idnits 2.17.1 draft-ietf-nfsv4-rfc5661sesqui-msns-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 8 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 951 has weird spacing: '...privacy no ...' == Line 3146 has weird spacing: '...est|Pad bytes...' == Line 4373 has weird spacing: '... opaque devic...' == Line 4486 has weird spacing: '...str_cis nii_...' == Line 4487 has weird spacing: '...8str_cs nii...' == (28 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 4, 2019) is 1720 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC5661' on line 30493 == Missing Reference: '0' is mentioned on line 16137, but not defined -- Looks like a reference, but probably isn't: 'X' on line 15884 -- Looks like a reference, but probably isn't: 'Y' on line 15893 -- Possible downref: Non-RFC (?) normative reference: ref. '6' -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' -- Possible downref: Non-RFC (?) normative reference: ref. '14' -- Possible downref: Non-RFC (?) normative reference: ref. '15' ** Obsolete normative reference: RFC 3454 (ref. '16') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. '17' -- Possible downref: Non-RFC (?) normative reference: ref. '18' ** Obsolete normative reference: RFC 3491 (ref. '20') (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. '21' -- Possible downref: Non-RFC (?) normative reference: ref. '22' -- Possible downref: Non-RFC (?) normative reference: ref. '23' -- Possible downref: Non-RFC (?) normative reference: ref. '24' -- Possible downref: Non-RFC (?) normative reference: ref. '26' -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '33') (Obsoleted by RFC 7530) -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '51') (Obsoleted by RFC 7143) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '58') (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '60') (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 21 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 D. Noveck, Ed. 3 Internet-Draft NetApp 4 Obsoletes: 5661 (if approved) C. Lever 5 Intended status: Standards Track ORACLE 6 Expires: February 5, 2020 August 4, 2019 8 Network File System (NFS) Version 4 Minor Version 1 Protocol 9 draft-ietf-nfsv4-rfc5661sesqui-msns-01 11 Abstract 13 This document describes the Network File System (NFS) version 4 minor 14 version 1, including features retained from the base protocol (NFS 15 version 4 minor version 0, which is specified in RFC 7530) and 16 protocol extensions made subsequently. The later minor version has 17 no dependencies on NFS version 4 minor version 0, and is considered a 18 separate protocol. 20 This document obsoletes RFC5661. It substantialy revises the 21 treatment of features relating to multi-server namesapce superseding 22 the description of those features appearing in RFC5661. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on February 5, 2020. 41 Copyright Notice 43 Copyright (c) 2019 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7 71 1.1. Introduction to this Update . . . . . . . . . . . . . . . 7 72 1.2. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 8 73 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 9 74 1.4. Scope of This Document . . . . . . . . . . . . . . . . . 9 75 1.5. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . . 9 76 1.6. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . . 10 77 1.7. General Definitions . . . . . . . . . . . . . . . . . . . 10 78 1.8. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 13 79 1.9. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17 80 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 18 81 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18 82 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . . 18 83 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 22 84 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22 85 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . . 28 86 2.6. Security Service Negotiation . . . . . . . . . . . . . . 29 87 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 34 88 2.8. Non-RPC-Based Security Services . . . . . . . . . . . . . 36 89 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 37 90 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . . 40 91 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 86 92 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . . 86 93 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 87 94 3.3. Structured Data Types . . . . . . . . . . . . . . . . . . 89 95 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 97 96 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 98 97 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 99 98 4.3. One Method of Constructing a Volatile Filehandle . . . . 101 99 4.4. Client Recovery from Filehandle Expiration . . . . . . . 102 100 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 103 101 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . . 104 102 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 104 103 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 105 104 5.4. Classification of Attributes . . . . . . . . . . . . . . 106 105 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 107 106 5.6. REQUIRED Attributes - List and Definition References . . 107 107 5.7. RECOMMENDED Attributes - List and Definition References . 108 108 5.8. Attribute Definitions . . . . . . . . . . . . . . 110 109 5.9. Interpreting owner and owner_group . . . . . . . . . . . 119 110 5.10. Character Case Attributes . . . . . . . . . . . . . . . . 121 111 5.11. Directory Notification Attributes . . . . . . . . . . . . 121 112 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 122 113 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 123 114 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 126 115 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 126 116 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 127 117 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 144 118 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 146 119 7. Single-Server Namespace . . . . . . . . . . . . . . . . . . . 153 120 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 153 121 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 153 122 7.3. Server Pseudo File System . . . . . . . . . . . . . . . . 154 123 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 154 124 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . . 155 125 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . . 155 126 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 155 127 7.8. Security Policy and Namespace Presentation . . . . . . . 156 128 8. State Management . . . . . . . . . . . . . . . . . . . . . . 157 129 8.1. Client and Session ID . . . . . . . . . . . . . . . . . . 158 130 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 158 131 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . . 167 132 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 169 133 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 180 134 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . . 181 135 8.7. Clocks, Propagation Delay, and Calculating Lease 136 Expiration . . . . . . . . . . . . . . . . . . . . . . . 182 137 8.8. Obsolete Locking Infrastructure from NFSv4.0 . . . . . . 182 138 9. File Locking and Share Reservations . . . . . . . . . . . . . 183 139 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 183 140 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . . 187 141 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . . 188 142 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . . 188 143 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 188 144 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 189 145 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 190 146 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . . 191 147 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 192 148 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 193 149 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 193 150 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 194 151 10.1. Performance Challenges for Client-Side Caching . . . . . 194 152 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 195 153 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 200 154 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 204 155 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 215 156 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 217 157 10.7. Data and Metadata Caching and Memory Mapped Files . . . 219 158 10.8. Name and Directory Caching without Directory Delegations 221 159 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 223 160 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 227 161 11.1. Terminology . . . . . . . . . . . . . . . . . . . . . . 227 162 11.2. File System Location Attributes . . . . . . . . . . . . 230 163 11.3. File System Presence or Absence . . . . . . . . . . . . 231 164 11.4. Getting Attributes for an Absent File System . . . . . . 232 165 11.5. Uses of File System Location Information . . . . . . . . 234 166 11.6. Users and Groups in a Multi-server Namespace . . . . . . 242 167 11.7. Additional Client-Side Considerations . . . . . . . . . 243 168 11.8. Overview of File Access Transitions . . . . . . . . . . 244 169 11.9. Effecting Network Endpoint Transitions . . . . . . . . . 244 170 11.10. Effecting File System Transitions . . . . . . . . . . . 245 171 11.11. Transferring State upon Migration . . . . . . . . . . . 253 172 11.12. Client Responsibilities when Access is Transitioned . . 255 173 11.13. Server Responsibilities Upon Migration . . . . . . . . . 264 174 11.14. Effecting File System Referrals . . . . . . . . . . . . 270 175 11.15. The Attribute fs_locations . . . . . . . . . . . . . . . 277 176 11.16. The Attribute fs_locations_info . . . . . . . . . . . . 280 177 11.17. The Attribute fs_status . . . . . . . . . . . . . . . . 293 178 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 297 179 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 297 180 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 298 181 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 304 182 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 305 183 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 305 184 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 320 185 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 322 186 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 327 187 12.9. Security Considerations for pNFS . . . . . . . . . . . . 327 188 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type . 329 189 13.1. Client ID and Session Considerations . . . . . . . . . . 329 190 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 332 191 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 332 192 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 336 193 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 344 194 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 345 195 13.7. COMMIT through Metadata Server . . . . . . . . . . . . . 347 196 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 348 197 13.9. Metadata and Data Server State Coordination . . . . . . 349 198 13.10. Data Server Component File Size . . . . . . . . . . . . 352 199 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 352 200 13.12. Security Considerations for the File Layout Type . . . . 353 201 14. Internationalization . . . . . . . . . . . . . . . . . . . . 354 202 14.1. Stringprep Profile for the utf8str_cs Type . . . . . . . 355 203 14.2. Stringprep Profile for the utf8str_cis Type . . . . . . 357 204 14.3. Stringprep Profile for the utf8str_mixed Type . . . . . 358 205 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 360 206 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 360 207 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 361 208 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 361 209 15.2. Operations and Their Valid Errors . . . . . . . . . . . 380 210 15.3. Callback Operations and Their Valid Errors . . . . . . . 396 211 15.4. Errors and the Operations That Use Them . . . . . . . . 399 212 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 414 213 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 414 214 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 414 215 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 426 216 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 429 217 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 429 218 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 435 219 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 436 220 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 439 221 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 222 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 442 223 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 443 224 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 443 225 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 445 226 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 446 227 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 449 228 18.11. Operation 13: LOCKT - Test for Lock . . . . . . . . . . 454 229 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 455 230 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 457 231 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 459 232 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 460 233 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 461 234 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 481 235 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 483 236 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 484 237 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . 485 238 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 487 239 18.22. Operation 25: READ - Read from File . . . . . . . . . . 488 240 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 490 241 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 494 242 18.25. Operation 28: REMOVE - Remove File System Object . . . . 495 243 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 498 244 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 501 245 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 502 246 18.29. Operation 33: SECINFO - Obtain Available Security . . . 503 247 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 507 248 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 510 249 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 511 250 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control . . 516 251 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate 252 Connection with Session . . . . . . . . . . . . . . . . 517 253 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 520 254 18.36. Operation 43: CREATE_SESSION - Create New Session and 255 Confirm Client ID . . . . . . . . . . . . . . . . . . . 538 256 18.37. Operation 44: DESTROY_SESSION - Destroy a Session . . . 549 257 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 550 258 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory 259 Delegation . . . . . . . . . . . . . . . . . . . . . . . 551 260 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 555 261 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings 262 for a File System . . . . . . . . . . . . . . . . . . . 558 263 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a 264 Layout . . . . . . . . . . . . . . . . . . . . . . . . . 560 265 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 564 266 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 573 267 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed 268 Object . . . . . . . . . . . . . . . . . . . . . . . . . 578 269 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing 270 and Control . . . . . . . . . . . . . . . . . . . . . . 579 271 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 585 272 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 587 273 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 589 274 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID . . 593 275 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 276 Finished . . . . . . . . . . . . . . . . . . . . . . . . 594 277 18.52. Operation 10044: ILLEGAL - Illegal Operation . . . . . . 597 278 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 598 279 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 598 280 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 598 281 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 603 282 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 603 283 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 604 284 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 605 285 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory 286 Changes . . . . . . . . . . . . . . . . . . . . . . . . 608 287 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 288 Delegation to Client . . . . . . . . . . . . . . . . . . 612 290 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable 291 Objects . . . . . . . . . . . . . . . . . . . . . . . . 613 292 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources 293 for Recallable Objects . . . . . . . . . . . . . . . . . 616 294 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control 295 Limits . . . . . . . . . . . . . . . . . . . . . . . . . 617 296 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel 297 Sequencing and Control . . . . . . . . . . . . . . . . . 618 298 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 299 Delegation Wants . . . . . . . . . . . . . . . . . . . . 621 300 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible 301 Lock Availability . . . . . . . . . . . . . . . . . . . 622 302 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of 303 Device ID Changes . . . . . . . . . . . . . . . . . . . 623 304 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 625 305 21. Security Considerations . . . . . . . . . . . . . . . . . . . 626 306 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 630 307 22.1. IANA Actions Neeeded . . . . . . . . . . . . . . . . . . 630 308 22.2. Named Attribute Definitions . . . . . . . . . . . . . . 630 309 22.3. Device ID Notifications . . . . . . . . . . . . . . . . 631 310 22.4. Object Recall Types . . . . . . . . . . . . . . . . . . 633 311 22.5. Layout Types . . . . . . . . . . . . . . . . . . . . . . 635 312 22.6. Path Variable Definitions . . . . . . . . . . . . . . . 637 313 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 641 314 23.1. Normative References . . . . . . . . . . . . . . . . . . 641 315 23.2. Informative References . . . . . . . . . . . . . . . . . 644 316 Appendix A. Need for this Update . . . . . . . . . . . . . . . . 647 317 Appendix B. Changes in this Update . . . . . . . . . . . . . . . 649 318 B.1. Revisions Made to Section 11 of [RFC5661] . . . . . . . . 649 319 B.2. Revisions Made to Operations in [RFC5661] . . . . . . . . 652 320 B.3. Revisions Made to Error Definitions in [RFC5661] . . . . 655 321 B.4. Other Revisions Made to [RFC5661] . . . . . . . . . . . . 655 322 Appendix C. Security Issues that Need to be Addressed . . . . . 656 323 Appendix D. Acknowledgments . . . . . . . . . . . . . . . . . . 658 324 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 661 326 1. Introduction 328 1.1. Introduction to this Update 330 The revised description of the NFS version 4 minor version 1 331 (NFSv4.1) protocol presented in this update is necessary to enable 332 full use of trunking in connection with multi-server namespace 333 features and to enable the use of transparent state migration in 334 connection with NFSv4.1. See Appendix A for a discussion of the need 335 for this modified treatment and Appendix B for a description of the 336 specific changes made to arrive at the current text. 338 This document is in the form of a complete updated description of the 339 protocol, rather than a series of updated sections. However, it is 340 not, as the term has generally been used with regard to NFSv4 minor 341 versions, a "bis document", which contains a full description of the 342 protocol, with no documents updating it. Production of such a 343 document would require completion of the work items listed below. 344 This would include work with regard to documents which curently 345 update RFC5661 [60], which will also update this document. In 346 addition, it is believed that there are areas for which the 347 description in RFC5661 [60] is either incorrect or inadequate. 349 o Work would have to be done with regard to RFC8178 [61] with which 350 RFC5661 [60] is curretly inconsistent, in order to arrive at a 351 situation in which there would be no need for RFC8178 to update 352 the NFSv4.1 specfication. 354 o Work would have to be done with regard to RFC8434 [64] which 355 curently updates RFC5661 [60]. When that work is done and the 356 resulting document approved, the new NFSv4.1 specfication will 357 obsolete RFC8434 as well as RFC5661. 359 o There is a need for a new approach to the description of 360 internationalization since the current internationalization 361 section (Section 14) has never been implemented and does not meet 362 the needs of the NFSv4 protocol. Possible solutions are to create 363 a new internationalization section modeled on that in [62] or to 364 create a new document describing internationalization for all 365 NFSv4 minor versions and reference that document in the RFCs 366 defining both NFSv4.0 and NFSv4.1. 368 o There is a need for a revised treatment of security of in NFSv4.1. 369 The issues with the existing treatment are discussed in 370 Appendix C. 372 Until the above work is done, there will not be a full, correct 373 description of the NFSv41 protocol in a single document and any full 374 description would involves documents updating the specification, just 375 as RFC8434 [64] and RFC8178 [61] do today. 377 1.2. The NFS Version 4 Minor Version 1 Protocol 379 The NFS version 4 minor version 1 (NFSv4.1) protocol is the second 380 minor version of the NFS version 4 (NFSv4) protocol. The first minor 381 version, NFSv4.0, is now described in RFC7530 [62]. It generally 382 follows the guidelines for minor versioning that are listed in 383 Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a 384 client and server that support minor version X must support minor 385 versions 0 through X-1") and 12 ("no new features may be introduced 386 as mandatory in a minor version"). These divergences are due to the 387 introduction of the sessions model for managing non-idempotent 388 operations and the RECLAIM_COMPLETE operation. These two new 389 features are infrastructural in nature and simplify implementation of 390 existing and other new features. Making them anything but REQUIRED 391 would add undue complexity to protocol definition and implementation. 392 NFSv4.1 accordingly updates the minor versioning guidelines 393 (Section 2.7). 395 As a minor version, NFSv4.1 is consistent with the overall goals for 396 NFSv4, but extends the protocol so as to better meet those goals, 397 based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted 398 some additional goals, which motivate some of the major extensions in 399 NFSv4.1. 401 1.3. Requirements Language 403 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 404 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 405 document are to be interpreted as described in RFC 2119 [1]. 407 1.4. Scope of This Document 409 This document describes the NFSv4.1 protocol. With respect to 410 NFSv4.0, this document does not: 412 o describe the NFSv4.0 protocol, except where needed to contrast 413 with NFSv4.1. 415 o modify the specification of the NFSv4.0 protocol. 417 o clarify the NFSv4.0 protocol. 419 1.5. NFSv4 Goals 421 The NFSv4 protocol is a further revision of the NFS protocol defined 422 already by NFSv3 [34]. It retains the essential characteristics of 423 previous versions: easy recovery; independence of transport 424 protocols, operating systems, and file systems; simplicity; and good 425 performance. NFSv4 has the following goals: 427 o Improved access and good performance on the Internet 429 The protocol is designed to transit firewalls easily, perform well 430 where latency is high and bandwidth is low, and scale to very 431 large numbers of clients per server. 433 o Strong security with negotiation built into the protocol 434 The protocol builds on the work of the ONCRPC working group in 435 supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 436 protocol provides a mechanism to allow clients and servers the 437 ability to negotiate security and require clients and servers to 438 support a minimal set of security schemes. 440 o Good cross-platform interoperability 442 The protocol features a file system model that provides a useful, 443 common set of features that does not unduly favor one file system 444 or operating system over another. 446 o Designed for protocol extensions 448 The protocol is designed to accept standard extensions within a 449 framework that enables and encourages backward compatibility. 451 1.6. NFSv4.1 Goals 453 NFSv4.1 has the following goals, within the framework established by 454 the overall NFSv4 goals. 456 o To correct significant structural weaknesses and oversights 457 discovered in the base protocol. 459 o To add clarity and specificity to areas left unaddressed or not 460 addressed in sufficient detail in the base protocol. However, as 461 stated in Section 1.4, it is not a goal to clarify the NFSv4.0 462 protocol in the NFSv4.1 specification. 464 o To add specific features based on experience with the existing 465 protocol and recent industry developments. 467 o To provide protocol support to take advantage of clustered server 468 deployments including the ability to provide scalable parallel 469 access to files distributed among multiple servers. 471 1.7. General Definitions 473 The following definitions provide an appropriate context for the 474 reader. 476 Byte: In this document, a byte is an octet, i.e., a datum exactly 8 477 bits in length. 479 Client: The client is the entity that accesses the NFS server's 480 resources. The client may be an application that contains the 481 logic to access the NFS server directly. The client may also be 482 the traditional operating system client that provides remote file 483 system services for a set of applications. 485 A client is uniquely identified by a client owner. 487 With reference to byte-range locking, the client is also the 488 entity that maintains a set of locks on behalf of one or more 489 applications. This client is responsible for crash or failure 490 recovery for those locks it manages. 492 Note that multiple clients may share the same transport and 493 connection and multiple clients may exist on the same network 494 node. 496 Client ID: The client ID is a 64-bit quantity used as a unique, 497 short-hand reference to a client-supplied verifier and client 498 owner. The server is responsible for supplying the client ID. 500 Client Owner: The client owner is a unique string, opaque to the 501 server, that identifies a client. Multiple network connections 502 and source network addresses originating from those connections 503 may share a client owner. The server is expected to treat 504 requests from connections with the same client owner as coming 505 from the same client. 507 File System: The file system is the collection of objects on a 508 server (as identified by the major identifier of a server owner, 509 which is defined later in this section) that share the same fsid 510 attribute (see Section 5.8.1.9). 512 Lease: A lease is an interval of time defined by the server for 513 which the client is irrevocably granted locks. At the end of a 514 lease period, locks may be revoked if the lease has not been 515 extended. A lock must be revoked if a conflicting lock has been 516 granted after the lease interval. 518 A server grants a client a single lease for all state. 520 Lock: The term "lock" is used to refer to byte-range (in UNIX 521 environments, also known as record) locks, share reservations, 522 delegations, or layouts unless specifically stated otherwise. 524 Secret State Verifier (SSV): The SSV is a unique secret key shared 525 between a client and server. The SSV serves as the secret key for 526 an internal (that is, internal to NFSv4.1) Generic Security 527 Services (GSS) mechanism (the SSV GSS mechanism; see 528 Section 2.10.9). The SSV GSS mechanism uses the SSV to compute 529 message integrity code (MIC) and Wrap tokens. See 530 Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV and 531 the SSV GSS mechanism. 533 Server: The Server is the entity responsible for coordinating client 534 access to a set of file systems and is identified by a server 535 owner. A server can span multiple network addresses. 537 Server Owner: The server owner identifies the server to the client. 538 The server owner consists of a major identifier and a minor 539 identifier. When the client has two connections each to a peer 540 with the same major identifier, the client assumes that both peers 541 are the same server (the server namespace is the same via each 542 connection) and that lock state is sharable across both 543 connections. When each peer has both the same major and minor 544 identifiers, the client assumes that each connection might be 545 associable with the same session. 547 Stable Storage: Stable storage is storage from which data stored by 548 an NFSv4.1 server can be recovered without data loss from multiple 549 power failures (including cascading power failures, that is, 550 several power failures in quick succession), operating system 551 failures, and/or hardware failure of components other than the 552 storage medium itself (such as disk, nonvolatile RAM, flash 553 memory, etc.). 555 Some examples of stable storage that are allowable for an NFS 556 server include: 558 1. Media commit of data; that is, the modified data has been 559 successfully written to the disk media, for example, the disk 560 platter. 562 2. An immediate reply disk drive with battery-backed, on-drive 563 intermediate storage or uninterruptible power system (UPS). 565 3. Server commit of data with battery-backed intermediate storage 566 and recovery software. 568 4. Cache commit with uninterruptible power system (UPS) and 569 recovery software. 571 Stateid: A stateid is a 128-bit quantity returned by a server that 572 uniquely defines the open and locking states provided by the 573 server for a specific open-owner or lock-owner/open-owner pair for 574 a specific file and type of lock. 576 Verifier: A verifier is a 64-bit quantity generated by the client 577 that the server can use to determine if the client has restarted 578 and lost all previous lock state. 580 1.8. Overview of NFSv4.1 Features 582 The major features of the NFSv4.1 protocol will be reviewed in brief. 583 This will be done to provide an appropriate context for both the 584 reader who is familiar with the previous versions of the NFS protocol 585 and the reader who is new to the NFS protocols. For the reader new 586 to the NFS protocols, there is still a set of fundamental knowledge 587 that is expected. The reader should be familiar with the External 588 Data Representation (XDR) and Remote Procedure Call (RPC) protocols 589 as described in [2] and [3]. A basic knowledge of file systems and 590 distributed file systems is expected as well. 592 In general, this specification of NFSv4.1 will not distinguish those 593 features added in minor version 1 from those present in the base 594 protocol but will treat NFSv4.1 as a unified whole. See Section 1.9 595 for a summary of the differences between NFSv4.0 and NFSv4.1. 597 1.8.1. RPC and Security 599 As with previous versions of NFS, the External Data Representation 600 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 601 protocol are those defined in [2] and [3]. To meet end-to-end 602 security requirements, the RPCSEC_GSS framework [4] is used to extend 603 the basic RPC security. With the use of RPCSEC_GSS, various 604 mechanisms can be provided to offer authentication, integrity, and 605 privacy to the NFSv4 protocol. Kerberos V5 is used as described in 606 [5] to provide one security framework. With the use of RPCSEC_GSS, 607 other mechanisms may also be specified and used for NFSv4.1 security. 609 To enable in-band security negotiation, the NFSv4.1 protocol has 610 operations that provide the client a method of querying the server 611 about its policies regarding which security mechanisms must be used 612 for access to the server's file system resources. With this, the 613 client can securely match the security mechanism that meets the 614 policies specified at both the client and server. 616 NFSv4.1 introduces parallel access (see Section 1.8.2.2), which is 617 called pNFS. The security framework described in this section is 618 significantly modified by the introduction of pNFS (see 619 Section 12.9), because data access is sometimes not over RPC. The 620 level of significance varies with the storage protocol (see 621 Section 12.2.5) and can be as low as zero impact (see Section 13.12). 623 1.8.2. Protocol Structure 625 1.8.2.1. Core Protocol 627 Unlike NFSv3, which used a series of ancillary protocols (e.g., NLM, 628 NSM (Network Status Monitor), MOUNT), within all minor versions of 629 NFSv4 a single RPC protocol is used to make requests to the server. 630 Facilities that had been separate protocols, such as locking, are now 631 integrated within a single unified protocol. 633 1.8.2.2. Parallel Access 635 Minor version 1 supports high-performance data access to a clustered 636 server implementation by enabling a separation of metadata access and 637 data access, with the latter done to multiple servers in parallel. 639 Such parallel data access is controlled by recallable objects known 640 as "layouts", which are integrated into the protocol locking model. 641 Clients direct requests for data access to a set of data servers 642 specified by the layout via a data storage protocol which may be 643 NFSv4.1 or may be another protocol. 645 Because the protocols used for parallel data access are not 646 necessarily RPC-based, the RPC-based security model (Section 1.8.1) 647 is obviously impacted (see Section 12.9). The degree of impact 648 varies with the storage protocol (see Section 12.2.5) used for data 649 access, and can be as low as zero (see Section 13.12). 651 1.8.3. File System Model 653 The general file system model used for the NFSv4.1 protocol is the 654 same as previous versions. The server file system is hierarchical 655 with the regular files contained within being treated as opaque byte 656 streams. In a slight departure, file and directory names are encoded 657 with UTF-8 to deal with the basics of internationalization. 659 The NFSv4.1 protocol does not require a separate protocol to provide 660 for the initial mapping between path name and filehandle. All file 661 systems exported by a server are presented as a tree so that all file 662 systems are reachable from a special per-server global root 663 filehandle. This allows LOOKUP operations to be used to perform 664 functions previously provided by the MOUNT protocol. The server 665 provides any necessary pseudo file systems to bridge any gaps that 666 arise due to unexported gaps between exported file systems. 668 1.8.3.1. Filehandles 670 As in previous versions of the NFS protocol, opaque filehandles are 671 used to identify individual files and directories. Lookup-type and 672 create operations translate file and directory names to filehandles, 673 which are then used to identify objects in subsequent operations. 675 The NFSv4.1 protocol provides support for persistent filehandles, 676 guaranteed to be valid for the lifetime of the file system object 677 designated. In addition, it provides support to servers to provide 678 filehandles with more limited validity guarantees, called volatile 679 filehandles. 681 1.8.3.2. File Attributes 683 The NFSv4.1 protocol has a rich and extensible file object attribute 684 structure, which is divided into REQUIRED, RECOMMENDED, and named 685 attributes (see Section 5). 687 Several (but not all) of the REQUIRED attributes are derived from the 688 attributes of NFSv3 (see the definition of the fattr3 data type in 689 [34]). An example of a REQUIRED attribute is the file object's type 690 (Section 5.8.1.2) so that regular files can be distinguished from 691 directories (also known as folders in some operating environments) 692 and other types of objects. REQUIRED attributes are discussed in 693 Section 5.1. 695 An example of three RECOMMENDED attributes are acl, sacl, and dacl. 696 These attributes define an Access Control List (ACL) on a file object 697 (Section 6). An ACL provides directory and file access control 698 beyond the model used in NFSv3. The ACL definition allows for 699 specification of specific sets of permissions for individual users 700 and groups. In addition, ACL inheritance allows propagation of 701 access permissions and restrictions down a directory tree as file 702 system objects are created. RECOMMENDED attributes are discussed in 703 Section 5.2. 705 A named attribute is an opaque byte stream that is associated with a 706 directory or file and referred to by a string name. Named attributes 707 are meant to be used by client applications as a method to associate 708 application-specific data with a regular file or directory. NFSv4.1 709 modifies named attributes relative to NFSv4.0 by tightening the 710 allowed operations in order to prevent the development of non- 711 interoperable implementations. Named attributes are discussed in 712 Section 5.3. 714 1.8.3.3. Multi-Server Namespace 716 NFSv4.1 contains a number of features to allow implementation of 717 namespaces that cross server boundaries and that allow and facilitate 718 a non-disruptive transfer of support for individual file systems 719 between servers. They are all based upon attributes that allow one 720 file system to specify alternate, additional, and new location 721 information that specifies how the client may access that file 722 system. 724 These attributes can be used to provide for individual active file 725 systems: 727 o Alternate network addresses to access the current file system 728 instance. 730 o The locations of alternate file system instances or replicas to be 731 used in the event that the current file system instance becomes 732 unavailable. 734 These file system location attributes may be used together with the 735 concept of absent file systems, in which a position in the server 736 namespace is associated with locations on other servers without there 737 being any corresponding file system instance on the current server. 738 For example, 740 o These attributes may be used with absent file systems to implement 741 referrals whereby one server may direct the client to a file 742 system provided by another server. This allows extensive multi- 743 server namespaces to be constructed. 745 o These attributes may be provided when a previously present file 746 system becomes absent. This allows non-disruptive migration of 747 file systems to alternate servers. 749 1.8.4. Locking Facilities 751 As mentioned previously, NFSv4.1 is a single protocol that includes 752 locking facilities. These locking facilities include support for 753 many types of locks including a number of sorts of recallable locks. 754 Recallable locks such as delegations allow the client to be assured 755 that certain events will not occur so long as that lock is held. 756 When circumstances change, the lock is recalled via a callback 757 request. The assurances provided by delegations allow more extensive 758 caching to be done safely when circumstances allow it. 760 The types of locks are: 762 o Share reservations as established by OPEN operations. 764 o Byte-range locks. 766 o File delegations, which are recallable locks that assure the 767 holder that inconsistent opens and file changes cannot occur so 768 long as the delegation is held. 770 o Directory delegations, which are recallable locks that assure the 771 holder that inconsistent directory modifications cannot occur so 772 long as the delegation is held. 774 o Layouts, which are recallable objects that assure the holder that 775 direct access to the file data may be performed directly by the 776 client and that no change to the data's location that is 777 inconsistent with that access may be made so long as the layout is 778 held. 780 All locks for a given client are tied together under a single client- 781 wide lease. All requests made on sessions associated with the client 782 renew that lease. When the client's lease is not promptly renewed, 783 the client's locks are subject to revocation. In the event of server 784 restart, clients have the opportunity to safely reclaim their locks 785 within a special grace period. 787 1.9. Differences from NFSv4.0 789 The following summarizes the major differences between minor version 790 1 and the base protocol: 792 o Implementation of the sessions model (Section 2.10). 794 o Parallel access to data (Section 12). 796 o Addition of the RECLAIM_COMPLETE operation to better structure the 797 lock reclamation process (Section 18.51). 799 o Enhanced delegation support as follows. 801 * Delegations on directories and other file types in addition to 802 regular files (Section 18.39, Section 18.49). 804 * Operations to optimize acquisition of recalled or denied 805 delegations (Section 18.49, Section 20.5, Section 20.7). 807 * Notifications of changes to files and directories 808 (Section 18.39, Section 20.4). 810 * A method to allow a server to indicate that it is recalling one 811 or more delegations for resource management reasons, and thus a 812 method to allow the client to pick which delegations to return 813 (Section 20.6). 815 o Attributes can be set atomically during exclusive file create via 816 the OPEN operation (see the new EXCLUSIVE4_1 creation method in 817 Section 18.16). 819 o Open files can be preserved if removed and the hard link count 820 ("hard link" is defined in an Open Group [6] standard) goes to 821 zero, thus obviating the need for clients to rename deleted files 822 to partially hidden names -- colloquially called "silly rename" 823 (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in 824 Section 18.16). 826 o Improved compatibility with Microsoft Windows for Access Control 827 Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). 829 o Data retention (Section 5.13). 831 o Identification of the implementation of the NFS client and server 832 (Section 18.35). 834 o Support for notification of the availability of byte-range locks 835 (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in 836 Section 18.16 and see Section 20.11). 838 o In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms 839 [35]. 841 2. Core Infrastructure 843 2.1. Introduction 845 NFSv4.1 relies on core infrastructure common to nearly every 846 operation. This core infrastructure is described in the remainder of 847 this section. 849 2.2. RPC and XDR 851 The NFSv4.1 protocol is a Remote Procedure Call (RPC) application 852 that uses RPC version 2 and the corresponding eXternal Data 853 Representation (XDR) as defined in [3] and [2]. 855 2.2.1. RPC-Based Security 857 Previous NFS versions have been thought of as having a host-based 858 authentication model, where the NFS server authenticates the NFS 859 client, and trusts the client to authenticate all users. Actually, 860 NFS has always depended on RPC for authentication. One of the first 861 forms of RPC authentication, AUTH_SYS, had no strong authentication 862 and required a host-based authentication approach. NFSv4.1 also 863 depends on RPC for basic security services and mandates RPC support 864 for a user-based authentication model. The user-based authentication 865 model has user principals authenticated by a server, and in turn the 866 server authenticated by user principals. RPC provides some basic 867 security services that are used by NFSv4.1. 869 2.2.1.1. RPC Security Flavors 871 As described in Section 7.2 ("Authentication") of [3], RPC security 872 is encapsulated in the RPC header, via a security or authentication 873 flavor, and information specific to the specified security flavor. 874 Every RPC header conveys information used to identify and 875 authenticate a client and server. As discussed in Section 2.2.1.1.1, 876 some security flavors provide additional security services. 878 NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This 879 requirement to implement is not a requirement to use.) Other 880 flavors, such as AUTH_NONE and AUTH_SYS, MAY be implemented as well. 882 2.2.1.1.1. RPCSEC_GSS and Security Services 884 RPCSEC_GSS [4] uses the functionality of GSS-API [7]. This allows 885 for the use of various security mechanisms by the RPC layer without 886 the additional implementation overhead of adding RPC security 887 flavors. 889 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 891 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 892 users on clients to servers, and servers to users. It can also 893 perform integrity checking on the entire RPC message, including the 894 RPC header, and on the arguments or results. Finally, privacy, 895 usually via encryption, is a service available with RPCSEC_GSS. 896 Privacy is performed on the arguments and results. Note that if 897 privacy is selected, integrity, authentication, and identification 898 are enabled. If privacy is not selected, but integrity is selected, 899 authentication and identification are enabled. If integrity and 900 privacy are not selected, but authentication is enabled, 901 identification is enabled. RPCSEC_GSS does not provide 902 identification as a separate service. 904 Although GSS-API has an authentication service distinct from its 905 privacy and integrity services, GSS-API's authentication service is 906 not used for RPCSEC_GSS's authentication service. Instead, each RPC 907 request and response header is integrity protected with the GSS-API 908 integrity service, and this allows RPCSEC_GSS to offer per-RPC 909 authentication and identity. See [4] for more information. 911 NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and 912 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 913 privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy 914 service. 916 2.2.1.1.1.2. Security Mechanisms for NFSv4.1 918 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 919 security services. Therefore, NFSv4.1 clients and servers MUST 920 support the Kerberos V5 security mechanism. 922 The use of RPCSEC_GSS requires selection of mechanism, quality of 923 protection (QOP), and service (authentication, integrity, privacy). 924 For the mandated security mechanisms, NFSv4.1 specifies that a QOP of 925 zero is used, leaving it up to the mechanism or the mechanism's 926 configuration to map QOP zero to an appropriate level of protection. 927 Each mandated mechanism specifies a minimum set of cryptographic 928 algorithms for implementing integrity and privacy. NFSv4.1 clients 929 and servers MUST be implemented on operating environments that comply 930 with the REQUIRED cryptographic algorithms of each REQUIRED 931 mechanism. 933 2.2.1.1.1.2.1. Kerberos V5 935 The Kerberos V5 GSS-API mechanism as described in [5] MUST be 936 implemented with the RPCSEC_GSS services as specified in the 937 following table: 939 column descriptions: 940 1 == number of pseudo flavor 941 2 == name of pseudo flavor 942 3 == mechanism's OID 943 4 == RPCSEC_GSS service 944 5 == NFSv4.1 clients MUST support 945 6 == NFSv4.1 servers MUST support 947 1 2 3 4 5 6 948 ------------------------------------------------------------------ 949 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 950 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 951 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 953 Note that the number and name of the pseudo flavor are presented here 954 as a mapping aid to the implementor. Because the NFSv4.1 protocol 955 includes a method to negotiate security and it understands the GSS- 956 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 957 needed for the NFSv3 since the security negotiation is done via the 958 MOUNT protocol as described in [36]. 960 At the time NFSv4.1 was specified, the Advanced Encryption Standard 961 (AES) with HMAC-SHA1 was a REQUIRED algorithm set for Kerberos V5. 962 In contrast, when NFSv4.0 was specified, weaker algorithm sets were 963 REQUIRED for Kerberos V5, and were REQUIRED in the NFSv4.0 964 specification, because the Kerberos V5 specification at the time did 965 not specify stronger algorithms. The NFSv4.1 specification does not 966 specify REQUIRED algorithms for Kerberos V5, and instead, the 967 implementor is expected to track the evolution of the Kerberos V5 968 standard if and when stronger algorithms are specified. 970 2.2.1.1.1.2.1.1. Security Considerations for Cryptographic Algorithms 971 in Kerberos V5 973 When deploying NFSv4.1, the strength of the security achieved depends 974 on the existing Kerberos V5 infrastructure. The algorithms of 975 Kerberos V5 are not directly exposed to or selectable by the client 976 or server, so there is some due diligence required by the user of 977 NFSv4.1 to ensure that security is acceptable where needed. 979 2.2.1.1.1.3. GSS Server Principal 981 Regardless of what security mechanism under RPCSEC_GSS is being used, 982 the NFS server MUST identify itself in GSS-API via a 983 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 984 names are of the form: 986 service@hostname 988 For NFS, the "service" element is 990 nfs 992 Implementations of security mechanisms will convert nfs@hostname to 993 various different forms. For Kerberos V5, the following form is 994 RECOMMENDED: 996 nfs/hostname 998 2.3. COMPOUND and CB_COMPOUND 1000 A significant departure from the versions of the NFS protocol before 1001 NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 1002 protocol, in all minor versions, there are exactly two RPC 1003 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 1004 a series of individual operations and these operations perform the 1005 sorts of functions performed by traditional NFS procedures. 1007 The operations combined within a COMPOUND request are evaluated in 1008 order by the server, without any atomicity guarantees. A limited set 1009 of facilities exist to pass results from one operation to another. 1010 Once an operation returns a failing result, the evaluation ends and 1011 the results of all evaluated operations are returned to the client. 1013 With the use of the COMPOUND procedure, the client is able to build 1014 simple or complex requests. These COMPOUND requests allow for a 1015 reduction in the number of RPCs needed for logical file system 1016 operations. For example, multi-component look up requests can be 1017 constructed by combining multiple LOOKUP operations. Those can be 1018 further combined with operations such as GETATTR, READDIR, or OPEN 1019 plus READ to do more complicated sets of operation without incurring 1020 additional latency. 1022 NFSv4.1 also contains a considerable set of callback operations in 1023 which the server makes an RPC directed at the client. Callback RPCs 1024 have a similar structure to that of the normal server requests. In 1025 all minor versions of the NFSv4 protocol, there are two callback RPC 1026 procedures: CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is 1027 defined in an analogous fashion to that of COMPOUND with its own set 1028 of callback operations. 1030 The addition of new server and callback operations within the 1031 COMPOUND and CB_COMPOUND request framework provides a means of 1032 extending the protocol in subsequent minor versions. 1034 Except for a small number of operations needed for session creation, 1035 server requests and callback requests are performed within the 1036 context of a session. Sessions provide a client context for every 1037 request and support robust reply protection for non-idempotent 1038 requests. 1040 2.4. Client Identifiers and Client Owners 1042 For each operation that obtains or depends on locking state, the 1043 specific client needs to be identifiable by the server. 1045 Each distinct client instance is represented by a client ID. A 1046 client ID is a 64-bit identifier representing a specific client at a 1047 given time. The client ID is changed whenever the client re- 1048 initializes, and may change when the server re-initializes. Client 1049 IDs are used to support lock identification and crash recovery. 1051 During steady state operation, the client ID associated with each 1052 operation is derived from the session (see Section 2.10) on which the 1053 operation is sent. A session is associated with a client ID when the 1054 session is created. 1056 Unlike NFSv4.0, the only NFSv4.1 operations possible before a client 1057 ID is established are those needed to establish the client ID. 1059 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 1060 operation using that client ID (eir_clientid as returned from 1061 EXCHANGE_ID) is required to establish and confirm the client ID on 1062 the server. Establishment of identification by a new incarnation of 1063 the client also has the effect of immediately releasing any locking 1064 state that a previous incarnation of that same client might have had 1065 on the server. Such released state would include all byte-range 1066 lock, share reservation, layout state, and -- where the server 1067 supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_CUR_FH claim 1068 types -- all delegation state associated with the same client with 1069 the same identity. For discussion of delegation state recovery, see 1070 Section 10.2.1. For discussion of layout state recovery, see 1071 Section 12.7.1. 1073 Releasing such state requires that the server be able to determine 1074 that one client instance is the successor of another. Where this 1075 cannot be done, for any of a number of reasons, the locking state 1076 will remain for a time subject to lease expiration (see Section 8.3) 1077 and the new client will need to wait for such state to be removed, if 1078 it makes conflicting lock requests. 1080 Client identification is encapsulated in the following client owner 1081 data type: 1083 struct client_owner4 { 1084 verifier4 co_verifier; 1085 opaque co_ownerid; 1086 }; 1088 The first field, co_verifier, is a client incarnation verifier. The 1089 server will start the process of canceling the client's leased state 1090 if co_verifier is different than what the server has previously 1091 recorded for the identified client (as specified in the co_ownerid 1092 field). 1094 The second field, co_ownerid, is a variable length string that 1095 uniquely defines the client so that subsequent instances of the same 1096 client bear the same co_ownerid with a different verifier. 1098 There are several considerations for how the client generates the 1099 co_ownerid string: 1101 o The string should be unique so that multiple clients do not 1102 present the same string. The consequences of two clients 1103 presenting the same string range from one client getting an error 1104 to one client having its leased state abruptly and unexpectedly 1105 cancelled. 1107 o The string should be selected so that subsequent incarnations 1108 (e.g., restarts) of the same client cause the client to present 1109 the same string. The implementor is cautioned from an approach 1110 that requires the string to be recorded in a local file because 1111 this precludes the use of the implementation in an environment 1112 where there is no local disk and all file access is from an 1113 NFSv4.1 server. 1115 o The string should be the same for each server network address that 1116 the client accesses. This way, if a server has multiple 1117 interfaces, the client can trunk traffic over multiple network 1118 paths as described in Section 2.10.5. (Note: the precise opposite 1119 was advised in the NFSv4.0 specification [33].) 1121 o The algorithm for generating the string should not assume that the 1122 client's network address will not change, unless the client 1123 implementation knows it is using statically assigned network 1124 addresses. This includes changes between client incarnations and 1125 even changes while the client is still running in its current 1126 incarnation. Thus, with dynamic address assignment, if the client 1127 includes just the client's network address in the co_ownerid 1128 string, there is a real risk that after the client gives up the 1129 network address, another client, using a similar algorithm for 1130 generating the co_ownerid string, would generate a conflicting 1131 co_ownerid string. 1133 Given the above considerations, an example of a well-generated 1134 co_ownerid string is one that includes: 1136 o If applicable, the client's statically assigned network address. 1138 o Additional information that tends to be unique, such as one or 1139 more of: 1141 * The client machine's serial number (for privacy reasons, it is 1142 best to perform some one-way function on the serial number). 1144 * A Media Access Control (MAC) address (again, a one-way function 1145 should be performed). 1147 * The timestamp of when the NFSv4.1 software was first installed 1148 on the client (though this is subject to the previously 1149 mentioned caution about using information that is stored in a 1150 file, because the file might only be accessible over NFSv4.1). 1152 * A true random number. However, since this number ought to be 1153 the same between client incarnations, this shares the same 1154 problem as that of using the timestamp of the software 1155 installation. 1157 o For a user-level NFSv4.1 client, it should contain additional 1158 information to distinguish the client from other user-level 1159 clients running on the same host, such as a process identifier or 1160 other unique sequence. 1162 The client ID is assigned by the server (the eir_clientid result from 1163 EXCHANGE_ID) and should be chosen so that it will not conflict with a 1164 client ID previously assigned by the server. This applies across 1165 server restarts. 1167 In the event of a server restart, a client may find out that its 1168 current client ID is no longer valid when it receives an 1169 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on 1170 the characteristics of the sessions involved, specifically whether 1171 the session is persistent (see Section 2.10.6.5), but in each case 1172 the client will receive this error when it attempts to establish a 1173 new session with the existing client ID and receives the error 1174 NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be 1175 obtained via EXCHANGE_ID and the new session established with that 1176 client ID. 1178 When a session is not persistent, the client will find out that it 1179 needs to create a new session as a result of getting an 1180 NFS4ERR_BADSESSION, since the session in question was lost as part of 1181 a server restart. When the existing client ID is presented to a 1182 server as part of creating a session and that client ID is not 1183 recognized, as would happen after a server restart, the server will 1184 reject the request with the error NFS4ERR_STALE_CLIENTID. 1186 In the case of the session being persistent, the client will re- 1187 establish communication using the existing session after the restart. 1188 This session will be associated with the existing client ID but may 1189 only be used to retransmit operations that the client previously 1190 transmitted and did not see replies to. Replies to operations that 1191 the server previously performed will come from the reply cache; 1192 otherwise, NFS4ERR_DEADSESSION will be returned. Hence, such a 1193 session is referred to as "dead". In this situation, in order to 1194 perform new operations, the client needs to establish a new session. 1195 If an attempt is made to establish this new session with the existing 1196 client ID, the server will reject the request with 1197 NFS4ERR_STALE_CLIENTID. 1199 When NFS4ERR_STALE_CLIENTID is received in either of these 1200 situations, the client needs to obtain a new client ID by use of the 1201 EXCHANGE_ID operation, then use that client ID as the basis of a new 1202 session, and then proceed to any other necessary recovery for the 1203 server restart case (see Section 8.4.2). 1205 See the descriptions of EXCHANGE_ID (Section 18.35) and 1206 CREATE_SESSION (Section 18.36) for a complete specification of these 1207 operations. 1209 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 1211 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a 1212 value of data type client_owner4 in an EXCHANGE_ID with a value of 1213 data type nfs_client_id4 that was established using the SETCLIENTID 1214 operation of NFSv4.0. A server that does so will allow an upgraded 1215 client to avoid waiting until the lease (i.e., the lease established 1216 by the NFSv4.0 instance client) expires. This requires that the 1217 value of data type client_owner4 be constructed the same way as the 1218 value of data type nfs_client_id4. If the latter's contents included 1219 the server's network address (per the recommendations of the NFSv4.0 1220 specification [33]), and the NFSv4.1 client does not wish to use a 1221 client ID that prevents trunking, it should send two EXCHANGE_ID 1222 operations. The first EXCHANGE_ID will have a client_owner4 equal to 1223 the nfs_client_id4. This will clear the state created by the NFSv4.0 1224 client. The second EXCHANGE_ID will not have the server's network 1225 address. The state created for the second EXCHANGE_ID will not have 1226 to wait for lease expiration, because there will be no state to 1227 expire. 1229 2.4.2. Server Release of Client ID 1231 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1232 (Section 18.50), which the client SHOULD use to destroy a client ID 1233 it no longer needs. This permits graceful, bilateral release of a 1234 client ID. The operation cannot be used if there are sessions 1235 associated with the client ID, or state with an unexpired lease. 1237 If the server determines that the client holds no associated state 1238 for its client ID (associated state includes unrevoked sessions, 1239 opens, locks, delegations, layouts, and wants), the server MAY choose 1240 to unilaterally release the client ID in order to conserve resources. 1241 If the client contacts the server after this release, the server MUST 1242 ensure that the client receives the appropriate error so that it will 1243 use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new client 1244 ID. The server ought to be very hesitant to release a client ID 1245 since the resulting work on the client to recover from such an event 1246 will be the same burden as if the server had failed and restarted. 1247 Typically, a server would not release a client ID unless there had 1248 been no activity from that client for many minutes. As long as there 1249 are sessions, opens, locks, delegations, layouts, or wants, the 1250 server MUST NOT release the client ID. See Section 2.10.13.1.4 for 1251 discussion on releasing inactive sessions. 1253 2.4.3. Resolving Client Owner Conflicts 1255 When the server gets an EXCHANGE_ID for a client owner that currently 1256 has no state, or that has state but the lease has expired, the server 1257 MUST allow the EXCHANGE_ID and confirm the new client ID if followed 1258 by the appropriate CREATE_SESSION. 1260 When the server gets an EXCHANGE_ID for a new incarnation of a client 1261 owner that currently has an old incarnation with state and an 1262 unexpired lease, the server is allowed to dispose of the state of the 1263 previous incarnation of the client owner if one of the following is 1264 true: 1266 o The principal that created the client ID for the client owner is 1267 the same as the principal that is sending the EXCHANGE_ID 1268 operation. Note that if the client ID was created with 1269 SP4_MACH_CRED state protection (Section 18.35), the principal MUST 1270 be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used 1271 MUST be integrity or privacy, and the same GSS mechanism and 1272 principal MUST be used as that used when the client ID was 1273 created. 1275 o The client ID was established with SP4_SSV protection 1276 (Section 18.35, Section 2.10.8.3) and the client sends the 1277 EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the 1278 GSS SSV mechanism (Section 2.10.9). 1280 o The client ID was established with SP4_SSV protection, and under 1281 the conditions described herein, the EXCHANGE_ID was sent with 1282 SP4_MACH_CRED state protection. Because the SSV might not persist 1283 across client and server restart, and because the first time a 1284 client sends EXCHANGE_ID to a server it does not have an SSV, the 1285 client MAY send the subsequent EXCHANGE_ID without an SSV 1286 RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the 1287 principal MUST be based on RPCSEC_GSS authentication, the 1288 RPCSEC_GSS service used MUST be integrity or privacy, and the same 1289 GSS mechanism and principal MUST be used as that used when the 1290 client ID was created. 1292 If none of the above situations apply, the server MUST return 1293 NFS4ERR_CLID_INUSE. 1295 If the server accepts the principal and co_ownerid as matching that 1296 which created the client ID, and the co_verifier in the EXCHANGE_ID 1297 differs from the co_verifier used when the client ID was created, 1298 then after the server receives a CREATE_SESSION that confirms the 1299 client ID, the server deletes state. If the co_verifier values are 1300 the same (e.g., the client either is updating properties of the 1301 client ID (Section 18.35) or is attempting trunking (Section 2.10.5), 1302 the server MUST NOT delete state. 1304 2.5. Server Owners 1306 The server owner is similar to a client owner (Section 2.4), but 1307 unlike the client owner, there is no shorthand server ID. The server 1308 owner is defined in the following data type: 1310 struct server_owner4 { 1311 uint64_t so_minor_id; 1312 opaque so_major_id; 1313 }; 1315 The server owner is returned from EXCHANGE_ID. When the so_major_id 1316 fields are the same in two EXCHANGE_ID results, the connections that 1317 each EXCHANGE_ID were sent over can be assumed to address the same 1318 server (as defined in Section 1.7). If the so_minor_id fields are 1319 also the same, then not only do both connections connect to the same 1320 server, but the session can be shared across both connections. The 1321 reader is cautioned that multiple servers may deliberately or 1322 accidentally claim to have the same so_major_id or so_major_id/ 1323 so_minor_id; the reader should examine Sections 2.10.5 and 18.35 in 1324 order to avoid acting on falsely matching server owner values. 1326 The considerations for generating a so_major_id are similar to that 1327 for generating a co_ownerid string (see Section 2.4). The 1328 consequences of two servers generating conflicting so_major_id values 1329 are less dire than they are for co_ownerid conflicts because the 1330 client can use RPCSEC_GSS to compare the authenticity of each server 1331 (see Section 2.10.5). 1333 2.6. Security Service Negotiation 1335 With the NFSv4.1 server potentially offering multiple security 1336 mechanisms, the client needs a method to determine or negotiate which 1337 mechanism is to be used for its communication with the server. The 1338 NFS server may have multiple points within its file system namespace 1339 that are available for use by NFS clients. These points can be 1340 considered security policy boundaries, and, in some NFS 1341 implementations, are tied to NFS export points. In turn, the NFS 1342 server may be configured such that each of these security policy 1343 boundaries may have different or multiple security mechanisms in use. 1345 The security negotiation between client and server SHOULD be done 1346 with a secure channel to eliminate the possibility of a third party 1347 intercepting the negotiation sequence and forcing the client and 1348 server to choose a lower level of security than required or desired. 1349 See Section 21 for further discussion. 1351 2.6.1. NFSv4.1 Security Tuples 1353 An NFS server can assign one or more "security tuples" to each 1354 security policy boundary in its namespace. Each security tuple 1355 consists of a security flavor (see Section 2.2.1.1) and, if the 1356 flavor is RPCSEC_GSS, a GSS-API mechanism Object Identifier (OID), a 1357 GSS-API quality of protection, and an RPCSEC_GSS service. 1359 2.6.2. SECINFO and SECINFO_NO_NAME 1361 The SECINFO and SECINFO_NO_NAME operations allow the client to 1362 determine, on a per-filehandle basis, what security tuple is to be 1363 used for server access. In general, the client will not have to use 1364 either operation except during initial communication with the server 1365 or when the client crosses security policy boundaries at the server. 1366 However, the server's policies may also change at any time and force 1367 the client to negotiate a new security tuple. 1369 Where the use of different security tuples would affect the type of 1370 access that would be allowed if a request was sent over the same 1371 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g., 1372 read-only vs. read-write) access, security tuples that allow greater 1373 access should be presented first. Where the general level of access 1374 is the same and different security flavors limit the range of 1375 principals whose privileges are recognized (e.g., allowing or 1376 disallowing root access), flavors supporting the greatest range of 1377 principals should be listed first. 1379 2.6.3. Security Error 1381 Based on the assumption that each NFSv4.1 client and server MUST 1382 support a minimum set of security (i.e., Kerberos V5 under 1383 RPCSEC_GSS), the NFS client will initiate file access to the server 1384 with one of the minimal security tuples. During communication with 1385 the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. 1386 This error allows the server to notify the client that the security 1387 tuple currently being used contravenes the server's security policy. 1388 The client is then responsible for determining (see Section 2.6.3.1) 1389 what security tuples are available at the server and choosing one 1390 that is appropriate for the client. 1392 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1394 This section explains the mechanics of NFSv4.1 security negotiation. 1396 2.6.3.1.1. Put Filehandle Operations 1398 The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, 1399 PUTFH, and RESTOREFH. Each of the subsections herein describes how 1400 the server handles a subseries of operations that starts with a put 1401 filehandle operation. 1403 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH 1405 The client is saving a filehandle for a future RESTOREFH, LINK, or 1406 RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine 1407 whether or not the put filehandle operation returns NFS4ERR_WRONGSEC, 1408 the server implementation pretends SAVEFH is not in the series of 1409 operations and examines which of the situations described in the 1410 other subsections of Section 2.6.3.1.1 apply. 1412 2.6.3.1.1.2. Two or More Put Filehandle Operations 1414 For a series of N put filehandle operations, the server MUST NOT 1415 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1416 The Nth put filehandle operation is handled as if it is the first in 1417 a subseries of operations. For example, if the server received a 1418 COMPOUND request with this series of operations -- PUTFH, PUTROOTFH, 1419 LOOKUP -- then the PUTFH operation is ignored for NFS4ERR_WRONGSEC 1420 purposes, and the PUTROOTFH, LOOKUP subseries is processed as 1421 according to Section 2.6.3.1.1.3. 1423 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing 1424 Name) 1426 This situation also applies to a put filehandle operation followed by 1427 a LOOKUP or an OPEN operation that specifies an existing component 1428 name. 1430 In this situation, the client is potentially crossing a security 1431 policy boundary, and the set of security tuples the parent directory 1432 supports may differ from those of the child. The server 1433 implementation may decide whether to impose any restrictions on 1434 security policy administration. There are at least three approaches 1435 (sec_policy_child is the tuple set of the child export, 1436 sec_policy_parent is that of the parent). 1438 (a) sec_policy_child <= sec_policy_parent (<= for subset). This 1439 means that the set of security tuples specified on the security 1440 policy of a child directory is always a subset of its parent 1441 directory. 1443 (b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1444 {} for the empty set). This means that the set of security 1445 tuples specified on the security policy of a child directory 1446 always has a non-empty intersection with that of the parent. 1448 (c) sec_policy_child ^ sec_policy_parent == {}. This means that the 1449 set of security tuples specified on the security policy of a 1450 child directory may not intersect with that of the parent. In 1451 other words, there are no restrictions on how the system 1452 administrator may set up these tuples. 1454 In order for a server to support approaches (b) (for the case when a 1455 client chooses a flavor that is not a member of sec_policy_parent) 1456 and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC 1457 when there is a security tuple mismatch. Instead, it should be 1458 returned from the LOOKUP (or OPEN by existing component name) that 1459 follows. 1461 Since the above guideline does not contradict approach (a), it should 1462 be followed in general. Even if approach (a) is implemented, it is 1463 possible for the security tuple used to be acceptable for the target 1464 of LOOKUP but not for the filehandles used in the put filehandle 1465 operation. The put filehandle operation could be a PUTROOTFH or 1466 PUTPUBFH, where the client cannot know the security tuples for the 1467 root or public filehandle. Or the security policy for the filehandle 1468 used by the put filehandle operation could have changed since the 1469 time the filehandle was obtained. 1471 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1472 response to the put filehandle operation if the operation is 1473 immediately followed by a LOOKUP or an OPEN by component name. 1475 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP 1477 Since SECINFO only works its way down, there is no way LOOKUPP can 1478 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1479 solves this issue via style SECINFO_STYLE4_PARENT, which works in the 1480 opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put 1481 filehandle operation that is followed by a LOOKUPP MUST NOT return 1482 NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, 1483 the client's only recourse is to send the put filehandle operation, 1484 LOOKUPP, GETFH sequence of operations with every security tuple it 1485 supports. 1487 Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1488 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1489 operation if the operation is immediately followed by a LOOKUPP. 1491 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1493 A security-sensitive client is allowed to choose a strong security 1494 tuple when querying a server to determine a file object's permitted 1495 security tuples. The security tuple chosen by the client does not 1496 have to be included in the tuple list of the security policy of 1497 either the parent directory indicated in the put filehandle operation 1498 or the child file object indicated in SECINFO (or any parent 1499 directory indicated in SECINFO_NO_NAME). Of course, the server has 1500 to be configured for whatever security tuple the client selects; 1501 otherwise, the request will fail at the RPC layer with an appropriate 1502 authentication error. 1504 In theory, there is no connection between the security flavor used by 1505 SECINFO or SECINFO_NO_NAME and those supported by the security 1506 policy. But in practice, the client may start looking for strong 1507 flavors from those supported by the security policy, followed by 1508 those in the REQUIRED set. 1510 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1511 filehandle operation that is immediately followed by SECINFO or 1512 SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC 1513 from SECINFO or SECINFO_NO_NAME. 1515 2.6.3.1.1.6. Put Filehandle Operation + Nothing 1517 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1519 2.6.3.1.1.7. Put Filehandle Operation + Anything Else 1521 "Anything Else" includes OPEN by filehandle. 1523 The security policy enforcement applies to the filehandle specified 1524 in the put filehandle operation. Therefore, the put filehandle 1525 operation MUST return NFS4ERR_WRONGSEC when there is a security tuple 1526 mismatch. This avoids the complexity of adding NFS4ERR_WRONGSEC as 1527 an allowable error to every other operation. 1529 A COMPOUND containing the series put filehandle operation + 1530 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1531 for the client to recover from NFS4ERR_WRONGSEC. 1533 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation 1534 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1535 component name). 1537 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME 1539 Suppose a client sends a COMPOUND procedure containing the series 1540 SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple 1541 used does not match that required for the target file. By rule (see 1542 Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return 1543 NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot 1544 return NFS4ERR_WRONGSEC. The issue is resolved by the fact that 1545 SECINFO and SECINFO_NO_NAME consume the current filehandle (note that 1546 this is a change from NFSv4.0). This leaves no current filehandle 1547 for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 1549 2.6.3.1.2. LINK and RENAME 1551 The LINK and RENAME operations use both the current and saved 1552 filehandles. Technically, the server MAY return NFS4ERR_WRONGSEC 1553 from LINK or RENAME if the security policy of the saved filehandle 1554 rejects the security flavor used in the COMPOUND request's 1555 credentials. If the server does so, then if there is no intersection 1556 between the security policies of saved and current filehandles, this 1557 means that it will be impossible for the client to perform the 1558 intended LINK or RENAME operation. 1560 For example, suppose the client sends this COMPOUND request: 1561 SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where 1562 filehandles bFH and aFH refer to different directories. Suppose no 1563 common security tuple exists between the security policies of aFH and 1564 bFH. If the client sends the request using credentials acceptable to 1565 bFH's security policy but not aFH's policy, then the PUTFH aFH 1566 operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME 1567 request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, 1568 RENAME "c" "d", using credentials acceptable to aFH's security policy 1569 but not bFH's policy. The server returns NFS4ERR_WRONGSEC on the 1570 RENAME operation. 1572 To prevent a client from an endless sequence of a request containing 1573 LINK or RENAME, followed by a request containing SECINFO_NO_NAME or 1574 SECINFO, the server MUST detect when the security policies of the 1575 current and saved filehandles have no mutually acceptable security 1576 tuple, and MUST NOT return NFS4ERR_WRONGSEC from LINK or RENAME in 1577 that situation. Instead the server MUST do one of two things: 1579 o The server can return NFS4ERR_XDEV. 1581 o The server can allow the security policy of the current filehandle 1582 to override that of the saved filehandle, and so return NFS4_OK. 1584 2.7. Minor Versioning 1586 To address the requirement of an NFS protocol that can evolve as the 1587 need arises, the NFSv4.1 protocol contains the rules and framework to 1588 allow for future minor changes or versioning. 1590 The base assumption with respect to minor versioning is that any 1591 future accepted minor version will be documented in one or more 1592 Standards Track RFCs. Minor version 0 of the NFSv4 protocol is 1593 represented by [33], and minor version 1 is represented by this RFC. 1594 The COMPOUND and CB_COMPOUND procedures support the encoding of the 1595 minor version being requested by the client. 1597 The following items represent the basic rules for the development of 1598 minor versions. Note that a future minor version may modify or add 1599 to the following rules as part of the minor version definition. 1601 1. Procedures are not added or deleted. 1603 To maintain the general RPC model, NFSv4 minor versions will not 1604 add to or delete procedures from the NFS program. 1606 2. Minor versions may add operations to the COMPOUND and 1607 CB_COMPOUND procedures. 1609 The addition of operations to the COMPOUND and CB_COMPOUND 1610 procedures does not affect the RPC model. 1612 * Minor versions may append attributes to the bitmap4 that 1613 represents sets of attributes and to the fattr4 that 1614 represents sets of attribute values. 1616 This allows for the expansion of the attribute model to allow 1617 for future growth or adaptation. 1619 * Minor version X must append any new attributes after the last 1620 documented attribute. 1622 Since attribute results are specified as an opaque array of 1623 per-attribute, XDR-encoded results, the complexity of adding 1624 new attributes in the midst of the current definitions would 1625 be too burdensome. 1627 3. Minor versions must not modify the structure of an existing 1628 operation's arguments or results. 1630 Again, the complexity of handling multiple structure definitions 1631 for a single operation is too burdensome. New operations should 1632 be added instead of modifying existing structures for a minor 1633 version. 1635 This rule does not preclude the following adaptations in a minor 1636 version: 1638 * adding bits to flag fields, such as new attributes to 1639 GETATTR's bitmap4 data type, and providing corresponding 1640 variants of opaque arrays, such as a notify4 used together 1641 with such bitmaps 1643 * adding bits to existing attributes like ACLs that have flag 1644 words 1646 * extending enumerated types (including NFS4ERR_*) with new 1647 values 1649 * adding cases to a switched union 1651 4. Minor versions must not modify the structure of existing 1652 attributes. 1654 5. Minor versions must not delete operations. 1656 This prevents the potential reuse of a particular operation 1657 "slot" in a future minor version. 1659 6. Minor versions must not delete attributes. 1661 7. Minor versions must not delete flag bits or enumeration values. 1663 8. Minor versions may declare an operation MUST NOT be implemented. 1665 Specifying that an operation MUST NOT be implemented is 1666 equivalent to obsoleting an operation. For the client, it means 1667 that the operation MUST NOT be sent to the server. For the 1668 server, an NFS error can be returned as opposed to "dropping" 1669 the request as an XDR decode error. This approach allows for 1670 the obsolescence of an operation while maintaining its structure 1671 so that a future minor version can reintroduce the operation. 1673 1. Minor versions may declare that an attribute MUST NOT be 1674 implemented. 1676 2. Minor versions may declare that a flag bit or enumeration 1677 value MUST NOT be implemented. 1679 9. Minor versions may downgrade features from REQUIRED to 1680 RECOMMENDED, or RECOMMENDED to OPTIONAL. 1682 10. Minor versions may upgrade features from OPTIONAL to 1683 RECOMMENDED, or RECOMMENDED to REQUIRED. 1685 11. A client and server that support minor version X SHOULD support 1686 minor versions zero through X-1 as well. 1688 12. Except for infrastructural changes, a minor version must not 1689 introduce REQUIRED new features. 1691 This rule allows for the introduction of new functionality and 1692 forces the use of implementation experience before designating a 1693 feature as REQUIRED. On the other hand, some classes of 1694 features are infrastructural and have broad effects. Allowing 1695 infrastructural features to be RECOMMENDED or OPTIONAL 1696 complicates implementation of the minor version. 1698 13. A client MUST NOT attempt to use a stateid, filehandle, or 1699 similar returned object from the COMPOUND procedure with minor 1700 version X for another COMPOUND procedure with minor version Y, 1701 where X != Y. 1703 2.8. Non-RPC-Based Security Services 1705 As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for 1706 identification, authentication, integrity, and privacy. NFSv4.1 1707 itself provides or enables additional security services as described 1708 in the next several subsections. 1710 2.8.1. Authorization 1712 Authorization to access a file object via an NFSv4.1 operation is 1713 ultimately determined by the NFSv4.1 server. A client can 1714 predetermine its access to a file object via the OPEN (Section 18.16) 1715 and the ACCESS (Section 18.1) operations. 1717 Principals with appropriate access rights can modify the 1718 authorization on a file object via the SETATTR (Section 18.30) 1719 operation. Attributes that affect access rights include mode, owner, 1720 owner_group, acl, dacl, and sacl. See Section 5. 1722 2.8.2. Auditing 1724 NFSv4.1 provides auditing on a per-file object basis, via the acl and 1725 sacl attributes as described in Section 6. It is outside the scope 1726 of this specification to specify audit log formats or management 1727 policies. 1729 2.8.3. Intrusion Detection 1731 NFSv4.1 provides alarm control on a per-file object basis, via the 1732 acl and sacl attributes as described in Section 6. Alarms may serve 1733 as the basis for intrusion detection. It is outside the scope of 1734 this specification to specify heuristics for detecting intrusion via 1735 alarms. 1737 2.9. Transport Layers 1739 2.9.1. REQUIRED and RECOMMENDED Properties of Transports 1741 NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA- 1742 based transports with the following attributes: 1744 o The transport supports reliable delivery of data, which NFSv4.1 1745 requires but neither NFSv4.1 nor RPC has facilities for ensuring 1746 [37]. 1748 o The transport delivers data in the order it was sent. Ordered 1749 delivery simplifies detection of transmit errors, and simplifies 1750 the sending of arbitrary sized requests and responses via the 1751 record marking protocol [3]. 1753 Where an NFSv4.1 implementation supports operation over the IP 1754 network protocol, any transport used between NFS and IP MUST be among 1755 the IETF-approved congestion control transport protocols. At the 1756 time this document was written, the only two transports that had the 1757 above attributes were TCP and the Stream Control Transmission 1758 Protocol (SCTP). To enhance the possibilities for interoperability, 1759 an NFSv4.1 implementation MUST support operation over the TCP 1760 transport protocol. 1762 Even if NFSv4.1 is used over a non-IP network protocol, it is 1763 RECOMMENDED that the transport support congestion control. 1765 It is permissible for a connectionless transport to be used under 1766 NFSv4.1; however, reliable and in-order delivery of data combined 1767 with congestion control by the connectionless transport is REQUIRED. 1768 As a consequence, UDP by itself MUST NOT be used as an NFSv4.1 1769 transport. NFSv4.1 assumes that a client transport address and 1770 server transport address used to send data over a transport together 1771 constitute a connection, even if the underlying transport eschews the 1772 concept of a connection. 1774 2.9.2. Client and Server Transport Behavior 1776 If a connection-oriented transport (e.g., TCP) is used, the client 1777 and server SHOULD use long-lived connections for at least three 1778 reasons: 1780 1. This will prevent the weakening of the transport's congestion 1781 control mechanisms via short-lived connections. 1783 2. This will improve performance for the WAN environment by 1784 eliminating the need for connection setup handshakes. 1786 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 1787 client and server to maintain a client-created backchannel (see 1788 Section 2.10.3.1) for the server to use. 1790 In order to reduce congestion, if a connection-oriented transport is 1791 used, and the request is not the NULL procedure: 1793 o A requester MUST NOT retry a request unless the connection the 1794 request was sent over was lost before the reply was received. 1796 o A replier MUST NOT silently drop a request, even if the request is 1797 a retry. (The silent drop behavior of RPCSEC_GSS [4] does not 1798 apply because this behavior happens at the RPCSEC_GSS layer, a 1799 lower layer in the request processing.) Instead, the replier 1800 SHOULD return an appropriate error (see Section 2.10.6.1), or it 1801 MAY disconnect the connection. 1803 When sending a reply, the replier MUST send the reply to the same 1804 full network address (e.g., if using an IP-based transport, the 1805 source port of the requester is part of the full network address) 1806 from which the requester sent the request. If using a connection- 1807 oriented transport, replies MUST be sent on the same connection from 1808 which the request was received. 1810 If a connection is dropped after the replier receives the request but 1811 before the replier sends the reply, the replier might have a pending 1812 reply. If a connection is established with the same source and 1813 destination full network address as the dropped connection, then the 1814 replier MUST NOT send the reply until the requester retries the 1815 request. The reason for this prohibition is that the requester MAY 1816 retry a request over a different connection (provided that connection 1817 is associated with the original request's session). 1819 When using RDMA transports, there are other reasons for not 1820 tolerating retries over the same connection: 1822 o RDMA transports use "credits" to enforce flow control, where a 1823 credit is a right to a peer to transmit a message. If one peer 1824 were to retransmit a request (or reply), it would consume an 1825 additional credit. If the replier retransmitted a reply, it would 1826 certainly result in an RDMA connection loss, since the requester 1827 would typically only post a single receive buffer for each 1828 request. If the requester retransmitted a request, the additional 1829 credit consumed on the server might lead to RDMA connection 1830 failure unless the client accounted for it and decreased its 1831 available credit, leading to wasted resources. 1833 o RDMA credits present a new issue to the reply cache in NFSv4.1. 1834 The reply cache may be used when a connection within a session is 1835 lost, such as after the client reconnects. Credit information is 1836 a dynamic property of the RDMA connection, and stale values must 1837 not be replayed from the cache. This implies that the reply cache 1838 contents must not be blindly used when replies are sent from it, 1839 and credit information appropriate to the channel must be 1840 refreshed by the RPC layer. 1842 In addition, as described in Section 2.10.6.2, while a session is 1843 active, the NFSv4.1 requester MUST NOT stop waiting for a reply. 1845 2.9.3. Ports 1847 Historically, NFSv3 servers have listened over TCP port 2049. The 1848 registered port 2049 [38] for the NFS protocol should be the default 1849 configuration. NFSv4.1 clients SHOULD NOT use the RPC binding 1850 protocols as described in [39]. 1852 2.10. Session 1854 NFSv4.1 clients and servers MUST support and MUST use the session 1855 feature as described in this section. 1857 2.10.1. Motivation and Overview 1859 Previous versions and minor versions of NFS have suffered from the 1860 following: 1862 o Lack of support for Exactly Once Semantics (EOS). This includes 1863 lack of support for EOS through server failure and recovery. 1865 o Limited callback support, including no support for sending 1866 callbacks through firewalls, and races between replies to normal 1867 requests and callbacks. 1869 o Limited trunking over multiple network paths. 1871 o Requiring machine credentials for fully secure operation. 1873 Through the introduction of a session, NFSv4.1 addresses the above 1874 shortfalls with practical solutions: 1876 o EOS is enabled by a reply cache with a bounded size, making it 1877 feasible to keep the cache in persistent storage and enable EOS 1878 through server failure and recovery. One reason that previous 1879 revisions of NFS did not support EOS was because some EOS 1880 approaches often limited parallelism. As will be explained in 1881 Section 2.10.6, NFSv4.1 supports both EOS and unlimited 1882 parallelism. 1884 o The NFSv4.1 client (defined in Section 1.7, Paragraph 2) creates 1885 transport connections and provides them to the server to use for 1886 sending callback requests, thus solving the firewall issue 1887 (Section 18.34). Races between responses from client requests and 1888 callbacks caused by the requests are detected via the session's 1889 sequencing properties that are a consequence of EOS 1890 (Section 2.10.6.3). 1892 o The NFSv4.1 client can associate an arbitrary number of 1893 connections with the session, and thus provide trunking 1894 (Section 2.10.5). 1896 o The NFSv4.1 client and server produces a session key independent 1897 of client and server machine credentials which can be used to 1898 compute a digest for protecting critical session management 1899 operations (Section 2.10.8.3). 1901 o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 1902 use by the session's backchannel that do not require the server to 1903 authenticate to a client machine principal (Section 2.10.8.2). 1905 A session is a dynamically created, long-lived server object created 1906 by a client and used over time from one or more transport 1907 connections. Its function is to maintain the server's state relative 1908 to the connection(s) belonging to a client instance. This state is 1909 entirely independent of the connection itself, and indeed the state 1910 exists whether or not the connection exists. A client may have one 1911 or more sessions associated with it so that client-associated state 1912 may be accessed using any of the sessions associated with that 1913 client's client ID, when connections are associated with those 1914 sessions. When no connections are associated with any of a client 1915 ID's sessions for an extended time, such objects as locks, opens, 1916 delegations, layouts, etc. are subject to expiration. The session 1917 serves as an object representing a means of access by a client to the 1918 associated client state on the server, independent of the physical 1919 means of access to that state. 1921 A single client may create multiple sessions. A single session MUST 1922 NOT serve multiple clients. 1924 2.10.2. NFSv4 Integration 1926 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 1927 infrastructure change such as sessions would require a new major 1928 version number to an Open Network Computing (ONC) RPC program like 1929 NFS. However, because NFSv4 encapsulates its functionality in a 1930 single procedure, COMPOUND, and because COMPOUND can support an 1931 arbitrary number of operations, sessions have been added to NFSv4.1 1932 with little difficulty. COMPOUND includes a minor version number 1933 field, and for NFSv4.1 this minor version is set to 1. When the 1934 NFSv4 server processes a COMPOUND with the minor version set to 1, it 1935 expects a different set of operations than it does for NFSv4.0. 1936 NFSv4.1 defines the SEQUENCE operation, which is required for every 1937 COMPOUND that operates over an established session, with the 1938 exception of some session administration operations, such as 1939 DESTROY_SESSION (Section 18.37). 1941 2.10.2.1. SEQUENCE and CB_SEQUENCE 1943 In NFSv4.1, when the SEQUENCE operation is present, it MUST be the 1944 first operation in the COMPOUND procedure. The primary purpose of 1945 SEQUENCE is to carry the session identifier. The session identifier 1946 associates all other operations in the COMPOUND procedure with a 1947 particular session. SEQUENCE also contains required information for 1948 maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 1949 COMPOUND requests thus have the form: 1951 +-----+--------------+-----------+------------+-----------+---- 1952 | tag | minorversion | numops |SEQUENCE op | op + args | ... 1953 | | (== 1) | (limited) | + args | | 1954 +-----+--------------+-----------+------------+-----------+---- 1956 and the replies have the form: 1958 +------------+-----+--------+-------------------------------+--// 1959 |last status | tag | numres |status + SEQUENCE op + results | // 1960 +------------+-----+--------+-------------------------------+--// 1961 //-----------------------+---- 1962 // status + op + results | ... 1963 //-----------------------+---- 1965 A CB_COMPOUND procedure request and reply has a similar form to 1966 COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE 1967 operation. CB_COMPOUND also has an additional field called 1968 "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored 1969 by the client. CB_SEQUENCE has the same information as SEQUENCE, and 1970 also includes other information needed to resolve callback races 1971 (Section 2.10.6.3). 1973 2.10.2.2. Client ID and Session Association 1975 Each client ID (Section 2.4) can have zero or more active sessions. 1976 A client ID and associated session are required to perform file 1977 access in NFSv4.1. Each time a session is used (whether by a client 1978 sending a request to the server or the client replying to a callback 1979 request from the server), the state leased to its associated client 1980 ID is automatically renewed. 1982 State (which can consist of share reservations, locks, delegations, 1983 and layouts (Section 1.8.4)) is tied to the client ID. Client state 1984 is not tied to any individual session. Successive state changing 1985 operations from a given state owner MAY go over different sessions, 1986 provided the session is associated with the same client ID. A 1987 callback MAY arrive over a different session than that of the request 1988 that originally acquired the state pertaining to the callback. For 1989 example, if session A is used to acquire a delegation, a request to 1990 recall the delegation MAY arrive over session B if both sessions are 1991 associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 1992 discuss the security considerations around callbacks. 1994 2.10.3. Channels 1996 A channel is not a connection. A channel represents the direction 1997 ONC RPC requests are sent. 1999 Each session has one or two channels: the fore channel and the 2000 backchannel. Because there are at most two channels per session, and 2001 because each channel has a distinct purpose, channels are not 2002 assigned identifiers. 2004 The fore channel is used for ordinary requests from the client to the 2005 server, and carries COMPOUND requests and responses. A session 2006 always has a fore channel. 2008 The backchannel is used for callback requests from server to client, 2009 and carries CB_COMPOUND requests and responses. Whether or not there 2010 is a backchannel is a decision made by the client; however, many 2011 features of NFSv4.1 require a backchannel. NFSv4.1 servers MUST 2012 support backchannels. 2014 Each session has resources for each channel, including separate reply 2015 caches (see Section 2.10.6.1). Note that even the backchannel 2016 requires a reply cache (or, at least, a slot table in order to detect 2017 retries) because some callback operations are nonidempotent. 2019 2.10.3.1. Association of Connections, Channels, and Sessions 2021 Each channel is associated with zero or more transport connections 2022 (whether of the same transport protocol or different transport 2023 protocols). A connection can be associated with one channel or both 2024 channels of a session; the client and server negotiate whether a 2025 connection will carry traffic for one channel or both channels via 2026 the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION 2027 (Section 18.34) operations. When a session is created via 2028 CREATE_SESSION, the connection that transported the CREATE_SESSION 2029 request is automatically associated with the fore channel, and 2030 optionally the backchannel. If the client specifies no state 2031 protection (Section 18.35) when the session is created, then when 2032 SEQUENCE is transmitted on a different connection, the connection is 2033 automatically associated with the fore channel of the session 2034 specified in the SEQUENCE operation. 2036 A connection's association with a session is not exclusive. A 2037 connection associated with the channel(s) of one session may be 2038 simultaneously associated with the channel(s) of other sessions 2039 including sessions associated with other client IDs. 2041 It is permissible for connections of multiple transport types to be 2042 associated with the same channel. For example, both TCP and RDMA 2043 connections can be associated with the fore channel. In the event an 2044 RDMA and non-RDMA connection are associated with the same channel, 2045 the maximum number of slots SHOULD be at least one more than the 2046 total number of RDMA credits (Section 2.10.6.1). This way, if all 2047 RDMA credits are used, the non-RDMA connection can have at least one 2048 outstanding request. If a server supports multiple transport types, 2049 it MUST allow a client to associate connections from each transport 2050 to a channel. 2052 It is permissible for a connection of one type of transport to be 2053 associated with the fore channel, and a connection of a different 2054 type to be associated with the backchannel. 2056 2.10.4. Server Scope 2058 Servers each specify a server scope value in the form of an opaque 2059 string eir_server_scope returned as part of the results of an 2060 EXCHANGE_ID operation. The purpose of the server scope is to allow a 2061 group of servers to indicate to clients that a set of servers sharing 2062 the same server scope value has arranged to use compatible values of 2063 otherwise opaque identifiers. Thus, the identifiers generated by two 2064 servers within that set can be assumed compatible so that, in some 2065 cases, identifiers generated by one server in that set may be 2066 presented to another server of the same scope. 2068 The use of such compatible values does not imply that a value 2069 generated by one server will always be accepted by another. In most 2070 cases, it will not. However, a server will not accept a value 2071 generated by another inadvertently. When it does accept it, it will 2072 be because it is recognized as valid and carrying the same meaning as 2073 on another server of the same scope. 2075 When servers are of the same server scope, this compatibility of 2076 values applies to the following identifiers: 2078 o Filehandle values. A filehandle value accepted by two servers of 2079 the same server scope denotes the same object. A WRITE operation 2080 sent to one server is reflected immediately in a READ sent to the 2081 other. 2083 o Server owner values. When the server scope values are the same, 2084 server owner value may be validly compared. In cases where the 2085 server scope values are different, server owner values are treated 2086 as different even if they contain identical strings of bytes. 2088 The coordination among servers required to provide such compatibility 2089 can be quite minimal, and limited to a simple partition of the ID 2090 space. The recognition of common values requires additional 2091 implementation, but this can be tailored to the specific situations 2092 in which that recognition is desired. 2094 Clients will have occasion to compare the server scope values of 2095 multiple servers under a number of circumstances, each of which will 2096 be discussed under the appropriate functional section: 2098 o When server owner values received in response to EXCHANGE_ID 2099 operations sent to multiple network addresses are compared for the 2100 purpose of determining the validity of various forms of trunking, 2101 as described in Section 11.5.2. . 2103 o When network or server reconfiguration causes the same network 2104 address to possibly be directed to different servers, with the 2105 necessity for the client to determine when lock reclaim should be 2106 attempted, as described in Section 8.4.2.1. 2108 When two replies from EXCHANGE_ID, each from two different server 2109 network addresses, have the same server scope, there are a number of 2110 ways a client can validate that the common server scope is due to two 2111 servers cooperating in a group. 2113 o If both EXCHANGE_ID requests were sent with RPCSEC_GSS ([4], [9], 2114 [27]) authentication and the server principal is the same for both 2115 targets, the equality of server scope is validated. It is 2116 RECOMMENDED that two servers intending to share the same server 2117 scope and server_owner major_id also share the same principal 2118 name. In some cases, this simplifies the client's task of 2119 validating server scope. 2121 o The client may accept the appearance of the second server in the 2122 fs_locations or fs_locations_info attribute for a relevant file 2123 system. For example, if there is a migration event for a 2124 particular file system or there are locks to be reclaimed on a 2125 particular file system, the attributes for that particular file 2126 system may be used. The client sends the GETATTR request to the 2127 first server for the fs_locations or fs_locations_info attribute 2128 with RPCSEC_GSS authentication. It may need to do this in advance 2129 of the need to verify the common server scope. If the client 2130 successfully authenticates the reply to GETATTR, and the GETATTR 2131 request and reply containing the fs_locations or fs_locations_info 2132 attribute refers to the second server, then the equality of server 2133 scope is supported. A client may choose to limit the use of this 2134 form of support to information relevant to the specific file 2135 system involved (e.g. a file system being migrated). 2137 2.10.5. Trunking 2139 Trunking is the use of multiple connections between a client and 2140 server in order to increase the speed of data transfer. NFSv4.1 2141 supports two types of trunking: session trunking and client ID 2142 trunking. 2144 In the context of a single server network address, it can be assumed 2145 that all connections are accessing the same server and NFSv4.1 2146 servers MUST support both forms of trunking. When multiple 2147 connections use a set of network addresses accessing the same server, 2148 the server MUST support both forms of trunking. NFSv4.1 servers in a 2149 clustered configuration MAY allow network addresses for different 2150 servers to use client ID trunking. 2152 Clients may use either form of trunking as long as they do not, when 2153 trunking between different server network addresses, violate the 2154 servers' mandates as to the kinds of trunking to be allowed (see 2155 below). With regard to callback channels, the client MUST allow the 2156 server to choose among all callback channels valid for a given client 2157 ID and MUST support trunking when the connections supporting the 2158 backchannel allow session or client ID trunking to be used for 2159 callbacks. 2161 Session trunking is essentially the association of multiple 2162 connections, each with potentially different target and/or source 2163 network addresses, to the same session. When the target network 2164 addresses (server addresses) of the two connections are the same, the 2165 server MUST support such session trunking. When the target network 2166 addresses are different, the server MAY indicate such support using 2167 the data returned by the EXCHANGE_ID operation (see below). 2169 Client ID trunking is the association of multiple sessions to the 2170 same client ID. Servers MUST support client ID trunking for two 2171 target network addresses whenever they allow session trunking for 2172 those same two network addresses. In addition, a server MAY, by 2173 presenting the same major server owner ID (Section 2.5) and server 2174 scope (Section 2.10.4), allow an additional case of client ID 2175 trunking. When two servers return the same major server owner and 2176 server scope, it means that the two servers are cooperating on 2177 locking state management, which is a prerequisite for client ID 2178 trunking. 2180 Distinguishing when the client is allowed to use session and client 2181 ID trunking requires understanding how the results of the EXCHANGE_ID 2182 (Section 18.35) operation identify a server. Suppose a client sends 2183 EXCHANGE_IDs over two different connections, each with a possibly 2184 different target network address, but each EXCHANGE_ID operation has 2185 the same value in the eia_clientowner field. If the same NFSv4.1 2186 server is listening over each connection, then each EXCHANGE_ID 2187 result MUST return the same values of eir_clientid, 2188 eir_server_owner.so_major_id, and eir_server_scope. The client can 2189 then treat each connection as referring to the same server (subject 2190 to verification; see Section 2.10.5.1 below), and it can use each 2191 connection to trunk requests and replies. The client's choice is 2192 whether session trunking or client ID trunking applies. 2194 Session Trunking. If the eia_clientowner argument is the same in two 2195 different EXCHANGE_ID requests, and the eir_clientid, 2196 eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and 2197 eir_server_scope results match in both EXCHANGE_ID results, then 2198 the client is permitted to perform session trunking. If the 2199 client has no session mapping to the tuple of eir_clientid, 2200 eir_server_owner.so_major_id, eir_server_scope, and 2201 eir_server_owner.so_minor_id, then it creates the session via a 2202 CREATE_SESSION operation over one of the connections, which 2203 associates the connection to the session. If there is a session 2204 for the tuple, the client can send BIND_CONN_TO_SESSION to 2205 associate the connection to the session. 2207 Of course, if the client does not desire to use session trunking, 2208 it is not required to do so. It can invoke CREATE_SESSION on the 2209 connection. This will result in client ID trunking as described 2210 below. It can also decide to drop the connection if it does not 2211 choose to use trunking. 2213 Client ID Trunking. If the eia_clientowner argument is the same in 2214 two different EXCHANGE_ID requests, and the eir_clientid, 2215 eir_server_owner.so_major_id, and eir_server_scope results match 2216 in both EXCHANGE_ID results, then the client is permitted to 2217 perform client ID trunking (regardless of whether the 2218 eir_server_owner.so_minor_id results match). The client can 2219 associate each connection with different sessions, where each 2220 session is associated with the same server. 2222 The client completes the act of client ID trunking by invoking 2223 CREATE_SESSION on each connection, using the same client ID that 2224 was returned in eir_clientid. These invocations create two 2225 sessions and also associate each connection with its respective 2226 session. The client is free to decline to use client ID trunking 2227 by simply dropping the connection at this point. 2229 When doing client ID trunking, locking state is shared across 2230 sessions associated with that same client ID. This requires the 2231 server to coordinate state across sessions and the client to be 2232 able to associate the same locking state with multiple sessions. 2234 It is always possible that, as a result of various sorts of 2235 reconfiguration events, eir_server_scope and eir_server_owner values 2236 may be different on subsequent EXCHANGE_ID requests made to the same 2237 network address. 2239 In most cases such reconfiguration events will be disruptive and 2240 indicate that an IP address formerly connected to one server is now 2241 connected to an entirely different one. 2243 Some guidelines on client handling of such situations follow: 2245 o When eir_server_scope changes, the client has no assurance that 2246 any id's it obtained previously (e.g. file handles, state ids, 2247 client ids) can be validly used on the new server, and, even if 2248 the new server accepts them, there is no assurance that this is 2249 not due to accident. Thus, it is best to treat all such state as 2250 lost/stale although a client may assume that the probability of 2251 inadvertent acceptance is low and treat this situation as within 2252 the next case. 2254 o When eir_server_scope remains the same and 2255 eir_server_owner.so_major_id changes, the client can use the 2256 filehandles it has, consider its locking state lost, and attempt 2257 to reclaim or otherwise re-obtain its locks. It may find that its 2258 file handle IS now stale but if NFS4ERR_STALE is not received, it 2259 can proceed to reclaim or otherwise re-obtain its open locking 2260 state. 2262 o When eir_server_scope and eir_server_owner.so_major_id remain the 2263 same, the client has to use the now-current values of 2264 eir_server_owner.so_minor_id in deciding on appropriate forms of 2265 trunking. This may result in connections being dropped or new 2266 sessions being created. 2268 2.10.5.1. Verifying Claims of Matching Server Identity 2270 When the server responds using two different connections claim 2271 matching or partially matching eir_server_owner, eir_server_scope, 2272 and eir_clientid values, the client does not have to trust the 2273 servers' claims. The client may verify these claims before trunking 2274 traffic in the following ways: 2276 o For session trunking, clients SHOULD reliably verify if 2277 connections between different network paths are in fact associated 2278 with the same NFSv4.1 server and usable on the same session, and 2279 servers MUST allow clients to perform reliable verification. When 2280 a client ID is created, the client SHOULD specify that 2281 BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or 2282 SP4_MACH_CRED (Section 18.35) state protection options. For 2283 SP4_SSV, reliable verification depends on a shared secret (the 2284 SSV) that is established via the SET_SSV (see Section 18.47) 2285 operation. 2287 When a new connection is associated with the session (via the 2288 BIND_CONN_TO_SESSION operation, see Section 18.34), if the client 2289 specified SP4_SSV state protection for the BIND_CONN_TO_SESSION 2290 operation, the client MUST send the BIND_CONN_TO_SESSION with 2291 RPCSEC_GSS protection, using integrity or privacy, and an 2292 RPCSEC_GSS handle created with the GSS SSV mechanism (see 2293 Section 2.10.9). 2295 If the client mistakenly tries to associate a connection to a 2296 session of a wrong server, the server will either reject the 2297 attempt because it is not aware of the session identifier of the 2298 BIND_CONN_TO_SESSION arguments, or it will reject the attempt 2299 because the RPCSEC_GSS authentication fails. Even if the server 2300 mistakenly or maliciously accepts the connection association 2301 attempt, the RPCSEC_GSS verifier it computes in the response will 2302 not be verified by the client, so the client will know it cannot 2303 use the connection for trunking the specified session. 2305 If the client specified SP4_MACH_CRED state protection, the 2306 BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or 2307 privacy, using the same credential that was used when the client 2308 ID was created. Mutual authentication via RPCSEC_GSS assures the 2309 client that the connection is associated with the correct session 2310 of the correct server. 2312 o For client ID trunking, the client has at least two options for 2313 verifying that the same client ID obtained from two different 2314 EXCHANGE_ID operations came from the same server. The first 2315 option is to use RPCSEC_GSS authentication when sending each 2316 EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with 2317 RPCSEC_GSS authentication, the client notes the principal name of 2318 the GSS target. If the EXCHANGE_ID results indicate that client 2319 ID trunking is possible, and the GSS targets' principal names are 2320 the same, the servers are the same and client ID trunking is 2321 allowed. 2323 The second option for verification is to use SP4_SSV protection. 2324 When the client sends EXCHANGE_ID, it specifies SP4_SSV 2325 protection. The first EXCHANGE_ID the client sends always has to 2326 be confirmed by a CREATE_SESSION call. The client then sends 2327 SET_SSV. Later, the client sends EXCHANGE_ID to a second 2328 destination network address different from the one the first 2329 EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID 2330 reply has the same eir_clientid, eir_server_owner.so_major_id, and 2331 eir_server_scope. If so, the client verifies the claim by sending 2332 a CREATE_SESSION operation to the second destination address, 2333 protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle 2334 returned by the second EXCHANGE_ID. If the server accepts the 2335 CREATE_SESSION request, and if the client verifies the RPCSEC_GSS 2336 verifier and integrity codes, then the client has proof the second 2337 server knows the SSV, and thus the two servers are cooperating for 2338 the purposes of specifying server scope and client ID trunking. 2340 2.10.6. Exactly Once Semantics 2342 Via the session, NFSv4.1 offers exactly once semantics (EOS) for 2343 requests sent over a channel. EOS is supported on both the fore 2344 channel and backchannel. 2346 Each COMPOUND or CB_COMPOUND request that is sent with a leading 2347 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 2348 exactly once. This requirement holds regardless of whether the 2349 request is sent with reply caching specified (see 2350 Section 2.10.6.1.3). The requirement holds even if the requester is 2351 sending the request over a session created between a pNFS data client 2352 and pNFS data server. To understand the rationale for this 2353 requirement, divide the requests into three classifications: 2355 o Non-idempotent requests. 2357 o Idempotent modifying requests. 2359 o Idempotent non-modifying requests. 2361 An example of a non-idempotent request is RENAME. Obviously, if a 2362 replier executes the same RENAME request twice, and the first 2363 execution succeeds, the re-execution will fail. If the replier 2364 returns the result from the re-execution, this result is incorrect. 2365 Therefore, EOS is required for non-idempotent requests. 2367 An example of an idempotent modifying request is a COMPOUND request 2368 containing a WRITE operation. Repeated execution of the same WRITE 2369 has the same effect as execution of that WRITE a single time. 2370 Nevertheless, enforcing EOS for WRITEs and other idempotent modifying 2371 requests is necessary to avoid data corruption. 2373 Suppose a client sends WRITE A to a noncompliant server that does not 2374 enforce EOS, and receives no response, perhaps due to a network 2375 partition. The client reconnects to the server and re-sends WRITE A. 2376 Now, the server has outstanding two instances of A. The server can 2377 be in a situation in which it executes and replies to the retry of A, 2378 while the first A is still waiting in the server's internal I/O 2379 system for some resource. Upon receiving the reply to the second 2380 attempt of WRITE A, the client believes its WRITE is done so it is 2381 free to send WRITE B, which overlaps the byte-range of A. When the 2382 original A is dispatched from the server's I/O system and executed 2383 (thus the second time A will have been written), then what has been 2384 written by B can be overwritten and thus corrupted. 2386 An example of an idempotent non-modifying request is a COMPOUND 2387 containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- 2388 execution of such a request will not cause data corruption or produce 2389 an incorrect result. Nonetheless, to keep the implementation simple, 2390 the replier MUST enforce EOS for all requests, whether or not 2391 idempotent and non-modifying. 2393 Note that true and complete EOS is not possible unless the server 2394 persists the reply cache in stable storage, and unless the server is 2395 somehow implemented to never require a restart (indeed, if such a 2396 server exists, the distinction between a reply cache kept in stable 2397 storage versus one that is not is one without meaning). See 2398 Section 2.10.6.5 for a discussion of persistence in the reply cache. 2399 Regardless, even if the server does not persist the reply cache, EOS 2400 improves robustness and correctness over previous versions of NFS 2401 because the legacy duplicate request/reply caches were based on the 2402 ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the 2403 shortcomings of the XID as a basis for a reply cache and describes 2404 how NFSv4.1 sessions improve upon the XID. 2406 2.10.6.1. Slot Identifiers and Reply Cache 2408 The RPC layer provides a transaction ID (XID), which, while required 2409 to be unique, is not convenient for tracking requests for two 2410 reasons. First, the XID is only meaningful to the requester; it 2411 cannot be interpreted by the replier except to test for equality with 2412 previously sent requests. When consulting an RPC-based duplicate 2413 request cache, the opaqueness of the XID requires a computationally 2414 expensive look up (often via a hash that includes XID and source 2415 address). NFSv4.1 requests use a non-opaque slot ID, which is an 2416 index into a slot table, which is far more efficient. Second, 2417 because RPC requests can be executed by the replier in any order, 2418 there is no bound on the number of requests that may be outstanding 2419 at any time. To achieve perfect EOS, using ONC RPC would require 2420 storing all replies in the reply cache. XIDs are 32 bits; storing 2421 over four billion (2^32) replies in the reply cache is not practical. 2422 In practice, previous versions of NFS have chosen to store a fixed 2423 number of replies in the cache, and to use a least recently used 2424 (LRU) approach to replacing cache entries with new entries when the 2425 cache is full. In NFSv4.1, the number of outstanding requests is 2426 bounded by the size of the slot table, and a sequence ID per slot is 2427 used to tell the replier when it is safe to delete a cached reply. 2429 In the NFSv4.1 reply cache, when the requester sends a new request, 2430 it selects a slot ID in the range 0..N, where N is the replier's 2431 current maximum slot ID granted to the requester on the session over 2432 which the request is to be sent. The value of N starts out as equal 2433 to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the 2434 response to SEQUENCE or CB_SEQUENCE as described later in this 2435 section. The slot ID must be unused by any of the requests that the 2436 requester has already active on the session. "Unused" here means the 2437 requester has no outstanding request for that slot ID. 2439 A slot contains a sequence ID and the cached reply corresponding to 2440 the request sent with that sequence ID. The sequence ID is a 32-bit 2441 unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 2442 1). The first time a slot is used, the requester MUST specify a 2443 sequence ID of one (Section 18.36). Each time a slot is reused, the 2444 request MUST specify a sequence ID that is one greater than that of 2445 the previous request on the slot. If the previous sequence ID was 2446 0xFFFFFFFF, then the next request for the slot MUST have the sequence 2447 ID set to zero (i.e., (2^32 - 1) + 1 mod 2^32). 2449 The sequence ID accompanies the slot ID in each request. It is for 2450 the critical check at the replier: it used to efficiently determine 2451 whether a request using a certain slot ID is a retransmit or a new, 2452 never-before-seen request. It is not feasible for the requester to 2453 assert that it is retransmitting to implement this, because for any 2454 given request the requester cannot know whether the replier has seen 2455 it unless the replier actually replies. Of course, if the requester 2456 has seen the reply, the requester would not retransmit. 2458 The replier compares each received request's sequence ID with the 2459 last one previously received for that slot ID, to see if the new 2460 request is: 2462 o A new request, in which the sequence ID is one greater than that 2463 previously seen in the slot (accounting for sequence wraparound). 2464 The replier proceeds to execute the new request, and the replier 2465 MUST increase the slot's sequence ID by one. 2467 o A retransmitted request, in which the sequence ID is equal to that 2468 currently recorded in the slot. If the original request has 2469 executed to completion, the replier returns the cached reply. See 2470 Section 2.10.6.2 for direction on how the replier deals with 2471 retries of requests that are still in progress. 2473 o A misordered retry, in which the sequence ID is less than 2474 (accounting for sequence wraparound) that previously seen in the 2475 slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2476 result from SEQUENCE or CB_SEQUENCE). 2478 o A misordered new request, in which the sequence ID is two or more 2479 than (accounting for sequence wraparound) that previously seen in 2480 the slot. Note that because the sequence ID MUST wrap around to 2481 zero once it reaches 0xFFFFFFFF, a misordered new request and a 2482 misordered retry cannot be distinguished. Thus, the replier MUST 2483 return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2484 CB_SEQUENCE). 2486 Unlike the XID, the slot ID is always within a specific range; this 2487 has two implications. The first implication is that for a given 2488 session, the replier need only cache the results of a limited number 2489 of COMPOUND requests. The second implication derives from the first, 2490 which is that unlike XID-indexed reply caches (also known as 2491 duplicate request caches - DRCs), the slot ID-based reply cache 2492 cannot be overflowed. Through use of the sequence ID to identify 2493 retransmitted requests, the replier does not need to actually cache 2494 the request itself, reducing the storage requirements of the reply 2495 cache further. These facilities make it practical to maintain all 2496 the required entries for an effective reply cache. 2498 The slot ID, sequence ID, and session ID therefore take over the 2499 traditional role of the XID and source network address in the 2500 replier's reply cache implementation. This approach is considerably 2501 more portable and completely robust -- it is not subject to the 2502 reassignment of ports as clients reconnect over IP networks. In 2503 addition, the RPC XID is not used in the reply cache, enhancing 2504 robustness of the cache in the face of any rapid reuse of XIDs by the 2505 requester. While the replier does not care about the XID for the 2506 purposes of reply cache management (but the replier MUST return the 2507 same XID that was in the request), nonetheless there are 2508 considerations for the XID in NFSv4.1 that are the same as all other 2509 previous versions of NFS. The RPC XID remains in each message and 2510 needs to be formulated in NFSv4.1 requests as in any other ONC RPC 2511 request. The reasons include: 2513 o The RPC layer retains its existing semantics and implementation. 2515 o The requester and replier must be able to interoperate at the RPC 2516 layer, prior to the NFSv4.1 decoding of the SEQUENCE or 2517 CB_SEQUENCE operation. 2519 o If an operation is being used that does not start with SEQUENCE or 2520 CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is 2521 needed for correct operation to match the reply to the request. 2523 o The SEQUENCE or CB_SEQUENCE operation may generate an error. If 2524 so, the embedded slot ID, sequence ID, and session ID (if present) 2525 in the request will not be in the reply, and the requester has 2526 only the XID to match the reply to the request. 2528 Given that well-formulated XIDs continue to be required, this begs 2529 the question: why do SEQUENCE and CB_SEQUENCE replies have a session 2530 ID, slot ID, and sequence ID? Having the session ID in the reply 2531 means that the requester does not have to use the XID to look up the 2532 session ID, which would be necessary if the connection were 2533 associated with multiple sessions. Having the slot ID and sequence 2534 ID in the reply means that the requester does not have to use the XID 2535 to look up the slot ID and sequence ID. Furthermore, since the XID 2536 is only 32 bits, it is too small to guarantee the re-association of a 2537 reply with its request [40]; having session ID, slot ID, and sequence 2538 ID in the reply allows the client to validate that the reply in fact 2539 belongs to the matched request. 2541 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2542 "highest_slotid" value, which carries additional requester slot usage 2543 information. The requester MUST always indicate the slot ID 2544 representing the outstanding request with the highest-numbered slot 2545 value. The requester should in all cases provide the most 2546 conservative value possible, although it can be increased somewhat 2547 above the actual instantaneous usage to maintain some minimum or 2548 optimal level. This provides a way for the requester to yield unused 2549 request slots back to the replier, which in turn can use the 2550 information to reallocate resources. 2552 The replier responds with both a new target highest_slotid and an 2553 enforced highest_slotid, described as follows: 2555 o The target highest_slotid is an indication to the requester of the 2556 highest_slotid the replier wishes the requester to be using. This 2557 permits the replier to withdraw (or add) resources from a 2558 requester that has been found to not be using them, in order to 2559 more fairly share resources among a varying level of demand from 2560 other requesters. The requester must always comply with the 2561 replier's value updates, since they indicate newly established 2562 hard limits on the requester's access to session resources. 2564 However, because of request pipelining, the requester may have 2565 active requests in flight reflecting prior values; therefore, the 2566 replier must not immediately require the requester to comply. 2568 o The enforced highest_slotid indicates the highest slot ID the 2569 requester is permitted to use on a subsequent SEQUENCE or 2570 CB_SEQUENCE operation. The replier's enforced highest_slotid 2571 SHOULD be no less than the highest_slotid the requester indicated 2572 in the SEQUENCE or CB_SEQUENCE arguments. 2574 A requester can be intransigent with respect to lowering its 2575 highest_slotid argument to a Sequence operation, i.e. the 2576 requester continues to ignore the target highest_slotid in the 2577 response to a Sequence operation, and continues to set its 2578 highest_slotid argument to be higher than the target 2579 highest_slotid. This can be considered particularly egregious 2580 behavior when the replier knows there are no outstanding requests 2581 with slot IDs higher than its target highest_slotid. When faced 2582 with such intransigence, the replier is free to take more forceful 2583 action, and MAY reply with a new enforced highest_slotid that is 2584 less than its previous enforced highest_slotid. Thereafter, if 2585 the requester continues to send requests with a highest_slotid 2586 that is greater than the replier's new enforced highest_slotid, 2587 the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in 2588 the request is greater than the new enforced highest_slotid and 2589 the request is a retry. 2591 The replier SHOULD retain the slots it wants to retire until the 2592 requester sends a request with a highest_slotid less than or equal 2593 to the replier's new enforced highest_slotid. 2595 The requester can also be intransigent with respect to sending 2596 non-retry requests that have a slot ID that exceeds the replier's 2597 highest_slotid. Once the replier has forcibly lowered the 2598 enforced highest_slotid, the requester is only allowed to send 2599 retries on slots that exceed the replier's highest_slotid. If a 2600 request is received with a slot ID that is higher than the new 2601 enforced highest_slotid, and the sequence ID is one higher than 2602 what is in the slot's reply cache, then the server can both retire 2603 the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT 2604 do one and not the other). The reason it is safe to retire the 2605 slot is because by using the next sequence ID, the requester is 2606 indicating it has received the previous reply for the slot. 2608 o The requester SHOULD use the lowest available slot when sending a 2609 new request. This way, the replier may be able to retire slot 2610 entries faster. However, where the replier is actively adjusting 2611 its granted highest_slotid, it will not be able to use only the 2612 receipt of the slot ID and highest_slotid in the request. Neither 2613 the slot ID nor the highest_slotid used in a request may reflect 2614 the replier's current idea of the requester's session limit, 2615 because the request may have been sent from the requester before 2616 the update was received. Therefore, in the downward adjustment 2617 case, the replier may have to retain a number of reply cache 2618 entries at least as large as the old value of maximum requests 2619 outstanding, until it can infer that the requester has seen a 2620 reply containing the new granted highest_slotid. The replier can 2621 infer that the requester has seen such a reply when it receives a 2622 new request with the same slot ID as the request replied to and 2623 the next higher sequence ID. 2625 2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies 2627 When a SEQUENCE or CB_SEQUENCE operation is successfully executed, 2628 its reply MUST always be cached. Specifically, session ID, sequence 2629 ID, and slot ID MUST be cached in the reply cache. The reply from 2630 SEQUENCE also includes the highest slot ID, target highest slot ID, 2631 and status flags. Instead of caching these values, the server MAY 2632 re-compute the values from the current state of the fore channel, 2633 session, and/or client ID as appropriate. Similarly, the reply from 2634 CB_SEQUENCE includes a highest slot ID and target highest slot ID. 2635 The client MAY re-compute the values from the current state of the 2636 session as appropriate. 2638 Regardless of whether or not a replier is re-computing highest slot 2639 ID, target slot ID, and status on replies to retries, the requester 2640 MUST NOT assume that the values are being re-computed whenever it 2641 receives a reply after a retry is sent, since it has no way of 2642 knowing whether the reply it has received was sent by the replier in 2643 response to the retry or is a delayed response to the original 2644 request. Therefore, it may be the case that highest slot ID, target 2645 slot ID, or status bits may reflect the state of affairs when the 2646 request was first executed. Although acting based on such delayed 2647 information is valid, it may cause the receiver of the reply to do 2648 unneeded work. Requesters MAY choose to send additional requests to 2649 get the current state of affairs or use the state of affairs reported 2650 by subsequent requests, in preference to acting immediately on data 2651 that might be out of date. 2653 2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE 2655 Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of 2656 the slot MUST NOT change. The replier MUST NOT modify the reply 2657 cache entry for the slot whenever an error is returned from SEQUENCE 2658 or CB_SEQUENCE. 2660 2.10.6.1.3. Optional Reply Caching 2662 On a per-request basis, the requester can choose to direct the 2663 replier to cache the reply to all operations after the first 2664 operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or 2665 csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. 2666 The reason it would not direct the replier to cache the entire reply 2667 is that the request is composed of all idempotent operations [37]. 2668 Caching the reply may offer little benefit. If the reply is too 2669 large (see Section 2.10.6.4), it may not be cacheable anyway. Even 2670 if the reply to idempotent request is small enough to cache, 2671 unnecessarily caching the reply slows down the server and increases 2672 RPC latency. 2674 Whether or not the requester requests the reply to be cached has no 2675 effect on the slot processing. If the results of SEQUENCE or 2676 CB_SEQUENCE are NFS4_OK, then the slot's sequence ID MUST be 2677 incremented by one. If a requester does not direct the replier to 2678 cache the reply, the replier MUST do one of following: 2680 o The replier can cache the entire original reply. Even though 2681 sa_cachethis or csa_cachethis is FALSE, the replier is always free 2682 to cache. It may choose this approach in order to simplify 2683 implementation. 2685 o The replier enters into its reply cache a reply consisting of the 2686 original results to the SEQUENCE or CB_SEQUENCE operation, and 2687 with the next operation in COMPOUND or CB_COMPOUND having the 2688 error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later 2689 retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a 2690 replier receives a retried Sequence operation where the reply to 2691 the COMPOUND or CB_COMPOUND was not cached, then the replier, 2693 * MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence 2694 operation if the Sequence operation is not the first operation 2695 (granted, a requester that does so is in violation of the 2696 NFSv4.1 protocol). 2698 * MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a 2699 Sequence operation if the Sequence operation is the first 2700 operation. 2702 o If the second operation is an illegal operation, or an operation 2703 that was legal in a previous minor version of NFSv4 and MUST NOT 2704 be supported in the current minor version (e.g., SETCLIENTID), the 2705 replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead 2706 the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or 2707 NFS4ERR_NOTSUPP as appropriate. 2709 o If the second operation can result in another error status, the 2710 replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP, 2711 provided the operation is not executed in such a way that the 2712 state of the replier is changed. Examples of such an error status 2713 include: NFS4ERR_NOTSUPP returned for an operation that is legal 2714 but not REQUIRED in the current minor versions, and thus not 2715 supported by the replier; NFS4ERR_SEQUENCE_POS; and 2716 NFS4ERR_REQ_TOO_BIG. 2718 The discussion above assumes that the retried request matches the 2719 original one. Section 2.10.6.1.3.1 discusses what the replier might 2720 do, and MUST do when original and retried requests do not match. 2721 Since the replier may only cache a small amount of the information 2722 that would be required to determine whether this is a case of a false 2723 retry, the replier may send to the client any of the following 2724 responses: 2726 o The cached reply to the original request (if the replier has 2727 cached it in its entirety and the users of the original request 2728 and retry match). 2730 o A reply that consists only of the Sequence operation with the 2731 error NFS4ERR_FALSE_RETRY. 2733 o A reply consisting of the response to Sequence with the status 2734 NFS4_OK, together with the second operation as it appeared in the 2735 retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2736 other error as described above. 2738 o A reply that consists of the response to Sequence with the status 2739 NFS4_OK, together with the second operation as it appeared in the 2740 original request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2741 other error as described above. 2743 2.10.6.1.3.1. False Retry 2745 If a requester sent a Sequence operation with a slot ID and sequence 2746 ID that are in the reply cache but the replier detected that the 2747 retried request is not the same as the original request, including a 2748 retry that has different operations or different arguments in the 2749 operations from the original and a retry that uses a different 2750 principal in the RPC request's credential field that translates to a 2751 different user, then this is a false retry. When the replier detects 2752 a false retry, it is permitted (but not always obligated) to return 2753 NFS4ERR_FALSE_RETRY in response to the Sequence operation when it 2754 detects a false retry. 2756 Translations of particularly privileged user values to other users 2757 due to the lack of appropriately secure credentials, as configured on 2758 the replier, should be applied before determining whether the users 2759 are the same or different. If the replier determines the users are 2760 different between the original request and a retry, then the replier 2761 MUST return NFS4ERR_FALSE_RETRY. 2763 If an operation of the retry is an illegal operation, or an operation 2764 that was legal in a previous minor version of NFSv4 and MUST NOT be 2765 supported in the current minor version (e.g., SETCLIENTID), the 2766 replier MAY return NFS4ERR_FALSE_RETRY (and MUST do so if the users 2767 of the original request and retry differ). Otherwise, the replier 2768 MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or NFS4ERR_NOTSUPP as 2769 appropriate. Note that the handling is in contrast for how the 2770 replier deals with retries requests with no cached reply. The 2771 difference is due to NFS4ERR_FALSE_RETRY being a valid error for only 2772 Sequence operations, whereas NFS4ERR_RETRY_UNCACHED_REP is a valid 2773 error for all operations except illegal operations and operations 2774 that MUST NOT be supported in the current minor version of NFSv4. 2776 2.10.6.2. Retry and Replay of Reply 2778 A requester MUST NOT retry a request, unless the connection it used 2779 to send the request disconnects. The requester can then reconnect 2780 and re-send the request, or it can re-send the request over a 2781 different connection that is associated with the same session. 2783 If the requester is a server wanting to re-send a callback operation 2784 over the backchannel of a session, the requester of course cannot 2785 reconnect because only the client can associate connections with the 2786 backchannel. The server can re-send the request over another 2787 connection that is bound to the same session's backchannel. If there 2788 is no such connection, the server MUST indicate that the session has 2789 no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag 2790 bit in the response to the next SEQUENCE operation from the client. 2791 The client MUST then associate a connection with the session (or 2792 destroy the session). 2794 Note that it is not fatal for a requester to retry without a 2795 disconnect between the request and retry. However, the retry does 2796 consume resources, especially with RDMA, where each request, retry or 2797 not, consumes a credit. Retries for no reason, especially retries 2798 sent shortly after the previous attempt, are a poor use of network 2799 bandwidth and defeat the purpose of a transport's inherent congestion 2800 control system. 2802 A requester MUST wait for a reply to a request before using the slot 2803 for another request. If it does not wait for a reply, then the 2804 requester does not know what sequence ID to use for the slot on its 2805 next request. For example, suppose a requester sends a request with 2806 sequence ID 1, and does not wait for the response. The next time it 2807 uses the slot, it sends the new request with sequence ID 2. If the 2808 replier has not seen the request with sequence ID 1, then the replier 2809 is not expecting sequence ID 2, and rejects the requester's new 2810 request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2811 CB_SEQUENCE). 2813 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 2814 within each RPC/RDMA "chunk" [31] are valid on a scope outside that 2815 of a single connection. Therefore, handles used by the direct 2816 operations become invalid after connection loss. The server must 2817 ensure that any RDMA operations that must be replayed from the reply 2818 cache use the newly provided handle(s) from the most recent request. 2820 A retry might be sent while the original request is still in progress 2821 on the replier. The replier SHOULD deal with the issue by returning 2822 NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but 2823 implementations MAY return NFS4ERR_MISORDERED. Since errors from 2824 SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this 2825 approach allows the results of the execution of the original request 2826 to be properly recorded in the reply cache (assuming that the 2827 requester specified the reply to be cached). 2829 2.10.6.3. Resolving Server Callback Races 2831 It is possible for server callbacks to arrive at the client before 2832 the reply from related fore channel operations. For example, a 2833 client may have been granted a delegation to a file it has opened, 2834 but the reply to the OPEN (informing the client of the granting of 2835 the delegation) may be delayed in the network. If a conflicting 2836 operation arrives at the server, it will recall the delegation using 2837 the backchannel, which may be on a different transport connection, 2838 perhaps even a different network, or even a different session 2839 associated with the same client ID. 2841 The presence of a session between the client and server alleviates 2842 this issue. When a session is in place, each client request is 2843 uniquely identified by its { session ID, slot ID, sequence ID } 2844 triple. By the rules under which slot entries (reply cache entries) 2845 are retired, the server has knowledge whether the client has "seen" 2846 each of the server's replies. The server can therefore provide 2847 sufficient information to the client to allow it to disambiguate 2848 between an erroneous or conflicting callback race condition. 2850 For each client operation that might result in some sort of server 2851 callback, the server SHOULD "remember" the { session ID, slot ID, 2852 sequence ID } triple of the client request until the slot ID 2853 retirement rules allow the server to determine that the client has, 2854 in fact, seen the server's reply. Until the time the { session ID, 2855 slot ID, sequence ID } request triple can be retired, any recalls of 2856 the associated object MUST carry an array of these referring 2857 identifiers (in the CB_SEQUENCE operation's arguments), for the 2858 benefit of the client. After this time, it is not necessary for the 2859 server to provide this information in related callbacks, since it is 2860 certain that a race condition can no longer occur. 2862 The CB_SEQUENCE operation that begins each server callback carries a 2863 list of "referring" { session ID, slot ID, sequence ID } triples. If 2864 the client finds the request corresponding to the referring session 2865 ID, slot ID, and sequence ID to be currently outstanding (i.e., the 2866 server's reply has not been seen by the client), it can determine 2867 that the callback has raced the reply, and act accordingly. If the 2868 client does not find the request corresponding to the referring 2869 triple to be outstanding (including the case of a session ID 2870 referring to a destroyed session), then there is no race with respect 2871 to this triple. The server SHOULD limit the referring triples to 2872 requests that refer to just those that apply to the objects referred 2873 to in the CB_COMPOUND procedure. 2875 The client must not simply wait forever for the expected server reply 2876 to arrive before responding to the CB_COMPOUND that won the race, 2877 because it is possible that it will be delayed indefinitely. The 2878 client should assume the likely case that the reply will arrive 2879 within the average round-trip time for COMPOUND requests to the 2880 server, and wait that period of time. If that period of time 2881 expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There 2882 are other scenarios under which callbacks may race replies. Among 2883 them are pNFS layout recalls as described in Section 12.5.5.2. 2885 2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues 2887 Very large requests and replies may pose both buffer management 2888 issues (especially with RDMA) and reply cache issues. When the 2889 session is created (Section 18.36), for each channel (fore and back), 2890 the client and server negotiate the maximum-sized request they will 2891 send or process (ca_maxrequestsize), the maximum-sized reply they 2892 will return or process (ca_maxresponsesize), and the maximum-sized 2893 reply they will store in the reply cache (ca_maxresponsesize_cached). 2895 If a request exceeds ca_maxrequestsize, the reply will have the 2896 status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG 2897 as the status for the first operation (SEQUENCE or CB_SEQUENCE) in 2898 the request (which means that no operations in the request executed 2899 and that the state of the slot in the reply cache is unchanged), or 2900 it MAY opt to return it on a subsequent operation in the same 2901 COMPOUND or CB_COMPOUND request (which means that at least one 2902 operation did execute and that the state of the slot in the reply 2903 cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on 2904 the operation that exceeds ca_maxrequestsize. 2906 If a reply exceeds ca_maxresponsesize, the reply will have the status 2907 NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the 2908 status for the first operation (SEQUENCE or CB_SEQUENCE) in the 2909 request, or it MAY opt to return it on a subsequent operation (in the 2910 same COMPOUND or CB_COMPOUND reply). A replier MAY return 2911 NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if 2912 the response would still exceed ca_maxresponsesize. 2914 If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache 2915 a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE 2916 operation (see Section 2.10.6.1.2). If the reply exceeds 2917 ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is 2918 TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. 2919 Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that 2920 matter) is returned on an operation other than the first operation 2921 (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if 2922 sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND 2923 has eleven operations, including SEQUENCE, the fifth operation is a 2924 RENAME, and the tenth operation is a READ for one million bytes, the 2925 server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth 2926 operation. Since the server executed several operations, especially 2927 the non-idempotent RENAME, the client's request to cache the reply 2928 needs to be honored in order for the correct operation of exactly 2929 once semantics. If the client retries the request, the server will 2930 have cached a reply that contains results for ten of the eleven 2931 requested operations, with the tenth operation having a status of 2932 NFS4ERR_REP_TOO_BIG_TO_CACHE. 2934 A client needs to take care that when sending operations that change 2935 the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and 2936 RESTOREFH), it not exceed the maximum reply buffer before the GETFH 2937 operation. Otherwise, the client will have to retry the operation 2938 that changed the current filehandle, in order to obtain the desired 2939 filehandle. For the OPEN operation (see Section 18.16), retry is not 2940 always available as an option. The following guidelines for the 2941 handling of filehandle-changing operations are advised: 2943 o Within the same COMPOUND procedure, a client SHOULD send GETFH 2944 immediately after a current filehandle-changing operation. A 2945 client MUST send GETFH after a current filehandle-changing 2946 operation that is also non-idempotent (e.g., the OPEN operation), 2947 unless the operation is RESTOREFH. RESTOREFH is an exception, 2948 because even though it is non-idempotent, the filehandle RESTOREFH 2949 produced originated from an operation that is either idempotent 2950 (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If 2951 the origin is non-idempotent, then because the client MUST send 2952 GETFH after the origin operation, the client can recover if 2953 RESTOREFH returns an error. 2955 o A server MAY return NFS4ERR_REP_TOO_BIG or 2956 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2957 filehandle-changing operation if the reply would be too large on 2958 the next operation. 2960 o A server SHOULD return NFS4ERR_REP_TOO_BIG or 2961 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 2962 filehandle-changing, non-idempotent operation if the reply would 2963 be too large on the next operation, especially if the operation is 2964 OPEN. 2966 o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent 2967 current filehandle-changing operation, if it looks at the next 2968 operation (in the same COMPOUND procedure) and finds it is not 2969 GETFH. The server SHOULD do this if it is unable to determine in 2970 advance whether the total response size would exceed 2971 ca_maxresponsesize_cached or ca_maxresponsesize. 2973 2.10.6.5. Persistence 2975 Since the reply cache is bounded, it is practical for the reply cache 2976 to persist across server restarts. The replier MUST persist the 2977 following information if it agreed to persist the session (when the 2978 session was created; see Section 18.36): 2980 o The session ID. 2982 o The slot table including the sequence ID and cached reply for each 2983 slot. 2985 The above are sufficient for a replier to provide EOS semantics for 2986 any requests that were sent and executed before the server restarted. 2987 If the replier is a client, then there is no need for it to persist 2988 any more information, unless the client will be persisting all other 2989 state across client restart, in which case, the server will never see 2990 any NFSv4.1-level protocol manifestation of a client restart. If the 2991 replier is a server, with just the slot table and session ID 2992 persisting, any requests the client retries after the server restart 2993 will return the results that are cached in the reply cache, and any 2994 new requests (i.e., the sequence ID is one greater than the slot's 2995 sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by 2996 SEQUENCE). Such a session is considered dead. A server MAY re- 2997 animate a session after a server restart so that the session will 2998 accept new requests as well as retries. To re-animate a session, the 2999 server needs to persist additional information through server 3000 restart: 3002 o The client ID. This is a prerequisite to let the client create 3003 more sessions associated with the same client ID as the re- 3004 animated session. 3006 o The client ID's sequence ID that is used for creating sessions 3007 (see Sections 18.35 and 18.36). This is a prerequisite to let the 3008 client create more sessions. 3010 o The principal that created the client ID. This allows the server 3011 to authenticate the client when it sends EXCHANGE_ID. 3013 o The SSV, if SP4_SSV state protection was specified when the client 3014 ID was created (see Section 18.35). This lets the client create 3015 new sessions, and associate connections with the new and existing 3016 sessions. 3018 o The properties of the client ID as defined in Section 18.35. 3020 A persistent reply cache places certain demands on the server. The 3021 execution of the sequence of operations (starting with SEQUENCE) and 3022 placement of its results in the persistent cache MUST be atomic. If 3023 a client retries a sequence of operations that was previously 3024 executed on the server, the only acceptable outcomes are either the 3025 original cached reply or an indication that the client ID or session 3026 has been lost (indicating a catastrophic loss of the reply cache or a 3027 session that has been deleted because the client failed to use the 3028 session for an extended period of time). 3030 A server could fail and restart in the middle of a COMPOUND procedure 3031 that contains one or more non-idempotent or idempotent-but-modifying 3032 operations. This creates an even higher challenge for atomic 3033 execution and placement of results in the reply cache. One way to 3034 view the problem is as a single transaction consisting of each 3035 operation in the COMPOUND followed by storing the result in 3036 persistent storage, then finally a transaction commit. If there is a 3037 failure before the transaction is committed, then the server rolls 3038 back the transaction. If the server itself fails, then when it 3039 restarts, its recovery logic could roll back the transaction before 3040 starting the NFSv4.1 server. 3042 While the description of the implementation for atomic execution of 3043 the request and caching of the reply is beyond the scope of this 3044 document, an example implementation for NFSv2 [41] is described in 3045 [42]. 3047 2.10.7. RDMA Considerations 3049 A complete discussion of the operation of RPC-based protocols over 3050 RDMA transports is in [31]. A discussion of the operation of NFSv4, 3051 including NFSv4.1, over RDMA is in [32]. Where RDMA is considered, 3052 this specification assumes the use of such a layering; it addresses 3053 only the upper-layer issues relevant to making best use of RPC/RDMA. 3055 2.10.7.1. RDMA Connection Resources 3057 RDMA requires its consumers to register memory and post buffers of a 3058 specific size and number for receive operations. 3060 Registration of memory can be a relatively high-overhead operation, 3061 since it requires pinning of buffers, assignment of attributes (e.g., 3062 readable/writable), and initialization of hardware translation. 3063 Preregistration is desirable to reduce overhead. These registrations 3064 are specific to hardware interfaces and even to RDMA connection 3065 endpoints; therefore, negotiation of their limits is desirable to 3066 manage resources effectively. 3068 Following basic registration, these buffers must be posted by the RPC 3069 layer to handle receives. These buffers remain in use by the RPC/ 3070 NFSv4.1 implementation; the size and number of them must be known to 3071 the remote peer in order to avoid RDMA errors that would cause a 3072 fatal error on the RDMA connection. 3074 NFSv4.1 manages slots as resources on a per-session basis (see 3075 Section 2.10), while RDMA connections manage credits on a per- 3076 connection basis. This means that in order for a peer to send data 3077 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and 3078 an RDMA credit. If multiple RDMA connections are associated with a 3079 session, then if the total number of credits across all RDMA 3080 connections associated with the session is X, and the number of slots 3081 in the session is Y, then the maximum number of outstanding requests 3082 is the lesser of X and Y. 3084 2.10.7.2. Flow Control 3086 Previous versions of NFS do not provide flow control; instead, they 3087 rely on the windowing provided by transports like TCP to throttle 3088 requests. This does not work with RDMA, which provides no operation 3089 flow control and will terminate a connection in error when limits are 3090 exceeded. Limits such as maximum number of requests outstanding are 3091 therefore negotiated when a session is created (see the 3092 ca_maxrequests field in Section 18.36). These limits then provide 3093 the maxima within which each connection associated with the session's 3094 channel(s) must remain. RDMA connections are managed within these 3095 limits as described in Section 3.3 of [31]; if there are multiple 3096 RDMA connections, then the maximum number of requests for a channel 3097 will be divided among the RDMA connections. Put a different way, the 3098 onus is on the replier to ensure that the total number of RDMA 3099 credits across all connections associated with the replier's channel 3100 does exceed the channel's maximum number of outstanding requests. 3102 The limits may also be modified dynamically at the replier's choosing 3103 by manipulating certain parameters present in each NFSv4.1 reply. In 3104 addition, the CB_RECALL_SLOT callback operation (see Section 20.8) 3105 can be sent by a server to a client to return RDMA credits to the 3106 server, thereby lowering the maximum number of requests a client can 3107 have outstanding to the server. 3109 2.10.7.3. Padding 3111 Header padding is requested by each peer at session initiation (see 3112 the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), 3113 and subsequently used by the RPC RDMA layer, as described in [31]. 3114 Zero padding is permitted. 3116 Padding leverages the useful property that RDMA preserve alignment of 3117 data, even when they are placed into anonymous (untagged) buffers. 3118 If requested, client inline writes will insert appropriate pad bytes 3119 within the request header to align the data payload on the specified 3120 boundary. The client is encouraged to add sufficient padding (up to 3121 the negotiated size) so that the "data" field of the WRITE operation 3122 is aligned. Most servers can make good use of such padding, which 3123 allows them to chain receive buffers in such a way that any data 3124 carried by client requests will be placed into appropriate buffers at 3125 the server, ready for file system processing. The receiver's RPC 3126 layer encounters no overhead from skipping over pad bytes, and the 3127 RDMA layer's high performance makes the insertion and transmission of 3128 padding on the sender a significant optimization. In this way, the 3129 need for servers to perform RDMA Read to satisfy all but the largest 3130 client writes is obviated. An added benefit is the reduction of 3131 message round trips on the network -- a potentially good trade, where 3132 latency is present. 3134 The value to choose for padding is subject to a number of criteria. 3135 A primary source of variable-length data in the RPC header is the 3136 authentication information, the form of which is client-determined, 3137 possibly in response to server specification. The contents of 3138 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 3139 go into the determination of a maximal NFSv4.1 request size and 3140 therefore minimal buffer size. The client must select its offered 3141 value carefully, so as to avoid overburdening the server, and vice 3142 versa. The benefit of an appropriate padding value is higher 3143 performance. 3145 Sender gather: 3146 |RPC Request|Pad bytes|Length| -> |User data...| 3147 \------+----------------------/ \ 3148 \ \ 3149 \ Receiver scatter: \-----------+- ... 3150 /-----+----------------\ \ \ 3151 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 3153 In the above case, the server may recycle unused buffers to the next 3154 posted receive if unused by the actual received request, or may pass 3155 the now-complete buffers by reference for normal write processing. 3156 For a server that can make use of it, this removes any need for data 3157 copies of incoming data, without resorting to complicated end-to-end 3158 buffer advertisement and management. This includes most kernel-based 3159 and integrated server designs, among many others. The client may 3160 perform similar optimizations, if desired. 3162 2.10.7.4. Dual RDMA and Non-RDMA Transports 3164 Some RDMA transports (e.g., RFC 5040 [8]) permit a "streaming" (non- 3165 RDMA) phase, where ordinary traffic might flow before "stepping up" 3166 to RDMA mode, commencing RDMA traffic. Some RDMA transports start 3167 connections always in RDMA mode. NFSv4.1 allows, but does not 3168 assume, a streaming phase before RDMA mode. When a connection is 3169 associated with a session, the client and server negotiate whether 3170 the connection is used in RDMA or non-RDMA mode (see Sections 18.36 3171 and 18.34). 3173 2.10.8. Session Security 3174 2.10.8.1. Session Callback Security 3176 Via session/connection association, NFSv4.1 improves security over 3177 that provided by NFSv4.0 for the backchannel. The connection is 3178 client-initiated (see Section 18.34) and subject to the same firewall 3179 and routing checks as the fore channel. At the client's option (see 3180 Section 18.35), connection association is fully authenticated before 3181 being activated (see Section 18.34). Traffic from the server over 3182 the backchannel is authenticated exactly as the client specifies (see 3183 Section 2.10.8.2). 3185 2.10.8.2. Backchannel RPC Security 3187 When the NFSv4.1 client establishes the backchannel, it informs the 3188 server of the security flavors and principals to use when sending 3189 requests. If the security flavor is RPCSEC_GSS, the client expresses 3190 the principal in the form of an established RPCSEC_GSS context. The 3191 server is free to use any of the flavor/principal combinations the 3192 client offers, but it MUST NOT use unoffered combinations. This way, 3193 the client need not provide a target GSS principal for the 3194 backchannel as it did with NFSv4.0, nor does the server have to 3195 implement an RPCSEC_GSS initiator as it did with NFSv4.0 [33]. 3197 The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL 3198 (Section 18.33) operations allow the client to specify flavor/ 3199 principal combinations. 3201 Also note that the SP4_SSV state protection mode (see Sections 18.35 3202 and 2.10.8.3) has the side benefit of providing SSV-derived 3203 RPCSEC_GSS contexts (Section 2.10.9). 3205 2.10.8.3. Protection from Unauthorized State Changes 3207 As described to this point in the specification, the state model of 3208 NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation 3209 with a forged session ID and with a slot ID that it expects the 3210 legitimate client to use next. When the legitimate client uses the 3211 slot ID with the same sequence number, the server returns the 3212 attacker's result from the reply cache, which disrupts the legitimate 3213 client and thus denies service to it. Similarly, an attacker could 3214 send a CREATE_SESSION with a forged client ID to create a new session 3215 associated with the client ID. The attacker could send requests 3216 using the new session that change locking state, such as LOCKU 3217 operations to release locks the legitimate client has acquired. 3218 Setting a security policy on the file that requires RPCSEC_GSS 3219 credentials when manipulating the file's state is one potential work 3220 around, but has the disadvantage of preventing a legitimate client 3221 from releasing state when RPCSEC_GSS is required to do so, but a GSS 3222 context cannot be obtained (possibly because the user has logged off 3223 the client). 3225 NFSv4.1 provides three options to a client for state protection, 3226 which are specified when a client creates a client ID via EXCHANGE_ID 3227 (Section 18.35). 3229 The first (SP4_NONE) is to simply waive state protection. 3231 The other two options (SP4_MACH_CRED and SP4_SSV) share several 3232 traits: 3234 o An RPCSEC_GSS-based credential is used to authenticate client ID 3235 and session maintenance operations, including creating and 3236 destroying a session, associating a connection with the session, 3237 and destroying the client ID. 3239 o Because RPCSEC_GSS is used to authenticate client ID and session 3240 maintenance, the attacker cannot associate a rogue connection with 3241 a legitimate session, or associate a rogue session with a 3242 legitimate client ID in order to maliciously alter the client ID's 3243 lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. 3245 o In cases where the server's security policies on a portion of its 3246 namespace require RPCSEC_GSS authentication, a client may have to 3247 use an RPCSEC_GSS credential to remove per-file state (e.g., 3248 LOCKU, CLOSE, etc.). The server may require that the principal 3249 that removes the state match certain criteria (e.g., the principal 3250 might have to be the same as the one that acquired the state). 3251 However, the client might not have an RPCSEC_GSS context for such 3252 a principal, and might not be able to create such a context 3253 (perhaps because the user has logged off). When the client 3254 establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a 3255 list of operations that the server MUST allow using the machine 3256 credential (if SP4_MACH_CRED is used) or the SSV credential (if 3257 SP4_SSV is used). 3259 The SP4_MACH_CRED state protection option uses a machine credential 3260 where the principal that creates the client ID MUST also be the 3261 principal that performs client ID and session maintenance operations. 3262 The security of the machine credential state protection approach 3263 depends entirely on safe guarding the per-machine credential. 3264 Assuming a proper safeguard using the per-machine credential for 3265 operations like CREATE_SESSION, BIND_CONN_TO_SESSION, 3266 DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from 3267 associating a rogue connection with a session, or associating a rogue 3268 session with a client ID. 3270 There are at least three scenarios for the SP4_MACH_CRED option: 3272 1. The system administrator configures a unique, permanent per- 3273 machine credential for one of the mandated GSS mechanisms (e.g., 3274 if Kerberos V5 is used, a "keytab" containing a principal derived 3275 from a client host name could be used). 3277 2. The client is used by a single user, and so the client ID and its 3278 sessions are used by just that user. If the user's credential 3279 expires, then session and client ID maintenance cannot occur, but 3280 since the client has a single user, only that user is 3281 inconvenienced. 3283 3. The physical client has multiple users, but the client 3284 implementation has a unique client ID for each user. This is 3285 effectively the same as the second scenario, but a disadvantage 3286 is that each user needs to be allocated at least one session 3287 each, so the approach suffers from lack of economy. 3289 The SP4_SSV protection option uses the SSV (Section 1.7), via 3290 RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect 3291 state from attack. The SP4_SSV protection option is intended for the 3292 situation comprised of a client that has multiple active users and a 3293 system administrator who wants to avoid the burden of installing a 3294 permanent machine credential on each client. The SSV is established 3295 and updated on the server via SET_SSV (see Section 18.47). To 3296 prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS 3297 with the privacy service. Several aspects of the SSV make it 3298 intractable for an attacker to guess the SSV, and thus associate 3299 rogue connections with a session, and rogue sessions with a client 3300 ID: 3302 o The arguments to and results of SET_SSV include digests of the old 3303 and new SSV, respectively. 3305 o Because the initial value of the SSV is zero, therefore known, the 3306 client that opts for SP4_SSV protection and opts to apply SP4_SSV 3307 protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at 3308 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 3309 operation or before the second CREATE_SESSION operation on a 3310 client ID. If it does not, the SSV mechanism will not generate 3311 tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as 3312 a session is created. 3314 o A SET_SSV request does not replace the SSV with the argument to 3315 SET_SSV. Instead, the current SSV on the server is logically 3316 exclusive ORed (XORed) with the argument to SET_SSV. Each time a 3317 new principal uses a client ID for the first time, the client 3318 SHOULD send a SET_SSV with that principal's RPCSEC_GSS 3319 credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. 3321 Here are the types of attacks that can be attempted by an attacker 3322 named Eve on a victim named Bob, and how SP4_SSV protection foils 3323 each attack: 3325 o Suppose Eve is the first user to log into a legitimate client. 3326 Eve's use of an NFSv4.1 file system will cause the legitimate 3327 client to create a client ID with SP4_SSV protection, specifying 3328 that the BIND_CONN_TO_SESSION operation MUST use the SSV 3329 credential. Eve's use of the file system also causes an SSV to be 3330 created. The SET_SSV operation that creates the SSV will be 3331 protected by the RPCSEC_GSS context created by the legitimate 3332 client, which uses Eve's GSS principal and credentials. Eve can 3333 eavesdrop on the network while her RPCSEC_GSS context is created 3334 and the SET_SSV using her context is sent. Even if the legitimate 3335 client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve 3336 knows her own credentials, she can decrypt the SSV. Eve can 3337 compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will 3338 accept, and so associate a new connection with the legitimate 3339 session. Eve can change the slot ID and sequence state of a 3340 legitimate session, and/or the SSV state, in such a way that when 3341 Bob accesses the server via the same legitimate client, the 3342 legitimate client will be unable to use the session. 3344 The client's only recourse is to create a new client ID for Bob to 3345 use, and establish a new SSV for the client ID. The client will 3346 be unable to delete the old client ID, and will let the lease on 3347 the old client ID expire. 3349 Once the legitimate client establishes an SSV over the new session 3350 using Bob's RPCSEC_GSS context, Eve can use the new session via 3351 the legitimate client, but she cannot disrupt Bob. Moreover, 3352 because the client SHOULD have modified the SSV due to Eve using 3353 the new session, Bob cannot get revenge on Eve by associating a 3354 rogue connection with the session. 3356 The question is how did the legitimate client detect that Eve has 3357 hijacked the old session? When the client detects that a new 3358 principal, Bob, wants to use the session, it SHOULD have sent a 3359 SET_SSV, which leads to the following sub-scenarios: 3361 * Let us suppose that from the rogue connection, Eve sent a 3362 SET_SSV with the same slot ID and sequence ID that the 3363 legitimate client later uses. The server will assume the 3364 SET_SSV sent with Bob's credentials is a retry, and return to 3365 the legitimate client the reply it sent Eve. However, unless 3366 Eve can correctly guess the SSV the legitimate client will use, 3367 the digest verification checks in the SET_SSV response will 3368 fail. That is an indication to the client that the session has 3369 apparently been hijacked. 3371 * Alternatively, Eve sent a SET_SSV with a different slot ID than 3372 the legitimate client uses for its SET_SSV. Then the digest 3373 verification of the SET_SSV sent with Bob's credentials fails 3374 on the server, and the error returned to the client makes it 3375 apparent that the session has been hijacked. 3377 * Alternatively, Eve sent an operation other than SET_SSV, but 3378 with the same slot ID and sequence that the legitimate client 3379 uses for its SET_SSV. The server returns to the legitimate 3380 client the response it sent Eve. The client sees that the 3381 response is not at all what it expects. The client assumes 3382 either session hijacking or a server bug, and either way 3383 destroys the old session. 3385 o Eve associates a rogue connection with the session as above, and 3386 then destroys the session. Again, Bob goes to use the server from 3387 the legitimate client, which sends a SET_SSV using Bob's 3388 credentials. The client receives an error that indicates that the 3389 session does not exist. When the client tries to create a new 3390 session, this will fail because the SSV it has does not match that 3391 which the server has, and now the client knows the session was 3392 hijacked. The legitimate client establishes a new client ID. 3394 o If Eve creates a connection before the legitimate client 3395 establishes an SSV, because the initial value of the SSV is zero 3396 and therefore known, Eve can send a SET_SSV that will pass the 3397 digest verification check. However, because the new connection 3398 has not been associated with the session, the SET_SSV is rejected 3399 for that reason. 3401 In summary, an attacker's disruption of state when SP4_SSV protection 3402 is in use is limited to the formative period of a client ID, its 3403 first session, and the establishment of the SSV. Once a non- 3404 malicious user uses the client ID, the client quickly detects any 3405 hijack and rectifies the situation. Once a non-malicious user 3406 successfully modifies the SSV, the attacker cannot use NFSv4.1 3407 operations to disrupt the non-malicious user. 3409 Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches 3410 prevent hijacking of a transport connection that has previously been 3411 associated with a session. If the goal of a counter-threat strategy 3412 is to prevent connection hijacking, the use of IPsec is RECOMMENDED. 3414 If a connection hijack occurs, the hijacker could in theory change 3415 locking state and negatively impact the service to legitimate 3416 clients. However, if the server is configured to require the use of 3417 RPCSEC_GSS with integrity or privacy on the affected file objects, 3418 and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is 3419 in force, this will thwart unauthorized attempts to change locking 3420 state. 3422 2.10.9. The Secret State Verifier (SSV) GSS Mechanism 3424 The SSV provides the secret key for a GSS mechanism internal to 3425 NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this 3426 mechanism are not established via the RPCSEC_GSS protocol. Instead, 3427 the contexts are automatically created when EXCHANGE_ID specifies 3428 SP4_SSV protection. The only tokens defined are the PerMsgToken 3429 (emitted by GSS_GetMIC) and the SealedMessage token (emitted by 3430 GSS_Wrap). 3432 The mechanism OID for the SSV mechanism is 3433 iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech 3434 (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any 3435 initial context tokens, the OID can be used to let servers indicate 3436 that the SSV mechanism is acceptable whenever the client sends a 3437 SECINFO or SECINFO_NO_NAME operation (see Section 2.6). 3439 The SSV mechanism defines four subkeys derived from the SSV value. 3440 Each time SET_SSV is invoked, the subkeys are recalculated by the 3441 client and server. The calculation of each of the four subkeys 3442 depends on each of the four respective ssv_subkey4 enumerated values. 3443 The calculation uses the HMAC [59] algorithm, using the current SSV 3444 as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, 3445 and the input text as represented by the XDR encoded enumeration 3446 value for that subkey of data type ssv_subkey4. If the length of the 3447 output of the HMAC algorithm exceeds the length of key of the 3448 encryption algorithm (which is also negotiated by EXCHANGE_ID), then 3449 the subkey MUST be truncated from the HMAC output, i.e., if the 3450 subkey is of N bytes long, then the first N bytes of the HMAC output 3451 MUST be used for the subkey. The specification of EXCHANGE_ID states 3452 that the length of the output of the HMAC algorithm MUST NOT be less 3453 than the length of subkey needed for the encryption algorithm (see 3454 Section 18.35). 3456 /* Input for computing subkeys */ 3457 enum ssv_subkey4 { 3458 SSV4_SUBKEY_MIC_I2T = 1, 3459 SSV4_SUBKEY_MIC_T2I = 2, 3460 SSV4_SUBKEY_SEAL_I2T = 3, 3461 SSV4_SUBKEY_SEAL_T2I = 4 3462 }; 3464 The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating 3465 message integrity codes (MICs) that originate from the NFSv4.1 3466 client, whether as part of a request over the fore channel or a 3467 response over the backchannel. The subkey derived from 3468 SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 3469 server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for 3470 encryption text originating from the NFSv4.1 client, and the subkey 3471 derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text 3472 originating from the NFSv4.1 server. 3474 The PerMsgToken description is based on an XDR definition: 3476 /* Input for computing smt_hmac */ 3477 struct ssv_mic_plain_tkn4 { 3478 uint32_t smpt_ssv_seq; 3479 opaque smpt_orig_plain<>; 3480 }; 3482 /* SSV GSS PerMsgToken token */ 3483 struct ssv_mic_tkn4 { 3484 uint32_t smt_ssv_seq; 3485 opaque smt_hmac<>; 3486 }; 3488 The field smt_hmac is an HMAC calculated by using the subkey derived 3489 from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one- 3490 way hash algorithm as negotiated by EXCHANGE_ID, and the input text 3491 as represented by data of type ssv_mic_plain_tkn4. The field 3492 smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain 3493 is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of 3494 [7]). The caller of GSS_GetMIC() provides a pointer to a buffer 3495 containing the plain text. The SSV mechanism's entry point for 3496 GSS_GetMIC() encodes this into an opaque array, and the encoding will 3497 include an initial four-byte length, plus any necessary padding. 3498 Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus 3499 making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, 3500 which in turn is the input into the HMAC. 3502 The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type 3503 ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence 3504 number, which is equal to one after SET_SSV (Section 18.47) is called 3505 the first time on a client ID. Thereafter, the SSV sequence number 3506 is incremented on each SET_SSV. Thus, smt_ssv_seq represents the 3507 version of the SSV at the time GSS_GetMIC() was called. As noted in 3508 Section 18.35, the client and server can maintain multiple concurrent 3509 versions of the SSV. This allows the SSV to be changed without 3510 serializing all RPC calls that use the SSV mechanism with SET_SSV 3511 operations. Once the HMAC is calculated, it is XDR encoded into 3512 smt_hmac, which will include an initial four-byte length, and any 3513 necessary padding. Prepended to this will be the XDR encoded value 3514 of smt_ssv_seq. 3516 The SealedMessage description is based on an XDR definition: 3518 /* Input for computing ssct_encr_data and ssct_hmac */ 3519 struct ssv_seal_plain_tkn4 { 3520 opaque sspt_confounder<>; 3521 uint32_t sspt_ssv_seq; 3522 opaque sspt_orig_plain<>; 3523 opaque sspt_pad<>; 3524 }; 3526 /* SSV GSS SealedMessage token */ 3527 struct ssv_seal_cipher_tkn4 { 3528 uint32_t ssct_ssv_seq; 3529 opaque ssct_iv<>; 3530 opaque ssct_encr_data<>; 3531 opaque ssct_hmac<>; 3532 }; 3533 The token emitted by GSS_Wrap() is XDR encoded and of XDR data type 3534 ssv_seal_cipher_tkn4. 3536 The ssct_ssv_seq field has the same meaning as smt_ssv_seq. 3538 The ssct_encr_data field is the result of encrypting a value of the 3539 XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the 3540 subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and 3541 the encryption algorithm is that negotiated by EXCHANGE_ID. 3543 The ssct_iv field is the initialization vector (IV) for the 3544 encryption algorithm (if applicable) and is sent in clear text. The 3545 content and size of the IV MUST comply with the specification of the 3546 encryption algorithm. For example, the id-aes256-CBC algorithm MUST 3547 use a 16-byte initialization vector (IV), which MUST be unpredictable 3548 for each instance of a value of data type ssv_seal_plain_tkn4 that is 3549 encrypted with a particular SSV key. 3551 The ssct_hmac field is the result of computing an HMAC using the 3552 value of the XDR encoded data type ssv_seal_plain_tkn4 as the input 3553 text. The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or 3554 SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that 3555 negotiated by EXCHANGE_ID. 3557 The sspt_confounder field is a random value. 3559 The sspt_ssv_seq field is the same as ssvt_ssv_seq. 3561 The field sspt_orig_plain field is the original plaintext and is the 3562 "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of 3563 [7]). As with the handling of the plaintext by the SSV mechanism's 3564 GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a 3565 pointer to the plaintext, and will XDR encode an opaque array into 3566 sspt_orig_plain representing the plain text, along with the other 3567 fields of an instance of data type ssv_seal_plain_tkn4. 3569 The sspt_pad field is present to support encryption algorithms that 3570 require inputs to be in fixed-sized blocks. The content of sspt_pad 3571 is zero filled except for the length. Beware that the XDR encoding 3572 of ssv_seal_plain_tkn4 contains three variable-length arrays, and so 3573 each array consumes four bytes for an array length, and each array 3574 that follows the length is always padded to a multiple of four bytes 3575 per the XDR standard. 3577 For example, suppose the encryption algorithm uses 16-byte blocks, 3578 and the sspt_confounder is three bytes long, and the sspt_orig_plain 3579 field is 15 bytes long. The XDR encoding of sspt_confounder uses 3580 eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq 3581 uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 3582 + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad 3583 field is four bytes. This totals 36 bytes. The next multiple of 16 3584 is 48; thus, the length field of sspt_pad needs to be set to 12 3585 bytes, or a total encoding of 16 bytes. The total number of XDR 3586 encoded bytes is thus 8 + 4 + 20 + 16 = 48. 3588 GSS_Wrap() emits a token that is an XDR encoding of a value of data 3589 type ssv_seal_cipher_tkn4. Note that regardless of whether or not 3590 the caller of GSS_Wrap() requests confidentiality, the token always 3591 has confidentiality. This is because the SSV mechanism is for 3592 RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without 3593 confidentiality. 3595 There is one SSV per client ID. There is a single GSS context for a 3596 client ID / SSV pair. All SSV mechanism RPCSEC_GSS handles of a 3597 client ID / SSV pair share the same GSS context. SSV GSS contexts do 3598 not expire except when the SSV is destroyed (causes would include the 3599 client ID being destroyed or a server restart). Since one purpose of 3600 context expiration is to replace keys that have been in use for "too 3601 long", hence vulnerable to compromise by brute force or accident, the 3602 client can replace the SSV key by sending periodic SET_SSV 3603 operations, which is done by cycling through different users' 3604 RPCSEC_GSS credentials. This way, the SSV is replaced without 3605 destroying the SSV's GSS contexts. 3607 SSV RPCSEC_GSS handles can be expired or deleted by the server at any 3608 time, and the EXCHANGE_ID operation can be used to create more SSV 3609 RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not 3610 imply that the SSV or its GSS context has expired. 3612 The client MUST establish an SSV via SET_SSV before the SSV GSS 3613 context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). 3614 If SET_SSV has not been successfully called, attempts to emit tokens 3615 MUST fail. 3617 The SSV mechanism does not support replay detection and sequencing in 3618 its tokens because RPCSEC_GSS does not use those features (See 3619 Section 5.2.2, "Context Creation Requests", in [4]). However, 3620 Section 2.10.10 discusses special considerations for the SSV 3621 mechanism when used with RPCSEC_GSS. 3623 2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV 3624 Mechanism 3626 When a client ID is created with SP4_SSV state protection (see 3627 Section 18.35), the client is permitted to associate multiple 3628 RPCSEC_GSS handles with the single SSV GSS context (see 3629 Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and 3630 version 2, see [4] and [9]) calculate the verifier of the reply, 3631 special care must be taken by the implementation of the NFSv4.1 3632 client to prevent attacks by a man-in-the-middle. The verifier of an 3633 RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input 3634 value of the seq_num field of the RPCSEC_GSS credential (data type 3635 rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple 3636 RPCSEC_GSS handles share the same GSS context, then if one handle is 3637 used to send a request with the same seq_num value as another handle, 3638 an attacker could block the reply, and replace it with the verifier 3639 used for the other handle. 3641 There are multiple ways to prevent the attack on the SSV RPCSEC_GSS 3642 verifier in the reply. The simplest is believed to be as follows. 3644 o Each time one or more new SSV RPCSEC_GSS handles are created via 3645 EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify 3646 the SSV. By changing the SSV, the new handles will not result in 3647 the re-use of an SSV RPCSEC_GSS verifier in a reply. 3649 o When a requester decides to use N SSV RPCSEC_GSS handles, it 3650 SHOULD assign a unique and non-overlapping range of seq_nums to 3651 each SSV RPCSEC_GSS handle. The size of each range SHOULD be 3652 equal to MAXSEQ / N (see Section 5 of [4] for the definition of 3653 MAXSEQ). When an SSV RPCSEC_GSS handle reaches its maximum, it 3654 SHOULD force the replier to destroy the handle by sending a NULL 3655 RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of 3656 [4]). 3658 o When the requester wants to increase or decrease N, it SHOULD 3659 force the replier to destroy all N handles by sending a NULL RPC 3660 request on each handle with seq_num set to MAXSEQ + 1. If the 3661 requester is the client, it SHOULD send a SET_SSV operation before 3662 using new handles. If the requester is the server, then the 3663 client SHOULD send a SET_SSV operation when it detects that the 3664 server has forced it to destroy a backchannel's SSV RPCSEC_GSS 3665 handle. By sending a SET_SSV operation, the SSV will change, and 3666 so the attacker will be unavailable to successfully replay a 3667 previous verifier in a reply to the requester. 3669 Note that if the replier carefully creates the SSV RPCSEC_GSS 3670 handles, the related risk of a man-in-the-middle splicing a forged 3671 SSV RPCSEC_GSS credential with a verifier for another handle does not 3672 exist. This is because the verifier in an RPCSEC_GSS request is 3673 computed from input that includes both the RPCSEC_GSS handle and 3674 seq_num (see Section 5.3.1 of [4]). Provided the replier takes care 3675 to avoid re-using the value of an RPCSEC_GSS handle that it creates, 3676 such as by including a generation number in the handle, the man-in- 3677 the-middle will not be able to successfully replay a previous 3678 verifier in the request to a replier. 3680 2.10.11. Session Mechanics - Steady State 3682 2.10.11.1. Obligations of the Server 3684 The server has the primary obligation to monitor the state of 3685 backchannel resources that the client has created for the server 3686 (RPCSEC_GSS contexts and backchannel connections). If these 3687 resources vanish, the server takes action as specified in 3688 Section 2.10.13.2. 3690 2.10.11.2. Obligations of the Client 3692 The client SHOULD honor the following obligations in order to utilize 3693 the session: 3695 o Keep a necessary session from going idle on the server. A client 3696 that requires a session but nonetheless is not sending operations 3697 risks having the session be destroyed by the server. This is 3698 because sessions consume resources, and resource limitations may 3699 force the server to cull an inactive session. A server MAY 3700 consider a session to be inactive if the client has not used the 3701 session before the session inactivity timer (Section 2.10.12) has 3702 expired. 3704 o Destroy the session when not needed. If a client has multiple 3705 sessions, one of which has no requests waiting for replies, and 3706 has been idle for some period of time, it SHOULD destroy the 3707 session. 3709 o Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. 3710 If the client requires the server to use the RPCSEC_GSS security 3711 flavor for callbacks, then it needs to be sure the RPCSEC_GSS 3712 handles and/or their GSS contexts that are handed to the server 3713 via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. 3715 o Preserve a connection for a backchannel. The server requires a 3716 backchannel in order to gracefully recall recallable state or 3717 notify the client of certain events. Note that if the connection 3718 is not being used for the fore channel, there is no way for the 3719 client to tell if the connection is still alive (e.g., the server 3720 restarted without sending a disconnect). The onus is on the 3721 server, not the client, to determine if the backchannel's 3722 connection is alive, and to indicate in the response to a SEQUENCE 3723 operation when the last connection associated with a session's 3724 backchannel has disconnected. 3726 2.10.11.3. Steps the Client Takes to Establish a Session 3728 If the client does not have a client ID, the client sends EXCHANGE_ID 3729 to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV 3730 protection, in the spo_must_enforce list of operations, it SHOULD at 3731 minimum specify CREATE_SESSION, DESTROY_SESSION, 3732 BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it 3733 opts for SP4_SSV protection, the client needs to ask for SSV-based 3734 RPCSEC_GSS handles. 3736 The client uses the client ID to send a CREATE_SESSION on a 3737 connection to the server. The results of CREATE_SESSION indicate 3738 whether or not the server will persist the session reply cache 3739 through a server that has restarted, and the client notes this for 3740 future reference. 3742 If the client specified SP4_SSV state protection when the client ID 3743 was created, then it SHOULD send SET_SSV in the first COMPOUND after 3744 the session is created. Each time a new principal goes to use the 3745 client ID, it SHOULD send a SET_SSV again. 3747 If the client wants to use delegations, layouts, directory 3748 notifications, or any other state that requires a backchannel, then 3749 it needs to add a connection to the backchannel if CREATE_SESSION did 3750 not already do so. The client creates a connection, and calls 3751 BIND_CONN_TO_SESSION to associate the connection with the session and 3752 the session's backchannel. If CREATE_SESSION did not already do so, 3753 the client MUST tell the server what security is required in order 3754 for the client to accept callbacks. The client does this via 3755 BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV 3756 protection when it called EXCHANGE_ID, then the client SHOULD specify 3757 that the backchannel use RPCSEC_GSS contexts for security. 3759 If the client wants to use additional connections for the 3760 backchannel, then it needs to call BIND_CONN_TO_SESSION on each 3761 connection it wants to use with the session. If the client wants to 3762 use additional connections for the fore channel, then it needs to 3763 call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED 3764 state protection when the client ID was created. 3766 At this point, the session has reached steady state. 3768 2.10.12. Session Inactivity Timer 3770 The server MAY maintain a session inactivity timer for each session. 3771 If the session inactivity timer expires, then the server MAY destroy 3772 the session. To avoid losing a session due to inactivity, the client 3773 MUST renew the session inactivity timer. The length of session 3774 inactivity timer MUST NOT be less than the lease_time attribute 3775 (Section 5.8.1.11). As with lease renewal (Section 8.3), when the 3776 server receives a SEQUENCE operation, it resets the session 3777 inactivity timer, and MUST NOT allow the timer to expire while the 3778 rest of the operations in the COMPOUND procedure's request are still 3779 executing. Once the last operation has finished, the server MUST set 3780 the session inactivity timer to expire no sooner than the sum of the 3781 current time and the value of the lease_time attribute. 3783 2.10.13. Session Mechanics - Recovery 3785 2.10.13.1. Events Requiring Client Action 3787 The following events require client action to recover. 3789 2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path 3791 If all RPCSEC_GSS handles granted by the client to the server for 3792 callback use have expired, the client MUST establish a new handle via 3793 BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results 3794 indicates when callback handles are nearly expired, or fully expired 3795 (see Section 18.46.3). 3797 2.10.13.1.2. Connection Loss 3799 If the client loses the last connection of the session and wants to 3800 retain the session, then it needs to create a new connection, and if, 3801 when the client ID was created, BIND_CONN_TO_SESSION was specified in 3802 the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION 3803 to associate the connection with the session. 3805 If there was a request outstanding at the time of connection loss, 3806 then if the client wants to continue to use the session, it MUST 3807 retry the request, as described in Section 2.10.6.2. Note that it is 3808 not necessary to retry requests over a connection with the same 3809 source network address or the same destination network address as the 3810 lost connection. As long as the session ID, slot ID, and sequence ID 3811 in the retry match that of the original request, the server will 3812 recognize the request as a retry if it executed the request prior to 3813 disconnect. 3815 If the connection that was lost was the last one associated with the 3816 backchannel, and the client wants to retain the backchannel and/or 3817 prevent revocation of recallable state, the client needs to 3818 reconnect, and if it does, it MUST associate the connection to the 3819 session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD 3820 indicate when it has no callback connection via the sr_status_flags 3821 result from SEQUENCE. 3823 2.10.13.1.3. Backchannel GSS Context Loss 3825 Via the sr_status_flags result of the SEQUENCE operation or other 3826 means, the client will learn if some or all of the RPCSEC_GSS 3827 contexts it assigned to the backchannel have been lost. If the 3828 client wants to retain the backchannel and/or not put recallable 3829 state subject to revocation, the client needs to use BACKCHANNEL_CTL 3830 to assign new contexts. 3832 2.10.13.1.4. Loss of Session 3834 The replier might lose a record of the session. Causes include: 3836 o Replier failure and restart. 3838 o A catastrophe that causes the reply cache to be corrupted or lost 3839 on the media on which it was stored. This applies even if the 3840 replier indicated in the CREATE_SESSION results that it would 3841 persist the cache. 3843 o The server purges the session of a client that has been inactive 3844 for a very extended period of time. 3846 o As a result of configuration changes among a set of clustered 3847 servers, a network address previously connected to one server 3848 becomes connected to a different server that has no knowledge of 3849 the session in question. Such a configuration change will 3850 generally only happen when the original server ceases to function 3851 for a time. 3853 Loss of reply cache is equivalent to loss of session. The replier 3854 indicates loss of session to the requester by returning 3855 NFS4ERR_BADSESSION on the next operation that uses the session ID 3856 that refers to the lost session. 3858 After an event like a server restart, the client may have lost its 3859 connections. The client assumes for the moment that the session has 3860 not been lost. It reconnects, and if it specified connection 3861 association enforcement when the session was created, it invokes 3862 BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes 3863 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 3864 NFS4ERR_BADSESSION, the client knows the session is not available to 3865 it when communicating with that network address. If the connection 3866 survives session loss, then the next SEQUENCE operation the client 3867 sends over the connection will get back NFS4ERR_BADSESSION. The 3868 client again knows the session was lost. 3870 Here is one suggested algorithm for the client when it gets 3871 NFS4ERR_BADSESSION. It is not obligatory in that, if a client does 3872 not want to take advantage of such features as trunking, it may omit 3873 parts of it. However, it is a useful example that draws attention to 3874 various possible recovery issues: 3876 1. If the client has other connections to other server network 3877 addresses associated with the same session, attempt a COMPOUND 3878 with a single operation, SEQUENCE, on each of the other 3879 connections. 3881 2. If the attempts succeed, the session is still alive, and this is 3882 a strong indicator that the server's network address has moved. 3883 The client might send an EXCHANGE_ID on the connection that 3884 returned NFS4ERR_BADSESSION to see if there are opportunities for 3885 client ID trunking (i.e., the same client ID and so_major are 3886 returned). The client might use DNS to see if the moved network 3887 address was replaced with another, so that the performance and 3888 availability benefits of session trunking can continue. 3890 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the 3891 session no longer exists on any of the server network addresses 3892 for which the client has connections associated with that session 3893 ID. It is possible the session is still alive and available on 3894 other network addresses. The client sends an EXCHANGE_ID on all 3895 the connections to see if the server owner is still listening on 3896 those network addresses. If the same server owner is returned 3897 but a new client ID is returned, this is a strong indicator of a 3898 server restart. If both the same server owner and same client ID 3899 are returned, then this is a strong indication that the server 3900 did delete the session, and the client will need to send a 3901 CREATE_SESSION if it has no other sessions for that client ID. 3902 If a different server owner is returned, the client can use DNS 3903 to find other network addresses. If it does not, or if DNS does 3904 not find any other addresses for the server, then the client will 3905 be unable to provide NFSv4.1 service, and fatal errors should be 3906 returned to processes that were using the server. If the client 3907 is using a "mount" paradigm, unmounting the server is advised. 3909 4. If the client knows of no other connections associated with the 3910 session ID and server network addresses that are, or have been, 3911 associated with the session ID, then the client can use DNS to 3912 find other network addresses. If it does not, or if DNS does not 3913 find any other addresses for the server, then the client will be 3914 unable to provide NFSv4.1 service, and fatal errors should be 3915 returned to processes that were using the server. If the client 3916 is using a "mount" paradigm, unmounting the server is advised. 3918 If there is a reconfiguration event that results in the same network 3919 address being assigned to servers where the eir_server_scope value is 3920 different, it cannot be guaranteed that a session ID generated by the 3921 first will be recognized as invalid by the first. Therefore, in 3922 managing server reconfigurations among servers with different server 3923 scope values, it is necessary to make sure that all clients have 3924 disconnected from the first server before effecting the 3925 reconfiguration. Nonetheless, clients should not assume that servers 3926 will always adhere to this requirement; clients MUST be prepared to 3927 deal with unexpected effects of server reconfigurations. Even where 3928 a session ID is inappropriately recognized as valid, it is likely 3929 either that the connection will not be recognized as valid or that a 3930 sequence value for a slot will not be correct. Therefore, when a 3931 client receives results indicating such unexpected errors, the use of 3932 EXCHANGE_ID to determine the current server configuration is 3933 RECOMMENDED. 3935 A variation on the above is that after a server's network address 3936 moves, there is no NFSv4.1 server listening, e.g., no listener on 3937 port 2049. In this example, one of the following occur: the NFSv4 3938 server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a 3939 PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or 3940 attempts to reconnect to the network address timeout. These SHOULD 3941 be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for 3942 these purposes. 3944 When the client detects session loss, it needs to call CREATE_SESSION 3945 to recover. Any non-idempotent operations that were in progress 3946 might have been performed on the server at the time of session loss. 3947 The client has no general way to recover from this. 3949 Note that loss of session does not imply loss of byte-range lock, 3950 open, delegation, or layout state because locks, opens, delegations, 3951 and layouts are tied to the client ID and depend on the client ID, 3952 not the session. Nor does loss of byte-range lock, open, delegation, 3953 or layout state imply loss of session state, because the session 3954 depends on the client ID; loss of client ID however does imply loss 3955 of session, byte-range lock, open, delegation, and layout state. See 3956 Section 8.4.2. A session can survive a server restart, but lock 3957 recovery may still be needed. 3959 It is possible that CREATE_SESSION will fail with 3960 NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not 3961 preserve client ID state). If so, the client needs to call 3962 EXCHANGE_ID, followed by CREATE_SESSION. 3964 2.10.13.2. Events Requiring Server Action 3966 The following events require server action to recover. 3968 2.10.13.2.1. Client Crash and Restart 3970 As described in Section 18.35, a restarted client sends EXCHANGE_ID 3971 in such a way that it causes the server to delete any sessions it 3972 had. 3974 2.10.13.2.2. Client Crash with No Restart 3976 If a client crashes and never comes back, it will never send 3977 EXCHANGE_ID with its old client owner. Thus, the server has session 3978 state that will never be used again. After an extended period of 3979 time, and if the server has resource constraints, it MAY destroy the 3980 old session as well as locking state. 3982 2.10.13.2.3. Extended Network Partition 3984 To the server, the extended network partition may be no different 3985 from a client crash with no restart (see Section 2.10.13.2.2). 3986 Unless the server can discern that there is a network partition, it 3987 is free to treat the situation as if the client has crashed 3988 permanently. 3990 2.10.13.2.4. Backchannel Connection Loss 3992 If there were callback requests outstanding at the time of a 3993 connection loss, then the server MUST retry the requests, as 3994 described in Section 2.10.6.2. Note that it is not necessary to 3995 retry requests over a connection with the same source network address 3996 or the same destination network address as the lost connection. As 3997 long as the session ID, slot ID, and sequence ID in the retry match 3998 that of the original request, the callback target will recognize the 3999 request as a retry even if it did see the request prior to 4000 disconnect. 4002 If the connection lost is the last one associated with the 4003 backchannel, then the server MUST indicate that in the 4004 sr_status_flags field of every SEQUENCE reply until the backchannel 4005 is re-established. There are two situations, each of which uses 4006 different status flags: no connectivity for the session's backchannel 4007 and no connectivity for any session backchannel of the client. See 4008 Section 18.46 for a description of the appropriate flags in 4009 sr_status_flags. 4011 2.10.13.2.5. GSS Context Loss 4013 The server SHOULD monitor when the number of RPCSEC_GSS handles 4014 assigned to the backchannel reaches one, and when that one handle is 4015 near expiry (i.e., between one and two periods of lease time), and 4016 indicate so in the sr_status_flags field of all SEQUENCE replies. 4017 The server MUST indicate when all of the backchannel's assigned 4018 RPCSEC_GSS handles have expired via the sr_status_flags field of all 4019 SEQUENCE replies. 4021 2.10.14. Parallel NFS and Sessions 4023 A client and server can potentially be a non-pNFS implementation, a 4024 metadata server implementation, a data server implementation, or two 4025 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 4026 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 4027 mutually exclusive) are passed in the EXCHANGE_ID arguments and 4028 results to allow the client to indicate how it wants to use sessions 4029 created under the client ID, and to allow the server to indicate how 4030 it will allow the sessions to be used. See Section 13.1 for pNFS 4031 sessions considerations. 4033 3. Protocol Constants and Data Types 4035 The syntax and semantics to describe the data types of the NFSv4.1 4036 protocol are defined in the XDR RFC 4506 [2] and RPC RFC 5531 [3] 4037 documents. The next sections build upon the XDR data types to define 4038 constants, types, and structures specific to this protocol. The full 4039 list of XDR data types is in [10]. 4041 3.1. Basic Constants 4043 const NFS4_FHSIZE = 128; 4044 const NFS4_VERIFIER_SIZE = 8; 4045 const NFS4_OPAQUE_LIMIT = 1024; 4046 const NFS4_SESSIONID_SIZE = 16; 4048 const NFS4_INT64_MAX = 0x7fffffffffffffff; 4049 const NFS4_UINT64_MAX = 0xffffffffffffffff; 4050 const NFS4_INT32_MAX = 0x7fffffff; 4051 const NFS4_UINT32_MAX = 0xffffffff; 4053 const NFS4_MAXFILELEN = 0xffffffffffffffff; 4054 const NFS4_MAXFILEOFF = 0xfffffffffffffffe; 4056 Except where noted, all these constants are defined in bytes. 4058 o NFS4_FHSIZE is the maximum size of a filehandle. 4060 o NFS4_VERIFIER_SIZE is the fixed size of a verifier. 4062 o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque 4063 information. 4065 o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. 4067 o NFS4_INT64_MAX is the maximum value of a signed 64-bit integer. 4069 o NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit 4070 integer. 4072 o NFS4_INT32_MAX is the maximum value of a signed 32-bit integer. 4074 o NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit 4075 integer. 4077 o NFS4_MAXFILELEN is the maximum length of a regular file. 4079 o NFS4_MAXFILEOFF is the maximum offset into a regular file. 4081 3.2. Basic Data Types 4083 These are the base NFSv4.1 data types. 4085 +---------------+---------------------------------------------------+ 4086 | Data Type | Definition | 4087 +---------------+---------------------------------------------------+ 4088 | int32_t | typedef int int32_t; | 4089 | uint32_t | typedef unsigned int uint32_t; | 4090 | int64_t | typedef hyper int64_t; | 4091 | uint64_t | typedef unsigned hyper uint64_t; | 4092 | attrlist4 | typedef opaque attrlist4<>; | 4093 | | Used for file/directory attributes. | 4094 | bitmap4 | typedef uint32_t bitmap4<>; | 4095 | | Used in attribute array encoding. | 4096 | changeid4 | typedef uint64_t changeid4; | 4097 | | Used in the definition of change_info4. | 4098 | clientid4 | typedef uint64_t clientid4; | 4099 | | Shorthand reference to client identification. | 4100 | count4 | typedef uint32_t count4; | 4101 | | Various count parameters (READ, WRITE, COMMIT). | 4102 | length4 | typedef uint64_t length4; | 4103 | | The length of a byte-range within a file. | 4104 | mode4 | typedef uint32_t mode4; | 4105 | | Mode attribute data type. | 4106 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 4107 | | Opaque cookie value for READDIR. | 4108 | nfs_fh4 | typedef opaque nfs_fh4; | 4109 | | Filehandle definition. | 4110 | nfs_ftype4 | enum nfs_ftype4; | 4111 | | Various defined file types. | 4112 | nfsstat4 | enum nfsstat4; | 4113 | | Return value for operations. | 4114 | offset4 | typedef uint64_t offset4; | 4115 | | Various offset designations (READ, WRITE, LOCK, | 4116 | | COMMIT). | 4117 | qop4 | typedef uint32_t qop4; | 4118 | | Quality of protection designation in SECINFO. | 4119 | sec_oid4 | typedef opaque sec_oid4<>; | 4120 | | Security Object Identifier. The sec_oid4 data | 4121 | | type is not really opaque. Instead, it contains | 4122 | | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | 4123 | | the mech_type argument to GSS_Init_sec_context. | 4124 | | See [7] for details. | 4125 | sequenceid4 | typedef uint32_t sequenceid4; | 4126 | | Sequence number used for various session | 4127 | | operations (EXCHANGE_ID, CREATE_SESSION, | 4128 | | SEQUENCE, CB_SEQUENCE). | 4129 | seqid4 | typedef uint32_t seqid4; | 4130 | | Sequence identifier used for locking. | 4131 | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | 4132 | | Session identifier. | 4133 | slotid4 | typedef uint32_t slotid4; | 4134 | | Sequencing artifact for various session | 4135 | | operations (SEQUENCE, CB_SEQUENCE). | 4136 | utf8string | typedef opaque utf8string<>; | 4137 | | UTF-8 encoding for strings. | 4138 | utf8str_cis | typedef utf8string utf8str_cis; | 4139 | | Case-insensitive UTF-8 string. | 4140 | utf8str_cs | typedef utf8string utf8str_cs; | 4141 | | Case-sensitive UTF-8 string. | 4142 | utf8str_mixed | typedef utf8string utf8str_mixed; | 4143 | | UTF-8 strings with a case-sensitive prefix and a | 4144 | | case-insensitive suffix. | 4145 | component4 | typedef utf8str_cs component4; | 4146 | | Represents pathname components. | 4147 | linktext4 | typedef utf8str_cs linktext4; | 4148 | | Symbolic link contents ("symbolic link" is | 4149 | | defined in an Open Group [11] standard). | 4150 | pathname4 | typedef component4 pathname4<>; | 4151 | | Represents pathname for fs_locations. | 4152 | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | 4153 | | Verifier used for various operations (COMMIT, | 4154 | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | 4155 | | NFS4_VERIFIER_SIZE is defined as 8. | 4156 +---------------+---------------------------------------------------+ 4158 End of Base Data Types 4160 Table 1 4162 3.3. Structured Data Types 4164 3.3.1. nfstime4 4166 struct nfstime4 { 4167 int64_t seconds; 4168 uint32_t nseconds; 4169 }; 4171 The nfstime4 data type gives the number of seconds and nanoseconds 4172 since midnight or zero hour January 1, 1970 Coordinated Universal 4173 Time (UTC). Values greater than zero for the seconds field denote 4174 dates after the zero hour January 1, 1970. Values less than zero for 4175 the seconds field denote dates before the zero hour January 1, 1970. 4176 In both cases, the nseconds field is to be added to the seconds field 4177 for the final time representation. For example, if the time to be 4178 represented is one-half second before zero hour January 1, 1970, the 4179 seconds field would have a value of negative one (-1) and the 4180 nseconds field would have a value of one-half second (500000000). 4181 Values greater than 999,999,999 for nseconds are invalid. 4183 This data type is used to pass time and date information. A server 4184 converts to and from its local representation of time when processing 4185 time values, preserving as much accuracy as possible. If the 4186 precision of timestamps stored for a file system object is less than 4187 defined, loss of precision can occur. An adjunct time maintenance 4188 protocol is RECOMMENDED to reduce client and server time skew. 4190 3.3.2. time_how4 4192 enum time_how4 { 4193 SET_TO_SERVER_TIME4 = 0, 4194 SET_TO_CLIENT_TIME4 = 1 4195 }; 4197 3.3.3. settime4 4198 union settime4 switch (time_how4 set_it) { 4199 case SET_TO_CLIENT_TIME4: 4200 nfstime4 time; 4201 default: 4202 void; 4203 }; 4205 The time_how4 and settime4 data types are used for setting timestamps 4206 in file object attributes. If set_it is SET_TO_SERVER_TIME4, then 4207 the server uses its local representation of time for the time value. 4209 3.3.4. specdata4 4211 struct specdata4 { 4212 uint32_t specdata1; /* major device number */ 4213 uint32_t specdata2; /* minor device number */ 4214 }; 4216 This data type represents the device numbers for the device file 4217 types NF4CHR and NF4BLK. 4219 3.3.5. fsid4 4221 struct fsid4 { 4222 uint64_t major; 4223 uint64_t minor; 4224 }; 4226 3.3.6. change_policy4 4228 struct change_policy4 { 4229 uint64_t cp_major; 4230 uint64_t cp_minor; 4231 }; 4233 The change_policy4 data type is used for the change_policy 4234 RECOMMENDED attribute. It provides change sequencing indication 4235 analogous to the change attribute. To enable the server to present a 4236 value valid across server re-initialization without requiring 4237 persistent storage, two 64-bit quantities are used, allowing one to 4238 be a server instance ID and the second to be incremented non- 4239 persistently, within a given server instance. 4241 3.3.7. fattr4 4242 struct fattr4 { 4243 bitmap4 attrmask; 4244 attrlist4 attr_vals; 4245 }; 4247 The fattr4 data type is used to represent file and directory 4248 attributes. 4250 The bitmap is a counted array of 32-bit integers used to contain bit 4251 values. The position of the integer in the array that contains bit n 4252 can be computed from the expression (n / 32), and its bit within that 4253 integer is (n mod 32). 4255 0 1 4256 +-----------+-----------+-----------+-- 4257 | count | 31 .. 0 | 63 .. 32 | 4258 +-----------+-----------+-----------+-- 4260 3.3.8. change_info4 4262 struct change_info4 { 4263 bool atomic; 4264 changeid4 before; 4265 changeid4 after; 4266 }; 4268 This data type is used with the CREATE, LINK, OPEN, REMOVE, and 4269 RENAME operations to let the client know the value of the change 4270 attribute for the directory in which the target file system object 4271 resides. 4273 3.3.9. netaddr4 4275 struct netaddr4 { 4276 /* see struct rpcb in RFC 1833 */ 4277 string na_r_netid<>; /* network id */ 4278 string na_r_addr<>; /* universal address */ 4279 }; 4281 The netaddr4 data type is used to identify network transport 4282 endpoints. The r_netid and r_addr fields respectively contain a 4283 netid and uaddr. The netid and uaddr concepts are defined in [12]. 4284 The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are 4285 defined in [12], specifically Tables 2 and 3 and Sections 5.2.3.3 and 4286 5.2.3.4. 4288 3.3.10. state_owner4 4290 struct state_owner4 { 4291 clientid4 clientid; 4292 opaque owner; 4293 }; 4295 typedef state_owner4 open_owner4; 4296 typedef state_owner4 lock_owner4; 4298 The state_owner4 data type is the base type for the open_owner4 4299 (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2). 4301 3.3.10.1. open_owner4 4303 This data type is used to identify the owner of OPEN state. 4305 3.3.10.2. lock_owner4 4307 This structure is used to identify the owner of byte-range locking 4308 state. 4310 3.3.11. open_to_lock_owner4 4312 struct open_to_lock_owner4 { 4313 seqid4 open_seqid; 4314 stateid4 open_stateid; 4315 seqid4 lock_seqid; 4316 lock_owner4 lock_owner; 4317 }; 4319 This data type is used for the first LOCK operation done for an 4320 open_owner4. It provides both the open_stateid and lock_owner, such 4321 that the transition is made from a valid open_stateid sequence to 4322 that of the new lock_stateid sequence. Using this mechanism avoids 4323 the confirmation of the lock_owner/lock_seqid pair since it is tied 4324 to established state in the form of the open_stateid/open_seqid. 4326 3.3.12. stateid4 4328 struct stateid4 { 4329 uint32_t seqid; 4330 opaque other[12]; 4331 }; 4333 This data type is used for the various state sharing mechanisms 4334 between the client and server. The client never modifies a value of 4335 data type stateid. The starting value of the "seqid" field is 4336 undefined. The server is required to increment the "seqid" field by 4337 one at each transition of the stateid. This is important since the 4338 client will inspect the seqid in OPEN stateids to determine the order 4339 of OPEN processing done by the server. 4341 3.3.13. layouttype4 4343 enum layouttype4 { 4344 LAYOUT4_NFSV4_1_FILES = 0x1, 4345 LAYOUT4_OSD2_OBJECTS = 0x2, 4346 LAYOUT4_BLOCK_VOLUME = 0x3 4347 }; 4349 This data type indicates what type of layout is being used. The file 4350 server advertises the layout types it supports through the 4351 fs_layout_type file system attribute (Section 5.12.1). A client asks 4352 for layouts of a particular type in LAYOUTGET, and processes those 4353 layouts in its layout-type-specific logic. 4355 The layouttype4 data type is 32 bits in length. The range 4356 represented by the layout type is split into three parts. Type 0x0 4357 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 4358 globally unique and are assigned according to the description in 4359 Section 22.5; they are maintained by IANA. Types within the range 4360 0x80000000-0xFFFFFFFF are site specific and for private use only. 4362 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 4363 layout type, as defined in Section 13, is to be used. The 4364 LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as 4365 defined in [43], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME 4366 enumeration specifies that the block/volume layout, as defined in 4367 [44], is to be used. 4369 3.3.14. deviceid4 4371 const NFS4_DEVICEID4_SIZE = 16; 4373 typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; 4375 Layout information includes device IDs that specify a storage device 4376 through a compact handle. Addressing and type information is 4377 obtained with the GETDEVICEINFO operation. Device IDs are not 4378 guaranteed to be valid across metadata server restarts. A device ID 4379 is unique per client ID and layout type. See Section 12.2.10 for 4380 more details. 4382 3.3.15. device_addr4 4384 struct device_addr4 { 4385 layouttype4 da_layout_type; 4386 opaque da_addr_body<>; 4387 }; 4389 The device address is used to set up a communication channel with the 4390 storage device. Different layout types will require different data 4391 types to define how they communicate with storage devices. The 4392 opaque da_addr_body field is interpreted based on the specified 4393 da_layout_type field. 4395 This document defines the device address for the NFSv4.1 file layout 4396 (see Section 13.3), which identifies a storage device by network IP 4397 address and port number. This is sufficient for the clients to 4398 communicate with the NFSv4.1 storage devices, and may be sufficient 4399 for other layout types as well. Device types for object-based 4400 storage devices and block storage devices (e.g., Small Computer 4401 System Interface (SCSI) volume labels) are defined by their 4402 respective layout specifications. 4404 3.3.16. layout_content4 4406 struct layout_content4 { 4407 layouttype4 loc_type; 4408 opaque loc_body<>; 4409 }; 4411 The loc_body field is interpreted based on the layout type 4412 (loc_type). This document defines the loc_body for the NFSv4.1 file 4413 layout type; see Section 13.3 for its definition. 4415 3.3.17. layout4 4417 struct layout4 { 4418 offset4 lo_offset; 4419 length4 lo_length; 4420 layoutiomode4 lo_iomode; 4421 layout_content4 lo_content; 4422 }; 4424 The layout4 data type defines a layout for a file. The layout type 4425 specific data is opaque within lo_content. Since layouts are sub- 4426 dividable, the offset and length together with the file's filehandle, 4427 the client ID, iomode, and layout type identify the layout. 4429 3.3.18. layoutupdate4 4431 struct layoutupdate4 { 4432 layouttype4 lou_type; 4433 opaque lou_body<>; 4434 }; 4436 The layoutupdate4 data type is used by the client to return updated 4437 layout information to the metadata server via the LAYOUTCOMMIT 4438 (Section 18.42) operation. This data type provides a channel to pass 4439 layout type specific information (in field lou_body) back to the 4440 metadata server. For example, for the block/volume layout type, this 4441 could include the list of reserved blocks that were written. The 4442 contents of the opaque lou_body argument are determined by the layout 4443 type. The NFSv4.1 file-based layout does not use this data type; if 4444 lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a 4445 zero length. 4447 3.3.19. layouthint4 4449 struct layouthint4 { 4450 layouttype4 loh_type; 4451 opaque loh_body<>; 4452 }; 4454 The layouthint4 data type is used by the client to pass in a hint 4455 about the type of layout it would like created for a particular file. 4456 It is the data type specified by the layout_hint attribute described 4457 in Section 5.12.4. The metadata server may ignore the hint or may 4458 selectively ignore fields within the hint. This hint should be 4459 provided at create time as part of the initial attributes within 4460 OPEN. The loh_body field is specific to the type of layout 4461 (loh_type). The NFSv4.1 file-based layout uses the 4462 nfsv4_1_file_layouthint4 data type as defined in Section 13.3. 4464 3.3.20. layoutiomode4 4466 enum layoutiomode4 { 4467 LAYOUTIOMODE4_READ = 1, 4468 LAYOUTIOMODE4_RW = 2, 4469 LAYOUTIOMODE4_ANY = 3 4470 }; 4472 The iomode specifies whether the client intends to just read or both 4473 read and write the data represented by the layout. While the 4474 LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the 4475 LAYOUTGET operation, it MAY be used in the arguments to the 4476 LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY 4477 iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ 4478 and LAYOUTIOMODE4_RW iomodes are being returned or recalled, 4479 respectively. The metadata server's use of the iomode may depend on 4480 the layout type being used. The storage devices MAY validate I/O 4481 accesses against the iomode and reject invalid accesses. 4483 3.3.21. nfs_impl_id4 4485 struct nfs_impl_id4 { 4486 utf8str_cis nii_domain; 4487 utf8str_cs nii_name; 4488 nfstime4 nii_date; 4489 }; 4491 This data type is used to identify client and server implementation 4492 details. The nii_domain field is the DNS domain name with which the 4493 implementor is associated. The nii_name field is the product name of 4494 the implementation and is completely free form. It is RECOMMENDED 4495 that the nii_name be used to distinguish machine architecture, 4496 machine platforms, revisions, versions, and patch levels. The 4497 nii_date field is the timestamp of when the software instance was 4498 published or built. 4500 3.3.22. threshold_item4 4502 struct threshold_item4 { 4503 layouttype4 thi_layout_type; 4504 bitmap4 thi_hintset; 4505 opaque thi_hintlist<>; 4506 }; 4508 This data type contains a list of hints specific to a layout type for 4509 helping the client determine when it should send I/O directly through 4510 the metadata server versus the storage devices. The data type 4511 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 4512 describing the set of hints supported by the server (they may differ 4513 based on the layout type), and a list of hints (thi_hintlist) whose 4514 content is determined by the hintset bitmap. See the mdsthreshold 4515 attribute for more details. 4517 The thi_hintset field is a bitmap of the following values: 4519 +-------------------------+---+---------+---------------------------+ 4520 | name | # | Data | Description | 4521 | | | Type | | 4522 +-------------------------+---+---------+---------------------------+ 4523 | threshold4_read_size | 0 | length4 | If a file's length is | 4524 | | | | less than the value of | 4525 | | | | threshold4_read_size, | 4526 | | | | then it is RECOMMENDED | 4527 | | | | that the client read from | 4528 | | | | the file via the MDS and | 4529 | | | | not a storage device. | 4530 | threshold4_write_size | 1 | length4 | If a file's length is | 4531 | | | | less than the value of | 4532 | | | | threshold4_write_size, | 4533 | | | | then it is RECOMMENDED | 4534 | | | | that the client write to | 4535 | | | | the file via the MDS and | 4536 | | | | not a storage device. | 4537 | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | 4538 | | | | this threshold, it is | 4539 | | | | RECOMMENDED to read data | 4540 | | | | through the MDS. | 4541 | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | 4542 | | | | this threshold, it is | 4543 | | | | RECOMMENDED to write data | 4544 | | | | through the MDS. | 4545 +-------------------------+---+---------+---------------------------+ 4547 3.3.23. mdsthreshold4 4549 struct mdsthreshold4 { 4550 threshold_item4 mth_hints<>; 4551 }; 4553 This data type holds an array of elements of data type 4554 threshold_item4, each of which is valid for a particular layout type. 4555 An array is necessary because a server can support multiple layout 4556 types for a single file. 4558 4. Filehandles 4560 The filehandle in the NFS protocol is a per-server unique identifier 4561 for a file system object. The contents of the filehandle are opaque 4562 to the client. Therefore, the server is responsible for translating 4563 the filehandle to an internal representation of the file system 4564 object. 4566 4.1. Obtaining the First Filehandle 4568 The operations of the NFS protocol are defined in terms of one or 4569 more filehandles. Therefore, the client needs a filehandle to 4570 initiate communication with the server. With the NFSv3 protocol (RFC 4571 1813 [34]), there exists an ancillary protocol to obtain this first 4572 filehandle. The MOUNT protocol, RPC program number 100005, provides 4573 the mechanism of translating a string-based file system pathname to a 4574 filehandle, which can then be used by the NFS protocols. 4576 The MOUNT protocol has deficiencies in the area of security and use 4577 via firewalls. This is one reason that the use of the public 4578 filehandle was introduced in RFC 2054 [45] and RFC 2055 [46]. With 4579 the use of the public filehandle in combination with the LOOKUP 4580 operation in the NFSv3 protocol, it has been demonstrated that the 4581 MOUNT protocol is unnecessary for viable interaction between NFS 4582 client and server. 4584 Therefore, the NFSv4.1 protocol will not use an ancillary protocol 4585 for translation from string-based pathnames to a filehandle. Two 4586 special filehandles will be used as starting points for the NFS 4587 client. 4589 4.1.1. Root Filehandle 4591 The first of the special filehandles is the ROOT filehandle. The 4592 ROOT filehandle is the "conceptual" root of the file system namespace 4593 at the NFS server. The client uses or starts with the ROOT 4594 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 4595 operation instructs the server to set the "current" filehandle to the 4596 ROOT of the server's file tree. Once this PUTROOTFH operation is 4597 used, the client can then traverse the entirety of the server's file 4598 tree with the LOOKUP operation. A complete discussion of the server 4599 namespace is in Section 7. 4601 4.1.2. Public Filehandle 4603 The second special filehandle is the PUBLIC filehandle. Unlike the 4604 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 4605 arbitrary file system object at the server. The server is 4606 responsible for this binding. It may be that the PUBLIC filehandle 4607 and the ROOT filehandle refer to the same file system object. 4608 However, it is up to the administrative software at the server and 4609 the policies of the server administrator to define the binding of the 4610 PUBLIC filehandle and server file system object. The client may not 4611 make any assumptions about this binding. The client uses the PUBLIC 4612 filehandle via the PUTPUBFH operation. 4614 4.2. Filehandle Types 4616 In the NFSv3 protocol, there was one type of filehandle with a single 4617 set of semantics. This type of filehandle is termed "persistent" in 4618 NFSv4.1. The semantics of a persistent filehandle remain the same as 4619 before. A new type of filehandle introduced in NFSv4.1 is the 4620 "volatile" filehandle, which attempts to accommodate certain server 4621 environments. 4623 The volatile filehandle type was introduced to address server 4624 functionality or implementation issues that make correct 4625 implementation of a persistent filehandle infeasible. Some server 4626 environments do not provide a file-system-level invariant that can be 4627 used to construct a persistent filehandle. The underlying server 4628 file system may not provide the invariant or the server's file system 4629 programming interfaces may not provide access to the needed 4630 invariant. Volatile filehandles may ease the implementation of 4631 server functionality such as hierarchical storage management or file 4632 system reorganization or migration. However, the volatile filehandle 4633 increases the implementation burden for the client. 4635 Since the client will need to handle persistent and volatile 4636 filehandles differently, a file attribute is defined that may be used 4637 by the client to determine the filehandle types being returned by the 4638 server. 4640 4.2.1. General Properties of a Filehandle 4642 The filehandle contains all the information the server needs to 4643 distinguish an individual file. To the client, the filehandle is 4644 opaque. The client stores filehandles for use in a later request and 4645 can compare two filehandles from the same server for equality by 4646 doing a byte-by-byte comparison. However, the client MUST NOT 4647 otherwise interpret the contents of filehandles. If two filehandles 4648 from the same server are equal, they MUST refer to the same file. 4649 Servers SHOULD try to maintain a one-to-one correspondence between 4650 filehandles and files, but this is not required. Clients MUST use 4651 filehandle comparisons only to improve performance, not for correct 4652 behavior. All clients need to be prepared for situations in which it 4653 cannot be determined whether two filehandles denote the same object 4654 and in such cases, avoid making invalid assumptions that might cause 4655 incorrect behavior. Further discussion of filehandle and attribute 4656 comparison in the context of data caching is presented in 4657 Section 10.3.4. 4659 As an example, in the case that two different pathnames when 4660 traversed at the server terminate at the same file system object, the 4661 server SHOULD return the same filehandle for each path. This can 4662 occur if a hard link (see [6]) is used to create two file names that 4663 refer to the same underlying file object and associated data. For 4664 example, if paths /a/b/c and /a/d/c refer to the same file, the 4665 server SHOULD return the same filehandle for both pathnames' 4666 traversals. 4668 4.2.2. Persistent Filehandle 4670 A persistent filehandle is defined as having a fixed value for the 4671 lifetime of the file system object to which it refers. Once the 4672 server creates the filehandle for a file system object, the server 4673 MUST accept the same filehandle for the object for the lifetime of 4674 the object. If the server restarts, the NFS server MUST honor the 4675 same filehandle value as it did in the server's previous 4676 instantiation. Similarly, if the file system is migrated, the new 4677 NFS server MUST honor the same filehandle as the old NFS server. 4679 The persistent filehandle will be become stale or invalid when the 4680 file system object is removed. When the server is presented with a 4681 persistent filehandle that refers to a deleted object, it MUST return 4682 an error of NFS4ERR_STALE. A filehandle may become stale when the 4683 file system containing the object is no longer available. The file 4684 system may become unavailable if it exists on removable media and the 4685 media is no longer available at the server or the file system in 4686 whole has been destroyed or the file system has simply been removed 4687 from the server's namespace (i.e., unmounted in a UNIX environment). 4689 4.2.3. Volatile Filehandle 4691 A volatile filehandle does not share the same longevity 4692 characteristics of a persistent filehandle. The server may determine 4693 that a volatile filehandle is no longer valid at many different 4694 points in time. If the server can definitively determine that a 4695 volatile filehandle refers to an object that has been removed, the 4696 server should return NFS4ERR_STALE to the client (as is the case for 4697 persistent filehandles). In all other cases where the server 4698 determines that a volatile filehandle can no longer be used, it 4699 should return an error of NFS4ERR_FHEXPIRED. 4701 The REQUIRED attribute "fh_expire_type" is used by the client to 4702 determine what type of filehandle the server is providing for a 4703 particular file system. This attribute is a bitmask with the 4704 following values: 4706 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 4707 persistent filehandle, which is valid until the object is removed 4708 from the file system. The server will not return 4709 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 4710 as a value in which none of the bits specified below are set. 4712 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 4713 specifically excluded (i.e., FH4_NO_EXPIRE_WITH_OPEN). 4715 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 4716 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 4717 qualified to exclude any expiration of the filehandle when it is 4718 open. 4720 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 4721 system transition (migration or replication), in those cases in 4722 which the continuity of filehandle use is not specified by handle 4723 class information within the fs_locations_info attribute. When 4724 this bit is set, clients without access to fs_locations_info 4725 information should assume that filehandles will expire on file 4726 system transitions. 4728 FH4_VOL_RENAME The filehandle will expire during rename. This 4729 includes a rename by the requesting client or a rename by any 4730 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 4732 Servers that provide volatile filehandles that can expire while open 4733 require special care as regards handling of RENAMEs and REMOVEs. 4734 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 4735 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN is not 4736 set, or if a non-read-only file system has a transition target in a 4737 different handle class. In these cases, the server should deny a 4738 RENAME or REMOVE that would affect an OPEN file of any of the 4739 components leading to the OPEN file. In addition, the server should 4740 deny all RENAME or REMOVE requests during the grace period, in order 4741 to make sure that reclaims of files where filehandles may have 4742 expired do not do a reclaim for the wrong file. 4744 Volatile filehandles are especially suitable for implementation of 4745 the pseudo file systems used to bridge exports. See Section 7.5 for 4746 a discussion of this. 4748 4.3. One Method of Constructing a Volatile Filehandle 4750 A volatile filehandle, while opaque to the client, could contain: 4752 [volatile bit = 1 | server boot time | slot | generation number] 4753 o slot is an index in the server volatile filehandle table 4755 o generation number is the generation number for the table entry/ 4756 slot 4758 When the client presents a volatile filehandle, the server makes the 4759 following checks, which assume that the check for the volatile bit 4760 has passed. If the server boot time is less than the current server 4761 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 4762 NFS4ERR_BADHANDLE. If the generation number does not match, return 4763 NFS4ERR_FHEXPIRED. 4765 When the server restarts, the table is gone (it is volatile). 4767 If the volatile bit is 0, then it is a persistent filehandle with a 4768 different structure following it. 4770 4.4. Client Recovery from Filehandle Expiration 4772 If possible, the client SHOULD recover from the receipt of an 4773 NFS4ERR_FHEXPIRED error. The client must take on additional 4774 responsibility so that it may prepare itself to recover from the 4775 expiration of a volatile filehandle. If the server returns 4776 persistent filehandles, the client does not need these additional 4777 steps. 4779 For volatile filehandles, most commonly the client will need to store 4780 the component names leading up to and including the file system 4781 object in question. With these names, the client should be able to 4782 recover by finding a filehandle in the namespace that is still 4783 available or by starting at the root of the server's file system 4784 namespace. 4786 If the expired filehandle refers to an object that has been removed 4787 from the file system, obviously the client will not be able to 4788 recover from the expired filehandle. 4790 It is also possible that the expired filehandle refers to a file that 4791 has been renamed. If the file was renamed by another client, again 4792 it is possible that the original client will not be able to recover. 4793 However, in the case that the client itself is renaming the file and 4794 the file is open, it is possible that the client may be able to 4795 recover. The client can determine the new pathname based on the 4796 processing of the rename request. The client can then regenerate the 4797 new filehandle based on the new pathname. The client could also use 4798 the COMPOUND procedure to construct a series of operations like: 4800 RENAME A B 4801 LOOKUP B 4802 GETFH 4804 Note that the COMPOUND procedure does not provide atomicity. This 4805 example only reduces the overhead of recovering from an expired 4806 filehandle. 4808 5. File Attributes 4810 To meet the requirements of extensibility and increased 4811 interoperability with non-UNIX platforms, attributes need to be 4812 handled in a flexible manner. The NFSv3 fattr3 structure contains a 4813 fixed list of attributes that not all clients and servers are able to 4814 support or care about. The fattr3 structure cannot be extended as 4815 new needs arise and it provides no way to indicate non-support. With 4816 the NFSv4.1 protocol, the client is able to query what attributes the 4817 server supports and construct requests with only those supported 4818 attributes (or a subset thereof). 4820 To this end, attributes are divided into three groups: REQUIRED, 4821 RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are 4822 supported in the NFSv4.1 protocol by a specific and well-defined 4823 encoding and are identified by number. They are requested by setting 4824 a bit in the bit vector sent in the GETATTR request; the server 4825 response includes a bit vector to list what attributes were returned 4826 in the response. New REQUIRED or RECOMMENDED attributes may be added 4827 to the NFSv4 protocol as part of a new minor version by publishing a 4828 Standards Track RFC that allocates a new attribute number value and 4829 defines the encoding for the attribute. See Section 2.7 for further 4830 discussion. 4832 Named attributes are accessed by the new OPENATTR operation, which 4833 accesses a hidden directory of attributes associated with a file 4834 system object. OPENATTR takes a filehandle for the object and 4835 returns the filehandle for the attribute hierarchy. The filehandle 4836 for the named attributes is a directory object accessible by LOOKUP 4837 or READDIR and contains files whose names represent the named 4838 attributes and whose data bytes are the value of the attribute. For 4839 example: 4841 +----------+-----------+---------------------------------+ 4842 | LOOKUP | "foo" | ; look up file | 4843 | GETATTR | attrbits | | 4844 | OPENATTR | | ; access foo's named attributes | 4845 | LOOKUP | "x11icon" | ; look up specific attribute | 4846 | READ | 0,4096 | ; read stream of bytes | 4847 +----------+-----------+---------------------------------+ 4849 Named attributes are intended for data needed by applications rather 4850 than by an NFS client implementation. NFS implementors are strongly 4851 encouraged to define their new attributes as RECOMMENDED attributes 4852 by bringing them to the IETF Standards Track process. 4854 The set of attributes that are classified as REQUIRED is deliberately 4855 small since servers need to do whatever it takes to support them. A 4856 server should support as many of the RECOMMENDED attributes as 4857 possible but, by their definition, the server is not required to 4858 support all of them. Attributes are deemed REQUIRED if the data is 4859 both needed by a large number of clients and is not otherwise 4860 reasonably computable by the client when support is not provided on 4861 the server. 4863 Note that the hidden directory returned by OPENATTR is a convenience 4864 for protocol processing. The client should not make any assumptions 4865 about the server's implementation of named attributes and whether or 4866 not the underlying file system at the server has a named attribute 4867 directory. Therefore, operations such as SETATTR and GETATTR on the 4868 named attribute directory are undefined. 4870 5.1. REQUIRED Attributes 4872 These MUST be supported by every NFSv4.1 client and server in order 4873 to ensure a minimum level of interoperability. The server MUST store 4874 and return these attributes, and the client MUST be able to function 4875 with an attribute set limited to these attributes. With just the 4876 REQUIRED attributes some client functionality may be impaired or 4877 limited in some ways. A client may ask for any of these attributes 4878 to be returned by setting a bit in the GETATTR request, and the 4879 server MUST return their value. 4881 5.2. RECOMMENDED Attributes 4883 These attributes are understood well enough to warrant support in the 4884 NFSv4.1 protocol. However, they may not be supported on all clients 4885 and servers. A client may ask for any of these attributes to be 4886 returned by setting a bit in the GETATTR request but must handle the 4887 case where the server does not return them. A client MAY ask for the 4888 set of attributes the server supports and SHOULD NOT request 4889 attributes the server does not support. A server should be tolerant 4890 of requests for unsupported attributes and simply not return them 4891 rather than considering the request an error. It is expected that 4892 servers will support all attributes they comfortably can and only 4893 fail to support attributes that are difficult to support in their 4894 operating environments. A server should provide attributes whenever 4895 they don't have to "tell lies" to the client. For example, a file 4896 modification time should be either an accurate time or should not be 4897 supported by the server. At times this will be difficult for 4898 clients, but a client is better positioned to decide whether and how 4899 to fabricate or construct an attribute or whether to do without the 4900 attribute. 4902 5.3. Named Attributes 4904 These attributes are not supported by direct encoding in the NFSv4 4905 protocol but are accessed by string names rather than numbers and 4906 correspond to an uninterpreted stream of bytes that are stored with 4907 the file system object. The namespace for these attributes may be 4908 accessed by using the OPENATTR operation. The OPENATTR operation 4909 returns a filehandle for a virtual "named attribute directory", and 4910 further perusal and modification of the namespace may be done using 4911 operations that work on more typical directories. In particular, 4912 READDIR may be used to get a list of such named attributes, and 4913 LOOKUP and OPEN may select a particular attribute. Creation of a new 4914 named attribute may be the result of an OPEN specifying file 4915 creation. 4917 Once an OPEN is done, named attributes may be examined and changed by 4918 normal READ and WRITE operations using the filehandles and stateids 4919 returned by OPEN. 4921 Named attributes and the named attribute directory may have their own 4922 (non-named) attributes. Each of these objects MUST have all of the 4923 REQUIRED attributes and may have additional RECOMMENDED attributes. 4924 However, the set of attributes for named attributes and the named 4925 attribute directory need not be, and typically will not be, as large 4926 as that for other objects in that file system. 4928 Named attributes and the named attribute directory might be the 4929 target of delegations (in the case of the named attribute directory, 4930 these will be directory delegations). However, since granting 4931 delegations is at the server's discretion, a server need not support 4932 delegations on named attributes or the named attribute directory. 4934 It is RECOMMENDED that servers support arbitrary named attributes. A 4935 client should not depend on the ability to store any named attributes 4936 in the server's file system. If a server does support named 4937 attributes, a client that is also able to handle them should be able 4938 to copy a file's data and metadata with complete transparency from 4939 one location to another; this would imply that names allowed for 4940 regular directory entries are valid for named attribute names as 4941 well. 4943 In NFSv4.1, the structure of named attribute directories is 4944 restricted in a number of ways, in order to prevent the development 4945 of non-interoperable implementations in which some servers support a 4946 fully general hierarchical directory structure for named attributes 4947 while others support a limited but adequate structure for named 4948 attributes. In such an environment, clients or applications might 4949 come to depend on non-portable extensions. The restrictions are: 4951 o CREATE is not allowed in a named attribute directory. Thus, such 4952 objects as symbolic links and special files are not allowed to be 4953 named attributes. Further, directories may not be created in a 4954 named attribute directory, so no hierarchical structure of named 4955 attributes for a single object is allowed. 4957 o If OPENATTR is done on a named attribute directory or on a named 4958 attribute, the server MUST return NFS4ERR_WRONG_TYPE. 4960 o Doing a RENAME of a named attribute to a different named attribute 4961 directory or to an ordinary (i.e., non-named-attribute) directory 4962 is not allowed. 4964 o Creating hard links between named attribute directories or between 4965 named attribute directories and ordinary directories is not 4966 allowed. 4968 Names of attributes will not be controlled by this document or other 4969 IETF Standards Track documents. See Section 22.2 for further 4970 discussion. 4972 5.4. Classification of Attributes 4974 Each of the REQUIRED and RECOMMENDED attributes can be classified in 4975 one of three categories: per server (i.e., the value of the attribute 4976 will be the same for all file objects that share the same server 4977 owner; see Section 2.5 for a definition of server owner), per file 4978 system (i.e., the value of the attribute will be the same for some or 4979 all file objects that share the same fsid attribute (Section 5.8.1.9) 4980 and server owner), or per file system object. Note that it is 4981 possible that some per file system attributes may vary within the 4982 file system, depending on the value of the "homogeneous" 4983 (Section 5.8.2.16) attribute. Note that the attributes 4984 time_access_set and time_modify_set are not listed in this section 4985 because they are write-only attributes corresponding to time_access 4986 and time_modify, and are used in a special instance of SETATTR. 4988 o The per-server attribute is: 4990 lease_time 4992 o The per-file system attributes are: 4994 supported_attrs, suppattr_exclcreat, fh_expire_type, 4995 link_support, symlink_support, unique_handles, aclsupport, 4996 cansettime, case_insensitive, case_preserving, 4997 chown_restricted, files_avail, files_free, files_total, 4998 fs_locations, homogeneous, maxfilesize, maxname, maxread, 4999 maxwrite, no_trunc, space_avail, space_free, space_total, 5000 time_delta, change_policy, fs_status, fs_layout_type, 5001 fs_locations_info, fs_charset_cap 5003 o The per-file system object attributes are: 5005 type, change, size, named_attr, fsid, rdattr_error, filehandle, 5006 acl, archive, fileid, hidden, maxlink, mimetype, mode, 5007 numlinks, owner, owner_group, rawdev, space_used, system, 5008 time_access, time_backup, time_create, time_metadata, 5009 time_modify, mounted_on_fileid, dir_notif_delay, 5010 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 5011 layout_blksize, layout_alignment, mdsthreshold, retention_get, 5012 retention_set, retentevt_get, retentevt_set, retention_hold, 5013 mode_set_masked 5015 For quota_avail_hard, quota_avail_soft, and quota_used, see their 5016 definitions below for the appropriate classification. 5018 5.5. Set-Only and Get-Only Attributes 5020 Some REQUIRED and RECOMMENDED attributes are set-only; i.e., they can 5021 be set via SETATTR but not retrieved via GETATTR. Similarly, some 5022 REQUIRED and RECOMMENDED attributes are get-only; i.e., they can be 5023 retrieved via GETATTR but not set via SETATTR. If a client attempts 5024 to set a get-only attribute or get a set-only attributes, the server 5025 MUST return NFS4ERR_INVAL. 5027 5.6. REQUIRED Attributes - List and Definition References 5029 The list of REQUIRED attributes appears in Table 2. The meaning of 5030 the columns of the table are: 5032 o Name: The name of the attribute. 5034 o Id: The number assigned to the attribute. In the event of 5035 conflicts between the assigned number and [10], the latter is 5036 likely authoritative, but should be resolved with Errata to this 5037 document and/or [10]. See [47] for the Errata process. 5039 o Data Type: The XDR data type of the attribute. 5041 o Acc: Access allowed to the attribute. R means read-only (GETATTR 5042 may retrieve, SETATTR may not set). W means write-only (SETATTR 5043 may set, GETATTR may not retrieve). R W means read/write (GETATTR 5044 may retrieve, SETATTR may set). 5046 o Defined in: The section of this specification that describes the 5047 attribute. 5049 +--------------------+----+------------+-----+-------------------+ 5050 | Name | Id | Data Type | Acc | Defined in: | 5051 +--------------------+----+------------+-----+-------------------+ 5052 | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | 5053 | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | 5054 | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | 5055 | change | 3 | uint64_t | R | Section 5.8.1.4 | 5056 | size | 4 | uint64_t | R W | Section 5.8.1.5 | 5057 | link_support | 5 | bool | R | Section 5.8.1.6 | 5058 | symlink_support | 6 | bool | R | Section 5.8.1.7 | 5059 | named_attr | 7 | bool | R | Section 5.8.1.8 | 5060 | fsid | 8 | fsid4 | R | Section 5.8.1.9 | 5061 | unique_handles | 9 | bool | R | Section 5.8.1.10 | 5062 | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | 5063 | rdattr_error | 11 | enum | R | Section 5.8.1.12 | 5064 | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | 5065 | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | 5066 +--------------------+----+------------+-----+-------------------+ 5068 Table 2 5070 5.7. RECOMMENDED Attributes - List and Definition References 5072 The RECOMMENDED attributes are defined in Table 3. The meanings of 5073 the column headers are the same as Table 2; see Section 5.6 for the 5074 meanings. 5076 +--------------------+----+----------------+-----+------------------+ 5077 | Name | Id | Data Type | Acc | Defined in: | 5078 +--------------------+----+----------------+-----+------------------+ 5079 | acl | 12 | nfsace4<> | R W | Section 6.2.1 | 5080 | aclsupport | 13 | uint32_t | R | Section 6.2.1.2 | 5081 | archive | 14 | bool | R W | Section 5.8.2.1 | 5082 | cansettime | 15 | bool | R | Section 5.8.2.2 | 5083 | case_insensitive | 16 | bool | R | Section 5.8.2.3 | 5084 | case_preserving | 17 | bool | R | Section 5.8.2.4 | 5085 | change_policy | 60 | chg_policy4 | R | Section 5.8.2.5 | 5086 | chown_restricted | 18 | bool | R | Section 5.8.2.6 | 5087 | dacl | 58 | nfsacl41 | R W | Section 6.2.2 | 5088 | dir_notif_delay | 56 | nfstime4 | R | Section 5.11.1 | 5089 | dirent_notif_delay | 57 | nfstime4 | R | Section 5.11.2 | 5090 | fileid | 20 | uint64_t | R | Section 5.8.2.7 | 5091 | files_avail | 21 | uint64_t | R | Section 5.8.2.8 | 5092 | files_free | 22 | uint64_t | R | Section 5.8.2.9 | 5093 | files_total | 23 | uint64_t | R | Section 5.8.2.10 | 5094 | fs_charset_cap | 76 | uint32_t | R | Section 5.8.2.11 | 5095 | fs_layout_type | 62 | layouttype4<> | R | Section 5.12.1 | 5096 | fs_locations | 24 | fs_locations | R | Section 5.8.2.12 | 5097 | fs_locations_info | 67 | * | R | Section 5.8.2.13 | 5098 | fs_status | 61 | fs4_status | R | Section 5.8.2.14 | 5099 | hidden | 25 | bool | R W | Section 5.8.2.15 | 5100 | homogeneous | 26 | bool | R | Section 5.8.2.16 | 5101 | layout_alignment | 66 | uint32_t | R | Section 5.12.2 | 5102 | layout_blksize | 65 | uint32_t | R | Section 5.12.3 | 5103 | layout_hint | 63 | layouthint4 | W | Section 5.12.4 | 5104 | layout_type | 64 | layouttype4<> | R | Section 5.12.5 | 5105 | maxfilesize | 27 | uint64_t | R | Section 5.8.2.17 | 5106 | maxlink | 28 | uint32_t | R | Section 5.8.2.18 | 5107 | maxname | 29 | uint32_t | R | Section 5.8.2.19 | 5108 | maxread | 30 | uint64_t | R | Section 5.8.2.20 | 5109 | maxwrite | 31 | uint64_t | R | Section 5.8.2.21 | 5110 | mdsthreshold | 68 | mdsthreshold4 | R | Section 5.12.6 | 5111 | mimetype | 32 | utf8str_cs | R W | Section 5.8.2.22 | 5112 | mode | 33 | mode4 | R W | Section 6.2.4 | 5113 | mode_set_masked | 74 | mode_masked4 | W | Section 6.2.5 | 5114 | mounted_on_fileid | 55 | uint64_t | R | Section 5.8.2.23 | 5115 | no_trunc | 34 | bool | R | Section 5.8.2.24 | 5116 | numlinks | 35 | uint32_t | R | Section 5.8.2.25 | 5117 | owner | 36 | utf8str_mixed | R W | Section 5.8.2.26 | 5118 | owner_group | 37 | utf8str_mixed | R W | Section 5.8.2.27 | 5119 | quota_avail_hard | 38 | uint64_t | R | Section 5.8.2.28 | 5120 | quota_avail_soft | 39 | uint64_t | R | Section 5.8.2.29 | 5121 | quota_used | 40 | uint64_t | R | Section 5.8.2.30 | 5122 | rawdev | 41 | specdata4 | R | Section 5.8.2.31 | 5123 | retentevt_get | 71 | retention_get4 | R | Section 5.13.3 | 5124 | retentevt_set | 72 | retention_set4 | W | Section 5.13.4 | 5125 | retention_get | 69 | retention_get4 | R | Section 5.13.1 | 5126 | retention_hold | 73 | uint64_t | R W | Section 5.13.5 | 5127 | retention_set | 70 | retention_set4 | W | Section 5.13.2 | 5128 | sacl | 59 | nfsacl41 | R W | Section 6.2.3 | 5129 | space_avail | 42 | uint64_t | R | Section 5.8.2.32 | 5130 | space_free | 43 | uint64_t | R | Section 5.8.2.33 | 5131 | space_total | 44 | uint64_t | R | Section 5.8.2.34 | 5132 | space_used | 45 | uint64_t | R | Section 5.8.2.35 | 5133 | system | 46 | bool | R W | Section 5.8.2.36 | 5134 | time_access | 47 | nfstime4 | R | Section 5.8.2.37 | 5135 | time_access_set | 48 | settime4 | W | Section 5.8.2.38 | 5136 | time_backup | 49 | nfstime4 | R W | Section 5.8.2.39 | 5137 | time_create | 50 | nfstime4 | R W | Section 5.8.2.40 | 5138 | time_delta | 51 | nfstime4 | R | Section 5.8.2.41 | 5139 | time_metadata | 52 | nfstime4 | R | Section 5.8.2.42 | 5140 | time_modify | 53 | nfstime4 | R | Section 5.8.2.43 | 5141 | time_modify_set | 54 | settime4 | W | Section 5.8.2.44 | 5142 +--------------------+----+----------------+-----+------------------+ 5144 Table 3 5146 * fs_locations_info4 5148 5.8. Attribute Definitions 5150 5.8.1. Definitions of REQUIRED Attributes 5152 5.8.1.1. Attribute 0: supported_attrs 5154 The bit vector that would retrieve all REQUIRED and RECOMMENDED 5155 attributes that are supported for this object. The scope of this 5156 attribute applies to all objects with a matching fsid. 5158 5.8.1.2. Attribute 1: type 5160 Designates the type of an object in terms of one of a number of 5161 special constants: 5163 o NF4REG designates a regular file. 5165 o NF4DIR designates a directory. 5167 o NF4BLK designates a block device special file. 5169 o NF4CHR designates a character device special file. 5171 o NF4LNK designates a symbolic link. 5173 o NF4SOCK designates a named socket special file. 5175 o NF4FIFO designates a fifo special file. 5177 o NF4ATTRDIR designates a named attribute directory. 5179 o NF4NAMEDATTR designates a named attribute. 5181 Within the explanatory text and operation descriptions, the following 5182 phrases will be used with the meanings given below: 5184 o The phrase "is a directory" means that the object's type attribute 5185 is NF4DIR or NF4ATTRDIR. 5187 o The phrase "is a special file" means that the object's type 5188 attribute is NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. 5190 o The phrases "is an ordinary file" and "is a regular file" mean 5191 that the object's type attribute is NF4REG or NF4NAMEDATTR. 5193 5.8.1.3. Attribute 2: fh_expire_type 5195 Server uses this to specify filehandle expiration behavior to the 5196 client. See Section 4 for additional description. 5198 5.8.1.4. Attribute 3: change 5200 A value created by the server that the client can use to determine if 5201 file data, directory contents, or attributes of the object have been 5202 modified. The server may return the object's time_metadata attribute 5203 for this attribute's value, but only if the file system object cannot 5204 be updated more frequently than the resolution of time_metadata. 5206 5.8.1.5. Attribute 4: size 5208 The size of the object in bytes. 5210 5.8.1.6. Attribute 5: link_support 5212 TRUE, if the object's file system supports hard links. 5214 5.8.1.7. Attribute 6: symlink_support 5216 TRUE, if the object's file system supports symbolic links. 5218 5.8.1.8. Attribute 7: named_attr 5220 TRUE, if this object has named attributes. In other words, object 5221 has a non-empty named attribute directory. 5223 5.8.1.9. Attribute 8: fsid 5225 Unique file system identifier for the file system holding this 5226 object. The fsid attribute has major and minor components, each of 5227 which are of data type uint64_t. 5229 5.8.1.10. Attribute 9: unique_handles 5231 TRUE, if two distinct filehandles are guaranteed to refer to two 5232 different file system objects. 5234 5.8.1.11. Attribute 10: lease_time 5236 Duration of the lease at server in seconds. 5238 5.8.1.12. Attribute 11: rdattr_error 5240 Error returned from an attempt to retrieve attributes during a 5241 READDIR operation. 5243 5.8.1.13. Attribute 19: filehandle 5245 The filehandle of this object (primarily for READDIR requests). 5247 5.8.1.14. Attribute 75: suppattr_exclcreat 5249 The bit vector that would set all REQUIRED and RECOMMENDED attributes 5250 that are supported by the EXCLUSIVE4_1 method of file creation via 5251 the OPEN operation. The scope of this attribute applies to all 5252 objects with a matching fsid. 5254 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes 5256 The definitions of most of the RECOMMENDED attributes follow. 5257 Collections that share a common category are defined in other 5258 sections. 5260 5.8.2.1. Attribute 14: archive 5262 TRUE, if this file has been archived since the time of last 5263 modification (deprecated in favor of time_backup). 5265 5.8.2.2. Attribute 15: cansettime 5267 TRUE, if the server is able to change the times for a file system 5268 object as specified in a SETATTR operation. 5270 5.8.2.3. Attribute 16: case_insensitive 5272 TRUE, if file name comparisons on this file system are case 5273 insensitive. 5275 5.8.2.4. Attribute 17: case_preserving 5277 TRUE, if file name case on this file system is preserved. 5279 5.8.2.5. Attribute 60: change_policy 5281 A value created by the server that the client can use to determine if 5282 some server policy related to the current file system has been 5283 subject to change. If the value remains the same, then the client 5284 can be sure that the values of the attributes related to fs location 5285 and the fss_type field of the fs_status attribute have not changed. 5286 On the other hand, a change in this value does necessarily imply a 5287 change in policy. It is up to the client to interrogate the server 5288 to determine if some policy relevant to it has changed. See 5289 Section 3.3.6 for details. 5291 This attribute MUST change when the value returned by the 5292 fs_locations or fs_locations_info attribute changes, when a file 5293 system goes from read-only to writable or vice versa, or when the 5294 allowable set of security flavors for the file system or any part 5295 thereof is changed. 5297 5.8.2.6. Attribute 18: chown_restricted 5299 If TRUE, the server will reject any request to change either the 5300 owner or the group associated with a file if the caller is not a 5301 privileged user (for example, "root" in UNIX operating environments 5302 or, in Windows 2000, the "Take Ownership" privilege). 5304 5.8.2.7. Attribute 20: fileid 5306 A number uniquely identifying the file within the file system. 5308 5.8.2.8. Attribute 21: files_avail 5310 File slots available to this user on the file system containing this 5311 object -- this should be the smallest relevant limit. 5313 5.8.2.9. Attribute 22: files_free 5315 Free file slots on the file system containing this object -- this 5316 should be the smallest relevant limit. 5318 5.8.2.10. Attribute 23: files_total 5320 Total file slots on the file system containing this object. 5322 5.8.2.11. Attribute 76: fs_charset_cap 5324 Character set capabilities for this file system. See Section 14.4. 5326 5.8.2.12. Attribute 24: fs_locations 5328 Locations where this file system may be found. If the server returns 5329 NFS4ERR_MOVED as an error, this attribute MUST be supported. See 5330 Section 11.15 for more details. 5332 5.8.2.13. Attribute 67: fs_locations_info 5334 Full function file system location. See Section 11.16.2 for more 5335 details. 5337 5.8.2.14. Attribute 61: fs_status 5339 Generic file system type information. See Section 11.17 for more 5340 details. 5342 5.8.2.15. Attribute 25: hidden 5344 TRUE, if the file is considered hidden with respect to the Windows 5345 API. 5347 5.8.2.16. Attribute 26: homogeneous 5349 TRUE, if this object's file system is homogeneous; i.e., all objects 5350 in the file system (all objects on the server with the same fsid) 5351 have common values for all per-file-system attributes. 5353 5.8.2.17. Attribute 27: maxfilesize 5355 Maximum supported file size for the file system of this object. 5357 5.8.2.18. Attribute 28: maxlink 5359 Maximum number of links for this object. 5361 5.8.2.19. Attribute 29: maxname 5363 Maximum file name size supported for this object. 5365 5.8.2.20. Attribute 30: maxread 5367 Maximum amount of data the READ operation will return for this 5368 object. 5370 5.8.2.21. Attribute 31: maxwrite 5372 Maximum amount of data the WRITE operation will accept for this 5373 object. This attribute SHOULD be supported if the file is writable. 5374 Lack of this attribute can lead to the client either wasting 5375 bandwidth or not receiving the best performance. 5377 5.8.2.22. Attribute 32: mimetype 5379 MIME body type/subtype of this object. 5381 5.8.2.23. Attribute 55: mounted_on_fileid 5383 Like fileid, but if the target filehandle is the root of a file 5384 system, this attribute represents the fileid of the underlying 5385 directory. 5387 UNIX-based operating environments connect a file system into the 5388 namespace by connecting (mounting) the file system onto the existing 5389 file object (the mount point, usually a directory) of an existing 5390 file system. When the mount point's parent directory is read via an 5391 API like readdir(), the return results are directory entries, each 5392 with a component name and a fileid. The fileid of the mount point's 5393 directory entry will be different from the fileid that the stat() 5394 system call returns. The stat() system call is returning the fileid 5395 of the root of the mounted file system, whereas readdir() is 5396 returning the fileid that stat() would have returned before any file 5397 systems were mounted on the mount point. 5399 Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other 5400 file systems. The client detects the file system crossing whenever 5401 the filehandle argument of LOOKUP has an fsid attribute different 5402 from that of the filehandle returned by LOOKUP. A UNIX-based client 5403 will consider this a "mount point crossing". UNIX has a legacy 5404 scheme for allowing a process to determine its current working 5405 directory. This relies on readdir() of a mount point's parent and 5406 stat() of the mount point returning fileids as previously described. 5407 The mounted_on_fileid attribute corresponds to the fileid that 5408 readdir() would have returned as described previously. 5410 While the NFSv4.1 client could simply fabricate a fileid 5411 corresponding to what mounted_on_fileid provides (and if the server 5412 does not support mounted_on_fileid, the client has no choice), there 5413 is a risk that the client will generate a fileid that conflicts with 5414 one that is already assigned to another object in the file system. 5415 Instead, if the server can provide the mounted_on_fileid, the 5416 potential for client operational problems in this area is eliminated. 5418 If the server detects that there is no mounted point at the target 5419 file object, then the value for mounted_on_fileid that it returns is 5420 the same as that of the fileid attribute. 5422 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 5423 provide it if possible, and for a UNIX-based server, this is 5424 straightforward. Usually, mounted_on_fileid will be requested during 5425 a READDIR operation, in which case it is trivial (at least for UNIX- 5426 based servers) to return mounted_on_fileid since it is equal to the 5427 fileid of a directory entry returned by readdir(). If 5428 mounted_on_fileid is requested in a GETATTR operation, the server 5429 should obey an invariant that has it returning a value that is equal 5430 to the file object's entry in the object's parent directory, i.e., 5431 what readdir() would have returned. Some operating environments 5432 allow a series of two or more file systems to be mounted onto a 5433 single mount point. In this case, for the server to obey the 5434 aforementioned invariant, it will need to find the base mount point, 5435 and not the intermediate mount points. 5437 5.8.2.24. Attribute 34: no_trunc 5439 If this attribute is TRUE, then if the client uses a file name longer 5440 than name_max, an error will be returned instead of the name being 5441 truncated. 5443 5.8.2.25. Attribute 35: numlinks 5445 Number of hard links to this object. 5447 5.8.2.26. Attribute 36: owner 5449 The string name of the owner of this object. 5451 5.8.2.27. Attribute 37: owner_group 5453 The string name of the group ownership of this object. 5455 5.8.2.28. Attribute 38: quota_avail_hard 5457 The value in bytes that represents the amount of additional disk 5458 space beyond the current allocation that can be allocated to this 5459 file or directory before further allocations will be refused. It is 5460 understood that this space may be consumed by allocations to other 5461 files or directories. 5463 5.8.2.29. Attribute 39: quota_avail_soft 5465 The value in bytes that represents the amount of additional disk 5466 space that can be allocated to this file or directory before the user 5467 may reasonably be warned. It is understood that this space may be 5468 consumed by allocations to other files or directories though there is 5469 a rule as to which other files or directories. 5471 5.8.2.30. Attribute 40: quota_used 5473 The value in bytes that represents the amount of disk space used by 5474 this file or directory and possibly a number of other similar files 5475 or directories, where the set of "similar" meets at least the 5476 criterion that allocating space to any file or directory in the set 5477 will reduce the "quota_avail_hard" of every other file or directory 5478 in the set. 5480 Note that there may be a number of distinct but overlapping sets of 5481 files or directories for which a quota_used value is maintained, 5482 e.g., "all files with a given owner", "all files with a given group 5483 owner", etc. The server is at liberty to choose any of those sets 5484 when providing the content of the quota_used attribute, but should do 5485 so in a repeatable way. The rule may be configured per file system 5486 or may be "choose the set with the smallest quota". 5488 5.8.2.31. Attribute 41: rawdev 5490 Raw device number of file of type NF4BLK or NF4CHR. The device 5491 number is split into major and minor numbers. If the file's type 5492 attribute is not NF4BLK or NF4CHR, the value returned SHOULD NOT be 5493 considered useful. 5495 5.8.2.32. Attribute 42: space_avail 5497 Disk space in bytes available to this user on the file system 5498 containing this object -- this should be the smallest relevant limit. 5500 5.8.2.33. Attribute 43: space_free 5502 Free disk space in bytes on the file system containing this object -- 5503 this should be the smallest relevant limit. 5505 5.8.2.34. Attribute 44: space_total 5507 Total disk space in bytes on the file system containing this object. 5509 5.8.2.35. Attribute 45: space_used 5511 Number of file system bytes allocated to this object. 5513 5.8.2.36. Attribute 46: system 5515 This attribute is TRUE if this file is a "system" file with respect 5516 to the Windows operating environment. 5518 5.8.2.37. Attribute 47: time_access 5520 The time_access attribute represents the time of last access to the 5521 object by a READ operation sent to the server. The notion of what is 5522 an "access" depends on the server's operating environment and/or the 5523 server's file system semantics. For example, for servers obeying 5524 Portable Operating System Interface (POSIX) semantics, time_access 5525 would be updated only by the READ and READDIR operations and not any 5526 of the operations that modify the content of the object [13], [14], 5527 [15]. Of course, setting the corresponding time_access_set attribute 5528 is another way to modify the time_access attribute. 5530 Whenever the file object resides on a writable file system, the 5531 server should make its best efforts to record time_access into stable 5532 storage. However, to mitigate the performance effects of doing so, 5533 and most especially whenever the server is satisfying the read of the 5534 object's content from its cache, the server MAY cache access time 5535 updates and lazily write them to stable storage. It is also 5536 acceptable to give administrators of the server the option to disable 5537 time_access updates. 5539 5.8.2.38. Attribute 48: time_access_set 5541 Sets the time of last access to the object. SETATTR use only. 5543 5.8.2.39. Attribute 49: time_backup 5545 The time of last backup of the object. 5547 5.8.2.40. Attribute 50: time_create 5549 The time of creation of the object. This attribute does not have any 5550 relation to the traditional UNIX file attribute "ctime" or "change 5551 time". 5553 5.8.2.41. Attribute 51: time_delta 5555 Smallest useful server time granularity. 5557 5.8.2.42. Attribute 52: time_metadata 5559 The time of last metadata modification of the object. 5561 5.8.2.43. Attribute 53: time_modify 5563 The time of last modification to the object. 5565 5.8.2.44. Attribute 54: time_modify_set 5567 Sets the time of last modification to the object. SETATTR use only. 5569 5.9. Interpreting owner and owner_group 5571 The RECOMMENDED attributes "owner" and "owner_group" (and also users 5572 and groups within the "acl" attribute) are represented in terms of a 5573 UTF-8 string. To avoid a representation that is tied to a particular 5574 underlying implementation at the client or server, the use of the 5575 UTF-8 string has been chosen. Note that Section 6.1 of RFC 2624 [48] 5576 provides additional rationale. It is expected that the client and 5577 server will have their own local representation of owner and 5578 owner_group that is used for local storage or presentation to the end 5579 user. Therefore, it is expected that when these attributes are 5580 transferred between the client and server, the local representation 5581 is translated to a syntax of the form "user@dns_domain". This will 5582 allow for a client and server that do not use the same local 5583 representation the ability to translate to a common syntax that can 5584 be interpreted by both. 5586 Similarly, security principals may be represented in different ways 5587 by different security mechanisms. Servers normally translate these 5588 representations into a common format, generally that used by local 5589 storage, to serve as a means of identifying the users corresponding 5590 to these security principals. When these local identifiers are 5591 translated to the form of the owner attribute, associated with files 5592 created by such principals, they identify, in a common format, the 5593 users associated with each corresponding set of security principals. 5595 The translation used to interpret owner and group strings is not 5596 specified as part of the protocol. This allows various solutions to 5597 be employed. For example, a local translation table may be consulted 5598 that maps a numeric identifier to the user@dns_domain syntax. A name 5599 service may also be used to accomplish the translation. A server may 5600 provide a more general service, not limited by any particular 5601 translation (which would only translate a limited set of possible 5602 strings) by storing the owner and owner_group attributes in local 5603 storage without any translation or it may augment a translation 5604 method by storing the entire string for attributes for which no 5605 translation is available while using the local representation for 5606 those cases in which a translation is available. 5608 Servers that do not provide support for all possible values of the 5609 owner and owner_group attributes SHOULD return an error 5610 (NFS4ERR_BADOWNER) when a string is presented that has no 5611 translation, as the value to be set for a SETATTR of the owner, 5612 owner_group, or acl attributes. When a server does accept an owner 5613 or owner_group value as valid on a SETATTR (and similarly for the 5614 owner and group strings in an acl), it is promising to return that 5615 same string when a corresponding GETATTR is done. Configuration 5616 changes (including changes from the mapping of the string to the 5617 local representation) and ill-constructed name translations (those 5618 that contain aliasing) may make that promise impossible to honor. 5619 Servers should make appropriate efforts to avoid a situation in which 5620 these attributes have their values changed when no real change to 5621 ownership has occurred. 5623 The "dns_domain" portion of the owner string is meant to be a DNS 5624 domain name, for example, user@example.org. Servers should accept as 5625 valid a set of users for at least one domain. A server may treat 5626 other domains as having no valid translations. A more general 5627 service is provided when a server is capable of accepting users for 5628 multiple domains, or for all domains, subject to security 5629 constraints. 5631 In the case where there is no translation available to the client or 5632 server, the attribute value will be constructed without the "@". 5633 Therefore, the absence of the @ from the owner or owner_group 5634 attribute signifies that no translation was available at the sender 5635 and that the receiver of the attribute should not use that string as 5636 a basis for translation into its own internal format. Even though 5637 the attribute value cannot be translated, it may still be useful. In 5638 the case of a client, the attribute string may be used for local 5639 display of ownership. 5641 To provide a greater degree of compatibility with NFSv3, which 5642 identified users and groups by 32-bit unsigned user identifiers and 5643 group identifiers, owner and group strings that consist of decimal 5644 numeric values with no leading zeros can be given a special 5645 interpretation by clients and servers that choose to provide such 5646 support. The receiver may treat such a user or group string as 5647 representing the same user as would be represented by an NFSv3 uid or 5648 gid having the corresponding numeric value. A server is not 5649 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 5650 instead. To avoid this mechanism being used to subvert user and 5651 group translation, so that a client might pass all of the owners and 5652 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 5653 error when there is a valid translation for the user or owner 5654 designated in this way. In that case, the client must use the 5655 appropriate name@domain string and not the special form for 5656 compatibility. 5658 The owner string "nobody" may be used to designate an anonymous user, 5659 which will be associated with a file created by a security principal 5660 that cannot be mapped through normal means to the owner attribute. 5661 Users and implementations of NFSv4.1 SHOULD NOT use "nobody" to 5662 designate a real user whose access is not anonymous. 5664 5.10. Character Case Attributes 5666 With respect to the case_insensitive and case_preserving attributes, 5667 each UCS-4 character (which UTF-8 encodes) can be mapped according to 5668 Appendix B.2 of RFC 3454 [16]. For general character handling and 5669 internationalization issues, see Section 14. 5671 5.11. Directory Notification Attributes 5673 As described in Section 18.39, the client can request a minimum delay 5674 for notifications of changes to attributes, but the server is free to 5675 ignore what the client requests. The client can determine in advance 5676 what notification delays the server will accept by sending a GETATTR 5677 operation for either or both of two directory notification 5678 attributes. When the client calls the GET_DIR_DELEGATION operation 5679 and asks for attribute change notifications, it should request 5680 notification delays that are no less than the values in the server- 5681 provided attributes. 5683 5.11.1. Attribute 56: dir_notif_delay 5685 The dir_notif_delay attribute is the minimum number of seconds the 5686 server will delay before notifying the client of a change to the 5687 directory's attributes. 5689 5.11.2. Attribute 57: dirent_notif_delay 5691 The dirent_notif_delay attribute is the minimum number of seconds the 5692 server will delay before notifying the client of a change to a file 5693 object that has an entry in the directory. 5695 5.12. pNFS Attribute Definitions 5697 5.12.1. Attribute 62: fs_layout_type 5699 The fs_layout_type attribute (see Section 3.3.13) applies to a file 5700 system and indicates what layout types are supported by the file 5701 system. When the client encounters a new fsid, the client SHOULD 5702 obtain the value for the fs_layout_type attribute associated with the 5703 new file system. This attribute is used by the client to determine 5704 if the layout types supported by the server match any of the client's 5705 supported layout types. 5707 5.12.2. Attribute 66: layout_alignment 5709 When a client holds layouts on files of a file system, the 5710 layout_alignment attribute indicates the preferred alignment for I/O 5711 to files on that file system. Where possible, the client should send 5712 READ and WRITE operations with offsets that are whole multiples of 5713 the layout_alignment attribute. 5715 5.12.3. Attribute 65: layout_blksize 5717 When a client holds layouts on files of a file system, the 5718 layout_blksize attribute indicates the preferred block size for I/O 5719 to files on that file system. Where possible, the client should send 5720 READ operations with a count argument that is a whole multiple of 5721 layout_blksize, and WRITE operations with a data argument of size 5722 that is a whole multiple of layout_blksize. 5724 5.12.4. Attribute 63: layout_hint 5726 The layout_hint attribute (see Section 3.3.19) may be set on newly 5727 created files to influence the metadata server's choice for the 5728 file's layout. If possible, this attribute is one of those set in 5729 the initial attributes within the OPEN operation. The metadata 5730 server may choose to ignore this attribute. The layout_hint 5731 attribute is a subset of the layout structure returned by LAYOUTGET. 5732 For example, instead of specifying particular devices, this would be 5733 used to suggest the stripe width of a file. The server 5734 implementation determines which fields within the layout will be 5735 used. 5737 5.12.5. Attribute 64: layout_type 5739 This attribute lists the layout type(s) available for a file. The 5740 value returned by the server is for informational purposes only. The 5741 client will use the LAYOUTGET operation to obtain the information 5742 needed in order to perform I/O, for example, the specific device 5743 information for the file and its layout. 5745 5.12.6. Attribute 68: mdsthreshold 5747 This attribute is a server-provided hint used to communicate to the 5748 client when it is more efficient to send READ and WRITE operations to 5749 the metadata server or the data server. The two types of thresholds 5750 described are file size thresholds and I/O size thresholds. If a 5751 file's size is smaller than the file size threshold, data accesses 5752 SHOULD be sent to the metadata server. If an I/O request has a 5753 length that is below the I/O size threshold, the I/O SHOULD be sent 5754 to the metadata server. Each threshold type is specified separately 5755 for read and write. 5757 The server MAY provide both types of thresholds for a file. If both 5758 file size and I/O size are provided, the client SHOULD reach or 5759 exceed both thresholds before sending its read or write requests to 5760 the data server. Alternatively, if only one of the specified 5761 thresholds is reached or exceeded, the I/O requests are sent to the 5762 metadata server. 5764 For each threshold type, a value of zero indicates no READ or WRITE 5765 should be sent to the metadata server, while a value of all ones 5766 indicates that all READs or WRITEs should be sent to the metadata 5767 server. 5769 The attribute is available on a per-filehandle basis. If the current 5770 filehandle refers to a non-pNFS file or directory, the metadata 5771 server should return an attribute that is representative of the 5772 filehandle's file system. It is suggested that this attribute is 5773 queried as part of the OPEN operation. Due to dynamic system 5774 changes, the client should not assume that the attribute will remain 5775 constant for any specific time period; thus, it should be 5776 periodically refreshed. 5778 5.13. Retention Attributes 5780 Retention is a concept whereby a file object can be placed in an 5781 immutable, undeletable, unrenamable state for a fixed or infinite 5782 duration of time. Once in this "retained" state, the file cannot be 5783 moved out of the state until the duration of retention has been 5784 reached. 5786 When retention is enabled, retention MUST extend to the data of the 5787 file, and the name of file. The server MAY extend retention to any 5788 other property of the file, including any subset of REQUIRED, 5789 RECOMMENDED, and named attributes, with the exceptions noted in this 5790 section. 5792 Servers MAY support or not support retention on any file object type. 5794 The five retention attributes are explained in the next subsections. 5796 5.13.1. Attribute 69: retention_get 5798 If retention is enabled for the associated file, this attribute's 5799 value represents the retention begin time of the file object. This 5800 attribute's value is only readable with the GETATTR operation and 5801 MUST NOT be modified by the SETATTR operation (Section 5.5). The 5802 value of the attribute consists of: 5804 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 5805 struct retention_get4 { 5806 uint64_t rg_duration; 5807 nfstime4 rg_begin_time<1>; 5808 }; 5810 The field rg_duration is the duration in seconds indicating how long 5811 the file will be retained once retention is enabled. The field 5812 rg_begin_time is an array of up to one absolute time value. If the 5813 array is zero length, no beginning retention time has been 5814 established, and retention is not enabled. If rg_duration is equal 5815 to RET4_DURATION_INFINITE, the file, once retention is enabled, will 5816 be retained for an infinite duration. 5818 If (as soon as) rg_duration is zero, then rg_begin_time will be of 5819 zero length, and again, retention is not (no longer) enabled. 5821 5.13.2. Attribute 70: retention_set 5823 This attribute is used to set the retention duration and optionally 5824 enable retention for the associated file object. This attribute is 5825 only modifiable via the SETATTR operation and MUST NOT be retrieved 5826 by the GETATTR operation (Section 5.5). This attribute corresponds 5827 to retention_get. The value of the attribute consists of: 5829 struct retention_set4 { 5830 bool rs_enable; 5831 uint64_t rs_duration<1>; 5832 }; 5834 If the client sets rs_enable to TRUE, then it is enabling retention 5835 on the file object with the begin time of retention starting from the 5836 server's current time and date. The duration of the retention can 5837 also be provided if the rs_duration array is of length one. The 5838 duration is the time in seconds from the begin time of retention, and 5839 if set to RET4_DURATION_INFINITE, the file is to be retained forever. 5840 If retention is enabled, with no duration specified in either this 5841 SETATTR or a previous SETATTR, the duration defaults to zero seconds. 5842 The server MAY restrict the enabling of retention or the duration of 5843 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5844 The enabling of retention MUST NOT prevent the enabling of event- 5845 based retention or the modification of the retention_hold attribute. 5847 The following rules apply to both the retention_set and retentevt_set 5848 attributes. 5850 o As long as retention is not enabled, the client is permitted to 5851 decrease the duration. 5853 o The duration can always be set to an equal or higher value, even 5854 if retention is enabled. Note that once retention is enabled, the 5855 actual duration (as returned by the retention_get or retentevt_get 5856 attributes; see Section 5.13.1 or Section 5.13.3) is constantly 5857 counting down to zero (one unit per second), unless the duration 5858 was set to RET4_DURATION_INFINITE. Thus, it will not be possible 5859 for the client to precisely extend the duration on a file that has 5860 retention enabled. 5862 o While retention is enabled, attempts to disable retention or 5863 decrease the retention's duration MUST fail with the error 5864 NFS4ERR_INVAL. 5866 o If the principal attempting to change retention_set or 5867 retentevt_set does not have ACE4_WRITE_RETENTION permissions, the 5868 attempt MUST fail with NFS4ERR_ACCESS. 5870 5.13.3. Attribute 71: retentevt_get 5872 Gets the event-based retention duration, and if enabled, the event- 5873 based retention begin time of the file object. This attribute is 5874 like retention_get, but refers to event-based retention. The event 5875 that triggers event-based retention is not defined by the NFSv4.1 5876 specification. 5878 5.13.4. Attribute 72: retentevt_set 5880 Sets the event-based retention duration, and optionally enables 5881 event-based retention on the file object. This attribute corresponds 5882 to retentevt_get and is like retention_set, but refers to event-based 5883 retention. When event-based retention is set, the file MUST be 5884 retained even if non-event-based retention has been set, and the 5885 duration of non-event-based retention has been reached. Conversely, 5886 when non-event-based retention has been set, the file MUST be 5887 retained even if event-based retention has been set, and the duration 5888 of event-based retention has been reached. The server MAY restrict 5889 the enabling of event-based retention or the duration of event-based 5890 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5891 The enabling of event-based retention MUST NOT prevent the enabling 5892 of non-event-based retention or the modification of the 5893 retention_hold attribute. 5895 5.13.5. Attribute 73: retention_hold 5897 Gets or sets administrative retention holds, one hold per bit 5898 position. 5900 This attribute allows one to 64 administrative holds, one hold per 5901 bit on the attribute. If retention_hold is not zero, then the file 5902 MUST NOT be deleted, renamed, or modified, even if the duration on 5903 enabled event or non-event-based retention has been reached. The 5904 server MAY restrict the modification of retention_hold on the basis 5905 of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of 5906 administration retention holds does not prevent the enabling of 5907 event-based or non-event-based retention. 5909 If the principal attempting to change retention_hold does not have 5910 ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with 5911 NFS4ERR_ACCESS. 5913 6. Access Control Attributes 5915 Access Control Lists (ACLs) are file attributes that specify fine- 5916 grained access control. This section covers the "acl", "dacl", 5917 "sacl", "aclsupport", "mode", and "mode_set_masked" file attributes 5918 and their interactions. Note that file attributes may apply to any 5919 file system object. 5921 6.1. Goals 5923 ACLs and modes represent two well-established models for specifying 5924 permissions. This section specifies requirements that attempt to 5925 meet the following goals: 5927 o If a server supports the mode attribute, it should provide 5928 reasonable semantics to clients that only set and retrieve the 5929 mode attribute. 5931 o If a server supports ACL attributes, it should provide reasonable 5932 semantics to clients that only set and retrieve those attributes. 5934 o On servers that support the mode attribute, if ACL attributes have 5935 never been set on an object, via inheritance or explicitly, the 5936 behavior should be traditional UNIX-like behavior. 5938 o On servers that support the mode attribute, if the ACL attributes 5939 have been previously set on an object, either explicitly or via 5940 inheritance: 5942 * Setting only the mode attribute should effectively control the 5943 traditional UNIX-like permissions of read, write, and execute 5944 on owner, owner_group, and other. 5946 * Setting only the mode attribute should provide reasonable 5947 security. For example, setting a mode of 000 should be enough 5948 to ensure that future OPEN operations for 5949 OPEN4_SHARE_ACCESS_READ or OPEN4_SHARE_ACCESS_WRITE by any 5950 principal fail, regardless of a previously existing or 5951 inherited ACL. 5953 o NFSv4.1 may introduce different semantics relating to the mode and 5954 ACL attributes, but it does not render invalid any previously 5955 existing implementations. Additionally, this section provides 5956 clarifications based on previous implementations and discussions 5957 around them. 5959 o On servers that support both the mode and the acl or dacl 5960 attributes, the server must keep the two consistent with each 5961 other. The value of the mode attribute (with the exception of the 5962 three high-order bits described in Section 6.2.4) must be 5963 determined entirely by the value of the ACL, so that use of the 5964 mode is never required for anything other than setting the three 5965 high-order bits. See Section 6.4.1 for exact requirements. 5967 o When a mode attribute is set on an object, the ACL attributes may 5968 need to be modified in order to not conflict with the new mode. 5969 In such cases, it is desirable that the ACL keep as much 5970 information as possible. This includes information about 5971 inheritance, AUDIT and ALARM ACEs, and permissions granted and 5972 denied that do not conflict with the new mode. 5974 6.2. File Attributes Discussion 5976 6.2.1. Attribute 12: acl 5978 The NFSv4.1 ACL attribute contains an array of Access Control Entries 5979 (ACEs) that are associated with the file system object. Although the 5980 client can set and get the acl attribute, the server is responsible 5981 for using the ACL to perform access control. The client can use the 5982 OPEN or ACCESS operations to check access without modifying or 5983 reading data or metadata. 5985 The NFS ACE structure is defined as follows: 5987 typedef uint32_t acetype4; 5989 typedef uint32_t aceflag4; 5991 typedef uint32_t acemask4; 5993 struct nfsace4 { 5994 acetype4 type; 5995 aceflag4 flag; 5996 acemask4 access_mask; 5997 utf8str_mixed who; 5998 }; 6000 To determine if a request succeeds, the server processes each nfsace4 6001 entry in order. Only ACEs that have a "who" that matches the 6002 requester are considered. Each ACE is processed until all of the 6003 bits of the requester's access have been ALLOWED. Once a bit (see 6004 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 6005 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 6006 is encountered where the requester's access still has unALLOWED bits 6007 in common with the "access_mask" of the ACE, the request is denied. 6008 When the ACL is fully processed, if there are bits in the requester's 6009 mask that have not been ALLOWED or DENIED, access is denied. 6011 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 6012 not affect a requester's access, and instead are for triggering 6013 events as a result of a requester's access attempt. Therefore, AUDIT 6014 and ALARM ACEs are processed only after processing ALLOW and DENY 6015 ACEs. 6017 The NFSv4.1 ACL model is quite rich. Some server platforms may 6018 provide access-control functionality that goes beyond the UNIX-style 6019 mode attribute, but that is not as rich as the NFS ACL model. So 6020 that users can take advantage of this more limited functionality, the 6021 server may support the acl attributes by mapping between its ACL 6022 model and the NFSv4.1 ACL model. Servers must ensure that the ACL 6023 they actually store or enforce is at least as strict as the NFSv4 ACL 6024 that was set. It is tempting to accomplish this by rejecting any ACL 6025 that falls outside the small set that can be represented accurately. 6026 However, such an approach can render ACLs unusable without special 6027 client-side knowledge of the server's mapping, which defeats the 6028 purpose of having a common NFSv4 ACL protocol. Therefore, servers 6029 should accept every ACL that they can without compromising security. 6031 To help accomplish this, servers may make a special exception, in the 6032 case of unsupported permission bits, to the rule that bits not 6033 ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- 6034 style server might choose to silently allow read attribute 6035 permissions even though an ACL does not explicitly allow those 6036 permissions. (An ACL that explicitly denies permission to read 6037 attributes should still be rejected.) 6039 The situation is complicated by the fact that a server may have 6040 multiple modules that enforce ACLs. For example, the enforcement for 6041 NFSv4.1 access may be different from, but not weaker than, the 6042 enforcement for local access, and both may be different from the 6043 enforcement for access through other protocols such as SMB (Server 6044 Message Block). So it may be useful for a server to accept an ACL 6045 even if not all of its modules are able to support it. 6047 The guiding principle with regard to NFSv4 access is that the server 6048 must not accept ACLs that appear to make access to the file more 6049 restrictive than it really is. 6051 6.2.1.1. ACE Type 6053 The constants used for the type field (acetype4) are as follows: 6055 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 6056 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 6057 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 6058 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 6060 Only the ALLOWED and DENIED bits may be used in the dacl attribute, 6061 and only the AUDIT and ALARM bits may be used in the sacl attribute. 6062 All four are permitted in the acl attribute. 6064 +------------------------------+--------------+---------------------+ 6065 | Value | Abbreviation | Description | 6066 +------------------------------+--------------+---------------------+ 6067 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 6068 | | | the access defined | 6069 | | | in acemask4 to the | 6070 | | | file or directory. | 6071 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 6072 | | | the access defined | 6073 | | | in acemask4 to the | 6074 | | | file or directory. | 6075 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | Log (in a system- | 6076 | | | dependent way) any | 6077 | | | access attempt to a | 6078 | | | file or directory | 6079 | | | that uses any of | 6080 | | | the access methods | 6081 | | | specified in | 6082 | | | acemask4. | 6083 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate an alarm | 6084 | | | (in a system- | 6085 | | | dependent way) when | 6086 | | | any access attempt | 6087 | | | is made to a file | 6088 | | | or directory for | 6089 | | | the access methods | 6090 | | | specified in | 6091 | | | acemask4. | 6092 +------------------------------+--------------+---------------------+ 6094 The "Abbreviation" column denotes how the types will be referred to 6095 throughout the rest of this section. 6097 6.2.1.2. Attribute 13: aclsupport 6099 A server need not support all of the above ACE types. This attribute 6100 indicates which ACE types are supported for the current file system. 6101 The bitmask constants used to represent the above definitions within 6102 the aclsupport attribute are as follows: 6104 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 6105 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 6106 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 6107 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 6109 Servers that support either the ALLOW or DENY ACE type SHOULD support 6110 both ALLOW and DENY ACE types. 6112 Clients should not attempt to set an ACE unless the server claims 6113 support for that ACE type. If the server receives a request to set 6114 an ACE that it cannot store, it MUST reject the request with 6115 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 6116 that it can store but cannot enforce, the server SHOULD reject the 6117 request with NFS4ERR_ATTRNOTSUPP. 6119 Support for any of the ACL attributes is optional (albeit 6120 RECOMMENDED). However, a server that supports either of the new ACL 6121 attributes (dacl or sacl) MUST allow use of the new ACL attributes to 6122 access all of the ACE types that it supports. In other words, if 6123 such a server supports ALLOW or DENY ACEs, then it MUST support the 6124 dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST 6125 support the sacl attribute. 6127 6.2.1.3. ACE Access Mask 6129 The bitmask constants used for the access mask field are as follows: 6131 const ACE4_READ_DATA = 0x00000001; 6132 const ACE4_LIST_DIRECTORY = 0x00000001; 6133 const ACE4_WRITE_DATA = 0x00000002; 6134 const ACE4_ADD_FILE = 0x00000002; 6135 const ACE4_APPEND_DATA = 0x00000004; 6136 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 6137 const ACE4_READ_NAMED_ATTRS = 0x00000008; 6138 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 6139 const ACE4_EXECUTE = 0x00000020; 6140 const ACE4_DELETE_CHILD = 0x00000040; 6141 const ACE4_READ_ATTRIBUTES = 0x00000080; 6142 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 6143 const ACE4_WRITE_RETENTION = 0x00000200; 6144 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 6146 const ACE4_DELETE = 0x00010000; 6147 const ACE4_READ_ACL = 0x00020000; 6148 const ACE4_WRITE_ACL = 0x00040000; 6149 const ACE4_WRITE_OWNER = 0x00080000; 6150 const ACE4_SYNCHRONIZE = 0x00100000; 6152 Note that some masks have coincident values, for example, 6153 ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries 6154 ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are 6155 intended to be used with directory objects, while ACE4_READ_DATA, 6156 ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with 6157 non-directory objects. 6159 6.2.1.3.1. Discussion of Mask Attributes 6161 ACE4_READ_DATA 6163 Operation(s) affected: 6165 READ 6167 OPEN 6169 Discussion: 6171 Permission to read the data of the file. 6173 Servers SHOULD allow a user the ability to read the data of the 6174 file when only the ACE4_EXECUTE access mask bit is allowed. 6176 ACE4_LIST_DIRECTORY 6178 Operation(s) affected: 6180 READDIR 6182 Discussion: 6184 Permission to list the contents of a directory. 6186 ACE4_WRITE_DATA 6188 Operation(s) affected: 6190 WRITE 6192 OPEN 6194 SETATTR of size 6196 Discussion: 6198 Permission to modify a file's data. 6200 ACE4_ADD_FILE 6202 Operation(s) affected: 6204 CREATE 6206 LINK 6207 OPEN 6209 RENAME 6211 Discussion: 6213 Permission to add a new file in a directory. The CREATE 6214 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 6215 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it 6216 is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when 6217 used to create a regular file. LINK and RENAME are always 6218 affected. 6220 ACE4_APPEND_DATA 6222 Operation(s) affected: 6224 WRITE 6226 OPEN 6228 SETATTR of size 6230 Discussion: 6232 The ability to modify a file's data, but only starting at EOF. 6233 This allows for the notion of append-only files, by allowing 6234 ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user 6235 or group. If a file has an ACL such as the one described above 6236 and a WRITE request is made for somewhere other than EOF, the 6237 server SHOULD return NFS4ERR_ACCESS. 6239 ACE4_ADD_SUBDIRECTORY 6241 Operation(s) affected: 6243 CREATE 6245 RENAME 6247 Discussion: 6249 Permission to create a subdirectory in a directory. The CREATE 6250 operation is affected when nfs_ftype4 is NF4DIR. The RENAME 6251 operation is always affected. 6253 ACE4_READ_NAMED_ATTRS 6254 Operation(s) affected: 6256 OPENATTR 6258 Discussion: 6260 Permission to read the named attributes of a file or to look up 6261 the named attribute directory. OPENATTR is affected when it is 6262 not used to create a named attribute directory. This is when 6263 1) createdir is TRUE, but a named attribute directory already 6264 exists, or 2) createdir is FALSE. 6266 ACE4_WRITE_NAMED_ATTRS 6268 Operation(s) affected: 6270 OPENATTR 6272 Discussion: 6274 Permission to write the named attributes of a file or to create 6275 a named attribute directory. OPENATTR is affected when it is 6276 used to create a named attribute directory. This is when 6277 createdir is TRUE and no named attribute directory exists. The 6278 ability to check whether or not a named attribute directory 6279 exists depends on the ability to look it up; therefore, users 6280 also need the ACE4_READ_NAMED_ATTRS permission in order to 6281 create a named attribute directory. 6283 ACE4_EXECUTE 6285 Operation(s) affected: 6287 READ 6289 OPEN 6291 REMOVE 6293 RENAME 6295 LINK 6297 CREATE 6299 Discussion: 6301 Permission to execute a file. 6303 Servers SHOULD allow a user the ability to read the data of the 6304 file when only the ACE4_EXECUTE access mask bit is allowed. 6305 This is because there is no way to execute a file without 6306 reading the contents. Though a server may treat ACE4_EXECUTE 6307 and ACE4_READ_DATA bits identically when deciding to permit a 6308 READ operation, it SHOULD still allow the two bits to be set 6309 independently in ACLs, and MUST distinguish between them when 6310 replying to ACCESS operations. In particular, servers SHOULD 6311 NOT silently turn on one of the two bits when the other is set, 6312 as that would make it impossible for the client to correctly 6313 enforce the distinction between read and execute permissions. 6315 As an example, following a SETATTR of the following ACL: 6317 nfsuser:ACE4_EXECUTE:ALLOW 6319 A subsequent GETATTR of ACL for that file SHOULD return: 6321 nfsuser:ACE4_EXECUTE:ALLOW 6323 Rather than: 6325 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 6327 ACE4_EXECUTE 6329 Operation(s) affected: 6331 LOOKUP 6333 Discussion: 6335 Permission to traverse/search a directory. 6337 ACE4_DELETE_CHILD 6339 Operation(s) affected: 6341 REMOVE 6343 RENAME 6345 Discussion: 6347 Permission to delete a file or directory within a directory. 6348 See Section 6.2.1.3.2 for information on ACE4_DELETE and 6349 ACE4_DELETE_CHILD interact. 6351 ACE4_READ_ATTRIBUTES 6353 Operation(s) affected: 6355 GETATTR of file system object attributes 6357 VERIFY 6359 NVERIFY 6361 READDIR 6363 Discussion: 6365 The ability to read basic attributes (non-ACLs) of a file. On 6366 a UNIX system, basic attributes can be thought of as the stat- 6367 level attributes. Allowing this access mask bit would mean 6368 that the entity can execute "ls -l" and stat. If a READDIR 6369 operation requests attributes, this mask must be allowed for 6370 the READDIR to succeed. 6372 ACE4_WRITE_ATTRIBUTES 6374 Operation(s) affected: 6376 SETATTR of time_access_set, time_backup, 6378 time_create, time_modify_set, mimetype, hidden, system 6380 Discussion: 6382 Permission to change the times associated with a file or 6383 directory to an arbitrary value. Also permission to change the 6384 mimetype, hidden, and system attributes. A user having 6385 ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set 6386 the times associated with a file to the current server time. 6388 ACE4_WRITE_RETENTION 6390 Operation(s) affected: 6392 SETATTR of retention_set, retentevt_set. 6394 Discussion: 6396 Permission to modify the durations of event and non-event-based 6397 retention. Also permission to enable event and non-event-based 6398 retention. A server MAY behave such that setting 6399 ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. 6401 ACE4_WRITE_RETENTION_HOLD 6403 Operation(s) affected: 6405 SETATTR of retention_hold. 6407 Discussion: 6409 Permission to modify the administration retention holds. A 6410 server MAY map ACE4_WRITE_ATTRIBUTES to 6411 ACE_WRITE_RETENTION_HOLD. 6413 ACE4_DELETE 6415 Operation(s) affected: 6417 REMOVE 6419 Discussion: 6421 Permission to delete the file or directory. See 6422 Section 6.2.1.3.2 for information on ACE4_DELETE and 6423 ACE4_DELETE_CHILD interact. 6425 ACE4_READ_ACL 6427 Operation(s) affected: 6429 GETATTR of acl, dacl, or sacl 6431 NVERIFY 6433 VERIFY 6435 Discussion: 6437 Permission to read the ACL. 6439 ACE4_WRITE_ACL 6441 Operation(s) affected: 6443 SETATTR of acl and mode 6445 Discussion: 6447 Permission to write the acl and mode attributes. 6449 ACE4_WRITE_OWNER 6451 Operation(s) affected: 6453 SETATTR of owner and owner_group 6455 Discussion: 6457 Permission to write the owner and owner_group attributes. On 6458 UNIX systems, this is the ability to execute chown() and 6459 chgrp(). 6461 ACE4_SYNCHRONIZE 6463 Operation(s) affected: 6465 NONE 6467 Discussion: 6469 Permission to use the file object as a synchronization 6470 primitive for interprocess communication. This permission is 6471 not enforced or interpreted by the NFSv4.1 server on behalf of 6472 the client. 6474 Typically, the ACE4_SYNCHRONIZE permission is only meaningful 6475 on local file systems, i.e., file systems not accessed via 6476 NFSv4.1. The reason that the permission bit exists is that 6477 some operating environments, such as Windows, use 6478 ACE4_SYNCHRONIZE. 6480 For example, if a client copies a file that has 6481 ACE4_SYNCHRONIZE set from a local file system to an NFSv4.1 6482 server, and then later copies the file from the NFSv4.1 server 6483 to a local file system, it is likely that if ACE4_SYNCHRONIZE 6484 was set in the original file, the client will want it set in 6485 the second copy. The first copy will not have the permission 6486 set unless the NFSv4.1 server has the means to set the 6487 ACE4_SYNCHRONIZE bit. The second copy will not have the 6488 permission set unless the NFSv4.1 server has the means to 6489 retrieve the ACE4_SYNCHRONIZE bit. 6491 Server implementations need not provide the granularity of control 6492 that is implied by this list of masks. For example, POSIX-based 6493 systems might not distinguish ACE4_APPEND_DATA (the ability to append 6494 to a file) from ACE4_WRITE_DATA (the ability to modify existing 6495 contents); both masks would be tied to a single "write" permission 6496 [17]. When such a server returns attributes to the client, it would 6497 show both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the 6498 write permission is enabled. 6500 If a server receives a SETATTR request that it cannot accurately 6501 implement, it should err in the direction of more restricted access, 6502 except in the previously discussed cases of execute and read. For 6503 example, suppose a server cannot distinguish overwriting data from 6504 appending new data, as described in the previous paragraph. If a 6505 client submits an ALLOW ACE where ACE4_APPEND_DATA is set but 6506 ACE4_WRITE_DATA is not (or vice versa), the server should either turn 6507 off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. 6509 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6511 Two access mask bits govern the ability to delete a directory entry: 6512 ACE4_DELETE on the object itself (the "target") and ACE4_DELETE_CHILD 6513 on the containing directory (the "parent"). 6515 Many systems also take the "sticky bit" (MODE4_SVTX) on a directory 6516 to allow unlink only to a user that owns either the target or the 6517 parent; on some such systems the decision also depends on whether the 6518 target is writable. 6520 Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the 6521 target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that 6522 this is true even if the parent or target explicitly denies one of 6523 these permissions.) 6525 If the ACLs in question neither explicitly ALLOW nor DENY either of 6526 the above, and if MODE4_SVTX is not set on the parent, then the 6527 server SHOULD allow the removal if and only if ACE4_ADD_FILE is 6528 permitted. In the case where MODE4_SVTX is set, the server may also 6529 require the remover to own either the parent or the target, or may 6530 require the target to be writable. 6532 This allows servers to support something close to traditional UNIX- 6533 like semantics, with ACE4_ADD_FILE taking the place of the write bit. 6535 6.2.1.4. ACE flag 6537 The bitmask constants used for the flag field are as follows: 6539 const ACE4_FILE_INHERIT_ACE = 0x00000001; 6540 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 6541 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 6542 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 6543 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 6544 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 6545 const ACE4_IDENTIFIER_GROUP = 0x00000040; 6546 const ACE4_INHERITED_ACE = 0x00000080; 6548 A server need not support any of these flags. If the server supports 6549 flags that are similar to, but not exactly the same as, these flags, 6550 the implementation may define a mapping between the protocol-defined 6551 flags and the implementation-defined flags. 6553 For example, suppose a client tries to set an ACE with 6554 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 6555 server does not support any form of ACL inheritance, the server 6556 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 6557 supports a single "inherit ACE" flag that applies to both files and 6558 directories, the server may reject the request (i.e., requiring the 6559 client to set both the file and directory inheritance flags). The 6560 server may also accept the request and silently turn on the 6561 ACE4_DIRECTORY_INHERIT_ACE flag. 6563 6.2.1.4.1. Discussion of Flag Bits 6565 ACE4_FILE_INHERIT_ACE 6566 Any non-directory file in any sub-directory will get this ACE 6567 inherited. 6569 ACE4_DIRECTORY_INHERIT_ACE 6570 Can be placed on a directory and indicates that this ACE should be 6571 added to each new directory created. 6572 If this flag is set in an ACE in an ACL attribute to be set on a 6573 non-directory file system object, the operation attempting to set 6574 the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. 6576 ACE4_NO_PROPAGATE_INHERIT_ACE 6577 Can be placed on a directory. This flag tells the server that 6578 inheritance of this ACE should stop at newly created child 6579 directories. 6581 ACE4_INHERIT_ONLY_ACE 6582 Can be placed on a directory but does not apply to the directory; 6583 ALLOW and DENY ACEs with this bit set do not affect access to the 6584 directory, and AUDIT and ALARM ACEs with this bit set do not 6585 trigger log or alarm events. Such ACEs only take effect once they 6586 are applied (with this bit cleared) to newly created files and 6587 directories as specified by the ACE4_FILE_INHERIT_ACE and 6588 ACE4_DIRECTORY_INHERIT_ACE flags. 6590 If this flag is present on an ACE, but neither 6591 ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, 6592 then an operation attempting to set such an attribute SHOULD fail 6593 with NFS4ERR_ATTRNOTSUPP. 6595 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG 6597 ACE4_FAILED_ACCESS_ACE_FLAG 6598 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 6599 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on 6600 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 6601 (ALARM) ACE types. If during the processing of the file's ACL, 6602 the server encounters an AUDIT or ALARM ACE that matches the 6603 principal attempting the OPEN, the server notes that fact, and the 6604 presence, if any, of the SUCCESS and FAILED flags encountered in 6605 the AUDIT or ALARM ACE. Once the server completes the ACL 6606 processing, it then notes if the operation succeeded or failed. 6607 If the operation succeeded, and if the SUCCESS flag was set for a 6608 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 6609 event occurs. If the operation failed, and if the FAILED flag was 6610 set for the matching AUDIT or ALARM ACE, then the appropriate 6611 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 6612 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 6613 is not useful. 6615 The previously described processing applies to ACCESS operations 6616 even when they return NFS4_OK. For the purposes of AUDIT and 6617 ALARM, we consider an ACCESS operation to be a "failure" if it 6618 fails to return a bit that was requested and supported. 6620 ACE4_IDENTIFIER_GROUP 6621 Indicates that the "who" refers to a GROUP as defined under UNIX 6622 or a GROUP ACCOUNT as defined under Windows. Clients and servers 6623 MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 6624 value equal to one of the special identifiers outlined in 6625 Section 6.2.1.5. 6627 ACE4_INHERITED_ACE 6628 Indicates that this ACE is inherited from a parent directory. A 6629 server that supports automatic inheritance will place this flag on 6630 any ACEs inherited from the parent directory when creating a new 6631 object. Client applications will use this to perform automatic 6632 inheritance. Clients and servers MUST clear this bit in the acl 6633 attribute; it may only be used in the dacl and sacl attributes. 6635 6.2.1.5. ACE Who 6637 The "who" field of an ACE is an identifier that specifies the 6638 principal or principals to whom the ACE applies. It may refer to a 6639 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 6640 which. 6642 There are several special identifiers that need to be understood 6643 universally, rather than in the context of a particular DNS domain. 6644 Some of these identifiers cannot be understood when an NFS client 6645 accesses the server, but have meaning when a local process accesses 6646 the file. The ability to display and modify these permissions is 6647 permitted over NFS, even if none of the access methods on the server 6648 understands the identifiers. 6650 +---------------+---------------------------------------------------+ 6651 | Who | Description | 6652 +---------------+---------------------------------------------------+ 6653 | OWNER | The owner of the file. | 6654 | GROUP | The group associated with the file. | 6655 | EVERYONE | The world, including the owner and owning group. | 6656 | INTERACTIVE | Accessed from an interactive terminal. | 6657 | NETWORK | Accessed via the network. | 6658 | DIALUP | Accessed as a dialup user to the server. | 6659 | BATCH | Accessed from a batch job. | 6660 | ANONYMOUS | Accessed without any authentication. | 6661 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS). | 6662 | SERVICE | Access from a system service. | 6663 +---------------+---------------------------------------------------+ 6665 Table 4 6667 To avoid conflict, these special identifiers are distinguished by an 6668 appended "@" and should appear in the form "xxxx@" (with no domain 6669 name after the "@"), for example, ANONYMOUS@. 6671 The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these 6672 special identifiers. When encoding entries with these special 6673 identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. 6675 6.2.1.5.1. Discussion of EVERYONE@ 6677 It is important to note that "EVERYONE@" is not equivalent to the 6678 UNIX "other" entity. This is because, by definition, UNIX "other" 6679 does not include the owner or owning group of a file. "EVERYONE@" 6680 means literally everyone, including the owner or owning group. 6682 6.2.2. Attribute 58: dacl 6684 The dacl attribute is like the acl attribute, but dacl allows just 6685 ALLOW and DENY ACEs. The dacl attribute supports automatic 6686 inheritance (see Section 6.4.3.2). 6688 6.2.3. Attribute 59: sacl 6690 The sacl attribute is like the acl attribute, but sacl allows just 6691 AUDIT and ALARM ACEs. The sacl attribute supports automatic 6692 inheritance (see Section 6.4.3.2). 6694 6.2.4. Attribute 33: mode 6696 The NFSv4.1 mode attribute is based on the UNIX mode bits. The 6697 following bits are defined: 6699 const MODE4_SUID = 0x800; /* set user id on execution */ 6700 const MODE4_SGID = 0x400; /* set group id on execution */ 6701 const MODE4_SVTX = 0x200; /* save text even after use */ 6702 const MODE4_RUSR = 0x100; /* read permission: owner */ 6703 const MODE4_WUSR = 0x080; /* write permission: owner */ 6704 const MODE4_XUSR = 0x040; /* execute permission: owner */ 6705 const MODE4_RGRP = 0x020; /* read permission: group */ 6706 const MODE4_WGRP = 0x010; /* write permission: group */ 6707 const MODE4_XGRP = 0x008; /* execute permission: group */ 6708 const MODE4_ROTH = 0x004; /* read permission: other */ 6709 const MODE4_WOTH = 0x002; /* write permission: other */ 6710 const MODE4_XOTH = 0x001; /* execute permission: other */ 6712 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 6713 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 6714 MODE4_XGRP apply to principals identified in the owner_group 6715 attribute but who are not identified in the owner attribute. Bits 6716 MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH apply to any principal that 6717 does not match that in the owner attribute and does not have a group 6718 matching that of the owner_group attribute. 6720 Bits within a mode other than those specified above are not defined 6721 by this protocol. A server MUST NOT return bits other than those 6722 defined above in a GETATTR or READDIR operation, and it MUST return 6723 NFS4ERR_INVAL if bits other than those defined above are set in a 6724 SETATTR, CREATE, OPEN, VERIFY, or NVERIFY operation. 6726 6.2.5. Attribute 74: mode_set_masked 6728 The mode_set_masked attribute is a write-only attribute that allows 6729 individual bits in the mode attribute to be set or reset, without 6730 changing others. It allows, for example, the bits MODE4_SUID, 6731 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 6732 any of the nine low-order mode bits devoted to permissions. 6734 In such instances that the nine low-order bits are left unmodified, 6735 then neither the acl nor the dacl attribute should be automatically 6736 modified as discussed in Section 6.4.1. 6738 The mode_set_masked attribute consists of two words, each in the form 6739 of a mode4. The first consists of the value to be applied to the 6740 current mode value and the second is a mask. Only bits set to one in 6741 the mask word are changed (set or reset) in the file's mode. All 6742 other bits in the mode remain unchanged. Bits in the first word that 6743 correspond to bits that are zero in the mask are ignored, except that 6744 undefined bits are checked for validity and can result in 6745 NFS4ERR_INVAL as described below. 6747 The mode_set_masked attribute is only valid in a SETATTR operation. 6748 If it is used in a CREATE or OPEN operation, the server MUST return 6749 NFS4ERR_INVAL. 6751 Bits not defined as valid in the mode attribute are not valid in 6752 either word of the mode_set_masked attribute. The server MUST return 6753 NFS4ERR_INVAL if any such bits are set to one in a SETATTR. If the 6754 mode and mode_set_masked attributes are both specified in the same 6755 SETATTR, the server MUST also return NFS4ERR_INVAL. 6757 6.3. Common Methods 6759 The requirements in this section will be referred to in future 6760 sections, especially Section 6.4. 6762 6.3.1. Interpreting an ACL 6764 6.3.1.1. Server Considerations 6766 The server uses the algorithm described in Section 6.2.1 to determine 6767 whether an ACL allows access to an object. However, the ACL might 6768 not be the sole determiner of access. For example: 6770 o In the case of a file system exported as read-only, the server may 6771 deny write access even though an object's ACL grants it. 6773 o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 6774 permissions to prevent a situation from arising in which there is 6775 no valid way to ever modify the ACL. 6777 o All servers will allow a user the ability to read the data of the 6778 file when only the execute permission is granted (i.e., if the ACL 6779 denies the user the ACE4_READ_DATA access and allows the user 6780 ACE4_EXECUTE, the server will allow the user to read the data of 6781 the file). 6783 o Many servers have the notion of owner-override in which the owner 6784 of the object is allowed to override accesses that are denied by 6785 the ACL. This may be helpful, for example, to allow users 6786 continued access to open files on which the permissions have 6787 changed. 6789 o Many servers have the notion of a "superuser" that has privileges 6790 beyond an ordinary user. The superuser may be able to read or 6791 write data or metadata in ways that would not be permitted by the 6792 ACL. 6794 o A retention attribute might also block access otherwise allowed by 6795 ACLs (see Section 5.13). 6797 6.3.1.2. Client Considerations 6799 Clients SHOULD NOT do their own access checks based on their 6800 interpretation of the ACL, but rather use the OPEN and ACCESS 6801 operations to do access checks. This allows the client to act on the 6802 results of having the server determine whether or not access should 6803 be granted based on its interpretation of the ACL. 6805 Clients must be aware of situations in which an object's ACL will 6806 define a certain access even though the server will not enforce it. 6807 In general, but especially in these situations, the client needs to 6808 do its part in the enforcement of access as defined by the ACL. To 6809 do this, the client MAY send the appropriate ACCESS operation prior 6810 to servicing the request of the user or application in order to 6811 determine whether the user or application should be granted the 6812 access requested. For examples in which the ACL may define accesses 6813 that the server doesn't enforce, see Section 6.3.1.1. 6815 6.3.2. Computing a Mode Attribute from an ACL 6817 The following method can be used to calculate the MODE4_R*, MODE4_W*, 6818 and MODE4_X* bits of a mode attribute, based upon an ACL. 6820 First, for each of the special identifiers OWNER@, GROUP@, and 6821 EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY 6822 ACEs for the identifier EVERYONE@ and for the identifier under 6823 consideration. The result of the evaluation will be an NFSv4 ACL 6824 mask showing exactly which bits are permitted to that identifier. 6826 Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ 6827 into mode bits for, respectively, the user, group, and other, as 6828 follows: 6830 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and 6831 only if ACE4_READ_DATA is set in the corresponding mask. 6833 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and 6834 only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the 6835 corresponding mask. 6837 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if 6838 and only if ACE4_EXECUTE is set in the corresponding mask. 6840 6.3.2.1. Discussion 6842 Some server implementations also add bits permitted to named users 6843 and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and 6844 MODE4_XGRP). 6846 Implementations are discouraged from doing this, because it has been 6847 found to cause confusion for users who see members of a file's group 6848 denied access that the mode bits appear to allow. (The presence of 6849 DENY ACEs may also lead to such behavior, but DENY ACEs are expected 6850 to be more rarely used.) 6852 The same user confusion seen when fetching the mode also results if 6853 setting the mode does not effectively control permissions for the 6854 owner, group, and other users; this motivates some of the 6855 requirements that follow. 6857 6.4. Requirements 6859 The server that supports both mode and ACL must take care to 6860 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 6861 ACEs that have respective who fields of "OWNER@", "GROUP@", and 6862 "EVERYONE@". This way, the client can see if semantically equivalent 6863 access permissions exist whether the client asks for the owner, 6864 owner_group, and mode attributes or for just the ACL. 6866 In this section, much is made of the methods in Section 6.3.2. Many 6867 requirements refer to this section. But note that the methods have 6868 behaviors specified with "SHOULD". This is intentional, to avoid 6869 invalidating existing implementations that compute the mode according 6870 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 6871 actual permissions on owner, group, and other. 6873 6.4.1. Setting the Mode and/or ACL Attributes 6875 In the case where a server supports the sacl or dacl attribute, in 6876 addition to the acl attribute, the server MUST fail a request to set 6877 the acl attribute simultaneously with a dacl or sacl attribute. The 6878 error to be given is NFS4ERR_ATTRNOTSUPP. 6880 6.4.1.1. Setting Mode and not ACL 6882 When any of the nine low-order mode bits are subject to change, 6883 either because the mode attribute was set or because the 6884 mode_set_masked attribute was set and the mask included one or more 6885 bits from the nine low-order mode bits, and no ACL attribute is 6886 explicitly set, the acl and dacl attributes must be modified in 6887 accordance with the updated value of those bits. This must happen 6888 even if the value of the low-order bits is the same after the mode is 6889 set as before. 6891 Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl 6892 attribute) are unaffected by changes to the mode. 6894 In cases in which the permissions bits are subject to change, the acl 6895 and dacl attributes MUST be modified such that the mode computed via 6896 the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, 6897 MODE4_W*, MODE4_X*) of the mode attribute as modified by the 6898 attribute change. The ACL attributes SHOULD also be modified such 6899 that: 6901 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 6902 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6903 ACE4_READ_DATA. 6905 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 6906 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6907 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 6909 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 6910 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6911 ACE4_EXECUTE. 6913 Access mask bits other than those listed above, appearing in ALLOW 6914 ACEs, MAY also be disabled. 6916 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 6917 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 6918 ALARM. As such, it is desirable to leave these ACEs unmodified when 6919 modifying the ACL attributes. 6921 Also note that the requirement may be met by discarding the acl and 6922 dacl, in favor of an ACL that represents the mode and only the mode. 6923 This is permitted, but it is preferable for a server to preserve as 6924 much of the ACL as possible without violating the above requirements. 6925 Discarding the ACL makes it effectively impossible for a file created 6926 with a mode attribute to inherit an ACL (see Section 6.4.3). 6928 6.4.1.2. Setting ACL and Not Mode 6930 When setting the acl or dacl and not setting the mode or 6931 mode_set_masked attributes, the permission bits of the mode need to 6932 be derived from the ACL. In this case, the ACL attribute SHOULD be 6933 set as given. The nine low-order bits of the mode attribute 6934 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 6935 of the method in Section 6.3.2. The three high-order bits of the 6936 mode (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 6938 6.4.1.3. Setting Both ACL and Mode 6940 When setting both the mode (includes use of either the mode attribute 6941 or the mode_set_masked attribute) and the acl or dacl attributes in 6942 the same operation, the attributes MUST be applied in this order: 6943 mode (or mode_set_masked), then ACL. The mode-related attribute is 6944 set as given, then the ACL attribute is set as given, possibly 6945 changing the final mode, as described above in Section 6.4.1.2. 6947 6.4.2. Retrieving the Mode and/or ACL Attributes 6949 This section applies only to servers that support both the mode and 6950 ACL attributes. 6952 Some server implementations may have a concept of "objects without 6953 ACLs", meaning that all permissions are granted and denied according 6954 to the mode attribute and that no ACL attribute is stored for that 6955 object. If an ACL attribute is requested of such a server, the 6956 server SHOULD return an ACL that does not conflict with the mode; 6957 that is to say, the ACL returned SHOULD represent the nine low-order 6958 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 6959 described in Section 6.3.2. 6961 For other server implementations, the ACL attribute is always present 6962 for every object. Such servers SHOULD store at least the three high- 6963 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 6964 MODE4_SVTX). The server SHOULD return a mode attribute if one is 6965 requested, and the low-order nine bits of the mode (MODE4_R*, 6966 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 6967 Section 6.3.2 to the ACL attribute. 6969 6.4.3. Creating New Objects 6971 If a server supports any ACL attributes, it may use the ACL 6972 attributes on the parent directory to compute an initial ACL 6973 attribute for a newly created object. This will be referred to as 6974 the inherited ACL within this section. The act of adding one or more 6975 ACEs to the inherited ACL that are based upon ACEs in the parent 6976 directory's ACL will be referred to as inheriting an ACE within this 6977 section. 6979 Implementors should standardize what the behavior of CREATE and OPEN 6980 must be depending on the presence or absence of the mode and ACL 6981 attributes. 6983 1. If just the mode is given in the call: 6985 In this case, inheritance SHOULD take place, but the mode MUST be 6986 applied to the inherited ACL as described in Section 6.4.1.1, 6987 thereby modifying the ACL. 6989 2. If just the ACL is given in the call: 6991 In this case, inheritance SHOULD NOT take place, and the ACL as 6992 defined in the CREATE or OPEN will be set without modification, 6993 and the mode modified as in Section 6.4.1.2. 6995 3. If both mode and ACL are given in the call: 6997 In this case, inheritance SHOULD NOT take place, and both 6998 attributes will be set as described in Section 6.4.1.3. 7000 4. If neither mode nor ACL is given in the call: 7002 In the case where an object is being created without any initial 7003 attributes at all, e.g., an OPEN operation with an opentype4 of 7004 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 7005 NOT take place (note that EXCLUSIVE4_1 is a better choice of 7006 createmode4, since it does permit initial attributes). Instead, 7007 the server SHOULD set permissions to deny all access to the newly 7008 created object. It is expected that the appropriate client will 7009 set the desired attributes in a subsequent SETATTR operation, and 7010 the server SHOULD allow that operation to succeed, regardless of 7011 what permissions the object is created with. For example, an 7012 empty ACL denies all permissions, but the server should allow the 7013 owner's SETATTR to succeed even though WRITE_ACL is implicitly 7014 denied. 7016 In other cases, inheritance SHOULD take place, and no 7017 modifications to the ACL will happen. The mode attribute, if 7018 supported, MUST be as computed in Section 6.3.2, with the 7019 MODE4_SUID, MODE4_SGID, and MODE4_SVTX bits clear. If no 7020 inheritable ACEs exist on the parent directory, the rules for 7021 creating acl, dacl, or sacl attributes are implementation 7022 defined. If either the dacl or sacl attribute is supported, then 7023 the ACL4_DEFAULTED flag SHOULD be set on the newly created 7024 attributes. 7026 6.4.3.1. The Inherited ACL 7028 If the object being created is not a directory, the inherited ACL 7029 SHOULD NOT inherit ACEs from the parent directory ACL unless the 7030 ACE4_FILE_INHERIT_FLAG is set. 7032 If the object being created is a directory, the inherited ACL should 7033 inherit all inheritable ACEs from the parent directory, that is, 7034 those that have the ACE4_FILE_INHERIT_ACE or 7035 ACE4_DIRECTORY_INHERIT_ACE flag set. If the inheritable ACE has 7036 ACE4_FILE_INHERIT_ACE set but ACE4_DIRECTORY_INHERIT_ACE is clear, 7037 the inherited ACE on the newly created directory MUST have the 7038 ACE4_INHERIT_ONLY_ACE flag set to prevent the directory from being 7039 affected by ACEs meant for non-directories. 7041 When a new directory is created, the server MAY split any inherited 7042 ACE that is both inheritable and effective (in other words, that has 7043 neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE set), 7044 into two ACEs, one with no inheritance flags and one with 7045 ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, 7046 both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) 7047 This makes it simpler to modify the effective permissions on the 7048 directory without modifying the ACE that is to be inherited to the 7049 new directory's children. 7051 6.4.3.2. Automatic Inheritance 7053 The acl attribute consists only of an array of ACEs, but the sacl 7054 (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an 7055 additional flag field. 7057 struct nfsacl41 { 7058 aclflag4 na41_flag; 7059 nfsace4 na41_aces<>; 7060 }; 7062 The flag field applies to the entire sacl or dacl; three flag values 7063 are defined: 7065 const ACL4_AUTO_INHERIT = 0x00000001; 7066 const ACL4_PROTECTED = 0x00000002; 7067 const ACL4_DEFAULTED = 0x00000004; 7069 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 7070 be set in the ACEs of the sacl or dacl (whereas it must always be 7071 cleared in the acl). 7073 Together these features allow a server to support automatic 7074 inheritance, which we now explain in more detail. 7076 Inheritable ACEs are normally inherited by child objects only at the 7077 time that the child objects are created; later modifications to 7078 inheritable ACEs do not result in modifications to inherited ACEs on 7079 descendants. 7081 However, the dacl and sacl provide an OPTIONAL mechanism that allows 7082 a client application to propagate changes to inheritable ACEs to an 7083 entire directory hierarchy. 7085 A server that supports this performs inheritance at object creation 7086 time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on 7087 any inherited ACEs as they are added to the new object. 7089 A client application such as an ACL editor may then propagate changes 7090 to inheritable ACEs on a directory by recursively traversing that 7091 directory's descendants and modifying each ACL encountered to remove 7092 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 7093 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 7094 uses the existing ACE inheritance flags in the obvious way to decide 7095 which ACEs to propagate. (Note that it may encounter further 7096 inheritable ACEs when descending the directory hierarchy and that 7097 those will also need to be taken into account when propagating 7098 inheritable ACEs to further descendants.) 7100 The reach of this propagation may be limited in two ways: first, 7101 automatic inheritance is not performed from any directory ACL that 7102 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 7103 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 7104 set, preventing modification of that ACL and also (if the ACL is set 7105 on a directory) of the ACL on any of the object's descendants. 7107 This propagation is performed independently for the sacl and the dacl 7108 attributes; thus, the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 7109 be independently set for the sacl and the dacl, and propagation of 7110 one type of acl may continue down a hierarchy even where propagation 7111 of the other acl has stopped. 7113 New objects should be created with a dacl and a sacl that both have 7114 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 7115 the same value as that on, respectively, the sacl or dacl of the 7116 parent object. 7118 Both the dacl and sacl attributes are RECOMMENDED, and a server may 7119 support one without supporting the other. 7121 A server that supports both the old acl attribute and one or both of 7122 the new dacl or sacl attributes must do so in such a way as to keep 7123 all three attributes consistent with each other. Thus, the ACEs 7124 reported in the acl attribute should be the union of the ACEs 7125 reported in the dacl and sacl attributes, except that the 7126 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 7127 And of course a client that queries only the acl will be unable to 7128 determine the values of the sacl or dacl flag fields. 7130 When a client performs a SETATTR for the acl attribute, the server 7131 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 7132 dacl. By using the acl attribute, as opposed to the dacl or sacl 7133 attributes, the client signals that it may not understand automatic 7134 inheritance, and thus cannot be trusted to set an ACL for which 7135 automatic inheritance would make sense. 7137 When a client application queries an ACL, modifies it, and sets it 7138 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 7139 unchanged, in their original order, at the end of the ACL. If the 7140 application is unable to do this, it should set the ACL4_PROTECTED 7141 flag. This behavior is not enforced by servers, but violations of 7142 this rule may lead to unexpected results when applications perform 7143 automatic inheritance. 7145 If a server also supports the mode attribute, it SHOULD set the mode 7146 in such a way that leaves inherited ACEs unchanged, in their original 7147 order, at the end of the ACL. If it is unable to do so, it SHOULD 7148 set the ACL4_PROTECTED flag on the file's dacl. 7150 Finally, in the case where the request that creates a new file or 7151 directory does not also set permissions for that file or directory, 7152 and there are also no ACEs to inherit from the parent's directory, 7153 then the server's choice of ACL for the new object is implementation- 7154 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 7155 flag on the ACL it chooses for the new object. An application 7156 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 7157 sign that the ACL should be completely replaced by one generated 7158 using the automatic inheritance rules. 7160 7. Single-Server Namespace 7162 This section describes the NFSv4 single-server namespace. Single- 7163 server namespaces may be presented directly to clients, or they may 7164 be used as a basis to form larger multi-server namespaces (e.g., 7165 site-wide or organization-wide) to be presented to clients, as 7166 described in Section 11. 7168 7.1. Server Exports 7170 On a UNIX server, the namespace describes all the files reachable by 7171 pathnames under the root directory or "/". On a Windows server, the 7172 namespace constitutes all the files on disks named by mapped disk 7173 letters. NFS server administrators rarely make the entire server's 7174 file system namespace available to NFS clients. More often, portions 7175 of the namespace are made available via an "export" feature. In 7176 previous versions of the NFS protocol, the root filehandle for each 7177 export is obtained through the MOUNT protocol; the client sent a 7178 string that identified the export name within the namespace and the 7179 server returned the root filehandle for that export. The MOUNT 7180 protocol also provided an EXPORTS procedure that enumerated the 7181 server's exports. 7183 7.2. Browsing Exports 7185 The NFSv4.1 protocol provides a root filehandle that clients can use 7186 to obtain filehandles for the exports of a particular server, via a 7187 series of LOOKUP operations within a COMPOUND, to traverse a path. A 7188 common user experience is to use a graphical user interface (perhaps 7189 a file "Open" dialog window) to find a file via progressive browsing 7190 through a directory tree. The client must be able to move from one 7191 export to another export via single-component, progressive LOOKUP 7192 operations. 7194 This style of browsing is not well supported by the NFSv3 protocol. 7195 In NFSv3, the client expects all LOOKUP operations to remain within a 7196 single server file system. For example, the device attribute will 7197 not change. This prevents a client from taking namespace paths that 7198 span exports. 7200 In the case of NFSv3, an automounter on the client can obtain a 7201 snapshot of the server's namespace using the EXPORTS procedure of the 7202 MOUNT protocol. If it understands the server's pathname syntax, it 7203 can create an image of the server's namespace on the client. The 7204 parts of the namespace that are not exported by the server are filled 7205 in with directories that might be constructed similarly to an NFSv4.1 7206 "pseudo file system" (see Section 7.3) that allows the user to browse 7207 from one mounted file system to another. There is a drawback to this 7208 representation of the server's namespace on the client: it is static. 7209 If the server administrator adds a new export, the client will be 7210 unaware of it. 7212 7.3. Server Pseudo File System 7214 NFSv4.1 servers avoid this namespace inconsistency by presenting all 7215 the exports for a given server within the framework of a single 7216 namespace for that server. An NFSv4.1 client uses LOOKUP and READDIR 7217 operations to browse seamlessly from one export to another. 7219 Where there are portions of the server namespace that are not 7220 exported, clients require some way of traversing those portions to 7221 reach actual exported file systems. A technique that servers may use 7222 to provide for this is to bridge the unexported portion of the 7223 namespace via a "pseudo file system" that provides a view of exported 7224 directories only. A pseudo file system has a unique fsid and behaves 7225 like a normal, read-only file system. 7227 Based on the construction of the server's namespace, it is possible 7228 that multiple pseudo file systems may exist. For example, 7230 /a pseudo file system 7231 /a/b real file system 7232 /a/b/c pseudo file system 7233 /a/b/c/d real file system 7235 Each of the pseudo file systems is considered a separate entity and 7236 therefore MUST have its own fsid, unique among all the fsids for that 7237 server. 7239 7.4. Multiple Roots 7241 Certain operating environments are sometimes described as having 7242 "multiple roots". In such environments, individual file systems are 7243 commonly represented by disk or volume names. NFSv4 servers for 7244 these platforms can construct a pseudo file system above these root 7245 names so that disk letters or volume names are simply directory names 7246 in the pseudo root. 7248 7.5. Filehandle Volatility 7250 The nature of the server's pseudo file system is that it is a logical 7251 representation of file system(s) available from the server. 7252 Therefore, the pseudo file system is most likely constructed 7253 dynamically when the server is first instantiated. It is expected 7254 that the pseudo file system may not have an on-disk counterpart from 7255 which persistent filehandles could be constructed. Even though it is 7256 preferable that the server provide persistent filehandles for the 7257 pseudo file system, the NFS client should expect that pseudo file 7258 system filehandles are volatile. This can be confirmed by checking 7259 the associated "fh_expire_type" attribute for those filehandles in 7260 question. If the filehandles are volatile, the NFS client must be 7261 prepared to recover a filehandle value (e.g., with a series of LOOKUP 7262 operations) when receiving an error of NFS4ERR_FHEXPIRED. 7264 Because it is quite likely that servers will implement pseudo file 7265 systems using volatile filehandles, clients need to be prepared for 7266 them, rather than assuming that all filehandles will be persistent. 7268 7.6. Exported Root 7270 If the server's root file system is exported, one might conclude that 7271 a pseudo file system is unneeded. This is not necessarily so. 7272 Assume the following file systems on a server: 7274 / fs1 (exported) 7275 /a fs2 (not exported) 7276 /a/b fs3 (exported) 7278 Because fs2 is not exported, fs3 cannot be reached with simple 7279 LOOKUPs. The server must bridge the gap with a pseudo file system. 7281 7.7. Mount Point Crossing 7283 The server file system environment may be constructed in such a way 7284 that one file system contains a directory that is 'covered' or 7285 mounted upon by a second file system. For example: 7287 /a/b (file system 1) 7288 /a/b/c/d (file system 2) 7290 The pseudo file system for this server may be constructed to look 7291 like: 7293 / (place holder/not exported) 7294 /a/b (file system 1) 7295 /a/b/c/d (file system 2) 7297 It is the server's responsibility to present the pseudo file system 7298 that is complete to the client. If the client sends a LOOKUP request 7299 for the path /a/b/c/d, the server's response is the filehandle of the 7300 root of the file system /a/b/c/d. In previous versions of the NFS 7301 protocol, the server would respond with the filehandle of directory 7302 /a/b/c/d within the file system /a/b. 7304 The NFS client will be able to determine if it crosses a server mount 7305 point by a change in the value of the "fsid" attribute. 7307 7.8. Security Policy and Namespace Presentation 7309 Because NFSv4 clients possess the ability to change the security 7310 mechanisms used, after determining what is allowed, by using SECINFO 7311 and SECINFO_NONAME, the server SHOULD NOT present a different view of 7312 the namespace based on the security mechanism being used by a client. 7313 Instead, it should present a consistent view and return 7314 NFS4ERR_WRONGSEC if an attempt is made to access data with an 7315 inappropriate security mechanism. 7317 If security considerations make it necessary to hide the existence of 7318 a particular file system, as opposed to all of the data within it, 7319 the server can apply the security policy of a shared resource in the 7320 server's namespace to components of the resource's ancestors. For 7321 example: 7323 / (place holder/not exported) 7324 /a/b (file system 1) 7325 /a/b/MySecretProject (file system 2) 7327 The /a/b/MySecretProject directory is a real file system and is the 7328 shared resource. Suppose the security policy for /a/b/ 7329 MySecretProject is Kerberos with integrity and it is desired to limit 7330 knowledge of the existence of this file system. In this case, the 7331 server should apply the same security policy to /a/b. This allows 7332 for knowledge of the existence of a file system to be secured when 7333 desirable. 7335 For the case of the use of multiple, disjoint security mechanisms in 7336 the server's resources, applying that sort of policy would result in 7337 the higher-level file system not being accessible using any security 7338 flavor. Therefore, that sort of configuration is not compatible with 7339 hiding the existence (as opposed to the contents) from clients using 7340 multiple disjoint sets of security flavors. 7342 In other circumstances, a desirable policy is for the security of a 7343 particular object in the server's namespace to include the union of 7344 all security mechanisms of all direct descendants. A common and 7345 convenient practice, unless strong security requirements dictate 7346 otherwise, is to make the entire the pseudo file system accessible by 7347 all of the valid security mechanisms. 7349 Where there is concern about the security of data on the network, 7350 clients should use strong security mechanisms to access the pseudo 7351 file system in order to prevent man-in-the-middle attacks. 7353 8. State Management 7355 Integrating locking into the NFS protocol necessarily causes it to be 7356 stateful. With the inclusion of such features as share reservations, 7357 file and directory delegations, recallable layouts, and support for 7358 mandatory byte-range locking, the protocol becomes substantially more 7359 dependent on proper management of state than the traditional 7360 combination of NFS and NLM (Network Lock Manager) [49]. These 7361 features include expanded locking facilities, which provide some 7362 measure of inter-client exclusion, but the state also offers features 7363 not readily providable using a stateless model. There are three 7364 components to making this state manageable: 7366 o clear division between client and server 7368 o ability to reliably detect inconsistency in state between client 7369 and server 7371 o simple and robust recovery mechanisms 7373 In this model, the server owns the state information. The client 7374 requests changes in locks and the server responds with the changes 7375 made. Non-client-initiated changes in locking state are infrequent. 7376 The client receives prompt notification of such changes and can 7377 adjust its view of the locking state to reflect the server's changes. 7379 Individual pieces of state created by the server and passed to the 7380 client at its request are represented by 128-bit stateids. These 7381 stateids may represent a particular open file, a set of byte-range 7382 locks held by a particular owner, or a recallable delegation of 7383 privileges to access a file in particular ways or at a particular 7384 location. 7386 In all cases, there is a transition from the most general information 7387 that represents a client as a whole to the eventual lightweight 7388 stateid used for most client and server locking interactions. The 7389 details of this transition will vary with the type of object but it 7390 always starts with a client ID. 7392 8.1. Client and Session ID 7394 A client must establish a client ID (see Section 2.4) and then one or 7395 more sessionids (see Section 2.10) before performing any operations 7396 to open, byte-range lock, delegate, or obtain a layout for a file 7397 object. Each session ID is associated with a specific client ID, and 7398 thus serves as a shorthand reference to an NFSv4.1 client. 7400 For some types of locking interactions, the client will represent 7401 some number of internal locking entities called "owners", which 7402 normally correspond to processes internal to the client. For other 7403 types of locking-related objects, such as delegations and layouts, no 7404 such intermediate entities are provided for, and the locking-related 7405 objects are considered to be transferred directly between the server 7406 and a unitary client. 7408 8.2. Stateid Definition 7410 When the server grants a lock of any type (including opens, byte- 7411 range locks, delegations, and layouts), it responds with a unique 7412 stateid that represents a set of locks (often a single lock) for the 7413 same file, of the same type, and sharing the same ownership 7414 characteristics. Thus, opens of the same file by different open- 7415 owners each have an identifying stateid. Similarly, each set of 7416 byte-range locks on a file owned by a specific lock-owner has its own 7417 identifying stateid. Delegations and layouts also have associated 7418 stateids by which they may be referenced. The stateid is used as a 7419 shorthand reference to a lock or set of locks, and given a stateid, 7420 the server can determine the associated state-owner or state-owners 7421 (in the case of an open-owner/lock-owner pair) and the associated 7422 filehandle. When stateids are used, the current filehandle must be 7423 the one associated with that stateid. 7425 All stateids associated with a given client ID are associated with a 7426 common lease that represents the claim of those stateids and the 7427 objects they represent to be maintained by the server. See 7428 Section 8.3 for a discussion of the lease. 7430 The server may assign stateids independently for different clients. 7431 A stateid with the same bit pattern for one client may designate an 7432 entirely different set of locks for a different client. The stateid 7433 is always interpreted with respect to the client ID associated with 7434 the current session. Stateids apply to all sessions associated with 7435 the given client ID, and the client may use a stateid obtained from 7436 one session on another session associated with the same client ID. 7438 8.2.1. Stateid Types 7440 With the exception of special stateids (see Section 8.2.3), each 7441 stateid represents locking objects of one of a set of types defined 7442 by the NFSv4.1 protocol. Note that in all these cases, where we 7443 speak of guarantee, it is understood there are situations such as a 7444 client restart, or lock revocation, that allow the guarantee to be 7445 voided. 7447 o Stateids may represent opens of files. 7449 Each stateid in this case represents the OPEN state for a given 7450 client ID/open-owner/filehandle triple. Such stateids are subject 7451 to change (with consequent incrementing of the stateid's seqid) in 7452 response to OPENs that result in upgrade and OPEN_DOWNGRADE 7453 operations. 7455 o Stateids may represent sets of byte-range locks. 7457 All locks held on a particular file by a particular owner and 7458 gotten under the aegis of a particular open file are associated 7459 with a single stateid with the seqid being incremented whenever 7460 LOCK and LOCKU operations affect that set of locks. 7462 o Stateids may represent file delegations, which are recallable 7463 guarantees by the server to the client that other clients will not 7464 reference or modify a particular file, until the delegation is 7465 returned. In NFSv4.1, file delegations may be obtained on both 7466 regular and non-regular files. 7468 A stateid represents a single delegation held by a client for a 7469 particular filehandle. 7471 o Stateids may represent directory delegations, which are recallable 7472 guarantees by the server to the client that other clients will not 7473 modify the directory, until the delegation is returned. 7475 A stateid represents a single delegation held by a client for a 7476 particular directory filehandle. 7478 o Stateids may represent layouts, which are recallable guarantees by 7479 the server to the client that particular files may be accessed via 7480 an alternate data access protocol at specific locations. Such 7481 access is limited to particular sets of byte-ranges and may 7482 proceed until those byte-ranges are reduced or the layout is 7483 returned. 7485 A stateid represents the set of all layouts held by a particular 7486 client for a particular filehandle with a given layout type. The 7487 seqid is updated as the layouts of that set of byte-ranges change, 7488 via layout stateid changing operations such as LAYOUTGET and 7489 LAYOUTRETURN. 7491 8.2.2. Stateid Structure 7493 Stateids are divided into two fields, a 96-bit "other" field 7494 identifying the specific set of locks and a 32-bit "seqid" sequence 7495 value. Except in the case of special stateids (see Section 8.2.3), a 7496 particular value of the "other" field denotes a set of locks of the 7497 same type (for example, byte-range locks, opens, delegations, or 7498 layouts), for a specific file or directory, and sharing the same 7499 ownership characteristics. The seqid designates a specific instance 7500 of such a set of locks, and is incremented to indicate changes in 7501 such a set of locks, either by the addition or deletion of locks from 7502 the set, a change in the byte-range they apply to, or an upgrade or 7503 downgrade in the type of one or more locks. 7505 When such a set of locks is first created, the server returns a 7506 stateid with seqid value of one. On subsequent operations that 7507 modify the set of locks, the server is required to increment the 7508 "seqid" field by one whenever it returns a stateid for the same 7509 state-owner/file/type combination and there is some change in the set 7510 of locks actually designated. In this case, the server will return a 7511 stateid with an "other" field the same as previously used for that 7512 state-owner/file/type combination, with an incremented "seqid" field. 7513 This pattern continues until the seqid is incremented past 7514 NFS4_UINT32_MAX, and one (not zero) is the next seqid value. 7516 The purpose of the incrementing of the seqid is to allow the server 7517 to communicate to the client the order in which operations that 7518 modified locking state associated with a stateid have been processed 7519 and to make it possible for the client to send requests that are 7520 conditional on the set of locks not having changed since the stateid 7521 in question was returned. 7523 Except for layout stateids (Section 12.5.3), when a client sends a 7524 stateid to the server, it has two choices with regard to the seqid 7525 sent. It may set the seqid to zero to indicate to the server that it 7526 wishes the most up-to-date seqid for that stateid's "other" field to 7527 be used. This would be the common choice in the case of a stateid 7528 sent with a READ or WRITE operation. It also may set a non-zero 7529 value, in which case the server checks if that seqid is the correct 7530 one. In that case, the server is required to return 7531 NFS4ERR_OLD_STATEID if the seqid is lower than the most current value 7532 and NFS4ERR_BAD_STATEID if the seqid is greater than the most current 7533 value. This would be the common choice in the case of stateids sent 7534 with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in 7535 parallel for the same owner, a client might close a file without 7536 knowing that an OPEN upgrade had been done by the server, changing 7537 the lock in question. If CLOSE were sent with a zero seqid, the OPEN 7538 upgrade would be cancelled before the client even received an 7539 indication that an upgrade had happened. 7541 When a stateid is sent by the server to the client as part of a 7542 callback operation, it is not subject to checking for a current seqid 7543 and returning NFS4ERR_OLD_STATEID. This is because the client is not 7544 in a position to know the most up-to-date seqid and thus cannot 7545 verify it. Unless specially noted, the seqid value for a stateid 7546 sent by the server to the client as part of a callback is required to 7547 be zero with NFS4ERR_BAD_STATEID returned if it is not. 7549 In making comparisons between seqids, both by the client in 7550 determining the order of operations and by the server in determining 7551 whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of 7552 the seqid being swapped around past the NFS4_UINT32_MAX value needs 7553 to be taken into account. When two seqid values are being compared, 7554 the total count of slots for all sessions associated with the current 7555 client is used to do this. When one seqid value is less than this 7556 total slot count and another seqid value is greater than 7557 NFS4_UINT32_MAX minus the total slot count, the former is to be 7558 treated as lower than the latter, despite the fact that it is 7559 numerically greater. 7561 8.2.3. Special Stateids 7563 Stateid values whose "other" field is either all zeros or all ones 7564 are reserved. They may not be assigned by the server but have 7565 special meanings defined by the protocol. The particular meaning 7566 depends on whether the "other" field is all zeros or all ones and the 7567 specific value of the "seqid" field. 7569 The following combinations of "other" and "seqid" are defined in 7570 NFSv4.1: 7572 o When "other" and "seqid" are both zero, the stateid is treated as 7573 a special anonymous stateid, which can be used in READ, WRITE, and 7574 SETATTR requests to indicate the absence of any OPEN state 7575 associated with the request. When an anonymous stateid value is 7576 used and an existing open denies the form of access requested, 7577 then access will be denied to the request. This stateid MUST NOT 7578 be used on operations to data servers (Section 13.6). 7580 o When "other" and "seqid" are both all ones, the stateid is a 7581 special READ bypass stateid. When this value is used in WRITE or 7582 SETATTR, it is treated like the anonymous value. When used in 7583 READ, the server MAY grant access, even if access would normally 7584 be denied to READ operations. This stateid MUST NOT be used on 7585 operations to data servers. 7587 o When "other" is zero and "seqid" is one, the stateid represents 7588 the current stateid, which is whatever value is the last stateid 7589 returned by an operation within the COMPOUND. In the case of an 7590 OPEN, the stateid returned for the open file and not the 7591 delegation is used. The stateid passed to the operation in place 7592 of the special value has its "seqid" value set to zero, except 7593 when the current stateid is used by the operation CLOSE or 7594 OPEN_DOWNGRADE. If there is no operation in the COMPOUND that has 7595 returned a stateid value, the server MUST return the error 7596 NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of 7597 a current stateid is a special stateid and the stateid of an 7598 operation's arguments has "other" set to zero and "seqid" set to 7599 one, then the server MUST return the error NFS4ERR_BAD_STATEID. 7601 o When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid 7602 represents a reserved stateid value defined to be invalid. When 7603 this stateid is used, the server MUST return the error 7604 NFS4ERR_BAD_STATEID. 7606 If a stateid value is used that has all zeros or all ones in the 7607 "other" field but does not match one of the cases above, the server 7608 MUST return the error NFS4ERR_BAD_STATEID. 7610 Special stateids, unlike other stateids, are not associated with 7611 individual client IDs or filehandles and can be used with all valid 7612 client IDs and filehandles. In the case of a special stateid 7613 designating the current stateid, the current stateid value 7614 substituted for the special stateid is associated with a particular 7615 client ID and filehandle, and so, if it is used where the current 7616 filehandle does not match that associated with the current stateid, 7617 the operation to which the stateid is passed will return 7618 NFS4ERR_BAD_STATEID. 7620 8.2.4. Stateid Lifetime and Validation 7622 Stateids must remain valid until either a client restart or a server 7623 restart or until the client returns all of the locks associated with 7624 the stateid by means of an operation such as CLOSE or DELEGRETURN. 7625 If the locks are lost due to revocation, as long as the client ID is 7626 valid, the stateid remains a valid designation of that revoked state 7627 until the client frees it by using FREE_STATEID. Stateids associated 7628 with byte-range locks are an exception. They remain valid even if a 7629 LOCKU frees all remaining locks, so long as the open file with which 7630 they are associated remains open, unless the client frees the 7631 stateids via the FREE_STATEID operation. 7633 It should be noted that there are situations in which the client's 7634 locks become invalid, without the client requesting they be returned. 7635 These include lease expiration and a number of forms of lock 7636 revocation within the lease period. It is important to note that in 7637 these situations, the stateid remains valid and the client can use it 7638 to determine the disposition of the associated lost locks. 7640 An "other" value must never be reused for a different purpose (i.e., 7641 different filehandle, owner, or type of locks) within the context of 7642 a single client ID. A server may retain the "other" value for the 7643 same purpose beyond the point where it may otherwise be freed, but if 7644 it does so, it must maintain "seqid" continuity with previous values. 7646 One mechanism that may be used to satisfy the requirement that the 7647 server recognize invalid and out-of-date stateids is for the server 7648 to divide the "other" field of the stateid into two fields. 7650 o an index into a table of locking-state structures. 7652 o a generation number that is incremented on each allocation of a 7653 table entry for a particular use. 7655 And then store in each table entry, 7657 o the client ID with which the stateid is associated. 7659 o the current generation number for the (at most one) valid stateid 7660 sharing this index value. 7662 o the filehandle of the file on which the locks are taken. 7664 o an indication of the type of stateid (open, byte-range lock, file 7665 delegation, directory delegation, layout). 7667 o the last "seqid" value returned corresponding to the current 7668 "other" value. 7670 o an indication of the current status of the locks associated with 7671 this stateid, in particular, whether these have been revoked and 7672 if so, for what reason. 7674 With this information, an incoming stateid can be validated and the 7675 appropriate error returned when necessary. Special and non-special 7676 stateids are handled separately. (See Section 8.2.3 for a discussion 7677 of special stateids.) 7679 Note that stateids are implicitly qualified by the current client ID, 7680 as derived from the client ID associated with the current session. 7681 Note, however, that the semantics of the session will prevent 7682 stateids associated with a previous client or server instance from 7683 being analyzed by this procedure. 7685 If server restart has resulted in an invalid client ID or a session 7686 ID that is invalid, SEQUENCE will return an error and the operation 7687 that takes a stateid as an argument will never be processed. 7689 If there has been a server restart where there is a persistent 7690 session and all leased state has been lost, then the session in 7691 question will, although valid, be marked as dead, and any operation 7692 not satisfied by means of the reply cache will receive the error 7693 NFS4ERR_DEADSESSION, and thus not be processed as indicated below. 7695 When a stateid is being tested and the "other" field is all zeros or 7696 all ones, a check that the "other" and "seqid" fields match a defined 7697 combination for a special stateid is done and the results determined 7698 as follows: 7700 o If the "other" and "seqid" fields do not match a defined 7701 combination associated with a special stateid, the error 7702 NFS4ERR_BAD_STATEID is returned. 7704 o If the special stateid is one designating the current stateid and 7705 there is a current stateid, then the current stateid is 7706 substituted for the special stateid and the checks appropriate to 7707 non-special stateids are performed. 7709 o If the combination is valid in general but is not appropriate to 7710 the context in which the stateid is used (e.g., an all-zero 7711 stateid is used when an OPEN stateid is required in a LOCK 7712 operation), the error NFS4ERR_BAD_STATEID is also returned. 7714 o Otherwise, the check is completed and the special stateid is 7715 accepted as valid. 7717 When a stateid is being tested, and the "other" field is neither all 7718 zeros nor all ones, the following procedure could be used to validate 7719 an incoming stateid and return an appropriate error, when necessary, 7720 assuming that the "other" field would be divided into a table index 7721 and an entry generation. 7723 o If the table index field is outside the range of the associated 7724 table, return NFS4ERR_BAD_STATEID. 7726 o If the selected table entry is of a different generation than that 7727 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 7729 o If the selected table entry does not match the current filehandle, 7730 return NFS4ERR_BAD_STATEID. 7732 o If the client ID in the table entry does not match the client ID 7733 associated with the current session, return NFS4ERR_BAD_STATEID. 7735 o If the stateid represents revoked state, then return 7736 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, 7737 as appropriate. 7739 o If the stateid type is not valid for the context in which the 7740 stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid 7741 may be valid in general, as would be reported by the TEST_STATEID 7742 operation, but be invalid for a particular operation, as, for 7743 example, when a stateid that doesn't represent byte-range locks is 7744 passed to the non-from_open case of LOCK or to LOCKU, or when a 7745 stateid that does not represent an open is passed to CLOSE or 7746 OPEN_DOWNGRADE. In such cases, the server MUST return 7747 NFS4ERR_BAD_STATEID. 7749 o If the "seqid" field is not zero and it is greater than the 7750 current sequence value corresponding to the current "other" field, 7751 return NFS4ERR_BAD_STATEID. 7753 o If the "seqid" field is not zero and it is less than the current 7754 sequence value corresponding to the current "other" field, return 7755 NFS4ERR_OLD_STATEID. 7757 o Otherwise, the stateid is valid and the table entry should contain 7758 any additional information about the type of stateid and 7759 information associated with that particular type of stateid, such 7760 as the associated set of locks, e.g., open-owner and lock-owner 7761 information, as well as information on the specific locks, e.g., 7762 open modes and byte-ranges. 7764 8.2.5. Stateid Use for I/O Operations 7766 Clients performing I/O operations need to select an appropriate 7767 stateid based on the locks (including opens and delegations) held by 7768 the client and the various types of state-owners sending the I/O 7769 requests. SETATTR operations that change the file size are treated 7770 like I/O operations in this regard. 7772 The following rules, applied in order of decreasing priority, govern 7773 the selection of the appropriate stateid. In following these rules, 7774 the client will only consider locks of which it has actually received 7775 notification by an appropriate operation response or callback. Note 7776 that the rules are slightly different in the case of I/O to data 7777 servers when file layouts are being used (see Section 13.9.1). 7779 o If the client holds a delegation for the file in question, the 7780 delegation stateid SHOULD be used. 7782 o Otherwise, if the entity corresponding to the lock-owner (e.g., a 7783 process) sending the I/O has a byte-range lock stateid for the 7784 associated open file, then the byte-range lock stateid for that 7785 lock-owner and open file SHOULD be used. 7787 o If there is no byte-range lock stateid, then the OPEN stateid for 7788 the open file in question SHOULD be used. 7790 o Finally, if none of the above apply, then a special stateid SHOULD 7791 be used. 7793 Ignoring these rules may result in situations in which the server 7794 does not have information necessary to properly process the request. 7795 For example, when mandatory byte-range locks are in effect, if the 7796 stateid does not indicate the proper lock-owner, via a lock stateid, 7797 a request might be avoidably rejected. 7799 The server however should not try to enforce these ordering rules and 7800 should use whatever information is available to properly process I/O 7801 requests. In particular, when a client has a delegation for a given 7802 file, it SHOULD take note of this fact in processing a request, even 7803 if it is sent with a special stateid. 7805 8.2.6. Stateid Use for SETATTR Operations 7807 Because each operation is associated with a session ID and from that 7808 the clientid can be determined, operations do not need to include a 7809 stateid for the server to be able to determine whether they should 7810 cause a delegation to be recalled or are to be treated as done within 7811 the scope of the delegation. 7813 In the case of SETATTR operations, a stateid is present. In cases 7814 other than those that set the file size, the client may send either a 7815 special stateid or, when a delegation is held for the file in 7816 question, a delegation stateid. While the server SHOULD validate the 7817 stateid and may use the stateid to optimize the determination as to 7818 whether a delegation is held, it SHOULD note the presence of a 7819 delegation even when a special stateid is sent, and MUST accept a 7820 valid delegation stateid when sent. 7822 8.3. Lease Renewal 7824 Each client/server pair, as represented by a client ID, has a single 7825 lease. The purpose of the lease is to allow the client to indicate 7826 to the server, in a low-overhead way, that it is active, and thus 7827 that the server is to retain the client's locks. This arrangement 7828 allows the server to remove stale locking-related objects that are 7829 held by a client that has crashed or is otherwise unreachable, once 7830 the relevant lease expires. This in turn allows other clients to 7831 obtain conflicting locks without being delayed indefinitely by 7832 inactive or unreachable clients. It is not a mechanism for cache 7833 consistency and lease renewals may not be denied if the lease 7834 interval has not expired. 7836 Since each session is associated with a specific client (identified 7837 by the client's client ID), any operation sent on that session is an 7838 indication that the associated client is reachable. When a request 7839 is sent for a given session, successful execution of a SEQUENCE 7840 operation (or successful retrieval of the result of SEQUENCE from the 7841 reply cache) on an unexpired lease will result in the lease being 7842 implicitly renewed, for the standard renewal period (equal to the 7843 lease_time attribute). 7845 If the client ID's lease has not expired when the server receives a 7846 SEQUENCE operation, then the server MUST renew the lease. If the 7847 client ID's lease has expired when the server receives a SEQUENCE 7848 operation, the server MAY renew the lease; this depends on whether 7849 any state was revoked as a result of the client's failure to renew 7850 the lease before expiration. 7852 Absent other activity that would renew the lease, a COMPOUND 7853 consisting of a single SEQUENCE operation will suffice. The client 7854 should also take communication-related delays into account and take 7855 steps to ensure that the renewal messages actually reach the server 7856 in good time. For example: 7858 o When trunking is in effect, the client should consider sending 7859 multiple requests on different connections, in order to ensure 7860 that renewal occurs, even in the event of blockage in the path 7861 used for one of those connections. 7863 o Transport retransmission delays might become so large as to 7864 approach or exceed the length of the lease period. This may be 7865 particularly likely when the server is unresponsive due to a 7866 restart; see Section 8.4.2.1. If the client implementation is not 7867 careful, transport retransmission delays can result in the client 7868 failing to detect a server restart before the grace period ends. 7869 The scenario is that the client is using a transport with 7870 exponential backoff, such that the maximum retransmission timeout 7871 exceeds both the grace period and the lease_time attribute. A 7872 network partition causes the client's connection's retransmission 7873 interval to back off, and even after the partition heals, the next 7874 transport-level retransmission is sent after the server has 7875 restarted and its grace period ends. 7877 The client MUST either recover from the ensuing NFS4ERR_NO_GRACE 7878 errors or it MUST ensure that, despite transport-level 7879 retransmission intervals that exceed the lease_time, a SEQUENCE 7880 operation is sent that renews the lease before expiration. The 7881 client can achieve this by associating a new connection with the 7882 session, and sending a SEQUENCE operation on it. However, if the 7883 attempt to establish a new connection is delayed for some reason 7884 (e.g., exponential backoff of the connection establishment 7885 packets), the client will have to abort the connection 7886 establishment attempt before the lease expires, and attempt to 7887 reconnect. 7889 If the server renews the lease upon receiving a SEQUENCE operation, 7890 the server MUST NOT allow the lease to expire while the rest of the 7891 operations in the COMPOUND procedure's request are still executing. 7892 Once the last operation has finished, and the response to COMPOUND 7893 has been sent, the server MUST set the lease to expire no sooner than 7894 the sum of current time and the value of the lease_time attribute. 7896 A client ID's lease can expire when it has been at least the lease 7897 interval (lease_time) since the last lease-renewing SEQUENCE 7898 operation was sent on any of the client ID's sessions and there are 7899 no active COMPOUND operations on any such sessions. 7901 Because the SEQUENCE operation is the basic mechanism to renew a 7902 lease, and because it must be done at least once for each lease 7903 period, it is the natural mechanism whereby the server will inform 7904 the client of changes in the lease status that the client needs to be 7905 informed of. The client should inspect the status flags 7906 (sr_status_flags) returned by sequence and take the appropriate 7907 action (see Section 18.46.3 for details). 7909 o The status bits SEQ4_STATUS_CB_PATH_DOWN and 7910 SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the 7911 backchannel that the client may need to address in order to 7912 receive callback requests. 7914 o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and 7915 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS 7916 contexts or RPCSEC_GSS handles for the backchannel that the client 7917 might have to address in order to allow callback requests to be 7918 sent. 7920 o The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 7921 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, 7922 SEQ4_STATUS_ADMIN_STATE_REVOKED, and 7923 SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock 7924 revocation events. When these bits are set, the client should use 7925 TEST_STATEID to find what stateids have been revoked and use 7926 FREE_STATEID to acknowledge loss of the associated state. 7928 o The status bit SEQ4_STATUS_LEASE_MOVE indicates that 7929 responsibility for lease renewal has been transferred to one or 7930 more new servers. 7932 o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that 7933 due to server restart the client must reclaim locking state. 7935 o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates that the 7936 server has encountered an unrecoverable fault with the backchannel 7937 (e.g., it has lost track of a sequence ID for a slot in the 7938 backchannel). 7940 8.4. Crash Recovery 7942 A critical requirement in crash recovery is that both the client and 7943 the server know when the other has failed. Additionally, it is 7944 required that a client sees a consistent view of data across server 7945 restarts. All READ and WRITE operations that may have been queued 7946 within the client or network buffers must wait until the client has 7947 successfully recovered the locks protecting the READ and WRITE 7948 operations. Any that reach the server before the server can safely 7949 determine that the client has recovered enough locking state to be 7950 sure that such operations can be safely processed must be rejected. 7951 This will happen because either: 7953 o The state presented is no longer valid since it is associated with 7954 a now invalid client ID. In this case, the client will receive 7955 either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any 7956 attempt to attach a new session to that invalid client ID will 7957 result in an NFS4ERR_STALE_CLIENTID error. 7959 o Subsequent recovery of locks may make execution of the operation 7960 inappropriate (NFS4ERR_GRACE). 7962 8.4.1. Client Failure and Recovery 7964 In the event that a client fails, the server may release the client's 7965 locks when the associated lease has expired. Conflicting locks from 7966 another client may only be granted after this lease expiration. As 7967 discussed in Section 8.3, when a client has not failed and re- 7968 establishes its lease before expiration occurs, requests for 7969 conflicting locks will not be granted. 7971 To minimize client delay upon restart, lock requests are associated 7972 with an instance of the client by a client-supplied verifier. This 7973 verifier is part of the client_owner4 sent in the initial EXCHANGE_ID 7974 call made by the client. The server returns a client ID as a result 7975 of the EXCHANGE_ID operation. The client then confirms the use of 7976 the client ID by establishing a session associated with that client 7977 ID (see Section 18.36.3 for a description of how this is done). All 7978 locks, including opens, byte-range locks, delegations, and layouts 7979 obtained by sessions using that client ID, are associated with that 7980 client ID. 7982 Since the verifier will be changed by the client upon each 7983 initialization, the server can compare a new verifier to the verifier 7984 associated with currently held locks and determine that they do not 7985 match. This signifies the client's new instantiation and subsequent 7986 loss (upon confirmation of the new client ID) of locking state. As a 7987 result, the server is free to release all locks held that are 7988 associated with the old client ID that was derived from the old 7989 verifier. At this point, conflicting locks from other clients, kept 7990 waiting while the lease had not yet expired, can be granted. In 7991 addition, all stateids associated with the old client ID can also be 7992 freed, as they are no longer reference-able. 7994 Note that the verifier must have the same uniqueness properties as 7995 the verifier for the COMMIT operation. 7997 8.4.2. Server Failure and Recovery 7999 If the server loses locking state (usually as a result of a restart), 8000 it must allow clients time to discover this fact and re-establish the 8001 lost locking state. The client must be able to re-establish the 8002 locking state without having the server deny valid requests because 8003 the server has granted conflicting access to another client. 8004 Likewise, if there is a possibility that clients have not yet re- 8005 established their locking state for a file and that such locking 8006 state might make it invalid to perform READ or WRITE operations. For 8007 example, if mandatory locks are a possibility, the server must 8008 disallow READ and WRITE operations for that file. 8010 A client can determine that loss of locking state has occurred via 8011 several methods. 8013 1. When a SEQUENCE (most common) or other operation returns 8014 NFS4ERR_BADSESSION, this may mean that the session has been 8015 destroyed but the client ID is still valid. The client sends a 8016 CREATE_SESSION request with the client ID to re-establish the 8017 session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, 8018 the client must establish a new client ID (see Section 8.1) and 8019 re-establish its lock state with the new client ID, after the 8020 CREATE_SESSION operation succeeds (see Section 8.4.2.1). 8022 2. When a SEQUENCE (most common) or other operation on a persistent 8023 session returns NFS4ERR_DEADSESSION, this indicates that a 8024 session is no longer usable for new, i.e., not satisfied from the 8025 reply cache, operations. Once all pending operations are 8026 determined to be either performed before the retry or not 8027 performed, the client sends a CREATE_SESSION request with the 8028 client ID to re-establish the session. If CREATE_SESSION fails 8029 with NFS4ERR_STALE_CLIENTID, the client must establish a new 8030 client ID (see Section 8.1) and re-establish its lock state after 8031 the CREATE_SESSION, with the new client ID, succeeds 8032 (Section 8.4.2.1). 8034 3. When an operation, neither SEQUENCE nor preceded by SEQUENCE (for 8035 example, CREATE_SESSION, DESTROY_SESSION), returns 8036 NFS4ERR_STALE_CLIENTID, the client MUST establish a new client ID 8037 (Section 8.1) and re-establish its lock state (Section 8.4.2.1). 8039 8.4.2.1. State Reclaim 8041 When state information and the associated locks are lost as a result 8042 of a server restart, the protocol must provide a way to cause that 8043 state to be re-established. The approach used is to define, for most 8044 types of locking state (layouts are an exception), a request whose 8045 function is to allow the client to re-establish on the server a lock 8046 first obtained from a previous instance. Generally, these requests 8047 are variants of the requests normally used to create locks of that 8048 type and are referred to as "reclaim-type" requests, and the process 8049 of re-establishing such locks is referred to as "reclaiming" them. 8051 Because each client must have an opportunity to reclaim all of the 8052 locks that it has without the possibility that some other client will 8053 be granted a conflicting lock, a "grace period" is devoted to the 8054 reclaim process. During this period, requests creating client IDs 8055 and sessions are handled normally, but locking requests are subject 8056 to special restrictions. Only reclaim-type locking requests are 8057 allowed, unless the server can reliably determine (through state 8058 persistently maintained across restart instances) that granting any 8059 such lock cannot possibly conflict with a subsequent reclaim. When a 8060 request is made to obtain a new lock (i.e., not a reclaim-type 8061 request) during the grace period and such a determination cannot be 8062 made, the server must return the error NFS4ERR_GRACE. 8064 Once a session is established using the new client ID, the client 8065 will use reclaim-type locking requests (e.g., LOCK operations with 8066 reclaim set to TRUE and OPEN operations with a claim type of 8067 CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. 8068 Once this is done, or if there is no such locking state to reclaim, 8069 the client sends a global RECLAIM_COMPLETE operation, i.e., one with 8070 the rca_one_fs argument set to FALSE, to indicate that it has 8071 reclaimed all of the locking state that it will reclaim. Once a 8072 client sends such a RECLAIM_COMPLETE operation, it may attempt non- 8073 reclaim locking operations, although it might get an NFS4ERR_GRACE 8074 status result from each such operation until the period of special 8075 handling is over. See Section 11.10.9 for a discussion of the 8076 analogous handling lock reclamation in the case of file systems 8077 transitioning from server to server. 8079 During the grace period, the server must reject READ and WRITE 8080 operations and non-reclaim locking requests (i.e., other LOCK and 8081 OPEN operations) with an error of NFS4ERR_GRACE, unless it can 8082 guarantee that these may be done safely, as described below. 8084 The grace period may last until all clients that are known to 8085 possibly have had locks have done a global RECLAIM_COMPLETE 8086 operation, indicating that they have finished reclaiming the locks 8087 they held before the server restart. This means that a client that 8088 has done a RECLAIM_COMPLETE must be prepared to receive an 8089 NFS4ERR_GRACE when attempting to acquire new locks. In order for the 8090 server to know that all clients with possible prior lock state have 8091 done a RECLAIM_COMPLETE, the server must maintain in stable storage a 8092 list clients that may have such locks. The server may also terminate 8093 the grace period before all clients have done a global 8094 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 8095 before a time equal to the lease period in order to give clients an 8096 opportunity to find out about the server restart, as a result of 8097 sending requests on associated sessions with a frequency governed by 8098 the lease time. Note that when a client does not send such requests 8099 (or they are sent by the client but not received by the server), it 8100 is possible for the grace period to expire before the client finds 8101 out that the server restart has occurred. 8103 Some additional time in order to allow a client to establish a new 8104 client ID and session and to effect lock reclaims may be added to the 8105 lease time. Note that analogous rules apply to file system-specific 8106 grace periods discussed in Section 11.10.9. 8108 If the server can reliably determine that granting a non-reclaim 8109 request will not conflict with reclamation of locks by other clients, 8110 the NFS4ERR_GRACE error does not have to be returned even within the 8111 grace period, although NFS4ERR_GRACE must always be returned to 8112 clients attempting a non-reclaim lock request before doing their own 8113 global RECLAIM_COMPLETE. For the server to be able to service READ 8114 and WRITE operations during the grace period, it must again be able 8115 to guarantee that no possible conflict could arise between a 8116 potential reclaim locking request and the READ or WRITE operation. 8117 If the server is unable to offer that guarantee, the NFS4ERR_GRACE 8118 error must be returned to the client. 8120 For a server to provide simple, valid handling during the grace 8121 period, the easiest method is to simply reject all non-reclaim 8122 locking requests and READ and WRITE operations by returning the 8123 NFS4ERR_GRACE error. However, a server may keep information about 8124 granted locks in stable storage. With this information, the server 8125 could determine if a locking, READ or WRITE operation can be safely 8126 processed. 8128 For example, if the server maintained on stable storage summary 8129 information on whether mandatory locks exist, either mandatory byte- 8130 range locks, or share reservations specifying deny modes, many 8131 requests could be allowed during the grace period. If it is known 8132 that no such share reservations exist, OPEN request that do not 8133 specify deny modes may be safely granted. If, in addition, it is 8134 known that no mandatory byte-range locks exist, either through 8135 information stored on stable storage or simply because the server 8136 does not support such locks, READ and WRITE operations may be safely 8137 processed during the grace period. Another important case is where 8138 it is known that no mandatory byte-range locks exist, either because 8139 the server does not provide support for them or because their absence 8140 is known from persistently recorded data. In this case, READ and 8141 WRITE operations specifying stateids derived from reclaim-type 8142 operations may be validly processed during the grace period because 8143 of the fact that the valid reclaim ensures that no lock subsequently 8144 granted can prevent the I/O. 8146 To reiterate, for a server that allows non-reclaim lock and I/O 8147 requests to be processed during the grace period, it MUST determine 8148 that no lock subsequently reclaimed will be rejected and that no lock 8149 subsequently reclaimed would have prevented any I/O operation 8150 processed during the grace period. 8152 Clients should be prepared for the return of NFS4ERR_GRACE errors for 8153 non-reclaim lock and I/O requests. In this case, the client should 8154 employ a retry mechanism for the request. A delay (on the order of 8155 several seconds) between retries should be used to avoid overwhelming 8156 the server. Further discussion of the general issue is included in 8157 [50]. The client must account for the server that can perform I/O 8158 and non-reclaim locking requests within the grace period as well as 8159 those that cannot do so. 8161 A reclaim-type locking request outside the server's grace period can 8162 only succeed if the server can guarantee that no conflicting lock or 8163 I/O request has been granted since restart. 8165 A server may, upon restart, establish a new value for the lease 8166 period. Therefore, clients should, once a new client ID is 8167 established, refetch the lease_time attribute and use it as the basis 8168 for lease renewal for the lease associated with that server. 8169 However, the server must establish, for this restart event, a grace 8170 period at least as long as the lease period for the previous server 8171 instantiation. This allows the client state obtained during the 8172 previous server instance to be reliably re-established. 8174 The possibility exists that, because of server configuration events, 8175 the client will be communicating with a server different than the one 8176 on which the locks were obtained, as shown by the combination of 8177 eir_server_scope and eir_server_owner. This leads to the issue of if 8178 and when the client should attempt to reclaim locks previously 8179 obtained on what is being reported as a different server. The rules 8180 to resolve this question are as follows: 8182 o If the server scope is different, the client should not attempt to 8183 reclaim locks. In this situation, no lock reclaim is possible. 8184 Any attempt to re-obtain the locks with non-reclaim operations is 8185 problematic since there is no guarantee that the existing 8186 filehandles will be recognized by the new server, or that if 8187 recognized, they denote the same objects. It is best to treat the 8188 locks as having been revoked by the reconfiguration event. 8190 o If the server scope is the same, the client should attempt to 8191 reclaim locks, even if the eir_server_owner value is different. 8192 In this situation, it is the responsibility of the server to 8193 return NFS4ERR_NO_GRACE if it cannot provide correct support for 8194 lock reclaim operations, including the prevention of edge 8195 conditions. 8197 The eir_server_owner field is not used in making this determination. 8198 Its function is to specify trunking possibilities for the client (see 8199 Section 2.10.5) and not to control lock reclaim. 8201 8.4.2.1.1. Security Considerations for State Reclaim 8203 During the grace period, a client can reclaim state that it believes 8204 or asserts it had before the server restarted. Unless the server 8205 maintained a complete record of all the state the client had, the 8206 server has little choice but to trust the client. (Of course, if the 8207 server maintained a complete record, then it would not have to force 8208 the client to reclaim state after server restart.) While the server 8209 has to trust the client to tell the truth, such trust does not have 8210 any negative consequences for security. The fundamental rule for the 8211 server when processing reclaim requests is that it MUST NOT grant the 8212 reclaim if an equivalent non-reclaim request would not be granted 8213 during steady state due to access control or access conflict issues. 8214 For example, an OPEN request during a reclaim will be refused with 8215 NFS4ERR_ACCESS if the principal making the request does not have 8216 access to open the file according to the discretionary ACL 8217 (Section 6.2.2) on the file. 8219 Nonetheless, it is possible that a client operating in error or 8220 maliciously could, during reclaim, prevent another client from 8221 reclaiming access to state. For example, an attacker could send an 8222 OPEN reclaim operation with a deny mode that prevents another client 8223 from reclaiming the OPEN state it had before the server restarted. 8224 The attacker could perform the same denial of service during steady 8225 state prior to server restart, as long as the attacker had 8226 permissions. Given that the attack vectors are equivalent, the grace 8227 period does not offer any additional opportunity for denial of 8228 service, and any concerns about this attack vector, whether during 8229 grace or steady state, are addressed the same way: use RPCSEC_GSS for 8230 authentication and limit access to the file only to principals that 8231 the owner of the file trusts. 8233 Note that if prior to restart the server had client IDs with the 8234 EXCHGID4_FLAG_BIND_PRINC_STATEID (Section 18.35) capability set, then 8235 the server SHOULD record in stable storage the client owner and the 8236 principal that established the client ID via EXCHANGE_ID. If the 8237 server does not, then there is a risk a client will be unable to 8238 reclaim state if it does not have a credential for a principal that 8239 was originally authorized to establish the state. 8241 8.4.3. Network Partitions and Recovery 8243 If the duration of a network partition is greater than the lease 8244 period provided by the server, the server will not have received a 8245 lease renewal from the client. If this occurs, the server may free 8246 all locks held for the client or it may allow the lock state to 8247 remain for a considerable period, subject to the constraint that if a 8248 request for a conflicting lock is made, locks associated with an 8249 expired lease do not prevent such a conflicting lock from being 8250 granted but MUST be revoked as necessary so as to avoid interfering 8251 with such conflicting requests. 8253 If the server chooses to delay freeing of lock state until there is a 8254 conflict, it may either free all of the client's locks once there is 8255 a conflict or it may only revoke the minimum set of locks necessary 8256 to allow conflicting requests. When it adopts the finer-grained 8257 approach, it must revoke all locks associated with a given stateid, 8258 even if the conflict is with only a subset of locks. 8260 When the server chooses to free all of a client's lock state, either 8261 immediately upon lease expiration or as a result of the first attempt 8262 to obtain a conflicting a lock, the server may report the loss of 8263 lock state in a number of ways. 8265 The server may choose to invalidate the session and the associated 8266 client ID. In this case, once the client can communicate with the 8267 server, it will receive an NFS4ERR_BADSESSION error. Upon attempting 8268 to create a new session, it would get an NFS4ERR_STALE_CLIENTID. 8269 Upon creating the new client ID and new session, the client will 8270 attempt to reclaim locks. Normally, the server will not allow the 8271 client to reclaim locks, because the server will not be in its 8272 recovery grace period. 8274 Another possibility is for the server to maintain the session and 8275 client ID but for all stateids held by the client to become invalid 8276 or stale. Once the client can reach the server after such a network 8277 partition, the status returned by the SEQUENCE operation will 8278 indicate a loss of locking state; i.e., the flag 8279 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. 8280 In addition, all I/O submitted by the client with the now invalid 8281 stateids will fail with the server returning the error 8282 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 8283 state, it will suitably notify the applications that held the 8284 invalidated locks. The client should then take action to free 8285 invalidated stateids, either by establishing a new client ID using a 8286 new verifier or by doing a FREE_STATEID operation to release each of 8287 the invalidated stateids. 8289 When the server adopts a finer-grained approach to revocation of 8290 locks when a client's lease has expired, only a subset of stateids 8291 will normally become invalid during a network partition. When the 8292 client can communicate with the server after such a network partition 8293 heals, the status returned by the SEQUENCE operation will indicate a 8294 partial loss of locking state 8295 (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In addition, operations, 8296 including I/O submitted by the client, with the now invalid stateids 8297 will fail with the server returning the error NFS4ERR_EXPIRED. Once 8298 the client learns of the loss of locking state, it will use the 8299 TEST_STATEID operation on all of its stateids to determine which 8300 locks have been lost and then suitably notify the applications that 8301 held the invalidated locks. The client can then release the 8302 invalidated locking state and acknowledge the revocation of the 8303 associated locks by doing a FREE_STATEID operation on each of the 8304 invalidated stateids. 8306 When a network partition is combined with a server restart, there are 8307 edge conditions that place requirements on the server in order to 8308 avoid silent data corruption following the server restart. Two of 8309 these edge conditions are known, and are discussed below. 8311 The first edge condition arises as a result of the scenarios such as 8312 the following: 8314 1. Client A acquires a lock. 8316 2. Client A and server experience mutual network partition, such 8317 that client A is unable to renew its lease. 8319 3. Client A's lease expires, and the server releases the lock. 8321 4. Client B acquires a lock that would have conflicted with that of 8322 client A. 8324 5. Client B releases its lock. 8326 6. Server restarts. 8328 7. Network partition between client A and server heals. 8330 8. Client A connects to a new server instance and finds out about 8331 server restart. 8333 9. Client A reclaims its lock within the server's grace period. 8335 Thus, at the final step, the server has erroneously granted client 8336 A's lock reclaim. If client B modified the object the lock was 8337 protecting, client A will experience object corruption. 8339 The second known edge condition arises in situations such as the 8340 following: 8342 1. Client A acquires one or more locks. 8344 2. Server restarts. 8346 3. Client A and server experience mutual network partition, such 8347 that client A is unable to reclaim all of its locks within the 8348 grace period. 8350 4. Server's reclaim grace period ends. Client A has either no 8351 locks or an incomplete set of locks known to the server. 8353 5. Client B acquires a lock that would have conflicted with a lock 8354 of client A that was not reclaimed. 8356 6. Client B releases the lock. 8358 7. Server restarts a second time. 8360 8. Network partition between client A and server heals. 8362 9. Client A connects to new server instance and finds out about 8363 server restart. 8365 10. Client A reclaims its lock within the server's grace period. 8367 As with the first edge condition, the final step of the scenario of 8368 the second edge condition has the server erroneously granting client 8369 A's lock reclaim. 8371 Solving the first and second edge conditions requires either that the 8372 server always assumes after it restarts that some edge condition 8373 occurs, and thus returns NFS4ERR_NO_GRACE for all reclaim attempts, 8374 or that the server record some information in stable storage. The 8375 amount of information the server records in stable storage is in 8376 inverse proportion to how harsh the server intends to be whenever 8377 edge conditions arise. The server that is completely tolerant of all 8378 edge conditions will record in stable storage every lock that is 8379 acquired, removing the lock record from stable storage only when the 8380 lock is released. For the two edge conditions discussed above, the 8381 harshest a server can be, and still support a grace period for 8382 reclaims, requires that the server record in stable storage some 8383 minimal information. For example, a server implementation could, for 8384 each client, save in stable storage a record containing: 8386 o the co_ownerid field from the client_owner4 presented in the 8387 EXCHANGE_ID operation. 8389 o a boolean that indicates if the client's lease expired or if there 8390 was administrative intervention (see Section 8.5) to revoke a 8391 byte-range lock, share reservation, or delegation and there has 8392 been no acknowledgment, via FREE_STATEID, of such revocation. 8394 o a boolean that indicates whether the client may have locks that it 8395 believes to be reclaimable in situations in which the grace period 8396 was terminated, making the server's view of lock reclaimability 8397 suspect. The server will set this for any client record in stable 8398 storage where the client has not done a suitable RECLAIM_COMPLETE 8399 (global or file system-specific depending on the target of the 8400 lock request) before it grants any new (i.e., not reclaimed) lock 8401 to any client. 8403 Assuming the above record keeping, for the first edge condition, 8404 after the server restarts, the record that client A's lease expired 8405 means that another client could have acquired a conflicting byte- 8406 range lock, share reservation, or delegation. Hence, the server must 8407 reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8409 For the second edge condition, after the server restarts for a second 8410 time, the indication that the client had not completed its reclaims 8411 at the time at which the grace period ended means that the server 8412 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8414 When either edge condition occurs, the client's attempt to reclaim 8415 locks will result in the error NFS4ERR_NO_GRACE. When this is 8416 received, or after the client restarts with no lock state, the client 8417 will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 8418 received, the server and client are again in agreement regarding 8419 reclaimable locks and both booleans in persistent storage can be 8420 reset, to be set again only when there is a subsequent event that 8421 causes lock reclaim operations to be questionable. 8423 Regardless of the level and approach to record keeping, the server 8424 MUST implement one of the following strategies (which apply to 8425 reclaims of share reservations, byte-range locks, and delegations): 8427 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 8428 unforgiving, but necessary if the server does not record lock 8429 state in stable storage. 8431 2. Record sufficient state in stable storage such that all known 8432 edge conditions involving server restart, including the two noted 8433 in this section, are detected. It is acceptable to erroneously 8434 recognize an edge condition and not allow a reclaim, when, with 8435 sufficient knowledge, it would be allowed. The error the server 8436 would return in this case is NFS4ERR_NO_GRACE. Note that it is 8437 not known if there are other edge conditions. 8439 In the event that, after a server restart, the server determines 8440 there is unrecoverable damage or corruption to the information in 8441 stable storage, then for all clients and/or locks that may be 8442 affected, the server MUST return NFS4ERR_NO_GRACE. 8444 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 8445 outside the scope of this specification, since the strategies for 8446 such handling are very dependent on the client's operating 8447 environment. However, one potential approach is described below. 8449 When the client receives NFS4ERR_NO_GRACE, it could examine the 8450 change attribute of the objects for which the client is trying to 8451 reclaim state, and use that to determine whether to re-establish the 8452 state via normal OPEN or LOCK operations. This is acceptable 8453 provided that the client's operating environment allows it. In other 8454 words, the client implementor is advised to document for his users 8455 the behavior. The client could also inform the application that its 8456 byte-range lock or share reservations (whether or not they were 8457 delegated) have been lost, such as via a UNIX signal, a Graphical 8458 User Interface (GUI) pop-up window, etc. See Section 10.5 for a 8459 discussion of what the client should do for dealing with unreclaimed 8460 delegations on client state. 8462 For further discussion of revocation of locks, see Section 8.5. 8464 8.5. Server Revocation of Locks 8466 At any point, the server can revoke locks held by a client, and the 8467 client must be prepared for this event. When the client detects that 8468 its locks have been or may have been revoked, the client is 8469 responsible for validating the state information between itself and 8470 the server. Validating locking state for the client means that it 8471 must verify or reclaim state for each lock currently held. 8473 The first occasion of lock revocation is upon server restart. Note 8474 that this includes situations in which sessions are persistent and 8475 locking state is lost. In this class of instances, the client will 8476 receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes 8477 client ID, usually as part of recovery in response to a problem with 8478 the current session), and the client will proceed with normal crash 8479 recovery as described in the Section 8.4.2.1. 8481 The second occasion of lock revocation is the inability to renew the 8482 lease before expiration, as discussed in Section 8.4.3. While this 8483 is considered a rare or unusual event, the client must be prepared to 8484 recover. The server is responsible for determining the precise 8485 consequences of the lease expiration, informing the client of the 8486 scope of the lock revocation decided upon. The client then uses the 8487 status information provided by the server in the SEQUENCE results 8488 (field sr_status_flags, see Section 18.46.3) to synchronize its 8489 locking state with that of the server, in order to recover. 8491 The third occasion of lock revocation can occur as a result of 8492 revocation of locks within the lease period, either because of 8493 administrative intervention or because a recallable lock (a 8494 delegation or layout) was not returned within the lease period after 8495 having been recalled. While these are considered rare events, they 8496 are possible, and the client must be prepared to deal with them. 8497 When either of these events occurs, the client finds out about the 8498 situation through the status returned by the SEQUENCE operation. Any 8499 use of stateids associated with locks revoked during the lease period 8500 will receive the error NFS4ERR_ADMIN_REVOKED or 8501 NFS4ERR_DELEG_REVOKED, as appropriate. 8503 In all situations in which a subset of locking state may have been 8504 revoked, which include all cases in which locking state is revoked 8505 within the lease period, it is up to the client to determine which 8506 locks have been revoked and which have not. It does this by using 8507 the TEST_STATEID operation on the appropriate set of stateids. Once 8508 the set of revoked locks has been determined, the applications can be 8509 notified, and the invalidated stateids can be freed and lock 8510 revocation acknowledged by using FREE_STATEID. 8512 8.6. Short and Long Leases 8514 When determining the time period for the server lease, the usual 8515 lease tradeoffs apply. A short lease is good for fast server 8516 recovery at a cost of increased operations to effect lease renewal 8517 (when there are no other operations during the period to effect lease 8518 renewal as a side effect). A long lease is certainly kinder and 8519 gentler to servers trying to handle very large numbers of clients. 8520 The number of extra requests to effect lock renewal drops in inverse 8521 proportion to the lease time. The disadvantages of a long lease 8522 include the possibility of slower recovery after certain failures. 8523 After server failure, a longer grace period may be required when some 8524 clients do not promptly reclaim their locks and do a global 8525 RECLAIM_COMPLETE. In the event of client failure, the longer period 8526 for a lease to expire will force conflicting requests to wait longer. 8528 A long lease is practical if the server can store lease state in 8529 stable storage. Upon recovery, the server can reconstruct the lease 8530 state from its stable storage and continue operation with its 8531 clients. 8533 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration 8535 To avoid the need for synchronized clocks, lease times are granted by 8536 the server as a time delta. However, there is a requirement that the 8537 client and server clocks do not drift excessively over the duration 8538 of the lease. There is also the issue of propagation delay across 8539 the network, which could easily be several hundred milliseconds, as 8540 well as the possibility that requests will be lost and need to be 8541 retransmitted. 8543 To take propagation delay into account, the client should subtract it 8544 from lease times (e.g., if the client estimates the one-way 8545 propagation delay as 200 milliseconds, then it can assume that the 8546 lease is already 200 milliseconds old when it gets it). In addition, 8547 it will take another 200 milliseconds to get a response back to the 8548 server. So the client must send a lease renewal or write data back 8549 to the server at least 400 milliseconds before the lease would 8550 expire. If the propagation delay varies over the life of the lease 8551 (e.g., the client is on a mobile host), the client will need to 8552 continuously subtract the increase in propagation delay from the 8553 lease times. 8555 The server's lease period configuration should take into account the 8556 network distance of the clients that will be accessing the server's 8557 resources. It is expected that the lease period will take into 8558 account the network propagation delays and other network delay 8559 factors for the client population. Since the protocol does not allow 8560 for an automatic method to determine an appropriate lease period, the 8561 server's administrator may have to tune the lease period. 8563 8.8. Obsolete Locking Infrastructure from NFSv4.0 8565 There are a number of operations and fields within existing 8566 operations that no longer have a function in NFSv4.1. In one way or 8567 another, these changes are all due to the implementation of sessions 8568 that provide client context and exactly once semantics as a base 8569 feature of the protocol, separate from locking itself. 8571 The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. 8572 The server MUST return NFS4ERR_NOTSUPP if these operations are found 8573 in an NFSv4.1 COMPOUND. 8575 o SETCLIENTID since its function has been replaced by EXCHANGE_ID. 8577 o SETCLIENTID_CONFIRM since client ID confirmation now happens by 8578 means of CREATE_SESSION. 8580 o OPEN_CONFIRM because state-owner-based seqids have been replaced 8581 by the sequence ID in the SEQUENCE operation. 8583 o RELEASE_LOCKOWNER because lock-owners with no associated locks do 8584 not have any sequence-related state and so can be deleted by the 8585 server at will. 8587 o RENEW because every SEQUENCE operation for a session causes lease 8588 renewal, making a separate operation superfluous. 8590 Also, there are a number of fields, present in existing operations, 8591 related to locking that have no use in minor version 1. They were 8592 used in minor version 0 to perform functions now provided in a 8593 different fashion. 8595 o Sequence ids used to sequence requests for a given state-owner and 8596 to provide retry protection, now provided via sessions. 8598 o Client IDs used to identify the client associated with a given 8599 request. Client identification is now available using the client 8600 ID associated with the current session, without needing an 8601 explicit client ID field. 8603 Such vestigial fields in existing operations have no function in 8604 NFSv4.1 and are ignored by the server. Note that client IDs in 8605 operations new to NFSv4.1 (such as CREATE_SESSION and 8606 DESTROY_CLIENTID) are not ignored. 8608 9. File Locking and Share Reservations 8610 To support Win32 share reservations, it is necessary to provide 8611 operations that atomically open or create files. Having a separate 8612 share/unshare operation would not allow correct implementation of the 8613 Win32 OpenFile API. In order to correctly implement share semantics, 8614 the previous NFS protocol mechanisms used when a file is opened or 8615 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 8616 protocol defines an OPEN operation that is capable of atomically 8617 looking up, creating, and locking a file on the server. 8619 9.1. Opens and Byte-Range Locks 8621 It is assumed that manipulating a byte-range lock is rare when 8622 compared to READ and WRITE operations. It is also assumed that 8623 server restarts and network partitions are relatively rare. 8624 Therefore, it is important that the READ and WRITE operations have a 8625 lightweight mechanism to indicate if they possess a held lock. A 8626 LOCK operation contains the heavyweight information required to 8627 establish a byte-range lock and uniquely define the owner of the 8628 lock. 8630 9.1.1. State-Owner Definition 8632 When opening a file or requesting a byte-range lock, the client must 8633 specify an identifier that represents the owner of the requested 8634 lock. This identifier is in the form of a state-owner, represented 8635 in the protocol by a state_owner4, a variable-length opaque array 8636 that, when concatenated with the current client ID, uniquely defines 8637 the owner of a lock managed by the client. This may be a thread ID, 8638 process ID, or other unique value. 8640 Owners of opens and owners of byte-range locks are separate entities 8641 and remain separate even if the same opaque arrays are used to 8642 designate owners of each. The protocol distinguishes between open- 8643 owners (represented by open_owner4 structures) and lock-owners 8644 (represented by lock_owner4 structures). 8646 Each open is associated with a specific open-owner while each byte- 8647 range lock is associated with a lock-owner and an open-owner, the 8648 latter being the open-owner associated with the open file under which 8649 the LOCK operation was done. Delegations and layouts, on the other 8650 hand, are not associated with a specific owner but are associated 8651 with the client as a whole (identified by a client ID). 8653 9.1.2. Use of the Stateid and Locking 8655 All READ, WRITE, and SETATTR operations contain a stateid. For the 8656 purposes of this section, SETATTR operations that change the size 8657 attribute of a file are treated as if they are writing the area 8658 between the old and new sizes (i.e., the byte-range truncated or 8659 added to the file by means of the SETATTR), even where SETATTR is not 8660 explicitly mentioned in the text. The stateid passed to one of these 8661 operations must be one that represents an open, a set of byte-range 8662 locks, or a delegation, or it may be a special stateid representing 8663 anonymous access or the special bypass stateid. 8665 If the state-owner performs a READ or WRITE operation in a situation 8666 in which it has established a byte-range lock or share reservation on 8667 the server (any OPEN constitutes a share reservation), the stateid 8668 (previously returned by the server) must be used to indicate what 8669 locks, including both byte-range locks and share reservations, are 8670 held by the state-owner. If no state is established by the client, 8671 either a byte-range lock or a share reservation, a special stateid 8672 for anonymous state (zero as the value for "other" and "seqid") is 8673 used. (See Section 8.2.3 for a description of 'special' stateids in 8674 general.) Regardless of whether a stateid for anonymous state or a 8675 stateid returned by the server is used, if there is a conflicting 8676 share reservation or mandatory byte-range lock held on the file, the 8677 server MUST refuse to service the READ or WRITE operation. 8679 Share reservations are established by OPEN operations and by their 8680 nature are mandatory in that when the OPEN denies READ or WRITE 8681 operations, that denial results in such operations being rejected 8682 with error NFS4ERR_LOCKED. Byte-range locks may be implemented by 8683 the server as either mandatory or advisory, or the choice of 8684 mandatory or advisory behavior may be determined by the server on the 8685 basis of the file being accessed (for example, some UNIX-based 8686 servers support a "mandatory lock bit" on the mode attribute such 8687 that if set, byte-range locks are required on the file before I/O is 8688 possible). When byte-range locks are advisory, they only prevent the 8689 granting of conflicting lock requests and have no effect on READs or 8690 WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O 8691 operations. When they are attempted, they are rejected with 8692 NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file for 8693 which it knows it has the proper share reservation, it will need to 8694 send a LOCK operation on the byte-range of the file that includes the 8695 byte-range the I/O was to be performed on, with an appropriate 8696 locktype field of the LOCK operation's arguments (i.e., READ*_LT for 8697 a READ operation, WRITE*_LT for a WRITE operation). 8699 Note that for UNIX environments that support mandatory byte-range 8700 locking, the distinction between advisory and mandatory locking is 8701 subtle. In fact, advisory and mandatory byte-range locks are exactly 8702 the same as far as the APIs and requirements on implementation. If 8703 the mandatory lock attribute is set on the file, the server checks to 8704 see if the lock-owner has an appropriate shared (READ_LT) or 8705 exclusive (WRITE_LT) byte-range lock on the byte-range it wishes to 8706 READ from or WRITE to. If there is no appropriate lock, the server 8707 checks if there is a conflicting lock (which can be done by 8708 attempting to acquire the conflicting lock on behalf of the lock- 8709 owner, and if successful, release the lock after the READ or WRITE 8710 operation is done), and if there is, the server returns 8711 NFS4ERR_LOCKED. 8713 For Windows environments, byte-range locks are always mandatory, so 8714 the server always checks for byte-range locks during I/O requests. 8716 Thus, the LOCK operation does not need to distinguish between 8717 advisory and mandatory byte-range locks. It is the server's 8718 processing of the READ and WRITE operations that introduces the 8719 distinction. 8721 Every stateid that is validly passed to READ, WRITE, or SETATTR, with 8722 the exception of special stateid values, defines an access mode for 8723 the file (i.e., OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 8724 OPEN4_SHARE_ACCESS_BOTH). 8726 o For stateids associated with opens, this is the mode defined by 8727 the original OPEN that caused the allocation of the OPEN stateid 8728 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the 8729 same open-owner/file pair. 8731 o For stateids returned by byte-range LOCK operations, the 8732 appropriate mode is the access mode for the OPEN stateid 8733 associated with the lock set represented by the stateid. 8735 o For delegation stateids, the access mode is based on the type of 8736 delegation. 8738 When a READ, WRITE, or SETATTR (that specifies the size attribute) 8739 operation is done, the operation is subject to checking against the 8740 access mode to verify that the operation is appropriate given the 8741 stateid with which the operation is associated. 8743 In the case of WRITE-type operations (i.e., WRITEs and SETATTRs that 8744 set size), the server MUST verify that the access mode allows writing 8745 and MUST return an NFS4ERR_OPENMODE error if it does not. In the 8746 case of READ, the server may perform the corresponding check on the 8747 access mode, or it may choose to allow READ on OPENs for 8748 OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE 8749 implementation may unavoidably do reads (e.g., due to buffer cache 8750 constraints). However, even if READs are allowed in these 8751 circumstances, the server MUST still check for locks that conflict 8752 with the READ (e.g., another OPEN specified OPEN4_SHARE_DENY_READ or 8753 OPEN4_SHARE_DENY_BOTH). Note that a server that does enforce the 8754 access mode check on READs need not explicitly check for conflicting 8755 share reservations since the existence of OPEN for 8756 OPEN4_SHARE_ACCESS_READ guarantees that no conflicting share 8757 reservation can exist. 8759 The READ bypass special stateid (all bits of "other" and "seqid" set 8760 to one) indicates a desire to bypass locking checks. The server MAY 8761 allow READ operations to bypass locking checks at the server, when 8762 this special stateid is used. However, WRITE operations with this 8763 special stateid value MUST NOT bypass locking checks and are treated 8764 exactly the same as if a special stateid for anonymous state were 8765 used. 8767 A lock may not be granted while a READ or WRITE operation using one 8768 of the special stateids is being performed and the scope of the lock 8769 to be granted would conflict with the READ or WRITE operation. This 8770 can occur when: 8772 o A mandatory byte-range lock is requested with a byte-range that 8773 conflicts with the byte-range of the READ or WRITE operation. For 8774 the purposes of this paragraph, a conflict occurs when a shared 8775 lock is requested and a WRITE operation is being performed, or an 8776 exclusive lock is requested and either a READ or a WRITE operation 8777 is being performed. 8779 o A share reservation is requested that denies reading and/or 8780 writing and the corresponding operation is being performed. 8782 o A delegation is to be granted and the delegation type would 8783 prevent the I/O operation, i.e., READ and WRITE conflict with an 8784 OPEN_DELEGATE_WRITE delegation and WRITE conflicts with an 8785 OPEN_DELEGATE_READ delegation. 8787 When a client holds a delegation, it needs to ensure that the stateid 8788 sent conveys the association of operation with the delegation, to 8789 avoid the delegation from being avoidably recalled. When the 8790 delegation stateid, a stateid open associated with that delegation, 8791 or a stateid representing byte-range locks derived from such an open 8792 is used, the server knows that the READ, WRITE, or SETATTR does not 8793 conflict with the delegation but is sent under the aegis of the 8794 delegation. Even though it is possible for the server to determine 8795 from the client ID (via the session ID) that the client does in fact 8796 have a delegation, the server is not obliged to check this, so using 8797 a special stateid can result in avoidable recall of the delegation. 8799 9.2. Lock Ranges 8801 The protocol allows a lock-owner to request a lock with a byte-range 8802 and then either upgrade, downgrade, or unlock a sub-range of the 8803 initial lock, or a byte-range that overlaps -- fully or partially -- 8804 either with that initial lock or a combination of a set of existing 8805 locks for the same lock-owner. It is expected that this will be an 8806 uncommon type of request. In any case, servers or server file 8807 systems may not be able to support sub-range lock semantics. In the 8808 event that a server receives a locking request that represents a sub- 8809 range of current locking state for the lock-owner, the server is 8810 allowed to return the error NFS4ERR_LOCK_RANGE to signify that it 8811 does not support sub-range lock operations. Therefore, the client 8812 should be prepared to receive this error and, if appropriate, report 8813 the error to the requesting application. 8815 The client is discouraged from combining multiple independent locking 8816 ranges that happen to be adjacent into a single request since the 8817 server may not support sub-range requests for reasons related to the 8818 recovery of byte-range locking state in the event of server failure. 8819 As discussed in Section 8.4.2, the server may employ certain 8820 optimizations during recovery that work effectively only when the 8821 client's behavior during lock recovery is similar to the client's 8822 locking behavior prior to server failure. 8824 9.3. Upgrading and Downgrading Locks 8826 If a client has a WRITE_LT lock on a byte-range, it can request an 8827 atomic downgrade of the lock to a READ_LT lock via the LOCK 8828 operation, by setting the type to READ_LT. If the server supports 8829 atomic downgrade, the request will succeed. If not, it will return 8830 NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this 8831 error and, if appropriate, report the error to the requesting 8832 application. 8834 If a client has a READ_LT lock on a byte-range, it can request an 8835 atomic upgrade of the lock to a WRITE_LT lock via the LOCK operation 8836 by setting the type to WRITE_LT or WRITEW_LT. If the server does not 8837 support atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the 8838 upgrade can be achieved without an existing conflict, the request 8839 will succeed. Otherwise, the server will return either 8840 NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is 8841 returned if the client sent the LOCK operation with the type set to 8842 WRITEW_LT and the server has detected a deadlock. The client should 8843 be prepared to receive such errors and, if appropriate, report the 8844 error to the requesting application. 8846 9.4. Stateid Seqid Values and Byte-Range Locks 8848 When a LOCK or LOCKU operation is performed, the stateid returned has 8849 the same "other" value as the argument's stateid, and a "seqid" value 8850 that is incremented (relative to the argument's stateid) to reflect 8851 the occurrence of the LOCK or LOCKU operation. The server MUST 8852 increment the value of the "seqid" field whenever there is any change 8853 to the locking status of any byte offset as described by any of the 8854 locks covered by the stateid. A change in locking status includes a 8855 change from locked to unlocked or the reverse or a change from being 8856 locked for READ_LT to being locked for WRITE_LT or the reverse. 8858 When there is no such change, as, for example, when a range already 8859 locked for WRITE_LT is locked again for WRITE_LT, the server MAY 8860 increment the "seqid" value. 8862 9.5. Issues with Multiple Open-Owners 8864 When the same file is opened by multiple open-owners, a client will 8865 have multiple OPEN stateids for that file, each associated with a 8866 different open-owner. In that case, there can be multiple LOCK and 8867 LOCKU requests for the same lock-owner sent using the different OPEN 8868 stateids, and so a situation may arise in which there are multiple 8869 stateids, each representing byte-range locks on the same file and 8870 held by the same lock-owner but each associated with a different 8871 open-owner. 8873 In such a situation, the locking status of each byte (i.e., whether 8874 it is locked, the READ_LT or WRITE_LT type of the lock, and the lock- 8875 owner holding the lock) MUST reflect the last LOCK or LOCKU operation 8876 done for the lock-owner in question, independent of the stateid 8877 through which the request was sent. 8879 When a byte is locked by the lock-owner in question, the open-owner 8880 to which that byte-range lock is assigned SHOULD be that of the open- 8881 owner associated with the stateid through which the last LOCK of that 8882 byte was done. When there is a change in the open-owner associated 8883 with locks for the stateid through which a LOCK or LOCKU was done, 8884 the "seqid" field of the stateid MUST be incremented, even if the 8885 locking, in terms of lock-owners has not changed. When there is a 8886 change to the set of locked bytes associated with a different stateid 8887 for the same lock-owner, i.e., associated with a different open- 8888 owner, the "seqid" value for that stateid MUST NOT be incremented. 8890 9.6. Blocking Locks 8892 Some clients require the support of blocking locks. While NFSv4.1 8893 provides a callback when a previously unavailable lock becomes 8894 available, this is an OPTIONAL feature and clients cannot depend on 8895 its presence. Clients need to be prepared to continually poll for 8896 the lock. This presents a fairness problem. Two of the lock types, 8897 READW_LT and WRITEW_LT, are used to indicate to the server that the 8898 client is requesting a blocking lock. When the callback is not used, 8899 the server should maintain an ordered list of pending blocking locks. 8900 When the conflicting lock is released, the server may wait for the 8901 period of time equal to lease_time for the first waiting client to 8902 re-request the lock. After the lease period expires, the next 8903 waiting client request is allowed the lock. Clients are required to 8904 poll at an interval sufficiently small that it is likely to acquire 8905 the lock in a timely manner. The server is not required to maintain 8906 a list of pending blocked locks as it is used to increase fairness 8907 and not correct operation. Because of the unordered nature of crash 8908 recovery, storing of lock state to stable storage would be required 8909 to guarantee ordered granting of blocking locks. 8911 Servers may also note the lock types and delay returning denial of 8912 the request to allow extra time for a conflicting lock to be 8913 released, allowing a successful return. In this way, clients can 8914 avoid the burden of needless frequent polling for blocking locks. 8916 The server should take care in the length of delay in the event the 8917 client retransmits the request. 8919 If a server receives a blocking LOCK operation, denies it, and then 8920 later receives a nonblocking request for the same lock, which is also 8921 denied, then it should remove the lock in question from its list of 8922 pending blocking locks. Clients should use such a nonblocking 8923 request to indicate to the server that this is the last time they 8924 intend to poll for the lock, as may happen when the process 8925 requesting the lock is interrupted. This is a courtesy to the 8926 server, to prevent it from unnecessarily waiting a lease period 8927 before granting other LOCK operations. However, clients are not 8928 required to perform this courtesy, and servers must not depend on 8929 them doing so. Also, clients must be prepared for the possibility 8930 that this final locking request will be accepted. 8932 When a server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK, 8933 that CB_NOTIFY_LOCK callbacks might be done for the current open 8934 file, the client should take notice of this, but, since this is a 8935 hint, cannot rely on a CB_NOTIFY_LOCK always being done. A client 8936 may reasonably reduce the frequency with which it polls for a denied 8937 lock, since the greater latency that might occur is likely to be 8938 eliminated given a prompt callback, but it still needs to poll. When 8939 it receives a CB_NOTIFY_LOCK, it should promptly try to obtain the 8940 lock, but it should be aware that other clients may be polling and 8941 that the server is under no obligation to reserve the lock for that 8942 particular client. 8944 9.7. Share Reservations 8946 A share reservation is a mechanism to control access to a file. It 8947 is a separate and independent mechanism from byte-range locking. 8948 When a client opens a file, it sends an OPEN operation to the server 8949 specifying the type of access required (READ, WRITE, or BOTH) and the 8950 type of access to deny others (OPEN4_SHARE_DENY_NONE, 8951 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 8952 OPEN4_SHARE_DENY_BOTH). If the OPEN fails, the client will fail the 8953 application's open request. 8955 Pseudo-code definition of the semantics: 8957 if (request.access == 0) { 8958 return (NFS4ERR_INVAL) 8959 } else { 8960 if ((request.access & file_state.deny)) || 8961 (request.deny & file_state.access)) { 8962 return (NFS4ERR_SHARE_DENIED) 8963 } 8964 return (NFS4ERR_OK); 8966 When doing this checking of share reservations on OPEN, the current 8967 file_state used in the algorithm includes bits that reflect all 8968 current opens, including those for the open-owner making the new OPEN 8969 request. 8971 The constants used for the OPEN and OPEN_DOWNGRADE operations for the 8972 access and deny fields are as follows: 8974 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 8975 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 8976 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 8978 const OPEN4_SHARE_DENY_NONE = 0x00000000; 8979 const OPEN4_SHARE_DENY_READ = 0x00000001; 8980 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 8981 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 8983 9.8. OPEN/CLOSE Operations 8985 To provide correct share semantics, a client MUST use the OPEN 8986 operation to obtain the initial filehandle and indicate the desired 8987 access and what access, if any, to deny. Even if the client intends 8988 to use a special stateid for anonymous state or READ bypass, it must 8989 still obtain the filehandle for the regular file with the OPEN 8990 operation so the appropriate share semantics can be applied. Clients 8991 that do not have a deny mode built into their programming interfaces 8992 for opening a file should request a deny mode of 8993 OPEN4_SHARE_DENY_NONE. 8995 The OPEN operation with the CREATE flag also subsumes the CREATE 8996 operation for regular files as used in previous versions of the NFS 8997 protocol. This allows a create with a share to be done atomically. 8999 The CLOSE operation removes all share reservations held by the open- 9000 owner on that file. If byte-range locks are held, the client SHOULD 9001 release all locks before sending a CLOSE operation. The server MAY 9002 free all outstanding locks on CLOSE, but some servers may not support 9003 the CLOSE of a file that still has byte-range locks held. The server 9004 MUST return failure, NFS4ERR_LOCKS_HELD, if any locks would exist 9005 after the CLOSE. 9007 The LOOKUP operation will return a filehandle without establishing 9008 any lock state on the server. Without a valid stateid, the server 9009 will assume that the client has the least access. For example, if 9010 one client opened a file with OPEN4_SHARE_DENY_BOTH and another 9011 client accesses the file via a filehandle obtained through LOOKUP, 9012 the second client could only read the file using the special read 9013 bypass stateid. The second client could not WRITE the file at all 9014 because it would not have a valid stateid from OPEN and the special 9015 anonymous stateid would not be allowed access. 9017 9.9. Open Upgrade and Downgrade 9019 When an OPEN is done for a file and the open-owner for which the OPEN 9020 is being done already has the file open, the result is to upgrade the 9021 open file status maintained on the server to include the access and 9022 deny bits specified by the new OPEN as well as those for the existing 9023 OPEN. The result is that there is one open file, as far as the 9024 protocol is concerned, and it includes the union of the access and 9025 deny bits for all of the OPEN requests completed. The OPEN is 9026 represented by a single stateid whose "other" value matches that of 9027 the original open, and whose "seqid" value is incremented to reflect 9028 the occurrence of the upgrade. The increment is required in cases in 9029 which the "upgrade" results in no change to the open mode (e.g., an 9030 OPEN is done for read when the existing open file is opened for 9031 OPEN4_SHARE_ACCESS_BOTH). Only a single CLOSE will be done to reset 9032 the effects of both OPENs. The client may use the stateid returned 9033 by the OPEN effecting the upgrade or with a stateid sharing the same 9034 "other" field and a seqid of zero, although care needs to be taken as 9035 far as upgrades that happen while the CLOSE is pending. Note that 9036 the client, when sending the OPEN, may not know that the same file is 9037 in fact being opened. The above only applies if both OPENs result in 9038 the OPENed object being designated by the same filehandle. 9040 When the server chooses to export multiple filehandles corresponding 9041 to the same file object and returns different filehandles on two 9042 different OPENs of the same file object, the server MUST NOT "OR" 9043 together the access and deny bits and coalesce the two open files. 9044 Instead, the server must maintain separate OPENs with separate 9045 stateids and will require separate CLOSEs to free them. 9047 When multiple open files on the client are merged into a single OPEN 9048 file object on the server, the close of one of the open files (on the 9049 client) may necessitate change of the access and deny status of the 9050 open file on the server. This is because the union of the access and 9051 deny bits for the remaining opens may be smaller (i.e., a proper 9052 subset) than previously. The OPEN_DOWNGRADE operation is used to 9053 make the necessary change and the client should use it to update the 9054 server so that share reservation requests by other clients are 9055 handled properly. The stateid returned has the same "other" field as 9056 that passed to the server. The "seqid" value in the returned stateid 9057 MUST be incremented, even in situations in which there is no change 9058 to the access and deny bits for the file. 9060 9.10. Parallel OPENs 9062 Unlike the case of NFSv4.0, in which OPEN operations for the same 9063 open-owner are inherently serialized because of the owner-based 9064 seqid, multiple OPENs for the same open-owner may be done in 9065 parallel. When clients do this, they may encounter situations in 9066 which, because of the existence of hard links, two OPEN operations 9067 may turn out to open the same file, with a later OPEN performed being 9068 an upgrade of the first, with this fact only visible to the client 9069 once the operations complete. 9071 In this situation, clients may determine the order in which the OPENs 9072 were performed by examining the stateids returned by the OPENs. 9073 Stateids that share a common value of the "other" field can be 9074 recognized as having opened the same file, with the order of the 9075 operations determinable from the order of the "seqid" fields, mod any 9076 possible wraparound of the 32-bit field. 9078 When the possibility exists that the client will send multiple OPENs 9079 for the same open-owner in parallel, it may be the case that an open 9080 upgrade may happen without the client knowing beforehand that this 9081 could happen. Because of this possibility, CLOSEs and 9082 OPEN_DOWNGRADEs should generally be sent with a non-zero seqid in the 9083 stateid, to avoid the possibility that the status change associated 9084 with an open upgrade is not inadvertently lost. 9086 9.11. Reclaim of Open and Byte-Range Locks 9088 Special forms of the LOCK and OPEN operations are provided when it is 9089 necessary to re-establish byte-range locks or opens after a server 9090 failure. 9092 o To reclaim existing opens, an OPEN operation is performed using a 9093 CLAIM_PREVIOUS. Because the client, in this type of situation, 9094 will have already opened the file and have the filehandle of the 9095 target file, this operation requires that the current filehandle 9096 be the target file, rather than a directory, and no file name is 9097 specified. 9099 o To reclaim byte-range locks, a LOCK operation with the reclaim 9100 parameter set to true is used. 9102 Reclaims of opens associated with delegations are discussed in 9103 Section 10.2.1. 9105 10. Client-Side Caching 9107 Client-side caching of data, of file attributes, and of file names is 9108 essential to providing good performance with the NFS protocol. 9109 Providing distributed cache coherence is a difficult problem, and 9110 previous versions of the NFS protocol have not attempted it. 9111 Instead, several NFS client implementation techniques have been used 9112 to reduce the problems that a lack of coherence poses for users. 9113 These techniques have not been clearly defined by earlier protocol 9114 specifications, and it is often unclear what is valid or invalid 9115 client behavior. 9117 The NFSv4.1 protocol uses many techniques similar to those that have 9118 been used in previous protocol versions. The NFSv4.1 protocol does 9119 not provide distributed cache coherence. However, it defines a more 9120 limited set of caching guarantees to allow locks and share 9121 reservations to be used without destructive interference from client- 9122 side caching. 9124 In addition, the NFSv4.1 protocol introduces a delegation mechanism, 9125 which allows many decisions normally made by the server to be made 9126 locally by clients. This mechanism provides efficient support of the 9127 common cases where sharing is infrequent or where sharing is read- 9128 only. 9130 10.1. Performance Challenges for Client-Side Caching 9132 Caching techniques used in previous versions of the NFS protocol have 9133 been successful in providing good performance. However, several 9134 scalability challenges can arise when those techniques are used with 9135 very large numbers of clients. This is particularly true when 9136 clients are geographically distributed, which classically increases 9137 the latency for cache revalidation requests. 9139 The previous versions of the NFS protocol repeat their file data 9140 cache validation requests at the time the file is opened. This 9141 behavior can have serious performance drawbacks. A common case is 9142 one in which a file is only accessed by a single client. Therefore, 9143 sharing is infrequent. 9145 In this case, repeated references to the server to find that no 9146 conflicts exist are expensive. A better option with regards to 9147 performance is to allow a client that repeatedly opens a file to do 9148 so without reference to the server. This is done until potentially 9149 conflicting operations from another client actually occur. 9151 A similar situation arises in connection with byte-range locking. 9152 Sending LOCK and LOCKU operations as well as the READ and WRITE 9153 operations necessary to make data caching consistent with the locking 9154 semantics (see Section 10.3.2) can severely limit performance. When 9155 locking is used to provide protection against infrequent conflicts, a 9156 large penalty is incurred. This penalty may discourage the use of 9157 byte-range locking by applications. 9159 The NFSv4.1 protocol provides more aggressive caching strategies with 9160 the following design goals: 9162 o Compatibility with a large range of server semantics. 9164 o Providing the same caching benefits as previous versions of the 9165 NFS protocol when unable to support the more aggressive model. 9167 o Requirements for aggressive caching are organized so that a large 9168 portion of the benefit can be obtained even when not all of the 9169 requirements can be met. 9171 The appropriate requirements for the server are discussed in later 9172 sections in which specific forms of caching are covered (see 9173 Section 10.4). 9175 10.2. Delegation and Callbacks 9177 Recallable delegation of server responsibilities for a file to a 9178 client improves performance by avoiding repeated requests to the 9179 server in the absence of inter-client conflict. With the use of a 9180 "callback" RPC from server to client, a server recalls delegated 9181 responsibilities when another client engages in sharing of a 9182 delegated file. 9184 A delegation is passed from the server to the client, specifying the 9185 object of the delegation and the type of delegation. There are 9186 different types of delegations, but each type contains a stateid to 9187 be used to represent the delegation when performing operations that 9188 depend on the delegation. This stateid is similar to those 9189 associated with locks and share reservations but differs in that the 9190 stateid for a delegation is associated with a client ID and may be 9191 used on behalf of all the open-owners for the given client. A 9192 delegation is made to the client as a whole and not to any specific 9193 process or thread of control within it. 9195 The backchannel is established by CREATE_SESSION and 9196 BIND_CONN_TO_SESSION, and the client is required to maintain it. 9197 Because the backchannel may be down, even temporarily, correct 9198 protocol operation does not depend on them. Preliminary testing of 9199 backchannel functionality by means of a CB_COMPOUND procedure with a 9200 single operation, CB_SEQUENCE, can be used to check the continuity of 9201 the backchannel. A server avoids delegating responsibilities until 9202 it has determined that the backchannel exists. Because the granting 9203 of a delegation is always conditional upon the absence of conflicting 9204 access, clients MUST NOT assume that a delegation will be granted and 9205 they MUST always be prepared for OPENs, WANT_DELEGATIONs, and 9206 GET_DIR_DELEGATIONs to be processed without any delegations being 9207 granted. 9209 Unlike locks, an operation by a second client to a delegated file 9210 will cause the server to recall a delegation through a callback. For 9211 individual operations, we will describe, under IMPLEMENTATION, when 9212 such operations are required to effect a recall. A number of points 9213 should be noted, however. 9215 o The server is free to recall a delegation whenever it feels it is 9216 desirable and may do so even if no operations requiring recall are 9217 being done. 9219 o Operations done outside the NFSv4.1 protocol, due to, for example, 9220 access by other protocols, or by local access, also need to result 9221 in delegation recall when they make analogous changes to file 9222 system data. What is crucial is if the change would invalidate 9223 the guarantees provided by the delegation. When this is possible, 9224 the delegation needs to be recalled and MUST be returned or 9225 revoked before allowing the operation to proceed. 9227 o The semantics of the file system are crucial in defining when 9228 delegation recall is required. If a particular change within a 9229 specific implementation causes change to a file attribute, then 9230 delegation recall is required, whether that operation has been 9231 specifically listed as requiring delegation recall. Again, what 9232 is critical is whether the guarantees provided by the delegation 9233 are being invalidated. 9235 Despite those caveats, the implementation sections for a number of 9236 operations describe situations in which delegation recall would be 9237 required under some common circumstances: 9239 o For GETATTR, see Section 18.7.4. 9241 o For OPEN, see Section 18.16.4. 9243 o For READ, see Section 18.22.4. 9245 o For REMOVE, see Section 18.25.4. 9247 o For RENAME, see Section 18.26.4. 9249 o For SETATTR, see Section 18.30.4. 9251 o For WRITE, see Section 18.32.4. 9253 On recall, the client holding the delegation needs to flush modified 9254 state (such as modified data) to the server and return the 9255 delegation. The conflicting request will not be acted on until the 9256 recall is complete. The recall is considered complete when the 9257 client returns the delegation or the server times its wait for the 9258 delegation to be returned and revokes the delegation as a result of 9259 the timeout. In the interim, the server will either delay responding 9260 to conflicting requests or respond to them with NFS4ERR_DELAY. 9261 Following the resolution of the recall, the server has the 9262 information necessary to grant or deny the second client's request. 9264 At the time the client receives a delegation recall, it may have 9265 substantial state that needs to be flushed to the server. Therefore, 9266 the server should allow sufficient time for the delegation to be 9267 returned since it may involve numerous RPCs to the server. If the 9268 server is able to determine that the client is diligently flushing 9269 state to the server as a result of the recall, the server may extend 9270 the usual time allowed for a recall. However, the time allowed for 9271 recall completion should not be unbounded. 9273 An example of this is when responsibility to mediate opens on a given 9274 file is delegated to a client (see Section 10.4). The server will 9275 not know what opens are in effect on the client. Without this 9276 knowledge, the server will be unable to determine if the access and 9277 deny states for the file allow any particular open until the 9278 delegation for the file has been returned. 9280 A client failure or a network partition can result in failure to 9281 respond to a recall callback. In this case, the server will revoke 9282 the delegation, which in turn will render useless any modified state 9283 still on the client. 9285 10.2.1. Delegation Recovery 9287 There are three situations that delegation recovery needs to deal 9288 with: 9290 o client restart 9291 o server restart 9293 o network partition (full or backchannel-only) 9295 In the event the client restarts, the failure to renew the lease will 9296 result in the revocation of byte-range locks and share reservations. 9297 Delegations, however, may be treated a bit differently. 9299 There will be situations in which delegations will need to be re- 9300 established after a client restarts. The reason for this is that the 9301 client may have file data stored locally and this data was associated 9302 with the previously held delegations. The client will need to re- 9303 establish the appropriate file state on the server. 9305 To allow for this type of client recovery, the server MAY extend the 9306 period for delegation recovery beyond the typical lease expiration 9307 period. This implies that requests from other clients that conflict 9308 with these delegations will need to wait. Because the normal recall 9309 process may require significant time for the client to flush changed 9310 state to the server, other clients need be prepared for delays that 9311 occur because of a conflicting delegation. This longer interval 9312 would increase the window for clients to restart and consult stable 9313 storage so that the delegations can be reclaimed. For OPEN 9314 delegations, such delegations are reclaimed using OPEN with a claim 9315 type of CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH (see Sections 10.5 9316 and 18.16 for discussion of OPEN delegation and the details of OPEN, 9317 respectively). 9319 A server MAY support claim types of CLAIM_DELEGATE_PREV and 9320 CLAIM_DELEG_PREV_FH, and if it does, it MUST NOT remove delegations 9321 upon a CREATE_SESSION that confirm a client ID created by 9322 EXCHANGE_ID. Instead, the server MUST, for a period of time no less 9323 than that of the value of the lease_time attribute, maintain the 9324 client's delegations to allow time for the client to send 9325 CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH requests. The server 9326 that supports CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH MUST 9327 support the DELEGPURGE operation. 9329 When the server restarts, delegations are reclaimed (using the OPEN 9330 operation with CLAIM_PREVIOUS) in a similar fashion to byte-range 9331 locks and share reservations. However, there is a slight semantic 9332 difference. In the normal case, if the server decides that a 9333 delegation should not be granted, it performs the requested action 9334 (e.g., OPEN) without granting any delegation. For reclaim, the 9335 server grants the delegation but a special designation is applied so 9336 that the client treats the delegation as having been granted but 9337 recalled by the server. Because of this, the client has the duty to 9338 write all modified state to the server and then return the 9339 delegation. This process of handling delegation reclaim reconciles 9340 three principles of the NFSv4.1 protocol: 9342 o Upon reclaim, a client reporting resources assigned to it by an 9343 earlier server instance must be granted those resources. 9345 o The server has unquestionable authority to determine whether 9346 delegations are to be granted and, once granted, whether they are 9347 to be continued. 9349 o The use of callbacks should not be depended upon until the client 9350 has proven its ability to receive them. 9352 When a client needs to reclaim a delegation and there is no 9353 associated open, the client may use the CLAIM_PREVIOUS variant of the 9354 WANT_DELEGATION operation. However, since the server is not required 9355 to support this operation, an alternative is to reclaim via a dummy 9356 OPEN together with the delegation using an OPEN of type 9357 CLAIM_PREVIOUS. The dummy open file can be released using a CLOSE to 9358 re-establish the original state to be reclaimed, a delegation without 9359 an associated open. 9361 When a client has more than a single open associated with a 9362 delegation, state for those additional opens can be established using 9363 OPEN operations of type CLAIM_DELEGATE_CUR. When these are used to 9364 establish opens associated with reclaimed delegations, the server 9365 MUST allow them when made within the grace period. 9367 When a network partition occurs, delegations are subject to freeing 9368 by the server when the lease renewal period expires. This is similar 9369 to the behavior for locks and share reservations. For delegations, 9370 however, the server may extend the period in which conflicting 9371 requests are held off. Eventually, the occurrence of a conflicting 9372 request from another client will cause revocation of the delegation. 9373 A loss of the backchannel (e.g., by later network configuration 9374 change) will have the same effect. A recall request will fail and 9375 revocation of the delegation will result. 9377 A client normally finds out about revocation of a delegation when it 9378 uses a stateid associated with a delegation and receives one of the 9379 errors NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or 9380 NFS4ERR_DELEG_REVOKED. It also may find out about delegation 9381 revocation after a client restart when it attempts to reclaim a 9382 delegation and receives that same error. Note that in the case of a 9383 revoked OPEN_DELEGATE_WRITE delegation, there are issues because data 9384 may have been modified by the client whose delegation is revoked and 9385 separately by other clients. See Section 10.5.1 for a discussion of 9386 such issues. Note also that when delegations are revoked, 9387 information about the revoked delegation will be written by the 9388 server to stable storage (as described in Section 8.4.3). This is 9389 done to deal with the case in which a server restarts after revoking 9390 a delegation but before the client holding the revoked delegation is 9391 notified about the revocation. 9393 10.3. Data Caching 9395 When applications share access to a set of files, they need to be 9396 implemented so as to take account of the possibility of conflicting 9397 access by another application. This is true whether the applications 9398 in question execute on different clients or reside on the same 9399 client. 9401 Share reservations and byte-range locks are the facilities the 9402 NFSv4.1 protocol provides to allow applications to coordinate access 9403 by using mutual exclusion facilities. The NFSv4.1 protocol's data 9404 caching must be implemented such that it does not invalidate the 9405 assumptions on which those using these facilities depend. 9407 10.3.1. Data Caching and OPENs 9409 In order to avoid invalidating the sharing assumptions on which 9410 applications rely, NFSv4.1 clients should not provide cached data to 9411 applications or modify it on behalf of an application when it would 9412 not be valid to obtain or modify that same data via a READ or WRITE 9413 operation. 9415 Furthermore, in the absence of an OPEN delegation (see Section 10.4), 9416 two additional rules apply. Note that these rules are obeyed in 9417 practice by many NFSv3 clients. 9419 o First, cached data present on a client must be revalidated after 9420 doing an OPEN. Revalidating means that the client fetches the 9421 change attribute from the server, compares it with the cached 9422 change attribute, and if different, declares the cached data (as 9423 well as the cached attributes) as invalid. This is to ensure that 9424 the data for the OPENed file is still correctly reflected in the 9425 client's cache. This validation must be done at least when the 9426 client's OPEN operation includes a deny of OPEN4_SHARE_DENY_WRITE 9427 or OPEN4_SHARE_DENY_BOTH, thus terminating a period in which other 9428 clients may have had the opportunity to open the file with 9429 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH access. Clients 9430 may choose to do the revalidation more often (i.e., at OPENs 9431 specifying a deny mode of OPEN4_SHARE_DENY_NONE) to parallel the 9432 NFSv3 protocol's practice for the benefit of users assuming this 9433 degree of cache revalidation. 9435 Since the change attribute is updated for data and metadata 9436 modifications, some client implementors may be tempted to use the 9437 time_modify attribute and not the change attribute to validate 9438 cached data, so that metadata changes do not spuriously invalidate 9439 clean data. The implementor is cautioned in this approach. The 9440 change attribute is guaranteed to change for each update to the 9441 file, whereas time_modify is guaranteed to change only at the 9442 granularity of the time_delta attribute. Use by the client's data 9443 cache validation logic of time_modify and not change runs the risk 9444 of the client incorrectly marking stale data as valid. Thus, any 9445 cache validation approach by the client MUST include the use of 9446 the change attribute. 9448 o Second, modified data must be flushed to the server before closing 9449 a file OPENed for OPEN4_SHARE_ACCESS_WRITE. This is complementary 9450 to the first rule. If the data is not flushed at CLOSE, the 9451 revalidation done after the client OPENs a file is unable to 9452 achieve its purpose. The other aspect to flushing the data before 9453 close is that the data must be committed to stable storage, at the 9454 server, before the CLOSE operation is requested by the client. In 9455 the case of a server restart and a CLOSEd file, it may not be 9456 possible to retransmit the data to be written to the file, hence, 9457 this requirement. 9459 10.3.2. Data Caching and File Locking 9461 For those applications that choose to use byte-range locking instead 9462 of share reservations to exclude inconsistent file access, there is 9463 an analogous set of constraints that apply to client-side data 9464 caching. These rules are effective only if the byte-range locking is 9465 used in a way that matches in an equivalent way the actual READ and 9466 WRITE operations executed. This is as opposed to byte-range locking 9467 that is based on pure convention. For example, it is possible to 9468 manipulate a two-megabyte file by dividing the file into two one- 9469 megabyte ranges and protecting access to the two byte-ranges by byte- 9470 range locks on bytes zero and one. A WRITE_LT lock on byte zero of 9471 the file would represent the right to perform READ and WRITE 9472 operations on the first byte-range. A WRITE_LT lock on byte one of 9473 the file would represent the right to perform READ and WRITE 9474 operations on the second byte-range. As long as all applications 9475 manipulating the file obey this convention, they will work on a local 9476 file system. However, they may not work with the NFSv4.1 protocol 9477 unless clients refrain from data caching. 9479 The rules for data caching in the byte-range locking environment are: 9481 o First, when a client obtains a byte-range lock for a particular 9482 byte-range, the data cache corresponding to that byte-range (if 9483 any cache data exists) must be revalidated. If the change 9484 attribute indicates that the file may have been updated since the 9485 cached data was obtained, the client must flush or invalidate the 9486 cached data for the newly locked byte-range. A client might 9487 choose to invalidate all of the non-modified cached data that it 9488 has for the file, but the only requirement for correct operation 9489 is to invalidate all of the data in the newly locked byte-range. 9491 o Second, before releasing a WRITE_LT lock for a byte-range, all 9492 modified data for that byte-range must be flushed to the server. 9493 The modified data must also be written to stable storage. 9495 Note that flushing data to the server and the invalidation of cached 9496 data must reflect the actual byte-ranges locked or unlocked. 9497 Rounding these up or down to reflect client cache block boundaries 9498 will cause problems if not carefully done. For example, writing a 9499 modified block when only half of that block is within an area being 9500 unlocked may cause invalid modification to the byte-range outside the 9501 unlocked area. This, in turn, may be part of a byte-range locked by 9502 another client. Clients can avoid this situation by synchronously 9503 performing portions of WRITE operations that overlap that portion 9504 (initial or final) that is not a full block. Similarly, invalidating 9505 a locked area that is not an integral number of full buffer blocks 9506 would require the client to read one or two partial blocks from the 9507 server if the revalidation procedure shows that the data that the 9508 client possesses may not be valid. 9510 The data that is written to the server as a prerequisite to the 9511 unlocking of a byte-range must be written, at the server, to stable 9512 storage. The client may accomplish this either with synchronous 9513 writes or by following asynchronous writes with a COMMIT operation. 9514 This is required because retransmission of the modified data after a 9515 server restart might conflict with a lock held by another client. 9517 A client implementation may choose to accommodate applications that 9518 use byte-range locking in non-standard ways (e.g., using a byte-range 9519 lock as a global semaphore) by flushing to the server more data upon 9520 a LOCKU than is covered by the locked range. This may include 9521 modified data within files other than the one for which the unlocks 9522 are being done. In such cases, the client must not interfere with 9523 applications whose READs and WRITEs are being done only within the 9524 bounds of byte-range locks that the application holds. For example, 9525 an application locks a single byte of a file and proceeds to write 9526 that single byte. A client that chose to handle a LOCKU by flushing 9527 all modified data to the server could validly write that single byte 9528 in response to an unrelated LOCKU operation. However, it would not 9529 be valid to write the entire block in which that single written byte 9530 was located since it includes an area that is not locked and might be 9531 locked by another client. Client implementations can avoid this 9532 problem by dividing files with modified data into those for which all 9533 modifications are done to areas covered by an appropriate byte-range 9534 lock and those for which there are modifications not covered by a 9535 byte-range lock. Any writes done for the former class of files must 9536 not include areas not locked and thus not modified on the client. 9538 10.3.3. Data Caching and Mandatory File Locking 9540 Client-side data caching needs to respect mandatory byte-range 9541 locking when it is in effect. The presence of mandatory byte-range 9542 locking for a given file is indicated when the client gets back 9543 NFS4ERR_LOCKED from a READ or WRITE operation on a file for which it 9544 has an appropriate share reservation. When mandatory locking is in 9545 effect for a file, the client must check for an appropriate byte- 9546 range lock for data being read or written. If a byte-range lock 9547 exists for the range being read or written, the client may satisfy 9548 the request using the client's validated cache. If an appropriate 9549 byte-range lock is not held for the range of the read or write, the 9550 read or write request must not be satisfied by the client's cache and 9551 the request must be sent to the server for processing. When a read 9552 or write request partially overlaps a locked byte-range, the request 9553 should be subdivided into multiple pieces with each byte-range 9554 (locked or not) treated appropriately. 9556 10.3.4. Data Caching and File Identity 9558 When clients cache data, the file data needs to be organized 9559 according to the file system object to which the data belongs. For 9560 NFSv3 clients, the typical practice has been to assume for the 9561 purpose of caching that distinct filehandles represent distinct file 9562 system objects. The client then has the choice to organize and 9563 maintain the data cache on this basis. 9565 In the NFSv4.1 protocol, there is now the possibility to have 9566 significant deviations from a "one filehandle per object" model 9567 because a filehandle may be constructed on the basis of the object's 9568 pathname. Therefore, clients need a reliable method to determine if 9569 two filehandles designate the same file system object. If clients 9570 were simply to assume that all distinct filehandles denote distinct 9571 objects and proceed to do data caching on this basis, caching 9572 inconsistencies would arise between the distinct client-side objects 9573 that mapped to the same server-side object. 9575 By providing a method to differentiate filehandles, the NFSv4.1 9576 protocol alleviates a potential functional regression in comparison 9577 with the NFSv3 protocol. Without this method, caching 9578 inconsistencies within the same client could occur, and this has not 9579 been present in previous versions of the NFS protocol. Note that it 9580 is possible to have such inconsistencies with applications executing 9581 on multiple clients, but that is not the issue being addressed here. 9583 For the purposes of data caching, the following steps allow an 9584 NFSv4.1 client to determine whether two distinct filehandles denote 9585 the same server-side object: 9587 o If GETATTR directed to two filehandles returns different values of 9588 the fsid attribute, then the filehandles represent distinct 9589 objects. 9591 o If GETATTR for any file with an fsid that matches the fsid of the 9592 two filehandles in question returns a unique_handles attribute 9593 with a value of TRUE, then the two objects are distinct. 9595 o If GETATTR directed to the two filehandles does not return the 9596 fileid attribute for both of the handles, then it cannot be 9597 determined whether the two objects are the same. Therefore, 9598 operations that depend on that knowledge (e.g., client-side data 9599 caching) cannot be done reliably. Note that if GETATTR does not 9600 return the fileid attribute for both filehandles, it will return 9601 it for neither of the filehandles, since the fsid for both 9602 filehandles is the same. 9604 o If GETATTR directed to the two filehandles returns different 9605 values for the fileid attribute, then they are distinct objects. 9607 o Otherwise, they are the same object. 9609 10.4. Open Delegation 9611 When a file is being OPENed, the server may delegate further handling 9612 of opens and closes for that file to the opening client. Any such 9613 delegation is recallable since the circumstances that allowed for the 9614 delegation are subject to change. In particular, if the server 9615 receives a conflicting OPEN from another client, the server must 9616 recall the delegation before deciding whether the OPEN from the other 9617 client may be granted. Making a delegation is up to the server, and 9618 clients should not assume that any particular OPEN either will or 9619 will not result in an OPEN delegation. The following is a typical 9620 set of conditions that servers might use in deciding whether an OPEN 9621 should be delegated: 9623 o The client must be able to respond to the server's callback 9624 requests. If a backchannel has been established, the server will 9625 send a CB_COMPOUND request, containing a single operation, 9626 CB_SEQUENCE, for a test of backchannel availability. 9628 o The client must have responded properly to previous recalls. 9630 o There must be no current OPEN conflicting with the requested 9631 delegation. 9633 o There should be no current delegation that conflicts with the 9634 delegation being requested. 9636 o The probability of future conflicting open requests should be low 9637 based on the recent history of the file. 9639 o The existence of any server-specific semantics of OPEN/CLOSE that 9640 would make the required handling incompatible with the prescribed 9641 handling that the delegated client would apply (see below). 9643 There are two types of OPEN delegations: OPEN_DELEGATE_READ and 9644 OPEN_DELEGATE_WRITE. An OPEN_DELEGATE_READ delegation allows a 9645 client to handle, on its own, requests to open a file for reading 9646 that do not deny OPEN4_SHARE_ACCESS_READ access to others. Multiple 9647 OPEN_DELEGATE_READ delegations may be outstanding simultaneously and 9648 do not conflict. An OPEN_DELEGATE_WRITE delegation allows the client 9649 to handle, on its own, all opens. Only OPEN_DELEGATE_WRITE 9650 delegation may exist for a given file at a given time, and it is 9651 inconsistent with any OPEN_DELEGATE_READ delegations. 9653 When a client has an OPEN_DELEGATE_READ delegation, it is assured 9654 that neither the contents, the attributes (with the exception of 9655 time_access), nor the names of any links to the file will change 9656 without its knowledge, so long as the delegation is held. When a 9657 client has an OPEN_DELEGATE_WRITE delegation, it may modify the file 9658 data locally since no other client will be accessing the file's data. 9659 The client holding an OPEN_DELEGATE_WRITE delegation may only locally 9660 affect file attributes that are intimately connected with the file 9661 data: size, change, time_access, time_metadata, and time_modify. All 9662 other attributes must be reflected on the server. 9664 When a client has an OPEN delegation, it does not need to send OPENs 9665 or CLOSEs to the server. Instead, the client may update the 9666 appropriate status internally. For an OPEN_DELEGATE_READ delegation, 9667 opens that cannot be handled locally (opens that are for 9668 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH or that deny 9669 OPEN4_SHARE_ACCESS_READ access) must be sent to the server. 9671 When an OPEN delegation is made, the reply to the OPEN contains an 9672 OPEN delegation structure that specifies the following: 9674 o the type of delegation (OPEN_DELEGATE_READ or 9675 OPEN_DELEGATE_WRITE). 9677 o space limitation information to control flushing of data on close 9678 (OPEN_DELEGATE_WRITE delegation only; see Section 10.4.1) 9680 o an nfsace4 specifying read and write permissions 9682 o a stateid to represent the delegation 9684 The delegation stateid is separate and distinct from the stateid for 9685 the OPEN proper. The standard stateid, unlike the delegation 9686 stateid, is associated with a particular lock-owner and will continue 9687 to be valid after the delegation is recalled and the file remains 9688 open. 9690 When a request internal to the client is made to open a file and an 9691 OPEN delegation is in effect, it will be accepted or rejected solely 9692 on the basis of the following conditions. Any requirement for other 9693 checks to be made by the delegate should result in the OPEN 9694 delegation being denied so that the checks can be made by the server 9695 itself. 9697 o The access and deny bits for the request and the file as described 9698 in Section 9.7. 9700 o The read and write permissions as determined below. 9702 The nfsace4 passed with delegation can be used to avoid frequent 9703 ACCESS calls. The permission check should be as follows: 9705 o If the nfsace4 indicates that the open may be done, then it should 9706 be granted without reference to the server. 9708 o If the nfsace4 indicates that the open may not be done, then an 9709 ACCESS request must be sent to the server to obtain the definitive 9710 answer. 9712 The server may return an nfsace4 that is more restrictive than the 9713 actual ACL of the file. This includes an nfsace4 that specifies 9714 denial of all access. Note that some common practices such as 9715 mapping the traditional user "root" to the user "nobody" (see 9716 Section 5.9) may make it incorrect to return the actual ACL of the 9717 file in the delegation response. 9719 The use of a delegation together with various other forms of caching 9720 creates the possibility that no server authentication and 9721 authorization will ever be performed for a given user since all of 9722 the user's requests might be satisfied locally. Where the client is 9723 depending on the server for authentication and authorization, the 9724 client should be sure authentication and authorization occurs for 9725 each user by use of the ACCESS operation. This should be the case 9726 even if an ACCESS operation would not be required otherwise. As 9727 mentioned before, the server may enforce frequent authentication by 9728 returning an nfsace4 denying all access with every OPEN delegation. 9730 10.4.1. Open Delegation and Data Caching 9732 An OPEN delegation allows much of the message overhead associated 9733 with the opening and closing files to be eliminated. An open when an 9734 OPEN delegation is in effect does not require that a validation 9735 message be sent to the server. The continued endurance of the 9736 "OPEN_DELEGATE_READ delegation" provides a guarantee that no OPEN for 9737 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH, and thus no write, 9738 has occurred. Similarly, when closing a file opened for 9739 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH and if an 9740 OPEN_DELEGATE_WRITE delegation is in effect, the data written does 9741 not have to be written to the server until the OPEN delegation is 9742 recalled. The continued endurance of the OPEN delegation provides a 9743 guarantee that no open, and thus no READ or WRITE, has been done by 9744 another client. 9746 For the purposes of OPEN delegation, READs and WRITEs done without an 9747 OPEN are treated as the functional equivalents of a corresponding 9748 type of OPEN. Although a client SHOULD NOT use special stateids when 9749 an open exists, delegation handling on the server can use the client 9750 ID associated with the current session to determine if the operation 9751 has been done by the holder of the delegation (in which case, no 9752 recall is necessary) or by another client (in which case, the 9753 delegation must be recalled and I/O not proceed until the delegation 9754 is recalled or revoked). 9756 With delegations, a client is able to avoid writing data to the 9757 server when the CLOSE of a file is serviced. The file close system 9758 call is the usual point at which the client is notified of a lack of 9759 stable storage for the modified file data generated by the 9760 application. At the close, file data is written to the server and, 9761 through normal accounting, the server is able to determine if the 9762 available file system space for the data has been exceeded (i.e., the 9763 server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting 9764 includes quotas. The introduction of delegations requires that an 9765 alternative method be in place for the same type of communication to 9766 occur between client and server. 9768 In the delegation response, the server provides either the limit of 9769 the size of the file or the number of modified blocks and associated 9770 block size. The server must ensure that the client will be able to 9771 write modified data to the server of a size equal to that provided in 9772 the original delegation. The server must make this assurance for all 9773 outstanding delegations. Therefore, the server must be careful in 9774 its management of available space for new or modified data, taking 9775 into account available file system space and any applicable quotas. 9776 The server can recall delegations as a result of managing the 9777 available file system space. The client should abide by the server's 9778 state space limits for delegations. If the client exceeds the stated 9779 limits for the delegation, the server's behavior is undefined. 9781 Based on server conditions, quotas, or available file system space, 9782 the server may grant OPEN_DELEGATE_WRITE delegations with very 9783 restrictive space limitations. The limitations may be defined in a 9784 way that will always force modified data to be flushed to the server 9785 on close. 9787 With respect to authentication, flushing modified data to the server 9788 after a CLOSE has occurred may be problematic. For example, the user 9789 of the application may have logged off the client, and unexpired 9790 authentication credentials may not be present. In this case, the 9791 client may need to take special care to ensure that local unexpired 9792 credentials will in fact be available. This may be accomplished by 9793 tracking the expiration time of credentials and flushing data well in 9794 advance of their expiration or by making private copies of 9795 credentials to assure their availability when needed. 9797 10.4.2. Open Delegation and File Locks 9799 When a client holds an OPEN_DELEGATE_WRITE delegation, lock 9800 operations are performed locally. This includes those required for 9801 mandatory byte-range locking. This can be done since the delegation 9802 implies that there can be no conflicting locks. Similarly, all of 9803 the revalidations that would normally be associated with obtaining 9804 locks and the flushing of data associated with the releasing of locks 9805 need not be done. 9807 When a client holds an OPEN_DELEGATE_READ delegation, lock operations 9808 are not performed locally. All lock operations, including those 9809 requesting non-exclusive locks, are sent to the server for 9810 resolution. 9812 10.4.3. Handling of CB_GETATTR 9814 The server needs to employ special handling for a GETATTR where the 9815 target is a file that has an OPEN_DELEGATE_WRITE delegation in 9816 effect. The reason for this is that the client holding the 9817 OPEN_DELEGATE_WRITE delegation may have modified the data, and the 9818 server needs to reflect this change to the second client that 9819 submitted the GETATTR. Therefore, the client holding the 9820 OPEN_DELEGATE_WRITE delegation needs to be interrogated. The server 9821 will use the CB_GETATTR operation. The only attributes that the 9822 server can reliably query via CB_GETATTR are size and change. 9824 Since CB_GETATTR is being used to satisfy another client's GETATTR 9825 request, the server only needs to know if the client holding the 9826 delegation has a modified version of the file. If the client's copy 9827 of the delegated file is not modified (data or size), the server can 9828 satisfy the second client's GETATTR request from the attributes 9829 stored locally at the server. If the file is modified, the server 9830 only needs to know about this modified state. If the server 9831 determines that the file is currently modified, it will respond to 9832 the second client's GETATTR as if the file had been modified locally 9833 at the server. 9835 Since the form of the change attribute is determined by the server 9836 and is opaque to the client, the client and server need to agree on a 9837 method of communicating the modified state of the file. For the size 9838 attribute, the client will report its current view of the file size. 9839 For the change attribute, the handling is more involved. 9841 For the client, the following steps will be taken when receiving an 9842 OPEN_DELEGATE_WRITE delegation: 9844 o The value of the change attribute will be obtained from the server 9845 and cached. Let this value be represented by c. 9847 o The client will create a value greater than c that will be used 9848 for communicating that modified data is held at the client. Let 9849 this value be represented by d. 9851 o When the client is queried via CB_GETATTR for the change 9852 attribute, it checks to see if it holds modified data. If the 9853 file is modified, the value d is returned for the change attribute 9854 value. If this file is not currently modified, the client returns 9855 the value c for the change attribute. 9857 For simplicity of implementation, the client MAY for each CB_GETATTR 9858 return the same value d. This is true even if, between successive 9859 CB_GETATTR operations, the client again modifies the file's data or 9860 metadata in its cache. The client can return the same value because 9861 the only requirement is that the client be able to indicate to the 9862 server that the client holds modified data. Therefore, the value of 9863 d may always be c + 1. 9865 While the change attribute is opaque to the client in the sense that 9866 it has no idea what units of time, if any, the server is counting 9867 change with, it is not opaque in that the client has to treat it as 9868 an unsigned integer, and the server has to be able to see the results 9869 of the client's changes to that integer. Therefore, the server MUST 9870 encode the change attribute in network order when sending it to the 9871 client. The client MUST decode it from network order to its native 9872 order when receiving it, and the client MUST encode it in network 9873 order when sending it to the server. For this reason, change is 9874 defined as an unsigned integer rather than an opaque array of bytes. 9876 For the server, the following steps will be taken when providing an 9877 OPEN_DELEGATE_WRITE delegation: 9879 o Upon providing an OPEN_DELEGATE_WRITE delegation, the server will 9880 cache a copy of the change attribute in the data structure it uses 9881 to record the delegation. Let this value be represented by sc. 9883 o When a second client sends a GETATTR operation on the same file to 9884 the server, the server obtains the change attribute from the first 9885 client. Let this value be cc. 9887 o If the value cc is equal to sc, the file is not modified and the 9888 server returns the current values for change, time_metadata, and 9889 time_modify (for example) to the second client. 9891 o If the value cc is NOT equal to sc, the file is currently modified 9892 at the first client and most likely will be modified at the server 9893 at a future time. The server then uses its current time to 9894 construct attribute values for time_metadata and time_modify. A 9895 new value of sc, which we will call nsc, is computed by the 9896 server, such that nsc >= sc + 1. The server then returns the 9897 constructed time_metadata, time_modify, and nsc values to the 9898 requester. The server replaces sc in the delegation record with 9899 nsc. To prevent the possibility of time_modify, time_metadata, 9900 and change from appearing to go backward (which would happen if 9901 the client holding the delegation fails to write its modified data 9902 to the server before the delegation is revoked or returned), the 9903 server SHOULD update the file's metadata record with the 9904 constructed attribute values. For reasons of reasonable 9905 performance, committing the constructed attribute values to stable 9906 storage is OPTIONAL. 9908 As discussed earlier in this section, the client MAY return the same 9909 cc value on subsequent CB_GETATTR calls, even if the file was 9910 modified in the client's cache yet again between successive 9911 CB_GETATTR calls. Therefore, the server must assume that the file 9912 has been modified yet again, and MUST take care to ensure that the 9913 new nsc it constructs and returns is greater than the previous nsc it 9914 returned. An example implementation's delegation record would 9915 satisfy this mandate by including a boolean field (let us call it 9916 "modified") that is set to FALSE when the delegation is granted, and 9917 an sc value set at the time of grant to the change attribute value. 9918 The modified field would be set to TRUE the first time cc != sc, and 9919 would stay TRUE until the delegation is returned or revoked. The 9920 processing for constructing nsc, time_modify, and time_metadata would 9921 use this pseudo code: 9923 if (!modified) { 9924 do CB_GETATTR for change and size; 9926 if (cc != sc) 9927 modified = TRUE; 9928 } else { 9929 do CB_GETATTR for size; 9930 } 9932 if (modified) { 9933 sc = sc + 1; 9934 time_modify = time_metadata = current_time; 9935 update sc, time_modify, time_metadata into file's metadata; 9936 } 9938 This would return to the client (that sent GETATTR) the attributes it 9939 requested, but make sure size comes from what CB_GETATTR returned. 9940 The server would not update the file's metadata with the client's 9941 modified size. 9943 In the case that the file attribute size is different than the 9944 server's current value, the server treats this as a modification 9945 regardless of the value of the change attribute retrieved via 9946 CB_GETATTR and responds to the second client as in the last step. 9948 This methodology resolves issues of clock differences between client 9949 and server and other scenarios where the use of CB_GETATTR break 9950 down. 9952 It should be noted that the server is under no obligation to use 9953 CB_GETATTR, and therefore the server MAY simply recall the delegation 9954 to avoid its use. 9956 10.4.4. Recall of Open Delegation 9958 The following events necessitate recall of an OPEN delegation: 9960 o potentially conflicting OPEN request (or a READ or WRITE operation 9961 done with a special stateid) 9963 o SETATTR sent by another client 9964 o REMOVE request for the file 9966 o RENAME request for the file as either the source or target of the 9967 RENAME 9969 Whether a RENAME of a directory in the path leading to the file 9970 results in recall of an OPEN delegation depends on the semantics of 9971 the server's file system. If that file system denies such RENAMEs 9972 when a file is open, the recall must be performed to determine 9973 whether the file in question is, in fact, open. 9975 In addition to the situations above, the server may choose to recall 9976 OPEN delegations at any time if resource constraints make it 9977 advisable to do so. Clients should always be prepared for the 9978 possibility of recall. 9980 When a client receives a recall for an OPEN delegation, it needs to 9981 update state on the server before returning the delegation. These 9982 same updates must be done whenever a client chooses to return a 9983 delegation voluntarily. The following items of state need to be 9984 dealt with: 9986 o If the file associated with the delegation is no longer open and 9987 no previous CLOSE operation has been sent to the server, a CLOSE 9988 operation must be sent to the server. 9990 o If a file has other open references at the client, then OPEN 9991 operations must be sent to the server. The appropriate stateids 9992 will be provided by the server for subsequent use by the client 9993 since the delegation stateid will no longer be valid. These OPEN 9994 requests are done with the claim type of CLAIM_DELEGATE_CUR. This 9995 will allow the presentation of the delegation stateid so that the 9996 client can establish the appropriate rights to perform the OPEN. 9997 (see Section 18.16, which describes the OPEN operation, for 9998 details.) 10000 o If there are granted byte-range locks, the corresponding LOCK 10001 operations need to be performed. This applies to the 10002 OPEN_DELEGATE_WRITE delegation case only. 10004 o For an OPEN_DELEGATE_WRITE delegation, if at the time of recall 10005 the file is not open for OPEN4_SHARE_ACCESS_WRITE/ 10006 OPEN4_SHARE_ACCESS_BOTH, all modified data for the file must be 10007 flushed to the server. If the delegation had not existed, the 10008 client would have done this data flush before the CLOSE operation. 10010 o For an OPEN_DELEGATE_WRITE delegation when a file is still open at 10011 the time of recall, any modified data for the file needs to be 10012 flushed to the server. 10014 o With the OPEN_DELEGATE_WRITE delegation in place, it is possible 10015 that the file was truncated during the duration of the delegation. 10016 For example, the truncation could have occurred as a result of an 10017 OPEN UNCHECKED with a size attribute value of zero. Therefore, if 10018 a truncation of the file has occurred and this operation has not 10019 been propagated to the server, the truncation must occur before 10020 any modified data is written to the server. 10022 In the case of OPEN_DELEGATE_WRITE delegation, byte-range locking 10023 imposes some additional requirements. To precisely maintain the 10024 associated invariant, it is required to flush any modified data in 10025 any byte-range for which a WRITE_LT lock was released while the 10026 OPEN_DELEGATE_WRITE delegation was in effect. However, because the 10027 OPEN_DELEGATE_WRITE delegation implies no other locking by other 10028 clients, a simpler implementation is to flush all modified data for 10029 the file (as described just above) if any WRITE_LT lock has been 10030 released while the OPEN_DELEGATE_WRITE delegation was in effect. 10032 An implementation need not wait until delegation recall (or the 10033 decision to voluntarily return a delegation) to perform any of the 10034 above actions, if implementation considerations (e.g., resource 10035 availability constraints) make that desirable. Generally, however, 10036 the fact that the actual OPEN state of the file may continue to 10037 change makes it not worthwhile to send information about opens and 10038 closes to the server, except as part of delegation return. An 10039 exception is when the client has no more internal opens of the file. 10040 In this case, sending a CLOSE is useful because it reduces resource 10041 utilization on the client and server. Regardless of the client's 10042 choices on scheduling these actions, all must be performed before the 10043 delegation is returned, including (when applicable) the close that 10044 corresponds to the OPEN that resulted in the delegation. These 10045 actions can be performed either in previous requests or in previous 10046 operations in the same COMPOUND request. 10048 10.4.5. Clients That Fail to Honor Delegation Recalls 10050 A client may fail to respond to a recall for various reasons, such as 10051 a failure of the backchannel from server to the client. The client 10052 may be unaware of a failure in the backchannel. This lack of 10053 awareness could result in the client finding out long after the 10054 failure that its delegation has been revoked, and another client has 10055 modified the data for which the client had a delegation. This is 10056 especially a problem for the client that held an OPEN_DELEGATE_WRITE 10057 delegation. 10059 Status bits returned by SEQUENCE operations help to provide an 10060 alternate way of informing the client of issues regarding the status 10061 of the backchannel and of recalled delegations. When the backchannel 10062 is not available, the server returns the status bit 10063 SEQ4_STATUS_CB_PATH_DOWN on SEQUENCE operations. The client can 10064 react by attempting to re-establish the backchannel and by returning 10065 recallable objects if a backchannel cannot be successfully re- 10066 established. 10068 Whether the backchannel is functioning or not, it may be that the 10069 recalled delegation is not returned. Note that the client's lease 10070 might still be renewed, even though the recalled delegation is not 10071 returned. In this situation, servers SHOULD revoke delegations that 10072 are not returned in a period of time equal to the lease period. This 10073 period of time should allow the client time to note the backchannel- 10074 down status and re-establish the backchannel. 10076 When delegations are revoked, the server will return with the 10077 SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent 10078 SEQUENCE operations. The client should note this and then use 10079 TEST_STATEID to find which delegations have been revoked. 10081 10.4.6. Delegation Revocation 10083 At the point a delegation is revoked, if there are associated opens 10084 on the client, these opens may or may not be revoked. If no byte- 10085 range lock or open is granted that is inconsistent with the existing 10086 open, the stateid for the open may remain valid and be disconnected 10087 from the revoked delegation, just as would be the case if the 10088 delegation were returned. 10090 For example, if an OPEN for OPEN4_SHARE_ACCESS_BOTH with a deny of 10091 OPEN4_SHARE_DENY_NONE is associated with the delegation, granting of 10092 another such OPEN to a different client will revoke the delegation 10093 but need not revoke the OPEN, since the two OPENs are consistent with 10094 each other. On the other hand, if an OPEN denying write access is 10095 granted, then the existing OPEN must be revoked. 10097 When opens and/or locks are revoked, the applications holding these 10098 opens or locks need to be notified. This notification usually occurs 10099 by returning errors for READ/WRITE operations or when a close is 10100 attempted for the open file. 10102 If no opens exist for the file at the point the delegation is 10103 revoked, then notification of the revocation is unnecessary. 10104 However, if there is modified data present at the client for the 10105 file, the user of the application should be notified. Unfortunately, 10106 it may not be possible to notify the user since active applications 10107 may not be present at the client. See Section 10.5.1 for additional 10108 details. 10110 10.4.7. Delegations via WANT_DELEGATION 10112 In addition to providing delegations as part of the reply to OPEN 10113 operations, servers MAY provide delegations separate from open, via 10114 the OPTIONAL WANT_DELEGATION operation. This allows delegations to 10115 be obtained in advance of an OPEN that might benefit from them, for 10116 objects that are not a valid target of OPEN, or to deal with cases in 10117 which a delegation has been recalled and the client wants to make an 10118 attempt to re-establish it if the absence of use by other clients 10119 allows that. 10121 The WANT_DELEGATION operation may be performed on any type of file 10122 object other than a directory. 10124 When a delegation is obtained using WANT_DELEGATION, any open files 10125 for the same filehandle held by that client are to be treated as 10126 subordinate to the delegation, just as if they had been created using 10127 an OPEN of type CLAIM_DELEGATE_CUR. They are otherwise unchanged as 10128 to seqid, access and deny modes, and the relationship with byte-range 10129 locks. Similarly, because existing byte-range locks are subordinate 10130 to an open, those byte-range locks also become indirectly subordinate 10131 to that new delegation. 10133 The WANT_DELEGATION operation provides for delivery of delegations 10134 via callbacks, when the delegations are not immediately available. 10135 When a requested delegation is available, it is delivered to the 10136 client via a CB_PUSH_DELEG operation. When this happens, open files 10137 for the same filehandle become subordinate to the new delegation at 10138 the point at which the delegation is delivered, just as if they had 10139 been created using an OPEN of type CLAIM_DELEGATE_CUR. Similarly, 10140 this occurs for existing byte-range locks subordinate to an open. 10142 10.5. Data Caching and Revocation 10144 When locks and delegations are revoked, the assumptions upon which 10145 successful caching depends are no longer guaranteed. For any locks 10146 or share reservations that have been revoked, the corresponding 10147 state-owner needs to be notified. This notification includes 10148 applications with a file open that has a corresponding delegation 10149 that has been revoked. Cached data associated with the revocation 10150 must be removed from the client. In the case of modified data 10151 existing in the client's cache, that data must be removed from the 10152 client without being written to the server. As mentioned, the 10153 assumptions made by the client are no longer valid at the point when 10154 a lock or delegation has been revoked. For example, another client 10155 may have been granted a conflicting byte-range lock after the 10156 revocation of the byte-range lock at the first client. Therefore, 10157 the data within the lock range may have been modified by the other 10158 client. Obviously, the first client is unable to guarantee to the 10159 application what has occurred to the file in the case of revocation. 10161 Notification to a state-owner will in many cases consist of simply 10162 returning an error on the next and all subsequent READs/WRITEs to the 10163 open file or on the close. Where the methods available to a client 10164 make such notification impossible because errors for certain 10165 operations may not be returned, more drastic action such as signals 10166 or process termination may be appropriate. The justification here is 10167 that an invariant on which an application depends may be violated. 10168 Depending on how errors are typically treated for the client- 10169 operating environment, further levels of notification including 10170 logging, console messages, and GUI pop-ups may be appropriate. 10172 10.5.1. Revocation Recovery for Write Open Delegation 10174 Revocation recovery for an OPEN_DELEGATE_WRITE delegation poses the 10175 special issue of modified data in the client cache while the file is 10176 not open. In this situation, any client that does not flush modified 10177 data to the server on each close must ensure that the user receives 10178 appropriate notification of the failure as a result of the 10179 revocation. Since such situations may require human action to 10180 correct problems, notification schemes in which the appropriate user 10181 or administrator is notified may be necessary. Logging and console 10182 messages are typical examples. 10184 If there is modified data on the client, it must not be flushed 10185 normally to the server. A client may attempt to provide a copy of 10186 the file data as modified during the delegation under a different 10187 name in the file system namespace to ease recovery. Note that when 10188 the client can determine that the file has not been modified by any 10189 other client, or when the client has a complete cached copy of the 10190 file in question, such a saved copy of the client's view of the file 10191 may be of particular value for recovery. In another case, recovery 10192 using a copy of the file based partially on the client's cached data 10193 and partially on the server's copy as modified by other clients will 10194 be anything but straightforward, so clients may avoid saving file 10195 contents in these situations or specially mark the results to warn 10196 users of possible problems. 10198 Saving of such modified data in delegation revocation situations may 10199 be limited to files of a certain size or might be used only when 10200 sufficient disk space is available within the target file system. 10201 Such saving may also be restricted to situations when the client has 10202 sufficient buffering resources to keep the cached copy available 10203 until it is properly stored to the target file system. 10205 10.6. Attribute Caching 10207 This section pertains to the caching of a file's attributes on a 10208 client when that client does not hold a delegation on the file. 10210 The attributes discussed in this section do not include named 10211 attributes. Individual named attributes are analogous to files, and 10212 caching of the data for these needs to be handled just as data 10213 caching is for ordinary files. Similarly, LOOKUP results from an 10214 OPENATTR directory (as well as the directory's contents) are to be 10215 cached on the same basis as any other pathnames. 10217 Clients may cache file attributes obtained from the server and use 10218 them to avoid subsequent GETATTR requests. Such caching is write 10219 through in that modification to file attributes is always done by 10220 means of requests to the server and should not be done locally and 10221 should not be cached. The exception to this are modifications to 10222 attributes that are intimately connected with data caching. 10223 Therefore, extending a file by writing data to the local data cache 10224 is reflected immediately in the size as seen on the client without 10225 this change being immediately reflected on the server. Normally, 10226 such changes are not propagated directly to the server, but when the 10227 modified data is flushed to the server, analogous attribute changes 10228 are made on the server. When OPEN delegation is in effect, the 10229 modified attributes may be returned to the server in reaction to a 10230 CB_RECALL call. 10232 The result of local caching of attributes is that the attribute 10233 caches maintained on individual clients will not be coherent. 10234 Changes made in one order on the server may be seen in a different 10235 order on one client and in a third order on another client. 10237 The typical file system application programming interfaces do not 10238 provide means to atomically modify or interrogate attributes for 10239 multiple files at the same time. The following rules provide an 10240 environment where the potential incoherencies mentioned above can be 10241 reasonably managed. These rules are derived from the practice of 10242 previous NFS protocols. 10244 o All attributes for a given file (per-fsid attributes excepted) are 10245 cached as a unit at the client so that no non-serializability can 10246 arise within the context of a single file. 10248 o An upper time boundary is maintained on how long a client cache 10249 entry can be kept without being refreshed from the server. 10251 o When operations are performed that change attributes at the 10252 server, the updated attribute set is requested as part of the 10253 containing RPC. This includes directory operations that update 10254 attributes indirectly. This is accomplished by following the 10255 modifying operation with a GETATTR operation and then using the 10256 results of the GETATTR to update the client's cached attributes. 10258 Note that if the full set of attributes to be cached is requested by 10259 READDIR, the results can be cached by the client on the same basis as 10260 attributes obtained via GETATTR. 10262 A client may validate its cached version of attributes for a file by 10263 fetching both the change and time_access attributes and assuming that 10264 if the change attribute has the same value as it did when the 10265 attributes were cached, then no attributes other than time_access 10266 have changed. The reason why time_access is also fetched is because 10267 many servers operate in environments where the operation that updates 10268 change does not update time_access. For example, POSIX file 10269 semantics do not update access time when a file is modified by the 10270 write system call [15]. Therefore, the client that wants a current 10271 time_access value should fetch it with change during the attribute 10272 cache validation processing and update its cached time_access. 10274 The client may maintain a cache of modified attributes for those 10275 attributes intimately connected with data of modified regular files 10276 (size, time_modify, and change). Other than those three attributes, 10277 the client MUST NOT maintain a cache of modified attributes. 10278 Instead, attribute changes are immediately sent to the server. 10280 In some operating environments, the equivalent to time_access is 10281 expected to be implicitly updated by each read of the content of the 10282 file object. If an NFS client is caching the content of a file 10283 object, whether it is a regular file, directory, or symbolic link, 10284 the client SHOULD NOT update the time_access attribute (via SETATTR 10285 or a small READ or READDIR request) on the server with each read that 10286 is satisfied from cache. The reason is that this can defeat the 10287 performance benefits of caching content, especially since an explicit 10288 SETATTR of time_access may alter the change attribute on the server. 10289 If the change attribute changes, clients that are caching the content 10290 will think the content has changed, and will re-read unmodified data 10291 from the server. Nor is the client encouraged to maintain a modified 10292 version of time_access in its cache, since the client either would 10293 eventually have to write the access time to the server with bad 10294 performance effects or never update the server's time_access, thereby 10295 resulting in a situation where an application that caches access time 10296 between a close and open of the same file observes the access time 10297 oscillating between the past and present. The time_access attribute 10298 always means the time of last access to a file by a read that was 10299 satisfied by the server. This way clients will tend to see only 10300 time_access changes that go forward in time. 10302 10.7. Data and Metadata Caching and Memory Mapped Files 10304 Some operating environments include the capability for an application 10305 to map a file's content into the application's address space. Each 10306 time the application accesses a memory location that corresponds to a 10307 block that has not been loaded into the address space, a page fault 10308 occurs and the file is read (or if the block does not exist in the 10309 file, the block is allocated and then instantiated in the 10310 application's address space). 10312 As long as each memory-mapped access to the file requires a page 10313 fault, the relevant attributes of the file that are used to detect 10314 access and modification (time_access, time_metadata, time_modify, and 10315 change) will be updated. However, in many operating environments, 10316 when page faults are not required, these attributes will not be 10317 updated on reads or updates to the file via memory access (regardless 10318 of whether the file is local or is accessed remotely). A client or 10319 server MAY fail to update attributes of a file that is being accessed 10320 via memory-mapped I/O. This has several implications: 10322 o If there is an application on the server that has memory mapped a 10323 file that a client is also accessing, the client may not be able 10324 to get a consistent value of the change attribute to determine 10325 whether or not its cache is stale. A server that knows that the 10326 file is memory-mapped could always pessimistically return updated 10327 values for change so as to force the application to always get the 10328 most up-to-date data and metadata for the file. However, due to 10329 the negative performance implications of this, such behavior is 10330 OPTIONAL. 10332 o If the memory-mapped file is not being modified on the server, and 10333 instead is just being read by an application via the memory-mapped 10334 interface, the client will not see an updated time_access 10335 attribute. However, in many operating environments, neither will 10336 any process running on the server. Thus, NFS clients are at no 10337 disadvantage with respect to local processes. 10339 o If there is another client that is memory mapping the file, and if 10340 that client is holding an OPEN_DELEGATE_WRITE delegation, the same 10341 set of issues as discussed in the previous two bullet points 10342 apply. So, when a server does a CB_GETATTR to a file that the 10343 client has modified in its cache, the reply from CB_GETATTR will 10344 not necessarily be accurate. As discussed earlier, the client's 10345 obligation is to report that the file has been modified since the 10346 delegation was granted, not whether it has been modified again 10347 between successive CB_GETATTR calls, and the server MUST assume 10348 that any file the client has modified in cache has been modified 10349 again between successive CB_GETATTR calls. Depending on the 10350 nature of the client's memory management system, this weak 10351 obligation may not be possible. A client MAY return stale 10352 information in CB_GETATTR whenever the file is memory-mapped. 10354 o The mixture of memory mapping and byte-range locking on the same 10355 file is problematic. Consider the following scenario, where a 10356 page size on each client is 8192 bytes. 10358 * Client A memory maps the first page (8192 bytes) of file X. 10360 * Client B memory maps the first page (8192 bytes) of file X. 10362 * Client A WRITE_LT locks the first 4096 bytes. 10364 * Client B WRITE_LT locks the second 4096 bytes. 10366 * Client A, via a STORE instruction, modifies part of its locked 10367 byte-range. 10369 * Simultaneous to client A, client B executes a STORE on part of 10370 its locked byte-range. 10372 Here the challenge is for each client to resynchronize to get a 10373 correct view of the first page. In many operating environments, the 10374 virtual memory management systems on each client only know a page is 10375 modified, not that a subset of the page corresponding to the 10376 respective lock byte-ranges has been modified. So it is not possible 10377 for each client to do the right thing, which is to write to the 10378 server only that portion of the page that is locked. For example, if 10379 client A simply writes out the page, and then client B writes out the 10380 page, client A's data is lost. 10382 Moreover, if mandatory locking is enabled on the file, then we have a 10383 different problem. When clients A and B execute the STORE 10384 instructions, the resulting page faults require a byte-range lock on 10385 the entire page. Each client then tries to extend their locked range 10386 to the entire page, which results in a deadlock. Communicating the 10387 NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. 10389 If a client is locking the entire memory-mapped file, there is no 10390 problem with advisory or mandatory byte-range locking, at least until 10391 the client unlocks a byte-range in the middle of the file. 10393 Given the above issues, the following are permitted: 10395 o Clients and servers MAY deny memory mapping a file for which they 10396 know there are byte-range locks. 10398 o Clients and servers MAY deny a byte-range lock on a file they know 10399 is memory-mapped. 10401 o A client MAY deny memory mapping a file that it knows requires 10402 mandatory locking for I/O. If mandatory locking is enabled after 10403 the file is opened and mapped, the client MAY deny the application 10404 further access to its mapped file. 10406 10.8. Name and Directory Caching without Directory Delegations 10408 The NFSv4.1 directory delegation facility (described in Section 10.9 10409 below) is OPTIONAL for servers to implement. Even where it is 10410 implemented, it may not always be functional because of resource 10411 availability issues or other constraints. Thus, it is important to 10412 understand how name and directory caching are done in the absence of 10413 directory delegations. These topics are discussed in the next two 10414 subsections. 10416 10.8.1. Name Caching 10418 The results of LOOKUP and READDIR operations may be cached to avoid 10419 the cost of subsequent LOOKUP operations. Just as in the case of 10420 attribute caching, inconsistencies may arise among the various client 10421 caches. To mitigate the effects of these inconsistencies and given 10422 the context of typical file system APIs, an upper time boundary is 10423 maintained for how long a client name cache entry can be kept without 10424 verifying that the entry has not been made invalid by a directory 10425 change operation performed by another client. 10427 When a client is not making changes to a directory for which there 10428 exist name cache entries, the client needs to periodically fetch 10429 attributes for that directory to ensure that it is not being 10430 modified. After determining that no modification has occurred, the 10431 expiration time for the associated name cache entries may be updated 10432 to be the current time plus the name cache staleness bound. 10434 When a client is making changes to a given directory, it needs to 10435 determine whether there have been changes made to the directory by 10436 other clients. It does this by using the change attribute as 10437 reported before and after the directory operation in the associated 10438 change_info4 value returned for the operation. The server is able to 10439 communicate to the client whether the change_info4 data is provided 10440 atomically with respect to the directory operation. If the change 10441 values are provided atomically, the client has a basis for 10442 determining, given proper care, whether other clients are modifying 10443 the directory in question. 10445 The simplest way to enable the client to make this determination is 10446 for the client to serialize all changes made to a specific directory. 10447 When this is done, and the server provides before and after values of 10448 the change attribute atomically, the client can simply compare the 10449 after value of the change attribute from one operation on a directory 10450 with the before value on the subsequent operation modifying that 10451 directory. When these are equal, the client is assured that no other 10452 client is modifying the directory in question. 10454 When such serialization is not used, and there may be multiple 10455 simultaneous outstanding operations modifying a single directory sent 10456 from a single client, making this sort of determination can be more 10457 complicated. If two such operations complete in a different order 10458 than they were actually performed, that might give an appearance 10459 consistent with modification being made by another client. Where 10460 this appears to happen, the client needs to await the completion of 10461 all such modifications that were started previously, to see if the 10462 outstanding before and after change numbers can be sorted into a 10463 chain such that the before value of one change number matches the 10464 after value of a previous one, in a chain consistent with this client 10465 being the only one modifying the directory. 10467 In either of these cases, the client is able to determine whether the 10468 directory is being modified by another client. If the comparison 10469 indicates that the directory was updated by another client, the name 10470 cache associated with the modified directory is purged from the 10471 client. If the comparison indicates no modification, the name cache 10472 can be updated on the client to reflect the directory operation and 10473 the associated timeout can be extended. The post-operation change 10474 value needs to be saved as the basis for future change_info4 10475 comparisons. 10477 As demonstrated by the scenario above, name caching requires that the 10478 client revalidate name cache data by inspecting the change attribute 10479 of a directory at the point when the name cache item was cached. 10480 This requires that the server update the change attribute for 10481 directories when the contents of the corresponding directory is 10482 modified. For a client to use the change_info4 information 10483 appropriately and correctly, the server must report the pre- and 10484 post-operation change attribute values atomically. When the server 10485 is unable to report the before and after values atomically with 10486 respect to the directory operation, the server must indicate that 10487 fact in the change_info4 return value. When the information is not 10488 atomically reported, the client should not assume that other clients 10489 have not changed the directory. 10491 10.8.2. Directory Caching 10493 The results of READDIR operations may be used to avoid subsequent 10494 READDIR operations. Just as in the cases of attribute and name 10495 caching, inconsistencies may arise among the various client caches. 10496 To mitigate the effects of these inconsistencies, and given the 10497 context of typical file system APIs, the following rules should be 10498 followed: 10500 o Cached READDIR information for a directory that is not obtained in 10501 a single READDIR operation must always be a consistent snapshot of 10502 directory contents. This is determined by using a GETATTR before 10503 the first READDIR and after the last READDIR that contributes to 10504 the cache. 10506 o An upper time boundary is maintained to indicate the length of 10507 time a directory cache entry is considered valid before the client 10508 must revalidate the cached information. 10510 The revalidation technique parallels that discussed in the case of 10511 name caching. When the client is not changing the directory in 10512 question, checking the change attribute of the directory with GETATTR 10513 is adequate. The lifetime of the cache entry can be extended at 10514 these checkpoints. When a client is modifying the directory, the 10515 client needs to use the change_info4 data to determine whether there 10516 are other clients modifying the directory. If it is determined that 10517 no other client modifications are occurring, the client may update 10518 its directory cache to reflect its own changes. 10520 As demonstrated previously, directory caching requires that the 10521 client revalidate directory cache data by inspecting the change 10522 attribute of a directory at the point when the directory was cached. 10523 This requires that the server update the change attribute for 10524 directories when the contents of the corresponding directory is 10525 modified. For a client to use the change_info4 information 10526 appropriately and correctly, the server must report the pre- and 10527 post-operation change attribute values atomically. When the server 10528 is unable to report the before and after values atomically with 10529 respect to the directory operation, the server must indicate that 10530 fact in the change_info4 return value. When the information is not 10531 atomically reported, the client should not assume that other clients 10532 have not changed the directory. 10534 10.9. Directory Delegations 10535 10.9.1. Introduction to Directory Delegations 10537 Directory caching for the NFSv4.1 protocol, as previously described, 10538 is similar to file caching in previous versions. Clients typically 10539 cache directory information for a duration determined by the client. 10540 At the end of a predefined timeout, the client will query the server 10541 to see if the directory has been updated. By caching attributes, 10542 clients reduce the number of GETATTR calls made to the server to 10543 validate attributes. Furthermore, frequently accessed files and 10544 directories, such as the current working directory, have their 10545 attributes cached on the client so that some NFS operations can be 10546 performed without having to make an RPC call. By caching name and 10547 inode information about most recently looked up entries in a 10548 Directory Name Lookup Cache (DNLC), clients do not need to send 10549 LOOKUP calls to the server every time these files are accessed. 10551 This caching approach works reasonably well at reducing network 10552 traffic in many environments. However, it does not address 10553 environments where there are numerous queries for files that do not 10554 exist. In these cases of "misses", the client sends requests to the 10555 server in order to provide reasonable application semantics and 10556 promptly detect the creation of new directory entries. Examples of 10557 high miss activity are compilation in software development 10558 environments. The current behavior of NFS limits its potential 10559 scalability and wide-area sharing effectiveness in these types of 10560 environments. Other distributed stateful file system architectures 10561 such as AFS and DFS have proven that adding state around directory 10562 contents can greatly reduce network traffic in high-miss 10563 environments. 10565 Delegation of directory contents is an OPTIONAL feature of NFSv4.1. 10566 Directory delegations provide similar traffic reduction benefits as 10567 with file delegations. By allowing clients to cache directory 10568 contents (in a read-only fashion) while being notified of changes, 10569 the client can avoid making frequent requests to interrogate the 10570 contents of slowly-changing directories, reducing network traffic and 10571 improving client performance. It can also simplify the task of 10572 determining whether other clients are making changes to the directory 10573 when the client itself is making many changes to the directory and 10574 changes are not serialized. 10576 Directory delegations allow improved namespace cache consistency to 10577 be achieved through delegations and synchronous recalls, in the 10578 absence of notifications. In addition, if time-based consistency is 10579 sufficient, asynchronous notifications can provide performance 10580 benefits for the client, and possibly the server, under some common 10581 operating conditions such as slowly-changing and/or very large 10582 directories. 10584 10.9.2. Directory Delegation Design 10586 NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation 10587 to allow the client to ask for a directory delegation. The 10588 delegation covers directory attributes and all entries in the 10589 directory. If either of these change, the delegation will be 10590 recalled synchronously. The operation causing the recall will have 10591 to wait before the recall is complete. Any changes to directory 10592 entry attributes will not cause the delegation to be recalled. 10594 In addition to asking for delegations, a client can also ask for 10595 notifications for certain events. These events include changes to 10596 the directory's attributes and/or its contents. If a client asks for 10597 notification for a certain event, the server will notify the client 10598 when that event occurs. This will not result in the delegation being 10599 recalled for that client. The notifications are asynchronous and 10600 provide a way of avoiding recalls in situations where a directory is 10601 changing enough that the pure recall model may not be effective while 10602 trying to allow the client to get substantial benefit. In the 10603 absence of notifications, once the delegation is recalled the client 10604 has to refresh its directory cache; this might not be very efficient 10605 for very large directories. 10607 The delegation is read-only and the client may not make changes to 10608 the directory other than by performing NFSv4.1 operations that modify 10609 the directory or the associated file attributes so that the server 10610 has knowledge of these changes. In order to keep the client's 10611 namespace synchronized with the server, the server will notify the 10612 delegation-holding client (assuming it has requested notifications) 10613 of the changes made as a result of that client's directory-modifying 10614 operations. This is to avoid any need for that client to send 10615 subsequent GETATTR or READDIR operations to the server. If a single 10616 client is holding the delegation and that client makes any changes to 10617 the directory (i.e., the changes are made via operations sent on a 10618 session associated with the client ID holding the delegation), the 10619 delegation will not be recalled. Multiple clients may hold a 10620 delegation on the same directory, but if any such client modifies the 10621 directory, the server MUST recall the delegation from the other 10622 clients, unless those clients have made provisions to be notified of 10623 that sort of modification. 10625 Delegations can be recalled by the server at any time. Normally, the 10626 server will recall the delegation when the directory changes in a way 10627 that is not covered by the notification, or when the directory 10628 changes and notifications have not been requested. If another client 10629 removes the directory for which a delegation has been granted, the 10630 server will recall the delegation. 10632 10.9.3. Attributes in Support of Directory Notifications 10634 See Section 5.11 for a description of the attributes associated with 10635 directory notifications. 10637 10.9.4. Directory Delegation Recall 10639 The server will recall the directory delegation by sending a callback 10640 to the client. It will use the same callback procedure as used for 10641 recalling file delegations. The server will recall the delegation 10642 when the directory changes in a way that is not covered by the 10643 notification. However, the server need not recall the delegation if 10644 attributes of an entry within the directory change. 10646 If the server notices that handing out a delegation for a directory 10647 is causing too many notifications to be sent out, it may decide to 10648 not hand out delegations for that directory and/or recall those 10649 already granted. If a client tries to remove the directory for which 10650 a delegation has been granted, the server will recall all associated 10651 delegations. 10653 The implementation sections for a number of operations describe 10654 situations in which notification or delegation recall would be 10655 required under some common circumstances. In this regard, a similar 10656 set of caveats to those listed in Section 10.2 apply. 10658 o For CREATE, see Section 18.4.4. 10660 o For LINK, see Section 18.9.4. 10662 o For OPEN, see Section 18.16.4. 10664 o For REMOVE, see Section 18.25.4. 10666 o For RENAME, see Section 18.26.4. 10668 o For SETATTR, see Section 18.30.4. 10670 10.9.5. Directory Delegation Recovery 10672 Recovery from client or server restart for state on regular files has 10673 two main goals: avoiding the necessity of breaking application 10674 guarantees with respect to locked files and delivery of updates 10675 cached at the client. Neither of these goals applies to directories 10676 protected by OPEN_DELEGATE_READ delegations and notifications. Thus, 10677 no provision is made for reclaiming directory delegations in the 10678 event of client or server restart. The client can simply establish a 10679 directory delegation in the same fashion as was done initially. 10681 11. Multi-Server Namespace 10683 NFSv4.1 supports attributes that allow a namespace to extend beyond 10684 the boundaries of a single server. It is desirable that clients and 10685 servers support construction of such multi-server namespaces. Use of 10686 such multi-server namespaces is OPTIONAL however, and for many 10687 purposes, single-server namespaces are perfectly acceptable. Use of 10688 multi-server namespaces can provide many advantages, by separating a 10689 file system's logical position in a namespace from the (possibly 10690 changing) logistical and administrative considerations that result in 10691 particular file systems being located on particular servers via a 10692 single network access paths known in advance or determined using DNS. 10694 11.1. Terminology 10696 In this section as a whole (i.e within Section 11), the phrase 10697 "client ID" always refers to the 64-bit shorthand identifier assigned 10698 by the server (a clientid4) and never to the structure which the 10699 client uses to identify itself to the server (called an 10700 nfs_client_id4 or client_owner in NFSv4.0 and NFSv4.1 respectively). 10701 The opaque identifier within those structures is referred to as a 10702 "client id string". 10704 11.1.1. Terminology Related to Trunking 10706 It is particularly important to clarify the distinction between 10707 trunking detection and trunking discovery. The definitions we 10708 present are applicable to all minor versions of NFSv4, but we will 10709 focus on how these terms apply to NFS version 4.1. 10711 o Trunking detection refers to ways of deciding whether two specific 10712 network addresses are connected to the same NFSv4 server. The 10713 means available to make this determination depends on the protocol 10714 version, and, in some cases, on the client implementation. 10716 In the case of NFS version 4.1 and later minor versions, the means 10717 of trunking detection are as described in this document and are 10718 available to every client. Two network addresses connected to the 10719 same server are always server-trunkable but cannot necessarily be 10720 used together to access a single session. 10722 o Trunking discovery is a process by which a client using one 10723 network address can obtain other addresses that are connected to 10724 the same server. Typically, it builds on a trunking detection 10725 facility by providing one or more methods by which candidate 10726 addresses are made available to the client who can then use 10727 trunking detection to appropriately filter them. 10729 Despite the support for trunking detection there was no 10730 description of trunking discovery provided in RFC5661 [60], making 10731 it necessary to provide those means in this document. 10733 The combination of a server network address and a particular 10734 connection type to be used by a connection is referred to as a 10735 "server endpoint". Although using different connection types may 10736 result in different ports being used, the use of different ports by 10737 multiple connections to the same network address is not the essence 10738 of the distinction between the two endpoints used. 10740 Two network addresses connected to the same server are said to be 10741 server-trunkable. Two such addresses support the use of clientid ID 10742 trunking, as described in Section 2.10.5. 10744 Two network addresses connected to the same server such that those 10745 addresses can be used to support a single common session are referred 10746 to as session-trunkable. Note that two addresses may be server- 10747 trunkable without being session-trunkable and that when two 10748 connections of different connection types are made to the same 10749 network address and are based on a single file system location entry 10750 they are always session-trunkable, independent of the connection 10751 type, as specified by Section 2.10.5, since their derivation from the 10752 same file system location entry together with the identity of their 10753 network addresses assures that both connections are to the same 10754 server and will return server-owner information allowing session 10755 trunking to be used. 10757 11.1.2. Terminology Related to File System Location 10759 Regarding terminology relating to the construction of multi-server 10760 namespaces out of a set of local per-server namespaces: 10762 o Each server has a set of exported file systems which may be 10763 accessed by NFSv4 clients. Typically, this is done by assigning 10764 each file system a name within the pseudo-fs associated with the 10765 server, although the pseudo-fs may be dispensed with if there is 10766 only a single exported file system. Each such file system is part 10767 of the server's local namespace, and can be considered as a file 10768 system instance within a larger multi-server namespace. 10770 o The set of all exported file systems for a given server 10771 constitutes that server's local namespace. 10773 o In some cases, a server will have a namespace more extensive than 10774 its local namespace by using features associated with attributes 10775 that provide file system location information. These features, 10776 which allow construction of a multi-server namespace are all 10777 described in individual sections below and include referrals 10778 (described in Section 11.5.6), migration (described in 10779 Section 11.5.5), and replication (described in Section 11.5.4). 10781 o A file system present in a server's pseudo-fs may have multiple 10782 file system instances on different servers associated with it. 10783 All such instances are considered replicas of one another. 10785 o When a file system is present in a server's pseudo-fs, but there 10786 is no corresponding local file system, it is said to be "absent". 10787 In such cases, all associated instances will be accessed on other 10788 servers. 10790 Regarding terminology relating to attributes used in trunking 10791 discovery and other multi-server namespace features: 10793 o File system location attributes include the fs_locations and 10794 fs_locations_info attributes. 10796 o File system location entries provide the individual file system 10797 locations within the file system location attributes. Each such 10798 entry specifies a server, in the form of a host name or IP an 10799 address, and an fs name, which designates the location of the file 10800 system within the server's local namespace. A file system 10801 location entry designates a set of server endpoints to which the 10802 client may establish connections. There may be multiple endpoints 10803 because a host name may map to multiple network addresses and 10804 because multiple connection types may be used to communicate with 10805 a single network address. However, all such endpoints MUST 10806 provide a way of connecting to a single server. The exact form of 10807 the location entry varies with the particular file system location 10808 attribute used, as described in Section 11.2. 10810 o File system location elements are derived from location entries 10811 and each describes a particular network access path, consisting of 10812 a network address and a location within the server's local 10813 namespace. Such location elements need not appear within a file 10814 system location attribute, but the existence of each location 10815 element derives from a corresponding location entry. When a 10816 location entry specifies an IP address there is only a single 10817 corresponding location element. File system location entries that 10818 contain a host name are resolved using DNS, and may result in one 10819 or more location elements. All location elements consist of a 10820 location address which is the IP address of an interface to a 10821 server and an fs name which is the location of the file system 10822 within the server's local namespace. The fs name can be empty if 10823 the server has no pseudo-fs and only a single exported file system 10824 at the root filehandle. 10826 o Two file system location elements are said to be server-trunkable 10827 if they specify the same fs name and the location addresses are 10828 such that the location addresses are server-trunkable. When the 10829 corresponding network paths are used, the client will always be 10830 able to use client ID trunking, but will only be able to use 10831 session trunking if the paths are also session-trunkable. 10833 o Two file system location elements are said to be session-trunkable 10834 if they specify the same fs name and the location addresses are 10835 such that the location addresses are session-trunkable. When the 10836 corresponding network paths are used, the client will be able to 10837 able to use either client ID trunking or session trunking. 10839 Discussion of the term "replica" is complicated by the fact that the 10840 term was used in RFC5661 [60], with a meaning different from that in 10841 this document. In short, in [60] each replica is identified by a 10842 single network access path while, in the current document a set of 10843 network access paths which have server-trunkable network addresses 10844 and the same root-relative file system pathname is considered to be a 10845 single replica with multiple network access paths. 10847 Each set of server-trunkable location elements defines a set of 10848 available network access paths to a particular file system. When 10849 there are multiple such file systems, each of which contains the same 10850 data, these file systems are considered replicas of one another. 10851 Logically, such replication is symmetric, since the fs currently in 10852 use and an alternate fs are replicas of each other. Often, in other 10853 documents, the term "replica" is not applied to the fs currently in 10854 use, despite the fact that the replication relation is inherently 10855 symmetric. 10857 11.2. File System Location Attributes 10859 NFSv4.1 contains attributes that provide information about how (i.e., 10860 at what network address and namespace position) a given file system 10861 may be accessed. As a result, file systems in the namespace of one 10862 server can be associated with one or more instances of that file 10863 system on other servers. These attributes contain file system 10864 location entries specifying a server address target (either as a DNS 10865 name representing one or more IP addresses or as a specific IP 10866 address) together with the pathname of that file system within the 10867 associated single-server namespace. 10869 The fs_locations_info RECOMMENDED attribute allows specification of 10870 one or more file system instance locations where the data 10871 corresponding to a given file system may be found. This attribute 10872 provides to the client, in addition to specification of file system 10873 instance locations, other helpful information such as: 10875 o Information guiding choices among the various file system 10876 instances provided (e.g., priority for use, writability, currency, 10877 etc.). 10879 o Information to help the client efficiently effect as seamless a 10880 transition as possible among multiple file system instances, when 10881 and if that should be necessary. 10883 o Information helping to guide the selection of the appropriate 10884 connection type to be used when establishing a connection. 10886 Within the fs_locations_info attribute, each fs_locations_server4 10887 entry corresponds to a file system location entry with the fls_server 10888 field designating the server, with the location pathname within the 10889 server's pseudo-fs given by the fl_rootpath field of the encompassing 10890 fs_locations_item4. 10892 The fs_locations attribute defined in NFSv4.0 is also a part of 10893 NFSv4.1. This attribute only allows specification of the file system 10894 locations where the data corresponding to a given file system may be 10895 found. Servers should make this attribute available whenever 10896 fs_locations_info is supported, but client use of fs_locations_info 10897 is preferable, as it provides more information. 10899 Within the fs_location attribute, each fs_location4 contains a file 10900 system location entry with the server field designating the server 10901 and the rootpath field giving the location pathname within the 10902 server's pseudo-fs. 10904 11.3. File System Presence or Absence 10906 A given location in an NFSv4.1 namespace (typically but not 10907 necessarily a multi-server namespace) can have a number of file 10908 system instance locations associated with it (via the fs_locations or 10909 fs_locations_info attribute). There may also be an actual current 10910 file system at that location, accessible via normal namespace 10911 operations (e.g., LOOKUP). In this case, the file system is said to 10912 be "present" at that position in the namespace, and clients will 10913 typically use it, reserving use of additional locations specified via 10914 the location-related attributes to situations in which the principal 10915 location is no longer available. 10917 When there is no actual file system at the namespace location in 10918 question, the file system is said to be "absent". An absent file 10919 system contains no files or directories other than the root. Any 10920 reference to it, except to access a small set of attributes useful in 10921 determining alternate locations, will result in an error, 10922 NFS4ERR_MOVED. Note that if the server ever returns the error 10923 NFS4ERR_MOVED, it MUST support the fs_locations attribute and SHOULD 10924 support the fs_locations_info and fs_status attributes. 10926 While the error name suggests that we have a case of a file system 10927 that once was present, and has only become absent later, this is only 10928 one possibility. A position in the namespace may be permanently 10929 absent with the set of file system(s) designated by the location 10930 attributes being the only realization. The name NFS4ERR_MOVED 10931 reflects an earlier, more limited conception of its function, but 10932 this error will be returned whenever the referenced file system is 10933 absent, whether it has moved or not. 10935 Except in the case of GETATTR-type operations (to be discussed 10936 later), when the current filehandle at the start of an operation is 10937 within an absent file system, that operation is not performed and the 10938 error NFS4ERR_MOVED is returned, to indicate that the file system is 10939 absent on the current server. 10941 Because a GETFH cannot succeed if the current filehandle is within an 10942 absent file system, filehandles within an absent file system cannot 10943 be transferred to the client. When a client does have filehandles 10944 within an absent file system, it is the result of obtaining them when 10945 the file system was present, and having the file system become absent 10946 subsequently. 10948 It should be noted that because the check for the current filehandle 10949 being within an absent file system happens at the start of every 10950 operation, operations that change the current filehandle so that it 10951 is within an absent file system will not result in an error. This 10952 allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be 10953 used to get attribute information, particularly location attribute 10954 information, as discussed below. 10956 The RECOMMENDED file system attribute fs_status can be used to 10957 interrogate the present/absent status of a given file system. 10959 11.4. Getting Attributes for an Absent File System 10961 When a file system is absent, most attributes are not available, but 10962 it is necessary to allow the client access to the small set of 10963 attributes that are available, and most particularly those that give 10964 information about the correct current locations for this file system: 10965 fs_locations and fs_locations_info. 10967 11.4.1. GETATTR within an Absent File System 10969 As mentioned above, an exception is made for GETATTR in that 10970 attributes may be obtained for a filehandle within an absent file 10971 system. This exception only applies if the attribute mask contains 10972 at least one attribute bit that indicates the client is interested in 10973 a result regarding an absent file system: fs_locations, 10974 fs_locations_info, or fs_status. If none of these attributes is 10975 requested, GETATTR will result in an NFS4ERR_MOVED error. 10977 When a GETATTR is done on an absent file system, the set of supported 10978 attributes is very limited. Many attributes, including those that 10979 are normally REQUIRED, will not be available on an absent file 10980 system. In addition to the attributes mentioned above (fs_locations, 10981 fs_locations_info, fs_status), the following attributes SHOULD be 10982 available on absent file systems. In the case of RECOMMENDED 10983 attributes, they should be available at least to the same degree that 10984 they are available on present file systems. 10986 change_policy: This attribute is useful for absent file systems and 10987 can be helpful in summarizing to the client when any of the 10988 location-related attributes change. 10990 fsid: This attribute should be provided so that the client can 10991 determine file system boundaries, including, in particular, the 10992 boundary between present and absent file systems. This value must 10993 be different from any other fsid on the current server and need 10994 have no particular relationship to fsids on any particular 10995 destination to which the client might be directed. 10997 mounted_on_fileid: For objects at the top of an absent file system, 10998 this attribute needs to be available. Since the fileid is within 10999 the present parent file system, there should be no need to 11000 reference the absent file system to provide this information. 11002 Other attributes SHOULD NOT be made available for absent file 11003 systems, even when it is possible to provide them. The server should 11004 not assume that more information is always better and should avoid 11005 gratuitously providing additional information. 11007 When a GETATTR operation includes a bit mask for one of the 11008 attributes fs_locations, fs_locations_info, or fs_status, but where 11009 the bit mask includes attributes that are not supported, GETATTR will 11010 not return an error, but will return the mask of the actual 11011 attributes supported with the results. 11013 Handling of VERIFY/NVERIFY is similar to GETATTR in that if the 11014 attribute mask does not include fs_locations, fs_locations_info, or 11015 fs_status, the error NFS4ERR_MOVED will result. It differs in that 11016 any appearance in the attribute mask of an attribute not supported 11017 for an absent file system (and note that this will include some 11018 normally REQUIRED attributes) will also cause an NFS4ERR_MOVED 11019 result. 11021 11.4.2. READDIR and Absent File Systems 11023 A READDIR performed when the current filehandle is within an absent 11024 file system will result in an NFS4ERR_MOVED error, since, unlike the 11025 case of GETATTR, no such exception is made for READDIR. 11027 Attributes for an absent file system may be fetched via a READDIR for 11028 a directory in a present file system, when that directory contains 11029 the root directories of one or more absent file systems. In this 11030 case, the handling is as follows: 11032 o If the attribute set requested includes one of the attributes 11033 fs_locations, fs_locations_info, or fs_status, then fetching of 11034 attributes proceeds normally and no NFS4ERR_MOVED indication is 11035 returned, even when the rdattr_error attribute is requested. 11037 o If the attribute set requested does not include one of the 11038 attributes fs_locations, fs_locations_info, or fs_status, then if 11039 the rdattr_error attribute is requested, each directory entry for 11040 the root of an absent file system will report NFS4ERR_MOVED as the 11041 value of the rdattr_error attribute. 11043 o If the attribute set requested does not include any of the 11044 attributes fs_locations, fs_locations_info, fs_status, or 11045 rdattr_error, then the occurrence of the root of an absent file 11046 system within the directory will result in the READDIR failing 11047 with an NFS4ERR_MOVED error. 11049 o The unavailability of an attribute because of a file system's 11050 absence, even one that is ordinarily REQUIRED, does not result in 11051 any error indication. The set of attributes returned for the root 11052 directory of the absent file system in that case is simply 11053 restricted to those actually available. 11055 11.5. Uses of File System Location Information 11057 The file system location attributes (i.e. fs_locations and 11058 fs_locations_info), together with the possibility of absent file 11059 systems, provide a number of important facilities in providing 11060 reliable, manageable, and scalable data access. 11062 When a file system is present, these attributes can provide 11063 o The locations of alternative replicas, to be used to access the 11064 same data in the event of server failures, communications 11065 problems, or other difficulties that make continued access to the 11066 current replica impossible or otherwise impractical. Provision 11067 and use of such alternate replicas is referred to as "replication" 11068 and is discussed in Section 11.5.4 below. 11070 o The network address(es) to be used to access the current file 11071 system instance or replicas of it. Client use of this information 11072 is discussed in Section 11.5.2 below. 11074 Under some circumstances, multiple replicas may be used 11075 simultaneously to provide higher-performance access to the file 11076 system in question, although the lack of state sharing between 11077 servers may be an impediment to such use. 11079 When a file system is present and becomes absent, clients can be 11080 given the opportunity to have continued access to their data, using a 11081 different replica. In this case, a continued attempt to use the data 11082 in the now-absent file system will result in an NFS4ERR_MOVED error 11083 and, at that point, the successor replica or set of possible replica 11084 choices can be fetched and used to continue access. Transfer of 11085 access to the new replica location is referred to as "migration", and 11086 is discussed in Section 11.5.4 below. 11088 Where a file system had been absent, specification of file system 11089 location provides a means by which file systems located on one server 11090 can be associated with a namespace defined by another server, thus 11091 allowing a general multi-server namespace facility. A designation of 11092 such a remote instance, in place of a file system never previously 11093 present, is called a "pure referral" and is discussed in 11094 Section 11.5.6 below. 11096 Because client support for attributes related to file system location 11097 is OPTIONAL, a server may choose to take action to hide migration and 11098 referral events from such clients, by acting as a proxy, for example. 11099 The server can determine the presence of client support from the 11100 arguments of the EXCHANGE_ID operation (see Section 18.35.3). 11102 11.5.1. Combining Multiple Uses in a Single Attribute 11104 A file system location attribute will sometimes contain information 11105 relating to the location of multiple replicas which may be used in 11106 different ways. 11108 o File system location entries that relate to the file system 11109 instance currently in use provide trunking information, allowing 11110 the client to find additional network addresses by which the 11111 instance may be accessed. 11113 o File system location entries that provide information about 11114 replicas to which access is to be transferred. 11116 o Other file system location entries that relate to replicas that 11117 are available to use in the event that access to the current 11118 replica becomes unsatisfactory. 11120 In order to simplify client handling and allow the best choice of 11121 replicas to access, the server should adhere to the following 11122 guidelines. 11124 o All file system location entries that relate to a single file 11125 system instance should be adjacent. 11127 o File system location entries that relate to the instance currently 11128 in use should appear first. 11130 o File system location entries that relate to replica(s) to which 11131 migration is occurring should appear before replicas which are 11132 available for later use if the current replica should become 11133 inaccessible. 11135 11.5.2. File System Location Attributes and Trunking 11137 Trunking is the use of multiple connections between a client and 11138 server in order to increase the speed of data transfer. A client may 11139 determine the set of network addresses to use to access a given file 11140 system in a number of ways: 11142 o When the name of the server is known to the client, it may use DNS 11143 to obtain a set of network addresses to use in accessing the 11144 server. 11146 o The client may fetch the file system location attribute for the 11147 file system. This will provide either the name of the server 11148 (which can be turned into a set of network addresses using DNS), 11149 or a set of server-trunkable location entries. Using the latter 11150 alternative, the server can provide addresses it regards as 11151 desirable to use to access the file system in question. 11153 It should be noted that the client, when it fetches a location 11154 attribute for a file system, may encounter multiple entries for a 11155 number of reasons, so that, when determining trunking information, it 11156 may have to bypass addresses not trunkable with one already known. 11158 The server can provide location entries that include either names or 11159 network addresses. It might use the latter form because of DNS- 11160 related security concerns or because the set of addresses to be used 11161 might require active management by the server. 11163 Locations entries used to discover candidate addresses for use in 11164 trunking are subject to change, as discussed in Section 11.5.7 below. 11165 The client may respond to such changes by using additional addresses 11166 once they are verified or by ceasing to use existing ones. The 11167 server can force the client to cease using an address by returning 11168 NFS4ERR_MOVED when that address is used to access a file system. 11169 This allows a transfer of client access which is similar to 11170 migration, although the same file system instance is accessed 11171 throughout. 11173 11.5.3. File System Location Attributes and Connection Type Selection 11175 Because of the need to support multiple types of connections, clients 11176 face the issue of determining the proper connection type to use when 11177 establishing a connection to a given server network address. In some 11178 cases, this issue can be addressed through the use of the connection 11179 "step-up" facility described in Section 18.36. However, because 11180 there are cases is which that facility is not available, the client 11181 may have to choose a connection type with no possibility of changing 11182 it within the scope of a single connection. 11184 The two file system location attributes differ as to the information 11185 made available in this regard. Fs_locations provides no information 11186 to support connection type selection. As a result, clients 11187 supporting multiple connection types would need to attempt to 11188 establish connections using multiple connection types until the one 11189 preferred by the client is successfully established. 11191 Fs_locations_info includes a flag, FSLI4TF_RDMA, which, when set 11192 indicates that RPC-over-RDMA support is available using the specified 11193 location entry, by "stepping up" an existing TCP connection to 11194 include support for RDMA operation. This flag makes it convenient 11195 for a client wishing to use RDMA. When this flag is set, it can 11196 establish a TCP connection and then convert that connection to use 11197 RDMA by using the step-up facility. 11199 Irrespective of the particular attribute used, when there is no 11200 indication that a step-up operation can be performed, a client 11201 supporting RDMA operation can establish a new RDMA connection and it 11202 can be bound to the session already established by the TCP 11203 connection, allowing the TCP connection to be dropped and the session 11204 converted to further use in RDMA node. 11206 11.5.4. File System Replication 11208 The fs_locations and fs_locations_info attributes provide alternative 11209 file system locations, to be used to access data in place of or in 11210 addition to the current file system instance. On first access to a 11211 file system, the client should obtain the set of alternate locations 11212 by interrogating the fs_locations or fs_locations_info attribute, 11213 with the latter being preferred. 11215 In the event that the occurrence of server failures, communications 11216 problems, or other difficulties make continued access to the current 11217 file system impossible or otherwise impractical, the client can use 11218 the alternate locations as a way to get continued access to its data. 11220 The alternate locations may be physical replicas of the (typically 11221 read-only) file system data, or they may provide for the use of 11222 various forms of server clustering in which multiple servers provide 11223 alternate ways of accessing the same physical file system. How these 11224 different modes of file system transition are represented within the 11225 fs_locations and fs_locations_info attributes and how the client 11226 deals with file system transition issues will be discussed in detail 11227 below. 11229 11.5.5. File System Migration 11231 When a file system is present and becomes absent, the NFSv4.1 11232 protocol provides a means by which clients can be given the 11233 opportunity to have continued access to their data, using a different 11234 replica. The location of this replica is specified by a file system 11235 location attribute. The ensuing migration of access to another 11236 replica includes the ability to retain locks across the transition, 11237 either by using lock reclaim or by taking advantage of Transparent 11238 State Migration. 11240 Typically, a client will be accessing the file system in question, 11241 get an NFS4ERR_MOVED error, and then use a file system location 11242 attribute to determine the new location of the data. When 11243 fs_locations_info is used, additional information will be available 11244 that will define the nature of the client's handling of the 11245 transition to a new server. 11247 Such migration can be helpful in providing load balancing or general 11248 resource reallocation. The protocol does not specify how the file 11249 system will be moved between servers. It is anticipated that a 11250 number of different server-to-server transfer mechanisms might be 11251 used with the choice left to the server implementer. The NFSv4.1 11252 protocol specifies the method used to communicate the migration event 11253 between client and server. 11255 The new location may be, in the case of various forms of server 11256 clustering, another server providing access to the same physical file 11257 system. The client's responsibilities in dealing with this 11258 transition will depend on whether migration has occurred and the 11259 means the server has chosen to provide continuity of locking state. 11260 These issues will be discussed in detail below. 11262 Although a single successor location is typical, multiple locations 11263 may be provided. When multiple locations are provided, the client 11264 will typically use the first one provided. If that is inaccessible 11265 for some reason, later ones can be used. In such cases the client 11266 might consider that the transition to the new replica as a migration 11267 event, even though some of the servers involved might not be aware of 11268 the use of the server which was inaccessible. In such a case, a 11269 client might lose access to locking state as a result of the access 11270 transfer. 11272 When an alternate location is designated as the target for migration, 11273 it must designate the same data (with metadata being the same to the 11274 degree indicated by the fs_locations_info attribute). Where file 11275 systems are writable, a change made on the original file system must 11276 be visible on all migration targets. Where a file system is not 11277 writable but represents a read-only copy (possibly periodically 11278 updated) of a writable file system, similar requirements apply to the 11279 propagation of updates. Any change visible in the original file 11280 system must already be effected on all migration targets, to avoid 11281 any possibility that a client, in effecting a transition to the 11282 migration target, will see any reversion in file system state. 11284 11.5.6. Referrals 11286 Referrals allow the server to associate a file system namespace entry 11287 located on one server with a file system located on another server. 11288 When this includes the use of pure referrals, servers are provided a 11289 way of placing a file system in a location within the namespace 11290 essentially without respect to its physical location on a particular 11291 server. This allows a single server or a set of servers to present a 11292 multi-server namespace that encompasses file systems located on a 11293 wider range of servers. Some likely uses of this facility include 11294 establishment of site-wide or organization-wide namespaces, with the 11295 eventual possibility of combining such together into a truly global 11296 namespace, such as the one provided by AFS (the Andrew File System) 11297 [TBD: appropriate reference needed] 11299 Referrals occur when a client determines, upon first referencing a 11300 position in the current namespace, that it is part of a new file 11301 system and that the file system is absent. When this occurs, 11302 typically upon receiving the error NFS4ERR_MOVED, the actual location 11303 or locations of the file system can be determined by fetching a 11304 locations attribute. 11306 The file system location attribute may designate a single file system 11307 location or multiple file system locations, to be selected based on 11308 the needs of the client. The server, in the fs_locations_info 11309 attribute, may specify priorities to be associated with various file 11310 system location choices. The server may assign different priorities 11311 to different locations as reported to individual clients, in order to 11312 adapt to client physical location or to effect load balancing. When 11313 both read-only and read-write file systems are present, some of the 11314 read-only locations might not be absolutely up-to-date (as they would 11315 have to be in the case of replication and migration). Servers may 11316 also specify file system locations that include client-substituted 11317 variables so that different clients are referred to different file 11318 systems (with different data contents) based on client attributes 11319 such as CPU architecture. 11321 When the fs_locations_info attribute is such that that there are 11322 multiple possible targets listed, the relationships among them may be 11323 important to the client in selecting which one to use. The same 11324 rules specified in Section 11.5.5 below regarding multiple migration 11325 targets apply to these multiple replicas as well. For example, the 11326 client might prefer a writable target on a server that has additional 11327 writable replicas to which it subsequently might switch. Note that, 11328 as distinguished from the case of replication, there is no need to 11329 deal with the case of propagation of updates made by the current 11330 client, since the current client has not accessed the file system in 11331 question. 11333 Use of multi-server namespaces is enabled by NFSv4.1 but is not 11334 required. The use of multi-server namespaces and their scope will 11335 depend on the applications used and system administration 11336 preferences. 11338 Multi-server namespaces can be established by a single server 11339 providing a large set of pure referrals to all of the included file 11340 systems. Alternatively, a single multi-server namespace may be 11341 administratively segmented with separate referral file systems (on 11342 separate servers) for each separately administered portion of the 11343 namespace. The top-level referral file system or any segment may use 11344 replicated referral file systems for higher availability. 11346 Generally, multi-server namespaces are for the most part uniform, in 11347 that the same data made available to one client at a given location 11348 in the namespace is made available to all clients at that namespace 11349 location. However, there are facilities provided that allow 11350 different clients to be directed to different sets of data, for 11351 reasons such as enabling adaptation to such client characteristics as 11352 CPU architecture. These facilities are described in Section 11.16.3. 11354 Note that it is possible, when providing a uniform namespace, to 11355 provide diffeent location entries to diffeent clients, in order to 11356 provide each client with a copy of the data physically closest to it, 11357 or otherwise optimize access (e.g. provide load balancing). 11359 11.5.7. Changes in a File System Location Attribute 11361 Although clients will typically fetch a file system location 11362 attribute when first accessing a file system and when NFS4ERR_MOVED 11363 is returned, a client can choose to fetch the attribute periodically, 11364 in which case the value fetched may change over time. 11366 For clients not prepared to access multiple replicas simultaneously 11367 (see Section 11.10.1), the handling of the various cases of location 11368 change are as follows: 11370 o Changes in the list of replicas or in the network addresses 11371 associated with replicas do not require immediate action. The 11372 client will typically update its list of replicas to reflect the 11373 new information. 11375 o Additions to the list of network addresses for the current file 11376 system instance need not be acted on promptly. However, to 11377 prepare for the case in which a migration event occurs 11378 subsequently, the client can choose to take note of the new 11379 address and then use it whenever it needs to switch access to a 11380 new replica. 11382 o Deletions from the list of network addresses for the current file 11383 system instance need not be acted on immediately, although the 11384 client might need to be prepared for a shift in access whenever 11385 the server indicates that a network access path is not usable to 11386 access the current file system, by returning NFS4ERR_MOVED. 11388 For clients that are prepared to access several replicas 11389 simultaneously, the following additional cases need to be addressed. 11390 As in the cases discussed above, changes in the set of replicas need 11391 not be acted upon promptly, although the client has the option of 11392 adjusting its access even in the absence of difficulties that would 11393 lead to a new replica to be selected. 11395 o When a new replica is added which may be accessed simultaneously 11396 with one currently in use, the client is free to use the new 11397 replica immediately. 11399 o When a replica currently in use is deleted from the list, the 11400 client need not cease using it immediately. However, since the 11401 server may subsequently force such use to cease (by returning 11402 NFS4ERR_MOVED), clients might decide to limit the need for later 11403 state transfer. For example, new opens might be done on other 11404 replicas, rather than on one not present in the list. 11406 11.6. Users and Groups in a Multi-server Namespace 11408 As in the case of a single-server environment (see Section 5.9, when 11409 an owner or group name of the form "id@domain" is assigned to a file, 11410 there is an implcit promise to return that same string when the 11411 corresponding attribute is interrogated subsequently. In the case of 11412 a multi-server namespace, that same promise applies even if server 11413 boundaries have been crossed. Similarly, when the owner attribute of 11414 a file is derived from the securiy principal which created the file, 11415 that attribute should have the same value even if the interrogation 11416 occurs on a different server from the file creation. 11418 Similarly, the set of security principals recognized by all the 11419 participating servers needs to be the same, with each such principal 11420 having the same credentials, regardless of the particular server 11421 being accessed. 11423 In order to meet these requirements, those setting up multi-server 11424 namespaces will need to limit the servers included so that: 11426 o In all cases in which more than a single domain is supported, the 11427 requirements stated in RFC8000 [30] are to be respected. 11429 o All servers support a common set of domains which includes all of 11430 the domains clients use and expect to see returned as the domain 11431 portion of an owner or group in the form "id@domain". Note that 11432 although this set most ofen consists of a single domain, it is 11433 possible for mutiple domains to be supported. 11435 o All servers, for each domain that they support, accept the same 11436 set of user and group ids as valid. 11438 o All servers recognize the same set of security principals, and 11439 each principal, the same credential are required, independent of 11440 the server being accessed. In addition, the group membership for 11441 each such prinicipal is to be the same, independent of the server 11442 accessed. 11444 Note that there is no requirment that the users corresponding to 11445 particular security principals have the same local representation on 11446 each server, even though it is most often the case that this is so. 11448 When AUTH_SYS is used, with or without the use of stringified owners 11449 and groups, the following additional requirements must be met: 11451 o Only a single NFSv4 domain can be supported. 11453 o The "local" representation of all owners and groups must be the 11454 same on all servers. The word "local" is used here since that is 11455 the way that numeric user and group ids are described in 11456 Section 5.9. However, when AUTH_SYS or stringified owners or 11457 group are used, these identifiers are not truly local, since they 11458 are known tothe clients as well as the server. 11460 11.7. Additional Client-Side Considerations 11462 When clients make use of servers that implement referrals, 11463 replication, and migration, care should be taken that a user who 11464 mounts a given file system that includes a referral or a relocated 11465 file system continues to see a coherent picture of that user-side 11466 file system despite the fact that it contains a number of server-side 11467 file systems that may be on different servers. 11469 One important issue is upward navigation from the root of a server- 11470 side file system to its parent (specified as ".." in UNIX), in the 11471 case in which it transitions to that file system as a result of 11472 referral, migration, or a transition as a result of replication. 11473 When the client is at such a point, and it needs to ascend to the 11474 parent, it must go back to the parent as seen within the multi-server 11475 namespace rather than sending a LOOKUPP operation to the server, 11476 which would result in the parent within that server's single-server 11477 namespace. In order to do this, the client needs to remember the 11478 filehandles that represent such file system roots and use these 11479 instead of sending a LOOKUPP operation to the current server. This 11480 will allow the client to present to applications a consistent 11481 namespace, where upward navigation and downward navigation are 11482 consistent. 11484 Another issue concerns refresh of referral locations. When referrals 11485 are used extensively, they may change as server configurations 11486 change. It is expected that clients will cache information related 11487 to traversing referrals so that future client-side requests are 11488 resolved locally without server communication. This is usually 11489 rooted in client-side name look up caching. Clients should 11490 periodically purge this data for referral points in order to detect 11491 changes in location information. When the change_policy attribute 11492 changes for directories that hold referral entries or for the 11493 referral entries themselves, clients should consider any associated 11494 cached referral information to be out of date. 11496 11.8. Overview of File Access Transitions 11498 File access transitions are of two types: 11500 o Those that involve a transition from accessing the current replica 11501 to another one in connection with either replication or migration. 11502 How these are dealt with is discussed in Section 11.10. 11504 o Those in which access to the current file system instance is 11505 retained, while the network path used to access that instance is 11506 changed. This case is discussed in Section 11.9. 11508 11.9. Effecting Network Endpoint Transitions 11510 The endpoints used to access a particular file system instance may 11511 change in a number of ways, as listed below. In each of these cases, 11512 the same fsid, filehandles, stateids, client IDs and sessions are 11513 used to continue access, with a continuity of lock state. 11515 o When use of a particular address is to cease and there is also one 11516 currently in use which is server-trunkable with it, requests that 11517 would have been issued on the address whose use is to be 11518 discontinued can be issued on the remaining address(es). When an 11519 address is not a session-trunkable one, the request might need to 11520 be modified to reflect the fact that a different session will be 11521 used. 11523 o When use of a particular connection is to cease, as indicated by 11524 receiving NFS4ERR_MOVED when using that connection but that 11525 address is still indicated as accessible according to the 11526 appropriate file system location entries, it is likely that 11527 requests can be issued on a new connection of a different 11528 connection type, once that connection is established. Since any 11529 two server endpoints that share a network address are inherently 11530 session-trunkable, the client can use BIND_CONN_TO_SESSION to 11531 access the existing session using the new connection and proceed 11532 to access the file system using the new connection. 11534 o When there are no potential replacement addresses in use but there 11535 are valid addresses session-trunkable with the one whose use is to 11536 be discontinued, the client can use BIND_CONN_TO_SESSION to access 11537 the existing session using the new address. Although the target 11538 session will generally be accessible, there may be cases in which 11539 that session is no longer accessible. In this case, the client 11540 can create a new session to enable continued access to the 11541 existing instance and provide for use of existing filehandles, 11542 stateids, and client ids while providing continuity of locking 11543 state. 11545 o When there is no potential replacement address in use and there 11546 are no valid addresses session-trunkable with the one whose use is 11547 to be discontinued, other server-trunkable addresses may be used 11548 to provide continued access. Although use of CREATE_SESSION is 11549 available to provide continued access to the existing instance, 11550 servers have the option of providing continued access to the 11551 existing session through the new network access path in a fashion 11552 similar to that provided by session migration (see Section 11.11). 11553 To take advantage of this possibility, clients can perform an 11554 initial BIND_CONN_TO_SESSION, as in the previous case, and use 11555 CREATE_SESSION only if that fails. 11557 11.10. Effecting File System Transitions 11559 There are a range of situations in which there is a change to be 11560 effected in the set of replicas used to access a particular file 11561 system. Some of these may involve an expansion or contraction of the 11562 set of replicas used as discussed in Section 11.10.1 below. 11564 For reasons explained in that section, most transitions will involve 11565 a transition from a single replica to a corresponding replacement 11566 replica. When effecting replica transition, some types of sharing 11567 between the replicas may affect handling of the transition as 11568 described in Sections 11.10.2 through 11.10.8 below. The attribute 11569 fs_locations_info provides helpful information to allow the client to 11570 determine the degree of inter-replica sharing. 11572 With regard to some types of state, the degree of continuity across 11573 the transition depends on the occasion prompting the transition, with 11574 transitions initiated by the servers (i.e. migration) offering much 11575 more scope for a non-disruptive transition than cases in which the 11576 client on its own shifts its access to another replica (i.e. 11577 replication). This issue potentially applies to locking state and to 11578 session state, which are dealt with below as follows: 11580 o An introduction to the possible means of providing continuity in 11581 these areas appears in Section 11.10.9 below. 11583 o Transparent State Migration is introduced in Section 11.11. The 11584 possible transfer of session state is addressed there as well. 11586 o The client handling of transitions, including determining how to 11587 deal with the various means that the server might take to supply 11588 effective continuity of locking state is discussed in 11589 Section 11.12. 11591 o The servers' (source and destination) responsibilities in 11592 effecting Transparent Migration of locking and session state are 11593 discussed in Section 11.13. 11595 11.10.1. File System Transitions and Simultaneous Access 11597 The fs_locations_info attribute (described in Section 11.16) may 11598 indicate that two replicas may be used simultaneously, (see 11599 Section 11.7.2.1 of RFC5661 [60] for details). Although situations 11600 in which multiple replicas may be accessed simultaneously are 11601 somewhat similar to those in which a single replica is accessed by 11602 multiple network addresses, there are important differences, since 11603 locking state is not shared among multiple replicas. 11605 Because of this difference in state handling, many clients will not 11606 have the ability to take advantage of the fact that such replicas 11607 represent the same data. Such clients will not be prepared to use 11608 multiple replicas simultaneously but will access each file system 11609 using only a single replica, although the replica selected might make 11610 multiple server-trunkable addresses available. 11612 Clients who are prepared to use multiple replicas simultaneously will 11613 divide opens among replicas however they choose. Once that choice is 11614 made, any subsequent transitions will treat the set of locking state 11615 associated with each replica as a single entity. 11617 For example, if one of the replicas become unavailable, access will 11618 be transferred to a different replica, also capable of simultaneous 11619 access with the one still in use. 11621 When there is no such replica, the transition may be to the replica 11622 already in use. At this point, the client has a choice between 11623 merging the locking state for the two replicas under the aegis of the 11624 sole replica in use or treating these separately, until another 11625 replica capable of simultaneous access presents itself. 11627 11.10.2. Filehandles and File System Transitions 11629 There are a number of ways in which filehandles can be handled across 11630 a file system transition. These can be divided into two broad 11631 classes depending upon whether the two file systems across which the 11632 transition happens share sufficient state to effect some sort of 11633 continuity of file system handling. 11635 When there is no such cooperation in filehandle assignment, the two 11636 file systems are reported as being in different handle classes. In 11637 this case, all filehandles are assumed to expire as part of the file 11638 system transition. Note that this behavior does not depend on the 11639 fh_expire_type attribute and supersedes the specification of the 11640 FH4_VOL_MIGRATION bit, which only affects behavior when 11641 fs_locations_info is not available. 11643 When there is cooperation in filehandle assignment, the two file 11644 systems are reported as being in the same handle classes. In this 11645 case, persistent filehandles remain valid after the file system 11646 transition, while volatile filehandles (excluding those that are only 11647 volatile due to the FH4_VOL_MIGRATION bit) are subject to expiration 11648 on the target server. 11650 11.10.3. Fileids and File System Transitions 11652 In NFSv4.0, the issue of continuity of fileids in the event of a file 11653 system transition was not addressed. The general expectation had 11654 been that in situations in which the two file system instances are 11655 created by a single vendor using some sort of file system image copy, 11656 fileids would be consistent across the transition, while in the 11657 analogous multi-vendor transitions they would not. This poses 11658 difficulties, especially for the client without special knowledge of 11659 the transition mechanisms adopted by the server. Note that although 11660 fileid is not a REQUIRED attribute, many servers support fileids and 11661 many clients provide APIs that depend on fileids. 11663 It is important to note that while clients themselves may have no 11664 trouble with a fileid changing as a result of a file system 11665 transition event, applications do typically have access to the fileid 11666 (e.g., via stat). The result is that an application may work 11667 perfectly well if there is no file system instance transition or if 11668 any such transition is among instances created by a single vendor, 11669 yet be unable to deal with the situation in which a multi-vendor 11670 transition occurs at the wrong time. 11672 Providing the same fileids in a multi-vendor (multiple server 11673 vendors) environment has generally been held to be quite difficult. 11674 While there is work to be done, it needs to be pointed out that this 11675 difficulty is partly self-imposed. Servers have typically identified 11676 fileid with inode number, i.e. with a quantity used to find the file 11677 in question. This identification poses special difficulties for 11678 migration of a file system between vendors where assigning the same 11679 index to a given file may not be possible. Note here that a fileid 11680 is not required to be useful to find the file in question, only that 11681 it is unique within the given file system. Servers prepared to 11682 accept a fileid as a single piece of metadata and store it apart from 11683 the value used to index the file information can relatively easily 11684 maintain a fileid value across a migration event, allowing a truly 11685 transparent migration event. 11687 In any case, where servers can provide continuity of fileids, they 11688 should, and the client should be able to find out that such 11689 continuity is available and take appropriate action. Information 11690 about the continuity (or lack thereof) of fileids across a file 11691 system transition is represented by specifying whether the file 11692 systems in question are of the same fileid class. 11694 Note that when consistent fileids do not exist across a transition 11695 (either because there is no continuity of fileids or because fileid 11696 is not a supported attribute on one of instances involved), and there 11697 are no reliable filehandles across a transition event (either because 11698 there is no filehandle continuity or because the filehandles are 11699 volatile), the client is in a position where it cannot verify that 11700 files it was accessing before the transition are the same objects. 11701 It is forced to assume that no object has been renamed, and, unless 11702 there are guarantees that provide this (e.g., the file system is 11703 read-only), problems for applications may occur. Therefore, use of 11704 such configurations should be limited to situations where the 11705 problems that this may cause can be tolerated. 11707 11.10.4. Fsids and File System Transitions 11709 Since fsids are generally only unique on a per-server basis, it is 11710 likely that they will change during a file system transition. 11711 Clients should not make the fsids received from the server visible to 11712 applications since they may not be globally unique, and because they 11713 may change during a file system transition event. Applications are 11714 best served if they are isolated from such transitions to the extent 11715 possible. 11717 Although normally a single source file system will transition to a 11718 single target file system, there is a provision for splitting a 11719 single source file system into multiple target file systems, by 11720 specifying the FSLI4F_MULTI_FS flag. 11722 11.10.4.1. File System Splitting 11724 When a file system transition is made and the fs_locations_info 11725 indicates that the file system in question might be split into 11726 multiple file systems (via the FSLI4F_MULTI_FS flag), the client 11727 SHOULD do GETATTRs to determine the fsid attribute on all known 11728 objects within the file system undergoing transition to determine the 11729 new file system boundaries. 11731 Clients might choose to maintain the fsids passed to existing 11732 applications by mapping all of the fsids for the descendant file 11733 systems to the common fsid used for the original file system. 11735 Splitting a file system can be done on a transition between file 11736 systems of the same fileid class, since the fact that fileids are 11737 unique within the source file system ensure they will be unique in 11738 each of the target file systems. 11740 11.10.5. The Change Attribute and File System Transitions 11742 Since the change attribute is defined as a server-specific one, 11743 change attributes fetched from one server are normally presumed to be 11744 invalid on another server. Such a presumption is troublesome since 11745 it would invalidate all cached change attributes, requiring 11746 refetching. Even more disruptive, the absence of any assured 11747 continuity for the change attribute means that even if the same value 11748 is retrieved on refetch, no conclusions can be drawn as to whether 11749 the object in question has changed. The identical change attribute 11750 could be merely an artifact of a modified file with a different 11751 change attribute construction algorithm, with that new algorithm just 11752 happening to result in an identical change value. 11754 When the two file systems have consistent change attribute formats, 11755 and this fact is communicated to the client by reporting in the same 11756 change class, the client may assume a continuity of change attribute 11757 construction and handle this situation just as it would be handled 11758 without any file system transition. 11760 11.10.6. Write Verifiers and File System Transitions 11762 In a file system transition, the two file systems might be clustered 11763 in the handling of unstably written data. When this is the case, and 11764 the two file systems belong to the same write-verifier class, write 11765 verifiers returned from one system may be compared to those returned 11766 by the other and superfluous writes avoided. 11768 When two file systems belong to different write-verifier classes, any 11769 verifier generated by one must not be compared to one provided by the 11770 other. Instead, the two verifiers should be treated as not equal 11771 even when the values are identical. 11773 11.10.7. Readdir Cookies and Verifiers and File System Transitions 11775 In a file system transition, the two file systems might be consistent 11776 in their handling of READDIR cookies and verifiers. When this is the 11777 case, and the two file systems belong to the same readdir class, 11778 READDIR cookies and verifiers from one system may be recognized by 11779 the other and READDIR operations started on one server may be validly 11780 continued on the other, simply by presenting the cookie and verifier 11781 returned by a READDIR operation done on the first file system to the 11782 second. 11784 When two file systems belong to different readdir classes, any 11785 READDIR cookie and verifier generated by one is not valid on the 11786 second, and must not be presented to that server by the client. The 11787 client should act as if the verifier was rejected. 11789 11.10.8. File System Data and File System Transitions 11791 When multiple replicas exist and are used simultaneously or in 11792 succession by a client, applications using them will normally expect 11793 that they contain either the same data or data that is consistent 11794 with the normal sorts of changes that are made by other clients 11795 updating the data of the file system (with metadata being the same to 11796 the degree indicated by the fs_locations_info attribute). However, 11797 when multiple file systems are presented as replicas of one another, 11798 the precise relationship between the data of one and the data of 11799 another is not, as a general matter, specified by the NFSv4.1 11800 protocol. It is quite possible to present as replicas file systems 11801 where the data of those file systems is sufficiently different that 11802 some applications have problems dealing with the transition between 11803 replicas. The namespace will typically be constructed so that 11804 applications can choose an appropriate level of support, so that in 11805 one position in the namespace a varied set of replicas will be 11806 listed, while in another only those that are up-to-date may be 11807 considered replicas. The protocol does define three special cases of 11808 the relationship among replicas to be specified by the server and 11809 relied upon by clients: 11811 o When multiple replicas exist and are used simultaneously by a 11812 client (see the FSLIB4_CLSIMUL definition within 11813 fs_locations_info), they must designate the same data. Where file 11814 systems are writable, a change made on one instance must be 11815 visible on all instances, immediately upon the earlier of the 11816 return of the modifying requester or the visibility of that change 11817 on any of the associated replicas. This allows a client to use 11818 these replicas simultaneously without any special adaptation to 11819 the fact that there are multiple replicas, beyond adapting to the 11820 fact that locks obtained on one replica are maintained separately 11821 (i.e. under a different client ID). In this case, locks (whether 11822 share reservations or byte-range locks) and delegations obtained 11823 on one replica are immediately reflected on all replicas, in the 11824 sense that access from all other servers is prevented regardless 11825 of the replica used. However, because the servers are not 11826 required to treat two associated client IDs as representing the 11827 same client, it is best to access each file using only a single 11828 client ID. 11830 o When one replica is designated as the successor instance to 11831 another existing instance after return NFS4ERR_MOVED (i.e., the 11832 case of migration), the client may depend on the fact that all 11833 changes written to stable storage on the original instance are 11834 written to stable storage of the successor (uncommitted writes are 11835 dealt with in Section 11.10.6 above). 11837 o Where a file system is not writable but represents a read-only 11838 copy (possibly periodically updated) of a writable file system, 11839 clients have similar requirements with regard to the propagation 11840 of updates. They may need a guarantee that any change visible on 11841 the original file system instance must be immediately visible on 11842 any replica before the client transitions access to that replica, 11843 in order to avoid any possibility that a client, in effecting a 11844 transition to a replica, will see any reversion in file system 11845 state. The specific means of this guarantee varies based on the 11846 value of the fss_type field that is reported as part of the 11847 fs_status attribute (see Section 11.17). Since these file systems 11848 are presumed to be unsuitable for simultaneous use, there is no 11849 specification of how locking is handled; in general, locks 11850 obtained on one file system will be separate from those on others. 11851 Since these are expected to be read-only file systems, this is not 11852 likely to pose an issue for clients or applications. 11854 11.10.9. Lock State and File System Transitions 11856 While accessing a file system, clients obtain locks enforced by the 11857 server which may prevent actions by other clients that are 11858 inconsistent with those locks. 11860 When access is transferred between replicas, clients need to be 11861 assured that the actions disallowed by holding these locks cannot 11862 have occurred during the transition. This can be ensured by the 11863 methods below. Unless at least one of these is implemented, clients 11864 will not be assured of continuity of lock possession across a 11865 migration event. 11867 o Providing the client an opportunity to re-obtain his locks via a 11868 per-fs grace period on the destination server. Because the lock 11869 reclaim mechanism was originally defined to support server reboot, 11870 it implicitly assumes that file handles will on reclaim will be 11871 the same as those at open. In the case of migration, this 11872 requires that source and destination servers use the same 11873 filehandles, as evidenced by using the same server scope (see 11874 Section 2.10.4) or by showing this agreement using 11875 fs_locations_info (see Section 11.10.2 above). 11877 o Locking state can be transferred as part of the transition by 11878 providing Transparent State Migration as described in 11879 Section 11.11. 11881 Of these, Transparent State Migration provides the smoother 11882 experience for clients in that there is no grace-period-based delay 11883 before new locks can be obtained. However, it requires a greater 11884 degree of inter-server co-ordination. In general, the servers taking 11885 part in migration are free to provide either facility. However, when 11886 the filehandles can differ across the migration event, Transparent 11887 State Migration is the only available means of providing the needed 11888 functionality. 11890 It should be noted that these two methods are not mutually exclusive 11891 and that a server might well provide both. In particular, if there 11892 is some circumstance preventing a specific lock from being 11893 transferred transparently, the destination server can allow it to be 11894 reclaimed, by implementing a per-fs grace period for the migrated 11895 file system. 11897 11.10.9.1. Leases and File System Transitions 11899 In the case of lease renewal, the client may not be submitting 11900 requests for a file system that has been transferred to another 11901 server. This can occur because of the lease renewal mechanism. The 11902 client renews the lease associated with all file systems when 11903 submitting a request on an associated session, regardless of the 11904 specific file system being referenced. 11906 In order for the client to schedule renewal of its lease where there 11907 is locking state that may have been relocated to the new server, the 11908 client must find out about lease relocation before that lease expire. 11909 To accomplish this, the SEQUENCE operation will return the status bit 11910 SEQ4_STATUS_LEASE_MOVED if responsibility for any of the renewed 11911 locking state has been transferred to a new server. This will 11912 continue until the client receives an NFS4ERR_MOVED error for each of 11913 the file systems for which there has been locking state relocation. 11915 When a client receives an SEQ4_STATUS_LEASE_MOVED indication from a 11916 server, for each file system of the server for which the client has 11917 locking state, the client should perform an operation. For 11918 simplicity, the client may choose to reference all file systems, but 11919 what is important is that it must reference all file systems for 11920 which there was locking state where that state has moved. Once the 11921 client receives an NFS4ERR_MOVED error for each such file system, the 11922 server will clear the SEQ4_STATUS_LEASE_MOVED indication. The client 11923 can terminate the process of checking file systems once this 11924 indication is cleared (but only if the client has received a reply 11925 for all outstanding SEQUENCE requests on all sessions it has with the 11926 server), since there are no others for which locking state has moved. 11928 A client may use GETATTR of the fs_status (or fs_locations_info) 11929 attribute on all of the file systems to get absence indications in a 11930 single (or a few) request(s), since absent file systems will not 11931 cause an error in this context. However, it still must do an 11932 operation that receives NFS4ERR_MOVED on each file system, in order 11933 to clear the SEQ4_STATUS_LEASE_MOVED indication. 11935 Once the set of file systems with transferred locking state has been 11936 determined, the client can follow the normal process to obtain the 11937 new server information (through the fs_locations and 11938 fs_locations_info attributes) and perform renewal of that lease on 11939 the new server, unless information in the fs_locations_info attribute 11940 shows that no state could have been transferred. If the server has 11941 not had state transferred to it transparently, the client will 11942 receive NFS4ERR_STALE_CLIENTID from the new server, as described 11943 above, and the client can then reclaim locks as is done in the event 11944 of server failure. 11946 11.10.9.2. Transitions and the Lease_time Attribute 11948 In order that the client may appropriately manage its lease in the 11949 case of a file system transition, the destination server must 11950 establish proper values for the lease_time attribute. 11952 When state is transferred transparently, that state should include 11953 the correct value of the lease_time attribute. The lease_time 11954 attribute on the destination server must never be less than that on 11955 the source, since this would result in premature expiration of a 11956 lease granted by the source server. Upon transitions in which state 11957 is transferred transparently, the client is under no obligation to 11958 refetch the lease_time attribute and may continue to use the value 11959 previously fetched (on the source server). 11961 If state has not been transferred transparently, either because the 11962 associated servers are shown as having different eir_server_scope 11963 strings or because the client ID is rejected when presented to the 11964 new server, the client should fetch the value of lease_time on the 11965 new (i.e., destination) server, and use it for subsequent locking 11966 requests. However, the server must respect a grace period of at 11967 least as long as the lease_time on the source server, in order to 11968 ensure that clients have ample time to reclaim their lock before 11969 potentially conflicting non-reclaimed locks are granted. 11971 11.11. Transferring State upon Migration 11973 When the transition is a result of a server-initiated decision to 11974 transition access and the source and destination servers have 11975 implemented appropriate co-operation, it is possible to: 11977 o Transfer locking state from the source to the destination server, 11978 in a fashion similar to that provided by Transparent State 11979 Migration in NFSv4.0, as described in [63]. Server 11980 responsibilities are described in Section 11.13.2. 11982 o Transfer session state from the source to the destination server. 11983 Server responsibilities in effecting such a transfer are described 11984 in Section 11.13.3. 11986 The means by which the client determines which of these transfer 11987 events has occurred are described in Section 11.12. 11989 11.11.1. Transparent State Migration and pNFS 11991 When pNFS is involved, the protocol is capable of supporting: 11993 o Migration of the Metadata Server (MDS), leaving the Data Servers 11994 (DS's) in place. 11996 o Migration of the file system as a whole, including the MDS and 11997 associated DS's. 11999 o Replacement of one DS by another. 12001 o Migration of a pNFS file system to one in which pNFS is not used. 12003 o Migration of a file system not using pNFS to one in which layouts 12004 are available. 12006 Note that migration per se is only involved in the transfer of the 12007 MDS function. Although the servicing of a layout may be transferred 12008 from one data server to another, this not done using the file system 12009 location attributes. The MDS can effect such transfers by recalling/ 12010 revoking existing layouts and granting new ones on a different data 12011 server. 12013 Migration of the MDS function is directly supported by Transparent 12014 State Migration. Layout state will normally be transparently 12015 transferred, just as other state is. As a result, Transparent State 12016 Migration provides a framework in which, given appropriate inter-MDS 12017 data transfer, one MDS can be substituted for another. 12019 Migration of the file system function as a whole can be accomplished 12020 by recalling all layouts as part of the initial phase of the 12021 migration process. As a result, IO will be done through the MDS 12022 during the migration process, and new layouts can be granted once the 12023 client is interacting with the new MDS. An MDS can also effect this 12024 sort of transition by revoking all layouts as part of Transparent 12025 State Migration, as long as the client is notified about the loss of 12026 locking state. 12028 In order to allow migration to a file system on which pNFS is not 12029 supported, clients need to be prepared for a situation in which 12030 layouts are not available or supported on the destination file system 12031 and so direct IO requests to the destination server, rather than 12032 depending on layouts being available. 12034 Replacement of one DS by another is not addressed by migration as 12035 such but can be effected by an MDS recalling layouts for the DS to be 12036 replaced and issuing new ones to be served by the successor DS. 12038 Migration may transfer a file system from a server which does not 12039 support pNFS to one which does. In order to properly adapt to this 12040 situation, clients which support pNFS, but function adequately in its 12041 absence should check for pNFS support when a file system is migrated 12042 and be prepared to use pNFS when support is available on the 12043 destination. 12045 11.12. Client Responsibilities when Access is Transitioned 12047 For a client to respond to an access transition, it must become aware 12048 of it. The ways in which this can happen are discussed in 12049 Section 11.12.1 which discusses indications that a specific file 12050 system access path has transitioned as well as situations in which 12051 additional activity is necessary to determine the set of file systems 12052 that have been migrated. Section 11.12.2 goes on to complete the 12053 discussion of how the set of migrated file systems might be 12054 determined. Sections 11.12.3 through 11.12.5 discuss how the client 12055 should deal with each transition it becomes aware of, either directly 12056 or as a result of migration discovery. 12058 The following terms are used to describe client activities: 12060 o "Transition recovery" refers to the process of restoring access to 12061 a file system on which NFS4ERR_MOVED was received. 12063 o "Migration recovery" to that subset of transition recovery which 12064 applies when the file system has migrated to a different replica. 12066 o "Migration discovery" refers to the process of determining which 12067 file system(s) have been migrated. It is necessary to avoid a 12068 situation in which leases could expire when a file system is not 12069 accessed for a long period of time, since a client unaware of the 12070 migration might be referencing an unmigrated file system and not 12071 renewing the lease associated with the migrated file system. 12073 11.12.1. Client Transition Notifications 12075 When there is a change in the network access path which a client is 12076 to use to access a file system, there are a number of related status 12077 indications with which clients need to deal: 12079 o If an attempt is made to use or return a filehandle within a file 12080 system that is no longer accessible at the address previously used 12081 to access it, the error NFS4ERR_MOVED is returned. 12083 Exceptions are made to allow such file handles to be used when 12084 interrogating a file system location attribute. This enables a 12085 client to determine a new replica's location or a new network 12086 access path. 12088 This condition continues on subsequent attempts to access the file 12089 system in question. The only way the client can avoid the error 12090 is to cease accessing the file system in question at its old 12091 server location and access it instead using a different address at 12092 which it is now available. 12094 o Whenever a SEQUENCE operation is sent by a client to a server 12095 which generated state held on that client which is associated with 12096 a file system that is no longer accessible on the server at which 12097 it was previously available, the response will contain a lease- 12098 migrated indication, with the SEQ4_STATUS_LEASE_MOVED status bit 12099 being set. 12101 This condition continues until the client acknowledges the 12102 notification by fetching a file system location attribute for the 12103 file system whose network access path is being changed. When 12104 there are multiple such file systems, a location attribute for 12105 each such file system needs to be fetched. The location attribute 12106 for all migrated file system needs to be fetched in order to clear 12107 the condition. Even after the condition is cleared, the client 12108 needs to respond by using the location information to access the 12109 file system at its new location to ensure that leases are not 12110 needlessly expired. 12112 Unlike the case of NFSv4.0, in which the corresponding conditions are 12113 both errors and thus mutually exclusive, in NFSv4.1 the client can, 12114 and often will, receive both indications on the same request. As a 12115 result, implementations need to address the question of how to co- 12116 ordinate the necessary recovery actions when both indications arrive 12117 in the response to the same request. It should be noted that when 12118 processing an NFSv4 COMPOUND, the server will normally decide whether 12119 SEQ4_STATUS_LEASE_MOVED is to be set before it determines which file 12120 system will be referenced or whether NFS4ERR_MOVED is to be returned. 12122 Since these indications are not mutually exclusive in NFSv4.1, the 12123 following combinations are possible results when a COMPOUND is 12124 issued: 12126 o The COMPOUND status is NFS4ERR_MOVED and SEQ4_STATUS_LEASE_MOVED 12127 is asserted. 12129 In this case, transition recovery is required. While it is 12130 possible that migration discovery is needed in addition, it is 12131 likely that only the accessed file system has transitioned. In 12132 any case, because addressing NFS4ERR_MOVED is necessary to allow 12133 the rejected requests to be processed on the target, dealing with 12134 it will typically have priority over migration discovery. 12136 o The COMPOUND status is NFS4ERR_MOVED and SEQ4_STATUS_LEASE_MOVED 12137 is clear. 12139 In this case, transition recovery is also required. It is clear 12140 that migration discovery is not needed to find file systems that 12141 have been migrated other that the one returning NFS4ERR_MOVED. 12142 Cases in which this result can arise include a referral or a 12143 migration for which there is no associated locking state. This 12144 can also arise in cases in which an access path transition other 12145 than migration occurs within the same server. In such a case, 12146 there is no need to set SEQ4_STATUS_LEASE_MOVED, since the lease 12147 remains associated with the current server even though the access 12148 path has changed. 12150 o The COMPOUND status is not NFS4ERR_MOVED and 12151 SEQ4_STATUS_LEASE_MOVED is asserted. 12153 In this case, no transition recovery activity is required on the 12154 file system(s) accessed by the request. However, to prevent 12155 avoidable lease expiration, migration discovery needs to be done 12157 o The COMPOUND status is not NFS4ERR_MOVED and 12158 SEQ4_STATUS_LEASE_MOVED is clear. 12160 In this case, neither transition-related activity nor migration 12161 discovery is required. 12163 Note that the specified actions only need to be taken if they are not 12164 already going on. For example, when NFS4ERR_MOVED is received when 12165 accessing a file system for which transition recovery already going 12166 on, the client merely waits for that recovery to be completed while 12167 the receipt of SEQ4_STATUS_LEASE_MOVED indication only needs to 12168 initiate migration discovery for a server if such discovery is not 12169 already underway for that server. 12171 The fact that a lease-migrated condition does not result in an error 12172 in NFSv4.1 has a number of important consequences. In addition to 12173 the fact, discussed above, that the two indications are not mutually 12174 exclusive, there are number of issues that are important in 12175 considering implementation of migration discovery, as discussed in 12176 Section 11.12.2. 12178 Because of the absence of NFSV4ERR_LEASE_MOVED, it is possible for 12179 file systems whose access path has not changed to be successfully 12180 accessed on a given server even though recovery is necessary for 12181 other file systems on the same server. As a result, access can go on 12182 while, 12184 o The migration discovery process is going on for that server. 12186 o The transition recovery process is going on for on other file 12187 systems connected to that server. 12189 11.12.2. Performing Migration Discovery 12191 Migration discovery can be performed in the same context as 12192 transition recovery, allowing recovery for each migrated file system 12193 to be invoked as it is discovered. Alternatively, it may be done in 12194 a separate migration discovery thread, allowing migration discovery 12195 to be done in parallel with one or more instances of transition 12196 recovery. 12198 In either case, because the lease-migrated indication does not result 12199 in an error. other access to file systems on the server can proceed 12200 normally, with the possibility that further such indications will be 12201 received, raising the issue of how such indications are to be dealt 12202 with. In general, 12204 o No action needs to be taken for such indications received by the 12205 those performing migration discovery, since continuation of that 12206 work will address the issue. 12208 o In other cases in which migration discovery is currently being 12209 performed, nothing further needs to be done to respond to such 12210 lease migration indications, as long as one can be certain that 12211 the migration discovery process would deal with those indications. 12212 See below for details. 12214 o For such indications received in all other contexts, the 12215 appropriate response is to initiate or otherwise provide for the 12216 execution of migration discovery for file systems associated with 12217 the server IP address returning the indication. 12219 This leaves a potential difficulty in situations in which the 12220 migration discovery process is near to completion but is still 12221 operating. One should not ignore a LEASE_MOVED indication if the 12222 migration discovery process is not able to respond to the discovery 12223 of additional migrating file systems without additional aid. A 12224 further complexity relevant in addressing such situations is that a 12225 lease-migrated indication may reflect the server's state at the time 12226 the SEQUENCE operation was processed, which may be different from 12227 that in effect at the time the response is received. Because new 12228 migration events may occur at any time, and because a LEASE_MOVED 12229 indication may reflect the situation in effect a considerable time 12230 before the indication is received, special care needs to be taken to 12231 ensure that LEASE_MOVED indications are not inappropriately ignored. 12233 A useful approach to this issue involves the use of separate 12234 externally-visible migration discovery states for each server. 12235 Separate values could represent the various possible states for the 12236 migration discovery process for a server: 12238 o non-operation, in which migration discovery is not being performed 12240 o normal operation, in which there is an ongoing scan for migrated 12241 file systems. 12243 o completion/verification of migration discovery processing, in 12244 which the possible completion of migration discovery processing 12245 needs to be verified. 12247 Given that framework, migration discovery processing would proceed as 12248 follows. 12250 o While in the normal-operation state, the thread performing 12251 discovery would fetch, for successive file systems known to the 12252 client on the server being worked on, a file system location 12253 attribute plus the fs_status attribute. 12255 o If the fs_status attribute indicates that the file system is a 12256 migrated one (i.e. fss_absent is true and fss_type != 12257 STATUS4_REFERRAL) and thus that it is likely that the fetch of the 12258 file system location attribute has cleared one the file systems 12259 contributing to the lease-migrated indication. 12261 o In cases in which that happened, the thread cannot know whether 12262 the lease-migrated indication has been cleared and so it enters 12263 the completion/verification state and proceeds to issue a COMPOUND 12264 to see if the LEASE_MOVED indication has been cleared. 12266 o When the discovery process is in the completion/verification 12267 state, if other requests get a lease-migrated indication they note 12268 that it was received. Laater, the existence of such indications 12269 is used when the request completes, as described below. 12271 When the request used in the completion/verification state completes: 12273 o If a lease-migrated indication is returned, the discovery 12274 continues normally. Note that this is so even if all file systems 12275 have traversed, since new migrations could have occurred while the 12276 process was going on. 12278 o Otherwise, if there is any record that other requests saw a lease- 12279 migrated indication while the request was going on, that record is 12280 cleared and the verification request retried. The discovery 12281 process remains in completion/verification state. 12283 o If there have been no lease-migrated indications, the work of 12284 migration discovery is considered completed and it enters the non- 12285 operating state. Once it enters this state, subsequent lease- 12286 migrated indication will trigger a new migration discovery 12287 process. 12289 It should be noted that the process described above is not guaranteed 12290 to terminate, as a long series of new migration events might 12291 continually delay the clearing of the LEASE_MOVED indication. To 12292 prevent unnecessary lease expiration, it is appropriate for clients 12293 to use the discovery of migrations to effect lease renewal 12294 immediately, rather than waiting for clearing of the LEASE_MOVED 12295 indication when the complete set of migrations is available. 12297 11.12.3. Overview of Client Response to NFS4ERR_MOVED 12299 This section outlines a way in which a client that receives 12300 NFS4ERR_MOVED can effect transition recovery by using a new server or 12301 server endpoint if one is available. As part of that process, it 12302 will determine: 12304 o Whether the NFS4ERR_MOVED indicates migration has occurred, or 12305 whether it indicates another sort of file system access transition 12306 as discussed in Section 11.9 above. 12308 o In the case of migration, whether Transparent State Migration has 12309 occurred. 12311 o Whether any state has been lost during the process of Transparent 12312 State Migration. 12314 o Whether sessions have been transferred as part of Transparent 12315 State Migration. 12317 During the first phase of this process, the client proceeds to 12318 examine file system location entries to find the initial network 12319 address it will use to continue access to the file system or its 12320 replacement. For each location entry that the client examines, the 12321 process consists of five steps: 12323 1. Performing an EXCHANGE_ID directed at the location address. This 12324 operation is used to register the client owner (in the form of a 12325 client_owner4) with the server, to obtain a client ID to be use 12326 subsequently to communicate with it, to obtain that client ID's 12327 confirmation status, and to determine server_owner and scope for 12328 the purpose of determining if the entry is trunkable with that 12329 previously being used to access the file system (i.e. that it 12330 represents another network access path to the same file system 12331 and can share locking state with it). 12333 2. Making an initial determination of whether migration has 12334 occurred. The initial determination will be based on whether the 12335 EXCHANGE_ID results indicate that the current location element is 12336 server-trunkable with that used to access the file system when 12337 access was terminated by receiving NFS4ERR_MOVED. If it is, then 12338 migration has not occurred. In that case, the transition is 12339 dealt with, at least initially, as one involving continued access 12340 to the same file system on the same server through a new network 12341 address. 12343 3. Obtaining access to existing session state or creating new 12344 sessions. How this is done depends on the initial determination 12345 of whether migration has occurred and can be done as described in 12346 Section 11.12.4 below in the case of migration or as described in 12347 Section 11.12.5 below in the case of a network address transfer 12348 without migration. 12350 4. Verification of the trunking relationship assumed in step 2 as 12351 discussed in Section 2.10.5.1. Although this step will generally 12352 confirm the initial determination, it is possible for 12353 verification to fail with the result that an initial 12354 determination that a network address shift (without migration) 12355 has occurred may be invalidated and migration determined to have 12356 occurred. There is no need to redo step 3 above, since it will 12357 be possible to continue use of the session established already. 12359 5. Obtaining access to existing locking state and/or reobtaining it. 12360 How this is done depends on the final determination of whether 12361 migration has occurred and can be done as described below in 12362 Section 11.12.4 in the case of migration or as described in 12363 Section 11.12.5 in the case of a network address transfer without 12364 migration. 12366 Once the initial address has been determined, clients are free to 12367 apply an abbreviated process to find additional addresses trunkable 12368 with it (clients may seek session-trunkable or server-trunkable 12369 addresses depending on whether they support clientid trunking). 12370 During this later phase of the process, further location entries are 12371 examined using the abbreviated procedure specified below: 12373 A: Before the EXCHANGE_ID, the fs name of the location entry is 12374 examined and if it does not match that currently being used, the 12375 entry is ignored. otherwise, one proceeds as specified by step 1 12376 above. 12378 B: In the case that the network address is session-trunkable with 12379 one used previously a BIND_CONN_TO_SESSION is used to access that 12380 session using the new network address. Otherwise, or if the bind 12381 operation fails, a CREATE_SESSION is done. 12383 C: The verification procedure referred to in step 4 above is used. 12384 However, if it fails, the entry is ignored and the next available 12385 entry is used. 12387 11.12.4. Obtaining Access to Sessions and State after Migration 12389 In the event that migration has occurred, migration recovery will 12390 involve determining whether Transparent State Migration has occurred. 12391 This decision is made based on the client ID returned by the 12392 EXCHANGE_ID and the reported confirmation status. 12394 o If the client ID is an unconfirmed client ID not previously known 12395 to the client, then Transparent State Migration has not occurred. 12397 o If the client ID is a confirmed client ID previously known to the 12398 client, then any transferred state would have been merged with an 12399 existing client ID representing the client to the destination 12400 server. In this state merger case, Transparent State Migration 12401 might or might not have occurred and a determination as to whether 12402 it has occurred is deferred until sessions are established and the 12403 client is ready to begin state recovery. 12405 o If the client ID is a confirmed client ID not previously known to 12406 the client, then the client can conclude that the client ID was 12407 transferred as part of Transparent State Migration. In this 12408 transferred client ID case, Transparent State Migration has 12409 occurred although some state might have been lost. 12411 Once the client ID has been obtained, it is necessary to obtain 12412 access to sessions to continue communication with the new server. In 12413 any of the cases in which Transparent State Migration has occurred, 12414 it is possible that a session was transferred as well. To deal with 12415 that possibility, clients can, after doing the EXCHANGE_ID, issue a 12416 BIND_CONN_TO_SESSION to connect the transferred session to a 12417 connection to the new server. If that fails, it is an indication 12418 that the session was not transferred and that a new session needs to 12419 be created to take its place. 12421 In some situations, it is possible for a BIND_CONN_TO_SESSION to 12422 succeed without session migration having occurred. If state merger 12423 has taken place then the associated client ID may have already had a 12424 set of existing sessions, with it being possible that the sessionid 12425 of a given session is the same as one that might have been migrated. 12426 In that event, a BIND_CONN_TO_SESSION might succeed, even though 12427 there could have been no migration of the session with that 12428 sessionid. In such cases, the client will receive sequence errors 12429 when the slot sequence values used are not appropriate on the new 12430 session. When this occurs, the client can create a new a session and 12431 cease using the existing one. 12433 Once the client has determined the initial migration status, and 12434 determined that there was a shift to a new server, it needs to re- 12435 establish its locking state, if possible. To enable this to happen 12436 without loss of the guarantees normally provided by locking, the 12437 destination server needs to implement a per-fs grace period in all 12438 cases in which lock state was lost, including those in which 12439 Transparent State Migration was not implemented. 12441 Clients need to deal with the following cases: 12443 o In the state merger case, it is possible that the server has not 12444 attempted Transparent State Migration, in which case state may 12445 have been lost without it being reflected in the SEQ4_STATUS bits. 12446 To determine whether this has happened, the client can use 12447 TEST_STATEID to check whether the stateids created on the source 12448 server are still accessible on the destination server. Once a 12449 single stateid is found to have been successfully transferred, the 12450 client can conclude that Transparent State Migration was begun and 12451 any failure to transport all of the stateids will be reflected in 12452 the SEQ4_STATUS bits. Otherwise, Transparent State Migration has 12453 not occurred. 12455 o In a case in which Transparent State Migration has not occurred, 12456 the client can use the per-fs grace period provided by the 12457 destination server to reclaim locks that were held on the source 12458 server. 12460 o In a case in which Transparent State Migration has occurred, and 12461 no lock state was lost (as shown by SEQ4_STATUS flags), no lock 12462 reclaim is necessary. 12464 o In a case in which Transparent State Migration has occurred, and 12465 some lock state was lost (as shown by SEQ4_STATUS flags), existing 12466 stateids need to be checked for validity using TEST_STATEID, and 12467 reclaim used to re-establish any that were not transferred. 12469 For all of the cases above, RECLAIM_COMPLETE with an rca_one_fs value 12470 of TRUE needs to be done before normal use of the file system 12471 including obtaining new locks for the file system. This applies even 12472 if no locks were lost and there was no need for any to be reclaimed. 12474 11.12.5. Obtaining Access to Sessions and State after Network Address 12475 Transfer 12477 The case in which there is a transfer to a new network address 12478 without migration is similar to that described in Section 11.12.4 12479 above in that there is a need to obtain access to needed sessions and 12480 locking state. However, the details are simpler and will vary 12481 depending on the type of trunking between the address receiving 12482 NFS4ERR_MOVED and that to which the transfer is to be made 12484 To make a session available for use, a BIND_CONN_TO_SESSION should be 12485 used to obtain access to the session previously in use. Only if this 12486 fails, should a CREATE_SESSION be done. While this procedure mirrors 12487 that in Section 11.12.4 above, there is an important difference in 12488 that preservation of the session is not purely optional but depends 12489 on the type of trunking. 12491 Access to appropriate locking state will generally need no actions 12492 beyond access to the session. However, the SEQ4_STATUS bits need to 12493 be checked for lost locking state, including the need to reclaim 12494 locks after a server reboot, since there is always a possibility of 12495 locking state being lost. 12497 11.13. Server Responsibilities Upon Migration 12499 In the event of file system migration, when the client connects to 12500 the destination server, that server needs to be able to provide the 12501 client continued to access the files it had open on the source 12502 server. There are two ways to provide this: 12504 o By provision of an fs-specific grace period, allowing the client 12505 the ability to reclaim its locks, in a fashion similar to what 12506 would have been done in the case of recovery from a server 12507 restart. See Section 11.13.1 for a more complete discussion. 12509 o By implementing Transparent State Migration possibly in connection 12510 with session migration, the server can provide the client 12511 immediate access to the state built up on the source server, on 12512 the destination. 12514 These features are discussed separately in Sections 11.13.2 and 12515 11.13.3, which discuss Transparent State Migration and session 12516 migration respectively. 12518 All the features described above can involve transfer of lock-related 12519 information between source and destination servers. In some cases, 12520 this transfer is a necessary part of the implementation while in 12521 other cases it is a helpful implementation aid which servers might or 12522 might not use. The sub-sections below discuss the information which 12523 would be transferred but do not define the specifics of the transfer 12524 protocol. This is left as an implementation choice although 12525 standards in this area could be developed at a later time. 12527 11.13.1. Server Responsibilities in Effecting State Reclaim after 12528 Migration 12530 In this case, destination server need have no knowledge of the locks 12531 held on the source server, but relies on the clients to accurately 12532 report (via reclaim operations) the locks previously held, not 12533 allowing new locks to be granted on migrated file system until the 12534 grace period expires. 12536 During this grace period clients have the opportunity to use reclaim 12537 operations to obtain locks for file system objects within the 12538 migrated file system, in the same way that they do when recovering 12539 from server restart, and the servers typically rely on clients to 12540 accurately report their locks, although they have the option of 12541 subjecting these requests to verification. If the clients only 12542 reclaim locks held on the source server, no conflict can arise. Once 12543 the client has reclaimed its locks, it indicates the completion of 12544 lock reclamation by performing a RECLAIM_COMPLETE specifying 12545 rca_one_fs as TRUE. 12547 While it is not necessary for source and destination servers to co- 12548 operate to transfer information about locks, implementations are 12549 well-advised to consider transferring the following useful 12550 information: 12552 o If information about the set of clients that have locking state 12553 for the transferred file system is made available, the destination 12554 server will be able to terminate the grace period once all such 12555 clients have reclaimed their locks, allowing normal locking 12556 activity to resume earlier than it would have otherwise. 12558 o Locking summary information for individual clients (at various 12559 possible levels of detail) can detect some instances in which 12560 clients do not accurately represent the locks held on the source 12561 server. 12563 11.13.2. Server Responsibilities in Effecting Transparent State 12564 Migration 12566 The basic responsibility of the source server in effecting 12567 Transparent State Migration is to make available to the destination 12568 server a description of each piece of locking state associated with 12569 the file system being migrated. In addition to client id string and 12570 verifier, the source server needs to provide, for each stateid: 12572 o The stateid including the current sequence value. 12574 o The associated client ID. 12576 o The handle of the associated file. 12578 o The type of the lock, such as open, byte-range lock, delegation, 12579 or layout. 12581 o For locks such as opens and byte-range locks, there will be 12582 information about the owner(s) of the lock. 12584 o For recallable/revocable lock types, the current recall status 12585 needs to be included. 12587 o For each lock type, there will be type-specific information, such 12588 as share and deny modes for opens and type and byte ranges for 12589 byte-range locks and layouts. 12591 Such information will most probably be organized by client id string 12592 on the destination server so that it can be used to provide 12593 appropriate context to each client when it makes itself known to the 12594 client. Issues connected with a client impersonating another by 12595 presenting another client's id string are discussed in Section 21. 12597 A further server responsibility concerns locks that are revoked or 12598 otherwise lost during the process of file system migration. Because 12599 locks that appear to be lost during the process of migration will be 12600 reclaimed by the client, the servers have to take steps to ensure 12601 that locks revoked soon before or soon after migration are not 12602 inadvertently allowed to be reclaimed in situations in which the 12603 continuity of lock possession cannot be assured. 12605 o For locks lost on the source but whose loss has not yet been 12606 acknowledged by the client (by using FREE_STATEID), the 12607 destination must be aware of this loss so that it can deny a 12608 request to reclaim them. 12610 o For locks lost on the destination after the state transfer but 12611 before the client's RECLAIM_COMPLTE is done, the destination 12612 server should note these and not allow them to be reclaimed. 12614 An additional responsibility of the cooperating servers concerns 12615 situations in which a stateid cannot be transferred transparently 12616 because it conflicts with an existing stateid held by the client and 12617 associated with a different file system. In this case there are two 12618 valid choices: 12620 o Treat the transfer, as in NFSv4.0, as one without Transparent 12621 State Migration. In this case, conflicting locks cannot be 12622 granted until the client does a RECLAIM_COMPLETE, after reclaiming 12623 the locks it had, with the exception of reclaims denied because 12624 they were attempts to reclaim locks that had been lost. 12626 o Implement Transparent State Migration, except for the lock with 12627 the conflicting stateid. In this case, the client will be aware 12628 of a lost lock (through the SEQ4_STATUS flags) and be allowed to 12629 reclaim it. 12631 When transferring state between the source and destination, the 12632 issues discussed in Section 7.2 of [63] must still be attended to. 12633 In this case, the use of NFS4ERR_DELAY may still necessary in 12634 NFSv4.1, as it was in NFSv4.0, to prevent locking state changing 12635 while it is being transferred. 12637 There are a number of important differences in the NFS4.1 context: 12639 o The absence of RELEASE_LOCKOWNER means that the one case in which 12640 an operation could not be deferred by use of NFS4ERR_DELAY no 12641 longer exists. 12643 o Sequencing of operations is no longer done using owner-based 12644 operation sequences numbers. Instead, sequencing is session- 12645 based 12647 As a result, when sessions are not transferred, the techniques 12648 discussed in Section 7.2 of [63] are adequate and will not be further 12649 discussed. 12651 11.13.3. Server Responsibilities in Effecting Session Transfer 12653 The basic responsibility of the source server in effecting session 12654 transfer is to make available to the destination server a description 12655 of the current state of each slot with the session, including: 12657 o The last sequence value received for that slot. 12659 o Whether there is cached reply data for the last request executed 12660 and, if so, the cached reply. 12662 When sessions are transferred, there are a number of issues that pose 12663 challenges in terms of making the transferred state unmodifiable 12664 during the period it is gathered up and transferred to the 12665 destination server. 12667 o A single session may be used to access multiple file systems, not 12668 all of which are being transferred. 12670 o Requests made on a session may, even if rejected, affect the state 12671 of the session by advancing the sequence number associated with 12672 the slot used. 12674 As a result, when the file system state might otherwise be considered 12675 unmodifiable, the client might have any number of in-flight requests, 12676 each of which is capable of changing session state, which may be of a 12677 number of types: 12679 1. Those requests that were processed on the migrating file system, 12680 before migration began. 12682 2. Those requests which got the error NFS4ERR_DELAY because the file 12683 system being accessed was in the process of being migrated. 12685 3. Those requests which got the error NFS4ERR_MOVED because the file 12686 system being accessed had been migrated. 12688 4. Those requests that accessed the migrating file system, in order 12689 to obtain location or status information. 12691 5. Those requests that did not reference the migrating file system. 12693 It should be noted that the history of any particular slot is likely 12694 to include a number of these request classes. In the case in which a 12695 session which is migrated is used by file systems other than the one 12696 migrated, requests of class 5 may be common and be the last request 12697 processed, for many slots. 12699 Since session state can change even after the locking state has been 12700 fixed as part of the migration process, the session state known to 12701 the client could be different from that on the destination server, 12702 which necessarily reflects the session state on the source server, at 12703 an earlier time. In deciding how to deal with this situation, it is 12704 helpful to distinguish between two sorts of behavioral consequences 12705 of the choice of initial sequence ID values. 12707 o The error NFS4ERR_SEQ_MISORDERED is returned when the sequence ID 12708 in a request is neither equal to the last one seen for the current 12709 slot nor the next greater one. 12711 In view of the difficulty of arriving at a mutually acceptable 12712 value for the correct last sequence value at the point of 12713 migration, it may be necessary for the server to show some degree 12714 of forbearance, when the sequence ID is one that would be 12715 considered unacceptable if session migration were not involved. 12717 o Returning the cached reply for a previously executed request when 12718 the sequence ID in the request matches the last value recorded for 12719 the slot. 12721 In the cases in which an error is returned and there is no 12722 possibility of any non-idempotent operation having been executed, 12723 it may not be necessary to adhere to this as strictly as might be 12724 proper if session migration were not involved. For example, the 12725 fact that the error NFS4ERR_DELAY was returned may not assist the 12726 client in any material way, while the fact that NFS4ERR_MOVED was 12727 returned by the source server may not be relevant when the request 12728 was reissued, directed to the destination server. 12730 An important issue is that the specification needs to take note of 12731 all potential COMPOUNDs, even if they might be unlikely in practice. 12732 For example, a COMPOUND is allowed to access multiple file systems 12733 and might perform non-idempotent operations in some of them before 12734 accessing a file system being migrated. Also, a COMPOUND may return 12735 considerable data in the response, before being rejected with 12736 NFS4ERR_DELAY or NFS4ERR_MOVED, and may in addition be marked as 12737 sa_cachethis. 12739 To address these issues, a destination server MAY do any of the 12740 following when implementing session transfer. 12742 o Avoid enforcing any sequencing semantics for a particular slot 12743 until the client has established the starting sequence for that 12744 slot on the destination server. 12746 o For each slot, avoid returning a cached reply returning 12747 NFS4ERR_DELAY or NFS4ERR_MOVED until the client has established 12748 the starting sequence for that slot on the destination server. 12750 o Until the client has established the starting sequence for a 12751 particular slot on the destination server, avoid reporting 12752 NFS4ERR_SEQ_MISORDERED or return a cached reply returning 12753 NFS4ERR_DELAY or NFS4ERR_MOVED, where the reply consists solely of 12754 a series of operations where the response is NFS4_OK until the 12755 final error. 12757 Because of the considerations mentioned above, the destination server 12758 can respond appropriately to SEQUENCE operations received from the 12759 client by adopting the three policies listed below: 12761 o Not responding with NFS4ERR_SEQ_MISORDERED for the initial request 12762 on a slot within a transferred session, since the destination 12763 server cannot be aware of requests made by the client after the 12764 server handoff but before the client became aware of the shift. 12766 o Replying as it would for a retry whenever the sequence matches 12767 that transferred by the source server, even though this would not 12768 provide retry handling for requests issued after the server 12769 handoff, under the assumption that when such requests are issued 12770 they will never be responded to in a state-changing fashion, 12771 making retry support for them unnecessary. 12773 o Once a non-retry SEQUENCE is received for a given slot, using that 12774 as the basis for further sequence checking, with no further 12775 reference to the sequence value transferred by the sour server. 12777 11.14. Effecting File System Referrals 12779 Referrals are effected when an absent file system is encountered and 12780 one or more alternate locations are made available by the 12781 fs_locations or fs_locations_info attributes. The client will 12782 typically get an NFS4ERR_MOVED error, fetch the appropriate location 12783 information, and proceed to access the file system on a different 12784 server, even though it retains its logical position within the 12785 original namespace. Referrals differ from migration events in that 12786 they happen only when the client has not previously referenced the 12787 file system in question (so there is nothing to transition). 12788 Referrals can only come into effect when an absent file system is 12789 encountered at its root. 12791 The examples given in the sections below are somewhat artificial in 12792 that an actual client will not typically do a multi-component look 12793 up, but will have cached information regarding the upper levels of 12794 the name hierarchy. However, these examples are chosen to make the 12795 required behavior clear and easy to put within the scope of a small 12796 number of requests, without getting a discussion of the details of 12797 how specific clients might choose to cache things. 12799 11.14.1. Referral Example (LOOKUP) 12801 Let us suppose that the following COMPOUND is sent in an environment 12802 in which /this/is/the/path is absent from the target server. This 12803 may be for a number of reasons. It may be that the file system has 12804 moved, or it may be that the target server is functioning mainly, or 12805 solely, to refer clients to the servers on which various file systems 12806 are located. 12808 o PUTROOTFH 12810 o LOOKUP "this" 12812 o LOOKUP "is" 12814 o LOOKUP "the" 12816 o LOOKUP "path" 12818 o GETFH 12820 o GETATTR (fsid, fileid, size, time_modify) 12822 Under the given circumstances, the following will be the result. 12824 o PUTROOTFH --> NFS_OK. The current fh is now the root of the 12825 pseudo-fs. 12827 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 12828 within the pseudo-fs. 12830 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 12831 within the pseudo-fs. 12833 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 12834 is within the pseudo-fs. 12836 o LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path 12837 and is within a new, absent file system, but ... the client will 12838 never see the value of that fh. 12840 o GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent 12841 file system at the start of the operation, and the specification 12842 makes no exception for GETFH. 12844 o GETATTR (fsid, fileid, size, time_modify). Not executed because 12845 the failure of the GETFH stops processing of the COMPOUND. 12847 Given the failure of the GETFH, the client has the job of determining 12848 the root of the absent file system and where to find that file 12849 system, i.e., the server and path relative to that server's root fh. 12850 Note that in this example, the client did not obtain filehandles and 12851 attribute information (e.g., fsid) for the intermediate directories, 12852 so that it would not be sure where the absent file system starts. It 12853 could be the case, for example, that /this/is/the is the root of the 12854 moved file system and that the reason that the look up of "path" 12855 succeeded is that the file system was not absent on that operation 12856 but was moved between the last LOOKUP and the GETFH (since COMPOUND 12857 is not atomic). Even if we had the fsids for all of the intermediate 12858 directories, we could have no way of knowing that /this/is/the/path 12859 was the root of a new file system, since we don't yet have its fsid. 12861 In order to get the necessary information, let us re-send the chain 12862 of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we 12863 can be sure where the appropriate file system boundaries are. The 12864 client could choose to get fs_locations_info at the same time but in 12865 most cases the client will have a good guess as to where file system 12866 boundaries are (because of where NFS4ERR_MOVED was, and was not, 12867 received) making fetching of fs_locations_info unnecessary. 12869 OP01: PUTROOTFH --> NFS_OK 12871 - Current fh is root of pseudo-fs. 12873 OP02: GETATTR(fsid) --> NFS_OK 12875 - Just for completeness. Normally, clients will know the fsid of 12876 the pseudo-fs as soon as they establish communication with a 12877 server. 12879 OP03: LOOKUP "this" --> NFS_OK 12881 OP04: GETATTR(fsid) --> NFS_OK 12883 - Get current fsid to see where file system boundaries are. The 12884 fsid will be that for the pseudo-fs in this example, so no 12885 boundary. 12887 OP05: GETFH --> NFS_OK 12888 - Current fh is for /this and is within pseudo-fs. 12890 OP06: LOOKUP "is" --> NFS_OK 12892 - Current fh is for /this/is and is within pseudo-fs. 12894 OP07: GETATTR(fsid) --> NFS_OK 12896 - Get current fsid to see where file system boundaries are. The 12897 fsid will be that for the pseudo-fs in this example, so no 12898 boundary. 12900 OP08: GETFH --> NFS_OK 12902 - Current fh is for /this/is and is within pseudo-fs. 12904 OP09: LOOKUP "the" --> NFS_OK 12906 - Current fh is for /this/is/the and is within pseudo-fs. 12908 OP10: GETATTR(fsid) --> NFS_OK 12910 - Get current fsid to see where file system boundaries are. The 12911 fsid will be that for the pseudo-fs in this example, so no 12912 boundary. 12914 OP11: GETFH --> NFS_OK 12916 - Current fh is for /this/is/the and is within pseudo-fs. 12918 OP12: LOOKUP "path" --> NFS_OK 12920 - Current fh is for /this/is/the/path and is within a new, absent 12921 file system, but ... 12923 - The client will never see the value of that fh. 12925 OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK 12927 - We are getting the fsid to know where the file system boundaries 12928 are. In this operation, the fsid will be different than that of 12929 the parent directory (which in turn was retrieved in OP10). Note 12930 that the fsid we are given will not necessarily be preserved at 12931 the new location. That fsid might be different, and in fact the 12932 fsid we have for this file system might be a valid fsid of a 12933 different file system on that new server. 12935 - In this particular case, we are pretty sure anyway that what has 12936 moved is /this/is/the/path rather than /this/is/the since we have 12937 the fsid of the latter and it is that of the pseudo-fs, which 12938 presumably cannot move. However, in other examples, we might not 12939 have this kind of information to rely on (e.g., /this/is/the might 12940 be a non-pseudo file system separate from /this/is/the/path), so 12941 we need to have other reliable source information on the boundary 12942 of the file system that is moved. If, for example, the file 12943 system /this/is had moved, we would have a case of migration 12944 rather than referral, and once the boundaries of the migrated file 12945 system was clear we could fetch fs_locations_info. 12947 - We are fetching fs_locations_info because the fact that we got an 12948 NFS4ERR_MOVED at this point means that it is most likely that this 12949 is a referral and we need the destination. Even if it is the case 12950 that /this/is/the is a file system that has migrated, we will 12951 still need the location information for that file system. 12953 OP14: GETFH --> NFS4ERR_MOVED 12955 - Fails because current fh is in an absent file system at the start 12956 of the operation, and the specification makes no exception for 12957 GETFH. Note that this means the server will never send the client 12958 a filehandle from within an absent file system. 12960 Given the above, the client knows where the root of the absent file 12961 system is (/this/is/the/path) by noting where the change of fsid 12962 occurred (between "the" and "path"). The fs_locations_info attribute 12963 also gives the client the actual location of the absent file system, 12964 so that the referral can proceed. The server gives the client the 12965 bare minimum of information about the absent file system so that 12966 there will be very little scope for problems of conflict between 12967 information sent by the referring server and information of the file 12968 system's home. No filehandles and very few attributes are present on 12969 the referring server, and the client can treat those it receives as 12970 transient information with the function of enabling the referral. 12972 11.14.2. Referral Example (READDIR) 12974 Another context in which a client may encounter referrals is when it 12975 does a READDIR on a directory in which some of the sub-directories 12976 are the roots of absent file systems. 12978 Suppose such a directory is read as follows: 12980 o PUTROOTFH 12982 o LOOKUP "this" 12983 o LOOKUP "is" 12985 o LOOKUP "the" 12987 o READDIR (fsid, size, time_modify, mounted_on_fileid) 12989 In this case, because rdattr_error is not requested, 12990 fs_locations_info is not requested, and some of the attributes cannot 12991 be provided, the result will be an NFS4ERR_MOVED error on the 12992 READDIR, with the detailed results as follows: 12994 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 12995 pseudo-fs. 12997 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 12998 within the pseudo-fs. 13000 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13001 within the pseudo-fs. 13003 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13004 is within the pseudo-fs. 13006 o READDIR (fsid, size, time_modify, mounted_on_fileid) --> 13007 NFS4ERR_MOVED. Note that the same error would have been returned 13008 if /this/is/the had migrated, but it is returned because the 13009 directory contains the root of an absent file system. 13011 So now suppose that we re-send with rdattr_error: 13013 o PUTROOTFH 13015 o LOOKUP "this" 13017 o LOOKUP "is" 13019 o LOOKUP "the" 13021 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13023 The results will be: 13025 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 13026 pseudo-fs. 13028 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13029 within the pseudo-fs. 13031 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13032 within the pseudo-fs. 13034 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13035 is within the pseudo-fs. 13037 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13038 --> NFS_OK. The attributes for directory entry with the component 13039 named "path" will only contain rdattr_error with the value 13040 NFS4ERR_MOVED, together with an fsid value and a value for 13041 mounted_on_fileid. 13043 Suppose we do another READDIR to get fs_locations_info (although we 13044 could have used a GETATTR directly, as in Section 11.14.1). 13046 o PUTROOTFH 13048 o LOOKUP "this" 13050 o LOOKUP "is" 13052 o LOOKUP "the" 13054 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13055 size, time_modify) 13057 The results would be: 13059 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 13060 pseudo-fs. 13062 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13063 within the pseudo-fs. 13065 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13066 within the pseudo-fs. 13068 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13069 is within the pseudo-fs. 13071 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13072 size, time_modify) --> NFS_OK. The attributes will be as shown 13073 below. 13075 The attributes for the directory entry with the component named 13076 "path" will only contain: 13078 o rdattr_error (value: NFS_OK) 13079 o fs_locations_info 13081 o mounted_on_fileid (value: unique fileid within referring file 13082 system) 13084 o fsid (value: unique value within referring server) 13086 The attributes for entry "path" will not contain size or time_modify 13087 because these attributes are not available within an absent file 13088 system. 13090 11.15. The Attribute fs_locations 13092 The fs_locations attribute is structured in the following way: 13094 struct fs_location4 { 13095 utf8str_cis server<>; 13096 pathname4 rootpath; 13097 }; 13099 struct fs_locations4 { 13100 pathname4 fs_root; 13101 fs_location4 locations<>; 13102 }; 13104 The fs_location4 data type is used to represent the location of a 13105 file system by providing a server name and the path to the root of 13106 the file system within that server's namespace. When a set of 13107 servers have corresponding file systems at the same path within their 13108 namespaces, an array of server names may be provided. An entry in 13109 the server array is a UTF-8 string and represents one of a 13110 traditional DNS host name, IPv4 address, IPv6 address, or a zero- 13111 length string. An IPv4 or IPv6 address is represented as a universal 13112 address (see Section 3.3.9 and [12]), minus the netid, and either 13113 with or without the trailing ".p1.p2" suffix that represents the port 13114 number. If the suffix is omitted, then the default port, 2049, 13115 SHOULD be assumed. A zero-length string SHOULD be used to indicate 13116 the current address being used for the RPC call. It is not a 13117 requirement that all servers that share the same rootpath be listed 13118 in one fs_location4 instance. The array of server names is provided 13119 for convenience. Servers that share the same rootpath may also be 13120 listed in separate fs_location4 entries in the fs_locations 13121 attribute. 13123 The fs_locations4 data type and the fs_locations attribute each 13124 contain an array of such locations. Since the namespace of each 13125 server may be constructed differently, the "fs_root" field is 13126 provided. The path represented by fs_root represents the location of 13127 the file system in the current server's namespace, i.e., that of the 13128 server from which the fs_locations attribute was obtained. The 13129 fs_root path is meant to aid the client by clearly referencing the 13130 root of the file system whose locations are being reported, no matter 13131 what object within the current file system the current filehandle 13132 designates. The fs_root is simply the pathname the client used to 13133 reach the object on the current server (i.e., the object to which the 13134 fs_locations attribute applies). 13136 When the fs_locations attribute is interrogated and there are no 13137 alternate file system locations, the server SHOULD return a zero- 13138 length array of fs_location4 structures, together with a valid 13139 fs_root. 13141 As an example, suppose there is a replicated file system located at 13142 two servers (servA and servB). At servA, the file system is located 13143 at path /a/b/c. At, servB the file system is located at path /x/y/z. 13144 If the client were to obtain the fs_locations value for the directory 13145 at /a/b/c/d, it might not necessarily know that the file system's 13146 root is located in servA's namespace at /a/b/c. When the client 13147 switches to servB, it will need to determine that the directory it 13148 first referenced at servA is now represented by the path /x/y/z/d on 13149 servB. To facilitate this, the fs_locations attribute provided by 13150 servA would have an fs_root value of /a/b/c and two entries in 13151 fs_locations. One entry in fs_locations will be for itself (servA) 13152 and the other will be for servB with a path of /x/y/z. With this 13153 information, the client is able to substitute /x/y/z for the /a/b/c 13154 at the beginning of its access path and construct /x/y/z/d to use for 13155 the new server. 13157 Note that there is no requirement that the number of components in 13158 each rootpath be the same; there is no relation between the number of 13159 components in rootpath or fs_root, and none of the components in a 13160 rootpath and fs_root have to be the same. In the above example, we 13161 could have had a third element in the locations array, with server 13162 equal to "servC" and rootpath equal to "/I/II", and a fourth element 13163 in locations with server equal to "servD" and rootpath equal to 13164 "/aleph/beth/gimel/daleth/he". 13166 The relationship between fs_root to a rootpath is that the client 13167 replaces the pathname indicated in fs_root for the current server for 13168 the substitute indicated in rootpath for the new server. 13170 For an example of a referred or migrated file system, suppose there 13171 is a file system located at serv1. At serv1, the file system is 13172 located at /az/buky/vedi/glagoli. The client finds that object at 13173 glagoli has migrated (or is a referral). The client gets the 13174 fs_locations attribute, which contains an fs_root of /az/buky/vedi/ 13175 glagoli, and one element in the locations array, with server equal to 13176 serv2, and rootpath equal to /izhitsa/fita. The client replaces 13177 /az/buky/vedi/glagoli with /izhitsa/fita, and uses the latter 13178 pathname on serv2. 13180 Thus, the server MUST return an fs_root that is equal to the path the 13181 client used to reach the object to which the fs_locations attribute 13182 applies. Otherwise, the client cannot determine the new path to use 13183 on the new server. 13185 Since the fs_locations attribute lacks information defining various 13186 attributes of the various file system choices presented, it SHOULD 13187 only be interrogated and used when fs_locations_info is not 13188 available. When fs_locations is used, information about the specific 13189 locations should be assumed based on the following rules. 13191 The following rules are general and apply irrespective of the 13192 context. 13194 o All listed file system instances should be considered as of the 13195 same handle class, if and only if, the current fh_expire_type 13196 attribute does not include the FH4_VOL_MIGRATION bit. Note that 13197 in the case of referral, filehandle issues do not apply since 13198 there can be no filehandles known within the current file system, 13199 nor is there any access to the fh_expire_type attribute on the 13200 referring (absent) file system. 13202 o All listed file system instances should be considered as of the 13203 same fileid class if and only if the fh_expire_type attribute 13204 indicates persistent filehandles and does not include the 13205 FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid 13206 issues do not apply since there can be no fileids known within the 13207 referring (absent) file system, nor is there any access to the 13208 fh_expire_type attribute. 13210 o All file system instances servers should be considered as of 13211 different change classes. 13213 For other class assignments, handling of file system transitions 13214 depends on the reasons for the transition: 13216 o When the transition is due to migration, that is, the client was 13217 directed to a new file system after receiving an NFS4ERR_MOVED 13218 error, the target should be treated as being of the same write- 13219 verifier class as the source. 13221 o When the transition is due to failover to another replica, that 13222 is, the client selected another replica without receiving an 13223 NFS4ERR_MOVED error, the target should be treated as being of a 13224 different write-verifier class from the source. 13226 The specific choices reflect typical implementation patterns for 13227 failover and controlled migration, respectively. Since other choices 13228 are possible and useful, this information is better obtained by using 13229 fs_locations_info. When a server implementation needs to communicate 13230 other choices, it MUST support the fs_locations_info attribute. 13232 See Section 21 for a discussion on the recommendations for the 13233 security flavor to be used by any GETATTR operation that requests the 13234 "fs_locations" attribute. 13236 11.16. The Attribute fs_locations_info 13238 The fs_locations_info attribute is intended as a more functional 13239 replacement for the fs_locations attribute which will continue to 13240 exist and be supported. Clients can use it to get a more complete 13241 set of data about alternative file system locations, including 13242 additional network paths to access replicas in use and additional 13243 replicas. When the server does not support fs_locations_info, 13244 fs_locations can be used to get a subset of the data. A server that 13245 supports fs_locations_info MUST support fs_locations as well. 13247 There is additional data present in fs_locations_info, that is not 13248 available in fs_locations: 13250 o Attribute continuity information. This information will allow a 13251 client to select a replica that meets the transparency 13252 requirements of the applications accessing the data and to 13253 leverage optimizations due to the server guarantees of attribute 13254 continuity (e.g., if the change attribute of a file of the file 13255 system is continuous between multiple replicas, the client does 13256 not have to invalidate the file's cache when switching to a 13257 different replica). 13259 o File system identity information that indicates when multiple 13260 replicas, from the client's point of view, correspond to the same 13261 target file system, allowing them to be used interchangeably, 13262 without disruption, as distinct synchronized replicas of the same 13263 file data. 13265 Note that having two replicas with common identity information is 13266 distinct from the case of two (trunked) paths to the same replica. 13268 o Information that will bear on the suitability of various replicas, 13269 depending on the use that the client intends. For example, many 13270 applications need an absolutely up-to-date copy (e.g., those that 13271 write), while others may only need access to the most up-to-date 13272 copy reasonably available. 13274 o Server-derived preference information for replicas, which can be 13275 used to implement load-balancing while giving the client the 13276 entire file system list to be used in case the primary fails. 13278 The fs_locations_info attribute is structured similarly to the 13279 fs_locations attribute. A top-level structure (fs_locations_info4) 13280 contains the entire attribute including the root pathname of the file 13281 system and an array of lower-level structures that define replicas 13282 that share a common rootpath on their respective servers. The lower- 13283 level structure in turn (fs_locations_item4) contains a specific 13284 pathname and information on one or more individual network access 13285 paths. For that last lowest level, fs_locations_info has an 13286 fs_locations_server4 structure that contains per-server-replica 13287 information in addition to the file system location entry. This per- 13288 server-replica information includes a nominally opaque array, 13289 fls_info, within which specific pieces of information are located at 13290 the specific indices listed below. 13292 Two fs_location_server4 entries that are within different 13293 fs_location_item4 structures are never trunkable, while two entries 13294 within in the same fs_location_item4 structure might or might not be 13295 trunkable. Two entries that are trunkable will have identical 13296 identity information, although, as noted above, the converse is not 13297 the case. 13299 The attribute will always contain at least a single 13300 fs_locations_server entry. Typically, there will be an entry with 13301 the FS4LIGF_CUR_REQ flag set, although in the case of a referral 13302 there will be no entry with that flag set. 13304 It should be noted that fs_locations_info attributes returned by 13305 servers for various replicas may differ for various reasons. One 13306 server may know about a set of replicas that are not known to other 13307 servers. Further, compatibility attributes may differ. Filehandles 13308 might be of the same class going from replica A to replica B but not 13309 going in the reverse direction. This might happen because the 13310 filehandles are the same, but replica B's server implementation might 13311 not have provision to note and report that equivalence. 13313 The fs_locations_info attribute consists of a root pathname 13314 (fli_fs_root, just like fs_root in the fs_locations attribute), 13315 together with an array of fs_location_item4 structures. The 13316 fs_location_item4 structures in turn consist of a root pathname 13317 (fli_rootpath) together with an array (fli_entries) of elements of 13318 data type fs_locations_server4, all defined as follows. 13320 /* 13321 * Defines an individual server access path 13322 */ 13323 struct fs_locations_server4 { 13324 int32_t fls_currency; 13325 opaque fls_info<>; 13326 utf8str_cis fls_server; 13327 }; 13329 /* 13330 * Byte indices of items within 13331 * fls_info: flag fields, class numbers, 13332 * bytes indicating ranks and orders. 13333 */ 13334 const FSLI4BX_GFLAGS = 0; 13335 const FSLI4BX_TFLAGS = 1; 13337 const FSLI4BX_CLSIMUL = 2; 13338 const FSLI4BX_CLHANDLE = 3; 13339 const FSLI4BX_CLFILEID = 4; 13340 const FSLI4BX_CLWRITEVER = 5; 13341 const FSLI4BX_CLCHANGE = 6; 13342 const FSLI4BX_CLREADDIR = 7; 13344 const FSLI4BX_READRANK = 8; 13345 const FSLI4BX_WRITERANK = 9; 13346 const FSLI4BX_READORDER = 10; 13347 const FSLI4BX_WRITEORDER = 11; 13349 /* 13350 * Bits defined within the general flag byte. 13351 */ 13352 const FSLI4GF_WRITABLE = 0x01; 13353 const FSLI4GF_CUR_REQ = 0x02; 13354 const FSLI4GF_ABSENT = 0x04; 13355 const FSLI4GF_GOING = 0x08; 13356 const FSLI4GF_SPLIT = 0x10; 13358 /* 13359 * Bits defined within the transport flag byte. 13360 */ 13361 const FSLI4TF_RDMA = 0x01; 13363 /* 13364 * Defines a set of replicas sharing 13365 * a common value of the rootpath 13366 * within the corresponding 13367 * single-server namespaces. 13369 */ 13370 struct fs_locations_item4 { 13371 fs_locations_server4 fli_entries<>; 13372 pathname4 fli_rootpath; 13373 }; 13375 /* 13376 * Defines the overall structure of 13377 * the fs_locations_info attribute. 13378 */ 13379 struct fs_locations_info4 { 13380 uint32_t fli_flags; 13381 int32_t fli_valid_for; 13382 pathname4 fli_fs_root; 13383 fs_locations_item4 fli_items<>; 13384 }; 13386 /* 13387 * Flag bits in fli_flags. 13388 */ 13389 const FSLI4IF_VAR_SUB = 0x00000001; 13391 typedef fs_locations_info4 fattr4_fs_locations_info; 13393 As noted above, the fs_locations_info attribute, when supported, may 13394 be requested of absent file systems without causing NFS4ERR_MOVED to 13395 be returned. It is generally expected that it will be available for 13396 both present and absent file systems even if only a single 13397 fs_locations_server4 entry is present, designating the current 13398 (present) file system, or two fs_locations_server4 entries 13399 designating the previous location of an absent file system (the one 13400 just referenced) and its successor location. Servers are strongly 13401 urged to support this attribute on all file systems if they support 13402 it on any file system. 13404 The data presented in the fs_locations_info attribute may be obtained 13405 by the server in any number of ways, including specification by the 13406 administrator or by current protocols for transferring data among 13407 replicas and protocols not yet developed. NFSv4.1 only defines how 13408 this information is presented by the server to the client. 13410 11.16.1. The fs_locations_server4 Structure 13412 The fs_locations_server4 structure consists of the following items in 13413 addition to the fls_server field which specifies a network address or 13414 set of addresses to be used to access the specified file system. 13415 Note that both of these items (i.e., fls_currency and flinfo) specify 13416 attributes of the file system replica and should not be different 13417 when there are multiple fs_locations_server4 structures for the same 13418 replica, each specifying a network path to the chosen replica. 13420 When these values are different in two fs_locations_server4 13421 structures, a client has no basis for choosing one over the other and 13422 is best off simply ignoring both entries, whether these entries apply 13423 to migration replication or referral. When there are more than two 13424 such entries, majority voting can be used to exclude a single 13425 erroneous entry from consideration. In the case in which trunking 13426 information is provided for a replica currently being accessed, the 13427 additional trunked addresses can be ignored while access continues on 13428 the address currently being used, even if the entry corresponding to 13429 that path might be considered invalid. 13431 o An indication of how up-to-date the file system is (fls_currency) 13432 in seconds. This value is relative to the master copy. A 13433 negative value indicates that the server is unable to give any 13434 reasonably useful value here. A value of zero indicates that the 13435 file system is the actual writable data or a reliably coherent and 13436 fully up-to-date copy. Positive values indicate how out-of-date 13437 this copy can normally be before it is considered for update. 13438 Such a value is not a guarantee that such updates will always be 13439 performed on the required schedule but instead serves as a hint 13440 about how far the copy of the data would be expected to be behind 13441 the most up-to-date copy. 13443 o A counted array of one-byte values (fls_info) containing 13444 information about the particular file system instance. This data 13445 includes general flags, transport capability flags, file system 13446 equivalence class information, and selection priority information. 13447 The encoding will be discussed below. 13449 o The server string (fls_server). For the case of the replica 13450 currently being accessed (via GETATTR), a zero-length string MAY 13451 be used to indicate the current address being used for the RPC 13452 call. The fls_server field can also be an IPv4 or IPv6 address, 13453 formatted the same way as an IPv4 or IPv6 address in the "server" 13454 field of the fs_location4 data type (see Section 11.15). 13456 With the exception of the transport-flag field (at offset 13457 FSLI4BX_TFLAGS with the fls_info array), all of this data applies to 13458 the replica specified by the entry, rather that the specific network 13459 path used to access it. 13461 Data within the fls_info array is in the form of 8-bit data items 13462 with constants giving the offsets within the array of various values 13463 describing this particular file system instance. This style of 13464 definition was chosen, in preference to explicit XDR structure 13465 definitions for these values, for a number of reasons. 13467 o The kinds of data in the fls_info array, representing flags, file 13468 system classes, and priorities among sets of file systems 13469 representing the same data, are such that 8 bits provide a quite 13470 acceptable range of values. Even where there might be more than 13471 256 such file system instances, having more than 256 distinct 13472 classes or priorities is unlikely. 13474 o Explicit definition of the various specific data items within XDR 13475 would limit expandability in that any extension within would 13476 require yet another attribute, leading to specification and 13477 implementation clumsiness. In the context of the NFSv4 extension 13478 model in effect at the time fs_locations_info was designed (i.e. 13479 that described in RFC5661 [60]), this would necessitate a new 13480 minor version to effect any Standards Track extension to the data 13481 in in fls_info. 13483 The set of fls_info data is subject to expansion in a future minor 13484 version, or in a Standards Track RFC, within the context of a single 13485 minor version. The server SHOULD NOT send and the client MUST NOT 13486 use indices within the fls_info array or flag bits that are not 13487 defined in Standards Track RFCs. 13489 In light of the new extension model defined in RFC8178 [61] and the 13490 fact that the individual items within fls_info are not explicitly 13491 referenced in the XDR, the following practices should be followed 13492 when extending or otherwise changing the structure of the data 13493 returned in fls_info within the scope of a single minor version. 13495 o All extensions need to be described by Standards Track documents. 13496 There is no need for such documents to be marked as updating 13497 RFC5661 [60] or this document. 13499 o It needs to be made clear whether the information in any added 13500 data items applies to the replica specified by the entry or to the 13501 specific network paths specified in the entry. 13503 o There needs to be a reliable way defined to determine whether the 13504 server is aware of the extension. This may be based on the length 13505 field of the fls_info array, but it is more flexible to provide 13506 fs-scope or server-scope attributes to indicate what extensions 13507 are provided. 13509 This encoding scheme can be adapted to the specification of multi- 13510 byte numeric values, even though none are currently defined. If 13511 extensions are made via Standards Track RFCs, multi-byte quantities 13512 will be encoded as a range of bytes with a range of indices, with the 13513 byte interpreted in big-endian byte order. Further, any such index 13514 assignments will be constrained by the need for the relevant 13515 quantities not to cross XDR word boundaries. 13517 The fls_info array currently contains: 13519 o Two 8-bit flag fields, one devoted to general file-system 13520 characteristics and a second reserved for transport-related 13521 capabilities. 13523 o Six 8-bit class values that define various file system equivalence 13524 classes as explained below. 13526 o Four 8-bit priority values that govern file system selection as 13527 explained below. 13529 The general file system characteristics flag (at byte index 13530 FSLI4BX_GFLAGS) has the following bits defined within it: 13532 o FSLI4GF_WRITABLE indicates that this file system target is 13533 writable, allowing it to be selected by clients that may need to 13534 write on this file system. When the current file system instance 13535 is writable and is defined as of the same simultaneous use class 13536 (as specified by the value at index FSLI4BX_CLSIMUL) to which the 13537 client was previously writing, then it must incorporate within its 13538 data any committed write made on the source file system instance. 13539 See Section 11.10.6, which discusses the write-verifier class. 13540 While there is no harm in not setting this flag for a file system 13541 that turns out to be writable, turning the flag on for a read-only 13542 file system can cause problems for clients that select a migration 13543 or replication target based on the flag and then find themselves 13544 unable to write. 13546 o FSLI4GF_CUR_REQ indicates that this replica is the one on which 13547 the request is being made. Only a single server entry may have 13548 this flag set and, in the case of a referral, no entry will have 13549 it set. Note that this flag might be set even if the request was 13550 made on a network access path different from any of those 13551 specified in the current entry. 13553 o FSLI4GF_ABSENT indicates that this entry corresponds to an absent 13554 file system replica. It can only be set if FSLI4GF_CUR_REQ is 13555 set. When both such bits are set, it indicates that a file system 13556 instance is not usable but that the information in the entry can 13557 be used to determine the sorts of continuity available when 13558 switching from this replica to other possible replicas. Since 13559 this bit can only be true if FSLI4GF_CUR_REQ is true, the value 13560 could be determined using the fs_status attribute, but the 13561 information is also made available here for the convenience of the 13562 client. An entry with this bit, since it represents a true file 13563 system (albeit absent), does not appear in the event of a 13564 referral, but only when a file system has been accessed at this 13565 location and has subsequently been migrated. 13567 o FSLI4GF_GOING indicates that a replica, while still available, 13568 should not be used further. The client, if using it, should make 13569 an orderly transfer to another file system instance as 13570 expeditiously as possible. It is expected that file systems going 13571 out of service will be announced as FSLI4GF_GOING some time before 13572 the actual loss of service. It is also expected that the 13573 fli_valid_for value will be sufficiently small to allow clients to 13574 detect and act on scheduled events, while large enough that the 13575 cost of the requests to fetch the fs_locations_info values will 13576 not be excessive. Values on the order of ten minutes seem 13577 reasonable. 13579 When this flag is seen as part of a transition into a new file 13580 system, a client might choose to transfer immediately to another 13581 replica, or it may reference the current file system and only 13582 transition when a migration event occurs. Similarly, when this 13583 flag appears as a replica in the referral, clients would likely 13584 avoid being referred to this instance whenever there is another 13585 choice. 13587 This flag, like the other items within fls_info applies to the 13588 replica, rather than to a particular path to that replica. When 13589 it appears, a transition to a new replica rather than to a 13590 different path to the same replica, is indicated. 13592 o FSLI4GF_SPLIT indicates that when a transition occurs from the 13593 current file system instance to this one, the replacement may 13594 consist of multiple file systems. In this case, the client has to 13595 be prepared for the possibility that objects on the same file 13596 system before migration will be on different ones after. Note 13597 that FSLI4GF_SPLIT is not incompatible with the file systems 13598 belonging to the same fileid class since, if one has a set of 13599 fileids that are unique within a file system, each subset assigned 13600 to a smaller file system after migration would not have any 13601 conflicts internal to that file system. 13603 A client, in the case of a split file system, will interrogate 13604 existing files with which it has continuing connection (it is free 13605 to simply forget cached filehandles). If the client remembers the 13606 directory filehandle associated with each open file, it may 13607 proceed upward using LOOKUPP to find the new file system 13608 boundaries. Note that in the event of a referral, there will not 13609 be any such files and so these actions will not be performed. 13610 Instead, a reference to a portion of the original file system now 13611 split off into other file systems will encounter an fsid change 13612 and possibly a further referral. 13614 Once the client recognizes that one file system has been split 13615 into two, it can prevent the disruption of running applications by 13616 presenting the two file systems as a single one until a convenient 13617 point to recognize the transition, such as a restart. This would 13618 require a mapping from the server's fsids to fsids as seen by the 13619 client, but this is already necessary for other reasons. As noted 13620 above, existing fileids within the two descendant file systems 13621 will not conflict. Providing non-conflicting fileids for newly 13622 created files on the split file systems is the responsibility of 13623 the server (or servers working in concert). The server can encode 13624 filehandles such that filehandles generated before the split event 13625 can be discerned from those generated after the split, allowing 13626 the server to determine when the need for emulating two file 13627 systems as one is over. 13629 Although it is possible for this flag to be present in the event 13630 of referral, it would generally be of little interest to the 13631 client, since the client is not expected to have information 13632 regarding the current contents of the absent file system. 13634 The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the 13635 following bits related to the transport capabilities of the specific 13636 network path(s) specified by the entry. 13638 o FSLI4TF_RDMA indicates that any specified network paths provide 13639 NFSv4.1 clients access using an RDMA-capable transport. 13641 Attribute continuity and file system identity information are 13642 expressed by defining equivalence relations on the sets of file 13643 systems presented to the client. Each such relation is expressed as 13644 a set of file system equivalence classes. For each relation, a file 13645 system has an 8-bit class number. Two file systems belong to the 13646 same class if both have identical non-zero class numbers. Zero is 13647 treated as non-matching. Most often, the relevant question for the 13648 client will be whether a given replica is identical to / continuous 13649 with the current one in a given respect, but the information should 13650 be available also as to whether two other replicas match in that 13651 respect as well. 13653 The following fields specify the file system's class numbers for the 13654 equivalence relations used in determining the nature of file system 13655 transitions. See Sections 11.8 through 11.13 and their various 13656 subsections for details about how this information is to be used. 13657 Servers may assign these values as they wish, so long as file system 13658 instances that share the same value have the specified relationship 13659 to one another; conversely, file systems that have the specified 13660 relationship to one another share a common class value. As each 13661 instance entry is added, the relationships of this instance to 13662 previously entered instances can be consulted, and if one is found 13663 that bears the specified relationship, that entry's class value can 13664 be copied to the new entry. When no such previous entry exists, a 13665 new value for that byte index (not previously used) can be selected, 13666 most likely by incrementing the value of the last class value 13667 assigned for that index. 13669 o The field with byte index FSLI4BX_CLSIMUL defines the 13670 simultaneous-use class for the file system. 13672 o The field with byte index FSLI4BX_CLHANDLE defines the handle 13673 class for the file system. 13675 o The field with byte index FSLI4BX_CLFILEID defines the fileid 13676 class for the file system. 13678 o The field with byte index FSLI4BX_CLWRITEVER defines the write- 13679 verifier class for the file system. 13681 o The field with byte index FSLI4BX_CLCHANGE defines the change 13682 class for the file system. 13684 o The field with byte index FSLI4BX_CLREADDIR defines the readdir 13685 class for the file system. 13687 Server-specified preference information is also provided via 8-bit 13688 values within the fls_info array. The values provide a rank and an 13689 order (see below) to be used with separate values specifiable for the 13690 cases of read-only and writable file systems. These values are 13691 compared for different file systems to establish the server-specified 13692 preference, with lower values indicating "more preferred". 13694 Rank is used to express a strict server-imposed ordering on clients, 13695 with lower values indicating "more preferred". Clients should 13696 attempt to use all replicas with a given rank before they use one 13697 with a higher rank. Only if all of those file systems are 13698 unavailable should the client proceed to those of a higher rank. 13699 Because specifying a rank will override client preferences, servers 13700 should be conservative about using this mechanism, particularly when 13701 the environment is one in which client communication characteristics 13702 are neither tightly controlled nor visible to the server. 13704 Within a rank, the order value is used to specify the server's 13705 preference to guide the client's selection when the client's own 13706 preferences are not controlling, with lower values of order 13707 indicating "more preferred". If replicas are approximately equal in 13708 all respects, clients should defer to the order specified by the 13709 server. When clients look at server latency as part of their 13710 selection, they are free to use this criterion, but it is suggested 13711 that when latency differences are not significant, the server- 13712 specified order should guide selection. 13714 o The field at byte index FSLI4BX_READRANK gives the rank value to 13715 be used for read-only access. 13717 o The field at byte index FSLI4BX_READORDER gives the order value to 13718 be used for read-only access. 13720 o The field at byte index FSLI4BX_WRITERANK gives the rank value to 13721 be used for writable access. 13723 o The field at byte index FSLI4BX_WRITEORDER gives the order value 13724 to be used for writable access. 13726 Depending on the potential need for write access by a given client, 13727 one of the pairs of rank and order values is used. The read rank and 13728 order should only be used if the client knows that only reading will 13729 ever be done or if it is prepared to switch to a different replica in 13730 the event that any write access capability is required in the future. 13732 11.16.2. The fs_locations_info4 Structure 13734 The fs_locations_info4 structure, encoding the fs_locations_info 13735 attribute, contains the following: 13737 o The fli_flags field, which contains general flags that affect the 13738 interpretation of this fs_locations_info4 structure and all 13739 fs_locations_item4 structures within it. The only flag currently 13740 defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field that 13741 are not defined should always be returned as zero. 13743 o The fli_fs_root field, which contains the pathname of the root of 13744 the current file system on the current server, just as it does in 13745 the fs_locations4 structure. 13747 o An array called fli_items of fs_locations4_item structures, which 13748 contain information about replicas of the current file system. 13749 Where the current file system is actually present, or has been 13750 present, i.e., this is not a referral situation, one of the 13751 fs_locations_item4 structures will contain an fs_locations_server4 13752 for the current server. This structure will have FSLI4GF_ABSENT 13753 set if the current file system is absent, i.e., normal access to 13754 it will return NFS4ERR_MOVED. 13756 o The fli_valid_for field specifies a time in seconds for which it 13757 is reasonable for a client to use the fs_locations_info attribute 13758 without refetch. The fli_valid_for value does not provide a 13759 guarantee of validity since servers can unexpectedly go out of 13760 service or become inaccessible for any number of reasons. Clients 13761 are well-advised to refetch this information for an actively 13762 accessed file system at every fli_valid_for seconds. This is 13763 particularly important when file system replicas may go out of 13764 service in a controlled way using the FSLI4GF_GOING flag to 13765 communicate an ongoing change. The server should set 13766 fli_valid_for to a value that allows well-behaved clients to 13767 notice the FSLI4GF_GOING flag and make an orderly switch before 13768 the loss of service becomes effective. If this value is zero, 13769 then no refetch interval is appropriate and the client need not 13770 refetch this data on any particular schedule. In the event of a 13771 transition to a new file system instance, a new value of the 13772 fs_locations_info attribute will be fetched at the destination. 13773 It is to be expected that this may have a different fli_valid_for 13774 value, which the client should then use in the same fashion as the 13775 previous value. Because a refetch of the attribute causes 13776 information from all component entries to be refetched, the server 13777 will typically provide a low value for this field if any of the 13778 replicas are likely to go out of service in a short time frame. 13779 Note that, because of the ability of the server to return 13780 NFS4ERR_MOVED to trigger the use of different paths, when 13781 alternate trunked paths are available, there is generally no need 13782 to use low values of fli_valid_for in connection with the 13783 management of alternate paths to the same replica. 13785 The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable 13786 substitution is to be enabled. See Section 11.16.3 for an 13787 explanation of variable substitution. 13789 11.16.3. The fs_locations_item4 Structure 13791 The fs_locations_item4 structure contains a pathname (in the field 13792 fli_rootpath) that encodes the path of the target file system 13793 replicas on the set of servers designated by the included 13794 fs_locations_server4 entries. The precise manner in which this 13795 target location is specified depends on the value of the 13796 FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 13797 structure. 13799 If this flag is not set, then fli_rootpath simply designates the 13800 location of the target file system within each server's single-server 13801 namespace just as it does for the rootpath within the fs_location4 13802 structure. When this bit is set, however, component entries of a 13803 certain form are subject to client-specific variable substitution so 13804 as to allow a degree of namespace non-uniformity in order to 13805 accommodate the selection of client-specific file system targets to 13806 adapt to different client architectures or other characteristics. 13808 When such substitution is in effect, a variable beginning with the 13809 string "${" and ending with the string "}" and containing a colon is 13810 to be replaced by the client-specific value associated with that 13811 variable. The string "unknown" should be used by the client when it 13812 has no value for such a variable. The pathname resulting from such 13813 substitutions is used to designate the target file system, so that 13814 different clients may have different file systems, corresponding to 13815 that location in the multi-server namespace. 13817 As mentioned above, such substituted pathname variables contain a 13818 colon. The part before the colon is to be a DNS domain name, and the 13819 part after is to be a case-insensitive alphanumeric string. 13821 Where the domain is "ietf.org", only variable names defined in this 13822 document or subsequent Standards Track RFCs are subject to such 13823 substitution. Organizations are free to use their domain names to 13824 create their own sets of client-specific variables, to be subject to 13825 such substitution. In cases where such variables are intended to be 13826 used more broadly than a single organization, publication of an 13827 Informational RFC defining such variables is RECOMMENDED. 13829 The variable ${ietf.org:CPU_ARCH} is used to denote that the CPU 13830 architecture object files are compiled. This specification does not 13831 limit the acceptable values (except that they must be valid UTF-8 13832 strings), but such values as "x86", "x86_64", and "sparc" would be 13833 expected to be used in line with industry practice. 13835 The variable ${ietf.org:OS_TYPE} is used to denote the operating 13836 system, and thus the kernel and library APIs, for which code might be 13837 compiled. This specification does not limit the acceptable values 13838 (except that they must be valid UTF-8 strings), but such values as 13839 "linux" and "freebsd" would be expected to be used in line with 13840 industry practice. 13842 The variable ${ietf.org:OS_VERSION} is used to denote the operating 13843 system version, and thus the specific details of versioned 13844 interfaces, for which code might be compiled. This specification 13845 does not limit the acceptable values (except that they must be valid 13846 UTF-8 strings). However, combinations of numbers and letters with 13847 interspersed dots would be expected to be used in line with industry 13848 practice, with the details of the version format depending on the 13849 specific value of the variable ${ietf.org:OS_TYPE} with which it is 13850 used. 13852 Use of these variables could result in the direction of different 13853 clients to different file systems on the same server, as appropriate 13854 to particular clients. In cases in which the target file systems are 13855 located on different servers, a single server could serve as a 13856 referral point so that each valid combination of variable values 13857 would designate a referral hosted on a single server, with the 13858 targets of those referrals on a number of different servers. 13860 Because namespace administration is affected by the values selected 13861 to substitute for various variables, clients should provide 13862 convenient means of determining what variable substitutions a client 13863 will implement, as well as, where appropriate, providing means to 13864 control the substitutions to be used. The exact means by which this 13865 will be done is outside the scope of this specification. 13867 Although variable substitution is most suitable for use in the 13868 context of referrals, it may be used in the context of replication 13869 and migration. If it is used in these contexts, the server must 13870 ensure that no matter what values the client presents for the 13871 substituted variables, the result is always a valid successor file 13872 system instance to that from which a transition is occurring, i.e., 13873 that the data is identical or represents a later image of a writable 13874 file system. 13876 Note that when fli_rootpath is a null pathname (that is, one with 13877 zero components), the file system designated is at the root of the 13878 specified server, whether or not the FSLI4IF_VAR_SUB flag within the 13879 associated fs_locations_info4 structure is set. 13881 11.17. The Attribute fs_status 13883 In an environment in which multiple copies of the same basic set of 13884 data are available, information regarding the particular source of 13885 such data and the relationships among different copies can be very 13886 helpful in providing consistent data to applications. 13888 enum fs4_status_type { 13889 STATUS4_FIXED = 1, 13890 STATUS4_UPDATED = 2, 13891 STATUS4_VERSIONED = 3, 13892 STATUS4_WRITABLE = 4, 13893 STATUS4_REFERRAL = 5 13894 }; 13896 struct fs4_status { 13897 bool fss_absent; 13898 fs4_status_type fss_type; 13899 utf8str_cs fss_source; 13900 utf8str_cs fss_current; 13901 int32_t fss_age; 13902 nfstime4 fss_version; 13903 }; 13905 The boolean fss_absent indicates whether the file system is currently 13906 absent. This value will be set if the file system was previously 13907 present and becomes absent, or if the file system has never been 13908 present and the type is STATUS4_REFERRAL. When this boolean is set 13909 and the type is not STATUS4_REFERRAL, the remaining information in 13910 the fs4_status reflects that last valid when the file system was 13911 present. 13913 The fss_type field indicates the kind of file system image 13914 represented. This is of particular importance when using the version 13915 values to determine appropriate succession of file system images. 13916 When fss_absent is set, and the file system was previously present, 13917 the value of fss_type reflected is that when the file was last 13918 present. Five values are distinguished: 13920 o STATUS4_FIXED, which indicates a read-only image in the sense that 13921 it will never change. The possibility is allowed that, as a 13922 result of migration or switch to a different image, changed data 13923 can be accessed, but within the confines of this instance, no 13924 change is allowed. The client can use this fact to cache 13925 aggressively. 13927 o STATUS4_VERSIONED, which indicates that the image, like the 13928 STATUS4_UPDATED case, is updated externally, but it provides a 13929 guarantee that the server will carefully update an associated 13930 version value so that the client can protect itself from a 13931 situation in which it reads data from one version of the file 13932 system and then later reads data from an earlier version of the 13933 same file system. See below for a discussion of how this can be 13934 done. 13936 o STATUS4_UPDATED, which indicates an image that cannot be updated 13937 by the user writing to it but that may be changed externally, 13938 typically because it is a periodically updated copy of another 13939 writable file system somewhere else. In this case, version 13940 information is not provided, and the client does not have the 13941 responsibility of making sure that this version only advances upon 13942 a file system instance transition. In this case, it is the 13943 responsibility of the server to make sure that the data presented 13944 after a file system instance transition is a proper successor 13945 image and includes all changes seen by the client and any change 13946 made before all such changes. 13948 o STATUS4_WRITABLE, which indicates that the file system is an 13949 actual writable one. The client need not, of course, actually 13950 write to the file system, but once it does, it should not accept a 13951 transition to anything other than a writable instance of that same 13952 file system. 13954 o STATUS4_REFERRAL, which indicates that the file system in question 13955 is absent and has never been present on this server. 13957 Note that in the STATUS4_UPDATED and STATUS4_VERSIONED cases, the 13958 server is responsible for the appropriate handling of locks that are 13959 inconsistent with external changes to delegations. If a server gives 13960 out delegations, they SHOULD be recalled before an inconsistent 13961 change is made to the data, and MUST be revoked if this is not 13962 possible. Similarly, if an OPEN is inconsistent with data that is 13963 changed (the OPEN has OPEN4_SHARE_DENY_WRITE/OPEN4_SHARE_DENY_BOTH 13964 and the data is changed), that OPEN SHOULD be considered 13965 administratively revoked. 13967 The opaque strings fss_source and fss_current provide a way of 13968 presenting information about the source of the file system image 13969 being present. It is not intended that the client do anything with 13970 this information other than make it available to administrative 13971 tools. It is intended that this information be helpful when 13972 researching possible problems with a file system image that might 13973 arise when it is unclear if the correct image is being accessed and, 13974 if not, how that image came to be made. This kind of diagnostic 13975 information will be helpful, if, as seems likely, copies of file 13976 systems are made in many different ways (e.g., simple user-level 13977 copies, file-system-level point-in-time copies, clones of the 13978 underlying storage), under a variety of administrative arrangements. 13979 In such environments, determining how a given set of data was 13980 constructed can be very helpful in resolving problems. 13982 The opaque string fss_source is used to indicate the source of a 13983 given file system with the expectation that tools capable of creating 13984 a file system image propagate this information, when possible. It is 13985 understood that this may not always be possible since a user-level 13986 copy may be thought of as creating a new data set and the tools used 13987 may have no mechanism to propagate this data. When a file system is 13988 initially created, it is desirable to associate with it data 13989 regarding how the file system was created, where it was created, who 13990 created it, etc. Making this information available in this attribute 13991 in a human-readable string will be helpful for applications and 13992 system administrators and will also serve to make it available when 13993 the original file system is used to make subsequent copies. 13995 The opaque string fss_current should provide whatever information is 13996 available about the source of the current copy. Such information 13997 includes the tool creating it, any relevant parameters to that tool, 13998 the time at which the copy was done, the user making the change, the 13999 server on which the change was made, etc. All information should be 14000 in a human-readable string. 14002 The field fss_age provides an indication of how out-of-date the file 14003 system currently is with respect to its ultimate data source (in case 14004 of cascading data updates). This complements the fls_currency field 14005 of fs_locations_server4 (see Section 11.16) in the following way: the 14006 information in fls_currency gives a bound for how out of date the 14007 data in a file system might typically get, while the value in fss_age 14008 gives a bound on how out-of-date that data actually is. Negative 14009 values imply that no information is available. A zero means that 14010 this data is known to be current. A positive value means that this 14011 data is known to be no older than that number of seconds with respect 14012 to the ultimate data source. Using this value, the client may be 14013 able to decide that a data copy is too old, so that it may search for 14014 a newer version to use. 14016 The fss_version field provides a version identification, in the form 14017 of a time value, such that successive versions always have later time 14018 values. When the fs_type is anything other than STATUS4_VERSIONED, 14019 the server may provide such a value, but there is no guarantee as to 14020 its validity and clients will not use it except to provide additional 14021 information to add to fss_source and fss_current. 14023 When fss_type is STATUS4_VERSIONED, servers SHOULD provide a value of 14024 fss_version that progresses monotonically whenever any new version of 14025 the data is established. This allows the client, if reliable image 14026 progression is important to it, to fetch this attribute as part of 14027 each COMPOUND where data or metadata from the file system is used. 14029 When it is important to the client to make sure that only valid 14030 successor images are accepted, it must make sure that it does not 14031 read data or metadata from the file system without updating its sense 14032 of the current state of the image. This is to avoid the possibility 14033 that the fs_status that the client holds will be one for an earlier 14034 image, which would cause the client to accept a new file system 14035 instance that is later than that but still earlier than the updated 14036 data read by the client. 14038 In order to accept valid images reliably, the client must do a 14039 GETATTR of the fs_status attribute that follows any interrogation of 14040 data or metadata within the file system in question. Often this is 14041 most conveniently done by appending such a GETATTR after all other 14042 operations that reference a given file system. When errors occur 14043 between reading file system data and performing such a GETATTR, care 14044 must be exercised to make sure that the data in question is not used 14045 before obtaining the proper fs_status value. In this connection, 14046 when an OPEN is done within such a versioned file system and the 14047 associated GETATTR of fs_status is not successfully completed, the 14048 open file in question must not be accessed until that fs_status is 14049 fetched. 14051 The procedure above will ensure that before using any data from the 14052 file system the client has in hand a newly-fetched current version of 14053 the file system image. Multiple values for multiple requests in 14054 flight can be resolved by assembling them into the required partial 14055 order (and the elements should form a total order within the partial 14056 order) and using the last. The client may then, when switching among 14057 file system instances, decline to use an instance that does not have 14058 an fss_type of STATUS4_VERSIONED or whose fss_version field is 14059 earlier than the last one obtained from the predecessor file system 14060 instance. 14062 12. Parallel NFS (pNFS) 14064 12.1. Introduction 14066 pNFS is an OPTIONAL feature within NFSv4.1; the pNFS feature set 14067 allows direct client access to the storage devices containing file 14068 data. When file data for a single NFSv4 server is stored on multiple 14069 and/or higher-throughput storage devices (by comparison to the 14070 server's throughput capability), the result can be significantly 14071 better file access performance. The relationship among multiple 14072 clients, a single server, and multiple storage devices for pNFS 14073 (server and clients have access to all storage devices) is shown in 14074 Figure 1. 14076 +-----------+ 14077 |+-----------+ +-----------+ 14078 ||+-----------+ | | 14079 ||| | NFSv4.1 + pNFS | | 14080 +|| Clients |<------------------------------>| Server | 14081 +| | | | 14082 +-----------+ | | 14083 ||| +-----------+ 14084 ||| | 14085 ||| | 14086 ||| Storage +-----------+ | 14087 ||| Protocol |+-----------+ | 14088 ||+----------------||+-----------+ Control | 14089 |+-----------------||| | Protocol| 14090 +------------------+|| Storage |------------+ 14091 +| Devices | 14092 +-----------+ 14094 Figure 1 14096 In this model, the clients, server, and storage devices are 14097 responsible for managing file access. This is in contrast to NFSv4 14098 without pNFS, where it is primarily the server's responsibility; some 14099 of this responsibility may be delegated to the client under strictly 14100 specified conditions. See Section 12.2.5 for a discussion of the 14101 Storage Protocol. See Section 12.2.6 for a discussion of the Control 14102 Protocol. 14104 pNFS takes the form of OPTIONAL operations that manage protocol 14105 objects called 'layouts' (Section 12.2.7) that contain a byte-range 14106 and storage location information. The layout is managed in a similar 14107 fashion as NFSv4.1 data delegations. For example, the layout is 14108 leased, recallable, and revocable. However, layouts are distinct 14109 abstractions and are manipulated with new operations. When a client 14110 holds a layout, it is granted the ability to directly access the 14111 byte-range at the storage location specified in the layout. 14113 There are interactions between layouts and other NFSv4.1 abstractions 14114 such as data delegations and byte-range locking. Delegation issues 14115 are discussed in Section 12.5.5. Byte-range locking issues are 14116 discussed in Sections 12.2.9 and 12.5.1. 14118 12.2. pNFS Definitions 14120 NFSv4.1's pNFS feature provides parallel data access to a file system 14121 that stripes its content across multiple storage servers. The first 14122 instantiation of pNFS, as part of NFSv4.1, separates the file system 14123 protocol processing into two parts: metadata processing and data 14124 processing. Data consist of the contents of regular files that are 14125 striped across storage servers. Data striping occurs in at least two 14126 ways: on a file-by-file basis and, within sufficiently large files, 14127 on a block-by-block basis. In contrast, striped access to metadata 14128 by pNFS clients is not provided in NFSv4.1, even though the file 14129 system back end of a pNFS server might stripe metadata. Metadata 14130 consist of everything else, including the contents of non-regular 14131 files (e.g., directories); see Section 12.2.1. The metadata 14132 functionality is implemented by an NFSv4.1 server that supports pNFS 14133 and the operations described in Section 18; such a server is called a 14134 metadata server (Section 12.2.2). 14136 The data functionality is implemented by one or more storage devices, 14137 each of which are accessed by the client via a storage protocol. A 14138 subset (defined in Section 13.6) of NFSv4.1 is one such storage 14139 protocol. New terms are introduced to the NFSv4.1 nomenclature and 14140 existing terms are clarified to allow for the description of the pNFS 14141 feature. 14143 12.2.1. Metadata 14145 Information about a file system object, such as its name, location 14146 within the namespace, owner, ACL, and other attributes. Metadata may 14147 also include storage location information, and this will vary based 14148 on the underlying storage mechanism that is used. 14150 12.2.2. Metadata Server 14152 An NFSv4.1 server that supports the pNFS feature. A variety of 14153 architectural choices exist for the metadata server and its use of 14154 file system information held at the server. Some servers may contain 14155 metadata only for file objects residing at the metadata server, while 14156 the file data resides on associated storage devices. Other metadata 14157 servers may hold both metadata and a varying degree of file data. 14159 12.2.3. pNFS Client 14161 An NFSv4.1 client that supports pNFS operations and supports at least 14162 one storage protocol for performing I/O to storage devices. 14164 12.2.4. Storage Device 14166 A storage device stores a regular file's data, but leaves metadata 14167 management to the metadata server. A storage device could be another 14168 NFSv4.1 server, an object-based storage device (OSD), a block device 14169 accessed over a System Area Network (SAN, e.g., either FiberChannel 14170 or iSCSI SAN), or some other entity. 14172 12.2.5. Storage Protocol 14174 As noted in Figure 1, the storage protocol is the method used by the 14175 client to store and retrieve data directly from the storage devices. 14177 The NFSv4.1 pNFS feature has been structured to allow for a variety 14178 of storage protocols to be defined and used. One example storage 14179 protocol is NFSv4.1 itself (as documented in Section 13). Other 14180 options for the storage protocol are described elsewhere and include: 14182 o Block/volume protocols such as Internet SCSI (iSCSI) [51] and FCP 14183 [52]. The block/volume protocol support can be independent of the 14184 addressing structure of the block/volume protocol used, allowing 14185 more than one protocol to access the same file data and enabling 14186 extensibility to other block/volume protocols. See [44] for a 14187 layout specification that allows pNFS to use block/volume storage 14188 protocols. 14190 o Object protocols such as OSD over iSCSI or Fibre Channel [53]. 14191 See [43] for a layout specification that allows pNFS to use object 14192 storage protocols. 14194 It is possible that various storage protocols are available to both 14195 client and server and it may be possible that a client and server do 14196 not have a matching storage protocol available to them. Because of 14197 this, the pNFS server MUST support normal NFSv4.1 access to any file 14198 accessible by the pNFS feature; this will allow for continued 14199 interoperability between an NFSv4.1 client and server. 14201 12.2.6. Control Protocol 14203 As noted in Figure 1, the control protocol is used by the exported 14204 file system between the metadata server and storage devices. 14205 Specification of such protocols is outside the scope of the NFSv4.1 14206 protocol. Such control protocols would be used to control activities 14207 such as the allocation and deallocation of storage, the management of 14208 state required by the storage devices to perform client access 14209 control, and, depending on the storage protocol, the enforcement of 14210 authentication and authorization so that restrictions that would be 14211 enforced by the metadata server are also enforced by the storage 14212 device. 14214 A particular control protocol is not REQUIRED by NFSv4.1 but 14215 requirements are placed on the control protocol for maintaining 14216 attributes like modify time, the change attribute, and the end-of- 14217 file (EOF) position. Note that if pNFS is layered over a clustered, 14218 parallel file system (e.g., PVFS [54]), the mechanisms that enable 14219 clustering and parallelism in that file system can be considered the 14220 control protocol. 14222 12.2.7. Layout Types 14224 A layout describes the mapping of a file's data to the storage 14225 devices that hold the data. A layout is said to belong to a specific 14226 layout type (data type layouttype4, see Section 3.3.13). The layout 14227 type allows for variants to handle different storage protocols, such 14228 as those associated with block/volume [44], object [43], and file 14229 (Section 13) layout types. A metadata server, along with its control 14230 protocol, MUST support at least one layout type. A private sub-range 14231 of the layout type namespace is also defined. Values from the 14232 private layout type range MAY be used for internal testing or 14233 experimentation (see Section 3.3.13). 14235 As an example, the organization of the file layout type could be an 14236 array of tuples (e.g., device ID, filehandle), along with a 14237 definition of how the data is stored across the devices (e.g., 14238 striping). A block/volume layout might be an array of tuples that 14239 store along with information 14240 about block size and the associated file offset of the block number. 14241 An object layout might be an array of tuples 14242 and an additional structure (i.e., the aggregation map) that defines 14243 how the logical byte sequence of the file data is serialized into the 14244 different objects. Note that the actual layouts are typically more 14245 complex than these simple expository examples. 14247 Requests for pNFS-related operations will often specify a layout 14248 type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. 14249 The response for these operations will include structures such as a 14250 device_addr4 or a layout4, each of which includes a layout type 14251 within it. The layout type sent by the server MUST always be the 14252 same one requested by the client. When a server sends a response 14253 that includes a different layout type, the client SHOULD ignore the 14254 response and behave as if the server had returned an error response. 14256 12.2.8. Layout 14258 A layout defines how a file's data is organized on one or more 14259 storage devices. There are many potential layout types; each of the 14260 layout types are differentiated by the storage protocol used to 14261 access data and by the aggregation scheme that lays out the file data 14262 on the underlying storage devices. A layout is precisely identified 14263 by the tuple , 14264 where filehandle refers to the filehandle of the file on the metadata 14265 server. 14267 It is important to define when layouts overlap and/or conflict with 14268 each other. For two layouts with overlapping byte-ranges to actually 14269 overlap each other, both layouts must be of the same layout type, 14270 correspond to the same filehandle, and have the same iomode. Layouts 14271 conflict when they overlap and differ in the content of the layout 14272 (i.e., the storage device/file mapping parameters differ). Note that 14273 differing iomodes do not lead to conflicting layouts. It is 14274 permissible for layouts with different iomodes, pertaining to the 14275 same byte-range, to be held by the same client. An example of this 14276 would be copy-on-write functionality for a block/volume layout type. 14278 12.2.9. Layout Iomode 14280 The layout iomode (data type layoutiomode4, see Section 3.3.20) 14281 indicates to the metadata server the client's intent to perform 14282 either just READ operations or a mixture containing READ and WRITE 14283 operations. For certain layout types, it is useful for a client to 14284 specify this intent at the time it sends LAYOUTGET (Section 18.43). 14285 For example, for block/volume-based protocols, block allocation could 14286 occur when a LAYOUTIOMODE4_RW iomode is specified. A special 14287 LAYOUTIOMODE4_ANY iomode is defined and can only be used for 14288 LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies 14289 that layouts pertaining to both LAYOUTIOMODE4_READ and 14290 LAYOUTIOMODE4_RW iomodes are being returned or recalled, 14291 respectively. 14293 A storage device may validate I/O with regard to the iomode; this is 14294 dependent upon storage device implementation and layout type. Thus, 14295 if the client's layout iomode is inconsistent with the I/O being 14296 performed, the storage device may reject the client's I/O with an 14297 error indicating that a new layout with the correct iomode should be 14298 obtained via LAYOUTGET. For example, if a client gets a layout with 14299 a LAYOUTIOMODE4_READ iomode and performs a WRITE to a storage device, 14300 the storage device is allowed to reject that WRITE. 14302 The use of the layout iomode does not conflict with OPEN share modes 14303 or byte-range LOCK operations; open share mode and byte-range lock 14304 conflicts are enforced as they are without the use of pNFS and are 14305 logically separate from the pNFS layout level. Open share modes and 14306 byte-range locks are the preferred method for restricting user access 14307 to data files. For example, an OPEN of OPEN4_SHARE_ACCESS_WRITE does 14308 not conflict with a LAYOUTGET containing an iomode of 14309 LAYOUTIOMODE4_RW performed by another client. Applications that 14310 depend on writing into the same file concurrently may use byte-range 14311 locking to serialize their accesses. 14313 12.2.10. Device IDs 14315 The device ID (data type deviceid4, see Section 3.3.14) identifies a 14316 group of storage devices. The scope of a device ID is the pair 14317 . In practice, a significant amount of 14318 information may be required to fully address a storage device. 14319 Rather than embedding all such information in a layout, layouts embed 14320 device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is 14321 used to retrieve the complete address information (including all 14322 device addresses for the device ID) regarding the storage device 14323 according to its layout type and device ID. For example, the address 14324 of an NFSv4.1 data server or of an object-based storage device could 14325 be an IP address and port. The address of a block storage device 14326 could be a volume label. 14328 Clients cannot expect the mapping between a device ID and its storage 14329 device address(es) to persist across metadata server restart. See 14330 Section 12.7.4 for a description of how recovery works in that 14331 situation. 14333 A device ID lives as long as there is a layout referring to the 14334 device ID. If there are no layouts referring to the device ID, the 14335 server is free to delete the device ID any time. Once a device ID is 14336 deleted by the server, the server MUST NOT reuse the device ID for 14337 the same layout type and client ID again. This requirement is 14338 feasible because the device ID is 16 bytes long, leaving sufficient 14339 room to store a generation number if the server's implementation 14340 requires most of the rest of the device ID's content to be reused. 14341 This requirement is necessary because otherwise the race conditions 14342 between asynchronous notification of device ID addition and deletion 14343 would be too difficult to sort out. 14345 Device ID to device address mappings are not leased, and can be 14346 changed at any time. (Note that while device ID to device address 14347 mappings are likely to change after the metadata server restarts, the 14348 server is not required to change the mappings.) A server has two 14349 choices for changing mappings. It can recall all layouts referring 14350 to the device ID or it can use a notification mechanism. 14352 The NFSv4.1 protocol has no optimal way to recall all layouts that 14353 referred to a particular device ID (unless the server associates a 14354 single device ID with a single fsid or a single client ID; in which 14355 case, CB_LAYOUTRECALL has options for recalling all layouts 14356 associated with the fsid, client ID pair, or just the client ID). 14358 Via a notification mechanism (see Section 20.12), device ID to device 14359 address mappings can change over the duration of server operation 14360 without recalling or revoking the layouts that refer to device ID. 14362 The notification mechanism can also delete a device ID, but only if 14363 the client has no layouts referring to the device ID. A notification 14364 of a change to a device ID to device address mapping will immediately 14365 or eventually invalidate some or all of the device ID's mappings. 14366 The server MUST support notifications and the client must request 14367 them before they can be used. For further information about the 14368 notification types Section 20.12. 14370 12.3. pNFS Operations 14372 NFSv4.1 has several operations that are needed for pNFS servers, 14373 regardless of layout type or storage protocol. These operations are 14374 all sent to a metadata server and summarized here. While pNFS is an 14375 OPTIONAL feature, if pNFS is implemented, some operations are 14376 REQUIRED in order to comply with pNFS. See Section 17. 14378 These are the fore channel pNFS operations: 14380 GETDEVICEINFO (Section 18.40), as noted previously 14381 (Section 12.2.10), returns the mapping of device ID to storage 14382 device address. 14384 GETDEVICELIST (Section 18.41) allows clients to fetch all device IDs 14385 for a specific file system. 14387 LAYOUTGET (Section 18.43) is used by a client to get a layout for a 14388 file. 14390 LAYOUTCOMMIT (Section 18.42) is used to inform the metadata server 14391 of the client's intent to commit data that has been written to the 14392 storage device (the storage device as originally indicated in the 14393 return value of LAYOUTGET). 14395 LAYOUTRETURN (Section 18.44) is used to return layouts for a file, a 14396 file system ID (FSID), or a client ID. 14398 These are the backchannel pNFS operations: 14400 CB_LAYOUTRECALL (Section 20.3) recalls a layout, all layouts 14401 belonging to a file system, or all layouts belonging to a client 14402 ID. 14404 CB_RECALL_ANY (Section 20.6) tells a client that it needs to return 14405 some number of recallable objects, including layouts, to the 14406 metadata server. 14408 CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a 14409 recallable object that it was denied (in case of pNFS, a layout 14410 denied by LAYOUTGET) due to resource exhaustion is now available. 14412 CB_NOTIFY_DEVICEID (Section 20.12) notifies the client of changes to 14413 device IDs. 14415 12.4. pNFS Attributes 14417 A number of attributes specific to pNFS are listed and described in 14418 Section 5.12. 14420 12.5. Layout Semantics 14422 12.5.1. Guarantees Provided by Layouts 14424 Layouts grant to the client the ability to access data located at a 14425 storage device with the appropriate storage protocol. The client is 14426 guaranteed the layout will be recalled when one of two things occur: 14427 either a conflicting layout is requested or the state encapsulated by 14428 the layout becomes invalid (this can happen when an event directly or 14429 indirectly modifies the layout). When a layout is recalled and 14430 returned by the client, the client continues with the ability to 14431 access file data with normal NFSv4.1 operations through the metadata 14432 server. Only the ability to access the storage devices is affected. 14434 The requirement of NFSv4.1 that all user access rights MUST be 14435 obtained through the appropriate OPEN, LOCK, and ACCESS operations is 14436 not modified with the existence of layouts. Layouts are provided to 14437 NFSv4.1 clients, and user access still follows the rules of the 14438 protocol as if they did not exist. It is a requirement that for a 14439 client to access a storage device, a layout must be held by the 14440 client. If a storage device receives an I/O request for a byte-range 14441 for which the client does not hold a layout, the storage device 14442 SHOULD reject that I/O request. Note that the act of modifying a 14443 file for which a layout is held does not necessarily conflict with 14444 the holding of the layout that describes the file being modified. 14445 Therefore, it is the requirement of the storage protocol or layout 14446 type that determines the necessary behavior. For example, block/ 14447 volume layout types require that the layout's iomode agree with the 14448 type of I/O being performed. 14450 Depending upon the layout type and storage protocol in use, storage 14451 device access permissions may be granted by LAYOUTGET and may be 14452 encoded within the type-specific layout. For an example of storage 14453 device access permissions, see an object-based protocol such as [53]. 14454 If access permissions are encoded within the layout, the metadata 14455 server SHOULD recall the layout when those permissions become invalid 14456 for any reason -- for example, when a file becomes unwritable or 14457 inaccessible to a client. Note, clients are still required to 14458 perform the appropriate OPEN, LOCK, and ACCESS operations as 14459 described above. The degree to which it is possible for the client 14460 to circumvent these operations and the consequences of doing so must 14461 be clearly specified by the individual layout type specifications. 14462 In addition, these specifications must be clear about the 14463 requirements and non-requirements for the checking performed by the 14464 server. 14466 In the presence of pNFS functionality, mandatory byte-range locks 14467 MUST behave as they would without pNFS. Therefore, if mandatory file 14468 locks and layouts are provided simultaneously, the storage device 14469 MUST be able to enforce the mandatory byte-range locks. For example, 14470 if one client obtains a mandatory byte-range lock and a second client 14471 accesses the storage device, the storage device MUST appropriately 14472 restrict I/O for the range of the mandatory byte-range lock. If the 14473 storage device is incapable of providing this check in the presence 14474 of mandatory byte-range locks, then the metadata server MUST NOT 14475 grant layouts and mandatory byte-range locks simultaneously. 14477 12.5.2. Getting a Layout 14479 A client obtains a layout with the LAYOUTGET operation. The metadata 14480 server will grant layouts of a particular type (e.g., block/volume, 14481 object, or file). The client selects an appropriate layout type that 14482 the server supports and the client is prepared to use. The layout 14483 returned to the client might not exactly match the requested byte- 14484 range as described in Section 18.43.3. As needed a client may send 14485 multiple LAYOUTGET operations; these might result in multiple 14486 overlapping, non-conflicting layouts (see Section 12.2.8). 14488 In order to get a layout, the client must first have opened the file 14489 via the OPEN operation. When a client has no layout on a file, it 14490 MUST present an open stateid, a delegation stateid, or a byte-range 14491 lock stateid in the loga_stateid argument. A successful LAYOUTGET 14492 result includes a layout stateid. The first successful LAYOUTGET 14493 processed by the server using a non-layout stateid as an argument 14494 MUST have the "seqid" field of the layout stateid in the response set 14495 to one. Thereafter, the client MUST use a layout stateid (see 14496 Section 12.5.3) on future invocations of LAYOUTGET on the file, and 14497 the "seqid" MUST NOT be set to zero. Once the layout has been 14498 retrieved, it can be held across multiple OPEN and CLOSE sequences. 14499 Therefore, a client may hold a layout for a file that is not 14500 currently open by any user on the client. This allows for the 14501 caching of layouts beyond CLOSE. 14503 The storage protocol used by the client to access the data on the 14504 storage device is determined by the layout's type. The client is 14505 responsible for matching the layout type with an available method to 14506 interpret and use the layout. The method for this layout type 14507 selection is outside the scope of the pNFS functionality. 14509 Although the metadata server is in control of the layout for a file, 14510 the pNFS client can provide hints to the server when a file is opened 14511 or created about the preferred layout type and aggregation schemes. 14512 pNFS introduces a layout_hint attribute (Section 5.12.4) that the 14513 client can set at file creation time to provide a hint to the server 14514 for new files. Setting this attribute separately, after the file has 14515 been created might make it difficult, or impossible, for the server 14516 implementation to comply. 14518 Because the EXCLUSIVE4 createmode4 does not allow the setting of 14519 attributes at file creation time, NFSv4.1 introduces the EXCLUSIVE4_1 14520 createmode4, which does allow attributes to be set at file creation 14521 time. In addition, if the session is created with persistent reply 14522 caches, EXCLUSIVE4_1 is neither necessary nor allowed. Instead, 14523 GUARDED4 both works better and is prescribed. Table 10 in 14524 Section 18.16.3 summarizes how a client is allowed to send an 14525 exclusive create. 14527 12.5.3. Layout Stateid 14529 As with all other stateids, the layout stateid consists of a "seqid" 14530 and "other" field. Once a layout stateid is established, the "other" 14531 field will stay constant unless the stateid is revoked or the client 14532 returns all layouts on the file and the server disposes of the 14533 stateid. The "seqid" field is initially set to one, and is never 14534 zero on any NFSv4.1 operation that uses layout stateids, whether it 14535 is a fore channel or backchannel operation. After the layout stateid 14536 is established, the server increments by one the value of the "seqid" 14537 in each subsequent LAYOUTGET and LAYOUTRETURN response, and in each 14538 CB_LAYOUTRECALL request. 14540 Given the design goal of pNFS to provide parallelism, the layout 14541 stateid differs from other stateid types in that the client is 14542 expected to send LAYOUTGET and LAYOUTRETURN operations in parallel. 14543 The "seqid" value is used by the client to properly sort responses to 14544 LAYOUTGET and LAYOUTRETURN. The "seqid" is also used to prevent race 14545 conditions between LAYOUTGET and CB_LAYOUTRECALL. Given that the 14546 processing rules differ from layout stateids and other stateid types, 14547 only the pNFS sections of this document should be considered to 14548 determine proper layout stateid handling. 14550 Once the client receives a layout stateid, it MUST use the correct 14551 "seqid" for subsequent LAYOUTGET or LAYOUTRETURN operations. The 14552 correct "seqid" is defined as the highest "seqid" value from 14553 responses of fully processed LAYOUTGET or LAYOUTRETURN operations or 14554 arguments of a fully processed CB_LAYOUTRECALL operation. Since the 14555 server is incrementing the "seqid" value on each layout operation, 14556 the client may determine the order of operation processing by 14557 inspecting the "seqid" value. In the case of overlapping layout 14558 ranges, the ordering information will provide the client the 14559 knowledge of which layout ranges are held. Note that overlapping 14560 layout ranges may occur because of the client's specific requests or 14561 because the server is allowed to expand the range of a requested 14562 layout and notify the client in the LAYOUTRETURN results. Additional 14563 layout stateid sequencing requirements are provided in 14564 Section 12.5.5.2. 14566 The client's receipt of a "seqid" is not sufficient for subsequent 14567 use. The client must fully process the operations before the "seqid" 14568 can be used. For LAYOUTGET results, if the client is not using the 14569 forgetful model (Section 12.5.5.1), it MUST first update its record 14570 of what ranges of the file's layout it has before using the seqid. 14571 For LAYOUTRETURN results, the client MUST delete the range from its 14572 record of what ranges of the file's layout it had before using the 14573 seqid. For CB_LAYOUTRECALL arguments, the client MUST send a 14574 response to the recall before using the seqid. The fundamental 14575 requirement in client processing is that the "seqid" is used to 14576 provide the order of processing. LAYOUTGET results may be processed 14577 in parallel. LAYOUTRETURN results may be processed in parallel. 14578 LAYOUTGET and LAYOUTRETURN responses may be processed in parallel as 14579 long as the ranges do not overlap. CB_LAYOUTRECALL request 14580 processing MUST be processed in "seqid" order at all times. 14582 Once a client has no more layouts on a file, the layout stateid is no 14583 longer valid and MUST NOT be used. Any attempt to use such a layout 14584 stateid will result in NFS4ERR_BAD_STATEID. 14586 12.5.4. Committing a Layout 14588 Allowing for varying storage protocol capabilities, the pNFS protocol 14589 does not require the metadata server and storage devices to have a 14590 consistent view of file attributes and data location mappings. Data 14591 location mapping refers to aspects such as which offsets store data 14592 as opposed to storing holes (see Section 13.4.4 for a discussion). 14593 Related issues arise for storage protocols where a layout may hold 14594 provisionally allocated blocks where the allocation of those blocks 14595 does not survive a complete restart of both the client and server. 14596 Because of this inconsistency, it is necessary to resynchronize the 14597 client with the metadata server and its storage devices and make any 14598 potential changes available to other clients. This is accomplished 14599 by use of the LAYOUTCOMMIT operation. 14601 The LAYOUTCOMMIT operation is responsible for committing a modified 14602 layout to the metadata server. The data should be written and 14603 committed to the appropriate storage devices before the LAYOUTCOMMIT 14604 occurs. The scope of the LAYOUTCOMMIT operation depends on the 14605 storage protocol in use. It is important to note that the level of 14606 synchronization is from the point of view of the client that sent the 14607 LAYOUTCOMMIT. The updated state on the metadata server need only 14608 reflect the state as of the client's last operation previous to the 14609 LAYOUTCOMMIT. The metadata server is not REQUIRED to maintain a 14610 global view that accounts for other clients' I/O that may have 14611 occurred within the same time frame. 14613 For block/volume-based layouts, LAYOUTCOMMIT may require updating the 14614 block list that comprises the file and committing this layout to 14615 stable storage. For file-based layouts, synchronization of 14616 attributes between the metadata and storage devices, primarily the 14617 size attribute, is required. 14619 The control protocol is free to synchronize the attributes before it 14620 receives a LAYOUTCOMMIT; however, upon successful completion of a 14621 LAYOUTCOMMIT, state that exists on the metadata server that describes 14622 the file MUST be synchronized with the state that exists on the 14623 storage devices that comprise that file as of the client's last sent 14624 operation. Thus, a client that queries the size of a file between a 14625 WRITE to a storage device and the LAYOUTCOMMIT might observe a size 14626 that does not reflect the actual data written. 14628 The client MUST have a layout in order to send a LAYOUTCOMMIT 14629 operation. 14631 12.5.4.1. LAYOUTCOMMIT and change/time_modify 14633 The change and time_modify attributes may be updated by the server 14634 when the LAYOUTCOMMIT operation is processed. The reason for this is 14635 that some layout types do not support the update of these attributes 14636 when the storage devices process I/O operations. If a client has a 14637 layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY 14638 provide a suggested value to the server for time_modify within the 14639 arguments to LAYOUTCOMMIT. Based on the layout type, the provided 14640 value may or may not be used. The server should sanity-check the 14641 client-provided values before they are used. For example, the server 14642 should ensure that time does not flow backwards. The client always 14643 has the option to set time_modify through an explicit SETATTR 14644 operation. 14646 For some layout protocols, the storage device is able to notify the 14647 metadata server of the occurrence of an I/O; as a result, the change 14648 and time_modify attributes may be updated at the metadata server. 14649 For a metadata server that is capable of monitoring updates to the 14650 change and time_modify attributes, LAYOUTCOMMIT processing is not 14651 required to update the change attribute. In this case, the metadata 14652 server must ensure that no further update to the data has occurred 14653 since the last update of the attributes; file-based protocols may 14654 have enough information to make this determination or may update the 14655 change attribute upon each file modification. This also applies for 14656 the time_modify attribute. If the server implementation is able to 14657 determine that the file has not been modified since the last 14658 time_modify update, the server need not update time_modify at 14659 LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the updated attributes 14660 should be visible if that file was modified since the latest previous 14661 LAYOUTCOMMIT or LAYOUTGET. 14663 12.5.4.2. LAYOUTCOMMIT and size 14665 The size of a file may be updated when the LAYOUTCOMMIT operation is 14666 used by the client. One of the fields in the argument to 14667 LAYOUTCOMMIT is loca_last_write_offset; this field indicates the 14668 highest byte offset written but not yet committed with the 14669 LAYOUTCOMMIT operation. The data type of loca_last_write_offset is 14670 newoffset4 and is switched on a boolean value, no_newoffset, that 14671 indicates if a previous write occurred or not. If no_newoffset is 14672 FALSE, an offset is not given. If the client has a layout with 14673 LAYOUTIOMODE4_RW iomode on the file, with a byte-range (denoted by 14674 the values of lo_offset and lo_length) that overlaps 14675 loca_last_write_offset, then the client MAY set no_newoffset to TRUE 14676 and provide an offset that will update the file size. Keep in mind 14677 that offset is not the same as length, though they are related. For 14678 example, a loca_last_write_offset value of zero means that one byte 14679 was written at offset zero, and so the length of the file is at least 14680 one byte. 14682 The metadata server may do one of the following: 14684 1. Update the file's size using the last write offset provided by 14685 the client as either the true file size or as a hint of the file 14686 size. If the metadata server has a method available, any new 14687 value for file size should be sanity-checked. For example, the 14688 file must not be truncated if the client presents a last write 14689 offset less than the file's current size. 14691 2. Ignore the client-provided last write offset; the metadata server 14692 must have sufficient knowledge from other sources to determine 14693 the file's size. For example, the metadata server queries the 14694 storage devices with the control protocol. 14696 The method chosen to update the file's size will depend on the 14697 storage device's and/or the control protocol's capabilities. For 14698 example, if the storage devices are block devices with no knowledge 14699 of file size, the metadata server must rely on the client to set the 14700 last write offset appropriately. 14702 The results of LAYOUTCOMMIT contain a new size value in the form of a 14703 newsize4 union data type. If the file's size is set as a result of 14704 LAYOUTCOMMIT, the metadata server must reply with the new size; 14705 otherwise, the new size is not provided. If the file size is 14706 updated, the metadata server SHOULD update the storage devices such 14707 that the new file size is reflected when LAYOUTCOMMIT processing is 14708 complete. For example, the client should be able to read up to the 14709 new file size. 14711 The client can extend the length of a file or truncate a file by 14712 sending a SETATTR operation to the metadata server with the size 14713 attribute specified. If the size specified is larger than the 14714 current size of the file, the file is "zero extended", i.e., zeros 14715 are implicitly added between the file's previous EOF and the new EOF. 14716 (In many implementations, the zero-extended byte-range of the file 14717 consists of unallocated holes in the file.) When the client writes 14718 past EOF via WRITE, the SETATTR operation does not need to be used. 14720 12.5.4.3. LAYOUTCOMMIT and layoutupdate 14722 The LAYOUTCOMMIT argument contains a loca_layoutupdate field 14723 (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This 14724 argument is a layout-type-specific structure. The structure can be 14725 used to pass arbitrary layout-type-specific information from the 14726 client to the metadata server at LAYOUTCOMMIT time. For example, if 14727 using a block/volume layout, the client can indicate to the metadata 14728 server which reserved or allocated blocks the client used or did not 14729 use. The content of loca_layoutupdate (field lou_body) need not be 14730 the same layout-type-specific content returned by LAYOUTGET 14731 (Section 18.43.2) in the loc_body field of the lo_content field of 14732 the logr_layout field. The content of loca_layoutupdate is defined 14733 by the layout type specification and is opaque to LAYOUTCOMMIT. 14735 12.5.5. Recalling a Layout 14737 Since a layout protects a client's access to a file via a direct 14738 client-storage-device path, a layout need only be recalled when it is 14739 semantically unable to serve this function. Typically, this occurs 14740 when the layout no longer encapsulates the true location of the file 14741 over the byte-range it represents. Any operation or action, such as 14742 server-driven restriping or load balancing, that changes the layout 14743 will result in a recall of the layout. A layout is recalled by the 14744 CB_LAYOUTRECALL callback operation (see Section 20.3) and returned 14745 with LAYOUTRETURN (see Section 18.44). The CB_LAYOUTRECALL operation 14746 may recall a layout identified by a byte-range, all layouts 14747 associated with a file system ID (FSID), or all layouts associated 14748 with a client ID. Section 12.5.5.2 discusses sequencing issues 14749 surrounding the getting, returning, and recalling of layouts. 14751 An iomode is also specified when recalling a layout. Generally, the 14752 iomode in the recall request must match the layout being returned; 14753 for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause 14754 the client to only return LAYOUTIOMODE4_RW layouts and not 14755 LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY 14756 enumeration is defined to enable recalling a layout of any iomode; in 14757 other words, the client must return both LAYOUTIOMODE4_READ and 14758 LAYOUTIOMODE4_RW layouts. 14760 A REMOVE operation SHOULD cause the metadata server to recall the 14761 layout to prevent the client from accessing a non-existent file and 14762 to reclaim state stored on the client. Since a REMOVE may be delayed 14763 until the last close of the file has occurred, the recall may also be 14764 delayed until this time. After the last reference on the file has 14765 been released and the file has been removed, the client should no 14766 longer be able to perform I/O using the layout. In the case of a 14767 file-based layout, the data server SHOULD return NFS4ERR_STALE in 14768 response to any operation on the removed file. 14770 Once a layout has been returned, the client MUST NOT send I/Os to the 14771 storage devices for the file, byte-range, and iomode represented by 14772 the returned layout. If a client does send an I/O to a storage 14773 device for which it does not hold a layout, the storage device SHOULD 14774 reject the I/O. 14776 Although pNFS does not alter the file data caching capabilities of 14777 clients, or their semantics, it recognizes that some clients may 14778 perform more aggressive write-behind caching to optimize the benefits 14779 provided by pNFS. However, write-behind caching may negatively 14780 affect the latency in returning a layout in response to a 14781 CB_LAYOUTRECALL; this is similar to file delegations and the impact 14782 that file data caching has on DELEGRETURN. Client implementations 14783 SHOULD limit the amount of unwritten data they have outstanding at 14784 any one time in order to prevent excessively long responses to 14785 CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one 14786 lease period before taking further action. As soon as a lease period 14787 has passed, the server may choose to fence the client's access to the 14788 storage devices if the server perceives the client has taken too long 14789 to return a layout. However, just as in the case of data delegation 14790 and DELEGRETURN, the server may choose to wait, given that the client 14791 is showing forward progress on its way to returning the layout. This 14792 forward progress can take the form of successful interaction with the 14793 storage devices or of sub-portions of the layout being returned by 14794 the client. The server can also limit exposure to these problems by 14795 limiting the byte-ranges initially provided in the layouts and thus 14796 the amount of outstanding modified data. 14798 12.5.5.1. Layout Recall Callback Robustness 14800 It has been assumed thus far that pNFS client state (layout ranges 14801 and iomode) for a file exactly matches that of the pNFS server for 14802 that file. This assumption leads to the implication that any 14803 callback results in a LAYOUTRETURN or set of LAYOUTRETURNs that 14804 exactly match the range in the callback, since both client and server 14805 agree about the state being maintained. However, it can be useful if 14806 this assumption does not always hold. For example: 14808 o If conflicts that require callbacks are very rare, and a server 14809 can use a multi-file callback to recover per-client resources 14810 (e.g., via an FSID recall or a multi-file recall within a single 14811 CB_COMPOUND), the result may be significantly less client-server 14812 pNFS traffic. 14814 o It may be useful for servers to maintain information about what 14815 ranges are held by a client on a coarse-grained basis, leading to 14816 the server's layout ranges being beyond those actually held by the 14817 client. In the extreme, a server could manage conflicts on a per- 14818 file basis, only sending whole-file callbacks even though clients 14819 may request and be granted sub-file ranges. 14821 o It may be useful for clients to "forget" details about what 14822 layouts and ranges the client actually has, leading to the 14823 server's layout ranges being beyond those that the client "thinks" 14824 it has. As long as the client does not assume it has layouts that 14825 are beyond what the server has granted, this is a safe practice. 14826 When a client forgets what ranges and layouts it has, and it 14827 receives a CB_LAYOUTRECALL operation, the client MUST follow up 14828 with a LAYOUTRETURN for what the server recalled, or alternatively 14829 return the NFS4ERR_NOMATCHING_LAYOUT error if it has no layout to 14830 return in the recalled range. 14832 o In order to avoid errors, it is vital that a client not assign 14833 itself layout permissions beyond what the server has granted, and 14834 that the server not forget layout permissions that have been 14835 granted. On the other hand, if a server believes that a client 14836 holds a layout that the client does not know about, it is useful 14837 for the client to cleanly indicate completion of the requested 14838 recall either by sending a LAYOUTRETURN operation for the entire 14839 requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error 14840 to the CB_LAYOUTRECALL. 14842 Thus, in light of the above, it is useful for a server to be able to 14843 send callbacks for layout ranges it has not granted to a client, and 14844 for a client to return ranges it does not hold. A pNFS client MUST 14845 always return layouts that comprise the full range specified by the 14846 recall. Note, the full recalled layout range need not be returned as 14847 part of a single operation, but may be returned in portions. This 14848 allows the client to stage the flushing of dirty data and commits and 14849 returns of layouts. Also, it indicates to the metadata server that 14850 the client is making progress. 14852 When a layout is returned, the client MUST NOT have any outstanding 14853 I/O requests to the storage devices involved in the layout. 14854 Rephrasing, the client MUST NOT return the layout while it has 14855 outstanding I/O requests to the storage device. 14857 Even with this requirement for the client, it is possible that I/O 14858 requests may be presented to a storage device no longer allowed to 14859 perform them. Since the server has no strict control as to when the 14860 client will return the layout, the server may later decide to 14861 unilaterally revoke the client's access to the storage devices as 14862 provided by the layout. In choosing to revoke access, the server 14863 must deal with the possibility of lingering I/O requests, i.e., I/O 14864 requests that are still in flight to storage devices identified by 14865 the revoked layout. All layout type specifications MUST define 14866 whether unilateral layout revocation by the metadata server is 14867 supported; if it is, the specification must also describe how 14868 lingering writes are processed. For example, storage devices 14869 identified by the revoked layout could be fenced off from the client 14870 that held the layout. 14872 In order to ensure client/server convergence with regard to layout 14873 state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN 14874 operations for a particular recall MUST specify the entire range 14875 being recalled, echoing the recalled layout type, iomode, recall/ 14876 return type (FILE, FSID, or ALL), and byte-range, even if layouts 14877 pertaining to partial ranges were previously returned. In addition, 14878 if the client holds no layouts that overlap the range being recalled, 14879 the client should return the NFS4ERR_NOMATCHING_LAYOUT error code to 14880 CB_LAYOUTRECALL. This allows the server to update its view of the 14881 client's layout state. 14883 12.5.5.2. Sequencing of Layout Operations 14885 As with other stateful operations, pNFS requires the correct 14886 sequencing of layout operations. pNFS uses the "seqid" in the layout 14887 stateid to provide the correct sequencing between regular operations 14888 and callbacks. It is the server's responsibility to avoid 14889 inconsistencies regarding the layouts provided and the client's 14890 responsibility to properly serialize its layout requests and layout 14891 returns. 14893 12.5.5.2.1. Layout Recall and Return Sequencing 14895 One critical issue with regard to layout operations sequencing 14896 concerns callbacks. The protocol must defend against races between 14897 the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent 14898 CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that 14899 implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations 14900 to which the client has not yet received a reply. The client detects 14901 such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's 14902 layout stateid. If the "seqid" is not exactly one higher than what 14903 the client currently has recorded, and the client has at least one 14904 LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows 14905 the server sent the CB_LAYOUTRECALL after sending a response to an 14906 outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before 14907 processing such a CB_LAYOUTRECALL until it processes all replies for 14908 outstanding LAYOUTGET and LAYOUTRETURN operations for the 14909 corresponding file with seqid less than the seqid given by 14910 CB_LAYOUTRECALL (lor_stateid; see Section 20.3.) 14912 In addition to the seqid-based mechanism, Section 2.10.6.3 describes 14913 the sessions mechanism for allowing the client to detect callback 14914 race conditions and delay processing such a CB_LAYOUTRECALL. The 14915 server MAY reference conflicting operations in the CB_SEQUENCE that 14916 precedes the CB_LAYOUTRECALL. Because the server has already sent 14917 replies for these operations before sending the callback, the replies 14918 may race with the CB_LAYOUTRECALL. The client MUST wait for all the 14919 referenced calls to complete and update its view of the layout state 14920 before processing the CB_LAYOUTRECALL. 14922 12.5.5.2.1.1. Get/Return Sequencing 14924 The protocol allows the client to send concurrent LAYOUTGET and 14925 LAYOUTRETURN operations to the server. The protocol does not provide 14926 any means for the server to process the requests in the same order in 14927 which they were created. However, through the use of the "seqid" 14928 field in the layout stateid, the client can determine the order in 14929 which parallel outstanding operations were processed by the server. 14930 Thus, when a layout retrieved by an outstanding LAYOUTGET operation 14931 intersects with a layout returned by an outstanding LAYOUTRETURN on 14932 the same file, the order in which the two conflicting operations are 14933 processed determines the final state of the overlapping layout. The 14934 order is determined by the "seqid" returned in each operation: the 14935 operation with the higher seqid was executed later. 14937 It is permissible for the client to send multiple parallel LAYOUTGET 14938 operations for the same file or multiple parallel LAYOUTRETURN 14939 operations for the same file or a mix of both. 14941 It is permissible for the client to use the current stateid (see 14942 Section 16.2.3.1.2) for LAYOUTGET operations, for example, when 14943 compounding LAYOUTGETs or compounding OPEN and LAYOUTGETs. It is 14944 also permissible to use the current stateid when compounding 14945 LAYOUTRETURNs. 14947 It is permissible for the client to use the current stateid when 14948 combining LAYOUTRETURN and LAYOUTGET operations for the same file in 14949 the same COMPOUND request since the server MUST process these in 14950 order. However, if a client does send such COMPOUND requests, it 14951 MUST NOT have more than one outstanding for the same file at the same 14952 time, and it MUST NOT have other LAYOUTGET or LAYOUTRETURN operations 14953 outstanding at the same time for that same file. 14955 12.5.5.2.1.2. Client Considerations 14957 Consider a pNFS client that has sent a LAYOUTGET, and before it 14958 receives the reply to LAYOUTGET, it receives a CB_LAYOUTRECALL for 14959 the same file with an overlapping range. There are two 14960 possibilities, which the client can distinguish via the layout 14961 stateid in the recall. 14963 1. The server processed the LAYOUTGET before sending the recall, so 14964 the LAYOUTGET must be waited for because it may be carrying 14965 layout information that will need to be returned to deal with the 14966 CB_LAYOUTRECALL. 14968 2. The server sent the callback before receiving the LAYOUTGET. The 14969 server will not respond to the LAYOUTGET until the 14970 CB_LAYOUTRECALL is processed. 14972 If these possibilities cannot be distinguished, a deadlock could 14973 result, as the client must wait for the LAYOUTGET response before 14974 processing the recall in the first case, but that response will not 14975 arrive until after the recall is processed in the second case. Note 14976 that in the first case, the "seqid" in the layout stateid of the 14977 recall is two greater than what the client has recorded; in the 14978 second case, the "seqid" is one greater than what the client has 14979 recorded. This allows the client to disambiguate between the two 14980 cases. The client thus knows precisely which possibility applies. 14982 In case 1, the client knows it needs to wait for the LAYOUTGET 14983 response before processing the recall (or the client can return 14984 NFS4ERR_DELAY). 14986 In case 2, the client will not wait for the LAYOUTGET response before 14987 processing the recall because waiting would cause deadlock. 14988 Therefore, the action at the client will only require waiting in the 14989 case that the client has not yet seen the server's earlier responses 14990 to the LAYOUTGET operation(s). 14992 The recall process can be considered completed when the final 14993 LAYOUTRETURN operation for the recalled range is completed. The 14994 LAYOUTRETURN uses the layout stateid (with seqid) specified in 14995 CB_LAYOUTRECALL. If the client uses multiple LAYOUTRETURNs in 14996 processing the recall, the first LAYOUTRETURN will use the layout 14997 stateid as specified in CB_LAYOUTRECALL. Subsequent LAYOUTRETURNs 14998 will use the highest seqid as is the usual case. 15000 12.5.5.2.1.3. Server Considerations 15002 Consider a race from the metadata server's point of view. The 15003 metadata server has sent a CB_LAYOUTRECALL and receives an 15004 overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) 15005 that respond to the CB_LAYOUTRECALL. There are three cases: 15007 1. The client sent the LAYOUTGET before processing the 15008 CB_LAYOUTRECALL. The "seqid" in the layout stateid of the 15009 arguments of LAYOUTGET is one less than the "seqid" in 15010 CB_LAYOUTRECALL. The server returns NFS4ERR_RECALLCONFLICT to 15011 the client, which indicates to the client that there is a pending 15012 recall. 15014 2. The client sent the LAYOUTGET after processing the 15015 CB_LAYOUTRECALL, but the LAYOUTGET arrived before the 15016 LAYOUTRETURN and the response to CB_LAYOUTRECALL that completed 15017 that processing. The "seqid" in the layout stateid of LAYOUTGET 15018 is equal to or greater than that of the "seqid" in 15019 CB_LAYOUTRECALL. The server has not received a response to the 15020 CB_LAYOUTRECALL, so it returns NFS4ERR_RECALLCONFLICT. 15022 3. The client sent the LAYOUTGET after processing the 15023 CB_LAYOUTRECALL; the server received the CB_LAYOUTRECALL 15024 response, but the LAYOUTGET arrived before the LAYOUTRETURN that 15025 completed that processing. The "seqid" in the layout stateid of 15026 LAYOUTGET is equal to that of the "seqid" in CB_LAYOUTRECALL. 15028 The server has received a response to the CB_LAYOUTRECALL, so it 15029 returns NFS4ERR_RETURNCONFLICT. 15031 12.5.5.2.1.4. Wraparound and Validation of Seqid 15033 The rules for layout stateid processing differ from other stateids in 15034 the protocol because the "seqid" value cannot be zero and the 15035 stateid's "seqid" value changes in a CB_LAYOUTRECALL operation. The 15036 non-zero requirement combined with the inherent parallelism of layout 15037 operations means that a set of LAYOUTGET and LAYOUTRETURN operations 15038 may contain the same value for "seqid". The server uses a slightly 15039 modified version of the modulo arithmetic as described in 15040 Section 2.10.6.1 when incrementing the layout stateid's "seqid". The 15041 difference is that zero is not a valid value for "seqid"; when the 15042 value of a "seqid" is 0xFFFFFFFF, the next valid value will be 15043 0x00000001. The modulo arithmetic is also used for the comparisons 15044 of "seqid" values in the processing of CB_LAYOUTRECALL events as 15045 described above in Section 12.5.5.2.1.3. 15047 Just as the server validates the "seqid" in the event of 15048 CB_LAYOUTRECALL usage, as described in Section 12.5.5.2.1.3, the 15049 server also validates the "seqid" value to ensure that it is within 15050 an appropriate range. This range represents the degree of 15051 parallelism the server supports for layout stateids. If the client 15052 is sending multiple layout operations to the server in parallel, by 15053 definition, the "seqid" value in the supplied stateid will not be the 15054 current "seqid" as held by the server. The range of parallelism 15055 spans from the highest or current "seqid" to a "seqid" value in the 15056 past. To assist in the discussion, the server's current "seqid" 15057 value for a layout stateid is defined as SERVER_CURRENT_SEQID. The 15058 lowest "seqid" value that is acceptable to the server is represented 15059 by PAST_SEQID. And the value for the range of valid "seqid"s or 15060 range of parallelism is VALID_SEQID_RANGE. Therefore, the following 15061 holds: VALID_SEQID_RANGE = SERVER_CURRENT_SEQID - PAST_SEQID. In the 15062 following, all arithmetic is the modulo arithmetic as described 15063 above. 15065 The server MUST support a minimum VALID_SEQID_RANGE. The minimum is 15066 defined as: VALID_SEQID_RANGE = summation over 1..N of 15067 (ca_maxoperations(i) - 1), where N is the number of session fore 15068 channels and ca_maxoperations(i) is the value of the ca_maxoperations 15069 returned from CREATE_SESSION of the i'th session. The reason for "- 15070 1" is to allow for the required SEQUENCE operation. The server MAY 15071 support a VALID_SEQID_RANGE value larger than the minimum. The 15072 maximum VALID_SEQID_RANGE is (2 ^ 32 - 2) (accounting for zero not 15073 being a valid "seqid" value). 15075 If the server finds the "seqid" is zero, the NFS4ERR_BAD_STATEID 15076 error is returned to the client. The server further validates the 15077 "seqid" to ensure it is within the range of parallelism, 15078 VALID_SEQID_RANGE. If the "seqid" value is outside of that range, 15079 the error NFS4ERR_OLD_STATEID is returned to the client. Upon 15080 receipt of NFS4ERR_OLD_STATEID, the client updates the stateid in the 15081 layout request based on processing of other layout requests and re- 15082 sends the operation to the server. 15084 12.5.5.2.1.5. Bulk Recall and Return 15086 pNFS supports recalling and returning all layouts that are for files 15087 belonging to a particular fsid (LAYOUTRECALL4_FSID, 15088 LAYOUTRETURN4_FSID) or client ID (LAYOUTRECALL4_ALL, 15089 LAYOUTRETURN4_ALL). There are no "bulk" stateids, so detection of 15090 races via the seqid is not possible. The server MUST NOT initiate 15091 bulk recall while another recall is in progress, or the corresponding 15092 LAYOUTRETURN is in progress or pending. In the event the server 15093 sends a bulk recall while the client has a pending or in-progress 15094 LAYOUTRETURN, CB_LAYOUTRECALL, or LAYOUTGET, the client returns 15095 NFS4ERR_DELAY. In the event the client sends a LAYOUTGET or 15096 LAYOUTRETURN while a bulk recall is in progress, the server returns 15097 NFS4ERR_RECALLCONFLICT. If the client sends a LAYOUTGET or 15098 LAYOUTRETURN after the server receives NFS4ERR_DELAY from a bulk 15099 recall, then to ensure forward progress, the server MAY return 15100 NFS4ERR_RECALLCONFLICT. 15102 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL is sent, the server MUST 15103 NOT allow the client to use any layout stateid except for 15104 LAYOUTCOMMIT operations. Once the client receives a CB_LAYOUTRECALL 15105 of LAYOUTRECALL4_ALL, it MUST NOT use any layout stateid except for 15106 LAYOUTCOMMIT operations. Once a LAYOUTRETURN of LAYOUTRETURN4_ALL is 15107 sent, all layout stateids granted to the client ID are freed. The 15108 client MUST NOT use the layout stateids again. It MUST use LAYOUTGET 15109 to obtain new layout stateids. 15111 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_FSID is sent, the server MUST 15112 NOT allow the client to use any layout stateid that refers to a file 15113 with the specified fsid except for LAYOUTCOMMIT operations. Once the 15114 client receives a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL, it MUST NOT 15115 use any layout stateid that refers to a file with the specified fsid 15116 except for LAYOUTCOMMIT operations. Once a LAYOUTRETURN of 15117 LAYOUTRETURN4_FSID is sent, all layout stateids granted to the 15118 referenced fsid are freed. The client MUST NOT use those freed 15119 layout stateids for files with the referenced fsid again. 15120 Subsequently, for any file with the referenced fsid, to use a layout, 15121 the client MUST first send a LAYOUTGET operation in order to obtain a 15122 new layout stateid for that file. 15124 If the server has sent a bulk CB_LAYOUTRECALL and receives a 15125 LAYOUTGET, or a LAYOUTRETURN with a stateid, the server MUST return 15126 NFS4ERR_RECALLCONFLICT. If the server has sent a bulk 15127 CB_LAYOUTRECALL and receives a LAYOUTRETURN with an lr_returntype 15128 that is not equal to the lor_recalltype of the CB_LAYOUTRECALL, the 15129 server MUST return NFS4ERR_RECALLCONFLICT. 15131 12.5.6. Revoking Layouts 15133 Parallel NFS permits servers to revoke layouts from clients that fail 15134 to respond to recalls and/or fail to renew their lease in time. 15135 Depending on the layout type, the server might revoke the layout and 15136 might take certain actions with respect to the client's I/O to data 15137 servers. 15139 12.5.7. Metadata Server Write Propagation 15141 Asynchronous writes written through the metadata server may be 15142 propagated lazily to the storage devices. For data written 15143 asynchronously through the metadata server, a client performing a 15144 read at the appropriate storage device is not guaranteed to see the 15145 newly written data until a COMMIT occurs at the metadata server. 15146 While the write is pending, reads to the storage device may give out 15147 either the old data, the new data, or a mixture of new and old. Upon 15148 completion of a synchronous WRITE or COMMIT (for asynchronously 15149 written data), the metadata server MUST ensure that storage devices 15150 give out the new data and that the data has been written to stable 15151 storage. If the server implements its storage in any way such that 15152 it cannot obey these constraints, then it MUST recall the layouts to 15153 prevent reads being done that cannot be handled correctly. Note that 15154 the layouts MUST be recalled prior to the server responding to the 15155 associated WRITE operations. 15157 12.6. pNFS Mechanics 15159 This section describes the operations flow taken by a pNFS client to 15160 a metadata server and storage device. 15162 When a pNFS client encounters a new FSID, it sends a GETATTR to the 15163 NFSv4.1 server for the fs_layout_type (Section 5.12.1) attribute. If 15164 the attribute returns at least one layout type, and the layout types 15165 returned are among the set supported by the client, the client knows 15166 that pNFS is a possibility for the file system. If, from the server 15167 that returned the new FSID, the client does not have a client ID that 15168 came from an EXCHANGE_ID result that returned 15169 EXCHGID4_FLAG_USE_PNFS_MDS, it MUST send an EXCHANGE_ID to the server 15170 with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's 15171 response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to 15172 what the fs_layout_type attribute said, the server does not support 15173 pNFS, and the client will not be able use pNFS to that server; in 15174 this case, the server MUST return NFS4ERR_NOTSUPP in response to any 15175 pNFS operation. 15177 The client then creates a session, requesting a persistent session, 15178 so that exclusive creates can be done with single round trip via the 15179 createmode4 of GUARDED4. If the session ends up not being 15180 persistent, the client will use EXCLUSIVE4_1 for exclusive creates. 15182 If a file is to be created on a pNFS-enabled file system, the client 15183 uses the OPEN operation. With the normal set of attributes that may 15184 be provided upon OPEN used for creation, there is an OPTIONAL 15185 layout_hint attribute. The client's use of layout_hint allows the 15186 client to express its preference for a layout type and its associated 15187 layout details. The use of a createmode4 of UNCHECKED4, GUARDED4, or 15188 EXCLUSIVE4_1 will allow the client to provide the layout_hint 15189 attribute at create time. The client MUST NOT use EXCLUSIVE4 (see 15190 Table 10). The client is RECOMMENDED to combine a GETATTR operation 15191 after the OPEN within the same COMPOUND. The GETATTR may then 15192 retrieve the layout_type attribute for the newly created file. The 15193 client will then know what layout type the server has chosen for the 15194 file and therefore what storage protocol the client must use. 15196 If the client wants to open an existing file, then it also includes a 15197 GETATTR to determine what layout type the file supports. 15199 The GETATTR in either the file creation or plain file open case can 15200 also include the layout_blksize and layout_alignment attributes so 15201 that the client can determine optimal offsets and lengths for I/O on 15202 the file. 15204 Assuming the client supports the layout type returned by GETATTR and 15205 it chooses to use pNFS for data access, it then sends LAYOUTGET using 15206 the filehandle and stateid returned by OPEN, specifying the range it 15207 wants to do I/O on. The response is a layout, which may be a subset 15208 of the range for which the client asked. It also includes device IDs 15209 and a description of how data is organized (or in the case of 15210 writing, how data is to be organized) across the devices. The device 15211 IDs and data description are encoded in a format that is specific to 15212 the layout type, but the client is expected to understand. 15214 When the client wants to send an I/O, it determines to which device 15215 ID it needs to send the I/O command by examining the data description 15216 in the layout. It then sends a GETDEVICEINFO to find the device 15217 address(es) of the device ID. The client then sends the I/O request 15218 to one of device ID's device addresses, using the storage protocol 15219 defined for the layout type. Note that if a client has multiple I/Os 15220 to send, these I/O requests may be done in parallel. 15222 If the I/O was a WRITE, then at some point the client may want to use 15223 LAYOUTCOMMIT to commit the modification time and the new size of the 15224 file (if it believes it extended the file size) to the metadata 15225 server and the modified data to the file system. 15227 12.7. Recovery 15229 Recovery is complicated by the distributed nature of the pNFS 15230 protocol. In general, crash recovery for layouts is similar to crash 15231 recovery for delegations in the base NFSv4.1 protocol. However, the 15232 client's ability to perform I/O without contacting the metadata 15233 server introduces subtleties that must be handled correctly if the 15234 possibility of file system corruption is to be avoided. 15236 12.7.1. Recovery from Client Restart 15238 Client recovery for layouts is similar to client recovery for other 15239 lock and delegation state. When a pNFS client restarts, it will lose 15240 all information about the layouts that it previously owned. There 15241 are two methods by which the server can reclaim these resources and 15242 allow otherwise conflicting layouts to be provided to other clients. 15244 The first is through the expiry of the client's lease. If the client 15245 recovery time is longer than the lease period, the client's lease 15246 will expire and the server will know that state may be released. For 15247 layouts, the server may release the state immediately upon lease 15248 expiry or it may allow the layout to persist, awaiting possible lease 15249 revival, as long as no other layout conflicts. 15251 The second is through the client restarting in less time than it 15252 takes for the lease period to expire. In such a case, the client 15253 will contact the server through the standard EXCHANGE_ID protocol. 15254 The server will find that the client's co_ownerid matches the 15255 co_ownerid of the previous client invocation, but that the verifier 15256 is different. The server uses this as a signal to release all layout 15257 state associated with the client's previous invocation. In this 15258 scenario, the data written by the client but not covered by a 15259 successful LAYOUTCOMMIT is in an undefined state; it may have been 15260 written or it may now be lost. This is acceptable behavior and it is 15261 the client's responsibility to use LAYOUTCOMMIT to achieve the 15262 desired level of stability. 15264 12.7.2. Dealing with Lease Expiration on the Client 15266 If a client believes its lease has expired, it MUST NOT send I/O to 15267 the storage device until it has validated its lease. The client can 15268 send a SEQUENCE operation to the metadata server. If the SEQUENCE 15269 operation is successful, but sr_status_flag has 15270 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 15271 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or 15272 SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST NOT use 15273 currently held layouts. The client has two choices to recover from 15274 the lease expiration. First, for all modified but uncommitted data, 15275 the client writes it to the metadata server using the FILE_SYNC4 flag 15276 for the WRITEs, or WRITE and COMMIT. Second, the client re- 15277 establishes a client ID and session with the server and obtains new 15278 layouts and device-ID-to-device-address mappings for the modified 15279 data ranges and then writes the data to the storage devices with the 15280 newly obtained layouts. 15282 If sr_status_flags from the metadata server has 15283 SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns 15284 NFS4ERR_BAD_SESSION and CREATE_SESSION returns 15285 NFS4ERR_STALE_CLIENTID), then the metadata server has restarted, and 15286 the client SHOULD recover using the methods described in 15287 Section 12.7.4. 15289 If sr_status_flags from the metadata server has 15290 SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following 15291 the procedure described in Section 11.10.9.1. After that, the client 15292 may get an indication that the layout state was not moved with the 15293 file system. The client recovers as in the other applicable 15294 situations discussed in the first two paragraphs of this section. 15296 If sr_status_flags reports no loss of state, then the lease for the 15297 layouts that the client has are valid and renewed, and the client can 15298 once again send I/O requests to the storage devices. 15300 While clients SHOULD NOT send I/Os to storage devices that may extend 15301 past the lease expiration time period, this is not always possible, 15302 for example, an extended network partition that starts after the I/O 15303 is sent and does not heal until the I/O request is received by the 15304 storage device. Thus, the metadata server and/or storage devices are 15305 responsible for protecting themselves from I/Os that are both sent 15306 before the lease expires and arrive after the lease expires. See 15307 Section 12.7.3. 15309 12.7.3. Dealing with Loss of Layout State on the Metadata Server 15311 This is a description of the case where all of the following are 15312 true: 15314 o the metadata server has not restarted 15316 o a pNFS client's layouts have been discarded (usually because the 15317 client's lease expired) and are invalid 15319 o an I/O from the pNFS client arrives at the storage device 15321 The metadata server and its storage devices MUST solve this by 15322 fencing the client. In other words, they MUST solve this by 15323 preventing the execution of I/O operations from the client to the 15324 storage devices after layout state loss. The details of how fencing 15325 is done are specific to the layout type. The solution for NFSv4.1 15326 file-based layouts is described in (Section 13.11), and solutions for 15327 other layout types are in their respective external specification 15328 documents. 15330 12.7.4. Recovery from Metadata Server Restart 15332 The pNFS client will discover that the metadata server has restarted 15333 via the methods described in Section 8.4.2 and discussed in a pNFS- 15334 specific context in Section 12.7.2, Paragraph 2. The client MUST 15335 stop using layouts and delete the device ID to device address 15336 mappings it previously received from the metadata server. Having 15337 done that, if the client wrote data to the storage device without 15338 committing the layouts via LAYOUTCOMMIT, then the client has 15339 additional work to do in order to have the client, metadata server, 15340 and storage device(s) all synchronized on the state of the data. 15342 o If the client has data still modified and unwritten in the 15343 client's memory, the client has only two choices. 15345 1. The client can obtain a layout via LAYOUTGET after the 15346 server's grace period and write the data to the storage 15347 devices. 15349 2. The client can WRITE that data through the metadata server 15350 using the WRITE (Section 18.32) operation, and then obtain 15351 layouts as desired. 15353 o If the client asynchronously wrote data to the storage device, but 15354 still has a copy of the data in its memory, then it has available 15355 to it the recovery options listed above in the previous bullet 15356 point. If the metadata server is also in its grace period, the 15357 client has available to it the options below in the next bullet 15358 point. 15360 o The client does not have a copy of the data in its memory and the 15361 metadata server is still in its grace period. The client cannot 15362 use LAYOUTGET (within or outside the grace period) to reclaim a 15363 layout because the contents of the response from LAYOUTGET may not 15364 match what it had previously. The range might be different or the 15365 client might get the same range but the content of the layout 15366 might be different. Even if the content of the layout appears to 15367 be the same, the device IDs may map to different device addresses, 15368 and even if the device addresses are the same, the device 15369 addresses could have been assigned to a different storage device. 15370 The option of retrieving the data from the storage device and 15371 writing it to the metadata server per the recovery scenario 15372 described above is not available because, again, the mappings of 15373 range to device ID, device ID to device address, and device 15374 address to physical device are stale, and new mappings via new 15375 LAYOUTGET do not solve the problem. 15377 The only recovery option for this scenario is to send a 15378 LAYOUTCOMMIT in reclaim mode, which the metadata server will 15379 accept as long as it is in its grace period. The use of 15380 LAYOUTCOMMIT in reclaim mode informs the metadata server that the 15381 layout has changed. It is critical that the metadata server 15382 receive this information before its grace period ends, and thus 15383 before it starts allowing updates to the file system. 15385 To send LAYOUTCOMMIT in reclaim mode, the client sets the 15386 loca_reclaim field of the operation's arguments (Section 18.42.1) 15387 to TRUE. During the metadata server's recovery grace period (and 15388 only during the recovery grace period) the metadata server is 15389 prepared to accept LAYOUTCOMMIT requests with the loca_reclaim 15390 field set to TRUE. 15392 When loca_reclaim is TRUE, the client is attempting to commit 15393 changes to the layout that occurred prior to the restart of the 15394 metadata server. The metadata server applies some consistency 15395 checks on the loca_layoutupdate field of the arguments to 15396 determine whether the client can commit the data written to the 15397 storage device to the file system. The loca_layoutupdate field is 15398 of data type layoutupdate4 and contains layout-type-specific 15399 content (in the lou_body field of loca_layoutupdate). The layout- 15400 type-specific information that loca_layoutupdate might have is 15401 discussed in Section 12.5.4.3. If the metadata server's 15402 consistency checks on loca_layoutupdate succeed, then the metadata 15403 server MUST commit the data (as described by the loca_offset, 15404 loca_length, and loca_layoutupdate fields of the arguments) that 15405 was written to the storage device. If the metadata server's 15406 consistency checks on loca_layoutupdate fail, the metadata server 15407 rejects the LAYOUTCOMMIT operation and makes no changes to the 15408 file system. However, any time LAYOUTCOMMIT with loca_reclaim 15409 TRUE fails, the pNFS client has lost all the data in the range 15410 defined by . A client can defend 15411 against this risk by caching all data, whether written 15412 synchronously or asynchronously in its memory, and by not 15413 releasing the cached data until a successful LAYOUTCOMMIT. This 15414 condition does not hold true for all layout types; for example, 15415 file-based storage devices need not suffer from this limitation. 15417 o The client does not have a copy of the data in its memory and the 15418 metadata server is no longer in its grace period; i.e., the 15419 metadata server returns NFS4ERR_NO_GRACE. As with the scenario in 15420 the above bullet point, the failure of LAYOUTCOMMIT means the data 15421 in the range lost. The defense against 15422 the risk is the same -- cache all written data on the client until 15423 a successful LAYOUTCOMMIT. 15425 12.7.5. Operations during Metadata Server Grace Period 15427 Some of the recovery scenarios thus far noted that some operations 15428 (namely, WRITE and LAYOUTGET) might be permitted during the metadata 15429 server's grace period. The metadata server may allow these 15430 operations during its grace period. For LAYOUTGET, the metadata 15431 server must reliably determine that servicing such a request will not 15432 conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, 15433 the metadata server must reliably determine that servicing the 15434 request will not conflict with an impending OPEN or with a LOCK where 15435 the file has mandatory byte-range locking enabled. 15437 As mentioned previously, for expediency, the metadata server might 15438 reject some operations (namely, WRITE and LAYOUTGET) during its grace 15439 period, because the simplest correct approach is to reject all non- 15440 reclaim pNFS requests and WRITE operations by returning the 15441 NFS4ERR_GRACE error. However, depending on the storage protocol 15442 (which is specific to the layout type) and metadata server 15443 implementation, the metadata server may be able to determine that a 15444 particular request is safe. For example, a metadata server may save 15445 provisional allocation mappings for each file to stable storage, as 15446 well as information about potentially conflicting OPEN share modes 15447 and mandatory byte-range locks that might have been in effect at the 15448 time of restart, and the metadata server may use this information 15449 during the recovery grace period to determine that a WRITE request is 15450 safe. 15452 12.7.6. Storage Device Recovery 15454 Recovery from storage device restart is mostly dependent upon the 15455 layout type in use. However, there are a few general techniques a 15456 client can use if it discovers a storage device has crashed while 15457 holding modified, uncommitted data that was asynchronously written. 15458 First and foremost, it is important to realize that the client is the 15459 only one that has the information necessary to recover non-committed 15460 data since it holds the modified data and probably nothing else does. 15461 Second, the best solution is for the client to err on the side of 15462 caution and attempt to rewrite the modified data through another 15463 path. 15465 The client SHOULD immediately WRITE the data to the metadata server, 15466 with the stable field in the WRITE4args set to FILE_SYNC4. Once it 15467 does this, there is no need to wait for the original storage device. 15469 12.8. Metadata and Storage Device Roles 15471 If the same physical hardware is used to implement both a metadata 15472 server and storage device, then the same hardware entity is to be 15473 understood to be implementing two distinct roles and it is important 15474 that it be clearly understood on behalf of which role the hardware is 15475 executing at any given time. 15477 Two sub-cases can be distinguished. 15479 1. The storage device uses NFSv4.1 as the storage protocol, i.e., 15480 the same physical hardware is used to implement both a metadata 15481 and data server. See Section 13.1 for a description of how 15482 multiple roles are handled. 15484 2. The storage device does not use NFSv4.1 as the storage protocol, 15485 and the same physical hardware is used to implement both a 15486 metadata and storage device. Whether distinct network addresses 15487 are used to access the metadata server and storage device is 15488 immaterial. This is because it is always clear to the pNFS 15489 client and server, from the upper-layer protocol being used 15490 (NFSv4.1 or non-NFSv4.1), to which role the request to the common 15491 server network address is directed. 15493 12.9. Security Considerations for pNFS 15495 pNFS separates file system metadata and data and provides access to 15496 both. There are pNFS-specific operations (listed in Section 12.3) 15497 that provide access to the metadata; all existing NFSv4.1 15498 conventional (non-pNFS) security mechanisms and features apply to 15499 accessing the metadata. The combination of components in a pNFS 15500 system (see Figure 1) is required to preserve the security properties 15501 of NFSv4.1 with respect to an entity that is accessing a storage 15502 device from a client, including security countermeasures to defend 15503 against threats for which NFSv4.1 provides defenses in environments 15504 where these threats are considered significant. 15506 In some cases, the security countermeasures for connections to 15507 storage devices may take the form of physical isolation or a 15508 recommendation to avoid the use of pNFS in an environment. For 15509 example, it may be impractical to provide confidentiality protection 15510 for some storage protocols to protect against eavesdropping. In 15511 environments where eavesdropping on such protocols is of sufficient 15512 concern to require countermeasures, physical isolation of the 15513 communication channel (e.g., via direct connection from client(s) to 15514 storage device(s)) and/or a decision to forgo use of pNFS (e.g., and 15515 fall back to conventional NFSv4.1) may be appropriate courses of 15516 action. 15518 Where communication with storage devices is subject to the same 15519 threats as client-to-metadata server communication, the protocols 15520 used for that communication need to provide security mechanisms as 15521 strong as or no weaker than those available via RPCSEC_GSS for 15522 NFSv4.1. Except for the storage protocol used for the 15523 LAYOUT4_NFSV4_1_FILES layout (see Section 13), i.e., except for 15524 NFSv4.1, it is beyond the scope of this document to specify the 15525 security mechanisms for storage access protocols. 15527 pNFS implementations MUST NOT remove NFSv4.1's access controls. The 15528 combination of clients, storage devices, and the metadata server are 15529 responsible for ensuring that all client-to-storage-device file data 15530 access respects NFSv4.1's ACLs and file open modes. This entails 15531 performing both of these checks on every access in the client, the 15532 storage device, or both (as applicable; when the storage device is an 15533 NFSv4.1 server, the storage device is ultimately responsible for 15534 controlling access as described in Section 13.9.2). If a pNFS 15535 configuration performs these checks only in the client, the risk of a 15536 misbehaving client obtaining unauthorized access is an important 15537 consideration in determining when it is appropriate to use such a 15538 pNFS configuration. Such layout types SHOULD NOT be used when 15539 client-only access checks do not provide sufficient assurance that 15540 NFSv4.1 access control is being applied correctly. (This is not a 15541 problem for the file layout type described in Section 13 because the 15542 storage access protocol for LAYOUT4_NFSV4_1_FILES is NFSv4.1, and 15543 thus the security model for storage device access via 15544 LAYOUT4_NFSv4_1_FILES is the same as that of the metadata server.) 15545 For handling of access control specific to a layout, the reader 15546 should examine the layout specification, such as the NFSv4.1/file- 15547 based layout (Section 13) of this document, the blocks layout [44], 15548 and objects layout [43]. 15550 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type 15552 This section describes the semantics and format of NFSv4.1 file-based 15553 layouts for pNFS. NFSv4.1 file-based layouts use the 15554 LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type 15555 defines striping data across multiple NFSv4.1 data servers. 15557 13.1. Client ID and Session Considerations 15559 Sessions are a REQUIRED feature of NFSv4.1, and this extends to both 15560 the metadata server and file-based (NFSv4.1-based) data servers. 15562 The role a server plays in pNFS is determined by the result it 15563 returns from EXCHANGE_ID. The roles are: 15565 o Metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result 15566 eir_flags). 15568 o Data server (EXCHGID4_FLAG_USE_PNFS_DS). 15570 o Non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an 15571 NFSv4.1 server that does not support operations (e.g., LAYOUTGET) 15572 or attributes that pertain to pNFS. 15574 The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS, 15575 EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though 15576 some combinations (e.g., EXCHGID4_FLAG_USE_NON_PNFS | 15577 EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. However, the server 15578 MUST only return the following acceptable combinations: 15580 +---------------------------------------------------------+ 15581 | Acceptable Results from EXCHANGE_ID | 15582 +---------------------------------------------------------+ 15583 | EXCHGID4_FLAG_USE_PNFS_MDS | 15584 | EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS | 15585 | EXCHGID4_FLAG_USE_PNFS_DS | 15586 | EXCHGID4_FLAG_USE_NON_PNFS | 15587 | EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS | 15588 +---------------------------------------------------------+ 15590 As the above table implies, a server can have one or two roles. A 15591 server can be both a metadata server and a data server, or it can be 15592 both a data server and non-metadata server. In addition to returning 15593 two roles in the EXCHANGE_ID's results, and thus serving both roles 15594 via a common client ID, a server can serve two roles by returning a 15595 unique client ID and server owner for each role in each of two 15596 EXCHANGE_ID results, with each result indicating each role. 15598 In the case of a server with concurrent pNFS roles that are served by 15599 a common client ID, if the EXCHANGE_ID request from the client has 15600 zero or a combination of the bits set in eia_flags, the server result 15601 should set bits that represent the higher of the acceptable 15602 combination of the server roles, with a preference to match the roles 15603 requested by the client. Thus, if a client request has 15604 (EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_USE_PNFS_MDS | 15605 EXCHGID4_FLAG_USE_PNFS_DS) flags set, and the server is both a 15606 metadata server and a data server, serving both the roles by a common 15607 client ID, the server SHOULD return with 15608 (EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS) set. 15610 In the case of a server that has multiple concurrent pNFS roles, each 15611 role served by a unique client ID, if the client specifies zero or a 15612 combination of roles in the request, the server results SHOULD return 15613 only one of the roles from the combination specified by the client 15614 request. If the role specified by the server result does not match 15615 the intended use by the client, the client should send the 15616 EXCHANGE_ID specifying just the interested pNFS role. 15618 If a pNFS metadata client gets a layout that refers it to an NFSv4.1 15619 data server, it needs a client ID on that data server. If it does 15620 not yet have a client ID from the server that had the 15621 EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then 15622 the client needs to send an EXCHANGE_ID to the data server, using the 15623 same co_ownerid as it sent to the metadata server, with the 15624 EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's 15625 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the 15626 client may use the client ID to create sessions that will exchange 15627 pNFS data operations. The client ID returned by the data server has 15628 no relationship with the client ID returned by a metadata server 15629 unless the client IDs are equal, and the server owners and server 15630 scopes of the data server and metadata server are equal. 15632 In NFSv4.1, the session ID in the SEQUENCE operation implies the 15633 client ID, which in turn might be used by the server to map the 15634 stateid to the right client/server pair. However, when a data server 15635 is presented with a READ or WRITE operation with a stateid, because 15636 the stateid is associated with a client ID on a metadata server, and 15637 because the session ID in the preceding SEQUENCE operation is tied to 15638 the client ID of the data server, the data server has no obvious way 15639 to determine the metadata server from the COMPOUND procedure, and 15640 thus has no way to validate the stateid. One RECOMMENDED approach is 15641 for pNFS servers to encode metadata server routing and/or identity 15642 information in the data server filehandles as returned in the layout. 15644 If metadata server routing and/or identity information is encoded in 15645 data server filehandles, when the metadata server identity or 15646 location changes, the data server filehandles it gave out will become 15647 invalid (stale), and so the metadata server MUST first recall the 15648 layouts. Invalidating a data server filehandle does not render the 15649 NFS client's data cache invalid. The client's cache should map a 15650 data server filehandle to a metadata server filehandle, and a 15651 metadata server filehandle to cached data. 15653 If a server is both a metadata server and a data server, the server 15654 might need to distinguish operations on files that are directed to 15655 the metadata server from those that are directed to the data server. 15656 It is RECOMMENDED that the values of the filehandles returned by the 15657 LAYOUTGET operation be different than the value of the filehandle 15658 returned by the OPEN of the same file. 15660 Another scenario is for the metadata server and the storage device to 15661 be distinct from one client's point of view, and the roles reversed 15662 from another client's point of view. For example, in the cluster 15663 file system model, a metadata server to one client might be a data 15664 server to another client. If NFSv4.1 is being used as the storage 15665 protocol, then pNFS servers need to encode the values of filehandles 15666 according to their specific roles. 15668 13.1.1. Sessions Considerations for Data Servers 15670 Section 2.10.11.2 states that a client has to keep its lease renewed 15671 in order to prevent a session from being deleted by the server. If 15672 the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role 15673 set, then (as noted in Section 13.6) the client will not be able to 15674 determine the data server's lease_time attribute because GETATTR will 15675 not be permitted. Instead, the rule is that any time a client 15676 receives a layout referring it to a data server that returns just the 15677 EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the 15678 lease_time attribute from the metadata server that returned the 15679 layout applies to the data server. Thus, the data server MUST be 15680 aware of the values of all lease_time attributes of all metadata 15681 servers for which it is providing I/O, and it MUST use the maximum of 15682 all such lease_time values as the lease interval for all client IDs 15683 and sessions established on it. 15685 For example, if one metadata server has a lease_time attribute of 20 15686 seconds, and a second metadata server has a lease_time attribute of 15687 10 seconds, then if both servers return layouts that refer to an 15688 EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST 15689 renew a client's lease if the interval between two SEQUENCE 15690 operations on different COMPOUND requests is less than 20 seconds. 15692 13.2. File Layout Definitions 15694 The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout 15695 type and may be applicable to other layout types. 15697 Unit. A unit is a fixed-size quantity of data written to a data 15698 server. 15700 Pattern. A pattern is a method of distributing one or more equal 15701 sized units across a set of data servers. A pattern is iterated 15702 one or more times. 15704 Stripe. A stripe is a set of data distributed across a set of data 15705 servers in a pattern before that pattern repeats. 15707 Stripe Count. A stripe count is the number of units in a pattern. 15709 Stripe Width. A stripe width is the size of a stripe in bytes. The 15710 stripe width = the stripe count * the size of the stripe unit. 15712 Hereafter, this document will refer to a unit that is a written in a 15713 pattern as a "stripe unit". 15715 A pattern may have more stripe units than data servers. If so, some 15716 data servers will have more than one stripe unit per stripe. A data 15717 server that has multiple stripe units per stripe MAY store each unit 15718 in a different data file (and depending on the implementation, will 15719 possibly assign a unique data filehandle to each data file). 15721 13.3. File Layout Data Types 15723 The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, 15724 nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. 15726 The SETATTR operation supports a layout hint attribute 15727 (Section 5.12.4). When the client sets a layout hint (data type 15728 layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the 15729 loh_type field), the loh_body field contains a value of data type 15730 nfsv4_1_file_layouthint4. 15732 const NFL4_UFLG_MASK = 0x0000003F; 15733 const NFL4_UFLG_DENSE = 0x00000001; 15734 const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; 15735 const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK 15736 = 0xFFFFFFC0; 15738 typedef uint32_t nfl_util4; 15739 enum filelayout_hint_care4 { 15740 NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, 15742 NFLH4_CARE_COMMIT_THRU_MDS 15743 = NFL4_UFLG_COMMIT_THRU_MDS, 15745 NFLH4_CARE_STRIPE_UNIT_SIZE 15746 = 0x00000040, 15748 NFLH4_CARE_STRIPE_COUNT = 0x00000080 15749 }; 15751 /* Encoded in the loh_body field of data type layouthint4: */ 15753 struct nfsv4_1_file_layouthint4 { 15754 uint32_t nflh_care; 15755 nfl_util4 nflh_util; 15756 count4 nflh_stripe_count; 15757 }; 15759 The generic layout hint structure is described in Section 3.3.19. 15760 The client uses the layout hint in the layout_hint (Section 5.12.4) 15761 attribute to indicate the preferred type of layout to be used for a 15762 newly created file. The LAYOUT4_NFSV4_1_FILES layout-type-specific 15763 content for the layout hint is composed of three fields. The first 15764 field, nflh_care, is a set of flags indicating which values of the 15765 hint the client cares about. If the NFLH4_CARE_DENSE flag is set, 15766 then the client indicates in the second field, nflh_util, a 15767 preference for how the data file is packed (Section 13.4.4), which is 15768 controlled by the value of the expression nflh_util & NFL4_UFLG_DENSE 15769 ("&" represents the bitwise AND operator). If the 15770 NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a 15771 preference for whether the client should send COMMIT operations to 15772 the metadata server or data server (Section 13.7), which is 15773 controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If 15774 the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its 15775 preferred stripe unit size, which is indicated in nflh_util & 15776 NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus, the stripe unit size MUST be a 15777 multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If 15778 the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the 15779 third field, nflh_stripe_count, the stripe count. The stripe count 15780 multiplied by the stripe unit size is the stripe width. 15782 When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in 15783 the loc_type field of the lo_content field), the loc_body field of 15784 the lo_content field contains a value of data type 15785 nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has 15786 a storage device ID (field nfl_deviceid) of data type deviceid4. The 15787 GETDEVICEINFO operation maps a device ID to a storage device address 15788 (type device_addr4). When GETDEVICEINFO returns a device address 15789 with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type 15790 field), the da_addr_body field contains a value of data type 15791 nfsv4_1_file_layout_ds_addr4. 15793 typedef netaddr4 multipath_list4<>; 15795 /* 15796 * Encoded in the da_addr_body field of 15797 * data type device_addr4: 15798 */ 15799 struct nfsv4_1_file_layout_ds_addr4 { 15800 uint32_t nflda_stripe_indices<>; 15801 multipath_list4 nflda_multipath_ds_list<>; 15802 }; 15804 The nfsv4_1_file_layout_ds_addr4 data type represents the device 15805 address. It is composed of two fields: 15807 1. nflda_multipath_ds_list: An array of lists of data servers, where 15808 each list can be one or more elements, and each element 15809 represents a data server address that may serve equally as the 15810 target of I/O operations (see Section 13.5). The length of this 15811 array might be different than the stripe count. 15813 2. nflda_stripe_indices: An array of indices used to index into 15814 nflda_multipath_ds_list. The value of each element of 15815 nflda_stripe_indices MUST be less than the number of elements in 15816 nflda_multipath_ds_list. Each element of nflda_multipath_ds_list 15817 SHOULD be referred to by one or more elements of 15818 nflda_stripe_indices. The number of elements in 15819 nflda_stripe_indices is always equal to the stripe count. 15821 /* 15822 * Encoded in the loc_body field of 15823 * data type layout_content4: 15824 */ 15825 struct nfsv4_1_file_layout4 { 15826 deviceid4 nfl_deviceid; 15827 nfl_util4 nfl_util; 15828 uint32_t nfl_first_stripe_index; 15829 offset4 nfl_pattern_offset; 15830 nfs_fh4 nfl_fh_list<>; 15831 }; 15832 The nfsv4_1_file_layout4 data type represents the layout. It is 15833 composed of the following fields: 15835 1. nfl_deviceid: The device ID that maps to a value of type 15836 nfsv4_1_file_layout_ds_addr4. 15838 2. nfl_util: Like the nflh_util field of data type 15839 nfsv4_1_file_layouthint4, a compact representation of how the 15840 data on a file on each data server is packed, whether the client 15841 should send COMMIT operations to the metadata server or data 15842 server, and the stripe unit size. If a server returns two or 15843 more overlapping layouts, each stripe unit size in each 15844 overlapping layout MUST be the same. 15846 3. nfl_first_stripe_index: The index into the first element of the 15847 nflda_stripe_indices array to use. 15849 4. nfl_pattern_offset: This field is the logical offset into the 15850 file where the striping pattern starts. It is required for 15851 converting the client's logical I/O offset (e.g., the current 15852 offset in a POSIX file descriptor before the read() or write() 15853 system call is sent) into the stripe unit number (see 15854 Section 13.4.1). 15856 If dense packing is used, then nfl_pattern_offset is also needed 15857 to convert the client's logical I/O offset to an offset on the 15858 file on the data server corresponding to the stripe unit number 15859 (see Section 13.4.4). 15861 Note that nfl_pattern_offset is not always the same as lo_offset. 15862 For example, via the LAYOUTGET operation, a client might request 15863 a layout starting at offset 1000 of a file that has its striping 15864 pattern start at offset zero. 15866 5. nfl_fh_list: An array of data server filehandles for each list of 15867 data servers in each element of the nflda_multipath_ds_list 15868 array. The number of elements in nfl_fh_list depends on whether 15869 sparse or dense packing is being used. 15871 * If sparse packing is being used, the number of elements in 15872 nfl_fh_list MUST be one of three values: 15874 + Zero. This means that filehandles used for each data 15875 server are the same as the filehandle returned by the OPEN 15876 operation from the metadata server. 15878 + One. This means that every data server uses the same 15879 filehandle: what is specified in nfl_fh_list[0]. 15881 + The same number of elements in nflda_multipath_ds_list. 15882 Thus, in this case, when sending an I/O operation to any 15883 data server in nflda_multipath_ds_list[X], the filehandle 15884 in nfl_fh_list[X] MUST be used. 15886 See the discussion on sparse packing in Section 13.4.4. 15888 * If dense packing is being used, the number of elements in 15889 nfl_fh_list MUST be the same as the number of elements in 15890 nflda_stripe_indices. Thus, when sending an I/O operation to 15891 any data server in 15892 nflda_multipath_ds_list[nflda_stripe_indices[Y]], the 15893 filehandle in nfl_fh_list[Y] MUST be used. In addition, any 15894 time there exists i and j, (i != j), such that the 15895 intersection of 15896 nflda_multipath_ds_list[nflda_stripe_indices[i]] and 15897 nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, 15898 then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other 15899 words, when dense packing is being used, if a data server 15900 appears in two or more units of a striping pattern, each 15901 reference to the data server MUST use a different filehandle. 15903 Indeed, if there are multiple striping patterns, as indicated 15904 by the presence of multiple objects of data type layout4 15905 (either returned in one or multiple LAYOUTGET operations), and 15906 a data server is the target of a unit of one pattern and 15907 another unit of another pattern, then each reference to each 15908 data server MUST use a different filehandle. 15910 See the discussion on dense packing in Section 13.4.4. 15912 The details on the interpretation of the layout are in Section 13.4. 15914 13.4. Interpreting the File Layout 15916 13.4.1. Determining the Stripe Unit Number 15918 To find the stripe unit number that corresponds to the client's 15919 logical file offset, the pattern offset will also be used. The i'th 15920 stripe unit (SUi) is: 15922 relative_offset = file_offset - nfl_pattern_offset; 15923 SUi = floor(relative_offset / stripe_unit_size); 15925 13.4.2. Interpreting the File Layout Using Sparse Packing 15927 When sparse packing is used, the algorithm for determining the 15928 filehandle and set of data-server network addresses to write stripe 15929 unit i (SUi) to is: 15931 stripe_count = number of elements in nflda_stripe_indices; 15933 j = (SUi + nfl_first_stripe_index) % stripe_count; 15935 idx = nflda_stripe_indices[j]; 15937 fh_count = number of elements in nfl_fh_list; 15938 ds_count = number of elements in nflda_multipath_ds_list; 15940 switch (fh_count) { 15941 case ds_count: 15942 fh = nfl_fh_list[idx]; 15943 break; 15945 case 1: 15946 fh = nfl_fh_list[0]; 15947 break; 15949 case 0: 15950 fh = filehandle returned by OPEN; 15951 break; 15953 default: 15954 throw a fatal exception; 15955 break; 15956 } 15958 address_list = nflda_multipath_ds_list[idx]; 15960 The client would then select a data server from address_list, and 15961 send a READ or WRITE operation using the filehandle specified in fh. 15963 Consider the following example: 15965 Suppose we have a device address consisting of seven data servers, 15966 arranged in three equivalence (Section 13.5) classes: 15968 { A, B, C, D }, { E }, { F, G } 15970 where A through G are network addresses. 15972 Then 15974 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 15976 i.e., 15978 nflda_multipath_ds_list[0] = { A, B, C, D } 15980 nflda_multipath_ds_list[1] = { E } 15982 nflda_multipath_ds_list[2] = { F, G } 15984 Suppose the striping index array is: 15986 nflda_stripe_indices<> = { 2, 0, 1, 0 } 15988 Now suppose the client gets a layout that has a device ID that maps 15989 to the above device address. The initial index contains 15991 nfl_first_stripe_index = 2, 15993 and the filehandle list is 15995 nfl_fh_list = { 0x36, 0x87, 0x67 }. 15997 If the client wants to write to SU0, the set of valid { network 15998 address, filehandle } combinations for SUi are determined by: 16000 nfl_first_stripe_index = 2 16002 So 16004 idx = nflda_stripe_indices[(0 + 2) % 4] 16006 = nflda_stripe_indices[2] 16008 = 1 16010 So 16012 nflda_multipath_ds_list[1] = { E } 16014 and 16016 nfl_fh_list[1] = { 0x87 } 16018 The client can thus write SU0 to { 0x87, { E } }. 16020 The destinations of the first 13 storage units are: 16022 +-----+------------+--------------+ 16023 | SUi | filehandle | data servers | 16024 +-----+------------+--------------+ 16025 | 0 | 87 | E | 16026 | 1 | 36 | A,B,C,D | 16027 | 2 | 67 | F,G | 16028 | 3 | 36 | A,B,C,D | 16029 | | | | 16030 | 4 | 87 | E | 16031 | 5 | 36 | A,B,C,D | 16032 | 6 | 67 | F,G | 16033 | 7 | 36 | A,B,C,D | 16034 | | | | 16035 | 8 | 87 | E | 16036 | 9 | 36 | A,B,C,D | 16037 | 10 | 67 | F,G | 16038 | 11 | 36 | A,B,C,D | 16039 | | | | 16040 | 12 | 87 | E | 16041 +-----+------------+--------------+ 16043 13.4.3. Interpreting the File Layout Using Dense Packing 16045 When dense packing is used, the algorithm for determining the 16046 filehandle and set of data server network addresses to write stripe 16047 unit i (SUi) to is: 16049 stripe_count = number of elements in nflda_stripe_indices; 16051 j = (SUi + nfl_first_stripe_index) % stripe_count; 16053 idx = nflda_stripe_indices[j]; 16055 fh_count = number of elements in nfl_fh_list; 16056 ds_count = number of elements in nflda_multipath_ds_list; 16058 switch (fh_count) { 16059 case stripe_count: 16060 fh = nfl_fh_list[j]; 16061 break; 16063 default: 16064 throw a fatal exception; 16065 break; 16066 } 16068 address_list = nflda_multipath_ds_list[idx]; 16070 The client would then select a data server from address_list, and 16071 send a READ or WRITE operation using the filehandle specified in fh. 16073 Consider the following example (which is the same as the sparse 16074 packing example, except for the filehandle list): 16076 Suppose we have a device address consisting of seven data servers, 16077 arranged in three equivalence (Section 13.5) classes: 16079 { A, B, C, D }, { E }, { F, G } 16081 where A through G are network addresses. 16083 Then 16085 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 16087 i.e., 16089 nflda_multipath_ds_list[0] = { A, B, C, D } 16091 nflda_multipath_ds_list[1] = { E } 16093 nflda_multipath_ds_list[2] = { F, G } 16095 Suppose the striping index array is: 16097 nflda_stripe_indices<> = { 2, 0, 1, 0 } 16099 Now suppose the client gets a layout that has a device ID that maps 16100 to the above device address. The initial index contains 16102 nfl_first_stripe_index = 2, 16104 and 16106 nfl_fh_list = { 0x67, 0x37, 0x87, 0x36 }. 16108 The interesting examples for dense packing are SU1 and SU3 because 16109 each stripe unit refers to the same data server list, yet each stripe 16110 unit MUST use a different filehandle. If the client wants to write 16111 to SU1, the set of valid { network address, filehandle } combinations 16112 for SUi are determined by: 16114 nfl_first_stripe_index = 2 16116 So 16118 j = (1 + 2) % 4 = 3 16120 idx = nflda_stripe_indices[j] 16122 = nflda_stripe_indices[3] 16124 = 0 16126 So 16128 nflda_multipath_ds_list[0] = { A, B, C, D } 16130 and 16132 nfl_fh_list[3] = { 0x36 } 16134 The client can thus write SU1 to { 0x36, { A, B, C, D } }. 16136 For SU3, j = (3 + 2) % 4 = 1, and nflda_stripe_indices[1] = 0. Then 16137 nflda_multipath_ds_list[0] = { A, B, C, D }, and nfl_fh_list[1] = 16138 0x37. The client can thus write SU3 to { 0x37, { A, B, C, D } }. 16140 The destinations of the first 13 storage units are: 16142 +-----+------------+--------------+ 16143 | SUi | filehandle | data servers | 16144 +-----+------------+--------------+ 16145 | 0 | 87 | E | 16146 | 1 | 36 | A,B,C,D | 16147 | 2 | 67 | F,G | 16148 | 3 | 37 | A,B,C,D | 16149 | | | | 16150 | 4 | 87 | E | 16151 | 5 | 36 | A,B,C,D | 16152 | 6 | 67 | F,G | 16153 | 7 | 37 | A,B,C,D | 16154 | | | | 16155 | 8 | 87 | E | 16156 | 9 | 36 | A,B,C,D | 16157 | 10 | 67 | F,G | 16158 | 11 | 37 | A,B,C,D | 16159 | | | | 16160 | 12 | 87 | E | 16161 +-----+------------+--------------+ 16163 13.4.4. Sparse and Dense Stripe Unit Packing 16165 The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util 16166 of the data type nfsv4_1_file_layouthint4 and field nfl_util of data 16167 type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed 16168 within the data file on a data server. It allows for two different 16169 data packings: sparse and dense. The packing type determines the 16170 calculation that will be made to map the client-visible file offset 16171 to the offset within the data file located on the data server. 16173 If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing 16174 is being used. Hence, the logical offsets of the file as viewed by a 16175 client sending READs and WRITEs directly to the metadata server are 16176 the same offsets each data server uses when storing a stripe unit. 16177 The effect then, for striping patterns consisting of at least two 16178 stripe units, is for each data server file to be sparse or "holey". 16179 So for example, suppose there is a pattern with three stripe units, 16180 the stripe unit size is 4096 bytes, and there are three data servers 16181 in the pattern. Then, the file in data server 1 will have stripe 16182 units 0, 3, 6, 9, ... filled; data server 2's file will have stripe 16183 units 1, 4, 7, 10, ... filled; and data server 3's file will have 16184 stripe units 2, 5, 8, 11, ... filled. The unfilled stripe units of 16185 each file will be holes; hence, the files in each data server are 16186 sparse. 16188 If sparse packing is being used and a client attempts I/O to one of 16189 the holes, then an error MUST be returned by the data server. Using 16190 the above example, if data server 3 received a READ or WRITE 16191 operation for block 4, the data server would return 16192 NFS4ERR_PNFS_IO_HOLE. Thus, data servers need to understand the 16193 striping pattern in order to support sparse packing. 16195 If nfl_util & NFL4_UFLG_DENSE is one, this means that dense packing 16196 is being used, and the data server files have no holes. Dense 16197 packing might be selected because the data server does not 16198 (efficiently) support holey files or because the data server cannot 16199 recognize read-ahead unless there are no holes. If dense packing is 16200 indicated in the layout, the data files will be packed. Using the 16201 same striping pattern and stripe unit size that were used for the 16202 sparse packing example, the corresponding dense packing example would 16203 have all stripe units of all data files filled as follows: 16205 o Logical stripe units 0, 3, 6, ... of the file would live on stripe 16206 units 0, 1, 2, ... of the file of data server 1. 16208 o Logical stripe units 1, 4, 7, ... of the file would live on stripe 16209 units 0, 1, 2, ... of the file of data server 2. 16211 o Logical stripe units 2, 5, 8, ... of the file would live on stripe 16212 units 0, 1, 2, ... of the file of data server 3. 16214 Because dense packing does not leave holes on the data servers, the 16215 pNFS client is allowed to write to any offset of any data file of any 16216 data server in the stripe. Thus, the data servers need not know the 16217 file's striping pattern. 16219 The calculation to determine the byte offset within the data file for 16220 dense data server layouts is: 16222 stripe_width = stripe_unit_size * N; 16223 where N = number of elements in nflda_stripe_indices. 16225 relative_offset = file_offset - nfl_pattern_offset; 16227 data_file_offset = floor(relative_offset / stripe_width) 16228 * stripe_unit_size 16229 + relative_offset % stripe_unit_size 16231 If dense packing is being used, and a data server appears more than 16232 once in a striping pattern, then to distinguish one stripe unit from 16233 another, the data server MUST use a different filehandle. Let's 16234 suppose there are two data servers. Logical stripe units 0, 3, 6 are 16235 served by data server 1; logical stripe units 1, 4, 7 are served by 16236 data server 2; and logical stripe units 2, 5, 8 are also served by 16237 data server 2. Unless data server 2 has two filehandles (each 16238 referring to a different data file), then, for example, a write to 16239 logical stripe unit 1 overwrites the write to logical stripe unit 2 16240 because both logical stripe units are located in the same stripe unit 16241 (0) of data server 2. 16243 13.5. Data Server Multipathing 16245 The NFSv4.1 file layout supports multipathing to multiple data server 16246 addresses. Data-server-level multipathing is used for bandwidth 16247 scaling via trunking (Section 2.10.5) and for higher availability of 16248 use in the case of a data-server failure. Multipathing allows the 16249 client to switch to another data server address which may be that of 16250 another data server that is exporting the same data stripe unit, 16251 without having to contact the metadata server for a new layout. 16253 To support data server multipathing, each element of the 16254 nflda_multipath_ds_list contains an array of one more data server 16255 network addresses. This array (data type multipath_list4) represents 16256 a list of data servers (each identified by a network address), with 16257 the possibility that some data servers will appear in the list 16258 multiple times. 16260 The client is free to use any of the network addresses as a 16261 destination to send data server requests. If some network addresses 16262 are less optimal paths to the data than others, then the MDS SHOULD 16263 NOT include those network addresses in an element of 16264 nflda_multipath_ds_list. If less optimal network addresses exist to 16265 provide failover, the RECOMMENDED method to offer the addresses is to 16266 provide them in a replacement device-ID-to-device-address mapping, or 16267 a replacement device ID. When a client finds that no data server in 16268 an element of nflda_multipath_ds_list responds, it SHOULD send a 16269 GETDEVICEINFO to attempt to replace the existing device-ID-to-device- 16270 address mappings. If the MDS detects that all data servers 16271 represented by an element of nflda_multipath_ds_list are unavailable, 16272 the MDS SHOULD send a CB_NOTIFY_DEVICEID (if the client has indicated 16273 it wants device ID notifications for changed device IDs) to change 16274 the device-ID-to-device-address mappings to the available data 16275 servers. If the device ID itself will be replaced, the MDS SHOULD 16276 recall all layouts with the device ID, and thus force the client to 16277 get new layouts and device ID mappings via LAYOUTGET and 16278 GETDEVICEINFO. 16280 Generally, if two network addresses appear in an element of 16281 nflda_multipath_ds_list, they will designate the same data server, 16282 and the two data server addresses will support the implementation of 16283 client ID or session trunking (the latter is RECOMMENDED) as defined 16284 in Section 2.10.5. The two data server addresses will share the same 16285 server owner or major ID of the server owner. It is not always 16286 necessary for the two data server addresses to designate the same 16287 server with trunking being used. For example, the data could be 16288 read-only, and the data consist of exact replicas. 16290 13.6. Operations Sent to NFSv4.1 Data Servers 16292 Clients accessing data on an NFSv4.1 data server MUST send only the 16293 NULL procedure and COMPOUND procedures whose operations are taken 16294 only from two restricted subsets of the operations defined as valid 16295 NFSv4.1 operations. Clients MUST use the filehandle specified by the 16296 layout when accessing data on NFSv4.1 data servers. 16298 The first of these operation subsets consists of management 16299 operations. This subset consists of the BACKCHANNEL_CTL, 16300 BIND_CONN_TO_SESSION, CREATE_SESSION, DESTROY_CLIENTID, 16301 DESTROY_SESSION, EXCHANGE_ID, SECINFO_NO_NAME, SET_SSV, and SEQUENCE 16302 operations. The client may use these operations in order to set up 16303 and maintain the appropriate client IDs, sessions, and security 16304 contexts involved in communication with the data server. Henceforth, 16305 these will be referred to as data-server housekeeping operations. 16307 The second subset consists of COMMIT, READ, WRITE, and PUTFH. These 16308 operations MUST be used with a current filehandle specified by the 16309 layout. In the case of PUTFH, the new current filehandle MUST be one 16310 taken from the layout. Henceforth, these will be referred to as 16311 data-server I/O operations. As described in Section 12.5.1, a client 16312 MUST NOT send an I/O to a data server for which it does not hold a 16313 valid layout; the data server MUST reject such an I/O. 16315 Unless the server has a concurrent non-data-server personality -- 16316 i.e., EXCHANGE_ID results returned (EXCHGID4_FLAG_USE_PNFS_DS | 16317 EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS | 16318 EXCHGID4_FLAG_USE_NON_PNFS) see Section 13.1 -- any attempted use of 16319 operations against a data server other than those specified in the 16320 two subsets above MUST return NFS4ERR_NOTSUPP to the client. 16322 When the server has concurrent data-server and non-data-server 16323 personalities, each COMPOUND sent by the client MUST be constructed 16324 so that it is appropriate to one of the two personalities, and it 16325 MUST NOT contain operations directed to a mix of those personalities. 16326 The server MUST enforce this. To understand the constraints, 16327 operations within a COMPOUND are divided into the following three 16328 classes: 16330 1. An operation that is ambiguous regarding its personality 16331 assignment. This includes all of the data-server housekeeping 16332 operations. Additionally, if the server has assigned filehandles 16333 so that the ones defined by the layout are the same as those used 16334 by the metadata server, all operations using such filehandles are 16335 within this class, with the following exception. The exception 16336 is that if the operation uses a stateid that is incompatible with 16337 a data-server personality (e.g., a special stateid or the stateid 16338 has a non-zero "seqid" field, see Section 13.9.1), the operation 16339 is in class 3, as described below. A COMPOUND containing 16340 multiple class 1 operations (and operations of no other class) 16341 MAY be sent to a server with multiple concurrent data server and 16342 non-data-server personalities. 16344 2. An operation that is unambiguously referable to the data-server 16345 personality. This includes data-server I/O operations where the 16346 filehandle is one that can only be validly directed to the data- 16347 server personality. 16349 3. An operation that is unambiguously referable to the non-data- 16350 server personality. This includes all COMPOUND operations that 16351 are neither data-server housekeeping nor data-server I/O 16352 operations, plus data-server I/O operations where the current fh 16353 (or the one to be made the current fh in the case of PUTFH) is 16354 only valid on the metadata server or where a stateid is used that 16355 is incompatible with the data server, i.e., is a special stateid 16356 or has a non-zero seqid value. 16358 When a COMPOUND first executes an operation from class 3 above, it 16359 acts as a normal COMPOUND on any other server, and the data-server 16360 personality ceases to be relevant. There are no special restrictions 16361 on the operations in the COMPOUND to limit them to those for a data 16362 server. When a PUTFH is done, filehandles derived from the layout 16363 are not valid. If their format is not normally acceptable, then 16364 NFS4ERR_BADHANDLE MUST result. Similarly, current filehandles for 16365 other operations do not accept filehandles derived from layouts and 16366 are not normally usable on the metadata server. Using these will 16367 result in NFS4ERR_STALE. 16369 When a COMPOUND first executes an operation from class 2, which would 16370 be PUTFH where the filehandle is one from a layout, the COMPOUND 16371 henceforth is interpreted with respect to the data-server 16372 personality. Operations outside the two classes discussed above MUST 16373 result in NFS4ERR_NOTSUPP. Filehandles are validated using the rules 16374 of the data server, resulting in NFS4ERR_BADHANDLE and/or 16375 NFS4ERR_STALE even when they would not normally do so when addressed 16376 to the non-data-server personality. Stateids must obey the rules of 16377 the data server in that any use of special stateids or stateids with 16378 non-zero seqid values must result in NFS4ERR_BAD_STATEID. 16380 Until the server first executes an operation from class 2 or class 3, 16381 the client MUST NOT depend on the operation being executed by either 16382 the data-server or the non-data-server personality. The server MUST 16383 pick one personality consistently for a given COMPOUND, with the only 16384 possible transition being a single one when the first operation from 16385 class 2 or class 3 is executed. 16387 Because of the complexity induced by assigning filehandles so they 16388 can be used on both a data server and a metadata server, it is 16389 RECOMMENDED that where the same server can have both personalities, 16390 the server assign separate unique filehandles to both personalities. 16391 This makes it unambiguous for which server a given request is 16392 intended. 16394 GETATTR and SETATTR MUST be directed to the metadata server. In the 16395 case of a SETATTR of the size attribute, the control protocol is 16396 responsible for propagating size updates/truncations to the data 16397 servers. In the case of extending WRITEs to the data servers, the 16398 new size must be visible on the metadata server once a LAYOUTCOMMIT 16399 has completed (see Section 12.5.4.2). Section 13.10 describes the 16400 mechanism by which the client is to handle data-server files that do 16401 not reflect the metadata server's size. 16403 13.7. COMMIT through Metadata Server 16405 The file layout provides two alternate means of providing for the 16406 commit of data written through data servers. The flag 16407 NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout 16408 (data type nfsv4_1_file_layout4) is an indication from the metadata 16409 server to the client of the REQUIRED way of performing COMMIT, either 16410 by sending the COMMIT to the data server or the metadata server. 16411 These two methods of dealing with the issue correspond to broad 16412 styles of implementation for a pNFS server supporting the file layout 16413 type. 16415 o When the flag is FALSE, COMMIT operations MUST to be sent to the 16416 data server to which the corresponding WRITE operations were sent. 16417 This approach is sometimes useful when file striping is 16418 implemented within the pNFS server (instead of the file system), 16419 with the individual data servers each implementing their own file 16420 systems. 16422 o When the flag is TRUE, COMMIT operations MUST be sent to the 16423 metadata server, rather than to the individual data servers. This 16424 approach is sometimes useful when file striping is implemented 16425 within the clustered file system that is the backend to the pNFS 16426 server. In such an implementation, each COMMIT to each data 16427 server might result in repeated writes of metadata blocks to the 16428 detriment of write performance. Sending a single COMMIT to the 16429 metadata server can be more efficient when there exists a 16430 clustered file system capable of implementing such a coordinated 16431 COMMIT. 16433 If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to 16434 maintain the current NFSv4.1 commit and recovery model, the data 16435 servers MUST return a common writeverf verifier in all WRITE 16436 responses for a given file layout, and the metadata server's 16437 COMMIT implementation must return the same writeverf. The value 16438 of the writeverf verifier MUST be changed at the metadata server 16439 or any data server that is referenced in the layout, whenever 16440 there is a server event that can possibly lead to loss of 16441 uncommitted data. The scope of the verifier can be for a file or 16442 for the entire pNFS server. It might be more difficult for the 16443 server to maintain the verifier at the file level, but the benefit 16444 is that only events that impact a given file will require recovery 16445 action. 16447 Note that if the layout specified dense packing, then the offset used 16448 to a COMMIT to the MDS may differ than that of an offset used to a 16449 COMMIT to the data server. 16451 The single COMMIT to the metadata server will return a verifier, and 16452 the client should compare it to all the verifiers from the WRITEs and 16453 fail the COMMIT if there are any mismatched verifiers. If COMMIT to 16454 the metadata server fails, the client should re-send WRITEs for all 16455 the modified data in the file. The client should treat modified data 16456 with a mismatched verifier as a WRITE failure and try to recover by 16457 resending the WRITEs to the original data server or using another 16458 path to that data if the layout has not been recalled. 16459 Alternatively, the client can obtain a new layout or it could rewrite 16460 the data directly to the metadata server. If nfl_util & 16461 NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata 16462 server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS 16463 is FALSE, a COMMIT sent to the metadata server should be used only to 16464 commit data that was written to the metadata server. See 16465 Section 12.7.6 for recovery options. 16467 13.8. The Layout Iomode 16469 The layout iomode need not be used by the metadata server when 16470 servicing NFSv4.1 file-based layouts, although in some circumstances 16471 it may be useful. For example, if the server implementation supports 16472 reading from read-only replicas or mirrors, it would be useful for 16473 the server to return a layout enabling the client to do so. As such, 16474 the client SHOULD set the iomode based on its intent to read or write 16475 the data. The client may default to an iomode of LAYOUTIOMODE4_RW. 16476 The iomode need not be checked by the data servers when clients 16477 perform I/O. However, the data servers SHOULD still validate that 16478 the client holds a valid layout and return an error if the client 16479 does not. 16481 13.9. Metadata and Data Server State Coordination 16483 13.9.1. Global Stateid Requirements 16485 When the client sends I/O to a data server, the stateid used MUST NOT 16486 be a layout stateid as returned by LAYOUTGET or sent by 16487 CB_LAYOUTRECALL. Permitted stateids are based on one of the 16488 following: an OPEN stateid (the stateid field of data type OPEN4resok 16489 as returned by OPEN), a delegation stateid (the stateid field of data 16490 types open_read_delegation4 and open_write_delegation4 as returned by 16491 OPEN or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid 16492 returned by the LOCK or LOCKU operations. The stateid sent to the 16493 data server MUST be sent with the seqid set to zero, indicating the 16494 most current version of that stateid, rather than indicating a 16495 specific non-zero seqid value. In no case is the use of special 16496 stateid values allowed. 16498 The stateid used for I/O MUST have the same effect and be subject to 16499 the same validation on a data server as it would if the I/O was being 16500 performed on the metadata server itself in the absence of pNFS. This 16501 has the implication that stateids are globally valid on both the 16502 metadata and data servers. This requires the metadata server to 16503 propagate changes in LOCK and OPEN state to the data servers, so that 16504 the data servers can validate I/O accesses. This is discussed 16505 further in Section 13.9.2. Depending on when stateids are 16506 propagated, the existence of a valid stateid on the data server may 16507 act as proof of a valid layout. 16509 Clients performing I/O operations need to select an appropriate 16510 stateid based on the locks (including opens and delegations) held by 16511 the client and the various types of state-owners sending the I/O 16512 requests. The rules for doing so when referencing data servers are 16513 somewhat different from those discussed in Section 8.2.5, which apply 16514 when accessing metadata servers. 16516 The following rules, applied in order of decreasing priority, govern 16517 the selection of the appropriate stateid: 16519 o If the client holds a delegation for the file in question, the 16520 delegation stateid should be used. 16522 o Otherwise, there must be an OPEN stateid for the current open- 16523 owner, and that OPEN stateid for the open file in question is 16524 used, unless mandatory locking prevents that. See below. 16526 o If the data server had previously responded with NFS4ERR_LOCKED to 16527 use of the OPEN stateid, then the client should use the byte-range 16528 lock stateid whenever one exists for that open file with the 16529 current lock-owner. 16531 o Special stateids should never be used. If they are used, the data 16532 server MUST reject the I/O with an NFS4ERR_BAD_STATEID error. 16534 13.9.2. Data Server State Propagation 16536 Since the metadata server, which handles byte-range lock and open- 16537 mode state changes as well as ACLs, might not be co-located with the 16538 data servers where I/O accesses are validated, the server 16539 implementation MUST take care of propagating changes of this state to 16540 the data servers. Once the propagation to the data servers is 16541 complete, the full effect of those changes MUST be in effect at the 16542 data servers. However, some state changes need not be propagated 16543 immediately, although all changes SHOULD be propagated promptly. 16544 These state propagations have an impact on the design of the control 16545 protocol, even though the control protocol is outside of the scope of 16546 this specification. Immediate propagation refers to the synchronous 16547 propagation of state from the metadata server to the data server(s); 16548 the propagation must be complete before returning to the client. 16550 13.9.2.1. Lock State Propagation 16552 If the pNFS server supports mandatory byte-range locking, any 16553 mandatory byte-range locks on a file MUST be made effective at the 16554 data servers before the request that establishes them returns to the 16555 caller. The effect MUST be the same as if the mandatory byte-range 16556 lock state were synchronously propagated to the data servers, even 16557 though the details of the control protocol may avoid actual transfer 16558 of the state under certain circumstances. 16560 On the other hand, since advisory byte-range lock state is not used 16561 for checking I/O accesses at the data servers, there is no semantic 16562 reason for propagating advisory byte-range lock state to the data 16563 servers. Since updates to advisory locks neither confer nor remove 16564 privileges, these changes need not be propagated immediately, and may 16565 not need to be propagated promptly. The updates to advisory locks 16566 need only be propagated when the data server needs to resolve a 16567 question about a stateid. In fact, if byte-range locking is not 16568 mandatory (i.e., is advisory) the clients are advised to avoid using 16569 the byte-range lock-based stateids for I/O. The stateids returned by 16570 OPEN are sufficient and eliminate overhead for this kind of state 16571 propagation. 16573 If a client gets back an NFS4ERR_LOCKED error from a data server, 16574 this is an indication that mandatory byte-range locking is in force. 16575 The client recovers from this by getting a byte-range lock that 16576 covers the affected range and re-sends the I/O with the stateid of 16577 the byte-range lock. 16579 13.9.2.2. Open and Deny Mode Validation 16581 Open and deny mode validation MUST be performed against the open and 16582 deny mode(s) held by the data servers. When access is reduced or a 16583 deny mode made more restrictive (because of CLOSE or OPEN_DOWNGRADE), 16584 the data server MUST prevent any I/Os that would be denied if 16585 performed on the metadata server. When access is expanded, the data 16586 server MUST make sure that no requests are subsequently rejected 16587 because of open or deny issues that no longer apply, given the 16588 previous relaxation. 16590 13.9.2.3. File Attributes 16592 Since the SETATTR operation has the ability to modify state that is 16593 visible on both the metadata and data servers (e.g., the size), care 16594 must be taken to ensure that the resultant state across the set of 16595 data servers is consistent, especially when truncating or growing the 16596 file. 16598 As described earlier, the LAYOUTCOMMIT operation is used to ensure 16599 that the metadata is synchronized with changes made to the data 16600 servers. For the NFSv4.1-based data storage protocol, it is 16601 necessary to re-synchronize state such as the size attribute, and the 16602 setting of mtime/change/atime. See Section 12.5.4 for a full 16603 description of the semantics regarding LAYOUTCOMMIT and attribute 16604 synchronization. It should be noted that by using an NFSv4.1-based 16605 layout type, it is possible to synchronize this state before 16606 LAYOUTCOMMIT occurs. For example, the control protocol can be used 16607 to query the attributes present on the data servers. 16609 Any changes to file attributes that control authorization or access 16610 as reflected by ACCESS calls or READs and WRITEs on the metadata 16611 server, MUST be propagated to the data servers for enforcement on 16612 READ and WRITE I/O calls. If the changes made on the metadata server 16613 result in more restrictive access permissions for any user, those 16614 changes MUST be propagated to the data servers synchronously. 16616 The OPEN operation (Section 18.16.4) does not impose any requirement 16617 that I/O operations on an open file have the same credentials as the 16618 OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when 16619 EXCHANGE_ID creates the client ID), and so it requires the server's 16620 READ and WRITE operations to perform appropriate access checking. 16622 Changes to ACLs also require new access checking by READ and WRITE on 16623 the server. The propagation of access-right changes due to changes 16624 in ACLs may be asynchronous only if the server implementation is able 16625 to determine that the updated ACL is not more restrictive for any 16626 user specified in the old ACL. Due to the relative infrequency of 16627 ACL updates, it is suggested that all changes be propagated 16628 synchronously. 16630 13.10. Data Server Component File Size 16632 A potential problem exists when a component data file on a particular 16633 data server has grown past EOF; the problem exists for both dense and 16634 sparse layouts. Imagine the following scenario: a client creates a 16635 new file (size == 0) and writes to byte 131072; the client then seeks 16636 to the beginning of the file and reads byte 100. The client should 16637 receive zeroes back as a result of the READ. However, if the 16638 striping pattern directs the client to send the READ to a data server 16639 other than the one that received the client's original WRITE, the 16640 data server servicing the READ may believe that the file's size is 16641 still 0 bytes. In that event, the data server's READ response will 16642 contain zero bytes and an indication of EOF. The data server can 16643 only return zeroes if it knows that the file's size has been 16644 extended. This would require the immediate propagation of the file's 16645 size to all data servers, which is potentially very costly. 16646 Therefore, the client that has initiated the extension of the file's 16647 size MUST be prepared to deal with these EOF conditions. When the 16648 offset in the arguments to READ is less than the client's view of the 16649 file size, if the READ response indicates EOF and/or contains fewer 16650 bytes than requested, the client will interpret such a response as a 16651 hole in the file, and the NFS client will substitute zeroes for the 16652 data. 16654 The NFSv4.1 protocol only provides close-to-open file data cache 16655 semantics; meaning that when the file is closed, all modified data is 16656 written to the server. When a subsequent OPEN of the file is done, 16657 the change attribute is inspected for a difference from a cached 16658 value for the change attribute. For the case above, this means that 16659 a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and 16660 will update the file's size and change attribute. Access from 16661 another client after that point will result in the appropriate size 16662 being returned. 16664 13.11. Layout Revocation and Fencing 16666 As described in Section 12.7, the layout-type-specific storage 16667 protocol is responsible for handling the effects of I/Os that started 16668 before lease expiration and extend through lease expiration. The 16669 LAYOUT4_NFSV4_1_FILES layout type can prevent all I/Os to data 16670 servers from being executed after lease expiration (this prevention 16671 is called "fencing"), without relying on a precise client lease timer 16672 and without requiring data servers to maintain lease timers. The 16673 LAYOUT4_NFSV4_1_FILES pNFS server has the flexibility to revoke 16674 individual layouts, and thus fence I/O on a per-file basis. 16676 In addition to lease expiration, the reasons a layout can be revoked 16677 include: client fails to respond to a CB_LAYOUTRECALL, the metadata 16678 server restarts, or administrative intervention. Regardless of the 16679 reason, once a client's layout has been revoked, the pNFS server MUST 16680 prevent the client from sending I/O for the affected file from and to 16681 all data servers; in other words, it MUST fence the client from the 16682 affected file on the data servers. 16684 Fencing works as follows. As described in Section 13.1, in COMPOUND 16685 procedure requests to the data server, the data filehandle provided 16686 by the PUTFH operation and the stateid in the READ or WRITE operation 16687 are used to ensure that the client has a valid layout for the I/O 16688 being performed; if it does not, the I/O is rejected with 16689 NFS4ERR_PNFS_NO_LAYOUT. The server can simply check the stateid and, 16690 additionally, make the data filehandle stale if the layout specified 16691 a data filehandle that is different from the metadata server's 16692 filehandle for the file (see the nfl_fh_list description in 16693 Section 13.3). 16695 Before the metadata server takes any action to revoke layout state 16696 given out by a previous instance, it must make sure that all layout 16697 state from that previous instance are invalidated at the data 16698 servers. This has the following implications. 16700 o The metadata server must not restripe a file until it has 16701 contacted all of the data servers to invalidate the layouts from 16702 the previous instance. 16704 o The metadata server must not give out mandatory locks that 16705 conflict with layouts from the previous instance without either 16706 doing a specific layout invalidation (as it would have to do 16707 anyway) or doing a global data server invalidation. 16709 13.12. Security Considerations for the File Layout Type 16711 The NFSv4.1 file layout type MUST adhere to the security 16712 considerations outlined in Section 12.9. NFSv4.1 data servers MUST 16713 make all of the required access checks on each READ or WRITE I/O as 16714 determined by the NFSv4.1 protocol. If the metadata server would 16715 deny a READ or WRITE operation on a file due to its ACL, mode 16716 attribute, open access mode, open deny mode, mandatory byte-range 16717 lock state, or any other attributes and state, the data server MUST 16718 also deny the READ or WRITE operation. This impacts the control 16719 protocol and the propagation of state from the metadata server to the 16720 data servers; see Section 13.9.2 for more details. 16722 The methods for authentication, integrity, and privacy for data 16723 servers based on the LAYOUT4_NFSV4_1_FILES layout type are the same 16724 as those used by metadata servers. Metadata and data servers use ONC 16725 RPC security flavors to authenticate, and SECINFO and SECINFO_NO_NAME 16726 to negotiate the security mechanism and services to be used. Thus, 16727 when using the LAYOUT4_NFSV4_1_FILES layout type, the impact on the 16728 RPC-based security model due to pNFS (as alluded to in Sections 1.8.1 16729 and 1.8.2.2) is zero. 16731 For a given file object, a metadata server MAY require different 16732 security parameters (secinfo4 value) than the data server. For a 16733 given file object with multiple data servers, the secinfo4 value 16734 SHOULD be the same across all data servers. If the secinfo4 values 16735 across a metadata server and its data servers differ for a specific 16736 file, the mapping of the principal to the server's internal user 16737 identifier MUST be the same in order for the access-control checks 16738 based on ACL, mode, open and deny mode, and mandatory locking to be 16739 consistent across on the pNFS server. 16741 If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file 16742 layouts, then the implementation MUST support the SECINFO_NO_NAME 16743 operation on both the metadata and data servers. 16745 14. Internationalization 16747 The primary issue in which NFSv4.1 needs to deal with 16748 internationalization, or I18N, is with respect to file names and 16749 other strings as used within the protocol. The choice of string 16750 representation must allow reasonable name/string access to clients 16751 that use various languages. The UTF-8 encoding of the UCS (Universal 16752 Multiple-Octet Coded Character Set) as defined by ISO10646 [18] 16753 allows for this type of access and follows the policy described in 16754 "IETF Policy on Character Sets and Languages", RFC 2277 [19]. 16756 RFC 3454 [16], otherwise know as "stringprep", documents a framework 16757 for using Unicode/UTF-8 in networking protocols so as "to increase 16758 the likelihood that string input and string comparison work in ways 16759 that make sense for typical users throughout the world". A protocol 16760 must define a profile of stringprep "in order to fully specify the 16761 processing options". The remainder of this section defines the 16762 NFSv4.1 stringprep profiles. Much of the terminology used for the 16763 remainder of this section comes from stringprep. 16765 There are three UTF-8 string types defined for NFSv4.1: utf8str_cs, 16766 utf8str_cis, and utf8str_mixed. Separate profiles are defined for 16767 each. Each profile defines the following, as required by stringprep: 16769 o The intended applicability of the profile. 16771 o The character repertoire that is the input and output to 16772 stringprep (which is Unicode 3.2 for the referenced version of 16773 stringprep). However, NFSv4.1 implementations are not limited to 16774 3.2. 16776 o The mapping tables from stringprep used (as described in Section 3 16777 of stringprep). 16779 o Any additional mapping tables specific to the profile. 16781 o The Unicode normalization used, if any (as described in Section 4 16782 of stringprep). 16784 o The tables from the stringprep listing of characters that are 16785 prohibited as output (as described in Section 5 of stringprep). 16787 o The bidirectional string testing used, if any (as described in 16788 Section 6 of stringprep). 16790 o Any additional characters that are prohibited as output specific 16791 to the profile. 16793 Stringprep discusses Unicode characters, whereas NFSv4.1 renders 16794 UTF-8 characters. Since there is a one-to-one mapping from UTF-8 to 16795 Unicode, when the remainder of this document refers to Unicode, the 16796 reader should assume UTF-8. 16798 Much of the text for the profiles comes from RFC 3491 [20]. 16800 14.1. Stringprep Profile for the utf8str_cs Type 16802 Every use of the utf8str_cs type definition in the NFSv4 protocol 16803 specification follows the profile named nfs4_cs_prep. 16805 14.1.1. Intended Applicability of the nfs4_cs_prep Profile 16807 The utf8str_cs type is a case-sensitive string of UTF-8 characters. 16808 Its primary use in NFSv4.1 is for naming components and pathnames. 16809 Components and pathnames are stored on the server's file system. Two 16810 valid distinct UTF-8 strings might be the same after processing via 16811 the utf8str_cs profile. If the strings are two names inside a 16812 directory, the NFSv4.1 server will need to either: 16814 o disallow the creation of a second name if its post-processed form 16815 collides with that of an existing name, or 16817 o allow the creation of the second name, but arrange so that after 16818 post-processing, the second name is different than the post- 16819 processed form of the first name. 16821 14.1.2. Character Repertoire of nfs4_cs_prep 16823 The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's 16824 Appendix A.1. However, NFSv4.1 implementations are not limited to 16825 3.2. 16827 14.1.3. Mapping Used by nfs4_cs_prep 16829 The nfs4_cs_prep profile specifies mapping using the following tables 16830 from stringprep: 16832 Table B.1 16834 Table B.2 is normally not part of the nfs4_cs_prep profile as it is 16835 primarily for dealing with case-insensitive comparisons. However, if 16836 the NFSv4.1 file server supports the case_insensitive file system 16837 attribute, and if case_insensitive is TRUE, the NFSv4.1 server MUST 16838 use Table B.2 (in addition to Table B1) when processing utf8str_cs 16839 strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to 16840 Table B.1) is being used. 16842 If the case_preserving attribute is present and set to FALSE, then 16843 the NFSv4.1 server MUST use Table B.2 to map case when processing 16844 utf8str_cs strings. Whether the server maps from lower to upper case 16845 or from upper to lower case is an implementation dependency. 16847 14.1.4. Normalization used by nfs4_cs_prep 16849 The nfs4_cs_prep profile does not specify a normalization form. A 16850 later revision of this specification may specify a particular 16851 normalization form. Therefore, the server and client can expect that 16852 they may receive unnormalized characters within protocol requests and 16853 responses. If the operating environment requires normalization, then 16854 the implementation must normalize utf8str_cs strings within the 16855 protocol before presenting the information to an application (at the 16856 client) or local file system (at the server). 16858 14.1.5. Prohibited Output for nfs4_cs_prep 16860 The nfs4_cs_prep profile RECOMMENDS prohibiting the use of the 16861 following tables from stringprep: 16863 Table C.5 16865 Table C.6 16867 14.1.6. Bidirectional Output for nfs4_cs_prep 16869 The nfs4_cs_prep profile does not specify any checking of 16870 bidirectional strings. 16872 14.2. Stringprep Profile for the utf8str_cis Type 16874 Every use of the utf8str_cis type definition in the NFSv4.1 protocol 16875 specification follows the profile named nfs4_cis_prep. 16877 14.2.1. Intended Applicability of the nfs4_cis_prep Profile 16879 The utf8str_cis type is a case-insensitive string of UTF-8 16880 characters. Its primary use in NFSv4.1 is for naming NFS servers. 16882 14.2.2. Character Repertoire of nfs4_cis_prep 16884 The nfs4_cis_prep profile uses Unicode 3.2, as defined in 16885 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 16886 limited to 3.2. 16888 14.2.3. Mapping Used by nfs4_cis_prep 16890 The nfs4_cis_prep profile specifies mapping using the following 16891 tables from stringprep: 16893 Table B.1 16895 Table B.2 16897 14.2.4. Normalization Used by nfs4_cis_prep 16899 The nfs4_cis_prep profile specifies using Unicode normalization form 16900 KC, as described in stringprep. 16902 14.2.5. Prohibited Output for nfs4_cis_prep 16904 The nfs4_cis_prep profile specifies prohibiting using the following 16905 tables from stringprep: 16907 Table C.1.2 16909 Table C.2.2 16911 Table C.3 16913 Table C.4 16915 Table C.5 16917 Table C.6 16919 Table C.7 16921 Table C.8 16923 Table C.9 16925 14.2.6. Bidirectional Output for nfs4_cis_prep 16927 The nfs4_cis_prep profile specifies checking bidirectional strings as 16928 described in stringprep's Section 6. 16930 14.3. Stringprep Profile for the utf8str_mixed Type 16932 Every use of the utf8str_mixed type definition in the NFSv4.1 16933 protocol specification follows the profile named nfs4_mixed_prep. 16935 14.3.1. Intended Applicability of the nfs4_mixed_prep Profile 16937 The utf8str_mixed type is a string of UTF-8 characters, with a prefix 16938 that is case sensitive, a separator equal to '@', and a suffix that 16939 is a fully qualified domain name. Its primary use in NFSv4.1 is for 16940 naming principals identified in an Access Control Entry. 16942 14.3.2. Character Repertoire of nfs4_mixed_prep 16944 The nfs4_mixed_prep profile uses Unicode 3.2, as defined in 16945 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 16946 limited to 3.2. 16948 14.3.3. Mapping Used by nfs4_cis_prep 16950 For the prefix and the separator of a utf8str_mixed string, the 16951 nfs4_mixed_prep profile specifies mapping using the following table 16952 from stringprep: 16954 Table B.1 16956 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile 16957 specifies mapping using the following tables from stringprep: 16959 Table B.1 16961 Table B.2 16963 14.3.4. Normalization Used by nfs4_mixed_prep 16965 The nfs4_mixed_prep profile specifies using Unicode normalization 16966 form KC, as described in stringprep. 16968 14.3.5. Prohibited Output for nfs4_mixed_prep 16970 The nfs4_mixed_prep profile specifies prohibiting using the following 16971 tables from stringprep: 16973 Table C.1.2 16975 Table C.2.2 16977 Table C.3 16979 Table C.4 16981 Table C.5 16983 Table C.6 16985 Table C.7 16987 Table C.8 16989 Table C.9 16991 14.3.6. Bidirectional Output for nfs4_mixed_prep 16993 The nfs4_mixed_prep profile specifies checking bidirectional strings 16994 as described in stringprep's Section 6. 16996 14.4. UTF-8 Capabilities 16998 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 16999 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 17001 typedef uint32_t fs_charset_cap4; 17003 Because some operating environments and file systems do not enforce 17004 character set encodings, NFSv4.1 supports the fs_charset_cap 17005 attribute (Section 5.8.2.11) that indicates to the client a file 17006 system's UTF-8 capabilities. The attribute is an integer containing 17007 a pair of flags. The first flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, 17008 which, if set to one, tells the client that the file system contains 17009 non-UTF-8 characters, and the server will not convert non-UTF 17010 characters to UTF-8 if the client reads a symlink or directory, 17011 neither will operations with component names or pathnames in the 17012 arguments convert the strings to UTF-8. The second flag is 17013 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set to one, indicates that 17014 the server will accept (and generate) only UTF-8 characters on the 17015 file system. If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, 17016 FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. 17017 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. 17019 14.5. UTF-8 Related Errors 17021 Where the client sends an invalid UTF-8 string, the server should 17022 return NFS4ERR_INVAL (see Table 5). This includes cases in which 17023 inappropriate prefixes are detected and where the count includes 17024 trailing bytes that do not constitute a full UCS character. 17026 Where the client-supplied string is valid UTF-8 but contains 17027 characters that are not supported by the server as a value for that 17028 string (e.g., names containing characters outside of Unicode plane 0 17029 on file systems that fail to support such characters despite their 17030 presence in the Unicode standard), the server should return 17031 NFS4ERR_BADCHAR. 17033 Where a UTF-8 string is used as a file name, and the file system 17034 (while supporting all of the characters within the name) does not 17035 allow that particular name to be used, the server should return the 17036 error NFS4ERR_BADNAME (Table 5). This includes situations in which 17037 the server file system imposes a normalization constraint on name 17038 strings, but will also include such situations as file system 17039 prohibitions of "." and ".." as file names for certain operations, 17040 and other such constraints. 17042 15. Error Values 17044 NFS error numbers are assigned to failed operations within a Compound 17045 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 17046 number of NFS operations that have their results encoded in sequence 17047 in a Compound reply. The results of successful operations will 17048 consist of an NFS4_OK status followed by the encoded results of the 17049 operation. If an NFS operation fails, an error status will be 17050 entered in the reply and the Compound request will be terminated. 17052 15.1. Error Definitions 17054 Protocol Error Definitions 17056 +-----------------------------------+--------+-------------------+ 17057 | Error | Number | Description | 17058 +-----------------------------------+--------+-------------------+ 17059 | NFS4_OK | 0 | Section 15.1.3.1 | 17060 | NFS4ERR_ACCESS | 13 | Section 15.1.6.1 | 17061 | NFS4ERR_ATTRNOTSUPP | 10032 | Section 15.1.15.1 | 17062 | NFS4ERR_ADMIN_REVOKED | 10047 | Section 15.1.5.1 | 17063 | NFS4ERR_BACK_CHAN_BUSY | 10057 | Section 15.1.12.1 | 17064 | NFS4ERR_BADCHAR | 10040 | Section 15.1.7.1 | 17065 | NFS4ERR_BADHANDLE | 10001 | Section 15.1.2.1 | 17066 | NFS4ERR_BADIOMODE | 10049 | Section 15.1.10.1 | 17067 | NFS4ERR_BADLAYOUT | 10050 | Section 15.1.10.2 | 17068 | NFS4ERR_BADNAME | 10041 | Section 15.1.7.2 | 17069 | NFS4ERR_BADOWNER | 10039 | Section 15.1.15.2 | 17070 | NFS4ERR_BADSESSION | 10052 | Section 15.1.11.1 | 17071 | NFS4ERR_BADSLOT | 10053 | Section 15.1.11.2 | 17072 | NFS4ERR_BADTYPE | 10007 | Section 15.1.4.1 | 17073 | NFS4ERR_BADXDR | 10036 | Section 15.1.1.1 | 17074 | NFS4ERR_BAD_COOKIE | 10003 | Section 15.1.1.2 | 17075 | NFS4ERR_BAD_HIGH_SLOT | 10077 | Section 15.1.11.3 | 17076 | NFS4ERR_BAD_RANGE | 10042 | Section 15.1.8.1 | 17077 | NFS4ERR_BAD_SEQID | 10026 | Section 15.1.16.1 | 17078 | NFS4ERR_BAD_SESSION_DIGEST | 10051 | Section 15.1.12.2 | 17079 | NFS4ERR_BAD_STATEID | 10025 | Section 15.1.5.2 | 17080 | NFS4ERR_CB_PATH_DOWN | 10048 | Section 15.1.11.4 | 17081 | NFS4ERR_CLID_INUSE | 10017 | Section 15.1.13.2 | 17082 | NFS4ERR_CLIENTID_BUSY | 10074 | Section 15.1.13.1 | 17083 | NFS4ERR_COMPLETE_ALREADY | 10054 | Section 15.1.9.1 | 17084 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | Section 15.1.11.6 | 17085 | NFS4ERR_DEADLOCK | 10045 | Section 15.1.8.2 | 17086 | NFS4ERR_DEADSESSION | 10078 | Section 15.1.11.5 | 17087 | NFS4ERR_DELAY | 10008 | Section 15.1.1.3 | 17088 | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | Section 15.1.14.1 | 17089 | NFS4ERR_DELEG_REVOKED | 10087 | Section 15.1.5.3 | 17090 | NFS4ERR_DENIED | 10010 | Section 15.1.8.3 | 17091 | NFS4ERR_DIRDELEG_UNAVAIL | 10084 | Section 15.1.14.2 | 17092 | NFS4ERR_DQUOT | 69 | Section 15.1.4.2 | 17093 | NFS4ERR_ENCR_ALG_UNSUPP | 10079 | Section 15.1.13.3 | 17094 | NFS4ERR_EXIST | 17 | Section 15.1.4.3 | 17095 | NFS4ERR_EXPIRED | 10011 | Section 15.1.5.4 | 17096 | NFS4ERR_FBIG | 27 | Section 15.1.4.4 | 17097 | NFS4ERR_FHEXPIRED | 10014 | Section 15.1.2.2 | 17098 | NFS4ERR_FILE_OPEN | 10046 | Section 15.1.4.5 | 17099 | NFS4ERR_GRACE | 10013 | Section 15.1.9.2 | 17100 | NFS4ERR_HASH_ALG_UNSUPP | 10072 | Section 15.1.13.4 | 17101 | NFS4ERR_INVAL | 22 | Section 15.1.1.4 | 17102 | NFS4ERR_IO | 5 | Section 15.1.4.6 | 17103 | NFS4ERR_ISDIR | 21 | Section 15.1.2.3 | 17104 | NFS4ERR_LAYOUTTRYLATER | 10058 | Section 15.1.10.3 | 17105 | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Section 15.1.10.4 | 17106 | NFS4ERR_LEASE_MOVED | 10031 | Section 15.1.16.2 | 17107 | NFS4ERR_LOCKED | 10012 | Section 15.1.8.4 | 17108 | NFS4ERR_LOCKS_HELD | 10037 | Section 15.1.8.5 | 17109 | NFS4ERR_LOCK_NOTSUPP | 10043 | Section 15.1.8.6 | 17110 | NFS4ERR_LOCK_RANGE | 10028 | Section 15.1.8.7 | 17111 | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | Section 15.1.3.2 | 17112 | NFS4ERR_MLINK | 31 | Section 15.1.4.7 | 17113 | NFS4ERR_MOVED | 10019 | Section 15.1.2.4 | 17114 | NFS4ERR_NAMETOOLONG | 63 | Section 15.1.7.3 | 17115 | NFS4ERR_NOENT | 2 | Section 15.1.4.8 | 17116 | NFS4ERR_NOFILEHANDLE | 10020 | Section 15.1.2.5 | 17117 | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Section 15.1.10.5 | 17118 | NFS4ERR_NOSPC | 28 | Section 15.1.4.9 | 17119 | NFS4ERR_NOTDIR | 20 | Section 15.1.2.6 | 17120 | NFS4ERR_NOTEMPTY | 66 | Section 15.1.4.10 | 17121 | NFS4ERR_NOTSUPP | 10004 | Section 15.1.1.5 | 17122 | NFS4ERR_NOT_ONLY_OP | 10081 | Section 15.1.3.3 | 17123 | NFS4ERR_NOT_SAME | 10027 | Section 15.1.15.3 | 17124 | NFS4ERR_NO_GRACE | 10033 | Section 15.1.9.3 | 17125 | NFS4ERR_NXIO | 6 | Section 15.1.16.3 | 17126 | NFS4ERR_OLD_STATEID | 10024 | Section 15.1.5.5 | 17127 | NFS4ERR_OPENMODE | 10038 | Section 15.1.8.8 | 17128 | NFS4ERR_OP_ILLEGAL | 10044 | Section 15.1.3.4 | 17129 | NFS4ERR_OP_NOT_IN_SESSION | 10071 | Section 15.1.3.5 | 17130 | NFS4ERR_PERM | 1 | Section 15.1.6.2 | 17131 | NFS4ERR_PNFS_IO_HOLE | 10075 | Section 15.1.10.6 | 17132 | NFS4ERR_PNFS_NO_LAYOUT | 10080 | Section 15.1.10.7 | 17133 | NFS4ERR_RECALLCONFLICT | 10061 | Section 15.1.14.3 | 17134 | NFS4ERR_RECLAIM_BAD | 10034 | Section 15.1.9.4 | 17135 | NFS4ERR_RECLAIM_CONFLICT | 10035 | Section 15.1.9.5 | 17136 | NFS4ERR_REJECT_DELEG | 10085 | Section 15.1.14.4 | 17137 | NFS4ERR_REP_TOO_BIG | 10066 | Section 15.1.3.6 | 17138 | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | Section 15.1.3.7 | 17139 | NFS4ERR_REQ_TOO_BIG | 10065 | Section 15.1.3.8 | 17140 | NFS4ERR_RESTOREFH | 10030 | Section 15.1.16.4 | 17141 | NFS4ERR_RETRY_UNCACHED_REP | 10068 | Section 15.1.3.9 | 17142 | NFS4ERR_RETURNCONFLICT | 10086 | Section 15.1.10.8 | 17143 | NFS4ERR_ROFS | 30 | Section 15.1.4.11 | 17144 | NFS4ERR_SAME | 10009 | Section 15.1.15.4 | 17145 | NFS4ERR_SHARE_DENIED | 10015 | Section 15.1.8.9 | 17146 | NFS4ERR_SEQUENCE_POS | 10064 | Section 15.1.3.10 | 17147 | NFS4ERR_SEQ_FALSE_RETRY | 10076 | Section 15.1.11.7 | 17148 | NFS4ERR_SEQ_MISORDERED | 10063 | Section 15.1.11.8 | 17149 | NFS4ERR_SERVERFAULT | 10006 | Section 15.1.1.6 | 17150 | NFS4ERR_STALE | 70 | Section 15.1.2.7 | 17151 | NFS4ERR_STALE_CLIENTID | 10022 | Section 15.1.13.5 | 17152 | NFS4ERR_STALE_STATEID | 10023 | Section 15.1.16.5 | 17153 | NFS4ERR_SYMLINK | 10029 | Section 15.1.2.8 | 17154 | NFS4ERR_TOOSMALL | 10005 | Section 15.1.1.7 | 17155 | NFS4ERR_TOO_MANY_OPS | 10070 | Section 15.1.3.11 | 17156 | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Section 15.1.10.9 | 17157 | NFS4ERR_UNSAFE_COMPOUND | 10069 | Section 15.1.3.12 | 17158 | NFS4ERR_WRONGSEC | 10016 | Section 15.1.6.3 | 17159 | NFS4ERR_WRONG_CRED | 10082 | Section 15.1.6.4 | 17160 | NFS4ERR_WRONG_TYPE | 10083 | Section 15.1.2.9 | 17161 | NFS4ERR_XDEV | 18 | Section 15.1.4.12 | 17162 +-----------------------------------+--------+-------------------+ 17164 Table 5 17166 15.1.1. General Errors 17168 This section deals with errors that are applicable to a broad set of 17169 different purposes. 17171 15.1.1.1. NFS4ERR_BADXDR (Error Code 10036) 17173 The arguments for this operation do not match those specified in the 17174 XDR definition. This includes situations in which the request ends 17175 before all the arguments have been seen. Note that this error 17176 applies when fixed enumerations (these include booleans) have a value 17177 within the input stream that is not valid for the enum. A replier 17178 may pre-parse all operations for a Compound procedure before doing 17179 any operation execution and return RPC-level XDR errors in that case. 17181 15.1.1.2. NFS4ERR_BAD_COOKIE (Error Code 10003) 17183 Used for operations that provide a set of information indexed by some 17184 quantity provided by the client or cookie sent by the server for an 17185 earlier invocation. Where the value cannot be used for its intended 17186 purpose, this error results. 17188 15.1.1.3. NFS4ERR_DELAY (Error Code 10008) 17190 For any of a number of reasons, the replier could not process this 17191 operation in what was deemed a reasonable time. The client should 17192 wait and then try the request with a new slot and sequence value. 17194 Some examples of scenarios that might lead to this situation: 17196 o A server that supports hierarchical storage receives a request to 17197 process a file that had been migrated. 17199 o An operation requires a delegation recall to proceed, and waiting 17200 for this delegation recall makes processing this request in a 17201 timely fashion impossible. 17203 In such cases, the error NFS4ERR_DELAY allows these preparatory 17204 operations to proceed without holding up client resources such as a 17205 session slot. After delaying for period of time, the client can then 17206 re-send the operation in question (but not with the same slot ID and 17207 sequence ID; one or both MUST be different on the re-send). 17209 Note that without the ability to return NFS4ERR_DELAY and the 17210 client's willingness to re-send when receiving it, deadlock might 17211 result. For example, if a recall is done, and if the delegation 17212 return or operations preparatory to delegation return are held up by 17213 other operations that need the delegation to be returned, session 17214 slots might not be available. The result could be deadlock. 17216 15.1.1.4. NFS4ERR_INVAL (Error Code 22) 17218 The arguments for this operation are not valid for some reason, even 17219 though they do match those specified in the XDR definition for the 17220 request. 17222 15.1.1.5. NFS4ERR_NOTSUPP (Error Code 10004) 17224 Operation not supported, either because the operation is an OPTIONAL 17225 one and is not supported by this server or because the operation MUST 17226 NOT be implemented in the current minor version. 17228 15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006) 17230 An error occurred on the server that does not map to any of the 17231 specific legal NFSv4.1 protocol error values. The client should 17232 translate this into an appropriate error. UNIX clients may choose to 17233 translate this to EIO. 17235 15.1.1.7. NFS4ERR_TOOSMALL (Error Code 10005) 17237 Used where an operation returns a variable amount of data, with a 17238 limit specified by the client. Where the data returned cannot be fit 17239 within the limit specified by the client, this error results. 17241 15.1.2. Filehandle Errors 17243 These errors deal with the situation in which the current or saved 17244 filehandle, or the filehandle passed to PUTFH intended to become the 17245 current filehandle, is invalid in some way. This includes situations 17246 in which the filehandle is a valid filehandle in general but is not 17247 of the appropriate object type for the current operation. 17249 Where the error description indicates a problem with the current or 17250 saved filehandle, it is to be understood that filehandles are only 17251 checked for the condition if they are implicit arguments of the 17252 operation in question. 17254 15.1.2.1. NFS4ERR_BADHANDLE (Error Code 10001) 17256 Illegal NFS filehandle for the current server. The current file 17257 handle failed internal consistency checks. Once accepted as valid 17258 (by PUTFH), no subsequent status change can cause the filehandle to 17259 generate this error. 17261 15.1.2.2. NFS4ERR_FHEXPIRED (Error Code 10014) 17263 A current or saved filehandle that is an argument to the current 17264 operation is volatile and has expired at the server. 17266 15.1.2.3. NFS4ERR_ISDIR (Error Code 21) 17268 The current or saved filehandle designates a directory when the 17269 current operation does not allow a directory to be accepted as the 17270 target of this operation. 17272 15.1.2.4. NFS4ERR_MOVED (Error Code 10019) 17274 The file system that contains the current filehandle object is not 17275 present at the server. It may have been relocated or migrated to 17276 another server, or it may have never been present. The client may 17277 obtain the new file system location by obtaining the "fs_locations" 17278 or "fs_locations_info" attribute for the current filehandle. For 17279 further discussion, refer to Section 11.3. 17281 15.1.2.5. NFS4ERR_NOFILEHANDLE (Error Code 10020) 17283 The logical current or saved filehandle value is required by the 17284 current operation and is not set. This may be a result of a 17285 malformed COMPOUND operation (i.e., no PUTFH or PUTROOTFH before an 17286 operation that requires the current filehandle be set). 17288 15.1.2.6. NFS4ERR_NOTDIR (Error Code 20) 17290 The current (or saved) filehandle designates an object that is not a 17291 directory for an operation in which a directory is required. 17293 15.1.2.7. NFS4ERR_STALE (Error Code 70) 17295 The current or saved filehandle value designating an argument to the 17296 current operation is invalid. The file referred to by that 17297 filehandle no longer exists or access to it has been revoked. 17299 15.1.2.8. NFS4ERR_SYMLINK (Error Code 10029) 17301 The current filehandle designates a symbolic link when the current 17302 operation does not allow a symbolic link as the target. 17304 15.1.2.9. NFS4ERR_WRONG_TYPE (Error Code 10083) 17306 The current (or saved) filehandle designates an object that is of an 17307 invalid type for the current operation, and there is no more specific 17308 error (such as NFS4ERR_ISDIR or NFS4ERR_SYMLINK) that applies. Note 17309 that in NFSv4.0, such situations generally resulted in the less- 17310 specific error NFS4ERR_INVAL. 17312 15.1.3. Compound Structure Errors 17314 This section deals with errors that relate to the overall structure 17315 of a Compound request (by which we mean to include both COMPOUND and 17316 CB_COMPOUND), rather than to particular operations. 17318 There are a number of basic constraints on the operations that may 17319 appear in a Compound request. Sessions add to these basic 17320 constraints by requiring a Sequence operation (either SEQUENCE or 17321 CB_SEQUENCE) at the start of the Compound. 17323 15.1.3.1. NFS_OK (Error code 0) 17325 Indicates the operation completed successfully, in that all of the 17326 constituent operations completed without error. 17328 15.1.3.2. NFS4ERR_MINOR_VERS_MISMATCH (Error code 10021) 17330 The minor version specified is not one that the current listener 17331 supports. This value is returned in the overall status for the 17332 Compound but is not associated with a specific operation since the 17333 results will specify a result count of zero. 17335 15.1.3.3. NFS4ERR_NOT_ONLY_OP (Error Code 10081) 17337 Certain operations, which are allowed to be executed outside of a 17338 session, MUST be the only operation within a Compound whenever the 17339 Compound does not start with a Sequence operation. This error 17340 results when that constraint is not met. 17342 15.1.3.4. NFS4ERR_OP_ILLEGAL (Error Code 10044) 17344 The operation code is not a valid one for the current Compound 17345 procedure. The opcode in the result stream matched with this error 17346 is the ILLEGAL value, although the value that appears in the request 17347 stream may be different. Where an illegal value appears and the 17348 replier pre-parses all operations for a Compound procedure before 17349 doing any operation execution, an RPC-level XDR error may be 17350 returned. 17352 15.1.3.5. NFS4ERR_OP_NOT_IN_SESSION (Error Code 10071) 17354 Most forward operations and all callback operations are only valid 17355 within the context of a session, so that the Compound request in 17356 question MUST begin with a Sequence operation. If an attempt is made 17357 to execute these operations outside the context of session, this 17358 error results. 17360 15.1.3.6. NFS4ERR_REP_TOO_BIG (Error Code 10066) 17362 The reply to a Compound would exceed the channel's negotiated maximum 17363 response size. 17365 15.1.3.7. NFS4ERR_REP_TOO_BIG_TO_CACHE (Error Code 10067) 17367 The reply to a Compound would exceed the channel's negotiated maximum 17368 size for replies cached in the reply cache when the Sequence for the 17369 current request specifies that this request is to be cached. 17371 15.1.3.8. NFS4ERR_REQ_TOO_BIG (Error Code 10065) 17373 The Compound request exceeds the channel's negotiated maximum size 17374 for requests. 17376 15.1.3.9. NFS4ERR_RETRY_UNCACHED_REP (Error Code 10068) 17378 The requester has attempted a retry of a Compound that it previously 17379 requested not be placed in the reply cache. 17381 15.1.3.10. NFS4ERR_SEQUENCE_POS (Error Code 10064) 17383 A Sequence operation appeared in a position other than the first 17384 operation of a Compound request. 17386 15.1.3.11. NFS4ERR_TOO_MANY_OPS (Error Code 10070) 17388 The Compound request has too many operations, exceeding the count 17389 negotiated when the session was created. 17391 15.1.3.12. NFS4ERR_UNSAFE_COMPOUND (Error Code 10068) 17393 The client has sent a COMPOUND request with an unsafe mix of 17394 operations -- specifically, with a non-idempotent operation that 17395 changes the current filehandle and that is not followed by a GETFH. 17397 15.1.4. File System Errors 17399 These errors describe situations that occurred in the underlying file 17400 system implementation rather than in the protocol or any NFSv4.x 17401 feature. 17403 15.1.4.1. NFS4ERR_BADTYPE (Error Code 10007) 17405 An attempt was made to create an object with an inappropriate type 17406 specified to CREATE. This may be because the type is undefined, 17407 because the type is not supported by the server, or because the type 17408 is not intended to be created by CREATE (such as a regular file or 17409 named attribute, for which OPEN is used to do the file creation). 17411 15.1.4.2. NFS4ERR_DQUOT (Error Code 19) 17413 Resource (quota) hard limit exceeded. The user's resource limit on 17414 the server has been exceeded. 17416 15.1.4.3. NFS4ERR_EXIST (Error Code 17) 17418 A file of the specified target name (when creating, renaming, or 17419 linking) already exists. 17421 15.1.4.4. NFS4ERR_FBIG (Error Code 27) 17423 The file is too large. The operation would have caused the file to 17424 grow beyond the server's limit. 17426 15.1.4.5. NFS4ERR_FILE_OPEN (Error Code 10046) 17428 The operation is not allowed because a file involved in the operation 17429 is currently open. Servers may, but are not required to, disallow 17430 linking-to, removing, or renaming open files. 17432 15.1.4.6. NFS4ERR_IO (Error Code 5) 17434 Indicates that an I/O error occurred for which the file system was 17435 unable to provide recovery. 17437 15.1.4.7. NFS4ERR_MLINK (Error Code 31) 17439 The request would have caused the server's limit for the number of 17440 hard links a file may have to be exceeded. 17442 15.1.4.8. NFS4ERR_NOENT (Error Code 2) 17444 Indicates no such file or directory. The file or directory name 17445 specified does not exist. 17447 15.1.4.9. NFS4ERR_NOSPC (Error Code 28) 17449 Indicates there is no space left on the device. The operation would 17450 have caused the server's file system to exceed its limit. 17452 15.1.4.10. NFS4ERR_NOTEMPTY (Error Code 66) 17454 An attempt was made to remove a directory that was not empty. 17456 15.1.4.11. NFS4ERR_ROFS (Error Code 30) 17458 Indicates a read-only file system. A modifying operation was 17459 attempted on a read-only file system. 17461 15.1.4.12. NFS4ERR_XDEV (Error Code 18) 17463 Indicates an attempt to do an operation, such as linking, that 17464 inappropriately crosses a boundary. This may be due to such 17465 boundaries as: 17467 o that between file systems (where the fsids are different). 17469 o that between different named attribute directories or between a 17470 named attribute directory and an ordinary directory. 17472 o that between byte-ranges of a file system that the file system 17473 implementation treats as separate (for example, for space 17474 accounting purposes), and where cross-connection between the byte- 17475 ranges are not allowed. 17477 15.1.5. State Management Errors 17479 These errors indicate problems with the stateid (or one of the 17480 stateids) passed to a given operation. This includes situations in 17481 which the stateid is invalid as well as situations in which the 17482 stateid is valid but designates locking state that has been revoked. 17483 Depending on the operation, the stateid when valid may designate 17484 opens, byte-range locks, file or directory delegations, layouts, or 17485 device maps. 17487 15.1.5.1. NFS4ERR_ADMIN_REVOKED (Error Code 10047) 17489 A stateid designates locking state of any type that has been revoked 17490 due to administrative interaction, possibly while the lease is valid. 17492 15.1.5.2. NFS4ERR_BAD_STATEID (Error Code 10026) 17494 A stateid does not properly designate any valid state. See Sections 17495 8.2.4 and 8.2.3 for a discussion of how stateids are validated. 17497 15.1.5.3. NFS4ERR_DELEG_REVOKED (Error Code 10087) 17499 A stateid designates recallable locking state of any type (delegation 17500 or layout) that has been revoked due to the failure of the client to 17501 return the lock when it was recalled. 17503 15.1.5.4. NFS4ERR_EXPIRED (Error Code 10011) 17505 A stateid designates locking state of any type that has been revoked 17506 due to expiration of the client's lease, either immediately upon 17507 lease expiration, or following a later request for a conflicting 17508 lock. 17510 15.1.5.5. NFS4ERR_OLD_STATEID (Error Code 10024) 17512 A stateid with a non-zero seqid value does match the current seqid 17513 for the state designated by the user. 17515 15.1.6. Security Errors 17517 These are the various permission-related errors in NFSv4.1. 17519 15.1.6.1. NFS4ERR_ACCESS (Error Code 13) 17521 Indicates permission denied. The caller does not have the correct 17522 permission to perform the requested operation. Contrast this with 17523 NFS4ERR_PERM (Section 15.1.6.2), which restricts itself to owner or 17524 privileged-user permission failures, and NFS4ERR_WRONG_CRED 17525 (Section 15.1.6.4), which deals with appropriate permission to delete 17526 or modify transient objects based on the credentials of the user that 17527 created them. 17529 15.1.6.2. NFS4ERR_PERM (Error Code 1) 17531 Indicates requester is not the owner. The operation was not allowed 17532 because the caller is neither a privileged user (root) nor the owner 17533 of the target of the operation. 17535 15.1.6.3. NFS4ERR_WRONGSEC (Error Code 10016) 17537 Indicates that the security mechanism being used by the client for 17538 the operation does not match the server's security policy. The 17539 client should change the security mechanism being used and re-send 17540 the operation (but not with the same slot ID and sequence ID; one or 17541 both MUST be different on the re-send). SECINFO and SECINFO_NO_NAME 17542 can be used to determine the appropriate mechanism. 17544 15.1.6.4. NFS4ERR_WRONG_CRED (Error Code 10082) 17546 An operation that manipulates state was attempted by a principal that 17547 was not allowed to modify that piece of state. 17549 15.1.7. Name Errors 17551 Names in NFSv4 are UTF-8 strings. When the strings are not valid 17552 UTF-8 or are of length zero, the error NFS4ERR_INVAL results. 17553 Besides this, there are a number of other errors to indicate specific 17554 problems with names. 17556 15.1.7.1. NFS4ERR_BADCHAR (Error Code 10040) 17558 A UTF-8 string contains a character that is not supported by the 17559 server in the context in which it being used. 17561 15.1.7.2. NFS4ERR_BADNAME (Error Code 10041) 17563 A name string in a request consisted of valid UTF-8 characters 17564 supported by the server, but the name is not supported by the server 17565 as a valid name for the current operation. An example might be 17566 creating a file or directory named ".." on a server whose file system 17567 uses that name for links to parent directories. 17569 15.1.7.3. NFS4ERR_NAMETOOLONG (Error Code 63) 17571 Returned when the filename in an operation exceeds the server's 17572 implementation limit. 17574 15.1.8. Locking Errors 17576 This section deals with errors related to locking, both as to share 17577 reservations and byte-range locking. It does not deal with errors 17578 specific to the process of reclaiming locks. Those are dealt with in 17579 Section 15.1.9. 17581 15.1.8.1. NFS4ERR_BAD_RANGE (Error Code 10042) 17583 The byte-range of a LOCK, LOCKT, or LOCKU operation is not allowed by 17584 the server. For example, this error results when a server that only 17585 supports 32-bit ranges receives a range that cannot be handled by 17586 that server. (See Section 18.10.3.) 17588 15.1.8.2. NFS4ERR_DEADLOCK (Error Code 10045) 17590 The server has been able to determine a byte-range locking deadlock 17591 condition for a READW_LT or WRITEW_LT LOCK operation. 17593 15.1.8.3. NFS4ERR_DENIED (Error Code 10010) 17595 An attempt to lock a file is denied. Since this may be a temporary 17596 condition, the client is encouraged to re-send the lock request (but 17597 not with the same slot ID and sequence ID; one or both MUST be 17598 different on the re-send) until the lock is accepted. See 17599 Section 9.6 for a discussion of the re-send. 17601 15.1.8.4. NFS4ERR_LOCKED (Error Code 10012) 17603 A READ or WRITE operation was attempted on a file where there was a 17604 conflict between the I/O and an existing lock: 17606 o There is a share reservation inconsistent with the I/O being done. 17608 o The range to be read or written intersects an existing mandatory 17609 byte-range lock. 17611 15.1.8.5. NFS4ERR_LOCKS_HELD (Error Code 10037) 17613 An operation was prevented by the unexpected presence of locks. 17615 15.1.8.6. NFS4ERR_LOCK_NOTSUPP (Error Code 10043) 17617 A LOCK operation was attempted that would require the upgrade or 17618 downgrade of a byte-range lock range already held by the owner, and 17619 the server does not support atomic upgrade or downgrade of locks. 17621 15.1.8.7. NFS4ERR_LOCK_RANGE (Error Code 10028) 17623 A LOCK operation is operating on a range that overlaps in part a 17624 currently held byte-range lock for the current lock-owner and does 17625 not precisely match a single such byte-range lock where the server 17626 does not support this type of request, and thus does not implement 17627 POSIX locking semantics [21]. See Sections 18.10.4, 18.11.4, and 17628 18.12.4 for a discussion of how this applies to LOCK, LOCKT, and 17629 LOCKU respectively. 17631 15.1.8.8. NFS4ERR_OPENMODE (Error Code 10038) 17633 The client attempted a READ, WRITE, LOCK, or other operation not 17634 sanctioned by the stateid passed (e.g., writing to a file opened for 17635 read-only access). 17637 15.1.8.9. NFS4ERR_SHARE_DENIED (Error Code 10015) 17639 An attempt to OPEN a file with a share reservation has failed because 17640 of a share conflict. 17642 15.1.9. Reclaim Errors 17644 These errors relate to the process of reclaiming locks after a server 17645 restart. 17647 15.1.9.1. NFS4ERR_COMPLETE_ALREADY (Error Code 10054) 17649 The client previously sent a successful RECLAIM_COMPLETE operation. 17650 An additional RECLAIM_COMPLETE operation is not necessary and results 17651 in this error. 17653 15.1.9.2. NFS4ERR_GRACE (Error Code 10013) 17655 The server was in its recovery or grace period. The locking request 17656 was not a reclaim request and so could not be granted during that 17657 period. 17659 15.1.9.3. NFS4ERR_NO_GRACE (Error Code 10033) 17661 A reclaim of client state was attempted in circumstances in which the 17662 server cannot guarantee that conflicting state has not been provided 17663 to another client. This can occur because the reclaim has been done 17664 outside of the grace period of the server, after the client has done 17665 a RECLAIM_COMPLETE operation, or because previous operations have 17666 created a situation in which the server is not able to determine that 17667 a reclaim-interfering edge condition does not exist. 17669 15.1.9.4. NFS4ERR_RECLAIM_BAD (Error Code 10034) 17671 The server has determined that a reclaim attempted by the client is 17672 not valid, i.e. the lock specified as being reclaimed could not 17673 possibly have existed before the server restart. A server is not 17674 obliged to make this determination and will typically rely on the 17675 client to only reclaim locks that the client was granted prior to 17676 restart. However, when a server does have reliable information to 17677 enable it make this determination, this error indicates that the 17678 reclaim has been rejected as invalid. This is as opposed to the 17679 error NFS4ERR_RECLAIM_CONFLICT (see Section 15.1.9.5) where the 17680 server can only determine that there has been an invalid reclaim, but 17681 cannot determine which request is invalid. 17683 15.1.9.5. NFS4ERR_RECLAIM_CONFLICT (Error Code 10035) 17685 The reclaim attempted by the client has encountered a conflict and 17686 cannot be satisfied. Potentially indicates a misbehaving client, 17687 although not necessarily the one receiving the error. The 17688 misbehavior might be on the part of the client that established the 17689 lock with which this client conflicted. See also Section 15.1.9.4 17690 for the related error, NFS4ERR_RECLAIM_BAD. 17692 15.1.10. pNFS Errors 17694 This section deals with pNFS-related errors including those that are 17695 associated with using NFSv4.1 to communicate with a data server. 17697 15.1.10.1. NFS4ERR_BADIOMODE (Error Code 10049) 17699 An invalid or inappropriate layout iomode was specified. For example 17700 an inappropriate layout iomode, suppose a client's LAYOUTGET 17701 operation specified an iomode of LAYOUTIOMODE4_RW, and the server is 17702 neither able nor willing to let the client send write requests to 17703 data servers; the server can reply with NFS4ERR_BADIOMODE. The 17704 client would then send another LAYOUTGET with an iomode of 17705 LAYOUTIOMODE4_READ. 17707 15.1.10.2. NFS4ERR_BADLAYOUT (Error Code 10050) 17709 The layout specified is invalid in some way. For LAYOUTCOMMIT, this 17710 indicates that the specified layout is not held by the client or is 17711 not of mode LAYOUTIOMODE4_RW. For LAYOUTGET, it indicates that a 17712 layout matching the client's specification as to minimum length 17713 cannot be granted. 17715 15.1.10.3. NFS4ERR_LAYOUTTRYLATER (Error Code 10058) 17717 Layouts are temporarily unavailable for the file. The client should 17718 re-send later (but not with the same slot ID and sequence ID; one or 17719 both MUST be different on the re-send). 17721 15.1.10.4. NFS4ERR_LAYOUTUNAVAILABLE (Error Code 10059) 17723 Returned when layouts are not available for the current file system 17724 or the particular specified file. 17726 15.1.10.5. NFS4ERR_NOMATCHING_LAYOUT (Error Code 10060) 17728 Returned when layouts are recalled and the client has no layouts 17729 matching the specification of the layouts being recalled. 17731 15.1.10.6. NFS4ERR_PNFS_IO_HOLE (Error Code 10075) 17733 The pNFS client has attempted to read from or write to an illegal 17734 hole of a file of a data server that is using sparse packing. See 17735 Section 13.4.4. 17737 15.1.10.7. NFS4ERR_PNFS_NO_LAYOUT (Error Code 10080) 17739 The pNFS client has attempted to read from or write to a file (using 17740 a request to a data server) without holding a valid layout. This 17741 includes the case where the client had a layout, but the iomode does 17742 not allow a WRITE. 17744 15.1.10.8. NFS4ERR_RETURNCONFLICT (Error Code 10086) 17746 A layout is unavailable due to an attempt to perform the LAYOUTGET 17747 before a pending LAYOUTRETURN on the file has been received. See 17748 Section 12.5.5.2.1.3. 17750 15.1.10.9. NFS4ERR_UNKNOWN_LAYOUTTYPE (Error Code 10062) 17752 The client has specified a layout type that is not supported by the 17753 server. 17755 15.1.11. Session Use Errors 17757 This section deals with errors encountered when using sessions, that 17758 is, errors encountered when a request uses a Sequence (i.e., either 17759 SEQUENCE or CB_SEQUENCE) operation. 17761 15.1.11.1. NFS4ERR_BADSESSION (Error Code 10052) 17763 The specified session ID is unknown to the server to which the 17764 operation is addressed. 17766 15.1.11.2. NFS4ERR_BADSLOT (Error Code 10053) 17768 The requester sent a Sequence operation that attempted to use a slot 17769 the replier does not have in its slot table. It is possible the slot 17770 may have been retired. 17772 15.1.11.3. NFS4ERR_BAD_HIGH_SLOT (Error Code 10077) 17774 The highest_slot argument in a Sequence operation exceeds the 17775 replier's enforced highest_slotid. 17777 15.1.11.4. NFS4ERR_CB_PATH_DOWN (Error Code 10048) 17779 There is a problem contacting the client via the callback path. The 17780 function of this error has been mostly superseded by the use of 17781 status flags in the reply to the SEQUENCE operation (see 17782 Section 18.46). 17784 15.1.11.5. NFS4ERR_DEADSESSION (Error Code 10078) 17786 The specified session is a persistent session that is dead and does 17787 not accept new requests or perform new operations on existing 17788 requests (in the case in which a request was partially executed 17789 before server restart). 17791 15.1.11.6. NFS4ERR_CONN_NOT_BOUND_TO_SESSION (Error Code 10055) 17793 A Sequence operation was sent on a connection that has not been 17794 associated with the specified session, where the client specified 17795 that connection association was to be enforced with SP4_MACH_CRED or 17796 SP4_SSV state protection. 17798 15.1.11.7. NFS4ERR_SEQ_FALSE_RETRY (Error Code 10076) 17800 The requester sent a Sequence operation with a slot ID and sequence 17801 ID that are in the reply cache, but the replier has detected that the 17802 retried request is not the same as the original request. See 17803 Section 2.10.6.1.3.1. 17805 15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) 17807 The requester sent a Sequence operation with an invalid sequence ID. 17809 15.1.12. Session Management Errors 17811 This section deals with errors associated with requests used in 17812 session management. 17814 15.1.12.1. NFS4ERR_BACK_CHAN_BUSY (Error Code 10057) 17816 An attempt was made to destroy a session when the session cannot be 17817 destroyed because the server has callback requests outstanding. 17819 15.1.12.2. NFS4ERR_BAD_SESSION_DIGEST (Error Code 10051) 17821 The digest used in a SET_SSV request is not valid. 17823 15.1.13. Client Management Errors 17825 This section deals with errors associated with requests used to 17826 create and manage client IDs. 17828 15.1.13.1. NFS4ERR_CLIENTID_BUSY (Error Code 10074) 17830 The DESTROY_CLIENTID operation has found there are sessions and/or 17831 unexpired state associated with the client ID to be destroyed. 17833 15.1.13.2. NFS4ERR_CLID_INUSE (Error Code 10017) 17835 While processing an EXCHANGE_ID operation, the server was presented 17836 with a co_ownerid field that matches an existing client with valid 17837 leased state, but the principal sending the EXCHANGE_ID operation 17838 differs from the principal that established the existing client. 17840 This indicates a collision (most likely due to chance) between 17841 clients. The client should recover by changing the co_ownerid and 17842 re-sending EXCHANGE_ID (but not with the same slot ID and sequence 17843 ID; one or both MUST be different on the re-send). 17845 15.1.13.3. NFS4ERR_ENCR_ALG_UNSUPP (Error Code 10079) 17847 An EXCHANGE_ID was sent that specified state protection via SSV, and 17848 where the set of encryption algorithms presented by the client did 17849 not include any supported by the server. 17851 15.1.13.4. NFS4ERR_HASH_ALG_UNSUPP (Error Code 10072) 17853 An EXCHANGE_ID was sent that specified state protection via SSV, and 17854 where the set of hashing algorithms presented by the client did not 17855 include any supported by the server. 17857 15.1.13.5. NFS4ERR_STALE_CLIENTID (Error Code 10022) 17859 A client ID not recognized by the server was passed to an operation. 17860 Note that unlike the case of NFSv4.0, client IDs are not passed 17861 explicitly to the server in ordinary locking operations and cannot 17862 result in this error. Instead, when there is a server restart, it is 17863 first manifested through an error on the associated session, and the 17864 staleness of the client ID is detected when trying to associate a 17865 client ID with a new session. 17867 15.1.14. Delegation Errors 17869 This section deals with errors associated with requesting and 17870 returning delegations. 17872 15.1.14.1. NFS4ERR_DELEG_ALREADY_WANTED (Error Code 10056) 17874 The client has requested a delegation when it had already registered 17875 that it wants that same delegation. 17877 15.1.14.2. NFS4ERR_DIRDELEG_UNAVAIL (Error Code 10084) 17879 This error is returned when the server is unable or unwilling to 17880 provide a requested directory delegation. 17882 15.1.14.3. NFS4ERR_RECALLCONFLICT (Error Code 10061) 17884 A recallable object (i.e., a layout or delegation) is unavailable due 17885 to a conflicting recall operation that is currently in progress for 17886 that object. 17888 15.1.14.4. NFS4ERR_REJECT_DELEG (Error Code 10085) 17890 The callback operation invoked to deal with a new delegation has 17891 rejected it. 17893 15.1.15. Attribute Handling Errors 17895 This section deals with errors specific to attribute handling within 17896 NFSv4. 17898 15.1.15.1. NFS4ERR_ATTRNOTSUPP (Error Code 10032) 17900 An attribute specified is not supported by the server. This error 17901 MUST NOT be returned by the GETATTR operation. 17903 15.1.15.2. NFS4ERR_BADOWNER (Error Code 10039) 17905 This error is returned when an owner or owner_group attribute value 17906 or the who field of an ACE within an ACL attribute value cannot be 17907 translated to a local representation. 17909 15.1.15.3. NFS4ERR_NOT_SAME (Error Code 10027) 17911 This error is returned by the VERIFY operation to signify that the 17912 attributes compared were not the same as those provided in the 17913 client's request. 17915 15.1.15.4. NFS4ERR_SAME (Error Code 10009) 17917 This error is returned by the NVERIFY operation to signify that the 17918 attributes compared were the same as those provided in the client's 17919 request. 17921 15.1.16. Obsoleted Errors 17923 These errors MUST NOT be generated by any NFSv4.1 operation. This 17924 can be for a number of reasons. 17926 o The function provided by the error has been superseded by one of 17927 the status bits returned by the SEQUENCE operation. 17929 o The new session structure and associated change in locking have 17930 made the error unnecessary. 17932 o There has been a restructuring of some errors for NFSv4.1 that 17933 resulted in the elimination of certain errors. 17935 15.1.16.1. NFS4ERR_BAD_SEQID (Error Code 10026) 17937 The sequence number (seqid) in a locking request is neither the next 17938 expected number or the last number processed. These seqids are 17939 ignored in NFSv4.1. 17941 15.1.16.2. NFS4ERR_LEASE_MOVED (Error Code 10031) 17943 A lease being renewed is associated with a file system that has been 17944 migrated to a new server. The error has been superseded by the 17945 SEQ4_STATUS_LEASE_MOVED status bit (see Section 18.46). 17947 15.1.16.3. NFS4ERR_NXIO (Error Code 5) 17949 I/O error. No such device or address. This error is for errors 17950 involving block and character device access, but because NFSv4.1 is 17951 not a device-access protocol, this error is not applicable. 17953 15.1.16.4. NFS4ERR_RESTOREFH (Error Code 10030) 17955 The RESTOREFH operation does not have a saved filehandle (identified 17956 by SAVEFH) to operate upon. In NFSv4.1, this error has been 17957 superseded by NFS4ERR_NOFILEHANDLE. 17959 15.1.16.5. NFS4ERR_STALE_STATEID (Error Code 10023) 17961 A stateid generated by an earlier server instance was used. This 17962 error is moot in NFSv4.1 because all operations that take a stateid 17963 MUST be preceded by the SEQUENCE operation, and the earlier server 17964 instance is detected by the session infrastructure that supports 17965 SEQUENCE. 17967 15.2. Operations and Their Valid Errors 17969 This section contains a table that gives the valid error returns for 17970 each protocol operation. The error code NFS4_OK (indicating no 17971 error) is not listed but should be understood to be returnable by all 17972 operations with two important exceptions: 17974 o The operations that MUST NOT be implemented: OPEN_CONFIRM, 17975 RELEASE_LOCKOWNER, RENEW, SETCLIENTID, and SETCLIENTID_CONFIRM. 17977 o The invalid operation: ILLEGAL. 17979 Valid Error Returns for Each Protocol Operation 17981 +----------------------+--------------------------------------------+ 17982 | Operation | Errors | 17983 +----------------------+--------------------------------------------+ 17984 | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 17985 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 17986 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 17987 | | NFS4ERR_IO, NFS4ERR_MOVED, | 17988 | | NFS4ERR_NOFILEHANDLE, | 17989 | | NFS4ERR_OP_NOT_IN_SESSION, | 17990 | | NFS4ERR_REP_TOO_BIG, | 17991 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 17992 | | NFS4ERR_REQ_TOO_BIG, | 17993 | | NFS4ERR_RETRY_UNCACHED_REP, | 17994 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 17995 | | NFS4ERR_TOO_MANY_OPS | 17996 | | | 17997 | BACKCHANNEL_CTL | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 17998 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 17999 | | NFS4ERR_NOENT, NFS4ERR_OP_NOT_IN_SESSION, | 18000 | | NFS4ERR_REP_TOO_BIG, | 18001 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18002 | | NFS4ERR_REQ_TOO_BIG, | 18003 | | NFS4ERR_RETRY_UNCACHED_REP, | 18004 | | NFS4ERR_TOO_MANY_OPS | 18005 | | | 18006 | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18007 | | NFS4ERR_BAD_SESSION_DIGEST, | 18008 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18009 | | NFS4ERR_INVAL, NFS4ERR_NOT_ONLY_OP, | 18010 | | NFS4ERR_REP_TOO_BIG, | 18011 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18012 | | NFS4ERR_REQ_TOO_BIG, | 18013 | | NFS4ERR_RETRY_UNCACHED_REP, | 18014 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 18015 | | | 18016 | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18017 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18018 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18019 | | NFS4ERR_FHEXPIRED, NFS4ERR_LOCKS_HELD, | 18020 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18021 | | NFS4ERR_OLD_STATEID, | 18022 | | NFS4ERR_OP_NOT_IN_SESSION, | 18023 | | NFS4ERR_REP_TOO_BIG, | 18024 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18025 | | NFS4ERR_REQ_TOO_BIG, | 18026 | | NFS4ERR_RETRY_UNCACHED_REP, | 18027 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18028 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18029 | | | 18030 | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18031 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18032 | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | 18033 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 18034 | | NFS4ERR_NOFILEHANDLE, | 18035 | | NFS4ERR_OP_NOT_IN_SESSION, | 18036 | | NFS4ERR_REP_TOO_BIG, | 18037 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18038 | | NFS4ERR_REQ_TOO_BIG, | 18039 | | NFS4ERR_RETRY_UNCACHED_REP, | 18040 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18041 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18042 | | NFS4ERR_WRONG_TYPE | 18043 | | | 18044 | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18045 | | NFS4ERR_BADCHAR, NFS4ERR_BADNAME, | 18046 | | NFS4ERR_BADOWNER, NFS4ERR_BADTYPE, | 18047 | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18048 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 18049 | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | 18050 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MLINK, | 18051 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18052 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18053 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 18054 | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | 18055 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18056 | | NFS4ERR_REQ_TOO_BIG, | 18057 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18058 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18059 | | NFS4ERR_TOO_MANY_OPS, | 18060 | | NFS4ERR_UNSAFE_COMPOUND | 18061 | | | 18062 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 18063 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18064 | | NFS4ERR_INVAL, NFS4ERR_NOENT, | 18065 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOSPC, | 18066 | | NFS4ERR_REP_TOO_BIG, | 18067 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18068 | | NFS4ERR_REQ_TOO_BIG, | 18069 | | NFS4ERR_RETRY_UNCACHED_REP, | 18070 | | NFS4ERR_SEQ_MISORDERED, | 18071 | | NFS4ERR_SERVERFAULT, | 18072 | | NFS4ERR_STALE_CLIENTID, NFS4ERR_TOOSMALL, | 18073 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18074 | | | 18075 | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18076 | | NFS4ERR_DELAY, NFS4ERR_NOTSUPP, | 18077 | | NFS4ERR_OP_NOT_IN_SESSION, | 18078 | | NFS4ERR_REP_TOO_BIG, | 18079 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18080 | | NFS4ERR_REQ_TOO_BIG, | 18081 | | NFS4ERR_RETRY_UNCACHED_REP, | 18082 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18083 | | NFS4ERR_WRONG_CRED | 18084 | | | 18085 | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18086 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18087 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18088 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18089 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 18090 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18091 | | NFS4ERR_OLD_STATEID, | 18092 | | NFS4ERR_OP_NOT_IN_SESSION, | 18093 | | NFS4ERR_REP_TOO_BIG, | 18094 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18095 | | NFS4ERR_REQ_TOO_BIG, | 18096 | | NFS4ERR_RETRY_UNCACHED_REP, | 18097 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18098 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18099 | | | 18100 | DESTROY_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLIENTID_BUSY, | 18101 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18102 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 18103 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18104 | | NFS4ERR_REQ_TOO_BIG, | 18105 | | NFS4ERR_RETRY_UNCACHED_REP, | 18106 | | NFS4ERR_SERVERFAULT, | 18107 | | NFS4ERR_STALE_CLIENTID, | 18108 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18109 | | | 18110 | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | 18111 | | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18112 | | NFS4ERR_CB_PATH_DOWN, | 18113 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 18114 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18115 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 18116 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18117 | | NFS4ERR_REQ_TOO_BIG, | 18118 | | NFS4ERR_RETRY_UNCACHED_REP, | 18119 | | NFS4ERR_SERVERFAULT, | 18120 | | NFS4ERR_STALE_CLIENTID, | 18121 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18122 | | | 18123 | EXCHANGE_ID | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18124 | | NFS4ERR_CLID_INUSE, NFS4ERR_DEADSESSION, | 18125 | | NFS4ERR_DELAY, NFS4ERR_ENCR_ALG_UNSUPP, | 18126 | | NFS4ERR_HASH_ALG_UNSUPP, NFS4ERR_INVAL, | 18127 | | NFS4ERR_NOENT, NFS4ERR_NOT_ONLY_OP, | 18128 | | NFS4ERR_NOT_SAME, NFS4ERR_REP_TOO_BIG, | 18129 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18130 | | NFS4ERR_REQ_TOO_BIG, | 18131 | | NFS4ERR_RETRY_UNCACHED_REP, | 18132 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 18133 | | | 18134 | FREE_STATEID | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18135 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18136 | | NFS4ERR_LOCKS_HELD, NFS4ERR_OLD_STATEID, | 18137 | | NFS4ERR_OP_NOT_IN_SESSION, | 18138 | | NFS4ERR_REP_TOO_BIG, | 18139 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18140 | | NFS4ERR_REQ_TOO_BIG, | 18141 | | NFS4ERR_RETRY_UNCACHED_REP, | 18142 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18143 | | NFS4ERR_WRONG_CRED | 18144 | | | 18145 | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18146 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18147 | | NFS4ERR_DIRDELEG_UNAVAIL, | 18148 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18149 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18150 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18151 | | NFS4ERR_NOTSUPP, | 18152 | | NFS4ERR_OP_NOT_IN_SESSION, | 18153 | | NFS4ERR_REP_TOO_BIG, | 18154 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18155 | | NFS4ERR_REQ_TOO_BIG, | 18156 | | NFS4ERR_RETRY_UNCACHED_REP, | 18157 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18158 | | NFS4ERR_TOO_MANY_OPS | 18159 | | | 18160 | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18161 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18162 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18163 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18164 | | NFS4ERR_NOFILEHANDLE, | 18165 | | NFS4ERR_OP_NOT_IN_SESSION, | 18166 | | NFS4ERR_REP_TOO_BIG, | 18167 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18168 | | NFS4ERR_REQ_TOO_BIG, | 18169 | | NFS4ERR_RETRY_UNCACHED_REP, | 18170 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18171 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 18172 | | | 18173 | GETDEVICEINFO | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18174 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18175 | | NFS4ERR_NOENT, NFS4ERR_NOTSUPP, | 18176 | | NFS4ERR_OP_NOT_IN_SESSION, | 18177 | | NFS4ERR_REP_TOO_BIG, | 18178 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18179 | | NFS4ERR_REQ_TOO_BIG, | 18180 | | NFS4ERR_RETRY_UNCACHED_REP, | 18181 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOOSMALL, | 18182 | | NFS4ERR_TOO_MANY_OPS, | 18183 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18184 | | | 18185 | GETDEVICELIST | NFS4ERR_BADXDR, NFS4ERR_BAD_COOKIE, | 18186 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18187 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18188 | | NFS4ERR_IO, NFS4ERR_NOFILEHANDLE, | 18189 | | NFS4ERR_NOTSUPP, NFS4ERR_NOT_SAME, | 18190 | | NFS4ERR_OP_NOT_IN_SESSION, | 18191 | | NFS4ERR_REP_TOO_BIG, | 18192 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18193 | | NFS4ERR_REQ_TOO_BIG, | 18194 | | NFS4ERR_RETRY_UNCACHED_REP, | 18195 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18196 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18197 | | | 18198 | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 18199 | | NFS4ERR_NOFILEHANDLE, | 18200 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_STALE | 18201 | | | 18202 | ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 18203 | | | 18204 | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18205 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADIOMODE, | 18206 | | NFS4ERR_BADLAYOUT, NFS4ERR_BADXDR, | 18207 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18208 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 18209 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 18210 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18211 | | NFS4ERR_ISDIR NFS4ERR_MOVED, | 18212 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18213 | | NFS4ERR_NO_GRACE, | 18214 | | NFS4ERR_OP_NOT_IN_SESSION, | 18215 | | NFS4ERR_RECLAIM_BAD, | 18216 | | NFS4ERR_RECLAIM_CONFLICT, | 18217 | | NFS4ERR_REP_TOO_BIG, | 18218 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18219 | | NFS4ERR_REQ_TOO_BIG, | 18220 | | NFS4ERR_RETRY_UNCACHED_REP, | 18221 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18222 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18223 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18224 | | NFS4ERR_WRONG_CRED | 18225 | | | 18226 | LAYOUTGET | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18227 | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | 18228 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18229 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18230 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 18231 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18232 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18233 | | NFS4ERR_LAYOUTTRYLATER, | 18234 | | NFS4ERR_LAYOUTUNAVAILABLE, NFS4ERR_LOCKED, | 18235 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18236 | | NFS4ERR_NOSPC, NFS4ERR_NOTSUPP, | 18237 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 18238 | | NFS4ERR_OP_NOT_IN_SESSION, | 18239 | | NFS4ERR_RECALLCONFLICT, | 18240 | | NFS4ERR_REP_TOO_BIG, | 18241 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18242 | | NFS4ERR_REQ_TOO_BIG, | 18243 | | NFS4ERR_RETRY_UNCACHED_REP, | 18244 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18245 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS, | 18246 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18247 | | NFS4ERR_WRONG_TYPE | 18248 | | | 18249 | LAYOUTRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18250 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18251 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18252 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18253 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18254 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 18255 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18256 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 18257 | | NFS4ERR_OP_NOT_IN_SESSION, | 18258 | | NFS4ERR_REP_TOO_BIG, | 18259 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18260 | | NFS4ERR_REQ_TOO_BIG, | 18261 | | NFS4ERR_RETRY_UNCACHED_REP, | 18262 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18263 | | NFS4ERR_TOO_MANY_OPS, | 18264 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18265 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 18266 | | | 18267 | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18268 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18269 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18270 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 18271 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18272 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18273 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MLINK, | 18274 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18275 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18276 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 18277 | | NFS4ERR_OP_NOT_IN_SESSION, | 18278 | | NFS4ERR_REP_TOO_BIG, | 18279 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18280 | | NFS4ERR_REQ_TOO_BIG, | 18281 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18282 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18283 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18284 | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE, | 18285 | | NFS4ERR_XDEV | 18286 | | | 18287 | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18288 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 18289 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADLOCK, | 18290 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18291 | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | 18292 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18293 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 18294 | | NFS4ERR_LOCK_NOTSUPP, NFS4ERR_LOCK_RANGE, | 18295 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18296 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 18297 | | NFS4ERR_OPENMODE, | 18298 | | NFS4ERR_OP_NOT_IN_SESSION, | 18299 | | NFS4ERR_RECLAIM_BAD, | 18300 | | NFS4ERR_RECLAIM_CONFLICT, | 18301 | | NFS4ERR_REP_TOO_BIG, | 18302 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18303 | | NFS4ERR_REQ_TOO_BIG, | 18304 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18305 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18306 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18307 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 18308 | | | 18309 | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18310 | | NFS4ERR_BAD_RANGE, NFS4ERR_DEADSESSION, | 18311 | | NFS4ERR_DELAY, NFS4ERR_DENIED, | 18312 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18313 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 18314 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 18315 | | NFS4ERR_NOFILEHANDLE, | 18316 | | NFS4ERR_OP_NOT_IN_SESSION, | 18317 | | NFS4ERR_REP_TOO_BIG, | 18318 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18319 | | NFS4ERR_REQ_TOO_BIG, | 18320 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18321 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 18322 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED, | 18323 | | NFS4ERR_WRONG_TYPE | 18324 | | | 18325 | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18326 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 18327 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18328 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18329 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18330 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 18331 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 18332 | | NFS4ERR_OP_NOT_IN_SESSION, | 18333 | | NFS4ERR_REP_TOO_BIG, | 18334 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18335 | | NFS4ERR_REQ_TOO_BIG, | 18336 | | NFS4ERR_RETRY_UNCACHED_REP, | 18337 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18338 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18339 | | | 18340 | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18341 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18342 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18343 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18344 | | NFS4ERR_IO, NFS4ERR_MOVED, | 18345 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18346 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18347 | | NFS4ERR_OP_NOT_IN_SESSION, | 18348 | | NFS4ERR_REP_TOO_BIG, | 18349 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18350 | | NFS4ERR_REQ_TOO_BIG, | 18351 | | NFS4ERR_RETRY_UNCACHED_REP, | 18352 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18353 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18354 | | NFS4ERR_WRONGSEC | 18355 | | | 18356 | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 18357 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18358 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 18359 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18360 | | NFS4ERR_OP_NOT_IN_SESSION, | 18361 | | NFS4ERR_REP_TOO_BIG, | 18362 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18363 | | NFS4ERR_REQ_TOO_BIG, | 18364 | | NFS4ERR_RETRY_UNCACHED_REP, | 18365 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18366 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18367 | | NFS4ERR_WRONGSEC | 18368 | | | 18369 | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18370 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18371 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18372 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18373 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18374 | | NFS4ERR_NOFILEHANDLE, | 18375 | | NFS4ERR_OP_NOT_IN_SESSION, | 18376 | | NFS4ERR_REP_TOO_BIG, | 18377 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18378 | | NFS4ERR_REQ_TOO_BIG, | 18379 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SAME, | 18380 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18381 | | NFS4ERR_TOO_MANY_OPS, | 18382 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18383 | | NFS4ERR_WRONG_TYPE | 18384 | | | 18385 | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18386 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 18387 | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | 18388 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18389 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18390 | | NFS4ERR_DELEG_ALREADY_WANTED, | 18391 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 18392 | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | 18393 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 18394 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18395 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MOVED, | 18396 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18397 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18398 | | NFS4ERR_NOTDIR, NFS4ERR_NO_GRACE, | 18399 | | NFS4ERR_OLD_STATEID, | 18400 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 18401 | | NFS4ERR_RECLAIM_BAD, | 18402 | | NFS4ERR_RECLAIM_CONFLICT, | 18403 | | NFS4ERR_REP_TOO_BIG, | 18404 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18405 | | NFS4ERR_REQ_TOO_BIG, | 18406 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18407 | | NFS4ERR_SERVERFAULT, NFS4ERR_SHARE_DENIED, | 18408 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 18409 | | NFS4ERR_TOO_MANY_OPS, | 18410 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_WRONGSEC, | 18411 | | NFS4ERR_WRONG_TYPE | 18412 | | | 18413 | OPEN_CONFIRM | NFS4ERR_NOTSUPP | 18414 | | | 18415 | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18416 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18417 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18418 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18419 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18420 | | NFS4ERR_OLD_STATEID, | 18421 | | NFS4ERR_OP_NOT_IN_SESSION, | 18422 | | NFS4ERR_REP_TOO_BIG, | 18423 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18424 | | NFS4ERR_REQ_TOO_BIG, | 18425 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18426 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18427 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18428 | | | 18429 | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18430 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18431 | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | 18432 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 18433 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18434 | | NFS4ERR_NOTSUPP, | 18435 | | NFS4ERR_OP_NOT_IN_SESSION, | 18436 | | NFS4ERR_REP_TOO_BIG, | 18437 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18438 | | NFS4ERR_REQ_TOO_BIG, | 18439 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18440 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18441 | | NFS4ERR_TOO_MANY_OPS, | 18442 | | NFS4ERR_UNSAFE_COMPOUND, | 18443 | | NFS4ERR_WRONG_TYPE | 18444 | | | 18445 | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18446 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18447 | | NFS4ERR_MOVED, NFS4ERR_OP_NOT_IN_SESSION, | 18448 | | NFS4ERR_REP_TOO_BIG, | 18449 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18450 | | NFS4ERR_REQ_TOO_BIG, | 18451 | | NFS4ERR_RETRY_UNCACHED_REP, | 18452 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18453 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 18454 | | | 18455 | PUTPUBFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18456 | | NFS4ERR_OP_NOT_IN_SESSION, | 18457 | | NFS4ERR_REP_TOO_BIG, | 18458 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18459 | | NFS4ERR_REQ_TOO_BIG, | 18460 | | NFS4ERR_RETRY_UNCACHED_REP, | 18461 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18462 | | NFS4ERR_WRONGSEC | 18463 | | | 18464 | PUTROOTFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18465 | | NFS4ERR_OP_NOT_IN_SESSION, | 18466 | | NFS4ERR_REP_TOO_BIG, | 18467 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18468 | | NFS4ERR_REQ_TOO_BIG, | 18469 | | NFS4ERR_RETRY_UNCACHED_REP, | 18470 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18471 | | NFS4ERR_WRONGSEC | 18472 | | | 18473 | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18474 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18475 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18476 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 18477 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18478 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, NFS4ERR_IO, | 18479 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 18480 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 18481 | | NFS4ERR_OPENMODE, | 18482 | | NFS4ERR_OP_NOT_IN_SESSION, | 18483 | | NFS4ERR_PNFS_IO_HOLE, | 18484 | | NFS4ERR_PNFS_NO_LAYOUT, | 18485 | | NFS4ERR_REP_TOO_BIG, | 18486 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18487 | | NFS4ERR_REQ_TOO_BIG, | 18488 | | NFS4ERR_RETRY_UNCACHED_REP, | 18489 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18490 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18491 | | NFS4ERR_WRONG_TYPE | 18492 | | | 18493 | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18494 | | NFS4ERR_BAD_COOKIE, NFS4ERR_DEADSESSION, | 18495 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18496 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18497 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18498 | | NFS4ERR_NOT_SAME, | 18499 | | NFS4ERR_OP_NOT_IN_SESSION, | 18500 | | NFS4ERR_REP_TOO_BIG, | 18501 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18502 | | NFS4ERR_REQ_TOO_BIG, | 18503 | | NFS4ERR_RETRY_UNCACHED_REP, | 18504 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18505 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS | 18506 | | | 18507 | READLINK | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 18508 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18509 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18510 | | NFS4ERR_NOFILEHANDLE, | 18511 | | NFS4ERR_OP_NOT_IN_SESSION, | 18512 | | NFS4ERR_REP_TOO_BIG, | 18513 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18514 | | NFS4ERR_REQ_TOO_BIG, | 18515 | | NFS4ERR_RETRY_UNCACHED_REP, | 18516 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18517 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 18518 | | | 18519 | RECLAIM_COMPLETE | NFS4ERR_BADXDR, NFS4ERR_COMPLETE_ALREADY, | 18520 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18521 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18522 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18523 | | NFS4ERR_OP_NOT_IN_SESSION, | 18524 | | NFS4ERR_REP_TOO_BIG, | 18525 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18526 | | NFS4ERR_REQ_TOO_BIG, | 18527 | | NFS4ERR_RETRY_UNCACHED_REP, | 18528 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18529 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED, | 18530 | | NFS4ERR_WRONG_TYPE | 18531 | | | 18532 | RELEASE_LOCKOWNER | NFS4ERR_NOTSUPP | 18533 | | | 18534 | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18535 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18536 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18537 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18538 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18539 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18540 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 18541 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 18542 | | NFS4ERR_OP_NOT_IN_SESSION, | 18543 | | NFS4ERR_REP_TOO_BIG, | 18544 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18545 | | NFS4ERR_REQ_TOO_BIG, | 18546 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18547 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18548 | | NFS4ERR_TOO_MANY_OPS | 18549 | | | 18550 | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18551 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18552 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18553 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 18554 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18555 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18556 | | NFS4ERR_MLINK, NFS4ERR_MOVED, | 18557 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18558 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18559 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 18560 | | NFS4ERR_OP_NOT_IN_SESSION, | 18561 | | NFS4ERR_REP_TOO_BIG, | 18562 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18563 | | NFS4ERR_REQ_TOO_BIG, | 18564 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18565 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18566 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC, | 18567 | | NFS4ERR_XDEV | 18568 | | | 18569 | RENEW | NFS4ERR_NOTSUPP | 18570 | | | 18571 | RESTOREFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 18572 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18573 | | NFS4ERR_OP_NOT_IN_SESSION, | 18574 | | NFS4ERR_REP_TOO_BIG, | 18575 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18576 | | NFS4ERR_REQ_TOO_BIG, | 18577 | | NFS4ERR_RETRY_UNCACHED_REP, | 18578 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18579 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 18580 | | | 18581 | SAVEFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 18582 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18583 | | NFS4ERR_OP_NOT_IN_SESSION, | 18584 | | NFS4ERR_REP_TOO_BIG, | 18585 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18586 | | NFS4ERR_REQ_TOO_BIG, | 18587 | | NFS4ERR_RETRY_UNCACHED_REP, | 18588 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18589 | | NFS4ERR_TOO_MANY_OPS | 18590 | | | 18591 | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18592 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18593 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18594 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18595 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18596 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 18597 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 18598 | | NFS4ERR_REP_TOO_BIG, | 18599 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18600 | | NFS4ERR_REQ_TOO_BIG, | 18601 | | NFS4ERR_RETRY_UNCACHED_REP, | 18602 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18603 | | NFS4ERR_TOO_MANY_OPS | 18604 | | | 18605 | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18606 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18607 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18608 | | NFS4ERR_MOVED, NFS4ERR_NOENT, | 18609 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18610 | | NFS4ERR_NOTSUPP, | 18611 | | NFS4ERR_OP_NOT_IN_SESSION, | 18612 | | NFS4ERR_REP_TOO_BIG, | 18613 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18614 | | NFS4ERR_REQ_TOO_BIG, | 18615 | | NFS4ERR_RETRY_UNCACHED_REP, | 18616 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18617 | | NFS4ERR_TOO_MANY_OPS | 18618 | | | 18619 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 18620 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 18621 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 18622 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18623 | | NFS4ERR_REP_TOO_BIG, | 18624 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18625 | | NFS4ERR_REQ_TOO_BIG, | 18626 | | NFS4ERR_RETRY_UNCACHED_REP, | 18627 | | NFS4ERR_SEQUENCE_POS, | 18628 | | NFS4ERR_SEQ_FALSE_RETRY, | 18629 | | NFS4ERR_SEQ_MISORDERED, | 18630 | | NFS4ERR_TOO_MANY_OPS | 18631 | | | 18632 | SET_SSV | NFS4ERR_BADXDR, | 18633 | | NFS4ERR_BAD_SESSION_DIGEST, | 18634 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18635 | | NFS4ERR_INVAL, NFS4ERR_OP_NOT_IN_SESSION, | 18636 | | NFS4ERR_REP_TOO_BIG, | 18637 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18638 | | NFS4ERR_REQ_TOO_BIG, | 18639 | | NFS4ERR_RETRY_UNCACHED_REP, | 18640 | | NFS4ERR_TOO_MANY_OPS | 18641 | | | 18642 | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18643 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 18644 | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | 18645 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18646 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18647 | | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, | 18648 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 18649 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18650 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 18651 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18652 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 18653 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 18654 | | NFS4ERR_REP_TOO_BIG, | 18655 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18656 | | NFS4ERR_REQ_TOO_BIG, | 18657 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18658 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18659 | | NFS4ERR_TOO_MANY_OPS, | 18660 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18661 | | NFS4ERR_WRONG_TYPE | 18662 | | | 18663 | SETCLIENTID | NFS4ERR_NOTSUPP | 18664 | | | 18665 | SETCLIENTID_CONFIRM | NFS4ERR_NOTSUPP | 18666 | | | 18667 | TEST_STATEID | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18668 | | NFS4ERR_DELAY, NFS4ERR_OP_NOT_IN_SESSION, | 18669 | | NFS4ERR_REP_TOO_BIG, | 18670 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18671 | | NFS4ERR_REQ_TOO_BIG, | 18672 | | NFS4ERR_RETRY_UNCACHED_REP, | 18673 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 18674 | | | 18675 | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18676 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18677 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18678 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18679 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18680 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOT_SAME, | 18681 | | NFS4ERR_OP_NOT_IN_SESSION, | 18682 | | NFS4ERR_REP_TOO_BIG, | 18683 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18684 | | NFS4ERR_REQ_TOO_BIG, | 18685 | | NFS4ERR_RETRY_UNCACHED_REP, | 18686 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18687 | | NFS4ERR_TOO_MANY_OPS, | 18688 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18689 | | NFS4ERR_WRONG_TYPE | 18690 | | | 18691 | WANT_DELEGATION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18692 | | NFS4ERR_DELAY, | 18693 | | NFS4ERR_DELEG_ALREADY_WANTED, | 18694 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18695 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18696 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18697 | | NFS4ERR_NO_GRACE, | 18698 | | NFS4ERR_OP_NOT_IN_SESSION, | 18699 | | NFS4ERR_RECALLCONFLICT, | 18700 | | NFS4ERR_RECLAIM_BAD, | 18701 | | NFS4ERR_RECLAIM_CONFLICT, | 18702 | | NFS4ERR_REP_TOO_BIG, | 18703 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18704 | | NFS4ERR_REQ_TOO_BIG, | 18705 | | NFS4ERR_RETRY_UNCACHED_REP, | 18706 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18707 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 18708 | | | 18709 | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18710 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18711 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18712 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 18713 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 18714 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18715 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 18716 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 18717 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18718 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 18719 | | NFS4ERR_OP_NOT_IN_SESSION, | 18720 | | NFS4ERR_PNFS_IO_HOLE, | 18721 | | NFS4ERR_PNFS_NO_LAYOUT, | 18722 | | NFS4ERR_REP_TOO_BIG, | 18723 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18724 | | NFS4ERR_REQ_TOO_BIG, | 18725 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18726 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18727 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18728 | | NFS4ERR_WRONG_TYPE | 18729 | | | 18730 +----------------------+--------------------------------------------+ 18732 Table 6 18734 15.3. Callback Operations and Their Valid Errors 18736 This section contains a table that gives the valid error returns for 18737 each callback operation. The error code NFS4_OK (indicating no 18738 error) is not listed but should be understood to be returnable by all 18739 callback operations with the exception of CB_ILLEGAL. 18741 Valid Error Returns for Each Protocol Callback Operation 18743 +-------------------------+-----------------------------------------+ 18744 | Callback Operation | Errors | 18745 +-------------------------+-----------------------------------------+ 18746 | CB_GETATTR | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18747 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18748 | | NFS4ERR_OP_NOT_IN_SESSION, | 18749 | | NFS4ERR_REP_TOO_BIG, | 18750 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18751 | | NFS4ERR_REQ_TOO_BIG, | 18752 | | NFS4ERR_RETRY_UNCACHED_REP, | 18753 | | NFS4ERR_SERVERFAULT, | 18754 | | NFS4ERR_TOO_MANY_OPS, | 18755 | | | 18756 | CB_ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 18757 | | | 18758 | CB_LAYOUTRECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADIOMODE, | 18759 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18760 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18761 | | NFS4ERR_NOMATCHING_LAYOUT, | 18762 | | NFS4ERR_NOTSUPP, | 18763 | | NFS4ERR_OP_NOT_IN_SESSION, | 18764 | | NFS4ERR_REP_TOO_BIG, | 18765 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18766 | | NFS4ERR_REQ_TOO_BIG, | 18767 | | NFS4ERR_RETRY_UNCACHED_REP, | 18768 | | NFS4ERR_TOO_MANY_OPS, | 18769 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18770 | | NFS4ERR_WRONG_TYPE | 18771 | | | 18772 | CB_NOTIFY | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18773 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 18774 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 18775 | | NFS4ERR_OP_NOT_IN_SESSION, | 18776 | | NFS4ERR_REP_TOO_BIG, | 18777 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18778 | | NFS4ERR_REQ_TOO_BIG, | 18779 | | NFS4ERR_RETRY_UNCACHED_REP, | 18780 | | NFS4ERR_SERVERFAULT, | 18781 | | NFS4ERR_TOO_MANY_OPS | 18782 | | | 18783 | CB_NOTIFY_DEVICEID | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 18784 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 18785 | | NFS4ERR_OP_NOT_IN_SESSION, | 18786 | | NFS4ERR_REP_TOO_BIG, | 18787 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18788 | | NFS4ERR_REQ_TOO_BIG, | 18789 | | NFS4ERR_RETRY_UNCACHED_REP, | 18790 | | NFS4ERR_SERVERFAULT, | 18791 | | NFS4ERR_TOO_MANY_OPS | 18792 | | | 18793 | CB_NOTIFY_LOCK | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18794 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 18795 | | NFS4ERR_NOTSUPP, | 18796 | | NFS4ERR_OP_NOT_IN_SESSION, | 18797 | | NFS4ERR_REP_TOO_BIG, | 18798 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18799 | | NFS4ERR_REQ_TOO_BIG, | 18800 | | NFS4ERR_RETRY_UNCACHED_REP, | 18801 | | NFS4ERR_SERVERFAULT, | 18802 | | NFS4ERR_TOO_MANY_OPS | 18803 | | | 18804 | CB_PUSH_DELEG | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18805 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18806 | | NFS4ERR_NOTSUPP, | 18807 | | NFS4ERR_OP_NOT_IN_SESSION, | 18808 | | NFS4ERR_REJECT_DELEG, | 18809 | | NFS4ERR_REP_TOO_BIG, | 18810 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18811 | | NFS4ERR_REQ_TOO_BIG, | 18812 | | NFS4ERR_RETRY_UNCACHED_REP, | 18813 | | NFS4ERR_SERVERFAULT, | 18814 | | NFS4ERR_TOO_MANY_OPS, | 18815 | | NFS4ERR_WRONG_TYPE | 18816 | | | 18817 | CB_RECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18818 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 18819 | | NFS4ERR_OP_NOT_IN_SESSION, | 18820 | | NFS4ERR_REP_TOO_BIG, | 18821 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18822 | | NFS4ERR_REQ_TOO_BIG, | 18823 | | NFS4ERR_RETRY_UNCACHED_REP, | 18824 | | NFS4ERR_SERVERFAULT, | 18825 | | NFS4ERR_TOO_MANY_OPS | 18826 | | | 18827 | CB_RECALL_ANY | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 18828 | | NFS4ERR_INVAL, | 18829 | | NFS4ERR_OP_NOT_IN_SESSION, | 18830 | | NFS4ERR_REP_TOO_BIG, | 18831 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18832 | | NFS4ERR_REQ_TOO_BIG, | 18833 | | NFS4ERR_RETRY_UNCACHED_REP, | 18834 | | NFS4ERR_TOO_MANY_OPS | 18835 | | | 18836 | CB_RECALLABLE_OBJ_AVAIL | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 18837 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 18838 | | NFS4ERR_OP_NOT_IN_SESSION, | 18839 | | NFS4ERR_REP_TOO_BIG, | 18840 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18841 | | NFS4ERR_REQ_TOO_BIG, | 18842 | | NFS4ERR_RETRY_UNCACHED_REP, | 18843 | | NFS4ERR_SERVERFAULT, | 18844 | | NFS4ERR_TOO_MANY_OPS | 18845 | | | 18846 | CB_RECALL_SLOT | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 18847 | | NFS4ERR_DELAY, | 18848 | | NFS4ERR_OP_NOT_IN_SESSION, | 18849 | | NFS4ERR_REP_TOO_BIG, | 18850 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18851 | | NFS4ERR_REQ_TOO_BIG, | 18852 | | NFS4ERR_RETRY_UNCACHED_REP, | 18853 | | NFS4ERR_TOO_MANY_OPS | 18854 | | | 18855 | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 18856 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 18857 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 18858 | | NFS4ERR_DELAY, NFS4ERR_REP_TOO_BIG, | 18859 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18860 | | NFS4ERR_REQ_TOO_BIG, | 18861 | | NFS4ERR_RETRY_UNCACHED_REP, | 18862 | | NFS4ERR_SEQUENCE_POS, | 18863 | | NFS4ERR_SEQ_FALSE_RETRY, | 18864 | | NFS4ERR_SEQ_MISORDERED, | 18865 | | NFS4ERR_TOO_MANY_OPS | 18866 | | | 18867 | CB_WANTS_CANCELLED | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 18868 | | NFS4ERR_NOTSUPP, | 18869 | | NFS4ERR_OP_NOT_IN_SESSION, | 18870 | | NFS4ERR_REP_TOO_BIG, | 18871 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18872 | | NFS4ERR_REQ_TOO_BIG, | 18873 | | NFS4ERR_RETRY_UNCACHED_REP, | 18874 | | NFS4ERR_SERVERFAULT, | 18875 | | NFS4ERR_TOO_MANY_OPS | 18876 | | | 18877 +-------------------------+-----------------------------------------+ 18879 Table 7 18881 15.4. Errors and the Operations That Use Them 18883 +-----------------------------------+-------------------------------+ 18884 | Error | Operations | 18885 +-----------------------------------+-------------------------------+ 18886 | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | 18887 | | GETATTR, GET_DIR_DELEGATION, | 18888 | | LAYOUTCOMMIT, LAYOUTGET, | 18889 | | LINK, LOCK, LOCKT, LOCKU, | 18890 | | LOOKUP, LOOKUPP, NVERIFY, | 18891 | | OPEN, OPENATTR, READ, | 18892 | | READDIR, READLINK, REMOVE, | 18893 | | RENAME, SECINFO, | 18894 | | SECINFO_NO_NAME, SETATTR, | 18895 | | VERIFY, WRITE | 18896 | | | 18897 | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, | 18898 | | LAYOUTCOMMIT, LAYOUTGET, | 18899 | | LAYOUTRETURN, LOCK, LOCKU, | 18900 | | OPEN, OPEN_DOWNGRADE, READ, | 18901 | | SETATTR, WRITE | 18902 | | | 18903 | NFS4ERR_ATTRNOTSUPP | CREATE, LAYOUTCOMMIT, | 18904 | | NVERIFY, OPEN, SETATTR, | 18905 | | VERIFY | 18906 | | | 18907 | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | 18908 | | | 18909 | NFS4ERR_BADCHAR | CREATE, EXCHANGE_ID, LINK, | 18910 | | LOOKUP, NVERIFY, OPEN, | 18911 | | REMOVE, RENAME, SECINFO, | 18912 | | SETATTR, VERIFY | 18913 | | | 18914 | NFS4ERR_BADHANDLE | CB_GETATTR, CB_LAYOUTRECALL, | 18915 | | CB_NOTIFY, CB_NOTIFY_LOCK, | 18916 | | CB_PUSH_DELEG, CB_RECALL, | 18917 | | PUTFH | 18918 | | | 18919 | NFS4ERR_BADIOMODE | CB_LAYOUTRECALL, | 18920 | | LAYOUTCOMMIT, LAYOUTGET | 18921 | | | 18922 | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET | 18923 | | | 18924 | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | 18925 | | REMOVE, RENAME, SECINFO | 18926 | | | 18927 | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | 18928 | | | 18929 | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | 18930 | | CB_SEQUENCE, DESTROY_SESSION, | 18931 | | SEQUENCE | 18932 | | | 18933 | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | 18934 | | | 18935 | NFS4ERR_BADTYPE | CREATE | 18936 | | | 18937 | NFS4ERR_BADXDR | ACCESS, BACKCHANNEL_CTL, | 18938 | | BIND_CONN_TO_SESSION, | 18939 | | CB_GETATTR, CB_ILLEGAL, | 18940 | | CB_LAYOUTRECALL, CB_NOTIFY, | 18941 | | CB_NOTIFY_DEVICEID, | 18942 | | CB_NOTIFY_LOCK, | 18943 | | CB_PUSH_DELEG, CB_RECALL, | 18944 | | CB_RECALLABLE_OBJ_AVAIL, | 18945 | | CB_RECALL_ANY, | 18946 | | CB_RECALL_SLOT, CB_SEQUENCE, | 18947 | | CB_WANTS_CANCELLED, CLOSE, | 18948 | | COMMIT, CREATE, | 18949 | | CREATE_SESSION, DELEGPURGE, | 18950 | | DELEGRETURN, | 18951 | | DESTROY_CLIENTID, | 18952 | | DESTROY_SESSION, EXCHANGE_ID, | 18953 | | FREE_STATEID, GETATTR, | 18954 | | GETDEVICEINFO, GETDEVICELIST, | 18955 | | GET_DIR_DELEGATION, ILLEGAL, | 18956 | | LAYOUTCOMMIT, LAYOUTGET, | 18957 | | LAYOUTRETURN, LINK, LOCK, | 18958 | | LOCKT, LOCKU, LOOKUP, | 18959 | | NVERIFY, OPEN, OPENATTR, | 18960 | | OPEN_DOWNGRADE, PUTFH, READ, | 18961 | | READDIR, RECLAIM_COMPLETE, | 18962 | | REMOVE, RENAME, SECINFO, | 18963 | | SECINFO_NO_NAME, SEQUENCE, | 18964 | | SETATTR, SET_SSV, | 18965 | | TEST_STATEID, VERIFY, | 18966 | | WANT_DELEGATION, WRITE | 18967 | | | 18968 | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | 18969 | | | 18970 | NFS4ERR_BAD_HIGH_SLOT | CB_RECALL_SLOT, CB_SEQUENCE, | 18971 | | SEQUENCE | 18972 | | | 18973 | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | 18974 | | | 18975 | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV | 18976 | | | 18977 | NFS4ERR_BAD_STATEID | CB_LAYOUTRECALL, CB_NOTIFY, | 18978 | | CB_NOTIFY_LOCK, CB_RECALL, | 18979 | | CLOSE, DELEGRETURN, | 18980 | | FREE_STATEID, LAYOUTGET, | 18981 | | LAYOUTRETURN, LOCK, LOCKU, | 18982 | | OPEN, OPEN_DOWNGRADE, READ, | 18983 | | SETATTR, WRITE | 18984 | | | 18985 | NFS4ERR_CB_PATH_DOWN | DESTROY_SESSION | 18986 | | | 18987 | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID | 18988 | | | 18989 | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | 18990 | | | 18991 | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | 18992 | | | 18993 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, DESTROY_SESSION, | 18994 | | SEQUENCE | 18995 | | | 18996 | NFS4ERR_DEADLOCK | LOCK | 18997 | | | 18998 | NFS4ERR_DEADSESSION | ACCESS, BACKCHANNEL_CTL, | 18999 | | BIND_CONN_TO_SESSION, CLOSE, | 19000 | | COMMIT, CREATE, | 19001 | | CREATE_SESSION, DELEGPURGE, | 19002 | | DELEGRETURN, | 19003 | | DESTROY_CLIENTID, | 19004 | | DESTROY_SESSION, EXCHANGE_ID, | 19005 | | FREE_STATEID, GETATTR, | 19006 | | GETDEVICEINFO, GETDEVICELIST, | 19007 | | GET_DIR_DELEGATION, | 19008 | | LAYOUTCOMMIT, LAYOUTGET, | 19009 | | LAYOUTRETURN, LINK, LOCK, | 19010 | | LOCKT, LOCKU, LOOKUP, | 19011 | | LOOKUPP, NVERIFY, OPEN, | 19012 | | OPENATTR, OPEN_DOWNGRADE, | 19013 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19014 | | READ, READDIR, READLINK, | 19015 | | RECLAIM_COMPLETE, REMOVE, | 19016 | | RENAME, RESTOREFH, SAVEFH, | 19017 | | SECINFO, SECINFO_NO_NAME, | 19018 | | SEQUENCE, SETATTR, SET_SSV, | 19019 | | TEST_STATEID, VERIFY, | 19020 | | WANT_DELEGATION, WRITE | 19021 | | | 19022 | NFS4ERR_DELAY | ACCESS, BACKCHANNEL_CTL, | 19023 | | BIND_CONN_TO_SESSION, | 19024 | | CB_GETATTR, CB_LAYOUTRECALL, | 19025 | | CB_NOTIFY, | 19026 | | CB_NOTIFY_DEVICEID, | 19027 | | CB_NOTIFY_LOCK, | 19028 | | CB_PUSH_DELEG, CB_RECALL, | 19029 | | CB_RECALLABLE_OBJ_AVAIL, | 19030 | | CB_RECALL_ANY, | 19031 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19032 | | CB_WANTS_CANCELLED, CLOSE, | 19033 | | COMMIT, CREATE, | 19034 | | CREATE_SESSION, DELEGPURGE, | 19035 | | DELEGRETURN, | 19036 | | DESTROY_CLIENTID, | 19037 | | DESTROY_SESSION, EXCHANGE_ID, | 19038 | | FREE_STATEID, GETATTR, | 19039 | | GETDEVICEINFO, GETDEVICELIST, | 19040 | | GET_DIR_DELEGATION, | 19041 | | LAYOUTCOMMIT, LAYOUTGET, | 19042 | | LAYOUTRETURN, LINK, LOCK, | 19043 | | LOCKT, LOCKU, LOOKUP, | 19044 | | LOOKUPP, NVERIFY, OPEN, | 19045 | | OPENATTR, OPEN_DOWNGRADE, | 19046 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19047 | | READ, READDIR, READLINK, | 19048 | | RECLAIM_COMPLETE, REMOVE, | 19049 | | RENAME, SECINFO, | 19050 | | SECINFO_NO_NAME, SEQUENCE, | 19051 | | SETATTR, SET_SSV, | 19052 | | TEST_STATEID, VERIFY, | 19053 | | WANT_DELEGATION, WRITE | 19054 | | | 19055 | NFS4ERR_DELEG_ALREADY_WANTED | OPEN, WANT_DELEGATION | 19056 | | | 19057 | NFS4ERR_DELEG_REVOKED | DELEGRETURN, LAYOUTCOMMIT, | 19058 | | LAYOUTGET, LAYOUTRETURN, | 19059 | | OPEN, READ, SETATTR, WRITE | 19060 | | | 19061 | NFS4ERR_DENIED | LOCK, LOCKT | 19062 | | | 19063 | NFS4ERR_DIRDELEG_UNAVAIL | GET_DIR_DELEGATION | 19064 | | | 19065 | NFS4ERR_DQUOT | CREATE, LAYOUTGET, LINK, | 19066 | | OPEN, OPENATTR, RENAME, | 19067 | | SETATTR, WRITE | 19068 | | | 19069 | NFS4ERR_ENCR_ALG_UNSUPP | EXCHANGE_ID | 19070 | | | 19071 | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | 19072 | | | 19073 | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, | 19074 | | LAYOUTCOMMIT, LAYOUTRETURN, | 19075 | | LOCK, LOCKU, OPEN, | 19076 | | OPEN_DOWNGRADE, READ, | 19077 | | SETATTR, WRITE | 19078 | | | 19079 | NFS4ERR_FBIG | LAYOUTCOMMIT, OPEN, SETATTR, | 19080 | | WRITE | 19081 | | | 19082 | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | 19083 | | CREATE, DELEGRETURN, GETATTR, | 19084 | | GETDEVICELIST, GETFH, | 19085 | | GET_DIR_DELEGATION, | 19086 | | LAYOUTCOMMIT, LAYOUTGET, | 19087 | | LAYOUTRETURN, LINK, LOCK, | 19088 | | LOCKT, LOCKU, LOOKUP, | 19089 | | LOOKUPP, NVERIFY, OPEN, | 19090 | | OPENATTR, OPEN_DOWNGRADE, | 19091 | | READ, READDIR, READLINK, | 19092 | | RECLAIM_COMPLETE, REMOVE, | 19093 | | RENAME, RESTOREFH, SAVEFH, | 19094 | | SECINFO, SECINFO_NO_NAME, | 19095 | | SETATTR, VERIFY, | 19096 | | WANT_DELEGATION, WRITE | 19097 | | | 19098 | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | 19099 | | | 19100 | NFS4ERR_GRACE | GETATTR, GET_DIR_DELEGATION, | 19101 | | LAYOUTCOMMIT, LAYOUTGET, | 19102 | | LAYOUTRETURN, LINK, LOCK, | 19103 | | LOCKT, NVERIFY, OPEN, READ, | 19104 | | REMOVE, RENAME, SETATTR, | 19105 | | VERIFY, WANT_DELEGATION, | 19106 | | WRITE | 19107 | | | 19108 | NFS4ERR_HASH_ALG_UNSUPP | EXCHANGE_ID | 19109 | | | 19110 | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | 19111 | | BIND_CONN_TO_SESSION, | 19112 | | CB_GETATTR, CB_LAYOUTRECALL, | 19113 | | CB_NOTIFY, | 19114 | | CB_NOTIFY_DEVICEID, | 19115 | | CB_PUSH_DELEG, | 19116 | | CB_RECALLABLE_OBJ_AVAIL, | 19117 | | CB_RECALL_ANY, CREATE, | 19118 | | CREATE_SESSION, DELEGRETURN, | 19119 | | EXCHANGE_ID, GETATTR, | 19120 | | GETDEVICEINFO, GETDEVICELIST, | 19121 | | GET_DIR_DELEGATION, | 19122 | | LAYOUTCOMMIT, LAYOUTGET, | 19123 | | LAYOUTRETURN, LINK, LOCK, | 19124 | | LOCKT, LOCKU, LOOKUP, | 19125 | | NVERIFY, OPEN, | 19126 | | OPEN_DOWNGRADE, READ, | 19127 | | READDIR, READLINK, | 19128 | | RECLAIM_COMPLETE, REMOVE, | 19129 | | RENAME, SECINFO, | 19130 | | SECINFO_NO_NAME, SETATTR, | 19131 | | SET_SSV, VERIFY, | 19132 | | WANT_DELEGATION, WRITE | 19133 | | | 19134 | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | 19135 | | GETATTR, GETDEVICELIST, | 19136 | | GET_DIR_DELEGATION, | 19137 | | LAYOUTCOMMIT, LAYOUTGET, | 19138 | | LINK, LOOKUP, LOOKUPP, | 19139 | | NVERIFY, OPEN, OPENATTR, | 19140 | | READ, READDIR, READLINK, | 19141 | | REMOVE, RENAME, SETATTR, | 19142 | | VERIFY, WANT_DELEGATION, | 19143 | | WRITE | 19144 | | | 19145 | NFS4ERR_ISDIR | COMMIT, LAYOUTCOMMIT, | 19146 | | LAYOUTRETURN, LINK, LOCK, | 19147 | | LOCKT, OPEN, READ, WRITE | 19148 | | | 19149 | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | 19150 | | | 19151 | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | 19152 | | | 19153 | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | 19154 | | WRITE | 19155 | | | 19156 | NFS4ERR_LOCKS_HELD | CLOSE, FREE_STATEID | 19157 | | | 19158 | NFS4ERR_LOCK_NOTSUPP | LOCK | 19159 | | | 19160 | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | 19161 | | | 19162 | NFS4ERR_MLINK | CREATE, LINK, RENAME | 19163 | | | 19164 | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | 19165 | | CREATE, DELEGRETURN, GETATTR, | 19166 | | GETFH, GET_DIR_DELEGATION, | 19167 | | LAYOUTCOMMIT, LAYOUTGET, | 19168 | | LAYOUTRETURN, LINK, LOCK, | 19169 | | LOCKT, LOCKU, LOOKUP, | 19170 | | LOOKUPP, NVERIFY, OPEN, | 19171 | | OPENATTR, OPEN_DOWNGRADE, | 19172 | | PUTFH, READ, READDIR, | 19173 | | READLINK, RECLAIM_COMPLETE, | 19174 | | REMOVE, RENAME, RESTOREFH, | 19175 | | SAVEFH, SECINFO, | 19176 | | SECINFO_NO_NAME, SETATTR, | 19177 | | VERIFY, WANT_DELEGATION, | 19178 | | WRITE | 19179 | | | 19180 | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | 19181 | | REMOVE, RENAME, SECINFO | 19182 | | | 19183 | NFS4ERR_NOENT | BACKCHANNEL_CTL, | 19184 | | CREATE_SESSION, EXCHANGE_ID, | 19185 | | GETDEVICEINFO, LOOKUP, | 19186 | | LOOKUPP, OPEN, OPENATTR, | 19187 | | REMOVE, RENAME, SECINFO, | 19188 | | SECINFO_NO_NAME | 19189 | | | 19190 | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | 19191 | | CREATE, DELEGRETURN, GETATTR, | 19192 | | GETDEVICELIST, GETFH, | 19193 | | GET_DIR_DELEGATION, | 19194 | | LAYOUTCOMMIT, LAYOUTGET, | 19195 | | LAYOUTRETURN, LINK, LOCK, | 19196 | | LOCKT, LOCKU, LOOKUP, | 19197 | | LOOKUPP, NVERIFY, OPEN, | 19198 | | OPENATTR, OPEN_DOWNGRADE, | 19199 | | READ, READDIR, READLINK, | 19200 | | RECLAIM_COMPLETE, REMOVE, | 19201 | | RENAME, RESTOREFH, SAVEFH, | 19202 | | SECINFO, SECINFO_NO_NAME, | 19203 | | SETATTR, VERIFY, | 19204 | | WANT_DELEGATION, WRITE | 19205 | | | 19206 | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | 19207 | | | 19208 | NFS4ERR_NOSPC | CREATE, CREATE_SESSION, | 19209 | | LAYOUTGET, LINK, OPEN, | 19210 | | OPENATTR, RENAME, SETATTR, | 19211 | | WRITE | 19212 | | | 19213 | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | 19214 | | LINK, LOOKUP, LOOKUPP, OPEN, | 19215 | | READDIR, REMOVE, RENAME, | 19216 | | SECINFO, SECINFO_NO_NAME | 19217 | | | 19218 | NFS4ERR_NOTEMPTY | REMOVE, RENAME | 19219 | | | 19220 | NFS4ERR_NOTSUPP | CB_LAYOUTRECALL, CB_NOTIFY, | 19221 | | CB_NOTIFY_DEVICEID, | 19222 | | CB_NOTIFY_LOCK, | 19223 | | CB_PUSH_DELEG, | 19224 | | CB_RECALLABLE_OBJ_AVAIL, | 19225 | | CB_WANTS_CANCELLED, | 19226 | | DELEGPURGE, DELEGRETURN, | 19227 | | GETDEVICEINFO, GETDEVICELIST, | 19228 | | GET_DIR_DELEGATION, | 19229 | | LAYOUTCOMMIT, LAYOUTGET, | 19230 | | LAYOUTRETURN, LINK, OPENATTR, | 19231 | | OPEN_CONFIRM, | 19232 | | RELEASE_LOCKOWNER, RENEW, | 19233 | | SECINFO_NO_NAME, SETCLIENTID, | 19234 | | SETCLIENTID_CONFIRM, | 19235 | | WANT_DELEGATION | 19236 | | | 19237 | NFS4ERR_NOT_ONLY_OP | BIND_CONN_TO_SESSION, | 19238 | | CREATE_SESSION, | 19239 | | DESTROY_CLIENTID, | 19240 | | DESTROY_SESSION, EXCHANGE_ID | 19241 | | | 19242 | NFS4ERR_NOT_SAME | EXCHANGE_ID, GETDEVICELIST, | 19243 | | READDIR, VERIFY | 19244 | | | 19245 | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | 19246 | | LOCK, OPEN, WANT_DELEGATION | 19247 | | | 19248 | NFS4ERR_OLD_STATEID | CLOSE, DELEGRETURN, | 19249 | | FREE_STATEID, LAYOUTGET, | 19250 | | LAYOUTRETURN, LOCK, LOCKU, | 19251 | | OPEN, OPEN_DOWNGRADE, READ, | 19252 | | SETATTR, WRITE | 19253 | | | 19254 | NFS4ERR_OPENMODE | LAYOUTGET, LOCK, READ, | 19255 | | SETATTR, WRITE | 19256 | | | 19257 | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | 19258 | | | 19259 | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, BACKCHANNEL_CTL, | 19260 | | CB_GETATTR, CB_LAYOUTRECALL, | 19261 | | CB_NOTIFY, | 19262 | | CB_NOTIFY_DEVICEID, | 19263 | | CB_NOTIFY_LOCK, | 19264 | | CB_PUSH_DELEG, CB_RECALL, | 19265 | | CB_RECALLABLE_OBJ_AVAIL, | 19266 | | CB_RECALL_ANY, | 19267 | | CB_RECALL_SLOT, | 19268 | | CB_WANTS_CANCELLED, CLOSE, | 19269 | | COMMIT, CREATE, DELEGPURGE, | 19270 | | DELEGRETURN, FREE_STATEID, | 19271 | | GETATTR, GETDEVICEINFO, | 19272 | | GETDEVICELIST, GETFH, | 19273 | | GET_DIR_DELEGATION, | 19274 | | LAYOUTCOMMIT, LAYOUTGET, | 19275 | | LAYOUTRETURN, LINK, LOCK, | 19276 | | LOCKT, LOCKU, LOOKUP, | 19277 | | LOOKUPP, NVERIFY, OPEN, | 19278 | | OPENATTR, OPEN_DOWNGRADE, | 19279 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19280 | | READ, READDIR, READLINK, | 19281 | | RECLAIM_COMPLETE, REMOVE, | 19282 | | RENAME, RESTOREFH, SAVEFH, | 19283 | | SECINFO, SECINFO_NO_NAME, | 19284 | | SETATTR, SET_SSV, | 19285 | | TEST_STATEID, VERIFY, | 19286 | | WANT_DELEGATION, WRITE | 19287 | | | 19288 | NFS4ERR_PERM | CREATE, OPEN, SETATTR | 19289 | | | 19290 | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | 19291 | | | 19292 | NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE | 19293 | | | 19294 | NFS4ERR_RECALLCONFLICT | LAYOUTGET, WANT_DELEGATION | 19295 | | | 19296 | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN, | 19297 | | WANT_DELEGATION | 19298 | | | 19299 | NFS4ERR_RECLAIM_CONFLICT | LAYOUTCOMMIT, LOCK, OPEN, | 19300 | | WANT_DELEGATION | 19301 | | | 19302 | NFS4ERR_REJECT_DELEG | CB_PUSH_DELEG | 19303 | | | 19304 | NFS4ERR_REP_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 19305 | | BIND_CONN_TO_SESSION, | 19306 | | CB_GETATTR, CB_LAYOUTRECALL, | 19307 | | CB_NOTIFY, | 19308 | | CB_NOTIFY_DEVICEID, | 19309 | | CB_NOTIFY_LOCK, | 19310 | | CB_PUSH_DELEG, CB_RECALL, | 19311 | | CB_RECALLABLE_OBJ_AVAIL, | 19312 | | CB_RECALL_ANY, | 19313 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19314 | | CB_WANTS_CANCELLED, CLOSE, | 19315 | | COMMIT, CREATE, | 19316 | | CREATE_SESSION, DELEGPURGE, | 19317 | | DELEGRETURN, | 19318 | | DESTROY_CLIENTID, | 19319 | | DESTROY_SESSION, EXCHANGE_ID, | 19320 | | FREE_STATEID, GETATTR, | 19321 | | GETDEVICEINFO, GETDEVICELIST, | 19322 | | GET_DIR_DELEGATION, | 19323 | | LAYOUTCOMMIT, LAYOUTGET, | 19324 | | LAYOUTRETURN, LINK, LOCK, | 19325 | | LOCKT, LOCKU, LOOKUP, | 19326 | | LOOKUPP, NVERIFY, OPEN, | 19327 | | OPENATTR, OPEN_DOWNGRADE, | 19328 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19329 | | READ, READDIR, READLINK, | 19330 | | RECLAIM_COMPLETE, REMOVE, | 19331 | | RENAME, RESTOREFH, SAVEFH, | 19332 | | SECINFO, SECINFO_NO_NAME, | 19333 | | SEQUENCE, SETATTR, SET_SSV, | 19334 | | TEST_STATEID, VERIFY, | 19335 | | WANT_DELEGATION, WRITE | 19336 | | | 19337 | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, BACKCHANNEL_CTL, | 19338 | | BIND_CONN_TO_SESSION, | 19339 | | CB_GETATTR, CB_LAYOUTRECALL, | 19340 | | CB_NOTIFY, | 19341 | | CB_NOTIFY_DEVICEID, | 19342 | | CB_NOTIFY_LOCK, | 19343 | | CB_PUSH_DELEG, CB_RECALL, | 19344 | | CB_RECALLABLE_OBJ_AVAIL, | 19345 | | CB_RECALL_ANY, | 19346 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19347 | | CB_WANTS_CANCELLED, CLOSE, | 19348 | | COMMIT, CREATE, | 19349 | | CREATE_SESSION, DELEGPURGE, | 19350 | | DELEGRETURN, | 19351 | | DESTROY_CLIENTID, | 19352 | | DESTROY_SESSION, EXCHANGE_ID, | 19353 | | FREE_STATEID, GETATTR, | 19354 | | GETDEVICEINFO, GETDEVICELIST, | 19355 | | GET_DIR_DELEGATION, | 19356 | | LAYOUTCOMMIT, LAYOUTGET, | 19357 | | LAYOUTRETURN, LINK, LOCK, | 19358 | | LOCKT, LOCKU, LOOKUP, | 19359 | | LOOKUPP, NVERIFY, OPEN, | 19360 | | OPENATTR, OPEN_DOWNGRADE, | 19361 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19362 | | READ, READDIR, READLINK, | 19363 | | RECLAIM_COMPLETE, REMOVE, | 19364 | | RENAME, RESTOREFH, SAVEFH, | 19365 | | SECINFO, SECINFO_NO_NAME, | 19366 | | SEQUENCE, SETATTR, SET_SSV, | 19367 | | TEST_STATEID, VERIFY, | 19368 | | WANT_DELEGATION, WRITE | 19369 | | | 19370 | NFS4ERR_REQ_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 19371 | | BIND_CONN_TO_SESSION, | 19372 | | CB_GETATTR, CB_LAYOUTRECALL, | 19373 | | CB_NOTIFY, | 19374 | | CB_NOTIFY_DEVICEID, | 19375 | | CB_NOTIFY_LOCK, | 19376 | | CB_PUSH_DELEG, CB_RECALL, | 19377 | | CB_RECALLABLE_OBJ_AVAIL, | 19378 | | CB_RECALL_ANY, | 19379 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19380 | | CB_WANTS_CANCELLED, CLOSE, | 19381 | | COMMIT, CREATE, | 19382 | | CREATE_SESSION, DELEGPURGE, | 19383 | | DELEGRETURN, | 19384 | | DESTROY_CLIENTID, | 19385 | | DESTROY_SESSION, EXCHANGE_ID, | 19386 | | FREE_STATEID, GETATTR, | 19387 | | GETDEVICEINFO, GETDEVICELIST, | 19388 | | GET_DIR_DELEGATION, | 19389 | | LAYOUTCOMMIT, LAYOUTGET, | 19390 | | LAYOUTRETURN, LINK, LOCK, | 19391 | | LOCKT, LOCKU, LOOKUP, | 19392 | | LOOKUPP, NVERIFY, OPEN, | 19393 | | OPENATTR, OPEN_DOWNGRADE, | 19394 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19395 | | READ, READDIR, READLINK, | 19396 | | RECLAIM_COMPLETE, REMOVE, | 19397 | | RENAME, RESTOREFH, SAVEFH, | 19398 | | SECINFO, SECINFO_NO_NAME, | 19399 | | SEQUENCE, SETATTR, SET_SSV, | 19400 | | TEST_STATEID, VERIFY, | 19401 | | WANT_DELEGATION, WRITE | 19402 | | | 19403 | NFS4ERR_RETRY_UNCACHED_REP | ACCESS, BACKCHANNEL_CTL, | 19404 | | BIND_CONN_TO_SESSION, | 19405 | | CB_GETATTR, CB_LAYOUTRECALL, | 19406 | | CB_NOTIFY, | 19407 | | CB_NOTIFY_DEVICEID, | 19408 | | CB_NOTIFY_LOCK, | 19409 | | CB_PUSH_DELEG, CB_RECALL, | 19410 | | CB_RECALLABLE_OBJ_AVAIL, | 19411 | | CB_RECALL_ANY, | 19412 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19413 | | CB_WANTS_CANCELLED, CLOSE, | 19414 | | COMMIT, CREATE, | 19415 | | CREATE_SESSION, DELEGPURGE, | 19416 | | DELEGRETURN, | 19417 | | DESTROY_CLIENTID, | 19418 | | DESTROY_SESSION, EXCHANGE_ID, | 19419 | | FREE_STATEID, GETATTR, | 19420 | | GETDEVICEINFO, GETDEVICELIST, | 19421 | | GET_DIR_DELEGATION, | 19422 | | LAYOUTCOMMIT, LAYOUTGET, | 19423 | | LAYOUTRETURN, LINK, LOCK, | 19424 | | LOCKT, LOCKU, LOOKUP, | 19425 | | LOOKUPP, NVERIFY, OPEN, | 19426 | | OPENATTR, OPEN_DOWNGRADE, | 19427 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19428 | | READ, READDIR, READLINK, | 19429 | | RECLAIM_COMPLETE, REMOVE, | 19430 | | RENAME, RESTOREFH, SAVEFH, | 19431 | | SECINFO, SECINFO_NO_NAME, | 19432 | | SEQUENCE, SETATTR, SET_SSV, | 19433 | | TEST_STATEID, VERIFY, | 19434 | | WANT_DELEGATION, WRITE | 19435 | | | 19436 | NFS4ERR_ROFS | CREATE, LINK, LOCK, LOCKT, | 19437 | | OPEN, OPENATTR, | 19438 | | OPEN_DOWNGRADE, REMOVE, | 19439 | | RENAME, SETATTR, WRITE | 19440 | | | 19441 | NFS4ERR_SAME | NVERIFY | 19442 | | | 19443 | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | 19444 | | | 19445 | NFS4ERR_SEQ_FALSE_RETRY | CB_SEQUENCE, SEQUENCE | 19446 | | | 19447 | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, CREATE_SESSION, | 19448 | | SEQUENCE | 19449 | | | 19450 | NFS4ERR_SERVERFAULT | ACCESS, BIND_CONN_TO_SESSION, | 19451 | | CB_GETATTR, CB_NOTIFY, | 19452 | | CB_NOTIFY_DEVICEID, | 19453 | | CB_NOTIFY_LOCK, | 19454 | | CB_PUSH_DELEG, CB_RECALL, | 19455 | | CB_RECALLABLE_OBJ_AVAIL, | 19456 | | CB_WANTS_CANCELLED, CLOSE, | 19457 | | COMMIT, CREATE, | 19458 | | CREATE_SESSION, DELEGPURGE, | 19459 | | DELEGRETURN, | 19460 | | DESTROY_CLIENTID, | 19461 | | DESTROY_SESSION, EXCHANGE_ID, | 19462 | | FREE_STATEID, GETATTR, | 19463 | | GETDEVICEINFO, GETDEVICELIST, | 19464 | | GET_DIR_DELEGATION, | 19465 | | LAYOUTCOMMIT, LAYOUTGET, | 19466 | | LAYOUTRETURN, LINK, LOCK, | 19467 | | LOCKU, LOOKUP, LOOKUPP, | 19468 | | NVERIFY, OPEN, OPENATTR, | 19469 | | OPEN_DOWNGRADE, PUTFH, | 19470 | | PUTPUBFH, PUTROOTFH, READ, | 19471 | | READDIR, READLINK, | 19472 | | RECLAIM_COMPLETE, REMOVE, | 19473 | | RENAME, RESTOREFH, SAVEFH, | 19474 | | SECINFO, SECINFO_NO_NAME, | 19475 | | SETATTR, TEST_STATEID, | 19476 | | VERIFY, WANT_DELEGATION, | 19477 | | WRITE | 19478 | | | 19479 | NFS4ERR_SHARE_DENIED | OPEN | 19480 | | | 19481 | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | 19482 | | CREATE, DELEGRETURN, GETATTR, | 19483 | | GETFH, GET_DIR_DELEGATION, | 19484 | | LAYOUTCOMMIT, LAYOUTGET, | 19485 | | LAYOUTRETURN, LINK, LOCK, | 19486 | | LOCKT, LOCKU, LOOKUP, | 19487 | | LOOKUPP, NVERIFY, OPEN, | 19488 | | OPENATTR, OPEN_DOWNGRADE, | 19489 | | PUTFH, READ, READDIR, | 19490 | | READLINK, RECLAIM_COMPLETE, | 19491 | | REMOVE, RENAME, RESTOREFH, | 19492 | | SAVEFH, SECINFO, | 19493 | | SECINFO_NO_NAME, SETATTR, | 19494 | | VERIFY, WANT_DELEGATION, | 19495 | | WRITE | 19496 | | | 19497 | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, | 19498 | | DESTROY_CLIENTID, | 19499 | | DESTROY_SESSION | 19500 | | | 19501 | NFS4ERR_SYMLINK | COMMIT, LAYOUTCOMMIT, LINK, | 19502 | | LOCK, LOCKT, LOOKUP, LOOKUPP, | 19503 | | OPEN, READ, WRITE | 19504 | | | 19505 | NFS4ERR_TOOSMALL | CREATE_SESSION, | 19506 | | GETDEVICEINFO, LAYOUTGET, | 19507 | | READDIR | 19508 | | | 19509 | NFS4ERR_TOO_MANY_OPS | ACCESS, BACKCHANNEL_CTL, | 19510 | | BIND_CONN_TO_SESSION, | 19511 | | CB_GETATTR, CB_LAYOUTRECALL, | 19512 | | CB_NOTIFY, | 19513 | | CB_NOTIFY_DEVICEID, | 19514 | | CB_NOTIFY_LOCK, | 19515 | | CB_PUSH_DELEG, CB_RECALL, | 19516 | | CB_RECALLABLE_OBJ_AVAIL, | 19517 | | CB_RECALL_ANY, | 19518 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19519 | | CB_WANTS_CANCELLED, CLOSE, | 19520 | | COMMIT, CREATE, | 19521 | | CREATE_SESSION, DELEGPURGE, | 19522 | | DELEGRETURN, | 19523 | | DESTROY_CLIENTID, | 19524 | | DESTROY_SESSION, EXCHANGE_ID, | 19525 | | FREE_STATEID, GETATTR, | 19526 | | GETDEVICEINFO, GETDEVICELIST, | 19527 | | GET_DIR_DELEGATION, | 19528 | | LAYOUTCOMMIT, LAYOUTGET, | 19529 | | LAYOUTRETURN, LINK, LOCK, | 19530 | | LOCKT, LOCKU, LOOKUP, | 19531 | | LOOKUPP, NVERIFY, OPEN, | 19532 | | OPENATTR, OPEN_DOWNGRADE, | 19533 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19534 | | READ, READDIR, READLINK, | 19535 | | RECLAIM_COMPLETE, REMOVE, | 19536 | | RENAME, RESTOREFH, SAVEFH, | 19537 | | SECINFO, SECINFO_NO_NAME, | 19538 | | SEQUENCE, SETATTR, SET_SSV, | 19539 | | TEST_STATEID, VERIFY, | 19540 | | WANT_DELEGATION, WRITE | 19541 | | | 19542 | NFS4ERR_UNKNOWN_LAYOUTTYPE | CB_LAYOUTRECALL, | 19543 | | GETDEVICEINFO, GETDEVICELIST, | 19544 | | LAYOUTCOMMIT, LAYOUTGET, | 19545 | | LAYOUTRETURN, NVERIFY, | 19546 | | SETATTR, VERIFY | 19547 | | | 19548 | NFS4ERR_UNSAFE_COMPOUND | CREATE, OPEN, OPENATTR | 19549 | | | 19550 | NFS4ERR_WRONGSEC | LINK, LOOKUP, LOOKUPP, OPEN, | 19551 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19552 | | RENAME, RESTOREFH | 19553 | | | 19554 | NFS4ERR_WRONG_CRED | CLOSE, CREATE_SESSION, | 19555 | | DELEGPURGE, DELEGRETURN, | 19556 | | DESTROY_CLIENTID, | 19557 | | DESTROY_SESSION, | 19558 | | FREE_STATEID, LAYOUTCOMMIT, | 19559 | | LAYOUTRETURN, LOCK, LOCKT, | 19560 | | LOCKU, OPEN_DOWNGRADE, | 19561 | | RECLAIM_COMPLETE | 19562 | | | 19563 | NFS4ERR_WRONG_TYPE | CB_LAYOUTRECALL, | 19564 | | CB_PUSH_DELEG, COMMIT, | 19565 | | GETATTR, LAYOUTGET, | 19566 | | LAYOUTRETURN, LINK, LOCK, | 19567 | | LOCKT, NVERIFY, OPEN, | 19568 | | OPENATTR, READ, READLINK, | 19569 | | RECLAIM_COMPLETE, SETATTR, | 19570 | | VERIFY, WANT_DELEGATION, | 19571 | | WRITE | 19572 | | | 19573 | NFS4ERR_XDEV | LINK, RENAME | 19574 | | | 19575 +-----------------------------------+-------------------------------+ 19577 Table 8 19579 16. NFSv4.1 Procedures 19581 Both procedures, NULL and COMPOUND, MUST be implemented. 19583 16.1. Procedure 0: NULL - No Operation 19585 16.1.1. ARGUMENTS 19587 void; 19589 16.1.2. RESULTS 19591 void; 19593 16.1.3. DESCRIPTION 19595 This is the standard NULL procedure with the standard void argument 19596 and void response. This procedure has no functionality associated 19597 with it. Because of this, it is sometimes used to measure the 19598 overhead of processing a service request. Therefore, the server 19599 SHOULD ensure that no unnecessary work is done in servicing this 19600 procedure. 19602 16.1.4. ERRORS 19604 None. 19606 16.2. Procedure 1: COMPOUND - Compound Operations 19608 16.2.1. ARGUMENTS 19610 enum nfs_opnum4 { 19611 OP_ACCESS = 3, 19612 OP_CLOSE = 4, 19613 OP_COMMIT = 5, 19614 OP_CREATE = 6, 19615 OP_DELEGPURGE = 7, 19616 OP_DELEGRETURN = 8, 19617 OP_GETATTR = 9, 19618 OP_GETFH = 10, 19619 OP_LINK = 11, 19620 OP_LOCK = 12, 19621 OP_LOCKT = 13, 19622 OP_LOCKU = 14, 19623 OP_LOOKUP = 15, 19624 OP_LOOKUPP = 16, 19625 OP_NVERIFY = 17, 19626 OP_OPEN = 18, 19627 OP_OPENATTR = 19, 19628 OP_OPEN_CONFIRM = 20, /* Mandatory not-to-implement */ 19629 OP_OPEN_DOWNGRADE = 21, 19630 OP_PUTFH = 22, 19631 OP_PUTPUBFH = 23, 19632 OP_PUTROOTFH = 24, 19633 OP_READ = 25, 19634 OP_READDIR = 26, 19635 OP_READLINK = 27, 19636 OP_REMOVE = 28, 19637 OP_RENAME = 29, 19638 OP_RENEW = 30, /* Mandatory not-to-implement */ 19639 OP_RESTOREFH = 31, 19640 OP_SAVEFH = 32, 19641 OP_SECINFO = 33, 19642 OP_SETATTR = 34, 19643 OP_SETCLIENTID = 35, /* Mandatory not-to-implement */ 19644 OP_SETCLIENTID_CONFIRM = 36, /* Mandatory not-to-implement */ 19645 OP_VERIFY = 37, 19646 OP_WRITE = 38, 19647 OP_RELEASE_LOCKOWNER = 39, /* Mandatory not-to-implement */ 19649 /* new operations for NFSv4.1 */ 19651 OP_BACKCHANNEL_CTL = 40, 19652 OP_BIND_CONN_TO_SESSION = 41, 19653 OP_EXCHANGE_ID = 42, 19654 OP_CREATE_SESSION = 43, 19655 OP_DESTROY_SESSION = 44, 19656 OP_FREE_STATEID = 45, 19657 OP_GET_DIR_DELEGATION = 46, 19658 OP_GETDEVICEINFO = 47, 19659 OP_GETDEVICELIST = 48, 19660 OP_LAYOUTCOMMIT = 49, 19661 OP_LAYOUTGET = 50, 19662 OP_LAYOUTRETURN = 51, 19663 OP_SECINFO_NO_NAME = 52, 19664 OP_SEQUENCE = 53, 19665 OP_SET_SSV = 54, 19666 OP_TEST_STATEID = 55, 19667 OP_WANT_DELEGATION = 56, 19668 OP_DESTROY_CLIENTID = 57, 19669 OP_RECLAIM_COMPLETE = 58, 19670 OP_ILLEGAL = 10044 19671 }; 19673 union nfs_argop4 switch (nfs_opnum4 argop) { 19674 case OP_ACCESS: ACCESS4args opaccess; 19675 case OP_CLOSE: CLOSE4args opclose; 19676 case OP_COMMIT: COMMIT4args opcommit; 19677 case OP_CREATE: CREATE4args opcreate; 19678 case OP_DELEGPURGE: DELEGPURGE4args opdelegpurge; 19679 case OP_DELEGRETURN: DELEGRETURN4args opdelegreturn; 19680 case OP_GETATTR: GETATTR4args opgetattr; 19681 case OP_GETFH: void; 19682 case OP_LINK: LINK4args oplink; 19683 case OP_LOCK: LOCK4args oplock; 19684 case OP_LOCKT: LOCKT4args oplockt; 19685 case OP_LOCKU: LOCKU4args oplocku; 19686 case OP_LOOKUP: LOOKUP4args oplookup; 19687 case OP_LOOKUPP: void; 19688 case OP_NVERIFY: NVERIFY4args opnverify; 19689 case OP_OPEN: OPEN4args opopen; 19690 case OP_OPENATTR: OPENATTR4args opopenattr; 19692 /* Not for NFSv4.1 */ 19693 case OP_OPEN_CONFIRM: OPEN_CONFIRM4args opopen_confirm; 19695 case OP_OPEN_DOWNGRADE: 19696 OPEN_DOWNGRADE4args opopen_downgrade; 19698 case OP_PUTFH: PUTFH4args opputfh; 19699 case OP_PUTPUBFH: void; 19700 case OP_PUTROOTFH: void; 19701 case OP_READ: READ4args opread; 19702 case OP_READDIR: READDIR4args opreaddir; 19703 case OP_READLINK: void; 19704 case OP_REMOVE: REMOVE4args opremove; 19705 case OP_RENAME: RENAME4args oprename; 19707 /* Not for NFSv4.1 */ 19708 case OP_RENEW: RENEW4args oprenew; 19710 case OP_RESTOREFH: void; 19711 case OP_SAVEFH: void; 19712 case OP_SECINFO: SECINFO4args opsecinfo; 19713 case OP_SETATTR: SETATTR4args opsetattr; 19715 /* Not for NFSv4.1 */ 19716 case OP_SETCLIENTID: SETCLIENTID4args opsetclientid; 19718 /* Not for NFSv4.1 */ 19719 case OP_SETCLIENTID_CONFIRM: SETCLIENTID_CONFIRM4args 19720 opsetclientid_confirm; 19721 case OP_VERIFY: VERIFY4args opverify; 19722 case OP_WRITE: WRITE4args opwrite; 19724 /* Not for NFSv4.1 */ 19725 case OP_RELEASE_LOCKOWNER: 19726 RELEASE_LOCKOWNER4args 19727 oprelease_lockowner; 19729 /* Operations new to NFSv4.1 */ 19730 case OP_BACKCHANNEL_CTL: 19731 BACKCHANNEL_CTL4args opbackchannel_ctl; 19733 case OP_BIND_CONN_TO_SESSION: 19734 BIND_CONN_TO_SESSION4args 19735 opbind_conn_to_session; 19737 case OP_EXCHANGE_ID: EXCHANGE_ID4args opexchange_id; 19739 case OP_CREATE_SESSION: 19740 CREATE_SESSION4args opcreate_session; 19742 case OP_DESTROY_SESSION: 19743 DESTROY_SESSION4args opdestroy_session; 19745 case OP_FREE_STATEID: FREE_STATEID4args opfree_stateid; 19747 case OP_GET_DIR_DELEGATION: 19748 GET_DIR_DELEGATION4args 19749 opget_dir_delegation; 19751 case OP_GETDEVICEINFO: GETDEVICEINFO4args opgetdeviceinfo; 19752 case OP_GETDEVICELIST: GETDEVICELIST4args opgetdevicelist; 19753 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4args oplayoutcommit; 19754 case OP_LAYOUTGET: LAYOUTGET4args oplayoutget; 19755 case OP_LAYOUTRETURN: LAYOUTRETURN4args oplayoutreturn; 19757 case OP_SECINFO_NO_NAME: 19758 SECINFO_NO_NAME4args opsecinfo_no_name; 19760 case OP_SEQUENCE: SEQUENCE4args opsequence; 19761 case OP_SET_SSV: SET_SSV4args opset_ssv; 19762 case OP_TEST_STATEID: TEST_STATEID4args optest_stateid; 19764 case OP_WANT_DELEGATION: 19765 WANT_DELEGATION4args opwant_delegation; 19767 case OP_DESTROY_CLIENTID: 19768 DESTROY_CLIENTID4args 19769 opdestroy_clientid; 19771 case OP_RECLAIM_COMPLETE: 19772 RECLAIM_COMPLETE4args 19773 opreclaim_complete; 19775 /* Operations not new to NFSv4.1 */ 19776 case OP_ILLEGAL: void; 19777 }; 19779 struct COMPOUND4args { 19780 utf8str_cs tag; 19781 uint32_t minorversion; 19782 nfs_argop4 argarray<>; 19783 }; 19785 16.2.2. RESULTS 19787 union nfs_resop4 switch (nfs_opnum4 resop) { 19788 case OP_ACCESS: ACCESS4res opaccess; 19789 case OP_CLOSE: CLOSE4res opclose; 19790 case OP_COMMIT: COMMIT4res opcommit; 19791 case OP_CREATE: CREATE4res opcreate; 19792 case OP_DELEGPURGE: DELEGPURGE4res opdelegpurge; 19793 case OP_DELEGRETURN: DELEGRETURN4res opdelegreturn; 19794 case OP_GETATTR: GETATTR4res opgetattr; 19795 case OP_GETFH: GETFH4res opgetfh; 19796 case OP_LINK: LINK4res oplink; 19797 case OP_LOCK: LOCK4res oplock; 19798 case OP_LOCKT: LOCKT4res oplockt; 19799 case OP_LOCKU: LOCKU4res oplocku; 19800 case OP_LOOKUP: LOOKUP4res oplookup; 19801 case OP_LOOKUPP: LOOKUPP4res oplookupp; 19802 case OP_NVERIFY: NVERIFY4res opnverify; 19803 case OP_OPEN: OPEN4res opopen; 19804 case OP_OPENATTR: OPENATTR4res opopenattr; 19805 /* Not for NFSv4.1 */ 19806 case OP_OPEN_CONFIRM: OPEN_CONFIRM4res opopen_confirm; 19807 case OP_OPEN_DOWNGRADE: 19808 OPEN_DOWNGRADE4res 19809 opopen_downgrade; 19811 case OP_PUTFH: PUTFH4res opputfh; 19812 case OP_PUTPUBFH: PUTPUBFH4res opputpubfh; 19813 case OP_PUTROOTFH: PUTROOTFH4res opputrootfh; 19814 case OP_READ: READ4res opread; 19815 case OP_READDIR: READDIR4res opreaddir; 19816 case OP_READLINK: READLINK4res opreadlink; 19817 case OP_REMOVE: REMOVE4res opremove; 19818 case OP_RENAME: RENAME4res oprename; 19819 /* Not for NFSv4.1 */ 19820 case OP_RENEW: RENEW4res oprenew; 19821 case OP_RESTOREFH: RESTOREFH4res oprestorefh; 19822 case OP_SAVEFH: SAVEFH4res opsavefh; 19823 case OP_SECINFO: SECINFO4res opsecinfo; 19824 case OP_SETATTR: SETATTR4res opsetattr; 19825 /* Not for NFSv4.1 */ 19826 case OP_SETCLIENTID: SETCLIENTID4res opsetclientid; 19828 /* Not for NFSv4.1 */ 19829 case OP_SETCLIENTID_CONFIRM: 19830 SETCLIENTID_CONFIRM4res 19831 opsetclientid_confirm; 19832 case OP_VERIFY: VERIFY4res opverify; 19833 case OP_WRITE: WRITE4res opwrite; 19835 /* Not for NFSv4.1 */ 19836 case OP_RELEASE_LOCKOWNER: 19837 RELEASE_LOCKOWNER4res 19838 oprelease_lockowner; 19840 /* Operations new to NFSv4.1 */ 19841 case OP_BACKCHANNEL_CTL: 19842 BACKCHANNEL_CTL4res 19843 opbackchannel_ctl; 19845 case OP_BIND_CONN_TO_SESSION: 19846 BIND_CONN_TO_SESSION4res 19847 opbind_conn_to_session; 19849 case OP_EXCHANGE_ID: EXCHANGE_ID4res opexchange_id; 19851 case OP_CREATE_SESSION: 19852 CREATE_SESSION4res 19853 opcreate_session; 19855 case OP_DESTROY_SESSION: 19856 DESTROY_SESSION4res 19857 opdestroy_session; 19859 case OP_FREE_STATEID: FREE_STATEID4res 19860 opfree_stateid; 19862 case OP_GET_DIR_DELEGATION: 19863 GET_DIR_DELEGATION4res 19864 opget_dir_delegation; 19866 case OP_GETDEVICEINFO: GETDEVICEINFO4res 19867 opgetdeviceinfo; 19869 case OP_GETDEVICELIST: GETDEVICELIST4res 19870 opgetdevicelist; 19872 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4res oplayoutcommit; 19873 case OP_LAYOUTGET: LAYOUTGET4res oplayoutget; 19874 case OP_LAYOUTRETURN: LAYOUTRETURN4res oplayoutreturn; 19876 case OP_SECINFO_NO_NAME: 19877 SECINFO_NO_NAME4res 19878 opsecinfo_no_name; 19880 case OP_SEQUENCE: SEQUENCE4res opsequence; 19881 case OP_SET_SSV: SET_SSV4res opset_ssv; 19882 case OP_TEST_STATEID: TEST_STATEID4res optest_stateid; 19884 case OP_WANT_DELEGATION: 19885 WANT_DELEGATION4res 19886 opwant_delegation; 19888 case OP_DESTROY_CLIENTID: 19889 DESTROY_CLIENTID4res 19890 opdestroy_clientid; 19892 case OP_RECLAIM_COMPLETE: 19893 RECLAIM_COMPLETE4res 19894 opreclaim_complete; 19896 /* Operations not new to NFSv4.1 */ 19897 case OP_ILLEGAL: ILLEGAL4res opillegal; 19898 }; 19899 struct COMPOUND4res { 19900 nfsstat4 status; 19901 utf8str_cs tag; 19902 nfs_resop4 resarray<>; 19903 }; 19905 16.2.3. DESCRIPTION 19907 The COMPOUND procedure is used to combine one or more NFSv4 19908 operations into a single RPC request. The server interprets each of 19909 the operations in turn. If an operation is executed by the server 19910 and the status of that operation is NFS4_OK, then the next operation 19911 in the COMPOUND procedure is executed. The server continues this 19912 process until there are no more operations to be executed or until 19913 one of the operations has a status value other than NFS4_OK. 19915 In the processing of the COMPOUND procedure, the server may find that 19916 it does not have the available resources to execute any or all of the 19917 operations within the COMPOUND sequence. See Section 2.10.6.4 for a 19918 more detailed discussion. 19920 The server will generally choose between two methods of decoding the 19921 client's request. The first would be the traditional one-pass XDR 19922 decode. If there is an XDR decoding error in this case, the RPC XDR 19923 decode error would be returned. The second method would be to make 19924 an initial pass to decode the basic COMPOUND request and then to XDR 19925 decode the individual operations; the most interesting is the decode 19926 of attributes. In this case, the server may encounter an XDR decode 19927 error during the second pass. If it does, the server would return 19928 the error NFS4ERR_BADXDR to signify the decode error. 19930 The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, 19931 the value for this field is 1. If the server receives a COMPOUND 19932 procedure with a minorversion field value that it does not support, 19933 the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a 19934 zero-length resultdata array. 19936 Contained within the COMPOUND results is a "status" field. If the 19937 results array length is non-zero, this status must be equivalent to 19938 the status of the last operation that was executed within the 19939 COMPOUND procedure. Therefore, if an operation incurred an error 19940 then the "status" value will be the same error value as is being 19941 returned for the operation that failed. 19943 Note that operations zero and one are not defined for the COMPOUND 19944 procedure. Operation 2 is not defined and is reserved for future 19945 definition and use with minor versioning. If the server receives an 19946 operation array that contains operation 2 and the minorversion field 19947 has a value of zero, an error of NFS4ERR_OP_ILLEGAL, as described in 19948 the next paragraph, is returned to the client. If an operation array 19949 contains an operation 2 and the minorversion field is non-zero and 19950 the server does not support the minor version, the server returns an 19951 error of NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the 19952 NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other 19953 errors. 19955 It is possible that the server receives a request that contains an 19956 operation that is less than the first legal operation (OP_ACCESS) or 19957 greater than the last legal operation (OP_RELEASE_LOCKOWNER). In 19958 this case, the server's response will encode the opcode OP_ILLEGAL 19959 rather than the illegal opcode of the request. The status field in 19960 the ILLEGAL return results will be set to NFS4ERR_OP_ILLEGAL. The 19961 COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. 19963 The definition of the "tag" in the request is left to the 19964 implementor. It may be used to summarize the content of the Compound 19965 request for the benefit of packet-sniffers and engineers debugging 19966 implementations. However, the value of "tag" in the response SHOULD 19967 be the same value as provided in the request. This applies to the 19968 tag field of the CB_COMPOUND procedure as well. 19970 16.2.3.1. Current Filehandle and Stateid 19972 The COMPOUND procedure offers a simple environment for the execution 19973 of the operations specified by the client. The first two relate to 19974 the filehandle while the second two relate to the current stateid. 19976 16.2.3.1.1. Current Filehandle 19978 The current and saved filehandles are used throughout the protocol. 19979 Most operations implicitly use the current filehandle as an argument, 19980 and many set the current filehandle as part of the results. The 19981 combination of client-specified sequences of operations and current 19982 and saved filehandle arguments and results allows for greater 19983 protocol flexibility. The best or easiest example of current 19984 filehandle usage is a sequence like the following: 19986 PUTFH fh1 {fh1} 19987 LOOKUP "compA" {fh2} 19988 GETATTR {fh2} 19989 LOOKUP "compB" {fh3} 19990 GETATTR {fh3} 19991 LOOKUP "compC" {fh4} 19992 GETATTR {fh4} 19993 GETFH 19995 Figure 2 19997 In this example, the PUTFH (Section 18.19) operation explicitly sets 19998 the current filehandle value while the result of each LOOKUP 19999 operation sets the current filehandle value to the resultant file 20000 system object. Also, the client is able to insert GETATTR operations 20001 using the current filehandle as an argument. 20003 The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.20) operations 20004 also set the current filehandle. The above example would replace 20005 "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in 20006 order to achieve the same effect (on the assumption that "compA" is 20007 directly below the root of the namespace). 20009 Along with the current filehandle, there is a saved filehandle. 20010 While the current filehandle is set as the result of operations like 20011 LOOKUP, the saved filehandle must be set directly with the use of the 20012 SAVEFH operation. The SAVEFH operation copies the current filehandle 20013 value to the saved value. The saved filehandle value is used in 20014 combination with the current filehandle value for the LINK and RENAME 20015 operations. The RESTOREFH operation will copy the saved filehandle 20016 value to the current filehandle value; as a result, the saved 20017 filehandle value may be used a sort of "scratch" area for the 20018 client's series of operations. 20020 16.2.3.1.2. Current Stateid 20022 With NFSv4.1, additions of a current stateid and a saved stateid have 20023 been made to the COMPOUND processing environment; this allows for the 20024 passing of stateids between operations. There are no changes to the 20025 syntax of the protocol, only changes to the semantics of a few 20026 operations. 20028 A "current stateid" is the stateid that is associated with the 20029 current filehandle. The current stateid may only be changed by an 20030 operation that modifies the current filehandle or returns a stateid. 20031 If an operation returns a stateid, it MUST set the current stateid to 20032 the returned value. If an operation sets the current filehandle but 20033 does not return a stateid, the current stateid MUST be set to the 20034 all-zeros special stateid, i.e., (seqid, other) = (0, 0). If an 20035 operation uses a stateid as an argument but does not return a 20036 stateid, the current stateid MUST NOT be changed. For example, 20037 PUTFH, PUTROOTFH, and PUTPUBFH will change the current server state 20038 from {ocfh, (osid)} to {cfh, (0, 0)}, while LOCK will change the 20039 current state from {cfh, (osid} to {cfh, (nsid)}. Operations like 20040 LOOKUP that transform a current filehandle and component name into a 20041 new current filehandle will also change the current state to {0, 0}. 20042 The SAVEFH and RESTOREFH operations will save and restore both the 20043 current filehandle and the current stateid as a set. 20045 The following example is the common case of a simple READ operation 20046 with a normal stateid showing that the PUTFH initializes the current 20047 stateid to (0, 0). The subsequent READ with stateid (sid1) leaves 20048 the current stateid unchanged. 20050 PUTFH fh1 - -> {fh1, (0, 0)} 20051 READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} 20053 Figure 3 20055 This next example performs an OPEN with the root filehandle and, as a 20056 result, generates stateid (sid1). The next operation specifies the 20057 READ with the argument stateid set such that (seqid, other) are equal 20058 to (1, 0), but the current stateid set by the previous operation is 20059 actually used when the operation is evaluated. This allows correct 20060 interaction with any existing, potentially conflicting, locks. 20062 PUTROOTFH - -> {fh1, (0, 0)} 20063 OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} 20064 READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} 20065 CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} 20067 Figure 4 20069 This next example is similar to the second in how it passes the 20070 stateid sid2 generated by the LOCK operation to the next READ 20071 operation. This allows the client to explicitly surround a single I/ 20072 O operation with a lock and its appropriate stateid to guarantee 20073 correctness with other client locks. The example also shows how 20074 SAVEFH and RESTOREFH can save and later reuse a filehandle and 20075 stateid, passing them as the current filehandle and stateid to a READ 20076 operation. 20078 PUTFH fh1 - -> {fh1, (0, 0)} 20079 LOCK 0, 1024, (sid1) {fh1, (sid1)} -> {fh1, (sid2)} 20080 READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} 20081 LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} 20082 SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} 20084 PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} 20085 WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} 20087 RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} 20088 READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} 20090 Figure 5 20092 The final example shows a disallowed use of the current stateid. The 20093 client is attempting to implicitly pass an anonymous special stateid, 20094 (0,0), to the READ operation. The server MUST return 20095 NFS4ERR_BAD_STATEID in the reply to the READ operation. 20097 PUTFH fh1 - -> {fh1, (0, 0)} 20098 READ (1, 0), 0, 1024 {fh1, (0, 0)} -> NFS4ERR_BAD_STATEID 20100 Figure 6 20102 16.2.4. ERRORS 20104 COMPOUND will of course return every error that each operation on the 20105 fore channel can return (see Table 6). However, if COMPOUND returns 20106 zero operations, obviously the error returned by COMPOUND has nothing 20107 to do with an error returned by an operation. The list of errors 20108 COMPOUND will return if it processes zero operations include: 20110 COMPOUND Error Returns 20112 +------------------------------+------------------------------------+ 20113 | Error | Notes | 20114 +------------------------------+------------------------------------+ 20115 | NFS4ERR_BADCHAR | The tag argument has a character | 20116 | | the replier does not support. | 20117 | NFS4ERR_BADXDR | | 20118 | NFS4ERR_DELAY | | 20119 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 20120 | | encoding. | 20121 | NFS4ERR_MINOR_VERS_MISMATCH | | 20122 | NFS4ERR_SERVERFAULT | | 20123 | NFS4ERR_TOO_MANY_OPS | | 20124 | NFS4ERR_REP_TOO_BIG | | 20125 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 20126 | NFS4ERR_REQ_TOO_BIG | | 20127 +------------------------------+------------------------------------+ 20129 Table 9 20131 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 20133 The following tables summarize the operations of the NFSv4.1 protocol 20134 and the corresponding designation of REQUIRED, RECOMMENDED, and 20135 OPTIONAL to implement or MUST NOT implement. The designation of MUST 20136 NOT implement is reserved for those operations that were defined in 20137 NFSv4.0 and MUST NOT be implemented in NFSv4.1. 20139 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 20140 for operations sent by the client is for the server implementation. 20141 The client is generally required to implement the operations needed 20142 for the operating environment for which it serves. For example, a 20143 read-only NFSv4.1 client would have no need to implement the WRITE 20144 operation and is not required to do so. 20146 The REQUIRED or OPTIONAL designation for callback operations sent by 20147 the server is for both the client and server. Generally, the client 20148 has the option of creating the backchannel and sending the operations 20149 on the fore channel that will be a catalyst for the server sending 20150 callback operations. A partial exception is CB_RECALL_SLOT; the only 20151 way the client can avoid supporting this operation is by not creating 20152 a backchannel. 20154 Since this is a summary of the operations and their designation, 20155 there are subtleties that are not presented here. Therefore, if 20156 there is a question of the requirements of implementation, the 20157 operation descriptions themselves must be consulted along with other 20158 relevant explanatory text within this specification. 20160 The abbreviations used in the second and third columns of the table 20161 are defined as follows. 20163 REQ REQUIRED to implement 20165 REC RECOMMEND to implement 20167 OPT OPTIONAL to implement 20169 MNI MUST NOT implement 20171 For the NFSv4.1 features that are OPTIONAL, the operations that 20172 support those features are OPTIONAL, and the server would return 20173 NFS4ERR_NOTSUPP in response to the client's use of those operations. 20174 If an OPTIONAL feature is supported, it is possible that a set of 20175 operations related to the feature become REQUIRED to implement. The 20176 third column of the table designates the feature(s) and if the 20177 operation is REQUIRED or OPTIONAL in the presence of support for the 20178 feature. 20180 The OPTIONAL features identified and their abbreviations are as 20181 follows: 20183 pNFS Parallel NFS 20185 FDELG File Delegations 20187 DDELG Directory Delegations 20189 Operations 20191 +----------------------+------------+--------------+----------------+ 20192 | Operation | REQ, REC, | Feature | Definition | 20193 | | OPT, or | (REQ, REC, | | 20194 | | MNI | or OPT) | | 20195 +----------------------+------------+--------------+----------------+ 20196 | ACCESS | REQ | | Section 18.1 | 20197 | BACKCHANNEL_CTL | REQ | | Section 18.33 | 20198 | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | 20199 | CLOSE | REQ | | Section 18.2 | 20200 | COMMIT | REQ | | Section 18.3 | 20201 | CREATE | REQ | | Section 18.4 | 20202 | CREATE_SESSION | REQ | | Section 18.36 | 20203 | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | 20204 | DELEGRETURN | OPT | FDELG, | Section 18.6 | 20205 | | | DDELG, pNFS | | 20206 | | | (REQ) | | 20207 | DESTROY_CLIENTID | REQ | | Section 18.50 | 20208 | DESTROY_SESSION | REQ | | Section 18.37 | 20209 | EXCHANGE_ID | REQ | | Section 18.35 | 20210 | FREE_STATEID | REQ | | Section 18.38 | 20211 | GETATTR | REQ | | Section 18.7 | 20212 | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | 20213 | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | 20214 | GETFH | REQ | | Section 18.8 | 20215 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | 20216 | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | 20217 | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | 20218 | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | 20219 | LINK | OPT | | Section 18.9 | 20220 | LOCK | REQ | | Section 18.10 | 20221 | LOCKT | REQ | | Section 18.11 | 20222 | LOCKU | REQ | | Section 18.12 | 20223 | LOOKUP | REQ | | Section 18.13 | 20224 | LOOKUPP | REQ | | Section 18.14 | 20225 | NVERIFY | REQ | | Section 18.15 | 20226 | OPEN | REQ | | Section 18.16 | 20227 | OPENATTR | OPT | | Section 18.17 | 20228 | OPEN_CONFIRM | MNI | | N/A | 20229 | OPEN_DOWNGRADE | REQ | | Section 18.18 | 20230 | PUTFH | REQ | | Section 18.19 | 20231 | PUTPUBFH | REQ | | Section 18.20 | 20232 | PUTROOTFH | REQ | | Section 18.21 | 20233 | READ | REQ | | Section 18.22 | 20234 | READDIR | REQ | | Section 18.23 | 20235 | READLINK | OPT | | Section 18.24 | 20236 | RECLAIM_COMPLETE | REQ | | Section 18.51 | 20237 | RELEASE_LOCKOWNER | MNI | | N/A | 20238 | REMOVE | REQ | | Section 18.25 | 20239 | RENAME | REQ | | Section 18.26 | 20240 | RENEW | MNI | | N/A | 20241 | RESTOREFH | REQ | | Section 18.27 | 20242 | SAVEFH | REQ | | Section 18.28 | 20243 | SECINFO | REQ | | Section 18.29 | 20244 | SECINFO_NO_NAME | REC | pNFS file | Section 18.45, | 20245 | | | layout (REQ) | Section 13.12 | 20246 | SEQUENCE | REQ | | Section 18.46 | 20247 | SETATTR | REQ | | Section 18.30 | 20248 | SETCLIENTID | MNI | | N/A | 20249 | SETCLIENTID_CONFIRM | MNI | | N/A | 20250 | SET_SSV | REQ | | Section 18.47 | 20251 | TEST_STATEID | REQ | | Section 18.48 | 20252 | VERIFY | REQ | | Section 18.31 | 20253 | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | 20254 | WRITE | REQ | | Section 18.32 | 20255 +----------------------+------------+--------------+----------------+ 20257 Callback Operations 20259 +-------------------------+------------+---------------+------------+ 20260 | Operation | REQ, REC, | Feature (REQ, | Definition | 20261 | | OPT, or | REC, or OPT) | | 20262 | | MNI | | | 20263 +-------------------------+------------+---------------+------------+ 20264 | CB_GETATTR | OPT | FDELG (REQ) | Section | 20265 | | | | 20.1 | 20266 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section | 20267 | | | | 20.3 | 20268 | CB_NOTIFY | OPT | DDELG (REQ) | Section | 20269 | | | | 20.4 | 20270 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section | 20271 | | | | 20.12 | 20272 | CB_NOTIFY_LOCK | OPT | | Section | 20273 | | | | 20.11 | 20274 | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section | 20275 | | | | 20.5 | 20276 | CB_RECALL | OPT | FDELG, DDELG, | Section | 20277 | | | pNFS (REQ) | 20.2 | 20278 | CB_RECALL_ANY | OPT | FDELG, DDELG, | Section | 20279 | | | pNFS (REQ) | 20.6 | 20280 | CB_RECALL_SLOT | REQ | | Section | 20281 | | | | 20.8 | 20282 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section | 20283 | | | (REQ) | 20.7 | 20284 | CB_SEQUENCE | OPT | FDELG, DDELG, | Section | 20285 | | | pNFS (REQ) | 20.9 | 20286 | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, | Section | 20287 | | | pNFS (REQ) | 20.10 | 20288 +-------------------------+------------+---------------+------------+ 20290 18. NFSv4.1 Operations 20292 18.1. Operation 3: ACCESS - Check Access Rights 20294 18.1.1. ARGUMENTS 20295 const ACCESS4_READ = 0x00000001; 20296 const ACCESS4_LOOKUP = 0x00000002; 20297 const ACCESS4_MODIFY = 0x00000004; 20298 const ACCESS4_EXTEND = 0x00000008; 20299 const ACCESS4_DELETE = 0x00000010; 20300 const ACCESS4_EXECUTE = 0x00000020; 20302 struct ACCESS4args { 20303 /* CURRENT_FH: object */ 20304 uint32_t access; 20305 }; 20307 18.1.2. RESULTS 20309 struct ACCESS4resok { 20310 uint32_t supported; 20311 uint32_t access; 20312 }; 20314 union ACCESS4res switch (nfsstat4 status) { 20315 case NFS4_OK: 20316 ACCESS4resok resok4; 20317 default: 20318 void; 20319 }; 20321 18.1.3. DESCRIPTION 20323 ACCESS determines the access rights that a user, as identified by the 20324 credentials in the RPC request, has with respect to the file system 20325 object specified by the current filehandle. The client encodes the 20326 set of access rights that are to be checked in the bit mask "access". 20327 The server checks the permissions encoded in the bit mask. If a 20328 status of NFS4_OK is returned, two bit masks are included in the 20329 response. The first, "supported", represents the access rights for 20330 which the server can verify reliably. The second, "access", 20331 represents the access rights available to the user for the filehandle 20332 provided. On success, the current filehandle retains its value. 20334 Note that the reply's supported and access fields MUST NOT contain 20335 more values than originally set in the request's access field. For 20336 example, if the client sends an ACCESS operation with just the 20337 ACCESS4_READ value set and the server supports this value, the server 20338 MUST NOT set more than ACCESS4_READ in the supported field even if it 20339 could have reliably checked other values. 20341 The reply's access field MUST NOT contain more values than the 20342 supported field. 20344 The results of this operation are necessarily advisory in nature. A 20345 return status of NFS4_OK and the appropriate bit set in the bit mask 20346 do not imply that such access will be allowed to the file system 20347 object in the future. This is because access rights can be revoked 20348 by the server at any time. 20350 The following access permissions may be requested: 20352 ACCESS4_READ Read data from file or read a directory. 20354 ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- 20355 directory objects). 20357 ACCESS4_MODIFY Rewrite existing file data or modify existing 20358 directory entries. 20360 ACCESS4_EXTEND Write new data or add directory entries. 20362 ACCESS4_DELETE Delete an existing directory entry. 20364 ACCESS4_EXECUTE Execute a regular file (no meaning for a directory). 20366 On success, the current filehandle retains its value. 20368 ACCESS4_EXECUTE is a challenging semantic to implement because NFS 20369 provides remote file access, not remote execution. This leads to the 20370 following: 20372 o Whether or not a regular file is executable ought to be the 20373 responsibility of the NFS client and not the server. And yet the 20374 ACCESS operation is specified to seemingly require a server to own 20375 that responsibility. 20377 o When a client executes a regular file, it has to read the file 20378 from the server. Strictly speaking, the server should not allow 20379 the client to read a file being executed unless the user has read 20380 permissions on the file. Requiring explicit read permissions on 20381 executable files in order to access them over NFS is not going to 20382 be acceptable to some users and storage administrators. 20383 Historically, NFS servers have allowed a user to READ a file if 20384 the user has execute access to the file. 20386 As a practical example, the UNIX specification [55] states that an 20387 implementation claiming conformance to UNIX may indicate in the 20388 access() programming interface's result that a privileged user has 20389 execute rights, even if no execute permission bits are set on the 20390 regular file's attributes. It is possible to claim conformance to 20391 the UNIX specification and instead not indicate execute rights in 20392 that situation, which is true for some operating environments. 20393 Suppose the operating environments of the client and server are 20394 implementing the access() semantics for privileged users differently, 20395 and the ACCESS operation implementations of the client and server 20396 follow their respective access() semantics. This can cause undesired 20397 behavior: 20399 o Suppose the client's access() interface returns X_OK if the user 20400 is privileged and no execute permission bits are set on the 20401 regular file's attribute, and the server's access() interface does 20402 not return X_OK in that situation. Then the client will be unable 20403 to execute files stored on the NFS server that could be executed 20404 if stored on a non-NFS file system. 20406 o Suppose the client's access() interface does not return X_OK if 20407 the user is privileged, and no execute permission bits are set on 20408 the regular file's attribute, and the server's access() interface 20409 does return X_OK in that situation. Then: 20411 * The client will be able to execute files stored on the NFS 20412 server that could be executed if stored on a non-NFS file 20413 system, unless the client's execution subsystem also checks for 20414 execute permission bits. 20416 * Even if the execution subsystem is checking for execute 20417 permission bits, there are more potential issues. For example, 20418 suppose the client is invoking access() to build a "path search 20419 table" of all executable files in the user's "search path", 20420 where the path is a list of directories each containing 20421 executable files. Suppose there are two files each in separate 20422 directories of the search path, such that files have the same 20423 component name. In the first directory the file has no execute 20424 permission bits set, and in the second directory the file has 20425 execute bits set. The path search table will indicate that the 20426 first directory has the executable file, but the execute 20427 subsystem will fail to execute it. The command shell might 20428 fail to try the second file in the second directory. And even 20429 if it did, this is a potential performance issue. Clearly, the 20430 desired outcome for the client is for the path search table to 20431 not contain the first file. 20433 To deal with the problems described above, the "smart client, stupid 20434 server" principle is used. The client owns overall responsibility 20435 for determining execute access and relies on the server to parse the 20436 execution permissions within the file's mode, acl, and dacl 20437 attributes. The rules for the client and server follow: 20439 o If the client is sending ACCESS in order to determine if the user 20440 can read the file, the client SHOULD set ACCESS4_READ in the 20441 request's access field. 20443 o If the client's operating environment only grants execution to the 20444 user if the user has execute access according to the execute 20445 permissions in the mode, acl, and dacl attributes, then if the 20446 client wants to determine execute access, the client SHOULD send 20447 an ACCESS request with ACCESS4_EXECUTE bit set in the request's 20448 access field. 20450 o If the client's operating environment grants execution to the user 20451 even if the user does not have execute access according to the 20452 execute permissions in the mode, acl, and dacl attributes, then if 20453 the client wants to determine execute access, it SHOULD send an 20454 ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits 20455 set in the request's access field. This way, if any read or 20456 execute permission grants the user read or execute access (or if 20457 the server interprets the user as privileged), as indicated by the 20458 presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's 20459 access field, the client will be able to grant the user execute 20460 access to the file. 20462 o If the server supports execute permission bits, or some other 20463 method for denoting executability (e.g., the suffix of the name of 20464 the file might indicate execute), it MUST check only execute 20465 permissions, not read permissions, when determining whether or not 20466 the reply will have ACCESS4_EXECUTE set in the access field. The 20467 server MUST NOT also examine read permission bits when determining 20468 whether or not the reply will have ACCESS4_EXECUTE set in the 20469 access field. Even if the server's operating environment would 20470 grant execute access to the user (e.g., the user is privileged), 20471 the server MUST NOT reply with ACCESS4_EXECUTE set in reply's 20472 access field unless there is at least one execute permission bit 20473 set in the mode, acl, or dacl attributes. In the case of acl and 20474 dacl, the "one execute permission bit" MUST be an ACE4_EXECUTE bit 20475 set in an ALLOW ACE. 20477 o If the server does not support execute permission bits or some 20478 other method for denoting executability, it MUST NOT set 20479 ACCESS4_EXECUTE in the reply's supported and access fields. If 20480 the client set ACCESS4_EXECUTE in the ACCESS request's access 20481 field, and ACCESS4_EXECUTE is not set in the reply's supported 20482 field, then the client will have to send an ACCESS request with 20483 the ACCESS4_READ bit set in the request's access field. 20485 o If the server supports read permission bits, it MUST only check 20486 for read permissions in the mode, acl, and dacl attributes when it 20487 receives an ACCESS request with ACCESS4_READ set in the access 20488 field. The server MUST NOT also examine execute permission bits 20489 when determining whether the reply will have ACCESS4_READ set in 20490 the access field or not. 20492 Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, 20493 then the user also has permissions to OPEN (Section 18.16) or READ 20494 (Section 18.22) the file. In other words, if the client sends an 20495 ACCESS request with the ACCESS4_READ and ACCESS_EXECUTE set in the 20496 access field (or two separate requests, one with ACCESS4_READ set and 20497 the other with ACCESS4_EXECUTE set), and the reply has just 20498 ACCESS4_EXECUTE set in the access field (or just one reply has 20499 ACCESS4_EXECUTE set), then the user has authorization to OPEN or READ 20500 the file. 20502 18.1.4. IMPLEMENTATION 20504 In general, it is not sufficient for the client to attempt to deduce 20505 access permissions by inspecting the uid, gid, and mode fields in the 20506 file attributes or by attempting to interpret the contents of the ACL 20507 attribute. This is because the server may perform uid or gid mapping 20508 or enforce additional access-control restrictions. It is also 20509 possible that the server may not be in the same ID space as the 20510 client. In these cases (and perhaps others), the client cannot 20511 reliably perform an access check with only current file attributes. 20513 In the NFSv2 protocol, the only reliable way to determine whether an 20514 operation was allowed was to try it and see if it succeeded or 20515 failed. Using the ACCESS operation in the NFSv4.1 protocol, the 20516 client can ask the server to indicate whether or not one or more 20517 classes of operations are permitted. The ACCESS operation is 20518 provided to allow clients to check before doing a series of 20519 operations that will result in an access failure. The OPEN operation 20520 provides a point where the server can verify access to the file 20521 object and a method to return that information to the client. The 20522 ACCESS operation is still useful for directory operations or for use 20523 in the case that the UNIX interface access() is used on the client. 20525 The information returned by the server in response to an ACCESS call 20526 is not permanent. It was correct at the exact time that the server 20527 performed the checks, but not necessarily afterwards. The server can 20528 revoke access permission at any time. 20530 The client should use the effective credentials of the user to build 20531 the authentication information in the ACCESS request used to 20532 determine access rights. It is the effective user and group 20533 credentials that are used in subsequent READ and WRITE operations. 20535 Many implementations do not directly support the ACCESS4_DELETE 20536 permission. Operating systems like UNIX will ignore the 20537 ACCESS4_DELETE bit if set on an access request on a non-directory 20538 object. In these systems, delete permission on a file is determined 20539 by the access permissions on the directory in which the file resides, 20540 instead of being determined by the permissions of the file itself. 20541 Therefore, the mask returned enumerating which access rights can be 20542 determined will have the ACCESS4_DELETE value set to 0. This 20543 indicates to the client that the server was unable to check that 20544 particular access right. The ACCESS4_DELETE bit in the access mask 20545 returned will then be ignored by the client. 20547 18.2. Operation 4: CLOSE - Close File 20549 18.2.1. ARGUMENTS 20551 struct CLOSE4args { 20552 /* CURRENT_FH: object */ 20553 seqid4 seqid; 20554 stateid4 open_stateid; 20555 }; 20557 18.2.2. RESULTS 20559 union CLOSE4res switch (nfsstat4 status) { 20560 case NFS4_OK: 20561 stateid4 open_stateid; 20562 default: 20563 void; 20564 }; 20566 18.2.3. DESCRIPTION 20568 The CLOSE operation releases share reservations for the regular or 20569 named attribute file as specified by the current filehandle. The 20570 share reservations and other state information released at the server 20571 as a result of this CLOSE are only those associated with the supplied 20572 stateid. State associated with other OPENs is not affected. 20574 If byte-range locks are held, the client SHOULD release all locks 20575 before sending a CLOSE. The server MAY free all outstanding locks on 20576 CLOSE, but some servers may not support the CLOSE of a file that 20577 still has byte-range locks held. The server MUST return failure if 20578 any locks would exist after the CLOSE. 20580 The argument seqid MAY have any value, and the server MUST ignore 20581 seqid. 20583 On success, the current filehandle retains its value. 20585 The server MAY require that the combination of principal, security 20586 flavor, and, if applicable, GSS mechanism that sent the OPEN request 20587 also be the one to CLOSE the file. This might not be possible if 20588 credentials for the principal are no longer available. The server 20589 MAY allow the machine credential or SSV credential (see 20590 Section 18.35) to send CLOSE. 20592 18.2.4. IMPLEMENTATION 20594 Even though CLOSE returns a stateid, this stateid is not useful to 20595 the client and should be treated as deprecated. CLOSE "shuts down" 20596 the state associated with all OPENs for the file by a single open- 20597 owner. As noted above, CLOSE will either release all file-locking 20598 state or return an error. Therefore, the stateid returned by CLOSE 20599 is not useful for operations that follow. To help find any uses of 20600 this stateid by clients, the server SHOULD return the invalid special 20601 stateid (the "other" value is zero and the "seqid" field is 20602 NFS4_UINT32_MAX, see Section 8.2.3). 20604 A CLOSE operation may make delegations grantable where they were not 20605 previously. Servers may choose to respond immediately if there are 20606 pending delegation want requests or may respond to the situation at a 20607 later time. 20609 18.3. Operation 5: COMMIT - Commit Cached Data 20611 18.3.1. ARGUMENTS 20613 struct COMMIT4args { 20614 /* CURRENT_FH: file */ 20615 offset4 offset; 20616 count4 count; 20617 }; 20619 18.3.2. RESULTS 20620 struct COMMIT4resok { 20621 verifier4 writeverf; 20622 }; 20624 union COMMIT4res switch (nfsstat4 status) { 20625 case NFS4_OK: 20626 COMMIT4resok resok4; 20627 default: 20628 void; 20629 }; 20631 18.3.3. DESCRIPTION 20633 The COMMIT operation forces or flushes uncommitted, modified data to 20634 stable storage for the file specified by the current filehandle. The 20635 flushed data is that which was previously written with one or more 20636 WRITE operations that had the "committed" field of their results 20637 field set to UNSTABLE4. 20639 The offset specifies the position within the file where the flush is 20640 to begin. An offset value of zero means to flush data starting at 20641 the beginning of the file. The count specifies the number of bytes 20642 of data to flush. If the count is zero, a flush from the offset to 20643 the end of the file is done. 20645 The server returns a write verifier upon successful completion of the 20646 COMMIT. The write verifier is used by the client to determine if the 20647 server has restarted between the initial WRITE operations and the 20648 COMMIT. The client does this by comparing the write verifier 20649 returned from the initial WRITE operations and the verifier returned 20650 by the COMMIT operation. The server must vary the value of the write 20651 verifier at each server event or instantiation that may lead to a 20652 loss of uncommitted data. Most commonly this occurs when the server 20653 is restarted; however, other events at the server may result in 20654 uncommitted data loss as well. 20656 On success, the current filehandle retains its value. 20658 18.3.4. IMPLEMENTATION 20660 The COMMIT operation is similar in operation and semantics to the 20661 POSIX fsync() [22] system interface that synchronizes a file's state 20662 with the disk (file data and metadata is flushed to disk or stable 20663 storage). COMMIT performs the same operation for a client, flushing 20664 any unsynchronized data and metadata on the server to the server's 20665 disk or stable storage for the specified file. Like fsync(), it may 20666 be that there is some modified data or no modified data to 20667 synchronize. The data may have been synchronized by the server's 20668 normal periodic buffer synchronization activity. COMMIT should 20669 return NFS4_OK, unless there has been an unexpected error. 20671 COMMIT differs from fsync() in that it is possible for the client to 20672 flush a range of the file (most likely triggered by a buffer- 20673 reclamation scheme on the client before the file has been completely 20674 written). 20676 The server implementation of COMMIT is reasonably simple. If the 20677 server receives a full file COMMIT request, that is, starting at 20678 offset zero and count zero, it should do the equivalent of applying 20679 fsync() to the entire file. Otherwise, it should arrange to have the 20680 modified data in the range specified by offset and count to be 20681 flushed to stable storage. In both cases, any metadata associated 20682 with the file must be flushed to stable storage before returning. It 20683 is not an error for there to be nothing to flush on the server. This 20684 means that the data and metadata that needed to be flushed have 20685 already been flushed or lost during the last server failure. 20687 The client implementation of COMMIT is a little more complex. There 20688 are two reasons for wanting to commit a client buffer to stable 20689 storage. The first is that the client wants to reuse a buffer. In 20690 this case, the offset and count of the buffer are sent to the server 20691 in the COMMIT request. The server then flushes any modified data 20692 based on the offset and count, and flushes any modified metadata 20693 associated with the file. It then returns the status of the flush 20694 and the write verifier. The second reason for the client to generate 20695 a COMMIT is for a full file flush, such as may be done at close. In 20696 this case, the client would gather all of the buffers for this file 20697 that contain uncommitted data, do the COMMIT operation with an offset 20698 of zero and count of zero, and then free all of those buffers. Any 20699 other dirty buffers would be sent to the server in the normal 20700 fashion. 20702 After a buffer is written (via the WRITE operation) by the client 20703 with the "committed" field in the result of WRITE set to UNSTABLE4, 20704 the buffer must be considered as modified by the client until the 20705 buffer has either been flushed via a COMMIT operation or written via 20706 a WRITE operation with the "committed" field in the result set to 20707 FILE_SYNC4 or DATA_SYNC4. This is done to prevent the buffer from 20708 being freed and reused before the data can be flushed to stable 20709 storage on the server. 20711 When a response is returned from either a WRITE or a COMMIT operation 20712 and it contains a write verifier that differs from that previously 20713 returned by the server, the client will need to retransmit all of the 20714 buffers containing uncommitted data to the server. How this is to be 20715 done is up to the implementor. If there is only one buffer of 20716 interest, then it should be sent in a WRITE request with the 20717 FILE_SYNC4 stable parameter. If there is more than one buffer, it 20718 might be worthwhile retransmitting all of the buffers in WRITE 20719 operations with the stable parameter set to UNSTABLE4 and then 20720 retransmitting the COMMIT operation to flush all of the data on the 20721 server to stable storage. However, if the server repeatably returns 20722 from COMMIT a verifier that differs from that returned by WRITE, the 20723 only way to ensure progress is to retransmit all of the buffers with 20724 WRITE requests with the FILE_SYNC4 stable parameter. 20726 The above description applies to page-cache-based systems as well as 20727 buffer-cache-based systems. In the former systems, the virtual 20728 memory system will need to be modified instead of the buffer cache. 20730 18.4. Operation 6: CREATE - Create a Non-Regular File Object 20732 18.4.1. ARGUMENTS 20734 union createtype4 switch (nfs_ftype4 type) { 20735 case NF4LNK: 20736 linktext4 linkdata; 20737 case NF4BLK: 20738 case NF4CHR: 20739 specdata4 devdata; 20740 case NF4SOCK: 20741 case NF4FIFO: 20742 case NF4DIR: 20743 void; 20744 default: 20745 void; /* server should return NFS4ERR_BADTYPE */ 20746 }; 20748 struct CREATE4args { 20749 /* CURRENT_FH: directory for creation */ 20750 createtype4 objtype; 20751 component4 objname; 20752 fattr4 createattrs; 20753 }; 20755 18.4.2. RESULTS 20756 struct CREATE4resok { 20757 change_info4 cinfo; 20758 bitmap4 attrset; /* attributes set */ 20759 }; 20761 union CREATE4res switch (nfsstat4 status) { 20762 case NFS4_OK: 20763 /* new CURRENTFH: created object */ 20764 CREATE4resok resok4; 20765 default: 20766 void; 20767 }; 20769 18.4.3. DESCRIPTION 20771 The CREATE operation creates a file object other than an ordinary 20772 file in a directory with a given name. The OPEN operation MUST be 20773 used to create a regular file or a named attribute. 20775 The current filehandle must be a directory: an object of type NF4DIR. 20776 If the current filehandle is an attribute directory (type 20777 NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the 20778 current file handle designates any other type of object, the error 20779 NFS4ERR_NOTDIR results. 20781 The objname specifies the name for the new object. The objtype 20782 determines the type of object to be created: directory, symlink, etc. 20783 If the object type specified is that of an ordinary file, a named 20784 attribute, or a named attribute directory, the error NFS4ERR_BADTYPE 20785 results. 20787 If an object of the same name already exists in the directory, the 20788 server will return the error NFS4ERR_EXIST. 20790 For the directory where the new file object was created, the server 20791 returns change_info4 information in cinfo. With the atomic field of 20792 the change_info4 data type, the server will indicate if the before 20793 and after change attributes were obtained atomically with respect to 20794 the file object creation. 20796 If the objname has a length of zero, or if objname does not obey the 20797 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 20799 The current filehandle is replaced by that of the new object. 20801 The createattrs specifies the initial set of attributes for the 20802 object. The set of attributes may include any writable attribute 20803 valid for the object type. When the operation is successful, the 20804 server will return to the client an attribute mask signifying which 20805 attributes were successfully set for the object. 20807 If createattrs includes neither the owner attribute nor an ACL with 20808 an ACE for the owner, and if the server's file system both supports 20809 and requires an owner attribute (or an owner ACE), then the server 20810 MUST derive the owner (or the owner ACE). This would typically be 20811 from the principal indicated in the RPC credentials of the call, but 20812 the server's operating environment or file system semantics may 20813 dictate other methods of derivation. Similarly, if createattrs 20814 includes neither the group attribute nor a group ACE, and if the 20815 server's file system both supports and requires the notion of a group 20816 attribute (or group ACE), the server MUST derive the group attribute 20817 (or the corresponding owner ACE) for the file. This could be from 20818 the RPC call's credentials, such as the group principal if the 20819 credentials include it (such as with AUTH_SYS), from the group 20820 identifier associated with the principal in the credentials (e.g., 20821 POSIX systems have a user database [23] that has a group identifier 20822 for every user identifier), inherited from the directory in which the 20823 object is created, or whatever else the server's operating 20824 environment or file system semantics dictate. This applies to the 20825 OPEN operation too. 20827 Conversely, it is possible that the client will specify in 20828 createattrs an owner attribute, group attribute, or ACL that the 20829 principal indicated the RPC call's credentials does not have 20830 permissions to create files for. The error to be returned in this 20831 instance is NFS4ERR_PERM. This applies to the OPEN operation too. 20833 If the current filehandle designates a directory for which another 20834 client holds a directory delegation, then, unless the delegation is 20835 such that the situation can be resolved by sending a notification, 20836 the delegation MUST be recalled, and the CREATE operation MUST NOT 20837 proceed until the delegation is returned or revoked. Except where 20838 this happens very quickly, one or more NFS4ERR_DELAY errors will be 20839 returned to requests made while delegation remains outstanding. 20841 When the current filehandle designates a directory for which one or 20842 more directory delegations exist, then, when those delegations 20843 request such notifications, NOTIFY4_ADD_ENTRY will be generated as a 20844 result of this operation. 20846 If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set 20847 (Section 14.4), and a symbolic link is being created, then the 20848 content of the symbolic link MUST be in UTF-8 encoding. 20850 18.4.4. IMPLEMENTATION 20852 If the client desires to set attribute values after the create, a 20853 SETATTR operation can be added to the COMPOUND request so that the 20854 appropriate attributes will be set. 20856 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery 20858 18.5.1. ARGUMENTS 20860 struct DELEGPURGE4args { 20861 clientid4 clientid; 20862 }; 20864 18.5.2. RESULTS 20866 struct DELEGPURGE4res { 20867 nfsstat4 status; 20868 }; 20870 18.5.3. DESCRIPTION 20872 This operation purges all of the delegations awaiting recovery for a 20873 given client. This is useful for clients that do not commit 20874 delegation information to stable storage to indicate that conflicting 20875 requests need not be delayed by the server awaiting recovery of 20876 delegation information. 20878 The client is NOT specified by the clientid field of the request. 20879 The client SHOULD set the client field to zero, and the server MUST 20880 ignore the clientid field. Instead, the server MUST derive the 20881 client ID from the value of the session ID in the arguments of the 20882 SEQUENCE operation that precedes DELEGPURGE in the COMPOUND request. 20884 The DELEGPURGE operation should be used by clients that record 20885 delegation information on stable storage on the client. In this 20886 case, after the client recovers all delegations it knows of, it 20887 should immediately send a DELEGPURGE operation. Doing so will notify 20888 the server that no additional delegations for the client will be 20889 recovered allowing it to free resources, and avoid delaying other 20890 clients which make requests that conflict with the unrecovered 20891 delegations. The set of delegations known to the server and the 20892 client might be different. The reason for this is that after sending 20893 a request that resulted in a delegation, the client might experience 20894 a failure before it both received the delegation and committed the 20895 delegation to the client's stable storage. 20897 The server MAY support DELEGPURGE, but if it does not, it MUST NOT 20898 support CLAIM_DELEGATE_PREV and MUST NOT support CLAIM_DELEG_PREV_FH. 20900 18.6. Operation 8: DELEGRETURN - Return Delegation 20902 18.6.1. ARGUMENTS 20904 struct DELEGRETURN4args { 20905 /* CURRENT_FH: delegated object */ 20906 stateid4 deleg_stateid; 20907 }; 20909 18.6.2. RESULTS 20911 struct DELEGRETURN4res { 20912 nfsstat4 status; 20913 }; 20915 18.6.3. DESCRIPTION 20917 The DELEGRETURN operation returns the delegation represented by the 20918 current filehandle and stateid. 20920 Delegations may be returned voluntarily (i.e., before the server has 20921 recalled them) or when recalled. In either case, the client must 20922 properly propagate state changed under the context of the delegation 20923 to the server before returning the delegation. 20925 The server MAY require that the principal, security flavor, and if 20926 applicable, the GSS mechanism, combination that acquired the 20927 delegation also be the one to send DELEGRETURN on the file. This 20928 might not be possible if credentials for the principal are no longer 20929 available. The server MAY allow the machine credential or SSV 20930 credential (see Section 18.35) to send DELEGRETURN. 20932 18.7. Operation 9: GETATTR - Get Attributes 20934 18.7.1. ARGUMENTS 20936 struct GETATTR4args { 20937 /* CURRENT_FH: object */ 20938 bitmap4 attr_request; 20939 }; 20941 18.7.2. RESULTS 20943 struct GETATTR4resok { 20944 fattr4 obj_attributes; 20945 }; 20947 union GETATTR4res switch (nfsstat4 status) { 20948 case NFS4_OK: 20949 GETATTR4resok resok4; 20950 default: 20951 void; 20952 }; 20954 18.7.3. DESCRIPTION 20956 The GETATTR operation will obtain attributes for the file system 20957 object specified by the current filehandle. The client sets a bit in 20958 the bitmap argument for each attribute value that it would like the 20959 server to return. The server returns an attribute bitmap that 20960 indicates the attribute values that it was able to return, which will 20961 include all attributes requested by the client that are attributes 20962 supported by the server for the target file system. This bitmap is 20963 followed by the attribute values ordered lowest attribute number 20964 first. 20966 The server MUST return a value for each attribute that the client 20967 requests if the attribute is supported by the server for the target 20968 file system. If the server does not support a particular attribute 20969 on the target file system, then it MUST NOT return the attribute 20970 value and MUST NOT set the attribute bit in the result bitmap. The 20971 server MUST return an error if it supports an attribute on the target 20972 but cannot obtain its value. In that case, no attribute values will 20973 be returned. 20975 File systems that are absent should be treated as having support for 20976 a very small set of attributes as described in Section 11.4.1, even 20977 if previously, when the file system was present, more attributes were 20978 supported. 20980 All servers MUST support the REQUIRED attributes as specified in 20981 Section 5.6, for all file systems, with the exception of absent file 20982 systems. 20984 On success, the current filehandle retains its value. 20986 18.7.4. IMPLEMENTATION 20988 Suppose there is an OPEN_DELEGATE_WRITE delegation held by another 20989 client for the file in question and size and/or change are among the 20990 set of attributes being interrogated. The server has two choices. 20991 First, the server can obtain the actual current value of these 20992 attributes from the client holding the delegation by using the 20993 CB_GETATTR callback. Second, the server, particularly when the 20994 delegated client is unresponsive, can recall the delegation in 20995 question. The GETATTR MUST NOT proceed until one of the following 20996 occurs: 20998 o The requested attribute values are returned in the response to 20999 CB_GETATTR. 21001 o The OPEN_DELEGATE_WRITE delegation is returned. 21003 o The OPEN_DELEGATE_WRITE delegation is revoked. 21005 Unless one of the above happens very quickly, one or more 21006 NFS4ERR_DELAY errors will be returned while a delegation is 21007 outstanding. 21009 18.8. Operation 10: GETFH - Get Current Filehandle 21011 18.8.1. ARGUMENTS 21013 /* CURRENT_FH: */ 21014 void; 21016 18.8.2. RESULTS 21018 struct GETFH4resok { 21019 nfs_fh4 object; 21020 }; 21022 union GETFH4res switch (nfsstat4 status) { 21023 case NFS4_OK: 21024 GETFH4resok resok4; 21025 default: 21026 void; 21027 }; 21029 18.8.3. DESCRIPTION 21031 This operation returns the current filehandle value. 21033 On success, the current filehandle retains its value. 21035 As described in Section 2.10.6.4, GETFH is REQUIRED or RECOMMENDED to 21036 immediately follow certain operations, and servers are free to reject 21037 such operations if the client fails to insert GETFH in the request as 21038 REQUIRED or RECOMMENDED. Section 18.16.4.1 provides additional 21039 justification for why GETFH MUST follow OPEN. 21041 18.8.4. IMPLEMENTATION 21043 Operations that change the current filehandle like LOOKUP or CREATE 21044 do not automatically return the new filehandle as a result. For 21045 instance, if a client needs to look up a directory entry and obtain 21046 its filehandle, then the following request is needed. 21048 PUTFH (directory filehandle) 21050 LOOKUP (entry name) 21052 GETFH 21054 18.9. Operation 11: LINK - Create Link to a File 21056 18.9.1. ARGUMENTS 21058 struct LINK4args { 21059 /* SAVED_FH: source object */ 21060 /* CURRENT_FH: target directory */ 21061 component4 newname; 21062 }; 21064 18.9.2. RESULTS 21065 struct LINK4resok { 21066 change_info4 cinfo; 21067 }; 21069 union LINK4res switch (nfsstat4 status) { 21070 case NFS4_OK: 21071 LINK4resok resok4; 21072 default: 21073 void; 21074 }; 21076 18.9.3. DESCRIPTION 21078 The LINK operation creates an additional newname for the file 21079 represented by the saved filehandle, as set by the SAVEFH operation, 21080 in the directory represented by the current filehandle. The existing 21081 file and the target directory must reside within the same file system 21082 on the server. On success, the current filehandle will continue to 21083 be the target directory. If an object exists in the target directory 21084 with the same name as newname, the server must return NFS4ERR_EXIST. 21086 For the target directory, the server returns change_info4 information 21087 in cinfo. With the atomic field of the change_info4 data type, the 21088 server will indicate if the before and after change attributes were 21089 obtained atomically with respect to the link creation. 21091 If the newname has a length of zero, or if newname does not obey the 21092 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21094 18.9.4. IMPLEMENTATION 21096 The server MAY impose restrictions on the LINK operation such that 21097 LINK may not be done when the file is open or when that open is done 21098 by particular protocols, or with particular options or access modes. 21099 When LINK is rejected because of such restrictions, the error 21100 NFS4ERR_FILE_OPEN is returned. 21102 If a server does implement such restrictions and those restrictions 21103 include cases of NFSv4 opens preventing successful execution of a 21104 link, the server needs to recall any delegations that could hide the 21105 existence of opens relevant to that decision. The reason is that 21106 when a client holds a delegation, the server might not have an 21107 accurate account of the opens for that client, since the client may 21108 execute OPENs and CLOSEs locally. The LINK operation must be delayed 21109 only until a definitive result can be obtained. For example, suppose 21110 there are multiple delegations and one of them establishes an open 21111 whose presence would prevent the link. Given the server's semantics, 21112 NFS4ERR_FILE_OPEN may be returned to the caller as soon as that 21113 delegation is returned without waiting for other delegations to be 21114 returned. Similarly, if such opens are not associated with 21115 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 21116 delegation recall being done. 21118 If the current filehandle designates a directory for which another 21119 client holds a directory delegation, then, unless the delegation is 21120 such that the situation can be resolved by sending a notification, 21121 the delegation MUST be recalled, and the operation cannot be 21122 performed successfully until the delegation is returned or revoked. 21123 Except where this happens very quickly, one or more NFS4ERR_DELAY 21124 errors will be returned to requests made while delegation remains 21125 outstanding. 21127 When the current filehandle designates a directory for which one or 21128 more directory delegations exist, then, when those delegations 21129 request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY 21130 will be generated as a result of the LINK operation. 21132 If the current file system supports the numlinks attribute, and other 21133 clients have delegations to the file being linked, then those 21134 delegations MUST be recalled and the LINK operation MUST NOT proceed 21135 until all delegations are returned or revoked. Except where this 21136 happens very quickly, one or more NFS4ERR_DELAY errors will be 21137 returned to requests made while delegation remains outstanding. 21139 Changes to any property of the "hard" linked files are reflected in 21140 all of the linked files. When a link is made to a file, the 21141 attributes for the file should have a value for numlinks that is one 21142 greater than the value before the LINK operation. 21144 The statement "file and the target directory must reside within the 21145 same file system on the server" means that the fsid fields in the 21146 attributes for the objects are the same. If they reside on different 21147 file systems, the error NFS4ERR_XDEV is returned. This error may be 21148 returned by some servers when there is an internal partitioning of a 21149 file system that the LINK operation would violate. 21151 On some servers, "." and ".." are illegal values for newname and the 21152 error NFS4ERR_BADNAME will be returned if they are specified. 21154 When the current filehandle designates a named attribute directory 21155 and the object to be linked (the saved filehandle) is not a named 21156 attribute for the same object, the error NFS4ERR_XDEV MUST be 21157 returned. When the saved filehandle designates a named attribute and 21158 the current filehandle is not the appropriate named attribute 21159 directory, the error NFS4ERR_XDEV MUST also be returned. 21161 When the current filehandle designates a named attribute directory 21162 and the object to be linked (the saved filehandle) is a named 21163 attribute within that directory, the server may return the error 21164 NFS4ERR_NOTSUPP. 21166 In the case that newname is already linked to the file represented by 21167 the saved filehandle, the server will return NFS4ERR_EXIST. 21169 Note that symbolic links are created with the CREATE operation. 21171 18.10. Operation 12: LOCK - Create Lock 21173 18.10.1. ARGUMENTS 21174 /* 21175 * For LOCK, transition from open_stateid and lock_owner 21176 * to a lock stateid. 21177 */ 21178 struct open_to_lock_owner4 { 21179 seqid4 open_seqid; 21180 stateid4 open_stateid; 21181 seqid4 lock_seqid; 21182 lock_owner4 lock_owner; 21183 }; 21185 /* 21186 * For LOCK, existing lock stateid continues to request new 21187 * file lock for the same lock_owner and open_stateid. 21188 */ 21189 struct exist_lock_owner4 { 21190 stateid4 lock_stateid; 21191 seqid4 lock_seqid; 21192 }; 21194 union locker4 switch (bool new_lock_owner) { 21195 case TRUE: 21196 open_to_lock_owner4 open_owner; 21197 case FALSE: 21198 exist_lock_owner4 lock_owner; 21199 }; 21201 /* 21202 * LOCK/LOCKT/LOCKU: Record lock management 21203 */ 21204 struct LOCK4args { 21205 /* CURRENT_FH: file */ 21206 nfs_lock_type4 locktype; 21207 bool reclaim; 21208 offset4 offset; 21209 length4 length; 21210 locker4 locker; 21211 }; 21213 18.10.2. RESULTS 21214 struct LOCK4denied { 21215 offset4 offset; 21216 length4 length; 21217 nfs_lock_type4 locktype; 21218 lock_owner4 owner; 21219 }; 21221 struct LOCK4resok { 21222 stateid4 lock_stateid; 21223 }; 21225 union LOCK4res switch (nfsstat4 status) { 21226 case NFS4_OK: 21227 LOCK4resok resok4; 21228 case NFS4ERR_DENIED: 21229 LOCK4denied denied; 21230 default: 21231 void; 21232 }; 21234 18.10.3. DESCRIPTION 21236 The LOCK operation requests a byte-range lock for the byte-range 21237 specified by the offset and length parameters, and lock type 21238 specified in the locktype parameter. If this is a reclaim request, 21239 the reclaim parameter will be TRUE. 21241 Bytes in a file may be locked even if those bytes are not currently 21242 allocated to the file. To lock the file from a specific offset 21243 through the end-of-file (no matter how long the file actually is) use 21244 a length field equal to NFS4_UINT64_MAX. The server MUST return 21245 NFS4ERR_INVAL under the following combinations of length and offset: 21247 o Length is equal to zero. 21249 o Length is not equal to NFS4_UINT64_MAX, and the sum of length and 21250 offset exceeds NFS4_UINT64_MAX. 21252 32-bit servers are servers that support locking for byte offsets that 21253 fit within 32 bits (i.e., less than or equal to NFS4_UINT32_MAX). If 21254 the client specifies a range that overlaps one or more bytes beyond 21255 offset NFS4_UINT32_MAX but does not end at offset NFS4_UINT64_MAX, 21256 then such a 32-bit server MUST return the error NFS4ERR_BAD_RANGE. 21258 If the server returns NFS4ERR_DENIED, the owner, offset, and length 21259 of a conflicting lock are returned. 21261 The locker argument specifies the lock-owner that is associated with 21262 the LOCK operation. The locker4 structure is a switched union that 21263 indicates whether the client has already created byte-range locking 21264 state associated with the current open file and lock-owner. In the 21265 case in which it has, the argument is just a stateid representing the 21266 set of locks associated with that open file and lock-owner, together 21267 with a lock_seqid value that MAY be any value and MUST be ignored by 21268 the server. In the case where no byte-range locking state has been 21269 established, or the client does not have the stateid available, the 21270 argument contains the stateid of the open file with which this lock 21271 is to be associated, together with the lock-owner with which the lock 21272 is to be associated. The open_to_lock_owner case covers the very 21273 first lock done by a lock-owner for a given open file and offers a 21274 method to use the established state of the open_stateid to transition 21275 to the use of a lock stateid. 21277 The following fields of the locker parameter MAY be set to any value 21278 by the client and MUST be ignored by the server: 21280 o The clientid field of the lock_owner field of the open_owner field 21281 (locker.open_owner.lock_owner.clientid). The reason the server 21282 MUST ignore the clientid field is that the server MUST derive the 21283 client ID from the session ID from the SEQUENCE operation of the 21284 COMPOUND request. 21286 o The open_seqid and lock_seqid fields of the open_owner field 21287 (locker.open_owner.open_seqid and locker.open_owner.lock_seqid). 21289 o The lock_seqid field of the lock_owner field 21290 (locker.lock_owner.lock_seqid). 21292 Note that the client ID appearing in a LOCK4denied structure is the 21293 actual client associated with the conflicting lock, whether this is 21294 the client ID associated with the current session or a different one. 21295 Thus, if the server returns NFS4ERR_DENIED, it MUST set the clientid 21296 field of the owner field of the denied field. 21298 If the current filehandle is not an ordinary file, an error will be 21299 returned to the client. In the case that the current filehandle 21300 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21301 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21302 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21304 On success, the current filehandle retains its value. 21306 18.10.4. IMPLEMENTATION 21308 If the server is unable to determine the exact offset and length of 21309 the conflicting byte-range lock, the same offset and length that were 21310 provided in the arguments should be returned in the denied results. 21312 LOCK operations are subject to permission checks and to checks 21313 against the access type of the associated file. However, the 21314 specific right and modes required for various types of locks reflect 21315 the semantics of the server-exported file system, and are not 21316 specified by the protocol. For example, Windows 2000 allows a write 21317 lock of a file open for read access, while a POSIX-compliant system 21318 does not. 21320 When the client sends a LOCK operation that corresponds to a range 21321 that the lock-owner has locked already (with the same or different 21322 lock type), or to a sub-range of such a range, or to a byte-range 21323 that includes multiple locks already granted to that lock-owner, in 21324 whole or in part, and the server does not support such locking 21325 operations (i.e., does not support POSIX locking semantics), the 21326 server will return the error NFS4ERR_LOCK_RANGE. In that case, the 21327 client may return an error, or it may emulate the required 21328 operations, using only LOCK for ranges that do not include any bytes 21329 already locked by that lock-owner and LOCKU of locks held by that 21330 lock-owner (specifying an exactly matching range and type). 21331 Similarly, when the client sends a LOCK operation that amounts to 21332 upgrading (changing from a READ_LT lock to a WRITE_LT lock) or 21333 downgrading (changing from WRITE_LT lock to a READ_LT lock) an 21334 existing byte-range lock, and the server does not support such a 21335 lock, the server will return NFS4ERR_LOCK_NOTSUPP. Such operations 21336 may not perfectly reflect the required semantics in the face of 21337 conflicting LOCK operations from other clients. 21339 When a client holds an OPEN_DELEGATE_WRITE delegation, the client 21340 holding that delegation is assured that there are no opens by other 21341 clients. Thus, there can be no conflicting LOCK operations from such 21342 clients. Therefore, the client may be handling locking requests 21343 locally, without doing LOCK operations on the server. If it does 21344 that, it must be prepared to update the lock status on the server, by 21345 sending appropriate LOCK and LOCKU operations before returning the 21346 delegation. 21348 When one or more clients hold OPEN_DELEGATE_READ delegations, any 21349 LOCK operation where the server is implementing mandatory locking 21350 semantics MUST result in the recall of all such delegations. The 21351 LOCK operation may not be granted until all such delegations are 21352 returned or revoked. Except where this happens very quickly, one or 21353 more NFS4ERR_DELAY errors will be returned to requests made while the 21354 delegation remains outstanding. 21356 18.11. Operation 13: LOCKT - Test for Lock 21358 18.11.1. ARGUMENTS 21360 struct LOCKT4args { 21361 /* CURRENT_FH: file */ 21362 nfs_lock_type4 locktype; 21363 offset4 offset; 21364 length4 length; 21365 lock_owner4 owner; 21366 }; 21368 18.11.2. RESULTS 21370 union LOCKT4res switch (nfsstat4 status) { 21371 case NFS4ERR_DENIED: 21372 LOCK4denied denied; 21373 case NFS4_OK: 21374 void; 21375 default: 21376 void; 21377 }; 21379 18.11.3. DESCRIPTION 21381 The LOCKT operation tests the lock as specified in the arguments. If 21382 a conflicting lock exists, the owner, offset, length, and type of the 21383 conflicting lock are returned. The owner field in the results 21384 includes the client ID of the owner of the conflicting lock, whether 21385 this is the client ID associated with the current session or a 21386 different client ID. If no lock is held, nothing other than NFS4_OK 21387 is returned. Lock types READ_LT and READW_LT are processed in the 21388 same way in that a conflicting lock test is done without regard to 21389 blocking or non-blocking. The same is true for WRITE_LT and 21390 WRITEW_LT. 21392 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 21393 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 21394 for LOCK. 21396 The clientid field of the owner MAY be set to any value by the client 21397 and MUST be ignored by the server. The reason the server MUST ignore 21398 the clientid field is that the server MUST derive the client ID from 21399 the session ID from the SEQUENCE operation of the COMPOUND request. 21401 If the current filehandle is not an ordinary file, an error will be 21402 returned to the client. In the case that the current filehandle 21403 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21404 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21405 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21407 On success, the current filehandle retains its value. 21409 18.11.4. IMPLEMENTATION 21411 If the server is unable to determine the exact offset and length of 21412 the conflicting lock, the same offset and length that were provided 21413 in the arguments should be returned in the denied results. 21415 LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to 21416 identify the owner. This is because the client does not have to open 21417 the file to test for the existence of a lock, so a stateid might not 21418 be available. 21420 As noted in Section 18.10.4, some servers may return 21421 NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) LOCK 21422 operations that overlap ranges already granted to the current lock- 21423 owner. 21425 The LOCKT operation's test for conflicting locks SHOULD exclude locks 21426 for the current lock-owner, and thus should return NFS4_OK in such 21427 cases. Note that this means that a server might return NFS4_OK to a 21428 LOCKT request even though a LOCK operation for the same range and 21429 lock-owner would fail with NFS4ERR_LOCK_RANGE. 21431 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 21432 (see Section 18.10.4) to handle LOCK requests locally. In such a 21433 case, LOCKT requests will similarly be handled locally. 21435 18.12. Operation 14: LOCKU - Unlock File 21437 18.12.1. ARGUMENTS 21438 struct LOCKU4args { 21439 /* CURRENT_FH: file */ 21440 nfs_lock_type4 locktype; 21441 seqid4 seqid; 21442 stateid4 lock_stateid; 21443 offset4 offset; 21444 length4 length; 21445 }; 21447 18.12.2. RESULTS 21449 union LOCKU4res switch (nfsstat4 status) { 21450 case NFS4_OK: 21451 stateid4 lock_stateid; 21452 default: 21453 void; 21454 }; 21456 18.12.3. DESCRIPTION 21458 The LOCKU operation unlocks the byte-range lock specified by the 21459 parameters. The client may set the locktype field to any value that 21460 is legal for the nfs_lock_type4 enumerated type, and the server MUST 21461 accept any legal value for locktype. Any legal value for locktype 21462 has no effect on the success or failure of the LOCKU operation. 21464 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 21465 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 21466 for LOCK. 21468 The seqid parameter MAY be any value and the server MUST ignore it. 21470 If the current filehandle is not an ordinary file, an error will be 21471 returned to the client. In the case that the current filehandle 21472 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21473 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21474 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21476 On success, the current filehandle retains its value. 21478 The server MAY require that the principal, security flavor, and if 21479 applicable, the GSS mechanism, combination that sent a LOCK operation 21480 also be the one to send LOCKU on the file. This might not be 21481 possible if credentials for the principal are no longer available. 21482 The server MAY allow the machine credential or SSV credential (see 21483 Section 18.35) to send LOCKU. 21485 18.12.4. IMPLEMENTATION 21487 If the area to be unlocked does not correspond exactly to a lock 21488 actually held by the lock-owner, the server may return the error 21489 NFS4ERR_LOCK_RANGE. This includes the case in which the area is not 21490 locked, where the area is a sub-range of the area locked, where it 21491 overlaps the area locked without matching exactly, or the area 21492 specified includes multiple locks held by the lock-owner. In all of 21493 these cases, allowed by POSIX locking [21] semantics, a client 21494 receiving this error should, if it desires support for such 21495 operations, simulate the operation using LOCKU on ranges 21496 corresponding to locks it actually holds, possibly followed by LOCK 21497 operations for the sub-ranges not being unlocked. 21499 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 21500 (see Section 18.10.4) to handle LOCK requests locally. In such a 21501 case, LOCKU operations will similarly be handled locally. 21503 18.13. Operation 15: LOOKUP - Lookup Filename 21505 18.13.1. ARGUMENTS 21507 struct LOOKUP4args { 21508 /* CURRENT_FH: directory */ 21509 component4 objname; 21510 }; 21512 18.13.2. RESULTS 21514 struct LOOKUP4res { 21515 /* New CURRENT_FH: object */ 21516 nfsstat4 status; 21517 }; 21519 18.13.3. DESCRIPTION 21521 The LOOKUP operation looks up or finds a file system object using the 21522 directory specified by the current filehandle. LOOKUP evaluates the 21523 component and if the object exists, the current filehandle is 21524 replaced with the component's filehandle. 21526 If the component cannot be evaluated either because it does not exist 21527 or because the client does not have permission to evaluate the 21528 component, then an error will be returned and the current filehandle 21529 will be unchanged. 21531 If the component is a zero-length string or if any component does not 21532 obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21534 18.13.4. IMPLEMENTATION 21536 If the client wants to achieve the effect of a multi-component look 21537 up, it may construct a COMPOUND request such as (and obtain each 21538 filehandle): 21540 PUTFH (directory filehandle) 21541 LOOKUP "pub" 21542 GETFH 21543 LOOKUP "foo" 21544 GETFH 21545 LOOKUP "bar" 21546 GETFH 21548 Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on 21549 the server. The client can detect a mountpoint crossing by comparing 21550 the fsid attribute of the directory with the fsid attribute of the 21551 directory looked up. If the fsids are different, then the new 21552 directory is a server mountpoint. UNIX clients that detect a 21553 mountpoint crossing will need to mount the server's file system. 21554 This needs to be done to maintain the file object identity checking 21555 mechanisms common to UNIX clients. 21557 Servers that limit NFS access to "shared" or "exported" file systems 21558 should provide a pseudo file system into which the exported file 21559 systems can be integrated, so that clients can browse the server's 21560 namespace. The clients view of a pseudo file system will be limited 21561 to paths that lead to exported file systems. 21563 Note: previous versions of the protocol assigned special semantics to 21564 the names "." and "..". NFSv4.1 assigns no special semantics to 21565 these names. The LOOKUPP operator must be used to look up a parent 21566 directory. 21568 Note that this operation does not follow symbolic links. The client 21569 is responsible for all parsing of filenames including filenames that 21570 are modified by symbolic links encountered during the look up 21571 process. 21573 If the current filehandle supplied is not a directory but a symbolic 21574 link, the error NFS4ERR_SYMLINK is returned as the error. For all 21575 other non-directory file types, the error NFS4ERR_NOTDIR is returned. 21577 18.14. Operation 16: LOOKUPP - Lookup Parent Directory 21579 18.14.1. ARGUMENTS 21581 /* CURRENT_FH: object */ 21582 void; 21584 18.14.2. RESULTS 21586 struct LOOKUPP4res { 21587 /* new CURRENT_FH: parent directory */ 21588 nfsstat4 status; 21589 }; 21591 18.14.3. DESCRIPTION 21593 The current filehandle is assumed to refer to a regular directory or 21594 a named attribute directory. LOOKUPP assigns the filehandle for its 21595 parent directory to be the current filehandle. If there is no parent 21596 directory, an NFS4ERR_NOENT error must be returned. Therefore, 21597 NFS4ERR_NOENT will be returned by the server when the current 21598 filehandle is at the root or top of the server's file tree. 21600 As is the case with LOOKUP, LOOKUPP will also cross mountpoints. 21602 If the current filehandle is not a directory or named attribute 21603 directory, the error NFS4ERR_NOTDIR is returned. 21605 If the requester's security flavor does not match that configured for 21606 the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC 21607 (a future minor revision of NFSv4 may upgrade this to MUST) in the 21608 LOOKUPP response. However, if the server does so, it MUST support 21609 the SECINFO_NO_NAME operation (Section 18.45), so that the client can 21610 gracefully determine the correct security flavor. 21612 If the current filehandle is a named attribute directory that is 21613 associated with a file system object via OPENATTR (i.e., not a sub- 21614 directory of a named attribute directory), LOOKUPP SHOULD return the 21615 filehandle of the associated file system object. 21617 18.14.4. IMPLEMENTATION 21619 An issue to note is upward navigation from named attribute 21620 directories. The named attribute directories are essentially 21621 detached from the namespace, and this property should be safely 21622 represented in the client operating environment. LOOKUPP on a named 21623 attribute directory may return the filehandle of the associated file, 21624 and conveying this to applications might be unsafe as many 21625 applications expect the parent of an object to always be a directory. 21626 Therefore, the client may want to hide the parent of named attribute 21627 directories (represented as ".." in UNIX) or represent the named 21628 attribute directory as its own parent (as is typically done for the 21629 file system root directory in UNIX). 21631 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 21633 18.15.1. ARGUMENTS 21635 struct NVERIFY4args { 21636 /* CURRENT_FH: object */ 21637 fattr4 obj_attributes; 21638 }; 21640 18.15.2. RESULTS 21642 struct NVERIFY4res { 21643 nfsstat4 status; 21644 }; 21646 18.15.3. DESCRIPTION 21648 This operation is used to prefix a sequence of operations to be 21649 performed if one or more attributes have changed on some file system 21650 object. If all the attributes match, then the error NFS4ERR_SAME 21651 MUST be returned. 21653 On success, the current filehandle retains its value. 21655 18.15.4. IMPLEMENTATION 21657 This operation is useful as a cache validation operator. If the 21658 object to which the attributes belong has changed, then the following 21659 operations may obtain new data associated with that object, for 21660 instance, to check if a file has been changed and obtain new data if 21661 it has: 21663 SEQUENCE 21664 PUTFH fh 21665 NVERIFY attrbits attrs 21666 READ 0 32767 21668 Contrast this with NFSv3, which would first send a GETATTR in one 21669 request/reply round trip, and then if attributes indicated that the 21670 client's cache was stale, then send a READ in another request/reply 21671 round trip. 21673 In the case that a RECOMMENDED attribute is specified in the NVERIFY 21674 operation and the server does not support that attribute for the file 21675 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 21676 client. 21678 When the attribute rdattr_error or any set-only attribute (e.g., 21679 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 21680 the client. 21682 18.16. Operation 18: OPEN - Open a Regular File 21684 18.16.1. ARGUMENTS 21686 /* 21687 * Various definitions for OPEN 21688 */ 21689 enum createmode4 { 21690 UNCHECKED4 = 0, 21691 GUARDED4 = 1, 21692 /* Deprecated in NFSv4.1. */ 21693 EXCLUSIVE4 = 2, 21694 /* 21695 * New to NFSv4.1. If session is persistent, 21696 * GUARDED4 MUST be used. Otherwise, use 21697 * EXCLUSIVE4_1 instead of EXCLUSIVE4. 21698 */ 21699 EXCLUSIVE4_1 = 3 21700 }; 21702 struct creatverfattr { 21703 verifier4 cva_verf; 21704 fattr4 cva_attrs; 21705 }; 21707 union createhow4 switch (createmode4 mode) { 21708 case UNCHECKED4: 21709 case GUARDED4: 21710 fattr4 createattrs; 21711 case EXCLUSIVE4: 21712 verifier4 createverf; 21713 case EXCLUSIVE4_1: 21714 creatverfattr ch_createboth; 21715 }; 21717 enum opentype4 { 21718 OPEN4_NOCREATE = 0, 21719 OPEN4_CREATE = 1 21720 }; 21722 union openflag4 switch (opentype4 opentype) { 21723 case OPEN4_CREATE: 21724 createhow4 how; 21725 default: 21726 void; 21727 }; 21729 /* Next definitions used for OPEN delegation */ 21730 enum limit_by4 { 21731 NFS_LIMIT_SIZE = 1, 21732 NFS_LIMIT_BLOCKS = 2 21733 /* others as needed */ 21734 }; 21736 struct nfs_modified_limit4 { 21737 uint32_t num_blocks; 21738 uint32_t bytes_per_block; 21739 }; 21741 union nfs_space_limit4 switch (limit_by4 limitby) { 21742 /* limit specified as file size */ 21743 case NFS_LIMIT_SIZE: 21744 uint64_t filesize; 21745 /* limit specified by number of blocks */ 21746 case NFS_LIMIT_BLOCKS: 21747 nfs_modified_limit4 mod_blocks; 21748 } ; 21750 /* 21751 * Share Access and Deny constants for open argument 21752 */ 21753 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 21754 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 21755 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 21757 const OPEN4_SHARE_DENY_NONE = 0x00000000; 21758 const OPEN4_SHARE_DENY_READ = 0x00000001; 21759 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 21760 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 21762 /* new flags for share_access field of OPEN4args */ 21763 const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; 21764 const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; 21765 const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; 21766 const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; 21767 const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; 21768 const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; 21769 const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; 21771 const 21772 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 21773 = 0x10000; 21775 const 21776 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 21777 = 0x20000; 21779 enum open_delegation_type4 { 21780 OPEN_DELEGATE_NONE = 0, 21781 OPEN_DELEGATE_READ = 1, 21782 OPEN_DELEGATE_WRITE = 2, 21783 OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ 21784 }; 21786 enum open_claim_type4 { 21787 /* 21788 * Not a reclaim. 21789 */ 21790 CLAIM_NULL = 0, 21792 CLAIM_PREVIOUS = 1, 21793 CLAIM_DELEGATE_CUR = 2, 21794 CLAIM_DELEGATE_PREV = 3, 21796 /* 21797 * Not a reclaim. 21798 * 21799 * Like CLAIM_NULL, but object identified 21800 * by the current filehandle. 21801 */ 21802 CLAIM_FH = 4, /* new to v4.1 */ 21804 /* 21805 * Like CLAIM_DELEGATE_CUR, but object identified 21806 * by current filehandle. 21807 */ 21808 CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ 21810 /* 21811 * Like CLAIM_DELEGATE_PREV, but object identified 21812 * by current filehandle. 21814 */ 21815 CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ 21816 }; 21818 struct open_claim_delegate_cur4 { 21819 stateid4 delegate_stateid; 21820 component4 file; 21821 }; 21823 union open_claim4 switch (open_claim_type4 claim) { 21824 /* 21825 * No special rights to file. 21826 * Ordinary OPEN of the specified file. 21827 */ 21828 case CLAIM_NULL: 21829 /* CURRENT_FH: directory */ 21830 component4 file; 21831 /* 21832 * Right to the file established by an 21833 * open previous to server reboot. File 21834 * identified by filehandle obtained at 21835 * that time rather than by name. 21836 */ 21837 case CLAIM_PREVIOUS: 21838 /* CURRENT_FH: file being reclaimed */ 21839 open_delegation_type4 delegate_type; 21841 /* 21842 * Right to file based on a delegation 21843 * granted by the server. File is 21844 * specified by name. 21845 */ 21846 case CLAIM_DELEGATE_CUR: 21847 /* CURRENT_FH: directory */ 21848 open_claim_delegate_cur4 delegate_cur_info; 21850 /* 21851 * Right to file based on a delegation 21852 * granted to a previous boot instance 21853 * of the client. File is specified by name. 21854 */ 21855 case CLAIM_DELEGATE_PREV: 21856 /* CURRENT_FH: directory */ 21857 component4 file_delegate_prev; 21859 /* 21860 * Like CLAIM_NULL. No special rights 21861 * to file. Ordinary OPEN of the 21862 * specified file by current filehandle. 21863 */ 21864 case CLAIM_FH: /* new to v4.1 */ 21865 /* CURRENT_FH: regular file to open */ 21866 void; 21868 /* 21869 * Like CLAIM_DELEGATE_PREV. Right to file based on a 21870 * delegation granted to a previous boot 21871 * instance of the client. File is identified by 21872 * by filehandle. 21873 */ 21874 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 21875 /* CURRENT_FH: file being opened */ 21876 void; 21878 /* 21879 * Like CLAIM_DELEGATE_CUR. Right to file based on 21880 * a delegation granted by the server. 21881 * File is identified by filehandle. 21882 */ 21883 case CLAIM_DELEG_CUR_FH: /* new to v4.1 */ 21884 /* CURRENT_FH: file being opened */ 21885 stateid4 oc_delegate_stateid; 21887 }; 21889 /* 21890 * OPEN: Open a file, potentially receiving an OPEN delegation 21891 */ 21892 struct OPEN4args { 21893 seqid4 seqid; 21894 uint32_t share_access; 21895 uint32_t share_deny; 21896 open_owner4 owner; 21897 openflag4 openhow; 21898 open_claim4 claim; 21899 }; 21901 18.16.2. RESULTS 21903 struct open_read_delegation4 { 21904 stateid4 stateid; /* Stateid for delegation*/ 21905 bool recall; /* Pre-recalled flag for 21906 delegations obtained 21907 by reclaim (CLAIM_PREVIOUS) */ 21909 nfsace4 permissions; /* Defines users who don't 21910 need an ACCESS call to 21911 open for read */ 21912 }; 21914 struct open_write_delegation4 { 21915 stateid4 stateid; /* Stateid for delegation */ 21916 bool recall; /* Pre-recalled flag for 21917 delegations obtained 21918 by reclaim 21919 (CLAIM_PREVIOUS) */ 21921 nfs_space_limit4 21922 space_limit; /* Defines condition that 21923 the client must check to 21924 determine whether the 21925 file needs to be flushed 21926 to the server on close. */ 21928 nfsace4 permissions; /* Defines users who don't 21929 need an ACCESS call as 21930 part of a delegated 21931 open. */ 21932 }; 21934 enum why_no_delegation4 { /* new to v4.1 */ 21935 WND4_NOT_WANTED = 0, 21936 WND4_CONTENTION = 1, 21937 WND4_RESOURCE = 2, 21938 WND4_NOT_SUPP_FTYPE = 3, 21939 WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, 21940 WND4_NOT_SUPP_UPGRADE = 5, 21941 WND4_NOT_SUPP_DOWNGRADE = 6, 21942 WND4_CANCELLED = 7, 21943 WND4_IS_DIR = 8 21944 }; 21946 union open_none_delegation4 /* new to v4.1 */ 21947 switch (why_no_delegation4 ond_why) { 21948 case WND4_CONTENTION: 21949 bool ond_server_will_push_deleg; 21950 case WND4_RESOURCE: 21951 bool ond_server_will_signal_avail; 21952 default: 21953 void; 21954 }; 21955 union open_delegation4 21956 switch (open_delegation_type4 delegation_type) { 21957 case OPEN_DELEGATE_NONE: 21958 void; 21959 case OPEN_DELEGATE_READ: 21960 open_read_delegation4 read; 21961 case OPEN_DELEGATE_WRITE: 21962 open_write_delegation4 write; 21963 case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ 21964 open_none_delegation4 od_whynone; 21965 }; 21967 /* 21968 * Result flags 21969 */ 21971 /* Client must confirm open */ 21972 const OPEN4_RESULT_CONFIRM = 0x00000002; 21973 /* Type of file locking behavior at the server */ 21974 const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; 21975 /* Server will preserve file if removed while open */ 21976 const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; 21978 /* 21979 * Server may use CB_NOTIFY_LOCK on locks 21980 * derived from this open 21981 */ 21982 const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; 21984 struct OPEN4resok { 21985 stateid4 stateid; /* Stateid for open */ 21986 change_info4 cinfo; /* Directory Change Info */ 21987 uint32_t rflags; /* Result flags */ 21988 bitmap4 attrset; /* attribute set for create*/ 21989 open_delegation4 delegation; /* Info on any open 21990 delegation */ 21991 }; 21993 union OPEN4res switch (nfsstat4 status) { 21994 case NFS4_OK: 21995 /* New CURRENT_FH: opened file */ 21996 OPEN4resok resok4; 21997 default: 21998 void; 21999 }; 22001 18.16.3. DESCRIPTION 22003 The OPEN operation opens a regular file in a directory with the 22004 provided name or filehandle. OPEN can also create a file if a name 22005 is provided, and the client specifies it wants to create a file. 22006 Specification of whether or not a file is to be created, and the 22007 method of creation is via the openhow parameter. The openhow 22008 parameter consists of a switched union (data type opengflag4), which 22009 switches on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). 22010 If OPEN4_CREATE is specified, this leads to another switched union 22011 (data type createhow4) that supports four cases of creation methods: 22012 UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is 22013 OPEN4_CREATE, then the claim field of the claim field MUST be one of 22014 CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, because these 22015 claim methods include a component of a file name. 22017 Upon success (which might entail creation of a new file), the current 22018 filehandle is replaced by that of the created or existing object. 22020 If the current filehandle is a named attribute directory, OPEN will 22021 then create or open a named attribute file. Note that exclusive 22022 create of a named attribute is not supported. If the createmode is 22023 EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named 22024 attribute directory, the server will return EINVAL. 22026 UNCHECKED4 means that the file should be created if a file of that 22027 name does not exist and encountering an existing regular file of that 22028 name is not an error. For this type of create, createattrs specifies 22029 the initial set of attributes for the file. The set of attributes 22030 may include any writable attribute valid for regular files. When an 22031 UNCHECKED4 create encounters an existing file, the attributes 22032 specified by createattrs are not used, except that when createattrs 22033 specifies the size attribute with a size of zero, the existing file 22034 is truncated. 22036 If GUARDED4 is specified, the server checks for the presence of a 22037 duplicate object by name before performing the create. If a 22038 duplicate exists, NFS4ERR_EXIST is returned. If the object does not 22039 exist, the request is performed as described for UNCHECKED4. 22041 For the UNCHECKED4 and GUARDED4 cases, where the operation is 22042 successful, the server will return to the client an attribute mask 22043 signifying which attributes were successfully set for the object. 22045 EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow 22046 exclusive creation semantics, using the verifier to ensure exclusive 22047 creation of the target. The server should check for the presence of 22048 a duplicate object by name. If the object does not exist, the server 22049 creates the object and stores the verifier with the object. If the 22050 object does exist and the stored verifier matches the client provided 22051 verifier, the server uses the existing object as the newly created 22052 object. If the stored verifier does not match, then an error of 22053 NFS4ERR_EXIST is returned. 22055 If using EXCLUSIVE4, and if the server uses attributes to store the 22056 exclusive create verifier, the server will signify which attributes 22057 it used by setting the appropriate bits in the attribute mask that is 22058 returned in the results. Unlike UNCHECKED4, GUARDED4, and 22059 EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes 22060 at file creation, and after a successful OPEN via EXCLUSIVE4, the 22061 client MUST send a SETATTR to set attributes to a known state. 22063 In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. 22064 Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 22065 case, but because the server may use attributes of the target object 22066 to store the verifier, the set of allowable attributes may be fewer 22067 than the set of attributes SETATTR allows. The allowable attributes 22068 for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat 22069 (Section 5.8.1.14) attribute. If the client attempts to set in 22070 cva_attrs an attribute that is not in suppattr_exclcreat, the server 22071 MUST return NFS4ERR_INVAL. The response field, attrset, indicates 22072 both which attributes the server set from cva_attrs and which 22073 attributes the server used to store the verifier. As described in 22074 Section 18.16.4, the client can compare cva_attrs.attrmask with 22075 attrset to determine which attributes were used to store the 22076 verifier. 22078 With the addition of persistent sessions and pNFS, under some 22079 conditions EXCLUSIVE4 MUST NOT be used by the client or supported by 22080 the server. The following table summarizes the appropriate and 22081 mandated exclusive create methods for implementations of NFSv4.1: 22083 Required methods for exclusive create 22085 +----------------+-----------+---------------+----------------------+ 22086 | Persistent | Server | Server | Client Allowed | 22087 | Reply Cache | Supports | REQUIRED | | 22088 | Enabled | pNFS | | | 22089 +----------------+-----------+---------------+----------------------+ 22090 | no | no | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 22091 | | | and | (SHOULD) or | 22092 | | | EXCLUSIVE4 | EXCLUSIVE4 (SHOULD | 22093 | | | | NOT) | 22094 | no | yes | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 22095 | yes | no | GUARDED4 | GUARDED4 | 22096 | yes | yes | GUARDED4 | GUARDED4 | 22097 +----------------+-----------+---------------+----------------------+ 22099 Table 10 22101 If CREATE_SESSION4_FLAG_PERSIST is set in the results of 22102 CREATE_SESSION, the reply cache is persistent (see Section 18.36). 22103 If the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from 22104 EXCHANGE_ID, the server is a pNFS server (see Section 18.35). If the 22105 client attempts to use EXCLUSIVE4 on a persistent session, or a 22106 session derived from an EXCHGID4_FLAG_USE_PNFS_MDS client ID, the 22107 server MUST return NFS4ERR_INVAL. 22109 With persistent sessions, exclusive create semantics are fully 22110 achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1 MUST NOT 22111 be used. When pNFS is being used, the layout_hint attribute might 22112 not be supported after the file is created. Only the EXCLUSIVE4_1 22113 and GUARDED methods of exclusive file creation allow the atomic 22114 setting of attributes. 22116 For the target directory, the server returns change_info4 information 22117 in cinfo. With the atomic field of the change_info4 data type, the 22118 server will indicate if the before and after change attributes were 22119 obtained atomically with respect to the link creation. 22121 The OPEN operation provides for Windows share reservation capability 22122 with the use of the share_access and share_deny fields of the OPEN 22123 arguments. The client specifies at OPEN the required share_access 22124 and share_deny modes. For clients that do not directly support 22125 SHAREs (i.e., UNIX), the expected deny value is 22126 OPEN4_SHARE_DENY_NONE. In the case that there is an existing SHARE 22127 reservation that conflicts with the OPEN request, the server returns 22128 the error NFS4ERR_SHARE_DENIED. For additional discussion of SHARE 22129 semantics, see Section 9.7. 22131 For each OPEN, the client provides a value for the owner field of the 22132 OPEN argument. The owner field is of data type open_owner4, and 22133 contains a field called clientid and a field called owner. The 22134 client can set the clientid field to any value and the server MUST 22135 ignore it. Instead, the server MUST derive the client ID from the 22136 session ID of the SEQUENCE operation of the COMPOUND request. 22138 The "seqid" field of the request is not used in NFSv4.1, but it MAY 22139 be any value and the server MUST ignore it. 22141 In the case that the client is recovering state from a server 22142 failure, the claim field of the OPEN argument is used to signify that 22143 the request is meant to reclaim state previously held. 22145 The "claim" field of the OPEN argument is used to specify the file to 22146 be opened and the state information that the client claims to 22147 possess. There are seven claim types as follows: 22149 +----------------------+--------------------------------------------+ 22150 | open type | description | 22151 +----------------------+--------------------------------------------+ 22152 | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN request | 22153 | | and there is no previous state associated | 22154 | | with the file for the client. With | 22155 | | CLAIM_NULL, the file is identified by the | 22156 | | current filehandle and the specified | 22157 | | component name. With CLAIM_FH (new to | 22158 | | NFSv4.1), the file is identified by just | 22159 | | the current filehandle. | 22160 | CLAIM_PREVIOUS | The client is claiming basic OPEN state | 22161 | | for a file that was held previous to a | 22162 | | server restart. Generally used when a | 22163 | | server is returning persistent | 22164 | | filehandles; the client may not have the | 22165 | | file name to reclaim the OPEN. | 22166 | CLAIM_DELEGATE_CUR, | The client is claiming a delegation for | 22167 | CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally, | 22168 | | this is done as part of recalling a | 22169 | | delegation. With CLAIM_DELEGATE_CUR, the | 22170 | | file is identified by the current | 22171 | | filehandle and the specified component | 22172 | | name. With CLAIM_DELEG_CUR_FH (new to | 22173 | | NFSv4.1), the file is identified by just | 22174 | | the current filehandle. | 22175 | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | 22176 | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | 22177 | | used after the client restarts. The server | 22178 | | MAY support CLAIM_DELEGATE_PREV and/or | 22179 | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | 22180 | | it does support either claim type, | 22181 | | CREATE_SESSION MUST NOT remove the | 22182 | | client's delegation state, and the server | 22183 | | MUST support the DELEGPURGE operation. | 22184 +----------------------+--------------------------------------------+ 22186 For OPEN requests that reach the server during the grace period, the 22187 server returns an error of NFS4ERR_GRACE. The following claim types 22188 are exceptions: 22190 o OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted 22191 to reclaiming opens after a server restart and are typically only 22192 valid during the grace period. 22194 o OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and 22195 CLAIM_DELEG_CUR_FH are valid both during and after the grace 22196 period. Since the granting of the delegation that they are 22197 subordinate to assures that there is no conflict with locks to be 22198 reclaimed by other clients, the server need not return 22199 NFS4ERR_GRACE when these are received during the grace period. 22201 For any OPEN request, the server may return an OPEN delegation, which 22202 allows further opens and closes to be handled locally on the client 22203 as described in Section 10.4. Note that delegation is up to the 22204 server to decide. The client should never assume that delegation 22205 will or will not be granted in a particular instance. It should 22206 always be prepared for either case. A partial exception is the 22207 reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. 22208 In this case, delegation will always be granted, although the server 22209 may specify an immediate recall in the delegation structure. 22211 The rflags returned by a successful OPEN allow the server to return 22212 information governing how the open file is to be handled. 22214 o OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an 22215 NFSv4.1 server. 22217 o OPEN4_RESULT_LOCKTYPE_POSIX indicates that the server's byte-range 22218 locking behavior supports the complete set of POSIX locking 22219 techniques [21]. From this, the client can choose to manage byte- 22220 range locking state in a way to handle a mismatch of byte-range 22221 locking management. 22223 o OPEN4_RESULT_PRESERVE_UNLINKED indicates that the server will 22224 preserve the open file if the client (or any other client) removes 22225 the file as long as it is open. Furthermore, the server promises 22226 to preserve the file through the grace period after server 22227 restart, thereby giving the client the opportunity to reclaim its 22228 open. 22230 o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt 22231 CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a 22232 hint only, and may be safely ignored by the client. 22234 If the component is of zero length, NFS4ERR_INVAL will be returned. 22235 The component is also subject to the normal UTF-8, character support, 22236 and name checks. See Section 14.5 for further discussion. 22238 When an OPEN is done and the specified open-owner already has the 22239 resulting filehandle open, the result is to "OR" together the new 22240 share and deny status together with the existing status. In this 22241 case, only a single CLOSE need be done, even though multiple OPENs 22242 were completed. When such an OPEN is done, checking of share 22243 reservations for the new OPEN proceeds normally, with no exception 22244 for the existing OPEN held by the same open-owner. In this case, the 22245 stateid returned as an "other" field that matches that of the 22246 previous open while the "seqid" field is incremented to reflect the 22247 change status due to the new open. 22249 If the underlying file system at the server is only accessible in a 22250 read-only mode and the OPEN request has specified ACCESS_WRITE or 22251 ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- 22252 only file system. 22254 As with the CREATE operation, the server MUST derive the owner, owner 22255 ACE, group, or group ACE if any of the four attributes are required 22256 and supported by the server's file system. For an OPEN with the 22257 EXCLUSIVE4 createmode, the server has no choice, since such OPEN 22258 calls do not include the createattrs field. Conversely, if 22259 createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is 22260 specified, and includes an owner, owner_group, or ACE that the 22261 principal in the RPC call's credentials does not have authorization 22262 to create files for, then the server may return NFS4ERR_PERM. 22264 In the case of an OPEN that specifies a size of zero (e.g., 22265 truncation) and the file has named attributes, the named attributes 22266 are left as is and are not removed. 22268 NFSv4.1 gives more precise control to clients over acquisition of 22269 delegations via the following new flags for the share_access field of 22270 OPEN4args: 22272 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22274 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22276 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22278 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 22280 OPEN4_SHARE_ACCESS_WANT_CANCEL 22282 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 22284 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 22286 If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, 22287 then the client will have specified one and only one of: 22289 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22291 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22292 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22294 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 22296 OPEN4_SHARE_ACCESS_WANT_CANCEL 22298 Otherwise, the client is neither indicating a desire nor a non-desire 22299 for a delegation, and the server MAY or MAY not return a delegation 22300 in the OPEN response. 22302 If the server supports the new _WANT_ flags and the client sends one 22303 or more of the new flags, then in the event the server does not 22304 return a delegation, it MUST return a delegation type of 22305 OPEN_DELEGATE_NONE_EXT. The field ond_why in the reply indicates why 22306 no delegation was returned and will be one of: 22308 WND4_NOT_WANTED The client specified 22309 OPEN4_SHARE_ACCESS_WANT_NO_DELEG. 22311 WND4_CONTENTION There is a conflicting delegation or open on the 22312 file. 22314 WND4_RESOURCE Resource limitations prevent the server from granting 22315 a delegation. 22317 WND4_NOT_SUPP_FTYPE The server does not support delegations on this 22318 file type. 22320 WND4_WRITE_DELEG_NOT_SUPP_FTYPE The server does not support 22321 OPEN_DELEGATE_WRITE delegations on this file type. 22323 WND4_NOT_SUPP_UPGRADE The server does not support atomic upgrade of 22324 an OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE 22325 delegation. 22327 WND4_NOT_SUPP_DOWNGRADE The server does not support atomic downgrade 22328 of an OPEN_DELEGATE_WRITE delegation to an OPEN_DELEGATE_READ 22329 delegation. 22331 WND4_CANCELED The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL 22332 and now any "want" for this file object is cancelled. 22334 WND4_IS_DIR The specified file object is a directory, and the 22335 operation is OPEN or WANT_DELEGATION, which do not support 22336 delegations on directories. 22338 OPEN4_SHARE_ACCESS_WANT_READ_DELEG, 22339 OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or 22340 OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants 22341 an OPEN_DELEGATE_READ, OPEN_DELEGATE_WRITE, or any delegation 22342 regardless which of OPEN4_SHARE_ACCESS_READ, 22343 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH is set. If the 22344 client has an OPEN_DELEGATE_READ delegation on a file and requests an 22345 OPEN_DELEGATE_WRITE delegation, then the client is requesting atomic 22346 upgrade of its OPEN_DELEGATE_READ delegation to an 22347 OPEN_DELEGATE_WRITE delegation. If the client has an 22348 OPEN_DELEGATE_WRITE delegation on a file and requests an 22349 OPEN_DELEGATE_READ delegation, then the client is requesting atomic 22350 downgrade to an OPEN_DELEGATE_READ delegation. A server MAY support 22351 atomic upgrade or downgrade. If it does, then the returned 22352 delegation_type of OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is 22353 different from the delegation type the client currently has, 22354 indicates successful upgrade or downgrade. If the server does not 22355 support atomic delegation upgrade or downgrade, then ond_why will be 22356 set to WND4_NOT_SUPP_UPGRADE or WND4_NOT_SUPP_DOWNGRADE. 22358 OPEN4_SHARE_ACCESS_WANT_NO_DELEG means that the client wants no 22359 delegation. 22361 OPEN4_SHARE_ACCESS_WANT_CANCEL means that the client wants no 22362 delegation and wants to cancel any previously registered "want" for a 22363 delegation. 22365 The client may set one or both of 22366 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and 22367 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they 22368 will have no effect unless one of following is set: 22370 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22372 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22374 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22376 If the client specifies 22377 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes 22378 to register a "want" for a delegation, in the event the OPEN results 22379 do not include a delegation. If so and the server denies the 22380 delegation due to insufficient resources, the server MAY later inform 22381 the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the 22382 resource limitation condition has eased. The server will tell the 22383 client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL 22384 operation by setting delegation_type in the results to 22385 OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and 22386 ond_server_will_signal_avail set to TRUE. If 22387 ond_server_will_signal_avail is set to TRUE, the server MUST later 22388 send a CB_RECALLABLE_OBJ_AVAIL operation. 22390 If the client specifies 22391 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes 22392 to register a "want" for a delegation, in the event the OPEN results 22393 do not include a delegation. If so and the server denies the 22394 delegation due to contention, the server MAY later inform the client, 22395 via the CB_PUSH_DELEG operation, that the contention condition has 22396 eased. The server will tell the client that it intends to send a 22397 future CB_PUSH_DELEG operation by setting delegation_type in the 22398 results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and 22399 ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is 22400 TRUE, the server MUST later send a CB_PUSH_DELEG operation. 22402 If the client has previously registered a want for a delegation on a 22403 file, and then sends a request to register a want for a delegation on 22404 the same file, the server MUST return a new error: 22405 NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a 22406 different type of delegation want for the same file, it MUST cancel 22407 the existing delegation WANT. 22409 18.16.4. IMPLEMENTATION 22411 In absence of a persistent session, the client invokes exclusive 22412 create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. 22413 In these cases, the client provides a verifier that can reasonably be 22414 expected to be unique. A combination of a client identifier, perhaps 22415 the client network address, and a unique number generated by the 22416 client, perhaps the RPC transaction identifier, may be appropriate. 22418 If the object does not exist, the server creates the object and 22419 stores the verifier in stable storage. For file systems that do not 22420 provide a mechanism for the storage of arbitrary file attributes, the 22421 server may use one or more elements of the object's metadata to store 22422 the verifier. The verifier MUST be stored in stable storage to 22423 prevent erroneous failure on retransmission of the request. It is 22424 assumed that an exclusive create is being performed because exclusive 22425 semantics are critical to the application. Because of the expected 22426 usage, exclusive CREATE does not rely solely on the server's reply 22427 cache for storage of the verifier. A nonpersistent reply cache does 22428 not survive a crash and the session and reply cache may be deleted 22429 after a network partition that exceeds the lease time, thus opening 22430 failure windows. 22432 An NFSv4.1 server SHOULD NOT store the verifier in any of the file's 22433 RECOMMENDED or REQUIRED attributes. If it does, the server SHOULD 22434 use time_modify_set or time_access_set to store the verifier. The 22435 server SHOULD NOT store the verifier in the following attributes: 22437 acl (it is desirable for access control to be established at 22438 creation), 22440 dacl (ditto), 22442 mode (ditto), 22444 owner (ditto), 22446 owner_group (ditto), 22448 retentevt_set (it may be desired to establish retention at 22449 creation) 22451 retention_hold (ditto), 22453 retention_set (ditto), 22455 sacl (it is desirable for auditing control to be established at 22456 creation), 22458 size (on some servers, size may have a limited range of values), 22460 mode_set_masked (as with mode), 22462 and 22464 time_creation (a meaningful file creation should be set when the 22465 file is created). 22467 Another alternative for the server is to use a named attribute to 22468 store the verifier. 22470 Because the EXCLUSIVE4 create method does not specify initial 22471 attributes when processing an EXCLUSIVE4 create, the server 22473 o SHOULD set the owner of the file to that corresponding to the 22474 credential of request's RPC header. 22476 o SHOULD NOT leave the file's access control to anyone but the owner 22477 of the file. 22479 If the server cannot support exclusive create semantics, possibly 22480 because of the requirement to commit the verifier to stable storage, 22481 it should fail the OPEN request with the error NFS4ERR_NOTSUPP. 22483 During an exclusive CREATE request, if the object already exists, the 22484 server reconstructs the object's verifier and compares it with the 22485 verifier in the request. If they match, the server treats the 22486 request as a success. The request is presumed to be a duplicate of 22487 an earlier, successful request for which the reply was lost and that 22488 the server duplicate request cache mechanism did not detect. If the 22489 verifiers do not match, the request is rejected with the status 22490 NFS4ERR_EXIST. 22492 After the client has performed a successful exclusive create, the 22493 attrset response indicates which attributes were used to store the 22494 verifier. If EXCLUSIVE4 was used, the attributes set in attrset were 22495 used for the verifier. If EXCLUSIVE4_1 was used, the client 22496 determines the attributes used for the verifier by comparing attrset 22497 with cva_attrs.attrmask; any bits set in the former but not the 22498 latter identify the attributes used to store the verifier. The 22499 client MUST immediately send a SETATTR to set attributes used to 22500 store the verifier. Until it does so, the attributes used to store 22501 the verifier cannot be relied upon. The subsequent SETATTR MUST NOT 22502 occur in the same COMPOUND request as the OPEN. 22504 Unless a persistent session is used, use of the GUARDED4 attribute 22505 does not provide exactly once semantics. In particular, if a reply 22506 is lost and the server does not detect the retransmission of the 22507 request, the operation can fail with NFS4ERR_EXIST, even though the 22508 create was performed successfully. The client would use this 22509 behavior in the case that the application has not requested an 22510 exclusive create but has asked to have the file truncated when the 22511 file is opened. In the case of the client timing out and 22512 retransmitting the create request, the client can use GUARDED4 to 22513 prevent against a sequence like create, write, create (retransmitted) 22514 from occurring. 22516 For SHARE reservations, the value of the expression (share_access & 22517 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) MUST be one of 22518 OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 22519 OPEN4_SHARE_ACCESS_BOTH. If not, the server MUST return 22520 NFS4ERR_INVAL. The value of share_deny MUST be one of 22521 OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, 22522 or OPEN4_SHARE_DENY_BOTH. If not, the server MUST return 22523 NFS4ERR_INVAL. 22525 Based on the share_access value (OPEN4_SHARE_ACCESS_READ, 22526 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH), the client 22527 should check that the requester has the proper access rights to 22528 perform the specified operation. This would generally be the results 22529 of applying the ACL access rules to the file for the current 22530 requester. However, just as with the ACCESS operation, the client 22531 should not attempt to second-guess the server's decisions, as access 22532 rights may change and may be subject to server administrative 22533 controls outside the ACL framework. If the requester's READ or WRITE 22534 operation is not authorized (depending on the share_access value), 22535 the server MUST return NFS4ERR_ACCESS. 22537 Note that if the client ID was not created with the 22538 EXCHGID4_FLAG_BIND_PRINC_STATEID capability set in the reply to 22539 EXCHANGE_ID, then the server MUST NOT impose any requirement that 22540 READs and WRITEs sent for an open file have the same credentials as 22541 the OPEN itself, and the server is REQUIRED to perform access 22542 checking on the READs and WRITEs themselves. Otherwise, if the reply 22543 to EXCHANGE_ID did have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then 22544 with one exception, the credentials used in the OPEN request MUST 22545 match those used in the READs and WRITEs, and the stateids in the 22546 READs and WRITEs MUST match, or be derived from the stateid from the 22547 reply to OPEN. The exception is if SP4_SSV or SP4_MACH_CRED state 22548 protection is used, and the spo_must_allow result of EXCHANGE_ID 22549 includes the READ and/or WRITE operations. In that case, the machine 22550 or SSV credential will be allowed to send READ and/or WRITE. See 22551 Section 18.35. 22553 If the component provided to OPEN is a symbolic link, the error 22554 NFS4ERR_SYMLINK will be returned to the client, while if it is a 22555 directory the error NFS4ERR_ISDIR will be returned. If the component 22556 is neither of those but not an ordinary file, the error 22557 NFS4ERR_WRONG_TYPE is returned. If the current filehandle is not a 22558 directory, the error NFS4ERR_NOTDIR will be returned. 22560 The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a 22561 client to avoid the common implementation practice of renaming an 22562 open file to ".nfs" after it removes the file. After 22563 the server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends 22564 a REMOVE operation that would reduce the file's link count to zero, 22565 the server SHOULD report a value of zero for the numlinks attribute 22566 on the file. 22568 If another client has a delegation of the file being opened that 22569 conflicts with open being done (sometimes depending on the 22570 share_access or share_deny value specified), the delegation(s) MUST 22571 be recalled, and the operation cannot proceed until each such 22572 delegation is returned or revoked. Except where this happens very 22573 quickly, one or more NFS4ERR_DELAY errors will be returned to 22574 requests made while delegation remains outstanding. In the case of 22575 an OPEN_DELEGATE_WRITE delegation, any open by a different client 22576 will conflict, while for an OPEN_DELEGATE_READ delegation, only opens 22577 with one of the following characteristics will be considered 22578 conflicting: 22580 o The value of share_access includes the bit 22581 OPEN4_SHARE_ACCESS_WRITE. 22583 o The value of share_deny specifies OPEN4_SHARE_DENY_READ or 22584 OPEN4_SHARE_DENY_BOTH. 22586 o OPEN4_CREATE is specified together with UNCHECKED4, the size 22587 attribute is specified as zero (for truncation), and an existing 22588 file is truncated. 22590 If OPEN4_CREATE is specified and the file does not exist and the 22591 current filehandle designates a directory for which another client 22592 holds a directory delegation, then, unless the delegation is such 22593 that the situation can be resolved by sending a notification, the 22594 delegation MUST be recalled, and the operation cannot proceed until 22595 the delegation is returned or revoked. Except where this happens 22596 very quickly, one or more NFS4ERR_DELAY errors will be returned to 22597 requests made while delegation remains outstanding. 22599 If OPEN4_CREATE is specified and the file does not exist and the 22600 current filehandle designates a directory for which one or more 22601 directory delegations exist, then, when those delegations request 22602 such notifications, NOTIFY4_ADD_ENTRY will be generated as a result 22603 of this operation. 22605 18.16.4.1. Warning to Client Implementors 22607 OPEN resembles LOOKUP in that it generates a filehandle for the 22608 client to use. Unlike LOOKUP though, OPEN creates server state on 22609 the filehandle. In normal circumstances, the client can only release 22610 this state with a CLOSE operation. CLOSE uses the current filehandle 22611 to determine which file to close. Therefore, the client MUST follow 22612 every OPEN operation with a GETFH operation in the same COMPOUND 22613 procedure. This will supply the client with the filehandle such that 22614 CLOSE can be used appropriately. 22616 Simply waiting for the lease on the file to expire is insufficient 22617 because the server may maintain the state indefinitely as long as 22618 another client does not attempt to make a conflicting access to the 22619 same file. 22621 See also Section 2.10.6.4. 22623 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 22624 18.17.1. ARGUMENTS 22626 struct OPENATTR4args { 22627 /* CURRENT_FH: object */ 22628 bool createdir; 22629 }; 22631 18.17.2. RESULTS 22633 struct OPENATTR4res { 22634 /* 22635 * If status is NFS4_OK, 22636 * new CURRENT_FH: named attribute 22637 * directory 22638 */ 22639 nfsstat4 status; 22640 }; 22642 18.17.3. DESCRIPTION 22644 The OPENATTR operation is used to obtain the filehandle of the named 22645 attribute directory associated with the current filehandle. The 22646 result of the OPENATTR will be a filehandle to an object of type 22647 NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can 22648 be used to obtain filehandles for the various named attributes 22649 associated with the original file system object. Filehandles 22650 returned within the named attribute directory will designate objects 22651 of type of NF4NAMEDATTR. 22653 The createdir argument allows the client to signify if a named 22654 attribute directory should be created as a result of the OPENATTR 22655 operation. Some clients may use the OPENATTR operation with a value 22656 of FALSE for createdir to determine if any named attributes exist for 22657 the object. If none exist, then NFS4ERR_NOENT will be returned. If 22658 createdir has a value of TRUE and no named attribute directory 22659 exists, one is created and its filehandle becomes the current 22660 filehandle. On the other hand, if createdir has a value of TRUE and 22661 the named attribute directory already exists, no error results and 22662 the filehandle of the existing directory becomes the current 22663 filehandle. The creation of a named attribute directory assumes that 22664 the server has implemented named attribute support in this fashion 22665 and is not required to do so by this definition. 22667 If the current file handle designates an object of type NF4NAMEDATTR 22668 (a named attribute) or NF4ATTRDIR (a named attribute directory), an 22669 error of NFS4ERR_WRONG_TYPE is returned to the client. Named 22670 attributes or a named attribute directory MUST NOT have their own 22671 named attributes. 22673 18.17.4. IMPLEMENTATION 22675 If the server does not support named attributes for the current 22676 filehandle, an error of NFS4ERR_NOTSUPP will be returned to the 22677 client. 22679 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access 22681 18.18.1. ARGUMENTS 22683 struct OPEN_DOWNGRADE4args { 22684 /* CURRENT_FH: opened file */ 22685 stateid4 open_stateid; 22686 seqid4 seqid; 22687 uint32_t share_access; 22688 uint32_t share_deny; 22689 }; 22691 18.18.2. RESULTS 22693 struct OPEN_DOWNGRADE4resok { 22694 stateid4 open_stateid; 22695 }; 22697 union OPEN_DOWNGRADE4res switch(nfsstat4 status) { 22698 case NFS4_OK: 22699 OPEN_DOWNGRADE4resok resok4; 22700 default: 22701 void; 22702 }; 22704 18.18.3. DESCRIPTION 22706 This operation is used to adjust the access and deny states for a 22707 given open. This is necessary when a given open-owner opens the same 22708 file multiple times with different access and deny values. In this 22709 situation, a close of one of the opens may change the appropriate 22710 share_access and share_deny flags to remove bits associated with 22711 opens no longer in effect. 22713 Valid values for the expression (share_access & 22714 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) are OPEN4_SHARE_ACCESS_READ, 22715 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client 22716 specifies other values, the server MUST reply with NFS4ERR_INVAL. 22718 Valid values for the share_deny field are OPEN4_SHARE_DENY_NONE, 22719 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 22720 OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the 22721 server MUST reply with NFS4ERR_INVAL. 22723 After checking for valid values of share_access and share_deny, the 22724 server replaces the current access and deny modes on the file with 22725 share_access and share_deny subject to the following constraints: 22727 o The bits in share_access SHOULD equal the union of the 22728 share_access bits (not including OPEN4_SHARE_WANT_* bits) 22729 specified for some subset of the OPENs in effect for the current 22730 open-owner on the current file. 22732 o The bits in share_deny SHOULD equal the union of the share_deny 22733 bits specified for some subset of the OPENs in effect for the 22734 current open-owner on the current file. 22736 If the above constraints are not respected, the server SHOULD return 22737 the error NFS4ERR_INVAL. Since share_access and share_deny bits 22738 should be subsets of those already granted, short of a defect in the 22739 client or server implementation, it is not possible for the 22740 OPEN_DOWNGRADE request to be denied because of conflicting share 22741 reservations. 22743 The seqid argument is not used in NFSv4.1, MAY be any value, and MUST 22744 be ignored by the server. 22746 On success, the current filehandle retains its value. 22748 18.18.4. IMPLEMENTATION 22750 An OPEN_DOWNGRADE operation may make OPEN_DELEGATE_READ delegations 22751 grantable where they were not previously. Servers may choose to 22752 respond immediately if there are pending delegation want requests or 22753 may respond to the situation at a later time. 22755 18.19. Operation 22: PUTFH - Set Current Filehandle 22757 18.19.1. ARGUMENTS 22759 struct PUTFH4args { 22760 nfs_fh4 object; 22761 }; 22763 18.19.2. RESULTS 22765 struct PUTFH4res { 22766 /* 22767 * If status is NFS4_OK, 22768 * new CURRENT_FH: argument to PUTFH 22769 */ 22770 nfsstat4 status; 22771 }; 22773 18.19.3. DESCRIPTION 22775 This operation replaces the current filehandle with the filehandle 22776 provided as an argument. It clears the current stateid. 22778 If the security mechanism used by the requester does not meet the 22779 requirements of the filehandle provided to this operation, the server 22780 MUST return NFS4ERR_WRONGSEC. 22782 See Section 16.2.3.1.1 for more details on the current filehandle. 22784 See Section 16.2.3.1.2 for more details on the current stateid. 22786 18.19.4. IMPLEMENTATION 22788 This operation is used in an NFS request to set the context for file 22789 accessing operations that follow in the same COMPOUND request. 22791 18.20. Operation 23: PUTPUBFH - Set Public Filehandle 22793 18.20.1. ARGUMENT 22795 void; 22797 18.20.2. RESULT 22799 struct PUTPUBFH4res { 22800 /* 22801 * If status is NFS4_OK, 22802 * new CURRENT_FH: public fh 22803 */ 22804 nfsstat4 status; 22805 }; 22807 18.20.3. DESCRIPTION 22809 This operation replaces the current filehandle with the filehandle 22810 that represents the public filehandle of the server's namespace. 22811 This filehandle may be different from the "root" filehandle that may 22812 be associated with some other directory on the server. 22814 PUTPUBFH also clears the current stateid. 22816 The public filehandle represents the concepts embodied in RFC 2054 22817 [45], RFC 2055 [46], and RFC 2224 [56]. The intent for NFSv4.1 is 22818 that the public filehandle (represented by the PUTPUBFH operation) be 22819 used as a method of providing WebNFS server compatibility with NFSv3. 22821 The public filehandle and the root filehandle (represented by the 22822 PUTROOTFH operation) SHOULD be equivalent. If the public and root 22823 filehandles are not equivalent, then the directory corresponding to 22824 the public filehandle MUST be a descendant of the directory 22825 corresponding to the root filehandle. 22827 See Section 16.2.3.1.1 for more details on the current filehandle. 22829 See Section 16.2.3.1.2 for more details on the current stateid. 22831 18.20.4. IMPLEMENTATION 22833 This operation is used in an NFS request to set the context for file 22834 accessing operations that follow in the same COMPOUND request. 22836 With the NFSv3 public filehandle, the client is able to specify 22837 whether the pathname provided in the LOOKUP should be evaluated as 22838 either an absolute path relative to the server's root or relative to 22839 the public filehandle. RFC 2224 [56] contains further discussion of 22840 the functionality. With NFSv4.1, that type of specification is not 22841 directly available in the LOOKUP operation. The reason for this is 22842 because the component separators needed to specify absolute vs. 22843 relative are not allowed in NFSv4. Therefore, the client is 22844 responsible for constructing its request such that the use of either 22845 PUTROOTFH or PUTPUBFH signifies absolute or relative evaluation of an 22846 NFS URL, respectively. 22848 Note that there are warnings mentioned in RFC 2224 [56] with respect 22849 to the use of absolute evaluation and the restrictions the server may 22850 place on that evaluation with respect to how much of its namespace 22851 has been made available. These same warnings apply to NFSv4.1. It 22852 is likely, therefore, that because of server implementation details, 22853 an NFSv3 absolute public filehandle look up may behave differently 22854 than an NFSv4.1 absolute resolution. 22856 There is a form of security negotiation as described in RFC 2755 [57] 22857 that uses the public filehandle and an overloading of the pathname. 22858 This method is not available with NFSv4.1 as filehandles are not 22859 overloaded with special meaning and therefore do not provide the same 22860 framework as NFSv3. Clients should therefore use the security 22861 negotiation mechanisms described in Section 2.6. 22863 18.21. Operation 24: PUTROOTFH - Set Root Filehandle 22865 18.21.1. ARGUMENTS 22867 void; 22869 18.21.2. RESULTS 22871 struct PUTROOTFH4res { 22872 /* 22873 * If status is NFS4_OK, 22874 * new CURRENT_FH: root fh 22875 */ 22876 nfsstat4 status; 22877 }; 22879 18.21.3. DESCRIPTION 22881 This operation replaces the current filehandle with the filehandle 22882 that represents the root of the server's namespace. From this 22883 filehandle, a LOOKUP operation can locate any other filehandle on the 22884 server. This filehandle may be different from the "public" 22885 filehandle that may be associated with some other directory on the 22886 server. 22888 PUTROOTFH also clears the current stateid. 22890 See Section 16.2.3.1.1 for more details on the current filehandle. 22892 See Section 16.2.3.1.2 for more details on the current stateid. 22894 18.21.4. IMPLEMENTATION 22896 This operation is used in an NFS request to set the context for file 22897 accessing operations that follow in the same COMPOUND request. 22899 18.22. Operation 25: READ - Read from File 22901 18.22.1. ARGUMENTS 22903 struct READ4args { 22904 /* CURRENT_FH: file */ 22905 stateid4 stateid; 22906 offset4 offset; 22907 count4 count; 22908 }; 22910 18.22.2. RESULTS 22912 struct READ4resok { 22913 bool eof; 22914 opaque data<>; 22915 }; 22917 union READ4res switch (nfsstat4 status) { 22918 case NFS4_OK: 22919 READ4resok resok4; 22920 default: 22921 void; 22922 }; 22924 18.22.3. DESCRIPTION 22926 The READ operation reads data from the regular file identified by the 22927 current filehandle. 22929 The client provides an offset of where the READ is to start and a 22930 count of how many bytes are to be read. An offset of zero means to 22931 read data starting at the beginning of the file. If offset is 22932 greater than or equal to the size of the file, the status NFS4_OK is 22933 returned with a data length set to zero and eof is set to TRUE. The 22934 READ is subject to access permissions checking. 22936 If the client specifies a count value of zero, the READ succeeds and 22937 returns zero bytes of data again subject to access permissions 22938 checking. The server may choose to return fewer bytes than specified 22939 by the client. The client needs to check for this condition and 22940 handle the condition appropriately. 22942 Except when special stateids are used, the stateid value for a READ 22943 request represents a value returned from a previous byte-range lock 22944 or share reservation request or the stateid associated with a 22945 delegation. The stateid identifies the associated owners if any and 22946 is used by the server to verify that the associated locks are still 22947 valid (e.g., have not been revoked). 22949 If the read ended at the end-of-file (formally, in a correctly formed 22950 READ operation, if offset + count is equal to the size of the file), 22951 or the READ operation extends beyond the size of the file (if offset 22952 + count is greater than the size of the file), eof is returned as 22953 TRUE; otherwise, it is FALSE. A successful READ of an empty file 22954 will always return eof as TRUE. 22956 If the current filehandle is not an ordinary file, an error will be 22957 returned to the client. In the case that the current filehandle 22958 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 22959 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 22960 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 22962 For a READ with a stateid value of all bits equal to zero, the server 22963 MAY allow the READ to be serviced subject to mandatory byte-range 22964 locks or the current share deny modes for the file. For a READ with 22965 a stateid value of all bits equal to one, the server MAY allow READ 22966 operations to bypass locking checks at the server. 22968 On success, the current filehandle retains its value. 22970 18.22.4. IMPLEMENTATION 22972 If the server returns a "short read" (i.e., fewer data than requested 22973 and eof is set to FALSE), the client should send another READ to get 22974 the remaining data. A server may return less data than requested 22975 under several circumstances. The file may have been truncated by 22976 another client or perhaps on the server itself, changing the file 22977 size from what the requesting client believes to be the case. This 22978 would reduce the actual amount of data available to the client. It 22979 is possible that the server reduce the transfer size and so return a 22980 short read result. Server resource exhaustion may also occur in a 22981 short read. 22983 If mandatory byte-range locking is in effect for the file, and if the 22984 byte-range corresponding to the data to be read from the file is 22985 WRITE_LT locked by an owner not associated with the stateid, the 22986 server will return the NFS4ERR_LOCKED error. The client should try 22987 to get the appropriate READ_LT via the LOCK operation before re- 22988 attempting the READ. When the READ completes, the client should 22989 release the byte-range lock via LOCKU. 22991 If another client has an OPEN_DELEGATE_WRITE delegation for the file 22992 being read, the delegation must be recalled, and the operation cannot 22993 proceed until that delegation is returned or revoked. Except where 22994 this happens very quickly, one or more NFS4ERR_DELAY errors will be 22995 returned to requests made while the delegation remains outstanding. 22996 Normally, delegations will not be recalled as a result of a READ 22997 operation since the recall will occur as a result of an earlier OPEN. 22998 However, since it is possible for a READ to be done with a special 22999 stateid, the server needs to check for this case even though the 23000 client should have done an OPEN previously. 23002 18.23. Operation 26: READDIR - Read Directory 23004 18.23.1. ARGUMENTS 23006 struct READDIR4args { 23007 /* CURRENT_FH: directory */ 23008 nfs_cookie4 cookie; 23009 verifier4 cookieverf; 23010 count4 dircount; 23011 count4 maxcount; 23012 bitmap4 attr_request; 23013 }; 23015 18.23.2. RESULTS 23016 struct entry4 { 23017 nfs_cookie4 cookie; 23018 component4 name; 23019 fattr4 attrs; 23020 entry4 *nextentry; 23021 }; 23023 struct dirlist4 { 23024 entry4 *entries; 23025 bool eof; 23026 }; 23028 struct READDIR4resok { 23029 verifier4 cookieverf; 23030 dirlist4 reply; 23031 }; 23033 union READDIR4res switch (nfsstat4 status) { 23034 case NFS4_OK: 23035 READDIR4resok resok4; 23036 default: 23037 void; 23038 }; 23040 18.23.3. DESCRIPTION 23042 The READDIR operation retrieves a variable number of entries from a 23043 file system directory and returns client-requested attributes for 23044 each entry along with information to allow the client to request 23045 additional directory entries in a subsequent READDIR. 23047 The arguments contain a cookie value that represents where the 23048 READDIR should start within the directory. A value of zero for the 23049 cookie is used to start reading at the beginning of the directory. 23050 For subsequent READDIR requests, the client specifies a cookie value 23051 that is provided by the server on a previous READDIR request. 23053 The request's cookieverf field should be set to 0 zero) when the 23054 request's cookie field is zero (first read of the directory). On 23055 subsequent requests, the cookieverf field must match the cookieverf 23056 returned by the READDIR in which the cookie was acquired. If the 23057 server determines that the cookieverf is no longer valid for the 23058 directory, the error NFS4ERR_NOT_SAME must be returned. 23060 The dircount field of the request is a hint of the maximum number of 23061 bytes of directory information that should be returned. This value 23062 represents the total length of the names of the directory entries and 23063 the cookie value for these entries. This length represents the XDR 23064 encoding of the data (names and cookies) and not the length in the 23065 native format of the server. 23067 The maxcount field of the request represents the maximum total size 23068 of all of the data being returned within the READDIR4resok structure 23069 and includes the XDR overhead. The server MAY return less data. If 23070 the server is unable to return a single directory entry within the 23071 maxcount limit, the error NFS4ERR_TOOSMALL MUST be returned to the 23072 client. 23074 Finally, the request's attr_request field represents the list of 23075 attributes to be returned for each directory entry supplied by the 23076 server. 23078 A successful reply consists of a list of directory entries. Each of 23079 these entries contains the name of the directory entry, a cookie 23080 value for that entry, and the associated attributes as requested. 23081 The "eof" flag has a value of TRUE if there are no more entries in 23082 the directory. 23084 The cookie value is only meaningful to the server and is used as a 23085 cursor for the directory entry. As mentioned, this cookie is used by 23086 the client for subsequent READDIR operations so that it may continue 23087 reading a directory. The cookie is similar in concept to a READ 23088 offset but MUST NOT be interpreted as such by the client. Ideally, 23089 the cookie value SHOULD NOT change if the directory is modified since 23090 the client may be caching these values. 23092 In some cases, the server may encounter an error while obtaining the 23093 attributes for a directory entry. Instead of returning an error for 23094 the entire READDIR operation, the server can instead return the 23095 attribute rdattr_error (Section 5.8.1.12). With this, the server is 23096 able to communicate the failure to the client and not fail the entire 23097 operation in the instance of what might be a transient failure. 23098 Obviously, the client must request the fattr4_rdattr_error attribute 23099 for this method to work properly. If the client does not request the 23100 attribute, the server has no choice but to return failure for the 23101 entire READDIR operation. 23103 For some file system environments, the directory entries "." and ".." 23104 have special meaning, and in other environments, they do not. If the 23105 server supports these special entries within a directory, they SHOULD 23106 NOT be returned to the client as part of the READDIR response. To 23107 enable some client environments, the cookie values of zero, 1, and 2 23108 are to be considered reserved. Note that the UNIX client will use 23109 these values when combining the server's response and local 23110 representations to enable a fully formed UNIX directory presentation 23111 to the application. 23113 For READDIR arguments, cookie values of one and two SHOULD NOT be 23114 used, and for READDIR results, cookie values of zero, one, and two 23115 SHOULD NOT be returned. 23117 On success, the current filehandle retains its value. 23119 18.23.4. IMPLEMENTATION 23121 The server's file system directory representations can differ 23122 greatly. A client's programming interfaces may also be bound to the 23123 local operating environment in a way that does not translate well 23124 into the NFS protocol. Therefore, the use of the dircount and 23125 maxcount fields are provided to enable the client to provide hints to 23126 the server. If the client is aggressive about attribute collection 23127 during a READDIR, the server has an idea of how to limit the encoded 23128 response. 23130 If dircount is zero, the server bounds the reply's size based on the 23131 request's maxcount field. 23133 The cookieverf may be used by the server to help manage cookie values 23134 that may become stale. It should be a rare occurrence that a server 23135 is unable to continue properly reading a directory with the provided 23136 cookie/cookieverf pair. The server SHOULD make every effort to avoid 23137 this condition since the application at the client might be unable to 23138 properly handle this type of failure. 23140 The use of the cookieverf will also protect the client from using 23141 READDIR cookie values that might be stale. For example, if the file 23142 system has been migrated, the server might or might not be able to 23143 use the same cookie values to service READDIR as the previous server 23144 used. With the client providing the cookieverf, the server is able 23145 to provide the appropriate response to the client. This prevents the 23146 case where the server accepts a cookie value but the underlying 23147 directory has changed and the response is invalid from the client's 23148 context of its previous READDIR. 23150 Since some servers will not be returning "." and ".." entries as has 23151 been done with previous versions of the NFS protocol, the client that 23152 requires these entries be present in READDIR responses must fabricate 23153 them. 23155 18.24. Operation 27: READLINK - Read Symbolic Link 23157 18.24.1. ARGUMENTS 23159 /* CURRENT_FH: symlink */ 23160 void; 23162 18.24.2. RESULTS 23164 struct READLINK4resok { 23165 linktext4 link; 23166 }; 23168 union READLINK4res switch (nfsstat4 status) { 23169 case NFS4_OK: 23170 READLINK4resok resok4; 23171 default: 23172 void; 23173 }; 23175 18.24.3. DESCRIPTION 23177 READLINK reads the data associated with a symbolic link. Depending 23178 on the value of the UTF-8 capability attribute (Section 14.4), the 23179 data is encoded in UTF-8. Whether created by an NFS client or 23180 created locally on the server, the data in a symbolic link is not 23181 interpreted (except possibly to check for proper UTF-8 encoding) when 23182 created, but is simply stored. 23184 On success, the current filehandle retains its value. 23186 18.24.4. IMPLEMENTATION 23188 A symbolic link is nominally a pointer to another file. The data is 23189 not necessarily interpreted by the server, just stored in the file. 23190 It is possible for a client implementation to store a pathname that 23191 is not meaningful to the server operating system in a symbolic link. 23192 A READLINK operation returns the data to the client for 23193 interpretation. If different implementations want to share access to 23194 symbolic links, then they must agree on the interpretation of the 23195 data in the symbolic link. 23197 The READLINK operation is only allowed on objects of type NF4LNK. 23198 The server should return the error NFS4ERR_WRONG_TYPE if the object 23199 is not of type NF4LNK. 23201 18.25. Operation 28: REMOVE - Remove File System Object 23203 18.25.1. ARGUMENTS 23205 struct REMOVE4args { 23206 /* CURRENT_FH: directory */ 23207 component4 target; 23208 }; 23210 18.25.2. RESULTS 23212 struct REMOVE4resok { 23213 change_info4 cinfo; 23214 }; 23216 union REMOVE4res switch (nfsstat4 status) { 23217 case NFS4_OK: 23218 REMOVE4resok resok4; 23219 default: 23220 void; 23221 }; 23223 18.25.3. DESCRIPTION 23225 The REMOVE operation removes (deletes) a directory entry named by 23226 filename from the directory corresponding to the current filehandle. 23227 If the entry in the directory was the last reference to the 23228 corresponding file system object, the object may be destroyed. The 23229 directory may be either of type NF4DIR or NF4ATTRDIR. 23231 For the directory where the filename was removed, the server returns 23232 change_info4 information in cinfo. With the atomic field of the 23233 change_info4 data type, the server will indicate if the before and 23234 after change attributes were obtained atomically with respect to the 23235 removal. 23237 If the target has a length of zero, or if the target does not obey 23238 the UTF-8 definition (and the server is enforcing UTF-8 encoding; see 23239 Section 14.4), the error NFS4ERR_INVAL will be returned. 23241 On success, the current filehandle retains its value. 23243 18.25.4. IMPLEMENTATION 23245 NFSv3 required a different operator RMDIR for directory removal and 23246 REMOVE for non-directory removal. This allowed clients to skip 23247 checking the file type when being passed a non-directory delete 23248 system call (e.g., unlink() [24] in POSIX) to remove a directory, as 23249 well as the converse (e.g., a rmdir() on a non-directory) because 23250 they knew the server would check the file type. NFSv4.1 REMOVE can 23251 be used to delete any directory entry independent of its file type. 23252 The implementor of an NFSv4.1 client's entry points from the unlink() 23253 and rmdir() system calls should first check the file type against the 23254 types the system call is allowed to remove before sending a REMOVE 23255 operation. Alternatively, the implementor can produce a COMPOUND 23256 call that includes a LOOKUP/VERIFY sequence of operations to verify 23257 the file type before a REMOVE operation in the same COMPOUND call. 23259 The concept of last reference is server specific. However, if the 23260 numlinks field in the previous attributes of the object had the value 23261 1, the client should not rely on referring to the object via a 23262 filehandle. Likewise, the client should not rely on the resources 23263 (disk space, directory entry, and so on) formerly associated with the 23264 object becoming immediately available. Thus, if a client needs to be 23265 able to continue to access a file after using REMOVE to remove it, 23266 the client should take steps to make sure that the file will still be 23267 accessible. While the traditional mechanism used is to RENAME the 23268 file from its old name to a new hidden name, the NFSv4.1 OPEN 23269 operation MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, 23270 which indicates to the client that the file will be preserved if the 23271 file has an outstanding open (see Section 18.16). 23273 If the server finds that the file is still open when the REMOVE 23274 arrives: 23276 o The server SHOULD NOT delete the file's directory entry if the 23277 file was opened with OPEN4_SHARE_DENY_WRITE or 23278 OPEN4_SHARE_DENY_BOTH. 23280 o If the file was not opened with OPEN4_SHARE_DENY_WRITE or 23281 OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's 23282 directory entry. However, until last CLOSE of the file, the 23283 server MAY continue to allow access to the file via its 23284 filehandle. 23286 o The server MUST NOT delete the directory entry if the reply from 23287 OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set. 23289 The server MAY implement its own restrictions on removal of a file 23290 while it is open. The server might disallow such a REMOVE (or a 23291 removal that occurs as part of RENAME). The conditions that 23292 influence the restrictions on removal of a file while it is still 23293 open include: 23295 o Whether certain access protocols (i.e., not just NFS) are holding 23296 the file open. 23298 o Whether particular options, access modes, or policies on the 23299 server are enabled. 23301 If a file has an outstanding OPEN and this prevents the removal of 23302 the file's directory entry, the error NFS4ERR_FILE_OPEN is returned. 23304 Where the determination above cannot be made definitively because 23305 delegations are being held, they MUST be recalled to allow processing 23306 of the REMOVE to continue. When a delegation is held, the server has 23307 no reliable knowledge of the status of OPENs for that client, so 23308 unless there are files opened with the particular deny modes by 23309 clients without delegations, the determination cannot be made until 23310 delegations are recalled, and the operation cannot proceed until each 23311 sufficient delegation has been returned or revoked to allow the 23312 server to make a correct determination. 23314 In all cases in which delegations are recalled, the server is likely 23315 to return one or more NFS4ERR_DELAY errors while delegations remain 23316 outstanding. 23318 If the current filehandle designates a directory for which another 23319 client holds a directory delegation, then, unless the situation can 23320 be resolved by sending a notification, the directory delegation MUST 23321 be recalled, and the operation MUST NOT proceed until the delegation 23322 is returned or revoked. Except where this happens very quickly, one 23323 or more NFS4ERR_DELAY errors will be returned to requests made while 23324 delegation remains outstanding. 23326 When the current filehandle designates a directory for which one or 23327 more directory delegations exist, then, when those delegations 23328 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as 23329 a result of this operation. 23331 Note that when a remove occurs as a result of a RENAME, 23332 NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as 23333 a separate operation. In the case in which the removal is integrated 23334 and atomic with RENAME, the notification of the removal is integrated 23335 with notification for the RENAME. See the discussion of the 23336 NOTIFY4_RENAME_ENTRY notification in Section 20.4. 23338 18.26. Operation 29: RENAME - Rename Directory Entry 23340 18.26.1. ARGUMENTS 23342 struct RENAME4args { 23343 /* SAVED_FH: source directory */ 23344 component4 oldname; 23345 /* CURRENT_FH: target directory */ 23346 component4 newname; 23347 }; 23349 18.26.2. RESULTS 23351 struct RENAME4resok { 23352 change_info4 source_cinfo; 23353 change_info4 target_cinfo; 23354 }; 23356 union RENAME4res switch (nfsstat4 status) { 23357 case NFS4_OK: 23358 RENAME4resok resok4; 23359 default: 23360 void; 23361 }; 23363 18.26.3. DESCRIPTION 23365 The RENAME operation renames the object identified by oldname in the 23366 source directory corresponding to the saved filehandle, as set by the 23367 SAVEFH operation, to newname in the target directory corresponding to 23368 the current filehandle. The operation is required to be atomic to 23369 the client. Source and target directories MUST reside on the same 23370 file system on the server. On success, the current filehandle will 23371 continue to be the target directory. 23373 If the target directory already contains an entry with the name 23374 newname, the source object MUST be compatible with the target: either 23375 both are non-directories or both are directories and the target MUST 23376 be empty. If compatible, the existing target is removed before the 23377 rename occurs or, preferably, the target is removed atomically as 23378 part of the rename. See Section 18.25.4 for client and server 23379 actions whenever a target is removed. Note however that when the 23380 removal is performed atomically with the rename, certain parts of the 23381 removal described there are integrated with the rename. For example, 23382 notification of the removal will not be via a NOTIFY4_REMOVE_ENTRY 23383 but will be indicated as part of the NOTIFY4_ADD_ENTRY or 23384 NOTIFY4_RENAME_ENTRY generated by the rename. 23386 If the source object and the target are not compatible or if the 23387 target is a directory but not empty, the server will return the error 23388 NFS4ERR_EXIST. 23390 If oldname and newname both refer to the same file (e.g., they might 23391 be hard links of each other), then unless the file is open (see 23392 Section 18.26.4), RENAME MUST perform no action and return NFS4_OK. 23394 For both directories involved in the RENAME, the server returns 23395 change_info4 information. With the atomic field of the change_info4 23396 data type, the server will indicate if the before and after change 23397 attributes were obtained atomically with respect to the rename. 23399 If oldname refers to a named attribute and the saved and current 23400 filehandles refer to different file system objects, the server will 23401 return NFS4ERR_XDEV just as if the saved and current filehandles 23402 represented directories on different file systems. 23404 If oldname or newname has a length of zero, or if oldname or newname 23405 does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be 23406 returned. 23408 18.26.4. IMPLEMENTATION 23410 The server MAY impose restrictions on the RENAME operation such that 23411 RENAME may not be done when the file being renamed is open or when 23412 that open is done by particular protocols, or with particular options 23413 or access modes. Similar restrictions may be applied when a file 23414 exists with the target name and is open. When RENAME is rejected 23415 because of such restrictions, the error NFS4ERR_FILE_OPEN is 23416 returned. 23418 When oldname and rename refer to the same file and that file is open 23419 in a fashion such that RENAME would normally be rejected with 23420 NFS4ERR_FILE_OPEN if oldname and newname were different files, then 23421 RENAME SHOULD be rejected with NFS4ERR_FILE_OPEN. 23423 If a server does implement such restrictions and those restrictions 23424 include cases of NFSv4 opens preventing successful execution of a 23425 rename, the server needs to recall any delegations that could hide 23426 the existence of opens relevant to that decision. This is because 23427 when a client holds a delegation, the server might not have an 23428 accurate account of the opens for that client, since the client may 23429 execute OPENs and CLOSEs locally. The RENAME operation need only be 23430 delayed until a definitive result can be obtained. For example, if 23431 there are multiple delegations and one of them establishes an open 23432 whose presence would prevent the rename, given the server's 23433 semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as 23434 that delegation is returned without waiting for other delegations to 23435 be returned. Similarly, if such opens are not associated with 23436 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 23437 delegation recall being done. 23439 If the current filehandle or the saved filehandle designates a 23440 directory for which another client holds a directory delegation, 23441 then, unless the situation can be resolved by sending a notification, 23442 the delegation MUST be recalled, and the operation cannot proceed 23443 until the delegation is returned or revoked. Except where this 23444 happens very quickly, one or more NFS4ERR_DELAY errors will be 23445 returned to requests made while delegation remains outstanding. 23447 When the current and saved filehandles are the same and they 23448 designate a directory for which one or more directory delegations 23449 exist, then, when those delegations request such notifications, a 23450 notification of type NOTIFY4_RENAME_ENTRY will be generated as a 23451 result of this operation. When oldname and rename refer to the same 23452 file, no notification is generated (because, as Section 18.26.3 23453 states, the server MUST take no action). When a file is removed 23454 because it has the same name as the target, if that removal is done 23455 atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will 23456 not be generated. Instead, the deletion of the file will be reported 23457 as part of the NOTIFY4_RENAME_ENTRY notification. 23459 When the current and saved filehandles are not the same: 23461 o If the current filehandle designates a directory for which one or 23462 more directory delegations exist, then, when those delegations 23463 request such notifications, NOTIFY4_ADD_ENTRY will be generated as 23464 a result of this operation. When a file is removed because it has 23465 the same name as the target, if that removal is done atomically 23466 with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be 23467 generated. Instead, the deletion of the file will be reported as 23468 part of the NOTIFY4_ADD_ENTRY notification. 23470 o If the saved filehandle designates a directory for which one or 23471 more directory delegations exist, then, when those delegations 23472 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated 23473 as a result of this operation. 23475 If the object being renamed has file delegations held by clients 23476 other than the one doing the RENAME, the delegations MUST be 23477 recalled, and the operation cannot proceed until each such delegation 23478 is returned or revoked. Note that in the case of multiply linked 23479 files, the delegation recall requirement applies even if the 23480 delegation was obtained through a different name than the one being 23481 renamed. In all cases in which delegations are recalled, the server 23482 is likely to return one or more NFS4ERR_DELAY errors while the 23483 delegation(s) remains outstanding, although it might not do that if 23484 the delegations are returned quickly. 23486 The RENAME operation must be atomic to the client. The statement 23487 "source and target directories MUST reside on the same file system on 23488 the server" means that the fsid fields in the attributes for the 23489 directories are the same. If they reside on different file systems, 23490 the error NFS4ERR_XDEV is returned. 23492 Based on the value of the fh_expire_type attribute for the object, 23493 the filehandle may or may not expire on a RENAME. However, server 23494 implementors are strongly encouraged to attempt to keep filehandles 23495 from expiring in this fashion. 23497 On some servers, the file names "." and ".." are illegal as either 23498 oldname or newname, and will result in the error NFS4ERR_BADNAME. In 23499 addition, on many servers the case of oldname or newname being an 23500 alias for the source directory will be checked for. Such servers 23501 will return the error NFS4ERR_INVAL in these cases. 23503 If either of the source or target filehandles are not directories, 23504 the server will return NFS4ERR_NOTDIR. 23506 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle 23508 18.27.1. ARGUMENTS 23510 /* SAVED_FH: */ 23511 void; 23513 18.27.2. RESULTS 23515 struct RESTOREFH4res { 23516 /* 23517 * If status is NFS4_OK, 23518 * new CURRENT_FH: value of saved fh 23519 */ 23520 nfsstat4 status; 23521 }; 23523 18.27.3. DESCRIPTION 23525 The RESTOREFH operation sets the current filehandle and stateid to 23526 the values in the saved filehandle and stateid. If there is no saved 23527 filehandle, then the server will return the error 23528 NFS4ERR_NOFILEHANDLE. 23530 See Section 16.2.3.1.1 for more details on the current filehandle. 23532 See Section 16.2.3.1.2 for more details on the current stateid. 23534 18.27.4. IMPLEMENTATION 23536 Operations like OPEN and LOOKUP use the current filehandle to 23537 represent a directory and replace it with a new filehandle. Assuming 23538 that the previous filehandle was saved with a SAVEFH operator, the 23539 previous filehandle can be restored as the current filehandle. This 23540 is commonly used to obtain post-operation attributes for the 23541 directory, e.g., 23543 PUTFH (directory filehandle) 23544 SAVEFH 23545 GETATTR attrbits (pre-op dir attrs) 23546 CREATE optbits "foo" attrs 23547 GETATTR attrbits (file attributes) 23548 RESTOREFH 23549 GETATTR attrbits (post-op dir attrs) 23551 18.28. Operation 32: SAVEFH - Save Current Filehandle 23553 18.28.1. ARGUMENTS 23555 /* CURRENT_FH: */ 23556 void; 23558 18.28.2. RESULTS 23560 struct SAVEFH4res { 23561 /* 23562 * If status is NFS4_OK, 23563 * new SAVED_FH: value of current fh 23564 */ 23565 nfsstat4 status; 23566 }; 23568 18.28.3. DESCRIPTION 23570 The SAVEFH operation saves the current filehandle and stateid. If a 23571 previous filehandle was saved, then it is no longer accessible. The 23572 saved filehandle can be restored as the current filehandle with the 23573 RESTOREFH operator. 23575 On success, the current filehandle retains its value. 23577 See Section 16.2.3.1.1 for more details on the current filehandle. 23579 See Section 16.2.3.1.2 for more details on the current stateid. 23581 18.28.4. IMPLEMENTATION 23583 18.29. Operation 33: SECINFO - Obtain Available Security 23585 18.29.1. ARGUMENTS 23587 struct SECINFO4args { 23588 /* CURRENT_FH: directory */ 23589 component4 name; 23590 }; 23592 18.29.2. RESULTS 23593 /* 23594 * From RFC 2203 23595 */ 23596 enum rpc_gss_svc_t { 23597 RPC_GSS_SVC_NONE = 1, 23598 RPC_GSS_SVC_INTEGRITY = 2, 23599 RPC_GSS_SVC_PRIVACY = 3 23600 }; 23602 struct rpcsec_gss_info { 23603 sec_oid4 oid; 23604 qop4 qop; 23605 rpc_gss_svc_t service; 23606 }; 23608 /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ 23609 union secinfo4 switch (uint32_t flavor) { 23610 case RPCSEC_GSS: 23611 rpcsec_gss_info flavor_info; 23612 default: 23613 void; 23614 }; 23616 typedef secinfo4 SECINFO4resok<>; 23618 union SECINFO4res switch (nfsstat4 status) { 23619 case NFS4_OK: 23620 /* CURRENTFH: consumed */ 23621 SECINFO4resok resok4; 23622 default: 23623 void; 23624 }; 23626 18.29.3. DESCRIPTION 23628 The SECINFO operation is used by the client to obtain a list of valid 23629 RPC authentication flavors for a specific directory filehandle, file 23630 name pair. SECINFO should apply the same access methodology used for 23631 LOOKUP when evaluating the name. Therefore, if the requester does 23632 not have the appropriate access to LOOKUP the name, then SECINFO MUST 23633 behave the same way and return NFS4ERR_ACCESS. 23635 The result will contain an array that represents the security 23636 mechanisms available, with an order corresponding to the server's 23637 preferences, the most preferred being first in the array. The client 23638 is free to pick whatever security mechanism it both desires and 23639 supports, or to pick in the server's preference order the first one 23640 it supports. The array entries are represented by the secinfo4 23641 structure. The field 'flavor' will contain a value of AUTH_NONE, 23642 AUTH_SYS (as defined in RFC 5531 [3]), or RPCSEC_GSS (as defined in 23643 RFC 2203 [4]). The field flavor can also be any other security 23644 flavor registered with IANA. 23646 For the flavors AUTH_NONE and AUTH_SYS, no additional security 23647 information is returned. The same is true of many (if not most) 23648 other security flavors, including AUTH_DH. For a return value of 23649 RPCSEC_GSS, a security triple is returned that contains the mechanism 23650 object identifier (OID, as defined in RFC 2743 [7]), the quality of 23651 protection (as defined in RFC 2743 [7]), and the service type (as 23652 defined in RFC 2203 [4]). It is possible for SECINFO to return 23653 multiple entries with flavor equal to RPCSEC_GSS with different 23654 security triple values. 23656 On success, the current filehandle is consumed (see 23657 Section 2.6.3.1.1.8), and if the next operation after SECINFO tries 23658 to use the current filehandle, that operation will fail with the 23659 status NFS4ERR_NOFILEHANDLE. 23661 If the name has a length of zero, or if the name does not obey the 23662 UTF-8 definition (assuming UTF-8 capabilities are enabled; see 23663 Section 14.4), the error NFS4ERR_INVAL will be returned. 23665 See Section 2.6 for additional information on the use of SECINFO. 23667 18.29.4. IMPLEMENTATION 23669 The SECINFO operation is expected to be used by the NFS client when 23670 the error value of NFS4ERR_WRONGSEC is returned from another NFS 23671 operation. This signifies to the client that the server's security 23672 policy is different from what the client is currently using. At this 23673 point, the client is expected to obtain a list of possible security 23674 flavors and choose what best suits its policies. 23676 As mentioned, the server's security policies will determine when a 23677 client request receives NFS4ERR_WRONGSEC. See Table 8 for a list of 23678 operations that can return NFS4ERR_WRONGSEC. In addition, when 23679 READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can 23680 contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT 23681 return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the 23682 target name exists, it cannot have a separate security policy from 23683 the parent directory, and the security policy of the parent was 23684 checked when its filehandle was injected into the COMPOUND request's 23685 operations stream (for similar reasons, an OPEN operation that 23686 creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target 23687 name exists, while it might have a separate security policy, that is 23688 irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale 23689 for REMOVE is that while that target might have a separate security 23690 policy, the target is going to be removed, and so the security policy 23691 of the parent trumps that of the object being removed. RENAME and 23692 LINK MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error 23693 applies only to the saved filehandle (see Section 2.6.3.1.2). Any 23694 NFS4ERR_WRONGSEC error on the current filehandle used by LINK and 23695 RENAME MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or 23696 RESTOREFH operation that injected the current filehandle. 23698 With the exception of LINK and RENAME, the set of operations that can 23699 return NFS4ERR_WRONGSEC represents the point at which the client can 23700 inject a filehandle into the "current filehandle" at the server. The 23701 filehandle is either provided by the client (PUTFH, PUTPUBFH, 23702 PUTROOTFH), generated as a result of a name-to-filehandle translation 23703 (LOOKUP and OPEN), or generated from the saved filehandle via 23704 RESTOREFH. As Section 2.6.3.1.1.1 states, a put filehandle operation 23705 followed by SAVEFH MUST NOT return NFS4ERR_WRONGSEC. Thus, the 23706 RESTOREFH operation, under certain conditions (see 23707 Section 2.6.3.1.1), is permitted to return NFS4ERR_WRONGSEC so that 23708 security policies can be honored. 23710 The READDIR operation will not directly return the NFS4ERR_WRONGSEC 23711 error. However, if the READDIR request included a request for 23712 attributes, it is possible that the READDIR request's security triple 23713 did not match that of a directory entry. If this is the case and the 23714 client has requested the rdattr_error attribute, the server will 23715 return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. 23717 To resolve an error return of NFS4ERR_WRONGSEC, the client does the 23718 following: 23720 o For LOOKUP and OPEN, the client will use SECINFO with the same 23721 current filehandle and name as provided in the original LOOKUP or 23722 OPEN to enumerate the available security triples. 23724 o For the rdattr_error, the client will use SECINFO with the same 23725 current filehandle as provided in the original READDIR. The name 23726 passed to SECINFO will be that of the directory entry (as returned 23727 from READDIR) that had the NFS4ERR_WRONGSEC error in the 23728 rdattr_error attribute. 23730 o For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the 23731 client will use SECINFO_NO_NAME { style = 23732 SECINFO_STYLE4_CURRENT_FH }. The client will prefix the 23733 SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or 23734 PUTROOTFH operation that provides the filehandle originally 23735 provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH 23736 operation. 23738 NOTE: In NFSv4.0, the client was required to use SECINFO, and had 23739 to reconstruct the parent of the original filehandle and the 23740 component name of the original filehandle. The introduction in 23741 NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction. 23743 o For LOOKUPP, the client will use SECINFO_NO_NAME { style = 23744 SECINFO_STYLE4_PARENT } and provide the filehandle that equals the 23745 filehandle originally provided to LOOKUPP. 23747 See Section 21 for a discussion on the recommendations for the 23748 security flavor used by SECINFO and SECINFO_NO_NAME. 23750 18.30. Operation 34: SETATTR - Set Attributes 23752 18.30.1. ARGUMENTS 23754 struct SETATTR4args { 23755 /* CURRENT_FH: target object */ 23756 stateid4 stateid; 23757 fattr4 obj_attributes; 23758 }; 23760 18.30.2. RESULTS 23762 struct SETATTR4res { 23763 nfsstat4 status; 23764 bitmap4 attrsset; 23765 }; 23767 18.30.3. DESCRIPTION 23769 The SETATTR operation changes one or more of the attributes of a file 23770 system object. The new attributes are specified with a bitmap and 23771 the attributes that follow the bitmap in bit order. 23773 The stateid argument for SETATTR is used to provide byte-range 23774 locking context that is necessary for SETATTR requests that set the 23775 size attribute. Since setting the size attribute modifies the file's 23776 data, it has the same locking requirements as a corresponding WRITE. 23777 Any SETATTR that sets the size attribute is incompatible with a share 23778 reservation that specifies OPEN4_SHARE_DENY_WRITE. The area between 23779 the old end-of-file and the new end-of-file is considered to be 23780 modified just as would have been the case had the area in question 23781 been specified as the target of WRITE, for the purpose of checking 23782 conflicts with byte-range locks, for those cases in which a server is 23783 implementing mandatory byte-range locking behavior. A valid stateid 23784 SHOULD always be specified. When the file size attribute is not set, 23785 the special stateid consisting of all bits equal to zero MAY be 23786 passed. 23788 On either success or failure of the operation, the server will return 23789 the attrsset bitmask to represent what (if any) attributes were 23790 successfully set. The attrsset in the response is a subset of the 23791 attrmask field of the obj_attributes field in the argument. 23793 On success, the current filehandle retains its value. 23795 18.30.4. IMPLEMENTATION 23797 If the request specifies the owner attribute to be set, the server 23798 SHOULD allow the operation to succeed if the current owner of the 23799 object matches the value specified in the request. Some servers may 23800 be implemented in a way as to prohibit the setting of the owner 23801 attribute unless the requester has privilege to do so. If the server 23802 is lenient in this one case of matching owner values, the client 23803 implementation may be simplified in cases of creation of an object 23804 (e.g., an exclusive create via OPEN) followed by a SETATTR. 23806 The file size attribute is used to request changes to the size of a 23807 file. A value of zero causes the file to be truncated, a value less 23808 than the current size of the file causes data from new size to the 23809 end of the file to be discarded, and a size greater than the current 23810 size of the file causes logically zeroed data bytes to be added to 23811 the end of the file. Servers are free to implement this using 23812 unallocated bytes (holes) or allocated data bytes set to zero. 23813 Clients should not make any assumptions regarding a server's 23814 implementation of this feature, beyond that the bytes in the affected 23815 byte-range returned by READ will be zeroed. Servers MUST support 23816 extending the file size via SETATTR. 23818 SETATTR is not guaranteed to be atomic. A failed SETATTR may 23819 partially change a file's attributes, hence the reason why the reply 23820 always includes the status and the list of attributes that were set. 23822 If the object whose attributes are being changed has a file 23823 delegation that is held by a client other than the one doing the 23824 SETATTR, the delegation(s) must be recalled, and the operation cannot 23825 proceed to actually change an attribute until each such delegation is 23826 returned or revoked. In all cases in which delegations are recalled, 23827 the server is likely to return one or more NFS4ERR_DELAY errors while 23828 the delegation(s) remains outstanding, although it might not do that 23829 if the delegations are returned quickly. 23831 If the object whose attributes are being set is a directory and 23832 another client holds a directory delegation for that directory, then 23833 if enabled, asynchronous notifications will be generated when the set 23834 of attributes changed has a non-null intersection with the set of 23835 attributes for which notification is requested. Notifications of 23836 type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate 23837 client(s), but the SETATTR is not delayed by waiting for these 23838 notifications to be sent. 23840 If the object whose attributes are being set is a member of the 23841 directory for which another client holds a directory delegation, then 23842 asynchronous notifications will be generated when the set of 23843 attributes changed has a non-null intersection with the set of 23844 attributes for which notification is requested. Notifications of 23845 type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate 23846 clients, but the SETATTR is not delayed by waiting for these 23847 notifications to be sent. 23849 Changing the size of a file with SETATTR indirectly changes the 23850 time_modify and change attributes. A client must account for this as 23851 size changes can result in data deletion. 23853 The attributes time_access_set and time_modify_set are write-only 23854 attributes constructed as a switched union so the client can direct 23855 the server in setting the time values. If the switched union 23856 specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to 23857 be used for the operation. If the switch union does not specify 23858 SET_TO_CLIENT_TIME4, the server is to use its current time for the 23859 SETATTR operation. 23861 If server and client times differ, programs that compare client time 23862 to file times can break. A time synchronization protocol should be 23863 used to limit client/server time skew. 23865 Use of a COMPOUND containing a VERIFY operation specifying only the 23866 change attribute, immediately followed by a SETATTR, provides a means 23867 whereby a client may specify a request that emulates the 23868 functionality of the SETATTR guard mechanism of NFSv3. Since the 23869 function of the guard mechanism is to avoid changes to the file 23870 attributes based on stale information, delays between checking of the 23871 guard condition and the setting of the attributes have the potential 23872 to compromise this function, as would the corresponding delay in the 23873 NFSv4 emulation. Therefore, NFSv4.1 servers SHOULD take care to 23874 avoid such delays, to the degree possible, when executing such a 23875 request. 23877 If the server does not support an attribute as requested by the 23878 client, the server SHOULD return NFS4ERR_ATTRNOTSUPP. 23880 A mask of the attributes actually set is returned by SETATTR in all 23881 cases. That mask MUST NOT include attribute bits not requested to be 23882 set by the client. If the attribute masks in the request and reply 23883 are equal, the status field in the reply MUST be NFS4_OK. 23885 18.31. Operation 37: VERIFY - Verify Same Attributes 23887 18.31.1. ARGUMENTS 23889 struct VERIFY4args { 23890 /* CURRENT_FH: object */ 23891 fattr4 obj_attributes; 23892 }; 23894 18.31.2. RESULTS 23896 struct VERIFY4res { 23897 nfsstat4 status; 23898 }; 23900 18.31.3. DESCRIPTION 23902 The VERIFY operation is used to verify that attributes have the value 23903 assumed by the client before proceeding with the following operations 23904 in the COMPOUND request. If any of the attributes do not match, then 23905 the error NFS4ERR_NOT_SAME must be returned. The current filehandle 23906 retains its value after successful completion of the operation. 23908 18.31.4. IMPLEMENTATION 23910 One possible use of the VERIFY operation is the following series of 23911 operations. With this, the client is attempting to verify that the 23912 file being removed will match what the client expects to be removed. 23913 This series can help prevent the unintended deletion of a file. 23915 PUTFH (directory filehandle) 23916 LOOKUP (file name) 23917 VERIFY (filehandle == fh) 23918 PUTFH (directory filehandle) 23919 REMOVE (file name) 23921 This series does not prevent a second client from removing and 23922 creating a new file in the middle of this sequence, but it does help 23923 avoid the unintended result. 23925 In the case that a RECOMMENDED attribute is specified in the VERIFY 23926 operation and the server does not support that attribute for the file 23927 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 23928 client. 23930 When the attribute rdattr_error or any set-only attribute (e.g., 23931 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 23932 the client. 23934 18.32. Operation 38: WRITE - Write to File 23936 18.32.1. ARGUMENTS 23938 enum stable_how4 { 23939 UNSTABLE4 = 0, 23940 DATA_SYNC4 = 1, 23941 FILE_SYNC4 = 2 23942 }; 23944 struct WRITE4args { 23945 /* CURRENT_FH: file */ 23946 stateid4 stateid; 23947 offset4 offset; 23948 stable_how4 stable; 23949 opaque data<>; 23950 }; 23952 18.32.2. RESULTS 23954 struct WRITE4resok { 23955 count4 count; 23956 stable_how4 committed; 23957 verifier4 writeverf; 23958 }; 23960 union WRITE4res switch (nfsstat4 status) { 23961 case NFS4_OK: 23962 WRITE4resok resok4; 23963 default: 23964 void; 23965 }; 23967 18.32.3. DESCRIPTION 23969 The WRITE operation is used to write data to a regular file. The 23970 target file is specified by the current filehandle. The offset 23971 specifies the offset where the data should be written. An offset of 23972 zero specifies that the write should start at the beginning of the 23973 file. The count, as encoded as part of the opaque data parameter, 23974 represents the number of bytes of data that are to be written. If 23975 the count is zero, the WRITE will succeed and return a count of zero 23976 subject to permissions checking. The server MAY write fewer bytes 23977 than requested by the client. 23979 The client specifies with the stable parameter the method of how the 23980 data is to be processed by the server. If stable is FILE_SYNC4, the 23981 server MUST commit the data written plus all file system metadata to 23982 stable storage before returning results. This corresponds to the 23983 NFSv2 protocol semantics. Any other behavior constitutes a protocol 23984 violation. If stable is DATA_SYNC4, then the server MUST commit all 23985 of the data to stable storage and enough of the metadata to retrieve 23986 the data before returning. The server implementor is free to 23987 implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a 23988 possible performance drop. If stable is UNSTABLE4, the server is 23989 free to commit any part of the data and the metadata to stable 23990 storage, including all or none, before returning a reply to the 23991 client. There is no guarantee whether or when any uncommitted data 23992 will subsequently be committed to stable storage. The only 23993 guarantees made by the server are that it will not destroy any data 23994 without changing the value of writeverf and that it will not commit 23995 the data and metadata at a level less than that requested by the 23996 client. 23998 Except when special stateids are used, the stateid value for a WRITE 23999 request represents a value returned from a previous byte-range LOCK 24000 or OPEN request or the stateid associated with a delegation. The 24001 stateid identifies the associated owners if any and is used by the 24002 server to verify that the associated locks are still valid (e.g., 24003 have not been revoked). 24005 Upon successful completion, the following results are returned. The 24006 count result is the number of bytes of data written to the file. The 24007 server may write fewer bytes than requested. If so, the actual 24008 number of bytes written starting at location, offset, is returned. 24010 The server also returns an indication of the level of commitment of 24011 the data and metadata via committed. Per Table 11, 24013 o The server MAY commit the data at a stronger level than requested. 24015 o The server MUST commit the data at a level at least as high as 24016 that committed. 24018 Valid combinations of the fields stable in the request and committed 24019 in the reply. 24021 +------------+-----------------------------------+ 24022 | stable | committed | 24023 +------------+-----------------------------------+ 24024 | UNSTABLE4 | FILE_SYNC4, DATA_SYNC4, UNSTABLE4 | 24025 | DATA_SYNC4 | FILE_SYNC4, DATA_SYNC4 | 24026 | FILE_SYNC4 | FILE_SYNC4 | 24027 +------------+-----------------------------------+ 24029 Table 11 24031 The final portion of the result is the field writeverf. This field 24032 is the write verifier and is a cookie that the client can use to 24033 determine whether a server has changed instance state (e.g., server 24034 restart) between a call to WRITE and a subsequent call to either 24035 WRITE or COMMIT. This cookie MUST be unchanged during a single 24036 instance of the NFSv4.1 server and MUST be unique between instances 24037 of the NFSv4.1 server. If the cookie changes, then the client MUST 24038 assume that any data written with an UNSTABLE4 value for committed 24039 and an old writeverf in the reply has been lost and will need to be 24040 recovered. 24042 If a client writes data to the server with the stable argument set to 24043 UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or 24044 UNSTABLE4, the client will follow up some time in the future with a 24045 COMMIT operation to synchronize outstanding asynchronous data and 24046 metadata with the server's stable storage, barring client error. It 24047 is possible that due to client crash or other error that a subsequent 24048 COMMIT will not be received by the server. 24050 For a WRITE with a stateid value of all bits equal to zero, the 24051 server MAY allow the WRITE to be serviced subject to mandatory byte- 24052 range locks or the current share deny modes for the file. For a 24053 WRITE with a stateid value of all bits equal to 1, the server MUST 24054 NOT allow the WRITE operation to bypass locking checks at the server 24055 and otherwise is treated as if a stateid of all bits equal to zero 24056 were used. 24058 On success, the current filehandle retains its value. 24060 18.32.4. IMPLEMENTATION 24062 It is possible for the server to write fewer bytes of data than 24063 requested by the client. In this case, the server SHOULD NOT return 24064 an error unless no data was written at all. If the server writes 24065 less than the number of bytes specified, the client will need to send 24066 another WRITE to write the remaining data. 24068 It is assumed that the act of writing data to a file will cause the 24069 time_modified and change attributes of the file to be updated. 24070 However, these attributes SHOULD NOT be changed unless the contents 24071 of the file are changed. Thus, a WRITE request with count set to 24072 zero SHOULD NOT cause the time_modified and change attributes of the 24073 file to be updated. 24075 Stable storage is persistent storage that survives: 24077 1. Repeated power failures. 24079 2. Hardware failures (of any board, power supply, etc.). 24081 3. Repeated software crashes and restarts. 24083 This definition does not address failure of the stable storage module 24084 itself. 24086 The verifier is defined to allow a client to detect different 24087 instances of an NFSv4.1 protocol server over which cached, 24088 uncommitted data may be lost. In the most likely case, the verifier 24089 allows the client to detect server restarts. This information is 24090 required so that the client can safely determine whether the server 24091 could have lost cached data. If the server fails unexpectedly and 24092 the client has uncommitted data from previous WRITE requests (done 24093 with the stable argument set to UNSTABLE4 and in which the result 24094 committed was returned as UNSTABLE4 as well), the server might not 24095 have flushed cached data to stable storage. The burden of recovery 24096 is on the client, and the client will need to retransmit the data to 24097 the server. 24099 A suggested verifier would be to use the time that the server was 24100 last started (if restarting the server results in lost buffers). 24102 The reply's committed field allows the client to do more effective 24103 caching. If the server is committing all WRITE requests to stable 24104 storage, then it SHOULD return with committed set to FILE_SYNC4, 24105 regardless of the value of the stable field in the arguments. A 24106 server that uses an NVRAM accelerator may choose to implement this 24107 policy. The client can use this to increase the effectiveness of the 24108 cache by discarding cached data that has already been committed on 24109 the server. 24111 Some implementations may return NFS4ERR_NOSPC instead of 24112 NFS4ERR_DQUOT when a user's quota is exceeded. 24114 In the case that the current filehandle is of type NF4DIR, the server 24115 will return NFS4ERR_ISDIR. If the current file is a symbolic link, 24116 the error NFS4ERR_SYMLINK will be returned. Otherwise, if the 24117 current filehandle does not designate an ordinary file, the server 24118 will return NFS4ERR_WRONG_TYPE. 24120 If mandatory byte-range locking is in effect for the file, and the 24121 corresponding byte-range of the data to be written to the file is 24122 READ_LT or WRITE_LT locked by an owner that is not associated with 24123 the stateid, the server MUST return NFS4ERR_LOCKED. If so, the 24124 client MUST check if the owner corresponding to the stateid used with 24125 the WRITE operation has a conflicting READ_LT lock that overlaps with 24126 the byte-range that was to be written. If the stateid's owner has no 24127 conflicting READ_LT lock, then the client SHOULD try to get the 24128 appropriate write byte-range lock via the LOCK operation before re- 24129 attempting the WRITE. When the WRITE completes, the client SHOULD 24130 release the byte-range lock via LOCKU. 24132 If the stateid's owner had a conflicting READ_LT lock, then the 24133 client has no choice but to return an error to the application that 24134 attempted the WRITE. The reason is that since the stateid's owner 24135 had a READ_LT lock, either the server attempted to temporarily 24136 effectively upgrade this READ_LT lock to a WRITE_LT lock or the 24137 server has no upgrade capability. If the server attempted to upgrade 24138 the READ_LT lock and failed, it is pointless for the client to re- 24139 attempt the upgrade via the LOCK operation, because there might be 24140 another client also trying to upgrade. If two clients are blocked 24141 trying to upgrade the same lock, the clients deadlock. If the server 24142 has no upgrade capability, then it is pointless to try a LOCK 24143 operation to upgrade. 24145 If one or more other clients have delegations for the file being 24146 written, those delegations MUST be recalled, and the operation cannot 24147 proceed until those delegations are returned or revoked. Except 24148 where this happens very quickly, one or more NFS4ERR_DELAY errors 24149 will be returned to requests made while the delegation remains 24150 outstanding. Normally, delegations will not be recalled as a result 24151 of a WRITE operation since the recall will occur as a result of an 24152 earlier OPEN. However, since it is possible for a WRITE to be done 24153 with a special stateid, the server needs to check for this case even 24154 though the client should have done an OPEN previously. 24156 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control 24158 18.33.1. ARGUMENT 24160 typedef opaque gsshandle4_t<>; 24162 struct gss_cb_handles4 { 24163 rpc_gss_svc_t gcbp_service; /* RFC 2203 */ 24164 gsshandle4_t gcbp_handle_from_server; 24165 gsshandle4_t gcbp_handle_from_client; 24166 }; 24168 union callback_sec_parms4 switch (uint32_t cb_secflavor) { 24169 case AUTH_NONE: 24170 void; 24171 case AUTH_SYS: 24172 authsys_parms cbsp_sys_cred; /* RFC 1831 */ 24173 case RPCSEC_GSS: 24174 gss_cb_handles4 cbsp_gss_handles; 24175 }; 24177 struct BACKCHANNEL_CTL4args { 24178 uint32_t bca_cb_program; 24179 callback_sec_parms4 bca_sec_parms<>; 24180 }; 24182 18.33.2. RESULT 24184 struct BACKCHANNEL_CTL4res { 24185 nfsstat4 bcr_status; 24186 }; 24188 18.33.3. DESCRIPTION 24190 The BACKCHANNEL_CTL operation replaces the backchannel's callback 24191 program number and adds (not replaces) RPCSEC_GSS handles for use by 24192 the backchannel. 24194 The arguments of the BACKCHANNEL_CTL call are a subset of the 24195 CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the 24196 bca_cb_program field and bca_sec_parms fields correspond respectively 24197 to the csa_cb_program and csa_sec_parms fields of the arguments of 24198 CREATE_SESSION (Section 18.36). 24200 BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. 24202 If the RPCSEC_GSS handle identified by gcbp_handle_from_server does 24203 not exist on the server, the server MUST return NFS4ERR_NOENT. 24205 If an RPCSEC_GSS handle is using the SSV context (see 24206 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 24207 common SSV GSS context, there are security considerations specific to 24208 this situation discussed in Section 2.10.10. 24210 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection with 24211 Session 24213 18.34.1. ARGUMENT 24215 enum channel_dir_from_client4 { 24216 CDFC4_FORE = 0x1, 24217 CDFC4_BACK = 0x2, 24218 CDFC4_FORE_OR_BOTH = 0x3, 24219 CDFC4_BACK_OR_BOTH = 0x7 24220 }; 24222 struct BIND_CONN_TO_SESSION4args { 24223 sessionid4 bctsa_sessid; 24225 channel_dir_from_client4 24226 bctsa_dir; 24228 bool bctsa_use_conn_in_rdma_mode; 24229 }; 24231 18.34.2. RESULT 24232 enum channel_dir_from_server4 { 24233 CDFS4_FORE = 0x1, 24234 CDFS4_BACK = 0x2, 24235 CDFS4_BOTH = 0x3 24236 }; 24238 struct BIND_CONN_TO_SESSION4resok { 24239 sessionid4 bctsr_sessid; 24241 channel_dir_from_server4 24242 bctsr_dir; 24244 bool bctsr_use_conn_in_rdma_mode; 24245 }; 24247 union BIND_CONN_TO_SESSION4res 24248 switch (nfsstat4 bctsr_status) { 24250 case NFS4_OK: 24251 BIND_CONN_TO_SESSION4resok 24252 bctsr_resok4; 24254 default: void; 24255 }; 24257 18.34.3. DESCRIPTION 24259 BIND_CONN_TO_SESSION is used to associate additional connections with 24260 a session. It MUST be used on the connection being associated with 24261 the session. It MUST be the only operation in the COMPOUND 24262 procedure. If SP4_NONE (Section 18.35) state protection is used, any 24263 principal, security flavor, or RPCSEC_GSS context MAY be used to 24264 invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS MUST be 24265 used with the integrity or privacy services, using the principal that 24266 created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV 24267 GSS mechanism (Section 2.10.9) and integrity or privacy MUST be used. 24269 If, when the client ID was created, the client opted for SP4_NONE 24270 state protection, the client is not required to use 24271 BIND_CONN_TO_SESSION to associate the connection with the session, 24272 unless the client wishes to associate the connection with the 24273 backchannel. When SP4_NONE protection is used, simply sending a 24274 COMPOUND request with a SEQUENCE operation is sufficient to associate 24275 the connection with the session specified in SEQUENCE. 24277 The field bctsa_dir indicates whether the client wants to associate 24278 the connection with the fore channel or the backchannel or both 24279 channels. The value CDFC4_FORE_OR_BOTH indicates that the client 24280 wants to associate the connection with both the fore channel and 24281 backchannel, but will accept the connection being associated to just 24282 the fore channel. The value CDFC4_BACK_OR_BOTH indicates that the 24283 client wants to associate with both the fore channel and backchannel, 24284 but will accept the connection being associated with just the 24285 backchannel. The server replies in bctsr_dir which channel(s) the 24286 connection is associated with. If the client specified CDFC4_FORE, 24287 the server MUST return CDFS4_FORE. If the client specified 24288 CDFC4_BACK, the server MUST return CDFS4_BACK. If the client 24289 specified CDFC4_FORE_OR_BOTH, the server MUST return CDFS4_FORE or 24290 CDFS4_BOTH. If the client specified CDFC4_BACK_OR_BOTH, the server 24291 MUST return CDFS4_BACK or CDFS4_BOTH. 24293 See the CREATE_SESSION operation (Section 18.36), and the description 24294 of the argument csa_use_conn_in_rdma_mode to understand 24295 bctsa_use_conn_in_rdma_mode, and the description of 24296 csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. 24298 Invoking BIND_CONN_TO_SESSION on a connection already associated with 24299 the specified session has no effect, and the server MUST respond with 24300 NFS4_OK, unless the client is demanding changes to the set of 24301 channels the connection is associated with. If so, the server MUST 24302 return NFS4ERR_INVAL. 24304 18.34.4. IMPLEMENTATION 24306 If a session's channel loses all connections, depending on the client 24307 ID's state protection and type of channel, the client might need to 24308 use BIND_CONN_TO_SESSION to associate a new connection. If the 24309 server restarted and does not keep the reply cache in stable storage, 24310 the server will not recognize the session ID. The client will 24311 ultimately have to invoke EXCHANGE_ID to create a new client ID and 24312 session. 24314 Suppose SP4_SSV state protection is being used, and 24315 BIND_CONN_TO_SESSION is among the operations included in the 24316 spo_must_enforce set when the client ID was created (Section 18.35). 24317 If so, there is an issue if SET_SSV is sent, no response is returned, 24318 and the last connection associated with the client ID drops. The 24319 client, per the sessions model, MUST retry the SET_SSV. But it needs 24320 a new connection to do so, and MUST associate that connection with 24321 the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS 24322 mechanism. The problem is that the RPCSEC_GSS message integrity 24323 codes use a subkey derived from the SSV as the key and the SSV may 24324 have changed. While there are multiple recovery strategies, a 24325 single, general strategy is described here. 24327 o The client reconnects. 24329 o The client assumes that the SET_SSV was executed, and so sends 24330 BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, 24331 i.e., what SET_SSV would have set the SSV to) used as the key for 24332 the RPCSEC_GSS credential message integrity codes. 24334 o If the request succeeds, this means that the original attempted 24335 SET_SSV did execute successfully. The client re-sends the 24336 original SET_SSV, which the server will reply to via the reply 24337 cache. 24339 o If the server returns an RPC authentication error, this means that 24340 the server's current SSV was not changed (and the SET_SSV was 24341 likely not executed). The client then tries BIND_CONN_TO_SESSION 24342 with the subkey derived from the old SSV as the key for the 24343 RPCSEC_GSS message integrity codes. 24345 o The attempted BIND_CONN_TO_SESSION with the old SSV should 24346 succeed. If so, the client re-sends the original SET_SSV. If the 24347 original SET_SSV was not executed, then the server executes it. 24348 If the original SET_SSV was executed but failed, the server will 24349 return the SET_SSV from the reply cache. 24351 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID 24353 The EXCHANGE_ID operation exchanges long-hand client and server 24354 identifiers (owners), and provides access to a client ID, creating 24355 one if necessary. This client ID becomes associated with the 24356 connection on which the operation is done, so that it is available 24357 when a CREATE_SESSION is done or when the connection is used to issue 24358 a request on an existing session associated with the current client. 24360 18.35.1. ARGUMENT 24362 const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; 24363 const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; 24365 const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100; 24367 const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; 24368 const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; 24369 const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; 24371 const EXCHGID4_FLAG_MASK_PNFS = 0x00070000; 24373 const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000; 24374 const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000; 24376 struct state_protect_ops4 { 24377 bitmap4 spo_must_enforce; 24378 bitmap4 spo_must_allow; 24379 }; 24381 struct ssv_sp_parms4 { 24382 state_protect_ops4 ssp_ops; 24383 sec_oid4 ssp_hash_algs<>; 24384 sec_oid4 ssp_encr_algs<>; 24385 uint32_t ssp_window; 24386 uint32_t ssp_num_gss_handles; 24387 }; 24389 enum state_protect_how4 { 24390 SP4_NONE = 0, 24391 SP4_MACH_CRED = 1, 24392 SP4_SSV = 2 24393 }; 24395 union state_protect4_a switch(state_protect_how4 spa_how) { 24396 case SP4_NONE: 24397 void; 24398 case SP4_MACH_CRED: 24399 state_protect_ops4 spa_mach_ops; 24400 case SP4_SSV: 24401 ssv_sp_parms4 spa_ssv_parms; 24402 }; 24404 struct EXCHANGE_ID4args { 24405 client_owner4 eia_clientowner; 24406 uint32_t eia_flags; 24407 state_protect4_a eia_state_protect; 24408 nfs_impl_id4 eia_client_impl_id<1>; 24409 }; 24411 18.35.2. RESULT 24412 struct ssv_prot_info4 { 24413 state_protect_ops4 spi_ops; 24414 uint32_t spi_hash_alg; 24415 uint32_t spi_encr_alg; 24416 uint32_t spi_ssv_len; 24417 uint32_t spi_window; 24418 gsshandle4_t spi_handles<>; 24419 }; 24421 union state_protect4_r switch(state_protect_how4 spr_how) { 24422 case SP4_NONE: 24423 void; 24424 case SP4_MACH_CRED: 24425 state_protect_ops4 spr_mach_ops; 24426 case SP4_SSV: 24427 ssv_prot_info4 spr_ssv_info; 24428 }; 24430 struct EXCHANGE_ID4resok { 24431 clientid4 eir_clientid; 24432 sequenceid4 eir_sequenceid; 24433 uint32_t eir_flags; 24434 state_protect4_r eir_state_protect; 24435 server_owner4 eir_server_owner; 24436 opaque eir_server_scope; 24437 nfs_impl_id4 eir_server_impl_id<1>; 24438 }; 24440 union EXCHANGE_ID4res switch (nfsstat4 eir_status) { 24441 case NFS4_OK: 24442 EXCHANGE_ID4resok eir_resok4; 24444 default: 24445 void; 24446 }; 24448 18.35.3. DESCRIPTION 24450 The client uses the EXCHANGE_ID operation to register a particular 24451 client_owner with the server. However, when the client_owner has 24452 already been registered by other means (e.g. Transparent State 24453 Migration), the client may still use EXCHANGE_ID to obtain the client 24454 ID assigned previously. 24456 The client ID returned from this operation will be associated with 24457 the connection on which the EXCHANGE_ID is received and will serve as 24458 a parent object for sessions created by the client on this connection 24459 or to which the connection is bound. As a result of using those 24460 sessions to make requests involving the creation of state, that state 24461 will become associated with the client ID returned. 24463 In situations in which the registration of the client_owner has not 24464 occurred previously, the client ID must first be used, along with the 24465 returned eir_sequenceid, in creating an associated session using 24466 CREATE_SESSION. 24468 If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the result, 24469 eir_flags, then it is an indication that the registration of the 24470 client_owner has already occurred and that a further CREATE_SESSION 24471 is not needed to confirm it. Of course, subsequent CREATE_SESSION 24472 operations may be needed for other reasons. 24474 The value eir_sequenceid is used to establish an initial sequence 24475 value associate with the client ID returned. In cases in which a 24476 CREATE_SESSION has already been done, there is no need for this 24477 value, since sequencing of such request has already been established 24478 and the client has no need for this value and will ignore it 24480 EXCHANGE_ID MAY be sent in a COMPOUND procedure that starts with 24481 SEQUENCE. However, when a client communicates with a server for the 24482 first time, it will not have a session, so using SEQUENCE will not be 24483 possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then 24484 it MUST be the only operation in the COMPOUND procedure's request. 24485 If it is not, the server MUST return NFS4ERR_NOT_ONLY_OP. 24487 The eia_clientowner field is composed of a co_verifier field and a 24488 co_ownerid string. As noted in s Section 2.4, the co_ownerid 24489 describes the client, and the co_verifier is the incarnation of the 24490 client. An EXCHANGE_ID sent with a new incarnation of the client 24491 will lead to the server removing lock state of the old incarnation. 24492 Whereas an EXCHANGE_ID sent with the current incarnation and 24493 co_ownerid will result in an error or an update of the client ID's 24494 properties, depending on the arguments to EXCHANGE_ID. 24496 A server MUST NOT provide the same client ID to two different 24497 incarnations of an eia_clientowner. 24499 In addition to the client ID and sequence ID, the server returns a 24500 server owner (eir_server_owner) and server scope (eir_server_scope). 24501 The former field is used in connection with network trunking as 24502 described in Section 2.10.5. The latter field is used to allow 24503 clients to determine when client IDs sent by one server may be 24504 recognized by another in the event of file system migration (see 24505 Section 11.10.9 of the current document). 24507 The client ID returned by EXCHANGE_ID is only unique relative to the 24508 combination of eir_server_owner.so_major_id and eir_server_scope. 24509 Thus, if two servers return the same client ID, the onus is on the 24510 client to distinguish the client IDs on the basis of 24511 eir_server_owner.so_major_id and eir_server_scope. In the event two 24512 different servers claim matching server_owner.so_major_id and 24513 eir_server_scope, the client can use the verification techniques 24514 discussed in Section 2.10.5.1 to determine if the servers are 24515 distinct. If they are distinct, then the client will need to note 24516 the destination network addresses of the connections used with each 24517 server and use the network address as the final discriminator. 24519 The server, as defined by the unique identity expressed in the 24520 so_major_id of the server owner and the server scope, needs to track 24521 several properties of each client ID it hands out. The properties 24522 apply to the client ID and all sessions associated with the client 24523 ID. The properties are derived from the arguments and results of 24524 EXCHANGE_ID. The client ID properties include: 24526 o The capabilities expressed by the following bits, which come from 24527 the results of EXCHANGE_ID: 24529 * EXCHGID4_FLAG_SUPP_MOVED_REFER 24531 * EXCHGID4_FLAG_SUPP_MOVED_MIGR 24533 * EXCHGID4_FLAG_BIND_PRINC_STATEID 24535 * EXCHGID4_FLAG_USE_NON_PNFS 24537 * EXCHGID4_FLAG_USE_PNFS_MDS 24539 * EXCHGID4_FLAG_USE_PNFS_DS 24541 These properties may be updated by subsequent EXCHANGE_ID 24542 operations on confirmed client IDs though the server MAY refuse to 24543 change them. 24545 o The state protection method used, one of SP4_NONE, SP4_MACH_CRED, 24546 or SP4_SSV, as set by the spa_how field of the arguments to 24547 EXCHANGE_ID. Once the client ID is confirmed, this property 24548 cannot be updated by subsequent EXCHANGE_ID operations. 24550 o For SP4_MACH_CRED or SP4_SSV state protection: 24552 * The list of operations (spo_must_enforce) that MUST use the 24553 specified state protection. This list comes from the results 24554 of EXCHANGE_ID. 24556 * The list of operations (spo_must_allow) that MAY use the 24557 specified state protection. This list comes from the results 24558 of EXCHANGE_ID. 24560 Once the client ID is confirmed, these properties cannot be 24561 updated by subsequent EXCHANGE_ID requests. 24563 o For SP4_SSV protection: 24565 * The OID of the hash algorithm. This property is represented by 24566 one of the algorithms in the ssp_hash_algs field of the 24567 EXCHANGE_ID arguments. Once the client ID is confirmed, this 24568 property cannot be updated by subsequent EXCHANGE_ID requests. 24570 * The OID of the encryption algorithm. This property is 24571 represented by one of the algorithms in the ssp_encr_algs field 24572 of the EXCHANGE_ID arguments. Once the client ID is confirmed, 24573 this property cannot be updated by subsequent EXCHANGE_ID 24574 requests. 24576 * The length of the SSV. This property is represented by the 24577 spi_ssv_len field in the EXCHANGE_ID results. Once the client 24578 ID is confirmed, this property cannot be updated by subsequent 24579 EXCHANGE_ID operations. 24581 There are REQUIRED and RECOMMENDED relationships among the 24582 length of the key of the encryption algorithm ("key length"), 24583 the length of the output of hash algorithm ("hash length"), and 24584 the length of the SSV ("SSV length"). 24586 + key length MUST be <= hash length. This is because the keys 24587 used for the encryption algorithm are actually subkeys 24588 derived from the SSV, and the derivation is via the hash 24589 algorithm. The selection of an encryption algorithm with a 24590 key length that exceeded the length of the output of the 24591 hash algorithm would require padding, and thus weaken the 24592 use of the encryption algorithm. 24594 + hash length SHOULD be <= SSV length. This is because the 24595 SSV is a key used to derive subkeys via an HMAC, and it is 24596 recommended that the key used as input to an HMAC be at 24597 least as long as the length of the HMAC's hash algorithm's 24598 output (see Section 3 of [59]). 24600 + key length SHOULD be <= SSV length. This is a transitive 24601 result of the above two invariants. 24603 + key length SHOULD be >= hash length / 2. This is because 24604 the subkey derivation is via an HMAC and it is recommended 24605 that if the HMAC has to be truncated, it should not be 24606 truncated to less than half the hash length (see Section 4 24607 of RFC2104 [59]). 24609 * Number of concurrent versions of the SSV the client and server 24610 will support (see Section 2.10.9). This property is 24611 represented by spi_window in the EXCHANGE_ID results. The 24612 property may be updated by subsequent EXCHANGE_ID operations. 24614 o The client's implementation ID as represented by the 24615 eia_client_impl_id field of the arguments. The property may be 24616 updated by subsequent EXCHANGE_ID requests. 24618 o The server's implementation ID as represented by the 24619 eir_server_impl_id field of the reply. The property may be 24620 updated by replies to subsequent EXCHANGE_ID requests. 24622 The eia_flags passed as part of the arguments and the eir_flags 24623 results allow the client and server to inform each other of their 24624 capabilities as well as indicate how the client ID will be used. 24625 Whether a bit is set or cleared on the arguments' flags does not 24626 force the server to set or clear the same bit on the results' side. 24627 Bits not defined above cannot be set in the eia_flags field. If they 24628 are, the server MUST reject the operation with NFS4ERR_INVAL. 24630 The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in 24631 eia_flags; it is always off in eir_flags. The 24632 EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is 24633 always off in eia_flags. If the server recognizes the co_ownerid and 24634 co_verifier as mapping to a confirmed client ID, it sets 24635 EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The 24636 EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client 24637 ID it is trying to create already exists and is confirmed. 24639 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means 24640 that the client is attempting to update properties of an existing 24641 confirmed client ID (if the client wants to update properties of an 24642 unconfirmed client ID, it MUST NOT set 24643 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is RECOMMENDED that 24644 the client send the update EXCHANGE_ID operation in the same COMPOUND 24645 as a SEQUENCE so that the EXCHANGE_ID is executed exactly once. 24646 Whether the client can update the properties of client ID depends on 24647 the state protection it selected when the client ID was created, and 24648 the principal and security flavor it used when sending the 24649 EXCHANGE_ID operation. The situations described in items 6, 7, 8, or 24650 9 of the second numbered list of Section 18.35.4 below will apply. 24652 Note that if the operation succeeds and returns a client ID that is 24653 already confirmed, the server MUST set the EXCHGID4_FLAG_CONFIRMED_R 24654 bit in eir_flags. 24656 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this 24657 means that the client is trying to establish a new client ID; it is 24658 attempting to trunk data communication to the server (See 24659 Section 2.10.5); or it is attempting to update properties of an 24660 unconfirmed client ID. The situations described in items 1, 2, 3, 4, 24661 or 5 of the second numbered list of Section 18.35.4 below will apply. 24662 Note that if the operation succeeds and returns a client ID that was 24663 previously confirmed, the server MUST set the 24664 EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags. 24666 When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client 24667 indicates that it is capable of dealing with an NFS4ERR_MOVED error 24668 as part of a referral sequence. When this bit is not set, it is 24669 still legal for the server to perform a referral sequence. However, 24670 a server may use the fact that the client is incapable of correctly 24671 responding to a referral, by avoiding it for that particular client. 24672 It may, for instance, act as a proxy for that particular file system, 24673 at some cost in performance, although it is not obligated to do so. 24674 If the server will potentially perform a referral, it MUST set 24675 EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. 24677 When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates 24678 that it is capable of dealing with an NFS4ERR_MOVED error as part of 24679 a file system migration sequence. When this bit is not set, it is 24680 still legal for the server to indicate that a file system has moved, 24681 when this in fact happens. However, a server may use the fact that 24682 the client is incapable of correctly responding to a migration in its 24683 scheduling of file systems to migrate so as to avoid migration of 24684 file systems being actively used. It may also hide actual migrations 24685 from clients unable to deal with them by acting as a proxy for a 24686 migrated file system for particular clients, at some cost in 24687 performance, although it is not obligated to do so. If the server 24688 will potentially perform a migration, it MUST set 24689 EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. 24691 When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates 24692 that it wants the server to bind the stateid to the principal. This 24693 means that when a principal creates a stateid, it has to be the one 24694 to use the stateid. If the server will perform binding, it will 24695 return EXCHGID4_FLAG_BIND_PRINC_STATEID. The server MAY return 24696 EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request 24697 it. If an update to the client ID changes the value of 24698 EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect 24699 applies only to new stateids. Existing stateids (and all stateids 24700 with the same "other" field) that were created with stateid to 24701 principal binding in force will continue to have binding in force. 24702 Existing stateids (and all stateids with the same "other" field) that 24703 were created with stateid to principal not in force will continue to 24704 have binding not in force. 24706 The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and 24707 EXCHGID4_FLAG_USE_PNFS_DS bits are described in Section 2.10.2.2 and 24708 convey roles the client ID is to be used for in a pNFS environment. 24709 The server MUST set one of the acceptable combinations of these bits 24710 (roles) in eir_flags, as specified in that section. Note that the 24711 same client owner/server owner pair can have multiple roles. 24712 Multiple roles can be associated with the same client ID or with 24713 different client IDs. Thus, if a client sends EXCHANGE_ID from the 24714 same client owner to the same server owner multiple times, but 24715 specifies different pNFS roles each time, the server might return 24716 different client IDs. Given that different pNFS roles might have 24717 different client IDs, the client may ask for different properties for 24718 each role/client ID. 24720 The spa_how field of the eia_state_protect field specifies how the 24721 client wants to protect its client, locking, and session states from 24722 unauthorized changes (Section 2.10.8.3): 24724 o SP4_NONE. The client does not request the NFSv4.1 server to 24725 enforce state protection. The NFSv4.1 server MUST NOT enforce 24726 state protection for the returned client ID. 24728 o SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST 24729 send the EXCHANGE_ID operation with RPCSEC_GSS as the security 24730 flavor, and with a service of RPC_GSS_SVC_INTEGRITY or 24731 RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the 24732 client wants to use an RPCSEC_GSS-based machine credential to 24733 protect its state. The server MUST note the principal the 24734 EXCHANGE_ID operation was sent with, and the GSS mechanism used. 24735 These notes collectively comprise the machine credential. 24737 After the client ID is confirmed, as long as the lease associated 24738 with the client ID is unexpired, a subsequent EXCHANGE_ID 24739 operation that uses the same eia_clientowner.co_owner as the first 24740 EXCHANGE_ID MUST also use the same machine credential as the first 24741 EXCHANGE_ID. The server returns the same client ID for the 24742 subsequent EXCHANGE_ID as that returned from the first 24743 EXCHANGE_ID. 24745 o SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the 24746 EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and 24747 with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. 24749 If SP4_SSV is specified, then the client wants to use the SSV to 24750 protect its state. The server records the credential used in the 24751 request as the machine credential (as defined above) for the 24752 eia_clientowner.co_owner. The CREATE_SESSION operation that 24753 confirms the client ID MUST use the same machine credential. 24755 When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides 24756 two lists of operations (each expressed as a bitmap). The first list 24757 is spo_must_enforce and consists of those operations the client MUST 24758 send (subject to the server confirming the list of operations in the 24759 result of EXCHANGE_ID) with the machine credential (if SP4_MACH_CRED 24760 protection is specified) or the SSV-based credential (if SP4_SSV 24761 protection is used). The client MUST send the operations with 24762 RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or 24763 RPC_GSS_SVC_PRIVACY security service. Typically, the first list of 24764 operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, 24765 DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The 24766 client SHOULD NOT specify in this list any operations that require a 24767 filehandle because the server's access policies MAY conflict with the 24768 client's choice, and thus the client would then be unable to access a 24769 subset of the server's namespace. 24771 Note that if SP4_SSV protection is specified, and the client 24772 indicates that CREATE_SESSION must be protected with SP4_SSV, because 24773 the SSV cannot exist without a confirmed client ID, the first 24774 CREATE_SESSION MUST instead be sent using the machine credential, and 24775 the server MUST accept the machine credential. 24777 There is a corresponding result, also called spo_must_enforce, of the 24778 operations for which the server will require SP4_MACH_CRED or SP4_SSV 24779 protection. Normally, the server's result equals the client's 24780 argument, but the result MAY be different. If the client requests 24781 one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, 24782 DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID 24783 }, then the result spo_must_enforce MUST include the operations the 24784 client requested from that set. 24786 If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then 24787 connection binding enforcement is enabled, and the client MUST use 24788 the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV 24789 protection is used) credential on calls to BIND_CONN_TO_SESSION. 24791 The second list is spo_must_allow and consists of those operations 24792 the client wants to have the option of sending with the machine 24793 credential or the SSV-based credential, even if the object the 24794 operations are performed on is not owned by the machine or SSV 24795 credential. 24797 The corresponding result, also called spo_must_allow, consists of the 24798 operations the server will allow the client to use SP4_SSV or 24799 SP4_MACH_CRED credentials with. Normally, the server's result equals 24800 the client's argument, but the result MAY be different. 24802 The purpose of spo_must_allow is to allow clients to solve the 24803 following conundrum. Suppose the client ID is confirmed with 24804 EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the 24805 RPCSEC_GSS credentials of a normal user. Now suppose the user's 24806 credentials expire, and cannot be renewed (e.g., a Kerberos ticket 24807 granting ticket expires, and the user has logged off and will not be 24808 acquiring a new ticket granting ticket). The client will be unable 24809 to send CLOSE without the user's credentials, which is to say the 24810 client has to either leave the state on the server or re-send 24811 EXCHANGE_ID with a new verifier to clear all state, that is, unless 24812 the client includes CLOSE on the list of operations in spo_must_allow 24813 and the server agrees. 24815 The SP4_SSV protection parameters also have: 24817 ssp_hash_algs: 24819 This is the set of algorithms the client supports for the purpose 24820 of computing the digests needed for the internal SSV GSS mechanism 24821 and for the SET_SSV operation. Each algorithm is specified as an 24822 object identifier (OID). The REQUIRED algorithms for a server are 24823 id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [25]. The 24824 algorithm the server selects among the set is indicated in 24825 spi_hash_alg, a field of spr_ssv_prot_info. The field 24826 spi_hash_alg is an index into the array ssp_hash_algs. If the 24827 server does not support any of the offered algorithms, it returns 24828 NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the server 24829 MUST return NFS4ERR_INVAL. 24831 ssp_encr_algs: 24833 This is the set of algorithms the client supports for the purpose 24834 of providing privacy protection for the internal SSV GSS 24835 mechanism. Each algorithm is specified as an OID. The REQUIRED 24836 algorithm for a server is id-aes256-CBC. The RECOMMENDED 24837 algorithms are id-aes192-CBC and id-aes128-CBC [26]. The selected 24838 algorithm is returned in spi_encr_alg, an index into 24839 ssp_encr_algs. If the server does not support any of the offered 24840 algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs 24841 is empty, the server MUST return NFS4ERR_INVAL. Note that due to 24842 previously stated requirements and recommendations on the 24843 relationships between key length and hash length, some 24844 combinations of RECOMMENDED and REQUIRED encryption algorithm and 24845 hash algorithm either SHOULD NOT or MUST NOT be used. Table 12 24846 summarizes the illegal and discouraged combinations. 24848 ssp_window: 24850 This is the number of SSV versions the client wants the server to 24851 maintain (i.e., each successful call to SET_SSV produces a new 24852 version of the SSV). If ssp_window is zero, the server MUST 24853 return NFS4ERR_INVAL. The server responds with spi_window, which 24854 MUST NOT exceed ssp_window and MUST be at least one. Any requests 24855 on the backchannel or fore channel that are using a version of the 24856 SSV that is outside the window will fail with an ONC RPC 24857 authentication error, and the requester will have to retry them 24858 with the same slot ID and sequence ID. 24860 ssp_num_gss_handles: 24862 This is the number of RPCSEC_GSS handles the server should create 24863 that are based on the GSS SSV mechanism (see Section 2.10.9). It 24864 is not the total number of RPCSEC_GSS handles for the client ID. 24865 Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS 24866 handles. The server responds with a list of handles in 24867 spi_handles. If the client asks for at least one handle and the 24868 server cannot create it, the server MUST return an error. The 24869 handles in spi_handles are not available for use until the client 24870 ID is confirmed, which could be immediately if EXCHANGE_ID returns 24871 EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from 24872 CREATE_SESSION. 24874 While a client ID can span all the connections that are connected 24875 to a server sharing the same eir_server_owner.so_major_id, the 24876 RPCSEC_GSS handles returned in spi_handles can only be used on 24877 connections connected to a server that returns the same the 24878 eir_server_owner.so_major_id and eir_server_owner.so_minor_id on 24879 each connection. It is permissible for the client to set 24880 ssp_num_gss_handles to zero; the client can create more handles 24881 with another EXCHANGE_ID call. 24883 Because each SSV RPCSEC_GSS handle shares a common SSV GSS 24884 context, there are security considerations specific to this 24885 situation discussed in Section 2.10.10. 24887 The seq_window (see Section 5.2.3.1 of RFC2203 [4]) of each 24888 RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window 24889 of the RPCSEC_GSS handle used for the credential of the RPC 24890 request that the EXCHANGE_ID operation was sent as a part of. 24892 +-------------------+----------------------+------------------------+ 24893 | Encryption | MUST NOT be combined | SHOULD NOT be combined | 24894 | Algorithm | with | with | 24895 +-------------------+----------------------+------------------------+ 24896 | id-aes128-CBC | | id-sha384, id-sha512 | 24897 | id-aes192-CBC | id-sha1 | id-sha512 | 24898 | id-aes256-CBC | id-sha1, id-sha224 | | 24899 +-------------------+----------------------+------------------------+ 24901 Table 12 24903 The arguments include an array of up to one element in length called 24904 eia_client_impl_id. If eia_client_impl_id is present, it contains 24905 the information identifying the implementation of the client. 24906 Similarly, the results include an array of up to one element in 24907 length called eir_server_impl_id that identifies the implementation 24908 of the server. Servers MUST accept a zero-length eia_client_impl_id 24909 array, and clients MUST accept a zero-length eir_server_impl_id 24910 array. 24912 A possible use for implementation identifiers would be in diagnostic 24913 software that extracts this information in an attempt to identify 24914 interoperability problems, performance workload behaviors, or general 24915 usage statistics. Since the intent of having access to this 24916 information is for planning or general diagnosis only, the client and 24917 server MUST NOT interpret this implementation identity information in 24918 a way that affects how the implementation interacts with its peer. 24919 The client and server are not allowed to depend on the peer's 24920 manifesting a particular allowed behavior based on an implementation 24921 identifier but are required to interoperate as specified elsewhere in 24922 the protocol specification. 24924 Because it is possible that some implementations might violate the 24925 protocol specification and interpret the identity information, 24926 implementations MUST provide facilities to allow the NFSv4 client and 24927 server be configured to set the contents of the nfs_impl_id 24928 structures sent to any specified value. 24930 18.35.4. IMPLEMENTATION 24932 A server's client record is a 5-tuple: 24934 1. co_ownerid 24936 The client identifier string, from the eia_clientowner 24937 structure of the EXCHANGE_ID4args structure. 24939 2. co_verifier: 24941 A client-specific value used to indicate incarnations (where a 24942 client restart represents a new incarnation), from the 24943 eia_clientowner structure of the EXCHANGE_ID4args structure. 24945 3. principal: 24947 The principal that was defined in the RPC header's credential 24948 and/or verifier at the time the client record was established. 24950 4. client ID: 24952 The shorthand client identifier, generated by the server and 24953 returned via the eir_clientid field in the EXCHANGE_ID4resok 24954 structure. 24956 5. confirmed: 24958 A private field on the server indicating whether or not a 24959 client record has been confirmed. A client record is 24960 confirmed if there has been a successful CREATE_SESSION 24961 operation to confirm it. Otherwise, it is unconfirmed. An 24962 unconfirmed record is established by an EXCHANGE_ID call. Any 24963 unconfirmed record that is not confirmed within a lease period 24964 SHOULD be removed. 24966 The following identifiers represent special values for the fields in 24967 the records. 24969 ownerid_arg: 24971 The value of the eia_clientowner.co_ownerid subfield of the 24972 EXCHANGE_ID4args structure of the current request. 24974 verifier_arg: 24976 The value of the eia_clientowner.co_verifier subfield of the 24977 EXCHANGE_ID4args structure of the current request. 24979 old_verifier_arg: 24981 A value of the eia_clientowner.co_verifier field of a client 24982 record received in a previous request; this is distinct from 24983 verifier_arg. 24985 principal_arg: 24987 The value of the RPCSEC_GSS principal for the current request. 24989 old_principal_arg: 24991 A value of the principal of a client record as defined by the RPC 24992 header's credential or verifier of a previous request. This is 24993 distinct from principal_arg. 24995 clientid_ret: 24997 The value of the eir_clientid field the server will return in the 24998 EXCHANGE_ID4resok structure for the current request. 25000 old_clientid_ret: 25002 The value of the eir_clientid field the server returned in the 25003 EXCHANGE_ID4resok structure for a previous request. This is 25004 distinct from clientid_ret. 25006 confirmed: 25008 The client ID has been confirmed. 25010 unconfirmed: 25012 The client ID has not been confirmed. 25014 Since EXCHANGE_ID is a non-idempotent operation, we must consider the 25015 possibility that retries occur as a result of a client restart, 25016 network partition, malfunctioning router, etc. Retries are 25017 identified by the value of the eia_clientowner field of 25018 EXCHANGE_ID4args, and the method for dealing with them is outlined in 25019 the scenarios below. 25021 The scenarios are described in terms of the client record(s) a server 25022 has for a given co_ownerid. Note that if the client ID was created 25023 specifying SP4_SSV state protection and EXCHANGE_ID as the one of the 25024 operations in spo_must_allow, then the server MUST authorize 25025 EXCHANGE_IDs with the SSV principal in addition to the principal that 25026 created the client ID. 25028 1. New Owner ID 25030 If the server has no client records with 25031 eia_clientowner.co_ownerid matching ownerid_arg, and 25032 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the 25033 EXCHANGE_ID, then a new shorthand client ID (let us call it 25034 clientid_ret) is generated, and the following unconfirmed 25035 record is added to the server's state. 25037 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25038 unconfirmed } 25040 Subsequently, the server returns clientid_ret. 25042 2. Non-Update on Existing Client ID 25044 If the server has the following confirmed record, and the 25045 request does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, 25046 then the request is the result of a retried request due to a 25047 faulty router or lost connection, or the client is trying to 25048 determine if it can perform trunking. 25050 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25051 confirmed } 25053 Since the record has been confirmed, the client must have 25054 received the server's reply from the initial EXCHANGE_ID 25055 request. Since the server has a confirmed record, and since 25056 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the 25057 possible exception of eir_server_owner.so_minor_id, the server 25058 returns the same result it did when the client ID's properties 25059 were last updated (or if never updated, the result when the 25060 client ID was created). The confirmed record is unchanged. 25062 3. Client Collision 25064 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 25065 server has the following confirmed record, then this request 25066 is likely the result of a chance collision between the values 25067 of the eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args 25068 for two different clients. 25070 { ownerid_arg, *, old_principal_arg, old_clientid_ret, 25071 confirmed } 25073 If there is currently no state associated with 25074 old_clientid_ret, or if there is state but the lease has 25075 expired, then this case is effectively equivalent to the New 25076 Owner ID case of Paragraph 1. The confirmed record is 25077 deleted, the old_clientid_ret and its lock state are deleted, 25078 a new shorthand client ID is generated, and the following 25079 unconfirmed record is added to the server's state. 25081 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25082 unconfirmed } 25084 Subsequently, the server returns clientid_ret. 25086 If old_clientid_ret has an unexpired lease with state, then no 25087 state of old_clientid_ret is changed or deleted. The server 25088 returns NFS4ERR_CLID_INUSE to indicate that the client should 25089 retry with a different value for the 25090 eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args. The 25091 client record is not changed. 25093 4. Replacement of Unconfirmed Record 25095 If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and 25096 the server has the following unconfirmed record, then the 25097 client is attempting EXCHANGE_ID again on an unconfirmed 25098 client ID, perhaps due to a retry, a client restart before 25099 client ID confirmation (i.e., before CREATE_SESSION was 25100 called), or some other reason. 25102 { ownerid_arg, *, *, old_clientid_ret, unconfirmed } 25104 It is possible that the properties of old_clientid_ret are 25105 different than those specified in the current EXCHANGE_ID. 25106 Whether or not the properties are being updated, to eliminate 25107 ambiguity, the server deletes the unconfirmed record, 25108 generates a new client ID (clientid_ret), and establishes the 25109 following unconfirmed record: 25111 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25112 unconfirmed } 25114 5. Client Restart 25116 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 25117 server has the following confirmed client record, then this 25118 request is likely from a previously confirmed client that has 25119 restarted. 25121 { ownerid_arg, old_verifier_arg, principal_arg, 25122 old_clientid_ret, confirmed } 25123 Since the previous incarnation of the same client will no 25124 longer be making requests, once the new client ID is confirmed 25125 by CREATE_SESSION, byte-range locks and share reservations 25126 should be released immediately rather than forcing the new 25127 incarnation to wait for the lease time on the previous 25128 incarnation to expire. Furthermore, session state should be 25129 removed since if the client had maintained that information 25130 across restart, this request would not have been sent. If the 25131 server supports neither the CLAIM_DELEGATE_PREV nor 25132 CLAIM_DELEG_PREV_FH claim types, associated delegations should 25133 be purged as well; otherwise, delegations are retained and 25134 recovery proceeds according to Section 10.2.1. 25136 After processing, clientid_ret is returned to the client and 25137 this client record is added: 25139 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25140 unconfirmed } 25142 The previously described confirmed record continues to exist, 25143 and thus the same ownerid_arg exists in both a confirmed and 25144 unconfirmed state at the same time. The number of states can 25145 collapse to one once the server receives an applicable 25146 CREATE_SESSION or EXCHANGE_ID. 25148 + If the server subsequently receives a successful 25149 CREATE_SESSION that confirms clientid_ret, then the server 25150 atomically destroys the confirmed record and makes the 25151 unconfirmed record confirmed as described in 25152 Section 18.36.3. 25154 + If the server instead subsequently receives an EXCHANGE_ID 25155 with the client owner equal to ownerid_arg, one strategy is 25156 to simply delete the unconfirmed record, and process the 25157 EXCHANGE_ID as described in the entirety of 25158 Section 18.35.4. 25160 6. Update 25162 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25163 has the following confirmed record, then this request is an 25164 attempt at an update. 25166 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25167 confirmed } 25168 Since the record has been confirmed, the client must have 25169 received the server's reply from the initial EXCHANGE_ID 25170 request. The server allows the update, and the client record 25171 is left intact. 25173 7. Update but No Confirmed Record 25175 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25176 has no confirmed record corresponding ownerid_arg, then the 25177 server returns NFS4ERR_NOENT and leaves any unconfirmed record 25178 intact. 25180 8. Update but Wrong Verifier 25182 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25183 has the following confirmed record, then this request is an 25184 illegal attempt at an update, perhaps because of a retry from 25185 a previous client incarnation. 25187 { ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed } 25189 The server returns NFS4ERR_NOT_SAME and leaves the client 25190 record intact. 25192 9. Update but Wrong Principal 25194 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25195 has the following confirmed record, then this request is an 25196 illegal attempt at an update by an unauthorized principal. 25198 { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, 25199 confirmed } 25201 The server returns NFS4ERR_PERM and leaves the client record 25202 intact. 25204 18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm 25205 Client ID 25207 18.36.1. ARGUMENT 25208 struct channel_attrs4 { 25209 count4 ca_headerpadsize; 25210 count4 ca_maxrequestsize; 25211 count4 ca_maxresponsesize; 25212 count4 ca_maxresponsesize_cached; 25213 count4 ca_maxoperations; 25214 count4 ca_maxrequests; 25215 uint32_t ca_rdma_ird<1>; 25216 }; 25218 const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; 25219 const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; 25220 const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; 25222 struct CREATE_SESSION4args { 25223 clientid4 csa_clientid; 25224 sequenceid4 csa_sequence; 25226 uint32_t csa_flags; 25228 channel_attrs4 csa_fore_chan_attrs; 25229 channel_attrs4 csa_back_chan_attrs; 25231 uint32_t csa_cb_program; 25232 callback_sec_parms4 csa_sec_parms<>; 25233 }; 25235 18.36.2. RESULT 25237 struct CREATE_SESSION4resok { 25238 sessionid4 csr_sessionid; 25239 sequenceid4 csr_sequence; 25241 uint32_t csr_flags; 25243 channel_attrs4 csr_fore_chan_attrs; 25244 channel_attrs4 csr_back_chan_attrs; 25245 }; 25247 union CREATE_SESSION4res switch (nfsstat4 csr_status) { 25248 case NFS4_OK: 25249 CREATE_SESSION4resok csr_resok4; 25250 default: 25251 void; 25252 }; 25254 18.36.3. DESCRIPTION 25256 This operation is used by the client to create new session objects on 25257 the server. 25259 CREATE_SESSION can be sent with or without a preceding SEQUENCE 25260 operation in the same COMPOUND procedure. If CREATE_SESSION is sent 25261 with a preceding SEQUENCE operation, any session created by 25262 CREATE_SESSION has no direct relation to the session specified in the 25263 SEQUENCE operation, although the two sessions might be associated 25264 with the same client ID. If CREATE_SESSION is sent without a 25265 preceding SEQUENCE, then it MUST be the only operation in the 25266 COMPOUND procedure's request. If it is not, the server MUST return 25267 NFS4ERR_NOT_ONLY_OP. 25269 In addition to creating a session, CREATE_SESSION has the following 25270 effects: 25272 o The first session created with a new client ID serves to confirm 25273 the creation of that client's state on the server. The server 25274 returns the parameter values for the new session. 25276 o The connection CREATE_SESSION that is sent over is associated with 25277 the session's fore channel. 25279 The arguments and results of CREATE_SESSION are described as follows: 25281 csa_clientid: 25283 This is the client ID with which the new session will be 25284 associated. The corresponding result is csr_sessionid, the 25285 session ID of the new session. 25287 csa_sequence: 25289 Each client ID serializes CREATE_SESSION via a per-client ID 25290 sequence number (see Section 18.36.4). The corresponding result 25291 is csr_sequence, which MUST be equal to csa_sequence. 25293 In the next three arguments, the client offers a value that is to be 25294 a property of the session. Except where stated otherwise, it is 25295 RECOMMENDED that the server accept the value. If it is not 25296 acceptable, the server MAY use a different value. Regardless, the 25297 server MUST return the value the session will use (which will be 25298 either what the client offered, or what the server is insisting on) 25299 to the client. 25301 csa_flags: 25303 The csa_flags field contains a list of the following flag bits: 25305 CREATE_SESSION4_FLAG_PERSIST: 25307 If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the 25308 server to provide a persistent reply cache. For sessions in 25309 which only idempotent operations will be used (e.g., a read- 25310 only session), clients SHOULD NOT set 25311 CREATE_SESSION4_FLAG_PERSIST. If the server does not or cannot 25312 provide a persistent reply cache, the server MUST NOT set 25313 CREATE_SESSION4_FLAG_PERSIST in the field csr_flags. 25315 If the server is a pNFS metadata server, for reasons described 25316 in Section 12.5.2 it SHOULD support 25317 CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint 25318 (Section 5.12.4) attribute. 25320 CREATE_SESSION4_FLAG_CONN_BACK_CHAN: 25322 If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the 25323 client is requesting that the connection over which the 25324 CREATE_SESSION operation arrived be associated with the 25325 session's backchannel in addition to its fore channel. If the 25326 server agrees, it sets CREATE_SESSION4_FLAG_CONN_BACK_CHAN in 25327 the result field csr_flags. If 25328 CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in csa_flags, 25329 then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in 25330 csr_flags. 25332 CREATE_SESSION4_FLAG_CONN_RDMA: 25334 If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if 25335 the connection over which the CREATE_SESSION operation arrived 25336 is currently in non-RDMA mode but has the capability to operate 25337 in RDMA mode, then the client is requesting that the server 25338 "step up" to RDMA mode on the connection. If the server 25339 agrees, it sets CREATE_SESSION4_FLAG_CONN_RDMA in the result 25340 field csr_flags. If CREATE_SESSION4_FLAG_CONN_RDMA is not set 25341 in csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be 25342 set in csr_flags. Note that once the server agrees to step up, 25343 it and the client MUST exchange all future traffic on the 25344 connection with RPC RDMA framing and not Record Marking ([31]). 25346 csa_fore_chan_attrs, csa_fore_chan_attrs: 25348 The csa_fore_chan_attrs and csa_back_chan_attrs fields apply to 25349 attributes of the fore channel (which conveys requests originating 25350 from the client to the server), and the backchannel (the channel 25351 that conveys callback requests originating from the server to the 25352 client), respectively. The results are in corresponding 25353 structures called csr_fore_chan_attrs and csr_back_chan_attrs. 25354 The results establish attributes for each channel, and on all 25355 subsequent use of each channel of the session. Each structure has 25356 the following fields: 25358 ca_headerpadsize: 25360 The maximum amount of padding the requester is willing to apply 25361 to ensure that write payloads are aligned on some boundary at 25362 the replier. For each channel, the server 25364 + will reply in ca_headerpadsize with its preferred value, or 25365 zero if padding is not in use, and 25367 + MAY decrease this value but MUST NOT increase it. 25369 ca_maxrequestsize: 25371 The maximum size of a COMPOUND or CB_COMPOUND request that will 25372 be sent. This size represents the XDR encoded size of the 25373 request, including the RPC headers (including security flavor 25374 credentials and verifiers) but excludes any RPC transport 25375 framing headers. Imagine a request coming over a non-RDMA TCP/ 25376 IP connection, and that it has a single Record Marking header 25377 preceding it. The maximum allowable count encoded in the 25378 header will be ca_maxrequestsize. If a requester sends a 25379 request that exceeds ca_maxrequestsize, the error 25380 NFS4ERR_REQ_TOO_BIG will be returned per the description in 25381 Section 2.10.6.4. For each channel, the server MAY decrease 25382 this value but MUST NOT increase it. 25384 ca_maxresponsesize: 25386 The maximum size of a COMPOUND or CB_COMPOUND reply that the 25387 requester will accept from the replier including RPC headers 25388 (see the ca_maxrequestsize definition). For each channel, the 25389 server MAY decrease this value, but MUST NOT increase it. 25390 However, if the client selects a value for ca_maxresponsesize 25391 such that a replier on a channel could never send a response, 25392 the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION 25393 reply. After the session is created, if a requester sends a 25394 request for which the size of the reply would exceed this 25395 value, the replier will return NFS4ERR_REP_TOO_BIG, per the 25396 description in Section 2.10.6.4. 25398 ca_maxresponsesize_cached: 25400 Like ca_maxresponsesize, but the maximum size of a reply that 25401 will be stored in the reply cache (Section 2.10.6.1). For each 25402 channel, the server MAY decrease this value, but MUST NOT 25403 increase it. If, in the reply to CREATE_SESSION, the value of 25404 ca_maxresponsesize_cached of a channel is less than the value 25405 of ca_maxresponsesize of the same channel, then this is an 25406 indication to the requester that it needs to be selective about 25407 which replies it directs the replier to cache; for example, 25408 large replies from nonidempotent operations (e.g., COMPOUND 25409 requests with a READ operation) should not be cached. The 25410 requester decides which replies to cache via an argument to the 25411 SEQUENCE (the sa_cachethis field, see Section 18.46) or 25412 CB_SEQUENCE (the csa_cachethis field, see Section 20.9) 25413 operations. After the session is created, if a requester sends 25414 a request for which the size of the reply would exceed 25415 ca_maxresponsesize_cached, the replier will return 25416 NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in 25417 Section 2.10.6.4. 25419 ca_maxoperations: 25421 The maximum number of operations the replier will accept in a 25422 COMPOUND or CB_COMPOUND. For the backchannel, the server MUST 25423 NOT change the value the client offers. For the fore channel, 25424 the server MAY change the requested value. After the session 25425 is created, if a requester sends a COMPOUND or CB_COMPOUND with 25426 more operations than ca_maxoperations, the replier MUST return 25427 NFS4ERR_TOO_MANY_OPS. 25429 ca_maxrequests: 25431 The maximum number of concurrent COMPOUND or CB_COMPOUND 25432 requests the requester will send on the session. Subsequent 25433 requests will each be assigned a slot identifier by the 25434 requester within the range zero to ca_maxrequests - 1 25435 inclusive. For the backchannel, the server MUST NOT change the 25436 value the client offers. For the fore channel, the server MAY 25437 change the requested value. 25439 ca_rdma_ird: 25441 This array has a maximum of one element. If this array has one 25442 element, then the element contains the inbound RDMA read queue 25443 depth (IRD). For each channel, the server MAY decrease this 25444 value, but MUST NOT increase it. 25446 csa_cb_program 25447 This is the ONC RPC program number the server MUST use in any 25448 callbacks sent through the backchannel to the client. The server 25449 MUST specify an ONC RPC program number equal to csa_cb_program and 25450 an ONC RPC version number equal to 4 in callbacks sent to the 25451 client. If a CB_COMPOUND is sent to the client, the server MUST 25452 use a minor version number of 1. There is no corresponding 25453 result. 25455 csa_sec_parms 25457 The field csa_sec_parms is an array of acceptable security 25458 credentials the server can use on the session's backchannel. 25459 Three security flavors are supported: AUTH_NONE, AUTH_SYS, and 25460 RPCSEC_GSS. If AUTH_NONE is specified for a credential, then this 25461 says the client is authorizing the server to use AUTH_NONE on all 25462 callbacks for the session. If AUTH_SYS is specified, then the 25463 client is authorizing the server to use AUTH_SYS on all callbacks, 25464 using the credential specified cbsp_sys_cred. If RPCSEC_GSS is 25465 specified, then the server is allowed to use the RPCSEC_GSS 25466 context specified in cbsp_gss_parms as the RPCSEC_GSS context in 25467 the credential of the RPC header of callbacks to the client. 25468 There is no corresponding result. 25470 The RPCSEC_GSS context for the backchannel is specified via a pair 25471 of values of data type gsshandle4_t. The data type gsshandle4_t 25472 represents an RPCSEC_GSS handle, and is precisely the same as the 25473 data type of the "handle" field of the rpc_gss_init_res data type 25474 defined in Section 5.2.3.1, "Context Creation Response - 25475 Successful Acceptance", of [4]. 25477 The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore 25478 handle the server returned to the client (either in the handle 25479 field of data type rpc_gss_init_res or as one of the elements of 25480 the spi_handles field returned in the reply to EXCHANGE_ID) when 25481 the RPCSEC_GSS context was created on the server. The second 25482 handle, gcbp_handle_from_client, is the back handle to which the 25483 client will map the RPCSEC_GSS context. The server can 25484 immediately use the value of gcbp_handle_from_client in the 25485 RPCSEC_GSS credential in callback RPCs. That is, the value in 25486 gcbp_handle_from_client can be used as the value of the field 25487 "handle" in data type rpc_gss_cred_t (see Section 5, "Elements of 25488 the RPCSEC_GSS Security Protocol", of [4]) in callback RPCs. The 25489 server MUST use the RPCSEC_GSS security service specified in 25490 gcbp_service, i.e., it MUST set the "service" field of the 25491 rpc_gss_cred_t data type in RPCSEC_GSS credential to the value of 25492 gcbp_service (see Section 5.3.1, "RPC Request Header", of [4]). 25494 If the RPCSEC_GSS handle identified by gcbp_handle_from_server 25495 does not exist on the server, the server will return 25496 NFS4ERR_NOENT. 25498 Within each element of csa_sec_parms, the fore and back RPCSEC_GSS 25499 contexts MUST share the same GSS context and MUST have the same 25500 seq_window (see Section 5.2.3.1 of RFC2203 [4]). The fore and 25501 back RPCSEC_GSS context state are independent of each other as far 25502 as the RPCSEC_GSS sequence number (see the seq_num field in the 25503 rpc_gss_cred_t data type of Sections 5 and 5.3.1 of [4]). 25505 If an RPCSEC_GSS handle is using the SSV context (see 25506 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 25507 common SSV GSS context, there are security considerations specific 25508 to this situation discussed in Section 2.10.10. 25510 Once the session is created, the first SEQUENCE or CB_SEQUENCE 25511 received on a slot MUST have a sequence ID equal to 1; if not, the 25512 replier MUST return NFS4ERR_SEQ_MISORDERED. 25514 18.36.4. IMPLEMENTATION 25516 To describe a possible implementation, the same notation for client 25517 records introduced in the description of EXCHANGE_ID is used with the 25518 following addition: 25520 clientid_arg: The value of the csa_clientid field of the 25521 CREATE_SESSION4args structure of the current request. 25523 Since CREATE_SESSION is a non-idempotent operation, we need to 25524 consider the possibility that retries may occur as a result of a 25525 client restart, network partition, malfunctioning router, etc. For 25526 each client ID created by EXCHANGE_ID, the server maintains a 25527 separate reply cache (called the CREATE_SESSION reply cache) similar 25528 to the session reply cache used for SEQUENCE operations, with two 25529 distinctions. 25531 o First, this is a reply cache just for detecting and processing 25532 CREATE_SESSION requests for a given client ID. 25534 o Second, the size of the client ID reply cache is of one slot (and 25535 as a result, the CREATE_SESSION request does not carry a slot 25536 number). This means that at most one CREATE_SESSION request for a 25537 given client ID can be outstanding. 25539 As previously stated, CREATE_SESSION can be sent with or without a 25540 preceding SEQUENCE operation. Even if a SEQUENCE precedes 25541 CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply 25542 cache, which is separate from the reply cache for the session 25543 associated with a SEQUENCE. If CREATE_SESSION was originally sent by 25544 itself, the client MAY send a retry of the CREATE_SESSION operation 25545 within a COMPOUND preceded by a SEQUENCE. If CREATE_SESSION was 25546 originally sent in a COMPOUND that started with a SEQUENCE, then the 25547 client SHOULD send a retry in a COMPOUND that starts with a SEQUENCE 25548 that has the same session ID as the SEQUENCE of the original request. 25549 However, the client MAY send a retry in a COMPOUND that either has no 25550 preceding SEQUENCE, or has a preceding SEQUENCE that refers to a 25551 different session than the original CREATE_SESSION. This might be 25552 necessary if the client sends a CREATE_SESSION in a COMPOUND preceded 25553 by a SEQUENCE with session ID X, and session X no longer exists. 25554 Regardless, any retry of CREATE_SESSION, with or without a preceding 25555 SEQUENCE, MUST use the same value of csa_sequence as the original. 25557 After the client received a reply to an EXCHANGE_ID operation that 25558 contains a new, unconfirmed client ID, the server expects the client 25559 to follow with a CREATE_SESSION operation to confirm the client ID. 25560 The server expects value of csa_sequenceid in the arguments to that 25561 CREATE_SESSION to be to equal the value of the field eir_sequenceid 25562 that was returned in results of the EXCHANGE_ID that returned the 25563 unconfirmed client ID. Before the server replies to that EXCHANGE_ID 25564 operation, it initializes the client ID slot to be equal to 25565 eir_sequenceid - 1 (accounting for underflow), and records a 25566 contrived CREATE_SESSION result with a "cached" result of 25567 NFS4ERR_SEQ_MISORDERED. With the client ID slot thus initialized, 25568 the processing of the CREATE_SESSION operation is divided into four 25569 phases: 25571 1. Client record look up. The server looks up the client ID in its 25572 client record table. If the server contains no records with 25573 client ID equal to clientid_arg, then most likely the client's 25574 state has been purged during a period of inactivity, possibly due 25575 to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, 25576 and no changes are made to any client records on the server. 25577 Otherwise, the server goes to phase 2. 25579 2. Sequence ID processing. If csa_sequenceid is equal to the 25580 sequence ID in the client ID's slot, then this is a replay of the 25581 previous CREATE_SESSION request, and the server returns the 25582 cached result. If csa_sequenceid is not equal to the sequence ID 25583 in the slot, and is more than one greater (accounting for 25584 wraparound), then the server returns the error 25585 NFS4ERR_SEQ_MISORDERED, and does not change the slot. If 25586 csa_sequenceid is equal to the slot's sequence ID + 1 (accounting 25587 for wraparound), then the slot's sequence ID is set to 25588 csa_sequenceid, and the CREATE_SESSION processing goes to the 25589 next phase. A subsequent new CREATE_SESSION call over the same 25590 client ID MUST use a csa_sequenceid that is one greater than the 25591 sequence ID in the slot. 25593 3. Client ID confirmation. If this would be the first session for 25594 the client ID, the CREATE_SESSION operation serves to confirm the 25595 client ID. Otherwise, the client ID confirmation phase is 25596 skipped and only the session creation phase occurs. Any case in 25597 which there is more than one record with identical values for 25598 client ID represents a server implementation error. Operation in 25599 the potential valid cases is summarized as follows. 25601 * Successful Confirmation 25603 If the server has the following unconfirmed record, then 25604 this is the expected confirmation of an unconfirmed record. 25606 { ownerid, verifier, principal_arg, clientid_arg, 25607 unconfirmed } 25609 As noted in Section 18.35.4, the server might also have the 25610 following confirmed record. 25612 { ownerid, old_verifier, principal_arg, old_clientid, 25613 confirmed } 25615 The server schedules the replacement of both records with: 25617 { ownerid, verifier, principal_arg, clientid_arg, confirmed 25618 } 25620 The processing of CREATE_SESSION continues on to session 25621 creation. Once the session is successfully created, the 25622 scheduled client record replacement is committed. If the 25623 session is not successfully created, then no changes are 25624 made to any client records on the server. 25626 * Unsuccessful Confirmation 25628 If the server has the following record, then the client has 25629 changed principals after the previous EXCHANGE_ID request, 25630 or there has been a chance collision between shorthand 25631 client identifiers. 25633 { *, *, old_principal_arg, clientid_arg, * } 25634 Neither of these cases is permissible. Processing stops 25635 and NFS4ERR_CLID_INUSE is returned to the client. No 25636 changes are made to any client records on the server. 25638 4. Session creation. The server confirmed the client ID, either in 25639 this CREATE_SESSION operation, or a previous CREATE_SESSION 25640 operation. The server examines the remaining fields of the 25641 arguments. 25643 The server creates the session by recording the parameter values 25644 used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is 25645 set and has been accepted by the server) and allocating space for 25646 the session reply cache (if there is not enough space, the server 25647 returns NFS4ERR_NOSPC). For each slot in the reply cache, the 25648 server sets the sequence ID to zero, and records an entry 25649 containing a COMPOUND reply with zero operations and the error 25650 NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request 25651 sent has a sequence ID equal to zero, the server can simply 25652 return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The 25653 client initializes its reply cache for receiving callbacks in the 25654 same way, and similarly, the first CB_SEQUENCE operation on a 25655 slot after session creation MUST have a sequence ID of one. 25657 If the session state is created successfully, the server 25658 associates the session with the client ID provided by the client. 25660 When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs 25661 to be retried, the retry MUST be done on a new connection that is 25662 in non-RDMA mode. If properties of the new connection are 25663 different enough that the arguments to CREATE_SESSION need to 25664 change, then a non-retry MUST be sent. The server will 25665 eventually dispose of any session that was created on the 25666 original connection. 25668 On the backchannel, the client and server might wish to have many 25669 slots, in some cases perhaps more that the fore channel, in order to 25670 deal with the situations where the network link has high latency and 25671 is the primary bottleneck for response to recalls. If so, and if the 25672 client provides too few slots to the backchannel, the server might 25673 limit the number of recallable objects it gives to the client. 25675 Implementing RPCSEC_GSS callback support requires changes to both the 25676 client and server implementations of RPCSEC_GSS. One possible set of 25677 changes includes: 25679 o Adding a data structure that wraps the GSS-API context with a 25680 reference count. 25682 o New functions to increment and decrement the reference count. If 25683 the reference count is decremented to zero, the wrapper data 25684 structure and the GSS-API context it refers to would be freed. 25686 o Change RPCSEC_GSS to create the wrapper data structure upon 25687 receiving GSS-API context from gss_accept_sec_context() and 25688 gss_init_sec_context(). The reference count would be initialized 25689 to 1. 25691 o Adding a function to map an existing RPCSEC_GSS handle to a 25692 pointer to the wrapper data structure. The reference count would 25693 be incremented. 25695 o Adding a function to create a new RPCSEC_GSS handle from a pointer 25696 to the wrapper data structure. The reference count would be 25697 incremented. 25699 o Replacing calls from RPCSEC_GSS that free GSS-API contexts, with 25700 calls to decrement the reference count on the wrapper data 25701 structure. 25703 18.37. Operation 44: DESTROY_SESSION - Destroy a Session 25705 18.37.1. ARGUMENT 25707 struct DESTROY_SESSION4args { 25708 sessionid4 dsa_sessionid; 25709 }; 25711 18.37.2. RESULT 25713 struct DESTROY_SESSION4res { 25714 nfsstat4 dsr_status; 25715 }; 25717 18.37.3. DESCRIPTION 25719 The DESTROY_SESSION operation closes the session and discards the 25720 session's reply cache, if any. Any remaining connections associated 25721 with the session are immediately disassociated. If the connection 25722 has no remaining associated sessions, the connection MAY be closed by 25723 the server. Locks, delegations, layouts, wants, and the lease, which 25724 are all tied to the client ID, are not affected by DESTROY_SESSION. 25726 DESTROY_SESSION MUST be invoked on a connection that is associated 25727 with the session being destroyed. In addition, if SP4_MACH_CRED 25728 state protection was specified when the client ID was created, the 25729 RPCSEC_GSS principal that created the session MUST be the one that 25730 destroys the session, using RPCSEC_GSS privacy or integrity. If 25731 SP4_SSV state protection was specified when the client ID was 25732 created, RPCSEC_GSS using the SSV mechanism (Section 2.10.9) MUST be 25733 used, with integrity or privacy. 25735 If the COMPOUND request starts with SEQUENCE, and if the sessionids 25736 specified in SEQUENCE and DESTROY_SESSION are the same, then 25738 o DESTROY_SESSION MUST be the final operation in the COMPOUND 25739 request. 25741 o It is advisable to avoid placing DESTROY_SESSION in a COMPOUND 25742 request with other state-modifying operations, because the 25743 DESTROY_SESSION will destroy the reply cache. 25745 o Because the session and its reply cache are destroyed, a client 25746 that retries the request may receive an error in reply to the 25747 retry, even though the original request was successful. 25749 If the COMPOUND request starts with SEQUENCE, and if the sessionids 25750 specified in SEQUENCE and DESTROY_SESSION are different, then 25751 DESTROY_SESSION can appear in any position of the COMPOUND request 25752 (except for the first position). The two sessionids can belong to 25753 different client IDs. 25755 If the COMPOUND request does not start with SEQUENCE, and if 25756 DESTROY_SESSION is not the sole operation, then server MUST return 25757 NFS4ERR_NOT_ONLY_OP. 25759 If there is a backchannel on the session and the server has 25760 outstanding CB_COMPOUND operations for the session which have not 25761 been replied to, then the server MAY refuse to destroy the session 25762 and return an error. If so, then in the event the backchannel is 25763 down, the server SHOULD return NFS4ERR_CB_PATH_DOWN to inform the 25764 client that the backchannel needs to be repaired before the server 25765 will allow the session to be destroyed. Otherwise, the error 25766 CB_BACK_CHAN_BUSY SHOULD be returned to indicate that there are 25767 CB_COMPOUNDs that need to be replied to. The client SHOULD reply to 25768 all outstanding CB_COMPOUNDs before re-sending DESTROY_SESSION. 25770 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 25771 18.38.1. ARGUMENT 25773 struct FREE_STATEID4args { 25774 stateid4 fsa_stateid; 25775 }; 25777 18.38.2. RESULT 25779 struct FREE_STATEID4res { 25780 nfsstat4 fsr_status; 25781 }; 25783 18.38.3. DESCRIPTION 25785 The FREE_STATEID operation is used to free a stateid that no longer 25786 has any associated locks (including opens, byte-range locks, 25787 delegations, and layouts). This may be because of client LOCKU 25788 operations or because of server revocation. If there are valid locks 25789 (of any kind) associated with the stateid in question, the error 25790 NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will 25791 not be freed. 25793 When a stateid is freed that had been associated with revoked locks, 25794 by sending the FREE_STATEID operation, the client acknowledges the 25795 loss of those locks. This allows the server, once all such revoked 25796 state is acknowledged, to allow that client again to reclaim locks, 25797 without encountering the edge conditions discussed in Section 8.4.2. 25799 Once a successful FREE_STATEID is done for a given stateid, any 25800 subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID 25801 error. 25803 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory Delegation 25805 18.39.1. ARGUMENT 25806 typedef nfstime4 attr_notice4; 25808 struct GET_DIR_DELEGATION4args { 25809 /* CURRENT_FH: delegated directory */ 25810 bool gdda_signal_deleg_avail; 25811 bitmap4 gdda_notification_types; 25812 attr_notice4 gdda_child_attr_delay; 25813 attr_notice4 gdda_dir_attr_delay; 25814 bitmap4 gdda_child_attributes; 25815 bitmap4 gdda_dir_attributes; 25816 }; 25818 18.39.2. RESULT 25820 struct GET_DIR_DELEGATION4resok { 25821 verifier4 gddr_cookieverf; 25822 /* Stateid for get_dir_delegation */ 25823 stateid4 gddr_stateid; 25824 /* Which notifications can the server support */ 25825 bitmap4 gddr_notification; 25826 bitmap4 gddr_child_attributes; 25827 bitmap4 gddr_dir_attributes; 25828 }; 25830 enum gddrnf4_status { 25831 GDD4_OK = 0, 25832 GDD4_UNAVAIL = 1 25833 }; 25835 union GET_DIR_DELEGATION4res_non_fatal 25836 switch (gddrnf4_status gddrnf_status) { 25837 case GDD4_OK: 25838 GET_DIR_DELEGATION4resok gddrnf_resok4; 25839 case GDD4_UNAVAIL: 25840 bool gddrnf_will_signal_deleg_avail; 25841 }; 25843 union GET_DIR_DELEGATION4res 25844 switch (nfsstat4 gddr_status) { 25845 case NFS4_OK: 25846 GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; 25847 default: 25848 void; 25849 }; 25851 18.39.3. DESCRIPTION 25853 The GET_DIR_DELEGATION operation is used by a client to request a 25854 directory delegation. The directory is represented by the current 25855 filehandle. The client also specifies whether it wants the server to 25856 notify it when the directory changes in certain ways by setting one 25857 or more bits in a bitmap. The server may refuse to grant the 25858 delegation. In that case, the server will return 25859 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 25860 delegation, it will return a cookie verifier for that directory. If 25861 the cookie verifier changes when the client is holding the 25862 delegation, the delegation will be recalled unless the client has 25863 asked for notification for this event. 25865 The server will also return a directory delegation stateid, 25866 gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This 25867 stateid will appear in callback messages related to the delegation, 25868 such as notifications and delegation recalls. The client will use 25869 this stateid to return the delegation voluntarily or upon recall. A 25870 delegation is returned by calling the DELEGRETURN operation. 25872 The server might not be able to support notifications of certain 25873 events. If the client asks for such notifications, the server MUST 25874 inform the client of its inability to do so as part of the 25875 GET_DIR_DELEGATION reply by not setting the appropriate bits in the 25876 supported notifications bitmask, gddr_notification, contained in the 25877 reply. The server MUST NOT add bits to gddr_notification that the 25878 client did not request. 25880 The GET_DIR_DELEGATION operation can be used for both normal and 25881 named attribute directories. 25883 If client sets gdda_signal_deleg_avail to TRUE, then it is 25884 registering with the client a "want" for a directory delegation. If 25885 the delegation is not available, and the server supports and will 25886 honor the "want", the results will have 25887 gddrnf_will_signal_deleg_avail set to TRUE and no error will be 25888 indicated on return. If so, the client should expect a future 25889 CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory 25890 delegation is available. If the server does not wish to honor the 25891 "want" or is not able to do so, it returns the error 25892 NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately 25893 available, the server SHOULD return it with the response to the 25894 operation, rather than via a callback. 25896 When a client makes a request for a directory delegation while it 25897 already holds a directory delegation for that directory (including 25898 the case where it has been recalled but not yet returned by the 25899 client or revoked by the server), the server MUST reply with the 25900 value of gddr_status set to NFS4_OK, the value of gddrnf_status set 25901 to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set 25902 to FALSE. The delegation the client held before the request remains 25903 intact, and its state is unchanged. The current stateid is not 25904 changed (see Section 16.2.3.1.2 for a description of the current 25905 stateid). 25907 18.39.4. IMPLEMENTATION 25909 Directory delegations provide the benefit of improving cache 25910 consistency of namespace information. This is done through 25911 synchronous callbacks. A server must support synchronous callbacks 25912 in order to support directory delegations. In addition to that, 25913 asynchronous notifications provide a way to reduce network traffic as 25914 well as improve client performance in certain conditions. 25916 Notifications are specified in terms of potential changes to the 25917 directory. A client can ask to be notified of events by setting one 25918 or more bits in gdda_notification_types. The client can ask for 25919 notifications on addition of entries to a directory (by setting the 25920 NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry 25921 removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), 25922 directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and 25923 cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting 25924 one or more corresponding bits in the gdda_notification_types field. 25926 The client can also ask for notifications of changes to attributes of 25927 directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep 25928 its attribute cache up to date. However, any changes made to child 25929 attributes do not cause the delegation to be recalled. If a client 25930 is interested in directory entry caching or negative name caching, it 25931 can set the gdda_notification_types appropriately to its particular 25932 need and the server will notify it of all changes that would 25933 otherwise invalidate its name cache. The kind of notification a 25934 client asks for may depend on the directory size, its rate of change, 25935 and the applications being used to access that directory. The 25936 enumeration of the conditions under which a client might ask for a 25937 notification is out of the scope of this specification. 25939 For attribute notifications, the client will set bits in the 25940 gdda_dir_attributes bitmap to indicate which attributes it wants to 25941 be notified of. If the server does not support notifications for 25942 changes to a certain attribute, it SHOULD NOT set that attribute in 25943 the supported attribute bitmap specified in the reply 25944 (gddr_dir_attributes). The client will also set in the 25945 gdda_child_attributes bitmap the attributes of directory entries it 25946 wants to be notified of, and the server will indicate in 25947 gddr_child_attributes which attributes of directory entries it will 25948 notify the client of. 25950 The client will also let the server know if it wants to get the 25951 notification as soon as the attribute change occurs or after a 25952 certain delay by setting a delay factor; gdda_child_attr_delay is for 25953 attribute changes to directory entries and gdda_dir_attr_delay is for 25954 attribute changes to the directory. If this delay factor is set to 25955 zero, that indicates to the server that the client wants to be 25956 notified of any attribute changes as soon as they occur. If the 25957 delay factor is set to N seconds, the server will make a best-effort 25958 guarantee that attribute updates are synchronized within N seconds. 25959 If the client asks for a delay factor that the server does not 25960 support or that may cause significant resource consumption on the 25961 server by causing the server to send a lot of notifications, the 25962 server should not commit to sending out notifications for attributes 25963 and therefore must not set the appropriate bit in the 25964 gddr_child_attributes and gddr_dir_attributes bitmaps in the 25965 response. 25967 The client MUST use a security tuple (Section 2.6.1) that the 25968 directory or its applicable ancestor (Section 2.6) is exported with. 25969 If not, the server MUST return NFS4ERR_WRONGSEC to the operation that 25970 both precedes GET_DIR_DELEGATION and sets the current filehandle (see 25971 Section 2.6.3.1). 25973 The directory delegation covers all the entries in the directory 25974 except the parent entry. That means if a directory and its parent 25975 both hold directory delegations, any changes to the parent will not 25976 cause a notification to be sent for the child even though the child's 25977 parent entry points to the parent directory. 25979 18.40. Operation 47: GETDEVICEINFO - Get Device Information 25981 18.40.1. ARGUMENT 25983 struct GETDEVICEINFO4args { 25984 deviceid4 gdia_device_id; 25985 layouttype4 gdia_layout_type; 25986 count4 gdia_maxcount; 25987 bitmap4 gdia_notify_types; 25988 }; 25990 18.40.2. RESULT 25992 struct GETDEVICEINFO4resok { 25993 device_addr4 gdir_device_addr; 25994 bitmap4 gdir_notification; 25995 }; 25997 union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { 25998 case NFS4_OK: 25999 GETDEVICEINFO4resok gdir_resok4; 26000 case NFS4ERR_TOOSMALL: 26001 count4 gdir_mincount; 26002 default: 26003 void; 26004 }; 26006 18.40.3. DESCRIPTION 26008 The GETDEVICEINFO operation returns pNFS storage device address 26009 information for the specified device ID. The client identifies the 26010 device information to be returned by providing the gdia_device_id and 26011 gdia_layout_type that uniquely identify the device. The client 26012 provides gdia_maxcount to limit the number of bytes for the result. 26013 This maximum size represents all of the data being returned within 26014 the GETDEVICEINFO4resok structure and includes the XDR overhead. The 26015 server may return less data. If the server is unable to return any 26016 information within the gdia_maxcount limit, the error 26017 NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is 26018 zero, NFS4ERR_TOOSMALL MUST NOT be returned. 26020 The da_layout_type field of the gdir_device_addr returned by the 26021 server MUST be equal to the gdia_layout_type specified by the client. 26022 If it is not equal, the client SHOULD ignore the response as invalid 26023 and behave as if the server returned an error, even if the client 26024 does have support for the layout type returned. 26026 The client also provides a notification bitmap, gdia_notify_types, 26027 for the device ID mapping notification for which it is interested in 26028 receiving; the server must support device ID notifications for the 26029 notification request to have affect. The notification mask is 26030 composed in the same manner as the bitmap for file attributes 26031 (Section 3.3.7). The numbers of bit positions are listed in the 26032 notify_device_type4 enumeration type (Section 20.12). Only two 26033 enumerated values of notify_device_type4 currently apply to 26034 GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE 26035 (see Section 20.12). 26037 The notification bitmap applies only to the specified device ID. If 26038 a client sends a GETDEVICEINFO operation on a deviceID multiple 26039 times, the last notification bitmap is used by the server for 26040 subsequent notifications. If the bitmap is zero or empty, then the 26041 device ID's notifications are turned off. 26043 If the client wants to just update or turn off notifications, it MAY 26044 send a GETDEVICEINFO operation with gdia_maxcount set to zero. In 26045 that event, if the device ID is valid, the reply's da_addr_body field 26046 of the gdir_device_addr field will be of zero length. 26048 If an unknown device ID is given in gdia_device_id, the server 26049 returns NFS4ERR_NOENT. Otherwise, the device address information is 26050 returned in gdir_device_addr. Finally, if the server supports 26051 notifications for device ID mappings, the gdir_notification result 26052 will contain a bitmap of which notifications it will actually send to 26053 the client (via CB_NOTIFY_DEVICEID, see Section 20.12). 26055 If NFS4ERR_TOOSMALL is returned, the results also contain 26056 gdir_mincount. The value of gdir_mincount represents the minimum 26057 size necessary to obtain the device information. 26059 18.40.4. IMPLEMENTATION 26061 Aside from updating or turning off notifications, another use case 26062 for gdia_maxcount being set to zero is to validate a device ID. 26064 The client SHOULD request a notification for changes or deletion of a 26065 device ID to device address mapping so that the server can allow the 26066 client gracefully use a new mapping, without having pending I/O fail 26067 abruptly, or force layouts using the device ID to be recalled or 26068 revoked. 26070 It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with 26071 CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the 26072 client gets and processes the response to GETDEVICEINFO or 26073 GETDEVICELIST. The analysis of the race leverages the fact that the 26074 server MUST NOT delete a device ID that is referred to by a layout 26075 the client has. 26077 o CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it 26078 has layouts that refer to the device ID, then it is possible that 26079 layouts referring to the deleted device ID have been revoked. The 26080 client should send a TEST_STATEID request using the stateid for 26081 each layout that might have been revoked. If TEST_STATEID 26082 indicates that any layouts have been revoked, the client must 26083 recover from layout revocation as described in Section 12.5.6. If 26084 TEST_STATEID indicates that at least one layout has not been 26085 revoked, the client should send a GETDEVICEINFO operation on the 26086 supposedly deleted device ID to verify that the device ID has been 26087 deleted. 26089 If GETDEVICEINFO indicates that the device ID does not exist, then 26090 the client assumes the server is faulty and recovers by sending an 26091 EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device 26092 ID does exist, then while the server is faulty for sending an 26093 erroneous device ID deletion notification, the degree to which it 26094 is faulty does not require the client to create a new client ID. 26096 If the client does not have layouts that refer to the device ID, 26097 no harm is done. The client should mark the device ID as deleted, 26098 and when GETDEVICEINFO or GETDEVICELIST results are received that 26099 indicate that the device ID has been in fact deleted, the device 26100 ID should be removed from the client's cache. 26102 o CB_NOTIFY_DEVICEID indicates that a device ID's device addressing 26103 mappings have changed. The client should assume that the results 26104 from the in-progress GETDEVICEINFO will be stale for the device ID 26105 once received, and so it should send another GETDEVICEINFO on the 26106 device ID. 26108 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File 26109 System 26111 18.41.1. ARGUMENT 26113 struct GETDEVICELIST4args { 26114 /* CURRENT_FH: object belonging to the file system */ 26115 layouttype4 gdla_layout_type; 26117 /* number of deviceIDs to return */ 26118 count4 gdla_maxdevices; 26120 nfs_cookie4 gdla_cookie; 26121 verifier4 gdla_cookieverf; 26122 }; 26124 18.41.2. RESULT 26125 struct GETDEVICELIST4resok { 26126 nfs_cookie4 gdlr_cookie; 26127 verifier4 gdlr_cookieverf; 26128 deviceid4 gdlr_deviceid_list<>; 26129 bool gdlr_eof; 26130 }; 26132 union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { 26133 case NFS4_OK: 26134 GETDEVICELIST4resok gdlr_resok4; 26135 default: 26136 void; 26137 }; 26139 18.41.3. DESCRIPTION 26141 This operation is used by the client to enumerate all of the device 26142 IDs that a server's file system uses. 26144 The client provides a current filehandle of a file object that 26145 belongs to the file system (i.e., all file objects sharing the same 26146 fsid as that of the current filehandle) and the layout type in 26147 gdia_layout_type. Since this operation might require multiple calls 26148 to enumerate all the device IDs (and is thus similar to the READDIR 26149 (Section 18.23) operation), the client also provides gdia_cookie and 26150 gdia_cookieverf to specify the current cursor position in the list. 26151 When the client wants to read from the beginning of the file system's 26152 device mappings, it sets gdla_cookie to zero. The field 26153 gdla_cookieverf MUST be ignored by the server when gdla_cookie is 26154 zero. The client provides gdla_maxdevices to limit the number of 26155 device IDs in the result. If gdla_maxdevices is zero, the server 26156 MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. 26158 The successful response to the operation will contain the cookie, 26159 gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on 26160 the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies 26161 that there are no remaining entries in the server's device list. 26162 Each element of gdlr_deviceid_list contains a device ID. 26164 18.41.4. IMPLEMENTATION 26166 An example of the use of this operation is for pNFS clients and 26167 servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments 26168 it may be helpful for a client to determine device accessibility upon 26169 first file system access. 26171 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout 26173 18.42.1. ARGUMENT 26175 union newtime4 switch (bool nt_timechanged) { 26176 case TRUE: 26177 nfstime4 nt_time; 26178 case FALSE: 26179 void; 26180 }; 26182 union newoffset4 switch (bool no_newoffset) { 26183 case TRUE: 26184 offset4 no_offset; 26185 case FALSE: 26186 void; 26187 }; 26189 struct LAYOUTCOMMIT4args { 26190 /* CURRENT_FH: file */ 26191 offset4 loca_offset; 26192 length4 loca_length; 26193 bool loca_reclaim; 26194 stateid4 loca_stateid; 26195 newoffset4 loca_last_write_offset; 26196 newtime4 loca_time_modify; 26197 layoutupdate4 loca_layoutupdate; 26198 }; 26200 18.42.2. RESULT 26201 union newsize4 switch (bool ns_sizechanged) { 26202 case TRUE: 26203 length4 ns_size; 26204 case FALSE: 26205 void; 26206 }; 26208 struct LAYOUTCOMMIT4resok { 26209 newsize4 locr_newsize; 26210 }; 26212 union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { 26213 case NFS4_OK: 26214 LAYOUTCOMMIT4resok locr_resok4; 26215 default: 26216 void; 26217 }; 26219 18.42.3. DESCRIPTION 26221 The LAYOUTCOMMIT operation commits changes in the layout represented 26222 by the current filehandle, client ID (derived from the session ID in 26223 the preceding SEQUENCE operation), byte-range, and stateid. Since 26224 layouts are sub-dividable, a smaller portion of a layout, retrieved 26225 via LAYOUTGET, can be committed. The byte-range being committed is 26226 specified through the byte-range (loca_offset and loca_length). This 26227 byte-range MUST overlap with one or more existing layouts previously 26228 granted via LAYOUTGET (Section 18.43), each with an iomode of 26229 LAYOUTIOMODE4_RW. In the case where the iomode of any held layout 26230 segment is not LAYOUTIOMODE4_RW, the server should return the error 26231 NFS4ERR_BAD_IOMODE. For the case where the client does not hold 26232 matching layout segment(s) for the defined byte-range, the server 26233 should return the error NFS4ERR_BAD_LAYOUT. 26235 The LAYOUTCOMMIT operation indicates that the client has completed 26236 writes using a layout obtained by a previous LAYOUTGET. The client 26237 may have only written a subset of the data range it previously 26238 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 26239 allocated space and to update the server with a new end-of-file. The 26240 layout referenced by LAYOUTCOMMIT is still valid after the operation 26241 completes and can be continued to be referenced by the client ID, 26242 filehandle, byte-range, layout type, and stateid. 26244 If the loca_reclaim field is set to TRUE, this indicates that the 26245 client is attempting to commit changes to a layout after the restart 26246 of the metadata server during the metadata server's recovery grace 26247 period (see Section 12.7.4). This type of request may be necessary 26248 when the client has uncommitted writes to provisionally allocated 26249 byte-ranges of a file that were sent to the storage devices before 26250 the restart of the metadata server. In this case, the layout 26251 provided by the client MUST be a subset of a writable layout that the 26252 client held immediately before the restart of the metadata server. 26253 The value of the field loca_stateid MUST be a value that the metadata 26254 server returned before it restarted. The metadata server is free to 26255 accept or reject this request based on its own internal metadata 26256 consistency checks. If the metadata server finds that the layout 26257 provided by the client does not pass its consistency checks, it MUST 26258 reject the request with the status NFS4ERR_RECLAIM_BAD. The 26259 successful completion of the LAYOUTCOMMIT request with loca_reclaim 26260 set to TRUE does NOT provide the client with a layout for the file. 26261 It simply commits the changes to the layout specified in the 26262 loca_layoutupdate field. To obtain a layout for the file, the client 26263 must send a LAYOUTGET request to the server after the server's grace 26264 period has expired. If the metadata server receives a LAYOUTCOMMIT 26265 request with loca_reclaim set to TRUE when the metadata server is not 26266 in its recovery grace period, it MUST reject the request with the 26267 status NFS4ERR_NO_GRACE. 26269 Setting the loca_reclaim field to TRUE is required if and only if the 26270 committed layout was acquired before the metadata server restart. If 26271 the client is committing a layout that was acquired during the 26272 metadata server's grace period, it MUST set the "reclaim" field to 26273 FALSE. 26275 The loca_stateid is a layout stateid value as returned by previously 26276 successful layout operations (see Section 12.5.3). 26278 The loca_last_write_offset field specifies the offset of the last 26279 byte written by the client previous to the LAYOUTCOMMIT. Note that 26280 this value is never equal to the file's size (at most it is one byte 26281 less than the file's size) and MUST be less than or equal to 26282 NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range 26283 described by loca_offset and loca_length. The metadata server may 26284 use this information to determine whether the file's size needs to be 26285 updated. If the metadata server updates the file's size as the 26286 result of the LAYOUTCOMMIT operation, it must return the new size 26287 (locr_newsize.ns_size) as part of the results. 26289 The loca_time_modify field allows the client to suggest a 26290 modification time it would like the metadata server to set. The 26291 metadata server may use the suggestion or it may use the time of the 26292 LAYOUTCOMMIT operation to set the modification time. If the metadata 26293 server uses the client-provided modification time, it should ensure 26294 that time does not flow backwards. If the client wants to force the 26295 metadata server to set an exact time, the client should use a SETATTR 26296 operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 26297 for more details. If the client desires the resultant modification 26298 time, it should construct the COMPOUND so that a GETATTR follows the 26299 LAYOUTCOMMIT. 26301 The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism 26302 for a client to provide layout-specific updates to the metadata 26303 server. For example, the layout update can describe what byte-ranges 26304 of the original layout have been used and what byte-ranges can be 26305 deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 26306 structure. 26308 The layout information is more verbose for block devices than for 26309 objects and files because the latter two hide the details of block 26310 allocation behind their storage protocols. At the minimum, the 26311 client needs to communicate changes to the end-of-file location back 26312 to the server, and, if desired, its view of the file's modification 26313 time. For block/volume layouts, it needs to specify precisely which 26314 blocks have been used. 26316 If the layout identified in the arguments does not exist, the error 26317 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 26318 be rejected if it does not correspond to an existing layout with an 26319 iomode of LAYOUTIOMODE4_RW. 26321 On success, the current filehandle retains its value and the current 26322 stateid retains its value. 26324 18.42.4. IMPLEMENTATION 26326 The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set 26327 to TRUE to convey hints to modified file attributes or to report 26328 layout-type specific information such as I/O errors for object-based 26329 storage layouts, as normally done during normal operation. Doing so 26330 may help the metadata server to recover files more efficiently after 26331 restart. For example, some file system implementations may require 26332 expansive recovery of file system objects if the metadata server does 26333 not get a positive indication from all clients holding a 26334 LAYOUTIOMODE4_RW layout that they have successfully completed all 26335 their writes. Sending a LAYOUTCOMMIT (if required) and then 26336 following with LAYOUTRETURN can provide such an indication and allow 26337 for graceful and efficient recovery. 26339 If loca_reclaim is TRUE, the metadata server is free to either 26340 examine or ignore the value in the field loca_stateid. The metadata 26341 server implementation might or might not encode in its layout stateid 26342 information that allows the metadate server to perform a consistency 26343 check on the LAYOUTCOMMIT request. 26345 18.43. Operation 50: LAYOUTGET - Get Layout Information 26347 18.43.1. ARGUMENT 26349 struct LAYOUTGET4args { 26350 /* CURRENT_FH: file */ 26351 bool loga_signal_layout_avail; 26352 layouttype4 loga_layout_type; 26353 layoutiomode4 loga_iomode; 26354 offset4 loga_offset; 26355 length4 loga_length; 26356 length4 loga_minlength; 26357 stateid4 loga_stateid; 26358 count4 loga_maxcount; 26359 }; 26361 18.43.2. RESULT 26363 struct LAYOUTGET4resok { 26364 bool logr_return_on_close; 26365 stateid4 logr_stateid; 26366 layout4 logr_layout<>; 26367 }; 26369 union LAYOUTGET4res switch (nfsstat4 logr_status) { 26370 case NFS4_OK: 26371 LAYOUTGET4resok logr_resok4; 26372 case NFS4ERR_LAYOUTTRYLATER: 26373 bool logr_will_signal_layout_avail; 26374 default: 26375 void; 26376 }; 26378 18.43.3. DESCRIPTION 26380 The LAYOUTGET operation requests a layout from the metadata server 26381 for reading or writing the file given by the filehandle at the byte- 26382 range specified by offset and length. Layouts are identified by the 26383 client ID (derived from the session ID in the preceding SEQUENCE 26384 operation), current filehandle, layout type (loga_layout_type), and 26385 the layout stateid (loga_stateid). The use of the loga_iomode field 26386 depends upon the layout type, but should reflect the client's data 26387 access intent. 26389 If the metadata server is in a grace period, and does not persist 26390 layouts and device ID to device address mappings, then it MUST return 26391 NFS4ERR_GRACE (see Section 8.4.2.1). 26393 The LAYOUTGET operation returns layout information for the specified 26394 byte-range: a layout. The client actually specifies two ranges, both 26395 starting at the offset in the loga_offset field. The first range is 26396 between loga_offset and loga_offset + loga_length - 1 inclusive. 26397 This range indicates the desired range the client wants the layout to 26398 cover. The second range is between loga_offset and loga_offset + 26399 loga_minlength - 1 inclusive. This range indicates the required 26400 range the client needs the layout to cover. Thus, loga_minlength 26401 MUST be less than or equal to loga_length. 26403 When a length field is set to NFS4_UINT64_MAX, this indicates a 26404 desire (when loga_length is NFS4_UINT64_MAX) or requirement (when 26405 loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset 26406 through the end-of-file, regardless of the file's length. 26408 The following rules govern the relationships among, and the minima 26409 of, loga_length, loga_minlength, and loga_offset. 26411 o If loga_length is less than loga_minlength, the metadata server 26412 MUST return NFS4ERR_INVAL. 26414 o If loga_minlength is zero, this is an indication to the metadata 26415 server that the client desires any layout at offset loga_offset or 26416 less that the metadata server has "readily available". Readily is 26417 subjective, and depends on the layout type and the pNFS server 26418 implementation. For example, some metadata servers might have to 26419 pre-allocate stable storage when they receive a request for a 26420 range of a file that goes beyond the file's current length. If 26421 loga_minlength is zero and loga_length is greater than zero, this 26422 tells the metadata server what range of the layout the client 26423 would prefer to have. If loga_length and loga_minlength are both 26424 zero, then the client is indicating that it desires a layout of 26425 any length with the ending offset of the range no less than the 26426 value specified loga_offset, and the starting offset at or below 26427 loga_offset. If the metadata server does not have a layout that 26428 is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER. 26430 o If the sum of loga_offset and loga_minlength exceeds 26431 NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the 26432 error NFS4ERR_INVAL MUST result. 26434 o If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, 26435 and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL 26436 MUST result. 26438 After the metadata server has performed the above checks on 26439 loga_offset, loga_minlength, and loga_offset, the metadata server 26440 MUST return a layout according to the rules in Table 13. 26442 Acceptable layouts based on loga_minlength. Note: u64m = 26443 NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. 26445 +-----------+-----------+----------+----------+---------------------+ 26446 | Layout | Layout | Layout | Layout | Layout length of | 26447 | iomode of | a_minlen | iomode | offset | reply | 26448 | request | of | of reply | of reply | | 26449 | | request | | | | 26450 +-----------+-----------+----------+----------+---------------------+ 26451 | _READ | u64m | MAY be | MUST be | MUST be >= file | 26452 | | | _READ | <= a_off | length - layout | 26453 | | | | | offset | 26454 | _READ | u64m | MAY be | MUST be | MUST be u64m | 26455 | | | _RW | <= a_off | | 26456 | _READ | > 0 and < | MAY be | MUST be | MUST be >= MIN(file | 26457 | | u64m | _READ | <= a_off | length, a_minlen + | 26458 | | | | | a_off) - layout | 26459 | | | | | offset | 26460 | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off - | 26461 | | u64m | _RW | <= a_off | layout offset + | 26462 | | | | | a_minlen | 26463 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 26464 | | | _READ | <= a_off | | 26465 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 26466 | | | _RW | <= a_off | | 26467 | _RW | u64m | MUST be | MUST be | MUST be u64m | 26468 | | | _RW | <= a_off | | 26469 | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off - | 26470 | | u64m | _RW | <= a_off | layout offset + | 26471 | | | | | a_minlen | 26472 | _RW | 0 | MUST be | MUST be | MUST be > 0 | 26473 | | | _RW | <= a_off | | 26474 +-----------+-----------+----------+----------+---------------------+ 26476 Table 13 26478 If loga_minlength is not zero and the metadata server cannot return a 26479 layout according to the rules in Table 13, then the metadata server 26480 MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero 26481 and the metadata server cannot or will not return a layout according 26482 to the rules in Table 13, then the metadata server MUST return the 26483 error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater 26484 than loga_minlength or equal to zero, the metadata server SHOULD 26485 return a layout according to the rules in Table 14. 26487 Desired layouts based on loga_length. The rules of Table 13 MUST be 26488 applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; 26489 a_len = loga_length. 26491 +------------+------------+-----------+-----------+-----------------+ 26492 | Layout | Layout | Layout | Layout | Layout length | 26493 | iomode of | a_len of | iomode of | offset of | of reply | 26494 | request | request | reply | reply | | 26495 +------------+------------+-----------+-----------+-----------------+ 26496 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 26497 | | | _READ | <= a_off | | 26498 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 26499 | | | _RW | <= a_off | | 26500 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 26501 | | u64m | _READ | <= a_off | a_off - layout | 26502 | | | | | offset + a_len | 26503 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 26504 | | u64m | _RW | <= a_off | a_off - layout | 26505 | | | | | offset + a_len | 26506 | _READ | 0 | MAY be | MUST be | SHOULD be > | 26507 | | | _READ | <= a_off | a_off - layout | 26508 | | | | | offset | 26509 | _READ | 0 | MAY be | MUST be | SHOULD be > | 26510 | | | _READ | <= a_off | a_off - layout | 26511 | | | | | offset | 26512 | _RW | u64m | MUST be | MUST be | SHOULD be u64m | 26513 | | | _RW | <= a_off | | 26514 | _RW | > 0 and < | MUST be | MUST be | SHOULD be >= | 26515 | | u64m | _RW | <= a_off | a_off - layout | 26516 | | | | | offset + a_len | 26517 | _RW | 0 | MUST be | MUST be | SHOULD be > | 26518 | | | _RW | <= a_off | a_off - layout | 26519 | | | | | offset | 26520 +------------+------------+-----------+-----------+-----------------+ 26522 Table 14 26524 The loga_stateid field specifies a valid stateid. If a layout is not 26525 currently held by the client, the loga_stateid field represents a 26526 stateid reflecting the correspondingly valid open, byte-range lock, 26527 or delegation stateid. Once a layout is held on the file by the 26528 client, the loga_stateid field MUST be a stateid as returned from a 26529 previous LAYOUTGET or LAYOUTRETURN operation or provided by a 26530 CB_LAYOUTRECALL operation (see Section 12.5.3). 26532 The loga_maxcount field specifies the maximum layout size (in bytes) 26533 that the client can handle. If the size of the layout structure 26534 exceeds the size specified by maxcount, the metadata server will 26535 return the NFS4ERR_TOOSMALL error. 26537 The returned layout is expressed as an array, logr_layout, with each 26538 element of type layout4. If a file has a single striping pattern, 26539 then logr_layout SHOULD contain just one entry. Otherwise, if the 26540 requested range overlaps more than one striping pattern, logr_layout 26541 will contain the required number of entries. The elements of 26542 logr_layout MUST be sorted in ascending order of the value of the 26543 lo_offset field of each element. There MUST be no gaps or overlaps 26544 in the range between two successive elements of logr_layout. The 26545 lo_iomode field in each element of logr_layout MUST be the same. 26547 Table 13 and Table 14 both refer to a returned layout iomode, offset, 26548 and length. Because the returned layout is encoded in the 26549 logr_layout array, more description is required. 26551 iomode 26553 The value of the returned layout iomode listed in Table 13 and 26554 Table 14 is equal to the value of the lo_iomode field in each 26555 element of logr_layout. As shown in Table 13 and Table 14, the 26556 metadata server MAY return a layout with an lo_iomode different 26557 from the requested iomode (field loga_iomode of the request). If 26558 it does so, it MUST ensure that the lo_iomode is more permissive 26559 than the loga_iomode requested. For example, this behavior allows 26560 an implementation to upgrade LAYOUTIOMODE4_READ requests to 26561 LAYOUTIOMODE4_RW requests at its discretion, within the limits of 26562 the layout type specific protocol. A lo_iomode of either 26563 LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. 26565 offset 26567 The value of the returned layout offset listed in Table 13 and 26568 Table 14 is always equal to the lo_offset field of the first 26569 element logr_layout. 26571 length 26573 When setting the value of the returned layout length, the 26574 situation is complicated by the possibility that the special 26575 layout length value NFS4_UINT64_MAX is involved. For a 26576 logr_layout array of N elements, the lo_length field in the first 26577 N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of 26578 the last element of logr_layout can be NFS4_UINT64_MAX under some 26579 conditions as described in the following list. 26581 * If an applicable rule of Table 13 states that the metadata 26582 server MUST return a layout of length NFS4_UINT64_MAX, then the 26583 lo_length field of the last element of logr_layout MUST be 26584 NFS4_UINT64_MAX. 26586 * If an applicable rule of Table 13 states that the metadata 26587 server MUST NOT return a layout of length NFS4_UINT64_MAX, then 26588 the lo_length field of the last element of logr_layout MUST NOT 26589 be NFS4_UINT64_MAX. 26591 * If an applicable rule of Table 14 states that the metadata 26592 server SHOULD return a layout of length NFS4_UINT64_MAX, then 26593 the lo_length field of the last element of logr_layout SHOULD 26594 be NFS4_UINT64_MAX. 26596 * When the value of the returned layout length of Table 13 and 26597 Table 14 is not NFS4_UINT64_MAX, then the returned layout 26598 length is equal to the sum of the lo_length fields of each 26599 element of logr_layout. 26601 The logr_return_on_close result field is a directive to return the 26602 layout before closing the file. When the metadata server sets this 26603 return value to TRUE, it MUST be prepared to recall the layout in the 26604 case in which the client fails to return the layout before close. 26605 For the metadata server that knows a layout must be returned before a 26606 close of the file, this return value can be used to communicate the 26607 desired behavior to the client and thus remove one extra step from 26608 the client's and metadata server's interaction. 26610 The logr_stateid stateid is returned to the client for use in 26611 subsequent layout related operations. See Sections 8.2, 12.5.3, and 26612 12.5.5.2 for a further discussion and requirements. 26614 The format of the returned layout (lo_content) is specific to the 26615 layout type. The value of the layout type (lo_content.loc_type) for 26616 each of the elements of the array of layouts returned by the metadata 26617 server (logr_layout) MUST be equal to the loga_layout_type specified 26618 by the client. If it is not equal, the client SHOULD ignore the 26619 response as invalid and behave as if the metadata server returned an 26620 error, even if the client does have support for the layout type 26621 returned. 26623 If neither the requested file nor its containing file system support 26624 layouts, the metadata server MUST return NFS4ERR_LAYOUTUNAVAILABLE. 26625 If the layout type is not supported, the metadata server MUST return 26626 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 26627 matches the client provided layout identification, the metadata 26628 server MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is 26629 specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the 26630 metadata server MUST return NFS4ERR_BADIOMODE. 26632 If the layout for the file is unavailable due to transient 26633 conditions, e.g., file sharing prohibits layouts, the metadata server 26634 MUST return NFS4ERR_LAYOUTTRYLATER. 26636 If the layout request is rejected due to an overlapping layout 26637 recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See 26638 Section 12.5.5.2 for details. 26640 If the layout conflicts with a mandatory byte-range lock held on the 26641 file, and if the storage devices have no method of enforcing 26642 mandatory locks, other than through the restriction of layouts, the 26643 metadata server SHOULD return NFS4ERR_LOCKED. 26645 If client sets loga_signal_layout_avail to TRUE, then it is 26646 registering with the client a "want" for a layout in the event the 26647 layout cannot be obtained due to resource exhaustion. If the 26648 metadata server supports and will honor the "want", the results will 26649 have logr_will_signal_layout_avail set to TRUE. If so, the client 26650 should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a 26651 layout is available. 26653 On success, the current filehandle retains its value and the current 26654 stateid is updated to match the value as returned in the results. 26656 18.43.4. IMPLEMENTATION 26658 Typically, LAYOUTGET will be called as part of a COMPOUND request 26659 after an OPEN operation and results in the client having location 26660 information for the file. This requires that loga_stateid be set to 26661 the special stateid that tells the metadata server to use the current 26662 stateid, which is set by OPEN (see Section 16.2.3.1.2). A client may 26663 also hold a layout across multiple OPENs. The client specifies a 26664 layout type that limits what kind of layout the metadata server will 26665 return. This prevents metadata servers from granting layouts that 26666 are unusable by the client. 26668 As indicated by Table 13 and Table 14, the specification of LAYOUTGET 26669 allows a pNFS client and server considerable flexibility. A pNFS 26670 client can take several strategies for sending LAYOUTGET. Some 26671 examples are as follows. 26673 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request and 26674 the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might 26675 opt to request a _READ layout with loga_offset set to zero, 26676 loga_minlength set to zero, and loga_length set to 26677 NFS4_UINT64_MAX. If the file has space allocated to it, that 26678 space is striped over one or more storage devices, and there is 26679 either no conflicting layout or the concept of a conflicting 26680 layout does not apply to the pNFS server's layout type or 26681 implementation, then the metadata server might return a layout 26682 with a starting offset of zero, and a length equal to the length 26683 of the file, if not NFS4_UINT64_MAX. If the length of the file is 26684 not a multiple of the pNFS server's stripe width (see Section 13.2 26685 for a formal definition), the metadata server might round up the 26686 returned layout's length. 26688 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and 26689 the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not 26690 truncate the file, the client might opt to request a _RW layout 26691 with loga_offset set to zero, loga_minlength set to zero, and 26692 loga_length set to the file's current length (if known), or 26693 NFS4_UINT64_MAX. As with the previous case, under some conditions 26694 the metadata server might return a layout that covers the entire 26695 length of the file or beyond. 26697 o This strategy is as above, but the OPEN truncates the file. In 26698 this case, the client might anticipate it will be writing to the 26699 file from offset zero, and so loga_offset and loga_minlength are 26700 set to zero, and loga_length is set to the value of 26701 threshold4_write_iosize. The metadata server might return a 26702 layout from offset zero with a length at least as long as 26703 threshold4_write_iosize. 26705 o A process on the client invokes a request to read from offset 26706 10000 for length 50000. The client is using buffered I/O, and has 26707 buffer sizes of 4096 bytes. The client intends to map the request 26708 of the process into a series of READ requests starting at offset 26709 8192. The end offset needs to be higher than 10000 + 50000 = 26710 60000, and the next offset that is a multiple of 4096 is 61440. 26711 The difference between 61440 and that starting offset of the 26712 layout is 53248 (which is the product of 4096 and 15). The value 26713 of threshold4_read_iosize is less than 53248, so the client sends 26714 a LAYOUTGET request with loga_offset set to 8192, loga_minlength 26715 set to 53248, and loga_length set to the file's length (if known) 26716 minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). 26717 Since this LAYOUTGET request exceeds the metadata server's 26718 threshold, it grants the layout, possibly with an initial offset 26719 of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, 26720 but preferably a layout with an offset aligned on the stripe width 26721 and a length that is a multiple of the stripe width. 26723 o This strategy is as above, but the client is not using buffered I/ 26724 O, and instead all internal I/O requests are sent directly to the 26725 server. The LAYOUTGET request has loga_offset equal to 10000 and 26726 loga_minlength set to 50000. The value of loga_length is set to 26727 the length of the file. The metadata server is free to return a 26728 layout that fully overlaps the requested range, with a starting 26729 offset and length aligned on the stripe width. 26731 o Again, a process on the client invokes a request to read from 26732 offset 10000 for length 50000 (i.e. a range with a starting offset 26733 of 10000 and an ending offset of 69999), and buffered I/O is in 26734 use. The client is expecting that the server might not be able to 26735 return the layout for the full I/O range. The client intends to 26736 map the request of the process into a series of thirteen READ 26737 requests starting at offset 8192, each with length 4096, with a 26738 total length of 53248 (which equals 13 * 4096), which fully 26739 contains the range that client's process wants to read. Because 26740 the value of threshold4_read_iosize is equal to 4096, it is 26741 practical and reasonable for the client to use several LAYOUTGET 26742 operations to complete the series of READs. The client sends a 26743 LAYOUTGET request with loga_offset set to 8192, loga_minlength set 26744 to 4096, and loga_length set to 53248 or higher. The server will 26745 grant a layout possibly with an initial offset of zero, with an 26746 end offset of at least 8192 + 4096 - 1 = 12287, but preferably a 26747 layout with an offset aligned on the stripe width and a length 26748 that is a multiple of the stripe width. This will allow the 26749 client to make forward progress, possibly sending more LAYOUTGET 26750 operations for the remainder of the range. 26752 o An NFS client detects a sequential read pattern, and so sends a 26753 LAYOUTGET operation that goes well beyond any current or pending 26754 read requests to the server. The server might likewise detect 26755 this pattern, and grant the LAYOUTGET request. Once the client 26756 reads from an offset of the file that represents 50% of the way 26757 through the range of the last layout it received, in order to 26758 avoid stalling I/O that would wait for a layout, the client sends 26759 more operations from an offset of the file that represents 50% of 26760 the way through the last layout it received. The client continues 26761 to request layouts with byte-ranges that are well in advance of 26762 the byte-ranges of recent and/or read requests of processes 26763 running on the client. 26765 o This strategy is as above, but the client fails to detect the 26766 pattern, but the server does. The next time the metadata server 26767 gets a LAYOUTGET, it returns a layout with a length that is well 26768 beyond loga_minlength. 26770 o A client is using buffered I/O, and has a long queue of write- 26771 behinds to process and also detects a sequential write pattern. 26772 It sends a LAYOUTGET for a layout that spans the range of the 26773 queued write-behinds and well beyond, including ranges beyond the 26774 filer's current length. The client continues to send LAYOUTGET 26775 operations once the write-behind queue reaches 50% of the maximum 26776 queue length. 26778 Once the client has obtained a layout referring to a particular 26779 device ID, the metadata server MUST NOT delete the device ID until 26780 the layout is returned or revoked. 26782 CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is 26783 that LAYOUTGET returns a device ID for which the client does not have 26784 device address mappings, and the metadata server sends a 26785 CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and 26786 meanwhile the client sends GETDEVICEINFO on the device ID. This 26787 scenario is discussed in Section 18.40.4. Another scenario is that 26788 the CB_NOTIFY_DEVICEID is processed by the client before it processes 26789 the results from LAYOUTGET. The client will send a GETDEVICEINFO on 26790 the device ID. If the results from GETDEVICEINFO are received before 26791 the client gets results from LAYOUTGET, then there is no longer a 26792 race. If the results from LAYOUTGET are received before the results 26793 from GETDEVICEINFO, the client can either wait for results of 26794 GETDEVICEINFO or send another one to get possibly more up-to-date 26795 device address mappings for the device ID. 26797 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 26799 18.44.1. ARGUMENT 26800 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 26801 const LAYOUT4_RET_REC_FILE = 1; 26802 const LAYOUT4_RET_REC_FSID = 2; 26803 const LAYOUT4_RET_REC_ALL = 3; 26805 enum layoutreturn_type4 { 26806 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 26807 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 26808 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 26809 }; 26811 struct layoutreturn_file4 { 26812 offset4 lrf_offset; 26813 length4 lrf_length; 26814 stateid4 lrf_stateid; 26815 /* layouttype4 specific data */ 26816 opaque lrf_body<>; 26817 }; 26819 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 26820 case LAYOUTRETURN4_FILE: 26821 layoutreturn_file4 lr_layout; 26822 default: 26823 void; 26824 }; 26826 struct LAYOUTRETURN4args { 26827 /* CURRENT_FH: file */ 26828 bool lora_reclaim; 26829 layouttype4 lora_layout_type; 26830 layoutiomode4 lora_iomode; 26831 layoutreturn4 lora_layoutreturn; 26832 }; 26834 18.44.2. RESULT 26835 union layoutreturn_stateid switch (bool lrs_present) { 26836 case TRUE: 26837 stateid4 lrs_stateid; 26838 case FALSE: 26839 void; 26840 }; 26842 union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { 26843 case NFS4_OK: 26844 layoutreturn_stateid lorr_stateid; 26845 default: 26846 void; 26847 }; 26849 18.44.3. DESCRIPTION 26851 This operation returns from the client to the server one or more 26852 layouts represented by the client ID (derived from the session ID in 26853 the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. 26854 When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is 26855 further identified by the current filehandle, lrf_offset, lrf_length, 26856 and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all 26857 bytes of the layout, starting at lrf_offset, are returned. When 26858 lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used 26859 to identify the file system and all layouts matching the client ID, 26860 the fsid of the file system, lora_layout_type, and lora_iomode are 26861 returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts 26862 matching the client ID, lora_layout_type, and lora_iomode are 26863 returned and the current filehandle is not used. After this call, 26864 the client MUST NOT use the returned layout(s) and the associated 26865 storage protocol to access the file data. 26867 If the set of layouts designated in the case of LAYOUTRETURN4_FSID or 26868 LAYOUTRETURN4_ALL is empty, then no error results. In the case of 26869 LAYOUTRETURN4_FILE, the byte-range specified is returned even if it 26870 is a subdivision of a layout previously obtained with LAYOUTGET, a 26871 combination of multiple layouts previously obtained with LAYOUTGET, 26872 or a combination including some layouts previously obtained with 26873 LAYOUTGET, and one or more subdivisions of such layouts. When the 26874 byte-range does not designate any bytes for which a layout is held 26875 for the specified file, client ID, layout type and mode, no error 26876 results. See Section 12.5.5.2.1.5 for considerations with "bulk" 26877 return of layouts. 26879 The layout being returned may be a subset or superset of a layout 26880 specified by CB_LAYOUTRECALL. However, if it is a subset, the recall 26881 is not complete until the full recalled scope has been returned. 26883 Recalled scope refers to the byte-range in the case of 26884 LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of 26885 LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching 26886 scope to complete the return even if all current layout ranges have 26887 been previously individually returned. 26889 For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY 26890 specifies that all layouts that match the other arguments to 26891 LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current 26892 filehandle and range; fsid derived from current filehandle; or 26893 LAYOUTRETURN4_ALL) are being returned. 26895 In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid 26896 provided by the client is a layout stateid as returned from previous 26897 layout operations. Note that the "seqid" field of lrf_stateid MUST 26898 NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further 26899 discussion and requirements. 26901 Return of a layout or all layouts does not invalidate the mapping of 26902 storage device ID to a storage device address. The mapping remains 26903 in effect until specifically changed or deleted via device ID 26904 notification callbacks. Of course if there are no remaining layouts 26905 that refer to a previously used device ID, the server is free to 26906 delete a device ID without a notification callback, which will be the 26907 case when notifications are not in effect. 26909 If the lora_reclaim field is set to TRUE, the client is attempting to 26910 return a layout that was acquired before the restart of the metadata 26911 server during the metadata server's grace period. When returning 26912 layouts that were acquired during the metadata server's grace period, 26913 the client MUST set the lora_reclaim field to FALSE. The 26914 lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is 26915 LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT 26916 (Section 18.42) for more details. 26918 Layouts may be returned when recalled or voluntarily (i.e., before 26919 the server has recalled them). In either case, the client must 26920 properly propagate state changed under the context of the layout to 26921 the storage device(s) or to the metadata server before returning the 26922 layout. 26924 If the client returns the layout in response to a CB_LAYOUTRECALL 26925 where the lor_recalltype field of the clora_recall field was 26926 LAYOUTRECALL4_FILE, the client should use the lor_stateid value from 26927 CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should 26928 use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid 26929 (from a previous LAYRETURN result). This is done to indicate the 26930 point in time (in terms of layout stateid transitions) when the 26931 recall was sent. The client uses the precise lora_recallstateid 26932 value and MUST NOT set the stateid's seqid to zero; otherwise, 26933 NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be 26934 returned if the client is using an old seqid, and the server knows 26935 the client should not be using the old seqid. For example, the 26936 client uses the seqid on slot 1 of the session, receives the response 26937 with the new seqid, and uses the slot to send another request with 26938 the old seqid. 26940 If a client fails to return a layout in a timely manner, then the 26941 metadata server SHOULD use its control protocol with the storage 26942 devices to fence the client from accessing the data referenced by the 26943 layout. See Section 12.5.5 for more details. 26945 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after 26946 the metadata server's grace period, NFS4ERR_NO_GRACE is returned. 26948 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and 26949 lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, 26950 NFS4ERR_INVAL is returned. 26952 If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, 26953 then the lrs_stateid field will represent the layout stateid as 26954 updated for this operation's processing; the current stateid will 26955 also be updated to match the returned value. If the last byte of any 26956 layout for the current file, client ID, and layout type is being 26957 returned and there are no remaining pending CB_LAYOUTRECALL 26958 operations for which a LAYOUTRETURN operation must be done, 26959 lrs_present MUST be FALSE, and no stateid will be returned. In 26960 addition, the COMPOUND request's current stateid will be set to the 26961 all-zeroes special stateid (see Section 16.2.3.1.2). The server MUST 26962 reject with NFS4ERR_BAD_STATEID any further use of the current 26963 stateid in that COMPOUND until the current stateid is re-established 26964 by a later stateid-returning operation. 26966 On success, the current filehandle retains its value. 26968 If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the 26969 client ID (see Section 18.35), the server will require that the 26970 principal, security flavor, and if applicable, the GSS mechanism, 26971 combination that acquired the layout also be the one to send 26972 LAYOUTRETURN. This might not be possible if credentials for the 26973 principal are no longer available. The server will allow the machine 26974 credential or SSV credential (see Section 18.35) to send LAYOUTRETURN 26975 if LAYOUTRETURN's operation code was set in the spo_must_allow result 26976 of EXCHANGE_ID. 26978 18.44.4. IMPLEMENTATION 26980 The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL 26981 callback MUST be serialized with any outstanding, intersecting 26982 LAYOUTRETURN operations. Note that it is possible that while a 26983 client is returning the layout for some recalled range, the server 26984 may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the 26985 final return operation for the latter must block until the former 26986 layout recall is done. 26988 Returning all layouts in a file system using LAYOUTRETURN4_FSID is 26989 typically done in response to a CB_LAYOUTRECALL for that file system 26990 as the final return operation. Similarly, LAYOUTRETURN4_ALL is used 26991 in response to a recall callback for all layouts. It is possible 26992 that the client already returned some outstanding layouts via 26993 individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or 26994 LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See 26995 Section 12.5.5.1 for more details. 26997 Once the client has returned all layouts referring to a particular 26998 device ID, the server MAY delete the device ID. 27000 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object 27002 18.45.1. ARGUMENT 27004 enum secinfo_style4 { 27005 SECINFO_STYLE4_CURRENT_FH = 0, 27006 SECINFO_STYLE4_PARENT = 1 27007 }; 27009 /* CURRENT_FH: object or child directory */ 27010 typedef secinfo_style4 SECINFO_NO_NAME4args; 27012 18.45.2. RESULT 27014 /* CURRENTFH: consumed if status is NFS4_OK */ 27015 typedef SECINFO4res SECINFO_NO_NAME4res; 27017 18.45.3. DESCRIPTION 27019 Like the SECINFO operation, SECINFO_NO_NAME is used by the client to 27020 obtain a list of valid RPC authentication flavors for a specific file 27021 object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that 27022 are accessed by filehandle. 27024 There are two styles of SECINFO_NO_NAME, as determined by the value 27025 of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is 27026 passed, then SECINFO_NO_NAME is querying for the required security 27027 for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then 27028 SECINFO_NO_NAME is querying for the required security of the current 27029 filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, 27030 then SECINFO should apply the same access methodology used for 27031 LOOKUPP when evaluating the traversal to the parent directory. 27032 Therefore, if the requester does not have the appropriate access to 27033 LOOKUPP the parent, then SECINFO_NO_NAME must behave the same way and 27034 return NFS4ERR_ACCESS. 27036 If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH returns NFS4ERR_WRONGSEC, 27037 then the client resolves the situation by sending a COMPOUND request 27038 that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed 27039 by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See Section 2.6 27040 for instructions on dealing with NFS4ERR_WRONGSEC error returns from 27041 PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. 27043 If SECINFO_STYLE4_PARENT is specified and there is no parent 27044 directory, SECINFO_NO_NAME MUST return NFS4ERR_NOENT. 27046 On success, the current filehandle is consumed (see 27047 Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME 27048 tries to use the current filehandle, that operation will fail with 27049 the status NFS4ERR_NOFILEHANDLE. 27051 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 27052 the discussion on SECINFO (Section 18.29.3). 27054 18.45.4. IMPLEMENTATION 27056 See the discussion on SECINFO (Section 18.29.4). 27058 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing and 27059 Control 27061 18.46.1. ARGUMENT 27063 struct SEQUENCE4args { 27064 sessionid4 sa_sessionid; 27065 sequenceid4 sa_sequenceid; 27066 slotid4 sa_slotid; 27067 slotid4 sa_highest_slotid; 27068 bool sa_cachethis; 27069 }; 27071 18.46.2. RESULT 27073 const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; 27074 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; 27075 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; 27076 const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; 27077 const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; 27078 const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; 27079 const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; 27080 const SEQ4_STATUS_LEASE_MOVED = 0x00000080; 27081 const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; 27082 const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200; 27083 const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400; 27084 const SEQ4_STATUS_DEVID_CHANGED = 0x00000800; 27085 const SEQ4_STATUS_DEVID_DELETED = 0x00001000; 27087 struct SEQUENCE4resok { 27088 sessionid4 sr_sessionid; 27089 sequenceid4 sr_sequenceid; 27090 slotid4 sr_slotid; 27091 slotid4 sr_highest_slotid; 27092 slotid4 sr_target_highest_slotid; 27093 uint32_t sr_status_flags; 27094 }; 27096 union SEQUENCE4res switch (nfsstat4 sr_status) { 27097 case NFS4_OK: 27098 SEQUENCE4resok sr_resok4; 27099 default: 27100 void; 27101 }; 27103 18.46.3. DESCRIPTION 27105 The SEQUENCE operation is used by the server to implement session 27106 request control and the reply cache semantics. 27108 SEQUENCE MUST appear as the first operation of any COMPOUND in which 27109 it appears. The error NFS4ERR_SEQUENCE_POS will be returned when it 27110 is found in any position in a COMPOUND beyond the first. Operations 27111 other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, 27112 CREATE_SESSION, and DESTROY_SESSION, MUST NOT appear as the first 27113 operation in a COMPOUND. Such operations MUST yield the error 27114 NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a 27115 COMPOUND. 27117 If SEQUENCE is received on a connection not associated with the 27118 session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection 27119 association enforcement is enabled (see Section 18.35), then the 27120 server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION. 27122 The sa_sessionid argument identifies the session to which this 27123 request applies. The sr_sessionid result MUST equal sa_sessionid. 27125 The sa_slotid argument is the index in the reply cache for the 27126 request. The sa_sequenceid field is the sequence number of the 27127 request for the reply cache entry (slot). The sr_slotid result MUST 27128 equal sa_slotid. The sr_sequenceid result MUST equal sa_sequenceid. 27130 The sa_highest_slotid argument is the highest slot ID for which the 27131 client has a request outstanding; it could be equal to sa_slotid. 27132 The server returns two "highest_slotid" values: sr_highest_slotid and 27133 sr_target_highest_slotid. The former is the highest slot ID the 27134 server will accept in future SEQUENCE operation, and SHOULD NOT be 27135 less than the value of sa_highest_slotid (but see Section 2.10.6.1 27136 for an exception). The latter is the highest slot ID the server 27137 would prefer the client use on a future SEQUENCE operation. 27139 If sa_cachethis is TRUE, then the client is requesting that the 27140 server cache the entire reply in the server's reply cache; therefore, 27141 the server MUST cache the reply (see Section 2.10.6.1.3). The server 27142 MAY cache the reply if sa_cachethis is FALSE. If the server does not 27143 cache the entire reply, it MUST still record that it executed the 27144 request at the specified slot and sequence ID. 27146 The response to the SEQUENCE operation contains a word of status 27147 flags (sr_status_flags) that can provide to the client information 27148 related to the status of the client's lock state and communications 27149 paths. Note that any status bits relating to lock state MAY be reset 27150 when lock state is lost due to a server restart (even if the session 27151 is persistent across restarts; session persistence does not imply 27152 lock state persistence) or the establishment of a new client 27153 instance. 27155 SEQ4_STATUS_CB_PATH_DOWN 27156 When set, indicates that the client has no operational backchannel 27157 path for any session associated with the client ID, making it 27158 necessary for the client to re-establish one. This bit remains 27159 set on all SEQUENCE responses on all sessions associated with the 27160 client ID until at least one backchannel is available on any 27161 session associated with the client ID. If the client fails to re- 27162 establish a backchannel for the client ID, it is subject to having 27163 recallable state revoked. 27165 SEQ4_STATUS_CB_PATH_DOWN_SESSION 27166 When set, indicates that the session has no operational 27167 backchannel. There are two reasons why 27168 SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not 27169 SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that 27170 applies specifically to the session (e.g., CB_RECALL_SLOT, see 27171 Section 20.8) needs to be sent. Second is that the server did 27172 send a callback operation, but the connection was lost before the 27173 reply. The server cannot be sure whether or not the client 27174 received the callback operation, and so, per rules on request 27175 retry, the server MUST retry the callback operation over the same 27176 session. The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the 27177 indication to the client that it needs to associate a connection 27178 to the session's backchannel. This bit remains set on all 27179 SEQUENCE responses of the session until a connection is associated 27180 with the session's a backchannel. If the client fails to re- 27181 establish a backchannel for the session, it is subject to having 27182 recallable state revoked. 27184 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING 27185 When set, indicates that all GSS contexts or RPCSEC_GSS handles 27186 assigned to the session's backchannel will expire within a period 27187 equal to the lease time. This bit remains set on all SEQUENCE 27188 replies until at least one of the following are true: 27190 * All SSV RPCSEC_GSS handles on the session's backchannel have 27191 been destroyed and all non-SSV GSS contexts have expired. 27193 * At least one more SSV RPCSEC_GSS handle has been added to the 27194 backchannel. 27196 * The expiration time of at least one non-SSV GSS context of an 27197 RPCSEC_GSS handle is beyond the lease period from the current 27198 time (relative to the time of when a SEQUENCE response was 27199 sent) 27201 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED 27202 When set, indicates all non-SSV GSS contexts and all SSV 27203 RPCSEC_GSS handles assigned to the session's backchannel have 27204 expired or have been destroyed. This bit remains set on all 27205 SEQUENCE replies until at least one non-expired non-SSV GSS 27206 context for the session's backchannel has been established or at 27207 least one SSV RPCSEC_GSS handle has been assigned to the 27208 backchannel. 27210 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED 27211 When set, indicates that the lease has expired and as a result the 27212 server released all of the client's locking state. This status 27213 bit remains set on all SEQUENCE replies until the loss of all such 27214 locks has been acknowledged by use of FREE_STATEID (see 27215 Section 18.38), or by establishing a new client instance by 27216 destroying all sessions (via DESTROY_SESSION), the client ID (via 27217 DESTROY_CLIENTID), and then invoking EXCHANGE_ID and 27218 CREATE_SESSION to establish a new client ID. 27220 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 27221 When set, indicates that some subset of the client's locks have 27222 been revoked due to expiration of the lease period followed by 27223 another client's conflicting LOCK operation. This status bit 27224 remains set on all SEQUENCE replies until the loss of all such 27225 locks has been acknowledged by use of FREE_STATEID. 27227 SEQ4_STATUS_ADMIN_STATE_REVOKED 27228 When set, indicates that one or more locks have been revoked 27229 without expiration of the lease period, due to administrative 27230 action. This status bit remains set on all SEQUENCE replies until 27231 the loss of all such locks has been acknowledged by use of 27232 FREE_STATEID. 27234 SEQ4_STATUS_RECALLABLE_STATE_REVOKED 27235 When set, indicates that one or more recallable objects have been 27236 revoked without expiration of the lease period, due to the 27237 client's failure to return them when recalled, which may be a 27238 consequence of there being no working backchannel and the client 27239 failing to re-establish a backchannel per the 27240 SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or 27241 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit 27242 remains set on all SEQUENCE replies until the loss of all such 27243 locks has been acknowledged by use of FREE_STATEID. 27245 SEQ4_STATUS_LEASE_MOVED 27246 When set, indicates that responsibility for lease renewal has been 27247 transferred to one or more new servers. This condition will 27248 continue until the client receives an NFS4ERR_MOVED error and the 27249 server receives the subsequent GETATTR for the fs_locations or 27250 fs_locations_info attribute for an access to each file system for 27251 which a lease has been moved to a new server. See 27252 Section 11.10.9.1. 27254 SEQ4_STATUS_RESTART_RECLAIM_NEEDED 27255 When set, indicates that due to server restart, the client must 27256 reclaim locking state. Until the client sends a global 27257 RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will 27258 return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. 27260 SEQ4_STATUS_BACKCHANNEL_FAULT 27261 The server has encountered an unrecoverable fault with the 27262 backchannel (e.g., it has lost track of the sequence ID for a slot 27263 in the backchannel). The client MUST stop sending more requests 27264 on the session's fore channel, wait for all outstanding requests 27265 to complete on the fore and back channel, and then destroy the 27266 session. 27268 SEQ4_STATUS_DEVID_CHANGED 27269 The client is using device ID notifications and the server has 27270 changed a device ID mapping held by the client. This flag will 27271 stay present until the client has obtained the new mapping with 27272 GETDEVICEINFO. 27274 SEQ4_STATUS_DEVID_DELETED 27275 The client is using device ID notifications and the server has 27276 deleted a device ID mapping held by the client. This flag will 27277 stay in effect until the client sends a GETDEVICEINFO on the 27278 device ID with a null value in the argument gdia_notify_types. 27280 The value of the sa_sequenceid argument relative to the cached 27281 sequence ID on the slot falls into one of three cases. 27283 o If the difference between sa_sequenceid and the server's cached 27284 sequence ID at the slot ID is two (2) or more, or if sa_sequenceid 27285 is less than the cached sequence ID (accounting for wraparound of 27286 the unsigned sequence ID value), then the server MUST return 27287 NFS4ERR_SEQ_MISORDERED. 27289 o If sa_sequenceid and the cached sequence ID are the same, this is 27290 a retry, and the server replies with what is recorded in the reply 27291 cache. The lease is possibly renewed as described below. 27293 o If sa_sequenceid is one greater (accounting for wraparound) than 27294 the cached sequence ID, then this is a new request, and the slot's 27295 sequence ID is incremented. The operations subsequent to 27296 SEQUENCE, if any, are processed. If there are no other 27297 operations, the only other effects are to cache the SEQUENCE reply 27298 in the slot, maintain the session's activity, and possibly renew 27299 the lease. 27301 If the client reuses a slot ID and sequence ID for a completely 27302 different request, the server MAY treat the request as if it is a 27303 retry of what it has already executed. The server MAY however detect 27304 the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 27306 If SEQUENCE returns an error, then the state of the slot (sequence 27307 ID, cached reply) MUST NOT change, and the associated lease MUST NOT 27308 be renewed. 27310 If SEQUENCE returns NFS4_OK, then the associated lease MUST be 27311 renewed (see Section 8.3), except if 27312 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. 27314 18.46.4. IMPLEMENTATION 27316 The server MUST maintain a mapping of session ID to client ID in 27317 order to validate any operations that follow SEQUENCE that take a 27318 stateid as an argument and/or result. 27320 If the client establishes a persistent session, then a SEQUENCE 27321 received after a server restart might encounter requests performed 27322 and recorded in a persistent reply cache before the server restart. 27323 In this case, SEQUENCE will be processed successfully, while requests 27324 that were not previously performed and recorded are rejected with 27325 NFS4ERR_DEADSESSION. 27327 Depending on which of the operations within the COMPOUND were 27328 successfully performed before the server restart, these operations 27329 will also have replies sent from the server reply cache. Note that 27330 when these operations establish locking state, it is locking state 27331 that applies to the previous server instance and to the previous 27332 client ID, even though the server restart, which logically happened 27333 after these operations, eliminated that state. In the case of a 27334 partially executed COMPOUND, processing may reach an operation not 27335 processed during the earlier server instance, making this operation a 27336 new one and not performable on the existing session. In this case, 27337 NFS4ERR_DEADSESSION will be returned from that operation. 27339 18.47. Operation 54: SET_SSV - Update SSV for a Client ID 27341 18.47.1. ARGUMENT 27343 struct ssa_digest_input4 { 27344 SEQUENCE4args sdi_seqargs; 27345 }; 27347 struct SET_SSV4args { 27348 opaque ssa_ssv<>; 27349 opaque ssa_digest<>; 27350 }; 27352 18.47.2. RESULT 27353 struct ssr_digest_input4 { 27354 SEQUENCE4res sdi_seqres; 27355 }; 27357 struct SET_SSV4resok { 27358 opaque ssr_digest<>; 27359 }; 27361 union SET_SSV4res switch (nfsstat4 ssr_status) { 27362 case NFS4_OK: 27363 SET_SSV4resok ssr_resok4; 27364 default: 27365 void; 27366 }; 27368 18.47.3. DESCRIPTION 27370 This operation is used to update the SSV for a client ID. Before 27371 SET_SSV is called the first time on a client ID, the SSV is zero. 27372 The SSV is the key used for the SSV GSS mechanism (Section 2.10.9) 27374 SET_SSV MUST be preceded by a SEQUENCE operation in the same 27375 COMPOUND. It MUST NOT be used if the client did not opt for SP4_SSV 27376 state protection when the client ID was created (see Section 18.35); 27377 the server returns NFS4ERR_INVAL in that case. 27379 The field ssa_digest is computed as the output of the HMAC (RFC 2104 27380 [59]) using the subkey derived from the SSV4_SUBKEY_MIC_I2T and 27381 current SSV as the key (see Section 2.10.9 for a description of 27382 subkeys), and an XDR encoded value of data type ssa_digest_input4. 27383 The field sdi_seqargs is equal to the arguments of the SEQUENCE 27384 operation for the COMPOUND procedure that SET_SSV is within. 27386 The argument ssa_ssv is XORed with the current SSV to produce the new 27387 SSV. The argument ssa_ssv SHOULD be generated randomly. 27389 In the response, ssr_digest is the output of the HMAC using the 27390 subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and 27391 an XDR encoded value of data type ssr_digest_input4. The field 27392 sdi_seqres is equal to the results of the SEQUENCE operation for the 27393 COMPOUND procedure that SET_SSV is within. 27395 As noted in Section 18.35, the client and server can maintain 27396 multiple concurrent versions of the SSV. The client and server each 27397 MUST maintain an internal SSV version number, which is set to one the 27398 first time SET_SSV executes on the server and the client receives the 27399 first SET_SSV reply. Each subsequent SET_SSV increases the internal 27400 SSV version number by one. The value of this version number 27401 corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, and 27402 ssct_ssv_seq fields of the SSV GSS mechanism tokens (see 27403 Section 2.10.9). 27405 18.47.4. IMPLEMENTATION 27407 When the server receives ssa_digest, it MUST verify the digest by 27408 computing the digest the same way the client did and comparing it 27409 with ssa_digest. If the server gets a different result, this is an 27410 error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of 27411 another SET_SSV from the same client ID changing the SSV. If so, the 27412 client recovers by sending a SET_SSV operation again with a 27413 recomputed digest based on the subkey of the new SSV. If the 27414 transport connection is dropped after the SET_SSV request is sent, 27415 but before the SET_SSV reply is received, then there are special 27416 considerations for recovery if the client has no more connections 27417 associated with sessions associated with the client ID of the SSV. 27418 See Section 18.34.4. 27420 Clients SHOULD NOT send an ssa_ssv that is equal to a previous 27421 ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv 27422 equal to zero since the SSV is initialized to zero when the client ID 27423 is created). 27425 Clients SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers MUST 27426 support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, 27427 SET_SSV }. 27429 A client SHOULD NOT send SET_SSV with the SSV GSS mechanism's 27430 credential because the purpose of SET_SSV is to seed the SSV from 27431 non-SSV credentials. Instead, SET_SSV SHOULD be sent with the 27432 credential of a user that is accessing the client ID for the first 27433 time (Section 2.10.8.3). However, if the client does send SET_SSV 27434 with SSV credentials, the digest protecting the arguments uses the 27435 value of the SSV before ssa_ssv is XORed in, and the digest 27436 protecting the results uses the value of the SSV after the ssa_ssv is 27437 XORed in. 27439 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 27441 18.48.1. ARGUMENT 27443 struct TEST_STATEID4args { 27444 stateid4 ts_stateids<>; 27445 }; 27447 18.48.2. RESULT 27449 struct TEST_STATEID4resok { 27450 nfsstat4 tsr_status_codes<>; 27451 }; 27453 union TEST_STATEID4res switch (nfsstat4 tsr_status) { 27454 case NFS4_OK: 27455 TEST_STATEID4resok tsr_resok4; 27456 default: 27457 void; 27458 }; 27460 18.48.3. DESCRIPTION 27462 The TEST_STATEID operation is used to check the validity of a set of 27463 stateids. It can be used at any time, but the client should 27464 definitely use it when it receives an indication that one or more of 27465 its stateids have been invalidated due to lock revocation. This 27466 occurs when the SEQUENCE operation returns with one of the following 27467 sr_status_flags set: 27469 o SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 27471 o SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED 27473 o SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED 27475 The client can use TEST_STATEID one or more times to test the 27476 validity of its stateids. Each use of TEST_STATEID allows a large 27477 set of such stateids to be tested and avoids problems with earlier 27478 stateids in a COMPOUND request from interfering with the checking of 27479 subsequent stateids, as would happen if individual stateids were 27480 tested by a series of corresponding by operations in a COMPOUND 27481 request. 27483 For each stateid, the server returns the status code that would be 27484 returned if that stateid were to be used in normal operation. 27485 Returning such a status indication is not an error and does not cause 27486 COMPOUND processing to terminate. Checks for the validity of the 27487 stateid proceed as they would for normal operations with a number of 27488 exceptions: 27490 o There is no check for the type of stateid object, as would be the 27491 case for normal use of a stateid. 27493 o There is no reference to the current filehandle. 27495 o Special stateids are always considered invalid (they result in the 27496 error code NFS4ERR_BAD_STATEID). 27498 All stateids are interpreted as being associated with the client for 27499 the current session. Any possible association with a previous 27500 instance of the client (as stale stateids) is not considered. 27502 The valid status values in the returned status_code array are 27503 NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, 27504 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. 27506 18.48.4. IMPLEMENTATION 27508 See Sections 8.2.2 and 8.2.4 for a discussion of stateid structure, 27509 lifetime, and validation. 27511 18.49. Operation 56: WANT_DELEGATION - Request Delegation 27513 18.49.1. ARGUMENT 27514 union deleg_claim4 switch (open_claim_type4 dc_claim) { 27515 /* 27516 * No special rights to object. Ordinary delegation 27517 * request of the specified object. Object identified 27518 * by filehandle. 27519 */ 27520 case CLAIM_FH: /* new to v4.1 */ 27521 /* CURRENT_FH: object being delegated */ 27522 void; 27524 /* 27525 * Right to file based on a delegation granted 27526 * to a previous boot instance of the client. 27527 * File is specified by filehandle. 27528 */ 27529 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 27530 /* CURRENT_FH: object being delegated */ 27531 void; 27533 /* 27534 * Right to the file established by an open previous 27535 * to server reboot. File identified by filehandle. 27536 * Used during server reclaim grace period. 27537 */ 27538 case CLAIM_PREVIOUS: 27539 /* CURRENT_FH: object being reclaimed */ 27540 open_delegation_type4 dc_delegate_type; 27541 }; 27543 struct WANT_DELEGATION4args { 27544 uint32_t wda_want; 27545 deleg_claim4 wda_claim; 27546 }; 27548 18.49.2. RESULT 27550 union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { 27551 case NFS4_OK: 27552 open_delegation4 wdr_resok4; 27553 default: 27554 void; 27555 }; 27557 18.49.3. DESCRIPTION 27559 Where this description mandates the return of a specific error code 27560 for a specific condition, and where multiple conditions apply, the 27561 server MAY return any of the mandated error codes. 27563 This operation allows a client to: 27565 o Get a delegation on all types of files except directories. 27567 o Register a "want" for a delegation for the specified file object, 27568 and be notified via a callback when the delegation is available. 27569 The server MAY support notifications of availability via 27570 callbacks. If the server does not support registration of wants, 27571 it MUST NOT return an error to indicate that, and instead MUST 27572 return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and 27573 ond_server_will_push_deleg or ond_server_will_signal_avail set to 27574 FALSE. When the server indicates that it will notify the client 27575 by means of a callback, it will either provide the delegation 27576 using a CB_PUSH_DELEG operation or cancel its promise by sending a 27577 CB_WANTS_CANCELLED operation. 27579 o Cancel a want for a delegation. 27581 The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set 27582 OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST 27583 ignore them. 27585 The meanings of the following flags in wda_want are the same as they 27586 are in OPEN, except as noted below. 27588 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 27590 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 27592 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 27594 o OPEN4_SHARE_ACCESS_WANT_NO_DELEG. Unlike the OPEN operation, this 27595 flag SHOULD NOT be set by the client in the arguments to 27596 WANT_DELEGATION, and MUST be ignored by the server. 27598 o OPEN4_SHARE_ACCESS_WANT_CANCEL 27600 o OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 27602 o OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 27603 The handling of the above flags in WANT_DELEGATION is the same as in 27604 OPEN. Information about the delegation and/or the promises the 27605 server is making regarding future callbacks are the same as those 27606 described in the open_delegation4 structure. 27608 The successful results of WANT_DELEGATION are of data type 27609 open_delegation4, which is the same data type as the "delegation" 27610 field in the results of the OPEN operation (see Section 18.16.3). 27611 The server constructs wdr_resok4 the same way it constructs OPEN's 27612 "delegation" with one difference: WANT_DELEGATION MUST NOT return a 27613 delegation type of OPEN_DELEGATE_NONE. 27615 If ((wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) & 27616 ~OPEN4_SHARE_ACCESS_WANT_NO_DELEG) is zero, then the client is 27617 indicating no explicit desire or non-desire for a delegation and the 27618 server MUST return NFS4ERR_INVAL. 27620 The client uses the OPEN4_SHARE_ACCESS_WANT_CANCEL flag in the 27621 WANT_DELEGATION operation to cancel a previously requested want for a 27622 delegation. Note that if the server is in the process of sending the 27623 delegation (via CB_PUSH_DELEG) at the time the client sends a 27624 cancellation of the want, the delegation might still be pushed to the 27625 client. 27627 If WANT_DELEGATION fails to return a delegation, and the server 27628 returns NFS4_OK, the server MUST set the delegation type to 27629 OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in 27630 Section 18.16. Write delegations are not available for file types 27631 that are not writable. This includes file objects of types NF4BLK, 27632 NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests 27633 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without 27634 OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the 27635 aforementioned file types, the server must set 27636 wdr_resok4.od_whynone.ond_why to WND4_WRITE_DELEG_NOT_SUPP_FTYPE. 27638 18.49.4. IMPLEMENTATION 27640 A request for a conflicting delegation is not normally intended to 27641 trigger the recall of the existing delegation. Servers may choose to 27642 treat some clients as having higher priority such that their wants 27643 will trigger recall of an existing delegation, although that is 27644 expected to be an unusual situation. 27646 Servers will generally recall delegations assigned by WANT_DELEGATION 27647 on the same basis as those assigned by OPEN. CB_RECALL will 27648 generally be done only when other clients perform operations 27649 inconsistent with the delegation. The normal response to aging of 27650 delegations is to use CB_RECALL_ANY, in order to give the client the 27651 opportunity to keep the delegations most useful from its point of 27652 view. 27654 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID 27656 18.50.1. ARGUMENT 27658 struct DESTROY_CLIENTID4args { 27659 clientid4 dca_clientid; 27660 }; 27662 18.50.2. RESULT 27664 struct DESTROY_CLIENTID4res { 27665 nfsstat4 dcr_status; 27666 }; 27668 18.50.3. DESCRIPTION 27670 The DESTROY_CLIENTID operation destroys the client ID. If there are 27671 sessions (both idle and non-idle), opens, locks, delegations, 27672 layouts, and/or wants (Section 18.49) associated with the unexpired 27673 lease of the client ID, the server MUST return NFS4ERR_CLIENTID_BUSY. 27674 DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as 27675 the client ID derived from the session ID of SEQUENCE is not the same 27676 as the client ID to be destroyed. If the client IDs are the same, 27677 then the server MUST return NFS4ERR_CLIENTID_BUSY. 27679 If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only 27680 operation in the COMPOUND request (otherwise, the server MUST return 27681 NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE 27682 preceding it, a client that retransmits the request may receive an 27683 error in response, because the original request might have been 27684 successfully executed. 27686 18.50.4. IMPLEMENTATION 27688 DESTROY_CLIENTID allows a server to immediately reclaim the resources 27689 consumed by an unused client ID, and also to forget that it ever 27690 generated the client ID. By forgetting that it ever generated the 27691 client ID, the server can safely reuse the client ID on a future 27692 EXCHANGE_ID operation. 27694 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished 27696 18.51.1. ARGUMENT 27698 27700 struct RECLAIM_COMPLETE4args { 27701 /* 27702 * If rca_one_fs TRUE, 27703 * 27704 * CURRENT_FH: object in 27705 * file system reclaim is 27706 * complete for. 27707 */ 27708 bool rca_one_fs; 27709 }; 27711 27713 18.51.2. RESULTS 27715 27717 struct RECLAIM_COMPLETE4res { 27718 nfsstat4 rcr_status; 27719 }; 27721 27723 18.51.3. DESCRIPTION 27725 A RECLAIM_COMPLETE operation is used to indicate that the client has 27726 reclaimed all of the locking state that it will recover using 27727 reclaim, when it is recovering state due to either a server restart 27728 or the migration of a file system to another server. There are two 27729 types of RECLAIM_COMPLETE operations: 27731 o When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. 27732 This indicates that recovery of all locks that the client held on 27733 the previous server instance has been completed. The current 27734 filehandle need not be set in this case. 27736 o When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE 27737 is being done. This indicates that recovery of locks for a single 27738 fs (the one designated by the current filehandle) due to the 27739 migration of the file system has been completed. Presence of a 27740 current filehandle is required when rca_one_fs is set to TRUE. 27742 When the current filehandle designates a filehandle in a file 27743 system not in the process of migration, the operation returns 27744 NFS4_OK and is otherwise ignored. 27746 Once a RECLAIM_COMPLETE is done, there can be no further reclaim 27747 operations for locks whose scope is defined as having completed 27748 recovery. Once the client sends RECLAIM_COMPLETE, the server will 27749 not allow the client to do subsequent reclaims of locking state for 27750 that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. 27752 Whenever a client establishes a new client ID and before it does the 27753 first non-reclaim operation that obtains a lock, it MUST send a 27754 RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no 27755 locks to reclaim. If non-reclaim locking operations are done before 27756 the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 27758 Similarly, when the client accesses a migrated file system on a new 27759 server, before it sends the first non-reclaim operation that obtains 27760 a lock on this new server, it MUST send a RECLAIM_COMPLETE with 27761 rca_one_fs set to TRUE and current filehandle within that file 27762 system, even if there are no locks to reclaim. If non-reclaim 27763 locking operations are done on that file system before the 27764 RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 27766 It should be noted that there are situations in which a client needs 27767 to issue both forms of RECLAIM_COMPLETE. An example is an instance 27768 of file system migration in which the file system is migrated to a 27769 server for which the client has no clientid. As a result, the client 27770 needs to obtain a clientid from the server (incurring the 27771 responsibility to do RECLAIM_COMPLETE with rca_one_fs set to FALSE) 27772 as well as RECLAIM_COMPLETE with rca_one_fs set to TRUE to complete 27773 the per-fs grace period associated with the file system migration. 27774 These two may be done in any order as long as all necessary lock 27775 reclaims have been done before issuing either of them. 27777 Any locks not reclaimed at the point at which RECLAIM_COMPLETE is 27778 done become non-reclaimable. The client MUST NOT attempt to reclaim 27779 them, either during the current server instance or in any subsequent 27780 server instance, or on another server to which responsibility for 27781 that file system is transferred. If the client were to do so, it 27782 would be violating the protocol by representing itself as owning 27783 locks that it does not own, and so has no right to reclaim. See 27784 Section 8.4.3 of [60] for a discussion of edge conditions related to 27785 lock reclaim. 27787 By sending a RECLAIM_COMPLETE, the client indicates readiness to 27788 proceed to do normal non-reclaim locking operations. The client 27789 should be aware that such operations may temporarily result in 27790 NFS4ERR_GRACE errors until the server is ready to terminate its grace 27791 period. 27793 18.51.4. IMPLEMENTATION 27795 Servers will typically use the information as to when reclaim 27796 activity is complete to reduce the length of the grace period. When 27797 the server maintains in persistent storage a list of clients that 27798 might have had locks, it is able to use the fact that all such 27799 clients have done a RECLAIM_COMPLETE to terminate the grace period 27800 and begin normal operations (i.e., grant requests for new locks) 27801 sooner than it might otherwise. 27803 Latency can be minimized by doing a RECLAIM_COMPLETE as part of the 27804 COMPOUND request in which the last lock-reclaiming operation is done. 27805 When there are no reclaims to be done, RECLAIM_COMPLETE should be 27806 done immediately in order to allow the grace period to end as soon as 27807 possible. 27809 RECLAIM_COMPLETE should only be done once for each server instance or 27810 occasion of the transition of a file system. If it is done a second 27811 time, the error NFS4ERR_COMPLETE_ALREADY will result. Note that 27812 because of the session feature's retry protection, retries of 27813 COMPOUND requests containing RECLAIM_COMPLETE operation will not 27814 result in this error. 27816 When a RECLAIM_COMPLETE is sent, the client effectively acknowledges 27817 any locks not yet reclaimed as lost. This allows the server to re- 27818 enable the client to recover locks if the occurrence of edge 27819 conditions, as described in Section 8.4.3, had caused the server to 27820 disable the client's ability to recover locks. 27822 Because previous descriptions of RECLAIM_COMPLETE were not 27823 sufficiently explicit about the circumstances in which use of 27824 RECLAIM_COMPLETE with rca_one_fs set to TRUE was appropriate, there 27825 have been cases which it has been misused by clients, and cases in 27826 which servers have, in various ways, not responded to such misuse as 27827 described above. While clients SHOULD NOT misuse this feature and 27828 servers SHOULD respond to such misuse as described above, 27829 implementers need to be aware of the following considerations as they 27830 make necessary tradeoffs between interoperability with existing 27831 implementations and proper support for facilities to allow lock 27832 recovery in the event of file system migration. 27834 o When servers have no support for becoming the destination server 27835 of a file system subject to migration, there is no possibility of 27836 a per-fs RECLAIM_COMPLETE being done legitimately and occurrences 27837 of it SHOULD be ignored. However, the negative consequences of 27838 accepting such mistaken use are quite limited as long as the 27839 client does not issue it before all necessary reclaims are done. 27841 o When a server might become the destination for a file system being 27842 migrated, inappropriate use of per-fs RECLAIM_COMPLETE is more 27843 concerning. In the case in which the file system designated is 27844 not within a per-fs grace period, the per-fs RECLAIM_COMPLETE 27845 SHOULD be ignored, with the negative consequences of accepting it 27846 being limited, as in the case in which migration is not supported. 27847 However, if the server encounters a file system undergoing 27848 migration, the operation cannot be accepted as if it were a global 27849 RECLAIM_COMPLETE without invalidating its intended use. 27851 18.52. Operation 10044: ILLEGAL - Illegal Operation 27853 18.52.1. ARGUMENTS 27855 void; 27857 18.52.2. RESULTS 27859 struct ILLEGAL4res { 27860 nfsstat4 status; 27861 }; 27863 18.52.3. DESCRIPTION 27865 This operation is a placeholder for encoding a result to handle the 27866 case of the client sending an operation code within COMPOUND that is 27867 not supported. See the COMPOUND procedure description for more 27868 details. 27870 The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 27872 18.52.4. IMPLEMENTATION 27874 A client will probably not send an operation with code OP_ILLEGAL but 27875 if it does, the response will be ILLEGAL4res just as it would be with 27876 any other invalid operation code. Note that if the server gets an 27877 illegal operation code that is not OP_ILLEGAL, and if the server 27878 checks for legal operation codes during the XDR decode phase, then 27879 the ILLEGAL4res would not be returned. 27881 19. NFSv4.1 Callback Procedures 27883 The procedures used for callbacks are defined in the following 27884 sections. In the interest of clarity, the terms "client" and 27885 "server" refer to NFS clients and servers, despite the fact that for 27886 an individual callback RPC, the sense of these terms would be 27887 precisely the opposite. 27889 Both procedures, CB_NULL and CB_COMPOUND, MUST be implemented. 27891 19.1. Procedure 0: CB_NULL - No Operation 27893 19.1.1. ARGUMENTS 27895 void; 27897 19.1.2. RESULTS 27899 void; 27901 19.1.3. DESCRIPTION 27903 CB_NULL is the standard ONC RPC NULL procedure, with the standard 27904 void argument and void response. Even though there is no direct 27905 functionality associated with this procedure, the server will use 27906 CB_NULL to confirm the existence of a path for RPCs from the server 27907 to client. 27909 19.1.4. ERRORS 27911 None. 27913 19.2. Procedure 1: CB_COMPOUND - Compound Operations 27915 19.2.1. ARGUMENTS 27916 enum nfs_cb_opnum4 { 27917 OP_CB_GETATTR = 3, 27918 OP_CB_RECALL = 4, 27919 /* Callback operations new to NFSv4.1 */ 27920 OP_CB_LAYOUTRECALL = 5, 27921 OP_CB_NOTIFY = 6, 27922 OP_CB_PUSH_DELEG = 7, 27923 OP_CB_RECALL_ANY = 8, 27924 OP_CB_RECALLABLE_OBJ_AVAIL = 9, 27925 OP_CB_RECALL_SLOT = 10, 27926 OP_CB_SEQUENCE = 11, 27927 OP_CB_WANTS_CANCELLED = 12, 27928 OP_CB_NOTIFY_LOCK = 13, 27929 OP_CB_NOTIFY_DEVICEID = 14, 27931 OP_CB_ILLEGAL = 10044 27932 }; 27934 union nfs_cb_argop4 switch (unsigned argop) { 27935 case OP_CB_GETATTR: 27936 CB_GETATTR4args opcbgetattr; 27937 case OP_CB_RECALL: 27938 CB_RECALL4args opcbrecall; 27939 case OP_CB_LAYOUTRECALL: 27940 CB_LAYOUTRECALL4args opcblayoutrecall; 27941 case OP_CB_NOTIFY: 27942 CB_NOTIFY4args opcbnotify; 27943 case OP_CB_PUSH_DELEG: 27944 CB_PUSH_DELEG4args opcbpush_deleg; 27945 case OP_CB_RECALL_ANY: 27946 CB_RECALL_ANY4args opcbrecall_any; 27947 case OP_CB_RECALLABLE_OBJ_AVAIL: 27948 CB_RECALLABLE_OBJ_AVAIL4args opcbrecallable_obj_avail; 27949 case OP_CB_RECALL_SLOT: 27950 CB_RECALL_SLOT4args opcbrecall_slot; 27951 case OP_CB_SEQUENCE: 27952 CB_SEQUENCE4args opcbsequence; 27953 case OP_CB_WANTS_CANCELLED: 27954 CB_WANTS_CANCELLED4args opcbwants_cancelled; 27955 case OP_CB_NOTIFY_LOCK: 27956 CB_NOTIFY_LOCK4args opcbnotify_lock; 27957 case OP_CB_NOTIFY_DEVICEID: 27958 CB_NOTIFY_DEVICEID4args opcbnotify_deviceid; 27959 case OP_CB_ILLEGAL: void; 27960 }; 27961 struct CB_COMPOUND4args { 27962 utf8str_cs tag; 27963 uint32_t minorversion; 27964 uint32_t callback_ident; 27965 nfs_cb_argop4 argarray<>; 27966 }; 27968 19.2.2. RESULTS 27969 union nfs_cb_resop4 switch (unsigned resop) { 27970 case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; 27971 case OP_CB_RECALL: CB_RECALL4res opcbrecall; 27973 /* new NFSv4.1 operations */ 27974 case OP_CB_LAYOUTRECALL: 27975 CB_LAYOUTRECALL4res 27976 opcblayoutrecall; 27978 case OP_CB_NOTIFY: CB_NOTIFY4res opcbnotify; 27980 case OP_CB_PUSH_DELEG: CB_PUSH_DELEG4res 27981 opcbpush_deleg; 27983 case OP_CB_RECALL_ANY: CB_RECALL_ANY4res 27984 opcbrecall_any; 27986 case OP_CB_RECALLABLE_OBJ_AVAIL: 27987 CB_RECALLABLE_OBJ_AVAIL4res 27988 opcbrecallable_obj_avail; 27990 case OP_CB_RECALL_SLOT: 27991 CB_RECALL_SLOT4res 27992 opcbrecall_slot; 27994 case OP_CB_SEQUENCE: CB_SEQUENCE4res opcbsequence; 27996 case OP_CB_WANTS_CANCELLED: 27997 CB_WANTS_CANCELLED4res 27998 opcbwants_cancelled; 28000 case OP_CB_NOTIFY_LOCK: 28001 CB_NOTIFY_LOCK4res 28002 opcbnotify_lock; 28004 case OP_CB_NOTIFY_DEVICEID: 28005 CB_NOTIFY_DEVICEID4res 28006 opcbnotify_deviceid; 28008 /* Not new operation */ 28009 case OP_CB_ILLEGAL: CB_ILLEGAL4res opcbillegal; 28010 }; 28012 struct CB_COMPOUND4res { 28013 nfsstat4 status; 28014 utf8str_cs tag; 28015 nfs_cb_resop4 resarray<>; 28016 }; 28018 19.2.3. DESCRIPTION 28020 The CB_COMPOUND procedure is used to combine one or more of the 28021 callback procedures into a single RPC request. The main callback RPC 28022 program has two main procedures: CB_NULL and CB_COMPOUND. All other 28023 operations use the CB_COMPOUND procedure as a wrapper. 28025 During the processing of the CB_COMPOUND procedure, the client may 28026 find that it does not have the available resources to execute any or 28027 all of the operations within the CB_COMPOUND sequence. Refer to 28028 Section 2.10.6.4 for details. 28030 The minorversion field of the arguments MUST be the same as the 28031 minorversion of the COMPOUND procedure used to create the client ID 28032 and session. For NFSv4.1, minorversion MUST be set to 1. 28034 Contained within the CB_COMPOUND results is a "status" field. This 28035 status MUST be equal to the status of the last operation that was 28036 executed within the CB_COMPOUND procedure. Therefore, if an 28037 operation incurred an error, then the "status" value will be the same 28038 error value as is being returned for the operation that failed. 28040 The "tag" field is handled the same way as that of the COMPOUND 28041 procedure (see Section 16.2.3). 28043 Illegal operation codes are handled in the same way as they are 28044 handled for the COMPOUND procedure. 28046 19.2.4. IMPLEMENTATION 28048 The CB_COMPOUND procedure is used to combine individual operations 28049 into a single RPC request. The client interprets each of the 28050 operations in turn. If an operation is executed by the client and 28051 the status of that operation is NFS4_OK, then the next operation in 28052 the CB_COMPOUND procedure is executed. The client continues this 28053 process until there are no more operations to be executed or one of 28054 the operations has a status value other than NFS4_OK. 28056 19.2.5. ERRORS 28058 CB_COMPOUND will of course return every error that each operation on 28059 the backchannel can return (see Table 7). However, if CB_COMPOUND 28060 returns zero operations, obviously the error returned by COMPOUND has 28061 nothing to do with an error returned by an operation. The list of 28062 errors CB_COMPOUND will return if it processes zero operations 28063 includes: 28065 CB_COMPOUND error returns 28067 +------------------------------+------------------------------------+ 28068 | Error | Notes | 28069 +------------------------------+------------------------------------+ 28070 | NFS4ERR_BADCHAR | The tag argument has a character | 28071 | | the replier does not support. | 28072 | NFS4ERR_BADXDR | | 28073 | NFS4ERR_DELAY | | 28074 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 28075 | | encoding. | 28076 | NFS4ERR_MINOR_VERS_MISMATCH | | 28077 | NFS4ERR_SERVERFAULT | | 28078 | NFS4ERR_TOO_MANY_OPS | | 28079 | NFS4ERR_REP_TOO_BIG | | 28080 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 28081 | NFS4ERR_REQ_TOO_BIG | | 28082 +------------------------------+------------------------------------+ 28084 Table 15 28086 20. NFSv4.1 Callback Operations 28088 20.1. Operation 3: CB_GETATTR - Get Attributes 28090 20.1.1. ARGUMENT 28092 struct CB_GETATTR4args { 28093 nfs_fh4 fh; 28094 bitmap4 attr_request; 28095 }; 28097 20.1.2. RESULT 28099 struct CB_GETATTR4resok { 28100 fattr4 obj_attributes; 28101 }; 28103 union CB_GETATTR4res switch (nfsstat4 status) { 28104 case NFS4_OK: 28105 CB_GETATTR4resok resok4; 28106 default: 28107 void; 28108 }; 28110 20.1.3. DESCRIPTION 28112 The CB_GETATTR operation is used by the server to obtain the current 28113 modified state of a file that has been OPEN_DELEGATE_WRITE delegated. 28114 The size and change attributes are the only ones guaranteed to be 28115 serviced by the client. See Section 10.4.3 for a full description of 28116 how the client and server are to interact with the use of CB_GETATTR. 28118 If the filehandle specified is not one for which the client holds an 28119 OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is 28120 returned. 28122 20.1.4. IMPLEMENTATION 28124 The client returns attrmask bits and the associated attribute values 28125 only for the change attribute, and attributes that it may change 28126 (time_modify, and size). 28128 20.2. Operation 4: CB_RECALL - Recall a Delegation 28130 20.2.1. ARGUMENT 28132 struct CB_RECALL4args { 28133 stateid4 stateid; 28134 bool truncate; 28135 nfs_fh4 fh; 28136 }; 28138 20.2.2. RESULT 28140 struct CB_RECALL4res { 28141 nfsstat4 status; 28142 }; 28144 20.2.3. DESCRIPTION 28146 The CB_RECALL operation is used to begin the process of recalling a 28147 delegation and returning it to the server. 28149 The truncate flag is used to optimize recall for a file object that 28150 is a regular file and is about to be truncated to zero. When it is 28151 TRUE, the client is freed of the obligation to propagate modified 28152 data for the file to the server, since this data is irrelevant. 28154 If the handle specified is not one for which the client holds a 28155 delegation, an NFS4ERR_BADHANDLE error is returned. 28157 If the stateid specified is not one corresponding to an OPEN 28158 delegation for the file specified by the filehandle, an 28159 NFS4ERR_BAD_STATEID is returned. 28161 20.2.4. IMPLEMENTATION 28163 The client SHOULD reply to the callback immediately. Replying does 28164 not complete the recall except when the value of the reply's status 28165 field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not 28166 complete until the delegation is returned using a DELEGRETURN 28167 operation. 28169 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 28171 20.3.1. ARGUMENT 28173 /* 28174 * NFSv4.1 callback arguments and results 28175 */ 28177 enum layoutrecall_type4 { 28178 LAYOUTRECALL4_FILE = LAYOUT4_RET_REC_FILE, 28179 LAYOUTRECALL4_FSID = LAYOUT4_RET_REC_FSID, 28180 LAYOUTRECALL4_ALL = LAYOUT4_RET_REC_ALL 28181 }; 28183 struct layoutrecall_file4 { 28184 nfs_fh4 lor_fh; 28185 offset4 lor_offset; 28186 length4 lor_length; 28187 stateid4 lor_stateid; 28188 }; 28190 union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) { 28191 case LAYOUTRECALL4_FILE: 28192 layoutrecall_file4 lor_layout; 28193 case LAYOUTRECALL4_FSID: 28194 fsid4 lor_fsid; 28195 case LAYOUTRECALL4_ALL: 28196 void; 28197 }; 28199 struct CB_LAYOUTRECALL4args { 28200 layouttype4 clora_type; 28201 layoutiomode4 clora_iomode; 28202 bool clora_changed; 28203 layoutrecall4 clora_recall; 28204 }; 28206 20.3.2. RESULT 28208 struct CB_LAYOUTRECALL4res { 28209 nfsstat4 clorr_status; 28210 }; 28212 20.3.3. DESCRIPTION 28214 The CB_LAYOUTRECALL operation is used by the server to recall layouts 28215 from the client; as a result, the client will begin the process of 28216 returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation 28217 specifies one of three forms of recall processing with the value of 28218 layoutrecall_type4. The recall is for one of the following: a 28219 specific layout of a specific file (LAYOUTRECALL4_FILE), an entire 28220 file system ID (LAYOUTRECALL4_FSID), or all file systems 28221 (LAYOUTRECALL4_ALL). 28223 The behavior of the operation varies based on the value of the 28224 layoutrecall_type4. The value and behaviors are: 28226 LAYOUTRECALL4_FILE 28228 For a layout to match the recall request, the values of the 28229 following fields must match those of the layout: clora_type, 28230 clora_iomode, lor_fh, and the byte-range specified by lor_offset 28231 and lor_length. The clora_iomode field may have a special value 28232 of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will 28233 match any iomode originally returned in a layout; therefore, it 28234 acts as a wild card. The other special value used is for 28235 lor_length. If lor_length has a value of NFS4_UINT64_MAX, the 28236 lor_length field means the maximum possible file size. If a 28237 matching layout is found, it MUST be returned using the 28238 LAYOUTRETURN operation (see Section 18.44). An example of the 28239 field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, 28240 lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the 28241 entire layout is to be returned. 28243 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 28244 client does not hold layouts for the file or if the client does 28245 not have any overlapping layouts for the specification in the 28246 layout recall. 28248 LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 28250 If LAYOUTRECALL4_FSID is specified, the fsid specifies the file 28251 system for which any outstanding layouts MUST be returned. If 28252 LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be 28253 returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 28254 specify that all the storage device ID to storage device address 28255 mappings in the affected file system(s) are also recalled. The 28256 respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or 28257 LAYOUTRETURN4_ALL acknowledges to the server that the client 28258 invalidated the said device mappings. See Section 12.5.5.2.1.5 28259 for considerations with "bulk" recall of layouts. 28261 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 28262 client does not hold layouts and does not have valid deviceid 28263 mappings. 28265 In processing the layout recall request, the client also varies its 28266 behavior based on the value of the clora_changed field. This field 28267 is used by the server to provide additional context for the reason 28268 why the layout is being recalled. A FALSE value for clora_changed 28269 indicates that no change in the layout is expected and the client may 28270 write modified data to the storage devices involved; this must be 28271 done prior to returning the layout via LAYOUTRETURN. A TRUE value 28272 for clora_changed indicates that the server is changing the layout. 28273 Examples of layout changes and reasons for a TRUE indication are the 28274 following: the metadata server is restriping the file or a permanent 28275 error has occurred on a storage device and the metadata server would 28276 like to provide a new layout for the file. Therefore, a 28277 clora_changed value of TRUE indicates some level of change for the 28278 layout and the client SHOULD NOT write and commit modified data to 28279 the storage devices. In this case, the client writes and commits 28280 data through the metadata server. 28282 See Section 12.5.3 for a description of how the lor_stateid field in 28283 the arguments is to be constructed. Note that the "seqid" field of 28284 lor_stateid MUST NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 28285 for a further discussion and requirements. 28287 20.3.4. IMPLEMENTATION 28289 The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL 28290 (recall of file delegations) in that the client responds to the 28291 request before actually returning layouts via the LAYOUTRETURN 28292 operation. While the client responds to the CB_LAYOUTRECALL 28293 immediately, the operation is not considered complete (i.e., 28294 considered pending) until all affected layouts are returned to the 28295 server via the LAYOUTRETURN operation. 28297 Before returning the layout to the server via LAYOUTRETURN, the 28298 client should wait for the response from in-process or in-flight 28299 READ, WRITE, or COMMIT operations that use the recalled layout. 28301 If the client is holding modified data that is affected by a recalled 28302 layout, the client has various options for writing the data to the 28303 server. As always, the client may write the data through the 28304 metadata server. In fact, the client may not have a choice other 28305 than writing to the metadata server when the clora_changed argument 28306 is TRUE and a new layout is unavailable from the server. However, 28307 the client may be able to write the modified data to the storage 28308 device if the clora_changed argument is FALSE; this needs to be done 28309 before returning the layout via LAYOUTRETURN. If the client were to 28310 obtain a new layout covering the modified data's byte-range, then 28311 writing to the storage devices is an available alternative. Note 28312 that before obtaining a new layout, the client must first return the 28313 original layout. 28315 In the case of modified data being written while the layout is held, 28316 the client must use LAYOUTCOMMIT operations at the appropriate time; 28317 as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a 28318 large amount of modified data is outstanding, the client may send 28319 LAYOUTRETURNs for portions of the recalled layout; this allows the 28320 server to monitor the client's progress and adherence to the original 28321 recall request. However, the last LAYOUTRETURN in a sequence of 28322 returns MUST specify the full range being recalled (see 28323 Section 12.5.5.1 for details). 28325 If a server needs to delete a device ID and there are layouts 28326 referring to the device ID, CB_LAYOUTRECALL MUST be invoked to cause 28327 the client to return all layouts referring to the device ID before 28328 the server can delete the device ID. If the client does not return 28329 the affected layouts, the server MAY revoke the layouts. 28331 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory Changes 28333 20.4.1. ARGUMENT 28335 /* 28336 * Directory notification types. 28337 */ 28338 enum notify_type4 { 28339 NOTIFY4_CHANGE_CHILD_ATTRS = 0, 28340 NOTIFY4_CHANGE_DIR_ATTRS = 1, 28341 NOTIFY4_REMOVE_ENTRY = 2, 28342 NOTIFY4_ADD_ENTRY = 3, 28343 NOTIFY4_RENAME_ENTRY = 4, 28344 NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 28345 }; 28347 /* Changed entry information. */ 28348 struct notify_entry4 { 28349 component4 ne_file; 28350 fattr4 ne_attrs; 28351 }; 28353 /* Previous entry information */ 28354 struct prev_entry4 { 28355 notify_entry4 pe_prev_entry; 28356 /* what READDIR returned for this entry */ 28357 nfs_cookie4 pe_prev_entry_cookie; 28358 }; 28360 struct notify_remove4 { 28361 notify_entry4 nrm_old_entry; 28362 nfs_cookie4 nrm_old_entry_cookie; 28363 }; 28365 struct notify_add4 { 28366 /* 28367 * Information on object 28368 * possibly renamed over. 28369 */ 28370 notify_remove4 nad_old_entry<1>; 28371 notify_entry4 nad_new_entry; 28372 /* what READDIR would have returned for this entry */ 28373 nfs_cookie4 nad_new_entry_cookie<1>; 28374 prev_entry4 nad_prev_entry<1>; 28375 bool nad_last_entry; 28376 }; 28378 struct notify_attr4 { 28379 notify_entry4 na_changed_entry; 28380 }; 28382 struct notify_rename4 { 28383 notify_remove4 nrn_old_entry; 28384 notify_add4 nrn_new_entry; 28385 }; 28387 struct notify_verifier4 { 28388 verifier4 nv_old_cookieverf; 28389 verifier4 nv_new_cookieverf; 28390 }; 28392 /* 28393 * Objects of type notify_<>4 and 28394 * notify_device_<>4 are encoded in this. 28395 */ 28396 typedef opaque notifylist4<>; 28397 struct notify4 { 28398 /* composed from notify_type4 or notify_deviceid_type4 */ 28399 bitmap4 notify_mask; 28400 notifylist4 notify_vals; 28401 }; 28403 struct CB_NOTIFY4args { 28404 stateid4 cna_stateid; 28405 nfs_fh4 cna_fh; 28406 notify4 cna_changes<>; 28407 }; 28409 20.4.2. RESULT 28411 struct CB_NOTIFY4res { 28412 nfsstat4 cnr_status; 28413 }; 28415 20.4.3. DESCRIPTION 28417 The CB_NOTIFY operation is used by the server to send notifications 28418 to clients about changes to delegated directories. The registration 28419 of notifications for the directories occurs when the delegation is 28420 established using GET_DIR_DELEGATION. These notifications are sent 28421 over the backchannel. The notification is sent once the original 28422 request has been processed on the server. The server will send an 28423 array of notifications for changes that might have occurred in the 28424 directory. The notifications are sent as list of pairs of bitmaps 28425 and values. See Section 3.3.7 for a description of how NFSv4.1 28426 bitmaps work. 28428 If the server has more notifications than can fit in the CB_COMPOUND 28429 request, it SHOULD send a sequence of serial CB_COMPOUND requests so 28430 that the client's view of the directory does not become confused. 28431 For example, if the server indicates that a file named "foo" is added 28432 and that the file "foo" is removed, the order in which the client 28433 receives these notifications needs to be the same as the order in 28434 which the corresponding operations occurred on the server. 28436 If the client holding the delegation makes any changes in the 28437 directory that cause files or sub-directories to be added or removed, 28438 the server will notify that client of the resulting change(s). If 28439 the client holding the delegation is making attribute or cookie 28440 verifier changes only, the server does not need to send notifications 28441 to that client. The server will send the following information for 28442 each operation: 28444 NOTIFY4_ADD_ENTRY 28445 The server will send information about the new directory entry 28446 being created along with the cookie for that entry. The entry 28447 information (data type notify_add4) includes the component name of 28448 the entry and attributes. The server will send this type of entry 28449 when a file is actually being created, when an entry is being 28450 added to a directory as a result of a rename across directories 28451 (see below), and when a hard link is being created to an existing 28452 file. If this entry is added to the end of the directory, the 28453 server will set the nad_last_entry flag to TRUE. If the file is 28454 added such that there is at least one entry before it, the server 28455 will also return the previous entry information (nad_prev_entry, a 28456 variable-length array of up to one element. If the array is of 28457 zero length, there is no previous entry), along with its cookie. 28458 This is to help clients find the right location in their file name 28459 caches and directory caches where this entry should be cached. If 28460 the new entry's cookie is available, it will be in the 28461 nad_new_entry_cookie (another variable-length array of up to one 28462 element) field. If the addition of the entry causes another entry 28463 to be deleted (which can only happen in the rename case) 28464 atomically with the addition, then information on this entry is 28465 reported in nad_old_entry. 28467 NOTIFY4_REMOVE_ENTRY 28468 The server will send information about the directory entry being 28469 deleted. The server will also send the cookie value for the 28470 deleted entry so that clients can get to the cached information 28471 for this entry. 28473 NOTIFY4_RENAME_ENTRY 28474 The server will send information about both the old entry and the 28475 new entry. This includes the name and attributes for each entry. 28476 In addition, if the rename causes the deletion of an entry (i.e., 28477 the case of a file renamed over), then this is reported in 28478 nrn_new_new_entry.nad_old_entry. This notification is only sent 28479 if both entries are in the same directory. If the rename is 28480 across directories, the server will send a remove notification to 28481 one directory and an add notification to the other directory, 28482 assuming both have a directory delegation. 28484 NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS 28485 The client will use the attribute mask to inform the server of 28486 attributes for which it wants to receive notifications. This 28487 change notification can be requested for changes to the attributes 28488 of the directory as well as changes to any file's attributes in 28489 the directory by using two separate attribute masks. The client 28490 cannot ask for change attribute notification for a specific file. 28491 One attribute mask covers all the files in the directory. Upon 28492 any attribute change, the server will send back the values of 28493 changed attributes. Notifications might not make sense for some 28494 file system-wide attributes, and it is up to the server to decide 28495 which subset it wants to support. The client can negotiate the 28496 frequency of attribute notifications by letting the server know 28497 how often it wants to be notified of an attribute change. The 28498 server will return supported notification frequencies or an 28499 indication that no notification is permitted for directory or 28500 child attributes by setting the dir_notif_delay and 28501 dir_entry_notif_delay attributes, respectively. 28503 NOTIFY4_CHANGE_COOKIE_VERIFIER 28504 If the cookie verifier changes while a client is holding a 28505 delegation, the server will notify the client so that it can 28506 invalidate its cookies and re-send a READDIR to get the new set of 28507 cookies. 28509 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 28510 Delegation to Client 28512 20.5.1. ARGUMENT 28514 struct CB_PUSH_DELEG4args { 28515 nfs_fh4 cpda_fh; 28516 open_delegation4 cpda_delegation; 28518 }; 28520 20.5.2. RESULT 28522 struct CB_PUSH_DELEG4res { 28523 nfsstat4 cpdr_status; 28524 }; 28526 20.5.3. DESCRIPTION 28528 CB_PUSH_DELEG is used by the server both to signal to the client that 28529 the delegation it wants (previously indicated via a want established 28530 from an OPEN or WANT_DELEGATION operation) is available and to 28531 simultaneously offer the delegation to the client. The client has 28532 the choice of accepting the delegation by returning NFS4_OK to the 28533 server, delaying the decision to accept the offered delegation by 28534 returning NFS4ERR_DELAY, or permanently rejecting the offer of the 28535 delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is 28536 rejected in this fashion, the want previously established is 28537 permanently deleted and the delegation is subject to acquisition by 28538 another client. 28540 20.5.4. IMPLEMENTATION 28542 If the client does return NFS4ERR_DELAY and there is a conflicting 28543 delegation request, the server MAY process it at the expense of the 28544 client that returned NFS4ERR_DELAY. The client's want will not be 28545 cancelled, but MAY be processed behind other delegation requests or 28546 registered wants. 28548 When a client returns a status other than NFS4_OK, NFS4ERR_DELAY, or 28549 NFS4ERR_REJECT_DELAY, the want remains pending, although servers may 28550 decide to cancel the want by sending a CB_WANTS_CANCELLED. 28552 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects 28554 20.6.1. ARGUMENT 28556 const RCA4_TYPE_MASK_RDATA_DLG = 0; 28557 const RCA4_TYPE_MASK_WDATA_DLG = 1; 28558 const RCA4_TYPE_MASK_DIR_DLG = 2; 28559 const RCA4_TYPE_MASK_FILE_LAYOUT = 3; 28560 const RCA4_TYPE_MASK_BLK_LAYOUT = 4; 28561 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 28562 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 28563 const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; 28564 const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; 28566 struct CB_RECALL_ANY4args { 28567 uint32_t craa_objects_to_keep; 28568 bitmap4 craa_type_mask; 28569 }; 28571 20.6.2. RESULT 28573 struct CB_RECALL_ANY4res { 28574 nfsstat4 crar_status; 28575 }; 28577 20.6.3. DESCRIPTION 28579 The server may decide that it cannot hold all of the state for 28580 recallable objects, such as delegations and layouts, without running 28581 out of resources. In such a case, while not optimal, the server is 28582 free to recall individual objects to reduce the load. 28584 Because the general purpose of such recallable objects as delegations 28585 is to eliminate client interaction with the server, the server cannot 28586 interpret lack of recent use as indicating that the object is no 28587 longer useful. The absence of visible use is consistent with a 28588 delegation keeping potential operations from being sent to the 28589 server. In the case of layouts, while it is true that the usefulness 28590 of a layout is indicated by the use of the layout when storage 28591 devices receive I/O requests, because there is no mandate that a 28592 storage device indicate to the metadata server any past or present 28593 use of a layout, the metadata server is not likely to know which 28594 layouts are good candidates to recall in response to low resources. 28596 In order to implement an effective reclaim scheme for such objects, 28597 the server's knowledge of available resources must be used to 28598 determine when objects must be recalled with the clients selecting 28599 the actual objects to be returned. 28601 Server implementations may differ in their resource allocation 28602 requirements. For example, one server may share resources among all 28603 classes of recallable objects, whereas another may use separate 28604 resource pools for layouts and for delegations, or further separate 28605 resources by types of delegations. 28607 When a given resource pool is over-utilized, the server can send a 28608 CB_RECALL_ANY to clients holding recallable objects of the types 28609 involved, allowing it to keep a certain number of such objects and 28610 return any excess. A mask specifies which types of objects are to be 28611 limited. The client chooses, based on its own knowledge of current 28612 usefulness, which of the objects in that class should be returned. 28614 A number of bits are defined. For some of these, ranges are defined 28615 and it is up to the definition of the storage protocol to specify how 28616 these are to be used. There are ranges reserved for object-based 28617 storage protocols and for other experimental storage protocols. An 28618 RFC defining such a storage protocol needs to specify how particular 28619 bits within its range are to be used. For example, it may specify a 28620 mapping between attributes of the layout (read vs. write, size of 28621 area) and the bit to be used, or it may define a field in the layout 28622 where the associated bit position is made available by the server to 28623 the client. 28625 RCA4_TYPE_MASK_RDATA_DLG 28627 The client is to return OPEN_DELEGATE_READ delegations on non- 28628 directory file objects. 28630 RCA4_TYPE_MASK_WDATA_DLG 28631 The client is to return OPEN_DELEGATE_WRITE delegations on regular 28632 file objects. 28634 RCA4_TYPE_MASK_DIR_DLG 28636 The client is to return directory delegations. 28638 RCA4_TYPE_MASK_FILE_LAYOUT 28640 The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. 28642 RCA4_TYPE_MASK_BLK_LAYOUT 28644 See [44] for a description. 28646 RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX 28648 See [43] for a description. 28650 RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX 28652 This range is reserved for telling the client to recall layouts of 28653 experimental or site-specific layout types (see Section 3.3.13). 28655 When a bit is set in the type mask that corresponds to an undefined 28656 type of recallable object, NFS4ERR_INVAL MUST be returned. When a 28657 bit is set that corresponds to a defined type of object but the 28658 client does not support an object of the type, NFS4ERR_INVAL MUST NOT 28659 be returned. Future minor versions of NFSv4 may expand the set of 28660 valid type mask bits. 28662 CB_RECALL_ANY specifies a count of objects that the client may keep 28663 as opposed to a count that the client must return. This is to avoid 28664 a potential race between a CB_RECALL_ANY that had a count of objects 28665 to free with a set of client-originated operations to return layouts 28666 or delegations. As a result of the race, the client and server would 28667 have differing ideas as to how many objects to return. Hence, the 28668 client could mistakenly free too many. 28670 If resource demands prompt it, the server may send another 28671 CB_RECALL_ANY with a lower count, even if it has not yet received an 28672 acknowledgment from the client for a previous CB_RECALL_ANY with the 28673 same type mask. Although the possibility exists that these will be 28674 received by the client in an order different from the order in which 28675 they were sent, any such permutation of the callback stream is 28676 harmless. It is the job of the client to bring down the size of the 28677 recallable object set in line with each CB_RECALL_ANY received, and 28678 until that obligation is met, it cannot be cancelled or modified by 28679 any subsequent CB_RECALL_ANY for the same type mask. Thus, if the 28680 server sends two CB_RECALL_ANYs, the effect will be the same as if 28681 the lower count was sent, whatever the order of recall receipt. Note 28682 that this means that a server may not cancel the effect of a 28683 CB_RECALL_ANY by sending another recall with a higher count. When a 28684 CB_RECALL_ANY is received and the count is already within the limit 28685 set or is above a limit that the client is working to get down to, 28686 that callback has no effect. 28688 Servers are generally free to deny recallable objects when 28689 insufficient resources are available. Note that the effect of such a 28690 policy is implicitly to give precedence to existing objects relative 28691 to requested ones, with the result that resources might not be 28692 optimally used. To prevent this, servers are well advised to make 28693 the point at which they start sending CB_RECALL_ANY callbacks 28694 somewhat below that at which they cease to give out new delegations 28695 and layouts. This allows the client to purge its less-used objects 28696 whenever appropriate and so continue to have its subsequent requests 28697 given new resources freed up by object returns. 28699 20.6.4. IMPLEMENTATION 28701 The client can choose to return any type of object specified by the 28702 mask. If a server wishes to limit the use of objects of a specific 28703 type, it should only specify that type in the mask it sends. Should 28704 the client fail to return requested objects, it is up to the server 28705 to handle this situation, typically by sending specific recalls 28706 (i.e., sending CB_RECALL operations) to properly limit resource 28707 usage. The server should give the client enough time to return 28708 objects before proceeding to specific recalls. This time should not 28709 be less than the lease period. 28711 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for 28712 Recallable Objects 28714 20.7.1. ARGUMENT 28716 typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; 28718 20.7.2. RESULT 28720 struct CB_RECALLABLE_OBJ_AVAIL4res { 28721 nfsstat4 croa_status; 28722 }; 28724 20.7.3. DESCRIPTION 28726 CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client 28727 that the server has resources to grant recallable objects that might 28728 previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, 28729 or LAYOUTGET. 28731 The argument craa_objects_to_keep means the total number of 28732 recallable objects of the types indicated in the argument type_mask 28733 that the server believes it can allow the client to have, including 28734 the number of such objects the client already has. A client that 28735 tries to acquire more recallable objects than the server informs it 28736 can have runs the risk of having objects recalled. 28738 The server is not obligated to reserve the difference between the 28739 number of the objects the client currently has and the value of 28740 craa_objects_to_keep, nor does delaying the reply to 28741 CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources 28742 of the recallable objects for another purpose. Indeed, if a client 28743 responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might 28744 interpret the client as having reduced capability to manage 28745 recallable objects, and so cancel or reduce any reservation it is 28746 maintaining on behalf of the client. Thus, if the client desires to 28747 acquire more recallable objects, it needs to reply quickly to 28748 CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to 28749 acquire recallable objects. 28751 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits 28753 20.8.1. ARGUMENT 28755 struct CB_RECALL_SLOT4args { 28756 slotid4 rsa_target_highest_slotid; 28757 }; 28759 20.8.2. RESULT 28761 struct CB_RECALL_SLOT4res { 28762 nfsstat4 rsr_status; 28763 }; 28765 20.8.3. DESCRIPTION 28767 The CB_RECALL_SLOT operation requests the client to return session 28768 slots, and if applicable, transport credits (e.g., RDMA credits for 28769 connections associated with the operations channel) of the session's 28770 fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, 28771 the value of the target highest slot ID the server wants for the 28772 session. The client MUST then progress toward reducing the session's 28773 highest slot ID to the target value. 28775 If the session has only non-RDMA connections associated with its 28776 operations channel, then the client need only wait for all 28777 outstanding requests with a slot ID > rsa_target_highest_slotid to 28778 complete, then send a single COMPOUND consisting of a single SEQUENCE 28779 operation, with the sa_highestslot field set to 28780 rsa_target_highest_slotid. If there are RDMA-based connections 28781 associated with operation channel, then the client needs to also send 28782 enough zero-length "RDMA Send" messages to take the total RDMA credit 28783 count to rsa_target_highest_slotid + 1 or below. 28785 20.8.4. IMPLEMENTATION 28787 If the client fails to reduce highest slot it has on the fore channel 28788 to what the server requests, the server can force the issue by 28789 asserting flow control on the receive side of all connections bound 28790 to the fore channel, and then finish servicing all outstanding 28791 requests that are in slots greater than rsa_target_highest_slotid. 28792 Once that is done, the server can then open the flow control, and any 28793 time the client sends a new request on a slot greater than 28794 rsa_target_highest_slotid, the server can return NFS4ERR_BADSLOT. 28796 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing and 28797 Control 28799 20.9.1. ARGUMENT 28800 struct referring_call4 { 28801 sequenceid4 rc_sequenceid; 28802 slotid4 rc_slotid; 28803 }; 28805 struct referring_call_list4 { 28806 sessionid4 rcl_sessionid; 28807 referring_call4 rcl_referring_calls<>; 28808 }; 28810 struct CB_SEQUENCE4args { 28811 sessionid4 csa_sessionid; 28812 sequenceid4 csa_sequenceid; 28813 slotid4 csa_slotid; 28814 slotid4 csa_highest_slotid; 28815 bool csa_cachethis; 28816 referring_call_list4 csa_referring_call_lists<>; 28817 }; 28819 20.9.2. RESULT 28821 struct CB_SEQUENCE4resok { 28822 sessionid4 csr_sessionid; 28823 sequenceid4 csr_sequenceid; 28824 slotid4 csr_slotid; 28825 slotid4 csr_highest_slotid; 28826 slotid4 csr_target_highest_slotid; 28827 }; 28829 union CB_SEQUENCE4res switch (nfsstat4 csr_status) { 28830 case NFS4_OK: 28831 CB_SEQUENCE4resok csr_resok4; 28832 default: 28833 void; 28834 }; 28836 20.9.3. DESCRIPTION 28838 The CB_SEQUENCE operation is used to manage operational accounting 28839 for the backchannel of the session on which a request is sent. The 28840 contents include the session ID to which this request belongs, the 28841 slot ID and sequence ID used by the server to implement session 28842 request control and exactly once semantics, and exchanged slot ID 28843 maxima that are used to adjust the size of the reply cache. In each 28844 CB_COMPOUND request, CB_SEQUENCE MUST appear once and MUST be the 28845 first operation. The error NFS4ERR_SEQUENCE_POS MUST be returned 28846 when CB_SEQUENCE is found in any position in a CB_COMPOUND beyond the 28847 first. If any other operation is in the first position of 28848 CB_COMPOUND, NFS4ERR_OP_NOT_IN_SESSION MUST be returned. 28850 See Section 18.46.3 for a description of how slots are processed. 28852 If csa_cachethis is TRUE, then the server is requesting that the 28853 client cache the reply in the callback reply cache. The client MUST 28854 cache the reply (see Section 2.10.6.1.3). 28856 The csa_referring_call_lists array is the list of COMPOUND requests, 28857 identified by session ID, slot ID, and sequence ID. These are 28858 requests that the client previously sent to the server. These 28859 previous requests created state that some operation(s) in the same 28860 CB_COMPOUND as the csa_referring_call_lists are identifying. A 28861 session ID is included because leased state is tied to a client ID, 28862 and a client ID can have multiple sessions. See Section 2.10.6.3. 28864 The value of the csa_sequenceid argument relative to the cached 28865 sequence ID on the slot falls into one of three cases. 28867 o If the difference between csa_sequenceid and the client's cached 28868 sequence ID at the slot ID is two (2) or more, or if 28869 csa_sequenceid is less than the cached sequence ID (accounting for 28870 wraparound of the unsigned sequence ID value), then the client 28871 MUST return NFS4ERR_SEQ_MISORDERED. 28873 o If csa_sequenceid and the cached sequence ID are the same, this is 28874 a retry, and the client returns the CB_COMPOUND request's cached 28875 reply. 28877 o If csa_sequenceid is one greater (accounting for wraparound) than 28878 the cached sequence ID, then this is a new request, and the slot's 28879 sequence ID is incremented. The operations subsequent to 28880 CB_SEQUENCE, if any, are processed. If there are no other 28881 operations, the only other effects are to cache the CB_SEQUENCE 28882 reply in the slot, maintain the session's activity, and when the 28883 server receives the CB_SEQUENCE reply, renew the lease of state 28884 related to the client ID. 28886 If the server reuses a slot ID and sequence ID for a completely 28887 different request, the client MAY treat the request as if it is a 28888 retry of what it has already executed. The client MAY however detect 28889 the server's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 28891 If CB_SEQUENCE returns an error, then the state of the slot (sequence 28892 ID, cached reply) MUST NOT change. See Section 2.10.6.1.3 for the 28893 conditions when the error NFS4ERR_RETRY_UNCACHED_REP might be 28894 returned. 28896 The client returns two "highest_slotid" values: csr_highest_slotid 28897 and csr_target_highest_slotid. The former is the highest slot ID the 28898 client will accept in a future CB_SEQUENCE operation, and SHOULD NOT 28899 be less than the value of csa_highest_slotid (but see 28900 Section 2.10.6.1 for an exception). The latter is the highest slot 28901 ID the client would prefer the server use on a future CB_SEQUENCE 28902 operation. 28904 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation 28905 Wants 28907 20.10.1. ARGUMENT 28909 struct CB_WANTS_CANCELLED4args { 28910 bool cwca_contended_wants_cancelled; 28911 bool cwca_resourced_wants_cancelled; 28912 }; 28914 20.10.2. RESULT 28916 struct CB_WANTS_CANCELLED4res { 28917 nfsstat4 cwcr_status; 28918 }; 28920 20.10.3. DESCRIPTION 28922 The CB_WANTS_CANCELLED operation is used to notify the client that 28923 some or all of the wants it registered for recallable delegations and 28924 layouts have been cancelled. 28926 If cwca_contended_wants_cancelled is TRUE, this indicates that the 28927 server will not be pushing to the client any delegations that become 28928 available after contention passes. 28930 If cwca_resourced_wants_cancelled is TRUE, this indicates that the 28931 server will not notify the client when there are resources on the 28932 server to grant delegations or layouts. 28934 After receiving a CB_WANTS_CANCELLED operation, the client is free to 28935 attempt to acquire the delegations or layouts it was waiting for, and 28936 possibly re-register wants. 28938 20.10.4. IMPLEMENTATION 28940 When a client has an OPEN, WANT_DELEGATION, or GET_DIR_DELEGATION 28941 request outstanding, when a CB_WANTS_CANCELLED is sent, the server 28942 may need to make clear to the client whether a promise to signal 28943 delegation availability happened before the CB_WANTS_CANCELLED and is 28944 thus covered by it, or after the CB_WANTS_CANCELLED in which case it 28945 was not covered by it. The server can make this distinction by 28946 putting the appropriate requests into the list of referring calls in 28947 the associated CB_SEQUENCE. 28949 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible Lock 28950 Availability 28952 20.11.1. ARGUMENT 28954 struct CB_NOTIFY_LOCK4args { 28955 nfs_fh4 cnla_fh; 28956 lock_owner4 cnla_lock_owner; 28957 }; 28959 20.11.2. RESULT 28961 struct CB_NOTIFY_LOCK4res { 28962 nfsstat4 cnlr_status; 28963 }; 28965 20.11.3. DESCRIPTION 28967 The server can use this operation to indicate that a byte-range lock 28968 for the given file and lock-owner, previously requested by the client 28969 via an unsuccessful LOCK operation, might be available. 28971 This callback is meant to be used by servers to help reduce the 28972 latency of blocking locks in the case where they recognize that a 28973 client that has been polling for a blocking byte-range lock may now 28974 be able to acquire the lock. If the server supports this callback 28975 for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 28976 when responding to successful opens for that file. This does not 28977 commit the server to the use of CB_NOTIFY_LOCK, but the client may 28978 use this as a hint to decide how frequently to poll for locks derived 28979 from that open. 28981 If an OPEN operation results in an upgrade, in which the stateid 28982 returned has an "other" value matching that of a stateid already 28983 allocated, with a new "seqid" indicating a change in the lock being 28984 represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 28985 when responding to that new OPEN controls handling from that point 28986 going forward. When parallel OPENs are done on the same file and 28987 open-owner, the ordering of the "seqid" fields of the returned 28988 stateids (subject to wraparound) are to be used to select the 28989 controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 28991 20.11.4. IMPLEMENTATION 28993 The server MUST NOT grant the byte-range lock to the client unless 28994 and until it receives a LOCK operation from the client. Similarly, 28995 the client receiving this callback cannot assume that it now has the 28996 lock or that a subsequent LOCK operation for the lock will be 28997 successful. 28999 The server is not required to implement this callback, and even if it 29000 does, it is not required to use it in any particular case. 29001 Therefore, the client must still rely on polling for blocking locks, 29002 as described in Section 9.6. 29004 Similarly, the client is not required to implement this callback, and 29005 even it does, is still free to ignore it. Therefore, the server MUST 29006 NOT assume that the client will act based on the callback. 29008 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device ID 29009 Changes 29011 20.12.1. ARGUMENT 29012 /* 29013 * Device notification types. 29014 */ 29015 enum notify_deviceid_type4 { 29016 NOTIFY_DEVICEID4_CHANGE = 1, 29017 NOTIFY_DEVICEID4_DELETE = 2 29018 }; 29020 /* For NOTIFY4_DEVICEID4_DELETE */ 29021 struct notify_deviceid_delete4 { 29022 layouttype4 ndd_layouttype; 29023 deviceid4 ndd_deviceid; 29024 }; 29026 /* For NOTIFY4_DEVICEID4_CHANGE */ 29027 struct notify_deviceid_change4 { 29028 layouttype4 ndc_layouttype; 29029 deviceid4 ndc_deviceid; 29030 bool ndc_immediate; 29031 }; 29033 struct CB_NOTIFY_DEVICEID4args { 29034 notify4 cnda_changes<>; 29035 }; 29037 20.12.2. RESULT 29039 struct CB_NOTIFY_DEVICEID4res { 29040 nfsstat4 cndr_status; 29041 }; 29043 20.12.3. DESCRIPTION 29045 The CB_NOTIFY_DEVICEID operation is used by the server to send 29046 notifications to clients about changes to pNFS device IDs. The 29047 registration of device ID notifications is optional and is done via 29048 GETDEVICEINFO. These notifications are sent over the backchannel 29049 once the original request has been processed on the server. The 29050 server will send an array of notifications, cnda_changes, as a list 29051 of pairs of bitmaps and values. See Section 3.3.7 for a description 29052 of how NFSv4.1 bitmaps work. 29054 As with CB_NOTIFY (Section 20.4.3), it is possible the server has 29055 more notifications than can fit in a CB_COMPOUND, thus requiring 29056 multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an 29057 issue because unlike directory entries, device IDs cannot be re-used 29058 after being deleted (Section 12.2.10). 29060 All device ID notifications contain a device ID and a layout type. 29061 The layout type is necessary because two different layout types can 29062 share the same device ID, and the common device ID can have 29063 completely different mappings for each layout type. 29065 The server will send the following notifications: 29067 NOTIFY_DEVICEID4_CHANGE 29068 A previously provided device-ID-to-device-address mapping has 29069 changed and the client uses GETDEVICEINFO to obtain the updated 29070 mapping. The notification is encoded in a value of data type 29071 notify_deviceid_change4. This data type also contains a boolean 29072 field, ndc_immediate, which if TRUE indicates that the change will 29073 be enforced immediately, and so the client might not be able to 29074 complete any pending I/O to the device ID. If ndc_immediate is 29075 FALSE, then for an indefinite time, the client can complete 29076 pending I/O. After pending I/O is complete, the client SHOULD get 29077 the new device-ID-to-device-address mappings before sending new I/ 29078 O requests to the storage devices addressed by the device ID. 29080 NOTIFY4_DEVICEID_DELETE 29081 Deletes a device ID from the mappings. This notification MUST NOT 29082 be sent if the client has a layout that refers to the device ID. 29083 In other words, if the server is sending a delete device ID 29084 notification, one of the following is true for layouts associated 29085 with the layout type: 29087 * The client never had a layout referring to that device ID. 29089 * The client has returned all layouts referring to that device 29090 ID. 29092 * The server has revoked all layouts referring to that device ID. 29094 The notification is encoded in a value of data type 29095 notify_deviceid_delete4. After a server deletes a device ID, it 29096 MUST NOT reuse that device ID for the same layout type until the 29097 client ID is deleted. 29099 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 29100 20.13.1. ARGUMENT 29102 void; 29104 20.13.2. RESULT 29106 /* 29107 * CB_ILLEGAL: Response for illegal operation numbers 29108 */ 29109 struct CB_ILLEGAL4res { 29110 nfsstat4 status; 29111 }; 29113 20.13.3. DESCRIPTION 29115 This operation is a placeholder for encoding a result to handle the 29116 case of the server sending an operation code within CB_COMPOUND that 29117 is not defined in the NFSv4.1 specification. See Section 19.2.3 for 29118 more details. 29120 The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 29122 20.13.4. IMPLEMENTATION 29124 A server will probably not send an operation with code OP_CB_ILLEGAL, 29125 but if it does, the response will be CB_ILLEGAL4res just as it would 29126 be with any other invalid operation code. Note that if the client 29127 gets an illegal operation code that is not OP_ILLEGAL, and if the 29128 client checks for legal operation codes during the XDR decode phase, 29129 then an instance of data type CB_ILLEGAL4res will not be returned. 29131 21. Security Considerations 29133 Historically, the authentication model of NFS was based on the entire 29134 machine being the NFS client, with the NFS server trusting the NFS 29135 client to authenticate the end-user. The NFS server in turn shared 29136 its files only to specific clients, as identified by the client's 29137 source network address. Given this model, the AUTH_SYS RPC security 29138 flavor simply identified the end-user using the client to the NFS 29139 server. When processing NFS responses, the client ensured that the 29140 responses came from the same network address and port number to which 29141 the request was sent. While such a model is easy to implement and 29142 simple to deploy and use, it is unsafe. Thus, NFSv4.1 29143 implementations are REQUIRED to support a security model that uses 29144 end-to-end authentication, where an end-user on a client mutually 29145 authenticates (via cryptographic schemes that do not expose passwords 29146 or keys in the clear on the network) to a principal on an NFS server. 29148 Consideration is also given to the integrity and privacy of NFS 29149 requests and responses. The issues of end-to-end mutual 29150 authentication, integrity, and privacy are discussed in 29151 Section 2.2.1.1.1. There are specific considerations when using 29152 Kerberos V5 as described in Section 2.2.1.1.1.2.1.1. 29154 Note that being REQUIRED to implement does not mean REQUIRED to use; 29155 AUTH_SYS can be used by NFSv4.1 clients and servers. However, 29156 AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so 29157 interoperability via AUTH_SYS is not assured. 29159 For reasons of reduced administration overhead, better performance, 29160 and/or reduction of CPU utilization, users of NFSv4.1 implementations 29161 might decline to use security mechanisms that enable integrity 29162 protection on each remote procedure call and response. The use of 29163 mechanisms without integrity leaves the user vulnerable to a man-in- 29164 the-middle of the NFS client and server that modifies the RPC request 29165 and/or the response. While implementations are free to provide the 29166 option to use weaker security mechanisms, there are three operations 29167 in particular that warrant the implementation overriding user 29168 choices. 29170 o The first two such operations are SECINFO and SECINFO_NO_NAME. It 29171 is RECOMMENDED that the client send both operations such that they 29172 are protected with a security flavor that has integrity 29173 protection, such as RPCSEC_GSS with either the 29174 rpc_gss_svc_integrity or rpc_gss_svc_privacy service. Without 29175 integrity protection encapsulating SECINFO and SECINFO_NO_NAME and 29176 their results, a man-in-the-middle could modify results such that 29177 the client might select a weaker algorithm in the set allowed by 29178 the server, making the client and/or server vulnerable to further 29179 attacks. 29181 o The third operation that SHOULD use integrity protection is any 29182 GETATTR for the fs_locations and fs_locations_info attributes, in 29183 order to mitigate the severity of a man-in-the-middle attack. The 29184 attack has two steps. First the attacker modifies the unprotected 29185 results of some operation to return NFS4ERR_MOVED. Second, when 29186 the client follows up with a GETATTR for the fs_locations or 29187 fs_locations_info attributes, the attacker modifies the results to 29188 cause the client to migrate its traffic to a server controlled by 29189 the attacker. With integrity protection, this attack is 29190 mitigated. 29192 Relative to previous NFS versions, NFSv4.1 has additional security 29193 considerations for pNFS (see Sections 12.9 and 13.12), locking and 29194 session state (see Section 2.10.8.3), and state recovery during grace 29195 period (see Section 8.4.2.1.1). With respect to locking and session 29196 state, if SP4_SSV state protection is being used, Section 2.10.10 has 29197 specific security considerations for the NFSv4.1 client and server. 29199 The use of the multi-server bamespace features described in 29200 Section 11 raises the possibility that requests to determine the set 29201 of network addresses corresponding to a given server might be 29202 interfered with or have their responses modified in flight. In light 29203 of this possibility, the following considerations should be taken 29204 note of: 29206 o When DNS is used to convert server names to addresses and DNSSEC 29207 [29] is not available, the validity of the network addresses 29208 returned cannot be relied upon. However, when the client uses 29209 RPCSEC_GSS to access the designated server, it is possible for 29210 mutual authentication to discover invalid server addresses 29211 provided, as long as the RPCSEC_GSS implementation used does not 29212 use insecure DNS queries to canonicalize the hostname components 29213 of the service principal names, as explained in [28]. 29215 o The fetching of attributes containing file system location 29216 information SHOULD be performed using RPCSEC_GSS with integrity 29217 protection. It is important to note here that a client making a 29218 request of this sort without using RPCSEC_GSS including integrity 29219 protection needs be aware of the negative consequences of doing 29220 so, which can lead to invalid host names or network addresses 29221 being returned. These include cases in which the client is 29222 directed a server under the control of an attacker, who might get 29223 access to data written or provide incorrect values for data read. 29224 In light of this, the client needs to recognize that using such 29225 returned location information to access an NFSv4 server without 29226 use of RPCSEC_GSS (i.e. by using AUTH_SYS) poses dangers as it 29227 can result in the client interacting with such an attacker- 29228 controlled server, without any authentication facilities to verify 29229 the server's identity. 29231 o Despite the fact that it is a requirement that "implementations" 29232 provide "support" for use of RPCSEC_GSS, it cannot be assumed that 29233 use of RPCSEC_GSS is always available between any particular 29234 client-server pair. 29236 o When a client has the network addresses of a server but not the 29237 associated host names, that would interfere with its ability to 29238 use RPCSEC_GSS. 29240 In light of the above, a server SHOULD present file system location 29241 entries that correspond to file systems on other servers using a host 29242 name. This would allow the client to interrogate the fs_locations on 29243 the destination server to obtain trunking information (as well as 29244 replica information) using RPCSEC_GSS with integrity, validating the 29245 name provided while assuring that the response has not been modified 29246 in flight. 29248 When RPCSEC_GSS is not available on a server, the client needs to be 29249 aware of the fact that the location entries are subject to 29250 modification in flight and so cannot be relied upon. In the case of 29251 a client being directed to another server after NFS4ERR_MOVED, this 29252 could vitiate the authentication provided by the use of RPCSEC_GSS on 29253 the destination. Even when RPCSEC_GSS authentication is available on 29254 the destination, the server might validly represent itself as the 29255 server to which the client was erroneously directed. Without a way 29256 to decide whether the server is a valid one, the client can only 29257 determine, using RPCSEC_GSS, that the server corresponds to the name 29258 provided, with no basis for trusting that server. As a result, the 29259 client SHOULD NOT use such unverified location entries as a basis for 29260 migration, even though RPCSEC_GSS might be available on the 29261 destination. 29263 When a file system location attribute is fetched upon connecting with 29264 an NFS server, it SHOULD, as stated above, be done using RPCSEC_GSS 29265 with integrity protection. When this not possible, it is generally 29266 best for the client to ignore trunking and replica information or 29267 simply not fetch the location information for these purposes. 29269 When location information cannot be verified, it can be subjected to 29270 additional filtering to prevent the client from being inappropriately 29271 directed. For example, if a range of network addresses can be 29272 determined that assure that the servers and clients using AUTH_SYS 29273 are subject to the appropriate set of constraints (e.g. physical 29274 network isolation, administrative controls on the operating systems 29275 used), then network addresses in the appropriate range can be used 29276 with others discarded or restricted in their use of AUTH_SYS. 29278 To summarize considerations regarding the use of RPCSEC_GSS in 29279 fetching location information, we need to consider the following 29280 possibilities for requests to interrogate location information, with 29281 interrogation approaches on the referring and destination servers 29282 arrived at separately: 29284 o The use of RPCSEC_GSS with integrity protection is RECOMMENDED in 29285 all cases, since the absence of integrity protection exposes the 29286 client to the possibility of the results being modified in 29287 transit. 29289 o The use of requests issued without RPCSEC_GSS (i.e. using AUTH_SYS 29290 which has no provision to avoid modification of data in flight), 29291 while undesirable and a potential security exposure, may not be 29292 avoidable in all cases. Where the use of the returned information 29293 cannot be avoided, it is made subject to filtering as described 29294 above to eliminate the possibility that the client would treat an 29295 invalid address as if it were a NFSv4 server. The specifics will 29296 vary depending on the degree of network isolation and whether the 29297 request is to the referring or destination servers. 29299 22. IANA Considerations 29301 This section uses terms that are defined in [58]. 29303 22.1. IANA Actions Neeeded 29305 This update does not require any actions by IANA. 29307 Previous actions by IANA related to NFSv4.1 are listed in the 29308 remaining subsections of Section 22. 29310 22.2. Named Attribute Definitions 29312 IANA created a registry called the "NFSv4 Named Attribute Definitions 29313 Registry". 29315 The NFSv4.1 protocol supports the association of a file with zero or 29316 more named attributes. The namespace identifiers for these 29317 attributes are defined as string names. The protocol does not define 29318 the specific assignment of the namespace for these file attributes. 29319 The IANA registry promotes interoperability where common interests 29320 exist. While application developers are allowed to define and use 29321 attributes as needed, they are encouraged to register the attributes 29322 with IANA. 29324 Such registered named attributes are presumed to apply to all minor 29325 versions of NFSv4, including those defined subsequently to the 29326 registration. If the named attribute is intended to be limited to 29327 specific minor versions, this will be clearly stated in the 29328 registry's assignment. 29330 All assignments to the registry are made on a First Come First Served 29331 basis, per Section 4.1 of [58]. The policy for each assignment is 29332 Specification Required, per Section 4.1 of [58]. 29334 Under the NFSv4.1 specification, the name of a named attribute can in 29335 theory be up to 2^32 - 1 bytes in length, but in practice NFSv4.1 29336 clients and servers will be unable to handle a string that long. 29337 IANA should reject any assignment request with a named attribute that 29338 exceeds 128 UTF-8 characters. To give the IESG the flexibility to 29339 set up bases of assignment of Experimental Use and Standards Action, 29340 the prefixes of "EXPE" and "STDS" are Reserved. The named attribute 29341 with a zero-length name is Reserved. 29343 The prefix "PRIV" is designated for Private Use. A site that wants 29344 to make use of unregistered named attributes without risk of 29345 conflicting with an assignment in IANA's registry should use the 29346 prefix "PRIV" in all of its named attributes. 29348 Because some NFSv4.1 clients and servers have case-insensitive 29349 semantics, the fifteen additional lower case and mixed case 29350 permutations of each of "EXPE", "PRIV", and "STDS" are Reserved 29351 (e.g., "expe", "expE", "exPe", etc. are Reserved). Similarly, IANA 29352 must not allow two assignments that would conflict if both named 29353 attributes were converted to a common case. 29355 The registry of named attributes is a list of assignments, each 29356 containing three fields for each assignment. 29358 1. A US-ASCII string name that is the actual name of the attribute. 29359 This name must be unique. This string name can be 1 to 128 UTF-8 29360 characters long. 29362 2. A reference to the specification of the named attribute. The 29363 reference can consume up to 256 bytes (or more if IANA permits). 29365 3. The point of contact of the registrant. The point of contact can 29366 consume up to 256 bytes (or more if IANA permits). 29368 22.2.1. Initial Registry 29370 There is no initial registry. 29372 22.2.2. Updating Registrations 29374 The registrant is always permitted to update the point of contact 29375 field. Any other change will require Expert Review or IESG Approval. 29377 22.3. Device ID Notifications 29379 IANA created a registry called the "NFSv4 Device ID Notifications 29380 Registry". 29382 The potential exists for new notification types to be added to the 29383 CB_NOTIFY_DEVICEID operation (see Section 20.12). This can be done 29384 via changes to the operations that register notifications, or by 29385 adding new operations to NFSv4. This requires a new minor version of 29386 NFSv4, and requires a Standards Track document from the IETF. 29388 Another way to add a notification is to specify a new layout type 29389 (see Section 22.5). 29391 Hence, all assignments to the registry are made on a Standards Action 29392 basis per Section 4.1 of [58], with Expert Review required. 29394 The registry is a list of assignments, each containing five fields 29395 per assignment. 29397 1. The name of the notification type. This name must have the 29398 prefix "NOTIFY_DEVICEID4_". This name must be unique. 29400 2. The value of the notification. IANA will assign this number, and 29401 the request from the registrant will use TBD1 instead of an 29402 actual value. IANA MUST use a whole number that can be no higher 29403 than 2^32-1, and should be the next available value. The value 29404 assigned must be unique. A Designated Expert must be used to 29405 ensure that when the name of the notification type and its value 29406 are added to the NFSv4.1 notify_deviceid_type4 enumerated data 29407 type in the NFSv4.1 XDR description ([10]), the result continues 29408 to be a valid XDR description. 29410 3. The Standards Track RFC(s) that describe the notification. If 29411 the RFC(s) have not yet been published, the registrant will use 29412 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 29414 4. How the RFC introduces the notification. This is indicated by a 29415 single US-ASCII value. If the value is N, it means a minor 29416 revision to the NFSv4 protocol. If the value is L, it means a 29417 new pNFS layout type. Other values can be used with IESG 29418 Approval. 29420 5. The minor versions of NFSv4 that are allowed to use the 29421 notification. While these are numeric values, IANA will not 29422 allocate and assign them; the author of the relevant RFCs with 29423 IESG Approval assigns these numbers. Each time there is a new 29424 minor version of NFSv4 approved, a Designated Expert should 29425 review the registry to make recommended updates as needed. 29427 22.3.1. Initial Registry 29429 The initial registry is in Table 16. Note that the next available 29430 value is zero. 29432 +-------------------------+-------+---------+-----+----------------+ 29433 | Notification Name | Value | RFC | How | Minor Versions | 29434 +-------------------------+-------+---------+-----+----------------+ 29435 | NOTIFY_DEVICEID4_CHANGE | 1 | RFC5661 | N | 1 | 29436 | NOTIFY_DEVICEID4_DELETE | 2 | RFC5661 | N | 1 | 29437 +-------------------------+-------+---------+-----+----------------+ 29439 Table 16: Initial Device ID Notification Assignments 29441 22.3.2. Updating Registrations 29443 The update of a registration will require IESG Approval on the advice 29444 of a Designated Expert. 29446 22.4. Object Recall Types 29448 IANA created a registry called the "NFSv4 Recallable Object Types 29449 Registry". 29451 The potential exists for new object types to be added to the 29452 CB_RECALL_ANY operation (see Section 20.6). This can be done via 29453 changes to the operations that add recallable types, or by adding new 29454 operations to NFSv4. This requires a new minor version of NFSv4, and 29455 requires a Standards Track document from IETF. Another way to add a 29456 new recallable object is to specify a new layout type (see 29457 Section 22.5). 29459 All assignments to the registry are made on a Standards Action basis 29460 per Section 4.1 of [58], with Expert Review required. 29462 Recallable object types are 32-bit unsigned numbers. There are no 29463 Reserved values. Values in the range 12 through 15, inclusive, are 29464 designated for Private Use. 29466 The registry is a list of assignments, each containing five fields 29467 per assignment. 29469 1. The name of the recallable object type. This name must have the 29470 prefix "RCA4_TYPE_MASK_". The name must be unique. 29472 2. The value of the recallable object type. IANA will assign this 29473 number, and the request from the registrant will use TBD1 instead 29474 of an actual value. IANA MUST use a whole number that can be no 29475 higher than 2^32-1, and should be the next available value. The 29476 value must be unique. A Designated Expert must be used to ensure 29477 that when the name of the recallable type and its value are added 29478 to the NFSv4 XDR description [10], the result continues to be a 29479 valid XDR description. 29481 3. The Standards Track RFC(s) that describe the recallable object 29482 type. If the RFC(s) have not yet been published, the registrant 29483 will use RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 29485 4. How the RFC introduces the recallable object type. This is 29486 indicated by a single US-ASCII value. If the value is N, it 29487 means a minor revision to the NFSv4 protocol. If the value is L, 29488 it means a new pNFS layout type. Other values can be used with 29489 IESG Approval. 29491 5. The minor versions of NFSv4 that are allowed to use the 29492 recallable object type. While these are numeric values, IANA 29493 will not allocate and assign them; the author of the relevant 29494 RFCs with IESG Approval assigns these numbers. Each time there 29495 is a new minor version of NFSv4 approved, a Designated Expert 29496 should review the registry to make recommended updates as needed. 29498 22.4.1. Initial Registry 29500 The initial registry is in Table 17. Note that the next available 29501 value is five. 29503 +-------------------------------+-------+--------+-----+------------+ 29504 | Recallable Object Type Name | Value | RFC | How | Minor | 29505 | | | | | Versions | 29506 +-------------------------------+-------+--------+-----+------------+ 29507 | RCA4_TYPE_MASK_RDATA_DLG | 0 | RFC | N | 1 | 29508 | | | 5661 | | | 29509 | RCA4_TYPE_MASK_WDATA_DLG | 1 | RFC | N | 1 | 29510 | | | 5661 | | | 29511 | RCA4_TYPE_MASK_DIR_DLG | 2 | RFC | N | 1 | 29512 | | | 5661 | | | 29513 | RCA4_TYPE_MASK_FILE_LAYOUT | 3 | RFC | N | 1 | 29514 | | | 5661 | | | 29515 | RCA4_TYPE_MASK_BLK_LAYOUT | 4 | RFC | L | 1 | 29516 | | | 5661 | | | 29517 | RCA4_TYPE_MASK_OBJ_LAYOUT_MIN | 8 | RFC | L | 1 | 29518 | | | 5661 | | | 29519 | RCA4_TYPE_MASK_OBJ_LAYOUT_MAX | 9 | RFC | L | 1 | 29520 | | | 5661 | | | 29521 +-------------------------------+-------+--------+-----+------------+ 29523 Table 17: Initial Recallable Object Type Assignments 29525 22.4.2. Updating Registrations 29527 The update of a registration will require IESG Approval on the advice 29528 of a Designated Expert. 29530 22.5. Layout Types 29532 IANA created a registry called the "pNFS Layout Types Registry". 29534 All assignments to the registry are made on a Standards Action basis, 29535 with Expert Review required. 29537 Layout types are 32-bit numbers. The value zero is Reserved. Values 29538 in the range 0x80000000 to 0xFFFFFFFF inclusive are designated for 29539 Private Use. IANA will assign numbers from the range 0x00000001 to 29540 0x7FFFFFFF inclusive. 29542 The registry is a list of assignments, each containing five fields. 29544 1. The name of the layout type. This name must have the prefix 29545 "LAYOUT4_". The name must be unique. 29547 2. The value of the layout type. IANA will assign this number, and 29548 the request from the registrant will use TBD1 instead of an 29549 actual value. The value assigned must be unique. A Designated 29550 Expert must be used to ensure that when the name of the layout 29551 type and its value are added to the NFSv4.1 layouttype4 29552 enumerated data type in the NFSv4.1 XDR description ([10]), the 29553 result continues to be a valid XDR description. 29555 3. The Standards Track RFC(s) that describe the notification. If 29556 the RFC(s) have not yet been published, the registrant will use 29557 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 29558 Collectively, the RFC(s) must adhere to the guidelines listed in 29559 Section 22.5.3. 29561 4. How the RFC introduces the layout type. This is indicated by a 29562 single US-ASCII value. If the value is N, it means a minor 29563 revision to the NFSv4 protocol. If the value is L, it means a 29564 new pNFS layout type. Other values can be used with IESG 29565 Approval. 29567 5. The minor versions of NFSv4 that are allowed to use the 29568 notification. While these are numeric values, IANA will not 29569 allocate and assign them; the author of the relevant RFCs with 29570 IESG Approval assigns these numbers. Each time there is a new 29571 minor version of NFSv4 approved, a Designated Expert should 29572 review the registry to make recommended updates as needed. 29574 22.5.1. Initial Registry 29576 The initial registry is in Table 18. 29578 +-----------------------+-------+----------+-----+----------------+ 29579 | Layout Type Name | Value | RFC | How | Minor Versions | 29580 +-----------------------+-------+----------+-----+----------------+ 29581 | LAYOUT4_NFSV4_1_FILES | 0x1 | RFC 5661 | N | 1 | 29582 | LAYOUT4_OSD2_OBJECTS | 0x2 | RFC 5664 | L | 1 | 29583 | LAYOUT4_BLOCK_VOLUME | 0x3 | RFC 5663 | L | 1 | 29584 +-----------------------+-------+----------+-----+----------------+ 29586 Table 18: Initial Layout Type Assignments 29588 22.5.2. Updating Registrations 29590 The update of a registration will require IESG Approval on the advice 29591 of a Designated Expert. 29593 22.5.3. Guidelines for Writing Layout Type Specifications 29595 The author of a new pNFS layout specification must follow these steps 29596 to obtain acceptance of the layout type as a Standards Track RFC: 29598 1. The author devises the new layout specification. 29600 2. The new layout type specification MUST, at a minimum: 29602 * Define the contents of the layout-type-specific fields of the 29603 following data types: 29605 + the da_addr_body field of the device_addr4 data type; 29607 + the loh_body field of the layouthint4 data type; 29609 + the loc_body field of layout_content4 data type (which in 29610 turn is the lo_content field of the layout4 data type); 29612 + the lou_body field of the layoutupdate4 data type; 29614 * Describe or define the storage access protocol used to access 29615 the storage devices. 29617 * Describe whether revocation of layouts is supported. 29619 * At a minimum, describe the methods of recovery from: 29621 1. Failure and restart for client, server, storage device. 29623 2. Lease expiration from perspective of the active client, 29624 server, storage device. 29626 3. Loss of layout state resulting in fencing of client access 29627 to storage devices (for an example, see Section 12.7.3). 29629 * Include an IANA considerations section, which will in turn 29630 include: 29632 + A request to IANA for a new layout type per Section 22.5. 29634 + A list of requests to IANA for any new recallable object 29635 types for CB_RECALL_ANY; each entry is to be presented in 29636 the form described in Section 22.4. 29638 + A list of requests to IANA for any new notification values 29639 for CB_NOTIFY_DEVICEID; each entry is to be presented in 29640 the form described in Section 22.3. 29642 * Include a security considerations section. This section MUST 29643 explain how the NFSv4.1 authentication, authorization, and 29644 access-control models are preserved. That is, if a metadata 29645 server would restrict a READ or WRITE operation, how would 29646 pNFS via the layout similarly restrict a corresponding input 29647 or output operation? 29649 3. The author documents the new layout specification as an Internet- 29650 Draft. 29652 4. The author submits the Internet-Draft for review through the IETF 29653 standards process as defined in "The Internet Standards Process-- 29654 Revision 3" (BCP 9). The new layout specification will be 29655 submitted for eventual publication as a Standards Track RFC. 29657 5. The layout specification progresses through the IETF standards 29658 process. 29660 22.6. Path Variable Definitions 29662 This section deals with the IANA considerations associated with the 29663 variable substitution feature for location names as described in 29664 Section 11.16.3. As described there, variables subject to 29665 substitution consist of a domain name and a specific name within that 29666 domain, with the two separated by a colon. There are two sets of 29667 IANA considerations here: 29669 1. The list of variable names. 29671 2. For each variable name, the list of possible values. 29673 Thus, there will be one registry for the list of variable names, and 29674 possibly one registry for listing the values of each variable name. 29676 22.6.1. Path Variables Registry 29678 IANA created a registry called the "NFSv4 Path Variables Registry". 29680 22.6.1.1. Path Variable Values 29682 Variable names are of the form "${", followed by a domain name, 29683 followed by a colon (":"), followed by a domain-specific portion of 29684 the variable name, followed by "}". When the domain name is 29685 "ietf.org", all variables names must be registered with IANA on a 29686 Standards Action basis, with Expert Review required. Path variables 29687 with registered domain names neither part of nor equal to ietf.org 29688 are assigned on a Hierarchical Allocation basis (delegating to the 29689 domain owner) and thus of no concern to IANA, unless the domain owner 29690 chooses to register a variable name from his domain. If the domain 29691 owner chooses to do so, IANA will do so on a First Come First Serve 29692 basis. To accommodate registrants who do not have their own domain, 29693 IANA will accept requests to register variables with the prefix 29694 "${FCFS.ietf.org:" on a First Come First Served basis. Assignments 29695 on a First Come First Basis do not require Expert Review, unless the 29696 registrant also wants IANA to establish a registry for the values of 29697 the registered variable. 29699 The registry is a list of assignments, each containing three fields. 29701 1. The name of the variable. The name of this variable must start 29702 with a "${" followed by a registered domain name, followed by 29703 ":", or it must start with "${FCFS.ietf.org". The name must be 29704 no more than 64 UTF-8 characters long. The name must be unique. 29706 2. For assignments made on Standards Action basis, the Standards 29707 Track RFC(s) that describe the variable. If the RFC(s) have not 29708 yet been published, the registrant will use RFCTBD1, RFCTBD2, 29709 etc. instead of an actual RFC number. Note that the RFCs do not 29710 have to be a part of an NFS minor version. For assignments made 29711 on a First Come First Serve basis, an explanation (consuming no 29712 more than 1024 bytes, or more if IANA permits) of the purpose of 29713 the variable. A reference to the explanation can be substituted. 29715 3. The point of contact, including an email address. The point of 29716 contact can consume up to 256 bytes (or more if IANA permits). 29717 For assignments made on a Standards Action basis, the point of 29718 contact is always IESG. 29720 22.6.1.1.1. Initial Registry 29722 The initial registry is in Table 19. 29724 +------------------------+----------+------------------+ 29725 | Variable Name | RFC | Point of Contact | 29726 +------------------------+----------+------------------+ 29727 | ${ietf.org:CPU_ARCH} | RFC 5661 | IESG | 29728 | ${ietf.org:OS_TYPE} | RFC 5661 | IESG | 29729 | ${ietf.org:OS_VERSION} | RFC 5661 | IESG | 29730 +------------------------+----------+------------------+ 29732 Table 19: Initial List of Path Variables 29734 IANA has created registries for the values of the variable names 29735 ${ietf.org:CPU_ARCH} and ${ietf.org:OS_TYPE}. See Sections 22.6.2 and 29736 22.6.3. 29738 For the values of the variable ${ietf.org:OS_VERSION}, no registry is 29739 needed as the specifics of the values of the variable will vary with 29740 the value of ${ietf.org:OS_TYPE}. Thus, values for 29741 ${ietf.org:OS_VERSION} are on a Hierarchical Allocation basis and are 29742 of no concern to IANA. 29744 22.6.1.1.2. Updating Registrations 29746 The update of an assignment made on a Standards Action basis will 29747 require IESG Approval on the advice of a Designated Expert. 29749 The registrant can always update the point of contact of an 29750 assignment made on a First Come First Serve basis. Any other update 29751 will require Expert Review. 29753 22.6.2. Values for the ${ietf.org:CPU_ARCH} Variable 29755 IANA created a registry called the "NFSv4 ${ietf.org:CPU_ARCH} Value 29756 Registry". 29758 Assignments to the registry are made on a First Come First Serve 29759 basis. The zero-length value of ${ietf.org:CPU_ARCH} is Reserved. 29760 Values with a prefix of "PRIV" are designated for Private Use. 29762 The registry is a list of assignments, each containing three fields. 29764 1. A value of the ${ietf.org:CPU_ARCH} variable. The value must be 29765 1 to 32 UTF-8 characters long. The value must be unique. 29767 2. An explanation (consuming no more than 1024 bytes, or more if 29768 IANA permits) of what CPU architecture the value denotes. A 29769 reference to the explanation can be substituted. 29771 3. The point of contact, including an email address. The point of 29772 contact can consume up to 256 bytes (or more if IANA permits). 29774 22.6.2.1. Initial Registry 29776 There is no initial registry. 29778 22.6.2.2. Updating Registrations 29780 The registrant is free to update the assignment, i.e., change the 29781 explanation and/or point-of-contact fields. 29783 22.6.3. Values for the ${ietf.org:OS_TYPE} Variable 29785 IANA created a registry called the "NFSv4 ${ietf.org:OS_TYPE} Value 29786 Registry". 29788 Assignments to the registry are made on a First Come First Serve 29789 basis. The zero-length value of ${ietf.org:OS_TYPE} is Reserved. 29790 Values with a prefix of "PRIV" are designated for Private Use. 29792 The registry is a list of assignments, each containing three fields. 29794 1. A value of the ${ietf.org:OS_TYPE} variable. The value must be 1 29795 to 32 UTF-8 characters long. The value must be unique. 29797 2. An explanation (consuming no more than 1024 bytes, or more if 29798 IANA permits) of what CPU architecture the value denotes. A 29799 reference to the explanation can be substituted. 29801 3. The point of contact, including an email address. The point of 29802 contact can consume up to 256 bytes (or more if IANA permits). 29804 22.6.3.1. Initial Registry 29806 There is no initial registry. 29808 22.6.3.2. Updating Registrations 29810 The registrant is free to update the assignment, i.e., change the 29811 explanation and/or point of contact fields. 29813 23. References 29815 23.1. Normative References 29817 [1] Bradner, S., "Key words for use in RFCs to Indicate 29818 Requirement Levels", BCP 14, RFC 2119, March 1997. 29820 [2] Eisler, M., Ed., "XDR: External Data Representation 29821 Standard", STD 67, RFC 4506, May 2006. 29823 [3] Thurlow, R., "RPC: Remote Procedure Call Protocol 29824 Specification Version 2", RFC 5531, May 2009. 29826 [4] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 29827 Specification", RFC 2203, September 1997. 29829 [5] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 29830 Version 5 Generic Security Service Application Program 29831 Interface (GSS-API) Mechanism Version 2", RFC 4121, July 29832 2005. 29834 [6] The Open Group, "Section 3.191 of Chapter 3 of Base 29835 Definitions of The Open Group Base Specifications Issue 6 29836 IEEE Std 1003.1, 2004 Edition, HTML Version 29837 (www.opengroup.org), ISBN 1931624232", 2004. 29839 [7] Linn, J., "Generic Security Service Application Program 29840 Interface Version 2, Update 1", RFC 2743, January 2000. 29842 [8] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 29843 Garcia, "A Remote Direct Memory Access Protocol 29844 Specification", RFC 5040, October 2007. 29846 [9] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February 29847 2009. 29849 [10] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 29850 "Network File System (NFS) Version 4 Minor Version 1 29851 External Data Representation Standard (XDR) Description", 29852 RFC 5662, January 2010. 29854 [11] The Open Group, "Section 3.372 of Chapter 3 of Base 29855 Definitions of The Open Group Base Specifications Issue 6 29856 IEEE Std 1003.1, 2004 Edition, HTML Version 29857 (www.opengroup.org), ISBN 1931624232", 2004. 29859 [12] Eisler, M., "IANA Considerations for Remote Procedure Call 29860 (RPC) Network Identifiers and Universal Address Formats", 29861 RFC 5665, January 2010. 29863 [13] The Open Group, "Section 'read()' of System Interfaces of 29864 The Open Group Base Specifications Issue 6 IEEE Std 29865 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29866 ISBN 1931624232", 2004. 29868 [14] The Open Group, "Section 'readdir()' of System Interfaces 29869 of The Open Group Base Specifications Issue 6 IEEE Std 29870 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29871 ISBN 1931624232", 2004. 29873 [15] The Open Group, "Section 'write()' of System Interfaces of 29874 The Open Group Base Specifications Issue 6 IEEE Std 29875 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29876 ISBN 1931624232", 2004. 29878 [16] Hoffman, P. and M. Blanchet, "Preparation of 29879 Internationalized Strings ("stringprep")", RFC 3454, 29880 December 2002. 29882 [17] The Open Group, "Section 'chmod()' of System Interfaces of 29883 The Open Group Base Specifications Issue 6 IEEE Std 29884 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29885 ISBN 1931624232", 2004. 29887 [18] International Organization for Standardization, 29888 "Information Technology - Universal Multiple-octet coded 29889 Character Set (UCS) - Part 1: Architecture and Basic 29890 Multilingual Plane", ISO Standard 10646-1, May 1993. 29892 [19] Alvestrand, H., "IETF Policy on Character Sets and 29893 Languages", BCP 18, RFC 2277, January 1998. 29895 [20] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 29896 Profile for Internationalized Domain Names (IDN)", 29897 RFC 3491, March 2003. 29899 [21] The Open Group, "Section 'fcntl()' of System Interfaces of 29900 The Open Group Base Specifications Issue 6 IEEE Std 29901 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29902 ISBN 1931624232", 2004. 29904 [22] The Open Group, "Section 'fsync()' of System Interfaces of 29905 The Open Group Base Specifications Issue 6 IEEE Std 29906 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29907 ISBN 1931624232", 2004. 29909 [23] The Open Group, "Section 'getpwnam()' of System Interfaces 29910 of The Open Group Base Specifications Issue 6 IEEE Std 29911 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29912 ISBN 1931624232", 2004. 29914 [24] The Open Group, "Section 'unlink()' of System Interfaces 29915 of The Open Group Base Specifications Issue 6 IEEE Std 29916 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 29917 ISBN 1931624232", 2004. 29919 [25] Schaad, J., Kaliski, B., and R. Housley, "Additional 29920 Algorithms and Identifiers for RSA Cryptography for use in 29921 the Internet X.509 Public Key Infrastructure Certificate 29922 and Certificate Revocation List (CRL) Profile", RFC 4055, 29923 June 2005. 29925 [26] National Institute of Standards and Technology, 29926 "Cryptographic Algorithm Object Registration", URL 29927 http://csrc.nist.gov/groups/ST/crypto_apps_infra/csor/ 29928 algorithms.html, November 2007. 29930 [27] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 29931 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 29932 November 2016, . 29934 [28] Neuman, C., Yu, T., Hartman, S., and K. Raeburn, "The 29935 Kerberos Network Authentication Service (V5)", RFC 4120, 29936 DOI 10.17487/RFC4120, July 2005, 29937 . 29939 [29] Arends, R., Austein, R., Larson, M., Massey, D., and S. 29940 Rose, "DNS Security Introduction and Requirements", 29941 RFC 4033, DOI 10.17487/RFC4033, March 2005, 29942 . 29944 [30] Adamson, A. and N. Williams, "Requirements for NFSv4 29945 Multi-Domain Namespace Deployment", RFC 8000, 29946 DOI 10.17487/RFC8000, November 2016, 29947 . 29949 [31] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 29950 Memory Access Transport for Remote Procedure Call Version 29951 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 29952 . 29954 [32] Lever, C., "Network File System (NFS) Upper-Layer Binding 29955 to RPC-over-RDMA Version 1", RFC 8267, 29956 DOI 10.17487/RFC8267, October 2017, 29957 . 29959 23.2. Informative References 29961 [33] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 29962 Beame, C., Eisler, M., and D. Noveck, "Network File System 29963 (NFS) version 4 Protocol", RFC 3530, April 2003. 29965 [34] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 29966 Version 3 Protocol Specification", RFC 1813, June 1995. 29968 [35] Eisler, M., "LIPKEY - A Low Infrastructure Public Key 29969 Mechanism Using SPKM", RFC 2847, June 2000. 29971 [36] Eisler, M., "NFS Version 2 and Version 3 Security Issues 29972 and the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 29973 RFC 2623, June 1999. 29975 [37] Juszczak, C., "Improving the Performance and Correctness 29976 of an NFS Server", USENIX Conference Proceedings , June 29977 1990. 29979 [38] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 29980 by an On-line Database", RFC 3232, January 2002. 29982 [39] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 29983 RFC 1833, August 1995. 29985 [40] Werme, R., "RPC XID Issues", USENIX Conference 29986 Proceedings , February 1996. 29988 [41] Nowicki, B., "NFS: Network File System Protocol 29989 specification", RFC 1094, March 1989. 29991 [42] Bhide, A., Elnozahy, E., and S. Morgan, "A Highly 29992 Available Network Server", USENIX Conference Proceedings , 29993 January 1991. 29995 [43] Halevy, B., Welch, B., and J. Zelenka, "Object-Based 29996 Parallel NFS (pNFS) Operations", RFC 5664, January 2010. 29998 [44] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS 29999 (pNFS) Block/Volume Layout", RFC 5663, January 2010. 30001 [45] Callaghan, B., "WebNFS Client Specification", RFC 2054, 30002 October 1996. 30004 [46] Callaghan, B., "WebNFS Server Specification", RFC 2055, 30005 October 1996. 30007 [47] IESG, "IESG Processing of RFC Errata for the IETF Stream", 30008 July 2008. 30010 [48] Shepler, S., "NFS Version 4 Design Considerations", 30011 RFC 2624, June 1999. 30013 [49] The Open Group, "Protocols for Interworking: XNFS, Version 30014 3W, ISBN 1-85912-184-5", February 1998. 30016 [50] Floyd, S. and V. Jacobson, "The Synchronization of 30017 Periodic Routing Messages", IEEE/ACM Transactions on 30018 Networking 2(2), pp. 122-136, April 1994. 30020 [51] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., 30021 and E. Zeidner, "Internet Small Computer Systems Interface 30022 (iSCSI)", RFC 3720, April 2004. 30024 [52] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 30025 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 30027 [53] Weber, R., "Object-Based Storage Device Commands (OSD)", 30028 ANSI/INCITS 400-2004, July 2004, 30029 . 30031 [54] Carns, P., Ligon III, W., Ross, R., and R. Thakur, "PVFS: 30032 A Parallel File System for Linux Clusters.", Proceedings 30033 of the 4th Annual Linux Showcase and Conference , 2000. 30035 [55] The Open Group, "The Open Group Base Specifications Issue 30036 6, IEEE Std 1003.1, 2004 Edition", 2004. 30038 [56] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 30040 [57] Chiu, A., Eisler, M., and B. Callaghan, "Security 30041 Negotiation for WebNFS", RFC 2755, January 2000. 30043 [58] Narten, T. and H. Alvestrand, "Guidelines for Writing an 30044 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 30045 May 2008. 30047 [59] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed- 30048 Hashing for Message Authentication", RFC 2104, February 30049 1997. 30051 [60] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 30052 "Network File System (NFS) Version 4 Minor Version 1 30053 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 30054 . 30056 [61] Noveck, D., "Rules for NFSv4 Extensions and Minor 30057 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 30058 . 30060 [62] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 30061 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 30062 March 2015, . 30064 [63] Noveck, D., Ed., Shivam, P., Lever, C., and B. Baker, 30065 "NFSv4.0 Migration: Specification Update", RFC 7931, 30066 DOI 10.17487/RFC7931, July 2016, 30067 . 30069 [64] Haynes, T., "Requirements for Parallel NFS (pNFS) Layout 30070 Types", RFC 8434, DOI 10.17487/RFC8434, August 2018, 30071 . 30073 [65] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 30074 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 30075 2014, . 30077 [66] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 30078 Text on Security Considerations", BCP 72, RFC 3552, 30079 DOI 10.17487/RFC3552, July 2003, 30080 . 30082 Appendix A. Need for this Update 30084 This document includes an explanation of how clients and servers are 30085 to determine the particular network access paths to be used to access 30086 a file system. This includes describing how changes to the specific 30087 replica to be used or to the set of addresses to be used to access it 30088 are to be dealt with, and how transfers of responsibility that need 30089 to be made can be dealt with transparently. This includes cases in 30090 which there is a shift between one replica and another and those in 30091 which different network access paths are used to access the same 30092 replica. 30094 As a result of the following problems in RFC5661 [60], it is 30095 necessary to provide the specific updates which are made by this 30096 document. These updates are described in Appendix B 30098 o RFC5661 [60], while it dealt with situations in which various 30099 forms of clustering allowed co-ordination of the state assigned by 30100 co-operating servers to be used, made no provisions for 30101 Transparent State Migration. Within NFSv4.0, Transparent 30102 Migration was first explained cearly in RFC7530 [62] and corrected 30103 and clarified by RFC7931 [63]. No correesponding explanation for 30104 NFSv4.1 had been provided. 30106 o Although NFSv4.1 was defined with a clear definition of how 30107 trunking detection was to be done, there was no clear 30108 specification of how trunking discovery was to be done, despite 30109 the fact that the specification clearly indicated that this 30110 information could be made available via the file system location 30111 attributes. 30113 o Because the existence of multiple network access paths to the same 30114 file system was dealt with as if there were multiple replicas, 30115 issues relating to transitions between replicas could never be 30116 clearly distinguished from trunking-related transitions between 30117 the addresses used to access a particular file system instance. 30118 As a result, in situations in which both migration and trunking 30119 configuration changes were involved, neither of these could be 30120 clearly dealt with and the relationship between these two features 30121 was not seriously addressed. 30123 o Because use of two network access paths to the same file system 30124 instance (i.e. trunking) was often treated as if two replicas were 30125 involved, it was considered that two replicas were being used 30126 simultaneously. As a result, the treatment of replicas being used 30127 simultaneously in RFC5661 [60] was not clear as it covered the two 30128 distinct cases of a single file system instance being accessed by 30129 two different network access paths and two replicas being accessed 30130 simultaneously, with the limitations of the latter case not being 30131 clearly laid out. 30133 The majority of the consequences of these issues are dealt with by 30134 presenting in Section 11 a replacement for Section 11 within RFC5661 30135 [60]. This replacement modifies existing sub-sections within that 30136 section and adds new ones, as described in Appendix B.1. Also, some 30137 existing sections are deleted. These changes were made in order to: 30139 o Reorganize the description so that the case of two network access 30140 paths to the same file system instance needs to be distinguished 30141 clearly from the case of two different replicas since, in the 30142 former case, locking state is shared and there also can be sharing 30143 of session state. 30145 o Provide a clear statement regarding the desirability of 30146 transparent transfer of state between replicas together with a 30147 recommendation that either that or a single-fs grace period be 30148 provided. 30150 o Specifically delineate how such transfers are to be dealt with by 30151 the client, taking into account the differences from the treatment 30152 in [63] made necessary by the major protocol changes made in 30153 NFSv4.1. 30155 o Provide discussion of the relationship between transparent state 30156 transfer and Parallel NFS (pNFS). 30158 o Provide clarification of the fs_locations_info attribute in order 30159 to specify which portions of the information provided apply to a 30160 specific network access path and which to the replica which that 30161 path is used to access. 30163 In addition, there are also updates to other sections of RFC5661 30164 [60], where the consequences of the incorrect assumptions underlying 30165 the current treatment of multi-server namespace issues also needed to 30166 be corrected. These are to be dealt with as described in Sections 30167 B.2 through B.4. 30169 o A revised introductory section regarding multi-server namespace 30170 facilities is provided. 30172 o A more realistic treatment of server scope is provided, which 30173 reflects the more limited co-ordination of locking state adopted 30174 by servers actually sharing a common server scope. 30176 o Some confusing text regarding changes in server_owner has been 30177 clarified. 30179 o The description of some existing errors has been modified to more 30180 clearly explain certain errors situations to reflect the existence 30181 of trunking and the possible use of fs-specific grace periods. 30182 For details, see Appendix B.3. 30184 o New descriptions of certain existing operations are provided, 30185 either because the existing treatment did not account for 30186 situations that would arise in dealing with transparent state 30187 migration, or because some types of reclaim issues were not 30188 adequately dealt with in the context of fs-specific grace periods. 30189 For details, see Appendix B.3. 30191 Appendix B. Changes in this Update 30193 B.1. Revisions Made to Section 11 of [RFC5661] 30195 A number of areas needed to be revised or extended, in many case 30196 replacing existing sub-sections within section 11 of RFC5661 [60]: 30198 o New introductory material, including a terminology section, 30199 replaces the existing material in RFC5661 [60] ranging from the 30200 start of the existing Section 11 up to and including the existing 30201 Section 11.1. The new material starts at the beginning of 30202 Section 11 and continues through 11.2 below. 30204 o A significant reorganization of the material in the existing 30205 Sections 11.4 and 11.5 (of RFC5661 [60]) is necessary. The 30206 reasons for the reorganization of these sections into a single 30207 section with multiple subsections are discussed in Appendix B.1.1 30208 below. This replacement appears as Section 11.5 below. 30210 New material relating to the handling of the file system location 30211 attributes is contained in Sections 11.5.1 and 11.5.7 below. 30213 o A new section describing requirements for user and group handling 30214 within a multi-server namespace has been added as Section 11.6 30216 o A major replacement for the existing Section 11.7 of RFC5661 [60] 30217 entitled "Effecting File System Transitions", will appear as 30218 Sections 11.8 through 11.13. The reasons for the reorganization 30219 of this section into multiple sections are discussed in 30220 Appendix B.1.2. 30222 o A replacement for the existing Section 11.10 of RFC5661 [60] 30223 entitled "The Attribute fs_locations_info", will appear as 30224 Section 11.16, with Appendix B.1.3 describing the differences 30225 between the new section and the treatment within [60]. A revised 30226 treatment is necessary because the existing treatment did not make 30227 clear how the added attribute information relates to the case of 30228 trunked paths to the same replica. These issues were not 30229 addressed in RFC5661 [60] where the concepts of a replica and a 30230 network path used to access a replica were not clearly 30231 distinguished. 30233 B.1.1. Re-organization of Sections 11.4 and 11.5 of [RFC5661] 30235 Previously, issues related to the fact that multiple location entries 30236 directed the client to the same file system instance were dealt with 30237 in a separate Section 11.5 of RFC5661 [60]. Because of the new 30238 treatment of trunking, these issues now belong within Section 11.5 30239 below. 30241 In this new section, trunking is dealt with in Section 11.5.2 30242 together with the other uses of file system location information 30243 described in Sections Section 11.5.3 through 11.5.6. 30245 As a result, Section 11.5 which will replace Section 11.4 of RFC5661 30246 [60] is substantially different than the section it replaces in that 30247 some existing sections will be replaced by corresponding sections 30248 below while, at the same time, new sections will be added, resulting 30249 in a replacement containing some renumbered sections, as follows: 30251 o The material in Section 11.5, exclusive of subsections, replaces 30252 the material in Section 11.4 of RFC5661 [60] exclusive of 30253 subsections. 30255 o Section 11.5.1 is a new first subsection of the overall section. 30257 o Section 11.5.2 is a new second subsection of the overall section. 30259 o Each of the Sections 11.5.4, 11.5.5, and 11.5.6 replaces (in 30260 order) one of the corresponding Sections 11.4.1, 11.4.2, and 30261 11.4.3 of RFC5661 [60]. 11.4.4, and 11.4.5. 30263 o Section 11.5.7 is a new final subsection of the overall section. 30265 B.1.2. Re-organization of Material Dealing with File System Transitions 30267 The material relating to file system transition, previously contained 30268 in Section 11.7 of RFC5661 [60] has been reorganized and augmented as 30269 described below: 30271 o Because there can be a shift of the network access paths used to 30272 access a file system instance without any shift between replicas, 30273 a new Section 11.8 distinguishes between those cases in which 30274 there is a shift between distinct replicas and those involving a 30275 shift in network access paths with no shift between replicas. 30277 As a result, a new Section 11.9 deals with network address 30278 transitions while the bulk of the former Section 11.7 (in RFC5661 30279 [60]) is extensively modified as reflected in Section 11.10 which 30280 is now limited to cases in which there is a shift between two 30281 different sets of replicas. 30283 o The additional Section 11.11 discusses the case in which a shift 30284 to a different replica is made and state is transferred to allow 30285 the client the ability to have continued access to its accumulated 30286 locking state on the new server. 30288 o The additional Section 11.12 discusses the client's response to 30289 access transitions and how it determines whether migration has 30290 occurred, and how it gets access to any transferred locking and 30291 session state. 30293 o The additional Section 11.13 discusses the responsibilities of the 30294 source and destination servers when transferring locking and 30295 session state. 30297 This re-organization has caused a renumbering of the sections within 30298 Section 11 of [60] as described below: 30300 o The new Sections 11.8 and 11.9 have resulted in existing sections 30301 wit these numbers to be renumbered. 30303 o Section 11.7 of [60] will be substantially modified and appear as 30304 Section 11.10. The necessary modifications reflect the fact that 30305 this section will only deal with transitions between replicas 30306 while transitions between network addresses are dealt with in 30307 other sections. Details of the reorganization are described later 30308 in this section. 30310 o The additional Sections 11.11, 11.12, and 11.13 have been added. 30312 o Consequently, Sections 11.8, 11.9, 11.10, and 11.11 in [60] now 30313 appear as Sections 11.13, 11.14, 11.15, and 11.16, respectively. 30315 As part of this general re-organization, Section 11.7 of RFC5661 [60] 30316 will be modified as described below: 30318 o Sections 11.7 and 11.7.1 of RFC5661 [60] are to be replaced by 30319 Sections 11.10 and 11.10.1, respectively. 30321 o Section 11.7.2 (and included subsections) of RFC5661 [60] are to 30322 be deleted. 30324 o Sections 11.7.3, 11.7.4. 11.7.5, 11.7.5.1, and 11.7.6 of RFC5661 30325 [60] are to be replaced by Sections 11.10.2, 11.10.3, 11.10.4, 30326 11.10.4.1, and 11.10.5 respectively in this document. 30328 o Section 11.7.7 of RFC5661 [60] is to be replaced by 30329 Section 11.10.9 This sub-section has been moved to the end of the 30330 section dealing with file system transitions. 30332 o Sections 11.7.8, 11.7.9. and 11.7.10 of RFC5661 [60] are to be 30333 replaced by Sections 11.10.6, 11.10.7, and 11.10.8 respectively in 30334 this document. 30336 B.1.3. Updates to treatment of fs_locations_info 30338 Various elements of the fs_locations_info attribute contain 30339 information that applies to either a specific file system replica or 30340 to a network path or set of network paths used to access such a 30341 replica. The existing treatment of fs_locations info (in 30342 Section 11.10 of RFC5661 [60]) does not clearly distinguish these 30343 cases, in part because the document did not clearly distinguish 30344 replicas from the paths used to access them. 30346 In addition, special clarification needed to be provided with regard 30347 to the following fields: 30349 o With regard to the handling of FSLI4GF_GOING, it needs to be made 30350 clear that this only applies to the unavailability of a replica 30351 rather than to a path to access a replica. 30353 o In describing the appropriate value for a server to use for 30354 fli_valid_for, it needs to be made clear that there is no need for 30355 the client to frequently fetch the fs_locations_info value to be 30356 prepared for shifts in trunking patterns. 30358 o Clarification of the rules for extensions to the fls_info needs to 30359 be provided. The existing treatment reflects the extension model 30360 in effect at the time RFC5661 [60] was written, and needed to be 30361 updated in accordance with the extension model described in 30362 RFC8178 [61]. 30364 B.2. Revisions Made to Operations in [RFC5661] 30366 Revised descriptions were needed to address issues that arose in 30367 effecting necessary changes to multi-server namespace features. 30369 o The existing treatment of EXCHANGE_ID (in Section 18.35 of RFC5661 30370 [60]) assumes that client IDs cannot be created/ confirmed other 30371 than by the EXCHANGE_ID and CREATE_SESSION operations. Also, the 30372 necessary use of EXCHANGE_ID in recovery from migration and 30373 related situations is not addressed clearly. A revised treatment 30374 of EXCHANGE_ID is necessary and it appears in Section 18.35 while 30375 the specific differences between it and the treatment within [60] 30376 are explained in Appendix B.2.1 below. 30378 o The existing treatment of RECLAIM_COMPLETE in section 18.51 of 30379 RFC5661 [60]) is not sufficiently clear about the purpose and use 30380 of the rca_one_fs and how the server is to deal with inappropriate 30381 values of this argument. Because the resulting confusion raises 30382 interoperability issues, a new treatment of RECLAIM_COMPLETE is 30383 necessary and it appears in Section 18.51 below while the specific 30384 differences between it and the treatment within RFC5661 [60] are 30385 discussed in Appendix B.2.2 below. In addition, the definitions 30386 of the reclaim-related errors receive an updated treatment in 30387 Section 15.1.9 to reflect the fact that there are multiple 30388 contexts for lock reclaim operations. 30390 B.2.1. Revision to Treatment of EXCHANGE_ID 30392 There are a number of issues in the original treatment of EXCHANGE_ID 30393 (in RFC5661 [60]) that cause problems for Transparent State Migration 30394 and for the transfer of access between different network access paths 30395 to the same file system instance. 30397 These issues arise from the fact that this treatment was written, 30399 o Assuming that a client ID can only become known to a server by 30400 having been created by executing an EXCHANGE_ID, with confirmation 30401 of the ID only possible by execution of a CREATE_SESSION. 30403 o Considering the interactions between a client and a server only 30404 occurring on a single network address 30406 As these assumptions have become invalid in the context of 30407 Transparent State Migration and active use of trunking, the treatment 30408 has been modified in several respects. 30410 o It had been assumed that an EXCHANGED_ID executed when the server 30411 is already aware of a given client instance must be either 30412 updating associated parameters (e.g. with respect to callbacks) or 30413 a lingering retransmission to deal with a previously lost reply. 30414 As result, any slot sequence returned by that operation would be 30415 of no use. The existing treatment goes so far as to say that it 30416 "MUST NOT" be used, although this usage is not in accord with [1]. 30418 This created a difficulty when an EXCHANGE_ID is done after 30419 Transparent State Migration since that slot sequence would need to 30420 be used in a subsequent CREATE_SESSION. 30422 In the updated treatment, CREATE_SESSION is a way that client IDs 30423 are confirmed but it is understood that other ways are possible. 30424 The slot sequence can be used as needed and cases in which it 30425 would be of no use are appropriately noted. 30427 o It was assumed that the only functions of EXCHANGE_ID were to 30428 inform the server of the client, create the client ID, and 30429 communicate it to the client. When multiple simultaneous 30430 connections are involved, as often happens when trunking, that 30431 treatment was inadequate in that it ignored the role of 30432 EXCHANGE_ID in associating the client ID with the connection on 30433 which it was done, so that it could be used by a subsequent 30434 CREATE_SESSSION, whose parameters do not include an explicit 30435 client ID. 30437 The new treatment explicitly discusses the role of EXCHANGE_ID in 30438 associating the client ID with the connection so it can be used by 30439 CREATE_SESSION and in associating a connection with an existing 30440 session. 30442 The new treatment can be found in Section 18.35 below. It is 30443 intended to supersede the treatment in Section 18.35 of RFC5661 [60]. 30444 Publishing a complete replacement for Section 18.35 allows the 30445 corrected definition to be read as a whole, in place of the one in 30446 RFC5661 [60]. 30448 B.2.2. Revision to Treatment of RECLAIM_COMPLETE 30450 The following changes were made to the treatment of RECLAIM_COMPLETE 30451 in RFC5661 [60] to arrive at the treatment in Section 18.51. 30453 o In a number of places the text is made more explicit about the 30454 purpose of rca_one_fs and its connection to file system migration. 30456 o There is a discussion of situations in which particular forms of 30457 RECLAIM_COMPLETE would need to be done. 30459 o There is a discussion of interoperability issues that result from 30460 implementations that may have arisen due to the lack of clarity of 30461 the previous treatment of RECLAIM_COMPLETE. 30463 B.3. Revisions Made to Error Definitions in [RFC5661] 30465 The new handling of various situations required revisions of some 30466 existing error definition: 30468 o Because of the need to appropriately address trunking-related 30469 issues, some uses of the term "replica" in RFC5661 [60] have 30470 become problematic since a shift in network access paths was 30471 considered to be a shift to a different replica. As a result, the 30472 existing definition of NFS4ERR_MOVED (in Section 15.1.2.4 of 30473 RFC5661 [60]) needs to be updated to reflect the different 30474 handling of unavailability of a particular fs via a specific 30475 network address. 30477 Since such a situation is no longer considered to constitute 30478 unavailability of a file system instance, the description needs to 30479 change even though the set of circumstances in which it is to be 30480 returned remain the same. The new paragraph explicitly recognizes 30481 that a different network address might be used, while the previous 30482 description, misleadingly, treated this as a shift between two 30483 replicas while only a single file system instance might be 30484 involved. The updated description appears in Section 15.1.2.4 30485 below. 30487 o Because of the need to accommodate use of fs-specific grace 30488 periods, it is necessary to clarify some of the error definitions 30489 of reclaim-related errors in Section 15 of RFC5661 [60], so the 30490 text applies properly to reclaims for all types of grace periods. 30491 The updated descriptions appear within Section 15.1.9 below. 30493 B.4. Other Revisions Made to [RFC5661] 30495 Beside the major reworking of Section 11 and the associated revisions 30496 to existing operations and errors, there are a number of related 30497 changes that are necessary: 30499 o The summary that appeared in Section 1.7.3.3 of RFC5661 [60] was 30500 revised to reflect the changes made in the revised Section 11 30501 above. The updated summary appears as Section 1.8.3.3 above. 30503 o The discussion of server scope which appeared in Section 2.10.4 of 30504 RFC5661 [60] needed to be replaced, since the previous text 30505 appears to require a level of inter-server co-ordination 30506 incompatible with its basic function of avoiding the need for a 30507 globally uniform means of assigning server_owner values. A 30508 revised treatment appears in Section 2.10.4. 30510 o The discussion of trunking which appeared in Section 2.10.5 of 30511 RFC5661 [60] needed to be revised, to more clearly explain the 30512 multiple types of trunking supporting and how the client can be 30513 made aware of the existing trunking configuration. In addition 30514 the last paragraph (exclusive of sub-sections) of that section, 30515 dealing with server_owner changes, is literally true, it has been 30516 a source of confusion. Since the existing paragraph can be read 30517 as suggesting that such changes be dealt with non-disruptively, 30518 the issue needs to be clarified in the revised section, which 30519 appears in Section 2.10.5 30521 Appendix C. Security Issues that Need to be Addressed 30523 The following issues in the treatment of security within the NFSv4.1 30524 specification need to be addressed: 30526 o The Security Considerations Section of RFC5661 [60] is not written 30527 in accord with RFC3552 [66] (also BCP72). Of particular concern 30528 is the fact that the section does not contain a threat analysis. 30530 o Initial analysis of the existing security issues with NFSv4.1 has 30531 made it likely that a revised Security Considerations Section for 30532 the existing protocol (one containing a threat analysis) would be 30533 likely to conclude that NFSv4.1 does not meet the goal of secure 30534 use on the internet. 30536 The Security Considerations Section of this document (in Section 21) 30537 has not been thoroughly revised to correct the difficulties mentioned 30538 above. Instead, it has been modified to take proper account of 30539 issues related to the multi-server namespace features discussed in 30540 Section 11, leaving the incomplete discussion and security weaknesses 30541 pretty much as they were. 30543 The following major security issues need to be addressed in a 30544 satisfactory fashion before an updated Security Considerations 30545 section can be published as part of a bis document for NFSv4.1: 30547 o The continued use of AUTH_SYS and the security exposures it 30548 creates needs to be addressed. Addressing this issue must not be 30549 limited to the questions of whether the designation of this as 30550 OPTIONAL was justified and whether it should be changed. 30552 In any event, it may not be possible, at this point, to correct 30553 the security problems created by continued use of AUTH_SYS simply 30554 by revising this designation. 30556 o The lack of attention within the protocol to the possibility of 30557 pervasive monitoring attacks such as those described in RFC7258 30558 [65] (also BCP188). 30560 In that connection, the use of CREATE_SESSION without privacy 30561 protection needs to be addressed as it exposes the session ID to 30562 view by an attacker. This is worrisome as this is precisely the 30563 type of protocol artifact alluded to in RFC7258, which can enable 30564 further mischief on the part of the attacker as it enables denial- 30565 of-service attacks which can be executed effectively with only a 30566 single, normally low-value, credential, even when RPCSEC_GSS 30567 authentication is in use. 30569 o The lack of effective use of privacy and integrity, even where the 30570 infrastructure to support use of RPCSEC_GSS in present, needs to 30571 be addressed. 30573 In light of the security exposures that this situation creates, it 30574 is not enough to define a protocol that could, with the provision 30575 of sufficient resources, address the problem. Instead, what is 30576 needed is a way to provide the necessary security, with very 30577 limited performance costs and without requiring security 30578 infrastructure that experience has shown is difficult for many 30579 clients and servers to provide. 30581 In trying to provide a major security upgrade for a deployed protocol 30582 such as NFSv4.1, the working group, and the internet community is 30583 likely to find itself dealing with a number of considerations such as 30584 the following: 30586 o The need to accommodate existing deployments of existing protocols 30587 as specified previously in existing Proposed Standards. 30589 o The difficulty of effecting changes to existing interoperating 30590 implementations. 30592 o The difficulty of making changes to NFSv4 protocols other than 30593 those in the form of OPTIONAL extensions. 30595 o The tendency of those responsible for existing NFSv4 deployments 30596 to ignore security flaws in the context of local area networks 30597 under the mistaken impression that network isolation provides, in 30598 and of itself, isolation from all potential attackers. 30600 Given that the difficulties mentioned above apply to minor version 30601 zero as well, it may make sense to deal with these security issues in 30602 a common document applying to all NFSv4 minor versions. If that 30603 approach is taken the, Security Considertions section of an eventual 30604 NFv4.1 bis document would reference that common document and the 30605 defining RFCs for other minor versions might do so as well. 30607 Appendix D. Acknowledgments 30609 D.1. Acknowledgments for this Update 30611 The authors wish to acknowledge the important role of Andy Adamson of 30612 Netapp in clarifying the need for trunking discovery functionality, 30613 and exploring the role of the file system location attributes in 30614 providing the necessary support. 30616 The authors wish to thank Tom Haynes of Hammerspace for drawing our 30617 attention to the fact that internationalization and security might 30618 best be handled in documents dealing with such protocol issues as 30619 they apply to all NFSv4 minor versions. 30621 The authors also wish to acknowledge the work of Xuan Qi of Oracle 30622 with NFSv4.1 client and server prototypes of transparent state 30623 migration functionality. 30625 The authors wish to thank others that brought attention to important 30626 issues. The comments of Trond Myklebust of Primary Data related to 30627 trunking helped to clarify the role of DNS in trunking discovery. 30628 Rick Macklem's comments brought attention to problems in the handling 30629 of the per-fs version of RECLAIM_COMPLETE. 30631 The authors wish to thank Olga Kornievskaia of Netapp for her helpful 30632 review comments. 30634 D.2. Acknowledgments for RFC5661 30636 The initial text for the SECINFO extensions were edited by Mike 30637 Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl 30638 Burnett. 30640 The initial text for the SESSIONS extensions were edited by Tom 30641 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 30642 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 30643 Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk, and Mark 30644 Wittle. 30646 Initial text relating to multi-server namespace features, including 30647 the concept of referrals, were contributed by Dave Noveck, Carl 30648 Burnett, and Charles Fan with contributions from Ted Anderson, Neil 30649 Brown, and Jon Haswell. 30651 The initial text for the Directory Delegations support were 30652 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 30653 Carl Burnett, Ted Anderson, and Tom Talpey. 30655 The initial text for the ACL explanations were contributed by Sam 30656 Falkner and Lisa Week. 30658 The pNFS work was inspired by the NASD and OSD work done by Garth 30659 Gibson. Gary Grider has also been a champion of high-performance 30660 parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort 30661 with a problem statement document for the IETF that formed the basis 30662 for the pNFS work in NFSv4.1. 30664 The initial text for the parallel NFS support was edited by Brent 30665 Welch and Garth Goodson. Additional authors for those documents were 30666 Benny Halevy, David Black, and Andy Adamson. Additional input came 30667 from the informal group that contributed to the construction of the 30668 initial pNFS drafts; specific acknowledgment goes to Gary Grider, 30669 Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. 30671 Fredric Isaman found several errors in draft versions of the ONC RPC 30672 XDR description of the NFSv4.1 protocol. 30674 Audrey Van Belleghem provided, in numerous ways, essential co- 30675 ordination and management of the process of editing the specification 30676 documents. 30678 Richard Jernigan gave feedback on the file layout's striping pattern 30679 design. 30681 Several formal inspection teams were formed to review various areas 30682 of the protocol. All the inspections found significant errors and 30683 room for improvement. NFSv4.1's inspection teams were: 30685 o ACLs, with the following inspectors: Sam Falkner, Bruce Fields, 30686 Rahul Iyer, Saadia Khan, Dave Noveck, Lisa Week, Mario Wurzl, and 30687 Alan Yoder. 30689 o Sessions, with the following inspectors: William Brown, Tom 30690 Doeppner, Robert Gordon, Benny Halevy, Fredric Isaman, Rick 30691 Macklem, Trond Myklebust, Dave Noveck, Karen Rochford, John Scott, 30692 and Peter Shah. 30694 o Initial pNFS inspection, with the following inspectors: Andy 30695 Adamson, David Black, Mike Eisler, Marc Eshel, Sam Falkner, Garth 30696 Goodson, Benny Halevy, Rahul Iyer, Trond Myklebust, Spencer 30697 Shepler, and Lisa Week. 30699 o Global namespace, with the following inspectors: Mike Eisler, Dan 30700 Ellard, Craig Everhart, Fredric Isaman, Trond Myklebust, Dave 30701 Noveck, Theresa Raj, Spencer Shepler, Renu Tewari, and Robert 30702 Thurlow. 30704 o NFSv4.1 file layout type, with the following inspectors: Andy 30705 Adamson, Marc Eshel, Sam Falkner, Garth Goodson, Rahul Iyer, Trond 30706 Myklebust, and Lisa Week. 30708 o NFSv4.1 locking and directory delegations, with the following 30709 inspectors: Mike Eisler, Pranoop Erasani, Robert Gordon, Saadia 30710 Khan, Eric Kustarz, Dave Noveck, Spencer Shepler, and Amy Weaver. 30712 o EXCHANGE_ID and DESTROY_CLIENTID, with the following inspectors: 30713 Mike Eisler, Pranoop Erasani, Robert Gordon, Benny Halevy, Fredric 30714 Isaman, Saadia Khan, Ricardo Labiaga, Rick Macklem, Trond 30715 Myklebust, Spencer Shepler, and Brent Welch. 30717 o Final pNFS inspection, with the following inspectors: Andy 30718 Adamson, Mike Eisler, Mark Eshel, Sam Falkner, Jason Glasgow, 30719 Garth Goodson, Robert Gordon, Benny Halevy, Dean Hildebrand, Rahul 30720 Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer 30721 Shepler, Renu Tewari, Lisa Week, and Brent Welch. 30723 A review team worked together to generate the tables of assignments 30724 of error sets to operations and make sure that each such assignment 30725 had two or more people validating it. Participating in the process 30726 were Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert 30727 Gordon, Trond Myklebust, Dave Noveck, Spencer Shepler, Tom Talpey, 30728 Amy Weaver, and Lisa Week. 30730 Jari Arkko, David Black, Scott Bradner, Lisa Dusseault, Lars Eggert, 30731 Chris Newman, and Tim Polk provided valuable review and guidance. 30733 Olga Kornievskaia found several errors in the SSV specification. 30735 Ricardo Labiaga found several places where the use of RPCSEC_GSS was 30736 underspecified. 30738 Those who provided miscellaneous comments include: Andy Adamson, 30739 Sunil Bhargo, Alex Burlyga, Pranoop Erasani, Bruce Fields, Vadim 30740 Finkelstein, Jason Goldschmidt, Vijay K. Gurbani, Sergey Klyushin, 30741 Ricardo Labiaga, James Lentini, Anshul Madan, Daniel Muntz, Daniel 30742 Picken, Archana Ramani, Jim Rees, Mahesh Siddheshwar, Tom Talpey, and 30743 Peter Varga. 30745 Authors' Addresses 30747 David Noveck (editor) 30748 NetApp 30749 1601 Trapelo Road, Suite 16 30750 Waltham, MA 02451 30751 USA 30753 Phone: +1-781-768-5347 30754 EMail: dnoveck@netapp.com 30756 Charles Lever 30757 Oracle Corporation 30758 1015 Granger Avenue 30759 Ann Arbor, MI 48104 30760 United States of America 30762 Phone: +1 248 614 5091 30763 EMail: chuck.lever@oracle.com