idnits 2.17.1 draft-ietf-nfsv4-rfc5661sesqui-msns-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. == There are 8 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 994 has weird spacing: '...privacy no ...' == Line 3190 has weird spacing: '...est|Pad bytes...' == Line 4420 has weird spacing: '... opaque devic...' == Line 4533 has weird spacing: '...str_cis nii_...' == Line 4534 has weird spacing: '...8str_cs nii...' == (28 more instances...) == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 28, 2020) is 1550 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 16461, but not defined -- Looks like a reference, but probably isn't: 'X' on line 16208 -- Looks like a reference, but probably isn't: 'Y' on line 16217 -- Possible downref: Non-RFC (?) normative reference: ref. '6' -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' -- Possible downref: Non-RFC (?) normative reference: ref. '14' -- Possible downref: Non-RFC (?) normative reference: ref. '15' ** Obsolete normative reference: RFC 3454 (ref. '16') (Obsoleted by RFC 7564) -- Possible downref: Non-RFC (?) normative reference: ref. '17' -- Possible downref: Non-RFC (?) normative reference: ref. '18' ** Obsolete normative reference: RFC 3491 (ref. '20') (Obsoleted by RFC 5891) -- Possible downref: Non-RFC (?) normative reference: ref. '21' -- Possible downref: Non-RFC (?) normative reference: ref. '22' -- Possible downref: Non-RFC (?) normative reference: ref. '23' -- Possible downref: Non-RFC (?) normative reference: ref. '24' -- Possible downref: Non-RFC (?) normative reference: ref. '26' -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '36') (Obsoleted by RFC 7530) -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '55') (Obsoleted by RFC 7143) -- Obsolete informational reference (is this intentional?): RFC 5226 (ref. '62') (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '63') (Obsoleted by RFC 8881) -- Duplicate reference: RFC5661, mentioned in '65', was also mentioned in '63'. -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '65') (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 11 warnings (==), 22 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 D. Noveck, Ed. 3 Internet-Draft NetApp 4 Obsoletes: 5661 (if approved) C. Lever 5 Intended status: Standards Track ORACLE 6 Expires: July 31, 2020 January 28, 2020 8 Network File System (NFS) Version 4 Minor Version 1 Protocol 9 draft-ietf-nfsv4-rfc5661sesqui-msns-04 11 Abstract 13 This document describes the Network File System (NFS) version 4 minor 14 version 1, including features retained from the base protocol (NFS 15 version 4 minor version 0, which is specified in RFC 7530) and 16 protocol extensions made subsequently. The later minor version has 17 no dependencies on NFS version 4 minor version 0, and is considered a 18 separate protocol. 20 This document obsoletes RFC5661. It substantially revises the 21 treatment of features relating to multi-server namespace, superseding 22 the description of those features appearing in RFC5661. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on July 31, 2020. 41 Copyright Notice 43 Copyright (c) 2020 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 This document may contain material from IETF Documents or IETF 57 Contributions published or made publicly available before November 58 10, 2008. The person(s) controlling the copyright in some of this 59 material may not have granted the IETF Trust the right to allow 60 modifications of such material outside the IETF Standards Process. 61 Without obtaining an adequate license from the person(s) controlling 62 the copyright in such materials, this document may not be modified 63 outside the IETF Standards Process, and derivative works of it may 64 not be created outside the IETF Standards Process, except to format 65 it for publication as an RFC or to translate it into languages other 66 than English. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7 71 1.1. Introduction to this Update . . . . . . . . . . . . . . . 7 72 1.2. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 9 73 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 10 74 1.4. Scope of This Document . . . . . . . . . . . . . . . . . 10 75 1.5. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . . 10 76 1.6. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . . 11 77 1.7. General Definitions . . . . . . . . . . . . . . . . . . . 11 78 1.8. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 14 79 1.9. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 18 80 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 19 81 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 19 82 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . . 19 83 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 22 84 2.4. Client Identifiers and Client Owners . . . . . . . . . . 23 85 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . . 29 86 2.6. Security Service Negotiation . . . . . . . . . . . . . . 29 87 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 35 88 2.8. Non-RPC-Based Security Services . . . . . . . . . . . . . 37 89 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 38 90 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . . 40 91 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 87 92 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . . 87 93 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 88 94 3.3. Structured Data Types . . . . . . . . . . . . . . . . . . 90 95 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 98 96 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 99 97 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 100 98 4.3. One Method of Constructing a Volatile Filehandle . . . . 102 99 4.4. Client Recovery from Filehandle Expiration . . . . . . . 103 100 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 104 101 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . . 105 102 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 105 103 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 106 104 5.4. Classification of Attributes . . . . . . . . . . . . . . 107 105 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 108 106 5.6. REQUIRED Attributes - List and Definition References . . 108 107 5.7. RECOMMENDED Attributes - List and Definition References . 109 108 5.8. Attribute Definitions . . . . . . . . . . . . . . 111 109 5.9. Interpreting owner and owner_group . . . . . . . . . . . 120 110 5.10. Character Case Attributes . . . . . . . . . . . . . . . . 122 111 5.11. Directory Notification Attributes . . . . . . . . . . . . 122 112 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 123 113 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 124 114 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 127 115 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 127 116 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 128 117 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 145 118 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 147 119 7. Single-Server Namespace . . . . . . . . . . . . . . . . . . . 154 120 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 154 121 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 154 122 7.3. Server Pseudo File System . . . . . . . . . . . . . . . . 155 123 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 155 124 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . . 156 125 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . . 156 126 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 156 127 7.8. Security Policy and Namespace Presentation . . . . . . . 157 128 8. State Management . . . . . . . . . . . . . . . . . . . . . . 158 129 8.1. Client and Session ID . . . . . . . . . . . . . . . . . . 159 130 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 159 131 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . . 168 132 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 170 133 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 181 134 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . . 182 135 8.7. Clocks, Propagation Delay, and Calculating Lease 136 Expiration . . . . . . . . . . . . . . . . . . . . . . . 183 137 8.8. Obsolete Locking Infrastructure from NFSv4.0 . . . . . . 183 138 9. File Locking and Share Reservations . . . . . . . . . . . . . 184 139 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 184 140 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . . 188 141 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . . 189 142 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . . 189 143 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 189 144 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 190 145 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 191 146 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . . 192 147 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 193 148 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 194 149 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 194 150 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 195 151 10.1. Performance Challenges for Client-Side Caching . . . . . 195 152 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 196 153 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 201 154 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 205 155 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 216 156 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 218 157 10.7. Data and Metadata Caching and Memory Mapped Files . . . 220 158 10.8. Name and Directory Caching without Directory Delegations 222 159 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 224 160 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 228 161 11.1. Terminology . . . . . . . . . . . . . . . . . . . . . . 228 162 11.2. File System Location Attributes . . . . . . . . . . . . 232 163 11.3. File System Presence or Absence . . . . . . . . . . . . 233 164 11.4. Getting Attributes for an Absent File System . . . . . . 234 165 11.5. Uses of File System Location Information . . . . . . . . 236 166 11.6. Trunking without File System Location Information . . . 246 167 11.7. Users and Groups in a Multi-server Namespace . . . . . . 246 168 11.8. Additional Client-Side Considerations . . . . . . . . . 248 169 11.9. Overview of File Access Transitions . . . . . . . . . . 248 170 11.10. Effecting Network Endpoint Transitions . . . . . . . . . 249 171 11.11. Effecting File System Transitions . . . . . . . . . . . 250 172 11.12. Transferring State upon Migration . . . . . . . . . . . 260 173 11.13. Client Responsibilities when Access is Transitioned . . 261 174 11.14. Server Responsibilities Upon Migration . . . . . . . . . 271 175 11.15. Effecting File System Referrals . . . . . . . . . . . . 277 176 11.16. The Attribute fs_locations . . . . . . . . . . . . . . . 284 177 11.17. The Attribute fs_locations_info . . . . . . . . . . . . 287 178 11.18. The Attribute fs_status . . . . . . . . . . . . . . . . 300 179 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 304 180 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 304 181 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 305 182 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 311 183 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 312 184 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 312 185 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 327 186 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 329 187 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 334 188 12.9. Security Considerations for pNFS . . . . . . . . . . . . 334 189 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type . 336 190 13.1. Client ID and Session Considerations . . . . . . . . . . 336 191 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 339 192 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 339 193 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 343 194 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 351 195 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 352 196 13.7. COMMIT through Metadata Server . . . . . . . . . . . . . 354 197 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 355 198 13.9. Metadata and Data Server State Coordination . . . . . . 356 199 13.10. Data Server Component File Size . . . . . . . . . . . . 359 200 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 359 201 13.12. Security Considerations for the File Layout Type . . . . 360 202 14. Internationalization . . . . . . . . . . . . . . . . . . . . 361 203 14.1. Stringprep Profile for the utf8str_cs Type . . . . . . . 362 204 14.2. Stringprep Profile for the utf8str_cis Type . . . . . . 364 205 14.3. Stringprep Profile for the utf8str_mixed Type . . . . . 365 206 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 367 207 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 367 208 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 368 209 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 368 210 15.2. Operations and Their Valid Errors . . . . . . . . . . . 390 211 15.3. Callback Operations and Their Valid Errors . . . . . . . 406 212 15.4. Errors and the Operations That Use Them . . . . . . . . 409 213 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 423 214 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 423 215 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 424 216 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 435 217 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 438 218 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 438 219 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 444 220 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 445 221 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 448 222 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 223 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 451 224 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 452 225 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 452 226 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 454 227 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 455 228 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 458 229 18.11. Operation 13: LOCKT - Test for Lock . . . . . . . . . . 463 230 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 464 231 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 466 232 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 468 233 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 469 234 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 470 235 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 490 236 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 492 237 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 493 238 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . 494 239 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 496 240 18.22. Operation 25: READ - Read from File . . . . . . . . . . 497 241 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 499 242 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 503 243 18.25. Operation 28: REMOVE - Remove File System Object . . . . 504 244 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 507 245 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 510 246 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 511 247 18.29. Operation 33: SECINFO - Obtain Available Security . . . 512 248 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 516 249 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 519 250 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 520 251 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control . . 525 252 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate 253 Connection with Session . . . . . . . . . . . . . . . . 526 254 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 529 255 18.36. Operation 43: CREATE_SESSION - Create New Session and 256 Confirm Client ID . . . . . . . . . . . . . . . . . . . 548 257 18.37. Operation 44: DESTROY_SESSION - Destroy a Session . . . 558 258 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 560 259 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory 260 Delegation . . . . . . . . . . . . . . . . . . . . . . . 561 261 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 565 262 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings 263 for a File System . . . . . . . . . . . . . . . . . . . 568 264 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a 265 Layout . . . . . . . . . . . . . . . . . . . . . . . . . 569 266 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 573 267 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 583 268 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed 269 Object . . . . . . . . . . . . . . . . . . . . . . . . . 588 270 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing 271 and Control . . . . . . . . . . . . . . . . . . . . . . 589 272 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 595 273 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 597 274 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 599 275 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID . . 603 276 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 277 Finished . . . . . . . . . . . . . . . . . . . . . . . . 604 278 18.52. Operation 10044: ILLEGAL - Illegal Operation . . . . . . 607 279 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 608 280 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 608 281 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 608 282 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 613 283 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 613 284 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 614 285 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 615 286 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory 287 Changes . . . . . . . . . . . . . . . . . . . . . . . . 618 288 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 289 Delegation to Client . . . . . . . . . . . . . . . . . . 622 290 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable 291 Objects . . . . . . . . . . . . . . . . . . . . . . . . 623 292 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources 293 for Recallable Objects . . . . . . . . . . . . . . . . . 626 294 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control 295 Limits . . . . . . . . . . . . . . . . . . . . . . . . . 627 296 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel 297 Sequencing and Control . . . . . . . . . . . . . . . . . 628 298 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 299 Delegation Wants . . . . . . . . . . . . . . . . . . . . 631 300 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible 301 Lock Availability . . . . . . . . . . . . . . . . . . . 632 302 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of 303 Device ID Changes . . . . . . . . . . . . . . . . . . . 633 304 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 635 305 21. Security Considerations . . . . . . . . . . . . . . . . . . . 636 306 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 640 307 22.1. IANA Actions Needed . . . . . . . . . . . . . . . . . . 641 308 22.2. Named Attribute Definitions . . . . . . . . . . . . . . 641 309 22.3. Device ID Notifications . . . . . . . . . . . . . . . . 642 310 22.4. Object Recall Types . . . . . . . . . . . . . . . . . . 644 311 22.5. Layout Types . . . . . . . . . . . . . . . . . . . . . . 645 312 22.6. Path Variable Definitions . . . . . . . . . . . . . . . 648 313 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 652 314 23.1. Normative References . . . . . . . . . . . . . . . . . . 652 315 23.2. Informative References . . . . . . . . . . . . . . . . . 655 316 Appendix A. Need for this Update . . . . . . . . . . . . . . . . 659 317 Appendix B. Changes in this Update . . . . . . . . . . . . . . . 661 318 B.1. Revisions Made to Section 11 of RFC5661 . . . . . . . . . 661 319 B.2. Revisions Made to Operations in RFC5661 . . . . . . . . . 664 320 B.3. Revisions Made to Error Definitions in RFC5661 . . . . . 666 321 B.4. Other Revisions Made to RFC5661 . . . . . . . . . . . . . 667 322 Appendix C. Security Issues that Need to be Addressed . . . . . 668 323 Appendix D. Acknowledgments . . . . . . . . . . . . . . . . . . 670 324 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 673 326 1. Introduction 328 1.1. Introduction to this Update 330 Two important features previously defined in minor version 0 but 331 never fully addressed in minor version 1 are trunking, the 332 simultaneous use of multiple connections between a client and server, 333 potentially to different network addresses, and transparent state 334 migration, which allows a file system to be transferred between 335 servers in a way that provides to the client the ability to maintain 336 its existing locking state across the transfer. 338 The revised description of the NFS version 4 minor version 1 339 (NFSv4.1) protocol presented in this update is necessary to enable 340 full use of these features together with other multi-server namespace 341 features. This document is in the form of an updated description of 342 the NFSv4.1 protocol previously defined in RFC5661 [65]. RFC5661 is 343 obsoleted by this document. However, the update has a limited scope 344 and is focused on enabling full use of trunking and transparent state 345 migration. The need for these changes is discussed in Appendix A. 346 Appendix B describes the specific changes made to arrive at the 347 current text. 349 This limited-scope update replaces the current NFSv4.1 RFC with the 350 intention of providing an authoritative and complete specification, 351 the motivation for which is discussed in [35], addressing the issues 352 within the scope of the update. However, it will not address issues 353 that are known but outside of this limited scope as could expected by 354 a full update of the protocol. Below are some areas which are known 355 to need addressing in a future update of the protocol. 357 o Work needs to be done with regard to RFC8178 [66] which 358 establishes NFSv4-wide versioning rules. As RFC5661 is currently 359 inconsistent with that document, changes are needed in order to 360 arrive at a situation in which there would be no need for RFC8178 361 to update the NFSv4.1 specification. 363 o Work needs to be done with regard to RFC8434 [69], which 364 establishes the requirements for pNFS layout types, which are not 365 clearly defined in RFC5661. When that work is done and the 366 resulting documents approved, the new NFSv4.1 specification 367 document will provide a clear set of requirements for layout types 368 and a description of the file layout type that conforms to those 369 requirements. Other layout types will have their own 370 specification documents that conforms to those requirements as 371 well. 373 o Work needs to be done to address many errata reports relevant to 374 RFC 5661, other than errata report 2006 [63], which is addressed 375 in this document. Addressing that report was not deferrable 376 because of the interaction of the changes suggested there and the 377 newly described handling of state and session migration. 379 The errata reports that have been deferred and that will need to 380 be addressed in a later document include reports currently 381 assigned a range of statuses in the errata reporting system 382 including reports marked Accepted and those marked Hold For 383 Document Update because the change was too minor to address 384 immediately. 386 In addition, there is a set of other reports, including at least 387 one in state Rejected, which will need to be addressed in a later 388 document. This will involve making changes to consensus decisions 389 reflected in RFC 5661, in situation in which the working group has 390 decided that the treatment in RFC 5661 is incorrect, and needs to 391 be revised to reflect the working group's new consensus and ensure 392 compatibility with existing implementations that do not follow the 393 handling described in in RFC 5661. 395 Note that it is expected that all such errata reports will remain 396 relevant to implementers and the authors of an eventual 397 rfc5661bis, despite the fact that this document, when approved, 398 will obsolete RFC 5661 [65]. 400 o There is a need for a new approach to the description of 401 internationalization since the current internationalization 402 section (Section 14) has never been implemented and does not meet 403 the needs of the NFSv4 protocol. Possible solutions are to create 404 a new internationalization section modeled on that in [67] or to 405 create a new document describing internationalization for all 406 NFSv4 minor versions and reference that document in the RFCs 407 defining both NFSv4.0 and NFSv4.1. 409 o There is a need for a revised treatment of security in NFSv4.1. 410 The issues with the existing treatment are discussed in 411 Appendix C. 413 Until the above work is done, there will not be a consistent set of 414 documents providing a description of the NFSv4.1 protocol and any 415 full description would involve documents updating other documents 416 within the specification. The updates applied by RFC8434 [69] and 417 RFC8178 [66] to RFC5661 also apply to this specification, and will 418 apply to any subsequent v4.1 specification until that work is done. 420 1.2. The NFS Version 4 Minor Version 1 Protocol 422 The NFS version 4 minor version 1 (NFSv4.1) protocol is the second 423 minor version of the NFS version 4 (NFSv4) protocol. The first minor 424 version, NFSv4.0, is now described in RFC 7530 [67]. It generally 425 follows the guidelines for minor versioning that are listed in 426 Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a 427 client and server that support minor version X must support minor 428 versions 0 through X-1") and 12 ("no new features may be introduced 429 as mandatory in a minor version"). These divergences are due to the 430 introduction of the sessions model for managing non-idempotent 431 operations and the RECLAIM_COMPLETE operation. These two new 432 features are infrastructural in nature and simplify implementation of 433 existing and other new features. Making them anything but REQUIRED 434 would add undue complexity to protocol definition and implementation. 435 NFSv4.1 accordingly updates the minor versioning guidelines 436 (Section 2.7). 438 As a minor version, NFSv4.1 is consistent with the overall goals for 439 NFSv4, but extends the protocol so as to better meet those goals, 440 based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted 441 some additional goals, which motivate some of the major extensions in 442 NFSv4.1. 444 1.3. Requirements Language 446 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 447 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 448 document are to be interpreted as described in RFC 2119 [1]. 450 1.4. Scope of This Document 452 This document describes the NFSv4.1 protocol. With respect to 453 NFSv4.0, this document does not: 455 o describe the NFSv4.0 protocol, except where needed to contrast 456 with NFSv4.1. 458 o modify the specification of the NFSv4.0 protocol. 460 o clarify the NFSv4.0 protocol. 462 1.5. NFSv4 Goals 464 The NFSv4 protocol is a further revision of the NFS protocol defined 465 already by NFSv3 [37]. It retains the essential characteristics of 466 previous versions: easy recovery; independence of transport 467 protocols, operating systems, and file systems; simplicity; and good 468 performance. NFSv4 has the following goals: 470 o Improved access and good performance on the Internet 472 The protocol is designed to transit firewalls easily, perform well 473 where latency is high and bandwidth is low, and scale to very 474 large numbers of clients per server. 476 o Strong security with negotiation built into the protocol 478 The protocol builds on the work of the ONCRPC working group in 479 supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 480 protocol provides a mechanism to allow clients and servers the 481 ability to negotiate security and require clients and servers to 482 support a minimal set of security schemes. 484 o Good cross-platform interoperability 486 The protocol features a file system model that provides a useful, 487 common set of features that does not unduly favor one file system 488 or operating system over another. 490 o Designed for protocol extensions 492 The protocol is designed to accept standard extensions within a 493 framework that enables and encourages backward compatibility. 495 1.6. NFSv4.1 Goals 497 NFSv4.1 has the following goals, within the framework established by 498 the overall NFSv4 goals. 500 o To correct significant structural weaknesses and oversights 501 discovered in the base protocol. 503 o To add clarity and specificity to areas left unaddressed or not 504 addressed in sufficient detail in the base protocol. However, as 505 stated in Section 1.4, it is not a goal to clarify the NFSv4.0 506 protocol in the NFSv4.1 specification. 508 o To add specific features based on experience with the existing 509 protocol and recent industry developments. 511 o To provide protocol support to take advantage of clustered server 512 deployments including the ability to provide scalable parallel 513 access to files distributed among multiple servers. 515 1.7. General Definitions 517 The following definitions provide an appropriate context for the 518 reader. 520 Byte: In this document, a byte is an octet, i.e., a datum exactly 8 521 bits in length. 523 Client: The client is the entity that accesses the NFS server's 524 resources. The client may be an application that contains the 525 logic to access the NFS server directly. The client may also be 526 the traditional operating system client that provides remote file 527 system services for a set of applications. 529 A client is uniquely identified by a client owner. 531 With reference to byte-range locking, the client is also the 532 entity that maintains a set of locks on behalf of one or more 533 applications. This client is responsible for crash or failure 534 recovery for those locks it manages. 536 Note that multiple clients may share the same transport and 537 connection and multiple clients may exist on the same network 538 node. 540 Client ID: The client ID is a 64-bit quantity used as a unique, 541 short-hand reference to a client-supplied verifier and client 542 owner. The server is responsible for supplying the client ID. 544 Client Owner: The client owner is a unique string, opaque to the 545 server, that identifies a client. Multiple network connections 546 and source network addresses originating from those connections 547 may share a client owner. The server is expected to treat 548 requests from connections with the same client owner as coming 549 from the same client. 551 File System: The file system is the collection of objects on a 552 server (as identified by the major identifier of a server owner, 553 which is defined later in this section) that share the same fsid 554 attribute (see Section 5.8.1.9). 556 Lease: A lease is an interval of time defined by the server for 557 which the client is irrevocably granted locks. At the end of a 558 lease period, locks may be revoked if the lease has not been 559 extended. A lock must be revoked if a conflicting lock has been 560 granted after the lease interval. 562 A server grants a client a single lease for all state. 564 Lock: The term "lock" is used to refer to byte-range (in UNIX 565 environments, also known as record) locks, share reservations, 566 delegations, or layouts unless specifically stated otherwise. 568 Secret State Verifier (SSV): The SSV is a unique secret key shared 569 between a client and server. The SSV serves as the secret key for 570 an internal (that is, internal to NFSv4.1) Generic Security 571 Services (GSS) mechanism (the SSV GSS mechanism; see 572 Section 2.10.9). The SSV GSS mechanism uses the SSV to compute 573 message integrity code (MIC) and Wrap tokens. See 574 Section 2.10.8.3 for more details on how NFSv4.1 uses the SSV and 575 the SSV GSS mechanism. 577 Server: The Server is the entity responsible for coordinating client 578 access to a set of file systems and is identified by a server 579 owner. A server can span multiple network addresses. 581 Server Owner: The server owner identifies the server to the client. 582 The server owner consists of a major identifier and a minor 583 identifier. When the client has two connections each to a peer 584 with the same major identifier, the client assumes that both peers 585 are the same server (the server namespace is the same via each 586 connection) and that lock state is sharable across both 587 connections. When each peer has both the same major and minor 588 identifiers, the client assumes that each connection might be 589 associable with the same session. 591 Stable Storage: Stable storage is storage from which data stored by 592 an NFSv4.1 server can be recovered without data loss from multiple 593 power failures (including cascading power failures, that is, 594 several power failures in quick succession), operating system 595 failures, and/or hardware failure of components other than the 596 storage medium itself (such as disk, nonvolatile RAM, flash 597 memory, etc.). 599 Some examples of stable storage that are allowable for an NFS 600 server include: 602 1. Media commit of data; that is, the modified data has been 603 successfully written to the disk media, for example, the disk 604 platter. 606 2. An immediate reply disk drive with battery-backed, on-drive 607 intermediate storage or uninterruptible power system (UPS). 609 3. Server commit of data with battery-backed intermediate storage 610 and recovery software. 612 4. Cache commit with uninterruptible power system (UPS) and 613 recovery software. 615 Stateid: A stateid is a 128-bit quantity returned by a server that 616 uniquely defines the open and locking states provided by the 617 server for a specific open-owner or lock-owner/open-owner pair for 618 a specific file and type of lock. 620 Verifier: A verifier is a 64-bit quantity generated by the client 621 that the server can use to determine if the client has restarted 622 and lost all previous lock state. 624 1.8. Overview of NFSv4.1 Features 626 The major features of the NFSv4.1 protocol will be reviewed in brief. 627 This will be done to provide an appropriate context for both the 628 reader who is familiar with the previous versions of the NFS protocol 629 and the reader who is new to the NFS protocols. For the reader new 630 to the NFS protocols, there is still a set of fundamental knowledge 631 that is expected. The reader should be familiar with the External 632 Data Representation (XDR) and Remote Procedure Call (RPC) protocols 633 as described in [2] and [3]. A basic knowledge of file systems and 634 distributed file systems is expected as well. 636 In general, this specification of NFSv4.1 will not distinguish those 637 features added in minor version 1 from those present in the base 638 protocol but will treat NFSv4.1 as a unified whole. See Section 1.9 639 for a summary of the differences between NFSv4.0 and NFSv4.1. 641 1.8.1. RPC and Security 643 As with previous versions of NFS, the External Data Representation 644 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 645 protocol are those defined in [2] and [3]. To meet end-to-end 646 security requirements, the RPCSEC_GSS framework [4] is used to extend 647 the basic RPC security. With the use of RPCSEC_GSS, various 648 mechanisms can be provided to offer authentication, integrity, and 649 privacy to the NFSv4 protocol. Kerberos V5 is used as described in 650 [5] to provide one security framework. With the use of RPCSEC_GSS, 651 other mechanisms may also be specified and used for NFSv4.1 security. 653 To enable in-band security negotiation, the NFSv4.1 protocol has 654 operations that provide the client a method of querying the server 655 about its policies regarding which security mechanisms must be used 656 for access to the server's file system resources. With this, the 657 client can securely match the security mechanism that meets the 658 policies specified at both the client and server. 660 NFSv4.1 introduces parallel access (see Section 1.8.2.2), which is 661 called pNFS. The security framework described in this section is 662 significantly modified by the introduction of pNFS (see 663 Section 12.9), because data access is sometimes not over RPC. The 664 level of significance varies with the storage protocol (see 665 Section 12.2.5) and can be as low as zero impact (see Section 13.12). 667 1.8.2. Protocol Structure 668 1.8.2.1. Core Protocol 670 Unlike NFSv3, which used a series of ancillary protocols (e.g., NLM, 671 NSM (Network Status Monitor), MOUNT), within all minor versions of 672 NFSv4 a single RPC protocol is used to make requests to the server. 673 Facilities that had been separate protocols, such as locking, are now 674 integrated within a single unified protocol. 676 1.8.2.2. Parallel Access 678 Minor version 1 supports high-performance data access to a clustered 679 server implementation by enabling a separation of metadata access and 680 data access, with the latter done to multiple servers in parallel. 682 Such parallel data access is controlled by recallable objects known 683 as "layouts", which are integrated into the protocol locking model. 684 Clients direct requests for data access to a set of data servers 685 specified by the layout via a data storage protocol which may be 686 NFSv4.1 or may be another protocol. 688 Because the protocols used for parallel data access are not 689 necessarily RPC-based, the RPC-based security model (Section 1.8.1) 690 is obviously impacted (see Section 12.9). The degree of impact 691 varies with the storage protocol (see Section 12.2.5) used for data 692 access, and can be as low as zero (see Section 13.12). 694 1.8.3. File System Model 696 The general file system model used for the NFSv4.1 protocol is the 697 same as previous versions. The server file system is hierarchical 698 with the regular files contained within being treated as opaque byte 699 streams. In a slight departure, file and directory names are encoded 700 with UTF-8 to deal with the basics of internationalization. 702 The NFSv4.1 protocol does not require a separate protocol to provide 703 for the initial mapping between path name and filehandle. All file 704 systems exported by a server are presented as a tree so that all file 705 systems are reachable from a special per-server global root 706 filehandle. This allows LOOKUP operations to be used to perform 707 functions previously provided by the MOUNT protocol. The server 708 provides any necessary pseudo file systems to bridge any gaps that 709 arise due to unexported gaps between exported file systems. 711 1.8.3.1. Filehandles 713 As in previous versions of the NFS protocol, opaque filehandles are 714 used to identify individual files and directories. Lookup-type and 715 create operations translate file and directory names to filehandles, 716 which are then used to identify objects in subsequent operations. 718 The NFSv4.1 protocol provides support for persistent filehandles, 719 guaranteed to be valid for the lifetime of the file system object 720 designated. In addition, it provides support to servers to provide 721 filehandles with more limited validity guarantees, called volatile 722 filehandles. 724 1.8.3.2. File Attributes 726 The NFSv4.1 protocol has a rich and extensible file object attribute 727 structure, which is divided into REQUIRED, RECOMMENDED, and named 728 attributes (see Section 5). 730 Several (but not all) of the REQUIRED attributes are derived from the 731 attributes of NFSv3 (see the definition of the fattr3 data type in 732 [37]). An example of a REQUIRED attribute is the file object's type 733 (Section 5.8.1.2) so that regular files can be distinguished from 734 directories (also known as folders in some operating environments) 735 and other types of objects. REQUIRED attributes are discussed in 736 Section 5.1. 738 An example of three RECOMMENDED attributes are acl, sacl, and dacl. 739 These attributes define an Access Control List (ACL) on a file object 740 (Section 6). An ACL provides directory and file access control 741 beyond the model used in NFSv3. The ACL definition allows for 742 specification of specific sets of permissions for individual users 743 and groups. In addition, ACL inheritance allows propagation of 744 access permissions and restrictions down a directory tree as file 745 system objects are created. RECOMMENDED attributes are discussed in 746 Section 5.2. 748 A named attribute is an opaque byte stream that is associated with a 749 directory or file and referred to by a string name. Named attributes 750 are meant to be used by client applications as a method to associate 751 application-specific data with a regular file or directory. NFSv4.1 752 modifies named attributes relative to NFSv4.0 by tightening the 753 allowed operations in order to prevent the development of non- 754 interoperable implementations. Named attributes are discussed in 755 Section 5.3. 757 1.8.3.3. Multi-Server Namespace 759 NFSv4.1 contains a number of features to allow implementation of 760 namespaces that cross server boundaries and that allow and facilitate 761 a non-disruptive transfer of support for individual file systems 762 between servers. They are all based upon attributes that allow one 763 file system to specify alternate, additional, and new location 764 information that specifies how the client may access that file 765 system. 767 These attributes can be used to provide for individual active file 768 systems: 770 o Alternate network addresses to access the current file system 771 instance. 773 o The locations of alternate file system instances or replicas to be 774 used in the event that the current file system instance becomes 775 unavailable. 777 These file system location attributes may be used together with the 778 concept of absent file systems, in which a position in the server 779 namespace is associated with locations on other servers without there 780 being any corresponding file system instance on the current server. 781 For example, 783 o These attributes may be used with absent file systems to implement 784 referrals whereby one server may direct the client to a file 785 system provided by another server. This allows extensive multi- 786 server namespaces to be constructed. 788 o These attributes may be provided when a previously present file 789 system becomes absent. This allows non-disruptive migration of 790 file systems to alternate servers. 792 1.8.4. Locking Facilities 794 As mentioned previously, NFSv4.1 is a single protocol that includes 795 locking facilities. These locking facilities include support for 796 many types of locks including a number of sorts of recallable locks. 797 Recallable locks such as delegations allow the client to be assured 798 that certain events will not occur so long as that lock is held. 799 When circumstances change, the lock is recalled via a callback 800 request. The assurances provided by delegations allow more extensive 801 caching to be done safely when circumstances allow it. 803 The types of locks are: 805 o Share reservations as established by OPEN operations. 807 o Byte-range locks. 809 o File delegations, which are recallable locks that assure the 810 holder that inconsistent opens and file changes cannot occur so 811 long as the delegation is held. 813 o Directory delegations, which are recallable locks that assure the 814 holder that inconsistent directory modifications cannot occur so 815 long as the delegation is held. 817 o Layouts, which are recallable objects that assure the holder that 818 direct access to the file data may be performed directly by the 819 client and that no change to the data's location that is 820 inconsistent with that access may be made so long as the layout is 821 held. 823 All locks for a given client are tied together under a single client- 824 wide lease. All requests made on sessions associated with the client 825 renew that lease. When the client's lease is not promptly renewed, 826 the client's locks are subject to revocation. In the event of server 827 restart, clients have the opportunity to safely reclaim their locks 828 within a special grace period. 830 1.9. Differences from NFSv4.0 832 The following summarizes the major differences between minor version 833 1 and the base protocol: 835 o Implementation of the sessions model (Section 2.10). 837 o Parallel access to data (Section 12). 839 o Addition of the RECLAIM_COMPLETE operation to better structure the 840 lock reclamation process (Section 18.51). 842 o Enhanced delegation support as follows. 844 * Delegations on directories and other file types in addition to 845 regular files (Section 18.39, Section 18.49). 847 * Operations to optimize acquisition of recalled or denied 848 delegations (Section 18.49, Section 20.5, Section 20.7). 850 * Notifications of changes to files and directories 851 (Section 18.39, Section 20.4). 853 * A method to allow a server to indicate that it is recalling one 854 or more delegations for resource management reasons, and thus a 855 method to allow the client to pick which delegations to return 856 (Section 20.6). 858 o Attributes can be set atomically during exclusive file create via 859 the OPEN operation (see the new EXCLUSIVE4_1 creation method in 860 Section 18.16). 862 o Open files can be preserved if removed and the hard link count 863 ("hard link" is defined in an Open Group [6] standard) goes to 864 zero, thus obviating the need for clients to rename deleted files 865 to partially hidden names -- colloquially called "silly rename" 866 (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in 867 Section 18.16). 869 o Improved compatibility with Microsoft Windows for Access Control 870 Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). 872 o Data retention (Section 5.13). 874 o Identification of the implementation of the NFS client and server 875 (Section 18.35). 877 o Support for notification of the availability of byte-range locks 878 (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in 879 Section 18.16 and see Section 20.11). 881 o In NFSv4.1, LIPKEY and SPKM-3 are not required security mechanisms 882 [38]. 884 2. Core Infrastructure 886 2.1. Introduction 888 NFSv4.1 relies on core infrastructure common to nearly every 889 operation. This core infrastructure is described in the remainder of 890 this section. 892 2.2. RPC and XDR 894 The NFSv4.1 protocol is a Remote Procedure Call (RPC) application 895 that uses RPC version 2 and the corresponding eXternal Data 896 Representation (XDR) as defined in [3] and [2]. 898 2.2.1. RPC-Based Security 900 Previous NFS versions have been thought of as having a host-based 901 authentication model, where the NFS server authenticates the NFS 902 client, and trusts the client to authenticate all users. Actually, 903 NFS has always depended on RPC for authentication. One of the first 904 forms of RPC authentication, AUTH_SYS, had no strong authentication 905 and required a host-based authentication approach. NFSv4.1 also 906 depends on RPC for basic security services and mandates RPC support 907 for a user-based authentication model. The user-based authentication 908 model has user principals authenticated by a server, and in turn the 909 server authenticated by user principals. RPC provides some basic 910 security services that are used by NFSv4.1. 912 2.2.1.1. RPC Security Flavors 914 As described in Section 7.2 ("Authentication") of [3], RPC security 915 is encapsulated in the RPC header, via a security or authentication 916 flavor, and information specific to the specified security flavor. 917 Every RPC header conveys information used to identify and 918 authenticate a client and server. As discussed in Section 2.2.1.1.1, 919 some security flavors provide additional security services. 921 NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This 922 requirement to implement is not a requirement to use.) Other 923 flavors, such as AUTH_NONE and AUTH_SYS, MAY be implemented as well. 925 2.2.1.1.1. RPCSEC_GSS and Security Services 927 RPCSEC_GSS [4] uses the functionality of GSS-API [7]. This allows 928 for the use of various security mechanisms by the RPC layer without 929 the additional implementation overhead of adding RPC security 930 flavors. 932 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 934 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 935 users on clients to servers, and servers to users. It can also 936 perform integrity checking on the entire RPC message, including the 937 RPC header, and on the arguments or results. Finally, privacy, 938 usually via encryption, is a service available with RPCSEC_GSS. 939 Privacy is performed on the arguments and results. Note that if 940 privacy is selected, integrity, authentication, and identification 941 are enabled. If privacy is not selected, but integrity is selected, 942 authentication and identification are enabled. If integrity and 943 privacy are not selected, but authentication is enabled, 944 identification is enabled. RPCSEC_GSS does not provide 945 identification as a separate service. 947 Although GSS-API has an authentication service distinct from its 948 privacy and integrity services, GSS-API's authentication service is 949 not used for RPCSEC_GSS's authentication service. Instead, each RPC 950 request and response header is integrity protected with the GSS-API 951 integrity service, and this allows RPCSEC_GSS to offer per-RPC 952 authentication and identity. See [4] for more information. 954 NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and 955 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 956 privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy 957 service. 959 2.2.1.1.1.2. Security Mechanisms for NFSv4.1 961 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 962 security services. Therefore, NFSv4.1 clients and servers MUST 963 support the Kerberos V5 security mechanism. 965 The use of RPCSEC_GSS requires selection of mechanism, quality of 966 protection (QOP), and service (authentication, integrity, privacy). 967 For the mandated security mechanisms, NFSv4.1 specifies that a QOP of 968 zero is used, leaving it up to the mechanism or the mechanism's 969 configuration to map QOP zero to an appropriate level of protection. 970 Each mandated mechanism specifies a minimum set of cryptographic 971 algorithms for implementing integrity and privacy. NFSv4.1 clients 972 and servers MUST be implemented on operating environments that comply 973 with the REQUIRED cryptographic algorithms of each REQUIRED 974 mechanism. 976 2.2.1.1.1.2.1. Kerberos V5 978 The Kerberos V5 GSS-API mechanism as described in [5] MUST be 979 implemented with the RPCSEC_GSS services as specified in the 980 following table: 982 column descriptions: 983 1 == number of pseudo flavor 984 2 == name of pseudo flavor 985 3 == mechanism's OID 986 4 == RPCSEC_GSS service 987 5 == NFSv4.1 clients MUST support 988 6 == NFSv4.1 servers MUST support 990 1 2 3 4 5 6 991 ------------------------------------------------------------------ 992 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 993 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 994 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 996 Note that the number and name of the pseudo flavor are presented here 997 as a mapping aid to the implementor. Because the NFSv4.1 protocol 998 includes a method to negotiate security and it understands the GSS- 999 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 1000 needed for the NFSv3 since the security negotiation is done via the 1001 MOUNT protocol as described in [39]. 1003 At the time NFSv4.1 was specified, the Advanced Encryption Standard 1004 (AES) with HMAC-SHA1 was a REQUIRED algorithm set for Kerberos V5. 1005 In contrast, when NFSv4.0 was specified, weaker algorithm sets were 1006 REQUIRED for Kerberos V5, and were REQUIRED in the NFSv4.0 1007 specification, because the Kerberos V5 specification at the time did 1008 not specify stronger algorithms. The NFSv4.1 specification does not 1009 specify REQUIRED algorithms for Kerberos V5, and instead, the 1010 implementor is expected to track the evolution of the Kerberos V5 1011 standard if and when stronger algorithms are specified. 1013 2.2.1.1.1.2.1.1. Security Considerations for Cryptographic Algorithms 1014 in Kerberos V5 1016 When deploying NFSv4.1, the strength of the security achieved depends 1017 on the existing Kerberos V5 infrastructure. The algorithms of 1018 Kerberos V5 are not directly exposed to or selectable by the client 1019 or server, so there is some due diligence required by the user of 1020 NFSv4.1 to ensure that security is acceptable where needed. 1022 2.2.1.1.1.3. GSS Server Principal 1024 Regardless of what security mechanism under RPCSEC_GSS is being used, 1025 the NFS server MUST identify itself in GSS-API via a 1026 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 1027 names are of the form: 1029 service@hostname 1031 For NFS, the "service" element is 1033 nfs 1035 Implementations of security mechanisms will convert nfs@hostname to 1036 various different forms. For Kerberos V5, the following form is 1037 RECOMMENDED: 1039 nfs/hostname 1041 2.3. COMPOUND and CB_COMPOUND 1043 A significant departure from the versions of the NFS protocol before 1044 NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 1045 protocol, in all minor versions, there are exactly two RPC 1046 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 1047 a series of individual operations and these operations perform the 1048 sorts of functions performed by traditional NFS procedures. 1050 The operations combined within a COMPOUND request are evaluated in 1051 order by the server, without any atomicity guarantees. A limited set 1052 of facilities exist to pass results from one operation to another. 1053 Once an operation returns a failing result, the evaluation ends and 1054 the results of all evaluated operations are returned to the client. 1056 With the use of the COMPOUND procedure, the client is able to build 1057 simple or complex requests. These COMPOUND requests allow for a 1058 reduction in the number of RPCs needed for logical file system 1059 operations. For example, multi-component look up requests can be 1060 constructed by combining multiple LOOKUP operations. Those can be 1061 further combined with operations such as GETATTR, READDIR, or OPEN 1062 plus READ to do more complicated sets of operation without incurring 1063 additional latency. 1065 NFSv4.1 also contains a considerable set of callback operations in 1066 which the server makes an RPC directed at the client. Callback RPCs 1067 have a similar structure to that of the normal server requests. In 1068 all minor versions of the NFSv4 protocol, there are two callback RPC 1069 procedures: CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is 1070 defined in an analogous fashion to that of COMPOUND with its own set 1071 of callback operations. 1073 The addition of new server and callback operations within the 1074 COMPOUND and CB_COMPOUND request framework provides a means of 1075 extending the protocol in subsequent minor versions. 1077 Except for a small number of operations needed for session creation, 1078 server requests and callback requests are performed within the 1079 context of a session. Sessions provide a client context for every 1080 request and support robust replay protection for non-idempotent 1081 requests. 1083 2.4. Client Identifiers and Client Owners 1085 For each operation that obtains or depends on locking state, the 1086 specific client needs to be identifiable by the server. 1088 Each distinct client instance is represented by a client ID. A 1089 client ID is a 64-bit identifier representing a specific client at a 1090 given time. The client ID is changed whenever the client re- 1091 initializes, and may change when the server re-initializes. Client 1092 IDs are used to support lock identification and crash recovery. 1094 During steady state operation, the client ID associated with each 1095 operation is derived from the session (see Section 2.10) on which the 1096 operation is sent. A session is associated with a client ID when the 1097 session is created. 1099 Unlike NFSv4.0, the only NFSv4.1 operations possible before a client 1100 ID is established are those needed to establish the client ID. 1102 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 1103 operation using that client ID (eir_clientid as returned from 1104 EXCHANGE_ID) is required to establish and confirm the client ID on 1105 the server. Establishment of identification by a new incarnation of 1106 the client also has the effect of immediately releasing any locking 1107 state that a previous incarnation of that same client might have had 1108 on the server. Such released state would include all byte-range 1109 lock, share reservation, layout state, and -- where the server 1110 supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_CUR_FH claim 1111 types -- all delegation state associated with the same client with 1112 the same identity. For discussion of delegation state recovery, see 1113 Section 10.2.1. For discussion of layout state recovery, see 1114 Section 12.7.1. 1116 Releasing such state requires that the server be able to determine 1117 that one client instance is the successor of another. Where this 1118 cannot be done, for any of a number of reasons, the locking state 1119 will remain for a time subject to lease expiration (see Section 8.3) 1120 and the new client will need to wait for such state to be removed, if 1121 it makes conflicting lock requests. 1123 Client identification is encapsulated in the following client owner 1124 data type: 1126 struct client_owner4 { 1127 verifier4 co_verifier; 1128 opaque co_ownerid; 1129 }; 1131 The first field, co_verifier, is a client incarnation verifier, 1132 allowing the server to distinguish successive incarnations (e.g. 1133 reboots) of the same client. The server will start the process of 1134 canceling the client's leased state if co_verifier is different than 1135 what the server has previously recorded for the identified client (as 1136 specified in the co_ownerid field). 1138 The second field, co_ownerid, is a variable length string that 1139 uniquely defines the client so that subsequent instances of the same 1140 client bear the same co_ownerid with a different verifier. 1142 There are several considerations for how the client generates the 1143 co_ownerid string: 1145 o The string should be unique so that multiple clients do not 1146 present the same string. The consequences of two clients 1147 presenting the same string range from one client getting an error 1148 to one client having its leased state abruptly and unexpectedly 1149 cancelled. 1151 o The string should be selected so that subsequent incarnations 1152 (e.g., restarts) of the same client cause the client to present 1153 the same string. The implementor is cautioned from an approach 1154 that requires the string to be recorded in a local file because 1155 this precludes the use of the implementation in an environment 1156 where there is no local disk and all file access is from an 1157 NFSv4.1 server. 1159 o The string should be the same for each server network address that 1160 the client accesses. This way, if a server has multiple 1161 interfaces, the client can trunk traffic over multiple network 1162 paths as described in Section 2.10.5. (Note: the precise opposite 1163 was advised in the NFSv4.0 specification [36].) 1165 o The algorithm for generating the string should not assume that the 1166 client's network address will not change, unless the client 1167 implementation knows it is using statically assigned network 1168 addresses. This includes changes between client incarnations and 1169 even changes while the client is still running in its current 1170 incarnation. Thus, with dynamic address assignment, if the client 1171 includes just the client's network address in the co_ownerid 1172 string, there is a real risk that after the client gives up the 1173 network address, another client, using a similar algorithm for 1174 generating the co_ownerid string, would generate a conflicting 1175 co_ownerid string. 1177 Given the above considerations, an example of a well-generated 1178 co_ownerid string is one that includes: 1180 o If applicable, the client's statically assigned network address. 1182 o Additional information that tends to be unique, such as one or 1183 more of: 1185 * The client machine's serial number (for privacy reasons, it is 1186 best to perform some one-way function on the serial number). 1188 * A Media Access Control (MAC) address (again, a one-way function 1189 should be performed). 1191 * The timestamp of when the NFSv4.1 software was first installed 1192 on the client (though this is subject to the previously 1193 mentioned caution about using information that is stored in a 1194 file, because the file might only be accessible over NFSv4.1). 1196 * A true random number. However, since this number ought to be 1197 the same between client incarnations, this shares the same 1198 problem as that of using the timestamp of the software 1199 installation. 1201 o For a user-level NFSv4.1 client, it should contain additional 1202 information to distinguish the client from other user-level 1203 clients running on the same host, such as a process identifier or 1204 other unique sequence. 1206 The client ID is assigned by the server (the eir_clientid result from 1207 EXCHANGE_ID) and should be chosen so that it will not conflict with a 1208 client ID previously assigned by the server. This applies across 1209 server restarts. 1211 In the event of a server restart, a client may find out that its 1212 current client ID is no longer valid when it receives an 1213 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on 1214 the characteristics of the sessions involved, specifically whether 1215 the session is persistent (see Section 2.10.6.5), but in each case 1216 the client will receive this error when it attempts to establish a 1217 new session with the existing client ID and receives the error 1218 NFS4ERR_STALE_CLIENTID, indicating that a new client ID needs to be 1219 obtained via EXCHANGE_ID and the new session established with that 1220 client ID. 1222 When a session is not persistent, the client will find out that it 1223 needs to create a new session as a result of getting an 1224 NFS4ERR_BADSESSION, since the session in question was lost as part of 1225 a server restart. When the existing client ID is presented to a 1226 server as part of creating a session and that client ID is not 1227 recognized, as would happen after a server restart, the server will 1228 reject the request with the error NFS4ERR_STALE_CLIENTID. 1230 In the case of the session being persistent, the client will re- 1231 establish communication using the existing session after the restart. 1232 This session will be associated with the existing client ID but may 1233 only be used to retransmit operations that the client previously 1234 transmitted and did not see replies to. Replies to operations that 1235 the server previously performed will come from the reply cache; 1236 otherwise, NFS4ERR_DEADSESSION will be returned. Hence, such a 1237 session is referred to as "dead". In this situation, in order to 1238 perform new operations, the client needs to establish a new session. 1239 If an attempt is made to establish this new session with the existing 1240 client ID, the server will reject the request with 1241 NFS4ERR_STALE_CLIENTID. 1243 When NFS4ERR_STALE_CLIENTID is received in either of these 1244 situations, the client needs to obtain a new client ID by use of the 1245 EXCHANGE_ID operation, then use that client ID as the basis of a new 1246 session, and then proceed to any other necessary recovery for the 1247 server restart case (see Section 8.4.2). 1249 See the descriptions of EXCHANGE_ID (Section 18.35) and 1250 CREATE_SESSION (Section 18.36) for a complete specification of these 1251 operations. 1253 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 1255 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a 1256 value of data type client_owner4 in an EXCHANGE_ID with a value of 1257 data type nfs_client_id4 that was established using the SETCLIENTID 1258 operation of NFSv4.0. A server that does so will allow an upgraded 1259 client to avoid waiting until the lease (i.e., the lease established 1260 by the NFSv4.0 instance client) expires. This requires that the 1261 value of data type client_owner4 be constructed the same way as the 1262 value of data type nfs_client_id4. If the latter's contents included 1263 the server's network address (per the recommendations of the NFSv4.0 1264 specification [36]), and the NFSv4.1 client does not wish to use a 1265 client ID that prevents trunking, it should send two EXCHANGE_ID 1266 operations. The first EXCHANGE_ID will have a client_owner4 equal to 1267 the nfs_client_id4. This will clear the state created by the NFSv4.0 1268 client. The second EXCHANGE_ID will not have the server's network 1269 address. The state created for the second EXCHANGE_ID will not have 1270 to wait for lease expiration, because there will be no state to 1271 expire. 1273 2.4.2. Server Release of Client ID 1275 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1276 (Section 18.50), which the client SHOULD use to destroy a client ID 1277 it no longer needs. This permits graceful, bilateral release of a 1278 client ID. The operation cannot be used if there are sessions 1279 associated with the client ID, or state with an unexpired lease. 1281 If the server determines that the client holds no associated state 1282 for its client ID (associated state includes unrevoked sessions, 1283 opens, locks, delegations, layouts, and wants), the server MAY choose 1284 to unilaterally release the client ID in order to conserve resources. 1285 If the client contacts the server after this release, the server MUST 1286 ensure that the client receives the appropriate error so that it will 1287 use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new client 1288 ID. The server ought to be very hesitant to release a client ID 1289 since the resulting work on the client to recover from such an event 1290 will be the same burden as if the server had failed and restarted. 1292 Typically, a server would not release a client ID unless there had 1293 been no activity from that client for many minutes. As long as there 1294 are sessions, opens, locks, delegations, layouts, or wants, the 1295 server MUST NOT release the client ID. See Section 2.10.13.1.4 for 1296 discussion on releasing inactive sessions. 1298 2.4.3. Resolving Client Owner Conflicts 1300 When the server gets an EXCHANGE_ID for a client owner that currently 1301 has no state, or that has state but the lease has expired, the server 1302 MUST allow the EXCHANGE_ID and confirm the new client ID if followed 1303 by the appropriate CREATE_SESSION. 1305 When the server gets an EXCHANGE_ID for a new incarnation of a client 1306 owner that currently has an old incarnation with state and an 1307 unexpired lease, the server is allowed to dispose of the state of the 1308 previous incarnation of the client owner if one of the following is 1309 true: 1311 o The principal that created the client ID for the client owner is 1312 the same as the principal that is sending the EXCHANGE_ID 1313 operation. Note that if the client ID was created with 1314 SP4_MACH_CRED state protection (Section 18.35), the principal MUST 1315 be based on RPCSEC_GSS authentication, the RPCSEC_GSS service used 1316 MUST be integrity or privacy, and the same GSS mechanism and 1317 principal MUST be used as that used when the client ID was 1318 created. 1320 o The client ID was established with SP4_SSV protection 1321 (Section 18.35, Section 2.10.8.3) and the client sends the 1322 EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the 1323 GSS SSV mechanism (Section 2.10.9). 1325 o The client ID was established with SP4_SSV protection, and under 1326 the conditions described herein, the EXCHANGE_ID was sent with 1327 SP4_MACH_CRED state protection. Because the SSV might not persist 1328 across client and server restart, and because the first time a 1329 client sends EXCHANGE_ID to a server it does not have an SSV, the 1330 client MAY send the subsequent EXCHANGE_ID without an SSV 1331 RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the 1332 principal MUST be based on RPCSEC_GSS authentication, the 1333 RPCSEC_GSS service used MUST be integrity or privacy, and the same 1334 GSS mechanism and principal MUST be used as that used when the 1335 client ID was created. 1337 If none of the above situations apply, the server MUST return 1338 NFS4ERR_CLID_INUSE. 1340 If the server accepts the principal and co_ownerid as matching that 1341 which created the client ID, and the co_verifier in the EXCHANGE_ID 1342 differs from the co_verifier used when the client ID was created, 1343 then after the server receives a CREATE_SESSION that confirms the 1344 client ID, the server deletes state. If the co_verifier values are 1345 the same (e.g., the client either is updating properties of the 1346 client ID (Section 18.35) or is attempting trunking (Section 2.10.5), 1347 the server MUST NOT delete state. 1349 2.5. Server Owners 1351 The server owner is similar to a client owner (Section 2.4), but 1352 unlike the client owner, there is no shorthand server ID. The server 1353 owner is defined in the following data type: 1355 struct server_owner4 { 1356 uint64_t so_minor_id; 1357 opaque so_major_id; 1358 }; 1360 The server owner is returned from EXCHANGE_ID. When the so_major_id 1361 fields are the same in two EXCHANGE_ID results, the connections that 1362 each EXCHANGE_ID were sent over can be assumed to address the same 1363 server (as defined in Section 1.7). If the so_minor_id fields are 1364 also the same, then not only do both connections connect to the same 1365 server, but the session can be shared across both connections. The 1366 reader is cautioned that multiple servers may deliberately or 1367 accidentally claim to have the same so_major_id or so_major_id/ 1368 so_minor_id; the reader should examine Sections 2.10.5 and 18.35 in 1369 order to avoid acting on falsely matching server owner values. 1371 The considerations for generating an so_major_id are similar to that 1372 for generating a co_ownerid string (see Section 2.4). The 1373 consequences of two servers generating conflicting so_major_id values 1374 are less dire than they are for co_ownerid conflicts because the 1375 client can use RPCSEC_GSS to compare the authenticity of each server 1376 (see Section 2.10.5). 1378 2.6. Security Service Negotiation 1380 With the NFSv4.1 server potentially offering multiple security 1381 mechanisms, the client needs a method to determine or negotiate which 1382 mechanism is to be used for its communication with the server. The 1383 NFS server may have multiple points within its file system namespace 1384 that are available for use by NFS clients. These points can be 1385 considered security policy boundaries, and, in some NFS 1386 implementations, are tied to NFS export points. In turn, the NFS 1387 server may be configured such that each of these security policy 1388 boundaries may have different or multiple security mechanisms in use. 1390 The security negotiation between client and server SHOULD be done 1391 with a secure channel to eliminate the possibility of a third party 1392 intercepting the negotiation sequence and forcing the client and 1393 server to choose a lower level of security than required or desired. 1394 See Section 21 for further discussion. 1396 2.6.1. NFSv4.1 Security Tuples 1398 An NFS server can assign one or more "security tuples" to each 1399 security policy boundary in its namespace. Each security tuple 1400 consists of a security flavor (see Section 2.2.1.1) and, if the 1401 flavor is RPCSEC_GSS, a GSS-API mechanism Object Identifier (OID), a 1402 GSS-API quality of protection, and an RPCSEC_GSS service. 1404 2.6.2. SECINFO and SECINFO_NO_NAME 1406 The SECINFO and SECINFO_NO_NAME operations allow the client to 1407 determine, on a per-filehandle basis, what security tuple is to be 1408 used for server access. In general, the client will not have to use 1409 either operation except during initial communication with the server 1410 or when the client crosses security policy boundaries at the server. 1411 However, the server's policies may also change at any time and force 1412 the client to negotiate a new security tuple. 1414 Where the use of different security tuples would affect the type of 1415 access that would be allowed if a request was sent over the same 1416 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g., 1417 read-only vs. read-write) access, security tuples that allow greater 1418 access should be presented first. Where the general level of access 1419 is the same and different security flavors limit the range of 1420 principals whose privileges are recognized (e.g., allowing or 1421 disallowing root access), flavors supporting the greatest range of 1422 principals should be listed first. 1424 2.6.3. Security Error 1426 Based on the assumption that each NFSv4.1 client and server MUST 1427 support a minimum set of security (i.e., Kerberos V5 under 1428 RPCSEC_GSS), the NFS client will initiate file access to the server 1429 with one of the minimal security tuples. During communication with 1430 the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. 1431 This error allows the server to notify the client that the security 1432 tuple currently being used contravenes the server's security policy. 1433 The client is then responsible for determining (see Section 2.6.3.1) 1434 what security tuples are available at the server and choosing one 1435 that is appropriate for the client. 1437 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1439 This section explains the mechanics of NFSv4.1 security negotiation. 1441 2.6.3.1.1. Put Filehandle Operations 1443 The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, 1444 PUTFH, and RESTOREFH. Each of the subsections herein describes how 1445 the server handles a subseries of operations that starts with a put 1446 filehandle operation. 1448 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH 1450 The client is saving a filehandle for a future RESTOREFH, LINK, or 1451 RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine 1452 whether or not the put filehandle operation returns NFS4ERR_WRONGSEC, 1453 the server implementation pretends SAVEFH is not in the series of 1454 operations and examines which of the situations described in the 1455 other subsections of Section 2.6.3.1.1 apply. 1457 2.6.3.1.1.2. Two or More Put Filehandle Operations 1459 For a series of N put filehandle operations, the server MUST NOT 1460 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1461 The Nth put filehandle operation is handled as if it is the first in 1462 a subseries of operations. For example, if the server received a 1463 COMPOUND request with this series of operations -- PUTFH, PUTROOTFH, 1464 LOOKUP -- then the PUTFH operation is ignored for NFS4ERR_WRONGSEC 1465 purposes, and the PUTROOTFH, LOOKUP subseries is processed as 1466 according to Section 2.6.3.1.1.3. 1468 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing 1469 Name) 1471 This situation also applies to a put filehandle operation followed by 1472 a LOOKUP or an OPEN operation that specifies an existing component 1473 name. 1475 In this situation, the client is potentially crossing a security 1476 policy boundary, and the set of security tuples the parent directory 1477 supports may differ from those of the child. The server 1478 implementation may decide whether to impose any restrictions on 1479 security policy administration. There are at least three approaches 1480 (sec_policy_child is the tuple set of the child export, 1481 sec_policy_parent is that of the parent). 1483 (a) sec_policy_child <= sec_policy_parent (<= for subset). This 1484 means that the set of security tuples specified on the security 1485 policy of a child directory is always a subset of its parent 1486 directory. 1488 (b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1489 {} for the empty set). This means that the set of security 1490 tuples specified on the security policy of a child directory 1491 always has a non-empty intersection with that of the parent. 1493 (c) sec_policy_child ^ sec_policy_parent == {}. This means that the 1494 set of security tuples specified on the security policy of a 1495 child directory may not intersect with that of the parent. In 1496 other words, there are no restrictions on how the system 1497 administrator may set up these tuples. 1499 In order for a server to support approaches (b) (for the case when a 1500 client chooses a flavor that is not a member of sec_policy_parent) 1501 and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC 1502 when there is a security tuple mismatch. Instead, it should be 1503 returned from the LOOKUP (or OPEN by existing component name) that 1504 follows. 1506 Since the above guideline does not contradict approach (a), it should 1507 be followed in general. Even if approach (a) is implemented, it is 1508 possible for the security tuple used to be acceptable for the target 1509 of LOOKUP but not for the filehandles used in the put filehandle 1510 operation. The put filehandle operation could be a PUTROOTFH or 1511 PUTPUBFH, where the client cannot know the security tuples for the 1512 root or public filehandle. Or the security policy for the filehandle 1513 used by the put filehandle operation could have changed since the 1514 time the filehandle was obtained. 1516 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1517 response to the put filehandle operation if the operation is 1518 immediately followed by a LOOKUP or an OPEN by component name. 1520 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP 1522 Since SECINFO only works its way down, there is no way LOOKUPP can 1523 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1524 solves this issue via style SECINFO_STYLE4_PARENT, which works in the 1525 opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put 1526 filehandle operation that is followed by a LOOKUPP MUST NOT return 1527 NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, 1528 the client's only recourse is to send the put filehandle operation, 1529 LOOKUPP, GETFH sequence of operations with every security tuple it 1530 supports. 1532 Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1533 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1534 operation if the operation is immediately followed by a LOOKUPP. 1536 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1538 A security-sensitive client is allowed to choose a strong security 1539 tuple when querying a server to determine a file object's permitted 1540 security tuples. The security tuple chosen by the client does not 1541 have to be included in the tuple list of the security policy of 1542 either the parent directory indicated in the put filehandle operation 1543 or the child file object indicated in SECINFO (or any parent 1544 directory indicated in SECINFO_NO_NAME). Of course, the server has 1545 to be configured for whatever security tuple the client selects; 1546 otherwise, the request will fail at the RPC layer with an appropriate 1547 authentication error. 1549 In theory, there is no connection between the security flavor used by 1550 SECINFO or SECINFO_NO_NAME and those supported by the security 1551 policy. But in practice, the client may start looking for strong 1552 flavors from those supported by the security policy, followed by 1553 those in the REQUIRED set. 1555 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1556 filehandle operation that is immediately followed by SECINFO or 1557 SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC 1558 from SECINFO or SECINFO_NO_NAME. 1560 2.6.3.1.1.6. Put Filehandle Operation + Nothing 1562 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1564 2.6.3.1.1.7. Put Filehandle Operation + Anything Else 1566 "Anything Else" includes OPEN by filehandle. 1568 The security policy enforcement applies to the filehandle specified 1569 in the put filehandle operation. Therefore, the put filehandle 1570 operation MUST return NFS4ERR_WRONGSEC when there is a security tuple 1571 mismatch. This avoids the complexity of adding NFS4ERR_WRONGSEC as 1572 an allowable error to every other operation. 1574 A COMPOUND containing the series put filehandle operation + 1575 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1576 for the client to recover from NFS4ERR_WRONGSEC. 1578 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation 1579 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1580 component name). 1582 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME 1584 Suppose a client sends a COMPOUND procedure containing the series 1585 SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple 1586 used does not match that required for the target file. By rule (see 1587 Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return 1588 NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot 1589 return NFS4ERR_WRONGSEC. The issue is resolved by the fact that 1590 SECINFO and SECINFO_NO_NAME consume the current filehandle (note that 1591 this is a change from NFSv4.0). This leaves no current filehandle 1592 for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 1594 2.6.3.1.2. LINK and RENAME 1596 The LINK and RENAME operations use both the current and saved 1597 filehandles. Technically, the server MAY return NFS4ERR_WRONGSEC 1598 from LINK or RENAME if the security policy of the saved filehandle 1599 rejects the security flavor used in the COMPOUND request's 1600 credentials. If the server does so, then if there is no intersection 1601 between the security policies of saved and current filehandles, this 1602 means that it will be impossible for the client to perform the 1603 intended LINK or RENAME operation. 1605 For example, suppose the client sends this COMPOUND request: 1606 SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where 1607 filehandles bFH and aFH refer to different directories. Suppose no 1608 common security tuple exists between the security policies of aFH and 1609 bFH. If the client sends the request using credentials acceptable to 1610 bFH's security policy but not aFH's policy, then the PUTFH aFH 1611 operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME 1612 request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, 1613 RENAME "c" "d", using credentials acceptable to aFH's security policy 1614 but not bFH's policy. The server returns NFS4ERR_WRONGSEC on the 1615 RENAME operation. 1617 To prevent a client from an endless sequence of a request containing 1618 LINK or RENAME, followed by a request containing SECINFO_NO_NAME or 1619 SECINFO, the server MUST detect when the security policies of the 1620 current and saved filehandles have no mutually acceptable security 1621 tuple, and MUST NOT return NFS4ERR_WRONGSEC from LINK or RENAME in 1622 that situation. Instead the server MUST do one of two things: 1624 o The server can return NFS4ERR_XDEV. 1626 o The server can allow the security policy of the current filehandle 1627 to override that of the saved filehandle, and so return NFS4_OK. 1629 2.7. Minor Versioning 1631 To address the requirement of an NFS protocol that can evolve as the 1632 need arises, the NFSv4.1 protocol contains the rules and framework to 1633 allow for future minor changes or versioning. 1635 The base assumption with respect to minor versioning is that any 1636 future accepted minor version will be documented in one or more 1637 Standards Track RFCs. Minor version 0 of the NFSv4 protocol is 1638 represented by [36], and minor version 1 is represented by this RFC. 1639 The COMPOUND and CB_COMPOUND procedures support the encoding of the 1640 minor version being requested by the client. 1642 The following items represent the basic rules for the development of 1643 minor versions. Note that a future minor version may modify or add 1644 to the following rules as part of the minor version definition. 1646 1. Procedures are not added or deleted. 1648 To maintain the general RPC model, NFSv4 minor versions will not 1649 add to or delete procedures from the NFS program. 1651 2. Minor versions may add operations to the COMPOUND and 1652 CB_COMPOUND procedures. 1654 The addition of operations to the COMPOUND and CB_COMPOUND 1655 procedures does not affect the RPC model. 1657 * Minor versions may append attributes to the bitmap4 that 1658 represents sets of attributes and to the fattr4 that 1659 represents sets of attribute values. 1661 This allows for the expansion of the attribute model to allow 1662 for future growth or adaptation. 1664 * Minor version X must append any new attributes after the last 1665 documented attribute. 1667 Since attribute results are specified as an opaque array of 1668 per-attribute, XDR-encoded results, the complexity of adding 1669 new attributes in the midst of the current definitions would 1670 be too burdensome. 1672 3. Minor versions must not modify the structure of an existing 1673 operation's arguments or results. 1675 Again, the complexity of handling multiple structure definitions 1676 for a single operation is too burdensome. New operations should 1677 be added instead of modifying existing structures for a minor 1678 version. 1680 This rule does not preclude the following adaptations in a minor 1681 version: 1683 * adding bits to flag fields, such as new attributes to 1684 GETATTR's bitmap4 data type, and providing corresponding 1685 variants of opaque arrays, such as a notify4 used together 1686 with such bitmaps 1688 * adding bits to existing attributes like ACLs that have flag 1689 words 1691 * extending enumerated types (including NFS4ERR_*) with new 1692 values 1694 * adding cases to a switched union 1696 4. Minor versions must not modify the structure of existing 1697 attributes. 1699 5. Minor versions must not delete operations. 1701 This prevents the potential reuse of a particular operation 1702 "slot" in a future minor version. 1704 6. Minor versions must not delete attributes. 1706 7. Minor versions must not delete flag bits or enumeration values. 1708 8. Minor versions may declare an operation MUST NOT be implemented. 1710 Specifying that an operation MUST NOT be implemented is 1711 equivalent to obsoleting an operation. For the client, it means 1712 that the operation MUST NOT be sent to the server. For the 1713 server, an NFS error can be returned as opposed to "dropping" 1714 the request as an XDR decode error. This approach allows for 1715 the obsolescence of an operation while maintaining its structure 1716 so that a future minor version can reintroduce the operation. 1718 1. Minor versions may declare that an attribute MUST NOT be 1719 implemented. 1721 2. Minor versions may declare that a flag bit or enumeration 1722 value MUST NOT be implemented. 1724 9. Minor versions may downgrade features from REQUIRED to 1725 RECOMMENDED, or RECOMMENDED to OPTIONAL. 1727 10. Minor versions may upgrade features from OPTIONAL to 1728 RECOMMENDED, or RECOMMENDED to REQUIRED. 1730 11. A client and server that support minor version X SHOULD support 1731 minor versions zero through X-1 as well. 1733 12. Except for infrastructural changes, a minor version must not 1734 introduce REQUIRED new features. 1736 This rule allows for the introduction of new functionality and 1737 forces the use of implementation experience before designating a 1738 feature as REQUIRED. On the other hand, some classes of 1739 features are infrastructural and have broad effects. Allowing 1740 infrastructural features to be RECOMMENDED or OPTIONAL 1741 complicates implementation of the minor version. 1743 13. A client MUST NOT attempt to use a stateid, filehandle, or 1744 similar returned object from the COMPOUND procedure with minor 1745 version X for another COMPOUND procedure with minor version Y, 1746 where X != Y. 1748 2.8. Non-RPC-Based Security Services 1750 As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for 1751 identification, authentication, integrity, and privacy. NFSv4.1 1752 itself provides or enables additional security services as described 1753 in the next several subsections. 1755 2.8.1. Authorization 1757 Authorization to access a file object via an NFSv4.1 operation is 1758 ultimately determined by the NFSv4.1 server. A client can 1759 predetermine its access to a file object via the OPEN (Section 18.16) 1760 and the ACCESS (Section 18.1) operations. 1762 Principals with appropriate access rights can modify the 1763 authorization on a file object via the SETATTR (Section 18.30) 1764 operation. Attributes that affect access rights include mode, owner, 1765 owner_group, acl, dacl, and sacl. See Section 5. 1767 2.8.2. Auditing 1769 NFSv4.1 provides auditing on a per-file object basis, via the acl and 1770 sacl attributes as described in Section 6. It is outside the scope 1771 of this specification to specify audit log formats or management 1772 policies. 1774 2.8.3. Intrusion Detection 1776 NFSv4.1 provides alarm control on a per-file object basis, via the 1777 acl and sacl attributes as described in Section 6. Alarms may serve 1778 as the basis for intrusion detection. It is outside the scope of 1779 this specification to specify heuristics for detecting intrusion via 1780 alarms. 1782 2.9. Transport Layers 1784 2.9.1. REQUIRED and RECOMMENDED Properties of Transports 1786 NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA- 1787 based transports with the following attributes: 1789 o The transport supports reliable delivery of data, which NFSv4.1 1790 requires but neither NFSv4.1 nor RPC has facilities for ensuring 1791 [40]. 1793 o The transport delivers data in the order it was sent. Ordered 1794 delivery simplifies detection of transmit errors, and simplifies 1795 the sending of arbitrary sized requests and responses via the 1796 record marking protocol [3]. 1798 Where an NFSv4.1 implementation supports operation over the IP 1799 network protocol, any transport used between NFS and IP MUST be among 1800 the IETF-approved congestion control transport protocols. At the 1801 time this document was written, the only two transports that had the 1802 above attributes were TCP and the Stream Control Transmission 1803 Protocol (SCTP). To enhance the possibilities for interoperability, 1804 an NFSv4.1 implementation MUST support operation over the TCP 1805 transport protocol. 1807 Even if NFSv4.1 is used over a non-IP network protocol, it is 1808 RECOMMENDED that the transport support congestion control. 1810 It is permissible for a connectionless transport to be used under 1811 NFSv4.1; however, reliable and in-order delivery of data combined 1812 with congestion control by the connectionless transport is REQUIRED. 1813 As a consequence, UDP by itself MUST NOT be used as an NFSv4.1 1814 transport. NFSv4.1 assumes that a client transport address and 1815 server transport address used to send data over a transport together 1816 constitute a connection, even if the underlying transport eschews the 1817 concept of a connection. 1819 2.9.2. Client and Server Transport Behavior 1821 If a connection-oriented transport (e.g., TCP) is used, the client 1822 and server SHOULD use long-lived connections for at least three 1823 reasons: 1825 1. This will prevent the weakening of the transport's congestion 1826 control mechanisms via short-lived connections. 1828 2. This will improve performance for the WAN environment by 1829 eliminating the need for connection setup handshakes. 1831 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 1832 client and server to maintain a client-created backchannel (see 1833 Section 2.10.3.1) for the server to use. 1835 In order to reduce congestion, if a connection-oriented transport is 1836 used, and the request is not the NULL procedure: 1838 o A requester MUST NOT retry a request unless the connection the 1839 request was sent over was lost before the reply was received. 1841 o A replier MUST NOT silently drop a request, even if the request is 1842 a retry. (The silent drop behavior of RPCSEC_GSS [4] does not 1843 apply because this behavior happens at the RPCSEC_GSS layer, a 1844 lower layer in the request processing.) Instead, the replier 1845 SHOULD return an appropriate error (see Section 2.10.6.1), or it 1846 MAY disconnect the connection. 1848 When sending a reply, the replier MUST send the reply to the same 1849 full network address (e.g., if using an IP-based transport, the 1850 source port of the requester is part of the full network address) 1851 from which the requester sent the request. If using a connection- 1852 oriented transport, replies MUST be sent on the same connection from 1853 which the request was received. 1855 If a connection is dropped after the replier receives the request but 1856 before the replier sends the reply, the replier might have a pending 1857 reply. If a connection is established with the same source and 1858 destination full network address as the dropped connection, then the 1859 replier MUST NOT send the reply until the requester retries the 1860 request. The reason for this prohibition is that the requester MAY 1861 retry a request over a different connection (provided that connection 1862 is associated with the original request's session). 1864 When using RDMA transports, there are other reasons for not 1865 tolerating retries over the same connection: 1867 o RDMA transports use "credits" to enforce flow control, where a 1868 credit is a right to a peer to transmit a message. If one peer 1869 were to retransmit a request (or reply), it would consume an 1870 additional credit. If the replier retransmitted a reply, it would 1871 certainly result in an RDMA connection loss, since the requester 1872 would typically only post a single receive buffer for each 1873 request. If the requester retransmitted a request, the additional 1874 credit consumed on the server might lead to RDMA connection 1875 failure unless the client accounted for it and decreased its 1876 available credit, leading to wasted resources. 1878 o RDMA credits present a new issue to the reply cache in NFSv4.1. 1879 The reply cache may be used when a connection within a session is 1880 lost, such as after the client reconnects. Credit information is 1881 a dynamic property of the RDMA connection, and stale values must 1882 not be replayed from the cache. This implies that the reply cache 1883 contents must not be blindly used when replies are sent from it, 1884 and credit information appropriate to the channel must be 1885 refreshed by the RPC layer. 1887 In addition, as described in Section 2.10.6.2, while a session is 1888 active, the NFSv4.1 requester MUST NOT stop waiting for a reply. 1890 2.9.3. Ports 1892 Historically, NFSv3 servers have listened over TCP port 2049. The 1893 registered port 2049 [41] for the NFS protocol should be the default 1894 configuration. NFSv4.1 clients SHOULD NOT use the RPC binding 1895 protocols as described in [42]. 1897 2.10. Session 1899 NFSv4.1 clients and servers MUST support and MUST use the session 1900 feature as described in this section. 1902 2.10.1. Motivation and Overview 1904 Previous versions and minor versions of NFS have suffered from the 1905 following: 1907 o Lack of support for Exactly Once Semantics (EOS). This includes 1908 lack of support for EOS through server failure and recovery. 1910 o Limited callback support, including no support for sending 1911 callbacks through firewalls, and races between replies to normal 1912 requests and callbacks. 1914 o Limited trunking over multiple network paths. 1916 o Requiring machine credentials for fully secure operation. 1918 Through the introduction of a session, NFSv4.1 addresses the above 1919 shortfalls with practical solutions: 1921 o EOS is enabled by a reply cache with a bounded size, making it 1922 feasible to keep the cache in persistent storage and enable EOS 1923 through server failure and recovery. One reason that previous 1924 revisions of NFS did not support EOS was because some EOS 1925 approaches often limited parallelism. As will be explained in 1926 Section 2.10.6, NFSv4.1 supports both EOS and unlimited 1927 parallelism. 1929 o The NFSv4.1 client (defined in Section 1.7, Paragraph 2) creates 1930 transport connections and provides them to the server to use for 1931 sending callback requests, thus solving the firewall issue 1932 (Section 18.34). Races between responses from client requests and 1933 callbacks caused by the requests are detected via the session's 1934 sequencing properties that are a consequence of EOS 1935 (Section 2.10.6.3). 1937 o The NFSv4.1 client can associate an arbitrary number of 1938 connections with the session, and thus provide trunking 1939 (Section 2.10.5). 1941 o The NFSv4.1 client and server produce a session key independent of 1942 client and server machine credentials which can be used to compute 1943 a digest for protecting critical session management operations 1944 (Section 2.10.8.3). 1946 o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 1947 use by the session's backchannel that do not require the server to 1948 authenticate to a client machine principal (Section 2.10.8.2). 1950 A session is a dynamically created, long-lived server object created 1951 by a client and used over time from one or more transport 1952 connections. Its function is to maintain the server's state relative 1953 to the connection(s) belonging to a client instance. This state is 1954 entirely independent of the connection itself, and indeed the state 1955 exists whether or not the connection exists. A client may have one 1956 or more sessions associated with it so that client-associated state 1957 may be accessed using any of the sessions associated with that 1958 client's client ID, when connections are associated with those 1959 sessions. When no connections are associated with any of a client 1960 ID's sessions for an extended time, such objects as locks, opens, 1961 delegations, layouts, etc. are subject to expiration. The session 1962 serves as an object representing a means of access by a client to the 1963 associated client state on the server, independent of the physical 1964 means of access to that state. 1966 A single client may create multiple sessions. A single session MUST 1967 NOT serve multiple clients. 1969 2.10.2. NFSv4 Integration 1971 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 1972 infrastructure change such as sessions would require a new major 1973 version number to an Open Network Computing (ONC) RPC program like 1974 NFS. However, because NFSv4 encapsulates its functionality in a 1975 single procedure, COMPOUND, and because COMPOUND can support an 1976 arbitrary number of operations, sessions have been added to NFSv4.1 1977 with little difficulty. COMPOUND includes a minor version number 1978 field, and for NFSv4.1 this minor version is set to 1. When the 1979 NFSv4 server processes a COMPOUND with the minor version set to 1, it 1980 expects a different set of operations than it does for NFSv4.0. 1981 NFSv4.1 defines the SEQUENCE operation, which is required for every 1982 COMPOUND that operates over an established session, with the 1983 exception of some session administration operations, such as 1984 DESTROY_SESSION (Section 18.37). 1986 2.10.2.1. SEQUENCE and CB_SEQUENCE 1988 In NFSv4.1, when the SEQUENCE operation is present, it MUST be the 1989 first operation in the COMPOUND procedure. The primary purpose of 1990 SEQUENCE is to carry the session identifier. The session identifier 1991 associates all other operations in the COMPOUND procedure with a 1992 particular session. SEQUENCE also contains required information for 1993 maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 1994 COMPOUND requests thus have the form: 1996 +-----+--------------+-----------+------------+-----------+---- 1997 | tag | minorversion | numops |SEQUENCE op | op + args | ... 1998 | | (== 1) | (limited) | + args | | 1999 +-----+--------------+-----------+------------+-----------+---- 2001 and the replies have the form: 2003 +------------+-----+--------+-------------------------------+--// 2004 |last status | tag | numres |status + SEQUENCE op + results | // 2005 +------------+-----+--------+-------------------------------+--// 2006 //-----------------------+---- 2007 // status + op + results | ... 2008 //-----------------------+---- 2010 A CB_COMPOUND procedure request and reply has a similar form to 2011 COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE 2012 operation. CB_COMPOUND also has an additional field called 2013 "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored 2014 by the client. CB_SEQUENCE has the same information as SEQUENCE, and 2015 also includes other information needed to resolve callback races 2016 (Section 2.10.6.3). 2018 2.10.2.2. Client ID and Session Association 2020 Each client ID (Section 2.4) can have zero or more active sessions. 2021 A client ID and associated session are required to perform file 2022 access in NFSv4.1. Each time a session is used (whether by a client 2023 sending a request to the server or the client replying to a callback 2024 request from the server), the state leased to its associated client 2025 ID is automatically renewed. 2027 State (which can consist of share reservations, locks, delegations, 2028 and layouts (Section 1.8.4)) is tied to the client ID. Client state 2029 is not tied to any individual session. Successive state changing 2030 operations from a given state owner MAY go over different sessions, 2031 provided the session is associated with the same client ID. A 2032 callback MAY arrive over a different session than that of the request 2033 that originally acquired the state pertaining to the callback. For 2034 example, if session A is used to acquire a delegation, a request to 2035 recall the delegation MAY arrive over session B if both sessions are 2036 associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 2037 discuss the security considerations around callbacks. 2039 2.10.3. Channels 2041 A channel is not a connection. A channel represents the direction 2042 ONC RPC requests are sent. 2044 Each session has one or two channels: the fore channel and the 2045 backchannel. Because there are at most two channels per session, and 2046 because each channel has a distinct purpose, channels are not 2047 assigned identifiers. 2049 The fore channel is used for ordinary requests from the client to the 2050 server, and carries COMPOUND requests and responses. A session 2051 always has a fore channel. 2053 The backchannel is used for callback requests from server to client, 2054 and carries CB_COMPOUND requests and responses. Whether or not there 2055 is a backchannel is decided by the client; however, many features of 2056 NFSv4.1 require a backchannel. NFSv4.1 servers MUST support 2057 backchannels. 2059 Each session has resources for each channel, including separate reply 2060 caches (see Section 2.10.6.1). Note that even the backchannel 2061 requires a reply cache (or, at least, a slot table in order to detect 2062 retries) because some callback operations are nonidempotent. 2064 2.10.3.1. Association of Connections, Channels, and Sessions 2066 Each channel is associated with zero or more transport connections 2067 (whether of the same transport protocol or different transport 2068 protocols). A connection can be associated with one channel or both 2069 channels of a session; the client and server negotiate whether a 2070 connection will carry traffic for one channel or both channels via 2071 the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION 2072 (Section 18.34) operations. When a session is created via 2073 CREATE_SESSION, the connection that transported the CREATE_SESSION 2074 request is automatically associated with the fore channel, and 2075 optionally the backchannel. If the client specifies no state 2076 protection (Section 18.35) when the session is created, then when 2077 SEQUENCE is transmitted on a different connection, the connection is 2078 automatically associated with the fore channel of the session 2079 specified in the SEQUENCE operation. 2081 A connection's association with a session is not exclusive. A 2082 connection associated with the channel(s) of one session may be 2083 simultaneously associated with the channel(s) of other sessions 2084 including sessions associated with other client IDs. 2086 It is permissible for connections of multiple transport types to be 2087 associated with the same channel. For example, both TCP and RDMA 2088 connections can be associated with the fore channel. In the event an 2089 RDMA and non-RDMA connection are associated with the same channel, 2090 the maximum number of slots SHOULD be at least one more than the 2091 total number of RDMA credits (Section 2.10.6.1). This way, if all 2092 RDMA credits are used, the non-RDMA connection can have at least one 2093 outstanding request. If a server supports multiple transport types, 2094 it MUST allow a client to associate connections from each transport 2095 to a channel. 2097 It is permissible for a connection of one type of transport to be 2098 associated with the fore channel, and a connection of a different 2099 type to be associated with the backchannel. 2101 2.10.4. Server Scope 2103 Servers each specify a server scope value in the form of an opaque 2104 string eir_server_scope returned as part of the results of an 2105 EXCHANGE_ID operation. The purpose of the server scope is to allow a 2106 group of servers to indicate to clients that a set of servers sharing 2107 the same server scope value has arranged to use distinct values of 2108 opaque identifiers so that the two servers never assign the same 2109 value to two distinct objects. Thus, the identifiers generated by 2110 two servers within that set can be assumed compatible so that, in 2111 certain important cases, identifiers generated by one server in that 2112 set may be presented to another server of the same scope. 2114 The use of such compatible values does not imply that a value 2115 generated by one server will always be accepted by another. In most 2116 cases, it will not. However, a server will not inadvertently accept 2117 a value generated by another server. When it does accept it, it will 2118 be because it is recognized as valid and carrying the same meaning as 2119 on another server of the same scope. 2121 When servers are of the same server scope, this compatibility of 2122 values applies to the following identifiers: 2124 o Filehandle values. A filehandle value accepted by two servers of 2125 the same server scope denotes the same object. A WRITE operation 2126 sent to one server is reflected immediately in a READ sent to the 2127 other. 2129 o Server owner values. When the server scope values are the same, 2130 server owner value may be validly compared. In cases where the 2131 server scope values are different, server owner values are treated 2132 as different even if they contain identical strings of bytes. 2134 The coordination among servers required to provide such compatibility 2135 can be quite minimal, and limited to a simple partition of the ID 2136 space. The recognition of common values requires additional 2137 implementation, but this can be tailored to the specific situations 2138 in which that recognition is desired. 2140 Clients will have occasion to compare the server scope values of 2141 multiple servers under a number of circumstances, each of which will 2142 be discussed under the appropriate functional section: 2144 o When server owner values received in response to EXCHANGE_ID 2145 operations sent to multiple network addresses are compared for the 2146 purpose of determining the validity of various forms of trunking, 2147 as described in Section 11.5.2. . 2149 o When network or server reconfiguration causes the same network 2150 address to possibly be directed to different servers, with the 2151 necessity for the client to determine when lock reclaim should be 2152 attempted, as described in Section 8.4.2.1. 2154 When two replies from EXCHANGE_ID, each from two different server 2155 network addresses, have the same server scope, there are a number of 2156 ways a client can validate that the common server scope is due to two 2157 servers cooperating in a group. 2159 o If both EXCHANGE_ID requests were sent with RPCSEC_GSS ([4], [9], 2160 [27]) authentication and the server principal is the same for both 2161 targets, the equality of server scope is validated. It is 2162 RECOMMENDED that two servers intending to share the same server 2163 scope and server_owner major_id also share the same principal 2164 name. In some cases, this simplifies the client's task of 2165 validating server scope. 2167 o The client may accept the appearance of the second server in the 2168 fs_locations or fs_locations_info attribute for a relevant file 2169 system. For example, if there is a migration event for a 2170 particular file system or there are locks to be reclaimed on a 2171 particular file system, the attributes for that particular file 2172 system may be used. The client sends the GETATTR request to the 2173 first server for the fs_locations or fs_locations_info attribute 2174 with RPCSEC_GSS authentication. It may need to do this in advance 2175 of the need to verify the common server scope. If the client 2176 successfully authenticates the reply to GETATTR, and the GETATTR 2177 request and reply containing the fs_locations or fs_locations_info 2178 attribute refers to the second server, then the equality of server 2179 scope is supported. A client may choose to limit the use of this 2180 form of support to information relevant to the specific file 2181 system involved (e.g. a file system being migrated). 2183 2.10.5. Trunking 2185 Trunking is the use of multiple connections between a client and 2186 server in order to increase the speed of data transfer. NFSv4.1 2187 supports two types of trunking: session trunking and client ID 2188 trunking. 2190 In the context of a single server network address, it can be assumed 2191 that all connections are accessing the same server and NFSv4.1 2192 servers MUST support both forms of trunking. When multiple 2193 connections use a set of network addresses accessing the same server, 2194 the server MUST support both forms of trunking. NFSv4.1 servers in a 2195 clustered configuration MAY allow network addresses for different 2196 servers to use client ID trunking. 2198 Clients may use either form of trunking as long as they do not, when 2199 trunking between different server network addresses, violate the 2200 servers' mandates as to the kinds of trunking to be allowed (see 2201 below). With regard to callback channels, the client MUST allow the 2202 server to choose among all callback channels valid for a given client 2203 ID and MUST support trunking when the connections supporting the 2204 backchannel allow session or client ID trunking to be used for 2205 callbacks. 2207 Session trunking is essentially the association of multiple 2208 connections, each with potentially different target and/or source 2209 network addresses, to the same session. When the target network 2210 addresses (server addresses) of the two connections are the same, the 2211 server MUST support such session trunking. When the target network 2212 addresses are different, the server MAY indicate such support using 2213 the data returned by the EXCHANGE_ID operation (see below). 2215 Client ID trunking is the association of multiple sessions to the 2216 same client ID. Servers MUST support client ID trunking for two 2217 target network addresses whenever they allow session trunking for 2218 those same two network addresses. In addition, a server MAY, by 2219 presenting the same major server owner ID (Section 2.5) and server 2220 scope (Section 2.10.4), allow an additional case of client ID 2221 trunking. When two servers return the same major server owner and 2222 server scope, it means that the two servers are cooperating on 2223 locking state management, which is a prerequisite for client ID 2224 trunking. 2226 Distinguishing when the client is allowed to use session and client 2227 ID trunking requires understanding how the results of the EXCHANGE_ID 2228 (Section 18.35) operation identify a server. Suppose a client sends 2229 EXCHANGE_IDs over two different connections, each with a possibly 2230 different target network address, but each EXCHANGE_ID operation has 2231 the same value in the eia_clientowner field. If the same NFSv4.1 2232 server is listening over each connection, then each EXCHANGE_ID 2233 result MUST return the same values of eir_clientid, 2234 eir_server_owner.so_major_id, and eir_server_scope. The client can 2235 then treat each connection as referring to the same server (subject 2236 to verification; see Section 2.10.5.1 below), and it can use each 2237 connection to trunk requests and replies. The client's choice is 2238 whether session trunking or client ID trunking applies. 2240 Session Trunking. If the eia_clientowner argument is the same in two 2241 different EXCHANGE_ID requests, and the eir_clientid, 2242 eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and 2243 eir_server_scope results match in both EXCHANGE_ID results, then 2244 the client is permitted to perform session trunking. If the 2245 client has no session mapping to the tuple of eir_clientid, 2246 eir_server_owner.so_major_id, eir_server_scope, and 2247 eir_server_owner.so_minor_id, then it creates the session via a 2248 CREATE_SESSION operation over one of the connections, which 2249 associates the connection to the session. If there is a session 2250 for the tuple, the client can send BIND_CONN_TO_SESSION to 2251 associate the connection to the session. 2253 Of course, if the client does not desire to use session trunking, 2254 it is not required to do so. It can invoke CREATE_SESSION on the 2255 connection. This will result in client ID trunking as described 2256 below. It can also decide to drop the connection if it does not 2257 choose to use trunking. 2259 Client ID Trunking. If the eia_clientowner argument is the same in 2260 two different EXCHANGE_ID requests, and the eir_clientid, 2261 eir_server_owner.so_major_id, and eir_server_scope results match 2262 in both EXCHANGE_ID results, then the client is permitted to 2263 perform client ID trunking (regardless of whether the 2264 eir_server_owner.so_minor_id results match). The client can 2265 associate each connection with different sessions, where each 2266 session is associated with the same server. 2268 The client completes the act of client ID trunking by invoking 2269 CREATE_SESSION on each connection, using the same client ID that 2270 was returned in eir_clientid. These invocations create two 2271 sessions and also associate each connection with its respective 2272 session. The client is free to decline to use client ID trunking 2273 by simply dropping the connection at this point. 2275 When doing client ID trunking, locking state is shared across 2276 sessions associated with that same client ID. This requires the 2277 server to coordinate state across sessions and the client to be 2278 able to associate the same locking state with multiple sessions. 2280 It is always possible that, as a result of various sorts of 2281 reconfiguration events, eir_server_scope and eir_server_owner values 2282 may be different on subsequent EXCHANGE_ID requests made to the same 2283 network address. 2285 In most cases such reconfiguration events will be disruptive and 2286 indicate that an IP address formerly connected to one server is now 2287 connected to an entirely different one. 2289 Some guidelines on client handling of such situations follow: 2291 o When eir_server_scope changes, the client has no assurance that 2292 any id's it obtained previously (e.g. file handles) can be validly 2293 used on the new server, and, even if the new server accepts them, 2294 there is no assurance that this is not due to accident. Thus, it 2295 is best to treat all such state as lost/stale although a client 2296 may assume that the probability of inadvertent acceptance is low 2297 and treat this situation as within the next case. 2299 o When eir_server_scope remains the same and 2300 eir_server_owner.so_major_id changes, the client can use the 2301 filehandles it has, consider its locking state lost, and attempt 2302 to reclaim or otherwise re-obtain its locks. It might find that 2303 its file handle is now stale. However, if NFS4ERR_STALE is not 2304 returned, it can proceed to reclaim or otherwise re-obtain its 2305 open locking state. 2307 o When eir_server_scope and eir_server_owner.so_major_id remain the 2308 same, the client has to use the now-current values of 2309 eir_server_owner.so_minor_id in deciding on appropriate forms of 2310 trunking. This may result in connections being dropped or new 2311 sessions being created. 2313 2.10.5.1. Verifying Claims of Matching Server Identity 2315 When the server responds using two different connections claiming 2316 matching or partially matching eir_server_owner, eir_server_scope, 2317 and eir_clientid values, the client does not have to trust the 2318 servers' claims. The client may verify these claims before trunking 2319 traffic in the following ways: 2321 o For session trunking, clients SHOULD reliably verify if 2322 connections between different network paths are in fact associated 2323 with the same NFSv4.1 server and usable on the same session, and 2324 servers MUST allow clients to perform reliable verification. When 2325 a client ID is created, the client SHOULD specify that 2326 BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or 2327 SP4_MACH_CRED (Section 18.35) state protection options. For 2328 SP4_SSV, reliable verification depends on a shared secret (the 2329 SSV) that is established via the SET_SSV (see Section 18.47) 2330 operation. 2332 When a new connection is associated with the session (via the 2333 BIND_CONN_TO_SESSION operation, see Section 18.34), if the client 2334 specified SP4_SSV state protection for the BIND_CONN_TO_SESSION 2335 operation, the client MUST send the BIND_CONN_TO_SESSION with 2336 RPCSEC_GSS protection, using integrity or privacy, and an 2337 RPCSEC_GSS handle created with the GSS SSV mechanism (see 2338 Section 2.10.9). 2340 If the client mistakenly tries to associate a connection to a 2341 session of a wrong server, the server will either reject the 2342 attempt because it is not aware of the session identifier of the 2343 BIND_CONN_TO_SESSION arguments, or it will reject the attempt 2344 because the RPCSEC_GSS authentication fails. Even if the server 2345 mistakenly or maliciously accepts the connection association 2346 attempt, the RPCSEC_GSS verifier it computes in the response will 2347 not be verified by the client, so the client will know it cannot 2348 use the connection for trunking the specified session. 2350 If the client specified SP4_MACH_CRED state protection, the 2351 BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or 2352 privacy, using the same credential that was used when the client 2353 ID was created. Mutual authentication via RPCSEC_GSS assures the 2354 client that the connection is associated with the correct session 2355 of the correct server. 2357 o For client ID trunking, the client has at least two options for 2358 verifying that the same client ID obtained from two different 2359 EXCHANGE_ID operations came from the same server. The first 2360 option is to use RPCSEC_GSS authentication when sending each 2361 EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with 2362 RPCSEC_GSS authentication, the client notes the principal name of 2363 the GSS target. If the EXCHANGE_ID results indicate that client 2364 ID trunking is possible, and the GSS targets' principal names are 2365 the same, the servers are the same and client ID trunking is 2366 allowed. 2368 The second option for verification is to use SP4_SSV protection. 2369 When the client sends EXCHANGE_ID, it specifies SP4_SSV 2370 protection. The first EXCHANGE_ID the client sends always has to 2371 be confirmed by a CREATE_SESSION call. The client then sends 2372 SET_SSV. Later, the client sends EXCHANGE_ID to a second 2373 destination network address different from the one the first 2374 EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID 2375 reply has the same eir_clientid, eir_server_owner.so_major_id, and 2376 eir_server_scope. If so, the client verifies the claim by sending 2377 a CREATE_SESSION operation to the second destination address, 2378 protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle 2379 returned by the second EXCHANGE_ID. If the server accepts the 2380 CREATE_SESSION request, and if the client verifies the RPCSEC_GSS 2381 verifier and integrity codes, then the client has proof the second 2382 server knows the SSV, and thus the two servers are cooperating for 2383 the purposes of specifying server scope and client ID trunking. 2385 2.10.6. Exactly Once Semantics 2387 Via the session, NFSv4.1 offers exactly once semantics (EOS) for 2388 requests sent over a channel. EOS is supported on both the fore 2389 channel and backchannel. 2391 Each COMPOUND or CB_COMPOUND request that is sent with a leading 2392 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 2393 exactly once. This requirement holds regardless of whether the 2394 request is sent with reply caching specified (see 2395 Section 2.10.6.1.3). The requirement holds even if the requester is 2396 sending the request over a session created between a pNFS data client 2397 and pNFS data server. To understand the rationale for this 2398 requirement, divide the requests into three classifications: 2400 o Non-idempotent requests. 2402 o Idempotent modifying requests. 2404 o Idempotent non-modifying requests. 2406 An example of a non-idempotent request is RENAME. Obviously, if a 2407 replier executes the same RENAME request twice, and the first 2408 execution succeeds, the re-execution will fail. If the replier 2409 returns the result from the re-execution, this result is incorrect. 2410 Therefore, EOS is required for non-idempotent requests. 2412 An example of an idempotent modifying request is a COMPOUND request 2413 containing a WRITE operation. Repeated execution of the same WRITE 2414 has the same effect as execution of that WRITE a single time. 2415 Nevertheless, enforcing EOS for WRITEs and other idempotent modifying 2416 requests is necessary to avoid data corruption. 2418 Suppose a client sends WRITE A to a noncompliant server that does not 2419 enforce EOS, and receives no response, perhaps due to a network 2420 partition. The client reconnects to the server and re-sends WRITE A. 2421 Now, the server has outstanding two instances of A. The server can 2422 be in a situation in which it executes and replies to the retry of A, 2423 while the first A is still waiting in the server's internal I/O 2424 system for some resource. Upon receiving the reply to the second 2425 attempt of WRITE A, the client believes its WRITE is done so it is 2426 free to send WRITE B, which overlaps the byte-range of A. When the 2427 original A is dispatched from the server's I/O system and executed 2428 (thus the second time A will have been written), then what has been 2429 written by B can be overwritten and thus corrupted. 2431 An example of an idempotent non-modifying request is a COMPOUND 2432 containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- 2433 execution of such a request will not cause data corruption or produce 2434 an incorrect result. Nonetheless, to keep the implementation simple, 2435 the replier MUST enforce EOS for all requests, whether or not 2436 idempotent and non-modifying. 2438 Note that true and complete EOS is not possible unless the server 2439 persists the reply cache in stable storage, and unless the server is 2440 somehow implemented to never require a restart (indeed, if such a 2441 server exists, the distinction between a reply cache kept in stable 2442 storage versus one that is not is one without meaning). See 2443 Section 2.10.6.5 for a discussion of persistence in the reply cache. 2444 Regardless, even if the server does not persist the reply cache, EOS 2445 improves robustness and correctness over previous versions of NFS 2446 because the legacy duplicate request/reply caches were based on the 2447 ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the 2448 shortcomings of the XID as a basis for a reply cache and describes 2449 how NFSv4.1 sessions improve upon the XID. 2451 2.10.6.1. Slot Identifiers and Reply Cache 2453 The RPC layer provides a transaction ID (XID), which, while required 2454 to be unique, is not convenient for tracking requests for two 2455 reasons. First, the XID is only meaningful to the requester; it 2456 cannot be interpreted by the replier except to test for equality with 2457 previously sent requests. When consulting an RPC-based duplicate 2458 request cache, the opaqueness of the XID requires a computationally 2459 expensive look up (often via a hash that includes XID and source 2460 address). NFSv4.1 requests use a non-opaque slot ID, which is an 2461 index into a slot table, which is far more efficient. Second, 2462 because RPC requests can be executed by the replier in any order, 2463 there is no bound on the number of requests that may be outstanding 2464 at any time. To achieve perfect EOS, using ONC RPC would require 2465 storing all replies in the reply cache. XIDs are 32 bits; storing 2466 over four billion (2^32) replies in the reply cache is not practical. 2467 In practice, previous versions of NFS have chosen to store a fixed 2468 number of replies in the cache, and to use a least recently used 2469 (LRU) approach to replacing cache entries with new entries when the 2470 cache is full. In NFSv4.1, the number of outstanding requests is 2471 bounded by the size of the slot table, and a sequence ID per slot is 2472 used to tell the replier when it is safe to delete a cached reply. 2474 In the NFSv4.1 reply cache, when the requester sends a new request, 2475 it selects a slot ID in the range 0..N, where N is the replier's 2476 current maximum slot ID granted to the requester on the session over 2477 which the request is to be sent. The value of N starts out as equal 2478 to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the 2479 response to SEQUENCE or CB_SEQUENCE as described later in this 2480 section. The slot ID must be unused by any of the requests that the 2481 requester has already active on the session. "Unused" here means the 2482 requester has no outstanding request for that slot ID. 2484 A slot contains a sequence ID and the cached reply corresponding to 2485 the request sent with that sequence ID. The sequence ID is a 32-bit 2486 unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 2487 1). The first time a slot is used, the requester MUST specify a 2488 sequence ID of one (Section 18.36). Each time a slot is reused, the 2489 request MUST specify a sequence ID that is one greater than that of 2490 the previous request on the slot. If the previous sequence ID was 2491 0xFFFFFFFF, then the next request for the slot MUST have the sequence 2492 ID set to zero (i.e., (2^32 - 1) + 1 mod 2^32). 2494 The sequence ID accompanies the slot ID in each request. It is for 2495 the critical check at the replier: it used to efficiently determine 2496 whether a request using a certain slot ID is a retransmit or a new, 2497 never-before-seen request. It is not feasible for the requester to 2498 assert that it is retransmitting to implement this, because for any 2499 given request the requester cannot know whether the replier has seen 2500 it unless the replier actually replies. Of course, if the requester 2501 has seen the reply, the requester would not retransmit. 2503 The replier compares each received request's sequence ID with the 2504 last one previously received for that slot ID, to see if the new 2505 request is: 2507 o A new request, in which the sequence ID is one greater than that 2508 previously seen in the slot (accounting for sequence wraparound). 2509 The replier proceeds to execute the new request, and the replier 2510 MUST increase the slot's sequence ID by one. 2512 o A retransmitted request, in which the sequence ID is equal to that 2513 currently recorded in the slot. If the original request has 2514 executed to completion, the replier returns the cached reply. See 2515 Section 2.10.6.2 for direction on how the replier deals with 2516 retries of requests that are still in progress. 2518 o A misordered retry, in which the sequence ID is less than 2519 (accounting for sequence wraparound) that previously seen in the 2520 slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2521 result from SEQUENCE or CB_SEQUENCE). 2523 o A misordered new request, in which the sequence ID is two or more 2524 than (accounting for sequence wraparound) that previously seen in 2525 the slot. Note that because the sequence ID MUST wrap around to 2526 zero once it reaches 0xFFFFFFFF, a misordered new request and a 2527 misordered retry cannot be distinguished. Thus, the replier MUST 2528 return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2529 CB_SEQUENCE). 2531 Unlike the XID, the slot ID is always within a specific range; this 2532 has two implications. The first implication is that for a given 2533 session, the replier need only cache the results of a limited number 2534 of COMPOUND requests. The second implication derives from the first, 2535 which is that unlike XID-indexed reply caches (also known as 2536 duplicate request caches - DRCs), the slot ID-based reply cache 2537 cannot be overflowed. Through use of the sequence ID to identify 2538 retransmitted requests, the replier does not need to actually cache 2539 the request itself, reducing the storage requirements of the reply 2540 cache further. These facilities make it practical to maintain all 2541 the required entries for an effective reply cache. 2543 The slot ID, sequence ID, and session ID therefore take over the 2544 traditional role of the XID and source network address in the 2545 replier's reply cache implementation. This approach is considerably 2546 more portable and completely robust -- it is not subject to the 2547 reassignment of ports as clients reconnect over IP networks. In 2548 addition, the RPC XID is not used in the reply cache, enhancing 2549 robustness of the cache in the face of any rapid reuse of XIDs by the 2550 requester. While the replier does not care about the XID for the 2551 purposes of reply cache management (but the replier MUST return the 2552 same XID that was in the request), nonetheless there are 2553 considerations for the XID in NFSv4.1 that are the same as all other 2554 previous versions of NFS. The RPC XID remains in each message and 2555 needs to be formulated in NFSv4.1 requests as in any other ONC RPC 2556 request. The reasons include: 2558 o The RPC layer retains its existing semantics and implementation. 2560 o The requester and replier must be able to interoperate at the RPC 2561 layer, prior to the NFSv4.1 decoding of the SEQUENCE or 2562 CB_SEQUENCE operation. 2564 o If an operation is being used that does not start with SEQUENCE or 2565 CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is 2566 needed for correct operation to match the reply to the request. 2568 o The SEQUENCE or CB_SEQUENCE operation may generate an error. If 2569 so, the embedded slot ID, sequence ID, and session ID (if present) 2570 in the request will not be in the reply, and the requester has 2571 only the XID to match the reply to the request. 2573 Given that well-formulated XIDs continue to be required, this raises 2574 the question: why do SEQUENCE and CB_SEQUENCE replies have a session 2575 ID, slot ID, and sequence ID? Having the session ID in the reply 2576 means that the requester does not have to use the XID to look up the 2577 session ID, which would be necessary if the connection were 2578 associated with multiple sessions. Having the slot ID and sequence 2579 ID in the reply means that the requester does not have to use the XID 2580 to look up the slot ID and sequence ID. Furthermore, since the XID 2581 is only 32 bits, it is too small to guarantee the re-association of a 2582 reply with its request [43]; having session ID, slot ID, and sequence 2583 ID in the reply allows the client to validate that the reply in fact 2584 belongs to the matched request. 2586 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2587 "highest_slotid" value, which carries additional requester slot usage 2588 information. The requester MUST always indicate the slot ID 2589 representing the outstanding request with the highest-numbered slot 2590 value. The requester should in all cases provide the most 2591 conservative value possible, although it can be increased somewhat 2592 above the actual instantaneous usage to maintain some minimum or 2593 optimal level. This provides a way for the requester to yield unused 2594 request slots back to the replier, which in turn can use the 2595 information to reallocate resources. 2597 The replier responds with both a new target highest_slotid and an 2598 enforced highest_slotid, described as follows: 2600 o The target highest_slotid is an indication to the requester of the 2601 highest_slotid the replier wishes the requester to be using. This 2602 permits the replier to withdraw (or add) resources from a 2603 requester that has been found to not be using them, in order to 2604 more fairly share resources among a varying level of demand from 2605 other requesters. The requester must always comply with the 2606 replier's value updates, since they indicate newly established 2607 hard limits on the requester's access to session resources. 2608 However, because of request pipelining, the requester may have 2609 active requests in flight reflecting prior values; therefore, the 2610 replier must not immediately require the requester to comply. 2612 o The enforced highest_slotid indicates the highest slot ID the 2613 requester is permitted to use on a subsequent SEQUENCE or 2614 CB_SEQUENCE operation. The replier's enforced highest_slotid 2615 SHOULD be no less than the highest_slotid the requester indicated 2616 in the SEQUENCE or CB_SEQUENCE arguments. 2618 A requester can be intransigent with respect to lowering its 2619 highest_slotid argument to a Sequence operation, i.e. the 2620 requester continues to ignore the target highest_slotid in the 2621 response to a Sequence operation, and continues to set its 2622 highest_slotid argument to be higher than the target 2623 highest_slotid. This can be considered particularly egregious 2624 behavior when the replier knows there are no outstanding requests 2625 with slot IDs higher than its target highest_slotid. When faced 2626 with such intransigence, the replier is free to take more forceful 2627 action, and MAY reply with a new enforced highest_slotid that is 2628 less than its previous enforced highest_slotid. Thereafter, if 2629 the requester continues to send requests with a highest_slotid 2630 that is greater than the replier's new enforced highest_slotid, 2631 the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in 2632 the request is greater than the new enforced highest_slotid and 2633 the request is a retry. 2635 The replier SHOULD retain the slots it wants to retire until the 2636 requester sends a request with a highest_slotid less than or equal 2637 to the replier's new enforced highest_slotid. 2639 The requester can also be intransigent with respect to sending 2640 non-retry requests that have a slot ID that exceeds the replier's 2641 highest_slotid. Once the replier has forcibly lowered the 2642 enforced highest_slotid, the requester is only allowed to send 2643 retries on slots that exceed the replier's highest_slotid. If a 2644 request is received with a slot ID that is higher than the new 2645 enforced highest_slotid, and the sequence ID is one higher than 2646 what is in the slot's reply cache, then the server can both retire 2647 the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT 2648 do one and not the other). The reason it is safe to retire the 2649 slot is because by using the next sequence ID, the requester is 2650 indicating it has received the previous reply for the slot. 2652 o The requester SHOULD use the lowest available slot when sending a 2653 new request. This way, the replier may be able to retire slot 2654 entries faster. However, where the replier is actively adjusting 2655 its granted highest_slotid, it will not be able to use only the 2656 receipt of the slot ID and highest_slotid in the request. Neither 2657 the slot ID nor the highest_slotid used in a request may reflect 2658 the replier's current idea of the requester's session limit, 2659 because the request may have been sent from the requester before 2660 the update was received. Therefore, in the downward adjustment 2661 case, the replier may have to retain a number of reply cache 2662 entries at least as large as the old value of maximum requests 2663 outstanding, until it can infer that the requester has seen a 2664 reply containing the new granted highest_slotid. The replier can 2665 infer that the requester has seen such a reply when it receives a 2666 new request with the same slot ID as the request replied to and 2667 the next higher sequence ID. 2669 2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies 2671 When a SEQUENCE or CB_SEQUENCE operation is successfully executed, 2672 its reply MUST always be cached. Specifically, session ID, sequence 2673 ID, and slot ID MUST be cached in the reply cache. The reply from 2674 SEQUENCE also includes the highest slot ID, target highest slot ID, 2675 and status flags. Instead of caching these values, the server MAY 2676 re-compute the values from the current state of the fore channel, 2677 session, and/or client ID as appropriate. Similarly, the reply from 2678 CB_SEQUENCE includes a highest slot ID and target highest slot ID. 2679 The client MAY re-compute the values from the current state of the 2680 session as appropriate. 2682 Regardless of whether or not a replier is re-computing highest slot 2683 ID, target slot ID, and status on replies to retries, the requester 2684 MUST NOT assume that the values are being re-computed whenever it 2685 receives a reply after a retry is sent, since it has no way of 2686 knowing whether the reply it has received was sent by the replier in 2687 response to the retry or is a delayed response to the original 2688 request. Therefore, it may be the case that highest slot ID, target 2689 slot ID, or status bits may reflect the state of affairs when the 2690 request was first executed. Although acting based on such delayed 2691 information is valid, it may cause the receiver of the reply to do 2692 unneeded work. Requesters MAY choose to send additional requests to 2693 get the current state of affairs or use the state of affairs reported 2694 by subsequent requests, in preference to acting immediately on data 2695 that might be out of date. 2697 2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE 2699 Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of 2700 the slot MUST NOT change. The replier MUST NOT modify the reply 2701 cache entry for the slot whenever an error is returned from SEQUENCE 2702 or CB_SEQUENCE. 2704 2.10.6.1.3. Optional Reply Caching 2706 On a per-request basis, the requester can choose to direct the 2707 replier to cache the reply to all operations after the first 2708 operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or 2709 csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. 2710 The reason it would not direct the replier to cache the entire reply 2711 is that the request is composed of all idempotent operations [40]. 2712 Caching the reply may offer little benefit. If the reply is too 2713 large (see Section 2.10.6.4), it may not be cacheable anyway. Even 2714 if the reply to idempotent request is small enough to cache, 2715 unnecessarily caching the reply slows down the server and increases 2716 RPC latency. 2718 Whether or not the requester requests the reply to be cached has no 2719 effect on the slot processing. If the result of SEQUENCE or 2720 CB_SEQUENCE is NFS4_OK, then the slot's sequence ID MUST be 2721 incremented by one. If a requester does not direct the replier to 2722 cache the reply, the replier MUST do one of following: 2724 o The replier can cache the entire original reply. Even though 2725 sa_cachethis or csa_cachethis is FALSE, the replier is always free 2726 to cache. It may choose this approach in order to simplify 2727 implementation. 2729 o The replier enters into its reply cache a reply consisting of the 2730 original results to the SEQUENCE or CB_SEQUENCE operation, and 2731 with the next operation in COMPOUND or CB_COMPOUND having the 2732 error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later 2733 retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a 2734 replier receives a retried Sequence operation where the reply to 2735 the COMPOUND or CB_COMPOUND was not cached, then the replier, 2737 * MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence 2738 operation if the Sequence operation is not the first operation 2739 (granted, a requester that does so is in violation of the 2740 NFSv4.1 protocol). 2742 * MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a 2743 Sequence operation if the Sequence operation is the first 2744 operation. 2746 o If the second operation is an illegal operation, or an operation 2747 that was legal in a previous minor version of NFSv4 and MUST NOT 2748 be supported in the current minor version (e.g., SETCLIENTID), the 2749 replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead 2750 the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or 2751 NFS4ERR_NOTSUPP as appropriate. 2753 o If the second operation can result in another error status, the 2754 replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP, 2755 provided the operation is not executed in such a way that the 2756 state of the replier is changed. Examples of such an error status 2757 include: NFS4ERR_NOTSUPP returned for an operation that is legal 2758 but not REQUIRED in the current minor versions, and thus not 2759 supported by the replier; NFS4ERR_SEQUENCE_POS; and 2760 NFS4ERR_REQ_TOO_BIG. 2762 The discussion above assumes that the retried request matches the 2763 original one. Section 2.10.6.1.3.1 discusses what the replier might 2764 do, and MUST do when original and retried requests do not match. 2765 Since the replier may only cache a small amount of the information 2766 that would be required to determine whether this is a case of a false 2767 retry, the replier may send to the client any of the following 2768 responses: 2770 o The cached reply to the original request (if the replier has 2771 cached it in its entirety and the users of the original request 2772 and retry match). 2774 o A reply that consists only of the Sequence operation with the 2775 error NFS4ERR_FALSE_RETRY. 2777 o A reply consisting of the response to Sequence with the status 2778 NFS4_OK, together with the second operation as it appeared in the 2779 retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2780 other error as described above. 2782 o A reply that consists of the response to Sequence with the status 2783 NFS4_OK, together with the second operation as it appeared in the 2784 original request with an error of NFS4ERR_RETRY_UNCACHED_REP or 2785 other error as described above. 2787 2.10.6.1.3.1. False Retry 2789 If a requester sent a Sequence operation with a slot ID and sequence 2790 ID that are in the reply cache but the replier detected that the 2791 retried request is not the same as the original request, including a 2792 retry that has different operations or different arguments in the 2793 operations from the original and a retry that uses a different 2794 principal in the RPC request's credential field that translates to a 2795 different user, then this is a false retry. When the replier detects 2796 a false retry, it is permitted (but not always obligated) to return 2797 NFS4ERR_FALSE_RETRY in response to the Sequence operation when it 2798 detects a false retry. 2800 Translations of particularly privileged user values to other users 2801 due to the lack of appropriately secure credentials, as configured on 2802 the replier, should be applied before determining whether the users 2803 are the same or different. If the replier determines the users are 2804 different between the original request and a retry, then the replier 2805 MUST return NFS4ERR_FALSE_RETRY. 2807 If an operation of the retry is an illegal operation, or an operation 2808 that was legal in a previous minor version of NFSv4 and MUST NOT be 2809 supported in the current minor version (e.g., SETCLIENTID), the 2810 replier MAY return NFS4ERR_FALSE_RETRY (and MUST do so if the users 2811 of the original request and retry differ). Otherwise, the replier 2812 MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or NFS4ERR_NOTSUPP as 2813 appropriate. Note that the handling is in contrast for how the 2814 replier deals with retries requests with no cached reply. The 2815 difference is due to NFS4ERR_FALSE_RETRY being a valid error for only 2816 Sequence operations, whereas NFS4ERR_RETRY_UNCACHED_REP is a valid 2817 error for all operations except illegal operations and operations 2818 that MUST NOT be supported in the current minor version of NFSv4. 2820 2.10.6.2. Retry and Replay of Reply 2822 A requester MUST NOT retry a request, unless the connection it used 2823 to send the request disconnects. The requester can then reconnect 2824 and re-send the request, or it can re-send the request over a 2825 different connection that is associated with the same session. 2827 If the requester is a server wanting to re-send a callback operation 2828 over the backchannel of a session, the requester of course cannot 2829 reconnect because only the client can associate connections with the 2830 backchannel. The server can re-send the request over another 2831 connection that is bound to the same session's backchannel. If there 2832 is no such connection, the server MUST indicate that the session has 2833 no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag 2834 bit in the response to the next SEQUENCE operation from the client. 2835 The client MUST then associate a connection with the session (or 2836 destroy the session). 2838 Note that it is not fatal for a requester to retry without a 2839 disconnect between the request and retry. However, the retry does 2840 consume resources, especially with RDMA, where each request, retry or 2841 not, consumes a credit. Retries for no reason, especially retries 2842 sent shortly after the previous attempt, are a poor use of network 2843 bandwidth and defeat the purpose of a transport's inherent congestion 2844 control system. 2846 A requester MUST wait for a reply to a request before using the slot 2847 for another request. If it does not wait for a reply, then the 2848 requester does not know what sequence ID to use for the slot on its 2849 next request. For example, suppose a requester sends a request with 2850 sequence ID 1, and does not wait for the response. The next time it 2851 uses the slot, it sends the new request with sequence ID 2. If the 2852 replier has not seen the request with sequence ID 1, then the replier 2853 is not expecting sequence ID 2, and rejects the requester's new 2854 request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 2855 CB_SEQUENCE). 2857 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 2858 within each RPC/RDMA "chunk" [32] are valid on a scope outside that 2859 of a single connection. Therefore, handles used by the direct 2860 operations become invalid after connection loss. The server must 2861 ensure that any RDMA operations that must be replayed from the reply 2862 cache use the newly provided handle(s) from the most recent request. 2864 A retry might be sent while the original request is still in progress 2865 on the replier. The replier SHOULD deal with the issue by returning 2866 NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but 2867 implementations MAY return NFS4ERR_MISORDERED. Since errors from 2868 SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this 2869 approach allows the results of the execution of the original request 2870 to be properly recorded in the reply cache (assuming that the 2871 requester specified the reply to be cached). 2873 2.10.6.3. Resolving Server Callback Races 2875 It is possible for server callbacks to arrive at the client before 2876 the reply from related fore channel operations. For example, a 2877 client may have been granted a delegation to a file it has opened, 2878 but the reply to the OPEN (informing the client of the granting of 2879 the delegation) may be delayed in the network. If a conflicting 2880 operation arrives at the server, it will recall the delegation using 2881 the backchannel, which may be on a different transport connection, 2882 perhaps even a different network, or even a different session 2883 associated with the same client ID. 2885 The presence of a session between the client and server alleviates 2886 this issue. When a session is in place, each client request is 2887 uniquely identified by its { session ID, slot ID, sequence ID } 2888 triple. By the rules under which slot entries (reply cache entries) 2889 are retired, the server has knowledge whether the client has "seen" 2890 each of the server's replies. The server can therefore provide 2891 sufficient information to the client to allow it to disambiguate 2892 between an erroneous or conflicting callback race condition. 2894 For each client operation that might result in some sort of server 2895 callback, the server SHOULD "remember" the { session ID, slot ID, 2896 sequence ID } triple of the client request until the slot ID 2897 retirement rules allow the server to determine that the client has, 2898 in fact, seen the server's reply. Until the time the { session ID, 2899 slot ID, sequence ID } request triple can be retired, any recalls of 2900 the associated object MUST carry an array of these referring 2901 identifiers (in the CB_SEQUENCE operation's arguments), for the 2902 benefit of the client. After this time, it is not necessary for the 2903 server to provide this information in related callbacks, since it is 2904 certain that a race condition can no longer occur. 2906 The CB_SEQUENCE operation that begins each server callback carries a 2907 list of "referring" { session ID, slot ID, sequence ID } triples. If 2908 the client finds the request corresponding to the referring session 2909 ID, slot ID, and sequence ID to be currently outstanding (i.e., the 2910 server's reply has not been seen by the client), it can determine 2911 that the callback has raced the reply, and act accordingly. If the 2912 client does not find the request corresponding to the referring 2913 triple to be outstanding (including the case of a session ID 2914 referring to a destroyed session), then there is no race with respect 2915 to this triple. The server SHOULD limit the referring triples to 2916 requests that refer to just those that apply to the objects referred 2917 to in the CB_COMPOUND procedure. 2919 The client must not simply wait forever for the expected server reply 2920 to arrive before responding to the CB_COMPOUND that won the race, 2921 because it is possible that it will be delayed indefinitely. The 2922 client should assume the likely case that the reply will arrive 2923 within the average round-trip time for COMPOUND requests to the 2924 server, and wait that period of time. If that period of time 2925 expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There 2926 are other scenarios under which callbacks may race replies. Among 2927 them are pNFS layout recalls as described in Section 12.5.5.2. 2929 2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues 2931 Very large requests and replies may pose both buffer management 2932 issues (especially with RDMA) and reply cache issues. When the 2933 session is created (Section 18.36), for each channel (fore and back), 2934 the client and server negotiate the maximum-sized request they will 2935 send or process (ca_maxrequestsize), the maximum-sized reply they 2936 will return or process (ca_maxresponsesize), and the maximum-sized 2937 reply they will store in the reply cache (ca_maxresponsesize_cached). 2939 If a request exceeds ca_maxrequestsize, the reply will have the 2940 status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG 2941 as the status for the first operation (SEQUENCE or CB_SEQUENCE) in 2942 the request (which means that no operations in the request executed 2943 and that the state of the slot in the reply cache is unchanged), or 2944 it MAY opt to return it on a subsequent operation in the same 2945 COMPOUND or CB_COMPOUND request (which means that at least one 2946 operation did execute and that the state of the slot in the reply 2947 cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on 2948 the operation that exceeds ca_maxrequestsize. 2950 If a reply exceeds ca_maxresponsesize, the reply will have the status 2951 NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the 2952 status for the first operation (SEQUENCE or CB_SEQUENCE) in the 2953 request, or it MAY opt to return it on a subsequent operation (in the 2954 same COMPOUND or CB_COMPOUND reply). A replier MAY return 2955 NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if 2956 the response would still exceed ca_maxresponsesize. 2958 If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache 2959 a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE 2960 operation (see Section 2.10.6.1.2). If the reply exceeds 2961 ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is 2962 TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. 2963 Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that 2964 matter) is returned on an operation other than the first operation 2965 (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if 2966 sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND 2967 has eleven operations, including SEQUENCE, the fifth operation is a 2968 RENAME, and the tenth operation is a READ for one million bytes, the 2969 server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth 2970 operation. Since the server executed several operations, especially 2971 the non-idempotent RENAME, the client's request to cache the reply 2972 needs to be honored in order for the correct operation of exactly 2973 once semantics. If the client retries the request, the server will 2974 have cached a reply that contains results for ten of the eleven 2975 requested operations, with the tenth operation having a status of 2976 NFS4ERR_REP_TOO_BIG_TO_CACHE. 2978 A client needs to take care that, when sending operations that change 2979 the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and 2980 RESTOREFH), it does not exceed the maximum reply buffer before the 2981 GETFH operation. Otherwise, the client will have to retry the 2982 operation that changed the current filehandle, in order to obtain the 2983 desired filehandle. For the OPEN operation (see Section 18.16), 2984 retry is not always available as an option. The following guidelines 2985 for the handling of filehandle-changing operations are advised: 2987 o Within the same COMPOUND procedure, a client SHOULD send GETFH 2988 immediately after a current filehandle-changing operation. A 2989 client MUST send GETFH after a current filehandle-changing 2990 operation that is also non-idempotent (e.g., the OPEN operation), 2991 unless the operation is RESTOREFH. RESTOREFH is an exception, 2992 because even though it is non-idempotent, the filehandle RESTOREFH 2993 produced originated from an operation that is either idempotent 2994 (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If 2995 the origin is non-idempotent, then because the client MUST send 2996 GETFH after the origin operation, the client can recover if 2997 RESTOREFH returns an error. 2999 o A server MAY return NFS4ERR_REP_TOO_BIG or 3000 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3001 filehandle-changing operation if the reply would be too large on 3002 the next operation. 3004 o A server SHOULD return NFS4ERR_REP_TOO_BIG or 3005 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3006 filehandle-changing, non-idempotent operation if the reply would 3007 be too large on the next operation, especially if the operation is 3008 OPEN. 3010 o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent 3011 current filehandle-changing operation, if it looks at the next 3012 operation (in the same COMPOUND procedure) and finds it is not 3013 GETFH. The server SHOULD do this if it is unable to determine in 3014 advance whether the total response size would exceed 3015 ca_maxresponsesize_cached or ca_maxresponsesize. 3017 2.10.6.5. Persistence 3019 Since the reply cache is bounded, it is practical for the reply cache 3020 to persist across server restarts. The replier MUST persist the 3021 following information if it agreed to persist the session (when the 3022 session was created; see Section 18.36): 3024 o The session ID. 3026 o The slot table including the sequence ID and cached reply for each 3027 slot. 3029 The above are sufficient for a replier to provide EOS semantics for 3030 any requests that were sent and executed before the server restarted. 3031 If the replier is a client, then there is no need for it to persist 3032 any more information, unless the client will be persisting all other 3033 state across client restart, in which case, the server will never see 3034 any NFSv4.1-level protocol manifestation of a client restart. If the 3035 replier is a server, with just the slot table and session ID 3036 persisting, any requests the client retries after the server restart 3037 will return the results that are cached in the reply cache, and any 3038 new requests (i.e., the sequence ID is one greater than the slot's 3039 sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by 3040 SEQUENCE). Such a session is considered dead. A server MAY re- 3041 animate a session after a server restart so that the session will 3042 accept new requests as well as retries. To re-animate a session, the 3043 server needs to persist additional information through server 3044 restart: 3046 o The client ID. This is a prerequisite to let the client create 3047 more sessions associated with the same client ID as the re- 3048 animated session. 3050 o The client ID's sequence ID that is used for creating sessions 3051 (see Sections 18.35 and 18.36). This is a prerequisite to let the 3052 client create more sessions. 3054 o The principal that created the client ID. This allows the server 3055 to authenticate the client when it sends EXCHANGE_ID. 3057 o The SSV, if SP4_SSV state protection was specified when the client 3058 ID was created (see Section 18.35). This lets the client create 3059 new sessions, and associate connections with the new and existing 3060 sessions. 3062 o The properties of the client ID as defined in Section 18.35. 3064 A persistent reply cache places certain demands on the server. The 3065 execution of the sequence of operations (starting with SEQUENCE) and 3066 placement of its results in the persistent cache MUST be atomic. If 3067 a client retries a sequence of operations that was previously 3068 executed on the server, the only acceptable outcomes are either the 3069 original cached reply or an indication that the client ID or session 3070 has been lost (indicating a catastrophic loss of the reply cache or a 3071 session that has been deleted because the client failed to use the 3072 session for an extended period of time). 3074 A server could fail and restart in the middle of a COMPOUND procedure 3075 that contains one or more non-idempotent or idempotent-but-modifying 3076 operations. This creates an even higher challenge for atomic 3077 execution and placement of results in the reply cache. One way to 3078 view the problem is as a single transaction consisting of each 3079 operation in the COMPOUND followed by storing the result in 3080 persistent storage, then finally a transaction commit. If there is a 3081 failure before the transaction is committed, then the server rolls 3082 back the transaction. If the server itself fails, then when it 3083 restarts, its recovery logic could roll back the transaction before 3084 starting the NFSv4.1 server. 3086 While the description of the implementation for atomic execution of 3087 the request and caching of the reply is beyond the scope of this 3088 document, an example implementation for NFSv2 [44] is described in 3089 [45]. 3091 2.10.7. RDMA Considerations 3093 A complete discussion of the operation of RPC-based protocols over 3094 RDMA transports is in [32]. A discussion of the operation of NFSv4, 3095 including NFSv4.1, over RDMA is in [33]. Where RDMA is considered, 3096 this specification assumes the use of such a layering; it addresses 3097 only the upper-layer issues relevant to making best use of RPC/RDMA. 3099 2.10.7.1. RDMA Connection Resources 3101 RDMA requires its consumers to register memory and post buffers of a 3102 specific size and number for receive operations. 3104 Registration of memory can be a relatively high-overhead operation, 3105 since it requires pinning of buffers, assignment of attributes (e.g., 3106 readable/writable), and initialization of hardware translation. 3107 Preregistration is desirable to reduce overhead. These registrations 3108 are specific to hardware interfaces and even to RDMA connection 3109 endpoints; therefore, negotiation of their limits is desirable to 3110 manage resources effectively. 3112 Following basic registration, these buffers must be posted by the RPC 3113 layer to handle receives. These buffers remain in use by the RPC/ 3114 NFSv4.1 implementation; the size and number of them must be known to 3115 the remote peer in order to avoid RDMA errors that would cause a 3116 fatal error on the RDMA connection. 3118 NFSv4.1 manages slots as resources on a per-session basis (see 3119 Section 2.10), while RDMA connections manage credits on a per- 3120 connection basis. This means that in order for a peer to send data 3121 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and 3122 an RDMA credit. If multiple RDMA connections are associated with a 3123 session, then if the total number of credits across all RDMA 3124 connections associated with the session is X, and the number of slots 3125 in the session is Y, then the maximum number of outstanding requests 3126 is the lesser of X and Y. 3128 2.10.7.2. Flow Control 3130 Previous versions of NFS do not provide flow control; instead, they 3131 rely on the windowing provided by transports like TCP to throttle 3132 requests. This does not work with RDMA, which provides no operation 3133 flow control and will terminate a connection in error when limits are 3134 exceeded. Limits such as maximum number of requests outstanding are 3135 therefore negotiated when a session is created (see the 3136 ca_maxrequests field in Section 18.36). These limits then provide 3137 the maxima within which each connection associated with the session's 3138 channel(s) must remain. RDMA connections are managed within these 3139 limits as described in Section 3.3 of [32]; if there are multiple 3140 RDMA connections, then the maximum number of requests for a channel 3141 will be divided among the RDMA connections. Put a different way, the 3142 onus is on the replier to ensure that the total number of RDMA 3143 credits across all connections associated with the replier's channel 3144 does exceed the channel's maximum number of outstanding requests. 3146 The limits may also be modified dynamically at the replier's choosing 3147 by manipulating certain parameters present in each NFSv4.1 reply. In 3148 addition, the CB_RECALL_SLOT callback operation (see Section 20.8) 3149 can be sent by a server to a client to return RDMA credits to the 3150 server, thereby lowering the maximum number of requests a client can 3151 have outstanding to the server. 3153 2.10.7.3. Padding 3155 Header padding is requested by each peer at session initiation (see 3156 the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), 3157 and subsequently used by the RPC RDMA layer, as described in [32]. 3158 Zero padding is permitted. 3160 Padding leverages the useful property that RDMA preserve alignment of 3161 data, even when they are placed into anonymous (untagged) buffers. 3162 If requested, client inline writes will insert appropriate pad bytes 3163 within the request header to align the data payload on the specified 3164 boundary. The client is encouraged to add sufficient padding (up to 3165 the negotiated size) so that the "data" field of the WRITE operation 3166 is aligned. Most servers can make good use of such padding, which 3167 allows them to chain receive buffers in such a way that any data 3168 carried by client requests will be placed into appropriate buffers at 3169 the server, ready for file system processing. The receiver's RPC 3170 layer encounters no overhead from skipping over pad bytes, and the 3171 RDMA layer's high performance makes the insertion and transmission of 3172 padding on the sender a significant optimization. In this way, the 3173 need for servers to perform RDMA Read to satisfy all but the largest 3174 client writes is obviated. An added benefit is the reduction of 3175 message round trips on the network -- a potentially good trade, where 3176 latency is present. 3178 The value to choose for padding is subject to a number of criteria. 3179 A primary source of variable-length data in the RPC header is the 3180 authentication information, the form of which is client-determined, 3181 possibly in response to server specification. The contents of 3182 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 3183 go into the determination of a maximal NFSv4.1 request size and 3184 therefore minimal buffer size. The client must select its offered 3185 value carefully, so as to avoid overburdening the server, and vice 3186 versa. The benefit of an appropriate padding value is higher 3187 performance. 3189 Sender gather: 3190 |RPC Request|Pad bytes|Length| -> |User data...| 3191 \------+----------------------/ \ 3192 \ \ 3193 \ Receiver scatter: \-----------+- ... 3194 /-----+----------------\ \ \ 3195 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 3197 In the above case, the server may recycle unused buffers to the next 3198 posted receive if unused by the actual received request, or may pass 3199 the now-complete buffers by reference for normal write processing. 3200 For a server that can make use of it, this removes any need for data 3201 copies of incoming data, without resorting to complicated end-to-end 3202 buffer advertisement and management. This includes most kernel-based 3203 and integrated server designs, among many others. The client may 3204 perform similar optimizations, if desired. 3206 2.10.7.4. Dual RDMA and Non-RDMA Transports 3208 Some RDMA transports (e.g., RFC 5040 [8]) permit a "streaming" (non- 3209 RDMA) phase, where ordinary traffic might flow before "stepping up" 3210 to RDMA mode, commencing RDMA traffic. Some RDMA transports start 3211 connections always in RDMA mode. NFSv4.1 allows, but does not 3212 assume, a streaming phase before RDMA mode. When a connection is 3213 associated with a session, the client and server negotiate whether 3214 the connection is used in RDMA or non-RDMA mode (see Sections 18.36 3215 and 18.34). 3217 2.10.8. Session Security 3219 2.10.8.1. Session Callback Security 3221 Via session/connection association, NFSv4.1 improves security over 3222 that provided by NFSv4.0 for the backchannel. The connection is 3223 client-initiated (see Section 18.34) and subject to the same firewall 3224 and routing checks as the fore channel. At the client's option (see 3225 Section 18.35), connection association is fully authenticated before 3226 being activated (see Section 18.34). Traffic from the server over 3227 the backchannel is authenticated exactly as the client specifies (see 3228 Section 2.10.8.2). 3230 2.10.8.2. Backchannel RPC Security 3232 When the NFSv4.1 client establishes the backchannel, it informs the 3233 server of the security flavors and principals to use when sending 3234 requests. If the security flavor is RPCSEC_GSS, the client expresses 3235 the principal in the form of an established RPCSEC_GSS context. The 3236 server is free to use any of the flavor/principal combinations the 3237 client offers, but it MUST NOT use unoffered combinations. This way, 3238 the client need not provide a target GSS principal for the 3239 backchannel as it did with NFSv4.0, nor does the server have to 3240 implement an RPCSEC_GSS initiator as it did with NFSv4.0 [36]. 3242 The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL 3243 (Section 18.33) operations allow the client to specify flavor/ 3244 principal combinations. 3246 Also note that the SP4_SSV state protection mode (see Sections 18.35 3247 and 2.10.8.3) has the side benefit of providing SSV-derived 3248 RPCSEC_GSS contexts (Section 2.10.9). 3250 2.10.8.3. Protection from Unauthorized State Changes 3252 As described to this point in the specification, the state model of 3253 NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation 3254 with a forged session ID and with a slot ID that it expects the 3255 legitimate client to use next. When the legitimate client uses the 3256 slot ID with the same sequence number, the server returns the 3257 attacker's result from the reply cache, which disrupts the legitimate 3258 client and thus denies service to it. Similarly, an attacker could 3259 send a CREATE_SESSION with a forged client ID to create a new session 3260 associated with the client ID. The attacker could send requests 3261 using the new session that change locking state, such as LOCKU 3262 operations to release locks the legitimate client has acquired. 3263 Setting a security policy on the file that requires RPCSEC_GSS 3264 credentials when manipulating the file's state is one potential work 3265 around, but has the disadvantage of preventing a legitimate client 3266 from releasing state when RPCSEC_GSS is required to do so, but a GSS 3267 context cannot be obtained (possibly because the user has logged off 3268 the client). 3270 NFSv4.1 provides three options to a client for state protection, 3271 which are specified when a client creates a client ID via EXCHANGE_ID 3272 (Section 18.35). 3274 The first (SP4_NONE) is to simply waive state protection. 3276 The other two options (SP4_MACH_CRED and SP4_SSV) share several 3277 traits: 3279 o An RPCSEC_GSS-based credential is used to authenticate client ID 3280 and session maintenance operations, including creating and 3281 destroying a session, associating a connection with the session, 3282 and destroying the client ID. 3284 o Because RPCSEC_GSS is used to authenticate client ID and session 3285 maintenance, the attacker cannot associate a rogue connection with 3286 a legitimate session, or associate a rogue session with a 3287 legitimate client ID in order to maliciously alter the client ID's 3288 lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. 3290 o In cases where the server's security policies on a portion of its 3291 namespace require RPCSEC_GSS authentication, a client may have to 3292 use an RPCSEC_GSS credential to remove per-file state (e.g., 3293 LOCKU, CLOSE, etc.). The server may require that the principal 3294 that removes the state match certain criteria (e.g., the principal 3295 might have to be the same as the one that acquired the state). 3296 However, the client might not have an RPCSEC_GSS context for such 3297 a principal, and might not be able to create such a context 3298 (perhaps because the user has logged off). When the client 3299 establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a 3300 list of operations that the server MUST allow using the machine 3301 credential (if SP4_MACH_CRED is used) or the SSV credential (if 3302 SP4_SSV is used). 3304 The SP4_MACH_CRED state protection option uses a machine credential 3305 where the principal that creates the client ID MUST also be the 3306 principal that performs client ID and session maintenance operations. 3307 The security of the machine credential state protection approach 3308 depends entirely on safeguarding the per-machine credential. 3309 Assuming a proper safeguard using the per-machine credential for 3310 operations like CREATE_SESSION, BIND_CONN_TO_SESSION, 3311 DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from 3312 associating a rogue connection with a session, or associating a rogue 3313 session with a client ID. 3315 There are at least three scenarios for the SP4_MACH_CRED option: 3317 1. The system administrator configures a unique, permanent per- 3318 machine credential for one of the mandated GSS mechanisms (e.g., 3319 if Kerberos V5 is used, a "keytab" containing a principal derived 3320 from a client host name could be used). 3322 2. The client is used by a single user, and so the client ID and its 3323 sessions are used by just that user. If the user's credential 3324 expires, then session and client ID maintenance cannot occur, but 3325 since the client has a single user, only that user is 3326 inconvenienced. 3328 3. The physical client has multiple users, but the client 3329 implementation has a unique client ID for each user. This is 3330 effectively the same as the second scenario, but a disadvantage 3331 is that each user needs to be allocated at least one session 3332 each, so the approach suffers from lack of economy. 3334 The SP4_SSV protection option uses the SSV (Section 1.7), via 3335 RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect 3336 state from attack. The SP4_SSV protection option is intended for the 3337 situation comprised of a client that has multiple active users and a 3338 system administrator who wants to avoid the burden of installing a 3339 permanent machine credential on each client. The SSV is established 3340 and updated on the server via SET_SSV (see Section 18.47). To 3341 prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS 3342 with the privacy service. Several aspects of the SSV make it 3343 intractable for an attacker to guess the SSV, and thus associate 3344 rogue connections with a session, and rogue sessions with a client 3345 ID: 3347 o The arguments to and results of SET_SSV include digests of the old 3348 and new SSV, respectively. 3350 o Because the initial value of the SSV is zero, therefore known, the 3351 client that opts for SP4_SSV protection and opts to apply SP4_SSV 3352 protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at 3353 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 3354 operation or before the second CREATE_SESSION operation on a 3355 client ID. If it does not, the SSV mechanism will not generate 3356 tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as 3357 a session is created. 3359 o A SET_SSV request does not replace the SSV with the argument to 3360 SET_SSV. Instead, the current SSV on the server is logically 3361 exclusive ORed (XORed) with the argument to SET_SSV. Each time a 3362 new principal uses a client ID for the first time, the client 3363 SHOULD send a SET_SSV with that principal's RPCSEC_GSS 3364 credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. 3366 Here are the types of attacks that can be attempted by an attacker 3367 named Eve on a victim named Bob, and how SP4_SSV protection foils 3368 each attack: 3370 o Suppose Eve is the first user to log into a legitimate client. 3371 Eve's use of an NFSv4.1 file system will cause the legitimate 3372 client to create a client ID with SP4_SSV protection, specifying 3373 that the BIND_CONN_TO_SESSION operation MUST use the SSV 3374 credential. Eve's use of the file system also causes an SSV to be 3375 created. The SET_SSV operation that creates the SSV will be 3376 protected by the RPCSEC_GSS context created by the legitimate 3377 client, which uses Eve's GSS principal and credentials. Eve can 3378 eavesdrop on the network while her RPCSEC_GSS context is created 3379 and the SET_SSV using her context is sent. Even if the legitimate 3380 client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve 3381 knows her own credentials, she can decrypt the SSV. Eve can 3382 compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will 3383 accept, and so associate a new connection with the legitimate 3384 session. Eve can change the slot ID and sequence state of a 3385 legitimate session, and/or the SSV state, in such a way that when 3386 Bob accesses the server via the same legitimate client, the 3387 legitimate client will be unable to use the session. 3389 The client's only recourse is to create a new client ID for Bob to 3390 use, and establish a new SSV for the client ID. The client will 3391 be unable to delete the old client ID, and will let the lease on 3392 the old client ID expire. 3394 Once the legitimate client establishes an SSV over the new session 3395 using Bob's RPCSEC_GSS context, Eve can use the new session via 3396 the legitimate client, but she cannot disrupt Bob. Moreover, 3397 because the client SHOULD have modified the SSV due to Eve using 3398 the new session, Bob cannot get revenge on Eve by associating a 3399 rogue connection with the session. 3401 The question is how did the legitimate client detect that Eve has 3402 hijacked the old session? When the client detects that a new 3403 principal, Bob, wants to use the session, it SHOULD have sent a 3404 SET_SSV, which leads to the following sub-scenarios: 3406 * Let us suppose that from the rogue connection, Eve sent a 3407 SET_SSV with the same slot ID and sequence ID that the 3408 legitimate client later uses. The server will assume the 3409 SET_SSV sent with Bob's credentials is a retry, and return to 3410 the legitimate client the reply it sent Eve. However, unless 3411 Eve can correctly guess the SSV the legitimate client will use, 3412 the digest verification checks in the SET_SSV response will 3413 fail. That is an indication to the client that the session has 3414 apparently been hijacked. 3416 * Alternatively, Eve sent a SET_SSV with a different slot ID than 3417 the legitimate client uses for its SET_SSV. Then the digest 3418 verification of the SET_SSV sent with Bob's credentials fails 3419 on the server, and the error returned to the client makes it 3420 apparent that the session has been hijacked. 3422 * Alternatively, Eve sent an operation other than SET_SSV, but 3423 with the same slot ID and sequence that the legitimate client 3424 uses for its SET_SSV. The server returns to the legitimate 3425 client the response it sent Eve. The client sees that the 3426 response is not at all what it expects. The client assumes 3427 either session hijacking or a server bug, and either way 3428 destroys the old session. 3430 o Eve associates a rogue connection with the session as above, and 3431 then destroys the session. Again, Bob goes to use the server from 3432 the legitimate client, which sends a SET_SSV using Bob's 3433 credentials. The client receives an error that indicates that the 3434 session does not exist. When the client tries to create a new 3435 session, this will fail because the SSV it has does not match that 3436 which the server has, and now the client knows the session was 3437 hijacked. The legitimate client establishes a new client ID. 3439 o If Eve creates a connection before the legitimate client 3440 establishes an SSV, because the initial value of the SSV is zero 3441 and therefore known, Eve can send a SET_SSV that will pass the 3442 digest verification check. However, because the new connection 3443 has not been associated with the session, the SET_SSV is rejected 3444 for that reason. 3446 In summary, an attacker's disruption of state when SP4_SSV protection 3447 is in use is limited to the formative period of a client ID, its 3448 first session, and the establishment of the SSV. Once a non- 3449 malicious user uses the client ID, the client quickly detects any 3450 hijack and rectifies the situation. Once a non-malicious user 3451 successfully modifies the SSV, the attacker cannot use NFSv4.1 3452 operations to disrupt the non-malicious user. 3454 Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches 3455 prevent hijacking of a transport connection that has previously been 3456 associated with a session. If the goal of a counter-threat strategy 3457 is to prevent connection hijacking, the use of IPsec is RECOMMENDED. 3459 If a connection hijack occurs, the hijacker could in theory change 3460 locking state and negatively impact the service to legitimate 3461 clients. However, if the server is configured to require the use of 3462 RPCSEC_GSS with integrity or privacy on the affected file objects, 3463 and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is 3464 in force, this will thwart unauthorized attempts to change locking 3465 state. 3467 2.10.9. The Secret State Verifier (SSV) GSS Mechanism 3469 The SSV provides the secret key for a GSS mechanism internal to 3470 NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this 3471 mechanism are not established via the RPCSEC_GSS protocol. Instead, 3472 the contexts are automatically created when EXCHANGE_ID specifies 3473 SP4_SSV protection. The only tokens defined are the PerMsgToken 3474 (emitted by GSS_GetMIC) and the SealedMessage token (emitted by 3475 GSS_Wrap). 3477 The mechanism OID for the SSV mechanism is 3478 iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech 3479 (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any 3480 initial context tokens, the OID can be used to let servers indicate 3481 that the SSV mechanism is acceptable whenever the client sends a 3482 SECINFO or SECINFO_NO_NAME operation (see Section 2.6). 3484 The SSV mechanism defines four subkeys derived from the SSV value. 3485 Each time SET_SSV is invoked, the subkeys are recalculated by the 3486 client and server. The calculation of each of the four subkeys 3487 depends on each of the four respective ssv_subkey4 enumerated values. 3488 The calculation uses the HMAC [51] algorithm, using the current SSV 3489 as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, 3490 and the input text as represented by the XDR encoded enumeration 3491 value for that subkey of data type ssv_subkey4. If the length of the 3492 output of the HMAC algorithm exceeds the length of key of the 3493 encryption algorithm (which is also negotiated by EXCHANGE_ID), then 3494 the subkey MUST be truncated from the HMAC output, i.e., if the 3495 subkey is of N bytes long, then the first N bytes of the HMAC output 3496 MUST be used for the subkey. The specification of EXCHANGE_ID states 3497 that the length of the output of the HMAC algorithm MUST NOT be less 3498 than the length of subkey needed for the encryption algorithm (see 3499 Section 18.35). 3501 /* Input for computing subkeys */ 3502 enum ssv_subkey4 { 3503 SSV4_SUBKEY_MIC_I2T = 1, 3504 SSV4_SUBKEY_MIC_T2I = 2, 3505 SSV4_SUBKEY_SEAL_I2T = 3, 3506 SSV4_SUBKEY_SEAL_T2I = 4 3507 }; 3508 The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating 3509 message integrity codes (MICs) that originate from the NFSv4.1 3510 client, whether as part of a request over the fore channel or a 3511 response over the backchannel. The subkey derived from 3512 SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 3513 server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for 3514 encryption text originating from the NFSv4.1 client, and the subkey 3515 derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text 3516 originating from the NFSv4.1 server. 3518 The PerMsgToken description is based on an XDR definition: 3520 /* Input for computing smt_hmac */ 3521 struct ssv_mic_plain_tkn4 { 3522 uint32_t smpt_ssv_seq; 3523 opaque smpt_orig_plain<>; 3524 }; 3526 /* SSV GSS PerMsgToken token */ 3527 struct ssv_mic_tkn4 { 3528 uint32_t smt_ssv_seq; 3529 opaque smt_hmac<>; 3530 }; 3532 The field smt_hmac is an HMAC calculated by using the subkey derived 3533 from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one- 3534 way hash algorithm as negotiated by EXCHANGE_ID, and the input text 3535 as represented by data of type ssv_mic_plain_tkn4. The field 3536 smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain 3537 is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of 3538 [7]). The caller of GSS_GetMIC() provides a pointer to a buffer 3539 containing the plain text. The SSV mechanism's entry point for 3540 GSS_GetMIC() encodes this into an opaque array, and the encoding will 3541 include an initial four-byte length, plus any necessary padding. 3542 Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus 3543 making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, 3544 which in turn is the input into the HMAC. 3546 The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type 3547 ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence 3548 number, which is equal to one after SET_SSV (Section 18.47) is called 3549 the first time on a client ID. Thereafter, the SSV sequence number 3550 is incremented on each SET_SSV. Thus, smt_ssv_seq represents the 3551 version of the SSV at the time GSS_GetMIC() was called. As noted in 3552 Section 18.35, the client and server can maintain multiple concurrent 3553 versions of the SSV. This allows the SSV to be changed without 3554 serializing all RPC calls that use the SSV mechanism with SET_SSV 3555 operations. Once the HMAC is calculated, it is XDR encoded into 3556 smt_hmac, which will include an initial four-byte length, and any 3557 necessary padding. Prepended to this will be the XDR encoded value 3558 of smt_ssv_seq. 3560 The SealedMessage description is based on an XDR definition: 3562 /* Input for computing ssct_encr_data and ssct_hmac */ 3563 struct ssv_seal_plain_tkn4 { 3564 opaque sspt_confounder<>; 3565 uint32_t sspt_ssv_seq; 3566 opaque sspt_orig_plain<>; 3567 opaque sspt_pad<>; 3568 }; 3570 /* SSV GSS SealedMessage token */ 3571 struct ssv_seal_cipher_tkn4 { 3572 uint32_t ssct_ssv_seq; 3573 opaque ssct_iv<>; 3574 opaque ssct_encr_data<>; 3575 opaque ssct_hmac<>; 3576 }; 3578 The token emitted by GSS_Wrap() is XDR encoded and of XDR data type 3579 ssv_seal_cipher_tkn4. 3581 The ssct_ssv_seq field has the same meaning as smt_ssv_seq. 3583 The ssct_encr_data field is the result of encrypting a value of the 3584 XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the 3585 subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and 3586 the encryption algorithm is that negotiated by EXCHANGE_ID. 3588 The ssct_iv field is the initialization vector (IV) for the 3589 encryption algorithm (if applicable) and is sent in clear text. The 3590 content and size of the IV MUST comply with the specification of the 3591 encryption algorithm. For example, the id-aes256-CBC algorithm MUST 3592 use a 16-byte initialization vector (IV), which MUST be unpredictable 3593 for each instance of a value of data type ssv_seal_plain_tkn4 that is 3594 encrypted with a particular SSV key. 3596 The ssct_hmac field is the result of computing an HMAC using the 3597 value of the XDR encoded data type ssv_seal_plain_tkn4 as the input 3598 text. The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or 3599 SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that 3600 negotiated by EXCHANGE_ID. 3602 The sspt_confounder field is a random value. 3604 The sspt_ssv_seq field is the same as ssvt_ssv_seq. 3606 The field sspt_orig_plain field is the original plaintext and is the 3607 "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of 3608 [7]). As with the handling of the plaintext by the SSV mechanism's 3609 GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a 3610 pointer to the plaintext, and will XDR encode an opaque array into 3611 sspt_orig_plain representing the plain text, along with the other 3612 fields of an instance of data type ssv_seal_plain_tkn4. 3614 The sspt_pad field is present to support encryption algorithms that 3615 require inputs to be in fixed-sized blocks. The content of sspt_pad 3616 is zero filled except for the length. Beware that the XDR encoding 3617 of ssv_seal_plain_tkn4 contains three variable-length arrays, and so 3618 each array consumes four bytes for an array length, and each array 3619 that follows the length is always padded to a multiple of four bytes 3620 per the XDR standard. 3622 For example, suppose the encryption algorithm uses 16-byte blocks, 3623 and the sspt_confounder is three bytes long, and the sspt_orig_plain 3624 field is 15 bytes long. The XDR encoding of sspt_confounder uses 3625 eight bytes (4 + 3 + 1-byte pad), the XDR encoding of sspt_ssv_seq 3626 uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 3627 + 15 + 1-byte pad), and the smallest XDR encoding of the sspt_pad 3628 field is four bytes. This totals 36 bytes. The next multiple of 16 3629 is 48; thus, the length field of sspt_pad needs to be set to 12 3630 bytes, or a total encoding of 16 bytes. The total number of XDR 3631 encoded bytes is thus 8 + 4 + 20 + 16 = 48. 3633 GSS_Wrap() emits a token that is an XDR encoding of a value of data 3634 type ssv_seal_cipher_tkn4. Note that regardless of whether or not 3635 the caller of GSS_Wrap() requests confidentiality, the token always 3636 has confidentiality. This is because the SSV mechanism is for 3637 RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without 3638 confidentiality. 3640 There is one SSV per client ID. There is a single GSS context for a 3641 client ID / SSV pair. All SSV mechanism RPCSEC_GSS handles of a 3642 client ID / SSV pair share the same GSS context. SSV GSS contexts do 3643 not expire except when the SSV is destroyed (causes would include the 3644 client ID being destroyed or a server restart). Since one purpose of 3645 context expiration is to replace keys that have been in use for "too 3646 long", hence vulnerable to compromise by brute force or accident, the 3647 client can replace the SSV key by sending periodic SET_SSV 3648 operations, which is done by cycling through different users' 3649 RPCSEC_GSS credentials. This way, the SSV is replaced without 3650 destroying the SSV's GSS contexts. 3652 SSV RPCSEC_GSS handles can be expired or deleted by the server at any 3653 time, and the EXCHANGE_ID operation can be used to create more SSV 3654 RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not 3655 imply that the SSV or its GSS context has expired. 3657 The client MUST establish an SSV via SET_SSV before the SSV GSS 3658 context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). 3659 If SET_SSV has not been successfully called, attempts to emit tokens 3660 MUST fail. 3662 The SSV mechanism does not support replay detection and sequencing in 3663 its tokens because RPCSEC_GSS does not use those features (See 3664 Section 5.2.2, "Context Creation Requests", in [4]). However, 3665 Section 2.10.10 discusses special considerations for the SSV 3666 mechanism when used with RPCSEC_GSS. 3668 2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV 3669 Mechanism 3671 When a client ID is created with SP4_SSV state protection (see 3672 Section 18.35), the client is permitted to associate multiple 3673 RPCSEC_GSS handles with the single SSV GSS context (see 3674 Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and 3675 version 2, see [4] and [9]) calculate the verifier of the reply, 3676 special care must be taken by the implementation of the NFSv4.1 3677 client to prevent attacks by a man-in-the-middle. The verifier of an 3678 RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input 3679 value of the seq_num field of the RPCSEC_GSS credential (data type 3680 rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple 3681 RPCSEC_GSS handles share the same GSS context, then if one handle is 3682 used to send a request with the same seq_num value as another handle, 3683 an attacker could block the reply, and replace it with the verifier 3684 used for the other handle. 3686 There are multiple ways to prevent the attack on the SSV RPCSEC_GSS 3687 verifier in the reply. The simplest is believed to be as follows. 3689 o Each time one or more new SSV RPCSEC_GSS handles are created via 3690 EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify 3691 the SSV. By changing the SSV, the new handles will not result in 3692 the re-use of an SSV RPCSEC_GSS verifier in a reply. 3694 o When a requester decides to use N SSV RPCSEC_GSS handles, it 3695 SHOULD assign a unique and non-overlapping range of seq_nums to 3696 each SSV RPCSEC_GSS handle. The size of each range SHOULD be 3697 equal to MAXSEQ / N (see Section 5 of [4] for the definition of 3698 MAXSEQ). When an SSV RPCSEC_GSS handle reaches its maximum, it 3699 SHOULD force the replier to destroy the handle by sending a NULL 3700 RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of 3701 [4]). 3703 o When the requester wants to increase or decrease N, it SHOULD 3704 force the replier to destroy all N handles by sending a NULL RPC 3705 request on each handle with seq_num set to MAXSEQ + 1. If the 3706 requester is the client, it SHOULD send a SET_SSV operation before 3707 using new handles. If the requester is the server, then the 3708 client SHOULD send a SET_SSV operation when it detects that the 3709 server has forced it to destroy a backchannel's SSV RPCSEC_GSS 3710 handle. By sending a SET_SSV operation, the SSV will change, and 3711 so the attacker will be unavailable to successfully replay a 3712 previous verifier in a reply to the requester. 3714 Note that if the replier carefully creates the SSV RPCSEC_GSS 3715 handles, the related risk of a man-in-the-middle splicing a forged 3716 SSV RPCSEC_GSS credential with a verifier for another handle does not 3717 exist. This is because the verifier in an RPCSEC_GSS request is 3718 computed from input that includes both the RPCSEC_GSS handle and 3719 seq_num (see Section 5.3.1 of [4]). Provided the replier takes care 3720 to avoid re-using the value of an RPCSEC_GSS handle that it creates, 3721 such as by including a generation number in the handle, the man-in- 3722 the-middle will not be able to successfully replay a previous 3723 verifier in the request to a replier. 3725 2.10.11. Session Mechanics - Steady State 3727 2.10.11.1. Obligations of the Server 3729 The server has the primary obligation to monitor the state of 3730 backchannel resources that the client has created for the server 3731 (RPCSEC_GSS contexts and backchannel connections). If these 3732 resources vanish, the server takes action as specified in 3733 Section 2.10.13.2. 3735 2.10.11.2. Obligations of the Client 3737 The client SHOULD honor the following obligations in order to utilize 3738 the session: 3740 o Keep a necessary session from going idle on the server. A client 3741 that requires a session but nonetheless is not sending operations 3742 risks having the session be destroyed by the server. This is 3743 because sessions consume resources, and resource limitations may 3744 force the server to cull an inactive session. A server MAY 3745 consider a session to be inactive if the client has not used the 3746 session before the session inactivity timer (Section 2.10.12) has 3747 expired. 3749 o Destroy the session when not needed. If a client has multiple 3750 sessions, one of which has no requests waiting for replies, and 3751 has been idle for some period of time, it SHOULD destroy the 3752 session. 3754 o Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. 3755 If the client requires the server to use the RPCSEC_GSS security 3756 flavor for callbacks, then it needs to be sure the RPCSEC_GSS 3757 handles and/or their GSS contexts that are handed to the server 3758 via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. 3760 o Preserve a connection for a backchannel. The server requires a 3761 backchannel in order to gracefully recall recallable state or 3762 notify the client of certain events. Note that if the connection 3763 is not being used for the fore channel, there is no way for the 3764 client to tell if the connection is still alive (e.g., the server 3765 restarted without sending a disconnect). The onus is on the 3766 server, not the client, to determine if the backchannel's 3767 connection is alive, and to indicate in the response to a SEQUENCE 3768 operation when the last connection associated with a session's 3769 backchannel has disconnected. 3771 2.10.11.3. Steps the Client Takes to Establish a Session 3773 If the client does not have a client ID, the client sends EXCHANGE_ID 3774 to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV 3775 protection, in the spo_must_enforce list of operations, it SHOULD at 3776 minimum specify CREATE_SESSION, DESTROY_SESSION, 3777 BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it 3778 opts for SP4_SSV protection, the client needs to ask for SSV-based 3779 RPCSEC_GSS handles. 3781 The client uses the client ID to send a CREATE_SESSION on a 3782 connection to the server. The results of CREATE_SESSION indicate 3783 whether or not the server will persist the session reply cache 3784 through a server that has restarted, and the client notes this for 3785 future reference. 3787 If the client specified SP4_SSV state protection when the client ID 3788 was created, then it SHOULD send SET_SSV in the first COMPOUND after 3789 the session is created. Each time a new principal goes to use the 3790 client ID, it SHOULD send a SET_SSV again. 3792 If the client wants to use delegations, layouts, directory 3793 notifications, or any other state that requires a backchannel, then 3794 it needs to add a connection to the backchannel if CREATE_SESSION did 3795 not already do so. The client creates a connection, and calls 3796 BIND_CONN_TO_SESSION to associate the connection with the session and 3797 the session's backchannel. If CREATE_SESSION did not already do so, 3798 the client MUST tell the server what security is required in order 3799 for the client to accept callbacks. The client does this via 3800 BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV 3801 protection when it called EXCHANGE_ID, then the client SHOULD specify 3802 that the backchannel use RPCSEC_GSS contexts for security. 3804 If the client wants to use additional connections for the 3805 backchannel, then it needs to call BIND_CONN_TO_SESSION on each 3806 connection it wants to use with the session. If the client wants to 3807 use additional connections for the fore channel, then it needs to 3808 call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED 3809 state protection when the client ID was created. 3811 At this point, the session has reached steady state. 3813 2.10.12. Session Inactivity Timer 3815 The server MAY maintain a session inactivity timer for each session. 3816 If the session inactivity timer expires, then the server MAY destroy 3817 the session. To avoid losing a session due to inactivity, the client 3818 MUST renew the session inactivity timer. The length of session 3819 inactivity timer MUST NOT be less than the lease_time attribute 3820 (Section 5.8.1.11). As with lease renewal (Section 8.3), when the 3821 server receives a SEQUENCE operation, it resets the session 3822 inactivity timer, and MUST NOT allow the timer to expire while the 3823 rest of the operations in the COMPOUND procedure's request are still 3824 executing. Once the last operation has finished, the server MUST set 3825 the session inactivity timer to expire no sooner than the sum of the 3826 current time and the value of the lease_time attribute. 3828 2.10.13. Session Mechanics - Recovery 3830 2.10.13.1. Events Requiring Client Action 3832 The following events require client action to recover. 3834 2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path 3836 If all RPCSEC_GSS handles granted by the client to the server for 3837 callback use have expired, the client MUST establish a new handle via 3838 BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results 3839 indicates when callback handles are nearly expired, or fully expired 3840 (see Section 18.46.3). 3842 2.10.13.1.2. Connection Loss 3844 If the client loses the last connection of the session and wants to 3845 retain the session, then it needs to create a new connection, and if, 3846 when the client ID was created, BIND_CONN_TO_SESSION was specified in 3847 the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION 3848 to associate the connection with the session. 3850 If there was a request outstanding at the time of connection loss, 3851 then if the client wants to continue to use the session, it MUST 3852 retry the request, as described in Section 2.10.6.2. Note that it is 3853 not necessary to retry requests over a connection with the same 3854 source network address or the same destination network address as the 3855 lost connection. As long as the session ID, slot ID, and sequence ID 3856 in the retry match that of the original request, the server will 3857 recognize the request as a retry if it executed the request prior to 3858 disconnect. 3860 If the connection that was lost was the last one associated with the 3861 backchannel, and the client wants to retain the backchannel and/or 3862 prevent revocation of recallable state, the client needs to 3863 reconnect, and if it does, it MUST associate the connection to the 3864 session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD 3865 indicate when it has no callback connection via the sr_status_flags 3866 result from SEQUENCE. 3868 2.10.13.1.3. Backchannel GSS Context Loss 3870 Via the sr_status_flags result of the SEQUENCE operation or other 3871 means, the client will learn if some or all of the RPCSEC_GSS 3872 contexts it assigned to the backchannel have been lost. If the 3873 client wants to retain the backchannel and/or not put recallable 3874 state subject to revocation, the client needs to use BACKCHANNEL_CTL 3875 to assign new contexts. 3877 2.10.13.1.4. Loss of Session 3879 The replier might lose a record of the session. Causes include: 3881 o Replier failure and restart. 3883 o A catastrophe that causes the reply cache to be corrupted or lost 3884 on the media on which it was stored. This applies even if the 3885 replier indicated in the CREATE_SESSION results that it would 3886 persist the cache. 3888 o The server purges the session of a client that has been inactive 3889 for a very extended period of time. 3891 o As a result of configuration changes among a set of clustered 3892 servers, a network address previously connected to one server 3893 becomes connected to a different server that has no knowledge of 3894 the session in question. Such a configuration change will 3895 generally only happen when the original server ceases to function 3896 for a time. 3898 Loss of reply cache is equivalent to loss of session. The replier 3899 indicates loss of session to the requester by returning 3900 NFS4ERR_BADSESSION on the next operation that uses the session ID 3901 that refers to the lost session. 3903 After an event like a server restart, the client may have lost its 3904 connections. The client assumes for the moment that the session has 3905 not been lost. It reconnects, and if it specified connection 3906 association enforcement when the session was created, it invokes 3907 BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes 3908 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 3909 NFS4ERR_BADSESSION, the client knows the session is not available to 3910 it when communicating with that network address. If the connection 3911 survives session loss, then the next SEQUENCE operation the client 3912 sends over the connection will get back NFS4ERR_BADSESSION. The 3913 client again knows the session was lost. 3915 Here is one suggested algorithm for the client when it gets 3916 NFS4ERR_BADSESSION. It is not obligatory in that, if a client does 3917 not want to take advantage of such features as trunking, it may omit 3918 parts of it. However, it is a useful example that draws attention to 3919 various possible recovery issues: 3921 1. If the client has other connections to other server network 3922 addresses associated with the same session, attempt a COMPOUND 3923 with a single operation, SEQUENCE, on each of the other 3924 connections. 3926 2. If the attempts succeed, the session is still alive, and this is 3927 a strong indicator that the server's network address has moved. 3928 The client might send an EXCHANGE_ID on the connection that 3929 returned NFS4ERR_BADSESSION to see if there are opportunities for 3930 client ID trunking (i.e., the same client ID and so_major value 3931 are returned). The client might use DNS to see if the moved 3932 network address was replaced with another, so that the 3933 performance and availability benefits of session trunking can 3934 continue. 3936 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the 3937 session no longer exists on any of the server network addresses 3938 for which the client has connections associated with that session 3939 ID. It is possible the session is still alive and available on 3940 other network addresses. The client sends an EXCHANGE_ID on all 3941 the connections to see if the server owner is still listening on 3942 those network addresses. If the same server owner is returned 3943 but a new client ID is returned, this is a strong indicator of a 3944 server restart. If both the same server owner and same client ID 3945 are returned, then this is a strong indication that the server 3946 did delete the session, and the client will need to send a 3947 CREATE_SESSION if it has no other sessions for that client ID. 3948 If a different server owner is returned, the client can use DNS 3949 to find other network addresses. If it does not, or if DNS does 3950 not find any other addresses for the server, then the client will 3951 be unable to provide NFSv4.1 service, and fatal errors should be 3952 returned to processes that were using the server. If the client 3953 is using a "mount" paradigm, unmounting the server is advised. 3955 4. If the client knows of no other connections associated with the 3956 session ID and server network addresses that are, or have been, 3957 associated with the session ID, then the client can use DNS to 3958 find other network addresses. If it does not, or if DNS does not 3959 find any other addresses for the server, then the client will be 3960 unable to provide NFSv4.1 service, and fatal errors should be 3961 returned to processes that were using the server. If the client 3962 is using a "mount" paradigm, unmounting the server is advised. 3964 If there is a reconfiguration event that results in the same network 3965 address being assigned to servers where the eir_server_scope value is 3966 different, it cannot be guaranteed that a session ID generated by the 3967 first will be recognized as invalid by the first. Therefore, in 3968 managing server reconfigurations among servers with different server 3969 scope values, it is necessary to make sure that all clients have 3970 disconnected from the first server before effecting the 3971 reconfiguration. Nonetheless, clients should not assume that servers 3972 will always adhere to this requirement; clients MUST be prepared to 3973 deal with unexpected effects of server reconfigurations. Even where 3974 a session ID is inappropriately recognized as valid, it is likely 3975 either that the connection will not be recognized as valid or that a 3976 sequence value for a slot will not be correct. Therefore, when a 3977 client receives results indicating such unexpected errors, the use of 3978 EXCHANGE_ID to determine the current server configuration is 3979 RECOMMENDED. 3981 A variation on the above is that after a server's network address 3982 moves, there is no NFSv4.1 server listening, e.g., no listener on 3983 port 2049. In this example, one of the following occur: the NFSv4 3984 server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a 3985 PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or 3986 attempts to reconnect to the network address timeout. These SHOULD 3987 be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for 3988 these purposes. 3990 When the client detects session loss, it needs to call CREATE_SESSION 3991 to recover. Any non-idempotent operations that were in progress 3992 might have been performed on the server at the time of session loss. 3993 The client has no general way to recover from this. 3995 Note that loss of session does not imply loss of byte-range lock, 3996 open, delegation, or layout state because locks, opens, delegations, 3997 and layouts are tied to the client ID and depend on the client ID, 3998 not the session. Nor does loss of byte-range lock, open, delegation, 3999 or layout state imply loss of session state, because the session 4000 depends on the client ID; loss of client ID however does imply loss 4001 of session, byte-range lock, open, delegation, and layout state. See 4002 Section 8.4.2. A session can survive a server restart, but lock 4003 recovery may still be needed. 4005 It is possible that CREATE_SESSION will fail with 4006 NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not 4007 preserve client ID state). If so, the client needs to call 4008 EXCHANGE_ID, followed by CREATE_SESSION. 4010 2.10.13.2. Events Requiring Server Action 4012 The following events require server action to recover. 4014 2.10.13.2.1. Client Crash and Restart 4016 As described in Section 18.35, a restarted client sends EXCHANGE_ID 4017 in such a way that it causes the server to delete any sessions it 4018 had. 4020 2.10.13.2.2. Client Crash with No Restart 4022 If a client crashes and never comes back, it will never send 4023 EXCHANGE_ID with its old client owner. Thus, the server has session 4024 state that will never be used again. After an extended period of 4025 time, and if the server has resource constraints, it MAY destroy the 4026 old session as well as locking state. 4028 2.10.13.2.3. Extended Network Partition 4030 To the server, the extended network partition may be no different 4031 from a client crash with no restart (see Section 2.10.13.2.2). 4032 Unless the server can discern that there is a network partition, it 4033 is free to treat the situation as if the client has crashed 4034 permanently. 4036 2.10.13.2.4. Backchannel Connection Loss 4038 If there were callback requests outstanding at the time of a 4039 connection loss, then the server MUST retry the requests, as 4040 described in Section 2.10.6.2. Note that it is not necessary to 4041 retry requests over a connection with the same source network address 4042 or the same destination network address as the lost connection. As 4043 long as the session ID, slot ID, and sequence ID in the retry match 4044 that of the original request, the callback target will recognize the 4045 request as a retry even if it did see the request prior to 4046 disconnect. 4048 If the connection lost is the last one associated with the 4049 backchannel, then the server MUST indicate that in the 4050 sr_status_flags field of every SEQUENCE reply until the backchannel 4051 is re-established. There are two situations, each of which uses 4052 different status flags: no connectivity for the session's backchannel 4053 and no connectivity for any session backchannel of the client. See 4054 Section 18.46 for a description of the appropriate flags in 4055 sr_status_flags. 4057 2.10.13.2.5. GSS Context Loss 4059 The server SHOULD monitor when the number of RPCSEC_GSS handles 4060 assigned to the backchannel reaches one, and when that one handle is 4061 near expiry (i.e., between one and two periods of lease time), and 4062 indicate so in the sr_status_flags field of all SEQUENCE replies. 4063 The server MUST indicate when all of the backchannel's assigned 4064 RPCSEC_GSS handles have expired via the sr_status_flags field of all 4065 SEQUENCE replies. 4067 2.10.14. Parallel NFS and Sessions 4069 A client and server can potentially be a non-pNFS implementation, a 4070 metadata server implementation, a data server implementation, or two 4071 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 4072 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 4073 mutually exclusive) are passed in the EXCHANGE_ID arguments and 4074 results to allow the client to indicate how it wants to use sessions 4075 created under the client ID, and to allow the server to indicate how 4076 it will allow the sessions to be used. See Section 13.1 for pNFS 4077 sessions considerations. 4079 3. Protocol Constants and Data Types 4081 The syntax and semantics to describe the data types of the NFSv4.1 4082 protocol are defined in the XDR RFC 4506 [2] and RPC RFC 5531 [3] 4083 documents. The next sections build upon the XDR data types to define 4084 constants, types, and structures specific to this protocol. The full 4085 list of XDR data types is in [10]. 4087 3.1. Basic Constants 4089 const NFS4_FHSIZE = 128; 4090 const NFS4_VERIFIER_SIZE = 8; 4091 const NFS4_OPAQUE_LIMIT = 1024; 4092 const NFS4_SESSIONID_SIZE = 16; 4094 const NFS4_INT64_MAX = 0x7fffffffffffffff; 4095 const NFS4_UINT64_MAX = 0xffffffffffffffff; 4096 const NFS4_INT32_MAX = 0x7fffffff; 4097 const NFS4_UINT32_MAX = 0xffffffff; 4099 const NFS4_MAXFILELEN = 0xffffffffffffffff; 4100 const NFS4_MAXFILEOFF = 0xfffffffffffffffe; 4102 Except where noted, all these constants are defined in bytes. 4104 o NFS4_FHSIZE is the maximum size of a filehandle. 4106 o NFS4_VERIFIER_SIZE is the fixed size of a verifier. 4108 o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque 4109 information. 4111 o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. 4113 o NFS4_INT64_MAX is the maximum value of a signed 64-bit integer. 4115 o NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit 4116 integer. 4118 o NFS4_INT32_MAX is the maximum value of a signed 32-bit integer. 4120 o NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit 4121 integer. 4123 o NFS4_MAXFILELEN is the maximum length of a regular file. 4125 o NFS4_MAXFILEOFF is the maximum offset into a regular file. 4127 3.2. Basic Data Types 4129 These are the base NFSv4.1 data types. 4131 +---------------+---------------------------------------------------+ 4132 | Data Type | Definition | 4133 +---------------+---------------------------------------------------+ 4134 | int32_t | typedef int int32_t; | 4135 | uint32_t | typedef unsigned int uint32_t; | 4136 | int64_t | typedef hyper int64_t; | 4137 | uint64_t | typedef unsigned hyper uint64_t; | 4138 | attrlist4 | typedef opaque attrlist4<>; | 4139 | | Used for file/directory attributes. | 4140 | bitmap4 | typedef uint32_t bitmap4<>; | 4141 | | Used in attribute array encoding. | 4142 | changeid4 | typedef uint64_t changeid4; | 4143 | | Used in the definition of change_info4. | 4144 | clientid4 | typedef uint64_t clientid4; | 4145 | | Shorthand reference to client identification. | 4146 | count4 | typedef uint32_t count4; | 4147 | | Various count parameters (READ, WRITE, COMMIT). | 4148 | length4 | typedef uint64_t length4; | 4149 | | The length of a byte-range within a file. | 4150 | mode4 | typedef uint32_t mode4; | 4151 | | Mode attribute data type. | 4152 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 4153 | | Opaque cookie value for READDIR. | 4154 | nfs_fh4 | typedef opaque nfs_fh4; | 4155 | | Filehandle definition. | 4156 | nfs_ftype4 | enum nfs_ftype4; | 4157 | | Various defined file types. | 4158 | nfsstat4 | enum nfsstat4; | 4159 | | Return value for operations. | 4160 | offset4 | typedef uint64_t offset4; | 4161 | | Various offset designations (READ, WRITE, LOCK, | 4162 | | COMMIT). | 4163 | qop4 | typedef uint32_t qop4; | 4164 | | Quality of protection designation in SECINFO. | 4165 | sec_oid4 | typedef opaque sec_oid4<>; | 4166 | | Security Object Identifier. The sec_oid4 data | 4167 | | type is not really opaque. Instead, it contains | 4168 | | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | 4169 | | the mech_type argument to GSS_Init_sec_context. | 4170 | | See [7] for details. | 4171 | sequenceid4 | typedef uint32_t sequenceid4; | 4172 | | Sequence number used for various session | 4173 | | operations (EXCHANGE_ID, CREATE_SESSION, | 4174 | | SEQUENCE, CB_SEQUENCE). | 4175 | seqid4 | typedef uint32_t seqid4; | 4176 | | Sequence identifier used for locking. | 4177 | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | 4178 | | Session identifier. | 4179 | slotid4 | typedef uint32_t slotid4; | 4180 | | Sequencing artifact for various session | 4181 | | operations (SEQUENCE, CB_SEQUENCE). | 4182 | utf8string | typedef opaque utf8string<>; | 4183 | | UTF-8 encoding for strings. | 4184 | utf8str_cis | typedef utf8string utf8str_cis; | 4185 | | Case-insensitive UTF-8 string. | 4186 | utf8str_cs | typedef utf8string utf8str_cs; | 4187 | | Case-sensitive UTF-8 string. | 4188 | utf8str_mixed | typedef utf8string utf8str_mixed; | 4189 | | UTF-8 strings with a case-sensitive prefix and a | 4190 | | case-insensitive suffix. | 4191 | component4 | typedef utf8str_cs component4; | 4192 | | Represents pathname components. | 4193 | linktext4 | typedef utf8str_cs linktext4; | 4194 | | Symbolic link contents ("symbolic link" is | 4195 | | defined in an Open Group [11] standard). | 4196 | pathname4 | typedef component4 pathname4<>; | 4197 | | Represents pathname for fs_locations. | 4198 | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | 4199 | | Verifier used for various operations (COMMIT, | 4200 | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | 4201 | | NFS4_VERIFIER_SIZE is defined as 8. | 4202 +---------------+---------------------------------------------------+ 4204 End of Base Data Types 4206 Table 1 4208 3.3. Structured Data Types 4210 3.3.1. nfstime4 4212 struct nfstime4 { 4213 int64_t seconds; 4214 uint32_t nseconds; 4215 }; 4217 The nfstime4 data type gives the number of seconds and nanoseconds 4218 since midnight or zero hour January 1, 1970 Coordinated Universal 4219 Time (UTC). Values greater than zero for the seconds field denote 4220 dates after the zero hour January 1, 1970. Values less than zero for 4221 the seconds field denote dates before the zero hour January 1, 1970. 4222 In both cases, the nseconds field is to be added to the seconds field 4223 for the final time representation. For example, if the time to be 4224 represented is one-half second before zero hour January 1, 1970, the 4225 seconds field would have a value of negative one (-1) and the 4226 nseconds field would have a value of one-half second (500000000). 4227 Values greater than 999,999,999 for nseconds are invalid. 4229 This data type is used to pass time and date information. A server 4230 converts to and from its local representation of time when processing 4231 time values, preserving as much accuracy as possible. If the 4232 precision of timestamps stored for a file system object is less than 4233 defined, loss of precision can occur. An adjunct time maintenance 4234 protocol is RECOMMENDED to reduce client and server time skew. 4236 3.3.2. time_how4 4238 enum time_how4 { 4239 SET_TO_SERVER_TIME4 = 0, 4240 SET_TO_CLIENT_TIME4 = 1 4241 }; 4243 3.3.3. settime4 4245 union settime4 switch (time_how4 set_it) { 4246 case SET_TO_CLIENT_TIME4: 4247 nfstime4 time; 4248 default: 4249 void; 4250 }; 4252 The time_how4 and settime4 data types are used for setting timestamps 4253 in file object attributes. If set_it is SET_TO_SERVER_TIME4, then 4254 the server uses its local representation of time for the time value. 4256 3.3.4. specdata4 4258 struct specdata4 { 4259 uint32_t specdata1; /* major device number */ 4260 uint32_t specdata2; /* minor device number */ 4261 }; 4263 This data type represents the device numbers for the device file 4264 types NF4CHR and NF4BLK. 4266 3.3.5. fsid4 4268 struct fsid4 { 4269 uint64_t major; 4270 uint64_t minor; 4271 }; 4273 3.3.6. change_policy4 4275 struct change_policy4 { 4276 uint64_t cp_major; 4277 uint64_t cp_minor; 4278 }; 4280 The change_policy4 data type is used for the change_policy 4281 RECOMMENDED attribute. It provides change sequencing indication 4282 analogous to the change attribute. To enable the server to present a 4283 value valid across server re-initialization without requiring 4284 persistent storage, two 64-bit quantities are used, allowing one to 4285 be a server instance ID and the second to be incremented non- 4286 persistently, within a given server instance. 4288 3.3.7. fattr4 4290 struct fattr4 { 4291 bitmap4 attrmask; 4292 attrlist4 attr_vals; 4293 }; 4295 The fattr4 data type is used to represent file and directory 4296 attributes. 4298 The bitmap is a counted array of 32-bit integers used to contain bit 4299 values. The position of the integer in the array that contains bit n 4300 can be computed from the expression (n / 32), and its bit within that 4301 integer is (n mod 32). 4303 0 1 4304 +-----------+-----------+-----------+-- 4305 | count | 31 .. 0 | 63 .. 32 | 4306 +-----------+-----------+-----------+-- 4308 3.3.8. change_info4 4310 struct change_info4 { 4311 bool atomic; 4312 changeid4 before; 4313 changeid4 after; 4314 }; 4316 This data type is used with the CREATE, LINK, OPEN, REMOVE, and 4317 RENAME operations to let the client know the value of the change 4318 attribute for the directory in which the target file system object 4319 resides. 4321 3.3.9. netaddr4 4323 struct netaddr4 { 4324 /* see struct rpcb in RFC 1833 */ 4325 string na_r_netid<>; /* network id */ 4326 string na_r_addr<>; /* universal address */ 4327 }; 4329 The netaddr4 data type is used to identify network transport 4330 endpoints. The na_r_netid and na_r_addr fields respectively contain 4331 a netid and uaddr. The netid and uaddr concepts are defined in [12]. 4332 The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are 4333 defined in [12], specifically Tables 2 and 3 and Sections 5.2.3.3 and 4334 5.2.3.4. 4336 3.3.10. state_owner4 4338 struct state_owner4 { 4339 clientid4 clientid; 4340 opaque owner; 4341 }; 4343 typedef state_owner4 open_owner4; 4344 typedef state_owner4 lock_owner4; 4346 The state_owner4 data type is the base type for the open_owner4 4347 (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2). 4349 3.3.10.1. open_owner4 4351 This data type is used to identify the owner of OPEN state. 4353 3.3.10.2. lock_owner4 4355 This structure is used to identify the owner of byte-range locking 4356 state. 4358 3.3.11. open_to_lock_owner4 4360 struct open_to_lock_owner4 { 4361 seqid4 open_seqid; 4362 stateid4 open_stateid; 4363 seqid4 lock_seqid; 4364 lock_owner4 lock_owner; 4365 }; 4367 This data type is used for the first LOCK operation done for an 4368 open_owner4. It provides both the open_stateid and lock_owner, such 4369 that the transition is made from a valid open_stateid sequence to 4370 that of the new lock_stateid sequence. Using this mechanism avoids 4371 the confirmation of the lock_owner/lock_seqid pair since it is tied 4372 to established state in the form of the open_stateid/open_seqid. 4374 3.3.12. stateid4 4376 struct stateid4 { 4377 uint32_t seqid; 4378 opaque other[12]; 4379 }; 4381 This data type is used for the various state sharing mechanisms 4382 between the client and server. The client never modifies a value of 4383 data type stateid. The starting value of the "seqid" field is 4384 undefined. The server is required to increment the "seqid" field by 4385 one at each transition of the stateid. This is important since the 4386 client will inspect the seqid in OPEN stateids to determine the order 4387 of OPEN processing done by the server. 4389 3.3.13. layouttype4 4391 enum layouttype4 { 4392 LAYOUT4_NFSV4_1_FILES = 0x1, 4393 LAYOUT4_OSD2_OBJECTS = 0x2, 4394 LAYOUT4_BLOCK_VOLUME = 0x3 4395 }; 4396 This data type indicates what type of layout is being used. The file 4397 server advertises the layout types it supports through the 4398 fs_layout_type file system attribute (Section 5.12.1). A client asks 4399 for layouts of a particular type in LAYOUTGET, and processes those 4400 layouts in its layout-type-specific logic. 4402 The layouttype4 data type is 32 bits in length. The range 4403 represented by the layout type is split into three parts. Type 0x0 4404 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 4405 globally unique and are assigned according to the description in 4406 Section 22.5; they are maintained by IANA. Types within the range 4407 0x80000000-0xFFFFFFFF are site specific and for private use only. 4409 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 4410 layout type, as defined in Section 13, is to be used. The 4411 LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as 4412 defined in [46], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME 4413 enumeration specifies that the block/volume layout, as defined in 4414 [47], is to be used. 4416 3.3.14. deviceid4 4418 const NFS4_DEVICEID4_SIZE = 16; 4420 typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; 4422 Layout information includes device IDs that specify a storage device 4423 through a compact handle. Addressing and type information is 4424 obtained with the GETDEVICEINFO operation. Device IDs are not 4425 guaranteed to be valid across metadata server restarts. A device ID 4426 is unique per client ID and layout type. See Section 12.2.10 for 4427 more details. 4429 3.3.15. device_addr4 4431 struct device_addr4 { 4432 layouttype4 da_layout_type; 4433 opaque da_addr_body<>; 4434 }; 4436 The device address is used to set up a communication channel with the 4437 storage device. Different layout types will require different data 4438 types to define how they communicate with storage devices. The 4439 opaque da_addr_body field is interpreted based on the specified 4440 da_layout_type field. 4442 This document defines the device address for the NFSv4.1 file layout 4443 (see Section 13.3), which identifies a storage device by network IP 4444 address and port number. This is sufficient for the clients to 4445 communicate with the NFSv4.1 storage devices, and may be sufficient 4446 for other layout types as well. Device types for object-based 4447 storage devices and block storage devices (e.g., Small Computer 4448 System Interface (SCSI) volume labels) are defined by their 4449 respective layout specifications. 4451 3.3.16. layout_content4 4453 struct layout_content4 { 4454 layouttype4 loc_type; 4455 opaque loc_body<>; 4456 }; 4458 The loc_body field is interpreted based on the layout type 4459 (loc_type). This document defines the loc_body for the NFSv4.1 file 4460 layout type; see Section 13.3 for its definition. 4462 3.3.17. layout4 4464 struct layout4 { 4465 offset4 lo_offset; 4466 length4 lo_length; 4467 layoutiomode4 lo_iomode; 4468 layout_content4 lo_content; 4469 }; 4471 The layout4 data type defines a layout for a file. The layout type 4472 specific data is opaque within lo_content. Since layouts are sub- 4473 dividable, the offset and length together with the file's filehandle, 4474 the client ID, iomode, and layout type identify the layout. 4476 3.3.18. layoutupdate4 4478 struct layoutupdate4 { 4479 layouttype4 lou_type; 4480 opaque lou_body<>; 4481 }; 4483 The layoutupdate4 data type is used by the client to return updated 4484 layout information to the metadata server via the LAYOUTCOMMIT 4485 (Section 18.42) operation. This data type provides a channel to pass 4486 layout type specific information (in field lou_body) back to the 4487 metadata server. For example, for the block/volume layout type, this 4488 could include the list of reserved blocks that were written. The 4489 contents of the opaque lou_body argument are determined by the layout 4490 type. The NFSv4.1 file-based layout does not use this data type; if 4491 lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a 4492 zero length. 4494 3.3.19. layouthint4 4496 struct layouthint4 { 4497 layouttype4 loh_type; 4498 opaque loh_body<>; 4499 }; 4501 The layouthint4 data type is used by the client to pass in a hint 4502 about the type of layout it would like created for a particular file. 4503 It is the data type specified by the layout_hint attribute described 4504 in Section 5.12.4. The metadata server may ignore the hint or may 4505 selectively ignore fields within the hint. This hint should be 4506 provided at create time as part of the initial attributes within 4507 OPEN. The loh_body field is specific to the type of layout 4508 (loh_type). The NFSv4.1 file-based layout uses the 4509 nfsv4_1_file_layouthint4 data type as defined in Section 13.3. 4511 3.3.20. layoutiomode4 4513 enum layoutiomode4 { 4514 LAYOUTIOMODE4_READ = 1, 4515 LAYOUTIOMODE4_RW = 2, 4516 LAYOUTIOMODE4_ANY = 3 4517 }; 4519 The iomode specifies whether the client intends to just read or both 4520 read and write the data represented by the layout. While the 4521 LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the 4522 LAYOUTGET operation, it MAY be used in the arguments to the 4523 LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY 4524 iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ 4525 and LAYOUTIOMODE4_RW iomodes are being returned or recalled, 4526 respectively. The metadata server's use of the iomode may depend on 4527 the layout type being used. The storage devices MAY validate I/O 4528 accesses against the iomode and reject invalid accesses. 4530 3.3.21. nfs_impl_id4 4532 struct nfs_impl_id4 { 4533 utf8str_cis nii_domain; 4534 utf8str_cs nii_name; 4535 nfstime4 nii_date; 4536 }; 4537 This data type is used to identify client and server implementation 4538 details. The nii_domain field is the DNS domain name with which the 4539 implementor is associated. The nii_name field is the product name of 4540 the implementation and is completely free form. It is RECOMMENDED 4541 that the nii_name be used to distinguish machine architecture, 4542 machine platforms, revisions, versions, and patch levels. The 4543 nii_date field is the timestamp of when the software instance was 4544 published or built. 4546 3.3.22. threshold_item4 4548 struct threshold_item4 { 4549 layouttype4 thi_layout_type; 4550 bitmap4 thi_hintset; 4551 opaque thi_hintlist<>; 4552 }; 4554 This data type contains a list of hints specific to a layout type for 4555 helping the client determine when it should send I/O directly through 4556 the metadata server versus the storage devices. The data type 4557 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 4558 describing the set of hints supported by the server (they may differ 4559 based on the layout type), and a list of hints (thi_hintlist) whose 4560 content is determined by the hintset bitmap. See the mdsthreshold 4561 attribute for more details. 4563 The thi_hintset field is a bitmap of the following values: 4565 +-------------------------+---+---------+---------------------------+ 4566 | name | # | Data | Description | 4567 | | | Type | | 4568 +-------------------------+---+---------+---------------------------+ 4569 | threshold4_read_size | 0 | length4 | If a file's length is | 4570 | | | | less than the value of | 4571 | | | | threshold4_read_size, | 4572 | | | | then it is RECOMMENDED | 4573 | | | | that the client read from | 4574 | | | | the file via the MDS and | 4575 | | | | not a storage device. | 4576 | threshold4_write_size | 1 | length4 | If a file's length is | 4577 | | | | less than the value of | 4578 | | | | threshold4_write_size, | 4579 | | | | then it is RECOMMENDED | 4580 | | | | that the client write to | 4581 | | | | the file via the MDS and | 4582 | | | | not a storage device. | 4583 | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | 4584 | | | | this threshold, it is | 4585 | | | | RECOMMENDED to read data | 4586 | | | | through the MDS. | 4587 | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | 4588 | | | | this threshold, it is | 4589 | | | | RECOMMENDED to write data | 4590 | | | | through the MDS. | 4591 +-------------------------+---+---------+---------------------------+ 4593 3.3.23. mdsthreshold4 4595 struct mdsthreshold4 { 4596 threshold_item4 mth_hints<>; 4597 }; 4599 This data type holds an array of elements of data type 4600 threshold_item4, each of which is valid for a particular layout type. 4601 An array is necessary because a server can support multiple layout 4602 types for a single file. 4604 4. Filehandles 4606 The filehandle in the NFS protocol is a per-server unique identifier 4607 for a file system object. The contents of the filehandle are opaque 4608 to the client. Therefore, the server is responsible for translating 4609 the filehandle to an internal representation of the file system 4610 object. 4612 4.1. Obtaining the First Filehandle 4614 The operations of the NFS protocol are defined in terms of one or 4615 more filehandles. Therefore, the client needs a filehandle to 4616 initiate communication with the server. With the NFSv3 protocol (RFC 4617 1813 [37]), there exists an ancillary protocol to obtain this first 4618 filehandle. The MOUNT protocol, RPC program number 100005, provides 4619 the mechanism of translating a string-based file system pathname to a 4620 filehandle, which can then be used by the NFS protocols. 4622 The MOUNT protocol has deficiencies in the area of security and use 4623 via firewalls. This is one reason that the use of the public 4624 filehandle was introduced in RFC 2054 [48] and RFC 2055 [49]. With 4625 the use of the public filehandle in combination with the LOOKUP 4626 operation in the NFSv3 protocol, it has been demonstrated that the 4627 MOUNT protocol is unnecessary for viable interaction between NFS 4628 client and server. 4630 Therefore, the NFSv4.1 protocol will not use an ancillary protocol 4631 for translation from string-based pathnames to a filehandle. Two 4632 special filehandles will be used as starting points for the NFS 4633 client. 4635 4.1.1. Root Filehandle 4637 The first of the special filehandles is the ROOT filehandle. The 4638 ROOT filehandle is the "conceptual" root of the file system namespace 4639 at the NFS server. The client uses or starts with the ROOT 4640 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 4641 operation instructs the server to set the "current" filehandle to the 4642 ROOT of the server's file tree. Once this PUTROOTFH operation is 4643 used, the client can then traverse the entirety of the server's file 4644 tree with the LOOKUP operation. A complete discussion of the server 4645 namespace is in Section 7. 4647 4.1.2. Public Filehandle 4649 The second special filehandle is the PUBLIC filehandle. Unlike the 4650 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 4651 arbitrary file system object at the server. The server is 4652 responsible for this binding. It may be that the PUBLIC filehandle 4653 and the ROOT filehandle refer to the same file system object. 4654 However, it is up to the administrative software at the server and 4655 the policies of the server administrator to define the binding of the 4656 PUBLIC filehandle and server file system object. The client may not 4657 make any assumptions about this binding. The client uses the PUBLIC 4658 filehandle via the PUTPUBFH operation. 4660 4.2. Filehandle Types 4662 In the NFSv3 protocol, there was one type of filehandle with a single 4663 set of semantics. This type of filehandle is termed "persistent" in 4664 NFSv4.1. The semantics of a persistent filehandle remain the same as 4665 before. A new type of filehandle introduced in NFSv4.1 is the 4666 "volatile" filehandle, which attempts to accommodate certain server 4667 environments. 4669 The volatile filehandle type was introduced to address server 4670 functionality or implementation issues that make correct 4671 implementation of a persistent filehandle infeasible. Some server 4672 environments do not provide a file-system-level invariant that can be 4673 used to construct a persistent filehandle. The underlying server 4674 file system may not provide the invariant or the server's file system 4675 programming interfaces may not provide access to the needed 4676 invariant. Volatile filehandles may ease the implementation of 4677 server functionality such as hierarchical storage management or file 4678 system reorganization or migration. However, the volatile filehandle 4679 increases the implementation burden for the client. 4681 Since the client will need to handle persistent and volatile 4682 filehandles differently, a file attribute is defined that may be used 4683 by the client to determine the filehandle types being returned by the 4684 server. 4686 4.2.1. General Properties of a Filehandle 4688 The filehandle contains all the information the server needs to 4689 distinguish an individual file. To the client, the filehandle is 4690 opaque. The client stores filehandles for use in a later request and 4691 can compare two filehandles from the same server for equality by 4692 doing a byte-by-byte comparison. However, the client MUST NOT 4693 otherwise interpret the contents of filehandles. If two filehandles 4694 from the same server are equal, they MUST refer to the same file. 4695 Servers SHOULD try to maintain a one-to-one correspondence between 4696 filehandles and files, but this is not required. Clients MUST use 4697 filehandle comparisons only to improve performance, not for correct 4698 behavior. All clients need to be prepared for situations in which it 4699 cannot be determined whether two filehandles denote the same object 4700 and in such cases, avoid making invalid assumptions that might cause 4701 incorrect behavior. Further discussion of filehandle and attribute 4702 comparison in the context of data caching is presented in 4703 Section 10.3.4. 4705 As an example, in the case that two different pathnames when 4706 traversed at the server terminate at the same file system object, the 4707 server SHOULD return the same filehandle for each path. This can 4708 occur if a hard link (see [6]) is used to create two file names that 4709 refer to the same underlying file object and associated data. For 4710 example, if paths /a/b/c and /a/d/c refer to the same file, the 4711 server SHOULD return the same filehandle for both pathnames' 4712 traversals. 4714 4.2.2. Persistent Filehandle 4716 A persistent filehandle is defined as having a fixed value for the 4717 lifetime of the file system object to which it refers. Once the 4718 server creates the filehandle for a file system object, the server 4719 MUST accept the same filehandle for the object for the lifetime of 4720 the object. If the server restarts, the NFS server MUST honor the 4721 same filehandle value as it did in the server's previous 4722 instantiation. Similarly, if the file system is migrated, the new 4723 NFS server MUST honor the same filehandle as the old NFS server. 4725 The persistent filehandle will be become stale or invalid when the 4726 file system object is removed. When the server is presented with a 4727 persistent filehandle that refers to a deleted object, it MUST return 4728 an error of NFS4ERR_STALE. A filehandle may become stale when the 4729 file system containing the object is no longer available. The file 4730 system may become unavailable if it exists on removable media and the 4731 media is no longer available at the server or the file system in 4732 whole has been destroyed or the file system has simply been removed 4733 from the server's namespace (i.e., unmounted in a UNIX environment). 4735 4.2.3. Volatile Filehandle 4737 A volatile filehandle does not share the same longevity 4738 characteristics of a persistent filehandle. The server may determine 4739 that a volatile filehandle is no longer valid at many different 4740 points in time. If the server can definitively determine that a 4741 volatile filehandle refers to an object that has been removed, the 4742 server should return NFS4ERR_STALE to the client (as is the case for 4743 persistent filehandles). In all other cases where the server 4744 determines that a volatile filehandle can no longer be used, it 4745 should return an error of NFS4ERR_FHEXPIRED. 4747 The REQUIRED attribute "fh_expire_type" is used by the client to 4748 determine what type of filehandle the server is providing for a 4749 particular file system. This attribute is a bitmask with the 4750 following values: 4752 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 4753 persistent filehandle, which is valid until the object is removed 4754 from the file system. The server will not return 4755 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 4756 as a value in which none of the bits specified below are set. 4758 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 4759 specifically excluded (i.e., FH4_NO_EXPIRE_WITH_OPEN). 4761 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 4762 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 4763 qualified to exclude any expiration of the filehandle when it is 4764 open. 4766 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 4767 system transition (migration or replication), in those cases in 4768 which the continuity of filehandle use is not specified by handle 4769 class information within the fs_locations_info attribute. When 4770 this bit is set, clients without access to fs_locations_info 4771 information should assume that filehandles will expire on file 4772 system transitions. 4774 FH4_VOL_RENAME The filehandle will expire during rename. This 4775 includes a rename by the requesting client or a rename by any 4776 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 4778 Servers that provide volatile filehandles that can expire while open 4779 require special care as regards handling of RENAMEs and REMOVEs. 4780 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 4781 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN is not 4782 set, or if a non-read-only file system has a transition target in a 4783 different handle class. In these cases, the server should deny a 4784 RENAME or REMOVE that would affect an OPEN file of any of the 4785 components leading to the OPEN file. In addition, the server should 4786 deny all RENAME or REMOVE requests during the grace period, in order 4787 to make sure that reclaims of files where filehandles may have 4788 expired do not do a reclaim for the wrong file. 4790 Volatile filehandles are especially suitable for implementation of 4791 the pseudo file systems used to bridge exports. See Section 7.5 for 4792 a discussion of this. 4794 4.3. One Method of Constructing a Volatile Filehandle 4796 A volatile filehandle, while opaque to the client, could contain: 4798 [volatile bit = 1 | server boot time | slot | generation number] 4799 o slot is an index in the server volatile filehandle table 4801 o generation number is the generation number for the table entry/ 4802 slot 4804 When the client presents a volatile filehandle, the server makes the 4805 following checks, which assume that the check for the volatile bit 4806 has passed. If the server boot time is less than the current server 4807 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 4808 NFS4ERR_BADHANDLE. If the generation number does not match, return 4809 NFS4ERR_FHEXPIRED. 4811 When the server restarts, the table is gone (it is volatile). 4813 If the volatile bit is 0, then it is a persistent filehandle with a 4814 different structure following it. 4816 4.4. Client Recovery from Filehandle Expiration 4818 If possible, the client SHOULD recover from the receipt of an 4819 NFS4ERR_FHEXPIRED error. The client must take on additional 4820 responsibility so that it may prepare itself to recover from the 4821 expiration of a volatile filehandle. If the server returns 4822 persistent filehandles, the client does not need these additional 4823 steps. 4825 For volatile filehandles, most commonly the client will need to store 4826 the component names leading up to and including the file system 4827 object in question. With these names, the client should be able to 4828 recover by finding a filehandle in the namespace that is still 4829 available or by starting at the root of the server's file system 4830 namespace. 4832 If the expired filehandle refers to an object that has been removed 4833 from the file system, obviously the client will not be able to 4834 recover from the expired filehandle. 4836 It is also possible that the expired filehandle refers to a file that 4837 has been renamed. If the file was renamed by another client, again 4838 it is possible that the original client will not be able to recover. 4839 However, in the case that the client itself is renaming the file and 4840 the file is open, it is possible that the client may be able to 4841 recover. The client can determine the new pathname based on the 4842 processing of the rename request. The client can then regenerate the 4843 new filehandle based on the new pathname. The client could also use 4844 the COMPOUND procedure to construct a series of operations like: 4846 RENAME A B 4847 LOOKUP B 4848 GETFH 4850 Note that the COMPOUND procedure does not provide atomicity. This 4851 example only reduces the overhead of recovering from an expired 4852 filehandle. 4854 5. File Attributes 4856 To meet the requirements of extensibility and increased 4857 interoperability with non-UNIX platforms, attributes need to be 4858 handled in a flexible manner. The NFSv3 fattr3 structure contains a 4859 fixed list of attributes that not all clients and servers are able to 4860 support or care about. The fattr3 structure cannot be extended as 4861 new needs arise and it provides no way to indicate non-support. With 4862 the NFSv4.1 protocol, the client is able to query what attributes the 4863 server supports and construct requests with only those supported 4864 attributes (or a subset thereof). 4866 To this end, attributes are divided into three groups: REQUIRED, 4867 RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are 4868 supported in the NFSv4.1 protocol by a specific and well-defined 4869 encoding and are identified by number. They are requested by setting 4870 a bit in the bit vector sent in the GETATTR request; the server 4871 response includes a bit vector to list what attributes were returned 4872 in the response. New REQUIRED or RECOMMENDED attributes may be added 4873 to the NFSv4 protocol as part of a new minor version by publishing a 4874 Standards Track RFC that allocates a new attribute number value and 4875 defines the encoding for the attribute. See Section 2.7 for further 4876 discussion. 4878 Named attributes are accessed by the new OPENATTR operation, which 4879 accesses a hidden directory of attributes associated with a file 4880 system object. OPENATTR takes a filehandle for the object and 4881 returns the filehandle for the attribute hierarchy. The filehandle 4882 for the named attributes is a directory object accessible by LOOKUP 4883 or READDIR and contains files whose names represent the named 4884 attributes and whose data bytes are the value of the attribute. For 4885 example: 4887 +----------+-----------+---------------------------------+ 4888 | LOOKUP | "foo" | ; look up file | 4889 | GETATTR | attrbits | | 4890 | OPENATTR | | ; access foo's named attributes | 4891 | LOOKUP | "x11icon" | ; look up specific attribute | 4892 | READ | 0,4096 | ; read stream of bytes | 4893 +----------+-----------+---------------------------------+ 4895 Named attributes are intended for data needed by applications rather 4896 than by an NFS client implementation. NFS implementors are strongly 4897 encouraged to define their new attributes as RECOMMENDED attributes 4898 by bringing them to the IETF Standards Track process. 4900 The set of attributes that are classified as REQUIRED is deliberately 4901 small since servers need to do whatever it takes to support them. A 4902 server should support as many of the RECOMMENDED attributes as 4903 possible but, by their definition, the server is not required to 4904 support all of them. Attributes are deemed REQUIRED if the data is 4905 both needed by a large number of clients and is not otherwise 4906 reasonably computable by the client when support is not provided on 4907 the server. 4909 Note that the hidden directory returned by OPENATTR is a convenience 4910 for protocol processing. The client should not make any assumptions 4911 about the server's implementation of named attributes and whether or 4912 not the underlying file system at the server has a named attribute 4913 directory. Therefore, operations such as SETATTR and GETATTR on the 4914 named attribute directory are undefined. 4916 5.1. REQUIRED Attributes 4918 These MUST be supported by every NFSv4.1 client and server in order 4919 to ensure a minimum level of interoperability. The server MUST store 4920 and return these attributes, and the client MUST be able to function 4921 with an attribute set limited to these attributes. With just the 4922 REQUIRED attributes some client functionality may be impaired or 4923 limited in some ways. A client may ask for any of these attributes 4924 to be returned by setting a bit in the GETATTR request, and the 4925 server MUST return their value. 4927 5.2. RECOMMENDED Attributes 4929 These attributes are understood well enough to warrant support in the 4930 NFSv4.1 protocol. However, they may not be supported on all clients 4931 and servers. A client may ask for any of these attributes to be 4932 returned by setting a bit in the GETATTR request but must handle the 4933 case where the server does not return them. A client MAY ask for the 4934 set of attributes the server supports and SHOULD NOT request 4935 attributes the server does not support. A server should be tolerant 4936 of requests for unsupported attributes and simply not return them 4937 rather than considering the request an error. It is expected that 4938 servers will support all attributes they comfortably can and only 4939 fail to support attributes that are difficult to support in their 4940 operating environments. A server should provide attributes whenever 4941 they don't have to "tell lies" to the client. For example, a file 4942 modification time should be either an accurate time or should not be 4943 supported by the server. At times this will be difficult for 4944 clients, but a client is better positioned to decide whether and how 4945 to fabricate or construct an attribute or whether to do without the 4946 attribute. 4948 5.3. Named Attributes 4950 These attributes are not supported by direct encoding in the NFSv4 4951 protocol but are accessed by string names rather than numbers and 4952 correspond to an uninterpreted stream of bytes that are stored with 4953 the file system object. The namespace for these attributes may be 4954 accessed by using the OPENATTR operation. The OPENATTR operation 4955 returns a filehandle for a virtual "named attribute directory", and 4956 further perusal and modification of the namespace may be done using 4957 operations that work on more typical directories. In particular, 4958 READDIR may be used to get a list of such named attributes, and 4959 LOOKUP and OPEN may select a particular attribute. Creation of a new 4960 named attribute may be the result of an OPEN specifying file 4961 creation. 4963 Once an OPEN is done, named attributes may be examined and changed by 4964 normal READ and WRITE operations using the filehandles and stateids 4965 returned by OPEN. 4967 Named attributes and the named attribute directory may have their own 4968 (non-named) attributes. Each of these objects MUST have all of the 4969 REQUIRED attributes and may have additional RECOMMENDED attributes. 4970 However, the set of attributes for named attributes and the named 4971 attribute directory need not be, and typically will not be, as large 4972 as that for other objects in that file system. 4974 Named attributes and the named attribute directory might be the 4975 target of delegations (in the case of the named attribute directory, 4976 these will be directory delegations). However, since granting of 4977 delegations is at the server's discretion, a server need not support 4978 delegations on named attributes or the named attribute directory. 4980 It is RECOMMENDED that servers support arbitrary named attributes. A 4981 client should not depend on the ability to store any named attributes 4982 in the server's file system. If a server does support named 4983 attributes, a client that is also able to handle them should be able 4984 to copy a file's data and metadata with complete transparency from 4985 one location to another; this would imply that names allowed for 4986 regular directory entries are valid for named attribute names as 4987 well. 4989 In NFSv4.1, the structure of named attribute directories is 4990 restricted in a number of ways, in order to prevent the development 4991 of non-interoperable implementations in which some servers support a 4992 fully general hierarchical directory structure for named attributes 4993 while others support a limited but adequate structure for named 4994 attributes. In such an environment, clients or applications might 4995 come to depend on non-portable extensions. The restrictions are: 4997 o CREATE is not allowed in a named attribute directory. Thus, such 4998 objects as symbolic links and special files are not allowed to be 4999 named attributes. Further, directories may not be created in a 5000 named attribute directory, so no hierarchical structure of named 5001 attributes for a single object is allowed. 5003 o If OPENATTR is done on a named attribute directory or on a named 5004 attribute, the server MUST return NFS4ERR_WRONG_TYPE. 5006 o Doing a RENAME of a named attribute to a different named attribute 5007 directory or to an ordinary (i.e., non-named-attribute) directory 5008 is not allowed. 5010 o Creating hard links between named attribute directories or between 5011 named attribute directories and ordinary directories is not 5012 allowed. 5014 Names of attributes will not be controlled by this document or other 5015 IETF Standards Track documents. See Section 22.2 for further 5016 discussion. 5018 5.4. Classification of Attributes 5020 Each of the REQUIRED and RECOMMENDED attributes can be classified in 5021 one of three categories: per server (i.e., the value of the attribute 5022 will be the same for all file objects that share the same server 5023 owner; see Section 2.5 for a definition of server owner), per file 5024 system (i.e., the value of the attribute will be the same for some or 5025 all file objects that share the same fsid attribute (Section 5.8.1.9) 5026 and server owner), or per file system object. Note that it is 5027 possible that some per file system attributes may vary within the 5028 file system, depending on the value of the "homogeneous" 5029 (Section 5.8.2.16) attribute. Note that the attributes 5030 time_access_set and time_modify_set are not listed in this section 5031 because they are write-only attributes corresponding to time_access 5032 and time_modify, and are used in a special instance of SETATTR. 5034 o The per-server attribute is: 5036 lease_time 5038 o The per-file system attributes are: 5040 supported_attrs, suppattr_exclcreat, fh_expire_type, 5041 link_support, symlink_support, unique_handles, aclsupport, 5042 cansettime, case_insensitive, case_preserving, 5043 chown_restricted, files_avail, files_free, files_total, 5044 fs_locations, homogeneous, maxfilesize, maxname, maxread, 5045 maxwrite, no_trunc, space_avail, space_free, space_total, 5046 time_delta, change_policy, fs_status, fs_layout_type, 5047 fs_locations_info, fs_charset_cap 5049 o The per-file system object attributes are: 5051 type, change, size, named_attr, fsid, rdattr_error, filehandle, 5052 acl, archive, fileid, hidden, maxlink, mimetype, mode, 5053 numlinks, owner, owner_group, rawdev, space_used, system, 5054 time_access, time_backup, time_create, time_metadata, 5055 time_modify, mounted_on_fileid, dir_notif_delay, 5056 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 5057 layout_blksize, layout_alignment, mdsthreshold, retention_get, 5058 retention_set, retentevt_get, retentevt_set, retention_hold, 5059 mode_set_masked 5061 For quota_avail_hard, quota_avail_soft, and quota_used, see their 5062 definitions below for the appropriate classification. 5064 5.5. Set-Only and Get-Only Attributes 5066 Some REQUIRED and RECOMMENDED attributes are set-only; i.e., they can 5067 be set via SETATTR but not retrieved via GETATTR. Similarly, some 5068 REQUIRED and RECOMMENDED attributes are get-only; i.e., they can be 5069 retrieved via GETATTR but not set via SETATTR. If a client attempts 5070 to set a get-only attribute or get a set-only attributes, the server 5071 MUST return NFS4ERR_INVAL. 5073 5.6. REQUIRED Attributes - List and Definition References 5075 The list of REQUIRED attributes appears in Table 2. The meaning of 5076 the columns of the table are: 5078 o Name: The name of the attribute. 5080 o Id: The number assigned to the attribute. In the event of 5081 conflicts between the assigned number and [10], the latter is 5082 likely authoritative, but should be resolved with Errata to this 5083 document and/or [10]. See [50] for the Errata process. 5085 o Data Type: The XDR data type of the attribute. 5087 o Acc: Access allowed to the attribute. R means read-only (GETATTR 5088 may retrieve, SETATTR may not set). W means write-only (SETATTR 5089 may set, GETATTR may not retrieve). R W means read/write (GETATTR 5090 may retrieve, SETATTR may set). 5092 o Defined in: The section of this specification that describes the 5093 attribute. 5095 +--------------------+----+------------+-----+-------------------+ 5096 | Name | Id | Data Type | Acc | Defined in: | 5097 +--------------------+----+------------+-----+-------------------+ 5098 | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | 5099 | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | 5100 | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | 5101 | change | 3 | uint64_t | R | Section 5.8.1.4 | 5102 | size | 4 | uint64_t | R W | Section 5.8.1.5 | 5103 | link_support | 5 | bool | R | Section 5.8.1.6 | 5104 | symlink_support | 6 | bool | R | Section 5.8.1.7 | 5105 | named_attr | 7 | bool | R | Section 5.8.1.8 | 5106 | fsid | 8 | fsid4 | R | Section 5.8.1.9 | 5107 | unique_handles | 9 | bool | R | Section 5.8.1.10 | 5108 | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | 5109 | rdattr_error | 11 | enum | R | Section 5.8.1.12 | 5110 | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | 5111 | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | 5112 +--------------------+----+------------+-----+-------------------+ 5114 Table 2 5116 5.7. RECOMMENDED Attributes - List and Definition References 5118 The RECOMMENDED attributes are defined in Table 3. The meanings of 5119 the column headers are the same as Table 2; see Section 5.6 for the 5120 meanings. 5122 +--------------------+----+----------------+-----+------------------+ 5123 | Name | Id | Data Type | Acc | Defined in: | 5124 +--------------------+----+----------------+-----+------------------+ 5125 | acl | 12 | nfsace4<> | R W | Section 6.2.1 | 5126 | aclsupport | 13 | uint32_t | R | Section 6.2.1.2 | 5127 | archive | 14 | bool | R W | Section 5.8.2.1 | 5128 | cansettime | 15 | bool | R | Section 5.8.2.2 | 5129 | case_insensitive | 16 | bool | R | Section 5.8.2.3 | 5130 | case_preserving | 17 | bool | R | Section 5.8.2.4 | 5131 | change_policy | 60 | chg_policy4 | R | Section 5.8.2.5 | 5132 | chown_restricted | 18 | bool | R | Section 5.8.2.6 | 5133 | dacl | 58 | nfsacl41 | R W | Section 6.2.2 | 5134 | dir_notif_delay | 56 | nfstime4 | R | Section 5.11.1 | 5135 | dirent_notif_delay | 57 | nfstime4 | R | Section 5.11.2 | 5136 | fileid | 20 | uint64_t | R | Section 5.8.2.7 | 5137 | files_avail | 21 | uint64_t | R | Section 5.8.2.8 | 5138 | files_free | 22 | uint64_t | R | Section 5.8.2.9 | 5139 | files_total | 23 | uint64_t | R | Section 5.8.2.10 | 5140 | fs_charset_cap | 76 | uint32_t | R | Section 5.8.2.11 | 5141 | fs_layout_type | 62 | layouttype4<> | R | Section 5.12.1 | 5142 | fs_locations | 24 | fs_locations | R | Section 5.8.2.12 | 5143 | fs_locations_info | 67 | * | R | Section 5.8.2.13 | 5144 | fs_status | 61 | fs4_status | R | Section 5.8.2.14 | 5145 | hidden | 25 | bool | R W | Section 5.8.2.15 | 5146 | homogeneous | 26 | bool | R | Section 5.8.2.16 | 5147 | layout_alignment | 66 | uint32_t | R | Section 5.12.2 | 5148 | layout_blksize | 65 | uint32_t | R | Section 5.12.3 | 5149 | layout_hint | 63 | layouthint4 | W | Section 5.12.4 | 5150 | layout_type | 64 | layouttype4<> | R | Section 5.12.5 | 5151 | maxfilesize | 27 | uint64_t | R | Section 5.8.2.17 | 5152 | maxlink | 28 | uint32_t | R | Section 5.8.2.18 | 5153 | maxname | 29 | uint32_t | R | Section 5.8.2.19 | 5154 | maxread | 30 | uint64_t | R | Section 5.8.2.20 | 5155 | maxwrite | 31 | uint64_t | R | Section 5.8.2.21 | 5156 | mdsthreshold | 68 | mdsthreshold4 | R | Section 5.12.6 | 5157 | mimetype | 32 | utf8str_cs | R W | Section 5.8.2.22 | 5158 | mode | 33 | mode4 | R W | Section 6.2.4 | 5159 | mode_set_masked | 74 | mode_masked4 | W | Section 6.2.5 | 5160 | mounted_on_fileid | 55 | uint64_t | R | Section 5.8.2.23 | 5161 | no_trunc | 34 | bool | R | Section 5.8.2.24 | 5162 | numlinks | 35 | uint32_t | R | Section 5.8.2.25 | 5163 | owner | 36 | utf8str_mixed | R W | Section 5.8.2.26 | 5164 | owner_group | 37 | utf8str_mixed | R W | Section 5.8.2.27 | 5165 | quota_avail_hard | 38 | uint64_t | R | Section 5.8.2.28 | 5166 | quota_avail_soft | 39 | uint64_t | R | Section 5.8.2.29 | 5167 | quota_used | 40 | uint64_t | R | Section 5.8.2.30 | 5168 | rawdev | 41 | specdata4 | R | Section 5.8.2.31 | 5169 | retentevt_get | 71 | retention_get4 | R | Section 5.13.3 | 5170 | retentevt_set | 72 | retention_set4 | W | Section 5.13.4 | 5171 | retention_get | 69 | retention_get4 | R | Section 5.13.1 | 5172 | retention_hold | 73 | uint64_t | R W | Section 5.13.5 | 5173 | retention_set | 70 | retention_set4 | W | Section 5.13.2 | 5174 | sacl | 59 | nfsacl41 | R W | Section 6.2.3 | 5175 | space_avail | 42 | uint64_t | R | Section 5.8.2.32 | 5176 | space_free | 43 | uint64_t | R | Section 5.8.2.33 | 5177 | space_total | 44 | uint64_t | R | Section 5.8.2.34 | 5178 | space_used | 45 | uint64_t | R | Section 5.8.2.35 | 5179 | system | 46 | bool | R W | Section 5.8.2.36 | 5180 | time_access | 47 | nfstime4 | R | Section 5.8.2.37 | 5181 | time_access_set | 48 | settime4 | W | Section 5.8.2.38 | 5182 | time_backup | 49 | nfstime4 | R W | Section 5.8.2.39 | 5183 | time_create | 50 | nfstime4 | R W | Section 5.8.2.40 | 5184 | time_delta | 51 | nfstime4 | R | Section 5.8.2.41 | 5185 | time_metadata | 52 | nfstime4 | R | Section 5.8.2.42 | 5186 | time_modify | 53 | nfstime4 | R | Section 5.8.2.43 | 5187 | time_modify_set | 54 | settime4 | W | Section 5.8.2.44 | 5188 +--------------------+----+----------------+-----+------------------+ 5190 Table 3 5192 * fs_locations_info4 5194 5.8. Attribute Definitions 5196 5.8.1. Definitions of REQUIRED Attributes 5198 5.8.1.1. Attribute 0: supported_attrs 5200 The bit vector that would retrieve all REQUIRED and RECOMMENDED 5201 attributes that are supported for this object. The scope of this 5202 attribute applies to all objects with a matching fsid. 5204 5.8.1.2. Attribute 1: type 5206 Designates the type of an object in terms of one of a number of 5207 special constants: 5209 o NF4REG designates a regular file. 5211 o NF4DIR designates a directory. 5213 o NF4BLK designates a block device special file. 5215 o NF4CHR designates a character device special file. 5217 o NF4LNK designates a symbolic link. 5219 o NF4SOCK designates a named socket special file. 5221 o NF4FIFO designates a fifo special file. 5223 o NF4ATTRDIR designates a named attribute directory. 5225 o NF4NAMEDATTR designates a named attribute. 5227 Within the explanatory text and operation descriptions, the following 5228 phrases will be used with the meanings given below: 5230 o The phrase "is a directory" means that the object's type attribute 5231 is NF4DIR or NF4ATTRDIR. 5233 o The phrase "is a special file" means that the object's type 5234 attribute is NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. 5236 o The phrases "is an ordinary file" and "is a regular file" mean 5237 that the object's type attribute is NF4REG or NF4NAMEDATTR. 5239 5.8.1.3. Attribute 2: fh_expire_type 5241 Server uses this to specify filehandle expiration behavior to the 5242 client. See Section 4 for additional description. 5244 5.8.1.4. Attribute 3: change 5246 A value created by the server that the client can use to determine if 5247 file data, directory contents, or attributes of the object have been 5248 modified. The server may return the object's time_metadata attribute 5249 for this attribute's value, but only if the file system object cannot 5250 be updated more frequently than the resolution of time_metadata. 5252 5.8.1.5. Attribute 4: size 5254 The size of the object in bytes. 5256 5.8.1.6. Attribute 5: link_support 5258 TRUE, if the object's file system supports hard links. 5260 5.8.1.7. Attribute 6: symlink_support 5262 TRUE, if the object's file system supports symbolic links. 5264 5.8.1.8. Attribute 7: named_attr 5266 TRUE, if this object has named attributes. In other words, object 5267 has a non-empty named attribute directory. 5269 5.8.1.9. Attribute 8: fsid 5271 Unique file system identifier for the file system holding this 5272 object. The fsid attribute has major and minor components, each of 5273 which are of data type uint64_t. 5275 5.8.1.10. Attribute 9: unique_handles 5277 TRUE, if two distinct filehandles are guaranteed to refer to two 5278 different file system objects. 5280 5.8.1.11. Attribute 10: lease_time 5282 Duration of the lease at server in seconds. 5284 5.8.1.12. Attribute 11: rdattr_error 5286 Error returned from an attempt to retrieve attributes during a 5287 READDIR operation. 5289 5.8.1.13. Attribute 19: filehandle 5291 The filehandle of this object (primarily for READDIR requests). 5293 5.8.1.14. Attribute 75: suppattr_exclcreat 5295 The bit vector that would set all REQUIRED and RECOMMENDED attributes 5296 that are supported by the EXCLUSIVE4_1 method of file creation via 5297 the OPEN operation. The scope of this attribute applies to all 5298 objects with a matching fsid. 5300 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes 5302 The definitions of most of the RECOMMENDED attributes follow. 5303 Collections that share a common category are defined in other 5304 sections. 5306 5.8.2.1. Attribute 14: archive 5308 TRUE, if this file has been archived since the time of last 5309 modification (deprecated in favor of time_backup). 5311 5.8.2.2. Attribute 15: cansettime 5313 TRUE, if the server is able to change the times for a file system 5314 object as specified in a SETATTR operation. 5316 5.8.2.3. Attribute 16: case_insensitive 5318 TRUE, if file name comparisons on this file system are case 5319 insensitive. 5321 5.8.2.4. Attribute 17: case_preserving 5323 TRUE, if file name case on this file system is preserved. 5325 5.8.2.5. Attribute 60: change_policy 5327 A value created by the server that the client can use to determine if 5328 some server policy related to the current file system has been 5329 subject to change. If the value remains the same, then the client 5330 can be sure that the values of the attributes related to fs location 5331 and the fss_type field of the fs_status attribute have not changed. 5332 On the other hand, a change in this value does necessarily imply a 5333 change in policy. It is up to the client to interrogate the server 5334 to determine if some policy relevant to it has changed. See 5335 Section 3.3.6 for details. 5337 This attribute MUST change when the value returned by the 5338 fs_locations or fs_locations_info attribute changes, when a file 5339 system goes from read-only to writable or vice versa, or when the 5340 allowable set of security flavors for the file system or any part 5341 thereof is changed. 5343 5.8.2.6. Attribute 18: chown_restricted 5345 If TRUE, the server will reject any request to change either the 5346 owner or the group associated with a file if the caller is not a 5347 privileged user (for example, "root" in UNIX operating environments 5348 or, in Windows 2000, the "Take Ownership" privilege). 5350 5.8.2.7. Attribute 20: fileid 5352 A number uniquely identifying the file within the file system. 5354 5.8.2.8. Attribute 21: files_avail 5356 File slots available to this user on the file system containing this 5357 object -- this should be the smallest relevant limit. 5359 5.8.2.9. Attribute 22: files_free 5361 Free file slots on the file system containing this object -- this 5362 should be the smallest relevant limit. 5364 5.8.2.10. Attribute 23: files_total 5366 Total file slots on the file system containing this object. 5368 5.8.2.11. Attribute 76: fs_charset_cap 5370 Character set capabilities for this file system. See Section 14.4. 5372 5.8.2.12. Attribute 24: fs_locations 5374 Locations where this file system may be found. If the server returns 5375 NFS4ERR_MOVED as an error, this attribute MUST be supported. See 5376 Section 11.16 for more details. 5378 5.8.2.13. Attribute 67: fs_locations_info 5380 Full function file system location. See Section 11.17.2 for more 5381 details. 5383 5.8.2.14. Attribute 61: fs_status 5385 Generic file system type information. See Section 11.18 for more 5386 details. 5388 5.8.2.15. Attribute 25: hidden 5390 TRUE, if the file is considered hidden with respect to the Windows 5391 API. 5393 5.8.2.16. Attribute 26: homogeneous 5395 TRUE, if this object's file system is homogeneous; i.e., all objects 5396 in the file system (all objects on the server with the same fsid) 5397 have common values for all per-file-system attributes. 5399 5.8.2.17. Attribute 27: maxfilesize 5401 Maximum supported file size for the file system of this object. 5403 5.8.2.18. Attribute 28: maxlink 5405 Maximum number of links for this object. 5407 5.8.2.19. Attribute 29: maxname 5409 Maximum file name size supported for this object. 5411 5.8.2.20. Attribute 30: maxread 5413 Maximum amount of data the READ operation will return for this 5414 object. 5416 5.8.2.21. Attribute 31: maxwrite 5418 Maximum amount of data the WRITE operation will accept for this 5419 object. This attribute SHOULD be supported if the file is writable. 5420 Lack of this attribute can lead to the client either wasting 5421 bandwidth or not receiving the best performance. 5423 5.8.2.22. Attribute 32: mimetype 5425 MIME body type/subtype of this object. 5427 5.8.2.23. Attribute 55: mounted_on_fileid 5429 Like fileid, but if the target filehandle is the root of a file 5430 system, this attribute represents the fileid of the underlying 5431 directory. 5433 UNIX-based operating environments connect a file system into the 5434 namespace by connecting (mounting) the file system onto the existing 5435 file object (the mount point, usually a directory) of an existing 5436 file system. When the mount point's parent directory is read via an 5437 API like readdir(), the return results are directory entries, each 5438 with a component name and a fileid. The fileid of the mount point's 5439 directory entry will be different from the fileid that the stat() 5440 system call returns. The stat() system call is returning the fileid 5441 of the root of the mounted file system, whereas readdir() is 5442 returning the fileid that stat() would have returned before any file 5443 systems were mounted on the mount point. 5445 Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other 5446 file systems. The client detects the file system crossing whenever 5447 the filehandle argument of LOOKUP has an fsid attribute different 5448 from that of the filehandle returned by LOOKUP. A UNIX-based client 5449 will consider this a "mount point crossing". UNIX has a legacy 5450 scheme for allowing a process to determine its current working 5451 directory. This relies on readdir() of a mount point's parent and 5452 stat() of the mount point returning fileids as previously described. 5453 The mounted_on_fileid attribute corresponds to the fileid that 5454 readdir() would have returned as described previously. 5456 While the NFSv4.1 client could simply fabricate a fileid 5457 corresponding to what mounted_on_fileid provides (and if the server 5458 does not support mounted_on_fileid, the client has no choice), there 5459 is a risk that the client will generate a fileid that conflicts with 5460 one that is already assigned to another object in the file system. 5461 Instead, if the server can provide the mounted_on_fileid, the 5462 potential for client operational problems in this area is eliminated. 5464 If the server detects that there is no mounted point at the target 5465 file object, then the value for mounted_on_fileid that it returns is 5466 the same as that of the fileid attribute. 5468 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 5469 provide it if possible, and for a UNIX-based server, this is 5470 straightforward. Usually, mounted_on_fileid will be requested during 5471 a READDIR operation, in which case it is trivial (at least for UNIX- 5472 based servers) to return mounted_on_fileid since it is equal to the 5473 fileid of a directory entry returned by readdir(). If 5474 mounted_on_fileid is requested in a GETATTR operation, the server 5475 should obey an invariant that has it returning a value that is equal 5476 to the file object's entry in the object's parent directory, i.e., 5477 what readdir() would have returned. Some operating environments 5478 allow a series of two or more file systems to be mounted onto a 5479 single mount point. In this case, for the server to obey the 5480 aforementioned invariant, it will need to find the base mount point, 5481 and not the intermediate mount points. 5483 5.8.2.24. Attribute 34: no_trunc 5485 If this attribute is TRUE, then if the client uses a file name longer 5486 than name_max, an error will be returned instead of the name being 5487 truncated. 5489 5.8.2.25. Attribute 35: numlinks 5491 Number of hard links to this object. 5493 5.8.2.26. Attribute 36: owner 5495 The string name of the owner of this object. 5497 5.8.2.27. Attribute 37: owner_group 5499 The string name of the group ownership of this object. 5501 5.8.2.28. Attribute 38: quota_avail_hard 5503 The value in bytes that represents the amount of additional disk 5504 space beyond the current allocation that can be allocated to this 5505 file or directory before further allocations will be refused. It is 5506 understood that this space may be consumed by allocations to other 5507 files or directories. 5509 5.8.2.29. Attribute 39: quota_avail_soft 5511 The value in bytes that represents the amount of additional disk 5512 space that can be allocated to this file or directory before the user 5513 may reasonably be warned. It is understood that this space may be 5514 consumed by allocations to other files or directories though there is 5515 a rule as to which other files or directories. 5517 5.8.2.30. Attribute 40: quota_used 5519 The value in bytes that represents the amount of disk space used by 5520 this file or directory and possibly a number of other similar files 5521 or directories, where the set of "similar" meets at least the 5522 criterion that allocating space to any file or directory in the set 5523 will reduce the "quota_avail_hard" of every other file or directory 5524 in the set. 5526 Note that there may be a number of distinct but overlapping sets of 5527 files or directories for which a quota_used value is maintained, 5528 e.g., "all files with a given owner", "all files with a given group 5529 owner", etc. The server is at liberty to choose any of those sets 5530 when providing the content of the quota_used attribute, but should do 5531 so in a repeatable way. The rule may be configured per file system 5532 or may be "choose the set with the smallest quota". 5534 5.8.2.31. Attribute 41: rawdev 5536 Raw device number of file of type NF4BLK or NF4CHR. The device 5537 number is split into major and minor numbers. If the file's type 5538 attribute is not NF4BLK or NF4CHR, the value returned SHOULD NOT be 5539 considered useful. 5541 5.8.2.32. Attribute 42: space_avail 5543 Disk space in bytes available to this user on the file system 5544 containing this object -- this should be the smallest relevant limit. 5546 5.8.2.33. Attribute 43: space_free 5548 Free disk space in bytes on the file system containing this object -- 5549 this should be the smallest relevant limit. 5551 5.8.2.34. Attribute 44: space_total 5553 Total disk space in bytes on the file system containing this object. 5555 5.8.2.35. Attribute 45: space_used 5557 Number of file system bytes allocated to this object. 5559 5.8.2.36. Attribute 46: system 5561 This attribute is TRUE if this file is a "system" file with respect 5562 to the Windows operating environment. 5564 5.8.2.37. Attribute 47: time_access 5566 The time_access attribute represents the time of last access to the 5567 object by a READ operation sent to the server. The notion of what is 5568 an "access" depends on the server's operating environment and/or the 5569 server's file system semantics. For example, for servers obeying 5570 Portable Operating System Interface (POSIX) semantics, time_access 5571 would be updated only by the READ and READDIR operations and not any 5572 of the operations that modify the content of the object [13], [14], 5573 [15]. Of course, setting the corresponding time_access_set attribute 5574 is another way to modify the time_access attribute. 5576 Whenever the file object resides on a writable file system, the 5577 server should make its best efforts to record time_access into stable 5578 storage. However, to mitigate the performance effects of doing so, 5579 and most especially whenever the server is satisfying the read of the 5580 object's content from its cache, the server MAY cache access time 5581 updates and lazily write them to stable storage. It is also 5582 acceptable to give administrators of the server the option to disable 5583 time_access updates. 5585 5.8.2.38. Attribute 48: time_access_set 5587 Sets the time of last access to the object. SETATTR use only. 5589 5.8.2.39. Attribute 49: time_backup 5591 The time of last backup of the object. 5593 5.8.2.40. Attribute 50: time_create 5595 The time of creation of the object. This attribute does not have any 5596 relation to the traditional UNIX file attribute "ctime" or "change 5597 time". 5599 5.8.2.41. Attribute 51: time_delta 5601 Smallest useful server time granularity. 5603 5.8.2.42. Attribute 52: time_metadata 5605 The time of last metadata modification of the object. 5607 5.8.2.43. Attribute 53: time_modify 5609 The time of last modification to the object. 5611 5.8.2.44. Attribute 54: time_modify_set 5613 Sets the time of last modification to the object. SETATTR use only. 5615 5.9. Interpreting owner and owner_group 5617 The RECOMMENDED attributes "owner" and "owner_group" (and also users 5618 and groups within the "acl" attribute) are represented in terms of a 5619 UTF-8 string. To avoid a representation that is tied to a particular 5620 underlying implementation at the client or server, the use of the 5621 UTF-8 string has been chosen. Note that Section 6.1 of RFC 2624 [52] 5622 provides additional rationale. It is expected that the client and 5623 server will have their own local representation of owner and 5624 owner_group that is used for local storage or presentation to the end 5625 user. Therefore, it is expected that when these attributes are 5626 transferred between the client and server, the local representation 5627 is translated to a syntax of the form "user@dns_domain". This will 5628 allow for a client and server that do not use the same local 5629 representation the ability to translate to a common syntax that can 5630 be interpreted by both. 5632 Similarly, security principals may be represented in different ways 5633 by different security mechanisms. Servers normally translate these 5634 representations into a common format, generally that used by local 5635 storage, to serve as a means of identifying the users corresponding 5636 to these security principals. When these local identifiers are 5637 translated to the form of the owner attribute, associated with files 5638 created by such principals, they identify, in a common format, the 5639 users associated with each corresponding set of security principals. 5641 The translation used to interpret owner and group strings is not 5642 specified as part of the protocol. This allows various solutions to 5643 be employed. For example, a local translation table may be consulted 5644 that maps a numeric identifier to the user@dns_domain syntax. A name 5645 service may also be used to accomplish the translation. A server may 5646 provide a more general service, not limited by any particular 5647 translation (which would only translate a limited set of possible 5648 strings) by storing the owner and owner_group attributes in local 5649 storage without any translation or it may augment a translation 5650 method by storing the entire string for attributes for which no 5651 translation is available while using the local representation for 5652 those cases in which a translation is available. 5654 Servers that do not provide support for all possible values of the 5655 owner and owner_group attributes SHOULD return an error 5656 (NFS4ERR_BADOWNER) when a string is presented that has no 5657 translation, as the value to be set for a SETATTR of the owner, 5658 owner_group, or acl attributes. When a server does accept an owner 5659 or owner_group value as valid on a SETATTR (and similarly for the 5660 owner and group strings in an acl), it is promising to return that 5661 same string when a corresponding GETATTR is done. Configuration 5662 changes (including changes from the mapping of the string to the 5663 local representation) and ill-constructed name translations (those 5664 that contain aliasing) may make that promise impossible to honor. 5665 Servers should make appropriate efforts to avoid a situation in which 5666 these attributes have their values changed when no real change to 5667 ownership has occurred. 5669 The "dns_domain" portion of the owner string is meant to be a DNS 5670 domain name, for example, user@example.org. Servers should accept as 5671 valid a set of users for at least one domain. A server may treat 5672 other domains as having no valid translations. A more general 5673 service is provided when a server is capable of accepting users for 5674 multiple domains, or for all domains, subject to security 5675 constraints. 5677 In the case where there is no translation available to the client or 5678 server, the attribute value will be constructed without the "@". 5679 Therefore, the absence of the @ from the owner or owner_group 5680 attribute signifies that no translation was available at the sender 5681 and that the receiver of the attribute should not use that string as 5682 a basis for translation into its own internal format. Even though 5683 the attribute value cannot be translated, it may still be useful. In 5684 the case of a client, the attribute string may be used for local 5685 display of ownership. 5687 To provide a greater degree of compatibility with NFSv3, which 5688 identified users and groups by 32-bit unsigned user identifiers and 5689 group identifiers, owner and group strings that consist of decimal 5690 numeric values with no leading zeros can be given a special 5691 interpretation by clients and servers that choose to provide such 5692 support. The receiver may treat such a user or group string as 5693 representing the same user as would be represented by an NFSv3 uid or 5694 gid having the corresponding numeric value. A server is not 5695 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 5696 instead. To avoid this mechanism being used to subvert user and 5697 group translation, so that a client might pass all of the owners and 5698 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 5699 error when there is a valid translation for the user or owner 5700 designated in this way. In that case, the client must use the 5701 appropriate name@domain string and not the special form for 5702 compatibility. 5704 The owner string "nobody" may be used to designate an anonymous user, 5705 which will be associated with a file created by a security principal 5706 that cannot be mapped through normal means to the owner attribute. 5707 Users and implementations of NFSv4.1 SHOULD NOT use "nobody" to 5708 designate a real user whose access is not anonymous. 5710 5.10. Character Case Attributes 5712 With respect to the case_insensitive and case_preserving attributes, 5713 each UCS-4 character (which UTF-8 encodes) can be mapped according to 5714 Appendix B.2 of RFC 3454 [16]. For general character handling and 5715 internationalization issues, see Section 14. 5717 5.11. Directory Notification Attributes 5719 As described in Section 18.39, the client can request a minimum delay 5720 for notifications of changes to attributes, but the server is free to 5721 ignore what the client requests. The client can determine in advance 5722 what notification delays the server will accept by sending a GETATTR 5723 operation for either or both of two directory notification 5724 attributes. When the client calls the GET_DIR_DELEGATION operation 5725 and asks for attribute change notifications, it should request 5726 notification delays that are no less than the values in the server- 5727 provided attributes. 5729 5.11.1. Attribute 56: dir_notif_delay 5731 The dir_notif_delay attribute is the minimum number of seconds the 5732 server will delay before notifying the client of a change to the 5733 directory's attributes. 5735 5.11.2. Attribute 57: dirent_notif_delay 5737 The dirent_notif_delay attribute is the minimum number of seconds the 5738 server will delay before notifying the client of a change to a file 5739 object that has an entry in the directory. 5741 5.12. pNFS Attribute Definitions 5743 5.12.1. Attribute 62: fs_layout_type 5745 The fs_layout_type attribute (see Section 3.3.13) applies to a file 5746 system and indicates what layout types are supported by the file 5747 system. When the client encounters a new fsid, the client SHOULD 5748 obtain the value for the fs_layout_type attribute associated with the 5749 new file system. This attribute is used by the client to determine 5750 if the layout types supported by the server match any of the client's 5751 supported layout types. 5753 5.12.2. Attribute 66: layout_alignment 5755 When a client holds layouts on files of a file system, the 5756 layout_alignment attribute indicates the preferred alignment for I/O 5757 to files on that file system. Where possible, the client should send 5758 READ and WRITE operations with offsets that are whole multiples of 5759 the layout_alignment attribute. 5761 5.12.3. Attribute 65: layout_blksize 5763 When a client holds layouts on files of a file system, the 5764 layout_blksize attribute indicates the preferred block size for I/O 5765 to files on that file system. Where possible, the client should send 5766 READ operations with a count argument that is a whole multiple of 5767 layout_blksize, and WRITE operations with a data argument of size 5768 that is a whole multiple of layout_blksize. 5770 5.12.4. Attribute 63: layout_hint 5772 The layout_hint attribute (see Section 3.3.19) may be set on newly 5773 created files to influence the metadata server's choice for the 5774 file's layout. If possible, this attribute is one of those set in 5775 the initial attributes within the OPEN operation. The metadata 5776 server may choose to ignore this attribute. The layout_hint 5777 attribute is a subset of the layout structure returned by LAYOUTGET. 5778 For example, instead of specifying particular devices, this would be 5779 used to suggest the stripe width of a file. The server 5780 implementation determines which fields within the layout will be 5781 used. 5783 5.12.5. Attribute 64: layout_type 5785 This attribute lists the layout type(s) available for a file. The 5786 value returned by the server is for informational purposes only. The 5787 client will use the LAYOUTGET operation to obtain the information 5788 needed in order to perform I/O, for example, the specific device 5789 information for the file and its layout. 5791 5.12.6. Attribute 68: mdsthreshold 5793 This attribute is a server-provided hint used to communicate to the 5794 client when it is more efficient to send READ and WRITE operations to 5795 the metadata server or the data server. The two types of thresholds 5796 described are file size thresholds and I/O size thresholds. If a 5797 file's size is smaller than the file size threshold, data accesses 5798 SHOULD be sent to the metadata server. If an I/O request has a 5799 length that is below the I/O size threshold, the I/O SHOULD be sent 5800 to the metadata server. Each threshold type is specified separately 5801 for read and write. 5803 The server MAY provide both types of thresholds for a file. If both 5804 file size and I/O size are provided, the client SHOULD reach or 5805 exceed both thresholds before sending its read or write requests to 5806 the data server. Alternatively, if only one of the specified 5807 thresholds is reached or exceeded, the I/O requests are sent to the 5808 metadata server. 5810 For each threshold type, a value of zero indicates no READ or WRITE 5811 should be sent to the metadata server, while a value of all ones 5812 indicates that all READs or WRITEs should be sent to the metadata 5813 server. 5815 The attribute is available on a per-filehandle basis. If the current 5816 filehandle refers to a non-pNFS file or directory, the metadata 5817 server should return an attribute that is representative of the 5818 filehandle's file system. It is suggested that this attribute is 5819 queried as part of the OPEN operation. Due to dynamic system 5820 changes, the client should not assume that the attribute will remain 5821 constant for any specific time period; thus, it should be 5822 periodically refreshed. 5824 5.13. Retention Attributes 5826 Retention is a concept whereby a file object can be placed in an 5827 immutable, undeletable, unrenamable state for a fixed or infinite 5828 duration of time. Once in this "retained" state, the file cannot be 5829 moved out of the state until the duration of retention has been 5830 reached. 5832 When retention is enabled, retention MUST extend to the data of the 5833 file, and the name of file. The server MAY extend retention to any 5834 other property of the file, including any subset of REQUIRED, 5835 RECOMMENDED, and named attributes, with the exceptions noted in this 5836 section. 5838 Servers MAY support or not support retention on any file object type. 5840 The five retention attributes are explained in the next subsections. 5842 5.13.1. Attribute 69: retention_get 5844 If retention is enabled for the associated file, this attribute's 5845 value represents the retention begin time of the file object. This 5846 attribute's value is only readable with the GETATTR operation and 5847 MUST NOT be modified by the SETATTR operation (Section 5.5). The 5848 value of the attribute consists of: 5850 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 5851 struct retention_get4 { 5852 uint64_t rg_duration; 5853 nfstime4 rg_begin_time<1>; 5854 }; 5856 The field rg_duration is the duration in seconds indicating how long 5857 the file will be retained once retention is enabled. The field 5858 rg_begin_time is an array of up to one absolute time value. If the 5859 array is zero length, no beginning retention time has been 5860 established, and retention is not enabled. If rg_duration is equal 5861 to RET4_DURATION_INFINITE, the file, once retention is enabled, will 5862 be retained for an infinite duration. 5864 If (as soon as) rg_duration is zero, then rg_begin_time will be of 5865 zero length, and again, retention is not (no longer) enabled. 5867 5.13.2. Attribute 70: retention_set 5869 This attribute is used to set the retention duration and optionally 5870 enable retention for the associated file object. This attribute is 5871 only modifiable via the SETATTR operation and MUST NOT be retrieved 5872 by the GETATTR operation (Section 5.5). This attribute corresponds 5873 to retention_get. The value of the attribute consists of: 5875 struct retention_set4 { 5876 bool rs_enable; 5877 uint64_t rs_duration<1>; 5878 }; 5880 If the client sets rs_enable to TRUE, then it is enabling retention 5881 on the file object with the begin time of retention starting from the 5882 server's current time and date. The duration of the retention can 5883 also be provided if the rs_duration array is of length one. The 5884 duration is the time in seconds from the begin time of retention, and 5885 if set to RET4_DURATION_INFINITE, the file is to be retained forever. 5886 If retention is enabled, with no duration specified in either this 5887 SETATTR or a previous SETATTR, the duration defaults to zero seconds. 5888 The server MAY restrict the enabling of retention or the duration of 5889 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5890 The enabling of retention MUST NOT prevent the enabling of event- 5891 based retention or the modification of the retention_hold attribute. 5893 The following rules apply to both the retention_set and retentevt_set 5894 attributes. 5896 o As long as retention is not enabled, the client is permitted to 5897 decrease the duration. 5899 o The duration can always be set to an equal or higher value, even 5900 if retention is enabled. Note that once retention is enabled, the 5901 actual duration (as returned by the retention_get or retentevt_get 5902 attributes; see Section 5.13.1 or Section 5.13.3) is constantly 5903 counting down to zero (one unit per second), unless the duration 5904 was set to RET4_DURATION_INFINITE. Thus, it will not be possible 5905 for the client to precisely extend the duration on a file that has 5906 retention enabled. 5908 o While retention is enabled, attempts to disable retention or 5909 decrease the retention's duration MUST fail with the error 5910 NFS4ERR_INVAL. 5912 o If the principal attempting to change retention_set or 5913 retentevt_set does not have ACE4_WRITE_RETENTION permissions, the 5914 attempt MUST fail with NFS4ERR_ACCESS. 5916 5.13.3. Attribute 71: retentevt_get 5918 Gets the event-based retention duration, and if enabled, the event- 5919 based retention begin time of the file object. This attribute is 5920 like retention_get, but refers to event-based retention. The event 5921 that triggers event-based retention is not defined by the NFSv4.1 5922 specification. 5924 5.13.4. Attribute 72: retentevt_set 5926 Sets the event-based retention duration, and optionally enables 5927 event-based retention on the file object. This attribute corresponds 5928 to retentevt_get and is like retention_set, but refers to event-based 5929 retention. When event-based retention is set, the file MUST be 5930 retained even if non-event-based retention has been set, and the 5931 duration of non-event-based retention has been reached. Conversely, 5932 when non-event-based retention has been set, the file MUST be 5933 retained even if event-based retention has been set, and the duration 5934 of event-based retention has been reached. The server MAY restrict 5935 the enabling of event-based retention or the duration of event-based 5936 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 5937 The enabling of event-based retention MUST NOT prevent the enabling 5938 of non-event-based retention or the modification of the 5939 retention_hold attribute. 5941 5.13.5. Attribute 73: retention_hold 5943 Gets or sets administrative retention holds, one hold per bit 5944 position. 5946 This attribute allows one to 64 administrative holds, one hold per 5947 bit on the attribute. If retention_hold is not zero, then the file 5948 MUST NOT be deleted, renamed, or modified, even if the duration on 5949 enabled event or non-event-based retention has been reached. The 5950 server MAY restrict the modification of retention_hold on the basis 5951 of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of 5952 administration retention holds does not prevent the enabling of 5953 event-based or non-event-based retention. 5955 If the principal attempting to change retention_hold does not have 5956 ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with 5957 NFS4ERR_ACCESS. 5959 6. Access Control Attributes 5961 Access Control Lists (ACLs) are file attributes that specify fine- 5962 grained access control. This section covers the "acl", "dacl", 5963 "sacl", "aclsupport", "mode", and "mode_set_masked" file attributes 5964 and their interactions. Note that file attributes may apply to any 5965 file system object. 5967 6.1. Goals 5969 ACLs and modes represent two well-established models for specifying 5970 permissions. This section specifies requirements that attempt to 5971 meet the following goals: 5973 o If a server supports the mode attribute, it should provide 5974 reasonable semantics to clients that only set and retrieve the 5975 mode attribute. 5977 o If a server supports ACL attributes, it should provide reasonable 5978 semantics to clients that only set and retrieve those attributes. 5980 o On servers that support the mode attribute, if ACL attributes have 5981 never been set on an object, via inheritance or explicitly, the 5982 behavior should be traditional UNIX-like behavior. 5984 o On servers that support the mode attribute, if the ACL attributes 5985 have been previously set on an object, either explicitly or via 5986 inheritance: 5988 * Setting only the mode attribute should effectively control the 5989 traditional UNIX-like permissions of read, write, and execute 5990 on owner, owner_group, and other. 5992 * Setting only the mode attribute should provide reasonable 5993 security. For example, setting a mode of 000 should be enough 5994 to ensure that future OPEN operations for 5995 OPEN4_SHARE_ACCESS_READ or OPEN4_SHARE_ACCESS_WRITE by any 5996 principal fail, regardless of a previously existing or 5997 inherited ACL. 5999 o NFSv4.1 may introduce different semantics relating to the mode and 6000 ACL attributes, but it does not render invalid any previously 6001 existing implementations. Additionally, this section provides 6002 clarifications based on previous implementations and discussions 6003 around them. 6005 o On servers that support both the mode and the acl or dacl 6006 attributes, the server must keep the two consistent with each 6007 other. The value of the mode attribute (with the exception of the 6008 three high-order bits described in Section 6.2.4) must be 6009 determined entirely by the value of the ACL, so that use of the 6010 mode is never required for anything other than setting the three 6011 high-order bits. See Section 6.4.1 for exact requirements. 6013 o When a mode attribute is set on an object, the ACL attributes may 6014 need to be modified in order to not conflict with the new mode. 6015 In such cases, it is desirable that the ACL keep as much 6016 information as possible. This includes information about 6017 inheritance, AUDIT and ALARM ACEs, and permissions granted and 6018 denied that do not conflict with the new mode. 6020 6.2. File Attributes Discussion 6022 6.2.1. Attribute 12: acl 6024 The NFSv4.1 ACL attribute contains an array of Access Control Entries 6025 (ACEs) that are associated with the file system object. Although the 6026 client can set and get the acl attribute, the server is responsible 6027 for using the ACL to perform access control. The client can use the 6028 OPEN or ACCESS operations to check access without modifying or 6029 reading data or metadata. 6031 The NFS ACE structure is defined as follows: 6033 typedef uint32_t acetype4; 6035 typedef uint32_t aceflag4; 6037 typedef uint32_t acemask4; 6039 struct nfsace4 { 6040 acetype4 type; 6041 aceflag4 flag; 6042 acemask4 access_mask; 6043 utf8str_mixed who; 6044 }; 6046 To determine if a request succeeds, the server processes each nfsace4 6047 entry in order. Only ACEs that have a "who" that matches the 6048 requester are considered. Each ACE is processed until all of the 6049 bits of the requester's access have been ALLOWED. Once a bit (see 6050 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 6051 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 6052 is encountered where the requester's access still has unALLOWED bits 6053 in common with the "access_mask" of the ACE, the request is denied. 6054 When the ACL is fully processed, if there are bits in the requester's 6055 mask that have not been ALLOWED or DENIED, access is denied. 6057 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 6058 not affect a requester's access, and instead are for triggering 6059 events as a result of a requester's access attempt. Therefore, AUDIT 6060 and ALARM ACEs are processed only after processing ALLOW and DENY 6061 ACEs. 6063 The NFSv4.1 ACL model is quite rich. Some server platforms may 6064 provide access-control functionality that goes beyond the UNIX-style 6065 mode attribute, but that is not as rich as the NFS ACL model. So 6066 that users can take advantage of this more limited functionality, the 6067 server may support the acl attributes by mapping between its ACL 6068 model and the NFSv4.1 ACL model. Servers must ensure that the ACL 6069 they actually store or enforce is at least as strict as the NFSv4 ACL 6070 that was set. It is tempting to accomplish this by rejecting any ACL 6071 that falls outside the small set that can be represented accurately. 6072 However, such an approach can render ACLs unusable without special 6073 client-side knowledge of the server's mapping, which defeats the 6074 purpose of having a common NFSv4 ACL protocol. Therefore, servers 6075 should accept every ACL that they can without compromising security. 6077 To help accomplish this, servers may make a special exception, in the 6078 case of unsupported permission bits, to the rule that bits not 6079 ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- 6080 style server might choose to silently allow read attribute 6081 permissions even though an ACL does not explicitly allow those 6082 permissions. (An ACL that explicitly denies permission to read 6083 attributes should still be rejected.) 6085 The situation is complicated by the fact that a server may have 6086 multiple modules that enforce ACLs. For example, the enforcement for 6087 NFSv4.1 access may be different from, but not weaker than, the 6088 enforcement for local access, and both may be different from the 6089 enforcement for access through other protocols such as SMB (Server 6090 Message Block). So it may be useful for a server to accept an ACL 6091 even if not all of its modules are able to support it. 6093 The guiding principle with regard to NFSv4 access is that the server 6094 must not accept ACLs that appear to make access to the file more 6095 restrictive than it really is. 6097 6.2.1.1. ACE Type 6099 The constants used for the type field (acetype4) are as follows: 6101 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 6102 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 6103 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 6104 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 6106 Only the ALLOWED and DENIED bits may be used in the dacl attribute, 6107 and only the AUDIT and ALARM bits may be used in the sacl attribute. 6108 All four are permitted in the acl attribute. 6110 +------------------------------+--------------+---------------------+ 6111 | Value | Abbreviation | Description | 6112 +------------------------------+--------------+---------------------+ 6113 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 6114 | | | the access defined | 6115 | | | in acemask4 to the | 6116 | | | file or directory. | 6117 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 6118 | | | the access defined | 6119 | | | in acemask4 to the | 6120 | | | file or directory. | 6121 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | Log (in a system- | 6122 | | | dependent way) any | 6123 | | | access attempt to a | 6124 | | | file or directory | 6125 | | | that uses any of | 6126 | | | the access methods | 6127 | | | specified in | 6128 | | | acemask4. | 6129 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate an alarm | 6130 | | | (in a system- | 6131 | | | dependent way) when | 6132 | | | any access attempt | 6133 | | | is made to a file | 6134 | | | or directory for | 6135 | | | the access methods | 6136 | | | specified in | 6137 | | | acemask4. | 6138 +------------------------------+--------------+---------------------+ 6140 The "Abbreviation" column denotes how the types will be referred to 6141 throughout the rest of this section. 6143 6.2.1.2. Attribute 13: aclsupport 6145 A server need not support all of the above ACE types. This attribute 6146 indicates which ACE types are supported for the current file system. 6147 The bitmask constants used to represent the above definitions within 6148 the aclsupport attribute are as follows: 6150 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 6151 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 6152 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 6153 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 6155 Servers that support either the ALLOW or DENY ACE type SHOULD support 6156 both ALLOW and DENY ACE types. 6158 Clients should not attempt to set an ACE unless the server claims 6159 support for that ACE type. If the server receives a request to set 6160 an ACE that it cannot store, it MUST reject the request with 6161 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 6162 that it can store but cannot enforce, the server SHOULD reject the 6163 request with NFS4ERR_ATTRNOTSUPP. 6165 Support for any of the ACL attributes is optional (albeit 6166 RECOMMENDED). However, a server that supports either of the new ACL 6167 attributes (dacl or sacl) MUST allow use of the new ACL attributes to 6168 access all of the ACE types that it supports. In other words, if 6169 such a server supports ALLOW or DENY ACEs, then it MUST support the 6170 dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST 6171 support the sacl attribute. 6173 6.2.1.3. ACE Access Mask 6175 The bitmask constants used for the access mask field are as follows: 6177 const ACE4_READ_DATA = 0x00000001; 6178 const ACE4_LIST_DIRECTORY = 0x00000001; 6179 const ACE4_WRITE_DATA = 0x00000002; 6180 const ACE4_ADD_FILE = 0x00000002; 6181 const ACE4_APPEND_DATA = 0x00000004; 6182 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 6183 const ACE4_READ_NAMED_ATTRS = 0x00000008; 6184 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 6185 const ACE4_EXECUTE = 0x00000020; 6186 const ACE4_DELETE_CHILD = 0x00000040; 6187 const ACE4_READ_ATTRIBUTES = 0x00000080; 6188 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 6189 const ACE4_WRITE_RETENTION = 0x00000200; 6190 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 6192 const ACE4_DELETE = 0x00010000; 6193 const ACE4_READ_ACL = 0x00020000; 6194 const ACE4_WRITE_ACL = 0x00040000; 6195 const ACE4_WRITE_OWNER = 0x00080000; 6196 const ACE4_SYNCHRONIZE = 0x00100000; 6198 Note that some masks have coincident values, for example, 6199 ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries 6200 ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are 6201 intended to be used with directory objects, while ACE4_READ_DATA, 6202 ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with 6203 non-directory objects. 6205 6.2.1.3.1. Discussion of Mask Attributes 6207 ACE4_READ_DATA 6209 Operation(s) affected: 6211 READ 6213 OPEN 6215 Discussion: 6217 Permission to read the data of the file. 6219 Servers SHOULD allow a user the ability to read the data of the 6220 file when only the ACE4_EXECUTE access mask bit is allowed. 6222 ACE4_LIST_DIRECTORY 6224 Operation(s) affected: 6226 READDIR 6228 Discussion: 6230 Permission to list the contents of a directory. 6232 ACE4_WRITE_DATA 6234 Operation(s) affected: 6236 WRITE 6238 OPEN 6240 SETATTR of size 6242 Discussion: 6244 Permission to modify a file's data. 6246 ACE4_ADD_FILE 6248 Operation(s) affected: 6250 CREATE 6252 LINK 6253 OPEN 6255 RENAME 6257 Discussion: 6259 Permission to add a new file in a directory. The CREATE 6260 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 6261 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it 6262 is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when 6263 used to create a regular file. LINK and RENAME are always 6264 affected. 6266 ACE4_APPEND_DATA 6268 Operation(s) affected: 6270 WRITE 6272 OPEN 6274 SETATTR of size 6276 Discussion: 6278 The ability to modify a file's data, but only starting at EOF. 6279 This allows for the notion of append-only files, by allowing 6280 ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user 6281 or group. If a file has an ACL such as the one described above 6282 and a WRITE request is made for somewhere other than EOF, the 6283 server SHOULD return NFS4ERR_ACCESS. 6285 ACE4_ADD_SUBDIRECTORY 6287 Operation(s) affected: 6289 CREATE 6291 RENAME 6293 Discussion: 6295 Permission to create a subdirectory in a directory. The CREATE 6296 operation is affected when nfs_ftype4 is NF4DIR. The RENAME 6297 operation is always affected. 6299 ACE4_READ_NAMED_ATTRS 6300 Operation(s) affected: 6302 OPENATTR 6304 Discussion: 6306 Permission to read the named attributes of a file or to look up 6307 the named attribute directory. OPENATTR is affected when it is 6308 not used to create a named attribute directory. This is when 6309 1) createdir is TRUE, but a named attribute directory already 6310 exists, or 2) createdir is FALSE. 6312 ACE4_WRITE_NAMED_ATTRS 6314 Operation(s) affected: 6316 OPENATTR 6318 Discussion: 6320 Permission to write the named attributes of a file or to create 6321 a named attribute directory. OPENATTR is affected when it is 6322 used to create a named attribute directory. This is when 6323 createdir is TRUE and no named attribute directory exists. The 6324 ability to check whether or not a named attribute directory 6325 exists depends on the ability to look it up; therefore, users 6326 also need the ACE4_READ_NAMED_ATTRS permission in order to 6327 create a named attribute directory. 6329 ACE4_EXECUTE 6331 Operation(s) affected: 6333 READ 6335 OPEN 6337 REMOVE 6339 RENAME 6341 LINK 6343 CREATE 6345 Discussion: 6347 Permission to execute a file. 6349 Servers SHOULD allow a user the ability to read the data of the 6350 file when only the ACE4_EXECUTE access mask bit is allowed. 6351 This is because there is no way to execute a file without 6352 reading the contents. Though a server may treat ACE4_EXECUTE 6353 and ACE4_READ_DATA bits identically when deciding to permit a 6354 READ operation, it SHOULD still allow the two bits to be set 6355 independently in ACLs, and MUST distinguish between them when 6356 replying to ACCESS operations. In particular, servers SHOULD 6357 NOT silently turn on one of the two bits when the other is set, 6358 as that would make it impossible for the client to correctly 6359 enforce the distinction between read and execute permissions. 6361 As an example, following a SETATTR of the following ACL: 6363 nfsuser:ACE4_EXECUTE:ALLOW 6365 A subsequent GETATTR of ACL for that file SHOULD return: 6367 nfsuser:ACE4_EXECUTE:ALLOW 6369 Rather than: 6371 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 6373 ACE4_EXECUTE 6375 Operation(s) affected: 6377 LOOKUP 6379 Discussion: 6381 Permission to traverse/search a directory. 6383 ACE4_DELETE_CHILD 6385 Operation(s) affected: 6387 REMOVE 6389 RENAME 6391 Discussion: 6393 Permission to delete a file or directory within a directory. 6394 See Section 6.2.1.3.2 for information on ACE4_DELETE and 6395 ACE4_DELETE_CHILD interact. 6397 ACE4_READ_ATTRIBUTES 6399 Operation(s) affected: 6401 GETATTR of file system object attributes 6403 VERIFY 6405 NVERIFY 6407 READDIR 6409 Discussion: 6411 The ability to read basic attributes (non-ACLs) of a file. On 6412 a UNIX system, basic attributes can be thought of as the stat- 6413 level attributes. Allowing this access mask bit would mean 6414 that the entity can execute "ls -l" and stat. If a READDIR 6415 operation requests attributes, this mask must be allowed for 6416 the READDIR to succeed. 6418 ACE4_WRITE_ATTRIBUTES 6420 Operation(s) affected: 6422 SETATTR of time_access_set, time_backup, 6424 time_create, time_modify_set, mimetype, hidden, system 6426 Discussion: 6428 Permission to change the times associated with a file or 6429 directory to an arbitrary value. Also permission to change the 6430 mimetype, hidden, and system attributes. A user having 6431 ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set 6432 the times associated with a file to the current server time. 6434 ACE4_WRITE_RETENTION 6436 Operation(s) affected: 6438 SETATTR of retention_set, retentevt_set. 6440 Discussion: 6442 Permission to modify the durations of event and non-event-based 6443 retention. Also permission to enable event and non-event-based 6444 retention. A server MAY behave such that setting 6445 ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. 6447 ACE4_WRITE_RETENTION_HOLD 6449 Operation(s) affected: 6451 SETATTR of retention_hold. 6453 Discussion: 6455 Permission to modify the administration retention holds. A 6456 server MAY map ACE4_WRITE_ATTRIBUTES to 6457 ACE_WRITE_RETENTION_HOLD. 6459 ACE4_DELETE 6461 Operation(s) affected: 6463 REMOVE 6465 Discussion: 6467 Permission to delete the file or directory. See 6468 Section 6.2.1.3.2 for information on ACE4_DELETE and 6469 ACE4_DELETE_CHILD interact. 6471 ACE4_READ_ACL 6473 Operation(s) affected: 6475 GETATTR of acl, dacl, or sacl 6477 NVERIFY 6479 VERIFY 6481 Discussion: 6483 Permission to read the ACL. 6485 ACE4_WRITE_ACL 6487 Operation(s) affected: 6489 SETATTR of acl and mode 6491 Discussion: 6493 Permission to write the acl and mode attributes. 6495 ACE4_WRITE_OWNER 6497 Operation(s) affected: 6499 SETATTR of owner and owner_group 6501 Discussion: 6503 Permission to write the owner and owner_group attributes. On 6504 UNIX systems, this is the ability to execute chown() and 6505 chgrp(). 6507 ACE4_SYNCHRONIZE 6509 Operation(s) affected: 6511 NONE 6513 Discussion: 6515 Permission to use the file object as a synchronization 6516 primitive for interprocess communication. This permission is 6517 not enforced or interpreted by the NFSv4.1 server on behalf of 6518 the client. 6520 Typically, the ACE4_SYNCHRONIZE permission is only meaningful 6521 on local file systems, i.e., file systems not accessed via 6522 NFSv4.1. The reason that the permission bit exists is that 6523 some operating environments, such as Windows, use 6524 ACE4_SYNCHRONIZE. 6526 For example, if a client copies a file that has 6527 ACE4_SYNCHRONIZE set from a local file system to an NFSv4.1 6528 server, and then later copies the file from the NFSv4.1 server 6529 to a local file system, it is likely that if ACE4_SYNCHRONIZE 6530 was set in the original file, the client will want it set in 6531 the second copy. The first copy will not have the permission 6532 set unless the NFSv4.1 server has the means to set the 6533 ACE4_SYNCHRONIZE bit. The second copy will not have the 6534 permission set unless the NFSv4.1 server has the means to 6535 retrieve the ACE4_SYNCHRONIZE bit. 6537 Server implementations need not provide the granularity of control 6538 that is implied by this list of masks. For example, POSIX-based 6539 systems might not distinguish ACE4_APPEND_DATA (the ability to append 6540 to a file) from ACE4_WRITE_DATA (the ability to modify existing 6541 contents); both masks would be tied to a single "write" permission 6542 [17]. When such a server returns attributes to the client, it would 6543 show both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the 6544 write permission is enabled. 6546 If a server receives a SETATTR request that it cannot accurately 6547 implement, it should err in the direction of more restricted access, 6548 except in the previously discussed cases of execute and read. For 6549 example, suppose a server cannot distinguish overwriting data from 6550 appending new data, as described in the previous paragraph. If a 6551 client submits an ALLOW ACE where ACE4_APPEND_DATA is set but 6552 ACE4_WRITE_DATA is not (or vice versa), the server should either turn 6553 off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. 6555 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 6557 Two access mask bits govern the ability to delete a directory entry: 6558 ACE4_DELETE on the object itself (the "target") and ACE4_DELETE_CHILD 6559 on the containing directory (the "parent"). 6561 Many systems also take the "sticky bit" (MODE4_SVTX) on a directory 6562 to allow unlink only to a user that owns either the target or the 6563 parent; on some such systems the decision also depends on whether the 6564 target is writable. 6566 Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the 6567 target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that 6568 this is true even if the parent or target explicitly denies one of 6569 these permissions.) 6571 If the ACLs in question neither explicitly ALLOW nor DENY either of 6572 the above, and if MODE4_SVTX is not set on the parent, then the 6573 server SHOULD allow the removal if and only if ACE4_ADD_FILE is 6574 permitted. In the case where MODE4_SVTX is set, the server may also 6575 require the remover to own either the parent or the target, or may 6576 require the target to be writable. 6578 This allows servers to support something close to traditional UNIX- 6579 like semantics, with ACE4_ADD_FILE taking the place of the write bit. 6581 6.2.1.4. ACE flag 6583 The bitmask constants used for the flag field are as follows: 6585 const ACE4_FILE_INHERIT_ACE = 0x00000001; 6586 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 6587 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 6588 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 6589 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 6590 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 6591 const ACE4_IDENTIFIER_GROUP = 0x00000040; 6592 const ACE4_INHERITED_ACE = 0x00000080; 6594 A server need not support any of these flags. If the server supports 6595 flags that are similar to, but not exactly the same as, these flags, 6596 the implementation may define a mapping between the protocol-defined 6597 flags and the implementation-defined flags. 6599 For example, suppose a client tries to set an ACE with 6600 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 6601 server does not support any form of ACL inheritance, the server 6602 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 6603 supports a single "inherit ACE" flag that applies to both files and 6604 directories, the server may reject the request (i.e., requiring the 6605 client to set both the file and directory inheritance flags). The 6606 server may also accept the request and silently turn on the 6607 ACE4_DIRECTORY_INHERIT_ACE flag. 6609 6.2.1.4.1. Discussion of Flag Bits 6611 ACE4_FILE_INHERIT_ACE 6612 Any non-directory file in any sub-directory will get this ACE 6613 inherited. 6615 ACE4_DIRECTORY_INHERIT_ACE 6616 Can be placed on a directory and indicates that this ACE should be 6617 added to each new directory created. 6618 If this flag is set in an ACE in an ACL attribute to be set on a 6619 non-directory file system object, the operation attempting to set 6620 the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. 6622 ACE4_NO_PROPAGATE_INHERIT_ACE 6623 Can be placed on a directory. This flag tells the server that 6624 inheritance of this ACE should stop at newly created child 6625 directories. 6627 ACE4_INHERIT_ONLY_ACE 6628 Can be placed on a directory but does not apply to the directory; 6629 ALLOW and DENY ACEs with this bit set do not affect access to the 6630 directory, and AUDIT and ALARM ACEs with this bit set do not 6631 trigger log or alarm events. Such ACEs only take effect once they 6632 are applied (with this bit cleared) to newly created files and 6633 directories as specified by the ACE4_FILE_INHERIT_ACE and 6634 ACE4_DIRECTORY_INHERIT_ACE flags. 6636 If this flag is present on an ACE, but neither 6637 ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, 6638 then an operation attempting to set such an attribute SHOULD fail 6639 with NFS4ERR_ATTRNOTSUPP. 6641 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG 6643 ACE4_FAILED_ACCESS_ACE_FLAG 6644 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 6645 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on 6646 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 6647 (ALARM) ACE types. If during the processing of the file's ACL, 6648 the server encounters an AUDIT or ALARM ACE that matches the 6649 principal attempting the OPEN, the server notes that fact, and the 6650 presence, if any, of the SUCCESS and FAILED flags encountered in 6651 the AUDIT or ALARM ACE. Once the server completes the ACL 6652 processing, it then notes if the operation succeeded or failed. 6653 If the operation succeeded, and if the SUCCESS flag was set for a 6654 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 6655 event occurs. If the operation failed, and if the FAILED flag was 6656 set for the matching AUDIT or ALARM ACE, then the appropriate 6657 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 6658 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 6659 is not useful. 6661 The previously described processing applies to ACCESS operations 6662 even when they return NFS4_OK. For the purposes of AUDIT and 6663 ALARM, we consider an ACCESS operation to be a "failure" if it 6664 fails to return a bit that was requested and supported. 6666 ACE4_IDENTIFIER_GROUP 6667 Indicates that the "who" refers to a GROUP as defined under UNIX 6668 or a GROUP ACCOUNT as defined under Windows. Clients and servers 6669 MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 6670 value equal to one of the special identifiers outlined in 6671 Section 6.2.1.5. 6673 ACE4_INHERITED_ACE 6674 Indicates that this ACE is inherited from a parent directory. A 6675 server that supports automatic inheritance will place this flag on 6676 any ACEs inherited from the parent directory when creating a new 6677 object. Client applications will use this to perform automatic 6678 inheritance. Clients and servers MUST clear this bit in the acl 6679 attribute; it may only be used in the dacl and sacl attributes. 6681 6.2.1.5. ACE Who 6683 The "who" field of an ACE is an identifier that specifies the 6684 principal or principals to whom the ACE applies. It may refer to a 6685 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 6686 which. 6688 There are several special identifiers that need to be understood 6689 universally, rather than in the context of a particular DNS domain. 6690 Some of these identifiers cannot be understood when an NFS client 6691 accesses the server, but have meaning when a local process accesses 6692 the file. The ability to display and modify these permissions is 6693 permitted over NFS, even if none of the access methods on the server 6694 understands the identifiers. 6696 +---------------+---------------------------------------------------+ 6697 | Who | Description | 6698 +---------------+---------------------------------------------------+ 6699 | OWNER | The owner of the file. | 6700 | GROUP | The group associated with the file. | 6701 | EVERYONE | The world, including the owner and owning group. | 6702 | INTERACTIVE | Accessed from an interactive terminal. | 6703 | NETWORK | Accessed via the network. | 6704 | DIALUP | Accessed as a dialup user to the server. | 6705 | BATCH | Accessed from a batch job. | 6706 | ANONYMOUS | Accessed without any authentication. | 6707 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS). | 6708 | SERVICE | Access from a system service. | 6709 +---------------+---------------------------------------------------+ 6711 Table 4 6713 To avoid conflict, these special identifiers are distinguished by an 6714 appended "@" and should appear in the form "xxxx@" (with no domain 6715 name after the "@"), for example, ANONYMOUS@. 6717 The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these 6718 special identifiers. When encoding entries with these special 6719 identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. 6721 6.2.1.5.1. Discussion of EVERYONE@ 6723 It is important to note that "EVERYONE@" is not equivalent to the 6724 UNIX "other" entity. This is because, by definition, UNIX "other" 6725 does not include the owner or owning group of a file. "EVERYONE@" 6726 means literally everyone, including the owner or owning group. 6728 6.2.2. Attribute 58: dacl 6730 The dacl attribute is like the acl attribute, but dacl allows just 6731 ALLOW and DENY ACEs. The dacl attribute supports automatic 6732 inheritance (see Section 6.4.3.2). 6734 6.2.3. Attribute 59: sacl 6736 The sacl attribute is like the acl attribute, but sacl allows just 6737 AUDIT and ALARM ACEs. The sacl attribute supports automatic 6738 inheritance (see Section 6.4.3.2). 6740 6.2.4. Attribute 33: mode 6742 The NFSv4.1 mode attribute is based on the UNIX mode bits. The 6743 following bits are defined: 6745 const MODE4_SUID = 0x800; /* set user id on execution */ 6746 const MODE4_SGID = 0x400; /* set group id on execution */ 6747 const MODE4_SVTX = 0x200; /* save text even after use */ 6748 const MODE4_RUSR = 0x100; /* read permission: owner */ 6749 const MODE4_WUSR = 0x080; /* write permission: owner */ 6750 const MODE4_XUSR = 0x040; /* execute permission: owner */ 6751 const MODE4_RGRP = 0x020; /* read permission: group */ 6752 const MODE4_WGRP = 0x010; /* write permission: group */ 6753 const MODE4_XGRP = 0x008; /* execute permission: group */ 6754 const MODE4_ROTH = 0x004; /* read permission: other */ 6755 const MODE4_WOTH = 0x002; /* write permission: other */ 6756 const MODE4_XOTH = 0x001; /* execute permission: other */ 6758 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 6759 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 6760 MODE4_XGRP apply to principals identified in the owner_group 6761 attribute but who are not identified in the owner attribute. Bits 6762 MODE4_ROTH, MODE4_WOTH, and MODE4_XOTH apply to any principal that 6763 does not match that in the owner attribute and does not have a group 6764 matching that of the owner_group attribute. 6766 Bits within a mode other than those specified above are not defined 6767 by this protocol. A server MUST NOT return bits other than those 6768 defined above in a GETATTR or READDIR operation, and it MUST return 6769 NFS4ERR_INVAL if bits other than those defined above are set in a 6770 SETATTR, CREATE, OPEN, VERIFY, or NVERIFY operation. 6772 6.2.5. Attribute 74: mode_set_masked 6774 The mode_set_masked attribute is a write-only attribute that allows 6775 individual bits in the mode attribute to be set or reset, without 6776 changing others. It allows, for example, the bits MODE4_SUID, 6777 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 6778 any of the nine low-order mode bits devoted to permissions. 6780 In such instances that the nine low-order bits are left unmodified, 6781 then neither the acl nor the dacl attribute should be automatically 6782 modified as discussed in Section 6.4.1. 6784 The mode_set_masked attribute consists of two words, each in the form 6785 of a mode4. The first consists of the value to be applied to the 6786 current mode value and the second is a mask. Only bits set to one in 6787 the mask word are changed (set or reset) in the file's mode. All 6788 other bits in the mode remain unchanged. Bits in the first word that 6789 correspond to bits that are zero in the mask are ignored, except that 6790 undefined bits are checked for validity and can result in 6791 NFS4ERR_INVAL as described below. 6793 The mode_set_masked attribute is only valid in a SETATTR operation. 6794 If it is used in a CREATE or OPEN operation, the server MUST return 6795 NFS4ERR_INVAL. 6797 Bits not defined as valid in the mode attribute are not valid in 6798 either word of the mode_set_masked attribute. The server MUST return 6799 NFS4ERR_INVAL if any such bits are set to one in a SETATTR. If the 6800 mode and mode_set_masked attributes are both specified in the same 6801 SETATTR, the server MUST also return NFS4ERR_INVAL. 6803 6.3. Common Methods 6805 The requirements in this section will be referred to in future 6806 sections, especially Section 6.4. 6808 6.3.1. Interpreting an ACL 6810 6.3.1.1. Server Considerations 6812 The server uses the algorithm described in Section 6.2.1 to determine 6813 whether an ACL allows access to an object. However, the ACL might 6814 not be the sole determiner of access. For example: 6816 o In the case of a file system exported as read-only, the server may 6817 deny write access even though an object's ACL grants it. 6819 o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 6820 permissions to prevent a situation from arising in which there is 6821 no valid way to ever modify the ACL. 6823 o All servers will allow a user the ability to read the data of the 6824 file when only the execute permission is granted (i.e., if the ACL 6825 denies the user the ACE4_READ_DATA access and allows the user 6826 ACE4_EXECUTE, the server will allow the user to read the data of 6827 the file). 6829 o Many servers have the notion of owner-override in which the owner 6830 of the object is allowed to override accesses that are denied by 6831 the ACL. This may be helpful, for example, to allow users 6832 continued access to open files on which the permissions have 6833 changed. 6835 o Many servers have the notion of a "superuser" that has privileges 6836 beyond an ordinary user. The superuser may be able to read or 6837 write data or metadata in ways that would not be permitted by the 6838 ACL. 6840 o A retention attribute might also block access otherwise allowed by 6841 ACLs (see Section 5.13). 6843 6.3.1.2. Client Considerations 6845 Clients SHOULD NOT do their own access checks based on their 6846 interpretation of the ACL, but rather use the OPEN and ACCESS 6847 operations to do access checks. This allows the client to act on the 6848 results of having the server determine whether or not access should 6849 be granted based on its interpretation of the ACL. 6851 Clients must be aware of situations in which an object's ACL will 6852 define a certain access even though the server will not enforce it. 6853 In general, but especially in these situations, the client needs to 6854 do its part in the enforcement of access as defined by the ACL. To 6855 do this, the client MAY send the appropriate ACCESS operation prior 6856 to servicing the request of the user or application in order to 6857 determine whether the user or application should be granted the 6858 access requested. For examples in which the ACL may define accesses 6859 that the server doesn't enforce, see Section 6.3.1.1. 6861 6.3.2. Computing a Mode Attribute from an ACL 6863 The following method can be used to calculate the MODE4_R*, MODE4_W*, 6864 and MODE4_X* bits of a mode attribute, based upon an ACL. 6866 First, for each of the special identifiers OWNER@, GROUP@, and 6867 EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY 6868 ACEs for the identifier EVERYONE@ and for the identifier under 6869 consideration. The result of the evaluation will be an NFSv4 ACL 6870 mask showing exactly which bits are permitted to that identifier. 6872 Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ 6873 into mode bits for, respectively, the user, group, and other, as 6874 follows: 6876 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and 6877 only if ACE4_READ_DATA is set in the corresponding mask. 6879 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and 6880 only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the 6881 corresponding mask. 6883 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if 6884 and only if ACE4_EXECUTE is set in the corresponding mask. 6886 6.3.2.1. Discussion 6888 Some server implementations also add bits permitted to named users 6889 and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and 6890 MODE4_XGRP). 6892 Implementations are discouraged from doing this, because it has been 6893 found to cause confusion for users who see members of a file's group 6894 denied access that the mode bits appear to allow. (The presence of 6895 DENY ACEs may also lead to such behavior, but DENY ACEs are expected 6896 to be more rarely used.) 6898 The same user confusion seen when fetching the mode also results if 6899 setting the mode does not effectively control permissions for the 6900 owner, group, and other users; this motivates some of the 6901 requirements that follow. 6903 6.4. Requirements 6905 The server that supports both mode and ACL must take care to 6906 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 6907 ACEs that have respective who fields of "OWNER@", "GROUP@", and 6908 "EVERYONE@". This way, the client can see if semantically equivalent 6909 access permissions exist whether the client asks for the owner, 6910 owner_group, and mode attributes or for just the ACL. 6912 In this section, much is made of the methods in Section 6.3.2. Many 6913 requirements refer to this section. But note that the methods have 6914 behaviors specified with "SHOULD". This is intentional, to avoid 6915 invalidating existing implementations that compute the mode according 6916 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 6917 actual permissions on owner, group, and other. 6919 6.4.1. Setting the Mode and/or ACL Attributes 6921 In the case where a server supports the sacl or dacl attribute, in 6922 addition to the acl attribute, the server MUST fail a request to set 6923 the acl attribute simultaneously with a dacl or sacl attribute. The 6924 error to be given is NFS4ERR_ATTRNOTSUPP. 6926 6.4.1.1. Setting Mode and not ACL 6928 When any of the nine low-order mode bits are subject to change, 6929 either because the mode attribute was set or because the 6930 mode_set_masked attribute was set and the mask included one or more 6931 bits from the nine low-order mode bits, and no ACL attribute is 6932 explicitly set, the acl and dacl attributes must be modified in 6933 accordance with the updated value of those bits. This must happen 6934 even if the value of the low-order bits is the same after the mode is 6935 set as before. 6937 Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl 6938 attribute) are unaffected by changes to the mode. 6940 In cases in which the permissions bits are subject to change, the acl 6941 and dacl attributes MUST be modified such that the mode computed via 6942 the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, 6943 MODE4_W*, MODE4_X*) of the mode attribute as modified by the 6944 attribute change. The ACL attributes SHOULD also be modified such 6945 that: 6947 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 6948 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6949 ACE4_READ_DATA. 6951 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 6952 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6953 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 6955 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 6956 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 6957 ACE4_EXECUTE. 6959 Access mask bits other than those listed above, appearing in ALLOW 6960 ACEs, MAY also be disabled. 6962 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 6963 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 6964 ALARM. As such, it is desirable to leave these ACEs unmodified when 6965 modifying the ACL attributes. 6967 Also note that the requirement may be met by discarding the acl and 6968 dacl, in favor of an ACL that represents the mode and only the mode. 6969 This is permitted, but it is preferable for a server to preserve as 6970 much of the ACL as possible without violating the above requirements. 6971 Discarding the ACL makes it effectively impossible for a file created 6972 with a mode attribute to inherit an ACL (see Section 6.4.3). 6974 6.4.1.2. Setting ACL and Not Mode 6976 When setting the acl or dacl and not setting the mode or 6977 mode_set_masked attributes, the permission bits of the mode need to 6978 be derived from the ACL. In this case, the ACL attribute SHOULD be 6979 set as given. The nine low-order bits of the mode attribute 6980 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 6981 of the method in Section 6.3.2. The three high-order bits of the 6982 mode (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 6984 6.4.1.3. Setting Both ACL and Mode 6986 When setting both the mode (includes use of either the mode attribute 6987 or the mode_set_masked attribute) and the acl or dacl attributes in 6988 the same operation, the attributes MUST be applied in this order: 6989 mode (or mode_set_masked), then ACL. The mode-related attribute is 6990 set as given, then the ACL attribute is set as given, possibly 6991 changing the final mode, as described above in Section 6.4.1.2. 6993 6.4.2. Retrieving the Mode and/or ACL Attributes 6995 This section applies only to servers that support both the mode and 6996 ACL attributes. 6998 Some server implementations may have a concept of "objects without 6999 ACLs", meaning that all permissions are granted and denied according 7000 to the mode attribute and that no ACL attribute is stored for that 7001 object. If an ACL attribute is requested of such a server, the 7002 server SHOULD return an ACL that does not conflict with the mode; 7003 that is to say, the ACL returned SHOULD represent the nine low-order 7004 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 7005 described in Section 6.3.2. 7007 For other server implementations, the ACL attribute is always present 7008 for every object. Such servers SHOULD store at least the three high- 7009 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 7010 MODE4_SVTX). The server SHOULD return a mode attribute if one is 7011 requested, and the low-order nine bits of the mode (MODE4_R*, 7012 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 7013 Section 6.3.2 to the ACL attribute. 7015 6.4.3. Creating New Objects 7017 If a server supports any ACL attributes, it may use the ACL 7018 attributes on the parent directory to compute an initial ACL 7019 attribute for a newly created object. This will be referred to as 7020 the inherited ACL within this section. The act of adding one or more 7021 ACEs to the inherited ACL that are based upon ACEs in the parent 7022 directory's ACL will be referred to as inheriting an ACE within this 7023 section. 7025 Implementors should standardize what the behavior of CREATE and OPEN 7026 must be depending on the presence or absence of the mode and ACL 7027 attributes. 7029 1. If just the mode is given in the call: 7031 In this case, inheritance SHOULD take place, but the mode MUST be 7032 applied to the inherited ACL as described in Section 6.4.1.1, 7033 thereby modifying the ACL. 7035 2. If just the ACL is given in the call: 7037 In this case, inheritance SHOULD NOT take place, and the ACL as 7038 defined in the CREATE or OPEN will be set without modification, 7039 and the mode modified as in Section 6.4.1.2. 7041 3. If both mode and ACL are given in the call: 7043 In this case, inheritance SHOULD NOT take place, and both 7044 attributes will be set as described in Section 6.4.1.3. 7046 4. If neither mode nor ACL is given in the call: 7048 In the case where an object is being created without any initial 7049 attributes at all, e.g., an OPEN operation with an opentype4 of 7050 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 7051 NOT take place (note that EXCLUSIVE4_1 is a better choice of 7052 createmode4, since it does permit initial attributes). Instead, 7053 the server SHOULD set permissions to deny all access to the newly 7054 created object. It is expected that the appropriate client will 7055 set the desired attributes in a subsequent SETATTR operation, and 7056 the server SHOULD allow that operation to succeed, regardless of 7057 what permissions the object is created with. For example, an 7058 empty ACL denies all permissions, but the server should allow the 7059 owner's SETATTR to succeed even though WRITE_ACL is implicitly 7060 denied. 7062 In other cases, inheritance SHOULD take place, and no 7063 modifications to the ACL will happen. The mode attribute, if 7064 supported, MUST be as computed in Section 6.3.2, with the 7065 MODE4_SUID, MODE4_SGID, and MODE4_SVTX bits clear. If no 7066 inheritable ACEs exist on the parent directory, the rules for 7067 creating acl, dacl, or sacl attributes are implementation 7068 defined. If either the dacl or sacl attribute is supported, then 7069 the ACL4_DEFAULTED flag SHOULD be set on the newly created 7070 attributes. 7072 6.4.3.1. The Inherited ACL 7074 If the object being created is not a directory, the inherited ACL 7075 SHOULD NOT inherit ACEs from the parent directory ACL unless the 7076 ACE4_FILE_INHERIT_FLAG is set. 7078 If the object being created is a directory, the inherited ACL should 7079 inherit all inheritable ACEs from the parent directory, that is, 7080 those that have the ACE4_FILE_INHERIT_ACE or 7081 ACE4_DIRECTORY_INHERIT_ACE flag set. If the inheritable ACE has 7082 ACE4_FILE_INHERIT_ACE set but ACE4_DIRECTORY_INHERIT_ACE is clear, 7083 the inherited ACE on the newly created directory MUST have the 7084 ACE4_INHERIT_ONLY_ACE flag set to prevent the directory from being 7085 affected by ACEs meant for non-directories. 7087 When a new directory is created, the server MAY split any inherited 7088 ACE that is both inheritable and effective (in other words, that has 7089 neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE set), 7090 into two ACEs, one with no inheritance flags and one with 7091 ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, 7092 both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) 7093 This makes it simpler to modify the effective permissions on the 7094 directory without modifying the ACE that is to be inherited to the 7095 new directory's children. 7097 6.4.3.2. Automatic Inheritance 7099 The acl attribute consists only of an array of ACEs, but the sacl 7100 (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an 7101 additional flag field. 7103 struct nfsacl41 { 7104 aclflag4 na41_flag; 7105 nfsace4 na41_aces<>; 7106 }; 7108 The flag field applies to the entire sacl or dacl; three flag values 7109 are defined: 7111 const ACL4_AUTO_INHERIT = 0x00000001; 7112 const ACL4_PROTECTED = 0x00000002; 7113 const ACL4_DEFAULTED = 0x00000004; 7115 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 7116 be set in the ACEs of the sacl or dacl (whereas it must always be 7117 cleared in the acl). 7119 Together these features allow a server to support automatic 7120 inheritance, which we now explain in more detail. 7122 Inheritable ACEs are normally inherited by child objects only at the 7123 time that the child objects are created; later modifications to 7124 inheritable ACEs do not result in modifications to inherited ACEs on 7125 descendants. 7127 However, the dacl and sacl provide an OPTIONAL mechanism that allows 7128 a client application to propagate changes to inheritable ACEs to an 7129 entire directory hierarchy. 7131 A server that supports this performs inheritance at object creation 7132 time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on 7133 any inherited ACEs as they are added to the new object. 7135 A client application such as an ACL editor may then propagate changes 7136 to inheritable ACEs on a directory by recursively traversing that 7137 directory's descendants and modifying each ACL encountered to remove 7138 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 7139 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 7140 uses the existing ACE inheritance flags in the obvious way to decide 7141 which ACEs to propagate. (Note that it may encounter further 7142 inheritable ACEs when descending the directory hierarchy and that 7143 those will also need to be taken into account when propagating 7144 inheritable ACEs to further descendants.) 7146 The reach of this propagation may be limited in two ways: first, 7147 automatic inheritance is not performed from any directory ACL that 7148 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 7149 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 7150 set, preventing modification of that ACL and also (if the ACL is set 7151 on a directory) of the ACL on any of the object's descendants. 7153 This propagation is performed independently for the sacl and the dacl 7154 attributes; thus, the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 7155 be independently set for the sacl and the dacl, and propagation of 7156 one type of acl may continue down a hierarchy even where propagation 7157 of the other acl has stopped. 7159 New objects should be created with a dacl and a sacl that both have 7160 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 7161 the same value as that on, respectively, the sacl or dacl of the 7162 parent object. 7164 Both the dacl and sacl attributes are RECOMMENDED, and a server may 7165 support one without supporting the other. 7167 A server that supports both the old acl attribute and one or both of 7168 the new dacl or sacl attributes must do so in such a way as to keep 7169 all three attributes consistent with each other. Thus, the ACEs 7170 reported in the acl attribute should be the union of the ACEs 7171 reported in the dacl and sacl attributes, except that the 7172 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 7173 And of course a client that queries only the acl will be unable to 7174 determine the values of the sacl or dacl flag fields. 7176 When a client performs a SETATTR for the acl attribute, the server 7177 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 7178 dacl. By using the acl attribute, as opposed to the dacl or sacl 7179 attributes, the client signals that it may not understand automatic 7180 inheritance, and thus cannot be trusted to set an ACL for which 7181 automatic inheritance would make sense. 7183 When a client application queries an ACL, modifies it, and sets it 7184 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 7185 unchanged, in their original order, at the end of the ACL. If the 7186 application is unable to do this, it should set the ACL4_PROTECTED 7187 flag. This behavior is not enforced by servers, but violations of 7188 this rule may lead to unexpected results when applications perform 7189 automatic inheritance. 7191 If a server also supports the mode attribute, it SHOULD set the mode 7192 in such a way that leaves inherited ACEs unchanged, in their original 7193 order, at the end of the ACL. If it is unable to do so, it SHOULD 7194 set the ACL4_PROTECTED flag on the file's dacl. 7196 Finally, in the case where the request that creates a new file or 7197 directory does not also set permissions for that file or directory, 7198 and there are also no ACEs to inherit from the parent's directory, 7199 then the server's choice of ACL for the new object is implementation- 7200 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 7201 flag on the ACL it chooses for the new object. An application 7202 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 7203 sign that the ACL should be completely replaced by one generated 7204 using the automatic inheritance rules. 7206 7. Single-Server Namespace 7208 This section describes the NFSv4 single-server namespace. Single- 7209 server namespaces may be presented directly to clients, or they may 7210 be used as a basis to form larger multi-server namespaces (e.g., 7211 site-wide or organization-wide) to be presented to clients, as 7212 described in Section 11. 7214 7.1. Server Exports 7216 On a UNIX server, the namespace describes all the files reachable by 7217 pathnames under the root directory or "/". On a Windows server, the 7218 namespace constitutes all the files on disks named by mapped disk 7219 letters. NFS server administrators rarely make the entire server's 7220 file system namespace available to NFS clients. More often, portions 7221 of the namespace are made available via an "export" feature. In 7222 previous versions of the NFS protocol, the root filehandle for each 7223 export is obtained through the MOUNT protocol; the client sent a 7224 string that identified the export name within the namespace and the 7225 server returned the root filehandle for that export. The MOUNT 7226 protocol also provided an EXPORTS procedure that enumerated the 7227 server's exports. 7229 7.2. Browsing Exports 7231 The NFSv4.1 protocol provides a root filehandle that clients can use 7232 to obtain filehandles for the exports of a particular server, via a 7233 series of LOOKUP operations within a COMPOUND, to traverse a path. A 7234 common user experience is to use a graphical user interface (perhaps 7235 a file "Open" dialog window) to find a file via progressive browsing 7236 through a directory tree. The client must be able to move from one 7237 export to another export via single-component, progressive LOOKUP 7238 operations. 7240 This style of browsing is not well supported by the NFSv3 protocol. 7241 In NFSv3, the client expects all LOOKUP operations to remain within a 7242 single server file system. For example, the device attribute will 7243 not change. This prevents a client from taking namespace paths that 7244 span exports. 7246 In the case of NFSv3, an automounter on the client can obtain a 7247 snapshot of the server's namespace using the EXPORTS procedure of the 7248 MOUNT protocol. If it understands the server's pathname syntax, it 7249 can create an image of the server's namespace on the client. The 7250 parts of the namespace that are not exported by the server are filled 7251 in with directories that might be constructed similarly to an NFSv4.1 7252 "pseudo file system" (see Section 7.3) that allows the user to browse 7253 from one mounted file system to another. There is a drawback to this 7254 representation of the server's namespace on the client: it is static. 7255 If the server administrator adds a new export, the client will be 7256 unaware of it. 7258 7.3. Server Pseudo File System 7260 NFSv4.1 servers avoid this namespace inconsistency by presenting all 7261 the exports for a given server within the framework of a single 7262 namespace for that server. An NFSv4.1 client uses LOOKUP and READDIR 7263 operations to browse seamlessly from one export to another. 7265 Where there are portions of the server namespace that are not 7266 exported, clients require some way of traversing those portions to 7267 reach actual exported file systems. A technique that servers may use 7268 to provide for this is to bridge the unexported portion of the 7269 namespace via a "pseudo file system" that provides a view of exported 7270 directories only. A pseudo file system has a unique fsid and behaves 7271 like a normal, read-only file system. 7273 Based on the construction of the server's namespace, it is possible 7274 that multiple pseudo file systems may exist. For example, 7276 /a pseudo file system 7277 /a/b real file system 7278 /a/b/c pseudo file system 7279 /a/b/c/d real file system 7281 Each of the pseudo file systems is considered a separate entity and 7282 therefore MUST have its own fsid, unique among all the fsids for that 7283 server. 7285 7.4. Multiple Roots 7287 Certain operating environments are sometimes described as having 7288 "multiple roots". In such environments, individual file systems are 7289 commonly represented by disk or volume names. NFSv4 servers for 7290 these platforms can construct a pseudo file system above these root 7291 names so that disk letters or volume names are simply directory names 7292 in the pseudo root. 7294 7.5. Filehandle Volatility 7296 The nature of the server's pseudo file system is that it is a logical 7297 representation of file system(s) available from the server. 7298 Therefore, the pseudo file system is most likely constructed 7299 dynamically when the server is first instantiated. It is expected 7300 that the pseudo file system may not have an on-disk counterpart from 7301 which persistent filehandles could be constructed. Even though it is 7302 preferable that the server provide persistent filehandles for the 7303 pseudo file system, the NFS client should expect that pseudo file 7304 system filehandles are volatile. This can be confirmed by checking 7305 the associated "fh_expire_type" attribute for those filehandles in 7306 question. If the filehandles are volatile, the NFS client must be 7307 prepared to recover a filehandle value (e.g., with a series of LOOKUP 7308 operations) when receiving an error of NFS4ERR_FHEXPIRED. 7310 Because it is quite likely that servers will implement pseudo file 7311 systems using volatile filehandles, clients need to be prepared for 7312 them, rather than assuming that all filehandles will be persistent. 7314 7.6. Exported Root 7316 If the server's root file system is exported, one might conclude that 7317 a pseudo file system is unneeded. This is not necessarily so. 7318 Assume the following file systems on a server: 7320 / fs1 (exported) 7321 /a fs2 (not exported) 7322 /a/b fs3 (exported) 7324 Because fs2 is not exported, fs3 cannot be reached with simple 7325 LOOKUPs. The server must bridge the gap with a pseudo file system. 7327 7.7. Mount Point Crossing 7329 The server file system environment may be constructed in such a way 7330 that one file system contains a directory that is 'covered' or 7331 mounted upon by a second file system. For example: 7333 /a/b (file system 1) 7334 /a/b/c/d (file system 2) 7336 The pseudo file system for this server may be constructed to look 7337 like: 7339 / (place holder/not exported) 7340 /a/b (file system 1) 7341 /a/b/c/d (file system 2) 7343 It is the server's responsibility to present the pseudo file system 7344 that is complete to the client. If the client sends a LOOKUP request 7345 for the path /a/b/c/d, the server's response is the filehandle of the 7346 root of the file system /a/b/c/d. In previous versions of the NFS 7347 protocol, the server would respond with the filehandle of directory 7348 /a/b/c/d within the file system /a/b. 7350 The NFS client will be able to determine if it crosses a server mount 7351 point by a change in the value of the "fsid" attribute. 7353 7.8. Security Policy and Namespace Presentation 7355 Because NFSv4 clients possess the ability to change the security 7356 mechanisms used, after determining what is allowed, by using SECINFO 7357 and SECINFO_NONAME, the server SHOULD NOT present a different view of 7358 the namespace based on the security mechanism being used by a client. 7359 Instead, it should present a consistent view and return 7360 NFS4ERR_WRONGSEC if an attempt is made to access data with an 7361 inappropriate security mechanism. 7363 If security considerations make it necessary to hide the existence of 7364 a particular file system, as opposed to all of the data within it, 7365 the server can apply the security policy of a shared resource in the 7366 server's namespace to components of the resource's ancestors. For 7367 example: 7369 / (place holder/not exported) 7370 /a/b (file system 1) 7371 /a/b/MySecretProject (file system 2) 7373 The /a/b/MySecretProject directory is a real file system and is the 7374 shared resource. Suppose the security policy for /a/b/ 7375 MySecretProject is Kerberos with integrity and it is desired to limit 7376 knowledge of the existence of this file system. In this case, the 7377 server should apply the same security policy to /a/b. This allows 7378 for knowledge of the existence of a file system to be secured when 7379 desirable. 7381 For the case of the use of multiple, disjoint security mechanisms in 7382 the server's resources, applying that sort of policy would result in 7383 the higher-level file system not being accessible using any security 7384 flavor. Therefore, that sort of configuration is not compatible with 7385 hiding the existence (as opposed to the contents) from clients using 7386 multiple disjoint sets of security flavors. 7388 In other circumstances, a desirable policy is for the security of a 7389 particular object in the server's namespace to include the union of 7390 all security mechanisms of all direct descendants. A common and 7391 convenient practice, unless strong security requirements dictate 7392 otherwise, is to make the entire the pseudo file system accessible by 7393 all of the valid security mechanisms. 7395 Where there is concern about the security of data on the network, 7396 clients should use strong security mechanisms to access the pseudo 7397 file system in order to prevent man-in-the-middle attacks. 7399 8. State Management 7401 Integrating locking into the NFS protocol necessarily causes it to be 7402 stateful. With the inclusion of such features as share reservations, 7403 file and directory delegations, recallable layouts, and support for 7404 mandatory byte-range locking, the protocol becomes substantially more 7405 dependent on proper management of state than the traditional 7406 combination of NFS and NLM (Network Lock Manager) [53]. These 7407 features include expanded locking facilities, which provide some 7408 measure of inter-client exclusion, but the state also offers features 7409 not readily providable using a stateless model. There are three 7410 components to making this state manageable: 7412 o clear division between client and server 7414 o ability to reliably detect inconsistency in state between client 7415 and server 7417 o simple and robust recovery mechanisms 7419 In this model, the server owns the state information. The client 7420 requests changes in locks and the server responds with the changes 7421 made. Non-client-initiated changes in locking state are infrequent. 7422 The client receives prompt notification of such changes and can 7423 adjust its view of the locking state to reflect the server's changes. 7425 Individual pieces of state created by the server and passed to the 7426 client at its request are represented by 128-bit stateids. These 7427 stateids may represent a particular open file, a set of byte-range 7428 locks held by a particular owner, or a recallable delegation of 7429 privileges to access a file in particular ways or at a particular 7430 location. 7432 In all cases, there is a transition from the most general information 7433 that represents a client as a whole to the eventual lightweight 7434 stateid used for most client and server locking interactions. The 7435 details of this transition will vary with the type of object but it 7436 always starts with a client ID. 7438 8.1. Client and Session ID 7440 A client must establish a client ID (see Section 2.4) and then one or 7441 more sessionids (see Section 2.10) before performing any operations 7442 to open, byte-range lock, delegate, or obtain a layout for a file 7443 object. Each session ID is associated with a specific client ID, and 7444 thus serves as a shorthand reference to an NFSv4.1 client. 7446 For some types of locking interactions, the client will represent 7447 some number of internal locking entities called "owners", which 7448 normally correspond to processes internal to the client. For other 7449 types of locking-related objects, such as delegations and layouts, no 7450 such intermediate entities are provided for, and the locking-related 7451 objects are considered to be transferred directly between the server 7452 and a unitary client. 7454 8.2. Stateid Definition 7456 When the server grants a lock of any type (including opens, byte- 7457 range locks, delegations, and layouts), it responds with a unique 7458 stateid that represents a set of locks (often a single lock) for the 7459 same file, of the same type, and sharing the same ownership 7460 characteristics. Thus, opens of the same file by different open- 7461 owners each have an identifying stateid. Similarly, each set of 7462 byte-range locks on a file owned by a specific lock-owner has its own 7463 identifying stateid. Delegations and layouts also have associated 7464 stateids by which they may be referenced. The stateid is used as a 7465 shorthand reference to a lock or set of locks, and given a stateid, 7466 the server can determine the associated state-owner or state-owners 7467 (in the case of an open-owner/lock-owner pair) and the associated 7468 filehandle. When stateids are used, the current filehandle must be 7469 the one associated with that stateid. 7471 All stateids associated with a given client ID are associated with a 7472 common lease that represents the claim of those stateids and the 7473 objects they represent to be maintained by the server. See 7474 Section 8.3 for a discussion of the lease. 7476 The server may assign stateids independently for different clients. 7477 A stateid with the same bit pattern for one client may designate an 7478 entirely different set of locks for a different client. The stateid 7479 is always interpreted with respect to the client ID associated with 7480 the current session. Stateids apply to all sessions associated with 7481 the given client ID, and the client may use a stateid obtained from 7482 one session on another session associated with the same client ID. 7484 8.2.1. Stateid Types 7486 With the exception of special stateids (see Section 8.2.3), each 7487 stateid represents locking objects of one of a set of types defined 7488 by the NFSv4.1 protocol. Note that in all these cases, where we 7489 speak of guarantee, it is understood there are situations such as a 7490 client restart, or lock revocation, that allow the guarantee to be 7491 voided. 7493 o Stateids may represent opens of files. 7495 Each stateid in this case represents the OPEN state for a given 7496 client ID/open-owner/filehandle triple. Such stateids are subject 7497 to change (with consequent incrementing of the stateid's seqid) in 7498 response to OPENs that result in upgrade and OPEN_DOWNGRADE 7499 operations. 7501 o Stateids may represent sets of byte-range locks. 7503 All locks held on a particular file by a particular owner and 7504 gotten under the aegis of a particular open file are associated 7505 with a single stateid with the seqid being incremented whenever 7506 LOCK and LOCKU operations affect that set of locks. 7508 o Stateids may represent file delegations, which are recallable 7509 guarantees by the server to the client that other clients will not 7510 reference or modify a particular file, until the delegation is 7511 returned. In NFSv4.1, file delegations may be obtained on both 7512 regular and non-regular files. 7514 A stateid represents a single delegation held by a client for a 7515 particular filehandle. 7517 o Stateids may represent directory delegations, which are recallable 7518 guarantees by the server to the client that other clients will not 7519 modify the directory, until the delegation is returned. 7521 A stateid represents a single delegation held by a client for a 7522 particular directory filehandle. 7524 o Stateids may represent layouts, which are recallable guarantees by 7525 the server to the client that particular files may be accessed via 7526 an alternate data access protocol at specific locations. Such 7527 access is limited to particular sets of byte-ranges and may 7528 proceed until those byte-ranges are reduced or the layout is 7529 returned. 7531 A stateid represents the set of all layouts held by a particular 7532 client for a particular filehandle with a given layout type. The 7533 seqid is updated as the layouts of that set of byte-ranges change, 7534 via layout stateid changing operations such as LAYOUTGET and 7535 LAYOUTRETURN. 7537 8.2.2. Stateid Structure 7539 Stateids are divided into two fields, a 96-bit "other" field 7540 identifying the specific set of locks and a 32-bit "seqid" sequence 7541 value. Except in the case of special stateids (see Section 8.2.3), a 7542 particular value of the "other" field denotes a set of locks of the 7543 same type (for example, byte-range locks, opens, delegations, or 7544 layouts), for a specific file or directory, and sharing the same 7545 ownership characteristics. The seqid designates a specific instance 7546 of such a set of locks, and is incremented to indicate changes in 7547 such a set of locks, either by the addition or deletion of locks from 7548 the set, a change in the byte-range they apply to, or an upgrade or 7549 downgrade in the type of one or more locks. 7551 When such a set of locks is first created, the server returns a 7552 stateid with seqid value of one. On subsequent operations that 7553 modify the set of locks, the server is required to increment the 7554 "seqid" field by one whenever it returns a stateid for the same 7555 state-owner/file/type combination and there is some change in the set 7556 of locks actually designated. In this case, the server will return a 7557 stateid with an "other" field the same as previously used for that 7558 state-owner/file/type combination, with an incremented "seqid" field. 7559 This pattern continues until the seqid is incremented past 7560 NFS4_UINT32_MAX, and one (not zero) is the next seqid value. 7562 The purpose of the incrementing of the seqid is to allow the server 7563 to communicate to the client the order in which operations that 7564 modified locking state associated with a stateid have been processed 7565 and to make it possible for the client to send requests that are 7566 conditional on the set of locks not having changed since the stateid 7567 in question was returned. 7569 Except for layout stateids (Section 12.5.3), when a client sends a 7570 stateid to the server, it has two choices with regard to the seqid 7571 sent. It may set the seqid to zero to indicate to the server that it 7572 wishes the most up-to-date seqid for that stateid's "other" field to 7573 be used. This would be the common choice in the case of a stateid 7574 sent with a READ or WRITE operation. It also may set a non-zero 7575 value, in which case the server checks if that seqid is the correct 7576 one. In that case, the server is required to return 7577 NFS4ERR_OLD_STATEID if the seqid is lower than the most current value 7578 and NFS4ERR_BAD_STATEID if the seqid is greater than the most current 7579 value. This would be the common choice in the case of stateids sent 7580 with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in 7581 parallel for the same owner, a client might close a file without 7582 knowing that an OPEN upgrade had been done by the server, changing 7583 the lock in question. If CLOSE were sent with a zero seqid, the OPEN 7584 upgrade would be cancelled before the client even received an 7585 indication that an upgrade had happened. 7587 When a stateid is sent by the server to the client as part of a 7588 callback operation, it is not subject to checking for a current seqid 7589 and returning NFS4ERR_OLD_STATEID. This is because the client is not 7590 in a position to know the most up-to-date seqid and thus cannot 7591 verify it. Unless specially noted, the seqid value for a stateid 7592 sent by the server to the client as part of a callback is required to 7593 be zero with NFS4ERR_BAD_STATEID returned if it is not. 7595 In making comparisons between seqids, both by the client in 7596 determining the order of operations and by the server in determining 7597 whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of 7598 the seqid being swapped around past the NFS4_UINT32_MAX value needs 7599 to be taken into account. When two seqid values are being compared, 7600 the total count of slots for all sessions associated with the current 7601 client is used to do this. When one seqid value is less than this 7602 total slot count and another seqid value is greater than 7603 NFS4_UINT32_MAX minus the total slot count, the former is to be 7604 treated as lower than the latter, despite the fact that it is 7605 numerically greater. 7607 8.2.3. Special Stateids 7609 Stateid values whose "other" field is either all zeros or all ones 7610 are reserved. They may not be assigned by the server but have 7611 special meanings defined by the protocol. The particular meaning 7612 depends on whether the "other" field is all zeros or all ones and the 7613 specific value of the "seqid" field. 7615 The following combinations of "other" and "seqid" are defined in 7616 NFSv4.1: 7618 o When "other" and "seqid" are both zero, the stateid is treated as 7619 a special anonymous stateid, which can be used in READ, WRITE, and 7620 SETATTR requests to indicate the absence of any OPEN state 7621 associated with the request. When an anonymous stateid value is 7622 used and an existing open denies the form of access requested, 7623 then access will be denied to the request. This stateid MUST NOT 7624 be used on operations to data servers (Section 13.6). 7626 o When "other" and "seqid" are both all ones, the stateid is a 7627 special READ bypass stateid. When this value is used in WRITE or 7628 SETATTR, it is treated like the anonymous value. When used in 7629 READ, the server MAY grant access, even if access would normally 7630 be denied to READ operations. This stateid MUST NOT be used on 7631 operations to data servers. 7633 o When "other" is zero and "seqid" is one, the stateid represents 7634 the current stateid, which is whatever value is the last stateid 7635 returned by an operation within the COMPOUND. In the case of an 7636 OPEN, the stateid returned for the open file and not the 7637 delegation is used. The stateid passed to the operation in place 7638 of the special value has its "seqid" value set to zero, except 7639 when the current stateid is used by the operation CLOSE or 7640 OPEN_DOWNGRADE. If there is no operation in the COMPOUND that has 7641 returned a stateid value, the server MUST return the error 7642 NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of 7643 a current stateid is a special stateid and the stateid of an 7644 operation's arguments has "other" set to zero and "seqid" set to 7645 one, then the server MUST return the error NFS4ERR_BAD_STATEID. 7647 o When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid 7648 represents a reserved stateid value defined to be invalid. When 7649 this stateid is used, the server MUST return the error 7650 NFS4ERR_BAD_STATEID. 7652 If a stateid value is used that has all zeros or all ones in the 7653 "other" field but does not match one of the cases above, the server 7654 MUST return the error NFS4ERR_BAD_STATEID. 7656 Special stateids, unlike other stateids, are not associated with 7657 individual client IDs or filehandles and can be used with all valid 7658 client IDs and filehandles. In the case of a special stateid 7659 designating the current stateid, the current stateid value 7660 substituted for the special stateid is associated with a particular 7661 client ID and filehandle, and so, if it is used where the current 7662 filehandle does not match that associated with the current stateid, 7663 the operation to which the stateid is passed will return 7664 NFS4ERR_BAD_STATEID. 7666 8.2.4. Stateid Lifetime and Validation 7668 Stateids must remain valid until either a client restart or a server 7669 restart or until the client returns all of the locks associated with 7670 the stateid by means of an operation such as CLOSE or DELEGRETURN. 7671 If the locks are lost due to revocation, as long as the client ID is 7672 valid, the stateid remains a valid designation of that revoked state 7673 until the client frees it by using FREE_STATEID. Stateids associated 7674 with byte-range locks are an exception. They remain valid even if a 7675 LOCKU frees all remaining locks, so long as the open file with which 7676 they are associated remains open, unless the client frees the 7677 stateids via the FREE_STATEID operation. 7679 It should be noted that there are situations in which the client's 7680 locks become invalid, without the client requesting they be returned. 7681 These include lease expiration and a number of forms of lock 7682 revocation within the lease period. It is important to note that in 7683 these situations, the stateid remains valid and the client can use it 7684 to determine the disposition of the associated lost locks. 7686 An "other" value must never be reused for a different purpose (i.e., 7687 different filehandle, owner, or type of locks) within the context of 7688 a single client ID. A server may retain the "other" value for the 7689 same purpose beyond the point where it may otherwise be freed, but if 7690 it does so, it must maintain "seqid" continuity with previous values. 7692 One mechanism that may be used to satisfy the requirement that the 7693 server recognize invalid and out-of-date stateids is for the server 7694 to divide the "other" field of the stateid into two fields. 7696 o an index into a table of locking-state structures. 7698 o a generation number that is incremented on each allocation of a 7699 table entry for a particular use. 7701 And then store in each table entry, 7703 o the client ID with which the stateid is associated. 7705 o the current generation number for the (at most one) valid stateid 7706 sharing this index value. 7708 o the filehandle of the file on which the locks are taken. 7710 o an indication of the type of stateid (open, byte-range lock, file 7711 delegation, directory delegation, layout). 7713 o the last "seqid" value returned corresponding to the current 7714 "other" value. 7716 o an indication of the current status of the locks associated with 7717 this stateid, in particular, whether these have been revoked and 7718 if so, for what reason. 7720 With this information, an incoming stateid can be validated and the 7721 appropriate error returned when necessary. Special and non-special 7722 stateids are handled separately. (See Section 8.2.3 for a discussion 7723 of special stateids.) 7725 Note that stateids are implicitly qualified by the current client ID, 7726 as derived from the client ID associated with the current session. 7727 Note, however, that the semantics of the session will prevent 7728 stateids associated with a previous client or server instance from 7729 being analyzed by this procedure. 7731 If server restart has resulted in an invalid client ID or a session 7732 ID that is invalid, SEQUENCE will return an error and the operation 7733 that takes a stateid as an argument will never be processed. 7735 If there has been a server restart where there is a persistent 7736 session and all leased state has been lost, then the session in 7737 question will, although valid, be marked as dead, and any operation 7738 not satisfied by means of the reply cache will receive the error 7739 NFS4ERR_DEADSESSION, and thus not be processed as indicated below. 7741 When a stateid is being tested and the "other" field is all zeros or 7742 all ones, a check that the "other" and "seqid" fields match a defined 7743 combination for a special stateid is done and the results determined 7744 as follows: 7746 o If the "other" and "seqid" fields do not match a defined 7747 combination associated with a special stateid, the error 7748 NFS4ERR_BAD_STATEID is returned. 7750 o If the special stateid is one designating the current stateid and 7751 there is a current stateid, then the current stateid is 7752 substituted for the special stateid and the checks appropriate to 7753 non-special stateids are performed. 7755 o If the combination is valid in general but is not appropriate to 7756 the context in which the stateid is used (e.g., an all-zero 7757 stateid is used when an OPEN stateid is required in a LOCK 7758 operation), the error NFS4ERR_BAD_STATEID is also returned. 7760 o Otherwise, the check is completed and the special stateid is 7761 accepted as valid. 7763 When a stateid is being tested, and the "other" field is neither all 7764 zeros nor all ones, the following procedure could be used to validate 7765 an incoming stateid and return an appropriate error, when necessary, 7766 assuming that the "other" field would be divided into a table index 7767 and an entry generation. 7769 o If the table index field is outside the range of the associated 7770 table, return NFS4ERR_BAD_STATEID. 7772 o If the selected table entry is of a different generation than that 7773 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 7775 o If the selected table entry does not match the current filehandle, 7776 return NFS4ERR_BAD_STATEID. 7778 o If the client ID in the table entry does not match the client ID 7779 associated with the current session, return NFS4ERR_BAD_STATEID. 7781 o If the stateid represents revoked state, then return 7782 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, 7783 as appropriate. 7785 o If the stateid type is not valid for the context in which the 7786 stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid 7787 may be valid in general, as would be reported by the TEST_STATEID 7788 operation, but be invalid for a particular operation, as, for 7789 example, when a stateid that doesn't represent byte-range locks is 7790 passed to the non-from_open case of LOCK or to LOCKU, or when a 7791 stateid that does not represent an open is passed to CLOSE or 7792 OPEN_DOWNGRADE. In such cases, the server MUST return 7793 NFS4ERR_BAD_STATEID. 7795 o If the "seqid" field is not zero and it is greater than the 7796 current sequence value corresponding to the current "other" field, 7797 return NFS4ERR_BAD_STATEID. 7799 o If the "seqid" field is not zero and it is less than the current 7800 sequence value corresponding to the current "other" field, return 7801 NFS4ERR_OLD_STATEID. 7803 o Otherwise, the stateid is valid and the table entry should contain 7804 any additional information about the type of stateid and 7805 information associated with that particular type of stateid, such 7806 as the associated set of locks, e.g., open-owner and lock-owner 7807 information, as well as information on the specific locks, e.g., 7808 open modes and byte-ranges. 7810 8.2.5. Stateid Use for I/O Operations 7812 Clients performing I/O operations need to select an appropriate 7813 stateid based on the locks (including opens and delegations) held by 7814 the client and the various types of state-owners sending the I/O 7815 requests. SETATTR operations that change the file size are treated 7816 like I/O operations in this regard. 7818 The following rules, applied in order of decreasing priority, govern 7819 the selection of the appropriate stateid. In following these rules, 7820 the client will only consider locks of which it has actually received 7821 notification by an appropriate operation response or callback. Note 7822 that the rules are slightly different in the case of I/O to data 7823 servers when file layouts are being used (see Section 13.9.1). 7825 o If the client holds a delegation for the file in question, the 7826 delegation stateid SHOULD be used. 7828 o Otherwise, if the entity corresponding to the lock-owner (e.g., a 7829 process) sending the I/O has a byte-range lock stateid for the 7830 associated open file, then the byte-range lock stateid for that 7831 lock-owner and open file SHOULD be used. 7833 o If there is no byte-range lock stateid, then the OPEN stateid for 7834 the open file in question SHOULD be used. 7836 o Finally, if none of the above apply, then a special stateid SHOULD 7837 be used. 7839 Ignoring these rules may result in situations in which the server 7840 does not have information necessary to properly process the request. 7841 For example, when mandatory byte-range locks are in effect, if the 7842 stateid does not indicate the proper lock-owner, via a lock stateid, 7843 a request might be avoidably rejected. 7845 The server however should not try to enforce these ordering rules and 7846 should use whatever information is available to properly process I/O 7847 requests. In particular, when a client has a delegation for a given 7848 file, it SHOULD take note of this fact in processing a request, even 7849 if it is sent with a special stateid. 7851 8.2.6. Stateid Use for SETATTR Operations 7853 Because each operation is associated with a session ID and from that 7854 the clientid can be determined, operations do not need to include a 7855 stateid for the server to be able to determine whether they should 7856 cause a delegation to be recalled or are to be treated as done within 7857 the scope of the delegation. 7859 In the case of SETATTR operations, a stateid is present. In cases 7860 other than those that set the file size, the client may send either a 7861 special stateid or, when a delegation is held for the file in 7862 question, a delegation stateid. While the server SHOULD validate the 7863 stateid and may use the stateid to optimize the determination as to 7864 whether a delegation is held, it SHOULD note the presence of a 7865 delegation even when a special stateid is sent, and MUST accept a 7866 valid delegation stateid when sent. 7868 8.3. Lease Renewal 7870 Each client/server pair, as represented by a client ID, has a single 7871 lease. The purpose of the lease is to allow the client to indicate 7872 to the server, in a low-overhead way, that it is active, and thus 7873 that the server is to retain the client's locks. This arrangement 7874 allows the server to remove stale locking-related objects that are 7875 held by a client that has crashed or is otherwise unreachable, once 7876 the relevant lease expires. This in turn allows other clients to 7877 obtain conflicting locks without being delayed indefinitely by 7878 inactive or unreachable clients. It is not a mechanism for cache 7879 consistency and lease renewals may not be denied if the lease 7880 interval has not expired. 7882 Since each session is associated with a specific client (identified 7883 by the client's client ID), any operation sent on that session is an 7884 indication that the associated client is reachable. When a request 7885 is sent for a given session, successful execution of a SEQUENCE 7886 operation (or successful retrieval of the result of SEQUENCE from the 7887 reply cache) on an unexpired lease will result in the lease being 7888 implicitly renewed, for the standard renewal period (equal to the 7889 lease_time attribute). 7891 If the client ID's lease has not expired when the server receives a 7892 SEQUENCE operation, then the server MUST renew the lease. If the 7893 client ID's lease has expired when the server receives a SEQUENCE 7894 operation, the server MAY renew the lease; this depends on whether 7895 any state was revoked as a result of the client's failure to renew 7896 the lease before expiration. 7898 Absent other activity that would renew the lease, a COMPOUND 7899 consisting of a single SEQUENCE operation will suffice. The client 7900 should also take communication-related delays into account and take 7901 steps to ensure that the renewal messages actually reach the server 7902 in good time. For example: 7904 o When trunking is in effect, the client should consider sending 7905 multiple requests on different connections, in order to ensure 7906 that renewal occurs, even in the event of blockage in the path 7907 used for one of those connections. 7909 o Transport retransmission delays might become so large as to 7910 approach or exceed the length of the lease period. This may be 7911 particularly likely when the server is unresponsive due to a 7912 restart; see Section 8.4.2.1. If the client implementation is not 7913 careful, transport retransmission delays can result in the client 7914 failing to detect a server restart before the grace period ends. 7915 The scenario is that the client is using a transport with 7916 exponential backoff, such that the maximum retransmission timeout 7917 exceeds both the grace period and the lease_time attribute. A 7918 network partition causes the client's connection's retransmission 7919 interval to back off, and even after the partition heals, the next 7920 transport-level retransmission is sent after the server has 7921 restarted and its grace period ends. 7923 The client MUST either recover from the ensuing NFS4ERR_NO_GRACE 7924 errors or it MUST ensure that, despite transport-level 7925 retransmission intervals that exceed the lease_time, a SEQUENCE 7926 operation is sent that renews the lease before expiration. The 7927 client can achieve this by associating a new connection with the 7928 session, and sending a SEQUENCE operation on it. However, if the 7929 attempt to establish a new connection is delayed for some reason 7930 (e.g., exponential backoff of the connection establishment 7931 packets), the client will have to abort the connection 7932 establishment attempt before the lease expires, and attempt to 7933 reconnect. 7935 If the server renews the lease upon receiving a SEQUENCE operation, 7936 the server MUST NOT allow the lease to expire while the rest of the 7937 operations in the COMPOUND procedure's request are still executing. 7938 Once the last operation has finished, and the response to COMPOUND 7939 has been sent, the server MUST set the lease to expire no sooner than 7940 the sum of current time and the value of the lease_time attribute. 7942 A client ID's lease can expire when it has been at least the lease 7943 interval (lease_time) since the last lease-renewing SEQUENCE 7944 operation was sent on any of the client ID's sessions and there are 7945 no active COMPOUND operations on any such sessions. 7947 Because the SEQUENCE operation is the basic mechanism to renew a 7948 lease, and because it must be done at least once for each lease 7949 period, it is the natural mechanism whereby the server will inform 7950 the client of changes in the lease status that the client needs to be 7951 informed of. The client should inspect the status flags 7952 (sr_status_flags) returned by sequence and take the appropriate 7953 action (see Section 18.46.3 for details). 7955 o The status bits SEQ4_STATUS_CB_PATH_DOWN and 7956 SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the 7957 backchannel that the client may need to address in order to 7958 receive callback requests. 7960 o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and 7961 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS 7962 contexts or RPCSEC_GSS handles for the backchannel that the client 7963 might have to address in order to allow callback requests to be 7964 sent. 7966 o The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 7967 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, 7968 SEQ4_STATUS_ADMIN_STATE_REVOKED, and 7969 SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock 7970 revocation events. When these bits are set, the client should use 7971 TEST_STATEID to find what stateids have been revoked and use 7972 FREE_STATEID to acknowledge loss of the associated state. 7974 o The status bit SEQ4_STATUS_LEASE_MOVE indicates that 7975 responsibility for lease renewal has been transferred to one or 7976 more new servers. 7978 o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that 7979 due to server restart the client must reclaim locking state. 7981 o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates that the 7982 server has encountered an unrecoverable fault with the backchannel 7983 (e.g., it has lost track of a sequence ID for a slot in the 7984 backchannel). 7986 8.4. Crash Recovery 7988 A critical requirement in crash recovery is that both the client and 7989 the server know when the other has failed. Additionally, it is 7990 required that a client sees a consistent view of data across server 7991 restarts. All READ and WRITE operations that may have been queued 7992 within the client or network buffers must wait until the client has 7993 successfully recovered the locks protecting the READ and WRITE 7994 operations. Any that reach the server before the server can safely 7995 determine that the client has recovered enough locking state to be 7996 sure that such operations can be safely processed must be rejected. 7997 This will happen because either: 7999 o The state presented is no longer valid since it is associated with 8000 a now invalid client ID. In this case, the client will receive 8001 either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any 8002 attempt to attach a new session to that invalid client ID will 8003 result in an NFS4ERR_STALE_CLIENTID error. 8005 o Subsequent recovery of locks may make execution of the operation 8006 inappropriate (NFS4ERR_GRACE). 8008 8.4.1. Client Failure and Recovery 8010 In the event that a client fails, the server may release the client's 8011 locks when the associated lease has expired. Conflicting locks from 8012 another client may only be granted after this lease expiration. As 8013 discussed in Section 8.3, when a client has not failed and re- 8014 establishes its lease before expiration occurs, requests for 8015 conflicting locks will not be granted. 8017 To minimize client delay upon restart, lock requests are associated 8018 with an instance of the client by a client-supplied verifier. This 8019 verifier is part of the client_owner4 sent in the initial EXCHANGE_ID 8020 call made by the client. The server returns a client ID as a result 8021 of the EXCHANGE_ID operation. The client then confirms the use of 8022 the client ID by establishing a session associated with that client 8023 ID (see Section 18.36.3 for a description of how this is done). All 8024 locks, including opens, byte-range locks, delegations, and layouts 8025 obtained by sessions using that client ID, are associated with that 8026 client ID. 8028 Since the verifier will be changed by the client upon each 8029 initialization, the server can compare a new verifier to the verifier 8030 associated with currently held locks and determine that they do not 8031 match. This signifies the client's new instantiation and subsequent 8032 loss (upon confirmation of the new client ID) of locking state. As a 8033 result, the server is free to release all locks held that are 8034 associated with the old client ID that was derived from the old 8035 verifier. At this point, conflicting locks from other clients, kept 8036 waiting while the lease had not yet expired, can be granted. In 8037 addition, all stateids associated with the old client ID can also be 8038 freed, as they are no longer reference-able. 8040 Note that the verifier must have the same uniqueness properties as 8041 the verifier for the COMMIT operation. 8043 8.4.2. Server Failure and Recovery 8045 If the server loses locking state (usually as a result of a restart), 8046 it must allow clients time to discover this fact and re-establish the 8047 lost locking state. The client must be able to re-establish the 8048 locking state without having the server deny valid requests because 8049 the server has granted conflicting access to another client. 8050 Likewise, if there is a possibility that clients have not yet re- 8051 established their locking state for a file and that such locking 8052 state might make it invalid to perform READ or WRITE operations. For 8053 example, if mandatory locks are a possibility, the server must 8054 disallow READ and WRITE operations for that file. 8056 A client can determine that loss of locking state has occurred via 8057 several methods. 8059 1. When a SEQUENCE (most common) or other operation returns 8060 NFS4ERR_BADSESSION, this may mean that the session has been 8061 destroyed but the client ID is still valid. The client sends a 8062 CREATE_SESSION request with the client ID to re-establish the 8063 session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, 8064 the client must establish a new client ID (see Section 8.1) and 8065 re-establish its lock state with the new client ID, after the 8066 CREATE_SESSION operation succeeds (see Section 8.4.2.1). 8068 2. When a SEQUENCE (most common) or other operation on a persistent 8069 session returns NFS4ERR_DEADSESSION, this indicates that a 8070 session is no longer usable for new, i.e., not satisfied from the 8071 reply cache, operations. Once all pending operations are 8072 determined to be either performed before the retry or not 8073 performed, the client sends a CREATE_SESSION request with the 8074 client ID to re-establish the session. If CREATE_SESSION fails 8075 with NFS4ERR_STALE_CLIENTID, the client must establish a new 8076 client ID (see Section 8.1) and re-establish its lock state after 8077 the CREATE_SESSION, with the new client ID, succeeds 8078 (Section 8.4.2.1). 8080 3. When an operation, neither SEQUENCE nor preceded by SEQUENCE (for 8081 example, CREATE_SESSION, DESTROY_SESSION), returns 8082 NFS4ERR_STALE_CLIENTID, the client MUST establish a new client ID 8083 (Section 8.1) and re-establish its lock state (Section 8.4.2.1). 8085 8.4.2.1. State Reclaim 8087 When state information and the associated locks are lost as a result 8088 of a server restart, the protocol must provide a way to cause that 8089 state to be re-established. The approach used is to define, for most 8090 types of locking state (layouts are an exception), a request whose 8091 function is to allow the client to re-establish on the server a lock 8092 first obtained from a previous instance. Generally, these requests 8093 are variants of the requests normally used to create locks of that 8094 type and are referred to as "reclaim-type" requests, and the process 8095 of re-establishing such locks is referred to as "reclaiming" them. 8097 Because each client must have an opportunity to reclaim all of the 8098 locks that it has without the possibility that some other client will 8099 be granted a conflicting lock, a "grace period" is devoted to the 8100 reclaim process. During this period, requests creating client IDs 8101 and sessions are handled normally, but locking requests are subject 8102 to special restrictions. Only reclaim-type locking requests are 8103 allowed, unless the server can reliably determine (through state 8104 persistently maintained across restart instances) that granting any 8105 such lock cannot possibly conflict with a subsequent reclaim. When a 8106 request is made to obtain a new lock (i.e., not a reclaim-type 8107 request) during the grace period and such a determination cannot be 8108 made, the server must return the error NFS4ERR_GRACE. 8110 Once a session is established using the new client ID, the client 8111 will use reclaim-type locking requests (e.g., LOCK operations with 8112 reclaim set to TRUE and OPEN operations with a claim type of 8113 CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. 8114 Once this is done, or if there is no such locking state to reclaim, 8115 the client sends a global RECLAIM_COMPLETE operation, i.e., one with 8116 the rca_one_fs argument set to FALSE, to indicate that it has 8117 reclaimed all of the locking state that it will reclaim. Once a 8118 client sends such a RECLAIM_COMPLETE operation, it may attempt non- 8119 reclaim locking operations, although it might get an NFS4ERR_GRACE 8120 status result from each such operation until the period of special 8121 handling is over. See Section 11.11.9 for a discussion of the 8122 analogous handling lock reclamation in the case of file systems 8123 transitioning from server to server. 8125 During the grace period, the server must reject READ and WRITE 8126 operations and non-reclaim locking requests (i.e., other LOCK and 8127 OPEN operations) with an error of NFS4ERR_GRACE, unless it can 8128 guarantee that these may be done safely, as described below. 8130 The grace period may last until all clients that are known to 8131 possibly have had locks have done a global RECLAIM_COMPLETE 8132 operation, indicating that they have finished reclaiming the locks 8133 they held before the server restart. This means that a client that 8134 has done a RECLAIM_COMPLETE must be prepared to receive an 8135 NFS4ERR_GRACE when attempting to acquire new locks. In order for the 8136 server to know that all clients with possible prior lock state have 8137 done a RECLAIM_COMPLETE, the server must maintain in stable storage a 8138 list clients that may have such locks. The server may also terminate 8139 the grace period before all clients have done a global 8140 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 8141 before a time equal to the lease period in order to give clients an 8142 opportunity to find out about the server restart, as a result of 8143 sending requests on associated sessions with a frequency governed by 8144 the lease time. Note that when a client does not send such requests 8145 (or they are sent by the client but not received by the server), it 8146 is possible for the grace period to expire before the client finds 8147 out that the server restart has occurred. 8149 Some additional time in order to allow a client to establish a new 8150 client ID and session and to effect lock reclaims may be added to the 8151 lease time. Note that analogous rules apply to file system-specific 8152 grace periods discussed in Section 11.11.9. 8154 If the server can reliably determine that granting a non-reclaim 8155 request will not conflict with reclamation of locks by other clients, 8156 the NFS4ERR_GRACE error does not have to be returned even within the 8157 grace period, although NFS4ERR_GRACE must always be returned to 8158 clients attempting a non-reclaim lock request before doing their own 8159 global RECLAIM_COMPLETE. For the server to be able to service READ 8160 and WRITE operations during the grace period, it must again be able 8161 to guarantee that no possible conflict could arise between a 8162 potential reclaim locking request and the READ or WRITE operation. 8163 If the server is unable to offer that guarantee, the NFS4ERR_GRACE 8164 error must be returned to the client. 8166 For a server to provide simple, valid handling during the grace 8167 period, the easiest method is to simply reject all non-reclaim 8168 locking requests and READ and WRITE operations by returning the 8169 NFS4ERR_GRACE error. However, a server may keep information about 8170 granted locks in stable storage. With this information, the server 8171 could determine if a locking, READ or WRITE operation can be safely 8172 processed. 8174 For example, if the server maintained on stable storage summary 8175 information on whether mandatory locks exist, either mandatory byte- 8176 range locks, or share reservations specifying deny modes, many 8177 requests could be allowed during the grace period. If it is known 8178 that no such share reservations exist, OPEN request that do not 8179 specify deny modes may be safely granted. If, in addition, it is 8180 known that no mandatory byte-range locks exist, either through 8181 information stored on stable storage or simply because the server 8182 does not support such locks, READ and WRITE operations may be safely 8183 processed during the grace period. Another important case is where 8184 it is known that no mandatory byte-range locks exist, either because 8185 the server does not provide support for them or because their absence 8186 is known from persistently recorded data. In this case, READ and 8187 WRITE operations specifying stateids derived from reclaim-type 8188 operations may be validly processed during the grace period because 8189 of the fact that the valid reclaim ensures that no lock subsequently 8190 granted can prevent the I/O. 8192 To reiterate, for a server that allows non-reclaim lock and I/O 8193 requests to be processed during the grace period, it MUST determine 8194 that no lock subsequently reclaimed will be rejected and that no lock 8195 subsequently reclaimed would have prevented any I/O operation 8196 processed during the grace period. 8198 Clients should be prepared for the return of NFS4ERR_GRACE errors for 8199 non-reclaim lock and I/O requests. In this case, the client should 8200 employ a retry mechanism for the request. A delay (on the order of 8201 several seconds) between retries should be used to avoid overwhelming 8202 the server. Further discussion of the general issue is included in 8203 [54]. The client must account for the server that can perform I/O 8204 and non-reclaim locking requests within the grace period as well as 8205 those that cannot do so. 8207 A reclaim-type locking request outside the server's grace period can 8208 only succeed if the server can guarantee that no conflicting lock or 8209 I/O request has been granted since restart. 8211 A server may, upon restart, establish a new value for the lease 8212 period. Therefore, clients should, once a new client ID is 8213 established, refetch the lease_time attribute and use it as the basis 8214 for lease renewal for the lease associated with that server. 8215 However, the server must establish, for this restart event, a grace 8216 period at least as long as the lease period for the previous server 8217 instantiation. This allows the client state obtained during the 8218 previous server instance to be reliably re-established. 8220 The possibility exists that, because of server configuration events, 8221 the client will be communicating with a server different than the one 8222 on which the locks were obtained, as shown by the combination of 8223 eir_server_scope and eir_server_owner. This leads to the issue of if 8224 and when the client should attempt to reclaim locks previously 8225 obtained on what is being reported as a different server. The rules 8226 to resolve this question are as follows: 8228 o If the server scope is different, the client should not attempt to 8229 reclaim locks. In this situation, no lock reclaim is possible. 8230 Any attempt to re-obtain the locks with non-reclaim operations is 8231 problematic since there is no guarantee that the existing 8232 filehandles will be recognized by the new server, or that if 8233 recognized, they denote the same objects. It is best to treat the 8234 locks as having been revoked by the reconfiguration event. 8236 o If the server scope is the same, the client should attempt to 8237 reclaim locks, even if the eir_server_owner value is different. 8238 In this situation, it is the responsibility of the server to 8239 return NFS4ERR_NO_GRACE if it cannot provide correct support for 8240 lock reclaim operations, including the prevention of edge 8241 conditions. 8243 The eir_server_owner field is not used in making this determination. 8244 Its function is to specify trunking possibilities for the client (see 8245 Section 2.10.5) and not to control lock reclaim. 8247 8.4.2.1.1. Security Considerations for State Reclaim 8249 During the grace period, a client can reclaim state that it believes 8250 or asserts it had before the server restarted. Unless the server 8251 maintained a complete record of all the state the client had, the 8252 server has little choice but to trust the client. (Of course, if the 8253 server maintained a complete record, then it would not have to force 8254 the client to reclaim state after server restart.) While the server 8255 has to trust the client to tell the truth, the negative consequences 8256 for security are limited to enabling denial-of-service attacks in 8257 situations in which AUTH_SYS is supported. The fundamental rule for 8258 the server when processing reclaim requests is that it MUST NOT grant 8259 the reclaim if an equivalent non-reclaim request would not be granted 8260 during steady state due to access control or access conflict issues. 8261 For example, an OPEN request during a reclaim will be refused with 8262 NFS4ERR_ACCESS if the principal making the request does not have 8263 access to open the file according to the discretionary ACL 8264 (Section 6.2.2) on the file. 8266 Nonetheless, it is possible that a client operating in error or 8267 maliciously could, during reclaim, prevent another client from 8268 reclaiming access to state. For example, an attacker could send an 8269 OPEN reclaim operation with a deny mode that prevents another client 8270 from reclaiming the OPEN state it had before the server restarted. 8271 The attacker could perform the same denial of service during steady 8272 state prior to server restart, as long as the attacker had 8273 permissions. Given that the attack vectors are equivalent, the grace 8274 period does not offer any additional opportunity for denial of 8275 service, and any concerns about this attack vector, whether during 8276 grace or steady state, are addressed the same way: use RPCSEC_GSS for 8277 authentication and limit access to the file only to principals that 8278 the owner of the file trusts. 8280 Note that if prior to restart the server had client IDs with the 8281 EXCHGID4_FLAG_BIND_PRINC_STATEID (Section 18.35) capability set, then 8282 the server SHOULD record in stable storage the client owner and the 8283 principal that established the client ID via EXCHANGE_ID. If the 8284 server does not, then there is a risk a client will be unable to 8285 reclaim state if it does not have a credential for a principal that 8286 was originally authorized to establish the state. 8288 8.4.3. Network Partitions and Recovery 8290 If the duration of a network partition is greater than the lease 8291 period provided by the server, the server will not have received a 8292 lease renewal from the client. If this occurs, the server may free 8293 all locks held for the client or it may allow the lock state to 8294 remain for a considerable period, subject to the constraint that if a 8295 request for a conflicting lock is made, locks associated with an 8296 expired lease do not prevent such a conflicting lock from being 8297 granted but MUST be revoked as necessary so as to avoid interfering 8298 with such conflicting requests. 8300 If the server chooses to delay freeing of lock state until there is a 8301 conflict, it may either free all of the client's locks once there is 8302 a conflict or it may only revoke the minimum set of locks necessary 8303 to allow conflicting requests. When it adopts the finer-grained 8304 approach, it must revoke all locks associated with a given stateid, 8305 even if the conflict is with only a subset of locks. 8307 When the server chooses to free all of a client's lock state, either 8308 immediately upon lease expiration or as a result of the first attempt 8309 to obtain a conflicting a lock, the server may report the loss of 8310 lock state in a number of ways. 8312 The server may choose to invalidate the session and the associated 8313 client ID. In this case, once the client can communicate with the 8314 server, it will receive an NFS4ERR_BADSESSION error. Upon attempting 8315 to create a new session, it would get an NFS4ERR_STALE_CLIENTID. 8316 Upon creating the new client ID and new session, the client will 8317 attempt to reclaim locks. Normally, the server will not allow the 8318 client to reclaim locks, because the server will not be in its 8319 recovery grace period. 8321 Another possibility is for the server to maintain the session and 8322 client ID but for all stateids held by the client to become invalid 8323 or stale. Once the client can reach the server after such a network 8324 partition, the status returned by the SEQUENCE operation will 8325 indicate a loss of locking state; i.e., the flag 8326 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. 8327 In addition, all I/O submitted by the client with the now invalid 8328 stateids will fail with the server returning the error 8329 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 8330 state, it will suitably notify the applications that held the 8331 invalidated locks. The client should then take action to free 8332 invalidated stateids, either by establishing a new client ID using a 8333 new verifier or by doing a FREE_STATEID operation to release each of 8334 the invalidated stateids. 8336 When the server adopts a finer-grained approach to revocation of 8337 locks when a client's lease has expired, only a subset of stateids 8338 will normally become invalid during a network partition. When the 8339 client can communicate with the server after such a network partition 8340 heals, the status returned by the SEQUENCE operation will indicate a 8341 partial loss of locking state 8342 (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In addition, operations, 8343 including I/O submitted by the client, with the now invalid stateids 8344 will fail with the server returning the error NFS4ERR_EXPIRED. Once 8345 the client learns of the loss of locking state, it will use the 8346 TEST_STATEID operation on all of its stateids to determine which 8347 locks have been lost and then suitably notify the applications that 8348 held the invalidated locks. The client can then release the 8349 invalidated locking state and acknowledge the revocation of the 8350 associated locks by doing a FREE_STATEID operation on each of the 8351 invalidated stateids. 8353 When a network partition is combined with a server restart, there are 8354 edge conditions that place requirements on the server in order to 8355 avoid silent data corruption following the server restart. Two of 8356 these edge conditions are known, and are discussed below. 8358 The first edge condition arises as a result of the scenarios such as 8359 the following: 8361 1. Client A acquires a lock. 8363 2. Client A and server experience mutual network partition, such 8364 that client A is unable to renew its lease. 8366 3. Client A's lease expires, and the server releases the lock. 8368 4. Client B acquires a lock that would have conflicted with that of 8369 client A. 8371 5. Client B releases its lock. 8373 6. Server restarts. 8375 7. Network partition between client A and server heals. 8377 8. Client A connects to a new server instance and finds out about 8378 server restart. 8380 9. Client A reclaims its lock within the server's grace period. 8382 Thus, at the final step, the server has erroneously granted client 8383 A's lock reclaim. If client B modified the object the lock was 8384 protecting, client A will experience object corruption. 8386 The second known edge condition arises in situations such as the 8387 following: 8389 1. Client A acquires one or more locks. 8391 2. Server restarts. 8393 3. Client A and server experience mutual network partition, such 8394 that client A is unable to reclaim all of its locks within the 8395 grace period. 8397 4. Server's reclaim grace period ends. Client A has either no 8398 locks or an incomplete set of locks known to the server. 8400 5. Client B acquires a lock that would have conflicted with a lock 8401 of client A that was not reclaimed. 8403 6. Client B releases the lock. 8405 7. Server restarts a second time. 8407 8. Network partition between client A and server heals. 8409 9. Client A connects to new server instance and finds out about 8410 server restart. 8412 10. Client A reclaims its lock within the server's grace period. 8414 As with the first edge condition, the final step of the scenario of 8415 the second edge condition has the server erroneously granting client 8416 A's lock reclaim. 8418 Solving the first and second edge conditions requires either that the 8419 server always assumes after it restarts that some edge condition 8420 occurs, and thus returns NFS4ERR_NO_GRACE for all reclaim attempts, 8421 or that the server record some information in stable storage. The 8422 amount of information the server records in stable storage is in 8423 inverse proportion to how harsh the server intends to be whenever 8424 edge conditions arise. The server that is completely tolerant of all 8425 edge conditions will record in stable storage every lock that is 8426 acquired, removing the lock record from stable storage only when the 8427 lock is released. For the two edge conditions discussed above, the 8428 harshest a server can be, and still support a grace period for 8429 reclaims, requires that the server record in stable storage some 8430 minimal information. For example, a server implementation could, for 8431 each client, save in stable storage a record containing: 8433 o the co_ownerid field from the client_owner4 presented in the 8434 EXCHANGE_ID operation. 8436 o a boolean that indicates if the client's lease expired or if there 8437 was administrative intervention (see Section 8.5) to revoke a 8438 byte-range lock, share reservation, or delegation and there has 8439 been no acknowledgment, via FREE_STATEID, of such revocation. 8441 o a boolean that indicates whether the client may have locks that it 8442 believes to be reclaimable in situations in which the grace period 8443 was terminated, making the server's view of lock reclaimability 8444 suspect. The server will set this for any client record in stable 8445 storage where the client has not done a suitable RECLAIM_COMPLETE 8446 (global or file system-specific depending on the target of the 8447 lock request) before it grants any new (i.e., not reclaimed) lock 8448 to any client. 8450 Assuming the above record keeping, for the first edge condition, 8451 after the server restarts, the record that client A's lease expired 8452 means that another client could have acquired a conflicting byte- 8453 range lock, share reservation, or delegation. Hence, the server must 8454 reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8456 For the second edge condition, after the server restarts for a second 8457 time, the indication that the client had not completed its reclaims 8458 at the time at which the grace period ended means that the server 8459 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 8461 When either edge condition occurs, the client's attempt to reclaim 8462 locks will result in the error NFS4ERR_NO_GRACE. When this is 8463 received, or after the client restarts with no lock state, the client 8464 will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 8465 received, the server and client are again in agreement regarding 8466 reclaimable locks and both booleans in persistent storage can be 8467 reset, to be set again only when there is a subsequent event that 8468 causes lock reclaim operations to be questionable. 8470 Regardless of the level and approach to record keeping, the server 8471 MUST implement one of the following strategies (which apply to 8472 reclaims of share reservations, byte-range locks, and delegations): 8474 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 8475 unforgiving, but necessary if the server does not record lock 8476 state in stable storage. 8478 2. Record sufficient state in stable storage such that all known 8479 edge conditions involving server restart, including the two noted 8480 in this section, are detected. It is acceptable to erroneously 8481 recognize an edge condition and not allow a reclaim, when, with 8482 sufficient knowledge, it would be allowed. The error the server 8483 would return in this case is NFS4ERR_NO_GRACE. Note that it is 8484 not known if there are other edge conditions. 8486 In the event that, after a server restart, the server determines 8487 there is unrecoverable damage or corruption to the information in 8488 stable storage, then for all clients and/or locks that may be 8489 affected, the server MUST return NFS4ERR_NO_GRACE. 8491 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 8492 outside the scope of this specification, since the strategies for 8493 such handling are very dependent on the client's operating 8494 environment. However, one potential approach is described below. 8496 When the client receives NFS4ERR_NO_GRACE, it could examine the 8497 change attribute of the objects for which the client is trying to 8498 reclaim state, and use that to determine whether to re-establish the 8499 state via normal OPEN or LOCK operations. This is acceptable 8500 provided that the client's operating environment allows it. In other 8501 words, the client implementor is advised to document for his users 8502 the behavior. The client could also inform the application that its 8503 byte-range lock or share reservations (whether or not they were 8504 delegated) have been lost, such as via a UNIX signal, a Graphical 8505 User Interface (GUI) pop-up window, etc. See Section 10.5 for a 8506 discussion of what the client should do for dealing with unreclaimed 8507 delegations on client state. 8509 For further discussion of revocation of locks, see Section 8.5. 8511 8.5. Server Revocation of Locks 8513 At any point, the server can revoke locks held by a client, and the 8514 client must be prepared for this event. When the client detects that 8515 its locks have been or may have been revoked, the client is 8516 responsible for validating the state information between itself and 8517 the server. Validating locking state for the client means that it 8518 must verify or reclaim state for each lock currently held. 8520 The first occasion of lock revocation is upon server restart. Note 8521 that this includes situations in which sessions are persistent and 8522 locking state is lost. In this class of instances, the client will 8523 receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes 8524 client ID, usually as part of recovery in response to a problem with 8525 the current session), and the client will proceed with normal crash 8526 recovery as described in the Section 8.4.2.1. 8528 The second occasion of lock revocation is the inability to renew the 8529 lease before expiration, as discussed in Section 8.4.3. While this 8530 is considered a rare or unusual event, the client must be prepared to 8531 recover. The server is responsible for determining the precise 8532 consequences of the lease expiration, informing the client of the 8533 scope of the lock revocation decided upon. The client then uses the 8534 status information provided by the server in the SEQUENCE results 8535 (field sr_status_flags, see Section 18.46.3) to synchronize its 8536 locking state with that of the server, in order to recover. 8538 The third occasion of lock revocation can occur as a result of 8539 revocation of locks within the lease period, either because of 8540 administrative intervention or because a recallable lock (a 8541 delegation or layout) was not returned within the lease period after 8542 having been recalled. While these are considered rare events, they 8543 are possible, and the client must be prepared to deal with them. 8544 When either of these events occurs, the client finds out about the 8545 situation through the status returned by the SEQUENCE operation. Any 8546 use of stateids associated with locks revoked during the lease period 8547 will receive the error NFS4ERR_ADMIN_REVOKED or 8548 NFS4ERR_DELEG_REVOKED, as appropriate. 8550 In all situations in which a subset of locking state may have been 8551 revoked, which include all cases in which locking state is revoked 8552 within the lease period, it is up to the client to determine which 8553 locks have been revoked and which have not. It does this by using 8554 the TEST_STATEID operation on the appropriate set of stateids. Once 8555 the set of revoked locks has been determined, the applications can be 8556 notified, and the invalidated stateids can be freed and lock 8557 revocation acknowledged by using FREE_STATEID. 8559 8.6. Short and Long Leases 8561 When determining the time period for the server lease, the usual 8562 lease tradeoffs apply. A short lease is good for fast server 8563 recovery at a cost of increased operations to effect lease renewal 8564 (when there are no other operations during the period to effect lease 8565 renewal as a side effect). A long lease is certainly kinder and 8566 gentler to servers trying to handle very large numbers of clients. 8567 The number of extra requests to effect lock renewal drops in inverse 8568 proportion to the lease time. The disadvantages of a long lease 8569 include the possibility of slower recovery after certain failures. 8570 After server failure, a longer grace period may be required when some 8571 clients do not promptly reclaim their locks and do a global 8572 RECLAIM_COMPLETE. In the event of client failure, the longer period 8573 for a lease to expire will force conflicting requests to wait longer. 8575 A long lease is practical if the server can store lease state in 8576 stable storage. Upon recovery, the server can reconstruct the lease 8577 state from its stable storage and continue operation with its 8578 clients. 8580 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration 8582 To avoid the need for synchronized clocks, lease times are granted by 8583 the server as a time delta. However, there is a requirement that the 8584 client and server clocks do not drift excessively over the duration 8585 of the lease. There is also the issue of propagation delay across 8586 the network, which could easily be several hundred milliseconds, as 8587 well as the possibility that requests will be lost and need to be 8588 retransmitted. 8590 To take propagation delay into account, the client should subtract it 8591 from lease times (e.g., if the client estimates the one-way 8592 propagation delay as 200 milliseconds, then it can assume that the 8593 lease is already 200 milliseconds old when it gets it). In addition, 8594 it will take another 200 milliseconds to get a response back to the 8595 server. So the client must send a lease renewal or write data back 8596 to the server at least 400 milliseconds before the lease would 8597 expire. If the propagation delay varies over the life of the lease 8598 (e.g., the client is on a mobile host), the client will need to 8599 continuously subtract the increase in propagation delay from the 8600 lease times. 8602 The server's lease period configuration should take into account the 8603 network distance of the clients that will be accessing the server's 8604 resources. It is expected that the lease period will take into 8605 account the network propagation delays and other network delay 8606 factors for the client population. Since the protocol does not allow 8607 for an automatic method to determine an appropriate lease period, the 8608 server's administrator may have to tune the lease period. 8610 8.8. Obsolete Locking Infrastructure from NFSv4.0 8612 There are a number of operations and fields within existing 8613 operations that no longer have a function in NFSv4.1. In one way or 8614 another, these changes are all due to the implementation of sessions 8615 that provide client context and exactly once semantics as a base 8616 feature of the protocol, separate from locking itself. 8618 The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. 8619 The server MUST return NFS4ERR_NOTSUPP if these operations are found 8620 in an NFSv4.1 COMPOUND. 8622 o SETCLIENTID since its function has been replaced by EXCHANGE_ID. 8624 o SETCLIENTID_CONFIRM since client ID confirmation now happens by 8625 means of CREATE_SESSION. 8627 o OPEN_CONFIRM because state-owner-based seqids have been replaced 8628 by the sequence ID in the SEQUENCE operation. 8630 o RELEASE_LOCKOWNER because lock-owners with no associated locks do 8631 not have any sequence-related state and so can be deleted by the 8632 server at will. 8634 o RENEW because every SEQUENCE operation for a session causes lease 8635 renewal, making a separate operation superfluous. 8637 Also, there are a number of fields, present in existing operations, 8638 related to locking that have no use in minor version 1. They were 8639 used in minor version 0 to perform functions now provided in a 8640 different fashion. 8642 o Sequence ids used to sequence requests for a given state-owner and 8643 to provide retry protection, now provided via sessions. 8645 o Client IDs used to identify the client associated with a given 8646 request. Client identification is now available using the client 8647 ID associated with the current session, without needing an 8648 explicit client ID field. 8650 Such vestigial fields in existing operations have no function in 8651 NFSv4.1 and are ignored by the server. Note that client IDs in 8652 operations new to NFSv4.1 (such as CREATE_SESSION and 8653 DESTROY_CLIENTID) are not ignored. 8655 9. File Locking and Share Reservations 8657 To support Win32 share reservations, it is necessary to provide 8658 operations that atomically open or create files. Having a separate 8659 share/unshare operation would not allow correct implementation of the 8660 Win32 OpenFile API. In order to correctly implement share semantics, 8661 the previous NFS protocol mechanisms used when a file is opened or 8662 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 8663 protocol defines an OPEN operation that is capable of atomically 8664 looking up, creating, and locking a file on the server. 8666 9.1. Opens and Byte-Range Locks 8668 It is assumed that manipulating a byte-range lock is rare when 8669 compared to READ and WRITE operations. It is also assumed that 8670 server restarts and network partitions are relatively rare. 8671 Therefore, it is important that the READ and WRITE operations have a 8672 lightweight mechanism to indicate if they possess a held lock. A 8673 LOCK operation contains the heavyweight information required to 8674 establish a byte-range lock and uniquely define the owner of the 8675 lock. 8677 9.1.1. State-Owner Definition 8679 When opening a file or requesting a byte-range lock, the client must 8680 specify an identifier that represents the owner of the requested 8681 lock. This identifier is in the form of a state-owner, represented 8682 in the protocol by a state_owner4, a variable-length opaque array 8683 that, when concatenated with the current client ID, uniquely defines 8684 the owner of a lock managed by the client. This may be a thread ID, 8685 process ID, or other unique value. 8687 Owners of opens and owners of byte-range locks are separate entities 8688 and remain separate even if the same opaque arrays are used to 8689 designate owners of each. The protocol distinguishes between open- 8690 owners (represented by open_owner4 structures) and lock-owners 8691 (represented by lock_owner4 structures). 8693 Each open is associated with a specific open-owner while each byte- 8694 range lock is associated with a lock-owner and an open-owner, the 8695 latter being the open-owner associated with the open file under which 8696 the LOCK operation was done. Delegations and layouts, on the other 8697 hand, are not associated with a specific owner but are associated 8698 with the client as a whole (identified by a client ID). 8700 9.1.2. Use of the Stateid and Locking 8702 All READ, WRITE, and SETATTR operations contain a stateid. For the 8703 purposes of this section, SETATTR operations that change the size 8704 attribute of a file are treated as if they are writing the area 8705 between the old and new sizes (i.e., the byte-range truncated or 8706 added to the file by means of the SETATTR), even where SETATTR is not 8707 explicitly mentioned in the text. The stateid passed to one of these 8708 operations must be one that represents an open, a set of byte-range 8709 locks, or a delegation, or it may be a special stateid representing 8710 anonymous access or the special bypass stateid. 8712 If the state-owner performs a READ or WRITE operation in a situation 8713 in which it has established a byte-range lock or share reservation on 8714 the server (any OPEN constitutes a share reservation), the stateid 8715 (previously returned by the server) must be used to indicate what 8716 locks, including both byte-range locks and share reservations, are 8717 held by the state-owner. If no state is established by the client, 8718 either a byte-range lock or a share reservation, a special stateid 8719 for anonymous state (zero as the value for "other" and "seqid") is 8720 used. (See Section 8.2.3 for a description of 'special' stateids in 8721 general.) Regardless of whether a stateid for anonymous state or a 8722 stateid returned by the server is used, if there is a conflicting 8723 share reservation or mandatory byte-range lock held on the file, the 8724 server MUST refuse to service the READ or WRITE operation. 8726 Share reservations are established by OPEN operations and by their 8727 nature are mandatory in that when the OPEN denies READ or WRITE 8728 operations, that denial results in such operations being rejected 8729 with error NFS4ERR_LOCKED. Byte-range locks may be implemented by 8730 the server as either mandatory or advisory, or the choice of 8731 mandatory or advisory behavior may be determined by the server on the 8732 basis of the file being accessed (for example, some UNIX-based 8733 servers support a "mandatory lock bit" on the mode attribute such 8734 that if set, byte-range locks are required on the file before I/O is 8735 possible). When byte-range locks are advisory, they only prevent the 8736 granting of conflicting lock requests and have no effect on READs or 8737 WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O 8738 operations. When they are attempted, they are rejected with 8739 NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file for 8740 which it knows it has the proper share reservation, it will need to 8741 send a LOCK operation on the byte-range of the file that includes the 8742 byte-range the I/O was to be performed on, with an appropriate 8743 locktype field of the LOCK operation's arguments (i.e., READ*_LT for 8744 a READ operation, WRITE*_LT for a WRITE operation). 8746 Note that for UNIX environments that support mandatory byte-range 8747 locking, the distinction between advisory and mandatory locking is 8748 subtle. In fact, advisory and mandatory byte-range locks are exactly 8749 the same as far as the APIs and requirements on implementation. If 8750 the mandatory lock attribute is set on the file, the server checks to 8751 see if the lock-owner has an appropriate shared (READ_LT) or 8752 exclusive (WRITE_LT) byte-range lock on the byte-range it wishes to 8753 READ from or WRITE to. If there is no appropriate lock, the server 8754 checks if there is a conflicting lock (which can be done by 8755 attempting to acquire the conflicting lock on behalf of the lock- 8756 owner, and if successful, release the lock after the READ or WRITE 8757 operation is done), and if there is, the server returns 8758 NFS4ERR_LOCKED. 8760 For Windows environments, byte-range locks are always mandatory, so 8761 the server always checks for byte-range locks during I/O requests. 8763 Thus, the LOCK operation does not need to distinguish between 8764 advisory and mandatory byte-range locks. It is the server's 8765 processing of the READ and WRITE operations that introduces the 8766 distinction. 8768 Every stateid that is validly passed to READ, WRITE, or SETATTR, with 8769 the exception of special stateid values, defines an access mode for 8770 the file (i.e., OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 8771 OPEN4_SHARE_ACCESS_BOTH). 8773 o For stateids associated with opens, this is the mode defined by 8774 the original OPEN that caused the allocation of the OPEN stateid 8775 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the 8776 same open-owner/file pair. 8778 o For stateids returned by byte-range LOCK operations, the 8779 appropriate mode is the access mode for the OPEN stateid 8780 associated with the lock set represented by the stateid. 8782 o For delegation stateids, the access mode is based on the type of 8783 delegation. 8785 When a READ, WRITE, or SETATTR (that specifies the size attribute) 8786 operation is done, the operation is subject to checking against the 8787 access mode to verify that the operation is appropriate given the 8788 stateid with which the operation is associated. 8790 In the case of WRITE-type operations (i.e., WRITEs and SETATTRs that 8791 set size), the server MUST verify that the access mode allows writing 8792 and MUST return an NFS4ERR_OPENMODE error if it does not. In the 8793 case of READ, the server may perform the corresponding check on the 8794 access mode, or it may choose to allow READ on OPENs for 8795 OPEN4_SHARE_ACCESS_WRITE, to accommodate clients whose WRITE 8796 implementation may unavoidably do reads (e.g., due to buffer cache 8797 constraints). However, even if READs are allowed in these 8798 circumstances, the server MUST still check for locks that conflict 8799 with the READ (e.g., another OPEN specified OPEN4_SHARE_DENY_READ or 8800 OPEN4_SHARE_DENY_BOTH). Note that a server that does enforce the 8801 access mode check on READs need not explicitly check for conflicting 8802 share reservations since the existence of OPEN for 8803 OPEN4_SHARE_ACCESS_READ guarantees that no conflicting share 8804 reservation can exist. 8806 The READ bypass special stateid (all bits of "other" and "seqid" set 8807 to one) indicates a desire to bypass locking checks. The server MAY 8808 allow READ operations to bypass locking checks at the server, when 8809 this special stateid is used. However, WRITE operations with this 8810 special stateid value MUST NOT bypass locking checks and are treated 8811 exactly the same as if a special stateid for anonymous state were 8812 used. 8814 A lock may not be granted while a READ or WRITE operation using one 8815 of the special stateids is being performed and the scope of the lock 8816 to be granted would conflict with the READ or WRITE operation. This 8817 can occur when: 8819 o A mandatory byte-range lock is requested with a byte-range that 8820 conflicts with the byte-range of the READ or WRITE operation. For 8821 the purposes of this paragraph, a conflict occurs when a shared 8822 lock is requested and a WRITE operation is being performed, or an 8823 exclusive lock is requested and either a READ or a WRITE operation 8824 is being performed. 8826 o A share reservation is requested that denies reading and/or 8827 writing and the corresponding operation is being performed. 8829 o A delegation is to be granted and the delegation type would 8830 prevent the I/O operation, i.e., READ and WRITE conflict with an 8831 OPEN_DELEGATE_WRITE delegation and WRITE conflicts with an 8832 OPEN_DELEGATE_READ delegation. 8834 When a client holds a delegation, it needs to ensure that the stateid 8835 sent conveys the association of operation with the delegation, to 8836 avoid the delegation from being avoidably recalled. When the 8837 delegation stateid, a stateid open associated with that delegation, 8838 or a stateid representing byte-range locks derived from such an open 8839 is used, the server knows that the READ, WRITE, or SETATTR does not 8840 conflict with the delegation but is sent under the aegis of the 8841 delegation. Even though it is possible for the server to determine 8842 from the client ID (via the session ID) that the client does in fact 8843 have a delegation, the server is not obliged to check this, so using 8844 a special stateid can result in avoidable recall of the delegation. 8846 9.2. Lock Ranges 8848 The protocol allows a lock-owner to request a lock with a byte-range 8849 and then either upgrade, downgrade, or unlock a sub-range of the 8850 initial lock, or a byte-range that overlaps -- fully or partially -- 8851 either with that initial lock or a combination of a set of existing 8852 locks for the same lock-owner. It is expected that this will be an 8853 uncommon type of request. In any case, servers or server file 8854 systems may not be able to support sub-range lock semantics. In the 8855 event that a server receives a locking request that represents a sub- 8856 range of current locking state for the lock-owner, the server is 8857 allowed to return the error NFS4ERR_LOCK_RANGE to signify that it 8858 does not support sub-range lock operations. Therefore, the client 8859 should be prepared to receive this error and, if appropriate, report 8860 the error to the requesting application. 8862 The client is discouraged from combining multiple independent locking 8863 ranges that happen to be adjacent into a single request since the 8864 server may not support sub-range requests for reasons related to the 8865 recovery of byte-range locking state in the event of server failure. 8866 As discussed in Section 8.4.2, the server may employ certain 8867 optimizations during recovery that work effectively only when the 8868 client's behavior during lock recovery is similar to the client's 8869 locking behavior prior to server failure. 8871 9.3. Upgrading and Downgrading Locks 8873 If a client has a WRITE_LT lock on a byte-range, it can request an 8874 atomic downgrade of the lock to a READ_LT lock via the LOCK 8875 operation, by setting the type to READ_LT. If the server supports 8876 atomic downgrade, the request will succeed. If not, it will return 8877 NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this 8878 error and, if appropriate, report the error to the requesting 8879 application. 8881 If a client has a READ_LT lock on a byte-range, it can request an 8882 atomic upgrade of the lock to a WRITE_LT lock via the LOCK operation 8883 by setting the type to WRITE_LT or WRITEW_LT. If the server does not 8884 support atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the 8885 upgrade can be achieved without an existing conflict, the request 8886 will succeed. Otherwise, the server will return either 8887 NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is 8888 returned if the client sent the LOCK operation with the type set to 8889 WRITEW_LT and the server has detected a deadlock. The client should 8890 be prepared to receive such errors and, if appropriate, report the 8891 error to the requesting application. 8893 9.4. Stateid Seqid Values and Byte-Range Locks 8895 When a LOCK or LOCKU operation is performed, the stateid returned has 8896 the same "other" value as the argument's stateid, and a "seqid" value 8897 that is incremented (relative to the argument's stateid) to reflect 8898 the occurrence of the LOCK or LOCKU operation. The server MUST 8899 increment the value of the "seqid" field whenever there is any change 8900 to the locking status of any byte offset as described by any of the 8901 locks covered by the stateid. A change in locking status includes a 8902 change from locked to unlocked or the reverse or a change from being 8903 locked for READ_LT to being locked for WRITE_LT or the reverse. 8905 When there is no such change, as, for example, when a range already 8906 locked for WRITE_LT is locked again for WRITE_LT, the server MAY 8907 increment the "seqid" value. 8909 9.5. Issues with Multiple Open-Owners 8911 When the same file is opened by multiple open-owners, a client will 8912 have multiple OPEN stateids for that file, each associated with a 8913 different open-owner. In that case, there can be multiple LOCK and 8914 LOCKU requests for the same lock-owner sent using the different OPEN 8915 stateids, and so a situation may arise in which there are multiple 8916 stateids, each representing byte-range locks on the same file and 8917 held by the same lock-owner but each associated with a different 8918 open-owner. 8920 In such a situation, the locking status of each byte (i.e., whether 8921 it is locked, the READ_LT or WRITE_LT type of the lock, and the lock- 8922 owner holding the lock) MUST reflect the last LOCK or LOCKU operation 8923 done for the lock-owner in question, independent of the stateid 8924 through which the request was sent. 8926 When a byte is locked by the lock-owner in question, the open-owner 8927 to which that byte-range lock is assigned SHOULD be that of the open- 8928 owner associated with the stateid through which the last LOCK of that 8929 byte was done. When there is a change in the open-owner associated 8930 with locks for the stateid through which a LOCK or LOCKU was done, 8931 the "seqid" field of the stateid MUST be incremented, even if the 8932 locking, in terms of lock-owners has not changed. When there is a 8933 change to the set of locked bytes associated with a different stateid 8934 for the same lock-owner, i.e., associated with a different open- 8935 owner, the "seqid" value for that stateid MUST NOT be incremented. 8937 9.6. Blocking Locks 8939 Some clients require the support of blocking locks. While NFSv4.1 8940 provides a callback when a previously unavailable lock becomes 8941 available, this is an OPTIONAL feature and clients cannot depend on 8942 its presence. Clients need to be prepared to continually poll for 8943 the lock. This presents a fairness problem. Two of the lock types, 8944 READW_LT and WRITEW_LT, are used to indicate to the server that the 8945 client is requesting a blocking lock. When the callback is not used, 8946 the server should maintain an ordered list of pending blocking locks. 8947 When the conflicting lock is released, the server may wait for the 8948 period of time equal to lease_time for the first waiting client to 8949 re-request the lock. After the lease period expires, the next 8950 waiting client request is allowed the lock. Clients are required to 8951 poll at an interval sufficiently small that it is likely to acquire 8952 the lock in a timely manner. The server is not required to maintain 8953 a list of pending blocked locks as it is used to increase fairness 8954 and not correct operation. Because of the unordered nature of crash 8955 recovery, storing of lock state to stable storage would be required 8956 to guarantee ordered granting of blocking locks. 8958 Servers may also note the lock types and delay returning denial of 8959 the request to allow extra time for a conflicting lock to be 8960 released, allowing a successful return. In this way, clients can 8961 avoid the burden of needless frequent polling for blocking locks. 8963 The server should take care in the length of delay in the event the 8964 client retransmits the request. 8966 If a server receives a blocking LOCK operation, denies it, and then 8967 later receives a nonblocking request for the same lock, which is also 8968 denied, then it should remove the lock in question from its list of 8969 pending blocking locks. Clients should use such a nonblocking 8970 request to indicate to the server that this is the last time they 8971 intend to poll for the lock, as may happen when the process 8972 requesting the lock is interrupted. This is a courtesy to the 8973 server, to prevent it from unnecessarily waiting a lease period 8974 before granting other LOCK operations. However, clients are not 8975 required to perform this courtesy, and servers must not depend on 8976 them doing so. Also, clients must be prepared for the possibility 8977 that this final locking request will be accepted. 8979 When a server indicates, via the flag OPEN4_RESULT_MAY_NOTIFY_LOCK, 8980 that CB_NOTIFY_LOCK callbacks might be done for the current open 8981 file, the client should take notice of this, but, since this is a 8982 hint, cannot rely on a CB_NOTIFY_LOCK always being done. A client 8983 may reasonably reduce the frequency with which it polls for a denied 8984 lock, since the greater latency that might occur is likely to be 8985 eliminated given a prompt callback, but it still needs to poll. When 8986 it receives a CB_NOTIFY_LOCK, it should promptly try to obtain the 8987 lock, but it should be aware that other clients may be polling and 8988 that the server is under no obligation to reserve the lock for that 8989 particular client. 8991 9.7. Share Reservations 8993 A share reservation is a mechanism to control access to a file. It 8994 is a separate and independent mechanism from byte-range locking. 8995 When a client opens a file, it sends an OPEN operation to the server 8996 specifying the type of access required (READ, WRITE, or BOTH) and the 8997 type of access to deny others (OPEN4_SHARE_DENY_NONE, 8998 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 8999 OPEN4_SHARE_DENY_BOTH). If the OPEN fails, the client will fail the 9000 application's open request. 9002 Pseudo-code definition of the semantics: 9004 if (request.access == 0) { 9005 return (NFS4ERR_INVAL) 9006 } else { 9007 if ((request.access & file_state.deny)) || 9008 (request.deny & file_state.access)) { 9009 return (NFS4ERR_SHARE_DENIED) 9010 } 9011 return (NFS4ERR_OK); 9013 When doing this checking of share reservations on OPEN, the current 9014 file_state used in the algorithm includes bits that reflect all 9015 current opens, including those for the open-owner making the new OPEN 9016 request. 9018 The constants used for the OPEN and OPEN_DOWNGRADE operations for the 9019 access and deny fields are as follows: 9021 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 9022 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 9023 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 9025 const OPEN4_SHARE_DENY_NONE = 0x00000000; 9026 const OPEN4_SHARE_DENY_READ = 0x00000001; 9027 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 9028 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 9030 9.8. OPEN/CLOSE Operations 9032 To provide correct share semantics, a client MUST use the OPEN 9033 operation to obtain the initial filehandle and indicate the desired 9034 access and what access, if any, to deny. Even if the client intends 9035 to use a special stateid for anonymous state or READ bypass, it must 9036 still obtain the filehandle for the regular file with the OPEN 9037 operation so the appropriate share semantics can be applied. Clients 9038 that do not have a deny mode built into their programming interfaces 9039 for opening a file should request a deny mode of 9040 OPEN4_SHARE_DENY_NONE. 9042 The OPEN operation with the CREATE flag also subsumes the CREATE 9043 operation for regular files as used in previous versions of the NFS 9044 protocol. This allows a create with a share to be done atomically. 9046 The CLOSE operation removes all share reservations held by the open- 9047 owner on that file. If byte-range locks are held, the client SHOULD 9048 release all locks before sending a CLOSE operation. The server MAY 9049 free all outstanding locks on CLOSE, but some servers may not support 9050 the CLOSE of a file that still has byte-range locks held. The server 9051 MUST return failure, NFS4ERR_LOCKS_HELD, if any locks would exist 9052 after the CLOSE. 9054 The LOOKUP operation will return a filehandle without establishing 9055 any lock state on the server. Without a valid stateid, the server 9056 will assume that the client has the least access. For example, if 9057 one client opened a file with OPEN4_SHARE_DENY_BOTH and another 9058 client accesses the file via a filehandle obtained through LOOKUP, 9059 the second client could only read the file using the special read 9060 bypass stateid. The second client could not WRITE the file at all 9061 because it would not have a valid stateid from OPEN and the special 9062 anonymous stateid would not be allowed access. 9064 9.9. Open Upgrade and Downgrade 9066 When an OPEN is done for a file and the open-owner for which the OPEN 9067 is being done already has the file open, the result is to upgrade the 9068 open file status maintained on the server to include the access and 9069 deny bits specified by the new OPEN as well as those for the existing 9070 OPEN. The result is that there is one open file, as far as the 9071 protocol is concerned, and it includes the union of the access and 9072 deny bits for all of the OPEN requests completed. The OPEN is 9073 represented by a single stateid whose "other" value matches that of 9074 the original open, and whose "seqid" value is incremented to reflect 9075 the occurrence of the upgrade. The increment is required in cases in 9076 which the "upgrade" results in no change to the open mode (e.g., an 9077 OPEN is done for read when the existing open file is opened for 9078 OPEN4_SHARE_ACCESS_BOTH). Only a single CLOSE will be done to reset 9079 the effects of both OPENs. The client may use the stateid returned 9080 by the OPEN effecting the upgrade or with a stateid sharing the same 9081 "other" field and a seqid of zero, although care needs to be taken as 9082 far as upgrades that happen while the CLOSE is pending. Note that 9083 the client, when sending the OPEN, may not know that the same file is 9084 in fact being opened. The above only applies if both OPENs result in 9085 the OPENed object being designated by the same filehandle. 9087 When the server chooses to export multiple filehandles corresponding 9088 to the same file object and returns different filehandles on two 9089 different OPENs of the same file object, the server MUST NOT "OR" 9090 together the access and deny bits and coalesce the two open files. 9091 Instead, the server must maintain separate OPENs with separate 9092 stateids and will require separate CLOSEs to free them. 9094 When multiple open files on the client are merged into a single OPEN 9095 file object on the server, the close of one of the open files (on the 9096 client) may necessitate change of the access and deny status of the 9097 open file on the server. This is because the union of the access and 9098 deny bits for the remaining opens may be smaller (i.e., a proper 9099 subset) than previously. The OPEN_DOWNGRADE operation is used to 9100 make the necessary change and the client should use it to update the 9101 server so that share reservation requests by other clients are 9102 handled properly. The stateid returned has the same "other" field as 9103 that passed to the server. The "seqid" value in the returned stateid 9104 MUST be incremented, even in situations in which there is no change 9105 to the access and deny bits for the file. 9107 9.10. Parallel OPENs 9109 Unlike the case of NFSv4.0, in which OPEN operations for the same 9110 open-owner are inherently serialized because of the owner-based 9111 seqid, multiple OPENs for the same open-owner may be done in 9112 parallel. When clients do this, they may encounter situations in 9113 which, because of the existence of hard links, two OPEN operations 9114 may turn out to open the same file, with a later OPEN performed being 9115 an upgrade of the first, with this fact only visible to the client 9116 once the operations complete. 9118 In this situation, clients may determine the order in which the OPENs 9119 were performed by examining the stateids returned by the OPENs. 9120 Stateids that share a common value of the "other" field can be 9121 recognized as having opened the same file, with the order of the 9122 operations determinable from the order of the "seqid" fields, mod any 9123 possible wraparound of the 32-bit field. 9125 When the possibility exists that the client will send multiple OPENs 9126 for the same open-owner in parallel, it may be the case that an open 9127 upgrade may happen without the client knowing beforehand that this 9128 could happen. Because of this possibility, CLOSEs and 9129 OPEN_DOWNGRADEs should generally be sent with a non-zero seqid in the 9130 stateid, to avoid the possibility that the status change associated 9131 with an open upgrade is not inadvertently lost. 9133 9.11. Reclaim of Open and Byte-Range Locks 9135 Special forms of the LOCK and OPEN operations are provided when it is 9136 necessary to re-establish byte-range locks or opens after a server 9137 failure. 9139 o To reclaim existing opens, an OPEN operation is performed using a 9140 CLAIM_PREVIOUS. Because the client, in this type of situation, 9141 will have already opened the file and have the filehandle of the 9142 target file, this operation requires that the current filehandle 9143 be the target file, rather than a directory, and no file name is 9144 specified. 9146 o To reclaim byte-range locks, a LOCK operation with the reclaim 9147 parameter set to true is used. 9149 Reclaims of opens associated with delegations are discussed in 9150 Section 10.2.1. 9152 10. Client-Side Caching 9154 Client-side caching of data, of file attributes, and of file names is 9155 essential to providing good performance with the NFS protocol. 9156 Providing distributed cache coherence is a difficult problem, and 9157 previous versions of the NFS protocol have not attempted it. 9158 Instead, several NFS client implementation techniques have been used 9159 to reduce the problems that a lack of coherence poses for users. 9160 These techniques have not been clearly defined by earlier protocol 9161 specifications, and it is often unclear what is valid or invalid 9162 client behavior. 9164 The NFSv4.1 protocol uses many techniques similar to those that have 9165 been used in previous protocol versions. The NFSv4.1 protocol does 9166 not provide distributed cache coherence. However, it defines a more 9167 limited set of caching guarantees to allow locks and share 9168 reservations to be used without destructive interference from client- 9169 side caching. 9171 In addition, the NFSv4.1 protocol introduces a delegation mechanism, 9172 which allows many decisions normally made by the server to be made 9173 locally by clients. This mechanism provides efficient support of the 9174 common cases where sharing is infrequent or where sharing is read- 9175 only. 9177 10.1. Performance Challenges for Client-Side Caching 9179 Caching techniques used in previous versions of the NFS protocol have 9180 been successful in providing good performance. However, several 9181 scalability challenges can arise when those techniques are used with 9182 very large numbers of clients. This is particularly true when 9183 clients are geographically distributed, which classically increases 9184 the latency for cache revalidation requests. 9186 The previous versions of the NFS protocol repeat their file data 9187 cache validation requests at the time the file is opened. This 9188 behavior can have serious performance drawbacks. A common case is 9189 one in which a file is only accessed by a single client. Therefore, 9190 sharing is infrequent. 9192 In this case, repeated references to the server to find that no 9193 conflicts exist are expensive. A better option with regards to 9194 performance is to allow a client that repeatedly opens a file to do 9195 so without reference to the server. This is done until potentially 9196 conflicting operations from another client actually occur. 9198 A similar situation arises in connection with byte-range locking. 9199 Sending LOCK and LOCKU operations as well as the READ and WRITE 9200 operations necessary to make data caching consistent with the locking 9201 semantics (see Section 10.3.2) can severely limit performance. When 9202 locking is used to provide protection against infrequent conflicts, a 9203 large penalty is incurred. This penalty may discourage the use of 9204 byte-range locking by applications. 9206 The NFSv4.1 protocol provides more aggressive caching strategies with 9207 the following design goals: 9209 o Compatibility with a large range of server semantics. 9211 o Providing the same caching benefits as previous versions of the 9212 NFS protocol when unable to support the more aggressive model. 9214 o Requirements for aggressive caching are organized so that a large 9215 portion of the benefit can be obtained even when not all of the 9216 requirements can be met. 9218 The appropriate requirements for the server are discussed in later 9219 sections in which specific forms of caching are covered (see 9220 Section 10.4). 9222 10.2. Delegation and Callbacks 9224 Recallable delegation of server responsibilities for a file to a 9225 client improves performance by avoiding repeated requests to the 9226 server in the absence of inter-client conflict. With the use of a 9227 "callback" RPC from server to client, a server recalls delegated 9228 responsibilities when another client engages in sharing of a 9229 delegated file. 9231 A delegation is passed from the server to the client, specifying the 9232 object of the delegation and the type of delegation. There are 9233 different types of delegations, but each type contains a stateid to 9234 be used to represent the delegation when performing operations that 9235 depend on the delegation. This stateid is similar to those 9236 associated with locks and share reservations but differs in that the 9237 stateid for a delegation is associated with a client ID and may be 9238 used on behalf of all the open-owners for the given client. A 9239 delegation is made to the client as a whole and not to any specific 9240 process or thread of control within it. 9242 The backchannel is established by CREATE_SESSION and 9243 BIND_CONN_TO_SESSION, and the client is required to maintain it. 9244 Because the backchannel may be down, even temporarily, correct 9245 protocol operation does not depend on them. Preliminary testing of 9246 backchannel functionality by means of a CB_COMPOUND procedure with a 9247 single operation, CB_SEQUENCE, can be used to check the continuity of 9248 the backchannel. A server avoids delegating responsibilities until 9249 it has determined that the backchannel exists. Because the granting 9250 of a delegation is always conditional upon the absence of conflicting 9251 access, clients MUST NOT assume that a delegation will be granted and 9252 they MUST always be prepared for OPENs, WANT_DELEGATIONs, and 9253 GET_DIR_DELEGATIONs to be processed without any delegations being 9254 granted. 9256 Unlike locks, an operation by a second client to a delegated file 9257 will cause the server to recall a delegation through a callback. For 9258 individual operations, we will describe, under IMPLEMENTATION, when 9259 such operations are required to effect a recall. A number of points 9260 should be noted, however. 9262 o The server is free to recall a delegation whenever it feels it is 9263 desirable and may do so even if no operations requiring recall are 9264 being done. 9266 o Operations done outside the NFSv4.1 protocol, due to, for example, 9267 access by other protocols, or by local access, also need to result 9268 in delegation recall when they make analogous changes to file 9269 system data. What is crucial is if the change would invalidate 9270 the guarantees provided by the delegation. When this is possible, 9271 the delegation needs to be recalled and MUST be returned or 9272 revoked before allowing the operation to proceed. 9274 o The semantics of the file system are crucial in defining when 9275 delegation recall is required. If a particular change within a 9276 specific implementation causes change to a file attribute, then 9277 delegation recall is required, whether that operation has been 9278 specifically listed as requiring delegation recall. Again, what 9279 is critical is whether the guarantees provided by the delegation 9280 are being invalidated. 9282 Despite those caveats, the implementation sections for a number of 9283 operations describe situations in which delegation recall would be 9284 required under some common circumstances: 9286 o For GETATTR, see Section 18.7.4. 9288 o For OPEN, see Section 18.16.4. 9290 o For READ, see Section 18.22.4. 9292 o For REMOVE, see Section 18.25.4. 9294 o For RENAME, see Section 18.26.4. 9296 o For SETATTR, see Section 18.30.4. 9298 o For WRITE, see Section 18.32.4. 9300 On recall, the client holding the delegation needs to flush modified 9301 state (such as modified data) to the server and return the 9302 delegation. The conflicting request will not be acted on until the 9303 recall is complete. The recall is considered complete when the 9304 client returns the delegation or the server times its wait for the 9305 delegation to be returned and revokes the delegation as a result of 9306 the timeout. In the interim, the server will either delay responding 9307 to conflicting requests or respond to them with NFS4ERR_DELAY. 9308 Following the resolution of the recall, the server has the 9309 information necessary to grant or deny the second client's request. 9311 At the time the client receives a delegation recall, it may have 9312 substantial state that needs to be flushed to the server. Therefore, 9313 the server should allow sufficient time for the delegation to be 9314 returned since it may involve numerous RPCs to the server. If the 9315 server is able to determine that the client is diligently flushing 9316 state to the server as a result of the recall, the server may extend 9317 the usual time allowed for a recall. However, the time allowed for 9318 recall completion should not be unbounded. 9320 An example of this is when responsibility to mediate opens on a given 9321 file is delegated to a client (see Section 10.4). The server will 9322 not know what opens are in effect on the client. Without this 9323 knowledge, the server will be unable to determine if the access and 9324 deny states for the file allow any particular open until the 9325 delegation for the file has been returned. 9327 A client failure or a network partition can result in failure to 9328 respond to a recall callback. In this case, the server will revoke 9329 the delegation, which in turn will render useless any modified state 9330 still on the client. 9332 10.2.1. Delegation Recovery 9334 There are three situations that delegation recovery needs to deal 9335 with: 9337 o client restart 9338 o server restart 9340 o network partition (full or backchannel-only) 9342 In the event the client restarts, the failure to renew the lease will 9343 result in the revocation of byte-range locks and share reservations. 9344 Delegations, however, may be treated a bit differently. 9346 There will be situations in which delegations will need to be re- 9347 established after a client restarts. The reason for this is that the 9348 client may have file data stored locally and this data was associated 9349 with the previously held delegations. The client will need to re- 9350 establish the appropriate file state on the server. 9352 To allow for this type of client recovery, the server MAY extend the 9353 period for delegation recovery beyond the typical lease expiration 9354 period. This implies that requests from other clients that conflict 9355 with these delegations will need to wait. Because the normal recall 9356 process may require significant time for the client to flush changed 9357 state to the server, other clients need be prepared for delays that 9358 occur because of a conflicting delegation. This longer interval 9359 would increase the window for clients to restart and consult stable 9360 storage so that the delegations can be reclaimed. For OPEN 9361 delegations, such delegations are reclaimed using OPEN with a claim 9362 type of CLAIM_DELEGATE_PREV or CLAIM_DELEG_PREV_FH (see Sections 10.5 9363 and 18.16 for discussion of OPEN delegation and the details of OPEN, 9364 respectively). 9366 A server MAY support claim types of CLAIM_DELEGATE_PREV and 9367 CLAIM_DELEG_PREV_FH, and if it does, it MUST NOT remove delegations 9368 upon a CREATE_SESSION that confirm a client ID created by 9369 EXCHANGE_ID. Instead, the server MUST, for a period of time no less 9370 than that of the value of the lease_time attribute, maintain the 9371 client's delegations to allow time for the client to send 9372 CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH requests. The server 9373 that supports CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH MUST 9374 support the DELEGPURGE operation. 9376 When the server restarts, delegations are reclaimed (using the OPEN 9377 operation with CLAIM_PREVIOUS) in a similar fashion to byte-range 9378 locks and share reservations. However, there is a slight semantic 9379 difference. In the normal case, if the server decides that a 9380 delegation should not be granted, it performs the requested action 9381 (e.g., OPEN) without granting any delegation. For reclaim, the 9382 server grants the delegation but a special designation is applied so 9383 that the client treats the delegation as having been granted but 9384 recalled by the server. Because of this, the client has the duty to 9385 write all modified state to the server and then return the 9386 delegation. This process of handling delegation reclaim reconciles 9387 three principles of the NFSv4.1 protocol: 9389 o Upon reclaim, a client reporting resources assigned to it by an 9390 earlier server instance must be granted those resources. 9392 o The server has unquestionable authority to determine whether 9393 delegations are to be granted and, once granted, whether they are 9394 to be continued. 9396 o The use of callbacks should not be depended upon until the client 9397 has proven its ability to receive them. 9399 When a client needs to reclaim a delegation and there is no 9400 associated open, the client may use the CLAIM_PREVIOUS variant of the 9401 WANT_DELEGATION operation. However, since the server is not required 9402 to support this operation, an alternative is to reclaim via a dummy 9403 OPEN together with the delegation using an OPEN of type 9404 CLAIM_PREVIOUS. The dummy open file can be released using a CLOSE to 9405 re-establish the original state to be reclaimed, a delegation without 9406 an associated open. 9408 When a client has more than a single open associated with a 9409 delegation, state for those additional opens can be established using 9410 OPEN operations of type CLAIM_DELEGATE_CUR. When these are used to 9411 establish opens associated with reclaimed delegations, the server 9412 MUST allow them when made within the grace period. 9414 When a network partition occurs, delegations are subject to freeing 9415 by the server when the lease renewal period expires. This is similar 9416 to the behavior for locks and share reservations. For delegations, 9417 however, the server may extend the period in which conflicting 9418 requests are held off. Eventually, the occurrence of a conflicting 9419 request from another client will cause revocation of the delegation. 9420 A loss of the backchannel (e.g., by later network configuration 9421 change) will have the same effect. A recall request will fail and 9422 revocation of the delegation will result. 9424 A client normally finds out about revocation of a delegation when it 9425 uses a stateid associated with a delegation and receives one of the 9426 errors NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or 9427 NFS4ERR_DELEG_REVOKED. It also may find out about delegation 9428 revocation after a client restart when it attempts to reclaim a 9429 delegation and receives that same error. Note that in the case of a 9430 revoked OPEN_DELEGATE_WRITE delegation, there are issues because data 9431 may have been modified by the client whose delegation is revoked and 9432 separately by other clients. See Section 10.5.1 for a discussion of 9433 such issues. Note also that when delegations are revoked, 9434 information about the revoked delegation will be written by the 9435 server to stable storage (as described in Section 8.4.3). This is 9436 done to deal with the case in which a server restarts after revoking 9437 a delegation but before the client holding the revoked delegation is 9438 notified about the revocation. 9440 10.3. Data Caching 9442 When applications share access to a set of files, they need to be 9443 implemented so as to take account of the possibility of conflicting 9444 access by another application. This is true whether the applications 9445 in question execute on different clients or reside on the same 9446 client. 9448 Share reservations and byte-range locks are the facilities the 9449 NFSv4.1 protocol provides to allow applications to coordinate access 9450 by using mutual exclusion facilities. The NFSv4.1 protocol's data 9451 caching must be implemented such that it does not invalidate the 9452 assumptions on which those using these facilities depend. 9454 10.3.1. Data Caching and OPENs 9456 In order to avoid invalidating the sharing assumptions on which 9457 applications rely, NFSv4.1 clients should not provide cached data to 9458 applications or modify it on behalf of an application when it would 9459 not be valid to obtain or modify that same data via a READ or WRITE 9460 operation. 9462 Furthermore, in the absence of an OPEN delegation (see Section 10.4), 9463 two additional rules apply. Note that these rules are obeyed in 9464 practice by many NFSv3 clients. 9466 o First, cached data present on a client must be revalidated after 9467 doing an OPEN. Revalidating means that the client fetches the 9468 change attribute from the server, compares it with the cached 9469 change attribute, and if different, declares the cached data (as 9470 well as the cached attributes) as invalid. This is to ensure that 9471 the data for the OPENed file is still correctly reflected in the 9472 client's cache. This validation must be done at least when the 9473 client's OPEN operation includes a deny of OPEN4_SHARE_DENY_WRITE 9474 or OPEN4_SHARE_DENY_BOTH, thus terminating a period in which other 9475 clients may have had the opportunity to open the file with 9476 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH access. Clients 9477 may choose to do the revalidation more often (i.e., at OPENs 9478 specifying a deny mode of OPEN4_SHARE_DENY_NONE) to parallel the 9479 NFSv3 protocol's practice for the benefit of users assuming this 9480 degree of cache revalidation. 9482 Since the change attribute is updated for data and metadata 9483 modifications, some client implementors may be tempted to use the 9484 time_modify attribute and not the change attribute to validate 9485 cached data, so that metadata changes do not spuriously invalidate 9486 clean data. The implementor is cautioned in this approach. The 9487 change attribute is guaranteed to change for each update to the 9488 file, whereas time_modify is guaranteed to change only at the 9489 granularity of the time_delta attribute. Use by the client's data 9490 cache validation logic of time_modify and not change runs the risk 9491 of the client incorrectly marking stale data as valid. Thus, any 9492 cache validation approach by the client MUST include the use of 9493 the change attribute. 9495 o Second, modified data must be flushed to the server before closing 9496 a file OPENed for OPEN4_SHARE_ACCESS_WRITE. This is complementary 9497 to the first rule. If the data is not flushed at CLOSE, the 9498 revalidation done after the client OPENs a file is unable to 9499 achieve its purpose. The other aspect to flushing the data before 9500 close is that the data must be committed to stable storage, at the 9501 server, before the CLOSE operation is requested by the client. In 9502 the case of a server restart and a CLOSEd file, it may not be 9503 possible to retransmit the data to be written to the file, hence, 9504 this requirement. 9506 10.3.2. Data Caching and File Locking 9508 For those applications that choose to use byte-range locking instead 9509 of share reservations to exclude inconsistent file access, there is 9510 an analogous set of constraints that apply to client-side data 9511 caching. These rules are effective only if the byte-range locking is 9512 used in a way that matches in an equivalent way the actual READ and 9513 WRITE operations executed. This is as opposed to byte-range locking 9514 that is based on pure convention. For example, it is possible to 9515 manipulate a two-megabyte file by dividing the file into two one- 9516 megabyte ranges and protecting access to the two byte-ranges by byte- 9517 range locks on bytes zero and one. A WRITE_LT lock on byte zero of 9518 the file would represent the right to perform READ and WRITE 9519 operations on the first byte-range. A WRITE_LT lock on byte one of 9520 the file would represent the right to perform READ and WRITE 9521 operations on the second byte-range. As long as all applications 9522 manipulating the file obey this convention, they will work on a local 9523 file system. However, they may not work with the NFSv4.1 protocol 9524 unless clients refrain from data caching. 9526 The rules for data caching in the byte-range locking environment are: 9528 o First, when a client obtains a byte-range lock for a particular 9529 byte-range, the data cache corresponding to that byte-range (if 9530 any cache data exists) must be revalidated. If the change 9531 attribute indicates that the file may have been updated since the 9532 cached data was obtained, the client must flush or invalidate the 9533 cached data for the newly locked byte-range. A client might 9534 choose to invalidate all of the non-modified cached data that it 9535 has for the file, but the only requirement for correct operation 9536 is to invalidate all of the data in the newly locked byte-range. 9538 o Second, before releasing a WRITE_LT lock for a byte-range, all 9539 modified data for that byte-range must be flushed to the server. 9540 The modified data must also be written to stable storage. 9542 Note that flushing data to the server and the invalidation of cached 9543 data must reflect the actual byte-ranges locked or unlocked. 9544 Rounding these up or down to reflect client cache block boundaries 9545 will cause problems if not carefully done. For example, writing a 9546 modified block when only half of that block is within an area being 9547 unlocked may cause invalid modification to the byte-range outside the 9548 unlocked area. This, in turn, may be part of a byte-range locked by 9549 another client. Clients can avoid this situation by synchronously 9550 performing portions of WRITE operations that overlap that portion 9551 (initial or final) that is not a full block. Similarly, invalidating 9552 a locked area that is not an integral number of full buffer blocks 9553 would require the client to read one or two partial blocks from the 9554 server if the revalidation procedure shows that the data that the 9555 client possesses may not be valid. 9557 The data that is written to the server as a prerequisite to the 9558 unlocking of a byte-range must be written, at the server, to stable 9559 storage. The client may accomplish this either with synchronous 9560 writes or by following asynchronous writes with a COMMIT operation. 9561 This is required because retransmission of the modified data after a 9562 server restart might conflict with a lock held by another client. 9564 A client implementation may choose to accommodate applications that 9565 use byte-range locking in non-standard ways (e.g., using a byte-range 9566 lock as a global semaphore) by flushing to the server more data upon 9567 a LOCKU than is covered by the locked range. This may include 9568 modified data within files other than the one for which the unlocks 9569 are being done. In such cases, the client must not interfere with 9570 applications whose READs and WRITEs are being done only within the 9571 bounds of byte-range locks that the application holds. For example, 9572 an application locks a single byte of a file and proceeds to write 9573 that single byte. A client that chose to handle a LOCKU by flushing 9574 all modified data to the server could validly write that single byte 9575 in response to an unrelated LOCKU operation. However, it would not 9576 be valid to write the entire block in which that single written byte 9577 was located since it includes an area that is not locked and might be 9578 locked by another client. Client implementations can avoid this 9579 problem by dividing files with modified data into those for which all 9580 modifications are done to areas covered by an appropriate byte-range 9581 lock and those for which there are modifications not covered by a 9582 byte-range lock. Any writes done for the former class of files must 9583 not include areas not locked and thus not modified on the client. 9585 10.3.3. Data Caching and Mandatory File Locking 9587 Client-side data caching needs to respect mandatory byte-range 9588 locking when it is in effect. The presence of mandatory byte-range 9589 locking for a given file is indicated when the client gets back 9590 NFS4ERR_LOCKED from a READ or WRITE operation on a file for which it 9591 has an appropriate share reservation. When mandatory locking is in 9592 effect for a file, the client must check for an appropriate byte- 9593 range lock for data being read or written. If a byte-range lock 9594 exists for the range being read or written, the client may satisfy 9595 the request using the client's validated cache. If an appropriate 9596 byte-range lock is not held for the range of the read or write, the 9597 read or write request must not be satisfied by the client's cache and 9598 the request must be sent to the server for processing. When a read 9599 or write request partially overlaps a locked byte-range, the request 9600 should be subdivided into multiple pieces with each byte-range 9601 (locked or not) treated appropriately. 9603 10.3.4. Data Caching and File Identity 9605 When clients cache data, the file data needs to be organized 9606 according to the file system object to which the data belongs. For 9607 NFSv3 clients, the typical practice has been to assume for the 9608 purpose of caching that distinct filehandles represent distinct file 9609 system objects. The client then has the choice to organize and 9610 maintain the data cache on this basis. 9612 In the NFSv4.1 protocol, there is now the possibility to have 9613 significant deviations from a "one filehandle per object" model 9614 because a filehandle may be constructed on the basis of the object's 9615 pathname. Therefore, clients need a reliable method to determine if 9616 two filehandles designate the same file system object. If clients 9617 were simply to assume that all distinct filehandles denote distinct 9618 objects and proceed to do data caching on this basis, caching 9619 inconsistencies would arise between the distinct client-side objects 9620 that mapped to the same server-side object. 9622 By providing a method to differentiate filehandles, the NFSv4.1 9623 protocol alleviates a potential functional regression in comparison 9624 with the NFSv3 protocol. Without this method, caching 9625 inconsistencies within the same client could occur, and this has not 9626 been present in previous versions of the NFS protocol. Note that it 9627 is possible to have such inconsistencies with applications executing 9628 on multiple clients, but that is not the issue being addressed here. 9630 For the purposes of data caching, the following steps allow an 9631 NFSv4.1 client to determine whether two distinct filehandles denote 9632 the same server-side object: 9634 o If GETATTR directed to two filehandles returns different values of 9635 the fsid attribute, then the filehandles represent distinct 9636 objects. 9638 o If GETATTR for any file with an fsid that matches the fsid of the 9639 two filehandles in question returns a unique_handles attribute 9640 with a value of TRUE, then the two objects are distinct. 9642 o If GETATTR directed to the two filehandles does not return the 9643 fileid attribute for both of the handles, then it cannot be 9644 determined whether the two objects are the same. Therefore, 9645 operations that depend on that knowledge (e.g., client-side data 9646 caching) cannot be done reliably. Note that if GETATTR does not 9647 return the fileid attribute for both filehandles, it will return 9648 it for neither of the filehandles, since the fsid for both 9649 filehandles is the same. 9651 o If GETATTR directed to the two filehandles returns different 9652 values for the fileid attribute, then they are distinct objects. 9654 o Otherwise, they are the same object. 9656 10.4. Open Delegation 9658 When a file is being OPENed, the server may delegate further handling 9659 of opens and closes for that file to the opening client. Any such 9660 delegation is recallable since the circumstances that allowed for the 9661 delegation are subject to change. In particular, if the server 9662 receives a conflicting OPEN from another client, the server must 9663 recall the delegation before deciding whether the OPEN from the other 9664 client may be granted. Making a delegation is up to the server, and 9665 clients should not assume that any particular OPEN either will or 9666 will not result in an OPEN delegation. The following is a typical 9667 set of conditions that servers might use in deciding whether an OPEN 9668 should be delegated: 9670 o The client must be able to respond to the server's callback 9671 requests. If a backchannel has been established, the server will 9672 send a CB_COMPOUND request, containing a single operation, 9673 CB_SEQUENCE, for a test of backchannel availability. 9675 o The client must have responded properly to previous recalls. 9677 o There must be no current OPEN conflicting with the requested 9678 delegation. 9680 o There should be no current delegation that conflicts with the 9681 delegation being requested. 9683 o The probability of future conflicting open requests should be low 9684 based on the recent history of the file. 9686 o The existence of any server-specific semantics of OPEN/CLOSE that 9687 would make the required handling incompatible with the prescribed 9688 handling that the delegated client would apply (see below). 9690 There are two types of OPEN delegations: OPEN_DELEGATE_READ and 9691 OPEN_DELEGATE_WRITE. An OPEN_DELEGATE_READ delegation allows a 9692 client to handle, on its own, requests to open a file for reading 9693 that do not deny OPEN4_SHARE_ACCESS_READ access to others. Multiple 9694 OPEN_DELEGATE_READ delegations may be outstanding simultaneously and 9695 do not conflict. An OPEN_DELEGATE_WRITE delegation allows the client 9696 to handle, on its own, all opens. Only OPEN_DELEGATE_WRITE 9697 delegation may exist for a given file at a given time, and it is 9698 inconsistent with any OPEN_DELEGATE_READ delegations. 9700 When a client has an OPEN_DELEGATE_READ delegation, it is assured 9701 that neither the contents, the attributes (with the exception of 9702 time_access), nor the names of any links to the file will change 9703 without its knowledge, so long as the delegation is held. When a 9704 client has an OPEN_DELEGATE_WRITE delegation, it may modify the file 9705 data locally since no other client will be accessing the file's data. 9706 The client holding an OPEN_DELEGATE_WRITE delegation may only locally 9707 affect file attributes that are intimately connected with the file 9708 data: size, change, time_access, time_metadata, and time_modify. All 9709 other attributes must be reflected on the server. 9711 When a client has an OPEN delegation, it does not need to send OPENs 9712 or CLOSEs to the server. Instead, the client may update the 9713 appropriate status internally. For an OPEN_DELEGATE_READ delegation, 9714 opens that cannot be handled locally (opens that are for 9715 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH or that deny 9716 OPEN4_SHARE_ACCESS_READ access) must be sent to the server. 9718 When an OPEN delegation is made, the reply to the OPEN contains an 9719 OPEN delegation structure that specifies the following: 9721 o the type of delegation (OPEN_DELEGATE_READ or 9722 OPEN_DELEGATE_WRITE). 9724 o space limitation information to control flushing of data on close 9725 (OPEN_DELEGATE_WRITE delegation only; see Section 10.4.1) 9727 o an nfsace4 specifying read and write permissions 9729 o a stateid to represent the delegation 9731 The delegation stateid is separate and distinct from the stateid for 9732 the OPEN proper. The standard stateid, unlike the delegation 9733 stateid, is associated with a particular lock-owner and will continue 9734 to be valid after the delegation is recalled and the file remains 9735 open. 9737 When a request internal to the client is made to open a file and an 9738 OPEN delegation is in effect, it will be accepted or rejected solely 9739 on the basis of the following conditions. Any requirement for other 9740 checks to be made by the delegate should result in the OPEN 9741 delegation being denied so that the checks can be made by the server 9742 itself. 9744 o The access and deny bits for the request and the file as described 9745 in Section 9.7. 9747 o The read and write permissions as determined below. 9749 The nfsace4 passed with delegation can be used to avoid frequent 9750 ACCESS calls. The permission check should be as follows: 9752 o If the nfsace4 indicates that the open may be done, then it should 9753 be granted without reference to the server. 9755 o If the nfsace4 indicates that the open may not be done, then an 9756 ACCESS request must be sent to the server to obtain the definitive 9757 answer. 9759 The server may return an nfsace4 that is more restrictive than the 9760 actual ACL of the file. This includes an nfsace4 that specifies 9761 denial of all access. Note that some common practices such as 9762 mapping the traditional user "root" to the user "nobody" (see 9763 Section 5.9) may make it incorrect to return the actual ACL of the 9764 file in the delegation response. 9766 The use of a delegation together with various other forms of caching 9767 creates the possibility that no server authentication and 9768 authorization will ever be performed for a given user since all of 9769 the user's requests might be satisfied locally. Where the client is 9770 depending on the server for authentication and authorization, the 9771 client should be sure authentication and authorization occurs for 9772 each user by use of the ACCESS operation. This should be the case 9773 even if an ACCESS operation would not be required otherwise. As 9774 mentioned before, the server may enforce frequent authentication by 9775 returning an nfsace4 denying all access with every OPEN delegation. 9777 10.4.1. Open Delegation and Data Caching 9779 An OPEN delegation allows much of the message overhead associated 9780 with the opening and closing files to be eliminated. An open when an 9781 OPEN delegation is in effect does not require that a validation 9782 message be sent to the server. The continued endurance of the 9783 "OPEN_DELEGATE_READ delegation" provides a guarantee that no OPEN for 9784 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH, and thus no write, 9785 has occurred. Similarly, when closing a file opened for 9786 OPEN4_SHARE_ACCESS_WRITE/OPEN4_SHARE_ACCESS_BOTH and if an 9787 OPEN_DELEGATE_WRITE delegation is in effect, the data written does 9788 not have to be written to the server until the OPEN delegation is 9789 recalled. The continued endurance of the OPEN delegation provides a 9790 guarantee that no open, and thus no READ or WRITE, has been done by 9791 another client. 9793 For the purposes of OPEN delegation, READs and WRITEs done without an 9794 OPEN are treated as the functional equivalents of a corresponding 9795 type of OPEN. Although a client SHOULD NOT use special stateids when 9796 an open exists, delegation handling on the server can use the client 9797 ID associated with the current session to determine if the operation 9798 has been done by the holder of the delegation (in which case, no 9799 recall is necessary) or by another client (in which case, the 9800 delegation must be recalled and I/O not proceed until the delegation 9801 is returned or revoked). 9803 With delegations, a client is able to avoid writing data to the 9804 server when the CLOSE of a file is serviced. The file close system 9805 call is the usual point at which the client is notified of a lack of 9806 stable storage for the modified file data generated by the 9807 application. At the close, file data is written to the server and, 9808 through normal accounting, the server is able to determine if the 9809 available file system space for the data has been exceeded (i.e., the 9810 server returns NFS4ERR_NOSPC or NFS4ERR_DQUOT). This accounting 9811 includes quotas. The introduction of delegations requires that an 9812 alternative method be in place for the same type of communication to 9813 occur between client and server. 9815 In the delegation response, the server provides either the limit of 9816 the size of the file or the number of modified blocks and associated 9817 block size. The server must ensure that the client will be able to 9818 write modified data to the server of a size equal to that provided in 9819 the original delegation. The server must make this assurance for all 9820 outstanding delegations. Therefore, the server must be careful in 9821 its management of available space for new or modified data, taking 9822 into account available file system space and any applicable quotas. 9823 The server can recall delegations as a result of managing the 9824 available file system space. The client should abide by the server's 9825 state space limits for delegations. If the client exceeds the stated 9826 limits for the delegation, the server's behavior is undefined. 9828 Based on server conditions, quotas, or available file system space, 9829 the server may grant OPEN_DELEGATE_WRITE delegations with very 9830 restrictive space limitations. The limitations may be defined in a 9831 way that will always force modified data to be flushed to the server 9832 on close. 9834 With respect to authentication, flushing modified data to the server 9835 after a CLOSE has occurred may be problematic. For example, the user 9836 of the application may have logged off the client, and unexpired 9837 authentication credentials may not be present. In this case, the 9838 client may need to take special care to ensure that local unexpired 9839 credentials will in fact be available. This may be accomplished by 9840 tracking the expiration time of credentials and flushing data well in 9841 advance of their expiration or by making private copies of 9842 credentials to assure their availability when needed. 9844 10.4.2. Open Delegation and File Locks 9846 When a client holds an OPEN_DELEGATE_WRITE delegation, lock 9847 operations are performed locally. This includes those required for 9848 mandatory byte-range locking. This can be done since the delegation 9849 implies that there can be no conflicting locks. Similarly, all of 9850 the revalidations that would normally be associated with obtaining 9851 locks and the flushing of data associated with the releasing of locks 9852 need not be done. 9854 When a client holds an OPEN_DELEGATE_READ delegation, lock operations 9855 are not performed locally. All lock operations, including those 9856 requesting non-exclusive locks, are sent to the server for 9857 resolution. 9859 10.4.3. Handling of CB_GETATTR 9861 The server needs to employ special handling for a GETATTR where the 9862 target is a file that has an OPEN_DELEGATE_WRITE delegation in 9863 effect. The reason for this is that the client holding the 9864 OPEN_DELEGATE_WRITE delegation may have modified the data, and the 9865 server needs to reflect this change to the second client that 9866 submitted the GETATTR. Therefore, the client holding the 9867 OPEN_DELEGATE_WRITE delegation needs to be interrogated. The server 9868 will use the CB_GETATTR operation. The only attributes that the 9869 server can reliably query via CB_GETATTR are size and change. 9871 Since CB_GETATTR is being used to satisfy another client's GETATTR 9872 request, the server only needs to know if the client holding the 9873 delegation has a modified version of the file. If the client's copy 9874 of the delegated file is not modified (data or size), the server can 9875 satisfy the second client's GETATTR request from the attributes 9876 stored locally at the server. If the file is modified, the server 9877 only needs to know about this modified state. If the server 9878 determines that the file is currently modified, it will respond to 9879 the second client's GETATTR as if the file had been modified locally 9880 at the server. 9882 Since the form of the change attribute is determined by the server 9883 and is opaque to the client, the client and server need to agree on a 9884 method of communicating the modified state of the file. For the size 9885 attribute, the client will report its current view of the file size. 9886 For the change attribute, the handling is more involved. 9888 For the client, the following steps will be taken when receiving an 9889 OPEN_DELEGATE_WRITE delegation: 9891 o The value of the change attribute will be obtained from the server 9892 and cached. Let this value be represented by c. 9894 o The client will create a value greater than c that will be used 9895 for communicating that modified data is held at the client. Let 9896 this value be represented by d. 9898 o When the client is queried via CB_GETATTR for the change 9899 attribute, it checks to see if it holds modified data. If the 9900 file is modified, the value d is returned for the change attribute 9901 value. If this file is not currently modified, the client returns 9902 the value c for the change attribute. 9904 For simplicity of implementation, the client MAY for each CB_GETATTR 9905 return the same value d. This is true even if, between successive 9906 CB_GETATTR operations, the client again modifies the file's data or 9907 metadata in its cache. The client can return the same value because 9908 the only requirement is that the client be able to indicate to the 9909 server that the client holds modified data. Therefore, the value of 9910 d may always be c + 1. 9912 While the change attribute is opaque to the client in the sense that 9913 it has no idea what units of time, if any, the server is counting 9914 change with, it is not opaque in that the client has to treat it as 9915 an unsigned integer, and the server has to be able to see the results 9916 of the client's changes to that integer. Therefore, the server MUST 9917 encode the change attribute in network order when sending it to the 9918 client. The client MUST decode it from network order to its native 9919 order when receiving it, and the client MUST encode it in network 9920 order when sending it to the server. For this reason, change is 9921 defined as an unsigned integer rather than an opaque array of bytes. 9923 For the server, the following steps will be taken when providing an 9924 OPEN_DELEGATE_WRITE delegation: 9926 o Upon providing an OPEN_DELEGATE_WRITE delegation, the server will 9927 cache a copy of the change attribute in the data structure it uses 9928 to record the delegation. Let this value be represented by sc. 9930 o When a second client sends a GETATTR operation on the same file to 9931 the server, the server obtains the change attribute from the first 9932 client. Let this value be cc. 9934 o If the value cc is equal to sc, the file is not modified and the 9935 server returns the current values for change, time_metadata, and 9936 time_modify (for example) to the second client. 9938 o If the value cc is NOT equal to sc, the file is currently modified 9939 at the first client and most likely will be modified at the server 9940 at a future time. The server then uses its current time to 9941 construct attribute values for time_metadata and time_modify. A 9942 new value of sc, which we will call nsc, is computed by the 9943 server, such that nsc >= sc + 1. The server then returns the 9944 constructed time_metadata, time_modify, and nsc values to the 9945 requester. The server replaces sc in the delegation record with 9946 nsc. To prevent the possibility of time_modify, time_metadata, 9947 and change from appearing to go backward (which would happen if 9948 the client holding the delegation fails to write its modified data 9949 to the server before the delegation is revoked or returned), the 9950 server SHOULD update the file's metadata record with the 9951 constructed attribute values. For reasons of reasonable 9952 performance, committing the constructed attribute values to stable 9953 storage is OPTIONAL. 9955 As discussed earlier in this section, the client MAY return the same 9956 cc value on subsequent CB_GETATTR calls, even if the file was 9957 modified in the client's cache yet again between successive 9958 CB_GETATTR calls. Therefore, the server must assume that the file 9959 has been modified yet again, and MUST take care to ensure that the 9960 new nsc it constructs and returns is greater than the previous nsc it 9961 returned. An example implementation's delegation record would 9962 satisfy this mandate by including a boolean field (let us call it 9963 "modified") that is set to FALSE when the delegation is granted, and 9964 an sc value set at the time of grant to the change attribute value. 9965 The modified field would be set to TRUE the first time cc != sc, and 9966 would stay TRUE until the delegation is returned or revoked. The 9967 processing for constructing nsc, time_modify, and time_metadata would 9968 use this pseudo code: 9970 if (!modified) { 9971 do CB_GETATTR for change and size; 9973 if (cc != sc) 9974 modified = TRUE; 9975 } else { 9976 do CB_GETATTR for size; 9977 } 9979 if (modified) { 9980 sc = sc + 1; 9981 time_modify = time_metadata = current_time; 9982 update sc, time_modify, time_metadata into file's metadata; 9983 } 9985 This would return to the client (that sent GETATTR) the attributes it 9986 requested, but make sure size comes from what CB_GETATTR returned. 9987 The server would not update the file's metadata with the client's 9988 modified size. 9990 In the case that the file attribute size is different than the 9991 server's current value, the server treats this as a modification 9992 regardless of the value of the change attribute retrieved via 9993 CB_GETATTR and responds to the second client as in the last step. 9995 This methodology resolves issues of clock differences between client 9996 and server and other scenarios where the use of CB_GETATTR break 9997 down. 9999 It should be noted that the server is under no obligation to use 10000 CB_GETATTR, and therefore the server MAY simply recall the delegation 10001 to avoid its use. 10003 10.4.4. Recall of Open Delegation 10005 The following events necessitate recall of an OPEN delegation: 10007 o potentially conflicting OPEN request (or a READ or WRITE operation 10008 done with a special stateid) 10010 o SETATTR sent by another client 10011 o REMOVE request for the file 10013 o RENAME request for the file as either the source or target of the 10014 RENAME 10016 Whether a RENAME of a directory in the path leading to the file 10017 results in recall of an OPEN delegation depends on the semantics of 10018 the server's file system. If that file system denies such RENAMEs 10019 when a file is open, the recall must be performed to determine 10020 whether the file in question is, in fact, open. 10022 In addition to the situations above, the server may choose to recall 10023 OPEN delegations at any time if resource constraints make it 10024 advisable to do so. Clients should always be prepared for the 10025 possibility of recall. 10027 When a client receives a recall for an OPEN delegation, it needs to 10028 update state on the server before returning the delegation. These 10029 same updates must be done whenever a client chooses to return a 10030 delegation voluntarily. The following items of state need to be 10031 dealt with: 10033 o If the file associated with the delegation is no longer open and 10034 no previous CLOSE operation has been sent to the server, a CLOSE 10035 operation must be sent to the server. 10037 o If a file has other open references at the client, then OPEN 10038 operations must be sent to the server. The appropriate stateids 10039 will be provided by the server for subsequent use by the client 10040 since the delegation stateid will no longer be valid. These OPEN 10041 requests are done with the claim type of CLAIM_DELEGATE_CUR. This 10042 will allow the presentation of the delegation stateid so that the 10043 client can establish the appropriate rights to perform the OPEN. 10044 (see Section 18.16, which describes the OPEN operation, for 10045 details.) 10047 o If there are granted byte-range locks, the corresponding LOCK 10048 operations need to be performed. This applies to the 10049 OPEN_DELEGATE_WRITE delegation case only. 10051 o For an OPEN_DELEGATE_WRITE delegation, if at the time of recall 10052 the file is not open for OPEN4_SHARE_ACCESS_WRITE/ 10053 OPEN4_SHARE_ACCESS_BOTH, all modified data for the file must be 10054 flushed to the server. If the delegation had not existed, the 10055 client would have done this data flush before the CLOSE operation. 10057 o For an OPEN_DELEGATE_WRITE delegation when a file is still open at 10058 the time of recall, any modified data for the file needs to be 10059 flushed to the server. 10061 o With the OPEN_DELEGATE_WRITE delegation in place, it is possible 10062 that the file was truncated during the duration of the delegation. 10063 For example, the truncation could have occurred as a result of an 10064 OPEN UNCHECKED with a size attribute value of zero. Therefore, if 10065 a truncation of the file has occurred and this operation has not 10066 been propagated to the server, the truncation must occur before 10067 any modified data is written to the server. 10069 In the case of OPEN_DELEGATE_WRITE delegation, byte-range locking 10070 imposes some additional requirements. To precisely maintain the 10071 associated invariant, it is required to flush any modified data in 10072 any byte-range for which a WRITE_LT lock was released while the 10073 OPEN_DELEGATE_WRITE delegation was in effect. However, because the 10074 OPEN_DELEGATE_WRITE delegation implies no other locking by other 10075 clients, a simpler implementation is to flush all modified data for 10076 the file (as described just above) if any WRITE_LT lock has been 10077 released while the OPEN_DELEGATE_WRITE delegation was in effect. 10079 An implementation need not wait until delegation recall (or the 10080 decision to voluntarily return a delegation) to perform any of the 10081 above actions, if implementation considerations (e.g., resource 10082 availability constraints) make that desirable. Generally, however, 10083 the fact that the actual OPEN state of the file may continue to 10084 change makes it not worthwhile to send information about opens and 10085 closes to the server, except as part of delegation return. An 10086 exception is when the client has no more internal opens of the file. 10087 In this case, sending a CLOSE is useful because it reduces resource 10088 utilization on the client and server. Regardless of the client's 10089 choices on scheduling these actions, all must be performed before the 10090 delegation is returned, including (when applicable) the close that 10091 corresponds to the OPEN that resulted in the delegation. These 10092 actions can be performed either in previous requests or in previous 10093 operations in the same COMPOUND request. 10095 10.4.5. Clients That Fail to Honor Delegation Recalls 10097 A client may fail to respond to a recall for various reasons, such as 10098 a failure of the backchannel from server to the client. The client 10099 may be unaware of a failure in the backchannel. This lack of 10100 awareness could result in the client finding out long after the 10101 failure that its delegation has been revoked, and another client has 10102 modified the data for which the client had a delegation. This is 10103 especially a problem for the client that held an OPEN_DELEGATE_WRITE 10104 delegation. 10106 Status bits returned by SEQUENCE operations help to provide an 10107 alternate way of informing the client of issues regarding the status 10108 of the backchannel and of recalled delegations. When the backchannel 10109 is not available, the server returns the status bit 10110 SEQ4_STATUS_CB_PATH_DOWN on SEQUENCE operations. The client can 10111 react by attempting to re-establish the backchannel and by returning 10112 recallable objects if a backchannel cannot be successfully re- 10113 established. 10115 Whether the backchannel is functioning or not, it may be that the 10116 recalled delegation is not returned. Note that the client's lease 10117 might still be renewed, even though the recalled delegation is not 10118 returned. In this situation, servers SHOULD revoke delegations that 10119 are not returned in a period of time equal to the lease period. This 10120 period of time should allow the client time to note the backchannel- 10121 down status and re-establish the backchannel. 10123 When delegations are revoked, the server will return with the 10124 SEQ4_STATUS_RECALLABLE_STATE_REVOKED status bit set on subsequent 10125 SEQUENCE operations. The client should note this and then use 10126 TEST_STATEID to find which delegations have been revoked. 10128 10.4.6. Delegation Revocation 10130 At the point a delegation is revoked, if there are associated opens 10131 on the client, these opens may or may not be revoked. If no byte- 10132 range lock or open is granted that is inconsistent with the existing 10133 open, the stateid for the open may remain valid and be disconnected 10134 from the revoked delegation, just as would be the case if the 10135 delegation were returned. 10137 For example, if an OPEN for OPEN4_SHARE_ACCESS_BOTH with a deny of 10138 OPEN4_SHARE_DENY_NONE is associated with the delegation, granting of 10139 another such OPEN to a different client will revoke the delegation 10140 but need not revoke the OPEN, since the two OPENs are consistent with 10141 each other. On the other hand, if an OPEN denying write access is 10142 granted, then the existing OPEN must be revoked. 10144 When opens and/or locks are revoked, the applications holding these 10145 opens or locks need to be notified. This notification usually occurs 10146 by returning errors for READ/WRITE operations or when a close is 10147 attempted for the open file. 10149 If no opens exist for the file at the point the delegation is 10150 revoked, then notification of the revocation is unnecessary. 10151 However, if there is modified data present at the client for the 10152 file, the user of the application should be notified. Unfortunately, 10153 it may not be possible to notify the user since active applications 10154 may not be present at the client. See Section 10.5.1 for additional 10155 details. 10157 10.4.7. Delegations via WANT_DELEGATION 10159 In addition to providing delegations as part of the reply to OPEN 10160 operations, servers MAY provide delegations separate from open, via 10161 the OPTIONAL WANT_DELEGATION operation. This allows delegations to 10162 be obtained in advance of an OPEN that might benefit from them, for 10163 objects that are not a valid target of OPEN, or to deal with cases in 10164 which a delegation has been recalled and the client wants to make an 10165 attempt to re-establish it if the absence of use by other clients 10166 allows that. 10168 The WANT_DELEGATION operation may be performed on any type of file 10169 object other than a directory. 10171 When a delegation is obtained using WANT_DELEGATION, any open files 10172 for the same filehandle held by that client are to be treated as 10173 subordinate to the delegation, just as if they had been created using 10174 an OPEN of type CLAIM_DELEGATE_CUR. They are otherwise unchanged as 10175 to seqid, access and deny modes, and the relationship with byte-range 10176 locks. Similarly, because existing byte-range locks are subordinate 10177 to an open, those byte-range locks also become indirectly subordinate 10178 to that new delegation. 10180 The WANT_DELEGATION operation provides for delivery of delegations 10181 via callbacks, when the delegations are not immediately available. 10182 When a requested delegation is available, it is delivered to the 10183 client via a CB_PUSH_DELEG operation. When this happens, open files 10184 for the same filehandle become subordinate to the new delegation at 10185 the point at which the delegation is delivered, just as if they had 10186 been created using an OPEN of type CLAIM_DELEGATE_CUR. Similarly, 10187 this occurs for existing byte-range locks subordinate to an open. 10189 10.5. Data Caching and Revocation 10191 When locks and delegations are revoked, the assumptions upon which 10192 successful caching depends are no longer guaranteed. For any locks 10193 or share reservations that have been revoked, the corresponding 10194 state-owner needs to be notified. This notification includes 10195 applications with a file open that has a corresponding delegation 10196 that has been revoked. Cached data associated with the revocation 10197 must be removed from the client. In the case of modified data 10198 existing in the client's cache, that data must be removed from the 10199 client without being written to the server. As mentioned, the 10200 assumptions made by the client are no longer valid at the point when 10201 a lock or delegation has been revoked. For example, another client 10202 may have been granted a conflicting byte-range lock after the 10203 revocation of the byte-range lock at the first client. Therefore, 10204 the data within the lock range may have been modified by the other 10205 client. Obviously, the first client is unable to guarantee to the 10206 application what has occurred to the file in the case of revocation. 10208 Notification to a state-owner will in many cases consist of simply 10209 returning an error on the next and all subsequent READs/WRITEs to the 10210 open file or on the close. Where the methods available to a client 10211 make such notification impossible because errors for certain 10212 operations may not be returned, more drastic action such as signals 10213 or process termination may be appropriate. The justification here is 10214 that an invariant on which an application depends may be violated. 10215 Depending on how errors are typically treated for the client- 10216 operating environment, further levels of notification including 10217 logging, console messages, and GUI pop-ups may be appropriate. 10219 10.5.1. Revocation Recovery for Write Open Delegation 10221 Revocation recovery for an OPEN_DELEGATE_WRITE delegation poses the 10222 special issue of modified data in the client cache while the file is 10223 not open. In this situation, any client that does not flush modified 10224 data to the server on each close must ensure that the user receives 10225 appropriate notification of the failure as a result of the 10226 revocation. Since such situations may require human action to 10227 correct problems, notification schemes in which the appropriate user 10228 or administrator is notified may be necessary. Logging and console 10229 messages are typical examples. 10231 If there is modified data on the client, it must not be flushed 10232 normally to the server. A client may attempt to provide a copy of 10233 the file data as modified during the delegation under a different 10234 name in the file system namespace to ease recovery. Note that when 10235 the client can determine that the file has not been modified by any 10236 other client, or when the client has a complete cached copy of the 10237 file in question, such a saved copy of the client's view of the file 10238 may be of particular value for recovery. In another case, recovery 10239 using a copy of the file based partially on the client's cached data 10240 and partially on the server's copy as modified by other clients will 10241 be anything but straightforward, so clients may avoid saving file 10242 contents in these situations or specially mark the results to warn 10243 users of possible problems. 10245 Saving of such modified data in delegation revocation situations may 10246 be limited to files of a certain size or might be used only when 10247 sufficient disk space is available within the target file system. 10248 Such saving may also be restricted to situations when the client has 10249 sufficient buffering resources to keep the cached copy available 10250 until it is properly stored to the target file system. 10252 10.6. Attribute Caching 10254 This section pertains to the caching of a file's attributes on a 10255 client when that client does not hold a delegation on the file. 10257 The attributes discussed in this section do not include named 10258 attributes. Individual named attributes are analogous to files, and 10259 caching of the data for these needs to be handled just as data 10260 caching is for ordinary files. Similarly, LOOKUP results from an 10261 OPENATTR directory (as well as the directory's contents) are to be 10262 cached on the same basis as any other pathnames. 10264 Clients may cache file attributes obtained from the server and use 10265 them to avoid subsequent GETATTR requests. Such caching is write 10266 through in that modification to file attributes is always done by 10267 means of requests to the server and should not be done locally and 10268 should not be cached. The exception to this are modifications to 10269 attributes that are intimately connected with data caching. 10270 Therefore, extending a file by writing data to the local data cache 10271 is reflected immediately in the size as seen on the client without 10272 this change being immediately reflected on the server. Normally, 10273 such changes are not propagated directly to the server, but when the 10274 modified data is flushed to the server, analogous attribute changes 10275 are made on the server. When OPEN delegation is in effect, the 10276 modified attributes may be returned to the server in reaction to a 10277 CB_RECALL call. 10279 The result of local caching of attributes is that the attribute 10280 caches maintained on individual clients will not be coherent. 10281 Changes made in one order on the server may be seen in a different 10282 order on one client and in a third order on another client. 10284 The typical file system application programming interfaces do not 10285 provide means to atomically modify or interrogate attributes for 10286 multiple files at the same time. The following rules provide an 10287 environment where the potential incoherencies mentioned above can be 10288 reasonably managed. These rules are derived from the practice of 10289 previous NFS protocols. 10291 o All attributes for a given file (per-fsid attributes excepted) are 10292 cached as a unit at the client so that no non-serializability can 10293 arise within the context of a single file. 10295 o An upper time boundary is maintained on how long a client cache 10296 entry can be kept without being refreshed from the server. 10298 o When operations are performed that change attributes at the 10299 server, the updated attribute set is requested as part of the 10300 containing RPC. This includes directory operations that update 10301 attributes indirectly. This is accomplished by following the 10302 modifying operation with a GETATTR operation and then using the 10303 results of the GETATTR to update the client's cached attributes. 10305 Note that if the full set of attributes to be cached is requested by 10306 READDIR, the results can be cached by the client on the same basis as 10307 attributes obtained via GETATTR. 10309 A client may validate its cached version of attributes for a file by 10310 fetching both the change and time_access attributes and assuming that 10311 if the change attribute has the same value as it did when the 10312 attributes were cached, then no attributes other than time_access 10313 have changed. The reason why time_access is also fetched is because 10314 many servers operate in environments where the operation that updates 10315 change does not update time_access. For example, POSIX file 10316 semantics do not update access time when a file is modified by the 10317 write system call [15]. Therefore, the client that wants a current 10318 time_access value should fetch it with change during the attribute 10319 cache validation processing and update its cached time_access. 10321 The client may maintain a cache of modified attributes for those 10322 attributes intimately connected with data of modified regular files 10323 (size, time_modify, and change). Other than those three attributes, 10324 the client MUST NOT maintain a cache of modified attributes. 10325 Instead, attribute changes are immediately sent to the server. 10327 In some operating environments, the equivalent to time_access is 10328 expected to be implicitly updated by each read of the content of the 10329 file object. If an NFS client is caching the content of a file 10330 object, whether it is a regular file, directory, or symbolic link, 10331 the client SHOULD NOT update the time_access attribute (via SETATTR 10332 or a small READ or READDIR request) on the server with each read that 10333 is satisfied from cache. The reason is that this can defeat the 10334 performance benefits of caching content, especially since an explicit 10335 SETATTR of time_access may alter the change attribute on the server. 10336 If the change attribute changes, clients that are caching the content 10337 will think the content has changed, and will re-read unmodified data 10338 from the server. Nor is the client encouraged to maintain a modified 10339 version of time_access in its cache, since the client either would 10340 eventually have to write the access time to the server with bad 10341 performance effects or never update the server's time_access, thereby 10342 resulting in a situation where an application that caches access time 10343 between a close and open of the same file observes the access time 10344 oscillating between the past and present. The time_access attribute 10345 always means the time of last access to a file by a read that was 10346 satisfied by the server. This way clients will tend to see only 10347 time_access changes that go forward in time. 10349 10.7. Data and Metadata Caching and Memory Mapped Files 10351 Some operating environments include the capability for an application 10352 to map a file's content into the application's address space. Each 10353 time the application accesses a memory location that corresponds to a 10354 block that has not been loaded into the address space, a page fault 10355 occurs and the file is read (or if the block does not exist in the 10356 file, the block is allocated and then instantiated in the 10357 application's address space). 10359 As long as each memory-mapped access to the file requires a page 10360 fault, the relevant attributes of the file that are used to detect 10361 access and modification (time_access, time_metadata, time_modify, and 10362 change) will be updated. However, in many operating environments, 10363 when page faults are not required, these attributes will not be 10364 updated on reads or updates to the file via memory access (regardless 10365 of whether the file is local or is accessed remotely). A client or 10366 server MAY fail to update attributes of a file that is being accessed 10367 via memory-mapped I/O. This has several implications: 10369 o If there is an application on the server that has memory mapped a 10370 file that a client is also accessing, the client may not be able 10371 to get a consistent value of the change attribute to determine 10372 whether or not its cache is stale. A server that knows that the 10373 file is memory-mapped could always pessimistically return updated 10374 values for change so as to force the application to always get the 10375 most up-to-date data and metadata for the file. However, due to 10376 the negative performance implications of this, such behavior is 10377 OPTIONAL. 10379 o If the memory-mapped file is not being modified on the server, and 10380 instead is just being read by an application via the memory-mapped 10381 interface, the client will not see an updated time_access 10382 attribute. However, in many operating environments, neither will 10383 any process running on the server. Thus, NFS clients are at no 10384 disadvantage with respect to local processes. 10386 o If there is another client that is memory mapping the file, and if 10387 that client is holding an OPEN_DELEGATE_WRITE delegation, the same 10388 set of issues as discussed in the previous two bullet points 10389 apply. So, when a server does a CB_GETATTR to a file that the 10390 client has modified in its cache, the reply from CB_GETATTR will 10391 not necessarily be accurate. As discussed earlier, the client's 10392 obligation is to report that the file has been modified since the 10393 delegation was granted, not whether it has been modified again 10394 between successive CB_GETATTR calls, and the server MUST assume 10395 that any file the client has modified in cache has been modified 10396 again between successive CB_GETATTR calls. Depending on the 10397 nature of the client's memory management system, this weak 10398 obligation may not be possible. A client MAY return stale 10399 information in CB_GETATTR whenever the file is memory-mapped. 10401 o The mixture of memory mapping and byte-range locking on the same 10402 file is problematic. Consider the following scenario, where a 10403 page size on each client is 8192 bytes. 10405 * Client A memory maps the first page (8192 bytes) of file X. 10407 * Client B memory maps the first page (8192 bytes) of file X. 10409 * Client A WRITE_LT locks the first 4096 bytes. 10411 * Client B WRITE_LT locks the second 4096 bytes. 10413 * Client A, via a STORE instruction, modifies part of its locked 10414 byte-range. 10416 * Simultaneous to client A, client B executes a STORE on part of 10417 its locked byte-range. 10419 Here the challenge is for each client to resynchronize to get a 10420 correct view of the first page. In many operating environments, the 10421 virtual memory management systems on each client only know a page is 10422 modified, not that a subset of the page corresponding to the 10423 respective lock byte-ranges has been modified. So it is not possible 10424 for each client to do the right thing, which is to write to the 10425 server only that portion of the page that is locked. For example, if 10426 client A simply writes out the page, and then client B writes out the 10427 page, client A's data is lost. 10429 Moreover, if mandatory locking is enabled on the file, then we have a 10430 different problem. When clients A and B execute the STORE 10431 instructions, the resulting page faults require a byte-range lock on 10432 the entire page. Each client then tries to extend their locked range 10433 to the entire page, which results in a deadlock. Communicating the 10434 NFS4ERR_DEADLOCK error to a STORE instruction is difficult at best. 10436 If a client is locking the entire memory-mapped file, there is no 10437 problem with advisory or mandatory byte-range locking, at least until 10438 the client unlocks a byte-range in the middle of the file. 10440 Given the above issues, the following are permitted: 10442 o Clients and servers MAY deny memory mapping a file for which they 10443 know there are byte-range locks. 10445 o Clients and servers MAY deny a byte-range lock on a file they know 10446 is memory-mapped. 10448 o A client MAY deny memory mapping a file that it knows requires 10449 mandatory locking for I/O. If mandatory locking is enabled after 10450 the file is opened and mapped, the client MAY deny the application 10451 further access to its mapped file. 10453 10.8. Name and Directory Caching without Directory Delegations 10455 The NFSv4.1 directory delegation facility (described in Section 10.9 10456 below) is OPTIONAL for servers to implement. Even where it is 10457 implemented, it may not always be functional because of resource 10458 availability issues or other constraints. Thus, it is important to 10459 understand how name and directory caching are done in the absence of 10460 directory delegations. These topics are discussed in the next two 10461 subsections. 10463 10.8.1. Name Caching 10465 The results of LOOKUP and READDIR operations may be cached to avoid 10466 the cost of subsequent LOOKUP operations. Just as in the case of 10467 attribute caching, inconsistencies may arise among the various client 10468 caches. To mitigate the effects of these inconsistencies and given 10469 the context of typical file system APIs, an upper time boundary is 10470 maintained for how long a client name cache entry can be kept without 10471 verifying that the entry has not been made invalid by a directory 10472 change operation performed by another client. 10474 When a client is not making changes to a directory for which there 10475 exist name cache entries, the client needs to periodically fetch 10476 attributes for that directory to ensure that it is not being 10477 modified. After determining that no modification has occurred, the 10478 expiration time for the associated name cache entries may be updated 10479 to be the current time plus the name cache staleness bound. 10481 When a client is making changes to a given directory, it needs to 10482 determine whether there have been changes made to the directory by 10483 other clients. It does this by using the change attribute as 10484 reported before and after the directory operation in the associated 10485 change_info4 value returned for the operation. The server is able to 10486 communicate to the client whether the change_info4 data is provided 10487 atomically with respect to the directory operation. If the change 10488 values are provided atomically, the client has a basis for 10489 determining, given proper care, whether other clients are modifying 10490 the directory in question. 10492 The simplest way to enable the client to make this determination is 10493 for the client to serialize all changes made to a specific directory. 10494 When this is done, and the server provides before and after values of 10495 the change attribute atomically, the client can simply compare the 10496 after value of the change attribute from one operation on a directory 10497 with the before value on the subsequent operation modifying that 10498 directory. When these are equal, the client is assured that no other 10499 client is modifying the directory in question. 10501 When such serialization is not used, and there may be multiple 10502 simultaneous outstanding operations modifying a single directory sent 10503 from a single client, making this sort of determination can be more 10504 complicated. If two such operations complete in a different order 10505 than they were actually performed, that might give an appearance 10506 consistent with modification being made by another client. Where 10507 this appears to happen, the client needs to await the completion of 10508 all such modifications that were started previously, to see if the 10509 outstanding before and after change numbers can be sorted into a 10510 chain such that the before value of one change number matches the 10511 after value of a previous one, in a chain consistent with this client 10512 being the only one modifying the directory. 10514 In either of these cases, the client is able to determine whether the 10515 directory is being modified by another client. If the comparison 10516 indicates that the directory was updated by another client, the name 10517 cache associated with the modified directory is purged from the 10518 client. If the comparison indicates no modification, the name cache 10519 can be updated on the client to reflect the directory operation and 10520 the associated timeout can be extended. The post-operation change 10521 value needs to be saved as the basis for future change_info4 10522 comparisons. 10524 As demonstrated by the scenario above, name caching requires that the 10525 client revalidate name cache data by inspecting the change attribute 10526 of a directory at the point when the name cache item was cached. 10527 This requires that the server update the change attribute for 10528 directories when the contents of the corresponding directory is 10529 modified. For a client to use the change_info4 information 10530 appropriately and correctly, the server must report the pre- and 10531 post-operation change attribute values atomically. When the server 10532 is unable to report the before and after values atomically with 10533 respect to the directory operation, the server must indicate that 10534 fact in the change_info4 return value. When the information is not 10535 atomically reported, the client should not assume that other clients 10536 have not changed the directory. 10538 10.8.2. Directory Caching 10540 The results of READDIR operations may be used to avoid subsequent 10541 READDIR operations. Just as in the cases of attribute and name 10542 caching, inconsistencies may arise among the various client caches. 10543 To mitigate the effects of these inconsistencies, and given the 10544 context of typical file system APIs, the following rules should be 10545 followed: 10547 o Cached READDIR information for a directory that is not obtained in 10548 a single READDIR operation must always be a consistent snapshot of 10549 directory contents. This is determined by using a GETATTR before 10550 the first READDIR and after the last READDIR that contributes to 10551 the cache. 10553 o An upper time boundary is maintained to indicate the length of 10554 time a directory cache entry is considered valid before the client 10555 must revalidate the cached information. 10557 The revalidation technique parallels that discussed in the case of 10558 name caching. When the client is not changing the directory in 10559 question, checking the change attribute of the directory with GETATTR 10560 is adequate. The lifetime of the cache entry can be extended at 10561 these checkpoints. When a client is modifying the directory, the 10562 client needs to use the change_info4 data to determine whether there 10563 are other clients modifying the directory. If it is determined that 10564 no other client modifications are occurring, the client may update 10565 its directory cache to reflect its own changes. 10567 As demonstrated previously, directory caching requires that the 10568 client revalidate directory cache data by inspecting the change 10569 attribute of a directory at the point when the directory was cached. 10570 This requires that the server update the change attribute for 10571 directories when the contents of the corresponding directory is 10572 modified. For a client to use the change_info4 information 10573 appropriately and correctly, the server must report the pre- and 10574 post-operation change attribute values atomically. When the server 10575 is unable to report the before and after values atomically with 10576 respect to the directory operation, the server must indicate that 10577 fact in the change_info4 return value. When the information is not 10578 atomically reported, the client should not assume that other clients 10579 have not changed the directory. 10581 10.9. Directory Delegations 10582 10.9.1. Introduction to Directory Delegations 10584 Directory caching for the NFSv4.1 protocol, as previously described, 10585 is similar to file caching in previous versions. Clients typically 10586 cache directory information for a duration determined by the client. 10587 At the end of a predefined timeout, the client will query the server 10588 to see if the directory has been updated. By caching attributes, 10589 clients reduce the number of GETATTR calls made to the server to 10590 validate attributes. Furthermore, frequently accessed files and 10591 directories, such as the current working directory, have their 10592 attributes cached on the client so that some NFS operations can be 10593 performed without having to make an RPC call. By caching name and 10594 inode information about most recently looked up entries in a 10595 Directory Name Lookup Cache (DNLC), clients do not need to send 10596 LOOKUP calls to the server every time these files are accessed. 10598 This caching approach works reasonably well at reducing network 10599 traffic in many environments. However, it does not address 10600 environments where there are numerous queries for files that do not 10601 exist. In these cases of "misses", the client sends requests to the 10602 server in order to provide reasonable application semantics and 10603 promptly detect the creation of new directory entries. Examples of 10604 high miss activity are compilation in software development 10605 environments. The current behavior of NFS limits its potential 10606 scalability and wide-area sharing effectiveness in these types of 10607 environments. Other distributed stateful file system architectures 10608 such as AFS and DFS have proven that adding state around directory 10609 contents can greatly reduce network traffic in high-miss 10610 environments. 10612 Delegation of directory contents is an OPTIONAL feature of NFSv4.1. 10613 Directory delegations provide similar traffic reduction benefits as 10614 with file delegations. By allowing clients to cache directory 10615 contents (in a read-only fashion) while being notified of changes, 10616 the client can avoid making frequent requests to interrogate the 10617 contents of slowly-changing directories, reducing network traffic and 10618 improving client performance. It can also simplify the task of 10619 determining whether other clients are making changes to the directory 10620 when the client itself is making many changes to the directory and 10621 changes are not serialized. 10623 Directory delegations allow improved namespace cache consistency to 10624 be achieved through delegations and synchronous recalls, in the 10625 absence of notifications. In addition, if time-based consistency is 10626 sufficient, asynchronous notifications can provide performance 10627 benefits for the client, and possibly the server, under some common 10628 operating conditions such as slowly-changing and/or very large 10629 directories. 10631 10.9.2. Directory Delegation Design 10633 NFSv4.1 introduces the GET_DIR_DELEGATION (Section 18.39) operation 10634 to allow the client to ask for a directory delegation. The 10635 delegation covers directory attributes and all entries in the 10636 directory. If either of these change, the delegation will be 10637 recalled synchronously. The operation causing the recall will have 10638 to wait before the recall is complete. Any changes to directory 10639 entry attributes will not cause the delegation to be recalled. 10641 In addition to asking for delegations, a client can also ask for 10642 notifications for certain events. These events include changes to 10643 the directory's attributes and/or its contents. If a client asks for 10644 notification for a certain event, the server will notify the client 10645 when that event occurs. This will not result in the delegation being 10646 recalled for that client. The notifications are asynchronous and 10647 provide a way of avoiding recalls in situations where a directory is 10648 changing enough that the pure recall model may not be effective while 10649 trying to allow the client to get substantial benefit. In the 10650 absence of notifications, once the delegation is recalled the client 10651 has to refresh its directory cache; this might not be very efficient 10652 for very large directories. 10654 The delegation is read-only and the client may not make changes to 10655 the directory other than by performing NFSv4.1 operations that modify 10656 the directory or the associated file attributes so that the server 10657 has knowledge of these changes. In order to keep the client's 10658 namespace synchronized with that of the server, the server will 10659 notify the delegation-holding client (assuming it has requested 10660 notifications) of the changes made as a result of that client's 10661 directory-modifying operations. This is to avoid any need for that 10662 client to send subsequent GETATTR or READDIR operations to the 10663 server. If a single client is holding the delegation and that client 10664 makes any changes to the directory (i.e., the changes are made via 10665 operations sent on a session associated with the client ID holding 10666 the delegation), the delegation will not be recalled. Multiple 10667 clients may hold a delegation on the same directory, but if any such 10668 client modifies the directory, the server MUST recall the delegation 10669 from the other clients, unless those clients have made provisions to 10670 be notified of that sort of modification. 10672 Delegations can be recalled by the server at any time. Normally, the 10673 server will recall the delegation when the directory changes in a way 10674 that is not covered by the notification, or when the directory 10675 changes and notifications have not been requested. If another client 10676 removes the directory for which a delegation has been granted, the 10677 server will recall the delegation. 10679 10.9.3. Attributes in Support of Directory Notifications 10681 See Section 5.11 for a description of the attributes associated with 10682 directory notifications. 10684 10.9.4. Directory Delegation Recall 10686 The server will recall the directory delegation by sending a callback 10687 to the client. It will use the same callback procedure as used for 10688 recalling file delegations. The server will recall the delegation 10689 when the directory changes in a way that is not covered by the 10690 notification. However, the server need not recall the delegation if 10691 attributes of an entry within the directory change. 10693 If the server notices that handing out a delegation for a directory 10694 is causing too many notifications to be sent out, it may decide to 10695 not hand out delegations for that directory and/or recall those 10696 already granted. If a client tries to remove the directory for which 10697 a delegation has been granted, the server will recall all associated 10698 delegations. 10700 The implementation sections for a number of operations describe 10701 situations in which notification or delegation recall would be 10702 required under some common circumstances. In this regard, a similar 10703 set of caveats to those listed in Section 10.2 apply. 10705 o For CREATE, see Section 18.4.4. 10707 o For LINK, see Section 18.9.4. 10709 o For OPEN, see Section 18.16.4. 10711 o For REMOVE, see Section 18.25.4. 10713 o For RENAME, see Section 18.26.4. 10715 o For SETATTR, see Section 18.30.4. 10717 10.9.5. Directory Delegation Recovery 10719 Recovery from client or server restart for state on regular files has 10720 two main goals: avoiding the necessity of breaking application 10721 guarantees with respect to locked files and delivery of updates 10722 cached at the client. Neither of these goals applies to directories 10723 protected by OPEN_DELEGATE_READ delegations and notifications. Thus, 10724 no provision is made for reclaiming directory delegations in the 10725 event of client or server restart. The client can simply establish a 10726 directory delegation in the same fashion as was done initially. 10728 11. Multi-Server Namespace 10730 NFSv4.1 supports attributes that allow a namespace to extend beyond 10731 the boundaries of a single server. It is desirable that clients and 10732 servers support construction of such multi-server namespaces. Use of 10733 such multi-server namespaces is OPTIONAL however, and for many 10734 purposes, single-server namespaces are perfectly acceptable. Use of 10735 multi-server namespaces can provide many advantages, by separating a 10736 file system's logical position in a namespace from the (possibly 10737 changing) logistical and administrative considerations that result in 10738 particular file systems being located on particular servers via a 10739 single network access paths known in advance or determined using DNS. 10741 11.1. Terminology 10743 In this section as a whole (i.e. within all of Section 11), the 10744 phrase "client ID" always refers to the 64-bit shorthand identifier 10745 assigned by the server (a clientid4) and never to the structure which 10746 the client uses to identify itself to the server (called an 10747 nfs_client_id4 or client_owner in NFSv4.0 and NFSv4.1 respectively). 10748 The opaque identifier within those structures is referred to as a 10749 "client id string". 10751 11.1.1. Terminology Related to Trunking 10753 It is particularly important to clarify the distinction between 10754 trunking detection and trunking discovery. The definitions we 10755 present are applicable to all minor versions of NFSv4, but we will 10756 focus on how these terms apply to NFS version 4.1. 10758 o Trunking detection refers to ways of deciding whether two specific 10759 network addresses are connected to the same NFSv4 server. The 10760 means available to make this determination depends on the protocol 10761 version, and, in some cases, on the client implementation. 10763 In the case of NFS version 4.1 and later minor versions, the means 10764 of trunking detection are as described in this document and are 10765 available to every client. Two network addresses connected to the 10766 same server can always be used together to access a particular 10767 server but cannot necessarily be used together to access a single 10768 session. See below for definitions of the terms "server- 10769 trunkable" and "session-trunkable" 10771 o Trunking discovery is a process by which a client using one 10772 network address can obtain other addresses that are connected to 10773 the same server. Typically, it builds on a trunking detection 10774 facility by providing one or more methods by which candidate 10775 addresses are made available to the client who can then use 10776 trunking detection to appropriately filter them. 10778 Despite the support for trunking detection there was no 10779 description of trunking discovery provided in RFC5661 [65], making 10780 it necessary to provide those means in this document. 10782 The combination of a server network address and a particular 10783 connection type to be used by a connection is referred to as a 10784 "server endpoint". Although using different connection types may 10785 result in different ports being used, the use of different ports by 10786 multiple connections to the same network address in such cases is not 10787 the essence of the distinction between the two endpoints used. This 10788 is in contrast to the case of port-specific endpoints, in which the 10789 explicit specification of port numbers within network addresses is 10790 used to allow a single server node to support multiple NFS servers. 10792 Two network addresses connected to the same server are said to be 10793 server-trunkable. Two such addresses support the use of clientid ID 10794 trunking, as described in Section 2.10.5. 10796 Two network addresses connected to the same server such that those 10797 addresses can be used to support a single common session are referred 10798 to as session-trunkable. Note that two addresses may be server- 10799 trunkable without being session-trunkable and that when two 10800 connections of different connection types are made to the same 10801 network address and are based on a single file system location entry 10802 they are always session-trunkable, independent of the connection 10803 type, as specified by Section 2.10.5, since their derivation from the 10804 same file system location entry together with the identity of their 10805 network addresses assures that both connections are to the same 10806 server and will return server-owner information allowing session 10807 trunking to be used. 10809 11.1.2. Terminology Related to File System Location 10811 Regarding terminology relating to the construction of multi-server 10812 namespaces out of a set of local per-server namespaces: 10814 o Each server has a set of exported file systems which may be 10815 accessed by NFSv4 clients. Typically, this is done by assigning 10816 each file system a name within the pseudo-fs associated with the 10817 server, although the pseudo-fs may be dispensed with if there is 10818 only a single exported file system. Each such file system is part 10819 of the server's local namespace, and can be considered as a file 10820 system instance within a larger multi-server namespace. 10822 o The set of all exported file systems for a given server 10823 constitutes that server's local namespace. 10825 o In some cases, a server will have a namespace more extensive than 10826 its local namespace by using features associated with attributes 10827 that provide file system location information. These features, 10828 which allow construction of a multi-server namespace, are all 10829 described in individual sections below and include referrals 10830 (described in Section 11.5.6), migration (described in 10831 Section 11.5.5), and replication (described in Section 11.5.4). 10833 o A file system present in a server's pseudo-fs may have multiple 10834 file system instances on different servers associated with it. 10835 All such instances are considered replicas of one another. 10836 Whether such replicas can be used simultaneously is discussed in 10837 Section 11.11.1, while the level of co-ordination between them 10838 (important when switching between them) is discussed in Sections 10839 11.11.2 through 11.11.8 below. 10841 o When a file system is present in a server's pseudo-fs, but there 10842 is no corresponding local file system, it is said to be "absent". 10843 In such cases, all associated instances will be accessed on other 10844 servers. 10846 Regarding terminology relating to attributes used in trunking 10847 discovery and other multi-server namespace features: 10849 o File system location attributes include the fs_locations and 10850 fs_locations_info attributes. 10852 o File system location entries provide the individual file system 10853 locations within the file system location attributes. Each such 10854 entry specifies a server, in the form of a host name or an 10855 address, and an fs name, which designates the location of the file 10856 system within the server's local namespace. A file system 10857 location entry designates a set of server endpoints to which the 10858 client may establish connections. There may be multiple endpoints 10859 because a host name may map to multiple network addresses and 10860 because multiple connection types may be used to communicate with 10861 a single network address. However, except where an explicit port 10862 numbers are used to designate a set of server within a single 10863 server node, all such endpoints MUST designate a way of connecting 10864 to a single server. The exact form of the location entry varies 10865 with the particular file system location attribute used, as 10866 described in Section 11.2. 10868 The network addresses used in file system location entries 10869 typically appear without port number indications and are used to 10870 designate a server at one of the standard ports for NFS access, 10871 e.g., 2049 for TCP, or 20049 for use with RPC-over-RDMA. Port 10872 numbers may be used in file system location entries to designate 10873 servers (typically user-level ones) accessed using other port 10874 numbers. In the case where network addresses indicate trunking 10875 relationships, use of an explicit port number is inappropriate 10876 since trunking is a relationship between network addresses. See 10877 Section 11.5.2 for details. 10879 o File system location elements are derived from location entries 10880 and each describes a particular network access path, consisting of 10881 a network address and a location within the server's local 10882 namespace. Such location elements need not appear within a file 10883 system location attribute, but the existence of each location 10884 element derives from a corresponding location entry. When a 10885 location entry specifies an IP address there is only a single 10886 corresponding location element. File system location entries that 10887 contain a host name are resolved using DNS, and may result in one 10888 or more location elements. All location elements consist of a 10889 location address which includes the IP address of an interface to 10890 a server and an fs name which is the location of the file system 10891 within the server's local namespace. The fs name can be empty if 10892 the server has no pseudo-fs and only a single exported file system 10893 at the root filehandle. 10895 o Two file system location elements are said to be server-trunkable 10896 if they specify the same fs name and the location addresses are 10897 such that the location addresses are server-trunkable. When the 10898 corresponding network paths are used, the client will always be 10899 able to use client ID trunking, but will only be able to use 10900 session trunking if the paths are also session-trunkable. 10902 o Two file system location elements are said to be session-trunkable 10903 if they specify the same fs name and the location addresses are 10904 such that the location addresses are session-trunkable. When the 10905 corresponding network paths are used, the client will be able to 10906 able to use either client ID trunking or session trunking. 10908 Discussion of the term "replica" is complicated by the fact that the 10909 term was used in RFC5661 [65], with a meaning different from that in 10910 this document. In short, in [65] each replica is identified by a 10911 single network access path while, in the current document a set of 10912 network access paths which have server-trunkable network addresses 10913 and the same root-relative file system pathname is considered to be a 10914 single replica with multiple network access paths. 10916 Each set of server-trunkable location elements defines a set of 10917 available network access paths to a particular file system. When 10918 there are multiple such file systems, each of which contains the same 10919 data, these file systems are considered replicas of one another. 10920 Logically, such replication is symmetric, since the fs currently in 10921 use and an alternate fs are replicas of each other. Often, in other 10922 documents, the term "replica" is not applied to the fs currently in 10923 use, despite the fact that the replication relation is inherently 10924 symmetric. 10926 11.2. File System Location Attributes 10928 NFSv4.1 contains attributes that provide information about how (i.e., 10929 at what network address and namespace position) a given file system 10930 may be accessed. As a result, file systems in the namespace of one 10931 server can be associated with one or more instances of that file 10932 system on other servers. These attributes contain file system 10933 location entries specifying a server address target (either as a DNS 10934 name representing one or more IP addresses or as a specific IP 10935 address) together with the pathname of that file system within the 10936 associated single-server namespace. 10938 The fs_locations_info RECOMMENDED attribute allows specification of 10939 one or more file system instance locations where the data 10940 corresponding to a given file system may be found. This attribute 10941 provides to the client, in addition to specification of file system 10942 instance locations, other helpful information such as: 10944 o Information guiding choices among the various file system 10945 instances provided (e.g., priority for use, writability, currency, 10946 etc.). 10948 o Information to help the client efficiently effect as seamless a 10949 transition as possible among multiple file system instances, when 10950 and if that should be necessary. 10952 o Information helping to guide the selection of the appropriate 10953 connection type to be used when establishing a connection. 10955 Within the fs_locations_info attribute, each fs_locations_server4 10956 entry corresponds to a file system location entry with the fls_server 10957 field designating the server, with the location pathname within the 10958 server's pseudo-fs given by the fl_rootpath field of the encompassing 10959 fs_locations_item4. 10961 The fs_locations attribute defined in NFSv4.0 is also a part of 10962 NFSv4.1. This attribute only allows specification of the file system 10963 locations where the data corresponding to a given file system may be 10964 found. Servers SHOULD make this attribute available whenever 10965 fs_locations_info is supported, but client use of fs_locations_info 10966 is preferable, as it provides more information. 10968 Within the fs_location attribute, each fs_location4 contains a file 10969 system location entry with the server field designating the server 10970 and the rootpath field giving the location pathname within the 10971 server's pseudo-fs. 10973 11.3. File System Presence or Absence 10975 A given location in an NFSv4.1 namespace (typically but not 10976 necessarily a multi-server namespace) can have a number of file 10977 system instance locations associated with it (via the fs_locations or 10978 fs_locations_info attribute). There may also be an actual current 10979 file system at that location, accessible via normal namespace 10980 operations (e.g., LOOKUP). In this case, the file system is said to 10981 be "present" at that position in the namespace, and clients will 10982 typically use it, reserving use of additional locations specified via 10983 the location-related attributes to situations in which the principal 10984 location is no longer available. 10986 When there is no actual file system at the namespace location in 10987 question, the file system is said to be "absent". An absent file 10988 system contains no files or directories other than the root. Any 10989 reference to it, except to access a small set of attributes useful in 10990 determining alternate locations, will result in an error, 10991 NFS4ERR_MOVED. Note that if the server ever returns the error 10992 NFS4ERR_MOVED, it MUST support the fs_locations attribute and SHOULD 10993 support the fs_locations_info and fs_status attributes. 10995 While the error name suggests that we have a case of a file system 10996 that once was present, and has only become absent later, this is only 10997 one possibility. A position in the namespace may be permanently 10998 absent with the set of file system(s) designated by the location 10999 attributes being the only realization. The name NFS4ERR_MOVED 11000 reflects an earlier, more limited conception of its function, but 11001 this error will be returned whenever the referenced file system is 11002 absent, whether it has moved or not. 11004 Except in the case of GETATTR-type operations (to be discussed 11005 later), when the current filehandle at the start of an operation is 11006 within an absent file system, that operation is not performed and the 11007 error NFS4ERR_MOVED is returned, to indicate that the file system is 11008 absent on the current server. 11010 Because a GETFH cannot succeed if the current filehandle is within an 11011 absent file system, filehandles within an absent file system cannot 11012 be transferred to the client. When a client does have filehandles 11013 within an absent file system, it is the result of obtaining them when 11014 the file system was present, and having the file system become absent 11015 subsequently. 11017 It should be noted that because the check for the current filehandle 11018 being within an absent file system happens at the start of every 11019 operation, operations that change the current filehandle so that it 11020 is within an absent file system will not result in an error. This 11021 allows such combinations as PUTFH-GETATTR and LOOKUP-GETATTR to be 11022 used to get attribute information, particularly location attribute 11023 information, as discussed below. 11025 The RECOMMENDED file system attribute fs_status can be used to 11026 interrogate the present/absent status of a given file system. 11028 11.4. Getting Attributes for an Absent File System 11030 When a file system is absent, most attributes are not available, but 11031 it is necessary to allow the client access to the small set of 11032 attributes that are available, and most particularly those that give 11033 information about the correct current locations for this file system: 11034 fs_locations and fs_locations_info. 11036 11.4.1. GETATTR within an Absent File System 11038 As mentioned above, an exception is made for GETATTR in that 11039 attributes may be obtained for a filehandle within an absent file 11040 system. This exception only applies if the attribute mask contains 11041 at least one attribute bit that indicates the client is interested in 11042 a result regarding an absent file system: fs_locations, 11043 fs_locations_info, or fs_status. If none of these attributes is 11044 requested, GETATTR will result in an NFS4ERR_MOVED error. 11046 When a GETATTR is done on an absent file system, the set of supported 11047 attributes is very limited. Many attributes, including those that 11048 are normally REQUIRED, will not be available on an absent file 11049 system. In addition to the attributes mentioned above (fs_locations, 11050 fs_locations_info, fs_status), the following attributes SHOULD be 11051 available on absent file systems. In the case of RECOMMENDED 11052 attributes, they should be available at least to the same degree that 11053 they are available on present file systems. 11055 change_policy: This attribute is useful for absent file systems and 11056 can be helpful in summarizing to the client when any of the 11057 location-related attributes change. 11059 fsid: This attribute should be provided so that the client can 11060 determine file system boundaries, including, in particular, the 11061 boundary between present and absent file systems. This value must 11062 be different from any other fsid on the current server and need 11063 have no particular relationship to fsids on any particular 11064 destination to which the client might be directed. 11066 mounted_on_fileid: For objects at the top of an absent file system, 11067 this attribute needs to be available. Since the fileid is within 11068 the present parent file system, there should be no need to 11069 reference the absent file system to provide this information. 11071 Other attributes SHOULD NOT be made available for absent file 11072 systems, even when it is possible to provide them. The server should 11073 not assume that more information is always better and should avoid 11074 gratuitously providing additional information. 11076 When a GETATTR operation includes a bit mask for one of the 11077 attributes fs_locations, fs_locations_info, or fs_status, but where 11078 the bit mask includes attributes that are not supported, GETATTR will 11079 not return an error, but will return the mask of the actual 11080 attributes supported with the results. 11082 Handling of VERIFY/NVERIFY is similar to GETATTR in that if the 11083 attribute mask does not include fs_locations, fs_locations_info, or 11084 fs_status, the error NFS4ERR_MOVED will result. It differs in that 11085 any appearance in the attribute mask of an attribute not supported 11086 for an absent file system (and note that this will include some 11087 normally REQUIRED attributes) will also cause an NFS4ERR_MOVED 11088 result. 11090 11.4.2. READDIR and Absent File Systems 11092 A READDIR performed when the current filehandle is within an absent 11093 file system will result in an NFS4ERR_MOVED error, since, unlike the 11094 case of GETATTR, no such exception is made for READDIR. 11096 Attributes for an absent file system may be fetched via a READDIR for 11097 a directory in a present file system, when that directory contains 11098 the root directories of one or more absent file systems. In this 11099 case, the handling is as follows: 11101 o If the attribute set requested includes one of the attributes 11102 fs_locations, fs_locations_info, or fs_status, then fetching of 11103 attributes proceeds normally and no NFS4ERR_MOVED indication is 11104 returned, even when the rdattr_error attribute is requested. 11106 o If the attribute set requested does not include one of the 11107 attributes fs_locations, fs_locations_info, or fs_status, then if 11108 the rdattr_error attribute is requested, each directory entry for 11109 the root of an absent file system will report NFS4ERR_MOVED as the 11110 value of the rdattr_error attribute. 11112 o If the attribute set requested does not include any of the 11113 attributes fs_locations, fs_locations_info, fs_status, or 11114 rdattr_error, then the occurrence of the root of an absent file 11115 system within the directory will result in the READDIR failing 11116 with an NFS4ERR_MOVED error. 11118 o The unavailability of an attribute because of a file system's 11119 absence, even one that is ordinarily REQUIRED, does not result in 11120 any error indication. The set of attributes returned for the root 11121 directory of the absent file system in that case is simply 11122 restricted to those actually available. 11124 11.5. Uses of File System Location Information 11126 The file system location attributes (i.e. fs_locations and 11127 fs_locations_info), together with the possibility of absent file 11128 systems, provide a number of important facilities in providing 11129 reliable, manageable, and scalable data access. 11131 When a file system is present, these attributes can provide 11133 o The locations of alternative replicas, to be used to access the 11134 same data in the event of server failures, communications 11135 problems, or other difficulties that make continued access to the 11136 current replica impossible or otherwise impractical. Provision 11137 and use of such alternate replicas is referred to as "replication" 11138 and is discussed in Section 11.5.4 below. 11140 o The network address(es) to be used to access the current file 11141 system instance or replicas of it. Client use of this information 11142 is discussed in Section 11.5.2 below. 11144 Under some circumstances, multiple replicas may be used 11145 simultaneously to provide higher-performance access to the file 11146 system in question, although the lack of state sharing between 11147 servers may be an impediment to such use. 11149 When a file system is present and becomes absent, clients can be 11150 given the opportunity to have continued access to their data, using a 11151 different replica. In this case, a continued attempt to use the data 11152 in the now-absent file system will result in an NFS4ERR_MOVED error 11153 and, at that point, the successor replica or set of possible replica 11154 choices can be fetched and used to continue access. Transfer of 11155 access to the new replica location is referred to as "migration", and 11156 is discussed in Section 11.5.4 below. 11158 Where a file system is currently absent, specification of file system 11159 location provides a means by which file systems located on one server 11160 can be associated with a namespace defined by another server, thus 11161 allowing a general multi-server namespace facility. A designation of 11162 such a remote instance, in place of a file system not previously 11163 present, is called a "pure referral" and is discussed in 11164 Section 11.5.6 below. 11166 Because client support for attributes related to file system location 11167 is OPTIONAL, a server may choose to take action to hide migration and 11168 referral events from such clients, by acting as a proxy, for example. 11169 The server can determine the presence of client support from the 11170 arguments of the EXCHANGE_ID operation (see Section 18.35.3). 11172 11.5.1. Combining Multiple Uses in a Single Attribute 11174 A file system location attribute will sometimes contain information 11175 relating to the location of multiple replicas which may be used in 11176 different ways. 11178 o File system location entries that relate to the file system 11179 instance currently in use provide trunking information, allowing 11180 the client to find additional network addresses by which the 11181 instance may be accessed. 11183 o File system location entries that provide information about 11184 replicas to which access is to be transferred. 11186 o Other file system location entries that relate to replicas that 11187 are available to use in the event that access to the current 11188 replica becomes unsatisfactory. 11190 In order to simplify client handling and allow the best choice of 11191 replicas to access, the server should adhere to the following 11192 guidelines. 11194 o All file system location entries that relate to a single file 11195 system instance should be adjacent. 11197 o File system location entries that relate to the instance currently 11198 in use should appear first. 11200 o File system location entries that relate to replica(s) to which 11201 migration is occurring should appear before replicas which are 11202 available for later use if the current replica should become 11203 inaccessible. 11205 11.5.2. File System Location Attributes and Trunking 11207 Trunking is the use of multiple connections between a client and 11208 server in order to increase the speed of data transfer. A client may 11209 determine the set of network addresses to use to access a given file 11210 system in a number of ways: 11212 o When the name of the server is known to the client, it may use DNS 11213 to obtain a set of network addresses to use in accessing the 11214 server. 11216 o The client may fetch the file system location attribute for the 11217 file system. This will provide either the name of the server 11218 (which can be turned into a set of network addresses using DNS), 11219 or a set of server-trunkable location entries. Using the latter 11220 alternative, the server can provide addresses it regards as 11221 desirable to use to access the file system in question. Although 11222 these entries can contain port numbers, these port numbers are not 11223 used in determining trunking relationships. Once the candidate 11224 addresses have been determined and EXCHANGE_ID done to the proper 11225 server, only the value of the so_major field returned by the 11226 servers in question determines whether a trunking relationship 11227 actually exists. 11229 It should be noted that the client, when it fetches a location 11230 attribute for a file system, may encounter multiple entries for a 11231 number of reasons, so that, when determining trunking information, it 11232 may have to bypass addresses not trunkable with one already known. 11234 The server can provide location entries that include either names or 11235 network addresses. It might use the latter form because of DNS- 11236 related security concerns or because the set of addresses to be used 11237 might require active management by the server. 11239 Location entries used to discover candidate addresses for use in 11240 trunking are subject to change, as discussed in Section 11.5.7 below. 11241 The client may respond to such changes by using additional addresses 11242 once they are verified or by ceasing to use existing ones. The 11243 server can force the client to cease using an address by returning 11244 NFS4ERR_MOVED when that address is used to access a file system. 11245 This allows a transfer of client access which is similar to 11246 migration, although the same file system instance is accessed 11247 throughout. 11249 11.5.3. File System Location Attributes and Connection Type Selection 11251 Because of the need to support multiple types of connections, clients 11252 face the issue of determining the proper connection type to use when 11253 establishing a connection to a given server network address. In some 11254 cases, this issue can be addressed through the use of the connection 11255 "step-up" facility described in Section 18.36. However, because 11256 there are cases is which that facility is not available, the client 11257 may have to choose a connection type with no possibility of changing 11258 it within the scope of a single connection. 11260 The two file system location attributes differ as to the information 11261 made available in this regard. Fs_locations provides no information 11262 to support connection type selection. As a result, clients 11263 supporting multiple connection types would need to attempt to 11264 establish connections using multiple connection types until the one 11265 preferred by the client is successfully established. 11267 Fs_locations_info includes a flag, FSLI4TF_RDMA, which, when set 11268 indicates that RPC-over-RDMA support is available using the specified 11269 location entry, by "stepping up" an existing TCP connection to 11270 include support for RDMA operation. This flag makes it convenient 11271 for a client wishing to use RDMA. When this flag is set, it can 11272 establish a TCP connection and then convert that connection to use 11273 RDMA by using the step-up facility. 11275 Irrespective of the particular attribute used, when there is no 11276 indication that a step-up operation can be performed, a client 11277 supporting RDMA operation can establish a new RDMA connection and it 11278 can be bound to the session already established by the TCP 11279 connection, allowing the TCP connection to be dropped and the session 11280 converted to further use in RDMA mode, if the server supports that. 11282 11.5.4. File System Replication 11284 The fs_locations and fs_locations_info attributes provide alternative 11285 file system locations, to be used to access data in place of or in 11286 addition to the current file system instance. On first access to a 11287 file system, the client should obtain the set of alternate locations 11288 by interrogating the fs_locations or fs_locations_info attribute, 11289 with the latter being preferred. 11291 In the event that the occurrence of server failures, communications 11292 problems, or other difficulties make continued access to the current 11293 file system impossible or otherwise impractical, the client can use 11294 the alternate locations as a way to get continued access to its data. 11296 The alternate locations may be physical replicas of the (typically 11297 read-only) file system data supplemented by possible asynchronous 11298 propagation of updates. Alternatively, they may provide for the use 11299 of various forms of server clustering in which multiple servers 11300 provide alternate ways of accessing the same physical file system. 11301 How the difference between replicas affects file system transitions 11302 can be represented within the fs_locations and fs_locations_info 11303 attributes and how the client deals with file system transition 11304 issues will be discussed in detail in later sections. 11306 Although the location attributes provide some information about the 11307 nature of the inter-replica transition, many aspects of the semantics 11308 of possible asynchronous updates are not currently described by the 11309 protocol, making it necessary that clients using replication to 11310 switch among replicas undergoing change familiarize themselves with 11311 the semantics of the update approach used. Because of this lack of 11312 specificity, many applications may find use of migration more 11313 appropriate, since, in that case, the server, when effecting the 11314 transition, has established a point in time such that all updates 11315 made before that can propagated to the new replica as part of the 11316 migration event. 11318 11.5.4.1. File System Trunking Presented as Replication 11320 In some situations, a file system location entry may indicate a file 11321 system access path to be used as an alternate location, where 11322 trunking, rather than replication, is to be used. The situations in 11323 which this is appropriate are limited to those in which both of the 11324 following are true. 11326 o The two file system locations (i.e., the one on which the location 11327 attribute is obtained and the one specified in the file system 11328 location entry) designate the same locations within their 11329 respective single-server namespaces. 11331 o The two server network addresses (i.e., the one being used to 11332 obtain the location attribute and the one specified in the file 11333 system location entry) designate the same server (as indicated by 11334 the same value of the so_major_id field of the eir_server_owner 11335 field returned in response to EXCHANGE_ID). 11337 When these conditions hold, operations using both access paths are 11338 generally trunked, although, when the attribute fs_locations_info is 11339 used, trunking may be disallowed: 11341 o When the fs_locations_info attribute shows the two entries as not 11342 having the same simultaneous-use class, trunking is inhibited and 11343 the two access paths cannot be used together. 11345 In this case the two paths can be used serially with no transition 11346 activity required on the part of the client. In this case, any 11347 transition between access paths is transparent, and the client, in 11348 transferring access from one to the other, is acting as it would 11349 in the event that communication is interrupted, with a new 11350 connection and possibly a new session being established to 11351 continue access to the same file system. 11353 o Note that for two such location entries, any information within 11354 the fs_locations_info attribute that indicates the need for 11355 special transition activity, i.e., the appearance of the two file 11356 system location entries with different handle, fileid, write- 11357 verifier, change, and readdir classes, indicates a serious 11358 problem. The client, if it allows transition to the file system 11359 instance at all, must not treat any transition as a transparent 11360 one. The server SHOULD NOT indicate that these two entries (for 11361 the same file system on the same server) belong to different 11362 handle, fileid, write-verifier, change, and readdir classes, 11363 whether or not the two entries are shown belonging to the same 11364 simultaneous-use class. 11366 These situations were recognized by [65], even though that document 11367 made no explicit mention of trunking. 11369 o It treated the situation that we describe as trunking as one of 11370 simultaneous use of two distinct file system instances, even 11371 though, in the explanatory framework now used to describe the 11372 situation, the case is one in which a single file system is 11373 accessed by two different trunked addresses. 11375 o It treated the situation in which two paths are to be used 11376 serially as a special sort of "transparent transition". however, 11377 in the descriptive framework now used to categorize transition 11378 situations, this is considered a case of a "network endpoint 11379 transition" (see Section 11.9). 11381 11.5.5. File System Migration 11383 When a file system is present and becomes inaccessible using the 11384 current access path, the NFSv4.1 protocol provides a means by which 11385 clients can be given the opportunity to have continued access to 11386 their data. This may involve use of a different access path to the 11387 existing replica or by providing a path to a different replica. The 11388 new access path or the location of the new replica is specified by a 11389 file system location attribute. The ensuing migration of access 11390 includes the ability to retain locks across the transition. 11391 Depending on circumstances, this can involve: 11393 o The continued use of the existing clientid when accessing the 11394 current replica using a new access path. 11396 o Use of lock reclaim, taking advantage of a per-fs grace period. 11398 o Use of Transparent State Migration. 11400 Typically, a client will be accessing the file system in question, 11401 get an NFS4ERR_MOVED error, and then use a file system location 11402 attribute to determine the new access path for the data. When 11403 fs_locations_info is used, additional information will be available 11404 that will define the nature of the client's handling of the 11405 transition to a new server. 11407 In most instances, servers will choose to migrate all clients using a 11408 particular file system to a successor replica at the same time to 11409 avoid cases in which different clients are updating different 11410 replicas. However migration of individual client can be helpful in 11411 providing load balancing, as long as the replicas in question are 11412 such that they represent the same data as described in 11413 Section 11.11.8. 11415 o In the case in which there is no transition between replicas 11416 (i.e., only a change in access path), there are no special 11417 difficulties in using of this mechanism to effect load balancing. 11419 o In the case in which the two replicas are sufficiently co- 11420 ordinated as to allow coherent simultaneous access to both by a 11421 single client, there is, in general, no obstacle to use of 11422 migration of particular clients to effect load balancing. 11423 Generally, such simultaneous use involves co-operation between 11424 servers to ensure that locks granted on two co-ordinated replicas 11425 cannot conflict and can remain effective when transferred to a 11426 common replica. 11428 o In the case in which a large set of clients are accessing a file 11429 system in a read-only fashion, in can be helpful to migrate all 11430 clients with writable access simultaneously, while using load 11431 balancing on the set of read-only copies, as long as the rules 11432 appearing in Section 11.11.8, designed to prevent data reversion 11433 are adhered to. 11435 In other cases, the client might not have sufficient guarantees of 11436 data similarity/coherence to function properly (e.g. the data in the 11437 two replicas is similar but not identical), and the possibility that 11438 different clients are updating different replicas can exacerbate the 11439 difficulties, making use of load balancing in such situations a 11440 perilous enterprise. 11442 The protocol does not specify how the file system will be moved 11443 between servers or how updates to multiple replicas will be co- 11444 ordinated. It is anticipated that a number of different server-to- 11445 server co-ordination mechanisms might be used with the choice left to 11446 the server implementer. The NFSv4.1 protocol specifies the method 11447 used to communicate the migration event between client and server. 11449 The new location may be, in the case of various forms of server 11450 clustering, another server providing access to the same physical file 11451 system. The client's responsibilities in dealing with this 11452 transition will depend on whether a switch between replicas has 11453 occurred and the means the server has chosen to provide continuity of 11454 locking state. These issues will be discussed in detail below. 11456 Although a single successor location is typical, multiple locations 11457 may be provided. When multiple locations are provided, the client 11458 will typically use the first one provided. If that is inaccessible 11459 for some reason, later ones can be used. In such cases the client 11460 might consider the transition to the new replica to be a migration 11461 event, even though some of the servers involved might not be aware of 11462 the use of the server which was inaccessible. In such a case, a 11463 client might lose access to locking state as a result of the access 11464 transfer. 11466 When an alternate location is designated as the target for migration, 11467 it must designate the same data (with metadata being the same to the 11468 degree indicated by the fs_locations_info attribute). Where file 11469 systems are writable, a change made on the original file system must 11470 be visible on all migration targets. Where a file system is not 11471 writable but represents a read-only copy (possibly periodically 11472 updated) of a writable file system, similar requirements apply to the 11473 propagation of updates. Any change visible in the original file 11474 system must already be effected on all migration targets, to avoid 11475 any possibility that a client, in effecting a transition to the 11476 migration target, will see any reversion in file system state. 11478 11.5.6. Referrals 11480 Referrals allow the server to associate a file system namespace entry 11481 located on one server with a file system located on another server. 11482 When this includes the use of pure referrals, servers are provided a 11483 way of placing a file system in a location within the namespace 11484 essentially without respect to its physical location on a particular 11485 server. This allows a single server or a set of servers to present a 11486 multi-server namespace that encompasses file systems located on a 11487 wider range of servers. Some likely uses of this facility include 11488 establishment of site-wide or organization-wide namespaces, with the 11489 eventual possibility of combining such together into a truly global 11490 namespace, such as the one provided by AFS (the Andrew File System) 11491 [64]. 11493 Referrals occur when a client determines, upon first referencing a 11494 position in the current namespace, that it is part of a new file 11495 system and that the file system is absent. When this occurs, 11496 typically upon receiving the error NFS4ERR_MOVED, the actual location 11497 or locations of the file system can be determined by fetching a 11498 locations attribute. 11500 The file system location attribute may designate a single file system 11501 location or multiple file system locations, to be selected based on 11502 the needs of the client. The server, in the fs_locations_info 11503 attribute, may specify priorities to be associated with various file 11504 system location choices. The server may assign different priorities 11505 to different locations as reported to individual clients, in order to 11506 adapt to client physical location or to effect load balancing. When 11507 both read-only and read-write file systems are present, some of the 11508 read-only locations might not be absolutely up-to-date (as they would 11509 have to be in the case of replication and migration). Servers may 11510 also specify file system locations that include client-substituted 11511 variables so that different clients are referred to different file 11512 systems (with different data contents) based on client attributes 11513 such as CPU architecture. 11515 When the fs_locations_info attribute is such that that there are 11516 multiple possible targets listed, the relationships among them may be 11517 important to the client in selecting which one to use. The same 11518 rules specified in Section 11.5.5 below regarding multiple migration 11519 targets apply to these multiple replicas as well. For example, the 11520 client might prefer a writable target on a server that has additional 11521 writable replicas to which it subsequently might switch. Note that, 11522 as distinguished from the case of replication, there is no need to 11523 deal with the case of propagation of updates made by the current 11524 client, since the current client has not accessed the file system in 11525 question. 11527 Use of multi-server namespaces is enabled by NFSv4.1 but is not 11528 required. The use of multi-server namespaces and their scope will 11529 depend on the applications used and system administration 11530 preferences. 11532 Multi-server namespaces can be established by a single server 11533 providing a large set of pure referrals to all of the included file 11534 systems. Alternatively, a single multi-server namespace may be 11535 administratively segmented with separate referral file systems (on 11536 separate servers) for each separately administered portion of the 11537 namespace. The top-level referral file system or any segment may use 11538 replicated referral file systems for higher availability. 11540 Generally, multi-server namespaces are for the most part uniform, in 11541 that the same data made available to one client at a given location 11542 in the namespace is made available to all clients at that namespace 11543 location. However, there are facilities provided that allow 11544 different clients to be directed to different sets of data, for 11545 reasons such as enabling adaptation to such client characteristics as 11546 CPU architecture. These facilities are described in Section 11.17.3. 11548 Note that it is possible, when providing a uniform namespace, to 11549 provide different location entries to different clients, in order to 11550 provide each client with a copy of the data physically closest to it, 11551 or otherwise optimize access (e.g. provide load balancing). 11553 11.5.7. Changes in a File System Location Attribute 11555 Although clients will typically fetch a file system location 11556 attribute when first accessing a file system and when NFS4ERR_MOVED 11557 is returned, a client can choose to fetch the attribute periodically, 11558 in which case the value fetched may change over time. 11560 For clients not prepared to access multiple replicas simultaneously 11561 (see Section 11.11.1), the handling of the various cases of location 11562 change are as follows: 11564 o Changes in the list of replicas or in the network addresses 11565 associated with replicas do not require immediate action. The 11566 client will typically update its list of replicas to reflect the 11567 new information. 11569 o Additions to the list of network addresses for the current file 11570 system instance need not be acted on promptly. However, to 11571 prepare for the case in which a migration event occurs 11572 subsequently, the client can choose to take note of the new 11573 address and then use it whenever it needs to switch access to a 11574 new replica. 11576 o Deletions from the list of network addresses for the current file 11577 system instance do not need to be acted on immediately by ceasing 11578 use of existing access paths although new connections are not to 11579 be established on addresses that have been deleted. However, 11580 clients can choose to act on such deletions by making preparations 11581 for an eventual shift in access which would become unavoidable as 11582 soon as the server indicates that a particular network access path 11583 is not usable to access the current file system, by returning 11584 NFS4ERR_MOVED. 11586 For clients that are prepared to access several replicas 11587 simultaneously, the following additional cases need to be addressed. 11588 As in the cases discussed above, changes in the set of replicas need 11589 not be acted upon promptly, although the client has the option of 11590 adjusting its access even in the absence of difficulties that would 11591 lead to a new replica to be selected. 11593 o When a new replica is added which may be accessed simultaneously 11594 with one currently in use, the client is free to use the new 11595 replica immediately. 11597 o When a replica currently in use is deleted from the list, the 11598 client need not cease using it immediately. However, since the 11599 server may subsequently force such use to cease (by returning 11600 NFS4ERR_MOVED), clients might decide to limit the need for later 11601 state transfer. For example, new opens might be done on other 11602 replicas, rather than on one not present in the list. 11604 11.6. Trunking without File System Location Information 11606 In situations in which a file system is accessed using two server- 11607 trunkable addresses (as indicated by the same value of the 11608 so_major_id field of the eir_server_owner field returned in response 11609 to EXCHANGE_ID), trunked access is allowed even though there might 11610 not be any location entries specifically indicating the use of 11611 trunking for that file system. 11613 This situation was recognized by [65], even though that document made 11614 no explicit mention of trunking and treated the situation as one of 11615 simultaneous use of two distinct file system instances, even though, 11616 in the explanatory framework now used to describe the situation, the 11617 case is one in which a single file system is accessed by two 11618 different trunked addresses. 11620 11.7. Users and Groups in a Multi-server Namespace 11622 As in the case of a single-server environment (see Section 5.9, when 11623 an owner or group name of the form "id@domain" is assigned to a file, 11624 there is an implicit promise to return that same string when the 11625 corresponding attribute is interrogated subsequently. In the case of 11626 a multi-server namespace, that same promise applies even if server 11627 boundaries have been crossed. Similarly, when the owner attribute of 11628 a file is derived from the security principal which created the file, 11629 that attribute should have the same value even if the interrogation 11630 occurs on a different server from the file creation. 11632 Similarly, the set of security principals recognized by all the 11633 participating servers needs to be the same, with each such principal 11634 having the same credentials, regardless of the particular server 11635 being accessed. 11637 In order to meet these requirements, those setting up multi-server 11638 namespaces will need to limit the servers included so that: 11640 o In all cases in which more than a single domain is supported, the 11641 requirements stated in RFC8000 [31] are to be respected. 11643 o All servers support a common set of domains which includes all of 11644 the domains clients use and expect to see returned as the domain 11645 portion of an owner or group in the form "id@domain". Note that 11646 although this set most often consists of a single domain, it is 11647 possible for multiple domains to be supported. 11649 o All servers, for each domain that they support, accept the same 11650 set of user and group ids as valid. 11652 o All servers recognize the same set of security principals. For 11653 each principal, the same credential is required, independent of 11654 the server being accessed. In addition, the group membership for 11655 each such principal is to be the same, independent of the server 11656 accessed. 11658 Note that there is no requirement in general that the users 11659 corresponding to particular security principals have the same local 11660 representation on each server, even though it is most often the case 11661 that this is so. 11663 When AUTH_SYS is used, the following additional requirements must be 11664 met: 11666 o Only a single NFSv4 domain can be supported through use of 11667 AUTH_SYS. 11669 o The "local" representation of all owners and groups must be the 11670 same on all servers. The word "local" is used here since that is 11671 the way that numeric user and group ids are described in 11672 Section 5.9. However, when AUTH_SYS or stringified numeric owners 11673 or groups are used, these identifiers are not truly local, since 11674 they are known to the clients as well as the server. 11676 Similarly, when stringified numeric user and group ids are used, the 11677 "local" representation of all owners and groups must be the same on 11678 all servers, even when AUTH_SYS is not used. 11680 11.8. Additional Client-Side Considerations 11682 When clients make use of servers that implement referrals, 11683 replication, and migration, care should be taken that a user who 11684 mounts a given file system that includes a referral or a relocated 11685 file system continues to see a coherent picture of that user-side 11686 file system despite the fact that it contains a number of server-side 11687 file systems that may be on different servers. 11689 One important issue is upward navigation from the root of a server- 11690 side file system to its parent (specified as ".." in UNIX), in the 11691 case in which it transitions to that file system as a result of 11692 referral, migration, or a transition as a result of replication. 11693 When the client is at such a point, and it needs to ascend to the 11694 parent, it must go back to the parent as seen within the multi-server 11695 namespace rather than sending a LOOKUPP operation to the server, 11696 which would result in the parent within that server's single-server 11697 namespace. In order to do this, the client needs to remember the 11698 filehandles that represent such file system roots and use these 11699 instead of sending a LOOKUPP operation to the current server. This 11700 will allow the client to present to applications a consistent 11701 namespace, where upward navigation and downward navigation are 11702 consistent. 11704 Another issue concerns refresh of referral locations. When referrals 11705 are used extensively, they may change as server configurations 11706 change. It is expected that clients will cache information related 11707 to traversing referrals so that future client-side requests are 11708 resolved locally without server communication. This is usually 11709 rooted in client-side name look up caching. Clients should 11710 periodically purge this data for referral points in order to detect 11711 changes in location information. When the change_policy attribute 11712 changes for directories that hold referral entries or for the 11713 referral entries themselves, clients should consider any associated 11714 cached referral information to be out of date. 11716 11.9. Overview of File Access Transitions 11718 File access transitions are of two types: 11720 o Those that involve a transition from accessing the current replica 11721 to another one in connection with either replication or migration. 11722 How these are dealt with is discussed in Section 11.11. 11724 o Those in which access to the current file system instance is 11725 retained, while the network path used to access that instance is 11726 changed. This case is discussed in Section 11.10. 11728 11.10. Effecting Network Endpoint Transitions 11730 The endpoints used to access a particular file system instance may 11731 change in a number of ways, as listed below. In each of these cases, 11732 the same fsid, filehandles, stateids, client IDs and are used to 11733 continue access, with a continuity of lock state. In many cases, the 11734 same sessions can also be used. 11736 The appropriate action depends on the set of replacement addresses 11737 (i.e. server endpoints which are server-trunkable with one previously 11738 being used) which are available for use. 11740 o When use of a particular address is to cease and there is also 11741 another one currently in use which is server-trunkable with it, 11742 requests that would have been issued on the address whose use is 11743 to be discontinued can be issued on the remaining address(es). 11744 When an address is server-trunkable but not session-trunkable with 11745 the address whose use is to be discontinued, the request might 11746 need to be modified to reflect the fact that a different session 11747 will be used. 11749 o When use of a particular connection is to cease, as indicated by 11750 receiving NFS4ERR_MOVED when using that connection but that 11751 address is still indicated as accessible according to the 11752 appropriate file system location entries, it is likely that 11753 requests can be issued on a new connection of a different 11754 connection type, once that connection is established. Since any 11755 two, non-port-specific server endpoints that share a network 11756 address are inherently session-trunkable, the client can use 11757 BIND_CONN_TO_SESSION to access the existing session using the new 11758 connection and proceed to access the file system using the new 11759 connection. 11761 o When there are no potential replacement addresses in use but there 11762 are valid addresses session-trunkable with the one whose use is to 11763 be discontinued, the client can use BIND_CONN_TO_SESSION to access 11764 the existing session using the new address. Although the target 11765 session will generally be accessible, there may be rare situations 11766 in which that session is no longer accessible, when an attempt is 11767 made to bind the new connection to it. In this case, the client 11768 can create a new session to enable continued access to the 11769 existing instance using the new connection, providing for use of 11770 existing filehandles, stateids, and client ids while providing 11771 continuity of locking state. 11773 o When there is no potential replacement address in use and there 11774 are no valid addresses session-trunkable with the one whose use is 11775 to be discontinued, other server-trunkable addresses may be used 11776 to provide continued access. Although use of CREATE_SESSION is 11777 available to provide continued access to the existing instance, 11778 servers have the option of providing continued access to the 11779 existing session through the new network access path in a fashion 11780 similar to that provided by session migration (see Section 11.12). 11781 To take advantage of this possibility, clients can perform an 11782 initial BIND_CONN_TO_SESSION, as in the previous case, and use 11783 CREATE_SESSION only if that fails. 11785 11.11. Effecting File System Transitions 11787 There are a range of situations in which there is a change to be 11788 effected in the set of replicas used to access a particular file 11789 system. Some of these may involve an expansion or contraction of the 11790 set of replicas used as discussed in Section 11.11.1 below. 11792 For reasons explained in that section, most transitions will involve 11793 a transition from a single replica to a corresponding replacement 11794 replica. When effecting replica transition, some types of sharing 11795 between the replicas may affect handling of the transition as 11796 described in Sections 11.11.2 through 11.11.8 below. The attribute 11797 fs_locations_info provides helpful information to allow the client to 11798 determine the degree of inter-replica sharing. 11800 With regard to some types of state, the degree of continuity across 11801 the transition depends on the occasion prompting the transition, with 11802 transitions initiated by the servers (i.e. migration) offering much 11803 more scope for a non-disruptive transition than cases in which the 11804 client on its own shifts its access to another replica (i.e. 11805 replication). This issue potentially applies to locking state and to 11806 session state, which are dealt with below as follows: 11808 o An introduction to the possible means of providing continuity in 11809 these areas appears in Section 11.11.9 below. 11811 o Transparent State Migration is introduced in Section 11.12. The 11812 possible transfer of session state is addressed there as well. 11814 o The client handling of transitions, including determining how to 11815 deal with the various means that the server might take to supply 11816 effective continuity of locking state is discussed in 11817 Section 11.13. 11819 o The servers' (source and destination) responsibilities in 11820 effecting Transparent Migration of locking and session state are 11821 discussed in Section 11.14. 11823 11.11.1. File System Transitions and Simultaneous Access 11825 The fs_locations_info attribute (described in Section 11.17) may 11826 indicate that two replicas may be used simultaneously, although some 11827 situations in which such simultaneous access is permitted are more 11828 appropriately described as instances of trunking (see 11829 Section 11.5.4.1). Although situations in which multiple replicas 11830 may be accessed simultaneously are somewhat similar to those in which 11831 a single replica is accessed by multiple network addresses, there are 11832 important differences, since locking state is not shared among 11833 multiple replicas. 11835 Because of this difference in state handling, many clients will not 11836 have the ability to take advantage of the fact that such replicas 11837 represent the same data. Such clients will not be prepared to use 11838 multiple replicas simultaneously but will access each file system 11839 using only a single replica, although the replica selected might make 11840 multiple server-trunkable addresses available. 11842 Clients who are prepared to use multiple replicas simultaneously will 11843 divide opens among replicas however they choose. Once that choice is 11844 made, any subsequent transitions will treat the set of locking state 11845 associated with each replica as a single entity. 11847 For example, if one of the replicas become unavailable, access will 11848 be transferred to a different replica, also capable of simultaneous 11849 access with the one still in use. 11851 When there is no such replica, the transition may be to the replica 11852 already in use. At this point, the client has a choice between 11853 merging the locking state for the two replicas under the aegis of the 11854 sole replica in use or treating these separately, until another 11855 replica capable of simultaneous access presents itself. 11857 11.11.2. Filehandles and File System Transitions 11859 There are a number of ways in which filehandles can be handled across 11860 a file system transition. These can be divided into two broad 11861 classes depending upon whether the two file systems across which the 11862 transition happens share sufficient state to effect some sort of 11863 continuity of file system handling. 11865 When there is no such cooperation in filehandle assignment, the two 11866 file systems are reported as being in different handle classes. In 11867 this case, all filehandles are assumed to expire as part of the file 11868 system transition. Note that this behavior does not depend on the 11869 fh_expire_type attribute and supersedes the specification of the 11870 FH4_VOL_MIGRATION bit, which only affects behavior when 11871 fs_locations_info is not available. 11873 When there is cooperation in filehandle assignment, the two file 11874 systems are reported as being in the same handle classes. In this 11875 case, persistent filehandles remain valid after the file system 11876 transition, while volatile filehandles (excluding those that are only 11877 volatile due to the FH4_VOL_MIGRATION bit) are subject to expiration 11878 on the target server. 11880 11.11.3. Fileids and File System Transitions 11882 In NFSv4.0, the issue of continuity of fileids in the event of a file 11883 system transition was not addressed. The general expectation had 11884 been that in situations in which the two file system instances are 11885 created by a single vendor using some sort of file system image copy, 11886 fileids would be consistent across the transition, while in the 11887 analogous multi-vendor transitions they would not. This poses 11888 difficulties, especially for the client without special knowledge of 11889 the transition mechanisms adopted by the server. Note that although 11890 fileid is not a REQUIRED attribute, many servers support fileids and 11891 many clients provide APIs that depend on fileids. 11893 It is important to note that while clients themselves may have no 11894 trouble with a fileid changing as a result of a file system 11895 transition event, applications do typically have access to the fileid 11896 (e.g., via stat). The result is that an application may work 11897 perfectly well if there is no file system instance transition or if 11898 any such transition is among instances created by a single vendor, 11899 yet be unable to deal with the situation in which a multi-vendor 11900 transition occurs at the wrong time. 11902 Providing the same fileids in a multi-vendor (multiple server 11903 vendors) environment has generally been held to be quite difficult. 11904 While there is work to be done, it needs to be pointed out that this 11905 difficulty is partly self-imposed. Servers have typically identified 11906 fileid with inode number, i.e. with a quantity used to find the file 11907 in question. This identification poses special difficulties for 11908 migration of a file system between vendors where assigning the same 11909 index to a given file may not be possible. Note here that a fileid 11910 is not required to be useful to find the file in question, only that 11911 it is unique within the given file system. Servers prepared to 11912 accept a fileid as a single piece of metadata and store it apart from 11913 the value used to index the file information can relatively easily 11914 maintain a fileid value across a migration event, allowing a truly 11915 transparent migration event. 11917 In any case, where servers can provide continuity of fileids, they 11918 should, and the client should be able to find out that such 11919 continuity is available and take appropriate action. Information 11920 about the continuity (or lack thereof) of fileids across a file 11921 system transition is represented by specifying whether the file 11922 systems in question are of the same fileid class. 11924 Note that when consistent fileids do not exist across a transition 11925 (either because there is no continuity of fileids or because fileid 11926 is not a supported attribute on one of instances involved), and there 11927 are no reliable filehandles across a transition event (either because 11928 there is no filehandle continuity or because the filehandles are 11929 volatile), the client is in a position where it cannot verify that 11930 files it was accessing before the transition are the same objects. 11931 It is forced to assume that no object has been renamed, and, unless 11932 there are guarantees that provide this (e.g., the file system is 11933 read-only), problems for applications may occur. Therefore, use of 11934 such configurations should be limited to situations where the 11935 problems that this may cause can be tolerated. 11937 11.11.4. Fsids and File System Transitions 11939 Since fsids are generally only unique on a per-server basis, it is 11940 likely that they will change during a file system transition. 11941 Clients should not make the fsids received from the server visible to 11942 applications since they may not be globally unique, and because they 11943 may change during a file system transition event. Applications are 11944 best served if they are isolated from such transitions to the extent 11945 possible. 11947 Although normally a single source file system will transition to a 11948 single target file system, there is a provision for splitting a 11949 single source file system into multiple target file systems, by 11950 specifying the FSLI4F_MULTI_FS flag. 11952 11.11.4.1. File System Splitting 11954 When a file system transition is made and the fs_locations_info 11955 indicates that the file system in question might be split into 11956 multiple file systems (via the FSLI4F_MULTI_FS flag), the client 11957 SHOULD do GETATTRs to determine the fsid attribute on all known 11958 objects within the file system undergoing transition to determine the 11959 new file system boundaries. 11961 Clients might choose to maintain the fsids passed to existing 11962 applications by mapping all of the fsids for the descendant file 11963 systems to the common fsid used for the original file system. 11965 Splitting a file system can be done on a transition between file 11966 systems of the same fileid class, since the fact that fileids are 11967 unique within the source file system ensure they will be unique in 11968 each of the target file systems. 11970 11.11.5. The Change Attribute and File System Transitions 11972 Since the change attribute is defined as a server-specific one, 11973 change attributes fetched from one server are normally presumed to be 11974 invalid on another server. Such a presumption is troublesome since 11975 it would invalidate all cached change attributes, requiring 11976 refetching. Even more disruptive, the absence of any assured 11977 continuity for the change attribute means that even if the same value 11978 is retrieved on refetch, no conclusions can be drawn as to whether 11979 the object in question has changed. The identical change attribute 11980 could be merely an artifact of a modified file with a different 11981 change attribute construction algorithm, with that new algorithm just 11982 happening to result in an identical change value. 11984 When the two file systems have consistent change attribute formats, 11985 and this fact is communicated to the client by reporting in the same 11986 change class, the client may assume a continuity of change attribute 11987 construction and handle this situation just as it would be handled 11988 without any file system transition. 11990 11.11.6. Write Verifiers and File System Transitions 11992 In a file system transition, the two file systems might be 11993 cooperating in the handling of unstably written data. Clients can 11994 determine if this is the case, by seeing if the two file systems 11995 belong to the same write-verifier class. When this is the case, 11996 write verifiers returned from one system may be compared to those 11997 returned by the other and superfluous writes avoided. 11999 When two file systems belong to different write-verifier classes, any 12000 verifier generated by one must not be compared to one provided by the 12001 other. Instead, the two verifiers should be treated as not equal 12002 even when the values are identical. 12004 11.11.7. Readdir Cookies and Verifiers and File System Transitions 12006 In a file system transition, the two file systems might be consistent 12007 in their handling of READDIR cookies and verifiers. Clients can 12008 determine if this is the case, by seeing if the two file systems 12009 belong to the same readdir class. When this is the case, readdir 12010 class, READDIR cookies and verifiers from one system will be 12011 recognized by the other and READDIR operations started on one server 12012 can be validly continued on the other, simply by presenting the 12013 cookie and verifier returned by a READDIR operation done on the first 12014 file system to the second. 12016 When two file systems belong to different readdir classes, any 12017 READDIR cookie and verifier generated by one is not valid on the 12018 second, and must not be presented to that server by the client. The 12019 client should act as if the verifier were rejected. 12021 11.11.8. File System Data and File System Transitions 12023 When multiple replicas exist and are used simultaneously or in 12024 succession by a client, applications using them will normally expect 12025 that they contain either the same data or data that is consistent 12026 with the normal sorts of changes that are made by other clients 12027 updating the data of the file system (with metadata being the same to 12028 the degree indicated by the fs_locations_info attribute). However, 12029 when multiple file systems are presented as replicas of one another, 12030 the precise relationship between the data of one and the data of 12031 another is not, as a general matter, specified by the NFSv4.1 12032 protocol. It is quite possible to present as replicas file systems 12033 where the data of those file systems is sufficiently different that 12034 some applications have problems dealing with the transition between 12035 replicas. The namespace will typically be constructed so that 12036 applications can choose an appropriate level of support, so that in 12037 one position in the namespace a varied set of replicas might be 12038 listed, while in another only those that are up-to-date would be 12039 considered replicas. The protocol does define three special cases of 12040 the relationship among replicas to be specified by the server and 12041 relied upon by clients: 12043 o When multiple replicas exist and are used simultaneously by a 12044 client (see the FSLIB4_CLSIMUL definition within 12045 fs_locations_info), they must designate the same data. Where file 12046 systems are writable, a change made on one instance must be 12047 visible on all instances at the same time, regardless of whether 12048 the interrogated instance is the one on which the modification was 12049 done. This allows a client to use these replicas simultaneously 12050 without any special adaptation to the fact that there are multiple 12051 replicas, beyond adapting to the fact that locks obtained on one 12052 replica are maintained separately (i.e. under a different client 12053 ID). In this case, locks (whether share reservations or byte- 12054 range locks) and delegations obtained on one replica are 12055 immediately reflected on all replicas, in the sense that access 12056 from all other servers is prevented regardless of the replica 12057 used. However, because the servers are not required to treat two 12058 associated client IDs as representing the same client, it is best 12059 to access each file using only a single client ID. 12061 o When one replica is designated as the successor instance to 12062 another existing instance after return NFS4ERR_MOVED (i.e., the 12063 case of migration), the client may depend on the fact that all 12064 changes written to stable storage on the original instance are 12065 written to stable storage of the successor (uncommitted writes are 12066 dealt with in Section 11.11.6 above). 12068 o Where a file system is not writable but represents a read-only 12069 copy (possibly periodically updated) of a writable file system, 12070 clients have similar requirements with regard to the propagation 12071 of updates. They may need a guarantee that any change visible on 12072 the original file system instance must be immediately visible on 12073 any replica before the client transitions access to that replica, 12074 in order to avoid any possibility that a client, in effecting a 12075 transition to a replica, will see any reversion in file system 12076 state. The specific means of this guarantee varies based on the 12077 value of the fss_type field that is reported as part of the 12078 fs_status attribute (see Section 11.18). Since these file systems 12079 are presumed to be unsuitable for simultaneous use, there is no 12080 specification of how locking is handled; in general, locks 12081 obtained on one file system will be separate from those on others. 12082 Since these are expected to be read-only file systems, this is not 12083 likely to pose an issue for clients or applications. 12085 When none of these special situations apply, there is no basis, 12086 within the protocol for the client to make assumptions about the 12087 contents of a replica file system or its relationship to previous 12088 file system instances. Thus switching between nominally identical 12089 read-write file systems would not be possible, because either the 12090 client does not use or the server does not support the 12091 fs_locations_info attribute. 12093 11.11.9. Lock State and File System Transitions 12095 While accessing a file system, clients obtain locks enforced by the 12096 server which may prevent actions by other clients that are 12097 inconsistent with those locks. 12099 When access is transferred between replicas, clients need to be 12100 assured that the actions disallowed by holding these locks cannot 12101 have occurred during the transition. This can be ensured by the 12102 methods below. Unless at least one of these is implemented, clients 12103 will not be assured of continuity of lock possession across a 12104 migration event. 12106 o Providing the client an opportunity to re-obtain his locks via a 12107 per-fs grace period on the destination server, denying all clients 12108 using the destination file system the opportunity to obtain new 12109 locks that conflict which those held by the transferred client as 12110 long as that client has not completed its per-fs grace period. 12111 Because the lock reclaim mechanism was originally defined to 12112 support server reboot, it implicitly assumes that file handles 12113 will, upon reclaim, will be the same as those at open. In the 12114 case of migration, this requires that source and destination 12115 servers use the same filehandles, as evidenced by using the same 12116 server scope (see Section 2.10.4) or by showing this agreement 12117 using fs_locations_info (see Section 11.11.2 above). 12119 Note that such a grace period can be implemented without 12120 interfering with the ability of non-transferred clients to obtain 12121 new locks while it is going on. As long as the destination server 12122 is aware of the transferred locks, it can distinguish requests to 12123 obtain new locks that contrast with existing locks from those that 12124 do not, allowing it to treat such client requests without 12125 reference to the ongoing grace period. 12127 o Locking state can be transferred as part of the transition by 12128 providing Transparent State Migration as described in 12129 Section 11.12. 12131 Of these, Transparent State Migration provides the smoother 12132 experience for clients in that there is no need to go through a 12133 reclaim process before new locks can be obtained. However, it 12134 requires a greater degree of inter-server co-ordination. In general, 12135 the servers taking part in migration are free to provide either 12136 facility. However, when the filehandles can differ across the 12137 migration event, Transparent State Migration is the only available 12138 means of providing the needed functionality. 12140 It should be noted that these two methods are not mutually exclusive 12141 and that a server might well provide both. In particular, if there 12142 is some circumstance preventing a specific lock from being 12143 transferred transparently, the destination server can allow it to be 12144 reclaimed, by implementing a per-fs grace period for the migrated 12145 file system. 12147 11.11.9.1. Security Consideration Related to Reclaiming Lock State 12148 after File System Transitions 12150 Although it is possible for a client reclaiming state to misrepresent 12151 its state, in the same fashion as described in Section 8.4.2.1.1, 12152 most implementations providing for such reclamation in the case of 12153 file system transitions will have the ability to detect such 12154 misrepresentations. This limits the ability of unauthenticated 12155 clients to execute denial-of-service attacks in these circumstances. 12156 Nevertheless, the rules stated in Section 8.4.2.1.1, regarding 12157 principal verification for reclaim requests, apply in this situation 12158 as well. 12160 Typically, implementations that support file system transitions will 12161 have extensive information about the locks to be transferred. This 12162 is because: 12164 o Since failure is not involved, there is no need store to locking 12165 information in persistent storage. 12167 o There is no need, as there is in the failure case, to update 12168 multiple repositories containing locking state to keep them in 12169 sync. Instead, there is a one-time communication of locking state 12170 from the source to the destination server. 12172 o Providing this information avoids potential interference with 12173 existing clients using the destination file system, by denying 12174 them the ability to obtain new locks during the grace period. 12176 When such detailed locking information, not necessarily including the 12177 associated stateids, is available: 12179 o It is possible to detect reclaim requests that attempt to reclaim 12180 locks that did not exist before the transfer, rejecting them with 12181 NFS4ERR_RECLAIM_BAD (Section 15.1.9.4). 12183 o It is possible when dealing with non-reclaim requests, to 12184 determine whether they conflict with existing locks, eliminating 12185 the need to return NFS4ERR_GRACE ((Section 15.1.9.2) on non- 12186 reclaim requests. 12188 It is possible for implementations of grace periods in connection 12189 with file system transitions not to have detailed locking information 12190 available at the destination server, in which case the security 12191 situation is exactly as described in Section 8.4.2.1.1. 12193 11.11.9.2. Leases and File System Transitions 12195 In the case of lease renewal, the client may not be submitting 12196 requests for a file system that has been transferred to another 12197 server. This can occur because of the lease renewal mechanism. The 12198 client renews the lease associated with all file systems when 12199 submitting a request on an associated session, regardless of the 12200 specific file system being referenced. 12202 In order for the client to schedule renewal of its lease where there 12203 is locking state that may have been relocated to the new server, the 12204 client must find out about lease relocation before that lease expire. 12206 To accomplish this, the SEQUENCE operation will return the status bit 12207 SEQ4_STATUS_LEASE_MOVED if responsibility for any of the renewed 12208 locking state has been transferred to a new server. This will 12209 continue until the client receives an NFS4ERR_MOVED error for each of 12210 the file systems for which there has been locking state relocation. 12212 When a client receives an SEQ4_STATUS_LEASE_MOVED indication from a 12213 server, for each file system of the server for which the client has 12214 locking state, the client should perform an operation. For 12215 simplicity, the client may choose to reference all file systems, but 12216 what is important is that it must reference all file systems for 12217 which there was locking state where that state has moved. Once the 12218 client receives an NFS4ERR_MOVED error for each such file system, the 12219 server will clear the SEQ4_STATUS_LEASE_MOVED indication. The client 12220 can terminate the process of checking file systems once this 12221 indication is cleared (but only if the client has received a reply 12222 for all outstanding SEQUENCE requests on all sessions it has with the 12223 server), since there are no others for which locking state has moved. 12225 A client may use GETATTR of the fs_status (or fs_locations_info) 12226 attribute on all of the file systems to get absence indications in a 12227 single (or a few) request(s), since absent file systems will not 12228 cause an error in this context. However, it still must do an 12229 operation that receives NFS4ERR_MOVED on each file system, in order 12230 to clear the SEQ4_STATUS_LEASE_MOVED indication. 12232 Once the set of file systems with transferred locking state has been 12233 determined, the client can follow the normal process to obtain the 12234 new server information (through the fs_locations and 12235 fs_locations_info attributes) and perform renewal of that lease on 12236 the new server, unless information in the fs_locations_info attribute 12237 shows that no state could have been transferred. If the server has 12238 not had state transferred to it transparently, the client will 12239 receive NFS4ERR_STALE_CLIENTID from the new server, as described 12240 above, and the client can then reclaim locks as is done in the event 12241 of server failure. 12243 11.11.9.3. Transitions and the Lease_time Attribute 12245 In order that the client may appropriately manage its lease in the 12246 case of a file system transition, the destination server must 12247 establish proper values for the lease_time attribute. 12249 When state is transferred transparently, that state should include 12250 the correct value of the lease_time attribute. The lease_time 12251 attribute on the destination server must never be less than that on 12252 the source, since this would result in premature expiration of a 12253 lease granted by the source server. Upon transitions in which state 12254 is transferred transparently, the client is under no obligation to 12255 refetch the lease_time attribute and may continue to use the value 12256 previously fetched (on the source server). 12258 If state has not been transferred transparently, either because the 12259 associated servers are shown as having different eir_server_scope 12260 strings or because the client ID is rejected when presented to the 12261 new server, the client should fetch the value of lease_time on the 12262 new (i.e., destination) server, and use it for subsequent locking 12263 requests. However, the server must respect a grace period of at 12264 least as long as the lease_time on the source server, in order to 12265 ensure that clients have ample time to reclaim their lock before 12266 potentially conflicting non-reclaimed locks are granted. 12268 11.12. Transferring State upon Migration 12270 When the transition is a result of a server-initiated decision to 12271 transition access and the source and destination servers have 12272 implemented appropriate co-operation, it is possible to: 12274 o Transfer locking state from the source to the destination server, 12275 in a fashion similar to that provided by Transparent State 12276 Migration in NFSv4.0, as described in [68]. Server 12277 responsibilities are described in Section 11.14.2. 12279 o Transfer session state from the source to the destination server. 12280 Server responsibilities in effecting such a transfer are described 12281 in Section 11.14.3. 12283 The means by which the client determines which of these transfer 12284 events has occurred are described in Section 11.13. 12286 11.12.1. Transparent State Migration and pNFS 12288 When pNFS is involved, the protocol is capable of supporting: 12290 o Migration of the Metadata Server (MDS), leaving the Data Servers 12291 (DS's) in place. 12293 o Migration of the file system as a whole, including the MDS and 12294 associated DS's. 12296 o Replacement of one DS by another. 12298 o Migration of a pNFS file system to one in which pNFS is not used. 12300 o Migration of a file system not using pNFS to one in which layouts 12301 are available. 12303 Note that migration per se is only involved in the transfer of the 12304 MDS function. Although the servicing of a layout may be transferred 12305 from one data server to another, this not done using the file system 12306 location attributes. The MDS can effect such transfers by recalling/ 12307 revoking existing layouts and granting new ones on a different data 12308 server. 12310 Migration of the MDS function is directly supported by Transparent 12311 State Migration. Layout state will normally be transparently 12312 transferred, just as other state is. As a result, Transparent State 12313 Migration provides a framework in which, given appropriate inter-MDS 12314 data transfer, one MDS can be substituted for another. 12316 Migration of the file system function as a whole can be accomplished 12317 by recalling all layouts as part of the initial phase of the 12318 migration process. As a result, IO will be done through the MDS 12319 during the migration process, and new layouts can be granted once the 12320 client is interacting with the new MDS. An MDS can also effect this 12321 sort of transition by revoking all layouts as part of Transparent 12322 State Migration, as long as the client is notified about the loss of 12323 locking state. 12325 In order to allow migration to a file system on which pNFS is not 12326 supported, clients need to be prepared for a situation in which 12327 layouts are not available or supported on the destination file system 12328 and so direct IO requests to the destination server, rather than 12329 depending on layouts being available. 12331 Replacement of one DS by another is not addressed by migration as 12332 such but can be effected by an MDS recalling layouts for the DS to be 12333 replaced and issuing new ones to be served by the successor DS. 12335 Migration may transfer a file system from a server which does not 12336 support pNFS to one which does. In order to properly adapt to this 12337 situation, clients which support pNFS, but function adequately in its 12338 absence should check for pNFS support when a file system is migrated 12339 and be prepared to use pNFS when support is available on the 12340 destination. 12342 11.13. Client Responsibilities when Access is Transitioned 12344 For a client to respond to an access transition, it must become aware 12345 of it. The ways in which this can happen are discussed in 12346 Section 11.13.1 which discusses indications that a specific file 12347 system access path has transitioned as well as situations in which 12348 additional activity is necessary to determine the set of file systems 12349 that have been migrated. Section 11.13.2 goes on to complete the 12350 discussion of how the set of migrated file systems might be 12351 determined. Sections 11.13.3 through 11.13.5 discuss how the client 12352 should deal with each transition it becomes aware of, either directly 12353 or as a result of migration discovery. 12355 The following terms are used to describe client activities: 12357 o "Transition recovery" refers to the process of restoring access to 12358 a file system on which NFS4ERR_MOVED was received. 12360 o "Migration recovery" to that subset of transition recovery which 12361 applies when the file system has migrated to a different replica. 12363 o "Migration discovery" refers to the process of determining which 12364 file system(s) have been migrated. It is necessary to avoid a 12365 situation in which leases could expire when a file system is not 12366 accessed for a long period of time, since a client unaware of the 12367 migration might be referencing an unmigrated file system and not 12368 renewing the lease associated with the migrated file system. 12370 11.13.1. Client Transition Notifications 12372 When there is a change in the network access path which a client is 12373 to use to access a file system, there are a number of related status 12374 indications with which clients need to deal: 12376 o If an attempt is made to use or return a filehandle within a file 12377 system that is no longer accessible at the address previously used 12378 to access it, the error NFS4ERR_MOVED is returned. 12380 Exceptions are made to allow such file handles to be used when 12381 interrogating a file system location attribute. This enables a 12382 client to determine a new replica's location or a new network 12383 access path. 12385 This condition continues on subsequent attempts to access the file 12386 system in question. The only way the client can avoid the error 12387 is to cease accessing the file system in question at its old 12388 server location and access it instead using a different address at 12389 which it is now available. 12391 o Whenever a SEQUENCE operation is sent by a client to a server 12392 which generated state held on that client which is associated with 12393 a file system that is no longer accessible on the server at which 12394 it was previously available, the response will contain a lease- 12395 migrated indication, with the SEQ4_STATUS_LEASE_MOVED status bit 12396 being set. 12398 This condition continues until the client acknowledges the 12399 notification by fetching a file system location attribute for the 12400 file system whose network access path is being changed. When 12401 there are multiple such file systems, a location attribute for 12402 each such file system needs to be fetched. The location attribute 12403 for all migrated file system needs to be fetched in order to clear 12404 the condition. Even after the condition is cleared, the client 12405 needs to respond by using the location information to access the 12406 file system at its new location to ensure that leases are not 12407 needlessly expired. 12409 Unlike the case of NFSv4.0, in which the corresponding conditions are 12410 both errors and thus mutually exclusive, in NFSv4.1 the client can, 12411 and often will, receive both indications on the same request. As a 12412 result, implementations need to address the question of how to co- 12413 ordinate the necessary recovery actions when both indications arrive 12414 in the response to the same request. It should be noted that when 12415 processing an NFSv4 COMPOUND, the server will normally decide whether 12416 SEQ4_STATUS_LEASE_MOVED is to be set before it determines which file 12417 system will be referenced or whether NFS4ERR_MOVED is to be returned. 12419 Since these indications are not mutually exclusive in NFSv4.1, the 12420 following combinations are possible results when a COMPOUND is 12421 issued: 12423 o The COMPOUND status is NFS4ERR_MOVED and SEQ4_STATUS_LEASE_MOVED 12424 is asserted. 12426 In this case, transition recovery is required. While it is 12427 possible that migration discovery is needed in addition, it is 12428 likely that only the accessed file system has transitioned. In 12429 any case, because addressing NFS4ERR_MOVED is necessary to allow 12430 the rejected requests to be processed on the target, dealing with 12431 it will typically have priority over migration discovery. 12433 o The COMPOUND status is NFS4ERR_MOVED and SEQ4_STATUS_LEASE_MOVED 12434 is clear. 12436 In this case, transition recovery is also required. It is clear 12437 that migration discovery is not needed to find file systems that 12438 have been migrated other that the one returning NFS4ERR_MOVED. 12439 Cases in which this result can arise include a referral or a 12440 migration for which there is no associated locking state. This 12441 can also arise in cases in which an access path transition other 12442 than migration occurs within the same server. In such a case, 12443 there is no need to set SEQ4_STATUS_LEASE_MOVED, since the lease 12444 remains associated with the current server even though the access 12445 path has changed. 12447 o The COMPOUND status is not NFS4ERR_MOVED and 12448 SEQ4_STATUS_LEASE_MOVED is asserted. 12450 In this case, no transition recovery activity is required on the 12451 file system(s) accessed by the request. However, to prevent 12452 avoidable lease expiration, migration discovery needs to be done 12454 o The COMPOUND status is not NFS4ERR_MOVED and 12455 SEQ4_STATUS_LEASE_MOVED is clear. 12457 In this case, neither transition-related activity nor migration 12458 discovery is required. 12460 Note that the specified actions only need to be taken if they are not 12461 already going on. For example, when NFS4ERR_MOVED is received when 12462 accessing a file system for which transition recovery already going 12463 on, the client merely waits for that recovery to be completed while 12464 the receipt of SEQ4_STATUS_LEASE_MOVED indication only needs to 12465 initiate migration discovery for a server if such discovery is not 12466 already underway for that server. 12468 The fact that a lease-migrated condition does not result in an error 12469 in NFSv4.1 has a number of important consequences. In addition to 12470 the fact, discussed above, that the two indications are not mutually 12471 exclusive, there are number of issues that are important in 12472 considering implementation of migration discovery, as discussed in 12473 Section 11.13.2. 12475 Because SEQ4_STATUS_LEASE_MOVED is not an error condition", it is 12476 possible for file systems whose access paths have not changed to be 12477 successfully accessed on a given server even though recovery is 12478 necessary for other file systems on the same server. As a result, 12479 access can go on while, 12481 o The migration discovery process is going on for that server. 12483 o The transition recovery process is going on for other file systems 12484 connected to that server. 12486 11.13.2. Performing Migration Discovery 12488 Migration discovery can be performed in the same context as 12489 transition recovery, allowing recovery for each migrated file system 12490 to be invoked as it is discovered. Alternatively, it may be done in 12491 a separate migration discovery thread, allowing migration discovery 12492 to be done in parallel with one or more instances of transition 12493 recovery. 12495 In either case, because the lease-migrated indication does not result 12496 in an error. other access to file systems on the server can proceed 12497 normally, with the possibility that further such indications will be 12498 received, raising the issue of how such indications are to be dealt 12499 with. In general, 12501 o No action needs to be taken for such indications received by any 12502 threads performing migration discovery, since continuation of that 12503 work will address the issue. 12505 o In other cases in which migration discovery is currently being 12506 performed, nothing further needs to be done to respond to such 12507 lease migration indications, as long as one can be certain that 12508 the migration discovery process would deal with those indications. 12509 See below for details. 12511 o For such indications received in all other contexts, the 12512 appropriate response is to initiate or otherwise provide for the 12513 execution of migration discovery for file systems associated with 12514 the server IP address returning the indication. 12516 This leaves a potential difficulty in situations in which the 12517 migration discovery process is near to completion but is still 12518 operating. One should not ignore a LEASE_MOVED indication if the 12519 migration discovery process is not able to respond to the discovery 12520 of additional migrating file systems without additional aid. A 12521 further complexity relevant in addressing such situations is that a 12522 lease-migrated indication may reflect the server's state at the time 12523 the SEQUENCE operation was processed, which may be different from 12524 that in effect at the time the response is received. Because new 12525 migration events may occur at any time, and because a LEASE_MOVED 12526 indication may reflect the situation in effect a considerable time 12527 before the indication is received, special care needs to be taken to 12528 ensure that LEASE_MOVED indications are not inappropriately ignored. 12530 A useful approach to this issue involves the use of separate 12531 externally-visible migration discovery states for each server. 12532 Separate values could represent the various possible states for the 12533 migration discovery process for a server: 12535 o non-operation, in which migration discovery is not being performed 12537 o normal operation, in which there is an ongoing scan for migrated 12538 file systems. 12540 o completion/verification of migration discovery processing, in 12541 which the possible completion of migration discovery processing 12542 needs to be verified. 12544 Given that framework, migration discovery processing would proceed as 12545 follows. 12547 o While in the normal-operation state, the thread performing 12548 discovery would fetch, for successive file systems known to the 12549 client on the server being worked on, a file system location 12550 attribute plus the fs_status attribute. 12552 o If the fs_status attribute indicates that the file system is a 12553 migrated one (i.e. fss_absent is true and fss_type != 12554 STATUS4_REFERRAL) then a migrated file system has been found. In 12555 this situation, it is likely that the fetch of the file system 12556 location attribute has cleared one the file systems contributing 12557 to the lease-migrated indication. 12559 o In cases in which that happened, the thread cannot know whether 12560 the lease-migrated indication has been cleared and so it enters 12561 the completion/verification state and proceeds to issue a COMPOUND 12562 to see if the LEASE_MOVED indication has been cleared. 12564 o When the discovery process is in the completion/verification 12565 state, if other requests get a lease-migrated indication they note 12566 that it was received. Later, the existence of such indications is 12567 used when the request completes, as described below. 12569 When the request used in the completion/verification state completes: 12571 o If a lease-migrated indication is returned, the discovery 12572 continues normally. Note that this is so even if all file systems 12573 have traversed, since new migrations could have occurred while the 12574 process was going on. 12576 o Otherwise, if there is any record that other requests saw a lease- 12577 migrated indication while the request was going on, that record is 12578 cleared and the verification request retried. The discovery 12579 process remains in completion/verification state. 12581 o If there have been no lease-migrated indications, the work of 12582 migration discovery is considered completed and it enters the non- 12583 operating state. Once it enters this state, subsequent lease- 12584 migrated indication will trigger a new migration discovery 12585 process. 12587 It should be noted that the process described above is not guaranteed 12588 to terminate, as a long series of new migration events might 12589 continually delay the clearing of the LEASE_MOVED indication. To 12590 prevent unnecessary lease expiration, it is appropriate for clients 12591 to use the discovery of migrations to effect lease renewal 12592 immediately, rather than waiting for clearing of the LEASE_MOVED 12593 indication when the complete set of migrations is available. 12595 Lease discovery needs to be provided as described above. This 12596 ensures that the client discovers file system migrations soon enough 12597 to renew its leases on each destination server before they expire. 12598 Non-renewal of leases can lead to loss of locking state. While the 12599 consequences of such loss can be ameliorated through implementations 12600 of courtesy locks, servers are under no obligation to do so, and a 12601 conflicting lock request may mean that a lock is revoked 12602 unexpectedly. Clients should be aware of this possibility. 12604 11.13.3. Overview of Client Response to NFS4ERR_MOVED 12606 This section outlines a way in which a client that receives 12607 NFS4ERR_MOVED can effect transition recovery by using a new server or 12608 server endpoint if one is available. As part of that process, it 12609 will determine: 12611 o Whether the NFS4ERR_MOVED indicates migration has occurred, or 12612 whether it indicates another sort of file system access transition 12613 as discussed in Section 11.10 above. 12615 o In the case of migration, whether Transparent State Migration has 12616 occurred. 12618 o Whether any state has been lost during the process of Transparent 12619 State Migration. 12621 o Whether sessions have been transferred as part of Transparent 12622 State Migration. 12624 During the first phase of this process, the client proceeds to 12625 examine file system location entries to find the initial network 12626 address it will use to continue access to the file system or its 12627 replacement. For each location entry that the client examines, the 12628 process consists of five steps: 12630 1. Performing an EXCHANGE_ID directed at the location address. This 12631 operation is used to register the client owner (in the form of a 12632 client_owner4) with the server, to obtain a client ID to be use 12633 subsequently to communicate with it, to obtain that client ID's 12634 confirmation status, and to determine server_owner and scope for 12635 the purpose of determining if the entry is trunkable with that 12636 previously being used to access the file system (i.e. that it 12637 represents another network access path to the same file system 12638 and can share locking state with it). 12640 2. Making an initial determination of whether migration has 12641 occurred. The initial determination will be based on whether the 12642 EXCHANGE_ID results indicate that the current location element is 12643 server-trunkable with that used to access the file system when 12644 access was terminated by receiving NFS4ERR_MOVED. If it is, then 12645 migration has not occurred. In that case, the transition is 12646 dealt with, at least initially, as one involving continued access 12647 to the same file system on the same server through a new network 12648 address. 12650 3. Obtaining access to existing session state or creating new 12651 sessions. How this is done depends on the initial determination 12652 of whether migration has occurred and can be done as described in 12653 Section 11.13.4 below in the case of migration or as described in 12654 Section 11.13.5 below in the case of a network address transfer 12655 without migration. 12657 4. Verification of the trunking relationship assumed in step 2 as 12658 discussed in Section 2.10.5.1. Although this step will generally 12659 confirm the initial determination, it is possible for 12660 verification to fail with the result that an initial 12661 determination that a network address shift (without migration) 12662 has occurred may be invalidated and migration determined to have 12663 occurred. There is no need to redo step 3 above, since it will 12664 be possible to continue use of the session established already. 12666 5. Obtaining access to existing locking state and/or reobtaining it. 12667 How this is done depends on the final determination of whether 12668 migration has occurred and can be done as described below in 12669 Section 11.13.4 in the case of migration or as described in 12670 Section 11.13.5 in the case of a network address transfer without 12671 migration. 12673 Once the initial address has been determined, clients are free to 12674 apply an abbreviated process to find additional addresses trunkable 12675 with it (clients may seek session-trunkable or server-trunkable 12676 addresses depending on whether they support clientid trunking). 12677 During this later phase of the process, further location entries are 12678 examined using the abbreviated procedure specified below: 12680 A: Before the EXCHANGE_ID, the fs name of the location entry is 12681 examined and if it does not match that currently being used, the 12682 entry is ignored. otherwise, one proceeds as specified by step 1 12683 above. 12685 B: In the case that the network address is session-trunkable with 12686 one used previously a BIND_CONN_TO_SESSION is used to access that 12687 session using the new network address. Otherwise, or if the bind 12688 operation fails, a CREATE_SESSION is done. 12690 C: The verification procedure referred to in step 4 above is used. 12691 However, if it fails, the entry is ignored and the next available 12692 entry is used. 12694 11.13.4. Obtaining Access to Sessions and State after Migration 12696 In the event that migration has occurred, migration recovery will 12697 involve determining whether Transparent State Migration has occurred. 12698 This decision is made based on the client ID returned by the 12699 EXCHANGE_ID and the reported confirmation status. 12701 o If the client ID is an unconfirmed client ID not previously known 12702 to the client, then Transparent State Migration has not occurred. 12704 o If the client ID is a confirmed client ID previously known to the 12705 client, then any transferred state would have been merged with an 12706 existing client ID representing the client to the destination 12707 server. In this state merger case, Transparent State Migration 12708 might or might not have occurred and a determination as to whether 12709 it has occurred is deferred until sessions are established and the 12710 client is ready to begin state recovery. 12712 o If the client ID is a confirmed client ID not previously known to 12713 the client, then the client can conclude that the client ID was 12714 transferred as part of Transparent State Migration. In this 12715 transferred client ID case, Transparent State Migration has 12716 occurred although some state might have been lost. 12718 Once the client ID has been obtained, it is necessary to obtain 12719 access to sessions to continue communication with the new server. In 12720 any of the cases in which Transparent State Migration has occurred, 12721 it is possible that a session was transferred as well. To deal with 12722 that possibility, clients can, after doing the EXCHANGE_ID, issue a 12723 BIND_CONN_TO_SESSION to connect the transferred session to a 12724 connection to the new server. If that fails, it is an indication 12725 that the session was not transferred and that a new session needs to 12726 be created to take its place. 12728 In some situations, it is possible for a BIND_CONN_TO_SESSION to 12729 succeed without session migration having occurred. If state merger 12730 has taken place then the associated client ID may have already had a 12731 set of existing sessions, with it being possible that the sessionid 12732 of a given session is the same as one that might have been migrated. 12733 In that event, a BIND_CONN_TO_SESSION might succeed, even though 12734 there could have been no migration of the session with that 12735 sessionid. In such cases, the client will receive sequence errors 12736 when the slot sequence values used are not appropriate on the new 12737 session. When this occurs, the client can create a new a session and 12738 cease using the existing one. 12740 Once the client has determined the initial migration status, and 12741 determined that there was a shift to a new server, it needs to re- 12742 establish its locking state, if possible. To enable this to happen 12743 without loss of the guarantees normally provided by locking, the 12744 destination server needs to implement a per-fs grace period in all 12745 cases in which lock state was lost, including those in which 12746 Transparent State Migration was not implemented. Each client for 12747 which there was a transfer of locking state to the new server will 12748 have the duration of the grace period to reclaim its locks, from the 12749 time its locks were transferred. 12751 Clients need to deal with the following cases: 12753 o In the state merger case, it is possible that the server has not 12754 attempted Transparent State Migration, in which case state may 12755 have been lost without it being reflected in the SEQ4_STATUS bits. 12756 To determine whether this has happened, the client can use 12757 TEST_STATEID to check whether the stateids created on the source 12758 server are still accessible on the destination server. Once a 12759 single stateid is found to have been successfully transferred, the 12760 client can conclude that Transparent State Migration was begun and 12761 any failure to transport all of the stateids will be reflected in 12762 the SEQ4_STATUS bits. Otherwise, Transparent State Migration has 12763 not occurred. 12765 o In a case in which Transparent State Migration has not occurred, 12766 the client can use the per-fs grace period provided by the 12767 destination server to reclaim locks that were held on the source 12768 server. 12770 o In a case in which Transparent State Migration has occurred, and 12771 no lock state was lost (as shown by SEQ4_STATUS flags), no lock 12772 reclaim is necessary. 12774 o In a case in which Transparent State Migration has occurred, and 12775 some lock state was lost (as shown by SEQ4_STATUS flags), existing 12776 stateids need to be checked for validity using TEST_STATEID, and 12777 reclaim used to re-establish any that were not transferred. 12779 For all of the cases above, RECLAIM_COMPLETE with an rca_one_fs value 12780 of TRUE needs to be done before normal use of the file system 12781 including obtaining new locks for the file system. This applies even 12782 if no locks were lost and there was no need for any to be reclaimed. 12784 11.13.5. Obtaining Access to Sessions and State after Network Address 12785 Transfer 12787 The case in which there is a transfer to a new network address 12788 without migration is similar to that described in Section 11.13.4 12789 above in that there is a need to obtain access to needed sessions and 12790 locking state. However, the details are simpler and will vary 12791 depending on the type of trunking between the address receiving 12792 NFS4ERR_MOVED and that to which the transfer is to be made 12794 To make a session available for use, a BIND_CONN_TO_SESSION should be 12795 used to obtain access to the session previously in use. Only if this 12796 fails, should a CREATE_SESSION be done. While this procedure mirrors 12797 that in Section 11.13.4 above, there is an important difference in 12798 that preservation of the session is not purely optional but depends 12799 on the type of trunking. 12801 Access to appropriate locking state will generally need no actions 12802 beyond access to the session. However, the SEQ4_STATUS bits need to 12803 be checked for lost locking state, including the need to reclaim 12804 locks after a server reboot, since there is always a possibility of 12805 locking state being lost. 12807 11.14. Server Responsibilities Upon Migration 12809 In the event of file system migration, when the client connects to 12810 the destination server, that server needs to be able to provide the 12811 client continued to access the files it had open on the source 12812 server. There are two ways to provide this: 12814 o By provision of an fs-specific grace period, allowing the client 12815 the ability to reclaim its locks, in a fashion similar to what 12816 would have been done in the case of recovery from a server 12817 restart. See Section 11.14.1 for a more complete discussion. 12819 o By implementing Transparent State Migration possibly in connection 12820 with session migration, the server can provide the client 12821 immediate access to the state built up on the source server, on 12822 the destination. 12824 These features are discussed separately in Sections 11.14.2 and 12825 11.14.3, which discuss Transparent State Migration and session 12826 migration respectively. 12828 All the features described above can involve transfer of lock-related 12829 information between source and destination servers. In some cases, 12830 this transfer is a necessary part of the implementation while in 12831 other cases it is a helpful implementation aid which servers might or 12832 might not use. The sub-sections below discuss the information which 12833 would be transferred but do not define the specifics of the transfer 12834 protocol. This is left as an implementation choice although 12835 standards in this area could be developed at a later time. 12837 11.14.1. Server Responsibilities in Effecting State Reclaim after 12838 Migration 12840 In this case, the destination server needs no knowledge of the locks 12841 held on the source server. It relies on the clients to accurately 12842 report (via reclaim operations) the locks previously held, and does 12843 not allow new locks to be granted on migrated file systems until the 12844 grace period expires. Disallowing of new locks applies to all 12845 clients accessing these file system, while grace period expiration 12846 occurs for each migrated client independently. 12848 During this grace period clients have the opportunity to use reclaim 12849 operations to obtain locks for file system objects within the 12850 migrated file system, in the same way that they do when recovering 12851 from server restart, and the servers typically rely on clients to 12852 accurately report their locks, although they have the option of 12853 subjecting these requests to verification. If the clients only 12854 reclaim locks held on the source server, no conflict can arise. Once 12855 the client has reclaimed its locks, it indicates the completion of 12856 lock reclamation by performing a RECLAIM_COMPLETE specifying 12857 rca_one_fs as TRUE. 12859 While it is not necessary for source and destination servers to co- 12860 operate to transfer information about locks, implementations are 12861 well-advised to consider transferring the following useful 12862 information: 12864 o If information about the set of clients that have locking state 12865 for the transferred file system is made available, the destination 12866 server will be able to terminate the grace period once all such 12867 clients have reclaimed their locks, allowing normal locking 12868 activity to resume earlier than it would have otherwise. 12870 o Locking summary information for individual clients (at various 12871 possible levels of detail) can detect some instances in which 12872 clients do not accurately represent the locks held on the source 12873 server. 12875 11.14.2. Server Responsibilities in Effecting Transparent State 12876 Migration 12878 The basic responsibility of the source server in effecting 12879 Transparent State Migration is to make available to the destination 12880 server a description of each piece of locking state associated with 12881 the file system being migrated. In addition to client id string and 12882 verifier, the source server needs to provide, for each stateid: 12884 o The stateid including the current sequence value. 12886 o The associated client ID. 12888 o The handle of the associated file. 12890 o The type of the lock, such as open, byte-range lock, delegation, 12891 or layout. 12893 o For locks such as opens and byte-range locks, there will be 12894 information about the owner(s) of the lock. 12896 o For recallable/revocable lock types, the current recall status 12897 needs to be included. 12899 o For each lock type, there will be type-specific information, such 12900 as share and deny modes for opens and type and byte ranges for 12901 byte-range locks and layouts. 12903 Such information will most probably be organized by client id string 12904 on the destination server so that it can be used to provide 12905 appropriate context to each client when it makes itself known to the 12906 client. Issues connected with a client impersonating another by 12907 presenting another client's client id string can be addressed using 12908 NFSv4.1 state protection features, as described in Section 21. 12910 A further server responsibility concerns locks that are revoked or 12911 otherwise lost during the process of file system migration. Because 12912 locks that appear to be lost during the process of migration will be 12913 reclaimed by the client, the servers have to take steps to ensure 12914 that locks revoked soon before or soon after migration are not 12915 inadvertently allowed to be reclaimed in situations in which the 12916 continuity of lock possession cannot be assured. 12918 o For locks lost on the source but whose loss has not yet been 12919 acknowledged by the client (by using FREE_STATEID), the 12920 destination must be aware of this loss so that it can deny a 12921 request to reclaim them. 12923 o For locks lost on the destination after the state transfer but 12924 before the client's RECLAIM_COMPLTE is done, the destination 12925 server should note these and not allow them to be reclaimed. 12927 An additional responsibility of the cooperating servers concerns 12928 situations in which a stateid cannot be transferred transparently 12929 because it conflicts with an existing stateid held by the client and 12930 associated with a different file system. In this case there are two 12931 valid choices: 12933 o Treat the transfer, as in NFSv4.0, as one without Transparent 12934 State Migration. In this case, conflicting locks cannot be 12935 granted until the client does a RECLAIM_COMPLETE, after reclaiming 12936 the locks it had, with the exception of reclaims denied because 12937 they were attempts to reclaim locks that had been lost. 12939 o Implement Transparent State Migration, except for the lock with 12940 the conflicting stateid. In this case, the client will be aware 12941 of a lost lock (through the SEQ4_STATUS flags) and be allowed to 12942 reclaim it. 12944 When transferring state between the source and destination, the 12945 issues discussed in Section 7.2 of [68] must still be attended to. 12946 In this case, the use of NFS4ERR_DELAY may still necessary in 12947 NFSv4.1, as it was in NFSv4.0, to prevent locking state changing 12948 while it is being transferred. See Section 15.1.1.3 for information 12949 about appropriate client retry approaches in the event that 12950 NFS4ERR_DELAY is returned. 12952 There are a number of important differences in the NFS4.1 context: 12954 o The absence of RELEASE_LOCKOWNER means that the one case in which 12955 an operation could not be deferred by use of NFS4ERR_DELAY no 12956 longer exists. 12958 o Sequencing of operations is no longer done using owner-based 12959 operation sequences numbers. Instead, sequencing is session- 12960 based 12962 As a result, when sessions are not transferred, the techniques 12963 discussed in Section 7.2 of [68] are adequate and will not be further 12964 discussed. 12966 11.14.3. Server Responsibilities in Effecting Session Transfer 12968 The basic responsibility of the source server in effecting session 12969 transfer is to make available to the destination server a description 12970 of the current state of each slot with the session, including: 12972 o The last sequence value received for that slot. 12974 o Whether there is cached reply data for the last request executed 12975 and, if so, the cached reply. 12977 When sessions are transferred, there are a number of issues that pose 12978 challenges in terms of making the transferred state unmodifiable 12979 during the period it is gathered up and transferred to the 12980 destination server. 12982 o A single session may be used to access multiple file systems, not 12983 all of which are being transferred. 12985 o Requests made on a session may, even if rejected, affect the state 12986 of the session by advancing the sequence number associated with 12987 the slot used. 12989 As a result, when the file system state might otherwise be considered 12990 unmodifiable, the client might have any number of in-flight requests, 12991 each of which is capable of changing session state, which may be of a 12992 number of types: 12994 1. Those requests that were processed on the migrating file system, 12995 before migration began. 12997 2. Those requests which got the error NFS4ERR_DELAY because the file 12998 system being accessed was in the process of being migrated. 13000 3. Those requests which got the error NFS4ERR_MOVED because the file 13001 system being accessed had been migrated. 13003 4. Those requests that accessed the migrating file system, in order 13004 to obtain location or status information. 13006 5. Those requests that did not reference the migrating file system. 13008 It should be noted that the history of any particular slot is likely 13009 to include a number of these request classes. In the case in which a 13010 session which is migrated is used by file systems other than the one 13011 migrated, requests of class 5 may be common and be the last request 13012 processed, for many slots. 13014 Since session state can change even after the locking state has been 13015 fixed as part of the migration process, the session state known to 13016 the client could be different from that on the destination server, 13017 which necessarily reflects the session state on the source server, at 13018 an earlier time. In deciding how to deal with this situation, it is 13019 helpful to distinguish between two sorts of behavioral consequences 13020 of the choice of initial sequence ID values. 13022 o The error NFS4ERR_SEQ_MISORDERED is returned when the sequence ID 13023 in a request is neither equal to the last one seen for the current 13024 slot nor the next greater one. 13026 In view of the difficulty of arriving at a mutually acceptable 13027 value for the correct last sequence value at the point of 13028 migration, it may be necessary for the server to show some degree 13029 of forbearance, when the sequence ID is one that would be 13030 considered unacceptable if session migration were not involved. 13032 o Returning the cached reply for a previously executed request when 13033 the sequence ID in the request matches the last value recorded for 13034 the slot. 13036 In the cases in which an error is returned and there is no 13037 possibility of any non-idempotent operation having been executed, 13038 it may not be necessary to adhere to this as strictly as might be 13039 proper if session migration were not involved. For example, the 13040 fact that the error NFS4ERR_DELAY was returned may not assist the 13041 client in any material way, while the fact that NFS4ERR_MOVED was 13042 returned by the source server may not be relevant when the request 13043 was reissued, directed to the destination server. 13045 An important issue is that the specification needs to take note of 13046 all potential COMPOUNDs, even if they might be unlikely in practice. 13047 For example, a COMPOUND is allowed to access multiple file systems 13048 and might perform non-idempotent operations in some of them before 13049 accessing a file system being migrated. Also, a COMPOUND may return 13050 considerable data in the response, before being rejected with 13051 NFS4ERR_DELAY or NFS4ERR_MOVED, and may in addition be marked as 13052 sa_cachethis. However, note that if the client and server adhere to 13053 rules in Section 15.1.1.3, there is no possibility of non-idempotent 13054 operations being spuriously reissued after receiving NFS4ERR_DELAY 13055 response. 13057 To address these issues, a destination server MAY do any of the 13058 following when implementing session transfer. 13060 o Avoid enforcing any sequencing semantics for a particular slot 13061 until the client has established the starting sequence for that 13062 slot on the destination server. 13064 o For each slot, avoid returning a cached reply returning 13065 NFS4ERR_DELAY or NFS4ERR_MOVED until the client has established 13066 the starting sequence for that slot on the destination server. 13068 o Until the client has established the starting sequence for a 13069 particular slot on the destination server, avoid reporting 13070 NFS4ERR_SEQ_MISORDERED or returning a cached reply returning 13071 NFS4ERR_DELAY or NFS4ERR_MOVED, where the reply consists solely of 13072 a series of operations where the response is NFS4_OK until the 13073 final error. 13075 Because of the considerations mentioned above including the rules for 13076 the handling of NFS4ERR_DELAY included in Section 15.1.1.3, the 13077 destination server can respond appropriately to SEQUENCE operations 13078 received from the client by adopting the three policies listed below: 13080 o Not responding with NFS4ERR_SEQ_MISORDERED for the initial request 13081 on a slot within a transferred session, since the destination 13082 server cannot be aware of requests made by the client after the 13083 server handoff but before the client became aware of the shift. 13084 In cases in which NFS4ERR_SEQ_MISORDERED would normally have been 13085 reported, the request is to be processed normally, as a new 13086 request. 13088 o Replying as it would for a retry whenever the sequence matches 13089 that transferred by the source server, even though this would not 13090 provide retry handling for requests issued after the server 13091 handoff, under the assumption that when such requests are issued 13092 they will never be responded to in a state-changing fashion, 13093 making retry support for them unnecessary. 13095 o Once a non-retry SEQUENCE is received for a given slot, using that 13096 as the basis for further sequence checking, with no further 13097 reference to the sequence value transferred by the source. 13098 server. 13100 11.15. Effecting File System Referrals 13102 Referrals are effected when an absent file system is encountered and 13103 one or more alternate locations are made available by the 13104 fs_locations or fs_locations_info attributes. The client will 13105 typically get an NFS4ERR_MOVED error, fetch the appropriate location 13106 information, and proceed to access the file system on a different 13107 server, even though it retains its logical position within the 13108 original namespace. Referrals differ from migration events in that 13109 they happen only when the client has not previously referenced the 13110 file system in question (so there is nothing to transition). 13111 Referrals can only come into effect when an absent file system is 13112 encountered at its root. 13114 The examples given in the sections below are somewhat artificial in 13115 that an actual client will not typically do a multi-component look 13116 up, but will have cached information regarding the upper levels of 13117 the name hierarchy. However, these examples are chosen to make the 13118 required behavior clear and easy to put within the scope of a small 13119 number of requests, without getting a discussion of the details of 13120 how specific clients might choose to cache things. 13122 11.15.1. Referral Example (LOOKUP) 13124 Let us suppose that the following COMPOUND is sent in an environment 13125 in which /this/is/the/path is absent from the target server. This 13126 may be for a number of reasons. It may be that the file system has 13127 moved, or it may be that the target server is functioning mainly, or 13128 solely, to refer clients to the servers on which various file systems 13129 are located. 13131 o PUTROOTFH 13133 o LOOKUP "this" 13135 o LOOKUP "is" 13137 o LOOKUP "the" 13139 o LOOKUP "path" 13141 o GETFH 13143 o GETATTR (fsid, fileid, size, time_modify) 13145 Under the given circumstances, the following will be the result. 13147 o PUTROOTFH --> NFS_OK. The current fh is now the root of the 13148 pseudo-fs. 13150 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13151 within the pseudo-fs. 13153 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13154 within the pseudo-fs. 13156 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13157 is within the pseudo-fs. 13159 o LOOKUP "path" --> NFS_OK. The current fh is for /this/is/the/path 13160 and is within a new, absent file system, but ... the client will 13161 never see the value of that fh. 13163 o GETFH --> NFS4ERR_MOVED. Fails because current fh is in an absent 13164 file system at the start of the operation, and the specification 13165 makes no exception for GETFH. 13167 o GETATTR (fsid, fileid, size, time_modify). Not executed because 13168 the failure of the GETFH stops processing of the COMPOUND. 13170 Given the failure of the GETFH, the client has the job of determining 13171 the root of the absent file system and where to find that file 13172 system, i.e., the server and path relative to that server's root fh. 13173 Note that in this example, the client did not obtain filehandles and 13174 attribute information (e.g., fsid) for the intermediate directories, 13175 so that it would not be sure where the absent file system starts. It 13176 could be the case, for example, that /this/is/the is the root of the 13177 moved file system and that the reason that the look up of "path" 13178 succeeded is that the file system was not absent on that operation 13179 but was moved between the last LOOKUP and the GETFH (since COMPOUND 13180 is not atomic). Even if we had the fsids for all of the intermediate 13181 directories, we could have no way of knowing that /this/is/the/path 13182 was the root of a new file system, since we don't yet have its fsid. 13184 In order to get the necessary information, let us re-send the chain 13185 of LOOKUPs with GETFHs and GETATTRs to at least get the fsids so we 13186 can be sure where the appropriate file system boundaries are. The 13187 client could choose to get fs_locations_info at the same time but in 13188 most cases the client will have a good guess as to where file system 13189 boundaries are (because of where NFS4ERR_MOVED was, and was not, 13190 received) making fetching of fs_locations_info unnecessary. 13192 OP01: PUTROOTFH --> NFS_OK 13194 - Current fh is root of pseudo-fs. 13196 OP02: GETATTR(fsid) --> NFS_OK 13198 - Just for completeness. Normally, clients will know the fsid of 13199 the pseudo-fs as soon as they establish communication with a 13200 server. 13202 OP03: LOOKUP "this" --> NFS_OK 13204 OP04: GETATTR(fsid) --> NFS_OK 13206 - Get current fsid to see where file system boundaries are. The 13207 fsid will be that for the pseudo-fs in this example, so no 13208 boundary. 13210 OP05: GETFH --> NFS_OK 13211 - Current fh is for /this and is within pseudo-fs. 13213 OP06: LOOKUP "is" --> NFS_OK 13215 - Current fh is for /this/is and is within pseudo-fs. 13217 OP07: GETATTR(fsid) --> NFS_OK 13219 - Get current fsid to see where file system boundaries are. The 13220 fsid will be that for the pseudo-fs in this example, so no 13221 boundary. 13223 OP08: GETFH --> NFS_OK 13225 - Current fh is for /this/is and is within pseudo-fs. 13227 OP09: LOOKUP "the" --> NFS_OK 13229 - Current fh is for /this/is/the and is within pseudo-fs. 13231 OP10: GETATTR(fsid) --> NFS_OK 13233 - Get current fsid to see where file system boundaries are. The 13234 fsid will be that for the pseudo-fs in this example, so no 13235 boundary. 13237 OP11: GETFH --> NFS_OK 13239 - Current fh is for /this/is/the and is within pseudo-fs. 13241 OP12: LOOKUP "path" --> NFS_OK 13243 - Current fh is for /this/is/the/path and is within a new, absent 13244 file system, but ... 13246 - The client will never see the value of that fh. 13248 OP13: GETATTR(fsid, fs_locations_info) --> NFS_OK 13250 - We are getting the fsid to know where the file system boundaries 13251 are. In this operation, the fsid will be different than that of 13252 the parent directory (which in turn was retrieved in OP10). Note 13253 that the fsid we are given will not necessarily be preserved at 13254 the new location. That fsid might be different, and in fact the 13255 fsid we have for this file system might be a valid fsid of a 13256 different file system on that new server. 13258 - In this particular case, we are pretty sure anyway that what has 13259 moved is /this/is/the/path rather than /this/is/the since we have 13260 the fsid of the latter and it is that of the pseudo-fs, which 13261 presumably cannot move. However, in other examples, we might not 13262 have this kind of information to rely on (e.g., /this/is/the might 13263 be a non-pseudo file system separate from /this/is/the/path), so 13264 we need to have other reliable source information on the boundary 13265 of the file system that is moved. If, for example, the file 13266 system /this/is had moved, we would have a case of migration 13267 rather than referral, and once the boundaries of the migrated file 13268 system was clear we could fetch fs_locations_info. 13270 - We are fetching fs_locations_info because the fact that we got an 13271 NFS4ERR_MOVED at this point means that it is most likely that this 13272 is a referral and we need the destination. Even if it is the case 13273 that /this/is/the is a file system that has migrated, we will 13274 still need the location information for that file system. 13276 OP14: GETFH --> NFS4ERR_MOVED 13278 - Fails because current fh is in an absent file system at the start 13279 of the operation, and the specification makes no exception for 13280 GETFH. Note that this means the server will never send the client 13281 a filehandle from within an absent file system. 13283 Given the above, the client knows where the root of the absent file 13284 system is (/this/is/the/path) by noting where the change of fsid 13285 occurred (between "the" and "path"). The fs_locations_info attribute 13286 also gives the client the actual location of the absent file system, 13287 so that the referral can proceed. The server gives the client the 13288 bare minimum of information about the absent file system so that 13289 there will be very little scope for problems of conflict between 13290 information sent by the referring server and information of the file 13291 system's home. No filehandles and very few attributes are present on 13292 the referring server, and the client can treat those it receives as 13293 transient information with the function of enabling the referral. 13295 11.15.2. Referral Example (READDIR) 13297 Another context in which a client may encounter referrals is when it 13298 does a READDIR on a directory in which some of the sub-directories 13299 are the roots of absent file systems. 13301 Suppose such a directory is read as follows: 13303 o PUTROOTFH 13305 o LOOKUP "this" 13306 o LOOKUP "is" 13308 o LOOKUP "the" 13310 o READDIR (fsid, size, time_modify, mounted_on_fileid) 13312 In this case, because rdattr_error is not requested, 13313 fs_locations_info is not requested, and some of the attributes cannot 13314 be provided, the result will be an NFS4ERR_MOVED error on the 13315 READDIR, with the detailed results as follows: 13317 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 13318 pseudo-fs. 13320 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13321 within the pseudo-fs. 13323 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13324 within the pseudo-fs. 13326 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13327 is within the pseudo-fs. 13329 o READDIR (fsid, size, time_modify, mounted_on_fileid) --> 13330 NFS4ERR_MOVED. Note that the same error would have been returned 13331 if /this/is/the had migrated, but it is returned because the 13332 directory contains the root of an absent file system. 13334 So now suppose that we re-send with rdattr_error: 13336 o PUTROOTFH 13338 o LOOKUP "this" 13340 o LOOKUP "is" 13342 o LOOKUP "the" 13344 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13346 The results will be: 13348 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 13349 pseudo-fs. 13351 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13352 within the pseudo-fs. 13354 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13355 within the pseudo-fs. 13357 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13358 is within the pseudo-fs. 13360 o READDIR (rdattr_error, fsid, size, time_modify, mounted_on_fileid) 13361 --> NFS_OK. The attributes for directory entry with the component 13362 named "path" will only contain rdattr_error with the value 13363 NFS4ERR_MOVED, together with an fsid value and a value for 13364 mounted_on_fileid. 13366 Suppose we do another READDIR to get fs_locations_info (although we 13367 could have used a GETATTR directly, as in Section 11.15.1). 13369 o PUTROOTFH 13371 o LOOKUP "this" 13373 o LOOKUP "is" 13375 o LOOKUP "the" 13377 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13378 size, time_modify) 13380 The results would be: 13382 o PUTROOTFH --> NFS_OK. The current fh is at the root of the 13383 pseudo-fs. 13385 o LOOKUP "this" --> NFS_OK. The current fh is for /this and is 13386 within the pseudo-fs. 13388 o LOOKUP "is" --> NFS_OK. The current fh is for /this/is and is 13389 within the pseudo-fs. 13391 o LOOKUP "the" --> NFS_OK. The current fh is for /this/is/the and 13392 is within the pseudo-fs. 13394 o READDIR (rdattr_error, fs_locations_info, mounted_on_fileid, fsid, 13395 size, time_modify) --> NFS_OK. The attributes will be as shown 13396 below. 13398 The attributes for the directory entry with the component named 13399 "path" will only contain: 13401 o rdattr_error (value: NFS_OK) 13402 o fs_locations_info 13404 o mounted_on_fileid (value: unique fileid within referring file 13405 system) 13407 o fsid (value: unique value within referring server) 13409 The attributes for entry "path" will not contain size or time_modify 13410 because these attributes are not available within an absent file 13411 system. 13413 11.16. The Attribute fs_locations 13415 The fs_locations attribute is structured in the following way: 13417 struct fs_location4 { 13418 utf8str_cis server<>; 13419 pathname4 rootpath; 13420 }; 13422 struct fs_locations4 { 13423 pathname4 fs_root; 13424 fs_location4 locations<>; 13425 }; 13427 The fs_location4 data type is used to represent the location of a 13428 file system by providing a server name and the path to the root of 13429 the file system within that server's namespace. When a set of 13430 servers have corresponding file systems at the same path within their 13431 namespaces, an array of server names may be provided. An entry in 13432 the server array is a UTF-8 string and represents one of a 13433 traditional DNS host name, IPv4 address, IPv6 address, or a zero- 13434 length string. An IPv4 or IPv6 address is represented as a universal 13435 address (see Section 3.3.9 and [12]), minus the netid, and either 13436 with or without the trailing ".p1.p2" suffix that represents the port 13437 number. If the suffix is omitted, then the default port, 2049, 13438 SHOULD be assumed. A zero-length string SHOULD be used to indicate 13439 the current address being used for the RPC call. It is not a 13440 requirement that all servers that share the same rootpath be listed 13441 in one fs_location4 instance. The array of server names is provided 13442 for convenience. Servers that share the same rootpath may also be 13443 listed in separate fs_location4 entries in the fs_locations 13444 attribute. 13446 The fs_locations4 data type and the fs_locations attribute each 13447 contain an array of such locations. Since the namespace of each 13448 server may be constructed differently, the "fs_root" field is 13449 provided. The path represented by fs_root represents the location of 13450 the file system in the current server's namespace, i.e., that of the 13451 server from which the fs_locations attribute was obtained. The 13452 fs_root path is meant to aid the client by clearly referencing the 13453 root of the file system whose locations are being reported, no matter 13454 what object within the current file system the current filehandle 13455 designates. The fs_root is simply the pathname the client used to 13456 reach the object on the current server (i.e., the object to which the 13457 fs_locations attribute applies). 13459 When the fs_locations attribute is interrogated and there are no 13460 alternate file system locations, the server SHOULD return a zero- 13461 length array of fs_location4 structures, together with a valid 13462 fs_root. 13464 As an example, suppose there is a replicated file system located at 13465 two servers (servA and servB). At servA, the file system is located 13466 at path /a/b/c. At, servB the file system is located at path /x/y/z. 13467 If the client were to obtain the fs_locations value for the directory 13468 at /a/b/c/d, it might not necessarily know that the file system's 13469 root is located in servA's namespace at /a/b/c. When the client 13470 switches to servB, it will need to determine that the directory it 13471 first referenced at servA is now represented by the path /x/y/z/d on 13472 servB. To facilitate this, the fs_locations attribute provided by 13473 servA would have an fs_root value of /a/b/c and two entries in 13474 fs_locations. One entry in fs_locations will be for itself (servA) 13475 and the other will be for servB with a path of /x/y/z. With this 13476 information, the client is able to substitute /x/y/z for the /a/b/c 13477 at the beginning of its access path and construct /x/y/z/d to use for 13478 the new server. 13480 Note that there is no requirement that the number of components in 13481 each rootpath be the same; there is no relation between the number of 13482 components in rootpath or fs_root, and none of the components in a 13483 rootpath and fs_root have to be the same. In the above example, we 13484 could have had a third element in the locations array, with server 13485 equal to "servC" and rootpath equal to "/I/II", and a fourth element 13486 in locations with server equal to "servD" and rootpath equal to 13487 "/aleph/beth/gimel/daleth/he". 13489 The relationship between fs_root to a rootpath is that the client 13490 replaces the pathname indicated in fs_root for the current server for 13491 the substitute indicated in rootpath for the new server. 13493 For an example of a referred or migrated file system, suppose there 13494 is a file system located at serv1. At serv1, the file system is 13495 located at /az/buky/vedi/glagoli. The client finds that object at 13496 glagoli has migrated (or is a referral). The client gets the 13497 fs_locations attribute, which contains an fs_root of /az/buky/vedi/ 13498 glagoli, and one element in the locations array, with server equal to 13499 serv2, and rootpath equal to /izhitsa/fita. The client replaces 13500 /az/buky/vedi/glagoli with /izhitsa/fita, and uses the latter 13501 pathname on serv2. 13503 Thus, the server MUST return an fs_root that is equal to the path the 13504 client used to reach the object to which the fs_locations attribute 13505 applies. Otherwise, the client cannot determine the new path to use 13506 on the new server. 13508 Since the fs_locations attribute lacks information defining various 13509 attributes of the various file system choices presented, it SHOULD 13510 only be interrogated and used when fs_locations_info is not 13511 available. When fs_locations is used, information about the specific 13512 locations should be assumed based on the following rules. 13514 The following rules are general and apply irrespective of the 13515 context. 13517 o All listed file system instances should be considered as of the 13518 same handle class, if and only if, the current fh_expire_type 13519 attribute does not include the FH4_VOL_MIGRATION bit. Note that 13520 in the case of referral, filehandle issues do not apply since 13521 there can be no filehandles known within the current file system, 13522 nor is there any access to the fh_expire_type attribute on the 13523 referring (absent) file system. 13525 o All listed file system instances should be considered as of the 13526 same fileid class if and only if the fh_expire_type attribute 13527 indicates persistent filehandles and does not include the 13528 FH4_VOL_MIGRATION bit. Note that in the case of referral, fileid 13529 issues do not apply since there can be no fileids known within the 13530 referring (absent) file system, nor is there any access to the 13531 fh_expire_type attribute. 13533 o All file system instances servers should be considered as of 13534 different change classes. 13536 For other class assignments, handling of file system transitions 13537 depends on the reasons for the transition: 13539 o When the transition is due to migration, that is, the client was 13540 directed to a new file system after receiving an NFS4ERR_MOVED 13541 error, the target should be treated as being of the same write- 13542 verifier class as the source. 13544 o When the transition is due to failover to another replica, that 13545 is, the client selected another replica without receiving an 13546 NFS4ERR_MOVED error, the target should be treated as being of a 13547 different write-verifier class from the source. 13549 The specific choices reflect typical implementation patterns for 13550 failover and controlled migration, respectively. Since other choices 13551 are possible and useful, this information is better obtained by using 13552 fs_locations_info. When a server implementation needs to communicate 13553 other choices, it MUST support the fs_locations_info attribute. 13555 See Section 21 for a discussion on the recommendations for the 13556 security flavor to be used by any GETATTR operation that requests the 13557 "fs_locations" attribute. 13559 11.17. The Attribute fs_locations_info 13561 The fs_locations_info attribute is intended as a more functional 13562 replacement for the fs_locations attribute which will continue to 13563 exist and be supported. Clients can use it to get a more complete 13564 set of data about alternative file system locations, including 13565 additional network paths to access replicas in use and additional 13566 replicas. When the server does not support fs_locations_info, 13567 fs_locations can be used to get a subset of the data. A server that 13568 supports fs_locations_info MUST support fs_locations as well. 13570 There is additional data present in fs_locations_info, that is not 13571 available in fs_locations: 13573 o Attribute continuity information. This information will allow a 13574 client to select a replica that meets the transparency 13575 requirements of the applications accessing the data and to 13576 leverage optimizations due to the server guarantees of attribute 13577 continuity (e.g., if the change attribute of a file of the file 13578 system is continuous between multiple replicas, the client does 13579 not have to invalidate the file's cache when switching to a 13580 different replica). 13582 o File system identity information that indicates when multiple 13583 replicas, from the client's point of view, correspond to the same 13584 target file system, allowing them to be used interchangeably, 13585 without disruption, as distinct synchronized replicas of the same 13586 file data. 13588 Note that having two replicas with common identity information is 13589 distinct from the case of two (trunked) paths to the same replica. 13591 o Information that will bear on the suitability of various replicas, 13592 depending on the use that the client intends. For example, many 13593 applications need an absolutely up-to-date copy (e.g., those that 13594 write), while others may only need access to the most up-to-date 13595 copy reasonably available. 13597 o Server-derived preference information for replicas, which can be 13598 used to implement load-balancing while giving the client the 13599 entire file system list to be used in case the primary fails. 13601 The fs_locations_info attribute is structured similarly to the 13602 fs_locations attribute. A top-level structure (fs_locations_info4) 13603 contains the entire attribute including the root pathname of the file 13604 system and an array of lower-level structures that define replicas 13605 that share a common rootpath on their respective servers. The lower- 13606 level structure in turn (fs_locations_item4) contains a specific 13607 pathname and information on one or more individual network access 13608 paths. For that last lowest level, fs_locations_info has an 13609 fs_locations_server4 structure that contains per-server-replica 13610 information in addition to the file system location entry. This per- 13611 server-replica information includes a nominally opaque array, 13612 fls_info, within which specific pieces of information are located at 13613 the specific indices listed below. 13615 Two fs_location_server4 entries that are within different 13616 fs_location_item4 structures are never trunkable, while two entries 13617 within in the same fs_location_item4 structure might or might not be 13618 trunkable. Two entries that are trunkable will have identical 13619 identity information, although, as noted above, the converse is not 13620 the case. 13622 The attribute will always contain at least a single 13623 fs_locations_server entry. Typically, there will be an entry with 13624 the FS4LIGF_CUR_REQ flag set, although in the case of a referral 13625 there will be no entry with that flag set. 13627 It should be noted that fs_locations_info attributes returned by 13628 servers for various replicas may differ for various reasons. One 13629 server may know about a set of replicas that are not known to other 13630 servers. Further, compatibility attributes may differ. Filehandles 13631 might be of the same class going from replica A to replica B but not 13632 going in the reverse direction. This might happen because the 13633 filehandles are the same, but replica B's server implementation might 13634 not have provision to note and report that equivalence. 13636 The fs_locations_info attribute consists of a root pathname 13637 (fli_fs_root, just like fs_root in the fs_locations attribute), 13638 together with an array of fs_location_item4 structures. The 13639 fs_location_item4 structures in turn consist of a root pathname 13640 (fli_rootpath) together with an array (fli_entries) of elements of 13641 data type fs_locations_server4, all defined as follows. 13643 /* 13644 * Defines an individual server access path 13645 */ 13646 struct fs_locations_server4 { 13647 int32_t fls_currency; 13648 opaque fls_info<>; 13649 utf8str_cis fls_server; 13650 }; 13652 /* 13653 * Byte indices of items within 13654 * fls_info: flag fields, class numbers, 13655 * bytes indicating ranks and orders. 13656 */ 13657 const FSLI4BX_GFLAGS = 0; 13658 const FSLI4BX_TFLAGS = 1; 13660 const FSLI4BX_CLSIMUL = 2; 13661 const FSLI4BX_CLHANDLE = 3; 13662 const FSLI4BX_CLFILEID = 4; 13663 const FSLI4BX_CLWRITEVER = 5; 13664 const FSLI4BX_CLCHANGE = 6; 13665 const FSLI4BX_CLREADDIR = 7; 13667 const FSLI4BX_READRANK = 8; 13668 const FSLI4BX_WRITERANK = 9; 13669 const FSLI4BX_READORDER = 10; 13670 const FSLI4BX_WRITEORDER = 11; 13672 /* 13673 * Bits defined within the general flag byte. 13674 */ 13675 const FSLI4GF_WRITABLE = 0x01; 13676 const FSLI4GF_CUR_REQ = 0x02; 13677 const FSLI4GF_ABSENT = 0x04; 13678 const FSLI4GF_GOING = 0x08; 13679 const FSLI4GF_SPLIT = 0x10; 13681 /* 13682 * Bits defined within the transport flag byte. 13683 */ 13684 const FSLI4TF_RDMA = 0x01; 13686 /* 13687 * Defines a set of replicas sharing 13688 * a common value of the rootpath 13689 * within the corresponding 13690 * single-server namespaces. 13692 */ 13693 struct fs_locations_item4 { 13694 fs_locations_server4 fli_entries<>; 13695 pathname4 fli_rootpath; 13696 }; 13698 /* 13699 * Defines the overall structure of 13700 * the fs_locations_info attribute. 13701 */ 13702 struct fs_locations_info4 { 13703 uint32_t fli_flags; 13704 int32_t fli_valid_for; 13705 pathname4 fli_fs_root; 13706 fs_locations_item4 fli_items<>; 13707 }; 13709 /* 13710 * Flag bits in fli_flags. 13711 */ 13712 const FSLI4IF_VAR_SUB = 0x00000001; 13714 typedef fs_locations_info4 fattr4_fs_locations_info; 13716 As noted above, the fs_locations_info attribute, when supported, may 13717 be requested of absent file systems without causing NFS4ERR_MOVED to 13718 be returned. It is generally expected that it will be available for 13719 both present and absent file systems even if only a single 13720 fs_locations_server4 entry is present, designating the current 13721 (present) file system, or two fs_locations_server4 entries 13722 designating the previous location of an absent file system (the one 13723 just referenced) and its successor location. Servers are strongly 13724 urged to support this attribute on all file systems if they support 13725 it on any file system. 13727 The data presented in the fs_locations_info attribute may be obtained 13728 by the server in any number of ways, including specification by the 13729 administrator or by current protocols for transferring data among 13730 replicas and protocols not yet developed. NFSv4.1 only defines how 13731 this information is presented by the server to the client. 13733 11.17.1. The fs_locations_server4 Structure 13735 The fs_locations_server4 structure consists of the following items in 13736 addition to the fls_server field which specifies a network address or 13737 set of addresses to be used to access the specified file system. 13738 Note that both of these items (i.e., fls_currency and flinfo) specify 13739 attributes of the file system replica and should not be different 13740 when there are multiple fs_locations_server4 structures for the same 13741 replica, each specifying a network path to the chosen replica. 13743 When these values are different in two fs_locations_server4 13744 structures, a client has no basis for choosing one over the other and 13745 is best off simply ignoring both entries, whether these entries apply 13746 to migration replication or referral. When there are more than two 13747 such entries, majority voting can be used to exclude a single 13748 erroneous entry from consideration. In the case in which trunking 13749 information is provided for a replica currently being accessed, the 13750 additional trunked addresses can be ignored while access continues on 13751 the address currently being used, even if the entry corresponding to 13752 that path might be considered invalid. 13754 o An indication of how up-to-date the file system is (fls_currency) 13755 in seconds. This value is relative to the master copy. A 13756 negative value indicates that the server is unable to give any 13757 reasonably useful value here. A value of zero indicates that the 13758 file system is the actual writable data or a reliably coherent and 13759 fully up-to-date copy. Positive values indicate how out-of-date 13760 this copy can normally be before it is considered for update. 13761 Such a value is not a guarantee that such updates will always be 13762 performed on the required schedule but instead serves as a hint 13763 about how far the copy of the data would be expected to be behind 13764 the most up-to-date copy. 13766 o A counted array of one-byte values (fls_info) containing 13767 information about the particular file system instance. This data 13768 includes general flags, transport capability flags, file system 13769 equivalence class information, and selection priority information. 13770 The encoding will be discussed below. 13772 o The server string (fls_server). For the case of the replica 13773 currently being accessed (via GETATTR), a zero-length string MAY 13774 be used to indicate the current address being used for the RPC 13775 call. The fls_server field can also be an IPv4 or IPv6 address, 13776 formatted the same way as an IPv4 or IPv6 address in the "server" 13777 field of the fs_location4 data type (see Section 11.16). 13779 With the exception of the transport-flag field (at offset 13780 FSLI4BX_TFLAGS with the fls_info array), all of this data defined in 13781 this specification applies to the replica specified by the entry, 13782 rather that the specific network path used to access it. The 13783 classification of data in extensions to this data is discussed below. 13785 Data within the fls_info array is in the form of 8-bit data items 13786 with constants giving the offsets within the array of various values 13787 describing this particular file system instance. This style of 13788 definition was chosen, in preference to explicit XDR structure 13789 definitions for these values, for a number of reasons. 13791 o The kinds of data in the fls_info array, representing flags, file 13792 system classes, and priorities among sets of file systems 13793 representing the same data, are such that 8 bits provide a quite 13794 acceptable range of values. Even where there might be more than 13795 256 such file system instances, having more than 256 distinct 13796 classes or priorities is unlikely. 13798 o Explicit definition of the various specific data items within XDR 13799 would limit expandability in that any extension within would 13800 require yet another attribute, leading to specification and 13801 implementation clumsiness. In the context of the NFSv4 extension 13802 model in effect at the time fs_locations_info was designed (i.e. 13803 that described in RFC5661 [65]), this would necessitate a new 13804 minor version to effect any Standards Track extension to the data 13805 in in fls_info. 13807 The set of fls_info data is subject to expansion in a future minor 13808 version, or in a Standards Track RFC, within the context of a single 13809 minor version. The server SHOULD NOT send and the client MUST NOT 13810 use indices within the fls_info array or flag bits that are not 13811 defined in Standards Track RFCs. 13813 In light of the new extension model defined in RFC8178 [66] and the 13814 fact that the individual items within fls_info are not explicitly 13815 referenced in the XDR, the following practices should be followed 13816 when extending or otherwise changing the structure of the data 13817 returned in fls_info within the scope of a single minor version. 13819 o All extensions need to be described by Standards Track documents. 13820 There is no need for such documents to be marked as updating 13821 RFC5661 [65] or this document. 13823 o It needs to be made clear whether the information in any added 13824 data items applies to the replica specified by the entry or to the 13825 specific network paths specified in the entry. 13827 o There needs to be a reliable way defined to determine whether the 13828 server is aware of the extension. This may be based on the length 13829 field of the fls_info array, but it is more flexible to provide 13830 fs-scope or server-scope attributes to indicate what extensions 13831 are provided. 13833 This encoding scheme can be adapted to the specification of multi- 13834 byte numeric values, even though none are currently defined. If 13835 extensions are made via Standards Track RFCs, multi-byte quantities 13836 will be encoded as a range of bytes with a range of indices, with the 13837 byte interpreted in big-endian byte order. Further, any such index 13838 assignments will be constrained by the need for the relevant 13839 quantities not to cross XDR word boundaries. 13841 The fls_info array currently contains: 13843 o Two 8-bit flag fields, one devoted to general file-system 13844 characteristics and a second reserved for transport-related 13845 capabilities. 13847 o Six 8-bit class values that define various file system equivalence 13848 classes as explained below. 13850 o Four 8-bit priority values that govern file system selection as 13851 explained below. 13853 The general file system characteristics flag (at byte index 13854 FSLI4BX_GFLAGS) has the following bits defined within it: 13856 o FSLI4GF_WRITABLE indicates that this file system target is 13857 writable, allowing it to be selected by clients that may need to 13858 write on this file system. When the current file system instance 13859 is writable and is defined as of the same simultaneous use class 13860 (as specified by the value at index FSLI4BX_CLSIMUL) to which the 13861 client was previously writing, then it must incorporate within its 13862 data any committed write made on the source file system instance. 13863 See Section 11.11.6, which discusses the write-verifier class. 13864 While there is no harm in not setting this flag for a file system 13865 that turns out to be writable, turning the flag on for a read-only 13866 file system can cause problems for clients that select a migration 13867 or replication target based on the flag and then find themselves 13868 unable to write. 13870 o FSLI4GF_CUR_REQ indicates that this replica is the one on which 13871 the request is being made. Only a single server entry may have 13872 this flag set and, in the case of a referral, no entry will have 13873 it set. Note that this flag might be set even if the request was 13874 made on a network access path different from any of those 13875 specified in the current entry. 13877 o FSLI4GF_ABSENT indicates that this entry corresponds to an absent 13878 file system replica. It can only be set if FSLI4GF_CUR_REQ is 13879 set. When both such bits are set, it indicates that a file system 13880 instance is not usable but that the information in the entry can 13881 be used to determine the sorts of continuity available when 13882 switching from this replica to other possible replicas. Since 13883 this bit can only be true if FSLI4GF_CUR_REQ is true, the value 13884 could be determined using the fs_status attribute, but the 13885 information is also made available here for the convenience of the 13886 client. An entry with this bit, since it represents a true file 13887 system (albeit absent), does not appear in the event of a 13888 referral, but only when a file system has been accessed at this 13889 location and has subsequently been migrated. 13891 o FSLI4GF_GOING indicates that a replica, while still available, 13892 should not be used further. The client, if using it, should make 13893 an orderly transfer to another file system instance as 13894 expeditiously as possible. It is expected that file systems going 13895 out of service will be announced as FSLI4GF_GOING some time before 13896 the actual loss of service. It is also expected that the 13897 fli_valid_for value will be sufficiently small to allow clients to 13898 detect and act on scheduled events, while large enough that the 13899 cost of the requests to fetch the fs_locations_info values will 13900 not be excessive. Values on the order of ten minutes seem 13901 reasonable. 13903 When this flag is seen as part of a transition into a new file 13904 system, a client might choose to transfer immediately to another 13905 replica, or it may reference the current file system and only 13906 transition when a migration event occurs. Similarly, when this 13907 flag appears as a replica in the referral, clients would likely 13908 avoid being referred to this instance whenever there is another 13909 choice. 13911 This flag, like the other items within fls_info applies to the 13912 replica, rather than to a particular path to that replica. When 13913 it appears, a transition to a new replica rather than to a 13914 different path to the same replica, is indicated. 13916 o FSLI4GF_SPLIT indicates that when a transition occurs from the 13917 current file system instance to this one, the replacement may 13918 consist of multiple file systems. In this case, the client has to 13919 be prepared for the possibility that objects on the same file 13920 system before migration will be on different ones after. Note 13921 that FSLI4GF_SPLIT is not incompatible with the file systems 13922 belonging to the same fileid class since, if one has a set of 13923 fileids that are unique within a file system, each subset assigned 13924 to a smaller file system after migration would not have any 13925 conflicts internal to that file system. 13927 A client, in the case of a split file system, will interrogate 13928 existing files with which it has continuing connection (it is free 13929 to simply forget cached filehandles). If the client remembers the 13930 directory filehandle associated with each open file, it may 13931 proceed upward using LOOKUPP to find the new file system 13932 boundaries. Note that in the event of a referral, there will not 13933 be any such files and so these actions will not be performed. 13934 Instead, a reference to a portion of the original file system now 13935 split off into other file systems will encounter an fsid change 13936 and possibly a further referral. 13938 Once the client recognizes that one file system has been split 13939 into two, it can prevent the disruption of running applications by 13940 presenting the two file systems as a single one until a convenient 13941 point to recognize the transition, such as a restart. This would 13942 require a mapping from the server's fsids to fsids as seen by the 13943 client, but this is already necessary for other reasons. As noted 13944 above, existing fileids within the two descendant file systems 13945 will not conflict. Providing non-conflicting fileids for newly 13946 created files on the split file systems is the responsibility of 13947 the server (or servers working in concert). The server can encode 13948 filehandles such that filehandles generated before the split event 13949 can be discerned from those generated after the split, allowing 13950 the server to determine when the need for emulating two file 13951 systems as one is over. 13953 Although it is possible for this flag to be present in the event 13954 of referral, it would generally be of little interest to the 13955 client, since the client is not expected to have information 13956 regarding the current contents of the absent file system. 13958 The transport-flag field (at byte index FSLI4BX_TFLAGS) contains the 13959 following bits related to the transport capabilities of the specific 13960 network path(s) specified by the entry. 13962 o FSLI4TF_RDMA indicates that any specified network paths provide 13963 NFSv4.1 clients access using an RDMA-capable transport. 13965 Attribute continuity and file system identity information are 13966 expressed by defining equivalence relations on the sets of file 13967 systems presented to the client. Each such relation is expressed as 13968 a set of file system equivalence classes. For each relation, a file 13969 system has an 8-bit class number. Two file systems belong to the 13970 same class if both have identical non-zero class numbers. Zero is 13971 treated as non-matching. Most often, the relevant question for the 13972 client will be whether a given replica is identical to / continuous 13973 with the current one in a given respect, but the information should 13974 be available also as to whether two other replicas match in that 13975 respect as well. 13977 The following fields specify the file system's class numbers for the 13978 equivalence relations used in determining the nature of file system 13979 transitions. See Sections 11.9 through 11.14 and their various 13980 subsections for details about how this information is to be used. 13981 Servers may assign these values as they wish, so long as file system 13982 instances that share the same value have the specified relationship 13983 to one another; conversely, file systems that have the specified 13984 relationship to one another share a common class value. As each 13985 instance entry is added, the relationships of this instance to 13986 previously entered instances can be consulted, and if one is found 13987 that bears the specified relationship, that entry's class value can 13988 be copied to the new entry. When no such previous entry exists, a 13989 new value for that byte index (not previously used) can be selected, 13990 most likely by incrementing the value of the last class value 13991 assigned for that index. 13993 o The field with byte index FSLI4BX_CLSIMUL defines the 13994 simultaneous-use class for the file system. 13996 o The field with byte index FSLI4BX_CLHANDLE defines the handle 13997 class for the file system. 13999 o The field with byte index FSLI4BX_CLFILEID defines the fileid 14000 class for the file system. 14002 o The field with byte index FSLI4BX_CLWRITEVER defines the write- 14003 verifier class for the file system. 14005 o The field with byte index FSLI4BX_CLCHANGE defines the change 14006 class for the file system. 14008 o The field with byte index FSLI4BX_CLREADDIR defines the readdir 14009 class for the file system. 14011 Server-specified preference information is also provided via 8-bit 14012 values within the fls_info array. The values provide a rank and an 14013 order (see below) to be used with separate values specifiable for the 14014 cases of read-only and writable file systems. These values are 14015 compared for different file systems to establish the server-specified 14016 preference, with lower values indicating "more preferred". 14018 Rank is used to express a strict server-imposed ordering on clients, 14019 with lower values indicating "more preferred". Clients should 14020 attempt to use all replicas with a given rank before they use one 14021 with a higher rank. Only if all of those file systems are 14022 unavailable should the client proceed to those of a higher rank. 14023 Because specifying a rank will override client preferences, servers 14024 should be conservative about using this mechanism, particularly when 14025 the environment is one in which client communication characteristics 14026 are neither tightly controlled nor visible to the server. 14028 Within a rank, the order value is used to specify the server's 14029 preference to guide the client's selection when the client's own 14030 preferences are not controlling, with lower values of order 14031 indicating "more preferred". If replicas are approximately equal in 14032 all respects, clients should defer to the order specified by the 14033 server. When clients look at server latency as part of their 14034 selection, they are free to use this criterion, but it is suggested 14035 that when latency differences are not significant, the server- 14036 specified order should guide selection. 14038 o The field at byte index FSLI4BX_READRANK gives the rank value to 14039 be used for read-only access. 14041 o The field at byte index FSLI4BX_READORDER gives the order value to 14042 be used for read-only access. 14044 o The field at byte index FSLI4BX_WRITERANK gives the rank value to 14045 be used for writable access. 14047 o The field at byte index FSLI4BX_WRITEORDER gives the order value 14048 to be used for writable access. 14050 Depending on the potential need for write access by a given client, 14051 one of the pairs of rank and order values is used. The read rank and 14052 order should only be used if the client knows that only reading will 14053 ever be done or if it is prepared to switch to a different replica in 14054 the event that any write access capability is required in the future. 14056 11.17.2. The fs_locations_info4 Structure 14058 The fs_locations_info4 structure, encoding the fs_locations_info 14059 attribute, contains the following: 14061 o The fli_flags field, which contains general flags that affect the 14062 interpretation of this fs_locations_info4 structure and all 14063 fs_locations_item4 structures within it. The only flag currently 14064 defined is FSLI4IF_VAR_SUB. All bits in the fli_flags field that 14065 are not defined should always be returned as zero. 14067 o The fli_fs_root field, which contains the pathname of the root of 14068 the current file system on the current server, just as it does in 14069 the fs_locations4 structure. 14071 o An array called fli_items of fs_locations4_item structures, which 14072 contain information about replicas of the current file system. 14073 Where the current file system is actually present, or has been 14074 present, i.e., this is not a referral situation, one of the 14075 fs_locations_item4 structures will contain an fs_locations_server4 14076 for the current server. This structure will have FSLI4GF_ABSENT 14077 set if the current file system is absent, i.e., normal access to 14078 it will return NFS4ERR_MOVED. 14080 o The fli_valid_for field specifies a time in seconds for which it 14081 is reasonable for a client to use the fs_locations_info attribute 14082 without refetch. The fli_valid_for value does not provide a 14083 guarantee of validity since servers can unexpectedly go out of 14084 service or become inaccessible for any number of reasons. Clients 14085 are well-advised to refetch this information for an actively 14086 accessed file system at every fli_valid_for seconds. This is 14087 particularly important when file system replicas may go out of 14088 service in a controlled way using the FSLI4GF_GOING flag to 14089 communicate an ongoing change. The server should set 14090 fli_valid_for to a value that allows well-behaved clients to 14091 notice the FSLI4GF_GOING flag and make an orderly switch before 14092 the loss of service becomes effective. If this value is zero, 14093 then no refetch interval is appropriate and the client need not 14094 refetch this data on any particular schedule. In the event of a 14095 transition to a new file system instance, a new value of the 14096 fs_locations_info attribute will be fetched at the destination. 14097 It is to be expected that this may have a different fli_valid_for 14098 value, which the client should then use in the same fashion as the 14099 previous value. Because a refetch of the attribute causes 14100 information from all component entries to be refetched, the server 14101 will typically provide a low value for this field if any of the 14102 replicas are likely to go out of service in a short time frame. 14103 Note that, because of the ability of the server to return 14104 NFS4ERR_MOVED to trigger the use of different paths, when 14105 alternate trunked paths are available, there is generally no need 14106 to use low values of fli_valid_for in connection with the 14107 management of alternate paths to the same replica. 14109 The FSLI4IF_VAR_SUB flag within fli_flags controls whether variable 14110 substitution is to be enabled. See Section 11.17.3 for an 14111 explanation of variable substitution. 14113 11.17.3. The fs_locations_item4 Structure 14115 The fs_locations_item4 structure contains a pathname (in the field 14116 fli_rootpath) that encodes the path of the target file system 14117 replicas on the set of servers designated by the included 14118 fs_locations_server4 entries. The precise manner in which this 14119 target location is specified depends on the value of the 14120 FSLI4IF_VAR_SUB flag within the associated fs_locations_info4 14121 structure. 14123 If this flag is not set, then fli_rootpath simply designates the 14124 location of the target file system within each server's single-server 14125 namespace just as it does for the rootpath within the fs_location4 14126 structure. When this bit is set, however, component entries of a 14127 certain form are subject to client-specific variable substitution so 14128 as to allow a degree of namespace non-uniformity in order to 14129 accommodate the selection of client-specific file system targets to 14130 adapt to different client architectures or other characteristics. 14132 When such substitution is in effect, a variable beginning with the 14133 string "${" and ending with the string "}" and containing a colon is 14134 to be replaced by the client-specific value associated with that 14135 variable. The string "unknown" should be used by the client when it 14136 has no value for such a variable. The pathname resulting from such 14137 substitutions is used to designate the target file system, so that 14138 different clients may have different file systems, corresponding to 14139 that location in the multi-server namespace. 14141 As mentioned above, such substituted pathname variables contain a 14142 colon. The part before the colon is to be a DNS domain name, and the 14143 part after is to be a case-insensitive alphanumeric string. 14145 Where the domain is "ietf.org", only variable names defined in this 14146 document or subsequent Standards Track RFCs are subject to such 14147 substitution. Organizations are free to use their domain names to 14148 create their own sets of client-specific variables, to be subject to 14149 such substitution. In cases where such variables are intended to be 14150 used more broadly than a single organization, publication of an 14151 Informational RFC defining such variables is RECOMMENDED. 14153 The variable ${ietf.org:CPU_ARCH} is used to denote that the CPU 14154 architecture object files are compiled. This specification does not 14155 limit the acceptable values (except that they must be valid UTF-8 14156 strings), but such values as "x86", "x86_64", and "sparc" would be 14157 expected to be used in line with industry practice. 14159 The variable ${ietf.org:OS_TYPE} is used to denote the operating 14160 system, and thus the kernel and library APIs, for which code might be 14161 compiled. This specification does not limit the acceptable values 14162 (except that they must be valid UTF-8 strings), but such values as 14163 "linux" and "freebsd" would be expected to be used in line with 14164 industry practice. 14166 The variable ${ietf.org:OS_VERSION} is used to denote the operating 14167 system version, and thus the specific details of versioned 14168 interfaces, for which code might be compiled. This specification 14169 does not limit the acceptable values (except that they must be valid 14170 UTF-8 strings). However, combinations of numbers and letters with 14171 interspersed dots would be expected to be used in line with industry 14172 practice, with the details of the version format depending on the 14173 specific value of the variable ${ietf.org:OS_TYPE} with which it is 14174 used. 14176 Use of these variables could result in the direction of different 14177 clients to different file systems on the same server, as appropriate 14178 to particular clients. In cases in which the target file systems are 14179 located on different servers, a single server could serve as a 14180 referral point so that each valid combination of variable values 14181 would designate a referral hosted on a single server, with the 14182 targets of those referrals on a number of different servers. 14184 Because namespace administration is affected by the values selected 14185 to substitute for various variables, clients should provide 14186 convenient means of determining what variable substitutions a client 14187 will implement, as well as, where appropriate, providing means to 14188 control the substitutions to be used. The exact means by which this 14189 will be done is outside the scope of this specification. 14191 Although variable substitution is most suitable for use in the 14192 context of referrals, it may be used in the context of replication 14193 and migration. If it is used in these contexts, the server must 14194 ensure that no matter what values the client presents for the 14195 substituted variables, the result is always a valid successor file 14196 system instance to that from which a transition is occurring, i.e., 14197 that the data is identical or represents a later image of a writable 14198 file system. 14200 Note that when fli_rootpath is a null pathname (that is, one with 14201 zero components), the file system designated is at the root of the 14202 specified server, whether or not the FSLI4IF_VAR_SUB flag within the 14203 associated fs_locations_info4 structure is set. 14205 11.18. The Attribute fs_status 14207 In an environment in which multiple copies of the same basic set of 14208 data are available, information regarding the particular source of 14209 such data and the relationships among different copies can be very 14210 helpful in providing consistent data to applications. 14212 enum fs4_status_type { 14213 STATUS4_FIXED = 1, 14214 STATUS4_UPDATED = 2, 14215 STATUS4_VERSIONED = 3, 14216 STATUS4_WRITABLE = 4, 14217 STATUS4_REFERRAL = 5 14218 }; 14220 struct fs4_status { 14221 bool fss_absent; 14222 fs4_status_type fss_type; 14223 utf8str_cs fss_source; 14224 utf8str_cs fss_current; 14225 int32_t fss_age; 14226 nfstime4 fss_version; 14227 }; 14229 The boolean fss_absent indicates whether the file system is currently 14230 absent. This value will be set if the file system was previously 14231 present and becomes absent, or if the file system has never been 14232 present and the type is STATUS4_REFERRAL. When this boolean is set 14233 and the type is not STATUS4_REFERRAL, the remaining information in 14234 the fs4_status reflects that last valid when the file system was 14235 present. 14237 The fss_type field indicates the kind of file system image 14238 represented. This is of particular importance when using the version 14239 values to determine appropriate succession of file system images. 14240 When fss_absent is set, and the file system was previously present, 14241 the value of fss_type reflected is that when the file was last 14242 present. Five values are distinguished: 14244 o STATUS4_FIXED, which indicates a read-only image in the sense that 14245 it will never change. The possibility is allowed that, as a 14246 result of migration or switch to a different image, changed data 14247 can be accessed, but within the confines of this instance, no 14248 change is allowed. The client can use this fact to cache 14249 aggressively. 14251 o STATUS4_VERSIONED, which indicates that the image, like the 14252 STATUS4_UPDATED case, is updated externally, but it provides a 14253 guarantee that the server will carefully update an associated 14254 version value so that the client can protect itself from a 14255 situation in which it reads data from one version of the file 14256 system and then later reads data from an earlier version of the 14257 same file system. See below for a discussion of how this can be 14258 done. 14260 o STATUS4_UPDATED, which indicates an image that cannot be updated 14261 by the user writing to it but that may be changed externally, 14262 typically because it is a periodically updated copy of another 14263 writable file system somewhere else. In this case, version 14264 information is not provided, and the client does not have the 14265 responsibility of making sure that this version only advances upon 14266 a file system instance transition. In this case, it is the 14267 responsibility of the server to make sure that the data presented 14268 after a file system instance transition is a proper successor 14269 image and includes all changes seen by the client and any change 14270 made before all such changes. 14272 o STATUS4_WRITABLE, which indicates that the file system is an 14273 actual writable one. The client need not, of course, actually 14274 write to the file system, but once it does, it should not accept a 14275 transition to anything other than a writable instance of that same 14276 file system. 14278 o STATUS4_REFERRAL, which indicates that the file system in question 14279 is absent and has never been present on this server. 14281 Note that in the STATUS4_UPDATED and STATUS4_VERSIONED cases, the 14282 server is responsible for the appropriate handling of locks that are 14283 inconsistent with external changes to delegations. If a server gives 14284 out delegations, they SHOULD be recalled before an inconsistent 14285 change is made to the data, and MUST be revoked if this is not 14286 possible. Similarly, if an OPEN is inconsistent with data that is 14287 changed (the OPEN has OPEN4_SHARE_DENY_WRITE/OPEN4_SHARE_DENY_BOTH 14288 and the data is changed), that OPEN SHOULD be considered 14289 administratively revoked. 14291 The opaque strings fss_source and fss_current provide a way of 14292 presenting information about the source of the file system image 14293 being present. It is not intended that the client do anything with 14294 this information other than make it available to administrative 14295 tools. It is intended that this information be helpful when 14296 researching possible problems with a file system image that might 14297 arise when it is unclear if the correct image is being accessed and, 14298 if not, how that image came to be made. This kind of diagnostic 14299 information will be helpful, if, as seems likely, copies of file 14300 systems are made in many different ways (e.g., simple user-level 14301 copies, file-system-level point-in-time copies, clones of the 14302 underlying storage), under a variety of administrative arrangements. 14303 In such environments, determining how a given set of data was 14304 constructed can be very helpful in resolving problems. 14306 The opaque string fss_source is used to indicate the source of a 14307 given file system with the expectation that tools capable of creating 14308 a file system image propagate this information, when possible. It is 14309 understood that this may not always be possible since a user-level 14310 copy may be thought of as creating a new data set and the tools used 14311 may have no mechanism to propagate this data. When a file system is 14312 initially created, it is desirable to associate with it data 14313 regarding how the file system was created, where it was created, who 14314 created it, etc. Making this information available in this attribute 14315 in a human-readable string will be helpful for applications and 14316 system administrators and will also serve to make it available when 14317 the original file system is used to make subsequent copies. 14319 The opaque string fss_current should provide whatever information is 14320 available about the source of the current copy. Such information 14321 includes the tool creating it, any relevant parameters to that tool, 14322 the time at which the copy was done, the user making the change, the 14323 server on which the change was made, etc. All information should be 14324 in a human-readable string. 14326 The field fss_age provides an indication of how out-of-date the file 14327 system currently is with respect to its ultimate data source (in case 14328 of cascading data updates). This complements the fls_currency field 14329 of fs_locations_server4 (see Section 11.17) in the following way: the 14330 information in fls_currency gives a bound for how out of date the 14331 data in a file system might typically get, while the value in fss_age 14332 gives a bound on how out-of-date that data actually is. Negative 14333 values imply that no information is available. A zero means that 14334 this data is known to be current. A positive value means that this 14335 data is known to be no older than that number of seconds with respect 14336 to the ultimate data source. Using this value, the client may be 14337 able to decide that a data copy is too old, so that it may search for 14338 a newer version to use. 14340 The fss_version field provides a version identification, in the form 14341 of a time value, such that successive versions always have later time 14342 values. When the fs_type is anything other than STATUS4_VERSIONED, 14343 the server may provide such a value, but there is no guarantee as to 14344 its validity and clients will not use it except to provide additional 14345 information to add to fss_source and fss_current. 14347 When fss_type is STATUS4_VERSIONED, servers SHOULD provide a value of 14348 fss_version that progresses monotonically whenever any new version of 14349 the data is established. This allows the client, if reliable image 14350 progression is important to it, to fetch this attribute as part of 14351 each COMPOUND where data or metadata from the file system is used. 14353 When it is important to the client to make sure that only valid 14354 successor images are accepted, it must make sure that it does not 14355 read data or metadata from the file system without updating its sense 14356 of the current state of the image. This is to avoid the possibility 14357 that the fs_status that the client holds will be one for an earlier 14358 image, which would cause the client to accept a new file system 14359 instance that is later than that but still earlier than the updated 14360 data read by the client. 14362 In order to accept valid images reliably, the client must do a 14363 GETATTR of the fs_status attribute that follows any interrogation of 14364 data or metadata within the file system in question. Often this is 14365 most conveniently done by appending such a GETATTR after all other 14366 operations that reference a given file system. When errors occur 14367 between reading file system data and performing such a GETATTR, care 14368 must be exercised to make sure that the data in question is not used 14369 before obtaining the proper fs_status value. In this connection, 14370 when an OPEN is done within such a versioned file system and the 14371 associated GETATTR of fs_status is not successfully completed, the 14372 open file in question must not be accessed until that fs_status is 14373 fetched. 14375 The procedure above will ensure that before using any data from the 14376 file system the client has in hand a newly-fetched current version of 14377 the file system image. Multiple values for multiple requests in 14378 flight can be resolved by assembling them into the required partial 14379 order (and the elements should form a total order within the partial 14380 order) and using the last. The client may then, when switching among 14381 file system instances, decline to use an instance that does not have 14382 an fss_type of STATUS4_VERSIONED or whose fss_version field is 14383 earlier than the last one obtained from the predecessor file system 14384 instance. 14386 12. Parallel NFS (pNFS) 14388 12.1. Introduction 14390 pNFS is an OPTIONAL feature within NFSv4.1; the pNFS feature set 14391 allows direct client access to the storage devices containing file 14392 data. When file data for a single NFSv4 server is stored on multiple 14393 and/or higher-throughput storage devices (by comparison to the 14394 server's throughput capability), the result can be significantly 14395 better file access performance. The relationship among multiple 14396 clients, a single server, and multiple storage devices for pNFS 14397 (server and clients have access to all storage devices) is shown in 14398 Figure 1. 14400 +-----------+ 14401 |+-----------+ +-----------+ 14402 ||+-----------+ | | 14403 ||| | NFSv4.1 + pNFS | | 14404 +|| Clients |<------------------------------>| Server | 14405 +| | | | 14406 +-----------+ | | 14407 ||| +-----------+ 14408 ||| | 14409 ||| | 14410 ||| Storage +-----------+ | 14411 ||| Protocol |+-----------+ | 14412 ||+----------------||+-----------+ Control | 14413 |+-----------------||| | Protocol| 14414 +------------------+|| Storage |------------+ 14415 +| Devices | 14416 +-----------+ 14418 Figure 1 14420 In this model, the clients, server, and storage devices are 14421 responsible for managing file access. This is in contrast to NFSv4 14422 without pNFS, where it is primarily the server's responsibility; some 14423 of this responsibility may be delegated to the client under strictly 14424 specified conditions. See Section 12.2.5 for a discussion of the 14425 Storage Protocol. See Section 12.2.6 for a discussion of the Control 14426 Protocol. 14428 pNFS takes the form of OPTIONAL operations that manage protocol 14429 objects called 'layouts' (Section 12.2.7) that contain a byte-range 14430 and storage location information. The layout is managed in a similar 14431 fashion as NFSv4.1 data delegations. For example, the layout is 14432 leased, recallable, and revocable. However, layouts are distinct 14433 abstractions and are manipulated with new operations. When a client 14434 holds a layout, it is granted the ability to directly access the 14435 byte-range at the storage location specified in the layout. 14437 There are interactions between layouts and other NFSv4.1 abstractions 14438 such as data delegations and byte-range locking. Delegation issues 14439 are discussed in Section 12.5.5. Byte-range locking issues are 14440 discussed in Sections 12.2.9 and 12.5.1. 14442 12.2. pNFS Definitions 14444 NFSv4.1's pNFS feature provides parallel data access to a file system 14445 that stripes its content across multiple storage servers. The first 14446 instantiation of pNFS, as part of NFSv4.1, separates the file system 14447 protocol processing into two parts: metadata processing and data 14448 processing. Data consist of the contents of regular files that are 14449 striped across storage servers. Data striping occurs in at least two 14450 ways: on a file-by-file basis and, within sufficiently large files, 14451 on a block-by-block basis. In contrast, striped access to metadata 14452 by pNFS clients is not provided in NFSv4.1, even though the file 14453 system back end of a pNFS server might stripe metadata. Metadata 14454 consist of everything else, including the contents of non-regular 14455 files (e.g., directories); see Section 12.2.1. The metadata 14456 functionality is implemented by an NFSv4.1 server that supports pNFS 14457 and the operations described in Section 18; such a server is called a 14458 metadata server (Section 12.2.2). 14460 The data functionality is implemented by one or more storage devices, 14461 each of which are accessed by the client via a storage protocol. A 14462 subset (defined in Section 13.6) of NFSv4.1 is one such storage 14463 protocol. New terms are introduced to the NFSv4.1 nomenclature and 14464 existing terms are clarified to allow for the description of the pNFS 14465 feature. 14467 12.2.1. Metadata 14469 Information about a file system object, such as its name, location 14470 within the namespace, owner, ACL, and other attributes. Metadata may 14471 also include storage location information, and this will vary based 14472 on the underlying storage mechanism that is used. 14474 12.2.2. Metadata Server 14476 An NFSv4.1 server that supports the pNFS feature. A variety of 14477 architectural choices exist for the metadata server and its use of 14478 file system information held at the server. Some servers may contain 14479 metadata only for file objects residing at the metadata server, while 14480 the file data resides on associated storage devices. Other metadata 14481 servers may hold both metadata and a varying degree of file data. 14483 12.2.3. pNFS Client 14485 An NFSv4.1 client that supports pNFS operations and supports at least 14486 one storage protocol for performing I/O to storage devices. 14488 12.2.4. Storage Device 14490 A storage device stores a regular file's data, but leaves metadata 14491 management to the metadata server. A storage device could be another 14492 NFSv4.1 server, an object-based storage device (OSD), a block device 14493 accessed over a System Area Network (SAN, e.g., either FiberChannel 14494 or iSCSI SAN), or some other entity. 14496 12.2.5. Storage Protocol 14498 As noted in Figure 1, the storage protocol is the method used by the 14499 client to store and retrieve data directly from the storage devices. 14501 The NFSv4.1 pNFS feature has been structured to allow for a variety 14502 of storage protocols to be defined and used. One example storage 14503 protocol is NFSv4.1 itself (as documented in Section 13). Other 14504 options for the storage protocol are described elsewhere and include: 14506 o Block/volume protocols such as Internet SCSI (iSCSI) [55] and FCP 14507 [56]. The block/volume protocol support can be independent of the 14508 addressing structure of the block/volume protocol used, allowing 14509 more than one protocol to access the same file data and enabling 14510 extensibility to other block/volume protocols. See [47] for a 14511 layout specification that allows pNFS to use block/volume storage 14512 protocols. 14514 o Object protocols such as OSD over iSCSI or Fibre Channel [57]. 14515 See [46] for a layout specification that allows pNFS to use object 14516 storage protocols. 14518 It is possible that various storage protocols are available to both 14519 client and server and it may be possible that a client and server do 14520 not have a matching storage protocol available to them. Because of 14521 this, the pNFS server MUST support normal NFSv4.1 access to any file 14522 accessible by the pNFS feature; this will allow for continued 14523 interoperability between an NFSv4.1 client and server. 14525 12.2.6. Control Protocol 14527 As noted in Figure 1, the control protocol is used by the exported 14528 file system between the metadata server and storage devices. 14529 Specification of such protocols is outside the scope of the NFSv4.1 14530 protocol. Such control protocols would be used to control activities 14531 such as the allocation and deallocation of storage, the management of 14532 state required by the storage devices to perform client access 14533 control, and, depending on the storage protocol, the enforcement of 14534 authentication and authorization so that restrictions that would be 14535 enforced by the metadata server are also enforced by the storage 14536 device. 14538 A particular control protocol is not REQUIRED by NFSv4.1 but 14539 requirements are placed on the control protocol for maintaining 14540 attributes like modify time, the change attribute, and the end-of- 14541 file (EOF) position. Note that if pNFS is layered over a clustered, 14542 parallel file system (e.g., PVFS [58]), the mechanisms that enable 14543 clustering and parallelism in that file system can be considered the 14544 control protocol. 14546 12.2.7. Layout Types 14548 A layout describes the mapping of a file's data to the storage 14549 devices that hold the data. A layout is said to belong to a specific 14550 layout type (data type layouttype4, see Section 3.3.13). The layout 14551 type allows for variants to handle different storage protocols, such 14552 as those associated with block/volume [47], object [46], and file 14553 (Section 13) layout types. A metadata server, along with its control 14554 protocol, MUST support at least one layout type. A private sub-range 14555 of the layout type namespace is also defined. Values from the 14556 private layout type range MAY be used for internal testing or 14557 experimentation (see Section 3.3.13). 14559 As an example, the organization of the file layout type could be an 14560 array of tuples (e.g., device ID, filehandle), along with a 14561 definition of how the data is stored across the devices (e.g., 14562 striping). A block/volume layout might be an array of tuples that 14563 store along with information 14564 about block size and the associated file offset of the block number. 14565 An object layout might be an array of tuples 14566 and an additional structure (i.e., the aggregation map) that defines 14567 how the logical byte sequence of the file data is serialized into the 14568 different objects. Note that the actual layouts are typically more 14569 complex than these simple expository examples. 14571 Requests for pNFS-related operations will often specify a layout 14572 type. Examples of such operations are GETDEVICEINFO and LAYOUTGET. 14573 The response for these operations will include structures such as a 14574 device_addr4 or a layout4, each of which includes a layout type 14575 within it. The layout type sent by the server MUST always be the 14576 same one requested by the client. When a server sends a response 14577 that includes a different layout type, the client SHOULD ignore the 14578 response and behave as if the server had returned an error response. 14580 12.2.8. Layout 14582 A layout defines how a file's data is organized on one or more 14583 storage devices. There are many potential layout types; each of the 14584 layout types are differentiated by the storage protocol used to 14585 access data and by the aggregation scheme that lays out the file data 14586 on the underlying storage devices. A layout is precisely identified 14587 by the tuple , 14588 where filehandle refers to the filehandle of the file on the metadata 14589 server. 14591 It is important to define when layouts overlap and/or conflict with 14592 each other. For two layouts with overlapping byte-ranges to actually 14593 overlap each other, both layouts must be of the same layout type, 14594 correspond to the same filehandle, and have the same iomode. Layouts 14595 conflict when they overlap and differ in the content of the layout 14596 (i.e., the storage device/file mapping parameters differ). Note that 14597 differing iomodes do not lead to conflicting layouts. It is 14598 permissible for layouts with different iomodes, pertaining to the 14599 same byte-range, to be held by the same client. An example of this 14600 would be copy-on-write functionality for a block/volume layout type. 14602 12.2.9. Layout Iomode 14604 The layout iomode (data type layoutiomode4, see Section 3.3.20) 14605 indicates to the metadata server the client's intent to perform 14606 either just READ operations or a mixture containing READ and WRITE 14607 operations. For certain layout types, it is useful for a client to 14608 specify this intent at the time it sends LAYOUTGET (Section 18.43). 14609 For example, for block/volume-based protocols, block allocation could 14610 occur when a LAYOUTIOMODE4_RW iomode is specified. A special 14611 LAYOUTIOMODE4_ANY iomode is defined and can only be used for 14612 LAYOUTRETURN and CB_LAYOUTRECALL, not for LAYOUTGET. It specifies 14613 that layouts pertaining to both LAYOUTIOMODE4_READ and 14614 LAYOUTIOMODE4_RW iomodes are being returned or recalled, 14615 respectively. 14617 A storage device may validate I/O with regard to the iomode; this is 14618 dependent upon storage device implementation and layout type. Thus, 14619 if the client's layout iomode is inconsistent with the I/O being 14620 performed, the storage device may reject the client's I/O with an 14621 error indicating that a new layout with the correct iomode should be 14622 obtained via LAYOUTGET. For example, if a client gets a layout with 14623 a LAYOUTIOMODE4_READ iomode and performs a WRITE to a storage device, 14624 the storage device is allowed to reject that WRITE. 14626 The use of the layout iomode does not conflict with OPEN share modes 14627 or byte-range LOCK operations; open share mode and byte-range lock 14628 conflicts are enforced as they are without the use of pNFS and are 14629 logically separate from the pNFS layout level. Open share modes and 14630 byte-range locks are the preferred method for restricting user access 14631 to data files. For example, an OPEN of OPEN4_SHARE_ACCESS_WRITE does 14632 not conflict with a LAYOUTGET containing an iomode of 14633 LAYOUTIOMODE4_RW performed by another client. Applications that 14634 depend on writing into the same file concurrently may use byte-range 14635 locking to serialize their accesses. 14637 12.2.10. Device IDs 14639 The device ID (data type deviceid4, see Section 3.3.14) identifies a 14640 group of storage devices. The scope of a device ID is the pair 14641 . In practice, a significant amount of 14642 information may be required to fully address a storage device. 14643 Rather than embedding all such information in a layout, layouts embed 14644 device IDs. The NFSv4.1 operation GETDEVICEINFO (Section 18.40) is 14645 used to retrieve the complete address information (including all 14646 device addresses for the device ID) regarding the storage device 14647 according to its layout type and device ID. For example, the address 14648 of an NFSv4.1 data server or of an object-based storage device could 14649 be an IP address and port. The address of a block storage device 14650 could be a volume label. 14652 Clients cannot expect the mapping between a device ID and its storage 14653 device address(es) to persist across metadata server restart. See 14654 Section 12.7.4 for a description of how recovery works in that 14655 situation. 14657 A device ID lives as long as there is a layout referring to the 14658 device ID. If there are no layouts referring to the device ID, the 14659 server is free to delete the device ID any time. Once a device ID is 14660 deleted by the server, the server MUST NOT reuse the device ID for 14661 the same layout type and client ID again. This requirement is 14662 feasible because the device ID is 16 bytes long, leaving sufficient 14663 room to store a generation number if the server's implementation 14664 requires most of the rest of the device ID's content to be reused. 14665 This requirement is necessary because otherwise the race conditions 14666 between asynchronous notification of device ID addition and deletion 14667 would be too difficult to sort out. 14669 Device ID to device address mappings are not leased, and can be 14670 changed at any time. (Note that while device ID to device address 14671 mappings are likely to change after the metadata server restarts, the 14672 server is not required to change the mappings.) A server has two 14673 choices for changing mappings. It can recall all layouts referring 14674 to the device ID or it can use a notification mechanism. 14676 The NFSv4.1 protocol has no optimal way to recall all layouts that 14677 referred to a particular device ID (unless the server associates a 14678 single device ID with a single fsid or a single client ID; in which 14679 case, CB_LAYOUTRECALL has options for recalling all layouts 14680 associated with the fsid, client ID pair, or just the client ID). 14682 Via a notification mechanism (see Section 20.12), device ID to device 14683 address mappings can change over the duration of server operation 14684 without recalling or revoking the layouts that refer to device ID. 14686 The notification mechanism can also delete a device ID, but only if 14687 the client has no layouts referring to the device ID. A notification 14688 of a change to a device ID to device address mapping will immediately 14689 or eventually invalidate some or all of the device ID's mappings. 14690 The server MUST support notifications and the client must request 14691 them before they can be used. For further information about the 14692 notification types Section 20.12. 14694 12.3. pNFS Operations 14696 NFSv4.1 has several operations that are needed for pNFS servers, 14697 regardless of layout type or storage protocol. These operations are 14698 all sent to a metadata server and summarized here. While pNFS is an 14699 OPTIONAL feature, if pNFS is implemented, some operations are 14700 REQUIRED in order to comply with pNFS. See Section 17. 14702 These are the fore channel pNFS operations: 14704 GETDEVICEINFO (Section 18.40), as noted previously 14705 (Section 12.2.10), returns the mapping of device ID to storage 14706 device address. 14708 GETDEVICELIST (Section 18.41) allows clients to fetch all device IDs 14709 for a specific file system. 14711 LAYOUTGET (Section 18.43) is used by a client to get a layout for a 14712 file. 14714 LAYOUTCOMMIT (Section 18.42) is used to inform the metadata server 14715 of the client's intent to commit data that has been written to the 14716 storage device (the storage device as originally indicated in the 14717 return value of LAYOUTGET). 14719 LAYOUTRETURN (Section 18.44) is used to return layouts for a file, a 14720 file system ID (FSID), or a client ID. 14722 These are the backchannel pNFS operations: 14724 CB_LAYOUTRECALL (Section 20.3) recalls a layout, all layouts 14725 belonging to a file system, or all layouts belonging to a client 14726 ID. 14728 CB_RECALL_ANY (Section 20.6) tells a client that it needs to return 14729 some number of recallable objects, including layouts, to the 14730 metadata server. 14732 CB_RECALLABLE_OBJ_AVAIL (Section 20.7) tells a client that a 14733 recallable object that it was denied (in case of pNFS, a layout 14734 denied by LAYOUTGET) due to resource exhaustion is now available. 14736 CB_NOTIFY_DEVICEID (Section 20.12) notifies the client of changes to 14737 device IDs. 14739 12.4. pNFS Attributes 14741 A number of attributes specific to pNFS are listed and described in 14742 Section 5.12. 14744 12.5. Layout Semantics 14746 12.5.1. Guarantees Provided by Layouts 14748 Layouts grant to the client the ability to access data located at a 14749 storage device with the appropriate storage protocol. The client is 14750 guaranteed the layout will be recalled when one of two things occur: 14751 either a conflicting layout is requested or the state encapsulated by 14752 the layout becomes invalid (this can happen when an event directly or 14753 indirectly modifies the layout). When a layout is recalled and 14754 returned by the client, the client continues with the ability to 14755 access file data with normal NFSv4.1 operations through the metadata 14756 server. Only the ability to access the storage devices is affected. 14758 The requirement of NFSv4.1 that all user access rights MUST be 14759 obtained through the appropriate OPEN, LOCK, and ACCESS operations is 14760 not modified with the existence of layouts. Layouts are provided to 14761 NFSv4.1 clients, and user access still follows the rules of the 14762 protocol as if they did not exist. It is a requirement that for a 14763 client to access a storage device, a layout must be held by the 14764 client. If a storage device receives an I/O request for a byte-range 14765 for which the client does not hold a layout, the storage device 14766 SHOULD reject that I/O request. Note that the act of modifying a 14767 file for which a layout is held does not necessarily conflict with 14768 the holding of the layout that describes the file being modified. 14769 Therefore, it is the requirement of the storage protocol or layout 14770 type that determines the necessary behavior. For example, block/ 14771 volume layout types require that the layout's iomode agree with the 14772 type of I/O being performed. 14774 Depending upon the layout type and storage protocol in use, storage 14775 device access permissions may be granted by LAYOUTGET and may be 14776 encoded within the type-specific layout. For an example of storage 14777 device access permissions, see an object-based protocol such as [57]. 14778 If access permissions are encoded within the layout, the metadata 14779 server SHOULD recall the layout when those permissions become invalid 14780 for any reason -- for example, when a file becomes unwritable or 14781 inaccessible to a client. Note, clients are still required to 14782 perform the appropriate OPEN, LOCK, and ACCESS operations as 14783 described above. The degree to which it is possible for the client 14784 to circumvent these operations and the consequences of doing so must 14785 be clearly specified by the individual layout type specifications. 14786 In addition, these specifications must be clear about the 14787 requirements and non-requirements for the checking performed by the 14788 server. 14790 In the presence of pNFS functionality, mandatory byte-range locks 14791 MUST behave as they would without pNFS. Therefore, if mandatory file 14792 locks and layouts are provided simultaneously, the storage device 14793 MUST be able to enforce the mandatory byte-range locks. For example, 14794 if one client obtains a mandatory byte-range lock and a second client 14795 accesses the storage device, the storage device MUST appropriately 14796 restrict I/O for the range of the mandatory byte-range lock. If the 14797 storage device is incapable of providing this check in the presence 14798 of mandatory byte-range locks, then the metadata server MUST NOT 14799 grant layouts and mandatory byte-range locks simultaneously. 14801 12.5.2. Getting a Layout 14803 A client obtains a layout with the LAYOUTGET operation. The metadata 14804 server will grant layouts of a particular type (e.g., block/volume, 14805 object, or file). The client selects an appropriate layout type that 14806 the server supports and the client is prepared to use. The layout 14807 returned to the client might not exactly match the requested byte- 14808 range as described in Section 18.43.3. As needed a client may send 14809 multiple LAYOUTGET operations; these might result in multiple 14810 overlapping, non-conflicting layouts (see Section 12.2.8). 14812 In order to get a layout, the client must first have opened the file 14813 via the OPEN operation. When a client has no layout on a file, it 14814 MUST present an open stateid, a delegation stateid, or a byte-range 14815 lock stateid in the loga_stateid argument. A successful LAYOUTGET 14816 result includes a layout stateid. The first successful LAYOUTGET 14817 processed by the server using a non-layout stateid as an argument 14818 MUST have the "seqid" field of the layout stateid in the response set 14819 to one. Thereafter, the client MUST use a layout stateid (see 14820 Section 12.5.3) on future invocations of LAYOUTGET on the file, and 14821 the "seqid" MUST NOT be set to zero. Once the layout has been 14822 retrieved, it can be held across multiple OPEN and CLOSE sequences. 14823 Therefore, a client may hold a layout for a file that is not 14824 currently open by any user on the client. This allows for the 14825 caching of layouts beyond CLOSE. 14827 The storage protocol used by the client to access the data on the 14828 storage device is determined by the layout's type. The client is 14829 responsible for matching the layout type with an available method to 14830 interpret and use the layout. The method for this layout type 14831 selection is outside the scope of the pNFS functionality. 14833 Although the metadata server is in control of the layout for a file, 14834 the pNFS client can provide hints to the server when a file is opened 14835 or created about the preferred layout type and aggregation schemes. 14836 pNFS introduces a layout_hint attribute (Section 5.12.4) that the 14837 client can set at file creation time to provide a hint to the server 14838 for new files. Setting this attribute separately, after the file has 14839 been created might make it difficult, or impossible, for the server 14840 implementation to comply. 14842 Because the EXCLUSIVE4 createmode4 does not allow the setting of 14843 attributes at file creation time, NFSv4.1 introduces the EXCLUSIVE4_1 14844 createmode4, which does allow attributes to be set at file creation 14845 time. In addition, if the session is created with persistent reply 14846 caches, EXCLUSIVE4_1 is neither necessary nor allowed. Instead, 14847 GUARDED4 both works better and is prescribed. Table 10 in 14848 Section 18.16.3 summarizes how a client is allowed to send an 14849 exclusive create. 14851 12.5.3. Layout Stateid 14853 As with all other stateids, the layout stateid consists of a "seqid" 14854 and "other" field. Once a layout stateid is established, the "other" 14855 field will stay constant unless the stateid is revoked or the client 14856 returns all layouts on the file and the server disposes of the 14857 stateid. The "seqid" field is initially set to one, and is never 14858 zero on any NFSv4.1 operation that uses layout stateids, whether it 14859 is a fore channel or backchannel operation. After the layout stateid 14860 is established, the server increments by one the value of the "seqid" 14861 in each subsequent LAYOUTGET and LAYOUTRETURN response, and in each 14862 CB_LAYOUTRECALL request. 14864 Given the design goal of pNFS to provide parallelism, the layout 14865 stateid differs from other stateid types in that the client is 14866 expected to send LAYOUTGET and LAYOUTRETURN operations in parallel. 14867 The "seqid" value is used by the client to properly sort responses to 14868 LAYOUTGET and LAYOUTRETURN. The "seqid" is also used to prevent race 14869 conditions between LAYOUTGET and CB_LAYOUTRECALL. Given that the 14870 processing rules differ from layout stateids and other stateid types, 14871 only the pNFS sections of this document should be considered to 14872 determine proper layout stateid handling. 14874 Once the client receives a layout stateid, it MUST use the correct 14875 "seqid" for subsequent LAYOUTGET or LAYOUTRETURN operations. The 14876 correct "seqid" is defined as the highest "seqid" value from 14877 responses of fully processed LAYOUTGET or LAYOUTRETURN operations or 14878 arguments of a fully processed CB_LAYOUTRECALL operation. Since the 14879 server is incrementing the "seqid" value on each layout operation, 14880 the client may determine the order of operation processing by 14881 inspecting the "seqid" value. In the case of overlapping layout 14882 ranges, the ordering information will provide the client the 14883 knowledge of which layout ranges are held. Note that overlapping 14884 layout ranges may occur because of the client's specific requests or 14885 because the server is allowed to expand the range of a requested 14886 layout and notify the client in the LAYOUTRETURN results. Additional 14887 layout stateid sequencing requirements are provided in 14888 Section 12.5.5.2. 14890 The client's receipt of a "seqid" is not sufficient for subsequent 14891 use. The client must fully process the operations before the "seqid" 14892 can be used. For LAYOUTGET results, if the client is not using the 14893 forgetful model (Section 12.5.5.1), it MUST first update its record 14894 of what ranges of the file's layout it has before using the seqid. 14895 For LAYOUTRETURN results, the client MUST delete the range from its 14896 record of what ranges of the file's layout it had before using the 14897 seqid. For CB_LAYOUTRECALL arguments, the client MUST send a 14898 response to the recall before using the seqid. The fundamental 14899 requirement in client processing is that the "seqid" is used to 14900 provide the order of processing. LAYOUTGET results may be processed 14901 in parallel. LAYOUTRETURN results may be processed in parallel. 14902 LAYOUTGET and LAYOUTRETURN responses may be processed in parallel as 14903 long as the ranges do not overlap. CB_LAYOUTRECALL request 14904 processing MUST be processed in "seqid" order at all times. 14906 Once a client has no more layouts on a file, the layout stateid is no 14907 longer valid and MUST NOT be used. Any attempt to use such a layout 14908 stateid will result in NFS4ERR_BAD_STATEID. 14910 12.5.4. Committing a Layout 14912 Allowing for varying storage protocol capabilities, the pNFS protocol 14913 does not require the metadata server and storage devices to have a 14914 consistent view of file attributes and data location mappings. Data 14915 location mapping refers to aspects such as which offsets store data 14916 as opposed to storing holes (see Section 13.4.4 for a discussion). 14917 Related issues arise for storage protocols where a layout may hold 14918 provisionally allocated blocks where the allocation of those blocks 14919 does not survive a complete restart of both the client and server. 14920 Because of this inconsistency, it is necessary to resynchronize the 14921 client with the metadata server and its storage devices and make any 14922 potential changes available to other clients. This is accomplished 14923 by use of the LAYOUTCOMMIT operation. 14925 The LAYOUTCOMMIT operation is responsible for committing a modified 14926 layout to the metadata server. The data should be written and 14927 committed to the appropriate storage devices before the LAYOUTCOMMIT 14928 occurs. The scope of the LAYOUTCOMMIT operation depends on the 14929 storage protocol in use. It is important to note that the level of 14930 synchronization is from the point of view of the client that sent the 14931 LAYOUTCOMMIT. The updated state on the metadata server need only 14932 reflect the state as of the client's last operation previous to the 14933 LAYOUTCOMMIT. The metadata server is not REQUIRED to maintain a 14934 global view that accounts for other clients' I/O that may have 14935 occurred within the same time frame. 14937 For block/volume-based layouts, LAYOUTCOMMIT may require updating the 14938 block list that comprises the file and committing this layout to 14939 stable storage. For file-based layouts, synchronization of 14940 attributes between the metadata and storage devices, primarily the 14941 size attribute, is required. 14943 The control protocol is free to synchronize the attributes before it 14944 receives a LAYOUTCOMMIT; however, upon successful completion of a 14945 LAYOUTCOMMIT, state that exists on the metadata server that describes 14946 the file MUST be synchronized with the state that exists on the 14947 storage devices that comprise that file as of the client's last sent 14948 operation. Thus, a client that queries the size of a file between a 14949 WRITE to a storage device and the LAYOUTCOMMIT might observe a size 14950 that does not reflect the actual data written. 14952 The client MUST have a layout in order to send a LAYOUTCOMMIT 14953 operation. 14955 12.5.4.1. LAYOUTCOMMIT and change/time_modify 14957 The change and time_modify attributes may be updated by the server 14958 when the LAYOUTCOMMIT operation is processed. The reason for this is 14959 that some layout types do not support the update of these attributes 14960 when the storage devices process I/O operations. If a client has a 14961 layout with the LAYOUTIOMODE4_RW iomode on the file, the client MAY 14962 provide a suggested value to the server for time_modify within the 14963 arguments to LAYOUTCOMMIT. Based on the layout type, the provided 14964 value may or may not be used. The server should sanity-check the 14965 client-provided values before they are used. For example, the server 14966 should ensure that time does not flow backwards. The client always 14967 has the option to set time_modify through an explicit SETATTR 14968 operation. 14970 For some layout protocols, the storage device is able to notify the 14971 metadata server of the occurrence of an I/O; as a result, the change 14972 and time_modify attributes may be updated at the metadata server. 14973 For a metadata server that is capable of monitoring updates to the 14974 change and time_modify attributes, LAYOUTCOMMIT processing is not 14975 required to update the change attribute. In this case, the metadata 14976 server must ensure that no further update to the data has occurred 14977 since the last update of the attributes; file-based protocols may 14978 have enough information to make this determination or may update the 14979 change attribute upon each file modification. This also applies for 14980 the time_modify attribute. If the server implementation is able to 14981 determine that the file has not been modified since the last 14982 time_modify update, the server need not update time_modify at 14983 LAYOUTCOMMIT. At LAYOUTCOMMIT completion, the updated attributes 14984 should be visible if that file was modified since the latest previous 14985 LAYOUTCOMMIT or LAYOUTGET. 14987 12.5.4.2. LAYOUTCOMMIT and size 14989 The size of a file may be updated when the LAYOUTCOMMIT operation is 14990 used by the client. One of the fields in the argument to 14991 LAYOUTCOMMIT is loca_last_write_offset; this field indicates the 14992 highest byte offset written but not yet committed with the 14993 LAYOUTCOMMIT operation. The data type of loca_last_write_offset is 14994 newoffset4 and is switched on a boolean value, no_newoffset, that 14995 indicates if a previous write occurred or not. If no_newoffset is 14996 FALSE, an offset is not given. If the client has a layout with 14997 LAYOUTIOMODE4_RW iomode on the file, with a byte-range (denoted by 14998 the values of lo_offset and lo_length) that overlaps 14999 loca_last_write_offset, then the client MAY set no_newoffset to TRUE 15000 and provide an offset that will update the file size. Keep in mind 15001 that offset is not the same as length, though they are related. For 15002 example, a loca_last_write_offset value of zero means that one byte 15003 was written at offset zero, and so the length of the file is at least 15004 one byte. 15006 The metadata server may do one of the following: 15008 1. Update the file's size using the last write offset provided by 15009 the client as either the true file size or as a hint of the file 15010 size. If the metadata server has a method available, any new 15011 value for file size should be sanity-checked. For example, the 15012 file must not be truncated if the client presents a last write 15013 offset less than the file's current size. 15015 2. Ignore the client-provided last write offset; the metadata server 15016 must have sufficient knowledge from other sources to determine 15017 the file's size. For example, the metadata server queries the 15018 storage devices with the control protocol. 15020 The method chosen to update the file's size will depend on the 15021 storage device's and/or the control protocol's capabilities. For 15022 example, if the storage devices are block devices with no knowledge 15023 of file size, the metadata server must rely on the client to set the 15024 last write offset appropriately. 15026 The results of LAYOUTCOMMIT contain a new size value in the form of a 15027 newsize4 union data type. If the file's size is set as a result of 15028 LAYOUTCOMMIT, the metadata server must reply with the new size; 15029 otherwise, the new size is not provided. If the file size is 15030 updated, the metadata server SHOULD update the storage devices such 15031 that the new file size is reflected when LAYOUTCOMMIT processing is 15032 complete. For example, the client should be able to read up to the 15033 new file size. 15035 The client can extend the length of a file or truncate a file by 15036 sending a SETATTR operation to the metadata server with the size 15037 attribute specified. If the size specified is larger than the 15038 current size of the file, the file is "zero extended", i.e., zeros 15039 are implicitly added between the file's previous EOF and the new EOF. 15040 (In many implementations, the zero-extended byte-range of the file 15041 consists of unallocated holes in the file.) When the client writes 15042 past EOF via WRITE, the SETATTR operation does not need to be used. 15044 12.5.4.3. LAYOUTCOMMIT and layoutupdate 15046 The LAYOUTCOMMIT argument contains a loca_layoutupdate field 15047 (Section 18.42.1) of data type layoutupdate4 (Section 3.3.18). This 15048 argument is a layout-type-specific structure. The structure can be 15049 used to pass arbitrary layout-type-specific information from the 15050 client to the metadata server at LAYOUTCOMMIT time. For example, if 15051 using a block/volume layout, the client can indicate to the metadata 15052 server which reserved or allocated blocks the client used or did not 15053 use. The content of loca_layoutupdate (field lou_body) need not be 15054 the same layout-type-specific content returned by LAYOUTGET 15055 (Section 18.43.2) in the loc_body field of the lo_content field of 15056 the logr_layout field. The content of loca_layoutupdate is defined 15057 by the layout type specification and is opaque to LAYOUTCOMMIT. 15059 12.5.5. Recalling a Layout 15061 Since a layout protects a client's access to a file via a direct 15062 client-storage-device path, a layout need only be recalled when it is 15063 semantically unable to serve this function. Typically, this occurs 15064 when the layout no longer encapsulates the true location of the file 15065 over the byte-range it represents. Any operation or action, such as 15066 server-driven restriping or load balancing, that changes the layout 15067 will result in a recall of the layout. A layout is recalled by the 15068 CB_LAYOUTRECALL callback operation (see Section 20.3) and returned 15069 with LAYOUTRETURN (see Section 18.44). The CB_LAYOUTRECALL operation 15070 may recall a layout identified by a byte-range, all layouts 15071 associated with a file system ID (FSID), or all layouts associated 15072 with a client ID. Section 12.5.5.2 discusses sequencing issues 15073 surrounding the getting, returning, and recalling of layouts. 15075 An iomode is also specified when recalling a layout. Generally, the 15076 iomode in the recall request must match the layout being returned; 15077 for example, a recall with an iomode of LAYOUTIOMODE4_RW should cause 15078 the client to only return LAYOUTIOMODE4_RW layouts and not 15079 LAYOUTIOMODE4_READ layouts. However, a special LAYOUTIOMODE4_ANY 15080 enumeration is defined to enable recalling a layout of any iomode; in 15081 other words, the client must return both LAYOUTIOMODE4_READ and 15082 LAYOUTIOMODE4_RW layouts. 15084 A REMOVE operation SHOULD cause the metadata server to recall the 15085 layout to prevent the client from accessing a non-existent file and 15086 to reclaim state stored on the client. Since a REMOVE may be delayed 15087 until the last close of the file has occurred, the recall may also be 15088 delayed until this time. After the last reference on the file has 15089 been released and the file has been removed, the client should no 15090 longer be able to perform I/O using the layout. In the case of a 15091 file-based layout, the data server SHOULD return NFS4ERR_STALE in 15092 response to any operation on the removed file. 15094 Once a layout has been returned, the client MUST NOT send I/Os to the 15095 storage devices for the file, byte-range, and iomode represented by 15096 the returned layout. If a client does send an I/O to a storage 15097 device for which it does not hold a layout, the storage device SHOULD 15098 reject the I/O. 15100 Although pNFS does not alter the file data caching capabilities of 15101 clients, or their semantics, it recognizes that some clients may 15102 perform more aggressive write-behind caching to optimize the benefits 15103 provided by pNFS. However, write-behind caching may negatively 15104 affect the latency in returning a layout in response to a 15105 CB_LAYOUTRECALL; this is similar to file delegations and the impact 15106 that file data caching has on DELEGRETURN. Client implementations 15107 SHOULD limit the amount of unwritten data they have outstanding at 15108 any one time in order to prevent excessively long responses to 15109 CB_LAYOUTRECALL. Once a layout is recalled, a server MUST wait one 15110 lease period before taking further action. As soon as a lease period 15111 has passed, the server may choose to fence the client's access to the 15112 storage devices if the server perceives the client has taken too long 15113 to return a layout. However, just as in the case of data delegation 15114 and DELEGRETURN, the server may choose to wait, given that the client 15115 is showing forward progress on its way to returning the layout. This 15116 forward progress can take the form of successful interaction with the 15117 storage devices or of sub-portions of the layout being returned by 15118 the client. The server can also limit exposure to these problems by 15119 limiting the byte-ranges initially provided in the layouts and thus 15120 the amount of outstanding modified data. 15122 12.5.5.1. Layout Recall Callback Robustness 15124 It has been assumed thus far that pNFS client state (layout ranges 15125 and iomode) for a file exactly matches that of the pNFS server for 15126 that file. This assumption leads to the implication that any 15127 callback results in a LAYOUTRETURN or set of LAYOUTRETURNs that 15128 exactly match the range in the callback, since both client and server 15129 agree about the state being maintained. However, it can be useful if 15130 this assumption does not always hold. For example: 15132 o If conflicts that require callbacks are very rare, and a server 15133 can use a multi-file callback to recover per-client resources 15134 (e.g., via an FSID recall or a multi-file recall within a single 15135 CB_COMPOUND), the result may be significantly less client-server 15136 pNFS traffic. 15138 o It may be useful for servers to maintain information about what 15139 ranges are held by a client on a coarse-grained basis, leading to 15140 the server's layout ranges being beyond those actually held by the 15141 client. In the extreme, a server could manage conflicts on a per- 15142 file basis, only sending whole-file callbacks even though clients 15143 may request and be granted sub-file ranges. 15145 o It may be useful for clients to "forget" details about what 15146 layouts and ranges the client actually has, leading to the 15147 server's layout ranges being beyond those that the client "thinks" 15148 it has. As long as the client does not assume it has layouts that 15149 are beyond what the server has granted, this is a safe practice. 15150 When a client forgets what ranges and layouts it has, and it 15151 receives a CB_LAYOUTRECALL operation, the client MUST follow up 15152 with a LAYOUTRETURN for what the server recalled, or alternatively 15153 return the NFS4ERR_NOMATCHING_LAYOUT error if it has no layout to 15154 return in the recalled range. 15156 o In order to avoid errors, it is vital that a client not assign 15157 itself layout permissions beyond what the server has granted, and 15158 that the server not forget layout permissions that have been 15159 granted. On the other hand, if a server believes that a client 15160 holds a layout that the client does not know about, it is useful 15161 for the client to cleanly indicate completion of the requested 15162 recall either by sending a LAYOUTRETURN operation for the entire 15163 requested range or by returning an NFS4ERR_NOMATCHING_LAYOUT error 15164 to the CB_LAYOUTRECALL. 15166 Thus, in light of the above, it is useful for a server to be able to 15167 send callbacks for layout ranges it has not granted to a client, and 15168 for a client to return ranges it does not hold. A pNFS client MUST 15169 always return layouts that comprise the full range specified by the 15170 recall. Note, the full recalled layout range need not be returned as 15171 part of a single operation, but may be returned in portions. This 15172 allows the client to stage the flushing of dirty data and commits and 15173 returns of layouts. Also, it indicates to the metadata server that 15174 the client is making progress. 15176 When a layout is returned, the client MUST NOT have any outstanding 15177 I/O requests to the storage devices involved in the layout. 15178 Rephrasing, the client MUST NOT return the layout while it has 15179 outstanding I/O requests to the storage device. 15181 Even with this requirement for the client, it is possible that I/O 15182 requests may be presented to a storage device no longer allowed to 15183 perform them. Since the server has no strict control as to when the 15184 client will return the layout, the server may later decide to 15185 unilaterally revoke the client's access to the storage devices as 15186 provided by the layout. In choosing to revoke access, the server 15187 must deal with the possibility of lingering I/O requests, i.e., I/O 15188 requests that are still in flight to storage devices identified by 15189 the revoked layout. All layout type specifications MUST define 15190 whether unilateral layout revocation by the metadata server is 15191 supported; if it is, the specification must also describe how 15192 lingering writes are processed. For example, storage devices 15193 identified by the revoked layout could be fenced off from the client 15194 that held the layout. 15196 In order to ensure client/server convergence with regard to layout 15197 state, the final LAYOUTRETURN operation in a sequence of LAYOUTRETURN 15198 operations for a particular recall MUST specify the entire range 15199 being recalled, echoing the recalled layout type, iomode, recall/ 15200 return type (FILE, FSID, or ALL), and byte-range, even if layouts 15201 pertaining to partial ranges were previously returned. In addition, 15202 if the client holds no layouts that overlap the range being recalled, 15203 the client should return the NFS4ERR_NOMATCHING_LAYOUT error code to 15204 CB_LAYOUTRECALL. This allows the server to update its view of the 15205 client's layout state. 15207 12.5.5.2. Sequencing of Layout Operations 15209 As with other stateful operations, pNFS requires the correct 15210 sequencing of layout operations. pNFS uses the "seqid" in the layout 15211 stateid to provide the correct sequencing between regular operations 15212 and callbacks. It is the server's responsibility to avoid 15213 inconsistencies regarding the layouts provided and the client's 15214 responsibility to properly serialize its layout requests and layout 15215 returns. 15217 12.5.5.2.1. Layout Recall and Return Sequencing 15219 One critical issue with regard to layout operations sequencing 15220 concerns callbacks. The protocol must defend against races between 15221 the reply to a LAYOUTGET or LAYOUTRETURN operation and a subsequent 15222 CB_LAYOUTRECALL. A client MUST NOT process a CB_LAYOUTRECALL that 15223 implies one or more outstanding LAYOUTGET or LAYOUTRETURN operations 15224 to which the client has not yet received a reply. The client detects 15225 such a CB_LAYOUTRECALL by examining the "seqid" field of the recall's 15226 layout stateid. If the "seqid" is not exactly one higher than what 15227 the client currently has recorded, and the client has at least one 15228 LAYOUTGET and/or LAYOUTRETURN operation outstanding, the client knows 15229 the server sent the CB_LAYOUTRECALL after sending a response to an 15230 outstanding LAYOUTGET or LAYOUTRETURN. The client MUST wait before 15231 processing such a CB_LAYOUTRECALL until it processes all replies for 15232 outstanding LAYOUTGET and LAYOUTRETURN operations for the 15233 corresponding file with seqid less than the seqid given by 15234 CB_LAYOUTRECALL (lor_stateid; see Section 20.3.) 15236 In addition to the seqid-based mechanism, Section 2.10.6.3 describes 15237 the sessions mechanism for allowing the client to detect callback 15238 race conditions and delay processing such a CB_LAYOUTRECALL. The 15239 server MAY reference conflicting operations in the CB_SEQUENCE that 15240 precedes the CB_LAYOUTRECALL. Because the server has already sent 15241 replies for these operations before sending the callback, the replies 15242 may race with the CB_LAYOUTRECALL. The client MUST wait for all the 15243 referenced calls to complete and update its view of the layout state 15244 before processing the CB_LAYOUTRECALL. 15246 12.5.5.2.1.1. Get/Return Sequencing 15248 The protocol allows the client to send concurrent LAYOUTGET and 15249 LAYOUTRETURN operations to the server. The protocol does not provide 15250 any means for the server to process the requests in the same order in 15251 which they were created. However, through the use of the "seqid" 15252 field in the layout stateid, the client can determine the order in 15253 which parallel outstanding operations were processed by the server. 15254 Thus, when a layout retrieved by an outstanding LAYOUTGET operation 15255 intersects with a layout returned by an outstanding LAYOUTRETURN on 15256 the same file, the order in which the two conflicting operations are 15257 processed determines the final state of the overlapping layout. The 15258 order is determined by the "seqid" returned in each operation: the 15259 operation with the higher seqid was executed later. 15261 It is permissible for the client to send multiple parallel LAYOUTGET 15262 operations for the same file or multiple parallel LAYOUTRETURN 15263 operations for the same file or a mix of both. 15265 It is permissible for the client to use the current stateid (see 15266 Section 16.2.3.1.2) for LAYOUTGET operations, for example, when 15267 compounding LAYOUTGETs or compounding OPEN and LAYOUTGETs. It is 15268 also permissible to use the current stateid when compounding 15269 LAYOUTRETURNs. 15271 It is permissible for the client to use the current stateid when 15272 combining LAYOUTRETURN and LAYOUTGET operations for the same file in 15273 the same COMPOUND request since the server MUST process these in 15274 order. However, if a client does send such COMPOUND requests, it 15275 MUST NOT have more than one outstanding for the same file at the same 15276 time, and it MUST NOT have other LAYOUTGET or LAYOUTRETURN operations 15277 outstanding at the same time for that same file. 15279 12.5.5.2.1.2. Client Considerations 15281 Consider a pNFS client that has sent a LAYOUTGET, and before it 15282 receives the reply to LAYOUTGET, it receives a CB_LAYOUTRECALL for 15283 the same file with an overlapping range. There are two 15284 possibilities, which the client can distinguish via the layout 15285 stateid in the recall. 15287 1. The server processed the LAYOUTGET before sending the recall, so 15288 the LAYOUTGET must be waited for because it may be carrying 15289 layout information that will need to be returned to deal with the 15290 CB_LAYOUTRECALL. 15292 2. The server sent the callback before receiving the LAYOUTGET. The 15293 server will not respond to the LAYOUTGET until the 15294 CB_LAYOUTRECALL is processed. 15296 If these possibilities cannot be distinguished, a deadlock could 15297 result, as the client must wait for the LAYOUTGET response before 15298 processing the recall in the first case, but that response will not 15299 arrive until after the recall is processed in the second case. Note 15300 that in the first case, the "seqid" in the layout stateid of the 15301 recall is two greater than what the client has recorded; in the 15302 second case, the "seqid" is one greater than what the client has 15303 recorded. This allows the client to disambiguate between the two 15304 cases. The client thus knows precisely which possibility applies. 15306 In case 1, the client knows it needs to wait for the LAYOUTGET 15307 response before processing the recall (or the client can return 15308 NFS4ERR_DELAY). 15310 In case 2, the client will not wait for the LAYOUTGET response before 15311 processing the recall because waiting would cause deadlock. 15312 Therefore, the action at the client will only require waiting in the 15313 case that the client has not yet seen the server's earlier responses 15314 to the LAYOUTGET operation(s). 15316 The recall process can be considered completed when the final 15317 LAYOUTRETURN operation for the recalled range is completed. The 15318 LAYOUTRETURN uses the layout stateid (with seqid) specified in 15319 CB_LAYOUTRECALL. If the client uses multiple LAYOUTRETURNs in 15320 processing the recall, the first LAYOUTRETURN will use the layout 15321 stateid as specified in CB_LAYOUTRECALL. Subsequent LAYOUTRETURNs 15322 will use the highest seqid as is the usual case. 15324 12.5.5.2.1.3. Server Considerations 15326 Consider a race from the metadata server's point of view. The 15327 metadata server has sent a CB_LAYOUTRECALL and receives an 15328 overlapping LAYOUTGET for the same file before the LAYOUTRETURN(s) 15329 that respond to the CB_LAYOUTRECALL. There are three cases: 15331 1. The client sent the LAYOUTGET before processing the 15332 CB_LAYOUTRECALL. The "seqid" in the layout stateid of the 15333 arguments of LAYOUTGET is one less than the "seqid" in 15334 CB_LAYOUTRECALL. The server returns NFS4ERR_RECALLCONFLICT to 15335 the client, which indicates to the client that there is a pending 15336 recall. 15338 2. The client sent the LAYOUTGET after processing the 15339 CB_LAYOUTRECALL, but the LAYOUTGET arrived before the 15340 LAYOUTRETURN and the response to CB_LAYOUTRECALL that completed 15341 that processing. The "seqid" in the layout stateid of LAYOUTGET 15342 is equal to or greater than that of the "seqid" in 15343 CB_LAYOUTRECALL. The server has not received a response to the 15344 CB_LAYOUTRECALL, so it returns NFS4ERR_RECALLCONFLICT. 15346 3. The client sent the LAYOUTGET after processing the 15347 CB_LAYOUTRECALL; the server received the CB_LAYOUTRECALL 15348 response, but the LAYOUTGET arrived before the LAYOUTRETURN that 15349 completed that processing. The "seqid" in the layout stateid of 15350 LAYOUTGET is equal to that of the "seqid" in CB_LAYOUTRECALL. 15352 The server has received a response to the CB_LAYOUTRECALL, so it 15353 returns NFS4ERR_RETURNCONFLICT. 15355 12.5.5.2.1.4. Wraparound and Validation of Seqid 15357 The rules for layout stateid processing differ from other stateids in 15358 the protocol because the "seqid" value cannot be zero and the 15359 stateid's "seqid" value changes in a CB_LAYOUTRECALL operation. The 15360 non-zero requirement combined with the inherent parallelism of layout 15361 operations means that a set of LAYOUTGET and LAYOUTRETURN operations 15362 may contain the same value for "seqid". The server uses a slightly 15363 modified version of the modulo arithmetic as described in 15364 Section 2.10.6.1 when incrementing the layout stateid's "seqid". The 15365 difference is that zero is not a valid value for "seqid"; when the 15366 value of a "seqid" is 0xFFFFFFFF, the next valid value will be 15367 0x00000001. The modulo arithmetic is also used for the comparisons 15368 of "seqid" values in the processing of CB_LAYOUTRECALL events as 15369 described above in Section 12.5.5.2.1.3. 15371 Just as the server validates the "seqid" in the event of 15372 CB_LAYOUTRECALL usage, as described in Section 12.5.5.2.1.3, the 15373 server also validates the "seqid" value to ensure that it is within 15374 an appropriate range. This range represents the degree of 15375 parallelism the server supports for layout stateids. If the client 15376 is sending multiple layout operations to the server in parallel, by 15377 definition, the "seqid" value in the supplied stateid will not be the 15378 current "seqid" as held by the server. The range of parallelism 15379 spans from the highest or current "seqid" to a "seqid" value in the 15380 past. To assist in the discussion, the server's current "seqid" 15381 value for a layout stateid is defined as SERVER_CURRENT_SEQID. The 15382 lowest "seqid" value that is acceptable to the server is represented 15383 by PAST_SEQID. And the value for the range of valid "seqid"s or 15384 range of parallelism is VALID_SEQID_RANGE. Therefore, the following 15385 holds: VALID_SEQID_RANGE = SERVER_CURRENT_SEQID - PAST_SEQID. In the 15386 following, all arithmetic is the modulo arithmetic as described 15387 above. 15389 The server MUST support a minimum VALID_SEQID_RANGE. The minimum is 15390 defined as: VALID_SEQID_RANGE = summation over 1..N of 15391 (ca_maxoperations(i) - 1), where N is the number of session fore 15392 channels and ca_maxoperations(i) is the value of the ca_maxoperations 15393 returned from CREATE_SESSION of the i'th session. The reason for "- 15394 1" is to allow for the required SEQUENCE operation. The server MAY 15395 support a VALID_SEQID_RANGE value larger than the minimum. The 15396 maximum VALID_SEQID_RANGE is (2 ^ 32 - 2) (accounting for zero not 15397 being a valid "seqid" value). 15399 If the server finds the "seqid" is zero, the NFS4ERR_BAD_STATEID 15400 error is returned to the client. The server further validates the 15401 "seqid" to ensure it is within the range of parallelism, 15402 VALID_SEQID_RANGE. If the "seqid" value is outside of that range, 15403 the error NFS4ERR_OLD_STATEID is returned to the client. Upon 15404 receipt of NFS4ERR_OLD_STATEID, the client updates the stateid in the 15405 layout request based on processing of other layout requests and re- 15406 sends the operation to the server. 15408 12.5.5.2.1.5. Bulk Recall and Return 15410 pNFS supports recalling and returning all layouts that are for files 15411 belonging to a particular fsid (LAYOUTRECALL4_FSID, 15412 LAYOUTRETURN4_FSID) or client ID (LAYOUTRECALL4_ALL, 15413 LAYOUTRETURN4_ALL). There are no "bulk" stateids, so detection of 15414 races via the seqid is not possible. The server MUST NOT initiate 15415 bulk recall while another recall is in progress, or the corresponding 15416 LAYOUTRETURN is in progress or pending. In the event the server 15417 sends a bulk recall while the client has a pending or in-progress 15418 LAYOUTRETURN, CB_LAYOUTRECALL, or LAYOUTGET, the client returns 15419 NFS4ERR_DELAY. In the event the client sends a LAYOUTGET or 15420 LAYOUTRETURN while a bulk recall is in progress, the server returns 15421 NFS4ERR_RECALLCONFLICT. If the client sends a LAYOUTGET or 15422 LAYOUTRETURN after the server receives NFS4ERR_DELAY from a bulk 15423 recall, then to ensure forward progress, the server MAY return 15424 NFS4ERR_RECALLCONFLICT. 15426 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL is sent, the server MUST 15427 NOT allow the client to use any layout stateid except for 15428 LAYOUTCOMMIT operations. Once the client receives a CB_LAYOUTRECALL 15429 of LAYOUTRECALL4_ALL, it MUST NOT use any layout stateid except for 15430 LAYOUTCOMMIT operations. Once a LAYOUTRETURN of LAYOUTRETURN4_ALL is 15431 sent, all layout stateids granted to the client ID are freed. The 15432 client MUST NOT use the layout stateids again. It MUST use LAYOUTGET 15433 to obtain new layout stateids. 15435 Once a CB_LAYOUTRECALL of LAYOUTRECALL4_FSID is sent, the server MUST 15436 NOT allow the client to use any layout stateid that refers to a file 15437 with the specified fsid except for LAYOUTCOMMIT operations. Once the 15438 client receives a CB_LAYOUTRECALL of LAYOUTRECALL4_ALL, it MUST NOT 15439 use any layout stateid that refers to a file with the specified fsid 15440 except for LAYOUTCOMMIT operations. Once a LAYOUTRETURN of 15441 LAYOUTRETURN4_FSID is sent, all layout stateids granted to the 15442 referenced fsid are freed. The client MUST NOT use those freed 15443 layout stateids for files with the referenced fsid again. 15444 Subsequently, for any file with the referenced fsid, to use a layout, 15445 the client MUST first send a LAYOUTGET operation in order to obtain a 15446 new layout stateid for that file. 15448 If the server has sent a bulk CB_LAYOUTRECALL and receives a 15449 LAYOUTGET, or a LAYOUTRETURN with a stateid, the server MUST return 15450 NFS4ERR_RECALLCONFLICT. If the server has sent a bulk 15451 CB_LAYOUTRECALL and receives a LAYOUTRETURN with an lr_returntype 15452 that is not equal to the lor_recalltype of the CB_LAYOUTRECALL, the 15453 server MUST return NFS4ERR_RECALLCONFLICT. 15455 12.5.6. Revoking Layouts 15457 Parallel NFS permits servers to revoke layouts from clients that fail 15458 to respond to recalls and/or fail to renew their lease in time. 15459 Depending on the layout type, the server might revoke the layout and 15460 might take certain actions with respect to the client's I/O to data 15461 servers. 15463 12.5.7. Metadata Server Write Propagation 15465 Asynchronous writes written through the metadata server may be 15466 propagated lazily to the storage devices. For data written 15467 asynchronously through the metadata server, a client performing a 15468 read at the appropriate storage device is not guaranteed to see the 15469 newly written data until a COMMIT occurs at the metadata server. 15470 While the write is pending, reads to the storage device may give out 15471 either the old data, the new data, or a mixture of new and old. Upon 15472 completion of a synchronous WRITE or COMMIT (for asynchronously 15473 written data), the metadata server MUST ensure that storage devices 15474 give out the new data and that the data has been written to stable 15475 storage. If the server implements its storage in any way such that 15476 it cannot obey these constraints, then it MUST recall the layouts to 15477 prevent reads being done that cannot be handled correctly. Note that 15478 the layouts MUST be recalled prior to the server responding to the 15479 associated WRITE operations. 15481 12.6. pNFS Mechanics 15483 This section describes the operations flow taken by a pNFS client to 15484 a metadata server and storage device. 15486 When a pNFS client encounters a new FSID, it sends a GETATTR to the 15487 NFSv4.1 server for the fs_layout_type (Section 5.12.1) attribute. If 15488 the attribute returns at least one layout type, and the layout types 15489 returned are among the set supported by the client, the client knows 15490 that pNFS is a possibility for the file system. If, from the server 15491 that returned the new FSID, the client does not have a client ID that 15492 came from an EXCHANGE_ID result that returned 15493 EXCHGID4_FLAG_USE_PNFS_MDS, it MUST send an EXCHANGE_ID to the server 15494 with the EXCHGID4_FLAG_USE_PNFS_MDS bit set. If the server's 15495 response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then contrary to 15496 what the fs_layout_type attribute said, the server does not support 15497 pNFS, and the client will not be able use pNFS to that server; in 15498 this case, the server MUST return NFS4ERR_NOTSUPP in response to any 15499 pNFS operation. 15501 The client then creates a session, requesting a persistent session, 15502 so that exclusive creates can be done with single round trip via the 15503 createmode4 of GUARDED4. If the session ends up not being 15504 persistent, the client will use EXCLUSIVE4_1 for exclusive creates. 15506 If a file is to be created on a pNFS-enabled file system, the client 15507 uses the OPEN operation. With the normal set of attributes that may 15508 be provided upon OPEN used for creation, there is an OPTIONAL 15509 layout_hint attribute. The client's use of layout_hint allows the 15510 client to express its preference for a layout type and its associated 15511 layout details. The use of a createmode4 of UNCHECKED4, GUARDED4, or 15512 EXCLUSIVE4_1 will allow the client to provide the layout_hint 15513 attribute at create time. The client MUST NOT use EXCLUSIVE4 (see 15514 Table 10). The client is RECOMMENDED to combine a GETATTR operation 15515 after the OPEN within the same COMPOUND. The GETATTR may then 15516 retrieve the layout_type attribute for the newly created file. The 15517 client will then know what layout type the server has chosen for the 15518 file and therefore what storage protocol the client must use. 15520 If the client wants to open an existing file, then it also includes a 15521 GETATTR to determine what layout type the file supports. 15523 The GETATTR in either the file creation or plain file open case can 15524 also include the layout_blksize and layout_alignment attributes so 15525 that the client can determine optimal offsets and lengths for I/O on 15526 the file. 15528 Assuming the client supports the layout type returned by GETATTR and 15529 it chooses to use pNFS for data access, it then sends LAYOUTGET using 15530 the filehandle and stateid returned by OPEN, specifying the range it 15531 wants to do I/O on. The response is a layout, which may be a subset 15532 of the range for which the client asked. It also includes device IDs 15533 and a description of how data is organized (or in the case of 15534 writing, how data is to be organized) across the devices. The device 15535 IDs and data description are encoded in a format that is specific to 15536 the layout type, but the client is expected to understand. 15538 When the client wants to send an I/O, it determines to which device 15539 ID it needs to send the I/O command by examining the data description 15540 in the layout. It then sends a GETDEVICEINFO to find the device 15541 address(es) of the device ID. The client then sends the I/O request 15542 to one of device ID's device addresses, using the storage protocol 15543 defined for the layout type. Note that if a client has multiple I/Os 15544 to send, these I/O requests may be done in parallel. 15546 If the I/O was a WRITE, then at some point the client may want to use 15547 LAYOUTCOMMIT to commit the modification time and the new size of the 15548 file (if it believes it extended the file size) to the metadata 15549 server and the modified data to the file system. 15551 12.7. Recovery 15553 Recovery is complicated by the distributed nature of the pNFS 15554 protocol. In general, crash recovery for layouts is similar to crash 15555 recovery for delegations in the base NFSv4.1 protocol. However, the 15556 client's ability to perform I/O without contacting the metadata 15557 server introduces subtleties that must be handled correctly if the 15558 possibility of file system corruption is to be avoided. 15560 12.7.1. Recovery from Client Restart 15562 Client recovery for layouts is similar to client recovery for other 15563 lock and delegation state. When a pNFS client restarts, it will lose 15564 all information about the layouts that it previously owned. There 15565 are two methods by which the server can reclaim these resources and 15566 allow otherwise conflicting layouts to be provided to other clients. 15568 The first is through the expiry of the client's lease. If the client 15569 recovery time is longer than the lease period, the client's lease 15570 will expire and the server will know that state may be released. For 15571 layouts, the server may release the state immediately upon lease 15572 expiry or it may allow the layout to persist, awaiting possible lease 15573 revival, as long as no other layout conflicts. 15575 The second is through the client restarting in less time than it 15576 takes for the lease period to expire. In such a case, the client 15577 will contact the server through the standard EXCHANGE_ID protocol. 15578 The server will find that the client's co_ownerid matches the 15579 co_ownerid of the previous client invocation, but that the verifier 15580 is different. The server uses this as a signal to release all layout 15581 state associated with the client's previous invocation. In this 15582 scenario, the data written by the client but not covered by a 15583 successful LAYOUTCOMMIT is in an undefined state; it may have been 15584 written or it may now be lost. This is acceptable behavior and it is 15585 the client's responsibility to use LAYOUTCOMMIT to achieve the 15586 desired level of stability. 15588 12.7.2. Dealing with Lease Expiration on the Client 15590 If a client believes its lease has expired, it MUST NOT send I/O to 15591 the storage device until it has validated its lease. The client can 15592 send a SEQUENCE operation to the metadata server. If the SEQUENCE 15593 operation is successful, but sr_status_flag has 15594 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 15595 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, or 15596 SEQ4_STATUS_ADMIN_STATE_REVOKED set, the client MUST NOT use 15597 currently held layouts. The client has two choices to recover from 15598 the lease expiration. First, for all modified but uncommitted data, 15599 the client writes it to the metadata server using the FILE_SYNC4 flag 15600 for the WRITEs, or WRITE and COMMIT. Second, the client re- 15601 establishes a client ID and session with the server and obtains new 15602 layouts and device-ID-to-device-address mappings for the modified 15603 data ranges and then writes the data to the storage devices with the 15604 newly obtained layouts. 15606 If sr_status_flags from the metadata server has 15607 SEQ4_STATUS_RESTART_RECLAIM_NEEDED set (or SEQUENCE returns 15608 NFS4ERR_BAD_SESSION and CREATE_SESSION returns 15609 NFS4ERR_STALE_CLIENTID), then the metadata server has restarted, and 15610 the client SHOULD recover using the methods described in 15611 Section 12.7.4. 15613 If sr_status_flags from the metadata server has 15614 SEQ4_STATUS_LEASE_MOVED set, then the client recovers by following 15615 the procedure described in Section 11.11.9.2. After that, the client 15616 may get an indication that the layout state was not moved with the 15617 file system. The client recovers as in the other applicable 15618 situations discussed in the first two paragraphs of this section. 15620 If sr_status_flags reports no loss of state, then the lease for the 15621 layouts that the client has are valid and renewed, and the client can 15622 once again send I/O requests to the storage devices. 15624 While clients SHOULD NOT send I/Os to storage devices that may extend 15625 past the lease expiration time period, this is not always possible, 15626 for example, an extended network partition that starts after the I/O 15627 is sent and does not heal until the I/O request is received by the 15628 storage device. Thus, the metadata server and/or storage devices are 15629 responsible for protecting themselves from I/Os that are both sent 15630 before the lease expires and arrive after the lease expires. See 15631 Section 12.7.3. 15633 12.7.3. Dealing with Loss of Layout State on the Metadata Server 15635 This is a description of the case where all of the following are 15636 true: 15638 o the metadata server has not restarted 15640 o a pNFS client's layouts have been discarded (usually because the 15641 client's lease expired) and are invalid 15643 o an I/O from the pNFS client arrives at the storage device 15645 The metadata server and its storage devices MUST solve this by 15646 fencing the client. In other words, they MUST solve this by 15647 preventing the execution of I/O operations from the client to the 15648 storage devices after layout state loss. The details of how fencing 15649 is done are specific to the layout type. The solution for NFSv4.1 15650 file-based layouts is described in (Section 13.11), and solutions for 15651 other layout types are in their respective external specification 15652 documents. 15654 12.7.4. Recovery from Metadata Server Restart 15656 The pNFS client will discover that the metadata server has restarted 15657 via the methods described in Section 8.4.2 and discussed in a pNFS- 15658 specific context in Section 12.7.2, Paragraph 2. The client MUST 15659 stop using layouts and delete the device ID to device address 15660 mappings it previously received from the metadata server. Having 15661 done that, if the client wrote data to the storage device without 15662 committing the layouts via LAYOUTCOMMIT, then the client has 15663 additional work to do in order to have the client, metadata server, 15664 and storage device(s) all synchronized on the state of the data. 15666 o If the client has data still modified and unwritten in the 15667 client's memory, the client has only two choices. 15669 1. The client can obtain a layout via LAYOUTGET after the 15670 server's grace period and write the data to the storage 15671 devices. 15673 2. The client can WRITE that data through the metadata server 15674 using the WRITE (Section 18.32) operation, and then obtain 15675 layouts as desired. 15677 o If the client asynchronously wrote data to the storage device, but 15678 still has a copy of the data in its memory, then it has available 15679 to it the recovery options listed above in the previous bullet 15680 point. If the metadata server is also in its grace period, the 15681 client has available to it the options below in the next bullet 15682 point. 15684 o The client does not have a copy of the data in its memory and the 15685 metadata server is still in its grace period. The client cannot 15686 use LAYOUTGET (within or outside the grace period) to reclaim a 15687 layout because the contents of the response from LAYOUTGET may not 15688 match what it had previously. The range might be different or the 15689 client might get the same range but the content of the layout 15690 might be different. Even if the content of the layout appears to 15691 be the same, the device IDs may map to different device addresses, 15692 and even if the device addresses are the same, the device 15693 addresses could have been assigned to a different storage device. 15694 The option of retrieving the data from the storage device and 15695 writing it to the metadata server per the recovery scenario 15696 described above is not available because, again, the mappings of 15697 range to device ID, device ID to device address, and device 15698 address to physical device are stale, and new mappings via new 15699 LAYOUTGET do not solve the problem. 15701 The only recovery option for this scenario is to send a 15702 LAYOUTCOMMIT in reclaim mode, which the metadata server will 15703 accept as long as it is in its grace period. The use of 15704 LAYOUTCOMMIT in reclaim mode informs the metadata server that the 15705 layout has changed. It is critical that the metadata server 15706 receive this information before its grace period ends, and thus 15707 before it starts allowing updates to the file system. 15709 To send LAYOUTCOMMIT in reclaim mode, the client sets the 15710 loca_reclaim field of the operation's arguments (Section 18.42.1) 15711 to TRUE. During the metadata server's recovery grace period (and 15712 only during the recovery grace period) the metadata server is 15713 prepared to accept LAYOUTCOMMIT requests with the loca_reclaim 15714 field set to TRUE. 15716 When loca_reclaim is TRUE, the client is attempting to commit 15717 changes to the layout that occurred prior to the restart of the 15718 metadata server. The metadata server applies some consistency 15719 checks on the loca_layoutupdate field of the arguments to 15720 determine whether the client can commit the data written to the 15721 storage device to the file system. The loca_layoutupdate field is 15722 of data type layoutupdate4 and contains layout-type-specific 15723 content (in the lou_body field of loca_layoutupdate). The layout- 15724 type-specific information that loca_layoutupdate might have is 15725 discussed in Section 12.5.4.3. If the metadata server's 15726 consistency checks on loca_layoutupdate succeed, then the metadata 15727 server MUST commit the data (as described by the loca_offset, 15728 loca_length, and loca_layoutupdate fields of the arguments) that 15729 was written to the storage device. If the metadata server's 15730 consistency checks on loca_layoutupdate fail, the metadata server 15731 rejects the LAYOUTCOMMIT operation and makes no changes to the 15732 file system. However, any time LAYOUTCOMMIT with loca_reclaim 15733 TRUE fails, the pNFS client has lost all the data in the range 15734 defined by . A client can defend 15735 against this risk by caching all data, whether written 15736 synchronously or asynchronously in its memory, and by not 15737 releasing the cached data until a successful LAYOUTCOMMIT. This 15738 condition does not hold true for all layout types; for example, 15739 file-based storage devices need not suffer from this limitation. 15741 o The client does not have a copy of the data in its memory and the 15742 metadata server is no longer in its grace period; i.e., the 15743 metadata server returns NFS4ERR_NO_GRACE. As with the scenario in 15744 the above bullet point, the failure of LAYOUTCOMMIT means the data 15745 in the range lost. The defense against 15746 the risk is the same -- cache all written data on the client until 15747 a successful LAYOUTCOMMIT. 15749 12.7.5. Operations during Metadata Server Grace Period 15751 Some of the recovery scenarios thus far noted that some operations 15752 (namely, WRITE and LAYOUTGET) might be permitted during the metadata 15753 server's grace period. The metadata server may allow these 15754 operations during its grace period. For LAYOUTGET, the metadata 15755 server must reliably determine that servicing such a request will not 15756 conflict with an impending LAYOUTCOMMIT reclaim request. For WRITE, 15757 the metadata server must reliably determine that servicing the 15758 request will not conflict with an impending OPEN or with a LOCK where 15759 the file has mandatory byte-range locking enabled. 15761 As mentioned previously, for expediency, the metadata server might 15762 reject some operations (namely, WRITE and LAYOUTGET) during its grace 15763 period, because the simplest correct approach is to reject all non- 15764 reclaim pNFS requests and WRITE operations by returning the 15765 NFS4ERR_GRACE error. However, depending on the storage protocol 15766 (which is specific to the layout type) and metadata server 15767 implementation, the metadata server may be able to determine that a 15768 particular request is safe. For example, a metadata server may save 15769 provisional allocation mappings for each file to stable storage, as 15770 well as information about potentially conflicting OPEN share modes 15771 and mandatory byte-range locks that might have been in effect at the 15772 time of restart, and the metadata server may use this information 15773 during the recovery grace period to determine that a WRITE request is 15774 safe. 15776 12.7.6. Storage Device Recovery 15778 Recovery from storage device restart is mostly dependent upon the 15779 layout type in use. However, there are a few general techniques a 15780 client can use if it discovers a storage device has crashed while 15781 holding modified, uncommitted data that was asynchronously written. 15782 First and foremost, it is important to realize that the client is the 15783 only one that has the information necessary to recover non-committed 15784 data since it holds the modified data and probably nothing else does. 15785 Second, the best solution is for the client to err on the side of 15786 caution and attempt to rewrite the modified data through another 15787 path. 15789 The client SHOULD immediately WRITE the data to the metadata server, 15790 with the stable field in the WRITE4args set to FILE_SYNC4. Once it 15791 does this, there is no need to wait for the original storage device. 15793 12.8. Metadata and Storage Device Roles 15795 If the same physical hardware is used to implement both a metadata 15796 server and storage device, then the same hardware entity is to be 15797 understood to be implementing two distinct roles and it is important 15798 that it be clearly understood on behalf of which role the hardware is 15799 executing at any given time. 15801 Two sub-cases can be distinguished. 15803 1. The storage device uses NFSv4.1 as the storage protocol, i.e., 15804 the same physical hardware is used to implement both a metadata 15805 and data server. See Section 13.1 for a description of how 15806 multiple roles are handled. 15808 2. The storage device does not use NFSv4.1 as the storage protocol, 15809 and the same physical hardware is used to implement both a 15810 metadata and storage device. Whether distinct network addresses 15811 are used to access the metadata server and storage device is 15812 immaterial. This is because it is always clear to the pNFS 15813 client and server, from the upper-layer protocol being used 15814 (NFSv4.1 or non-NFSv4.1), to which role the request to the common 15815 server network address is directed. 15817 12.9. Security Considerations for pNFS 15819 pNFS separates file system metadata and data and provides access to 15820 both. There are pNFS-specific operations (listed in Section 12.3) 15821 that provide access to the metadata; all existing NFSv4.1 15822 conventional (non-pNFS) security mechanisms and features apply to 15823 accessing the metadata. The combination of components in a pNFS 15824 system (see Figure 1) is required to preserve the security properties 15825 of NFSv4.1 with respect to an entity that is accessing a storage 15826 device from a client, including security countermeasures to defend 15827 against threats for which NFSv4.1 provides defenses in environments 15828 where these threats are considered significant. 15830 In some cases, the security countermeasures for connections to 15831 storage devices may take the form of physical isolation or a 15832 recommendation to avoid the use of pNFS in an environment. For 15833 example, it may be impractical to provide confidentiality protection 15834 for some storage protocols to protect against eavesdropping. In 15835 environments where eavesdropping on such protocols is of sufficient 15836 concern to require countermeasures, physical isolation of the 15837 communication channel (e.g., via direct connection from client(s) to 15838 storage device(s)) and/or a decision to forgo use of pNFS (e.g., and 15839 fall back to conventional NFSv4.1) may be appropriate courses of 15840 action. 15842 Where communication with storage devices is subject to the same 15843 threats as client-to-metadata server communication, the protocols 15844 used for that communication need to provide security mechanisms as 15845 strong as or no weaker than those available via RPCSEC_GSS for 15846 NFSv4.1. Except for the storage protocol used for the 15847 LAYOUT4_NFSV4_1_FILES layout (see Section 13), i.e., except for 15848 NFSv4.1, it is beyond the scope of this document to specify the 15849 security mechanisms for storage access protocols. 15851 pNFS implementations MUST NOT remove NFSv4.1's access controls. The 15852 combination of clients, storage devices, and the metadata server are 15853 responsible for ensuring that all client-to-storage-device file data 15854 access respects NFSv4.1's ACLs and file open modes. This entails 15855 performing both of these checks on every access in the client, the 15856 storage device, or both (as applicable; when the storage device is an 15857 NFSv4.1 server, the storage device is ultimately responsible for 15858 controlling access as described in Section 13.9.2). If a pNFS 15859 configuration performs these checks only in the client, the risk of a 15860 misbehaving client obtaining unauthorized access is an important 15861 consideration in determining when it is appropriate to use such a 15862 pNFS configuration. Such layout types SHOULD NOT be used when 15863 client-only access checks do not provide sufficient assurance that 15864 NFSv4.1 access control is being applied correctly. (This is not a 15865 problem for the file layout type described in Section 13 because the 15866 storage access protocol for LAYOUT4_NFSV4_1_FILES is NFSv4.1, and 15867 thus the security model for storage device access via 15868 LAYOUT4_NFSv4_1_FILES is the same as that of the metadata server.) 15869 For handling of access control specific to a layout, the reader 15870 should examine the layout specification, such as the NFSv4.1/file- 15871 based layout (Section 13) of this document, the blocks layout [47], 15872 and objects layout [46]. 15874 13. NFSv4.1 as a Storage Protocol in pNFS: the File Layout Type 15876 This section describes the semantics and format of NFSv4.1 file-based 15877 layouts for pNFS. NFSv4.1 file-based layouts use the 15878 LAYOUT4_NFSV4_1_FILES layout type. The LAYOUT4_NFSV4_1_FILES type 15879 defines striping data across multiple NFSv4.1 data servers. 15881 13.1. Client ID and Session Considerations 15883 Sessions are a REQUIRED feature of NFSv4.1, and this extends to both 15884 the metadata server and file-based (NFSv4.1-based) data servers. 15886 The role a server plays in pNFS is determined by the result it 15887 returns from EXCHANGE_ID. The roles are: 15889 o Metadata server (EXCHGID4_FLAG_USE_PNFS_MDS is set in the result 15890 eir_flags). 15892 o Data server (EXCHGID4_FLAG_USE_PNFS_DS). 15894 o Non-metadata server (EXCHGID4_FLAG_USE_NON_PNFS). This is an 15895 NFSv4.1 server that does not support operations (e.g., LAYOUTGET) 15896 or attributes that pertain to pNFS. 15898 The client MAY request zero or more of EXCHGID4_FLAG_USE_NON_PNFS, 15899 EXCHGID4_FLAG_USE_PNFS_DS, or EXCHGID4_FLAG_USE_PNFS_MDS, even though 15900 some combinations (e.g., EXCHGID4_FLAG_USE_NON_PNFS | 15901 EXCHGID4_FLAG_USE_PNFS_MDS) are contradictory. However, the server 15902 MUST only return the following acceptable combinations: 15904 +---------------------------------------------------------+ 15905 | Acceptable Results from EXCHANGE_ID | 15906 +---------------------------------------------------------+ 15907 | EXCHGID4_FLAG_USE_PNFS_MDS | 15908 | EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS | 15909 | EXCHGID4_FLAG_USE_PNFS_DS | 15910 | EXCHGID4_FLAG_USE_NON_PNFS | 15911 | EXCHGID4_FLAG_USE_PNFS_DS | EXCHGID4_FLAG_USE_NON_PNFS | 15912 +---------------------------------------------------------+ 15914 As the above table implies, a server can have one or two roles. A 15915 server can be both a metadata server and a data server, or it can be 15916 both a data server and non-metadata server. In addition to returning 15917 two roles in the EXCHANGE_ID's results, and thus serving both roles 15918 via a common client ID, a server can serve two roles by returning a 15919 unique client ID and server owner for each role in each of two 15920 EXCHANGE_ID results, with each result indicating each role. 15922 In the case of a server with concurrent pNFS roles that are served by 15923 a common client ID, if the EXCHANGE_ID request from the client has 15924 zero or a combination of the bits set in eia_flags, the server result 15925 should set bits that represent the higher of the acceptable 15926 combination of the server roles, with a preference to match the roles 15927 requested by the client. Thus, if a client request has 15928 (EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_USE_PNFS_MDS | 15929 EXCHGID4_FLAG_USE_PNFS_DS) flags set, and the server is both a 15930 metadata server and a data server, serving both the roles by a common 15931 client ID, the server SHOULD return with 15932 (EXCHGID4_FLAG_USE_PNFS_MDS | EXCHGID4_FLAG_USE_PNFS_DS) set. 15934 In the case of a server that has multiple concurrent pNFS roles, each 15935 role served by a unique client ID, if the client specifies zero or a 15936 combination of roles in the request, the server results SHOULD return 15937 only one of the roles from the combination specified by the client 15938 request. If the role specified by the server result does not match 15939 the intended use by the client, the client should send the 15940 EXCHANGE_ID specifying just the interested pNFS role. 15942 If a pNFS metadata client gets a layout that refers it to an NFSv4.1 15943 data server, it needs a client ID on that data server. If it does 15944 not yet have a client ID from the server that had the 15945 EXCHGID4_FLAG_USE_PNFS_DS flag set in the EXCHANGE_ID results, then 15946 the client needs to send an EXCHANGE_ID to the data server, using the 15947 same co_ownerid as it sent to the metadata server, with the 15948 EXCHGID4_FLAG_USE_PNFS_DS flag set in the arguments. If the server's 15949 EXCHANGE_ID results have EXCHGID4_FLAG_USE_PNFS_DS set, then the 15950 client may use the client ID to create sessions that will exchange 15951 pNFS data operations. The client ID returned by the data server has 15952 no relationship with the client ID returned by a metadata server 15953 unless the client IDs are equal, and the server owners and server 15954 scopes of the data server and metadata server are equal. 15956 In NFSv4.1, the session ID in the SEQUENCE operation implies the 15957 client ID, which in turn might be used by the server to map the 15958 stateid to the right client/server pair. However, when a data server 15959 is presented with a READ or WRITE operation with a stateid, because 15960 the stateid is associated with a client ID on a metadata server, and 15961 because the session ID in the preceding SEQUENCE operation is tied to 15962 the client ID of the data server, the data server has no obvious way 15963 to determine the metadata server from the COMPOUND procedure, and 15964 thus has no way to validate the stateid. One RECOMMENDED approach is 15965 for pNFS servers to encode metadata server routing and/or identity 15966 information in the data server filehandles as returned in the layout. 15968 If metadata server routing and/or identity information is encoded in 15969 data server filehandles, when the metadata server identity or 15970 location changes, the data server filehandles it gave out will become 15971 invalid (stale), and so the metadata server MUST first recall the 15972 layouts. Invalidating a data server filehandle does not render the 15973 NFS client's data cache invalid. The client's cache should map a 15974 data server filehandle to a metadata server filehandle, and a 15975 metadata server filehandle to cached data. 15977 If a server is both a metadata server and a data server, the server 15978 might need to distinguish operations on files that are directed to 15979 the metadata server from those that are directed to the data server. 15980 It is RECOMMENDED that the values of the filehandles returned by the 15981 LAYOUTGET operation be different than the value of the filehandle 15982 returned by the OPEN of the same file. 15984 Another scenario is for the metadata server and the storage device to 15985 be distinct from one client's point of view, and the roles reversed 15986 from another client's point of view. For example, in the cluster 15987 file system model, a metadata server to one client might be a data 15988 server to another client. If NFSv4.1 is being used as the storage 15989 protocol, then pNFS servers need to encode the values of filehandles 15990 according to their specific roles. 15992 13.1.1. Sessions Considerations for Data Servers 15994 Section 2.10.11.2 states that a client has to keep its lease renewed 15995 in order to prevent a session from being deleted by the server. If 15996 the reply to EXCHANGE_ID has just the EXCHGID4_FLAG_USE_PNFS_DS role 15997 set, then (as noted in Section 13.6) the client will not be able to 15998 determine the data server's lease_time attribute because GETATTR will 15999 not be permitted. Instead, the rule is that any time a client 16000 receives a layout referring it to a data server that returns just the 16001 EXCHGID4_FLAG_USE_PNFS_DS role, the client MAY assume that the 16002 lease_time attribute from the metadata server that returned the 16003 layout applies to the data server. Thus, the data server MUST be 16004 aware of the values of all lease_time attributes of all metadata 16005 servers for which it is providing I/O, and it MUST use the maximum of 16006 all such lease_time values as the lease interval for all client IDs 16007 and sessions established on it. 16009 For example, if one metadata server has a lease_time attribute of 20 16010 seconds, and a second metadata server has a lease_time attribute of 16011 10 seconds, then if both servers return layouts that refer to an 16012 EXCHGID4_FLAG_USE_PNFS_DS-only data server, the data server MUST 16013 renew a client's lease if the interval between two SEQUENCE 16014 operations on different COMPOUND requests is less than 20 seconds. 16016 13.2. File Layout Definitions 16018 The following definitions apply to the LAYOUT4_NFSV4_1_FILES layout 16019 type and may be applicable to other layout types. 16021 Unit. A unit is a fixed-size quantity of data written to a data 16022 server. 16024 Pattern. A pattern is a method of distributing one or more equal 16025 sized units across a set of data servers. A pattern is iterated 16026 one or more times. 16028 Stripe. A stripe is a set of data distributed across a set of data 16029 servers in a pattern before that pattern repeats. 16031 Stripe Count. A stripe count is the number of units in a pattern. 16033 Stripe Width. A stripe width is the size of a stripe in bytes. The 16034 stripe width = the stripe count * the size of the stripe unit. 16036 Hereafter, this document will refer to a unit that is a written in a 16037 pattern as a "stripe unit". 16039 A pattern may have more stripe units than data servers. If so, some 16040 data servers will have more than one stripe unit per stripe. A data 16041 server that has multiple stripe units per stripe MAY store each unit 16042 in a different data file (and depending on the implementation, will 16043 possibly assign a unique data filehandle to each data file). 16045 13.3. File Layout Data Types 16047 The high level NFSv4.1 layout types are nfsv4_1_file_layouthint4, 16048 nfsv4_1_file_layout_ds_addr4, and nfsv4_1_file_layout4. 16050 The SETATTR operation supports a layout hint attribute 16051 (Section 5.12.4). When the client sets a layout hint (data type 16052 layouthint4) with a layout type of LAYOUT4_NFSV4_1_FILES (the 16053 loh_type field), the loh_body field contains a value of data type 16054 nfsv4_1_file_layouthint4. 16056 const NFL4_UFLG_MASK = 0x0000003F; 16057 const NFL4_UFLG_DENSE = 0x00000001; 16058 const NFL4_UFLG_COMMIT_THRU_MDS = 0x00000002; 16059 const NFL4_UFLG_STRIPE_UNIT_SIZE_MASK 16060 = 0xFFFFFFC0; 16062 typedef uint32_t nfl_util4; 16063 enum filelayout_hint_care4 { 16064 NFLH4_CARE_DENSE = NFL4_UFLG_DENSE, 16066 NFLH4_CARE_COMMIT_THRU_MDS 16067 = NFL4_UFLG_COMMIT_THRU_MDS, 16069 NFLH4_CARE_STRIPE_UNIT_SIZE 16070 = 0x00000040, 16072 NFLH4_CARE_STRIPE_COUNT = 0x00000080 16073 }; 16075 /* Encoded in the loh_body field of data type layouthint4: */ 16077 struct nfsv4_1_file_layouthint4 { 16078 uint32_t nflh_care; 16079 nfl_util4 nflh_util; 16080 count4 nflh_stripe_count; 16081 }; 16083 The generic layout hint structure is described in Section 3.3.19. 16084 The client uses the layout hint in the layout_hint (Section 5.12.4) 16085 attribute to indicate the preferred type of layout to be used for a 16086 newly created file. The LAYOUT4_NFSV4_1_FILES layout-type-specific 16087 content for the layout hint is composed of three fields. The first 16088 field, nflh_care, is a set of flags indicating which values of the 16089 hint the client cares about. If the NFLH4_CARE_DENSE flag is set, 16090 then the client indicates in the second field, nflh_util, a 16091 preference for how the data file is packed (Section 13.4.4), which is 16092 controlled by the value of the expression nflh_util & NFL4_UFLG_DENSE 16093 ("&" represents the bitwise AND operator). If the 16094 NFLH4_CARE_COMMIT_THRU_MDS flag is set, then the client indicates a 16095 preference for whether the client should send COMMIT operations to 16096 the metadata server or data server (Section 13.7), which is 16097 controlled by the value of nflh_util & NFL4_UFLG_COMMIT_THRU_MDS. If 16098 the NFLH4_CARE_STRIPE_UNIT_SIZE flag is set, the client indicates its 16099 preferred stripe unit size, which is indicated in nflh_util & 16100 NFL4_UFLG_STRIPE_UNIT_SIZE_MASK (thus, the stripe unit size MUST be a 16101 multiple of 64 bytes). The minimum stripe unit size is 64 bytes. If 16102 the NFLH4_CARE_STRIPE_COUNT flag is set, the client indicates in the 16103 third field, nflh_stripe_count, the stripe count. The stripe count 16104 multiplied by the stripe unit size is the stripe width. 16106 When LAYOUTGET returns a LAYOUT4_NFSV4_1_FILES layout (indicated in 16107 the loc_type field of the lo_content field), the loc_body field of 16108 the lo_content field contains a value of data type 16109 nfsv4_1_file_layout4. Among other content, nfsv4_1_file_layout4 has 16110 a storage device ID (field nfl_deviceid) of data type deviceid4. The 16111 GETDEVICEINFO operation maps a device ID to a storage device address 16112 (type device_addr4). When GETDEVICEINFO returns a device address 16113 with a layout type of LAYOUT4_NFSV4_1_FILES (the da_layout_type 16114 field), the da_addr_body field contains a value of data type 16115 nfsv4_1_file_layout_ds_addr4. 16117 typedef netaddr4 multipath_list4<>; 16119 /* 16120 * Encoded in the da_addr_body field of 16121 * data type device_addr4: 16122 */ 16123 struct nfsv4_1_file_layout_ds_addr4 { 16124 uint32_t nflda_stripe_indices<>; 16125 multipath_list4 nflda_multipath_ds_list<>; 16126 }; 16128 The nfsv4_1_file_layout_ds_addr4 data type represents the device 16129 address. It is composed of two fields: 16131 1. nflda_multipath_ds_list: An array of lists of data servers, where 16132 each list can be one or more elements, and each element 16133 represents a data server address that may serve equally as the 16134 target of I/O operations (see Section 13.5). The length of this 16135 array might be different than the stripe count. 16137 2. nflda_stripe_indices: An array of indices used to index into 16138 nflda_multipath_ds_list. The value of each element of 16139 nflda_stripe_indices MUST be less than the number of elements in 16140 nflda_multipath_ds_list. Each element of nflda_multipath_ds_list 16141 SHOULD be referred to by one or more elements of 16142 nflda_stripe_indices. The number of elements in 16143 nflda_stripe_indices is always equal to the stripe count. 16145 /* 16146 * Encoded in the loc_body field of 16147 * data type layout_content4: 16148 */ 16149 struct nfsv4_1_file_layout4 { 16150 deviceid4 nfl_deviceid; 16151 nfl_util4 nfl_util; 16152 uint32_t nfl_first_stripe_index; 16153 offset4 nfl_pattern_offset; 16154 nfs_fh4 nfl_fh_list<>; 16155 }; 16156 The nfsv4_1_file_layout4 data type represents the layout. It is 16157 composed of the following fields: 16159 1. nfl_deviceid: The device ID that maps to a value of type 16160 nfsv4_1_file_layout_ds_addr4. 16162 2. nfl_util: Like the nflh_util field of data type 16163 nfsv4_1_file_layouthint4, a compact representation of how the 16164 data on a file on each data server is packed, whether the client 16165 should send COMMIT operations to the metadata server or data 16166 server, and the stripe unit size. If a server returns two or 16167 more overlapping layouts, each stripe unit size in each 16168 overlapping layout MUST be the same. 16170 3. nfl_first_stripe_index: The index into the first element of the 16171 nflda_stripe_indices array to use. 16173 4. nfl_pattern_offset: This field is the logical offset into the 16174 file where the striping pattern starts. It is required for 16175 converting the client's logical I/O offset (e.g., the current 16176 offset in a POSIX file descriptor before the read() or write() 16177 system call is sent) into the stripe unit number (see 16178 Section 13.4.1). 16180 If dense packing is used, then nfl_pattern_offset is also needed 16181 to convert the client's logical I/O offset to an offset on the 16182 file on the data server corresponding to the stripe unit number 16183 (see Section 13.4.4). 16185 Note that nfl_pattern_offset is not always the same as lo_offset. 16186 For example, via the LAYOUTGET operation, a client might request 16187 a layout starting at offset 1000 of a file that has its striping 16188 pattern start at offset zero. 16190 5. nfl_fh_list: An array of data server filehandles for each list of 16191 data servers in each element of the nflda_multipath_ds_list 16192 array. The number of elements in nfl_fh_list depends on whether 16193 sparse or dense packing is being used. 16195 * If sparse packing is being used, the number of elements in 16196 nfl_fh_list MUST be one of three values: 16198 + Zero. This means that filehandles used for each data 16199 server are the same as the filehandle returned by the OPEN 16200 operation from the metadata server. 16202 + One. This means that every data server uses the same 16203 filehandle: what is specified in nfl_fh_list[0]. 16205 + The same number of elements in nflda_multipath_ds_list. 16206 Thus, in this case, when sending an I/O operation to any 16207 data server in nflda_multipath_ds_list[X], the filehandle 16208 in nfl_fh_list[X] MUST be used. 16210 See the discussion on sparse packing in Section 13.4.4. 16212 * If dense packing is being used, the number of elements in 16213 nfl_fh_list MUST be the same as the number of elements in 16214 nflda_stripe_indices. Thus, when sending an I/O operation to 16215 any data server in 16216 nflda_multipath_ds_list[nflda_stripe_indices[Y]], the 16217 filehandle in nfl_fh_list[Y] MUST be used. In addition, any 16218 time there exists i and j, (i != j), such that the 16219 intersection of 16220 nflda_multipath_ds_list[nflda_stripe_indices[i]] and 16221 nflda_multipath_ds_list[nflda_stripe_indices[j]] is not empty, 16222 then nfl_fh_list[i] MUST NOT equal nfl_fh_list[j]. In other 16223 words, when dense packing is being used, if a data server 16224 appears in two or more units of a striping pattern, each 16225 reference to the data server MUST use a different filehandle. 16227 Indeed, if there are multiple striping patterns, as indicated 16228 by the presence of multiple objects of data type layout4 16229 (either returned in one or multiple LAYOUTGET operations), and 16230 a data server is the target of a unit of one pattern and 16231 another unit of another pattern, then each reference to each 16232 data server MUST use a different filehandle. 16234 See the discussion on dense packing in Section 13.4.4. 16236 The details on the interpretation of the layout are in Section 13.4. 16238 13.4. Interpreting the File Layout 16240 13.4.1. Determining the Stripe Unit Number 16242 To find the stripe unit number that corresponds to the client's 16243 logical file offset, the pattern offset will also be used. The i'th 16244 stripe unit (SUi) is: 16246 relative_offset = file_offset - nfl_pattern_offset; 16247 SUi = floor(relative_offset / stripe_unit_size); 16249 13.4.2. Interpreting the File Layout Using Sparse Packing 16251 When sparse packing is used, the algorithm for determining the 16252 filehandle and set of data-server network addresses to write stripe 16253 unit i (SUi) to is: 16255 stripe_count = number of elements in nflda_stripe_indices; 16257 j = (SUi + nfl_first_stripe_index) % stripe_count; 16259 idx = nflda_stripe_indices[j]; 16261 fh_count = number of elements in nfl_fh_list; 16262 ds_count = number of elements in nflda_multipath_ds_list; 16264 switch (fh_count) { 16265 case ds_count: 16266 fh = nfl_fh_list[idx]; 16267 break; 16269 case 1: 16270 fh = nfl_fh_list[0]; 16271 break; 16273 case 0: 16274 fh = filehandle returned by OPEN; 16275 break; 16277 default: 16278 throw a fatal exception; 16279 break; 16280 } 16282 address_list = nflda_multipath_ds_list[idx]; 16284 The client would then select a data server from address_list, and 16285 send a READ or WRITE operation using the filehandle specified in fh. 16287 Consider the following example: 16289 Suppose we have a device address consisting of seven data servers, 16290 arranged in three equivalence (Section 13.5) classes: 16292 { A, B, C, D }, { E }, { F, G } 16294 where A through G are network addresses. 16296 Then 16298 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 16300 i.e., 16302 nflda_multipath_ds_list[0] = { A, B, C, D } 16304 nflda_multipath_ds_list[1] = { E } 16306 nflda_multipath_ds_list[2] = { F, G } 16308 Suppose the striping index array is: 16310 nflda_stripe_indices<> = { 2, 0, 1, 0 } 16312 Now suppose the client gets a layout that has a device ID that maps 16313 to the above device address. The initial index contains 16315 nfl_first_stripe_index = 2, 16317 and the filehandle list is 16319 nfl_fh_list = { 0x36, 0x87, 0x67 }. 16321 If the client wants to write to SU0, the set of valid { network 16322 address, filehandle } combinations for SUi are determined by: 16324 nfl_first_stripe_index = 2 16326 So 16328 idx = nflda_stripe_indices[(0 + 2) % 4] 16330 = nflda_stripe_indices[2] 16332 = 1 16334 So 16336 nflda_multipath_ds_list[1] = { E } 16338 and 16340 nfl_fh_list[1] = { 0x87 } 16342 The client can thus write SU0 to { 0x87, { E } }. 16344 The destinations of the first 13 storage units are: 16346 +-----+------------+--------------+ 16347 | SUi | filehandle | data servers | 16348 +-----+------------+--------------+ 16349 | 0 | 87 | E | 16350 | 1 | 36 | A,B,C,D | 16351 | 2 | 67 | F,G | 16352 | 3 | 36 | A,B,C,D | 16353 | | | | 16354 | 4 | 87 | E | 16355 | 5 | 36 | A,B,C,D | 16356 | 6 | 67 | F,G | 16357 | 7 | 36 | A,B,C,D | 16358 | | | | 16359 | 8 | 87 | E | 16360 | 9 | 36 | A,B,C,D | 16361 | 10 | 67 | F,G | 16362 | 11 | 36 | A,B,C,D | 16363 | | | | 16364 | 12 | 87 | E | 16365 +-----+------------+--------------+ 16367 13.4.3. Interpreting the File Layout Using Dense Packing 16369 When dense packing is used, the algorithm for determining the 16370 filehandle and set of data server network addresses to write stripe 16371 unit i (SUi) to is: 16373 stripe_count = number of elements in nflda_stripe_indices; 16375 j = (SUi + nfl_first_stripe_index) % stripe_count; 16377 idx = nflda_stripe_indices[j]; 16379 fh_count = number of elements in nfl_fh_list; 16380 ds_count = number of elements in nflda_multipath_ds_list; 16382 switch (fh_count) { 16383 case stripe_count: 16384 fh = nfl_fh_list[j]; 16385 break; 16387 default: 16388 throw a fatal exception; 16389 break; 16390 } 16392 address_list = nflda_multipath_ds_list[idx]; 16394 The client would then select a data server from address_list, and 16395 send a READ or WRITE operation using the filehandle specified in fh. 16397 Consider the following example (which is the same as the sparse 16398 packing example, except for the filehandle list): 16400 Suppose we have a device address consisting of seven data servers, 16401 arranged in three equivalence (Section 13.5) classes: 16403 { A, B, C, D }, { E }, { F, G } 16405 where A through G are network addresses. 16407 Then 16409 nflda_multipath_ds_list<> = { A, B, C, D }, { E }, { F, G } 16411 i.e., 16413 nflda_multipath_ds_list[0] = { A, B, C, D } 16415 nflda_multipath_ds_list[1] = { E } 16417 nflda_multipath_ds_list[2] = { F, G } 16419 Suppose the striping index array is: 16421 nflda_stripe_indices<> = { 2, 0, 1, 0 } 16423 Now suppose the client gets a layout that has a device ID that maps 16424 to the above device address. The initial index contains 16426 nfl_first_stripe_index = 2, 16428 and 16430 nfl_fh_list = { 0x67, 0x37, 0x87, 0x36 }. 16432 The interesting examples for dense packing are SU1 and SU3 because 16433 each stripe unit refers to the same data server list, yet each stripe 16434 unit MUST use a different filehandle. If the client wants to write 16435 to SU1, the set of valid { network address, filehandle } combinations 16436 for SUi are determined by: 16438 nfl_first_stripe_index = 2 16440 So 16442 j = (1 + 2) % 4 = 3 16444 idx = nflda_stripe_indices[j] 16446 = nflda_stripe_indices[3] 16448 = 0 16450 So 16452 nflda_multipath_ds_list[0] = { A, B, C, D } 16454 and 16456 nfl_fh_list[3] = { 0x36 } 16458 The client can thus write SU1 to { 0x36, { A, B, C, D } }. 16460 For SU3, j = (3 + 2) % 4 = 1, and nflda_stripe_indices[1] = 0. Then 16461 nflda_multipath_ds_list[0] = { A, B, C, D }, and nfl_fh_list[1] = 16462 0x37. The client can thus write SU3 to { 0x37, { A, B, C, D } }. 16464 The destinations of the first 13 storage units are: 16466 +-----+------------+--------------+ 16467 | SUi | filehandle | data servers | 16468 +-----+------------+--------------+ 16469 | 0 | 87 | E | 16470 | 1 | 36 | A,B,C,D | 16471 | 2 | 67 | F,G | 16472 | 3 | 37 | A,B,C,D | 16473 | | | | 16474 | 4 | 87 | E | 16475 | 5 | 36 | A,B,C,D | 16476 | 6 | 67 | F,G | 16477 | 7 | 37 | A,B,C,D | 16478 | | | | 16479 | 8 | 87 | E | 16480 | 9 | 36 | A,B,C,D | 16481 | 10 | 67 | F,G | 16482 | 11 | 37 | A,B,C,D | 16483 | | | | 16484 | 12 | 87 | E | 16485 +-----+------------+--------------+ 16487 13.4.4. Sparse and Dense Stripe Unit Packing 16489 The flag NFL4_UFLG_DENSE of the nfl_util4 data type (field nflh_util 16490 of the data type nfsv4_1_file_layouthint4 and field nfl_util of data 16491 type nfsv4_1_file_layout_ds_addr4) specifies how the data is packed 16492 within the data file on a data server. It allows for two different 16493 data packings: sparse and dense. The packing type determines the 16494 calculation that will be made to map the client-visible file offset 16495 to the offset within the data file located on the data server. 16497 If nfl_util & NFL4_UFLG_DENSE is zero, this means that sparse packing 16498 is being used. Hence, the logical offsets of the file as viewed by a 16499 client sending READs and WRITEs directly to the metadata server are 16500 the same offsets each data server uses when storing a stripe unit. 16501 The effect then, for striping patterns consisting of at least two 16502 stripe units, is for each data server file to be sparse or "holey". 16503 So for example, suppose there is a pattern with three stripe units, 16504 the stripe unit size is 4096 bytes, and there are three data servers 16505 in the pattern. Then, the file in data server 1 will have stripe 16506 units 0, 3, 6, 9, ... filled; data server 2's file will have stripe 16507 units 1, 4, 7, 10, ... filled; and data server 3's file will have 16508 stripe units 2, 5, 8, 11, ... filled. The unfilled stripe units of 16509 each file will be holes; hence, the files in each data server are 16510 sparse. 16512 If sparse packing is being used and a client attempts I/O to one of 16513 the holes, then an error MUST be returned by the data server. Using 16514 the above example, if data server 3 received a READ or WRITE 16515 operation for block 4, the data server would return 16516 NFS4ERR_PNFS_IO_HOLE. Thus, data servers need to understand the 16517 striping pattern in order to support sparse packing. 16519 If nfl_util & NFL4_UFLG_DENSE is one, this means that dense packing 16520 is being used, and the data server files have no holes. Dense 16521 packing might be selected because the data server does not 16522 (efficiently) support holey files or because the data server cannot 16523 recognize read-ahead unless there are no holes. If dense packing is 16524 indicated in the layout, the data files will be packed. Using the 16525 same striping pattern and stripe unit size that were used for the 16526 sparse packing example, the corresponding dense packing example would 16527 have all stripe units of all data files filled as follows: 16529 o Logical stripe units 0, 3, 6, ... of the file would live on stripe 16530 units 0, 1, 2, ... of the file of data server 1. 16532 o Logical stripe units 1, 4, 7, ... of the file would live on stripe 16533 units 0, 1, 2, ... of the file of data server 2. 16535 o Logical stripe units 2, 5, 8, ... of the file would live on stripe 16536 units 0, 1, 2, ... of the file of data server 3. 16538 Because dense packing does not leave holes on the data servers, the 16539 pNFS client is allowed to write to any offset of any data file of any 16540 data server in the stripe. Thus, the data servers need not know the 16541 file's striping pattern. 16543 The calculation to determine the byte offset within the data file for 16544 dense data server layouts is: 16546 stripe_width = stripe_unit_size * N; 16547 where N = number of elements in nflda_stripe_indices. 16549 relative_offset = file_offset - nfl_pattern_offset; 16551 data_file_offset = floor(relative_offset / stripe_width) 16552 * stripe_unit_size 16553 + relative_offset % stripe_unit_size 16555 If dense packing is being used, and a data server appears more than 16556 once in a striping pattern, then to distinguish one stripe unit from 16557 another, the data server MUST use a different filehandle. Let's 16558 suppose there are two data servers. Logical stripe units 0, 3, 6 are 16559 served by data server 1; logical stripe units 1, 4, 7 are served by 16560 data server 2; and logical stripe units 2, 5, 8 are also served by 16561 data server 2. Unless data server 2 has two filehandles (each 16562 referring to a different data file), then, for example, a write to 16563 logical stripe unit 1 overwrites the write to logical stripe unit 2 16564 because both logical stripe units are located in the same stripe unit 16565 (0) of data server 2. 16567 13.5. Data Server Multipathing 16569 The NFSv4.1 file layout supports multipathing to multiple data server 16570 addresses. Data-server-level multipathing is used for bandwidth 16571 scaling via trunking (Section 2.10.5) and for higher availability of 16572 use in the case of a data-server failure. Multipathing allows the 16573 client to switch to another data server address which may be that of 16574 another data server that is exporting the same data stripe unit, 16575 without having to contact the metadata server for a new layout. 16577 To support data server multipathing, each element of the 16578 nflda_multipath_ds_list contains an array of one more data server 16579 network addresses. This array (data type multipath_list4) represents 16580 a list of data servers (each identified by a network address), with 16581 the possibility that some data servers will appear in the list 16582 multiple times. 16584 The client is free to use any of the network addresses as a 16585 destination to send data server requests. If some network addresses 16586 are less optimal paths to the data than others, then the MDS SHOULD 16587 NOT include those network addresses in an element of 16588 nflda_multipath_ds_list. If less optimal network addresses exist to 16589 provide failover, the RECOMMENDED method to offer the addresses is to 16590 provide them in a replacement device-ID-to-device-address mapping, or 16591 a replacement device ID. When a client finds that no data server in 16592 an element of nflda_multipath_ds_list responds, it SHOULD send a 16593 GETDEVICEINFO to attempt to replace the existing device-ID-to-device- 16594 address mappings. If the MDS detects that all data servers 16595 represented by an element of nflda_multipath_ds_list are unavailable, 16596 the MDS SHOULD send a CB_NOTIFY_DEVICEID (if the client has indicated 16597 it wants device ID notifications for changed device IDs) to change 16598 the device-ID-to-device-address mappings to the available data 16599 servers. If the device ID itself will be replaced, the MDS SHOULD 16600 recall all layouts with the device ID, and thus force the client to 16601 get new layouts and device ID mappings via LAYOUTGET and 16602 GETDEVICEINFO. 16604 Generally, if two network addresses appear in an element of 16605 nflda_multipath_ds_list, they will designate the same data server, 16606 and the two data server addresses will support the implementation of 16607 client ID or session trunking (the latter is RECOMMENDED) as defined 16608 in Section 2.10.5. The two data server addresses will share the same 16609 server owner or major ID of the server owner. It is not always 16610 necessary for the two data server addresses to designate the same 16611 server with trunking being used. For example, the data could be 16612 read-only, and the data consist of exact replicas. 16614 13.6. Operations Sent to NFSv4.1 Data Servers 16616 Clients accessing data on an NFSv4.1 data server MUST send only the 16617 NULL procedure and COMPOUND procedures whose operations are taken 16618 only from two restricted subsets of the operations defined as valid 16619 NFSv4.1 operations. Clients MUST use the filehandle specified by the 16620 layout when accessing data on NFSv4.1 data servers. 16622 The first of these operation subsets consists of management 16623 operations. This subset consists of the BACKCHANNEL_CTL, 16624 BIND_CONN_TO_SESSION, CREATE_SESSION, DESTROY_CLIENTID, 16625 DESTROY_SESSION, EXCHANGE_ID, SECINFO_NO_NAME, SET_SSV, and SEQUENCE 16626 operations. The client may use these operations in order to set up 16627 and maintain the appropriate client IDs, sessions, and security 16628 contexts involved in communication with the data server. Henceforth, 16629 these will be referred to as data-server housekeeping operations. 16631 The second subset consists of COMMIT, READ, WRITE, and PUTFH. These 16632 operations MUST be used with a current filehandle specified by the 16633 layout. In the case of PUTFH, the new current filehandle MUST be one 16634 taken from the layout. Henceforth, these will be referred to as 16635 data-server I/O operations. As described in Section 12.5.1, a client 16636 MUST NOT send an I/O to a data server for which it does not hold a 16637 valid layout; the data server MUST reject such an I/O. 16639 Unless the server has a concurrent non-data-server personality -- 16640 i.e., EXCHANGE_ID results returned (EXCHGID4_FLAG_USE_PNFS_DS | 16641 EXCHGID4_FLAG_USE_PNFS_MDS) or (EXCHGID4_FLAG_USE_PNFS_DS | 16642 EXCHGID4_FLAG_USE_NON_PNFS) see Section 13.1 -- any attempted use of 16643 operations against a data server other than those specified in the 16644 two subsets above MUST return NFS4ERR_NOTSUPP to the client. 16646 When the server has concurrent data-server and non-data-server 16647 personalities, each COMPOUND sent by the client MUST be constructed 16648 so that it is appropriate to one of the two personalities, and it 16649 MUST NOT contain operations directed to a mix of those personalities. 16650 The server MUST enforce this. To understand the constraints, 16651 operations within a COMPOUND are divided into the following three 16652 classes: 16654 1. An operation that is ambiguous regarding its personality 16655 assignment. This includes all of the data-server housekeeping 16656 operations. Additionally, if the server has assigned filehandles 16657 so that the ones defined by the layout are the same as those used 16658 by the metadata server, all operations using such filehandles are 16659 within this class, with the following exception. The exception 16660 is that if the operation uses a stateid that is incompatible with 16661 a data-server personality (e.g., a special stateid or the stateid 16662 has a non-zero "seqid" field, see Section 13.9.1), the operation 16663 is in class 3, as described below. A COMPOUND containing 16664 multiple class 1 operations (and operations of no other class) 16665 MAY be sent to a server with multiple concurrent data server and 16666 non-data-server personalities. 16668 2. An operation that is unambiguously referable to the data-server 16669 personality. This includes data-server I/O operations where the 16670 filehandle is one that can only be validly directed to the data- 16671 server personality. 16673 3. An operation that is unambiguously referable to the non-data- 16674 server personality. This includes all COMPOUND operations that 16675 are neither data-server housekeeping nor data-server I/O 16676 operations, plus data-server I/O operations where the current fh 16677 (or the one to be made the current fh in the case of PUTFH) is 16678 only valid on the metadata server or where a stateid is used that 16679 is incompatible with the data server, i.e., is a special stateid 16680 or has a non-zero seqid value. 16682 When a COMPOUND first executes an operation from class 3 above, it 16683 acts as a normal COMPOUND on any other server, and the data-server 16684 personality ceases to be relevant. There are no special restrictions 16685 on the operations in the COMPOUND to limit them to those for a data 16686 server. When a PUTFH is done, filehandles derived from the layout 16687 are not valid. If their format is not normally acceptable, then 16688 NFS4ERR_BADHANDLE MUST result. Similarly, current filehandles for 16689 other operations do not accept filehandles derived from layouts and 16690 are not normally usable on the metadata server. Using these will 16691 result in NFS4ERR_STALE. 16693 When a COMPOUND first executes an operation from class 2, which would 16694 be PUTFH where the filehandle is one from a layout, the COMPOUND 16695 henceforth is interpreted with respect to the data-server 16696 personality. Operations outside the two classes discussed above MUST 16697 result in NFS4ERR_NOTSUPP. Filehandles are validated using the rules 16698 of the data server, resulting in NFS4ERR_BADHANDLE and/or 16699 NFS4ERR_STALE even when they would not normally do so when addressed 16700 to the non-data-server personality. Stateids must obey the rules of 16701 the data server in that any use of special stateids or stateids with 16702 non-zero seqid values must result in NFS4ERR_BAD_STATEID. 16704 Until the server first executes an operation from class 2 or class 3, 16705 the client MUST NOT depend on the operation being executed by either 16706 the data-server or the non-data-server personality. The server MUST 16707 pick one personality consistently for a given COMPOUND, with the only 16708 possible transition being a single one when the first operation from 16709 class 2 or class 3 is executed. 16711 Because of the complexity induced by assigning filehandles so they 16712 can be used on both a data server and a metadata server, it is 16713 RECOMMENDED that where the same server can have both personalities, 16714 the server assign separate unique filehandles to both personalities. 16715 This makes it unambiguous for which server a given request is 16716 intended. 16718 GETATTR and SETATTR MUST be directed to the metadata server. In the 16719 case of a SETATTR of the size attribute, the control protocol is 16720 responsible for propagating size updates/truncations to the data 16721 servers. In the case of extending WRITEs to the data servers, the 16722 new size must be visible on the metadata server once a LAYOUTCOMMIT 16723 has completed (see Section 12.5.4.2). Section 13.10 describes the 16724 mechanism by which the client is to handle data-server files that do 16725 not reflect the metadata server's size. 16727 13.7. COMMIT through Metadata Server 16729 The file layout provides two alternate means of providing for the 16730 commit of data written through data servers. The flag 16731 NFL4_UFLG_COMMIT_THRU_MDS in the field nfl_util of the file layout 16732 (data type nfsv4_1_file_layout4) is an indication from the metadata 16733 server to the client of the REQUIRED way of performing COMMIT, either 16734 by sending the COMMIT to the data server or the metadata server. 16735 These two methods of dealing with the issue correspond to broad 16736 styles of implementation for a pNFS server supporting the file layout 16737 type. 16739 o When the flag is FALSE, COMMIT operations MUST to be sent to the 16740 data server to which the corresponding WRITE operations were sent. 16741 This approach is sometimes useful when file striping is 16742 implemented within the pNFS server (instead of the file system), 16743 with the individual data servers each implementing their own file 16744 systems. 16746 o When the flag is TRUE, COMMIT operations MUST be sent to the 16747 metadata server, rather than to the individual data servers. This 16748 approach is sometimes useful when file striping is implemented 16749 within the clustered file system that is the backend to the pNFS 16750 server. In such an implementation, each COMMIT to each data 16751 server might result in repeated writes of metadata blocks to the 16752 detriment of write performance. Sending a single COMMIT to the 16753 metadata server can be more efficient when there exists a 16754 clustered file system capable of implementing such a coordinated 16755 COMMIT. 16757 If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS is TRUE, then in order to 16758 maintain the current NFSv4.1 commit and recovery model, the data 16759 servers MUST return a common writeverf verifier in all WRITE 16760 responses for a given file layout, and the metadata server's 16761 COMMIT implementation must return the same writeverf. The value 16762 of the writeverf verifier MUST be changed at the metadata server 16763 or any data server that is referenced in the layout, whenever 16764 there is a server event that can possibly lead to loss of 16765 uncommitted data. The scope of the verifier can be for a file or 16766 for the entire pNFS server. It might be more difficult for the 16767 server to maintain the verifier at the file level, but the benefit 16768 is that only events that impact a given file will require recovery 16769 action. 16771 Note that if the layout specified dense packing, then the offset used 16772 to a COMMIT to the MDS may differ than that of an offset used to a 16773 COMMIT to the data server. 16775 The single COMMIT to the metadata server will return a verifier, and 16776 the client should compare it to all the verifiers from the WRITEs and 16777 fail the COMMIT if there are any mismatched verifiers. If COMMIT to 16778 the metadata server fails, the client should re-send WRITEs for all 16779 the modified data in the file. The client should treat modified data 16780 with a mismatched verifier as a WRITE failure and try to recover by 16781 resending the WRITEs to the original data server or using another 16782 path to that data if the layout has not been recalled. 16783 Alternatively, the client can obtain a new layout or it could rewrite 16784 the data directly to the metadata server. If nfl_util & 16785 NFL4_UFLG_COMMIT_THRU_MDS is FALSE, sending a COMMIT to the metadata 16786 server might have no effect. If nfl_util & NFL4_UFLG_COMMIT_THRU_MDS 16787 is FALSE, a COMMIT sent to the metadata server should be used only to 16788 commit data that was written to the metadata server. See 16789 Section 12.7.6 for recovery options. 16791 13.8. The Layout Iomode 16793 The layout iomode need not be used by the metadata server when 16794 servicing NFSv4.1 file-based layouts, although in some circumstances 16795 it may be useful. For example, if the server implementation supports 16796 reading from read-only replicas or mirrors, it would be useful for 16797 the server to return a layout enabling the client to do so. As such, 16798 the client SHOULD set the iomode based on its intent to read or write 16799 the data. The client may default to an iomode of LAYOUTIOMODE4_RW. 16800 The iomode need not be checked by the data servers when clients 16801 perform I/O. However, the data servers SHOULD still validate that 16802 the client holds a valid layout and return an error if the client 16803 does not. 16805 13.9. Metadata and Data Server State Coordination 16807 13.9.1. Global Stateid Requirements 16809 When the client sends I/O to a data server, the stateid used MUST NOT 16810 be a layout stateid as returned by LAYOUTGET or sent by 16811 CB_LAYOUTRECALL. Permitted stateids are based on one of the 16812 following: an OPEN stateid (the stateid field of data type OPEN4resok 16813 as returned by OPEN), a delegation stateid (the stateid field of data 16814 types open_read_delegation4 and open_write_delegation4 as returned by 16815 OPEN or WANT_DELEGATION, or as sent by CB_PUSH_DELEG), or a stateid 16816 returned by the LOCK or LOCKU operations. The stateid sent to the 16817 data server MUST be sent with the seqid set to zero, indicating the 16818 most current version of that stateid, rather than indicating a 16819 specific non-zero seqid value. In no case is the use of special 16820 stateid values allowed. 16822 The stateid used for I/O MUST have the same effect and be subject to 16823 the same validation on a data server as it would if the I/O was being 16824 performed on the metadata server itself in the absence of pNFS. This 16825 has the implication that stateids are globally valid on both the 16826 metadata and data servers. This requires the metadata server to 16827 propagate changes in LOCK and OPEN state to the data servers, so that 16828 the data servers can validate I/O accesses. This is discussed 16829 further in Section 13.9.2. Depending on when stateids are 16830 propagated, the existence of a valid stateid on the data server may 16831 act as proof of a valid layout. 16833 Clients performing I/O operations need to select an appropriate 16834 stateid based on the locks (including opens and delegations) held by 16835 the client and the various types of state-owners sending the I/O 16836 requests. The rules for doing so when referencing data servers are 16837 somewhat different from those discussed in Section 8.2.5, which apply 16838 when accessing metadata servers. 16840 The following rules, applied in order of decreasing priority, govern 16841 the selection of the appropriate stateid: 16843 o If the client holds a delegation for the file in question, the 16844 delegation stateid should be used. 16846 o Otherwise, there must be an OPEN stateid for the current open- 16847 owner, and that OPEN stateid for the open file in question is 16848 used, unless mandatory locking prevents that. See below. 16850 o If the data server had previously responded with NFS4ERR_LOCKED to 16851 use of the OPEN stateid, then the client should use the byte-range 16852 lock stateid whenever one exists for that open file with the 16853 current lock-owner. 16855 o Special stateids should never be used. If they are used, the data 16856 server MUST reject the I/O with an NFS4ERR_BAD_STATEID error. 16858 13.9.2. Data Server State Propagation 16860 Since the metadata server, which handles byte-range lock and open- 16861 mode state changes as well as ACLs, might not be co-located with the 16862 data servers where I/O accesses are validated, the server 16863 implementation MUST take care of propagating changes of this state to 16864 the data servers. Once the propagation to the data servers is 16865 complete, the full effect of those changes MUST be in effect at the 16866 data servers. However, some state changes need not be propagated 16867 immediately, although all changes SHOULD be propagated promptly. 16868 These state propagations have an impact on the design of the control 16869 protocol, even though the control protocol is outside of the scope of 16870 this specification. Immediate propagation refers to the synchronous 16871 propagation of state from the metadata server to the data server(s); 16872 the propagation must be complete before returning to the client. 16874 13.9.2.1. Lock State Propagation 16876 If the pNFS server supports mandatory byte-range locking, any 16877 mandatory byte-range locks on a file MUST be made effective at the 16878 data servers before the request that establishes them returns to the 16879 caller. The effect MUST be the same as if the mandatory byte-range 16880 lock state were synchronously propagated to the data servers, even 16881 though the details of the control protocol may avoid actual transfer 16882 of the state under certain circumstances. 16884 On the other hand, since advisory byte-range lock state is not used 16885 for checking I/O accesses at the data servers, there is no semantic 16886 reason for propagating advisory byte-range lock state to the data 16887 servers. Since updates to advisory locks neither confer nor remove 16888 privileges, these changes need not be propagated immediately, and may 16889 not need to be propagated promptly. The updates to advisory locks 16890 need only be propagated when the data server needs to resolve a 16891 question about a stateid. In fact, if byte-range locking is not 16892 mandatory (i.e., is advisory) the clients are advised to avoid using 16893 the byte-range lock-based stateids for I/O. The stateids returned by 16894 OPEN are sufficient and eliminate overhead for this kind of state 16895 propagation. 16897 If a client gets back an NFS4ERR_LOCKED error from a data server, 16898 this is an indication that mandatory byte-range locking is in force. 16899 The client recovers from this by getting a byte-range lock that 16900 covers the affected range and re-sends the I/O with the stateid of 16901 the byte-range lock. 16903 13.9.2.2. Open and Deny Mode Validation 16905 Open and deny mode validation MUST be performed against the open and 16906 deny mode(s) held by the data servers. When access is reduced or a 16907 deny mode made more restrictive (because of CLOSE or OPEN_DOWNGRADE), 16908 the data server MUST prevent any I/Os that would be denied if 16909 performed on the metadata server. When access is expanded, the data 16910 server MUST make sure that no requests are subsequently rejected 16911 because of open or deny issues that no longer apply, given the 16912 previous relaxation. 16914 13.9.2.3. File Attributes 16916 Since the SETATTR operation has the ability to modify state that is 16917 visible on both the metadata and data servers (e.g., the size), care 16918 must be taken to ensure that the resultant state across the set of 16919 data servers is consistent, especially when truncating or growing the 16920 file. 16922 As described earlier, the LAYOUTCOMMIT operation is used to ensure 16923 that the metadata is synchronized with changes made to the data 16924 servers. For the NFSv4.1-based data storage protocol, it is 16925 necessary to re-synchronize state such as the size attribute, and the 16926 setting of mtime/change/atime. See Section 12.5.4 for a full 16927 description of the semantics regarding LAYOUTCOMMIT and attribute 16928 synchronization. It should be noted that by using an NFSv4.1-based 16929 layout type, it is possible to synchronize this state before 16930 LAYOUTCOMMIT occurs. For example, the control protocol can be used 16931 to query the attributes present on the data servers. 16933 Any changes to file attributes that control authorization or access 16934 as reflected by ACCESS calls or READs and WRITEs on the metadata 16935 server, MUST be propagated to the data servers for enforcement on 16936 READ and WRITE I/O calls. If the changes made on the metadata server 16937 result in more restrictive access permissions for any user, those 16938 changes MUST be propagated to the data servers synchronously. 16940 The OPEN operation (Section 18.16.4) does not impose any requirement 16941 that I/O operations on an open file have the same credentials as the 16942 OPEN itself (unless EXCHGID4_FLAG_BIND_PRINC_STATEID is set when 16943 EXCHANGE_ID creates the client ID), and so it requires the server's 16944 READ and WRITE operations to perform appropriate access checking. 16946 Changes to ACLs also require new access checking by READ and WRITE on 16947 the server. The propagation of access-right changes due to changes 16948 in ACLs may be asynchronous only if the server implementation is able 16949 to determine that the updated ACL is not more restrictive for any 16950 user specified in the old ACL. Due to the relative infrequency of 16951 ACL updates, it is suggested that all changes be propagated 16952 synchronously. 16954 13.10. Data Server Component File Size 16956 A potential problem exists when a component data file on a particular 16957 data server has grown past EOF; the problem exists for both dense and 16958 sparse layouts. Imagine the following scenario: a client creates a 16959 new file (size == 0) and writes to byte 131072; the client then seeks 16960 to the beginning of the file and reads byte 100. The client should 16961 receive zeroes back as a result of the READ. However, if the 16962 striping pattern directs the client to send the READ to a data server 16963 other than the one that received the client's original WRITE, the 16964 data server servicing the READ may believe that the file's size is 16965 still 0 bytes. In that event, the data server's READ response will 16966 contain zero bytes and an indication of EOF. The data server can 16967 only return zeroes if it knows that the file's size has been 16968 extended. This would require the immediate propagation of the file's 16969 size to all data servers, which is potentially very costly. 16970 Therefore, the client that has initiated the extension of the file's 16971 size MUST be prepared to deal with these EOF conditions. When the 16972 offset in the arguments to READ is less than the client's view of the 16973 file size, if the READ response indicates EOF and/or contains fewer 16974 bytes than requested, the client will interpret such a response as a 16975 hole in the file, and the NFS client will substitute zeroes for the 16976 data. 16978 The NFSv4.1 protocol only provides close-to-open file data cache 16979 semantics; meaning that when the file is closed, all modified data is 16980 written to the server. When a subsequent OPEN of the file is done, 16981 the change attribute is inspected for a difference from a cached 16982 value for the change attribute. For the case above, this means that 16983 a LAYOUTCOMMIT will be done at close (along with the data WRITEs) and 16984 will update the file's size and change attribute. Access from 16985 another client after that point will result in the appropriate size 16986 being returned. 16988 13.11. Layout Revocation and Fencing 16990 As described in Section 12.7, the layout-type-specific storage 16991 protocol is responsible for handling the effects of I/Os that started 16992 before lease expiration and extend through lease expiration. The 16993 LAYOUT4_NFSV4_1_FILES layout type can prevent all I/Os to data 16994 servers from being executed after lease expiration (this prevention 16995 is called "fencing"), without relying on a precise client lease timer 16996 and without requiring data servers to maintain lease timers. The 16997 LAYOUT4_NFSV4_1_FILES pNFS server has the flexibility to revoke 16998 individual layouts, and thus fence I/O on a per-file basis. 17000 In addition to lease expiration, the reasons a layout can be revoked 17001 include: client fails to respond to a CB_LAYOUTRECALL, the metadata 17002 server restarts, or administrative intervention. Regardless of the 17003 reason, once a client's layout has been revoked, the pNFS server MUST 17004 prevent the client from sending I/O for the affected file from and to 17005 all data servers; in other words, it MUST fence the client from the 17006 affected file on the data servers. 17008 Fencing works as follows. As described in Section 13.1, in COMPOUND 17009 procedure requests to the data server, the data filehandle provided 17010 by the PUTFH operation and the stateid in the READ or WRITE operation 17011 are used to ensure that the client has a valid layout for the I/O 17012 being performed; if it does not, the I/O is rejected with 17013 NFS4ERR_PNFS_NO_LAYOUT. The server can simply check the stateid and, 17014 additionally, make the data filehandle stale if the layout specified 17015 a data filehandle that is different from the metadata server's 17016 filehandle for the file (see the nfl_fh_list description in 17017 Section 13.3). 17019 Before the metadata server takes any action to revoke layout state 17020 given out by a previous instance, it must make sure that all layout 17021 state from that previous instance are invalidated at the data 17022 servers. This has the following implications. 17024 o The metadata server must not restripe a file until it has 17025 contacted all of the data servers to invalidate the layouts from 17026 the previous instance. 17028 o The metadata server must not give out mandatory locks that 17029 conflict with layouts from the previous instance without either 17030 doing a specific layout invalidation (as it would have to do 17031 anyway) or doing a global data server invalidation. 17033 13.12. Security Considerations for the File Layout Type 17035 The NFSv4.1 file layout type MUST adhere to the security 17036 considerations outlined in Section 12.9. NFSv4.1 data servers MUST 17037 make all of the required access checks on each READ or WRITE I/O as 17038 determined by the NFSv4.1 protocol. If the metadata server would 17039 deny a READ or WRITE operation on a file due to its ACL, mode 17040 attribute, open access mode, open deny mode, mandatory byte-range 17041 lock state, or any other attributes and state, the data server MUST 17042 also deny the READ or WRITE operation. This impacts the control 17043 protocol and the propagation of state from the metadata server to the 17044 data servers; see Section 13.9.2 for more details. 17046 The methods for authentication, integrity, and privacy for data 17047 servers based on the LAYOUT4_NFSV4_1_FILES layout type are the same 17048 as those used by metadata servers. Metadata and data servers use ONC 17049 RPC security flavors to authenticate, and SECINFO and SECINFO_NO_NAME 17050 to negotiate the security mechanism and services to be used. Thus, 17051 when using the LAYOUT4_NFSV4_1_FILES layout type, the impact on the 17052 RPC-based security model due to pNFS (as alluded to in Sections 1.8.1 17053 and 1.8.2.2) is zero. 17055 For a given file object, a metadata server MAY require different 17056 security parameters (secinfo4 value) than the data server. For a 17057 given file object with multiple data servers, the secinfo4 value 17058 SHOULD be the same across all data servers. If the secinfo4 values 17059 across a metadata server and its data servers differ for a specific 17060 file, the mapping of the principal to the server's internal user 17061 identifier MUST be the same in order for the access-control checks 17062 based on ACL, mode, open and deny mode, and mandatory locking to be 17063 consistent across on the pNFS server. 17065 If an NFSv4.1 implementation supports pNFS and supports NFSv4.1 file 17066 layouts, then the implementation MUST support the SECINFO_NO_NAME 17067 operation on both the metadata and data servers. 17069 14. Internationalization 17071 The primary issue in which NFSv4.1 needs to deal with 17072 internationalization, or I18N, is with respect to file names and 17073 other strings as used within the protocol. The choice of string 17074 representation must allow reasonable name/string access to clients 17075 that use various languages. The UTF-8 encoding of the UCS (Universal 17076 Multiple-Octet Coded Character Set) as defined by ISO10646 [18] 17077 allows for this type of access and follows the policy described in 17078 "IETF Policy on Character Sets and Languages", RFC 2277 [19]. 17080 RFC 3454 [16], otherwise known as "stringprep", documents a framework 17081 for using Unicode/UTF-8 in networking protocols so as "to increase 17082 the likelihood that string input and string comparison work in ways 17083 that make sense for typical users throughout the world". A protocol 17084 must define a profile of stringprep "in order to fully specify the 17085 processing options". The remainder of this section defines the 17086 NFSv4.1 stringprep profiles. Much of the terminology used for the 17087 remainder of this section comes from stringprep. 17089 There are three UTF-8 string types defined for NFSv4.1: utf8str_cs, 17090 utf8str_cis, and utf8str_mixed. Separate profiles are defined for 17091 each. Each profile defines the following, as required by stringprep: 17093 o The intended applicability of the profile. 17095 o The character repertoire that is the input and output to 17096 stringprep (which is Unicode 3.2 for the referenced version of 17097 stringprep). However, NFSv4.1 implementations are not limited to 17098 3.2. 17100 o The mapping tables from stringprep used (as described in Section 3 17101 of stringprep). 17103 o Any additional mapping tables specific to the profile. 17105 o The Unicode normalization used, if any (as described in Section 4 17106 of stringprep). 17108 o The tables from the stringprep listing of characters that are 17109 prohibited as output (as described in Section 5 of stringprep). 17111 o The bidirectional string testing used, if any (as described in 17112 Section 6 of stringprep). 17114 o Any additional characters that are prohibited as output specific 17115 to the profile. 17117 Stringprep discusses Unicode characters, whereas NFSv4.1 renders 17118 UTF-8 characters. Since there is a one-to-one mapping from UTF-8 to 17119 Unicode, when the remainder of this document refers to Unicode, the 17120 reader should assume UTF-8. 17122 Much of the text for the profiles comes from RFC 3491 [20]. 17124 14.1. Stringprep Profile for the utf8str_cs Type 17126 Every use of the utf8str_cs type definition in the NFSv4 protocol 17127 specification follows the profile named nfs4_cs_prep. 17129 14.1.1. Intended Applicability of the nfs4_cs_prep Profile 17131 The utf8str_cs type is a case-sensitive string of UTF-8 characters. 17132 Its primary use in NFSv4.1 is for naming components and pathnames. 17133 Components and pathnames are stored on the server's file system. Two 17134 valid distinct UTF-8 strings might be the same after processing via 17135 the utf8str_cs profile. If the strings are two names inside a 17136 directory, the NFSv4.1 server will need to either: 17138 o disallow the creation of a second name if its post-processed form 17139 collides with that of an existing name, or 17141 o allow the creation of the second name, but arrange so that after 17142 post-processing, the second name is different than the post- 17143 processed form of the first name. 17145 14.1.2. Character Repertoire of nfs4_cs_prep 17147 The nfs4_cs_prep profile uses Unicode 3.2, as defined in stringprep's 17148 Appendix A.1. However, NFSv4.1 implementations are not limited to 17149 3.2. 17151 14.1.3. Mapping Used by nfs4_cs_prep 17153 The nfs4_cs_prep profile specifies mapping using the following tables 17154 from stringprep: 17156 Table B.1 17158 Table B.2 is normally not part of the nfs4_cs_prep profile as it is 17159 primarily for dealing with case-insensitive comparisons. However, if 17160 the NFSv4.1 file server supports the case_insensitive file system 17161 attribute, and if case_insensitive is TRUE, the NFSv4.1 server MUST 17162 use Table B.2 (in addition to Table B1) when processing utf8str_cs 17163 strings, and the NFSv4.1 client MUST assume Table B.2 (in addition to 17164 Table B.1) is being used. 17166 If the case_preserving attribute is present and set to FALSE, then 17167 the NFSv4.1 server MUST use Table B.2 to map case when processing 17168 utf8str_cs strings. Whether the server maps from lower to upper case 17169 or from upper to lower case is an implementation dependency. 17171 14.1.4. Normalization used by nfs4_cs_prep 17173 The nfs4_cs_prep profile does not specify a normalization form. A 17174 later revision of this specification may specify a particular 17175 normalization form. Therefore, the server and client can expect that 17176 they may receive unnormalized characters within protocol requests and 17177 responses. If the operating environment requires normalization, then 17178 the implementation must normalize utf8str_cs strings within the 17179 protocol before presenting the information to an application (at the 17180 client) or local file system (at the server). 17182 14.1.5. Prohibited Output for nfs4_cs_prep 17184 The nfs4_cs_prep profile RECOMMENDS prohibiting the use of the 17185 following tables from stringprep: 17187 Table C.5 17189 Table C.6 17191 14.1.6. Bidirectional Output for nfs4_cs_prep 17193 The nfs4_cs_prep profile does not specify any checking of 17194 bidirectional strings. 17196 14.2. Stringprep Profile for the utf8str_cis Type 17198 Every use of the utf8str_cis type definition in the NFSv4.1 protocol 17199 specification follows the profile named nfs4_cis_prep. 17201 14.2.1. Intended Applicability of the nfs4_cis_prep Profile 17203 The utf8str_cis type is a case-insensitive string of UTF-8 17204 characters. Its primary use in NFSv4.1 is for naming NFS servers. 17206 14.2.2. Character Repertoire of nfs4_cis_prep 17208 The nfs4_cis_prep profile uses Unicode 3.2, as defined in 17209 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 17210 limited to 3.2. 17212 14.2.3. Mapping Used by nfs4_cis_prep 17214 The nfs4_cis_prep profile specifies mapping using the following 17215 tables from stringprep: 17217 Table B.1 17219 Table B.2 17221 14.2.4. Normalization Used by nfs4_cis_prep 17223 The nfs4_cis_prep profile specifies using Unicode normalization form 17224 KC, as described in stringprep. 17226 14.2.5. Prohibited Output for nfs4_cis_prep 17228 The nfs4_cis_prep profile specifies prohibiting using the following 17229 tables from stringprep: 17231 Table C.1.2 17233 Table C.2.2 17235 Table C.3 17237 Table C.4 17239 Table C.5 17241 Table C.6 17243 Table C.7 17245 Table C.8 17247 Table C.9 17249 14.2.6. Bidirectional Output for nfs4_cis_prep 17251 The nfs4_cis_prep profile specifies checking bidirectional strings as 17252 described in stringprep's Section 6. 17254 14.3. Stringprep Profile for the utf8str_mixed Type 17256 Every use of the utf8str_mixed type definition in the NFSv4.1 17257 protocol specification follows the profile named nfs4_mixed_prep. 17259 14.3.1. Intended Applicability of the nfs4_mixed_prep Profile 17261 The utf8str_mixed type is a string of UTF-8 characters, with a prefix 17262 that is case sensitive, a separator equal to '@', and a suffix that 17263 is a fully qualified domain name. Its primary use in NFSv4.1 is for 17264 naming principals identified in an Access Control Entry. 17266 14.3.2. Character Repertoire of nfs4_mixed_prep 17268 The nfs4_mixed_prep profile uses Unicode 3.2, as defined in 17269 stringprep's Appendix A.1. However, NFSv4.1 implementations are not 17270 limited to 3.2. 17272 14.3.3. Mapping Used by nfs4_cis_prep 17274 For the prefix and the separator of a utf8str_mixed string, the 17275 nfs4_mixed_prep profile specifies mapping using the following table 17276 from stringprep: 17278 Table B.1 17280 For the suffix of a utf8str_mixed string, the nfs4_mixed_prep profile 17281 specifies mapping using the following tables from stringprep: 17283 Table B.1 17285 Table B.2 17287 14.3.4. Normalization Used by nfs4_mixed_prep 17289 The nfs4_mixed_prep profile specifies using Unicode normalization 17290 form KC, as described in stringprep. 17292 14.3.5. Prohibited Output for nfs4_mixed_prep 17294 The nfs4_mixed_prep profile specifies prohibiting using the following 17295 tables from stringprep: 17297 Table C.1.2 17299 Table C.2.2 17301 Table C.3 17303 Table C.4 17305 Table C.5 17307 Table C.6 17309 Table C.7 17311 Table C.8 17313 Table C.9 17315 14.3.6. Bidirectional Output for nfs4_mixed_prep 17317 The nfs4_mixed_prep profile specifies checking bidirectional strings 17318 as described in stringprep's Section 6. 17320 14.4. UTF-8 Capabilities 17322 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 17323 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 17325 typedef uint32_t fs_charset_cap4; 17327 Because some operating environments and file systems do not enforce 17328 character set encodings, NFSv4.1 supports the fs_charset_cap 17329 attribute (Section 5.8.2.11) that indicates to the client a file 17330 system's UTF-8 capabilities. The attribute is an integer containing 17331 a pair of flags. The first flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, 17332 which, if set to one, tells the client that the file system contains 17333 non-UTF-8 characters, and the server will not convert non-UTF 17334 characters to UTF-8 if the client reads a symbolic link or directory, 17335 neither will operations with component names or pathnames in the 17336 arguments convert the strings to UTF-8. The second flag is 17337 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set to one, indicates that 17338 the server will accept (and generate) only UTF-8 characters on the 17339 file system. If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, 17340 FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. 17341 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. 17343 14.5. UTF-8 Related Errors 17345 Where the client sends an invalid UTF-8 string, the server should 17346 return NFS4ERR_INVAL (see Table 5). This includes cases in which 17347 inappropriate prefixes are detected and where the count includes 17348 trailing bytes that do not constitute a full UCS character. 17350 Where the client-supplied string is valid UTF-8 but contains 17351 characters that are not supported by the server as a value for that 17352 string (e.g., names containing characters outside of Unicode plane 0 17353 on file systems that fail to support such characters despite their 17354 presence in the Unicode standard), the server should return 17355 NFS4ERR_BADCHAR. 17357 Where a UTF-8 string is used as a file name, and the file system 17358 (while supporting all of the characters within the name) does not 17359 allow that particular name to be used, the server should return the 17360 error NFS4ERR_BADNAME (Table 5). This includes situations in which 17361 the server file system imposes a normalization constraint on name 17362 strings, but will also include such situations as file system 17363 prohibitions of "." and ".." as file names for certain operations, 17364 and other such constraints. 17366 15. Error Values 17368 NFS error numbers are assigned to failed operations within a Compound 17369 (COMPOUND or CB_COMPOUND) request. A Compound request contains a 17370 number of NFS operations that have their results encoded in sequence 17371 in a Compound reply. The results of successful operations will 17372 consist of an NFS4_OK status followed by the encoded results of the 17373 operation. If an NFS operation fails, an error status will be 17374 entered in the reply and the Compound request will be terminated. 17376 15.1. Error Definitions 17378 Protocol Error Definitions 17380 +-----------------------------------+--------+-------------------+ 17381 | Error | Number | Description | 17382 +-----------------------------------+--------+-------------------+ 17383 | NFS4_OK | 0 | Section 15.1.3.1 | 17384 | NFS4ERR_ACCESS | 13 | Section 15.1.6.1 | 17385 | NFS4ERR_ATTRNOTSUPP | 10032 | Section 15.1.15.1 | 17386 | NFS4ERR_ADMIN_REVOKED | 10047 | Section 15.1.5.1 | 17387 | NFS4ERR_BACK_CHAN_BUSY | 10057 | Section 15.1.12.1 | 17388 | NFS4ERR_BADCHAR | 10040 | Section 15.1.7.1 | 17389 | NFS4ERR_BADHANDLE | 10001 | Section 15.1.2.1 | 17390 | NFS4ERR_BADIOMODE | 10049 | Section 15.1.10.1 | 17391 | NFS4ERR_BADLAYOUT | 10050 | Section 15.1.10.2 | 17392 | NFS4ERR_BADNAME | 10041 | Section 15.1.7.2 | 17393 | NFS4ERR_BADOWNER | 10039 | Section 15.1.15.2 | 17394 | NFS4ERR_BADSESSION | 10052 | Section 15.1.11.1 | 17395 | NFS4ERR_BADSLOT | 10053 | Section 15.1.11.2 | 17396 | NFS4ERR_BADTYPE | 10007 | Section 15.1.4.1 | 17397 | NFS4ERR_BADXDR | 10036 | Section 15.1.1.1 | 17398 | NFS4ERR_BAD_COOKIE | 10003 | Section 15.1.1.2 | 17399 | NFS4ERR_BAD_HIGH_SLOT | 10077 | Section 15.1.11.3 | 17400 | NFS4ERR_BAD_RANGE | 10042 | Section 15.1.8.1 | 17401 | NFS4ERR_BAD_SEQID | 10026 | Section 15.1.16.1 | 17402 | NFS4ERR_BAD_SESSION_DIGEST | 10051 | Section 15.1.12.2 | 17403 | NFS4ERR_BAD_STATEID | 10025 | Section 15.1.5.2 | 17404 | NFS4ERR_CB_PATH_DOWN | 10048 | Section 15.1.11.4 | 17405 | NFS4ERR_CLID_INUSE | 10017 | Section 15.1.13.2 | 17406 | NFS4ERR_CLIENTID_BUSY | 10074 | Section 15.1.13.1 | 17407 | NFS4ERR_COMPLETE_ALREADY | 10054 | Section 15.1.9.1 | 17408 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | 10055 | Section 15.1.11.6 | 17409 | NFS4ERR_DEADLOCK | 10045 | Section 15.1.8.2 | 17410 | NFS4ERR_DEADSESSION | 10078 | Section 15.1.11.5 | 17411 | NFS4ERR_DELAY | 10008 | Section 15.1.1.3 | 17412 | NFS4ERR_DELEG_ALREADY_WANTED | 10056 | Section 15.1.14.1 | 17413 | NFS4ERR_DELEG_REVOKED | 10087 | Section 15.1.5.3 | 17414 | NFS4ERR_DENIED | 10010 | Section 15.1.8.3 | 17415 | NFS4ERR_DIRDELEG_UNAVAIL | 10084 | Section 15.1.14.2 | 17416 | NFS4ERR_DQUOT | 69 | Section 15.1.4.2 | 17417 | NFS4ERR_ENCR_ALG_UNSUPP | 10079 | Section 15.1.13.3 | 17418 | NFS4ERR_EXIST | 17 | Section 15.1.4.3 | 17419 | NFS4ERR_EXPIRED | 10011 | Section 15.1.5.4 | 17420 | NFS4ERR_FBIG | 27 | Section 15.1.4.4 | 17421 | NFS4ERR_FHEXPIRED | 10014 | Section 15.1.2.2 | 17422 | NFS4ERR_FILE_OPEN | 10046 | Section 15.1.4.5 | 17423 | NFS4ERR_GRACE | 10013 | Section 15.1.9.2 | 17424 | NFS4ERR_HASH_ALG_UNSUPP | 10072 | Section 15.1.13.4 | 17425 | NFS4ERR_INVAL | 22 | Section 15.1.1.4 | 17426 | NFS4ERR_IO | 5 | Section 15.1.4.6 | 17427 | NFS4ERR_ISDIR | 21 | Section 15.1.2.3 | 17428 | NFS4ERR_LAYOUTTRYLATER | 10058 | Section 15.1.10.3 | 17429 | NFS4ERR_LAYOUTUNAVAILABLE | 10059 | Section 15.1.10.4 | 17430 | NFS4ERR_LEASE_MOVED | 10031 | Section 15.1.16.2 | 17431 | NFS4ERR_LOCKED | 10012 | Section 15.1.8.4 | 17432 | NFS4ERR_LOCKS_HELD | 10037 | Section 15.1.8.5 | 17433 | NFS4ERR_LOCK_NOTSUPP | 10043 | Section 15.1.8.6 | 17434 | NFS4ERR_LOCK_RANGE | 10028 | Section 15.1.8.7 | 17435 | NFS4ERR_MINOR_VERS_MISMATCH | 10021 | Section 15.1.3.2 | 17436 | NFS4ERR_MLINK | 31 | Section 15.1.4.7 | 17437 | NFS4ERR_MOVED | 10019 | Section 15.1.2.4 | 17438 | NFS4ERR_NAMETOOLONG | 63 | Section 15.1.7.3 | 17439 | NFS4ERR_NOENT | 2 | Section 15.1.4.8 | 17440 | NFS4ERR_NOFILEHANDLE | 10020 | Section 15.1.2.5 | 17441 | NFS4ERR_NOMATCHING_LAYOUT | 10060 | Section 15.1.10.5 | 17442 | NFS4ERR_NOSPC | 28 | Section 15.1.4.9 | 17443 | NFS4ERR_NOTDIR | 20 | Section 15.1.2.6 | 17444 | NFS4ERR_NOTEMPTY | 66 | Section 15.1.4.10 | 17445 | NFS4ERR_NOTSUPP | 10004 | Section 15.1.1.5 | 17446 | NFS4ERR_NOT_ONLY_OP | 10081 | Section 15.1.3.3 | 17447 | NFS4ERR_NOT_SAME | 10027 | Section 15.1.15.3 | 17448 | NFS4ERR_NO_GRACE | 10033 | Section 15.1.9.3 | 17449 | NFS4ERR_NXIO | 6 | Section 15.1.16.3 | 17450 | NFS4ERR_OLD_STATEID | 10024 | Section 15.1.5.5 | 17451 | NFS4ERR_OPENMODE | 10038 | Section 15.1.8.8 | 17452 | NFS4ERR_OP_ILLEGAL | 10044 | Section 15.1.3.4 | 17453 | NFS4ERR_OP_NOT_IN_SESSION | 10071 | Section 15.1.3.5 | 17454 | NFS4ERR_PERM | 1 | Section 15.1.6.2 | 17455 | NFS4ERR_PNFS_IO_HOLE | 10075 | Section 15.1.10.6 | 17456 | NFS4ERR_PNFS_NO_LAYOUT | 10080 | Section 15.1.10.7 | 17457 | NFS4ERR_RECALLCONFLICT | 10061 | Section 15.1.14.3 | 17458 | NFS4ERR_RECLAIM_BAD | 10034 | Section 15.1.9.4 | 17459 | NFS4ERR_RECLAIM_CONFLICT | 10035 | Section 15.1.9.5 | 17460 | NFS4ERR_REJECT_DELEG | 10085 | Section 15.1.14.4 | 17461 | NFS4ERR_REP_TOO_BIG | 10066 | Section 15.1.3.6 | 17462 | NFS4ERR_REP_TOO_BIG_TO_CACHE | 10067 | Section 15.1.3.7 | 17463 | NFS4ERR_REQ_TOO_BIG | 10065 | Section 15.1.3.8 | 17464 | NFS4ERR_RESTOREFH | 10030 | Section 15.1.16.4 | 17465 | NFS4ERR_RETRY_UNCACHED_REP | 10068 | Section 15.1.3.9 | 17466 | NFS4ERR_RETURNCONFLICT | 10086 | Section 15.1.10.8 | 17467 | NFS4ERR_ROFS | 30 | Section 15.1.4.11 | 17468 | NFS4ERR_SAME | 10009 | Section 15.1.15.4 | 17469 | NFS4ERR_SHARE_DENIED | 10015 | Section 15.1.8.9 | 17470 | NFS4ERR_SEQUENCE_POS | 10064 | Section 15.1.3.10 | 17471 | NFS4ERR_SEQ_FALSE_RETRY | 10076 | Section 15.1.11.7 | 17472 | NFS4ERR_SEQ_MISORDERED | 10063 | Section 15.1.11.8 | 17473 | NFS4ERR_SERVERFAULT | 10006 | Section 15.1.1.6 | 17474 | NFS4ERR_STALE | 70 | Section 15.1.2.7 | 17475 | NFS4ERR_STALE_CLIENTID | 10022 | Section 15.1.13.5 | 17476 | NFS4ERR_STALE_STATEID | 10023 | Section 15.1.16.5 | 17477 | NFS4ERR_SYMLINK | 10029 | Section 15.1.2.8 | 17478 | NFS4ERR_TOOSMALL | 10005 | Section 15.1.1.7 | 17479 | NFS4ERR_TOO_MANY_OPS | 10070 | Section 15.1.3.11 | 17480 | NFS4ERR_UNKNOWN_LAYOUTTYPE | 10062 | Section 15.1.10.9 | 17481 | NFS4ERR_UNSAFE_COMPOUND | 10069 | Section 15.1.3.12 | 17482 | NFS4ERR_WRONGSEC | 10016 | Section 15.1.6.3 | 17483 | NFS4ERR_WRONG_CRED | 10082 | Section 15.1.6.4 | 17484 | NFS4ERR_WRONG_TYPE | 10083 | Section 15.1.2.9 | 17485 | NFS4ERR_XDEV | 18 | Section 15.1.4.12 | 17486 +-----------------------------------+--------+-------------------+ 17488 Table 5 17490 15.1.1. General Errors 17492 This section deals with errors that are applicable to a broad set of 17493 different purposes. 17495 15.1.1.1. NFS4ERR_BADXDR (Error Code 10036) 17497 The arguments for this operation do not match those specified in the 17498 XDR definition. This includes situations in which the request ends 17499 before all the arguments have been seen. Note that this error 17500 applies when fixed enumerations (these include booleans) have a value 17501 within the input stream that is not valid for the enum. A replier 17502 may pre-parse all operations for a Compound procedure before doing 17503 any operation execution and return RPC-level XDR errors in that case. 17505 15.1.1.2. NFS4ERR_BAD_COOKIE (Error Code 10003) 17507 Used for operations that provide a set of information indexed by some 17508 quantity provided by the client or cookie sent by the server for an 17509 earlier invocation. Where the value cannot be used for its intended 17510 purpose, this error results. 17512 15.1.1.3. NFS4ERR_DELAY (Error Code 10008) 17514 For any of a number of reasons, the replier could not process this 17515 operation in what was deemed a reasonable time. The client should 17516 wait and then try the request with a new slot and sequence value. 17518 Some examples of scenarios that might lead to this situation: 17520 o A server that supports hierarchical storage receives a request to 17521 process a file that had been migrated. 17523 o An operation requires a delegation recall to proceed, so that the 17524 need to wait for this delegation to be recalled and returned makes 17525 processing this request in a timely fashion impossible. 17527 o A request is being performed on a session being migrated from 17528 another server as described in Section 11.14.3, and the lack of 17529 full information about the state of the session on the source 17530 makes it impossible to process the request immediately. 17532 In such cases, returning the error NFS4ERR_DELAY allows necessary 17533 preparatory operations to proceed without holding up requester 17534 resources such as a session slot. After delaying for period of time, 17535 the client can then re-send the operation in question, often as part 17536 of a nearly identical request. Because of the need to avoid spurious 17537 reissues of non-idempotent operations and to avoid acting in response 17538 to NFS4ERR_DELAY errors returned on responses returned from the 17539 replier's replay cache, integration with the session-provided replay 17540 cache is necessary. There are a number of cases to deal with, each 17541 of which requires different sorts of handling by the requester and 17542 replier: 17544 o If NFS4ERR_DELAY is returned on a SEQUENCE operation, the request 17545 is retried in full with the SEQUENCE operation containing the same 17546 slot and sequence values. In this case, the replier MUST avoid 17547 returning a response containing NFS4ERR_DELAY as the response to 17548 SEQUENCE solely on the basis of its presence in the replay cache. 17549 If the replier did this, the retries would not be effective as 17550 there would be no opportunity for the replier to see whether the 17551 condition that generated the NFS4ERR_DELAY had been rectified 17552 during the interim between the original request and the retry. 17554 o If NFS4ERR_DELAY is returned on an operation other than SEQUENCE 17555 which validly appears as the first operation of a request, 17556 handling is similar. The request can be retried in full without 17557 modification. In this case as well, the replier MUST avoid 17558 returning a response containing NFS4ERR_DELAY as the response to 17559 an initial operation of a request solely on the basis of its 17560 presence in the replay cache. If the replier did this, the 17561 retries would not be effective as there would be no opportunity 17562 for the replier to see whether the condition that generated the 17563 NFS4ERR_DELAY had been rectified during the interim between the 17564 original request and the retry. 17566 o If NFS4ERR_DELAY is returned on an operation other than the first 17567 in the request, the request when retried MUST contain a SEQUENCE 17568 operation which is different than the original one, with either 17569 the bin id or the sequence value different from that in the 17570 original request. Because requesters do this, there is no need 17571 for the replier to take special care to avoid returning an 17572 NFS4ERR_DELAY error, obtained from the replay cache. When no non- 17573 idempotent operations have been processed before the NFS4ERR_DELAY 17574 was returned, the requester should retry the request in full, with 17575 the only difference from the original request being the 17576 modification to the slot id or sequence value in the reissued 17577 SEQUENCE operation. 17579 o When NFS4ERR_DELAY is returned on an operation other than the 17580 first within a request and there has been a non-idempotent 17581 operation processed before the NFS4ERR_DELAY was returned, 17582 reissuing the request as is normally done would incorrectly cause 17583 the re-execution of the non-idempotent operation. 17585 To avoid this situation, the client should reissue the request 17586 without the non-idempotent operation. The request still must use 17587 a SEQUENCE operation with either a different slot id or sequence 17588 value from the SEQUENCE in the original request. Because this is 17589 done, there is no way the replier could avoid spuriously re- 17590 executing the non-idempotent operation since the different 17591 SEQUENCE parameters prevent the requester from recognizing that 17592 the non-idempotent operation is being retried. 17594 Note that without the ability to return NFS4ERR_DELAY and the 17595 requester's willingness to re-send when receiving it, deadlock might 17596 result. For example, if a recall is done, and if the delegation 17597 return or operations preparatory to delegation return are held up by 17598 other operations that need the delegation to be returned, session 17599 slots might not be available. The result could be deadlock. 17601 15.1.1.4. NFS4ERR_INVAL (Error Code 22) 17603 The arguments for this operation are not valid for some reason, even 17604 though they do match those specified in the XDR definition for the 17605 request. 17607 15.1.1.5. NFS4ERR_NOTSUPP (Error Code 10004) 17609 Operation not supported, either because the operation is an OPTIONAL 17610 one and is not supported by this server or because the operation MUST 17611 NOT be implemented in the current minor version. 17613 15.1.1.6. NFS4ERR_SERVERFAULT (Error Code 10006) 17615 An error occurred on the server that does not map to any of the 17616 specific legal NFSv4.1 protocol error values. The client should 17617 translate this into an appropriate error. UNIX clients may choose to 17618 translate this to EIO. 17620 15.1.1.7. NFS4ERR_TOOSMALL (Error Code 10005) 17622 Used where an operation returns a variable amount of data, with a 17623 limit specified by the client. Where the data returned cannot be fit 17624 within the limit specified by the client, this error results. 17626 15.1.2. Filehandle Errors 17628 These errors deal with the situation in which the current or saved 17629 filehandle, or the filehandle passed to PUTFH intended to become the 17630 current filehandle, is invalid in some way. This includes situations 17631 in which the filehandle is a valid filehandle in general but is not 17632 of the appropriate object type for the current operation. 17634 Where the error description indicates a problem with the current or 17635 saved filehandle, it is to be understood that filehandles are only 17636 checked for the condition if they are implicit arguments of the 17637 operation in question. 17639 15.1.2.1. NFS4ERR_BADHANDLE (Error Code 10001) 17641 Illegal NFS filehandle for the current server. The current file 17642 handle failed internal consistency checks. Once accepted as valid 17643 (by PUTFH), no subsequent status change can cause the filehandle to 17644 generate this error. 17646 15.1.2.2. NFS4ERR_FHEXPIRED (Error Code 10014) 17648 A current or saved filehandle that is an argument to the current 17649 operation is volatile and has expired at the server. 17651 15.1.2.3. NFS4ERR_ISDIR (Error Code 21) 17653 The current or saved filehandle designates a directory when the 17654 current operation does not allow a directory to be accepted as the 17655 target of this operation. 17657 15.1.2.4. NFS4ERR_MOVED (Error Code 10019) 17659 The file system that contains the current filehandle object is not 17660 present at the server, or is not accessible using the network address 17661 used. It may have been made accessible on a different set of network 17662 addresses, relocated or migrated to another server, or it may have 17663 never been present. The client may obtain the new file system 17664 location by obtaining the "fs_locations" or "fs_locations_info" 17665 attribute for the current filehandle. For further discussion, refer 17666 to Section 11.3. 17668 As with the case of NFS4ERR_DELAY, it is possible that one or more 17669 non-idempotent operations may have been successfully executed within 17670 a COMPOUND before NFS4ERR_MOVED is returned. Because of this, once 17671 the new location is determined, the original request which received 17672 the NFS4ERR_MOVED should not be re-executed in full. Instead, the 17673 client should send a new COMPOUND, with any successfully executed 17674 non-idempotent operations removed. When the client uses the same 17675 session for the new COMPOUND, its SEQUENCE operation should use a 17676 different slot id or sequence. 17678 15.1.2.5. NFS4ERR_NOFILEHANDLE (Error Code 10020) 17680 The logical current or saved filehandle value is required by the 17681 current operation and is not set. This may be a result of a 17682 malformed COMPOUND operation (i.e., no PUTFH or PUTROOTFH before an 17683 operation that requires the current filehandle be set). 17685 15.1.2.6. NFS4ERR_NOTDIR (Error Code 20) 17687 The current (or saved) filehandle designates an object that is not a 17688 directory for an operation in which a directory is required. 17690 15.1.2.7. NFS4ERR_STALE (Error Code 70) 17692 The current or saved filehandle value designating an argument to the 17693 current operation is invalid. The file referred to by that 17694 filehandle no longer exists or access to it has been revoked. 17696 15.1.2.8. NFS4ERR_SYMLINK (Error Code 10029) 17698 The current filehandle designates a symbolic link when the current 17699 operation does not allow a symbolic link as the target. 17701 15.1.2.9. NFS4ERR_WRONG_TYPE (Error Code 10083) 17703 The current (or saved) filehandle designates an object that is of an 17704 invalid type for the current operation, and there is no more specific 17705 error (such as NFS4ERR_ISDIR or NFS4ERR_SYMLINK) that applies. Note 17706 that in NFSv4.0, such situations generally resulted in the less- 17707 specific error NFS4ERR_INVAL. 17709 15.1.3. Compound Structure Errors 17711 This section deals with errors that relate to the overall structure 17712 of a Compound request (by which we mean to include both COMPOUND and 17713 CB_COMPOUND), rather than to particular operations. 17715 There are a number of basic constraints on the operations that may 17716 appear in a Compound request. Sessions add to these basic 17717 constraints by requiring a Sequence operation (either SEQUENCE or 17718 CB_SEQUENCE) at the start of the Compound. 17720 15.1.3.1. NFS_OK (Error code 0) 17722 Indicates the operation completed successfully, in that all of the 17723 constituent operations completed without error. 17725 15.1.3.2. NFS4ERR_MINOR_VERS_MISMATCH (Error code 10021) 17727 The minor version specified is not one that the current listener 17728 supports. This value is returned in the overall status for the 17729 Compound but is not associated with a specific operation since the 17730 results will specify a result count of zero. 17732 15.1.3.3. NFS4ERR_NOT_ONLY_OP (Error Code 10081) 17734 Certain operations, which are allowed to be executed outside of a 17735 session, MUST be the only operation within a Compound whenever the 17736 Compound does not start with a Sequence operation. This error 17737 results when that constraint is not met. 17739 15.1.3.4. NFS4ERR_OP_ILLEGAL (Error Code 10044) 17741 The operation code is not a valid one for the current Compound 17742 procedure. The opcode in the result stream matched with this error 17743 is the ILLEGAL value, although the value that appears in the request 17744 stream may be different. Where an illegal value appears and the 17745 replier pre-parses all operations for a Compound procedure before 17746 doing any operation execution, an RPC-level XDR error may be 17747 returned. 17749 15.1.3.5. NFS4ERR_OP_NOT_IN_SESSION (Error Code 10071) 17751 Most forward operations and all callback operations are only valid 17752 within the context of a session, so that the Compound request in 17753 question MUST begin with a Sequence operation. If an attempt is made 17754 to execute these operations outside the context of session, this 17755 error results. 17757 15.1.3.6. NFS4ERR_REP_TOO_BIG (Error Code 10066) 17759 The reply to a Compound would exceed the channel's negotiated maximum 17760 response size. 17762 15.1.3.7. NFS4ERR_REP_TOO_BIG_TO_CACHE (Error Code 10067) 17764 The reply to a Compound would exceed the channel's negotiated maximum 17765 size for replies cached in the reply cache when the Sequence for the 17766 current request specifies that this request is to be cached. 17768 15.1.3.8. NFS4ERR_REQ_TOO_BIG (Error Code 10065) 17770 The Compound request exceeds the channel's negotiated maximum size 17771 for requests. 17773 15.1.3.9. NFS4ERR_RETRY_UNCACHED_REP (Error Code 10068) 17775 The requester has attempted a retry of a Compound that it previously 17776 requested not be placed in the reply cache. 17778 15.1.3.10. NFS4ERR_SEQUENCE_POS (Error Code 10064) 17780 A Sequence operation appeared in a position other than the first 17781 operation of a Compound request. 17783 15.1.3.11. NFS4ERR_TOO_MANY_OPS (Error Code 10070) 17785 The Compound request has too many operations, exceeding the count 17786 negotiated when the session was created. 17788 15.1.3.12. NFS4ERR_UNSAFE_COMPOUND (Error Code 10068) 17790 The client has sent a COMPOUND request with an unsafe mix of 17791 operations -- specifically, with a non-idempotent operation that 17792 changes the current filehandle and that is not followed by a GETFH. 17794 15.1.4. File System Errors 17796 These errors describe situations that occurred in the underlying file 17797 system implementation rather than in the protocol or any NFSv4.x 17798 feature. 17800 15.1.4.1. NFS4ERR_BADTYPE (Error Code 10007) 17802 An attempt was made to create an object with an inappropriate type 17803 specified to CREATE. This may be because the type is undefined, 17804 because the type is not supported by the server, or because the type 17805 is not intended to be created by CREATE (such as a regular file or 17806 named attribute, for which OPEN is used to do the file creation). 17808 15.1.4.2. NFS4ERR_DQUOT (Error Code 19) 17810 Resource (quota) hard limit exceeded. The user's resource limit on 17811 the server has been exceeded. 17813 15.1.4.3. NFS4ERR_EXIST (Error Code 17) 17815 A file of the specified target name (when creating, renaming, or 17816 linking) already exists. 17818 15.1.4.4. NFS4ERR_FBIG (Error Code 27) 17820 The file is too large. The operation would have caused the file to 17821 grow beyond the server's limit. 17823 15.1.4.5. NFS4ERR_FILE_OPEN (Error Code 10046) 17825 The operation is not allowed because a file involved in the operation 17826 is currently open. Servers may, but are not required to, disallow 17827 linking-to, removing, or renaming open files. 17829 15.1.4.6. NFS4ERR_IO (Error Code 5) 17831 Indicates that an I/O error occurred for which the file system was 17832 unable to provide recovery. 17834 15.1.4.7. NFS4ERR_MLINK (Error Code 31) 17836 The request would have caused the server's limit for the number of 17837 hard links a file may have to be exceeded. 17839 15.1.4.8. NFS4ERR_NOENT (Error Code 2) 17841 Indicates no such file or directory. The file or directory name 17842 specified does not exist. 17844 15.1.4.9. NFS4ERR_NOSPC (Error Code 28) 17846 Indicates there is no space left on the device. The operation would 17847 have caused the server's file system to exceed its limit. 17849 15.1.4.10. NFS4ERR_NOTEMPTY (Error Code 66) 17851 An attempt was made to remove a directory that was not empty. 17853 15.1.4.11. NFS4ERR_ROFS (Error Code 30) 17855 Indicates a read-only file system. A modifying operation was 17856 attempted on a read-only file system. 17858 15.1.4.12. NFS4ERR_XDEV (Error Code 18) 17860 Indicates an attempt to do an operation, such as linking, that 17861 inappropriately crosses a boundary. This may be due to such 17862 boundaries as: 17864 o that between file systems (where the fsids are different). 17866 o that between different named attribute directories or between a 17867 named attribute directory and an ordinary directory. 17869 o that between byte-ranges of a file system that the file system 17870 implementation treats as separate (for example, for space 17871 accounting purposes), and where cross-connection between the byte- 17872 ranges are not allowed. 17874 15.1.5. State Management Errors 17876 These errors indicate problems with the stateid (or one of the 17877 stateids) passed to a given operation. This includes situations in 17878 which the stateid is invalid as well as situations in which the 17879 stateid is valid but designates locking state that has been revoked. 17880 Depending on the operation, the stateid when valid may designate 17881 opens, byte-range locks, file or directory delegations, layouts, or 17882 device maps. 17884 15.1.5.1. NFS4ERR_ADMIN_REVOKED (Error Code 10047) 17886 A stateid designates locking state of any type that has been revoked 17887 due to administrative interaction, possibly while the lease is valid. 17889 15.1.5.2. NFS4ERR_BAD_STATEID (Error Code 10026) 17891 A stateid does not properly designate any valid state. See Sections 17892 8.2.4 and 8.2.3 for a discussion of how stateids are validated. 17894 15.1.5.3. NFS4ERR_DELEG_REVOKED (Error Code 10087) 17896 A stateid designates recallable locking state of any type (delegation 17897 or layout) that has been revoked due to the failure of the client to 17898 return the lock when it was recalled. 17900 15.1.5.4. NFS4ERR_EXPIRED (Error Code 10011) 17902 A stateid designates locking state of any type that has been revoked 17903 due to expiration of the client's lease, either immediately upon 17904 lease expiration, or following a later request for a conflicting 17905 lock. 17907 15.1.5.5. NFS4ERR_OLD_STATEID (Error Code 10024) 17909 A stateid with a non-zero seqid value does match the current seqid 17910 for the state designated by the user. 17912 15.1.6. Security Errors 17914 These are the various permission-related errors in NFSv4.1. 17916 15.1.6.1. NFS4ERR_ACCESS (Error Code 13) 17918 Indicates permission denied. The caller does not have the correct 17919 permission to perform the requested operation. Contrast this with 17920 NFS4ERR_PERM (Section 15.1.6.2), which restricts itself to owner or 17921 privileged-user permission failures, and NFS4ERR_WRONG_CRED 17922 (Section 15.1.6.4), which deals with appropriate permission to delete 17923 or modify transient objects based on the credentials of the user that 17924 created them. 17926 15.1.6.2. NFS4ERR_PERM (Error Code 1) 17928 Indicates requester is not the owner. The operation was not allowed 17929 because the caller is neither a privileged user (root) nor the owner 17930 of the target of the operation. 17932 15.1.6.3. NFS4ERR_WRONGSEC (Error Code 10016) 17934 Indicates that the security mechanism being used by the client for 17935 the operation does not match the server's security policy. The 17936 client should change the security mechanism being used and re-send 17937 the operation (but not with the same slot ID and sequence ID; one or 17938 both MUST be different on the re-send). SECINFO and SECINFO_NO_NAME 17939 can be used to determine the appropriate mechanism. 17941 15.1.6.4. NFS4ERR_WRONG_CRED (Error Code 10082) 17943 An operation that manipulates state was attempted by a principal that 17944 was not allowed to modify that piece of state. 17946 15.1.7. Name Errors 17948 Names in NFSv4 are UTF-8 strings. When the strings are not valid 17949 UTF-8 or are of length zero, the error NFS4ERR_INVAL results. 17950 Besides this, there are a number of other errors to indicate specific 17951 problems with names. 17953 15.1.7.1. NFS4ERR_BADCHAR (Error Code 10040) 17955 A UTF-8 string contains a character that is not supported by the 17956 server in the context in which it being used. 17958 15.1.7.2. NFS4ERR_BADNAME (Error Code 10041) 17960 A name string in a request consisted of valid UTF-8 characters 17961 supported by the server, but the name is not supported by the server 17962 as a valid name for the current operation. An example might be 17963 creating a file or directory named ".." on a server whose file system 17964 uses that name for links to parent directories. 17966 15.1.7.3. NFS4ERR_NAMETOOLONG (Error Code 63) 17968 Returned when the filename in an operation exceeds the server's 17969 implementation limit. 17971 15.1.8. Locking Errors 17973 This section deals with errors related to locking, both as to share 17974 reservations and byte-range locking. It does not deal with errors 17975 specific to the process of reclaiming locks. Those are dealt with in 17976 Section 15.1.9. 17978 15.1.8.1. NFS4ERR_BAD_RANGE (Error Code 10042) 17980 The byte-range of a LOCK, LOCKT, or LOCKU operation is not allowed by 17981 the server. For example, this error results when a server that only 17982 supports 32-bit ranges receives a range that cannot be handled by 17983 that server. (See Section 18.10.3.) 17985 15.1.8.2. NFS4ERR_DEADLOCK (Error Code 10045) 17987 The server has been able to determine a byte-range locking deadlock 17988 condition for a READW_LT or WRITEW_LT LOCK operation. 17990 15.1.8.3. NFS4ERR_DENIED (Error Code 10010) 17992 An attempt to lock a file is denied. Since this may be a temporary 17993 condition, the client is encouraged to re-send the lock request (but 17994 not with the same slot ID and sequence ID; one or both MUST be 17995 different on the re-send) until the lock is accepted. See 17996 Section 9.6 for a discussion of the re-send. 17998 15.1.8.4. NFS4ERR_LOCKED (Error Code 10012) 18000 A READ or WRITE operation was attempted on a file where there was a 18001 conflict between the I/O and an existing lock: 18003 o There is a share reservation inconsistent with the I/O being done. 18005 o The range to be read or written intersects an existing mandatory 18006 byte-range lock. 18008 15.1.8.5. NFS4ERR_LOCKS_HELD (Error Code 10037) 18010 An operation was prevented by the unexpected presence of locks. 18012 15.1.8.6. NFS4ERR_LOCK_NOTSUPP (Error Code 10043) 18014 A LOCK operation was attempted that would require the upgrade or 18015 downgrade of a byte-range lock range already held by the owner, and 18016 the server does not support atomic upgrade or downgrade of locks. 18018 15.1.8.7. NFS4ERR_LOCK_RANGE (Error Code 10028) 18020 A LOCK operation is operating on a range that overlaps in part a 18021 currently held byte-range lock for the current lock-owner and does 18022 not precisely match a single such byte-range lock where the server 18023 does not support this type of request, and thus does not implement 18024 POSIX locking semantics [21]. See Sections 18.10.4, 18.11.4, and 18025 18.12.4 for a discussion of how this applies to LOCK, LOCKT, and 18026 LOCKU respectively. 18028 15.1.8.8. NFS4ERR_OPENMODE (Error Code 10038) 18030 The client attempted a READ, WRITE, LOCK, or other operation not 18031 sanctioned by the stateid passed (e.g., writing to a file opened for 18032 read-only access). 18034 15.1.8.9. NFS4ERR_SHARE_DENIED (Error Code 10015) 18036 An attempt to OPEN a file with a share reservation has failed because 18037 of a share conflict. 18039 15.1.9. Reclaim Errors 18041 These errors relate to the process of reclaiming locks after a server 18042 restart. 18044 15.1.9.1. NFS4ERR_COMPLETE_ALREADY (Error Code 10054) 18046 The client previously sent a successful RECLAIM_COMPLETE operation 18047 specifying the same scope, whether that scope is global or for the 18048 same file system in the case of a per-fs RECLAIM_COMPLETE. An 18049 additional RECLAIM_COMPLETE operation is not necessary and results in 18050 this error. 18052 15.1.9.2. NFS4ERR_GRACE (Error Code 10013) 18054 This error is returned when the server is in its grace period with 18055 regard to the file system object for which the lock was requested. 18056 In this situation, a non-reclaim locking request cannot be granted. 18057 This can occur because either 18058 o The server does not have sufficient information about locks that 18059 might be potentially reclaimed to determine whether the lock could 18060 be granted. 18062 o The request is made by a client responsible for reclaiming its 18063 locks that has not yet done the appropriate RECLAIM_COMPLETE 18064 operation, allowing it to proceed to obtain new locks. 18066 In the case of a per-fs grace period, there may be clients, (i.e., 18067 those currently using the destination file system) who might be 18068 unaware of the circumstances resulting in the initiation of the grace 18069 period. Such clients need to periodically retry the request until 18070 the grace period is over, just as other clients do. 18072 15.1.9.3. NFS4ERR_NO_GRACE (Error Code 10033) 18074 A reclaim of client state was attempted in circumstances in which the 18075 server cannot guarantee that conflicting state has not been provided 18076 to another client. This occurs in any of the following situations. 18078 o There is no active grace period applying to the file system object 18079 for which the request was made. 18081 o The client making the request has no current role in reclaiming 18082 locks. 18084 o Previous operations have created a situation in which the server 18085 is not able to determine that a reclaim-interfering edge condition 18086 does not exist. 18088 15.1.9.4. NFS4ERR_RECLAIM_BAD (Error Code 10034) 18090 The server has determined that a reclaim attempted by the client is 18091 not valid, i.e. the lock specified as being reclaimed could not 18092 possibly have existed before the server restart or file system 18093 migration event. A server is not obliged to make this determination 18094 and will typically rely on the client to only reclaim locks that the 18095 client was granted prior to restart. However, when a server does 18096 have reliable information to enable it to make this determination, 18097 this error indicates that the reclaim has been rejected as invalid. 18098 This is as opposed to the error NFS4ERR_RECLAIM_CONFLICT (see 18099 Section 15.1.9.5) where the server can only determine that there has 18100 been an invalid reclaim, but cannot determine which request is 18101 invalid. 18103 15.1.9.5. NFS4ERR_RECLAIM_CONFLICT (Error Code 10035) 18105 The reclaim attempted by the client has encountered a conflict and 18106 cannot be satisfied. This potentially indicates a misbehaving 18107 client, although not necessarily the one receiving the error. The 18108 misbehavior might be on the part of the client that established the 18109 lock with which this client conflicted. See also Section 15.1.9.4 18110 for the related error, NFS4ERR_RECLAIM_BAD. 18112 15.1.10. pNFS Errors 18114 This section deals with pNFS-related errors including those that are 18115 associated with using NFSv4.1 to communicate with a data server. 18117 15.1.10.1. NFS4ERR_BADIOMODE (Error Code 10049) 18119 An invalid or inappropriate layout iomode was specified. For example 18120 an inappropriate layout iomode, suppose a client's LAYOUTGET 18121 operation specified an iomode of LAYOUTIOMODE4_RW, and the server is 18122 neither able nor willing to let the client send write requests to 18123 data servers; the server can reply with NFS4ERR_BADIOMODE. The 18124 client would then send another LAYOUTGET with an iomode of 18125 LAYOUTIOMODE4_READ. 18127 15.1.10.2. NFS4ERR_BADLAYOUT (Error Code 10050) 18129 The layout specified is invalid in some way. For LAYOUTCOMMIT, this 18130 indicates that the specified layout is not held by the client or is 18131 not of mode LAYOUTIOMODE4_RW. For LAYOUTGET, it indicates that a 18132 layout matching the client's specification as to minimum length 18133 cannot be granted. 18135 15.1.10.3. NFS4ERR_LAYOUTTRYLATER (Error Code 10058) 18137 Layouts are temporarily unavailable for the file. The client should 18138 re-send later (but not with the same slot ID and sequence ID; one or 18139 both MUST be different on the re-send). 18141 15.1.10.4. NFS4ERR_LAYOUTUNAVAILABLE (Error Code 10059) 18143 Returned when layouts are not available for the current file system 18144 or the particular specified file. 18146 15.1.10.5. NFS4ERR_NOMATCHING_LAYOUT (Error Code 10060) 18148 Returned when layouts are recalled and the client has no layouts 18149 matching the specification of the layouts being recalled. 18151 15.1.10.6. NFS4ERR_PNFS_IO_HOLE (Error Code 10075) 18153 The pNFS client has attempted to read from or write to an illegal 18154 hole of a file of a data server that is using sparse packing. See 18155 Section 13.4.4. 18157 15.1.10.7. NFS4ERR_PNFS_NO_LAYOUT (Error Code 10080) 18159 The pNFS client has attempted to read from or write to a file (using 18160 a request to a data server) without holding a valid layout. This 18161 includes the case where the client had a layout, but the iomode does 18162 not allow a WRITE. 18164 15.1.10.8. NFS4ERR_RETURNCONFLICT (Error Code 10086) 18166 A layout is unavailable due to an attempt to perform the LAYOUTGET 18167 before a pending LAYOUTRETURN on the file has been received. See 18168 Section 12.5.5.2.1.3. 18170 15.1.10.9. NFS4ERR_UNKNOWN_LAYOUTTYPE (Error Code 10062) 18172 The client has specified a layout type that is not supported by the 18173 server. 18175 15.1.11. Session Use Errors 18177 This section deals with errors encountered when using sessions, that 18178 is, errors encountered when a request uses a Sequence (i.e., either 18179 SEQUENCE or CB_SEQUENCE) operation. 18181 15.1.11.1. NFS4ERR_BADSESSION (Error Code 10052) 18183 The specified session ID is unknown to the server to which the 18184 operation is addressed. 18186 15.1.11.2. NFS4ERR_BADSLOT (Error Code 10053) 18188 The requester sent a Sequence operation that attempted to use a slot 18189 the replier does not have in its slot table. It is possible the slot 18190 may have been retired. 18192 15.1.11.3. NFS4ERR_BAD_HIGH_SLOT (Error Code 10077) 18194 The highest_slot argument in a Sequence operation exceeds the 18195 replier's enforced highest_slotid. 18197 15.1.11.4. NFS4ERR_CB_PATH_DOWN (Error Code 10048) 18199 There is a problem contacting the client via the callback path. The 18200 function of this error has been mostly superseded by the use of 18201 status flags in the reply to the SEQUENCE operation (see 18202 Section 18.46). 18204 15.1.11.5. NFS4ERR_DEADSESSION (Error Code 10078) 18206 The specified session is a persistent session that is dead and does 18207 not accept new requests or perform new operations on existing 18208 requests (in the case in which a request was partially executed 18209 before server restart). 18211 15.1.11.6. NFS4ERR_CONN_NOT_BOUND_TO_SESSION (Error Code 10055) 18213 A Sequence operation was sent on a connection that has not been 18214 associated with the specified session, where the client specified 18215 that connection association was to be enforced with SP4_MACH_CRED or 18216 SP4_SSV state protection. 18218 15.1.11.7. NFS4ERR_SEQ_FALSE_RETRY (Error Code 10076) 18220 The requester sent a Sequence operation with a slot ID and sequence 18221 ID that are in the reply cache, but the replier has detected that the 18222 retried request is not the same as the original request. See 18223 Section 2.10.6.1.3.1. 18225 15.1.11.8. NFS4ERR_SEQ_MISORDERED (Error Code 10063) 18227 The requester sent a Sequence operation with an invalid sequence ID. 18229 15.1.12. Session Management Errors 18231 This section deals with errors associated with requests used in 18232 session management. 18234 15.1.12.1. NFS4ERR_BACK_CHAN_BUSY (Error Code 10057) 18236 An attempt was made to destroy a session when the session cannot be 18237 destroyed because the server has callback requests outstanding. 18239 15.1.12.2. NFS4ERR_BAD_SESSION_DIGEST (Error Code 10051) 18241 The digest used in a SET_SSV request is not valid. 18243 15.1.13. Client Management Errors 18245 This section deals with errors associated with requests used to 18246 create and manage client IDs. 18248 15.1.13.1. NFS4ERR_CLIENTID_BUSY (Error Code 10074) 18250 The DESTROY_CLIENTID operation has found there are sessions and/or 18251 unexpired state associated with the client ID to be destroyed. 18253 15.1.13.2. NFS4ERR_CLID_INUSE (Error Code 10017) 18255 While processing an EXCHANGE_ID operation, the server was presented 18256 with a co_ownerid field that matches an existing client with valid 18257 leased state, but the principal sending the EXCHANGE_ID operation 18258 differs from the principal that established the existing client. 18259 This indicates a collision (most likely due to chance) between 18260 clients. The client should recover by changing the co_ownerid and 18261 re-sending EXCHANGE_ID (but not with the same slot ID and sequence 18262 ID; one or both MUST be different on the re-send). 18264 15.1.13.3. NFS4ERR_ENCR_ALG_UNSUPP (Error Code 10079) 18266 An EXCHANGE_ID was sent that specified state protection via SSV, and 18267 where the set of encryption algorithms presented by the client did 18268 not include any supported by the server. 18270 15.1.13.4. NFS4ERR_HASH_ALG_UNSUPP (Error Code 10072) 18272 An EXCHANGE_ID was sent that specified state protection via SSV, and 18273 where the set of hashing algorithms presented by the client did not 18274 include any supported by the server. 18276 15.1.13.5. NFS4ERR_STALE_CLIENTID (Error Code 10022) 18278 A client ID not recognized by the server was passed to an operation. 18279 Note that unlike the case of NFSv4.0, client IDs are not passed 18280 explicitly to the server in ordinary locking operations and cannot 18281 result in this error. Instead, when there is a server restart, it is 18282 first manifested through an error on the associated session, and the 18283 staleness of the client ID is detected when trying to associate a 18284 client ID with a new session. 18286 15.1.14. Delegation Errors 18288 This section deals with errors associated with requesting and 18289 returning delegations. 18291 15.1.14.1. NFS4ERR_DELEG_ALREADY_WANTED (Error Code 10056) 18293 The client has requested a delegation when it had already registered 18294 that it wants that same delegation. 18296 15.1.14.2. NFS4ERR_DIRDELEG_UNAVAIL (Error Code 10084) 18298 This error is returned when the server is unable or unwilling to 18299 provide a requested directory delegation. 18301 15.1.14.3. NFS4ERR_RECALLCONFLICT (Error Code 10061) 18303 A recallable object (i.e., a layout or delegation) is unavailable due 18304 to a conflicting recall operation that is currently in progress for 18305 that object. 18307 15.1.14.4. NFS4ERR_REJECT_DELEG (Error Code 10085) 18309 The callback operation invoked to deal with a new delegation has 18310 rejected it. 18312 15.1.15. Attribute Handling Errors 18314 This section deals with errors specific to attribute handling within 18315 NFSv4. 18317 15.1.15.1. NFS4ERR_ATTRNOTSUPP (Error Code 10032) 18319 An attribute specified is not supported by the server. This error 18320 MUST NOT be returned by the GETATTR operation. 18322 15.1.15.2. NFS4ERR_BADOWNER (Error Code 10039) 18324 This error is returned when an owner or owner_group attribute value 18325 or the who field of an ACE within an ACL attribute value cannot be 18326 translated to a local representation. 18328 15.1.15.3. NFS4ERR_NOT_SAME (Error Code 10027) 18330 This error is returned by the VERIFY operation to signify that the 18331 attributes compared were not the same as those provided in the 18332 client's request. 18334 15.1.15.4. NFS4ERR_SAME (Error Code 10009) 18336 This error is returned by the NVERIFY operation to signify that the 18337 attributes compared were the same as those provided in the client's 18338 request. 18340 15.1.16. Obsoleted Errors 18342 These errors MUST NOT be generated by any NFSv4.1 operation. This 18343 can be for a number of reasons. 18345 o The function provided by the error has been superseded by one of 18346 the status bits returned by the SEQUENCE operation. 18348 o The new session structure and associated change in locking have 18349 made the error unnecessary. 18351 o There has been a restructuring of some errors for NFSv4.1 that 18352 resulted in the elimination of certain errors. 18354 15.1.16.1. NFS4ERR_BAD_SEQID (Error Code 10026) 18356 The sequence number (seqid) in a locking request is neither the next 18357 expected number or the last number processed. These seqids are 18358 ignored in NFSv4.1. 18360 15.1.16.2. NFS4ERR_LEASE_MOVED (Error Code 10031) 18362 A lease being renewed is associated with a file system that has been 18363 migrated to a new server. The error has been superseded by the 18364 SEQ4_STATUS_LEASE_MOVED status bit (see Section 18.46). 18366 15.1.16.3. NFS4ERR_NXIO (Error Code 5) 18368 I/O error. No such device or address. This error is for errors 18369 involving block and character device access, but because NFSv4.1 is 18370 not a device-access protocol, this error is not applicable. 18372 15.1.16.4. NFS4ERR_RESTOREFH (Error Code 10030) 18374 The RESTOREFH operation does not have a saved filehandle (identified 18375 by SAVEFH) to operate upon. In NFSv4.1, this error has been 18376 superseded by NFS4ERR_NOFILEHANDLE. 18378 15.1.16.5. NFS4ERR_STALE_STATEID (Error Code 10023) 18380 A stateid generated by an earlier server instance was used. This 18381 error is moot in NFSv4.1 because all operations that take a stateid 18382 MUST be preceded by the SEQUENCE operation, and the earlier server 18383 instance is detected by the session infrastructure that supports 18384 SEQUENCE. 18386 15.2. Operations and Their Valid Errors 18388 This section contains a table that gives the valid error returns for 18389 each protocol operation. The error code NFS4_OK (indicating no 18390 error) is not listed but should be understood to be returnable by all 18391 operations with two important exceptions: 18393 o The operations that MUST NOT be implemented: OPEN_CONFIRM, 18394 RELEASE_LOCKOWNER, RENEW, SETCLIENTID, and SETCLIENTID_CONFIRM. 18396 o The invalid operation: ILLEGAL. 18398 Valid Error Returns for Each Protocol Operation 18400 +----------------------+--------------------------------------------+ 18401 | Operation | Errors | 18402 +----------------------+--------------------------------------------+ 18403 | ACCESS | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18404 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18405 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18406 | | NFS4ERR_IO, NFS4ERR_MOVED, | 18407 | | NFS4ERR_NOFILEHANDLE, | 18408 | | NFS4ERR_OP_NOT_IN_SESSION, | 18409 | | NFS4ERR_REP_TOO_BIG, | 18410 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18411 | | NFS4ERR_REQ_TOO_BIG, | 18412 | | NFS4ERR_RETRY_UNCACHED_REP, | 18413 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18414 | | NFS4ERR_TOO_MANY_OPS | 18415 | | | 18416 | BACKCHANNEL_CTL | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18417 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18418 | | NFS4ERR_NOENT, NFS4ERR_OP_NOT_IN_SESSION, | 18419 | | NFS4ERR_REP_TOO_BIG, | 18420 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18421 | | NFS4ERR_REQ_TOO_BIG, | 18422 | | NFS4ERR_RETRY_UNCACHED_REP, | 18423 | | NFS4ERR_TOO_MANY_OPS | 18424 | | | 18425 | BIND_CONN_TO_SESSION | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18426 | | NFS4ERR_BAD_SESSION_DIGEST, | 18427 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18428 | | NFS4ERR_INVAL, NFS4ERR_NOT_ONLY_OP, | 18429 | | NFS4ERR_REP_TOO_BIG, | 18430 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18431 | | NFS4ERR_REQ_TOO_BIG, | 18432 | | NFS4ERR_RETRY_UNCACHED_REP, | 18433 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 18434 | | | 18435 | CLOSE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18436 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18437 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18438 | | NFS4ERR_FHEXPIRED, NFS4ERR_LOCKS_HELD, | 18439 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18440 | | NFS4ERR_OLD_STATEID, | 18441 | | NFS4ERR_OP_NOT_IN_SESSION, | 18442 | | NFS4ERR_REP_TOO_BIG, | 18443 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18444 | | NFS4ERR_REQ_TOO_BIG, | 18445 | | NFS4ERR_RETRY_UNCACHED_REP, | 18446 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18447 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18448 | | | 18449 | COMMIT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18450 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18451 | | NFS4ERR_FHEXPIRED, NFS4ERR_IO, | 18452 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 18453 | | NFS4ERR_NOFILEHANDLE, | 18454 | | NFS4ERR_OP_NOT_IN_SESSION, | 18455 | | NFS4ERR_REP_TOO_BIG, | 18456 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18457 | | NFS4ERR_REQ_TOO_BIG, | 18458 | | NFS4ERR_RETRY_UNCACHED_REP, | 18459 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18460 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18461 | | NFS4ERR_WRONG_TYPE | 18462 | | | 18463 | CREATE | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18464 | | NFS4ERR_BADCHAR, NFS4ERR_BADNAME, | 18465 | | NFS4ERR_BADOWNER, NFS4ERR_BADTYPE, | 18466 | | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18467 | | NFS4ERR_DELAY, NFS4ERR_DQUOT, | 18468 | | NFS4ERR_EXIST, NFS4ERR_FHEXPIRED, | 18469 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MLINK, | 18470 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18471 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18472 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 18473 | | NFS4ERR_PERM, NFS4ERR_REP_TOO_BIG, | 18474 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18475 | | NFS4ERR_REQ_TOO_BIG, | 18476 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18477 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18478 | | NFS4ERR_TOO_MANY_OPS, | 18479 | | NFS4ERR_UNSAFE_COMPOUND | 18480 | | | 18481 | CREATE_SESSION | NFS4ERR_BADXDR, NFS4ERR_CLID_INUSE, | 18482 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18483 | | NFS4ERR_INVAL, NFS4ERR_NOENT, | 18484 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_NOSPC, | 18485 | | NFS4ERR_REP_TOO_BIG, | 18486 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18487 | | NFS4ERR_REQ_TOO_BIG, | 18488 | | NFS4ERR_RETRY_UNCACHED_REP, | 18489 | | NFS4ERR_SEQ_MISORDERED, | 18490 | | NFS4ERR_SERVERFAULT, | 18491 | | NFS4ERR_STALE_CLIENTID, NFS4ERR_TOOSMALL, | 18492 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18493 | | | 18494 | DELEGPURGE | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18495 | | NFS4ERR_DELAY, NFS4ERR_NOTSUPP, | 18496 | | NFS4ERR_OP_NOT_IN_SESSION, | 18497 | | NFS4ERR_REP_TOO_BIG, | 18498 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18499 | | NFS4ERR_REQ_TOO_BIG, | 18500 | | NFS4ERR_RETRY_UNCACHED_REP, | 18501 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18502 | | NFS4ERR_WRONG_CRED | 18503 | | | 18504 | DELEGRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18505 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18506 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18507 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18508 | | NFS4ERR_INVAL, NFS4ERR_MOVED, | 18509 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18510 | | NFS4ERR_OLD_STATEID, | 18511 | | NFS4ERR_OP_NOT_IN_SESSION, | 18512 | | NFS4ERR_REP_TOO_BIG, | 18513 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18514 | | NFS4ERR_REQ_TOO_BIG, | 18515 | | NFS4ERR_RETRY_UNCACHED_REP, | 18516 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18517 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18518 | | | 18519 | DESTROY_CLIENTID | NFS4ERR_BADXDR, NFS4ERR_CLIENTID_BUSY, | 18520 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18521 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 18522 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18523 | | NFS4ERR_REQ_TOO_BIG, | 18524 | | NFS4ERR_RETRY_UNCACHED_REP, | 18525 | | NFS4ERR_SERVERFAULT, | 18526 | | NFS4ERR_STALE_CLIENTID, | 18527 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18528 | | | 18529 | DESTROY_SESSION | NFS4ERR_BACK_CHAN_BUSY, | 18530 | | NFS4ERR_BADSESSION, NFS4ERR_BADXDR, | 18531 | | NFS4ERR_CB_PATH_DOWN, | 18532 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 18533 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18534 | | NFS4ERR_NOT_ONLY_OP, NFS4ERR_REP_TOO_BIG, | 18535 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18536 | | NFS4ERR_REQ_TOO_BIG, | 18537 | | NFS4ERR_RETRY_UNCACHED_REP, | 18538 | | NFS4ERR_SERVERFAULT, | 18539 | | NFS4ERR_STALE_CLIENTID, | 18540 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18541 | | | 18542 | EXCHANGE_ID | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18543 | | NFS4ERR_CLID_INUSE, NFS4ERR_DEADSESSION, | 18544 | | NFS4ERR_DELAY, NFS4ERR_ENCR_ALG_UNSUPP, | 18545 | | NFS4ERR_HASH_ALG_UNSUPP, NFS4ERR_INVAL, | 18546 | | NFS4ERR_NOENT, NFS4ERR_NOT_ONLY_OP, | 18547 | | NFS4ERR_NOT_SAME, NFS4ERR_REP_TOO_BIG, | 18548 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18549 | | NFS4ERR_REQ_TOO_BIG, | 18550 | | NFS4ERR_RETRY_UNCACHED_REP, | 18551 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 18552 | | | 18553 | FREE_STATEID | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18554 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18555 | | NFS4ERR_LOCKS_HELD, NFS4ERR_OLD_STATEID, | 18556 | | NFS4ERR_OP_NOT_IN_SESSION, | 18557 | | NFS4ERR_REP_TOO_BIG, | 18558 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18559 | | NFS4ERR_REQ_TOO_BIG, | 18560 | | NFS4ERR_RETRY_UNCACHED_REP, | 18561 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18562 | | NFS4ERR_WRONG_CRED | 18563 | | | 18564 | GET_DIR_DELEGATION | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18565 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18566 | | NFS4ERR_DIRDELEG_UNAVAIL, | 18567 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18568 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18569 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18570 | | NFS4ERR_NOTSUPP, | 18571 | | NFS4ERR_OP_NOT_IN_SESSION, | 18572 | | NFS4ERR_REP_TOO_BIG, | 18573 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18574 | | NFS4ERR_REQ_TOO_BIG, | 18575 | | NFS4ERR_RETRY_UNCACHED_REP, | 18576 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18577 | | NFS4ERR_TOO_MANY_OPS | 18578 | | | 18579 | GETATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18580 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18581 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18582 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18583 | | NFS4ERR_NOFILEHANDLE, | 18584 | | NFS4ERR_OP_NOT_IN_SESSION, | 18585 | | NFS4ERR_REP_TOO_BIG, | 18586 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18587 | | NFS4ERR_REQ_TOO_BIG, | 18588 | | NFS4ERR_RETRY_UNCACHED_REP, | 18589 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18590 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 18591 | | | 18592 | GETDEVICEINFO | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 18593 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 18594 | | NFS4ERR_NOENT, NFS4ERR_NOTSUPP, | 18595 | | NFS4ERR_OP_NOT_IN_SESSION, | 18596 | | NFS4ERR_REP_TOO_BIG, | 18597 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18598 | | NFS4ERR_REQ_TOO_BIG, | 18599 | | NFS4ERR_RETRY_UNCACHED_REP, | 18600 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOOSMALL, | 18601 | | NFS4ERR_TOO_MANY_OPS, | 18602 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18603 | | | 18604 | GETDEVICELIST | NFS4ERR_BADXDR, NFS4ERR_BAD_COOKIE, | 18605 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18606 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18607 | | NFS4ERR_IO, NFS4ERR_NOFILEHANDLE, | 18608 | | NFS4ERR_NOTSUPP, NFS4ERR_NOT_SAME, | 18609 | | NFS4ERR_OP_NOT_IN_SESSION, | 18610 | | NFS4ERR_REP_TOO_BIG, | 18611 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18612 | | NFS4ERR_REQ_TOO_BIG, | 18613 | | NFS4ERR_RETRY_UNCACHED_REP, | 18614 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18615 | | NFS4ERR_UNKNOWN_LAYOUTTYPE | 18616 | | | 18617 | GETFH | NFS4ERR_FHEXPIRED, NFS4ERR_MOVED, | 18618 | | NFS4ERR_NOFILEHANDLE, | 18619 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_STALE | 18620 | | | 18621 | ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 18622 | | | 18623 | LAYOUTCOMMIT | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18624 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADIOMODE, | 18625 | | NFS4ERR_BADLAYOUT, NFS4ERR_BADXDR, | 18626 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18627 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 18628 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 18629 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18630 | | NFS4ERR_ISDIR NFS4ERR_MOVED, | 18631 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18632 | | NFS4ERR_NO_GRACE, | 18633 | | NFS4ERR_OP_NOT_IN_SESSION, | 18634 | | NFS4ERR_RECLAIM_BAD, | 18635 | | NFS4ERR_RECLAIM_CONFLICT, | 18636 | | NFS4ERR_REP_TOO_BIG, | 18637 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18638 | | NFS4ERR_REQ_TOO_BIG, | 18639 | | NFS4ERR_RETRY_UNCACHED_REP, | 18640 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18641 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18642 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18643 | | NFS4ERR_WRONG_CRED | 18644 | | | 18645 | LAYOUTGET | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18646 | | NFS4ERR_BADIOMODE, NFS4ERR_BADLAYOUT, | 18647 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18648 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18649 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 18650 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18651 | | NFS4ERR_INVAL, NFS4ERR_IO, | 18652 | | NFS4ERR_LAYOUTTRYLATER, | 18653 | | NFS4ERR_LAYOUTUNAVAILABLE, NFS4ERR_LOCKED, | 18654 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18655 | | NFS4ERR_NOSPC, NFS4ERR_NOTSUPP, | 18656 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 18657 | | NFS4ERR_OP_NOT_IN_SESSION, | 18658 | | NFS4ERR_RECALLCONFLICT, | 18659 | | NFS4ERR_REP_TOO_BIG, | 18660 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18661 | | NFS4ERR_REQ_TOO_BIG, | 18662 | | NFS4ERR_RETRY_UNCACHED_REP, | 18663 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18664 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS, | 18665 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18666 | | NFS4ERR_WRONG_TYPE | 18667 | | | 18668 | LAYOUTRETURN | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18669 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18670 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 18671 | | NFS4ERR_EXPIRED, NFS4ERR_FHEXPIRED, | 18672 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18673 | | NFS4ERR_ISDIR, NFS4ERR_MOVED, | 18674 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 18675 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 18676 | | NFS4ERR_OP_NOT_IN_SESSION, | 18677 | | NFS4ERR_REP_TOO_BIG, | 18678 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18679 | | NFS4ERR_REQ_TOO_BIG, | 18680 | | NFS4ERR_RETRY_UNCACHED_REP, | 18681 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18682 | | NFS4ERR_TOO_MANY_OPS, | 18683 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18684 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 18685 | | | 18686 | LINK | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18687 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18688 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18689 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 18690 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18691 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18692 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MLINK, | 18693 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18694 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18695 | | NFS4ERR_NOTDIR, NFS4ERR_NOTSUPP, | 18696 | | NFS4ERR_OP_NOT_IN_SESSION, | 18697 | | NFS4ERR_REP_TOO_BIG, | 18698 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18699 | | NFS4ERR_REQ_TOO_BIG, | 18700 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18701 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18702 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18703 | | NFS4ERR_WRONGSEC, NFS4ERR_WRONG_TYPE, | 18704 | | NFS4ERR_XDEV | 18705 | | | 18706 | LOCK | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18707 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 18708 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADLOCK, | 18709 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18710 | | NFS4ERR_DENIED, NFS4ERR_EXPIRED, | 18711 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18712 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 18713 | | NFS4ERR_LOCK_NOTSUPP, NFS4ERR_LOCK_RANGE, | 18714 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18715 | | NFS4ERR_NO_GRACE, NFS4ERR_OLD_STATEID, | 18716 | | NFS4ERR_OPENMODE, | 18717 | | NFS4ERR_OP_NOT_IN_SESSION, | 18718 | | NFS4ERR_RECLAIM_BAD, | 18719 | | NFS4ERR_RECLAIM_CONFLICT, | 18720 | | NFS4ERR_REP_TOO_BIG, | 18721 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18722 | | NFS4ERR_REQ_TOO_BIG, | 18723 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18724 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18725 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18726 | | NFS4ERR_WRONG_CRED, NFS4ERR_WRONG_TYPE | 18727 | | | 18728 | LOCKT | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18729 | | NFS4ERR_BAD_RANGE, NFS4ERR_DEADSESSION, | 18730 | | NFS4ERR_DELAY, NFS4ERR_DENIED, | 18731 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18732 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, | 18733 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 18734 | | NFS4ERR_NOFILEHANDLE, | 18735 | | NFS4ERR_OP_NOT_IN_SESSION, | 18736 | | NFS4ERR_REP_TOO_BIG, | 18737 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18738 | | NFS4ERR_REQ_TOO_BIG, | 18739 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18740 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 18741 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED, | 18742 | | NFS4ERR_WRONG_TYPE | 18743 | | | 18744 | LOCKU | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18745 | | NFS4ERR_BADXDR, NFS4ERR_BAD_RANGE, | 18746 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18747 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18748 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18749 | | NFS4ERR_LOCK_RANGE, NFS4ERR_MOVED, | 18750 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 18751 | | NFS4ERR_OP_NOT_IN_SESSION, | 18752 | | NFS4ERR_REP_TOO_BIG, | 18753 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18754 | | NFS4ERR_REQ_TOO_BIG, | 18755 | | NFS4ERR_RETRY_UNCACHED_REP, | 18756 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18757 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18758 | | | 18759 | LOOKUP | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18760 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18761 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18762 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18763 | | NFS4ERR_IO, NFS4ERR_MOVED, | 18764 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18765 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18766 | | NFS4ERR_OP_NOT_IN_SESSION, | 18767 | | NFS4ERR_REP_TOO_BIG, | 18768 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18769 | | NFS4ERR_REQ_TOO_BIG, | 18770 | | NFS4ERR_RETRY_UNCACHED_REP, | 18771 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18772 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18773 | | NFS4ERR_WRONGSEC | 18774 | | | 18775 | LOOKUPP | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 18776 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18777 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 18778 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18779 | | NFS4ERR_OP_NOT_IN_SESSION, | 18780 | | NFS4ERR_REP_TOO_BIG, | 18781 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18782 | | NFS4ERR_REQ_TOO_BIG, | 18783 | | NFS4ERR_RETRY_UNCACHED_REP, | 18784 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18785 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18786 | | NFS4ERR_WRONGSEC | 18787 | | | 18788 | NVERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 18789 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 18790 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18791 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18792 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18793 | | NFS4ERR_NOFILEHANDLE, | 18794 | | NFS4ERR_OP_NOT_IN_SESSION, | 18795 | | NFS4ERR_REP_TOO_BIG, | 18796 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18797 | | NFS4ERR_REQ_TOO_BIG, | 18798 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SAME, | 18799 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18800 | | NFS4ERR_TOO_MANY_OPS, | 18801 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 18802 | | NFS4ERR_WRONG_TYPE | 18803 | | | 18804 | OPEN | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18805 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 18806 | | NFS4ERR_BADNAME, NFS4ERR_BADOWNER, | 18807 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18808 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18809 | | NFS4ERR_DELEG_ALREADY_WANTED, | 18810 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 18811 | | NFS4ERR_EXIST, NFS4ERR_EXPIRED, | 18812 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 18813 | | NFS4ERR_GRACE, NFS4ERR_INVAL, | 18814 | | NFS4ERR_ISDIR, NFS4ERR_IO, NFS4ERR_MOVED, | 18815 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18816 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18817 | | NFS4ERR_NOTDIR, NFS4ERR_NO_GRACE, | 18818 | | NFS4ERR_OLD_STATEID, | 18819 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 18820 | | NFS4ERR_RECLAIM_BAD, | 18821 | | NFS4ERR_RECLAIM_CONFLICT, | 18822 | | NFS4ERR_REP_TOO_BIG, | 18823 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18824 | | NFS4ERR_REQ_TOO_BIG, | 18825 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18826 | | NFS4ERR_SERVERFAULT, NFS4ERR_SHARE_DENIED, | 18827 | | NFS4ERR_STALE, NFS4ERR_SYMLINK, | 18828 | | NFS4ERR_TOO_MANY_OPS, | 18829 | | NFS4ERR_UNSAFE_COMPOUND, NFS4ERR_WRONGSEC, | 18830 | | NFS4ERR_WRONG_TYPE | 18831 | | | 18832 | OPEN_CONFIRM | NFS4ERR_NOTSUPP | 18833 | | | 18834 | OPEN_DOWNGRADE | NFS4ERR_ADMIN_REVOKED, NFS4ERR_BADXDR, | 18835 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 18836 | | NFS4ERR_DELAY, NFS4ERR_EXPIRED, | 18837 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18838 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18839 | | NFS4ERR_OLD_STATEID, | 18840 | | NFS4ERR_OP_NOT_IN_SESSION, | 18841 | | NFS4ERR_REP_TOO_BIG, | 18842 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18843 | | NFS4ERR_REQ_TOO_BIG, | 18844 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18845 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18846 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED | 18847 | | | 18848 | OPENATTR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18849 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18850 | | NFS4ERR_DQUOT, NFS4ERR_FHEXPIRED, | 18851 | | NFS4ERR_IO, NFS4ERR_MOVED, NFS4ERR_NOENT, | 18852 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18853 | | NFS4ERR_NOTSUPP, | 18854 | | NFS4ERR_OP_NOT_IN_SESSION, | 18855 | | NFS4ERR_REP_TOO_BIG, | 18856 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18857 | | NFS4ERR_REQ_TOO_BIG, | 18858 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18859 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18860 | | NFS4ERR_TOO_MANY_OPS, | 18861 | | NFS4ERR_UNSAFE_COMPOUND, | 18862 | | NFS4ERR_WRONG_TYPE | 18863 | | | 18864 | PUTFH | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 18865 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18866 | | NFS4ERR_MOVED, NFS4ERR_OP_NOT_IN_SESSION, | 18867 | | NFS4ERR_REP_TOO_BIG, | 18868 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18869 | | NFS4ERR_REQ_TOO_BIG, | 18870 | | NFS4ERR_RETRY_UNCACHED_REP, | 18871 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18872 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 18873 | | | 18874 | PUTPUBFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18875 | | NFS4ERR_OP_NOT_IN_SESSION, | 18876 | | NFS4ERR_REP_TOO_BIG, | 18877 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18878 | | NFS4ERR_REQ_TOO_BIG, | 18879 | | NFS4ERR_RETRY_UNCACHED_REP, | 18880 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18881 | | NFS4ERR_WRONGSEC | 18882 | | | 18883 | PUTROOTFH | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18884 | | NFS4ERR_OP_NOT_IN_SESSION, | 18885 | | NFS4ERR_REP_TOO_BIG, | 18886 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18887 | | NFS4ERR_REQ_TOO_BIG, | 18888 | | NFS4ERR_RETRY_UNCACHED_REP, | 18889 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS, | 18890 | | NFS4ERR_WRONGSEC | 18891 | | | 18892 | READ | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 18893 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 18894 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18895 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_EXPIRED, | 18896 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 18897 | | NFS4ERR_INVAL, NFS4ERR_ISDIR, NFS4ERR_IO, | 18898 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 18899 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_OLD_STATEID, | 18900 | | NFS4ERR_OPENMODE, | 18901 | | NFS4ERR_OP_NOT_IN_SESSION, | 18902 | | NFS4ERR_PNFS_IO_HOLE, | 18903 | | NFS4ERR_PNFS_NO_LAYOUT, | 18904 | | NFS4ERR_REP_TOO_BIG, | 18905 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18906 | | NFS4ERR_REQ_TOO_BIG, | 18907 | | NFS4ERR_RETRY_UNCACHED_REP, | 18908 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18909 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 18910 | | NFS4ERR_WRONG_TYPE | 18911 | | | 18912 | READDIR | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 18913 | | NFS4ERR_BAD_COOKIE, NFS4ERR_DEADSESSION, | 18914 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18915 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18916 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 18917 | | NFS4ERR_NOT_SAME, | 18918 | | NFS4ERR_OP_NOT_IN_SESSION, | 18919 | | NFS4ERR_REP_TOO_BIG, | 18920 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18921 | | NFS4ERR_REQ_TOO_BIG, | 18922 | | NFS4ERR_RETRY_UNCACHED_REP, | 18923 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18924 | | NFS4ERR_TOOSMALL, NFS4ERR_TOO_MANY_OPS | 18925 | | | 18926 | READLINK | NFS4ERR_ACCESS, NFS4ERR_DEADSESSION, | 18927 | | NFS4ERR_DELAY, NFS4ERR_FHEXPIRED, | 18928 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 18929 | | NFS4ERR_NOFILEHANDLE, | 18930 | | NFS4ERR_OP_NOT_IN_SESSION, | 18931 | | NFS4ERR_REP_TOO_BIG, | 18932 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18933 | | NFS4ERR_REQ_TOO_BIG, | 18934 | | NFS4ERR_RETRY_UNCACHED_REP, | 18935 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18936 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 18937 | | | 18938 | RECLAIM_COMPLETE | NFS4ERR_BADXDR, NFS4ERR_COMPLETE_ALREADY, | 18939 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18940 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 18941 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18942 | | NFS4ERR_OP_NOT_IN_SESSION, | 18943 | | NFS4ERR_REP_TOO_BIG, | 18944 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18945 | | NFS4ERR_REQ_TOO_BIG, | 18946 | | NFS4ERR_RETRY_UNCACHED_REP, | 18947 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18948 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_CRED, | 18949 | | NFS4ERR_WRONG_TYPE | 18950 | | | 18951 | RELEASE_LOCKOWNER | NFS4ERR_NOTSUPP | 18952 | | | 18953 | REMOVE | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18954 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18955 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18956 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18957 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18958 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 18959 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 18960 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 18961 | | NFS4ERR_OP_NOT_IN_SESSION, | 18962 | | NFS4ERR_REP_TOO_BIG, | 18963 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18964 | | NFS4ERR_REQ_TOO_BIG, | 18965 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18966 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18967 | | NFS4ERR_TOO_MANY_OPS | 18968 | | | 18969 | RENAME | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 18970 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 18971 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 18972 | | NFS4ERR_DQUOT, NFS4ERR_EXIST, | 18973 | | NFS4ERR_FHEXPIRED, NFS4ERR_FILE_OPEN, | 18974 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 18975 | | NFS4ERR_MLINK, NFS4ERR_MOVED, | 18976 | | NFS4ERR_NAMETOOLONG, NFS4ERR_NOENT, | 18977 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 18978 | | NFS4ERR_NOTDIR, NFS4ERR_NOTEMPTY, | 18979 | | NFS4ERR_OP_NOT_IN_SESSION, | 18980 | | NFS4ERR_REP_TOO_BIG, | 18981 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18982 | | NFS4ERR_REQ_TOO_BIG, | 18983 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 18984 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18985 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC, | 18986 | | NFS4ERR_XDEV | 18987 | | | 18988 | RENEW | NFS4ERR_NOTSUPP | 18989 | | | 18990 | RESTOREFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 18991 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 18992 | | NFS4ERR_OP_NOT_IN_SESSION, | 18993 | | NFS4ERR_REP_TOO_BIG, | 18994 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 18995 | | NFS4ERR_REQ_TOO_BIG, | 18996 | | NFS4ERR_RETRY_UNCACHED_REP, | 18997 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 18998 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONGSEC | 18999 | | | 19000 | SAVEFH | NFS4ERR_DEADSESSION, NFS4ERR_FHEXPIRED, | 19001 | | NFS4ERR_MOVED, NFS4ERR_NOFILEHANDLE, | 19002 | | NFS4ERR_OP_NOT_IN_SESSION, | 19003 | | NFS4ERR_REP_TOO_BIG, | 19004 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19005 | | NFS4ERR_REQ_TOO_BIG, | 19006 | | NFS4ERR_RETRY_UNCACHED_REP, | 19007 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19008 | | NFS4ERR_TOO_MANY_OPS | 19009 | | | 19010 | SECINFO | NFS4ERR_ACCESS, NFS4ERR_BADCHAR, | 19011 | | NFS4ERR_BADNAME, NFS4ERR_BADXDR, | 19012 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19013 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19014 | | NFS4ERR_MOVED, NFS4ERR_NAMETOOLONG, | 19015 | | NFS4ERR_NOENT, NFS4ERR_NOFILEHANDLE, | 19016 | | NFS4ERR_NOTDIR, NFS4ERR_OP_NOT_IN_SESSION, | 19017 | | NFS4ERR_REP_TOO_BIG, | 19018 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19019 | | NFS4ERR_REQ_TOO_BIG, | 19020 | | NFS4ERR_RETRY_UNCACHED_REP, | 19021 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19022 | | NFS4ERR_TOO_MANY_OPS | 19023 | | | 19024 | SECINFO_NO_NAME | NFS4ERR_ACCESS, NFS4ERR_BADXDR, | 19025 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19026 | | NFS4ERR_FHEXPIRED, NFS4ERR_INVAL, | 19027 | | NFS4ERR_MOVED, NFS4ERR_NOENT, | 19028 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTDIR, | 19029 | | NFS4ERR_NOTSUPP, | 19030 | | NFS4ERR_OP_NOT_IN_SESSION, | 19031 | | NFS4ERR_REP_TOO_BIG, | 19032 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19033 | | NFS4ERR_REQ_TOO_BIG, | 19034 | | NFS4ERR_RETRY_UNCACHED_REP, | 19035 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19036 | | NFS4ERR_TOO_MANY_OPS | 19037 | | | 19038 | SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 19039 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 19040 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 19041 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19042 | | NFS4ERR_REP_TOO_BIG, | 19043 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19044 | | NFS4ERR_REQ_TOO_BIG, | 19045 | | NFS4ERR_RETRY_UNCACHED_REP, | 19046 | | NFS4ERR_SEQUENCE_POS, | 19047 | | NFS4ERR_SEQ_FALSE_RETRY, | 19048 | | NFS4ERR_SEQ_MISORDERED, | 19049 | | NFS4ERR_TOO_MANY_OPS | 19050 | | | 19051 | SET_SSV | NFS4ERR_BADXDR, | 19052 | | NFS4ERR_BAD_SESSION_DIGEST, | 19053 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19054 | | NFS4ERR_INVAL, NFS4ERR_OP_NOT_IN_SESSION, | 19055 | | NFS4ERR_REP_TOO_BIG, | 19056 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19057 | | NFS4ERR_REQ_TOO_BIG, | 19058 | | NFS4ERR_RETRY_UNCACHED_REP, | 19059 | | NFS4ERR_TOO_MANY_OPS | 19060 | | | 19061 | SETATTR | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19062 | | NFS4ERR_ATTRNOTSUPP, NFS4ERR_BADCHAR, | 19063 | | NFS4ERR_BADOWNER, NFS4ERR_BADXDR, | 19064 | | NFS4ERR_BAD_STATEID, NFS4ERR_DEADSESSION, | 19065 | | NFS4ERR_DELAY, NFS4ERR_DELEG_REVOKED, | 19066 | | NFS4ERR_DQUOT, NFS4ERR_EXPIRED, | 19067 | | NFS4ERR_FBIG, NFS4ERR_FHEXPIRED, | 19068 | | NFS4ERR_GRACE, NFS4ERR_INVAL, NFS4ERR_IO, | 19069 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 19070 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 19071 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 19072 | | NFS4ERR_OP_NOT_IN_SESSION, NFS4ERR_PERM, | 19073 | | NFS4ERR_REP_TOO_BIG, | 19074 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19075 | | NFS4ERR_REQ_TOO_BIG, | 19076 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 19077 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19078 | | NFS4ERR_TOO_MANY_OPS, | 19079 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19080 | | NFS4ERR_WRONG_TYPE | 19081 | | | 19082 | SETCLIENTID | NFS4ERR_NOTSUPP | 19083 | | | 19084 | SETCLIENTID_CONFIRM | NFS4ERR_NOTSUPP | 19085 | | | 19086 | TEST_STATEID | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 19087 | | NFS4ERR_DELAY, NFS4ERR_OP_NOT_IN_SESSION, | 19088 | | NFS4ERR_REP_TOO_BIG, | 19089 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19090 | | NFS4ERR_REQ_TOO_BIG, | 19091 | | NFS4ERR_RETRY_UNCACHED_REP, | 19092 | | NFS4ERR_SERVERFAULT, NFS4ERR_TOO_MANY_OPS | 19093 | | | 19094 | VERIFY | NFS4ERR_ACCESS, NFS4ERR_ATTRNOTSUPP, | 19095 | | NFS4ERR_BADCHAR, NFS4ERR_BADXDR, | 19096 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19097 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19098 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 19099 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOT_SAME, | 19100 | | NFS4ERR_OP_NOT_IN_SESSION, | 19101 | | NFS4ERR_REP_TOO_BIG, | 19102 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19103 | | NFS4ERR_REQ_TOO_BIG, | 19104 | | NFS4ERR_RETRY_UNCACHED_REP, | 19105 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19106 | | NFS4ERR_TOO_MANY_OPS, | 19107 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19108 | | NFS4ERR_WRONG_TYPE | 19109 | | | 19110 | WANT_DELEGATION | NFS4ERR_BADXDR, NFS4ERR_DEADSESSION, | 19111 | | NFS4ERR_DELAY, | 19112 | | NFS4ERR_DELEG_ALREADY_WANTED, | 19113 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19114 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_MOVED, | 19115 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOTSUPP, | 19116 | | NFS4ERR_NO_GRACE, | 19117 | | NFS4ERR_OP_NOT_IN_SESSION, | 19118 | | NFS4ERR_RECALLCONFLICT, | 19119 | | NFS4ERR_RECLAIM_BAD, | 19120 | | NFS4ERR_RECLAIM_CONFLICT, | 19121 | | NFS4ERR_REP_TOO_BIG, | 19122 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19123 | | NFS4ERR_REQ_TOO_BIG, | 19124 | | NFS4ERR_RETRY_UNCACHED_REP, | 19125 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19126 | | NFS4ERR_TOO_MANY_OPS, NFS4ERR_WRONG_TYPE | 19127 | | | 19128 | WRITE | NFS4ERR_ACCESS, NFS4ERR_ADMIN_REVOKED, | 19129 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19130 | | NFS4ERR_DEADSESSION, NFS4ERR_DELAY, | 19131 | | NFS4ERR_DELEG_REVOKED, NFS4ERR_DQUOT, | 19132 | | NFS4ERR_EXPIRED, NFS4ERR_FBIG, | 19133 | | NFS4ERR_FHEXPIRED, NFS4ERR_GRACE, | 19134 | | NFS4ERR_INVAL, NFS4ERR_IO, NFS4ERR_ISDIR, | 19135 | | NFS4ERR_LOCKED, NFS4ERR_MOVED, | 19136 | | NFS4ERR_NOFILEHANDLE, NFS4ERR_NOSPC, | 19137 | | NFS4ERR_OLD_STATEID, NFS4ERR_OPENMODE, | 19138 | | NFS4ERR_OP_NOT_IN_SESSION, | 19139 | | NFS4ERR_PNFS_IO_HOLE, | 19140 | | NFS4ERR_PNFS_NO_LAYOUT, | 19141 | | NFS4ERR_REP_TOO_BIG, | 19142 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19143 | | NFS4ERR_REQ_TOO_BIG, | 19144 | | NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_ROFS, | 19145 | | NFS4ERR_SERVERFAULT, NFS4ERR_STALE, | 19146 | | NFS4ERR_SYMLINK, NFS4ERR_TOO_MANY_OPS, | 19147 | | NFS4ERR_WRONG_TYPE | 19148 | | | 19149 +----------------------+--------------------------------------------+ 19151 Table 6 19153 15.3. Callback Operations and Their Valid Errors 19155 This section contains a table that gives the valid error returns for 19156 each callback operation. The error code NFS4_OK (indicating no 19157 error) is not listed but should be understood to be returnable by all 19158 callback operations with the exception of CB_ILLEGAL. 19160 Valid Error Returns for Each Protocol Callback Operation 19162 +-------------------------+-----------------------------------------+ 19163 | Callback Operation | Errors | 19164 +-------------------------+-----------------------------------------+ 19165 | CB_GETATTR | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19166 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19167 | | NFS4ERR_OP_NOT_IN_SESSION, | 19168 | | NFS4ERR_REP_TOO_BIG, | 19169 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19170 | | NFS4ERR_REQ_TOO_BIG, | 19171 | | NFS4ERR_RETRY_UNCACHED_REP, | 19172 | | NFS4ERR_SERVERFAULT, | 19173 | | NFS4ERR_TOO_MANY_OPS, | 19174 | | | 19175 | CB_ILLEGAL | NFS4ERR_BADXDR, NFS4ERR_OP_ILLEGAL | 19176 | | | 19177 | CB_LAYOUTRECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADIOMODE, | 19178 | | NFS4ERR_BADXDR, NFS4ERR_BAD_STATEID, | 19179 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19180 | | NFS4ERR_NOMATCHING_LAYOUT, | 19181 | | NFS4ERR_NOTSUPP, | 19182 | | NFS4ERR_OP_NOT_IN_SESSION, | 19183 | | NFS4ERR_REP_TOO_BIG, | 19184 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19185 | | NFS4ERR_REQ_TOO_BIG, | 19186 | | NFS4ERR_RETRY_UNCACHED_REP, | 19187 | | NFS4ERR_TOO_MANY_OPS, | 19188 | | NFS4ERR_UNKNOWN_LAYOUTTYPE, | 19189 | | NFS4ERR_WRONG_TYPE | 19190 | | | 19191 | CB_NOTIFY | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19192 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19193 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19194 | | NFS4ERR_OP_NOT_IN_SESSION, | 19195 | | NFS4ERR_REP_TOO_BIG, | 19196 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19197 | | NFS4ERR_REQ_TOO_BIG, | 19198 | | NFS4ERR_RETRY_UNCACHED_REP, | 19199 | | NFS4ERR_SERVERFAULT, | 19200 | | NFS4ERR_TOO_MANY_OPS | 19201 | | | 19202 | CB_NOTIFY_DEVICEID | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19203 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19204 | | NFS4ERR_OP_NOT_IN_SESSION, | 19205 | | NFS4ERR_REP_TOO_BIG, | 19206 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19207 | | NFS4ERR_REQ_TOO_BIG, | 19208 | | NFS4ERR_RETRY_UNCACHED_REP, | 19209 | | NFS4ERR_SERVERFAULT, | 19210 | | NFS4ERR_TOO_MANY_OPS | 19211 | | | 19212 | CB_NOTIFY_LOCK | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19213 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19214 | | NFS4ERR_NOTSUPP, | 19215 | | NFS4ERR_OP_NOT_IN_SESSION, | 19216 | | NFS4ERR_REP_TOO_BIG, | 19217 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19218 | | NFS4ERR_REQ_TOO_BIG, | 19219 | | NFS4ERR_RETRY_UNCACHED_REP, | 19220 | | NFS4ERR_SERVERFAULT, | 19221 | | NFS4ERR_TOO_MANY_OPS | 19222 | | | 19223 | CB_PUSH_DELEG | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19224 | | NFS4ERR_DELAY, NFS4ERR_INVAL, | 19225 | | NFS4ERR_NOTSUPP, | 19226 | | NFS4ERR_OP_NOT_IN_SESSION, | 19227 | | NFS4ERR_REJECT_DELEG, | 19228 | | NFS4ERR_REP_TOO_BIG, | 19229 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19230 | | NFS4ERR_REQ_TOO_BIG, | 19231 | | NFS4ERR_RETRY_UNCACHED_REP, | 19232 | | NFS4ERR_SERVERFAULT, | 19233 | | NFS4ERR_TOO_MANY_OPS, | 19234 | | NFS4ERR_WRONG_TYPE | 19235 | | | 19236 | CB_RECALL | NFS4ERR_BADHANDLE, NFS4ERR_BADXDR, | 19237 | | NFS4ERR_BAD_STATEID, NFS4ERR_DELAY, | 19238 | | NFS4ERR_OP_NOT_IN_SESSION, | 19239 | | NFS4ERR_REP_TOO_BIG, | 19240 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19241 | | NFS4ERR_REQ_TOO_BIG, | 19242 | | NFS4ERR_RETRY_UNCACHED_REP, | 19243 | | NFS4ERR_SERVERFAULT, | 19244 | | NFS4ERR_TOO_MANY_OPS | 19245 | | | 19246 | CB_RECALL_ANY | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19247 | | NFS4ERR_INVAL, | 19248 | | NFS4ERR_OP_NOT_IN_SESSION, | 19249 | | NFS4ERR_REP_TOO_BIG, | 19250 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19251 | | NFS4ERR_REQ_TOO_BIG, | 19252 | | NFS4ERR_RETRY_UNCACHED_REP, | 19253 | | NFS4ERR_TOO_MANY_OPS | 19254 | | | 19255 | CB_RECALLABLE_OBJ_AVAIL | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19256 | | NFS4ERR_INVAL, NFS4ERR_NOTSUPP, | 19257 | | NFS4ERR_OP_NOT_IN_SESSION, | 19258 | | NFS4ERR_REP_TOO_BIG, | 19259 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19260 | | NFS4ERR_REQ_TOO_BIG, | 19261 | | NFS4ERR_RETRY_UNCACHED_REP, | 19262 | | NFS4ERR_SERVERFAULT, | 19263 | | NFS4ERR_TOO_MANY_OPS | 19264 | | | 19265 | CB_RECALL_SLOT | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 19266 | | NFS4ERR_DELAY, | 19267 | | NFS4ERR_OP_NOT_IN_SESSION, | 19268 | | NFS4ERR_REP_TOO_BIG, | 19269 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19270 | | NFS4ERR_REQ_TOO_BIG, | 19271 | | NFS4ERR_RETRY_UNCACHED_REP, | 19272 | | NFS4ERR_TOO_MANY_OPS | 19273 | | | 19274 | CB_SEQUENCE | NFS4ERR_BADSESSION, NFS4ERR_BADSLOT, | 19275 | | NFS4ERR_BADXDR, NFS4ERR_BAD_HIGH_SLOT, | 19276 | | NFS4ERR_CONN_NOT_BOUND_TO_SESSION, | 19277 | | NFS4ERR_DELAY, NFS4ERR_REP_TOO_BIG, | 19278 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19279 | | NFS4ERR_REQ_TOO_BIG, | 19280 | | NFS4ERR_RETRY_UNCACHED_REP, | 19281 | | NFS4ERR_SEQUENCE_POS, | 19282 | | NFS4ERR_SEQ_FALSE_RETRY, | 19283 | | NFS4ERR_SEQ_MISORDERED, | 19284 | | NFS4ERR_TOO_MANY_OPS | 19285 | | | 19286 | CB_WANTS_CANCELLED | NFS4ERR_BADXDR, NFS4ERR_DELAY, | 19287 | | NFS4ERR_NOTSUPP, | 19288 | | NFS4ERR_OP_NOT_IN_SESSION, | 19289 | | NFS4ERR_REP_TOO_BIG, | 19290 | | NFS4ERR_REP_TOO_BIG_TO_CACHE, | 19291 | | NFS4ERR_REQ_TOO_BIG, | 19292 | | NFS4ERR_RETRY_UNCACHED_REP, | 19293 | | NFS4ERR_SERVERFAULT, | 19294 | | NFS4ERR_TOO_MANY_OPS | 19295 | | | 19296 +-------------------------+-----------------------------------------+ 19297 Table 7 19299 15.4. Errors and the Operations That Use Them 19301 +-----------------------------------+-------------------------------+ 19302 | Error | Operations | 19303 +-----------------------------------+-------------------------------+ 19304 | NFS4ERR_ACCESS | ACCESS, COMMIT, CREATE, | 19305 | | GETATTR, GET_DIR_DELEGATION, | 19306 | | LAYOUTCOMMIT, LAYOUTGET, | 19307 | | LINK, LOCK, LOCKT, LOCKU, | 19308 | | LOOKUP, LOOKUPP, NVERIFY, | 19309 | | OPEN, OPENATTR, READ, | 19310 | | READDIR, READLINK, REMOVE, | 19311 | | RENAME, SECINFO, | 19312 | | SECINFO_NO_NAME, SETATTR, | 19313 | | VERIFY, WRITE | 19314 | | | 19315 | NFS4ERR_ADMIN_REVOKED | CLOSE, DELEGRETURN, | 19316 | | LAYOUTCOMMIT, LAYOUTGET, | 19317 | | LAYOUTRETURN, LOCK, LOCKU, | 19318 | | OPEN, OPEN_DOWNGRADE, READ, | 19319 | | SETATTR, WRITE | 19320 | | | 19321 | NFS4ERR_ATTRNOTSUPP | CREATE, LAYOUTCOMMIT, | 19322 | | NVERIFY, OPEN, SETATTR, | 19323 | | VERIFY | 19324 | | | 19325 | NFS4ERR_BACK_CHAN_BUSY | DESTROY_SESSION | 19326 | | | 19327 | NFS4ERR_BADCHAR | CREATE, EXCHANGE_ID, LINK, | 19328 | | LOOKUP, NVERIFY, OPEN, | 19329 | | REMOVE, RENAME, SECINFO, | 19330 | | SETATTR, VERIFY | 19331 | | | 19332 | NFS4ERR_BADHANDLE | CB_GETATTR, CB_LAYOUTRECALL, | 19333 | | CB_NOTIFY, CB_NOTIFY_LOCK, | 19334 | | CB_PUSH_DELEG, CB_RECALL, | 19335 | | PUTFH | 19336 | | | 19337 | NFS4ERR_BADIOMODE | CB_LAYOUTRECALL, | 19338 | | LAYOUTCOMMIT, LAYOUTGET | 19339 | | | 19340 | NFS4ERR_BADLAYOUT | LAYOUTCOMMIT, LAYOUTGET | 19341 | | | 19342 | NFS4ERR_BADNAME | CREATE, LINK, LOOKUP, OPEN, | 19343 | | REMOVE, RENAME, SECINFO | 19344 | | | 19345 | NFS4ERR_BADOWNER | CREATE, OPEN, SETATTR | 19346 | | | 19347 | NFS4ERR_BADSESSION | BIND_CONN_TO_SESSION, | 19348 | | CB_SEQUENCE, DESTROY_SESSION, | 19349 | | SEQUENCE | 19350 | | | 19351 | NFS4ERR_BADSLOT | CB_SEQUENCE, SEQUENCE | 19352 | | | 19353 | NFS4ERR_BADTYPE | CREATE | 19354 | | | 19355 | NFS4ERR_BADXDR | ACCESS, BACKCHANNEL_CTL, | 19356 | | BIND_CONN_TO_SESSION, | 19357 | | CB_GETATTR, CB_ILLEGAL, | 19358 | | CB_LAYOUTRECALL, CB_NOTIFY, | 19359 | | CB_NOTIFY_DEVICEID, | 19360 | | CB_NOTIFY_LOCK, | 19361 | | CB_PUSH_DELEG, CB_RECALL, | 19362 | | CB_RECALLABLE_OBJ_AVAIL, | 19363 | | CB_RECALL_ANY, | 19364 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19365 | | CB_WANTS_CANCELLED, CLOSE, | 19366 | | COMMIT, CREATE, | 19367 | | CREATE_SESSION, DELEGPURGE, | 19368 | | DELEGRETURN, | 19369 | | DESTROY_CLIENTID, | 19370 | | DESTROY_SESSION, EXCHANGE_ID, | 19371 | | FREE_STATEID, GETATTR, | 19372 | | GETDEVICEINFO, GETDEVICELIST, | 19373 | | GET_DIR_DELEGATION, ILLEGAL, | 19374 | | LAYOUTCOMMIT, LAYOUTGET, | 19375 | | LAYOUTRETURN, LINK, LOCK, | 19376 | | LOCKT, LOCKU, LOOKUP, | 19377 | | NVERIFY, OPEN, OPENATTR, | 19378 | | OPEN_DOWNGRADE, PUTFH, READ, | 19379 | | READDIR, RECLAIM_COMPLETE, | 19380 | | REMOVE, RENAME, SECINFO, | 19381 | | SECINFO_NO_NAME, SEQUENCE, | 19382 | | SETATTR, SET_SSV, | 19383 | | TEST_STATEID, VERIFY, | 19384 | | WANT_DELEGATION, WRITE | 19385 | | | 19386 | NFS4ERR_BAD_COOKIE | GETDEVICELIST, READDIR | 19387 | | | 19388 | NFS4ERR_BAD_HIGH_SLOT | CB_RECALL_SLOT, CB_SEQUENCE, | 19389 | | SEQUENCE | 19390 | | | 19391 | NFS4ERR_BAD_RANGE | LOCK, LOCKT, LOCKU | 19392 | | | 19393 | NFS4ERR_BAD_SESSION_DIGEST | BIND_CONN_TO_SESSION, SET_SSV | 19394 | | | 19395 | NFS4ERR_BAD_STATEID | CB_LAYOUTRECALL, CB_NOTIFY, | 19396 | | CB_NOTIFY_LOCK, CB_RECALL, | 19397 | | CLOSE, DELEGRETURN, | 19398 | | FREE_STATEID, LAYOUTGET, | 19399 | | LAYOUTRETURN, LOCK, LOCKU, | 19400 | | OPEN, OPEN_DOWNGRADE, READ, | 19401 | | SETATTR, WRITE | 19402 | | | 19403 | NFS4ERR_CB_PATH_DOWN | DESTROY_SESSION | 19404 | | | 19405 | NFS4ERR_CLID_INUSE | CREATE_SESSION, EXCHANGE_ID | 19406 | | | 19407 | NFS4ERR_CLIENTID_BUSY | DESTROY_CLIENTID | 19408 | | | 19409 | NFS4ERR_COMPLETE_ALREADY | RECLAIM_COMPLETE | 19410 | | | 19411 | NFS4ERR_CONN_NOT_BOUND_TO_SESSION | CB_SEQUENCE, DESTROY_SESSION, | 19412 | | SEQUENCE | 19413 | | | 19414 | NFS4ERR_DEADLOCK | LOCK | 19415 | | | 19416 | NFS4ERR_DEADSESSION | ACCESS, BACKCHANNEL_CTL, | 19417 | | BIND_CONN_TO_SESSION, CLOSE, | 19418 | | COMMIT, CREATE, | 19419 | | CREATE_SESSION, DELEGPURGE, | 19420 | | DELEGRETURN, | 19421 | | DESTROY_CLIENTID, | 19422 | | DESTROY_SESSION, EXCHANGE_ID, | 19423 | | FREE_STATEID, GETATTR, | 19424 | | GETDEVICEINFO, GETDEVICELIST, | 19425 | | GET_DIR_DELEGATION, | 19426 | | LAYOUTCOMMIT, LAYOUTGET, | 19427 | | LAYOUTRETURN, LINK, LOCK, | 19428 | | LOCKT, LOCKU, LOOKUP, | 19429 | | LOOKUPP, NVERIFY, OPEN, | 19430 | | OPENATTR, OPEN_DOWNGRADE, | 19431 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19432 | | READ, READDIR, READLINK, | 19433 | | RECLAIM_COMPLETE, REMOVE, | 19434 | | RENAME, RESTOREFH, SAVEFH, | 19435 | | SECINFO, SECINFO_NO_NAME, | 19436 | | SEQUENCE, SETATTR, SET_SSV, | 19437 | | TEST_STATEID, VERIFY, | 19438 | | WANT_DELEGATION, WRITE | 19439 | | | 19440 | NFS4ERR_DELAY | ACCESS, BACKCHANNEL_CTL, | 19441 | | BIND_CONN_TO_SESSION, | 19442 | | CB_GETATTR, CB_LAYOUTRECALL, | 19443 | | CB_NOTIFY, | 19444 | | CB_NOTIFY_DEVICEID, | 19445 | | CB_NOTIFY_LOCK, | 19446 | | CB_PUSH_DELEG, CB_RECALL, | 19447 | | CB_RECALLABLE_OBJ_AVAIL, | 19448 | | CB_RECALL_ANY, | 19449 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19450 | | CB_WANTS_CANCELLED, CLOSE, | 19451 | | COMMIT, CREATE, | 19452 | | CREATE_SESSION, DELEGPURGE, | 19453 | | DELEGRETURN, | 19454 | | DESTROY_CLIENTID, | 19455 | | DESTROY_SESSION, EXCHANGE_ID, | 19456 | | FREE_STATEID, GETATTR, | 19457 | | GETDEVICEINFO, GETDEVICELIST, | 19458 | | GET_DIR_DELEGATION, | 19459 | | LAYOUTCOMMIT, LAYOUTGET, | 19460 | | LAYOUTRETURN, LINK, LOCK, | 19461 | | LOCKT, LOCKU, LOOKUP, | 19462 | | LOOKUPP, NVERIFY, OPEN, | 19463 | | OPENATTR, OPEN_DOWNGRADE, | 19464 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19465 | | READ, READDIR, READLINK, | 19466 | | RECLAIM_COMPLETE, REMOVE, | 19467 | | RENAME, SECINFO, | 19468 | | SECINFO_NO_NAME, SEQUENCE, | 19469 | | SETATTR, SET_SSV, | 19470 | | TEST_STATEID, VERIFY, | 19471 | | WANT_DELEGATION, WRITE | 19472 | | | 19473 | NFS4ERR_DELEG_ALREADY_WANTED | OPEN, WANT_DELEGATION | 19474 | | | 19475 | NFS4ERR_DELEG_REVOKED | DELEGRETURN, LAYOUTCOMMIT, | 19476 | | LAYOUTGET, LAYOUTRETURN, | 19477 | | OPEN, READ, SETATTR, WRITE | 19478 | | | 19479 | NFS4ERR_DENIED | LOCK, LOCKT | 19480 | | | 19481 | NFS4ERR_DIRDELEG_UNAVAIL | GET_DIR_DELEGATION | 19482 | | | 19483 | NFS4ERR_DQUOT | CREATE, LAYOUTGET, LINK, | 19484 | | OPEN, OPENATTR, RENAME, | 19485 | | SETATTR, WRITE | 19486 | | | 19487 | NFS4ERR_ENCR_ALG_UNSUPP | EXCHANGE_ID | 19488 | | | 19489 | NFS4ERR_EXIST | CREATE, LINK, OPEN, RENAME | 19490 | | | 19491 | NFS4ERR_EXPIRED | CLOSE, DELEGRETURN, | 19492 | | LAYOUTCOMMIT, LAYOUTRETURN, | 19493 | | LOCK, LOCKU, OPEN, | 19494 | | OPEN_DOWNGRADE, READ, | 19495 | | SETATTR, WRITE | 19496 | | | 19497 | NFS4ERR_FBIG | LAYOUTCOMMIT, OPEN, SETATTR, | 19498 | | WRITE | 19499 | | | 19500 | NFS4ERR_FHEXPIRED | ACCESS, CLOSE, COMMIT, | 19501 | | CREATE, DELEGRETURN, GETATTR, | 19502 | | GETDEVICELIST, GETFH, | 19503 | | GET_DIR_DELEGATION, | 19504 | | LAYOUTCOMMIT, LAYOUTGET, | 19505 | | LAYOUTRETURN, LINK, LOCK, | 19506 | | LOCKT, LOCKU, LOOKUP, | 19507 | | LOOKUPP, NVERIFY, OPEN, | 19508 | | OPENATTR, OPEN_DOWNGRADE, | 19509 | | READ, READDIR, READLINK, | 19510 | | RECLAIM_COMPLETE, REMOVE, | 19511 | | RENAME, RESTOREFH, SAVEFH, | 19512 | | SECINFO, SECINFO_NO_NAME, | 19513 | | SETATTR, VERIFY, | 19514 | | WANT_DELEGATION, WRITE | 19515 | | | 19516 | NFS4ERR_FILE_OPEN | LINK, REMOVE, RENAME | 19517 | | | 19518 | NFS4ERR_GRACE | GETATTR, GET_DIR_DELEGATION, | 19519 | | LAYOUTCOMMIT, LAYOUTGET, | 19520 | | LAYOUTRETURN, LINK, LOCK, | 19521 | | LOCKT, NVERIFY, OPEN, READ, | 19522 | | REMOVE, RENAME, SETATTR, | 19523 | | VERIFY, WANT_DELEGATION, | 19524 | | WRITE | 19525 | | | 19526 | NFS4ERR_HASH_ALG_UNSUPP | EXCHANGE_ID | 19527 | | | 19528 | NFS4ERR_INVAL | ACCESS, BACKCHANNEL_CTL, | 19529 | | BIND_CONN_TO_SESSION, | 19530 | | CB_GETATTR, CB_LAYOUTRECALL, | 19531 | | CB_NOTIFY, | 19532 | | CB_NOTIFY_DEVICEID, | 19533 | | CB_PUSH_DELEG, | 19534 | | CB_RECALLABLE_OBJ_AVAIL, | 19535 | | CB_RECALL_ANY, CREATE, | 19536 | | CREATE_SESSION, DELEGRETURN, | 19537 | | EXCHANGE_ID, GETATTR, | 19538 | | GETDEVICEINFO, GETDEVICELIST, | 19539 | | GET_DIR_DELEGATION, | 19540 | | LAYOUTCOMMIT, LAYOUTGET, | 19541 | | LAYOUTRETURN, LINK, LOCK, | 19542 | | LOCKT, LOCKU, LOOKUP, | 19543 | | NVERIFY, OPEN, | 19544 | | OPEN_DOWNGRADE, READ, | 19545 | | READDIR, READLINK, | 19546 | | RECLAIM_COMPLETE, REMOVE, | 19547 | | RENAME, SECINFO, | 19548 | | SECINFO_NO_NAME, SETATTR, | 19549 | | SET_SSV, VERIFY, | 19550 | | WANT_DELEGATION, WRITE | 19551 | | | 19552 | NFS4ERR_IO | ACCESS, COMMIT, CREATE, | 19553 | | GETATTR, GETDEVICELIST, | 19554 | | GET_DIR_DELEGATION, | 19555 | | LAYOUTCOMMIT, LAYOUTGET, | 19556 | | LINK, LOOKUP, LOOKUPP, | 19557 | | NVERIFY, OPEN, OPENATTR, | 19558 | | READ, READDIR, READLINK, | 19559 | | REMOVE, RENAME, SETATTR, | 19560 | | VERIFY, WANT_DELEGATION, | 19561 | | WRITE | 19562 | | | 19563 | NFS4ERR_ISDIR | COMMIT, LAYOUTCOMMIT, | 19564 | | LAYOUTRETURN, LINK, LOCK, | 19565 | | LOCKT, OPEN, READ, WRITE | 19566 | | | 19567 | NFS4ERR_LAYOUTTRYLATER | LAYOUTGET | 19568 | | | 19569 | NFS4ERR_LAYOUTUNAVAILABLE | LAYOUTGET | 19570 | | | 19571 | NFS4ERR_LOCKED | LAYOUTGET, READ, SETATTR, | 19572 | | WRITE | 19573 | | | 19574 | NFS4ERR_LOCKS_HELD | CLOSE, FREE_STATEID | 19575 | | | 19576 | NFS4ERR_LOCK_NOTSUPP | LOCK | 19577 | | | 19578 | NFS4ERR_LOCK_RANGE | LOCK, LOCKT, LOCKU | 19579 | | | 19580 | NFS4ERR_MLINK | CREATE, LINK, RENAME | 19581 | | | 19582 | NFS4ERR_MOVED | ACCESS, CLOSE, COMMIT, | 19583 | | CREATE, DELEGRETURN, GETATTR, | 19584 | | GETFH, GET_DIR_DELEGATION, | 19585 | | LAYOUTCOMMIT, LAYOUTGET, | 19586 | | LAYOUTRETURN, LINK, LOCK, | 19587 | | LOCKT, LOCKU, LOOKUP, | 19588 | | LOOKUPP, NVERIFY, OPEN, | 19589 | | OPENATTR, OPEN_DOWNGRADE, | 19590 | | PUTFH, READ, READDIR, | 19591 | | READLINK, RECLAIM_COMPLETE, | 19592 | | REMOVE, RENAME, RESTOREFH, | 19593 | | SAVEFH, SECINFO, | 19594 | | SECINFO_NO_NAME, SETATTR, | 19595 | | VERIFY, WANT_DELEGATION, | 19596 | | WRITE | 19597 | | | 19598 | NFS4ERR_NAMETOOLONG | CREATE, LINK, LOOKUP, OPEN, | 19599 | | REMOVE, RENAME, SECINFO | 19600 | | | 19601 | NFS4ERR_NOENT | BACKCHANNEL_CTL, | 19602 | | CREATE_SESSION, EXCHANGE_ID, | 19603 | | GETDEVICEINFO, LOOKUP, | 19604 | | LOOKUPP, OPEN, OPENATTR, | 19605 | | REMOVE, RENAME, SECINFO, | 19606 | | SECINFO_NO_NAME | 19607 | | | 19608 | NFS4ERR_NOFILEHANDLE | ACCESS, CLOSE, COMMIT, | 19609 | | CREATE, DELEGRETURN, GETATTR, | 19610 | | GETDEVICELIST, GETFH, | 19611 | | GET_DIR_DELEGATION, | 19612 | | LAYOUTCOMMIT, LAYOUTGET, | 19613 | | LAYOUTRETURN, LINK, LOCK, | 19614 | | LOCKT, LOCKU, LOOKUP, | 19615 | | LOOKUPP, NVERIFY, OPEN, | 19616 | | OPENATTR, OPEN_DOWNGRADE, | 19617 | | READ, READDIR, READLINK, | 19618 | | RECLAIM_COMPLETE, REMOVE, | 19619 | | RENAME, RESTOREFH, SAVEFH, | 19620 | | SECINFO, SECINFO_NO_NAME, | 19621 | | SETATTR, VERIFY, | 19622 | | WANT_DELEGATION, WRITE | 19623 | | | 19624 | NFS4ERR_NOMATCHING_LAYOUT | CB_LAYOUTRECALL | 19625 | | | 19626 | NFS4ERR_NOSPC | CREATE, CREATE_SESSION, | 19627 | | LAYOUTGET, LINK, OPEN, | 19628 | | OPENATTR, RENAME, SETATTR, | 19629 | | WRITE | 19630 | | | 19631 | NFS4ERR_NOTDIR | CREATE, GET_DIR_DELEGATION, | 19632 | | LINK, LOOKUP, LOOKUPP, OPEN, | 19633 | | READDIR, REMOVE, RENAME, | 19634 | | SECINFO, SECINFO_NO_NAME | 19635 | | | 19636 | NFS4ERR_NOTEMPTY | REMOVE, RENAME | 19637 | | | 19638 | NFS4ERR_NOTSUPP | CB_LAYOUTRECALL, CB_NOTIFY, | 19639 | | CB_NOTIFY_DEVICEID, | 19640 | | CB_NOTIFY_LOCK, | 19641 | | CB_PUSH_DELEG, | 19642 | | CB_RECALLABLE_OBJ_AVAIL, | 19643 | | CB_WANTS_CANCELLED, | 19644 | | DELEGPURGE, DELEGRETURN, | 19645 | | GETDEVICEINFO, GETDEVICELIST, | 19646 | | GET_DIR_DELEGATION, | 19647 | | LAYOUTCOMMIT, LAYOUTGET, | 19648 | | LAYOUTRETURN, LINK, OPENATTR, | 19649 | | OPEN_CONFIRM, | 19650 | | RELEASE_LOCKOWNER, RENEW, | 19651 | | SECINFO_NO_NAME, SETCLIENTID, | 19652 | | SETCLIENTID_CONFIRM, | 19653 | | WANT_DELEGATION | 19654 | | | 19655 | NFS4ERR_NOT_ONLY_OP | BIND_CONN_TO_SESSION, | 19656 | | CREATE_SESSION, | 19657 | | DESTROY_CLIENTID, | 19658 | | DESTROY_SESSION, EXCHANGE_ID | 19659 | | | 19660 | NFS4ERR_NOT_SAME | EXCHANGE_ID, GETDEVICELIST, | 19661 | | READDIR, VERIFY | 19662 | | | 19663 | NFS4ERR_NO_GRACE | LAYOUTCOMMIT, LAYOUTRETURN, | 19664 | | LOCK, OPEN, WANT_DELEGATION | 19665 | | | 19666 | NFS4ERR_OLD_STATEID | CLOSE, DELEGRETURN, | 19667 | | FREE_STATEID, LAYOUTGET, | 19668 | | LAYOUTRETURN, LOCK, LOCKU, | 19669 | | OPEN, OPEN_DOWNGRADE, READ, | 19670 | | SETATTR, WRITE | 19671 | | | 19672 | NFS4ERR_OPENMODE | LAYOUTGET, LOCK, READ, | 19673 | | SETATTR, WRITE | 19674 | | | 19675 | NFS4ERR_OP_ILLEGAL | CB_ILLEGAL, ILLEGAL | 19676 | | | 19677 | NFS4ERR_OP_NOT_IN_SESSION | ACCESS, BACKCHANNEL_CTL, | 19678 | | CB_GETATTR, CB_LAYOUTRECALL, | 19679 | | CB_NOTIFY, | 19680 | | CB_NOTIFY_DEVICEID, | 19681 | | CB_NOTIFY_LOCK, | 19682 | | CB_PUSH_DELEG, CB_RECALL, | 19683 | | CB_RECALLABLE_OBJ_AVAIL, | 19684 | | CB_RECALL_ANY, | 19685 | | CB_RECALL_SLOT, | 19686 | | CB_WANTS_CANCELLED, CLOSE, | 19687 | | COMMIT, CREATE, DELEGPURGE, | 19688 | | DELEGRETURN, FREE_STATEID, | 19689 | | GETATTR, GETDEVICEINFO, | 19690 | | GETDEVICELIST, GETFH, | 19691 | | GET_DIR_DELEGATION, | 19692 | | LAYOUTCOMMIT, LAYOUTGET, | 19693 | | LAYOUTRETURN, LINK, LOCK, | 19694 | | LOCKT, LOCKU, LOOKUP, | 19695 | | LOOKUPP, NVERIFY, OPEN, | 19696 | | OPENATTR, OPEN_DOWNGRADE, | 19697 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19698 | | READ, READDIR, READLINK, | 19699 | | RECLAIM_COMPLETE, REMOVE, | 19700 | | RENAME, RESTOREFH, SAVEFH, | 19701 | | SECINFO, SECINFO_NO_NAME, | 19702 | | SETATTR, SET_SSV, | 19703 | | TEST_STATEID, VERIFY, | 19704 | | WANT_DELEGATION, WRITE | 19705 | | | 19706 | NFS4ERR_PERM | CREATE, OPEN, SETATTR | 19707 | | | 19708 | NFS4ERR_PNFS_IO_HOLE | READ, WRITE | 19709 | | | 19710 | NFS4ERR_PNFS_NO_LAYOUT | READ, WRITE | 19711 | | | 19712 | NFS4ERR_RECALLCONFLICT | LAYOUTGET, WANT_DELEGATION | 19713 | | | 19714 | NFS4ERR_RECLAIM_BAD | LAYOUTCOMMIT, LOCK, OPEN, | 19715 | | WANT_DELEGATION | 19716 | | | 19717 | NFS4ERR_RECLAIM_CONFLICT | LAYOUTCOMMIT, LOCK, OPEN, | 19718 | | WANT_DELEGATION | 19719 | | | 19720 | NFS4ERR_REJECT_DELEG | CB_PUSH_DELEG | 19721 | | | 19722 | NFS4ERR_REP_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 19723 | | BIND_CONN_TO_SESSION, | 19724 | | CB_GETATTR, CB_LAYOUTRECALL, | 19725 | | CB_NOTIFY, | 19726 | | CB_NOTIFY_DEVICEID, | 19727 | | CB_NOTIFY_LOCK, | 19728 | | CB_PUSH_DELEG, CB_RECALL, | 19729 | | CB_RECALLABLE_OBJ_AVAIL, | 19730 | | CB_RECALL_ANY, | 19731 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19732 | | CB_WANTS_CANCELLED, CLOSE, | 19733 | | COMMIT, CREATE, | 19734 | | CREATE_SESSION, DELEGPURGE, | 19735 | | DELEGRETURN, | 19736 | | DESTROY_CLIENTID, | 19737 | | DESTROY_SESSION, EXCHANGE_ID, | 19738 | | FREE_STATEID, GETATTR, | 19739 | | GETDEVICEINFO, GETDEVICELIST, | 19740 | | GET_DIR_DELEGATION, | 19741 | | LAYOUTCOMMIT, LAYOUTGET, | 19742 | | LAYOUTRETURN, LINK, LOCK, | 19743 | | LOCKT, LOCKU, LOOKUP, | 19744 | | LOOKUPP, NVERIFY, OPEN, | 19745 | | OPENATTR, OPEN_DOWNGRADE, | 19746 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19747 | | READ, READDIR, READLINK, | 19748 | | RECLAIM_COMPLETE, REMOVE, | 19749 | | RENAME, RESTOREFH, SAVEFH, | 19750 | | SECINFO, SECINFO_NO_NAME, | 19751 | | SEQUENCE, SETATTR, SET_SSV, | 19752 | | TEST_STATEID, VERIFY, | 19753 | | WANT_DELEGATION, WRITE | 19754 | | | 19755 | NFS4ERR_REP_TOO_BIG_TO_CACHE | ACCESS, BACKCHANNEL_CTL, | 19756 | | BIND_CONN_TO_SESSION, | 19757 | | CB_GETATTR, CB_LAYOUTRECALL, | 19758 | | CB_NOTIFY, | 19759 | | CB_NOTIFY_DEVICEID, | 19760 | | CB_NOTIFY_LOCK, | 19761 | | CB_PUSH_DELEG, CB_RECALL, | 19762 | | CB_RECALLABLE_OBJ_AVAIL, | 19763 | | CB_RECALL_ANY, | 19764 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19765 | | CB_WANTS_CANCELLED, CLOSE, | 19766 | | COMMIT, CREATE, | 19767 | | CREATE_SESSION, DELEGPURGE, | 19768 | | DELEGRETURN, | 19769 | | DESTROY_CLIENTID, | 19770 | | DESTROY_SESSION, EXCHANGE_ID, | 19771 | | FREE_STATEID, GETATTR, | 19772 | | GETDEVICEINFO, GETDEVICELIST, | 19773 | | GET_DIR_DELEGATION, | 19774 | | LAYOUTCOMMIT, LAYOUTGET, | 19775 | | LAYOUTRETURN, LINK, LOCK, | 19776 | | LOCKT, LOCKU, LOOKUP, | 19777 | | LOOKUPP, NVERIFY, OPEN, | 19778 | | OPENATTR, OPEN_DOWNGRADE, | 19779 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19780 | | READ, READDIR, READLINK, | 19781 | | RECLAIM_COMPLETE, REMOVE, | 19782 | | RENAME, RESTOREFH, SAVEFH, | 19783 | | SECINFO, SECINFO_NO_NAME, | 19784 | | SEQUENCE, SETATTR, SET_SSV, | 19785 | | TEST_STATEID, VERIFY, | 19786 | | WANT_DELEGATION, WRITE | 19787 | | | 19788 | NFS4ERR_REQ_TOO_BIG | ACCESS, BACKCHANNEL_CTL, | 19789 | | BIND_CONN_TO_SESSION, | 19790 | | CB_GETATTR, CB_LAYOUTRECALL, | 19791 | | CB_NOTIFY, | 19792 | | CB_NOTIFY_DEVICEID, | 19793 | | CB_NOTIFY_LOCK, | 19794 | | CB_PUSH_DELEG, CB_RECALL, | 19795 | | CB_RECALLABLE_OBJ_AVAIL, | 19796 | | CB_RECALL_ANY, | 19797 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19798 | | CB_WANTS_CANCELLED, CLOSE, | 19799 | | COMMIT, CREATE, | 19800 | | CREATE_SESSION, DELEGPURGE, | 19801 | | DELEGRETURN, | 19802 | | DESTROY_CLIENTID, | 19803 | | DESTROY_SESSION, EXCHANGE_ID, | 19804 | | FREE_STATEID, GETATTR, | 19805 | | GETDEVICEINFO, GETDEVICELIST, | 19806 | | GET_DIR_DELEGATION, | 19807 | | LAYOUTCOMMIT, LAYOUTGET, | 19808 | | LAYOUTRETURN, LINK, LOCK, | 19809 | | LOCKT, LOCKU, LOOKUP, | 19810 | | LOOKUPP, NVERIFY, OPEN, | 19811 | | OPENATTR, OPEN_DOWNGRADE, | 19812 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19813 | | READ, READDIR, READLINK, | 19814 | | RECLAIM_COMPLETE, REMOVE, | 19815 | | RENAME, RESTOREFH, SAVEFH, | 19816 | | SECINFO, SECINFO_NO_NAME, | 19817 | | SEQUENCE, SETATTR, SET_SSV, | 19818 | | TEST_STATEID, VERIFY, | 19819 | | WANT_DELEGATION, WRITE | 19820 | | | 19821 | NFS4ERR_RETRY_UNCACHED_REP | ACCESS, BACKCHANNEL_CTL, | 19822 | | BIND_CONN_TO_SESSION, | 19823 | | CB_GETATTR, CB_LAYOUTRECALL, | 19824 | | CB_NOTIFY, | 19825 | | CB_NOTIFY_DEVICEID, | 19826 | | CB_NOTIFY_LOCK, | 19827 | | CB_PUSH_DELEG, CB_RECALL, | 19828 | | CB_RECALLABLE_OBJ_AVAIL, | 19829 | | CB_RECALL_ANY, | 19830 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19831 | | CB_WANTS_CANCELLED, CLOSE, | 19832 | | COMMIT, CREATE, | 19833 | | CREATE_SESSION, DELEGPURGE, | 19834 | | DELEGRETURN, | 19835 | | DESTROY_CLIENTID, | 19836 | | DESTROY_SESSION, EXCHANGE_ID, | 19837 | | FREE_STATEID, GETATTR, | 19838 | | GETDEVICEINFO, GETDEVICELIST, | 19839 | | GET_DIR_DELEGATION, | 19840 | | LAYOUTCOMMIT, LAYOUTGET, | 19841 | | LAYOUTRETURN, LINK, LOCK, | 19842 | | LOCKT, LOCKU, LOOKUP, | 19843 | | LOOKUPP, NVERIFY, OPEN, | 19844 | | OPENATTR, OPEN_DOWNGRADE, | 19845 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19846 | | READ, READDIR, READLINK, | 19847 | | RECLAIM_COMPLETE, REMOVE, | 19848 | | RENAME, RESTOREFH, SAVEFH, | 19849 | | SECINFO, SECINFO_NO_NAME, | 19850 | | SEQUENCE, SETATTR, SET_SSV, | 19851 | | TEST_STATEID, VERIFY, | 19852 | | WANT_DELEGATION, WRITE | 19853 | | | 19854 | NFS4ERR_ROFS | CREATE, LINK, LOCK, LOCKT, | 19855 | | OPEN, OPENATTR, | 19856 | | OPEN_DOWNGRADE, REMOVE, | 19857 | | RENAME, SETATTR, WRITE | 19858 | | | 19859 | NFS4ERR_SAME | NVERIFY | 19860 | | | 19861 | NFS4ERR_SEQUENCE_POS | CB_SEQUENCE, SEQUENCE | 19862 | | | 19863 | NFS4ERR_SEQ_FALSE_RETRY | CB_SEQUENCE, SEQUENCE | 19864 | | | 19865 | NFS4ERR_SEQ_MISORDERED | CB_SEQUENCE, CREATE_SESSION, | 19866 | | SEQUENCE | 19867 | | | 19868 | NFS4ERR_SERVERFAULT | ACCESS, BIND_CONN_TO_SESSION, | 19869 | | CB_GETATTR, CB_NOTIFY, | 19870 | | CB_NOTIFY_DEVICEID, | 19871 | | CB_NOTIFY_LOCK, | 19872 | | CB_PUSH_DELEG, CB_RECALL, | 19873 | | CB_RECALLABLE_OBJ_AVAIL, | 19874 | | CB_WANTS_CANCELLED, CLOSE, | 19875 | | COMMIT, CREATE, | 19876 | | CREATE_SESSION, DELEGPURGE, | 19877 | | DELEGRETURN, | 19878 | | DESTROY_CLIENTID, | 19879 | | DESTROY_SESSION, EXCHANGE_ID, | 19880 | | FREE_STATEID, GETATTR, | 19881 | | GETDEVICEINFO, GETDEVICELIST, | 19882 | | GET_DIR_DELEGATION, | 19883 | | LAYOUTCOMMIT, LAYOUTGET, | 19884 | | LAYOUTRETURN, LINK, LOCK, | 19885 | | LOCKU, LOOKUP, LOOKUPP, | 19886 | | NVERIFY, OPEN, OPENATTR, | 19887 | | OPEN_DOWNGRADE, PUTFH, | 19888 | | PUTPUBFH, PUTROOTFH, READ, | 19889 | | READDIR, READLINK, | 19890 | | RECLAIM_COMPLETE, REMOVE, | 19891 | | RENAME, RESTOREFH, SAVEFH, | 19892 | | SECINFO, SECINFO_NO_NAME, | 19893 | | SETATTR, TEST_STATEID, | 19894 | | VERIFY, WANT_DELEGATION, | 19895 | | WRITE | 19896 | | | 19897 | NFS4ERR_SHARE_DENIED | OPEN | 19898 | | | 19899 | NFS4ERR_STALE | ACCESS, CLOSE, COMMIT, | 19900 | | CREATE, DELEGRETURN, GETATTR, | 19901 | | GETFH, GET_DIR_DELEGATION, | 19902 | | LAYOUTCOMMIT, LAYOUTGET, | 19903 | | LAYOUTRETURN, LINK, LOCK, | 19904 | | LOCKT, LOCKU, LOOKUP, | 19905 | | LOOKUPP, NVERIFY, OPEN, | 19906 | | OPENATTR, OPEN_DOWNGRADE, | 19907 | | PUTFH, READ, READDIR, | 19908 | | READLINK, RECLAIM_COMPLETE, | 19909 | | REMOVE, RENAME, RESTOREFH, | 19910 | | SAVEFH, SECINFO, | 19911 | | SECINFO_NO_NAME, SETATTR, | 19912 | | VERIFY, WANT_DELEGATION, | 19913 | | WRITE | 19914 | | | 19915 | NFS4ERR_STALE_CLIENTID | CREATE_SESSION, | 19916 | | DESTROY_CLIENTID, | 19917 | | DESTROY_SESSION | 19918 | | | 19919 | NFS4ERR_SYMLINK | COMMIT, LAYOUTCOMMIT, LINK, | 19920 | | LOCK, LOCKT, LOOKUP, LOOKUPP, | 19921 | | OPEN, READ, WRITE | 19922 | | | 19923 | NFS4ERR_TOOSMALL | CREATE_SESSION, | 19924 | | GETDEVICEINFO, LAYOUTGET, | 19925 | | READDIR | 19926 | | | 19927 | NFS4ERR_TOO_MANY_OPS | ACCESS, BACKCHANNEL_CTL, | 19928 | | BIND_CONN_TO_SESSION, | 19929 | | CB_GETATTR, CB_LAYOUTRECALL, | 19930 | | CB_NOTIFY, | 19931 | | CB_NOTIFY_DEVICEID, | 19932 | | CB_NOTIFY_LOCK, | 19933 | | CB_PUSH_DELEG, CB_RECALL, | 19934 | | CB_RECALLABLE_OBJ_AVAIL, | 19935 | | CB_RECALL_ANY, | 19936 | | CB_RECALL_SLOT, CB_SEQUENCE, | 19937 | | CB_WANTS_CANCELLED, CLOSE, | 19938 | | COMMIT, CREATE, | 19939 | | CREATE_SESSION, DELEGPURGE, | 19940 | | DELEGRETURN, | 19941 | | DESTROY_CLIENTID, | 19942 | | DESTROY_SESSION, EXCHANGE_ID, | 19943 | | FREE_STATEID, GETATTR, | 19944 | | GETDEVICEINFO, GETDEVICELIST, | 19945 | | GET_DIR_DELEGATION, | 19946 | | LAYOUTCOMMIT, LAYOUTGET, | 19947 | | LAYOUTRETURN, LINK, LOCK, | 19948 | | LOCKT, LOCKU, LOOKUP, | 19949 | | LOOKUPP, NVERIFY, OPEN, | 19950 | | OPENATTR, OPEN_DOWNGRADE, | 19951 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19952 | | READ, READDIR, READLINK, | 19953 | | RECLAIM_COMPLETE, REMOVE, | 19954 | | RENAME, RESTOREFH, SAVEFH, | 19955 | | SECINFO, SECINFO_NO_NAME, | 19956 | | SEQUENCE, SETATTR, SET_SSV, | 19957 | | TEST_STATEID, VERIFY, | 19958 | | WANT_DELEGATION, WRITE | 19959 | | | 19960 | NFS4ERR_UNKNOWN_LAYOUTTYPE | CB_LAYOUTRECALL, | 19961 | | GETDEVICEINFO, GETDEVICELIST, | 19962 | | LAYOUTCOMMIT, LAYOUTGET, | 19963 | | LAYOUTRETURN, NVERIFY, | 19964 | | SETATTR, VERIFY | 19965 | | | 19966 | NFS4ERR_UNSAFE_COMPOUND | CREATE, OPEN, OPENATTR | 19967 | | | 19968 | NFS4ERR_WRONGSEC | LINK, LOOKUP, LOOKUPP, OPEN, | 19969 | | PUTFH, PUTPUBFH, PUTROOTFH, | 19970 | | RENAME, RESTOREFH | 19971 | | | 19972 | NFS4ERR_WRONG_CRED | CLOSE, CREATE_SESSION, | 19973 | | DELEGPURGE, DELEGRETURN, | 19974 | | DESTROY_CLIENTID, | 19975 | | DESTROY_SESSION, | 19976 | | FREE_STATEID, LAYOUTCOMMIT, | 19977 | | LAYOUTRETURN, LOCK, LOCKT, | 19978 | | LOCKU, OPEN_DOWNGRADE, | 19979 | | RECLAIM_COMPLETE | 19980 | | | 19981 | NFS4ERR_WRONG_TYPE | CB_LAYOUTRECALL, | 19982 | | CB_PUSH_DELEG, COMMIT, | 19983 | | GETATTR, LAYOUTGET, | 19984 | | LAYOUTRETURN, LINK, LOCK, | 19985 | | LOCKT, NVERIFY, OPEN, | 19986 | | OPENATTR, READ, READLINK, | 19987 | | RECLAIM_COMPLETE, SETATTR, | 19988 | | VERIFY, WANT_DELEGATION, | 19989 | | WRITE | 19990 | | | 19991 | NFS4ERR_XDEV | LINK, RENAME | 19992 | | | 19993 +-----------------------------------+-------------------------------+ 19995 Table 8 19997 16. NFSv4.1 Procedures 19999 Both procedures, NULL and COMPOUND, MUST be implemented. 20001 16.1. Procedure 0: NULL - No Operation 20003 16.1.1. ARGUMENTS 20005 void; 20007 16.1.2. RESULTS 20009 void; 20011 16.1.3. DESCRIPTION 20013 This is the standard NULL procedure with the standard void argument 20014 and void response. This procedure has no functionality associated 20015 with it. Because of this, it is sometimes used to measure the 20016 overhead of processing a service request. Therefore, the server 20017 SHOULD ensure that no unnecessary work is done in servicing this 20018 procedure. 20020 16.1.4. ERRORS 20022 None. 20024 16.2. Procedure 1: COMPOUND - Compound Operations 20026 16.2.1. ARGUMENTS 20028 enum nfs_opnum4 { 20029 OP_ACCESS = 3, 20030 OP_CLOSE = 4, 20031 OP_COMMIT = 5, 20032 OP_CREATE = 6, 20033 OP_DELEGPURGE = 7, 20034 OP_DELEGRETURN = 8, 20035 OP_GETATTR = 9, 20036 OP_GETFH = 10, 20037 OP_LINK = 11, 20038 OP_LOCK = 12, 20039 OP_LOCKT = 13, 20040 OP_LOCKU = 14, 20041 OP_LOOKUP = 15, 20042 OP_LOOKUPP = 16, 20043 OP_NVERIFY = 17, 20044 OP_OPEN = 18, 20045 OP_OPENATTR = 19, 20046 OP_OPEN_CONFIRM = 20, /* Mandatory not-to-implement */ 20047 OP_OPEN_DOWNGRADE = 21, 20048 OP_PUTFH = 22, 20049 OP_PUTPUBFH = 23, 20050 OP_PUTROOTFH = 24, 20051 OP_READ = 25, 20052 OP_READDIR = 26, 20053 OP_READLINK = 27, 20054 OP_REMOVE = 28, 20055 OP_RENAME = 29, 20056 OP_RENEW = 30, /* Mandatory not-to-implement */ 20057 OP_RESTOREFH = 31, 20058 OP_SAVEFH = 32, 20059 OP_SECINFO = 33, 20060 OP_SETATTR = 34, 20061 OP_SETCLIENTID = 35, /* Mandatory not-to-implement */ 20062 OP_SETCLIENTID_CONFIRM = 36, /* Mandatory not-to-implement */ 20063 OP_VERIFY = 37, 20064 OP_WRITE = 38, 20065 OP_RELEASE_LOCKOWNER = 39, /* Mandatory not-to-implement */ 20067 /* new operations for NFSv4.1 */ 20069 OP_BACKCHANNEL_CTL = 40, 20070 OP_BIND_CONN_TO_SESSION = 41, 20071 OP_EXCHANGE_ID = 42, 20072 OP_CREATE_SESSION = 43, 20073 OP_DESTROY_SESSION = 44, 20074 OP_FREE_STATEID = 45, 20075 OP_GET_DIR_DELEGATION = 46, 20076 OP_GETDEVICEINFO = 47, 20077 OP_GETDEVICELIST = 48, 20078 OP_LAYOUTCOMMIT = 49, 20079 OP_LAYOUTGET = 50, 20080 OP_LAYOUTRETURN = 51, 20081 OP_SECINFO_NO_NAME = 52, 20082 OP_SEQUENCE = 53, 20083 OP_SET_SSV = 54, 20084 OP_TEST_STATEID = 55, 20085 OP_WANT_DELEGATION = 56, 20086 OP_DESTROY_CLIENTID = 57, 20087 OP_RECLAIM_COMPLETE = 58, 20088 OP_ILLEGAL = 10044 20089 }; 20091 union nfs_argop4 switch (nfs_opnum4 argop) { 20092 case OP_ACCESS: ACCESS4args opaccess; 20093 case OP_CLOSE: CLOSE4args opclose; 20094 case OP_COMMIT: COMMIT4args opcommit; 20095 case OP_CREATE: CREATE4args opcreate; 20096 case OP_DELEGPURGE: DELEGPURGE4args opdelegpurge; 20097 case OP_DELEGRETURN: DELEGRETURN4args opdelegreturn; 20098 case OP_GETATTR: GETATTR4args opgetattr; 20099 case OP_GETFH: void; 20100 case OP_LINK: LINK4args oplink; 20101 case OP_LOCK: LOCK4args oplock; 20102 case OP_LOCKT: LOCKT4args oplockt; 20103 case OP_LOCKU: LOCKU4args oplocku; 20104 case OP_LOOKUP: LOOKUP4args oplookup; 20105 case OP_LOOKUPP: void; 20106 case OP_NVERIFY: NVERIFY4args opnverify; 20107 case OP_OPEN: OPEN4args opopen; 20108 case OP_OPENATTR: OPENATTR4args opopenattr; 20110 /* Not for NFSv4.1 */ 20111 case OP_OPEN_CONFIRM: OPEN_CONFIRM4args opopen_confirm; 20112 case OP_OPEN_DOWNGRADE: 20113 OPEN_DOWNGRADE4args opopen_downgrade; 20115 case OP_PUTFH: PUTFH4args opputfh; 20116 case OP_PUTPUBFH: void; 20117 case OP_PUTROOTFH: void; 20118 case OP_READ: READ4args opread; 20119 case OP_READDIR: READDIR4args opreaddir; 20120 case OP_READLINK: void; 20121 case OP_REMOVE: REMOVE4args opremove; 20122 case OP_RENAME: RENAME4args oprename; 20124 /* Not for NFSv4.1 */ 20125 case OP_RENEW: RENEW4args oprenew; 20127 case OP_RESTOREFH: void; 20128 case OP_SAVEFH: void; 20129 case OP_SECINFO: SECINFO4args opsecinfo; 20130 case OP_SETATTR: SETATTR4args opsetattr; 20132 /* Not for NFSv4.1 */ 20133 case OP_SETCLIENTID: SETCLIENTID4args opsetclientid; 20135 /* Not for NFSv4.1 */ 20136 case OP_SETCLIENTID_CONFIRM: SETCLIENTID_CONFIRM4args 20137 opsetclientid_confirm; 20138 case OP_VERIFY: VERIFY4args opverify; 20139 case OP_WRITE: WRITE4args opwrite; 20141 /* Not for NFSv4.1 */ 20142 case OP_RELEASE_LOCKOWNER: 20143 RELEASE_LOCKOWNER4args 20144 oprelease_lockowner; 20146 /* Operations new to NFSv4.1 */ 20147 case OP_BACKCHANNEL_CTL: 20148 BACKCHANNEL_CTL4args opbackchannel_ctl; 20150 case OP_BIND_CONN_TO_SESSION: 20151 BIND_CONN_TO_SESSION4args 20152 opbind_conn_to_session; 20154 case OP_EXCHANGE_ID: EXCHANGE_ID4args opexchange_id; 20156 case OP_CREATE_SESSION: 20157 CREATE_SESSION4args opcreate_session; 20159 case OP_DESTROY_SESSION: 20161 DESTROY_SESSION4args opdestroy_session; 20163 case OP_FREE_STATEID: FREE_STATEID4args opfree_stateid; 20165 case OP_GET_DIR_DELEGATION: 20166 GET_DIR_DELEGATION4args 20167 opget_dir_delegation; 20169 case OP_GETDEVICEINFO: GETDEVICEINFO4args opgetdeviceinfo; 20170 case OP_GETDEVICELIST: GETDEVICELIST4args opgetdevicelist; 20171 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4args oplayoutcommit; 20172 case OP_LAYOUTGET: LAYOUTGET4args oplayoutget; 20173 case OP_LAYOUTRETURN: LAYOUTRETURN4args oplayoutreturn; 20175 case OP_SECINFO_NO_NAME: 20176 SECINFO_NO_NAME4args opsecinfo_no_name; 20178 case OP_SEQUENCE: SEQUENCE4args opsequence; 20179 case OP_SET_SSV: SET_SSV4args opset_ssv; 20180 case OP_TEST_STATEID: TEST_STATEID4args optest_stateid; 20182 case OP_WANT_DELEGATION: 20183 WANT_DELEGATION4args opwant_delegation; 20185 case OP_DESTROY_CLIENTID: 20186 DESTROY_CLIENTID4args 20187 opdestroy_clientid; 20189 case OP_RECLAIM_COMPLETE: 20190 RECLAIM_COMPLETE4args 20191 opreclaim_complete; 20193 /* Operations not new to NFSv4.1 */ 20194 case OP_ILLEGAL: void; 20195 }; 20197 struct COMPOUND4args { 20198 utf8str_cs tag; 20199 uint32_t minorversion; 20200 nfs_argop4 argarray<>; 20201 }; 20203 16.2.2. RESULTS 20205 union nfs_resop4 switch (nfs_opnum4 resop) { 20206 case OP_ACCESS: ACCESS4res opaccess; 20207 case OP_CLOSE: CLOSE4res opclose; 20208 case OP_COMMIT: COMMIT4res opcommit; 20209 case OP_CREATE: CREATE4res opcreate; 20210 case OP_DELEGPURGE: DELEGPURGE4res opdelegpurge; 20211 case OP_DELEGRETURN: DELEGRETURN4res opdelegreturn; 20212 case OP_GETATTR: GETATTR4res opgetattr; 20213 case OP_GETFH: GETFH4res opgetfh; 20214 case OP_LINK: LINK4res oplink; 20215 case OP_LOCK: LOCK4res oplock; 20216 case OP_LOCKT: LOCKT4res oplockt; 20217 case OP_LOCKU: LOCKU4res oplocku; 20218 case OP_LOOKUP: LOOKUP4res oplookup; 20219 case OP_LOOKUPP: LOOKUPP4res oplookupp; 20220 case OP_NVERIFY: NVERIFY4res opnverify; 20221 case OP_OPEN: OPEN4res opopen; 20222 case OP_OPENATTR: OPENATTR4res opopenattr; 20223 /* Not for NFSv4.1 */ 20224 case OP_OPEN_CONFIRM: OPEN_CONFIRM4res opopen_confirm; 20226 case OP_OPEN_DOWNGRADE: 20227 OPEN_DOWNGRADE4res 20228 opopen_downgrade; 20230 case OP_PUTFH: PUTFH4res opputfh; 20231 case OP_PUTPUBFH: PUTPUBFH4res opputpubfh; 20232 case OP_PUTROOTFH: PUTROOTFH4res opputrootfh; 20233 case OP_READ: READ4res opread; 20234 case OP_READDIR: READDIR4res opreaddir; 20235 case OP_READLINK: READLINK4res opreadlink; 20236 case OP_REMOVE: REMOVE4res opremove; 20237 case OP_RENAME: RENAME4res oprename; 20238 /* Not for NFSv4.1 */ 20239 case OP_RENEW: RENEW4res oprenew; 20240 case OP_RESTOREFH: RESTOREFH4res oprestorefh; 20241 case OP_SAVEFH: SAVEFH4res opsavefh; 20242 case OP_SECINFO: SECINFO4res opsecinfo; 20243 case OP_SETATTR: SETATTR4res opsetattr; 20244 /* Not for NFSv4.1 */ 20245 case OP_SETCLIENTID: SETCLIENTID4res opsetclientid; 20247 /* Not for NFSv4.1 */ 20248 case OP_SETCLIENTID_CONFIRM: 20249 SETCLIENTID_CONFIRM4res 20250 opsetclientid_confirm; 20251 case OP_VERIFY: VERIFY4res opverify; 20252 case OP_WRITE: WRITE4res opwrite; 20254 /* Not for NFSv4.1 */ 20255 case OP_RELEASE_LOCKOWNER: 20256 RELEASE_LOCKOWNER4res 20257 oprelease_lockowner; 20259 /* Operations new to NFSv4.1 */ 20260 case OP_BACKCHANNEL_CTL: 20261 BACKCHANNEL_CTL4res 20262 opbackchannel_ctl; 20264 case OP_BIND_CONN_TO_SESSION: 20265 BIND_CONN_TO_SESSION4res 20266 opbind_conn_to_session; 20268 case OP_EXCHANGE_ID: EXCHANGE_ID4res opexchange_id; 20270 case OP_CREATE_SESSION: 20271 CREATE_SESSION4res 20272 opcreate_session; 20274 case OP_DESTROY_SESSION: 20275 DESTROY_SESSION4res 20276 opdestroy_session; 20278 case OP_FREE_STATEID: FREE_STATEID4res 20279 opfree_stateid; 20281 case OP_GET_DIR_DELEGATION: 20282 GET_DIR_DELEGATION4res 20283 opget_dir_delegation; 20285 case OP_GETDEVICEINFO: GETDEVICEINFO4res 20286 opgetdeviceinfo; 20288 case OP_GETDEVICELIST: GETDEVICELIST4res 20289 opgetdevicelist; 20291 case OP_LAYOUTCOMMIT: LAYOUTCOMMIT4res oplayoutcommit; 20292 case OP_LAYOUTGET: LAYOUTGET4res oplayoutget; 20293 case OP_LAYOUTRETURN: LAYOUTRETURN4res oplayoutreturn; 20295 case OP_SECINFO_NO_NAME: 20296 SECINFO_NO_NAME4res 20297 opsecinfo_no_name; 20299 case OP_SEQUENCE: SEQUENCE4res opsequence; 20300 case OP_SET_SSV: SET_SSV4res opset_ssv; 20301 case OP_TEST_STATEID: TEST_STATEID4res optest_stateid; 20303 case OP_WANT_DELEGATION: 20304 WANT_DELEGATION4res 20305 opwant_delegation; 20307 case OP_DESTROY_CLIENTID: 20308 DESTROY_CLIENTID4res 20309 opdestroy_clientid; 20311 case OP_RECLAIM_COMPLETE: 20312 RECLAIM_COMPLETE4res 20313 opreclaim_complete; 20315 /* Operations not new to NFSv4.1 */ 20316 case OP_ILLEGAL: ILLEGAL4res opillegal; 20317 }; 20319 struct COMPOUND4res { 20320 nfsstat4 status; 20321 utf8str_cs tag; 20322 nfs_resop4 resarray<>; 20323 }; 20325 16.2.3. DESCRIPTION 20327 The COMPOUND procedure is used to combine one or more NFSv4 20328 operations into a single RPC request. The server interprets each of 20329 the operations in turn. If an operation is executed by the server 20330 and the status of that operation is NFS4_OK, then the next operation 20331 in the COMPOUND procedure is executed. The server continues this 20332 process until there are no more operations to be executed or until 20333 one of the operations has a status value other than NFS4_OK. 20335 In the processing of the COMPOUND procedure, the server may find that 20336 it does not have the available resources to execute any or all of the 20337 operations within the COMPOUND sequence. See Section 2.10.6.4 for a 20338 more detailed discussion. 20340 The server will generally choose between two methods of decoding the 20341 client's request. The first would be the traditional one-pass XDR 20342 decode. If there is an XDR decoding error in this case, the RPC XDR 20343 decode error would be returned. The second method would be to make 20344 an initial pass to decode the basic COMPOUND request and then to XDR 20345 decode the individual operations; the most interesting is the decode 20346 of attributes. In this case, the server may encounter an XDR decode 20347 error during the second pass. If it does, the server would return 20348 the error NFS4ERR_BADXDR to signify the decode error. 20350 The COMPOUND arguments contain a "minorversion" field. For NFSv4.1, 20351 the value for this field is 1. If the server receives a COMPOUND 20352 procedure with a minorversion field value that it does not support, 20353 the server MUST return an error of NFS4ERR_MINOR_VERS_MISMATCH and a 20354 zero-length resultdata array. 20356 Contained within the COMPOUND results is a "status" field. If the 20357 results array length is non-zero, this status must be equivalent to 20358 the status of the last operation that was executed within the 20359 COMPOUND procedure. Therefore, if an operation incurred an error 20360 then the "status" value will be the same error value as is being 20361 returned for the operation that failed. 20363 Note that operations zero and one are not defined for the COMPOUND 20364 procedure. Operation 2 is not defined and is reserved for future 20365 definition and use with minor versioning. If the server receives an 20366 operation array that contains operation 2 and the minorversion field 20367 has a value of zero, an error of NFS4ERR_OP_ILLEGAL, as described in 20368 the next paragraph, is returned to the client. If an operation array 20369 contains an operation 2 and the minorversion field is non-zero and 20370 the server does not support the minor version, the server returns an 20371 error of NFS4ERR_MINOR_VERS_MISMATCH. Therefore, the 20372 NFS4ERR_MINOR_VERS_MISMATCH error takes precedence over all other 20373 errors. 20375 It is possible that the server receives a request that contains an 20376 operation that is less than the first legal operation (OP_ACCESS) or 20377 greater than the last legal operation (OP_RELEASE_LOCKOWNER). In 20378 this case, the server's response will encode the opcode OP_ILLEGAL 20379 rather than the illegal opcode of the request. The status field in 20380 the ILLEGAL return results will be set to NFS4ERR_OP_ILLEGAL. The 20381 COMPOUND procedure's return results will also be NFS4ERR_OP_ILLEGAL. 20383 The definition of the "tag" in the request is left to the 20384 implementor. It may be used to summarize the content of the Compound 20385 request for the benefit of packet-sniffers and engineers debugging 20386 implementations. However, the value of "tag" in the response SHOULD 20387 be the same value as provided in the request. This applies to the 20388 tag field of the CB_COMPOUND procedure as well. 20390 16.2.3.1. Current Filehandle and Stateid 20392 The COMPOUND procedure offers a simple environment for the execution 20393 of the operations specified by the client. The first two relate to 20394 the filehandle while the second two relate to the current stateid. 20396 16.2.3.1.1. Current Filehandle 20398 The current and saved filehandles are used throughout the protocol. 20399 Most operations implicitly use the current filehandle as an argument, 20400 and many set the current filehandle as part of the results. The 20401 combination of client-specified sequences of operations and current 20402 and saved filehandle arguments and results allows for greater 20403 protocol flexibility. The best or easiest example of current 20404 filehandle usage is a sequence like the following: 20406 PUTFH fh1 {fh1} 20407 LOOKUP "compA" {fh2} 20408 GETATTR {fh2} 20409 LOOKUP "compB" {fh3} 20410 GETATTR {fh3} 20411 LOOKUP "compC" {fh4} 20412 GETATTR {fh4} 20413 GETFH 20415 Figure 2 20417 In this example, the PUTFH (Section 18.19) operation explicitly sets 20418 the current filehandle value while the result of each LOOKUP 20419 operation sets the current filehandle value to the resultant file 20420 system object. Also, the client is able to insert GETATTR operations 20421 using the current filehandle as an argument. 20423 The PUTROOTFH (Section 18.21) and PUTPUBFH (Section 18.20) operations 20424 also set the current filehandle. The above example would replace 20425 "PUTFH fh1" with PUTROOTFH or PUTPUBFH with no filehandle argument in 20426 order to achieve the same effect (on the assumption that "compA" is 20427 directly below the root of the namespace). 20429 Along with the current filehandle, there is a saved filehandle. 20430 While the current filehandle is set as the result of operations like 20431 LOOKUP, the saved filehandle must be set directly with the use of the 20432 SAVEFH operation. The SAVEFH operation copies the current filehandle 20433 value to the saved value. The saved filehandle value is used in 20434 combination with the current filehandle value for the LINK and RENAME 20435 operations. The RESTOREFH operation will copy the saved filehandle 20436 value to the current filehandle value; as a result, the saved 20437 filehandle value may be used a sort of "scratch" area for the 20438 client's series of operations. 20440 16.2.3.1.2. Current Stateid 20442 With NFSv4.1, additions of a current stateid and a saved stateid have 20443 been made to the COMPOUND processing environment; this allows for the 20444 passing of stateids between operations. There are no changes to the 20445 syntax of the protocol, only changes to the semantics of a few 20446 operations. 20448 A "current stateid" is the stateid that is associated with the 20449 current filehandle. The current stateid may only be changed by an 20450 operation that modifies the current filehandle or returns a stateid. 20451 If an operation returns a stateid, it MUST set the current stateid to 20452 the returned value. If an operation sets the current filehandle but 20453 does not return a stateid, the current stateid MUST be set to the 20454 all-zeros special stateid, i.e., (seqid, other) = (0, 0). If an 20455 operation uses a stateid as an argument but does not return a 20456 stateid, the current stateid MUST NOT be changed. For example, 20457 PUTFH, PUTROOTFH, and PUTPUBFH will change the current server state 20458 from {ocfh, (osid)} to {cfh, (0, 0)}, while LOCK will change the 20459 current state from {cfh, (osid} to {cfh, (nsid)}. Operations like 20460 LOOKUP that transform a current filehandle and component name into a 20461 new current filehandle will also change the current state to {0, 0}. 20462 The SAVEFH and RESTOREFH operations will save and restore both the 20463 current filehandle and the current stateid as a set. 20465 The following example is the common case of a simple READ operation 20466 with a normal stateid showing that the PUTFH initializes the current 20467 stateid to (0, 0). The subsequent READ with stateid (sid1) leaves 20468 the current stateid unchanged. 20470 PUTFH fh1 - -> {fh1, (0, 0)} 20471 READ (sid1), 0, 1024 {fh1, (0, 0)} -> {fh1, (0, 0)} 20473 Figure 3 20475 This next example performs an OPEN with the root filehandle and, as a 20476 result, generates stateid (sid1). The next operation specifies the 20477 READ with the argument stateid set such that (seqid, other) are equal 20478 to (1, 0), but the current stateid set by the previous operation is 20479 actually used when the operation is evaluated. This allows correct 20480 interaction with any existing, potentially conflicting, locks. 20482 PUTROOTFH - -> {fh1, (0, 0)} 20483 OPEN "compA" {fh1, (0, 0)} -> {fh2, (sid1)} 20484 READ (1, 0), 0, 1024 {fh2, (sid1)} -> {fh2, (sid1)} 20485 CLOSE (1, 0) {fh2, (sid1)} -> {fh2, (sid2)} 20487 Figure 4 20489 This next example is similar to the second in how it passes the 20490 stateid sid2 generated by the LOCK operation to the next READ 20491 operation. This allows the client to explicitly surround a single I/ 20492 O operation with a lock and its appropriate stateid to guarantee 20493 correctness with other client locks. The example also shows how 20494 SAVEFH and RESTOREFH can save and later reuse a filehandle and 20495 stateid, passing them as the current filehandle and stateid to a READ 20496 operation. 20498 PUTFH fh1 - -> {fh1, (0, 0)} 20499 LOCK 0, 1024, (sid1) {fh1, (sid1)} -> {fh1, (sid2)} 20500 READ (1, 0), 0, 1024 {fh1, (sid2)} -> {fh1, (sid2)} 20501 LOCKU 0, 1024, (1, 0) {fh1, (sid2)} -> {fh1, (sid3)} 20502 SAVEFH {fh1, (sid3)} -> {fh1, (sid3)} 20504 PUTFH fh2 {fh1, (sid3)} -> {fh2, (0, 0)} 20505 WRITE (1, 0), 0, 1024 {fh2, (0, 0)} -> {fh2, (0, 0)} 20507 RESTOREFH {fh2, (0, 0)} -> {fh1, (sid3)} 20508 READ (1, 0), 1024, 1024 {fh1, (sid3)} -> {fh1, (sid3)} 20510 Figure 5 20512 The final example shows a disallowed use of the current stateid. The 20513 client is attempting to implicitly pass an anonymous special stateid, 20514 (0,0), to the READ operation. The server MUST return 20515 NFS4ERR_BAD_STATEID in the reply to the READ operation. 20517 PUTFH fh1 - -> {fh1, (0, 0)} 20518 READ (1, 0), 0, 1024 {fh1, (0, 0)} -> NFS4ERR_BAD_STATEID 20520 Figure 6 20522 16.2.4. ERRORS 20524 COMPOUND will of course return every error that each operation on the 20525 fore channel can return (see Table 6). However, if COMPOUND returns 20526 zero operations, obviously the error returned by COMPOUND has nothing 20527 to do with an error returned by an operation. The list of errors 20528 COMPOUND will return if it processes zero operations include: 20530 COMPOUND Error Returns 20532 +------------------------------+------------------------------------+ 20533 | Error | Notes | 20534 +------------------------------+------------------------------------+ 20535 | NFS4ERR_BADCHAR | The tag argument has a character | 20536 | | the replier does not support. | 20537 | NFS4ERR_BADXDR | | 20538 | NFS4ERR_DELAY | | 20539 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 20540 | | encoding. | 20541 | NFS4ERR_MINOR_VERS_MISMATCH | | 20542 | NFS4ERR_SERVERFAULT | | 20543 | NFS4ERR_TOO_MANY_OPS | | 20544 | NFS4ERR_REP_TOO_BIG | | 20545 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 20546 | NFS4ERR_REQ_TOO_BIG | | 20547 +------------------------------+------------------------------------+ 20549 Table 9 20551 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL 20553 The following tables summarize the operations of the NFSv4.1 protocol 20554 and the corresponding designation of REQUIRED, RECOMMENDED, and 20555 OPTIONAL to implement or MUST NOT implement. The designation of MUST 20556 NOT implement is reserved for those operations that were defined in 20557 NFSv4.0 and MUST NOT be implemented in NFSv4.1. 20559 For the most part, the REQUIRED, RECOMMENDED, or OPTIONAL designation 20560 for operations sent by the client is for the server implementation. 20561 The client is generally required to implement the operations needed 20562 for the operating environment for which it serves. For example, a 20563 read-only NFSv4.1 client would have no need to implement the WRITE 20564 operation and is not required to do so. 20566 The REQUIRED or OPTIONAL designation for callback operations sent by 20567 the server is for both the client and server. Generally, the client 20568 has the option of creating the backchannel and sending the operations 20569 on the fore channel that will be a catalyst for the server sending 20570 callback operations. A partial exception is CB_RECALL_SLOT; the only 20571 way the client can avoid supporting this operation is by not creating 20572 a backchannel. 20574 Since this is a summary of the operations and their designation, 20575 there are subtleties that are not presented here. Therefore, if 20576 there is a question of the requirements of implementation, the 20577 operation descriptions themselves must be consulted along with other 20578 relevant explanatory text within this specification. 20580 The abbreviations used in the second and third columns of the table 20581 are defined as follows. 20583 REQ REQUIRED to implement 20585 REC RECOMMEND to implement 20587 OPT OPTIONAL to implement 20589 MNI MUST NOT implement 20591 For the NFSv4.1 features that are OPTIONAL, the operations that 20592 support those features are OPTIONAL, and the server would return 20593 NFS4ERR_NOTSUPP in response to the client's use of those operations. 20594 If an OPTIONAL feature is supported, it is possible that a set of 20595 operations related to the feature become REQUIRED to implement. The 20596 third column of the table designates the feature(s) and if the 20597 operation is REQUIRED or OPTIONAL in the presence of support for the 20598 feature. 20600 The OPTIONAL features identified and their abbreviations are as 20601 follows: 20603 pNFS Parallel NFS 20605 FDELG File Delegations 20607 DDELG Directory Delegations 20609 Operations 20611 +----------------------+------------+--------------+----------------+ 20612 | Operation | REQ, REC, | Feature | Definition | 20613 | | OPT, or | (REQ, REC, | | 20614 | | MNI | or OPT) | | 20615 +----------------------+------------+--------------+----------------+ 20616 | ACCESS | REQ | | Section 18.1 | 20617 | BACKCHANNEL_CTL | REQ | | Section 18.33 | 20618 | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | 20619 | CLOSE | REQ | | Section 18.2 | 20620 | COMMIT | REQ | | Section 18.3 | 20621 | CREATE | REQ | | Section 18.4 | 20622 | CREATE_SESSION | REQ | | Section 18.36 | 20623 | DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | 20624 | DELEGRETURN | OPT | FDELG, | Section 18.6 | 20625 | | | DDELG, pNFS | | 20626 | | | (REQ) | | 20627 | DESTROY_CLIENTID | REQ | | Section 18.50 | 20628 | DESTROY_SESSION | REQ | | Section 18.37 | 20629 | EXCHANGE_ID | REQ | | Section 18.35 | 20630 | FREE_STATEID | REQ | | Section 18.38 | 20631 | GETATTR | REQ | | Section 18.7 | 20632 | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | 20633 | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | 20634 | GETFH | REQ | | Section 18.8 | 20635 | GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | 20636 | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | 20637 | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | 20638 | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | 20639 | LINK | OPT | | Section 18.9 | 20640 | LOCK | REQ | | Section 18.10 | 20641 | LOCKT | REQ | | Section 18.11 | 20642 | LOCKU | REQ | | Section 18.12 | 20643 | LOOKUP | REQ | | Section 18.13 | 20644 | LOOKUPP | REQ | | Section 18.14 | 20645 | NVERIFY | REQ | | Section 18.15 | 20646 | OPEN | REQ | | Section 18.16 | 20647 | OPENATTR | OPT | | Section 18.17 | 20648 | OPEN_CONFIRM | MNI | | N/A | 20649 | OPEN_DOWNGRADE | REQ | | Section 18.18 | 20650 | PUTFH | REQ | | Section 18.19 | 20651 | PUTPUBFH | REQ | | Section 18.20 | 20652 | PUTROOTFH | REQ | | Section 18.21 | 20653 | READ | REQ | | Section 18.22 | 20654 | READDIR | REQ | | Section 18.23 | 20655 | READLINK | OPT | | Section 18.24 | 20656 | RECLAIM_COMPLETE | REQ | | Section 18.51 | 20657 | RELEASE_LOCKOWNER | MNI | | N/A | 20658 | REMOVE | REQ | | Section 18.25 | 20659 | RENAME | REQ | | Section 18.26 | 20660 | RENEW | MNI | | N/A | 20661 | RESTOREFH | REQ | | Section 18.27 | 20662 | SAVEFH | REQ | | Section 18.28 | 20663 | SECINFO | REQ | | Section 18.29 | 20664 | SECINFO_NO_NAME | REC | pNFS file | Section 18.45, | 20665 | | | layout (REQ) | Section 13.12 | 20666 | SEQUENCE | REQ | | Section 18.46 | 20667 | SETATTR | REQ | | Section 18.30 | 20668 | SETCLIENTID | MNI | | N/A | 20669 | SETCLIENTID_CONFIRM | MNI | | N/A | 20670 | SET_SSV | REQ | | Section 18.47 | 20671 | TEST_STATEID | REQ | | Section 18.48 | 20672 | VERIFY | REQ | | Section 18.31 | 20673 | WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | 20674 | WRITE | REQ | | Section 18.32 | 20675 +----------------------+------------+--------------+----------------+ 20677 Callback Operations 20679 +-------------------------+------------+---------------+------------+ 20680 | Operation | REQ, REC, | Feature (REQ, | Definition | 20681 | | OPT, or | REC, or OPT) | | 20682 | | MNI | | | 20683 +-------------------------+------------+---------------+------------+ 20684 | CB_GETATTR | OPT | FDELG (REQ) | Section | 20685 | | | | 20.1 | 20686 | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section | 20687 | | | | 20.3 | 20688 | CB_NOTIFY | OPT | DDELG (REQ) | Section | 20689 | | | | 20.4 | 20690 | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section | 20691 | | | | 20.12 | 20692 | CB_NOTIFY_LOCK | OPT | | Section | 20693 | | | | 20.11 | 20694 | CB_PUSH_DELEG | OPT | FDELG (OPT) | Section | 20695 | | | | 20.5 | 20696 | CB_RECALL | OPT | FDELG, DDELG, | Section | 20697 | | | pNFS (REQ) | 20.2 | 20698 | CB_RECALL_ANY | OPT | FDELG, DDELG, | Section | 20699 | | | pNFS (REQ) | 20.6 | 20700 | CB_RECALL_SLOT | REQ | | Section | 20701 | | | | 20.8 | 20702 | CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section | 20703 | | | (REQ) | 20.7 | 20704 | CB_SEQUENCE | OPT | FDELG, DDELG, | Section | 20705 | | | pNFS (REQ) | 20.9 | 20706 | CB_WANTS_CANCELLED | OPT | FDELG, DDELG, | Section | 20707 | | | pNFS (REQ) | 20.10 | 20708 +-------------------------+------------+---------------+------------+ 20710 18. NFSv4.1 Operations 20712 18.1. Operation 3: ACCESS - Check Access Rights 20714 18.1.1. ARGUMENTS 20715 const ACCESS4_READ = 0x00000001; 20716 const ACCESS4_LOOKUP = 0x00000002; 20717 const ACCESS4_MODIFY = 0x00000004; 20718 const ACCESS4_EXTEND = 0x00000008; 20719 const ACCESS4_DELETE = 0x00000010; 20720 const ACCESS4_EXECUTE = 0x00000020; 20722 struct ACCESS4args { 20723 /* CURRENT_FH: object */ 20724 uint32_t access; 20725 }; 20727 18.1.2. RESULTS 20729 struct ACCESS4resok { 20730 uint32_t supported; 20731 uint32_t access; 20732 }; 20734 union ACCESS4res switch (nfsstat4 status) { 20735 case NFS4_OK: 20736 ACCESS4resok resok4; 20737 default: 20738 void; 20739 }; 20741 18.1.3. DESCRIPTION 20743 ACCESS determines the access rights that a user, as identified by the 20744 credentials in the RPC request, has with respect to the file system 20745 object specified by the current filehandle. The client encodes the 20746 set of access rights that are to be checked in the bit mask "access". 20747 The server checks the permissions encoded in the bit mask. If a 20748 status of NFS4_OK is returned, two bit masks are included in the 20749 response. The first, "supported", represents the access rights for 20750 which the server can verify reliably. The second, "access", 20751 represents the access rights available to the user for the filehandle 20752 provided. On success, the current filehandle retains its value. 20754 Note that the reply's supported and access fields MUST NOT contain 20755 more values than originally set in the request's access field. For 20756 example, if the client sends an ACCESS operation with just the 20757 ACCESS4_READ value set and the server supports this value, the server 20758 MUST NOT set more than ACCESS4_READ in the supported field even if it 20759 could have reliably checked other values. 20761 The reply's access field MUST NOT contain more values than the 20762 supported field. 20764 The results of this operation are necessarily advisory in nature. A 20765 return status of NFS4_OK and the appropriate bit set in the bit mask 20766 do not imply that such access will be allowed to the file system 20767 object in the future. This is because access rights can be revoked 20768 by the server at any time. 20770 The following access permissions may be requested: 20772 ACCESS4_READ Read data from file or read a directory. 20774 ACCESS4_LOOKUP Look up a name in a directory (no meaning for non- 20775 directory objects). 20777 ACCESS4_MODIFY Rewrite existing file data or modify existing 20778 directory entries. 20780 ACCESS4_EXTEND Write new data or add directory entries. 20782 ACCESS4_DELETE Delete an existing directory entry. 20784 ACCESS4_EXECUTE Execute a regular file (no meaning for a directory). 20786 On success, the current filehandle retains its value. 20788 ACCESS4_EXECUTE is a challenging semantic to implement because NFS 20789 provides remote file access, not remote execution. This leads to the 20790 following: 20792 o Whether or not a regular file is executable ought to be the 20793 responsibility of the NFS client and not the server. And yet the 20794 ACCESS operation is specified to seemingly require a server to own 20795 that responsibility. 20797 o When a client executes a regular file, it has to read the file 20798 from the server. Strictly speaking, the server should not allow 20799 the client to read a file being executed unless the user has read 20800 permissions on the file. Requiring explicit read permissions on 20801 executable files in order to access them over NFS is not going to 20802 be acceptable to some users and storage administrators. 20803 Historically, NFS servers have allowed a user to READ a file if 20804 the user has execute access to the file. 20806 As a practical example, the UNIX specification [59] states that an 20807 implementation claiming conformance to UNIX may indicate in the 20808 access() programming interface's result that a privileged user has 20809 execute rights, even if no execute permission bits are set on the 20810 regular file's attributes. It is possible to claim conformance to 20811 the UNIX specification and instead not indicate execute rights in 20812 that situation, which is true for some operating environments. 20813 Suppose the operating environments of the client and server are 20814 implementing the access() semantics for privileged users differently, 20815 and the ACCESS operation implementations of the client and server 20816 follow their respective access() semantics. This can cause undesired 20817 behavior: 20819 o Suppose the client's access() interface returns X_OK if the user 20820 is privileged and no execute permission bits are set on the 20821 regular file's attribute, and the server's access() interface does 20822 not return X_OK in that situation. Then the client will be unable 20823 to execute files stored on the NFS server that could be executed 20824 if stored on a non-NFS file system. 20826 o Suppose the client's access() interface does not return X_OK if 20827 the user is privileged, and no execute permission bits are set on 20828 the regular file's attribute, and the server's access() interface 20829 does return X_OK in that situation. Then: 20831 * The client will be able to execute files stored on the NFS 20832 server that could be executed if stored on a non-NFS file 20833 system, unless the client's execution subsystem also checks for 20834 execute permission bits. 20836 * Even if the execution subsystem is checking for execute 20837 permission bits, there are more potential issues. For example, 20838 suppose the client is invoking access() to build a "path search 20839 table" of all executable files in the user's "search path", 20840 where the path is a list of directories each containing 20841 executable files. Suppose there are two files each in separate 20842 directories of the search path, such that files have the same 20843 component name. In the first directory the file has no execute 20844 permission bits set, and in the second directory the file has 20845 execute bits set. The path search table will indicate that the 20846 first directory has the executable file, but the execute 20847 subsystem will fail to execute it. The command shell might 20848 fail to try the second file in the second directory. And even 20849 if it did, this is a potential performance issue. Clearly, the 20850 desired outcome for the client is for the path search table to 20851 not contain the first file. 20853 To deal with the problems described above, the "smart client, stupid 20854 server" principle is used. The client owns overall responsibility 20855 for determining execute access and relies on the server to parse the 20856 execution permissions within the file's mode, acl, and dacl 20857 attributes. The rules for the client and server follow: 20859 o If the client is sending ACCESS in order to determine if the user 20860 can read the file, the client SHOULD set ACCESS4_READ in the 20861 request's access field. 20863 o If the client's operating environment only grants execution to the 20864 user if the user has execute access according to the execute 20865 permissions in the mode, acl, and dacl attributes, then if the 20866 client wants to determine execute access, the client SHOULD send 20867 an ACCESS request with ACCESS4_EXECUTE bit set in the request's 20868 access field. 20870 o If the client's operating environment grants execution to the user 20871 even if the user does not have execute access according to the 20872 execute permissions in the mode, acl, and dacl attributes, then if 20873 the client wants to determine execute access, it SHOULD send an 20874 ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits 20875 set in the request's access field. This way, if any read or 20876 execute permission grants the user read or execute access (or if 20877 the server interprets the user as privileged), as indicated by the 20878 presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's 20879 access field, the client will be able to grant the user execute 20880 access to the file. 20882 o If the server supports execute permission bits, or some other 20883 method for denoting executability (e.g., the suffix of the name of 20884 the file might indicate execute), it MUST check only execute 20885 permissions, not read permissions, when determining whether or not 20886 the reply will have ACCESS4_EXECUTE set in the access field. The 20887 server MUST NOT also examine read permission bits when determining 20888 whether or not the reply will have ACCESS4_EXECUTE set in the 20889 access field. Even if the server's operating environment would 20890 grant execute access to the user (e.g., the user is privileged), 20891 the server MUST NOT reply with ACCESS4_EXECUTE set in reply's 20892 access field unless there is at least one execute permission bit 20893 set in the mode, acl, or dacl attributes. In the case of acl and 20894 dacl, the "one execute permission bit" MUST be an ACE4_EXECUTE bit 20895 set in an ALLOW ACE. 20897 o If the server does not support execute permission bits or some 20898 other method for denoting executability, it MUST NOT set 20899 ACCESS4_EXECUTE in the reply's supported and access fields. If 20900 the client set ACCESS4_EXECUTE in the ACCESS request's access 20901 field, and ACCESS4_EXECUTE is not set in the reply's supported 20902 field, then the client will have to send an ACCESS request with 20903 the ACCESS4_READ bit set in the request's access field. 20905 o If the server supports read permission bits, it MUST only check 20906 for read permissions in the mode, acl, and dacl attributes when it 20907 receives an ACCESS request with ACCESS4_READ set in the access 20908 field. The server MUST NOT also examine execute permission bits 20909 when determining whether the reply will have ACCESS4_READ set in 20910 the access field or not. 20912 Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, 20913 then the user also has permissions to OPEN (Section 18.16) or READ 20914 (Section 18.22) the file. In other words, if the client sends an 20915 ACCESS request with the ACCESS4_READ and ACCESS_EXECUTE set in the 20916 access field (or two separate requests, one with ACCESS4_READ set and 20917 the other with ACCESS4_EXECUTE set), and the reply has just 20918 ACCESS4_EXECUTE set in the access field (or just one reply has 20919 ACCESS4_EXECUTE set), then the user has authorization to OPEN or READ 20920 the file. 20922 18.1.4. IMPLEMENTATION 20924 In general, it is not sufficient for the client to attempt to deduce 20925 access permissions by inspecting the uid, gid, and mode fields in the 20926 file attributes or by attempting to interpret the contents of the ACL 20927 attribute. This is because the server may perform uid or gid mapping 20928 or enforce additional access-control restrictions. It is also 20929 possible that the server may not be in the same ID space as the 20930 client. In these cases (and perhaps others), the client cannot 20931 reliably perform an access check with only current file attributes. 20933 In the NFSv2 protocol, the only reliable way to determine whether an 20934 operation was allowed was to try it and see if it succeeded or 20935 failed. Using the ACCESS operation in the NFSv4.1 protocol, the 20936 client can ask the server to indicate whether or not one or more 20937 classes of operations are permitted. The ACCESS operation is 20938 provided to allow clients to check before doing a series of 20939 operations that will result in an access failure. The OPEN operation 20940 provides a point where the server can verify access to the file 20941 object and a method to return that information to the client. The 20942 ACCESS operation is still useful for directory operations or for use 20943 in the case that the UNIX interface access() is used on the client. 20945 The information returned by the server in response to an ACCESS call 20946 is not permanent. It was correct at the exact time that the server 20947 performed the checks, but not necessarily afterwards. The server can 20948 revoke access permission at any time. 20950 The client should use the effective credentials of the user to build 20951 the authentication information in the ACCESS request used to 20952 determine access rights. It is the effective user and group 20953 credentials that are used in subsequent READ and WRITE operations. 20955 Many implementations do not directly support the ACCESS4_DELETE 20956 permission. Operating systems like UNIX will ignore the 20957 ACCESS4_DELETE bit if set on an access request on a non-directory 20958 object. In these systems, delete permission on a file is determined 20959 by the access permissions on the directory in which the file resides, 20960 instead of being determined by the permissions of the file itself. 20961 Therefore, the mask returned enumerating which access rights can be 20962 determined will have the ACCESS4_DELETE value set to 0. This 20963 indicates to the client that the server was unable to check that 20964 particular access right. The ACCESS4_DELETE bit in the access mask 20965 returned will then be ignored by the client. 20967 18.2. Operation 4: CLOSE - Close File 20969 18.2.1. ARGUMENTS 20971 struct CLOSE4args { 20972 /* CURRENT_FH: object */ 20973 seqid4 seqid; 20974 stateid4 open_stateid; 20975 }; 20977 18.2.2. RESULTS 20979 union CLOSE4res switch (nfsstat4 status) { 20980 case NFS4_OK: 20981 stateid4 open_stateid; 20982 default: 20983 void; 20984 }; 20986 18.2.3. DESCRIPTION 20988 The CLOSE operation releases share reservations for the regular or 20989 named attribute file as specified by the current filehandle. The 20990 share reservations and other state information released at the server 20991 as a result of this CLOSE are only those associated with the supplied 20992 stateid. State associated with other OPENs is not affected. 20994 If byte-range locks are held, the client SHOULD release all locks 20995 before sending a CLOSE. The server MAY free all outstanding locks on 20996 CLOSE, but some servers may not support the CLOSE of a file that 20997 still has byte-range locks held. The server MUST return failure if 20998 any locks would exist after the CLOSE. 21000 The argument seqid MAY have any value, and the server MUST ignore 21001 seqid. 21003 On success, the current filehandle retains its value. 21005 The server MAY require that the combination of principal, security 21006 flavor, and, if applicable, GSS mechanism that sent the OPEN request 21007 also be the one to CLOSE the file. This might not be possible if 21008 credentials for the principal are no longer available. The server 21009 MAY allow the machine credential or SSV credential (see 21010 Section 18.35) to send CLOSE. 21012 18.2.4. IMPLEMENTATION 21014 Even though CLOSE returns a stateid, this stateid is not useful to 21015 the client and should be treated as deprecated. CLOSE "shuts down" 21016 the state associated with all OPENs for the file by a single open- 21017 owner. As noted above, CLOSE will either release all file-locking 21018 state or return an error. Therefore, the stateid returned by CLOSE 21019 is not useful for operations that follow. To help find any uses of 21020 this stateid by clients, the server SHOULD return the invalid special 21021 stateid (the "other" value is zero and the "seqid" field is 21022 NFS4_UINT32_MAX, see Section 8.2.3). 21024 A CLOSE operation may make delegations grantable where they were not 21025 previously. Servers may choose to respond immediately if there are 21026 pending delegation want requests or may respond to the situation at a 21027 later time. 21029 18.3. Operation 5: COMMIT - Commit Cached Data 21031 18.3.1. ARGUMENTS 21033 struct COMMIT4args { 21034 /* CURRENT_FH: file */ 21035 offset4 offset; 21036 count4 count; 21037 }; 21039 18.3.2. RESULTS 21040 struct COMMIT4resok { 21041 verifier4 writeverf; 21042 }; 21044 union COMMIT4res switch (nfsstat4 status) { 21045 case NFS4_OK: 21046 COMMIT4resok resok4; 21047 default: 21048 void; 21049 }; 21051 18.3.3. DESCRIPTION 21053 The COMMIT operation forces or flushes uncommitted, modified data to 21054 stable storage for the file specified by the current filehandle. The 21055 flushed data is that which was previously written with one or more 21056 WRITE operations that had the "committed" field of their results 21057 field set to UNSTABLE4. 21059 The offset specifies the position within the file where the flush is 21060 to begin. An offset value of zero means to flush data starting at 21061 the beginning of the file. The count specifies the number of bytes 21062 of data to flush. If the count is zero, a flush from the offset to 21063 the end of the file is done. 21065 The server returns a write verifier upon successful completion of the 21066 COMMIT. The write verifier is used by the client to determine if the 21067 server has restarted between the initial WRITE operations and the 21068 COMMIT. The client does this by comparing the write verifier 21069 returned from the initial WRITE operations and the verifier returned 21070 by the COMMIT operation. The server must vary the value of the write 21071 verifier at each server event or instantiation that may lead to a 21072 loss of uncommitted data. Most commonly this occurs when the server 21073 is restarted; however, other events at the server may result in 21074 uncommitted data loss as well. 21076 On success, the current filehandle retains its value. 21078 18.3.4. IMPLEMENTATION 21080 The COMMIT operation is similar in operation and semantics to the 21081 POSIX fsync() [22] system interface that synchronizes a file's state 21082 with the disk (file data and metadata is flushed to disk or stable 21083 storage). COMMIT performs the same operation for a client, flushing 21084 any unsynchronized data and metadata on the server to the server's 21085 disk or stable storage for the specified file. Like fsync(), it may 21086 be that there is some modified data or no modified data to 21087 synchronize. The data may have been synchronized by the server's 21088 normal periodic buffer synchronization activity. COMMIT should 21089 return NFS4_OK, unless there has been an unexpected error. 21091 COMMIT differs from fsync() in that it is possible for the client to 21092 flush a range of the file (most likely triggered by a buffer- 21093 reclamation scheme on the client before the file has been completely 21094 written). 21096 The server implementation of COMMIT is reasonably simple. If the 21097 server receives a full file COMMIT request, that is, starting at 21098 offset zero and count zero, it should do the equivalent of applying 21099 fsync() to the entire file. Otherwise, it should arrange to have the 21100 modified data in the range specified by offset and count to be 21101 flushed to stable storage. In both cases, any metadata associated 21102 with the file must be flushed to stable storage before returning. It 21103 is not an error for there to be nothing to flush on the server. This 21104 means that the data and metadata that needed to be flushed have 21105 already been flushed or lost during the last server failure. 21107 The client implementation of COMMIT is a little more complex. There 21108 are two reasons for wanting to commit a client buffer to stable 21109 storage. The first is that the client wants to reuse a buffer. In 21110 this case, the offset and count of the buffer are sent to the server 21111 in the COMMIT request. The server then flushes any modified data 21112 based on the offset and count, and flushes any modified metadata 21113 associated with the file. It then returns the status of the flush 21114 and the write verifier. The second reason for the client to generate 21115 a COMMIT is for a full file flush, such as may be done at close. In 21116 this case, the client would gather all of the buffers for this file 21117 that contain uncommitted data, do the COMMIT operation with an offset 21118 of zero and count of zero, and then free all of those buffers. Any 21119 other dirty buffers would be sent to the server in the normal 21120 fashion. 21122 After a buffer is written (via the WRITE operation) by the client 21123 with the "committed" field in the result of WRITE set to UNSTABLE4, 21124 the buffer must be considered as modified by the client until the 21125 buffer has either been flushed via a COMMIT operation or written via 21126 a WRITE operation with the "committed" field in the result set to 21127 FILE_SYNC4 or DATA_SYNC4. This is done to prevent the buffer from 21128 being freed and reused before the data can be flushed to stable 21129 storage on the server. 21131 When a response is returned from either a WRITE or a COMMIT operation 21132 and it contains a write verifier that differs from that previously 21133 returned by the server, the client will need to retransmit all of the 21134 buffers containing uncommitted data to the server. How this is to be 21135 done is up to the implementor. If there is only one buffer of 21136 interest, then it should be sent in a WRITE request with the 21137 FILE_SYNC4 stable parameter. If there is more than one buffer, it 21138 might be worthwhile retransmitting all of the buffers in WRITE 21139 operations with the stable parameter set to UNSTABLE4 and then 21140 retransmitting the COMMIT operation to flush all of the data on the 21141 server to stable storage. However, if the server repeatably returns 21142 from COMMIT a verifier that differs from that returned by WRITE, the 21143 only way to ensure progress is to retransmit all of the buffers with 21144 WRITE requests with the FILE_SYNC4 stable parameter. 21146 The above description applies to page-cache-based systems as well as 21147 buffer-cache-based systems. In the former systems, the virtual 21148 memory system will need to be modified instead of the buffer cache. 21150 18.4. Operation 6: CREATE - Create a Non-Regular File Object 21152 18.4.1. ARGUMENTS 21154 union createtype4 switch (nfs_ftype4 type) { 21155 case NF4LNK: 21156 linktext4 linkdata; 21157 case NF4BLK: 21158 case NF4CHR: 21159 specdata4 devdata; 21160 case NF4SOCK: 21161 case NF4FIFO: 21162 case NF4DIR: 21163 void; 21164 default: 21165 void; /* server should return NFS4ERR_BADTYPE */ 21166 }; 21168 struct CREATE4args { 21169 /* CURRENT_FH: directory for creation */ 21170 createtype4 objtype; 21171 component4 objname; 21172 fattr4 createattrs; 21173 }; 21175 18.4.2. RESULTS 21176 struct CREATE4resok { 21177 change_info4 cinfo; 21178 bitmap4 attrset; /* attributes set */ 21179 }; 21181 union CREATE4res switch (nfsstat4 status) { 21182 case NFS4_OK: 21183 /* new CURRENTFH: created object */ 21184 CREATE4resok resok4; 21185 default: 21186 void; 21187 }; 21189 18.4.3. DESCRIPTION 21191 The CREATE operation creates a file object other than an ordinary 21192 file in a directory with a given name. The OPEN operation MUST be 21193 used to create a regular file or a named attribute. 21195 The current filehandle must be a directory: an object of type NF4DIR. 21196 If the current filehandle is an attribute directory (type 21197 NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the 21198 current file handle designates any other type of object, the error 21199 NFS4ERR_NOTDIR results. 21201 The objname specifies the name for the new object. The objtype 21202 determines the type of object to be created: directory, symlink, etc. 21203 If the object type specified is that of an ordinary file, a named 21204 attribute, or a named attribute directory, the error NFS4ERR_BADTYPE 21205 results. 21207 If an object of the same name already exists in the directory, the 21208 server will return the error NFS4ERR_EXIST. 21210 For the directory where the new file object was created, the server 21211 returns change_info4 information in cinfo. With the atomic field of 21212 the change_info4 data type, the server will indicate if the before 21213 and after change attributes were obtained atomically with respect to 21214 the file object creation. 21216 If the objname has a length of zero, or if objname does not obey the 21217 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21219 The current filehandle is replaced by that of the new object. 21221 The createattrs specifies the initial set of attributes for the 21222 object. The set of attributes may include any writable attribute 21223 valid for the object type. When the operation is successful, the 21224 server will return to the client an attribute mask signifying which 21225 attributes were successfully set for the object. 21227 If createattrs includes neither the owner attribute nor an ACL with 21228 an ACE for the owner, and if the server's file system both supports 21229 and requires an owner attribute (or an owner ACE), then the server 21230 MUST derive the owner (or the owner ACE). This would typically be 21231 from the principal indicated in the RPC credentials of the call, but 21232 the server's operating environment or file system semantics may 21233 dictate other methods of derivation. Similarly, if createattrs 21234 includes neither the group attribute nor a group ACE, and if the 21235 server's file system both supports and requires the notion of a group 21236 attribute (or group ACE), the server MUST derive the group attribute 21237 (or the corresponding owner ACE) for the file. This could be from 21238 the RPC call's credentials, such as the group principal if the 21239 credentials include it (such as with AUTH_SYS), from the group 21240 identifier associated with the principal in the credentials (e.g., 21241 POSIX systems have a user database [23] that has a group identifier 21242 for every user identifier), inherited from the directory in which the 21243 object is created, or whatever else the server's operating 21244 environment or file system semantics dictate. This applies to the 21245 OPEN operation too. 21247 Conversely, it is possible that the client will specify in 21248 createattrs an owner attribute, group attribute, or ACL that the 21249 principal indicated the RPC call's credentials does not have 21250 permissions to create files for. The error to be returned in this 21251 instance is NFS4ERR_PERM. This applies to the OPEN operation too. 21253 If the current filehandle designates a directory for which another 21254 client holds a directory delegation, then, unless the delegation is 21255 such that the situation can be resolved by sending a notification, 21256 the delegation MUST be recalled, and the CREATE operation MUST NOT 21257 proceed until the delegation is returned or revoked. Except where 21258 this happens very quickly, one or more NFS4ERR_DELAY errors will be 21259 returned to requests made while delegation remains outstanding. 21261 When the current filehandle designates a directory for which one or 21262 more directory delegations exist, then, when those delegations 21263 request such notifications, NOTIFY4_ADD_ENTRY will be generated as a 21264 result of this operation. 21266 If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set 21267 (Section 14.4), and a symbolic link is being created, then the 21268 content of the symbolic link MUST be in UTF-8 encoding. 21270 18.4.4. IMPLEMENTATION 21272 If the client desires to set attribute values after the create, a 21273 SETATTR operation can be added to the COMPOUND request so that the 21274 appropriate attributes will be set. 21276 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery 21278 18.5.1. ARGUMENTS 21280 struct DELEGPURGE4args { 21281 clientid4 clientid; 21282 }; 21284 18.5.2. RESULTS 21286 struct DELEGPURGE4res { 21287 nfsstat4 status; 21288 }; 21290 18.5.3. DESCRIPTION 21292 This operation purges all of the delegations awaiting recovery for a 21293 given client. This is useful for clients that do not commit 21294 delegation information to stable storage to indicate that conflicting 21295 requests need not be delayed by the server awaiting recovery of 21296 delegation information. 21298 The client is NOT specified by the clientid field of the request. 21299 The client SHOULD set the client field to zero, and the server MUST 21300 ignore the clientid field. Instead, the server MUST derive the 21301 client ID from the value of the session ID in the arguments of the 21302 SEQUENCE operation that precedes DELEGPURGE in the COMPOUND request. 21304 The DELEGPURGE operation should be used by clients that record 21305 delegation information on stable storage on the client. In this 21306 case, after the client recovers all delegations it knows of, it 21307 should immediately send a DELEGPURGE operation. Doing so will notify 21308 the server that no additional delegations for the client will be 21309 recovered allowing it to free resources, and avoid delaying other 21310 clients which make requests that conflict with the unrecovered 21311 delegations. The set of delegations known to the server and the 21312 client might be different. The reason for this is that after sending 21313 a request that resulted in a delegation, the client might experience 21314 a failure before it both received the delegation and committed the 21315 delegation to the client's stable storage. 21317 The server MAY support DELEGPURGE, but if it does not, it MUST NOT 21318 support CLAIM_DELEGATE_PREV and MUST NOT support CLAIM_DELEG_PREV_FH. 21320 18.6. Operation 8: DELEGRETURN - Return Delegation 21322 18.6.1. ARGUMENTS 21324 struct DELEGRETURN4args { 21325 /* CURRENT_FH: delegated object */ 21326 stateid4 deleg_stateid; 21327 }; 21329 18.6.2. RESULTS 21331 struct DELEGRETURN4res { 21332 nfsstat4 status; 21333 }; 21335 18.6.3. DESCRIPTION 21337 The DELEGRETURN operation returns the delegation represented by the 21338 current filehandle and stateid. 21340 Delegations may be returned voluntarily (i.e., before the server has 21341 recalled them) or when recalled. In either case, the client must 21342 properly propagate state changed under the context of the delegation 21343 to the server before returning the delegation. 21345 The server MAY require that the principal, security flavor, and if 21346 applicable, the GSS mechanism, combination that acquired the 21347 delegation also be the one to send DELEGRETURN on the file. This 21348 might not be possible if credentials for the principal are no longer 21349 available. The server MAY allow the machine credential or SSV 21350 credential (see Section 18.35) to send DELEGRETURN. 21352 18.7. Operation 9: GETATTR - Get Attributes 21354 18.7.1. ARGUMENTS 21356 struct GETATTR4args { 21357 /* CURRENT_FH: object */ 21358 bitmap4 attr_request; 21359 }; 21361 18.7.2. RESULTS 21363 struct GETATTR4resok { 21364 fattr4 obj_attributes; 21365 }; 21367 union GETATTR4res switch (nfsstat4 status) { 21368 case NFS4_OK: 21369 GETATTR4resok resok4; 21370 default: 21371 void; 21372 }; 21374 18.7.3. DESCRIPTION 21376 The GETATTR operation will obtain attributes for the file system 21377 object specified by the current filehandle. The client sets a bit in 21378 the bitmap argument for each attribute value that it would like the 21379 server to return. The server returns an attribute bitmap that 21380 indicates the attribute values that it was able to return, which will 21381 include all attributes requested by the client that are attributes 21382 supported by the server for the target file system. This bitmap is 21383 followed by the attribute values ordered lowest attribute number 21384 first. 21386 The server MUST return a value for each attribute that the client 21387 requests if the attribute is supported by the server for the target 21388 file system. If the server does not support a particular attribute 21389 on the target file system, then it MUST NOT return the attribute 21390 value and MUST NOT set the attribute bit in the result bitmap. The 21391 server MUST return an error if it supports an attribute on the target 21392 but cannot obtain its value. In that case, no attribute values will 21393 be returned. 21395 File systems that are absent should be treated as having support for 21396 a very small set of attributes as described in Section 11.4.1, even 21397 if previously, when the file system was present, more attributes were 21398 supported. 21400 All servers MUST support the REQUIRED attributes as specified in 21401 Section 5.6, for all file systems, with the exception of absent file 21402 systems. 21404 On success, the current filehandle retains its value. 21406 18.7.4. IMPLEMENTATION 21408 Suppose there is an OPEN_DELEGATE_WRITE delegation held by another 21409 client for the file in question and size and/or change are among the 21410 set of attributes being interrogated. The server has two choices. 21411 First, the server can obtain the actual current value of these 21412 attributes from the client holding the delegation by using the 21413 CB_GETATTR callback. Second, the server, particularly when the 21414 delegated client is unresponsive, can recall the delegation in 21415 question. The GETATTR MUST NOT proceed until one of the following 21416 occurs: 21418 o The requested attribute values are returned in the response to 21419 CB_GETATTR. 21421 o The OPEN_DELEGATE_WRITE delegation is returned. 21423 o The OPEN_DELEGATE_WRITE delegation is revoked. 21425 Unless one of the above happens very quickly, one or more 21426 NFS4ERR_DELAY errors will be returned while a delegation is 21427 outstanding. 21429 18.8. Operation 10: GETFH - Get Current Filehandle 21431 18.8.1. ARGUMENTS 21433 /* CURRENT_FH: */ 21434 void; 21436 18.8.2. RESULTS 21438 struct GETFH4resok { 21439 nfs_fh4 object; 21440 }; 21442 union GETFH4res switch (nfsstat4 status) { 21443 case NFS4_OK: 21444 GETFH4resok resok4; 21445 default: 21446 void; 21447 }; 21449 18.8.3. DESCRIPTION 21451 This operation returns the current filehandle value. 21453 On success, the current filehandle retains its value. 21455 As described in Section 2.10.6.4, GETFH is REQUIRED or RECOMMENDED to 21456 immediately follow certain operations, and servers are free to reject 21457 such operations if the client fails to insert GETFH in the request as 21458 REQUIRED or RECOMMENDED. Section 18.16.4.1 provides additional 21459 justification for why GETFH MUST follow OPEN. 21461 18.8.4. IMPLEMENTATION 21463 Operations that change the current filehandle like LOOKUP or CREATE 21464 do not automatically return the new filehandle as a result. For 21465 instance, if a client needs to look up a directory entry and obtain 21466 its filehandle, then the following request is needed. 21468 PUTFH (directory filehandle) 21470 LOOKUP (entry name) 21472 GETFH 21474 18.9. Operation 11: LINK - Create Link to a File 21476 18.9.1. ARGUMENTS 21478 struct LINK4args { 21479 /* SAVED_FH: source object */ 21480 /* CURRENT_FH: target directory */ 21481 component4 newname; 21482 }; 21484 18.9.2. RESULTS 21485 struct LINK4resok { 21486 change_info4 cinfo; 21487 }; 21489 union LINK4res switch (nfsstat4 status) { 21490 case NFS4_OK: 21491 LINK4resok resok4; 21492 default: 21493 void; 21494 }; 21496 18.9.3. DESCRIPTION 21498 The LINK operation creates an additional newname for the file 21499 represented by the saved filehandle, as set by the SAVEFH operation, 21500 in the directory represented by the current filehandle. The existing 21501 file and the target directory must reside within the same file system 21502 on the server. On success, the current filehandle will continue to 21503 be the target directory. If an object exists in the target directory 21504 with the same name as newname, the server must return NFS4ERR_EXIST. 21506 For the target directory, the server returns change_info4 information 21507 in cinfo. With the atomic field of the change_info4 data type, the 21508 server will indicate if the before and after change attributes were 21509 obtained atomically with respect to the link creation. 21511 If the newname has a length of zero, or if newname does not obey the 21512 UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21514 18.9.4. IMPLEMENTATION 21516 The server MAY impose restrictions on the LINK operation such that 21517 LINK may not be done when the file is open or when that open is done 21518 by particular protocols, or with particular options or access modes. 21519 When LINK is rejected because of such restrictions, the error 21520 NFS4ERR_FILE_OPEN is returned. 21522 If a server does implement such restrictions and those restrictions 21523 include cases of NFSv4 opens preventing successful execution of a 21524 link, the server needs to recall any delegations that could hide the 21525 existence of opens relevant to that decision. The reason is that 21526 when a client holds a delegation, the server might not have an 21527 accurate account of the opens for that client, since the client may 21528 execute OPENs and CLOSEs locally. The LINK operation must be delayed 21529 only until a definitive result can be obtained. For example, suppose 21530 there are multiple delegations and one of them establishes an open 21531 whose presence would prevent the link. Given the server's semantics, 21532 NFS4ERR_FILE_OPEN may be returned to the caller as soon as that 21533 delegation is returned without waiting for other delegations to be 21534 returned. Similarly, if such opens are not associated with 21535 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 21536 delegation recall being done. 21538 If the current filehandle designates a directory for which another 21539 client holds a directory delegation, then, unless the delegation is 21540 such that the situation can be resolved by sending a notification, 21541 the delegation MUST be recalled, and the operation cannot be 21542 performed successfully until the delegation is returned or revoked. 21543 Except where this happens very quickly, one or more NFS4ERR_DELAY 21544 errors will be returned to requests made while delegation remains 21545 outstanding. 21547 When the current filehandle designates a directory for which one or 21548 more directory delegations exist, then, when those delegations 21549 request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY 21550 will be generated as a result of the LINK operation. 21552 If the current file system supports the numlinks attribute, and other 21553 clients have delegations to the file being linked, then those 21554 delegations MUST be recalled and the LINK operation MUST NOT proceed 21555 until all delegations are returned or revoked. Except where this 21556 happens very quickly, one or more NFS4ERR_DELAY errors will be 21557 returned to requests made while delegation remains outstanding. 21559 Changes to any property of the "hard" linked files are reflected in 21560 all of the linked files. When a link is made to a file, the 21561 attributes for the file should have a value for numlinks that is one 21562 greater than the value before the LINK operation. 21564 The statement "file and the target directory must reside within the 21565 same file system on the server" means that the fsid fields in the 21566 attributes for the objects are the same. If they reside on different 21567 file systems, the error NFS4ERR_XDEV is returned. This error may be 21568 returned by some servers when there is an internal partitioning of a 21569 file system that the LINK operation would violate. 21571 On some servers, "." and ".." are illegal values for newname and the 21572 error NFS4ERR_BADNAME will be returned if they are specified. 21574 When the current filehandle designates a named attribute directory 21575 and the object to be linked (the saved filehandle) is not a named 21576 attribute for the same object, the error NFS4ERR_XDEV MUST be 21577 returned. When the saved filehandle designates a named attribute and 21578 the current filehandle is not the appropriate named attribute 21579 directory, the error NFS4ERR_XDEV MUST also be returned. 21581 When the current filehandle designates a named attribute directory 21582 and the object to be linked (the saved filehandle) is a named 21583 attribute within that directory, the server may return the error 21584 NFS4ERR_NOTSUPP. 21586 In the case that newname is already linked to the file represented by 21587 the saved filehandle, the server will return NFS4ERR_EXIST. 21589 Note that symbolic links are created with the CREATE operation. 21591 18.10. Operation 12: LOCK - Create Lock 21593 18.10.1. ARGUMENTS 21594 /* 21595 * For LOCK, transition from open_stateid and lock_owner 21596 * to a lock stateid. 21597 */ 21598 struct open_to_lock_owner4 { 21599 seqid4 open_seqid; 21600 stateid4 open_stateid; 21601 seqid4 lock_seqid; 21602 lock_owner4 lock_owner; 21603 }; 21605 /* 21606 * For LOCK, existing lock stateid continues to request new 21607 * file lock for the same lock_owner and open_stateid. 21608 */ 21609 struct exist_lock_owner4 { 21610 stateid4 lock_stateid; 21611 seqid4 lock_seqid; 21612 }; 21614 union locker4 switch (bool new_lock_owner) { 21615 case TRUE: 21616 open_to_lock_owner4 open_owner; 21617 case FALSE: 21618 exist_lock_owner4 lock_owner; 21619 }; 21621 /* 21622 * LOCK/LOCKT/LOCKU: Record lock management 21623 */ 21624 struct LOCK4args { 21625 /* CURRENT_FH: file */ 21626 nfs_lock_type4 locktype; 21627 bool reclaim; 21628 offset4 offset; 21629 length4 length; 21630 locker4 locker; 21631 }; 21633 18.10.2. RESULTS 21634 struct LOCK4denied { 21635 offset4 offset; 21636 length4 length; 21637 nfs_lock_type4 locktype; 21638 lock_owner4 owner; 21639 }; 21641 struct LOCK4resok { 21642 stateid4 lock_stateid; 21643 }; 21645 union LOCK4res switch (nfsstat4 status) { 21646 case NFS4_OK: 21647 LOCK4resok resok4; 21648 case NFS4ERR_DENIED: 21649 LOCK4denied denied; 21650 default: 21651 void; 21652 }; 21654 18.10.3. DESCRIPTION 21656 The LOCK operation requests a byte-range lock for the byte-range 21657 specified by the offset and length parameters, and lock type 21658 specified in the locktype parameter. If this is a reclaim request, 21659 the reclaim parameter will be TRUE. 21661 Bytes in a file may be locked even if those bytes are not currently 21662 allocated to the file. To lock the file from a specific offset 21663 through the end-of-file (no matter how long the file actually is) use 21664 a length field equal to NFS4_UINT64_MAX. The server MUST return 21665 NFS4ERR_INVAL under the following combinations of length and offset: 21667 o Length is equal to zero. 21669 o Length is not equal to NFS4_UINT64_MAX, and the sum of length and 21670 offset exceeds NFS4_UINT64_MAX. 21672 32-bit servers are servers that support locking for byte offsets that 21673 fit within 32 bits (i.e., less than or equal to NFS4_UINT32_MAX). If 21674 the client specifies a range that overlaps one or more bytes beyond 21675 offset NFS4_UINT32_MAX but does not end at offset NFS4_UINT64_MAX, 21676 then such a 32-bit server MUST return the error NFS4ERR_BAD_RANGE. 21678 If the server returns NFS4ERR_DENIED, the owner, offset, and length 21679 of a conflicting lock are returned. 21681 The locker argument specifies the lock-owner that is associated with 21682 the LOCK operation. The locker4 structure is a switched union that 21683 indicates whether the client has already created byte-range locking 21684 state associated with the current open file and lock-owner. In the 21685 case in which it has, the argument is just a stateid representing the 21686 set of locks associated with that open file and lock-owner, together 21687 with a lock_seqid value that MAY be any value and MUST be ignored by 21688 the server. In the case where no byte-range locking state has been 21689 established, or the client does not have the stateid available, the 21690 argument contains the stateid of the open file with which this lock 21691 is to be associated, together with the lock-owner with which the lock 21692 is to be associated. The open_to_lock_owner case covers the very 21693 first lock done by a lock-owner for a given open file and offers a 21694 method to use the established state of the open_stateid to transition 21695 to the use of a lock stateid. 21697 The following fields of the locker parameter MAY be set to any value 21698 by the client and MUST be ignored by the server: 21700 o The clientid field of the lock_owner field of the open_owner field 21701 (locker.open_owner.lock_owner.clientid). The reason the server 21702 MUST ignore the clientid field is that the server MUST derive the 21703 client ID from the session ID from the SEQUENCE operation of the 21704 COMPOUND request. 21706 o The open_seqid and lock_seqid fields of the open_owner field 21707 (locker.open_owner.open_seqid and locker.open_owner.lock_seqid). 21709 o The lock_seqid field of the lock_owner field 21710 (locker.lock_owner.lock_seqid). 21712 Note that the client ID appearing in a LOCK4denied structure is the 21713 actual client associated with the conflicting lock, whether this is 21714 the client ID associated with the current session or a different one. 21715 Thus, if the server returns NFS4ERR_DENIED, it MUST set the clientid 21716 field of the owner field of the denied field. 21718 If the current filehandle is not an ordinary file, an error will be 21719 returned to the client. In the case that the current filehandle 21720 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21721 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21722 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21724 On success, the current filehandle retains its value. 21726 18.10.4. IMPLEMENTATION 21728 If the server is unable to determine the exact offset and length of 21729 the conflicting byte-range lock, the same offset and length that were 21730 provided in the arguments should be returned in the denied results. 21732 LOCK operations are subject to permission checks and to checks 21733 against the access type of the associated file. However, the 21734 specific right and modes required for various types of locks reflect 21735 the semantics of the server-exported file system, and are not 21736 specified by the protocol. For example, Windows 2000 allows a write 21737 lock of a file open for read access, while a POSIX-compliant system 21738 does not. 21740 When the client sends a LOCK operation that corresponds to a range 21741 that the lock-owner has locked already (with the same or different 21742 lock type), or to a sub-range of such a range, or to a byte-range 21743 that includes multiple locks already granted to that lock-owner, in 21744 whole or in part, and the server does not support such locking 21745 operations (i.e., does not support POSIX locking semantics), the 21746 server will return the error NFS4ERR_LOCK_RANGE. In that case, the 21747 client may return an error, or it may emulate the required 21748 operations, using only LOCK for ranges that do not include any bytes 21749 already locked by that lock-owner and LOCKU of locks held by that 21750 lock-owner (specifying an exactly matching range and type). 21751 Similarly, when the client sends a LOCK operation that amounts to 21752 upgrading (changing from a READ_LT lock to a WRITE_LT lock) or 21753 downgrading (changing from WRITE_LT lock to a READ_LT lock) an 21754 existing byte-range lock, and the server does not support such a 21755 lock, the server will return NFS4ERR_LOCK_NOTSUPP. Such operations 21756 may not perfectly reflect the required semantics in the face of 21757 conflicting LOCK operations from other clients. 21759 When a client holds an OPEN_DELEGATE_WRITE delegation, the client 21760 holding that delegation is assured that there are no opens by other 21761 clients. Thus, there can be no conflicting LOCK operations from such 21762 clients. Therefore, the client may be handling locking requests 21763 locally, without doing LOCK operations on the server. If it does 21764 that, it must be prepared to update the lock status on the server, by 21765 sending appropriate LOCK and LOCKU operations before returning the 21766 delegation. 21768 When one or more clients hold OPEN_DELEGATE_READ delegations, any 21769 LOCK operation where the server is implementing mandatory locking 21770 semantics MUST result in the recall of all such delegations. The 21771 LOCK operation may not be granted until all such delegations are 21772 returned or revoked. Except where this happens very quickly, one or 21773 more NFS4ERR_DELAY errors will be returned to requests made while the 21774 delegation remains outstanding. 21776 18.11. Operation 13: LOCKT - Test for Lock 21778 18.11.1. ARGUMENTS 21780 struct LOCKT4args { 21781 /* CURRENT_FH: file */ 21782 nfs_lock_type4 locktype; 21783 offset4 offset; 21784 length4 length; 21785 lock_owner4 owner; 21786 }; 21788 18.11.2. RESULTS 21790 union LOCKT4res switch (nfsstat4 status) { 21791 case NFS4ERR_DENIED: 21792 LOCK4denied denied; 21793 case NFS4_OK: 21794 void; 21795 default: 21796 void; 21797 }; 21799 18.11.3. DESCRIPTION 21801 The LOCKT operation tests the lock as specified in the arguments. If 21802 a conflicting lock exists, the owner, offset, length, and type of the 21803 conflicting lock are returned. The owner field in the results 21804 includes the client ID of the owner of the conflicting lock, whether 21805 this is the client ID associated with the current session or a 21806 different client ID. If no lock is held, nothing other than NFS4_OK 21807 is returned. Lock types READ_LT and READW_LT are processed in the 21808 same way in that a conflicting lock test is done without regard to 21809 blocking or non-blocking. The same is true for WRITE_LT and 21810 WRITEW_LT. 21812 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 21813 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 21814 for LOCK. 21816 The clientid field of the owner MAY be set to any value by the client 21817 and MUST be ignored by the server. The reason the server MUST ignore 21818 the clientid field is that the server MUST derive the client ID from 21819 the session ID from the SEQUENCE operation of the COMPOUND request. 21821 If the current filehandle is not an ordinary file, an error will be 21822 returned to the client. In the case that the current filehandle 21823 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21824 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21825 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21827 On success, the current filehandle retains its value. 21829 18.11.4. IMPLEMENTATION 21831 If the server is unable to determine the exact offset and length of 21832 the conflicting lock, the same offset and length that were provided 21833 in the arguments should be returned in the denied results. 21835 LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to 21836 identify the owner. This is because the client does not have to open 21837 the file to test for the existence of a lock, so a stateid might not 21838 be available. 21840 As noted in Section 18.10.4, some servers may return 21841 NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) LOCK 21842 operations that overlap ranges already granted to the current lock- 21843 owner. 21845 The LOCKT operation's test for conflicting locks SHOULD exclude locks 21846 for the current lock-owner, and thus should return NFS4_OK in such 21847 cases. Note that this means that a server might return NFS4_OK to a 21848 LOCKT request even though a LOCK operation for the same range and 21849 lock-owner would fail with NFS4ERR_LOCK_RANGE. 21851 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 21852 (see Section 18.10.4) to handle LOCK requests locally. In such a 21853 case, LOCKT requests will similarly be handled locally. 21855 18.12. Operation 14: LOCKU - Unlock File 21857 18.12.1. ARGUMENTS 21858 struct LOCKU4args { 21859 /* CURRENT_FH: file */ 21860 nfs_lock_type4 locktype; 21861 seqid4 seqid; 21862 stateid4 lock_stateid; 21863 offset4 offset; 21864 length4 length; 21865 }; 21867 18.12.2. RESULTS 21869 union LOCKU4res switch (nfsstat4 status) { 21870 case NFS4_OK: 21871 stateid4 lock_stateid; 21872 default: 21873 void; 21874 }; 21876 18.12.3. DESCRIPTION 21878 The LOCKU operation unlocks the byte-range lock specified by the 21879 parameters. The client may set the locktype field to any value that 21880 is legal for the nfs_lock_type4 enumerated type, and the server MUST 21881 accept any legal value for locktype. Any legal value for locktype 21882 has no effect on the success or failure of the LOCKU operation. 21884 The ranges are specified as for LOCK. The NFS4ERR_INVAL and 21885 NFS4ERR_BAD_RANGE errors are returned under the same circumstances as 21886 for LOCK. 21888 The seqid parameter MAY be any value and the server MUST ignore it. 21890 If the current filehandle is not an ordinary file, an error will be 21891 returned to the client. In the case that the current filehandle 21892 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 21893 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 21894 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 21896 On success, the current filehandle retains its value. 21898 The server MAY require that the principal, security flavor, and if 21899 applicable, the GSS mechanism, combination that sent a LOCK operation 21900 also be the one to send LOCKU on the file. This might not be 21901 possible if credentials for the principal are no longer available. 21902 The server MAY allow the machine credential or SSV credential (see 21903 Section 18.35) to send LOCKU. 21905 18.12.4. IMPLEMENTATION 21907 If the area to be unlocked does not correspond exactly to a lock 21908 actually held by the lock-owner, the server may return the error 21909 NFS4ERR_LOCK_RANGE. This includes the case in which the area is not 21910 locked, where the area is a sub-range of the area locked, where it 21911 overlaps the area locked without matching exactly, or the area 21912 specified includes multiple locks held by the lock-owner. In all of 21913 these cases, allowed by POSIX locking [21] semantics, a client 21914 receiving this error should, if it desires support for such 21915 operations, simulate the operation using LOCKU on ranges 21916 corresponding to locks it actually holds, possibly followed by LOCK 21917 operations for the sub-ranges not being unlocked. 21919 When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose 21920 (see Section 18.10.4) to handle LOCK requests locally. In such a 21921 case, LOCKU operations will similarly be handled locally. 21923 18.13. Operation 15: LOOKUP - Lookup Filename 21925 18.13.1. ARGUMENTS 21927 struct LOOKUP4args { 21928 /* CURRENT_FH: directory */ 21929 component4 objname; 21930 }; 21932 18.13.2. RESULTS 21934 struct LOOKUP4res { 21935 /* New CURRENT_FH: object */ 21936 nfsstat4 status; 21937 }; 21939 18.13.3. DESCRIPTION 21941 The LOOKUP operation looks up or finds a file system object using the 21942 directory specified by the current filehandle. LOOKUP evaluates the 21943 component and if the object exists, the current filehandle is 21944 replaced with the component's filehandle. 21946 If the component cannot be evaluated either because it does not exist 21947 or because the client does not have permission to evaluate the 21948 component, then an error will be returned and the current filehandle 21949 will be unchanged. 21951 If the component is a zero-length string or if any component does not 21952 obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned. 21954 18.13.4. IMPLEMENTATION 21956 If the client wants to achieve the effect of a multi-component look 21957 up, it may construct a COMPOUND request such as (and obtain each 21958 filehandle): 21960 PUTFH (directory filehandle) 21961 LOOKUP "pub" 21962 GETFH 21963 LOOKUP "foo" 21964 GETFH 21965 LOOKUP "bar" 21966 GETFH 21968 Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on 21969 the server. The client can detect a mountpoint crossing by comparing 21970 the fsid attribute of the directory with the fsid attribute of the 21971 directory looked up. If the fsids are different, then the new 21972 directory is a server mountpoint. UNIX clients that detect a 21973 mountpoint crossing will need to mount the server's file system. 21974 This needs to be done to maintain the file object identity checking 21975 mechanisms common to UNIX clients. 21977 Servers that limit NFS access to "shared" or "exported" file systems 21978 should provide a pseudo file system into which the exported file 21979 systems can be integrated, so that clients can browse the server's 21980 namespace. The clients view of a pseudo file system will be limited 21981 to paths that lead to exported file systems. 21983 Note: previous versions of the protocol assigned special semantics to 21984 the names "." and "..". NFSv4.1 assigns no special semantics to 21985 these names. The LOOKUPP operator must be used to look up a parent 21986 directory. 21988 Note that this operation does not follow symbolic links. The client 21989 is responsible for all parsing of filenames including filenames that 21990 are modified by symbolic links encountered during the look up 21991 process. 21993 If the current filehandle supplied is not a directory but a symbolic 21994 link, the error NFS4ERR_SYMLINK is returned as the error. For all 21995 other non-directory file types, the error NFS4ERR_NOTDIR is returned. 21997 18.14. Operation 16: LOOKUPP - Lookup Parent Directory 21999 18.14.1. ARGUMENTS 22001 /* CURRENT_FH: object */ 22002 void; 22004 18.14.2. RESULTS 22006 struct LOOKUPP4res { 22007 /* new CURRENT_FH: parent directory */ 22008 nfsstat4 status; 22009 }; 22011 18.14.3. DESCRIPTION 22013 The current filehandle is assumed to refer to a regular directory or 22014 a named attribute directory. LOOKUPP assigns the filehandle for its 22015 parent directory to be the current filehandle. If there is no parent 22016 directory, an NFS4ERR_NOENT error must be returned. Therefore, 22017 NFS4ERR_NOENT will be returned by the server when the current 22018 filehandle is at the root or top of the server's file tree. 22020 As is the case with LOOKUP, LOOKUPP will also cross mountpoints. 22022 If the current filehandle is not a directory or named attribute 22023 directory, the error NFS4ERR_NOTDIR is returned. 22025 If the requester's security flavor does not match that configured for 22026 the parent directory, then the server SHOULD return NFS4ERR_WRONGSEC 22027 (a future minor revision of NFSv4 may upgrade this to MUST) in the 22028 LOOKUPP response. However, if the server does so, it MUST support 22029 the SECINFO_NO_NAME operation (Section 18.45), so that the client can 22030 gracefully determine the correct security flavor. 22032 If the current filehandle is a named attribute directory that is 22033 associated with a file system object via OPENATTR (i.e., not a sub- 22034 directory of a named attribute directory), LOOKUPP SHOULD return the 22035 filehandle of the associated file system object. 22037 18.14.4. IMPLEMENTATION 22039 An issue to note is upward navigation from named attribute 22040 directories. The named attribute directories are essentially 22041 detached from the namespace, and this property should be safely 22042 represented in the client operating environment. LOOKUPP on a named 22043 attribute directory may return the filehandle of the associated file, 22044 and conveying this to applications might be unsafe as many 22045 applications expect the parent of an object to always be a directory. 22046 Therefore, the client may want to hide the parent of named attribute 22047 directories (represented as ".." in UNIX) or represent the named 22048 attribute directory as its own parent (as is typically done for the 22049 file system root directory in UNIX). 22051 18.15. Operation 17: NVERIFY - Verify Difference in Attributes 22053 18.15.1. ARGUMENTS 22055 struct NVERIFY4args { 22056 /* CURRENT_FH: object */ 22057 fattr4 obj_attributes; 22058 }; 22060 18.15.2. RESULTS 22062 struct NVERIFY4res { 22063 nfsstat4 status; 22064 }; 22066 18.15.3. DESCRIPTION 22068 This operation is used to prefix a sequence of operations to be 22069 performed if one or more attributes have changed on some file system 22070 object. If all the attributes match, then the error NFS4ERR_SAME 22071 MUST be returned. 22073 On success, the current filehandle retains its value. 22075 18.15.4. IMPLEMENTATION 22077 This operation is useful as a cache validation operator. If the 22078 object to which the attributes belong has changed, then the following 22079 operations may obtain new data associated with that object, for 22080 instance, to check if a file has been changed and obtain new data if 22081 it has: 22083 SEQUENCE 22084 PUTFH fh 22085 NVERIFY attrbits attrs 22086 READ 0 32767 22088 Contrast this with NFSv3, which would first send a GETATTR in one 22089 request/reply round trip, and then if attributes indicated that the 22090 client's cache was stale, then send a READ in another request/reply 22091 round trip. 22093 In the case that a RECOMMENDED attribute is specified in the NVERIFY 22094 operation and the server does not support that attribute for the file 22095 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 22096 client. 22098 When the attribute rdattr_error or any set-only attribute (e.g., 22099 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 22100 the client. 22102 18.16. Operation 18: OPEN - Open a Regular File 22104 18.16.1. ARGUMENTS 22106 /* 22107 * Various definitions for OPEN 22108 */ 22109 enum createmode4 { 22110 UNCHECKED4 = 0, 22111 GUARDED4 = 1, 22112 /* Deprecated in NFSv4.1. */ 22113 EXCLUSIVE4 = 2, 22114 /* 22115 * New to NFSv4.1. If session is persistent, 22116 * GUARDED4 MUST be used. Otherwise, use 22117 * EXCLUSIVE4_1 instead of EXCLUSIVE4. 22118 */ 22119 EXCLUSIVE4_1 = 3 22120 }; 22122 struct creatverfattr { 22123 verifier4 cva_verf; 22124 fattr4 cva_attrs; 22125 }; 22127 union createhow4 switch (createmode4 mode) { 22128 case UNCHECKED4: 22129 case GUARDED4: 22130 fattr4 createattrs; 22131 case EXCLUSIVE4: 22132 verifier4 createverf; 22133 case EXCLUSIVE4_1: 22134 creatverfattr ch_createboth; 22135 }; 22137 enum opentype4 { 22138 OPEN4_NOCREATE = 0, 22139 OPEN4_CREATE = 1 22140 }; 22142 union openflag4 switch (opentype4 opentype) { 22143 case OPEN4_CREATE: 22144 createhow4 how; 22145 default: 22146 void; 22147 }; 22149 /* Next definitions used for OPEN delegation */ 22150 enum limit_by4 { 22151 NFS_LIMIT_SIZE = 1, 22152 NFS_LIMIT_BLOCKS = 2 22153 /* others as needed */ 22154 }; 22156 struct nfs_modified_limit4 { 22157 uint32_t num_blocks; 22158 uint32_t bytes_per_block; 22159 }; 22161 union nfs_space_limit4 switch (limit_by4 limitby) { 22162 /* limit specified as file size */ 22163 case NFS_LIMIT_SIZE: 22164 uint64_t filesize; 22165 /* limit specified by number of blocks */ 22166 case NFS_LIMIT_BLOCKS: 22167 nfs_modified_limit4 mod_blocks; 22168 } ; 22170 /* 22171 * Share Access and Deny constants for open argument 22172 */ 22173 const OPEN4_SHARE_ACCESS_READ = 0x00000001; 22174 const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; 22175 const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; 22177 const OPEN4_SHARE_DENY_NONE = 0x00000000; 22178 const OPEN4_SHARE_DENY_READ = 0x00000001; 22179 const OPEN4_SHARE_DENY_WRITE = 0x00000002; 22180 const OPEN4_SHARE_DENY_BOTH = 0x00000003; 22182 /* new flags for share_access field of OPEN4args */ 22183 const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00; 22184 const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000; 22185 const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100; 22186 const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200; 22187 const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300; 22188 const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400; 22189 const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500; 22191 const 22192 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 22193 = 0x10000; 22195 const 22196 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 22197 = 0x20000; 22199 enum open_delegation_type4 { 22200 OPEN_DELEGATE_NONE = 0, 22201 OPEN_DELEGATE_READ = 1, 22202 OPEN_DELEGATE_WRITE = 2, 22203 OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */ 22204 }; 22206 enum open_claim_type4 { 22207 /* 22208 * Not a reclaim. 22209 */ 22210 CLAIM_NULL = 0, 22212 CLAIM_PREVIOUS = 1, 22213 CLAIM_DELEGATE_CUR = 2, 22214 CLAIM_DELEGATE_PREV = 3, 22216 /* 22217 * Not a reclaim. 22218 * 22219 * Like CLAIM_NULL, but object identified 22220 * by the current filehandle. 22221 */ 22222 CLAIM_FH = 4, /* new to v4.1 */ 22224 /* 22225 * Like CLAIM_DELEGATE_CUR, but object identified 22226 * by current filehandle. 22227 */ 22228 CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */ 22230 /* 22231 * Like CLAIM_DELEGATE_PREV, but object identified 22232 * by current filehandle. 22234 */ 22235 CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */ 22236 }; 22238 struct open_claim_delegate_cur4 { 22239 stateid4 delegate_stateid; 22240 component4 file; 22241 }; 22243 union open_claim4 switch (open_claim_type4 claim) { 22244 /* 22245 * No special rights to file. 22246 * Ordinary OPEN of the specified file. 22247 */ 22248 case CLAIM_NULL: 22249 /* CURRENT_FH: directory */ 22250 component4 file; 22251 /* 22252 * Right to the file established by an 22253 * open previous to server reboot. File 22254 * identified by filehandle obtained at 22255 * that time rather than by name. 22256 */ 22257 case CLAIM_PREVIOUS: 22258 /* CURRENT_FH: file being reclaimed */ 22259 open_delegation_type4 delegate_type; 22261 /* 22262 * Right to file based on a delegation 22263 * granted by the server. File is 22264 * specified by name. 22265 */ 22266 case CLAIM_DELEGATE_CUR: 22267 /* CURRENT_FH: directory */ 22268 open_claim_delegate_cur4 delegate_cur_info; 22270 /* 22271 * Right to file based on a delegation 22272 * granted to a previous boot instance 22273 * of the client. File is specified by name. 22274 */ 22275 case CLAIM_DELEGATE_PREV: 22276 /* CURRENT_FH: directory */ 22277 component4 file_delegate_prev; 22279 /* 22280 * Like CLAIM_NULL. No special rights 22281 * to file. Ordinary OPEN of the 22282 * specified file by current filehandle. 22283 */ 22284 case CLAIM_FH: /* new to v4.1 */ 22285 /* CURRENT_FH: regular file to open */ 22286 void; 22288 /* 22289 * Like CLAIM_DELEGATE_PREV. Right to file based on a 22290 * delegation granted to a previous boot 22291 * instance of the client. File is identified by 22292 * by filehandle. 22293 */ 22294 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 22295 /* CURRENT_FH: file being opened */ 22296 void; 22298 /* 22299 * Like CLAIM_DELEGATE_CUR. Right to file based on 22300 * a delegation granted by the server. 22301 * File is identified by filehandle. 22302 */ 22303 case CLAIM_DELEG_CUR_FH: /* new to v4.1 */ 22304 /* CURRENT_FH: file being opened */ 22305 stateid4 oc_delegate_stateid; 22307 }; 22309 /* 22310 * OPEN: Open a file, potentially receiving an OPEN delegation 22311 */ 22312 struct OPEN4args { 22313 seqid4 seqid; 22314 uint32_t share_access; 22315 uint32_t share_deny; 22316 open_owner4 owner; 22317 openflag4 openhow; 22318 open_claim4 claim; 22319 }; 22321 18.16.2. RESULTS 22323 struct open_read_delegation4 { 22324 stateid4 stateid; /* Stateid for delegation*/ 22325 bool recall; /* Pre-recalled flag for 22326 delegations obtained 22327 by reclaim (CLAIM_PREVIOUS) */ 22329 nfsace4 permissions; /* Defines users who don't 22330 need an ACCESS call to 22331 open for read */ 22332 }; 22334 struct open_write_delegation4 { 22335 stateid4 stateid; /* Stateid for delegation */ 22336 bool recall; /* Pre-recalled flag for 22337 delegations obtained 22338 by reclaim 22339 (CLAIM_PREVIOUS) */ 22341 nfs_space_limit4 22342 space_limit; /* Defines condition that 22343 the client must check to 22344 determine whether the 22345 file needs to be flushed 22346 to the server on close. */ 22348 nfsace4 permissions; /* Defines users who don't 22349 need an ACCESS call as 22350 part of a delegated 22351 open. */ 22352 }; 22354 enum why_no_delegation4 { /* new to v4.1 */ 22355 WND4_NOT_WANTED = 0, 22356 WND4_CONTENTION = 1, 22357 WND4_RESOURCE = 2, 22358 WND4_NOT_SUPP_FTYPE = 3, 22359 WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4, 22360 WND4_NOT_SUPP_UPGRADE = 5, 22361 WND4_NOT_SUPP_DOWNGRADE = 6, 22362 WND4_CANCELLED = 7, 22363 WND4_IS_DIR = 8 22364 }; 22366 union open_none_delegation4 /* new to v4.1 */ 22367 switch (why_no_delegation4 ond_why) { 22368 case WND4_CONTENTION: 22369 bool ond_server_will_push_deleg; 22370 case WND4_RESOURCE: 22371 bool ond_server_will_signal_avail; 22372 default: 22373 void; 22374 }; 22375 union open_delegation4 22376 switch (open_delegation_type4 delegation_type) { 22377 case OPEN_DELEGATE_NONE: 22378 void; 22379 case OPEN_DELEGATE_READ: 22380 open_read_delegation4 read; 22381 case OPEN_DELEGATE_WRITE: 22382 open_write_delegation4 write; 22383 case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */ 22384 open_none_delegation4 od_whynone; 22385 }; 22387 /* 22388 * Result flags 22389 */ 22391 /* Client must confirm open */ 22392 const OPEN4_RESULT_CONFIRM = 0x00000002; 22393 /* Type of file locking behavior at the server */ 22394 const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004; 22395 /* Server will preserve file if removed while open */ 22396 const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008; 22398 /* 22399 * Server may use CB_NOTIFY_LOCK on locks 22400 * derived from this open 22401 */ 22402 const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020; 22404 struct OPEN4resok { 22405 stateid4 stateid; /* Stateid for open */ 22406 change_info4 cinfo; /* Directory Change Info */ 22407 uint32_t rflags; /* Result flags */ 22408 bitmap4 attrset; /* attribute set for create*/ 22409 open_delegation4 delegation; /* Info on any open 22410 delegation */ 22411 }; 22413 union OPEN4res switch (nfsstat4 status) { 22414 case NFS4_OK: 22415 /* New CURRENT_FH: opened file */ 22416 OPEN4resok resok4; 22417 default: 22418 void; 22419 }; 22421 18.16.3. DESCRIPTION 22423 The OPEN operation opens a regular file in a directory with the 22424 provided name or filehandle. OPEN can also create a file if a name 22425 is provided, and the client specifies it wants to create a file. 22426 Specification of whether or not a file is to be created, and the 22427 method of creation is via the openhow parameter. The openhow 22428 parameter consists of a switched union (data type opengflag4), which 22429 switches on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). 22430 If OPEN4_CREATE is specified, this leads to another switched union 22431 (data type createhow4) that supports four cases of creation methods: 22432 UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is 22433 OPEN4_CREATE, then the claim field of the claim field MUST be one of 22434 CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, because these 22435 claim methods include a component of a file name. 22437 Upon success (which might entail creation of a new file), the current 22438 filehandle is replaced by that of the created or existing object. 22440 If the current filehandle is a named attribute directory, OPEN will 22441 then create or open a named attribute file. Note that exclusive 22442 create of a named attribute is not supported. If the createmode is 22443 EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named 22444 attribute directory, the server will return EINVAL. 22446 UNCHECKED4 means that the file should be created if a file of that 22447 name does not exist and encountering an existing regular file of that 22448 name is not an error. For this type of create, createattrs specifies 22449 the initial set of attributes for the file. The set of attributes 22450 may include any writable attribute valid for regular files. When an 22451 UNCHECKED4 create encounters an existing file, the attributes 22452 specified by createattrs are not used, except that when createattrs 22453 specifies the size attribute with a size of zero, the existing file 22454 is truncated. 22456 If GUARDED4 is specified, the server checks for the presence of a 22457 duplicate object by name before performing the create. If a 22458 duplicate exists, NFS4ERR_EXIST is returned. If the object does not 22459 exist, the request is performed as described for UNCHECKED4. 22461 For the UNCHECKED4 and GUARDED4 cases, where the operation is 22462 successful, the server will return to the client an attribute mask 22463 signifying which attributes were successfully set for the object. 22465 EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow 22466 exclusive creation semantics, using the verifier to ensure exclusive 22467 creation of the target. The server should check for the presence of 22468 a duplicate object by name. If the object does not exist, the server 22469 creates the object and stores the verifier with the object. If the 22470 object does exist and the stored verifier matches the client provided 22471 verifier, the server uses the existing object as the newly created 22472 object. If the stored verifier does not match, then an error of 22473 NFS4ERR_EXIST is returned. 22475 If using EXCLUSIVE4, and if the server uses attributes to store the 22476 exclusive create verifier, the server will signify which attributes 22477 it used by setting the appropriate bits in the attribute mask that is 22478 returned in the results. Unlike UNCHECKED4, GUARDED4, and 22479 EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes 22480 at file creation, and after a successful OPEN via EXCLUSIVE4, the 22481 client MUST send a SETATTR to set attributes to a known state. 22483 In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. 22484 Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 22485 case, but because the server may use attributes of the target object 22486 to store the verifier, the set of allowable attributes may be fewer 22487 than the set of attributes SETATTR allows. The allowable attributes 22488 for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat 22489 (Section 5.8.1.14) attribute. If the client attempts to set in 22490 cva_attrs an attribute that is not in suppattr_exclcreat, the server 22491 MUST return NFS4ERR_INVAL. The response field, attrset, indicates 22492 both which attributes the server set from cva_attrs and which 22493 attributes the server used to store the verifier. As described in 22494 Section 18.16.4, the client can compare cva_attrs.attrmask with 22495 attrset to determine which attributes were used to store the 22496 verifier. 22498 With the addition of persistent sessions and pNFS, under some 22499 conditions EXCLUSIVE4 MUST NOT be used by the client or supported by 22500 the server. The following table summarizes the appropriate and 22501 mandated exclusive create methods for implementations of NFSv4.1: 22503 Required methods for exclusive create 22505 +----------------+-----------+---------------+----------------------+ 22506 | Persistent | Server | Server | Client Allowed | 22507 | Reply Cache | Supports | REQUIRED | | 22508 | Enabled | pNFS | | | 22509 +----------------+-----------+---------------+----------------------+ 22510 | no | no | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 22511 | | | and | (SHOULD) or | 22512 | | | EXCLUSIVE4 | EXCLUSIVE4 (SHOULD | 22513 | | | | NOT) | 22514 | no | yes | EXCLUSIVE4_1 | EXCLUSIVE4_1 | 22515 | yes | no | GUARDED4 | GUARDED4 | 22516 | yes | yes | GUARDED4 | GUARDED4 | 22517 +----------------+-----------+---------------+----------------------+ 22519 Table 10 22521 If CREATE_SESSION4_FLAG_PERSIST is set in the results of 22522 CREATE_SESSION, the reply cache is persistent (see Section 18.36). 22523 If the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from 22524 EXCHANGE_ID, the server is a pNFS server (see Section 18.35). If the 22525 client attempts to use EXCLUSIVE4 on a persistent session, or a 22526 session derived from an EXCHGID4_FLAG_USE_PNFS_MDS client ID, the 22527 server MUST return NFS4ERR_INVAL. 22529 With persistent sessions, exclusive create semantics are fully 22530 achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1 MUST NOT 22531 be used. When pNFS is being used, the layout_hint attribute might 22532 not be supported after the file is created. Only the EXCLUSIVE4_1 22533 and GUARDED methods of exclusive file creation allow the atomic 22534 setting of attributes. 22536 For the target directory, the server returns change_info4 information 22537 in cinfo. With the atomic field of the change_info4 data type, the 22538 server will indicate if the before and after change attributes were 22539 obtained atomically with respect to the link creation. 22541 The OPEN operation provides for Windows share reservation capability 22542 with the use of the share_access and share_deny fields of the OPEN 22543 arguments. The client specifies at OPEN the required share_access 22544 and share_deny modes. For clients that do not directly support 22545 SHAREs (i.e., UNIX), the expected deny value is 22546 OPEN4_SHARE_DENY_NONE. In the case that there is an existing SHARE 22547 reservation that conflicts with the OPEN request, the server returns 22548 the error NFS4ERR_SHARE_DENIED. For additional discussion of SHARE 22549 semantics, see Section 9.7. 22551 For each OPEN, the client provides a value for the owner field of the 22552 OPEN argument. The owner field is of data type open_owner4, and 22553 contains a field called clientid and a field called owner. The 22554 client can set the clientid field to any value and the server MUST 22555 ignore it. Instead, the server MUST derive the client ID from the 22556 session ID of the SEQUENCE operation of the COMPOUND request. 22558 The "seqid" field of the request is not used in NFSv4.1, but it MAY 22559 be any value and the server MUST ignore it. 22561 In the case that the client is recovering state from a server 22562 failure, the claim field of the OPEN argument is used to signify that 22563 the request is meant to reclaim state previously held. 22565 The "claim" field of the OPEN argument is used to specify the file to 22566 be opened and the state information that the client claims to 22567 possess. There are seven claim types as follows: 22569 +----------------------+--------------------------------------------+ 22570 | open type | description | 22571 +----------------------+--------------------------------------------+ 22572 | CLAIM_NULL, CLAIM_FH | For the client, this is a new OPEN request | 22573 | | and there is no previous state associated | 22574 | | with the file for the client. With | 22575 | | CLAIM_NULL, the file is identified by the | 22576 | | current filehandle and the specified | 22577 | | component name. With CLAIM_FH (new to | 22578 | | NFSv4.1), the file is identified by just | 22579 | | the current filehandle. | 22580 | CLAIM_PREVIOUS | The client is claiming basic OPEN state | 22581 | | for a file that was held previous to a | 22582 | | server restart. Generally used when a | 22583 | | server is returning persistent | 22584 | | filehandles; the client may not have the | 22585 | | file name to reclaim the OPEN. | 22586 | CLAIM_DELEGATE_CUR, | The client is claiming a delegation for | 22587 | CLAIM_DELEG_CUR_FH | OPEN as granted by the server. Generally, | 22588 | | this is done as part of recalling a | 22589 | | delegation. With CLAIM_DELEGATE_CUR, the | 22590 | | file is identified by the current | 22591 | | filehandle and the specified component | 22592 | | name. With CLAIM_DELEG_CUR_FH (new to | 22593 | | NFSv4.1), the file is identified by just | 22594 | | the current filehandle. | 22595 | CLAIM_DELEGATE_PREV, | The client is claiming a delegation | 22596 | CLAIM_DELEG_PREV_FH | granted to a previous client instance; | 22597 | | used after the client restarts. The server | 22598 | | MAY support CLAIM_DELEGATE_PREV and/or | 22599 | | CLAIM_DELEG_PREV_FH (new to NFSv4.1). If | 22600 | | it does support either claim type, | 22601 | | CREATE_SESSION MUST NOT remove the | 22602 | | client's delegation state, and the server | 22603 | | MUST support the DELEGPURGE operation. | 22604 +----------------------+--------------------------------------------+ 22606 For OPEN requests that reach the server during the grace period, the 22607 server returns an error of NFS4ERR_GRACE. The following claim types 22608 are exceptions: 22610 o OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted 22611 to reclaiming opens after a server restart and are typically only 22612 valid during the grace period. 22614 o OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and 22615 CLAIM_DELEG_CUR_FH are valid both during and after the grace 22616 period. Since the granting of the delegation that they are 22617 subordinate to assures that there is no conflict with locks to be 22618 reclaimed by other clients, the server need not return 22619 NFS4ERR_GRACE when these are received during the grace period. 22621 For any OPEN request, the server may return an OPEN delegation, which 22622 allows further opens and closes to be handled locally on the client 22623 as described in Section 10.4. Note that delegation is up to the 22624 server to decide. The client should never assume that delegation 22625 will or will not be granted in a particular instance. It should 22626 always be prepared for either case. A partial exception is the 22627 reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. 22628 In this case, delegation will always be granted, although the server 22629 may specify an immediate recall in the delegation structure. 22631 The rflags returned by a successful OPEN allow the server to return 22632 information governing how the open file is to be handled. 22634 o OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an 22635 NFSv4.1 server. 22637 o OPEN4_RESULT_LOCKTYPE_POSIX indicates that the server's byte-range 22638 locking behavior supports the complete set of POSIX locking 22639 techniques [21]. From this, the client can choose to manage byte- 22640 range locking state in a way to handle a mismatch of byte-range 22641 locking management. 22643 o OPEN4_RESULT_PRESERVE_UNLINKED indicates that the server will 22644 preserve the open file if the client (or any other client) removes 22645 the file as long as it is open. Furthermore, the server promises 22646 to preserve the file through the grace period after server 22647 restart, thereby giving the client the opportunity to reclaim its 22648 open. 22650 o OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt 22651 CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a 22652 hint only, and may be safely ignored by the client. 22654 If the component is of zero length, NFS4ERR_INVAL will be returned. 22655 The component is also subject to the normal UTF-8, character support, 22656 and name checks. See Section 14.5 for further discussion. 22658 When an OPEN is done and the specified open-owner already has the 22659 resulting filehandle open, the result is to "OR" together the new 22660 share and deny status together with the existing status. In this 22661 case, only a single CLOSE need be done, even though multiple OPENs 22662 were completed. When such an OPEN is done, checking of share 22663 reservations for the new OPEN proceeds normally, with no exception 22664 for the existing OPEN held by the same open-owner. In this case, the 22665 stateid returned as an "other" field that matches that of the 22666 previous open while the "seqid" field is incremented to reflect the 22667 change status due to the new open. 22669 If the underlying file system at the server is only accessible in a 22670 read-only mode and the OPEN request has specified ACCESS_WRITE or 22671 ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read- 22672 only file system. 22674 As with the CREATE operation, the server MUST derive the owner, owner 22675 ACE, group, or group ACE if any of the four attributes are required 22676 and supported by the server's file system. For an OPEN with the 22677 EXCLUSIVE4 createmode, the server has no choice, since such OPEN 22678 calls do not include the createattrs field. Conversely, if 22679 createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is 22680 specified, and includes an owner, owner_group, or ACE that the 22681 principal in the RPC call's credentials does not have authorization 22682 to create files for, then the server may return NFS4ERR_PERM. 22684 In the case of an OPEN that specifies a size of zero (e.g., 22685 truncation) and the file has named attributes, the named attributes 22686 are left as is and are not removed. 22688 NFSv4.1 gives more precise control to clients over acquisition of 22689 delegations via the following new flags for the share_access field of 22690 OPEN4args: 22692 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22694 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22696 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22698 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 22700 OPEN4_SHARE_ACCESS_WANT_CANCEL 22702 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 22704 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 22706 If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, 22707 then the client will have specified one and only one of: 22709 OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22711 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22712 OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22714 OPEN4_SHARE_ACCESS_WANT_NO_DELEG 22716 OPEN4_SHARE_ACCESS_WANT_CANCEL 22718 Otherwise, the client is neither indicating a desire nor a non-desire 22719 for a delegation, and the server MAY or MAY not return a delegation 22720 in the OPEN response. 22722 If the server supports the new _WANT_ flags and the client sends one 22723 or more of the new flags, then in the event the server does not 22724 return a delegation, it MUST return a delegation type of 22725 OPEN_DELEGATE_NONE_EXT. The field ond_why in the reply indicates why 22726 no delegation was returned and will be one of: 22728 WND4_NOT_WANTED The client specified 22729 OPEN4_SHARE_ACCESS_WANT_NO_DELEG. 22731 WND4_CONTENTION There is a conflicting delegation or open on the 22732 file. 22734 WND4_RESOURCE Resource limitations prevent the server from granting 22735 a delegation. 22737 WND4_NOT_SUPP_FTYPE The server does not support delegations on this 22738 file type. 22740 WND4_WRITE_DELEG_NOT_SUPP_FTYPE The server does not support 22741 OPEN_DELEGATE_WRITE delegations on this file type. 22743 WND4_NOT_SUPP_UPGRADE The server does not support atomic upgrade of 22744 an OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE 22745 delegation. 22747 WND4_NOT_SUPP_DOWNGRADE The server does not support atomic downgrade 22748 of an OPEN_DELEGATE_WRITE delegation to an OPEN_DELEGATE_READ 22749 delegation. 22751 WND4_CANCELED The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL 22752 and now any "want" for this file object is cancelled. 22754 WND4_IS_DIR The specified file object is a directory, and the 22755 operation is OPEN or WANT_DELEGATION, which do not support 22756 delegations on directories. 22758 OPEN4_SHARE_ACCESS_WANT_READ_DELEG, 22759 OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or 22760 OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants 22761 an OPEN_DELEGATE_READ, OPEN_DELEGATE_WRITE, or any delegation 22762 regardless which of OPEN4_SHARE_ACCESS_READ, 22763 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH is set. If the 22764 client has an OPEN_DELEGATE_READ delegation on a file and requests an 22765 OPEN_DELEGATE_WRITE delegation, then the client is requesting atomic 22766 upgrade of its OPEN_DELEGATE_READ delegation to an 22767 OPEN_DELEGATE_WRITE delegation. If the client has an 22768 OPEN_DELEGATE_WRITE delegation on a file and requests an 22769 OPEN_DELEGATE_READ delegation, then the client is requesting atomic 22770 downgrade to an OPEN_DELEGATE_READ delegation. A server MAY support 22771 atomic upgrade or downgrade. If it does, then the returned 22772 delegation_type of OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is 22773 different from the delegation type the client currently has, 22774 indicates successful upgrade or downgrade. If the server does not 22775 support atomic delegation upgrade or downgrade, then ond_why will be 22776 set to WND4_NOT_SUPP_UPGRADE or WND4_NOT_SUPP_DOWNGRADE. 22778 OPEN4_SHARE_ACCESS_WANT_NO_DELEG means that the client wants no 22779 delegation. 22781 OPEN4_SHARE_ACCESS_WANT_CANCEL means that the client wants no 22782 delegation and wants to cancel any previously registered "want" for a 22783 delegation. 22785 The client may set one or both of 22786 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and 22787 OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they 22788 will have no effect unless one of following is set: 22790 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 22792 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 22794 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 22796 If the client specifies 22797 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes 22798 to register a "want" for a delegation, in the event the OPEN results 22799 do not include a delegation. If so and the server denies the 22800 delegation due to insufficient resources, the server MAY later inform 22801 the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the 22802 resource limitation condition has eased. The server will tell the 22803 client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL 22804 operation by setting delegation_type in the results to 22805 OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and 22806 ond_server_will_signal_avail set to TRUE. If 22807 ond_server_will_signal_avail is set to TRUE, the server MUST later 22808 send a CB_RECALLABLE_OBJ_AVAIL operation. 22810 If the client specifies 22811 OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes 22812 to register a "want" for a delegation, in the event the OPEN results 22813 do not include a delegation. If so and the server denies the 22814 delegation due to contention, the server MAY later inform the client, 22815 via the CB_PUSH_DELEG operation, that the contention condition has 22816 eased. The server will tell the client that it intends to send a 22817 future CB_PUSH_DELEG operation by setting delegation_type in the 22818 results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and 22819 ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is 22820 TRUE, the server MUST later send a CB_PUSH_DELEG operation. 22822 If the client has previously registered a want for a delegation on a 22823 file, and then sends a request to register a want for a delegation on 22824 the same file, the server MUST return a new error: 22825 NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a 22826 different type of delegation want for the same file, it MUST cancel 22827 the existing delegation WANT. 22829 18.16.4. IMPLEMENTATION 22831 In absence of a persistent session, the client invokes exclusive 22832 create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. 22833 In these cases, the client provides a verifier that can reasonably be 22834 expected to be unique. A combination of a client identifier, perhaps 22835 the client network address, and a unique number generated by the 22836 client, perhaps the RPC transaction identifier, may be appropriate. 22838 If the object does not exist, the server creates the object and 22839 stores the verifier in stable storage. For file systems that do not 22840 provide a mechanism for the storage of arbitrary file attributes, the 22841 server may use one or more elements of the object's metadata to store 22842 the verifier. The verifier MUST be stored in stable storage to 22843 prevent erroneous failure on retransmission of the request. It is 22844 assumed that an exclusive create is being performed because exclusive 22845 semantics are critical to the application. Because of the expected 22846 usage, exclusive CREATE does not rely solely on the server's reply 22847 cache for storage of the verifier. A nonpersistent reply cache does 22848 not survive a crash and the session and reply cache may be deleted 22849 after a network partition that exceeds the lease time, thus opening 22850 failure windows. 22852 An NFSv4.1 server SHOULD NOT store the verifier in any of the file's 22853 RECOMMENDED or REQUIRED attributes. If it does, the server SHOULD 22854 use time_modify_set or time_access_set to store the verifier. The 22855 server SHOULD NOT store the verifier in the following attributes: 22857 acl (it is desirable for access control to be established at 22858 creation), 22860 dacl (ditto), 22862 mode (ditto), 22864 owner (ditto), 22866 owner_group (ditto), 22868 retentevt_set (it may be desired to establish retention at 22869 creation) 22871 retention_hold (ditto), 22873 retention_set (ditto), 22875 sacl (it is desirable for auditing control to be established at 22876 creation), 22878 size (on some servers, size may have a limited range of values), 22880 mode_set_masked (as with mode), 22882 and 22884 time_creation (a meaningful file creation should be set when the 22885 file is created). 22887 Another alternative for the server is to use a named attribute to 22888 store the verifier. 22890 Because the EXCLUSIVE4 create method does not specify initial 22891 attributes when processing an EXCLUSIVE4 create, the server 22893 o SHOULD set the owner of the file to that corresponding to the 22894 credential of request's RPC header. 22896 o SHOULD NOT leave the file's access control to anyone but the owner 22897 of the file. 22899 If the server cannot support exclusive create semantics, possibly 22900 because of the requirement to commit the verifier to stable storage, 22901 it should fail the OPEN request with the error NFS4ERR_NOTSUPP. 22903 During an exclusive CREATE request, if the object already exists, the 22904 server reconstructs the object's verifier and compares it with the 22905 verifier in the request. If they match, the server treats the 22906 request as a success. The request is presumed to be a duplicate of 22907 an earlier, successful request for which the reply was lost and that 22908 the server duplicate request cache mechanism did not detect. If the 22909 verifiers do not match, the request is rejected with the status 22910 NFS4ERR_EXIST. 22912 After the client has performed a successful exclusive create, the 22913 attrset response indicates which attributes were used to store the 22914 verifier. If EXCLUSIVE4 was used, the attributes set in attrset were 22915 used for the verifier. If EXCLUSIVE4_1 was used, the client 22916 determines the attributes used for the verifier by comparing attrset 22917 with cva_attrs.attrmask; any bits set in the former but not the 22918 latter identify the attributes used to store the verifier. The 22919 client MUST immediately send a SETATTR to set attributes used to 22920 store the verifier. Until it does so, the attributes used to store 22921 the verifier cannot be relied upon. The subsequent SETATTR MUST NOT 22922 occur in the same COMPOUND request as the OPEN. 22924 Unless a persistent session is used, use of the GUARDED4 attribute 22925 does not provide exactly once semantics. In particular, if a reply 22926 is lost and the server does not detect the retransmission of the 22927 request, the operation can fail with NFS4ERR_EXIST, even though the 22928 create was performed successfully. The client would use this 22929 behavior in the case that the application has not requested an 22930 exclusive create but has asked to have the file truncated when the 22931 file is opened. In the case of the client timing out and 22932 retransmitting the create request, the client can use GUARDED4 to 22933 prevent against a sequence like create, write, create (retransmitted) 22934 from occurring. 22936 For SHARE reservations, the value of the expression (share_access & 22937 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) MUST be one of 22938 OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or 22939 OPEN4_SHARE_ACCESS_BOTH. If not, the server MUST return 22940 NFS4ERR_INVAL. The value of share_deny MUST be one of 22941 OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, 22942 or OPEN4_SHARE_DENY_BOTH. If not, the server MUST return 22943 NFS4ERR_INVAL. 22945 Based on the share_access value (OPEN4_SHARE_ACCESS_READ, 22946 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH), the client 22947 should check that the requester has the proper access rights to 22948 perform the specified operation. This would generally be the results 22949 of applying the ACL access rules to the file for the current 22950 requester. However, just as with the ACCESS operation, the client 22951 should not attempt to second-guess the server's decisions, as access 22952 rights may change and may be subject to server administrative 22953 controls outside the ACL framework. If the requester's READ or WRITE 22954 operation is not authorized (depending on the share_access value), 22955 the server MUST return NFS4ERR_ACCESS. 22957 Note that if the client ID was not created with the 22958 EXCHGID4_FLAG_BIND_PRINC_STATEID capability set in the reply to 22959 EXCHANGE_ID, then the server MUST NOT impose any requirement that 22960 READs and WRITEs sent for an open file have the same credentials as 22961 the OPEN itself, and the server is REQUIRED to perform access 22962 checking on the READs and WRITEs themselves. Otherwise, if the reply 22963 to EXCHANGE_ID did have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then 22964 with one exception, the credentials used in the OPEN request MUST 22965 match those used in the READs and WRITEs, and the stateids in the 22966 READs and WRITEs MUST match, or be derived from the stateid from the 22967 reply to OPEN. The exception is if SP4_SSV or SP4_MACH_CRED state 22968 protection is used, and the spo_must_allow result of EXCHANGE_ID 22969 includes the READ and/or WRITE operations. In that case, the machine 22970 or SSV credential will be allowed to send READ and/or WRITE. See 22971 Section 18.35. 22973 If the component provided to OPEN is a symbolic link, the error 22974 NFS4ERR_SYMLINK will be returned to the client, while if it is a 22975 directory the error NFS4ERR_ISDIR will be returned. If the component 22976 is neither of those but not an ordinary file, the error 22977 NFS4ERR_WRONG_TYPE is returned. If the current filehandle is not a 22978 directory, the error NFS4ERR_NOTDIR will be returned. 22980 The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a 22981 client to avoid the common implementation practice of renaming an 22982 open file to ".nfs" after it removes the file. After 22983 the server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends 22984 a REMOVE operation that would reduce the file's link count to zero, 22985 the server SHOULD report a value of zero for the numlinks attribute 22986 on the file. 22988 If another client has a delegation of the file being opened that 22989 conflicts with open being done (sometimes depending on the 22990 share_access or share_deny value specified), the delegation(s) MUST 22991 be recalled, and the operation cannot proceed until each such 22992 delegation is returned or revoked. Except where this happens very 22993 quickly, one or more NFS4ERR_DELAY errors will be returned to 22994 requests made while delegation remains outstanding. In the case of 22995 an OPEN_DELEGATE_WRITE delegation, any open by a different client 22996 will conflict, while for an OPEN_DELEGATE_READ delegation, only opens 22997 with one of the following characteristics will be considered 22998 conflicting: 23000 o The value of share_access includes the bit 23001 OPEN4_SHARE_ACCESS_WRITE. 23003 o The value of share_deny specifies OPEN4_SHARE_DENY_READ or 23004 OPEN4_SHARE_DENY_BOTH. 23006 o OPEN4_CREATE is specified together with UNCHECKED4, the size 23007 attribute is specified as zero (for truncation), and an existing 23008 file is truncated. 23010 If OPEN4_CREATE is specified and the file does not exist and the 23011 current filehandle designates a directory for which another client 23012 holds a directory delegation, then, unless the delegation is such 23013 that the situation can be resolved by sending a notification, the 23014 delegation MUST be recalled, and the operation cannot proceed until 23015 the delegation is returned or revoked. Except where this happens 23016 very quickly, one or more NFS4ERR_DELAY errors will be returned to 23017 requests made while delegation remains outstanding. 23019 If OPEN4_CREATE is specified and the file does not exist and the 23020 current filehandle designates a directory for which one or more 23021 directory delegations exist, then, when those delegations request 23022 such notifications, NOTIFY4_ADD_ENTRY will be generated as a result 23023 of this operation. 23025 18.16.4.1. Warning to Client Implementors 23027 OPEN resembles LOOKUP in that it generates a filehandle for the 23028 client to use. Unlike LOOKUP though, OPEN creates server state on 23029 the filehandle. In normal circumstances, the client can only release 23030 this state with a CLOSE operation. CLOSE uses the current filehandle 23031 to determine which file to close. Therefore, the client MUST follow 23032 every OPEN operation with a GETFH operation in the same COMPOUND 23033 procedure. This will supply the client with the filehandle such that 23034 CLOSE can be used appropriately. 23036 Simply waiting for the lease on the file to expire is insufficient 23037 because the server may maintain the state indefinitely as long as 23038 another client does not attempt to make a conflicting access to the 23039 same file. 23041 See also Section 2.10.6.4. 23043 18.17. Operation 19: OPENATTR - Open Named Attribute Directory 23044 18.17.1. ARGUMENTS 23046 struct OPENATTR4args { 23047 /* CURRENT_FH: object */ 23048 bool createdir; 23049 }; 23051 18.17.2. RESULTS 23053 struct OPENATTR4res { 23054 /* 23055 * If status is NFS4_OK, 23056 * new CURRENT_FH: named attribute 23057 * directory 23058 */ 23059 nfsstat4 status; 23060 }; 23062 18.17.3. DESCRIPTION 23064 The OPENATTR operation is used to obtain the filehandle of the named 23065 attribute directory associated with the current filehandle. The 23066 result of the OPENATTR will be a filehandle to an object of type 23067 NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can 23068 be used to obtain filehandles for the various named attributes 23069 associated with the original file system object. Filehandles 23070 returned within the named attribute directory will designate objects 23071 of type of NF4NAMEDATTR. 23073 The createdir argument allows the client to signify if a named 23074 attribute directory should be created as a result of the OPENATTR 23075 operation. Some clients may use the OPENATTR operation with a value 23076 of FALSE for createdir to determine if any named attributes exist for 23077 the object. If none exist, then NFS4ERR_NOENT will be returned. If 23078 createdir has a value of TRUE and no named attribute directory 23079 exists, one is created and its filehandle becomes the current 23080 filehandle. On the other hand, if createdir has a value of TRUE and 23081 the named attribute directory already exists, no error results and 23082 the filehandle of the existing directory becomes the current 23083 filehandle. The creation of a named attribute directory assumes that 23084 the server has implemented named attribute support in this fashion 23085 and is not required to do so by this definition. 23087 If the current file handle designates an object of type NF4NAMEDATTR 23088 (a named attribute) or NF4ATTRDIR (a named attribute directory), an 23089 error of NFS4ERR_WRONG_TYPE is returned to the client. Named 23090 attributes or a named attribute directory MUST NOT have their own 23091 named attributes. 23093 18.17.4. IMPLEMENTATION 23095 If the server does not support named attributes for the current 23096 filehandle, an error of NFS4ERR_NOTSUPP will be returned to the 23097 client. 23099 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access 23101 18.18.1. ARGUMENTS 23103 struct OPEN_DOWNGRADE4args { 23104 /* CURRENT_FH: opened file */ 23105 stateid4 open_stateid; 23106 seqid4 seqid; 23107 uint32_t share_access; 23108 uint32_t share_deny; 23109 }; 23111 18.18.2. RESULTS 23113 struct OPEN_DOWNGRADE4resok { 23114 stateid4 open_stateid; 23115 }; 23117 union OPEN_DOWNGRADE4res switch(nfsstat4 status) { 23118 case NFS4_OK: 23119 OPEN_DOWNGRADE4resok resok4; 23120 default: 23121 void; 23122 }; 23124 18.18.3. DESCRIPTION 23126 This operation is used to adjust the access and deny states for a 23127 given open. This is necessary when a given open-owner opens the same 23128 file multiple times with different access and deny values. In this 23129 situation, a close of one of the opens may change the appropriate 23130 share_access and share_deny flags to remove bits associated with 23131 opens no longer in effect. 23133 Valid values for the expression (share_access & 23134 ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) are OPEN4_SHARE_ACCESS_READ, 23135 OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client 23136 specifies other values, the server MUST reply with NFS4ERR_INVAL. 23138 Valid values for the share_deny field are OPEN4_SHARE_DENY_NONE, 23139 OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or 23140 OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the 23141 server MUST reply with NFS4ERR_INVAL. 23143 After checking for valid values of share_access and share_deny, the 23144 server replaces the current access and deny modes on the file with 23145 share_access and share_deny subject to the following constraints: 23147 o The bits in share_access SHOULD equal the union of the 23148 share_access bits (not including OPEN4_SHARE_WANT_* bits) 23149 specified for some subset of the OPENs in effect for the current 23150 open-owner on the current file. 23152 o The bits in share_deny SHOULD equal the union of the share_deny 23153 bits specified for some subset of the OPENs in effect for the 23154 current open-owner on the current file. 23156 If the above constraints are not respected, the server SHOULD return 23157 the error NFS4ERR_INVAL. Since share_access and share_deny bits 23158 should be subsets of those already granted, short of a defect in the 23159 client or server implementation, it is not possible for the 23160 OPEN_DOWNGRADE request to be denied because of conflicting share 23161 reservations. 23163 The seqid argument is not used in NFSv4.1, MAY be any value, and MUST 23164 be ignored by the server. 23166 On success, the current filehandle retains its value. 23168 18.18.4. IMPLEMENTATION 23170 An OPEN_DOWNGRADE operation may make OPEN_DELEGATE_READ delegations 23171 grantable where they were not previously. Servers may choose to 23172 respond immediately if there are pending delegation want requests or 23173 may respond to the situation at a later time. 23175 18.19. Operation 22: PUTFH - Set Current Filehandle 23177 18.19.1. ARGUMENTS 23179 struct PUTFH4args { 23180 nfs_fh4 object; 23181 }; 23183 18.19.2. RESULTS 23185 struct PUTFH4res { 23186 /* 23187 * If status is NFS4_OK, 23188 * new CURRENT_FH: argument to PUTFH 23189 */ 23190 nfsstat4 status; 23191 }; 23193 18.19.3. DESCRIPTION 23195 This operation replaces the current filehandle with the filehandle 23196 provided as an argument. It clears the current stateid. 23198 If the security mechanism used by the requester does not meet the 23199 requirements of the filehandle provided to this operation, the server 23200 MUST return NFS4ERR_WRONGSEC. 23202 See Section 16.2.3.1.1 for more details on the current filehandle. 23204 See Section 16.2.3.1.2 for more details on the current stateid. 23206 18.19.4. IMPLEMENTATION 23208 This operation is used in an NFS request to set the context for file 23209 accessing operations that follow in the same COMPOUND request. 23211 18.20. Operation 23: PUTPUBFH - Set Public Filehandle 23213 18.20.1. ARGUMENT 23215 void; 23217 18.20.2. RESULT 23219 struct PUTPUBFH4res { 23220 /* 23221 * If status is NFS4_OK, 23222 * new CURRENT_FH: public fh 23223 */ 23224 nfsstat4 status; 23225 }; 23227 18.20.3. DESCRIPTION 23229 This operation replaces the current filehandle with the filehandle 23230 that represents the public filehandle of the server's namespace. 23231 This filehandle may be different from the "root" filehandle that may 23232 be associated with some other directory on the server. 23234 PUTPUBFH also clears the current stateid. 23236 The public filehandle represents the concepts embodied in RFC 2054 23237 [48], RFC 2055 [49], and RFC 2224 [60]. The intent for NFSv4.1 is 23238 that the public filehandle (represented by the PUTPUBFH operation) be 23239 used as a method of providing WebNFS server compatibility with NFSv3. 23241 The public filehandle and the root filehandle (represented by the 23242 PUTROOTFH operation) SHOULD be equivalent. If the public and root 23243 filehandles are not equivalent, then the directory corresponding to 23244 the public filehandle MUST be a descendant of the directory 23245 corresponding to the root filehandle. 23247 See Section 16.2.3.1.1 for more details on the current filehandle. 23249 See Section 16.2.3.1.2 for more details on the current stateid. 23251 18.20.4. IMPLEMENTATION 23253 This operation is used in an NFS request to set the context for file 23254 accessing operations that follow in the same COMPOUND request. 23256 With the NFSv3 public filehandle, the client is able to specify 23257 whether the pathname provided in the LOOKUP should be evaluated as 23258 either an absolute path relative to the server's root or relative to 23259 the public filehandle. RFC 2224 [60] contains further discussion of 23260 the functionality. With NFSv4.1, that type of specification is not 23261 directly available in the LOOKUP operation. The reason for this is 23262 because the component separators needed to specify absolute vs. 23263 relative are not allowed in NFSv4. Therefore, the client is 23264 responsible for constructing its request such that the use of either 23265 PUTROOTFH or PUTPUBFH signifies absolute or relative evaluation of an 23266 NFS URL, respectively. 23268 Note that there are warnings mentioned in RFC 2224 [60] with respect 23269 to the use of absolute evaluation and the restrictions the server may 23270 place on that evaluation with respect to how much of its namespace 23271 has been made available. These same warnings apply to NFSv4.1. It 23272 is likely, therefore, that because of server implementation details, 23273 an NFSv3 absolute public filehandle look up may behave differently 23274 than an NFSv4.1 absolute resolution. 23276 There is a form of security negotiation as described in RFC 2755 [61] 23277 that uses the public filehandle and an overloading of the pathname. 23278 This method is not available with NFSv4.1 as filehandles are not 23279 overloaded with special meaning and therefore do not provide the same 23280 framework as NFSv3. Clients should therefore use the security 23281 negotiation mechanisms described in Section 2.6. 23283 18.21. Operation 24: PUTROOTFH - Set Root Filehandle 23285 18.21.1. ARGUMENTS 23287 void; 23289 18.21.2. RESULTS 23291 struct PUTROOTFH4res { 23292 /* 23293 * If status is NFS4_OK, 23294 * new CURRENT_FH: root fh 23295 */ 23296 nfsstat4 status; 23297 }; 23299 18.21.3. DESCRIPTION 23301 This operation replaces the current filehandle with the filehandle 23302 that represents the root of the server's namespace. From this 23303 filehandle, a LOOKUP operation can locate any other filehandle on the 23304 server. This filehandle may be different from the "public" 23305 filehandle that may be associated with some other directory on the 23306 server. 23308 PUTROOTFH also clears the current stateid. 23310 See Section 16.2.3.1.1 for more details on the current filehandle. 23312 See Section 16.2.3.1.2 for more details on the current stateid. 23314 18.21.4. IMPLEMENTATION 23316 This operation is used in an NFS request to set the context for file 23317 accessing operations that follow in the same COMPOUND request. 23319 18.22. Operation 25: READ - Read from File 23321 18.22.1. ARGUMENTS 23323 struct READ4args { 23324 /* CURRENT_FH: file */ 23325 stateid4 stateid; 23326 offset4 offset; 23327 count4 count; 23328 }; 23330 18.22.2. RESULTS 23332 struct READ4resok { 23333 bool eof; 23334 opaque data<>; 23335 }; 23337 union READ4res switch (nfsstat4 status) { 23338 case NFS4_OK: 23339 READ4resok resok4; 23340 default: 23341 void; 23342 }; 23344 18.22.3. DESCRIPTION 23346 The READ operation reads data from the regular file identified by the 23347 current filehandle. 23349 The client provides an offset of where the READ is to start and a 23350 count of how many bytes are to be read. An offset of zero means to 23351 read data starting at the beginning of the file. If offset is 23352 greater than or equal to the size of the file, the status NFS4_OK is 23353 returned with a data length set to zero and eof is set to TRUE. The 23354 READ is subject to access permissions checking. 23356 If the client specifies a count value of zero, the READ succeeds and 23357 returns zero bytes of data again subject to access permissions 23358 checking. The server may choose to return fewer bytes than specified 23359 by the client. The client needs to check for this condition and 23360 handle the condition appropriately. 23362 Except when special stateids are used, the stateid value for a READ 23363 request represents a value returned from a previous byte-range lock 23364 or share reservation request or the stateid associated with a 23365 delegation. The stateid identifies the associated owners if any and 23366 is used by the server to verify that the associated locks are still 23367 valid (e.g., have not been revoked). 23369 If the read ended at the end-of-file (formally, in a correctly formed 23370 READ operation, if offset + count is equal to the size of the file), 23371 or the READ operation extends beyond the size of the file (if offset 23372 + count is greater than the size of the file), eof is returned as 23373 TRUE; otherwise, it is FALSE. A successful READ of an empty file 23374 will always return eof as TRUE. 23376 If the current filehandle is not an ordinary file, an error will be 23377 returned to the client. In the case that the current filehandle 23378 represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If 23379 the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is 23380 returned. In all other cases, NFS4ERR_WRONG_TYPE is returned. 23382 For a READ with a stateid value of all bits equal to zero, the server 23383 MAY allow the READ to be serviced subject to mandatory byte-range 23384 locks or the current share deny modes for the file. For a READ with 23385 a stateid value of all bits equal to one, the server MAY allow READ 23386 operations to bypass locking checks at the server. 23388 On success, the current filehandle retains its value. 23390 18.22.4. IMPLEMENTATION 23392 If the server returns a "short read" (i.e., fewer data than requested 23393 and eof is set to FALSE), the client should send another READ to get 23394 the remaining data. A server may return less data than requested 23395 under several circumstances. The file may have been truncated by 23396 another client or perhaps on the server itself, changing the file 23397 size from what the requesting client believes to be the case. This 23398 would reduce the actual amount of data available to the client. It 23399 is possible that the server reduce the transfer size and so return a 23400 short read result. Server resource exhaustion may also occur in a 23401 short read. 23403 If mandatory byte-range locking is in effect for the file, and if the 23404 byte-range corresponding to the data to be read from the file is 23405 WRITE_LT locked by an owner not associated with the stateid, the 23406 server will return the NFS4ERR_LOCKED error. The client should try 23407 to get the appropriate READ_LT via the LOCK operation before re- 23408 attempting the READ. When the READ completes, the client should 23409 release the byte-range lock via LOCKU. 23411 If another client has an OPEN_DELEGATE_WRITE delegation for the file 23412 being read, the delegation must be recalled, and the operation cannot 23413 proceed until that delegation is returned or revoked. Except where 23414 this happens very quickly, one or more NFS4ERR_DELAY errors will be 23415 returned to requests made while the delegation remains outstanding. 23416 Normally, delegations will not be recalled as a result of a READ 23417 operation since the recall will occur as a result of an earlier OPEN. 23418 However, since it is possible for a READ to be done with a special 23419 stateid, the server needs to check for this case even though the 23420 client should have done an OPEN previously. 23422 18.23. Operation 26: READDIR - Read Directory 23424 18.23.1. ARGUMENTS 23426 struct READDIR4args { 23427 /* CURRENT_FH: directory */ 23428 nfs_cookie4 cookie; 23429 verifier4 cookieverf; 23430 count4 dircount; 23431 count4 maxcount; 23432 bitmap4 attr_request; 23433 }; 23435 18.23.2. RESULTS 23436 struct entry4 { 23437 nfs_cookie4 cookie; 23438 component4 name; 23439 fattr4 attrs; 23440 entry4 *nextentry; 23441 }; 23443 struct dirlist4 { 23444 entry4 *entries; 23445 bool eof; 23446 }; 23448 struct READDIR4resok { 23449 verifier4 cookieverf; 23450 dirlist4 reply; 23451 }; 23453 union READDIR4res switch (nfsstat4 status) { 23454 case NFS4_OK: 23455 READDIR4resok resok4; 23456 default: 23457 void; 23458 }; 23460 18.23.3. DESCRIPTION 23462 The READDIR operation retrieves a variable number of entries from a 23463 file system directory and returns client-requested attributes for 23464 each entry along with information to allow the client to request 23465 additional directory entries in a subsequent READDIR. 23467 The arguments contain a cookie value that represents where the 23468 READDIR should start within the directory. A value of zero for the 23469 cookie is used to start reading at the beginning of the directory. 23470 For subsequent READDIR requests, the client specifies a cookie value 23471 that is provided by the server on a previous READDIR request. 23473 The request's cookieverf field should be set to 0 zero) when the 23474 request's cookie field is zero (first read of the directory). On 23475 subsequent requests, the cookieverf field must match the cookieverf 23476 returned by the READDIR in which the cookie was acquired. If the 23477 server determines that the cookieverf is no longer valid for the 23478 directory, the error NFS4ERR_NOT_SAME must be returned. 23480 The dircount field of the request is a hint of the maximum number of 23481 bytes of directory information that should be returned. This value 23482 represents the total length of the names of the directory entries and 23483 the cookie value for these entries. This length represents the XDR 23484 encoding of the data (names and cookies) and not the length in the 23485 native format of the server. 23487 The maxcount field of the request represents the maximum total size 23488 of all of the data being returned within the READDIR4resok structure 23489 and includes the XDR overhead. The server MAY return less data. If 23490 the server is unable to return a single directory entry within the 23491 maxcount limit, the error NFS4ERR_TOOSMALL MUST be returned to the 23492 client. 23494 Finally, the request's attr_request field represents the list of 23495 attributes to be returned for each directory entry supplied by the 23496 server. 23498 A successful reply consists of a list of directory entries. Each of 23499 these entries contains the name of the directory entry, a cookie 23500 value for that entry, and the associated attributes as requested. 23501 The "eof" flag has a value of TRUE if there are no more entries in 23502 the directory. 23504 The cookie value is only meaningful to the server and is used as a 23505 cursor for the directory entry. As mentioned, this cookie is used by 23506 the client for subsequent READDIR operations so that it may continue 23507 reading a directory. The cookie is similar in concept to a READ 23508 offset but MUST NOT be interpreted as such by the client. Ideally, 23509 the cookie value SHOULD NOT change if the directory is modified since 23510 the client may be caching these values. 23512 In some cases, the server may encounter an error while obtaining the 23513 attributes for a directory entry. Instead of returning an error for 23514 the entire READDIR operation, the server can instead return the 23515 attribute rdattr_error (Section 5.8.1.12). With this, the server is 23516 able to communicate the failure to the client and not fail the entire 23517 operation in the instance of what might be a transient failure. 23518 Obviously, the client must request the fattr4_rdattr_error attribute 23519 for this method to work properly. If the client does not request the 23520 attribute, the server has no choice but to return failure for the 23521 entire READDIR operation. 23523 For some file system environments, the directory entries "." and ".." 23524 have special meaning, and in other environments, they do not. If the 23525 server supports these special entries within a directory, they SHOULD 23526 NOT be returned to the client as part of the READDIR response. To 23527 enable some client environments, the cookie values of zero, 1, and 2 23528 are to be considered reserved. Note that the UNIX client will use 23529 these values when combining the server's response and local 23530 representations to enable a fully formed UNIX directory presentation 23531 to the application. 23533 For READDIR arguments, cookie values of one and two SHOULD NOT be 23534 used, and for READDIR results, cookie values of zero, one, and two 23535 SHOULD NOT be returned. 23537 On success, the current filehandle retains its value. 23539 18.23.4. IMPLEMENTATION 23541 The server's file system directory representations can differ 23542 greatly. A client's programming interfaces may also be bound to the 23543 local operating environment in a way that does not translate well 23544 into the NFS protocol. Therefore, the use of the dircount and 23545 maxcount fields are provided to enable the client to provide hints to 23546 the server. If the client is aggressive about attribute collection 23547 during a READDIR, the server has an idea of how to limit the encoded 23548 response. 23550 If dircount is zero, the server bounds the reply's size based on the 23551 request's maxcount field. 23553 The cookieverf may be used by the server to help manage cookie values 23554 that may become stale. It should be a rare occurrence that a server 23555 is unable to continue properly reading a directory with the provided 23556 cookie/cookieverf pair. The server SHOULD make every effort to avoid 23557 this condition since the application at the client might be unable to 23558 properly handle this type of failure. 23560 The use of the cookieverf will also protect the client from using 23561 READDIR cookie values that might be stale. For example, if the file 23562 system has been migrated, the server might or might not be able to 23563 use the same cookie values to service READDIR as the previous server 23564 used. With the client providing the cookieverf, the server is able 23565 to provide the appropriate response to the client. This prevents the 23566 case where the server accepts a cookie value but the underlying 23567 directory has changed and the response is invalid from the client's 23568 context of its previous READDIR. 23570 Since some servers will not be returning "." and ".." entries as has 23571 been done with previous versions of the NFS protocol, the client that 23572 requires these entries be present in READDIR responses must fabricate 23573 them. 23575 18.24. Operation 27: READLINK - Read Symbolic Link 23577 18.24.1. ARGUMENTS 23579 /* CURRENT_FH: symlink */ 23580 void; 23582 18.24.2. RESULTS 23584 struct READLINK4resok { 23585 linktext4 link; 23586 }; 23588 union READLINK4res switch (nfsstat4 status) { 23589 case NFS4_OK: 23590 READLINK4resok resok4; 23591 default: 23592 void; 23593 }; 23595 18.24.3. DESCRIPTION 23597 READLINK reads the data associated with a symbolic link. Depending 23598 on the value of the UTF-8 capability attribute (Section 14.4), the 23599 data is encoded in UTF-8. Whether created by an NFS client or 23600 created locally on the server, the data in a symbolic link is not 23601 interpreted (except possibly to check for proper UTF-8 encoding) when 23602 created, but is simply stored. 23604 On success, the current filehandle retains its value. 23606 18.24.4. IMPLEMENTATION 23608 A symbolic link is nominally a pointer to another file. The data is 23609 not necessarily interpreted by the server, just stored in the file. 23610 It is possible for a client implementation to store a pathname that 23611 is not meaningful to the server operating system in a symbolic link. 23612 A READLINK operation returns the data to the client for 23613 interpretation. If different implementations want to share access to 23614 symbolic links, then they must agree on the interpretation of the 23615 data in the symbolic link. 23617 The READLINK operation is only allowed on objects of type NF4LNK. 23618 The server should return the error NFS4ERR_WRONG_TYPE if the object 23619 is not of type NF4LNK. 23621 18.25. Operation 28: REMOVE - Remove File System Object 23623 18.25.1. ARGUMENTS 23625 struct REMOVE4args { 23626 /* CURRENT_FH: directory */ 23627 component4 target; 23628 }; 23630 18.25.2. RESULTS 23632 struct REMOVE4resok { 23633 change_info4 cinfo; 23634 }; 23636 union REMOVE4res switch (nfsstat4 status) { 23637 case NFS4_OK: 23638 REMOVE4resok resok4; 23639 default: 23640 void; 23641 }; 23643 18.25.3. DESCRIPTION 23645 The REMOVE operation removes (deletes) a directory entry named by 23646 filename from the directory corresponding to the current filehandle. 23647 If the entry in the directory was the last reference to the 23648 corresponding file system object, the object may be destroyed. The 23649 directory may be either of type NF4DIR or NF4ATTRDIR. 23651 For the directory where the filename was removed, the server returns 23652 change_info4 information in cinfo. With the atomic field of the 23653 change_info4 data type, the server will indicate if the before and 23654 after change attributes were obtained atomically with respect to the 23655 removal. 23657 If the target has a length of zero, or if the target does not obey 23658 the UTF-8 definition (and the server is enforcing UTF-8 encoding; see 23659 Section 14.4), the error NFS4ERR_INVAL will be returned. 23661 On success, the current filehandle retains its value. 23663 18.25.4. IMPLEMENTATION 23665 NFSv3 required a different operator RMDIR for directory removal and 23666 REMOVE for non-directory removal. This allowed clients to skip 23667 checking the file type when being passed a non-directory delete 23668 system call (e.g., unlink() [24] in POSIX) to remove a directory, as 23669 well as the converse (e.g., a rmdir() on a non-directory) because 23670 they knew the server would check the file type. NFSv4.1 REMOVE can 23671 be used to delete any directory entry independent of its file type. 23672 The implementor of an NFSv4.1 client's entry points from the unlink() 23673 and rmdir() system calls should first check the file type against the 23674 types the system call is allowed to remove before sending a REMOVE 23675 operation. Alternatively, the implementor can produce a COMPOUND 23676 call that includes a LOOKUP/VERIFY sequence of operations to verify 23677 the file type before a REMOVE operation in the same COMPOUND call. 23679 The concept of last reference is server specific. However, if the 23680 numlinks field in the previous attributes of the object had the value 23681 1, the client should not rely on referring to the object via a 23682 filehandle. Likewise, the client should not rely on the resources 23683 (disk space, directory entry, and so on) formerly associated with the 23684 object becoming immediately available. Thus, if a client needs to be 23685 able to continue to access a file after using REMOVE to remove it, 23686 the client should take steps to make sure that the file will still be 23687 accessible. While the traditional mechanism used is to RENAME the 23688 file from its old name to a new hidden name, the NFSv4.1 OPEN 23689 operation MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, 23690 which indicates to the client that the file will be preserved if the 23691 file has an outstanding open (see Section 18.16). 23693 If the server finds that the file is still open when the REMOVE 23694 arrives: 23696 o The server SHOULD NOT delete the file's directory entry if the 23697 file was opened with OPEN4_SHARE_DENY_WRITE or 23698 OPEN4_SHARE_DENY_BOTH. 23700 o If the file was not opened with OPEN4_SHARE_DENY_WRITE or 23701 OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's 23702 directory entry. However, until last CLOSE of the file, the 23703 server MAY continue to allow access to the file via its 23704 filehandle. 23706 o The server MUST NOT delete the directory entry if the reply from 23707 OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set. 23709 The server MAY implement its own restrictions on removal of a file 23710 while it is open. The server might disallow such a REMOVE (or a 23711 removal that occurs as part of RENAME). The conditions that 23712 influence the restrictions on removal of a file while it is still 23713 open include: 23715 o Whether certain access protocols (i.e., not just NFS) are holding 23716 the file open. 23718 o Whether particular options, access modes, or policies on the 23719 server are enabled. 23721 If a file has an outstanding OPEN and this prevents the removal of 23722 the file's directory entry, the error NFS4ERR_FILE_OPEN is returned. 23724 Where the determination above cannot be made definitively because 23725 delegations are being held, they MUST be recalled to allow processing 23726 of the REMOVE to continue. When a delegation is held, the server has 23727 no reliable knowledge of the status of OPENs for that client, so 23728 unless there are files opened with the particular deny modes by 23729 clients without delegations, the determination cannot be made until 23730 delegations are recalled, and the operation cannot proceed until each 23731 sufficient delegation has been returned or revoked to allow the 23732 server to make a correct determination. 23734 In all cases in which delegations are recalled, the server is likely 23735 to return one or more NFS4ERR_DELAY errors while delegations remain 23736 outstanding. 23738 If the current filehandle designates a directory for which another 23739 client holds a directory delegation, then, unless the situation can 23740 be resolved by sending a notification, the directory delegation MUST 23741 be recalled, and the operation MUST NOT proceed until the delegation 23742 is returned or revoked. Except where this happens very quickly, one 23743 or more NFS4ERR_DELAY errors will be returned to requests made while 23744 delegation remains outstanding. 23746 When the current filehandle designates a directory for which one or 23747 more directory delegations exist, then, when those delegations 23748 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as 23749 a result of this operation. 23751 Note that when a remove occurs as a result of a RENAME, 23752 NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as 23753 a separate operation. In the case in which the removal is integrated 23754 and atomic with RENAME, the notification of the removal is integrated 23755 with notification for the RENAME. See the discussion of the 23756 NOTIFY4_RENAME_ENTRY notification in Section 20.4. 23758 18.26. Operation 29: RENAME - Rename Directory Entry 23760 18.26.1. ARGUMENTS 23762 struct RENAME4args { 23763 /* SAVED_FH: source directory */ 23764 component4 oldname; 23765 /* CURRENT_FH: target directory */ 23766 component4 newname; 23767 }; 23769 18.26.2. RESULTS 23771 struct RENAME4resok { 23772 change_info4 source_cinfo; 23773 change_info4 target_cinfo; 23774 }; 23776 union RENAME4res switch (nfsstat4 status) { 23777 case NFS4_OK: 23778 RENAME4resok resok4; 23779 default: 23780 void; 23781 }; 23783 18.26.3. DESCRIPTION 23785 The RENAME operation renames the object identified by oldname in the 23786 source directory corresponding to the saved filehandle, as set by the 23787 SAVEFH operation, to newname in the target directory corresponding to 23788 the current filehandle. The operation is required to be atomic to 23789 the client. Source and target directories MUST reside on the same 23790 file system on the server. On success, the current filehandle will 23791 continue to be the target directory. 23793 If the target directory already contains an entry with the name 23794 newname, the source object MUST be compatible with the target: either 23795 both are non-directories or both are directories and the target MUST 23796 be empty. If compatible, the existing target is removed before the 23797 rename occurs or, preferably, the target is removed atomically as 23798 part of the rename. See Section 18.25.4 for client and server 23799 actions whenever a target is removed. Note however that when the 23800 removal is performed atomically with the rename, certain parts of the 23801 removal described there are integrated with the rename. For example, 23802 notification of the removal will not be via a NOTIFY4_REMOVE_ENTRY 23803 but will be indicated as part of the NOTIFY4_ADD_ENTRY or 23804 NOTIFY4_RENAME_ENTRY generated by the rename. 23806 If the source object and the target are not compatible or if the 23807 target is a directory but not empty, the server will return the error 23808 NFS4ERR_EXIST. 23810 If oldname and newname both refer to the same file (e.g., they might 23811 be hard links of each other), then unless the file is open (see 23812 Section 18.26.4), RENAME MUST perform no action and return NFS4_OK. 23814 For both directories involved in the RENAME, the server returns 23815 change_info4 information. With the atomic field of the change_info4 23816 data type, the server will indicate if the before and after change 23817 attributes were obtained atomically with respect to the rename. 23819 If oldname refers to a named attribute and the saved and current 23820 filehandles refer to different file system objects, the server will 23821 return NFS4ERR_XDEV just as if the saved and current filehandles 23822 represented directories on different file systems. 23824 If oldname or newname has a length of zero, or if oldname or newname 23825 does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be 23826 returned. 23828 18.26.4. IMPLEMENTATION 23830 The server MAY impose restrictions on the RENAME operation such that 23831 RENAME may not be done when the file being renamed is open or when 23832 that open is done by particular protocols, or with particular options 23833 or access modes. Similar restrictions may be applied when a file 23834 exists with the target name and is open. When RENAME is rejected 23835 because of such restrictions, the error NFS4ERR_FILE_OPEN is 23836 returned. 23838 When oldname and rename refer to the same file and that file is open 23839 in a fashion such that RENAME would normally be rejected with 23840 NFS4ERR_FILE_OPEN if oldname and newname were different files, then 23841 RENAME SHOULD be rejected with NFS4ERR_FILE_OPEN. 23843 If a server does implement such restrictions and those restrictions 23844 include cases of NFSv4 opens preventing successful execution of a 23845 rename, the server needs to recall any delegations that could hide 23846 the existence of opens relevant to that decision. This is because 23847 when a client holds a delegation, the server might not have an 23848 accurate account of the opens for that client, since the client may 23849 execute OPENs and CLOSEs locally. The RENAME operation need only be 23850 delayed until a definitive result can be obtained. For example, if 23851 there are multiple delegations and one of them establishes an open 23852 whose presence would prevent the rename, given the server's 23853 semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as 23854 that delegation is returned without waiting for other delegations to 23855 be returned. Similarly, if such opens are not associated with 23856 delegations, NFS4ERR_FILE_OPEN can be returned immediately with no 23857 delegation recall being done. 23859 If the current filehandle or the saved filehandle designates a 23860 directory for which another client holds a directory delegation, 23861 then, unless the situation can be resolved by sending a notification, 23862 the delegation MUST be recalled, and the operation cannot proceed 23863 until the delegation is returned or revoked. Except where this 23864 happens very quickly, one or more NFS4ERR_DELAY errors will be 23865 returned to requests made while delegation remains outstanding. 23867 When the current and saved filehandles are the same and they 23868 designate a directory for which one or more directory delegations 23869 exist, then, when those delegations request such notifications, a 23870 notification of type NOTIFY4_RENAME_ENTRY will be generated as a 23871 result of this operation. When oldname and rename refer to the same 23872 file, no notification is generated (because, as Section 18.26.3 23873 states, the server MUST take no action). When a file is removed 23874 because it has the same name as the target, if that removal is done 23875 atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will 23876 not be generated. Instead, the deletion of the file will be reported 23877 as part of the NOTIFY4_RENAME_ENTRY notification. 23879 When the current and saved filehandles are not the same: 23881 o If the current filehandle designates a directory for which one or 23882 more directory delegations exist, then, when those delegations 23883 request such notifications, NOTIFY4_ADD_ENTRY will be generated as 23884 a result of this operation. When a file is removed because it has 23885 the same name as the target, if that removal is done atomically 23886 with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be 23887 generated. Instead, the deletion of the file will be reported as 23888 part of the NOTIFY4_ADD_ENTRY notification. 23890 o If the saved filehandle designates a directory for which one or 23891 more directory delegations exist, then, when those delegations 23892 request such notifications, NOTIFY4_REMOVE_ENTRY will be generated 23893 as a result of this operation. 23895 If the object being renamed has file delegations held by clients 23896 other than the one doing the RENAME, the delegations MUST be 23897 recalled, and the operation cannot proceed until each such delegation 23898 is returned or revoked. Note that in the case of multiply linked 23899 files, the delegation recall requirement applies even if the 23900 delegation was obtained through a different name than the one being 23901 renamed. In all cases in which delegations are recalled, the server 23902 is likely to return one or more NFS4ERR_DELAY errors while the 23903 delegation(s) remains outstanding, although it might not do that if 23904 the delegations are returned quickly. 23906 The RENAME operation must be atomic to the client. The statement 23907 "source and target directories MUST reside on the same file system on 23908 the server" means that the fsid fields in the attributes for the 23909 directories are the same. If they reside on different file systems, 23910 the error NFS4ERR_XDEV is returned. 23912 Based on the value of the fh_expire_type attribute for the object, 23913 the filehandle may or may not expire on a RENAME. However, server 23914 implementors are strongly encouraged to attempt to keep filehandles 23915 from expiring in this fashion. 23917 On some servers, the file names "." and ".." are illegal as either 23918 oldname or newname, and will result in the error NFS4ERR_BADNAME. In 23919 addition, on many servers the case of oldname or newname being an 23920 alias for the source directory will be checked for. Such servers 23921 will return the error NFS4ERR_INVAL in these cases. 23923 If either of the source or target filehandles are not directories, 23924 the server will return NFS4ERR_NOTDIR. 23926 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle 23928 18.27.1. ARGUMENTS 23930 /* SAVED_FH: */ 23931 void; 23933 18.27.2. RESULTS 23935 struct RESTOREFH4res { 23936 /* 23937 * If status is NFS4_OK, 23938 * new CURRENT_FH: value of saved fh 23939 */ 23940 nfsstat4 status; 23941 }; 23943 18.27.3. DESCRIPTION 23945 The RESTOREFH operation sets the current filehandle and stateid to 23946 the values in the saved filehandle and stateid. If there is no saved 23947 filehandle, then the server will return the error 23948 NFS4ERR_NOFILEHANDLE. 23950 See Section 16.2.3.1.1 for more details on the current filehandle. 23952 See Section 16.2.3.1.2 for more details on the current stateid. 23954 18.27.4. IMPLEMENTATION 23956 Operations like OPEN and LOOKUP use the current filehandle to 23957 represent a directory and replace it with a new filehandle. Assuming 23958 that the previous filehandle was saved with a SAVEFH operator, the 23959 previous filehandle can be restored as the current filehandle. This 23960 is commonly used to obtain post-operation attributes for the 23961 directory, e.g., 23963 PUTFH (directory filehandle) 23964 SAVEFH 23965 GETATTR attrbits (pre-op dir attrs) 23966 CREATE optbits "foo" attrs 23967 GETATTR attrbits (file attributes) 23968 RESTOREFH 23969 GETATTR attrbits (post-op dir attrs) 23971 18.28. Operation 32: SAVEFH - Save Current Filehandle 23973 18.28.1. ARGUMENTS 23975 /* CURRENT_FH: */ 23976 void; 23978 18.28.2. RESULTS 23980 struct SAVEFH4res { 23981 /* 23982 * If status is NFS4_OK, 23983 * new SAVED_FH: value of current fh 23984 */ 23985 nfsstat4 status; 23986 }; 23988 18.28.3. DESCRIPTION 23990 The SAVEFH operation saves the current filehandle and stateid. If a 23991 previous filehandle was saved, then it is no longer accessible. The 23992 saved filehandle can be restored as the current filehandle with the 23993 RESTOREFH operator. 23995 On success, the current filehandle retains its value. 23997 See Section 16.2.3.1.1 for more details on the current filehandle. 23999 See Section 16.2.3.1.2 for more details on the current stateid. 24001 18.28.4. IMPLEMENTATION 24003 18.29. Operation 33: SECINFO - Obtain Available Security 24005 18.29.1. ARGUMENTS 24007 struct SECINFO4args { 24008 /* CURRENT_FH: directory */ 24009 component4 name; 24010 }; 24012 18.29.2. RESULTS 24013 /* 24014 * From RFC 2203 24015 */ 24016 enum rpc_gss_svc_t { 24017 RPC_GSS_SVC_NONE = 1, 24018 RPC_GSS_SVC_INTEGRITY = 2, 24019 RPC_GSS_SVC_PRIVACY = 3 24020 }; 24022 struct rpcsec_gss_info { 24023 sec_oid4 oid; 24024 qop4 qop; 24025 rpc_gss_svc_t service; 24026 }; 24028 /* RPCSEC_GSS has a value of '6' - See RFC 2203 */ 24029 union secinfo4 switch (uint32_t flavor) { 24030 case RPCSEC_GSS: 24031 rpcsec_gss_info flavor_info; 24032 default: 24033 void; 24034 }; 24036 typedef secinfo4 SECINFO4resok<>; 24038 union SECINFO4res switch (nfsstat4 status) { 24039 case NFS4_OK: 24040 /* CURRENTFH: consumed */ 24041 SECINFO4resok resok4; 24042 default: 24043 void; 24044 }; 24046 18.29.3. DESCRIPTION 24048 The SECINFO operation is used by the client to obtain a list of valid 24049 RPC authentication flavors for a specific directory filehandle, file 24050 name pair. SECINFO should apply the same access methodology used for 24051 LOOKUP when evaluating the name. Therefore, if the requester does 24052 not have the appropriate access to LOOKUP the name, then SECINFO MUST 24053 behave the same way and return NFS4ERR_ACCESS. 24055 The result will contain an array that represents the security 24056 mechanisms available, with an order corresponding to the server's 24057 preferences, the most preferred being first in the array. The client 24058 is free to pick whatever security mechanism it both desires and 24059 supports, or to pick in the server's preference order the first one 24060 it supports. The array entries are represented by the secinfo4 24061 structure. The field 'flavor' will contain a value of AUTH_NONE, 24062 AUTH_SYS (as defined in RFC 5531 [3]), or RPCSEC_GSS (as defined in 24063 RFC 2203 [4]). The field flavor can also be any other security 24064 flavor registered with IANA. 24066 For the flavors AUTH_NONE and AUTH_SYS, no additional security 24067 information is returned. The same is true of many (if not most) 24068 other security flavors, including AUTH_DH. For a return value of 24069 RPCSEC_GSS, a security triple is returned that contains the mechanism 24070 object identifier (OID, as defined in RFC 2743 [7]), the quality of 24071 protection (as defined in RFC 2743 [7]), and the service type (as 24072 defined in RFC 2203 [4]). It is possible for SECINFO to return 24073 multiple entries with flavor equal to RPCSEC_GSS with different 24074 security triple values. 24076 On success, the current filehandle is consumed (see 24077 Section 2.6.3.1.1.8), and if the next operation after SECINFO tries 24078 to use the current filehandle, that operation will fail with the 24079 status NFS4ERR_NOFILEHANDLE. 24081 If the name has a length of zero, or if the name does not obey the 24082 UTF-8 definition (assuming UTF-8 capabilities are enabled; see 24083 Section 14.4), the error NFS4ERR_INVAL will be returned. 24085 See Section 2.6 for additional information on the use of SECINFO. 24087 18.29.4. IMPLEMENTATION 24089 The SECINFO operation is expected to be used by the NFS client when 24090 the error value of NFS4ERR_WRONGSEC is returned from another NFS 24091 operation. This signifies to the client that the server's security 24092 policy is different from what the client is currently using. At this 24093 point, the client is expected to obtain a list of possible security 24094 flavors and choose what best suits its policies. 24096 As mentioned, the server's security policies will determine when a 24097 client request receives NFS4ERR_WRONGSEC. See Table 8 for a list of 24098 operations that can return NFS4ERR_WRONGSEC. In addition, when 24099 READDIR returns attributes, the rdattr_error (Section 5.8.1.12) can 24100 contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE MUST NOT 24101 return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the 24102 target name exists, it cannot have a separate security policy from 24103 the parent directory, and the security policy of the parent was 24104 checked when its filehandle was injected into the COMPOUND request's 24105 operations stream (for similar reasons, an OPEN operation that 24106 creates the target MUST NOT return NFS4ERR_WRONGSEC). If the target 24107 name exists, while it might have a separate security policy, that is 24108 irrelevant because CREATE MUST return NFS4ERR_EXIST. The rationale 24109 for REMOVE is that while that target might have a separate security 24110 policy, the target is going to be removed, and so the security policy 24111 of the parent trumps that of the object being removed. RENAME and 24112 LINK MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error 24113 applies only to the saved filehandle (see Section 2.6.3.1.2). Any 24114 NFS4ERR_WRONGSEC error on the current filehandle used by LINK and 24115 RENAME MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or 24116 RESTOREFH operation that injected the current filehandle. 24118 With the exception of LINK and RENAME, the set of operations that can 24119 return NFS4ERR_WRONGSEC represents the point at which the client can 24120 inject a filehandle into the "current filehandle" at the server. The 24121 filehandle is either provided by the client (PUTFH, PUTPUBFH, 24122 PUTROOTFH), generated as a result of a name-to-filehandle translation 24123 (LOOKUP and OPEN), or generated from the saved filehandle via 24124 RESTOREFH. As Section 2.6.3.1.1.1 states, a put filehandle operation 24125 followed by SAVEFH MUST NOT return NFS4ERR_WRONGSEC. Thus, the 24126 RESTOREFH operation, under certain conditions (see 24127 Section 2.6.3.1.1), is permitted to return NFS4ERR_WRONGSEC so that 24128 security policies can be honored. 24130 The READDIR operation will not directly return the NFS4ERR_WRONGSEC 24131 error. However, if the READDIR request included a request for 24132 attributes, it is possible that the READDIR request's security triple 24133 did not match that of a directory entry. If this is the case and the 24134 client has requested the rdattr_error attribute, the server will 24135 return the NFS4ERR_WRONGSEC error in rdattr_error for the entry. 24137 To resolve an error return of NFS4ERR_WRONGSEC, the client does the 24138 following: 24140 o For LOOKUP and OPEN, the client will use SECINFO with the same 24141 current filehandle and name as provided in the original LOOKUP or 24142 OPEN to enumerate the available security triples. 24144 o For the rdattr_error, the client will use SECINFO with the same 24145 current filehandle as provided in the original READDIR. The name 24146 passed to SECINFO will be that of the directory entry (as returned 24147 from READDIR) that had the NFS4ERR_WRONGSEC error in the 24148 rdattr_error attribute. 24150 o For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the 24151 client will use SECINFO_NO_NAME { style = 24152 SECINFO_STYLE4_CURRENT_FH }. The client will prefix the 24153 SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or 24154 PUTROOTFH operation that provides the filehandle originally 24155 provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH 24156 operation. 24158 NOTE: In NFSv4.0, the client was required to use SECINFO, and had 24159 to reconstruct the parent of the original filehandle and the 24160 component name of the original filehandle. The introduction in 24161 NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction. 24163 o For LOOKUPP, the client will use SECINFO_NO_NAME { style = 24164 SECINFO_STYLE4_PARENT } and provide the filehandle that equals the 24165 filehandle originally provided to LOOKUPP. 24167 See Section 21 for a discussion on the recommendations for the 24168 security flavor used by SECINFO and SECINFO_NO_NAME. 24170 18.30. Operation 34: SETATTR - Set Attributes 24172 18.30.1. ARGUMENTS 24174 struct SETATTR4args { 24175 /* CURRENT_FH: target object */ 24176 stateid4 stateid; 24177 fattr4 obj_attributes; 24178 }; 24180 18.30.2. RESULTS 24182 struct SETATTR4res { 24183 nfsstat4 status; 24184 bitmap4 attrsset; 24185 }; 24187 18.30.3. DESCRIPTION 24189 The SETATTR operation changes one or more of the attributes of a file 24190 system object. The new attributes are specified with a bitmap and 24191 the attributes that follow the bitmap in bit order. 24193 The stateid argument for SETATTR is used to provide byte-range 24194 locking context that is necessary for SETATTR requests that set the 24195 size attribute. Since setting the size attribute modifies the file's 24196 data, it has the same locking requirements as a corresponding WRITE. 24197 Any SETATTR that sets the size attribute is incompatible with a share 24198 reservation that specifies OPEN4_SHARE_DENY_WRITE. The area between 24199 the old end-of-file and the new end-of-file is considered to be 24200 modified just as would have been the case had the area in question 24201 been specified as the target of WRITE, for the purpose of checking 24202 conflicts with byte-range locks, for those cases in which a server is 24203 implementing mandatory byte-range locking behavior. A valid stateid 24204 SHOULD always be specified. When the file size attribute is not set, 24205 the special stateid consisting of all bits equal to zero MAY be 24206 passed. 24208 On either success or failure of the operation, the server will return 24209 the attrsset bitmask to represent what (if any) attributes were 24210 successfully set. The attrsset in the response is a subset of the 24211 attrmask field of the obj_attributes field in the argument. 24213 On success, the current filehandle retains its value. 24215 18.30.4. IMPLEMENTATION 24217 If the request specifies the owner attribute to be set, the server 24218 SHOULD allow the operation to succeed if the current owner of the 24219 object matches the value specified in the request. Some servers may 24220 be implemented in a way as to prohibit the setting of the owner 24221 attribute unless the requester has privilege to do so. If the server 24222 is lenient in this one case of matching owner values, the client 24223 implementation may be simplified in cases of creation of an object 24224 (e.g., an exclusive create via OPEN) followed by a SETATTR. 24226 The file size attribute is used to request changes to the size of a 24227 file. A value of zero causes the file to be truncated, a value less 24228 than the current size of the file causes data from new size to the 24229 end of the file to be discarded, and a size greater than the current 24230 size of the file causes logically zeroed data bytes to be added to 24231 the end of the file. Servers are free to implement this using 24232 unallocated bytes (holes) or allocated data bytes set to zero. 24233 Clients should not make any assumptions regarding a server's 24234 implementation of this feature, beyond that the bytes in the affected 24235 byte-range returned by READ will be zeroed. Servers MUST support 24236 extending the file size via SETATTR. 24238 SETATTR is not guaranteed to be atomic. A failed SETATTR may 24239 partially change a file's attributes, hence the reason why the reply 24240 always includes the status and the list of attributes that were set. 24242 If the object whose attributes are being changed has a file 24243 delegation that is held by a client other than the one doing the 24244 SETATTR, the delegation(s) must be recalled, and the operation cannot 24245 proceed to actually change an attribute until each such delegation is 24246 returned or revoked. In all cases in which delegations are recalled, 24247 the server is likely to return one or more NFS4ERR_DELAY errors while 24248 the delegation(s) remains outstanding, although it might not do that 24249 if the delegations are returned quickly. 24251 If the object whose attributes are being set is a directory and 24252 another client holds a directory delegation for that directory, then 24253 if enabled, asynchronous notifications will be generated when the set 24254 of attributes changed has a non-null intersection with the set of 24255 attributes for which notification is requested. Notifications of 24256 type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate 24257 client(s), but the SETATTR is not delayed by waiting for these 24258 notifications to be sent. 24260 If the object whose attributes are being set is a member of the 24261 directory for which another client holds a directory delegation, then 24262 asynchronous notifications will be generated when the set of 24263 attributes changed has a non-null intersection with the set of 24264 attributes for which notification is requested. Notifications of 24265 type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate 24266 clients, but the SETATTR is not delayed by waiting for these 24267 notifications to be sent. 24269 Changing the size of a file with SETATTR indirectly changes the 24270 time_modify and change attributes. A client must account for this as 24271 size changes can result in data deletion. 24273 The attributes time_access_set and time_modify_set are write-only 24274 attributes constructed as a switched union so the client can direct 24275 the server in setting the time values. If the switched union 24276 specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to 24277 be used for the operation. If the switch union does not specify 24278 SET_TO_CLIENT_TIME4, the server is to use its current time for the 24279 SETATTR operation. 24281 If server and client times differ, programs that compare client time 24282 to file times can break. A time synchronization protocol should be 24283 used to limit client/server time skew. 24285 Use of a COMPOUND containing a VERIFY operation specifying only the 24286 change attribute, immediately followed by a SETATTR, provides a means 24287 whereby a client may specify a request that emulates the 24288 functionality of the SETATTR guard mechanism of NFSv3. Since the 24289 function of the guard mechanism is to avoid changes to the file 24290 attributes based on stale information, delays between checking of the 24291 guard condition and the setting of the attributes have the potential 24292 to compromise this function, as would the corresponding delay in the 24293 NFSv4 emulation. Therefore, NFSv4.1 servers SHOULD take care to 24294 avoid such delays, to the degree possible, when executing such a 24295 request. 24297 If the server does not support an attribute as requested by the 24298 client, the server SHOULD return NFS4ERR_ATTRNOTSUPP. 24300 A mask of the attributes actually set is returned by SETATTR in all 24301 cases. That mask MUST NOT include attribute bits not requested to be 24302 set by the client. If the attribute masks in the request and reply 24303 are equal, the status field in the reply MUST be NFS4_OK. 24305 18.31. Operation 37: VERIFY - Verify Same Attributes 24307 18.31.1. ARGUMENTS 24309 struct VERIFY4args { 24310 /* CURRENT_FH: object */ 24311 fattr4 obj_attributes; 24312 }; 24314 18.31.2. RESULTS 24316 struct VERIFY4res { 24317 nfsstat4 status; 24318 }; 24320 18.31.3. DESCRIPTION 24322 The VERIFY operation is used to verify that attributes have the value 24323 assumed by the client before proceeding with the following operations 24324 in the COMPOUND request. If any of the attributes do not match, then 24325 the error NFS4ERR_NOT_SAME must be returned. The current filehandle 24326 retains its value after successful completion of the operation. 24328 18.31.4. IMPLEMENTATION 24330 One possible use of the VERIFY operation is the following series of 24331 operations. With this, the client is attempting to verify that the 24332 file being removed will match what the client expects to be removed. 24333 This series can help prevent the unintended deletion of a file. 24335 PUTFH (directory filehandle) 24336 LOOKUP (file name) 24337 VERIFY (filehandle == fh) 24338 PUTFH (directory filehandle) 24339 REMOVE (file name) 24341 This series does not prevent a second client from removing and 24342 creating a new file in the middle of this sequence, but it does help 24343 avoid the unintended result. 24345 In the case that a RECOMMENDED attribute is specified in the VERIFY 24346 operation and the server does not support that attribute for the file 24347 system object, the error NFS4ERR_ATTRNOTSUPP is returned to the 24348 client. 24350 When the attribute rdattr_error or any set-only attribute (e.g., 24351 time_modify_set) is specified, the error NFS4ERR_INVAL is returned to 24352 the client. 24354 18.32. Operation 38: WRITE - Write to File 24356 18.32.1. ARGUMENTS 24358 enum stable_how4 { 24359 UNSTABLE4 = 0, 24360 DATA_SYNC4 = 1, 24361 FILE_SYNC4 = 2 24362 }; 24364 struct WRITE4args { 24365 /* CURRENT_FH: file */ 24366 stateid4 stateid; 24367 offset4 offset; 24368 stable_how4 stable; 24369 opaque data<>; 24370 }; 24372 18.32.2. RESULTS 24374 struct WRITE4resok { 24375 count4 count; 24376 stable_how4 committed; 24377 verifier4 writeverf; 24378 }; 24380 union WRITE4res switch (nfsstat4 status) { 24381 case NFS4_OK: 24382 WRITE4resok resok4; 24383 default: 24384 void; 24385 }; 24387 18.32.3. DESCRIPTION 24389 The WRITE operation is used to write data to a regular file. The 24390 target file is specified by the current filehandle. The offset 24391 specifies the offset where the data should be written. An offset of 24392 zero specifies that the write should start at the beginning of the 24393 file. The count, as encoded as part of the opaque data parameter, 24394 represents the number of bytes of data that are to be written. If 24395 the count is zero, the WRITE will succeed and return a count of zero 24396 subject to permissions checking. The server MAY write fewer bytes 24397 than requested by the client. 24399 The client specifies with the stable parameter the method of how the 24400 data is to be processed by the server. If stable is FILE_SYNC4, the 24401 server MUST commit the data written plus all file system metadata to 24402 stable storage before returning results. This corresponds to the 24403 NFSv2 protocol semantics. Any other behavior constitutes a protocol 24404 violation. If stable is DATA_SYNC4, then the server MUST commit all 24405 of the data to stable storage and enough of the metadata to retrieve 24406 the data before returning. The server implementor is free to 24407 implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a 24408 possible performance drop. If stable is UNSTABLE4, the server is 24409 free to commit any part of the data and the metadata to stable 24410 storage, including all or none, before returning a reply to the 24411 client. There is no guarantee whether or when any uncommitted data 24412 will subsequently be committed to stable storage. The only 24413 guarantees made by the server are that it will not destroy any data 24414 without changing the value of writeverf and that it will not commit 24415 the data and metadata at a level less than that requested by the 24416 client. 24418 Except when special stateids are used, the stateid value for a WRITE 24419 request represents a value returned from a previous byte-range LOCK 24420 or OPEN request or the stateid associated with a delegation. The 24421 stateid identifies the associated owners if any and is used by the 24422 server to verify that the associated locks are still valid (e.g., 24423 have not been revoked). 24425 Upon successful completion, the following results are returned. The 24426 count result is the number of bytes of data written to the file. The 24427 server may write fewer bytes than requested. If so, the actual 24428 number of bytes written starting at location, offset, is returned. 24430 The server also returns an indication of the level of commitment of 24431 the data and metadata via committed. Per Table 11, 24433 o The server MAY commit the data at a stronger level than requested. 24435 o The server MUST commit the data at a level at least as high as 24436 that committed. 24438 Valid combinations of the fields stable in the request and committed 24439 in the reply. 24441 +------------+-----------------------------------+ 24442 | stable | committed | 24443 +------------+-----------------------------------+ 24444 | UNSTABLE4 | FILE_SYNC4, DATA_SYNC4, UNSTABLE4 | 24445 | DATA_SYNC4 | FILE_SYNC4, DATA_SYNC4 | 24446 | FILE_SYNC4 | FILE_SYNC4 | 24447 +------------+-----------------------------------+ 24449 Table 11 24451 The final portion of the result is the field writeverf. This field 24452 is the write verifier and is a cookie that the client can use to 24453 determine whether a server has changed instance state (e.g., server 24454 restart) between a call to WRITE and a subsequent call to either 24455 WRITE or COMMIT. This cookie MUST be unchanged during a single 24456 instance of the NFSv4.1 server and MUST be unique between instances 24457 of the NFSv4.1 server. If the cookie changes, then the client MUST 24458 assume that any data written with an UNSTABLE4 value for committed 24459 and an old writeverf in the reply has been lost and will need to be 24460 recovered. 24462 If a client writes data to the server with the stable argument set to 24463 UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or 24464 UNSTABLE4, the client will follow up some time in the future with a 24465 COMMIT operation to synchronize outstanding asynchronous data and 24466 metadata with the server's stable storage, barring client error. It 24467 is possible that due to client crash or other error that a subsequent 24468 COMMIT will not be received by the server. 24470 For a WRITE with a stateid value of all bits equal to zero, the 24471 server MAY allow the WRITE to be serviced subject to mandatory byte- 24472 range locks or the current share deny modes for the file. For a 24473 WRITE with a stateid value of all bits equal to 1, the server MUST 24474 NOT allow the WRITE operation to bypass locking checks at the server 24475 and otherwise is treated as if a stateid of all bits equal to zero 24476 were used. 24478 On success, the current filehandle retains its value. 24480 18.32.4. IMPLEMENTATION 24482 It is possible for the server to write fewer bytes of data than 24483 requested by the client. In this case, the server SHOULD NOT return 24484 an error unless no data was written at all. If the server writes 24485 less than the number of bytes specified, the client will need to send 24486 another WRITE to write the remaining data. 24488 It is assumed that the act of writing data to a file will cause the 24489 time_modified and change attributes of the file to be updated. 24490 However, these attributes SHOULD NOT be changed unless the contents 24491 of the file are changed. Thus, a WRITE request with count set to 24492 zero SHOULD NOT cause the time_modified and change attributes of the 24493 file to be updated. 24495 Stable storage is persistent storage that survives: 24497 1. Repeated power failures. 24499 2. Hardware failures (of any board, power supply, etc.). 24501 3. Repeated software crashes and restarts. 24503 This definition does not address failure of the stable storage module 24504 itself. 24506 The verifier is defined to allow a client to detect different 24507 instances of an NFSv4.1 protocol server over which cached, 24508 uncommitted data may be lost. In the most likely case, the verifier 24509 allows the client to detect server restarts. This information is 24510 required so that the client can safely determine whether the server 24511 could have lost cached data. If the server fails unexpectedly and 24512 the client has uncommitted data from previous WRITE requests (done 24513 with the stable argument set to UNSTABLE4 and in which the result 24514 committed was returned as UNSTABLE4 as well), the server might not 24515 have flushed cached data to stable storage. The burden of recovery 24516 is on the client, and the client will need to retransmit the data to 24517 the server. 24519 A suggested verifier would be to use the time that the server was 24520 last started (if restarting the server results in lost buffers). 24522 The reply's committed field allows the client to do more effective 24523 caching. If the server is committing all WRITE requests to stable 24524 storage, then it SHOULD return with committed set to FILE_SYNC4, 24525 regardless of the value of the stable field in the arguments. A 24526 server that uses an NVRAM accelerator may choose to implement this 24527 policy. The client can use this to increase the effectiveness of the 24528 cache by discarding cached data that has already been committed on 24529 the server. 24531 Some implementations may return NFS4ERR_NOSPC instead of 24532 NFS4ERR_DQUOT when a user's quota is exceeded. 24534 In the case that the current filehandle is of type NF4DIR, the server 24535 will return NFS4ERR_ISDIR. If the current file is a symbolic link, 24536 the error NFS4ERR_SYMLINK will be returned. Otherwise, if the 24537 current filehandle does not designate an ordinary file, the server 24538 will return NFS4ERR_WRONG_TYPE. 24540 If mandatory byte-range locking is in effect for the file, and the 24541 corresponding byte-range of the data to be written to the file is 24542 READ_LT or WRITE_LT locked by an owner that is not associated with 24543 the stateid, the server MUST return NFS4ERR_LOCKED. If so, the 24544 client MUST check if the owner corresponding to the stateid used with 24545 the WRITE operation has a conflicting READ_LT lock that overlaps with 24546 the byte-range that was to be written. If the stateid's owner has no 24547 conflicting READ_LT lock, then the client SHOULD try to get the 24548 appropriate write byte-range lock via the LOCK operation before re- 24549 attempting the WRITE. When the WRITE completes, the client SHOULD 24550 release the byte-range lock via LOCKU. 24552 If the stateid's owner had a conflicting READ_LT lock, then the 24553 client has no choice but to return an error to the application that 24554 attempted the WRITE. The reason is that since the stateid's owner 24555 had a READ_LT lock, either the server attempted to temporarily 24556 effectively upgrade this READ_LT lock to a WRITE_LT lock or the 24557 server has no upgrade capability. If the server attempted to upgrade 24558 the READ_LT lock and failed, it is pointless for the client to re- 24559 attempt the upgrade via the LOCK operation, because there might be 24560 another client also trying to upgrade. If two clients are blocked 24561 trying to upgrade the same lock, the clients deadlock. If the server 24562 has no upgrade capability, then it is pointless to try a LOCK 24563 operation to upgrade. 24565 If one or more other clients have delegations for the file being 24566 written, those delegations MUST be recalled, and the operation cannot 24567 proceed until those delegations are returned or revoked. Except 24568 where this happens very quickly, one or more NFS4ERR_DELAY errors 24569 will be returned to requests made while the delegation remains 24570 outstanding. Normally, delegations will not be recalled as a result 24571 of a WRITE operation since the recall will occur as a result of an 24572 earlier OPEN. However, since it is possible for a WRITE to be done 24573 with a special stateid, the server needs to check for this case even 24574 though the client should have done an OPEN previously. 24576 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel Control 24578 18.33.1. ARGUMENT 24580 typedef opaque gsshandle4_t<>; 24582 struct gss_cb_handles4 { 24583 rpc_gss_svc_t gcbp_service; /* RFC 2203 */ 24584 gsshandle4_t gcbp_handle_from_server; 24585 gsshandle4_t gcbp_handle_from_client; 24586 }; 24588 union callback_sec_parms4 switch (uint32_t cb_secflavor) { 24589 case AUTH_NONE: 24590 void; 24591 case AUTH_SYS: 24592 authsys_parms cbsp_sys_cred; /* RFC 1831 */ 24593 case RPCSEC_GSS: 24594 gss_cb_handles4 cbsp_gss_handles; 24595 }; 24597 struct BACKCHANNEL_CTL4args { 24598 uint32_t bca_cb_program; 24599 callback_sec_parms4 bca_sec_parms<>; 24600 }; 24602 18.33.2. RESULT 24604 struct BACKCHANNEL_CTL4res { 24605 nfsstat4 bcr_status; 24606 }; 24608 18.33.3. DESCRIPTION 24610 The BACKCHANNEL_CTL operation replaces the backchannel's callback 24611 program number and adds (not replaces) RPCSEC_GSS handles for use by 24612 the backchannel. 24614 The arguments of the BACKCHANNEL_CTL call are a subset of the 24615 CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the 24616 bca_cb_program field and bca_sec_parms fields correspond respectively 24617 to the csa_cb_program and csa_sec_parms fields of the arguments of 24618 CREATE_SESSION (Section 18.36). 24620 BACKCHANNEL_CTL MUST appear in a COMPOUND that starts with SEQUENCE. 24622 If the RPCSEC_GSS handle identified by gcbp_handle_from_server does 24623 not exist on the server, the server MUST return NFS4ERR_NOENT. 24625 If an RPCSEC_GSS handle is using the SSV context (see 24626 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 24627 common SSV GSS context, there are security considerations specific to 24628 this situation discussed in Section 2.10.10. 24630 18.34. Operation 41: BIND_CONN_TO_SESSION - Associate Connection with 24631 Session 24633 18.34.1. ARGUMENT 24635 enum channel_dir_from_client4 { 24636 CDFC4_FORE = 0x1, 24637 CDFC4_BACK = 0x2, 24638 CDFC4_FORE_OR_BOTH = 0x3, 24639 CDFC4_BACK_OR_BOTH = 0x7 24640 }; 24642 struct BIND_CONN_TO_SESSION4args { 24643 sessionid4 bctsa_sessid; 24645 channel_dir_from_client4 24646 bctsa_dir; 24648 bool bctsa_use_conn_in_rdma_mode; 24649 }; 24651 18.34.2. RESULT 24652 enum channel_dir_from_server4 { 24653 CDFS4_FORE = 0x1, 24654 CDFS4_BACK = 0x2, 24655 CDFS4_BOTH = 0x3 24656 }; 24658 struct BIND_CONN_TO_SESSION4resok { 24659 sessionid4 bctsr_sessid; 24661 channel_dir_from_server4 24662 bctsr_dir; 24664 bool bctsr_use_conn_in_rdma_mode; 24665 }; 24667 union BIND_CONN_TO_SESSION4res 24668 switch (nfsstat4 bctsr_status) { 24670 case NFS4_OK: 24671 BIND_CONN_TO_SESSION4resok 24672 bctsr_resok4; 24674 default: void; 24675 }; 24677 18.34.3. DESCRIPTION 24679 BIND_CONN_TO_SESSION is used to associate additional connections with 24680 a session. It MUST be used on the connection being associated with 24681 the session. It MUST be the only operation in the COMPOUND 24682 procedure. If SP4_NONE (Section 18.35) state protection is used, any 24683 principal, security flavor, or RPCSEC_GSS context MAY be used to 24684 invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS MUST be 24685 used with the integrity or privacy services, using the principal that 24686 created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV 24687 GSS mechanism (Section 2.10.9) and integrity or privacy MUST be used. 24689 If, when the client ID was created, the client opted for SP4_NONE 24690 state protection, the client is not required to use 24691 BIND_CONN_TO_SESSION to associate the connection with the session, 24692 unless the client wishes to associate the connection with the 24693 backchannel. When SP4_NONE protection is used, simply sending a 24694 COMPOUND request with a SEQUENCE operation is sufficient to associate 24695 the connection with the session specified in SEQUENCE. 24697 The field bctsa_dir indicates whether the client wants to associate 24698 the connection with the fore channel or the backchannel or both 24699 channels. The value CDFC4_FORE_OR_BOTH indicates that the client 24700 wants to associate the connection with both the fore channel and 24701 backchannel, but will accept the connection being associated to just 24702 the fore channel. The value CDFC4_BACK_OR_BOTH indicates that the 24703 client wants to associate with both the fore channel and backchannel, 24704 but will accept the connection being associated with just the 24705 backchannel. The server replies in bctsr_dir which channel(s) the 24706 connection is associated with. If the client specified CDFC4_FORE, 24707 the server MUST return CDFS4_FORE. If the client specified 24708 CDFC4_BACK, the server MUST return CDFS4_BACK. If the client 24709 specified CDFC4_FORE_OR_BOTH, the server MUST return CDFS4_FORE or 24710 CDFS4_BOTH. If the client specified CDFC4_BACK_OR_BOTH, the server 24711 MUST return CDFS4_BACK or CDFS4_BOTH. 24713 See the CREATE_SESSION operation (Section 18.36), and the description 24714 of the argument csa_use_conn_in_rdma_mode to understand 24715 bctsa_use_conn_in_rdma_mode, and the description of 24716 csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode. 24718 Invoking BIND_CONN_TO_SESSION on a connection already associated with 24719 the specified session has no effect, and the server MUST respond with 24720 NFS4_OK, unless the client is demanding changes to the set of 24721 channels the connection is associated with. If so, the server MUST 24722 return NFS4ERR_INVAL. 24724 18.34.4. IMPLEMENTATION 24726 If a session's channel loses all connections, depending on the client 24727 ID's state protection and type of channel, the client might need to 24728 use BIND_CONN_TO_SESSION to associate a new connection. If the 24729 server restarted and does not keep the reply cache in stable storage, 24730 the server will not recognize the session ID. The client will 24731 ultimately have to invoke EXCHANGE_ID to create a new client ID and 24732 session. 24734 Suppose SP4_SSV state protection is being used, and 24735 BIND_CONN_TO_SESSION is among the operations included in the 24736 spo_must_enforce set when the client ID was created (Section 18.35). 24737 If so, there is an issue if SET_SSV is sent, no response is returned, 24738 and the last connection associated with the client ID drops. The 24739 client, per the sessions model, MUST retry the SET_SSV. But it needs 24740 a new connection to do so, and MUST associate that connection with 24741 the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS 24742 mechanism. The problem is that the RPCSEC_GSS message integrity 24743 codes use a subkey derived from the SSV as the key and the SSV may 24744 have changed. While there are multiple recovery strategies, a 24745 single, general strategy is described here. 24747 o The client reconnects. 24749 o The client assumes that the SET_SSV was executed, and so sends 24750 BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, 24751 i.e., what SET_SSV would have set the SSV to) used as the key for 24752 the RPCSEC_GSS credential message integrity codes. 24754 o If the request succeeds, this means that the original attempted 24755 SET_SSV did execute successfully. The client re-sends the 24756 original SET_SSV, which the server will reply to via the reply 24757 cache. 24759 o If the server returns an RPC authentication error, this means that 24760 the server's current SSV was not changed (and the SET_SSV was 24761 likely not executed). The client then tries BIND_CONN_TO_SESSION 24762 with the subkey derived from the old SSV as the key for the 24763 RPCSEC_GSS message integrity codes. 24765 o The attempted BIND_CONN_TO_SESSION with the old SSV should 24766 succeed. If so, the client re-sends the original SET_SSV. If the 24767 original SET_SSV was not executed, then the server executes it. 24768 If the original SET_SSV was executed but failed, the server will 24769 return the SET_SSV from the reply cache. 24771 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID 24773 The EXCHANGE_ID operation exchanges long-hand client and server 24774 identifiers (owners), and provides access to a client ID, creating 24775 one if necessary. This client ID becomes associated with the 24776 connection on which the operation is done, so that it is available 24777 when a CREATE_SESSION is done or when the connection is used to issue 24778 a request on an existing session associated with the current client. 24780 18.35.1. ARGUMENT 24782 const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001; 24783 const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002; 24785 const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100; 24787 const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000; 24788 const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000; 24789 const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000; 24791 const EXCHGID4_FLAG_MASK_PNFS = 0x00070000; 24793 const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000; 24794 const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000; 24796 struct state_protect_ops4 { 24797 bitmap4 spo_must_enforce; 24798 bitmap4 spo_must_allow; 24799 }; 24801 struct ssv_sp_parms4 { 24802 state_protect_ops4 ssp_ops; 24803 sec_oid4 ssp_hash_algs<>; 24804 sec_oid4 ssp_encr_algs<>; 24805 uint32_t ssp_window; 24806 uint32_t ssp_num_gss_handles; 24807 }; 24809 enum state_protect_how4 { 24810 SP4_NONE = 0, 24811 SP4_MACH_CRED = 1, 24812 SP4_SSV = 2 24813 }; 24815 union state_protect4_a switch(state_protect_how4 spa_how) { 24816 case SP4_NONE: 24817 void; 24818 case SP4_MACH_CRED: 24819 state_protect_ops4 spa_mach_ops; 24820 case SP4_SSV: 24821 ssv_sp_parms4 spa_ssv_parms; 24822 }; 24824 struct EXCHANGE_ID4args { 24825 client_owner4 eia_clientowner; 24826 uint32_t eia_flags; 24827 state_protect4_a eia_state_protect; 24828 nfs_impl_id4 eia_client_impl_id<1>; 24829 }; 24831 18.35.2. RESULT 24832 struct ssv_prot_info4 { 24833 state_protect_ops4 spi_ops; 24834 uint32_t spi_hash_alg; 24835 uint32_t spi_encr_alg; 24836 uint32_t spi_ssv_len; 24837 uint32_t spi_window; 24838 gsshandle4_t spi_handles<>; 24839 }; 24841 union state_protect4_r switch(state_protect_how4 spr_how) { 24842 case SP4_NONE: 24843 void; 24844 case SP4_MACH_CRED: 24845 state_protect_ops4 spr_mach_ops; 24846 case SP4_SSV: 24847 ssv_prot_info4 spr_ssv_info; 24848 }; 24850 struct EXCHANGE_ID4resok { 24851 clientid4 eir_clientid; 24852 sequenceid4 eir_sequenceid; 24853 uint32_t eir_flags; 24854 state_protect4_r eir_state_protect; 24855 server_owner4 eir_server_owner; 24856 opaque eir_server_scope; 24857 nfs_impl_id4 eir_server_impl_id<1>; 24858 }; 24860 union EXCHANGE_ID4res switch (nfsstat4 eir_status) { 24861 case NFS4_OK: 24862 EXCHANGE_ID4resok eir_resok4; 24864 default: 24865 void; 24866 }; 24868 18.35.3. DESCRIPTION 24870 The client uses the EXCHANGE_ID operation to register a particular 24871 client_owner with the server. However, when the client_owner has 24872 already been registered by other means (e.g. Transparent State 24873 Migration), the client may still use EXCHANGE_ID to obtain the client 24874 ID assigned previously. 24876 The client ID returned from this operation will be associated with 24877 the connection on which the EXCHANGE_ID is received and will serve as 24878 a parent object for sessions created by the client on this connection 24879 or to which the connection is bound. As a result of using those 24880 sessions to make requests involving the creation of state, that state 24881 will become associated with the client ID returned. 24883 In situations in which the registration of the client_owner has not 24884 occurred previously, the client ID must first be used, along with the 24885 returned eir_sequenceid, in creating an associated session using 24886 CREATE_SESSION. 24888 If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the result, 24889 eir_flags, then it is an indication that the registration of the 24890 client_owner has already occurred and that a further CREATE_SESSION 24891 is not needed to confirm it. Of course, subsequent CREATE_SESSION 24892 operations may be needed for other reasons. 24894 The value eir_sequenceid is used to establish an initial sequence 24895 value associate with the client ID returned. In cases in which a 24896 CREATE_SESSION has already been done, there is no need for this 24897 value, since sequencing of such request has already been established 24898 and the client has no need for this value and will ignore it 24900 EXCHANGE_ID MAY be sent in a COMPOUND procedure that starts with 24901 SEQUENCE. However, when a client communicates with a server for the 24902 first time, it will not have a session, so using SEQUENCE will not be 24903 possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then 24904 it MUST be the only operation in the COMPOUND procedure's request. 24905 If it is not, the server MUST return NFS4ERR_NOT_ONLY_OP. 24907 The eia_clientowner field is composed of a co_verifier field and a 24908 co_ownerid string. As noted in Section 2.4, the co_ownerid 24909 identifies the client, and the co_verifier specifies a particular 24910 incarnation of that client. An EXCHANGE_ID sent with a new 24911 incarnation of the client will lead to the server removing lock state 24912 of the old incarnation. On the other hand, an EXCHANGE_ID sent with 24913 the current incarnation and co_ownerid will, when it does not result 24914 in an unrelated error, potentially update an existing client ID's 24915 properties, or simply return information about the existing 24916 client_id. That latter would happen when this operation is done to 24917 the same server using different network addresses as part of creating 24918 trunked connections. 24920 A server MUST NOT provide the same client ID to two different 24921 incarnations of an eia_clientowner. 24923 In addition to the client ID and sequence ID, the server returns a 24924 server owner (eir_server_owner) and server scope (eir_server_scope). 24925 The former field is used in connection with network trunking as 24926 described in Section 2.10.5. The latter field is used to allow 24927 clients to determine when client IDs sent by one server may be 24928 recognized by another in the event of file system migration (see 24929 Section 11.11.9 of the current document). 24931 The client ID returned by EXCHANGE_ID is only unique relative to the 24932 combination of eir_server_owner.so_major_id and eir_server_scope. 24933 Thus, if two servers return the same client ID, the onus is on the 24934 client to distinguish the client IDs on the basis of 24935 eir_server_owner.so_major_id and eir_server_scope. In the event two 24936 different servers claim matching server_owner.so_major_id and 24937 eir_server_scope, the client can use the verification techniques 24938 discussed in Section 2.10.5.1 to determine if the servers are 24939 distinct. If they are distinct, then the client will need to note 24940 the destination network addresses of the connections used with each 24941 server and use the network address as the final discriminator. 24943 The server, as defined by the unique identity expressed in the 24944 so_major_id of the server owner and the server scope, needs to track 24945 several properties of each client ID it hands out. The properties 24946 apply to the client ID and all sessions associated with the client 24947 ID. The properties are derived from the arguments and results of 24948 EXCHANGE_ID. The client ID properties include: 24950 o The capabilities expressed by the following bits, which come from 24951 the results of EXCHANGE_ID: 24953 * EXCHGID4_FLAG_SUPP_MOVED_REFER 24955 * EXCHGID4_FLAG_SUPP_MOVED_MIGR 24957 * EXCHGID4_FLAG_BIND_PRINC_STATEID 24959 * EXCHGID4_FLAG_USE_NON_PNFS 24961 * EXCHGID4_FLAG_USE_PNFS_MDS 24963 * EXCHGID4_FLAG_USE_PNFS_DS 24965 These properties may be updated by subsequent EXCHANGE_ID 24966 operations on confirmed client IDs though the server MAY refuse to 24967 change them. 24969 o The state protection method used, one of SP4_NONE, SP4_MACH_CRED, 24970 or SP4_SSV, as set by the spa_how field of the arguments to 24971 EXCHANGE_ID. Once the client ID is confirmed, this property 24972 cannot be updated by subsequent EXCHANGE_ID operations. 24974 o For SP4_MACH_CRED or SP4_SSV state protection: 24976 * The list of operations (spo_must_enforce) that MUST use the 24977 specified state protection. This list comes from the results 24978 of EXCHANGE_ID. 24980 * The list of operations (spo_must_allow) that MAY use the 24981 specified state protection. This list comes from the results 24982 of EXCHANGE_ID. 24984 Once the client ID is confirmed, these properties cannot be 24985 updated by subsequent EXCHANGE_ID requests. 24987 o For SP4_SSV protection: 24989 * The OID of the hash algorithm. This property is represented by 24990 one of the algorithms in the ssp_hash_algs field of the 24991 EXCHANGE_ID arguments. Once the client ID is confirmed, this 24992 property cannot be updated by subsequent EXCHANGE_ID requests. 24994 * The OID of the encryption algorithm. This property is 24995 represented by one of the algorithms in the ssp_encr_algs field 24996 of the EXCHANGE_ID arguments. Once the client ID is confirmed, 24997 this property cannot be updated by subsequent EXCHANGE_ID 24998 requests. 25000 * The length of the SSV. This property is represented by the 25001 spi_ssv_len field in the EXCHANGE_ID results. Once the client 25002 ID is confirmed, this property cannot be updated by subsequent 25003 EXCHANGE_ID operations. 25005 There are REQUIRED and RECOMMENDED relationships among the 25006 length of the key of the encryption algorithm ("key length"), 25007 the length of the output of hash algorithm ("hash length"), and 25008 the length of the SSV ("SSV length"). 25010 + key length MUST be <= hash length. This is because the keys 25011 used for the encryption algorithm are actually subkeys 25012 derived from the SSV, and the derivation is via the hash 25013 algorithm. The selection of an encryption algorithm with a 25014 key length that exceeded the length of the output of the 25015 hash algorithm would require padding, and thus weaken the 25016 use of the encryption algorithm. 25018 + hash length SHOULD be <= SSV length. This is because the 25019 SSV is a key used to derive subkeys via an HMAC, and it is 25020 recommended that the key used as input to an HMAC be at 25021 least as long as the length of the HMAC's hash algorithm's 25022 output (see Section 3 of [51]). 25024 + key length SHOULD be <= SSV length. This is a transitive 25025 result of the above two invariants. 25027 + key length SHOULD be >= hash length / 2. This is because 25028 the subkey derivation is via an HMAC and it is recommended 25029 that if the HMAC has to be truncated, it should not be 25030 truncated to less than half the hash length (see Section 4 25031 of RFC2104 [51]). 25033 * Number of concurrent versions of the SSV the client and server 25034 will support (see Section 2.10.9). This property is 25035 represented by spi_window in the EXCHANGE_ID results. The 25036 property may be updated by subsequent EXCHANGE_ID operations. 25038 o The client's implementation ID as represented by the 25039 eia_client_impl_id field of the arguments. The property may be 25040 updated by subsequent EXCHANGE_ID requests. 25042 o The server's implementation ID as represented by the 25043 eir_server_impl_id field of the reply. The property may be 25044 updated by replies to subsequent EXCHANGE_ID requests. 25046 The eia_flags passed as part of the arguments and the eir_flags 25047 results allow the client and server to inform each other of their 25048 capabilities as well as indicate how the client ID will be used. 25049 Whether a bit is set or cleared on the arguments' flags does not 25050 force the server to set or clear the same bit on the results' side. 25051 Bits not defined above cannot be set in the eia_flags field. If they 25052 are, the server MUST reject the operation with NFS4ERR_INVAL. 25054 The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in 25055 eia_flags; it is always off in eir_flags. The 25056 EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is 25057 always off in eia_flags. If the server recognizes the co_ownerid and 25058 co_verifier as mapping to a confirmed client ID, it sets 25059 EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The 25060 EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client 25061 ID it is trying to create already exists and is confirmed. 25063 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means 25064 that the client is attempting to update properties of an existing 25065 confirmed client ID (if the client wants to update properties of an 25066 unconfirmed client ID, it MUST NOT set 25067 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is RECOMMENDED that 25068 the client send the update EXCHANGE_ID operation in the same COMPOUND 25069 as a SEQUENCE so that the EXCHANGE_ID is executed exactly once. 25070 Whether the client can update the properties of client ID depends on 25071 the state protection it selected when the client ID was created, and 25072 the principal and security flavor it used when sending the 25073 EXCHANGE_ID operation. The situations described in items 6, 7, 8, or 25074 9 of the second numbered list of Section 18.35.4 below will apply. 25075 Note that if the operation succeeds and returns a client ID that is 25076 already confirmed, the server MUST set the EXCHGID4_FLAG_CONFIRMED_R 25077 bit in eir_flags. 25079 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this 25080 means that the client is trying to establish a new client ID; it is 25081 attempting to trunk data communication to the server (See 25082 Section 2.10.5); or it is attempting to update properties of an 25083 unconfirmed client ID. The situations described in items 1, 2, 3, 4, 25084 or 5 of the second numbered list of Section 18.35.4 below will apply. 25085 Note that if the operation succeeds and returns a client ID that was 25086 previously confirmed, the server MUST set the 25087 EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags. 25089 When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client 25090 indicates that it is capable of dealing with an NFS4ERR_MOVED error 25091 as part of a referral sequence. When this bit is not set, it is 25092 still legal for the server to perform a referral sequence. However, 25093 a server may use the fact that the client is incapable of correctly 25094 responding to a referral, by avoiding it for that particular client. 25095 It may, for instance, act as a proxy for that particular file system, 25096 at some cost in performance, although it is not obligated to do so. 25097 If the server will potentially perform a referral, it MUST set 25098 EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags. 25100 When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates 25101 that it is capable of dealing with an NFS4ERR_MOVED error as part of 25102 a file system migration sequence. When this bit is not set, it is 25103 still legal for the server to indicate that a file system has moved, 25104 when this in fact happens. However, a server may use the fact that 25105 the client is incapable of correctly responding to a migration in its 25106 scheduling of file systems to migrate so as to avoid migration of 25107 file systems being actively used. It may also hide actual migrations 25108 from clients unable to deal with them by acting as a proxy for a 25109 migrated file system for particular clients, at some cost in 25110 performance, although it is not obligated to do so. If the server 25111 will potentially perform a migration, it MUST set 25112 EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags. 25114 When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates 25115 that it wants the server to bind the stateid to the principal. This 25116 means that when a principal creates a stateid, it has to be the one 25117 to use the stateid. If the server will perform binding, it will 25118 return EXCHGID4_FLAG_BIND_PRINC_STATEID. The server MAY return 25119 EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request 25120 it. If an update to the client ID changes the value of 25121 EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect 25122 applies only to new stateids. Existing stateids (and all stateids 25123 with the same "other" field) that were created with stateid to 25124 principal binding in force will continue to have binding in force. 25125 Existing stateids (and all stateids with the same "other" field) that 25126 were created with stateid to principal not in force will continue to 25127 have binding not in force. 25129 The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and 25130 EXCHGID4_FLAG_USE_PNFS_DS bits are described in Section 2.10.2.2 and 25131 convey roles the client ID is to be used for in a pNFS environment. 25132 The server MUST set one of the acceptable combinations of these bits 25133 (roles) in eir_flags, as specified in that section. Note that the 25134 same client owner/server owner pair can have multiple roles. 25135 Multiple roles can be associated with the same client ID or with 25136 different client IDs. Thus, if a client sends EXCHANGE_ID from the 25137 same client owner to the same server owner multiple times, but 25138 specifies different pNFS roles each time, the server might return 25139 different client IDs. Given that different pNFS roles might have 25140 different client IDs, the client may ask for different properties for 25141 each role/client ID. 25143 The spa_how field of the eia_state_protect field specifies how the 25144 client wants to protect its client, locking, and session states from 25145 unauthorized changes (Section 2.10.8.3): 25147 o SP4_NONE. The client does not request the NFSv4.1 server to 25148 enforce state protection. The NFSv4.1 server MUST NOT enforce 25149 state protection for the returned client ID. 25151 o SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST 25152 send the EXCHANGE_ID operation with RPCSEC_GSS as the security 25153 flavor, and with a service of RPC_GSS_SVC_INTEGRITY or 25154 RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the 25155 client wants to use an RPCSEC_GSS-based machine credential to 25156 protect its state. The server MUST note the principal the 25157 EXCHANGE_ID operation was sent with, and the GSS mechanism used. 25158 These notes collectively comprise the machine credential. 25160 After the client ID is confirmed, as long as the lease associated 25161 with the client ID is unexpired, a subsequent EXCHANGE_ID 25162 operation that uses the same eia_clientowner.co_owner as the first 25163 EXCHANGE_ID MUST also use the same machine credential as the first 25164 EXCHANGE_ID. The server returns the same client ID for the 25165 subsequent EXCHANGE_ID as that returned from the first 25166 EXCHANGE_ID. 25168 o SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the 25169 EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and 25170 with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. 25171 If SP4_SSV is specified, then the client wants to use the SSV to 25172 protect its state. The server records the credential used in the 25173 request as the machine credential (as defined above) for the 25174 eia_clientowner.co_owner. The CREATE_SESSION operation that 25175 confirms the client ID MUST use the same machine credential. 25177 When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides 25178 two lists of operations (each expressed as a bitmap). The first list 25179 is spo_must_enforce and consists of those operations the client MUST 25180 send (subject to the server confirming the list of operations in the 25181 result of EXCHANGE_ID) with the machine credential (if SP4_MACH_CRED 25182 protection is specified) or the SSV-based credential (if SP4_SSV 25183 protection is used). The client MUST send the operations with 25184 RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or 25185 RPC_GSS_SVC_PRIVACY security service. Typically, the first list of 25186 operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, 25187 DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The 25188 client SHOULD NOT specify in this list any operations that require a 25189 filehandle because the server's access policies MAY conflict with the 25190 client's choice, and thus the client would then be unable to access a 25191 subset of the server's namespace. 25193 Note that if SP4_SSV protection is specified, and the client 25194 indicates that CREATE_SESSION must be protected with SP4_SSV, because 25195 the SSV cannot exist without a confirmed client ID, the first 25196 CREATE_SESSION MUST instead be sent using the machine credential, and 25197 the server MUST accept the machine credential. 25199 There is a corresponding result, also called spo_must_enforce, of the 25200 operations for which the server will require SP4_MACH_CRED or SP4_SSV 25201 protection. Normally, the server's result equals the client's 25202 argument, but the result MAY be different. If the client requests 25203 one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, 25204 DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID 25205 }, then the result spo_must_enforce MUST include the operations the 25206 client requested from that set. 25208 If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then 25209 connection binding enforcement is enabled, and the client MUST use 25210 the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV 25211 protection is used) credential on calls to BIND_CONN_TO_SESSION. 25213 The second list is spo_must_allow and consists of those operations 25214 the client wants to have the option of sending with the machine 25215 credential or the SSV-based credential, even if the object the 25216 operations are performed on is not owned by the machine or SSV 25217 credential. 25219 The corresponding result, also called spo_must_allow, consists of the 25220 operations the server will allow the client to use SP4_SSV or 25221 SP4_MACH_CRED credentials with. Normally, the server's result equals 25222 the client's argument, but the result MAY be different. 25224 The purpose of spo_must_allow is to allow clients to solve the 25225 following conundrum. Suppose the client ID is confirmed with 25226 EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the 25227 RPCSEC_GSS credentials of a normal user. Now suppose the user's 25228 credentials expire, and cannot be renewed (e.g., a Kerberos ticket 25229 granting ticket expires, and the user has logged off and will not be 25230 acquiring a new ticket granting ticket). The client will be unable 25231 to send CLOSE without the user's credentials, which is to say the 25232 client has to either leave the state on the server or re-send 25233 EXCHANGE_ID with a new verifier to clear all state, that is, unless 25234 the client includes CLOSE on the list of operations in spo_must_allow 25235 and the server agrees. 25237 The SP4_SSV protection parameters also have: 25239 ssp_hash_algs: 25241 This is the set of algorithms the client supports for the purpose 25242 of computing the digests needed for the internal SSV GSS mechanism 25243 and for the SET_SSV operation. Each algorithm is specified as an 25244 object identifier (OID). The REQUIRED algorithms for a server are 25245 id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [25]. 25247 Due to known weaknesses in id-sha1, it is RECOMMENDED that the 25248 client specify at least one algorithm within ssp_hash_algs other 25249 than id-sha1. 25251 The algorithm the server selects among the set is indicated in 25252 spi_hash_alg, a field of spr_ssv_prot_info. The field 25253 spi_hash_alg is an index into the array ssp_hash_algs. Because of 25254 known the weaknesses in id-sha1, it is RECOMMENDED that it not be 25255 selected by the server as long as ssp_hash_algs contains any other 25256 supported algorithm. 25258 If the server does not support any of the offered algorithms, it 25259 returns NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the 25260 server MUST return NFS4ERR_INVAL. 25262 ssp_encr_algs: 25264 This is the set of algorithms the client supports for the purpose 25265 of providing privacy protection for the internal SSV GSS 25266 mechanism. Each algorithm is specified as an OID. The REQUIRED 25267 algorithm for a server is id-aes256-CBC. The RECOMMENDED 25268 algorithms are id-aes192-CBC and id-aes128-CBC [26]. The selected 25269 algorithm is returned in spi_encr_alg, an index into 25270 ssp_encr_algs. If the server does not support any of the offered 25271 algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs 25272 is empty, the server MUST return NFS4ERR_INVAL. Note that due to 25273 previously stated requirements and recommendations on the 25274 relationships between key length and hash length, some 25275 combinations of RECOMMENDED and REQUIRED encryption algorithm and 25276 hash algorithm either SHOULD NOT or MUST NOT be used. Table 12 25277 summarizes the illegal and discouraged combinations. 25279 ssp_window: 25281 This is the number of SSV versions the client wants the server to 25282 maintain (i.e., each successful call to SET_SSV produces a new 25283 version of the SSV). If ssp_window is zero, the server MUST 25284 return NFS4ERR_INVAL. The server responds with spi_window, which 25285 MUST NOT exceed ssp_window and MUST be at least one. Any requests 25286 on the backchannel or fore channel that are using a version of the 25287 SSV that is outside the window will fail with an ONC RPC 25288 authentication error, and the requester will have to retry them 25289 with the same slot ID and sequence ID. 25291 ssp_num_gss_handles: 25293 This is the number of RPCSEC_GSS handles the server should create 25294 that are based on the GSS SSV mechanism (see Section 2.10.9). It 25295 is not the total number of RPCSEC_GSS handles for the client ID. 25296 Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS 25297 handles. The server responds with a list of handles in 25298 spi_handles. If the client asks for at least one handle and the 25299 server cannot create it, the server MUST return an error. The 25300 handles in spi_handles are not available for use until the client 25301 ID is confirmed, which could be immediately if EXCHANGE_ID returns 25302 EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from 25303 CREATE_SESSION. 25305 While a client ID can span all the connections that are connected 25306 to a server sharing the same eir_server_owner.so_major_id, the 25307 RPCSEC_GSS handles returned in spi_handles can only be used on 25308 connections connected to a server that returns the same the 25309 eir_server_owner.so_major_id and eir_server_owner.so_minor_id on 25310 each connection. It is permissible for the client to set 25311 ssp_num_gss_handles to zero; the client can create more handles 25312 with another EXCHANGE_ID call. 25314 Because each SSV RPCSEC_GSS handle shares a common SSV GSS 25315 context, there are security considerations specific to this 25316 situation discussed in Section 2.10.10. 25318 The seq_window (see Section 5.2.3.1 of RFC2203 [4]) of each 25319 RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window 25320 of the RPCSEC_GSS handle used for the credential of the RPC 25321 request that the EXCHANGE_ID operation was sent as a part of. 25323 +-------------------+----------------------+------------------------+ 25324 | Encryption | MUST NOT be combined | SHOULD NOT be combined | 25325 | Algorithm | with | with | 25326 +-------------------+----------------------+------------------------+ 25327 | id-aes128-CBC | | id-sha384, id-sha512 | 25328 | id-aes192-CBC | id-sha1 | id-sha512 | 25329 | id-aes256-CBC | id-sha1, id-sha224 | | 25330 +-------------------+----------------------+------------------------+ 25332 Table 12 25334 The arguments include an array of up to one element in length called 25335 eia_client_impl_id. If eia_client_impl_id is present, it contains 25336 the information identifying the implementation of the client. 25337 Similarly, the results include an array of up to one element in 25338 length called eir_server_impl_id that identifies the implementation 25339 of the server. Servers MUST accept a zero-length eia_client_impl_id 25340 array, and clients MUST accept a zero-length eir_server_impl_id 25341 array. 25343 A possible use for implementation identifiers would be in diagnostic 25344 software that extracts this information in an attempt to identify 25345 interoperability problems, performance workload behaviors, or general 25346 usage statistics. Since the intent of having access to this 25347 information is for planning or general diagnosis only, the client and 25348 server MUST NOT interpret this implementation identity information in 25349 a way that affects how the implementation interacts with its peer. 25350 The client and server are not allowed to depend on the peer's 25351 manifesting a particular allowed behavior based on an implementation 25352 identifier but are required to interoperate as specified elsewhere in 25353 the protocol specification. 25355 Because it is possible that some implementations might violate the 25356 protocol specification and interpret the identity information, 25357 implementations MUST provide facilities to allow the NFSv4 client and 25358 server be configured to set the contents of the nfs_impl_id 25359 structures sent to any specified value. 25361 18.35.4. IMPLEMENTATION 25363 A server's client record is a 5-tuple: 25365 1. co_ownerid 25367 The client identifier string, from the eia_clientowner 25368 structure of the EXCHANGE_ID4args structure. 25370 2. co_verifier: 25372 A client-specific value used to indicate incarnations (where a 25373 client restart represents a new incarnation), from the 25374 eia_clientowner structure of the EXCHANGE_ID4args structure. 25376 3. principal: 25378 The principal that was defined in the RPC header's credential 25379 and/or verifier at the time the client record was established. 25381 4. client ID: 25383 The shorthand client identifier, generated by the server and 25384 returned via the eir_clientid field in the EXCHANGE_ID4resok 25385 structure. 25387 5. confirmed: 25389 A private field on the server indicating whether or not a 25390 client record has been confirmed. A client record is 25391 confirmed if there has been a successful CREATE_SESSION 25392 operation to confirm it. Otherwise, it is unconfirmed. An 25393 unconfirmed record is established by an EXCHANGE_ID call. Any 25394 unconfirmed record that is not confirmed within a lease period 25395 SHOULD be removed. 25397 The following identifiers represent special values for the fields in 25398 the records. 25400 ownerid_arg: 25402 The value of the eia_clientowner.co_ownerid subfield of the 25403 EXCHANGE_ID4args structure of the current request. 25405 verifier_arg: 25407 The value of the eia_clientowner.co_verifier subfield of the 25408 EXCHANGE_ID4args structure of the current request. 25410 old_verifier_arg: 25412 A value of the eia_clientowner.co_verifier field of a client 25413 record received in a previous request; this is distinct from 25414 verifier_arg. 25416 principal_arg: 25418 The value of the RPCSEC_GSS principal for the current request. 25420 old_principal_arg: 25422 A value of the principal of a client record as defined by the RPC 25423 header's credential or verifier of a previous request. This is 25424 distinct from principal_arg. 25426 clientid_ret: 25428 The value of the eir_clientid field the server will return in the 25429 EXCHANGE_ID4resok structure for the current request. 25431 old_clientid_ret: 25433 The value of the eir_clientid field the server returned in the 25434 EXCHANGE_ID4resok structure for a previous request. This is 25435 distinct from clientid_ret. 25437 confirmed: 25439 The client ID has been confirmed. 25441 unconfirmed: 25443 The client ID has not been confirmed. 25445 Since EXCHANGE_ID is a non-idempotent operation, we must consider the 25446 possibility that retries occur as a result of a client restart, 25447 network partition, malfunctioning router, etc. Retries are 25448 identified by the value of the eia_clientowner field of 25449 EXCHANGE_ID4args, and the method for dealing with them is outlined in 25450 the scenarios below. 25452 The scenarios are described in terms of the client record(s) a server 25453 has for a given co_ownerid. Note that if the client ID was created 25454 specifying SP4_SSV state protection and EXCHANGE_ID as the one of the 25455 operations in spo_must_allow, then the server MUST authorize 25456 EXCHANGE_IDs with the SSV principal in addition to the principal that 25457 created the client ID. 25459 1. New Owner ID 25461 If the server has no client records with 25462 eia_clientowner.co_ownerid matching ownerid_arg, and 25463 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the 25464 EXCHANGE_ID, then a new shorthand client ID (let us call it 25465 clientid_ret) is generated, and the following unconfirmed 25466 record is added to the server's state. 25468 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25469 unconfirmed } 25471 Subsequently, the server returns clientid_ret. 25473 2. Non-Update on Existing Client ID 25475 If the server has the following confirmed record, and the 25476 request does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, 25477 then the request is the result of a retried request due to a 25478 faulty router or lost connection, or the client is trying to 25479 determine if it can perform trunking. 25481 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25482 confirmed } 25484 Since the record has been confirmed, the client must have 25485 received the server's reply from the initial EXCHANGE_ID 25486 request. Since the server has a confirmed record, and since 25487 EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the 25488 possible exception of eir_server_owner.so_minor_id, the server 25489 returns the same result it did when the client ID's properties 25490 were last updated (or if never updated, the result when the 25491 client ID was created). The confirmed record is unchanged. 25493 3. Client Collision 25495 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 25496 server has the following confirmed record, then this request 25497 is likely the result of a chance collision between the values 25498 of the eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args 25499 for two different clients. 25501 { ownerid_arg, *, old_principal_arg, old_clientid_ret, 25502 confirmed } 25504 If there is currently no state associated with 25505 old_clientid_ret, or if there is state but the lease has 25506 expired, then this case is effectively equivalent to the New 25507 Owner ID case of Paragraph 1. The confirmed record is 25508 deleted, the old_clientid_ret and its lock state are deleted, 25509 a new shorthand client ID is generated, and the following 25510 unconfirmed record is added to the server's state. 25512 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25513 unconfirmed } 25515 Subsequently, the server returns clientid_ret. 25517 If old_clientid_ret has an unexpired lease with state, then no 25518 state of old_clientid_ret is changed or deleted. The server 25519 returns NFS4ERR_CLID_INUSE to indicate that the client should 25520 retry with a different value for the 25521 eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args. The 25522 client record is not changed. 25524 4. Replacement of Unconfirmed Record 25526 If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and 25527 the server has the following unconfirmed record, then the 25528 client is attempting EXCHANGE_ID again on an unconfirmed 25529 client ID, perhaps due to a retry, a client restart before 25530 client ID confirmation (i.e., before CREATE_SESSION was 25531 called), or some other reason. 25533 { ownerid_arg, *, *, old_clientid_ret, unconfirmed } 25535 It is possible that the properties of old_clientid_ret are 25536 different than those specified in the current EXCHANGE_ID. 25537 Whether or not the properties are being updated, to eliminate 25538 ambiguity, the server deletes the unconfirmed record, 25539 generates a new client ID (clientid_ret), and establishes the 25540 following unconfirmed record: 25542 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25543 unconfirmed } 25545 5. Client Restart 25547 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the 25548 server has the following confirmed client record, then this 25549 request is likely from a previously confirmed client that has 25550 restarted. 25552 { ownerid_arg, old_verifier_arg, principal_arg, 25553 old_clientid_ret, confirmed } 25555 Since the previous incarnation of the same client will no 25556 longer be making requests, once the new client ID is confirmed 25557 by CREATE_SESSION, byte-range locks and share reservations 25558 should be released immediately rather than forcing the new 25559 incarnation to wait for the lease time on the previous 25560 incarnation to expire. Furthermore, session state should be 25561 removed since if the client had maintained that information 25562 across restart, this request would not have been sent. If the 25563 server supports neither the CLAIM_DELEGATE_PREV nor 25564 CLAIM_DELEG_PREV_FH claim types, associated delegations should 25565 be purged as well; otherwise, delegations are retained and 25566 recovery proceeds according to Section 10.2.1. 25568 After processing, clientid_ret is returned to the client and 25569 this client record is added: 25571 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25572 unconfirmed } 25574 The previously described confirmed record continues to exist, 25575 and thus the same ownerid_arg exists in both a confirmed and 25576 unconfirmed state at the same time. The number of states can 25577 collapse to one once the server receives an applicable 25578 CREATE_SESSION or EXCHANGE_ID. 25580 + If the server subsequently receives a successful 25581 CREATE_SESSION that confirms clientid_ret, then the server 25582 atomically destroys the confirmed record and makes the 25583 unconfirmed record confirmed as described in 25584 Section 18.36.3. 25586 + If the server instead subsequently receives an EXCHANGE_ID 25587 with the client owner equal to ownerid_arg, one strategy is 25588 to simply delete the unconfirmed record, and process the 25589 EXCHANGE_ID as described in the entirety of 25590 Section 18.35.4. 25592 6. Update 25594 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25595 has the following confirmed record, then this request is an 25596 attempt at an update. 25598 { ownerid_arg, verifier_arg, principal_arg, clientid_ret, 25599 confirmed } 25601 Since the record has been confirmed, the client must have 25602 received the server's reply from the initial EXCHANGE_ID 25603 request. The server allows the update, and the client record 25604 is left intact. 25606 7. Update but No Confirmed Record 25608 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25609 has no confirmed record corresponding ownerid_arg, then the 25610 server returns NFS4ERR_NOENT and leaves any unconfirmed record 25611 intact. 25613 8. Update but Wrong Verifier 25615 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25616 has the following confirmed record, then this request is an 25617 illegal attempt at an update, perhaps because of a retry from 25618 a previous client incarnation. 25620 { ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed } 25622 The server returns NFS4ERR_NOT_SAME and leaves the client 25623 record intact. 25625 9. Update but Wrong Principal 25627 If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server 25628 has the following confirmed record, then this request is an 25629 illegal attempt at an update by an unauthorized principal. 25631 { ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, 25632 confirmed } 25634 The server returns NFS4ERR_PERM and leaves the client record 25635 intact. 25637 18.36. Operation 43: CREATE_SESSION - Create New Session and Confirm 25638 Client ID 25640 18.36.1. ARGUMENT 25642 struct channel_attrs4 { 25643 count4 ca_headerpadsize; 25644 count4 ca_maxrequestsize; 25645 count4 ca_maxresponsesize; 25646 count4 ca_maxresponsesize_cached; 25647 count4 ca_maxoperations; 25648 count4 ca_maxrequests; 25649 uint32_t ca_rdma_ird<1>; 25650 }; 25652 const CREATE_SESSION4_FLAG_PERSIST = 0x00000001; 25653 const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002; 25654 const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004; 25656 struct CREATE_SESSION4args { 25657 clientid4 csa_clientid; 25658 sequenceid4 csa_sequence; 25660 uint32_t csa_flags; 25662 channel_attrs4 csa_fore_chan_attrs; 25663 channel_attrs4 csa_back_chan_attrs; 25665 uint32_t csa_cb_program; 25666 callback_sec_parms4 csa_sec_parms<>; 25667 }; 25669 18.36.2. RESULT 25670 struct CREATE_SESSION4resok { 25671 sessionid4 csr_sessionid; 25672 sequenceid4 csr_sequence; 25674 uint32_t csr_flags; 25676 channel_attrs4 csr_fore_chan_attrs; 25677 channel_attrs4 csr_back_chan_attrs; 25678 }; 25680 union CREATE_SESSION4res switch (nfsstat4 csr_status) { 25681 case NFS4_OK: 25682 CREATE_SESSION4resok csr_resok4; 25683 default: 25684 void; 25685 }; 25687 18.36.3. DESCRIPTION 25689 This operation is used by the client to create new session objects on 25690 the server. 25692 CREATE_SESSION can be sent with or without a preceding SEQUENCE 25693 operation in the same COMPOUND procedure. If CREATE_SESSION is sent 25694 with a preceding SEQUENCE operation, any session created by 25695 CREATE_SESSION has no direct relation to the session specified in the 25696 SEQUENCE operation, although the two sessions might be associated 25697 with the same client ID. If CREATE_SESSION is sent without a 25698 preceding SEQUENCE, then it MUST be the only operation in the 25699 COMPOUND procedure's request. If it is not, the server MUST return 25700 NFS4ERR_NOT_ONLY_OP. 25702 In addition to creating a session, CREATE_SESSION has the following 25703 effects: 25705 o The first session created with a new client ID serves to confirm 25706 the creation of that client's state on the server. The server 25707 returns the parameter values for the new session. 25709 o The connection CREATE_SESSION that is sent over is associated with 25710 the session's fore channel. 25712 The arguments and results of CREATE_SESSION are described as follows: 25714 csa_clientid: 25716 This is the client ID with which the new session will be 25717 associated. The corresponding result is csr_sessionid, the 25718 session ID of the new session. 25720 csa_sequence: 25722 Each client ID serializes CREATE_SESSION via a per-client ID 25723 sequence number (see Section 18.36.4). The corresponding result 25724 is csr_sequence, which MUST be equal to csa_sequence. 25726 In the next three arguments, the client offers a value that is to be 25727 a property of the session. Except where stated otherwise, it is 25728 RECOMMENDED that the server accept the value. If it is not 25729 acceptable, the server MAY use a different value. Regardless, the 25730 server MUST return the value the session will use (which will be 25731 either what the client offered, or what the server is insisting on) 25732 to the client. 25734 csa_flags: 25736 The csa_flags field contains a list of the following flag bits: 25738 CREATE_SESSION4_FLAG_PERSIST: 25740 If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the 25741 server to provide a persistent reply cache. For sessions in 25742 which only idempotent operations will be used (e.g., a read- 25743 only session), clients SHOULD NOT set 25744 CREATE_SESSION4_FLAG_PERSIST. If the server does not or cannot 25745 provide a persistent reply cache, the server MUST NOT set 25746 CREATE_SESSION4_FLAG_PERSIST in the field csr_flags. 25748 If the server is a pNFS metadata server, for reasons described 25749 in Section 12.5.2 it SHOULD support 25750 CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint 25751 (Section 5.12.4) attribute. 25753 CREATE_SESSION4_FLAG_CONN_BACK_CHAN: 25755 If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the 25756 client is requesting that the connection over which the 25757 CREATE_SESSION operation arrived be associated with the 25758 session's backchannel in addition to its fore channel. If the 25759 server agrees, it sets CREATE_SESSION4_FLAG_CONN_BACK_CHAN in 25760 the result field csr_flags. If 25761 CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in csa_flags, 25762 then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in 25763 csr_flags. 25765 CREATE_SESSION4_FLAG_CONN_RDMA: 25767 If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if 25768 the connection over which the CREATE_SESSION operation arrived 25769 is currently in non-RDMA mode but has the capability to operate 25770 in RDMA mode, then the client is requesting that the server 25771 "step up" to RDMA mode on the connection. If the server 25772 agrees, it sets CREATE_SESSION4_FLAG_CONN_RDMA in the result 25773 field csr_flags. If CREATE_SESSION4_FLAG_CONN_RDMA is not set 25774 in csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be 25775 set in csr_flags. Note that once the server agrees to step up, 25776 it and the client MUST exchange all future traffic on the 25777 connection with RPC RDMA framing and not Record Marking ([32]). 25779 csa_fore_chan_attrs, csa_fore_chan_attrs: 25781 The csa_fore_chan_attrs and csa_back_chan_attrs fields apply to 25782 attributes of the fore channel (which conveys requests originating 25783 from the client to the server), and the backchannel (the channel 25784 that conveys callback requests originating from the server to the 25785 client), respectively. The results are in corresponding 25786 structures called csr_fore_chan_attrs and csr_back_chan_attrs. 25787 The results establish attributes for each channel, and on all 25788 subsequent use of each channel of the session. Each structure has 25789 the following fields: 25791 ca_headerpadsize: 25793 The maximum amount of padding the requester is willing to apply 25794 to ensure that write payloads are aligned on some boundary at 25795 the replier. For each channel, the server 25797 + will reply in ca_headerpadsize with its preferred value, or 25798 zero if padding is not in use, and 25800 + MAY decrease this value but MUST NOT increase it. 25802 ca_maxrequestsize: 25804 The maximum size of a COMPOUND or CB_COMPOUND request that will 25805 be sent. This size represents the XDR encoded size of the 25806 request, including the RPC headers (including security flavor 25807 credentials and verifiers) but excludes any RPC transport 25808 framing headers. Imagine a request coming over a non-RDMA TCP/ 25809 IP connection, and that it has a single Record Marking header 25810 preceding it. The maximum allowable count encoded in the 25811 header will be ca_maxrequestsize. If a requester sends a 25812 request that exceeds ca_maxrequestsize, the error 25813 NFS4ERR_REQ_TOO_BIG will be returned per the description in 25814 Section 2.10.6.4. For each channel, the server MAY decrease 25815 this value but MUST NOT increase it. 25817 ca_maxresponsesize: 25819 The maximum size of a COMPOUND or CB_COMPOUND reply that the 25820 requester will accept from the replier including RPC headers 25821 (see the ca_maxrequestsize definition). For each channel, the 25822 server MAY decrease this value, but MUST NOT increase it. 25823 However, if the client selects a value for ca_maxresponsesize 25824 such that a replier on a channel could never send a response, 25825 the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION 25826 reply. After the session is created, if a requester sends a 25827 request for which the size of the reply would exceed this 25828 value, the replier will return NFS4ERR_REP_TOO_BIG, per the 25829 description in Section 2.10.6.4. 25831 ca_maxresponsesize_cached: 25833 Like ca_maxresponsesize, but the maximum size of a reply that 25834 will be stored in the reply cache (Section 2.10.6.1). For each 25835 channel, the server MAY decrease this value, but MUST NOT 25836 increase it. If, in the reply to CREATE_SESSION, the value of 25837 ca_maxresponsesize_cached of a channel is less than the value 25838 of ca_maxresponsesize of the same channel, then this is an 25839 indication to the requester that it needs to be selective about 25840 which replies it directs the replier to cache; for example, 25841 large replies from nonidempotent operations (e.g., COMPOUND 25842 requests with a READ operation) should not be cached. The 25843 requester decides which replies to cache via an argument to the 25844 SEQUENCE (the sa_cachethis field, see Section 18.46) or 25845 CB_SEQUENCE (the csa_cachethis field, see Section 20.9) 25846 operations. After the session is created, if a requester sends 25847 a request for which the size of the reply would exceed 25848 ca_maxresponsesize_cached, the replier will return 25849 NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in 25850 Section 2.10.6.4. 25852 ca_maxoperations: 25854 The maximum number of operations the replier will accept in a 25855 COMPOUND or CB_COMPOUND. For the backchannel, the server MUST 25856 NOT change the value the client offers. For the fore channel, 25857 the server MAY change the requested value. After the session 25858 is created, if a requester sends a COMPOUND or CB_COMPOUND with 25859 more operations than ca_maxoperations, the replier MUST return 25860 NFS4ERR_TOO_MANY_OPS. 25862 ca_maxrequests: 25864 The maximum number of concurrent COMPOUND or CB_COMPOUND 25865 requests the requester will send on the session. Subsequent 25866 requests will each be assigned a slot identifier by the 25867 requester within the range zero to ca_maxrequests - 1 25868 inclusive. For the backchannel, the server MUST NOT change the 25869 value the client offers. For the fore channel, the server MAY 25870 change the requested value. 25872 ca_rdma_ird: 25874 This array has a maximum of one element. If this array has one 25875 element, then the element contains the inbound RDMA read queue 25876 depth (IRD). For each channel, the server MAY decrease this 25877 value, but MUST NOT increase it. 25879 csa_cb_program 25881 This is the ONC RPC program number the server MUST use in any 25882 callbacks sent through the backchannel to the client. The server 25883 MUST specify an ONC RPC program number equal to csa_cb_program and 25884 an ONC RPC version number equal to 4 in callbacks sent to the 25885 client. If a CB_COMPOUND is sent to the client, the server MUST 25886 use a minor version number of 1. There is no corresponding 25887 result. 25889 csa_sec_parms 25891 The field csa_sec_parms is an array of acceptable security 25892 credentials the server can use on the session's backchannel. 25893 Three security flavors are supported: AUTH_NONE, AUTH_SYS, and 25894 RPCSEC_GSS. If AUTH_NONE is specified for a credential, then this 25895 says the client is authorizing the server to use AUTH_NONE on all 25896 callbacks for the session. If AUTH_SYS is specified, then the 25897 client is authorizing the server to use AUTH_SYS on all callbacks, 25898 using the credential specified cbsp_sys_cred. If RPCSEC_GSS is 25899 specified, then the server is allowed to use the RPCSEC_GSS 25900 context specified in cbsp_gss_parms as the RPCSEC_GSS context in 25901 the credential of the RPC header of callbacks to the client. 25902 There is no corresponding result. 25904 The RPCSEC_GSS context for the backchannel is specified via a pair 25905 of values of data type gsshandle4_t. The data type gsshandle4_t 25906 represents an RPCSEC_GSS handle, and is precisely the same as the 25907 data type of the "handle" field of the rpc_gss_init_res data type 25908 defined in Section 5.2.3.1, "Context Creation Response - 25909 Successful Acceptance", of [4]. 25911 The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore 25912 handle the server returned to the client (either in the handle 25913 field of data type rpc_gss_init_res or as one of the elements of 25914 the spi_handles field returned in the reply to EXCHANGE_ID) when 25915 the RPCSEC_GSS context was created on the server. The second 25916 handle, gcbp_handle_from_client, is the back handle to which the 25917 client will map the RPCSEC_GSS context. The server can 25918 immediately use the value of gcbp_handle_from_client in the 25919 RPCSEC_GSS credential in callback RPCs. That is, the value in 25920 gcbp_handle_from_client can be used as the value of the field 25921 "handle" in data type rpc_gss_cred_t (see Section 5, "Elements of 25922 the RPCSEC_GSS Security Protocol", of [4]) in callback RPCs. The 25923 server MUST use the RPCSEC_GSS security service specified in 25924 gcbp_service, i.e., it MUST set the "service" field of the 25925 rpc_gss_cred_t data type in RPCSEC_GSS credential to the value of 25926 gcbp_service (see Section 5.3.1, "RPC Request Header", of [4]). 25928 If the RPCSEC_GSS handle identified by gcbp_handle_from_server 25929 does not exist on the server, the server will return 25930 NFS4ERR_NOENT. 25932 Within each element of csa_sec_parms, the fore and back RPCSEC_GSS 25933 contexts MUST share the same GSS context and MUST have the same 25934 seq_window (see Section 5.2.3.1 of RFC2203 [4]). The fore and 25935 back RPCSEC_GSS context state are independent of each other as far 25936 as the RPCSEC_GSS sequence number (see the seq_num field in the 25937 rpc_gss_cred_t data type of Sections 5 and 5.3.1 of [4]). 25939 If an RPCSEC_GSS handle is using the SSV context (see 25940 Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a 25941 common SSV GSS context, there are security considerations specific 25942 to this situation discussed in Section 2.10.10. 25944 Once the session is created, the first SEQUENCE or CB_SEQUENCE 25945 received on a slot MUST have a sequence ID equal to 1; if not, the 25946 replier MUST return NFS4ERR_SEQ_MISORDERED. 25948 18.36.4. IMPLEMENTATION 25950 To describe a possible implementation, the same notation for client 25951 records introduced in the description of EXCHANGE_ID is used with the 25952 following addition: 25954 clientid_arg: The value of the csa_clientid field of the 25955 CREATE_SESSION4args structure of the current request. 25957 Since CREATE_SESSION is a non-idempotent operation, we need to 25958 consider the possibility that retries may occur as a result of a 25959 client restart, network partition, malfunctioning router, etc. For 25960 each client ID created by EXCHANGE_ID, the server maintains a 25961 separate reply cache (called the CREATE_SESSION reply cache) similar 25962 to the session reply cache used for SEQUENCE operations, with two 25963 distinctions. 25965 o First, this is a reply cache just for detecting and processing 25966 CREATE_SESSION requests for a given client ID. 25968 o Second, the size of the client ID reply cache is of one slot (and 25969 as a result, the CREATE_SESSION request does not carry a slot 25970 number). This means that at most one CREATE_SESSION request for a 25971 given client ID can be outstanding. 25973 As previously stated, CREATE_SESSION can be sent with or without a 25974 preceding SEQUENCE operation. Even if a SEQUENCE precedes 25975 CREATE_SESSION, the server MUST maintain the CREATE_SESSION reply 25976 cache, which is separate from the reply cache for the session 25977 associated with a SEQUENCE. If CREATE_SESSION was originally sent by 25978 itself, the client MAY send a retry of the CREATE_SESSION operation 25979 within a COMPOUND preceded by a SEQUENCE. If CREATE_SESSION was 25980 originally sent in a COMPOUND that started with a SEQUENCE, then the 25981 client SHOULD send a retry in a COMPOUND that starts with a SEQUENCE 25982 that has the same session ID as the SEQUENCE of the original request. 25983 However, the client MAY send a retry in a COMPOUND that either has no 25984 preceding SEQUENCE, or has a preceding SEQUENCE that refers to a 25985 different session than the original CREATE_SESSION. This might be 25986 necessary if the client sends a CREATE_SESSION in a COMPOUND preceded 25987 by a SEQUENCE with session ID X, and session X no longer exists. 25988 Regardless, any retry of CREATE_SESSION, with or without a preceding 25989 SEQUENCE, MUST use the same value of csa_sequence as the original. 25991 After the client received a reply to an EXCHANGE_ID operation that 25992 contains a new, unconfirmed client ID, the server expects the client 25993 to follow with a CREATE_SESSION operation to confirm the client ID. 25994 The server expects value of csa_sequenceid in the arguments to that 25995 CREATE_SESSION to be to equal the value of the field eir_sequenceid 25996 that was returned in results of the EXCHANGE_ID that returned the 25997 unconfirmed client ID. Before the server replies to that EXCHANGE_ID 25998 operation, it initializes the client ID slot to be equal to 25999 eir_sequenceid - 1 (accounting for underflow), and records a 26000 contrived CREATE_SESSION result with a "cached" result of 26001 NFS4ERR_SEQ_MISORDERED. With the client ID slot thus initialized, 26002 the processing of the CREATE_SESSION operation is divided into four 26003 phases: 26005 1. Client record look up. The server looks up the client ID in its 26006 client record table. If the server contains no records with 26007 client ID equal to clientid_arg, then most likely the client's 26008 state has been purged during a period of inactivity, possibly due 26009 to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, 26010 and no changes are made to any client records on the server. 26011 Otherwise, the server goes to phase 2. 26013 2. Sequence ID processing. If csa_sequenceid is equal to the 26014 sequence ID in the client ID's slot, then this is a replay of the 26015 previous CREATE_SESSION request, and the server returns the 26016 cached result. If csa_sequenceid is not equal to the sequence ID 26017 in the slot, and is more than one greater (accounting for 26018 wraparound), then the server returns the error 26019 NFS4ERR_SEQ_MISORDERED, and does not change the slot. If 26020 csa_sequenceid is equal to the slot's sequence ID + 1 (accounting 26021 for wraparound), then the slot's sequence ID is set to 26022 csa_sequenceid, and the CREATE_SESSION processing goes to the 26023 next phase. A subsequent new CREATE_SESSION call over the same 26024 client ID MUST use a csa_sequenceid that is one greater than the 26025 sequence ID in the slot. 26027 3. Client ID confirmation. If this would be the first session for 26028 the client ID, the CREATE_SESSION operation serves to confirm the 26029 client ID. Otherwise, the client ID confirmation phase is 26030 skipped and only the session creation phase occurs. Any case in 26031 which there is more than one record with identical values for 26032 client ID represents a server implementation error. Operation in 26033 the potential valid cases is summarized as follows. 26035 * Successful Confirmation 26037 If the server has the following unconfirmed record, then 26038 this is the expected confirmation of an unconfirmed record. 26040 { ownerid, verifier, principal_arg, clientid_arg, 26041 unconfirmed } 26043 As noted in Section 18.35.4, the server might also have the 26044 following confirmed record. 26046 { ownerid, old_verifier, principal_arg, old_clientid, 26047 confirmed } 26049 The server schedules the replacement of both records with: 26051 { ownerid, verifier, principal_arg, clientid_arg, confirmed 26052 } 26053 The processing of CREATE_SESSION continues on to session 26054 creation. Once the session is successfully created, the 26055 scheduled client record replacement is committed. If the 26056 session is not successfully created, then no changes are 26057 made to any client records on the server. 26059 * Unsuccessful Confirmation 26061 If the server has the following record, then the client has 26062 changed principals after the previous EXCHANGE_ID request, 26063 or there has been a chance collision between shorthand 26064 client identifiers. 26066 { *, *, old_principal_arg, clientid_arg, * } 26068 Neither of these cases is permissible. Processing stops 26069 and NFS4ERR_CLID_INUSE is returned to the client. No 26070 changes are made to any client records on the server. 26072 4. Session creation. The server confirmed the client ID, either in 26073 this CREATE_SESSION operation, or a previous CREATE_SESSION 26074 operation. The server examines the remaining fields of the 26075 arguments. 26077 The server creates the session by recording the parameter values 26078 used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is 26079 set and has been accepted by the server) and allocating space for 26080 the session reply cache (if there is not enough space, the server 26081 returns NFS4ERR_NOSPC). For each slot in the reply cache, the 26082 server sets the sequence ID to zero, and records an entry 26083 containing a COMPOUND reply with zero operations and the error 26084 NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request 26085 sent has a sequence ID equal to zero, the server can simply 26086 return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The 26087 client initializes its reply cache for receiving callbacks in the 26088 same way, and similarly, the first CB_SEQUENCE operation on a 26089 slot after session creation MUST have a sequence ID of one. 26091 If the session state is created successfully, the server 26092 associates the session with the client ID provided by the client. 26094 When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs 26095 to be retried, the retry MUST be done on a new connection that is 26096 in non-RDMA mode. If properties of the new connection are 26097 different enough that the arguments to CREATE_SESSION need to 26098 change, then a non-retry MUST be sent. The server will 26099 eventually dispose of any session that was created on the 26100 original connection. 26102 On the backchannel, the client and server might wish to have many 26103 slots, in some cases perhaps more that the fore channel, in order to 26104 deal with the situations where the network link has high latency and 26105 is the primary bottleneck for response to recalls. If so, and if the 26106 client provides too few slots to the backchannel, the server might 26107 limit the number of recallable objects it gives to the client. 26109 Implementing RPCSEC_GSS callback support requires changes to both the 26110 client and server implementations of RPCSEC_GSS. One possible set of 26111 changes includes: 26113 o Adding a data structure that wraps the GSS-API context with a 26114 reference count. 26116 o New functions to increment and decrement the reference count. If 26117 the reference count is decremented to zero, the wrapper data 26118 structure and the GSS-API context it refers to would be freed. 26120 o Change RPCSEC_GSS to create the wrapper data structure upon 26121 receiving GSS-API context from gss_accept_sec_context() and 26122 gss_init_sec_context(). The reference count would be initialized 26123 to 1. 26125 o Adding a function to map an existing RPCSEC_GSS handle to a 26126 pointer to the wrapper data structure. The reference count would 26127 be incremented. 26129 o Adding a function to create a new RPCSEC_GSS handle from a pointer 26130 to the wrapper data structure. The reference count would be 26131 incremented. 26133 o Replacing calls from RPCSEC_GSS that free GSS-API contexts, with 26134 calls to decrement the reference count on the wrapper data 26135 structure. 26137 18.37. Operation 44: DESTROY_SESSION - Destroy a Session 26139 18.37.1. ARGUMENT 26141 struct DESTROY_SESSION4args { 26142 sessionid4 dsa_sessionid; 26143 }; 26145 18.37.2. RESULT 26147 struct DESTROY_SESSION4res { 26148 nfsstat4 dsr_status; 26149 }; 26151 18.37.3. DESCRIPTION 26153 The DESTROY_SESSION operation closes the session and discards the 26154 session's reply cache, if any. Any remaining connections associated 26155 with the session are immediately disassociated. If the connection 26156 has no remaining associated sessions, the connection MAY be closed by 26157 the server. Locks, delegations, layouts, wants, and the lease, which 26158 are all tied to the client ID, are not affected by DESTROY_SESSION. 26160 DESTROY_SESSION MUST be invoked on a connection that is associated 26161 with the session being destroyed. In addition, if SP4_MACH_CRED 26162 state protection was specified when the client ID was created, the 26163 RPCSEC_GSS principal that created the session MUST be the one that 26164 destroys the session, using RPCSEC_GSS privacy or integrity. If 26165 SP4_SSV state protection was specified when the client ID was 26166 created, RPCSEC_GSS using the SSV mechanism (Section 2.10.9) MUST be 26167 used, with integrity or privacy. 26169 If the COMPOUND request starts with SEQUENCE, and if the sessionids 26170 specified in SEQUENCE and DESTROY_SESSION are the same, then 26172 o DESTROY_SESSION MUST be the final operation in the COMPOUND 26173 request. 26175 o It is advisable to avoid placing DESTROY_SESSION in a COMPOUND 26176 request with other state-modifying operations, because the 26177 DESTROY_SESSION will destroy the reply cache. 26179 o Because the session and its reply cache are destroyed, a client 26180 that retries the request may receive an error in reply to the 26181 retry, even though the original request was successful. 26183 If the COMPOUND request starts with SEQUENCE, and if the sessionids 26184 specified in SEQUENCE and DESTROY_SESSION are different, then 26185 DESTROY_SESSION can appear in any position of the COMPOUND request 26186 (except for the first position). The two sessionids can belong to 26187 different client IDs. 26189 If the COMPOUND request does not start with SEQUENCE, and if 26190 DESTROY_SESSION is not the sole operation, then server MUST return 26191 NFS4ERR_NOT_ONLY_OP. 26193 If there is a backchannel on the session and the server has 26194 outstanding CB_COMPOUND operations for the session which have not 26195 been replied to, then the server MAY refuse to destroy the session 26196 and return an error. If so, then in the event the backchannel is 26197 down, the server SHOULD return NFS4ERR_CB_PATH_DOWN to inform the 26198 client that the backchannel needs to be repaired before the server 26199 will allow the session to be destroyed. Otherwise, the error 26200 CB_BACK_CHAN_BUSY SHOULD be returned to indicate that there are 26201 CB_COMPOUNDs that need to be replied to. The client SHOULD reply to 26202 all outstanding CB_COMPOUNDs before re-sending DESTROY_SESSION. 26204 18.38. Operation 45: FREE_STATEID - Free Stateid with No Locks 26206 18.38.1. ARGUMENT 26208 struct FREE_STATEID4args { 26209 stateid4 fsa_stateid; 26210 }; 26212 18.38.2. RESULT 26214 struct FREE_STATEID4res { 26215 nfsstat4 fsr_status; 26216 }; 26218 18.38.3. DESCRIPTION 26220 The FREE_STATEID operation is used to free a stateid that no longer 26221 has any associated locks (including opens, byte-range locks, 26222 delegations, and layouts). This may be because of client LOCKU 26223 operations or because of server revocation. If there are valid locks 26224 (of any kind) associated with the stateid in question, the error 26225 NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will 26226 not be freed. 26228 When a stateid is freed that had been associated with revoked locks, 26229 by sending the FREE_STATEID operation, the client acknowledges the 26230 loss of those locks. This allows the server, once all such revoked 26231 state is acknowledged, to allow that client again to reclaim locks, 26232 without encountering the edge conditions discussed in Section 8.4.2. 26234 Once a successful FREE_STATEID is done for a given stateid, any 26235 subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID 26236 error. 26238 18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory Delegation 26240 18.39.1. ARGUMENT 26242 typedef nfstime4 attr_notice4; 26244 struct GET_DIR_DELEGATION4args { 26245 /* CURRENT_FH: delegated directory */ 26246 bool gdda_signal_deleg_avail; 26247 bitmap4 gdda_notification_types; 26248 attr_notice4 gdda_child_attr_delay; 26249 attr_notice4 gdda_dir_attr_delay; 26250 bitmap4 gdda_child_attributes; 26251 bitmap4 gdda_dir_attributes; 26252 }; 26254 18.39.2. RESULT 26255 struct GET_DIR_DELEGATION4resok { 26256 verifier4 gddr_cookieverf; 26257 /* Stateid for get_dir_delegation */ 26258 stateid4 gddr_stateid; 26259 /* Which notifications can the server support */ 26260 bitmap4 gddr_notification; 26261 bitmap4 gddr_child_attributes; 26262 bitmap4 gddr_dir_attributes; 26263 }; 26265 enum gddrnf4_status { 26266 GDD4_OK = 0, 26267 GDD4_UNAVAIL = 1 26268 }; 26270 union GET_DIR_DELEGATION4res_non_fatal 26271 switch (gddrnf4_status gddrnf_status) { 26272 case GDD4_OK: 26273 GET_DIR_DELEGATION4resok gddrnf_resok4; 26274 case GDD4_UNAVAIL: 26275 bool gddrnf_will_signal_deleg_avail; 26276 }; 26278 union GET_DIR_DELEGATION4res 26279 switch (nfsstat4 gddr_status) { 26280 case NFS4_OK: 26281 GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; 26282 default: 26283 void; 26284 }; 26286 18.39.3. DESCRIPTION 26288 The GET_DIR_DELEGATION operation is used by a client to request a 26289 directory delegation. The directory is represented by the current 26290 filehandle. The client also specifies whether it wants the server to 26291 notify it when the directory changes in certain ways by setting one 26292 or more bits in a bitmap. The server may refuse to grant the 26293 delegation. In that case, the server will return 26294 NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the 26295 delegation, it will return a cookie verifier for that directory. If 26296 the cookie verifier changes when the client is holding the 26297 delegation, the delegation will be recalled unless the client has 26298 asked for notification for this event. 26300 The server will also return a directory delegation stateid, 26301 gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This 26302 stateid will appear in callback messages related to the delegation, 26303 such as notifications and delegation recalls. The client will use 26304 this stateid to return the delegation voluntarily or upon recall. A 26305 delegation is returned by calling the DELEGRETURN operation. 26307 The server might not be able to support notifications of certain 26308 events. If the client asks for such notifications, the server MUST 26309 inform the client of its inability to do so as part of the 26310 GET_DIR_DELEGATION reply by not setting the appropriate bits in the 26311 supported notifications bitmask, gddr_notification, contained in the 26312 reply. The server MUST NOT add bits to gddr_notification that the 26313 client did not request. 26315 The GET_DIR_DELEGATION operation can be used for both normal and 26316 named attribute directories. 26318 If client sets gdda_signal_deleg_avail to TRUE, then it is 26319 registering with the client a "want" for a directory delegation. If 26320 the delegation is not available, and the server supports and will 26321 honor the "want", the results will have 26322 gddrnf_will_signal_deleg_avail set to TRUE and no error will be 26323 indicated on return. If so, the client should expect a future 26324 CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory 26325 delegation is available. If the server does not wish to honor the 26326 "want" or is not able to do so, it returns the error 26327 NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately 26328 available, the server SHOULD return it with the response to the 26329 operation, rather than via a callback. 26331 When a client makes a request for a directory delegation while it 26332 already holds a directory delegation for that directory (including 26333 the case where it has been recalled but not yet returned by the 26334 client or revoked by the server), the server MUST reply with the 26335 value of gddr_status set to NFS4_OK, the value of gddrnf_status set 26336 to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set 26337 to FALSE. The delegation the client held before the request remains 26338 intact, and its state is unchanged. The current stateid is not 26339 changed (see Section 16.2.3.1.2 for a description of the current 26340 stateid). 26342 18.39.4. IMPLEMENTATION 26344 Directory delegations provide the benefit of improving cache 26345 consistency of namespace information. This is done through 26346 synchronous callbacks. A server must support synchronous callbacks 26347 in order to support directory delegations. In addition to that, 26348 asynchronous notifications provide a way to reduce network traffic as 26349 well as improve client performance in certain conditions. 26351 Notifications are specified in terms of potential changes to the 26352 directory. A client can ask to be notified of events by setting one 26353 or more bits in gdda_notification_types. The client can ask for 26354 notifications on addition of entries to a directory (by setting the 26355 NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry 26356 removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), 26357 directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and 26358 cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting 26359 one or more corresponding bits in the gdda_notification_types field. 26361 The client can also ask for notifications of changes to attributes of 26362 directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep 26363 its attribute cache up to date. However, any changes made to child 26364 attributes do not cause the delegation to be recalled. If a client 26365 is interested in directory entry caching or negative name caching, it 26366 can set the gdda_notification_types appropriately to its particular 26367 need and the server will notify it of all changes that would 26368 otherwise invalidate its name cache. The kind of notification a 26369 client asks for may depend on the directory size, its rate of change, 26370 and the applications being used to access that directory. The 26371 enumeration of the conditions under which a client might ask for a 26372 notification is out of the scope of this specification. 26374 For attribute notifications, the client will set bits in the 26375 gdda_dir_attributes bitmap to indicate which attributes it wants to 26376 be notified of. If the server does not support notifications for 26377 changes to a certain attribute, it SHOULD NOT set that attribute in 26378 the supported attribute bitmap specified in the reply 26379 (gddr_dir_attributes). The client will also set in the 26380 gdda_child_attributes bitmap the attributes of directory entries it 26381 wants to be notified of, and the server will indicate in 26382 gddr_child_attributes which attributes of directory entries it will 26383 notify the client of. 26385 The client will also let the server know if it wants to get the 26386 notification as soon as the attribute change occurs or after a 26387 certain delay by setting a delay factor; gdda_child_attr_delay is for 26388 attribute changes to directory entries and gdda_dir_attr_delay is for 26389 attribute changes to the directory. If this delay factor is set to 26390 zero, that indicates to the server that the client wants to be 26391 notified of any attribute changes as soon as they occur. If the 26392 delay factor is set to N seconds, the server will make a best-effort 26393 guarantee that attribute updates are synchronized within N seconds. 26394 If the client asks for a delay factor that the server does not 26395 support or that may cause significant resource consumption on the 26396 server by causing the server to send a lot of notifications, the 26397 server should not commit to sending out notifications for attributes 26398 and therefore must not set the appropriate bit in the 26399 gddr_child_attributes and gddr_dir_attributes bitmaps in the 26400 response. 26402 The client MUST use a security tuple (Section 2.6.1) that the 26403 directory or its applicable ancestor (Section 2.6) is exported with. 26404 If not, the server MUST return NFS4ERR_WRONGSEC to the operation that 26405 both precedes GET_DIR_DELEGATION and sets the current filehandle (see 26406 Section 2.6.3.1). 26408 The directory delegation covers all the entries in the directory 26409 except the parent entry. That means if a directory and its parent 26410 both hold directory delegations, any changes to the parent will not 26411 cause a notification to be sent for the child even though the child's 26412 parent entry points to the parent directory. 26414 18.40. Operation 47: GETDEVICEINFO - Get Device Information 26416 18.40.1. ARGUMENT 26418 struct GETDEVICEINFO4args { 26419 deviceid4 gdia_device_id; 26420 layouttype4 gdia_layout_type; 26421 count4 gdia_maxcount; 26422 bitmap4 gdia_notify_types; 26423 }; 26425 18.40.2. RESULT 26427 struct GETDEVICEINFO4resok { 26428 device_addr4 gdir_device_addr; 26429 bitmap4 gdir_notification; 26430 }; 26432 union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { 26433 case NFS4_OK: 26434 GETDEVICEINFO4resok gdir_resok4; 26435 case NFS4ERR_TOOSMALL: 26436 count4 gdir_mincount; 26437 default: 26438 void; 26439 }; 26441 18.40.3. DESCRIPTION 26443 The GETDEVICEINFO operation returns pNFS storage device address 26444 information for the specified device ID. The client identifies the 26445 device information to be returned by providing the gdia_device_id and 26446 gdia_layout_type that uniquely identify the device. The client 26447 provides gdia_maxcount to limit the number of bytes for the result. 26448 This maximum size represents all of the data being returned within 26449 the GETDEVICEINFO4resok structure and includes the XDR overhead. The 26450 server may return less data. If the server is unable to return any 26451 information within the gdia_maxcount limit, the error 26452 NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is 26453 zero, NFS4ERR_TOOSMALL MUST NOT be returned. 26455 The da_layout_type field of the gdir_device_addr returned by the 26456 server MUST be equal to the gdia_layout_type specified by the client. 26457 If it is not equal, the client SHOULD ignore the response as invalid 26458 and behave as if the server returned an error, even if the client 26459 does have support for the layout type returned. 26461 The client also provides a notification bitmap, gdia_notify_types, 26462 for the device ID mapping notification for which it is interested in 26463 receiving; the server must support device ID notifications for the 26464 notification request to have affect. The notification mask is 26465 composed in the same manner as the bitmap for file attributes 26466 (Section 3.3.7). The numbers of bit positions are listed in the 26467 notify_device_type4 enumeration type (Section 20.12). Only two 26468 enumerated values of notify_device_type4 currently apply to 26469 GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE 26470 (see Section 20.12). 26472 The notification bitmap applies only to the specified device ID. If 26473 a client sends a GETDEVICEINFO operation on a deviceID multiple 26474 times, the last notification bitmap is used by the server for 26475 subsequent notifications. If the bitmap is zero or empty, then the 26476 device ID's notifications are turned off. 26478 If the client wants to just update or turn off notifications, it MAY 26479 send a GETDEVICEINFO operation with gdia_maxcount set to zero. In 26480 that event, if the device ID is valid, the reply's da_addr_body field 26481 of the gdir_device_addr field will be of zero length. 26483 If an unknown device ID is given in gdia_device_id, the server 26484 returns NFS4ERR_NOENT. Otherwise, the device address information is 26485 returned in gdir_device_addr. Finally, if the server supports 26486 notifications for device ID mappings, the gdir_notification result 26487 will contain a bitmap of which notifications it will actually send to 26488 the client (via CB_NOTIFY_DEVICEID, see Section 20.12). 26490 If NFS4ERR_TOOSMALL is returned, the results also contain 26491 gdir_mincount. The value of gdir_mincount represents the minimum 26492 size necessary to obtain the device information. 26494 18.40.4. IMPLEMENTATION 26496 Aside from updating or turning off notifications, another use case 26497 for gdia_maxcount being set to zero is to validate a device ID. 26499 The client SHOULD request a notification for changes or deletion of a 26500 device ID to device address mapping so that the server can allow the 26501 client gracefully use a new mapping, without having pending I/O fail 26502 abruptly, or force layouts using the device ID to be recalled or 26503 revoked. 26505 It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with 26506 CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the 26507 client gets and processes the response to GETDEVICEINFO or 26508 GETDEVICELIST. The analysis of the race leverages the fact that the 26509 server MUST NOT delete a device ID that is referred to by a layout 26510 the client has. 26512 o CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it 26513 has layouts that refer to the device ID, then it is possible that 26514 layouts referring to the deleted device ID have been revoked. The 26515 client should send a TEST_STATEID request using the stateid for 26516 each layout that might have been revoked. If TEST_STATEID 26517 indicates that any layouts have been revoked, the client must 26518 recover from layout revocation as described in Section 12.5.6. If 26519 TEST_STATEID indicates that at least one layout has not been 26520 revoked, the client should send a GETDEVICEINFO operation on the 26521 supposedly deleted device ID to verify that the device ID has been 26522 deleted. 26524 If GETDEVICEINFO indicates that the device ID does not exist, then 26525 the client assumes the server is faulty and recovers by sending an 26526 EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device 26527 ID does exist, then while the server is faulty for sending an 26528 erroneous device ID deletion notification, the degree to which it 26529 is faulty does not require the client to create a new client ID. 26531 If the client does not have layouts that refer to the device ID, 26532 no harm is done. The client should mark the device ID as deleted, 26533 and when GETDEVICEINFO or GETDEVICELIST results are received that 26534 indicate that the device ID has been in fact deleted, the device 26535 ID should be removed from the client's cache. 26537 o CB_NOTIFY_DEVICEID indicates that a device ID's device addressing 26538 mappings have changed. The client should assume that the results 26539 from the in-progress GETDEVICEINFO will be stale for the device ID 26540 once received, and so it should send another GETDEVICEINFO on the 26541 device ID. 26543 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File 26544 System 26546 18.41.1. ARGUMENT 26548 struct GETDEVICELIST4args { 26549 /* CURRENT_FH: object belonging to the file system */ 26550 layouttype4 gdla_layout_type; 26552 /* number of deviceIDs to return */ 26553 count4 gdla_maxdevices; 26555 nfs_cookie4 gdla_cookie; 26556 verifier4 gdla_cookieverf; 26557 }; 26559 18.41.2. RESULT 26561 struct GETDEVICELIST4resok { 26562 nfs_cookie4 gdlr_cookie; 26563 verifier4 gdlr_cookieverf; 26564 deviceid4 gdlr_deviceid_list<>; 26565 bool gdlr_eof; 26566 }; 26568 union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { 26569 case NFS4_OK: 26570 GETDEVICELIST4resok gdlr_resok4; 26571 default: 26572 void; 26573 }; 26575 18.41.3. DESCRIPTION 26577 This operation is used by the client to enumerate all of the device 26578 IDs that a server's file system uses. 26580 The client provides a current filehandle of a file object that 26581 belongs to the file system (i.e., all file objects sharing the same 26582 fsid as that of the current filehandle) and the layout type in 26583 gdia_layout_type. Since this operation might require multiple calls 26584 to enumerate all the device IDs (and is thus similar to the READDIR 26585 (Section 18.23) operation), the client also provides gdia_cookie and 26586 gdia_cookieverf to specify the current cursor position in the list. 26587 When the client wants to read from the beginning of the file system's 26588 device mappings, it sets gdla_cookie to zero. The field 26589 gdla_cookieverf MUST be ignored by the server when gdla_cookie is 26590 zero. The client provides gdla_maxdevices to limit the number of 26591 device IDs in the result. If gdla_maxdevices is zero, the server 26592 MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. 26594 The successful response to the operation will contain the cookie, 26595 gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on 26596 the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies 26597 that there are no remaining entries in the server's device list. 26598 Each element of gdlr_deviceid_list contains a device ID. 26600 18.41.4. IMPLEMENTATION 26602 An example of the use of this operation is for pNFS clients and 26603 servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments 26604 it may be helpful for a client to determine device accessibility upon 26605 first file system access. 26607 18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout 26609 18.42.1. ARGUMENT 26610 union newtime4 switch (bool nt_timechanged) { 26611 case TRUE: 26612 nfstime4 nt_time; 26613 case FALSE: 26614 void; 26615 }; 26617 union newoffset4 switch (bool no_newoffset) { 26618 case TRUE: 26619 offset4 no_offset; 26620 case FALSE: 26621 void; 26622 }; 26624 struct LAYOUTCOMMIT4args { 26625 /* CURRENT_FH: file */ 26626 offset4 loca_offset; 26627 length4 loca_length; 26628 bool loca_reclaim; 26629 stateid4 loca_stateid; 26630 newoffset4 loca_last_write_offset; 26631 newtime4 loca_time_modify; 26632 layoutupdate4 loca_layoutupdate; 26633 }; 26635 18.42.2. RESULT 26637 union newsize4 switch (bool ns_sizechanged) { 26638 case TRUE: 26639 length4 ns_size; 26640 case FALSE: 26641 void; 26642 }; 26644 struct LAYOUTCOMMIT4resok { 26645 newsize4 locr_newsize; 26646 }; 26648 union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { 26649 case NFS4_OK: 26650 LAYOUTCOMMIT4resok locr_resok4; 26651 default: 26652 void; 26653 }; 26655 18.42.3. DESCRIPTION 26657 The LAYOUTCOMMIT operation commits changes in the layout represented 26658 by the current filehandle, client ID (derived from the session ID in 26659 the preceding SEQUENCE operation), byte-range, and stateid. Since 26660 layouts are sub-dividable, a smaller portion of a layout, retrieved 26661 via LAYOUTGET, can be committed. The byte-range being committed is 26662 specified through the byte-range (loca_offset and loca_length). This 26663 byte-range MUST overlap with one or more existing layouts previously 26664 granted via LAYOUTGET (Section 18.43), each with an iomode of 26665 LAYOUTIOMODE4_RW. In the case where the iomode of any held layout 26666 segment is not LAYOUTIOMODE4_RW, the server should return the error 26667 NFS4ERR_BAD_IOMODE. For the case where the client does not hold 26668 matching layout segment(s) for the defined byte-range, the server 26669 should return the error NFS4ERR_BAD_LAYOUT. 26671 The LAYOUTCOMMIT operation indicates that the client has completed 26672 writes using a layout obtained by a previous LAYOUTGET. The client 26673 may have only written a subset of the data range it previously 26674 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 26675 allocated space and to update the server with a new end-of-file. The 26676 layout referenced by LAYOUTCOMMIT is still valid after the operation 26677 completes and can be continued to be referenced by the client ID, 26678 filehandle, byte-range, layout type, and stateid. 26680 If the loca_reclaim field is set to TRUE, this indicates that the 26681 client is attempting to commit changes to a layout after the restart 26682 of the metadata server during the metadata server's recovery grace 26683 period (see Section 12.7.4). This type of request may be necessary 26684 when the client has uncommitted writes to provisionally allocated 26685 byte-ranges of a file that were sent to the storage devices before 26686 the restart of the metadata server. In this case, the layout 26687 provided by the client MUST be a subset of a writable layout that the 26688 client held immediately before the restart of the metadata server. 26689 The value of the field loca_stateid MUST be a value that the metadata 26690 server returned before it restarted. The metadata server is free to 26691 accept or reject this request based on its own internal metadata 26692 consistency checks. If the metadata server finds that the layout 26693 provided by the client does not pass its consistency checks, it MUST 26694 reject the request with the status NFS4ERR_RECLAIM_BAD. The 26695 successful completion of the LAYOUTCOMMIT request with loca_reclaim 26696 set to TRUE does NOT provide the client with a layout for the file. 26697 It simply commits the changes to the layout specified in the 26698 loca_layoutupdate field. To obtain a layout for the file, the client 26699 must send a LAYOUTGET request to the server after the server's grace 26700 period has expired. If the metadata server receives a LAYOUTCOMMIT 26701 request with loca_reclaim set to TRUE when the metadata server is not 26702 in its recovery grace period, it MUST reject the request with the 26703 status NFS4ERR_NO_GRACE. 26705 Setting the loca_reclaim field to TRUE is required if and only if the 26706 committed layout was acquired before the metadata server restart. If 26707 the client is committing a layout that was acquired during the 26708 metadata server's grace period, it MUST set the "reclaim" field to 26709 FALSE. 26711 The loca_stateid is a layout stateid value as returned by previously 26712 successful layout operations (see Section 12.5.3). 26714 The loca_last_write_offset field specifies the offset of the last 26715 byte written by the client previous to the LAYOUTCOMMIT. Note that 26716 this value is never equal to the file's size (at most it is one byte 26717 less than the file's size) and MUST be less than or equal to 26718 NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range 26719 described by loca_offset and loca_length. The metadata server may 26720 use this information to determine whether the file's size needs to be 26721 updated. If the metadata server updates the file's size as the 26722 result of the LAYOUTCOMMIT operation, it must return the new size 26723 (locr_newsize.ns_size) as part of the results. 26725 The loca_time_modify field allows the client to suggest a 26726 modification time it would like the metadata server to set. The 26727 metadata server may use the suggestion or it may use the time of the 26728 LAYOUTCOMMIT operation to set the modification time. If the metadata 26729 server uses the client-provided modification time, it should ensure 26730 that time does not flow backwards. If the client wants to force the 26731 metadata server to set an exact time, the client should use a SETATTR 26732 operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 26733 for more details. If the client desires the resultant modification 26734 time, it should construct the COMPOUND so that a GETATTR follows the 26735 LAYOUTCOMMIT. 26737 The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism 26738 for a client to provide layout-specific updates to the metadata 26739 server. For example, the layout update can describe what byte-ranges 26740 of the original layout have been used and what byte-ranges can be 26741 deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 26742 structure. 26744 The layout information is more verbose for block devices than for 26745 objects and files because the latter two hide the details of block 26746 allocation behind their storage protocols. At the minimum, the 26747 client needs to communicate changes to the end-of-file location back 26748 to the server, and, if desired, its view of the file's modification 26749 time. For block/volume layouts, it needs to specify precisely which 26750 blocks have been used. 26752 If the layout identified in the arguments does not exist, the error 26753 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 26754 be rejected if it does not correspond to an existing layout with an 26755 iomode of LAYOUTIOMODE4_RW. 26757 On success, the current filehandle retains its value and the current 26758 stateid retains its value. 26760 18.42.4. IMPLEMENTATION 26762 The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set 26763 to TRUE to convey hints to modified file attributes or to report 26764 layout-type specific information such as I/O errors for object-based 26765 storage layouts, as normally done during normal operation. Doing so 26766 may help the metadata server to recover files more efficiently after 26767 restart. For example, some file system implementations may require 26768 expansive recovery of file system objects if the metadata server does 26769 not get a positive indication from all clients holding a 26770 LAYOUTIOMODE4_RW layout that they have successfully completed all 26771 their writes. Sending a LAYOUTCOMMIT (if required) and then 26772 following with LAYOUTRETURN can provide such an indication and allow 26773 for graceful and efficient recovery. 26775 If loca_reclaim is TRUE, the metadata server is free to either 26776 examine or ignore the value in the field loca_stateid. The metadata 26777 server implementation might or might not encode in its layout stateid 26778 information that allows the metadate server to perform a consistency 26779 check on the LAYOUTCOMMIT request. 26781 18.43. Operation 50: LAYOUTGET - Get Layout Information 26783 18.43.1. ARGUMENT 26785 struct LAYOUTGET4args { 26786 /* CURRENT_FH: file */ 26787 bool loga_signal_layout_avail; 26788 layouttype4 loga_layout_type; 26789 layoutiomode4 loga_iomode; 26790 offset4 loga_offset; 26791 length4 loga_length; 26792 length4 loga_minlength; 26793 stateid4 loga_stateid; 26794 count4 loga_maxcount; 26795 }; 26797 18.43.2. RESULT 26799 struct LAYOUTGET4resok { 26800 bool logr_return_on_close; 26801 stateid4 logr_stateid; 26802 layout4 logr_layout<>; 26803 }; 26805 union LAYOUTGET4res switch (nfsstat4 logr_status) { 26806 case NFS4_OK: 26807 LAYOUTGET4resok logr_resok4; 26808 case NFS4ERR_LAYOUTTRYLATER: 26809 bool logr_will_signal_layout_avail; 26810 default: 26811 void; 26812 }; 26814 18.43.3. DESCRIPTION 26816 The LAYOUTGET operation requests a layout from the metadata server 26817 for reading or writing the file given by the filehandle at the byte- 26818 range specified by offset and length. Layouts are identified by the 26819 client ID (derived from the session ID in the preceding SEQUENCE 26820 operation), current filehandle, layout type (loga_layout_type), and 26821 the layout stateid (loga_stateid). The use of the loga_iomode field 26822 depends upon the layout type, but should reflect the client's data 26823 access intent. 26825 If the metadata server is in a grace period, and does not persist 26826 layouts and device ID to device address mappings, then it MUST return 26827 NFS4ERR_GRACE (see Section 8.4.2.1). 26829 The LAYOUTGET operation returns layout information for the specified 26830 byte-range: a layout. The client actually specifies two ranges, both 26831 starting at the offset in the loga_offset field. The first range is 26832 between loga_offset and loga_offset + loga_length - 1 inclusive. 26833 This range indicates the desired range the client wants the layout to 26834 cover. The second range is between loga_offset and loga_offset + 26835 loga_minlength - 1 inclusive. This range indicates the required 26836 range the client needs the layout to cover. Thus, loga_minlength 26837 MUST be less than or equal to loga_length. 26839 When a length field is set to NFS4_UINT64_MAX, this indicates a 26840 desire (when loga_length is NFS4_UINT64_MAX) or requirement (when 26841 loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset 26842 through the end-of-file, regardless of the file's length. 26844 The following rules govern the relationships among, and the minima 26845 of, loga_length, loga_minlength, and loga_offset. 26847 o If loga_length is less than loga_minlength, the metadata server 26848 MUST return NFS4ERR_INVAL. 26850 o If loga_minlength is zero, this is an indication to the metadata 26851 server that the client desires any layout at offset loga_offset or 26852 less that the metadata server has "readily available". Readily is 26853 subjective, and depends on the layout type and the pNFS server 26854 implementation. For example, some metadata servers might have to 26855 pre-allocate stable storage when they receive a request for a 26856 range of a file that goes beyond the file's current length. If 26857 loga_minlength is zero and loga_length is greater than zero, this 26858 tells the metadata server what range of the layout the client 26859 would prefer to have. If loga_length and loga_minlength are both 26860 zero, then the client is indicating that it desires a layout of 26861 any length with the ending offset of the range no less than the 26862 value specified loga_offset, and the starting offset at or below 26863 loga_offset. If the metadata server does not have a layout that 26864 is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER. 26866 o If the sum of loga_offset and loga_minlength exceeds 26867 NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the 26868 error NFS4ERR_INVAL MUST result. 26870 o If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, 26871 and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL 26872 MUST result. 26874 After the metadata server has performed the above checks on 26875 loga_offset, loga_minlength, and loga_offset, the metadata server 26876 MUST return a layout according to the rules in Table 13. 26878 Acceptable layouts based on loga_minlength. Note: u64m = 26879 NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. 26881 +-----------+-----------+----------+----------+---------------------+ 26882 | Layout | Layout | Layout | Layout | Layout length of | 26883 | iomode of | a_minlen | iomode | offset | reply | 26884 | request | of | of reply | of reply | | 26885 | | request | | | | 26886 +-----------+-----------+----------+----------+---------------------+ 26887 | _READ | u64m | MAY be | MUST be | MUST be >= file | 26888 | | | _READ | <= a_off | length - layout | 26889 | | | | | offset | 26890 | _READ | u64m | MAY be | MUST be | MUST be u64m | 26891 | | | _RW | <= a_off | | 26892 | _READ | > 0 and < | MAY be | MUST be | MUST be >= MIN(file | 26893 | | u64m | _READ | <= a_off | length, a_minlen + | 26894 | | | | | a_off) - layout | 26895 | | | | | offset | 26896 | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off - | 26897 | | u64m | _RW | <= a_off | layout offset + | 26898 | | | | | a_minlen | 26899 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 26900 | | | _READ | <= a_off | | 26901 | _READ | 0 | MAY be | MUST be | MUST be > 0 | 26902 | | | _RW | <= a_off | | 26903 | _RW | u64m | MUST be | MUST be | MUST be u64m | 26904 | | | _RW | <= a_off | | 26905 | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off - | 26906 | | u64m | _RW | <= a_off | layout offset + | 26907 | | | | | a_minlen | 26908 | _RW | 0 | MUST be | MUST be | MUST be > 0 | 26909 | | | _RW | <= a_off | | 26910 +-----------+-----------+----------+----------+---------------------+ 26912 Table 13 26914 If loga_minlength is not zero and the metadata server cannot return a 26915 layout according to the rules in Table 13, then the metadata server 26916 MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero 26917 and the metadata server cannot or will not return a layout according 26918 to the rules in Table 13, then the metadata server MUST return the 26919 error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater 26920 than loga_minlength or equal to zero, the metadata server SHOULD 26921 return a layout according to the rules in Table 14. 26923 Desired layouts based on loga_length. The rules of Table 13 MUST be 26924 applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; 26925 a_len = loga_length. 26927 +------------+------------+-----------+-----------+-----------------+ 26928 | Layout | Layout | Layout | Layout | Layout length | 26929 | iomode of | a_len of | iomode of | offset of | of reply | 26930 | request | request | reply | reply | | 26931 +------------+------------+-----------+-----------+-----------------+ 26932 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 26933 | | | _READ | <= a_off | | 26934 | _READ | u64m | MAY be | MUST be | SHOULD be u64m | 26935 | | | _RW | <= a_off | | 26936 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 26937 | | u64m | _READ | <= a_off | a_off - layout | 26938 | | | | | offset + a_len | 26939 | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | 26940 | | u64m | _RW | <= a_off | a_off - layout | 26941 | | | | | offset + a_len | 26942 | _READ | 0 | MAY be | MUST be | SHOULD be > | 26943 | | | _READ | <= a_off | a_off - layout | 26944 | | | | | offset | 26945 | _READ | 0 | MAY be | MUST be | SHOULD be > | 26946 | | | _READ | <= a_off | a_off - layout | 26947 | | | | | offset | 26948 | _RW | u64m | MUST be | MUST be | SHOULD be u64m | 26949 | | | _RW | <= a_off | | 26950 | _RW | > 0 and < | MUST be | MUST be | SHOULD be >= | 26951 | | u64m | _RW | <= a_off | a_off - layout | 26952 | | | | | offset + a_len | 26953 | _RW | 0 | MUST be | MUST be | SHOULD be > | 26954 | | | _RW | <= a_off | a_off - layout | 26955 | | | | | offset | 26956 +------------+------------+-----------+-----------+-----------------+ 26958 Table 14 26960 The loga_stateid field specifies a valid stateid. If a layout is not 26961 currently held by the client, the loga_stateid field represents a 26962 stateid reflecting the correspondingly valid open, byte-range lock, 26963 or delegation stateid. Once a layout is held on the file by the 26964 client, the loga_stateid field MUST be a stateid as returned from a 26965 previous LAYOUTGET or LAYOUTRETURN operation or provided by a 26966 CB_LAYOUTRECALL operation (see Section 12.5.3). 26968 The loga_maxcount field specifies the maximum layout size (in bytes) 26969 that the client can handle. If the size of the layout structure 26970 exceeds the size specified by maxcount, the metadata server will 26971 return the NFS4ERR_TOOSMALL error. 26973 The returned layout is expressed as an array, logr_layout, with each 26974 element of type layout4. If a file has a single striping pattern, 26975 then logr_layout SHOULD contain just one entry. Otherwise, if the 26976 requested range overlaps more than one striping pattern, logr_layout 26977 will contain the required number of entries. The elements of 26978 logr_layout MUST be sorted in ascending order of the value of the 26979 lo_offset field of each element. There MUST be no gaps or overlaps 26980 in the range between two successive elements of logr_layout. The 26981 lo_iomode field in each element of logr_layout MUST be the same. 26983 Table 13 and Table 14 both refer to a returned layout iomode, offset, 26984 and length. Because the returned layout is encoded in the 26985 logr_layout array, more description is required. 26987 iomode 26989 The value of the returned layout iomode listed in Table 13 and 26990 Table 14 is equal to the value of the lo_iomode field in each 26991 element of logr_layout. As shown in Table 13 and Table 14, the 26992 metadata server MAY return a layout with an lo_iomode different 26993 from the requested iomode (field loga_iomode of the request). If 26994 it does so, it MUST ensure that the lo_iomode is more permissive 26995 than the loga_iomode requested. For example, this behavior allows 26996 an implementation to upgrade LAYOUTIOMODE4_READ requests to 26997 LAYOUTIOMODE4_RW requests at its discretion, within the limits of 26998 the layout type specific protocol. A lo_iomode of either 26999 LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. 27001 offset 27003 The value of the returned layout offset listed in Table 13 and 27004 Table 14 is always equal to the lo_offset field of the first 27005 element logr_layout. 27007 length 27009 When setting the value of the returned layout length, the 27010 situation is complicated by the possibility that the special 27011 layout length value NFS4_UINT64_MAX is involved. For a 27012 logr_layout array of N elements, the lo_length field in the first 27013 N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of 27014 the last element of logr_layout can be NFS4_UINT64_MAX under some 27015 conditions as described in the following list. 27017 * If an applicable rule of Table 13 states that the metadata 27018 server MUST return a layout of length NFS4_UINT64_MAX, then the 27019 lo_length field of the last element of logr_layout MUST be 27020 NFS4_UINT64_MAX. 27022 * If an applicable rule of Table 13 states that the metadata 27023 server MUST NOT return a layout of length NFS4_UINT64_MAX, then 27024 the lo_length field of the last element of logr_layout MUST NOT 27025 be NFS4_UINT64_MAX. 27027 * If an applicable rule of Table 14 states that the metadata 27028 server SHOULD return a layout of length NFS4_UINT64_MAX, then 27029 the lo_length field of the last element of logr_layout SHOULD 27030 be NFS4_UINT64_MAX. 27032 * When the value of the returned layout length of Table 13 and 27033 Table 14 is not NFS4_UINT64_MAX, then the returned layout 27034 length is equal to the sum of the lo_length fields of each 27035 element of logr_layout. 27037 The logr_return_on_close result field is a directive to return the 27038 layout before closing the file. When the metadata server sets this 27039 return value to TRUE, it MUST be prepared to recall the layout in the 27040 case in which the client fails to return the layout before close. 27041 For the metadata server that knows a layout must be returned before a 27042 close of the file, this return value can be used to communicate the 27043 desired behavior to the client and thus remove one extra step from 27044 the client's and metadata server's interaction. 27046 The logr_stateid stateid is returned to the client for use in 27047 subsequent layout related operations. See Sections 8.2, 12.5.3, and 27048 12.5.5.2 for a further discussion and requirements. 27050 The format of the returned layout (lo_content) is specific to the 27051 layout type. The value of the layout type (lo_content.loc_type) for 27052 each of the elements of the array of layouts returned by the metadata 27053 server (logr_layout) MUST be equal to the loga_layout_type specified 27054 by the client. If it is not equal, the client SHOULD ignore the 27055 response as invalid and behave as if the metadata server returned an 27056 error, even if the client does have support for the layout type 27057 returned. 27059 If neither the requested file nor its containing file system support 27060 layouts, the metadata server MUST return NFS4ERR_LAYOUTUNAVAILABLE. 27061 If the layout type is not supported, the metadata server MUST return 27062 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 27063 matches the client provided layout identification, the metadata 27064 server MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is 27065 specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the 27066 metadata server MUST return NFS4ERR_BADIOMODE. 27068 If the layout for the file is unavailable due to transient 27069 conditions, e.g., file sharing prohibits layouts, the metadata server 27070 MUST return NFS4ERR_LAYOUTTRYLATER. 27072 If the layout request is rejected due to an overlapping layout 27073 recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See 27074 Section 12.5.5.2 for details. 27076 If the layout conflicts with a mandatory byte-range lock held on the 27077 file, and if the storage devices have no method of enforcing 27078 mandatory locks, other than through the restriction of layouts, the 27079 metadata server SHOULD return NFS4ERR_LOCKED. 27081 If client sets loga_signal_layout_avail to TRUE, then it is 27082 registering with the client a "want" for a layout in the event the 27083 layout cannot be obtained due to resource exhaustion. If the 27084 metadata server supports and will honor the "want", the results will 27085 have logr_will_signal_layout_avail set to TRUE. If so, the client 27086 should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a 27087 layout is available. 27089 On success, the current filehandle retains its value and the current 27090 stateid is updated to match the value as returned in the results. 27092 18.43.4. IMPLEMENTATION 27094 Typically, LAYOUTGET will be called as part of a COMPOUND request 27095 after an OPEN operation and results in the client having location 27096 information for the file. This requires that loga_stateid be set to 27097 the special stateid that tells the metadata server to use the current 27098 stateid, which is set by OPEN (see Section 16.2.3.1.2). A client may 27099 also hold a layout across multiple OPENs. The client specifies a 27100 layout type that limits what kind of layout the metadata server will 27101 return. This prevents metadata servers from granting layouts that 27102 are unusable by the client. 27104 As indicated by Table 13 and Table 14, the specification of LAYOUTGET 27105 allows a pNFS client and server considerable flexibility. A pNFS 27106 client can take several strategies for sending LAYOUTGET. Some 27107 examples are as follows. 27109 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request and 27110 the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might 27111 opt to request a _READ layout with loga_offset set to zero, 27112 loga_minlength set to zero, and loga_length set to 27113 NFS4_UINT64_MAX. If the file has space allocated to it, that 27114 space is striped over one or more storage devices, and there is 27115 either no conflicting layout or the concept of a conflicting 27116 layout does not apply to the pNFS server's layout type or 27117 implementation, then the metadata server might return a layout 27118 with a starting offset of zero, and a length equal to the length 27119 of the file, if not NFS4_UINT64_MAX. If the length of the file is 27120 not a multiple of the pNFS server's stripe width (see Section 13.2 27121 for a formal definition), the metadata server might round up the 27122 returned layout's length. 27124 o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and 27125 the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not 27126 truncate the file, the client might opt to request a _RW layout 27127 with loga_offset set to zero, loga_minlength set to zero, and 27128 loga_length set to the file's current length (if known), or 27129 NFS4_UINT64_MAX. As with the previous case, under some conditions 27130 the metadata server might return a layout that covers the entire 27131 length of the file or beyond. 27133 o This strategy is as above, but the OPEN truncates the file. In 27134 this case, the client might anticipate it will be writing to the 27135 file from offset zero, and so loga_offset and loga_minlength are 27136 set to zero, and loga_length is set to the value of 27137 threshold4_write_iosize. The metadata server might return a 27138 layout from offset zero with a length at least as long as 27139 threshold4_write_iosize. 27141 o A process on the client invokes a request to read from offset 27142 10000 for length 50000. The client is using buffered I/O, and has 27143 buffer sizes of 4096 bytes. The client intends to map the request 27144 of the process into a series of READ requests starting at offset 27145 8192. The end offset needs to be higher than 10000 + 50000 = 27146 60000, and the next offset that is a multiple of 4096 is 61440. 27147 The difference between 61440 and that starting offset of the 27148 layout is 53248 (which is the product of 4096 and 15). The value 27149 of threshold4_read_iosize is less than 53248, so the client sends 27150 a LAYOUTGET request with loga_offset set to 8192, loga_minlength 27151 set to 53248, and loga_length set to the file's length (if known) 27152 minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). 27153 Since this LAYOUTGET request exceeds the metadata server's 27154 threshold, it grants the layout, possibly with an initial offset 27155 of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, 27156 but preferably a layout with an offset aligned on the stripe width 27157 and a length that is a multiple of the stripe width. 27159 o This strategy is as above, but the client is not using buffered I/ 27160 O, and instead all internal I/O requests are sent directly to the 27161 server. The LAYOUTGET request has loga_offset equal to 10000 and 27162 loga_minlength set to 50000. The value of loga_length is set to 27163 the length of the file. The metadata server is free to return a 27164 layout that fully overlaps the requested range, with a starting 27165 offset and length aligned on the stripe width. 27167 o Again, a process on the client invokes a request to read from 27168 offset 10000 for length 50000 (i.e. a range with a starting offset 27169 of 10000 and an ending offset of 69999), and buffered I/O is in 27170 use. The client is expecting that the server might not be able to 27171 return the layout for the full I/O range. The client intends to 27172 map the request of the process into a series of thirteen READ 27173 requests starting at offset 8192, each with length 4096, with a 27174 total length of 53248 (which equals 13 * 4096), which fully 27175 contains the range that client's process wants to read. Because 27176 the value of threshold4_read_iosize is equal to 4096, it is 27177 practical and reasonable for the client to use several LAYOUTGET 27178 operations to complete the series of READs. The client sends a 27179 LAYOUTGET request with loga_offset set to 8192, loga_minlength set 27180 to 4096, and loga_length set to 53248 or higher. The server will 27181 grant a layout possibly with an initial offset of zero, with an 27182 end offset of at least 8192 + 4096 - 1 = 12287, but preferably a 27183 layout with an offset aligned on the stripe width and a length 27184 that is a multiple of the stripe width. This will allow the 27185 client to make forward progress, possibly sending more LAYOUTGET 27186 operations for the remainder of the range. 27188 o An NFS client detects a sequential read pattern, and so sends a 27189 LAYOUTGET operation that goes well beyond any current or pending 27190 read requests to the server. The server might likewise detect 27191 this pattern, and grant the LAYOUTGET request. Once the client 27192 reads from an offset of the file that represents 50% of the way 27193 through the range of the last layout it received, in order to 27194 avoid stalling I/O that would wait for a layout, the client sends 27195 more operations from an offset of the file that represents 50% of 27196 the way through the last layout it received. The client continues 27197 to request layouts with byte-ranges that are well in advance of 27198 the byte-ranges of recent and/or read requests of processes 27199 running on the client. 27201 o This strategy is as above, but the client fails to detect the 27202 pattern, but the server does. The next time the metadata server 27203 gets a LAYOUTGET, it returns a layout with a length that is well 27204 beyond loga_minlength. 27206 o A client is using buffered I/O, and has a long queue of write- 27207 behinds to process and also detects a sequential write pattern. 27208 It sends a LAYOUTGET for a layout that spans the range of the 27209 queued write-behinds and well beyond, including ranges beyond the 27210 filer's current length. The client continues to send LAYOUTGET 27211 operations once the write-behind queue reaches 50% of the maximum 27212 queue length. 27214 Once the client has obtained a layout referring to a particular 27215 device ID, the metadata server MUST NOT delete the device ID until 27216 the layout is returned or revoked. 27218 CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is 27219 that LAYOUTGET returns a device ID for which the client does not have 27220 device address mappings, and the metadata server sends a 27221 CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and 27222 meanwhile the client sends GETDEVICEINFO on the device ID. This 27223 scenario is discussed in Section 18.40.4. Another scenario is that 27224 the CB_NOTIFY_DEVICEID is processed by the client before it processes 27225 the results from LAYOUTGET. The client will send a GETDEVICEINFO on 27226 the device ID. If the results from GETDEVICEINFO are received before 27227 the client gets results from LAYOUTGET, then there is no longer a 27228 race. If the results from LAYOUTGET are received before the results 27229 from GETDEVICEINFO, the client can either wait for results of 27230 GETDEVICEINFO or send another one to get possibly more up-to-date 27231 device address mappings for the device ID. 27233 18.44. Operation 51: LAYOUTRETURN - Release Layout Information 27235 18.44.1. ARGUMENT 27236 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 27237 const LAYOUT4_RET_REC_FILE = 1; 27238 const LAYOUT4_RET_REC_FSID = 2; 27239 const LAYOUT4_RET_REC_ALL = 3; 27241 enum layoutreturn_type4 { 27242 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 27243 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 27244 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 27245 }; 27247 struct layoutreturn_file4 { 27248 offset4 lrf_offset; 27249 length4 lrf_length; 27250 stateid4 lrf_stateid; 27251 /* layouttype4 specific data */ 27252 opaque lrf_body<>; 27253 }; 27255 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 27256 case LAYOUTRETURN4_FILE: 27257 layoutreturn_file4 lr_layout; 27258 default: 27259 void; 27260 }; 27262 struct LAYOUTRETURN4args { 27263 /* CURRENT_FH: file */ 27264 bool lora_reclaim; 27265 layouttype4 lora_layout_type; 27266 layoutiomode4 lora_iomode; 27267 layoutreturn4 lora_layoutreturn; 27268 }; 27270 18.44.2. RESULT 27271 union layoutreturn_stateid switch (bool lrs_present) { 27272 case TRUE: 27273 stateid4 lrs_stateid; 27274 case FALSE: 27275 void; 27276 }; 27278 union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { 27279 case NFS4_OK: 27280 layoutreturn_stateid lorr_stateid; 27281 default: 27282 void; 27283 }; 27285 18.44.3. DESCRIPTION 27287 This operation returns from the client to the server one or more 27288 layouts represented by the client ID (derived from the session ID in 27289 the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. 27290 When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is 27291 further identified by the current filehandle, lrf_offset, lrf_length, 27292 and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all 27293 bytes of the layout, starting at lrf_offset, are returned. When 27294 lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used 27295 to identify the file system and all layouts matching the client ID, 27296 the fsid of the file system, lora_layout_type, and lora_iomode are 27297 returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts 27298 matching the client ID, lora_layout_type, and lora_iomode are 27299 returned and the current filehandle is not used. After this call, 27300 the client MUST NOT use the returned layout(s) and the associated 27301 storage protocol to access the file data. 27303 If the set of layouts designated in the case of LAYOUTRETURN4_FSID or 27304 LAYOUTRETURN4_ALL is empty, then no error results. In the case of 27305 LAYOUTRETURN4_FILE, the byte-range specified is returned even if it 27306 is a subdivision of a layout previously obtained with LAYOUTGET, a 27307 combination of multiple layouts previously obtained with LAYOUTGET, 27308 or a combination including some layouts previously obtained with 27309 LAYOUTGET, and one or more subdivisions of such layouts. When the 27310 byte-range does not designate any bytes for which a layout is held 27311 for the specified file, client ID, layout type and mode, no error 27312 results. See Section 12.5.5.2.1.5 for considerations with "bulk" 27313 return of layouts. 27315 The layout being returned may be a subset or superset of a layout 27316 specified by CB_LAYOUTRECALL. However, if it is a subset, the recall 27317 is not complete until the full recalled scope has been returned. 27319 Recalled scope refers to the byte-range in the case of 27320 LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of 27321 LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching 27322 scope to complete the return even if all current layout ranges have 27323 been previously individually returned. 27325 For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY 27326 specifies that all layouts that match the other arguments to 27327 LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current 27328 filehandle and range; fsid derived from current filehandle; or 27329 LAYOUTRETURN4_ALL) are being returned. 27331 In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid 27332 provided by the client is a layout stateid as returned from previous 27333 layout operations. Note that the "seqid" field of lrf_stateid MUST 27334 NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further 27335 discussion and requirements. 27337 Return of a layout or all layouts does not invalidate the mapping of 27338 storage device ID to a storage device address. The mapping remains 27339 in effect until specifically changed or deleted via device ID 27340 notification callbacks. Of course if there are no remaining layouts 27341 that refer to a previously used device ID, the server is free to 27342 delete a device ID without a notification callback, which will be the 27343 case when notifications are not in effect. 27345 If the lora_reclaim field is set to TRUE, the client is attempting to 27346 return a layout that was acquired before the restart of the metadata 27347 server during the metadata server's grace period. When returning 27348 layouts that were acquired during the metadata server's grace period, 27349 the client MUST set the lora_reclaim field to FALSE. The 27350 lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is 27351 LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT 27352 (Section 18.42) for more details. 27354 Layouts may be returned when recalled or voluntarily (i.e., before 27355 the server has recalled them). In either case, the client must 27356 properly propagate state changed under the context of the layout to 27357 the storage device(s) or to the metadata server before returning the 27358 layout. 27360 If the client returns the layout in response to a CB_LAYOUTRECALL 27361 where the lor_recalltype field of the clora_recall field was 27362 LAYOUTRECALL4_FILE, the client should use the lor_stateid value from 27363 CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should 27364 use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid 27365 (from a previous LAYRETURN result). This is done to indicate the 27366 point in time (in terms of layout stateid transitions) when the 27367 recall was sent. The client uses the precise lora_recallstateid 27368 value and MUST NOT set the stateid's seqid to zero; otherwise, 27369 NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be 27370 returned if the client is using an old seqid, and the server knows 27371 the client should not be using the old seqid. For example, the 27372 client uses the seqid on slot 1 of the session, receives the response 27373 with the new seqid, and uses the slot to send another request with 27374 the old seqid. 27376 If a client fails to return a layout in a timely manner, then the 27377 metadata server SHOULD use its control protocol with the storage 27378 devices to fence the client from accessing the data referenced by the 27379 layout. See Section 12.5.5 for more details. 27381 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after 27382 the metadata server's grace period, NFS4ERR_NO_GRACE is returned. 27384 If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and 27385 lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, 27386 NFS4ERR_INVAL is returned. 27388 If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, 27389 then the lrs_stateid field will represent the layout stateid as 27390 updated for this operation's processing; the current stateid will 27391 also be updated to match the returned value. If the last byte of any 27392 layout for the current file, client ID, and layout type is being 27393 returned and there are no remaining pending CB_LAYOUTRECALL 27394 operations for which a LAYOUTRETURN operation must be done, 27395 lrs_present MUST be FALSE, and no stateid will be returned. In 27396 addition, the COMPOUND request's current stateid will be set to the 27397 all-zeroes special stateid (see Section 16.2.3.1.2). The server MUST 27398 reject with NFS4ERR_BAD_STATEID any further use of the current 27399 stateid in that COMPOUND until the current stateid is re-established 27400 by a later stateid-returning operation. 27402 On success, the current filehandle retains its value. 27404 If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the 27405 client ID (see Section 18.35), the server will require that the 27406 principal, security flavor, and if applicable, the GSS mechanism, 27407 combination that acquired the layout also be the one to send 27408 LAYOUTRETURN. This might not be possible if credentials for the 27409 principal are no longer available. The server will allow the machine 27410 credential or SSV credential (see Section 18.35) to send LAYOUTRETURN 27411 if LAYOUTRETURN's operation code was set in the spo_must_allow result 27412 of EXCHANGE_ID. 27414 18.44.4. IMPLEMENTATION 27416 The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL 27417 callback MUST be serialized with any outstanding, intersecting 27418 LAYOUTRETURN operations. Note that it is possible that while a 27419 client is returning the layout for some recalled range, the server 27420 may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the 27421 final return operation for the latter must block until the former 27422 layout recall is done. 27424 Returning all layouts in a file system using LAYOUTRETURN4_FSID is 27425 typically done in response to a CB_LAYOUTRECALL for that file system 27426 as the final return operation. Similarly, LAYOUTRETURN4_ALL is used 27427 in response to a recall callback for all layouts. It is possible 27428 that the client already returned some outstanding layouts via 27429 individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or 27430 LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See 27431 Section 12.5.5.1 for more details. 27433 Once the client has returned all layouts referring to a particular 27434 device ID, the server MAY delete the device ID. 27436 18.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object 27438 18.45.1. ARGUMENT 27440 enum secinfo_style4 { 27441 SECINFO_STYLE4_CURRENT_FH = 0, 27442 SECINFO_STYLE4_PARENT = 1 27443 }; 27445 /* CURRENT_FH: object or child directory */ 27446 typedef secinfo_style4 SECINFO_NO_NAME4args; 27448 18.45.2. RESULT 27450 /* CURRENTFH: consumed if status is NFS4_OK */ 27451 typedef SECINFO4res SECINFO_NO_NAME4res; 27453 18.45.3. DESCRIPTION 27455 Like the SECINFO operation, SECINFO_NO_NAME is used by the client to 27456 obtain a list of valid RPC authentication flavors for a specific file 27457 object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that 27458 are accessed by filehandle. 27460 There are two styles of SECINFO_NO_NAME, as determined by the value 27461 of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is 27462 passed, then SECINFO_NO_NAME is querying for the required security 27463 for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then 27464 SECINFO_NO_NAME is querying for the required security of the current 27465 filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, 27466 then SECINFO should apply the same access methodology used for 27467 LOOKUPP when evaluating the traversal to the parent directory. 27468 Therefore, if the requester does not have the appropriate access to 27469 LOOKUPP the parent, then SECINFO_NO_NAME must behave the same way and 27470 return NFS4ERR_ACCESS. 27472 If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH returns NFS4ERR_WRONGSEC, 27473 then the client resolves the situation by sending a COMPOUND request 27474 that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed 27475 by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See Section 2.6 27476 for instructions on dealing with NFS4ERR_WRONGSEC error returns from 27477 PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH. 27479 If SECINFO_STYLE4_PARENT is specified and there is no parent 27480 directory, SECINFO_NO_NAME MUST return NFS4ERR_NOENT. 27482 On success, the current filehandle is consumed (see 27483 Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME 27484 tries to use the current filehandle, that operation will fail with 27485 the status NFS4ERR_NOFILEHANDLE. 27487 Everything else about SECINFO_NO_NAME is the same as SECINFO. See 27488 the discussion on SECINFO (Section 18.29.3). 27490 18.45.4. IMPLEMENTATION 27492 See the discussion on SECINFO (Section 18.29.4). 27494 18.46. Operation 53: SEQUENCE - Supply Per-Procedure Sequencing and 27495 Control 27497 18.46.1. ARGUMENT 27499 struct SEQUENCE4args { 27500 sessionid4 sa_sessionid; 27501 sequenceid4 sa_sequenceid; 27502 slotid4 sa_slotid; 27503 slotid4 sa_highest_slotid; 27504 bool sa_cachethis; 27505 }; 27507 18.46.2. RESULT 27509 const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001; 27510 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002; 27511 const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004; 27512 const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008; 27513 const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010; 27514 const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020; 27515 const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040; 27516 const SEQ4_STATUS_LEASE_MOVED = 0x00000080; 27517 const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100; 27518 const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200; 27519 const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400; 27520 const SEQ4_STATUS_DEVID_CHANGED = 0x00000800; 27521 const SEQ4_STATUS_DEVID_DELETED = 0x00001000; 27523 struct SEQUENCE4resok { 27524 sessionid4 sr_sessionid; 27525 sequenceid4 sr_sequenceid; 27526 slotid4 sr_slotid; 27527 slotid4 sr_highest_slotid; 27528 slotid4 sr_target_highest_slotid; 27529 uint32_t sr_status_flags; 27530 }; 27532 union SEQUENCE4res switch (nfsstat4 sr_status) { 27533 case NFS4_OK: 27534 SEQUENCE4resok sr_resok4; 27535 default: 27536 void; 27537 }; 27539 18.46.3. DESCRIPTION 27541 The SEQUENCE operation is used by the server to implement session 27542 request control and the reply cache semantics. 27544 SEQUENCE MUST appear as the first operation of any COMPOUND in which 27545 it appears. The error NFS4ERR_SEQUENCE_POS will be returned when it 27546 is found in any position in a COMPOUND beyond the first. Operations 27547 other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, 27548 CREATE_SESSION, and DESTROY_SESSION, MUST NOT appear as the first 27549 operation in a COMPOUND. Such operations MUST yield the error 27550 NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a 27551 COMPOUND. 27553 If SEQUENCE is received on a connection not associated with the 27554 session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection 27555 association enforcement is enabled (see Section 18.35), then the 27556 server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION. 27558 The sa_sessionid argument identifies the session to which this 27559 request applies. The sr_sessionid result MUST equal sa_sessionid. 27561 The sa_slotid argument is the index in the reply cache for the 27562 request. The sa_sequenceid field is the sequence number of the 27563 request for the reply cache entry (slot). The sr_slotid result MUST 27564 equal sa_slotid. The sr_sequenceid result MUST equal sa_sequenceid. 27566 The sa_highest_slotid argument is the highest slot ID for which the 27567 client has a request outstanding; it could be equal to sa_slotid. 27568 The server returns two "highest_slotid" values: sr_highest_slotid and 27569 sr_target_highest_slotid. The former is the highest slot ID the 27570 server will accept in future SEQUENCE operation, and SHOULD NOT be 27571 less than the value of sa_highest_slotid (but see Section 2.10.6.1 27572 for an exception). The latter is the highest slot ID the server 27573 would prefer the client use on a future SEQUENCE operation. 27575 If sa_cachethis is TRUE, then the client is requesting that the 27576 server cache the entire reply in the server's reply cache; therefore, 27577 the server MUST cache the reply (see Section 2.10.6.1.3). The server 27578 MAY cache the reply if sa_cachethis is FALSE. If the server does not 27579 cache the entire reply, it MUST still record that it executed the 27580 request at the specified slot and sequence ID. 27582 The response to the SEQUENCE operation contains a word of status 27583 flags (sr_status_flags) that can provide to the client information 27584 related to the status of the client's lock state and communications 27585 paths. Note that any status bits relating to lock state MAY be reset 27586 when lock state is lost due to a server restart (even if the session 27587 is persistent across restarts; session persistence does not imply 27588 lock state persistence) or the establishment of a new client 27589 instance. 27591 SEQ4_STATUS_CB_PATH_DOWN 27592 When set, indicates that the client has no operational backchannel 27593 path for any session associated with the client ID, making it 27594 necessary for the client to re-establish one. This bit remains 27595 set on all SEQUENCE responses on all sessions associated with the 27596 client ID until at least one backchannel is available on any 27597 session associated with the client ID. If the client fails to re- 27598 establish a backchannel for the client ID, it is subject to having 27599 recallable state revoked. 27601 SEQ4_STATUS_CB_PATH_DOWN_SESSION 27602 When set, indicates that the session has no operational 27603 backchannel. There are two reasons why 27604 SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not 27605 SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that 27606 applies specifically to the session (e.g., CB_RECALL_SLOT, see 27607 Section 20.8) needs to be sent. Second is that the server did 27608 send a callback operation, but the connection was lost before the 27609 reply. The server cannot be sure whether or not the client 27610 received the callback operation, and so, per rules on request 27611 retry, the server MUST retry the callback operation over the same 27612 session. The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the 27613 indication to the client that it needs to associate a connection 27614 to the session's backchannel. This bit remains set on all 27615 SEQUENCE responses of the session until a connection is associated 27616 with the session's a backchannel. If the client fails to re- 27617 establish a backchannel for the session, it is subject to having 27618 recallable state revoked. 27620 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING 27621 When set, indicates that all GSS contexts or RPCSEC_GSS handles 27622 assigned to the session's backchannel will expire within a period 27623 equal to the lease time. This bit remains set on all SEQUENCE 27624 replies until at least one of the following are true: 27626 * All SSV RPCSEC_GSS handles on the session's backchannel have 27627 been destroyed and all non-SSV GSS contexts have expired. 27629 * At least one more SSV RPCSEC_GSS handle has been added to the 27630 backchannel. 27632 * The expiration time of at least one non-SSV GSS context of an 27633 RPCSEC_GSS handle is beyond the lease period from the current 27634 time (relative to the time of when a SEQUENCE response was 27635 sent) 27637 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED 27638 When set, indicates all non-SSV GSS contexts and all SSV 27639 RPCSEC_GSS handles assigned to the session's backchannel have 27640 expired or have been destroyed. This bit remains set on all 27641 SEQUENCE replies until at least one non-expired non-SSV GSS 27642 context for the session's backchannel has been established or at 27643 least one SSV RPCSEC_GSS handle has been assigned to the 27644 backchannel. 27646 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED 27647 When set, indicates that the lease has expired and as a result the 27648 server released all of the client's locking state. This status 27649 bit remains set on all SEQUENCE replies until the loss of all such 27650 locks has been acknowledged by use of FREE_STATEID (see 27651 Section 18.38), or by establishing a new client instance by 27652 destroying all sessions (via DESTROY_SESSION), the client ID (via 27653 DESTROY_CLIENTID), and then invoking EXCHANGE_ID and 27654 CREATE_SESSION to establish a new client ID. 27656 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 27657 When set, indicates that some subset of the client's locks have 27658 been revoked due to expiration of the lease period followed by 27659 another client's conflicting LOCK operation. This status bit 27660 remains set on all SEQUENCE replies until the loss of all such 27661 locks has been acknowledged by use of FREE_STATEID. 27663 SEQ4_STATUS_ADMIN_STATE_REVOKED 27664 When set, indicates that one or more locks have been revoked 27665 without expiration of the lease period, due to administrative 27666 action. This status bit remains set on all SEQUENCE replies until 27667 the loss of all such locks has been acknowledged by use of 27668 FREE_STATEID. 27670 SEQ4_STATUS_RECALLABLE_STATE_REVOKED 27671 When set, indicates that one or more recallable objects have been 27672 revoked without expiration of the lease period, due to the 27673 client's failure to return them when recalled, which may be a 27674 consequence of there being no working backchannel and the client 27675 failing to re-establish a backchannel per the 27676 SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or 27677 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit 27678 remains set on all SEQUENCE replies until the loss of all such 27679 locks has been acknowledged by use of FREE_STATEID. 27681 SEQ4_STATUS_LEASE_MOVED 27682 When set, indicates that responsibility for lease renewal has been 27683 transferred to one or more new servers. This condition will 27684 continue until the client receives an NFS4ERR_MOVED error and the 27685 server receives the subsequent GETATTR for the fs_locations or 27686 fs_locations_info attribute for an access to each file system for 27687 which a lease has been moved to a new server. See 27688 Section 11.11.9.2. 27690 SEQ4_STATUS_RESTART_RECLAIM_NEEDED 27691 When set, indicates that due to server restart, the client must 27692 reclaim locking state. Until the client sends a global 27693 RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will 27694 return SEQ4_STATUS_RESTART_RECLAIM_NEEDED. 27696 SEQ4_STATUS_BACKCHANNEL_FAULT 27697 The server has encountered an unrecoverable fault with the 27698 backchannel (e.g., it has lost track of the sequence ID for a slot 27699 in the backchannel). The client MUST stop sending more requests 27700 on the session's fore channel, wait for all outstanding requests 27701 to complete on the fore and back channel, and then destroy the 27702 session. 27704 SEQ4_STATUS_DEVID_CHANGED 27705 The client is using device ID notifications and the server has 27706 changed a device ID mapping held by the client. This flag will 27707 stay present until the client has obtained the new mapping with 27708 GETDEVICEINFO. 27710 SEQ4_STATUS_DEVID_DELETED 27711 The client is using device ID notifications and the server has 27712 deleted a device ID mapping held by the client. This flag will 27713 stay in effect until the client sends a GETDEVICEINFO on the 27714 device ID with a null value in the argument gdia_notify_types. 27716 The value of the sa_sequenceid argument relative to the cached 27717 sequence ID on the slot falls into one of three cases. 27719 o If the difference between sa_sequenceid and the server's cached 27720 sequence ID at the slot ID is two (2) or more, or if sa_sequenceid 27721 is less than the cached sequence ID (accounting for wraparound of 27722 the unsigned sequence ID value), then the server MUST return 27723 NFS4ERR_SEQ_MISORDERED. 27725 o If sa_sequenceid and the cached sequence ID are the same, this is 27726 a retry, and the server replies with what is recorded in the reply 27727 cache. The lease is possibly renewed as described below. 27729 o If sa_sequenceid is one greater (accounting for wraparound) than 27730 the cached sequence ID, then this is a new request, and the slot's 27731 sequence ID is incremented. The operations subsequent to 27732 SEQUENCE, if any, are processed. If there are no other 27733 operations, the only other effects are to cache the SEQUENCE reply 27734 in the slot, maintain the session's activity, and possibly renew 27735 the lease. 27737 If the client reuses a slot ID and sequence ID for a completely 27738 different request, the server MAY treat the request as if it is a 27739 retry of what it has already executed. The server MAY however detect 27740 the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 27742 If SEQUENCE returns an error, then the state of the slot (sequence 27743 ID, cached reply) MUST NOT change, and the associated lease MUST NOT 27744 be renewed. 27746 If SEQUENCE returns NFS4_OK, then the associated lease MUST be 27747 renewed (see Section 8.3), except if 27748 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags. 27750 18.46.4. IMPLEMENTATION 27752 The server MUST maintain a mapping of session ID to client ID in 27753 order to validate any operations that follow SEQUENCE that take a 27754 stateid as an argument and/or result. 27756 If the client establishes a persistent session, then a SEQUENCE 27757 received after a server restart might encounter requests performed 27758 and recorded in a persistent reply cache before the server restart. 27759 In this case, SEQUENCE will be processed successfully, while requests 27760 that were not previously performed and recorded are rejected with 27761 NFS4ERR_DEADSESSION. 27763 Depending on which of the operations within the COMPOUND were 27764 successfully performed before the server restart, these operations 27765 will also have replies sent from the server reply cache. Note that 27766 when these operations establish locking state, it is locking state 27767 that applies to the previous server instance and to the previous 27768 client ID, even though the server restart, which logically happened 27769 after these operations, eliminated that state. In the case of a 27770 partially executed COMPOUND, processing may reach an operation not 27771 processed during the earlier server instance, making this operation a 27772 new one and not performable on the existing session. In this case, 27773 NFS4ERR_DEADSESSION will be returned from that operation. 27775 18.47. Operation 54: SET_SSV - Update SSV for a Client ID 27777 18.47.1. ARGUMENT 27779 struct ssa_digest_input4 { 27780 SEQUENCE4args sdi_seqargs; 27781 }; 27783 struct SET_SSV4args { 27784 opaque ssa_ssv<>; 27785 opaque ssa_digest<>; 27786 }; 27788 18.47.2. RESULT 27789 struct ssr_digest_input4 { 27790 SEQUENCE4res sdi_seqres; 27791 }; 27793 struct SET_SSV4resok { 27794 opaque ssr_digest<>; 27795 }; 27797 union SET_SSV4res switch (nfsstat4 ssr_status) { 27798 case NFS4_OK: 27799 SET_SSV4resok ssr_resok4; 27800 default: 27801 void; 27802 }; 27804 18.47.3. DESCRIPTION 27806 This operation is used to update the SSV for a client ID. Before 27807 SET_SSV is called the first time on a client ID, the SSV is zero. 27808 The SSV is the key used for the SSV GSS mechanism (Section 2.10.9) 27810 SET_SSV MUST be preceded by a SEQUENCE operation in the same 27811 COMPOUND. It MUST NOT be used if the client did not opt for SP4_SSV 27812 state protection when the client ID was created (see Section 18.35); 27813 the server returns NFS4ERR_INVAL in that case. 27815 The field ssa_digest is computed as the output of the HMAC (RFC 2104 27816 [51]) using the subkey derived from the SSV4_SUBKEY_MIC_I2T and 27817 current SSV as the key (see Section 2.10.9 for a description of 27818 subkeys), and an XDR encoded value of data type ssa_digest_input4. 27819 The field sdi_seqargs is equal to the arguments of the SEQUENCE 27820 operation for the COMPOUND procedure that SET_SSV is within. 27822 The argument ssa_ssv is XORed with the current SSV to produce the new 27823 SSV. The argument ssa_ssv SHOULD be generated randomly. 27825 In the response, ssr_digest is the output of the HMAC using the 27826 subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and 27827 an XDR encoded value of data type ssr_digest_input4. The field 27828 sdi_seqres is equal to the results of the SEQUENCE operation for the 27829 COMPOUND procedure that SET_SSV is within. 27831 As noted in Section 18.35, the client and server can maintain 27832 multiple concurrent versions of the SSV. The client and server each 27833 MUST maintain an internal SSV version number, which is set to one the 27834 first time SET_SSV executes on the server and the client receives the 27835 first SET_SSV reply. Each subsequent SET_SSV increases the internal 27836 SSV version number by one. The value of this version number 27837 corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, and 27838 ssct_ssv_seq fields of the SSV GSS mechanism tokens (see 27839 Section 2.10.9). 27841 18.47.4. IMPLEMENTATION 27843 When the server receives ssa_digest, it MUST verify the digest by 27844 computing the digest the same way the client did and comparing it 27845 with ssa_digest. If the server gets a different result, this is an 27846 error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of 27847 another SET_SSV from the same client ID changing the SSV. If so, the 27848 client recovers by sending a SET_SSV operation again with a 27849 recomputed digest based on the subkey of the new SSV. If the 27850 transport connection is dropped after the SET_SSV request is sent, 27851 but before the SET_SSV reply is received, then there are special 27852 considerations for recovery if the client has no more connections 27853 associated with sessions associated with the client ID of the SSV. 27854 See Section 18.34.4. 27856 Clients SHOULD NOT send an ssa_ssv that is equal to a previous 27857 ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv 27858 equal to zero since the SSV is initialized to zero when the client ID 27859 is created). 27861 Clients SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers MUST 27862 support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, 27863 SET_SSV }. 27865 A client SHOULD NOT send SET_SSV with the SSV GSS mechanism's 27866 credential because the purpose of SET_SSV is to seed the SSV from 27867 non-SSV credentials. Instead, SET_SSV SHOULD be sent with the 27868 credential of a user that is accessing the client ID for the first 27869 time (Section 2.10.8.3). However, if the client does send SET_SSV 27870 with SSV credentials, the digest protecting the arguments uses the 27871 value of the SSV before ssa_ssv is XORed in, and the digest 27872 protecting the results uses the value of the SSV after the ssa_ssv is 27873 XORed in. 27875 18.48. Operation 55: TEST_STATEID - Test Stateids for Validity 27877 18.48.1. ARGUMENT 27879 struct TEST_STATEID4args { 27880 stateid4 ts_stateids<>; 27881 }; 27883 18.48.2. RESULT 27885 struct TEST_STATEID4resok { 27886 nfsstat4 tsr_status_codes<>; 27887 }; 27889 union TEST_STATEID4res switch (nfsstat4 tsr_status) { 27890 case NFS4_OK: 27891 TEST_STATEID4resok tsr_resok4; 27892 default: 27893 void; 27894 }; 27896 18.48.3. DESCRIPTION 27898 The TEST_STATEID operation is used to check the validity of a set of 27899 stateids. It can be used at any time, but the client should 27900 definitely use it when it receives an indication that one or more of 27901 its stateids have been invalidated due to lock revocation. This 27902 occurs when the SEQUENCE operation returns with one of the following 27903 sr_status_flags set: 27905 o SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED 27907 o SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED 27909 o SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED 27911 The client can use TEST_STATEID one or more times to test the 27912 validity of its stateids. Each use of TEST_STATEID allows a large 27913 set of such stateids to be tested and avoids problems with earlier 27914 stateids in a COMPOUND request from interfering with the checking of 27915 subsequent stateids, as would happen if individual stateids were 27916 tested by a series of corresponding by operations in a COMPOUND 27917 request. 27919 For each stateid, the server returns the status code that would be 27920 returned if that stateid were to be used in normal operation. 27921 Returning such a status indication is not an error and does not cause 27922 COMPOUND processing to terminate. Checks for the validity of the 27923 stateid proceed as they would for normal operations with a number of 27924 exceptions: 27926 o There is no check for the type of stateid object, as would be the 27927 case for normal use of a stateid. 27929 o There is no reference to the current filehandle. 27931 o Special stateids are always considered invalid (they result in the 27932 error code NFS4ERR_BAD_STATEID). 27934 All stateids are interpreted as being associated with the client for 27935 the current session. Any possible association with a previous 27936 instance of the client (as stale stateids) is not considered. 27938 The valid status values in the returned status_code array are 27939 NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, 27940 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED. 27942 18.48.4. IMPLEMENTATION 27944 See Sections 8.2.2 and 8.2.4 for a discussion of stateid structure, 27945 lifetime, and validation. 27947 18.49. Operation 56: WANT_DELEGATION - Request Delegation 27949 18.49.1. ARGUMENT 27950 union deleg_claim4 switch (open_claim_type4 dc_claim) { 27951 /* 27952 * No special rights to object. Ordinary delegation 27953 * request of the specified object. Object identified 27954 * by filehandle. 27955 */ 27956 case CLAIM_FH: /* new to v4.1 */ 27957 /* CURRENT_FH: object being delegated */ 27958 void; 27960 /* 27961 * Right to file based on a delegation granted 27962 * to a previous boot instance of the client. 27963 * File is specified by filehandle. 27964 */ 27965 case CLAIM_DELEG_PREV_FH: /* new to v4.1 */ 27966 /* CURRENT_FH: object being delegated */ 27967 void; 27969 /* 27970 * Right to the file established by an open previous 27971 * to server reboot. File identified by filehandle. 27972 * Used during server reclaim grace period. 27973 */ 27974 case CLAIM_PREVIOUS: 27975 /* CURRENT_FH: object being reclaimed */ 27976 open_delegation_type4 dc_delegate_type; 27977 }; 27979 struct WANT_DELEGATION4args { 27980 uint32_t wda_want; 27981 deleg_claim4 wda_claim; 27982 }; 27984 18.49.2. RESULT 27986 union WANT_DELEGATION4res switch (nfsstat4 wdr_status) { 27987 case NFS4_OK: 27988 open_delegation4 wdr_resok4; 27989 default: 27990 void; 27991 }; 27993 18.49.3. DESCRIPTION 27995 Where this description mandates the return of a specific error code 27996 for a specific condition, and where multiple conditions apply, the 27997 server MAY return any of the mandated error codes. 27999 This operation allows a client to: 28001 o Get a delegation on all types of files except directories. 28003 o Register a "want" for a delegation for the specified file object, 28004 and be notified via a callback when the delegation is available. 28005 The server MAY support notifications of availability via 28006 callbacks. If the server does not support registration of wants, 28007 it MUST NOT return an error to indicate that, and instead MUST 28008 return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and 28009 ond_server_will_push_deleg or ond_server_will_signal_avail set to 28010 FALSE. When the server indicates that it will notify the client 28011 by means of a callback, it will either provide the delegation 28012 using a CB_PUSH_DELEG operation or cancel its promise by sending a 28013 CB_WANTS_CANCELLED operation. 28015 o Cancel a want for a delegation. 28017 The client SHOULD NOT set OPEN4_SHARE_ACCESS_READ and SHOULD NOT set 28018 OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server MUST 28019 ignore them. 28021 The meanings of the following flags in wda_want are the same as they 28022 are in OPEN, except as noted below. 28024 o OPEN4_SHARE_ACCESS_WANT_READ_DELEG 28026 o OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG 28028 o OPEN4_SHARE_ACCESS_WANT_ANY_DELEG 28030 o OPEN4_SHARE_ACCESS_WANT_NO_DELEG. Unlike the OPEN operation, this 28031 flag SHOULD NOT be set by the client in the arguments to 28032 WANT_DELEGATION, and MUST be ignored by the server. 28034 o OPEN4_SHARE_ACCESS_WANT_CANCEL 28036 o OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL 28038 o OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED 28039 The handling of the above flags in WANT_DELEGATION is the same as in 28040 OPEN. Information about the delegation and/or the promises the 28041 server is making regarding future callbacks are the same as those 28042 described in the open_delegation4 structure. 28044 The successful results of WANT_DELEGATION are of data type 28045 open_delegation4, which is the same data type as the "delegation" 28046 field in the results of the OPEN operation (see Section 18.16.3). 28047 The server constructs wdr_resok4 the same way it constructs OPEN's 28048 "delegation" with one difference: WANT_DELEGATION MUST NOT return a 28049 delegation type of OPEN_DELEGATE_NONE. 28051 If ((wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) & 28052 ~OPEN4_SHARE_ACCESS_WANT_NO_DELEG) is zero, then the client is 28053 indicating no explicit desire or non-desire for a delegation and the 28054 server MUST return NFS4ERR_INVAL. 28056 The client uses the OPEN4_SHARE_ACCESS_WANT_CANCEL flag in the 28057 WANT_DELEGATION operation to cancel a previously requested want for a 28058 delegation. Note that if the server is in the process of sending the 28059 delegation (via CB_PUSH_DELEG) at the time the client sends a 28060 cancellation of the want, the delegation might still be pushed to the 28061 client. 28063 If WANT_DELEGATION fails to return a delegation, and the server 28064 returns NFS4_OK, the server MUST set the delegation type to 28065 OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in 28066 Section 18.16. Write delegations are not available for file types 28067 that are not writable. This includes file objects of types NF4BLK, 28068 NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests 28069 OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without 28070 OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the 28071 aforementioned file types, the server must set 28072 wdr_resok4.od_whynone.ond_why to WND4_WRITE_DELEG_NOT_SUPP_FTYPE. 28074 18.49.4. IMPLEMENTATION 28076 A request for a conflicting delegation is not normally intended to 28077 trigger the recall of the existing delegation. Servers may choose to 28078 treat some clients as having higher priority such that their wants 28079 will trigger recall of an existing delegation, although that is 28080 expected to be an unusual situation. 28082 Servers will generally recall delegations assigned by WANT_DELEGATION 28083 on the same basis as those assigned by OPEN. CB_RECALL will 28084 generally be done only when other clients perform operations 28085 inconsistent with the delegation. The normal response to aging of 28086 delegations is to use CB_RECALL_ANY, in order to give the client the 28087 opportunity to keep the delegations most useful from its point of 28088 view. 28090 18.50. Operation 57: DESTROY_CLIENTID - Destroy a Client ID 28092 18.50.1. ARGUMENT 28094 struct DESTROY_CLIENTID4args { 28095 clientid4 dca_clientid; 28096 }; 28098 18.50.2. RESULT 28100 struct DESTROY_CLIENTID4res { 28101 nfsstat4 dcr_status; 28102 }; 28104 18.50.3. DESCRIPTION 28106 The DESTROY_CLIENTID operation destroys the client ID. If there are 28107 sessions (both idle and non-idle), opens, locks, delegations, 28108 layouts, and/or wants (Section 18.49) associated with the unexpired 28109 lease of the client ID, the server MUST return NFS4ERR_CLIENTID_BUSY. 28110 DESTROY_CLIENTID MAY be preceded with a SEQUENCE operation as long as 28111 the client ID derived from the session ID of SEQUENCE is not the same 28112 as the client ID to be destroyed. If the client IDs are the same, 28113 then the server MUST return NFS4ERR_CLIENTID_BUSY. 28115 If DESTROY_CLIENTID is not prefixed by SEQUENCE, it MUST be the only 28116 operation in the COMPOUND request (otherwise, the server MUST return 28117 NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE 28118 preceding it, a client that retransmits the request may receive an 28119 error in response, because the original request might have been 28120 successfully executed. 28122 18.50.4. IMPLEMENTATION 28124 DESTROY_CLIENTID allows a server to immediately reclaim the resources 28125 consumed by an unused client ID, and also to forget that it ever 28126 generated the client ID. By forgetting that it ever generated the 28127 client ID, the server can safely reuse the client ID on a future 28128 EXCHANGE_ID operation. 28130 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims Finished 28132 18.51.1. ARGUMENT 28134 28136 struct RECLAIM_COMPLETE4args { 28137 /* 28138 * If rca_one_fs TRUE, 28139 * 28140 * CURRENT_FH: object in 28141 * file system reclaim is 28142 * complete for. 28143 */ 28144 bool rca_one_fs; 28145 }; 28147 28149 18.51.2. RESULTS 28151 28153 struct RECLAIM_COMPLETE4res { 28154 nfsstat4 rcr_status; 28155 }; 28157 28159 18.51.3. DESCRIPTION 28161 A RECLAIM_COMPLETE operation is used to indicate that the client has 28162 reclaimed all of the locking state that it will recover using 28163 reclaim, when it is recovering state due to either a server restart 28164 or the migration of a file system to another server. There are two 28165 types of RECLAIM_COMPLETE operations: 28167 o When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. 28168 This indicates that recovery of all locks that the client held on 28169 the previous server instance has been completed. The current 28170 filehandle need not be set in this case. 28172 o When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE 28173 is being done. This indicates that recovery of locks for a single 28174 fs (the one designated by the current filehandle) due to the 28175 migration of the file system has been completed. Presence of a 28176 current filehandle is required when rca_one_fs is set to TRUE. 28178 When the current filehandle designates a filehandle in a file 28179 system not in the process of migration, the operation returns 28180 NFS4_OK and is otherwise ignored. 28182 Once a RECLAIM_COMPLETE is done, there can be no further reclaim 28183 operations for locks whose scope is defined as having completed 28184 recovery. Once the client sends RECLAIM_COMPLETE, the server will 28185 not allow the client to do subsequent reclaims of locking state for 28186 that scope and, if these are attempted, will return NFS4ERR_NO_GRACE. 28188 Whenever a client establishes a new client ID and before it does the 28189 first non-reclaim operation that obtains a lock, it MUST send a 28190 RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no 28191 locks to reclaim. If non-reclaim locking operations are done before 28192 the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 28194 Similarly, when the client accesses a migrated file system on a new 28195 server, before it sends the first non-reclaim operation that obtains 28196 a lock on this new server, it MUST send a RECLAIM_COMPLETE with 28197 rca_one_fs set to TRUE and current filehandle within that file 28198 system, even if there are no locks to reclaim. If non-reclaim 28199 locking operations are done on that file system before the 28200 RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned. 28202 It should be noted that there are situations in which a client needs 28203 to issue both forms of RECLAIM_COMPLETE. An example is an instance 28204 of file system migration in which the file system is migrated to a 28205 server for which the client has no clientid. As a result, the client 28206 needs to obtain a clientid from the server (incurring the 28207 responsibility to do RECLAIM_COMPLETE with rca_one_fs set to FALSE) 28208 as well as RECLAIM_COMPLETE with rca_one_fs set to TRUE to complete 28209 the per-fs grace period associated with the file system migration. 28210 These two may be done in any order as long as all necessary lock 28211 reclaims have been done before issuing either of them. 28213 Any locks not reclaimed at the point at which RECLAIM_COMPLETE is 28214 done become non-reclaimable. The client MUST NOT attempt to reclaim 28215 them, either during the current server instance or in any subsequent 28216 server instance, or on another server to which responsibility for 28217 that file system is transferred. If the client were to do so, it 28218 would be violating the protocol by representing itself as owning 28219 locks that it does not own, and so has no right to reclaim. See 28220 Section 8.4.3 of [65] for a discussion of edge conditions related to 28221 lock reclaim. 28223 By sending a RECLAIM_COMPLETE, the client indicates readiness to 28224 proceed to do normal non-reclaim locking operations. The client 28225 should be aware that such operations may temporarily result in 28226 NFS4ERR_GRACE errors until the server is ready to terminate its grace 28227 period. 28229 18.51.4. IMPLEMENTATION 28231 Servers will typically use the information as to when reclaim 28232 activity is complete to reduce the length of the grace period. When 28233 the server maintains in persistent storage a list of clients that 28234 might have had locks, it is able to use the fact that all such 28235 clients have done a RECLAIM_COMPLETE to terminate the grace period 28236 and begin normal operations (i.e., grant requests for new locks) 28237 sooner than it might otherwise. 28239 Latency can be minimized by doing a RECLAIM_COMPLETE as part of the 28240 COMPOUND request in which the last lock-reclaiming operation is done. 28241 When there are no reclaims to be done, RECLAIM_COMPLETE should be 28242 done immediately in order to allow the grace period to end as soon as 28243 possible. 28245 RECLAIM_COMPLETE should only be done once for each server instance or 28246 occasion of the transition of a file system. If it is done a second 28247 time, the error NFS4ERR_COMPLETE_ALREADY will result. Note that 28248 because of the session feature's retry protection, retries of 28249 COMPOUND requests containing RECLAIM_COMPLETE operation will not 28250 result in this error. 28252 When a RECLAIM_COMPLETE is sent, the client effectively acknowledges 28253 any locks not yet reclaimed as lost. This allows the server to re- 28254 enable the client to recover locks if the occurrence of edge 28255 conditions, as described in Section 8.4.3, had caused the server to 28256 disable the client's ability to recover locks. 28258 Because previous descriptions of RECLAIM_COMPLETE were not 28259 sufficiently explicit about the circumstances in which use of 28260 RECLAIM_COMPLETE with rca_one_fs set to TRUE was appropriate, there 28261 have been cases which it has been misused by clients who have issued 28262 RECLAIM_COMPLETE with rca_one_fs set to TRUE when it should have not 28263 been. There have also been cases in which servers have, in various 28264 ways, not responded to such misuse as described above, either 28265 ignoring the rca_one_fs setting (treating the operation as a global 28266 RECLAIM_COMPLETE) or ignoring the entire operation. 28268 While clients SHOULD NOT misuse this feature and servers SHOULD 28269 respond to such misuse as described above, implementers need to be 28270 aware of the following considerations as they make necessary 28271 tradeoffs between interoperability with existing implementations and 28272 proper support for facilities to allow lock recovery in the event of 28273 file system migration. 28275 o When servers have no support for becoming the destination server 28276 of a file system subject to migration, there is no possibility of 28277 a per-fs RECLAIM_COMPLETE being done legitimately and occurrences 28278 of it SHOULD be ignored. However, the negative consequences of 28279 accepting such mistaken use are quite limited as long as the 28280 client does not issue it before all necessary reclaims are done. 28282 o When a server might become the destination for a file system being 28283 migrated, inappropriate use of per-fs RECLAIM_COMPLETE is more 28284 concerning. In the case in which the file system designated is 28285 not within a per-fs grace period, the per-fs RECLAIM_COMPLETE 28286 SHOULD be ignored, with the negative consequences of accepting it 28287 being limited, as in the case in which migration is not supported. 28288 However, if the server encounters a file system undergoing 28289 migration, the operation cannot be accepted as if it were a global 28290 RECLAIM_COMPLETE without invalidating its intended use. 28292 18.52. Operation 10044: ILLEGAL - Illegal Operation 28294 18.52.1. ARGUMENTS 28296 void; 28298 18.52.2. RESULTS 28300 struct ILLEGAL4res { 28301 nfsstat4 status; 28302 }; 28304 18.52.3. DESCRIPTION 28306 This operation is a placeholder for encoding a result to handle the 28307 case of the client sending an operation code within COMPOUND that is 28308 not supported. See the COMPOUND procedure description for more 28309 details. 28311 The status field of ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 28313 18.52.4. IMPLEMENTATION 28315 A client will probably not send an operation with code OP_ILLEGAL but 28316 if it does, the response will be ILLEGAL4res just as it would be with 28317 any other invalid operation code. Note that if the server gets an 28318 illegal operation code that is not OP_ILLEGAL, and if the server 28319 checks for legal operation codes during the XDR decode phase, then 28320 the ILLEGAL4res would not be returned. 28322 19. NFSv4.1 Callback Procedures 28324 The procedures used for callbacks are defined in the following 28325 sections. In the interest of clarity, the terms "client" and 28326 "server" refer to NFS clients and servers, despite the fact that for 28327 an individual callback RPC, the sense of these terms would be 28328 precisely the opposite. 28330 Both procedures, CB_NULL and CB_COMPOUND, MUST be implemented. 28332 19.1. Procedure 0: CB_NULL - No Operation 28334 19.1.1. ARGUMENTS 28336 void; 28338 19.1.2. RESULTS 28340 void; 28342 19.1.3. DESCRIPTION 28344 CB_NULL is the standard ONC RPC NULL procedure, with the standard 28345 void argument and void response. Even though there is no direct 28346 functionality associated with this procedure, the server will use 28347 CB_NULL to confirm the existence of a path for RPCs from the server 28348 to client. 28350 19.1.4. ERRORS 28352 None. 28354 19.2. Procedure 1: CB_COMPOUND - Compound Operations 28356 19.2.1. ARGUMENTS 28357 enum nfs_cb_opnum4 { 28358 OP_CB_GETATTR = 3, 28359 OP_CB_RECALL = 4, 28360 /* Callback operations new to NFSv4.1 */ 28361 OP_CB_LAYOUTRECALL = 5, 28362 OP_CB_NOTIFY = 6, 28363 OP_CB_PUSH_DELEG = 7, 28364 OP_CB_RECALL_ANY = 8, 28365 OP_CB_RECALLABLE_OBJ_AVAIL = 9, 28366 OP_CB_RECALL_SLOT = 10, 28367 OP_CB_SEQUENCE = 11, 28368 OP_CB_WANTS_CANCELLED = 12, 28369 OP_CB_NOTIFY_LOCK = 13, 28370 OP_CB_NOTIFY_DEVICEID = 14, 28372 OP_CB_ILLEGAL = 10044 28373 }; 28375 union nfs_cb_argop4 switch (unsigned argop) { 28376 case OP_CB_GETATTR: 28377 CB_GETATTR4args opcbgetattr; 28378 case OP_CB_RECALL: 28379 CB_RECALL4args opcbrecall; 28380 case OP_CB_LAYOUTRECALL: 28381 CB_LAYOUTRECALL4args opcblayoutrecall; 28382 case OP_CB_NOTIFY: 28383 CB_NOTIFY4args opcbnotify; 28384 case OP_CB_PUSH_DELEG: 28385 CB_PUSH_DELEG4args opcbpush_deleg; 28386 case OP_CB_RECALL_ANY: 28387 CB_RECALL_ANY4args opcbrecall_any; 28388 case OP_CB_RECALLABLE_OBJ_AVAIL: 28389 CB_RECALLABLE_OBJ_AVAIL4args opcbrecallable_obj_avail; 28390 case OP_CB_RECALL_SLOT: 28391 CB_RECALL_SLOT4args opcbrecall_slot; 28392 case OP_CB_SEQUENCE: 28393 CB_SEQUENCE4args opcbsequence; 28394 case OP_CB_WANTS_CANCELLED: 28395 CB_WANTS_CANCELLED4args opcbwants_cancelled; 28396 case OP_CB_NOTIFY_LOCK: 28397 CB_NOTIFY_LOCK4args opcbnotify_lock; 28398 case OP_CB_NOTIFY_DEVICEID: 28399 CB_NOTIFY_DEVICEID4args opcbnotify_deviceid; 28400 case OP_CB_ILLEGAL: void; 28401 }; 28402 struct CB_COMPOUND4args { 28403 utf8str_cs tag; 28404 uint32_t minorversion; 28405 uint32_t callback_ident; 28406 nfs_cb_argop4 argarray<>; 28407 }; 28409 19.2.2. RESULTS 28410 union nfs_cb_resop4 switch (unsigned resop) { 28411 case OP_CB_GETATTR: CB_GETATTR4res opcbgetattr; 28412 case OP_CB_RECALL: CB_RECALL4res opcbrecall; 28414 /* new NFSv4.1 operations */ 28415 case OP_CB_LAYOUTRECALL: 28416 CB_LAYOUTRECALL4res 28417 opcblayoutrecall; 28419 case OP_CB_NOTIFY: CB_NOTIFY4res opcbnotify; 28421 case OP_CB_PUSH_DELEG: CB_PUSH_DELEG4res 28422 opcbpush_deleg; 28424 case OP_CB_RECALL_ANY: CB_RECALL_ANY4res 28425 opcbrecall_any; 28427 case OP_CB_RECALLABLE_OBJ_AVAIL: 28428 CB_RECALLABLE_OBJ_AVAIL4res 28429 opcbrecallable_obj_avail; 28431 case OP_CB_RECALL_SLOT: 28432 CB_RECALL_SLOT4res 28433 opcbrecall_slot; 28435 case OP_CB_SEQUENCE: CB_SEQUENCE4res opcbsequence; 28437 case OP_CB_WANTS_CANCELLED: 28438 CB_WANTS_CANCELLED4res 28439 opcbwants_cancelled; 28441 case OP_CB_NOTIFY_LOCK: 28442 CB_NOTIFY_LOCK4res 28443 opcbnotify_lock; 28445 case OP_CB_NOTIFY_DEVICEID: 28446 CB_NOTIFY_DEVICEID4res 28447 opcbnotify_deviceid; 28449 /* Not new operation */ 28450 case OP_CB_ILLEGAL: CB_ILLEGAL4res opcbillegal; 28451 }; 28453 struct CB_COMPOUND4res { 28454 nfsstat4 status; 28455 utf8str_cs tag; 28456 nfs_cb_resop4 resarray<>; 28457 }; 28459 19.2.3. DESCRIPTION 28461 The CB_COMPOUND procedure is used to combine one or more of the 28462 callback procedures into a single RPC request. The main callback RPC 28463 program has two main procedures: CB_NULL and CB_COMPOUND. All other 28464 operations use the CB_COMPOUND procedure as a wrapper. 28466 During the processing of the CB_COMPOUND procedure, the client may 28467 find that it does not have the available resources to execute any or 28468 all of the operations within the CB_COMPOUND sequence. Refer to 28469 Section 2.10.6.4 for details. 28471 The minorversion field of the arguments MUST be the same as the 28472 minorversion of the COMPOUND procedure used to create the client ID 28473 and session. For NFSv4.1, minorversion MUST be set to 1. 28475 Contained within the CB_COMPOUND results is a "status" field. This 28476 status MUST be equal to the status of the last operation that was 28477 executed within the CB_COMPOUND procedure. Therefore, if an 28478 operation incurred an error, then the "status" value will be the same 28479 error value as is being returned for the operation that failed. 28481 The "tag" field is handled the same way as that of the COMPOUND 28482 procedure (see Section 16.2.3). 28484 Illegal operation codes are handled in the same way as they are 28485 handled for the COMPOUND procedure. 28487 19.2.4. IMPLEMENTATION 28489 The CB_COMPOUND procedure is used to combine individual operations 28490 into a single RPC request. The client interprets each of the 28491 operations in turn. If an operation is executed by the client and 28492 the status of that operation is NFS4_OK, then the next operation in 28493 the CB_COMPOUND procedure is executed. The client continues this 28494 process until there are no more operations to be executed or one of 28495 the operations has a status value other than NFS4_OK. 28497 19.2.5. ERRORS 28499 CB_COMPOUND will of course return every error that each operation on 28500 the backchannel can return (see Table 7). However, if CB_COMPOUND 28501 returns zero operations, obviously the error returned by COMPOUND has 28502 nothing to do with an error returned by an operation. The list of 28503 errors CB_COMPOUND will return if it processes zero operations 28504 includes: 28506 CB_COMPOUND error returns 28508 +------------------------------+------------------------------------+ 28509 | Error | Notes | 28510 +------------------------------+------------------------------------+ 28511 | NFS4ERR_BADCHAR | The tag argument has a character | 28512 | | the replier does not support. | 28513 | NFS4ERR_BADXDR | | 28514 | NFS4ERR_DELAY | | 28515 | NFS4ERR_INVAL | The tag argument is not in UTF-8 | 28516 | | encoding. | 28517 | NFS4ERR_MINOR_VERS_MISMATCH | | 28518 | NFS4ERR_SERVERFAULT | | 28519 | NFS4ERR_TOO_MANY_OPS | | 28520 | NFS4ERR_REP_TOO_BIG | | 28521 | NFS4ERR_REP_TOO_BIG_TO_CACHE | | 28522 | NFS4ERR_REQ_TOO_BIG | | 28523 +------------------------------+------------------------------------+ 28525 Table 15 28527 20. NFSv4.1 Callback Operations 28529 20.1. Operation 3: CB_GETATTR - Get Attributes 28531 20.1.1. ARGUMENT 28533 struct CB_GETATTR4args { 28534 nfs_fh4 fh; 28535 bitmap4 attr_request; 28536 }; 28538 20.1.2. RESULT 28540 struct CB_GETATTR4resok { 28541 fattr4 obj_attributes; 28542 }; 28544 union CB_GETATTR4res switch (nfsstat4 status) { 28545 case NFS4_OK: 28546 CB_GETATTR4resok resok4; 28547 default: 28548 void; 28549 }; 28551 20.1.3. DESCRIPTION 28553 The CB_GETATTR operation is used by the server to obtain the current 28554 modified state of a file that has been OPEN_DELEGATE_WRITE delegated. 28555 The size and change attributes are the only ones guaranteed to be 28556 serviced by the client. See Section 10.4.3 for a full description of 28557 how the client and server are to interact with the use of CB_GETATTR. 28559 If the filehandle specified is not one for which the client holds an 28560 OPEN_DELEGATE_WRITE delegation, an NFS4ERR_BADHANDLE error is 28561 returned. 28563 20.1.4. IMPLEMENTATION 28565 The client returns attrmask bits and the associated attribute values 28566 only for the change attribute, and attributes that it may change 28567 (time_modify, and size). 28569 20.2. Operation 4: CB_RECALL - Recall a Delegation 28571 20.2.1. ARGUMENT 28573 struct CB_RECALL4args { 28574 stateid4 stateid; 28575 bool truncate; 28576 nfs_fh4 fh; 28577 }; 28579 20.2.2. RESULT 28581 struct CB_RECALL4res { 28582 nfsstat4 status; 28583 }; 28585 20.2.3. DESCRIPTION 28587 The CB_RECALL operation is used to begin the process of recalling a 28588 delegation and returning it to the server. 28590 The truncate flag is used to optimize recall for a file object that 28591 is a regular file and is about to be truncated to zero. When it is 28592 TRUE, the client is freed of the obligation to propagate modified 28593 data for the file to the server, since this data is irrelevant. 28595 If the handle specified is not one for which the client holds a 28596 delegation, an NFS4ERR_BADHANDLE error is returned. 28598 If the stateid specified is not one corresponding to an OPEN 28599 delegation for the file specified by the filehandle, an 28600 NFS4ERR_BAD_STATEID is returned. 28602 20.2.4. IMPLEMENTATION 28604 The client SHOULD reply to the callback immediately. Replying does 28605 not complete the recall except when the value of the reply's status 28606 field is neither NFS4ERR_DELAY nor NFS4_OK. The recall is not 28607 complete until the delegation is returned using a DELEGRETURN 28608 operation. 28610 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from Client 28612 20.3.1. ARGUMENT 28614 /* 28615 * NFSv4.1 callback arguments and results 28616 */ 28618 enum layoutrecall_type4 { 28619 LAYOUTRECALL4_FILE = LAYOUT4_RET_REC_FILE, 28620 LAYOUTRECALL4_FSID = LAYOUT4_RET_REC_FSID, 28621 LAYOUTRECALL4_ALL = LAYOUT4_RET_REC_ALL 28622 }; 28624 struct layoutrecall_file4 { 28625 nfs_fh4 lor_fh; 28626 offset4 lor_offset; 28627 length4 lor_length; 28628 stateid4 lor_stateid; 28629 }; 28631 union layoutrecall4 switch(layoutrecall_type4 lor_recalltype) { 28632 case LAYOUTRECALL4_FILE: 28633 layoutrecall_file4 lor_layout; 28634 case LAYOUTRECALL4_FSID: 28635 fsid4 lor_fsid; 28636 case LAYOUTRECALL4_ALL: 28637 void; 28638 }; 28640 struct CB_LAYOUTRECALL4args { 28641 layouttype4 clora_type; 28642 layoutiomode4 clora_iomode; 28643 bool clora_changed; 28644 layoutrecall4 clora_recall; 28645 }; 28647 20.3.2. RESULT 28649 struct CB_LAYOUTRECALL4res { 28650 nfsstat4 clorr_status; 28651 }; 28653 20.3.3. DESCRIPTION 28655 The CB_LAYOUTRECALL operation is used by the server to recall layouts 28656 from the client; as a result, the client will begin the process of 28657 returning layouts via LAYOUTRETURN. The CB_LAYOUTRECALL operation 28658 specifies one of three forms of recall processing with the value of 28659 layoutrecall_type4. The recall is for one of the following: a 28660 specific layout of a specific file (LAYOUTRECALL4_FILE), an entire 28661 file system ID (LAYOUTRECALL4_FSID), or all file systems 28662 (LAYOUTRECALL4_ALL). 28664 The behavior of the operation varies based on the value of the 28665 layoutrecall_type4. The value and behaviors are: 28667 LAYOUTRECALL4_FILE 28669 For a layout to match the recall request, the values of the 28670 following fields must match those of the layout: clora_type, 28671 clora_iomode, lor_fh, and the byte-range specified by lor_offset 28672 and lor_length. The clora_iomode field may have a special value 28673 of LAYOUTIOMODE4_ANY. The special value LAYOUTIOMODE4_ANY will 28674 match any iomode originally returned in a layout; therefore, it 28675 acts as a wild card. The other special value used is for 28676 lor_length. If lor_length has a value of NFS4_UINT64_MAX, the 28677 lor_length field means the maximum possible file size. If a 28678 matching layout is found, it MUST be returned using the 28679 LAYOUTRETURN operation (see Section 18.44). An example of the 28680 field's special value use is if clora_iomode is LAYOUTIOMODE4_ANY, 28681 lor_offset is zero, and lor_length is NFS4_UINT64_MAX, then the 28682 entire layout is to be returned. 28684 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 28685 client does not hold layouts for the file or if the client does 28686 not have any overlapping layouts for the specification in the 28687 layout recall. 28689 LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 28691 If LAYOUTRECALL4_FSID is specified, the fsid specifies the file 28692 system for which any outstanding layouts MUST be returned. If 28693 LAYOUTRECALL4_ALL is specified, all outstanding layouts MUST be 28694 returned. In addition, LAYOUTRECALL4_FSID and LAYOUTRECALL4_ALL 28695 specify that all the storage device ID to storage device address 28696 mappings in the affected file system(s) are also recalled. The 28697 respective LAYOUTRETURN with either LAYOUTRETURN4_FSID or 28698 LAYOUTRETURN4_ALL acknowledges to the server that the client 28699 invalidated the said device mappings. See Section 12.5.5.2.1.5 28700 for considerations with "bulk" recall of layouts. 28702 The NFS4ERR_NOMATCHING_LAYOUT error is only returned when the 28703 client does not hold layouts and does not have valid deviceid 28704 mappings. 28706 In processing the layout recall request, the client also varies its 28707 behavior based on the value of the clora_changed field. This field 28708 is used by the server to provide additional context for the reason 28709 why the layout is being recalled. A FALSE value for clora_changed 28710 indicates that no change in the layout is expected and the client may 28711 write modified data to the storage devices involved; this must be 28712 done prior to returning the layout via LAYOUTRETURN. A TRUE value 28713 for clora_changed indicates that the server is changing the layout. 28714 Examples of layout changes and reasons for a TRUE indication are the 28715 following: the metadata server is restriping the file or a permanent 28716 error has occurred on a storage device and the metadata server would 28717 like to provide a new layout for the file. Therefore, a 28718 clora_changed value of TRUE indicates some level of change for the 28719 layout and the client SHOULD NOT write and commit modified data to 28720 the storage devices. In this case, the client writes and commits 28721 data through the metadata server. 28723 See Section 12.5.3 for a description of how the lor_stateid field in 28724 the arguments is to be constructed. Note that the "seqid" field of 28725 lor_stateid MUST NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 28726 for a further discussion and requirements. 28728 20.3.4. IMPLEMENTATION 28730 The client's processing for CB_LAYOUTRECALL is similar to CB_RECALL 28731 (recall of file delegations) in that the client responds to the 28732 request before actually returning layouts via the LAYOUTRETURN 28733 operation. While the client responds to the CB_LAYOUTRECALL 28734 immediately, the operation is not considered complete (i.e., 28735 considered pending) until all affected layouts are returned to the 28736 server via the LAYOUTRETURN operation. 28738 Before returning the layout to the server via LAYOUTRETURN, the 28739 client should wait for the response from in-process or in-flight 28740 READ, WRITE, or COMMIT operations that use the recalled layout. 28742 If the client is holding modified data that is affected by a recalled 28743 layout, the client has various options for writing the data to the 28744 server. As always, the client may write the data through the 28745 metadata server. In fact, the client may not have a choice other 28746 than writing to the metadata server when the clora_changed argument 28747 is TRUE and a new layout is unavailable from the server. However, 28748 the client may be able to write the modified data to the storage 28749 device if the clora_changed argument is FALSE; this needs to be done 28750 before returning the layout via LAYOUTRETURN. If the client were to 28751 obtain a new layout covering the modified data's byte-range, then 28752 writing to the storage devices is an available alternative. Note 28753 that before obtaining a new layout, the client must first return the 28754 original layout. 28756 In the case of modified data being written while the layout is held, 28757 the client must use LAYOUTCOMMIT operations at the appropriate time; 28758 as required LAYOUTCOMMIT must be done before the LAYOUTRETURN. If a 28759 large amount of modified data is outstanding, the client may send 28760 LAYOUTRETURNs for portions of the recalled layout; this allows the 28761 server to monitor the client's progress and adherence to the original 28762 recall request. However, the last LAYOUTRETURN in a sequence of 28763 returns MUST specify the full range being recalled (see 28764 Section 12.5.5.1 for details). 28766 If a server needs to delete a device ID and there are layouts 28767 referring to the device ID, CB_LAYOUTRECALL MUST be invoked to cause 28768 the client to return all layouts referring to the device ID before 28769 the server can delete the device ID. If the client does not return 28770 the affected layouts, the server MAY revoke the layouts. 28772 20.4. Operation 6: CB_NOTIFY - Notify Client of Directory Changes 28774 20.4.1. ARGUMENT 28776 /* 28777 * Directory notification types. 28778 */ 28779 enum notify_type4 { 28780 NOTIFY4_CHANGE_CHILD_ATTRS = 0, 28781 NOTIFY4_CHANGE_DIR_ATTRS = 1, 28782 NOTIFY4_REMOVE_ENTRY = 2, 28783 NOTIFY4_ADD_ENTRY = 3, 28784 NOTIFY4_RENAME_ENTRY = 4, 28785 NOTIFY4_CHANGE_COOKIE_VERIFIER = 5 28786 }; 28788 /* Changed entry information. */ 28789 struct notify_entry4 { 28790 component4 ne_file; 28791 fattr4 ne_attrs; 28792 }; 28794 /* Previous entry information */ 28795 struct prev_entry4 { 28796 notify_entry4 pe_prev_entry; 28797 /* what READDIR returned for this entry */ 28798 nfs_cookie4 pe_prev_entry_cookie; 28799 }; 28801 struct notify_remove4 { 28802 notify_entry4 nrm_old_entry; 28803 nfs_cookie4 nrm_old_entry_cookie; 28804 }; 28806 struct notify_add4 { 28807 /* 28808 * Information on object 28809 * possibly renamed over. 28810 */ 28811 notify_remove4 nad_old_entry<1>; 28812 notify_entry4 nad_new_entry; 28813 /* what READDIR would have returned for this entry */ 28814 nfs_cookie4 nad_new_entry_cookie<1>; 28815 prev_entry4 nad_prev_entry<1>; 28816 bool nad_last_entry; 28817 }; 28819 struct notify_attr4 { 28820 notify_entry4 na_changed_entry; 28821 }; 28823 struct notify_rename4 { 28824 notify_remove4 nrn_old_entry; 28825 notify_add4 nrn_new_entry; 28826 }; 28828 struct notify_verifier4 { 28829 verifier4 nv_old_cookieverf; 28830 verifier4 nv_new_cookieverf; 28831 }; 28833 /* 28834 * Objects of type notify_<>4 and 28835 * notify_device_<>4 are encoded in this. 28836 */ 28837 typedef opaque notifylist4<>; 28838 struct notify4 { 28839 /* composed from notify_type4 or notify_deviceid_type4 */ 28840 bitmap4 notify_mask; 28841 notifylist4 notify_vals; 28842 }; 28844 struct CB_NOTIFY4args { 28845 stateid4 cna_stateid; 28846 nfs_fh4 cna_fh; 28847 notify4 cna_changes<>; 28848 }; 28850 20.4.2. RESULT 28852 struct CB_NOTIFY4res { 28853 nfsstat4 cnr_status; 28854 }; 28856 20.4.3. DESCRIPTION 28858 The CB_NOTIFY operation is used by the server to send notifications 28859 to clients about changes to delegated directories. The registration 28860 of notifications for the directories occurs when the delegation is 28861 established using GET_DIR_DELEGATION. These notifications are sent 28862 over the backchannel. The notification is sent once the original 28863 request has been processed on the server. The server will send an 28864 array of notifications for changes that might have occurred in the 28865 directory. The notifications are sent as list of pairs of bitmaps 28866 and values. See Section 3.3.7 for a description of how NFSv4.1 28867 bitmaps work. 28869 If the server has more notifications than can fit in the CB_COMPOUND 28870 request, it SHOULD send a sequence of serial CB_COMPOUND requests so 28871 that the client's view of the directory does not become confused. 28872 For example, if the server indicates that a file named "foo" is added 28873 and that the file "foo" is removed, the order in which the client 28874 receives these notifications needs to be the same as the order in 28875 which the corresponding operations occurred on the server. 28877 If the client holding the delegation makes any changes in the 28878 directory that cause files or sub-directories to be added or removed, 28879 the server will notify that client of the resulting change(s). If 28880 the client holding the delegation is making attribute or cookie 28881 verifier changes only, the server does not need to send notifications 28882 to that client. The server will send the following information for 28883 each operation: 28885 NOTIFY4_ADD_ENTRY 28886 The server will send information about the new directory entry 28887 being created along with the cookie for that entry. The entry 28888 information (data type notify_add4) includes the component name of 28889 the entry and attributes. The server will send this type of entry 28890 when a file is actually being created, when an entry is being 28891 added to a directory as a result of a rename across directories 28892 (see below), and when a hard link is being created to an existing 28893 file. If this entry is added to the end of the directory, the 28894 server will set the nad_last_entry flag to TRUE. If the file is 28895 added such that there is at least one entry before it, the server 28896 will also return the previous entry information (nad_prev_entry, a 28897 variable-length array of up to one element. If the array is of 28898 zero length, there is no previous entry), along with its cookie. 28899 This is to help clients find the right location in their file name 28900 caches and directory caches where this entry should be cached. If 28901 the new entry's cookie is available, it will be in the 28902 nad_new_entry_cookie (another variable-length array of up to one 28903 element) field. If the addition of the entry causes another entry 28904 to be deleted (which can only happen in the rename case) 28905 atomically with the addition, then information on this entry is 28906 reported in nad_old_entry. 28908 NOTIFY4_REMOVE_ENTRY 28909 The server will send information about the directory entry being 28910 deleted. The server will also send the cookie value for the 28911 deleted entry so that clients can get to the cached information 28912 for this entry. 28914 NOTIFY4_RENAME_ENTRY 28915 The server will send information about both the old entry and the 28916 new entry. This includes the name and attributes for each entry. 28917 In addition, if the rename causes the deletion of an entry (i.e., 28918 the case of a file renamed over), then this is reported in 28919 nrn_new_new_entry.nad_old_entry. This notification is only sent 28920 if both entries are in the same directory. If the rename is 28921 across directories, the server will send a remove notification to 28922 one directory and an add notification to the other directory, 28923 assuming both have a directory delegation. 28925 NOTIFY4_CHANGE_CHILD_ATTRS/NOTIFY4_CHANGE_DIR_ATTRS 28926 The client will use the attribute mask to inform the server of 28927 attributes for which it wants to receive notifications. This 28928 change notification can be requested for changes to the attributes 28929 of the directory as well as changes to any file's attributes in 28930 the directory by using two separate attribute masks. The client 28931 cannot ask for change attribute notification for a specific file. 28932 One attribute mask covers all the files in the directory. Upon 28933 any attribute change, the server will send back the values of 28934 changed attributes. Notifications might not make sense for some 28935 file system-wide attributes, and it is up to the server to decide 28936 which subset it wants to support. The client can negotiate the 28937 frequency of attribute notifications by letting the server know 28938 how often it wants to be notified of an attribute change. The 28939 server will return supported notification frequencies or an 28940 indication that no notification is permitted for directory or 28941 child attributes by setting the dir_notif_delay and 28942 dir_entry_notif_delay attributes, respectively. 28944 NOTIFY4_CHANGE_COOKIE_VERIFIER 28945 If the cookie verifier changes while a client is holding a 28946 delegation, the server will notify the client so that it can 28947 invalidate its cookies and re-send a READDIR to get the new set of 28948 cookies. 28950 20.5. Operation 7: CB_PUSH_DELEG - Offer Previously Requested 28951 Delegation to Client 28953 20.5.1. ARGUMENT 28955 struct CB_PUSH_DELEG4args { 28956 nfs_fh4 cpda_fh; 28957 open_delegation4 cpda_delegation; 28959 }; 28961 20.5.2. RESULT 28963 struct CB_PUSH_DELEG4res { 28964 nfsstat4 cpdr_status; 28965 }; 28967 20.5.3. DESCRIPTION 28969 CB_PUSH_DELEG is used by the server both to signal to the client that 28970 the delegation it wants (previously indicated via a want established 28971 from an OPEN or WANT_DELEGATION operation) is available and to 28972 simultaneously offer the delegation to the client. The client has 28973 the choice of accepting the delegation by returning NFS4_OK to the 28974 server, delaying the decision to accept the offered delegation by 28975 returning NFS4ERR_DELAY, or permanently rejecting the offer of the 28976 delegation by returning NFS4ERR_REJECT_DELEG. When a delegation is 28977 rejected in this fashion, the want previously established is 28978 permanently deleted and the delegation is subject to acquisition by 28979 another client. 28981 20.5.4. IMPLEMENTATION 28983 If the client does return NFS4ERR_DELAY and there is a conflicting 28984 delegation request, the server MAY process it at the expense of the 28985 client that returned NFS4ERR_DELAY. The client's want will not be 28986 cancelled, but MAY be processed behind other delegation requests or 28987 registered wants. 28989 When a client returns a status other than NFS4_OK, NFS4ERR_DELAY, or 28990 NFS4ERR_REJECT_DELAY, the want remains pending, although servers may 28991 decide to cancel the want by sending a CB_WANTS_CANCELLED. 28993 20.6. Operation 8: CB_RECALL_ANY - Keep Any N Recallable Objects 28995 20.6.1. ARGUMENT 28997 const RCA4_TYPE_MASK_RDATA_DLG = 0; 28998 const RCA4_TYPE_MASK_WDATA_DLG = 1; 28999 const RCA4_TYPE_MASK_DIR_DLG = 2; 29000 const RCA4_TYPE_MASK_FILE_LAYOUT = 3; 29001 const RCA4_TYPE_MASK_BLK_LAYOUT = 4; 29002 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 29003 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 29004 const RCA4_TYPE_MASK_OTHER_LAYOUT_MIN = 12; 29005 const RCA4_TYPE_MASK_OTHER_LAYOUT_MAX = 15; 29007 struct CB_RECALL_ANY4args { 29008 uint32_t craa_objects_to_keep; 29009 bitmap4 craa_type_mask; 29010 }; 29012 20.6.2. RESULT 29014 struct CB_RECALL_ANY4res { 29015 nfsstat4 crar_status; 29016 }; 29018 20.6.3. DESCRIPTION 29020 The server may decide that it cannot hold all of the state for 29021 recallable objects, such as delegations and layouts, without running 29022 out of resources. In such a case, while not optimal, the server is 29023 free to recall individual objects to reduce the load. 29025 Because the general purpose of such recallable objects as delegations 29026 is to eliminate client interaction with the server, the server cannot 29027 interpret lack of recent use as indicating that the object is no 29028 longer useful. The absence of visible use is consistent with a 29029 delegation keeping potential operations from being sent to the 29030 server. In the case of layouts, while it is true that the usefulness 29031 of a layout is indicated by the use of the layout when storage 29032 devices receive I/O requests, because there is no mandate that a 29033 storage device indicate to the metadata server any past or present 29034 use of a layout, the metadata server is not likely to know which 29035 layouts are good candidates to recall in response to low resources. 29037 In order to implement an effective reclaim scheme for such objects, 29038 the server's knowledge of available resources must be used to 29039 determine when objects must be recalled with the clients selecting 29040 the actual objects to be returned. 29042 Server implementations may differ in their resource allocation 29043 requirements. For example, one server may share resources among all 29044 classes of recallable objects, whereas another may use separate 29045 resource pools for layouts and for delegations, or further separate 29046 resources by types of delegations. 29048 When a given resource pool is over-utilized, the server can send a 29049 CB_RECALL_ANY to clients holding recallable objects of the types 29050 involved, allowing it to keep a certain number of such objects and 29051 return any excess. A mask specifies which types of objects are to be 29052 limited. The client chooses, based on its own knowledge of current 29053 usefulness, which of the objects in that class should be returned. 29055 A number of bits are defined. For some of these, ranges are defined 29056 and it is up to the definition of the storage protocol to specify how 29057 these are to be used. There are ranges reserved for object-based 29058 storage protocols and for other experimental storage protocols. An 29059 RFC defining such a storage protocol needs to specify how particular 29060 bits within its range are to be used. For example, it may specify a 29061 mapping between attributes of the layout (read vs. write, size of 29062 area) and the bit to be used, or it may define a field in the layout 29063 where the associated bit position is made available by the server to 29064 the client. 29066 RCA4_TYPE_MASK_RDATA_DLG 29068 The client is to return OPEN_DELEGATE_READ delegations on non- 29069 directory file objects. 29071 RCA4_TYPE_MASK_WDATA_DLG 29072 The client is to return OPEN_DELEGATE_WRITE delegations on regular 29073 file objects. 29075 RCA4_TYPE_MASK_DIR_DLG 29077 The client is to return directory delegations. 29079 RCA4_TYPE_MASK_FILE_LAYOUT 29081 The client is to return layouts of type LAYOUT4_NFSV4_1_FILES. 29083 RCA4_TYPE_MASK_BLK_LAYOUT 29085 See [47] for a description. 29087 RCA4_TYPE_MASK_OBJ_LAYOUT_MIN to RCA4_TYPE_MASK_OBJ_LAYOUT_MAX 29089 See [46] for a description. 29091 RCA4_TYPE_MASK_OTHER_LAYOUT_MIN to RCA4_TYPE_MASK_OTHER_LAYOUT_MAX 29093 This range is reserved for telling the client to recall layouts of 29094 experimental or site-specific layout types (see Section 3.3.13). 29096 When a bit is set in the type mask that corresponds to an undefined 29097 type of recallable object, NFS4ERR_INVAL MUST be returned. When a 29098 bit is set that corresponds to a defined type of object but the 29099 client does not support an object of the type, NFS4ERR_INVAL MUST NOT 29100 be returned. Future minor versions of NFSv4 may expand the set of 29101 valid type mask bits. 29103 CB_RECALL_ANY specifies a count of objects that the client may keep 29104 as opposed to a count that the client must return. This is to avoid 29105 a potential race between a CB_RECALL_ANY that had a count of objects 29106 to free with a set of client-originated operations to return layouts 29107 or delegations. As a result of the race, the client and server would 29108 have differing ideas as to how many objects to return. Hence, the 29109 client could mistakenly free too many. 29111 If resource demands prompt it, the server may send another 29112 CB_RECALL_ANY with a lower count, even if it has not yet received an 29113 acknowledgment from the client for a previous CB_RECALL_ANY with the 29114 same type mask. Although the possibility exists that these will be 29115 received by the client in an order different from the order in which 29116 they were sent, any such permutation of the callback stream is 29117 harmless. It is the job of the client to bring down the size of the 29118 recallable object set in line with each CB_RECALL_ANY received, and 29119 until that obligation is met, it cannot be cancelled or modified by 29120 any subsequent CB_RECALL_ANY for the same type mask. Thus, if the 29121 server sends two CB_RECALL_ANYs, the effect will be the same as if 29122 the lower count was sent, whatever the order of recall receipt. Note 29123 that this means that a server may not cancel the effect of a 29124 CB_RECALL_ANY by sending another recall with a higher count. When a 29125 CB_RECALL_ANY is received and the count is already within the limit 29126 set or is above a limit that the client is working to get down to, 29127 that callback has no effect. 29129 Servers are generally free to deny recallable objects when 29130 insufficient resources are available. Note that the effect of such a 29131 policy is implicitly to give precedence to existing objects relative 29132 to requested ones, with the result that resources might not be 29133 optimally used. To prevent this, servers are well advised to make 29134 the point at which they start sending CB_RECALL_ANY callbacks 29135 somewhat below that at which they cease to give out new delegations 29136 and layouts. This allows the client to purge its less-used objects 29137 whenever appropriate and so continue to have its subsequent requests 29138 given new resources freed up by object returns. 29140 20.6.4. IMPLEMENTATION 29142 The client can choose to return any type of object specified by the 29143 mask. If a server wishes to limit the use of objects of a specific 29144 type, it should only specify that type in the mask it sends. Should 29145 the client fail to return requested objects, it is up to the server 29146 to handle this situation, typically by sending specific recalls 29147 (i.e., sending CB_RECALL operations) to properly limit resource 29148 usage. The server should give the client enough time to return 29149 objects before proceeding to specific recalls. This time should not 29150 be less than the lease period. 29152 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal Resources for 29153 Recallable Objects 29155 20.7.1. ARGUMENT 29157 typedef CB_RECALL_ANY4args CB_RECALLABLE_OBJ_AVAIL4args; 29159 20.7.2. RESULT 29161 struct CB_RECALLABLE_OBJ_AVAIL4res { 29162 nfsstat4 croa_status; 29163 }; 29165 20.7.3. DESCRIPTION 29167 CB_RECALLABLE_OBJ_AVAIL is used by the server to signal the client 29168 that the server has resources to grant recallable objects that might 29169 previously have been denied by OPEN, WANT_DELEGATION, GET_DIR_DELEG, 29170 or LAYOUTGET. 29172 The argument craa_objects_to_keep means the total number of 29173 recallable objects of the types indicated in the argument type_mask 29174 that the server believes it can allow the client to have, including 29175 the number of such objects the client already has. A client that 29176 tries to acquire more recallable objects than the server informs it 29177 can have runs the risk of having objects recalled. 29179 The server is not obligated to reserve the difference between the 29180 number of the objects the client currently has and the value of 29181 craa_objects_to_keep, nor does delaying the reply to 29182 CB_RECALLABLE_OBJ_AVAIL prevent the server from using the resources 29183 of the recallable objects for another purpose. Indeed, if a client 29184 responds slowly to CB_RECALLABLE_OBJ_AVAIL, the server might 29185 interpret the client as having reduced capability to manage 29186 recallable objects, and so cancel or reduce any reservation it is 29187 maintaining on behalf of the client. Thus, if the client desires to 29188 acquire more recallable objects, it needs to reply quickly to 29189 CB_RECALLABLE_OBJ_AVAIL, and then send the appropriate operations to 29190 acquire recallable objects. 29192 20.8. Operation 10: CB_RECALL_SLOT - Change Flow Control Limits 29194 20.8.1. ARGUMENT 29196 struct CB_RECALL_SLOT4args { 29197 slotid4 rsa_target_highest_slotid; 29198 }; 29200 20.8.2. RESULT 29202 struct CB_RECALL_SLOT4res { 29203 nfsstat4 rsr_status; 29204 }; 29206 20.8.3. DESCRIPTION 29208 The CB_RECALL_SLOT operation requests the client to return session 29209 slots, and if applicable, transport credits (e.g., RDMA credits for 29210 connections associated with the operations channel) of the session's 29211 fore channel. CB_RECALL_SLOT specifies rsa_target_highest_slotid, 29212 the value of the target highest slot ID the server wants for the 29213 session. The client MUST then progress toward reducing the session's 29214 highest slot ID to the target value. 29216 If the session has only non-RDMA connections associated with its 29217 operations channel, then the client need only wait for all 29218 outstanding requests with a slot ID > rsa_target_highest_slotid to 29219 complete, then send a single COMPOUND consisting of a single SEQUENCE 29220 operation, with the sa_highestslot field set to 29221 rsa_target_highest_slotid. If there are RDMA-based connections 29222 associated with operation channel, then the client needs to also send 29223 enough zero-length "RDMA Send" messages to take the total RDMA credit 29224 count to rsa_target_highest_slotid + 1 or below. 29226 20.8.4. IMPLEMENTATION 29228 If the client fails to reduce highest slot it has on the fore channel 29229 to what the server requests, the server can force the issue by 29230 asserting flow control on the receive side of all connections bound 29231 to the fore channel, and then finish servicing all outstanding 29232 requests that are in slots greater than rsa_target_highest_slotid. 29233 Once that is done, the server can then open the flow control, and any 29234 time the client sends a new request on a slot greater than 29235 rsa_target_highest_slotid, the server can return NFS4ERR_BADSLOT. 29237 20.9. Operation 11: CB_SEQUENCE - Supply Backchannel Sequencing and 29238 Control 29240 20.9.1. ARGUMENT 29241 struct referring_call4 { 29242 sequenceid4 rc_sequenceid; 29243 slotid4 rc_slotid; 29244 }; 29246 struct referring_call_list4 { 29247 sessionid4 rcl_sessionid; 29248 referring_call4 rcl_referring_calls<>; 29249 }; 29251 struct CB_SEQUENCE4args { 29252 sessionid4 csa_sessionid; 29253 sequenceid4 csa_sequenceid; 29254 slotid4 csa_slotid; 29255 slotid4 csa_highest_slotid; 29256 bool csa_cachethis; 29257 referring_call_list4 csa_referring_call_lists<>; 29258 }; 29260 20.9.2. RESULT 29262 struct CB_SEQUENCE4resok { 29263 sessionid4 csr_sessionid; 29264 sequenceid4 csr_sequenceid; 29265 slotid4 csr_slotid; 29266 slotid4 csr_highest_slotid; 29267 slotid4 csr_target_highest_slotid; 29268 }; 29270 union CB_SEQUENCE4res switch (nfsstat4 csr_status) { 29271 case NFS4_OK: 29272 CB_SEQUENCE4resok csr_resok4; 29273 default: 29274 void; 29275 }; 29277 20.9.3. DESCRIPTION 29279 The CB_SEQUENCE operation is used to manage operational accounting 29280 for the backchannel of the session on which a request is sent. The 29281 contents include the session ID to which this request belongs, the 29282 slot ID and sequence ID used by the server to implement session 29283 request control and exactly once semantics, and exchanged slot ID 29284 maxima that are used to adjust the size of the reply cache. In each 29285 CB_COMPOUND request, CB_SEQUENCE MUST appear once and MUST be the 29286 first operation. The error NFS4ERR_SEQUENCE_POS MUST be returned 29287 when CB_SEQUENCE is found in any position in a CB_COMPOUND beyond the 29288 first. If any other operation is in the first position of 29289 CB_COMPOUND, NFS4ERR_OP_NOT_IN_SESSION MUST be returned. 29291 See Section 18.46.3 for a description of how slots are processed. 29293 If csa_cachethis is TRUE, then the server is requesting that the 29294 client cache the reply in the callback reply cache. The client MUST 29295 cache the reply (see Section 2.10.6.1.3). 29297 The csa_referring_call_lists array is the list of COMPOUND requests, 29298 identified by session ID, slot ID, and sequence ID. These are 29299 requests that the client previously sent to the server. These 29300 previous requests created state that some operation(s) in the same 29301 CB_COMPOUND as the csa_referring_call_lists are identifying. A 29302 session ID is included because leased state is tied to a client ID, 29303 and a client ID can have multiple sessions. See Section 2.10.6.3. 29305 The value of the csa_sequenceid argument relative to the cached 29306 sequence ID on the slot falls into one of three cases. 29308 o If the difference between csa_sequenceid and the client's cached 29309 sequence ID at the slot ID is two (2) or more, or if 29310 csa_sequenceid is less than the cached sequence ID (accounting for 29311 wraparound of the unsigned sequence ID value), then the client 29312 MUST return NFS4ERR_SEQ_MISORDERED. 29314 o If csa_sequenceid and the cached sequence ID are the same, this is 29315 a retry, and the client returns the CB_COMPOUND request's cached 29316 reply. 29318 o If csa_sequenceid is one greater (accounting for wraparound) than 29319 the cached sequence ID, then this is a new request, and the slot's 29320 sequence ID is incremented. The operations subsequent to 29321 CB_SEQUENCE, if any, are processed. If there are no other 29322 operations, the only other effects are to cache the CB_SEQUENCE 29323 reply in the slot, maintain the session's activity, and when the 29324 server receives the CB_SEQUENCE reply, renew the lease of state 29325 related to the client ID. 29327 If the server reuses a slot ID and sequence ID for a completely 29328 different request, the client MAY treat the request as if it is a 29329 retry of what it has already executed. The client MAY however detect 29330 the server's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY. 29332 If CB_SEQUENCE returns an error, then the state of the slot (sequence 29333 ID, cached reply) MUST NOT change. See Section 2.10.6.1.3 for the 29334 conditions when the error NFS4ERR_RETRY_UNCACHED_REP might be 29335 returned. 29337 The client returns two "highest_slotid" values: csr_highest_slotid 29338 and csr_target_highest_slotid. The former is the highest slot ID the 29339 client will accept in a future CB_SEQUENCE operation, and SHOULD NOT 29340 be less than the value of csa_highest_slotid (but see 29341 Section 2.10.6.1 for an exception). The latter is the highest slot 29342 ID the client would prefer the server use on a future CB_SEQUENCE 29343 operation. 29345 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending Delegation 29346 Wants 29348 20.10.1. ARGUMENT 29350 struct CB_WANTS_CANCELLED4args { 29351 bool cwca_contended_wants_cancelled; 29352 bool cwca_resourced_wants_cancelled; 29353 }; 29355 20.10.2. RESULT 29357 struct CB_WANTS_CANCELLED4res { 29358 nfsstat4 cwcr_status; 29359 }; 29361 20.10.3. DESCRIPTION 29363 The CB_WANTS_CANCELLED operation is used to notify the client that 29364 some or all of the wants it registered for recallable delegations and 29365 layouts have been cancelled. 29367 If cwca_contended_wants_cancelled is TRUE, this indicates that the 29368 server will not be pushing to the client any delegations that become 29369 available after contention passes. 29371 If cwca_resourced_wants_cancelled is TRUE, this indicates that the 29372 server will not notify the client when there are resources on the 29373 server to grant delegations or layouts. 29375 After receiving a CB_WANTS_CANCELLED operation, the client is free to 29376 attempt to acquire the delegations or layouts it was waiting for, and 29377 possibly re-register wants. 29379 20.10.4. IMPLEMENTATION 29381 When a client has an OPEN, WANT_DELEGATION, or GET_DIR_DELEGATION 29382 request outstanding, when a CB_WANTS_CANCELLED is sent, the server 29383 may need to make clear to the client whether a promise to signal 29384 delegation availability happened before the CB_WANTS_CANCELLED and is 29385 thus covered by it, or after the CB_WANTS_CANCELLED in which case it 29386 was not covered by it. The server can make this distinction by 29387 putting the appropriate requests into the list of referring calls in 29388 the associated CB_SEQUENCE. 29390 20.11. Operation 13: CB_NOTIFY_LOCK - Notify Client of Possible Lock 29391 Availability 29393 20.11.1. ARGUMENT 29395 struct CB_NOTIFY_LOCK4args { 29396 nfs_fh4 cnla_fh; 29397 lock_owner4 cnla_lock_owner; 29398 }; 29400 20.11.2. RESULT 29402 struct CB_NOTIFY_LOCK4res { 29403 nfsstat4 cnlr_status; 29404 }; 29406 20.11.3. DESCRIPTION 29408 The server can use this operation to indicate that a byte-range lock 29409 for the given file and lock-owner, previously requested by the client 29410 via an unsuccessful LOCK operation, might be available. 29412 This callback is meant to be used by servers to help reduce the 29413 latency of blocking locks in the case where they recognize that a 29414 client that has been polling for a blocking byte-range lock may now 29415 be able to acquire the lock. If the server supports this callback 29416 for a given file, it MUST set the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 29417 when responding to successful opens for that file. This does not 29418 commit the server to the use of CB_NOTIFY_LOCK, but the client may 29419 use this as a hint to decide how frequently to poll for locks derived 29420 from that open. 29422 If an OPEN operation results in an upgrade, in which the stateid 29423 returned has an "other" value matching that of a stateid already 29424 allocated, with a new "seqid" indicating a change in the lock being 29425 represented, then the value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag 29426 when responding to that new OPEN controls handling from that point 29427 going forward. When parallel OPENs are done on the same file and 29428 open-owner, the ordering of the "seqid" fields of the returned 29429 stateids (subject to wraparound) are to be used to select the 29430 controlling value of the OPEN4_RESULT_MAY_NOTIFY_LOCK flag. 29432 20.11.4. IMPLEMENTATION 29434 The server MUST NOT grant the byte-range lock to the client unless 29435 and until it receives a LOCK operation from the client. Similarly, 29436 the client receiving this callback cannot assume that it now has the 29437 lock or that a subsequent LOCK operation for the lock will be 29438 successful. 29440 The server is not required to implement this callback, and even if it 29441 does, it is not required to use it in any particular case. 29442 Therefore, the client must still rely on polling for blocking locks, 29443 as described in Section 9.6. 29445 Similarly, the client is not required to implement this callback, and 29446 even it does, is still free to ignore it. Therefore, the server MUST 29447 NOT assume that the client will act based on the callback. 29449 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify Client of Device ID 29450 Changes 29452 20.12.1. ARGUMENT 29453 /* 29454 * Device notification types. 29455 */ 29456 enum notify_deviceid_type4 { 29457 NOTIFY_DEVICEID4_CHANGE = 1, 29458 NOTIFY_DEVICEID4_DELETE = 2 29459 }; 29461 /* For NOTIFY4_DEVICEID4_DELETE */ 29462 struct notify_deviceid_delete4 { 29463 layouttype4 ndd_layouttype; 29464 deviceid4 ndd_deviceid; 29465 }; 29467 /* For NOTIFY4_DEVICEID4_CHANGE */ 29468 struct notify_deviceid_change4 { 29469 layouttype4 ndc_layouttype; 29470 deviceid4 ndc_deviceid; 29471 bool ndc_immediate; 29472 }; 29474 struct CB_NOTIFY_DEVICEID4args { 29475 notify4 cnda_changes<>; 29476 }; 29478 20.12.2. RESULT 29480 struct CB_NOTIFY_DEVICEID4res { 29481 nfsstat4 cndr_status; 29482 }; 29484 20.12.3. DESCRIPTION 29486 The CB_NOTIFY_DEVICEID operation is used by the server to send 29487 notifications to clients about changes to pNFS device IDs. The 29488 registration of device ID notifications is optional and is done via 29489 GETDEVICEINFO. These notifications are sent over the backchannel 29490 once the original request has been processed on the server. The 29491 server will send an array of notifications, cnda_changes, as a list 29492 of pairs of bitmaps and values. See Section 3.3.7 for a description 29493 of how NFSv4.1 bitmaps work. 29495 As with CB_NOTIFY (Section 20.4.3), it is possible the server has 29496 more notifications than can fit in a CB_COMPOUND, thus requiring 29497 multiple CB_COMPOUNDs. Unlike CB_NOTIFY, serialization is not an 29498 issue because unlike directory entries, device IDs cannot be re-used 29499 after being deleted (Section 12.2.10). 29501 All device ID notifications contain a device ID and a layout type. 29502 The layout type is necessary because two different layout types can 29503 share the same device ID, and the common device ID can have 29504 completely different mappings for each layout type. 29506 The server will send the following notifications: 29508 NOTIFY_DEVICEID4_CHANGE 29509 A previously provided device-ID-to-device-address mapping has 29510 changed and the client uses GETDEVICEINFO to obtain the updated 29511 mapping. The notification is encoded in a value of data type 29512 notify_deviceid_change4. This data type also contains a boolean 29513 field, ndc_immediate, which if TRUE indicates that the change will 29514 be enforced immediately, and so the client might not be able to 29515 complete any pending I/O to the device ID. If ndc_immediate is 29516 FALSE, then for an indefinite time, the client can complete 29517 pending I/O. After pending I/O is complete, the client SHOULD get 29518 the new device-ID-to-device-address mappings before sending new I/ 29519 O requests to the storage devices addressed by the device ID. 29521 NOTIFY4_DEVICEID_DELETE 29522 Deletes a device ID from the mappings. This notification MUST NOT 29523 be sent if the client has a layout that refers to the device ID. 29524 In other words, if the server is sending a delete device ID 29525 notification, one of the following is true for layouts associated 29526 with the layout type: 29528 * The client never had a layout referring to that device ID. 29530 * The client has returned all layouts referring to that device 29531 ID. 29533 * The server has revoked all layouts referring to that device ID. 29535 The notification is encoded in a value of data type 29536 notify_deviceid_delete4. After a server deletes a device ID, it 29537 MUST NOT reuse that device ID for the same layout type until the 29538 client ID is deleted. 29540 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback Operation 29541 20.13.1. ARGUMENT 29543 void; 29545 20.13.2. RESULT 29547 /* 29548 * CB_ILLEGAL: Response for illegal operation numbers 29549 */ 29550 struct CB_ILLEGAL4res { 29551 nfsstat4 status; 29552 }; 29554 20.13.3. DESCRIPTION 29556 This operation is a placeholder for encoding a result to handle the 29557 case of the server sending an operation code within CB_COMPOUND that 29558 is not defined in the NFSv4.1 specification. See Section 19.2.3 for 29559 more details. 29561 The status field of CB_ILLEGAL4res MUST be set to NFS4ERR_OP_ILLEGAL. 29563 20.13.4. IMPLEMENTATION 29565 A server will probably not send an operation with code OP_CB_ILLEGAL, 29566 but if it does, the response will be CB_ILLEGAL4res just as it would 29567 be with any other invalid operation code. Note that if the client 29568 gets an illegal operation code that is not OP_ILLEGAL, and if the 29569 client checks for legal operation codes during the XDR decode phase, 29570 then an instance of data type CB_ILLEGAL4res will not be returned. 29572 21. Security Considerations 29574 Historically, the authentication model of NFS was based on the entire 29575 machine being the NFS client, with the NFS server trusting the NFS 29576 client to authenticate the end-user. The NFS server in turn shared 29577 its files only to specific clients, as identified by the client's 29578 source network address. Given this model, the AUTH_SYS RPC security 29579 flavor simply identified the end-user using the client to the NFS 29580 server. When processing NFS responses, the client ensured that the 29581 responses came from the same network address and port number to which 29582 the request was sent. While such a model is easy to implement and 29583 simple to deploy and use, it is unsafe. Thus, NFSv4.1 29584 implementations are REQUIRED to support a security model that uses 29585 end-to-end authentication, where an end-user on a client mutually 29586 authenticates (via cryptographic schemes that do not expose passwords 29587 or keys in the clear on the network) to a principal on an NFS server. 29589 Consideration is also given to the integrity and privacy of NFS 29590 requests and responses. The issues of end-to-end mutual 29591 authentication, integrity, and privacy are discussed in 29592 Section 2.2.1.1.1. There are specific considerations when using 29593 Kerberos V5 as described in Section 2.2.1.1.1.2.1.1. 29595 Note that being REQUIRED to implement does not mean REQUIRED to use; 29596 AUTH_SYS can be used by NFSv4.1 clients and servers. However, 29597 AUTH_SYS is merely an OPTIONAL security flavor in NFSv4.1, and so 29598 interoperability via AUTH_SYS is not assured. 29600 For reasons of reduced administration overhead, better performance, 29601 and/or reduction of CPU utilization, users of NFSv4.1 implementations 29602 might decline to use security mechanisms that enable integrity 29603 protection on each remote procedure call and response. The use of 29604 mechanisms without integrity leaves the user vulnerable to a man-in- 29605 the-middle of the NFS client and server that modifies the RPC request 29606 and/or the response. While implementations are free to provide the 29607 option to use weaker security mechanisms, there are three operations 29608 in particular that warrant the implementation overriding user 29609 choices. 29611 o The first two such operations are SECINFO and SECINFO_NO_NAME. It 29612 is RECOMMENDED that the client send both operations such that they 29613 are protected with a security flavor that has integrity 29614 protection, such as RPCSEC_GSS with either the 29615 rpc_gss_svc_integrity or rpc_gss_svc_privacy service. Without 29616 integrity protection encapsulating SECINFO and SECINFO_NO_NAME and 29617 their results, a man-in-the-middle could modify results such that 29618 the client might select a weaker algorithm in the set allowed by 29619 the server, making the client and/or server vulnerable to further 29620 attacks. 29622 o The third operation that SHOULD use integrity protection is any 29623 GETATTR for the fs_locations and fs_locations_info attributes, in 29624 order to mitigate the severity of a man-in-the-middle attack. The 29625 attack has two steps. First the attacker modifies the unprotected 29626 results of some operation to return NFS4ERR_MOVED. Second, when 29627 the client follows up with a GETATTR for the fs_locations or 29628 fs_locations_info attributes, the attacker modifies the results to 29629 cause the client to migrate its traffic to a server controlled by 29630 the attacker. With integrity protection, this attack is 29631 mitigated. 29633 Relative to previous NFS versions, NFSv4.1 has additional security 29634 considerations for pNFS (see Sections 12.9 and 13.12), locking and 29635 session state (see Section 2.10.8.3), and state recovery during grace 29636 period (see Section 8.4.2.1.1). With respect to locking and session 29637 state, if SP4_SSV state protection is being used, Section 2.10.10 has 29638 specific security considerations for the NFSv4.1 client and server. 29640 Security considerations for lock reclaim differ between the two 29641 different situations in which state reclaim is to be done. The 29642 server failure situation is discussed in Section 8.4.2.1.1 while the 29643 per-fs state reclaim done in support of migration/replication is 29644 discussed in Section 11.11.9.1. 29646 The use of the multi-server namespace features described in 29647 Section 11 raises the possibility that requests to determine the set 29648 of network addresses corresponding to a given server might be 29649 interfered with or have their responses modified in flight. In light 29650 of this possibility, the following considerations should be taken 29651 note of: 29653 o When DNS is used to convert server names to addresses and DNSSEC 29654 [29] is not available, the validity of the network addresses 29655 returned generally cannot be relied upon. However, when combined 29656 with a trusted resolver, DNS over TLS [30], and DNS over HTTPS 29657 [34] can also be relied upon to provide valid address resolutions. 29659 In situations in which the validity of the provided addresses 29660 cannot be relied upon and the client uses RPCSEC_GSS to access the 29661 designated server, it is possible for mutual authentication to 29662 discover invalid server addresses as long as the RPCSEC_GSS 29663 implementation used does not use insecure DNS queries to 29664 canonicalize the hostname components of the service principal 29665 names, as explained in [28]. 29667 o The fetching of attributes containing file system location 29668 information SHOULD be performed using integrity protection. It is 29669 important to note here that a client making a request of this sort 29670 without using integrity protection needs be aware of the negative 29671 consequences of doing so, which can lead to invalid host names or 29672 network addresses being returned. These include cases in which 29673 the client is directed to a server under the control of an 29674 attacker, who might get access to data written or provide 29675 incorrect values for data read. In light of this, the client 29676 needs to recognize that using such returned location information 29677 to access an NFSv4 server without use of RPCSEC_GSS (i.e. by 29678 using AUTH_SYS) poses dangers as it can result in the client 29679 interacting with such an attacker-controlled server, without any 29680 authentication facilities to verify the server's identity. 29682 o Despite the fact that it is a requirement that implementations 29683 provide "support" for use of RPCSEC_GSS, it cannot be assumed that 29684 use of RPCSEC_GSS is always available between any particular 29685 client-server pair. 29687 o When a client has the network addresses of a server but not the 29688 associated host names, that would interfere with its ability to 29689 use RPCSEC_GSS. 29691 In light of the above, a server SHOULD present file system location 29692 entries that correspond to file systems on other servers using a host 29693 name. This would allow the client to interrogate the fs_locations on 29694 the destination server to obtain trunking information (as well as 29695 replica information) using integrity protection, validating the name 29696 provided while assuring that the response has not been modified in 29697 flight. 29699 When RPCSEC_GSS is not available on a server, the client needs to be 29700 aware of the fact that the location entries are subject to 29701 modification in flight and so cannot be relied upon. In the case of 29702 a client being directed to another server after NFS4ERR_MOVED, this 29703 could vitiate the authentication provided by the use of RPCSEC_GSS on 29704 the designated destination server. Even when RPCSEC_GSS 29705 authentication is available on the destination, the server might 29706 still properly authenticate as the server to which the client was 29707 erroneously directed. Without a way to decide whether the server is 29708 a valid one, the client can only determine, using RPCSEC_GSS, that 29709 the server corresponds to the name provided, with no basis for 29710 trusting that server. As a result, the client SHOULD NOT use such 29711 unverified location entries as a basis for migration, even though 29712 RPCSEC_GSS might be available on the destination. 29714 When a file system location attribute is fetched upon connecting with 29715 an NFS server, it SHOULD, as stated above, be done with integrity 29716 protection. When this not possible, it is generally best for the 29717 client to ignore trunking and replica information or simply not fetch 29718 the location information for these purposes. 29720 When location information cannot be verified, it can be subjected to 29721 additional filtering to prevent the client from being inappropriately 29722 directed. For example, if a range of network addresses can be 29723 determined that assure that the servers and clients using AUTH_SYS 29724 are subject to the appropriate set of constraints (e.g. physical 29725 network isolation, administrative controls on the operating systems 29726 used), then network addresses in the appropriate range can be used 29727 with others discarded or restricted in their use of AUTH_SYS. 29729 To summarize considerations regarding the use of RPCSEC_GSS in 29730 fetching location information, we need to consider the following 29731 possibilities for requests to interrogate location information, with 29732 interrogation approaches on the referring and destination servers 29733 arrived at separately: 29735 o The use of integrity protection is RECOMMENDED in all cases, since 29736 the absence of integrity protection exposes the client to the 29737 possibility of the results being modified in transit. 29739 o The use of requests issued without RPCSEC_GSS (i.e. using AUTH_SYS 29740 which has no provision to avoid modification of data in flight), 29741 while undesirable and a potential security exposure, may not be 29742 avoidable in all cases. Where the use of the returned information 29743 cannot be avoided, it is made subject to filtering as described 29744 above to eliminate the possibility that the client would treat an 29745 invalid address as if it were a NFSv4 server. The specifics will 29746 vary depending on the degree of network isolation and whether the 29747 request is to the referring or destination servers. 29749 Even if such requests are not interfered with in flight, it is 29750 possible for a compromised server to direct the client to use 29751 inappropriate servers, such as those under the control of the 29752 attacker. It is not clear that being directed to such servers 29753 represents a greater threat to the client than the damage that could 29754 be done by the compromised server itself. However, it is possible 29755 that some sorts of transient server compromises might be taken 29756 advantage of to direct a client to a server capable of doing greater 29757 damage over a longer time. One useful step to guard against this 29758 possibility is to issue requests to fetch location data using 29759 RPCSEC_GSS, even if no mapping to an RPCSEC_GSS principal is 29760 available. In this case, RPCSEC_GSS would not be used, as it 29761 typically is, to identify the client principal to the server, but 29762 rather to make sure (via RPCSEC_GSS mutual authentication) that the 29763 server being contacted is the one intended. 29765 Similar considerations apply if the threat to be avoided is the 29766 redirection of client traffic to inappropriate (i.e. poorly 29767 performing) servers. In both cases, there is no reason for the 29768 information returned to depend on the identity of the client 29769 principal requesting it, while the validity of the server 29770 information, which has the capability to affect all client 29771 principals, is of considerable importance. 29773 22. IANA Considerations 29775 This section uses terms that are defined in [62]. 29777 22.1. IANA Actions Needed 29779 This update does not require any modification of or additions to 29780 registry entries or registry rules associated with NFSv4.1. However, 29781 since this document is intended to obsolete RFC5661, it will be 29782 necessary for IANA to update all registry entries and registry rules 29783 references that points to RFC5661 to point to this document instead. 29785 Previous actions by IANA related to NFSv4.1 are listed in the 29786 remaining subsections of Section 22. 29788 22.2. Named Attribute Definitions 29790 IANA created a registry called the "NFSv4 Named Attribute Definitions 29791 Registry". 29793 The NFSv4.1 protocol supports the association of a file with zero or 29794 more named attributes. The namespace identifiers for these 29795 attributes are defined as string names. The protocol does not define 29796 the specific assignment of the namespace for these file attributes. 29797 The IANA registry promotes interoperability where common interests 29798 exist. While application developers are allowed to define and use 29799 attributes as needed, they are encouraged to register the attributes 29800 with IANA. 29802 Such registered named attributes are presumed to apply to all minor 29803 versions of NFSv4, including those defined subsequently to the 29804 registration. If the named attribute is intended to be limited to 29805 specific minor versions, this will be clearly stated in the 29806 registry's assignment. 29808 All assignments to the registry are made on a First Come First Served 29809 basis, per Section 4.1 of [62]. The policy for each assignment is 29810 Specification Required, per Section 4.1 of [62]. 29812 Under the NFSv4.1 specification, the name of a named attribute can in 29813 theory be up to 2^32 - 1 bytes in length, but in practice NFSv4.1 29814 clients and servers will be unable to handle a string that long. 29815 IANA should reject any assignment request with a named attribute that 29816 exceeds 128 UTF-8 characters. To give the IESG the flexibility to 29817 set up bases of assignment of Experimental Use and Standards Action, 29818 the prefixes of "EXPE" and "STDS" are Reserved. The named attribute 29819 with a zero-length name is Reserved. 29821 The prefix "PRIV" is designated for Private Use. A site that wants 29822 to make use of unregistered named attributes without risk of 29823 conflicting with an assignment in IANA's registry should use the 29824 prefix "PRIV" in all of its named attributes. 29826 Because some NFSv4.1 clients and servers have case-insensitive 29827 semantics, the fifteen additional lower case and mixed case 29828 permutations of each of "EXPE", "PRIV", and "STDS" are Reserved 29829 (e.g., "expe", "expE", "exPe", etc. are Reserved). Similarly, IANA 29830 must not allow two assignments that would conflict if both named 29831 attributes were converted to a common case. 29833 The registry of named attributes is a list of assignments, each 29834 containing three fields for each assignment. 29836 1. A US-ASCII string name that is the actual name of the attribute. 29837 This name must be unique. This string name can be 1 to 128 UTF-8 29838 characters long. 29840 2. A reference to the specification of the named attribute. The 29841 reference can consume up to 256 bytes (or more if IANA permits). 29843 3. The point of contact of the registrant. The point of contact can 29844 consume up to 256 bytes (or more if IANA permits). 29846 22.2.1. Initial Registry 29848 There is no initial registry. 29850 22.2.2. Updating Registrations 29852 The registrant is always permitted to update the point of contact 29853 field. Any other change will require Expert Review or IESG Approval. 29855 22.3. Device ID Notifications 29857 IANA created a registry called the "NFSv4 Device ID Notifications 29858 Registry". 29860 The potential exists for new notification types to be added to the 29861 CB_NOTIFY_DEVICEID operation (see Section 20.12). This can be done 29862 via changes to the operations that register notifications, or by 29863 adding new operations to NFSv4. This requires a new minor version of 29864 NFSv4, and requires a Standards Track document from the IETF. 29865 Another way to add a notification is to specify a new layout type 29866 (see Section 22.5). 29868 Hence, all assignments to the registry are made on a Standards Action 29869 basis per Section 4.1 of [62], with Expert Review required. 29871 The registry is a list of assignments, each containing five fields 29872 per assignment. 29874 1. The name of the notification type. This name must have the 29875 prefix "NOTIFY_DEVICEID4_". This name must be unique. 29877 2. The value of the notification. IANA will assign this number, and 29878 the request from the registrant will use TBD1 instead of an 29879 actual value. IANA MUST use a whole number that can be no higher 29880 than 2^32-1, and should be the next available value. The value 29881 assigned must be unique. A Designated Expert must be used to 29882 ensure that when the name of the notification type and its value 29883 are added to the NFSv4.1 notify_deviceid_type4 enumerated data 29884 type in the NFSv4.1 XDR description ([10]), the result continues 29885 to be a valid XDR description. 29887 3. The Standards Track RFC(s) that describe the notification. If 29888 the RFC(s) have not yet been published, the registrant will use 29889 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 29891 4. How the RFC introduces the notification. This is indicated by a 29892 single US-ASCII value. If the value is N, it means a minor 29893 revision to the NFSv4 protocol. If the value is L, it means a 29894 new pNFS layout type. Other values can be used with IESG 29895 Approval. 29897 5. The minor versions of NFSv4 that are allowed to use the 29898 notification. While these are numeric values, IANA will not 29899 allocate and assign them; the author of the relevant RFCs with 29900 IESG Approval assigns these numbers. Each time there is a new 29901 minor version of NFSv4 approved, a Designated Expert should 29902 review the registry to make recommended updates as needed. 29904 22.3.1. Initial Registry 29906 The initial registry is in Table 16. Note that the next available 29907 value is zero. 29909 +-------------------------+-------+---------+-----+----------------+ 29910 | Notification Name | Value | RFC | How | Minor Versions | 29911 +-------------------------+-------+---------+-----+----------------+ 29912 | NOTIFY_DEVICEID4_CHANGE | 1 | RFC5661 | N | 1 | 29913 | NOTIFY_DEVICEID4_DELETE | 2 | RFC5661 | N | 1 | 29914 +-------------------------+-------+---------+-----+----------------+ 29916 Table 16: Initial Device ID Notification Assignments 29918 22.3.2. Updating Registrations 29920 The update of a registration will require IESG Approval on the advice 29921 of a Designated Expert. 29923 22.4. Object Recall Types 29925 IANA created a registry called the "NFSv4 Recallable Object Types 29926 Registry". 29928 The potential exists for new object types to be added to the 29929 CB_RECALL_ANY operation (see Section 20.6). This can be done via 29930 changes to the operations that add recallable types, or by adding new 29931 operations to NFSv4. This requires a new minor version of NFSv4, and 29932 requires a Standards Track document from IETF. Another way to add a 29933 new recallable object is to specify a new layout type (see 29934 Section 22.5). 29936 All assignments to the registry are made on a Standards Action basis 29937 per Section 4.1 of [62], with Expert Review required. 29939 Recallable object types are 32-bit unsigned numbers. There are no 29940 Reserved values. Values in the range 12 through 15, inclusive, are 29941 designated for Private Use. 29943 The registry is a list of assignments, each containing five fields 29944 per assignment. 29946 1. The name of the recallable object type. This name must have the 29947 prefix "RCA4_TYPE_MASK_". The name must be unique. 29949 2. The value of the recallable object type. IANA will assign this 29950 number, and the request from the registrant will use TBD1 instead 29951 of an actual value. IANA MUST use a whole number that can be no 29952 higher than 2^32-1, and should be the next available value. The 29953 value must be unique. A Designated Expert must be used to ensure 29954 that when the name of the recallable type and its value are added 29955 to the NFSv4 XDR description [10], the result continues to be a 29956 valid XDR description. 29958 3. The Standards Track RFC(s) that describe the recallable object 29959 type. If the RFC(s) have not yet been published, the registrant 29960 will use RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 29962 4. How the RFC introduces the recallable object type. This is 29963 indicated by a single US-ASCII value. If the value is N, it 29964 means a minor revision to the NFSv4 protocol. If the value is L, 29965 it means a new pNFS layout type. Other values can be used with 29966 IESG Approval. 29968 5. The minor versions of NFSv4 that are allowed to use the 29969 recallable object type. While these are numeric values, IANA 29970 will not allocate and assign them; the author of the relevant 29971 RFCs with IESG Approval assigns these numbers. Each time there 29972 is a new minor version of NFSv4 approved, a Designated Expert 29973 should review the registry to make recommended updates as needed. 29975 22.4.1. Initial Registry 29977 The initial registry is in Table 17. Note that the next available 29978 value is five. 29980 +-------------------------------+-------+--------+-----+------------+ 29981 | Recallable Object Type Name | Value | RFC | How | Minor | 29982 | | | | | Versions | 29983 +-------------------------------+-------+--------+-----+------------+ 29984 | RCA4_TYPE_MASK_RDATA_DLG | 0 | RFC | N | 1 | 29985 | | | 5661 | | | 29986 | RCA4_TYPE_MASK_WDATA_DLG | 1 | RFC | N | 1 | 29987 | | | 5661 | | | 29988 | RCA4_TYPE_MASK_DIR_DLG | 2 | RFC | N | 1 | 29989 | | | 5661 | | | 29990 | RCA4_TYPE_MASK_FILE_LAYOUT | 3 | RFC | N | 1 | 29991 | | | 5661 | | | 29992 | RCA4_TYPE_MASK_BLK_LAYOUT | 4 | RFC | L | 1 | 29993 | | | 5661 | | | 29994 | RCA4_TYPE_MASK_OBJ_LAYOUT_MIN | 8 | RFC | L | 1 | 29995 | | | 5661 | | | 29996 | RCA4_TYPE_MASK_OBJ_LAYOUT_MAX | 9 | RFC | L | 1 | 29997 | | | 5661 | | | 29998 +-------------------------------+-------+--------+-----+------------+ 30000 Table 17: Initial Recallable Object Type Assignments 30002 22.4.2. Updating Registrations 30004 The update of a registration will require IESG Approval on the advice 30005 of a Designated Expert. 30007 22.5. Layout Types 30009 IANA created a registry called the "pNFS Layout Types Registry". 30011 All assignments to the registry are made on a Standards Action basis, 30012 with Expert Review required. 30014 Layout types are 32-bit numbers. The value zero is Reserved. Values 30015 in the range 0x80000000 to 0xFFFFFFFF inclusive are designated for 30016 Private Use. IANA will assign numbers from the range 0x00000001 to 30017 0x7FFFFFFF inclusive. 30019 The registry is a list of assignments, each containing five fields. 30021 1. The name of the layout type. This name must have the prefix 30022 "LAYOUT4_". The name must be unique. 30024 2. The value of the layout type. IANA will assign this number, and 30025 the request from the registrant will use TBD1 instead of an 30026 actual value. The value assigned must be unique. A Designated 30027 Expert must be used to ensure that when the name of the layout 30028 type and its value are added to the NFSv4.1 layouttype4 30029 enumerated data type in the NFSv4.1 XDR description ([10]), the 30030 result continues to be a valid XDR description. 30032 3. The Standards Track RFC(s) that describe the notification. If 30033 the RFC(s) have not yet been published, the registrant will use 30034 RFCTBD2, RFCTBD3, etc. instead of an actual RFC number. 30035 Collectively, the RFC(s) must adhere to the guidelines listed in 30036 Section 22.5.3. 30038 4. How the RFC introduces the layout type. This is indicated by a 30039 single US-ASCII value. If the value is N, it means a minor 30040 revision to the NFSv4 protocol. If the value is L, it means a 30041 new pNFS layout type. Other values can be used with IESG 30042 Approval. 30044 5. The minor versions of NFSv4 that are allowed to use the 30045 notification. While these are numeric values, IANA will not 30046 allocate and assign them; the author of the relevant RFCs with 30047 IESG Approval assigns these numbers. Each time there is a new 30048 minor version of NFSv4 approved, a Designated Expert should 30049 review the registry to make recommended updates as needed. 30051 22.5.1. Initial Registry 30053 The initial registry is in Table 18. 30055 +-----------------------+-------+----------+-----+----------------+ 30056 | Layout Type Name | Value | RFC | How | Minor Versions | 30057 +-----------------------+-------+----------+-----+----------------+ 30058 | LAYOUT4_NFSV4_1_FILES | 0x1 | RFC 5661 | N | 1 | 30059 | LAYOUT4_OSD2_OBJECTS | 0x2 | RFC 5664 | L | 1 | 30060 | LAYOUT4_BLOCK_VOLUME | 0x3 | RFC 5663 | L | 1 | 30061 +-----------------------+-------+----------+-----+----------------+ 30063 Table 18: Initial Layout Type Assignments 30065 22.5.2. Updating Registrations 30067 The update of a registration will require IESG Approval on the advice 30068 of a Designated Expert. 30070 22.5.3. Guidelines for Writing Layout Type Specifications 30072 The author of a new pNFS layout specification must follow these steps 30073 to obtain acceptance of the layout type as a Standards Track RFC: 30075 1. The author devises the new layout specification. 30077 2. The new layout type specification MUST, at a minimum: 30079 * Define the contents of the layout-type-specific fields of the 30080 following data types: 30082 + the da_addr_body field of the device_addr4 data type; 30084 + the loh_body field of the layouthint4 data type; 30086 + the loc_body field of layout_content4 data type (which in 30087 turn is the lo_content field of the layout4 data type); 30089 + the lou_body field of the layoutupdate4 data type; 30091 * Describe or define the storage access protocol used to access 30092 the storage devices. 30094 * Describe whether revocation of layouts is supported. 30096 * At a minimum, describe the methods of recovery from: 30098 1. Failure and restart for client, server, storage device. 30100 2. Lease expiration from perspective of the active client, 30101 server, storage device. 30103 3. Loss of layout state resulting in fencing of client access 30104 to storage devices (for an example, see Section 12.7.3). 30106 * Include an IANA considerations section, which will in turn 30107 include: 30109 + A request to IANA for a new layout type per Section 22.5. 30111 + A list of requests to IANA for any new recallable object 30112 types for CB_RECALL_ANY; each entry is to be presented in 30113 the form described in Section 22.4. 30115 + A list of requests to IANA for any new notification values 30116 for CB_NOTIFY_DEVICEID; each entry is to be presented in 30117 the form described in Section 22.3. 30119 * Include a security considerations section. This section MUST 30120 explain how the NFSv4.1 authentication, authorization, and 30121 access-control models are preserved. That is, if a metadata 30122 server would restrict a READ or WRITE operation, how would 30123 pNFS via the layout similarly restrict a corresponding input 30124 or output operation? 30126 3. The author documents the new layout specification as an Internet- 30127 Draft. 30129 4. The author submits the Internet-Draft for review through the IETF 30130 standards process as defined in "The Internet Standards Process-- 30131 Revision 3" (BCP 9). The new layout specification will be 30132 submitted for eventual publication as a Standards Track RFC. 30134 5. The layout specification progresses through the IETF standards 30135 process. 30137 22.6. Path Variable Definitions 30139 This section deals with the IANA considerations associated with the 30140 variable substitution feature for location names as described in 30141 Section 11.17.3. As described there, variables subject to 30142 substitution consist of a domain name and a specific name within that 30143 domain, with the two separated by a colon. There are two sets of 30144 IANA considerations here: 30146 1. The list of variable names. 30148 2. For each variable name, the list of possible values. 30150 Thus, there will be one registry for the list of variable names, and 30151 possibly one registry for listing the values of each variable name. 30153 22.6.1. Path Variables Registry 30155 IANA created a registry called the "NFSv4 Path Variables Registry". 30157 22.6.1.1. Path Variable Values 30159 Variable names are of the form "${", followed by a domain name, 30160 followed by a colon (":"), followed by a domain-specific portion of 30161 the variable name, followed by "}". When the domain name is 30162 "ietf.org", all variables names must be registered with IANA on a 30163 Standards Action basis, with Expert Review required. Path variables 30164 with registered domain names neither part of nor equal to ietf.org 30165 are assigned on a Hierarchical Allocation basis (delegating to the 30166 domain owner) and thus of no concern to IANA, unless the domain owner 30167 chooses to register a variable name from his domain. If the domain 30168 owner chooses to do so, IANA will do so on a First Come First Serve 30169 basis. To accommodate registrants who do not have their own domain, 30170 IANA will accept requests to register variables with the prefix 30171 "${FCFS.ietf.org:" on a First Come First Served basis. Assignments 30172 on a First Come First Basis do not require Expert Review, unless the 30173 registrant also wants IANA to establish a registry for the values of 30174 the registered variable. 30176 The registry is a list of assignments, each containing three fields. 30178 1. The name of the variable. The name of this variable must start 30179 with a "${" followed by a registered domain name, followed by 30180 ":", or it must start with "${FCFS.ietf.org". The name must be 30181 no more than 64 UTF-8 characters long. The name must be unique. 30183 2. For assignments made on Standards Action basis, the Standards 30184 Track RFC(s) that describe the variable. If the RFC(s) have not 30185 yet been published, the registrant will use RFCTBD1, RFCTBD2, 30186 etc. instead of an actual RFC number. Note that the RFCs do not 30187 have to be a part of an NFS minor version. For assignments made 30188 on a First Come First Serve basis, an explanation (consuming no 30189 more than 1024 bytes, or more if IANA permits) of the purpose of 30190 the variable. A reference to the explanation can be substituted. 30192 3. The point of contact, including an email address. The point of 30193 contact can consume up to 256 bytes (or more if IANA permits). 30194 For assignments made on a Standards Action basis, the point of 30195 contact is always IESG. 30197 22.6.1.1.1. Initial Registry 30199 The initial registry is in Table 19. 30201 +------------------------+----------+------------------+ 30202 | Variable Name | RFC | Point of Contact | 30203 +------------------------+----------+------------------+ 30204 | ${ietf.org:CPU_ARCH} | RFC 5661 | IESG | 30205 | ${ietf.org:OS_TYPE} | RFC 5661 | IESG | 30206 | ${ietf.org:OS_VERSION} | RFC 5661 | IESG | 30207 +------------------------+----------+------------------+ 30209 Table 19: Initial List of Path Variables 30211 IANA has created registries for the values of the variable names 30212 ${ietf.org:CPU_ARCH} and ${ietf.org:OS_TYPE}. See Sections 22.6.2 and 30213 22.6.3. 30215 For the values of the variable ${ietf.org:OS_VERSION}, no registry is 30216 needed as the specifics of the values of the variable will vary with 30217 the value of ${ietf.org:OS_TYPE}. Thus, values for 30218 ${ietf.org:OS_VERSION} are on a Hierarchical Allocation basis and are 30219 of no concern to IANA. 30221 22.6.1.1.2. Updating Registrations 30223 The update of an assignment made on a Standards Action basis will 30224 require IESG Approval on the advice of a Designated Expert. 30226 The registrant can always update the point of contact of an 30227 assignment made on a First Come First Serve basis. Any other update 30228 will require Expert Review. 30230 22.6.2. Values for the ${ietf.org:CPU_ARCH} Variable 30232 IANA created a registry called the "NFSv4 ${ietf.org:CPU_ARCH} Value 30233 Registry". 30235 Assignments to the registry are made on a First Come First Serve 30236 basis. The zero-length value of ${ietf.org:CPU_ARCH} is Reserved. 30237 Values with a prefix of "PRIV" are designated for Private Use. 30239 The registry is a list of assignments, each containing three fields. 30241 1. A value of the ${ietf.org:CPU_ARCH} variable. The value must be 30242 1 to 32 UTF-8 characters long. The value must be unique. 30244 2. An explanation (consuming no more than 1024 bytes, or more if 30245 IANA permits) of what CPU architecture the value denotes. A 30246 reference to the explanation can be substituted. 30248 3. The point of contact, including an email address. The point of 30249 contact can consume up to 256 bytes (or more if IANA permits). 30251 22.6.2.1. Initial Registry 30253 There is no initial registry. 30255 22.6.2.2. Updating Registrations 30257 The registrant is free to update the assignment, i.e., change the 30258 explanation and/or point-of-contact fields. 30260 22.6.3. Values for the ${ietf.org:OS_TYPE} Variable 30262 IANA created a registry called the "NFSv4 ${ietf.org:OS_TYPE} Value 30263 Registry". 30265 Assignments to the registry are made on a First Come First Serve 30266 basis. The zero-length value of ${ietf.org:OS_TYPE} is Reserved. 30267 Values with a prefix of "PRIV" are designated for Private Use. 30269 The registry is a list of assignments, each containing three fields. 30271 1. A value of the ${ietf.org:OS_TYPE} variable. The value must be 1 30272 to 32 UTF-8 characters long. The value must be unique. 30274 2. An explanation (consuming no more than 1024 bytes, or more if 30275 IANA permits) of what CPU architecture the value denotes. A 30276 reference to the explanation can be substituted. 30278 3. The point of contact, including an email address. The point of 30279 contact can consume up to 256 bytes (or more if IANA permits). 30281 22.6.3.1. Initial Registry 30283 There is no initial registry. 30285 22.6.3.2. Updating Registrations 30287 The registrant is free to update the assignment, i.e., change the 30288 explanation and/or point of contact fields. 30290 23. References 30292 23.1. Normative References 30294 [1] Bradner, S., "Key words for use in RFCs to Indicate 30295 Requirement Levels", BCP 14, RFC 2119, March 1997. 30297 [2] Eisler, M., Ed., "XDR: External Data Representation 30298 Standard", STD 67, RFC 4506, May 2006. 30300 [3] Thurlow, R., "RPC: Remote Procedure Call Protocol 30301 Specification Version 2", RFC 5531, May 2009. 30303 [4] Eisler, M., Chiu, A., and L. Ling, "RPCSEC_GSS Protocol 30304 Specification", RFC 2203, September 1997. 30306 [5] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 30307 Version 5 Generic Security Service Application Program 30308 Interface (GSS-API) Mechanism Version 2", RFC 4121, July 30309 2005. 30311 [6] The Open Group, "Section 3.191 of Chapter 3 of Base 30312 Definitions of The Open Group Base Specifications Issue 6 30313 IEEE Std 1003.1, 2004 Edition, HTML Version 30314 (www.opengroup.org), ISBN 1931624232", 2004. 30316 [7] Linn, J., "Generic Security Service Application Program 30317 Interface Version 2, Update 1", RFC 2743, January 2000. 30319 [8] Recio, R., Metzler, B., Culley, P., Hilland, J., and D. 30320 Garcia, "A Remote Direct Memory Access Protocol 30321 Specification", RFC 5040, October 2007. 30323 [9] Eisler, M., "RPCSEC_GSS Version 2", RFC 5403, February 30324 2009. 30326 [10] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 30327 "Network File System (NFS) Version 4 Minor Version 1 30328 External Data Representation Standard (XDR) Description", 30329 RFC 5662, January 2010. 30331 [11] The Open Group, "Section 3.372 of Chapter 3 of Base 30332 Definitions of The Open Group Base Specifications Issue 6 30333 IEEE Std 1003.1, 2004 Edition, HTML Version 30334 (www.opengroup.org), ISBN 1931624232", 2004. 30336 [12] Eisler, M., "IANA Considerations for Remote Procedure Call 30337 (RPC) Network Identifiers and Universal Address Formats", 30338 RFC 5665, January 2010. 30340 [13] The Open Group, "Section 'read()' of System Interfaces of 30341 The Open Group Base Specifications Issue 6 IEEE Std 30342 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30343 ISBN 1931624232", 2004. 30345 [14] The Open Group, "Section 'readdir()' of System Interfaces 30346 of The Open Group Base Specifications Issue 6 IEEE Std 30347 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30348 ISBN 1931624232", 2004. 30350 [15] The Open Group, "Section 'write()' of System Interfaces of 30351 The Open Group Base Specifications Issue 6 IEEE Std 30352 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30353 ISBN 1931624232", 2004. 30355 [16] Hoffman, P. and M. Blanchet, "Preparation of 30356 Internationalized Strings ("stringprep")", RFC 3454, 30357 December 2002. 30359 [17] The Open Group, "Section 'chmod()' of System Interfaces of 30360 The Open Group Base Specifications Issue 6 IEEE Std 30361 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30362 ISBN 1931624232", 2004. 30364 [18] International Organization for Standardization, 30365 "Information Technology - Universal Multiple-octet coded 30366 Character Set (UCS) - Part 1: Architecture and Basic 30367 Multilingual Plane", ISO Standard 10646-1, May 1993. 30369 [19] Alvestrand, H., "IETF Policy on Character Sets and 30370 Languages", BCP 18, RFC 2277, January 1998. 30372 [20] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 30373 Profile for Internationalized Domain Names (IDN)", 30374 RFC 3491, March 2003. 30376 [21] The Open Group, "Section 'fcntl()' of System Interfaces of 30377 The Open Group Base Specifications Issue 6 IEEE Std 30378 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30379 ISBN 1931624232", 2004. 30381 [22] The Open Group, "Section 'fsync()' of System Interfaces of 30382 The Open Group Base Specifications Issue 6 IEEE Std 30383 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30384 ISBN 1931624232", 2004. 30386 [23] The Open Group, "Section 'getpwnam()' of System Interfaces 30387 of The Open Group Base Specifications Issue 6 IEEE Std 30388 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30389 ISBN 1931624232", 2004. 30391 [24] The Open Group, "Section 'unlink()' of System Interfaces 30392 of The Open Group Base Specifications Issue 6 IEEE Std 30393 1003.1, 2004 Edition, HTML Version (www.opengroup.org), 30394 ISBN 1931624232", 2004. 30396 [25] Schaad, J., Kaliski, B., and R. Housley, "Additional 30397 Algorithms and Identifiers for RSA Cryptography for use in 30398 the Internet X.509 Public Key Infrastructure Certificate 30399 and Certificate Revocation List (CRL) Profile", RFC 4055, 30400 June 2005. 30402 [26] National Institute of Standards and Technology, 30403 "Cryptographic Algorithm Object Registration", URL 30404 http://csrc.nist.gov/groups/ST/crypto_apps_infra/csor/ 30405 algorithms.html, November 2007. 30407 [27] Adamson, A. and N. Williams, "Remote Procedure Call (RPC) 30408 Security Version 3", RFC 7861, DOI 10.17487/RFC7861, 30409 November 2016, . 30411 [28] Neuman, C., Yu, T., Hartman, S., and K. Raeburn, "The 30412 Kerberos Network Authentication Service (V5)", RFC 4120, 30413 DOI 10.17487/RFC4120, July 2005, 30414 . 30416 [29] Arends, R., Austein, R., Larson, M., Massey, D., and S. 30417 Rose, "DNS Security Introduction and Requirements", 30418 RFC 4033, DOI 10.17487/RFC4033, March 2005, 30419 . 30421 [30] Hu, Z., Zhu, L., Heidemann, J., Mankin, A., Wessels, D., 30422 and P. Hoffman, "Specification for DNS over Transport 30423 Layer Security (TLS)", RFC 7858, DOI 10.17487/RFC7858, May 30424 2016, . 30426 [31] Adamson, A. and N. Williams, "Requirements for NFSv4 30427 Multi-Domain Namespace Deployment", RFC 8000, 30428 DOI 10.17487/RFC8000, November 2016, 30429 . 30431 [32] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct 30432 Memory Access Transport for Remote Procedure Call Version 30433 1", RFC 8166, DOI 10.17487/RFC8166, June 2017, 30434 . 30436 [33] Lever, C., "Network File System (NFS) Upper-Layer Binding 30437 to RPC-over-RDMA Version 1", RFC 8267, 30438 DOI 10.17487/RFC8267, October 2017, 30439 . 30441 [34] Hoffman, P. and P. McManus, "DNS Queries over HTTPS 30442 (DoH)", RFC 8484, DOI 10.17487/RFC8484, October 2018, 30443 . 30445 23.2. Informative References 30447 [35] Roach, A., "Process for Handling Non-Major Revisions to 30448 Existing RFCs", draft-roach-bis-documents-00 (work in 30449 progress), May 2019. 30451 [36] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 30452 Beame, C., Eisler, M., and D. Noveck, "Network File System 30453 (NFS) version 4 Protocol", RFC 3530, April 2003. 30455 [37] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 30456 Version 3 Protocol Specification", RFC 1813, June 1995. 30458 [38] Eisler, M., "LIPKEY - A Low Infrastructure Public Key 30459 Mechanism Using SPKM", RFC 2847, June 2000. 30461 [39] Eisler, M., "NFS Version 2 and Version 3 Security Issues 30462 and the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5", 30463 RFC 2623, June 1999. 30465 [40] Juszczak, C., "Improving the Performance and Correctness 30466 of an NFS Server", USENIX Conference Proceedings , June 30467 1990. 30469 [41] Reynolds, J., Ed., "Assigned Numbers: RFC 1700 is Replaced 30470 by an On-line Database", RFC 3232, January 2002. 30472 [42] Srinivasan, R., "Binding Protocols for ONC RPC Version 2", 30473 RFC 1833, August 1995. 30475 [43] Werme, R., "RPC XID Issues", USENIX Conference 30476 Proceedings , February 1996. 30478 [44] Nowicki, B., "NFS: Network File System Protocol 30479 specification", RFC 1094, March 1989. 30481 [45] Bhide, A., Elnozahy, E., and S. Morgan, "A Highly 30482 Available Network Server", USENIX Conference Proceedings , 30483 January 1991. 30485 [46] Halevy, B., Welch, B., and J. Zelenka, "Object-Based 30486 Parallel NFS (pNFS) Operations", RFC 5664, January 2010. 30488 [47] Black, D., Glasgow, J., and S. Fridella, "Parallel NFS 30489 (pNFS) Block/Volume Layout", RFC 5663, January 2010. 30491 [48] Callaghan, B., "WebNFS Client Specification", RFC 2054, 30492 October 1996. 30494 [49] Callaghan, B., "WebNFS Server Specification", RFC 2055, 30495 October 1996. 30497 [50] IESG, "IESG Processing of RFC Errata for the IETF Stream", 30498 July 2008. 30500 [51] Krawczyk, H., Bellare, M., and R. Canetti, "HMAC: Keyed- 30501 Hashing for Message Authentication", RFC 2104, February 30502 1997. 30504 [52] Shepler, S., "NFS Version 4 Design Considerations", 30505 RFC 2624, June 1999. 30507 [53] The Open Group, "Protocols for Interworking: XNFS, Version 30508 3W, ISBN 1-85912-184-5", February 1998. 30510 [54] Floyd, S. and V. Jacobson, "The Synchronization of 30511 Periodic Routing Messages", IEEE/ACM Transactions on 30512 Networking 2(2), pp. 122-136, April 1994. 30514 [55] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., 30515 and E. Zeidner, "Internet Small Computer Systems Interface 30516 (iSCSI)", RFC 3720, April 2004. 30518 [56] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 30519 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 30521 [57] Weber, R., "Object-Based Storage Device Commands (OSD)", 30522 ANSI/INCITS 400-2004, July 2004, 30523 . 30525 [58] Carns, P., Ligon III, W., Ross, R., and R. Thakur, "PVFS: 30526 A Parallel File System for Linux Clusters.", Proceedings 30527 of the 4th Annual Linux Showcase and Conference , 2000. 30529 [59] The Open Group, "The Open Group Base Specifications Issue 30530 6, IEEE Std 1003.1, 2004 Edition", 2004. 30532 [60] Callaghan, B., "NFS URL Scheme", RFC 2224, October 1997. 30534 [61] Chiu, A., Eisler, M., and B. Callaghan, "Security 30535 Negotiation for WebNFS", RFC 2755, January 2000. 30537 [62] Narten, T. and H. Alvestrand, "Guidelines for Writing an 30538 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 30539 May 2008. 30541 [63] Eisler, M., "Errata 2006 for RFC 5661", January 2010, 30542 . 30544 [64] Spasojevic, M. and M. Satayanarayanan, "An Empirical Study 30545 of a Wide-Area Distributed File System", May 1996, 30546 . 30549 [65] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 30550 "Network File System (NFS) Version 4 Minor Version 1 30551 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 30552 . 30554 [66] Noveck, D., "Rules for NFSv4 Extensions and Minor 30555 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 30556 . 30558 [67] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 30559 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 30560 March 2015, . 30562 [68] Noveck, D., Ed., Shivam, P., Lever, C., and B. Baker, 30563 "NFSv4.0 Migration: Specification Update", RFC 7931, 30564 DOI 10.17487/RFC7931, July 2016, 30565 . 30567 [69] Haynes, T., "Requirements for Parallel NFS (pNFS) Layout 30568 Types", RFC 8434, DOI 10.17487/RFC8434, August 2018, 30569 . 30571 [70] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 30572 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 30573 2014, . 30575 [71] Rescorla, E. and B. Korver, "Guidelines for Writing RFC 30576 Text on Security Considerations", BCP 72, RFC 3552, 30577 DOI 10.17487/RFC3552, July 2003, 30578 . 30580 Appendix A. Need for this Update 30582 This document includes an explanation of how clients and servers are 30583 to determine the particular network access paths to be used to access 30584 a file system. This includes describing how changes to the specific 30585 replica to be used or to the set of addresses to be used to access it 30586 are to be dealt with, and how transfers of responsibility that need 30587 to be made can be dealt with transparently. This includes cases in 30588 which there is a shift between one replica and another and those in 30589 which different network access paths are used to access the same 30590 replica. 30592 As a result of the following problems in RFC5661 [65], it is 30593 necessary to provide the specific updates which are made by this 30594 document. These updates are described in Appendix B 30596 o RFC5661 [65], while it dealt with situations in which various 30597 forms of clustering allowed co-ordination of the state assigned by 30598 co-operating servers to be used, made no provisions for 30599 Transparent State Migration. Within NFSv4.0, Transparent 30600 Migration was first explained clearly in RFC7530 [67] and 30601 corrected and clarified by RFC7931 [68]. No corresponding 30602 explanation for NFSv4.1 had been provided. 30604 o Although NFSv4.1 was defined with a clear definition of how 30605 trunking detection was to be done, there was no clear 30606 specification of how trunking discovery was to be done, despite 30607 the fact that the specification clearly indicated that this 30608 information could be made available via the file system location 30609 attributes. 30611 o Because the existence of multiple network access paths to the same 30612 file system was dealt with as if there were multiple replicas, 30613 issues relating to transitions between replicas could never be 30614 clearly distinguished from trunking-related transitions between 30615 the addresses used to access a particular file system instance. 30616 As a result, in situations in which both migration and trunking 30617 configuration changes were involved, neither of these could be 30618 clearly dealt with and the relationship between these two features 30619 was not seriously addressed. 30621 o Because use of two network access paths to the same file system 30622 instance (i.e. trunking) was often treated as if two replicas were 30623 involved, it was considered that two replicas were being used 30624 simultaneously. As a result, the treatment of replicas being used 30625 simultaneously in RFC5661 [65] was not clear as it covered the two 30626 distinct cases of a single file system instance being accessed by 30627 two different network access paths and two replicas being accessed 30628 simultaneously, with the limitations of the latter case not being 30629 clearly laid out. 30631 The majority of the consequences of these issues are dealt with by 30632 presenting in Section 11 a replacement for Section 11 within RFC5661 30633 [65]. This replacement modifies existing sub-sections within that 30634 section and adds new ones, as described in Appendix B.1. Also, some 30635 existing sections are deleted. These changes were made in order to: 30637 o Reorganize the description so that the case of two network access 30638 paths to the same file system instance needs to be distinguished 30639 clearly from the case of two different replicas since, in the 30640 former case, locking state is shared and there also can be sharing 30641 of session state. 30643 o Provide a clear statement regarding the desirability of 30644 transparent transfer of state between replicas together with a 30645 recommendation that either that or a single-fs grace period be 30646 provided. 30648 o Specifically delineate how such transfers are to be dealt with by 30649 the client, taking into account the differences from the treatment 30650 in [68] made necessary by the major protocol changes made in 30651 NFSv4.1. 30653 o Provide discussion of the relationship between transparent state 30654 transfer and Parallel NFS (pNFS). 30656 o Provide clarification of the fs_locations_info attribute in order 30657 to specify which portions of the information provided apply to a 30658 specific network access path and which to the replica which that 30659 path is used to access. 30661 In addition, there are also updates to other sections of RFC5661 30662 [65], where the consequences of the incorrect assumptions underlying 30663 the current treatment of multi-server namespace issues also needed to 30664 be corrected. These are to be dealt with as described in Sections 30665 B.2 through B.4. 30667 o A revised introductory section regarding multi-server namespace 30668 facilities is provided. 30670 o A more realistic treatment of server scope is provided, which 30671 reflects the more limited co-ordination of locking state adopted 30672 by servers actually sharing a common server scope. 30674 o Some confusing text regarding changes in server_owner has been 30675 clarified. 30677 o The description of some existing errors has been modified to more 30678 clearly explain certain errors situations to reflect the existence 30679 of trunking and the possible use of fs-specific grace periods. 30680 For details, see Appendix B.3. 30682 o New descriptions of certain existing operations are provided, 30683 either because the existing treatment did not account for 30684 situations that would arise in dealing with transparent state 30685 migration, or because some types of reclaim issues were not 30686 adequately dealt with in the context of fs-specific grace periods. 30687 For details, see Appendix B.2. 30689 Appendix B. Changes in this Update 30691 B.1. Revisions Made to Section 11 of RFC5661 30693 A number of areas needed to be revised or extended, in many case 30694 replacing existing sub-sections within section 11 of RFC5661 [65]: 30696 o New introductory material, including a terminology section, 30697 replaces the existing material in RFC5661 [65] ranging from the 30698 start of the existing Section 11 up to and including the existing 30699 Section 11.1. The new material starts at the beginning of 30700 Section 11 and continues through 11.2 below. 30702 o A significant reorganization of the material in the existing 30703 Sections 11.4 and 11.5 (of RFC5661 [65]) is necessary. The 30704 reasons for the reorganization of these sections into a single 30705 section with multiple subsections are discussed in Appendix B.1.1 30706 below. This replacement appears as Section 11.5 below. 30708 New material relating to the handling of the file system location 30709 attributes is contained in Sections 11.5.1 and 11.5.7 below. 30711 o A new section describing requirements for user and group handling 30712 within a multi-server namespace has been added as Section 11.7 30714 o A major replacement for the existing Section 11.7 of RFC5661 [65] 30715 entitled "Effecting File System Transitions", will appear as 30716 Sections 11.9 through 11.14. The reasons for the reorganization 30717 of this section into multiple sections are discussed in 30718 Appendix B.1.2. 30720 o A replacement for the existing Section 11.10 of RFC5661 [65] 30721 entitled "The Attribute fs_locations_info", will appear as 30722 Section 11.17, with Appendix B.1.3 describing the differences 30723 between the new section and the treatment within [65]. A revised 30724 treatment is necessary because the existing treatment did not make 30725 clear how the added attribute information relates to the case of 30726 trunked paths to the same replica. These issues were not 30727 addressed in RFC5661 [65] where the concepts of a replica and a 30728 network path used to access a replica were not clearly 30729 distinguished. 30731 B.1.1. Re-organization of Sections 11.4 and 11.5 of RFC5661 30733 Previously, issues related to the fact that multiple location entries 30734 directed the client to the same file system instance were dealt with 30735 in a separate Section 11.5 of RFC5661 [65]. Because of the new 30736 treatment of trunking, these issues now belong within Section 11.5 30737 below. 30739 In this new section, trunking is dealt with in Section 11.5.2 30740 together with the other uses of file system location information 30741 described in Sections Section 11.5.3 through 11.5.6. 30743 As a result, Section 11.5 which will replace Section 11.4 of RFC5661 30744 [65] is substantially different than the section it replaces in that 30745 some existing sections will be replaced by corresponding sections 30746 below while, at the same time, new sections will be added, resulting 30747 in a replacement containing some renumbered sections, as follows: 30749 o The material in Section 11.5, exclusive of subsections, replaces 30750 the material in Section 11.4 of RFC5661 [65] exclusive of 30751 subsections. 30753 o Section 11.5.1 is a new first subsection of the overall section. 30755 o Section 11.5.2 is a new second subsection of the overall section. 30757 o Each of the Sections 11.5.4, 11.5.5, and 11.5.6 replaces (in 30758 order) one of the corresponding Sections 11.4.1, 11.4.2, and 30759 11.4.3 of RFC5661 [65]. 11.4.4, and 11.4.5. 30761 o Section 11.5.7 is a new final subsection of the overall section. 30763 B.1.2. Re-organization of Material Dealing with File System Transitions 30765 The material relating to file system transition, previously contained 30766 in Section 11.7 of RFC5661 [65] has been reorganized and augmented as 30767 described below: 30769 o Because there can be a shift of the network access paths used to 30770 access a file system instance without any shift between replicas, 30771 a new Section 11.9 distinguishes between those cases in which 30772 there is a shift between distinct replicas and those involving a 30773 shift in network access paths with no shift between replicas. 30775 As a result, a new Section 11.10 deals with network address 30776 transitions while the bulk of the former Section 11.7 (in RFC5661 30777 [65]) is extensively modified as reflected in Section 11.11 which 30778 is now limited to cases in which there is a shift between two 30779 different sets of replicas. 30781 o The additional Section 11.12 discusses the case in which a shift 30782 to a different replica is made and state is transferred to allow 30783 the client the ability to have continued access to its accumulated 30784 locking state on the new server. 30786 o The additional Section 11.13 discusses the client's response to 30787 access transitions and how it determines whether migration has 30788 occurred, and how it gets access to any transferred locking and 30789 session state. 30791 o The additional Section 11.14 discusses the responsibilities of the 30792 source and destination servers when transferring locking and 30793 session state. 30795 This re-organization has caused a renumbering of the sections within 30796 Section 11 of [65] as described below: 30798 o The new Sections 11.9 and 11.10 have resulted in existing sections 30799 with these numbers to be renumbered. 30801 o Section 11.7 of [65] will be substantially modified and appear as 30802 Section 11.11. The necessary modifications reflect the fact that 30803 this section will only deal with transitions between replicas 30804 while transitions between network addresses are dealt with in 30805 other sections. Details of the reorganization are described later 30806 in this section. 30808 o The additional Sections 11.12, 11.13, and 11.14 have been added. 30810 o Consequently, Sections 11.8, 11.9, 11.10, and 11.11 in [65] now 30811 appear as Sections 11.13, 11.14, 11.15, and 11.16, respectively. 30813 As part of this general re-organization, Section 11.7 of RFC5661 [65] 30814 will be modified as described below: 30816 o Sections 11.7 and 11.7.1 of RFC5661 [65] are to be replaced by 30817 Sections 11.11 and 11.11.1, respectively. 30819 o Section 11.7.2 (and included subsections) of RFC5661 [65] are to 30820 be deleted. 30822 o Sections 11.7.3, 11.7.4. 11.7.5, 11.7.5.1, and 11.7.6 of RFC5661 30823 [65] are to be replaced by Sections 11.11.2, 11.11.3, 11.11.4, 30824 11.11.4.1, and 11.11.5 respectively in this document. 30826 o Section 11.7.7 of RFC5661 [65] is to be replaced by 30827 Section 11.11.9 This sub-section has been moved to the end of the 30828 section dealing with file system transitions. 30830 o Sections 11.7.8, 11.7.9. and 11.7.10 of RFC5661 [65] are to be 30831 replaced by Sections 11.11.6, 11.11.7, and 11.11.8 respectively in 30832 this document. 30834 B.1.3. Updates to treatment of fs_locations_info 30836 Various elements of the fs_locations_info attribute contain 30837 information that applies to either a specific file system replica or 30838 to a network path or set of network paths used to access such a 30839 replica. The existing treatment of fs_locations info (in 30840 Section 11.10 of RFC5661 [65]) does not clearly distinguish these 30841 cases, in part because the document did not clearly distinguish 30842 replicas from the paths used to access them. 30844 In addition, special clarification needed to be provided with regard 30845 to the following fields: 30847 o With regard to the handling of FSLI4GF_GOING, it needs to be made 30848 clear that this only applies to the unavailability of a replica 30849 rather than to a path to access a replica. 30851 o In describing the appropriate value for a server to use for 30852 fli_valid_for, it needs to be made clear that there is no need for 30853 the client to frequently fetch the fs_locations_info value to be 30854 prepared for shifts in trunking patterns. 30856 o Clarification of the rules for extensions to the fls_info needs to 30857 be provided. The existing treatment reflects the extension model 30858 in effect at the time RFC5661 [65] was written, and needed to be 30859 updated in accordance with the extension model described in 30860 RFC8178 [66]. 30862 B.2. Revisions Made to Operations in RFC5661 30864 Revised descriptions were needed to address issues that arose in 30865 effecting necessary changes to multi-server namespace features. 30867 o The existing treatment of EXCHANGE_ID (in Section 18.35 of RFC5661 30868 [65]) assumes that client IDs cannot be created/ confirmed other 30869 than by the EXCHANGE_ID and CREATE_SESSION operations. Also, the 30870 necessary use of EXCHANGE_ID in recovery from migration and 30871 related situations is not addressed clearly. A revised treatment 30872 of EXCHANGE_ID is necessary and it appears in Section 18.35 while 30873 the specific differences between it and the treatment within [65] 30874 are explained in Appendix B.2.1 below. 30876 o The existing treatment of RECLAIM_COMPLETE in section 18.51 of 30877 RFC5661 [65]) is not sufficiently clear about the purpose and use 30878 of the rca_one_fs and how the server is to deal with inappropriate 30879 values of this argument. Because the resulting confusion raises 30880 interoperability issues, a new treatment of RECLAIM_COMPLETE is 30881 necessary and it appears in Section 18.51 below while the specific 30882 differences between it and the treatment within RFC5661 [65] are 30883 discussed in Appendix B.2.2 below. In addition, the definitions 30884 of the reclaim-related errors receive an updated treatment in 30885 Section 15.1.9 to reflect the fact that there are multiple 30886 contexts for lock reclaim operations. 30888 B.2.1. Revision to Treatment of EXCHANGE_ID 30890 There are a number of issues in the original treatment of EXCHANGE_ID 30891 (in RFC5661 [65]) that cause problems for Transparent State Migration 30892 and for the transfer of access between different network access paths 30893 to the same file system instance. 30895 These issues arise from the fact that this treatment was written, 30897 o Assuming that a client ID can only become known to a server by 30898 having been created by executing an EXCHANGE_ID, with confirmation 30899 of the ID only possible by execution of a CREATE_SESSION. 30901 o Considering the interactions between a client and a server only 30902 occurring on a single network address 30904 As these assumptions have become invalid in the context of 30905 Transparent State Migration and active use of trunking, the treatment 30906 has been modified in several respects. 30908 o It had been assumed that an EXCHANGED_ID executed when the server 30909 is already aware of a given client instance must be either 30910 updating associated parameters (e.g. with respect to callbacks) or 30911 a lingering retransmission to deal with a previously lost reply. 30912 As result, any slot sequence returned by that operation would be 30913 of no use. The existing treatment goes so far as to say that it 30914 "MUST NOT" be used, although this usage is not in accord with [1]. 30916 This created a difficulty when an EXCHANGE_ID is done after 30917 Transparent State Migration since that slot sequence would need to 30918 be used in a subsequent CREATE_SESSION. 30920 In the updated treatment, CREATE_SESSION is a way that client IDs 30921 are confirmed but it is understood that other ways are possible. 30922 The slot sequence can be used as needed and cases in which it 30923 would be of no use are appropriately noted. 30925 o It was assumed that the only functions of EXCHANGE_ID were to 30926 inform the server of the client, create the client ID, and 30927 communicate it to the client. When multiple simultaneous 30928 connections are involved, as often happens when trunking, that 30929 treatment was inadequate in that it ignored the role of 30930 EXCHANGE_ID in associating the client ID with the connection on 30931 which it was done, so that it could be used by a subsequent 30932 CREATE_SESSSION, whose parameters do not include an explicit 30933 client ID. 30935 The new treatment explicitly discusses the role of EXCHANGE_ID in 30936 associating the client ID with the connection so it can be used by 30937 CREATE_SESSION and in associating a connection with an existing 30938 session. 30940 The new treatment can be found in Section 18.35 above. It supersedes 30941 the treatment in Section 18.35 of RFC5661 [65]. 30943 B.2.2. Revision to Treatment of RECLAIM_COMPLETE 30945 The following changes were made to the treatment of RECLAIM_COMPLETE 30946 in RFC5661 [65] to arrive at the treatment in Section 18.51. 30948 o In a number of places the text is made more explicit about the 30949 purpose of rca_one_fs and its connection to file system migration. 30951 o There is a discussion of situations in which particular forms of 30952 RECLAIM_COMPLETE would need to be done. 30954 o There is a discussion of interoperability issues that result from 30955 implementations that may have arisen due to the lack of clarity of 30956 the previous treatment of RECLAIM_COMPLETE. 30958 B.3. Revisions Made to Error Definitions in RFC5661 30960 The new handling of various situations required revisions of some 30961 existing error definition: 30963 o Because of the need to appropriately address trunking-related 30964 issues, some uses of the term "replica" in RFC5661 [65] have 30965 become problematic since a shift in network access paths was 30966 considered to be a shift to a different replica. As a result, the 30967 existing definition of NFS4ERR_MOVED (in Section 15.1.2.4 of 30968 RFC5661 [65]) needs to be updated to reflect the different 30969 handling of unavailability of a particular fs via a specific 30970 network address. 30972 Since such a situation is no longer considered to constitute 30973 unavailability of a file system instance, the description needs to 30974 change even though the set of circumstances in which it is to be 30975 returned remain the same. The new paragraph explicitly recognizes 30976 that a different network address might be used, while the previous 30977 description, misleadingly, treated this as a shift between two 30978 replicas while only a single file system instance might be 30979 involved. The updated description appears in Section 15.1.2.4 30980 below. 30982 o Because of the need to accommodate use of fs-specific grace 30983 periods, it is necessary to clarify some of the error definitions 30984 of reclaim-related errors in Section 15 of RFC5661 [65], so the 30985 text applies properly to reclaims for all types of grace periods. 30986 The updated descriptions appear within Section 15.1.9 below. 30988 o Because of the need to provide the clarifications in errata report 30989 2006 [63] and to adapt these to properly explain the interaction 30990 of NFS4ERR_DELAY with the replay cache, a revised description of 30991 NFS4ERR_DELAY appears in Section 15.1.1.3. This errata report, 30992 unlike many other RFC5661 errata reports, is addressed in this 30993 document because of the extensive use of NFS4ERR_DELAY in 30994 connection with state migration and session migration. 30996 B.4. Other Revisions Made to RFC5661 30998 Beside the major reworking of Section 11 and the associated revisions 30999 to existing operations and errors, there are a number of related 31000 changes that are necessary: 31002 o The summary that appeared in Section 1.7.3.3 of RFC5661 [65] was 31003 revised to reflect the changes made in the revised Section 11 31004 above. The updated summary appears as Section 1.8.3.3 above. 31006 o The discussion of server scope which appeared in Section 2.10.4 of 31007 RFC5661 [65] needed to be replaced, since the previous text 31008 appears to require a level of inter-server co-ordination 31009 incompatible with its basic function of avoiding the need for a 31010 globally uniform means of assigning server_owner values. A 31011 revised treatment appears in Section 2.10.4. 31013 o The discussion of trunking which appeared in Section 2.10.5 of 31014 RFC5661 [65] needed to be revised, to more clearly explain the 31015 multiple types of trunking support and how the client can be made 31016 aware of the existing trunking configuration. In addition, while 31017 the last paragraph (exclusive of sub-sections) of that section, 31018 dealing with server_owner changes, is literally true, it has been 31019 a source of confusion. Since the existing paragraph can be read 31020 as suggesting that such changes be dealt with non-disruptively, 31021 the issue needs to be clarified in the revised section, which 31022 appears in Section 2.10.5. 31024 Appendix C. Security Issues that Need to be Addressed 31026 The following issues in the treatment of security within the NFSv4.1 31027 specification need to be addressed: 31029 o The Security Considerations Section of RFC5661 [65] is not written 31030 in accord with RFC3552 [71] (also BCP72). Of particular concern 31031 is the fact that the section does not contain a threat analysis. 31033 o Initial analysis of the existing security issues with NFSv4.1 has 31034 made it likely that a revised Security Considerations Section for 31035 the existing protocol (one containing a threat analysis) would be 31036 likely to conclude that NFSv4.1 does not meet the goal of secure 31037 use on the internet. 31039 The Security Considerations Section of this document (in Section 21) 31040 has not been thoroughly revised to correct the difficulties mentioned 31041 above. Instead, it has been modified to take proper account of 31042 issues related to the multi-server namespace features discussed in 31043 Section 11, leaving the incomplete discussion and security weaknesses 31044 pretty much as they were. 31046 The following major security issues need to be addressed in a 31047 satisfactory fashion before an updated Security Considerations 31048 section can be published as part of a bis document for NFSv4.1: 31050 o The continued use of AUTH_SYS and the security exposures it 31051 creates needs to be addressed. Addressing this issue must not be 31052 limited to the questions of whether the designation of this as 31053 OPTIONAL was justified and whether it should be changed. 31055 In any event, it may not be possible, at this point, to correct 31056 the security problems created by continued use of AUTH_SYS simply 31057 by revising this designation. 31059 o The lack of attention within the protocol to the possibility of 31060 pervasive monitoring attacks such as those described in RFC7258 31061 [70] (also BCP188). 31063 In that connection, the use of CREATE_SESSION without privacy 31064 protection needs to be addressed as it exposes the session ID to 31065 view by an attacker. This is worrisome as this is precisely the 31066 type of protocol artifact alluded to in RFC7258, which can enable 31067 further mischief on the part of the attacker as it enables denial- 31068 of-service attacks which can be executed effectively with only a 31069 single, normally low-value, credential, even when RPCSEC_GSS 31070 authentication is in use. 31072 o The lack of effective use of privacy and integrity, even where the 31073 infrastructure to support use of RPCSEC_GSS in present, needs to 31074 be addressed. 31076 In light of the security exposures that this situation creates, it 31077 is not enough to define a protocol that could, with the provision 31078 of sufficient resources, address the problem. Instead, what is 31079 needed is a way to provide the necessary security, with very 31080 limited performance costs and without requiring security 31081 infrastructure that experience has shown is difficult for many 31082 clients and servers to provide. 31084 In trying to provide a major security upgrade for a deployed protocol 31085 such as NFSv4.1, the working group, and the internet community is 31086 likely to find itself dealing with a number of considerations such as 31087 the following: 31089 o The need to accommodate existing deployments of existing protocols 31090 as specified previously in existing Proposed Standards. 31092 o The difficulty of effecting changes to existing interoperating 31093 implementations. 31095 o The difficulty of making changes to NFSv4 protocols other than 31096 those in the form of OPTIONAL extensions. 31098 o The tendency of those responsible for existing NFSv4 deployments 31099 to ignore security flaws in the context of local area networks 31100 under the mistaken impression that network isolation provides, in 31101 and of itself, isolation from all potential attackers. 31103 Given that the difficulties mentioned above apply to minor version 31104 zero as well, it may make sense to deal with these security issues in 31105 a common document applying to all NFSv4 minor versions. If that 31106 approach is taken the, Security Considerations section of an eventual 31107 NFv4.1 bis document would reference that common document and the 31108 defining RFCs for other minor versions might do so as well. 31110 Appendix D. Acknowledgments 31112 D.1. Acknowledgments for this Update 31114 The authors wish to acknowledge the important role of Andy Adamson of 31115 Netapp in clarifying the need for trunking discovery functionality, 31116 and exploring the role of the file system location attributes in 31117 providing the necessary support. 31119 The authors wish to thank Tom Haynes of Hammerspace for drawing our 31120 attention to the fact that internationalization and security might 31121 best be handled in documents dealing with such protocol issues as 31122 they apply to all NFSv4 minor versions. 31124 The authors also wish to acknowledge the work of Xuan Qi of Oracle 31125 with NFSv4.1 client and server prototypes of transparent state 31126 migration functionality. 31128 The authors wish to thank others that brought attention to important 31129 issues. The comments of Trond Myklebust of Primary Data related to 31130 trunking helped to clarify the role of DNS in trunking discovery. 31131 Rick Macklem's comments brought attention to problems in the handling 31132 of the per-fs version of RECLAIM_COMPLETE. 31134 The authors wish to thank Olga Kornievskaia of Netapp for her helpful 31135 review comments. 31137 D.2. Acknowledgments for RFC5661 31139 The initial text for the SECINFO extensions were edited by Mike 31140 Eisler with contributions from Peng Dai, Sergey Klyushin, and Carl 31141 Burnett. 31143 The initial text for the SESSIONS extensions were edited by Tom 31144 Talpey, Spencer Shepler, Jon Bauman with contributions from Charles 31145 Antonelli, Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, 31146 Trond Myklebust, Dave Noveck, John Scott, Mike Stolarchuk, and Mark 31147 Wittle. 31149 Initial text relating to multi-server namespace features, including 31150 the concept of referrals, were contributed by Dave Noveck, Carl 31151 Burnett, and Charles Fan with contributions from Ted Anderson, Neil 31152 Brown, and Jon Haswell. 31154 The initial text for the Directory Delegations support were 31155 contributed by Saadia Khan with input from Dave Noveck, Mike Eisler, 31156 Carl Burnett, Ted Anderson, and Tom Talpey. 31158 The initial text for the ACL explanations were contributed by Sam 31159 Falkner and Lisa Week. 31161 The pNFS work was inspired by the NASD and OSD work done by Garth 31162 Gibson. Gary Grider has also been a champion of high-performance 31163 parallel I/O. Garth Gibson and Peter Corbett started the pNFS effort 31164 with a problem statement document for the IETF that formed the basis 31165 for the pNFS work in NFSv4.1. 31167 The initial text for the parallel NFS support was edited by Brent 31168 Welch and Garth Goodson. Additional authors for those documents were 31169 Benny Halevy, David Black, and Andy Adamson. Additional input came 31170 from the informal group that contributed to the construction of the 31171 initial pNFS drafts; specific acknowledgment goes to Gary Grider, 31172 Peter Corbett, Dave Noveck, Peter Honeyman, and Stephen Fridella. 31174 Fredric Isaman found several errors in draft versions of the ONC RPC 31175 XDR description of the NFSv4.1 protocol. 31177 Audrey Van Belleghem provided, in numerous ways, essential co- 31178 ordination and management of the process of editing the specification 31179 documents. 31181 Richard Jernigan gave feedback on the file layout's striping pattern 31182 design. 31184 Several formal inspection teams were formed to review various areas 31185 of the protocol. All the inspections found significant errors and 31186 room for improvement. NFSv4.1's inspection teams were: 31188 o ACLs, with the following inspectors: Sam Falkner, Bruce Fields, 31189 Rahul Iyer, Saadia Khan, Dave Noveck, Lisa Week, Mario Wurzl, and 31190 Alan Yoder. 31192 o Sessions, with the following inspectors: William Brown, Tom 31193 Doeppner, Robert Gordon, Benny Halevy, Fredric Isaman, Rick 31194 Macklem, Trond Myklebust, Dave Noveck, Karen Rochford, John Scott, 31195 and Peter Shah. 31197 o Initial pNFS inspection, with the following inspectors: Andy 31198 Adamson, David Black, Mike Eisler, Marc Eshel, Sam Falkner, Garth 31199 Goodson, Benny Halevy, Rahul Iyer, Trond Myklebust, Spencer 31200 Shepler, and Lisa Week. 31202 o Global namespace, with the following inspectors: Mike Eisler, Dan 31203 Ellard, Craig Everhart, Fredric Isaman, Trond Myklebust, Dave 31204 Noveck, Theresa Raj, Spencer Shepler, Renu Tewari, and Robert 31205 Thurlow. 31207 o NFSv4.1 file layout type, with the following inspectors: Andy 31208 Adamson, Marc Eshel, Sam Falkner, Garth Goodson, Rahul Iyer, Trond 31209 Myklebust, and Lisa Week. 31211 o NFSv4.1 locking and directory delegations, with the following 31212 inspectors: Mike Eisler, Pranoop Erasani, Robert Gordon, Saadia 31213 Khan, Eric Kustarz, Dave Noveck, Spencer Shepler, and Amy Weaver. 31215 o EXCHANGE_ID and DESTROY_CLIENTID, with the following inspectors: 31216 Mike Eisler, Pranoop Erasani, Robert Gordon, Benny Halevy, Fredric 31217 Isaman, Saadia Khan, Ricardo Labiaga, Rick Macklem, Trond 31218 Myklebust, Spencer Shepler, and Brent Welch. 31220 o Final pNFS inspection, with the following inspectors: Andy 31221 Adamson, Mike Eisler, Mark Eshel, Sam Falkner, Jason Glasgow, 31222 Garth Goodson, Robert Gordon, Benny Halevy, Dean Hildebrand, Rahul 31223 Iyer, Suchit Kaura, Trond Myklebust, Anatoly Pinchuk, Spencer 31224 Shepler, Renu Tewari, Lisa Week, and Brent Welch. 31226 A review team worked together to generate the tables of assignments 31227 of error sets to operations and make sure that each such assignment 31228 had two or more people validating it. Participating in the process 31229 were Andy Adamson, Mike Eisler, Sam Falkner, Garth Goodson, Robert 31230 Gordon, Trond Myklebust, Dave Noveck, Spencer Shepler, Tom Talpey, 31231 Amy Weaver, and Lisa Week. 31233 Jari Arkko, David Black, Scott Bradner, Lisa Dusseault, Lars Eggert, 31234 Chris Newman, and Tim Polk provided valuable review and guidance. 31236 Olga Kornievskaia found several errors in the SSV specification. 31238 Ricardo Labiaga found several places where the use of RPCSEC_GSS was 31239 underspecified. 31241 Those who provided miscellaneous comments include: Andy Adamson, 31242 Sunil Bhargo, Alex Burlyga, Pranoop Erasani, Bruce Fields, Vadim 31243 Finkelstein, Jason Goldschmidt, Vijay K. Gurbani, Sergey Klyushin, 31244 Ricardo Labiaga, James Lentini, Anshul Madan, Daniel Muntz, Daniel 31245 Picken, Archana Ramani, Jim Rees, Mahesh Siddheshwar, Tom Talpey, and 31246 Peter Varga. 31248 Authors' Addresses 31250 David Noveck (editor) 31251 NetApp 31252 1601 Trapelo Road, Suite 16 31253 Waltham, MA 02451 31254 USA 31256 Phone: +1-781-768-5347 31257 EMail: dnoveck@netapp.com 31259 Charles Lever 31260 Oracle Corporation 31261 1015 Granger Avenue 31262 Ann Arbor, MI 48104 31263 United States of America 31265 Phone: +1 248 614 5091 31266 EMail: chuck.lever@oracle.com