idnits 2.17.1 draft-ietf-nfsv4-flex-files-19.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 405 has weird spacing: '... loghyr staff...' == Line 1364 has weird spacing: '...stateid lor...' == Line 1628 has weird spacing: '...rs_hint ffl...' -- The document date (May 03, 2018) is 2147 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'LEGAL' ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft 4 Intended status: Standards Track T. Haynes 5 Expires: November 4, 2018 Hammerspace 6 May 03, 2018 8 Parallel NFS (pNFS) Flexible File Layout 9 draft-ietf-nfsv4-flex-files-19.txt 11 Abstract 13 The Parallel Network File System (pNFS) allows a separation between 14 the metadata (onto a metadata server) and data (onto a storage 15 device) for a file. The flexible file layout type is defined in this 16 document as an extension to pNFS which allows the use of storage 17 devices in a fashion such that they require only a quite limited 18 degree of interaction with the metadata server, using already 19 existing protocols. Client-side mirroring is also added to provide 20 replication of files. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on November 4, 2018. 39 Copyright Notice 41 Copyright (c) 2018 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 4 58 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 6 59 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 60 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 7 61 2.2. Fencing Clients from the Storage Device . . . . . . . . . 7 62 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 8 63 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 9 64 2.3. State and Locking Models . . . . . . . . . . . . . . . . 10 65 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 10 66 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 12 67 3. XDR Description of the Flexible File Layout Type . . . . . . 13 68 3.1. Code Components Licensing Notice . . . . . . . . . . . . 14 69 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 15 70 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 16 71 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 17 72 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 18 73 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 19 74 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 22 75 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 23 76 5.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 23 77 5.3. Interactions Between Devices and Layouts . . . . . . . . 23 78 5.4. Handling Version Errors . . . . . . . . . . . . . . . . . 23 79 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 24 80 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24 81 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 25 82 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 26 83 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 26 84 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 26 85 8.2.2. Client Updates All Mirrors . . . . . . . . . . . . . 27 86 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 27 87 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 28 88 8.3. Metadata Server Resilvering of the File . . . . . . . . . 28 89 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 28 90 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 30 91 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 30 92 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 30 93 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 31 94 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 32 95 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 32 96 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 33 98 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 34 99 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 34 100 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 34 101 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 35 102 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 35 103 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 36 104 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 37 105 15. Security Considerations . . . . . . . . . . . . . . . . . . . 37 106 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 38 107 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 38 108 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 39 109 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 110 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 111 17.1. Normative References . . . . . . . . . . . . . . . . . . 39 112 17.2. Informative References . . . . . . . . . . . . . . . . . 41 113 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 41 114 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 41 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 117 1. Introduction 119 In the parallel Network File System (pNFS), the metadata server 120 returns layout type structures that describe where file data is 121 located. There are different layout types for different storage 122 systems and methods of arranging data on storage devices. This 123 document defines the flexible file layout type used with file-based 124 data servers that are accessed using the Network File System (NFS) 125 protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and 126 NFSv4.2 [RFC7862]. 128 To provide a global state model equivalent to that of the files 129 layout type, a back-end control protocol might be implemented between 130 the metadata server and NFSv4.1+ storage devices. This document does 131 not provide a standard track control protocol. An implementation can 132 either define its own mechanism or it could define a control protocol 133 in a standard's track document. The requirements for a control 134 protocol are specified in [RFC5661] and clarified in [pNFSLayouts]. 136 The control protocol described in this document is based on NFS. The 137 storage devices are configured such that the metadata server has full 138 access rights to the data file system and then the metadata server 139 uses synthetic ids to control client access to individual files. 141 In traditional mirroring of data, the server is responsible for 142 replicating, validating, and repairing copies of the data file. With 143 client-side mirroring, the metadata server provides a layout which 144 presents the available mirrors to the client. It is then the client 145 which picks a mirror to read from and ensures that all writes go to 146 all mirrors. Only if all mirrors are successfully updated, does the 147 client consider the write transaction to have succeeded. In case of 148 error, the client can use the LAYOUTERROR operation to inform the 149 metadata server, which is then responsible for the repairing of the 150 mirrored copies of the file. 152 1.1. Definitions 154 control communication requirements: is the specification for 155 information on layouts, stateids, file metadata, and file data 156 which must be communicated between the metadata server and the 157 storage devices. There is a separate set of requirements for each 158 layout type. 160 control protocol: is the particular mechanism that an implementation 161 of a layout type would use to meet the control communication 162 requirement for that layout type. This need not be a protocol as 163 normally understood. In some cases the same protocol may be used 164 as a control protocol and storage protocol. 166 client-side mirroring: is a feature in which the client and not the 167 server is responsible for updating all of the mirrored copies of a 168 layout segment. 170 (file) data: is that part of the file system object which contains 171 the content. 173 data server (DS): is another term for storage device. 175 fencing: is the process by which the metadata server prevents the 176 storage devices from processing I/O from a specific client to a 177 specific file. 179 file layout type: is a layout type in which the storage devices are 180 accessed via the NFS protocol (see Section 13 of [RFC5661]). 182 gid: is the group id, a numeric value which identifies to which 183 group a file belongs. 185 layout: is the information a client uses to access file data on a 186 storage device. This information will include specification of 187 the protocol (layout type) and the identity of the storage devices 188 to be used. 190 layout iomode: is a grant of either read or read/write I/O to the 191 client. 193 layout segment: is a sub-division of a layout. That sub-division 194 might be by the layout iomode (see Sections 3.3.20 and 12.2.9 of 195 [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or 196 requested byte range. 198 layout stateid: is a 128-bit quantity returned by a server that 199 uniquely defines the layout state provided by the server for a 200 specific layout that describes a layout type and file (see 201 Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes 202 differences in handling between layout stateids and other stateid 203 types. 205 layout type: is a specification of both the storage protocol used to 206 access the data and the aggregation scheme used to lay out the 207 file data on the underlying storage devices. 209 loose coupling: is when the control protocol is a storage protocol. 211 (file) metadata: is that part of the file system object which 212 describes the object and not the content. E.g., it could be the 213 time since last modification, access, etc. 215 metadata server (MDS): is the pNFS server which provides metadata 216 information for a file system object. It also is responsible for 217 generating, recalling, and revoking layouts for file system 218 objects, for performing directory operations, and for performing I 219 /O operations to regular files when the clients direct these to 220 the metadata server itself. 222 mirror: is a copy of a layout segment. Note that if one copy of the 223 mirror is updated, then all copies must be updated. 225 recalling a layout: is when the metadata server uses a back channel 226 to inform the client that the layout is to be returned in a 227 graceful manner. Note that the client has the opportunity to 228 flush any writes, etc., before replying to the metadata server. 230 revoking a layout: is when the metadata server invalidates the 231 layout such that neither the metadata server nor any storage 232 device will accept any access from the client with that layout. 234 resilvering: is the act of rebuilding a mirrored copy of a layout 235 segment from a known good copy of the layout segment. Note that 236 this can also be done to create a new mirrored copy of the layout 237 segment. 239 rsize: is the data transfer buffer size used for reads. 241 stateid: is a 128-bit quantity returned by a server that uniquely 242 defines the open and locking states provided by the server for a 243 specific open-owner or lock-owner/open-owner pair for a specific 244 file and type of lock. 246 storage device: is the target to which clients may direct I/O 247 requests when they hold an appropriate layout. See Section 2.1 of 248 [pNFSLayouts] for further discussion of the difference between a 249 data store and a storage device. 251 storage protocol: is the protocol used by clients to do I/O 252 operations to the storage device. Each layout type specifies the 253 set of storage protocols. 255 tight coupling: is an arrangement in which the control protocol is 256 one designed specifically for that purpose. It may be either a 257 proprietary protocol, adapted specifically to a a particular 258 metadata server, or one based on a standards-track document. 260 uid: is the used id, a numeric value which identifies which user 261 owns a file. 263 wsize: is the data transfer buffer size used for writes. 265 1.2. Requirements Language 267 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 268 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 269 "OPTIONAL" in this document are to be interpreted as described in BCP 270 14 [RFC2119] [RFC8174] when, and only when, they appear in all 271 capitals, as shown here. 273 2. Coupling of Storage Devices 275 A server implementation may choose either a loose or tight coupling 276 model between the metadata server and the storage devices. 277 [pNFSLayouts] describes the general problems facing pNFS 278 implementations. This document details how the new Flexible File 279 Layout Type addresses these issues. To implement the tight coupling 280 model, a control protocol has to be defined. As the flex file layout 281 imposes no special requirements on the client, the control protocol 282 will need to provide: 284 (1) for the management of both security and LAYOUTCOMMITs, and, 286 (2) a global stateid model and management of these stateids. 288 When implementing the loose coupling model, the only control protocol 289 will be a version of NFS, with no ability to provide a global stateid 290 model or to prevent clients from using layouts inappropriately. To 291 enable client use in that environment, this document will specify how 292 security, state, and locking are to be managed. 294 2.1. LAYOUTCOMMIT 296 Regardless of the coupling model, the metadata server has the 297 responsibility, upon receiving a LAYOUTCOMMIT (see Section 18.42 of 298 [RFC5661]), of ensuring that the semantics of pNFS are respected (see 299 Section 3.1 of [pNFSLayouts]). These do include a requirement that 300 data written to data storage device be stable before the occurrence 301 of the LAYOUTCOMMIT. 303 It is the responsibility of the client to make sure the data file is 304 stable before the metadata server begins to query the storage devices 305 about the changes to the file. If any WRITE to a storage device did 306 not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the 307 metadata server MUST be preceded by a COMMIT to the storage devices 308 written to. Note that if the client has not done a COMMIT to the 309 storage device, then the LAYOUTCOMMIT might not be synchronized to 310 the last WRITE operation to the storage device. 312 2.2. Fencing Clients from the Storage Device 314 With loosely coupled storage devices, the metadata server uses 315 synthetic uids (user ids) and gids (group ids) for the data file, 316 where the uid owner of the data file is allowed read/write access and 317 the gid owner is allowed read only access. As part of the layout 318 (see ffds_user and ffds_group in Section 5.1), the client is provided 319 with the user and group to be used in the Remote Procedure Call (RPC) 320 [RFC5531] credentials needed to access the data file. Fencing off of 321 clients is achieved by the metadata server changing the synthetic uid 322 and/or gid owners of the data file on the storage device to 323 implicitly revoke the outstanding RPC credentials. A client 324 presenting the wrong credential for the desired access will get a 325 NFS4ERR_ACCESS error. 327 With this loosely coupled model, the metadata server is not able to 328 fence off a single client, it is forced to fence off all clients. 329 However, as the other clients react to the fencing, returning their 330 layouts and trying to get new ones, the metadata server can hand out 331 a new uid and gid to allow access. 333 It is RECOMMENDED to implement common access control methods at the 334 storage device filesystem to allow only the metadata server root 335 (super user) access to the storage device, and to set the owner of 336 all directories holding data files to the root user. This approach 337 provides a practical model to enforce access control and fence off 338 cooperative clients, but it can not protect against malicious 339 clients; hence it provides a level of security equivalent to 340 AUTH_SYS. It is RECOMMENDED that the communication between the 341 metadata server and storage device be secure from eavesdroppers and 342 man-in-the-middle protocol tampering. The security measure could be 343 due to physical security (e.g., the servers are co-located in a 344 physically secure area), from encrypted communications, or some other 345 technique. 347 With tightly coupled storage devices, the metadata server sets the 348 user and group owners, mode bits, and ACL of the data file to be the 349 same as the metadata file. And the client must authenticate with the 350 storage device and go through the same authorization process it would 351 go through via the metadata server. In the case of tight coupling, 352 fencing is the responsibility of the control protocol and is not 353 described in detail here. However, implementations of the tight 354 coupling locking model (see Section 2.3), will need a way to prevent 355 access by certain clients to specific files by invalidating the 356 corresponding stateids on the storage device. In such a scenario, 357 the client will be given an error of NFS4ERR_BAD_STATEID. 359 The client need not know the model used between the metadata server 360 and the storage device. It need only react consistently to any 361 errors in interacting with the storage device. It should both return 362 the layout and error to the metadata server and ask for a new layout. 363 At that point, the metadata server can either hand out a new layout, 364 hand out no layout (forcing the I/O through it), or deny the client 365 further access to the file. 367 2.2.1. Implementation Notes for Synthetic uids/gids 369 The selection method for the synthetic uids and gids to be used for 370 fencing in loosely coupled storage devices is strictly an 371 implementation issue. I.e., an administrator might restrict a range 372 of such ids available to the Lightweight Directory Access Protocol 373 (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id 374 that would never be used to grant access. Then when the metadata 375 server had a request to access a file, a SETATTR would be sent to the 376 storage device to set the owner and group of the data file. The user 377 and group might be selected in a round robin fashion from the range 378 of available ids. 380 Those ids would be sent back as ffds_user and ffds_group to the 381 client. And it would present them as the RPC credentials to the 382 storage device. When the client was done accessing the file and the 383 metadata server knew that no other client was accessing the file, it 384 could reset the owner and group to restrict access to the data file. 386 When the metadata server wanted to fence off a client, it would 387 change the synthetic uid and/or gid to the restricted ids. Note that 388 using a restricted id ensures that there is a change of owner and at 389 least one id available that never gets allowed access. 391 Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD 392 be avoided. These typically either grant super access to files on a 393 storage device or are mapped to an anonymous id. In the first case, 394 even if the data file is fenced, the client might still be able to 395 access the file. In the second case, multiple ids might be mapped to 396 the anonymous ids. 398 2.2.2. Example of using Synthetic uids/gids 400 The user loghyr creates a file "ompha.c" on the metadata server and 401 it creates a corresponding data file on the storage device. 403 The metadata server entry may look like: 405 -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c 407 On the storage device, it may be assigned some unpredictable 408 synthetic uid/gid to deny access: 410 -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c 412 When the file is opened on a client and accessed, it will try to get 413 a layout for the data file. Since the layout knows nothing about the 414 user (and does not care), whether the user loghyr or garbo opens the 415 file does not matter. The client has to present an uid of 19452 to 416 get write permission. If it presents any other value for the uid, 417 then it must give a gid of 28418 to get read access. 419 Further, if the metadata server decides to fence the file, it should 420 change the uid and/or gid such that these values neither match 421 earlier values for that file nor match a predictable change based on 422 an earlier fencing. 424 -rw-r----- 1 19453 28419 1697 Dec 4 11:31 data_ompha.c 426 The set of synthetic gids on the storage device should be selected 427 such that there is no mapping in any of the name services used by the 428 storage device. I.e., each group should have no members. 430 If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the 431 metadata server should return a synthetic uid that is not set on the 432 storage device. Only the synthetic gid would be valid. 434 The client is thus solely responsible for enforcing file permissions 435 in a loosely coupled model. To allow loghyr write access, it will 436 send an RPC to the storage device with a credential of 1066:1067. To 437 allow garbo read access, it will send an RPC to the storage device 438 with a credential of 1067:1067. The value of the uid does not matter 439 as long as it is not the synthetic uid granted it when getting the 440 layout. 442 While pushing the enforcement of permission checking onto the client 443 may seem to weaken security, the client may already be responsible 444 for enforcing permissions before modifications are sent to a server. 445 With cached writes, the client is always responsible for tracking who 446 is modifying a file and making sure to not coalesce requests from 447 multiple users into one request. 449 2.3. State and Locking Models 451 An implementation can always be deployed as a loosely coupled model. 452 There is however no way for a storage device to indicate over a NFS 453 protocol that it can definitively participate in a tightly coupled 454 model: 456 o Storage devices implementing the NFSv3 and NFSv4.0 protocols are 457 always treated as loosely coupled. 459 o NFSv4.1+ storage devices that do not return the 460 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating 461 that they are to be treated as loosely coupled. From the locking 462 viewpoint they are treated in the same way as NFSv4.0 storage 463 devices. 465 o NFSv4.1+ storage devices that do identify themselves with the 466 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID can potentially 467 be tightly coupled. They would use a back-end control protocol to 468 implement the global stateid model as described in [RFC5661]. 470 A storage device would have to either be discovered or advertised 471 over the control protocol to enable a tight coupling model. 473 2.3.1. Loosely Coupled Locking Model 475 When locking-related operations are requested, they are primarily 476 dealt with by the metadata server, which generates the appropriate 477 stateids. When an NFSv4 version is used as the data access protocol, 478 the metadata server may make stateid-related requests of the storage 479 devices. However, it is not required to do so and the resulting 480 stateids are known only to the metadata server and the storage 481 device. 483 Given this basic structure, locking-related operations are handled as 484 follows: 486 o OPENs are dealt with by the metadata server. Stateids are 487 selected by the metadata server and associated with the client id 488 describing the client's connection to the metadata server. The 489 metadata server may need to interact with the storage device to 490 locate the file to be opened, but no locking-related functionality 491 need be used on the storage device. 493 OPEN_DOWNGRADE and CLOSE only require local execution on the 494 metadata server. 496 o Advisory byte-range locks can be implemented locally on the 497 metadata server. As in the case of OPENs, the stateids associated 498 with byte-range locks are assigned by the metadata server and only 499 used on the metadata server. 501 o Delegations are assigned by the metadata server which initiates 502 recalls when conflicting OPENs are processed. No storage device 503 involvement is required. 505 o TEST_STATEID and FREE_STATEID are processed locally on the 506 metadata server, without storage device involvement. 508 All I/O operations to the storage device are done using the anonymous 509 stateid. Thus the storage device has no information about the 510 openowner and lockowner responsible for issuing a particular I/O 511 operation. As a result: 513 o Mandatory byte-range locking cannot be supported because the 514 storage device has no way of distinguishing I/O done on behalf of 515 the lock owner from those done by others. 517 o Enforcement of share reservations is the responsibility of the 518 client. Even though I/O is done using the anonymous stateid, the 519 client must ensure that it has a valid stateid associated with the 520 openowner, that allows the I/O being done before issuing the I/O. 522 In the event that a stateid is revoked, the metadata server is 523 responsible for preventing client access, since it has no way of 524 being sure that the client is aware that the stateid in question has 525 been revoked. 527 As the client never receives a stateid generated by a storage device, 528 there is no client lease on the storage device and no prospect of 529 lease expiration, even when access is via NFSv4 protocols. Clients 530 will have leases on the metadata server. In dealing with lease 531 expiration, the metadata server may need to use fencing to prevent 532 revoked stateids from being relied upon by a client unaware of the 533 fact that they have been revoked. 535 2.3.2. Tightly Coupled Locking Model 537 When locking-related operations are requested, they are primarily 538 dealt with by the metadata server, which generates the appropriate 539 stateids. These stateids must be made known to the storage device 540 using control protocol facilities, the details of which are not 541 discussed in this document. 543 Given this basic structure, locking-related operations are handled as 544 follows: 546 o OPENs are dealt with primarily on the metadata server. Stateids 547 are selected by the metadata server and associated with the client 548 id describing the client's connection to the metadata server. The 549 metadata server needs to interact with the storage device to 550 locate the file to be opened, and to make the storage device aware 551 of the association between the metadata-server-chosen stateid and 552 the client and openowner that it represents. 554 OPEN_DOWNGRADE and CLOSE are executed initially on the metadata 555 server but the state change made must be propagated to the storage 556 device. 558 o Advisory byte-range locks can be implemented locally on the 559 metadata server. As in the case of OPENs, the stateids associated 560 with byte-range locks, are assigned by the metadata server and are 561 available for use on the metadata server. Because I/O operations 562 are allowed to present lock stateids, the metadata server needs 563 the ability to make the storage device aware of the association 564 between the metadata-server-chosen stateid and the corresponding 565 open stateid it is associated with. 567 o Mandatory byte-range locks can be supported when both the metadata 568 server and the storage devices have the appropriate support. As 569 in the case of advisory byte-range locks, these are assigned by 570 the metadata server and are available for use on the metadata 571 server. To enable mandatory lock enforcement on the storage 572 device, the metadata server needs the ability to make the storage 573 device aware of the association between the metadata-server-chosen 574 stateid and the client, openowner, and lock (i.e., lockowner, 575 byte-range, lock-type) that it represents. Because I/O operations 576 are allowed to present lock stateids, this information needs to be 577 propagated to all storage devices to which I/O might be directed 578 rather than only to storage device that contain the locked region. 580 o Delegations are assigned by the metadata server which initiates 581 recalls when conflicting OPENs are processed. Because I/O 582 operations are allowed to present delegation stateids, the 583 metadata server requires the ability to make the storage device 584 aware of the association between the metadata-server-chosen 585 stateid and the filehandle and delegation type it represents, and 586 to break such an association. 588 o TEST_STATEID is processed locally on the metadata server, without 589 storage device involvement. 591 o FREE_STATEID is processed on the metadata server but the metadata 592 server requires the ability to propagate the request to the 593 corresponding storage devices. 595 Because the client will possess and use stateids valid on the storage 596 device, there will be a client lease on the storage device and the 597 possibility of lease expiration does exist. The best approach for 598 the storage device is to retain these locks as a courtesy. However, 599 if it does not do so, control protocol facilities need to provide the 600 means to synchronize lock state between the metadata server and 601 storage device. 603 Clients will also have leases on the metadata server, which are 604 subject to expiration. In dealing with lease expiration, the 605 metadata server would be expected to use control protocol facilities 606 enabling it to invalidate revoked stateids on the storage device. In 607 the event the client is not responsive, the metadata server may need 608 to use fencing to prevent revoked stateids from being acted upon by 609 the storage device. 611 3. XDR Description of the Flexible File Layout Type 613 This document contains the external data representation (XDR) 614 [RFC4506] description of the flexible file layout type. The XDR 615 description is embedded in this document in a way that makes it 616 simple for the reader to extract into a ready-to-compile form. The 617 reader can feed this document into the following shell script to 618 produce the machine readable XDR description of the flexible file 619 layout type: 621 622 #!/bin/sh 623 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 625 627 That is, if the above script is stored in a file called "extract.sh", 628 and this document is in a file called "spec.txt", then the reader can 629 do: 631 sh extract.sh < spec.txt > flex_files_prot.x 633 The effect of the script is to remove leading white space from each 634 line, plus a sentinel sequence of "///". 636 The embedded XDR file header follows. Subsequent XDR descriptions, 637 with the sentinel sequence are embedded throughout the document. 639 Note that the XDR code contained in this document depends on types 640 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 641 types that end with a 4, such as offset4, length4, etc., as well as 642 more generic types such as uint32_t and uint64_t. 644 3.1. Code Components Licensing Notice 646 Both the XDR description and the scripts used for extracting the XDR 647 description are Code Components as described in Section 4 of "Legal 648 Provisions Relating to IETF Documents" [LEGAL]. These Code 649 Components are licensed according to the terms of that document. 651 653 /// /* 654 /// * Copyright (c) 2012 IETF Trust and the persons identified 655 /// * as authors of the code. All rights reserved. 656 /// * 657 /// * Redistribution and use in source and binary forms, with 658 /// * or without modification, are permitted provided that the 659 /// * following conditions are met: 660 /// * 661 /// * o Redistributions of source code must retain the above 662 /// * copyright notice, this list of conditions and the 663 /// * following disclaimer. 664 /// * 665 /// * o Redistributions in binary form must reproduce the above 666 /// * copyright notice, this list of conditions and the 667 /// * following disclaimer in the documentation and/or other 668 /// * materials provided with the distribution. 669 /// * 670 /// * o Neither the name of Internet Society, IETF or IETF 671 /// * Trust, nor the names of specific contributors, may be 672 /// * used to endorse or promote products derived from this 673 /// * software without specific prior written permission. 674 /// * 675 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 676 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 677 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 678 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 679 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 680 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 681 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 682 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 683 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 684 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 685 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 686 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 687 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 688 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 689 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 690 /// * 691 /// * This code was derived from RFCTBD10. 692 /// * Please reproduce this note if possible. 693 /// */ 694 /// 695 /// /* 696 /// * flex_files_prot.x 697 /// */ 698 /// 699 /// /* 700 /// * The following include statements are for example only. 701 /// * The actual XDR definition files are generated separately 702 /// * and independently and are likely to have a different name. 703 /// * %#include 704 /// * %#include 705 /// */ 706 /// 708 710 4. Device Addressing and Discovery 712 Data operations to a storage device require the client to know the 713 network address of the storage device. The NFSv4.1+ GETDEVICEINFO 714 operation (Section 18.40 of [RFC5661]) is used by the client to 715 retrieve that information. 717 4.1. ff_device_addr4 719 The ff_device_addr4 data structure is returned by the server as the 720 layout type specific opaque field da_addr_body in the device_addr4 721 structure by a successful GETDEVICEINFO operation. 723 725 /// struct ff_device_versions4 { 726 /// uint32_t ffdv_version; 727 /// uint32_t ffdv_minorversion; 728 /// uint32_t ffdv_rsize; 729 /// uint32_t ffdv_wsize; 730 /// bool ffdv_tightly_coupled; 731 /// }; 732 /// 734 /// struct ff_device_addr4 { 735 /// multipath_list4 ffda_netaddrs; 736 /// ff_device_versions4 ffda_versions<>; 737 /// }; 738 /// 740 742 The ffda_netaddrs field is used to locate the storage device. It 743 MUST be set by the server to a list holding one or more of the device 744 network addresses. 746 The ffda_versions array allows the metadata server to present choices 747 as to NFS version, minor version, and coupling strength to the 748 client. The ffdv_version and ffdv_minorversion represent the NFS 749 protocol to be used to access the storage device. This layout 750 specification defines the semantics for ffdv_versions 3 and 4. If 751 ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 752 and ffdv_tightly_coupled to false. The client MUST then access the 753 storage device using the NFSv3 protocol [RFC1813]. If ffdv_version 754 equals 4 then the server MUST set ffdv_minorversion to one of the 755 NFSv4 minor version numbers and the client MUST access the storage 756 device using NFSv4 with the specified minor version. 758 Note that while the client might determine that it cannot use any of 759 the configured combinations of ffdv_version, ffdv_minorversion, and 760 ffdv_tightly_coupled, when it gets the device list from the metadata 761 server, there is no way to indicate to the metadata server as to 762 which device it is version incompatible. If however, the client 763 waits until it retrieves the layout from the metadata server, it can 764 at that time clearly identify the storage device in question (see 765 Section 5.4). 767 The ffdv_rsize and ffdv_wsize are used to communicate the maximum 768 rsize and wsize supported by the storage device. As the storage 769 device can have a different rsize or wsize than the metadata server, 770 the ffdv_rsize and ffdv_wsize allow the metadata server to 771 communicate that information on behalf of the storage device. 773 ffdv_tightly_coupled informs the client as to whether the metadata 774 server is tightly coupled with the storage devices or not. Note that 775 even if the data protocol is at least NFSv4.1, it may still be the 776 case that there is loose coupling in effect. If ffdv_tightly_coupled 777 is not set, then the client MUST commit writes to the storage devices 778 for the file before sending a LAYOUTCOMMIT to the metadata server. 779 I.e., the writes MUST be committed by the client to stable storage 780 via issuing WRITEs with stable_how == FILE_SYNC or by issuing a 781 COMMIT after WRITEs with stable_how != FILE_SYNC (see Section 3.3.7 782 of [RFC1813]). 784 4.2. Storage Device Multipathing 786 The flexible file layout type supports multipathing to multiple 787 storage device addresses. Storage device level multipathing is used 788 for bandwidth scaling via trunking and for higher availability of use 789 in the event of a storage device failure. Multipathing allows the 790 client to switch to another storage device address which may be that 791 of another storage device that is exporting the same data stripe 792 unit, without having to contact the metadata server for a new layout. 794 To support storage device multipathing, ffda_netaddrs contains an 795 array of one or more storage device network addresses. This array 796 (data type multipath_list4) represents a list of storage devices 797 (each identified by a network address), with the possibility that 798 some storage device will appear in the list multiple times. 800 The client is free to use any of the network addresses as a 801 destination to send storage device requests. If some network 802 addresses are less desirable paths to the data than others, then the 803 metadata server SHOULD NOT include those network addresses in 804 ffda_netaddrs. If less desirable network addresses exist to provide 805 failover, the RECOMMENDED method to offer the addresses is to provide 806 them in a replacement device-ID-to-device-address mapping, or a 807 replacement device ID. When a client finds no response from the 808 storage device using all addresses available in ffda_netaddrs, it 809 SHOULD send a GETDEVICEINFO to attempt to replace the existing 810 device-ID-to-device-address mappings. If the metadata server detects 811 that all network paths represented by ffda_netaddrs are unavailable, 812 the metadata server SHOULD send a CB_NOTIFY_DEVICEID (if the client 813 has indicated it wants device ID notifications for changed device 814 IDs) to change the device-ID-to-device-address mappings to the 815 available addresses. If the device ID itself will be replaced, the 816 metadata server SHOULD recall all layouts with the device ID, and 817 thus force the client to get new layouts and device ID mappings via 818 LAYOUTGET and GETDEVICEINFO. 820 Generally, if two network addresses appear in ffda_netaddrs, they 821 will designate the same storage device. When the storage device is 822 accessed over NFSv4.1 or a higher minor version, the two storage 823 device addresses will support the implementation of client ID or 824 session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. 825 The two storage device addresses will share the same server owner or 826 major ID of the server owner. It is not always necessary for the two 827 storage device addresses to designate the same storage device with 828 trunking being used. For example, the data could be read-only, and 829 the data consist of exact replicas. 831 5. Flexible File Layout Type 833 The layout4 type is defined in [RFC5662] as follows: 835 837 enum layouttype4 { 838 LAYOUT4_NFSV4_1_FILES = 1, 839 LAYOUT4_OSD2_OBJECTS = 2, 840 LAYOUT4_BLOCK_VOLUME = 3, 841 LAYOUT4_FLEX_FILES = 4 842 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 843 to be the layouttype assigned by IANA]] 844 }; 846 struct layout_content4 { 847 layouttype4 loc_type; 848 opaque loc_body<>; 849 }; 851 struct layout4 { 852 offset4 lo_offset; 853 length4 lo_length; 854 layoutiomode4 lo_iomode; 855 layout_content4 lo_content; 856 }; 858 859 This document defines structures associated with the layouttype4 860 value LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure 861 as an XDR type "opaque". The opaque layout is uninterpreted by the 862 generic pNFS client layers, but is interpreted by the flexible file 863 layout type implementation. This section defines the structure of 864 this otherwise opaque value, ff_layout4. 866 5.1. ff_layout4 868 870 /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; 871 /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; 872 /// const FF_FLAGS_NO_READ_IO = 0x00000004; 873 /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; 875 /// typedef uint32_t ff_flags4; 876 /// 878 /// struct ff_data_server4 { 879 /// deviceid4 ffds_deviceid; 880 /// uint32_t ffds_efficiency; 881 /// stateid4 ffds_stateid; 882 /// nfs_fh4 ffds_fh_vers<>; 883 /// fattr4_owner ffds_user; 884 /// fattr4_owner_group ffds_group; 885 /// }; 886 /// 888 /// struct ff_mirror4 { 889 /// ff_data_server4 ffm_data_servers<>; 890 /// }; 891 /// 893 /// struct ff_layout4 { 894 /// length4 ffl_stripe_unit; 895 /// ff_mirror4 ffl_mirrors<>; 896 /// ff_flags4 ffl_flags; 897 /// uint32_t ffl_stats_collect_hint; 898 /// }; 899 /// 901 903 The ff_layout4 structure specifies a layout in that portion of the 904 data file described in the current layout segment. It is either a 905 single instance or a set of mirrored copies of that portion of the 906 data file. When mirroring is in effect, it protects against loss of 907 data in layout segments. Note that while not explicitly shown in the 908 above XDR, each layout4 element returned in the logr_layout array of 909 LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) describes a layout 910 segment. Hence each ff_layout4 also describes a layout segment. 912 It is possible that the file is concatenated from more than one 913 layout segment. Each layout segment MAY represent different striping 914 parameters, applying respectively only to the layout segment byte 915 range. 917 The ffl_stripe_unit field is the stripe unit size in use for the 918 current layout segment. The number of stripes is given inside each 919 mirror by the number of elements in ffm_data_servers. If the number 920 of stripes is one, then the value for ffl_stripe_unit MUST default to 921 zero. The only supported mapping scheme is sparse and is detailed in 922 Section 6. Note that there is an assumption here that both the 923 stripe unit size and the number of stripes is the same across all 924 mirrors. 926 The ffl_mirrors field is the array of mirrored storage devices which 927 provide the storage for the current stripe, see Figure 1. 929 The ffl_stats_collect_hint field provides a hint to the client on how 930 often the server wants it to report LAYOUTSTATS for a file. The time 931 is in seconds. 933 +-----------+ 934 | | 935 | | 936 | File | 937 | | 938 | | 939 +-----+-----+ 940 | 941 +------------+------------+ 942 | | 943 +----+-----+ +-----+----+ 944 | Mirror 1 | | Mirror 2 | 945 +----+-----+ +-----+----+ 946 | | 947 +-----------+ +-----------+ 948 |+-----------+ |+-----------+ 949 ||+-----------+ ||+-----------+ 950 +|| Storage | +|| Storage | 951 +| Devices | +| Devices | 952 +-----------+ +-----------+ 954 Figure 1 956 The ffs_mirrors field represents an array of state information for 957 each mirrored copy of the current layout segment. Each element is 958 described by a ff_mirror4 type. 960 ffds_deviceid provides the deviceid of the storage device holding the 961 data file. 963 ffds_fh_vers is an array of filehandles of the data file matching to 964 the available NFS versions on the given storage device. There MUST 965 be exactly as many elements in ffds_fh_vers as there are in 966 ffda_versions. Each element of the array corresponds to a particular 967 combination of ffdv_version, ffdv_minorversion, and 968 ffdv_tightly_coupled provided for the device. The array allows for 969 server implementations which have different filehandles for different 970 combinations of version, minor version, and coupling strength. See 971 Section 5.4 for how to handle versioning issues between the client 972 and storage devices. 974 For tight coupling, ffds_stateid provides the stateid to be used by 975 the client to access the file. For loose coupling and a NFSv4 976 storage device, the client will have to use an anonymous stateid to 977 perform I/O on the storage device. With no control protocol, the 978 metadata server stateid can not be used to provide a global stateid 979 model. Thus the server MUST set the ffds_stateid to be the anonymous 980 stateid. 982 This specification of the ffds_stateid restricts both models for 983 NFSv4.x storage protocols: 985 loosely couple: the stateid has to be an anonymous stateid, 987 tightly couple: the stateid has to be a global stateid. 989 A number of issues stem from a mismatch between the fact that 990 ffds_stateid is defined as a single item while ffds_fh_vers is 991 defined as an array. It is possible for each open file on the 992 storage device to require its own open stateid. Because there are 993 established loosely coupled implementations of the version of the 994 protocol described in this document, such potential issues have not 995 been addressed here. It is possible for future layout types to be 996 defined that address these issues, should it become important to 997 provide multiple stateids for the same underlying file. 999 For loosely coupled storage devices, ffds_user and ffds_group provide 1000 the synthetic user and group to be used in the RPC credentials that 1001 the client presents to the storage device to access the data files. 1002 For tightly coupled storage devices, the user and group on the 1003 storage device will be the same as on the metadata server. I.e., if 1004 ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST 1005 ignore both ffds_user and ffds_group. 1007 The allowed values for both ffds_user and ffds_group are specified in 1008 Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group 1009 strings that consist of decimal numeric values with no leading zeros 1010 can be given a special interpretation by clients and servers that 1011 choose to provide such support. The receiver may treat such a user 1012 or group string as representing the same user as would be represented 1013 by an NFSv3 uid or gid having the corresponding numeric value. Note 1014 that if using Kerberos for security, the expectation is that these 1015 values will be a name@domain string. 1017 ffds_efficiency describes the metadata server's evaluation as to the 1018 effectiveness of each mirror. Note that this is per layout and not 1019 per device as the metric may change due to perceived load, 1020 availability to the metadata server, etc. Higher values denote 1021 higher perceived utility. The way the client can select the best 1022 mirror to access is discussed in Section 8.1. 1024 ffl_flags is a bitmap that allows the metadata server to inform the 1025 client of particular conditions that may result from the more or less 1026 tight coupling of the storage devices. 1028 FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is 1029 not required to send LAYOUTCOMMIT to the metadata server. 1031 F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client 1032 should not send I/O operations to the metadata server. I.e., even 1033 if the client could determine that there was a network disconnect 1034 to a storage device, the client should not try to proxy the I/O 1035 through the metadata server. 1037 FF_FLAGS_NO_READ_IO: can be set to indicate that the client should 1038 not send READ requests with the layouts of iomode 1039 LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode 1040 LAYOUTIOMODE4_READ from the metadata server. 1042 FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client 1043 only needs to update one of the mirrors (see Section 8.2). 1045 5.1.1. Error Codes from LAYOUTGET 1047 [RFC5661] provides little guidance as to how the client is to proceed 1048 with a LAYOUTGET which returns an error of either 1049 NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. 1050 Within the context of this document: 1052 NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O 1053 is to go to the metadata server. Note that it is possible to have 1054 had a layout before a recall and not after. 1056 NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout 1057 from being granted. If the client already has an appropriate 1058 layout, it should continue with I/O to the storage devices. 1060 NFS4ERR_DELAY: there is some issue preventing the layout from being 1061 granted. If the client already has an appropriate layout, it 1062 should not continue with I/O to the storage devices. 1064 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS 1066 Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS, 1067 flag, the client can still perform I/O to the metadata server. The 1068 flag functions as a hint. The flag indicates to the client that the 1069 metadata server prefers to separate the metadata I/O from the data I/ 1070 O, most likely for peformance reasons. 1072 5.2. LAYOUTCOMMIT 1074 The flex file layout does not use lou_body. If lou_type is 1075 LAYOUT4_FLEX_FILES, the lou_body field MUST have a zero length. 1077 5.3. Interactions Between Devices and Layouts 1079 In [RFC5661], the file layout type is defined such that the 1080 relationship between multipathing and filehandles can result in 1081 either 0, 1, or N filehandles (see Section 13.3). Some rationales 1082 for this are clustered servers which share the same filehandle or 1083 allowing for multiple read-only copies of the file on the same 1084 storage device. In the flexible file layout type, while there is an 1085 array of filehandles, they are independent of the multipathing being 1086 used. If the metadata server wants to provide multiple read-only 1087 copies of the same file on the same storage device, then it should 1088 provide multiple ff_device_addr4, each as a mirror. The client can 1089 then determine that since the ffds_fh_vers are different, then there 1090 are multiple copies of the file for the current layout segment 1091 available. 1093 5.4. Handling Version Errors 1095 When the metadata server provides the ffda_versions array in the 1096 ff_device_addr4 (see Section 4.1), the client is able to determine if 1097 it can not access a storage device with any of the supplied 1098 combinations of ffdv_version, ffdv_minorversion, and 1099 ffdv_tightly_coupled. However, due to the limitations of reporting 1100 errors in GETDEVICEINFO (see Section 18.40 in [RFC5661], the client 1101 is not able to specify which specific device it can not communicate 1102 with over one of the provided ffdv_version and ffdv_minorversion 1103 combinations. Using ff_ioerr4 (see Section 9.1.1 inside either the 1104 LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see 1105 Section 15.6 of [RFC7862] and Section 10 of this document), the 1106 client can isolate the problematic storage device. 1108 The error code to return for LAYOUTRETURN and/or LAYOUTERROR is 1109 NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch 1110 is a major version (e.g., client can use NFSv3 but not NFSv4) or 1111 minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the 1112 error indicates that for all the supplied combinations for 1113 ffdv_version and ffdv_minorversion, the client can not communicate 1114 with the storage device. The client can retry the GETDEVICEINFO to 1115 see if the metadata server can provide a different combination or it 1116 can fall back to doing the I/O through the metadata server. 1118 6. Striping via Sparse Mapping 1120 While other layout types support both dense and sparse mapping of 1121 logical offsets to physical offsets within a file (see for example 1122 Section 13.4 of [RFC5661]), the flexible file layout type only 1123 supports a sparse mapping. 1125 With sparse mappings, the logical offset within a file (L) is also 1126 the physical offset on the storage device. As detailed in 1127 Section 13.4.4 of [RFC5661], this results in holes across each 1128 storage device which does not contain the current stripe index. 1130 L: logical offset into the file 1132 W: stripe width 1133 W = number of elements in ffm_data_servers 1135 S: number of bytes in a stripe 1136 S = W * ffl_stripe_unit 1138 N: stripe number 1139 N = L / S 1141 7. Recovering from Client I/O Errors 1143 The pNFS client may encounter errors when directly accessing the 1144 storage devices. However, it is the responsibility of the metadata 1145 server to recover from the I/O errors. When the LAYOUT4_FLEX_FILES 1146 layout type is used, the client MUST report the I/O errors to the 1147 server at LAYOUTRETURN time using the ff_ioerr4 structure (see 1148 Section 9.1.1). 1150 The metadata server analyzes the error and determines the required 1151 recovery operations such as recovering media failures or 1152 reconstructing missing data files. 1154 The metadata server MUST recall any outstanding layouts to allow it 1155 exclusive write access to the stripes being recovered and to prevent 1156 other clients from hitting the same error condition. In these cases, 1157 the server MUST complete recovery before handing out any new layouts 1158 to the affected byte ranges. 1160 Although the client implementation has the option to propagate a 1161 corresponding error to the application that initiated the I/O 1162 operation and drop any unwritten data, the client should attempt to 1163 retry the original I/O operation by either requesting a new layout or 1164 sending the I/O via regular NFSv4.1+ READ or WRITE operations to the 1165 metadata server. The client SHOULD attempt to retrieve a new layout 1166 and retry the I/O operation using the storage device first and only 1167 if the error persists, retry the I/O operation via the metadata 1168 server. 1170 8. Mirroring 1172 The flexible file layout type has a simple model in place for the 1173 mirroring of the file data constrained by a layout segment. There is 1174 no assumption that each copy of the mirror is stored identically on 1175 the storage devices. For example, one device might employ 1176 compression or deduplication on the data. However, the over the wire 1177 transfer of the file contents MUST appear identical. Note, this is a 1178 constraint of the selected XDR representation in which each mirrored 1179 copy of the layout segment has the same striping pattern (see 1180 Figure 1). 1182 The metadata server is responsible for determining the number of 1183 mirrored copies and the location of each mirror. While the client 1184 may provide a hint to how many copies it wants (see Section 12), the 1185 metadata server can ignore that hint and in any event, the client has 1186 no means to dictate either the storage device (which also means the 1187 coupling and/or protocol levels to access the layout segments) or the 1188 location of said storage device. 1190 The updating of mirrored layout segments is done via client-side 1191 mirroring. With this approach, the client is responsible for making 1192 sure modifications are made on all copies of the layout segments it 1193 is informed of via the layout. If a layout segment is being 1194 resilvered to a storage device, that mirrored copy will not be in the 1195 layout. Thus the metadata server MUST update that copy until the 1196 client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR 1197 is set in ffl_flags, the client need only update one of the mirrors 1198 (see Section 8.2). If the client is writing to the layout segments 1199 via the metadata server, then the metadata server MUST update all 1200 copies of the mirror. As seen in Section 8.3, during the 1201 resilvering, the layout is recalled, and the client has to make 1202 modifications via the metadata server. 1204 8.1. Selecting a Mirror 1206 When the metadata server grants a layout to a client, it MAY let the 1207 client know how fast it expects each mirror to be once the request 1208 arrives at the storage devices via the ffds_efficiency member. While 1209 the algorithms to calculate that value are left to the metadata 1210 server implementations, factors that could contribute to that 1211 calculation include speed of the storage device, physical memory 1212 available to the device, operating system version, current load, etc. 1214 However, what should not be involved in that calculation is a 1215 perceived network distance between the client and the storage device. 1216 The client is better situated for making that determination based on 1217 past interaction with the storage device over the different available 1218 network interfaces between the two. I.e., the metadata server might 1219 not know about a transient outage between the client and storage 1220 device because it has no presence on the given subnet. 1222 As such, it is the client which decides which mirror to access for 1223 reading the file. The requirements for writing to mirrored layout 1224 segments are presented below. 1226 8.2. Writing to Mirrors 1228 8.2.1. Single Storage Device Updates Mirrors 1230 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client 1231 only needs to update one of the copies of the layout segment. For 1232 this case, the storage device MUST ensure that all copies of the 1233 mirror are updated when any one of the mirrors is updated. If the 1234 storage device gets an error when updating one of the mirrors, then 1235 it MUST inform the client that the original WRITE had an error. The 1236 client then MUST inform the metadata server (see Section 8.2.3). The 1237 client's responsibility with respect to COMMIT is explained in 1238 Section 8.2.4. The client may choose any one of the mirrors and may 1239 use ffds_efficiency in the same manner as for reading when making 1240 this choice. 1242 8.2.2. Client Updates All Mirrors 1244 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the 1245 client is responsible for updating all mirrored copies of the layout 1246 segments that it is given in the layout. A single failed update is 1247 sufficient to fail the entire operation. If all but one copy is 1248 updated successfully and the last one provides an error, then the 1249 client needs to inform the metadata server about the error via either 1250 LAYOUTRETURN or LAYOUTERROR that the update failed to that storage 1251 device. If the client is updating the mirrors serially, then it 1252 SHOULD stop at the first error encountered and report that to the 1253 metadata server. If the client is updating the mirrors in parallel, 1254 then it SHOULD wait until all storage devices respond such that it 1255 can report all errors encountered during the update. 1257 8.2.3. Handling Write Errors 1259 When the client reports a write error to the metadata server, the 1260 metadata server is responsible for determining if it wants to remove 1261 the errant mirror from the layout, if the mirror has recovered from 1262 some transient error, etc. When the client tries to get a new 1263 layout, the metadata server informs it of the decision by the 1264 contents of the layout. The client MUST NOT make any assumptions 1265 that the contents of the previous layout will match those of the new 1266 one. If it has updates that were not committed to all mirrors, then 1267 it MUST resend those updates to all mirrors. 1269 There is no provision in the protocol for the metadata server to 1270 directly determine that the client has or has not recovered from an 1271 error. I.e., assume that the storage device was network partitioned 1272 from the client and all of the copies are successfully updated after 1273 the error was reported. There is no mechanism for the client to 1274 report that fact and the metadata server is forced to repair the file 1275 across the mirror. 1277 If the client supports NFSv4.2, it can use LAYOUTERROR and 1278 LAYOUTRETURN to provide hints to the metadata server about the 1279 recovery efforts. A LAYOUTERROR on a file is for a non-fatal error. 1280 A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the 1281 client successfully replayed the I/O to all mirrors. Any 1282 LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server 1283 needs to repair. The client MUST be prepared for the LAYOUTERROR to 1284 trigger a CB_LAYOUTRECALL if the metadata server determines it needs 1285 to start repairing the file. 1287 8.2.4. Handling Write COMMITs 1289 When stable writes are done to the metadata server or to a single 1290 replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is 1291 the responsibility of the receiving node to propagate the written 1292 data stably, before replying to the client. 1294 In the corresponding cases in which unstable writes are done, the 1295 receiving node does not have any such obligation, although it may 1296 choose to asynchronously propagate the updates. However, once a 1297 COMMIT is replied to, all replicas must reflect the writes that have 1298 been done, and this data must have been committed to stable storage 1299 on all replicas. 1301 In order to avoid situations in which stale data is read from 1302 replicas to which writes have not been propagated: 1304 o A client which has outstanding unstable writes made to single node 1305 (metadata server or storage device) MUST do all reads from that 1306 same node. 1308 o When writes are flushed to the server, for example to implement, 1309 close-to-open semantics, a COMMIT must be done by the client to 1310 ensure that up-to-date written data will be available irrespective 1311 of the particular replica read. 1313 8.3. Metadata Server Resilvering of the File 1315 The metadata server may elect to create a new mirror of the layout 1316 segments at any time. This might be to resilver a copy on a storage 1317 device which was down for servicing, to provide a copy of the layout 1318 segments on storage with different storage performance 1319 characteristics, etc. As the client will not be aware of the new 1320 mirror and the metadata server will not be aware of updates that the 1321 client is making to the layout segments, the metadata server MUST 1322 recall the writable layout segment(s) that it is resilvering. If the 1323 client issues a LAYOUTGET for a writable layout segment which is in 1324 the process of being resilvered, then the metadata server can deny 1325 that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then 1326 have to perform the I/O through the metadata server. 1328 9. Flexible Files Layout Type Return 1330 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1331 layout-type specific information to the server. It is defined in 1332 Section 18.44.1 of [RFC5661] as follows: 1334 1335 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 1336 const LAYOUT4_RET_REC_FILE = 1; 1337 const LAYOUT4_RET_REC_FSID = 2; 1338 const LAYOUT4_RET_REC_ALL = 3; 1340 enum layoutreturn_type4 { 1341 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 1342 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 1343 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 1344 }; 1346 struct layoutreturn_file4 { 1347 offset4 lrf_offset; 1348 length4 lrf_length; 1349 stateid4 lrf_stateid; 1350 /* layouttype4 specific data */ 1351 opaque lrf_body<>; 1352 }; 1354 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1355 case LAYOUTRETURN4_FILE: 1356 layoutreturn_file4 lr_layout; 1357 default: 1358 void; 1359 }; 1361 struct LAYOUTRETURN4args { 1362 /* CURRENT_FH: file */ 1363 bool lora_reclaim; 1364 layoutreturn_stateid lora_recallstateid; 1365 layouttype4 lora_layout_type; 1366 layoutiomode4 lora_iomode; 1367 layoutreturn4 lora_layoutreturn; 1368 }; 1370 1372 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the 1373 lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value 1374 is defined by ff_layoutreturn4 (See Section 9.3). It allows the 1375 client to report I/O error information or layout usage statistics 1376 back to the metadata server as defined below. Note that while the 1377 data structures are built on concepts introduced in NFSv4.2, the 1378 effective discriminated union (lora_layout_type combined with 1379 ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the 1380 data. 1382 9.1. I/O Error Reporting 1384 9.1.1. ff_ioerr4 1386 1388 /// struct ff_ioerr4 { 1389 /// offset4 ffie_offset; 1390 /// length4 ffie_length; 1391 /// stateid4 ffie_stateid; 1392 /// device_error4 ffie_errors<>; 1393 /// }; 1394 /// 1396 1398 Recall that [RFC7862] defines device_error4 as: 1400 1402 struct device_error4 { 1403 deviceid4 de_deviceid; 1404 nfsstat4 de_status; 1405 nfs_opnum4 de_opnum; 1406 }; 1408 1410 The ff_ioerr4 structure is used to return error indications for data 1411 files that generated errors during data transfers. These are hints 1412 to the metadata server that there are problems with that file. For 1413 each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length 1414 represent the storage device and byte range within the file in which 1415 the error occurred; ffie_errors represents the operation and type of 1416 error. The use of device_error4 is described in Section 15.6 of 1417 [RFC7862]. 1419 Even though the storage device might be accessed via NFSv3 and 1420 reports back NFSv3 errors to the client, the client is responsible 1421 for mapping these to appropriate NFSv4 status codes as de_status. 1422 Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4 1423 operations. 1425 9.2. Layout Usage Statistics 1426 9.2.1. ff_io_latency4 1428 1430 /// struct ff_io_latency4 { 1431 /// uint64_t ffil_ops_requested; 1432 /// uint64_t ffil_bytes_requested; 1433 /// uint64_t ffil_ops_completed; 1434 /// uint64_t ffil_bytes_completed; 1435 /// uint64_t ffil_bytes_not_delivered; 1436 /// nfstime4 ffil_total_busy_time; 1437 /// nfstime4 ffil_aggregate_completion_time; 1438 /// }; 1439 /// 1441 1443 Both operation counts and bytes transferred are kept in the 1444 ff_io_latency4. As seen in ff_layoutupdate4 (See Section 9.2.2) read 1445 and write operations are aggregated separately. READ operations are 1446 used for the ff_io_latency4 ffl_read. Both WRITE and COMMIT 1447 operations are used for the ff_io_latency4 ffl_write. "Requested" 1448 counters track what the client is attempting to do and "completed" 1449 counters track what was done. There is no requirement that the 1450 client only report completed results that have matching requested 1451 results from the reported period. 1453 ffil_bytes_not_delivered is used to track the aggregate number of 1454 bytes requested by not fulfilled due to error conditions. 1455 ffil_total_busy_time is the aggregate time spent with outstanding RPC 1456 calls. ffil_aggregate_completion_time is the sum of all round trip 1457 times for completed RPC calls. 1459 In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number 1460 of seconds and nanoseconds since midnight or zero hour January 1, 1461 1970 Coordinated Universal Time (UTC). The use of nfstime4 in 1462 ff_io_latency4 is to store time since the start of the first I/O from 1463 the client after receiving the layout. In other words, these are to 1464 be decoded as duration and not as a date and time. 1466 Note that LAYOUTSTATS are cumulative, i.e., not reset each time the 1467 operation is sent. If two LAYOUTSTATS ops for the same file, layout 1468 stateid, and originating from the same NFS client are processed at 1469 the same time by the metadata server, then the one containing the 1470 larger values contains the most recent time series data. 1472 9.2.2. ff_layoutupdate4 1474 1476 /// struct ff_layoutupdate4 { 1477 /// netaddr4 ffl_addr; 1478 /// nfs_fh4 ffl_fhandle; 1479 /// ff_io_latency4 ffl_read; 1480 /// ff_io_latency4 ffl_write; 1481 /// nfstime4 ffl_duration; 1482 /// bool ffl_local; 1483 /// }; 1484 /// 1486 1488 ffl_addr differentiates which network address the client connected to 1489 on the storage device. In the case of multipathing, ffl_fhandle 1490 indicates which read-only copy was selected. ffl_read and ffl_write 1491 convey the latencies respectively for both read and write operations. 1492 ffl_duration is used to indicate the time period over which the 1493 statistics were collected. ffl_local if true indicates that the I/O 1494 was serviced by the client's cache. This flag allows the client to 1495 inform the metadata server about "hot" access to a file it would not 1496 normally be allowed to report on. 1498 9.2.3. ff_iostats4 1500 1502 /// struct ff_iostats4 { 1503 /// offset4 ffis_offset; 1504 /// length4 ffis_length; 1505 /// stateid4 ffis_stateid; 1506 /// io_info4 ffis_read; 1507 /// io_info4 ffis_write; 1508 /// deviceid4 ffis_deviceid; 1509 /// ff_layoutupdate4 ffis_layoutupdate; 1510 /// }; 1511 /// 1513 1515 Recall that [RFC7862] defines io_info4 as: 1517 1518 struct io_info4 { 1519 uint64_t ii_count; 1520 uint64_t ii_bytes; 1521 }; 1523 1525 With pNFS, data transfers are performed directly between the pNFS 1526 client and the storage devices. Therefore, the metadata server has 1527 no direct knowledge to the I/O operations being done and thus can not 1528 create on its own statistical information about client I/O to 1529 optimize data storage location. ff_iostats4 MAY be used by the 1530 client to report I/O statistics back to the metadata server upon 1531 returning the layout. 1533 Since it is not feasible for the client to report every I/O that used 1534 the layout, the client MAY identify "hot" byte ranges for which to 1535 report I/O statistics. The definition and/or configuration mechanism 1536 of what is considered "hot" and the size of the reported byte range 1537 is out of the scope of this document. It is suggested for client 1538 implementation to provide reasonable default values and an optional 1539 run-time management interface to control these parameters. For 1540 example, a client can define the default byte range resolution to be 1541 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1542 I/O operations per second. 1544 For each byte range, ffis_offset and ffis_length represent the 1545 starting offset of the range and the range length in bytes. 1546 ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and 1547 ffis_write.ii_bytes represent, respectively, the number of contiguous 1548 read and write I/Os and the respective aggregate number of bytes 1549 transferred within the reported byte range. 1551 The combination of ffis_deviceid and ffl_addr uniquely identifies 1552 both the storage path and the network route to it. Finally, the 1553 ffl_fhandle allows the metadata server to differentiate between 1554 multiple read-only copies of the file on the same storage device. 1556 9.3. ff_layoutreturn4 1558 1560 /// struct ff_layoutreturn4 { 1561 /// ff_ioerr4 fflr_ioerr_report<>; 1562 /// ff_iostats4 fflr_iostats_report<>; 1563 /// }; 1564 /// 1565 1567 When data file I/O operations fail, fflr_ioerr_report<> is used to 1568 report these errors to the metadata server as an array of elements of 1569 type ff_ioerr4. Each element in the array represents an error that 1570 occurred on the data file identified by ffie_errors.de_deviceid. If 1571 no errors are to be reported, the size of the fflr_ioerr_report<> 1572 array is set to zero. The client MAY also use fflr_iostats_report<> 1573 to report a list of I/O statistics as an array of elements of type 1574 ff_iostats4. Each element in the array represents statistics for a 1575 particular byte range. Byte ranges are not guaranteed to be disjoint 1576 and MAY repeat or intersect. 1578 10. Flexible Files Layout Type LAYOUTERROR 1580 If the client is using NFSv4.2 to communicate with the metadata 1581 server, then instead of waiting for a LAYOUTRETURN to send error 1582 information to the metadata server (see Section 9.1), it MAY use 1583 LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that 1584 information. For the flexible files layout type, this means that 1585 LAYOUTERROR4args is treated the same as ff_ioerr4. 1587 11. Flexible Files Layout Type LAYOUTSTATS 1589 If the client is using NFSv4.2 to communicate with the metadata 1590 server, then instead of waiting for a LAYOUTRETURN to send I/O 1591 statistics to the metadata server (see Section 9.2), it MAY use 1592 LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that 1593 information. For the flexible files layout type, this means that 1594 LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same 1595 contents as in ffis_layoutupdate. 1597 12. Flexible File Layout Type Creation Hint 1599 The layouthint4 type is defined in the [RFC5661] as follows: 1601 1603 struct layouthint4 { 1604 layouttype4 loh_type; 1605 opaque loh_body<>; 1606 }; 1608 1610 The layouthint4 structure is used by the client to pass a hint about 1611 the type of layout it would like created for a particular file. If 1612 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 1613 opaque value is defined by the ff_layouthint4 type. 1615 12.1. ff_layouthint4 1617 1619 /// union ff_mirrors_hint switch (bool ffmc_valid) { 1620 /// case TRUE: 1621 /// uint32_t ffmc_mirrors; 1622 /// case FALSE: 1623 /// void; 1624 /// }; 1625 /// 1627 /// struct ff_layouthint4 { 1628 /// ff_mirrors_hint fflh_mirrors_hint; 1629 /// }; 1630 /// 1632 1634 This type conveys hints for the desired data map. All parameters are 1635 optional so the client can give values for only the parameter it 1636 cares about. 1638 13. Recalling a Layout 1640 While Section 12.5.5 of [RFC5661] discusses layout type independent 1641 reasons for recalling a layout, the flexible file layout type 1642 metadata server should recall outstanding layouts in the following 1643 cases: 1645 o When the file's security policy changes, i.e., Access Control 1646 Lists (ACLs) or permission mode bits are set. 1648 o When the file's layout changes, rendering outstanding layouts 1649 invalid. 1651 o When existing layouts are inconsistent with the need to enforce 1652 locking constraints. 1654 o When existing layouts are inconsistent with the requirements 1655 regarding resilvering as described in Section 8.3. 1657 13.1. CB_RECALL_ANY 1659 The metadata server can use the CB_RECALL_ANY callback operation to 1660 notify the client to return some or all of its layouts. Section 22.3 1661 of [RFC5661] defines the allowed types of the "NFSv4 Recallable 1662 Object Types Registry". 1664 1666 /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN = 16; 1667 /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX = 17; 1668 [[RFC Editor: please insert assigned constants]] 1669 /// 1671 struct CB_RECALL_ANY4args { 1672 uint32_t craa_layouts_to_keep; 1673 bitmap4 craa_type_mask; 1674 }; 1676 1678 Typically, CB_RECALL_ANY will be used to recall client state when the 1679 server needs to reclaim resources. The craa_type_mask bitmap 1680 specifies the type of resources that are recalled and the 1681 craa_layouts_to_keep value specifies how many of the recalled 1682 flexible file layouts the client is allowed to keep. The flexible 1683 file layout type mask flags are defined as follows: 1685 1687 /// enum ff_cb_recall_any_mask { 1688 /// FF_RCA4_TYPE_MASK_READ = -2, 1689 /// FF_RCA4_TYPE_MASK_RW = -1 1690 [[RFC Editor: please insert assigned constants]] 1691 /// }; 1692 /// 1694 1696 They represent the iomode of the recalled layouts. In response, the 1697 client SHOULD return layouts of the recalled iomode that it needs the 1698 least, keeping at most craa_layouts_to_keep Flexible File Layouts. 1700 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 1701 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1702 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1703 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1704 is notified to return layouts of either iomode. 1706 14. Client Fencing 1708 In cases where clients are uncommunicative and their lease has 1709 expired or when clients fail to return recalled layouts within a 1710 lease period, the server MAY revoke client layouts and reassign these 1711 resources to other clients (see Section 12.5.5 in [RFC5661]). To 1712 avoid data corruption, the metadata server MUST fence off the revoked 1713 clients from the respective data files as described in Section 2.2. 1715 15. Security Considerations 1717 The combination of components in a pNFS system is required to 1718 preserve the security properties of NFSv4.1+ with respect to an 1719 entity accessing data via a client. The pNFS feature partitions the 1720 NFSv4.1+ file system protocol into two parts, the control protocol 1721 and the data protocol. As the control protocol in this document is 1722 NFS, the security properties are equivalent to that version of NFS. 1723 The Flexible File Layout further divides the data protocol into 1724 metadata and data paths. The security properties of the metadata 1725 path are equivalent to those of NFSv4.1x (see Sections 1.7.1 and 1726 2.2.1 of [RFC5661]). And the security properties of the data path 1727 are equivalent to those of the version of NFS used to access the 1728 storage device, with the provision that the metadata server is 1729 responsible for authenticating client access to the data file. The 1730 metadata server provides appropriate credentials to the client to 1731 access data files on the storage device. It is also responsible for 1732 revoking access for a client to the storage device. 1734 The metadata server enforces the file access-control policy at 1735 LAYOUTGET time. The client should use RPC authorization credentials 1736 for getting the layout for the requested iomode (READ or RW) and the 1737 server verifies the permissions and ACL for these credentials, 1738 possibly returning NFS4ERR_ACCESS if the client is not allowed the 1739 requested iomode. If the LAYOUTGET operation succeeds the client 1740 receives, as part of the layout, a set of credentials allowing it I/O 1741 access to the specified data files corresponding to the requested 1742 iomode. When the client acts on I/O operations on behalf of its 1743 local users, it MUST authenticate and authorize the user by issuing 1744 respective OPEN and ACCESS calls to the metadata server, similar to 1745 having NFSv4 data delegations. 1747 The combination of file handle, synthetic uid, and gid in the layout 1748 are the way that the metadata server enforces access control to the 1749 data server. The client only has access to file handles of file 1750 objects and not directory objects. Thus, given a file handle in a 1751 layout, it is not possible to guess the parent directory file handle. 1752 Further, as the data file permissions only allow the given synthetic 1753 uid read/write permission and the given synthetic gid read 1754 permission, knowing the synthetic ids of one file does not 1755 necessarily allow access to any other data file on the storage 1756 device. 1758 The metadata server can also deny access at any time by fencing the 1759 data file, which means changing the synthetic ids. In turn, that 1760 forces the client to return its current layout and get a new layout 1761 if it wants to continue IO to the data file. 1763 If access is allowed, the client uses the corresponding (READ or RW) 1764 credentials to perform the I/O operations at the data file's storage 1765 devices. When the metadata server receives a request to change a 1766 file's permissions or ACL, it SHOULD recall all layouts for that file 1767 and then MUST fence off any clients still holding outstanding layouts 1768 for the respective files by implicitly invalidating the previously 1769 distributed credential on all data file comprising the file in 1770 question. It is REQUIRED that this be done before committing to the 1771 new permissions and/or ACL. By requesting new layouts, the clients 1772 will reauthorize access against the modified access control metadata. 1773 Recalling the layouts in this case is intended to prevent clients 1774 from getting an error on I/Os done after the client was fenced off. 1776 15.1. RPCSEC_GSS and Security Services 1778 Because of the special use of principals within the loose coupling 1779 model, the issues are different depending on the coupling model. 1781 15.1.1. Loosely Coupled 1783 RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] contains facilities 1784 that would allow it to be used to authorize the client to the storage 1785 device on behalf of the metadata server. Doing so would require that 1786 each of the metadata server, storage device, and client would need to 1787 implement RPCSEC_GSSv3 using an RPC-application-defined structured 1788 privilege assertion in a manner described in Section 4.9.1 of 1789 [RFC7862]. The specifics necessary to do so are not described in 1790 this document. This is principally because any such specification 1791 would require extensive implementation work on a wide range of 1792 storage devices, which would be unlikely to result in a widely usable 1793 specification for a considerable time. 1795 As a result, the layout type described in this document will not 1796 provide support for use of RPCSEC_GSS together with the loosely 1797 coupled model. However, future layout types could be specified which 1798 would allow such support, either through the use of RPCSEC_GSSv3, or 1799 in other ways. 1801 15.1.2. Tightly Coupled 1803 With tight coupling, the principal used to access the metadata file 1804 is exactly the same as used to access the data file. The storage 1805 device can use the control protocol to validate any RPC credentials. 1806 As a result there are no security issues related to using RPCSEC_GSS 1807 with a tightly coupled system. For example, if Kerberos V5 GSS-API 1808 [RFC4121] is used as the security mechanism, then the storage device 1809 could use a control protocol to validate the RPC credentials to the 1810 metadata server. 1812 16. IANA Considerations 1814 [RFC5661] introduced a registry for "pNFS Layout Types Registry" and 1815 as such, new layout type numbers need to be assigned by IANA. This 1816 document defines the protocol associated with the existing layout 1817 type number, LAYOUT4_FLEX_FILES (see Table 1). 1819 +--------------------+-------+----------+-----+----------------+ 1820 | Layout Type Name | Value | RFC | How | Minor Versions | 1821 +--------------------+-------+----------+-----+----------------+ 1822 | LAYOUT4_FLEX_FILES | 0x4 | RFCTBD10 | L | 1 | 1823 +--------------------+-------+----------+-----+----------------+ 1825 Table 1: Layout Type Assignments 1827 [RFC5661] also introduced a registry called "NFSv4 Recallable Object 1828 Types Registry". This document defines new recallable objects for 1829 RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see 1830 Table 2). 1832 +------------------------------+-------+----------+-----+-----------+ 1833 | Recallable Object Type Name | Value | RFC | How | Minor | 1834 | | | | | Versions | 1835 +------------------------------+-------+----------+-----+-----------+ 1836 | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16 | RFCTBD10 | L | 1 | 1837 | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17 | RFCTBD10 | L | 1 | 1838 +------------------------------+-------+----------+-----+-----------+ 1840 Table 2: Recallable Object Type Assignments 1842 17. References 1844 17.1. Normative References 1846 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 1847 November 2008, . 1850 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1851 June 1995. 1853 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1854 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 1855 RFC2119, March 1997, . 1858 [RFC4121] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 1859 Version 5 Generic Security Service Application Program 1860 Interface (GSS-API) Mechanism Version 2", RFC 4121, July 1861 2005. 1863 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 1864 STD 67, RFC 4506, May 2006. 1866 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1867 Specification Version 2", RFC 5531, May 2009. 1869 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1870 "Network File System (NFS) Version 4 Minor Version 1 1871 Protocol", RFC 5661, January 2010. 1873 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1874 "Network File System (NFS) Version 4 Minor Version 1 1875 External Data Representation Standard (XDR) Description", 1876 RFC 5662, January 2010. 1878 [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) 1879 version 4 Protocol", RFC 7530, March 2015. 1881 [RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) 1882 Security Version 3", November 2016. 1884 [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, 1885 November 2016. 1887 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1888 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1889 May 2017, . 1891 [pNFSLayouts] 1892 Haynes, T., "Requirements for pNFS Layout Types", draft- 1893 ietf-nfsv4-layout-types-07 (Work In Progress), August 1894 2017. 1896 17.2. Informative References 1898 [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol 1899 (LDAP): Schema for User Applications", RFC 4519, DOI 1900 10.17487/RFC4519, June 2006, 1901 . 1903 Appendix A. Acknowledgments 1905 Those who provided miscellaneous comments to early drafts of this 1906 document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, 1907 and Lev Solomonov. 1909 Those who provided miscellaneous comments to the final drafts of this 1910 document include: Anand Ganesh, Robert Wipfel, Gobikrishnan 1911 Sundharraj, Trond Myklebust, Rick Macklem, and Jim Sermersheim. 1913 Idan Kedar caught a nasty bug in the interaction of client side 1914 mirroring and the minor versioning of devices. 1916 Dave Noveck provided comprehensive reviews of the document during the 1917 working group last calls. He also rewrote Section 2.3. 1919 Olga Kornievskaia made a convincing case against the use of a 1920 credential versus a principal in the fencing approach. Andy Adamson 1921 and Benjamin Kaduk helped to sharpen the focus. 1923 Benjamin Kaduk and Olga Kornievskaia also helped provide concrete 1924 scenarios for loosely coupled security mechanisms. And in the end, 1925 Olga proved that as defined, the loosely coupled model would not work 1926 with RPCSEC_GSS. 1928 Tigran Mkrtchyan provided the use case for not allowing the client to 1929 proxy the I/O through the data server. 1931 Rick Macklem provided the use case for only writing to a single 1932 mirror. 1934 Appendix B. RFC Editor Notes 1936 [RFC Editor: please remove this section prior to publishing this 1937 document as an RFC] 1939 [RFC Editor: prior to publishing this document as an RFC, please 1940 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 1941 RFC number of this document] 1943 Authors' Addresses 1945 Benny Halevy 1947 Email: bhalevy@gmail.com 1949 Thomas Haynes 1950 Hammerspace 1951 4300 El Camino Real Ste 105 1952 Los Altos, CA 94022 1953 USA 1955 Email: loghyr@gmail.com