idnits 2.17.1 draft-ietf-nfsv4-flex-files-17.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 403 has weird spacing: '... loghyr staff...' == Line 1364 has weird spacing: '...stateid lor...' == Line 1627 has weird spacing: '...rs_hint ffl...' -- The document date (February 27, 2018) is 2243 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'LEGAL' ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft 4 Intended status: Standards Track T. Haynes 5 Expires: August 31, 2018 Primary Data 6 February 27, 2018 8 Parallel NFS (pNFS) Flexible File Layout 9 draft-ietf-nfsv4-flex-files-17.txt 11 Abstract 13 The Parallel Network File System (pNFS) allows a separation between 14 the metadata (onto a metadata server) and data (onto a storage 15 device) for a file. The flexible file layout type is defined in this 16 document as an extension to pNFS which allows the use of storage 17 devices in a fashion such that they require only a quite limited 18 degree of interaction with the metadata server, using already 19 existing protocols. Client-side mirroring is also added to provide 20 replication of files. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 31, 2018. 39 Copyright Notice 41 Copyright (c) 2018 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 4 58 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 6 59 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 60 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 7 61 2.2. Fencing Clients from the Storage Device . . . . . . . . . 7 62 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 8 63 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 9 64 2.3. State and Locking Models . . . . . . . . . . . . . . . . 10 65 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 10 66 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 12 67 3. XDR Description of the Flexible File Layout Type . . . . . . 13 68 3.1. Code Components Licensing Notice . . . . . . . . . . . . 14 69 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 15 70 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 15 71 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 17 72 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 18 73 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 19 74 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 22 75 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 23 76 5.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 23 77 5.3. Interactions Between Devices and Layouts . . . . . . . . 23 78 5.4. Handling Version Errors . . . . . . . . . . . . . . . . . 23 79 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 24 80 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24 81 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 25 82 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 26 83 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 26 84 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 26 85 8.2.2. Client Updates All Mirrors . . . . . . . . . . . . . 26 86 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 27 87 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 27 88 8.3. Metadata Server Resilvering of the File . . . . . . . . . 28 89 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 28 90 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 29 91 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 29 92 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 30 93 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 30 94 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 31 95 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 32 96 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 33 98 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 34 99 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 34 100 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 34 101 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 35 102 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 35 103 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 35 104 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 36 105 15. Security Considerations . . . . . . . . . . . . . . . . . . . 37 106 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 38 107 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 38 108 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 39 109 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 110 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 40 111 17.1. Normative References . . . . . . . . . . . . . . . . . . 40 112 17.2. Informative References . . . . . . . . . . . . . . . . . 41 113 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 41 114 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 42 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 42 117 1. Introduction 119 In the parallel Network File System (pNFS), the metadata server 120 returns layout type structures that describe where file data is 121 located. There are different layout types for different storage 122 systems and methods of arranging data on storage devices. This 123 document defines the flexible file layout type used with file-based 124 data servers that are accessed using the Network File System (NFS) 125 protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and 126 NFSv4.2 [RFC7862]. 128 To provide a global state model equivalent to that of the files 129 layout type, a back-end control protocol might be implemented between 130 the metadata server and NFSv4.1+ storage devices. This document does 131 not provide a standard track control protocol. An implementation can 132 either define its own mechanism or it could define a control protocol 133 in a standard's track document. The requirements for a control 134 protocol are specified in [RFC5661] and clarified in [pNFSLayouts]. 136 The control protocol described in this document is based on NFS. The 137 storage devices are configured such that the metadata server has full 138 access rights to the data file system and then the metadata server 139 uses synthetic ids to control client access to individual files. 141 In traditional mirroring of data, the server is responsible for 142 replicating, validating, and repairing copies of the data file. With 143 client-side mirroring, the metadata server provides a layout which 144 presents the available mirrors to the client. It is then the client 145 which picks a mirror to read from and ensures that all writes go to 146 all mirrors. Only if all mirrors are successfully updated, does the 147 client consider the write transaction to have succeeded. In case of 148 error, the client can use the LAYOUTERROR operation to inform the 149 metadata server, which is then responsible for the repairing of the 150 mirrored copies of the file. 152 1.1. Definitions 154 control communication requirements: are for a layout type the 155 details regarding information on layouts, stateids, file metadata, 156 and file data which must be communicated between the metadata 157 server and the storage devices. 159 control protocol: is the particular mechanism that an implementation 160 of a layout type would use to meet the control communication 161 requirement for that layout type. This need not be a protocol as 162 normally understood. In some cases the same protocol may be used 163 as a control protocol and storage protocol. 165 client-side mirroring: is a feature in which the client and not the 166 server is responsible for updating all of the mirrored copies of a 167 layout segment. 169 (file) data: is that part of the file system object which contains 170 the content. 172 data server (DS): is another term for storage device. 174 fencing: is the process by which the metadata server prevents the 175 storage devices from processing I/O from a specific client to a 176 specific file. 178 file layout type: is a layout type in which the storage devices are 179 accessed via the NFS protocol (see Section 13 of [RFC5661]). 181 gid: is the group id, a numeric value which identifies to which 182 group a file belongs. 184 layout: is the information a client uses to access file data on a 185 storage device. This information will include specification of 186 the protocol (layout type) and the identity of the storage devices 187 to be used. 189 layout iomode: is a grant of either read or read/write I/O to the 190 client. 192 layout segment: is a sub-division of a layout. That sub-division 193 might be by the layout iomode (see Sections 3.3.20 and 12.2.9 of 195 [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or 196 requested byte range. 198 layout stateid: is a 128-bit quantity returned by a server that 199 uniquely defines the layout state provided by the server for a 200 specific layout that describes a layout type and file (see 201 Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes 202 differences in handling between layout stateids and other stateid 203 types. 205 layout type: is a specification of both the storage protocol used to 206 access the data and the aggregation scheme used to lay out the 207 file data on the underlying storage devices. 209 loose coupling: is when the control protocol is a storage protocol. 211 (file) metadata: is that part of the file system object which 212 describes the object and not the content. E.g., it could be the 213 time since last modification, access, etc. 215 metadata server (MDS): is the pNFS server which provides metadata 216 information for a file system object. It also is responsible for 217 generating, recalling, and revoking layouts for file system 218 objects, for performing directory operations, and for performing I 219 /O operations to regular files when the clients direct these to 220 the metadata server itself. 222 mirror: is a copy of a layout segment. Note that if one copy of the 223 mirror is updated, then all copies must be updated. 225 recalling a layout: is when the metadata server uses a back channel 226 to inform the client that the layout is to be returned in a 227 graceful manner. Note that the client has the opportunity to 228 flush any writes, etc., before replying to the metadata server. 230 revoking a layout: is when the metadata server invalidates the 231 layout such that neither the metadata server nor any storage 232 device will accept any access from the client with that layout. 234 resilvering: is the act of rebuilding a mirrored copy of a layout 235 segment from a known good copy of the layout segment. Note that 236 this can also be done to create a new mirrored copy of the layout 237 segment. 239 rsize: is the data transfer buffer size used for reads. 241 stateid: is a 128-bit quantity returned by a server that uniquely 242 defines the open and locking states provided by the server for a 243 specific open-owner or lock-owner/open-owner pair for a specific 244 file and type of lock. 246 storage device: is the target to which clients may direct I/O 247 requests when they hold an appropriate layout. See Section 2.1 of 248 [pNFSLayouts] for further discussion of the difference between a 249 data store and a storage device. 251 storage protocol: is the protocol used by clients to do I/O 252 operations to the storage device. Each layout type specifies the 253 set of storage protocols. 255 tight coupling: is an arrangement in which the control protocol is 256 one designed specifically for that purpose. It may be either a 257 proprietary protocol, adapted specifically to a a particular 258 metadata server, or one based on a standards-track document. 260 uid: is the used id, a numeric value which identifies which user 261 owns a file. 263 wsize: is the data transfer buffer size used for writes. 265 1.2. Requirements Language 267 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 268 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 269 document are to be interpreted as described in [RFC2119]. 271 2. Coupling of Storage Devices 273 A server implementation may choose either a loose or tight coupling 274 model between the metadata server and the storage devices. 275 [pNFSLayouts] describes the general problems facing pNFS 276 implementations. This document details how the new Flexible File 277 Layout Type addresses these issues. To implement the tight coupling 278 model, a control protocol has to be defined. As the flex file layout 279 imposes no special requirements on the client, the control protocol 280 will need to provide: 282 (1) for the management of both security and LAYOUTCOMMITs, and, 284 (2) a global stateid model and management of these stateids. 286 When implementing the loose coupling model, the only control protocol 287 will be a version of NFS, with no ability to provide a global stateid 288 model or to prevent clients from using layouts inappropriately. To 289 enable client use in that environment, this document will specify how 290 security, state, and locking are to be managed. 292 2.1. LAYOUTCOMMIT 294 Regardless of the coupling model, the metadata server has the 295 responsibility, upon receiving a LAYOUTCOMMIT (see Section 18.42 of 296 [RFC5661]), of ensuring that the semantics of pNFS are respected (see 297 Section 3.1 of [pNFSLayouts]). These do include a requirement that 298 data written to data storage device be stable before the occurrence 299 of the LAYOUTCOMMIT. 301 It is the responsibility of the client to make sure the data file is 302 stable before the metadata server begins to query the storage devices 303 about the changes to the file. If any WRITE to a storage device did 304 not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the 305 metadata server MUST be preceded by a COMMIT to the storage devices 306 written to. Note that if the client has not done a COMMIT to the 307 storage device, then the LAYOUTCOMMIT might not be synchronized to 308 the last WRITE operation to the storage device. 310 2.2. Fencing Clients from the Storage Device 312 With loosely coupled storage devices, the metadata server uses 313 synthetic uids (user ids) and gids (group ids) for the data file, 314 where the uid owner of the data file is allowed read/write access and 315 the gid owner is allowed read only access. As part of the layout 316 (see ffds_user and ffds_group in Section 5.1), the client is provided 317 with the user and group to be used in the Remote Procedure Call (RPC) 318 [RFC5531] credentials needed to access the data file. Fencing off of 319 clients is achieved by the metadata server changing the synthetic uid 320 and/or gid owners of the data file on the storage device to 321 implicitly revoke the outstanding RPC credentials. A client 322 presenting the wrong credential for the desired access will get a 323 NFS4ERR_ACCESS error. 325 With this loosely coupled model, the metadata server is not able to 326 fence off a single client, it is forced to fence off all clients. 327 However, as the other clients react to the fencing, returning their 328 layouts and trying to get new ones, the metadata server can hand out 329 a new uid and gid to allow access. 331 It is RECOMMENDED to implement common access control methods at the 332 storage device filesystem to allow only the metadata server root 333 (super user) access to the storage device, and to set the owner of 334 all directories holding data files to the root user. This approach 335 provides a practical model to enforce access control and fence off 336 cooperative clients, but it can not protect against malicious 337 clients; hence it provides a level of security equivalent to 338 AUTH_SYS. It is RECOMMENDED that the communication between the 339 metadata server and storage device be secure from eavesdroppers and 340 man-in-the-middle protocol tampering. The security measure could be 341 due to physical security (e.g., the servers are co-located in a 342 physically secure area), from encrypted communications, or some other 343 technique. 345 With tightly coupled storage devices, the metadata server sets the 346 user and group owners, mode bits, and ACL of the data file to be the 347 same as the metadata file. And the client must authenticate with the 348 storage device and go through the same authorization process it would 349 go through via the metadata server. In the case of tight coupling, 350 fencing is the responsibility of the control protocol and is not 351 described in detail here. However, implementations of the tight 352 coupling locking model (see Section 2.3), will need a way to prevent 353 access by certain clients to specific files by invalidating the 354 corresponding stateids on the storage device. In such a scenario, 355 the client will be given an error of NFS4ERR_BAD_STATEID. 357 The client need not know the model used between the metadata server 358 and the storage device. It need only react consistently to any 359 errors in interacting with the storage device. It should both return 360 the layout and error to the metadata server and ask for a new layout. 361 At that point, the metadata server can either hand out a new layout, 362 hand out no layout (forcing the I/O through it), or deny the client 363 further access to the file. 365 2.2.1. Implementation Notes for Synthetic uids/gids 367 The selection method for the synthetic uids and gids to be used for 368 fencing in loosely coupled storage devices is strictly an 369 implementation issue. I.e., an administrator might restrict a range 370 of such ids available to the Lightweight Directory Access Protocol 371 (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id 372 that would never be used to grant access. Then when the metadata 373 server had a request to access a file, a SETATTR would be sent to the 374 storage device to set the owner and group of the data file. The user 375 and group might be selected in a round robin fashion from the range 376 of available ids. 378 Those ids would be sent back as ffds_user and ffds_group to the 379 client. And it would present them as the RPC credentials to the 380 storage device. When the client was done accessing the file and the 381 metadata server knew that no other client was accessing the file, it 382 could reset the owner and group to restrict access to the data file. 384 When the metadata server wanted to fence off a client, it would 385 change the synthetic uid and/or gid to the restricted ids. Note that 386 using a restricted id ensures that there is a change of owner and at 387 least one id available that never gets allowed access. 389 Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD 390 be avoided. These typically either grant super access to files on a 391 storage device or are mapped to an anonymous id. In the first case, 392 even if the data file is fenced, the client might still be able to 393 access the file. In the second case, multiple ids might be mapped to 394 the anonymous ids. 396 2.2.2. Example of using Synthetic uids/gids 398 The user loghyr creates a file "ompha.c" on the metadata server and 399 it creates a corresponding data file on the storage device. 401 The metadata server entry may look like: 403 -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c 405 On the storage device, it may be assigned some unpredictable 406 synthetic uid/gid to deny access: 408 -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c 410 When the file is opened on a client and accessed, it will try to get 411 a layout for the data file. Since the layout knows nothing about the 412 user (and does not care), whether the user loghyr or garbo opens the 413 file does not matter. The client has to present an uid of 19452 to 414 get write permission. If it presents any other value for the uid, 415 then it must give a gid of 28418 to get read access. 417 Further, if the metadata server decides to fence the file, it should 418 change the uid and/or gid such that these values neither match 419 earlier values for that file nor match a predictable change based on 420 an earlier fencing. 422 -rw-r----- 1 19453 28419 1697 Dec 4 11:31 data_ompha.c 424 The set of synthetic gids on the storage device should be selected 425 such that there is no mapping in any of the name services used by the 426 storage device. I.e., each group should have no members. 428 If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the 429 metadata server should return a synthetic uid that is not set on the 430 storage device. Only the synthetic gid would be valid. 432 The client is thus solely responsible for enforcing file permissions 433 in a loosely coupled model. To allow loghyr write access, it will 434 send an RPC to the storage device with a credential of 1066:1067. To 435 allow garbo read access, it will send an RPC to the storage device 436 with a credential of 1067:1067. The value of the uid does not matter 437 as long as it is not the synthetic uid granted it when getting the 438 layout. 440 While pushing the enforcement of permission checking onto the client 441 may seem to weaken security, the client may already be responsible 442 for enforcing permissions before modifications are sent to a server. 443 With cached writes, the client is always responsible for tracking who 444 is modifying a file and making sure to not coalesce requests from 445 multiple users into one request. 447 2.3. State and Locking Models 449 An implementation can always be deployed as a loosely coupled model. 450 There is however no way for a storage device to indicate over a NFS 451 protocol that it can definitively participate in a tightly coupled 452 model: 454 o Storage devices implementing the NFSv3 and NFSv4.0 protocols are 455 always treated as loosely coupled. 457 o NFSv4.1+ storage devices that do not return the 458 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating 459 that they are to be treated as loosely coupled. From the locking 460 viewpoint they are treated in the same way as NFSv4.0 storage 461 devices. 463 o NFSv4.1+ storage devices that do identify themselves with the 464 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID can potentially 465 be tightly coupled. They would use a back-end control protocol to 466 implement the global stateid model as described in [RFC5661]. 468 A storage device would have to either be discovered or advertised 469 over the control protocol to enable a tight coupling model. 471 2.3.1. Loosely Coupled Locking Model 473 When locking-related operations are requested, they are primarily 474 dealt with by the metadata server, which generates the appropriate 475 stateids. When an NFSv4 version is used as the data access protocol, 476 the metadata server may make stateid-related requests of the storage 477 devices. However, it is not required to do so and the resulting 478 stateids are known only to the metadata server and the storage 479 device. 481 Given this basic structure, locking-related operations are handled as 482 follows: 484 o OPENs are dealt with by the metadata server. Stateids are 485 selected by the metadata server and associated with the client id 486 describing the client's connection to the metadata server. The 487 metadata server may need to interact with the storage device to 488 locate the file to be opened, but no locking-related functionality 489 need be used on the storage device. 491 OPEN_DOWNGRADE and CLOSE only require local execution on the 492 metadata server. 494 o Advisory byte-range locks can be implemented locally on the 495 metadata server. As in the case of OPENs, the stateids associated 496 with byte-range locks are assigned by the metadata server and only 497 used on the metadata server. 499 o Delegations are assigned by the metadata server which initiates 500 recalls when conflicting OPENs are processed. No storage device 501 involvement is required. 503 o TEST_STATEID and FREE_STATEID are processed locally on the 504 metadata server, without storage device involvement. 506 All I/O operations to the storage device are done using the anonymous 507 stateid. Thus the storage device has no information about the 508 openowner and lockowner responsible for issuing a particular I/O 509 operation. As a result: 511 o Mandatory byte-range locking cannot be supported because the 512 storage device has no way of distinguishing I/O done on behalf of 513 the lock owner from those done by others. 515 o Enforcement of share reservations is the responsibility of the 516 client. Even though I/O is done using the anonymous stateid, the 517 client must ensure that it has a valid stateid associated with the 518 openowner, that allows the I/O being done before issuing the I/O. 520 In the event that a stateid is revoked, the metadata server is 521 responsible for preventing client access, since it has no way of 522 being sure that the client is aware that the stateid in question has 523 been revoked. 525 As the client never receives a stateid generated by a storage device, 526 there is no client lease on the storage device and no prospect of 527 lease expiration, even when access is via NFSv4 protocols. Clients 528 will have leases on the metadata server. In dealing with lease 529 expiration, the metadata server may need to use fencing to prevent 530 revoked stateids from being relied upon by a client unaware of the 531 fact that they have been revoked. 533 2.3.2. Tightly Coupled Locking Model 535 When locking-related operations are requested, they are primarily 536 dealt with by the metadata server, which generates the appropriate 537 stateids. These stateids must be made known to the storage device 538 using control protocol facilities, the details of which are not 539 discussed in this document. 541 Given this basic structure, locking-related operations are handled as 542 follows: 544 o OPENs are dealt with primarily on the metadata server. Stateids 545 are selected by the metadata server and associated with the client 546 id describing the client's connection to the metadata server. The 547 metadata server needs to interact with the storage device to 548 locate the file to be opened, and to make the storage device aware 549 of the association between the metadata-server-chosen stateid and 550 the client and openowner that it represents. 552 OPEN_DOWNGRADE and CLOSE are executed initially on the metadata 553 server but the state change made must be propagated to the storage 554 device. 556 o Advisory byte-range locks can be implemented locally on the 557 metadata server. As in the case of OPENs, the stateids associated 558 with byte-range locks, are assigned by the metadata server and are 559 available for use on the metadata server. Because I/O operations 560 are allowed to present lock stateids, the metadata server needs 561 the ability to make the storage device aware of the association 562 between the metadata-server-chosen stateid and the corresponding 563 open stateid it is associated with. 565 o Mandatory byte-range locks can be supported when both the metadata 566 server and the storage devices have the appropriate support. As 567 in the case of advisory byte-range locks, these are assigned by 568 the metadata server and are available for use on the metadata 569 server. To enable mandatory lock enforcement on the storage 570 device, the metadata server needs the ability to make the storage 571 device aware of the association between the metadata-server-chosen 572 stateid and the client, openowner, and lock (i.e., lockowner, 573 byte-range, lock-type) that it represents. Because I/O operations 574 are allowed to present lock stateids, this information needs to be 575 propagated to all storage devices to which I/O might be directed 576 rather than only to storage device that contain the locked region. 578 o Delegations are assigned by the metadata server which initiates 579 recalls when conflicting OPENs are processed. Because I/O 580 operations are allowed to present delegation stateids, the 581 metadata server requires the ability to make the storage device 582 aware of the association between the metadata-server-chosen 583 stateid and the filehandle and delegation type it represents, and 584 to break such an association. 586 o TEST_STATEID is processed locally on the metadata server, without 587 storage device involvement. 589 o FREE_STATEID is processed on the metadata server but the metadata 590 server requires the ability to propagate the request to the 591 corresponding storage devices. 593 Because the client will possess and use stateids valid on the storage 594 device, there will be a client lease on the storage device and the 595 possibility of lease expiration does exist. The best approach for 596 the storage device is to retain these locks as a courtesy. However, 597 if it does not do so, control protocol facilities need to provide the 598 means to synchronize lock state between the metadata server and 599 storage device. 601 Clients will also have leases on the metadata server, which are 602 subject to expiration. In dealing with lease expiration, the 603 metadata server would be expected to use control protocol facilities 604 enabling it to invalidate revoked stateids on the storage device. In 605 the event the client is not responsive, the metadata server may need 606 to use fencing to prevent revoked stateids from being acted upon by 607 the storage device. 609 3. XDR Description of the Flexible File Layout Type 611 This document contains the external data representation (XDR) 612 [RFC4506] description of the flexible file layout type. The XDR 613 description is embedded in this document in a way that makes it 614 simple for the reader to extract into a ready-to-compile form. The 615 reader can feed this document into the following shell script to 616 produce the machine readable XDR description of the flexible file 617 layout type: 619 621 #!/bin/sh 622 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 624 626 That is, if the above script is stored in a file called "extract.sh", 627 and this document is in a file called "spec.txt", then the reader can 628 do: 630 sh extract.sh < spec.txt > flex_files_prot.x 632 The effect of the script is to remove leading white space from each 633 line, plus a sentinel sequence of "///". 635 The embedded XDR file header follows. Subsequent XDR descriptions, 636 with the sentinel sequence are embedded throughout the document. 638 Note that the XDR code contained in this document depends on types 639 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 640 types that end with a 4, such as offset4, length4, etc., as well as 641 more generic types such as uint32_t and uint64_t. 643 3.1. Code Components Licensing Notice 645 Both the XDR description and the scripts used for extracting the XDR 646 description are Code Components as described in Section 4 of "Legal 647 Provisions Relating to IETF Documents" [LEGAL]. These Code 648 Components are licensed according to the terms of that document. 650 652 /// /* 653 /// * Copyright (c) 2012 IETF Trust and the persons identified 654 /// * as authors of the code. All rights reserved. 655 /// * 656 /// * Redistribution and use in source and binary forms, with 657 /// * or without modification, are permitted provided that the 658 /// * following conditions are met: 659 /// * 660 /// * o Redistributions of source code must retain the above 661 /// * copyright notice, this list of conditions and the 662 /// * following disclaimer. 663 /// * 664 /// * o Redistributions in binary form must reproduce the above 665 /// * copyright notice, this list of conditions and the 666 /// * following disclaimer in the documentation and/or other 667 /// * materials provided with the distribution. 668 /// * 669 /// * o Neither the name of Internet Society, IETF or IETF 670 /// * Trust, nor the names of specific contributors, may be 671 /// * used to endorse or promote products derived from this 672 /// * software without specific prior written permission. 673 /// * 674 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 675 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 676 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 677 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 678 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 679 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 680 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 681 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 682 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 683 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 684 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 685 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 686 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 687 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 688 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 689 /// * 690 /// * This code was derived from RFCTBD10. 691 /// * Please reproduce this note if possible. 692 /// */ 693 /// 694 /// /* 695 /// * flex_files_prot.x 696 /// */ 697 /// 698 /// /* 699 /// * The following include statements are for example only. 700 /// * The actual XDR definition files are generated separately 701 /// * and independently and are likely to have a different name. 702 /// * %#include 703 /// * %#include 704 /// */ 705 /// 707 709 4. Device Addressing and Discovery 711 Data operations to a storage device require the client to know the 712 network address of the storage device. The NFSv4.1+ GETDEVICEINFO 713 operation (Section 18.40 of [RFC5661]) is used by the client to 714 retrieve that information. 716 4.1. ff_device_addr4 718 The ff_device_addr4 data structure is returned by the server as the 719 layout type specific opaque field da_addr_body in the device_addr4 720 structure by a successful GETDEVICEINFO operation. 722 723 /// struct ff_device_versions4 { 724 /// uint32_t ffdv_version; 725 /// uint32_t ffdv_minorversion; 726 /// uint32_t ffdv_rsize; 727 /// uint32_t ffdv_wsize; 728 /// bool ffdv_tightly_coupled; 729 /// }; 730 /// 732 /// struct ff_device_addr4 { 733 /// multipath_list4 ffda_netaddrs; 734 /// ff_device_versions4 ffda_versions<>; 735 /// }; 736 /// 738 740 The ffda_netaddrs field is used to locate the storage device. It 741 MUST be set by the server to a list holding one or more of the device 742 network addresses. 744 The ffda_versions array allows the metadata server to present choices 745 as to NFS version, minor version, and coupling strength to the 746 client. The ffdv_version and ffdv_minorversion represent the NFS 747 protocol to be used to access the storage device. This layout 748 specification defines the semantics for ffdv_versions 3 and 4. If 749 ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 750 and ffdv_tightly_coupled to false. The client MUST then access the 751 storage device using the NFSv3 protocol [RFC1813]. If ffdv_version 752 equals 4 then the server MUST set ffdv_minorversion to one of the 753 NFSv4 minor version numbers and the client MUST access the storage 754 device using NFSv4 with the specified minor version. 756 Note that while the client might determine that it cannot use any of 757 the configured combinations of ffdv_version, ffdv_minorversion, and 758 ffdv_tightly_coupled, when it gets the device list from the metadata 759 server, there is no way to indicate to the metadata server as to 760 which device it is version incompatible. If however, the client 761 waits until it retrieves the layout from the metadata server, it can 762 at that time clearly identify the storage device in question (see 763 Section 5.4). 765 The ffdv_rsize and ffdv_wsize are used to communicate the maximum 766 rsize and wsize supported by the storage device. As the storage 767 device can have a different rsize or wsize than the metadata server, 768 the ffdv_rsize and ffdv_wsize allow the metadata server to 769 communicate that information on behalf of the storage device. 771 ffdv_tightly_coupled informs the client as to whether the metadata 772 server is tightly coupled with the storage devices or not. Note that 773 even if the data protocol is at least NFSv4.1, it may still be the 774 case that there is loose coupling in effect. If ffdv_tightly_coupled 775 is not set, then the client MUST commit writes to the storage devices 776 for the file before sending a LAYOUTCOMMIT to the metadata server. 777 I.e., the writes MUST be committed by the client to stable storage 778 via issuing WRITEs with stable_how == FILE_SYNC or by issuing a 779 COMMIT after WRITEs with stable_how != FILE_SYNC (see Section 3.3.7 780 of [RFC1813]). 782 4.2. Storage Device Multipathing 784 The flexible file layout type supports multipathing to multiple 785 storage device addresses. Storage device level multipathing is used 786 for bandwidth scaling via trunking and for higher availability of use 787 in the event of a storage device failure. Multipathing allows the 788 client to switch to another storage device address which may be that 789 of another storage device that is exporting the same data stripe 790 unit, without having to contact the metadata server for a new layout. 792 To support storage device multipathing, ffda_netaddrs contains an 793 array of one or more storage device network addresses. This array 794 (data type multipath_list4) represents a list of storage devices 795 (each identified by a network address), with the possibility that 796 some storage device will appear in the list multiple times. 798 The client is free to use any of the network addresses as a 799 destination to send storage device requests. If some network 800 addresses are less desirable paths to the data than others, then the 801 metadata server SHOULD NOT include those network addresses in 802 ffda_netaddrs. If less desirable network addresses exist to provide 803 failover, the RECOMMENDED method to offer the addresses is to provide 804 them in a replacement device-ID-to-device-address mapping, or a 805 replacement device ID. When a client finds no response from the 806 storage device using all addresses available in ffda_netaddrs, it 807 SHOULD send a GETDEVICEINFO to attempt to replace the existing 808 device-ID-to-device-address mappings. If the metadata server detects 809 that all network paths represented by ffda_netaddrs are unavailable, 810 the metadata server SHOULD send a CB_NOTIFY_DEVICEID (if the client 811 has indicated it wants device ID notifications for changed device 812 IDs) to change the device-ID-to-device-address mappings to the 813 available addresses. If the device ID itself will be replaced, the 814 metadata server SHOULD recall all layouts with the device ID, and 815 thus force the client to get new layouts and device ID mappings via 816 LAYOUTGET and GETDEVICEINFO. 818 Generally, if two network addresses appear in ffda_netaddrs, they 819 will designate the same storage device. When the storage device is 820 accessed over NFSv4.1 or a higher minor version, the two storage 821 device addresses will support the implementation of client ID or 822 session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. 823 The two storage device addresses will share the same server owner or 824 major ID of the server owner. It is not always necessary for the two 825 storage device addresses to designate the same storage device with 826 trunking being used. For example, the data could be read-only, and 827 the data consist of exact replicas. 829 5. Flexible File Layout Type 831 The layout4 type is defined in [RFC5662] as follows: 833 835 enum layouttype4 { 836 LAYOUT4_NFSV4_1_FILES = 1, 837 LAYOUT4_OSD2_OBJECTS = 2, 838 LAYOUT4_BLOCK_VOLUME = 3, 839 LAYOUT4_FLEX_FILES = 4 840 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 841 to be the layouttype assigned by IANA]] 842 }; 844 struct layout_content4 { 845 layouttype4 loc_type; 846 opaque loc_body<>; 847 }; 849 struct layout4 { 850 offset4 lo_offset; 851 length4 lo_length; 852 layoutiomode4 lo_iomode; 853 layout_content4 lo_content; 854 }; 856 858 This document defines structures associated with the layouttype4 859 value LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure 860 as an XDR type "opaque". The opaque layout is uninterpreted by the 861 generic pNFS client layers, but is interpreted by the flexible file 862 layout type implementation. This section defines the structure of 863 this otherwise opaque value, ff_layout4. 865 5.1. ff_layout4 867 869 /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; 870 /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; 871 /// const FF_FLAGS_NO_READ_IO = 0x00000004; 872 /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; 874 /// typedef uint32_t ff_flags4; 875 /// 877 /// struct ff_data_server4 { 878 /// deviceid4 ffds_deviceid; 879 /// uint32_t ffds_efficiency; 880 /// stateid4 ffds_stateid; 881 /// nfs_fh4 ffds_fh_vers<>; 882 /// fattr4_owner ffds_user; 883 /// fattr4_owner_group ffds_group; 884 /// }; 885 /// 887 /// struct ff_mirror4 { 888 /// ff_data_server4 ffm_data_servers<>; 889 /// }; 890 /// 892 /// struct ff_layout4 { 893 /// length4 ffl_stripe_unit; 894 /// ff_mirror4 ffl_mirrors<>; 895 /// ff_flags4 ffl_flags; 896 /// uint32_t ffl_stats_collect_hint; 897 /// }; 898 /// 900 902 The ff_layout4 structure specifies a layout in that portion of the 903 data file described in the current layout segment. It is either a 904 single instance or a set of mirrored copies of that portion of the 905 data file. When mirroring is in effect, it protects against loss of 906 data in layout segments. Note that while not explicitly shown in the 907 above XDR, each layout4 element returned in the logr_layout array of 908 LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) describes a layout 909 segment. Hence each ff_layout4 also describes a layout segment. 911 It is possible that the file is concatenated from more than one 912 layout segment. Each layout segment MAY represent different striping 913 parameters, applying respectively only to the layout segment byte 914 range. 916 The ffl_stripe_unit field is the stripe unit size in use for the 917 current layout segment. The number of stripes is given inside each 918 mirror by the number of elements in ffm_data_servers. If the number 919 of stripes is one, then the value for ffl_stripe_unit MUST default to 920 zero. The only supported mapping scheme is sparse and is detailed in 921 Section 6. Note that there is an assumption here that both the 922 stripe unit size and the number of stripes is the same across all 923 mirrors. 925 The ffl_mirrors field is the array of mirrored storage devices which 926 provide the storage for the current stripe, see Figure 1. 928 The ffl_stats_collect_hint field provides a hint to the client on how 929 often the server wants it to report LAYOUTSTATS for a file. The time 930 is in seconds. 932 +-----------+ 933 | | 934 | | 935 | File | 936 | | 937 | | 938 +-----+-----+ 939 | 940 +------------+------------+ 941 | | 942 +----+-----+ +-----+----+ 943 | Mirror 1 | | Mirror 2 | 944 +----+-----+ +-----+----+ 945 | | 946 +-----------+ +-----------+ 947 |+-----------+ |+-----------+ 948 ||+-----------+ ||+-----------+ 949 +|| Storage | +|| Storage | 950 +| Devices | +| Devices | 951 +-----------+ +-----------+ 953 Figure 1 955 The ffs_mirrors field represents an array of state information for 956 each mirrored copy of the current layout segment. Each element is 957 described by a ff_mirror4 type. 959 ffds_deviceid provides the deviceid of the storage device holding the 960 data file. 962 ffds_fh_vers is an array of filehandles of the data file matching to 963 the available NFS versions on the given storage device. There MUST 964 be exactly as many elements in ffds_fh_vers as there are in 965 ffda_versions. Each element of the array corresponds to a particular 966 combination of ffdv_version, ffdv_minorversion, and 967 ffdv_tightly_coupled provided for the device. The array allows for 968 server implementations which have different filehandles for different 969 combinations of version, minor version, and coupling strength. See 970 Section 5.4 for how to handle versioning issues between the client 971 and storage devices. 973 For tight coupling, ffds_stateid provides the stateid to be used by 974 the client to access the file. For loose coupling and a NFSv4 975 storage device, the client will have to use an anonymous stateid to 976 perform I/O on the storage device. With no control protocol, the 977 metadata server stateid can not be used to provide a global stateid 978 model. Thus the server MUST set the ffds_stateid to be the anonymous 979 stateid. 981 This specification of the ffds_stateid restricts both models for 982 NFSv4.x storage protocols: 984 loosely couple: the stateid has to be an anonymous stateid, 986 tightly couple: the stateid has to be a global stateid. 988 A number of issues stem from a mismatch between the fact that 989 ffds_stateid is defined as a single item while ffds_fh_vers is 990 defined as an array. It is possible for each open file on the 991 storage device to require its own open stateid. Because there are 992 established loosely coupled implementations of the version of the 993 protocol described in this document, such potential issues have not 994 been addressed here. It is possible for future layout types to be 995 defined that address these issues, should it become important to 996 provide multiple stateids for the same underlying file. 998 For loosely coupled storage devices, ffds_user and ffds_group provide 999 the synthetic user and group to be used in the RPC credentials that 1000 the client presents to the storage device to access the data files. 1001 For tightly coupled storage devices, the user and group on the 1002 storage device will be the same as on the metadata server. I.e., if 1003 ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST 1004 ignore both ffds_user and ffds_group. 1006 The allowed values for both ffds_user and ffds_group are specified in 1007 Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group 1008 strings that consist of decimal numeric values with no leading zeros 1009 can be given a special interpretation by clients and servers that 1010 choose to provide such support. The receiver may treat such a user 1011 or group string as representing the same user as would be represented 1012 by an NFSv3 uid or gid having the corresponding numeric value. Note 1013 that if using Kerberos for security, the expectation is that these 1014 values will be a name@domain string. 1016 ffds_efficiency describes the metadata server's evaluation as to the 1017 effectiveness of each mirror. Note that this is per layout and not 1018 per device as the metric may change due to perceived load, 1019 availability to the metadata server, etc. Higher values denote 1020 higher perceived utility. The way the client can select the best 1021 mirror to access is discussed in Section 8.1. 1023 ffl_flags is a bitmap that allows the metadata server to inform the 1024 client of particular conditions that may result from the more or less 1025 tight coupling of the storage devices. 1027 FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is 1028 not required to send LAYOUTCOMMIT to the metadata server. 1030 F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client 1031 should not send I/O operations to the metadata server. I.e., even 1032 if the client could determine that there was a network disconnect 1033 to a storage device, the client should not try to proxy the I/O 1034 through the metadata server. 1036 FF_FLAGS_NO_READ_IO: can be set to indicate that the client should 1037 not send READ requests with the layouts of iomode 1038 LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode 1039 LAYOUTIOMODE4_READ from the metadata server. 1041 FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client 1042 only needs to update one of the mirrors (see Section 8.2). 1044 5.1.1. Error Codes from LAYOUTGET 1046 [RFC5661] provides little guidance as to how the client is to proceed 1047 with a LAYOUTGET which returns an error of either 1048 NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. 1049 Within the context of this document: 1051 NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O 1052 is to go to the metadata server. Note that it is possible to have 1053 had a layout before a recall and not after. 1055 NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout 1056 from being granted. If the client already has an appropriate 1057 layout, it should continue with I/O to the storage devices. 1059 NFS4ERR_DELAY: there is some issue preventing the layout from being 1060 granted. If the client already has an appropriate layout, it 1061 should not continue with I/O to the storage devices. 1063 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS 1065 Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS, 1066 flag, the client can still perform I/O to the metadata server. The 1067 flag functions as a hint. The flag indicates to the client that the 1068 metadata server prefers to separate the metadata I/O from the data I/ 1069 O, most likely for peformance reasons. 1071 5.2. LAYOUTCOMMIT 1073 The flex file layout does not use lou_body. If lou_type is 1074 LAYOUT4_FLEX_FILES, the lou_body field MUST have a zero length. 1076 5.3. Interactions Between Devices and Layouts 1078 In [RFC5661], the file layout type is defined such that the 1079 relationship between multipathing and filehandles can result in 1080 either 0, 1, or N filehandles (see Section 13.3). Some rationales 1081 for this are clustered servers which share the same filehandle or 1082 allowing for multiple read-only copies of the file on the same 1083 storage device. In the flexible file layout type, while there is an 1084 array of filehandles, they are independent of the multipathing being 1085 used. If the metadata server wants to provide multiple read-only 1086 copies of the same file on the same storage device, then it should 1087 provide multiple ff_device_addr4, each as a mirror. The client can 1088 then determine that since the ffds_fh_vers are different, then there 1089 are multiple copies of the file for the current layout segment 1090 available. 1092 5.4. Handling Version Errors 1094 When the metadata server provides the ffda_versions array in the 1095 ff_device_addr4 (see Section 4.1), the client is able to determine if 1096 it can not access a storage device with any of the supplied 1097 combinations of ffdv_version, ffdv_minorversion, and 1098 ffdv_tightly_coupled. However, due to the limitations of reporting 1099 errors in GETDEVICEINFO (see Section 18.40 in [RFC5661], the client 1100 is not able to specify which specific device it can not communicate 1101 with over one of the provided ffdv_version and ffdv_minorversion 1102 combinations. Using ff_ioerr4 (see Section 9.1.1 inside either the 1103 LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see 1104 Section 15.6 of [RFC7862] and Section 10 of this document), the 1105 client can isolate the problematic storage device. 1107 The error code to return for LAYOUTRETURN and/or LAYOUTERROR is 1108 NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch 1109 is a major version (e.g., client can use NFSv3 but not NFSv4) or 1110 minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the 1111 error indicates that for all the supplied combinations for 1112 ffdv_version and ffdv_minorversion, the client can not communicate 1113 with the storage device. The client can retry the GETDEVICEINFO to 1114 see if the metadata server can provide a different combination or it 1115 can fall back to doing the I/O through the metadata server. 1117 6. Striping via Sparse Mapping 1119 While other layout types support both dense and sparse mapping of 1120 logical offsets to physical offsets within a file (see for example 1121 Section 13.4 of [RFC5661]), the flexible file layout type only 1122 supports a sparse mapping. 1124 With sparse mappings, the logical offset within a file (L) is also 1125 the physical offset on the storage device. As detailed in 1126 Section 13.4.4 of [RFC5661], this results in holes across each 1127 storage device which does not contain the current stripe index. 1129 L: logical offset into the file 1131 W: stripe width 1132 W = number of elements in ffm_data_servers 1134 S: number of bytes in a stripe 1135 S = W * ffl_stripe_unit 1137 N: stripe number 1138 N = L / S 1140 7. Recovering from Client I/O Errors 1142 The pNFS client may encounter errors when directly accessing the 1143 storage devices. However, it is the responsibility of the metadata 1144 server to recover from the I/O errors. When the LAYOUT4_FLEX_FILES 1145 layout type is used, the client MUST report the I/O errors to the 1146 server at LAYOUTRETURN time using the ff_ioerr4 structure (see 1147 Section 9.1.1). 1149 The metadata server analyzes the error and determines the required 1150 recovery operations such as recovering media failures or 1151 reconstructing missing data files. 1153 The metadata server MUST recall any outstanding layouts to allow it 1154 exclusive write access to the stripes being recovered and to prevent 1155 other clients from hitting the same error condition. In these cases, 1156 the server MUST complete recovery before handing out any new layouts 1157 to the affected byte ranges. 1159 Although the client implementation has the option to propagate a 1160 corresponding error to the application that initiated the I/O 1161 operation and drop any unwritten data, the client should attempt to 1162 retry the original I/O operation by either requesting a new layout or 1163 sending the I/O via regular NFSv4.1+ READ or WRITE operations to the 1164 metadata server. The client SHOULD attempt to retrieve a new layout 1165 and retry the I/O operation using the storage device first and only 1166 if the error persists, retry the I/O operation via the metadata 1167 server. 1169 8. Mirroring 1171 The flexible file layout type has a simple model in place for the 1172 mirroring of the file data constrained by a layout segment. There is 1173 no assumption that each copy of the mirror is stored identically on 1174 the storage devices. For example, one device might employ 1175 compression or deduplication on the data. However, the over the wire 1176 transfer of the file contents MUST appear identical. Note, this is a 1177 constraint of the selected XDR representation in which each mirrored 1178 copy of the layout segment has the same striping pattern (see 1179 Figure 1). 1181 The metadata server is responsible for determining the number of 1182 mirrored copies and the location of each mirror. While the client 1183 may provide a hint to how many copies it wants (see Section 12), the 1184 metadata server can ignore that hint and in any event, the client has 1185 no means to dictate either the storage device (which also means the 1186 coupling and/or protocol levels to access the layout segments) or the 1187 location of said storage device. 1189 The updating of mirrored layout segments is done via client-side 1190 mirroring. With this approach, the client is responsible for making 1191 sure modifications are made on all copies of the layout segments it 1192 is informed of via the layout. If a layout segment is being 1193 resilvered to a storage device, that mirrored copy will not be in the 1194 layout. Thus the metadata server MUST update that copy until the 1195 client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR 1196 is set in ffl_flags, the client need only update one of the mirrors 1197 (see Section 8.2). If the client is writing to the layout segments 1198 via the metadata server, then the metadata server MUST update all 1199 copies of the mirror. As seen in Section 8.3, during the 1200 resilvering, the layout is recalled, and the client has to make 1201 modifications via the metadata server. 1203 8.1. Selecting a Mirror 1205 When the metadata server grants a layout to a client, it MAY let the 1206 client know how fast it expects each mirror to be once the request 1207 arrives at the storage devices via the ffds_efficiency member. While 1208 the algorithms to calculate that value are left to the metadata 1209 server implementations, factors that could contribute to that 1210 calculation include speed of the storage device, physical memory 1211 available to the device, operating system version, current load, etc. 1213 However, what should not be involved in that calculation is a 1214 perceived network distance between the client and the storage device. 1215 The client is better situated for making that determination based on 1216 past interaction with the storage device over the different available 1217 network interfaces between the two. I.e., the metadata server might 1218 not know about a transient outage between the client and storage 1219 device because it has no presence on the given subnet. 1221 As such, it is the client which decides which mirror to access for 1222 reading the file. The requirements for writing to mirrored layout 1223 segments are presented below. 1225 8.2. Writing to Mirrors 1227 8.2.1. Single Storage Device Updates Mirrors 1229 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client 1230 only needs to update one of the copies of the layout segment. For 1231 this case, the storage device MUST ensure that all copies of the 1232 mirror are updated when any one of the mirrors is updated. If the 1233 storage device gets an error when updating one of the mirrors, then 1234 it MUST inform the client that the original WRITE had an error. The 1235 client then MUST inform the metadata server (see Section 8.2.3). The 1236 client's responsibility with respect to COMMIT is explained in 1237 Section 8.2.4. The client may choose any one of the mirrors and may 1238 use ffds_efficiency in the same manner as for reading when making 1239 this choice. 1241 8.2.2. Client Updates All Mirrors 1243 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the 1244 client is responsible for updating all mirrored copies of the layout 1245 segments that it is given in the layout. A single failed update is 1246 sufficient to fail the entire operation. If all but one copy is 1247 updated successfully and the last one provides an error, then the 1248 client needs to inform the metadata server about the error via either 1249 LAYOUTRETURN or LAYOUTERROR that the update failed to that storage 1250 device. If the client is updating the mirrors serially, then it 1251 SHOULD stop at the first error encountered and report that to the 1252 metadata server. If the client is updating the mirrors in parallel, 1253 then it SHOULD wait until all storage devices respond such that it 1254 can report all errors encountered during the update. 1256 8.2.3. Handling Write Errors 1258 When the client reports a write error to the metadata server, the 1259 metadata server is responsible for determining if it wants to remove 1260 the errant mirror from the layout, if the mirror has recovered from 1261 some transient error, etc. When the client tries to get a new 1262 layout, the metadata server informs it of the decision by the 1263 contents of the layout. The client MUST NOT make any assumptions 1264 that the contents of the previous layout will match those of the new 1265 one. If it has updates that were not committed to all mirrors, then 1266 it MUST resend those updates to all mirrors. 1268 There is no provision in the protocol for the metadata server to 1269 directly determine that the client has or has not recovered from an 1270 error. I.e., assume that the storage device was network partitioned 1271 from the client and all of the copies are successfully updated after 1272 the error was reported. There is no mechanism for the client to 1273 report that fact and the metadata server is forced to repair the file 1274 across the mirror. 1276 If the client supports NFSv4.2, it can use LAYOUTERROR and 1277 LAYOUTRETURN to provide hints to the metadata server about the 1278 recovery efforts. A LAYOUTERROR on a file is for a non-fatal error. 1279 A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the 1280 client successfully replayed the I/O to all mirrors. Any 1281 LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server 1282 needs to repair. The client MUST be prepared for the LAYOUTERROR to 1283 trigger a CB_LAYOUTRECALL if the metadata server determines it needs 1284 to start repairing the file. 1286 8.2.4. Handling Write COMMITs 1288 When stable writes are done to the metadata server or to a single 1289 replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is 1290 the responsibility of the receiving node to propagate the written 1291 data stably, before replying to the client. 1293 In the corresponding cases in which unstable writes are done, the 1294 receiving node does not have any such obligation, although it may 1295 choose to asynchronously propagate the updates. However, once a 1296 COMMIT is replied to, all replicas must reflect the writes that have 1297 been done, and this data must have been committed to stable storage 1298 on all replicas. 1300 In order to avoid situations in which stale data is read from 1301 replicas to which writes have not been propagated: 1303 o A client which has outstanding unstable writes made to single node 1304 (metadata server or storage device) MUST do all reads from that 1305 same node. 1307 o When writes are flushed to the server, for example to implement, 1308 close-to-open semantics, a COMMIT must be done by the client to 1309 ensure that up-to-date written data will be available irrespective 1310 of the particular replica read. 1312 8.3. Metadata Server Resilvering of the File 1314 The metadata server may elect to create a new mirror of the layout 1315 segments at any time. This might be to resilver a copy on a storage 1316 device which was down for servicing, to provide a copy of the layout 1317 segments on storage with different storage performance 1318 characteristics, etc. As the client will not be aware of the new 1319 mirror and the metadata server will not be aware of updates that the 1320 client is making to the layout segments, the metadata server MUST 1321 recall the writable layout segment(s) that it is resilvering. If the 1322 client issues a LAYOUTGET for a writable layout segment which is in 1323 the process of being resilvered, then the metadata server can deny 1324 that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then 1325 have to perform the I/O through the metadata server. 1327 9. Flexible Files Layout Type Return 1329 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1330 layout-type specific information to the server. It is defined in 1331 Section 18.44.1 of [RFC5661] as follows: 1333 1335 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 1336 const LAYOUT4_RET_REC_FILE = 1; 1337 const LAYOUT4_RET_REC_FSID = 2; 1338 const LAYOUT4_RET_REC_ALL = 3; 1340 enum layoutreturn_type4 { 1341 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 1342 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 1343 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 1344 }; 1346 struct layoutreturn_file4 { 1347 offset4 lrf_offset; 1348 length4 lrf_length; 1349 stateid4 lrf_stateid; 1350 /* layouttype4 specific data */ 1351 opaque lrf_body<>; 1352 }; 1354 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1355 case LAYOUTRETURN4_FILE: 1356 layoutreturn_file4 lr_layout; 1357 default: 1358 void; 1359 }; 1361 struct LAYOUTRETURN4args { 1362 /* CURRENT_FH: file */ 1363 bool lora_reclaim; 1364 layoutreturn_stateid lora_recallstateid; 1365 layouttype4 lora_layout_type; 1366 layoutiomode4 lora_iomode; 1367 layoutreturn4 lora_layoutreturn; 1368 }; 1370 1372 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the 1373 lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value 1374 is defined by ff_layoutreturn4 (See Section 9.3). It allows the 1375 client to report I/O error information or layout usage statistics 1376 back to the metadata server as defined below. Note that while the 1377 data structures are built on concepts introduced in NFSv4.2, the 1378 effective discriminated union (lora_layout_type combined with 1379 ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the 1380 data. 1382 9.1. I/O Error Reporting 1384 9.1.1. ff_ioerr4 1386 1387 /// struct ff_ioerr4 { 1388 /// offset4 ffie_offset; 1389 /// length4 ffie_length; 1390 /// stateid4 ffie_stateid; 1391 /// device_error4 ffie_errors<>; 1392 /// }; 1393 /// 1395 1397 Recall that [RFC7862] defines device_error4 as: 1399 1401 struct device_error4 { 1402 deviceid4 de_deviceid; 1403 nfsstat4 de_status; 1404 nfs_opnum4 de_opnum; 1405 }; 1407 1409 The ff_ioerr4 structure is used to return error indications for data 1410 files that generated errors during data transfers. These are hints 1411 to the metadata server that there are problems with that file. For 1412 each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length 1413 represent the storage device and byte range within the file in which 1414 the error occurred; ffie_errors represents the operation and type of 1415 error. The use of device_error4 is described in Section 15.6 of 1416 [RFC7862]. 1418 Even though the storage device might be accessed via NFSv3 and 1419 reports back NFSv3 errors to the client, the client is responsible 1420 for mapping these to appropriate NFSv4 status codes as de_status. 1421 Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4 1422 operations. 1424 9.2. Layout Usage Statistics 1426 9.2.1. ff_io_latency4 1428 1429 /// struct ff_io_latency4 { 1430 /// uint64_t ffil_ops_requested; 1431 /// uint64_t ffil_bytes_requested; 1432 /// uint64_t ffil_ops_completed; 1433 /// uint64_t ffil_bytes_completed; 1434 /// uint64_t ffil_bytes_not_delivered; 1435 /// nfstime4 ffil_total_busy_time; 1436 /// nfstime4 ffil_aggregate_completion_time; 1437 /// }; 1438 /// 1440 1442 Both operation counts and bytes transferred are kept in the 1443 ff_io_latency4. As seen in ff_layoutupdate4 (See Section 9.2.2) read 1444 and write operations are aggregated separately. READ operations are 1445 used for the ff_io_latency4 ffl_read. Both WRITE and COMMIT 1446 operations are used for the ff_io_latency4 ffl_write. "Requested" 1447 counters track what the client is attempting to do and "completed" 1448 counters track what was done. There is no requirement that the 1449 client only report completed results that have matching requested 1450 results from the reported period. 1452 ffil_bytes_not_delivered is used to track the aggregate number of 1453 bytes requested by not fulfilled due to error conditions. 1454 ffil_total_busy_time is the aggregate time spent with outstanding RPC 1455 calls. ffil_aggregate_completion_time is the sum of all round trip 1456 times for completed RPC calls. 1458 In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number 1459 of seconds and nanoseconds since midnight or zero hour January 1, 1460 1970 Coordinated Universal Time (UTC). The use of nfstime4 in 1461 ff_io_latency4 is to store time since the start of the first I/O from 1462 the client after receiving the layout. In other words, these are to 1463 be decoded as duration and not as a date and time. 1465 Note that LAYOUTSTATS are cumulative, i.e., not reset each time the 1466 operation is sent. If two LAYOUTSTATS ops for the same file, layout 1467 stateid, and originating from the same NFS client are processed at 1468 the same time by the metadata server, then the one containing the 1469 larger values contains the most recent time series data. 1471 9.2.2. ff_layoutupdate4 1473 1474 /// struct ff_layoutupdate4 { 1475 /// netaddr4 ffl_addr; 1476 /// nfs_fh4 ffl_fhandle; 1477 /// ff_io_latency4 ffl_read; 1478 /// ff_io_latency4 ffl_write; 1479 /// nfstime4 ffl_duration; 1480 /// bool ffl_local; 1481 /// }; 1482 /// 1484 1486 ffl_addr differentiates which network address the client connected to 1487 on the storage device. In the case of multipathing, ffl_fhandle 1488 indicates which read-only copy was selected. ffl_read and ffl_write 1489 convey the latencies respectively for both read and write operations. 1490 ffl_duration is used to indicate the time period over which the 1491 statistics were collected. ffl_local if true indicates that the I/O 1492 was serviced by the client's cache. This flag allows the client to 1493 inform the metadata server about "hot" access to a file it would not 1494 normally be allowed to report on. 1496 9.2.3. ff_iostats4 1498 1500 /// struct ff_iostats4 { 1501 /// offset4 ffis_offset; 1502 /// length4 ffis_length; 1503 /// stateid4 ffis_stateid; 1504 /// io_info4 ffis_read; 1505 /// io_info4 ffis_write; 1506 /// deviceid4 ffis_deviceid; 1507 /// ff_layoutupdate4 ffis_layoutupdate; 1508 /// }; 1509 /// 1511 1513 Recall that [RFC7862] defines io_info4 as: 1515 1517 struct io_info4 { 1518 uint64_t ii_count; 1519 uint64_t ii_bytes; 1520 }; 1521 1523 With pNFS, data transfers are performed directly between the pNFS 1524 client and the storage devices. Therefore, the metadata server has 1525 no direct knowledge to the I/O operations being done and thus can not 1526 create on its own statistical information about client I/O to 1527 optimize data storage location. ff_iostats4 MAY be used by the 1528 client to report I/O statistics back to the metadata server upon 1529 returning the layout. 1531 Since it is not feasible for the client to report every I/O that used 1532 the layout, the client MAY identify "hot" byte ranges for which to 1533 report I/O statistics. The definition and/or configuration mechanism 1534 of what is considered "hot" and the size of the reported byte range 1535 is out of the scope of this document. It is suggested for client 1536 implementation to provide reasonable default values and an optional 1537 run-time management interface to control these parameters. For 1538 example, a client can define the default byte range resolution to be 1539 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1540 I/O operations per second. 1542 For each byte range, ffis_offset and ffis_length represent the 1543 starting offset of the range and the range length in bytes. 1544 ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and 1545 ffis_write.ii_bytes represent, respectively, the number of contiguous 1546 read and write I/Os and the respective aggregate number of bytes 1547 transferred within the reported byte range. 1549 The combination of ffis_deviceid and ffl_addr uniquely identifies 1550 both the storage path and the network route to it. Finally, the 1551 ffl_fhandle allows the metadata server to differentiate between 1552 multiple read-only copies of the file on the same storage device. 1554 9.3. ff_layoutreturn4 1556 1558 /// struct ff_layoutreturn4 { 1559 /// ff_ioerr4 fflr_ioerr_report<>; 1560 /// ff_iostats4 fflr_iostats_report<>; 1561 /// }; 1562 /// 1564 1566 When data file I/O operations fail, fflr_ioerr_report<> is used to 1567 report these errors to the metadata server as an array of elements of 1568 type ff_ioerr4. Each element in the array represents an error that 1569 occurred on the data file identified by ffie_errors.de_deviceid. If 1570 no errors are to be reported, the size of the fflr_ioerr_report<> 1571 array is set to zero. The client MAY also use fflr_iostats_report<> 1572 to report a list of I/O statistics as an array of elements of type 1573 ff_iostats4. Each element in the array represents statistics for a 1574 particular byte range. Byte ranges are not guaranteed to be disjoint 1575 and MAY repeat or intersect. 1577 10. Flexible Files Layout Type LAYOUTERROR 1579 If the client is using NFSv4.2 to communicate with the metadata 1580 server, then instead of waiting for a LAYOUTRETURN to send error 1581 information to the metadata server (see Section 9.1), it MAY use 1582 LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that 1583 information. For the flexible files layout type, this means that 1584 LAYOUTERROR4args is treated the same as ff_ioerr4. 1586 11. Flexible Files Layout Type LAYOUTSTATS 1588 If the client is using NFSv4.2 to communicate with the metadata 1589 server, then instead of waiting for a LAYOUTRETURN to send I/O 1590 statistics to the metadata server (see Section 9.2), it MAY use 1591 LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that 1592 information. For the flexible files layout type, this means that 1593 LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same 1594 contents as in ffis_layoutupdate. 1596 12. Flexible File Layout Type Creation Hint 1598 The layouthint4 type is defined in the [RFC5661] as follows: 1600 1602 struct layouthint4 { 1603 layouttype4 loh_type; 1604 opaque loh_body<>; 1605 }; 1607 1609 The layouthint4 structure is used by the client to pass a hint about 1610 the type of layout it would like created for a particular file. If 1611 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 1612 opaque value is defined by the ff_layouthint4 type. 1614 12.1. ff_layouthint4 1616 1618 /// union ff_mirrors_hint switch (bool ffmc_valid) { 1619 /// case TRUE: 1620 /// uint32_t ffmc_mirrors; 1621 /// case FALSE: 1622 /// void; 1623 /// }; 1624 /// 1626 /// struct ff_layouthint4 { 1627 /// ff_mirrors_hint fflh_mirrors_hint; 1628 /// }; 1629 /// 1631 1633 This type conveys hints for the desired data map. All parameters are 1634 optional so the client can give values for only the parameter it 1635 cares about. 1637 13. Recalling a Layout 1639 While Section 12.5.5 of [RFC5661] discusses layout type independent 1640 reasons for recalling a layout, the flexible file layout type 1641 metadata server should recall outstanding layouts in the following 1642 cases: 1644 o When the file's security policy changes, i.e., Access Control 1645 Lists (ACLs) or permission mode bits are set. 1647 o When the file's layout changes, rendering outstanding layouts 1648 invalid. 1650 o When existing layouts are inconsistent with the need to enforce 1651 locking constraints. 1653 o When existing layouts are inconsistent with the requirements 1654 regarding resilvering as described in Section 8.3. 1656 13.1. CB_RECALL_ANY 1658 The metadata server can use the CB_RECALL_ANY callback operation to 1659 notify the client to return some or all of its layouts. Section 22.3 1660 of [RFC5661] defines the allowed types of the "NFSv4 Recallable 1661 Object Types Registry". 1663 1665 /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN = 16; 1666 /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX = 17; 1667 [[RFC Editor: please insert assigned constants]] 1668 /// 1670 struct CB_RECALL_ANY4args { 1671 uint32_t craa_layouts_to_keep; 1672 bitmap4 craa_type_mask; 1673 }; 1675 1677 Typically, CB_RECALL_ANY will be used to recall client state when the 1678 server needs to reclaim resources. The craa_type_mask bitmap 1679 specifies the type of resources that are recalled and the 1680 craa_layouts_to_keep value specifies how many of the recalled 1681 flexible file layouts the client is allowed to keep. The flexible 1682 file layout type mask flags are defined as follows: 1684 1686 /// enum ff_cb_recall_any_mask { 1687 /// FF_RCA4_TYPE_MASK_READ = -2, 1688 /// FF_RCA4_TYPE_MASK_RW = -1 1689 [[RFC Editor: please insert assigned constants]] 1690 /// }; 1691 /// 1693 1695 They represent the iomode of the recalled layouts. In response, the 1696 client SHOULD return layouts of the recalled iomode that it needs the 1697 least, keeping at most craa_layouts_to_keep Flexible File Layouts. 1699 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 1700 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1701 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1702 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1703 is notified to return layouts of either iomode. 1705 14. Client Fencing 1707 In cases where clients are uncommunicative and their lease has 1708 expired or when clients fail to return recalled layouts within a 1709 lease period, the server MAY revoke client layouts and reassign these 1710 resources to other clients (see Section 12.5.5 in [RFC5661]). To 1711 avoid data corruption, the metadata server MUST fence off the revoked 1712 clients from the respective data files as described in Section 2.2. 1714 15. Security Considerations 1716 The pNFS feature partitions the NFSv4.1+ file system protocol into 1717 two parts, the control path and the data path (storage protocol). 1718 The control path contains all the new operations described by this 1719 feature; all existing NFSv4 security mechanisms and features apply to 1720 the control path (see Sections 1.7.1 and 2.2.1 of [RFC5661]). The 1721 combination of components in a pNFS system is required to preserve 1722 the security properties of NFSv4.1+ with respect to an entity 1723 accessing data via a client, including security countermeasures to 1724 defend against threats that NFSv4.1+ provides defenses for in 1725 environments where these threats are considered significant. 1727 The metadata server is primarily responsible for securing the data 1728 path. It has to authenticate the client access and provide 1729 appropriate credentials to the client to access data files on the 1730 storage device. Finally, it is responsible for revoking access for a 1731 client to the storage device. 1733 The metadata server enforces the file access-control policy at 1734 LAYOUTGET time. The client should use RPC authorization credentials 1735 for getting the layout for the requested iomode (READ or RW) and the 1736 server verifies the permissions and ACL for these credentials, 1737 possibly returning NFS4ERR_ACCESS if the client is not allowed the 1738 requested iomode. If the LAYOUTGET operation succeeds the client 1739 receives, as part of the layout, a set of credentials allowing it I/O 1740 access to the specified data files corresponding to the requested 1741 iomode. When the client acts on I/O operations on behalf of its 1742 local users, it MUST authenticate and authorize the user by issuing 1743 respective OPEN and ACCESS calls to the metadata server, similar to 1744 having NFSv4 data delegations. 1746 The combination of file handle, synthetic uid, and gid in the layout 1747 are the way that the metadata server enforces access control to the 1748 data server. The directory namespace on the storage device SHOULD 1749 only be accessible to the metadata server and not the clients. In 1750 that case, the client only has access to file handles of file objects 1751 and not directory objects. Thus, given a file handle in a layout, it 1752 is not possible to guess the parent directory file handle. Further, 1753 as the data file permissions only allow the given synthetic uid read/ 1754 write permission and the given synthetic gid read permission, knowing 1755 the synthetic ids of one file does not necessarily allow access to 1756 any other data file on the storage device. 1758 The metadata server can also deny access at any time by fencing the 1759 data file, which means changing the synthetic ids. In turn, that 1760 forces the client to return its current layout and get a new layout 1761 if it wants to continue IO to the data file. 1763 If the configuration of the storage device is such that clients can 1764 access the directory namespace, then the access control degrades to 1765 that of a typical NFS server with exports with a security flavor of 1766 AUTH_SYS. Any client which is allowed access can forge credentials 1767 to access any data file. The caveat is that the rogue client might 1768 have no knowledge of the data file's type or position in the metadata 1769 directory namespace. 1771 If access is allowed, the client uses the corresponding (READ or RW) 1772 credentials to perform the I/O operations at the data file's storage 1773 devices. When the metadata server receives a request to change a 1774 file's permissions or ACL, it SHOULD recall all layouts for that file 1775 and then MUST fence off any clients still holding outstanding layouts 1776 for the respective files by implicitly invalidating the previously 1777 distributed credential on all data file comprising the file in 1778 question. It is REQUIRED that this be done before committing to the 1779 new permissions and/or ACL. By requesting new layouts, the clients 1780 will reauthorize access against the modified access control metadata. 1781 Recalling the layouts in this case is intended to prevent clients 1782 from getting an error on I/Os done after the client was fenced off. 1784 15.1. RPCSEC_GSS and Security Services 1786 Because of the special use of principals within the loose coupling 1787 model, the issues are different depending on the coupling model. 1789 15.1.1. Loosely Coupled 1791 RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] contains facilities 1792 that would allow it to be used to authorize the client to the storage 1793 device on behalf of the metadata server. Doing so would require that 1794 each of the metadata server, storage device, and client would need to 1795 implement RPCSEC_GSSv3 using an RPC-application-defined structured 1796 privilege assertion in a manner described in Section 4.9.1 of 1797 [RFC7862]. The specifics necessary to do so are not described in 1798 this document. This is principally because any such specification 1799 would require extensive implementation work on a wide range of 1800 storage devices, which would be unlikely to result in a widely usable 1801 specification for a considerable time. 1803 As a result, the layout type described in this document will not 1804 provide support for use of RPCSEC_GSS together with the loosely 1805 coupled model. However, future layout types could be specified which 1806 would allow such support, either through the use of RPCSEC_GSSv3, or 1807 in other ways. 1809 15.1.2. Tightly Coupled 1811 With tight coupling, the principal used to access the metadata file 1812 is exactly the same as used to access the data file. The storage 1813 device can use the control protocol to validate any RPC credentials. 1814 As a result there are no security issues related to using RPCSEC_GSS 1815 with a tightly coupled system. For example, if Kerberos V5 GSS-API 1816 [RFC4121] is used as the security mechanism, then the storage device 1817 could use a control protocol to validate the RPC credentials to the 1818 metadata server. 1820 16. IANA Considerations 1822 [RFC5661] introduced a registry for "pNFS Layout Types Registry" and 1823 as such, new layout type numbers need to be assigned by IANA. This 1824 document defines the protocol associated with the existing layout 1825 type number, LAYOUT4_FLEX_FILES (see Table 1). 1827 +--------------------+-------+----------+-----+----------------+ 1828 | Layout Type Name | Value | RFC | How | Minor Versions | 1829 +--------------------+-------+----------+-----+----------------+ 1830 | LAYOUT4_FLEX_FILES | 0x4 | RFCTBD10 | L | 1 | 1831 +--------------------+-------+----------+-----+----------------+ 1833 Table 1: Layout Type Assignments 1835 [RFC5661] also introduced a registry called "NFSv4 Recallable Object 1836 Types Registry". This document defines new recallable objects for 1837 RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see 1838 Table 2). 1840 +------------------------------+-------+----------+-----+-----------+ 1841 | Recallable Object Type Name | Value | RFC | How | Minor | 1842 | | | | | Versions | 1843 +------------------------------+-------+----------+-----+-----------+ 1844 | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16 | RFCTBD10 | L | 1 | 1845 | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17 | RFCTBD10 | L | 1 | 1846 +------------------------------+-------+----------+-----+-----------+ 1848 Table 2: Recallable Object Type Assignments 1850 Note, [RFC5661] should have also defined (see Table 3): 1852 +-------------------------------+------+-----------+-----+----------+ 1853 | Recallable Object Type Name | Valu | RFC | How | Minor | 1854 | | e | | | Versions | 1855 +-------------------------------+------+-----------+-----+----------+ 1856 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 12 | [RFC5661] | L | 1 | 1857 | IN | | | | | 1858 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 15 | [RFC5661] | L | 1 | 1859 | AX | | | | | 1860 +-------------------------------+------+-----------+-----+----------+ 1862 Table 3: Recallable Object Type Assignments 1864 17. References 1866 17.1. Normative References 1868 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 1869 November 2008, . 1872 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1873 June 1995. 1875 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1876 Requirement Levels", BCP 14, RFC 2119, March 1997. 1878 [RFC4121] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 1879 Version 5 Generic Security Service Application Program 1880 Interface (GSS-API) Mechanism Version 2", RFC 4121, July 1881 2005. 1883 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 1884 STD 67, RFC 4506, May 2006. 1886 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1887 Specification Version 2", RFC 5531, May 2009. 1889 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1890 "Network File System (NFS) Version 4 Minor Version 1 1891 Protocol", RFC 5661, January 2010. 1893 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1894 "Network File System (NFS) Version 4 Minor Version 1 1895 External Data Representation Standard (XDR) Description", 1896 RFC 5662, January 2010. 1898 [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) 1899 version 4 Protocol", RFC 7530, March 2015. 1901 [RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) 1902 Security Version 3", November 2016. 1904 [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, 1905 November 2016. 1907 [pNFSLayouts] 1908 Haynes, T., "Requirements for pNFS Layout Types", draft- 1909 ietf-nfsv4-layout-types-07 (Work In Progress), August 1910 2017. 1912 17.2. Informative References 1914 [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol 1915 (LDAP): Schema for User Applications", RFC 4519, DOI 1916 10.17487/RFC4519, June 2006, 1917 . 1919 Appendix A. Acknowledgments 1921 Those who provided miscellaneous comments to early drafts of this 1922 document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, 1923 and Lev Solomonov. 1925 Those who provided miscellaneous comments to the final drafts of this 1926 document include: Anand Ganesh, Robert Wipfel, Gobikrishnan 1927 Sundharraj, Trond Myklebust, Rick Macklem, and Jim Sermersheim. 1929 Idan Kedar caught a nasty bug in the interaction of client side 1930 mirroring and the minor versioning of devices. 1932 Dave Noveck provided comprehensive reviews of the document during the 1933 working group last calls. He also rewrote Section 2.3. 1935 Olga Kornievskaia made a convincing case against the use of a 1936 credential versus a principal in the fencing approach. Andy Adamson 1937 and Benjamin Kaduk helped to sharpen the focus. 1939 Benjamin Kaduk and Olga Kornievskaia also helped provide concrete 1940 scenarios for loosely coupled security mechanisms. And in the end, 1941 Olga proved that as defined, the loosely coupled model would not work 1942 with RPCSEC_GSS. 1944 Tigran Mkrtchyan provided the use case for not allowing the client to 1945 proxy the I/O through the data server. 1947 Rick Macklem provided the use case for only writing to a single 1948 mirror. 1950 Appendix B. RFC Editor Notes 1952 [RFC Editor: please remove this section prior to publishing this 1953 document as an RFC] 1955 [RFC Editor: prior to publishing this document as an RFC, please 1956 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 1957 RFC number of this document] 1959 Authors' Addresses 1961 Benny Halevy 1963 Email: bhalevy@gmail.com 1965 Thomas Haynes 1966 Primary Data, Inc. 1967 4300 El Camino Real Ste 100 1968 Los Altos, CA 94022 1969 USA 1971 Email: loghyr@gmail.com