idnits 2.17.1 draft-ietf-nfsv4-flex-files-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 372 has weird spacing: '... loghyr staff...' == Line 1324 has weird spacing: '...stateid lor...' == Line 1589 has weird spacing: '...rs_hint ffl...' -- The document date (November 20, 2017) is 2343 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'LEGAL' ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft 4 Intended status: Standards Track T. Haynes 5 Expires: May 24, 2018 Primary Data 6 November 20, 2017 8 Parallel NFS (pNFS) Flexible File Layout 9 draft-ietf-nfsv4-flex-files-15.txt 11 Abstract 13 The Parallel Network File System (pNFS) allows a separation between 14 the metadata (onto a metadata server) and data (onto a storage 15 device) for a file. The flexible file layout type is defined in this 16 document as an extension to pNFS which allows the use of storage 17 devices in a fashion such that they require only a quite limited 18 degree of interaction with the metadata server, using already 19 existing protocols. Client-side mirroring is also added to provide 20 replication of files. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 24, 2018. 39 Copyright Notice 41 Copyright (c) 2017 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 6 59 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 60 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 61 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 62 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 8 63 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 64 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 65 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 10 66 2.3.2. Tightly Coupled Locking Model . . . . . . . . . . . . 11 67 3. XDR Description of the Flexible File Layout Type . . . . . . 13 68 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 69 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 15 70 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 15 71 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 72 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 73 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 18 74 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 22 75 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 22 76 5.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 22 77 5.3. Interactions Between Devices and Layouts . . . . . . . . 23 78 5.4. Handling Version Errors . . . . . . . . . . . . . . . . . 23 79 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 80 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24 81 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 25 82 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 25 83 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 26 84 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 26 85 8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 26 86 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 26 87 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 27 88 8.3. Metadata Server Resilvering of the File . . . . . . . . . 28 89 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 28 90 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 29 91 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 29 92 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 30 93 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 30 94 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 31 95 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 31 96 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 33 98 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 33 99 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 33 100 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 34 101 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 34 102 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 35 103 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 35 104 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 36 105 15. Security Considerations . . . . . . . . . . . . . . . . . . . 36 106 15.1. RPCSEC_GSS and Security Services . . . . . . . . . . . . 37 107 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 37 108 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 38 109 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 38 110 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 111 17.1. Normative References . . . . . . . . . . . . . . . . . . 39 112 17.2. Informative References . . . . . . . . . . . . . . . . . 40 113 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 40 114 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 41 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 117 1. Introduction 119 In the parallel Network File System (pNFS), the metadata server 120 returns layout type structures that describe where file data is 121 located. There are different layout types for different storage 122 systems and methods of arranging data on storage devices. This 123 document defines the flexible file layout type used with file-based 124 data servers that are accessed using the Network File System (NFS) 125 protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and 126 NFSv4.2 [RFC7862]. 128 To provide a global state model equivalent to that of the files 129 layout type, a back-end control protocol might be implemented between 130 the metadata server and NFSv4.1+ storage devices. This document does 131 not provide a standard track control protocol. An implementation can 132 either define its own mechanism or it could define a control protocol 133 in a standard's track document. The requirements for a control 134 protocol are specified in [RFC5661] and clarified in [pNFSLayouts]. 136 1.1. Definitions 138 control communication requirements: are for a layout type the 139 details regarding information on layouts, stateids, file metadata, 140 and file data which must be communicated between the metadata 141 server and the storage devices. 143 control protocol: is the particular mechanism that an implementation 144 of a layout type would use to meet the control communication 145 requirement for that layout type. This need not be a protocol as 146 normally understood. In some cases the same protocol may be used 147 as a control protocol and storage protocol. 149 client-side mirroring: is a feature in which the client and not the 150 server is responsible for updating all of the mirrored copies of a 151 layout segment. 153 (file) data: is that part of the file system object which contains 154 the content. 156 data server (DS): is another term for storage device. 158 fencing: is the process by which the metadata server prevents the 159 storage devices from processing I/O from a specific client to a 160 specific file. 162 file layout type: is a layout type in which the storage devices are 163 accessed via the NFS protocol (see Section 13 of [RFC5661]). 165 layout: is the information a client uses to access file data on a 166 storage device. This information will include specification of 167 the protocol (layout type) and the identity of the storage devices 168 to be used. 170 layout iomode: is a grant of either read or read/write I/O to the 171 client. 173 layout segment: is a sub-division of a layout. That sub-division 174 might be by the layout iomode (see Sections 3.3.20 and 12.2.9 of 175 [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or 176 requested byte range. 178 layout stateid: is a 128-bit quantity returned by a server that 179 uniquely defines the layout state provided by the server for a 180 specific layout that describes a layout type and file (see 181 Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes 182 differences in handling between layout stateids and other stateid 183 types. 185 layout type: is a specification of both the storage protocol used to 186 access the data and the aggregation scheme used to lay out the 187 file data on the underlying storage devices. 189 loose coupling: is when the control protocol is a storage protocol. 191 (file) metadata: is that part of the file system object which 192 describes the object and not the content. E.g., it could be the 193 time since last modification, access, etc. 195 metadata server (MDS): is the pNFS server which provides metadata 196 information for a file system object. It also is responsible for 197 generating, recalling, and revoking layouts for file system 198 objects, for performing directory operations, and for performing I 199 /O operations to regular files when the clients direct these to 200 the metadata server itself. 202 mirror: is a copy of a layout segment. Note that if one copy of the 203 mirror is updated, then all copies must be updated. 205 recalling a layout: is when the metadata server uses a back channel 206 to inform the client that the layout is to be returned in a 207 graceful manner. Note that the client has the opportunity to 208 flush any writes, etc., before replying to the metadata server. 210 revoking a layout: is when the metadata server invalidates the 211 layout such that neither the metadata server nor any storage 212 device will accept any access from the client with that layout. 214 resilvering: is the act of rebuilding a mirrored copy of a layout 215 segment from a known good copy of the layout segment. Note that 216 this can also be done to create a new mirrored copy of the layout 217 segment. 219 rsize: is the data transfer buffer size used for reads. 221 stateid: is a 128-bit quantity returned by a server that uniquely 222 defines the open and locking states provided by the server for a 223 specific open-owner or lock-owner/open-owner pair for a specific 224 file and type of lock. 226 storage device: is the target to which clients may direct I/O 227 requests when they hold an appropriate layout. See Section 2.1 of 228 [pNFSLayouts] for further discussion of the difference between a 229 data store and a storage device. 231 storage protocol: is the protocol used by clients to do I/O 232 operations to the storage device. Each layout type specifies the 233 set of storage protocols. 235 tight coupling: is an arrangement in which the control protocol is 236 one designed specifically for that purpose. It may be either a 237 proprietary protocol, adapted specifically to a a particular 238 metadata server, or one based on a standards-track document. 240 wsize: is the data transfer buffer size used for writes. 242 1.2. Requirements Language 244 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 245 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 246 document are to be interpreted as described in [RFC2119]. 248 2. Coupling of Storage Devices 250 A server implementation may choose either a loose or tight coupling 251 model between the metadata server and the storage devices. To 252 implement the tight coupling model, a control protocol has to be 253 defined. As the flex file layout imposes no special requirements on 254 the client, the control protocol will need to provide: 256 (1) for the management of both security and LAYOUTCOMMITs, and, 258 (2) a global stateid model and management of these stateids. 260 When implementing the loose coupling model, the only control protocol 261 will be a version of NFS, with no ability to provide a global stateid 262 model or to prevent clients from using layouts inappropriately. To 263 enable client use in that environment, this document will specify how 264 security, state, and locking are to be managed. 266 2.1. LAYOUTCOMMIT 268 Regardless of the coupling model, the metadata server has the 269 responsibility, upon receiving a LAYOUTCOMMIT (see Section 18.42 of 270 [RFC5661]), of ensuring that the semantics of pNFS are respected (see 271 Section 3.1 of [pNFSLayouts]). These do include a requirement that 272 data written to data storage device be stable before the occurrence 273 of the LAYOUTCOMMIT. 275 It is the responsibility of the client to make sure the data file is 276 stable before the metadata server begins to query the storage devices 277 about the changes to the file. If any WRITE to a storage device did 278 not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the 279 metadata server MUST be preceded by a COMMIT to the storage devices 280 written to. Note that if the client has not done a COMMIT to the 281 storage device, then the LAYOUTCOMMIT might not be synchronized to 282 the last WRITE operation to the storage device. 284 2.2. Fencing Clients from the Storage Device 286 With loosely coupled storage devices, the metadata server uses 287 synthetic uids and gids for the data file, where the uid owner of the 288 data file is allowed read/write access and the gid owner is allowed 289 read only access. As part of the layout (see ffds_user and 290 ffds_group in Section 5.1), the client is provided with the user and 291 group to be used in the Remote Procedure Call (RPC) [RFC5531] 292 credentials needed to access the data file. Fencing off of clients 293 is achieved by the metadata server changing the synthetic uid and/or 294 gid owners of the data file on the storage device to implicitly 295 revoke the outstanding RPC credentials. A client presenting the 296 wrong credential for the desired access will get a NFS4ERR_ACCESS 297 error. 299 With this loosely coupled model, the metadata server is not able to 300 fence off a single client, it is forced to fence off all clients. 301 However, as the other clients react to the fencing, returning their 302 layouts and trying to get new ones, the metadata server can hand out 303 a new uid and gid to allow access. 305 Note: it is recommended to implement common access control methods at 306 the storage device filesystem to allow only the metadata server root 307 (super user) access to the storage device, and to set the owner of 308 all directories holding data files to the root user. This approach 309 provides a practical model to enforce access control and fence off 310 cooperative clients, but it can not protect against malicious 311 clients; hence it provides a level of security equivalent to 312 AUTH_SYS. 314 With tightly coupled storage devices, the metadata server sets the 315 user and group owners, mode bits, and ACL of the data file to be the 316 same as the metadata file. And the client must authenticate with the 317 storage device and go through the same authorization process it would 318 go through via the metadata server. In the case of tight coupling, 319 fencing is the responsibility of the control protocol and is not 320 described in detail here. However, implementations of the tight 321 coupling locking model (see Section 2.3), will need a way to prevent 322 access by certain clients to specific files by invalidating the 323 corresponding stateids on the storage device. In such a scenario, 324 the client will be given an error of NFS4ERR_BAD_STATEID. 326 The client need not know the model used between the metadata server 327 and the storage device. It need only react consistently to any 328 errors in interacting with the storage device. It should both return 329 the layout and error to the metadata server and ask for a new layout. 330 At that point, the metadata server can either hand out a new layout, 331 hand out no layout (forcing the I/O through it), or deny the client 332 further access to the file. 334 2.2.1. Implementation Notes for Synthetic uids/gids 336 The selection method for the synthetic uids and gids to be used for 337 fencing in loosely coupled storage devices is strictly an 338 implementation issue. I.e., an administrator might restrict a range 339 of such ids available to the Lightweight Directory Access Protocol 340 (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id 341 that would never be used to grant access. Then when the metadata 342 server had a request to access a file, a SETATTR would be sent to the 343 storage device to set the owner and group of the data file. The user 344 and group might be selected in a round robin fashion from the range 345 of available ids. 347 Those ids would be sent back as ffds_user and ffds_group to the 348 client. And it would present them as the RPC credentials to the 349 storage device. When the client was done accessing the file and the 350 metadata server knew that no other client was accessing the file, it 351 could reset the owner and group to restrict access to the data file. 353 When the metadata server wanted to fence off a client, it would 354 change the synthetic uid and/or gid to the restricted ids. Note that 355 using a restricted id ensures that there is a change of owner and at 356 least one id available that never gets allowed access. 358 Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD 359 be avoided. These typically either grant super access to files on a 360 storage device or are mapped to an anonymous id. In the first case, 361 even if the data file is fenced, the client might still be able to 362 access the file. In the second case, multiple ids might be mapped to 363 the anonymous ids. 365 2.2.2. Example of using Synthetic uids/gids 367 The user loghyr creates a file "ompha.c" on the metadata server and 368 it creates a corresponding data file on the storage device. 370 The metadata server entry may look like: 372 -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c 374 On the storage device, it may be assigned some random synthetic uid/ 375 gid to deny access: 377 -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c 379 When the file is opened on a client, since the layout knows nothing 380 about the user (and does not care), whether loghyr or garbo opens the 381 file does not matter. The owner and group are modified and those 382 values are returned. 384 -rw-r----- 1 1066 1067 1697 Dec 4 11:31 data_ompha.c 386 The set of synthetic gids on the storage device should be selected 387 such that there is no mapping in any of the name services used by the 388 storage device. I.e., each group should have no members. 390 If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the 391 metadata server should return a synthetic uid that is not set on the 392 storage device. Only the synthetic gid would be valid. 394 The client is thus solely responsible for enforcing file permissions 395 in a loosely coupled model. To allow loghyr write access, it will 396 send an RPC to the storage device with a credential of 1066:1067. To 397 allow garbo read access, it will send an RPC to the storage device 398 with a credential of 1067:1067. The value of the uid does not matter 399 as long as it is not the synthetic uid granted it when getting the 400 layout. 402 While pushing the enforcement of permission checking onto the client 403 may seem to weaken security, the client may already be responsible 404 for enforcing permissions before modifications are sent to a server. 405 With cached writes, the client is always responsible for tracking who 406 is modifying a file and making sure to not coalesce requests from 407 multiple users into one request. 409 2.3. State and Locking Models 411 An implementation can always be deployed as a loosely coupled model. 412 There is however no way for a storage device to indicate over a NFS 413 protocol that it can definitively participate in a tightly coupled 414 model: 416 o Storage devices implementing the NFSv3 and NFSv4.0 protocols are 417 always treated as loosely coupled. 419 o NFSv4.1+ storage devices that do not return the 420 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating 421 that they are to be treated as loosely coupled. From the locking 422 viewpoint they are treated in the same way as NFSv4.0 storage 423 devices. 425 o NFSv4.1+ storage devices that do identify themselves with the 426 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID can potentially 427 be tightly coupled. They would use a back-end control protocol to 428 implement the global stateid model as described in [RFC5661]. 430 A storage device would have to either be discovered or advertised 431 over the control protocol to enable a tight coupling model. 433 2.3.1. Loosely Coupled Locking Model 435 When locking-related operations are requested, they are primarily 436 dealt with by the metadata server, which generates the appropriate 437 stateids. When an NFSv4 version is used as the data access protocol, 438 the metadata server may make stateid-related requests of the storage 439 devices. However, it is not required to do so and the resulting 440 stateids are known only to the metadata server and the storage 441 device. 443 Given this basic structure, locking-related operations are handled as 444 follows: 446 o OPENs are dealt with by the metadata server. Stateids are 447 selected by the metadata server and associated with the client id 448 describing the client's connection to the metadata server. The 449 metadata server may need to interact with the storage device to 450 locate the file to be opened, but no locking-related functionality 451 need be used on the storage device. 453 OPEN_DOWNGRADE and CLOSE only require local execution on the 454 metadata server. 456 o Advisory byte-range locks can be implemented locally on the 457 metadata server. As in the case of OPENs, the stateids associated 458 with byte-range locks are assigned by the metadata server and only 459 used on the metadata server. 461 o Delegations are assigned by the metadata server which initiates 462 recalls when conflicting OPENs are processed. No storage device 463 involvement is required. 465 o TEST_STATEID and FREE_STATEID are processed locally on the 466 metadata server, without storage device involvement. 468 All I/O operations to the storage device are done using the anonymous 469 stateid. Thus the storage device has no information about the 470 openowner and lockowner responsible for issuing a particular I/O 471 operation. As a result: 473 o Mandatory byte-range locking cannot be supported because the 474 storage device has no way of distinguishing I/O done on behalf of 475 the lock owner from those done by others. 477 o Enforcement of share reservations is the responsibility of the 478 client. Even though I/O is done using the anonymous stateid, the 479 client must ensure that it has a valid stateid associated with the 480 openowner, that allows the I/O being done before issuing the I/O. 482 In the event that a stateid is revoked, the metadata server is 483 responsible for preventing client access, since it has no way of 484 being sure that the client is aware that the stateid in question has 485 been revoked. 487 As the client never receives a stateid generated by a storage device, 488 there is no client lease on the storage device and no prospect of 489 lease expiration, even when access is via NFSv4 protocols. Clients 490 will have leases on the metadata server. In dealing with lease 491 expiration, the metadata server may need to use fencing to prevent 492 revoked stateids from being relied upon by a client unaware of the 493 fact that they have been revoked. 495 2.3.2. Tightly Coupled Locking Model 497 When locking-related operations are requested, they are primarily 498 dealt with by the metadata server, which generates the appropriate 499 stateids. These stateids must be made known to the storage device 500 using control protocol facilities, the details of which are not 501 discussed in this document. 503 Given this basic structure, locking-related operations are handled as 504 follows: 506 o OPENs are dealt with primarily on the metadata server. Stateids 507 are selected by the metadata server and associated with the client 508 id describing the client's connection to the metadata server. The 509 metadata server needs to interact with the storage device to 510 locate the file to be opened, and to make the storage device aware 511 of the association between the metadata-server-chosen stateid and 512 the client and openowner that it represents. 514 OPEN_DOWNGRADE and CLOSE are executed initially on the metadata 515 server but the state change made must be propagated to the storage 516 device. 518 o Advisory byte-range locks can be implemented locally on the 519 metadata server. As in the case of OPENs, the stateids associated 520 with byte-range locks, are assigned by the metadata server and are 521 available for use on the metadata server. Because I/O operations 522 are allowed to present lock stateids, the metadata server needs 523 the ability to make the storage device aware of the association 524 between the metadata-server-chosen stateid and the corresponding 525 open stateid it is associated with. 527 o Mandatory byte-range locks can be supported when both the metadata 528 server and the storage devices have the appropriate support. As 529 in the case of advisory byte-range locks, these are assigned by 530 the metadata server and are available for use on the metadata 531 server. To enable mandatory lock enforcement on the storage 532 device, the metadata server needs the ability to make the storage 533 device aware of the association between the metadata-server-chosen 534 stateid and the client, openowner, and lock (i.e., lockowner, 535 byte-range, lock-type) that it represents. Because I/O operations 536 are allowed to present lock stateids, this information needs to be 537 propagated to all storage devices to which I/O might be directed 538 rather than only to storage device that contain the locked region. 540 o Delegations are assigned by the metadata server which initiates 541 recalls when conflicting OPENs are processed. Because I/O 542 operations are allowed to present delegation stateids, the 543 metadata server requires the ability to make the storage device 544 aware of the association between the metadata-server-chosen 545 stateid and the filehandle and delegation type it represents, and 546 to break such an association. 548 o TEST_STATEID is processed locally on the metadata server, without 549 storage device involvement. 551 o FREE_STATEID is processed on the metadata server but the metadata 552 server requires the ability to propagate the request to the 553 corresponding storage devices. 555 Because the client will possess and use stateids valid on the storage 556 device, there will be a client lease on the storage device and the 557 possibility of lease expiration does exist. The best approach for 558 the storage device is to retain these locks as a courtesy. However, 559 if it does not do so, control protocol facilities need to provide the 560 means to synchronize lock state between the metadata server and 561 storage device. 563 Clients will also have leases on the metadata server, which are 564 subject to expiration. In dealing with lease expiration, the 565 metadata server would be expected to use control protocol facilities 566 enabling it to invalidate revoked stateids on the storage device. In 567 the event the client is not responsive, the metadata server may need 568 to use fencing to prevent revoked stateids from being acted upon by 569 the storage device. 571 3. XDR Description of the Flexible File Layout Type 573 This document contains the external data representation (XDR) 574 [RFC4506] description of the flexible file layout type. The XDR 575 description is embedded in this document in a way that makes it 576 simple for the reader to extract into a ready-to-compile form. The 577 reader can feed this document into the following shell script to 578 produce the machine readable XDR description of the flexible file 579 layout type: 581 583 #!/bin/sh 584 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 586 588 That is, if the above script is stored in a file called "extract.sh", 589 and this document is in a file called "spec.txt", then the reader can 590 do: 592 sh extract.sh < spec.txt > flex_files_prot.x 594 The effect of the script is to remove leading white space from each 595 line, plus a sentinel sequence of "///". 597 The embedded XDR file header follows. Subsequent XDR descriptions, 598 with the sentinel sequence are embedded throughout the document. 600 Note that the XDR code contained in this document depends on types 601 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 602 types that end with a 4, such as offset4, length4, etc., as well as 603 more generic types such as uint32_t and uint64_t. 605 3.1. Code Components Licensing Notice 607 Both the XDR description and the scripts used for extracting the XDR 608 description are Code Components as described in Section 4 of "Legal 609 Provisions Relating to IETF Documents" [LEGAL]. These Code 610 Components are licensed according to the terms of that document. 612 614 /// /* 615 /// * Copyright (c) 2012 IETF Trust and the persons identified 616 /// * as authors of the code. All rights reserved. 617 /// * 618 /// * Redistribution and use in source and binary forms, with 619 /// * or without modification, are permitted provided that the 620 /// * following conditions are met: 621 /// * 622 /// * o Redistributions of source code must retain the above 623 /// * copyright notice, this list of conditions and the 624 /// * following disclaimer. 625 /// * 626 /// * o Redistributions in binary form must reproduce the above 627 /// * copyright notice, this list of conditions and the 628 /// * following disclaimer in the documentation and/or other 629 /// * materials provided with the distribution. 630 /// * 631 /// * o Neither the name of Internet Society, IETF or IETF 632 /// * Trust, nor the names of specific contributors, may be 633 /// * used to endorse or promote products derived from this 634 /// * software without specific prior written permission. 635 /// * 636 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 637 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 638 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 639 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 640 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 641 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 642 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 643 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 644 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 645 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 646 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 647 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 648 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 649 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 650 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 651 /// * 652 /// * This code was derived from RFCTBD10. 653 /// * Please reproduce this note if possible. 654 /// */ 655 /// 656 /// /* 657 /// * flex_files_prot.x 658 /// */ 659 /// 660 /// /* 661 /// * The following include statements are for example only. 662 /// * The actual XDR definition files are generated separately 663 /// * and independently and are likely to have a different name. 664 /// * %#include 665 /// * %#include 666 /// */ 667 /// 669 671 4. Device Addressing and Discovery 673 Data operations to a storage device require the client to know the 674 network address of the storage device. The NFSv4.1+ GETDEVICEINFO 675 operation (Section 18.40 of [RFC5661]) is used by the client to 676 retrieve that information. 678 4.1. ff_device_addr4 680 The ff_device_addr4 data structure is returned by the server as the 681 layout type specific opaque field da_addr_body in the device_addr4 682 structure by a successful GETDEVICEINFO operation. 684 686 /// struct ff_device_versions4 { 687 /// uint32_t ffdv_version; 688 /// uint32_t ffdv_minorversion; 689 /// uint32_t ffdv_rsize; 690 /// uint32_t ffdv_wsize; 691 /// bool ffdv_tightly_coupled; 692 /// }; 693 /// 695 /// struct ff_device_addr4 { 696 /// multipath_list4 ffda_netaddrs; 697 /// ff_device_versions4 ffda_versions<>; 698 /// }; 699 /// 701 703 The ffda_netaddrs field is used to locate the storage device. It 704 MUST be set by the server to a list holding one or more of the device 705 network addresses. 707 The ffda_versions array allows the metadata server to present choices 708 as to NFS version, minor version, and coupling strength to the 709 client. The ffdv_version and ffdv_minorversion represent the NFS 710 protocol to be used to access the storage device. This layout 711 specification defines the semantics for ffdv_versions 3 and 4. If 712 ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 713 and ffdv_tightly_coupled to false. The client MUST then access the 714 storage device using the NFSv3 protocol [RFC1813]. If ffdv_version 715 equals 4 then the server MUST set ffdv_minorversion to one of the 716 NFSv4 minor version numbers and the client MUST access the storage 717 device using NFSv4 with the specified minor version. 719 Note that while the client might determine that it cannot use any of 720 the configured combinations of ffdv_version, ffdv_minorversion, and 721 ffdv_tightly_coupled, when it gets the device list from the metadata 722 server, there is no way to indicate to the metadata server as to 723 which device it is version incompatible. If however, the client 724 waits until it retrieves the layout from the metadata server, it can 725 at that time clearly identify the storage device in question (see 726 Section 5.4). 728 The ffdv_rsize and ffdv_wsize are used to communicate the maximum 729 rsize and wsize supported by the storage device. As the storage 730 device can have a different rsize or wsize than the metadata server, 731 the ffdv_rsize and ffdv_wsize allow the metadata server to 732 communicate that information on behalf of the storage device. 734 ffdv_tightly_coupled informs the client as to whether the metadata 735 server is tightly coupled with the storage devices or not. Note that 736 even if the data protocol is at least NFSv4.1, it may still be the 737 case that there is loose coupling in effect. If ffdv_tightly_coupled 738 is not set, then the client MUST commit writes to the storage devices 739 for the file before sending a LAYOUTCOMMIT to the metadata server. 740 I.e., the writes MUST be committed by the client to stable storage 741 via issuing WRITEs with stable_how == FILE_SYNC or by issuing a 742 COMMIT after WRITEs with stable_how != FILE_SYNC (see Section 3.3.7 743 of [RFC1813]). 745 4.2. Storage Device Multipathing 747 The flexible file layout type supports multipathing to multiple 748 storage device addresses. Storage device level multipathing is used 749 for bandwidth scaling via trunking and for higher availability of use 750 in the event of a storage device failure. Multipathing allows the 751 client to switch to another storage device address which may be that 752 of another storage device that is exporting the same data stripe 753 unit, without having to contact the metadata server for a new layout. 755 To support storage device multipathing, ffda_netaddrs contains an 756 array of one or more storage device network addresses. This array 757 (data type multipath_list4) represents a list of storage devices 758 (each identified by a network address), with the possibility that 759 some storage device will appear in the list multiple times. 761 The client is free to use any of the network addresses as a 762 destination to send storage device requests. If some network 763 addresses are less desirable paths to the data than others, then the 764 metadata server SHOULD NOT include those network addresses in 765 ffda_netaddrs. If less desirable network addresses exist to provide 766 failover, the RECOMMENDED method to offer the addresses is to provide 767 them in a replacement device-ID-to-device-address mapping, or a 768 replacement device ID. When a client finds no response from the 769 storage device using all addresses available in ffda_netaddrs, it 770 SHOULD send a GETDEVICEINFO to attempt to replace the existing 771 device-ID-to-device-address mappings. If the metadata server detects 772 that all network paths represented by ffda_netaddrs are unavailable, 773 the metadata server SHOULD send a CB_NOTIFY_DEVICEID (if the client 774 has indicated it wants device ID notifications for changed device 775 IDs) to change the device-ID-to-device-address mappings to the 776 available addresses. If the device ID itself will be replaced, the 777 metadata server SHOULD recall all layouts with the device ID, and 778 thus force the client to get new layouts and device ID mappings via 779 LAYOUTGET and GETDEVICEINFO. 781 Generally, if two network addresses appear in ffda_netaddrs, they 782 will designate the same storage device. When the storage device is 783 accessed over NFSv4.1 or a higher minor version, the two storage 784 device addresses will support the implementation of client ID or 785 session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. 786 The two storage device addresses will share the same server owner or 787 major ID of the server owner. It is not always necessary for the two 788 storage device addresses to designate the same storage device with 789 trunking being used. For example, the data could be read-only, and 790 the data consist of exact replicas. 792 5. Flexible File Layout Type 794 The layout4 type is defined in [RFC5662] as follows: 796 798 enum layouttype4 { 799 LAYOUT4_NFSV4_1_FILES = 1, 800 LAYOUT4_OSD2_OBJECTS = 2, 801 LAYOUT4_BLOCK_VOLUME = 3, 802 LAYOUT4_FLEX_FILES = 4 803 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 804 to be the layouttype assigned by IANA]] 805 }; 807 struct layout_content4 { 808 layouttype4 loc_type; 809 opaque loc_body<>; 810 }; 811 struct layout4 { 812 offset4 lo_offset; 813 length4 lo_length; 814 layoutiomode4 lo_iomode; 815 layout_content4 lo_content; 816 }; 818 820 This document defines structures associated with the layouttype4 821 value LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure 822 as an XDR type "opaque". The opaque layout is uninterpreted by the 823 generic pNFS client layers, but is interpreted by the flexible file 824 layout type implementation. This section defines the structure of 825 this otherwise opaque value, ff_layout4. 827 5.1. ff_layout4 829 831 /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; 832 /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; 833 /// const FF_FLAGS_NO_READ_IO = 0x00000004; 834 /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; 836 /// typedef uint32_t ff_flags4; 837 /// 839 /// struct ff_data_server4 { 840 /// deviceid4 ffds_deviceid; 841 /// uint32_t ffds_efficiency; 842 /// stateid4 ffds_stateid; 843 /// nfs_fh4 ffds_fh_vers<>; 844 /// fattr4_owner ffds_user; 845 /// fattr4_owner_group ffds_group; 846 /// }; 847 /// 849 /// struct ff_mirror4 { 850 /// ff_data_server4 ffm_data_servers<>; 851 /// }; 852 /// 853 /// struct ff_layout4 { 854 /// length4 ffl_stripe_unit; 855 /// ff_mirror4 ffl_mirrors<>; 856 /// ff_flags4 ffl_flags; 857 /// uint32_t ffl_stats_collect_hint; 858 /// }; 859 /// 861 863 The ff_layout4 structure specifies a layout in that portion of the 864 data file described in the current layout segment. It is either a 865 single instance or a set of mirrored copies of that portion of the 866 data file. When mirroring is in effect, it protects against loss of 867 data in layout segments. Note that while not explicitly shown in the 868 above XDR, each layout4 element returned in the logr_layout array of 869 LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) describes a layout 870 segment. Hence each ff_layout4 also describes a layout segment. 872 It is possible that the file is concatenated from more than one 873 layout segment. Each layout segment MAY represent different striping 874 parameters, applying respectively only to the layout segment byte 875 range. 877 The ffl_stripe_unit field is the stripe unit size in use for the 878 current layout segment. The number of stripes is given inside each 879 mirror by the number of elements in ffm_data_servers. If the number 880 of stripes is one, then the value for ffl_stripe_unit MUST default to 881 zero. The only supported mapping scheme is sparse and is detailed in 882 Section 6. Note that there is an assumption here that both the 883 stripe unit size and the number of stripes is the same across all 884 mirrors. 886 The ffl_mirrors field is the array of mirrored storage devices which 887 provide the storage for the current stripe, see Figure 1. 889 The ffl_stats_collect_hint field provides a hint to the client on how 890 often the server wants it to report LAYOUTSTATS for a file. The time 891 is in seconds. 893 +-----------+ 894 | | 895 | | 896 | File | 897 | | 898 | | 899 +-----+-----+ 900 | 901 +------------+------------+ 902 | | 903 +----+-----+ +-----+----+ 904 | Mirror 1 | | Mirror 2 | 905 +----+-----+ +-----+----+ 906 | | 907 +-----------+ +-----------+ 908 |+-----------+ |+-----------+ 909 ||+-----------+ ||+-----------+ 910 +|| Storage | +|| Storage | 911 +| Devices | +| Devices | 912 +-----------+ +-----------+ 914 Figure 1 916 The ffs_mirrors field represents an array of state information for 917 each mirrored copy of the current layout segment. Each element is 918 described by a ff_mirror4 type. 920 ffds_deviceid provides the deviceid of the storage device holding the 921 data file. 923 ffds_fh_vers is an array of filehandles of the data file matching to 924 the available NFS versions on the given storage device. There MUST 925 be exactly as many elements in ffds_fh_vers as there are in 926 ffda_versions. Each element of the array corresponds to a particular 927 combination of ffdv_version, ffdv_minorversion, and 928 ffdv_tightly_coupled provided for the device. The array allows for 929 server implementations which have different filehandles for different 930 combinations of version, minor version, and coupling strength. See 931 Section 5.4 for how to handle versioning issues between the client 932 and storage devices. 934 For tight coupling, ffds_stateid provides the stateid to be used by 935 the client to access the file. For loose coupling and a NFSv4 936 storage device, the client will have to use an anonymous stateid to 937 perform I/O on the storage device. With no control protocol, the 938 metadata server stateid can not be used to provide a global stateid 939 model. Thus the server MUST set the ffds_stateid to be the anonymous 940 stateid. 942 This specification of the ffds_stateid restricts both models for 943 NFSv4.x storage protocols: 945 loosely couple: the stateid has to be an anonymous stateid, 947 tightly couple: the stateid has to be a global stateid. 949 A number of issues stem from a mismatch between the fact that 950 ffds_stateid is defined as a single item while ffds_fh_vers is 951 defined as an array. It is possible for each open file on the 952 storage device to require its own open stateid. Because there are 953 established loosely coupled implementations of the version of the 954 protocol described in this document, such potential issues have not 955 been addressed here. It is possible for future layout types to be 956 defined that address these issues, should it become important to 957 provide multiple stateids for the same underlying file. 959 For loosely coupled storage devices, ffds_user and ffds_group provide 960 the synthetic user and group to be used in the RPC credentials that 961 the client presents to the storage device to access the data files. 962 For tightly coupled storage devices, the user and group on the 963 storage device will be the same as on the metadata server. I.e., if 964 ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST 965 ignore both ffds_user and ffds_group. 967 The allowed values for both ffds_user and ffds_group are specified in 968 Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group 969 strings that consist of decimal numeric values with no leading zeros 970 can be given a special interpretation by clients and servers that 971 choose to provide such support. The receiver may treat such a user 972 or group string as representing the same user as would be represented 973 by an NFSv3 uid or gid having the corresponding numeric value. Note 974 that if using Kerberos for security, the expectation is that these 975 values will be a name@domain string. 977 ffds_efficiency describes the metadata server's evaluation as to the 978 effectiveness of each mirror. Note that this is per layout and not 979 per device as the metric may change due to perceived load, 980 availability to the metadata server, etc. Higher values denote 981 higher perceived utility. The way the client can select the best 982 mirror to access is discussed in Section 8.1. 984 ffl_flags is a bitmap that allows the metadata server to inform the 985 client of particular conditions that may result from the more or less 986 tight coupling of the storage devices. 988 FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is 989 not required to send LAYOUTCOMMIT to the metadata server. 991 F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client 992 should not send I/O operations to the metadata server. I.e., even 993 if the client could determine that there was a network disconnect 994 to a storage device, the client should not try to proxy the I/O 995 through the metadata server. 997 FF_FLAGS_NO_READ_IO: can be set to indicate that the client should 998 not send READ requests with the layouts of iomode 999 LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode 1000 LAYOUTIOMODE4_READ from the metadata server. 1002 FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client 1003 only needs to update one of the mirrors (see Section 8.2). 1005 5.1.1. Error Codes from LAYOUTGET 1007 [RFC5661] provides little guidance as to how the client is to proceed 1008 with a LAYOUTGET which returns an error of either 1009 NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. 1010 Within the context of this document: 1012 NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O 1013 is to go to the metadata server. Note that it is possible to have 1014 had a layout before a recall and not after. 1016 NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout 1017 from being granted. If the client already has an appropriate 1018 layout, it should continue with I/O to the storage devices. 1020 NFS4ERR_DELAY: there is some issue preventing the layout from being 1021 granted. If the client already has an appropriate layout, it 1022 should not continue with I/O to the storage devices. 1024 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS 1026 Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS, 1027 flag, the client can still perform I/O to the metadata server. The 1028 flag functions as a hint. The flag indicates to the client that the 1029 metadata server prefers to separate the metadata I/O from the data I/ 1030 O, most likely for peformance reasons. 1032 5.2. LAYOUTCOMMIT 1034 The flex file layout does not use lou_body. If lou_type is 1035 LAYOUT4_FLEX_FILES, the lou_body field MUST have a zero length. 1037 5.3. Interactions Between Devices and Layouts 1039 In [RFC5661], the file layout type is defined such that the 1040 relationship between multipathing and filehandles can result in 1041 either 0, 1, or N filehandles (see Section 13.3). Some rationales 1042 for this are clustered servers which share the same filehandle or 1043 allowing for multiple read-only copies of the file on the same 1044 storage device. In the flexible file layout type, while there is an 1045 array of filehandles, they are independent of the multipathing being 1046 used. If the metadata server wants to provide multiple read-only 1047 copies of the same file on the same storage device, then it should 1048 provide multiple ff_device_addr4, each as a mirror. The client can 1049 then determine that since the ffds_fh_vers are different, then there 1050 are multiple copies of the file for the current layout segment 1051 available. 1053 5.4. Handling Version Errors 1055 When the metadata server provides the ffda_versions array in the 1056 ff_device_addr4 (see Section 4.1), the client is able to determine if 1057 it can not access a storage device with any of the supplied 1058 combinations of ffdv_version, ffdv_minorversion, and 1059 ffdv_tightly_coupled. However, due to the limitations of reporting 1060 errors in GETDEVICEINFO (see Section 18.40 in [RFC5661], the client 1061 is not able to specify which specific device it can not communicate 1062 with over one of the provided ffdv_version and ffdv_minorversion 1063 combinations. Using ff_ioerr4 (see Section 9.1.1 inside either the 1064 LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see 1065 Section 15.6 of [RFC7862] and Section 10 of this document), the 1066 client can isolate the problematic storage device. 1068 The error code to return for LAYOUTRETURN and/or LAYOUTERROR is 1069 NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch 1070 is a major version (e.g., client can use NFSv3 but not NFSv4) or 1071 minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the 1072 error indicates that for all the supplied combinations for 1073 ffdv_version and ffdv_minorversion, the client can not communicate 1074 with the storage device. The client can retry the GETDEVICEINFO to 1075 see if the metadata server can provide a different combination or it 1076 can fall back to doing the I/O through the metadata server. 1078 6. Striping via Sparse Mapping 1080 While other layout types support both dense and sparse mapping of 1081 logical offsets to physical offsets within a file (see for example 1082 Section 13.4 of [RFC5661]), the flexible file layout type only 1083 supports a sparse mapping. 1085 With sparse mappings, the logical offset within a file (L) is also 1086 the physical offset on the storage device. As detailed in 1087 Section 13.4.4 of [RFC5661], this results in holes across each 1088 storage device which does not contain the current stripe index. 1090 L: logical offset into the file 1092 W: stripe width 1093 W = number of elements in ffm_data_servers 1095 S: number of bytes in a stripe 1096 S = W * ffl_stripe_unit 1098 N: stripe number 1099 N = L / S 1101 7. Recovering from Client I/O Errors 1103 The pNFS client may encounter errors when directly accessing the 1104 storage devices. However, it is the responsibility of the metadata 1105 server to recover from the I/O errors. When the LAYOUT4_FLEX_FILES 1106 layout type is used, the client MUST report the I/O errors to the 1107 server at LAYOUTRETURN time using the ff_ioerr4 structure (see 1108 Section 9.1.1). 1110 The metadata server analyzes the error and determines the required 1111 recovery operations such as recovering media failures or 1112 reconstructing missing data files. 1114 The metadata server MUST recall any outstanding layouts to allow it 1115 exclusive write access to the stripes being recovered and to prevent 1116 other clients from hitting the same error condition. In these cases, 1117 the server MUST complete recovery before handing out any new layouts 1118 to the affected byte ranges. 1120 Although the client implementation has the option to propagate a 1121 corresponding error to the application that initiated the I/O 1122 operation and drop any unwritten data, the client should attempt to 1123 retry the original I/O operation by either requesting a new layout or 1124 sending the I/O via regular NFSv4.1+ READ or WRITE operations to the 1125 metadata server. The client SHOULD attempt to retrieve a new layout 1126 and retry the I/O operation using the storage device first and only 1127 if the error persists, retry the I/O operation via the metadata 1128 server. 1130 8. Mirroring 1132 The flexible file layout type has a simple model in place for the 1133 mirroring of the file data constrained by a layout segment. There is 1134 no assumption that each copy of the mirror is stored identically on 1135 the storage devices. For example, one device might employ 1136 compression or deduplication on the data. However, the over the wire 1137 transfer of the file contents MUST appear identical. Note, this is a 1138 constraint of the selected XDR representation in which each mirrored 1139 copy of the layout segment has the same striping pattern (see 1140 Figure 1). 1142 The metadata server is responsible for determining the number of 1143 mirrored copies and the location of each mirror. While the client 1144 may provide a hint to how many copies it wants (see Section 12), the 1145 metadata server can ignore that hint and in any event, the client has 1146 no means to dictate either the storage device (which also means the 1147 coupling and/or protocol levels to access the layout segments) or the 1148 location of said storage device. 1150 The updating of mirrored layout segments is done via client-side 1151 mirroring. With this approach, the client is responsible for making 1152 sure modifications are made on all copies of the layout segments it 1153 is informed of via the layout. If a layout segment is being 1154 resilvered to a storage device, that mirrored copy will not be in the 1155 layout. Thus the metadata server MUST update that copy until the 1156 client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR 1157 is set in ffl_flags, the client need only update one of the mirrors 1158 (see Section 8.2). If the client is writing to the layout segments 1159 via the metadata server, then the metadata server MUST update all 1160 copies of the mirror. As seen in Section 8.3, during the 1161 resilvering, the layout is recalled, and the client has to make 1162 modifications via the metadata server. 1164 8.1. Selecting a Mirror 1166 When the metadata server grants a layout to a client, it MAY let the 1167 client know how fast it expects each mirror to be once the request 1168 arrives at the storage devices via the ffds_efficiency member. While 1169 the algorithms to calculate that value are left to the metadata 1170 server implementations, factors that could contribute to that 1171 calculation include speed of the storage device, physical memory 1172 available to the device, operating system version, current load, etc. 1174 However, what should not be involved in that calculation is a 1175 perceived network distance between the client and the storage device. 1176 The client is better situated for making that determination based on 1177 past interaction with the storage device over the different available 1178 network interfaces between the two. I.e., the metadata server might 1179 not know about a transient outage between the client and storage 1180 device because it has no presence on the given subnet. 1182 As such, it is the client which decides which mirror to access for 1183 reading the file. The requirements for writing to mirrored layout 1184 segments are presented below. 1186 8.2. Writing to Mirrors 1188 8.2.1. Single Storage Device Updates Mirrors 1190 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client 1191 only needs to update one of the copies of the layout segment. For 1192 this case, the storage device MUST ensure that all copies of the 1193 mirror are updated when any one of the mirrors is updated. If the 1194 storage device gets an error when updating one of the mirrors, then 1195 it MUST inform the client that the original WRITE had an error. The 1196 client then MUST inform the metadata server (see Section 8.2.3). The 1197 client's responsibility with respect to COMMIT is explained in 1198 Section 8.2.4. The client may choose any one of the mirrors and may 1199 use ffds_efficiency in the same manner as for reading when making 1200 this choice. 1202 8.2.2. Single Storage Device Updates Mirrors 1204 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the 1205 client is responsible for updating all mirrored copies of the layout 1206 segments that it is given in the layout. A single failed update is 1207 sufficient to fail the entire operation. If all but one copy is 1208 updated successfully and the last one provides an error, then the 1209 client needs to inform the metadata server about the error via either 1210 LAYOUTRETURN or LAYOUTERROR that the update failed to that storage 1211 device. If the client is updating the mirrors serially, then it 1212 SHOULD stop at the first error encountered and report that to the 1213 metadata server. If the client is updating the mirrors in parallel, 1214 then it SHOULD wait until all storage devices respond such that it 1215 can report all errors encountered during the update. 1217 8.2.3. Handling Write Errors 1219 When the client reports a write error to the metadata server, the 1220 metadata server is responsible for determining if it wants to remove 1221 the errant mirror from the layout, if the mirror has recovered from 1222 some transient error, etc. When the client tries to get a new 1223 layout, the metadata server informs it of the decision by the 1224 contents of the layout. The client MUST NOT make any assumptions 1225 that the contents of the previous layout will match those of the new 1226 one. If it has updates that were not committed to all mirrors, then 1227 it MUST resend those updates to all mirrors. 1229 There is no provision in the protocol for the metadata server to 1230 directly determine that the client has or has not recovered from an 1231 error. I.e., assume that the storage device was network partitioned 1232 from the client and all of the copies are successfully updated after 1233 the error was reported. There is no mechanism for the client to 1234 report that fact and the metadata server is forced to repair the file 1235 across the mirror. 1237 If the client supports NFSv4.2, it can use LAYOUTERROR and 1238 LAYOUTRETURN to provide hints to the metadata server about the 1239 recovery efforts. A LAYOUTERROR on a file is for a non-fatal error. 1240 A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the 1241 client successfully replayed the I/O to all mirrors. Any 1242 LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server 1243 needs to repair. The client MUST be prepared for the LAYOUTERROR to 1244 trigger a CB_LAYOUTRECALL if the metadata server determines it needs 1245 to start repairing the file. 1247 8.2.4. Handling Write COMMITs 1249 When stable writes are done to the metadata server or to a single 1250 replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is 1251 the responsibility of the receiving node to propagate the written 1252 data stably, before replying to the client. 1254 In the corresponding cases in which unstable writes are done, the 1255 receiving node does not have any such obligation, although it may 1256 choose to asynchronously propagate the updates. However, once a 1257 COMMIT is replied to, all replicas must reflect the writes that have 1258 been done, and this data must have been committed to stable storage 1259 on all replicas. 1261 In order to avoid situations in which stale data is read from 1262 replicas to which writes have not been propagated: 1264 o A client which has outstanding unstable writes made to single node 1265 (metadata server or storage device) MUST do all reads from that 1266 same node. 1268 o When writes are flushed to the server, for example to implement, 1269 close-to-open semantics, a COMMIT must be done by the client to 1270 ensure that up-to-date written data will be available irrespective 1271 of the particular replica read. 1273 8.3. Metadata Server Resilvering of the File 1275 The metadata server may elect to create a new mirror of the layout 1276 segments at any time. This might be to resilver a copy on a storage 1277 device which was down for servicing, to provide a copy of the layout 1278 segments on storage with different storage performance 1279 characteristics, etc. As the client will not be aware of the new 1280 mirror and the metadata server will not be aware of updates that the 1281 client is making to the layout segments, the metadata server MUST 1282 recall the writable layout segment(s) that it is resilvering. If the 1283 client issues a LAYOUTGET for a writable layout segment which is in 1284 the process of being resilvered, then the metadata server can deny 1285 that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then 1286 have to perform the I/O through the metadata server. 1288 9. Flexible Files Layout Type Return 1290 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1291 layout-type specific information to the server. It is defined in 1292 Section 18.44.1 of [RFC5661] as follows: 1294 1296 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 1297 const LAYOUT4_RET_REC_FILE = 1; 1298 const LAYOUT4_RET_REC_FSID = 2; 1299 const LAYOUT4_RET_REC_ALL = 3; 1301 enum layoutreturn_type4 { 1302 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 1303 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 1304 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 1305 }; 1307 struct layoutreturn_file4 { 1308 offset4 lrf_offset; 1309 length4 lrf_length; 1310 stateid4 lrf_stateid; 1311 /* layouttype4 specific data */ 1312 opaque lrf_body<>; 1313 }; 1315 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1316 case LAYOUTRETURN4_FILE: 1317 layoutreturn_file4 lr_layout; 1318 default: 1319 void; 1320 }; 1321 struct LAYOUTRETURN4args { 1322 /* CURRENT_FH: file */ 1323 bool lora_reclaim; 1324 layoutreturn_stateid lora_recallstateid; 1325 layouttype4 lora_layout_type; 1326 layoutiomode4 lora_iomode; 1327 layoutreturn4 lora_layoutreturn; 1328 }; 1330 1332 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the 1333 lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value 1334 is defined by ff_layoutreturn4 (See Section 9.3). It allows the 1335 client to report I/O error information or layout usage statistics 1336 back to the metadata server as defined below. Note that while the 1337 data structures are built on concepts introduced in NFSv4.2, the 1338 effective discriminated union (lora_layout_type combined with 1339 ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the 1340 data. 1342 9.1. I/O Error Reporting 1344 9.1.1. ff_ioerr4 1346 1348 /// struct ff_ioerr4 { 1349 /// offset4 ffie_offset; 1350 /// length4 ffie_length; 1351 /// stateid4 ffie_stateid; 1352 /// device_error4 ffie_errors<>; 1353 /// }; 1354 /// 1356 1358 Recall that [RFC7862] defines device_error4 as: 1360 1362 struct device_error4 { 1363 deviceid4 de_deviceid; 1364 nfsstat4 de_status; 1365 nfs_opnum4 de_opnum; 1366 }; 1368 1369 The ff_ioerr4 structure is used to return error indications for data 1370 files that generated errors during data transfers. These are hints 1371 to the metadata server that there are problems with that file. For 1372 each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length 1373 represent the storage device and byte range within the file in which 1374 the error occurred; ffie_errors represents the operation and type of 1375 error. The use of device_error4 is described in Section 15.6 of 1376 [RFC7862]. 1378 Even though the storage device might be accessed via NFSv3 and 1379 reports back NFSv3 errors to the client, the client is responsible 1380 for mapping these to appropriate NFSv4 status codes as de_status. 1381 Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4 1382 operations. 1384 9.2. Layout Usage Statistics 1386 9.2.1. ff_io_latency4 1388 1390 /// struct ff_io_latency4 { 1391 /// uint64_t ffil_ops_requested; 1392 /// uint64_t ffil_bytes_requested; 1393 /// uint64_t ffil_ops_completed; 1394 /// uint64_t ffil_bytes_completed; 1395 /// uint64_t ffil_bytes_not_delivered; 1396 /// nfstime4 ffil_total_busy_time; 1397 /// nfstime4 ffil_aggregate_completion_time; 1398 /// }; 1399 /// 1401 1403 Both operation counts and bytes transferred are kept in the 1404 ff_io_latency4. As seen in ff_layoutupdate4 (See Section 9.2.2) read 1405 and write operations are aggregated separately. READ operations are 1406 used for the ff_io_latency4 ffl_read. Both WRITE and COMMIT 1407 operations are used for the ff_io_latency4 ffl_write. "Requested" 1408 counters track what the client is attempting to do and "completed" 1409 counters track what was done. There is no requirement that the 1410 client only report completed results that have matching requested 1411 results from the reported period. 1413 ffil_bytes_not_delivered is used to track the aggregate number of 1414 bytes requested by not fulfilled due to error conditions. 1415 ffil_total_busy_time is the aggregate time spent with outstanding RPC 1416 calls. ffil_aggregate_completion_time is the sum of all round trip 1417 times for completed RPC calls. 1419 In Section 3.3.1 of [RFC5661], the nfstime4 is defined as the number 1420 of seconds and nanoseconds since midnight or zero hour January 1, 1421 1970 Coordinated Universal Time (UTC). The use of nfstime4 in 1422 ff_io_latency4 is to store time since the start of the first I/O from 1423 the client after receiving the layout. In other words, these are to 1424 be decoded as duration and not as a date and time. 1426 Note that LAYOUTSTATS are cumulative, i.e., not reset each time the 1427 operation is sent. If two LAYOUTSTATS ops for the same file, layout 1428 stateid, and originating from the same NFS client are processed at 1429 the same time by the metadata server, then the one containing the 1430 larger values contains the most recent time series data. 1432 9.2.2. ff_layoutupdate4 1434 1436 /// struct ff_layoutupdate4 { 1437 /// netaddr4 ffl_addr; 1438 /// nfs_fh4 ffl_fhandle; 1439 /// ff_io_latency4 ffl_read; 1440 /// ff_io_latency4 ffl_write; 1441 /// nfstime4 ffl_duration; 1442 /// bool ffl_local; 1443 /// }; 1444 /// 1446 1448 ffl_addr differentiates which network address the client connected to 1449 on the storage device. In the case of multipathing, ffl_fhandle 1450 indicates which read-only copy was selected. ffl_read and ffl_write 1451 convey the latencies respectively for both read and write operations. 1452 ffl_duration is used to indicate the time period over which the 1453 statistics were collected. ffl_local if true indicates that the I/O 1454 was serviced by the client's cache. This flag allows the client to 1455 inform the metadata server about "hot" access to a file it would not 1456 normally be allowed to report on. 1458 9.2.3. ff_iostats4 1460 1461 /// struct ff_iostats4 { 1462 /// offset4 ffis_offset; 1463 /// length4 ffis_length; 1464 /// stateid4 ffis_stateid; 1465 /// io_info4 ffis_read; 1466 /// io_info4 ffis_write; 1467 /// deviceid4 ffis_deviceid; 1468 /// ff_layoutupdate4 ffis_layoutupdate; 1469 /// }; 1470 /// 1472 1474 Recall that [RFC7862] defines io_info4 as: 1476 1478 struct io_info4 { 1479 uint64_t ii_count; 1480 uint64_t ii_bytes; 1481 }; 1483 1485 With pNFS, data transfers are performed directly between the pNFS 1486 client and the storage devices. Therefore, the metadata server has 1487 no direct knowledge to the I/O operations being done and thus can not 1488 create on its own statistical information about client I/O to 1489 optimize data storage location. ff_iostats4 MAY be used by the 1490 client to report I/O statistics back to the metadata server upon 1491 returning the layout. 1493 Since it is not feasible for the client to report every I/O that used 1494 the layout, the client MAY identify "hot" byte ranges for which to 1495 report I/O statistics. The definition and/or configuration mechanism 1496 of what is considered "hot" and the size of the reported byte range 1497 is out of the scope of this document. It is suggested for client 1498 implementation to provide reasonable default values and an optional 1499 run-time management interface to control these parameters. For 1500 example, a client can define the default byte range resolution to be 1501 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1502 I/O operations per second. 1504 For each byte range, ffis_offset and ffis_length represent the 1505 starting offset of the range and the range length in bytes. 1506 ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and 1507 ffis_write.ii_bytes represent, respectively, the number of contiguous 1508 read and write I/Os and the respective aggregate number of bytes 1509 transferred within the reported byte range. 1511 The combination of ffis_deviceid and ffl_addr uniquely identifies 1512 both the storage path and the network route to it. Finally, the 1513 ffl_fhandle allows the metadata server to differentiate between 1514 multiple read-only copies of the file on the same storage device. 1516 9.3. ff_layoutreturn4 1518 1520 /// struct ff_layoutreturn4 { 1521 /// ff_ioerr4 fflr_ioerr_report<>; 1522 /// ff_iostats4 fflr_iostats_report<>; 1523 /// }; 1524 /// 1526 1528 When data file I/O operations fail, fflr_ioerr_report<> is used to 1529 report these errors to the metadata server as an array of elements of 1530 type ff_ioerr4. Each element in the array represents an error that 1531 occurred on the data file identified by ffie_errors.de_deviceid. If 1532 no errors are to be reported, the size of the fflr_ioerr_report<> 1533 array is set to zero. The client MAY also use fflr_iostats_report<> 1534 to report a list of I/O statistics as an array of elements of type 1535 ff_iostats4. Each element in the array represents statistics for a 1536 particular byte range. Byte ranges are not guaranteed to be disjoint 1537 and MAY repeat or intersect. 1539 10. Flexible Files Layout Type LAYOUTERROR 1541 If the client is using NFSv4.2 to communicate with the metadata 1542 server, then instead of waiting for a LAYOUTRETURN to send error 1543 information to the metadata server (see Section 9.1), it MAY use 1544 LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that 1545 information. For the flexible files layout type, this means that 1546 LAYOUTERROR4args is treated the same as ff_ioerr4. 1548 11. Flexible Files Layout Type LAYOUTSTATS 1550 If the client is using NFSv4.2 to communicate with the metadata 1551 server, then instead of waiting for a LAYOUTRETURN to send I/O 1552 statistics to the metadata server (see Section 9.2), it MAY use 1553 LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that 1554 information. For the flexible files layout type, this means that 1555 LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same 1556 contents as in ffis_layoutupdate. 1558 12. Flexible File Layout Type Creation Hint 1560 The layouthint4 type is defined in the [RFC5661] as follows: 1562 1564 struct layouthint4 { 1565 layouttype4 loh_type; 1566 opaque loh_body<>; 1567 }; 1569 1571 The layouthint4 structure is used by the client to pass a hint about 1572 the type of layout it would like created for a particular file. If 1573 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 1574 opaque value is defined by the ff_layouthint4 type. 1576 12.1. ff_layouthint4 1578 1580 /// union ff_mirrors_hint switch (bool ffmc_valid) { 1581 /// case TRUE: 1582 /// uint32_t ffmc_mirrors; 1583 /// case FALSE: 1584 /// void; 1585 /// }; 1586 /// 1588 /// struct ff_layouthint4 { 1589 /// ff_mirrors_hint fflh_mirrors_hint; 1590 /// }; 1591 /// 1593 1595 This type conveys hints for the desired data map. All parameters are 1596 optional so the client can give values for only the parameter it 1597 cares about. 1599 13. Recalling a Layout 1601 While Section 12.5.5 of [RFC5661] discusses layout type independent 1602 reasons for recalling a layout, the flexible file layout type 1603 metadata server should recall outstanding layouts in the following 1604 cases: 1606 o When the file's security policy changes, i.e., Access Control 1607 Lists (ACLs) or permission mode bits are set. 1609 o When the file's layout changes, rendering outstanding layouts 1610 invalid. 1612 o When existing layouts are inconsistent with the need to enforce 1613 locking constraints. 1615 o When existing layouts are inconsistent with the requirements 1616 regarding resilvering as described in Section 8.3. 1618 13.1. CB_RECALL_ANY 1620 The metadata server can use the CB_RECALL_ANY callback operation to 1621 notify the client to return some or all of its layouts. Section 22.3 1622 of [RFC5661] defines the allowed types of the "NFSv4 Recallable 1623 Object Types Registry". 1625 1627 /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN = 16; 1628 /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX = 17; 1629 [[RFC Editor: please insert assigned constants]] 1630 /// 1632 struct CB_RECALL_ANY4args { 1633 uint32_t craa_layouts_to_keep; 1634 bitmap4 craa_type_mask; 1635 }; 1637 1639 Typically, CB_RECALL_ANY will be used to recall client state when the 1640 server needs to reclaim resources. The craa_type_mask bitmap 1641 specifies the type of resources that are recalled and the 1642 craa_layouts_to_keep value specifies how many of the recalled 1643 flexible file layouts the client is allowed to keep. The flexible 1644 file layout type mask flags are defined as follows: 1646 1647 /// enum ff_cb_recall_any_mask { 1648 /// FF_RCA4_TYPE_MASK_READ = -2, 1649 /// FF_RCA4_TYPE_MASK_RW = -1 1650 [[RFC Editor: please insert assigned constants]] 1651 /// }; 1652 /// 1654 1656 They represent the iomode of the recalled layouts. In response, the 1657 client SHOULD return layouts of the recalled iomode that it needs the 1658 least, keeping at most craa_layouts_to_keep Flexible File Layouts. 1660 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 1661 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1662 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1663 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1664 is notified to return layouts of either iomode. 1666 14. Client Fencing 1668 In cases where clients are uncommunicative and their lease has 1669 expired or when clients fail to return recalled layouts within a 1670 lease period, the server MAY revoke client layouts and reassign these 1671 resources to other clients (see Section 12.5.5 in [RFC5661]). To 1672 avoid data corruption, the metadata server MUST fence off the revoked 1673 clients from the respective data files as described in Section 2.2. 1675 15. Security Considerations 1677 The pNFS feature partitions the NFSv4.1+ file system protocol into 1678 two parts, the control path and the data path (storage protocol). 1679 The control path contains all the new operations described by this 1680 feature; all existing NFSv4 security mechanisms and features apply to 1681 the control path (see Sections 1.7.1 and 2.2.1 of [RFC5661]). The 1682 combination of components in a pNFS system is required to preserve 1683 the security properties of NFSv4.1+ with respect to an entity 1684 accessing data via a client, including security countermeasures to 1685 defend against threats that NFSv4.1+ provides defenses for in 1686 environments where these threats are considered significant. 1688 The metadata server is primarily responsible for securing the data 1689 path. It has to authenticate the client access and provide 1690 appropriate credentials to the client to access data files on the 1691 storage device. Finally, it is responsible for revoking access for a 1692 client to the storage device. 1694 The metadata server enforces the file access-control policy at 1695 LAYOUTGET time. The client should use RPC authorization credentials 1696 for getting the layout for the requested iomode (READ or RW) and the 1697 server verifies the permissions and ACL for these credentials, 1698 possibly returning NFS4ERR_ACCESS if the client is not allowed the 1699 requested iomode. If the LAYOUTGET operation succeeds the client 1700 receives, as part of the layout, a set of credentials allowing it I/O 1701 access to the specified data files corresponding to the requested 1702 iomode. When the client acts on I/O operations on behalf of its 1703 local users, it MUST authenticate and authorize the user by issuing 1704 respective OPEN and ACCESS calls to the metadata server, similar to 1705 having NFSv4 data delegations. 1707 If access is allowed, the client uses the corresponding (READ or RW) 1708 credentials to perform the I/O operations at the data file's storage 1709 devices. When the metadata server receives a request to change a 1710 file's permissions or ACL, it SHOULD recall all layouts for that file 1711 and then MUST fence off any clients still holding outstanding layouts 1712 for the respective files by implicitly invalidating the previously 1713 distributed credential on all data file comprising the file in 1714 question. It is REQUIRED that this be done before committing to the 1715 new permissions and/or ACL. By requesting new layouts, the clients 1716 will reauthorize access against the modified access control metadata. 1717 Recalling the layouts in this case is intended to prevent clients 1718 from getting an error on I/Os done after the client was fenced off. 1720 15.1. RPCSEC_GSS and Security Services 1722 Because of the special use of principals within the loose coupling 1723 model, the issues are different depending on the coupling model. 1725 15.1.1. Loosely Coupled 1727 RPCSEC_GSS version 3 (RPCSEC_GSSv3) [RFC7861] contains facilities 1728 that would allow it to be used to authorize the client to the storage 1729 device on behalf of the metadata server. Doing so would require that 1730 each of the metadata server, storage device, and client would need to 1731 implement RPCSEC_GSSv3 using an RPC-application-defined structured 1732 privilege assertion in a manner described in Section 4.9.1 of 1733 [RFC7862]. The specifics necessary to do so are not described in 1734 this document. This is principally because any such specification 1735 would require extensive implementation work on a wide range of 1736 storage devices, which would be unlikely to result in a widely usable 1737 specification for a considerable time. 1739 As a result, the layout type described in this document will not 1740 provide support for use of RPCSEC_GSS together with the loosely 1741 coupled model. However, future layout types could be specified which 1742 would allow such support, either through the use of RPCSEC_GSSv3, or 1743 in other ways. 1745 15.1.2. Tightly Coupled 1747 With tight coupling, the principal used to access the metadata file 1748 is exactly the same as used to access the data file. The storage 1749 device can use the control protocol to validate any RPC credentials. 1750 As a result there are no security issues related to using RPCSEC_GSS 1751 with a tightly coupled system. For example, if Kerberos V5 GSS-API 1752 [RFC4121] is used as the security mechanism, then the storage device 1753 could use a control protocol to validate the RPC credentials to the 1754 metadata server. 1756 16. IANA Considerations 1758 [RFC5661] introduced a registry for "pNFS Layout Types Registry" and 1759 as such, new layout type numbers need to be assigned by IANA. This 1760 document defines the protocol associated with the existing layout 1761 type number, LAYOUT4_FLEX_FILES (see Table 1). 1763 +--------------------+-------+----------+-----+----------------+ 1764 | Layout Type Name | Value | RFC | How | Minor Versions | 1765 +--------------------+-------+----------+-----+----------------+ 1766 | LAYOUT4_FLEX_FILES | 0x4 | RFCTBD10 | L | 1 | 1767 +--------------------+-------+----------+-----+----------------+ 1769 Table 1: Layout Type Assignments 1771 [RFC5661] also introduced a registry called "NFSv4 Recallable Object 1772 Types Registry". This document defines new recallable objects for 1773 RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see 1774 Table 2). 1776 +------------------------------+-------+----------+-----+-----------+ 1777 | Recallable Object Type Name | Value | RFC | How | Minor | 1778 | | | | | Versions | 1779 +------------------------------+-------+----------+-----+-----------+ 1780 | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16 | RFCTBD10 | L | 1 | 1781 | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17 | RFCTBD10 | L | 1 | 1782 +------------------------------+-------+----------+-----+-----------+ 1784 Table 2: Recallable Object Type Assignments 1786 Note, [RFC5661] should have also defined (see Table 3): 1788 +-------------------------------+------+-----------+-----+----------+ 1789 | Recallable Object Type Name | Valu | RFC | How | Minor | 1790 | | e | | | Versions | 1791 +-------------------------------+------+-----------+-----+----------+ 1792 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 12 | [RFC5661] | L | 1 | 1793 | IN | | | | | 1794 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 15 | [RFC5661] | L | 1 | 1795 | AX | | | | | 1796 +-------------------------------+------+-----------+-----+----------+ 1798 Table 3: Recallable Object Type Assignments 1800 17. References 1802 17.1. Normative References 1804 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 1805 November 2008, . 1808 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1809 June 1995. 1811 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1812 Requirement Levels", BCP 14, RFC 2119, March 1997. 1814 [RFC4121] Zhu, L., Jaganathan, K., and S. Hartman, "The Kerberos 1815 Version 5 Generic Security Service Application Program 1816 Interface (GSS-API) Mechanism Version 2", RFC 4121, July 1817 2005. 1819 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 1820 STD 67, RFC 4506, May 2006. 1822 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1823 Specification Version 2", RFC 5531, May 2009. 1825 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1826 "Network File System (NFS) Version 4 Minor Version 1 1827 Protocol", RFC 5661, January 2010. 1829 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1830 "Network File System (NFS) Version 4 Minor Version 1 1831 External Data Representation Standard (XDR) Description", 1832 RFC 5662, January 2010. 1834 [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) 1835 version 4 Protocol", RFC 7530, March 2015. 1837 [RFC7861] Adamson, W. and N. Williams, "Remote Procedure Call (RPC) 1838 Security Version 3", November 2016. 1840 [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, 1841 November 2016. 1843 [pNFSLayouts] 1844 Haynes, T., "Requirements for pNFS Layout Types", draft- 1845 ietf-nfsv4-layout-types-07 (Work In Progress), August 1846 2017. 1848 17.2. Informative References 1850 [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol 1851 (LDAP): Schema for User Applications", RFC 4519, DOI 1852 10.17487/RFC4519, June 2006, 1853 . 1855 Appendix A. Acknowledgments 1857 Those who provided miscellaneous comments to early drafts of this 1858 document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, 1859 and Lev Solomonov. 1861 Those who provided miscellaneous comments to the final drafts of this 1862 document include: Anand Ganesh, Robert Wipfel, Gobikrishnan 1863 Sundharraj, Trond Myklebust, Rick Macklem, and Jim Sermersheim. 1865 Idan Kedar caught a nasty bug in the interaction of client side 1866 mirroring and the minor versioning of devices. 1868 Dave Noveck provided comprehensive reviews of the document during the 1869 working group last calls. He also rewrote Section 2.3. 1871 Olga Kornievskaia made a convincing case against the use of a 1872 credential versus a principal in the fencing approach. Andy Adamson 1873 and Benjamin Kaduk helped to sharpen the focus. 1875 Benjamin Kaduk and Olga Kornievskaia also helped provide concrete 1876 scenarios for loosely coupled security mechanisms. And in the end, 1877 Olga proved that as defined, the loosely coupled model would not work 1878 with RPCSEC_GSS. 1880 Tigran Mkrtchyan provided the use case for not allowing the client to 1881 proxy the I/O through the data server. 1883 Rick Macklem provided the use case for only writing to a single 1884 mirror. 1886 Appendix B. RFC Editor Notes 1888 [RFC Editor: please remove this section prior to publishing this 1889 document as an RFC] 1891 [RFC Editor: prior to publishing this document as an RFC, please 1892 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 1893 RFC number of this document] 1895 Authors' Addresses 1897 Benny Halevy 1899 Email: bhalevy@gmail.com 1901 Thomas Haynes 1902 Primary Data, Inc. 1903 4300 El Camino Real Ste 100 1904 Los Altos, CA 94022 1905 USA 1907 Phone: +1 408 215 1519 1908 Email: thomas.haynes@primarydata.com