idnits 2.17.1 draft-ietf-nfsv4-flex-files-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 357 has weird spacing: '... loghyr staff...' == Line 1302 has weird spacing: '...stateid lor...' == Line 1558 has weird spacing: '...rs_hint ffl...' == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client SHOULD not send I/O operations to the metadata server. I.e., even if the client could determine that there was a network diconnect to a storage device, the client SHOULD not try to proxy the I/O through the metadata server. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'SHOULD not' in this paragraph: FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD not send READ requests with the layouts of iomode LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode LAYOUTIOMODE4_READ from the metadata server. -- The document date (July 18, 2017) is 2467 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'LEGAL' ** Downref: Normative reference to an Informational RFC: RFC 1813 ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft 4 Intended status: Standards Track T. Haynes 5 Expires: January 19, 2018 Primary Data 6 July 18, 2017 8 Parallel NFS (pNFS) Flexible File Layout 9 draft-ietf-nfsv4-flex-files-11.txt 11 Abstract 13 The Parallel Network File System (pNFS) allows a separation between 14 the metadata (onto a metadata server) and data (onto a storage 15 device) for a file. The flexible file layout type is defined in this 16 document as an extension to pNFS which allows the use of storage 17 devices in a fashion such that they require only a quite limited 18 degree of interaction with the metadata server, using already 19 existing protocols. Client side mirroring is also added to provide 20 replication of files. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 19, 2018. 39 Copyright Notice 41 Copyright (c) 2017 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.2. Difference Between a Data Server and a Storage Device . . 5 59 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 6 60 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 6 61 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 6 62 2.2. Fencing Clients from the Storage Device . . . . . . . . . 6 63 2.2.1. Implementation Notes for Synthetic uids/gids . . . . 7 64 2.2.2. Example of using Synthetic uids/gids . . . . . . . . 8 65 2.3. State and Locking Models . . . . . . . . . . . . . . . . 9 66 2.3.1. Loosely Coupled Locking Model . . . . . . . . . . . . 9 67 2.3.2. Tighly Coupled Locking Model . . . . . . . . . . . . 10 68 3. XDR Description of the Flexible File Layout Type . . . . . . 12 69 3.1. Code Components Licensing Notice . . . . . . . . . . . . 13 70 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 14 71 4.1. ff_device_addr4 . . . . . . . . . . . . . . . . . . . . . 14 72 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 16 73 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 17 74 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 17 75 5.1.1. Error Codes from LAYOUTGET . . . . . . . . . . . . . 21 76 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS . . 21 77 5.2. Interactions Between Devices and Layouts . . . . . . . . 22 78 5.3. Handling Version Errors . . . . . . . . . . . . . . . . . 22 79 6. Striping via Sparse Mapping . . . . . . . . . . . . . . . . . 23 80 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 23 81 8. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . 24 82 8.1. Selecting a Mirror . . . . . . . . . . . . . . . . . . . 24 83 8.2. Writing to Mirrors . . . . . . . . . . . . . . . . . . . 25 84 8.2.1. Single Storage Device Updates Mirrors . . . . . . . . 25 85 8.2.2. Single Storage Device Updates Mirrors . . . . . . . . 25 86 8.2.3. Handling Write Errors . . . . . . . . . . . . . . . . 25 87 8.2.4. Handling Write COMMITs . . . . . . . . . . . . . . . 26 88 8.3. Metadata Server Resilvering of the File . . . . . . . . . 27 89 9. Flexible Files Layout Type Return . . . . . . . . . . . . . . 27 90 9.1. I/O Error Reporting . . . . . . . . . . . . . . . . . . . 28 91 9.1.1. ff_ioerr4 . . . . . . . . . . . . . . . . . . . . . . 28 92 9.2. Layout Usage Statistics . . . . . . . . . . . . . . . . . 29 93 9.2.1. ff_io_latency4 . . . . . . . . . . . . . . . . . . . 29 94 9.2.2. ff_layoutupdate4 . . . . . . . . . . . . . . . . . . 30 95 9.2.3. ff_iostats4 . . . . . . . . . . . . . . . . . . . . . 30 96 9.3. ff_layoutreturn4 . . . . . . . . . . . . . . . . . . . . 32 98 10. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 32 99 11. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 32 100 12. Flexible File Layout Type Creation Hint . . . . . . . . . . . 33 101 12.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 33 102 13. Recalling a Layout . . . . . . . . . . . . . . . . . . . . . 34 103 13.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 34 104 14. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 35 105 15. Security Considerations . . . . . . . . . . . . . . . . . . . 35 106 15.1. Kerberized File Access . . . . . . . . . . . . . . . . . 36 107 15.1.1. Loosely Coupled . . . . . . . . . . . . . . . . . . 36 108 15.1.2. Tightly Coupled . . . . . . . . . . . . . . . . . . 36 109 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 110 17. References . . . . . . . . . . . . . . . . . . . . . . . . . 38 111 17.1. Normative References . . . . . . . . . . . . . . . . . . 38 112 17.2. Informative References . . . . . . . . . . . . . . . . . 38 113 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 39 114 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 39 115 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 39 117 1. Introduction 119 In the parallel Network File System (pNFS), the metadata server 120 returns layout type structures that describe where file data is 121 located. There are different layout types for different storage 122 systems and methods of arranging data on storage devices. This 123 document defines the flexible file layout type used with file-based 124 data servers that are accessed using the Network File System (NFS) 125 protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC5661], and 126 NFSv4.2 [RFC7862]. 128 To provide a global state model equivalent to that of the files 129 layout type, a back-end control protocol MAY be implemented between 130 the metadata server and NFSv4.1+ storage devices. It is out of scope 131 for this document to specify such a protocol, yet the requirements 132 for the protocol are specified in [RFC5661] and clarified in 133 [pNFSLayouts]. 135 1.1. Definitions 137 control protocol: is a set of requirements for the communication of 138 information on layouts, stateids, file metadata, and file data 139 between the metadata server and the storage devices (see 140 [pNFSLayouts]). 142 client-side mirroring: is when the client and not the server is 143 responsible for updating all of the mirrored copies of a layout 144 segment. 146 data file: is that part of the file system object which contains the 147 content. 149 data server (DS): is one of the pNFS servers which provides the 150 contents of a file system object which is a regular file. 151 Depending on the layout, there might be one or more data servers 152 over which the data is striped. Note that while the metadata 153 server is strictly accessed over the NFSv4.1+ protocol, depending 154 on the layout type, the data server could be accessed via any 155 protocol that meets the pNFS requirements. 157 fencing: is when the metadata server prevents the storage devices 158 from processing I/O from a specific client to a specific file. 160 file layout type: is a layout type in which the storage devices are 161 accessed via the NFS protocol (see Section 13 of [RFC5661]). 163 layout: informs a client of which storage devices it needs to 164 communicate with (and over which protocol) to perform I/O on a 165 file. The layout might also provide some hints about how the 166 storage is physically organized. 168 layout iomode: describes whether the layout granted to the client is 169 for read or read/write I/O. 171 layout segment: describes a sub-division of a layout. That sub- 172 division might be by the iomode (see Sections 3.3.20 and 12.2.9 of 173 [RFC5661]), a striping pattern (see Section 13.3 of [RFC5661]), or 174 requested byte range. 176 layout stateid: is a 128-bit quantity returned by a server that 177 uniquely defines the layout state provided by the server for a 178 specific layout that describes a layout type and file (see 179 Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 of 180 [RFC5661] describes the difference between a layout stateid and a 181 normal stateid. 183 layout type: describes both the storage protocol used to access the 184 data and the aggregation scheme used to lay out the file data on 185 the underlying storage devices. 187 loose coupling: is when the metadata server and the storage devices 188 do not have a control protocol present. 190 metadata file: is that part of the file system object which 191 describes the object and not the content. E.g., it could be the 192 time since last modification, access, etc. 194 metadata server (MDS): is the pNFS server which provides metadata 195 information for a file system object. It also is responsible for 196 generating layouts for file system objects. Note that the MDS is 197 responsible for directory-based operations. 199 mirror: is a copy of a layout segment. Note that if one copy of the 200 mirror is updated, then all copies must be updated. 202 recalling a layout: is when the metadata server uses a back channel 203 to inform the client that the layout is to be returned in a 204 graceful manner. Note that the client has the opportunity to 205 flush any writes, etc., before replying to the metadata server. 207 revoking a layout: is when the metadata server invalidates the 208 layout such that neither the metadata server nor any storage 209 device will accept any access from the client with that layout. 211 resilvering: is the act of rebuilding a mirrored copy of a layout 212 segment from a known good copy of the layout segment. Note that 213 this can also be done to create a new mirrored copy of the layout 214 segment. 216 rsize: is the data transfer buffer size used for reads. 218 stateid: is a 128-bit quantity returned by a server that uniquely 219 defines the open and locking states provided by the server for a 220 specific open-owner or lock-owner/open-owner pair for a specific 221 file and type of lock. 223 storage device: is another term used almost interchangeably with 224 data server. See Section 1.2 for the nuances between the two. 226 tight coupling: is when the metadata server and the storage devices 227 do have a control protocol present. 229 wsize: is the data transfer buffer size used for writes. 231 1.2. Difference Between a Data Server and a Storage Device 233 We defined a data server as a pNFS server, which implies that it can 234 utilize the NFSv4.1+ protocol to communicate with the client. As 235 such, only the file layout type would currently meet this 236 requirement. The more generic concept is a storage device, which can 237 use any protocol to communicate with the client. The requirements 238 for a storage device to act together with the metadata server to 239 provide data to a client are that there is a layout type 240 specification for the given protocol and that the metadata server has 241 granted a layout to the client. Note that nothing precludes there 242 being multiple supported layout types (i.e., protocols) between a 243 metadata server, storage devices, and client. 245 As storage device is the more encompassing terminology, this document 246 utilizes it over data server. 248 1.3. Requirements Language 250 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 251 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 252 document are to be interpreted as described in [RFC2119]. 254 2. Coupling of Storage Devices 256 The coupling of the metadata server with the storage devices can be 257 either tight or loose. In a tight coupling, there is a control 258 protocol present to manage security, LAYOUTCOMMITs, etc. With a 259 loose coupling, the only control protocol might be a version of NFS. 260 As such, semantics for managing security, state, and locking models 261 MUST be defined. 263 2.1. LAYOUTCOMMIT 265 The metadata server has the responsibility, upon receiving a 266 LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), of ensuring that the 267 semantics of pNFS are respected (see Section 12.5.4 of [RFC5661]). 268 These do include a requirement that data written to data storage 269 device be stable before the occurance of the LAYOUTCOMMIT. 271 It is the responsibility of the client to make sure the data file is 272 stable before the metadata server begins to query the storage devices 273 about the changes to the file. If any WRITE to a storage device did 274 not result with stable_how equal to FILE_SYNC, a LAYOUTCOMMIT to the 275 metadata server MUST be preceded by a COMMIT to the storage devices 276 written to. Note that if the client has not done a COMMIT to the 277 storage device, then the LAYOUTCOMMIT might not be synchronized to 278 the last WRITE operation to the storage device. 280 2.2. Fencing Clients from the Storage Device 282 With loosely coupled storage devices, the metadata server uses 283 synthetic uids and gids for the data file, where the uid owner of the 284 data file is allowed read/write access and the gid owner is allowed 285 read only access. As part of the layout (see ffds_user and 286 ffds_group in Section 5.1), the client is provided with the user and 287 group to be used in the Remote Procedure Call (RPC) [RFC5531] 288 credentials needed to access the data file. Fencing off of clients 289 is achieved by the metadata server changing the synthetic uid and/or 290 gid owners of the data file on the storage device to implicitly 291 revoke the outstanding RPC credentials. 293 With this loosely coupled model, the metadata server is not able to 294 fence off a single client, it is forced to fence off all clients. 295 However, as the other clients react to the fencing, returning their 296 layouts and trying to get new ones, the metadata server can hand out 297 a new uid and gid to allow access. 299 Note: it is recommended to implement common access control methods at 300 the storage device filesystem to allow only the metadata server root 301 (super user) access to the storage device, and to set the owner of 302 all directories holding data files to the root user. This approach 303 provides a practical model to enforce access control and fence off 304 cooperative clients, but it can not protect against malicious 305 clients; hence it provides a level of security equivalent to 306 AUTH_SYS. 308 With tightly coupled storage devices, the metadata server sets the 309 user and group owners, mode bits, and ACL of the data file to be the 310 same as the metadata file. And the client must authenticate with the 311 storage device and go through the same authorization process it would 312 go through via the metadata server. In the case of tight coupling, 313 fencing is the responsibility of the control protocol and is not 314 described in detail here. However, implementations of the tight 315 coupling locking model (see Section 2.3), will need a way to prevent 316 access by certain clients to specific files by invalidating the 317 corresponding stateids on the storage device. 319 2.2.1. Implementation Notes for Synthetic uids/gids 321 The selection method for the synthetic uids and gids to be used for 322 fencing in loosely coupled storage devices is strictly an 323 implementation issue. I.e., an administrator might restrict a range 324 of such ids available to the Lightweight Directory Access Protocol 325 (LDAP) 'uid' field [RFC4519]. She might also be able to choose an id 326 that would never be used to grant acccess. Then when the metadata 327 server had a request to access a file, a SETATTR would be sent to the 328 storage device to set the owner and group of the data file. The user 329 and group might be selected in a round robin fashion from the range 330 of available ids. 332 Those ids would be sent back as ffds_user and ffds_group to the 333 client. And it would present them as the RPC credentials to the 334 storage device. When the client was done accessing the file and the 335 metadata server knew that no other client was accessing the file, it 336 could reset the owner and group to restrict access to the data file. 338 When the metadata server wanted to fence off a client, it would 339 change the synthetic uid and/or gid to the restricted ids. Note that 340 using a restricted id ensures that there is a change of owner and at 341 least one id available that never gets allowed access. 343 Under an AUTH_SYS security model, synthetic uids and gids of 0 SHOULD 344 be avoided. These typically either grant super access to files on a 345 storage device or are mapped to an anonymous id. In the first case, 346 even if the data file is fenced, the client might still be able to 347 access the file. In the second case, multiple ids might be mapped to 348 the anonymous ids. 350 2.2.2. Example of using Synthetic uids/gids 352 The user loghyr creates a file "ompha.c" on the metadata server and 353 it creates a corresponding data file on the storage device. 355 The metadata server entry may look like: 357 -rw-r--r-- 1 loghyr staff 1697 Dec 4 11:31 ompha.c 359 On the storage device, it may be assigned some random synthetic uid/ 360 gid to deny access: 362 -rw-r----- 1 19452 28418 1697 Dec 4 11:31 data_ompha.c 364 When the file is opened on a client, since the layout knows nothing 365 about the user (and does not care), whether loghyr or garbo opens the 366 file does not matter. The owner and group are modified and those 367 values are returned. 369 -rw-r----- 1 1066 1067 1697 Dec 4 11:31 data_ompha.c 371 The set of synthetic gids on the storage device should be selected 372 such that there is no mapping in any of the name services used by the 373 storage device. I.e., each group should have no members. 375 If the layout segment has an iomode of LAYOUTIOMODE4_READ, then the 376 metadata server should return a synthetic uid that is not set on the 377 storage device. Only the synthetic gid would be valid. 379 The client is thus solely responsible for enforcing file permissions 380 in a loosely coupled model. To allow loghyr write access, it will 381 send an RPC to the storage device with a credential of 1066:1067. To 382 allow garbo read access, it will send an RPC to the storage device 383 with a credential of 1067:1067. The value of the uid does not matter 384 as long as it is not the synthetic uid granted it when getting the 385 layout. 387 While pushing the enforcement of permission checking onto the client 388 may seem to weaken security, the client may already be responsible 389 for enforcing permissions before modifications are sent to a server. 390 With cached writes, the client is always responsible for tracking who 391 is modifying a file and making sure to not coalesce requests from 392 multiple users into one request. 394 2.3. State and Locking Models 396 The choice of locking models is governed by the following rules: 398 o Storage devices implementing the NFSv3 and NFSv4.0 protocols are 399 always treated as loosely coupled. 401 o NFSv4.1+ storage devices that do not return the 402 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are indicating 403 that they are to be treated as loosely coupled. From the locking 404 viewpoint they are treated in the same way as NFSv4.0 storage 405 devices. 407 o NFSv4.1+ storage devices that do identify themselves with the 408 EXCHGID4_FLAG_USE_PNFS_DS flag set to EXCHANGE_ID are considered 409 strongly coupled. They would use a back-end control protocol to 410 implement the global stateid model as described in [RFC5661]. 412 2.3.1. Loosely Coupled Locking Model 414 When locking-related operations are requested, they are primarily 415 dealt with by the metadata server, which generates the appropriate 416 stateids. When an NFSv4 version is used as the data access protocol, 417 the metadata server may make stateid-related requests of the storage 418 devices. However, it is not required to do so and the resulting 419 stateids are known only to the metadata server and the storage 420 device. 422 Given this basic structure, locking-related operations are handled as 423 follows: 425 o OPENs are dealt with by the metadata server. Stateids are 426 selected by the metadata server and associated with the client id 427 describing the client's connection to the metadata server. The 428 metadata server may need to interact with the storage device to 429 locate the file to be opened, but no locking-related functionality 430 need be used on the storage device. 432 OPEN_DOWNGRADE and CLOSE only require local execution on the 433 metadata sever. 435 o Advisory byte-range locks can be implemented locally on the 436 metadata server. As in the case of OPENs, the stateids associated 437 with byte-range locks are assigned by the metadata server and only 438 used on the metadata server. 440 o Delegations are assigned by the metadata server which initiates 441 recalls when conflicting OPENs are processed. No storage device 442 involvement is required. 444 o TEST_STATEID and FREE_STATEID are processed locally on the 445 metadata server, without storage device involvement. 447 All I/O operations to the storage device are done using the anonymous 448 stateid. Thus the storage device has no information about the 449 openowner and lockowner responsible for issuing a particular I/O 450 operation. As a result: 452 o Mandatory byte-range locking cannot be supported because the 453 storage device has no way of distinguishing I/O done on behalf of 454 the lock owner from those done by others. 456 o Enforcement of share reservations is the responsibility of the 457 client. Even though I/O is done using the anonymous stateid, the 458 client must ensure that it has a valid stateid associated with the 459 openowner, that allows the I/O being done before issuing the I/O. 461 In the event that a stateid is revoked, the metadata server is 462 responsible for preventing client access, since it has no way of 463 being sure that the client is aware that the stateid in question has 464 been revoked. 466 As the client never receives a stateid generated by a storage device, 467 there is no client lease on the storage device and no prospect of 468 lease expiration, even when access is via NFSv4 protocols. Clients 469 will have leases on the metadata server. In dealing with lease 470 expiration, the metadata server may need to use fencing to prevent 471 revoked stateids from being relied upon by a client unaware of the 472 fact that they have been revoked. 474 2.3.2. Tighly Coupled Locking Model 476 When locking-related operations are requested, they are primarily 477 dealt with by the metadata server, which generates the appropriate 478 stateids. These stateids must be made known to the storage device 479 using control protocol facilities, the details of which are not 480 discussed in this document. 482 Given this basic structure, locking-related operations are handled as 483 follows: 485 o OPENs are dealt with primarily on the metadata server. Stateids 486 are selected by the metadata server and associated with the client 487 id describing the client's connection to the metadata server. The 488 metadata server needs to interact with the storage device to 489 locate the file to be opened, and to make the storage device aware 490 of the association between the metadata-sever-chosen stateid and 491 the client and openowner that it represents. 493 OPEN_DOWNGRADE and CLOSE are executed initially on the metadata 494 server but the state change made must be propagated to the storage 495 device. 497 o Advisory byte-range locks can be implemented locally on the 498 metadata server. As in the case of OPENs, the stateids associated 499 with byte-range locks, are assigned by the metadata server and are 500 available for use on the metadata server. Because I/O operations 501 are allowed to present lock stateids, the metadata server needs 502 the ability to make the storage device aware of the association 503 between the metadata-sever-chosen stateid and the corresponding 504 open stateid it is associated with. 506 o Mandatory byte-range locks can be supported when both the metadata 507 server and the storage devices have the appropriate support. As 508 in the case of advisory byte-range locks, these are assigned by 509 the metadata server and are available for use on the metadata 510 server. To enable mandatory lock enforcement on the storage 511 device, the metadata server needs the ability to make the storage 512 device aware of the association between the metadata-sever-chosen 513 stateid and the client, openowner, and lock (i.e., lockowner, 514 byte-range, lock-type) that it represents. Because I/O operations 515 are allowed to present lock stateids, this information needs to be 516 propagated to all storage devices to which I/O might be directed 517 rather than only to daya storage device that contain the locked 518 region. 520 o Delegations are assigned by the metadata server which initiates 521 recalls when conflicting OPENs are processed. Because I/O 522 operations are allowed to present delegation stateids, the 523 metadata server requires the ability to make the storage device 524 aware of the association between the metadata-server-chosen 525 stateid and the filehandle and delegation type it represents, and 526 to break such an association. 528 o TEST_STATEID is processed locally on the metadata server, without 529 storage device involvement. 531 o FREE_STATEID is processed on the metadata server but the metadata 532 server requires the ability to propagate the request to the 533 corresponding storage devices. 535 Because the client will possess and use stateids valid on the storage 536 device, there will be a client lease on the storage device and the 537 possibility of lease expiration does exist. The best approach for 538 the storage device is to retain these locks as a courtesy. However, 539 if it does not do so, control protocol facilities need to provide the 540 means to synchronize lock state between the metadata server and 541 storage device. 543 Clients will also have leases on the metadata server, which are 544 subject to expiration. In dealing with lease expiration, the 545 metadata server would be expected to use control protocol facilities 546 enabling it to invalidate revoked stateids on the storage device. In 547 the event the client is not responsive, the metadata server may need 548 to use fencing to prevent revoked stateids from being acted upon by 549 the storage device. 551 3. XDR Description of the Flexible File Layout Type 553 This document contains the external data representation (XDR) 554 [RFC4506] description of the flexible file layout type. The XDR 555 description is embedded in this document in a way that makes it 556 simple for the reader to extract into a ready-to-compile form. The 557 reader can feed this document into the following shell script to 558 produce the machine readable XDR description of the flexible file 559 layout type: 561 563 #!/bin/sh 564 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 566 568 That is, if the above script is stored in a file called "extract.sh", 569 and this document is in a file called "spec.txt", then the reader can 570 do: 572 sh extract.sh < spec.txt > flex_files_prot.x 574 The effect of the script is to remove leading white space from each 575 line, plus a sentinel sequence of "///". 577 The embedded XDR file header follows. Subsequent XDR descriptions, 578 with the sentinel sequence are embedded throughout the document. 580 Note that the XDR code contained in this document depends on types 581 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 582 types that end with a 4, such as offset4, length4, etc., as well as 583 more generic types such as uint32_t and uint64_t. 585 3.1. Code Components Licensing Notice 587 Both the XDR description and the scripts used for extracting the XDR 588 description are Code Components as described in Section 4 of "Legal 589 Provisions Relating to IETF Documents" [LEGAL]. These Code 590 Components are licensed according to the terms of that document. 592 594 /// /* 595 /// * Copyright (c) 2012 IETF Trust and the persons identified 596 /// * as authors of the code. All rights reserved. 597 /// * 598 /// * Redistribution and use in source and binary forms, with 599 /// * or without modification, are permitted provided that the 600 /// * following conditions are met: 601 /// * 602 /// * o Redistributions of source code must retain the above 603 /// * copyright notice, this list of conditions and the 604 /// * following disclaimer. 605 /// * 606 /// * o Redistributions in binary form must reproduce the above 607 /// * copyright notice, this list of conditions and the 608 /// * following disclaimer in the documentation and/or other 609 /// * materials provided with the distribution. 610 /// * 611 /// * o Neither the name of Internet Society, IETF or IETF 612 /// * Trust, nor the names of specific contributors, may be 613 /// * used to endorse or promote products derived from this 614 /// * software without specific prior written permission. 615 /// * 616 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 617 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 618 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 619 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 620 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 621 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 622 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 623 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 624 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 625 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 626 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 627 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 628 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 629 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 630 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 631 /// * 632 /// * This code was derived from RFCTBD10. 633 /// * Please reproduce this note if possible. 634 /// */ 635 /// 636 /// /* 637 /// * flex_files_prot.x 638 /// */ 639 /// 640 /// /* 641 /// * The following include statements are for example only. 642 /// * The actual XDR definition files are generated separately 643 /// * and independently and are likely to have a different name. 644 /// * %#include 645 /// * %#include 646 /// */ 647 /// 649 651 4. Device Addressing and Discovery 653 Data operations to a storage device require the client to know the 654 network address of the storage device. The NFSv4.1+ GETDEVICEINFO 655 operation (Section 18.40 of [RFC5661]) is used by the client to 656 retrieve that information. 658 4.1. ff_device_addr4 660 The ff_device_addr4 data structure is returned by the server as the 661 storage protocol specific opaque field da_addr_body in the 662 device_addr4 structure by a successful GETDEVICEINFO operation. 664 666 /// struct ff_device_versions4 { 667 /// uint32_t ffdv_version; 668 /// uint32_t ffdv_minorversion; 669 /// uint32_t ffdv_rsize; 670 /// uint32_t ffdv_wsize; 671 /// bool ffdv_tightly_coupled; 672 /// }; 673 /// 674 /// struct ff_device_addr4 { 675 /// multipath_list4 ffda_netaddrs; 676 /// ff_device_versions4 ffda_versions<>; 677 /// }; 678 /// 680 682 The ffda_netaddrs field is used to locate the storage device. It 683 MUST be set by the server to a list holding one or more of the device 684 network addresses. 686 The ffda_versions array allows the metadata server to present choices 687 as to NFS version, minor version, and coupling strength to the 688 client. The ffdv_version and ffdv_minorversion represent the NFS 689 protocol to be used to access the storage device. This layout 690 specification defines the semantics for ffdv_versions 3 and 4. If 691 ffdv_version equals 3 then the server MUST set ffdv_minorversion to 0 692 and ffdv_tightly_coupled to false. The client MUST then access the 693 storage device using the NFSv3 protocol [RFC1813]. If ffdv_version 694 equals 4 then the server MUST set ffdv_minorversion to one of the 695 NFSv4 minor version numbers and the client MUST access the storage 696 device using NFSv4 with the specified minor version. 698 Note that while the client might determine that it cannot use any of 699 the configured combinations of ffdv_version, ffdv_minorversion, and 700 ffdv_tightly_coupled, when it gets the device list from the metadata 701 server, there is no way to indicate to the metadata server as to 702 which device it is version incompatible. If however, the client 703 waits until it retrieves the layout from the metadata server, it can 704 at that time clearly identify the storage device in question (see 705 Section 5.3). 707 The ffdv_rsize and ffdv_wsize are used to communicate the maximum 708 rsize and wsize supported by the storage device. As the storage 709 device can have a different rsize or wsize than the metadata server, 710 the ffdv_rsize and ffdv_wsize allow the metadata server to 711 communicate that information on behalf of the storage device. 713 ffdv_tightly_coupled informs the client as to whether the metadata 714 server is tightly coupled with the storage devices or not. Note that 715 even if the data protocol is at least NFSv4.1, it may still be the 716 case that there is loose coupling is in effect. If 717 ffdv_tightly_coupled is not set, then the client MUST commit writes 718 to the storage devices for the file before sending a LAYOUTCOMMIT to 719 the metadata server. I.e., the writes MUST be committed by the 720 client to stable storage via issuing WRITEs with stable_how == 721 FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != 722 FILE_SYNC (see Section 3.3.7 of [RFC1813]). 724 4.2. Storage Device Multipathing 726 The flexible file layout type supports multipathing to multiple 727 storage device addresses. Storage device level multipathing is used 728 for bandwidth scaling via trunking and for higher availability of use 729 in the event of a storage device failure. Multipathing allows the 730 client to switch to another storage device address which may be that 731 of another storage device that is exporting the same data stripe 732 unit, without having to contact the metadata server for a new layout. 734 To support storage device multipathing, ffda_netaddrs contains an 735 array of one or more storage device network addresses. This array 736 (data type multipath_list4) represents a list of storage devices 737 (each identified by a network address), with the possibility that 738 some storage device will appear in the list multiple times. 740 The client is free to use any of the network addresses as a 741 destination to send storage device requests. If some network 742 addresses are less desirable paths to the data than others, then the 743 MDS SHOULD NOT include those network addresses in ffda_netaddrs. If 744 less desirable network addresses exist to provide failover, the 745 RECOMMENDED method to offer the addresses is to provide them in a 746 replacement device-ID-to-device-address mapping, or a replacement 747 device ID. When a client finds no response from the storage device 748 using all addresses available in ffda_netaddrs, it SHOULD send a 749 GETDEVICEINFO to attempt to replace the existing device-ID-to-device- 750 address mappings. If the MDS detects that all network paths 751 represented by ffda_netaddrs are unavailable, the MDS SHOULD send a 752 CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID 753 notifications for changed device IDs) to change the device-ID-to- 754 device-address mappings to the available addresses. If the device ID 755 itself will be replaced, the MDS SHOULD recall all layouts with the 756 device ID, and thus force the client to get new layouts and device ID 757 mappings via LAYOUTGET and GETDEVICEINFO. 759 Generally, if two network addresses appear in ffda_netaddrs, they 760 will designate the same storage device. When the storage device is 761 accessed over NFSv4.1 or a higher minor version, the two storage 762 device addresses will support the implementation of client ID or 763 session trunking (the latter is RECOMMENDED) as defined in [RFC5661]. 764 The two storage device addresses will share the same server owner or 765 major ID of the server owner. It is not always necessary for the two 766 storage device addresses to designate the same storage device with 767 trunking being used. For example, the data could be read-only, and 768 the data consist of exact replicas. 770 5. Flexible File Layout Type 772 The layout4 type is defined in [RFC5662] as follows: 774 776 enum layouttype4 { 777 LAYOUT4_NFSV4_1_FILES = 1, 778 LAYOUT4_OSD2_OBJECTS = 2, 779 LAYOUT4_BLOCK_VOLUME = 3, 780 LAYOUT4_FLEX_FILES = 4 781 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 782 to be the layouttype assigned by IANA]] 783 }; 785 struct layout_content4 { 786 layouttype4 loc_type; 787 opaque loc_body<>; 788 }; 790 struct layout4 { 791 offset4 lo_offset; 792 length4 lo_length; 793 layoutiomode4 lo_iomode; 794 layout_content4 lo_content; 795 }; 797 799 This document defines structure associated with the layouttype4 value 800 LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an 801 XDR type "opaque". The opaque layout is uninterpreted by the generic 802 pNFS client layers, but is interpreted by the flexible file layout 803 type implementation. This section defines the structure of this 804 otherwise opaque value, ff_layout4. 806 5.1. ff_layout4 808 810 /// const FF_FLAGS_NO_LAYOUTCOMMIT = 0x00000001; 811 /// const FF_FLAGS_NO_IO_THRU_MDS = 0x00000002; 812 /// const FF_FLAGS_NO_READ_IO = 0x00000004; 813 /// const FF_FLAGS_WRITE_ONE_MIRROR = 0x00000008; 815 /// typedef uint32_t ff_flags4; 816 /// 817 /// struct ff_data_server4 { 818 /// deviceid4 ffds_deviceid; 819 /// uint32_t ffds_efficiency; 820 /// stateid4 ffds_stateid; 821 /// nfs_fh4 ffds_fh_vers<>; 822 /// fattr4_owner ffds_user; 823 /// fattr4_owner_group ffds_group; 824 /// }; 825 /// 827 /// struct ff_mirror4 { 828 /// ff_data_server4 ffm_data_servers<>; 829 /// }; 830 /// 832 /// struct ff_layout4 { 833 /// length4 ffl_stripe_unit; 834 /// ff_mirror4 ffl_mirrors<>; 835 /// ff_flags4 ffl_flags; 836 /// uint32_t ffl_stats_collect_hint; 837 /// }; 838 /// 840 842 The ff_layout4 structure specifies a layout over a set of mirrored 843 copies of that portion of the data file described in the current 844 layout segment. This mirroring protects against loss of data in 845 layout segments. Note that while not explicitly shown in the above 846 XDR, each layout4 element returned in the logr_layout array of 847 LAYOUTGET4res (see Section 18.43.1 of [RFC5661]) describes a layout 848 segment. Hence each ff_layout4 also describes a layout segment. 850 It is possible that the file is concatenated from more than one 851 layout segment. Each layout segment MAY represent different striping 852 parameters, applying respectively only to the layout segment byte 853 range. 855 The ffl_stripe_unit field is the stripe unit size in use for the 856 current layout segment. The number of stripes is given inside each 857 mirror by the number of elements in ffm_data_servers. If the number 858 of stripes is one, then the value for ffl_stripe_unit MUST default to 859 zero. The only supported mapping scheme is sparse and is detailed in 860 Section 6. Note that there is an assumption here that both the 861 stripe unit size and the number of stripes is the same across all 862 mirrors. 864 The ffl_mirrors field is the array of mirrored storage devices which 865 provide the storage for the current stripe, see Figure 1. 867 The ffl_stats_collect_hint field provides a hint to the client on how 868 often the server wants it to report LAYOUTSTATS for a file. The time 869 is in seconds. 871 +-----------+ 872 | | 873 | | 874 | File | 875 | | 876 | | 877 +-----+-----+ 878 | 879 +------------+------------+ 880 | | 881 +----+-----+ +-----+----+ 882 | Mirror 1 | | Mirror 2 | 883 +----+-----+ +-----+----+ 884 | | 885 +-----------+ +-----------+ 886 |+-----------+ |+-----------+ 887 ||+-----------+ ||+-----------+ 888 +|| Storage | +|| Storage | 889 +| Devices | +| Devices | 890 +-----------+ +-----------+ 892 Figure 1 894 The ffs_mirrors field represents an array of state information for 895 each mirrored copy of the current layout segment. Each element is 896 described by a ff_mirror4 type. 898 ffds_deviceid provides the deviceid of the storage device holding the 899 data file. 901 ffds_fh_vers is an array of filehandles of the data file matching to 902 the available NFS versions on the given storage device. There MUST 903 be exactly as many elements in ffds_fh_vers as there are in 904 ffda_versions. Each element of the array corresponds to a particular 905 combination of ffdv_version, ffdv_minorversion, and 906 ffdv_tightly_coupled provided for the device. The array allows for 907 server implementations which have different filehandles for different 908 combinations of version, minor version, and coupling strength. See 909 Section 5.3 for how to handle versioning issues between the client 910 and storage devices. 912 For tight coupling, ffds_stateid provides the stateid to be used by 913 the client to access the file. For loose coupling and a NFSv4 914 storage device, the client may use an anonymous stateid to perform I/ 915 O on the storage device as there is no use for the metadata server 916 stateid (no control protocol). In such a scenario, the server MUST 917 set the ffds_stateid to be the anonymous stateid. 919 This specification of the ffds_stateid is mostly broken for the 920 tightly coupled model. There needs to exist a one to one mapping 921 from ffds_stateid to ffds_fh_vers - each open file on the storage 922 device might need an open stateid. As there are established loosely 923 coupled implementations of this version of the protocol, the only 924 viable approaches for a tightly coupled implementation would be to 925 either use an anonymous stateid for the ffds_stateid or restrict the 926 size of the ffds_fh_vers to be one. Fixing this issue will require a 927 new version of the protocol. 929 [[AI14: One reviewer points out for loosely coupled, we can use the 930 anon stateid and for tightly coupled we can use the "global stateid". 931 These make it appear that the bug in the spec was actually a feature. 932 The intent here is to own up to the bug and shipping code. Can it be 933 said nicer? --TH]] 935 For loosely coupled storage devices, ffds_user and ffds_group provide 936 the synthetic user and group to be used in the RPC credentials that 937 the client presents to the storage device to access the data files. 938 For tightly coupled storage devices, the user and group on the 939 storage device will be the same as on the metadata server. I.e., if 940 ffdv_tightly_coupled (see Section 4.1) is set, then the client MUST 941 ignore both ffds_user and ffds_group. 943 The allowed values for both ffds_user and ffds_group are specified in 944 Section 5.9 of [RFC5661]. For NFSv3 compatibility, user and group 945 strings that consist of decimal numeric values with no leading zeros 946 can be given a special interpretation by clients and servers that 947 choose to provide such support. The receiver may treat such a user 948 or group string as representing the same user as would be represented 949 by an NFSv3 uid or gid having the corresponding numeric value. Note 950 that if using Kerberos for security, the expectation is that these 951 values will be a name@domain string. 953 ffds_efficiency describes the metadata server's evaluation as to the 954 effectiveness of each mirror. Note that this is per layout and not 955 per device as the metric may change due to perceived load, 956 availability to the metadata server, etc. Higher values denote 957 higher perceived utility. The way the client can select the best 958 mirror to access is discussed in Section 8.1. 960 ffl_flags is a bitmap that allows the metadata server to inform the 961 client of particular conditions that may result from the more or less 962 tight coupling of the storage devices. 964 FF_FLAGS_NO_LAYOUTCOMMIT: can be set to indicate that the client is 965 not required to send LAYOUTCOMMIT to the metadata server. 967 F_FLAGS_NO_IO_THRU_MDS: can be set to indicate that the client 968 SHOULD not send I/O operations to the metadata server. I.e., even 969 if the client could determine that there was a network diconnect 970 to a storage device, the client SHOULD not try to proxy the I/O 971 through the metadata server. 973 FF_FLAGS_NO_READ_IO: can be set to indicate that the client SHOULD 974 not send READ requests with the layouts of iomode 975 LAYOUTIOMODE4_RW. Instead, it should request a layout of iomode 976 LAYOUTIOMODE4_READ from the metadata server. 978 FF_FLAGS_WRITE_ONE_MIRROR: can be set to indicate that the client 979 only needs to update one of the mirrors (see Section 8.2). 981 5.1.1. Error Codes from LAYOUTGET 983 [RFC5661] provides little guidance as to how the client is to proceed 984 with a LAYOUTEGT which returns an error of either 985 NFS4ERR_LAYOUTTRYLATER, NFS4ERR_LAYOUTUNAVAILABLE, and NFS4ERR_DELAY. 986 Within the context of this document: 988 NFS4ERR_LAYOUTUNAVAILABLE: there is no layout available and the I/O 989 is to go to the metadata server. Note that it is possible to have 990 had a layout before a recall and not after. 992 NFS4ERR_LAYOUTTRYLATER: there is some issue preventing the layout 993 from being granted. If the client already has an appropriate 994 layout, it should continue with I/O to the storage devices. 996 NFS4ERR_DELAY: there is some issue preventing the layout from being 997 granted. If the client already has an appropriate layout, it 998 should not continue with I/O to the storage devices. 1000 5.1.2. Client Interactions with FF_FLAGS_NO_IO_THRU_MDS 1002 Even if the metadata server provides the FF_FLAGS_NO_IO_THRU_MDS, 1003 flag, the client can still perform I/O to the metadata server. The 1004 flag is at best a hint. The flag is indicating to the client that 1005 the metadata server most likely wants to separate the metadata I/O 1006 from the data I/O to increase the performance of the metadata 1007 operations. If the metadata server detects that the client is 1008 performing I/O against it despite the use of the 1009 FF_FLAGS_NO_IO_THRU_MDS flag, it can recall the layout and either not 1010 set the flag on the new layout or not provide a layout (perhaps the 1011 intent was for the server to temporarily prevent data I/O to meet 1012 some goal). The client's I/O would then proceed according to the 1013 status codes as outlined in Section 5.1.1. 1015 5.2. Interactions Between Devices and Layouts 1017 In [RFC5661], the file layout type is defined such that the 1018 relationship between multipathing and filehandles can result in 1019 either 0, 1, or N filehandles (see Section 13.3). Some rationals for 1020 this are clustered servers which share the same filehandle or 1021 allowing for multiple read-only copies of the file on the same 1022 storage device. In the flexible file layout type, while there is an 1023 array of filehandles, they are independent of the multipathing being 1024 used. If the metadata server wants to provide multiple read-only 1025 copies of the same file on the same storage device, then it should 1026 provide multiple ff_device_addr4, each as a mirror. The client can 1027 then determine that since the ffds_fh_vers are different, then there 1028 are multiple copies of the file for the current layout segment 1029 available. 1031 5.3. Handling Version Errors 1033 When the metadata server provides the ffda_versions array in the 1034 ff_device_addr4 (see Section 4.1), the client is able to determine if 1035 it can not access a storage device with any of the supplied 1036 combinations of ffdv_version, ffdv_minorversion, and 1037 ffdv_tightly_coupled. However, due to the limitations of reporting 1038 errors in GETDEVICEINFO (see Section 18.40 in [RFC5661], the client 1039 is not able to specify which specific device it can not communicate 1040 with over one of the provided ffdv_version and ffdv_minorversion 1041 combinations. Using ff_ioerr4 (see Section 9.1.1 inside either the 1042 LAYOUTRETURN (see Section 18.44 of [RFC5661]) or the LAYOUTERROR (see 1043 Section 15.6 of [RFC7862] and Section 10 of this document), the 1044 client can isolate the problematic storage device. 1046 The error code to return for LAYOUTRETURN and/or LAYOUTERROR is 1047 NFS4ERR_MINOR_VERS_MISMATCH. It does not matter whether the mismatch 1048 is a major version (e.g., client can use NFSv3 but not NFSv4) or 1049 minor version (e.g., client can use NFSv4.1 but not NFSv4.2), the 1050 error indicates that for all the supplied combinations for 1051 ffdv_version and ffdv_minorversion, the client can not communicate 1052 with the storage device. The client can retry the GETDEVICEINFO to 1053 see if the metadata server can provide a different combination or it 1054 can fall back to doing the I/O through the metadata server. 1056 6. Striping via Sparse Mapping 1058 While other layout types support both dense and sparse mapping of 1059 logical offsets to physical offsets within a file (see for example 1060 Section 13.4 of [RFC5661]), the flexible file layout type only 1061 supports a sparse mapping. 1063 With sparse mappings, the logical offset within a file (L) is also 1064 the physical offset on the storage device. As detailed in 1065 Section 13.4.4 of [RFC5661], this results in holes across each 1066 storage device which does not contain the current stripe index. 1068 L: logical offset into the file 1070 W: stripe width 1071 W = number of elements in ffm_data_servers 1073 S: number of bytes in a stripe 1074 S = W * ffl_stripe_unit 1076 N: stripe number 1077 N = L / S 1079 7. Recovering from Client I/O Errors 1081 The pNFS client may encounter errors when directly accessing the 1082 storage devices. However, it is the responsibility of the metadata 1083 server to recover from the I/O errors. When the LAYOUT4_FLEX_FILES 1084 layout type is used, the client MUST report the I/O errors to the 1085 server at LAYOUTRETURN time using the ff_ioerr4 structure (see 1086 Section 9.1.1). 1088 The metadata server analyzes the error and determines the required 1089 recovery operations such as recovering media failures or 1090 reconstructing missing data files. 1092 The metadata server SHOULD recall any outstanding layouts to allow it 1093 exclusive write access to the stripes being recovered and to prevent 1094 other clients from hitting the same error condition. In these cases, 1095 the server MUST complete recovery before handing out any new layouts 1096 to the affected byte ranges. 1098 Although the client implementation has the option to propagate a 1099 corresponding error to the application that initiated the I/O 1100 operation and drop any unwritten data, the client should attempt to 1101 retry the original I/O operation by either requesting a new layout or 1102 sending the I/O via regular NFSv4.1+ READ or WRITE operations to the 1103 metadata server. The client SHOULD attempt to retrieve a new layout 1104 and retry the I/O operation using the storage device first and only 1105 if the error persists, retry the I/O operation via the metadata 1106 server. 1108 8. Mirroring 1110 The flexible file layout type has a simple model in place for the 1111 mirroring of the file data constrained by a layout segment. There is 1112 no assumption that each copy of the mirror is stored identically on 1113 the storage devices. For example, one device might employ 1114 compression or deduplication on the data. However, the over the wire 1115 transfer of the file contents MUST appear identical. Note, this is a 1116 constraint of the selected XDR representation in which each mirrored 1117 copy of the layout segment has the same striping pattern (see 1118 Figure 1). 1120 The metadata server is responsible for determining the number of 1121 mirrored copies and the location of each mirror. While the client 1122 may provide a hint to how many copies it wants (see Section 12), the 1123 metadata server can ignore that hint and in any event, the client has 1124 no means to dictate neither the storage device (which also means the 1125 coupling and/or protocol levels to access the layout segments) nor 1126 the location of said storage device. 1128 The updating of mirrored layout segments is done via client-side 1129 mirroring. With this approach, the client is responsible for making 1130 sure modifications are made on all copies of the layout segments it 1131 is informed of via the layout. If a layout segment is being 1132 resilvered to a storage device, that mirrored copy will not be in the 1133 layout. Thus the metadata server MUST update that copy until the 1134 client is presented it in a layout. If the FF_FLAGS_WRITE_ONE_MIRROR 1135 is set in ffl_flags, the client need only update one of the mirrors 1136 (see Section 8.2. If the client is writing to the layout segments 1137 via the metadata server, then the metadata server MUST update all 1138 copies of the mirror. As seen in Section 8.3, during the 1139 resilvering, the layout is recalled, and the client has to make 1140 modifications via the metadata server. 1142 8.1. Selecting a Mirror 1144 When the metadata server grants a layout to a client, it MAY let the 1145 client know how fast it expects each mirror to be once the request 1146 arrives at the storage devices via the ffds_efficiency member. While 1147 the algorithms to calculate that value are left to the metadata 1148 server implementations, factors that could contribute to that 1149 calculation include speed of the storage device, physical memory 1150 available to the device, operating system version, current load, etc. 1152 However, what should not be involved in that calculation is a 1153 perceived network distance between the client and the storage device. 1154 The client is better situated for making that determination based on 1155 past interaction with the storage device over the different available 1156 network interfaces between the two. I.e., the metadata server might 1157 not know about a transient outage between the client and storage 1158 device because it has no presence on the given subnet. 1160 As such, it is the client which decides which mirror to access for 1161 reading the file. The requirements for writing to a mirrored layout 1162 segments are presented below. 1164 8.2. Writing to Mirrors 1166 8.2.1. Single Storage Device Updates Mirrors 1168 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is set, the client 1169 only needs to update one of the copies of the layout segment. For 1170 this case, the storage device MUST ensure that all copies of the 1171 mirror are updated when any one of the mirrors is updated. If the 1172 storage device gets an error when updating one of the mirrors, then 1173 it MUST inform the client that the original WRITE had an error. The 1174 client then MUST inform the metadata server (see Section 8.2.3. The 1175 client's responsibility with resepect to COMMIT is explained in 1176 Section 8.2.4. The client may choose any one of the mirrors and may 1177 use ffds_efficiency in the same manner as for reading when making 1178 this choice. 1180 8.2.2. Single Storage Device Updates Mirrors 1182 If the FF_FLAGS_WRITE_ONE_MIRROR flag in ffl_flags is not set, the 1183 client is responsible for updating all mirrored copies of the layout 1184 segments that it is given in the layout. A single failed update is 1185 sufficient to fail the entire operation. If all but one copy is 1186 updated successfully and the last one provides an error, then the 1187 client needs to inform the metadata server about the error via either 1188 LAYOUTRETURN or LAYOUTERROR that the update failed to that storage 1189 device. If the client is updating the mirrors serially, then it 1190 SHOULD stop at the first error encountered and report that to the 1191 metadata server. If the client is updating the mirrors in parallel, 1192 then it SHOULD wait until all storage devices respond such that it 1193 can report all errors encountered during the update. 1195 8.2.3. Handling Write Errors 1197 When the client reports a write error to the metadata server, the 1198 metadata server is responsible for determining if it wants to remove 1199 the errant mirror from the layout, if the mirror has recovered from 1200 some transient error, etc. When the client tries to get a new 1201 layout, the metadata server informs it of the decision by the 1202 contents of the layout. The client MUST NOT make any assumptions 1203 that the contents of the previous layout will match those of the new 1204 one. If it has updates that were not committed to all mirrors, then 1205 it MUST resend those updates to all mirrors. 1207 There is no provision in the protocol for the metadata server to 1208 directly determine that the client has or has not recovered from an 1209 error. I.e., assume that the storage device was network partitioned 1210 from the client and all of the copies are successfully updated after 1211 the error was reported. There is no mechanism for the client to 1212 report that fact and the metadata server is forced to repair the file 1213 across the mirror. 1215 If the client supports NFSv4.2, it can use LAYOUTERROR and 1216 LAYOUTRETURN to provide hints to the metadata server about the 1217 recovery efforts. A LAYOUTERROR on a file is for a non-fatal error. 1218 A subsequent LAYOUTRETURN without a ff_ioerr4 indicates that the 1219 client successfully replayed the I/O to all mirrors. Any 1220 LAYOUTRETURN with a ff_ioerr4 is an error that the metadata server 1221 needs to repair. The client MUST be prepared for the LAYOUTERROR to 1222 trigger a CB_LAYOUTRECALL if the metadata server determines it needs 1223 to start repairing the file. 1225 8.2.4. Handling Write COMMITs 1227 When stable writes are done to the metadata server or to a single 1228 replica (if allowed by the use of FF_FLAGS_WRITE_ONE_MIRROR ), it is 1229 the responsibility of the receiving node to propagate the written 1230 data stably, before replying to the client. 1232 In the corresponding cases in which unstable writes are done, the 1233 receiving node does not have any such obligation, although it may 1234 choose to asynchronously propagate the updates. However, once a 1235 COMMIT is replied to, all replicas must reflect the writes that have 1236 been done, and these data must have been committed to stable storage 1237 on all replicas. 1239 In order to avoid situations in which stale data is read from 1240 replicas to which writes have not been propagated: 1242 o A client which has outstanding unstable writes made to single node 1243 (metadata server or storage device) MUST do all reads from that 1244 same node. 1246 o When writes are flushed to the server, for example to implement, 1247 close-to-open semantics, a COMMIT must be done by the client to 1248 ensure that up-to-date written data will be available irrespective 1249 of the particular replica read. 1251 8.3. Metadata Server Resilvering of the File 1253 The metadata server may elect to create a new mirror of the layout 1254 segments at any time. This might be to resilver a copy on a storage 1255 device which was down for servicing, to provide a copy of the layout 1256 segments on storage with different storage performance 1257 characteristics, etc. As the client will not be aware of the new 1258 mirror and the metadata server will not be aware of updates that the 1259 client is making to the layout segments, the metadata server MUST 1260 recall the writable layout segment(s) that it is resilvering. If the 1261 client issues a LAYOUTGET for a writable layout segment which is in 1262 the process of being resilvered, then the metadata server can deny 1263 that request with a NFS4ERR_LAYOUTUNAVAILABLE. The client would then 1264 have to perform the I/O through the metadata server. 1266 9. Flexible Files Layout Type Return 1268 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1269 layout-type specific information to the server. It is defined in 1270 Section 18.44.1 of [RFC5661] as follows: 1272 1274 /* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ 1275 const LAYOUT4_RET_REC_FILE = 1; 1276 const LAYOUT4_RET_REC_FSID = 2; 1277 const LAYOUT4_RET_REC_ALL = 3; 1279 enum layoutreturn_type4 { 1280 LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, 1281 LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, 1282 LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL 1283 }; 1285 struct layoutreturn_file4 { 1286 offset4 lrf_offset; 1287 length4 lrf_length; 1288 stateid4 lrf_stateid; 1289 /* layouttype4 specific data */ 1290 opaque lrf_body<>; 1291 }; 1292 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1293 case LAYOUTRETURN4_FILE: 1294 layoutreturn_file4 lr_layout; 1295 default: 1296 void; 1297 }; 1299 struct LAYOUTRETURN4args { 1300 /* CURRENT_FH: file */ 1301 bool lora_reclaim; 1302 layoutreturn_stateid lora_recallstateid; 1303 layouttype4 lora_layout_type; 1304 layoutiomode4 lora_iomode; 1305 layoutreturn4 lora_layoutreturn; 1306 }; 1308 1310 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES and the 1311 lr_returntype is LAYOUTRETURN4_FILE, then the lrf_body opaque value 1312 is defined by ff_layoutreturn4 (See Section 9.3). It allows the 1313 client to report I/O error information or layout usage statistics 1314 back to the metadata server as defined below. Note that while the 1315 data strucures are built on concepts introduced in NFSv4.2, the 1316 effective discriminated union (lora_layout_type combined with 1317 ff_layoutreturn4) allows for a NFSv4.1 metadata server to utilize the 1318 data. 1320 9.1. I/O Error Reporting 1322 9.1.1. ff_ioerr4 1324 1326 /// struct ff_ioerr4 { 1327 /// offset4 ffie_offset; 1328 /// length4 ffie_length; 1329 /// stateid4 ffie_stateid; 1330 /// device_error4 ffie_errors<>; 1331 /// }; 1332 /// 1334 1336 Recall that [RFC7862] defines device_error4 as: 1338 1339 struct device_error4 { 1340 deviceid4 de_deviceid; 1341 nfsstat4 de_status; 1342 nfs_opnum4 de_opnum; 1343 }; 1345 1347 The ff_ioerr4 structure is used to return error indications for data 1348 files that generated errors during data transfers. These are hints 1349 to the metadata server that there are problems with that file. For 1350 each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length 1351 represent the storage device and byte range within the file in which 1352 the error occurred; ffie_errors represents the operation and type of 1353 error. The use of device_error4 is described in Section 15.6 of 1354 [RFC7862]. 1356 Even though the storage device might be accessed via NFSv3 and 1357 reports back NFSv3 errors to the client, the client is responsible 1358 for mapping these to appropriate NFSv4 status codes as de_status. 1359 Likewise, the NFSv3 operations need to be mapped to equivalent NFSv4 1360 operations. 1362 9.2. Layout Usage Statistics 1364 9.2.1. ff_io_latency4 1366 1368 /// struct ff_io_latency4 { 1369 /// uint64_t ffil_ops_requested; 1370 /// uint64_t ffil_bytes_requested; 1371 /// uint64_t ffil_ops_completed; 1372 /// uint64_t ffil_bytes_completed; 1373 /// uint64_t ffil_bytes_not_delivered; 1374 /// nfstime4 ffil_total_busy_time; 1375 /// nfstime4 ffil_aggregate_completion_time; 1376 /// }; 1377 /// 1379 1381 Both operation counts and bytes transferred are kept in the 1382 ff_io_latency4. READ operations are used for read latencies. Both 1383 WRITE and COMMIT operations are used for write latencies. 1384 "Requested" counters track what the client is attempting to do and 1385 "completed" counters track what was done. Note that there is no 1386 requirement that the client only report completed results that have 1387 matching requested results from the reported period. 1389 ffil_bytes_not_delivered is used to track the aggregate number of 1390 bytes requested by not fulfilled due to error conditions. 1391 ffil_total_busy_time is the aggregate time spent with outstanding RPC 1392 calls, ffil_aggregate_completion_time is the sum of all latencies for 1393 completed RPC calls. 1395 Note that LAYOUTSTATS are cumulative, i.e., not reset each time the 1396 operation is sent. If two LAYOUTSTATS ops for the same file, layout 1397 stateid, and originating from the same NFS client are processed at 1398 the same time by the metadata server, then the one containing the 1399 larger values contains the most recent time series data. 1401 9.2.2. ff_layoutupdate4 1403 1405 /// struct ff_layoutupdate4 { 1406 /// netaddr4 ffl_addr; 1407 /// nfs_fh4 ffl_fhandle; 1408 /// ff_io_latency4 ffl_read; 1409 /// ff_io_latency4 ffl_write; 1410 /// nfstime4 ffl_duration; 1411 /// bool ffl_local; 1412 /// }; 1413 /// 1415 1417 ffl_addr differentiates which network address the client connected to 1418 on the storage device. In the case of multipathing, ffl_fhandle 1419 indicates which read-only copy was selected. ffl_read and ffl_write 1420 convey the latencies respectively for both read and write operations. 1421 ffl_duration is used to indicate the time period over which the 1422 statistics were collected. ffl_local if true indicates that the I/O 1423 was serviced by the client's cache. This flag allows the client to 1424 inform the metadata server about "hot" access to a file it would not 1425 normally be allowed to report on. 1427 9.2.3. ff_iostats4 1429 1430 /// struct ff_iostats4 { 1431 /// offset4 ffis_offset; 1432 /// length4 ffis_length; 1433 /// stateid4 ffis_stateid; 1434 /// io_info4 ffis_read; 1435 /// io_info4 ffis_write; 1436 /// deviceid4 ffis_deviceid; 1437 /// ff_layoutupdate4 ffis_layoutupdate; 1438 /// }; 1439 /// 1441 1443 Recall that [RFC7862] defines io_info4 as: 1445 1447 struct io_info4 { 1448 uint64_t ii_count; 1449 uint64_t ii_bytes; 1450 }; 1452 1454 With pNFS, data transfers are performed directly between the pNFS 1455 client and the storage devices. Therefore, the metadata server has 1456 no direct knowledge to the I/O operations being done and thus can not 1457 create on its own statistical information about client I/O to 1458 optimize data storage location. ff_iostats4 MAY be used by the 1459 client to report I/O statistics back to the metadata server upon 1460 returning the layout. 1462 Since it is not feasible for the client to report every I/O that used 1463 the layout, the client MAY identify "hot" byte ranges for which to 1464 report I/O statistics. The definition and/or configuration mechanism 1465 of what is considered "hot" and the size of the reported byte range 1466 is out of the scope of this document. It is suggested for client 1467 implementation to provide reasonable default values and an optional 1468 run-time management interface to control these parameters. For 1469 example, a client can define the default byte range resolution to be 1470 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1471 I/O operations per second. 1473 For each byte range, ffis_offset and ffis_length represent the 1474 starting offset of the range and the range length in bytes. 1475 ffis_read.ii_count, ffis_read.ii_bytes, ffis_write.ii_count, and 1476 ffis_write.ii_bytes represent, respectively, the number of contiguous 1477 read and write I/Os and the respective aggregate number of bytes 1478 transferred within the reported byte range. 1480 The combination of ffis_deviceid and ffl_addr uniquely identifies 1481 both the storage path and the network route to it. Finally, the 1482 ffl_fhandle allows the metadata server to differentiate between 1483 multiple read-only copies of the file on the same storage device. 1485 9.3. ff_layoutreturn4 1487 1489 /// struct ff_layoutreturn4 { 1490 /// ff_ioerr4 fflr_ioerr_report<>; 1491 /// ff_iostats4 fflr_iostats_report<>; 1492 /// }; 1493 /// 1495 1497 When data file I/O operations fail, fflr_ioerr_report<> is used to 1498 report these errors to the metadata server as an array of elements of 1499 type ff_ioerr4. Each element in the array represents an error that 1500 occurred on the data file identified by ffie_errors.de_deviceid. If 1501 no errors are to be reported, the size of the fflr_ioerr_report<> 1502 array is set to zero. The client MAY also use fflr_iostats_report<> 1503 to report a list of I/O statistics as an array of elements of type 1504 ff_iostats4. Each element in the array represents statistics for a 1505 particular byte range. Byte ranges are not guaranteed to be disjoint 1506 and MAY repeat or intersect. 1508 10. Flexible Files Layout Type LAYOUTERROR 1510 If the client is using NFSv4.2 to communicate with the metadata 1511 server, then instead of waiting for a LAYOUTRETURN to send error 1512 information to the metadata server (see Section 9.1), it MAY use 1513 LAYOUTERROR (see Section 15.6 of [RFC7862]) to communicate that 1514 information. For the flexible files layout type, this means that 1515 LAYOUTERROR4args is treated the same as ff_ioerr4. 1517 11. Flexible Files Layout Type LAYOUTSTATS 1519 If the client is using NFSv4.2 to communicate with the metadata 1520 server, then instead of waiting for a LAYOUTRETURN to send I/O 1521 statistics to the metadata server (see Section 9.2), it MAY use 1522 LAYOUTSTATS (see Section 15.7 of [RFC7862]) to communicate that 1523 information. For the flexible files layout type, this means that 1524 LAYOUTSTATS4args.lsa_layoutupdate is overloaded with the same 1525 contents as in ffis_layoutupdate. 1527 12. Flexible File Layout Type Creation Hint 1529 The layouthint4 type is defined in the [RFC5661] as follows: 1531 1533 struct layouthint4 { 1534 layouttype4 loh_type; 1535 opaque loh_body<>; 1536 }; 1538 1540 The layouthint4 structure is used by the client to pass a hint about 1541 the type of layout it would like created for a particular file. If 1542 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 1543 opaque value is defined by the ff_layouthint4 type. 1545 12.1. ff_layouthint4 1547 1549 /// union ff_mirrors_hint switch (bool ffmc_valid) { 1550 /// case TRUE: 1551 /// uint32_t ffmc_mirrors; 1552 /// case FALSE: 1553 /// void; 1554 /// }; 1555 /// 1557 /// struct ff_layouthint4 { 1558 /// ff_mirrors_hint fflh_mirrors_hint; 1559 /// }; 1560 /// 1562 1564 This type conveys hints for the desired data map. All parameters are 1565 optional so the client can give values for only the parameter it 1566 cares about. 1568 13. Recalling a Layout 1570 While Section 12.5.5 of [RFC5661] discusses layout type independent 1571 reasons for recalling a layout, the flexible file layout type 1572 metadata server should recall outstanding layouts in the following 1573 cases: 1575 o When the file's security policy changes, i.e., Access Control 1576 Lists (ACLs) or permission mode bits are set. 1578 o When the file's layout changes, rendering outstanding layouts 1579 invalid. 1581 o When existing layouts are inconsistent with the need to enforce 1582 locking constraints. 1584 o When existing layouts are inconsistent with the requirements 1585 regarding resilvering as described in Section 8.3. 1587 13.1. CB_RECALL_ANY 1589 The metadata server can use the CB_RECALL_ANY callback operation to 1590 notify the client to return some or all of its layouts. Section 22.3 1591 of [RFC5661] defines the allowed types of the "NFSv4 Recallable 1592 Object Types Registry". 1594 1596 /// const RCA4_TYPE_MASK_FF_LAYOUT_MIN = 16; 1597 /// const RCA4_TYPE_MASK_FF_LAYOUT_MAX = 17; 1598 [[RFC Editor: please insert assigned constants]] 1599 /// 1601 struct CB_RECALL_ANY4args { 1602 uint32_t craa_layouts_to_keep; 1603 bitmap4 craa_type_mask; 1604 }; 1606 1608 Typically, CB_RECALL_ANY will be used to recall client state when the 1609 server needs to reclaim resources. The craa_type_mask bitmap 1610 specifies the type of resources that are recalled and the 1611 craa_layouts_to_keep value specifies how many of the recalled 1612 flexible file layouts the client is allowed to keep. The flexible 1613 file layout type mask flags are defined as follows: 1615 1616 /// enum ff_cb_recall_any_mask { 1617 /// FF_RCA4_TYPE_MASK_READ = -2, 1618 /// FF_RCA4_TYPE_MASK_RW = -1 1619 [[RFC Editor: please insert assigned constants]] 1620 /// }; 1621 /// 1623 1625 They represent the iomode of the recalled layouts. In response, the 1626 client SHOULD return layouts of the recalled iomode that it needs the 1627 least, keeping at most craa_layouts_to_keep Flexible File Layouts. 1629 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 1630 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1631 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1632 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1633 is notified to return layouts of either iomode. 1635 14. Client Fencing 1637 In cases where clients are uncommunicative and their lease has 1638 expired or when clients fail to return recalled layouts within a 1639 lease period, at the least the server MAY revoke client layouts and 1640 reassign these resources to other clients (see Section 12.5.5 in 1641 [RFC5661]). To avoid data corruption, the metadata server MUST fence 1642 off the revoked clients from the respective data files as described 1643 in Section 2.2. 1645 15. Security Considerations 1647 The pNFS extension partitions the NFSv4.1+ file system protocol into 1648 two parts, the control path and the data path (storage protocol). 1649 The control path contains all the new operations described by this 1650 extension; all existing NFSv4 security mechanisms and features apply 1651 to the control path. The combination of components in a pNFS system 1652 is required to preserve the security properties of NFSv4.1+ with 1653 respect to an entity accessing data via a client, including security 1654 countermeasures to defend against threats that NFSv4.1+ provides 1655 defenses for in environments where these threats are considered 1656 significant. 1658 The metadata server enforces the file access-control policy at 1659 LAYOUTGET time. The client should use RPC authorization credentials 1660 (uid/gid for AUTH_SYS or tickets for Kerberos) for getting the layout 1661 for the requested iomode (READ or RW) and the server verifies the 1662 permissions and ACL for these credentials, possibly returning 1663 NFS4ERR_ACCESS if the client is not allowed the requested iomode. If 1664 the LAYOUTGET operation succeeds the client receives, as part of the 1665 layout, a set of credentials allowing it I/O access to the specified 1666 data files corresponding to the requested iomode. When the client 1667 acts on I/O operations on behalf of its local users, it MUST 1668 authenticate and authorize the user by issuing respective OPEN and 1669 ACCESS calls to the metadata server, similar to having NFSv4 data 1670 delegations. 1672 If access is allowed, the client uses the corresponding (READ or RW) 1673 credentials to perform the I/O operations at the data file's storage 1674 devices. When the metadata server receives a request to change a 1675 file's permissions or ACL, it SHOULD recall all layouts for that file 1676 and then MUST fence off any clients still holding outstanding layouts 1677 for the respective files by implicitly invalidating the previously 1678 distributed credential on all data file comprising the file in 1679 question. It is REQUIRED that this be done before committing to the 1680 new permissions and/or ACL. By requesting new layouts, the clients 1681 will reauthorize access against the modified access control metadata. 1682 Recalling the layouts in this case is intended to prevent clients 1683 from getting an error on I/Os done after the client was fenced off. 1685 15.1. Kerberized File Access 1687 15.1.1. Loosely Coupled 1689 RPCSEC_GSS version 3 (RPCSEC_GSSv3) [rpcsec_gssv3] could be used to 1690 authorize the client to the storage device on behalf of the metadata 1691 server. This would require that each of the metadata server, storage 1692 device, and client would have to implement RPCSEC_GSSv3. The second 1693 requirement does not match the intent of the loosely coupled model 1694 that the storage device need not be modified. 1696 Under this coupling model, the principal used to authenticate the 1697 metadata file is different than that used to authenticate the data 1698 file. For the metadata server, the user credentials would be 1699 generated by the same Kerberos server as the client. However, for 1700 the data storage access, the metadata server would generate the 1701 ticket granting tickets and provide them to the client. Fencing 1702 would then be controlled either by expiring the ticket or by 1703 modifying the syntethic uid or gid on the data file. 1705 15.1.2. Tightly Coupled 1707 With tight coupling, the principal used to access the metadata file 1708 is exactly the same as used to access the data file. As a result 1709 there are no security issues related to using Kerberos with a tightly 1710 coupled system. 1712 16. IANA Considerations 1714 [RFC5661] introduced a registry for "pNFS Layout Types Registry" and 1715 as such, new layout type numbers need to be assigned by IANA. This 1716 document defines the protocol associated with the existing layout 1717 type number, LAYOUT4_FLEX_FILES (see Table 1). 1719 +--------------------+-------+--------+-----+----------------+ 1720 | Layout Type Name | Value | RFC | How | Minor Versions | 1721 +--------------------+-------+--------+-----+----------------+ 1722 | LAYOUT4_FLEX_FILES | 0x4 | RFCTDB | L | 1 | 1723 +--------------------+-------+--------+-----+----------------+ 1725 Table 1: Layout Type Assignments 1727 [RFC5661] also introduced a registry called "NFSv4 Recallable Object 1728 Types Registry". This document defines new recallable objects for 1729 RCA4_TYPE_MASK_FF_LAYOUT_MIN and RCA4_TYPE_MASK_FF_LAYOUT_MAX (see 1730 Table 2). 1732 +------------------------------+-------+--------+-----+-------------+ 1733 | Recallable Object Type Name | Value | RFC | How | Minor | 1734 | | | | | Versions | 1735 +------------------------------+-------+--------+-----+-------------+ 1736 | RCA4_TYPE_MASK_FF_LAYOUT_MIN | 16 | RFCTDB | L | 1 | 1737 | RCA4_TYPE_MASK_FF_LAYOUT_MAX | 17 | RFCTDB | L | 1 | 1738 +------------------------------+-------+--------+-----+-------------+ 1740 Table 2: Recallable Object Type Assignments 1742 Note, [RFC5661] should have also defined (see Table 3): 1744 +-------------------------------+------+-----------+-----+----------+ 1745 | Recallable Object Type Name | Valu | RFC | How | Minor | 1746 | | e | | | Versions | 1747 +-------------------------------+------+-----------+-----+----------+ 1748 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 12 | [RFC5661] | L | 1 | 1749 | IN | | | | | 1750 | RCA4_TYPE_MASK_OTHER_LAYOUT_M | 15 | [RFC5661] | L | 1 | 1751 | AX | | | | | 1752 +-------------------------------+------+-----------+-----+----------+ 1754 Table 3: Recallable Object Type Assignments 1756 17. References 1758 17.1. Normative References 1760 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 1761 November 2008, . 1764 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1765 June 1995. 1767 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1768 Requirement Levels", BCP 14, RFC 2119, March 1997. 1770 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 1771 STD 67, RFC 4506, May 2006. 1773 [RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol 1774 Specification Version 2", RFC 5531, May 2009. 1776 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1777 "Network File System (NFS) Version 4 Minor Version 1 1778 Protocol", RFC 5661, January 2010. 1780 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1781 "Network File System (NFS) Version 4 Minor Version 1 1782 External Data Representation Standard (XDR) Description", 1783 RFC 5662, January 2010. 1785 [RFC7530] Haynes, T. and D. Noveck, "Network File System (NFS) 1786 version 4 Protocol", RFC 7530, March 2015. 1788 [RFC7862] Haynes, T., "NFS Version 4 Minor Version 2", RFC 7862, 1789 November 2016. 1791 [pNFSLayouts] 1792 Haynes, T., "Requirements for pNFS Layout Types", draft- 1793 ietf-nfsv4-layout-types-04 (Work In Progress), January 1794 2016. 1796 17.2. Informative References 1798 [RFC4519] Sciberras, A., Ed., "Lightweight Directory Access Protocol 1799 (LDAP): Schema for User Applications", RFC 4519, DOI 1800 10.17487/RFC4519, June 2006, 1801 . 1803 [rpcsec_gssv3] 1804 Adamson, W. and N. Williams, "Remote Procedure Call (RPC) 1805 Security Version 3", November 2014. 1807 Appendix A. Acknowledgments 1809 Those who provided miscellaneous comments to early drafts of this 1810 document include: Matt W. Benjamin, Adam Emerson, J. Bruce Fields, 1811 and Lev Solomonov. 1813 Those who provided miscellaneous comments to the final drafts of this 1814 document include: Anand Ganesh, Robert Wipfel, Gobikrishnan 1815 Sundharraj, Trond Myklebust, and Rick Macklem. 1817 Idan Kedar caught a nasty bug in the interaction of client side 1818 mirroring and the minor versioning of devices. 1820 Dave Noveck provided comprehensive reviews of the document during the 1821 working group last calls. He also rewrote Section 2.3. 1823 Olga Kornievskaiaa made a convincing case against the use of a 1824 credential versus a principal in the fencing approach. Andy Adamson 1825 and Benjamin Kaduk helped to sharpen the focus. 1827 Tigran Mkrtchyan provided the use case for not allowing the client to 1828 proxy the I/O through the data server. 1830 Rick Macklem provided the use case for only writing to a single 1831 mirror. 1833 Appendix B. RFC Editor Notes 1835 [RFC Editor: please remove this section prior to publishing this 1836 document as an RFC] 1838 [RFC Editor: prior to publishing this document as an RFC, please 1839 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 1840 RFC number of this document] 1842 Authors' Addresses 1844 Benny Halevy 1846 Email: bhalevy@gmail.com 1847 Thomas Haynes 1848 Primary Data, Inc. 1849 4300 El Camino Real Ste 100 1850 Los Altos, CA 94022 1851 USA 1853 Phone: +1 408 215 1519 1854 Email: thomas.haynes@primarydata.com