idnits 2.17.1 draft-bhalevy-nfsv4-flex-files-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 606 has weird spacing: '...stateid lor...' -- The document date (June 10, 2014) is 3601 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-41) exists of draft-ietf-nfsv4-minorversion2-22 ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft T. Haynes 4 Intended status: Informational Primary Data 5 Expires: December 12, 2014 June 10, 2014 7 Parallel NFS (pNFS) Flexible File Layout 8 draft-bhalevy-nfsv4-flex-files-03.txt 10 Abstract 12 The Parallel Network File System (pNFS) allows a separation between 13 the metadata and data for a file. The metadata file access is 14 handled via Network File System version 4 (NFSv4) minor version 1 15 (NFSv4.1) and the data file access is specific to the protocol being 16 used between the client and storage device. The client is informed 17 by the metadata server as to which protocol to use via a Layout Type. 18 The Flexible File Layout Type is defined in this document as an 19 extension to NFSv4.1 to allow the use of storage devices which need 20 not be tightly coupled to the metadata server. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on December 12, 2014. 39 Copyright Notice 41 Copyright (c) 2014 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.2. Difference Between a Data Server and a Storage Device . . 5 59 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 60 2. Coupling of Storage Devices . . . . . . . . . . . . . . . . . 5 61 2.1. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 5 62 2.2. Security models . . . . . . . . . . . . . . . . . . . . . 6 63 2.3. State and Locking Models . . . . . . . . . . . . . . . . 6 64 3. XDR Description of the Flexible File Layout Type . . . . . . 7 65 3.1. Code Components Licensing Notice . . . . . . . . . . . . 7 66 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 9 67 4.1. ff_device_addr . . . . . . . . . . . . . . . . . . . . . 9 68 4.2. Storage Device Multipathing . . . . . . . . . . . . . . . 10 69 5. Flexible File Layout Type . . . . . . . . . . . . . . . . . . 11 70 5.1. ff_layout4 . . . . . . . . . . . . . . . . . . . . . . . 11 71 6. Recovering from Client I/O Errors . . . . . . . . . . . . . . 12 72 7. Flexible Files Layout Type Return . . . . . . . . . . . . . . 13 73 7.1. ff_ioerr . . . . . . . . . . . . . . . . . . . . . . . . 14 74 7.2. ff_iostats . . . . . . . . . . . . . . . . . . . . . . . 14 75 7.3. ff_layoutreturn . . . . . . . . . . . . . . . . . . . . . 15 76 8. Flexible Files Layout Type LAYOUTERROR . . . . . . . . . . . 16 77 9. Flexible Files Layout Type LAYOUTSTATS . . . . . . . . . . . 16 78 10. Flexible File Layout Type Creation Hint . . . . . . . . . . . 16 79 10.1. ff_layouthint4 . . . . . . . . . . . . . . . . . . . . . 16 80 11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 17 81 11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 17 82 12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 18 83 13. Security Considerations . . . . . . . . . . . . . . . . . . . 18 84 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 85 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 86 15.1. Normative References . . . . . . . . . . . . . . . . . . 19 87 15.2. Informative References . . . . . . . . . . . . . . . . . 20 88 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 20 89 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 20 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 92 1. Introduction 94 In the parallel Network File System (pNFS), the metadata server 95 returns Layout Type structures that describe where file data is 96 located. There are different Layout Types for different storage 97 systems and methods of arranging data on storage devices. This 98 document defines the Flexible File Layout Type used with file-based 99 data servers that are accessed using the Network File System (NFS) 100 protocols: NFSv3 [RFC1813], NFSv4 [RFC3530], NFSv4.1 [RFC5661], and 101 NFSv4.2 [NFSv42]. 103 In contrast to the File Layout Type [RFC5661] that also uses NFSv4.1 104 to access the data server, the Flexible File Layout Type defines a 105 simple device information model suitable for aggregating standalone 106 NFS servers into a centrally managed pNFS cluster. In particular, 107 unlike the File Layout Type, the Flexible File Layout Type does not 108 provide striping of the data file across multiple storage devices. 110 To provide a global state model equivalent to that of the Files 111 Layout Type, a back-end control protocol MAY be implemented between 112 the metadata server and NFSv4.1 storage devices. It is out of scope 113 for this document to specify the wire protocol of such a protocol, 114 yet the requirements for the protocol are specified in [RFC5661] and 115 clarified in [pNFSLayouts]. 117 1.1. Definitions 119 control protocol: is a set of requirements for the communication of 120 information on layouts, stateids, file metadata, and file data 121 between the metadata server and the storage devices (see 122 [pNFSLayouts]). 124 data file: is that part of the file system object which describes 125 the payload and not the object. E.g., it is the file contents. 127 Data Server (DS): is one of the pNFS servers which provide the 128 contents of a file system object which is a regular file. 129 Depending on the layout, there might be one or more data servers 130 over which the data is striped. Note that while the metadata 131 server is strictly accessed over the NFSv4.1 protocol, depending 132 on the Layout Type, the data server could be accessed via any 133 protocol that meets the pNFS requirements. 135 fencing: is when the metadata server prevents the storage devices 136 from processing I/O from a specific client to a specific file. 138 File Layout Type: is a Layout Type in which the storage devices are 139 accessed via the NFSv4.1 protocol. It is defined in Section 13 of 140 [RFC5661]. 142 layout: informs a client of which storage devices it needs to 143 communicate with (and over which protocol) to perform I/O on a 144 file. The layout might also provide some hints about how the 145 storage is physically organized. 147 layout iomode: describes whether the layout granted to the client is 148 for read or read/write I/O. 150 layout stateid: is a 128-bit quantity returned by a server that 151 uniquely defines the layout state provided by the server for a 152 specific layout that describes a Layout Type and file (see 153 Section 12.5.2 of [RFC5661]). Further, Section 12.5.3 describes 154 the difference between a layout stateid and a normal stateid. 156 Layout Type: describes both the storage protocol used to access the 157 data and the aggregation scheme used to lays out the file data on 158 the underlying storage devices. 160 loose coupling: is when the metadata server and the storage devices 161 do not have a control protocol present. 163 metadata file: is that part of the file system object which 164 describes the object and not the payload. E.g., it could be the 165 time since last modification, access, etc. 167 Metadata Server (MDS): is the pNFS server which provides metadata 168 information for a file system object. It also is responsible for 169 generating layouts for file system objects. Note that the MDS is 170 responsible for directory-based operations. 172 Object Layout Type: is a Layout Type in which the storage devices 173 are accessed via the OSD protocol [ANSI400-2004]. It is defined 174 in [RFC5664]. 176 recalling a layout: is when the metadata server uses a back channel 177 to inform the client that the layout is to be returned in a 178 graceful manner. Note that the client could be able to flush any 179 writes, etc., before replying to the metadata server. 181 revoking a layout: is when the metadata server invalidates the 182 layout such that neither the metadata server nor any storage 183 device will accept any access from the client with that layout. 185 stateid: is a 128-bit quantity returned by a server that uniquely 186 defines the open and locking states provided by the server for a 187 specific open-owner or lock-owner/open-owner pair for a specific 188 file and type of lock. 190 storage device: is another term used almost interchangeably with 191 data server. See Section 1.2 for the nuances between the two. 193 tight coupling: is when the metadata server and the storage devices 194 do have a control protocol present. 196 1.2. Difference Between a Data Server and a Storage Device 198 We defined a data server as a pNFS server, which implies that it can 199 utilize the NFSv4.1 protocol to communicate with the client. As 200 such, only the File Layout Type would currently meet this 201 requirement. The more generic concept is a storage device, which can 202 use any protocol to communicate with the client. The requirements 203 for a storage device to act together with the metadata server to 204 provide data to a client are that there is a Layout Type 205 specification for the given protocol and that the metadata server has 206 granted a layout to the client. Note that nothing precludes there 207 being multiple supported Layout Types (i.e., protocols) between a 208 metadata server, storage devices, and client. 210 As storage device is the more encompassing terminology, this document 211 utilizes it over data server. 213 1.3. Requirements Language 215 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 216 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 217 document are to be interpreted as described in [RFC2119]. 219 2. Coupling of Storage Devices 221 The coupling of the metadata server with the storage devices can be 222 either tight or loose. In a tight coupling, there is a control 223 protocol present to manage security, LAYOUTCOMMITs, etc. With a 224 loose coupling, the only control protocol might be a version of NFS. 225 As such, semantics for managing security, state, and locking models 226 MUST be defined. 228 A file is split into metadata and data. The "metadata file" is that 229 part of the file stored on the metadata server. The "data file" is 230 that part of the file stored on the storage device. And the "file" 231 is the combination of the two. 233 2.1. LAYOUTCOMMIT 235 With a tightly coupled system, when the metadata server receives a 236 LAYOUTCOMMIT (see Section 18.42 of [RFC5661]), the semantics of the 237 File Layout Type MUST be met (see Section 12.5.4 of [RFC5661]). With 238 a loosely coupled system, a LAYOUTCOMMIT to the metadata server MUST 239 be proceeded with a COMMIT to the storage device. I.e., it is the 240 responsibility of the client to make sure the data file is stable 241 before the metadata server begins to query the storage devices about 242 the changes to the file. Note that if the client has not done a 243 COMMIT to the storage device, then the LAYOUTCOMMIT might not be 244 synchronized to the last WRITE operation to the storage device. 246 2.2. Security models 248 With NFSv3 storage devices, the metadata server uses synthetic uids 249 and gids for the data file, where the uid owner of the data file is 250 allowed read/write access and the gid owner is allowed read only 251 access. As part of the layout, the client is provided with the rpc 252 credentials to be used (see ffm_auth in Section 5.1) to access the 253 data file. Fencing off clients is achieved by using SETATTR by the 254 server to change the uid and/or gid owners of the data file to 255 implicitly revoke the outstanding rpc credentials. Note: it is 256 recommended to implement common access control methods at the storage 257 device filesystem exports level to allow only the metadata server 258 root (super user) access to the storage device, and to set the owner 259 of all directories holding data files to the root user. This 260 security method, when using weak auth flavors such as AUTH_SYS, 261 provides a practical model to enforce access control and fence off 262 cooperative clients, but it can not protect against malicious 263 clients; hence it provides a level of security equivalent to NFSv3. 265 With NFSv4.x storage devices, the metadata server sets the user and 266 group owners, mode bits, and ACL of the data file to be the same as 267 the User File. And the client must authenticate with the storage 268 device and go through the same authorization process it would go 269 through via the metadata server. 271 2.3. State and Locking Models 273 Metadata file OPEN, LOCK, and DELEGATION operations are always 274 executed only against the metadata server. 276 With NFSv4 storage devices, the metadata server, in response to the 277 state changing operation, executes them against the respective data 278 files on the storage devices. It then sends the storage device open 279 stateid as part of the layout (see the ffm_stateid in Section 5.1) 280 and it is then used by the client for executing READ/WRITE operations 281 against the storage device. 283 Standalone NFSv4.1 storage devices that do not return the 284 EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way 285 as NFSv4 storage devices. 287 NFSv4.1 clustered storage devices that do identify themselves with 288 the EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end 289 control protocol as described in [RFC5661] to implement a global 290 stateid model as defined there. 292 3. XDR Description of the Flexible File Layout Type 294 This document contains the external data representation (XDR) 295 [RFC4506] description of the Flexible File Layout Type. The XDR 296 description is embedded in this document in a way that makes it 297 simple for the reader to extract into a ready-to-compile form. The 298 reader can feed this document into the following shell script to 299 produce the machine readable XDR description of the Flexible File 300 Layout Type: 302 #!/bin/sh 303 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 305 That is, if the above script is stored in a file called "extract.sh", 306 and this document is in a file called "spec.txt", then the reader can 307 do: 309 sh extract.sh < spec.txt > flex_files_prot.x 311 The effect of the script is to remove leading white space from each 312 line, plus a sentinel sequence of "///". 314 The embedded XDR file header follows. Subsequent XDR descriptions, 315 with the sentinel sequence are embedded throughout the document. 317 Note that the XDR code contained in this document depends on types 318 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 319 types that end with a 4, such as offset4, length4, etc., as well as 320 more generic types such as uint32_t and uint64_t. 322 3.1. Code Components Licensing Notice 324 Both the XDR description and the scripts used for extracting the XDR 325 description are Code Components as described in Section 4 of "Legal 326 Provisions Relating to IETF Documents" [LEGAL]. These Code 327 Components are licensed according to the terms of that document. 329 /// /* 330 /// * Copyright (c) 2012 IETF Trust and the persons identified 331 /// * as authors of the code. All rights reserved. 332 /// * 333 /// * Redistribution and use in source and binary forms, with 334 /// * or without modification, are permitted provided that the 335 /// * following conditions are met: 336 /// * 337 /// * o Redistributions of source code must retain the above 338 /// * copyright notice, this list of conditions and the 339 /// * following disclaimer. 340 /// * 341 /// * o Redistributions in binary form must reproduce the above 342 /// * copyright notice, this list of conditions and the 343 /// * following disclaimer in the documentation and/or other 344 /// * materials provided with the distribution. 345 /// * 346 /// * o Neither the name of Internet Society, IETF or IETF 347 /// * Trust, nor the names of specific contributors, may be 348 /// * used to endorse or promote products derived from this 349 /// * software without specific prior written permission. 350 /// * 351 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 352 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 353 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 354 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 355 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 356 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 357 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 358 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 359 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 360 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 361 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 362 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 363 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 364 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 365 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 366 /// * 367 /// * This code was derived from RFCTBD10. 368 /// * Please reproduce this note if possible. 369 /// */ 370 /// 371 /// /* 372 /// * flex_files_prot.x 373 /// */ 374 /// 375 /// /* 376 /// * The following include statements are for example only. 377 /// * The actual XDR definition files are generated separately 378 /// * and independently and are likely to have a different name. 379 /// * %#include 380 /// * %#include 381 /// */ 382 /// 384 4. Device Addressing and Discovery 386 Data operations to a storage device require the client to know the 387 network address of the storage device. The NFSv4.1 GETDEVICEINFO 388 operation (Section 18.40 of [RFC5661]) is used by the client to 389 retrieve that information. 391 4.1. ff_device_addr 393 The ff_device_addr data structure is returned by the server as the 394 storage protocol specific opaque field da_addr_body in the 395 device_addr4 structure by a successful GETDEVICEINFO operation. 397 /// struct ff_device_addr { 398 /// multipath_list4 ffda_netaddrs; 399 /// uint32_t ffda_version; 400 /// uint32_t ffda_minorversion; 401 /// bool ffda_tightly_coupled; 402 /// }; 403 /// 405 The ffda_netaddrs field is used to locate the storage device. It 406 MUST be set by the server to a list holding one or more of the device 407 network addresses. 409 The ffda_version and ffda_minorversion represent the NFS protocol to 410 be used to access the storage device. This layout specification 411 defines the semantics for ffda_versions 3 and 4. If ffda_version 412 equals 3 then server MUST set ffda_minorversion to 0 and the client 413 MUST access the storage device using the NFSv3 protocol [RFC1813]. 414 If ffda_version equals 4 then the server MUST set ffda_minorversion 415 to one of the NFSv4 minor version numbers and the client MUST access 416 the storage device using NFSv4. 418 ffda_tightly_coupled informs the client as to whether the metadata 419 server is tightly coupled with the storage devices or not. Note that 420 even if the data protocol is at least NFSv4.1, it may still be the 421 case that there is no control protocol present. If 422 ffda_tightly_coupled is not set, then the client MUST commit writes 423 to the storage devices for the file before sending a LAYOUTCOMMIT to 424 the metadata server. I.e., the writes MUST be committed by the 425 client to stable storage via issuing WRITEs with stable_how == 426 FILE_SYNC or by issuing a COMMIT after WRITEs with stable_how != 427 FILE_SYNC (see Section 3.3.7 of [RFC1813]). 429 4.2. Storage Device Multipathing 431 The Flexible File Layout Type supports multipathing to multiple 432 storage device addresses. Storage device level multipathing is used 433 for bandwidth scaling via trunking and for higher availability of use 434 in the case of a storage device failure. Multipathing allows the 435 client to switch to another storage device address which may be that 436 of another storage device that is exporting the same data stripe 437 unit, without having to contact the metadata server for a new layout. 439 To support storage device multipathing, ffda_netaddrs contains an 440 array of one more storage device network addresses. This array (data 441 type multipath_list4) represents a list of storage device (each 442 identified by a network address), with the possibility that some 443 storage device will appear in the list multiple times. 445 The client is free to use any of the network addresses as a 446 destination to send storage device requests. If some network 447 addresses are less optimal paths to the data than others, then the 448 MDS SHOULD NOT include those network addresses in ffda_netaddrs. If 449 less optimal network addresses exist to provide failover, the 450 RECOMMENDED method to offer the addresses is to provide them in a 451 replacement device-ID-to-device-address mapping, or a replacement 452 device ID. When a client finds no response from the storage device 453 using all addresses available in ffda_netaddrs, it SHOULD send a 454 GETDEVICEINFO to attempt to replace the existing device-ID-to-device- 455 address mappings. If the MDS detects that all network paths 456 represented by ffda_netaddrs are unavailable, the MDS SHOULD send a 457 CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID 458 notifications for changed device IDs) to change the device-ID-to- 459 device-address mappings to the available addresses. If the device ID 460 itself will be replaced, the MDS SHOULD recall all layouts with the 461 device ID, and thus force the client to get new layouts and device ID 462 mappings via LAYOUTGET and GETDEVICEINFO. 464 Generally, if two network addresses appear in ffda_netaddrs, they 465 will designate the same storage device. When the storage device is 466 accessed over NFSv4.1 or higher minor version the two storage device 467 addresses will support the implementation of client ID or session 468 trunking (the latter is RECOMMENDED) as defined in [RFC5661]. The 469 two storage device addresses will share the same server owner or 470 major ID of the server owner. It is not always necessary for the two 471 storage device addresses to designate the same storage device with 472 trunking being used. For example, the data could be read-only, and 473 the data consist of exact replicas. 475 5. Flexible File Layout Type 477 The layout4 type is defined in [RFC5662] as follows: 479 enum layouttype4 { 480 LAYOUT4_NFSV4_1_FILES = 1, 481 LAYOUT4_OSD2_OBJECTS = 2, 482 LAYOUT4_BLOCK_VOLUME = 3, 483 LAYOUT4_FLEX_FILES = 4 484 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 485 to be the layouttype assigned by IANA]] 486 }; 488 struct layout_content4 { 489 layouttype4 loc_type; 490 opaque loc_body<>; 491 }; 493 struct layout4 { 494 offset4 lo_offset; 495 length4 lo_length; 496 layoutiomode4 lo_iomode; 497 layout_content4 lo_content; 498 }; 500 This document defines structure associated with the layouttype4 value 501 LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an 502 XDR type "opaque". The opaque layout is uninterpreted by the generic 503 pNFS client layers, but obviously must be interpreted by the Flexible 504 File Layout Type implementation. This section defines the structure 505 of this opaque value, ff_layout4. 507 5.1. ff_layout4 509 /// struct ff_mirror4 { 510 /// deviceid4 ffm_deviceid; 511 /// nfs_fh4 ffm_fhandle; 512 /// stateid4 ffm_stateid; 513 /// opaque_auth ffm_auth; 514 /// }; 515 /// 517 /// struct ff_layout4 { 518 /// ff_mirror4 ffl_mirrors<>; 519 /// }; 520 /// 521 The ff_layout4 structure specifies a layout over a set of mirrored 522 copies of the data file. This mirroring protects against loss of 523 data files. 525 It is possible that the file is concatenated from more than one 526 layout segment. Each layout segment MAY represent different striping 527 parameters, applying respectively only to the layout segment byte 528 range. 530 The ffl_mirrors field represents an array of state information for 531 each mirrored copy of the file. Each element is described by a 532 ff_mirror type. 534 ffm_deviceid provides the deviceid of the storage device holding the 535 data file. 537 ffm_fhandle provides the filehandle of the data file on the given 538 storage device. For tight coupling, ffm_stateid provides the stateid 539 to be used by the client to access the file. For loose coupling and 540 a NFSv4 storage device, the client may use an anonymous stateid to 541 perform I/O on the storage device as there is no use for the metadata 542 server stateid (no control protocol). In such a scenario, the server 543 MUST set the ffm_stateid to be zero. 545 For NFSv3 storage devices, ffm_auth provides the RPC credentials to 546 be used by the client to access the data files. For NFSv4.x storage 547 devices, the server SHOULD use the AUTH_NONE flavor and a zero length 548 opaque body to minimize the returned structure length. The client 549 MUST ignore ffm_auth in this case. [[AI6: Even for tightly coupled 550 systems, that cannot be correct! --TH]] 552 6. Recovering from Client I/O Errors 554 The pNFS client may encounter errors when directly accessing the 555 storage devices. However, it is the responsibility of the metadata 556 server to recover from the I/O errors. When the LAYOUT4_FLEX_FILES 557 layout type is used, the client MUST report the I/O errors to the 558 server at LAYOUTRETURN time using the ff_ioerr structure (see 559 Section 7.1). 561 The metadata server analyzes the error and determines the required 562 recovery operations such as recovering media failures or 563 reconstructing missing data files. 565 The metadata server SHOULD recall any outstanding layouts to allow it 566 exclusive write access to the stripes being recovered and to prevent 567 other clients from hitting the same error condition. In these cases, 568 the server MUST complete recovery before handing out any new layouts 569 to the affected byte ranges. 571 Although it MAY be acceptable for the client to propagate a 572 corresponding error to the application that initiated the I/O 573 operation and drop any unwritten data, the client SHOULD attempt to 574 retry the original I/O operation by requesting a new layout using 575 LAYOUTGET and retry the I/O operation(s) using the new layout, or the 576 client MAY just retry the I/O operation(s) using regular NFS READ or 577 WRITE operations via the metadata server. The client SHOULD attempt 578 to retrieve a new layout and retry the I/O operation using the 579 storage device first and only if the error persists, retry the I/O 580 operation via the metadata server. 582 7. Flexible Files Layout Type Return 584 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 585 layout-type specific information to the server. It is defined in 586 [RFC5661] as follows: 588 struct layoutreturn_file4 { 589 offset4 lrf_offset; 590 length4 lrf_length; 591 stateid4 lrf_stateid; 592 /* layouttype4 specific data */ 593 opaque lrf_body<>; 594 }; 596 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 597 case LAYOUTRETURN4_FILE: 598 layoutreturn_file4 lr_layout; 599 default: 600 void; 601 }; 603 struct LAYOUTRETURN4args { 604 /* CURRENT_FH: file */ 605 bool lora_reclaim; 606 layoutreturn_stateid lora_recallstateid; 607 layouttype4 lora_layout_type; 608 layoutiomode4 lora_iomode; 609 layoutreturn4 lora_layoutreturn; 610 }; 611 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the 612 lrf_body opaque value is defined by the ff_layoutreturn4 type. The 613 new type allows the client to report I/O error information or layout 614 usage statistics back to the metadata server as defined below. 616 7.1. ff_ioerr 618 /// struct ff_ioerr4 { 619 /// offset4 ffie_offset; 620 /// length4 ffie_length; 621 /// stateid4 ffie_stateid; 622 /// device_error4 ffie_errors; 623 /// }; 624 /// 626 Recall that [NFSv42] defines device_error4 as: 628 struct device_error4 { 629 deviceid4 de_deviceid; 630 nfsstat4 de_status; 631 nfs_opnum4 de_opnum; 632 }; 634 The ff_ioerr4 structure is used to return error indications for data 635 files that generated errors during data transfers. These are hints 636 to the metadata server that there are problems with that file. For 637 each error, ffie_errors.de_deviceid, ffie_offset, and ffie_length 638 represent the storage device and byte range within the file in which 639 the error occurred; ffie_errors represents the operation and type of 640 error. The use of device_error4 is described in Section 16.6 of 641 [NFSv42]. 643 7.2. ff_iostats 645 /// struct ff_iostats4 { 646 /// offset4 ffis_offset; 647 /// length4 ffis_length; 648 /// stateid4 ffis_stateid; 649 /// uint32_t ffis_duration; 650 /// io_info4 ffis_read; 651 /// io_info4 ffis_write; 652 /// layoutupdate4 ffis_layoutupdate; 653 /// }; 654 /// 656 Recall that [NFSv42] defines io_info4 as: 658 struct io_info4 { 659 uint32_t ii_count; 660 uint64_t ii_bytes; 661 }; 663 With pNFS, the data transfers are performed directly between the pNFS 664 client and the storage devices. Therefore, the metadata server has 665 no visibility to the I/O stream and cannot use any statistical 666 information about client I/O to optimize data storage location. 667 ff_iostats4 MAY be used by the client to report I/O statistics back 668 to the metadata server upon returning the layout. Since it is 669 infeasible for the client to report every I/O that used the layout, 670 the client MAY identify "hot" byte ranges for which to report I/O 671 statistics. The definition and/or configuration mechanism of what is 672 considered "hot" and the size of the reported byte range is out of 673 the scope of this document. It is suggested for client 674 implementation to provide reasonable default values and an optional 675 run-time management interface to control these parameters. For 676 example, a client can define the default byte range resolution to be 677 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 678 I/O operations per second. For each byte range, ffis_offset and 679 ffis_length represent the starting offset of the range and the range 680 length in bytes. ffis_duration represents the number of seconds the 681 reported burst of I/O lasted. ffis_read.ii_count, 682 ffis_read.ii_bytes, ffis_write.ii_count, and ffis_write.ii_bytes 683 represent, respectively, the number of contiguous read and write I/Os 684 and the respective aggregate number of bytes transferred within the 685 reported byte range. [[AI7: Need to define whether we are using 686 ffis_layoutupdate or not. --TH]] [[AI8: Actually, ffis_duration 687 might be what we plop down in there. In any event, ffis_duration 688 needs some work. --TH]] 690 7.3. ff_layoutreturn 692 /// struct ff_layoutreturn { 693 /// ff_ioerr4 fflr_ioerr_report<>; 694 /// ff_iostats4 fflr_iostats_report<>; 695 /// }; 696 /// 698 When data file I/O operations fail, fflr_ioerr_report<> is used to 699 report these errors to the metadata server as an array of elements of 700 type ff_ioerr4. Each element in the array represents an error that 701 occurred on the data file identified by ffie_errors.de_deviceid. If 702 no errors are to be reported, the size of the fflr_ioerr_report<> 703 array is set to zero. The client MAY also use fflr_iostats_report<> 704 to report a list of I/O statistics as an array of elements of type 705 ff_iostats4. Each element in the array represents statistics for a 706 particular byte range. Byte ranges are not guaranteed to be disjoint 707 and MAY repeat or intersect. 709 8. Flexible Files Layout Type LAYOUTERROR 711 If the client is using NFSv4.2 to communicate with the metadata 712 server, then instead of waiting for a LAYOUTRETURN to send error 713 information to the metadata server (see Section 7.1), it can use 714 LAYOUTERROR (see Section 16.6 of [NFSv42]) to communicate that 715 information. 717 9. Flexible Files Layout Type LAYOUTSTATS 719 If the client is using NFSv4.2 to communicate with the metadata 720 server, then instead of waiting for a LAYOUTRETURN to send I/O 721 statistics to the metadata server (see Section 7.2), it can use 722 LAYOUTSTATS (see Section 16.7 of [NFSv42]) to communicate that 723 information. 725 10. Flexible File Layout Type Creation Hint 727 The layouthint4 type is defined in the [RFC5661] as follows: 729 struct layouthint4 { 730 layouttype4 loh_type; 731 opaque loh_body<>; 732 }; 734 The layouthint4 structure is used by the client to pass a hint about 735 the type of layout it would like created for a particular file. If 736 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 737 opaque value is defined by the ff_layouthint4 type. 739 10.1. ff_layouthint4 741 /// union ff_mirrors_hint switch (bool ffmc_valid) { 742 /// case TRUE: 743 /// uint32_t ffmc_mirrors; 744 /// case FALSE: 745 /// void; 746 /// }; 747 /// 749 /// struct ff_layouthint4 { 750 /// ff_mirrors_hint fflh_mirrors_hint; 751 /// }; 752 /// 753 This type conveys hints for the desired data map. All parameters are 754 optional so the client can give values for only the parameter it 755 cares about. 757 11. Recalling Layouts 759 The Flexible File Layout Type metadata server should recall 760 outstanding layouts in the following cases: 762 o When the file's security policy changes, i.e., Access Control 763 Lists (ACLs) or permission mode bits are set. 765 o When the file's layout changes, rendering outstanding layouts 766 invalid. 768 o When there are sharing conflicts. 770 11.1. CB_RECALL_ANY 772 The metadata server can use the CB_RECALL_ANY callback operation to 773 notify the client to return some or all of its layouts. The 774 [RFC5661] defines the following types: 776 const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; 777 const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; 778 [[RFC Editor: please insert assigned constants]] 780 struct CB_RECALL_ANY4args { 781 uint32_t craa_layouts_to_keep; 782 bitmap4 craa_type_mask; 783 }; 785 Typically, CB_RECALL_ANY will be used to recall client state when the 786 server needs to reclaim resources. The craa_type_mask bitmap 787 specifies the type of resources that are recalled and the 788 craa_layouts_to_keep value specifies how many of the recalled 789 Flexible File Layouts the client is allowed to keep. The Flexible 790 File Layout Type mask flags are defined as follows: 792 /// enum ff_cb_recall_any_mask { 793 /// FF_RCA4_TYPE_MASK_READ = -2, 794 /// FF_RCA4_TYPE_MASK_RW = -1 795 [[RFC Editor: please insert assigned constants]] 796 /// }; 797 /// 798 They represent the iomode of the recalled layouts. In response, the 799 client SHOULD return layouts of the recalled iomode that it needs the 800 least, keeping at most craa_layouts_to_keep Flexible File Layouts. 802 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 803 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 804 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 805 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 806 is notified to return layouts of either iomode. 808 12. Client Fencing 810 In cases where clients are uncommunicative and their lease has 811 expired or when clients fail to return recalled layouts within a 812 lease period, at the least the server MAY revoke client layouts and/ 813 or device address mappings and reassign these resources to other 814 clients (see "Recalling a Layout" in [RFC5661]). To avoid data 815 corruption, the metadata server MUST fence off the revoked clients 816 from the respective data files as described in Section 2.2. 818 13. Security Considerations 820 The pNFS extension partitions the NFSv4 file system protocol into two 821 parts, the control path and the data path (storage protocol). The 822 control path contains all the new operations described by this 823 extension; all existing NFSv4 security mechanisms and features apply 824 to the control path. The combination of components in a pNFS system 825 is required to preserve the security properties of NFSv4 with respect 826 to an entity accessing data via a client, including security 827 countermeasures to defend against threats that NFSv4 provides 828 defenses for in environments where these threats are considered 829 significant. 831 The metadata server enforces the file access-control policy at 832 LAYOUTGET time. The client should use suitable authorization 833 credentials for getting the layout for the requested iomode (READ or 834 RW) and the server verifies the permissions and ACL for these 835 credentials, possibly returning NFS4ERR_ACCESS if the client is not 836 allowed the requested iomode. If the LAYOUTGET operation succeeds 837 the client receives, as part of the layout, a set of credentials 838 allowing it I/O access to the specified data files corresponding to 839 the requested iomode. When the client acts on I/O operations on 840 behalf of its local users, it MUST authenticate and authorize the 841 user by issuing respective OPEN and ACCESS calls to the metadata 842 server, similar to having NFSv4 data delegations. If access is 843 allowed, the client uses the corresponding (READ or RW) credentials 844 to perform the I/O operations at the data files storage devices. 845 When the metadata server receives a request to change a file's 846 permissions or ACL, it SHOULD recall all layouts for that file and it 847 MUST fence off the clients holding outstanding layouts for the 848 respective file by implicitly invalidating the outstanding 849 credentials on all data files comprising before committing to the new 850 permissions and ACL. Doing this will ensure that clients re- 851 authorize their layouts according to the modified permissions and ACL 852 by requesting new layouts. Recalling the layouts in this case is 853 courtesy of the server intended to prevent clients from getting an 854 error on I/Os done after the client was fenced off. 856 14. IANA Considerations 858 As described in [RFC5661], new layout type numbers have been assigned 859 by IANA. This document defines the protocol associated with the 860 existing layout type number, LAYOUT4_FLEX_FILES. 862 15. References 864 15.1. Normative References 866 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 867 November 2008, . 870 [NFSv42] Haynes, T., "NFS Version 4 Minor Version 2", draft-ietf- 871 nfsv4-minorversion2-22 (Work In Progress), April 2014. 873 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 874 June 1995. 876 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 877 Requirement Levels", BCP 14, RFC 2119, March 1997. 879 [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 880 Beame, C., Eisler, M., and D. Noveck, "Network File System 881 (NFS) version 4 Protocol", RFC 3530, April 2003. 883 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 884 STD 67, RFC 4506, May 2006. 886 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 887 "Network File System (NFS) Version 4 Minor Version 1 888 Protocol", RFC 5661, January 2010. 890 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 891 "Network File System (NFS) Version 4 Minor Version 1 892 External Data Representation Standard (XDR) Description", 893 RFC 5662, January 2010. 895 [RFC5664] Halevy, B., Ed., Welch, B., Ed., and J. Zelenka, Ed., 896 "Object-Based Parallel NFS (pNFS) Operations", RFC 5664, 897 January 2010. 899 [pNFSLayouts] 900 Haynes, T., "Considerations for a New pNFS Layout Type", 901 draft-haynes-nfsv4-layout-types-02 (Work In Progress), 902 April 2014. 904 15.2. Informative References 906 [ANSI400-2004] 907 Weber, R., Ed., "ANSI INCITS 400-2004, Information 908 Technology - SCSI Object-Based Storage Device Commands 909 (OSD)", December 2004. 911 Appendix A. Acknowledgments 913 Those who provided miscellaneous comments to early drafts of this 914 document include: Matt W. Benjamin, Adam Emerson, Tom Haynes, J. 915 Bruce Fields, and Lev Solomonov. 917 Appendix B. RFC Editor Notes 919 [RFC Editor: please remove this section prior to publishing this 920 document as an RFC] 922 [RFC Editor: prior to publishing this document as an RFC, please 923 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 924 RFC number of this document] 926 Authors' Addresses 928 Benny Halevy 929 Primary Data, Inc. 931 Email: bhalevy@primarydata.com 932 URI: http://www.primarydata.com 934 Thomas Haynes 935 Primary Data, Inc. 936 4300 El Camino Real Ste 100 937 Los Altos, CA 94022 938 USA 940 Phone: +1 408 215 1519 941 Email: thomas.haynes@primarydata.com