idnits 2.17.1 draft-bhalevy-nfsv4-flex-files-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 447 has weird spacing: '...pattern pfl...' == Line 844 has weird spacing: '...stateid lor...' == Line 1023 has weird spacing: '...pattern pfs...' == Line 1032 has weird spacing: '...rn_hint pflh_...' -- The document date (April 17, 2014) is 3655 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 3530 (Obsoleted by RFC 7530) ** Obsolete normative reference: RFC 5661 (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft T. Haynes 4 Intended status: Informational Primary Data 5 Expires: October 19, 2014 April 17, 2014 7 Parallel NFS (pNFS) Flexible Files Layout 8 draft-bhalevy-nfsv4-flex-files-02.txt 10 Abstract 12 Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to 13 allow clients to directly access file data on the storage used by the 14 NFSv4 server. This ability to bypass the server for data access can 15 increase both performance and parallelism, but requires additional 16 client functionality for data access, some of which is dependent on 17 the class of storage used, i.e., the Layout Type. The main pNFS 18 operations and data types in NFSv4 Minor version 1 specify a layout- 19 type-independent layer; layout-type-specific information is conveyed 20 using opaque data structures whose internal structure is further 21 defined by the particular layout type specification. This document 22 specifies the NFSv4.1 Flexible Files pNFS Layout as a companion to 23 the main NFSv4 Minor version 1 specification for use of pNFS with 24 Data Servers over NFSv4 or higher minor versions using flexible, per- 25 file striping topology. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on October 19, 2014. 44 Copyright Notice 46 Copyright (c) 2014 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 63 2. Method of Operation . . . . . . . . . . . . . . . . . . . . . 3 64 2.1. Security models . . . . . . . . . . . . . . . . . . . . . 4 65 2.2. State and Locking Models . . . . . . . . . . . . . . . . 4 66 3. XDR Description of the Flexible Files Layout Protocol . . . . 5 67 3.1. Code Components Licensing Notice . . . . . . . . . . . . 5 68 4. Device Addressing and Discovery . . . . . . . . . . . . . . . 7 69 4.1. pnfs_ff_device_addr . . . . . . . . . . . . . . . . . . . 7 70 4.2. Data Server Multipathing . . . . . . . . . . . . . . . . 8 71 5. Flexible Files Layout . . . . . . . . . . . . . . . . . . . . 9 72 5.1. pnfs_ff_layout . . . . . . . . . . . . . . . . . . . . . 9 73 5.2. Striping Topologies . . . . . . . . . . . . . . . . . . . 13 74 5.2.1. PFSP_SPARSE_STRIPING . . . . . . . . . . . . . . . . 13 75 5.2.2. PFSP_DENSE_STRIPING . . . . . . . . . . . . . . . . . 14 76 5.2.3. PFSP_RAID_4 . . . . . . . . . . . . . . . . . . . . . 15 77 5.2.4. PFSP_RAID_5 . . . . . . . . . . . . . . . . . . . . . 15 78 5.2.5. PFSP_RAID_PQ . . . . . . . . . . . . . . . . . . . . 16 79 5.2.6. RAID Usage and Implementation Notes . . . . . . . . . 17 80 5.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . . . 17 81 6. Recovering from Client I/O Errors . . . . . . . . . . . . . . 17 82 7. Flexible Files Layout Return . . . . . . . . . . . . . . . . 18 83 7.1. pflr_errno . . . . . . . . . . . . . . . . . . . . . . . 19 84 7.2. pnfs_ff_ioerr . . . . . . . . . . . . . . . . . . . . . . 20 85 7.3. pnfs_ff_iostats . . . . . . . . . . . . . . . . . . . . . 21 86 7.4. pnfs_ff_layoutreturn . . . . . . . . . . . . . . . . . . 22 87 8. Flexible Files Creation Layout Hint . . . . . . . . . . . . . 22 88 8.1. pnfs_ff_layouthint . . . . . . . . . . . . . . . . . . . 22 89 9. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 24 90 9.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 24 91 10. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 25 92 11. Security Considerations . . . . . . . . . . . . . . . . . . . 25 93 12. Striping Topologies Extensibility . . . . . . . . . . . . . . 26 94 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 95 14. Normative References . . . . . . . . . . . . . . . . . . . . 26 96 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 27 97 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 28 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 100 1. Introduction 102 In pNFS, the file server returns typed layout structures that 103 describe where file data is located. There are different layouts for 104 different storage systems and methods of arranging data on storage 105 devices. This document defines the layout used with file-based data 106 servers that are accessed using the Network File System (NFS) 107 Protocol: NFSv3 [RFC1813], NFSv4 [RFC3530], and NFSv4.1 [RFC5661]. 109 In contrast to the LAYOUT4_NFSV4_1_FILES layout type [RFC5661] that 110 also uses NFSv4.1 to access the data server, the Flexible Files 111 layout defines a model of device metadata and striping patterns that 112 is inspired by the object layout [RFC5664] that provide flexible, 113 per-file striping patterns and simple device information suitable 114 aggregating standalone NFS servers into a centrally managed pNFS 115 cluster. 117 To provide a global state model equivalent to that of the files 118 layout a back-end control protocol may be implemented between the 119 metadata server (MDS) and NFSv4.1 data servers (DSs). It is out of 120 scope for this document to specify the wire protocol of such a 121 protocol, yet the requirements for the protocol are specified in 122 [RFC5661] and clarified in [pNFSLayouts]. 124 1.1. Requirements Language 126 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 127 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 128 document are to be interpreted as described in [RFC2119]. 130 2. Method of Operation 132 This section describes the semantics and format of flexible file- 133 based layouts for pNFS. Flexible file-based layouts use the 134 LAYOUT4_FLEX_FILES layout type. The LAYOUT4_FLEX_FILES type defines 135 striping data across multiple NFS Data Servers. 137 For the purpose of this discussion, we will distinguish between user 138 files served by the metadata server, to be referred to as User Files; 139 vs. user files served by Data Servers, to be referred to as Component 140 Objects. 142 Component Objects are addressable by their NFS filehandle. Each 143 Component Object may store a whole User File or parts of it, in case 144 the User File is striped across multiple Component Objects. The 145 striping pattern is provided by pfl_striping_pattern as defined 146 below. 148 Data Servers may be accessed using different versions of the NFS 149 protocol. It is required that the server MUST use Data Servers of 150 the same NFS version and minor version for striping data within each 151 layout. The NFS version and minor version define the respective 152 security, state, and locking models to be used, as described below. 154 2.1. Security models 156 With NFSv3 Data Servers, the Metadata Server uses synthetic uids and 157 gids for the Component Objects, where the uid owner of the Component 158 Objects is allowed read/write access and the gid owner is allowed 159 read only access. As part of the layout, the client is provided with 160 the rpc credentials to be used (XREF pfcf_auth) to access the Object. 161 Fencing off clients is achieved by using SETATTR by the server to 162 change the uid and/or gid owners of the Component Objects to 163 implicitly revoke the outstanding rpc credentials. Note: it is 164 recommended to implement common access control methods at the Data 165 Server filesystem exports level to allow only the Metadata Server 166 root (super user) access to the Data Server, and to set the owner of 167 all directories holding Component Objects to the root user. This 168 security method, when using weak auth flavors such as AUTH_SYS, 169 provides a practical model to enforce access control and fence off 170 cooperative clients, but it can not protect against malicious 171 clients; hence it provides a level of security equivalent to NFSv3. 173 With NFSv4.x Data Servers, the Metadata Server sets the user and 174 group owners, mode bits, and ACL of the Component Objects to be the 175 same as the User File. And the client must authenticate with the 176 Data Server and go through the same authorization process it would go 177 through via the Metadata Server. 179 2.2. State and Locking Models 181 User File OPEN, LOCK, and DELEGATION operations are always executed 182 only against the Metadata Server. 184 With NFSv4 Data Servers, the Metadata Server, in response to the 185 state changing operation, executes them against the respective 186 Component Objects on the Data Server(s). It then sends the Data 187 Server open stateid as part of the layout (see the pfcf_stateid in 188 Section 5.1) and it is then used by the client for executing READ/ 189 WRITE operations against the Data Server. 191 Standalone NFSv4.1 Data Servers that do not return the 192 EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID are used the same way 193 as NFSv4 Data Servers. 195 NFSv4.1 Clustered Data Servers that do identify themselves with the 196 EXCHGID4_FLAG_USE_PNFS_DS flag to EXCHANGE_ID use a back-end control 197 protocol as described in [RFC5661] to implement a global stateid 198 model as defined there. 200 3. XDR Description of the Flexible Files Layout Protocol 202 This document contains the external data representation (XDR) 203 [RFC4506] description of the NFSv4.1 flexible files layout protocol. 204 The XDR description is embedded in this document in a way that makes 205 it simple for the reader to extract into a ready-to-compile form. 206 The reader can feed this document into the following shell script to 207 produce the machine readable XDR description of the NFSv4.1 objects 208 layout protocol: 210 #!/bin/sh 211 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 213 That is, if the above script is stored in a file called "extract.sh", 214 and this document is in a file called "spec.txt", then the reader can 215 do: 217 sh extract.sh < spec.txt > pnfs_flex_files_prot.x 219 The effect of the script is to remove leading white space from each 220 line, plus a sentinel sequence of "///". 222 The embedded XDR file header follows. Subsequent XDR descriptions, 223 with the sentinel sequence are embedded throughout the document. 225 Note that the XDR code contained in this document depends on types 226 from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs 227 types that end with a 4, such as offset4, length4, etc., as well as 228 more generic types such as uint32_t and uint64_t. 230 3.1. Code Components Licensing Notice 232 Both the XDR description and the scripts used for extracting the XDR 233 description are Code Components as described in Section 4 of "Legal 234 Provisions Relating to IETF Documents" [LEGAL]. These Code 235 Components are licensed according to the terms of that document. 237 /// /* 238 /// * Copyright (c) 2012 IETF Trust and the persons identified 239 /// * as authors of the code. All rights reserved. 240 /// * 241 /// * Redistribution and use in source and binary forms, with 242 /// * or without modification, are permitted provided that the 243 /// * following conditions are met: 244 /// * 245 /// * o Redistributions of source code must retain the above 246 /// * copyright notice, this list of conditions and the 247 /// * following disclaimer. 248 /// * 249 /// * o Redistributions in binary form must reproduce the above 250 /// * copyright notice, this list of conditions and the 251 /// * following disclaimer in the documentation and/or other 252 /// * materials provided with the distribution. 253 /// * 254 /// * o Neither the name of Internet Society, IETF or IETF 255 /// * Trust, nor the names of specific contributors, may be 256 /// * used to endorse or promote products derived from this 257 /// * software without specific prior written permission. 258 /// * 259 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 260 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 261 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 262 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 263 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 264 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 265 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 266 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 267 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 268 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 269 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 270 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 271 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 272 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 273 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 274 /// * 275 /// * This code was derived from draft-bhalevy-nfsv4-flex-files-01. 276 [[RFC Editor: please insert RFC number if needed]] 277 /// * Please reproduce this note if possible. 278 /// */ 279 /// 280 /// /* 281 /// * pnfs_flex_files_prot.x 282 /// */ 283 /// 284 /// /* 285 /// * The following include statements are for example only. 286 /// * The actual XDR definition files are generated separately 287 /// * and independently and are likely to have a different name. 288 /// */ 289 /// %#include 290 /// %#include 291 /// 293 4. Device Addressing and Discovery 295 Data operations to a data server require the client to know the 296 network address of the data server. The GETDEVICEINFO NFSv4.1 297 operation is used by the client to retrieve that information. 299 4.1. pnfs_ff_device_addr 301 The pnfs_ff_device_addr data structure is returned by the server as 302 the storage-protocol-specific opaque field da_addr_body in the 303 device_addr4 structure by a successful GETDEVICEINFO operation 304 [RFC5661]. 306 /// struct pnfs_ff_device_addr { 307 /// multipath_list4 pfda_netaddrs; 308 /// uint32_t pfda_version; 309 /// uint32_t pfda_minorversion; 310 /// pathname4 pfda_path; 311 /// }; 312 /// 314 The pfda_netaddrs field is used to locate the data server. It MUST 315 be set by the server to a list holding one or more of the device 316 network addresses. 318 The pfda_version and pfda_minorversion represent the NFS protocol to 319 be used to access the data server. This layout specification defines 320 the semantics for pfda_versions 3 and 4. If pfda_version equals 3 321 then server MUST set pfda_minorversion to 0 and the client MUST 322 access the data server using the NFSv3 protocol [RFC1813]. If 323 pfda_version equals 4 then the server MUST set pfda_minorversion to 324 either 0 or 1 and the client MUST access the data server using NFSv4 325 [RFC3530] or NFSv4.1 [RFC5661], respectively. 327 The pfda_path MAY be set by the server to an exported path on the 328 data server for device identification. If provided, the path MUST 329 exist and be accessible to the client. If the path does not exist, 330 the client MUST ignore this device information and any layouts 331 referring to the respective deviceid until valid device information 332 is acquired. 334 4.2. Data Server Multipathing 336 The flexible file layout supports multipathing to multiple data 337 server addresses. Data-server-level multipathing is used for 338 bandwidth scaling via trunking and for higher availability of use in 339 the case of a data-server failure. Multipathing allows the client to 340 switch to another data server address which may be that of another 341 data server that is exporting the same data stripe unit, without 342 having to contact the metadata server for a new layout. 344 To support data server multipathing, pfda_netaddrs contains an array 345 of one more data server network addresses. This array (data type 346 multipath_list4) represents a list of data servers (each identified 347 by a network address), with the possibility that some data servers 348 will appear in the list multiple times. 350 The client is free to use any of the network addresses as a 351 destination to send data server requests. If some network addresses 352 are less optimal paths to the data than others, then the MDS SHOULD 353 NOT include those network addresses in pfda_netaddrs. If less 354 optimal network addresses exist to provide failover, the RECOMMENDED 355 method to offer the addresses is to provide them in a replacement 356 device-ID-to-device-address mapping, or a replacement device ID. 357 When a client finds no response from the data server using all 358 addresses available in pfda_netaddrs, it SHOULD send a GETDEVICEINFO 359 to attempt to replace the existing device-ID-to-device-address 360 mappings. If the MDS detects that all network paths represented by 361 pfda_netaddrs are unavailable, the MDS SHOULD send a 362 CB_NOTIFY_DEVICEID (if the client has indicated it wants device ID 363 notifications for changed device IDs) to change the device-ID-to- 364 device-address mappings to the available addresses. If the device ID 365 itself will be replaced, the MDS SHOULD recall all layouts with the 366 device ID, and thus force the client to get new layouts and device ID 367 mappings via LAYOUTGET and GETDEVICEINFO. 369 Generally, if two network addresses appear in pfda_netaddrs, they 370 will designate the same data server. When the data server is 371 accessed over NFSv4.1 or higher minor version the two data server 372 addresses will support the implementation of client ID or session 373 trunking (the latter is RECOMMENDED) as defined in [RFC5661]. The 374 two data server addresses will share the same server owner or major 375 ID of the server owner. It is not always necessary for the two data 376 server addresses to designate the same server with trunking being 377 used. For example, the data could be read-only, and the data consist 378 of exact replicas. 380 5. Flexible Files Layout 382 The layout4 type is defined in [RFC5662] as follows: 384 /// enum layouttype4 { 385 /// LAYOUT4_NFSV4_1_FILES = 1, 386 /// LAYOUT4_OSD2_OBJECTS = 2, 387 /// LAYOUT4_BLOCK_VOLUME = 3, 388 /// LAYOUT4_FLEX_FILES = 4 389 [[RFC Editor: please modify the LAYOUT4_FLEX_FILES 390 to be the layouttype assigned by IANA]] 391 /// }; 392 /// 393 /// struct layout_content4 { 394 /// layouttype4 loc_type; 395 /// opaque loc_body<>; 396 /// }; 397 /// 398 /// struct layout4 { 399 /// offset4 lo_offset; 400 /// length4 lo_length; 401 /// layoutiomode4 lo_iomode; 402 /// layout_content4 lo_content; 403 /// }; 405 This document defines structure associated with the layouttype4 value 406 LAYOUT4_FLEX_FILES. [RFC5661] specifies the loc_body structure as an 407 XDR type "opaque". The opaque layout is uninterpreted by the generic 408 pNFS client layers, but obviously must be interpreted by the flexible 409 files layout driver. This section defines the structure of this 410 opaque value, pnfs_ff_layout4. 412 5.1. pnfs_ff_layout 413 /// enum pnfs_ff_striping_pattern { 414 /// PFSP_SPARSE_STRIPING = 1, 415 /// PFSP_DENSE_STRIPING = 2, 416 /// PFSP_RAID_4 = 4, 417 /// PFSP_RAID_5 = 5, 418 /// PFSP_RAID_PQ = 6 419 /// }; 420 /// 421 /// enum pnfs_ff_comp_type { 422 /// PNFS_FF_COMP_MISSING = 0, 423 /// PNFS_FF_COMP_PACKED = 1, 424 /// PNFS_FF_COMP_FULL = 2 425 /// }; 426 /// 427 /// struct pnfs_ff_comp_full { 428 /// deviceid4 pfcf_deviceid; 429 /// nfs_fh4 pfcf_fhandle; 430 /// stateid4 pfcf_stateid; 431 /// opaque_auth pfcf_auth; 432 /// uint32_t pfcf_metric; 433 /// }; 434 /// 435 /// union pnfs_ff_comp switch (pnfs_ff_comp_type pfc_type) { 436 /// case PNFS_FF_COMP_MISSING: 437 /// void; 438 /// 439 /// case PNFS_FF_COMP_PACKED: 440 /// deviceid4 pfcp_deviceid; 441 /// 442 /// case PNFS_FF_COMP_FULL: 443 /// pnfs_ff_comp_full pfcp_full; 444 /// }; 445 /// 446 /// struct pnfs_ff_layout { 447 /// pnfs_ff_striping_pattern pfl_striping_pattern; 448 /// uint32_t pfl_num_comps; 449 /// uint32_t pfl_mirror_cnt; 450 /// length4 pfl_stripe_unit; 451 /// nfs_fh4 pfl_global_fh; 452 /// uint32_t pfl_comps_index; 453 /// pnfs_ff_comp pfl_comps<>; 454 /// }; 455 /// 457 The pnfs_ff_layout structure specifies a layout over a set of 458 Component Objects. The layout parameterizes the algorithm that maps 459 the file's contents within the returned byte range, as represented by 460 lo_offset and lo_length, over the Component Objects. 462 It is possible that the file is concatenated from more than one 463 layout segment. Each layout segment MAY represent different striping 464 parameters, applying respectively only to the layout segment byte 465 range. 467 This section provides a brief introduction to the layout parameters. 468 See Section 5.2 for a more detailed description of the different 469 striping schemes and the respective interpretation of the layout 470 parameters for each striping scheme. 472 In addition to mapping data using simple striping schemes where loss 473 of a single component object results in data loss, the layout 474 parameters support mirroring and more advanced redundancy schemes 475 that protect against loss of component objects. pfl_striping_pattern 476 represents the algorithm to be used for mapping byte offsets in the 477 file address space to corresponding component objects in the returned 478 layout and byte offsets in the component's address space. 479 pfl_striping_pattern also represents methods for storing and 480 retrieving redundant data that can be used to recover from failure or 481 loss of component objects. 483 pfl_num_comps is the total number of component objects the file is 484 striped over within the returned byte range, not counting mirrored 485 components (See pfl_mirror_cnt below). Note that the server MAY grow 486 the file by adding more components to the stripe while clients hold 487 valid layouts until the file has reached its final stripe width. 489 pfl_mirror_cnt represents the number of mirrors each component in the 490 stripe has. If there is no mirroring then pfm_mirror_cnt MUST be 0. 491 Otherwise, the number of entries listed in pfl_comps MUST be a 492 multiple of (pfl_mirror_cnt + 1). 494 pfl_stripe_unit is the number of bytes placed on one component before 495 advancing to the next one in the list of components. When the file 496 is striped over a single component object (pfl_num_comps equals to 497 1), the stripe unit has no use and the server SHOULD set it to the 498 server default value or to zero; otherwise, pfl_stripe_unit MUST NOT 499 be set to zero. 501 The pfl_comps field represents an array of component objects. The 502 data placement algorithm that maps file data onto component objects 503 assumes that each component object occurs exactly once in the array 504 of components. Therefore, component objects MUST appear in the 505 pfl_comps array only once. The components array may represent all 506 objects comprising the file, in which case pfl_comps_index is set to 507 zero and the number of entries in the pfl_comps array is equal to 508 pfl_num_comps * (pfl_mirror_cnt + 1). The server MAY return fewer 509 components than pfl_num_comps, provided that the returned byte range 510 represented by lo_offset and lo_count maps in whole into the set of 511 returned component objects. In this case, pfl_comps_index represents 512 the logical position of the returned components array, pfl_comps, 513 within the full array of components that comprise the file. 514 pfl_comps_index MUST be a multiple of (pfl_mirror_cnt + 1). 516 Each component object in the pfl_comps array is described by the 517 pnfs_ff_comp type. 519 When a component object is unavailable pfc_type is set to 520 PNFS_FF_COMP_MISSING and no other information for this component is 521 returned. When a data redundancy scheme is being used, as 522 represented by pfl_striping_pattern, the client MAY use a respective 523 data recovery algorithm to reconstruct data that is logically stored 524 on the missing component using user data and redundant data stored on 525 the available components in the containing stripe. 527 The server MUST set the same pfc_type for all available components to 528 either PNFS_FF_COMP_PACKED or PNFS_FF_COMP_FULL. 530 When NFSv4.1 Clustered Data Servers are used, the metadata server 531 implements the global state model where all data servers share the 532 same stateid and filehandle for the file. In such case, the client 533 MUST use the open, delegation, or lock stateid returned by the 534 metadata server for the file for accessing the Data Servers for READ 535 and WRITE; the global filehandle to be used by the client is provided 536 by pfl_global_fh. If the metadata server filehandle for the file is 537 being used by all data servers then pfl_global_fh MAY be set to an 538 empty filehandle. 540 pfcp_deviceid or pfcf_deviceid provide the deviceid of the data 541 server holding the Component Object. 543 When standalone data servers are used, either over NFSv4 or NFSv4.1, 544 pfl_global_fh SHOULD be set to an empty filehandle and it MUST be 545 ignored by the client and pfcf_fhandle provides the filehandle of the 546 Data Server file holding the Component Object, and pfcf_stateid 547 provides the stateid to be used by the client to access the file. 549 For NFSv3 Data Servers, pfcf_auth provides the RPC credentials to be 550 used by the client to access the Component Objects. For NFSv4.x Data 551 Servers, the server SHOULD use the AUTH_NONE flavor and a zero length 552 opaque body to minimize the returned structure length. The client 553 MUST ignore pfxf_auth in this case. 555 When pfl_mirror_cnt is not zero pfcf_metric indicates the distance to 556 the client the distance of the respective component object, otherwise 557 the server MUST set pfcf_metric to zero. When reading data, the 558 client the client is advised to read from components with the lowest 559 pfcf_metric. When there are several components with the same 560 pfcf_metric client implementations may implement a load distribution 561 algorithm to evenly distribute the read load across several devices 562 and by so provide larger bandwidth. 564 5.2. Striping Topologies 566 This section describes the different data mapping schemes in detail. 568 pnfs_ff_striping_pattern determines the algorithm and placement of 569 redundant data. This section defines the different redundancy 570 algorithms. Note: The term "RAID" (Redundant Array of Independent 571 Disks) is used in this document to represent an array of Component 572 Objects that store data for an individual User File. The objects are 573 stored on independent Data Servers. User File data is encoded and 574 striped across the array of Component Objects using algorithms 575 developed for block-based RAID systems. 577 5.2.1. PFSP_SPARSE_STRIPING 579 The mapping from the logical offset within a file (L) to the 580 Component Object C and object-specific offset O is direct and 581 straight forward as defined by the following equations: 583 L: logical offset into the file 585 W: stripe width 586 W = pfl_num_comps 588 S: number of bytes in a stripe 589 S = W * pfl_stripe_unit 591 N: stripe number 592 N = L / S 594 C: component index corresponding to L 595 C = (L % S) / pfl_stripe_unit 597 O: The component offset corresponding to L 598 O = L 600 Note that this computation does not accommodate the same object 601 appearing in the pfl_comps array multiple times. Therefore the 602 server may not return layouts with the same object appearing multiple 603 times. If needed the server can return multiple layout segments each 604 covering a single instance of the object. 606 PFSP_SPARSE_STRIPING means there is no parity data, so all bytes in 607 the component objects are data bytes located by the above equations 608 for C and O. If a component object is marked as 609 PNFS_FF_COMP_MISSING, the pNFS client MUST either return an I/O error 610 if this component is attempted to be read or, alternatively, it can 611 retry the READ against the pNFS server. 613 5.2.2. PFSP_DENSE_STRIPING 615 The mapping from the logical offset within a file (L) to the 616 component object C and object-specific offset O is defined by the 617 following equations: 619 L: logical offset into the file 621 W: stripe width 622 W = pfl_num_comps 624 S: number of bytes in a stripe 625 S = W * pfl_stripe_unit 627 N: stripe number 628 N = L / S 630 C: component index corresponding to L 631 C = (L % S) / pfl_stripe_unit 633 O: The component offset corresponding to L 634 O = (N * pfl_stripe_unit) + (L % pfl_stripe_unit) 636 Note that this computation does not accommodate the same object 637 appearing in the pfl_comps array multiple times. Therefore the 638 server may not return layouts with the same object appearing multiple 639 times. If needed the server can return multiple layout segments each 640 covering a single instance of the object. 642 PFSP_DENSE_STRIPING means there is no parity data, so all bytes in 643 the component objects are data bytes located by the above equations 644 for C and O. If a component object is marked as 645 PNFS_FF_COMP_MISSING, the pNFS client MUST either return an I/O error 646 if this component is attempted to be read or, alternatively, it can 647 retry the READ against the pNFS server. 649 Note that the layout depends on the file size, which the client 650 learns from the generic return parameters of LAYOUTGET, by doing 651 GETATTR commands to the Metadata Server. The client uses the file 652 size to decide if it should fill holes with zeros or return a short 653 read. Striping patterns can cause cases where Component Objects are 654 shorter than other components because a hole happens to correspond to 655 the last part of the Component Object. 657 5.2.3. PFSP_RAID_4 659 PFSP_RAID_4 means that the last component object in the stripe 660 contains parity information computed over the rest of the stripe with 661 an XOR operation. If a Component Object is unavailable, the client 662 can read the rest of the stripe units in the damaged stripe and 663 recompute the missing stripe unit by XORing the other stripe units in 664 the stripe. Or the client can replay the READ against the pNFS 665 server that will presumably perform the reconstructed read on the 666 client's behalf. 668 When parity is present in the file, then the number of parity devices 669 is taken into account in the above equations when calculating (D), 670 the number of data devices in a stripe, as follows: 672 P: number of parity devices in each stripe 673 P = 1 675 D: number of data devices in a stripe 676 D = W - P 678 I: parity device index 679 I = D 681 5.2.4. PFSP_RAID_5 683 PNFS_OBJ_RAID_5 means that the position of the parity data is rotated 684 on each stripe. In the first stripe, the last component holds the 685 parity. In the second stripe, the next-to-last component holds the 686 parity, and so on. In this scheme, all stripe units are rotated so 687 that I/O is evenly spread across objects as the file is read 688 sequentially. The rotated parity layout is illustrated here, with 689 hexadecimal numbers indicating the stripe unit. 691 0 1 2 P 692 4 5 P 3 693 8 P 6 7 694 P 9 a b 696 Note that the math for RAID_5 is similar to RAID_4 only that the 697 device indices for each stripe are rotated backwards. So start with 698 the equations above for RAID_4, then compute the rotation as 699 described below. 701 P: number of parity devices in each stripe 702 P = 1 704 PC: Parity Cycle 705 PC = W 707 R: The parity rotation index 708 (N is as computed in above equations for RAID-4) 709 R = N % PC 711 I: parity device index 712 I = (W + W - (R + 1) * P) % W 714 Cr: The rotated device index 715 (C is as computed in the above equations for RAID-4) 716 Cr = (W + C - (R * P)) % W 718 Note: W is added above to avoid negative numbers modulo math. 720 5.2.5. PFSP_RAID_PQ 722 PFSP_RAID_PQ is a double-parity scheme that uses the Reed-Solomon P+Q 723 encoding scheme [ErrorCorrectingCodes]. In this layout, the last two 724 component objects hold the P and Q data, respectively. P is parity 725 computed with XOR. The Q computation is described in detail in 726 [MathOfRAID-6]. The same polynomial "x^8+x^4+x^3+x^2+1" and Galois 727 field size of 2^8 are used here. Clients may simply choose to read 728 data through the metadata server if two or more components are 729 missing or damaged. 731 The equations given above for embedded parity can be used to map a 732 file offset to the correct component object by setting the number of 733 parity components (P) to 2 instead of 1 for RAID-5 and computing the 734 Parity Cycle length as the Lowest Common Multiple of pfl_num_comps 735 and P, divided by P, as described below. Note: This algorithm can be 736 used also for RAID-5 where P=1. 738 P: number of parity devices 739 P = 2 741 PC: Parity cycle: 742 PC = LCM(W, P) / P 744 Q: The device index holding the Q component 745 (I is as computed in the above equations for RAID-5) 746 Qdev = (I + 1) % W 748 5.2.6. RAID Usage and Implementation Notes 750 RAID layouts with redundant data in their stripes require additional 751 serialization of updates to ensure correct operation. Otherwise, if 752 two clients simultaneously write to the same logical range of an 753 object, the result could include different data in the same ranges of 754 mirrored tuples, or corrupt parity information. It is the 755 responsibility of the metadata server to enforce serialization 756 requirements such as this. For example, the metadata server may do 757 so by not granting overlapping write layouts within mirrored objects. 759 Many alternative encoding schemes exist for P >= 2 760 [ErasureCodingLibraries]. These involve P or Q equations different 761 than those used in PFSP_RAID_PQ. Thus, if one of these schemes is to 762 be used in the future, a distinct value must be added to 763 pnfs_ff_striping_pattern for it. While Reed-Solomon codes are well 764 understood, recently discovered schemes such as Liberation codes are 765 more computationally efficient for small group_widths, and Cauchy 766 Reed-Solomon codes are more computationally efficient for higher 767 values of P. 769 5.3. Mirroring 771 The pfl_mirror_cnt is used to replicate a file by replicating its 772 Component Objects. If there is no mirroring, then pfs_mirror_cnt 773 MUST be 0. If pfl_mirror_cnt is greater than zero, then the size of 774 the pfl_comps array MUST be a multiple of (pfl_mirror_cnt + 1). 775 Thus, for a classic mirror on two objects, pfl_mirror_cnt is one. 776 Note that mirroring can be defined over any striping pattern. 778 Replicas are adjacent in the olo_components array, and the value C 779 produced by the above equations is not a direct index into the 780 pfl_comps array. Instead, the following equations determine the 781 replica component index RCi, where i ranges from 0 to pfl_mirror_cnt. 783 FW = size of pfl_comps array / (pfl_mirror_cnt+1) 785 C = component index for striping or two-level striping 786 as calculated using above equations 788 i ranges from 0 to pfl_mirror_cnt, inclusive 789 RCi = C * (pfl_mirror_cnt+1) + i 791 6. Recovering from Client I/O Errors 793 The pNFS client may encounter errors when directly accessing the Data 794 Servers. However, it is the responsibility of the Metadata Server to 795 recover from the I/O errors. When the LAYOUT4_FLEX_FILES layout type 796 is used, the client MUST report the I/O errors to the server at 797 LAYOUTRETURN time using the pflr_ioerr4 structure (see Section 7.1). 799 The metadata server analyzes the error and determines the required 800 recovery operations such as repairing any parity inconsistencies, 801 recovering media failures, or reconstructing missing objects. 803 The metadata server SHOULD recall any outstanding layouts to allow it 804 exclusive write access to the stripes being recovered and to prevent 805 other clients from hitting the same error condition. In these cases, 806 the server MUST complete recovery before handing out any new layouts 807 to the affected byte ranges. 809 Although it MAY be acceptable for the client to propagate a 810 corresponding error to the application that initiated the I/O 811 operation and drop any unwritten data, the client SHOULD attempt to 812 retry the original I/O operation by requesting a new layout using 813 LAYOUTGET and retry the I/O operation(s) using the new layout, or the 814 client MAY just retry the I/O operation(s) using regular NFS READ or 815 WRITE operations via the metadata server. The client SHOULD attempt 816 to retrieve a new layout and retry the I/O operation using the Data 817 Server first and only if the error persists, retry the I/O operation 818 via the metadata server. 820 7. Flexible Files Layout Return 822 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 823 layout-type specific information to the server. It is defined in 824 [RFC5661] as follows: 826 struct layoutreturn_file4 { 827 offset4 lrf_offset; 828 length4 lrf_length; 829 stateid4 lrf_stateid; 830 /* layouttype4 specific data */ 831 opaque lrf_body<>; 832 }; 834 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 835 case LAYOUTRETURN4_FILE: 836 layoutreturn_file4 lr_layout; 837 default: 838 void; 839 }; 841 struct LAYOUTRETURN4args { 842 /* CURRENT_FH: file */ 843 bool lora_reclaim; 844 layoutreturn_stateid lora_recallstateid; 845 layouttype4 lora_layout_type; 846 layoutiomode4 lora_iomode; 847 layoutreturn4 lora_layoutreturn; 848 }; 850 If the lora_layout_type layout type is LAYOUT4_FLEX_FILES, then the 851 lrf_body opaque value is defined by the pnfs_ff_layoutreturn4 type. 853 The pnfs_ff_layoutreturn4 type allows the client to report I/O error 854 information or layout usage statistics back to the metadata server as 855 defined below. 857 7.1. pflr_errno 859 /// enum pflr_errno { 860 /// PNFS_FF_ERR_EIO = 1, 861 /// PNFS_FF_ERR_NOT_FOUND = 2, 862 /// PNFS_FF_ERR_NO_SPACE = 3, 863 /// PNFS_FF_ERR_BAD_STATEID = 4, 864 /// PNFS_FF_ERR_NO_ACCESS = 5, 865 /// PNFS_FF_ERR_UNREACHABLE = 6, 866 /// PNFS_FF_ERR_RESOURCE = 7 867 /// }; 868 /// 870 pflr_errno4 is used to represent error types when read/write errors 871 are reported to the metadata server. The error codes serve as hints 872 to the metadata server that may help it in diagnosing the exact 873 reason for the error and in repairing it. 875 PNFS_FF_ERR_EIO indicates the operation failed because the Data 876 Server experienced a failure trying to access the object. The 877 most common source of these errors is media errors, but other 878 internal errors might cause this as well. In this case, the 879 metadata server should go examine the broken object more closely; 880 hence, it should be used as the default error code. 882 PNFS_FF_ERR_NOT_FOUND indicates the object ID specifies a Component 883 Object that does not exist on the Data Server. 885 PNFS_FF_ERR_NO_SPACE indicates the operation failed because the Data 886 Server ran out of free capacity during the operation. 888 PNFS_FF_ERR_BAD_STATEID indicates the stateid is not valid. 890 PNFS_FF_ERR_NO_ACCESS indicates the RPC credentials do not allow the 891 requested operation. This may happen when the client is fenced 892 off. The client will need to return the layout and get a new one 893 with fresh credentials. 895 PNFS_FF_ERR_UNREACHABLE indicates the client did not complete the I/ 896 O operation at the Data Server due to a communication failure. 897 Whether or not the I/O operation was executed by the Data Server 898 is undetermined. 900 PNFS_FF_ERR_RESOURCE indicates the client did not issue the I/O 901 operation due to a local problem on the initiator (i.e., client) 902 side, e.g., when running out of memory. The client MUST guarantee 903 that the Data Server WRITE operation was never sent. 905 7.2. pnfs_ff_ioerr 907 /// struct pnfs_ff_ioerr { 908 /// deviceid4 ioe_deviceid; 909 /// nfs_fh4 ioe_fhandle; 910 /// offset4 ioe_comp_offset; 911 /// length4 ioe_comp_length; 912 /// bool ioe_iswrite; 913 /// pnfs_ff_errno ioe_errno; 914 /// }; 915 /// 917 The pnfs_ff_ioerr4 structure is used to return error indications for 918 Component Objects that generated errors during data transfers. These 919 are hints to the metadata server that there are problems with that 920 object. For each error, "ioe_deviceid", "ioe_fhandle", 921 "ioe_comp_offset", and "ioe_comp_length" represent the Component 922 Object and byte range within the object in which the error occurred; 923 "ioe_iswrite" is set to "true" if the failed Data Server operation 924 was data modifying, and "ioe_errno" represents the type of error. 926 Component byte ranges in the optional pnfs_ff_ioerr4 structure are 927 used for recovering the object and MUST be set by the client to cover 928 all failed I/O operations to the component. 930 7.3. pnfs_ff_iostats 932 /// struct pnfs_ff_iostats { 933 /// offset4 ios_offset; 934 /// length4 ios_length; 935 /// uint32_t ios_duration; 936 /// uint32_t ios_rd_count; 937 /// uint64_t ios_rd_bytes; 938 /// uint32_t ios_wr_count; 939 /// uint64_t ios_wr_bytes; 940 /// }; 941 /// 943 With pNFS, the data transfers are performed directly between the pNFS 944 client and the data servers. Therefore, the metadata server has no 945 visibility to the I/O stream and cannot use any statistical 946 information about client I/O to optimize data storage location. 947 pnfs_ff_iostats4 MAY be used by the client to report I/O statistics 948 back to the metadata server upon returning the layout. Since it is 949 infeasible for the client to report every I/O that used the layout, 950 the client MAY identify "hot" byte ranges for which to report I/O 951 statistics. The definition and/or configuration mechanism of what is 952 considered "hot" and the size of the reported byte range is out of 953 the scope of this document. It is suggested for client 954 implementation to provide reasonable default values and an optional 955 run-time management interface to control these parameters. For 956 example, a client can define the default byte range resolution to be 957 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 958 I/O operations per second. For each byte range, ios_offset and 959 ios_length represent the starting offset of the range and the range 960 length in bytes. ios_duration represents the number of seconds the 961 reported burst of I/O lasted. ios_rd_count, ios_rd_bytes, 962 ios_wr_count, and ios_wr_bytes represent, respectively, the number of 963 contiguous read and write I/Os and the respective aggregate number of 964 bytes transferred within the reported byte range. 966 7.4. pnfs_ff_layoutreturn 968 /// struct pnfs_ff_layoutreturn { 969 /// pnfs_ff_ioerr pflr_ioerr_report<>; 970 /// pnfs_ff_iostats pflr_iostats_report<>; 971 /// }; 972 /// 974 When object I/O operations failed, "pflr_ioerr_report<>" is used to 975 report these errors to the metadata server as an array of elements of 976 type pnfs_ff_ioerr4. Each element in the array represents an error 977 that occurred on the Component Object identified by . If no errors are to be reported, the size of the 979 pflr_ioerr_report<> array is set to zero. The client MAY also use 980 "pflr_iostats_report<>" to report a list of I/O statistics as an 981 array of elements of type pnfs_ff_iostats4. Each element in the 982 array represents statistics for a particular byte range. Byte ranges 983 are not guaranteed to be disjoint and MAY repeat or intersect. 985 8. Flexible Files Creation Layout Hint 987 The layouthint4 type is defined in the [RFC5661] as follows: 989 struct layouthint4 { 990 layouttype4 loh_type; 991 opaque loh_body<>; 992 }; 994 The layouthint4 structure is used by the client to pass a hint about 995 the type of layout it would like created for a particular file. If 996 the loh_type layout type is LAYOUT4_FLEX_FILES, then the loh_body 997 opaque value is defined by the pnfs_ff_layouthint type. 999 8.1. pnfs_ff_layouthint 1000 /// union pnfs_ff_max_comps_hint switch (bool pfmx_valid) { 1001 /// case TRUE: 1002 /// uint32_t omx_max_comps; 1003 /// case FALSE: 1004 /// void; 1005 /// }; 1006 /// 1007 /// union pnfs_ff_stripe_unit_hint switch (bool pfsu_valid) { 1008 /// case TRUE: 1009 /// length4 osu_stripe_unit; 1010 /// case FALSE: 1011 /// void; 1012 /// }; 1013 /// 1014 /// union pnfs_ff_mirror_cnt_hint switch (bool pfmc_valid) { 1015 /// case TRUE: 1016 /// uint32_t omc_mirror_cnt; 1017 /// case FALSE: 1018 /// void; 1019 /// }; 1020 /// 1021 /// union pnfs_ff_striping_pattern_hint switch (bool pfsp_valid) { 1022 /// case TRUE: 1023 /// pnfs_ff_striping_pattern pfsp_striping_pattern; 1024 /// case FALSE: 1025 /// void; 1026 /// }; 1027 /// 1028 /// struct pnfs_ff_layouthint { 1029 /// pnfs_ff_max_comps_hint pflh_max_comps_hint; 1030 /// pnfs_ff_stripe_unit_hint pflh_stripe_unit_hint; 1031 /// pnfs_ff_mirror_cnt_hint pflh_mirror_cnt_hint; 1032 /// pnfs_ff_striping_pattern_hint pflh_striping_pattern_hint; 1033 /// }; 1034 /// 1036 This type conveys hints for the desired data map. All parameters are 1037 optional so the client can give values for only the parameters it 1038 cares about, e.g. it can provide a hint for the desired number of 1039 mirrored components, regardless of the striping pattern selected for 1040 the file. The server should make an attempt to honor the hints, but 1041 it can ignore any or all of them at its own discretion and without 1042 failing the respective CREATE operation. 1044 9. Recalling Layouts 1046 The Flexible Files metadata server should recall outstanding layouts 1047 in the following cases: 1049 o When the file's security policy changes, i.e., Access Control 1050 Lists (ACLs) or permission mode bits are set. 1052 o When the file's layout changes, rendering outstanding layouts 1053 invalid. 1055 o When there are sharing conflicts. For example, the server will 1056 issue stripe-aligned layout segments for RAID-5 objects. To 1057 prevent corruption of the file's parity, multiple clients must not 1058 hold valid write layouts for the same stripes. An outstanding 1059 READ/WRITE (RW) layout should be recalled when a conflicting 1060 LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW 1061 and for a byte range overlapping with the outstanding layout 1062 segment. 1064 9.1. CB_RECALL_ANY 1066 The metadata server can use the CB_RECALL_ANY callback operation to 1067 notify the client to return some or all of its layouts. The 1068 [RFC5661] defines the following types: 1070 const RCA4_TYPE_MASK_FF_LAYOUT_MIN = -2; 1071 const RCA4_TYPE_MASK_FF_LAYOUT_MAX = -1; 1072 [[RFC Editor: please insert assigned constants]] 1074 struct CB_RECALL_ANY4args { 1075 uint32_t craa_objects_to_keep; 1076 bitmap4 craa_type_mask; 1077 }; 1079 Typically, CB_RECALL_ANY will be used to recall client state when the 1080 server needs to reclaim resources. The craa_type_mask bitmap 1081 specifies the type of resources that are recalled and the 1082 craa_objects_to_keep value specifies how many of the recalled objects 1083 the client is allowed to keep. The Flexible Files layout type mask 1084 flags are defined as follows. They represent the iomode of the 1085 recalled layouts. In response, the client SHOULD return layouts of 1086 the recalled iomode that it needs the least, keeping at most 1087 craa_objects_to_keep object-based layouts. 1089 /// enum pnfs_ff_cb_recall_any_mask { 1090 /// PNFS_FF_RCA4_TYPE_MASK_READ = -2, 1091 /// PNFS_FF_RCA4_TYPE_MASK_RW = -1 1092 [[RFC Editor: please insert assigned constants]] 1093 /// }; 1094 /// 1096 The PNFS_FF_RCA4_TYPE_MASK_READ flag notifies the client to return 1097 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1098 PNFS_FF_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1099 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1100 is notified to return layouts of either iomode. 1102 10. Client Fencing 1104 In cases where clients are uncommunicative and their lease has 1105 expired or when clients fail to return recalled layouts within a 1106 lease period, at the least the server MAY revoke client layouts and/ 1107 or device address mappings and reassign these resources to other 1108 clients (see "Recalling a Layout" in [RFC5661]). To avoid data 1109 corruption, the metadata server MUST fence off the revoked clients 1110 from the respective objects as described in Section 2.1. 1112 11. Security Considerations 1114 The pNFS extension partitions the NFSv4 file system protocol into two 1115 parts, the control path and the data path (storage protocol). The 1116 control path contains all the new operations described by this 1117 extension; all existing NFSv4 security mechanisms and features apply 1118 to the control path. The combination of components in a pNFS system 1119 is required to preserve the security properties of NFSv4 with respect 1120 to an entity accessing data via a client, including security 1121 countermeasures to defend against threats that NFSv4 provides 1122 defenses for in environments where these threats are considered 1123 significant. 1125 The metadata server enforces the file access-control policy at 1126 LAYOUTGET time. The client should use suitable authorization 1127 credentials for getting the layout for the requested iomode (READ or 1128 RW) and the server verifies the permissions and ACL for these 1129 credentials, possibly returning NFS4ERR_ACCESS if the client is not 1130 allowed the requested iomode. If the LAYOUTGET operation succeeds 1131 the client receives, as part of the layout, a set of credentials 1132 allowing it I/O access to the specified objects corresponding to the 1133 requested iomode. When the client acts on I/O operations on behalf 1134 of its local users, it MUST authenticate and authorize the user by 1135 issuing respective OPEN and ACCESS calls to the metadata server, 1136 similar to having NFSv4 data delegations. If access is allowed, the 1137 client uses the corresponding (READ or RW) credentials to perform the 1138 I/O operations at the object storage devices. When the metadata 1139 server receives a request to change a file's permissions or ACL, it 1140 SHOULD recall all layouts for that file and it MUST fence off the 1141 clients holding outstanding layouts for the respective file by 1142 implicitly invalidating the outstanding credentials on all Component 1143 Objects comprising before committing to the new permissions and ACL. 1144 Doing this will ensure that clients re-authorize their layouts 1145 according to the modified permissions and ACL by requesting new 1146 layouts. Recalling the layouts in this case is courtesy of the 1147 server intended to prevent clients from getting an error on I/Os done 1148 after the client was fenced off. 1150 12. Striping Topologies Extensibility 1152 New striping topologies that are not specified in this document may 1153 be specified using @@@. These must be documented in the IETF by 1154 submitting an RFC augmenting this protocol provided that: 1156 o New striping topologies MUST be wire-protocol compatible with the 1157 Flexible Files Layout protocol as specified in this document. 1159 o Some members of the data structures specified here may be declared 1160 as optional or manadatory-not-to-be-used. 1162 o Upon acceptance by the IETF as a RFC, new striping topology 1163 constants MUST be registered as describe in Section 13. 1165 13. IANA Considerations 1167 As described in [RFC5661], new layout type numbers have been assigned 1168 by IANA. This document defines the protocol associated with the 1169 existing layout type number, LAYOUT4_FLEX_FILES. 1171 A new IANA registry should be assigned to register new data map 1172 striping topologies described by the enumerated type: @@@. 1174 14. Normative References 1176 [ErasureCodingLibraries] 1177 Plank, James S., and Luo, Jianqiang and Schuman, Catherine 1178 D. and Xu, Lihao and Wilcox-O'Hearn, Zooko, , "A 1179 Performance Evaluation and Examination of Open-source 1180 Erasure Coding Libraries for Storage", 2007. 1182 [ErrorCorrectingCodes] 1183 MacWilliams, F. and N. Sloane, "The Theory of Error- 1184 Correcting Codes, Part I", 1977. 1186 [LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents", 1187 November 2008, . 1190 [MathOfRAID-6] 1191 Anvin, H., "The Mathematics of RAID-6", May 2009, 1192 . 1194 [RFC1813] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1195 June 1995. 1197 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1198 Requirement Levels", BCP 14, RFC 2119, March 1997. 1200 [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 1201 Beame, C., Eisler, M., and D. Noveck, "Network File System 1202 (NFS) version 4 Protocol", RFC 3530, April 2003. 1204 [RFC4506] Eisler, M., "XDR: External Data Representation Standard", 1205 STD 67, RFC 4506, May 2006. 1207 [RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1208 "Network File System (NFS) Version 4 Minor Version 1 1209 Protocol", RFC 5661, January 2010. 1211 [RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1212 "Network File System (NFS) Version 4 Minor Version 1 1213 External Data Representation Standard (XDR) Description", 1214 RFC 5662, January 2010. 1216 [RFC5664] Halevy, B., Ed., Welch, B., Ed., and J. Zelenka, Ed., 1217 "Object-Based Parallel NFS (pNFS) Operations", RFC 5664, 1218 January 2010. 1220 [pNFSLayouts] 1221 Haynes, T., "Considerations for a New pNFS Layout Type", 1222 draft-haynes-nfsv4-layout-types-02 (Work In Progress), 1223 April 2014. 1225 Appendix A. Acknowledgments 1227 The pNFS Objects Layout was authored and revised by Brent Welch, Jim 1228 Zelenka, Benny Halevy, and Boaz Harrosh. 1230 Those who provided miscellaneous comments to early drafts of this 1231 document include: Matt W. Benjamin, Adam Emerson, Tom Haynes, J. 1232 Bruce Fields, and Lev Solomonov. 1234 Appendix B. RFC Editor Notes 1236 [RFC Editor: please remove this section prior to publishing this 1237 document as an RFC] 1239 [RFC Editor: prior to publishing this document as an RFC, please 1240 replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the 1241 RFC number of this document] 1243 Authors' Addresses 1245 Benny Halevy 1246 Primary Data, Inc. 1248 Email: bhalevy@primarydata.com 1249 URI: http://www.primarydata.com 1251 Thomas Haynes 1252 Primary Data, Inc. 1253 4300 El Camino Real Ste 100 1254 Los Altos, CA 94022 1255 USA 1257 Phone: +1 408 215 1519 1258 Email: thomas.haynes@primarydata.com