idnits 2.17.1 draft-bhalevy-nfs-obj-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1179 has weird spacing: '...stateid lor...' -- The document date (August 31, 2012) is 4256 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 5661 (ref. '2') (Obsoleted by RFC 8881) -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Obsolete normative reference: RFC 3720 (ref. '8') (Obsoleted by RFC 7143) -- Possible downref: Non-RFC (?) normative reference: ref. '9' ** Obsolete normative reference: RFC 3980 (ref. '10') (Obsoleted by RFC 7143) -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '15') (Obsoleted by RFC 7530) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft Tonian 4 Intended status: Standards Track August 31, 2012 5 Expires: March 4, 2013 7 Object-Based Parallel NFS (pNFS) Operations - Version 2 8 draft-bhalevy-nfs-obj-00 10 Abstract 12 Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to 13 allow clients to directly access file data on the storage used by the 14 NFSv4 server. This ability to bypass the server for data access can 15 increase both performance and parallelism, but requires additional 16 client functionality for data access, some of which is dependent on 17 the class of storage used, a.k.a. the Layout Type. The main pNFS 18 operations and data types in NFSv4 Minor version 1 specify a layout- 19 type-independent layer; layout-type-specific information is conveyed 20 using opaque data structures whose internal structure is further 21 defined by the particular layout type specification. This document 22 specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to 23 the main NFSv4 Minor version 1 specification. Version 2 of this 24 layout type introduces the use of the Network File System protocol as 25 an object-storage protocol in addition to the T-10 OSD protocol. 27 Status of this Memo 29 This Internet-Draft is submitted to IETF in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at http://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on March 4, 2013. 44 Copyright Notice 46 Copyright (c) 2012 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (http://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 63 2. XDR Description of the Objects-Based Layout Protocol . . . . . 4 64 2.1. Code Components Licensing Notice . . . . . . . . . . . . . 5 65 3. Basic Data Type Definitions . . . . . . . . . . . . . . . . . 6 66 3.1. pnfs_obj_osd_objid4 . . . . . . . . . . . . . . . . . . . 7 67 3.2. pnfs_obj_nfs_objid4 . . . . . . . . . . . . . . . . . . . 7 68 3.3. pnfs_obj_type4 . . . . . . . . . . . . . . . . . . . . . . 8 69 3.4. pnfs_obj_comp4 . . . . . . . . . . . . . . . . . . . . . . 9 70 3.5. pnfs_obj_raid_algorithm4 . . . . . . . . . . . . . . . . . 10 71 4. Object Storage Device Addressing and Discovery . . . . . . . . 11 72 4.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 12 73 4.2. pnfs_obj_deviceaddr4 . . . . . . . . . . . . . . . . . . . 12 74 4.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 13 75 4.2.2. Device Network Address . . . . . . . . . . . . . . . . 14 76 4.2.3. NFS Device Identifier . . . . . . . . . . . . . . . . 15 77 5. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 15 78 5.1. pnfs_obj_data_map4 . . . . . . . . . . . . . . . . . . . . 16 79 5.2. pnfs_obj_layout4 . . . . . . . . . . . . . . . . . . . . . 17 80 5.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 17 81 5.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 18 82 5.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 19 83 5.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 21 84 5.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 22 85 5.4.1. PNFS_OBJ_RAID_0 . . . . . . . . . . . . . . . . . . . 22 86 5.4.2. PNFS_OBJ_RAID_4 . . . . . . . . . . . . . . . . . . . 22 87 5.4.3. PNFS_OBJ_RAID_5 . . . . . . . . . . . . . . . . . . . 23 88 5.4.4. PNFS_OBJ_RAID_PQ . . . . . . . . . . . . . . . . . . . 24 89 5.4.5. RAID Usage and Implementation Notes . . . . . . . . . 25 90 6. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 25 91 6.1. pnfs_obj_deltaspaceused4 . . . . . . . . . . . . . . . . . 26 92 6.2. pnfs_obj_layoutupdate4 . . . . . . . . . . . . . . . . . . 26 93 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 27 94 8. Object-Based Layout Return . . . . . . . . . . . . . . . . . . 27 95 8.1. pnfs_obj_errno4 . . . . . . . . . . . . . . . . . . . . . 28 96 8.2. pnfs_obj_ioerr4 . . . . . . . . . . . . . . . . . . . . . 30 97 8.3. pnfs_obj_iostats4 . . . . . . . . . . . . . . . . . . . . 30 98 8.4. pnfs_obj_layoutreturn4 . . . . . . . . . . . . . . . . . . 31 99 9. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 31 100 9.1. pnfs_obj_layouthint4 . . . . . . . . . . . . . . . . . . . 32 101 10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 33 102 10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 33 103 10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 34 104 11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 34 105 11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 34 106 12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 35 107 13. Security Considerations . . . . . . . . . . . . . . . . . . . 35 108 13.1. OSD Security Data Types . . . . . . . . . . . . . . . . . 36 109 13.2. The OSD Security Protocol . . . . . . . . . . . . . . . . 37 110 13.3. Protocol Privacy Requirements . . . . . . . . . . . . . . 38 111 13.4. Revoking Capabilities . . . . . . . . . . . . . . . . . . 39 112 13.5. Security Considerations over NFS . . . . . . . . . . . . . 39 113 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 114 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40 115 15.1. Normative References . . . . . . . . . . . . . . . . . . . 40 116 15.2. Informative References . . . . . . . . . . . . . . . . . . 41 117 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 42 118 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 42 120 1. Introduction 122 In pNFS, the file server returns typed layout structures that 123 describe where file data is located. There are different layouts for 124 different storage systems and methods of arranging data on storage 125 devices. This document describes the layouts used with object-based 126 storage devices (OSDs) that are accessed according to the OSD storage 127 protocol standard (ANSI INCITS 400-2004 [1]) or the NFS protocol 128 (RFC1813 [14], RFC3530 [15], RFC5661 [2]) 130 An "object" is a container for data and attributes, and files are 131 stored in one or more objects. The Object Storage protocol specifies 132 several operations on objects, including READ, WRITE, FLUSH, GET 133 ATTRIBUTES, SET ATTRIBUTES, CREATE, and DELETE. However, using the 134 object-based layout the client only uses the READ, WRITE, GET 135 ATTRIBUTES, and FLUSH commands, or in the NFS case, the READ, WRITE, 136 GETATTR, and COMMIT operations. The other commands are only used by 137 the pNFS server. 139 An object-based layout for pNFS includes object identifiers, 140 capabilities that allow clients to READ or WRITE those objects, and 141 various parameters that control how file data is striped across their 142 component objects. The OSD protocol has a capability-based security 143 scheme that allows the pNFS server to control what operations and 144 what objects can be used by clients. 146 With NFS filers used for object storage devices the object's owner, 147 group owner, and mode are used to implement a security mechanism 148 equivalent to the OSD capability model for the purpose of client 149 fencing. 151 This scheme is described in more detail in the "Security 152 Considerations" section (Section 13). 154 1.1. Requirements Language 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 157 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 158 document are to be interpreted as described in RFC 2119 [3]. 160 2. XDR Description of the Objects-Based Layout Protocol 162 This document contains the external data representation (XDR [4]) 163 description of the NFSv4.1 objects layout protocol. The XDR 164 description is embedded in this document in a way that makes it 165 simple for the reader to extract into a ready-to-compile form. The 166 reader can feed this document into the following shell script to 167 produce the machine readable XDR description of the NFSv4.1 objects 168 layout protocol: 170 #!/bin/sh 171 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 173 That is, if the above script is stored in a file called "extract.sh", 174 and this document is in a file called "spec.txt", then the reader can 175 do: 177 sh extract.sh < spec.txt > pnfs_obj_prot.x 179 The effect of the script is to remove leading white space from each 180 line, plus a sentinel sequence of "///". 182 The embedded XDR file header follows. Subsequent XDR descriptions, 183 with the sentinel sequence are embedded throughout the document. 185 Note that the XDR code contained in this document depends on types 186 from the NFSv4.1 nfs4_prot.x file ([5]). This includes both nfs 187 types that end with a 4, such as offset4, length4, etc., as well as 188 more generic types such as uint32_t and uint64_t. 190 2.1. Code Components Licensing Notice 192 The XDR description, marked with lines beginning with the sequence 193 "///", as well as scripts for extracting the XDR description are Code 194 Components as described in Section 4 of "Legal Provisions Relating to 195 IETF Documents" [6]. These Code Components are licensed according to 196 the terms of Section 4 of "Legal Provisions Relating to IETF 197 Documents". 199 /// /* 200 /// * Copyright (c) 2012 IETF Trust and the persons identified 201 /// * as authors of the code. All rights reserved. 202 /// * 203 /// * Redistribution and use in source and binary forms, with 204 /// * or without modification, are permitted provided that the 205 /// * following conditions are met: 206 /// * 207 /// * o Redistributions of source code must retain the above 208 /// * copyright notice, this list of conditions and the 209 /// * following disclaimer. 210 /// * 211 /// * o Redistributions in binary form must reproduce the above 212 /// * copyright notice, this list of conditions and the 213 /// * following disclaimer in the documentation and/or other 214 /// * materials provided with the distribution. 216 /// * 217 /// * o Neither the name of Internet Society, IETF or IETF 218 /// * Trust, nor the names of specific contributors, may be 219 /// * used to endorse or promote products derived from this 220 /// * software without specific prior written permission. 221 /// * 222 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 223 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 224 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 225 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 226 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 227 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 228 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 229 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 230 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 231 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 232 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 233 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 234 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 235 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 236 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 237 /// * 238 /// * This code was derived from draft-bhalevy-nfs-obj-00. 239 [[RFC Editor: please insert RFC number if needed]] 240 /// * Please reproduce this note if possible. 241 /// */ 242 /// 243 /// /* 244 /// * pnfs_obj_prot_v2.x 245 /// */ 246 /// 247 /// /* 248 /// * The following include statements are for example only. 249 /// * The actual XDR definition files are generated separately 250 /// * and independently and are likely to have a different name. 251 /// */ 252 /// %#include 253 /// %#include 254 /// 256 3. Basic Data Type Definitions 258 The following sections define basic data types and constants used by 259 the Object-Based Layout protocol. 261 3.1. pnfs_obj_osd_objid4 263 An object is identified by a number, somewhat like an inode number. 264 The object storage model has a two-level scheme, where the objects 265 within an object storage device are grouped into partitions. 267 /// struct pnfs_obj_osd_objid4 { 268 /// deviceid4 oid_device_id; 269 /// uint64_t oid_partition_id; 270 /// uint64_t oid_object_id; 271 /// }; 272 /// 274 The pnfs_obj_osd_objid4 type is used to identify an object within a 275 partition on a specified object storage device. "oid_device_id" 276 selects the object storage device from the set of available storage 277 devices. The device is identified with the deviceid4 type, which is 278 an index into addressing information about that device returned by 279 the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data 280 type is defined in NFSv4.1 [2]. The device MUST by identifed as an 281 OSD device, represented as oda_obj_type equal to PNFS_OBJ_OSD_V1 or 282 PNFS_OBJ_OSD_V2. Within an OSD, a partition is identified with a 64- 283 bit number, "oid_partition_id". Within a partition, an object is 284 identified with a 64-bit number, "oid_object_id". Creation and 285 management of partitions is outside the scope of this document, and 286 is a facility provided by the object-based storage file system. 288 3.2. pnfs_obj_nfs_objid4 290 The NFS equivalent of pnfs_obj_osd_objid4 identifies the object using 291 a NFS filehandle (See RFC1813 [14], RFC3530 [15], or RFC5661 [2]). 293 /// struct pnfs_obj_nfs_objid4 { 294 /// deviceid4 nid_device_id; 295 /// opaque nid_fhandle<>; 296 /// }; 297 /// 299 Similar to pnfs_obj_osd_objid4, "nid_device_id" selects the storage 300 device from the set of available storage devices. However, it MUST 301 refer to a device identifed as an NFS device, represented as 302 oda_obj_type equal to PNFS_OBJ_NFS. 304 3.3. pnfs_obj_type4 306 /// enum pnfs_obj_type4 { 307 /// PNFS_OBJ_MISSING = 0, 308 /// PNFS_OBJ_OSD_V1 = 1, 309 /// PNFS_OBJ_OSD_V2 = 2, 310 /// PNFS_OBJ_NFS = 3 311 /// }; 312 /// 314 pnfs_obj_type4 is used to indicate the object storage protocol type 315 and version or whether an object is missing (i.e., unavailable). 316 Some of the object-based layout- supported RAID algorithms encode 317 redundant information and can compensate for missing components, but 318 the data placement algorithm needs to know what parts are missing. 320 The second generation OSD protocol (SNIA T10/1729-D [16]). has 321 additional proposed features to support more robust error recovery, 322 snapshots, and byte-range capabilities. Therefore, the OSD version 323 is explicitly called out in the information returned in the layout. 324 (This information can also be deduced by looking inside the 325 capability type at the format field, which is the first byte. The 326 format value is 0x1 for an OSD v1 capability. However, it seems most 327 robust to call out the version explicitly.) 329 Support for the NFS protocol as the object storage protocol has been 330 added in version 2 of this layout type. In this model, the NFS 331 filehandle is used to identify the object and the NFS RPC 332 authentication credentials are used to emulate the OSD security 333 model. 335 3.4. pnfs_obj_comp4 337 /// enum pnfs_obj_osd_cap_key_sec4 { 338 /// PNFS_OBJ_CAP_KEY_SEC_NONE = 0, 339 /// PNFS_OBJ_CAP_KEY_SEC_SSV = 1 340 /// }; 341 /// 342 /// struct pnfs_obj_osd_cred4 { 343 /// pnfs_obj_osd_objid4 ooc_object_id; 344 /// pnfs_obj_osd_cap_key_sec4 ooc_cap_key_sec; 345 /// opaque ooc_capability_key<>; 346 /// opaque ooc_capability<>; 347 /// }; 348 /// 349 /// struct pnfs_obj_nfs_cred4 { 350 /// pnfs_obj_nfs_objid4 onc_object_id; 351 /// opaque_auth onc_auth; 352 /// }; 353 /// 354 /// union pnfs_obj_comp4 switch (pnfs_obj_type4 oc_obj_type) { 355 /// case PNFS_OBJ_MISSING: 356 /// pnfs_obj_osd_objid4 oc_missing_obj_id; 357 /// 358 /// case PNFS_OBJ_OSD_V1: 359 /// case PNFS_OBJ_OSD_V2: 360 /// pnfs_obj_osd_cred4 oc_osd_cred; 361 /// 362 /// case PNFS_OBJ_NFS: 363 /// pnfs_obj_nfs_cred4 oc_nfs_cred; 364 /// }; 365 /// 367 The pnfs_obj_comp4 union is used to identify each component 368 comprising the file. "oc_obj_type" represents the object storage 369 device protocol type and version, or whether that component is 370 unavailable. When oc_obj_type indicates PNFS_OBJ_OSD_V1 or 371 PNFS_OBJ_OSD_V2, the "ooc_object_id" field identifies the component 372 object, the "ooc_capability" and "ooc_capability_key" fields, along 373 with the "ooa_systemid" from the pnfs_obj_deviceaddr4, provide the 374 OSD security credentials needed to access that object. The 375 "ooc_cap_key_sec" value denotes the method used to secure the 376 ooc_capability_key (see Section 13.1 for more details). 378 To comply with the OSD security requirements, the capability key 379 SHOULD be transferred securely to prevent eavesdropping (see 380 Section 13). Therefore, a client SHOULD either issue the LAYOUTGET 381 or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service 382 or previously establish a secret state verifier (SSV) for the 383 sessions via the NFSv4.1 SET_SSV operation. The 384 pnfs_obj_osd_cap_key_sec4 type is used to identify the method used by 385 the server to secure the capability key. 387 o PNFS_OBJ_CAP_KEY_SEC_NONE denotes that the ooc_capability_key is 388 not encrypted, in which case the client SHOULD issue the LAYOUTGET 389 or GETDEVICEINFO operations with RPCSEC_GSS with the privacy 390 service or the NFSv4.1 transport should be secured by using 391 methods that are external to NFSv4.1 like the use of IPsec [17] 392 for transporting the NFSV4.1 protocol. 394 o PNFS_OBJ_CAP_KEY_SEC_SSV denotes that the ooc_capability_key 395 contents are encrypted using the SSV GSS context and the 396 capability key as inputs to the GSS_Wrap() function (see GSS-API 397 [7]) with the conf_req_flag set to TRUE. The client MUST use the 398 secret SSV key as part of the client's GSS context to decrypt the 399 capability key using the value of the ooc_capability_key field as 400 the input_message to the GSS_unwrap() function. Note that to 401 prevent eavesdropping of the SSV key, the client SHOULD issue 402 SET_SSV via RPCSEC_GSS with the privacy service. 404 The actual method chosen depends on whether the client established a 405 SSV key with the server and whether it issued the operation with the 406 RPCSEC_GSS privacy method. Naturally, if the client did not 407 establish an SSV key via SET_SSV, the server MUST use the 408 PNFS_OBJ_CAP_KEY_SEC_NONE method. Otherwise, if the operation was 409 not issued with the RPCSEC_GSS privacy method, the server SHOULD 410 secure the ooc_capability_key with the PNFS_OBJ_CAP_KEY_SEC_SSV 411 method. The server MAY use the PNFS_OBJ_CAP_KEY_SEC_SSV method also 412 when the operation was issued with the RPCSEC_GSS privacy method. 414 When oc_obj_type represents PNFS_OBJ_NFS as the storage protocol, the 415 "onc_device_id" field identifies the NFS server holding the object, 416 "onc_fhandle" provides the opaque NFS filehandle identifying the 417 object, and "onc_auth" provides the RPC credentials to be used for 418 accessing the object, encoded as struct opaque_auth (See RFC5531 419 [18]). 421 3.5. pnfs_obj_raid_algorithm4 423 /// enum pnfs_obj_raid_algorithm4 { 424 /// PNFS_OBJ_RAID_0 = 1, 425 /// PNFS_OBJ_RAID_4 = 2, 426 /// PNFS_OBJ_RAID_5 = 3, 427 /// PNFS_OBJ_RAID_PQ = 4 /* Reed-Solomon P+Q */ 428 /// }; 429 /// 430 pnfs_obj_raid_algorithm4 represents the data redundancy algorithm 431 used to protect the file's contents. See Section 5.4 for more 432 details. 434 4. Object Storage Device Addressing and Discovery 436 Data operations to an OSD require the client to know the "address" of 437 each OSD's root object. The root object is synonymous with the Small 438 Computer System Interface (SCSI) logical unit. The client specifies 439 SCSI logical units to its SCSI protocol stack using a representation 440 local to the client. Because these representations are local, 441 GETDEVICEINFO must return information that can be used by the client 442 to select the correct local representation. 444 In the block world, a set offset (logical block number or track/ 445 sector) contains a disk label. This label identifies the disk 446 uniquely. In contrast, an OSD has a standard set of attributes on 447 its root object. For device identification purposes, the OSD System 448 ID (root information attribute number 3) and the OSD Name (root 449 information attribute number 9) are used as the label. These appear 450 in the pnfs_obj_deviceaddr4 type below under the "ooa_systemid" and 451 "oda_osdname" fields. 453 In some situations, SCSI target discovery may need to be driven based 454 on information contained in the GETDEVICEINFO response. One example 455 of this is Internet SCSI (iSCSI) targets that are not known to the 456 client until a layout has been requested. The information provided 457 as the "ooa_targetid", "ooa_netaddrs", and "ooa_lun" fields in the 458 pnfs_obj_osd_addr4 type described below (see Section 4.2) allows the 459 client to probe a specific device given its network address and 460 optionally its iSCSI Name (see iSCSI [8]), or when the device network 461 address is omitted, allows it to discover the object storage device 462 using the provided device name or SCSI Device Identifier (see SPC-3 463 [9].) 465 The ooa_systemid is implicitly used by the client, by using the 466 object credential signing key to sign each request with the request 467 integrity check value. This method protects the client from 468 unintentionally accessing a device if the device address mapping was 469 changed (or revoked). The server computes the capability key using 470 its own view of the systemid associated with the respective deviceid 471 present in the credential. If the client's view of the deviceid 472 mapping is stale, the client will use the wrong systemid (which must 473 be system-wide unique) and the I/O request to the OSD will fail to 474 pass the integrity check verification. 476 To recover from this condition the client should report the error and 477 return the layout using LAYOUTRETURN, and invalidate all the device 478 address mappings associated with this layout. The client can then 479 ask for a new layout if it wishes using LAYOUTGET and resolve the 480 referenced deviceids using GETDEVICEINFO or GETDEVICELIST. 482 The server MUST provide the ooa_systemid and SHOULD also provide the 483 ooa_osdname. When the OSD name is present, the client SHOULD get the 484 root information attributes whenever it establishes communication 485 with the OSD and verify that the OSD name it got from the OSD matches 486 the one sent by the metadata server. To do so, the client uses the 487 ooa_root_obj_cred credentials. 489 For Network Attached Devices, the server MUST provide either the 490 ona_netaddrs network address(es) or ona_fqdn to identify the device. 492 4.1. pnfs_osd_targetid_type4 494 The following enum specifies the manner in which a SCSI target can be 495 specified. The target can be specified as a SCSI Name, or as an SCSI 496 Device Identifier. 498 /// enum pnfs_osd_targetid_type4 { 499 /// OBJ_TARGET_ANON = 1, 500 /// OBJ_TARGET_SCSI_NAME = 2, 501 /// OBJ_TARGET_SCSI_DEVICE_ID = 3 502 /// }; 503 /// 505 4.2. pnfs_obj_deviceaddr4 507 The "pnfs_obj_deviceaddr4" data structure is returned by the server 508 as the storage-protocol-specific opaque field da_addr_body in the 509 "device_addr4" structure by a successful GETDEVICEINFO operation 510 NFSv4.1 [2]. 512 The specification for an object device address is as follows: 514 /// union pnfs_obj_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { 515 /// case OBJ_TARGET_SCSI_NAME: 516 /// string oti_scsi_name<>; 517 /// 518 /// case OBJ_TARGET_SCSI_DEVICE_ID: 519 /// opaque oti_scsi_device_id<>; 520 /// 521 /// default: 522 /// void; 523 /// }; 524 /// 525 /// struct pnfs_obj_osd_addr4 { 526 /// pnfs_obj_targetid4 ooa_targetid; 527 /// netaddr4 ooa_netaddrs<>; 528 /// opaque ooa_lun[8]; 529 /// opaque ooa_systemid<>; 530 /// pnfs_obj_osd_cred4 ooa_root_obj_cred; 531 /// opaque ooa_osdname<>; 532 /// }; 533 /// 534 /// struct pnfs_obj_nfs_addr4 { 535 /// uint32_t ona_version; 536 /// uint32_t ona_minorversion; 537 /// netaddr4 ona_netaddrs<>; 538 /// opaque ona_fqdn<>; 539 /// opaque ona_path<>; 540 /// }; 541 /// 542 /// union pnfs_obj_deviceaddr4 switch (pnfs_obj_type4 oda_obj_type) { 543 /// case PNFS_OBJ_OSD_V1: 544 /// case PNFS_OBJ_OSD_V2: 545 /// pnfs_obj_osd_addr4 oda_osd_addr; 546 /// 547 /// case PNFS_OBJ_NFS: 548 /// pnfs_obj_nfs_addr4 oda_nfs_addr; 549 /// }; 550 /// 552 4.2.1. SCSI Target Identifier 554 When "ooa_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the 555 "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as 556 specified in iSCSI [8] and [10]. Note that the specification of the 557 oti_scsi_name string format is outside the scope of this document. 558 Parsing the string is based on the string prefix, e.g., "iqn.", 559 "eui.", or "naa." and more formats MAY be specified in the future in 560 accordance with iSCSI Names properties. 562 Currently, the iSCSI Name provides for naming the target device using 563 a string formatted as an iSCSI Qualified Name (IQN) or as an Extended 564 Unique Identifier (EUI) [11] string. Those are typically used to 565 identify iSCSI or Secure Routing Protocol (SRP) [19] devices. The 566 Network Address Authority (NAA) string format (see [10]) provides for 567 naming the device using globally unique identifiers, as defined in 568 Fibre Channel Framing and Signaling (FC-FS) [20]. These are 569 typically used to identify Fibre Channel or SAS [21] (Serial Attached 570 SCSI) devices. In particular, such devices that are dual-attached 571 both over Fibre Channel or SAS and over iSCSI. 573 When "ooa_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the 574 "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device 575 Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3. 576 "Device Identification VPD Page"). If the Device Identifier is 577 identical to the OSD System ID, as given by ooa_systemid, the server 578 SHOULD provide a zero-length oti_scsi_device_id opaque value. Note 579 that similarly to the "oti_scsi_name", the specification of the 580 oti_scsi_device_id opaque contents is outside the scope of this 581 document and more formats MAY be specified in the future in 582 accordance with SPC-3. 584 The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing 585 no target identification. In this case, only the OSD System ID, and 586 optionally the provided network address, are used to locate the 587 device. 589 4.2.2. Device Network Address 591 The optional "ooa_netaddrs" field MAY be provided by the server as a 592 hint to accelerate device discovery over, e.g., the iSCSI transport 593 protocol. The network address is given with the netaddr4 type, which 594 specifies a list of TCP/IP based endpoints (as specified in NFSv4.1 595 [2]). When given, the client SHOULD use it to probe for the SCSI 596 device at the given network address(es). The client MAY still use 597 other discovery mechanisms such as Internet Storage Name Service 598 (iSNS) [12] to locate the device using the ooa_targetid. In 599 particular, such an external name service SHOULD be used when the 600 devices may be attached to the network using multiple connections, 601 and/or multiple storage fabrics (e.g., Fibre-Channel and iSCSI). 603 The "ooa_lun" field identifies the OSD 64-bit Logical Unit Number, 604 formatted in accordance with SAM-3 [13]. The client uses the Logical 605 Unit Number to communicate with the specific OSD Logical Unit. Its 606 use is defined in detail by the SCSI transport protocol, e.g., iSCSI 607 [8]. 609 4.2.3. NFS Device Identifier 611 The PNFS_OBJ_NFS pnfs_obj_type4 is used to identify NFS filer 612 devices. In this case, ona_version and ona_minorversion represent 613 the NFS protocol version to be used to access the NFS filer. Either 614 the "ona_netaddrs" or "ona_fqdn" fields are used to locate the 615 device. "ona_netaddrs" MAY be set to a list holding one or more of 616 the device network addresses. Alternatively, "ona_fqdn" MAY be set 617 to the NFS device fully qualified domain name that can be resolved by 618 the client to locate the NFS device network address (See Domain Names 619 [22]). The server MUST provide either one of "ona_netaddrs" or 620 "ona_fqdn", but it MUST NOT provide both. 622 "ona_path" MUST be set by the server to an exported path on the 623 device. The path provided MUST exist and be accessible to the 624 client. If the path does not exist, the client MUST ignore this 625 device information and any layouts referring to the respective 626 deviceid until valid device information is acquired. 628 5. Object-Based Layout 630 The layout4 type is defined in the NFSv4.1 [2] as follows: 632 enum layouttype4 { 633 LAYOUT4_NFSV4_1_FILES = 1, 634 LAYOUT4_OSD2_OBJECTS = 2, 635 LAYOUT4_BLOCK_VOLUME = 3, 636 LAYOUT4_OBJECTS_V2 = 0x08010004 /* Tentatively */ 637 }; 639 struct layout_content4 { 640 layouttype4 loc_type; 641 opaque loc_body<>; 642 }; 644 struct layout4 { 645 offset4 lo_offset; 646 length4 lo_length; 647 layoutiomode4 lo_iomode; 648 layout_content4 lo_content; 649 }; 651 This document defines structure associated with the layouttype4 652 value, LAYOUT4_OBJECTS_V2. The NFSv4.1 [2] specifies the loc_body 653 structure as an XDR type "opaque". The opaque layout is 654 uninterpreted by the generic pNFS client layers, but obviously must 655 be interpreted by the object storage layout driver. This section 656 defines the structure of this opaque value, pnfs_obj_layout4. 658 5.1. pnfs_obj_data_map4 660 /// struct pnfs_obj_data_map4 { 661 /// uint32_t odm_num_comps; 662 /// length4 odm_stripe_unit; 663 /// uint32_t odm_group_width; 664 /// uint32_t odm_group_depth; 665 /// uint32_t odm_mirror_cnt; 666 /// pnfs_obj_raid_algorithm4 odm_raid_algorithm; 667 /// }; 668 /// 670 The pnfs_obj_data_map4 structure parameterizes the algorithm that 671 maps a file's contents over the component objects. Instead of 672 limiting the system to simple striping scheme where loss of a single 673 component object results in data loss, the map parameters support 674 mirroring and more complicated schemes that protect against loss of a 675 component object. 677 "odm_num_comps" is the number of component objects the file is 678 striped over. The server MAY grow the file by adding more components 679 to the stripe while clients hold valid layouts until the file has 680 reached its final stripe width. The file length in this case MUST be 681 limited to the number of bytes in a full stripe. 683 The "odm_stripe_unit" is the number of bytes placed on one component 684 before advancing to the next one in the list of components. The 685 number of bytes in a full stripe is odm_stripe_unit times the number 686 of components. In some RAID schemes, a stripe includes redundant 687 information (i.e., parity) that lets the system recover from loss or 688 damage to a component object. 690 The "odm_group_width" and "odm_group_depth" parameters allow a nested 691 striping pattern (see Section 5.3.2 for details). If there is no 692 nesting, then odm_group_width and odm_group_depth MUST be zero. The 693 size of the components array MUST be a multiple of odm_group_width. 695 The "odm_mirror_cnt" is used to replicate a file by replicating its 696 component objects. If there is no mirroring, then odm_mirror_cnt 697 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 698 the component array MUST be a multiple of (odm_mirror_cnt+1). 700 See Section 5.3 for more details. 702 5.2. pnfs_obj_layout4 704 /// struct pnfs_obj_layout4 { 705 /// pnfs_obj_data_map4 olo_map; 706 /// uint32_t olo_comps_index; 707 /// pnfs_obj_comp4 olo_components<>; 708 /// }; 709 /// 711 The pnfs_obj_layout4 structure specifies a layout over a set of 712 component objects. The "olo_components" field is an array of object 713 identifiers and security credentials that grant access to each 714 object. The organization of the data is defined by the 715 pnfs_obj_data_map4 type that specifies how the file's data is mapped 716 onto the component objects (i.e., the striping pattern). The data 717 placement algorithm that maps file data onto component objects 718 assumes that each component object occurs exactly once in the array 719 of components. Therefore, component objects MUST appear in the 720 olo_components array only once. The components array may represent 721 all objects comprising the file, in which case "olo_comps_index" is 722 set to zero and the number of entries in the olo_components array is 723 equal to olo_map.odm_num_comps. The server MAY return fewer 724 components than odm_num_comps, provided that the returned components 725 are sufficient to access any byte in the layout's data range (e.g., a 726 sub-stripe of "odm_group_width" components). In this case, 727 olo_comps_index represents the position of the returned components 728 array within the full array of components that comprise the file. 730 Note that the layout depends on the file size, which the client 731 learns from the generic return parameters of LAYOUTGET, by doing 732 GETATTR commands to the metadata server. The client uses the file 733 size to decide if it should fill holes with zeros or return a short 734 read. Striping patterns can cause cases where component objects are 735 shorter than other components because a hole happens to correspond to 736 the last part of the component object. 738 5.3. Data Mapping Schemes 740 This section describes the different data mapping schemes in detail. 741 The object layout always uses a "dense" layout as described in 742 NFSv4.1 [2]. This means that the second stripe unit of the file 743 starts at offset 0 of the second component, rather than at offset 744 stripe_unit bytes. After a full stripe has been written, the next 745 stripe unit is appended to the first component object in the list 746 without any holes in the component objects. 748 5.3.1. Simple Striping 750 The mapping from the logical offset within a file (L) to the 751 component object C and object-specific offset O is defined by the 752 following equations: 754 L: logical offset into the file 756 W: stripe width 757 W = size of olo_components array 759 S: number of bytes in a stripe 760 S = W * stripe_unit 762 N: stripe number 763 N = L / S 765 C: component index corresponding to L 766 C = (L % S) / stripe_unit 768 O: The component offset corresponding to L 769 O = (N * stripe_unit) + (L % stripe_unit) 771 Note that this computation does not accommodate the same object 772 appearing in the olo_components array multiple times. Therefore the 773 server may not return layouts with the same object appearing multiple 774 times. If needed the server can return multiple layout segments each 775 covering a single instance of the object. 777 For example, consider an object striped over four devices, . The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 779 4096 = 16384. 781 Offset 0: 782 N = 0 / 16384 = 0 783 C = (0 % 16384) /4096 = 0 (D0) 784 O = 0*4096 + (0%4096) = 0 786 Offset 4096: 787 N = 4096 / 16384 = 0 788 C = (4096 % 16384) / 4096 = 1 (D1) 789 O = (0*4096)+(4096%4096) = 0 791 Offset 9000: 792 N = 9000 / 16384 = 0 793 C = (9000 % 16384) / 4096 = 2 (D2) 794 O = (0*4096)+(9000%4096) = 808 796 Offset 132000: 797 N = 132000 / 16384 = 8 798 C = (132000 % 16384) / 4096 = 0 (D0) 799 O = (8*4096) + (132000%4096) = 33696 801 5.3.2. Nested Striping 803 The odm_group_width and odm_group_depth parameters allow a nested 804 striping pattern. odm_group_width defines the width of a data stripe 805 and odm_group_depth defines how many stripes are written before 806 advancing to the next group of components in the list of component 807 objects for the file. The math used to map from a file offset to a 808 component object and offset within that object is shown below. The 809 computations map from the logical offset L to the component index C 810 and offset relative O within that component object. 812 L: logical offset into the file 814 FW: total number of components 815 FW = size of olo_components array 817 W: stripe width 818 W = group_width, if not zero, else FW 820 group_count: number of groups 821 group_count = FW / group_width, if group_width is not zero, else 1 823 D: number of data devices in a stripe 824 D = W 826 U: number of data bytes in a stripe within a group 827 U = D * stripe_unit 829 T: number of bytes striped within a group of component objects 830 (before advancing to the next group) 831 T = U * group_depth 833 S: number of bytes striped across all component objects 834 (before the pattern repeats) 835 S = T * group_count 837 M: The "major" (i.e., across all components) cycle number 838 M = L / S 840 G: group number from the beginning of the major cycle 841 G = (L % S) / T 843 H: byte offset within the last group 844 H = (L % S) % T 846 N: The "minor" (i.e., across the group) stripe number 847 N = H / U 849 C: component index corresponding to L 850 C = (G * D) + ((H % U) / stripe_unit) 852 O: The component offset corresponding to L 853 O = (M * group_depth * stripe_unit) + (N * stripe_unit) + 854 (L % stripe_unit) 856 For example, consider an object striped over 100 devices with a 857 group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. 858 In this scheme, 500 MB are written to the first 10 components, and 859 5000 MB are written before the pattern wraps back around to the first 860 component in the array. 862 Offset 0: 863 W = 100 864 group_count = 100 / 10 = 10 865 D = 10 866 U = 1 MB * 10 = 10 MB 867 T = 10 MB * 50 = 500 MB 868 S = 500 MB * 10 = 5000 MB 869 M = 0 / 5000 MB = 0 870 G = (0 % 5000 MB) / 500 MB = 0 871 H = (0 % 5000 MB) % 500 MB = 0 872 N = 0 / 10 MB = 0 873 C = (0 * 10) + ((0 % 10 MB) / 1 MB) = 0 874 O = (0 * 50 * 1 MB) + (0 * 1 MB) + (0 % 1 MB) = 0 876 Offset 27 MB: 877 M = 27 MB / 5000 MB = 0 878 G = (27 MB % 5000 MB) / 500 MB = 0 879 H = (27 MB % 5000 MB) % 500 MB = 27 MB 880 N = 27 MB / 10 MB = 2 881 C = (0 * 10) + ((27 MB % 10 MB) / 1 MB) = 7 882 O = (0 * 50 * 1 MB) + (2 * 1 MB) + (27 MB % 1 MB) = 2 MB 884 Offset 7232 MB: 885 M = 7232 MB / 5000 MB = 1 886 G = (7232 MB % 5000 MB) / 500 MB = 4 887 H = (7232 MB % 5000 MB) % 500 MB = 232 MB 888 N = 232 MB / 10 MB = 23 889 C = (4 * 10) + ((232 MB % 10 MB) / 1 MB) = 42 890 O = (1 * 50 * 1 MB) + (23 * 1 MB) + (7232 MB % 1 MB) = 73 MB 892 5.3.3. Mirroring 894 The odm_mirror_cnt is used to replicate a file by replicating its 895 component objects. If there is no mirroring, then odm_mirror_cnt 896 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 897 the olo_components array MUST be a multiple of (odm_mirror_cnt+1). 898 Thus, for a classic mirror on two objects, odm_mirror_cnt is one. 899 Note that mirroring can be defined over any RAID algorithm and 900 striping pattern (either simple or nested). If odm_group_width is 901 also non-zero, then the size of the olo_components array MUST be a 902 multiple of odm_group_width * (odm_mirror_cnt+1). Note that 903 odm_group_width does not account for mirrors. Replicas are adjacent 904 in the olo_components array, and the value C produced by the above 905 equations is not a direct index into the olo_components array. 906 Instead, the following equations determine the replica component 907 index RCi, where i ranges from 0 to odm_mirror_cnt. 909 FW = size of olo_components array / (odm_mirror_cnt+1) 911 C = component index for striping or two-level striping 912 as calculated using above equations 914 i ranges from 0 to odm_mirror_cnt, inclusive 915 RCi = C * (odm_mirror_cnt+1) + i 917 5.4. RAID Algorithms 919 pnfs_obj_raid_algorithm4 determines the algorithm and placement of 920 redundant data. This section defines the different redundancy 921 algorithms. Note: The term "RAID" (Redundant Array of Independent 922 Disks) is used in this document to represent an array of component 923 objects that store data for an individual file. The objects are 924 stored on independent object-based storage devices. File data is 925 encoded and striped across the array of component objects using 926 algorithms developed for block-based RAID systems. 928 5.4.1. PNFS_OBJ_RAID_0 930 PNFS_OBJ_RAID_0 means there is no parity data, so all bytes in the 931 component objects are data bytes located by the above equations for C 932 and O. If a component object is marked as PNFS_OBJ_MISSING, the pNFS 933 client MUST either return an I/O error if this component is attempted 934 to be read or, alternatively, it can retry the READ against the pNFS 935 server. 937 5.4.2. PNFS_OBJ_RAID_4 939 PNFS_OBJ_RAID_4 means that the last component object, or the last in 940 each group (if odm_group_width is greater than zero), contains parity 941 information computed over the rest of the stripe with an XOR 942 operation. If a component object is unavailable, the client can read 943 the rest of the stripe units in the damaged stripe and recompute the 944 missing stripe unit by XORing the other stripe units in the stripe. 945 Or the client can replay the READ against the pNFS server that will 946 presumably perform the reconstructed read on the client's behalf. 948 When parity is present in the file, then the number of parity devices 949 is taken into account in the above equations when calculating (D), 950 the number of data devices in a stripe, as follows: 952 P: number of parity devices in each stripe 953 P = 1 955 D: number of data devices in a stripe 956 D = W - P 958 I: parity device index 959 I = D 961 5.4.3. PNFS_OBJ_RAID_5 963 PNFS_OBJ_RAID_5 means that the position of the parity data is rotated 964 on each stripe or each group (if odm_group_width is greater than 965 zero). In the first stripe, the last component holds the parity. In 966 the second stripe, the next-to-last component holds the parity, and 967 so on. In this scheme, all stripe units are rotated so that I/O is 968 evenly spread across objects as the file is read sequentially. The 969 rotated parity layout is illustrated here, with hexadecimal numbers 970 indicating the stripe unit. 972 0 1 2 P 973 4 5 P 3 974 8 P 6 7 975 P 9 a b 977 Note that the math for RAID_5 is similar to RAID_4 only that the 978 device indices for each stripe are rotated backwards. So start with 979 the equations above for RAID_4, then compute the rotation as 980 described below. Also note that the parity rotation cycle always 981 starts on group boundaries so the first stripe in a group has its 982 parity at device D. 984 P: number of parity devices in each stripe 985 P = 1 987 PC: Parity Cycle 988 PC = W 990 R: The parity rotation index 991 (N is as computed in above equations for RAID-4) 992 R = N % PC 994 I: parity device index 995 I = (W + W - (R + 1) * P) % W 997 Cr: The rotated device index 998 (C is as computed in the above equations for RAID-4) 999 Cr = (W + C - (R * P)) % W 1001 Note: W is added above to avoid negative numbers modulo math. 1003 5.4.4. PNFS_OBJ_RAID_PQ 1005 PNFS_OBJ_RAID_PQ is a double-parity scheme that uses the Reed-Solomon 1006 P+Q encoding scheme [23]. In this layout, the last two component 1007 objects hold the P and Q data, respectively. P is parity computed 1008 with XOR. The Q computation is described in detail by Anvin [24]. 1009 The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8 1010 are used here. Clients may simply choose to read data through the 1011 metadata server if two or more components are missing or damaged. 1013 The equations given above for embedded parity can be used to map a 1014 file offset to the correct component object by setting the number of 1015 parity components (P) to 2 instead of 1 for RAID-5 and computing the 1016 Parity Cycle length as the Lowest Common Multiple [25] of 1017 odm_group_width and P, devided by P, as described below. Note: This 1018 algorithm can be used also for RAID-5 where P=1. 1020 P: number of parity devices 1021 P = 2 1023 PC: Parity cycle: 1024 PC = LCM(W, P) / P 1026 Q: The device index holding the Q component 1027 (I is as computed in the above equations for RAID-5) 1028 Qdev = (I + 1) % W 1030 5.4.5. RAID Usage and Implementation Notes 1032 RAID layouts with redundant data in their stripes require additional 1033 serialization of updates to ensure correct operation. Otherwise, if 1034 two clients simultaneously write to the same logical range of an 1035 object, the result could include different data in the same ranges of 1036 mirrored tuples, or corrupt parity information. It is the 1037 responsibility of the metadata server to enforce serialization 1038 requirements such as this. For example, the metadata server may do 1039 so by not granting overlapping write layouts within mirrored objects. 1041 Many alternative encoding schemes exist for P>=2 [26]. These involve 1042 P or Q equations different than those used in PNFS_OBJ_RAID_PQ. 1043 Thus, if one of these schemes is to be used in the future, a distinct 1044 value must be added to pnfs_obj_raid_algorithm4 for it. While Reed- 1045 Solomon codes are well understood, recently discovered schemes such 1046 as Liberation codes are more computationally efficient for small 1047 group_widths, and Cauchy Reed-Solomon codes are more computationally 1048 efficient for higher values of P. 1050 6. Object-Based Layout Update 1052 layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates 1053 to the layout and additional information to the metadata server. It 1054 is defined in the NFSv4.1 [2] as follows: 1056 struct layoutupdate4 { 1057 layouttype4 lou_type; 1058 opaque lou_body<>; 1059 }; 1061 The layoutupdate4 type is an opaque value at the generic pNFS client 1062 level. If the lou_type layout type is LAYOUT4_OBJECTS_V2, then the 1063 lou_body opaque value is defined by the pnfs_obj_layoutupdate4 type. 1065 Object-Based pNFS clients are not allowed to modify the layout. 1066 Therefore, the information passed in pnfs_obj_layoutupdate4 is used 1067 only to update the file's attributes. In addition to the generic 1068 information the client can pass to the metadata server in 1069 LAYOUTCOMMIT such as the highest offset the client wrote to and the 1070 last time it modified the file, the client MAY use 1071 pnfs_obj_layoutupdate4 to convey the capacity consumed (or released) 1072 by writes using the layout, and to indicate that I/O errors were 1073 encountered by such writes. 1075 6.1. pnfs_obj_deltaspaceused4 1077 /// union pnfs_obj_deltaspaceused4 switch (bool dsu_valid) { 1078 /// case TRUE: 1079 /// int64_t dsu_delta; 1080 /// case FALSE: 1081 /// void; 1082 /// }; 1083 /// 1085 pnfs_obj_deltaspaceused4 is used to convey space utilization 1086 information at the time of LAYOUTCOMMIT. For the file system to 1087 properly maintain capacity-used information, it needs to track how 1088 much capacity was consumed by WRITE operations performed by the 1089 client. In this protocol, the OSD returns the capacity consumed by a 1090 write (*), which can be different than the number of bytes written 1091 because of internal overhead like block-level allocation and indirect 1092 blocks, and the client reflects this back to the pNFS server so it 1093 can accurately track quota. The pNFS server can choose to trust this 1094 information coming from the clients and therefore avoid querying the 1095 OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain 1096 this information from the OSD, it simply returns invalid 1097 olu_delta_space_used. 1099 6.2. pnfs_obj_layoutupdate4 1101 /// struct pnfs_obj_layoutupdate4 { 1102 /// pnfs_obj_deltaspaceused4 olu_delta_space_used; 1103 /// bool olu_ioerr_flag; 1104 /// }; 1105 /// 1107 "olu_delta_space_used" is used to convey capacity usage information 1108 back to the metadata server. 1110 The "olu_ioerr_flag" is used when I/O errors were encountered while 1111 writing the file. The client MUST report the errors using the 1112 pnfs_obj_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time. 1114 If the client updated the file successfully before hitting the I/O 1115 errors, it MAY use LAYOUTCOMMIT to update the metadata server as 1116 described above. Typically, in the error-free case, the server MAY 1117 turn around and update the file's attributes on the storage devices. 1118 However, if I/O errors were encountered, the server better not 1119 attempt to write the new attributes on the storage devices until it 1120 receives the I/O error report; therefore, the client MUST set the 1121 olu_ioerr_flag to true. Note that in this case, the client SHOULD 1122 send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same 1123 COMPOUND RPC. 1125 7. Recovering from Client I/O Errors 1127 The pNFS client may encounter errors when directly accessing the 1128 object storage devices. However, it is the responsibility of the 1129 metadata server to handle the I/O errors. When the 1130 LAYOUT4_OBJECTS_V2 layout type is used, the client MUST report the 1131 I/O errors to the server at LAYOUTRETURN time using the 1132 pnfs_obj_ioerr4 structure (see Section 8.1). 1134 The metadata server analyzes the error and determines the required 1135 recovery operations such as repairing any parity inconsistencies, 1136 recovering media failures, or reconstructing missing objects. 1138 The metadata server SHOULD recall any outstanding layouts to allow it 1139 exclusive write access to the stripes being recovered and to prevent 1140 other clients from hitting the same error condition. In these cases, 1141 the server MUST complete recovery before handing out any new layouts 1142 to the affected byte ranges. 1144 Although it MAY be acceptable for the client to propagate a 1145 corresponding error to the application that initiated the I/O 1146 operation and drop any unwritten data, the client SHOULD attempt to 1147 retry the original I/O operation by requesting a new layout using 1148 LAYOUTGET and retry the I/O operation(s) using the new layout, or the 1149 client MAY just retry the I/O operation(s) using regular NFS READ or 1150 WRITE operations via the metadata server. The client SHOULD attempt 1151 to retrieve a new layout and retry the I/O operation using OSD 1152 commands first and only if the error persists, retry the I/O 1153 operation via the metadata server. 1155 8. Object-Based Layout Return 1157 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1158 layout-type specific information to the server. It is defined in the 1159 NFSv4.1 [2] as follows: 1161 struct layoutreturn_file4 { 1162 offset4 lrf_offset; 1163 length4 lrf_length; 1164 stateid4 lrf_stateid; 1165 /* layouttype4 specific data */ 1166 opaque lrf_body<>; 1167 }; 1169 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1170 case LAYOUTRETURN4_FILE: 1171 layoutreturn_file4 lr_layout; 1172 default: 1173 void; 1174 }; 1176 struct LAYOUTRETURN4args { 1177 /* CURRENT_FH: file */ 1178 bool lora_reclaim; 1179 layoutreturn_stateid lora_recallstateid; 1180 layouttype4 lora_layout_type; 1181 layoutiomode4 lora_iomode; 1182 layoutreturn4 lora_layoutreturn; 1183 }; 1185 If the lora_layout_type layout type is LAYOUT4_OBJECTS_V2, then the 1186 lrf_body opaque value is defined by the pnfs_obj_layoutreturn4 type. 1188 The pnfs_obj_layoutreturn4 type allows the client to report I/O error 1189 information or layout usage statistics back to the metadata server as 1190 defined below. 1192 8.1. pnfs_obj_errno4 1194 /// enum pnfs_obj_errno4 { 1195 /// PNFS_OBJ_ERR_EIO = 1, 1196 /// PNFS_OBJ_ERR_NOT_FOUND = 2, 1197 /// PNFS_OBJ_ERR_NO_SPACE = 3, 1198 /// PNFS_OBJ_ERR_BAD_CRED = 4, 1199 /// PNFS_OBJ_ERR_NO_ACCESS = 5, 1200 /// PNFS_OBJ_ERR_UNREACHABLE = 6, 1201 /// PNFS_OBJ_ERR_RESOURCE = 7 1202 /// }; 1203 /// 1205 pnfs_obj_errno4 is used to represent error types when read/write 1206 errors are reported to the metadata server. The error codes serve as 1207 hints to the metadata server that may help it in diagnosing the exact 1208 reason for the error and in repairing it. 1210 o PNFS_OBJ_ERR_EIO indicates the operation failed because the object 1211 storage device experienced a failure trying to access the object. 1212 The most common source of these errors is media errors, but other 1213 internal errors might cause this as well. In this case, the 1214 metadata server should go examine the broken object more closely; 1215 hence, it should be used as the default error code. 1217 o PNFS_OBJ_ERR_NOT_FOUND indicates the object ID specifies an object 1218 that does not exist on the object storage device. 1220 o PNFS_OBJ_ERR_NO_SPACE indicates the operation failed because the 1221 object storage device ran out of free capacity during the 1222 operation. 1224 o PNFS_OBJ_ERR_BAD_CRED indicates the security parameters are not 1225 valid. The primary cause of this is that the capability has 1226 expired, or the access policy tag (a.k.a., capability version 1227 number) has been changed to revoke capabilities. The client will 1228 need to return the layout and get a new one with fresh 1229 capabilities. 1231 o PNFS_OBJ_ERR_NO_ACCESS indicates the capability does not allow the 1232 requested operation. This should not occur in normal operation 1233 because the metadata server should give out correct capabilities, 1234 or none at all. 1236 o PNFS_OBJ_ERR_UNREACHABLE indicates the client did not complete the 1237 I/O operation at the object storage device due to a communication 1238 failure. Whether or not the I/O operation was executed by the OSD 1239 is undetermined. 1241 o PNFS_OBJ_ERR_RESOURCE indicates the client did not issue the I/O 1242 operation due to a local problem on the initiator (i.e., client) 1243 side, e.g., when running out of memory. The client MUST guarantee 1244 that the OSD command was never dispatched to the OSD. 1246 8.2. pnfs_obj_ioerr4 1248 /// union pnfs_obj_objid4 switch (pnfs_obj_type4 oc_obj_type) { 1249 /// case PNFS_OBJ_OSD_V1: 1250 /// case PNFS_OBJ_OSD_V2: 1251 /// pnfs_obj_osd_objid4 oi_osd_objid; 1252 /// 1253 /// case PNFS_OBJ_NFS: 1254 /// pnfs_obj_nfs_objid4 oi_nfs_objid; 1255 /// }; 1256 /// 1257 /// struct pnfs_obj_ioerr4 { 1258 /// pnfs_obj_objid4 oer_component; 1259 /// offset4 oer_comp_offset; 1260 /// length4 oer_comp_length; 1261 /// bool oer_iswrite; 1262 /// pnfs_obj_errno4 oer_errno; 1263 /// }; 1264 /// 1266 The pnfs_obj_ioerr4 structure is used to return error indications for 1267 objects that generated errors during data transfers. These are hints 1268 to the metadata server that there are problems with that object. For 1269 each error, "oer_component", "oer_comp_offset", and "oer_comp_length" 1270 represent the object and byte range within the component object in 1271 which the error occurred; "oer_iswrite" is set to "true" if the 1272 failed OSD operation was data modifying, and "oer_errno" represents 1273 the type of error. 1275 Component byte ranges in the optional pnfs_obj_ioerr4 structure are 1276 used for recovering the object and MUST be set by the client to cover 1277 all failed I/O operations to the component. 1279 8.3. pnfs_obj_iostats4 1281 /// struct pnfs_obj_iostats4 { 1282 /// offset4 osr_offset; 1283 /// length4 osr_length; 1284 /// uint32_t osr_duration; 1285 /// uint32_t osr_rd_count; 1286 /// uint64_t osr_rd_bytes; 1287 /// uint32_t osr_wr_count; 1288 /// uint64_t osr_wr_bytes; 1289 /// }; 1290 /// 1292 With pNFS, the data transfers are performed directly between the pNFS 1293 client and the data servers. Therefore, the metadata server has no 1294 visibility to the I/O stream and cannot use any statistical 1295 information about client I/O to optimize data storage location. 1296 pnfs_obj_iostats4 MAY be used by the client to report I/O statistics 1297 back to the metadata server upon returning the layout. Since it is 1298 infeasible for the client to report every I/O that used the layout, 1299 the client MAY identify "hot" byte ranges for which to report I/O 1300 statistics. The definition and/or configuration mechanism of what is 1301 considered "hot" and the size of the reported byte range is out of 1302 the scope of this document. It is suggested for client 1303 implementation to provide reasonable default values and an optional 1304 run-time management interface to control these parameters. For 1305 example, a client can define the default byte range resolution to be 1306 1 MB in size and the thresholds for reporting to be 1 MB/second or 10 1307 I/O operations per second. For each byte range, osr_offset and 1308 osr_length represent the starting offset of the range and the range 1309 length in bytes. osr_duration represents the number of seconds the 1310 reported burst of I/O lasted. osr_rd_count, osr_rd_bytes, 1311 osr_wr_count, and osr_wr_bytes represent, respectively, the number of 1312 contiguous read and write I/Os and the respective aggregate number of 1313 bytes transferred within the reported byte range. 1315 8.4. pnfs_obj_layoutreturn4 1317 /// struct pnfs_obj_layoutreturn4 { 1318 /// pnfs_obj_ioerr4 olr_ioerr_report<>; 1319 /// pnfs_obj_iostats4 olr_iostats_report<>; 1320 /// }; 1321 /// 1323 When object I/O operations failed, "olr_ioerr_report<>" is used to 1324 report these errors to the metadata server as an array of elements of 1325 type pnfs_obj_ioerr4. Each element in the array represents an error 1326 that occurred on the object specified by oer_component. If no errors 1327 are to be reported, the size of the olr_ioerr_report<> array is set 1328 to zero. The client MAY also use "olr_iostats_report<>" to report a 1329 list of I/O statistics as an array of elements of type 1330 pnfs_obj_iostats4. Each element in the array represents statistics 1331 for a particular byte range. Byte ranges are not guaranteed to be 1332 disjoint and MAY repeat or intersect. 1334 9. Object-Based Creation Layout Hint 1336 The layouthint4 type is defined in the NFSv4.1 [2] as follows: 1338 struct layouthint4 { 1339 layouttype4 loh_type; 1340 opaque loh_body<>; 1342 }; 1344 The layouthint4 structure is used by the client to pass a hint about 1345 the type of layout it would like created for a particular file. If 1346 the loh_type layout type is LAYOUT4_OBJECTS_V2, then the loh_body 1347 opaque value is defined by the pnfs_obj_layouthint4 type. 1349 9.1. pnfs_obj_layouthint4 1351 /// union pnfs_obj_max_comps_hint4 switch (bool omx_valid) { 1352 /// case TRUE: 1353 /// uint32_t omx_max_comps; 1354 /// case FALSE: 1355 /// void; 1356 /// }; 1357 /// 1358 /// union pnfs_obj_stripe_unit_hint4 switch (bool osu_valid) { 1359 /// case TRUE: 1360 /// length4 osu_stripe_unit; 1361 /// case FALSE: 1362 /// void; 1363 /// }; 1364 /// 1365 /// union pnfs_obj_group_width_hint4 switch (bool ogw_valid) { 1366 /// case TRUE: 1367 /// uint32_t ogw_group_width; 1368 /// case FALSE: 1369 /// void; 1370 /// }; 1371 /// 1372 /// union pnfs_obj_group_depth_hint4 switch (bool ogd_valid) { 1373 /// case TRUE: 1374 /// uint32_t ogd_group_depth; 1375 /// case FALSE: 1376 /// void; 1377 /// }; 1378 /// 1379 /// union pnfs_obj_mirror_cnt_hint4 switch (bool omc_valid) { 1380 /// case TRUE: 1381 /// uint32_t omc_mirror_cnt; 1382 /// case FALSE: 1383 /// void; 1384 /// }; 1385 /// 1386 /// union pnfs_obj_raid_algorithm_hint4 switch (bool ora_valid) { 1387 /// case TRUE: 1388 /// pnfs_obj_raid_algorithm4 ora_raid_algorithm; 1389 /// case FALSE: 1391 /// void; 1392 /// }; 1393 /// 1394 /// struct pnfs_obj_layouthint4 { 1395 /// pnfs_obj_max_comps_hint4 olh_max_comps_hint; 1396 /// pnfs_obj_stripe_unit_hint4 olh_stripe_unit_hint; 1397 /// pnfs_obj_group_width_hint4 olh_group_width_hint; 1398 /// pnfs_obj_group_depth_hint4 olh_group_depth_hint; 1399 /// pnfs_obj_mirror_cnt_hint4 olh_mirror_cnt_hint; 1400 /// pnfs_obj_raid_algorithm_hint4 olh_raid_algorithm_hint; 1401 /// }; 1402 /// 1404 This type conveys hints for the desired data map. All parameters are 1405 optional so the client can give values for only the parameters it 1406 cares about, e.g. it can provide a hint for the desired number of 1407 mirrored components, regardless of the RAID algorithm selected for 1408 the file. The server should make an attempt to honor the hints, but 1409 it can ignore any or all of them at its own discretion and without 1410 failing the respective CREATE operation. 1412 The "olh_max_comps_hint" can be used to limit the total number of 1413 component objects comprising the file. All other hints correspond 1414 directly to the different fields of pnfs_obj_data_map4. 1416 10. Layout Segments 1418 The pnfs layout operations operate on logical byte ranges. There is 1419 no requirement in the protocol for any relationship between byte 1420 ranges used in LAYOUTGET to acquire layouts and byte ranges used in 1421 CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD 1422 byte-range capabilities poses limitations on these operations since 1423 the capabilities associated with layout segments cannot be merged or 1424 split. The following guidelines should be followed for proper 1425 operation of object-based layouts. 1427 10.1. CB_LAYOUTRECALL and LAYOUTRETURN 1429 In general, the object-based layout driver should keep track of each 1430 layout segment it got, keeping record of the segment's iomode, 1431 offset, and length. The server should allow the client to get 1432 multiple overlapping layout segments but is free to recall the layout 1433 to prevent overlap. 1435 In response to CB_LAYOUTRECALL, the client should return all layout 1436 segments matching the given iomode and overlapping with the recalled 1437 range. When returning the layouts for this byte range with 1438 LAYOUTRETURN, the client MUST NOT return a sub-range of a layout 1439 segment it has; each LAYOUTRETURN sent MUST completely cover at least 1440 one outstanding layout segment. 1442 The server, in turn, should release any segment that exactly matches 1443 the clientid, iomode, and byte range given in LAYOUTRETURN. If no 1444 exact match is found, then the server should release all layout 1445 segments matching the clientid and iomode and that are fully 1446 contained in the returned byte range. If none are found and the byte 1447 range is a subset of an outstanding layout segment with for the same 1448 clientid and iomode, then the client can be considered malfunctioning 1449 and the server SHOULD recall all layouts from this client to reset 1450 its state. If this behavior repeats, the server SHOULD deny all 1451 LAYOUTGETs from this client. 1453 10.2. LAYOUTCOMMIT 1455 LAYOUTCOMMIT is only used by object-based pNFS to convey modified 1456 attributes hints and/or to report the presence of I/O errors to the 1457 metadata server (MDS). Therefore, the offset and length in 1458 LAYOUTCOMMIT4args are reserved for future use and should be set to 0. 1460 11. Recalling Layouts 1462 The object-based metadata server should recall outstanding layouts in 1463 the following cases: 1465 o When the file's security policy changes, i.e., Access Control 1466 Lists (ACLs) or permission mode bits are set. 1468 o When the file's aggregation map changes, rendering outstanding 1469 layouts invalid. 1471 o When there are sharing conflicts. For example, the server will 1472 issue stripe-aligned layout segments for RAID-5 objects. To 1473 prevent corruption of the file's parity, multiple clients must not 1474 hold valid write layouts for the same stripes. An outstanding 1475 READ/WRITE (RW) layout should be recalled when a conflicting 1476 LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW 1477 and for a byte range overlapping with the outstanding layout 1478 segment. 1480 11.1. CB_RECALL_ANY 1482 The metadata server can use the CB_RECALL_ANY callback operation to 1483 notify the client to return some or all of its layouts. The NFSv4.1 1484 [2] defines the following types: 1486 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 1487 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 1489 struct CB_RECALL_ANY4args { 1490 uint32_t craa_objects_to_keep; 1491 bitmap4 craa_type_mask; 1492 }; 1494 Typically, CB_RECALL_ANY will be used to recall client state when the 1495 server needs to reclaim resources. The craa_type_mask bitmap 1496 specifies the type of resources that are recalled and the 1497 craa_objects_to_keep value specifies how many of the recalled objects 1498 the client is allowed to keep. The object-based layout type mask 1499 flags are defined as follows. They represent the iomode of the 1500 recalled layouts. In response, the client SHOULD return layouts of 1501 the recalled iomode that it needs the least, keeping at most 1502 craa_objects_to_keep object-based layouts. 1504 /// enum pnfs_obj_cb_recall_any_mask { 1505 /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, 1506 /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 1507 /// }; 1508 /// 1510 The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return 1511 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1512 PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1513 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1514 is notified to return layouts of either iomode. 1516 12. Client Fencing 1518 In cases where clients are uncommunicative and their lease has 1519 expired or when clients fail to return recalled layouts within a 1520 lease period at the least (see "Recalling a Layout"[2]), the server 1521 MAY revoke client layouts and/or device address mappings and reassign 1522 these resources to other clients. To avoid data corruption, the 1523 metadata server MUST fence off the revoked clients from the 1524 respective objects as described in Section 13.4. 1526 13. Security Considerations 1528 The pNFS extension partitions the NFSv4 file system protocol into two 1529 parts, the control path and the data path (storage protocol). The 1530 control path contains all the new operations described by this 1531 extension; all existing NFSv4 security mechanisms and features apply 1532 to the control path. The combination of components in a pNFS system 1533 is required to preserve the security properties of NFSv4 with respect 1534 to an entity accessing data via a client, including security 1535 countermeasures to defend against threats that NFSv4 provides 1536 defenses for in environments where these threats are considered 1537 significant. 1539 The metadata server enforces the file access-control policy at 1540 LAYOUTGET time. The client should use suitable authorization 1541 credentials for getting the layout for the requested iomode (READ or 1542 RW) and the server verifies the permissions and ACL for these 1543 credentials, possibly returning NFS4ERR_ACCESS if the client is not 1544 allowed the requested iomode. If the LAYOUTGET operation succeeds 1545 the client receives, as part of the layout, a set of object 1546 capabilities allowing it I/O access to the specified objects 1547 corresponding to the requested iomode. When the client acts on I/O 1548 operations on behalf of its local users, it MUST authenticate and 1549 authorize the user by issuing respective OPEN and ACCESS calls to the 1550 metadata server, similar to having NFSv4 data delegations. If access 1551 is allowed, the client uses the corresponding (READ or RW) 1552 capabilities to perform the I/O operations at the object storage 1553 devices. When the metadata server receives a request to change a 1554 file's permissions or ACL, it SHOULD recall all layouts for that file 1555 and it MUST change the capability version attribute on all objects 1556 comprising the file to implicitly invalidate any outstanding 1557 capabilities before committing to the new permissions and ACL. Doing 1558 this will ensure that clients re-authorize their layouts according to 1559 the modified permissions and ACL by requesting new layouts. 1560 Recalling the layouts in this case is courtesy of the server intended 1561 to prevent clients from getting an error on I/Os done after the 1562 capability version changed. 1564 The object storage protocol MUST implement the security aspects 1565 described in version 1 of the T10 OSD protocol definition [1]. The 1566 standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and 1567 ALLDATA. To provide minimum level of security allowing verification 1568 and enforcement of the server access control policy using the layout 1569 security credentials, the NOSEC security method MUST NOT be used for 1570 any I/O operation. The remainder of this section gives an overview 1571 of the security mechanism described in that standard. The goal is to 1572 give the reader a basic understanding of the object security model. 1573 Any discrepancies between this text and the actual standard are 1574 obviously to be resolved in favor of the OSD standard. 1576 13.1. OSD Security Data Types 1578 There are three main data types associated with object security: a 1579 capability, a credential, and security parameters. The capability is 1580 a set of fields that specifies an object and what operations can be 1581 performed on it. A credential is a signed capability. Only a 1582 security manager that knows the secret device keys can correctly sign 1583 a capability to form a valid credential. In pNFS, the file server 1584 acts as the security manager and returns signed capabilities (i.e., 1585 credentials) to the pNFS client. The security parameters are values 1586 computed by the issuer of OSD commands (i.e., the client) that prove 1587 they hold valid credentials. The client uses the credential as a 1588 signing key to sign the requests it makes to OSD, and puts the 1589 resulting signatures into the security_parameters field of the OSD 1590 command. The object storage device uses the secret keys it shares 1591 with the security manager to validate the signature values in the 1592 security parameters. 1594 The security types are opaque to the generic layers of the pNFS 1595 client. The credential contents are defined as opaque within the 1596 pnfs_obj_cred4 type. Instead of repeating the definitions here, the 1597 reader is referred to Section 4.9.2.2 of the OSD standard. 1599 13.2. The OSD Security Protocol 1601 The object storage protocol relies on a cryptographically secure 1602 capability to control accesses at the object storage devices. 1603 Capabilities are generated by the metadata server, returned to the 1604 client, and used by the client as described below to authenticate 1605 their requests to the object-based storage device. Capabilities 1606 therefore achieve the required access and open mode checking. They 1607 allow the file server to define and check a policy (e.g., open mode) 1608 and the OSD to enforce that policy without knowing the details (e.g., 1609 user IDs and ACLs). 1611 Since capabilities are tied to layouts, and since they are used to 1612 enforce access control, when the file ACL or mode changes the 1613 outstanding capabilities MUST be revoked to enforce the new access 1614 permissions. The server SHOULD recall layouts to allow clients to 1615 gracefully return their capabilities before the access permissions 1616 change. 1618 Each capability is specific to a particular object, an operation on 1619 that object, a byte range within the object (in OSDv2), and has an 1620 explicit expiration time. The capabilities are signed with a secret 1621 key that is shared by the object storage devices and the metadata 1622 managers. Clients do not have device keys so they are unable to 1623 forge the signatures in the security parameters. The combination of 1624 a capability, the OSD System ID, and a signature is called a 1625 "credential" in the OSD specification. 1627 The details of the security and privacy model for object storage are 1628 defined in the T10 OSD standard. The following sketch of the 1629 algorithm should help the reader understand the basic model. 1631 LAYOUTGET returns a CapKey and a Cap, which, together with the OSD 1632 SystemID, are also called a credential. It is a capability and a 1633 signature over that capability and the SystemID. The OSD Standard 1634 refers to the CapKey as the "Credential integrity check value" and to 1635 the ReqMAC as the "Request integrity check value". 1637 CapKey = MAC(Cap, SystemID) 1638 Credential = {Cap, SystemID, CapKey} 1640 The client uses CapKey to sign all the requests it issues for that 1641 object using the respective Cap. In other words, the Cap appears in 1642 the request to the storage device, and that request is signed with 1643 the CapKey as follows: 1645 ReqMAC = MAC(Req, ReqNonce) 1646 Request = {Cap, Req, ReqNonce, ReqMAC} 1648 The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The 1649 OSD uses the SecretKey it shares with the metadata server to compare 1650 the ReqMAC the client sent with a locally computed value: 1652 LocalCapKey = MAC(Cap, SystemID) 1653 LocalReqMAC = MAC(Req, ReqNonce) 1655 and if they match the OSD assumes that the capabilities came from an 1656 authentic metadata server and allows access to the object, as allowed 1657 by the Cap. 1659 13.3. Protocol Privacy Requirements 1661 Note that if the server LAYOUTGET reply, holding CapKey and Cap, is 1662 snooped by another client, it can be used to generate valid OSD 1663 requests (within the Cap access restrictions). 1665 To provide the required privacy requirements for the capability key 1666 returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g., 1667 by using the RPCSEC_GSS privacy method to send the LAYOUTGET 1668 operation or by using the SSV key to encrypt the oc_capability_key 1669 using the GSS_Wrap() function. Two general ways to provide privacy 1670 in the absence of GSS-API that are independent of NFSv4 are either an 1671 isolated network such as a VLAN or a secure channel provided by IPsec 1672 [17]. 1674 13.4. Revoking Capabilities 1676 At any time, the metadata server may invalidate all outstanding 1677 capabilities on an object by changing its POLICY ACCESS TAG 1678 attribute. The value of the POLICY ACCESS TAG is part of a 1679 capability, and it must match the state of the object attribute. If 1680 they do not match, the OSD rejects accesses to the object with the 1681 sense key set to ILLEGAL REQUEST and an additional sense code set to 1682 INVALID FIELD IN CDB. When a client attempts to use a capability and 1683 is rejected this way, it should issue a LAYOUTCOMMIT for the object 1684 and specify PNFS_OBJ_BAD_CRED in the olr_ioerr_report parameter. The 1685 client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or 1686 LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed 1687 set of capabilities. 1689 The metadata server may elect to change the access policy tag on an 1690 object at any time, for any reason (with the understanding that there 1691 is likely an associated performance penalty, especially if there are 1692 outstanding layouts for this object). The metadata server MUST 1693 revoke outstanding capabilities when any one of the following occurs: 1695 o the permissions on the object change, 1697 o a conflicting mandatory byte-range lock is granted, or 1699 o a layout is revoked and reassigned to another client. 1701 A pNFS client will typically hold one layout for each byte range for 1702 either READ or READ/WRITE. The client's credentials are checked by 1703 the metadata server at LAYOUTGET time and it is the client's 1704 responsibility to enforce access control among multiple users 1705 accessing the same file. It is neither required nor expected that 1706 the pNFS client will obtain a separate layout for each user accessing 1707 a shared object. The client SHOULD use OPEN and ACCESS calls to 1708 check user permissions when performing I/O so that the server's 1709 access control policies are correctly enforced. The result of the 1710 ACCESS operation may be cached while the client holds a valid layout 1711 as the server is expected to recall layouts when the file's access 1712 permissions or ACL change. 1714 13.5. Security Considerations over NFS 1716 When NFS is used for storage protocol, obviously the T10 security 1717 mechanism cannot be implemented. Instead, the server uses the NFS 1718 owner and group identifiers, in combination with the RPC credentials 1719 provided with the layout (as described in Paragraph 4), to simulate 1720 the OSD CAPKEY security model. The file's mode or ACL are set to 1721 provide the file's owner READ and WRITE permissions, to the file's 1722 group READ only permissions, and no permissions to others. 1723 Respecively, the client is provided with the respective credentials 1724 to provided READ/WRITE or READ only access for the respective layout 1725 lo_iomode. 1727 Fencing off a client over NFS is achieved by modifying the respective 1728 files' ownership attributes. This will implictly revoke the 1729 outstanding credentials and will require the client to ask the server 1730 for new layouts. 1732 14. IANA Considerations 1734 As described in NFSv4.1 [2], new layout type numbers have been 1735 assigned by IANA. This document defines the protocol associated with 1736 the existing layout type number, LAYOUT4_OBJECTS_V2, and it requires 1737 no further actions for IANA. 1739 15. References 1741 15.1. Normative References 1743 [1] Weber, R., "Information Technology - SCSI Object-Based Storage 1744 Device Commands (OSD)", ANSI INCITS 400-2004, December 2004. 1746 [2] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network 1747 File System (NFS) Version 4 Minor Version 1 Protocol", 1748 RFC 5661, January 2010. 1750 [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1751 Levels", BCP 14, RFC 2119, March 1997. 1753 [4] Eisler, M., "XDR: External Data Representation Standard", 1754 STD 67, RFC 4506, May 2006. 1756 [5] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., "Network 1757 File System (NFS) Version 4 Minor Version 1 External Data 1758 Representation Standard (XDR) Description", RFC 5662, 1759 January 2010. 1761 [6] IETF Trust, "Legal Provisions Relating to IETF Documents", 1762 November 2008, 1763 . 1765 [7] Linn, J., "Generic Security Service Application Program 1766 Interface Version 2, Update 1", RFC 2743, January 2000. 1768 [8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 1769 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 1770 RFC 3720, April 2004. 1772 [9] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI 1773 INCITS 408-2005, October 2005. 1775 [10] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network 1776 Address Authority (NAA) Naming Format for iSCSI Node Names", 1777 RFC 3980, February 2005. 1779 [11] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) 1780 Registration Authority", 1781 . 1783 [12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. 1784 Souza, "Internet Storage Name Service (iSNS)", RFC 4171, 1785 September 2005. 1787 [13] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI 1788 INCITS 402-2005, February 2005. 1790 15.2. Informative References 1792 [14] IETF, "NFS Version 3 Protocol Specification", RFC 1813, 1793 June 1995. 1795 [15] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 1796 C., Eisler, M., and D. Noveck, "Network File System (NFS) 1797 version 4 Protocol", RFC 3530, April 2003. 1799 [16] Weber, R., "SCSI Object-Based Storage Device Commands -2 1800 (OSD-2)", January 2009, 1801 . 1803 [17] Kent, S. and K. Seo, "Security Architecture for the Internet 1804 Protocol", RFC 4301, December 2005. 1806 [18] IETF, "RPC: Remote Procedure Call Protocol Specification 1807 Version 2", RFC 5531, May 2009. 1809 [19] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002, 1810 December 2002. 1812 [20] T11 1619-D, "Fibre Channel Framing and Signaling - 2 1813 (FC-FS-2)", ANSI INCITS 424-2007, February 2007. 1815 [21] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI 1816 INCITS 417-2006, June 2006. 1818 [22] IETF, "DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION", 1819 RFC 1035, November 1987. 1821 [23] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting 1822 Codes, Part I", 1977. 1824 [24] Anvin, H., "The Mathematics of RAID-6", May 2009, 1825 . 1827 [25] The free encyclopedia, Wikipedia., "Least common multiple", 1828 April 2011, 1829 . 1831 [26] Plank, James S., and Luo, Jianqiang and Schuman, Catherine D. 1832 and Xu, Lihao and Wilcox-O'Hearn, Zooko, "A Performance 1833 Evaluation and Examination of Open-source Erasure Coding 1834 Libraries for Storage", 2007. 1836 Appendix A. Acknowledgments 1838 Todd Pisek was a co-editor of the initial versions of this document. 1839 Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian 1840 E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and 1841 commented on this document. 1843 Author's Address 1845 Benny Halevy 1846 Tonian, Inc. 1848 Email: bhalevy@tonian.com 1849 URI: http://www.tonian.com/