idnits 2.17.1 draft-ietf-nfsv4-rfc5664bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1109 has weird spacing: '...stateid lor...' -- The document date (May 01, 2014) is 3645 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '3' ** Obsolete normative reference: RFC 5661 (ref. '4') (Obsoleted by RFC 8881) ** Obsolete normative reference: RFC 3720 (ref. '8') (Obsoleted by RFC 7143) ** Obsolete normative reference: RFC 3980 (ref. '9') (Obsoleted by RFC 7143) -- Possible downref: Non-RFC (?) normative reference: ref. '10' -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft PrimaryData 4 Intended status: Standards Track B. Harrosh 5 Expires: November 2, 2014 B. Welch 6 B. Mueller 7 Panasas 8 May 01, 2014 10 Object-Based Parallel NFS (pNFS) Operations 11 draft-ietf-nfsv4-rfc5664bis-03 13 Abstract 15 Parallel NFS (pNFS) extends Network File System version 4 (NFSv4) to 16 allow clients to directly access file data on the storage used by the 17 NFSv4 server. This ability to bypass the server for data access can 18 increase both performance and parallelism, but requires additional 19 client functionality for data access, some of which is dependent on 20 the class of storage used, a.k.a. the Layout Type. The main pNFS 21 operations and data types in NFSv4 Minor version 1 specify a layout- 22 type-independent layer; layout-type-specific information is conveyed 23 using opaque data structures whose internal structure is further 24 defined by the particular layout type specification. This document 25 specifies the NFSv4.1 Object-Based pNFS Layout Type as a companion to 26 the main NFSv4 Minor version 1 specification. This document has been 27 updated since the initial version to clarify and fix some of the 28 RAID-related computations so they match current implementations. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at http://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on November 2, 2014. 47 Copyright Notice 49 Copyright (c) 2014 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 66 1.2. Overview of Changes . . . . . . . . . . . . . . . . . . . 4 67 2. XDR Description of the Objects-Based Layout Protocol . . . . 4 68 2.1. Code Components Licensing Notice . . . . . . . . . . . . 5 69 3. Basic Data Type Definitions . . . . . . . . . . . . . . . . . 6 70 3.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 6 71 3.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . . . 6 72 3.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . . . 7 73 3.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . . 8 74 4. Object Storage Device Addressing and Discovery . . . . . . . 9 75 4.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 10 76 4.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . 10 77 4.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . 11 78 4.2.2. Device Network Address . . . . . . . . . . . . . . . 12 79 5. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 12 80 5.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . 13 81 5.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . 14 82 5.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . 15 83 5.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 15 84 5.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 16 85 5.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 18 86 5.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 19 87 5.4.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 19 88 5.4.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 20 89 5.4.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 20 90 5.4.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . 21 91 5.4.5. RAID Usage and Implementation Notes . . . . . . . . . 22 92 6. Object-Based Layout Update . . . . . . . . . . . . . . . . . 22 93 6.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . 23 94 6.2. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . 23 96 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 24 97 8. Object-Based Layout Return . . . . . . . . . . . . . . . . . 24 98 8.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . . 25 99 8.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . . 26 100 8.3. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . 27 101 9. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 27 102 9.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . 27 103 10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 29 104 10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . 29 105 10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . 30 106 11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 30 107 11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . 30 108 12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . 31 109 13. Security Considerations . . . . . . . . . . . . . . . . . . . 31 110 13.1. OSD Security Data Types . . . . . . . . . . . . . . . . 32 111 13.2. The OSD Security Protocol . . . . . . . . . . . . . . . 33 112 13.3. Protocol Privacy Requirements . . . . . . . . . . . . . 34 113 13.4. Revoking Capabilities . . . . . . . . . . . . . . . . . 34 114 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35 115 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 116 15.1. Normative References . . . . . . . . . . . . . . . . . . 35 117 15.2. Informative References . . . . . . . . . . . . . . . . . 36 118 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 37 119 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 37 121 1. Introduction 123 In pNFS, the file server returns typed layout structures that 124 describe where file data is located. There are different layouts for 125 different storage systems and methods of arranging data on storage 126 devices. This document describes the layouts used with object-based 127 storage devices (OSDs) that are accessed according to the OSD storage 128 protocol standard (ANSI INCITS 400-2004 [1]). 130 An "object" is a container for data and attributes, and files are 131 stored in one or more objects. The OSD protocol specifies several 132 operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES, 133 SET ATTRIBUTES, CREATE, and DELETE. However, using the object-based 134 layout the client only uses the READ, WRITE, GET ATTRIBUTES, and 135 FLUSH commands. The other commands are only used by the pNFS server. 137 An object-based layout for pNFS includes object identifiers, 138 capabilities that allow clients to READ or WRITE those objects, and 139 various parameters that control how file data is striped across their 140 component objects. The OSD protocol has a capability-based security 141 scheme that allows the pNFS server to control what operations and 142 what objects can be used by clients. This scheme is described in 143 more detail in the "Security Considerations" section (Section 13). 145 1.1. Requirements Language 147 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 148 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 149 document are to be interpreted as described in RFC 2119 [2]. 151 1.2. Overview of Changes 153 This document is an update to the initial RFC. The primary area for 154 changes are the clarification and correction of the RAID-related 155 equations and algorithms in Section 5.3. The equations were restated 156 for clarity, and in a few places minor corrections were made to 157 ensure that this spec accurately matches current implementations. In 158 addition, minor corrections have been made to other sections. 160 2. XDR Description of the Objects-Based Layout Protocol 162 This document contains the external data representation (XDR [6]) 163 description of the NFSv4.1 objects layout protocol. The XDR 164 description is embedded in this document in a way that makes it 165 simple for the reader to extract into a ready-to-compile form. The 166 reader can feed this document into the following shell script to 167 produce the machine readable XDR description of the NFSv4.1 objects 168 layout protocol: 170 #!/bin/sh 171 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 173 That is, if the above script is stored in a file called "extract.sh", 174 and this document is in a file called "spec.txt", then the reader can 175 do: 177 sh extract.sh < spec.txt > pnfs_osd_prot.x 179 The effect of the script is to remove leading white space from each 180 line, plus a sentinel sequence of "///". 182 The embedded XDR file header follows. Subsequent XDR descriptions, 183 with the sentinel sequence are embedded throughout the document. 185 Note that the XDR code contained in this document depends on types 186 from the NFSv4.1 nfs4_prot.x file ([5]). This includes both nfs 187 types that end with a 4, such as offset4, length4, etc., as well as 188 more generic types such as uint32_t and uint64_t. 190 2.1. Code Components Licensing Notice 192 The XDR description, marked with lines beginning with the sequence "/ 193 //", as well as scripts for extracting the XDR description are Code 194 Components as described in Section 4 of "Legal Provisions Relating to 195 IETF Documents" [3]. These Code Components are licensed according to 196 the terms of Section 4 of "Legal Provisions Relating to IETF 197 Documents". 199 /// /* 200 /// * Copyright (c) 2010 IETF Trust and the persons identified 201 /// * as authors of the code. All rights reserved. 202 /// * 203 /// * Redistribution and use in source and binary forms, with 204 /// * or without modification, are permitted provided that the 205 /// * following conditions are met: 206 /// * 207 /// * o Redistributions of source code must retain the above 208 /// * copyright notice, this list of conditions and the 209 /// * following disclaimer. 210 /// * 211 /// * o Redistributions in binary form must reproduce the above 212 /// * copyright notice, this list of conditions and the 213 /// * following disclaimer in the documentation and/or other 214 /// * materials provided with the distribution. 215 /// * 216 /// * o Neither the name of Internet Society, IETF or IETF 217 /// * Trust, nor the names of specific contributors, may be 218 /// * used to endorse or promote products derived from this 219 /// * software without specific prior written permission. 220 /// * 221 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 222 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 223 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 224 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 225 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 226 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 227 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 228 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 229 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 230 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 231 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 232 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 233 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 234 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 235 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 236 /// * 237 /// * This code was derived from draft-ietf-nfsv4-rfc5664bis-03. 239 [[RFC Editor: please insert RFC number if needed]] 240 /// * Please reproduce this note if possible. 241 /// */ 242 /// 243 /// /* 244 /// * pnfs_osd_prot.x 245 /// */ 246 /// 247 /// %#include 248 /// 250 3. Basic Data Type Definitions 252 The following sections define basic data types and constants used by 253 the Object-Based Layout protocol. 255 3.1. pnfs_osd_objid4 257 An object is identified by a number, somewhat like an inode number. 258 The object storage model has a two-level scheme, where the objects 259 within an object storage device are grouped into partitions. 261 /// struct pnfs_osd_objid4 { 262 /// deviceid4 oid_device_id; 263 /// uint64_t oid_partition_id; 264 /// uint64_t oid_object_id; 265 /// }; 266 /// 268 The pnfs_osd_objid4 type is used to identify an object within a 269 partition on a specified object storage device. "oid_device_id" 270 selects the object storage device from the set of available storage 271 devices. The device is identified with the deviceid4 type, which is 272 an index into addressing information about that device returned by 273 the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data 274 type is defined in NFSv4.1 [4]. Within an OSD, a partition is 275 identified with a 64-bit number, "oid_partition_id". Within a 276 partition, an object is identified with a 64-bit number, 277 "oid_object_id". Creation and management of partitions is outside 278 the scope of this document, and is a facility provided by the object- 279 based storage file system. 281 3.2. pnfs_osd_version4 282 /// enum pnfs_osd_version4 { 283 /// PNFS_OSD_MISSING = 0, 284 /// PNFS_OSD_VERSION_1 = 1, 285 /// PNFS_OSD_VERSION_2 = 2 286 /// }; 287 /// 289 pnfs_osd_version4 is used to indicate the OSD protocol version used 290 to access an object, or whether an object is missing (i.e., 291 unavailable). Some of the RAID algorithms supported by object-based 292 layouts encode redundant information and can compensate for missing 293 components, but the data placement algorithms need to be aware of the 294 logical positions of the missing components. 296 The 1.0 version of the OSD standard has been ratified. The 2.0 297 version of the OSD standard has reached final draft status, but has 298 not been fully ratified. However, current object-based pNFS 299 implementations adhere to the OSD 2.0 protocol (SNIA T10/1729-D 300 [14]). The second generation OSD protocol has additional features to 301 support more robust error recovery, snapshots, and byte-range 302 capabilities. For completeness, and to allow for future revisions in 303 the OSD protocol, the OSD version is explicitly called out in the 304 information returned in the layout. (This information can also be 305 deduced by looking inside the capability type at the format field, 306 which is the first byte. The format value is 0x1 for an OSD v1 307 capability.) 309 3.3. pnfs_osd_object_cred4 311 /// enum pnfs_osd_cap_key_sec4 { 312 /// PNFS_OSD_CAP_KEY_SEC_NONE = 0, 313 /// PNFS_OSD_CAP_KEY_SEC_SSV = 1 314 /// }; 315 /// 316 /// struct pnfs_osd_object_cred4 { 317 /// pnfs_osd_objid4 oc_object_id; 318 /// pnfs_osd_version4 oc_osd_version; 319 /// pnfs_osd_cap_key_sec4 oc_cap_key_sec; 320 /// opaque oc_capability_key<>; 321 /// opaque oc_capability<>; 322 /// }; 323 /// 325 The pnfs_osd_object_cred4 structure is used to identify each 326 component comprising the file. The "oc_object_id" identifies the 327 component object, the "oc_osd_version" represents the osd protocol 328 version, or whether that component is unavailable, and the 329 "oc_capability" and "oc_capability_key", along with the 330 "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD 331 security credentials needed to access that object. The 332 "oc_cap_key_sec" value denotes the method used to secure the 333 oc_capability_key (see Section 13.1 for more details). 335 To comply with the OSD security requirements, the capability key 336 SHOULD be transferred securely to prevent eavesdropping (see 337 Section 13). Therefore, a client SHOULD either issue the LAYOUTGET 338 or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service 339 or previously establish a secret state verifier (SSV) for the 340 sessions via the NFSv4.1 SET_SSV operation. The 341 pnfs_osd_cap_key_sec4 type is used to identify the method used by the 342 server to secure the capability key. 344 o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is 345 not encrypted, in which case the client SHOULD issue the LAYOUTGET 346 or GETDEVICEINFO operations with RPCSEC_GSS with the privacy 347 service or the NFSv4.1 transport should be secured by using 348 methods that are external to NFSv4.1 like the use of IPsec [15] 349 for transporting the NFSV4.1 protocol. 351 o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key 352 contents are encrypted using the SSV GSS context and the 353 capability key as inputs to the GSS_Wrap() function (see GSS-API 354 [7]) with the conf_req_flag set to TRUE. The client MUST use the 355 secret SSV key as part of the client's GSS context to decrypt the 356 capability key using the value of the oc_capability_key field as 357 the input_message to the GSS_unwrap() function. Note that to 358 prevent eavesdropping of the SSV key, the client SHOULD issue 359 SET_SSV via RPCSEC_GSS with the privacy service. 361 The actual method chosen depends on whether the client established a 362 SSV key with the server and whether it issued the operation with the 363 RPCSEC_GSS privacy method. Naturally, if the client did not 364 establish an SSV key via SET_SSV, the server MUST use the 365 PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the operation was 366 not issued with the RPCSEC_GSS privacy method, the server SHOULD 367 secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV 368 method. The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also 369 when the operation was issued with the RPCSEC_GSS privacy method. 371 3.4. pnfs_osd_raid_algorithm4 372 /// enum pnfs_osd_raid_algorithm4 { 373 /// PNFS_OSD_RAID_0 = 1, 374 /// PNFS_OSD_RAID_4 = 2, 375 /// PNFS_OSD_RAID_5 = 3, 376 /// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ 377 /// }; 378 /// 380 pnfs_osd_raid_algorithm4 represents the data redundancy algorithm 381 used to protect the file's contents. See Section 5.4 for more 382 details. 384 4. Object Storage Device Addressing and Discovery 386 Data operations to an OSD require the client to know the "address" of 387 each OSD's root object. The root object is synonymous with the Small 388 Computer System Interface (SCSI) logical unit. The client specifies 389 SCSI logical units to its SCSI protocol stack using a representation 390 local to the client. Because these representations are local, 391 GETDEVICEINFO must return information that can be used by the client 392 to select the correct local representation. 394 In the block world, a set offset (logical block number or track/ 395 sector) contains a disk label. This label identifies the disk 396 uniquely. In contrast, an OSD has a standard set of attributes on 397 its root object. For device identification purposes, the OSD System 398 ID (root information attribute number 3) and the OSD Name (root 399 information attribute number 9) are used as the label. These appear 400 in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and 401 "oda_osdname" fields. 403 In some situations, SCSI target discovery may need to be driven based 404 on information contained in the GETDEVICEINFO response. One example 405 of this is Internet SCSI (iSCSI) targets that are not known to the 406 client until a layout has been requested. The information provided 407 as the "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the 408 pnfs_osd_deviceaddr4 type described below (see Section 4.2) allows 409 the client to probe a specific device given its network address and 410 optionally its iSCSI Name (see iSCSI [8]), or when the device network 411 address is omitted, allows it to discover the object storage device 412 using the provided device name or SCSI Device Identifier (see SPC-3 413 [10].) 415 The oda_systemid is implicitly used by the client, by using the 416 object credential signing key to sign each request with the request 417 integrity check value. This method protects the client from 418 unintentionally accessing a device if the device address mapping was 419 changed (or revoked). The server computes the capability key using 420 its own view of the systemid associated with the respective deviceid 421 present in the credential. If the client's view of the deviceid 422 mapping is stale, the client will use the wrong systemid (which must 423 be system-wide unique) and the I/O request to the OSD will fail to 424 pass the integrity check verification. 426 To recover from this condition the client should report the error and 427 return the layout using LAYOUTRETURN, and invalidate all the device 428 address mappings associated with this layout. The client can then 429 ask for a new layout if it wishes using LAYOUTGET and resolve the 430 referenced deviceids using GETDEVICEINFO or GETDEVICELIST. 432 The server MUST provide the oda_systemid and SHOULD also provide the 433 oda_osdname. When the OSD name is present, the client SHOULD get the 434 root information attributes whenever it establishes communication 435 with the OSD and verify that the OSD name it got from the OSD matches 436 the one sent by the metadata server. To do so, the client uses the 437 root_obj_cred credentials. 439 4.1. pnfs_osd_targetid_type4 441 The following enum specifies the manner in which a SCSI target can be 442 specified. The target can be specified as a SCSI Name, or as an SCSI 443 Device Identifier. 445 /// enum pnfs_osd_targetid_type4 { 446 /// OBJ_TARGET_ANON = 1, 447 /// OBJ_TARGET_SCSI_NAME = 2, 448 /// OBJ_TARGET_SCSI_DEVICE_ID = 3 449 /// }; 450 /// 452 4.2. pnfs_osd_deviceaddr4 454 The "pnfs_osd_deviceaddr4" data structure is returned by the server 455 as the storage-protocol-specific opaque field da_addr_body in the 456 "device_addr4" structure by a successful GETDEVICEINFO operation 457 NFSv4.1 [4]. 459 The specification for an object device address is as follows: 461 /// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { 462 /// case OBJ_TARGET_SCSI_NAME: 463 /// string oti_scsi_name<>; 464 /// 465 /// case OBJ_TARGET_SCSI_DEVICE_ID: 466 /// opaque oti_scsi_device_id<>; 467 /// 468 /// default: 469 /// void; 470 /// }; 471 /// 472 /// union pnfs_osd_targetaddr4 switch (bool ota_available) { 473 /// case TRUE: 474 /// netaddr4 ota_netaddr; 475 /// case FALSE: 476 /// void; 477 /// }; 478 /// 479 /// struct pnfs_osd_deviceaddr4 { 480 /// pnfs_osd_targetid4 oda_targetid; 481 /// pnfs_osd_targetaddr4 oda_targetaddr; 482 /// opaque oda_lun[8]; 483 /// opaque oda_systemid<>; 484 /// pnfs_osd_object_cred4 oda_root_obj_cred; 485 /// opaque oda_osdname<>; 486 /// }; 487 /// 489 4.2.1. SCSI Target Identifier 491 When "oda_targetid" is specified as an OBJ_TARGET_SCSI_NAME, the 492 "oti_scsi_name" string MUST be formatted as an "iSCSI Name" as 493 specified in iSCSI [8] and [9]. Note that the specification of the 494 oti_scsi_name string format is outside the scope of this document. 495 Parsing the string is based on the string prefix, e.g., "iqn.", 496 "eui.", or "naa." and more formats MAY be specified in the future in 497 accordance with iSCSI Names properties. 499 Currently, the iSCSI Name provides for naming the target device using 500 a string formatted as an iSCSI Qualified Name (IQN) or as an Extended 501 Unique Identifier (EUI) [13] string. Those are typically used to 502 identify iSCSI or Secure Routing Protocol (SRP) [20] devices. The 503 Network Address Authority (NAA) string format (see [9]) provides for 504 naming the device using globally unique identifiers, as defined in 505 Fibre Channel Framing and Signaling (FC-FS) [21]. These are 506 typically used to identify Fibre Channel or SAS [22] (Serial Attached 507 SCSI) devices. In particular, such devices that are dual-attached 508 both over Fibre Channel or SAS and over iSCSI. 510 When "oda_targetid" is specified as an OBJ_TARGET_SCSI_DEVICE_ID, the 511 "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device 512 Identifier as defined in SPC-3 [10] VPD Page 83h (Section 7.6.3. 513 "Device Identification VPD Page"). If the Device Identifier is 514 identical to the OSD System ID, as given by oda_systemid, the server 515 SHOULD provide a zero-length oti_scsi_device_id opaque value. Note 516 that similarly to the "oti_scsi_name", the specification of the 517 oti_scsi_device_id opaque contents is outside the scope of this 518 document and more formats MAY be specified in the future in 519 accordance with SPC-3. 521 The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing 522 no target identification. In this case, only the OSD System ID, and 523 optionally the provided network address, are used to locate the 524 device. 526 4.2.2. Device Network Address 528 The optional "oda_targetaddr" field MAY be provided by the server as 529 a hint to accelerate device discovery over, e.g., the iSCSI transport 530 protocol. The network address is given with the netaddr4 type, which 531 specifies a TCP/IP based endpoint (as specified in NFSv4.1 [4]). 532 When given, the client SHOULD use it to probe for the SCSI device at 533 the given network address. The client MAY still use other discovery 534 mechanisms such as Internet Storage Name Service (iSNS) [12] to 535 locate the device using the oda_targetid. In particular, such an 536 external name service SHOULD be used when the devices may be attached 537 to the network using multiple connections, and/or multiple storage 538 fabrics (e.g., Fibre-Channel and iSCSI). 540 The "oda_lun" field identifies the OSD 64-bit Logical Unit Number, 541 formatted in accordance with SAM-3 [11]. The client uses the Logical 542 Unit Number to communicate with the specific OSD Logical Unit. Its 543 use is defined in detail by the SCSI transport protocol, e.g., iSCSI 544 [8]. 546 5. Object-Based Layout 548 The layout4 type is defined in the NFSv4.1 [4] as follows: 550 enum layouttype4 { 551 LAYOUT4_NFSV4_1_FILES = 1, 552 LAYOUT4_OSD2_OBJECTS = 2, 553 LAYOUT4_BLOCK_VOLUME = 3 554 }; 556 struct layout_content4 { 557 layouttype4 loc_type; 558 opaque loc_body<>; 559 }; 561 struct layout4 { 562 offset4 lo_offset; 563 length4 lo_length; 564 layoutiomode4 lo_iomode; 565 layout_content4 lo_content; 566 }; 568 This document defines structure associated with the layouttype4 569 value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [4] specifies the loc_body 570 structure as an XDR type "opaque". The opaque layout is 571 uninterpreted by the generic pNFS client layers, but obviously must 572 be interpreted by the object storage layout driver. This section 573 defines the structure of this opaque value, pnfs_osd_layout4. 575 5.1. pnfs_osd_data_map4 577 /// struct pnfs_osd_data_map4 { 578 /// uint32_t odm_num_comps; 579 /// length4 odm_stripe_unit; 580 /// uint32_t odm_group_width; 581 /// uint32_t odm_group_depth; 582 /// uint32_t odm_mirror_cnt; 583 /// pnfs_osd_raid_algorithm4 odm_raid_algorithm; 584 /// }; 585 /// 587 The pnfs_osd_data_map4 structure parameterizes the algorithm that 588 maps a file's contents over the component objects. Instead of 589 limiting the system to simple striping scheme where loss of a single 590 component object results in data loss, the map parameters support 591 mirroring and more complicated schemes that protect against loss of a 592 component object. 594 "odm_num_comps" is the number of component objects the file is 595 striped over. The server MAY grow the file by adding more components 596 to the stripe while clients hold valid layouts until the file has 597 reached its final stripe width. The file length in this case MUST be 598 limited to the number of bytes in a full stripe. 600 The "odm_stripe_unit" is the number of bytes placed on one component 601 before advancing to the next one in the list of components. The 602 number of bytes in a full stripe is odm_stripe_unit times the number 603 of components. In some RAID schemes, a stripe includes redundant 604 information (i.e., parity) that lets the system recover from loss or 605 damage to a component object. 607 The "odm_group_width" and "odm_group_depth" parameters allow a nested 608 striping pattern (see Section 5.3.2 for details). If there is no 609 nesting, then odm_group_width and odm_group_depth MUST be zero. The 610 size of the components array MUST be a multiple of odm_group_width. 612 The "odm_mirror_cnt" is used to replicate a file by replicating its 613 component objects. If there is no mirroring, then odm_mirror_cnt 614 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 615 the component array MUST be a multiple of (odm_mirror_cnt+1). 617 See Section 5.3 for more details. 619 5.2. pnfs_osd_layout4 621 /// struct pnfs_osd_layout4 { 622 /// pnfs_osd_data_map4 olo_map; 623 /// uint32_t olo_comps_index; 624 /// pnfs_osd_object_cred4 olo_components<>; 625 /// }; 626 /// 628 The pnfs_osd_layout4 structure specifies a layout over a set of 629 component objects. The "olo_components" field is an array of object 630 identifiers and security credentials that grant access to each 631 object. The organization of the data is defined by the 632 pnfs_osd_data_map4 type that specifies how the file's data is mapped 633 onto the component objects (i.e., the striping pattern). The data 634 placement algorithm that maps file data onto component objects 635 assumes that each component object occurs exactly once in the array 636 of components. Therefore, component objects MUST appear in the 637 olo_components array only once. The components array may represent 638 all objects comprising the file, in which case "olo_comps_index" is 639 set to zero and the number of entries in the olo_components array is 640 equal to olo_map.odm_num_comps. The server MAY return fewer 641 components than odm_num_comps, provided that the returned components 642 are sufficient to access any byte in the layout's data range (e.g., a 643 sub-stripe of "odm_group_width" components). In this case, 644 olo_comps_index represents the position of the returned components 645 array within the full array of components that comprise the file. 647 Note that the layout depends on the file size, which the client 648 learns from the generic return parameters of LAYOUTGET, by doing 649 GETATTR commands to the metadata server. The client uses the file 650 size to decide if it should fill holes with zeros or return a short 651 read. Striping patterns can cause cases where component objects are 652 shorter than other components because a hole happens to correspond to 653 the last part of the component object. 655 5.3. Data Mapping Schemes 657 This section describes the different data mapping schemes in detail. 658 The object layout always uses a "dense" layout as described in 659 NFSv4.1 [4]. This means that the second stripe unit of the file 660 starts at offset 0 of the second component, rather than at offset 661 stripe_unit bytes. After a full stripe has been written, the next 662 stripe unit is appended to the first component object in the list 663 without any holes in the component objects. 665 5.3.1. Simple Striping 667 The mapping from the logical offset within a file (L) to the 668 component object C and object-specific offset O is defined by the 669 following equations: 671 L: logical offset into the file 673 W: stripe width 674 W = size of olo_components array 676 S: number of bytes in a stripe 677 S = W * stripe_unit 679 N: stripe number 680 N = L / S 682 C: component index corresponding to L 683 C = (L % S) / stripe_unit 685 O: The component offset corresponding to L 686 O = (N * stripe_unit) + (L % stripe_unit) 688 Note that this computation does not accommodate the same object 689 appearing in the olo_components array multiple times. Therefore the 690 server may not return layouts with the same object appearing multiple 691 times. If needed the server can return multiple layout segments each 692 covering a single instance of the object. 694 For example, consider an object striped over four devices, . The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 696 4096 = 16384. 698 Offset 0: 699 N = 0 / 16384 = 0 700 C = (0 % 16384) /4096 = 0 (D0) 701 O = 0*4096 + (0%4096) = 0 703 Offset 4096: 704 N = 4096 / 16384 = 0 705 C = (4096 % 16384) / 4096 = 1 (D1) 706 O = (0*4096)+(4096%4096) = 0 708 Offset 9000: 709 N = 9000 / 16384 = 0 710 C = (9000 % 16384) / 4096 = 2 (D2) 711 O = (0*4096)+(9000%4096) = 808 713 Offset 132000: 714 N = 132000 / 16384 = 8 715 C = (132000 % 16384) / 4096 = 0 (D0) 716 O = (8*4096) + (132000%4096) = 33696 718 5.3.2. Nested Striping 720 The odm_group_width and odm_group_depth parameters allow a nested 721 striping pattern. odm_group_width defines the width of a data stripe 722 and odm_group_depth defines how many stripes are written before 723 advancing to the next group of components in the list of component 724 objects for the file. The math used to map from a file offset to a 725 component object and offset within that object is shown below. The 726 computations map from the logical offset L to the component index C 727 and offset relative O within that component object. 729 L: logical offset into the file 731 FW: total number of components 732 FW = size of olo_components array 734 W: stripe width 735 W = group_width, if not zero, else FW 737 group_count: number of groups 738 group_count = FW / group_width, if group_width is not zero, else 1 740 D: number of data devices in a stripe 741 D = W 743 U: number of data bytes in a stripe within a group 744 U = D * stripe_unit 746 T: number of bytes striped within a group of component objects 747 (before advancing to the next group) 748 T = U * group_depth 750 S: number of bytes striped across all component objects 751 (before the pattern repeats) 752 S = T * group_count 754 M: The "major" (i.e., across all components) cycle number 755 M = L / S 757 G: group number from the beginning of the major cycle 758 G = (L % S) / T 760 H: byte offset within the last group 761 H = (L % S) % T 763 N: The "minor" (i.e., across the group) stripe number 764 N = H / U 766 C: component index corresponding to L 767 C = (G * D) + ((H % U) / stripe_unit) 769 O: The component offset corresponding to L 770 O = (M * group_depth * stripe_unit) + (N * stripe_unit) + 771 (L % stripe_unit) 773 For example, consider an object striped over 100 devices with a 774 group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. 775 In this scheme, 500 MB are written to the first 10 components, and 776 5000 MB are written before the pattern wraps back around to the first 777 component in the array. 779 Offset 0: 780 W = 100 781 group_count = 100 / 10 = 10 782 D = 10 783 U = 1 MB * 10 = 10 MB 784 T = 10 MB * 50 = 500 MB 785 S = 500 MB * 10 = 5000 MB 786 M = 0 / 5000 MB = 0 787 G = (0 % 5000 MB) / 500 MB = 0 788 H = (0 % 5000 MB) % 500 MB = 0 789 N = 0 / 10 MB = 0 790 C = (0 * 10) + ((0 % 10 MB) / 1 MB) = 0 791 O = (0 * 50 * 1 MB) + (0 * 1 MB) + (0 % 1 MB) = 0 793 Offset 27 MB: 794 M = 27 MB / 5000 MB = 0 795 G = (27 MB % 5000 MB) / 500 MB = 0 796 H = (27 MB % 5000 MB) % 500 MB = 27 MB 797 N = 27 MB / 10 MB = 2 798 C = (0 * 10) + ((27 MB % 10 MB) / 1 MB) = 7 799 O = (0 * 50 * 1 MB) + (2 * 1 MB) + (27 MB % 1 MB) = 2 MB 801 Offset 7232 MB: 802 M = 7232 MB / 5000 MB = 1 803 G = (7232 MB % 5000 MB) / 500 MB = 4 804 H = (7232 MB % 5000 MB) % 500 MB = 232 MB 805 N = 232 MB / 10 MB = 23 806 C = (4 * 10) + ((232 MB % 10 MB) / 1 MB) = 42 807 O = (1 * 50 * 1 MB) + (23 * 1 MB) + (7232 MB % 1 MB) = 73 MB 809 5.3.3. Mirroring 811 The odm_mirror_cnt is used to replicate a file by replicating its 812 component objects. If there is no mirroring, then odm_mirror_cnt 813 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 814 the olo_components array MUST be a multiple of (odm_mirror_cnt+1). 815 Thus, for a classic mirror on two objects, odm_mirror_cnt is one. 816 Note that mirroring can be defined over any RAID algorithm and 817 striping pattern (either simple or nested). If odm_group_width is 818 also non-zero, then the size of the olo_components array MUST be a 819 multiple of odm_group_width * (odm_mirror_cnt+1). Note that 820 odm_group_width does not account for mirrors. Replicas are adjacent 821 in the olo_components array, and the value C produced by the above 822 equations is not a direct index into the olo_components array. 824 Instead, the following equations determine the replica component 825 index RCi, where i ranges from 0 to odm_mirror_cnt. 827 FW = size of olo_components array / (odm_mirror_cnt+1) 829 C = component index for striping or two-level striping 830 as calculated using above equations 832 i ranges from 0 to odm_mirror_cnt, inclusive 833 RCi = C * (odm_mirror_cnt+1) + i 835 5.4. RAID Algorithms 837 pnfs_osd_raid_algorithm4 determines the algorithm and placement of 838 redundant data. This section defines the different redundancy 839 algorithms. Note: The term "RAID" (Redundant Array of Independent 840 Disks) is used in this document to represent an array of component 841 objects that store data for an individual file. The objects are 842 stored on independent object-based storage devices. File data is 843 encoded and striped across the array of component objects using 844 algorithms developed for block-based RAID systems. 846 The use of per-file RAID encoding in the object-layout for pNFS 847 imposes an additional responsibility on the file system client. The 848 pNFS client SHOULD generate the redundant data and write it do 849 storage along with the file data according to the RAID parameters 850 returned in the layout. However, various error conditions may 851 prevent the client from meeting its obligations, and this is 852 supported by the error information in the pnfs_osd_ioerr4 structure 853 (see Section 8.1). An explicit error return from the client, or an 854 implicit error caused by a client's failure to return a layout MUST 855 trigger recovery action by the server to prevent access to invalid 856 data (see Section 7). It is the server's responsibility to only 857 grant layout information to files that can be safely accessed, and to 858 deny access to files that are in an inconsistent state. 860 5.4.1. PNFS_OSD_RAID_0 862 PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the 863 component objects are data bytes located by the above equations for C 864 and O. If a component object is marked as PNFS_OSD_MISSING, an I/O 865 error MUST be returned if this component is accessed. In this case, 866 the generic NFS client layer MAY elect to retry this operation 867 against the pNFS server. 869 5.4.2. PNFS_OSD_RAID_4 871 PNFS_OSD_RAID_4 means that the last component object, or the last in 872 each group (if odm_group_width is greater than zero), contains parity 873 information computed over the rest of the stripe with an XOR 874 operation. If a component object is unavailable, the client can read 875 the rest of the stripe units in the damaged stripe and recompute the 876 missing stripe unit by XORing the other stripe units in the stripe. 877 Or the client can replay the READ against the pNFS server that will 878 presumably perform the reconstructed read on the client's behalf. 880 When parity is present in the file, then the number of parity devices 881 is taken into account in the above equations when calculating (D), 882 the number of data devices in a stripe, as follows: 884 P: number of parity devices in each stripe 885 P = 1 887 D: number of data devices in a stripe 888 D = W - P 890 I: parity device index 891 I = D 893 5.4.3. PNFS_OSD_RAID_5 895 PNFS_OSD_RAID_5 means that the position of the parity data is rotated 896 on each stripe or each group (if odm_group_width is greater than 897 zero). In the first stripe, the last component holds the parity. In 898 the second stripe, the next-to-last component holds the parity, and 899 so on. In this scheme, all stripe units are rotated so that I/O is 900 evenly spread across objects as the file is read sequentially. The 901 rotated parity layout is illustrated here, with hexadecimal numbers 902 indicating the stripe unit. 904 0 1 2 P 905 4 5 P 3 906 8 P 6 7 907 P 9 a b 909 Note that the math for RAID_5 is similar to RAID_4 only that the 910 device indices for each stripe are rotated backwards. So start with 911 the equations above for RAID_4, then compute the rotation as 912 described below. Also note that the parity rotation cycle always 913 starts on group boundaries so the first stripe in a group has its 914 parity at device D. 916 P: number of parity devices in each stripe 917 P = 1 919 PC: Parity Cycle 920 PC = W 922 R: The parity rotation index 923 (N is as computed in above equations for RAID-4) 924 R = N % PC 926 I: parity device index 927 I = (W + W - (R + 1) * P) % W 929 Cr: The rotated device index 930 (C is as computed in the above equations for RAID-4) 931 Cr = (W + C - (R * P)) % W 933 Note: W is added above to avoid negative numbers modulo math. 935 5.4.4. PNFS_OSD_RAID_PQ 937 PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon 938 P+Q encoding scheme [16]. In this layout, the last two component 939 objects hold the P and Q data, respectively. P is parity computed 940 with XOR. The Q computation is described in detail by Anvin [17]. 941 The same polynomial "x^8+x^4+x^3+x^2+1" and Galois field size of 2^8 942 are used here. Clients may simply choose to read data through the 943 metadata server if two or more components are missing or damaged. 945 The equations given above for embedded parity can be used to map a 946 file offset to the correct component object by setting the number of 947 parity components (P) to 2 instead of 1 for RAID-5 and computing the 948 Parity Cycle length as the Lowest Common Multiple [18] of 949 odm_group_width and P, devided by P, as described below. Note: This 950 algorithm can be used also for RAID-5 where P=1. 952 P: number of parity devices 953 P = 2 955 PC: Parity cycle: 956 PC = LCM(W, P) / P 958 Q: The device index holding the Q component 959 (I is as computed in the above equations for RAID-5) 960 Qdev = (I + 1) % W 962 5.4.5. RAID Usage and Implementation Notes 964 RAID layouts with redundant data in their stripes require additional 965 serialization of updates to ensure correct operation. Otherwise, if 966 two clients simultaneously write to the same logical range of an 967 object, the result could include different data in the same ranges of 968 mirrored tuples, or corrupt parity information. It is the 969 responsibility of the metadata server to enforce serialization 970 requirements. Serialization MUST occur at the RAID stripe boundary 971 for write operations to avoid corrupting parity by concurrent updates 972 to the same stripe. Mirrors do not have explicit stripe boundaries, 973 so it is sufficient to serialize writes to the same byte ranges. 975 Many alternative encoding schemes exist for P>=2 [19]. These involve 976 P or Q equations different than the Reed-Solomon encoding used in 977 PNFS_OSD_RAID_PQ. Thus, if one of these schemes is to be used in the 978 future, a distinct value must be added to pnfs_osd_raid_algorithm4 979 for it. 981 6. Object-Based Layout Update 983 layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates 984 to the layout and additional information to the metadata server. It 985 is defined in the NFSv4.1 [4] as follows: 987 struct layoutupdate4 { 988 layouttype4 lou_type; 989 opaque lou_body<>; 990 }; 992 The layoutupdate4 type is an opaque value at the generic pNFS client 993 level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the 994 lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. 996 Object-Based pNFS clients are not allowed to modify the layout. 997 Therefore, the information passed in pnfs_osd_layoutupdate4 is used 998 only to update the file's attributes. In addition to the generic 999 information the client can pass to the metadata server in 1000 LAYOUTCOMMIT such as the highest offset the client wrote to and the 1001 last time it modified the file, the client MAY use 1002 pnfs_osd_layoutupdate4 to convey the capacity consumed (or released) 1003 by writes using the layout, and to indicate that I/O errors were 1004 encountered by such writes. 1006 6.1. pnfs_osd_deltaspaceused4 1008 /// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) { 1009 /// case TRUE: 1010 /// int64_t dsu_delta; 1011 /// case FALSE: 1012 /// void; 1013 /// }; 1014 /// 1016 pnfs_osd_deltaspaceused4 is used to convey space utilization 1017 information at the time of LAYOUTCOMMIT. For the file system to 1018 properly maintain capacity-used information, it needs to track how 1019 much capacity was consumed by WRITE operations performed by the 1020 client. In this protocol, the OSD returns the capacity consumed by a 1021 write (*), which can be different than the number of bytes written 1022 because of internal overhead like block-level allocation and indirect 1023 blocks, and the client reflects this back to the pNFS server so it 1024 can accurately track quota. The pNFS server can choose to trust this 1025 information coming from the clients and therefore avoid querying the 1026 OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain 1027 this information from the OSD, it simply returns invalid 1028 olu_delta_space_used. 1030 6.2. pnfs_osd_layoutupdate4 1032 /// struct pnfs_osd_layoutupdate4 { 1033 /// pnfs_osd_deltaspaceused4 olu_delta_space_used; 1034 /// bool olu_ioerr_flag; 1035 /// }; 1036 /// 1038 "olu_delta_space_used" is used to convey capacity usage information 1039 back to the metadata server. 1041 The "olu_ioerr_flag" is used when I/O errors were encountered while 1042 writing the file. The client MUST report the errors using the 1043 pnfs_osd_ioerr4 structure (see Section 8.1) at LAYOUTRETURN time. 1045 If the client updated the file successfully before hitting the I/O 1046 errors, it MAY use LAYOUTCOMMIT to update the metadata server as 1047 described above. Typically, in the error-free case, the server MAY 1048 turn around and update the file's attributes on the storage devices. 1049 However, if I/O errors were encountered, the server better not 1050 attempt to write the new attributes on the storage devices until it 1051 receives the I/O error report; therefore, the client MUST set the 1052 olu_ioerr_flag to true. Note that in this case, the client SHOULD 1053 send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same 1054 COMPOUND RPC. 1056 7. Recovering from Client I/O Errors 1058 The pNFS client may encounter errors when directly accessing the 1059 object storage devices. A well behaved client will report any such 1060 errors promptly by executing a LAYOUTRETURN. When the 1061 LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the 1062 I/O errors to the server at LAYOUTRETURN time using the 1063 pnfs_osd_ioerr4 structure (see Section 8.1). 1065 It is the responsibility of the metadata server to handle the I/O 1066 errors. The server MUST analyze the error and perform the required 1067 recovery operations such as repairing any parity inconsistencies, 1068 recovering media failures, or reconstructing missing objects. 1070 The metadata server SHOULD recall any outstanding layouts to allow it 1071 exclusive write access to the stripes being recovered and to prevent 1072 other clients from hitting the same error condition. In these cases, 1073 the server MUST complete recovery before handing out any new layouts 1074 to the affected byte ranges. 1076 The client SHOULD attempt to compensate for the error before giving 1077 up and reflecting an error to the application. The first step in 1078 error recovery is to return the layout with LAYOUTRETURN and the 1079 associated error information. The second step is to request a new 1080 layout using LAYOUTGET and then retry the I/O operation with the new 1081 layout. Finally, if the error persists, the client may choose to 1082 retry the I/O operation using regular NFS READ or WRITE operations 1083 via the metadata server. 1085 8. Object-Based Layout Return 1087 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 1088 layout-type specific information to the server. It is defined in the 1089 NFSv4.1 [4] as follows: 1091 struct layoutreturn_file4 { 1092 offset4 lrf_offset; 1093 length4 lrf_length; 1094 stateid4 lrf_stateid; 1095 /* layouttype4 specific data */ 1096 opaque lrf_body<>; 1097 }; 1099 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 1100 case LAYOUTRETURN4_FILE: 1101 layoutreturn_file4 lr_layout; 1102 default: 1103 void; 1104 }; 1106 struct LAYOUTRETURN4args { 1107 /* CURRENT_FH: file */ 1108 bool lora_reclaim; 1109 layoutreturn_stateid lora_recallstateid; 1110 layouttype4 lora_layout_type; 1111 layoutiomode4 lora_iomode; 1112 layoutreturn4 lora_layoutreturn; 1113 }; 1115 If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the 1116 lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type. 1118 The pnfs_osd_layoutreturn4 type allows the client to report I/O error 1119 information back to the metadata server as defined below. 1121 8.1. pnfs_osd_errno4 1123 /// enum pnfs_osd_errno4 { 1124 /// PNFS_OSD_ERR_EIO = 1, 1125 /// PNFS_OSD_ERR_NOT_FOUND = 2, 1126 /// PNFS_OSD_ERR_NO_SPACE = 3, 1127 /// PNFS_OSD_ERR_BAD_CRED = 4, 1128 /// PNFS_OSD_ERR_NO_ACCESS = 5, 1129 /// PNFS_OSD_ERR_UNREACHABLE = 6, 1130 /// PNFS_OSD_ERR_RESOURCE = 7 1131 /// }; 1132 /// 1134 pnfs_osd_errno4 is used to represent error types when read/write 1135 errors are reported to the metadata server. The error codes serve as 1136 hints to the metadata server that may help it in diagnosing the exact 1137 reason for the error and in repairing it. 1139 o PNFS_OSD_ERR_EIO indicates the operation failed because the object 1140 storage device experienced a failure trying to access the object. 1141 The most common source of these errors is media errors, but other 1142 internal errors might cause this as well. In this case, the 1143 metadata server should go examine the broken object more closely; 1144 hence, it should be used as the default error code. 1146 o PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object 1147 that does not exist on the object storage device. 1149 o PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the 1150 object storage device ran out of free capacity during the 1151 operation. 1153 o PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not 1154 valid. The primary cause of this is that the capability has 1155 expired, or the access policy tag (a.k.a., capability version 1156 number) has been changed to revoke capabilities. The client will 1157 need to return the layout and get a new one with fresh 1158 capabilities. 1160 o PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the 1161 requested operation. This should not occur in normal operation 1162 because the metadata server should give out correct capabilities, 1163 or none at all. 1165 o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the 1166 I/O operation at the object storage device due to a communication 1167 failure. Whether or not the I/O operation was executed by the OSD 1168 is undetermined. 1170 o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O 1171 operation due to a local problem on the initiator (i.e., client) 1172 side, e.g., when running out of memory. The client MUST guarantee 1173 that the OSD command was never dispatched to the OSD. 1175 8.2. pnfs_osd_ioerr4 1177 /// struct pnfs_osd_ioerr4 { 1178 /// pnfs_osd_objid4 oer_component; 1179 /// length4 oer_comp_offset; 1180 /// length4 oer_comp_length; 1181 /// bool oer_iswrite; 1182 /// pnfs_osd_errno4 oer_errno; 1183 /// }; 1184 /// 1185 The pnfs_osd_ioerr4 structure is used to return error indications for 1186 objects that generated errors during data transfers. These are hints 1187 to the metadata server that there are problems with that object. For 1188 each error, "oer_component", "oer_comp_offset", and "oer_comp_length" 1189 represent the object and byte range within the component object in 1190 which the error occurred; "oer_iswrite" is set to "true" if the 1191 failed OSD operation was data modifying, and "oer_errno" represents 1192 the type of error. 1194 Component byte ranges in the optional pnfs_osd_ioerr4 structure are 1195 used for recovering the object and MUST be set by the client to cover 1196 all failed I/O operations to the component. 1198 8.3. pnfs_osd_layoutreturn4 1200 /// struct pnfs_osd_layoutreturn4 { 1201 /// pnfs_osd_ioerr4 olr_ioerr_report<>; 1202 /// }; 1203 /// 1205 When OSD I/O operations failed, "olr_ioerr_report<>" is used to 1206 report these errors to the metadata server as an array of elements of 1207 type pnfs_osd_ioerr4. Each element in the array represents an error 1208 that occurred on the object specified by oer_component. If no errors 1209 are to be reported, the size of the olr_ioerr_report<> array is set 1210 to zero. 1212 9. Object-Based Creation Layout Hint 1214 The layouthint4 type is defined in the NFSv4.1 [4] as follows: 1216 struct layouthint4 { 1217 layouttype4 loh_type; 1218 opaque loh_body<>; 1219 }; 1221 The layouthint4 structure is used by the client to pass a hint about 1222 the type of layout it would like created for a particular file. If 1223 the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the loh_body 1224 opaque value is defined by the pnfs_osd_layouthint4 type. 1226 9.1. pnfs_osd_layouthint4 1228 /// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) { 1229 /// case TRUE: 1230 /// uint32_t omx_max_comps; 1231 /// case FALSE: 1232 /// void; 1233 /// }; 1234 /// 1235 /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) { 1236 /// case TRUE: 1237 /// length4 osu_stripe_unit; 1238 /// case FALSE: 1239 /// void; 1240 /// }; 1241 /// 1242 /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) { 1243 /// case TRUE: 1244 /// uint32_t ogw_group_width; 1245 /// case FALSE: 1246 /// void; 1247 /// }; 1248 /// 1249 /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) { 1250 /// case TRUE: 1251 /// uint32_t ogd_group_depth; 1252 /// case FALSE: 1253 /// void; 1254 /// }; 1255 /// 1256 /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) { 1257 /// case TRUE: 1258 /// uint32_t omc_mirror_cnt; 1259 /// case FALSE: 1260 /// void; 1261 /// }; 1262 /// 1263 /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) { 1264 /// case TRUE: 1265 /// pnfs_osd_raid_algorithm4 ora_raid_algorithm; 1266 /// case FALSE: 1267 /// void; 1268 /// }; 1269 /// 1270 /// struct pnfs_osd_layouthint4 { 1271 /// pnfs_osd_max_comps_hint4 olh_max_comps_hint; 1272 /// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint; 1273 /// pnfs_osd_group_width_hint4 olh_group_width_hint; 1274 /// pnfs_osd_group_depth_hint4 olh_group_depth_hint; 1275 /// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint; 1276 /// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint; 1277 /// }; 1278 /// 1279 This type conveys hints for the desired data map. All parameters are 1280 optional so the client can give values for only the parameters it 1281 cares about, e.g. it can provide a hint for the desired number of 1282 mirrored components, regardless of the RAID algorithm selected for 1283 the file. The server should make an attempt to honor the hints, but 1284 it can ignore any or all of them at its own discretion and without 1285 failing the respective CREATE operation. 1287 The "olh_max_comps_hint" can be used to limit the total number of 1288 component objects comprising the file. All other hints correspond 1289 directly to the different fields of pnfs_osd_data_map4. 1291 10. Layout Segments 1293 The pnfs layout operations operate on logical byte ranges. There is 1294 no requirement in the protocol for any relationship between byte 1295 ranges used in LAYOUTGET to acquire layouts and byte ranges used in 1296 CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD 1297 byte-range capabilities poses limitations on these operations since 1298 the capabilities associated with layout segments cannot be merged or 1299 split. The following guidelines should be followed for proper 1300 operation of object-based layouts. 1302 10.1. CB_LAYOUTRECALL and LAYOUTRETURN 1304 In general, the object-based layout driver should keep track of each 1305 layout segment it got, keeping record of the segment's iomode, 1306 offset, and length. The server should allow the client to get 1307 multiple overlapping layout segments but is free to recall the layout 1308 to prevent overlap. 1310 In response to CB_LAYOUTRECALL, the client should return all layout 1311 segments matching the given iomode and overlapping with the recalled 1312 range. When returning the layouts for this byte range with 1313 LAYOUTRETURN, the client MUST NOT return a sub-range of a layout 1314 segment it has; each LAYOUTRETURN sent MUST completely cover at least 1315 one outstanding layout segment. 1317 The server, in turn, should release any segment that exactly matches 1318 the clientid, iomode, and byte range given in LAYOUTRETURN. If no 1319 exact match is found, then the server should release all layout 1320 segments matching the clientid and iomode and that are fully 1321 contained in the returned byte range. If none are found and the byte 1322 range is a subset of an outstanding layout segment with for the same 1323 clientid and iomode, then the client can be considered malfunctioning 1324 and the server SHOULD recall all layouts from this client to reset 1325 its state. If this behavior repeats, the server SHOULD deny all 1326 LAYOUTGETs from this client. 1328 10.2. LAYOUTCOMMIT 1330 LAYOUTCOMMIT is only used by object-based pNFS to convey modified 1331 attributes hints and/or to report the presence of I/O errors to the 1332 metadata server (MDS). Therefore, the offset and length in 1333 LAYOUTCOMMIT4args are reserved for future use and should be set to 0. 1335 11. Recalling Layouts 1337 The object-based metadata server should recall outstanding layouts in 1338 the following cases: 1340 o When the file's security policy changes, i.e., Access Control 1341 Lists (ACLs) or permission mode bits are set. 1343 o When the file's aggregation map changes, rendering outstanding 1344 layouts invalid. 1346 o When there are sharing conflicts. For example, the server will 1347 issue stripe-aligned layout segments for RAID-5 objects. To 1348 prevent corruption of the file's parity, multiple clients must not 1349 hold valid write layouts for the same stripes. An outstanding 1350 READ/WRITE (RW) layout should be recalled when a conflicting 1351 LAYOUTGET is received from a different client for LAYOUTIOMODE4_RW 1352 and for a byte range overlapping with the outstanding layout 1353 segment. 1355 11.1. CB_RECALL_ANY 1357 The metadata server can use the CB_RECALL_ANY callback operation to 1358 notify the client to return some or all of its layouts. The NFSv4.1 1359 [4] defines the following types: 1361 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 1362 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 1364 struct CB_RECALL_ANY4args { 1365 uint32_t craa_objects_to_keep; 1366 bitmap4 craa_type_mask; 1367 }; 1369 Typically, CB_RECALL_ANY will be used to recall client state when the 1370 server needs to reclaim resources. The craa_type_mask bitmap 1371 specifies the type of resources that are recalled and the 1372 craa_objects_to_keep value specifies how many of the recalled objects 1373 the client is allowed to keep. The object-based layout type mask 1374 flags are defined as follows. They represent the iomode of the 1375 recalled layouts. In response, the client SHOULD return layouts of 1376 the recalled iomode that it needs the least, keeping at most 1377 craa_objects_to_keep object-based layouts. 1379 /// enum pnfs_osd_cb_recall_any_mask { 1380 /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, 1381 /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 1382 /// }; 1383 /// 1385 The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return 1386 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1387 PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1388 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1389 is notified to return layouts of either iomode. 1391 12. Client Fencing 1393 In cases where clients are uncommunicative and their lease has 1394 expired or when clients fail to return recalled layouts within a 1395 lease period at the least (see "Recalling a Layout"[4]), the server 1396 MAY revoke client layouts and/or device address mappings and reassign 1397 these resources to other clients. To avoid data corruption, the 1398 metadata server MUST fence off the revoked clients from the 1399 respective objects as described in Section 13.4. 1401 13. Security Considerations 1403 The pNFS extension partitions the NFSv4 file system protocol into two 1404 parts, the control path and the data path (storage protocol). The 1405 control path contains all the new operations described by this 1406 extension; all existing NFSv4 security mechanisms and features apply 1407 to the control path. The combination of components in a pNFS system 1408 is required to preserve the security properties of NFSv4 with respect 1409 to an entity accessing data via a client, including security 1410 countermeasures to defend against threats that NFSv4 provides 1411 defenses for in environments where these threats are considered 1412 significant. 1414 The metadata server enforces the file access-control policy at 1415 LAYOUTGET time. The client should use suitable authorization 1416 credentials for getting the layout for the requested iomode (READ or 1417 RW) and the server verifies the permissions and ACL for these 1418 credentials, possibly returning NFS4ERR_ACCESS if the client is not 1419 allowed the requested iomode. If the LAYOUTGET operation succeeds 1420 the client receives, as part of the layout, a set of object 1421 capabilities allowing it I/O access to the specified objects 1422 corresponding to the requested iomode. When the client acts on I/O 1423 operations on behalf of its local users, it MUST authenticate and 1424 authorize the user by issuing respective OPEN and ACCESS calls to the 1425 metadata server, similar to having NFSv4 data delegations. If access 1426 is allowed, the client uses the corresponding (READ or RW) 1427 capabilities to perform the I/O operations at the object storage 1428 devices. When the metadata server receives a request to change a 1429 file's permissions or ACL, it SHOULD recall all layouts for that file 1430 and it MUST change the capability version attribute on all objects 1431 comprising the file to implicitly invalidate any outstanding 1432 capabilities before committing to the new permissions and ACL. Doing 1433 this will ensure that clients re-authorize their layouts according to 1434 the modified permissions and ACL by requesting new layouts. 1435 Recalling the layouts in this case is courtesy of the server intended 1436 to prevent clients from getting an error on I/Os done after the 1437 capability version changed. 1439 The object storage protocol MUST implement the security aspects 1440 described in version 1 of the T10 OSD protocol definition [1]. The 1441 standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and 1442 ALLDATA. To provide minimum level of security allowing verification 1443 and enforcement of the server access control policy using the layout 1444 security credentials, the NOSEC security method MUST NOT be used for 1445 any I/O operation. The remainder of this section gives an overview 1446 of the security mechanism described in that standard. The goal is to 1447 give the reader a basic understanding of the object security model. 1448 Any discrepancies between this text and the actual standard are 1449 obviously to be resolved in favor of the OSD standard. 1451 13.1. OSD Security Data Types 1453 There are three main data types associated with object security: a 1454 capability, a credential, and security parameters. The capability is 1455 a set of fields that specifies an object and what operations can be 1456 performed on it. A credential is a signed capability. Only a 1457 security manager that knows the secret device keys can correctly sign 1458 a capability to form a valid credential. In pNFS, the file server 1459 acts as the security manager and returns signed capabilities (i.e., 1460 credentials) to the pNFS client. The security parameters are values 1461 computed by the issuer of OSD commands (i.e., the client) that prove 1462 they hold valid credentials. The client uses the credential as a 1463 signing key to sign the requests it makes to OSD, and puts the 1464 resulting signatures into the security_parameters field of the OSD 1465 command. The object storage device uses the secret keys it shares 1466 with the security manager to validate the signature values in the 1467 security parameters. 1469 The security types are opaque to the generic layers of the pNFS 1470 client. The credential contents are defined as opaque within the 1471 pnfs_osd_object_cred4 type. Instead of repeating the definitions 1472 here, the reader is referred to Section 4.9.2.2 of the OSD standard. 1474 13.2. The OSD Security Protocol 1476 The object storage protocol relies on a cryptographically secure 1477 capability to control accesses at the object storage devices. 1478 Capabilities are generated by the metadata server, returned to the 1479 client, and used by the client as described below to authenticate 1480 their requests to the object-based storage device. Capabilities 1481 therefore achieve the required access and open mode checking. They 1482 allow the file server to define and check a policy (e.g., open mode) 1483 and the OSD to enforce that policy without knowing the details (e.g., 1484 user IDs and ACLs). 1486 Since capabilities are tied to layouts, and since they are used to 1487 enforce access control, when the file ACL or mode changes the 1488 outstanding capabilities MUST be revoked to enforce the new access 1489 permissions. The server SHOULD recall layouts to allow clients to 1490 gracefully return their capabilities before the access permissions 1491 change. 1493 Each capability is specific to a particular object, an operation on 1494 that object, a byte range within the object (in OSDv2), and has an 1495 explicit expiration time. The capabilities are signed with a secret 1496 key that is shared by the object storage devices and the metadata 1497 managers. Clients do not have device keys so they are unable to 1498 forge the signatures in the security parameters. The combination of 1499 a capability, the OSD System ID, and a signature is called a 1500 "credential" in the OSD specification. 1502 The details of the security and privacy model for object storage are 1503 defined in the T10 OSD standard. The following sketch of the 1504 algorithm should help the reader understand the basic model. 1506 LAYOUTGET returns a CapKey and a Cap, which, together with the OSD 1507 SystemID, are also called a credential. It is a capability and a 1508 signature over that capability and the SystemID. The OSD Standard 1509 refers to the CapKey as the "Credential integrity check value" and to 1510 the ReqMAC as the "Request integrity check value". 1512 CapKey = MAC(Cap, SystemID) 1513 Credential = {Cap, SystemID, CapKey} 1515 The client uses CapKey to sign all the requests it issues for that 1516 object using the respective Cap. In other words, the Cap appears in 1517 the request to the storage device, and that request is signed with 1518 the CapKey as follows: 1520 ReqMAC = MAC(Req, ReqNonce) 1521 Request = {Cap, Req, ReqNonce, ReqMAC} 1523 The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The 1524 OSD uses the SecretKey it shares with the metadata server to compare 1525 the ReqMAC the client sent with a locally computed value: 1527 LocalCapKey = MAC(Cap, SystemID) 1528 LocalReqMAC = MAC(Req, ReqNonce) 1530 and if they match the OSD assumes that the capabilities came from an 1531 authentic metadata server and allows access to the object, as allowed 1532 by the Cap. 1534 13.3. Protocol Privacy Requirements 1536 Note that if the server LAYOUTGET reply, holding CapKey and Cap, is 1537 snooped by another client, it can be used to generate valid OSD 1538 requests (within the Cap access restrictions). 1540 To provide the required privacy requirements for the capability key 1541 returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g., 1542 by using the RPCSEC_GSS privacy method to send the LAYOUTGET 1543 operation or by using the SSV key to encrypt the oc_capability_key 1544 using the GSS_Wrap() function. Two general ways to provide privacy 1545 in the absence of GSS-API that are independent of NFSv4 are either an 1546 isolated network such as a VLAN or a secure channel provided by IPsec 1547 [15]. 1549 13.4. Revoking Capabilities 1551 At any time, the metadata server may invalidate all outstanding 1552 capabilities on an object by changing its POLICY ACCESS TAG 1553 attribute. The value of the POLICY ACCESS TAG is part of a 1554 capability, and it must match the state of the object attribute. If 1555 they do not match, the OSD rejects accesses to the object with the 1556 sense key set to ILLEGAL REQUEST and an additional sense code set to 1557 INVALID FIELD IN CDB. When a client attempts to use a capability and 1558 is rejected this way, it should issue a LAYOUTCOMMIT for the object 1559 and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter. The 1560 client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or 1561 LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed 1562 set of capabilities. 1564 The metadata server may elect to change the access policy tag on an 1565 object at any time, for any reason (with the understanding that there 1566 is likely an associated performance penalty, especially if there are 1567 outstanding layouts for this object). The metadata server MUST 1568 revoke outstanding capabilities when any one of the following occurs: 1570 o the permissions on the object change, 1572 o a conflicting mandatory byte-range lock is granted, or 1574 o a layout is revoked and reassigned to another client. 1576 A pNFS client will typically hold one layout for each byte range for 1577 either READ or READ/WRITE. The client's credentials are checked by 1578 the metadata server at LAYOUTGET time and it is the client's 1579 responsibility to enforce access control among multiple users 1580 accessing the same file. It is neither required nor expected that 1581 the pNFS client will obtain a separate layout for each user accessing 1582 a shared object. The client SHOULD use OPEN and ACCESS calls to 1583 check user permissions when performing I/O so that the server's 1584 access control policies are correctly enforced. The result of the 1585 ACCESS operation may be cached while the client holds a valid layout 1586 as the server is expected to recall layouts when the file's access 1587 permissions or ACL change. 1589 14. IANA Considerations 1591 As described in NFSv4.1 [4], new layout type numbers have been 1592 assigned by IANA. This document defines the protocol associated with 1593 the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it 1594 requires no further actions for IANA. 1596 15. References 1598 15.1. Normative References 1600 [1] Weber, R., "Information Technology - SCSI Object-Based 1601 Storage Device Commands (OSD)", ANSI INCITS 400-2004, 1602 December 2004. 1604 [2] Bradner, S., "Key words for use in RFCs to Indicate 1605 Requirement Levels", BCP 14, RFC 2119, March 1997. 1607 [3] IETF Trust, "Legal Provisions Relating to IETF Documents", 1608 November 2008, . 1611 [4] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1612 "Network File System (NFS) Version 4 Minor Version 1 1613 Protocol", RFC 5661, January 2010. 1615 [5] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1616 "Network File System (NFS) Version 4 Minor Version 1 1617 External Data Representation Standard (XDR) Description", 1618 RFC 5662, January 2010. 1620 [6] Eisler, M., "XDR: External Data Representation Standard", 1621 STD 67, RFC 4506, May 2006. 1623 [7] Linn, J., "Generic Security Service Application Program 1624 Interface Version 2, Update 1", RFC 2743, January 2000. 1626 [8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., 1627 and E. Zeidner, "Internet Small Computer Systems Interface 1628 (iSCSI)", RFC 3720, April 2004. 1630 [9] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network 1631 Address Authority (NAA) Naming Format for iSCSI Node 1632 Names", RFC 3980, February 2005. 1634 [10] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI 1635 INCITS 408-2005, October 2005. 1637 [11] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI 1638 INCITS 402-2005, February 2005. 1640 [12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and 1641 J. Souza, "Internet Storage Name Service (iSNS)", RFC 1642 4171, September 2005. 1644 [13] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) 1645 Registration Authority", . 1648 15.2. Informative References 1650 [14] Weber, R., "SCSI Object-Based Storage Device Commands -2 1651 (OSD-2)", January 2009, 1652 . 1654 [15] Kent, S. and K. Seo, "Security Architecture for the 1655 Internet Protocol", RFC 4301, December 2005. 1657 [16] MacWilliams, F. and N. Sloane, "The Theory of Error- 1658 Correcting Codes, Part I", 1977. 1660 [17] Anvin, H., "The Mathematics of RAID-6", May 2009, 1661 . 1663 [18] The free encyclopedia, Wikipedia., "Least common 1664 multiple", April 2011, 1665 . 1667 [19] Plank, James S., and Luo, Jianqiang and Schuman, Catherine 1668 D. and Xu, Lihao and Wilcox-O'Hearn, Zooko, , "A 1669 Performance Evaluation and Examination of Open-source 1670 Erasure Coding Libraries for Storage", 2007. 1672 [20] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 1673 365-2002, December 2002. 1675 [21] T11 1619-D, "Fibre Channel Framing and Signaling - 2 (FC- 1676 FS-2)", ANSI INCITS 424-2007, February 2007. 1678 [22] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI 1679 INCITS 417-2006, June 2006. 1681 Appendix A. Acknowledgments 1683 Todd Pisek was a co-editor of the initial versions of this document. 1684 Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian 1685 E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and 1686 commented on this document. 1688 Authors' Addresses 1690 Benny Halevy 1691 Primary Data 1693 Email: bhalevy@primarydata.com 1694 URI: http://www.primarydata.com/ 1696 Boaz Harrosh 1697 Panasas, Inc. 1698 1501 Reedsdale St. Suite 400 1699 Pittsburgh, PA 15233 1700 USA 1702 Phone: +1-412-323-3500 1703 Email: bharrosh@panasas.com 1704 URI: http://www.panasas.com/ 1705 Brent Welch 1706 Panasas, Inc. 1707 969 W. Maude Ave 1708 Sunnyvale, CA 94095 1709 USA 1711 Phone: +1-408-215-6715 1712 Email: welch@acm.org 1713 URI: http://www.panasas.com/ 1715 Brian Mueller 1716 Panasas, Inc. 1718 Email: bmueller@panasas.com 1719 URI: http://www.panasas.com/