idnits 2.17.1 draft-ietf-nfsv4-pnfs-obj-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? -- It seems you're using the 'non-IETF stream' Licence Notice instead Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 993 has weird spacing: '...stateid lor...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 15, 2008) is 5583 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '4' -- Possible downref: Non-RFC (?) normative reference: ref. '5' -- Possible downref: Non-RFC (?) normative reference: ref. '6' ** Obsolete normative reference: RFC 3720 (ref. '8') (Obsoleted by RFC 7143) -- Possible downref: Non-RFC (?) normative reference: ref. '9' ** Obsolete normative reference: RFC 3980 (ref. '10') (Obsoleted by RFC 7143) -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '13' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 B. Halevy 3 Internet-Draft B. Welch 4 Intended status: Standards Track J. Zelenka 5 Expires: June 18, 2009 Panasas 6 December 15, 2008 8 Object-based pNFS Operations 9 draft-ietf-nfsv4-pnfs-obj-12 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on June 18, 2009. 34 Copyright Notice 36 Copyright (c) 2008 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. 46 Abstract 48 Parallel NFS (pNFS) extends NFSv4 to allow clients to directly access 49 file data on the storage used by the NFSv4 server. This ability to 50 bypass the server for data access can increase both performance and 51 parallelism, but requires additional client functionality for data 52 access, some of which is dependent on the class of storage used, 53 a.k.a. the Layout Type. The main pNFS operations and data types in 54 NFSv4 Minor Version 1 specify a layout-type-independent layer; 55 layout-type-specific information is conveyed using opaque data 56 structures which internal structure is further defined by the 57 particular layout type specification. This document specifies the 58 NFSv4.1 Object-based pNFS Layout Type in companion with the main 59 NFSv4 Minor Version 1 specification. 61 Requirements Language 63 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 64 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 65 document are to be interpreted as described in RFC 2119 [1]. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 70 2. XDR Description of the Objects-Based Layout Protocol . . . . . 5 71 2.1. Code Components Licensing Notice . . . . . . . . . . . . . 6 72 3. Basic Data Type Definitions . . . . . . . . . . . . . . . . . 7 73 3.1. pnfs_osd_objid4 . . . . . . . . . . . . . . . . . . . . . 7 74 3.2. pnfs_osd_version4 . . . . . . . . . . . . . . . . . . . . 8 75 3.3. pnfs_osd_object_cred4 . . . . . . . . . . . . . . . . . . 8 76 3.4. pnfs_osd_raid_algorithm4 . . . . . . . . . . . . . . . . . 9 77 4. Object Storage Device Addressing and Discovery . . . . . . . . 10 78 4.1. pnfs_osd_targetid_type4 . . . . . . . . . . . . . . . . . 11 79 4.2. pnfs_osd_deviceaddr4 . . . . . . . . . . . . . . . . . . . 11 80 4.2.1. SCSI Target Identifier . . . . . . . . . . . . . . . . 12 81 4.2.2. Device Network Address . . . . . . . . . . . . . . . . 13 82 5. Object-Based Layout . . . . . . . . . . . . . . . . . . . . . 13 83 5.1. pnfs_osd_data_map4 . . . . . . . . . . . . . . . . . . . . 14 84 5.2. pnfs_osd_layout4 . . . . . . . . . . . . . . . . . . . . . 15 85 5.3. Data Mapping Schemes . . . . . . . . . . . . . . . . . . . 16 86 5.3.1. Simple Striping . . . . . . . . . . . . . . . . . . . 16 87 5.3.2. Nested Striping . . . . . . . . . . . . . . . . . . . 17 88 5.3.3. Mirroring . . . . . . . . . . . . . . . . . . . . . . 18 89 5.4. RAID Algorithms . . . . . . . . . . . . . . . . . . . . . 19 90 5.4.1. PNFS_OSD_RAID_0 . . . . . . . . . . . . . . . . . . . 19 91 5.4.2. PNFS_OSD_RAID_4 . . . . . . . . . . . . . . . . . . . 19 92 5.4.3. PNFS_OSD_RAID_5 . . . . . . . . . . . . . . . . . . . 20 93 5.4.4. PNFS_OSD_RAID_PQ . . . . . . . . . . . . . . . . . . . 20 94 5.4.5. RAID Usage and Implementation Notes . . . . . . . . . 21 95 6. Object-Based Layout Update . . . . . . . . . . . . . . . . . . 21 96 6.1. pnfs_osd_deltaspaceused4 . . . . . . . . . . . . . . . . . 21 97 6.2. pnfs_osd_layoutupdate4 . . . . . . . . . . . . . . . . . . 22 98 7. Recovering from Client I/O Errors . . . . . . . . . . . . . . 22 99 8. Object-Based Layout Return . . . . . . . . . . . . . . . . . . 23 100 8.1. pnfs_osd_errno4 . . . . . . . . . . . . . . . . . . . . . 24 101 8.2. pnfs_osd_ioerr4 . . . . . . . . . . . . . . . . . . . . . 25 102 8.3. pnfs_osd_layoutreturn4 . . . . . . . . . . . . . . . . . . 26 103 9. Object-Based Creation Layout Hint . . . . . . . . . . . . . . 26 104 9.1. pnfs_osd_layouthint4 . . . . . . . . . . . . . . . . . . . 26 105 10. Layout Segments . . . . . . . . . . . . . . . . . . . . . . . 28 106 10.1. CB_LAYOUTRECALL and LAYOUTRETURN . . . . . . . . . . . . . 28 107 10.2. LAYOUTCOMMIT . . . . . . . . . . . . . . . . . . . . . . . 29 108 11. Recalling Layouts . . . . . . . . . . . . . . . . . . . . . . 29 109 11.1. CB_RECALL_ANY . . . . . . . . . . . . . . . . . . . . . . 29 110 12. Client Fencing . . . . . . . . . . . . . . . . . . . . . . . . 30 111 13. Security Considerations . . . . . . . . . . . . . . . . . . . 30 112 13.1. OSD Security Data Types . . . . . . . . . . . . . . . . . 31 113 13.2. The OSD Security Protocol . . . . . . . . . . . . . . . . 32 114 13.3. Protocol Privacy Requirements . . . . . . . . . . . . . . 33 115 13.4. Revoking Capabilities . . . . . . . . . . . . . . . . . . 33 116 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 117 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 118 15.1. Normative References . . . . . . . . . . . . . . . . . . . 34 119 15.2. Informative References . . . . . . . . . . . . . . . . . . 35 120 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 36 121 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 36 123 1. Introduction 125 In pNFS, the file server returns typed layout structures that 126 describe where file data is located. There are different layouts for 127 different storage systems and methods of arranging data on storage 128 devices. This document describes the layouts used with object-based 129 storage devices (OSD) that are accessed according to the OSD storage 130 protocol standard (ANSI INCITS 400-2004 [2]). 132 An "object" is a container for data and attributes, and files are 133 stored in one or more objects. The OSD protocol specifies several 134 operations on objects, including READ, WRITE, FLUSH, GET ATTRIBUTES, 135 SET ATTRIBUTES, CREATE and DELETE. However, using the object-based 136 layout the client only uses the READ, WRITE, GET ATTRIBUTES and FLUSH 137 commands. The other commands are only used by the pNFS server. 139 An object-based layout for pNFS includes object identifiers, 140 capabilities that allow clients to READ or WRITE those objects, and 141 various parameters that control how file data is striped across their 142 component objects. The OSD protocol has a capability-based security 143 scheme that allows the pNFS server to control what operations and 144 what objects can be used by clients. This scheme is described in 145 more detail in the Security Considerations section (Section 13). 147 2. XDR Description of the Objects-Based Layout Protocol 149 This document contains the external data representation (XDR [3]) 150 description of the NFSv4.1 objects layout protocol. The XDR 151 description is embedded in this document in a way that makes it 152 simple for the reader to extract into a ready to compile form. The 153 reader can feed this document into the following shell script to 154 produce the machine readable XDR description of the NFSv4.1 objects 155 layout protocol: 157 #!/bin/sh 158 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 160 I.e. if the above script is stored in a file called "extract.sh", and 161 this document is in a file called "spec.txt", then the reader can do: 163 sh extract.sh < spec.txt > pnfs_osd_prot.x 165 The effect of the script is to remove leading white space from each 166 line, plus a sentinel sequence of "///". 168 The embedded XDR file header follows. Subsequent XDR descriptions, 169 with the sentinel sequence are embedded throughout the document. 171 Note that the XDR code contained in this document depends on types 172 from the NFSv4.1 nfs4_prot.x file ([4]). This includes both nfs 173 types that end with a 4, such as offset4, length4, etc, as well as 174 more generic types such as uint32_t and uint64_t. 176 2.1. Code Components Licensing Notice 178 The XDR description, marked with lines beginning with the sequence 179 "///", as well as scripts for extracting the XDR description are Code 180 Components as described in Section 4 of "Legal Provisions Relating to 181 IETF Documents" [5]. These Code Components are licensed according to 182 the terms of Section 4 of "Legal Provisions Relating to IETF 183 Documents". 185 /// /* 186 /// * Copyright (c) 2008 IETF Trust and the persons identified 187 /// * as the document authors. All rights reserved. 188 /// * 189 /// * Redistribution and use in source and binary forms, with 190 /// * or without modification, are permitted provided that the 191 /// * following conditions are met: 192 /// * 193 /// * o Redistributions of source code must retain the above 194 /// * copyright notice, this list of conditions and the 195 /// * following disclaimer. 196 /// * 197 /// * o Redistributions in binary form must reproduce the above 198 /// * copyright notice, this list of conditions and the 199 /// * following disclaimer in the documentation and/or other 200 /// * materials provided with the distribution. 201 /// * 202 /// * o Neither the name of Internet Society, IETF or IETF 203 /// * Trust, nor the names of specific contributors, may be 204 /// * used to endorse or promote products derived from this 205 /// * software without specific prior written permission. 206 /// * 207 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 208 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 209 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 210 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 211 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 212 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 213 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 214 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 215 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 216 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 217 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 218 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 219 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 220 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 221 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 222 /// * 223 /// * This code was derived from IETF RFC &rfc.number. 224 [[RFC Editor: please insert RFC number if needed]] 225 /// * Please reproduce this note if possible. 226 /// */ 227 /// 228 /// /* 229 /// * pnfs_osd_prot.x 230 /// */ 231 /// 232 /// %#include 233 /// 235 3. Basic Data Type Definitions 237 The following sections define basic data types and constants used by 238 the Object-Based Layout protocol. 240 3.1. pnfs_osd_objid4 242 An object is identified by a number, somewhat like an inode number. 243 The object storage model has a two level scheme, where the objects 244 within an object storage device are grouped into partitions. 246 /// struct pnfs_osd_objid4 { 247 /// deviceid4 oid_device_id; 248 /// uint64_t oid_partition_id; 249 /// uint64_t oid_object_id; 250 /// }; 251 /// 253 The pnfs_osd_objid4 type is used to identify an object within a 254 partition on a specified object storage device. "oid_device_id" 255 selects the object storage device from the set of available storage 256 devices. The device is identified with the deviceid4 type, which is 257 an index into addressing information about that device returned by 258 the GETDEVICELIST and GETDEVICEINFO operations. The deviceid4 data 259 type is defined in NFSv4.1 [6]. Within an OSD, a partition is 260 identified with a 64-bit number, "oid_partition_id". Within a 261 partition, an object is identified with a 64-bit number, 262 "oid_object_id". Creation and management of partitions is outside 263 the scope of this standard, and is a facility provided by the object 264 storage file system. 266 3.2. pnfs_osd_version4 268 /// enum pnfs_osd_version4 { 269 /// PNFS_OSD_MISSING = 0, 270 /// PNFS_OSD_VERSION_1 = 1, 271 /// PNFS_OSD_VERSION_2 = 2 272 /// }; 273 /// 275 pnfs_osd_version4 is used to indicate the OSD protocol version or 276 whether an object is missing (i.e., unavailable). Some of the 277 object-based layout supported raid algorithms encode redundant 278 information and can compensate for missing components, but the data 279 placement algorithm needs to know what parts are missing. 281 At this time the OSD standard is at version 1.0, and we anticipate a 282 version 2.0 of the standard ((SNIA T10/1729-D [14])). The second 283 generation OSD protocol has additional proposed features to support 284 more robust error recovery, snapshots, and byte-range capabilities. 285 Therefore, the OSD version is explicitly called out in the 286 information returned in the layout. (This information can also be 287 deduced by looking inside the capability type at the format field, 288 which is the first byte. The format value is 0x1 for an OSD v1 289 capability. However, it seems most robust to call out the version 290 explicitly.) 292 3.3. pnfs_osd_object_cred4 294 /// enum pnfs_osd_cap_key_sec4 { 295 /// PNFS_OSD_CAP_KEY_SEC_NONE = 0, 296 /// PNFS_OSD_CAP_KEY_SEC_SSV = 1 297 /// }; 298 /// 299 /// struct pnfs_osd_object_cred4 { 300 /// pnfs_osd_objid4 oc_object_id; 301 /// pnfs_osd_version4 oc_osd_version; 302 /// pnfs_osd_cap_key_sec4 oc_cap_key_sec; 303 /// opaque oc_capability_key<>; 304 /// opaque oc_capability<>; 305 /// }; 306 /// 308 The pnfs_osd_object_cred4 structure is used to identify each 309 component comprising the file. The "oc_object_id" identifies the 310 component object, the "oc_osd_version" represents the osd protocol 311 version, or whether that component is unavailable, and the 312 "oc_capability" and "oc_capability_key", along with the 313 "oda_systemid" from the pnfs_osd_deviceaddr4, provide the OSD 314 security credentials needed to access that object. The 315 "oc_cap_key_sec" value denotes the method used to secure the 316 oc_capability_key (see Section 13.1 for more details). 318 To comply with the OSD security requirements the capability key 319 SHOULD be transferred securely to prevent eavesdropping (see 320 Section 13). Therefore, a client SHOULD either issue the LAYOUTGET 321 or GETDEVICEINFO operations via RPCSEC_GSS with the privacy service 322 or to previously establish an SSV for the sessions via the NFSv4.1 323 SET_SSV operation. The pnfs_osd_cap_key_sec4 type is used to 324 identify the method used by the server to secure the capability key. 326 o PNFS_OSD_CAP_KEY_SEC_NONE denotes that the oc_capability_key is 327 not encrypted in which case the client SHOULD issue the LAYOUTGET 328 or GETDEVICEINFO operations with RPCSEC_GSS with the privacy 329 service or the NFSv4.1 transport should be secured by using 330 methods that are external to NFSv4.1 like the use of IPsec [15] 331 for transporting the NFSV4.1 protocol. 333 o PNFS_OSD_CAP_KEY_SEC_SSV denotes that the oc_capability_key 334 contents are encrypted using the SSV GSS context and the 335 capability key as inputs to the GSS_Wrap() function (see GSS-API 336 [7]) with the conf_req_flag set to TRUE. The client MUST use the 337 secret SSV key as part of the client's GSS context to decrypt the 338 capability key using the value of the oc_capability_key field as 339 the input_message to the GSS_unwrap() function. Note that to 340 prevent eavesdropping of the SSV key the client SHOULD issue 341 SET_SSV via RPCSEC_GSS with the privacy service. 343 The actual method chosen depends on whether the client established a 344 SSV key with the server and whether it issued the operation with the 345 RPCSEC_GSS privacy method. Naturally, if the client did not 346 establish a SSV key via SET_SSV the server MUST use the 347 PNFS_OSD_CAP_KEY_SEC_NONE method. Otherwise, if the operation was 348 not issued with the RPCSEC_GSS privacy method the server SHOULD 349 secure the oc_capability_key with the PNFS_OSD_CAP_KEY_SEC_SSV 350 method. The server MAY use the PNFS_OSD_CAP_KEY_SEC_SSV method also 351 when the operation was issued with the RPCSEC_GSS privacy method. 353 3.4. pnfs_osd_raid_algorithm4 355 /// enum pnfs_osd_raid_algorithm4 { 356 /// PNFS_OSD_RAID_0 = 1, 357 /// PNFS_OSD_RAID_4 = 2, 358 /// PNFS_OSD_RAID_5 = 3, 359 /// PNFS_OSD_RAID_PQ = 4 /* Reed-Solomon P+Q */ 360 /// }; 361 /// 362 pnfs_osd_raid_algorithm4 represents the data redundancy algorithm 363 used to protect the file's contents. See Section 5.4 for more 364 details. 366 4. Object Storage Device Addressing and Discovery 368 Data operations to an OSD require the client to know the "address" of 369 each OSD's root object. The root object is synonymous with SCSI 370 logical unit. The client specifies SCSI logical units to its SCSI 371 protocol stack using a representation local to the client. Because 372 these representations are local, GETDEVICEINFO must return 373 information that can be used by the client to select the correct 374 local representation. 376 In the block world, a set offset (logical block number or track/ 377 sector) contains a disk label. This label identifies the disk 378 uniquely. In contrast, an OSD has a standard set of attributes on 379 its root object. For device identification purposes the OSD System 380 ID (root information attribute number 3) and the OSD Name (root 381 information attribute number 9) are used as the label. These appear 382 in the pnfs_osd_deviceaddr4 type below under the "oda_systemid" and 383 "oda_osdname" fields. 385 In some situations, SCSI target discovery may need to be driven based 386 on information contained in the GETDEVICEINFO response. One example 387 of this is iSCSI targets that are not known to the client until a 388 layout has been requested. The information provided as the 389 "oda_targetid", "oda_targetaddr", and "oda_lun" fields in the 390 pnfs_osd_deviceaddr4 type described below (see Section 4.2), allows 391 the client to probe a specific device given its network address and 392 optionally its iSCSI Name (see iSCSI [8]), or when the device network 393 address is omitted, to discover the object storage device using the 394 provided device name or SCSI device identifier (See SPC-3 [9].) 396 The oda_systemid is implicitly used by the client, by using the 397 object credential signing key to sign each request with the request 398 integrity check value. This method protects the client from 399 unintentionally accessing a device if the device address mapping was 400 changed (or revoked). The server computes the capability key using 401 its own view of the systemid associated with the respective deviceid 402 present in the credential. If the client's view of the deviceid 403 mapping is stale, the client will use the wrong systemid (which must 404 be system-wide unique) and the I/O request to the OSD will fail to 405 pass the integrity check verification. 407 To recover from this condition the client should report the error and 408 return the layout using LAYOUTRETURN, and invalidate all the device 409 address mappings associated with this layout. The client can then 410 ask for a new layout if it wishes using LAYOUTGET and resolve the 411 referenced deviceids using GETDEVICEINFO or GETDEVICELIST. 413 The server MUST provide the oda_systemid and SHOULD also provide the 414 oda_osdname. When the OSD name is present the client SHOULD get the 415 root information attributes whenever it establishes communication 416 with the OSD and verify that the OSD name it got from the OSD matches 417 the one sent by the metadata server. To do so, the client uses the 418 root_obj_cred credentials. 420 4.1. pnfs_osd_targetid_type4 422 The following enum specifies the manner in which a scsi target can be 423 specified. The target can be specified as an SCSI Name, or as a SCSI 424 Device Identifier. 426 /// enum pnfs_osd_targetid_type4 { 427 /// OBJ_TARGET_ANON = 1, 428 /// OBJ_TARGET_SCSI_NAME = 2, 429 /// OBJ_TARGET_SCSI_DEVICE_ID = 3 430 /// }; 431 /// 433 4.2. pnfs_osd_deviceaddr4 435 The specification for an object device address is as follows: 437 /// union pnfs_osd_targetid4 switch (pnfs_osd_targetid_type4 oti_type) { 438 /// case OBJ_TARGET_SCSI_NAME: 439 /// string oti_scsi_name<>; 440 /// 441 /// case OBJ_TARGET_SCSI_DEVICE_ID: 442 /// opaque oti_scsi_device_id<>; 443 /// 444 /// default: 445 /// void; 446 /// }; 447 /// 448 /// union pnfs_osd_targetaddr4 switch (bool ota_available) { 449 /// case TRUE: 450 /// netaddr4 ota_netaddr; 451 /// case FALSE: 452 /// void; 453 /// }; 454 /// 455 /// struct pnfs_osd_deviceaddr4 { 456 /// pnfs_osd_targetid4 oda_targetid; 457 /// pnfs_osd_targetaddr4 oda_targetaddr; 458 /// opaque oda_lun[8]; 459 /// opaque oda_systemid<>; 460 /// pnfs_osd_object_cred4 oda_root_obj_cred; 461 /// opaque oda_osdname<>; 462 /// }; 463 /// 465 4.2.1. SCSI Target Identifier 467 When "oda_targetid" is specified as a OBJ_TARGET_SCSI_NAME, the 468 "oti_scsi_name" string MUST be formatted as a "iSCSI Name" as 469 specified in iSCSI [8] and [10]. Note that the specification of the 470 oti_scsi_name string format is outside the scope of this document. 471 Parsing the string is based on the string prefix, e.g. "iqn.", 472 "eui.", or "naa." and more formats MAY be specified in the future in 473 accordance with iSCSI Names properties. 475 Currently, the iSCSI Name provides for naming the target device using 476 a string formatted as an iSCSI Qualified Name (IQN) or as an EUI [11] 477 string. Those are typically used to identify iSCSI or SRP [16] 478 devices. The Network Address Authority (NAA) string format (see 479 [10]) provides for naming the device using globally unique 480 identifiers, as defined in FC-FS [17]. These are typically used to 481 identify Fibre Channel or SAS [18] (Serial Attached SCSI) devices. 482 In particular, such devices that are dual-attached both over Fibre 483 Channel or SAS, and over iSCSI. 485 When "oda_targetid" is specified as a OBJ_TARGET_SCSI_DEVICE_ID, the 486 "oti_scsi_device_id" opaque field MUST be formatted as a SCSI Device 487 Identifier as defined in SPC-3 [9] VPD Page 83h (Section 7.6.3. 488 "Device Identification VPD Page".) If the Device Identifier is 489 identical to the OSD System ID, as given by oda_systemid, the server 490 SHOULD provide a zero-length oti_scsi_device_id opaque value Note 491 that similarly to the "oti_scsi_name", the specification of the 492 oti_scsi_device_id opaque contents is outside the scope of this 493 document and more formats MAY be specified in the future in 494 accordance with SPC-3. 496 The OBJ_TARGET_ANON pnfs_osd_targetid_type4 MAY be used for providing 497 no target identification. In this case only the OSD System ID and 498 optionally, the provided network address, are used to locate to 499 device. 501 4.2.2. Device Network Address 503 The optional "oda_targetaddr" field MAY be provided by the server as 504 a hint to accelerate device discovery over e.g., the iSCSI transport 505 protocol. The network address is given with the netaddr4 type, which 506 specifies a TCP/IP based endpoint (as specified in NFSv4.1 [6]). 507 When given, the client SHOULD use it to probe for the SCSI device at 508 the given network address. The client MAY still use other discovery 509 mechanisms such as iSNS [12] to locate the device using the 510 oda_targetid. In particular, such external name service, SHOULD be 511 used when the devices may be attached to the network using multiple 512 connections, and/or multiple storage fabrics (e.g. Fibre-Channel and 513 iSCSI.) 515 The "oda_lun" field identifies the OSD 64-bit Logical Unit Number, 516 formatted in accordance with SAM-3 [13]. The client uses the Logical 517 Unit Number to communicate with the specific OSD Logical Unit. Its 518 use is defined in details by the SCSI transport protocol, e.g., iSCSI 519 [8]. 521 5. Object-Based Layout 523 The layout4 type is defined in the NFSv4.1 [6] as follows: 525 enum layouttype4 { 526 LAYOUT4_NFSV4_1_FILES = 1, 527 LAYOUT4_OSD2_OBJECTS = 2, 528 LAYOUT4_BLOCK_VOLUME = 3 529 }; 531 struct layout_content4 { 532 layouttype4 loc_type; 533 opaque loc_body<>; 534 }; 536 struct layout4 { 537 offset4 lo_offset; 538 length4 lo_length; 539 layoutiomode4 lo_iomode; 540 layout_content4 lo_content; 541 }; 543 This document defines structure associated with the layouttype4 544 value, LAYOUT4_OSD2_OBJECTS. The NFSv4.1 [6] specifies the loc_body 545 structure as an XDR type "opaque". The opaque layout is 546 uninterpreted by the generic pNFS client layers, but obviously must 547 be interpreted by the object-storage layout driver. This section 548 defines the structure of this opaque value, pnfs_osd_layout4. 550 5.1. pnfs_osd_data_map4 552 /// struct pnfs_osd_data_map4 { 553 /// uint32_t odm_num_comps; 554 /// length4 odm_stripe_unit; 555 /// uint32_t odm_group_width; 556 /// uint32_t odm_group_depth; 557 /// uint32_t odm_mirror_cnt; 558 /// pnfs_osd_raid_algorithm4 odm_raid_algorithm; 559 /// }; 560 /// 562 The pnfs_osd_data_map4 structure parameterizes the algorithm that 563 maps a file's contents over the component objects. Instead of 564 limiting the system to simple striping scheme where loss of a single 565 component object results in data loss, the map parameters support 566 mirroring and more complicated schemes that protect against loss of a 567 component object. 569 "odm_num_comps" is the number of component objects the file is 570 striped over. The server MAY grow the file by adding more components 571 to the stripe while clients hold valid layouts until the file has 572 reached its final stripe width. The file length in this case MUST be 573 limited to the number of bytes in a full stripe. 575 The "odm_stripe_unit" is the number of bytes placed on one component 576 before advancing to the next one in the list of components. The 577 number of bytes in a full stripe is odm_stripe_unit times the number 578 of components. In some raid schemes, a stripe includes redundant 579 information (i.e., parity) that lets the system recover from loss or 580 damage to a component object. 582 The "odm_group_width" and "odm_group_depth" parameters allow a nested 583 striping pattern (See Section 5.3.2 for details). If there is no 584 nesting, then odm_group_width and odm_group_depth MUST be zero. The 585 size of the components array MUST be a multiple of odm_group_width. 587 The "odm_mirror_cnt" is used to replicate a file by replicating its 588 component objects. If there is no mirroring, then odm_mirror_cnt 589 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 590 the component array MUST be a multiple of (odm_mirror_cnt+1). 592 See Section 5.3 for more details. 594 5.2. pnfs_osd_layout4 596 /// struct pnfs_osd_layout4 { 597 /// pnfs_osd_data_map4 olo_map; 598 /// uint32_t olo_comps_index; 599 /// pnfs_osd_object_cred4 olo_components<>; 600 /// }; 601 /// 603 The pnfs_osd_layout4 structure specifies a layout over a set of 604 component objects. The "olo_components" field is an array of object 605 identifiers and security credentials that grant access to each 606 object. The organization of the data is defined by the 607 pnfs_osd_data_map4 type that specifies how the file's data is mapped 608 onto the component objects (i.e., the striping pattern). The data 609 placement algorithm that maps file data onto component objects assume 610 that each component object occurs exactly once in the array of 611 components. Therefore, component objects MUST appear in the 612 olo_components array only once. The components array may represent 613 all objects comprising the file, in which case "olo_comps_index" is 614 set to zero and the number of entries in the olo_components array is 615 equal to olo_map.odm_num_comps. The server MAY return fewer 616 components than odm_num_comps, provided that the returned components 617 are sufficient to access any byte in the layout's data range (e.g., a 618 sub-stripe of "odm_group_width" components). In this case, 619 olo_comps_index represents the position of the returned components 620 array within the full array of components that comprise the file. 622 Note that the layout depends on the file size, which the client 623 learns from the generic return parameters of LAYOUTGET, by doing 624 GETATTR commands to the metadata server. The client uses the file 625 size to decide if it should fill holes with zeros, or return a short 626 read. Striping patterns can cause cases where component objects are 627 shorter than other components because a hole happens to correspond to 628 the last part of the component object. 630 5.3. Data Mapping Schemes 632 This section describes the different data mapping schemes in detail. 633 The object layout always uses a "dense" layout as described in 634 NFSv4.1 [6]. This means that the second stripe unit of the file 635 starts at offset 0 of the second component, rather than at offset 636 stripe_unit bytes. After a full stripe has been written, the next 637 stripe unit is appended to the first component object in the list 638 without any holes in the component objects. 640 5.3.1. Simple Striping 642 The mapping from the logical offset within a file (L) to the 643 component object C and object-specific offset O is defined by the 644 following equations: 646 L = logical offset into the file 647 W = total number of components 648 S = W * stripe_unit 649 N = L / S 650 C = (L-(N*S)) / stripe_unit 651 O = (N*stripe_unit)+(L%stripe_unit) 653 In these equations, S is the number of bytes in a full stripe, and N 654 is the stripe number. C is an index into the array of components, so 655 it selects a particular object storage device. Both N and C count 656 from zero. O is the offset within the object that corresponds to the 657 file offset. Note that this computation does not accommodate the 658 same object appearing in the olo_components array multiple times. 660 For example, consider an object striped over four devices, . The stripe_unit is 4096 bytes. The stripe width S is thus 4 * 662 4096 = 16384. 664 Offset 0: 665 N = 0 / 16384 = 0 666 C = 0-0/4096 = 0 (D0) 667 O = 0*4096 + (0%4096) = 0 669 Offset 4096: 670 N = 4096 / 16384 = 0 671 C = (4096-(0*16384)) / 4096 = 1 (D1) 672 O = (0*4096)+(4096%4096) = 0 674 Offset 9000: 675 N = 9000 / 16384 = 0 676 C = (9000-(0*16384)) / 4096 = 2 (D2) 677 O = (0*4096)+(9000%4096) = 808 679 Offset 132000: 680 N = 132000 / 16384 = 8 681 C = (132000-(8*16384)) / 4096 = 0 (D0) 682 O = (8*4096) + (132000%4096) = 33696 684 5.3.2. Nested Striping 686 The odm_group_width and odm_group_depth parameters allow a nested 687 striping pattern. odm_group_width defines the width of a data stripe 688 and odm_group_depth defines how many stripes are written before 689 advancing to the next group of components in the list of component 690 objects for the file. The math used to map from a file offset to a 691 component object and offset within that object is shown below. The 692 computations map from the logical offset L to the component index C 693 and offset relative O within that component object. 695 L = logical offset into the file 696 W = total number of components 697 S = stripe_unit * group_depth * W 698 T = stripe_unit * group_depth * group_width 699 U = stripe_unit * group_width 700 M = L / S 701 G = (L - (M * S)) / T 702 H = (L - (M * S)) % T 703 N = H / U 704 C = (H - (N * U)) / stripe_unit + G * group_width 705 O = L % stripe_unit + N * stripe_unit + M * group_depth * stripe_unit 707 In these equations, S is the number of bytes striped across all 708 component objects before the pattern repeats. T is the number of 709 bytes striped within a group of component objects before advancing to 710 the next group. U is the number of bytes in a stripe within a group. 711 M is the "major" (i.e., across all components) stripe number, and N 712 is the "minor" (i.e., across the group) stripe number. G counts the 713 groups from the beginning of the major stripe, and H is the byte 714 offset within the group. 716 For example, consider an object striped over 100 devices with a 717 group_width of 10, a group_depth of 50, and a stripe_unit of 1 MB. 718 In this scheme, 500 MB are written to the first 10 components, and 719 5000 MB is written before the pattern wraps back around to the first 720 component in the array. 722 Offset 0: 723 W = 100 724 S = 1 MB * 50 * 100 = 5000 MB 725 T = 1 MB * 50 * 10 = 500 MB 726 U = 1 MB * 10 = 10 MB 727 M = 0 / 5000 MB = 0 728 G = (0 - (0 * 5000 MB)) / 500 MB = 0 729 H = (0 - (0 * 5000 MB)) % 500 MB = 0 730 N = 0 / 10 MB = 0 731 C = (0 - (0 * 10 MB)) / 1 MB + 0 * 10 = 0 732 O = 0 % 1 MB + 0 * 1 MB + 0 * 50 * 1 MB = 0 734 Offset 27 MB: 735 M = 27 MB / 5000 MB = 0 736 G = (27 MB - (0 * 5000 MB)) / 500 MB = 0 737 H = (27 MB - (0 * 5000 MB)) % 500 MB = 27 MB 738 N = 27 MB / 10 MB = 2 739 C = (27 MB - (2 * 10 MB)) / 1 MB + 0 * 10 = 7 740 O = 27 MB % 1 MB + 2 * 1 MB + 0 * 50 * 1 MB = 2 MB 742 Offset 7232 MB: 743 M = 7232 MB / 5000 MB = 1 744 G = (7232 MB - (1 * 5000 MB)) / 500 MB = 4 745 H = (7232 MB - (1 * 5000 MB)) % 500 MB = 232 MB 746 N = 232 MB / 10 MB = 23 747 C = (232 MB - (23 * 10 MB)) / 1 MB + 4 * 10 = 42 748 O = 7232 MB % 1 MB + 23 * 1 MB + 1 * 50 * 1 MB = 73 MB 750 5.3.3. Mirroring 752 The odm_mirror_cnt is used to replicate a file by replicating its 753 component objects. If there is no mirroring, then odm_mirror_cnt 754 MUST be 0. If odm_mirror_cnt is greater than zero, then the size of 755 the olo_components array MUST be a multiple of (odm_mirror_cnt+1). 756 Thus, for a classic mirror on two objects, odm_mirror_cnt is one. 757 Note that mirroring can be defined over any raid algorithm and 758 striping pattern (either simple or nested). If odm_group_width is 759 also non-zero, then the size of the olo_components array MUST be a 760 multiple of odm_group_width * (odm_mirror_cnt+1). Replicas are 761 adjacent in the olo_components array, and the value C produced by the 762 above equations is not a direct index into the olo_components array. 763 Instead, the following equations determine the replica component 764 index RCi, where i ranges from 0 to odm_mirror_cnt. 766 C = component index for striping or two-level striping 767 i ranges from 0 to odm_mirror_cnt, inclusive 768 RCi = C * (odm_mirror_cnt+1) + i 770 5.4. RAID Algorithms 772 pnfs_osd_raid_algorithm4 determines the algorithm and placement of 773 redundant data. This section defines the different RAID algorithms. 775 5.4.1. PNFS_OSD_RAID_0 777 PNFS_OSD_RAID_0 means there is no parity data, so all bytes in the 778 component objects are data bytes located by the above equations for C 779 and O. If a component object is marked as PNFS_OSD_MISSING, the pNFS 780 client MUST either return an I/O error if this component is attempted 781 to be read or alternatively, it can retry the READ against the pNFS 782 server. 784 5.4.2. PNFS_OSD_RAID_4 786 PNFS_OSD_RAID_4 means that the last component object, or the last in 787 each group (if odm_group_width is greater than zero), contains parity 788 information computed over the rest of the stripe with an XOR 789 operation. If a component object is unavailable, the client can read 790 the rest of the stripe units in the damaged stripe and recompute the 791 missing stripe unit by XORing the other stripe units in the stripe. 792 Or the client can replay the READ against the pNFS server which will 793 presumably perform the reconstructed read on the client's behalf. 795 When parity is present in the file, then there is an additional 796 computation to map from the file offset L to the offset that accounts 797 for embedded parity, L'. First compute L', and then use L' in the 798 above equations for C and O. 800 L = file offset, not accounting for parity 801 P = number of parity devices in each stripe 802 W = group_width, if not zero, else size of olo_components array 803 N = L / (W-P * stripe_unit) 804 L' = N * (W * stripe_unit) + 805 (L % (W-P * stripe_unit)) 807 5.4.3. PNFS_OSD_RAID_5 809 PNFS_OSD_RAID_5 means that the position of the parity data is rotated 810 on each stripe or each group (if odm_group_width is greater than 811 zero). In the first stripe, the last component holds the parity. In 812 the second stripe, the next-to-last component holds the parity, and 813 so on. In this scheme, all stripe units are rotated so that I/O is 814 evenly spread across objects as the file is read sequentially. The 815 rotated parity layout is illustrated here, with numbers indicating 816 the stripe unit. 818 0 1 2 P 819 4 5 P 3 820 8 P 6 7 821 P 9 a b 823 To compute the component object C, first compute the offset that 824 accounts for parity L' and use that to compute C. Then rotate C to 825 get C'. Finally, increase C' by one if the parity information comes 826 at or before C' within that stripe. The following equations 827 illustrate this by computing I, which is the index of the component 828 that contains parity for a given stripe. 830 L = file offset, not accounting for parity 831 W = odm_group_width, if not zero, else size of olo_components array 832 N = L / (W-1 * stripe_unit) 833 (Compute L' as describe above) 834 (Compute C based on L' as described above) 835 C' = (C - (N%W)) % W 836 I = W - (N%W) - 1 837 if (C' <= I) { 838 C'++ 839 } 841 5.4.4. PNFS_OSD_RAID_PQ 843 PNFS_OSD_RAID_PQ is a double-parity scheme that uses the Reed-Solomon 844 P+Q encoding scheme [19]. In this layout, the last two component 845 objects hold the P and Q data, respectively. P is parity computed 846 with XOR, and Q is a more complex equation that is not described 847 here. The equations given above for embedded parity can be used to 848 map a file offset to the correct component object by setting the 849 number of parity components to 2 instead of 1 for RAID4 or RAID5. 850 Clients may simply choose to read data through the metadata server if 851 two components are missing or damaged. 853 5.4.5. RAID Usage and Implementation Notes 855 RAID layouts with redundant data in their stripes require additional 856 serialization of updates to ensure correct operation. Otherwise, if 857 two clients simultaneously write to the same logical range of an 858 object, the result could include different data in the same ranges of 859 mirrored tuples, or corrupt parity information. It is the 860 responsibility of the metadata server to enforce serialization 861 requirements such as this. For example, the metadata server may do 862 so by not granting overlapping write layouts within mirrored objects. 864 6. Object-Based Layout Update 866 layoutupdate4 is used in the LAYOUTCOMMIT operation to convey updates 867 to the layout and additional information to the metadata server. It 868 is defined in the NFSv4.1 [6] as follows: 870 struct layoutupdate4 { 871 layouttype4 lou_type; 872 opaque lou_body<>; 873 }; 875 The layoutupdate4 type is an opaque value at the generic pNFS client 876 level. If the lou_type layout type is LAYOUT4_OSD2_OBJECTS, then the 877 lou_body opaque value is defined by the pnfs_osd_layoutupdate4 type. 879 Object-Based pNFS clients are not allowed to modify the layout. 880 Therefore, the information passed in pnfs_osd_layoutupdate4 is used 881 only to update the file's attributes. In addition to the generic 882 information the client can pass to the metadata server in 883 LAYOUTCOMMIT such as the highest offset the client wrote to and the 884 last time it modified the file, the client MAY use 885 pnfs_osd_layoutupdate4 to convey the capacity consumed (or released) 886 by writes using the layout, and to indicate that I/O errors were 887 encountered by such writes. 889 6.1. pnfs_osd_deltaspaceused4 891 /// union pnfs_osd_deltaspaceused4 switch (bool dsu_valid) { 892 /// case TRUE: 893 /// int64_t dsu_delta; 894 /// case FALSE: 895 /// void; 896 /// }; 897 /// 899 pnfs_osd_deltaspaceused4 is used to convey space utilization 900 information at the time of LAYOUTCOMMIT. For the file system to 901 properly maintain capacity used information, it needs to track how 902 much capacity was consumed by WRITE operations performed by the 903 client. In this protocol, the OSD returns the capacity consumed by a 904 write (*), which can be different than the number of bytes written 905 because of internal overhead like block-level allocation and indirect 906 blocks, and the client reflects this back to the pNFS server so it 907 can accurately track quota. The pNFS server can choose to trust this 908 information coming from the clients and therefore avoid querying the 909 OSDs at the time of LAYOUTCOMMIT. If the client is unable to obtain 910 this information from the OSD, it simply returns invalid 911 olu_delta_space_used. 913 6.2. pnfs_osd_layoutupdate4 915 /// struct pnfs_osd_layoutupdate4 { 916 /// pnfs_osd_deltaspaceused4 olu_delta_space_used; 917 /// bool olu_ioerr_flag; 918 /// }; 919 /// 921 "olu_delta_space_used" is used to convey capacity usage information 922 back to the metadata server. 924 The "olu_ioerr_flag" is used when I/O errors were encountered while 925 writing the file. The client MUST report the errors using the 926 pnfs_osd_ioerr4 structure (See Section 8.1) at LAYOUTRETURN time. 928 If the client updated the file successfully before hitting the I/O 929 errors it MAY use LAYOUTCOMMIT to update the metadata server as 930 described above. Typically, in the error-free case, the server MAY 931 turn around and update the file's attributes on the storage devices. 932 However, if I/O errors were encountered the server better not attempt 933 to write the new attributes on the storage devices until it receives 934 the I/O error report, therefore the client MUST set the 935 olu_ioerr_flag to true. Note that in this case, the client SHOULD 936 send both the LAYOUTCOMMIT and LAYOUTRETURN operations in the same 937 COMPOUND RPC. 939 7. Recovering from Client I/O Errors 941 The pNFS client may encounter errors when directly accessing the 942 object storage devices. However, it is the responsibility of the 943 metadata server to handle the I/O errors. When the 944 LAYOUT4_OSD2_OBJECTS layout type is used, the client MUST report the 945 I/O errors to the server at LAYOUTRETURN time using the 946 pnfs_osd_ioerr4 structure (See Section 8.1). 948 The metadata server analyzes the error and determines the required 949 recovery operations such as repairing any parity inconsistencies, 950 recovering media failures, or reconstructing missing objects. 952 The metadata server SHOULD recall any outstanding layouts to allow it 953 exclusive write access to the stripes being recovered and to prevent 954 other clients from hitting the same error condition. In these cases, 955 the server MUST complete recovery before handing out any new layouts 956 to the affected byte ranges. 958 Although is it MAY be acceptable for the client to propagate a 959 corresponding error to the application that initiated the I/O 960 operation and drop any unwritten data, the client SHOULD attempt to 961 retry the original I/O operation by requesting a new layout using 962 LAYOUTGET and retry the I/O operation(s) using the new layout or the 963 client MAY just retry the I/O operation(s) using regular NFS READ or 964 WRITE operations via the metadata server. The client SHOULD attempt 965 to retrieve a new layout and retry the I/O operation using OSD 966 commands first and only if the error persists, retry the I/O 967 operation via the metadata server. 969 8. Object-Based Layout Return 971 layoutreturn_file4 is used in the LAYOUTRETURN operation to convey 972 layout-type specific information to the server. It is defined in the 973 NFSv4.1 [6] as follows: 975 struct layoutreturn_file4 { 976 offset4 lrf_offset; 977 length4 lrf_length; 978 stateid4 lrf_stateid; 979 /* layouttype4 specific data */ 980 opaque lrf_body<>; 981 }; 983 union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { 984 case LAYOUTRETURN4_FILE: 985 layoutreturn_file4 lr_layout; 986 default: 987 void; 988 }; 990 struct LAYOUTRETURN4args { 991 /* CURRENT_FH: file */ 992 bool lora_reclaim; 993 layoutreturn_stateid lora_recallstateid; 994 layouttype4 lora_layout_type; 995 layoutiomode4 lora_iomode; 996 layoutreturn4 lora_layoutreturn; 997 }; 999 If the lora_layout_type layout type is LAYOUT4_OSD2_OBJECTS, then the 1000 lrf_body opaque value is defined by the pnfs_osd_layoutreturn4 type. 1002 The pnfs_osd_layoutreturn4 type allows the client to report I/O error 1003 information back to the metadata server as defined below. 1005 8.1. pnfs_osd_errno4 1007 /// enum pnfs_osd_errno4 { 1008 /// PNFS_OSD_ERR_EIO = 1, 1009 /// PNFS_OSD_ERR_NOT_FOUND = 2, 1010 /// PNFS_OSD_ERR_NO_SPACE = 3, 1011 /// PNFS_OSD_ERR_BAD_CRED = 4, 1012 /// PNFS_OSD_ERR_NO_ACCESS = 5, 1013 /// PNFS_OSD_ERR_UNREACHABLE = 6, 1014 /// PNFS_OSD_ERR_RESOURCE = 7 1015 /// }; 1016 /// 1018 pnfs_osd_errno4 is used to represent error types when read/write 1019 errors are reported to the metadata server. The error codes serve as 1020 hints to the metadata server that may help it in diagnosing the exact 1021 reason for the error and in repairing it. 1023 o PNFS_OSD_ERR_EIO indicates the operation failed because the Object 1024 Storage Device experienced a failure trying to access the object. 1025 The most common source of these errors is media errors, but other 1026 internal errors might cause this as well. In this case, the 1027 metadata server should go examine the broken object more closely, 1028 hence it should be used as the default error code. 1030 o PNFS_OSD_ERR_NOT_FOUND indicates the object ID specifies an object 1031 that does not exist on the Object Storage Device. 1033 o PNFS_OSD_ERR_NO_SPACE indicates the operation failed because the 1034 Object Storage Device ran out of free capacity during the 1035 operation. 1037 o PNFS_OSD_ERR_BAD_CRED indicates the security parameters are not 1038 valid. The primary cause of this is that the capability has 1039 expired, or the access policy tag (a.k.a, capability version 1040 number) has been changed to revoke capabilities. The client will 1041 need to return the layout and get a new one with fresh 1042 capabilities. 1044 o PNFS_OSD_ERR_NO_ACCESS indicates the capability does not allow the 1045 requested operation. This should not occur in normal operation 1046 because the metadata server should give out correct capabilities, 1047 or none at all. 1049 o PNFS_OSD_ERR_UNREACHABLE indicates the client did not complete the 1050 I/O operation at the Object Storage Device due to a communication 1051 failure. Whether the I/O operation was executed by the OSD or not 1052 is undetermined. 1054 o PNFS_OSD_ERR_RESOURCE indicates the client did not issue the I/O 1055 operation due to a local problem on the initiator (i.e. client) 1056 side, e.g., when running out of memory. The client MUST guarantee 1057 that the OSD command was never dispatched to the OSD. 1059 8.2. pnfs_osd_ioerr4 1061 /// struct pnfs_osd_ioerr4 { 1062 /// pnfs_osd_objid4 oer_component; 1063 /// length4 oer_comp_offset; 1064 /// length4 oer_comp_length; 1065 /// bool oer_iswrite; 1066 /// pnfs_osd_errno4 oer_errno; 1067 /// }; 1068 /// 1070 The pnfs_osd_ioerr4 structure is used to return error indications for 1071 objects that generated errors during data transfers. These are hints 1072 to the metadata server that there are problems with that object. For 1073 each error, "oer_component", "oer_comp_offset", and "oer_comp_length" 1074 represent the object and byte range within the component object in 1075 which the error occurred, "oer_iswrite" is set to "true" if the 1076 failed OSD operation was data modifying, and "oer_errno" represents 1077 the type of error. 1079 Component byte ranges in the optional pnfs_osd_ioerr4 structure are 1080 used for recovering the object and MUST be set by the client to cover 1081 all failed I/O operations to the component. 1083 8.3. pnfs_osd_layoutreturn4 1085 /// struct pnfs_osd_layoutreturn4 { 1086 /// pnfs_osd_ioerr4 olr_ioerr_report<>; 1087 /// }; 1088 /// 1090 When OSD I/O operations failed, "olr_ioerr_report<>" is used to 1091 report these errors to the metadata server as an array of elements of 1092 type pnfs_osd_ioerr4. Each element in the array represents an error 1093 that occurred on the object specified by oer_component. If no errors 1094 are to be reported, the size of the olr_ioerr_report<> array is set 1095 to zero. 1097 9. Object-Based Creation Layout Hint 1099 The layouthint4 type is defined in the NFSv4.1 [6] as follows: 1101 struct layouthint4 { 1102 layouttype4 loh_type; 1103 opaque loh_body<>; 1104 }; 1106 The layouthint4 structure is used by the client to pass in a hint 1107 about the type of layout it would like created for a particular file. 1108 If the loh_type layout type is LAYOUT4_OSD2_OBJECTS, then the 1109 loh_body opaque value is defined by the pnfs_osd_layouthint4 type. 1111 9.1. pnfs_osd_layouthint4 1113 /// union pnfs_osd_max_comps_hint4 switch (bool omx_valid) { 1114 /// case TRUE: 1115 /// uint32_t omx_max_comps; 1116 /// case FALSE: 1117 /// void; 1118 /// }; 1119 /// 1120 /// union pnfs_osd_stripe_unit_hint4 switch (bool osu_valid) { 1121 /// case TRUE: 1122 /// length4 osu_stripe_unit; 1123 /// case FALSE: 1124 /// void; 1125 /// }; 1126 /// 1127 /// union pnfs_osd_group_width_hint4 switch (bool ogw_valid) { 1128 /// case TRUE: 1129 /// uint32_t ogw_group_width; 1130 /// case FALSE: 1131 /// void; 1132 /// }; 1133 /// 1134 /// union pnfs_osd_group_depth_hint4 switch (bool ogd_valid) { 1135 /// case TRUE: 1136 /// uint32_t ogd_group_depth; 1137 /// case FALSE: 1138 /// void; 1139 /// }; 1140 /// 1141 /// union pnfs_osd_mirror_cnt_hint4 switch (bool omc_valid) { 1142 /// case TRUE: 1143 /// uint32_t omc_mirror_cnt; 1144 /// case FALSE: 1145 /// void; 1146 /// }; 1147 /// 1148 /// union pnfs_osd_raid_algorithm_hint4 switch (bool ora_valid) { 1149 /// case TRUE: 1150 /// pnfs_osd_raid_algorithm4 ora_raid_algorithm; 1151 /// case FALSE: 1152 /// void; 1153 /// }; 1154 /// 1155 /// struct pnfs_osd_layouthint4 { 1156 /// pnfs_osd_max_comps_hint4 olh_max_comps_hint; 1157 /// pnfs_osd_stripe_unit_hint4 olh_stripe_unit_hint; 1158 /// pnfs_osd_group_width_hint4 olh_group_width_hint; 1159 /// pnfs_osd_group_depth_hint4 olh_group_depth_hint; 1160 /// pnfs_osd_mirror_cnt_hint4 olh_mirror_cnt_hint; 1161 /// pnfs_osd_raid_algorithm_hint4 olh_raid_algorithm_hint; 1162 /// }; 1163 /// 1165 This type conveys hints for the desired data map. All parameters are 1166 optional so the client can give values for only the parameters it 1167 cares about, e.g. it can provide a hint for the desired number of 1168 mirrored components, regardless of the the raid algorithm selected 1169 for the file. The server should make an attempt to honor the hints 1170 but it can ignore any or all of them at its own discretion and 1171 without failing the respective CREATE operation. 1173 The "olh_max_comps_hint" can be used to limit the total number of 1174 component objects comprising the file. All other hints correspond 1175 directly to the different fields of pnfs_osd_data_map4. 1177 10. Layout Segments 1179 The pnfs layout operations operate on logical byte ranges. There is 1180 no requirement in the protocol for any relationship between byte 1181 ranges used in LAYOUTGET to acquire layouts and byte ranges used in 1182 CB_LAYOUTRECALL, LAYOUTCOMMIT, or LAYOUTRETURN. However, using OSD 1183 byte-range capabilities poses limitations on these operations since 1184 the capabilities associated with layout segments cannot be merged or 1185 split. The following guidelines should be followed for proper 1186 operation of object-based layouts. 1188 10.1. CB_LAYOUTRECALL and LAYOUTRETURN 1190 In general, the object-based layout driver should keep track of each 1191 layout segment it got, keeping record of the segment's iomode, 1192 offset, and length. The server should allow the client to get 1193 multiple overlapping layout segments but is free to recall the layout 1194 to prevent overlap. 1196 In response to CB_LAYOUTRECALL, the client should return all layout 1197 segments matching the given iomode and overlapping with the recalled 1198 range. When returning the layouts for this byte range with 1199 LAYOUTRETURN the client MUST NOT return a sub-range of a layout 1200 segment it has; each LAYOUTRETURN sent MUST completely cover at least 1201 one outstanding layout segment. 1203 The server, in turn, should release any segment that exactly matches 1204 the clientid, iomode, and byte range given in LAYOUTRETURN. If no 1205 exact match is found then the server should release all layout 1206 segments matching the clientid and iomode and that are fully 1207 contained in the returned byte range. If none are found and the byte 1208 range is a subset of an outstanding layout segment with for the same 1209 clientid and iomode, then the client can be considered malfunctioning 1210 and the server SHOULD recall all layouts from this client to reset 1211 its state. If this behavior repeats the server SHOULD deny all 1212 LAYOUTGETs from this client. 1214 10.2. LAYOUTCOMMIT 1216 LAYOUTCOMMIT is only used by object-based pNFS to convey modified 1217 attributes hints and/or to report I/O errors to the MDS. Therefore, 1218 the offset and length in LAYOUTCOMMIT4args are reserved for future 1219 use and should be set to 0. 1221 11. Recalling Layouts 1223 The object-based metadata server should recall outstanding layouts in 1224 the following cases: 1226 o When the file's security policy changes, i.e. ACLs or permission 1227 mode bits are set. 1229 o When the file's aggregation map changes, rendering outstanding 1230 layouts invalid. 1232 o When there are sharing conflicts. For example, the server will 1233 issue stripe aligned layout segments for RAID-5 objects. To 1234 prevent corruption of the file's parity, Multiple clients must not 1235 hold valid write layouts for the same stripes. An outstanding RW 1236 layout should be recalled when a conflicting LAYOUTGET is received 1237 from a different client for LAYOUTIOMODE4_RW and for a byte-range 1238 overlapping with the outstanding layout segment. 1240 11.1. CB_RECALL_ANY 1242 The metadata server can use the CB_RECALL_ANY callback operation to 1243 notify the client to return some or all of its layouts. The NFSv4.1 1244 [6] defines the following types: 1246 const RCA4_TYPE_MASK_OBJ_LAYOUT_MIN = 8; 1247 const RCA4_TYPE_MASK_OBJ_LAYOUT_MAX = 9; 1249 struct CB_RECALL_ANY4args { 1250 uint32_t craa_objects_to_keep; 1251 bitmap4 craa_type_mask; 1252 }; 1254 Typically, CB_RECALL_ANY will be used to recall client state when the 1255 server needs to reclaim resources. The craa_type_mask bitmap 1256 specifies the type of resources that are recalled and the 1257 craa_objects_to_keep value specifies how many of the recalled objects 1258 the client is allowed to keep. The object-based layout type mask 1259 flags are defined as follows. They represent the iomode of the 1260 recalled layouts. In response, the client SHOULD return layouts of 1261 the recalled iomode that it needs the least, keeping at most 1262 craa_objects_to_keep object-based layouts. 1264 /// enum pnfs_osd_cb_recall_any_mask { 1265 /// PNFS_OSD_RCA4_TYPE_MASK_READ = 8, 1266 /// PNFS_OSD_RCA4_TYPE_MASK_RW = 9 1267 /// }; 1268 /// 1270 The PNFS_OSD_RCA4_TYPE_MASK_READ flag notifies the client to return 1271 layouts of iomode LAYOUTIOMODE4_READ. Similarly, the 1272 PNFS_OSD_RCA4_TYPE_MASK_RW flag notifies the client to return layouts 1273 of iomode LAYOUTIOMODE4_RW. When both mask flags are set, the client 1274 is notified to return layouts of either iomode. 1276 12. Client Fencing 1278 In cases where clients are uncommunicative and their lease has 1279 expired or when clients fail to return recalled layouts within a 1280 lease period at the least (see "Recalling a Layout"[6]), the server 1281 MAY revoke client layouts and/or device address mappings and reassign 1282 these resources to other clients. To avoid data corruption, the 1283 metadata server MUST fence off the revoked clients from the 1284 respective objects as described in Section 13.4. 1286 13. Security Considerations 1288 The pNFS extension partitions the NFSv4 file system protocol into two 1289 parts, the control path and the data path (storage protocol). The 1290 control path contains all the new operations described by this 1291 extension; all existing NFSv4 security mechanisms and features apply 1292 to the control path. The combination of components in a pNFS system 1293 is required to preserve the security properties of NFSv4 with respect 1294 to an entity accessing data via a client, including security 1295 countermeasures to defend against threats that NFSv4 provides 1296 defenses for in environments where these threats are considered 1297 significant. 1299 The metadata server enforces the file access-control policy at 1300 LAYOUTGET time. The client should use suitable authorization 1301 credentials for getting the layout for the requested iomode (READ or 1302 RW) and the server verifies the permissions and ACL for these 1303 credentials, possibly returning NFS4ERR_ACCESS if the client is not 1304 allowed the requested iomode. If the LAYOUTGET operation succeeds 1305 the client receives, as part of the layout, a set of object 1306 capabilities allowing it I/O access to the specified objects 1307 corresponding to the requested iomode. When the client acts on I/O 1308 operations on behalf of its local users it MUST authenticate and 1309 authorize the user by issuing respective OPEN and ACCESS calls to the 1310 metadata server, similarly to having NFSv4 data delegations. If 1311 access is allowed the client uses the corresponding (READ or RW) 1312 capabilities to perform the I/O operations at the object-storage 1313 devices. When the metadata server receives a request to change 1314 file's permissions or ACL it SHOULD recall all layouts for that file 1315 and it MUST change the capability version attribute on all objects 1316 comprising the file to implicitly invalidate any outstanding 1317 capabilities before committing to the new permissions and ACL. Doing 1318 this will ensure that clients re-authorize their layouts according to 1319 the modified permissions and ACL by requesting new layouts. 1320 Recalling the layouts in this case is courtesy of the server intended 1321 to prevent clients from getting an error on I/Os done after the 1322 capability version changed. 1324 The object storage protocol MUST implement the security aspects 1325 described in version 1 of the T10 OSD protocol definition [2]. The 1326 standard defines four security methods: NOSEC, CAPKEY, CMDRSP, and 1327 ALLDATA. To provide minimum level of security allowing verification 1328 and enforcement of the server access control policy using the layout 1329 security credentials, the NOSEC security method MUST NOT be used for 1330 any I/O operation. The remainder of this section gives an overview 1331 of the security mechanism described in that standard. The goal is to 1332 give the reader a basic understanding of the object security model. 1333 Any discrepancies between this text and the actual standard are 1334 obviously to be resolved in favor of the OSD standard. 1336 13.1. OSD Security Data Types 1338 There are three main data types associated with object security: a 1339 capability, a credential, and security parameters. The capability is 1340 a set of fields that specifies an object and what operations can be 1341 performed on it. A credential is a signed capability. Only a 1342 security manager that knows the secret device keys can correctly sign 1343 a capability to form a valid credential. In pNFS, the file server 1344 acts as the security manager and returns signed capabilities (i.e., 1345 credentials) to the pNFS client. The security parameters are values 1346 computed by the issuer of OSD commands (i.e., the client) that prove 1347 they hold valid credentials. The client uses the credential as a 1348 signing key to sign the requests it makes to OSD, and puts the 1349 resulting signatures into the security_parameters field of the OSD 1350 command. The object storage device uses the secret keys it shares 1351 with the security manager to validate the signature values in the 1352 security parameters. 1354 The security types are opaque to the generic layers of the pNFS 1355 client. The credential contents are defined as opaque within the 1356 pnfs_osd_object_cred4 type. Instead of repeating the definitions 1357 here, the reader is referred to section 4.9.2.2 of the OSD standard. 1359 13.2. The OSD Security Protocol 1361 The object storage protocol relies on a cryptographically secure 1362 capability to control accesses at the object storage devices. 1363 Capabilities are generated by the metadata server, returned to the 1364 client, and used by the client as described below to authenticate 1365 their requests to the Object Storage Device (OSD). Capabilities 1366 therefore achieve the required access and open mode checking. They 1367 allow the file server to define and check a policy (e.g., open mode) 1368 and the OSD to enforce that policy without knowing the details (e.g., 1369 user IDs and ACLs). 1371 Since capabilities are tied to layouts, and since they are used to 1372 enforce access control, when the file ACL or mode changes the 1373 outstanding capabilities MUST be revoked to enforce the new access 1374 permissions. The server SHOULD recall layouts to allow clients to 1375 gracefully return their capabilities before the access permissions 1376 change. 1378 Each capability is specific to a particular object, an operation on 1379 that object, a byte range within the object (in OSDv2), and has an 1380 explicit expiration time. The capabilities are signed with a secret 1381 key that is shared by the object storage devices (OSD) and the 1382 metadata managers. Clients do not have device keys so they are 1383 unable to forge the signatures in the security parameters. The 1384 combination of a capability, the OSD system id, and a signature is 1385 called a "credential" in the OSD specification. 1387 The details of the security and privacy model for Object Storage are 1388 defined in the T10 OSD standard. The following sketch of the 1389 algorithm should help the reader understand the basic model. 1391 LAYOUTGET returns a CapKey and a Cap which, together with the OSD 1392 SystemID, are also called a credential. It is a capability and a 1393 signature over that capability and the SystemID. The OSD Standard 1394 refers to the CapKey as the "Credential integrity check value" and to 1395 the ReqMAC as the "Request integrity check value". 1397 CapKey = MAC(Cap, SystemID) 1398 Credential = {Cap, SystemID, CapKey} 1400 The client uses CapKey to sign all the requests it issues for that 1401 object using the respective Cap. In other words, the Cap appears in 1402 the request to the storage device, and that request is signed with 1403 the CapKey as follows: 1405 ReqMAC = MAC(Req, ReqNonce) 1406 Request = {Cap, Req, ReqNonce, ReqMAC} 1408 The following is sent to the OSD: {Cap, Req, ReqNonce, ReqMAC}. The 1409 OSD uses the SecretKey it shares with the metadata server to compare 1410 the ReqMAC the client sent with a locally computed value: 1412 LocalCapKey = MAC(Cap, SystemID) 1413 LocalReqMAC = MAC(Req, ReqNonce) 1415 and if they match the OSD assumes that the capabilities came from an 1416 authentic metadata server and allows access to the object, as allowed 1417 by the Cap. 1419 13.3. Protocol Privacy Requirements 1421 Note that if the server LAYOUTGET reply, holding CapKey and Cap, is 1422 snooped by another client, it can be used to generate valid OSD 1423 requests (within the Cap access restrictions). 1425 To provide the required privacy requirements for the capability key 1426 returned by LAYOUTGET, the GSS-API [7] framework can be used, e.g. by 1427 using the RPCSEC_GSS privacy method to send the LAYOUTGET operation 1428 or by using the SSV key to encrypt the oc_capability_key using the 1429 GSS_Wrap() function. Two general ways to provide privacy in the 1430 absence of GSS-API that are independent of NFSv4 are either an 1431 isolated network such as a VLAN or a secure channel provided by IPsec 1432 [15]. 1434 13.4. Revoking Capabilities 1436 At any time, the metadata server may invalidate all outstanding 1437 capabilities on an object by changing its POLICY ACCESS TAG 1438 attribute. The value of the POLICY ACCESS TAG is part of a 1439 capability, and it must match the state of the object attribute. If 1440 they do not match, the OSD rejects accesses to the object with the 1441 sense key set to ILLEGAL REQUEST and an additional sense code set to 1442 INVALID FIELD IN CDB. When a client attempts to use a capability and 1443 is rejected this way, it should issue a LAYOUTCOMMIT for the object 1444 and specify PNFS_OSD_BAD_CRED in the olr_ioerr_report parameter. The 1445 client may elect to issue a compound LAYOUTRETURN/LAYOUTGET (or 1446 LAYOUTCOMMIT/LAYOUTRETURN/LAYOUTGET) to attempt to fetch a refreshed 1447 set of capabilities. 1449 The metadata server may elect to change the access policy tag on an 1450 object at any time, for any reason (with the understanding that there 1451 is likely an associated performance penalty, especially if there are 1452 outstanding layouts for this object). The metadata server MUST 1453 revoke outstanding capabilities when any one of the following occurs: 1455 o The permissions on the object change, 1457 o a conflicting mandatory byte-range lock is granted, or 1459 o a layout is revoked and reassigned to another client. 1461 A pNFS client will typically hold one layout for each byte range for 1462 either READ or READ/WRITE. The client's credentials are checked by 1463 the metadata server at LAYOUTGET time and it is the client's 1464 responsibility to enforce access control among multiple users 1465 accessing the same file. It is neither required nor expected that 1466 the pNFS client will obtain a separate layout for each user accessing 1467 a shared object. The client SHOULD use OPEN and ACCESS calls to 1468 check user permissions when performing I/O so that the server's 1469 access control policies are correctly enforced. The result of the 1470 ACCESS operation may be cached while the client holds a valid layout 1471 as the server is expected to recall layouts when the file's access 1472 permissions or ACL change. 1474 14. IANA Considerations 1476 As described in the NFSv4.1 [6], new layout type numbers will be 1477 requested from IANA. This document defines the protocol associated 1478 with the existing layout type number, LAYOUT4_OSD2_OBJECTS, and it 1479 requires no further actions for IANA. 1481 15. References 1483 15.1. Normative References 1485 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1486 Levels", RFC 2119, March 1997, 1487 . 1489 [2] Weber, R., "Information Technology - SCSI Object-Based Storage 1490 Device Commands (OSD)", ANSI INCITS 400-2004, December 2004. 1492 [3] Eisler, M., "XDR: External Data Representation Standard", 1493 STD 67, RFC 4506, May 2006, 1494 . 1496 [4] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1 1497 XDR Description", RFC [[RFC Editor: please insert NFSv4 Minor 1498 Version XDR Description 1 RFC number]], [[RFC Editor: please 1499 insert NFSv4 Minor Version 1 XDR Description RFC month]] [[RFC 1500 Editor: please insert NFSv4 Minor Version 1 XDR Description RFC 1501 year]], . 1504 [5] IETF Trust, "Legal Provisions Relating to IETF Documents", 1505 November 2008, 1506 . 1508 [6] Shepler, S., Eisler, M., and D. Noveck, "NFSv4 Minor Version 1509 1", RFC [[RFC Editor: please insert NFSv4 Minor Version 1 RFC 1510 number]], [[RFC Editor: please insert NFSv4 Minor Version 1 RFC 1511 month]] [[RFC Editor: please insert NFSv4 Minor Version 1 RFC 1512 year]], . 1515 [7] Linn, J., "Generic Security Service Application Program 1516 Interface Version 2, Update 1", RFC 2743, January 2000, 1517 . 1519 [8] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 1520 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 1521 RFC 3720, April 2004, . 1523 [9] Weber, R., "SCSI Primary Commands - 3 (SPC-3)", ANSI 1524 INCITS 408-2005, October 2005. 1526 [10] Krueger, M., Chadalapaka, M., and R. Elliott, "T11 Network 1527 Address Authority (NAA) Naming Format for iSCSI Node Names", 1528 RFC 3980, February 2005, . 1530 [11] IEEE, "Guidelines for 64-bit Global Identifier (EUI-64) 1531 Registration Authority", 1532 . 1534 [12] Tseng, J., Gibbons, K., Travostino, F., Du Laney, C., and J. 1535 Souza, "Internet Storage Name Service (iSNS)", RFC 4171, 1536 September 2005, . 1538 [13] Weber, R., "SCSI Architecture Model - 3 (SAM-3)", ANSI 1539 INCITS 402-2005, February 2005. 1541 15.2. Informative References 1543 [14] Weber, R., "SCSI Object-Based Storage Device Commands -2 1544 (OSD-2)", November 2008, 1545 . 1547 [15] Kent, S. and K. Seo, "Security Architecture for the Internet 1548 Protocol", RFC 4301, December 2005, 1549 . 1551 [16] T10 1415-D, "SCSI RDMA Protocol (SRP)", ANSI INCITS 365-2002, 1552 December 2002. 1554 [17] T11 1619-D, "Fibre Channel Framing and Signaling - 2 1555 (FC-FS-2)", ANSI INCITS 424-2007, February 2007. 1557 [18] T10 1601-D, "Serial Attached SCSI - 1.1 (SAS-1.1)", ANSI 1558 INCITS 417-2006, June 2006. 1560 [19] MacWilliams, F. and N. Sloane, "The Theory of Error-Correcting 1561 Codes, Part I", 1977. 1563 Appendix A. Acknowledgments 1565 Todd Pisek was a co-editor of the initial drafts for this document. 1566 Daniel E. Messinger, Pete Wyckoff, Mike Eisler, Sean P. Turner, Brian 1567 E. Carpenter, Jari Arkko, David Black, and Jason Glasgow reviewed and 1568 commented on this document. 1570 Authors' Addresses 1572 Benny Halevy 1573 Panasas, Inc. 1574 1501 Reedsdale St. Suite 400 1575 Pittsburgh, PA 15233 1576 USA 1578 Phone: +1-412-323-3500 1579 Email: bhalevy@panasas.com 1580 URI: http://www.panasas.com/ 1581 Brent Welch 1582 Panasas, Inc. 1583 6520 Kaiser Drive 1584 Fremont, CA 95444 1585 USA 1587 Phone: +1-650-608-7770 1588 Email: welch@panasas.com 1589 URI: http://www.panasas.com/ 1591 Jim Zelenka 1592 Panasas, Inc. 1593 1501 Reedsdale St. Suite 400 1594 Pittsburgh, PA 15233 1595 USA 1597 Phone: +1-412-323-3500 1598 Email: jimz@panasas.com 1599 URI: http://www.panasas.com/