idnits 2.17.1 draft-faibish-nfsv4-pnfs-lustre-layout-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 551 has weird spacing: '... char lmm_p...' == Line 779 has weird spacing: '... string lp_...' -- The document date (May 5, 2013) is 4009 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: None ---------------------------------------------------------------------------- == Missing Reference: '0' is mentioned on line 553, but not defined == Unused Reference: '11' is defined on line 859, but no explicit reference was found in the text ** Obsolete normative reference: RFC 5661 (ref. '8') (Obsoleted by RFC 8881) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NFSv4 Working Group S. Faibish 2 Internet-Draft P. Tao 3 Intended status: draft EMC Corporation 4 Expires: November 5, 2013 May 5, 2013 6 Parallel NFS (pNFS) Lustre Layout Operations 7 draft-faibish-nfsv4-pnfs-lustre-layout-04 9 Status of this Memo 11 This Internet-Draft is submitted to IETF in full conformance with 12 the provisions of BCP 78 and BCP 79. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other documents 21 at any time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html 30 This Internet-Draft will expire on November 5 2013. 32 Copyright Notice 34 Copyright (c) 2013 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with 42 respect to this document. Code Components extracted from this 43 document must include Simplified BSD License text as described in 44 Section 4.e of the Trust Legal Provisions and are provided without 45 warranty as described in the Simplified BSD License. 47 Abstract 49 Parallel NFS (pNFS) extends Network File System version 4.1(NFSv4.1) 50 to allow clients to directly access file data on the storage used by 51 the NFSv4.1 server. This ability to bypass the server for data 52 access can increase both performance and parallelism, but requires 53 additional client functionality for data access, some of which is 54 dependent on the class of storage used, a.k.a. the Layout Type. The 55 main pNFS operations and data types in NFSv4 Minor version 1 specify 56 a layout-type-independent layer; layout-type-specific information is 57 conveyed using opaque data structures whose internal structure is 58 further defined by the particular layout type specification. This 59 document specifies the NFSv4.1 Lustre pNFS Layout Type as a 60 companion to the main NFSv4 Minor version 1 specification. 62 Table of Contents 64 1. Introduction...................................................3 65 1.1. pNFS Lustre Layout Protocol...............................3 66 1.2. General Definitions.......................................4 67 1.3. Lustre Protocol Description...............................5 68 2. Conventions Used in this Document..............................6 69 3. XDR Description of the Lustre-Based Layout Protocol............6 70 3.1. Code Components Licensing Notice..........................6 71 4. Basic Data Type Definitions....................................8 72 4.1. pnfs_lov_magic............................................8 73 4.2. pnfs_los_object_cred4.....................................9 74 4.3. Data Stripping Algorithms................................10 75 5. Object Storage Server Addressing and Discovery................10 76 5.1. pnfs_los_targetid_type4..................................10 77 5.2. pnfs_los_deviceaddr4.....................................11 78 5.2.1. OSS Target Identifier...............................11 79 5.2.2. Device Network Address..............................11 80 6. Lustre-Based Layout...........................................11 81 6.1. pnfs_lov_mds_md..........................................12 82 6.2. pnfs_los_layout4.........................................14 83 6.3. Data Mapping Schemes.....................................15 84 6.3.1. Simple Striping.....................................15 85 6.4. RAID Algorithms..........................................17 86 6.4.1. PNFS_OST_RAID_0.....................................17 87 6.4.2. PNFS_OST_RAID_1.....................................17 89 7. Lustre-Based Creation Layout Hint.............................17 90 7.1. pnfs_los_layouthint4.....................................18 91 8. IANA Considerations...........................................19 92 9. References....................................................19 93 9.1. Normative References.....................................19 94 Authors' Addresses...............................................21 96 1. Introduction 98 1.1. pNFS Lustre Layout Protocol 100 Figure 1 shows the overall architecture of a Parallel NFS (pNFS) 101 Protocol ([8]) system: 103 +-----------+ 104 |+-----------+ +-----------+ 105 ||+-----------+ | | 106 ||| | NFSv4.1 + pNFS | | 107 +|| Clients |<------------------------------>| MDS | 108 +| | | | 109 +-----------+ | | 110 ||| +-----------+ 111 ||| | 112 ||| | 113 ||| Storage +-----------+ | 114 ||| Protocol |+-----------+ | 115 ||+----------------||+-----------+ Control | 116 |+-----------------||| | Protocol | 117 +------------------+|| Storage |------------+ 118 +| Devices | 119 +-----------+ 121 Figure 1 pNFS Architecture 123 In this document, "storage device" is used as a general term for a 124 data server and/or storage server for all pNFS layouts. The 125 MetaData Server (MDS) is the NFSv4.1 server that provides pNFS 126 layouts to clients and handles operations on file metadata (e.g., 127 names, attributes). 129 In pNFS, the file server returns typed layout structures that 130 describe where file data is located. There are different layouts for 131 different storage systems and methods of arranging data on storage 132 devices. This document describes the layouts used with Lustre object 133 storage servers (OSSs) that are accessed according to the Lustre 134 storage protocol ([1]). 136 1.2. General Definitions 138 The following definitions provide an appropriate context for the 139 reader. 141 +-----------------+------------------------------------------------+ 142 | Lustre module | Description | 143 +-----------------+------------------------------------------------+ 144 | OST | Object Storage Targets are SCSI LUNs which | 145 | | store file data objects | 146 | | | 147 | OSS | An Object Storage Sever implements the Lustre | 148 | | data protocol and serves data | 149 | | | 150 | OSC | An Object Storage Client [10] is a client of | 151 | | the Lustre object services | 152 | | | 153 | LOV | LOV is the Lustre Object Volume [10]. It | 154 | | interprets stripe information and directs pages| 155 | | to the correct OSCs. | 156 | | | 157 | MDT | A Metadata Target is a SCSI LUN that stores | 158 | | file metadata | 159 | | | 160 | MDS | A Metadata Sever implements the Lustre | 161 | | metadata server control protocol | 162 | | | 163 | MDC | A Metadata Client of Lustre protocol services | 164 | | | 165 | LDLM | The Lustre Distributed Lock Manager (LDLM) [11]| 166 | | provides a means to ensure that data is updated| 167 | | in a consistent fashion across multiple nodes. | 168 | | | 169 | PTLRPC | The Portal RPC subsystem [12] is a reliable | 170 | | messaging service layered on top of LNET. It | 171 | | caters for small messages and also for bulk | 172 | | data transfers. | 173 | | | 174 | LNET | LNET is the Lustre Networking sub-system [13]. | 175 | | It hides differences of underlying network | 176 | | types and provides common APIs to LNET users. | 177 | | | 178 | LND | LND is the Lustre Network Driver layer [13]. It| 179 | | implements the interface between the generic | 180 | | LNET layer and the drivers for the specific | 181 | | network types. | 182 +-----------------+------------------------------------------------+ 184 1.3. Lustre Protocol Description 186 Lustre is an object-based file system. It is composed of three 187 components: Metadata servers (MDSs), object storage servers (OSSs), 188 and Lustre clients. 190 Lustre uses block devices (SCSI LUNs) for file data storage (OST) 191 and metadata storages (MDT) and each block device can be managed by 192 only one Lustre server (OSS). The total data capacity of the Lustre 193 filesystem is the sum of all individual OST capacities. Lustre 194 clients access and concurrently use data through the standard POSIX 195 I/O system calls. 197 A Lustre MDS provides metadata services. One Lustre MDS manages one 198 metadata target (MDT). Each MDT stores file metadata, such as file 199 names, directory structures, and access permissions. An OSS exposes 200 block devices and serves data. Each OSS manages one or more object 201 storage targets (OSTs), and OSTs store file data "objects". 203 The Lustre protocol specifies several operations on objects, 204 including OPEN, READ, WRITE, GET ATTRIBUTES, SET ATTRIBUTES, CREATE, 205 and DELETE. However, using the Lustre layout the Lustre client only 206 uses the OPEN, READ, WRITE and GET ATTRIBUTES commands. The other 207 commands are only used by the Lustre server. 209 A Lustre file object's layout information is defined in the extended 210 attribute (EA) of the inode. Essentially, EA describes the mapping 211 between file object identifier and its corresponding OSTs. This 212 information is also known as striping. A Lustre-based layout for 213 pNFS includes object identifiers, capabilities that allow pNFS 214 clients to READ or WRITE those objects, and various parameters that 215 control how file data is striped across OSTs. 217 This document specifies the NFSv4.1 layout protocol and operations 218 for Lustre filesystems ([1]). 220 2. Conventions Used in this Document 222 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 223 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 224 document are to be interpreted as described in RFC-2119 [6]. 226 3. XDR Description of the Lustre-Based Layout Protocol 228 This document contains the external data representation (XDR [2]) 229 description of the NFSv4.1 objects layout protocol. The XDR 230 description is embedded in this document in a way that makes it 231 simple for the reader to extract into a ready-to-compile form. The 232 reader can feed this document into the following shell script to 233 produce the machine readable XDR description of the NFSv4.1 Lustre 234 layout protocol: 236 #!/bin/sh 237 grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' 239 That is, if the above script is stored in a file called 240 "extract.sh", and this document is in a file called "spec.txt", then 241 the reader can do: 243 sh extract.sh < spec.txt > pnfs_lustre_prot.x 245 The effect of the script is to remove leading white space from each 246 line, plus a sentinel sequence of "///". 248 The embedded XDR file header follows. Subsequent XDR descriptions, 249 with the sentinel sequence are embedded throughout the document. 251 Note that the XDR code contained in this document depends on types 252 from the NFSv4.1 nfs4_prot.x file ([3]). This includes both nfs 253 types that end with a 4, such as offset4, length4, etc., as well as 254 more generic types such as uint32_t and uint64_t. 256 3.1. Code Components Licensing Notice 258 The XDR description, marked with lines beginning with the sequence 259 "///", as well as scripts for extracting the XDR description are 260 Code Components as described in Section 4 of "Legal Provisions 261 Relating to IETF Documents" [4]. These Code Components are licensed 262 according to the terms of Section 4 of "Legal Provisions Relating to 263 IETF Documents". 265 /// /* 266 /// * Copyright (c) 2013 IETF Trust and the persons identified 267 /// * as authors of the code. All rights reserved. 268 /// * 269 /// * Redistribution and use in source and binary forms, with 270 /// * or without modification, are permitted provided that the 271 /// * following conditions are met: 272 /// * 273 /// * o Redistributions of source code must retain the above 274 /// * copyright notice, this list of conditions and the 275 /// * following disclaimer. 276 /// * 277 /// * o Redistributions in binary form must reproduce the above 278 /// * copyright notice, this list of conditions and the 279 /// * following disclaimer in the documentation and/or other 280 /// * materials provided with the distribution. 281 /// * 282 /// * o Neither the name of Internet Society, IETF or IETF 283 /// * Trust, nor the names of specific contributors, may be 284 /// * used to endorse or promote products derived from this 285 /// * software without specific prior written permission. 286 /// * 287 /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS 288 /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED 289 /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 290 /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 291 /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 292 /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 293 /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 294 /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 295 /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 296 /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 297 /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 298 /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 299 /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING 300 /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 301 /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 302 /// * 303 /// * Please reproduce this note if possible. 304 /// */ 305 /// 306 /// /* 307 /// * pnfs_lustre_prot.x 308 /// */ 309 /// 310 /// %#include 311 /// 313 4. Basic Data Type Definitions 315 The following sections define basic data types and constants used by 316 the Lustre Layout protocol. 318 4.1. pnfs_lov_magic 320 Lustre uses two magic numbers to identify different "lov_mds_md" 321 versions. 323 /// enum pnfs_lov_magic { 324 /// LOV_MAGIC_V1 = 0x0BD10BD0, /* to identify lov_mds_md_v1 */ 325 /// LOV_MAGIC_V3 = 0x0BD30BD0 /* to identify lov_mds_md_v3 */ 326 /// }; 328 "pnfs_lov_magic" is used to indicate the Lustre protocol MDS 329 metadata version. The magic number is used to identify the protocol 330 version and to detect the byte order of the request sent by the 331 client. 333 At this time, the Lustre protocol uses LOV_MAGIC_V1/3 to mark 334 different version of "lov_mds_md". The difference between 335 LOV_MAGIC_V1 and LOV_MAGIC_V3 is that LOV_MAGIC_V3 supports OST 336 pooling. 338 The OST pools feature allows the administrator to name a group of 339 OSTs for file striping purposes. For instance, a group of local OSTs 340 could be defined for faster access; a group of higher-performance 341 OSTs could be defined for specific applications; a group of non-RAID 342 OSTs could be defined for scratch files; or groups of OSTs could be 343 defined for particular users. 345 If OST pooling is configured, the server SHOULD return LOV_MAGIC_V3. 346 If OST pooling is not configured, the MDS server SHOULD return 347 LOV_MAGIC_V1. So the versioning is used just for feature matching. 349 Therefore, the Lustre protocol version is explicitly called out in 350 the information returned in the layout. (The format value is 351 0x0BD10BD0 for version V1 capability.) 353 4.2. pnfs_los_object_cred4 355 /// enum pnfs_los_cap_key_sec4 { 356 /// PNFS_OSS_CAP_KEY_SEC_NONE = 0, 357 /// PNFS_OSS_CAP_KEY_SEC_SSV = 1 358 /// }; 359 /// 360 /// typedef uint64_t pnfs_los_objid4; 361 /// 362 /// struct pnfs_los_object_cred4 { 363 /// pnfs_los_objid4 ploc_object_id; 364 /// pnfs_los_cap_key_sec4 ploc_cap_key_sec; 365 /// opaque ploc_capability_key<>; 366 /// opaque ploc_capability<>; 367 /// }; 368 /// 370 Lustre PTLRPC supports GSS authentication. PTLRPC implements Lustre 371 communications over LNET ([1]). So "pnfs_los_object_cred4" is put 372 inside pnfs_los_layout4 so that if the network requires security, 373 credentials can be passed around. 375 The pnfs_los_object_cred4 structure is used to identify each 376 component comprising the file. The "ploc_object_id" identifies the 377 component object, the "ploc_capability_key" provide the OSS security 378 credentials needed to access that object. The "ploc_cap_key_sec" 379 value denotes the method used to secure the "ploc_capability_key". 381 To comply with the Lustre security requirements, the capability key 382 SHOULD be transferred securely to prevent eavesdropping. Therefore, 383 a client SHOULD either issue the LAYOUTGET or GETDEVICEINFO 384 operations via RPCSEC_GSS with the privacy service or previously 385 establish a secret state verifier (SSV) for the sessions via the 386 NFSv4.1 SET_SSV operation. The pnfs_los_cap_key_sec4 type is used to 387 identify the method used by the server to secure the capability key. 389 o PNFS_OSS_CAP_KEY_SEC_NONE denotes that the "ploc_capability_key" 390 is not encrypted, in which case the client SHOULD issue the 391 LAYOUTGET or GETDEVICEINFO operations with RPCSEC_GSS with the 392 privacy service or the NFSv4.1 transport should be secured by 393 using methods that are external to NFSv4.1 like the use of IPsec 394 ([5]) for transporting the NFSV4.1 protocol. 396 o PNFS_OSS_CAP_KEY_SEC_SSV denotes that the "ploc_capability_key" 397 contents are encrypted using the SSV GSS context and the 398 capability key as inputs to the GSS_Wrap() function (see GSS-API 399 [7]) with the conf_req_flag set to TRUE. The client MUST use the 400 secret SSV key as part of the client's GSS context to decrypt the 401 capability key using the value of the lc_capability_key field as 402 the input_message to the GSS_unwrap() function. Note that to 403 prevent eavesdropping of the SSV key, the client SHOULD issue 404 SET_SSV via RPCSEC_GSS with the privacy service. 406 The actual method chosen depends on whether the client established a 407 SSV key with the server and whether it issued the operation with the 408 RPCSEC_GSS privacy method. Naturally, if the client did not 409 establish an SSV key via SET_SSV, the server MUST use the 410 PNFS_OSS_CAP_KEY_SEC_NONE method. Otherwise, if the operation was 411 not issued with the RPCSEC_GSS privacy method, the server SHOULD 412 secure the "ploc_capability_key" with the PNFS_OSS_CAP_KEY_SEC_SSV 413 method. The server MAY use the PNFS_OSS_CAP_KEY_SEC_SSV method also 414 when the operation was issued with the RPCSEC_GSS privacy method. 416 4.3. Data Stripping Algorithms 418 Currently only RAID0 is supported but Lustre defines RAID1 as well. 420 /// const LOV_PATTERN_RAID0 = 0x001 421 /// /* stripes are used round-robin */ 422 /// const LOV_PATTERN_RAID1 = 0x002 423 /// /* stripes are mirrors of each other */ 425 5. Object Storage Server Addressing and Discovery 427 Data operations to an OSS require the client to know the "address" 428 of each OSS's root object. The OSS exposes block devices and serves 429 data. Correspondingly, OSC is client of the services. Each OSS 430 manages one or more OSTs, and OSTs store file data objects. Because 431 these representations are local, GETDEVICEINFO must return 432 information that can be used by the client to select the correct 433 local representation. 435 5.1. pnfs_los_targetid_type4 437 The following enum specifies the manner in which an OST can be 438 specified. The target can be specified by the network access 439 protocol type used. 441 /// enum pnfs_los_targetid_type4 { 442 /// LOS_TARGET_TCP = 1, 443 /// LOS_TARGET_IB = 2 444 /// }; 446 Where: 447 o LOS_TARGET_TCP denotes use of the TCP protocol 449 o LOS_TARGET_IB denotes use of the IB protocol 451 Only TCP and IB are defined because these are the two most widely 452 used networks in High Performance Computing deployments. 454 5.2. pnfs_los_deviceaddr4 456 The specification (according to [9]) for an object device address is 457 as follows: 459 /// struct pnfs_los_deviceaddr4 { 460 /// netaddr4 lda_targetid; 461 /// opaque lda_ossname<>; 462 /// }; 464 5.2.1. OSS Target Identifier 466 When "lda_targetid" is specified the opaque field MUST be formatted 467 as the LOS name. 469 5.2.2. Device Network Address 471 The network address is given with the netaddr4 type, which specifies 472 a TCP/IP or IB based endpoint (as specified in NFSv4.1 [3]). When 473 given, the client SHOULD use it to probe for the OSS device at the 474 given network address. The client MAY still use other discovery 475 mechanisms to locate the device using the "lda_targetid". In 476 particular, an external name service (external to data protocol 477 coming from LNET) SHOULD be used when the devices may be attached to 478 the network using multiple connections, and/or multiple storage 479 fabrics (e.g., TCP or IB). 481 6. Lustre-Based Layout 483 The layout4 type is defined in the NFSv4.1 ([3]) as follows: 485 enum layouttype4 { 486 LAYOUT4_NFSV4_1_FILES= 0x1, 487 LAYOUT4_OSD2_OBJECTS = 0x2, 488 LAYOUT4_BLOCK_VOLUME = 0x3, 489 LAYOUT4_OSS_OBJECTS = 0x0BD30BD4 /* Tentatively */ 490 }; 492 struct layout_content4 { 493 layouttype4 loc_type; 494 opaque loc_body<>; 495 }; 497 struct layout4 { 498 offset4 lo_offset; 499 length4 lo_length; 500 layoutiomode4 lo_iomode; 501 layout_content4 lo_content; 502 }; 504 This document defines structure associated with the layouttype4 505 value, LAYOUT4_OSS_OBJECTS. The NFSv4.1 ([3]) specifies the 506 loc_body structure as an XDR type "opaque". The opaque layout is 507 uninterpreted by the generic pNFS client layers, but obviously must 508 be interpreted by the Lustre storage layout driver. This section 509 defines the structure of this opaque value, "pnfs_oss_layout4". 511 6.1. pnfs_lov_mds_md 513 These are the key file mapping data structures. "pnfs_lov_ost_data" 514 is per-stripe data structure. "lov_mds_md" is per file data 515 structure. The difference between v1 and v3 is that, v3 supports OST 516 pooling. 518 /// struct pnfs_lov_ost_data4 { /* per-stripe data structure */ 519 /// uint64_t l_object_id; /* OST object ID */ 520 /// uint64_t l_object_seq; /* OST object seq number */ 521 /// uint32_t l_ost_gen; 522 /// /* generation of this l_ost_idx */ 523 /// uint32_t l_ost_idx; 524 /// /* OST index in LOV (lov_tgt_desc->tgts) */ 525 /// }; 526 /// 527 /// struct pnfs_lov_mds_md_v1 { /* LOV EA mds/wire data */ 528 /// uint32_t lmm_pattern; 529 /// /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ 530 /// uint64_t lmm_object_id; /* LOV object ID */ 531 /// uint64_t lmm_object_seq;/* LOV object seq number */ 532 /// uint32_t lmm_stripe_size; /* size of stripe in bytes */ 533 /// uint16_t lmm_stripe_count; 534 /// /* num stripes in use for this object */ 535 /// uint16_t lmm_layout_gen; /* layout generation number */ 536 /// 537 /// pnfs_lov_ost_data4 lmm_objects[0]; /* per-stripe data */ 538 /// }; 539 /// 540 /// #define LOV_MAXPOOLNAME 16 541 /// 542 /// struct pnfs_lov_mds_md_v3 { /* LOV EA mds/wire data */ 543 /// uint32_t lmm_pattern; 544 /// /* LOV_PATTERN_RAID0, LOV_PATTERN_RAID1 */ 545 /// uint64_t lmm_object_id; /* LOV object ID */ 546 /// uint64_t lmm_object_seq; /* LOV object seq number */ 547 /// uint32_t lmm_stripe_size; /* size of stripe in bytes */ 548 /// uint16_t lmm_stripe_count; 549 /// /* num stripes in use for this object */ 550 /// uint16_t lmm_layout_gen; /* layout generation number */ 551 /// char lmm_pool_name[LOV_MAXPOOLNAME]; 552 /// /* must be 32bit aligned */ 553 /// pnfs_lov_ost_data4 lmm_objects[0]; /*per-stripe data*/ 554 /// }; 555 /// 556 /// union pnfs_lov_mds_md switch (pnfs_lov_magic lmm_magic) { 557 /// case LOV_MAGIC_V1: 558 /// pnfs_lov_mds_md_v1 mds_md1; 559 /// case LOV_MAGIC_V3: 560 /// pnfs_lov_mds_md_v3 mds_md3; 561 /// default: 562 /// void; 563 /// }; 564 /// 566 The pnfs_"pnfs_lov_ost_data4" structure parameterizes the algorithm 567 that maps a file's contents over the component OST's. 569 The "pnfs_lov_ost_data4" is a per stripe data structure that defines 570 the location of the stripe in OST and which OST holds the data. 572 "l_object_id" holds the file data's object ID on the OST. 574 "l_object_seq" holds the object sequence number which is always 0. 575 "l_ost_idx" holds the OST's index in LOV, and "l_ost_gen" holds the 576 OST's index generation. 578 The "lmm_magic" specifies the format of the returned stripping 579 information. LOV_MAGIC_V1 is used for pnfs_lov_mds_md_v1, and 580 LOV_MAGIC_V3 is used for "pnfs_lov_mds_md_v3". 582 "mds_md1" and "mds_md3" holds the file's detailed stripping 583 information. The two data structure share most fields while 584 "mds_md3" has OST pooling field "lmm_pool_name". When "lmm_magic" is 585 LOV_MAGIC_V3, OST pool name MUST be specified in "lmm_pool_name" 586 filed by MDS, with a pool name at most LOV_MAXPOOLNAME bytes. 588 The "lmm_pattern" holds the file's stripping pattern. It can be 589 either LOV_PATTERN_RAID0 or LOV_PATTERN_RAID1. "lmm_object_id" holds 590 the MDS object ID. "lmm_object_seq" holds the LOV object sequence 591 number. 593 "lmm_stripe_size" holds the stripe size in bytes. A file is striped 594 across multiple OSTs in the same stripe size. The "lmm_stripe_count" 595 holds the number of OSTs over which the file is striped. 597 "llm_layout_gen" holds the generation of current layout information. 598 Clients need to obtain layout generation before IO and check layout 599 generation after IO. If layout generation is changed, client needs 600 to redo the operations. 602 The "lmm_objects" is an array of "lmm_stripe_count" members 603 containing per OST file information. Each element is in form of 604 struct "pnfs_lov_ost_data". 606 6.2. pnfs_los_layout4 608 The following is the opaque data in generic layout. 610 /// struct pnfs_los_layout4 { 611 /// pnfs_lov_magic lmm_magic; 612 /// pnfs_lov_mds_md lov_mds_md; 613 /// pnfs_los_object_cred4 llo_component; 614 /// }; 615 /// 617 pnfs_lov_magic and lov_mds_md are defined as above [section 6.1]. 619 The "llo_component" is of type "pnfs_los_object_cred4", containing 620 credentials that Lustre client needs to use to connect to OSS's. 622 Note that the layout depends on the file size, which the client 623 learns, by doing GETATTR commands to the pNFS metadata server. 625 The pNFS client uses the file size to decide if it should return a 626 short read of the file when trying to read beyond the file size. 628 6.3. Data Mapping Schemes 630 This section describes the different data mapping schemes in detail. 631 The Lustre layout always uses a "dense" layout as described in 632 NFSv4.1 ([3]). This means that the second stripe unit of the file 633 starts at offset 0 of the second component, rather than at offset 634 stripe_unit bytes. After a full stripe has been written, the next 635 stripe unit is appended to the first component object in the list 636 without any holes in the component objects. From the MDS point of 637 view, each file is composed of multiple data objects striped on one 638 or more OSTs. 640 6.3.1. Simple Striping 642 A file object's layout information is defined in the extended 643 attribute (EA) of the inode. Essentially, EA describes the mapping 644 between file object id and its corresponding OSTs. 646 For example, if file A has a stripe count of three, then its EA will 647 look like: 649 EA ---> 650 651 652 stripe size and stripe width 654 In the above equation obj_id is the object identifier of a file 655 fragment on the ost p, "stripe size" is the size of each file 656 segment on one OST and "stripe width" is the number of OST's used. 657 So if the "stripe size" is 1MB, and the "stripe width" is 3, then 658 this would mean that: [0,1M), [4M,5M), ... are stored as object x, 659 which is on OST p; [1M, 2M), [5M, 6M), ... are stored as object y, 660 which is on OST q; [2M,3M), [6M, 7M), ... are stored as object z, 661 which is on OST r. 663 Before reading the file, the pNFS client will query the pNFS MDS and 664 be informed that it should talk to for this 665 operation. This information is structured in so-called LSM, and 666 Lustre client side LOV (logical object volume) is to interpret this 667 information so Lustre client can send requests to OSTs. Here again, 668 the Lustre client communicates with OST through a client module 669 interface known as OSC. Depending on the context, OSC can also be 670 used to refer to an OSS client by itself. 672 The mapping from the logical offset within a file (L) to the 673 component object C and object-specific offset O is defined by the 674 following equations: 676 L = logical offset into the file 677 W = stripe width 678 S = stripe size 679 C = (L-L%S)%W 680 O = L/W/S+L%S 682 In these equations, S is the number of bytes in a full stripe or 683 stripe size. C is an index into the array of components, so it 684 selects a particular OST device. C count starts from zero. O is the 685 offset within the OST that corresponds to the file offset. Note that 686 this computation does accommodate the fact that an object includes 687 all the file segments that are located on same OST. 689 For example, consider an object striped over three devices, 690 . The stripe size is 1024KB. The stripe width W is 691 thus 3. 693 Offset 0KB: 694 C = (0-0%1)%3 = 0 (OST0) 695 O = 0/3/1024 + (0%1024) = 0 697 Offset 1024KB: 698 C = (1024-(1024%1024))%3 = 1 (OST1) 699 O = 1024/3/1024 +(1024%1024) = 0 701 Offset 9000KB: 702 C = (9000-(9000%1024))%3 = 2 (OST2) 703 O = 9000/3/1024 + (9000%1024) = 810 705 Offset 102400KB: 706 C = (102400-(102400%1024))%3 = 1 (OST0) 707 O = 102400/3/1024 + (102400%4096) = 33 709 6.4. RAID Algorithms 711 This section defines the different redundancy algorithms. Note: The 712 term "RAID" (Redundant Array of Independent Disks) is used in this 713 document to represent an array of component OST's that store data 714 for an individual file. The objects are stored on independent OST- 715 based storage devices. File data is encoded and striped across the 716 array of component OST's using algorithms developed for block-based 717 RAID systems. 719 6.4.1. PNFS_OST_RAID_0 721 PNFS_OST_RAID_0 means there is no parity data, so all bytes in the 722 component objects are data bytes located by the above equations for 723 C and O. 725 6.4.2. PNFS_OST_RAID_1 727 PNFS_OST_RAID_1 means there is no parity data, but each OST is 728 mirrored to another OST. In this case the component objects are data 729 bytes still located by the above equations for C and O, defined in 730 section 6.3.1. 732 7. Lustre-Based Creation Layout Hint 734 The layouthint4 type is defined in the NFSv4.1 ([3]) as follows: 736 struct layouthint4 { 737 layouttype4 loh_type; 738 opaque loh_body<>; 739 }; 741 The "layouthint4" structure is used by the client to pass a hint 742 about the type of layout it would like to be created for a 743 particular file. If the "loh_type" layout type is 744 LAYOUT4_OSS_OBJECTS, then the "loh_body" opaque value is defined by 745 the "pnfs_oss_layouthint4" type. 747 7.1. pnfs_los_layouthint4 749 /// union pnfs_lov_stripe_count_hint4 switch (bool lsc_valid) { 750 /// case TRUE: 751 /// uint32_t lsc_stripe_count; 752 /// case FALSE: 753 /// void; 754 /// }; 755 /// 756 /// union pnfs_lov_stripe_size_hint4 switch (bool lss_valid) { 757 /// case TRUE: 758 /// uint32_t lss_stripe_size; 759 /// case FALSE: 760 /// void; 761 /// }; 762 /// 763 /// union pnfs_lov_stripe_offset_hint4 switch (bool lso_valid) { 764 /// case TRUE: 765 /// uint32_t lso_stripe_offset; 766 /// case FALSE: 767 /// void; 768 /// }; 769 /// 770 /// union pnfs_lov_stripe_pattern_hint4 switch (bool lsp_valid) { 771 /// case TRUE: 772 /// uint32_t lsp_stripe_pattern; 773 /// case FALSE: 774 /// void; 775 /// }; 776 /// 777 /// union pnfs_lov_pool_hint4 switch (bool lp_valid) { 778 /// case TRUE: 779 /// string lp_pool_name<>; 780 /// case FALSE: 781 /// void; 782 /// }; 783 /// 784 /// struct pnfs_los_layouthint4 { 785 /// pnfs_lov_stripe_count_hint4 lov_stripe_count_hint; 786 /// pnfs_lov_stripe_size_hint4 lov_stripe_size_hint; 787 /// pnfs_lov_stripe_offset_hint4 lov_stripe_offset_hint; 788 /// pnfs_lov_stripe_pattern_hint4 lov_stripe_pattern_hint; 789 /// pnfs_lov_pool_hint4 lov_pool_hint; 790 /// }; 791 /// 793 "pnfs_los_layouthint4" conveys hints for the desired data map. Hints 794 are indications of the client for preferences of the data stripe 795 type to be used for the file. All parameters are optional so the 796 client can give values for only the parameters it cares about. 798 "lov_stripe_count_hint", "lov_stripe_size_hint", 799 "lov_stripe_offset_hint" and "lov_stripe_pattern_hint" tells server 800 that client wants to create a file with corresponding stripe count, 801 stripe size, stripe offset and stripe pattern. "lov_pool_hint" tells 802 server that client wants to create a file within specific OST pool. 804 The server should make an attempt to honor the hints, but it can 805 ignore any or all of them at its own discretion and without failing 806 the respective CREATE operation. 808 8. IANA Considerations 810 As described in NFSv4.1 ([8]), new layout type numbers have been 811 assigned by IANA. This document defines the protocol associated 812 with a new layout type number, LAYOUT4_OSS_OBJECTS, and it requires 813 to be assigned a new value from IANA. 815 9. References 817 9.1. Normative References 819 [1] http://www.scribd.com/doc/59271212/Understanding-Lustre- 820 File-System-Internals. Lustre source code is hosted in 821 http://git.whamcloud.com/?p=fs/lustre- 822 release.git;a=summary. The Lustre client code is also in 823 process of being merged in Linux kernel. 824 https://git.kernel.org/cgit/linux/kernel/ 825 git/torvalds/linux.git/tree/drivers/staging 827 [2] Eisler, M., "XDR: External Data Representation Standard", 828 STD 67, RFC 4506, May 2006. 830 [3] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 831 "Network File System (NFS) Version 4 Minor Version 1 832 External Data Representation Standard (XDR) Description", 833 RFC 5662, January 2010. 835 [4] IETF Trust, "Legal Provisions Relating to IETF Documents", 836 November 2008, http://trustee.ietf.org/docs/IETF-Trust- 837 License-Policy.pdf. 839 [5] Kent, S. and K. Seo, "Security Architecture for the 840 Internet Protocol", RFC 4301, December 2005. 842 [6] Bradner, S., "Key words for use in RFCs to Indicate 843 Requirement Levels", BCP 14, RFC 2119, March 1997. 845 [7] Linn, J., "Generic Security Service Application Program 846 Interface Version 2, Update 1", RFC 2743, January 2000. 848 [8] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 849 "Network File System (NFS) Version 4 Minor Version 1 850 Protocol", RFC 5661, January 2010. 852 [9] Eisler, M., "IANA Considerations for Remote Procedure Call 853 (RPC) Network Identifiers and Universal Address Formats", 854 RFC 5665, January 2010. 856 [10] LOV and OSC. 857 http://wiki.lustre.org/lid/ulfi/ulfi_lov_osc.html 859 [11] Lustre Distributed Lock Manager. 860 http://wiki.lustre.org/lid/agi/agi_ldlm.html 862 [12] Portal RPC. http://wiki.lustre.org/lid/agi/agi_ptlrpc.html 864 [13] Lustre Networking. 865 http://wiki.lustre.org/lid/agi/agi_lnet.html 867 This document was prepared using 2-Word-v2.0.template.dot. 869 Authors' Addresses 871 Sorin Faibish (editor) 872 EMC Corporation 873 228 South Street 874 Hopkinton, MA 01748 875 US 877 Phone: +1 (508) 249-5745 878 Email: sfaibish@emc.com 880 Peng Tao 881 EMC Corporation 882 8F, Block D, SP Tower 883 Tsinghua Science Park 884 Zhongguancun Dong Road 885 Beijing 100084 886 PRC 888 Phone: +86 (10) 8215 8293 889 Email: tao.peng@emc.com