idnits 2.17.1 draft-eisler-nfsv4-pnfs-metastripe-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 771. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 782. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 789. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 795. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 308 has weird spacing: '..._fhonly mdl...' == Line 310 has weird spacing: '...dirread mdl_...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 27, 2008) is 5661 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0' is mentioned on line 527, but not defined -- Looks like a reference, but probably isn't: 'X' on line 458 == Outdated reference: A later version (-29) exists of draft-ietf-nfsv4-minorversion1-26 Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 M. Eisler 3 Internet-Draft NetApp 4 Intended status: Standards Track October 27, 2008 5 Expires: April 30, 2009 7 Metadata Striping for pNFS 8 draft-eisler-nfsv4-pnfs-metastripe-01.txt 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on April 30, 2009. 35 Abstract 37 This Internet-Draft describes a means to add metadata striping to 38 pNFS. 40 Requirements Language 42 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 43 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 44 document are to be interpreted as described in RFC 2119 [1]. 46 Table of Contents 48 1. Introduction and Motivation . . . . . . . . . . . . . . . . . 3 49 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 50 3. Scope of Metadata Striping . . . . . . . . . . . . . . . . . . 4 51 4. The Definition of Metadata Striping Layout . . . . . . . . . . 5 52 4.1. Name of Metadata Striping Layout Type . . . . . . . . . . 5 53 4.2. Value of Metadata Striping Layout Type . . . . . . . . . . 5 54 4.3. Definition of the da_addr_body Field of the 55 device_addr4 Data Type . . . . . . . . . . . . . . . . . . 6 56 4.4. Definition of the loh_body Field of the layouthint4 57 Data Type . . . . . . . . . . . . . . . . . . . . . . . . 7 58 4.5. Definition of the loc_body Field of the 59 layout_content4 Data Type . . . . . . . . . . . . . . . . 8 60 4.6. Definition of the lou_body Field of the layoutupdate4 61 Data Type . . . . . . . . . . . . . . . . . . . . . . . . 14 62 4.7. Storage Access Protocols . . . . . . . . . . . . . . . . . 14 63 4.8. Revocation of Layouts . . . . . . . . . . . . . . . . . . 14 64 4.9. Stateids . . . . . . . . . . . . . . . . . . . . . . . . . 15 65 4.10. Lease Terms . . . . . . . . . . . . . . . . . . . . . . . 15 66 4.11. Layout Operations Sent to an L-MDS . . . . . . . . . . . . 15 67 4.12. Filehandles in Metadata Layouts . . . . . . . . . . . . . 16 68 4.13. READ and WRITE Operations . . . . . . . . . . . . . . . . 16 69 4.14. Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 16 70 4.14.1. Failure and Restart of Client . . . . . . . . . . . . 16 71 4.14.2. Failure and Restart of Server . . . . . . . . . . . . 16 72 4.14.3. Failure and Restart of Storage Device . . . . . . . . 16 73 5. Negotiation . . . . . . . . . . . . . . . . . . . . . . . . . 16 74 6. Operational Recommendation for Deployment . . . . . . . . . . 16 75 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17 76 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 77 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 78 10. Normative References . . . . . . . . . . . . . . . . . . . . . 17 79 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 18 80 Intellectual Property and Copyright Statements . . . . . . . . . . 19 82 1. Introduction and Motivation 84 The NFSv4.1 specification describes pNFS [2]. In NFSv4.1, pNFS is 85 limited to the data contents of regular files. The content of 86 regular files is distributed (striped) across multiple storage 87 devices. Metadata is not distributed or striped, and indeed, the 88 model presented in the NFSv4.1 specification is that of a single 89 metadata server. This document describes a means to add metadata 90 striping to pNFS, which includes the notion of multiple metadata 91 servers. With metadata striping, multiple metadata servers may work 92 together to provide a higher parallel performance. 94 This document does not require a new minor version of NFSv4. 95 Instead, it requires a new layout type. 97 The XDR description is provided in this document in a way that makes 98 it simple for the reader to extract into a ready to compile form. 99 The reader can feed this document into the following shell script to 100 produce the machine readable XDR description of the metadata layout: 102 #!/bin/sh 103 grep "^ *///" | sed 's?^ */// ??' | sed 's?^.*///??' 105 I.e. if the above script is stored in a file called "extract.sh", and 106 this document is in a file called "spec.txt", then the reader can do: 108 sh extract.sh < spec.txt > md.x 110 The effect of the script is to remove leading white space from each 111 line of the specification, plus a sentinel sequence of "///". 113 2. Terminology 115 o Initial Metadata Server (I-MDS). The I-MDS is the metadata server 116 the client obtains a filehandle from prior to acquiring any layout 117 on the file. 119 o Layout Metadata Server (L-MDS). The L-MDS is the metadata server 120 the client obtains a filehandle from after direction from a 121 layout. 123 o Regular file: An object of file type NF4REG or NF4NAMEDATTR. 125 3. Scope of Metadata Striping 127 This proposal assumes a model where there are two or more servers 128 capable of supporting NFSv4.1 operations. At least one server is an 129 I-MDS, and the I-MDS should be thought of as a normal NFSv4.1 server, 130 with the additional capability of granting metadata layouts on 131 demand. The I-MDS might also be capable of granting non-metadata 132 layouts, but this is irrelevant to the scope of metadata striping. 133 The model also requires at least one additional server, an L-MDS, 134 that is capable of supporting NFSv4.1 operations that are directed to 135 the server by the I-MDS. It is permissible for an I-MDS to also be 136 an L-MDS, and an L-MDS to also be an I-MDS. Indeed, a simple 137 submodel is for every NFSv4.1 server in a set to be both an I-MDS and 138 L-MDS. 140 Metadata striping applies to all NFSv4.1 operations that operate on 141 file objects. These operations can be broken down into three 142 classes: 144 o Filehandle-only. These are operations that take just filehandles 145 as arguments, i.e. the current filehandle, or both the current 146 filehandle and the saved filehandle, and no component names of 147 files. When a client obtains a filehandle of an file object from 148 an NFS server, it can obtain a metadata layout that indicates the 149 optimal destination in the network to send filehandle-only 150 operations for that file object. For example, after obtaining the 151 filehandle via OPEN, and the metadata layout via LAYOUTGET, the 152 client wants to get a byte range lock on the file. The client 153 sends the LOCK request to the network address specified in the 154 metadata layout. 156 o Name-based. These are operations that take one or two filehandles 157 (i.e. the current file handle, or both the current file handle and 158 the saved filehandle) and one or two component names of files. 159 When a client obtains a filehandle of a file object that is of 160 type directory, it can obtain a metadata layout that indicates the 161 optimal destinations in the network to send name-based operations 162 for that directory. The optimal destinations MUST apply to the 163 current filehandle that the operation uses. In other words, for 164 LINK and RENAME, which take both the saved filehandle and the 165 current filehandle as parameters, the pNFS client would use the 166 metadata layout of the target directory (indicated in the current 167 filehandle) for guidance where to send the operation. Note that 168 if an L-MDS accepts a LINK or RENAME operation, the L-MDS MUST 169 perform the operation atomically. If it cannot, then the L-MDS 170 MUST return the error NFS4ERR_XDEV, and the client MUST send the 171 operation to the I-MDS. 173 The choice of destination is a function of the name the client is 174 requesting. For example, after the client obtains the filehandle 175 of a directory via LOOKUP and the metadata layout via LAYOUTGET, 176 the client wants to open a regular file within the directory. As 177 with the LAYOUT4_NFSV4_1_FILES layout type, the client has a list 178 network addresses to send requests to. With the 179 LAYOUT4_NFSV4_1_FILES layout, the choice of the index in the list 180 of network addresses was computed from the offset of the the read 181 or write request. With the metadata layout, the choice of the 182 index is derived from the name (or some other method, such as the 183 name and one or more attributes of the directory, such as the 184 filehandle, fileid, etc.) passed to OPEN. 186 o Directory-reading. These are operations that take one filehandle 187 and return the contents of a directory (currently, NFSv4 has just 188 one such operation, READDIR). When a client obtains a filehandle 189 of a file object that is of type directory, it can obtain a 190 metadata layout that indicates the optimal destination in the 191 network to send directory reading operations for that directory. 192 For example, after the client obtains the filehandle of a 193 directory via LOOKUP and the metadata layout via LAYOUTGET, the 194 client wants to read the directory. As with the 195 LAYOUT4_NFSV4_1_FILES layout type, the client has a list network 196 addresses to send requests to. With the LAYOUT4_NFSV4_1_FILES 197 layout, the choice of the index in list of network addresses was 198 computed from the offset of the the read or write request. Since 199 directories have cookies which resemble offsets, the choice of the 200 index is computed from the the "cookie" argument to the operation. 202 4. The Definition of Metadata Striping Layout 204 4.1. Name of Metadata Striping Layout Type 206 The name of the metadata striping layout type is LAYOUT4_METADATA. 208 4.2. Value of Metadata Striping Layout Type 210 The value of the metadata striping layout type is TBD1. 212 4.3. Definition of the da_addr_body Field of the device_addr4 Data Type 214 /// %#include "nfs4_prot.h" 215 /// union md_layout_addr4 switch (bool mdla_simple) { 216 /// case TRUE: 217 /// multipath_list4 mdla_simple_addr; 218 /// case FALSE: 219 /// nfsv4_1_file_layout_ds_addr4 mdla_complex_addr; 220 /// }; 222 Figure 1 224 If mdla_simple is TRUE, the remainder of the device address contains 225 a list of elements (mdla_simple_addr), where each element represents 226 a network address of an L-MDS which can serve equally as the target 227 of metadata operations (typically the filehandle-only operations). 228 See Section 13.5 of [2] for a description of how the multipath_list4 229 data type supports multi-pathing. 231 If mdla_simple is FALSE, the remainder of the device address is the 232 same as the LAYOUT4_NFSV4_1_FILES device address, consisting of an 233 array of lists of L-MDSes servers (nflda_multipath_ds_list), and an 234 array of indices (nflda_stripe_indices). Each element of 235 nflda_multipath_ds_list contains one or more subelements, and each 236 subelement represents a network address of an L-MDS which may serve 237 equally as the target of name-based and directory-reading operations 238 (see Section 13.5 of [2]). The number of elements in 239 nflda_multipath_ds_list array might be different than the stripe 240 count. The stripe count is the number of elements in 241 nflda_stripe_indices. The value of each element of 242 nflda_stripe_indices is an index into nflda_multipath_ds_list, and 243 thus the value of each element of nflda_stripe_indices MUST be less 244 than the number of elements in nflda_multipath_ds_list. 246 4.4. Definition of the loh_body Field of the layouthint4 Data Type 248 /// enum md_layout_hint_care4 { 249 /// MD4_CARE_STRIPE_UNIT_SIZE = 0x040, 250 /// MD4_CARE_STRIPE_CNT_NAMEOPS = 0x080, 251 /// MD4_CARE_STRIPE_CNT_DIRRDOPS = 0x100 252 /// }; 253 /// % 254 /// %/* Encoded in the loh_body field of type layouthint4: */ 255 /// % 256 /// struct md_layouthint4 { 257 /// uint32_t mdlh_care; 258 /// count4 mdlh_stripe_cnt_nameops; 259 /// count4 mdlh_stripe_cnt_dirrdops; 260 /// nfs_cookie4 mdlh_stripe_unit_size; 261 /// }; 263 Figure 2 265 The layout-type specific content for the LAYOUT4_METDATA layout type 266 is composed of four fields. The first field, mdlh_care, is a set of 267 flags indicating which values of the hint the client cares about. If 268 MD4_CARE_STRIPE_CNT_NAMEOPS is set, then the client indicates in the 269 second field, mdlh_stripe_cnt_nameops the preferred stripe count for 270 name-based operations. If MD4_CARE_STRIPE_CNT_DIRRDOPS is set, then 271 the client indicates in the third field, mdlh_stripe_cnt_dirrdops, 272 the preferred stripe count for directory-reading operations. If 273 MD4_CARE_STRIPE_UNIT_SIZE is set, then the client indicates in the 274 fourth field, mdlh_stripe_unit_size, the preferred stripe unit size 275 for directory-reading operations. 277 4.5. Definition of the loc_body Field of the layout_content4 Data Type 279 /// struct md_layout_fhonly { 280 /// deviceid4 mdlf_devid; 281 /// nfs_fh4 mdlf_fh<1>; 282 /// }; 283 /// 284 /// struct md_layout_namebased { 285 /// deviceid4 mdln_devid; 286 /// uint32_t mdln_namebased_alg; 287 /// uint32_t mdln_first_index; 288 /// nfs_fh4 mdln_fh_list<>; 289 /// }; 290 /// 291 /// union md_layout_dirread_fhlist 292 /// switch (bool mdldf_use_namebased) { 293 /// case TRUE: 294 /// void; 295 /// case FALSE: 296 /// nfs_fh4 mdldf_fh_list<>; 297 /// }; 298 /// 299 /// struct md_layout_dirread { 300 /// deviceid4 mdld_devid; 301 /// nfs_cookie4 mdld_first_cookie; 302 /// nfs_cookie4 mdld_unit_size; 303 /// uint32_t mdld_first_index; 304 /// md_layout_dirread_fhlist mdld_fh_list; 305 /// }; 306 /// 307 /// struct md_layout4 { 308 /// md_layout_fhonly mdl_fhops_layout<1>; 309 /// md_layout_namebased mdl_nameops_layout<1>; 310 /// md_layout_dirread mdl_dirrdops_layout_segments<>; 311 /// }; 313 Figure 3 315 The reply to a successful LAYOUTGET request it MUST contain exactly 316 one element in logr_layout. The elements contains the metadata 317 layout. The metadata layout consists of three variable length 318 arrays. At least one of the arrays MUST be of non-zero length. 320 o mdl_fhops_layout. This is an array of up to one element. If 321 there is one element, the element indicates the preferred set 322 L-MDSes as the target of filehandle-only operations. The element 323 contains two fields, mdlf_devid, the pNFS device ID of the L-MDS 324 and mdlf_fh, an array of up to one filehandle. 326 When the client receives a layout that has a mdl_fhops_layout 327 array with one element, it uses GETDEVICEINFO to map mdlf_devid to 328 a device address, of data type md_layout_addr4. The value of the 329 device address field mdla_simple MUST be TRUE. The client can 330 then select any element in mdla_simple_addr to send a filehandle- 331 only operation. The field mdlf_devid MUST map to a device address 332 with mdla_simple set to TRUE. The current filehandle REQUIRED for 333 use with the filehandle-only operation is either mdlf_fh[0] (if 334 and only if mdlf_fh has one element) or it is the filehandle the 335 pNFS client used as the current filehandle to the LAYOUTGET 336 operation that returned the metadata layout. 338 o mdl_nameops_layout. This is an array of up to one element. If 339 there is one element, the element indicates the preferred set of 340 L-MDS servers to as the target of name-based operations. The list 341 of L-MDSes is mapped from the mdln_devid device ID. The array 342 mdln_fh_list is used to select a filehandle for accessing an 343 L-MDS. The number of elements in this array MUST be one of three 344 values: 346 * Zero. The means that filehandles used for each L-MDS are the 347 same as the filehandle used as the current filehandle to 348 LAYOUTGET. 350 * One. This means that every L-MDS uses filehandle in 351 mdln_fh_list[0]. 353 * The same number of elements as 354 mdla_complex_addr.nflda_multipath_ds_list. Thus when sending a 355 name-based operation to any L-MDS in 356 mdla_complex_addr.nflda_multipath_ds_list[X], the filehandle in 357 mdln_fh_list[X] MUST be used. 359 The field mdld_first_index is the index into the first element of 360 the of mdla_complex_addr.nflda_stripe_indices array to use. The 361 field mdln_namebased_alg identifies the algorithm used to compute 362 the actual element in the mdla_complex_addr.nflda_stripe_indices 363 array to use. 365 When the client receives a layout that has a mdl_nameops_layout 366 array with one element, it uses GETDEVICEINFO to map mdln_devid to 367 a device address of data type md_layout_addr4. The value of the 368 device address field mdla_simple MUST be set to FALSE. The client 369 determines the filehandle and the set of L-MDS network addresses 370 to send a name-based operation via the following algorithm: 372 let F be the function designated by 373 mdln_namebased_alg; 375 let X = (x1, x2, x3, ...) some set of inputs for 376 function F, such that x1 SHOULD be the 377 component name of the file; 379 stripe_unit_number = F(X); 380 stripe_count = number of elements in 381 mdla_complex_addr.nflda_stripe_indices; 383 j = (stripe_unit_number + mdln_first_index) % 384 stripe_count; 386 idx = nflda_stripe_indices[j]; 388 fh_count = number of elements in mdln_fh_list; 389 lmds_count = number of elements in 390 mdla_complex_addr.nflda_multipath_ds_list; 392 switch (fh_count) { 393 case lmds_count: 394 fh = mdln_fh_list[idx]; 395 break; 397 case 1: 398 fh = mdln_fh_list[0]; 399 break; 401 case 0: 402 fh = current filehandle passed to LAYOUTGET; 403 break; 405 default: 406 throw a fatal exception; 407 break; 408 } 410 address_list = 411 mdla_complex_addr.nflda_multipath_ds_list[idx]; 413 Figure 4 415 The client would then select an L-MDS from address_list, and send 416 the name-based operation using the filehandle specified in fh. 418 o mdl_dirops_layout_segments. This is an array of zero or more 419 elements. Each element indicates the preferred set of L-MDSes as 420 the preferred destination for directory reading operations and the 421 pattern over which directory reading operations iterates over the 422 L-MDSes. The set of L-MDSes is mapped from the mdld_devid device 423 ID. The field mdld_devid is the device ID. The field 424 mdld_first_cookie indicates the first directory entry cookie a 425 directory reading operation can use for the first unit of the 426 pattern in this element. E.g., the value of mdld_first_cookie can 427 be used as the value of the "cookie" field in READDIR4args. In 428 the first element, mdld_first_cookie MUST be zero. The last 429 cookie that can be used on the pattern can be no higher than one 430 less than the value of mdld_first_cookie of the next element. If 431 there is no next element, then the pattern is valid for all 432 cookies from mdld_first_cookie through NFS4_UINT64_MAX inclusive. 433 The field mdld_unit_size indicates the maximum number of cookies 434 that can be read from each unit of a pattern, and thus indicates 435 the lowest value of the "cookie" field in READDIR4args for each 436 unit after the first unit. For example, if mdld_unit_size is 437 100000, and mdld_first_cookie is zero, then value of the "cookie" 438 field in the READDIR4args of the READDIR operation sent to the 439 second unit MUST be greater than or equal to 100000, and less than 440 200000. The field mdld_fh_list is used to select a filehandle for 441 accessing an L-MDS. It is a switched union with a boolean 442 discriminator mdldf_use_namebased. If mdldf_use_namebased is 443 TRUE, then the filehandle is selected from 444 mdl_nameops_layout.mdln_fh_list. The number of elements in this 445 array MUST be one of three values: 447 * Zero. The means that filehandles used for each L-MDS are the 448 same as the filehandle used as the current filehandle to 449 LAYOUTGET. 451 * One. This means that every L-MDS uses the filehandle in 452 mdld_fh_list[0]. 454 * The same number of elements as 455 mdld_complex_addr.nflda_multipath_ds_list. Thus when sending a 456 name-based operation to any L-MDS in 457 mdld_complex_addr.nflda_multipath_ds_list[X], the filehandle in 458 mdln_fh_list[X] MUST be used. 460 The field mdld_first_index is the index into the first element of 461 the mdld_complex_addr.nflda_stripe_indices array to use. 463 When the client receives a layout that has a 464 mdl_dirops_layout_segments array with more than zero elements, it 465 uses GETDEVICEINFO to map the mdln_devid of each element of the 466 array to a device address of data type md_layout_addr4. The value 467 of the device address field mdla_simple MUST be set to FALSE. The 468 client determines the filehandle and the set of L-MDS network 469 addresses to send a name-based operation via the following 470 algorithm: 472 let cookie_arg be the cookie the pNFS client will 473 use as the value of the cookie argument to a 474 directory reading operation; 476 segment_count = number of elements in 477 mdl_dirrdops_layout_segments; 479 find index k, such that (cookie_arg >= 480 mdl_dirrdops_layout_segments[k].mdld_first_cookie) 481 && ((k == (segment_count - 1)) || (cookie_arg 482 < mdl_dirrdops_layout_segments[k+1])); 484 relative_cookie = cookie_arg - 485 mdl_dirrdops_layout_segments[k].mdld_first_cookie; 487 i = floor(relative_cookie / 488 mdl_dirrdops_layout_segments[k].mdld_unit_size); 490 stripe_count = number of elements in 491 mdla_complex_addr.nflda_stripe_indices; 493 j = (stripe_unit_number + mdld_first_index) % stripe_count; 495 idx = nflda_stripe_indices[j]; 497 if (mdl_dirrdops_layout_segments[k]. 498 mdldf_use_namebased == TRUE) { 499 fh_count = number of elements in mdln_fh_list; 500 lmds_count = number of elements in 501 mdla_complex_addr.nflda_multipath_ds_list; 502 } else { 503 fh_count = number of elements in 504 mdl_dirrdops_layout_segments[k].mdld_fh_list. 505 mdldf_fh_list; 506 lmds_count = number of elements in 507 mdla_complex_addr.nflda_multipath_ds_list; 508 } 510 switch (fh_count) { 511 case lmds_count: 512 if (mdl_dirrdops_layout_segments[k]. 513 mdldf_use_namebased == TRUE) { 514 fh = mdln_fh_list[idx]; 515 } else { 516 fh = mdl_dirrdops_layout_segments[k].mdld_fh_list. 517 mdldf_fh_list[idx]; 518 } 519 break; 521 case 1: 522 if (mdl_dirrdops_layout_segments[k]. 523 mdldf_use_namebased == TRUE) { 524 fh = mdln_fh_list[0]; 525 } else { 526 fh = mdl_dirrdops_layout_segments[k].mdld_fh_list. 527 mdldf_fh_list[0]; 528 } 529 break; 531 case 0: 532 fh = current filehandle passed to LAYOUTGET; 533 break; 535 default: 536 throw a fatal exception; 537 break; 538 } 540 address_list = mdla_complex_addr. 541 nflda_multipath_ds_list[idx]; 543 Figure 5 545 The client would then select an L-MDS from address_list, and send 546 the directory-reading operation using the filehandle specified in 547 fh. When the client is reading the beginning of the directory, 548 cookie_arg is always zero. Subsequent directory-reading 549 operations to read the rest of the directory will use the last 550 cookie returned by the L-MDS. Am MDS returning a metadata layout 551 SHOULD return cookies that can be used directly to the I-MDS that 552 returned the layout. However this might not always be possible. 553 For example, the directory design of the filesystem of the MDS, 554 might not return cookies in ascending order, or any order at all 555 for that matter. Whereas, striping by definition requires an 556 ordering. In such cases, if a directory is restriped while a pNFS 557 client is reading its contents from the L-MDSes, it is possible 558 that client will be unable to complete reading the directory, and 559 as a result an error is returned to process reading the directory. 560 To mitigate this, servers that have sent a CB_LAYOUTRECALL on the 561 directory SHOULD NOT revoke the layout as long as they detect that 562 the client is completing a read of the entire directory. Once a 563 client has received a CB_LAYOUTRECALL, it SHOULD NOT send a 564 directory-reading operation to an L-MDS with a cookie argument of 565 zero. If the server has sent a CB_LAYOUTRECALL, the L-MDS SHOULD 566 reject requests to read the directory that have a cookie argument 567 zero and return the error NFS4ERR_PNFS_NO_LAYOUT. 569 4.6. Definition of the lou_body Field of the layoutupdate4 Data Type 571 /// %/* 572 /// % * LAYOUT4_METADATA. 573 /// % * Encoded in the lou_body field of type layoutupdate4: 574 /// % * Nothing. lou_body is a zero length array of octets. 575 /// % */ 576 /// % 578 Figure 6 580 The LAYOUT4_METADATA layout type has no content for lou_body filed of 581 the layoutupdate4 data type. 583 4.7. Storage Access Protocols 585 The LAYOUT4_METADATA layout type uses NFSv4.1 operations (and 586 potentially, operations of higher minor versions of NFSv4, subject to 587 the definition of a minor version of NFSv4) to access striped 588 metadata. The LAYOUT4_METADATA does not affect access to storage 589 devices. Thus a client might be able to obtain both a 590 LAYOUT4_METADATA layout, and a non-LAYOUT4_METADATA layout type 591 (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or 592 LAYOUT4_BLOCK_VOLUME) on the same regular file. Of course, for a 593 non-regular file, a pNFS client will be unable to get layouts of 594 types LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or 595 LAYOUT4_BLOCK_VOLUME). 597 4.8. Revocation of Layouts 599 Servers MAY revoke layouts of type LAYOUT4_METADATA. A client 600 detects if layout has been revoked if the operation is rejected with 601 NFS4ERR_PNFS_NO_LAYOUT. In NFSv4.1, the error NFS4ERR_PNFS_NO_LAYOUT 602 could be returned only by READ and WRITE. When the server returns a 603 layout of type LAYOUT4_METADATA, the set of operations that can 604 return NFS4ERR_PNFS_NO_LAYOUT is: ACCESS, CLOSE, COMMIT, CREATE, 605 DELEGRETURN, GETATTR, LINK, LOCK, LOCKT, LOCKU, LOOKUP, LOOKUPP, 606 NVERIFY, OPEN, OPENATTR, OPEN_DOWNGRADE, READ, READDIR, READLINK, 607 REMOVE, RENAME, SECINFO, SETATTR, VERIFY, WRITE, GET_DIR_DELEGATION, 608 SECINFO, SECINFO_NO_NAME, and WANT_DELEGATION. 610 4.9. Stateids 612 The pNFS specification for LAYOUT4_NFSV4_1_FILES states data servers 613 MUST be aware of the stateids granted by MDS so that the stateids 614 passed to READ and WRITE can be properly validated. This requirement 615 extends to the LAYOUT4_METADATA layout type: the L-MDS MUST be aware 616 of any non-layout stateids granted by the I-MDS, if and only if the 617 client is in contact the L-MDS under direction of a metadata layout 618 returned by the I-MDS, and the I-MDS has not recalled or revoked that 619 layout. In addition, because an L-MDS can accept operations like 620 OPEN and LOCK that create or modify stateids, the I-MDS MUST be aware 621 of stateids that an L-MDS has returned to a client, if and only if 622 the I-MDS granted the client a metadata layout that directed the 623 client to the L-MDS. 625 In some cases, one L-MDS MUST be aware of a stateid generated by 626 another L-MDS. For example a client can obtain a stateid from the 627 L-MDS serving as the destination of name-based operations, which 628 includes OPEN. However operations that use the stateid will be 629 filehandle-only operations, and the L-MDS the OPEN operation is sent 630 to might differ from the L-MDS the LOCK operation for the same target 631 file is sent to. 633 4.10. Lease Terms 635 Any state the client obtains from an I-MDS or L-MDS is guaranteed to 636 last for an interval lasting as long as the maximum of the lease_time 637 attribute of the the I-MDS, and any L-MDS the client is directed to 638 as the result of a metadata layout. The client has a lease for each 639 client ID it has with an I-MDS or L-MDS, and each lease MUST be 640 renewed separately for each client ID. 642 4.11. Layout Operations Sent to an L-MDS 644 An L-MDS MAY allow a LAYOUTGET operation. One reason the L-MDS might 645 allow a LAYOUTGET operation is to allow hierarchical striping. For 646 example, for name-based operations, the pNFS server might use a radix 647 tree, (which the field mdln_namebased_alg would indicate). The first 648 four bytes of the component name would be combined to form a 32 bit 649 stripe_unit_number. Once the client contacted the L-MDS, it would 650 repeat the algorithm on the second four bytes of the component, and 651 so on until the component name was exhausted. 653 One an L-MDS grants a layout, the client MUST use only the L-MDS that 654 granted to the layout to send LAYOUTUPDATE, LAYOUTCOMMIT, and 655 LAYOUTRETURN. 657 4.12. Filehandles in Metadata Layouts 659 The filehandles returned in a metadata layout are subject to becoming 660 stale at any time. The L-MDS SHOULD NOT return NFS4ERR_STALE unless 661 the I-MDS has recalled or revoked the corresponding layout. 663 4.13. READ and WRITE Operations 665 READ and WRITE are filehandle-only operations, and thus the pNFS 666 client SHOULD attempt to obtain a non-metadata layout for a regular 667 file. If it cannot, then it MAY use the metadata layout to send READ 668 and WRITE operations to an L-MDS. An L-MDS MUST accept a READ or 669 WRITE operation if the layout the I-MDS returned to the client 670 included a filehandle-only layout. 672 4.14. Recovery 674 [[Comment.1: it is likely this section will follow that of the files 675 layout type specified in the NFSv4.1 specification.]] 677 4.14.1. Failure and Restart of Client 679 TBD 681 4.14.2. Failure and Restart of Server 683 TBD 685 4.14.3. Failure and Restart of Storage Device 687 TBD 689 5. Negotiation 691 An pNFS client sends a GETATTR operation for attribute 692 fs_layout_type. If the reply contains the metadata layout type, then 693 metadata striping is supported, subject to further verification by a 694 LAYOUTGET operation. If not, the client cannot use metadata 695 striping. 697 6. Operational Recommendation for Deployment 699 Deploy the metadata striping layout when it is anticipated that the 700 workload will involve a high fraction of non-I/O operations on 701 filehandles. 703 7. Acknowledgements 705 Brent Welch had the idea of returning a separate device ID for 706 filehandle-only operations in the metadata layout. Pranoop Erasani, 707 Dave Noveck, and Richard Jernigan provided valuable feedback. 709 8. Security Considerations 711 The security considerations of Section 13.12 of [2] which are 712 specific to data servers apply to lMDSes. In addition, each lMDS 713 server and client are, respectively, a complete NFSv4.1 server and 714 client, and so the security considerations of [2] apply to any client 715 or server using the metadata layout type. 717 9. IANA Considerations 719 This specification requires an addition to the Layout Types registry 720 described in Section 22.4 of [2]. The five fields added to the 721 registy are: 723 1. Name of layout type: LAYOUT4_METADATA 725 2. Value of layout type: TBD1. 727 3. Standards Track RFC that describes this layout: RFCTBD2, which is 728 the RFC of this document. 730 4. How the RFC Introduces the specification: L. 732 5. Minor versions of NFSv4 that can use the layout type: 1. 734 This specification requires the creation of a registry of hash 735 algorithms for supporting the field mdln_namebased_alg. Details TBD. 737 10. Normative References 739 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 740 Levels", RFC 2119, March 1997. 742 [2] Shepler, S., Eisler, M., and D. Noveck, "NFS Version 4 Minor 743 Version 1", draft-ietf-nfsv4-minorversion1-26 (work in 744 progress), Sep 2008. 746 Author's Address 748 Mike Eisler 749 NetApp 750 5765 Chase Point Circle 751 Colorado Springs, CO 80919 752 US 754 Phone: +1-719-599-9026 755 Email: mike@eisler.com 757 Full Copyright Statement 759 Copyright (C) The IETF Trust (2008). 761 This document is subject to the rights, licenses and restrictions 762 contained in BCP 78, and except as set forth therein, the authors 763 retain all their rights. 765 This document and the information contained herein are provided on an 766 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 767 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 768 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 769 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 770 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 771 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 773 Intellectual Property 775 The IETF takes no position regarding the validity or scope of any 776 Intellectual Property Rights or other rights that might be claimed to 777 pertain to the implementation or use of the technology described in 778 this document or the extent to which any license under such rights 779 might or might not be available; nor does it represent that it has 780 made any independent effort to identify any such rights. Information 781 on the procedures with respect to rights in RFC documents can be 782 found in BCP 78 and BCP 79. 784 Copies of IPR disclosures made to the IETF Secretariat and any 785 assurances of licenses to be made available, or the result of an 786 attempt made to obtain a general license or permission for the use of 787 such proprietary rights by implementers or users of this 788 specification can be obtained from the IETF on-line IPR repository at 789 http://www.ietf.org/ipr. 791 The IETF invites any interested party to bring to its attention any 792 copyrights, patents or patent applications, or other proprietary 793 rights that may cover technology that may be required to implement 794 this standard. Please address the information to the IETF at 795 ietf-ipr@ietf.org.