idnits 2.17.1 draft-ietf-nfsv4-pnfs-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 21. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2861. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2838. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2845. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2851. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 2278 has weird spacing: '...T4resok resok...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 7, 2005) is 6775 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'Section 5' on line 389 == Unused Reference: '3' is defined on line 2748, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' ** Obsolete normative reference: RFC 3530 (ref. '2') (Obsoleted by RFC 7530) -- Possible downref: Normative reference to a draft: ref. '3' -- Obsolete informational reference (is this intentional?): RFC 3720 (ref. '4') (Obsoleted by RFC 7143) == Outdated reference: A later version (-02) exists of draft-zelenka-pnfs-obj-01 Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 G. Goodson 3 Internet-Draft NetApp 4 Expires: April 10, 2006 B. Welch 5 B. Halevy 6 Panasas 7 D. Black 8 EMC 9 A. Adamson 10 CITI 11 October 7, 2005 13 NFSv4 pNFS Extensions 14 draft-ietf-nfsv4-pnfs-00.txt 16 Status of this Memo 18 By submitting this Internet-Draft, each author represents that any 19 applicable patent or other IPR claims of which he or she is aware 20 have been or will be disclosed, and any of which he or she becomes 21 aware will be disclosed, in accordance with Section 6 of BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF), its areas, and its working groups. Note that 25 other groups may also distribute working documents as Internet- 26 Drafts. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt. 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html. 39 This Internet-Draft will expire on April 10, 2006. 41 Copyright Notice 43 Copyright (C) The Internet Society (2005). 45 Abstract 47 This Internet-Draft provides a description of the pNFS extension for 48 NFSv4. 50 The key feature of the protocol extension is the ability for clients 51 to perform read and write operations that go directly from the client 52 to individual storage system elements without funneling all such 53 accesses through a single file server. Of course, the file server 54 must provide sufficient coordination of the client I/O so that the 55 file system retains its integrity. 57 The extension adds operations that query and manage layout 58 information that allows parallel I/O between clients and storage 59 system elements. The layouts are managed in a similar way to 60 delegations in that they are associated with leases and can be 61 recalled by the server, but layout information is independent of 62 delegations. 64 Requirements Language 66 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 67 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 68 document are to be interpreted as described in RFC 2119 [1]. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 2. General Definitions . . . . . . . . . . . . . . . . . . . . . 7 74 2.1 Metadata Server . . . . . . . . . . . . . . . . . . . . . 7 75 2.2 Client . . . . . . . . . . . . . . . . . . . . . . . . . . 7 76 2.3 Storage Device . . . . . . . . . . . . . . . . . . . . . . 8 77 2.4 Storage Protocol . . . . . . . . . . . . . . . . . . . . . 8 78 2.5 Control Protocol . . . . . . . . . . . . . . . . . . . . . 8 79 2.6 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 9 80 2.7 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 9 81 3. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 9 82 3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . 9 83 3.1.1 Layout Types . . . . . . . . . . . . . . . . . . . . . 9 84 3.1.2 Layout Iomode . . . . . . . . . . . . . . . . . . . . 10 85 3.1.3 Layout Segments . . . . . . . . . . . . . . . . . . . 10 86 3.1.4 Device IDs . . . . . . . . . . . . . . . . . . . . . . 11 87 3.1.5 Aggregation Schemes . . . . . . . . . . . . . . . . . 12 88 3.2 Guarantees Provided by Layouts . . . . . . . . . . . . . . 12 89 3.3 Getting a Layout . . . . . . . . . . . . . . . . . . . . . 13 90 3.4 Committing a Layout . . . . . . . . . . . . . . . . . . . 14 91 3.4.1 LAYOUTCOMMIT and mtime/atime/change . . . . . . . . . 15 92 3.4.2 LAYOUTCOMMIT and size . . . . . . . . . . . . . . . . 15 93 3.4.3 LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . . 16 94 3.5 Recalling a Layout . . . . . . . . . . . . . . . . . . . . 17 95 3.5.1 Basic Operation . . . . . . . . . . . . . . . . . . . 17 96 3.5.2 Recall Callback Robustness . . . . . . . . . . . . . . 18 97 3.5.3 Recall/Return Sequencing . . . . . . . . . . . . . . . 19 98 3.6 Metadata Server Write Propagation . . . . . . . . . . . . 21 99 3.7 Crash Recovery . . . . . . . . . . . . . . . . . . . . . . 21 100 3.7.1 Leases . . . . . . . . . . . . . . . . . . . . . . . . 21 101 3.7.2 Client Recovery . . . . . . . . . . . . . . . . . . . 23 102 3.7.3 Metadata Server Recovery . . . . . . . . . . . . . . . 23 103 3.7.4 Storage Device Recovery . . . . . . . . . . . . . . . 25 104 4. Security Considerations . . . . . . . . . . . . . . . . . . . 26 105 4.1 File Layout Security . . . . . . . . . . . . . . . . . . . 27 106 4.2 Object Layout Security . . . . . . . . . . . . . . . . . . 27 107 4.3 Block/Volume Layout Security . . . . . . . . . . . . . . . 29 108 5. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . . 29 109 5.1 File Striping and Data Access . . . . . . . . . . . . . . 29 110 5.1.1 Sparse and Dense Storage Device Data Layouts . . . . . 31 111 5.1.2 Metadata and Storage Device Roles . . . . . . . . . . 32 112 5.1.3 Device Multipathing . . . . . . . . . . . . . . . . . 33 113 5.1.4 Operations Issued to Storage Devices . . . . . . . . . 34 114 5.2 Global Stateid Requirements . . . . . . . . . . . . . . . 35 115 5.3 The Layout Iomode . . . . . . . . . . . . . . . . . . . . 35 116 5.4 Storage Device State Propagation . . . . . . . . . . . . . 35 117 5.4.1 Lock State Propagation . . . . . . . . . . . . . . . . 36 118 5.4.2 Open-mode Validation . . . . . . . . . . . . . . . . . 36 119 5.4.3 File Attributes . . . . . . . . . . . . . . . . . . . 37 120 5.5 Storage Device Component File Size . . . . . . . . . . . . 38 121 5.6 Crash Recovery Considerations . . . . . . . . . . . . . . 38 122 5.7 Security Considerations . . . . . . . . . . . . . . . . . 39 123 5.8 Alternate Approaches . . . . . . . . . . . . . . . . . . . 39 124 6. pNFS Typed Data Structures . . . . . . . . . . . . . . . . . . 40 125 6.1 pnfs_layouttype4 . . . . . . . . . . . . . . . . . . . . . 40 126 6.2 pnfs_deviceid4 . . . . . . . . . . . . . . . . . . . . . . 41 127 6.3 pnfs_deviceaddr4 . . . . . . . . . . . . . . . . . . . . . 41 128 6.4 pnfs_devlist_item4 . . . . . . . . . . . . . . . . . . . . 42 129 6.5 pnfs_layout4 . . . . . . . . . . . . . . . . . . . . . . . 42 130 6.6 pnfs_layoutupdate4 . . . . . . . . . . . . . . . . . . . . 43 131 6.7 pnfs_layouthint4 . . . . . . . . . . . . . . . . . . . . . 43 132 6.8 pnfs_layoutiomode4 . . . . . . . . . . . . . . . . . . . . 43 133 7. pNFS File Attributes . . . . . . . . . . . . . . . . . . . . . 44 134 7.1 pnfs_layouttype4<> FS_LAYOUT_TYPES . . . . . . . . . . . . 44 135 7.2 pnfs_layouttype4<> FILE_LAYOUT_TYPES . . . . . . . . . . . 44 136 7.3 pnfs_layouthint4 FILE_LAYOUT_HINT . . . . . . . . . . . . 44 137 7.4 uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE . . . . . . . . . . 44 138 7.5 uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT . . . . . . . . . . 44 139 8. pNFS Error Definitions . . . . . . . . . . . . . . . . . . . . 45 140 9. pNFS Operations . . . . . . . . . . . . . . . . . . . . . . . 45 141 9.1 LAYOUTGET - Get Layout Information . . . . . . . . . . . . 46 142 9.2 LAYOUTCOMMIT - Commit writes made using a layout . . . . . 48 143 9.3 LAYOUTRETURN - Release Layout Information . . . . . . . . 51 144 9.4 GETDEVICEINFO - Get Device Information . . . . . . . . . . 53 145 9.5 GETDEVICELIST - Get List of Devices . . . . . . . . . . . 54 146 10. Callback Operations . . . . . . . . . . . . . . . . . . . . 55 147 10.1 CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . . 56 148 10.2 CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . . 58 149 11. Layouts and Aggregation . . . . . . . . . . . . . . . . . . 58 150 11.1 Simple Map . . . . . . . . . . . . . . . . . . . . . . . . 59 151 11.2 Block Extent Map . . . . . . . . . . . . . . . . . . . . . 59 152 11.3 Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . . 59 153 11.4 Replicated Map . . . . . . . . . . . . . . . . . . . . . . 59 154 11.5 Concatenated Map . . . . . . . . . . . . . . . . . . . . . 60 155 11.6 Nested Map . . . . . . . . . . . . . . . . . . . . . . . . 60 156 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 60 157 12.1 Normative References . . . . . . . . . . . . . . . . . . . 60 158 12.2 Informative References . . . . . . . . . . . . . . . . . . 60 159 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 61 160 A. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 62 161 Intellectual Property and Copyright Statements . . . . . . . . 63 163 1. Introduction 165 The NFSv4 protocol [2] specifies the interaction between a client 166 that accesses files and a server that provides access to files and is 167 responsible for coordinating access by multiple clients. As 168 described in the pNFS problem statement, this requires that all 169 access to a set of files exported by a single NFSv4 server be 170 performed by that server; at high data rates the server may become a 171 bottleneck. 173 The parallel NFS (pNFS) extensions to NFSv4 allow data accesses to 174 bypass this bottleneck by permitting direct client access to the 175 storage devices containing the file data. When file data for a 176 single NFSv4 server is stored on multiple and/or higher throughput 177 storage devices (by comparison to the server's throughput 178 capability), the result can be significantly better file access 179 performance. The relationship among multiple clients, a single 180 server, and multiple storage devices for pNFS (server and clients 181 have access to all storage devices) is shown in this diagram: 183 +-----------+ 184 |+-----------+ +-----------+ 185 ||+-----------+ | | 186 ||| | NFSv4 + pNFS | | 187 +|| Clients |<------------------------------>| Server | 188 +| | | | 189 +-----------+ | | 190 ||| +-----------+ 191 ||| | 192 ||| | 193 ||| Storage +-----------+ | 194 ||| Protocol |+-----------+ | 195 ||+----------------||+-----------+ Control| 196 |+-----------------||| | Protocol| 197 +------------------+|| Storage |------------+ 198 +| Devices | 199 +-----------+ 201 Figure 1 203 In this structure, the responsibility for coordination of file access 204 by multiple clients is shared among the server, clients, and storage 205 devices. This is in contrast to NFSv4 without pNFS extensions, in 206 which this is primarily the server's responsibility, some of which 207 can be delegated to clients under strictly specified conditions. 209 The pNFS extension to NFSv4 takes the form of new operations that 210 manage data location information called a "layout". The layout is 211 managed in a similar fashion as NFSv4 data delegations (e.g., they 212 are recallable and revocable). However, they are distinct 213 abstractions and are manipulated with new operations that are 214 described in Section 9. When a client holds a layout, it has rights 215 to access the data directly using the location information in the 216 layout. 218 There are new attributes that describe general layout 219 characteristics. However, much of the required information cannot be 220 managed solely within the attribute framework, because it will need 221 to have a strictly limited term of validity, subject to invalidation 222 by the server. This requires the use of new operations to obtain, 223 return, recall, and modify layouts, in addition to new attributes. 225 This document specifies both the NFSv4 extensions required to 226 distribute file access coordination between the server and its 227 clients and a NFSv4 file storage protocol that may be used to access 228 data stored on NFSv4 storage devices. 230 Storage protocols used to access a variety of other storage devices 231 are deliberately not specified here. These might include: 233 o Block/volume protocols such as iSCSI ([4]), and FCP ([5]). The 234 block/volume protocol support can be independent of the addressing 235 structure of the block/volume protocol used, allowing more than 236 one protocol to access the same file data and enabling 237 extensibility to other block/volume protocols. 239 o Object protocols such as OSD over iSCSI or Fibre Channel [6]. 241 o Other storage protocols, including PVFS and other file systems 242 that are in use in HPC environments. 244 pNFS is designed to accommodate these protocols and be extensible to 245 new classes of storage protocols that may be of interest. 247 The distribution of file access coordination between the server and 248 its clients increases the level of responsibility placed on clients. 249 Clients are already responsible for ensuring that suitable access 250 checks are made to cached data and that attributes are suitably 251 propagated to the server. Generally, a misbehaving client that hosts 252 only a single-user can only impact files accessible to that single 253 user. Misbehavior by a client hosting multiple users may impact 254 files accessible to all of its users. NFSv4 delegations increase the 255 level of client responsibility as a client that carries out actions 256 requiring a delegation without obtaining that delegation will cause 257 its user(s) to see unexpected and/or incorrect behavior. 259 Some uses of pNFS extend the responsibility of clients beyond 260 delegations. In some configurations, the storage devices cannot 261 perform fine-grained access checks to ensure that clients are only 262 performing accesses within the bounds permitted to them by the pNFS 263 operations with the server (e.g., the checks may only be possible at 264 file system granularity rather than file granularity). In situations 265 where this added responsibility placed on clients creates 266 unacceptable security risks, pNFS configurations in which storage 267 devices cannot perform fine-grained access checks SHOULD NOT be used. 268 All pNFS server implementations MUST support NFSv4 access to any file 269 accessible via pNFS in order to provide an interoperable means of 270 file access in such situations. See Section 4 on Security for 271 further discussion. 273 Finally, there are issues about how layouts interact with the 274 existing NFSv4 abstractions of data delegations and byte range 275 locking. These issues, and others, are also discussed here. 277 2. General Definitions 279 This protocol extension partitions the NFSv4 file system protocol 280 into two parts, the control path and the data path. The control path 281 is implemented by the extended (p)NFSv4 server. When the file system 282 being exported by (p)NFSv4 uses storage devices that are visible to 283 clients over the network, the data path may be implemented by direct 284 communication between the extended (p)NFSv4 file system client and 285 the storage devices. This leads to a few new terms used to describe 286 the protocol extension and some clarifications of existing terms. 288 2.1 Metadata Server 290 A pNFS "server" or "metadata server" is a server as defined by 291 RFC3530 [2], which additionally provides support of the pNFS minor 292 extension. When using the pNFS NFSv4 minor extension, the metadata 293 server may hold only the metadata associated with a file, while the 294 data can be stored on the storage devices. However, similar to 295 NFSv4, data may also be written through the metadata server. Note: 296 directory data is always accessed through the metadata server. 298 2.2 Client 300 A pNFS "client" is a client as defined by RFC3530 [2], with the 301 addition of supporting the pNFS minor extension server protocol and 302 with the addition of supporting at least one storage protocol for 303 performing I/O directly to storage devices. 305 2.3 Storage Device 307 This is a device, or server, that controls the file's data, but 308 leaves other metadata management up to the metadata server. A 309 storage device could be another NFS server, or an Object Storage 310 Device (OSD) or a block device accessed over a SAN (e.g., either 311 FiberChannel or iSCSI SAN). The goal of this extension is to allow 312 direct communication between clients and storage devices. 314 2.4 Storage Protocol 316 This is the protocol between the pNFS client and the storage device 317 used to access the file data. Three following types have been 318 described: file protocols (e.g., NFSv4), object protocols (e.g., 319 OSD), and block/volume protocols (e.g., based on SCSI-block 320 commands). These protocols are in turn realizable over a variety of 321 transport stacks. We anticipate there will be variations on these 322 storage protocols, including new protocols that are unknown at this 323 time or experimental in nature. The details of the storage protocols 324 will be described in other documents so that pNFS clients can be 325 written to use these storage protocols. Use of NFSv4 itself as a 326 file-based storage protocol is described in Section 5. 328 2.5 Control Protocol 330 This is a protocol used by the exported file system between the 331 server and storage devices. Specification of such protocols is 332 outside the scope of this draft. Such control protocols would be 333 used to control such activities as the allocation and deallocation of 334 storage and the management of state required by the storage devices 335 to perform client access control. The control protocol should not be 336 confused with protocols used to manage LUNs in a SAN and other 337 sysadmin kinds of tasks. 339 While the pNFS protocol allows for any control protocol, in practice 340 the control protocol is closely related to the storage protocol. For 341 example, if the storage devices are NFS servers, then the protocol 342 between the pNFS metadata server and the storage devices is likely to 343 involve NFS operations. Similarly, when object storage devices are 344 used, the pNFS metadata server will likely use iSCSI/OSD commands to 345 manipulate storage. 347 However, this document does not mandate any particular control 348 protocol. Instead, it just describes the requirements on the control 349 protocol for maintaining attributes like modify time, the change 350 attribute, and the end-of-file position. 352 2.6 Metadata 354 This is information about a file, like its name, owner, where it 355 stored, and so forth. The information is managed by the exported 356 file system server (metadata server). Metadata also includes lower- 357 level information like block addresses and indirect block pointers. 358 Depending the storage protocol, block-level metadata may or may not 359 be managed by the metadata server, but is instead managed by Object 360 Storage Devices or other servers acting as a storage device. 362 2.7 Layout 364 A layout defines how a file's data is organized on one or more 365 storage devices. There are many possible layout types. They vary in 366 the storage protocol used to access the data, and in the aggregation 367 scheme that lays out the file data on the underlying storage devices. 368 Layouts are described in more detail below. 370 3. pNFS protocol semantics 372 This section describes the semantics of the pNFS protocol extension 373 to NFSv4; this is the protocol between the client and the metadata 374 server. 376 3.1 Definitions 378 This sub-section defines a number of terms necessary for describing 379 layouts and their semantics. In addition, it more precisely defines 380 how layouts are identified and how they can be composed of smaller 381 granularity layout segments. 383 3.1.1 Layout Types 385 A layout describes the mapping of a file's data to the storage 386 devices that hold the data. A layout is said to belong to a specific 387 "layout type" (see Section 6.1 for its RPC definition). The layout 388 type allows for variants to handle different storage protocols (e.g., 389 block/volume [7], object [8], and file [Section 5] layout types). A 390 metadata server, along with its control protocol, must support at 391 least one layout type. A private sub-range of the layout type name 392 space is also defined. Values from the private layout type range can 393 be used for internal testing or experimentation. 395 As an example, a file layout type could be an array of tuples (e.g., 396 deviceID, file_handle), along with a definition of how the data is 397 stored across the devices (e.g., striping). A block/volume layout 398 might be an array of tuples that store along with information about block size and the file offset of 400 the first block. An object layout might be an array of tuples 401 and an additional structure (i.e., the 402 aggregation map) that defines how the logical byte sequence of the 403 file data is serialized into the different objects. Note, the actual 404 layouts are more complex than these simple expository examples. 406 This document defines a NFSv4 file layout type using a stripe-based 407 aggregation scheme (see Section 5). Adjunct specifications are being 408 drafted that precisely define other layout formats (e.g., block/ 409 volume [7], and object [8] layouts) to allow interoperability among 410 clients and metadata servers. 412 3.1.2 Layout Iomode 414 The iomode indicates to the metadata server the client's intent to 415 perform either READs (only) or a mixture of I/O possibly containing 416 WRITEs as well as READs (i.e., READ/WRITE). For certain layout 417 types, it is useful for a client to specify this intent at LAYOUTGET 418 time. E.g., for block/volume based protocols, block allocation could 419 occur when a READ/WRITE iomode is specified. A special 420 LAYOUTIOMODE_ANY iomode is defined and can only be used for 421 LAYOUTRETURN and LAYOUTRECALL, not for LAYOUTGET. It specifies that 422 layouts pertaining to both READ and RW iomodes are being returned or 423 recalled, respectively. 425 A storage device may validate I/O with regards to the iomode; this is 426 dependent upon storage device implementation. Thus, if the client's 427 layout iomode differs from the I/O being performed the storage device 428 may reject the client's I/O with an error indicating a new layout 429 with the correct I/O mode should be fetched. E.g., if a client gets 430 a layout with a READ iomode and performs a WRITE to a storage device, 431 the storage device is allowed to reject that WRITE. 433 The iomode does not conflict with OPEN share modes or lock requests; 434 open mode checks and lock enforcement are always enforced, and are 435 logically separate from the pNFS layout level. As well, open modes 436 and locks are the preferred method for restricting user access to 437 data files. E.g., an OPEN of read, deny-write does not conflict with 438 a LAYOUTGET containing an iomode of READ/WRITE performed by another 439 client. Applications that depend on writing into the same file 440 concurrently may use byte range locking to serialize their accesses. 442 3.1.3 Layout Segments 444 Until this point, layouts have been defined in a fairly vague manner. 445 A layout is more precisely identified by the following tuple: 446 ; the FH refers to the FH of the file on 447 the metadata server. Note, layouts describe a file, not a byte-range 448 of a file. 450 Since a layout that describes an entire file may be very large, there 451 is a desire to manage layouts in smaller chunks that correspond to 452 byte-ranges of the file. For example, the entire layout need not be 453 returned, recalled, or committed. These chunks are called "layout 454 segments" and are further identified by the byte-range they 455 represent. Layout operations require the identification of the 456 layout segment (i.e., clientID, FH, layout type, and byte-range), as 457 well as the iomode. This structure allows clients and metadata 458 servers to aggregate the results of layout operations into a singly 459 maintained layout. 461 It is important to define when layout segments overlap and/or 462 conflict with each other. For a layout segment to overlap another 463 layout segment both segments must be of the same layout type, 464 correspond to the same filehandle, and have the same iomode; in 465 addition, the byte-ranges of the segments must overlap. Layout 466 segments conflict, when they overlap and differ in the content of the 467 layout (i.e., the storage device/file mapping parameters differ). 468 Note, differing iomodes do not lead to conflicting layouts. It is 469 permissible for layout segments with different iomodes, pertaining to 470 the same byte range, to be held by the same client. 472 3.1.4 Device IDs 474 The "deviceID" is a short name for a storage device. In practice, a 475 significant amount of information may be required to fully identify a 476 storage device. Instead of embedding all that information in a 477 layout, a level of indirection is used. Layouts embed device IDs, 478 and a new operation (GETDEVICEINFO) is used to retrieve the complete 479 identity information about the storage device according to its layout 480 type. For example, the identity of a file server or object server 481 could be an IP address and port. The identity of a block device 482 could be a volume label. Due to multipath connectivity in a SAN 483 environment, agreement on a volume label is considered the reliable 484 way to locate a particular storage device. 486 The device ID is qualified by the layout type and unique per file 487 system (FSID). This allows different layout drivers to generate 488 device IDs without the need for co-ordination. In addition to 489 GETDEVICEINFO, another operation, GETDEVICELIST, has been added to 490 allow clients to fetch the mappings of multiple storage devices 491 attached to a metadata server. 493 Clients cannot expect the mapping between device ID and storage 494 device address to persist across server reboots, hence a client MUST 495 fetch new mappings on startup or upon detection of a metadata server 496 reboot unless it can revalidate its existing mappings. Not all 497 layout types support such revalidation, and the means of doing so is 498 layout specific. If data are reorganized from a storage device with 499 a given device ID to a different storage device (i.e., if the mapping 500 between storage device and data changes), the layout describing the 501 data MUST be recalled rather than assigning the new storage device to 502 the old device ID. 504 3.1.5 Aggregation Schemes 506 Aggregation schemes can describe layouts like simple one-to-one 507 mapping, concatenation, and striping. A general aggregation scheme 508 allows nested maps so that more complex layouts can be compactly 509 described. The canonical aggregation type for this extension is 510 striping, which allows a client to access storage devices in 511 parallel. Even a one-to-one mapping is useful for a file server that 512 wishes to distribute its load among a set of other file servers. 514 3.2 Guarantees Provided by Layouts 516 Layouts delegate to the client the ability to access data out of 517 band. The layout guarantees the holder that the layout will be 518 recalled when the state encapsulated by the layout becomes invalid 519 (e.g., through some operation that directly or indirectly modifies 520 the layout) or, possibly, when a conflicting layout is requested, as 521 determined by the layout's iomode. When a layout is recalled, and 522 then returned by the client, the client retains the ability to access 523 file data with normal NFSv4 I/O operations through the metadata 524 server. Only the right to do I/O out-of-band is affected. 526 Holding a layout does not guarantee that a user of the layout has the 527 rights to access the data represented by the layout. All user access 528 rights MUST be obtained through the appropriate open, lock, and 529 access operations (i.e., those that would be used in the absence of 530 pNFS). However, if a valid layout for a file is not held by the 531 client, the storage device should reject all I/Os to that file's byte 532 range that originate from that client. In summary, layouts and 533 ordinary file access controls are independent. The act of modifying 534 a file for which a layout is held, does not necessarily conflict with 535 the holding of the layout that describes the file being modified. 536 However, with certain layout types (e.g., block/volume layouts), the 537 layout's iomode must agree with the type of I/O being performed. 539 Depending upon the layout type and storage protocol in use, storage 540 device access permissions may be granted by LAYOUTGET and may be 541 encoded within the type specific layout. If access permissions are 542 encoded within the layout, the metadata server must recall the layout 543 when those permissions become invalid for any reason; for example 544 when a file becomes unwritable or inaccessible to a client. Note, 545 clients are still required to perform the appropriate access 546 operations as described above (e.g., open and lock ops). The degree 547 to which it is possible for the client to circumvent these access 548 operations must be clearly addressed by the individual layout type 549 documents, as well as the consequences of doing so. In addition, 550 these documents must be clear about the requirements and non- 551 requirements for the checking performed by the server. 553 If the pNFS metadata server supports mandatory byte range locks then 554 byte range locks must behave as specified by the NFSv4 protocol, as 555 observed by users of files. If a storage device is unable to 556 restrict access by a pNFS client who does not hold a required 557 mandatory byte range lock then the metadata server must not grant 558 layouts to a client, for that storage device, that permits any access 559 that conflicts with a mandatory byte range lock held by another 560 client. In this scenario, it is also necessary for the metadata 561 server to ensure that byte range locks are not granted to a client if 562 any other client holds a conflicting layout; in this case all 563 conflicting layouts must be recalled and returned before the lock 564 request can be granted. This requires the pNFS server to understand 565 the capabilities of its storage devices. 567 3.3 Getting a Layout 569 A client obtains a layout through a new operation, LAYOUTGET. The 570 metadata server will give out layouts of a particular type (e.g., 571 block/volume, object, or file) and aggregation as requested by the 572 client. The client selects an appropriate layout type which the 573 server supports and the client is prepared to use. The layout 574 returned to the client may not line up exactly with the requested 575 byte range. A field within the LAYOUTGET request, "minlength", 576 specifies the minimum overlap that MUST exist between the requested 577 layout and the layout returned by the metadata server. The 578 "minlength" field should specify a size of at least one. A metadata 579 server may give-out multiple overlapping, non-conflicting layout 580 segments to the same client in response to a LAYOUTGET. 582 There is no implied ordering between getting a layout and performing 583 a file OPEN. For example, a layout may first be retrieved by placing 584 a LAYOUTGET operation in the same compound as the initial file OPEN. 585 Once the layout has been retrieved, it can be held across multiple 586 OPEN and CLOSE sequences. 588 The storage protocol used by the client to access the data on the 589 storage device is determined by the layout's type. The client needs 590 to select a "layout driver" that understands how to interpret and use 591 that layout. The API used by the client to talk to its drivers is 592 outside the scope of the pNFS extension. The storage protocol 593 between the client's layout driver and the actual storage is covered 594 by other protocols specifications such as iSCSI (block storage), OSD 595 (object storage) or NFS (file storage). 597 Although, the metadata server is in control of the layout for a file, 598 the pNFS client can provide hints to the server when a file is opened 599 or created about preferred layout type and aggregation scheme. The 600 pNFS extension introduces a LAYOUT_HINT attribute that the client can 601 set at creation time to provide a hint to the server for new files. 602 It is suggested that this attribute be set as one of the initial 603 attributes to OPEN when creating a new file. Setting this attribute 604 separately, after the file has been created could make it difficult, 605 or impossible, for the server implementation to comply. 607 3.4 Committing a Layout 609 Due to the nature of the protocol, the file attributes, and data 610 location mapping (e.g., which offsets store data vs. store holes) 611 that exist on the metadata storage device may become inconsistent in 612 relation to the data stored on the storage devices; e.g., when WRITEs 613 occur before a layout has been committed (e.g., between a LAYOUTGET 614 and a LAYOUTCOMMIT). Thus, it is necessary to occasionally re-sync 615 this state and make it visible to other clients through the metadata 616 server. 618 The LAYOUTCOMMIT operation is responsible for committing a modified 619 layout segment to the metadata server. Note: the data should be 620 written and committed to the appropriate storage devices before the 621 LAYOUTCOMMIT occurs. Note, if the data is being written 622 asynchronously through the metadata server a COMMIT to the metadata 623 server is required to sync the data and make it visible on the 624 storage devices (see Section 3.6 for more details). The scope of 625 this operation depends on the storage protocol in use. For block/ 626 volume-based layouts, it may require updating the block list that 627 comprises the file and committing this layout to stable storage. 628 While, for file-layouts it requires some synchronization of 629 attributes between the metadata and storage devices (i.e., mainly the 630 size attribute; EOF). It is important to note that the level of 631 synchronization is from the point of view of the client who issued 632 the LAYOUTCOMMIT. The updated state on the metadata server need only 633 reflect the state as of the client's last operation previous to the 634 LAYOUTCOMMIT, it need not reflect a globally synchronized state 635 (e.g., other clients may be performing, or may have performed I/O 636 since the client's last operation and the LAYOUTCOMMIT). 638 The control protocol is free to synchronize the attributes before it 639 receives a LAYOUTCOMMIT, however upon successful completion of a 640 LAYOUTCOMMIT, state that exists on the metadata server that describes 641 the file MUST be in sync with the state existing on the storage 642 devices that comprise that file as of the issuing client's last 643 operation. Thus, a client that queries the size of a file between a 644 WRITE to a storage device and the LAYOUTCOMMIT may observe a size 645 that does not reflects the actual data written. 647 3.4.1 LAYOUTCOMMIT and mtime/atime/change 649 The change attribute and the modify/access times may be updated, by 650 the server, at LAYOUTCOMMIT time; since for some layout types, the 651 change attribute and atime/mtime can not be updated by the 652 appropriate I/O operation performed at a storage device. The 653 arguments to LAYOUTCOMMIT allow the client to provide suggested 654 access and modify time values to the server. Again, depending upon 655 the layout type, these client provided values may or may not be used. 656 The server should sanity check the client provided values before they 657 are used. For example, the server should ensure that time does not 658 flow backwards. According to the NFSv4 specification, The client 659 always has the option to set these attributes through an explicit 660 SETATTR operation. 662 As mentioned, for some layout protocols the change attribute and 663 mtime/atime may be updated at or after the time the I/O occurred 664 (e.g., if the storage device is able to communicate these attributes 665 to the metadata server). If, upon receiving a LAYOUTCOMMIT, the 666 server implementation is able to determine that the file did not 667 change since the last time the change attribute was updated (e.g., no 668 WRITEs or over-writes occurred), the implementation need not update 669 the change attribute; file-based protocols may have enough state to 670 make this determination or may update the change attribute upon each 671 file modification. This also applies for mtime and atime; if the 672 server implementation is able to determine that the file has not been 673 modified since the last mtime update, the server need not update 674 mtime at LAYOUTCOMMIT time. Once LAYOUTCOMMIT completes, the new 675 change attribute and mtime/atime should be visible if that file was 676 modified since the latest previous LAYOUTCOMMIT or LAYOUTGET. 678 3.4.2 LAYOUTCOMMIT and size 680 The file's size may be updated at LAYOUTCOMMIT time as well. The 681 LAYOUTCOMMIT operation contains an argument that indicates the last 682 byte offset to which the client wrote ("last_write_offset"). Note: 683 for this offset to be viewed as a file size it must be incremented by 684 one byte (e.g., a write to offset 0 would map into a file size of 1, 685 but the last write offset is 0). The metadata server may do one of 686 the following: 688 1. It may update the file's size based on the last write offset. 689 However, to the extent possible, the metadata server should 690 sanity check any value to which the file's size is going to be 691 set. E.g., it must not truncate the file based on the client 692 presenting a smaller last write offset than the file's current 693 size. 695 2. If it has sufficient other knowledge of file size (e.g., by 696 querying the storage devices through the control protocol), it 697 may ignore the client provided argument and use the query-derived 698 value. 700 3. It may use the last write offset as a hint, subject to correction 701 when other information is available as above. 703 The method chosen to update the file's size will depend on the 704 storage device's and/or the control protocol's implementation. For 705 example, if the storage devices are block devices with no knowledge 706 of file size, the metadata server must rely on the client to set the 707 size appropriately. A new size flag and length are also returned in 708 the results of a LAYOUTCOMMIT. This union indicates whether a new 709 size was set, and to what length it was set. If a new size is set as 710 a result of LAYOUTCOMMIT, then the metadata server must reply with 711 the new size. As well, if the size is updated, the metadata server 712 in conjunction with the control protocol SHOULD ensure that the new 713 size is reflected by the storage devices immediately upon return of 714 the LAYOUTCOMMIT operation; e.g., a READ up to the new file size 715 should succeed on the storage devices (assuming no intervening 716 truncations). Again, if the client wants to explicitly zero-extend 717 or truncate a file, SETATTR must be used; it need not be used when 718 simply writing past EOF. 720 Since client layout holders may be unaware of changes made to the 721 file's size, through LAYOUTCOMMIT or SETATTR, by other clients, an 722 additional callback/notification has been added for pNFS. 723 CB_SIZECHANGED is a notification that the metadata server sends to 724 layout holders to notify them of a change in file size. This is 725 preferred over issuing CB_LAYOUTRECALL to each of the layout holders. 727 3.4.3 LAYOUTCOMMIT and layoutupdate 729 The LAYOUTCOMMIT operation contains a "layoutupdate" argument. This 730 argument is a layout type specific structure. The structure can be 731 used to pass arbitrary layout type specific information from the 732 client to the metadata server at LAYOUTCOMMIT time. For example, if 733 using a block/volume layout, the client can indicate to the metadata 734 server which reserved or allocated blocks it used and which it did 735 not. The "layoutupdate" structure need not be the same structure as 736 the layout returned by LAYOUTGET. The structure is defined by the 737 layout type and is opaque to LAYOUTCOMMIT. 739 3.5 Recalling a Layout 741 3.5.1 Basic Operation 743 Since a layout protects a client's access to a file via a direct 744 client-storage-device path, a layout need only be recalled when it is 745 semantically unable to serve this function. Typically, this occurs 746 when the layout no longer encapsulates the true location of the file 747 over the byte range it represents. Any operation or action (e.g., 748 server driven restriping or load balancing) that changes the layout 749 will result in a recall of the layout. A layout is recalled by the 750 CB_LAYOUTRECALL callback operation (see Section 10.1). This callback 751 can either recall a layout segment identified by a byte range, or all 752 the layouts associated with a file system (FSID). However, there is 753 no single operation to return all layouts associated with an FSID; 754 multiple layout segments may be returned in a single compound 755 operation. Section 3.5.3 discusses sequencing issues surrounding the 756 getting, returning, and recalling of layouts. 758 The iomode is also specified when recalling a layout or layout 759 segment. Generally, the iomode in the recall request must match the 760 layout, or segment, being returned; e.g., a recall with an iomode of 761 RW should cause the client to only return RW layout segments (not R 762 segments). However, a special LAYOUTIOMODE_ANY enumeration is 763 defined to enable recalling a layout of any type (i.e., the client 764 must return both read-only and read/write layouts). 766 A REMOVE operation may cause the metadata server to recall the layout 767 to prevent the client from accessing a non-existent file and to 768 reclaim state stored on the client. Since a REMOVE may be delayed 769 until the last close of the file has occurred, the recall may also be 770 delayed until this time. As well, once the file has been removed, 771 after the last reference, the client SHOULD no longer be able to 772 perform I/O using the layout (e.g., with file-based layouts an error 773 such as ESTALE could be returned). 775 Although, the pNFS extension does not alter the caching capabilities 776 of clients, or their semantics, it recognizes that some clients may 777 perform more aggressive write-behind caching to optimize the benefits 778 provided by pNFS. However, write-behind caching may impact the 779 latency in returning a layout in response to a CB_LAYOUTRECALL; just 780 as caching impacts DELEGRETURN with regards to data delegations. 781 Client implementations should limit the amount of dirty data they 782 have outstanding at any one time. Server implementations may fence 783 clients from performing direct I/O to the storage devices if they 784 perceive that the client is taking too long to return a layout once 785 recalled. A server may be able to monitor client progress by 786 watching client I/Os or by observing LAYOUTRETURNs of sub-portions of 787 the recalled layout. The server can also limit the amount of dirty 788 data to be flushed to storage devices by limiting the byte ranges 789 covered in the layouts it gives out. 791 Once a layout has been returned, the client MUST NOT issue I/Os to 792 the storage devices for the file, byte range, and iomode represented 793 by the returned layout. If a client does issue an I/O to a storage 794 device for which it does not hold a layout, the storage device SHOULD 795 reject the I/O. 797 3.5.2 Recall Callback Robustness 799 For simplicity, the discussion thus far has assumed that pNFS client 800 state for a file exactly matches the pNFS server state for that file 801 and client regarding layout ranges and permissions. This assumption 802 leads to the implicit assumption that any callback results in a 803 LAYOUTRETURN or set of LAYOUTRETURNs that exactly match the range in 804 the callback, since both client and server agree about the state 805 being maintained. However, it can be useful if this assumption does 806 not always hold. For example: 808 o It may be useful for clients to be able to discard layout 809 information without calling LAYOUTRETURN. If conflicts that 810 require callbacks are very rare, and a server can use a multi-file 811 callback to recover per-client resources (e.g., via a FSID recall, 812 or a multi-file recall within a single compound), the result may 813 be significantly less client-server pNFS traffic. 815 o It may be similarly useful for servers to enhance information 816 about what layout ranges are held by a client beyond what a client 817 actually holds. In the extreme, a server could manage conflicts 818 on a per-file basis, only issuing whole-file callbacks even though 819 clients may request and be granted sub-file ranges. 821 o As well, the synchronized state assumption is not robust to minor 822 errors. A more robust design would allow for divergence between 823 client and server and the ability to recover. It is vital that a 824 client not assign itself layout permissions beyond what the server 825 has granted and that the server not forget layout permissions that 826 have been granted in order to avoid errors. On the other hand, if 827 a server believes that a client holds a layout segment that the 828 client does not know about, it's useful for the client to be able 829 to issue the LAYOUTRETURN that the server is expecting in response 830 to a recall. 832 Thus, in light of the above, it is useful for a server to be able to 833 issue callbacks for layout ranges it has not granted to a client, and 834 for a client to return ranges it does not hold. A pNFS client must 835 always return layout segments that comprise the full range specified 836 by the recall. Note, the full recalled layout range need not be 837 returned as part of a single operation, but may be returned in 838 segments. This allows the client to stage the flushing of dirty 839 data, layout commits, and returns. Also, it indicates to the 840 metadata server that the client is making progress. 842 In order to ensure client/server convergence on the layout state, the 843 final LAYOUTRETURN operation in a sequence of returns for a 844 particular recall, SHOULD specify the entire range being recalled, 845 even if layout segments pertaining to partial ranges were previously 846 returned. In addition, if the client holds no layout segment that 847 overlaps the range being recalled, the client should return the 848 NFS4ERR_NOMATCHING_LAYOUT error code. This allows the server to 849 update its view of the client's layout state. 851 3.5.3 Recall/Return Sequencing 853 As with other stateful operations, pNFS requires the correct 854 sequencing of layout operations. This proposal assumes that sessions 855 will precede or accompany pNFS into NFSv4.x and thus, pNFS will 856 require the use of sessions. If the sessions proposal does not 857 precede pNFS, then this proposal needs to be modified to provide for 858 the correct sequencing of pNFS layout operations. Also, this 859 specification is reliant on the sessions protocol to provide the 860 correct sequencing between regular operations and callbacks. It is 861 the server's responsibility to avoid inconsistencies regarding the 862 layouts it hands out and the client's responsibility to properly 863 serialize its layout requests. 865 One critical issue with operation sequencing concerns callbacks. The 866 protocol must defend against races between the reply to a LAYOUTGET 867 operation and a subsequent CB_LAYOUTRECALL. It MUST NOT be possible 868 for a client to process the CB_LAYOUTRECALL for a layout that it has 869 not received in a reply message to a LAYOUTGET. 871 3.5.3.1 Client Side Considerations 873 Consider a pNFS client that has issued a LAYOUTGET and then receives 874 an overlapping recall callback for the same file. There are two 875 possibilities, which the client cannot distinguish when the callback 876 arrives: 878 1. The server processed the LAYOUTGET before issuing the recall, so 879 the LAYOUTGET response is in flight, and must be waited for 880 because it may be carrying layout info that will need to be 881 returned to deal with the recall callback. 883 2. The server issued the callback before receiving the LAYOUTGET. 884 The server will not respond to the LAYOUTGET until the recall 885 callback is processed. 887 This can cause deadlock, as the client must wait for the LAYOUTGET 888 response before processing the recall in the first case, but that 889 response will not arrive until after the recall is processed in the 890 second case. This deadlock can be avoided by adhering to the 891 following requirements: 893 o A LAYOUTGET MUST be rejected with an error (i.e., 894 NFS4ERR_RECALLCONFLICT) if there's an overlapping outstanding 895 recall callback to the same client 897 o When processing a recall, the client MUST wait for a response to 898 all conflicting outstanding LAYOUTGETs before performing any 899 RETURN that could be affected by any such response. 901 o The client SHOULD wait for responses to all operations required to 902 complete a recall before sending any LAYOUTGETs that would 903 conflict with the recall because the server is likely to return 904 errors for them. 906 Now the client can wait for the LAYOUTGET response, as it will be 907 received in both cases. 909 3.5.3.2 Server Side Considerations 911 Consider a related situation from the pNFS server's point of view. 912 The server has issued a recall callback and receives an overlapping 913 LAYOUTGET for the same file before the LAYOUTRETURN(s) that respond 914 to the recall callback. Again, there are two cases: 916 1. The client issued the LAYOUTGET before processing the recall 917 callback. 919 2. The client issued the LAYOUTGET after processing the recall 920 callback, but it arrived before the LAYOUTRETURN that completed 921 that processing. 923 The simplest approach is to always reject the overlapping LAYOUTGET. 924 The client has two ways to avoid this result - it can issue the 925 LAYOUTGET as a subsequent element of a COMPOUND containing the 926 LAYOUTRETURN that completes the recall callback, or it can wait for 927 the response to that LAYOUTRETURN. 929 This leads to a more general problem; in the absence of a callback if 930 a client issues concurrent overlapping LAYOUTGET and LAYOUTRETURN 931 operations, it is possible for the server to process them in either 932 order. Again, a client must take the appropriate precautions in 933 serializing its actions. 935 [ASIDE: HighRoad forbids a client from doing this, as the per-file 936 layout stateid will cause one of the two operations to be rejected 937 with a stale layout stateid. This approach is simpler and produces 938 better results by comparison to allowing concurrent operations, at 939 least for this sort of conflict case, because server execution of 940 operations in an order not anticipated by the client may produce 941 results that are not useful to the client (e.g., if a LAYOUTRETURN is 942 followed by a concurrent overlapping LAYOUTGET, but executed in the 943 other order, the client will not retain layout extents for the 944 overlapping range).] 946 3.6 Metadata Server Write Propagation 948 Asynchronous writes written through the metadata server may be 949 propagated lazily to the storage devices. For data written 950 asynchronously through the metadata server, a client performing a 951 read at the appropriate storage device is not guaranteed to see the 952 newly written data until a COMMIT occurs at the metadata server. 953 While the write is pending, reads to the storage device can give out 954 either the old data, the new data, or a mixture thereof. After 955 either a synchronous write completes, or a COMMIT is received (for 956 asynchronously written data), the metadata server must ensure that 957 storage devices give out the new data and that the data has been 958 written to stable storage. If the server implements its storage in 959 any way such that it cannot obey these constraints, then it must 960 recall the layouts to prevent reads being done that cannot be handled 961 correctly. 963 3.7 Crash Recovery 965 Crash recovery is complicated due to the distributed nature of the 966 pNFS protocol. In general, crash recovery for layouts is similar to 967 crash recovery for delegations in the base NFSv4 protocol. However, 968 the client's ability to perform I/O without contacting the metadata 969 server introduces subtleties that must be handled correctly if file 970 system corruption is to be avoided. 972 3.7.1 Leases 974 The layout lease period plays a critical role in crash recovery. 975 Depending on the capabilities of the storage protocol, it is crucial 976 that the client is able to maintain an accurate layout lease timer to 977 ensure that I/Os are not issued to storage devices after expiration 978 of the layout lease period. In order for the client to do so, it 979 must know which operations renew a lease. 981 3.7.1.1 Lease Renewal 983 The current NFSv4 specification allows for implicit lease renewals to 984 occur upon receiving an I/O. However, due to the distributed pNFS 985 architecture, implicit lease renewals are limited to operations 986 performed at the metadata server; this includes I/O performed through 987 the metadata server. So, a client must not assume that READ and 988 WRITE I/O to storage devices implicitly renew lease state. 990 If sessions are required for pNFS, as has been suggested, then the 991 SEQUENCE operation is to be used to explicitly renew leases. It is 992 proposed that the SEQUENCE operation be extended to return all the 993 specific information that RENEW does, but not as an error as RENEW 994 returns it. Since, when using session, beginning each compound with 995 the SEQUENCE op allows renews to be performed without an additional 996 operation and without an additional request. Again, the client must 997 not rely on any operation to the storage devices to renew a lease. 998 Using the SEQUENCE operation for renewals, simplifies the client's 999 perception of lease renewal. 1001 3.7.1.2 Client Lease Timer 1003 Depending on the storage protocol and layout type in use, it may be 1004 crucial that the client not issue I/Os to storage devices if the 1005 corresponding layout's lease has expired. Doing so may lead to file 1006 system corruption if the layout has been given out and used by 1007 another client. In order to prevent this, the client must maintain 1008 an accurate lease timer for all layouts held. RFC3530 has the 1009 following to say regarding the maintenance of a client lease timer: 1011 ...the client must track operations which will renew the lease 1012 period. Using the time that each such request was sent and the 1013 time that the corresponding reply was received, the client should 1014 bound the time that the corresponding renewal could have occurred 1015 on the server and thus determine if it is possible that a lease 1016 period expiration could have occurred. 1018 To be conservative, the client should start its lease timer based on 1019 the time that the it issued the operation to the metadata server, 1020 rather than based on the time of the response. 1022 It is also necessary to take propagation delay into account when 1023 requesting a renewal of the lease: 1025 ...the client should subtract it from lease times (e.g., if the 1026 client estimates the one-way propagation delay as 200 msec, then 1027 it can assume that the lease is already 200 msec old when it gets 1028 it). In addition, it will take another 200 msec to get a response 1029 back to the server. So the client must send a lock renewal or 1030 write data back to the server 400 msec before the lease would 1031 expire. 1033 Thus, the client must be aware of the one-way propagation delay and 1034 should issue renewals well in advance of lease expiration. Clients, 1035 to the extent possible, should try not to issue I/Os that may extend 1036 past the lease expiration time period. However, since this is not 1037 always possible, the storage protocol must be able to protect against 1038 the effects of inflight I/Os, as is discussed later. 1040 3.7.2 Client Recovery 1042 Client recovery for layouts works in much the same way as NFSv4 1043 client recovery works for other lock/delegation state. When an NFSv4 1044 client reboots, it will lose all information about the layouts that 1045 it previously owned. There are two methods by which the server can 1046 reclaim these resources and allow otherwise conflicting layouts to be 1047 provided to other clients. 1049 The first is through the expiry of the client's lease. If the client 1050 recovery time is longer than the lease period, the client's lease 1051 will expire and the server will know that state may be released. for 1052 layouts the server may release the state immediately upon lease 1053 expiry or it may allow the layout to persist awaiting possible lease 1054 revival, as long as there are no conflicting requests. 1056 On the other hand, the client may recover in less time than it takes 1057 for the lease period to expire. In such a case, the client will 1058 contact the server through the standard SETCLIENTID protocol. The 1059 server will find that the client's id matches the id of the previous 1060 client invocation, but that the verifier is different. The server 1061 uses this as a signal to release all the state associated with the 1062 client's previous invocation. 1064 3.7.3 Metadata Server Recovery 1066 The server recovery case is slightly more complex. In general, the 1067 recovery process again follows the standard NFSv4 recovery model: the 1068 client will discover that the metadata server has rebooted when it 1069 receives an unexpected STALE_STATEID or STALE_CLIENTID reply from the 1070 server; it will then proceed to try to reclaim its previous 1071 delegations during the server's recovery grace period. However, 1072 layouts are not reclaimable in the same sense as data delegations; 1073 there is no reclaim bit, thus no guarantee of continuity between the 1074 previous and new layout. This is not necessarily required since a 1075 layout is not required to perform I/O; I/O can always be performed 1076 through the metadata server. 1078 [NOTE: there is no reclaim bit for getting a layout. Thus, in the 1079 case of reclaiming an old layout obtained through LAYOUTGET, there is 1080 no guarantee of continuity. If a reclaim bit existed a block/volume 1081 layout type might be happier knowing it got the layout back with the 1082 assurance of continuity. However, this would require the metadata 1083 server trusting the client in telling it the exact layout it had 1084 (i.e., the full block-list); however, divergence is avoided by having 1085 the server tell the client what is contained within the layout.] 1087 If the client has dirty data that it needs to write out, or an 1088 outstanding LAYOUTCOMMIT, the client should try to obtain a new 1089 layout segment covering the byte range covered by the previous layout 1090 segment. However, the client might not not get the same layout 1091 segment it had. The range might be different or it might get the 1092 same range but the content of the layout might be different. For 1093 example, if using a block/volume-based layout, the blocks 1094 provisionally assigned by the layout might be different, in which 1095 case the client will have to write the corresponding blocks again; in 1096 the interest of simplicity, the client might decide to always write 1097 them again. Alternatively, the client might be unable to obtain a 1098 new layout and thus, must write the data using normal NFSv4 through 1099 the metadata server. 1101 There is an important safety concern associated with layouts that 1102 does not come into play in the standard NFSv4 case. If a standard 1103 NFSv4 client makes use of a stale delegation, while reading, the 1104 consequence could be to deliver stale data to an application. If 1105 writing, using a stale delegation or a stale state stateid for an 1106 open or lock would result in the rejection of the client's write with 1107 the appropriate stale stateid error. 1109 However, the pNFS layout enables the client to directly access the 1110 file system storage---if this access is not properly managed by the 1111 NFSv4 server the client can potentially corrupt the file system data 1112 or metadata. Thus, it is vitally important that the client discover 1113 that the metadata server has rebooted, and that the client stops 1114 using stale layouts before the metadata server gives them away to 1115 other clients. To ensure this, the client must be implemented so 1116 that layouts are never used to access the storage after the client's 1117 lease timer has expired. It is crucial that clients have precise 1118 knowledge of the lease periods of their layouts. For specific 1119 details on lease renewal and client lease timers, see Section 3.7.1. 1121 The prohibition on using stale layouts applies to all layout related 1122 accesses, especially the flushing of dirty data to the storage 1123 devices. If the client's lease timer expires because the client 1124 could not contact the server for any reason, the client MUST 1125 immediately stop using the layout until the server can be contacted 1126 and the layout can be officially recovered or reclaimed. However, 1127 this is only part of the solution. It is also necessary to deal with 1128 the consequences of I/Os already in flight. 1130 The issue of the effects of I/Os started before lease expiration and 1131 possibly continuing through lease expiration is the responsibility of 1132 the data storage protocol and as such is layout type specific. There 1133 are two approaches the data storage protocol can take. The protocol 1134 may adopt a global solution which prevents all I/Os from being 1135 executed after the lease expiration and thus is safe against a client 1136 who issues I/Os after lease expiration. This is the preferred 1137 solution and the solution used by NFSv4 file based layouts (see 1138 Section 5.6); as well, the object storage device protocol allows 1139 storage to fence clients after lease expiration. Alternatively, the 1140 storage protocol may rely on proper client operation and only deal 1141 with the effects of lingering I/Os. These solutions may impact the 1142 client layout-driver, the metadata server layout-driver, and the 1143 control protocol. 1145 3.7.4 Storage Device Recovery 1147 Storage device crash recovery is mostly dependent upon the layout 1148 type in use. However, there are a few general techniques a client 1149 can use if it discovers a storage device has crashed while holding 1150 asynchronously written, non-committed, data. First and foremost, it 1151 is important to realize that the client is the only one who has the 1152 information necessary to recover asynchronously written data; since, 1153 it holds the dirty data and most probably nobody else does. Second, 1154 the best solution is for the client to err on the side or caution and 1155 attempt to re-write the dirty data through another path. 1157 The client, rather than hold the asynchronously written data 1158 indefinitely, is encouraged to, and can make sure that the data is 1159 written by using other paths to that data. The client may write the 1160 data to the metadata server, either synchronously or asynchronously 1161 with a subsequent COMMIT. Once it does this, there is no need to 1162 wait for the original storage device. In the event that the data 1163 range to be committed is transferred to a different storage device, 1164 as indicated in a new layout, the client may write to that storage 1165 device. Once the data has been committed at that storage device, 1166 either through a synchronous write or through a commit to that 1167 storage device (e.g., through the NFSv4 COMMIT operation for the 1168 NFSv4 file layout), the client should consider the transfer of 1169 responsibility for the data to the new server as strong evidence that 1170 this is the intended and most effective method for the client to get 1171 the data written. In either case, once the write is on stable 1172 storage (through either the storage device or metadata server), there 1173 is no need to continue either attempting to commit or attempting to 1174 synchronously write the data to the original storage device or wait 1175 for that storage device to become available. That storage device may 1176 never be visible to the client again. 1178 This approach does have a "lingering write" problem, similar to 1179 regular NFSv4. Suppose a WRITE is issued to a storage device for 1180 which no response is received. The client breaks the connection, 1181 trying to re-establish a new one, and gets a recall of the layout. 1182 The client issues the I/O for the dirty data through an alternative 1183 path, for example, through the metadata server and it succeeds. The 1184 client then goes on to perform additional writes that all succeed. 1185 If at some time later, the original write to the storage device 1186 succeeds, data inconsistency could result. The same problem can 1187 occur in regular NFSv4. For example, a WRITE is held in a switch for 1188 some period of time while other writes are issued and replied to, if 1189 the original WRITE finally succeeds, the same issues can occur. 1190 However, this is solved by sessions in NFSv4.x. 1192 4. Security Considerations 1194 The pNFS extension partitions the NFSv4 file system protocol into two 1195 parts, the control path and the data path (i.e., storage protocol). 1196 The control path contains all the new operations described by this 1197 extension; all existing NFSv4 security mechanisms and features apply 1198 to the control path. The combination of components in a pNFS system 1199 (see Figure 1) is required to preserve the security properties of 1200 NFSv4 with respect to an entity accessing data via a client, 1201 including security countermeasures to defend against threats that 1202 NFSv4 provides defenses for in environments where these threats are 1203 considered significant. 1205 In some cases, the security countermeasures for connections to 1206 storage devices may take the form of physical isolation or a 1207 recommendation not to use pNFS in an environment. For example, it is 1208 currently infeasible to provide confidentiality protection for some 1209 storage device access protocols to protect against eavesdropping; in 1210 environments where eavesdropping on such protocols is of sufficient 1211 concern to require countermeasures, physical isolation of the 1212 communication channel (e.g., via direct connection from client(s) to 1213 storage device(s)) and/or a decision to forego use of pNFS (e.g., and 1214 fall back to NFSv4) may be appropriate courses of action. 1216 In full generality where communication with storage devices is 1217 subject to the same threats as client-server communication, the 1218 protocols used for that communication need to provide security 1219 mechanisms comparable to those available via RPSEC_GSS for NFSv4. 1220 Many situations in which pNFS is likely to be used will not be 1221 subject to the overall threat profile for which NFSv4 is required to 1222 provide countermeasures. 1224 pNFS implementations MUST NOT remove NFSv4's access controls. The 1225 combination of clients, storage devices, and the server are 1226 responsible for ensuring that all client to storage device file data 1227 access respects NFSv4 ACLs and file open modes. This entails 1228 performing both of these checks on every access in the client, the 1229 storage device, or both. If a pNFS configuration performs these 1230 checks only in the client, the risk of a misbehaving client obtaining 1231 unauthorized access is an important consideration in determining when 1232 it is appropriate to use such a pNFS configuration. Such 1233 configurations SHOULD NOT be used when client- only access checks do 1234 not provide sufficient assurance that NFSv4 access control is being 1235 applied correctly. 1237 The following subsections describe security considerations 1238 specifically applicable to each of the three major storage device 1239 protocol types supported for pNFS. 1241 [Requiring strict equivalence to NFSv4 security mechanisms is the 1242 wrong approach. Will need to lay down a set of statements that each 1243 protocol has to make starting with access check location/properties.] 1245 4.1 File Layout Security 1247 A NFSv4 file layout type is defined in Section 5; see Section 5.7 for 1248 additional security considerations and details. In summary, the 1249 NFSv4 file layout type requires that all I/O access checks MUST be 1250 performed by the storage devices, as defined by the NFSv4 1251 specification. If another file layout type is being used, additional 1252 access checks may be required. But in all cases, the access control 1253 performed by the storage devices must be at least as strict as that 1254 specified by the NFSv4 protocol. 1256 4.2 Object Layout Security 1258 The object storage protocol MUST implement the security aspects 1259 described in version 1 of the T10 OSD protocol definition [6]. The 1260 remainder of this section gives an overview of the security mechanism 1261 described in that standard. The goal is to give the reader a basic 1262 understanding of the object security model. Any discrepancies 1263 between this text and the actual standard are obviously to be 1264 resolved in favor of the OSD standard. 1266 The object storage protocol relies on a cryptographically secure 1267 capability to control accesses at the object storage devices. 1268 Capabilities are generated by the metadata server, returned to the 1269 client, and used by the client as described below to authenticate 1270 their requests to the Object Storage Device (OSD). Capabilities 1271 therefore achieve the required access and open mode checking. They 1272 allow the file server to define and check a policy (e.g., open mode) 1273 and the OSD to check and enforce that policy without knowing the 1274 details (e.g., user IDs and ACLs). Since capabilities are tied to 1275 layouts, and since they are used to enforce access control, the 1276 server should recall layouts and revoke capabilities when the file 1277 ACL or mode changes in order to signal the clients. 1279 Each capability is specific to a particular object, an operation on 1280 that object, a byte range w/in the object, and has an explicit 1281 expiration time. The capabilities are signed with a secret key that 1282 is shared by the object storage devices (OSD) and the metadata 1283 managers. clients do not have device keys so they are unable to forge 1284 capabilities. The the following sketch of the algorithm should help 1285 the reader understand the basic model. 1287 LAYOUTGET returns 1289 {CapKey = MAC(CapArgs), CapArgs} 1291 The client uses CapKey to sign all the requests it issues for that 1292 object using the respective CapArgs. In other words, the CapArgs 1293 appears in the request to the storage device, and that request is 1294 signed with the CapKey as follows: 1296 ReqMAC = MAC(Req, Nonceln) 1298 The following is sent to the OSD: {CapArgs, Req, Nonceln, ReqMAC}. 1299 The OSD uses the SecretKey it shares with the metadata server to 1300 compare the ReqMAC the client sent with a locally computed 1302 MAC(CapArgs)>(Req, Nonceln) 1304 and if they match the OSD assumes that the capabilities came from an 1305 authentic metadata server and allows access to the object, as allowed 1306 by the CapArgs. Therefore, if the server LAYOUTGET reply, holding 1307 CapKey and CapArgs, is snooped by another client, it can be used to 1308 generate valid OSD requests (within the CapArgs access restriction). 1310 To provide the required privacy requirements for the capabilities 1311 returned by LAYOUTGET, the GSS-API can be used, e.g. by using a 1312 session key known to the file server and to the client to encrypt the 1313 whole layout or parts of it. Two general ways to provide privacy in 1314 the absence of GSS-API that are independent of NFSv4 are either an 1315 isolated network such as a VLAN or a secure channel provided by 1316 IPsec. 1318 4.3 Block/Volume Layout Security 1320 As typically used, block/volume protocols rely on clients to enforce 1321 file access checks since the storage devices are generally unaware of 1322 the files they are storing and in particular are unaware of which 1323 blocks belongs to which file. In such environments, the physical 1324 addresses of blocks are exported to pNFS clients via layouts. An 1325 alternative method of block/volume protocol use is for the storage 1326 devices to export virtualized block addresses, which do reflect the 1327 files to which blocks belong. These virtual block addresses are 1328 exported to pNFS clients via layouts. This allows the storage device 1329 to make appropriate access checks, while mapping virtual block 1330 addresses to physical block addresses. 1332 In environments where access control is important and client-only 1333 access checks provide insufficient assurance of access control 1334 enforcement (e.g., there is concern about a malicious of 1335 malfunctioning client skipping the access checks) and where physical 1336 block addresses are exported to clients, the storage devices will 1337 generally be unable to compensate for these client deficiencies. 1339 In such threat environments, block/volume protocols SHOULD NOT be 1340 used with pNFS, unless the storage device is able to implement the 1341 appropriate access checks, via use of virtualized block addresses, or 1342 other means. NFSv4 without pNFS or pNFS with a different type of 1343 storage protocol would be a more suitable means to access files in 1344 such environments. Storage-device/protocol-specific methods (e.g. 1345 LUN masking/mapping) may be available to prevent malicious or high- 1346 risk clients from directly accessing storage devices. 1348 5. The NFSv4 File Layout Type 1350 This section describes the semantics and format of NFSv4 file-based 1351 layouts. 1353 5.1 File Striping and Data Access 1355 The file layout type describes a method for striping data across 1356 multiple devices. The data for each stripe unit is stored within an 1357 NFSv4 file located on a particular storage device. The structures 1358 used to describe the stripe layout are as follows: 1360 enum stripetype4 { 1361 STRIPE_SPARSE = 1, 1362 STRIPE_DENSE = 2 1363 }; 1365 struct nfsv4_file_layouthint { 1366 stripetype4 stripe_type; 1367 length4 stripe_unit; 1368 uint32_t stripe_width; 1369 }; 1371 struct nfsv4_file_layout { /* Per data stripe */ 1372 pnfs_deviceid4 dev_id<>; 1373 nfs_fh4 fh; 1374 }; 1376 struct nfsv4_file_layouttype4 { /* Per file */ 1377 stripetype4 stripe_type; 1378 length4 stripe_unit; 1379 length4 file_size; 1380 nfsv4_file_layout dev_list<>; 1381 }; 1383 The file layout specifies an ordered array of 1384 tuples, as well as the stripe size, type of stripe layout (discussed 1385 a little later), and the file's current size as of LAYOUTGET time. 1386 The filehandle, "fh", identifies the file on a storage device 1387 identified by "dev_id", that holds a particular stripe of the file. 1388 The "dev_id" array can be used for multipathing and is discussed 1389 further in Section 5.1.3. The stripe width is determined by the 1390 stripe unit size multiplied by the number of devices in the dev_list. 1391 The stripe held by is determined by that tuples position 1392 within the device list, "dev_list". For example, consider a dev_list 1393 consisting of the following pairs: 1395 <(1,0x12), (2,0x13), (1,0x15)> and stripe_unit = 32KB 1397 The stripe width is 32KB * 3 devices = 96KB. The first entry 1398 specifies that on device 1 in the data file with filehandle 0x12 1399 holds the first 32KB of data (and every 32KB stripe beginning where 1400 the file's offset % 96KB == 0). 1402 Devices may be repeated multiple times within the device list array; 1403 this is shown where storage device 1 holds both the first and third 1404 stripe of data. Filehandles can only be repeated if a sparse stripe 1405 type is used. Data is striped across the devices in the order listed 1406 in the device list array in increments of the stripe size. A data 1407 file stored on a storage device MUST map to a single file as defined 1408 by the metadata server; i.e., data from two files as viewed by the 1409 metadata server MUST NOT be stored within the same data file on any 1410 storage device. 1412 The "stripe_type" field specifies how the data is laid out within the 1413 data file on a storage device. It allows for two different data 1414 layouts: sparse and dense or packed. The stripe type determines the 1415 calculation that must be made to map the client visible file offset 1416 to the offset within the data file located on the storage device. 1418 The layout hint structure is described in more detail in Section 6.7. 1419 It is used, by the client, as by the FILE_LAYOUT_HINT attribute to 1420 specify the type of layout to be used for a newly created file. 1422 5.1.1 Sparse and Dense Storage Device Data Layouts 1424 The stripe_type field allows for two storage device data file 1425 representations. Example sparse and dense storage device data 1426 layouts are illustrated below: 1428 Sparse file-layout (stripe_unit = 4KB) 1429 ------------------ 1431 Is represented by the following file layout on the storage devices: 1433 Offset ID:0 ID:1 ID:2 1434 0 +--+ +--+ +--+ +--+ indicates a 1435 |//| | | | | |//| stripe that 1436 4KB +--+ +--+ +--+ +--+ contains data 1437 | | |//| | | 1438 8KB +--+ +--+ +--+ 1439 | | | | |//| 1440 12KB +--+ +--+ +--+ 1441 |//| | | | | 1442 16KB +--+ +--+ +--+ 1443 | | |//| | | 1444 +--+ +--+ +--+ 1446 The sparse file-layout has holes for the byte ranges not exported by 1447 that storage device. This allows clients to access data using the 1448 real offset into the file, regardless of the storage device's 1449 position within the stripe. However, if a client writes to one of 1450 the holes (e.g., offset 4-12KB on device 1), then an error MUST be 1451 returned by the storage device. This requires that the storage 1452 device have knowledge of the layout for each file. 1454 When using a sparse layout, the offset into the storage device data 1455 file is the same as the offset into the main file. 1457 Dense/packed file-layout (stripe_unit = 4KB) 1458 ------------------------ 1460 Is represented by the following file layout on the storage devices: 1462 Offset ID:0 ID:1 ID:2 1463 0 +--+ +--+ +--+ 1464 |//| |//| |//| 1465 4KB +--+ +--+ +--+ 1466 |//| |//| |//| 1467 8KB +--+ +--+ +--+ 1468 |//| |//| |//| 1469 12KB +--+ +--+ +--+ 1470 |//| |//| |//| 1471 16KB +--+ +--+ +--+ 1472 |//| |//| |//| 1473 +--+ +--+ +--+ 1475 The dense or packed file-layout does not leave holes on the storage 1476 devices. Each stripe unit is spread across the storage devices. As 1477 such, the storage devices need not know the file's layout since the 1478 client is allowed to write to any offset. 1480 The calculation to determine the byte offset within the data file for 1481 dense storage device layouts is: 1483 stripe_width = stripe_unit * N; where N = |dev_list| 1484 dev_offset = floor(file_offset / stripe_width) * stripe_unit + 1485 file_offset % stripe_unit 1487 Regardless of the storage device data file layout, the calculation to 1488 determine the index into the device array is the same: 1490 dev_idx = floor(file_offset / stripe_unit) mod N 1492 Section 5.5 describe the semantics for dealing with reads to holes 1493 within the striped file. This is of particular concern, since each 1494 individual component stripe file (i.e., the component of the striped 1495 file that lives on a particular storage device) may be of different 1496 length. Thus, clients may experience 'short' reads when reading off 1497 the end of one of these component files. 1499 5.1.2 Metadata and Storage Device Roles 1501 In many cases, the metadata server and the storage device will be 1502 separate pieces of physical hardware. The specification text is 1503 written as if that were always case. However, it can be the case 1504 that the same physical hardware is used to implement both a metadata 1505 and storage device and in this case, the specification text's 1506 references to these two entities are to be understood as referring to 1507 the same physical hardware implementing two distinct roles and it is 1508 important that it be clearly understood on behalf of which role the 1509 hardware is executing at any given time. 1511 Two sub-cases can be distinguished. In the first sub-case, the same 1512 physical hardware is used to implement both a metadata and data 1513 server in which each role is addressed through a distinct network 1514 interface (e.g., IP addresses for the metadata server and storage 1515 device are distinct). As long as the storage device address is 1516 obtained from the layout and is distinct from the metadata server's 1517 address, using the device ID therein to obtain the appropriate 1518 storage device address, it is always clear, for any given request, to 1519 what role it is directed, based on the destination IP address. 1521 However, it may also be the case that even though the metadata server 1522 and storage device are distinct from one client's point of view, the 1523 roles may be reversed according to another client's point of view. 1524 For example, in the cluster file system model a metadata server to 1525 one client, may be a storage device to another client. Thus, it is 1526 safer to always mark the filehandle so that operations addressed to 1527 storage devices can be distinguished. 1529 The second sub-case is where both the metadata and storage device 1530 have the same network address. This requires us to make the 1531 distinction as to which role each request is directed, on a another 1532 basis. Since the network address is the same, the request is 1533 understood as being directed at one or the other, based on the 1534 filehandle of the first current filehandle value for the request. If 1535 the first current file handle is one derived from a layout (i.e., it 1536 is specified within the layout) (and it is recommended that these be 1537 distinguishable), then the request is to be considered as executed by 1538 a storage device. Otherwise, the operation is to be understood as 1539 executed by the metadata server. 1541 If a current filehandle is set that is inconsistent with the role to 1542 which it is directed, then the error NFS4ERR_BADHANDLE should result. 1543 For example, if a request is directed at the storage device, because 1544 the first current handle is from a layout, any attempt to set the 1545 current filehandle to be a value not from a layout should be 1546 rejected. Similarly, if the first current file handle was for a 1547 value not from a layout, a subsequent attempt to set the current file 1548 handle to a value obtained from a layout should be rejected. 1550 5.1.3 Device Multipathing 1552 The NFSv4 file layout supports multipathing to 'equivalent' devices. 1554 Device-level multipathing is primarily of use in the case of a data 1555 server failure --- it allows the client to switch to another storage 1556 device that is exporting the same data stripe, without having to 1557 contact the metadata server for a new layout. 1559 To support device multipathing, an array of device IDs is encoded 1560 within the data stripe portion of the file's layout. This array 1561 represents an ordered list of devices where the first element has the 1562 highest priority. Each device in the list MUST be 'equivalent' to 1563 every other device in the list and each device must be attempted in 1564 the order specified. 1566 Equivalent devices MUST export the same system image (e.g., the 1567 stateids and filehandles that they use are the same) and must provide 1568 the same consistency guarantees. Two equivalent storage devices must 1569 also have sufficient connections to the storage, such that writing to 1570 one storage device is equivalent to writing to another, this also 1571 applies to reading. Also, if multiple copies of the same data exist, 1572 reading from one must provide access to all existing copies. As 1573 such, it is unlikely that multipathing will provide additional 1574 benefit in the case of an I/O error. 1576 [NOTE: the error cases in which a client is expected to attempt an 1577 equivalent storage device should be specified.] 1579 5.1.4 Operations Issued to Storage Devices 1581 Clients MUST use the filehandle described within the layout when 1582 accessing data on the storage devices. When using the layout's 1583 filehandle, the client MUST only issue READ, WRITE, PUTFH, COMMIT, 1584 and NULL operations to the storage device associated with that 1585 filehandle. If a client issues an operation other than those 1586 specified above, using the filehandle and storage device listed in 1587 the client's layout, that storage device SHOULD return an error to 1588 the client. The client MUST follow the instruction implied by the 1589 layout (i.e., which filehandles to use on which devices). As 1590 described in Section 3.2, a client MUST NOT issue I/Os to storage 1591 devices for which it does not hold a valid layout. The storage 1592 devices may reject such requests. 1594 GETATTR and SETATTR MUST be directed to the metadata server. In the 1595 case of a SETATTR of the size attribute, the control protocol is 1596 responsible for propagating size updates/truncations to the storage 1597 devices. In the case of extending WRITEs to the storage devices, the 1598 new size must be visible on the metadata server once a LAYOUTCOMMIT 1599 has completed (see Section 3.4.2). Section 5.5, describes the 1600 mechanism by which the client is to handle storage device file's that 1601 do not reflect the metadata server's size. 1603 5.2 Global Stateid Requirements 1605 Note, there are no stateids returned embedded within the layout. The 1606 client MUST use the stateid representing open or lock state as 1607 returned by an earlier metadata operation (e.g., OPEN, LOCK), or a 1608 special stateid to perform I/O on the storage devices, as in regular 1609 NFSv4. Special stateid usage for I/O is subject to the NFSv4 1610 protocol specification. The stateid used for I/O MUST have the same 1611 effect and be subject to the same validation on storage device as it 1612 would if the I/O was being performed on the metadata server itself in 1613 the absence of pNFS. This has the implication that stateids are 1614 globally valid on both the metadata and storage devices. This 1615 requires the metadata server to propagate changes in lock and open 1616 state to the storage devices, so that the storage devices can 1617 validate I/O accesses. This is discussed further in Section 5.4. 1618 Depending on when stateids are propagated, the existence of a valid 1619 stateid on the storage device may act as proof of a valid layout. 1621 [NOTE: a number of proposals have been made that have the possibility 1622 of limiting the amount of validation performed by the storage device, 1623 if any of these proposals are accepted or obtain consensus, the 1624 global stateid requirement can be revisited.] 1626 5.3 The Layout Iomode 1628 The layout iomode need not used by the metadata server when servicing 1629 NFSv4 file-based layouts, although in some circumstances it may be 1630 useful to use. For example, if the server implementation supports 1631 reading from read-only replicas or mirrors, it would be useful for 1632 the server to return a layout enabling the client to do so. As such, 1633 the client should set the iomode based on its intent to read or write 1634 the data. The client may default to an iomode of READ/WRITE 1635 (LAYOUTIOMODE_RW). The iomode need not be checked by the storage 1636 devices when clients perform I/O. However, the storage devices SHOULD 1637 still validate that the client holds a valid layout and return an 1638 error if the client does not. 1640 5.4 Storage Device State Propagation 1642 Since the metadata server, which handles lock and open-mode state 1643 changes, as well as ACLs, may not be collocated with the storage 1644 devices where I/O access are validated, as such, the server 1645 implementation MUST take care of propagating changes of this state to 1646 the storage devices. Once the propagation to the storage devices is 1647 complete, the full effect of those changes must be in effect at the 1648 storage devices. However, some state changes need not be propagated 1649 immediately, although all changes SHOULD be propagated promptly. 1650 These state propagations have an impact on the design of the control 1651 protocol, even though the control protocol is outside of the scope of 1652 this specification. Immediate propagation refers to the synchronous 1653 propagation of state from the metadata server to the storage 1654 device(s); the propagation must be complete before returning to the 1655 client. 1657 5.4.1 Lock State Propagation 1659 Mandatory locks MUST be made effective at the storage devices before 1660 the request that establishes them returns to the caller. Thus, 1661 mandatory lock state MUST be synchronously propagated to the storage 1662 devices. On the other hand, since advisory lock state is not used 1663 for checking I/O accesses at the storage devices, there is no 1664 semantic reason for propagating advisory lock state to the storage 1665 devices. However, since all lock, unlock, open downgrades and 1666 upgrades affect the sequence ID stored within the stateid, the 1667 stateid changes which may cause difficulty if this state is not 1668 propagated. Thus, when a client uses a stateid on a storage device 1669 for I/O with a newer sequence number than the one the storage device 1670 has, the storage device should query the metadata server and get any 1671 pending updates to that stateid. This allows stateid sequence number 1672 changes to be propagated lazily, on-demand. 1674 [NOTE: With the reliance on the sessions protocol, there is no real 1675 need for sequence ID portion of the stateid to be validated on I/O 1676 accesses. It is proposed that the seq. ID checking is obsoleted.] 1678 Since updates to advisory locks neither confer nor remove privileges, 1679 these changes need not be propagated immediately, and may not need to 1680 be propagated promptly. The updates to advisory locks need only be 1681 propagated when the storage device needs to resolve a question about 1682 a stateid. In fact, if byte-range locking is not mandatory (i.e., is 1683 advisory) the clients are advised not to use the lock-based stateids 1684 for I/O at all. The stateids returned by open are sufficient and 1685 eliminate overhead for this kind of state propagation. 1687 5.4.2 Open-mode Validation 1689 Open-mode validation MUST be performed against the open mode(s) held 1690 by the storage devices. However, the server implementation may not 1691 always require the immediate propagation of changes. Reduction in 1692 access because of CLOSEs or DOWNGRADEs do not have to be propagated 1693 immediately, but SHOULD be propagated promptly; whereas changes due 1694 to revocation MUST be propagated immediately. On the other hand, 1695 changes that expand access (e.g., new OPEN's and upgrades) don't have 1696 to be propagated immediately but the storage device SHOULD NOT reject 1697 a request because of mode issues without making sure that the upgrade 1698 is not in flight. 1700 5.4.3 File Attributes 1702 Since the SETATTR operation has the ability to modify state that is 1703 visible on both the metadata and storage devices (e.g., the size), 1704 care must be taken to ensure that the resultant state across the set 1705 of storage devices is consistent; especially when truncating or 1706 growing the file. 1708 As described earlier, the LAYOUTCOMMIT operation is used to ensure 1709 that the metadata is synced with changes made to the storage devices. 1710 For the file-based protocol, it is necessary to re-sync state such as 1711 the size attribute, and the setting of mtime/atime. See Section 3.4 1712 for a full description of the semantics regarding LAYOUTCOMMIT and 1713 attribute synchronization. It should be noted, that by using a file- 1714 based layout type, it is possible to synchronize this state before 1715 LAYOUTCOMMIT occurs. For example, the control protocol can be used 1716 to query the attributes present on the storage devices. 1718 Any changes to file attributes that control authorization or access 1719 as reflected by ACCESS calls or READs and WRITEs on the metadata 1720 server, MUST be propagated to the storage devices for enforcement on 1721 READ and WRITE I/O calls. If the changes made on the metadata server 1722 result in more restrictive access permissions for any user, those 1723 changes MUST be propagated to the storage devices synchronously. 1725 Recall that the NFSv4 protocol [2] specifies that: 1727 ...since the NFS version 4 protocol does not impose any 1728 requirement that READs and WRITEs issued for an open file have the 1729 same credentials as the OPEN itself, the server still must do 1730 appropriate access checking on the READs and WRITEs themselves. 1732 This also includes changes to ACLs. The propagation of access right 1733 changes due to changes in ACLs may be asynchronous only if the server 1734 implementation is able to determine that the updated ACL is not more 1735 restrictive for any user specified in the old ACL. Due to the 1736 relative infrequency of ACL updates, it is suggested that all changes 1737 be propagated synchronously. 1739 [NOTE: it has been suggested that the NFSv4 specification is in error 1740 with regard to allowing principles other than those used for OPEN to 1741 be used for file I/O. If changes within a minor version alter the 1742 behavior of NFSv4 with regard to OPEN principals and stateids some 1743 access control checking at the storage device can be made less 1744 expensive. pNFS should be altered to take full advantage of these 1745 changes.] 1747 5.5 Storage Device Component File Size 1749 A potential problem exists when a component data file on a particular 1750 storage device is grown past EOF; the problem exists for both dense 1751 and sparse layouts. Imagine the following scenario: a client creates 1752 a new file (size == 0) and writes to byte 128KB; the client then 1753 seeks to the beginning of the file and reads byte 100. The client 1754 should receive 0s back as a result of the read. However, if the read 1755 falls on a different storage device to the client's original write, 1756 the storage device servicing the READ may still believe that the 1757 file's size is at 0 and return no data with the EOF flag set. The 1758 storage device can only return 0s if it knows that the file's size 1759 has been extended. This would require the immediate propagation of 1760 the file's size to all storage devices, which is potentially very 1761 costly, instead, another approach as outlined below. 1763 First, the file's size is returned within the layout by LAYOUTGET. 1764 This size must reflect the latest size at the metadata server as set 1765 by the most recent of either the last LAYOUTCOMMIT or SETATTR; 1766 however, it may be more recent. Second, if a client performs a read 1767 that is returned short (i.e., is fully within the file's size, but 1768 the storage device indicates EOF and returns partial or no data), the 1769 client must assume that it is a hole and substitute 0s for the data 1770 not read up until its known local file size. If a client extends the 1771 file, it must update its local file size. Third, if the metadata 1772 server receives a SETATTR of the size or a LAYOUTCOMMIT that alters 1773 the file's size, the metadata server must send out CB_SIZECHANGED 1774 messages with the new size to clients holding layouts; it need not 1775 send a notification to the client that performed the operation that 1776 resulted in the size changing). Upon reception of the CB_SIZECHANGED 1777 notification, clients must update their local size for that file. As 1778 well, if a new file size is returned as a result to LAYOUTCOMMIT, the 1779 client must update their local file size. 1781 5.6 Crash Recovery Considerations 1783 As described in Section 3.7, the layout type specific storage 1784 protocol is responsible for handling the effects of I/Os started 1785 before lease expiration, extending through lease expiration. The 1786 NFSv4 file layout type prevents all I/Os from being executed after 1787 lease expiration, without relying on a precise client lease timer and 1788 without requiring storage devices to maintain lease timers. 1790 It works as follows. In the presence of sessions, each compound 1791 begins with a SEQUENCE operation that contains the "clientID". On 1792 the storage device, the clientID can be used to validate that the 1793 client has a valid layout for the I/O being performed, if it does 1794 not, the I/O is rejected. Before the metadata server takes any 1795 action to invalidate a layout given out by a previous instance, it 1796 must make sure that all layouts from that previous instance are 1797 invalidated at the storage devices. Note: it is sufficient to 1798 invalidate the stateids associated with the layout only if special 1799 stateids are not being used for I/O at the storage devices, otherwise 1800 the layout itself must be invalidated. 1802 This means that a metadata server may not restripe a file until it 1803 has contacted all of the storage devices to invalidate the layouts 1804 from the previous instance nor may it give out locks that conflict 1805 with locks embodied by the stateids associated with any layout from 1806 the previous instance without either doing a specific invalidation 1807 (as it would have to do anyway) or doing a global storage device 1808 invalidation. 1810 5.7 Security Considerations 1812 The NFSv4 file layout type MUST adhere to the security considerations 1813 outlined in Section 4. More specifically, storage devices must make 1814 all of the required access checks on each READ or WRITE I/O as 1815 determined by the NFSv4 protocol [2]. This impacts the control 1816 protocol and the propagation of state from the metadata server to the 1817 storage devices; see Section 5.4 for more details. 1819 5.8 Alternate Approaches 1821 Two alternate approaches exist for file-based layouts and the method 1822 used by clients to obtain stateids used for I/O. Both approaches 1823 embed stateids within the layout. 1825 However, before examining these approaches it is important to 1826 understand the distinction between clients and owners. Delegations 1827 belong to clients, while locks (e.g., record and share reservations) 1828 are held by owners which in turn belong to a specific client. As 1829 such, delegations can only protect against inter-client conflicts, 1830 not intra-client conflicts. Layouts are held by clients and SHOULD 1831 NOT be associated with state held by owners. Therefore, if stateids 1832 used for data access are embedded within a layout, these stateids can 1833 only act as delegation stateids, protecting against inter-client 1834 conflicts; stateids pertaining to an owner can not be embedded within 1835 the layout. This has the implication that the client MUST arbitrate 1836 among all intra-client conflicts (e.g., arbitrating among lock 1837 requests by different processes) before issuing pNFS operations. 1838 Using the stateids stored within the layout, storage devices can only 1839 arbitrate between clients (not owners). 1841 The first alternate approach is to do away with global stateids, 1842 stateids returned by OPEN/LOCK that are valid on the metadata server 1843 and storage devices, and use only stateids embedded within the 1844 layout. This approach has the drawback that the stateids used for 1845 I/O access can not be validated against per owner state, since they 1846 are only associated with the client holding the layout. It breaks 1847 the semantics of tieing a stateid used for I/O to an open instance. 1848 This has the implication that clients must delegate per owner lock 1849 and open requests internally, rather than push the work onto the 1850 storage devices. The storage devices can still arbitrate and enforce 1851 inter-client lock and open state. 1853 The second approach is a hybrid approach. This approach allows for 1854 stateids to be embedded with the layout, but also allows for the 1855 possibility of global stateids. If the stateid embedded within the 1856 layout is a special stateid of all zeros, then the stateid referring 1857 to the last successful OPEN/LOCK should be used. This approach is 1858 recommended if it is decided that using NFSv4 as a control protocol 1859 is required. 1861 This proposal suggests the global stateid approach due to the cleaner 1862 semantics it provides regarding the relationship between stateids 1863 used for I/O and their corresponding open instance or lock state. 1864 However, it does have a profound impact on the control protocol's 1865 implementation and the state propagation that is required (as 1866 described in Section 5.4). 1868 6. pNFS Typed Data Structures 1870 6.1 pnfs_layouttype4 1872 enum pnfs_layouttype4 { 1873 LAYOUT_NFSV4_FILES = 1, 1874 LAYOUT_OSD2_OBJECTS = 2, 1875 LAYOUT_BLOCK_VOLUME = 3 1876 }; 1878 A layout type specifies the layout being used. The implication is 1879 that clients have "layout drivers" that support one or more layout 1880 types. The file server advertises the layout types it supports 1881 through the LAYOUT_TYPES file system attribute. A client asks for 1882 layouts of a particular type in LAYOUTGET, and passes those layouts 1883 to its layout driver. The set of well known layout types must be 1884 defined. As well, a private range of layout types is to be defined 1885 by this document. This would allow custom installations to introduce 1886 new layout types. 1888 [OPEN ISSUE: Determine private range of layout types] 1890 New layout types must be specified in RFCs approved by the IESG 1891 before becoming part of the pNFS specification. 1893 The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file 1894 layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration 1895 specifies that the object layout, as defined in [8], is to be used. 1896 Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume 1897 layout, as defined in [7], is to be used. 1899 6.2 pnfs_deviceid4 1901 typedef uint64_t pnfs_deviceid4; /* 64-bit device ID */ 1903 Layout information includes device IDs that specify a storage device 1904 through a compact handle. Addressing and type information is 1905 obtained with the GETDEVICEINFO operation. A client must not assume 1906 that device IDs are valid across metadata server reboots. The device 1907 ID is qualified by the layout type and are unique per file system 1908 (FSID). This allows different layout drivers to generate device IDs 1909 without the need for co-ordination. See Section 3.1.4 for more 1910 details. 1912 6.3 pnfs_deviceaddr4 1914 struct pnfs_netaddr4 { 1915 string r_netid<>; /* network ID */ 1916 string r_addr<>; /* universal address */ 1917 }; 1919 union pnfs_deviceaddr4 switch (pnfs_layouttype4 layout_type) { 1920 case LAYOUT_NFSV4_FILES: 1921 pnfs_netaddr4 netaddr; 1922 default: 1923 opaque device_addr<>; /* Other layouts */ 1924 }; 1926 The device address is used to set up a communication channel with the 1927 storage device. Different layout types will require different types 1928 of structures to define how they communicate with storage devices. 1929 The union is switched on the layout type. 1931 Currently, the only device address defined is that for the NFSv4 file 1932 layout, which identifies a storage device by network IP address and 1933 port number. This is sufficient for the clients to communicate with 1934 the NFSv4 storage devices, and may also be sufficient for object- 1935 based storage drivers to communicate with OSDs. The other device 1936 address we expect to support is a SCSI volume identifier. The final 1937 protocol specification will detail the allowed values for device_type 1938 and the format of their associated location information. 1940 [NOTE: other device addresses will be added as the respective 1941 specifications mature. It has been suggested that a separate 1942 device_type enumeration is used as a switch to the pnfs_deviceaddr4 1943 structure (e.g., if multiple types of addresses exist for the same 1944 layout type). Until such a time as a real case is made and the 1945 respective layout types have matured, the device address structure 1946 will be left as is.] 1948 6.4 pnfs_devlist_item4 1950 struct pnfs_devlist_item4 { 1951 pnfs_deviceid4 id; 1952 pnfs_deviceaddr4 addr; 1953 }; 1955 An array of these values is returned by the GETDEVICELIST operation. 1956 They define the set of devices associated with a file system. 1958 6.5 pnfs_layout4 1960 union pnfs_layoutdata4 switch (pnfs_layouttype4 layout_type) { 1961 case LAYOUT_NFSV4_FILES: 1962 nfsv4_file_layouttype4 file_layout; 1963 default: 1964 opaque layout_data<>; 1965 }; 1967 struct pnfs_layout4 { 1968 offset4 offset; 1969 length4 length; 1970 pnfs_layoutiomode4 iomode; 1971 pnfs_layoutdata4 layout; 1972 }; 1974 The pnfs_layout4 structure defines a layout for a file. The 1975 pnfs_layoutdata4 union contains the portion of the layout specific to 1976 the layout type. Currently, only the NFSv4 file layout type is 1977 defined; see Section 5.1 for its definition. Since layouts are sub- 1978 dividable, the offset and length together with the file's filehandle, 1979 the clientid, iomode, and layout type, identifies the layout. 1981 [OPEN ISSUE: there is a discussion of moving the striping 1982 information, or more generally the "aggregation scheme", up to the 1983 generic layout level. This creates a two-layer system where the top 1984 level is a switch on different data placement layouts, and the next 1985 level down is a switch on different data storage types. This lets 1986 different layouts (e.g., striping or mirroring or redundant servers) 1987 to be layered over different storage devices. This would move 1988 geometry information out of nfsv4_file_layouttype4 and up into a 1989 generic pnfs_striped_layout type that would specify a set of 1990 pnfs_deviceid4 and pnfs_devicetype4 to use for storage. Instead of 1991 nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.] 1993 6.6 pnfs_layoutupdate4 1995 union pnfs_layoutupdate4 switch (pnfs_layouttype4 layout_type) { 1996 case LAYOUT_NFSV4_FILES: 1997 void; 1998 default: 1999 opaque layout_data<>; 2000 }; 2002 The pnfs_layoutupdate4 structure is used by the client to return 2003 'updated' layout information to the metadata server at LAYOUTCOMMIT 2004 time. This provides a channel to pass layout type specific 2005 information back to the metadata server. E.g., for block/volume 2006 layout types this could include the list of reserved blocks that were 2007 written. The contents of the structure are determined by the layout 2008 type and are defined in their context. 2010 6.7 pnfs_layouthint4 2012 union pnfs_layouthint4 switch (pnfs_layouttype4 layout_type) { 2013 case LAYOUT_NFSV4_FILES: 2014 nfsv4_file_layouthint layout_hint; 2015 default: 2016 opaque layout_hint_data<>; 2017 }; 2019 The pnfs_layouthint4 structure is used by the client to pass in a 2020 hint about the type of layout it would like created for a particular 2021 file. It is the structure specified by the FILE_LAYOUT_HINT 2022 attribute described below. The metadata server may ignore the hint, 2023 or may selectively ignore fields within the hint. This hint should 2024 be provided at create time as part of the initial attributes within 2025 OPEN. The "nfsv4_file_layouthint" structure is defined in 2026 Section 5.1. 2028 6.8 pnfs_layoutiomode4 2030 enum pnfs_layoutiomode4 { 2031 LAYOUTIOMODE_READ = 1, 2032 LAYOUTIOMODE_RW = 2, 2033 LAYOUTIOMODE_ANY = 3 2034 }; 2036 The iomode specifies whether the client intends to read or write 2037 (with the possibility of reading) the data represented by the layout. 2038 The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be 2039 used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies 2040 that layouts pertaining to both READ and RW iomodes are being 2041 returned or recalled, respectively. The metadata server's use of the 2042 iomode may depend on the layout type being used. The storage devices 2043 may validate I/O accesses against the iomode and reject invalid 2044 accesses. 2046 7. pNFS File Attributes 2048 7.1 pnfs_layouttype4<> FS_LAYOUT_TYPES 2050 This attribute applies to a file system and indicates what layout 2051 types are supported by the file system. We expect this attribute to 2052 be queried when a client encounters a new fsid. This attribute is 2053 used by the client to determine if it has applicable layout drivers. 2055 7.2 pnfs_layouttype4<> FILE_LAYOUT_TYPES 2057 This attribute indicates the particular layout type(s) used for a 2058 file. This is for informational purposes only. The client needs to 2059 use the LAYOUTGET operation in order to get enough information (e.g., 2060 specific device information) in order to perform I/O. 2062 7.3 pnfs_layouthint4 FILE_LAYOUT_HINT 2064 This attribute may be set on newly created files to influence the 2065 metadata server's choice for the file's layout. It is suggested that 2066 this attribute is set as one of the initial attributes within the 2067 OPEN call. The metadata server may ignore this attribute. This 2068 attribute is a sub-set of the layout structure returned by LAYOUTGET. 2069 For example, instead of specifying particular devices, this would be 2070 used to suggest the stripe width of a file. It is up to the server 2071 implementation to determine which fields within the layout it uses. 2073 [OPEN ISSUE: it has been suggested that the HINT is a well defined 2074 type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.] 2076 7.4 uint32_t FS_LAYOUT_PREFERRED_BLOCKSIZE 2078 This attribute is a file system wide attribute and indicates the 2079 preferred block size for direct storage device access. 2081 7.5 uint32_t FS_LAYOUT_PREFERRED_ALIGNMENT 2083 This attribute is a file system wide attribute and indicates the 2084 preferred alignment for direct storage device access. 2086 8. pNFS Error Definitions 2088 NFS4ERR_BADLAYOUT Layout specified is invalid. 2090 NFS4ERR_BADIOMODE Layout iomode is invalid. 2092 NFS4ERR_LAYOUTUNAVAILABLE Layouts are not available for the file or 2093 its containing file system. 2095 NFS4ERR_LAYOUTTRYLATER Layouts are temporarily unavailable for the 2096 file, client should retry later. 2098 NFS4ERR_NOMATCHING_LAYOUT Client has no matching layout (segment) to 2099 return. 2101 NFS4ERR_RECALLCONFLICT Layout is unavailable due to a conflicting 2102 LAYOUTRECALL that is in progress. 2104 NFS4ERR_UNKNOWN_LAYOUTTYPE Layout type is unknown. 2106 9. pNFS Operations 2107 9.1 LAYOUTGET - Get Layout Information 2109 SYNOPSIS 2111 (cfh), clientid, layout_type, iomode, offset, length, 2112 minlength, maxcount -> layout 2114 ARGUMENT 2116 struct LAYOUTGET4args { 2117 /* CURRENT_FH: file */ 2118 clientid4 clientid; 2119 pnfs_layouttype4 layout_type; 2120 pnfs_layoutiomode4 iomode; 2121 offset4 offset; 2122 length4 length; 2123 length4 minlength; 2124 count4 maxcount; 2125 }; 2127 RESULT 2129 struct LAYOUTGET4resok { 2130 pnfs_layout4 layout; 2131 }; 2133 union LAYOUTGET4res switch (nfsstat4 status) { 2134 case NFS4_OK: 2135 LAYOUTGET4resok resok4; 2136 default: 2137 void; 2138 }; 2140 DESCRIPTION 2142 Requests a layout for reading or writing (and reading) the file given 2143 by the filehandle at the byte range specified by offset and length. 2144 Layouts are identified by the clientid, filehandle, and layout type. 2145 The use of the iomode depends upon the layout type, but should 2146 reflect the client's data access intent. 2148 The LAYOUTGET operation returns layout information for the specified 2149 byte range, a layout segment. To get a layout segment from a 2150 specific offset through the end-of-file, regardless of the file's 2151 length, a length field with all bits set to 1 (one) should be used. 2152 If the length is zero, or if a length which is not all bits set to 2153 one is specified, and length when added to the offset exceeds the 2154 maximum 64-bit unsigned integer value, the error NFS4ERR_INVAL will 2155 result. 2157 The "minlength" field specifies the minimum size overlap with the 2158 requested offset and length that is to be returned. If this 2159 requirement cannot be met, no layout must be returned; the error 2160 NFS4ERR_LAYOUTTRYLATER can be returned. 2162 The "maxcount" field specifies the maximum layout size (in bytes) 2163 that the client can handle. If the size of the layout structure 2164 exceeds the size specified by maxcount, the metadata server will 2165 return the NFS4ERR_TOOSMALL error. 2167 As well, the metadata server may adjust the range of the returned 2168 layout segment based on striping patterns and usage implied by the 2169 iomode. The client must be prepared to get a layout that does not 2170 line up exactly with their request; there MUST be at least an overlap 2171 of "minlength" between the layout returned by the server and the 2172 client's request, or the server SHOULD reject the request. See 2173 Section 3.3 for more details. 2175 The metadata server may also return a layout segment with an iomode 2176 other than that requested by the client. If it does so, it must 2177 ensure that the iomode is more permissive than the iomode requested. 2178 E.g., this allows an implementation to upgrade read-only requests to 2179 read/write requests at its discretion, within the limits of the 2180 layout type specific protocol. An iomode of either LAYOUTIOMODE_READ 2181 or LAYOUTIOMODE_RW must be returned. 2183 The format of the returned layout is specific to the underlying file 2184 system. Layout types other than the NFSv4 file layout type should be 2185 specified outside of this document. 2187 If layouts are not supported for the requested file or its containing 2188 file system the server SHOULD return NFS4ERR_LAYOUTUNAVAILABLE. If 2189 the layout type is not supported, the metadata server should return 2190 NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout 2191 matches the client provided layout identification, the server should 2192 return NFS4ERR_BADLAYOUT. If an invalid iomode is specified, or an 2193 iomode of LAYOUTIOMODE_ANY is specified, the server should return 2194 NFS4ERR_BADIOMODE. 2196 If the layout for the file is unavailable due to transient 2197 conditions, e.g. file sharing prohibits layouts, the server must 2198 return NFS4ERR_LAYOUTTRYLATER. 2200 If the layout request is rejected due to an overlapping layout 2201 recall, the server must return NFS4ERR_RECALLCONFLICT. See 2202 Section 3.5.3 for details. 2204 If the layout conflicts with a mandatory byte range lock held on the 2205 file, and if the storage devices have no method of enforcing 2206 mandatory locks, other than through the restriction of layouts, the 2207 metadata server should return NFS4ERR_LOCKED. 2209 On success, the current filehandle retains its value. 2211 IMPLEMENTATION 2213 Typically, LAYOUTGET will be called as part of a compound RPC after 2214 an OPEN operation and results in the client having location 2215 information for the file; a client may also hold a layout across 2216 multiple OPENs. The client specifies a layout type that limits what 2217 kind of layout the server will return. This prevents servers from 2218 issuing layouts that are unusable by the client. 2220 ERRORS 2222 NFS4ERR_BADLAYOUT 2223 NFS4ERR_BADIOMODE 2224 NFS4ERR_FHEXPIRED 2225 NFS4ERR_INVAL 2226 NFS4ERR_LAYOUTUNAVAILABLE 2227 NFS4ERR_LAYOUTTRYLATER 2228 NFS4ERR_LOCKED 2229 NFS4ERR_NOFILEHANDLE 2230 NFS4ERR_NOTSUPP 2231 NFS4ERR_RECALLCONFLICT 2232 NFS4ERR_STALE 2233 NFS4ERR_STALE_CLIENTID 2234 NFS4ERR_TOOSMALL 2235 NFS4ERR_UNKNOWN_LAYOUTTYPE 2237 9.2 LAYOUTCOMMIT - Commit writes made using a layout 2238 SYNOPSIS 2240 (cfh), clientid, offset, length, last_write_offset, 2241 time_modify, time_access, layoutupdate -> newsize 2243 ARGUMENT 2245 union newtime4 switch (bool timechanged) { 2246 case TRUE: 2247 nfstime4 time; 2248 case FALSE: 2249 void; 2250 }; 2252 union newsize4 switch (bool sizechanged) { 2253 case TRUE: 2254 length4 size; 2255 case FALSE: 2256 void; 2257 }; 2259 struct LAYOUTCOMMIT4args { 2260 /* CURRENT_FH: file */ 2261 clientid4 clientid; 2262 offset4 offset; 2263 length4 length; 2264 length4 last_write_offset; 2265 newtime4 time_modify; 2266 newtime4 time_access; 2267 pnfs_layoutupdate4 layoutupdate; 2268 }; 2270 RESULT 2272 struct LAYOUTCOMMIT4resok { 2273 newsize4 newsize; 2274 }; 2276 union LAYOUTCOMMIT4res switch (nfsstat4 status) { 2277 case NFS4_OK: 2278 LAYOUTCOMMIT4resok resok4; 2279 default: 2280 void; 2281 }; 2283 DESCRIPTION 2284 Commits changes in the layout segment represented by the current 2285 filehandle, clientid, and byte range. Since layouts are sub- 2286 dividable, a smaller portion of a layout, retrieved via LAYOUTGET, 2287 may be committed. The region being committed is specified through 2288 the byte range (length and offset). Note: the "layoutupdate" 2289 structure does not include the length and offset, as they are already 2290 specified in the arguments. 2292 The LAYOUTCOMMIT operation indicates that the client has completed 2293 writes using a layout obtained by a previous LAYOUTGET. The client 2294 may have only written a subset of the data range it previously 2295 requested. LAYOUTCOMMIT allows it to commit or discard provisionally 2296 allocated space and to update the server with a new end of file. The 2297 layout referenced by LAYOUTCOMMIT is still valid after the operation 2298 completes and can be continued to be referenced by the clientid, 2299 filehandle, byte range, and layout type. 2301 The "last_write_offset" field specifies the offset of the last byte 2302 written by the client previous to the LAYOUTCOMMIT. Note: this value 2303 is never equal to the file's size (at most it is one byte less than 2304 the file's size). The metadata server may use this information to 2305 determine whether the file's size needs to be updated. If the 2306 metadata server updates the file's size as the result of the 2307 LAYOUTCOMMIT operation, it must return the new size as part of the 2308 results. 2310 The "time_modify" and "time_access" fields allow the client to 2311 suggest times it would like the metadata server to set. The metadata 2312 server may use these time values or it may use the time of the 2313 LAYOUTCOMMIT operation to set these time values. If the metadata 2314 server uses the client provided times, it should sanity check the 2315 values (e.g., to ensure time does not flow backwards). If the client 2316 wants to force the metadata server to set an exact time, the client 2317 should use a SETATTR operation in a compound right after 2318 LAYOUTCOMMIT. See Section 3.4 for more details. If the new client 2319 desires the resultant mtime or atime, it should issue a GETATTR 2320 following the LAYOUTCOMMIT; e.g., later in the same compound. 2322 The "layoutupdate" argument to LAYOUTCOMMIT provides a mechanism for 2323 a client to provide layout specific updates to the metadata server. 2324 For example, the layout update can describe what regions of the 2325 original layout have been used and what regions can be deallocated. 2326 There is no NFSv4 file layout specific layoutupdate structure. 2328 The layout information is more verbose for block devices than for 2329 objects and files because the latter hide the details of block 2330 allocation behind their storage protocols. At the minimum, the 2331 client needs to communicate changes to the end of file location back 2332 to the server, and, if desired, its view of the file modify and 2333 access time. For block/volume layouts, it needs to specify precisely 2334 which blocks have been used. 2336 If the layout identified in the arguments does not exist, the error 2337 NFS4ERR_BADLAYOUT is returned. The layout being committed may also 2338 be rejected if it does not correspond to an existing layout with an 2339 iomode of RW. 2341 On success, the current filehandle retains its value. 2343 ERRORS 2345 NFS4ERR_BADLAYOUT 2346 NFS4ERR_BADIOMODE 2347 NFS4ERR_FHEXPIRED 2348 NFS4ERR_INVAL 2349 NFS4ERR_NOFILEHANDLE 2350 NFS4ERR_STALE 2351 NFS4ERR_STALE_CLIENTID 2352 NFS4ERR_UNKNOWN_LAYOUTTYPE 2354 9.3 LAYOUTRETURN - Release Layout Information 2356 SYNOPSIS 2358 (cfh), clientid, offset, length, iomode, layout_type -> - 2360 ARGUMENT 2362 struct LAYOUTRETURN4args { 2363 /* CURRENT_FH: file */ 2364 clientid4 clientid; 2365 offset4 offset; 2366 length4 length; 2367 pnfs_layoutiomode4 iomode; 2368 pnfs_layouttype4 layout_type; 2369 }; 2371 RESULT 2373 struct LAYOUTRETURN4res { 2374 nfsstat4 status; 2375 }; 2377 DESCRIPTION 2378 Returns the layout segment represented by the current filehandle, 2379 clientid, byte range, iomode, and layout type. After this call, the 2380 client MUST NOT use the layout and the associated storage protocol to 2381 access the file data. The layout being returned may be a subdivision 2382 of a layout previously fetched through LAYOUTGET. As well, it may be 2383 a subset or superset of a layout specified by CB_LAYOUTRECALL. 2384 However, if it is a subset, the recall is not complete until the full 2385 byte range has been returned. It is also permissible, and no error 2386 should result, for a client to return a byte range covering a layout 2387 it does not hold. If the length is all 1s, the layout covers the 2388 range from offset to EOF. An iomode of ANY specifies that all 2389 layouts that match the other arguments to LAYOUTRETURN (i.e., 2390 clientid, byte range, and type) are being returned. 2392 Layouts may be returned when recalled or voluntarily (i.e., before 2393 the server has recalled them). In either case the client must 2394 properly propagate state changed under the context of the layout to 2395 storage or to the server before returning the layout. 2397 If a client fails to return a layout in a timely manner, then the 2398 metadata server should use its control protocol with the storage 2399 devices to fence the client from accessing the data referenced by the 2400 layout. See Section 3.5 for more details. 2402 If the layout identified in the arguments does not exist, the error 2403 NFS4ERR_BADLAYOUT is returned. If a layout exists, but the iomode 2404 does not match, NFS4ERR_BADIOMODE is returned. 2406 On success, the current filehandle retains its value. 2408 [OPEN ISSUE: Should LAYOUTRETURN be modified to handle FSID 2409 callbacks?] 2411 ERRORS 2413 NFS4ERR_BADLAYOUT 2414 NFS4ERR_BADIOMODE 2415 NFS4ERR_FHEXPIRED 2416 NFS4ERR_INVAL 2417 NFS4ERR_NOFILEHANDLE 2418 NFS4ERR_STALE 2419 NFS4ERR_STALE_CLIENTID 2420 NFS4ERR_UNKNOWN_LAYOUTTYPE 2422 9.4 GETDEVICEINFO - Get Device Information 2424 SYNOPSIS 2426 (cfh), device_id, layout_type, maxcount -> device_addr 2428 ARGUMENT 2430 struct GETDEVICEINFO4args { 2431 /* CURRENT_FH: file */ 2432 pnfs_deviceid4 device_id; 2433 pnfs_layouttype4 layout_type; 2434 count4 maxcount; 2435 }; 2437 RESULT 2439 struct GETDEVICEINFO4resok { 2440 pnfs_deviceaddr4 device_addr; 2441 }; 2443 union GETDEVICEINFO4res switch (nfsstat4 status) { 2444 case NFS4_OK: 2445 GETDEVICEINFO4resok resok4; 2446 default: 2447 void; 2448 }; 2450 DESCRIPTION 2452 Returns device type and device address information for a specified 2453 device. The returned device_addr includes a type that indicates how 2454 to interpret the addressing information for that device. The current 2455 filehandle (cfh) is used to identify the file system; device IDs are 2456 unique per file system (FSID) and are qualified by the layout type. 2458 See Section 3.1.4 for more details on device ID assignment. 2460 If the size of the device address exceeds maxcount bytes, the 2461 metadata server will return the error NFS4ERR_TOOSMALL. If an 2462 invalid device ID is given, the metadata server will respond with 2463 NFS4ERR_INVAL. 2465 ERRORS 2467 NFS4ERR_FHEXPIRED 2468 NFS4ERR_INVAL 2469 NFS4ERR_TOOSMALL 2470 NFS4ERR_UNKNOWN_LAYOUTTYPE 2472 9.5 GETDEVICELIST - Get List of Devices 2474 SYNOPSIS 2476 (cfh), layout_type, maxcount, cookie, cookieverf -> 2477 cookie, cookieverf, device_addrs<> 2479 ARGUMENT 2481 struct GETDEVICELIST4args { 2482 /* CURRENT_FH: file */ 2483 pnfs_layouttype4 layout_type; 2484 count4 maxcount; 2485 nfs_cookie4 cookie; 2486 verifier4 cookieverf; 2487 }; 2489 RESULT 2491 struct GETDEVICELIST4resok { 2492 nfs_cookie4 cookie; 2493 verifier4 cookieverf; 2494 pnfs_devlist_item4 device_addrs<>; 2495 }; 2497 union GETDEVICELIST4res switch (nfsstat4 status) { 2498 case NFS4_OK: 2499 GETDEVICELIST4resok resok4; 2500 default: 2501 void; 2502 }; 2504 DESCRIPTION 2506 In some applications, especially SAN environments, it is convenient 2507 to find out about all the devices associated with a file system. 2508 This lets a client determine if it has access to these devices, e.g., 2509 at mount time. 2511 This operation returns an array of items (pnfs_devlist_item4) that 2512 establish the association between the short pnfs_deviceid4 and the 2513 addressing information for that device, for a particular layout type. 2514 This operation may not be able to fetch all device information at 2515 once, thus it uses a cookie based approach, similar to READDIR, to 2516 fetch additional device information (see [2], section 14.2.24). As 2517 in GETDEVICEINFO, the current filehandle (cfh) is used to identify 2518 the file system. 2520 As in GETDEVICEINFO, maxcount specifies the maximum number of bytes 2521 to return. If the metadata server is unable to return a single 2522 device address, it will return the error NFS4ERR_TOOSMALL. If an 2523 invalid device ID is given, the metadata server will respond with 2524 NFS4ERR_INVAL. 2526 ERRORS 2528 NFS4ERR_BAD_COOKIE 2529 NFS4ERR_FHEXPIRED 2530 NFS4ERR_INVAL 2531 NFS4ERR_TOOSMALL 2532 NFS4ERR_UNKNOWN_LAYOUTTYPE 2534 10. Callback Operations 2535 10.1 CB_LAYOUTRECALL 2537 SYNOPSIS 2539 layout_type, iomode, layoutrecall -> - 2541 ARGUMENT 2543 enum layoutrecall_type4 { 2544 RECALL_FILE = 1, 2545 RECALL_FSID = 2 2546 }; 2548 struct layoutrecall_file4 { 2549 nfs_fh4 fh; 2550 offset4 offset; 2551 length4 length; 2552 }; 2554 union layoutrecall4 switch(layoutrecall_type4 recalltype) { 2555 case RECALL_FILE: 2556 layoutrecall_file4 layout; 2557 case RECALL_FSID: 2558 fsid4 fsid; 2559 }; 2561 struct CB_LAYOUTRECALLargs { 2562 pnfs_layouttype4 layout_type; 2563 pnfs_layoutiomode4 iomode; 2564 layoutrecall4 layoutrecall; 2565 }; 2567 RESULT 2569 struct CB_LAYOUTRECALLres { 2570 nfsstat4 status; 2571 }; 2573 DESCRIPTION 2575 The CB_LAYOUTRECALL operation is used to begin the process of 2576 recalling a layout, a portion thereof, or all layouts pertaining to a 2577 particular file system (FSID). If RECALL_FILE is specified, the 2578 offset and length fields specify the portion of the layout to be 2579 returned. The iomode specifies the set of layouts to be returned. 2580 An iomode of ANY specifies that all matching layouts, regardless of 2581 iomode, must be returned; otherwise, only layouts that exactly match 2582 the iomode must be returned. 2584 If RECALL_FSID is specified, the fsid specifies the file system for 2585 which any outstanding layouts must be returned. Layouts are returned 2586 through the LAYOUTRETURN operation. 2588 If the client does not hold any layout segment either matching or 2589 overlapping with the requested layout, it returns 2590 NFS4ERR_NOMATCHING_LAYOUT. If a length of all 1s is specified then 2591 the layout corresponding to the byte range from "offset" to the end- 2592 of-file MUST be returned. 2594 IMPLEMENTATION 2596 The client should reply to the callback immediately. Replying does 2597 not complete the recall except when an error is returned. The recall 2598 is not complete until the layout(s) are returned using a 2599 LAYOUTRETURN. 2601 The client should complete any in-flight I/O operations using the 2602 recalled layout(s) before returning it/them via LAYOUTRETURN. If the 2603 client has buffered dirty data, it may chose to write it directly to 2604 storage before calling LAYOUTRETURN, or to write it later using 2605 normal NFSv4 WRITE operations to the metadata server. 2607 If dirty data is flushed while the layout is held, the client must 2608 still issue LAYOUTCOMMIT operations at the appropriate time, 2609 especially before issuing the LAYOUTRETURN. If a large amount of 2610 dirty data is outstanding, the client may issue LAYOUTRETURNs for 2611 portions of the layout being recalled; this allows the server to 2612 monitor the client's progress and adherence to the callback. 2613 However, the last LAYOUTRETURN in a sequence of returns, SHOULD 2614 specify the full range being recalled (see Section 3.5.2 for 2615 details). 2617 ERRORS 2619 NFS4ERR_NOMATCHING_LAYOUT 2621 10.2 CB_SIZECHANGED 2623 SYNOPSIS 2625 fh, size -> - 2627 ARGUMENT 2629 struct CB_SIZECHANGEDargs { 2630 nfs_fh4 fh; 2631 length4 size; 2632 }; 2634 RESULT 2636 struct CB_SIZECHANGEDres { 2637 nfsstat4 status; 2638 }; 2640 DESCRIPTION 2642 The CB_SIZECHANGED operation is used to notify the client that the 2643 size pertaining to the filehandle associated with "fh", has changed. 2644 The new size is specified. Upon reception of this notification 2645 callback, the client should update its internal size for the file. 2646 If the layout being held for the file is of the NFSv4 file layout 2647 type, then the size field within that layout should be updated (see 2648 Section 5.5). For other layout types see Section 3.4.2 for more 2649 details. 2651 If the handle specified is not one for which the client holds a 2652 layout, an NFS4ERR_BADHANDLE error is returned. 2654 ERRORS 2656 NFS4ERR_BADHANDLE 2658 11. Layouts and Aggregation 2660 This section describes several aggregation schemes in a semi-formal 2661 way to provide context for layout formats. These definitions will be 2662 formalized in other protocols. However, the set of understood types 2663 is part of this protocol in order to provide for basic 2664 interoperability. 2666 The layout descriptions include (deviceID, objectID) tuples that 2667 identify some storage object on some storage device. The addressing 2668 formation associated with the deviceID is obtained with 2669 GETDEVICEINFO. The interpretation of the objectID depends on the 2670 storage protocol. The objectID could be a filehandle for an NFSv4 2671 storage device. It could be a OSD object ID for an object server. 2672 The layout for a block device generally includes additional block map 2673 information to enumerate blocks or extents that are part of the 2674 layout. 2676 11.1 Simple Map 2678 The data is located on a single storage device. In this case the 2679 file server can act as the front end for several storage devices and 2680 distribute files among them. Each file is limited in its size and 2681 performance characteristics by a single storage device. The simple 2682 map consists of (deviceID, objectID). 2684 11.2 Block Extent Map 2686 The data is located on a LUN in the SAN. The layout consists of an 2687 array of (deviceID, blockID, offset, length) tuples. Each entry 2688 describes a block extent. 2690 11.3 Striped Map (RAID 0) 2692 The data is striped across storage devices. The parameters of the 2693 stripe include the number of storage devices (N) and the size of each 2694 stripe unit (U). A full stripe of data is N * U bytes. The stripe 2695 map consists of an ordered list of (deviceID, objectID) tuples and 2696 the parameter value for U. The first stripe unit (the first U bytes) 2697 are stored on the first (deviceID, objectID), the second stripe unit 2698 on the second (deviceID, objectID) and so forth until the first 2699 complete stripe. The data layout then wraps around so that byte 2700 (N*U) of the file is stored on the first (deviceID, objectID) in the 2701 list, but starting at offset U within that object. The striped 2702 layout allows a client to read or write to the component objects in 2703 parallel to achieve high bandwidth. 2705 The striped map for a block device would be slightly different. The 2706 map is an ordered list of (deviceID, blockID, blocksize), where the 2707 deviceID is rotated among a set of devices to achieve striping. 2709 11.4 Replicated Map 2711 The file data is replicated on N storage devices. The map consists 2712 of N (deviceID, objectID) tuples. When data is written using this 2713 map, it should be written to N objects in parallel. When data is 2714 read, any component object can be used. 2716 This map type is controversial because it highlights the issues with 2717 error recovery. Those issues get interesting with any scheme that 2718 employs redundancy. The handling of errors (e.g., only a subset of 2719 replicas get updated) is outside the scope of this protocol 2720 extension. Instead, it is a function of the storage protocol and the 2721 metadata control protocol. 2723 11.5 Concatenated Map 2725 The map consists of an ordered set of N (deviceID, objectID, size) 2726 tuples. Each successive tuple describes the next segment of the 2727 file. 2729 11.6 Nested Map 2731 The nested map is used to compose more complex maps out of simpler 2732 ones. The map format is an ordered set of M sub-maps, each submap 2733 applies to a byte range within the file and has its own type such as 2734 the ones introduced above. Any level of nesting is allowed in order 2735 to build up complex aggregation schemes. 2737 12. References 2739 12.1 Normative References 2741 [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement 2742 Levels", March 1997. 2744 [2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, 2745 C., Eisler, M., and D. Noveck, "Network File System (NFS) 2746 version 4 Protocol", RFC 3530, April 2003. 2748 [3] Gibson, G., "pNFS Problem Statement", July 2004, . 2752 12.2 Informative References 2754 [4] Satran, J., Meth, K., Sapuntzakis, C., Chadalapaka, M., and E. 2755 Zeidner, "Internet Small Computer Systems Interface (iSCSI)", 2756 RFC 3720, April 2004. 2758 [5] Snively, R., "Fibre Channel Protocol for SCSI, 2nd Version 2759 (FCP-2)", ANSI/INCITS 350-2003, Oct 2003. 2761 [6] Weber, R., "Object-Based Storage Device Commands (OSD)", ANSI/ 2762 INCITS 400-2004, July 2004, 2763 . 2765 [7] Black, D., "pNFS Block/Volume Layout", July 2005, . 2768 [8] Zelenka, J., Welch, B., and B. Halevy, "Object-based pNFS 2769 Operations", July 2005, . 2772 Authors' Addresses 2774 Garth Goodson 2775 Network Appliance 2776 495 E. Java Dr 2777 Sunnyvale, CA 94089 2778 USA 2780 Phone: +1-408-822-6847 2781 Email: goodson@netapp.com 2783 Brent Welch 2784 Panasas, Inc. 2785 6520 Kaiser Drive 2786 Fremont, CA 95444 2787 USA 2789 Phone: +1-650-608-7770 2790 Email: welch@panasas.com 2791 URI: http://www.panasas.com/ 2793 Benny Halevy 2794 Panasas, Inc. 2795 1501 Reedsdale St., #400 2796 Pittsburgh, PA 15233 2797 USA 2799 Phone: +1-412-323-3500 2800 Email: bhalevy@panasas.com 2801 URI: http://www.panasas.com/ 2802 David L. Black 2803 EMC Corporation 2804 176 South Street 2805 Hopkinton, MA 01748 2806 USA 2808 Phone: +1-508-293-7953 2809 Email: black_david@emc.com 2811 Andy Adamson 2812 CITI University of Michigan 2813 519 W. William 2814 Ann Arbor, MI 48103-4943 2815 USA 2817 Phone: +1-734-764-9465 2818 Email: andros@umich.edu 2820 Appendix A. Acknowledgments 2822 Many members of the pNFS informal working group have helped 2823 considerably. The authors would like to thank Gary Grider, Peter 2824 Corbett, Dave Noveck, and Peter Honeyman. This work is inspired by 2825 the NASD and OSD work done by Garth Gibson. Gary Grider of the 2826 national labs (LANL) has been a champion of high-performance parallel 2827 I/O. 2829 Intellectual Property Statement 2831 The IETF takes no position regarding the validity or scope of any 2832 Intellectual Property Rights or other rights that might be claimed to 2833 pertain to the implementation or use of the technology described in 2834 this document or the extent to which any license under such rights 2835 might or might not be available; nor does it represent that it has 2836 made any independent effort to identify any such rights. Information 2837 on the procedures with respect to rights in RFC documents can be 2838 found in BCP 78 and BCP 79. 2840 Copies of IPR disclosures made to the IETF Secretariat and any 2841 assurances of licenses to be made available, or the result of an 2842 attempt made to obtain a general license or permission for the use of 2843 such proprietary rights by implementers or users of this 2844 specification can be obtained from the IETF on-line IPR repository at 2845 http://www.ietf.org/ipr. 2847 The IETF invites any interested party to bring to its attention any 2848 copyrights, patents or patent applications, or other proprietary 2849 rights that may cover technology that may be required to implement 2850 this standard. Please address the information to the IETF at 2851 ietf-ipr@ietf.org. 2853 Disclaimer of Validity 2855 This document and the information contained herein are provided on an 2856 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2857 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2858 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2859 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2860 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2861 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2863 Copyright Statement 2865 Copyright (C) The Internet Society (2005). This document is subject 2866 to the rights, licenses and restrictions contained in BCP 78, and 2867 except as set forth therein, the authors retain all their rights. 2869 Acknowledgment 2871 Funding for the RFC Editor function is currently provided by the 2872 Internet Society.