idnits 2.17.1 draft-thurlow-nfsv4-repl-mig-design-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([RFC3010]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 2002) is 7986 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC3010' on line 448 looks like a reference -- Missing reference section? 'RFC1831' on line 440 looks like a reference -- Missing reference section? 'RFC1832' on line 444 looks like a reference -- Missing reference section? 'RFC2203' on line 328 looks like a reference -- Missing reference section? 'RFC2078' on line 329 looks like a reference -- Missing reference section? 'RFC1964' on line 332 looks like a reference -- Missing reference section? 'RFC2847' on line 333 looks like a reference -- Missing reference section? 'RDIST' on line 452 looks like a reference -- Missing reference section? 'RSYNC' on line 456 looks like a reference Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Robert Thurlow 3 Internet Draft June 2002 4 Document: draft-thurlow-nfsv4-repl-mig-design-00.txt 6 Server-to-Server Replication/Migration Protocol Design Principles 8 Status of this Memo 10 This document is an Internet-Draft and is subject to all provisions 11 of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering 14 Task Force (IETF), its areas, and its working groups. Note that 15 other groups may also distribute working documents as Internet- 16 Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six months 19 and may be updated, replaced, or obsoleted by other documents at any 20 time. It is inappropriate to use Internet- Drafts as reference 21 material or to cite them other than as "work in progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/1id-abstracts.html 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html 29 Discussion and suggestions for improvement are requested. This 30 document will expire in December, 2002. Distribution of this draft is 31 unlimited. 33 Abstract 35 NFS Version 4 [RFC3010] provided support for client/server 36 interactions to support replication and migration, but left 37 unspecified how replication and migration would be done. This 38 document discusses the nature of a protocol to be used to transfer 39 filesystem data and metadata for use with replication and migration 40 services for NFS Version 4. 42 Table of Contents 44 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 45 1.1. Definitions of terms . . . . . . . . . . . . . . . . . . . 3 46 1.1.1. Replication . . . . . . . . . . . . . . . . . . . . . . 3 47 1.1.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 3 48 1.2. Current practice . . . . . . . . . . . . . . . . . . . . . 4 49 1.3. The problem . . . . . . . . . . . . . . . . . . . . . . . 4 50 1.3.1. NFS clients today . . . . . . . . . . . . . . . . . . . 4 51 1.3.2. NFS Version 4 . . . . . . . . . . . . . . . . . . . . . 5 52 1.4. The need for a transfer protocol . . . . . . . . . . . . . 5 53 2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 5 54 2.1. Interoperability . . . . . . . . . . . . . . . . . . . . . 5 55 2.2. Transparency . . . . . . . . . . . . . . . . . . . . . . . 5 56 2.3. Security . . . . . . . . . . . . . . . . . . . . . . . . . 6 57 2.4. Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 6 58 2.5. Scalability . . . . . . . . . . . . . . . . . . . . . . . 6 59 3. What the protocol will not do (now) . . . . . . . . . . . . 6 60 4. Design considerations . . . . . . . . . . . . . . . . . . . 7 61 4.1. Basic structure . . . . . . . . . . . . . . . . . . . . . 7 62 4.2. Administrative Control . . . . . . . . . . . . . . . . . . 7 63 4.3. Basic environment . . . . . . . . . . . . . . . . . . . . 7 64 4.4. Handling file changes . . . . . . . . . . . . . . . . . . 7 65 4.5. Replication model . . . . . . . . . . . . . . . . . . . . 8 66 5. Security considerations . . . . . . . . . . . . . . . . . . 8 67 6. Implementation considerations . . . . . . . . . . . . . . . 8 68 6.1. Filehandle preservation . . . . . . . . . . . . . . . . . 8 69 6.2. Data transfer phases . . . . . . . . . . . . . . . . . . . 9 70 6.3. Operation on filesystem subsets . . . . . . . . . . . . . 9 71 7. Difficult issues . . . . . . . . . . . . . . . . . . . . . 10 72 7.1. Transparency violations . . . . . . . . . . . . . . . . 10 73 7.2. Directory access . . . . . . . . . . . . . . . . . . . . 10 74 8. Bibliography . . . . . . . . . . . . . . . . . . . . . . . 11 75 9. Author's Address . . . . . . . . . . . . . . . . . . . . . 12 77 1. Introduction 79 Though used in different circumstances, replication of data and 80 migration of data share a common problem: how to accurately transfer 81 data (which may be in use by applications) from one location to 82 another with reasonable bandwidth usage and in reasonable time. 83 Years ago, this was done by taking storage offline (or at least 84 preventing write access), making a tape copy of the data files, and 85 walking it to the new machine, after warning the twenty or so people 86 who cared about it. Networks reduced wear on sneakers, but many of 87 the data formats we use for filesystem copies tend to be little 88 improved - they are either lowest-common-denominator standards like 89 "tar" and "cpio" or internal dump formats which are non-standard. 90 Today, with distributed filesystems like NFS Version 4, richer 91 metadata including Access Control Lists (ACLs) and extended 92 attributes, and potential users all over the enterprise and the 93 Internet, we need something better - a standard, complete and 94 extensible protocol to transfer filesystems. 96 Though data replication and transfer are needed in many areas, this 97 document will focus primarily on solving the problem of providing 98 replication and migration support between NFS Version 4 servers. It 99 is assumed that the reader has familiarity with NFS Version 4 100 [RFC3010]. 102 1.1. Definitions of terms 104 1.1.1. Replication 106 Filesystem replication is the creation of a functionally identical 107 copy of a filesystem, usually to enhance availability or provide for 108 redundancy or disaster recovery. For example, a company may set up 109 replicas of a customer database accessed by employees in different 110 geographies. The data sets are often read-only, and initial creation 111 of a replica is not as interesting a problem as maintaining the 112 replica efficiently over time via incremental updates, which will 113 likely be set up to push automatically. 115 1.1.2. Migration 117 Filesystem migration is the moving of a filesystem to another server 118 for load balancing purposes or because a user or server has moved. 119 For example, a user may have moved from one building to another, or 120 across the country, and want his home directory to follow him, or it 121 may just be time to decommission an old server and move data to a new 122 one. Only one data transfer is done, and it is important for this to 123 be done efficiently and with the lowest possible impact on users. 125 1.2. Current practice 127 System administrators typically have several options available to 128 them to replicate or migrate files, but none of them cover the 129 problem space: 131 o The pax, cpio and tar tape archivers as defined by IEEE 1003.1 132 or ISO/IEC 9945-1 are often used without tape over a network for 133 data transfer; these support only generic Unix-specific metadata 134 and do not support ACLs or extended attributes 136 o The rdist (http://www.magnicomp.com/rdist) and rsync 137 (http://samba.anu.edu.au/rsync) applications focus on 138 propagating changes to replicas, but are documented only by 139 source code, are not available on all platforms, and do not 140 support more than generic Unix-specific metadata 142 o "cp -r" or its equivalent over NFS Version 4 could work in cases 143 where capabilities of servers were the same, but if the 144 destination did not support ACLs or extended attributes, would 145 it do what the user wanted? 147 o Most server filesystems have a "dump" format of some kind, which 148 can preserve all data and metadata as long as there are no 149 architectural differences in the servers 151 o Most server vendors have products which can keep replicas in 152 sync by monitoring changes at the block level below the server 153 filesystem, which are again inherently tied to one architecture 155 o Most of the above tools are not set up to properly deal with 156 exotic metadata which may be present on filesystems like MacOS's 157 HFS or NTFS, which can result in loss of data even when 158 transferring to the same platform 160 1.3. The problem 162 1.3.1. NFS clients today 164 Replication and migration events both cause problems for NFS clients, 165 which may have applications operating on data when the event occurs. 166 Past versions of NFS did not provide any support in protocol for the 167 client, and typical clients did not even attempt to find another 168 replica which might provide service. 170 1.3.2. NFS Version 4 172 NFS Version 4 [RFC3010] introduced some extra error codes and 173 attributes to improve this situation. For replication, the new 174 "fs_locations" attribute could be retrived by the client to determine 175 if multiple locations were available, so that when a server became 176 unavailable, the client could fail over to a new location without 177 hoping updated information was available in its name service. For 178 migration and in the case of a decommissioned replica, the 179 NFS4ERR_MOVED error would inform a client that it should consult 180 "fs_locations" and make contact with a new server responsible for the 181 data. In both cases, a client is required to establish a 182 relationship with a new server, which may involve state recovery and 183 using saved pathname information to discover new filehandles. 185 1.4. The need for a transfer protocol 187 To support NFS Version 4, a method is needed to transfer functionally 188 complete filesystem data from one server to another. The 189 shortcomings listed previously in the common tools in use demonstrate 190 that there is value in a standard protocol to transfer filesystem 191 data. 193 2. Requirements 195 The requirements for a replication and migration protocol are to be 196 addressed in a separate document, but are approximately these: 198 2.1. Interoperability 200 The replication/migration protocol must first and foremost be one 201 which can potentially be implemented on any server. Several vendors 202 already have a replication mechanism in their product lines which 203 takes advantage of known properties of their servers to replicate at 204 the block level, but this is inherently tied to one system. 206 2.2. Transparency 208 When a client has been using a file which has been migrated, it 209 should be able to detect this and recover the file state on the new 210 server without applications needing to take action. Similarly, when 211 a client has availability problems with a particular replica, it 212 should be able to adapt to the use of the new replica without 213 application involvement. This implies that, as far as possible, the 214 replication/migration protocol must copy all filesystem data, as much 215 metadata as possible, and all non-recoverable transient state such as 216 outstanding lock and delegation state, completely and correctly. It 217 is acceptable that the client must recover some state as occurs in 218 the event of a server reboot. 220 2.3. Security 222 NFS Version 4 supported strong mandatory-to-implement security 223 mechanisms to protect the integrity and privacy of file data and 224 metadata. The replication/migration protocol must specify 225 mandatory-to-implement security to protect data in transit, and 226 provide a security payload and an encryption mechanism to ensure 227 strong security for each message. It is expected that the security 228 mechanisms will correlate well with NFS Version 4 [RFC3010]. 230 2.4. Efficiency 232 The replication/migration protocol must get the job of data movement 233 done as efficiently as possible in terms of both bandwidth and time. 234 Components of this are: 236 o the protocol will conserve bandwidth by streaming data in large 237 blocks with limited header overhead 239 o the protocol will transfer changed regions in files rather than 240 complete files whenever possible 242 o the protocol will permit restart in the event of a server 243 failure or lost connection 245 2.5. Scalability 247 The replication/migration protocol must be able to handle both huge 248 files and huge filesystems, while maintaining low enough overhead to 249 work well with small filesystems as well. 251 3. What the protocol will not do (now) 253 There have been discussions about the things a good replication 254 protocol could do which are not considered part of the scope of this 255 work, though some of them could be specified by future RFCs. These 256 non-requirements include: 258 o being an "rdist" or "rsync" replacement 260 o being a tool to permit unprivileged users to copy file trees 262 o being used for replication of other types of data 264 4. Design considerations 266 4.1. Basic structure 268 For best performance, a replication/migration protocol should be able 269 to move large amounts of data without frequent small packets in the 270 direction of data movement. Use of RPC [RFC1831] may be 271 inapprpriate; current thinking is that the protocol should be 272 composed of messages encoded with XDR [RFC1832], exchanged under the 273 control of a finite state machine. Groups of messages would probably 274 include: 276 o Initialization and negotiation messages 278 o Filesystem information messages 280 o Data transfer messages 282 o Finalization messages 284 4.2. Administrative Control 286 The replication and migration protocol should include nothing 287 specifying how an administrative user contacts a server to initiate 288 replication or migration. A separate document should define a 289 mechanism suitable for this purpose. 291 4.3. Basic environment 293 The replication/migration protocol should be available to a 294 privileged context on a well-known TCP port on an NFSv4 server, able 295 to authenticate and act on control messages from administration 296 clients and general messages from other servers. 298 4.4. Handling file changes 300 For replication, it should be possible to handle large files changed 301 in small ways without transferring the entire file. The protocol 302 needs to be able to express changes to byte ranges within a file; 303 ideally, the server will be able to extract such changes from some 304 kind of change log or from internal filesystem data. However, this 305 may not be practical. The existence of "rdist" shows that a 306 bidirectional protocol can determine differences in files at a 307 reasonable bandwidth cost, and it would be good for the 308 replication/migration protocol to be able to operate in this mode. 310 4.5. Replication model 312 Replication is usually set up as a series of read-only replicas, with 313 the master copy of the filesystem generally unaccessible to the 314 client or accessible through a different mount point. It is possible 315 to envision a case where, along with several read-only replicas, a 316 single writer is available and "marked" as such in the fs_locations 317 attribute. The client would have to ensure that all reads and writes 318 were directed to the writable copy from the time a particular file on 319 the filesystem was first written to the time the client ceased caring 320 about the file. This is considered beyond our current scope at this 321 time. 323 5. Security considerations 325 NFS Version 4 is the primary impetus behind a replication/migration 326 protocol, so this protocol should mandate a strong security scheme 327 and security negotiation in a manner compatible with NFS Version 4. 328 Since NFS Version 4 specifies RPCSEC_GSS [RFC2203], which in turn 329 builds on GSS-API [RFC2078], it makes sense for a 330 replication/migration protocol to specify RPCSEC_GSS if it is based 331 on RPC, and GSS-API if it is not based on RPC. Kerberos Version 5 332 will be used as described in [RFC1964] to provide one security 333 framework. The LIPKEY GSS-API mechanism described in [RFC2847] will 334 be used to provide for the use of user password and server public 335 key. An initial message exchange will permit security negotiation. 336 The replication/migration protocol will also specify a NULL security 337 mechanism to optimize its performance when used with strong host- 338 based security mechanism such as SSH and IPSec. 340 6. Implementation considerations 342 6.1. Filehandle preservation 344 Filahandles are the basic shorthand used by clients to perform most 345 operations on files. The are opaque to the client, but are usually 346 derived from: 348 o the fsid of the filesystem 350 o the fileid or "inode number" of the directory shared by the 351 server 353 o the fileid or "inode number" of the file 355 o the "generation number", an internal field to support inode 356 reuse. 358 It is, in some circumstances, desireable to preserve persistant 359 filehandles across a replication or migration event. The most likely 360 circumstance for this is when both servers are of the same 361 architecture, and when the destination server can assign values to 362 these fields as data is accepted. To support this case, the 363 filehandle should be available as an attribute which can be passed to 364 the new server. Some operating environments will not have interfaces 365 to support access to this data or a way to recreate it anew, so this 366 should be negotiated so that this data is not sent unnecessarily. 368 Even if a server implementation can transfer and accept persistent 369 filehandles, it must ensure that the client is not falsely promised 370 that this will happen. [RFC3010] specifies that a server may migrate 371 a filesystem with persistent filehandles as long as the new server 372 also uses persistent filehandles and the same filehandles will 373 correspond to the same files after migration. In the general case, 374 the decision to migrate a filesystem, perhaps to a heterogeneous 375 server with different filehandles, will be made after clients have 376 accessed filesystems and learned of the value of the "fh_expire_type" 377 attribute. Thus it seems necessary that servers return an 378 "fh_expire_type" of at least FH4_VOL_MIGRATION so that clients will 379 always store partial pathnames for later use. It is possible for 380 clients to attempt to use pre-event filehandles with the new server 381 in the hope that persistent filehandles would have been transferred 382 intact, but there is no way for the server to promise this unless it 383 will never transfer to a server of a different implementation. 385 6.2. Data transfer phases 387 For both replication and migration, transfer most generally happens 388 in two phases: first, the bulk of the data is copied to the target 389 while access to the source filesystem continues, and second, changes 390 made since the start of the first phase are transferred while write 391 access to the source filesystem is curtailed. This reduces the 392 window during which clients will see restrictions, at the cost of 393 needing a method to lock out writes to files in the file tree. For 394 replication, it would be possible to bypass locking by the use of 395 multiple point-in-time copies ("snapshots"), since the delta 396 represented by each snapshot could be used to update the replicas. 398 6.3. Operation on filesystem subsets 400 When NFSv4 clients discover that they must react to a replication or 401 migration event, [RFC3010] states that they will recover at the 402 granularity of an entire filesystem, i.e. a set of files sharing the 403 same "fsid" attribute. It is possible that this protocol could be 404 useful for splitting up of large filesystems to permit them to be 405 replicated and migrated separately. This can most easily be done if 406 the server can arrange to return distinct "fsid"s for subdirectories 407 of what it manages as a single filesystem. 409 7. Difficult issues 411 7.1. Transparency violations 413 When being used between servers that are sufficiently different, it 414 may be impossible for the new server to support some metadata 415 enumerated in the data stream, or it may be that metadata critical to 416 the new server are not supported on the old. When this happens, the 417 client may notice and react badly to the loss of transparency. 418 Sources of this kind of problem include: 420 o Filename encoding differences 422 o Attributes supported on one server and not the other 424 o A failure of atomicity during transfer 426 o Incomplete or no transfer of locking, delegation and other state 428 7.2. Directory access 430 When a directory is read, a series of RPCs is used to get the entries 431 in small parts. The sequence of RPCs is tied together by a "cookie" 432 returned by the server in each reply and used by the client in the 433 next request. The sequence can be interrupted by a replication or 434 migration event, which can lead to NFS4ERR_BAD_COOKIE on the new 435 server, even if the servers are the same architecture, due to 436 different orders of creation of the directory entries and compaction. 438 8. Bibliography 440 [RFC1831] 441 R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification 442 Version 2", RFC1831, August 1995. 444 [RFC1832] 445 R. Srinivasan, "XDR: External Data Representation Standard", RFC1832, 446 August 1995. 448 [RFC3010] 449 S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. 450 Eisler, D. Noveck, "NFS version 4 Protocol", RFC3010, December 2000. 452 [RDIST] 453 MagniComp, Inc., "The RDist Home Page", 454 http://www.magnicomp.com/rdist. 456 [RSYNC] 457 The Samba Team, "The rsync web pages", http://samba.anu.edu.au/rsync. 459 9. Author's Address 461 Address comments related to this memorandum to: 463 nfsv4-wg@sunroof.eng.sun.com 465 Robert Thurlow 466 Sun Microsystems, Inc. 467 500 Eldorado Boulevard, UBRM05-171 468 Broomfield, CO 80021 470 Phone: 877-718-3419 471 E-mail: robert.thurlow@sun.com