idnits 2.17.1 

draft-thurlow-nfsv4-repl-mig-design-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The abstract seems to contain references ([RFC3010]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 2002) is 7986 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'RFC3010' on line 448 looks like a reference

  -- Missing reference section? 'RFC1831' on line 440 looks like a reference

  -- Missing reference section? 'RFC1832' on line 444 looks like a reference

  -- Missing reference section? 'RFC2203' on line 328 looks like a reference

  -- Missing reference section? 'RFC2078' on line 329 looks like a reference

  -- Missing reference section? 'RFC1964' on line 332 looks like a reference

  -- Missing reference section? 'RFC2847' on line 333 looks like a reference

  -- Missing reference section? 'RDIST' on line 452 looks like a reference

  -- Missing reference section? 'RSYNC' on line 456 looks like a reference


     Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     Robert Thurlow
3	Internet Draft                                            June 2002
4	Document: draft-thurlow-nfsv4-repl-mig-design-00.txt

6	   Server-to-Server Replication/Migration Protocol Design Principles

8	Status of this Memo

10	   This document is an Internet-Draft and is subject to all provisions
11	   of Section 10 of RFC2026.

13	   Internet-Drafts are working documents of the Internet Engineering
14	   Task Force (IETF), its areas, and its working groups.  Note that
15	   other groups may also distribute working documents as Internet-
16	   Drafts.

18	   Internet-Drafts are draft documents valid for a maximum of six months
19	   and may be updated, replaced, or obsoleted by other documents at any
20	   time.  It is inappropriate to use Internet- Drafts as reference
21	   material or to cite them other than as "work in progress."

23	   The list of current Internet-Drafts can be accessed at
24	   http://www.ietf.org/1id-abstracts.html

26	   The list of Internet-Draft Shadow Directories can be accessed at
27	   http://www.ietf.org/shadow.html

29	   Discussion and suggestions for improvement are requested.  This
30	   document will expire in December, 2002. Distribution of this draft is
31	   unlimited.

33	Abstract

35	   NFS Version 4 [RFC3010] provided support for client/server
36	   interactions to support replication and migration, but left
37	   unspecified how replication and migration would be done.  This
38	   document discusses the nature of a protocol to be used to transfer
39	   filesystem data and metadata for use with replication and migration
40	   services for NFS Version 4.

42	Table of Contents

44	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
45	   1.1.  Definitions of terms . . . . . . . . . . . . . . . . . . . 3
46	   1.1.1.  Replication  . . . . . . . . . . . . . . . . . . . . . . 3
47	   1.1.2.  Migration  . . . . . . . . . . . . . . . . . . . . . . . 3
48	   1.2.  Current practice . . . . . . . . . . . . . . . . . . . . . 4
49	   1.3.  The problem  . . . . . . . . . . . . . . . . . . . . . . . 4
50	   1.3.1.  NFS clients today  . . . . . . . . . . . . . . . . . . . 4
51	   1.3.2.  NFS Version 4  . . . . . . . . . . . . . . . . . . . . . 5
52	   1.4.  The need for a transfer protocol . . . . . . . . . . . . . 5
53	   2.  Requirements . . . . . . . . . . . . . . . . . . . . . . . . 5
54	   2.1.  Interoperability . . . . . . . . . . . . . . . . . . . . . 5
55	   2.2.  Transparency . . . . . . . . . . . . . . . . . . . . . . . 5
56	   2.3.  Security . . . . . . . . . . . . . . . . . . . . . . . . . 6
57	   2.4.  Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 6
58	   2.5.  Scalability  . . . . . . . . . . . . . . . . . . . . . . . 6
59	   3.  What the protocol will not do (now)  . . . . . . . . . . . . 6
60	   4.  Design considerations  . . . . . . . . . . . . . . . . . . . 7
61	   4.1.  Basic structure  . . . . . . . . . . . . . . . . . . . . . 7
62	   4.2.  Administrative Control . . . . . . . . . . . . . . . . . . 7
63	   4.3.  Basic environment  . . . . . . . . . . . . . . . . . . . . 7
64	   4.4.  Handling file changes  . . . . . . . . . . . . . . . . . . 7
65	   4.5.  Replication model  . . . . . . . . . . . . . . . . . . . . 8
66	   5.  Security considerations  . . . . . . . . . . . . . . . . . . 8
67	   6.  Implementation considerations  . . . . . . . . . . . . . . . 8
68	   6.1.  Filehandle preservation  . . . . . . . . . . . . . . . . . 8
69	   6.2.  Data transfer phases . . . . . . . . . . . . . . . . . . . 9
70	   6.3.  Operation on filesystem subsets  . . . . . . . . . . . . . 9
71	   7.  Difficult issues . . . . . . . . . . . . . . . . . . . . .  10
72	   7.1.  Transparency violations  . . . . . . . . . . . . . . . .  10
73	   7.2.  Directory access . . . . . . . . . . . . . . . . . . . .  10
74	   8.  Bibliography . . . . . . . . . . . . . . . . . . . . . . .  11
75	   9.  Author's Address . . . . . . . . . . . . . . . . . . . . .  12

77	1.  Introduction

79	   Though used in different circumstances, replication of data and
80	   migration of data share a common problem: how to accurately transfer
81	   data (which may be in use by applications) from one location to
82	   another with reasonable bandwidth usage and in reasonable time.
83	   Years ago, this was done by taking storage offline (or at least
84	   preventing write access), making a tape copy of the data files, and
85	   walking it to the new machine, after warning the twenty or so people
86	   who cared about it.  Networks reduced wear on sneakers, but many of
87	   the data formats we use for filesystem copies tend to be little
88	   improved - they are either lowest-common-denominator standards like
89	   "tar" and "cpio" or internal dump formats which are non-standard.
90	   Today, with distributed filesystems like NFS Version 4, richer
91	   metadata including Access Control Lists (ACLs) and extended
92	   attributes, and potential users all over the enterprise and the
93	   Internet, we need something better - a standard, complete and
94	   extensible protocol to transfer filesystems.

96	   Though data replication and transfer are needed in many areas, this
97	   document will focus primarily on solving the problem of providing
98	   replication and migration support between NFS Version 4 servers.  It
99	   is assumed that the reader has familiarity with NFS Version 4
100	   [RFC3010].

102	1.1.  Definitions of terms

104	1.1.1.  Replication

106	   Filesystem replication is the creation of a functionally identical
107	   copy of a filesystem, usually to enhance availability or provide for
108	   redundancy or disaster recovery.  For example, a company may set up
109	   replicas of a customer database accessed by employees in different
110	   geographies.  The data sets are often read-only, and initial creation
111	   of a replica is not as interesting a problem as maintaining the
112	   replica efficiently over time via incremental updates, which will
113	   likely be set up to push automatically.

115	1.1.2.  Migration

117	   Filesystem migration is the moving of a filesystem to another server
118	   for load balancing purposes or because a user or server has moved.
119	   For example, a user may have moved from one building to another, or
120	   across the country, and want his home directory to follow him, or it
121	   may just be time to decommission an old server and move data to a new
122	   one.  Only one data transfer is done, and it is important for this to
123	   be done efficiently and with the lowest possible impact on users.

125	1.2.  Current practice

127	   System administrators typically have several options available to
128	   them to replicate or migrate files, but none of them cover the
129	   problem space:

131	   o    The pax, cpio and tar tape archivers as defined by IEEE 1003.1
132	        or ISO/IEC 9945-1 are often used without tape over a network for
133	        data transfer; these support only generic Unix-specific metadata
134	        and do not support ACLs or extended attributes

136	   o    The rdist (http://www.magnicomp.com/rdist) and rsync
137	        (http://samba.anu.edu.au/rsync) applications focus on
138	        propagating changes to replicas, but are documented only by
139	        source code, are not available on all platforms, and do not
140	        support more than generic Unix-specific metadata

142	   o    "cp -r" or its equivalent over NFS Version 4 could work in cases
143	        where capabilities of servers were the same, but if the
144	        destination did not support ACLs or extended attributes, would
145	        it do what the user wanted?

147	   o    Most server filesystems have a "dump" format of some kind, which
148	        can preserve all data and metadata as long as there are no
149	        architectural differences in the servers

151	   o    Most server vendors have products which can keep replicas in
152	        sync by monitoring changes at the block level below the server
153	        filesystem, which are again inherently tied to one architecture

155	   o    Most of the above tools are not set up to properly deal with
156	        exotic metadata which may be present on filesystems like MacOS's
157	        HFS or NTFS, which can result in loss of data even when
158	        transferring to the same platform

160	1.3.  The problem

162	1.3.1.  NFS clients today

164	   Replication and migration events both cause problems for NFS clients,
165	   which may have applications operating on data when the event occurs.
166	   Past versions of NFS did not provide any support in protocol for the
167	   client, and typical clients did not even attempt to find another
168	   replica which might provide service.

170	1.3.2.  NFS Version 4

172	   NFS Version 4 [RFC3010] introduced some extra error codes and
173	   attributes to improve this situation.  For replication, the new
174	   "fs_locations" attribute could be retrived by the client to determine
175	   if multiple locations were available, so that when a server became
176	   unavailable, the client could fail over to a new location without
177	   hoping updated information was available in its name service.  For
178	   migration and in the case of a decommissioned replica, the
179	   NFS4ERR_MOVED error would inform a client that it should consult
180	   "fs_locations" and make contact with a new server responsible for the
181	   data.  In both cases, a client is required to establish a
182	   relationship with a new server, which may involve state recovery and
183	   using saved pathname information to discover new filehandles.

185	1.4.  The need for a transfer protocol

187	   To support NFS Version 4, a method is needed to transfer functionally
188	   complete filesystem data from one server to another.  The
189	   shortcomings listed previously in the common tools in use demonstrate
190	   that there is value in a standard protocol to transfer filesystem
191	   data.

193	2.  Requirements

195	   The requirements for a replication and migration protocol are to be
196	   addressed in a separate document, but are approximately these:

198	2.1.  Interoperability

200	   The replication/migration protocol must first and foremost be one
201	   which can potentially be implemented on any server.  Several vendors
202	   already have a replication mechanism in their product lines which
203	   takes advantage of known properties of their servers to replicate at
204	   the block level, but this is inherently tied to one system.

206	2.2.  Transparency

208	   When a client has been using a file which has been migrated, it
209	   should be able to detect this and recover the file state on the new
210	   server without applications needing to take action.  Similarly, when
211	   a client has availability problems with a particular replica, it
212	   should be able to adapt to the use of the new replica without
213	   application involvement.  This implies that, as far as possible, the
214	   replication/migration protocol must copy all filesystem data, as much
215	   metadata as possible, and all non-recoverable transient state such as
216	   outstanding lock and delegation state, completely and correctly.  It
217	   is acceptable that the client must recover some state as occurs in
218	   the event of a server reboot.

220	2.3.  Security

222	   NFS Version 4 supported strong mandatory-to-implement security
223	   mechanisms to protect the integrity and privacy of file data and
224	   metadata.  The replication/migration protocol must specify
225	   mandatory-to-implement security to protect data in transit, and
226	   provide a security payload and an encryption mechanism to ensure
227	   strong security for each message.  It is expected that the security
228	   mechanisms will correlate well with NFS Version 4 [RFC3010].

230	2.4.  Efficiency

232	   The replication/migration protocol must get the job of data movement
233	   done as efficiently as possible in terms of both bandwidth and time.
234	   Components of this are:

236	   o    the protocol will conserve bandwidth by streaming data in large
237	        blocks with limited header overhead

239	   o    the protocol will transfer changed regions in files rather than
240	        complete files whenever possible

242	   o    the protocol will permit restart in the event of a server
243	        failure or lost connection

245	2.5.  Scalability

247	   The replication/migration protocol must be able to handle both huge
248	   files and huge filesystems, while maintaining low enough overhead to
249	   work well with small filesystems as well.

251	3.  What the protocol will not do (now)

253	   There have been discussions about the things a good replication
254	   protocol could do which are not considered part of the scope of this
255	   work, though some of them could be specified by future RFCs.  These
256	   non-requirements include:

258	   o    being an "rdist" or "rsync" replacement

260	   o    being a tool to permit unprivileged users to copy file trees

262	   o    being used for replication of other types of data

264	4.  Design considerations

266	4.1.  Basic structure

268	   For best performance, a replication/migration protocol should be able
269	   to move large amounts of data without frequent small packets in the
270	   direction of data movement.  Use of RPC [RFC1831] may be
271	   inapprpriate; current thinking is that the protocol should be
272	   composed of messages encoded with XDR [RFC1832], exchanged under the
273	   control of a finite state machine.  Groups of messages would probably
274	   include:

276	   o    Initialization and negotiation messages

278	   o    Filesystem information messages

280	   o    Data transfer messages

282	   o    Finalization messages

284	4.2.  Administrative Control

286	   The replication and migration protocol should include nothing
287	   specifying how an administrative user contacts a server to initiate
288	   replication or migration.  A separate document should define a
289	   mechanism suitable for this purpose.

291	4.3.  Basic environment

293	   The replication/migration protocol should be available to a
294	   privileged context on a well-known TCP port on an NFSv4 server, able
295	   to authenticate and act on control messages from administration
296	   clients and general messages from other servers.

298	4.4.  Handling file changes

300	   For replication, it should be possible to handle large files changed
301	   in small ways without transferring the entire file.  The protocol
302	   needs to be able to express changes to byte ranges within a file;
303	   ideally, the server will be able to extract such changes from some
304	   kind of change log or from internal filesystem data.  However, this
305	   may not be practical.  The existence of "rdist" shows that a
306	   bidirectional protocol can determine differences in files at a
307	   reasonable bandwidth cost, and it would be good for the
308	   replication/migration protocol to be able to operate in this mode.

310	4.5.  Replication model

312	   Replication is usually set up as a series of read-only replicas, with
313	   the master copy of the filesystem generally unaccessible to the
314	   client or accessible through a different mount point.  It is possible
315	   to envision a case where, along with several read-only replicas, a
316	   single writer is available and "marked" as such in the fs_locations
317	   attribute.  The client would have to ensure that all reads and writes
318	   were directed to the writable copy from the time a particular file on
319	   the filesystem was first written to the time the client ceased caring
320	   about the file.  This is considered beyond our current scope at this
321	   time.

323	5.  Security considerations

325	   NFS Version 4 is the primary impetus behind a replication/migration
326	   protocol, so this protocol should mandate a strong security scheme
327	   and security negotiation in a manner compatible with NFS Version 4.
328	   Since NFS Version 4 specifies RPCSEC_GSS [RFC2203], which in turn
329	   builds on GSS-API [RFC2078], it makes sense for a
330	   replication/migration protocol to specify RPCSEC_GSS if it is based
331	   on RPC, and GSS-API if it is not based on RPC.  Kerberos Version 5
332	   will be used as described in [RFC1964] to provide one security
333	   framework.  The LIPKEY GSS-API mechanism described in [RFC2847] will
334	   be used to provide for the use of user password and server public
335	   key.  An initial message exchange will permit security negotiation.
336	   The replication/migration protocol will also specify a NULL security
337	   mechanism to optimize its performance when used with strong host-
338	   based security mechanism such as SSH and IPSec.

340	6.  Implementation considerations

342	6.1.  Filehandle preservation

344	   Filahandles are the basic shorthand used by clients to perform most
345	   operations on files.  The are opaque to the client, but are usually
346	   derived from:

348	   o    the fsid of the filesystem

350	   o    the fileid or "inode number" of the directory shared by the
351	        server

353	   o    the fileid or "inode number" of the file

355	   o    the "generation number", an internal field to support inode
356	        reuse.

358	   It is, in some circumstances, desireable to preserve persistant
359	   filehandles across a replication or migration event.  The most likely
360	   circumstance for this is when both servers are of the same
361	   architecture, and when the destination server can assign values to
362	   these fields as data is accepted.  To support this case, the
363	   filehandle should be available as an attribute which can be passed to
364	   the new server.  Some operating environments will not have interfaces
365	   to support access to this data or a way to recreate it anew, so this
366	   should be negotiated so that this data is not sent unnecessarily.

368	   Even if a server implementation can transfer and accept persistent
369	   filehandles, it must ensure that the client is not falsely promised
370	   that this will happen.  [RFC3010] specifies that a server may migrate
371	   a filesystem with persistent filehandles as long as the new server
372	   also uses persistent filehandles and the same filehandles will
373	   correspond to the same files after migration.  In the general case,
374	   the decision to migrate a filesystem, perhaps to a heterogeneous
375	   server with different filehandles, will be made after clients have
376	   accessed filesystems and learned of the value of the "fh_expire_type"
377	   attribute.  Thus it seems necessary that servers return an
378	   "fh_expire_type" of at least FH4_VOL_MIGRATION so that clients will
379	   always store partial pathnames for later use.  It is possible for
380	   clients to attempt to use pre-event filehandles with the new server
381	   in the hope that persistent filehandles would have been transferred
382	   intact, but there is no way for the server to promise this unless it
383	   will never transfer to a server of a different implementation.

385	6.2.  Data transfer phases

387	   For both replication and migration, transfer most generally happens
388	   in two phases: first, the bulk of the data is copied to the target
389	   while access to the source filesystem continues, and second, changes
390	   made since the start of the first phase are transferred while write
391	   access to the source filesystem is curtailed.  This reduces the
392	   window during which clients will see restrictions, at the cost of
393	   needing a method to lock out writes to files in the file tree.  For
394	   replication, it would be possible to bypass locking by the use of
395	   multiple point-in-time copies ("snapshots"), since the delta
396	   represented by each snapshot could be used to update the replicas.

398	6.3.  Operation on filesystem subsets

400	   When NFSv4 clients discover that they must react to a replication or
401	   migration event, [RFC3010] states that they will recover at the
402	   granularity of an entire filesystem, i.e. a set of files sharing the
403	   same "fsid" attribute.  It is possible that this protocol could be
404	   useful for splitting up of large filesystems to permit them to be
405	   replicated and migrated separately.  This can most easily be done if
406	   the server can arrange to return distinct "fsid"s for subdirectories
407	   of what it manages as a single filesystem.

409	7.  Difficult issues

411	7.1.  Transparency violations

413	   When being used between servers that are sufficiently different, it
414	   may be impossible for the new server to support some metadata
415	   enumerated in the data stream, or it may be that metadata critical to
416	   the new server are not supported on the old.  When this happens, the
417	   client may notice and react badly to the loss of transparency.
418	   Sources of this kind of problem include:

420	   o    Filename encoding differences

422	   o    Attributes supported on one server and not the other

424	   o    A failure of atomicity during transfer

426	   o    Incomplete or no transfer of locking, delegation and other state

428	7.2.  Directory access

430	   When a directory is read, a series of RPCs is used to get the entries
431	   in small parts.  The sequence of RPCs is tied together by a "cookie"
432	   returned by the server in each reply and used by the client in the
433	   next request.  The sequence can be interrupted by a replication or
434	   migration event, which can lead to NFS4ERR_BAD_COOKIE on the new
435	   server, even if the servers are the same architecture, due to
436	   different orders of creation of the directory entries and compaction.

438	8.  Bibliography

440	   [RFC1831]
441	   R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
442	   Version 2", RFC1831, August 1995.

444	   [RFC1832]
445	   R. Srinivasan, "XDR: External Data Representation Standard", RFC1832,
446	   August 1995.

448	   [RFC3010]
449	   S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
450	   Eisler, D. Noveck, "NFS version 4 Protocol", RFC3010, December 2000.

452	   [RDIST]
453	   MagniComp, Inc., "The RDist Home Page",
454	   http://www.magnicomp.com/rdist.

456	   [RSYNC]
457	   The Samba Team, "The rsync web pages", http://samba.anu.edu.au/rsync.

459	9.  Author's Address

461	   Address comments related to this memorandum to:

463	        nfsv4-wg@sunroof.eng.sun.com

465	   Robert Thurlow
466	   Sun Microsystems, Inc.
467	   500 Eldorado Boulevard, UBRM05-171
468	   Broomfield, CO 80021

470	   Phone: 877-718-3419
471	   E-mail: robert.thurlow@sun.com