< draft-talpey-nfsv4-rdma-sess-00.txt   draft-talpey-nfsv4-rdma-sess-01.txt >
Internet-Draft Tom Talpey Internet-Draft Tom Talpey
Expires: November 2003 Network Appliance, Inc. Expires: August 2004 Network Appliance, Inc.
Spencer Shepler Spencer Shepler
Sun Microsystems, Inc. Sun Microsystems, Inc.
May, 2003 February, 2004
NFSv4 RDMA and Session Extensions NFSv4 RDMA and Session Extensions
draft-talpey-nfsv4-rdma-sess-00.txt draft-talpey-nfsv4-rdma-sess-01
Status of this Memo Status of this Memo
This document is an Internet-Draft and is subject to all provisions This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026. of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet- other groups may also distribute working documents as Internet-
Drafts. Drafts.
skipping to change at page 1, line 38 skipping to change at page 1, line 37
progress." progress."
The list of current Internet-Drafts can be accessed at The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html. http://www.ietf.org/shadow.html.
Copyright Notice Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved. Copyright (C) The Internet Society (2004). All Rights Reserved.
Abstract Abstract
Extensions are proposed to NFS version 4 which enable it to support Extensions are proposed to NFS version 4 which enable it to support
sessions, connection management, and operation atop RDMA-capable sessions, connection management, and operation atop either TCP or
RPC. These extensions enable universal support for Exactly-once RDMA-capable RPC. These extensions enable universal support for
Semantics by NFSv4 servers, enhanced security, and multipathing and exactly-once semantics by NFSv4 servers, enhanced security,
trunking of transport connections. These extensions provide multipathing and trunking of transport connections. These
identical benefit over both TCP and RDMA connection types. extensions provide identical benefits over both TCP and RDMA
connection types.
Table Of Contents Table Of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 5 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . 5
1.3. NFSv4 Over RDMA Characteristics . . . . . . . . . . . . 7 1.3. NFSv4 Session Extension Characteristics . . . . . . . . 6
1.4. RDMA Requirements . . . . . . . . . . . . . . . . . . . 7 2. Transport Issues . . . . . . . . . . . . . . . . . . . . . 7
2. Transport Issues . . . . . . . . . . . . . . . . . . . . . 9 2.1. Session Model . . . . . . . . . . . . . . . . . . . . . 7
2.1. Session Model . . . . . . . . . . . . . . . . . . . . . 9 2.1.1. Connection State . . . . . . . . . . . . . . . . . . . 8
2.1.1. Connection State . . . . . . . . . . . . . . . . . . . 10 2.1.2. Channels . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2. Connection Resources . . . . . . . . . . . . . . . . . 10 2.1.3. Reconnection, Trunking, Failover . . . . . . . . . . . 10
2.1.3. Channels . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4. Server Duplicate Request Cache . . . . . . . . . . . . 11
2.1.4. Reconnection, Trunking, Failover . . . . . . . . . . . 12 2.2. RDMA . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.5. Server Duplicate Request Cache . . . . . . . . . . . . 12 2.2.1. RDMA Requirements . . . . . . . . . . . . . . . . . . 12
2.2. RDMA Negotiation . . . . . . . . . . . . . . . . . . . . 14 2.2.2. RDMA Negotiation . . . . . . . . . . . . . . . . . . . 12
2.3. RDMA Inline Model . . . . . . . . . . . . . . . . . . . 15 2.2.3. Connection Resources . . . . . . . . . . . . . . . . . 14
2.4. RDMA Direct Model . . . . . . . . . . . . . . . . . . . 18 2.2.4. Inline Transfer Model . . . . . . . . . . . . . . . . 14
2.5. Connection Models . . . . . . . . . . . . . . . . . . . 20 2.2.5. Direct Transfer Model . . . . . . . . . . . . . . . . 17
2.5.1. TCP Stream Connection Model . . . . . . . . . . . . . 21 2.3. Connection Models . . . . . . . . . . . . . . . . . . . 20
2.5.2. Negotiated RDMA Connection Model . . . . . . . . . . . 22 2.3.1. TCP Connection Model . . . . . . . . . . . . . . . . . 21
2.5.3. Automatic RDMA Connection Model . . . . . . . . . . . 23 2.3.2. Negotiated RDMA Connection Model . . . . . . . . . . . 21
2.6. Buffer Management, Transfer, Flow Control . . . . . . . 24 2.3.3. Automatic RDMA Connection Model . . . . . . . . . . . 22
2.7. Retry and Replay . . . . . . . . . . . . . . . . . . . . 27 2.4. Buffer Management, Transfer, Flow Control . . . . . . . 23
2.8. The Back Channel . . . . . . . . . . . . . . . . . . . . 27 2.5. Retry and Replay . . . . . . . . . . . . . . . . . . . . 26
2.9. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 29 2.6. The Back Channel . . . . . . . . . . . . . . . . . . . . 26
2.10. Inline Data Alignment . . . . . . . . . . . . . . . . . 30 2.7. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 28
3. NFSv4 Integration . . . . . . . . . . . . . . . . . . . . 31 2.8. Data Alignment . . . . . . . . . . . . . . . . . . . . . 29
3.1. Minor Versioning . . . . . . . . . . . . . . . . . . . . 31 3. NFSv4 Integration . . . . . . . . . . . . . . . . . . . . 30
3.2. Stream Identifiers and Exactly-Once Semantics . . . . . 32 3.1. Minor Versioning . . . . . . . . . . . . . . . . . . . . 30
3.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 33 3.2. Stream Identifiers and Exactly-Once Semantics . . . . . 31
3.4. eXternal Data Representation Efficiency . . . . . . . . 34 3.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 32
3.5. Effect of Sessions on Existing Operations . . . . . . . 35 3.4. eXternal Data Representation Efficiency . . . . . . . . 33
3.5. Effect of Sessions on Existing Operations . . . . . . . 34
3.6. Authentication Efficiencies . . . . . . . . . . . . . . 35 3.6. Authentication Efficiencies . . . . . . . . . . . . . . 35
4. Security Considerations . . . . . . . . . . . . . . . . . 36 4. Security Considerations . . . . . . . . . . . . . . . . . 36
5. IANA Considerations . . . . . . . . . . . . . . . . . . . 37 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 37
6. NFSv4 Protocol RDMA and Session Extensions . . . . . . . . 38 6. NFSv4 Protocol Extensions . . . . . . . . . . . . . . . . 37
6.1. SESSION_CREATE . . . . . . . . . . . . . . . . . . . . . 38 6.1. SESSION_CREATE . . . . . . . . . . . . . . . . . . . . . 38
6.2. SESSION_BIND . . . . . . . . . . . . . . . . . . . . . . 39 6.2. SESSION_BIND . . . . . . . . . . . . . . . . . . . . . . 39
6.3. SESSION_DISCONNECT . . . . . . . . . . . . . . . . . . . 40 6.3. SESSION_DESTROY . . . . . . . . . . . . . . . . . . . . 41
6.4. OPERATION_CONTROL . . . . . . . . . . . . . . . . . . . 41 6.4. OPERATION_CONTROL . . . . . . . . . . . . . . . . . . . 42
6.5. CB_CREDITRECALL . . . . . . . . . . . . . . . . . . . . 42 6.5. CB_CREDITRECALL . . . . . . . . . . . . . . . . . . . . 43
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 43 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . 43 References . . . . . . . . . . . . . . . . . . . . . . . . 43
Authors' Addresses . . . . . . . . . . . . . . . . . . . . 45 Authors' Addresses . . . . . . . . . . . . . . . . . . . . 46
Full Copyright Statement . . . . . . . . . . . . . . . . . 46 Full Copyright Statement . . . . . . . . . . . . . . . . . 46
1. Introduction 1. Introduction
This draft proposes extensions to NFS version 4 enabling it to This draft proposes extensions to NFS version 4 enabling it to
support sessions and connection management, and to support support sessions and connection management, and to support
operation atop RDMA-capable RPC over transport such as iWARP. operation atop RDMA-capable RPC over transports such as iWARP.
[RDDP] These extensions enable universal support for Exactly-once [RDMAP, DDP] These extensions enable universal support for exactly-
Semantics by NFSv4 servers, multipathing and trunking of transport once semantics by NFSv4 servers, multipathing and trunking of
connections, and enhanced security. The ability to operate over transport connections, and enhanced security. The ability to
RDMA enables greatly enhanced performance. Operation over existing operate over RDMA enables greatly enhanced performance. Operation
TCP is additionally enhanced. over existing TCP is enhanced as well.
While discussed here on IETF-chartered transports, the proposed While discussed here with respect to IETF-chartered transports, the
protocol is intended to function over other standards, such as proposed protocol is intended to function over other standards,
Infiniband. [IB] such as Infiniband. [IB]
The following are the major aspects of this proposal: The following are the major aspects of this proposal:
o Changes are proposed within the framework of NFSv4 minor o Changes are proposed within the framework of NFSv4 minor
versioning. RPC, XDR, and the NFSv4 procedures and operations versioning. RPC, XDR, and the NFSv4 procedures and operations
are preserved. The proposed minor version functions equally are preserved. The proposed minor version functions equally
well over existing transports and RDMA, and interoperates well over existing transports and RDMA, and interoperates
transparently with existing implementations, both at the local transparently with existing implementations, both at the local
programmatic interface and over the wire. programmatic interface and over the wire.
skipping to change at page 3, line 45 skipping to change at page 3, line 45
NFSv4 implementations. The NFSv4 callback channel is NFSv4 implementations. The NFSv4 callback channel is
associated with a session, and is connected by the client and associated with a session, and is connected by the client and
not the server, enhancing security and operation through not the server, enhancing security and operation through
firewalls. In fact, the callback channel will be enabled to firewalls. In fact, the callback channel will be enabled to
share the same connection as the operations channel. share the same connection as the operations channel.
o An enhanced RPC layer enables NFSv4 operation atop RDMA. The o An enhanced RPC layer enables NFSv4 operation atop RDMA. The
session is RDMA-aware, and additional facilities are provided session is RDMA-aware, and additional facilities are provided
for managing RDMA resources at both NFSv4 server and client. for managing RDMA resources at both NFSv4 server and client.
Existing NFSv4 operations continue to function as before, Existing NFSv4 operations continue to function as before,
though certain size limits are negotiated on RDMA transports. though certain size limits are negotiated. A companion draft
A companion draft to this document, "RDMA Transport for ONC to this document, "RDMA Transport for ONC RPC" [RPCRDMA] is to
RPC" [RPCRDMA] is to be referenced for details of RPC RDMA be referenced for details of RPC RDMA support.
support.
o Support for Exactly-Once Semantics (EOS) is enabled by the new o Support for exactly-once semantics ("EOS") is enabled by the
session facilities, providing to the server a way to bound the new session facilities, providing to the server a way to bound
size of the duplicate request cache for a single client, and the size of the duplicate request cache for a single client,
to manage its persistent storage. and to manage its persistent storage.
Block Diagram Block Diagram
+-------------------+------------------------------------+ +-------------------+------------------------------------+
| NFSv4 | NFSv4 + extensions | | NFSv4 | NFSv4 + extensions |
+-------------------+-----+----------------+-------------+ +-------------------+-----+----------------+-------------+
| Operations | Session | | | Operations | Session | |
+-------------------------+----------------+ | +-------------------------+----------------+ |
| RPC/XDR | | | RPC/XDR | |
+---------------------------------+--------+ | +---------------------------------+--------+ |
| Stream Transport | RDMA Transport | | Stream Transport | RDMA Transport |
+---------------------------------+----------------------+ +---------------------------------+----------------------+
1.1. Motivation 1.1. Motivation
NFS version 4 [NFSv4] has recently been granted "Proposed Standard" NFS version 4 [RFC3530] has been granted "Proposed Standard"
status. The NFSv4 protocol was developed along several design status. The NFSv4 protocol was developed along several design
points, important among them: effective operation over wide-area points, important among them: effective operation over wide-area
networks, including the Internet itself; strong security networks, including the Internet itself; strong security
integrated into the protocol; extensive cross-platform integrated into the protocol; extensive cross-platform
interoperability including integrated locking semantics compatible interoperability including integrated locking semantics compatible
with multiple operating systems; and protocol extensibility. with multiple operating systems; and protocol extensibility.
Additionally, over the past year, an effort to standardize a set of The NFS version 4 protocol, however, does not provide support for
protocols for Remote Direct Memory Access, RDMA, over the standard certain important transport aspects. For example, the protocol
Internet Protocol Suite has been chartered [RDDP]. Several drafts does not provide a way to implement exactly-once semantics for
have been proposed and are under discussion. clients, nor an interoperable way to support trunking and
multipathing of connections. This leads to inefficiencies,
Many RDMA specifications and implementations exist, both open and especially where trunking and multipathing are concerned, and
proprietary. [IB, VIA, CLAN, FCVI, MYRNET, QUAD, SVRNET] In fact, presents additional difficulties in supporting RDMA fabrics, in
at least one currently shipping implementation was developed on which endpoints may require dedicated or specialized resources.
standard TCP/IP and was submitted to IETF as an internet-draft
[VITCP]. This implementation is currently shipping from Emulex,
the GN9000/VI "Orion". [ORION]
RDMA is a general solution to the problem of CPU overhead incurred
due to data copies, primarily at the receiver. Substantial
research has addressed this and has borne out the efficacy of the
approach. An overview of this is the RDDP Problem Statement
document, [RDDPPS].
Numerous upper layer protocols achieve extremely high bandwidth and
low overhead through the use of RDMA. Products from a wide variety
of vendors employ RDMA to advantage, and prototypes have
demonstrated the effectiveness of many more. Here, we are
concerned specifically with NFS and NFS-style upper layer
protocols, examples from Network Appliance [DAFS], Sun Microsystems
[SNIA], Fujitsu Prime Software Technologies [FJNFS, FJDAFS] and
Harvard University [KM02] are all relevant.
NFS version 4 currently employs a clientid to identify clients at a
server, and provides no protocol-specified way to associate
additional connections with one another. This leads to
inefficiencies, especially where trunking and multipathing are
concerned, and presents additional difficulties in supporting RDMA
fabrics, where endpoints may require dedicated or specialized
resources.
Sessions can be employed to unify NFS-level constructs such as the Sessions can be employed to unify NFS-level constructs such as the
clientid with transport-level constructs such as transport clientid with transport-level constructs such as transport
endpoints. The endpoint is abstracted to be a member of the endpoints. The transport endpoint is abstracted to be a member of
session. Resource management can be more strictly maintained, the session. Resource management can be more strictly maintained,
leading to greater server efficiency in implementing the protocol. leading to greater server efficiency in implementing the protocol.
The enhanced operation over a session affords an opportunity to the The enhanced operation over a session affords an opportunity to the
server to implement highly reliable and exactly-once semantics. server to implement highly reliable and exactly-once semantics.
NFSv4 advances the state of high-performance local sharing, by NFSv4 advances the state of high-performance local sharing, by
virtue of its integrated security, locking, and delegation, and its virtue of its integrated security, locking, and delegation, and its
excellent coverage of the sharing semantics of multiple operating excellent coverage of the sharing semantics of multiple operating
systems. It is exactly this environment where exactly-once systems. It is precisely this environment where exactly-once
semantics become a fundamental requirement. semantics become a fundamental requirement.
Additionally, efforts to standardize a set of protocols for Remote
Direct Memory Access, RDMA, over the Internet Protocol Suite have
made significant progress. RDMA is a general solution to the
problem of CPU overhead incurred due to data copies, primarily at
the receiver. Substantial research has addressed this and has
borne out the efficacy of the approach. An overview of this is the
RDDP Problem Statement document, [RDDPPS].
Numerous upper layer protocols achieve extremely high bandwidth and
low overhead through the use of RDMA. Products from a wide variety
of vendors employ RDMA to advantage, and prototypes have
demonstrated the effectiveness of many more. Here, we are
concerned specifically with NFS and NFS-style upper layer
protocols; examples from Network Appliance [DAFS, DCK+03], Fujitsu
Prime Software Technologies [FJNFS, FJDAFS] and Harvard University
[KM02] are all relevant.
By layering a session binding for NFS version 4 directly atop a By layering a session binding for NFS version 4 directly atop a
standard RDMA transport, a greatly enhanced level of performance standard RDMA transport, a greatly enhanced level of performance
and transparency can be supported on a wide variety of operating and transparency can be supported on a wide variety of operating
system platforms. These combined capabilities alter the landscape system platforms. These combined capabilities alter the landscape
between local filesystems and network attached storage, enable a between local filesystems and network attached storage, enable a
new level of performance, and lead new classes of application to new level of performance, and lead new classes of application to
take advantage of NFS. take advantage of NFS.
1.2. Problem Statement 1.2. Problem Statement
The principal problem encountered by NFS implementations is the CPU Two issues drive the current proposal: correctness, and
overhead required to implement the protocol. Primary among the performance. Both are instances of "raising the bar" for NFS,
whereby the desire to use NFS in new classes applications can be
accommodated by providing the basic features to make such use
feasible. Such applications include tightly coupled sharing
environments such as cluster computing, high performance computing
(HPC) and information processing such as databases. These trends
are explored in depth in [NFSPS].
The first issue, correctness, exemplified among the attributes of
local filesystems, is support for exactly-once semantics. Such
semantics have not been reliably available with NFS. Server-based
duplicate request caches [CJ89] help, but do not reliably provide
strict correctness. For the type of application which is expected
to make extensive use of the high-performance RDMA-enabled
environment, the reliable provision of such semantics are a
fundamental requirement.
Introduction of a session to NFSv4 will address these issues. With
higher performance and enhanced semantics comes the problem of
enabling advanced endpoint management, for example high-speed
trunking, multipathing and failover. These characteristics enable
availability and performance. RFC3530 presents some issues in
permitting a single clientid to access a server over multiple
connections.
A second issue encountered in common by NFS implementations is the
CPU overhead required to implement the protocol. Primary among the
sources of this overhead is the movement of data from NFS protocol sources of this overhead is the movement of data from NFS protocol
messages to its eventual destination in user buffers or aligned messages to its eventual destination in user buffers or aligned
kernel buffers. The data copies consume system bus bandwidth and kernel buffers. The data copies consume system bus bandwidth and
CPU time, reducing the available system capacity for applications. CPU time, reducing the available system capacity for applications.
[RDDPPS] Achieving zero-copy with NFS has to date required [RDDPPS] Achieving zero-copy with NFS has to date required
sophisticated, "header cracking" hardware and/or extensive sophisticated, "header cracking" hardware and/or extensive
platform-specific virtual memory mapping tricks. platform-specific virtual memory mapping tricks.
Furthermore, NFSv4 will soon be challenged by emerging high-speed
network fabrics such as 10 gigabit Ethernet. Performing even raw
network I/O such as TCP is an issue at such speeds with today's
hardware. The problem is fundamental in nature and has led the
IETF to explore RDMA. [RDDPPS] IETF protocols such as NFS version 4
will clearly follow. Zero-copy techniques benefit file protocols
extensively, as they enable direct user I/O, reduce the overhead of
protocol stacks, provide perfect alignment into caches, etc. Many
studies have already shown the performance benefits of such
techniques [DCK+03, FJNFS, FJDAFS, MAF+02].
Combined in this way, NFSv4, RDMA and the emerging high-speed Combined in this way, NFSv4, RDMA and the emerging high-speed
network fabrics will enable delivery of performance which matches network fabrics will enable delivery of performance which matches
that of the fastest local filesystems, while preserving the key that of the fastest local filesystems, while preserving the key
existing local filesystem semantics. existing local filesystem semantics.
Primary among the attributes of local filesystems is support for
Exactly Once Semantics (EOS). Such semantics have not been
reliably available with NFS. Server-based duplicate request caches
[CJ89] help, but do not provide strict correctness. For the type
of application which is expected to make extensive use of the high-
performance RDMA-enabled environment, such semantics are a
fundamental requirement.
Introduction of a session to NFSv4 will address these. With higher
performance and enhanced semantics comes the problem of enabling
advanced endpoint management, for example high-speed trunking,
multipathing and failover. These characteristics enable
availability and performance. The NFSv4 specification presents
some issues in permitting a single clientid to access a server over
multiple connections.
RDMA implementations generally have other interesting properties, RDMA implementations generally have other interesting properties,
such as hardware assisted protocol access, and support for user such as hardware assisted protocol access, and support for user
space access to I/O. RDMA is compelling here for another reason; space access to I/O. RDMA is compelling here for another reason;
hardware offloaded networking support in itself does not avoid data hardware offloaded networking support in itself does not avoid data
copies, without resorting to implementing part of the NFS protocol copies, without resorting to implementing part of the NFS protocol
in the NIC. Support of RDMA by NFS enables the highest performance in the NIC. Support of RDMA by NFS enables the highest performance
at the architecture level rather than by implementation; this at the architecture level rather than by implementation; this
enables ubiquitous and interoperable solutions. enables ubiquitous and interoperable solutions.
By providing file access performance equivalent to that of local By providing file access performance equivalent to that of local
file systems, NFSv4 over RDMA will enable applications running on a file systems, NFSv4 over RDMA will enable applications running on a
set of client machines to interact through an NFSv4 file system, set of client machines to interact through an NFSv4 file system,
just as applications running on a single machine might interact just as applications running on a single machine might interact
through a local file system. through a local file system.
This raises the issue of whether additional protocol enhancements This raises the issue of whether additional protocol enhancements
to enable such interaction would be desirable and what such to enable such interaction would be desirable and what such
enhancements would be. This is a complicated issue which the enhancements would be. This is a complicated issue which the
working group needs to address. This document will not address working group needs to address and will not be further discussed in
that issue. this document.
1.3. NFSv4 Over RDMA Characteristics 1.3. NFSv4 Session Extension Characteristics
This draft will present a solution based upon minor versioning of This draft will present a solution based upon minor versioning of
NFSv4. It will describe use of RDMA by employing support within an NFSv4. It will introduce a session to collect transport issues
underlying RPC layer [RPCRDMA]. It will introduce a session to together, which in turn enables enhancements such as trunking,
collect transport issues together, which in turn enables failover and recovery. It will describe use of RDMA by employing
enhancements such as trunking, failover and recovery. Most support within an underlying RPC layer [RPCRDMA]. Most
importantly, it will focus on making the best possible use of an importantly, it will focus on making the best possible use of an
RDMA transport. RDMA transport.
These extensions are proposed as elements of a new minor revision These extensions are proposed as elements of a new minor revision
of NFS version 4. In this draft, NFS version 4 will be referred to of NFS version 4. In this draft, NFS version 4 will be referred to
generically as "NFSv4", when describing properties common to all generically as "NFSv4", when describing properties common to all
minor versions. When referring specifically to properties of the minor versions. When referring specifically to properties of the
original, minor version 0 protocol, "NFSv4.0" will be used, and original, minor version 0 protocol, "NFSv4.0" will be used, and
changes proposed here for minor version 1 will be referred to as changes proposed here for minor version 1 will be referred to as
"NFSv4.1". "NFSv4.1".
This draft proposes only changes which are strictly upward- This draft proposes only changes which are strictly upward-
compatible with existing RPC and NFS Application Programming compatible with existing RPC and NFS Application Programming
Interfaces (APIs). Interfaces (APIs).
1.4. RDMA Requirements
A connection oriented (reliable sequenced) RDMA transport is
required. There are several reasons for this. First, this model
most closely reflects the NFSv4 requirement of reliably sequenced,
congestion-controlled transports. Second, to operate correctly
over either an unreliable or unsequenced transport, or both, would
require significant complexity in the implementation and protocol
not appropriate for a strict minor version. For example,
retransmission on connected endpoints is explicitly disallowed in
the current NFSv4 draft; it would again be required with these
alternate transport characteristics. Third, the proposal assumes a
specific RDMA ordering semantic, which presents the same set of
ordering and reliability issues to the RDMA layer over such
transports.
The IETF RDDP Working group is addressing such a transport, other
examples are Infiniband "Reliable Connected" service and the
Virtual Interface Architecture.
Conceptually, any such RDMA transport implementation provides for
certain basic setup primitives, and three types of transfer.
The RDMA implementation provides for making connections to other
RDMA-capable peers. In the case of the current proposals before
the RDDP working group, these RDMA connections are preceded by a
"streaming" phase, where ordinary TCP (or NFS) traffic might flow.
However, this is not assumed here and sizes and other parameters
are explicitly negotiated prior to RDMA mode in all cases.
The RDMA implementation provides primitives for registering and
deregistering memory for RDMA access. These operations are
potentially expensive, since they require pinning of memory and
resources, as well as initializing hardware mappings. Lightweight
operations called "binding" can be used in certain circumstances.
In all cases, to achieve true zero-copy, the actual buffer destined
to receive the transferred data is ideally used, this may be a
region of user memory.
Data is transferred between RDMA peers through any of three
transfer models.
Send
Data may be transmitted into untagged receive buffers on the
remote peer via a Send operation, which typically results in a
completion being posted at the receiver. If a buffer is not
available at the receiver, or if the buffer is not large
enough to accept the entire operation, a fatal error will
result on the connection. Sends complete at the receiver in
the order in which they were issued at the sender.
RDMA Write
Data may be directly placed into tagged target buffer(s) on
the remote peer via an RDMA Write operation. This data
transfer operation does not generate a completion at the
receiver. The target buffer is described by a handle, along
with an offset and length to access byte ranges within the
region described by the handle. The handle may be used for
one operation or many. Data placed by RDMA write operations
is not guaranteed to be valid until a subsequent successful
send completion has been obtained by the receiver.
RDMA Read
Data may be directly fetched from a remote peer via an RDMA
Read operation, which does not generate any completion at the
data source. Two target buffer handles are used by RDMA Read,
one for the source and another for the destination, along with
offsets and lengths. The RDMA Read operation makes very few
guarantees as to the consistency of the data fetched with
respect to local access by processes at the data source,
however it does have certain consistency guarantees with
respect to the initiator's RDMA operations.
2. Transport Issues 2. Transport Issues
The Transport Issues section of the document explores the details The Transport Issues section of the document explores the details
of utilizing an RDMA transport. of utilizing the various supported transports.
2.1. Session Model 2.1. Session Model
The first and most evident issue in supporting diverse transports The first and most evident issue in supporting diverse transports
is how to provide for their differences. This draft proposes is how to provide for their differences. This draft proposes
introducing an explicit session. introducing an explicit session.
An initialized session will be required before processing requests An initialized session will be required before processing requests
contained within COMPOUND and CB_COMPOUND procedures of NFSv4.1. A contained within COMPOUND and CB_COMPOUND procedures of NFSv4.1. A
session introduces minimal protocol requirements, and provides for session introduces minimal protocol requirements, and provides for
a highly useful and convenient way to manage numerous endpoint- a highly useful and convenient way to manage numerous endpoint-
related issues. The session is a local construct; it represents a related issues. The session is a local construct; it represents a
named, higher-layer object to which connections can refer, and named, higher-layer object to which connections can refer, and
encapsulates properties important to each transport layer endpoint. encapsulates properties important to each transport layer endpoint.
A session is a dynamically created, persistent object created by a A session is a dynamically created, persistent object created by a
client, used over time from one or more transport connections. Its client, used over time from one or more transport connections. Its
function is to maintain the server's state relative to any single function is to maintain the server's state relative to any single
client instance. This state is entirely independent of the client instance. This state is entirely independent of the
connection itself. connection itself. The session in effect becomes the "top-level"
object representing an active client.
The session enables several things immediately. Clients may The session enables several things immediately. Clients may
disconnect and reconnect (voluntarily or not) without loss of disconnect and reconnect (voluntarily or not) without loss of
context at the server. (Of course, locks, delegations and related context at the server. (Of course, locks, delegations and related
associations require special handling which generally expires associations require special handling which generally expires
without an open connection.) Clients may connect multiple without an open connection.) Clients may connect multiple
transport endpoints to this common state. The endpoints may have transport endpoints to this common state. The endpoints may have
all the same attributes, for instance when trunked on multiple all the same attributes, for instance when trunked on multiple
physical network links for bandwidth aggregation or path failover. physical network links for bandwidth aggregation or path failover.
Or, the endpoints can have specific, special purpose attributes Or, the endpoints can have specific, special purpose attributes
skipping to change at page 10, line 23 skipping to change at page 8, line 33
semantics, authentication and authorization may be cached on a per- semantics, authentication and authorization may be cached on a per-
session basis, enabling greater efficiency in the issuing and session basis, enabling greater efficiency in the issuing and
processing of requests on both client and server. A proposal for processing of requests on both client and server. A proposal for
transparent, server-driven implementation of this in NFSv4 has been transparent, server-driven implementation of this in NFSv4 has been
made. [CCM] The existence of the session greatly adds to the made. [CCM] The existence of the session greatly adds to the
convenience of this approach. This is discussed in detail in the convenience of this approach. This is discussed in detail in the
Authentication Efficiencies section later in this draft. Authentication Efficiencies section later in this draft.
2.1.1. Connection State 2.1.1. Connection State
The normal RDMA model is connection oriented; in fact RDDP In RFC3530, the combination of a connected transport endpoint and a
proposes only connection oriented operation. Connection clientid forms the basis of connection state. While provably
orientation brings with it certain potential optimizations, such as workable, there are difficulties in correct and robust
caching of per-connection properties. implementation. The NFSv4.0 protocol must provide a clientid
negotiation (SETCLIENTID and SETCLIENTID_CONFIRM), must provide a
A session identifier is assigned upon initial session negotiation server-initiated connection for the callback channel, and must
on each connection. This identifier is used to associate carefully specify the persistence of client state at the server in
additional connections, to renegotiate after a reconnect, and to the face of transport interruptions. In effect, each transport
provide an abstraction for the various session properties. The connection is used as the server's representation of client state.
session identifier is unique within the server's scope and may be But, transport connections are potentially fragile and transitory.
subject to certain server policies such as being bounded in time.
A channel identifier is issued for each new connection in the
session.
2.1.2. Connection Resources
RDMA imposes several requirements on upper layer consumers.
Registration of memory and the need to post buffers of a specific
size and number for receive operations are a primary consideration.
Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes
(e.g. readable/writable), and initialization of hardware
translation. Preregistration is desirable to reduce overhead.
These registrations are specific to hardware interfaces and even to
RDMA connection endpoints, therefore negotiation of their limits is
desirable to manage resources effectively.
Following the basic registration, these buffers must be posted by In this proposal, a session identifier is assigned by the server
the RPC layer to handle receives. These buffers remain in use by upon initial session negotiation on each connection. This
the RPC/NFSv4 implementation; the size and number of them must be identifier is used to associate additional connections, to
known to the remote peer in order to avoid RDMA errors which would renegotiate after a reconnect, and to provide an abstraction for
cause a fatal error on the RDMA connection. the various session properties. The session identifier is unique
within the server's scope and may be subject to certain server
policies such as being bounded in time. A channel identifier is
issued for each new connection as it binds to the session. The
channel identifier is unique within the session, and may be unique
within a wider scope, at the server's choosing.
Each channel within a session will potentially have different It is envisioned that the primary transport model will be
requirements, negotiated per-connection but accounted for per- connection oriented. Connection orientation brings with it certain
session. The session provides a natural way for the server to potential optimizations, such as caching of per-connection
manage resource allocation to each client rather than to each properties, which are easily leveraged through the generality of
transport connection itself. This enables considerable flexibility the session. However, it is possible that in future, other
in the administration of transport endpoints. transport models could be accommodated below the session and
channel abstractions.
2.1.3. Channels 2.1.2. Channels
As mentioned above, different NFSv4 operations can lead to As mentioned above, different NFSv4 operations can lead to
different resource needs. For example, server callback operations different resource needs. For example, server callback operations
(CB_RECALL) are specific, small messages which flow from server to (CB_RECALL) are specific, small messages which flow from server to
client at arbitrary times, while data transfers such as read and client at arbitrary times, while data transfers such as read and
write have very different sizes and asymmetric behaviors. It is write have very different sizes and asymmetric behaviors. It is
impractical for the RDMA peers (NFSv4 client and NFSv4 server) to impractical for the RDMA peers (NFSv4 client and NFSv4 server) to
post buffers for these various operations on a single connection. post buffers for these various operations on a single connection.
Commingling of requests with responses at the client receive queue Commingling of requests with responses at the client receive queue
is particularly troublesome, due both to the need to manage both is particularly troublesome, due both to the need to manage both
solicited and unsolicited completions, and to provision buffers for solicited and unsolicited completions, and to provision buffers for
both purposes. Due to the lack of any ordering of callback both purposes. Due to the lack of any ordering of callback
requests versus response arrivals, without any other mechanisms, requests versus response arrivals, without any other mechanisms,
the client would be forced to allocate all buffers sized to the the client would be forced to allocate all buffers sized to the
worst case. worst case.
The callback requests are likely to be handled by a different task The callback requests are likely to be handled by a different task
context from that handling the responses. Significant context from that handling the responses. Significant
demultiplexing and thread management would be required if both are demultiplexing and thread management may be required if both are
received on the same queue. received on the same queue.
If the client explicitly binds each new connection to an existing If the client explicitly binds each new connection to an existing
session, multiple connections may be conveniently used to separate session, multiple connections may be conveniently used to separate
traffic by channel identifier within a session. traffic by channel identifier within a session. For example, reads
and writes may be assigned to specific, optimized channels, or
sorted and separated by any or all of size, idempotency, etc.
To address the problems described above, this proposal defines a To address the problems described above, this proposal defines a
"channel" that is created by the act of binding a connection to a "channel" that is created by the act of binding a connection to a
session for a specific purpose. A new connection may be created session for a specific purpose. A new connection may be created
for each channel, or a single connection may be bound to more than for each channel, or a single connection may be bound to more than
one channel. There are at least two types of channels: the one channel. There are at least two types of channels: the
"operations" channel used for ordinary requests from client to "operations" channel used for ordinary requests from client to
server, and the "back" channel, used for callback requests from server, and the "back" channel, used for callback requests from
server to client. The protocol does not permit binding a server to client. The protocol does not permit binding multiple
connection to multiple operations channels. There is no benefit in duplicate operations channels to a single connection. There is no
doing so; supporting this would require increased complexity in benefit in doing so; supporting this would require increased
the server duplicate response cache. complexity in the server duplicate request cache.
Single Connection model: Single Connection model:
NFSv4.1 clientid NFSv4.1 client instance
| |
Session Session
/ \ / \
Operations_Channel [Back_Channel] Operations_Channel [Back_Channel]
\ / \ /
Connection Connection
| |
Multi-connection model (2 operations channels shown): Multi-connection model (2 operations channels shown):
NFSv4.1 clientid NFSv4.1 client instance
| |
Session Session
/ \ / \
Operations_Channels [Back_Channel] Operations_Channels [Back_Channel]
| | | | | |
Connection Connection [Connection] Connection Connection [Connection]
| | | | | |
In this way, implementation as well as resource management may be In this way, implementation as well as resource management may be
optimized. Each channel (operations, back) will have its own optimized. Each channel (operations, back) will have its own
credits and buffering. Clients which do not require certain credits and buffering. Clients which do not require certain
behaviors may optimize such resources away completely, by not even behaviors may optimize such resources away completely, by not even
creating the channels. creating the channels.
2.1.4. Reconnection, Trunking, Failover 2.1.3. Reconnection, Trunking, Failover
Reconnection after failure references potentially stored state on Reconnection after failure references potentially stored state on
the server associated with lease recovery during the grace period. the server associated with lease recovery during the grace period.
The session provides a convenient handle for storing and managing The session provides a convenient handle for storing and managing
information regarding the client's previous state on a per- information regarding the client's previous state on a per-
connection basis, e.g. to be used upon reconnection. connection basis, e.g. to be used upon reconnection. Reconnection
and rebinding to a previously existing session, and its stored
resources, are covered in the "Connection Models" section below.
For Reliability Availability and Serviceability (RAS) issues such For Reliability Availability and Serviceability (RAS) issues such
as bandwidth aggregation and multipathing, clients frequently seek as bandwidth aggregation and multipathing, clients frequently seek
to make multiple connections through multiple logical or physical to make multiple connections through multiple logical or physical
channels. The session is a convenient point to aggregate and channels. The session is a convenient point to aggregate and
manage these resources. manage these resources.
2.1.5. Server Duplicate Request Cache 2.1.4. Server Duplicate Request Cache
Server duplicate request caches, while not a part of an NFS Server duplicate request caches, while not a part of an NFS
protocol, have become a standard, even required, part of any NFS protocol, have become a standard, even required, part of any NFS
implementation. First described in [CJ89], the duplicate request implementation. First described in [CJ89], the duplicate request
cache was initially found to reduce work at the server by avoiding cache was initially found to reduce work at the server by avoiding
duplicate processing for retransmitted requests. A second, and in duplicate processing for retransmitted requests. A second, and in
the long run more important benefit, was improved correctness, as the long run more important benefit, was improved correctness, as
the cache avoided certain destructive non-idempotent requests from the cache avoided certain destructive non-idempotent requests from
being reinvoked. being reinvoked.
skipping to change at page 13, line 29 skipping to change at page 11, line 34
identifier, enables its persistent storage on a per-session basis. identifier, enables its persistent storage on a per-session basis.
This provides a single unified mechanism which provides the This provides a single unified mechanism which provides the
following guarantees required in the NFSv4 specification, while following guarantees required in the NFSv4 specification, while
extending them to all requests, rather than limiting them only to a extending them to all requests, rather than limiting them only to a
subset of state-related requests: subset of state-related requests:
"It is critical the server maintain the last response sent to "It is critical the server maintain the last response sent to
the client to provide a more reliable cache of duplicate non- the client to provide a more reliable cache of duplicate non-
idempotent requests than that of the traditional cache idempotent requests than that of the traditional cache
described in [CJ89]..." [NFSv4] described in [CJ89]..." [RFC3530]
The credit limit is the count of active operations, which bounds The credit limit is the count of active operations, which bounds
the number of entries in the cache. The size of operations the number of entries in the cache. Constraining the size of
additionally serves to limit the required storage to the product of operations additionally serves to limit the required storage to the
the current credit count and the maximum response size. This product of the current credit count and the maximum response size.
storage requirement enables server-side efficiencies. This storage requirement enables server-side efficiencies.
Session negotiation allows the server to maintain other state. An Session negotiation allows the server to maintain other state. An
NFSv4.1 client invoking the session disconnect operation will cause NFSv4.1 client invoking the session destroy operation will cause
the server to denegotiate (close) the session, allowing the server the server to denegotiate (close) the session, allowing the server
to deallocate cache entries. Clients can potentially specify that to deallocate cache entries. Clients can potentially specify that
such caches not be kept for appropriate types of sessions (for such caches not be kept for appropriate types of sessions (for
example, read-only sessions). This can enable more efficient example, read-only sessions). This can enable more efficient
server operation resulting in improved response times. server operation resulting in improved response times.
Similarly, it is important for the client to explicitly learn Similarly, it is important for the client to explicitly learn
whether the server is able to implement these semantics. Knowledge whether the server is able to implement these semantics. Knowledge
of whether exactly-once semantics are in force is critical for a of whether exactly-once semantics are in force is critical for a
highly reliable client, one which must provide transactional highly reliable client, one which must provide transactional
integrity guarantees. When clients request that the semantics be integrity guarantees. When clients request that the semantics be
enabled for a given session, the session reply must inform the enabled for a given session, the session reply must inform the
client if the mode is in fact enabled. In this way the client can client if the mode is in fact enabled. In this way the client can
confidently proceed with operations without having to implement confidently proceed with operations without having to implement
consistency facilities of its own. consistency facilities of its own.
2.2. RDMA Negotiation 2.2. RDMA
2.2.1. RDMA Requirements
A complete discussion of the operation of RPC-based protocols atop
RDMA transports is in [RPCRDMA], and a general discussion of NFS
RDMA requirements is in [RDMAREQ]. Where RDMA is considered, this
proposal assumes the use of such a layering; it addresses only the
upper layer issues relevant to making best use of RPC/RDMA.
A connection oriented (reliable sequenced) RDMA transport will be
required. There are several reasons for this. First, this model
most closely reflects the general NFSv4 requirement of long-lived
and congestion-controlled transports. Second, to operate correctly
over either an unreliable or unsequenced RDMA transport, or both,
would require significant complexity in the implementation and
protocol not appropriate for a strict minor version. For example,
retransmission on connected endpoints is explicitly disallowed in
the current NFSv4 draft; it would again be required with these
alternate transport characteristics. Third, the proposal assumes a
specific RDMA ordering semantic, which presents the same set of
ordering and reliability issues to the RDMA layer over such
transports.
The RDMA implementation provides for making connections to other
RDMA-capable peers. In the case of the current proposals before
the RDDP working group, these RDMA connections are preceded by a
"streaming" phase, where ordinary TCP (or NFS) traffic might flow.
However, this is not assumed here and sizes and other parameters
are explicitly exchanges upon entering RDMA mode in all cases.
2.2.2. RDMA Negotiation
It is proposed that session negotiation be the method to enable It is proposed that session negotiation be the method to enable
RDMA mode on an NFSv4 connection. RDMA mode on an NFSv4 connection.
On transport endpoints which support automatic RDMA mode, that is, On transport endpoints which support automatic RDMA mode, that is,
endpoints which are created in the RDMA-enabled state, a single, endpoints which are created in the RDMA-enabled state, a single,
preposted buffer must initially be provided by both peers, and the preposted buffer must initially be provided by both peers, and the
client session negotiation must be the first exchange. client session negotiation must be the first exchange.
On transport endpoints supporting dynamic negotiation, a more On transport endpoints supporting dynamic negotiation, a more
sophisticated negotiation is possible. Clients may connect to the sophisticated negotiation is possible. Clients may connect to the
server in traditional NFSv4 mode and enter RDMA mode only after a server in traditional NFSv4 mode and enter RDMA mode only after a
successful NFSv4.1 session negotiation returning the RDMA successful NFSv4.1 channel binding negotiation returning the RDMA
capability. If RDMA capability is not indicated, the session capability. If RDMA capability is not indicated, the negotiation
negotiation still completes and the benefits of the session are still completes and the benefits of the session are available on
available on the existing TCP stream connection. the existing TCP stream connection.
Some of the parameters to be exchanged at session binding time are Some of the parameters to be exchanged at session binding time are
as follows. as follows.
Maximum Credits Maximum Credits
The client's desired maximum credits (number of concurrent The client's desired maximum credits (number of concurrent
requests) is passed, in order to allow the server to size its requests) is passed, in order to allow the server to size its
response cache storage. The server may modify the client's reply cache storage. The server may modify the client's
requested limit downward (or upward) to match its local policy requested limit downward (or upward) to match its local policy
and/or resources. and/or resources. The maximum credits available on a single
bound channel may also be limited by the maximum credits for
the session. Over RDMA-capable RPC transports, the per-
request management of message credits is handled within the
RPC layer. [RPCRDMA]
Maximum Request/Response Sizes Maximum Request/Response Sizes
The maximum request and response sizes are exchanged in order The maximum request and response sizes are exchanged in order
to permit posting of appropriately sized buffers. The size to permit allocation of appropriately sized buffers and
must allow for certain protocol minima, allowing the receipt request cache entries. The size must allow for certain
of maximally sized operations (e.g. RENAME requests which protocol minima, allowing the receipt of maximally sized
contains two name strings). The server may reduce the operations (e.g. RENAME requests which contains two name
client's requested sizes. Message credits are requested (and strings). The server may reduce the client's requested sizes.
granted) in each RPC message passed across RDMA transports
[RPCRDMA].
RDMA Read Resources RDMA Read Resources
RDMA implementations must explicitly provision resources to RDMA implementations must explicitly provision resources to
support RDMA Read requests from connected peers. These values support RDMA Read requests from connected peers. These values
must be explicitly specified, to provide adequate resources must be explicitly specified, to provide adequate resources
for matching the peer's expected needs and the connection's for matching the peer's expected needs and the connection's
delay-bandwidth parameters. The values are asymmetric and are delay-bandwidth parameters. The values are asymmetric and
generally optimized to zero at the server, since clients do should be set to zero at the server in order to conserve RDMA
not issue RDMA Read operations in this proposal. The result resources, since clients do not issue RDMA Read operations in
is communicated in the session response, to permit matching of this proposal. The result is communicated in the session
values across the connection. The value may not be changed in response, to permit matching of values across the connection.
the duration of the connection, although a new value may be The value may not be changed in the duration of the
requested as part of a reconnection. connection, although a new value may be requested as part of a
reconnection.
Inline Padding/Alignment Inline Padding/Alignment
The server can inform the client of any padding which can be The server can inform the client of any padding which can be
used to deliver NFSv4 inline WRITE payloads into aligned used to deliver NFSv4 inline WRITE payloads into aligned
buffers. Such alignment can be used to avoid data copy buffers. Such alignment can be used to avoid data copy
operations at the server, even when direct RDMA is not used. operations at the server, even when direct RDMA is not used.
The client informs the server in each operation when padding The client informs the server in each operation when padding
has been applied [RPCRDMA]. has been applied [RPCRDMA].
Transport Attributes Transport Attributes
A placeholder for transport-specific attributes is provided, A placeholder for transport-specific attributes is provided,
with a format to be determined. Examples of information to be with a format to be determined. Examples of information to be
passed in this parameter include transport security attributes passed in this parameter include transport security attributes
to be used on the connection, RDMA-specific attributes, legacy to be used on the connection, RDMA-specific attributes, legacy
"private data" as used on existing RDMA fabrics, transport "private data" as used on existing RDMA fabrics, transport
Quality of Service attributes, etc. This information is to be Quality of Service attributes, etc. This information is to be
passed to the peer's transport layer by local means which is passed to the peer's transport layer by local means which is
currently outside the scope of this draft. currently outside the scope of this draft.
2.3. RDMA Inline Model 2.2.3. Connection Resources
RDMA imposes several requirements on upper layer consumers.
Registration of memory and the need to post buffers of a specific
size and number for receive operations are a primary consideration.
Registration of memory can be a relatively high-overhead operation,
since it requires pinning of buffers, assignment of attributes
(e.g. readable/writable), and initialization of hardware
translation. Preregistration is desirable to reduce overhead.
These registrations are specific to hardware interfaces and even to
RDMA connection endpoints, therefore negotiation of their limits is
desirable to manage resources effectively.
Following the basic registration, these buffers must be posted by
the RPC layer to handle receives. These buffers remain in use by
the RPC/NFSv4 implementation; the size and number of them must be
known to the remote peer in order to avoid RDMA errors which would
cause a fatal error on the RDMA connection.
Each channel within a session will potentially have different
requirements, negotiated per-connection but accounted for per-
session. The session provides a natural way for the server to
manage resource allocation to each client rather than to each
transport connection itself. This enables considerable flexibility
in the administration of transport endpoints.
2.2.4. Inline Transfer Model
The RDMA Send transfer model is used for all NFS requests and The RDMA Send transfer model is used for all NFS requests and
replies. Use of Sends is required to ensure consistency of data replies. Use of Sends is required to ensure consistency of data
and to deliver completion notifications. and to deliver completion notifications.
Sends may carry data as well as control. When a Send carries data Sends may carry data as well as control. When a Send carries data
associated with a request type, the data is referred to as associated with a request type, the data is referred to as
"inline". This method is typically used where the data payload is "inline". This method is typically used where the data payload is
small, or where for whatever reason target memory for RDMA is not small, or where for whatever reason target memory for RDMA is not
available. available.
skipping to change at page 16, line 47 skipping to change at page 16, line 17
credits or smaller buffers are provided, the connection may fail credits or smaller buffers are provided, the connection may fail
with an RDMA transport error. with an RDMA transport error.
While tempting to consider, it is not possible to use the TCP While tempting to consider, it is not possible to use the TCP
window as an RDMA operation flow control mechanism. First, to do window as an RDMA operation flow control mechanism. First, to do
so would violate layering, requiring both senders to be aware of so would violate layering, requiring both senders to be aware of
the existing TCP outbound window at all times. Second, since the existing TCP outbound window at all times. Second, since
requests are of variable size, the TCP window can hold a widely requests are of variable size, the TCP window can hold a widely
variable number of them, and since it cannot be reduced without variable number of them, and since it cannot be reduced without
actually receiving data, the receiver cannot limit the sender. actually receiving data, the receiver cannot limit the sender.
Third, any middlebox interposing on the connection will wreck any Third, any middlebox interposing on the connection would wreck any
possible scheme. [MIDTAX] Credits, in the form of explicit possible scheme. [MIDTAX] In this proposal, credits, in the form of
operation counts, must be exchanged to allow correct provisioning explicit operation counts, are exchanged to allow correct
of receive buffers. provisioning of receive buffers.
When not operating over RDMA, credits and sizes are still employed When not operating over RDMA, credits and sizes are still employed
in NFSv4.1, but instead of being required for correctness, they in NFSv4.1, but instead of being required for correctness, they
provide the basis for efficient server implementation of exactly- provide the basis for efficient server implementation of exactly-
once semantics. The limits are chosen based upon the expected once semantics. The limits are chosen based upon the expected
needs and capabilities of the client and server, and are in fact needs and capabilities of the client and server, and are in fact
arbitrary. Sizes may be specified as zero (no specific size limit) arbitrary. Sizes may be specified as zero (no specific size limit)
and credits may be chosen in proportion to the client's and credits may be chosen in proportion to the client's
capabilities. For example, a limit of 1000 allows 1000 requests to capabilities. For example, a limit of 1000 allows 1000 requests to
be in progress, which is more than adequate to keep local networks be in progress, which may generally be far more than adequate to
and servers fully utilized. keep local networks and servers fully utilized.
Both client and server have independent sizes and buffering, but Both client and server have independent sizes and buffering, but
over RDMA fabrics client credits are easily managed by posting a over RDMA fabrics client credits are easily managed by posting a
receive buffer prior to sending each request. Each such buffer may receive buffer prior to sending each request. Each such buffer may
not be completed with the corresponding reply, since responses from not be completed with the corresponding reply, since responses from
NFSv4 servers arrive in arbitrary order. When the operations NFSv4 servers arrive in arbitrary order. When the operations
channel is used for callbacks, the client must account for callback channel is used for callbacks, the client must account for callback
requests by posting additional buffers. requests by posting additional buffers. Note that implementation-
specific facilities such as a "shared receive queue" may allow
optimization of these allocations.
When a connection is bound to a session (creating a channel), the When a connection is bound to a session (creating a channel), the
client requests a preferred buffer size, and the server provides client requests a preferred buffer size, and the server provides
its answer. The server posts all buffers of at least this size. its answer. The server posts all buffers of at least this size.
The client must comply by not sending requests greater than this The client must comply by not sending requests greater than this
size. It is recommended that server implementations do all they size. It is recommended that server implementations do all they
can to accommodate a useful range of possible client requests. can to accommodate a useful range of possible client requests.
There is a provision in [RPCRDMA] to allow the sending of client There is a provision in [RPCRDMA] to allow the sending of client
requests which exceed the server's receive buffer size, but it requests which exceed the server's receive buffer size, but it
requires the server to "pull" the client's request as a "read requires the server to "pull" the client's request as a "read
chunk" via RDMA Read. This introduces at least one additional chunk" via RDMA Read. This introduces at least one additional
network roundtrip, plus other overhead such as registering memory network roundtrip, plus other overhead such as registering memory
for RDMA Read at the client and additional RDMA operations at the for RDMA Read at the client and additional RDMA operations at the
server, and is therefore to be avoided. server, and is to be avoided.
An issue therefore arises when considering the NFSv4 COMPOUND An issue therefore arises when considering the NFSv4 COMPOUND
procedures. Since an arbitrary number (total size) of operations procedures. Since an arbitrary number (total size) of operations
can be specified in a single COMPOUND procedure, its size is can be specified in a single COMPOUND procedure, its size is
effectively unbounded. This cannot be supported by RDMA Sends, and effectively unbounded. This cannot be supported by RDMA Sends, and
therefore this size negotiation places a restriction on the therefore this size negotiation places a restriction on the
construction and maximum size of both COMPOUND requests and construction and maximum size of both COMPOUND requests and
responses. If a COMPOUND results in a reply at the server that is responses. If a COMPOUND results in a reply at the server that is
larger than can be sent in an RDMA Send to the client, then the larger than can be sent in an RDMA Send to the client, then the
COMPOUND must terminate and the operation which causes the overflow COMPOUND must terminate and the operation which causes the overflow
will provide a TOOSMALL error status result. A chaining facility will provide a TOOSMALL error status result. A chaining facility
is provided to overcome some of the resulting limitations, is provided to overcome some of the resulting limitations,
described later in the draft. described later in the draft.
2.4. RDMA Direct Model 2.2.5. Direct Transfer Model
Placement of data by explicitly tagged RDMA operations is referred Placement of data by explicitly tagged RDMA operations is referred
to as "direct" transfer. This method is typically used where the to as "direct" transfer. This method is typically used where the
data payload is relatively large, that is, when RDMA setup has been data payload is relatively large, that is, when RDMA setup has been
performed prior to the operation, or when any overhead for setting performed prior to the operation, or when any overhead for setting
up and performing the transfer is regained by avoiding the overhead up and performing the transfer is regained by avoiding the overhead
of processing an ordinary receive. of processing an ordinary receive.
The client advertises RDMA buffers in this proposed model, and not The client advertises RDMA buffers in this proposed model, and not
the server. This means the "XDR Decoding with Read Chunks" the server. This means the "XDR Decoding with Read Chunks"
skipping to change at page 18, line 27 skipping to change at page 17, line 44
instead all results transferred via RDMA to the client employ "XDR instead all results transferred via RDMA to the client employ "XDR
Decoding with Write Chunks". There are several reasons for this. Decoding with Write Chunks". There are several reasons for this.
First, it allows for a correct and secure mode of transfer. The First, it allows for a correct and secure mode of transfer. The
client may advertise specific memory buffers only during specific client may advertise specific memory buffers only during specific
times, and may revoke access when it pleases. The server is not times, and may revoke access when it pleases. The server is not
required to expose copies of local file buffers for individual required to expose copies of local file buffers for individual
clients, or to lock or copy them for each client access. clients, or to lock or copy them for each client access.
Second, client credits based on fixed-size request buffers are Second, client credits based on fixed-size request buffers are
easily managed on the server, but the server additionally managing easily managed on the server, but for the server additional
buffers for client RDMA Reads is not well-bounded. For example, management of buffers for client RDMA Reads is not well-bounded.
the client may not perform these RDMA Read operations in a timely For example, the client may not perform these RDMA Read operations
fashion, therefore the server would have to protect itself against in a timely fashion, therefore the server would have to protect
denial-of-service on these resources. itself against denial-of-service on these resources.
Third, it reduces network traffic, since buffer exposure outside Third, it reduces network traffic, since buffer exposure outside
the scope and duration of a single request/response exchange the scope and duration of a single request/response exchange
necessitates additional memory management exchanges. necessitates additional memory management exchanges.
There are costs associated with this decision. Primary among them There are costs associated with this decision. Primary among them
is the need for the server to employ RDMA Read for operations such is the need for the server to employ RDMA Read for operations such
as large WRITE. The RDMA Read operation is a two-way exchange at as large WRITE. The RDMA Read operation is a two-way exchange at
the RDMA layer, which incurs additional overhead relative to RDMA the RDMA layer, which incurs additional overhead relative to RDMA
Write. Additionally, RDMA Read requires resources at the data Write. Additionally, RDMA Read requires resources at the data
source (the client in this proposal) to maintain state and generate source (the client in this proposal) to maintain state and to
replies. These costs are overcome through use of pipelining with generate replies. These costs are overcome through use of
credits, with sufficient RDMA Read resources negotiated at session pipelining with credits, with sufficient RDMA Read resources
initiation, and appropriate use of RDMA for writes by the client - negotiated at session initiation, and appropriate use of RDMA for
for example only for transfers above a certain size. writes by the client - for example only for transfers above a
certain size.
A description of which NFSv4 operations are eligible for data A description of which NFSv4 operation results are eligible for
transfer via RDMA is in [NFSDDP]. There are only two such data transfer via RDMA Write is in [NFSDDP]. There are only two
operations: READ and READLINK. When XDR encoding these requests on such operations: READ and READLINK. When XDR encoding these
an RDMA transport, the NFSv4.1 client must insert the appropriate requests on an RDMA transport, the NFSv4.1 client must insert the
xdr_write_list entries to indicate to the server whether the appropriate xdr_write_list entries to indicate to the server
results should be transferred via RDMA or inline with a Send. As whether the results should be transferred via RDMA or inline with a
described in [NFSDDP], a zero-length write chunk is used to Send. As described in [NFSDDP], a zero-length write chunk is used
indicate an inline result. In this way, it is unnecessary to to indicate an inline result. In this way, it is unnecessary to
create new operations for RDMA-mode versions of READ and READLINK. create new operations for RDMA-mode versions of READ and READLINK.
Another tool to avoid creation of new, RDMA-mode operations is the
Reply Chunk [RPCRDMA], which is used by RPC in RDMA mode to return
large replies via RDMA as if they were inline. Reply chunks are
used for operations such as READDIR, which returns large amounts of
information, but in many small XDR segments. Reply chunks are
offered by the client and the server can use them in preference to
inline. Reply chunks are transparent to upper layers such as
NFSv4.
In any very rare cases where another NFSv4.1 operation requires In any very rare cases where another NFSv4.1 operation requires
larger buffers than were negotiated at channel binding (for example larger buffers than were negotiated when the channel was bound (for
extraordinarily large RENAMEs), the underlying RPC layer may example extraordinarily large RENAMEs), the underlying RPC layer
support the use of "Message as an RDMA Read Chunk" and "RDMA Write may support the use of "Message as an RDMA Read Chunk" and "RDMA
of Long Replies" as described in [RPCRDMA]. No additional support Write of Long Replies" as described in [RPCRDMA]. No additional
is required in the NFSv4.1 client for this. The client should be support is required in the NFSv4.1 client for this. The client
certain that its requested buffer sizes are not so small as to make should be certain that its requested buffer sizes are not so small
this a frequent occurrence, however. as to make this a frequent occurrence, however.
All operations are initiated by a Send, and are completed with a All operations are initiated by a Send, and are completed with a
Send. This is exactly as in conventional NFSv4, but under RDMA has Send. This is exactly as in conventional NFSv4, but under RDMA has
a significant purpose: RDMA operations are not complete, that is, a significant purpose: RDMA operations are not complete, that is,
guaranteed consistent, at the data sink until followed by a guaranteed consistent, at the data sink until followed by a
successful Send completion (i.e. a receive). These events provide successful Send completion (i.e. a receive). These events provide
a natural opportunity for the initiator (client) to enable and a natural opportunity for the initiator (client) to enable and
later disable RDMA access to the memory which is the target of each later disable RDMA access to the memory which is the target of each
operation, in order to provide for consistent and secure operation. operation, in order to provide for consistent and secure operation.
The RDDP Send with Invalidate operation may be worth employing in The RDMAP Send with Invalidate operation may be worth employing in
this respect, as it relieves the client of certain overhead in this this respect, as it relieves the client of certain overhead in this
case. case.
A "onetime" boolean advisory to each RDMA region might become a A "onetime" boolean advisory to each RDMA region might become a
hint to the server that the client will use the three-tuple for hint to the server that the client will use the three-tuple for
only one NFSv4 operation. For a transport such as iWARP, the only one NFSv4 operation. For a transport such as iWARP, the
server can assist the client in invalidating the three-tuple by server can assist the client in invalidating the three-tuple by
performing a Send with Solicited Event and Invalidate. The server performing a Send with Solicited Event and Invalidate. The server
may ignore this hint, in which case the client must perform a local may ignore this hint, in which case the client must perform a local
invalidate after receiving the indication from the server that the invalidate after receiving the indication from the server that the
skipping to change at page 20, line 37 skipping to change at page 20, line 21
buffer : +-----------------------------> : buffer : +-----------------------------> :
: : : : : :
: [Segment] : : [Segment] :
tagged : v------------------------------ : [RDMA Read] tagged : v------------------------------ : [RDMA Read]
buffer : +-----------------------------> : buffer : +-----------------------------> :
: : : :
: Direct Write Response : : Direct Write Response :
untagged : <------------------------------ : Send (w/Inv.) untagged : <------------------------------ : Send (w/Inv.)
buffer : : buffer : :
2.5. Connection Models 2.3. Connection Models
There are three scenarios in which to discuss the connection model. There are three scenarios in which to discuss the connection model.
Each will be discussed individually, after describing the common Each will be discussed individually, after describing the common
case encountered at initial connection establishment. case encountered at initial connection establishment.
After a successful connection, the first request proceeds, in the After a successful connection, the first request proceeds, in the
case of a new client association, to initial session creation, and case of a new client association, to initial session creation, and
then to session binding, prior to regular operation. Session then to session binding, prior to regular operation. Session
binding, which creates a channel, is a required first step for binding, which creates a channel, is a required first step for
NFSv4.1 operation on each connection, and there is no change in NFSv4.1 operation on each connection, and there is no change in
skipping to change at page 21, line 20 skipping to change at page 20, line 51
server, the server will have located the previous session's state, server, the server will have located the previous session's state,
including any surviving locks, delegations, duplicate request cache including any surviving locks, delegations, duplicate request cache
entries, etc. The previous session will be reestablished with its entries, etc. The previous session will be reestablished with its
previous state, ensuring exactly-once semantics of any previously previous state, ensuring exactly-once semantics of any previously
issued NFSv4 requests. If the rebinding fails, then the server has issued NFSv4 requests. If the rebinding fails, then the server has
restarted and does not support persistent state. This would have restarted and does not support persistent state. This would have
been noted in the server's original reply to the session creation, been noted in the server's original reply to the session creation,
however. however.
Since the session is explicitly created and destroyed by the Since the session is explicitly created and destroyed by the
client, and each client is uniquely identified by its clientid, the client, and each client is uniquely identified, the server may be
server may be specifically instructed to discard unneeded specifically instructed to discard unneeded presistent state. For
presistent state. For this reason, it is expected that a server this reason, it is possible that a server will retain any previous
will retain any previous state indefinitely, and place its state indefinitely, and place its destruction under administrative
destruction under administrative control. control. Or, a server may choose to retain state for some
configurable period, provided that the period meets other NFSv4
requirements.
After successful session establishment, the traditional (TCP After successful session establishment, the traditional (TCP
stream) connection model used by NFSv4.0 and NFSv4.1 ensures the stream) connection model used by NFSv4.0 and NFSv4.1 ensures the
connection is ready to proceed with issuing requests and returning connection is ready to proceed with issuing requests and returning
responses. This mode is arrived at when the client does not responses. This mode is arrived at when the client does not
request that the connection be placed into RDMA mode. request that the connection be placed into RDMA mode.
2.5.1. TCP Stream Connection Model 2.3.1. TCP Connection Model
The following is a schematic diagram of the NFSv4.1 protocol The following is a schematic diagram of the NFSv4.1 protocol
exchanges leading up to normal operation on a TCP stream. exchanges leading up to normal operation on a TCP stream.
Client Server Client Server
TCPmode : Session Create(nfs_client_id4, : TCPmode TCPmode : Session Create(nfs_client_id4, ...) : TCPmode
: TCP mode, ...) :
: ------------------------------> : : ------------------------------> :
: : : :
: Session reply(sessionid, : : Session reply(sessionid, ...) :
: TCP mode, ...) :
: <------------------------------ : : <------------------------------ :
: : : :
: Session bind(session id, size 0, : : Session bind(session id, size S, :
: opchan, STREAM, credits N, ...): : opchan, STREAM, credits N, ...):
: ------------------------------> : : ------------------------------> :
: : : :
: Bind reply(size 0, credits N) : : Bind reply(size S', credits N') :
: <------------------------------ : : <------------------------------ :
: : : :
: <normal operation> : : <normal operation> :
: ------------------------------> : : ------------------------------> :
: <------------------------------ : : <------------------------------ :
: : : : : :
No net additional exchange is added to the initial negotiation by No net additional exchange is added to the initial negotiation by
this proposal. In the NFSv4.1 exchange, the SETCLIENTID operation this proposal. In the NFSv4.1 exchange, the SETCLIENTID and
is subsumed into the Session establishment, and there is no need SETCLIENTID_CONFIRM operations are not performed, as described
for SETCLIENTID_CONFIRM, as described later in the document. later in the document.
2.5.2. Negotiated RDMA Connection Model 2.3.2. Negotiated RDMA Connection Model
The following is a schematic diagram of the NFSv4.1 protocol The following is a schematic diagram of the NFSv4.1 protocol
exchanges negotiating upgrade to RDMA mode on a TCP stream. exchanges negotiating upgrade to RDMA mode on a TCP stream.
Client Server Client Server
TCPmode : Session Create(nfs_client_id4, : TCPmode TCPmode : Session Create(nfs_client_id4, ...) : TCPmode
: RDMA mode, ...) :
: ------------------------------> : : ------------------------------> :
: : : :
: Session reply(sessionid, : : Session reply(sessionid, ...) :
: RDMA mode, ...) :
: <------------------------------ : : <------------------------------ :
: : : :
: Session bind(session id, size S, : : Session bind(session id, size S', :
: opchan, RDMA, credits N, ...) : : opchan, RDMA, credits N, ...) :
: ------------------------------> : : ------------------------------> :
: : Prepost N receives : : Prepost N' receives
: Bind reply(size S, credits N) : of size S : Bind reply(size S', credits N') : of size S'
: <------------------------------ : RDMAMode : <------------------------------ : RDMAMode
RDMAmode : : RDMAmode : :
: <normal operation> : : <normal operation> :
: ------------------------------> : : ------------------------------> :
: <------------------------------ : : <------------------------------ :
: : : : : :
In iWARP, the Bind reply and RDMA mode entry are combined into a In iWARP, the Bind reply and RDMA mode entry are combined into a
single, atomic operation within the Provider, where the Bind reply single, atomic operation within the Provider, where the Bind reply
is sent in TCP streaming mode and RDMA mode is enabled immediately. is sent in TCP streaming mode and RDMA mode is enabled immediately.
There is no opportunity for a race between the client's first There is no opportunity for a race between the client's first
operation, the preposting of receive descriptors, and RDMA mode operation, the preposting of receive descriptors, and RDMA mode
entry at the server. entry at the server.
2.5.3. Automatic RDMA Connection Model 2.3.3. Automatic RDMA Connection Model
The following is a schematic diagram of the NFSv4.1 protocol The following is a schematic diagram of the NFSv4.1 protocol
exchanges performed on an RDMA connection. exchanges performed on an RDMA connection.
Client Server Client Server
RDMAmode : : : RDMAmode RDMAmode : : : RDMAmode
: : : : : :
Prepost : : : Prepost Prepost : : : Prepost
receive : : : receive receive : : : receive
: : : :
: Session Create(nfs_client_id4, : : Session Create(nfs_client_id4, ...) :
: RDMA mode, ...) :
: ------------------------------> : : ------------------------------> :
: : Prepost : : Prepost
: Session reply(sessionid, : receive : Session reply(sessionid, ...) : receive
: RDMA mode, ...) :
: <------------------------------ : : <------------------------------ :
Prepost : : Prepost : :
receive : Session bind(session id, size S, : receive : Session bind(session id, size S, :
: opchan, credits N, ...) : : opchan, RDMA, credits N, ...) :
: ------------------------------> : : ------------------------------> :
: : Prepost N receives : : Prepost N' receives
: Bind reply(size S, credits N) : of size S : Bind reply(size S', credits N') : of size S'
: <------------------------------ : : <------------------------------ :
: : : :
: <normal operation> : : <normal operation> :
: ------------------------------> : : ------------------------------> :
: <------------------------------ : : <------------------------------ :
: : : : : :
2.6. Buffer Management, Transfer, Flow Control 2.4. Buffer Management, Transfer, Flow Control
Inline operations in NFSv4.1 behave effectively the same as TCP Inline operations in NFSv4.1 behave effectively the same as TCP
sends. Procedure results are passed in a single message, and its sends. Procedure results are passed in a single message, and its
completion at the client signal the receiving process to inspect completion at the client signal the receiving process to inspect
the message. the message.
RDMA operations are performed solely by the server in this RDMA operations are performed solely by the server in this
proposal, as described in the previous "RDMA Direct Model" section. proposal, as described in the previous "RDMA Direct Model" section.
Since server RDMA operations do not result in a completion at the Since server RDMA operations do not result in a completion at the
client, and due to ordering rules in RDMA transports, after all client, and due to ordering rules in RDMA transports, after all
skipping to change at page 25, line 45 skipping to change at page 24, line 45
of server credits might increase its requested credits of server credits might increase its requested credits
proportionately in response. Or, a client might have a simple, proportionately in response. Or, a client might have a simple,
configurable number. configurable number.
Occasionally, a server may wish to reduce the number of credits it Occasionally, a server may wish to reduce the number of credits it
offers a certain client channel. This could be encountered if a offers a certain client channel. This could be encountered if a
client were found to be consuming its credits slowly, or not at client were found to be consuming its credits slowly, or not at
all. A client might notice this itself, and reduce its requested all. A client might notice this itself, and reduce its requested
credits in advance, for instance requesting only the count of credits in advance, for instance requesting only the count of
operations it currently has queued, plus a few as a base for operations it currently has queued, plus a few as a base for
starting up again. starting up again. Such mechanism are, however, potentially
complicated and are implementation-defined. The protocol does not
require them.
Because of the way in which RDMA fabrics function, it is not Because of the way in which RDMA fabrics function, it is not
possible for the server (or client back channel) to cancel possible for the server (or client back channel) to cancel
outstanding receive operations. Therefore, effectively only one outstanding receive operations. Therefore, effectively only one
credit can be withdrawn per receive completion. The server (or credit can be withdrawn per receive completion. The server (or
client back channel) would simply not replenish a receive operation client back channel) would simply not replenish a receive operation
when replying. The server can still reduce the available credit when replying. The server can still reduce the available credit
advertisement in its replies to the target value it desires, as a advertisement in its replies to the target value it desires, as a
hint to the client that its credit target is lower and it should hint to the client that its credit target is lower and it should
expect it to be reduced accordingly. Of course, even if the server expect it to be reduced accordingly. Of course, even if the server
skipping to change at page 27, line 5 skipping to change at page 26, line 5
efficient allocation of resources on both peers. There is an efficient allocation of resources on both peers. There is an
important requirement on reconnection: the sizes offered at important requirement on reconnection: the sizes offered at
reconnect (session bind) must be at least as large as previously reconnect (session bind) must be at least as large as previously
used, to allow recovery. Any replies that are replayed from the used, to allow recovery. Any replies that are replayed from the
server's duplicate request cache must be able to be received into server's duplicate request cache must be able to be received into
client buffers. In the case where a client has received replies to client buffers. In the case where a client has received replies to
all its retried requests (and therefore received all its expected all its retried requests (and therefore received all its expected
responses), then the client may disconnect and reconnect with responses), then the client may disconnect and reconnect with
different buffers at will, since no cache replay will be required. different buffers at will, since no cache replay will be required.
2.7. Retry and Replay 2.5. Retry and Replay
NFSv4.0 forbids retransmission on active connections over reliable NFSv4.0 forbids retransmission on active connections over reliable
transports; this includes connected-mode RDMA. This restriction transports; this includes connected-mode RDMA. This restriction
must be maintained in NFSv4.1. must be maintained in NFSv4.1.
If one peer were to retransmit a request (or reply), it would If one peer were to retransmit a request (or reply), it would
consume an additional credit on the other. If the server consume an additional credit on the other. If the server
retransmitted a reply, it would certainly result in an RDMA retransmitted a reply, it would certainly result in an RDMA
connection loss, since the client would typically only post a connection loss, since the client would typically only post a
single receive buffer for each request. If the client single receive buffer for each request. If the client
retransmitted a request, the additional credit consumed on the retransmitted a request, the additional credit consumed on the
server might lead to RDMA connection failure unless the client server might lead to RDMA connection failure unless the client
accounted for it and decreased its available credit, leading to accounted for it and decreased its available credit, leading to
wasted resources. wasted resources.
Credits present a new issue to the duplicate request cache in Credits present a new issue to the duplicate request cache in
NFSv4.1. The reply cache may be used when a connection within a NFSv4.1. The request cache may be used when a connection within a
session is lost, such as after the client reconnects and rebinds. session is lost, such as after the client reconnects and rebinds.
Credit information is a dynamic property of the channel, and stale Credit information is a dynamic property of the channel, and stale
values must not be replayed from the cache. This may occur on values must not be replayed from the cache. This may occur on
another existing channel, or a new channel, with potentially new another existing channel, or a new channel, with potentially new
credits and buffers. This implies that the reply cache contents credits and buffers. This implies that the request cache contents
must not be blindly used when replies are issued from it, and must not be blindly used when replies are issued from it, and
credit information appropriate to the channel must be refreshed by credit information appropriate to the channel must be refreshed by
the RPC layer. the RPC layer.
Finally, RDMA fabrics do not guarantee that the memory handles Finally, RDMA fabrics do not guarantee that the memory handles
(Steering Tags) within each rdma three-tuple are valid on a scope (Steering Tags) within each rdma three-tuple are valid on a scope
outside that of a single connection. Therefore, handles used by outside that of a single connection. Therefore, handles used by
the direct operations become invalid after connection loss. The the direct operations become invalid after connection loss. The
server must ensure that any RDMA operations which must be replayed server must ensure that any RDMA operations which must be replayed
from the reply cache use the newly provided handle(s) from the most from the request cache use the newly provided handle(s) from the
recent request. most recent request.
2.8. The Back Channel 2.6. The Back Channel
The NFSv4 callback operations present a significant resource The NFSv4 callback operations present a significant resource
problem for the RDMA enabled client. Clearly, their number must be problem for the RDMA enabled client. Clearly, their number must be
negotiated in the way credits are for the ordinary operations negotiated in the way credits are for the ordinary operations
channel for requests flowing from client to server. But, for channel for requests flowing from client to server. But, for
callbacks to arrive on the same RDMA endpoint as operation replies callbacks to arrive on the same RDMA endpoint as operation replies
would require dedicating additional resources, and specialized would require dedicating additional resources, and specialized
demultiplexing and event handling. It is highly desirable to demultiplexing and event handling. Or, callbacks may not require
streamline this critical path via a second communications channel. RDMA sevice at all (they do not normally carry substantial data
payloads). It is highly desirable to streamline this critical path
via a second communications channel.
The session binding facility is designed for exactly such a The session binding facility is designed for exactly such a
situation, by dynamically associating a new connected endpoint with situation, by dynamically associating a new connected endpoint with
the session, and separately negotiating sizes and counts for active the session, and separately negotiating sizes and counts for active
operations. The ChannelType designation in the session bind operations. The ChannelType designation in the session bind
operation serves to identify the channel. This information later operation serves to identify the channel. The binding operation is
overrides any cb_location information provided in the callback firewall-friendly since it does not require the server to initiate
registration performed by SETCLIENTID_CONFIRM. The binding the connection.
operation is firewall-friendly since it does not require the server
to initiate the connection.
This same method serves as well for ordinary TCP connection mode. This same method serves as well for ordinary TCP connection mode.
It is expected that all NFSv4.1 clients may make use of the session It is expected that all NFSv4.1 clients may make use of the session
binding facility to streamline their design. binding facility to streamline their design.
The back channel functions exactly the same as the operations The back channel functions exactly the same as the operations
channel except that no RDMA operations are required to perform channel except that no RDMA operations are required to perform
transfers, instead the sizes are required to be sufficiently large transfers, instead the sizes are required to be sufficiently large
to carry all data inline, and of course the client and server to carry all data inline, and of course the client and server
reverse their roles with respect to which is in control of credit reverse their roles with respect to which is in control of credit
skipping to change at page 29, line 5 skipping to change at page 28, line 5
not prepared for them. not prepared for them.
There is one special case, that where the back channel is bound in There is one special case, that where the back channel is bound in
fact to the operations channel. This configuration would be used fact to the operations channel. This configuration would be used
normally over a TCP stream connection to exactly implement the normally over a TCP stream connection to exactly implement the
NFSv4.0 behavior, but over RDMA would require complex resource and NFSv4.0 behavior, but over RDMA would require complex resource and
event management at both sides of the connection. The server is event management at both sides of the connection. The server is
not required to accept such a bind request on an RDMA connection not required to accept such a bind request on an RDMA connection
for this reason, though it is recommended. for this reason, though it is recommended.
2.9. COMPOUND Sizing Issues 2.7. COMPOUND Sizing Issues
Very large responses may pose duplicate request cache issues. Very large responses may pose duplicate request cache issues.
Since servers will want to bound the storage required for such a Since servers will want to bound the storage required for such a
cache, the unlimited size of response data in COMPOUND may be cache, the unlimited size of response data in COMPOUND may be
troublesome. If COMPOUND is used in all its generality, then a troublesome. If COMPOUND is used in all its generality, then a
non-idempotent request might include operations that return any non-idempotent request might include operations that return any
amount of data via RDMA. amount of data via RDMA.
It is not satisfactory for the server to reject COMPOUNDs at will It is not satisfactory for the server to reject COMPOUNDs at will
with NFS4ERR_RESOURCE when they pose such difficulties for the with NFS4ERR_RESOURCE when they pose such difficulties for the
skipping to change at page 30, line 8 skipping to change at page 29, line 8
request. The explicit "end" flag allows a chain to immediately request. The explicit "end" flag allows a chain to immediately
follow another. follow another.
When a chain is in effect, the current filehandle and saved When a chain is in effect, the current filehandle and saved
filehandle are maintained across chained requests as for a single filehandle are maintained across chained requests as for a single
COMPOUND. This permits passing such results forward in the chain. COMPOUND. This permits passing such results forward in the chain.
The current and saved filehandles are not available outside the The current and saved filehandles are not available outside the
chain. chain.
2.10. Inline Data Alignment 2.8. Data Alignment
A negotiated data alignment enables certain scatter/gather A negotiated data alignment enables certain scatter/gather
optimizations. A facility for this is supported by [RPCRDMA]. optimizations. A facility for this is supported by [RPCRDMA].
Where NFS file data is the payload, specific optimizations become Where NFS file data is the payload, specific optimizations become
highly attractive. highly attractive.
Header padding is requested by each peer at session initiation, and Header padding is requested by each peer at session initiation, and
may be zero (no padding). Padding leverages the useful property may be zero (no padding). Padding leverages the useful property
that RDMA receives preserve alignment of data, even when they are that RDMA receives preserve alignment of data, even when they are
placed into anonymous (untagged) buffers. If requested, client placed into anonymous (untagged) buffers. If requested, client
skipping to change at page 30, line 30 skipping to change at page 29, line 30
header to align the data payload on the specified boundary. The header to align the data payload on the specified boundary. The
client is encouraged to be optimistic and simply pad all WRITEs client is encouraged to be optimistic and simply pad all WRITEs
within the RPC layer to the negotiated size, in the expectation within the RPC layer to the negotiated size, in the expectation
that the server can use them efficiently. that the server can use them efficiently.
It is highly recommended that clients offer to pad headers to an It is highly recommended that clients offer to pad headers to an
appropriate size. Most servers can make good use of such padding, appropriate size. Most servers can make good use of such padding,
which allows them to chain receive buffers in such a way that any which allows them to chain receive buffers in such a way that any
data carried by client requests will be placed into appropriate data carried by client requests will be placed into appropriate
buffers at the server, ready for filesystem processing. The buffers at the server, ready for filesystem processing. The
receiver's RPC layer encounters no overhead from skipping over pad receiver's RPC layer encounters no overhead from skipping over pad
bytes, and the RDMA layer's high performance makes the insertion bytes, and the RDMA layer's high performance makes the insertion
and transmission of padding on the sender a significant and transmission of padding on the sender a significant
optimization. In this way, the need for servers to perform RDMA optimization. In this way, the need for servers to perform RDMA
Read to satisfy all but the largest client writes is obviated. Read to satisfy all but the largest client writes is obviated. An
added benefit is the reduction of message roundtrips on the network
- a potentially good trade, where latency is present.
The value to choose for padding is subject to a number of criteria. The value to choose for padding is subject to a number of criteria.
A primary source of variable-length data in the RPC header is the A primary source of variable-length data in the RPC header is the
authentication information, the form of which is client-determined, authentication information, the form of which is client-determined,
possibly in response to server specification. The contents of possibly in response to server specification. The contents of
COMPOUNDs, sizes of strings such as those passed to RENAME, etc. COMPOUNDs, sizes of strings such as those passed to RENAME, etc.
all go into the determination of a maximal NFSv4 request size and all go into the determination of a maximal NFSv4 request size and
therefore minimal buffer size. The client must select its offered therefore minimal buffer size. The client must select its offered
value carefully, so as not to overburden the server, and vice- value carefully, so as not to overburden the server, and vice-
versa. The payoff of an appropriate padding value is higher versa. The payoff of an appropriate padding value is higher
skipping to change at page 31, line 43 skipping to change at page 30, line 43
Minor versioning is the existing facility to extend the NFSv4 Minor versioning is the existing facility to extend the NFSv4
protocol, and this proposal takes that approach. protocol, and this proposal takes that approach.
Minor versioning of NFSv4 is relatively restrictive, and allows for Minor versioning of NFSv4 is relatively restrictive, and allows for
tightly limited changes only. In particular, it does not permit tightly limited changes only. In particular, it does not permit
adding new "procedures" (it permits adding only new "operations"). adding new "procedures" (it permits adding only new "operations").
Interoperability concerns make it impossible to consider additional Interoperability concerns make it impossible to consider additional
layering to be a minor revision. This somewhat limits the changes layering to be a minor revision. This somewhat limits the changes
that can be proposed when considering extensions. that can be proposed when considering extensions.
To support Exactly-once Semantics integrated with sessions and flow To support exactly-once semantics integrated with sessions and flow
control, it is desirable to tag each request with an identifier to control, it is desirable to tag each request with an identifier to
be called a Streamid. This identifier must be passed by NFSv4 when be called a Streamid. This identifier must be passed by NFSv4 when
running atop any transport, including traditional TCP. Therefore running atop any transport, including traditional TCP. Therefore
it is not desirable to add the Streamid to a new RPC transport, it is not desirable to add the Streamid to a new RPC transport,
even though such a transport is indicated for support of RDMA. even though such a transport is indicated for support of RDMA.
This draft and [RPCRDMA] do not propose such an approach. This draft and [RPCRDMA] do not propose such an approach.
Instead, this proposal follows these requirements faithfully, Instead, this proposal follows these requirements faithfully,
through the use of a new operation within NFSv4 COMPOUND procedures through the use of a new operation within NFSv4 COMPOUND procedures
as detailed below. as detailed below.
skipping to change at page 32, line 19 skipping to change at page 31, line 19
3.2. Stream Identifiers and Exactly-Once Semantics 3.2. Stream Identifiers and Exactly-Once Semantics
The presence of deterministic flow control on a channel enables in- The presence of deterministic flow control on a channel enables in-
progress requests to be assigned unique values with useful progress requests to be assigned unique values with useful
properties. properties.
The RPC layer provides a transaction ID (xid), which, while The RPC layer provides a transaction ID (xid), which, while
required to be unique, is not especially convenient for tracking required to be unique, is not especially convenient for tracking
requests. The transaction ID is only meaningful to the issuer requests. The transaction ID is only meaningful to the issuer
(client), it cannot be interpreted at the server except to test for (client), it cannot be interpreted at the server except to test for
equality with previously issued requests. equality with previously issued requests. Because RPC operations
may be completed by the server in any order, many transaction IDs
may be outstanding at any time. The client may therefore perform a
computationally expensive lookup operation in the process of
demultiplexing each reply.
When flow control is in effect, there is a limit to the number of When flow control is in effect, there is a limit to the number of
active requests. This immediately enables a convenient, active requests. This immediately enables a convenient,
computationally efficient index for each request which is computationally efficient index for each request which is
designated as a Stream Identifier, or streamid. designated as a Stream Identifier, or streamid.
When the client issues a new request, it selects a streamid in the When the client issues a new request, it selects a streamid in the
range 0..N-1, where N is the server's current flow control limit range 0..N-1, where N is the server's current "totalrequests" limit
granted the client on the channel over which the request is to be granted the client on the session over which the request is to be
issued. The streamid must be unused by any of the requests which issued. The streamid must be unused by any of the requests which
the client has already active on the channel. "Unused" here means the client has already active on the session. "Unused" here means
the client has no outstanding request for that streamid. Because the client has no outstanding request for that streamid. Because
the stream id is always an integer in the range 0..N-1, client the stream id is always an integer in the range 0..N-1, client
implementations can use the streamid from a server response to implementations can use the streamid from a server response to
efficiently match responses with outstanding requests, such as, for efficiently match responses with outstanding requests, such as, for
example, by using the streamid to index into a outstanding request example, by using the streamid to index into a outstanding request
array. array.
The server in turn may use this streamid, in conjunction with the The server in turn may use this streamid, in conjunction with the
transaction id within the RPC portion of the request, to maintain transaction id within the RPC portion of the request, to maintain
its duplicate request cache (DRC) for the session, as opposed to its duplicate request cache (DRC) for the session, as opposed to
the traditional approach of ONC RPC applications that use the XID the traditional approach of ONC RPC applications that use the XID
to index into the DRC. Unlike the XID, the streamid is always to index into the DRC. Unlike the XID, the streamid is always
within a specific range; this has two implications. The first within a specific range; this has two implications. The first
implication is that for a given session, the server need only cache implication is that for a given session, the server need only cache
the results of a limited number of COMPOUND requests. The second the results of a limited number of COMPOUND requests. The second
implication derives from the first, which is unlike XID indexed implication derives from the first, which is unlike XID indexed
DRCs, the streamid DRC by its nature cannot be overflowed. This DRCs, the streamid DRC by its nature cannot be overflowed. This
makes it practical to maintain all the required entries for an makes it practical to maintain all the required entries for an
effective, Exactly Once Semantics, DRC. effective, exactly-once semantics, DRC.
It is required to encode the streamid information in such a way It is required to encode the streamid information in such a way
that does not violate the minor versioning rules of the NFSv4.0 that does not violate the minor versioning rules of the NFSv4.0
specification. This is accomplished here by encoding it in a specification. This is accomplished here by encoding it in a
control operation within each NFSv4.1 COMPOUND and CB_COMPOUND control operation within each NFSv4.1 COMPOUND and CB_COMPOUND
procedure. The operation easily piggybacks within existing procedure. The operation easily piggybacks within existing
messages. The implementation section of this document describes messages. The implementation section of this document describes
the specific proposal. the specific proposal.
Exactly-once semantics completely replace the functionality Exactly-once semantics completely replace the functionality
provided by NFSv4.0 sequence numbers. It is no longer necessary to provided by NFSv4.0 sequence numbers. It is no longer necessary to
employ NFS sequence numbers and their contents must be ignored by employ NFS sequence numbers and their contents must be ignored by
NFSv4.1 servers when a session is in effect for the connection. NFSv4.1 servers when a session is in effect for the connection. As
Similarly, such server will never request open-confirmation previously discussed, such a server will never request open-
response to OPEN requests and a client issuing an OPEN_CONFIRM confirmation response to OPEN requests, and a client must not issue
operation will receive an immediate error. an OPEN_CONFIRM operation.
In the case where the server is actively adjusting its granted flow In the case where the server is actively adjusting its granted flow
control credits to the client, it may not be able to use receipt of control credits to the client, it may not be able to use receipt of
the streamid to retire a cache entry. The streamid used in an the streamid to retire a cache entry. The streamid used in an
incoming request may not reflect the server's current idea of the incoming request may not reflect the server's current idea of the
client's credit limit, because the request may have been sent from client's credit limit, because the request may have been sent from
the client before the update was received. Therefore, in the the client before the update was received. Therefore, in the
credit downward adjustment case, the server may have to retain a credit downward adjustment case, the server may have to retain a
number of duplicate request cache entries at least as large as the number of duplicate request cache entries at least as large as the
old credit value, until operation sequencing rules allow it to old credit value, until operation sequencing rules allow it to
skipping to change at page 33, line 39 skipping to change at page 32, line 43
Finally, note that the streamid is a guarantee of uniqueness only Finally, note that the streamid is a guarantee of uniqueness only
in the scope of an unbroken connection. A channel identifier, in the scope of an unbroken connection. A channel identifier,
assigned at bind time and unique within the session, provides the assigned at bind time and unique within the session, provides the
means by which this is detected. If a request is received on a means by which this is detected. If a request is received on a
channel with a channel identifier which does not match the incoming channel with a channel identifier which does not match the incoming
request, then the request must be handled as a potential retry on request, then the request must be handled as a potential retry on
the previous channel identifier. It is possible to receive the previous channel identifier. It is possible to receive
requests up to the credit limit previously in effect for the old requests up to the credit limit previously in effect for the old
channel, but new requests outside this range should be rejected. channel, but new requests outside this range should be rejected.
As in the flow control downward adjustment case, the server may As in the flow control downward adjustment case, the server may
finally retire the old channel's response cache entries based on finally retire the old channel's request cache entries based on
operation sequencing rules. operation sequencing rules.
3.3. COMPOUND and CB_COMPOUND 3.3. COMPOUND and CB_COMPOUND
Support for per-operation control can be piggybacked onto NFSv4 Support for per-operation control can be piggybacked onto NFSv4
COMPOUNDs with full transparency, by placing such facilities into COMPOUNDs with full transparency, by placing such facilities into
their own, new operation, and placing this operation first in each their own, new operation, and placing this operation first in each
COMPOUND under the new NFSv4 minor protocol revision. The contents COMPOUND under the new NFSv4 minor protocol revision. The contents
of the operation would then apply to the entire COMPOUND. of the operation would then apply to the entire COMPOUND.
Recall that the NFSv4 minor revision is contained within the Recall that the NFSv4 minor revision is contained within the
COMPOUND header, encoded prior to the COMPOUNDed operations. By COMPOUND header, encoded prior to the COMPOUNDed operations. By
simply requiring that the new operation always be contained in simply requiring that the new operation always be contained in
NFSv4 minor COMPOUNDs, the control protocol can piggyback perfectly NFSv4 minor COMPOUNDs, the control protocol can piggyback perfectly
with each request and response. with each request and response.
In this way, the NFSv4 RDMA Extensions may stay in compliance with In this way, the NFSv4 RDMA Extensions may stay in compliance with
the minor versioning requirements specified in section 10 of the minor versioning requirements specified in section 10 of
RFC3530 [NFSv4]. [RFC3530].
Referring to section 13.1 of the same document, the proposed Referring to section 13.1 of the same document, the proposed
session-enabled COMPOUND and CB_COMPOUND have the form: session-enabled COMPOUND and CB_COMPOUND have the form:
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
| tag | minorversion | numops | control op | op + args | ... | tag | minorversion | numops | control op | op + args | ...
| | (== 1) | (limited) | + args | | | | (== 1) | (limited) | + args | |
+-----+--------------+-----------+------------+-----------+---- +-----+--------------+-----------+------------+-----------+----
and the reply's structure is: and the reply's structure is:
skipping to change at page 34, line 32 skipping to change at page 33, line 36
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
|last status | tag | numres | status + control op + results | // |last status | tag | numres | status + control op + results | //
+------------+-----+--------+-------------------------------+--// +------------+-----+--------+-------------------------------+--//
//-----------------------+---- //-----------------------+----
// status + op + results | ... // status + op + results | ...
//-----------------------+---- //-----------------------+----
The single control operation within each NFSv4.1 COMPOUND defines The single control operation within each NFSv4.1 COMPOUND defines
the context and operational session parameters which govern that the context and operational session parameters which govern that
COMPOUND request and reply. Placing it first in the COMPOUND COMPOUND request and reply. Placing it first in the COMPOUND
encoding is not strictly required, but is certainly logical and may encoding is required in order to allow its processing before other
enable certain optimizations. operations in the COMPOUND. This is especially important where
chaining is in effect, as the chain must be checked for correctness
prior to execution.
3.4. eXternal Data Representation Efficiency 3.4. eXternal Data Representation Efficiency
RDMA is a copy avoidance technology, and it is important to RDMA is a copy avoidance technology, and it is important to
maintain this efficiency when decoding received messages. maintain this efficiency when decoding received messages.
Traditional XDR implementations frequently use generated Traditional XDR implementations frequently use generated
unmarshaling code to convert objects to local form, incurring a unmarshaling code to convert objects to local form, incurring a
data copy in the process (in addition to subjecting the caller to data copy in the process (in addition to subjecting the caller to
recursive calls, etc). Often, such conversions are carried out recursive calls, etc). Often, such conversions are carried out
even when no size or byte order conversion is necessary. even when no size or byte order conversion is necessary.
skipping to change at page 35, line 16 skipping to change at page 34, line 23
operation, in which such encoding abounds. operation, in which such encoding abounds.
3.5. Effect of Sessions on Existing Operations 3.5. Effect of Sessions on Existing Operations
The use of a session and associated message credits to provide The use of a session and associated message credits to provide
exactly-once semantics allows considerable simplification of a exactly-once semantics allows considerable simplification of a
number of mechanisms in the base protocol that are all devoted in number of mechanisms in the base protocol that are all devoted in
some way to providing replay protection. In particular, the use of some way to providing replay protection. In particular, the use of
sequence id's on many operations becomes superfluous. Rather than sequence id's on many operations becomes superfluous. Rather than
replace existing operations with variants that delete the sequence replace existing operations with variants that delete the sequence
id's, the sequence id's will still be present and checked for id's, sequence id's will still be present but their value must not
correctness, but not used for replay protection. In addition, when be checked for correctness, nor used for replay protection. In
a session is in effect for the connection, the OPEN_CONFIRM addition, when a session is in effect for the connection, OPENs
operation will no longer be required; OPEN's will never require will never require confirmation, the server must not require
confirmation and the server, in NFSv4.1, must not require such confirmation, and the OPEN_CONFIRM operation must not be issued by
confirmation. the client.
Since each session will only be used by a single client, the use of Since each session will only be used by a single client, the use of
a clientid in many operations will no longer be required. Rather a clientid in many operations will no longer be required. Rather
than remove clientid parameters, the existing operations that use than remove clientid parameters, the existing operations that use
them will remain unchanged but a value of zero can be used. The them will remain unchanged but a value of zero can be used. The
determination of the client will follow from the session membership determination of the client will follow from the session membership
of the connection on which the request arrived. of the connection on which the request arrived.
A similar situation to sequence numbers, described earlier, exists
for NFSv4.0 clientid operations. There is no longer a need for
SETCLIENTID and SETCLIENTID_CONFIRM, as clientid uniqueness is
managed by the server through the session, and negotiation is both
unnecessary and redundant. Additionally, the cb_program and
cb_location which are obtained by the server in SETCLIENTID_CONFIRM
must not be used by the server, because the NFSv4.1 client performs
callback channel designation with SESSION_BIND. A server should
return an error to NFSv4.1 clients which might issue either
operation.
Finally the RENEW operation is made unnecessary when a session is
present, and the server should return an error to clients which
might issue it.
In summary, the
o OPEN_CONFIRM
o SETCLIENTID
o SETCLIENTID_CONFIRM
o RENEW
operations must not be issued or handled by client nor server when
a session is in effect.
Since the session carries the client indication with it implicitly, Since the session carries the client indication with it implicitly,
any request on a session associated with a given client will renew any request on a session associated with a given client will renew
that client's leases. that client's leases.
3.6. Authentication Efficiencies 3.6. Authentication Efficiencies
NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor NFSv4 requires the use of the RPCSEC_GSS ONC RPC security flavor
[RFC2203] to provide authentication, integrity, and privacy via [RFC2203] to provide authentication, integrity, and privacy via
cryptography. The server dictates to the client the use of cryptography. The server dictates to the client the use of
RPCSEC_GSS, the service (authentication, integrity, or privacy), RPCSEC_GSS, the service (authentication, integrity, or privacy),
skipping to change at page 36, line 4 skipping to change at page 35, line 41
If the connection's integrity is protected by an additional means If the connection's integrity is protected by an additional means
than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's than RPCSEC_GSS, such as via IPsec, then the use of RPCSEC_GSS's
integrity service is nearly redundant (See the Security integrity service is nearly redundant (See the Security
Considerations section for more explanation of why it is "nearly" Considerations section for more explanation of why it is "nearly"
and not completely redundant). Likewise, if the connection's and not completely redundant). Likewise, if the connection's
privacy is protected by additional means, then the use of both privacy is protected by additional means, then the use of both
RPCSEC_GSS's integrity and privacy services is nearly redundant. RPCSEC_GSS's integrity and privacy services is nearly redundant.
Connection protection schemes, such as IPsec, are more likely to be Connection protection schemes, such as IPsec, are more likely to be
implemented in hardware than upper layer protocols like RPCSEC_GSS. implemented in hardware than upper layer protocols like RPCSEC_GSS.
Hardware-based cryptography at the IPsec layer will be more Hardware-based cryptography at the IPsec layer will be more
efficient than software-based cryptography at the RPCSEC_GSS layer. efficient than software-based cryptography at the RPCSEC_GSS layer.
When transport integrity can be obtained, it is possible for server When transport integrity can be obtained, it is possible for server
and client to downgrade their per-operation authentication, after and client to downgrade their per-operation authentication, after
an appropriate exchange. This downgrade can in fact be as complete an appropriate exchange. This downgrade can in fact be as complete
as to establish security mechanisms that have zero cryptographic as to establish security mechanisms that have zero cryptographic
overhead, effectively using the underlying integrity and privacy overhead, effectively using the underlying integrity and privacy
services provided by transport. services provided by transport.
Based on the above observations, a new GSS-API mechanism, called Based on the above observations, a new GSS-API mechanism, called
the Context Cache Mechanism [CCM], is being defined. The CCM works the Channel Conjunction Mechanism [CCM], is being defined. The CCM
by creating a GSS-API security context using as input a cookie that works by creating a GSS-API security context using as input a
the initiator and target have previously agreed to be a handle for cookie that the initiator and target have previously agreed to be a
GSS-API context created previously over another GSS-API mechanism. handle for GSS-API context created previously over another GSS-API
mechanism.
NFSv4.1 clients and servers should support CCM and they must use as NFSv4.1 clients and servers should support CCM and they must use as
the cookie the handle from a successful RPCSEC_GSS context creation the cookie the handle from a successful RPCSEC_GSS context creation
over a non-CCM mechanism (such as Kerberos V5). The value of the over a non-CCM mechanism (such as Kerberos V5). The value of the
cookie will be equal to the handle field of the rpc_gss_init_res cookie will be equal to the handle field of the rpc_gss_init_res
structure from the RPCSEC_GSS specification. structure from the RPCSEC_GSS specification.
The [CCM] Draft provides further discussion and examples. The [CCM] Draft provides further discussion and examples.
4. Security Considerations 4. Security Considerations
skipping to change at page 37, line 5 skipping to change at page 36, line 42
efficiency which RDMA is typically employed to achieve. This is efficiency which RDMA is typically employed to achieve. This is
because such data is normally managed solely by the RDMA fabric, because such data is normally managed solely by the RDMA fabric,
and intentionally is not touched by software. Therefore when and intentionally is not touched by software. Therefore when
employing RPCSEC_GSS under CCM, and where integrity protection has employing RPCSEC_GSS under CCM, and where integrity protection has
been "downgraded", the cooperation of the RDMA transport provider been "downgraded", the cooperation of the RDMA transport provider
is critical to maintain any integrity and privacy otherwise in is critical to maintain any integrity and privacy otherwise in
place for the session. The means by which the local RPCSEC_GSS place for the session. The means by which the local RPCSEC_GSS
implementation is integrated with the RDMA data protection implementation is integrated with the RDMA data protection
facilities are outside the scope of this draft. facilities are outside the scope of this draft.
It is logical to use the same GSS context on a session's callback
channel as that used on its operations channel(s), but the issue
warrants careful analysis.
If the NFS client wishes to maintain full control over RPCSEC_GSS If the NFS client wishes to maintain full control over RPCSEC_GSS
protection, it may still perform its transfer operations using protection, it may still perform its transfer operations using
either the inline or RDMA transfer model, or of course employ either the inline or RDMA transfer model, or of course employ
traditional TCP stream operation. In the RDMA inline case, header traditional TCP stream operation. In the RDMA inline case, header
padding is recommended to optimize behavior at the server. At the padding is recommended to optimize behavior at the server. At the
client, close attention should be paid to the implementation of client, close attention should be paid to the implementation of
RPCSEC_GSS processing to minimize memory referencing and especially RPCSEC_GSS processing to minimize memory referencing and especially
copying. These are well-advised in any case! copying. These are well-advised in any case!
Proper authentication of the session binding operation of the Proper authentication of the session binding operation of the
proposed NFSv4.1 exactly follows the similar requirement on client proposed NFSv4.1 exactly follows the similar requirement on client
identifiers in NFSv4.0. It must not be possible for a client to identifiers in NFSv4.0. It must not be possible for a client to
bind to an existing session by guessing its session identifier. To bind to an existing session by guessing its session identifier. To
protect against this, NFSv4.0 requires appropriate authentication protect against this, NFSv4.0 requires appropriate authentication
and matching of the principal used. This is discussed in Section and matching of the principal used. This is discussed in Section
16, Security Considerations of [NFSv4]. The same requirement 16, Security Considerations of [RFC3530]. The same requirement
before binding to a session identifier applies here. before binding to a session identifier applies here.
The proposed session binding improves security over that provided The proposed session binding improves security over that provided
by NFSv4 for the callback channel. The connection is client- by NFSv4 for the callback channel. The connection is client-
initiated, and subject to the same firewall and routing checks as initiated, and subject to the same firewall and routing checks as
the operations channel. The connection cannot be hijacked by an the operations channel. The connection cannot be hijacked by an
attacker who connects to the client port prior to the intended attacker who connects to the client port prior to the intended
server. The connection is set up by the client with its desired server. The connection is set up by the client with its desired
attributes, such as optionally securing with IPsec or similar. The attributes, such as optionally securing with IPsec or similar. The
binding is fully authenticated before being activated. binding is fully authenticated before being activated.
The server should take care to protect itself against denial of The server should take care to protect itself against denial of
service attacks in the creation of sessions and clientids. Clients service attacks in the creation of sessions and clientids. Clients
who connect and create sessions, only to disconnect and never bind who connect and create sessions, only to disconnect and never bind
to them may leave significant state behind. The same issue applies to them may leave significant state behind. (The same issue
to NFSv4.0 with clients who may perform SETCLIENTID, then never applies to NFSv4.0 with clients who may perform SETCLIENTID, then
perform SETCLIENTID_CONFIRM. Careful authentication coupled with never perform SETCLIENTID_CONFIRM.) Careful authentication coupled
resource checks is highly recommended. with resource checks is highly recommended.
5. IANA Considerations 5. IANA Considerations
As a proposal based on minor protocol revision, any new minor As a proposal based on minor protocol revision, any new minor
number might be registered and reserved with the agreed-upon number might be registered and reserved with the agreed-upon
specification. Assigned operation numbers and any RPC constants specification. Assigned operation numbers and any RPC constants
might undergo the same process. might undergo the same process.
There are no issues stemming from RDMA use itself regarding port There are no issues stemming from RDMA use itself regarding port
number assignments not already specified by [NFSv4]. Initial number assignments not already specified by [RFC3530]. Initial
connection is via ordinary TCP stream services, operating on the connection is via ordinary TCP stream services, operating on the
same ports and under the same set of naming services. same ports and under the same set of naming services.
In the Automatic RDMA connection model described above, it is In the Automatic RDMA connection model described above, it is
possible that a new well-known port, or a new transport type possible that a new well-known port, or a new transport type
assignment (netid) as described in [NFSv4], may be desirable. assignment (netid) as described in [RFC3530], may be desirable.
6. NFSv4 Protocol RDMA and Session Extensions 6. NFSv4 Protocol Extensions
This section specifies details of the five extensions to NFSv4 This section specifies details of the five extensions to NFSv4
proposed by this document. Existing NFSv4 operations (under minor proposed by this document. Existing NFSv4 operations (under minor
version 0) continue to be fully supported, unmodified. version 0) continue to be fully supported, unmodified.
6.1. SESSION_CREATE 6.1. SESSION_CREATE
SYNOPSIS SYNOPSIS
sessionparams -> sessionresults sessionparams -> sessionresults
ARGUMENT ARGUMENT
enum ConnectionMode {
STREAM = 0,
RDMA = 1
};
struct SESSIONCREATE4args { struct SESSIONCREATE4args {
nfs_client_id4 clientid; nfs_client_id4 clientid;
bool persist; bool persist;
enum ConnectionMode mode; uint32 totalrequests;
}; };
RESULT RESULT
struct SESSIONCREATE4resok { struct SESSIONCREATE4resok {
uint64 sessionid; uint64 sessionid;
bool persist; bool persist;
enum ConnectionMode mode; uint32 totalrequests;
}; };
union SESSIONCREATE4res switch (nfsstat4 status) { union SESSIONCREATE4res switch (nfsstat4 status) {
case NFS4_OK: case NFS4_OK:
SESSIONCREATE4resok resok4; SESSIONCREATE4resok resok4;
default: default:
void; void;
}; };
DESCRIPTION DESCRIPTION
The SESSION_CREATE operation creates a session to which client The SESSION_CREATE operation creates a session to which client
connections may be bound with SESSION_BIND. connections may be bound with SESSION_BIND.
The "persist" argument indicates to the server whether the client
requires strict response caching for the session. For example, a
read-only session may set persist to FALSE. The server may choose
to change the returned value of "persist" to match its
implementation choice.
The "totalrequests" argument allows the server to size any
necessary response cache storage. It is the largest number of
outstanding requests which the client will adhere to session-wide.
Note that the SESSION_CREATE operation never appears with an
associated streamid. Therefore the SESSION_CREATE operation may
not receive the same level of exactly-once replay protection in the
face of transport failure. However, because at most one
SESSION_CREATE operation may be issued on a connection, servers can
provide "special" caching of the result (the sessionid) to
compensate for this.
... ...
ERRORS ERRORS
<tbd> <tbd>
6.2. SESSION_BIND 6.2. SESSION_BIND
SYNOPSIS SYNOPSIS
skipping to change at page 40, line 45 skipping to change at page 40, line 45
and sizes for the operations channel, while the back channel and sizes for the operations channel, while the back channel
specifies client credits and sizes for the back channel. Padding specifies client credits and sizes for the back channel. Padding
and also direct operations are generally not required on the back and also direct operations are generally not required on the back
channel. channel.
The channelid is a unique session-wide indentifier for each newly The channelid is a unique session-wide indentifier for each newly
bound connection. New requests must be issued on a channel with bound connection. New requests must be issued on a channel with
the matching identifier, while requests retried after connection the matching identifier, while requests retried after connection
failure must reissue the original identifier. failure must reissue the original identifier.
When ConnectionMode is "RDMA", the channel may be promoted to RDMA
mode by the server before replying, if supported.
The "maxrequests" value is a hint which the client may use to
communicate to the server its expected credit use on the channel.
The client must always adhere to the "totalrequests" value,
aggregated on all channels within the session, which it negotiated
with the server at session creation.
Note that the SESSION_BIND operation never appears with an
associated streamid, but also never requires replay protection. A
client which suffered a connection loss must immediately respond
with new SESSION_BIND, and never a retransmit. Also, for this
reason, it is recommended to use SESSION_BIND alone in its request.
... ...
ERRORS ERRORS
<tbd> <tbd>
6.3. SESSION_DISCONNECT 6.3. SESSION_DESTROY
SYNOPSIS SYNOPSIS
void -> status void -> status
ARGUMENT ARGUMENT
void; void;
RESULT RESULT
struct SESSION_DISCONNECTres { struct SESSION_DESTROYres {
nfsstat status; nfsstat status;
}; };
DESCRIPTION DESCRIPTION
The SESSION_DISCONNECT operation closes the session and discards The SESSION_DESTROY operation closes the session and discards any
any active state such as locks, leases, and server duplicate active state such as locks, leases, and server duplicate request
request cache entries. Any remaining connections bound to the cache entries. Any remaining connections bound to the session are
session are immediately unbound and may additionally be closed by immediately unbound and may additionally be closed by the server.
the server.
This operation must be the final, or only operation after the
required OPERATION_CONTROL in any request. Because the operation
results in destruction of the session, any duplicate request
caching for this request, as well as previously completed rewuests,
will be lost. For this reason, it is advisable to not place this
operation in a request with other state-modifying operations.
Note that because the operation will never be replayed by the
server, a client that retransmits the request may receive an error
in response, even though the session may have been successfully
destroyed.
... ...
ERRORS ERRORS
<tbd> <tbd>
6.4. OPERATION_CONTROL 6.4. OPERATION_CONTROL
SYNOPSIS SYNOPSIS
skipping to change at page 42, line 22 skipping to change at page 42, line 47
default: default:
void; void;
}; };
DESCRIPTION DESCRIPTION
The OPERATION_CONTROL operation is used to manage operational The OPERATION_CONTROL operation is used to manage operational
accounting for the channel on which the operation is sent. The accounting for the channel on which the operation is sent. The
contents include the Streamid, used by the server to implement contents include the Streamid, used by the server to implement
exactly-once semantics, and chaining flags to implement request exactly-once semantics, and chaining flags to implement request
chaining for the operations channel. This operation must be the chaining for the operations channel. This operation must appear
first in each COMPOUND and CB_COMPOUND sent in NFSv4.1 after the once as the first operation in each COMPOUND and CB_COMPOUND sent
channel is successfully bound, and any subsequent appearance is a after the channel is successfully bound, or a protocol error must
protocol error. result.
The channelid and streamid are provided in the arguments in order
to permit the server to implement duplicate request cache handling.
The streamid is provided in the results in order to assist the
client in efficiently demultiplexing the reply.
... ...
ERRORS ERRORS
Streamid out of bounds Streamid out of bounds
CHAIN_INVALID and CHAIN_BROKEN CHAIN_INVALID and CHAIN_BROKEN
6.5. CB_CREDITRECALL 6.5. CB_CREDITRECALL
SYNOPSIS SYNOPSIS
count4 -> status targetcount -> status
ARGUMENT ARGUMENT
count4 target; count4 target;
RESULT RESULT
struct CB_CREDITRECALLres { struct CB_CREDITRECALLres {
nfsstat status; nfsstat status;
}; };
skipping to change at page 43, line 4 skipping to change at page 43, line 34
count4 target; count4 target;
RESULT RESULT
struct CB_CREDITRECALLres { struct CB_CREDITRECALLres {
nfsstat status; nfsstat status;
}; };
DESCRIPTION DESCRIPTION
The CB_CREDITRECALL operation requests the client to return credits The CB_CREDITRECALL operation requests the client to return credits
at the server, by zero-length RDMA Sends or NULL NFSv4 operations. at the server, by zero-length RDMA Sends or NULL NFSv4 operations.
... ...
ERRORS ERRORS
<none> <none>
7. Acknowledgements 7. Acknowledgements
The authors wish to acknowledge the valuable contributions and The authors wish to acknowledge the valuable contributions and
review of Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak, review of Brent Callaghan, Mike Eisler, John Howard, Chet Juszczak,
Dave Noveck and Mark Wittle. Dave Noveck and Mark Wittle.
8. References 8. References
[RPCRDMA] [CCM]
B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC" M. Eisler, N. Williams, "The Channel Conjunction Mechanism
Internet-Draft Work in Progress, http://www.ietf.org/internet- (CCM) for GSS", Internet-Draft Work in Progress,
drafts/draft-callaghan-rpc-rdma-00.txt http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-ccm-02
[NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement",
Internet-Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-callaghan-nfsdirect-00.txt
[CJ89] [CJ89]
C. Juszczak, "Improving the Performance and Correctness of an C. Juszczak, "Improving the Performance and Correctness of an
NFS Server," Winter 1989 USENIX Conference Proceedings, USENIX NFS Server," Winter 1989 USENIX Conference Proceedings, USENIX
Association, Berkeley, CA, Februry 1989, pages 53-63. Association, Berkeley, CA, Februry 1989, pages 53-63.
[CLAN]
Emulex/Giganet cLAN,
http://www.emulex.com/products/legacy/vi/clan1000.html
[DAFS] [DAFS]
Direct Access File System http://www.dafscollaborative.org Direct Access File System, available from
http://www.ietf.org/internet-drafts/draft-wittle-dafs-00.txt http://www.dafscollaborative.org
[DCK+03] [DCK+03]
M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T. M. DeBergalis, P. Corbett, S. Kleiman, A. Lent, D. Noveck, T.
Talpey, M. Wittle, "The Direct Access File System", in Talpey, M. Wittle, "The Direct Access File System", in
Proceedings of 2nd USENIX Conference on File and Storage Proceedings of 2nd USENIX Conference on File and Storage
Technologies (FAST '03), San Francisco, CA, March 31 - April Technologies (FAST '03), San Francisco, CA, March 31 - April
2, 2003 2, 2003
[FCVI] [DDP]
VI over Fibre Channel Standard (ANSI T11.3 FC-VI ANSI/NCITS H. Shah, J. Pinkerton, R. Recio, P. Culley, "Direct Data
357-2001), http://www.t11.org Placement over Reliable Transports",
http://www.ietf.org/internet-drafts/draft-ietf-rddp-ddp-01
[FJDAFS] [FJDAFS]
Fujitsu Prime Software Technologies, "Meet the DAFS Fujitsu Prime Software Technologies, "Meet the DAFS
Performance with DAFS/VI Kernel Implementation using cLAN", Performance with DAFS/VI Kernel Implementation using cLAN",
http://www.pst.fujitsu.com/english/dafsdemo/index.html http://www.pst.fujitsu.com/english/dafsdemo/index.html
[FJNFS] [FJNFS]
Fujitsu Prime Software Technologies, "An Adaptation of VIA to Fujitsu Prime Software Technologies, "An Adaptation of VIA to
NFS on Linux", NFS on Linux",
http://www.pst.fujitsu.com/english/nfs/index.html http://www.pst.fujitsu.com/english/nfs/index.html
[IB] InfiniBand Architecture Specification, Volume 1, Release 1.1. [IB] InfiniBand Architecture Specification, Volume 1, Release 1.1.
http://www.infinibandta.org available from http://www.infinibandta.org
[KM02] [KM02]
K. Magoutis, "Design and Implementation of a Direct Access K. Magoutis, "Design and Implementation of a Direct Access
File System (DAFS) Kernel Server for FreeBSD", in Proceedings File System (DAFS) Kernel Server for FreeBSD", in Proceedings
of USENIX BSDCon 2002 Conference, San Francisco, CA, February of USENIX BSDCon 2002 Conference, San Francisco, CA, February
11-14, 2002. 11-14, 2002.
[MAF+02] [MAF+02]
K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D. K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, D.
Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure Gallatin, R. Kisley, R. Wickremesinghe, E. Gabber, "Structure
and Performance of the Direct Access File System (DAFS)", in and Performance of the Direct Access File System (DAFS)", in
Proceedings of 2002 USENIX Annual Technical Conference, Proceedings of 2002 USENIX Annual Technical Conference,
Monterey, CA, June 9-14, 2002. Monterey, CA, June 9-14, 2002.
[MIDTAX] [MIDTAX]
B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues", B. Carpenter, S. Brim, "Middleboxes: Taxonomy and Issues",
Informational RFC, http://www.ietf.org/rfc/rfc3234.txt Informational RFC, http://www.ietf.org/rfc/rfc3234
[CCM]
M. Eisler, "NFSv4 Context Cache Management", Internet-Draft
Work in Progress, http://www.ietf.org/internet-drafts/draft-
eisler-nfsv4-ccm-00.txt
[MYR] [NFSDDP]
Myrinet, http://www.myrinet.com B. Callaghan, T. Talpey, "NFS Direct Data Placement",
Internet-Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-callaghan-nfsdirect-01
[NFSv4] [NFSPS]
S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track T. Talpey, C. Juszczak, "NFS RDMA Problem Statement",
RFC, http://www.ietf.org/rfc/rfc3530.txt Internet-Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-talpey-nfs-rdma-problem-statement-01
[ORION] [RDMAREQ]
Emulex GN/9000VI Orion, B. Callaghan, M. Wittle, "NFS RDMA requirements", Internet-
http://www.emulex.com/products/viip/gn9000VI.html Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-callaghan-nfs-rdmareq-00
[QUAD] [RFC3530]
Quadrics Ltd., http://www.quadrics.com S. Shepler, et. al., "NFS Version 4 Protocol", Standards Track
RFC, http://www.ietf.org/rfc/rfc3530
[RDDP] [RDDP]
Remote Direct Data Placement Working Group charter, Remote Direct Data Placement Working Group charter,
http://www.ietf.org/html.charters/rddp-charter.html http://www.ietf.org/html.charters/rddp-charter.html
[RDDPPS] [RDDPPS]
Remote Direct Data Placement Working Group Problem Statement, Remote Direct Data Placement Working Group Problem Statement,
A. Romanow, J. Mogul, T. Talpey, S. Bailey, A. Romanow, J. Mogul, T. Talpey, S. Bailey,
http://www.ietf.org/internet-drafts/draft-ietf-rddp-problem- http://www.ietf.org/internet-drafts/draft-ietf-rddp-problem-
statement-00.txt statement-03
[RDMAP]
R. Recio, P. Culley, D. Garcia, J. Hilland, "An RDMA Protocol
Specification", http://www.ietf.org/internet-drafts/draft-
ietf-rddp-rdmap-01
[RPCRDMA]
B. Callaghan, T. Talpey, "RDMA Transport for ONC RPC"
Internet-Draft Work in Progress, http://www.ietf.org/internet-
drafts/draft-callaghan-rpc-rdma-01
[RFC2203] [RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol
Specification", Standards Track RFC, Specification", Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt http://www.ietf.org/rfc/rfc2203
[SNIA]
B. Callaghan, "ONC RPC over RDMA Strawman",
http://www.snia.org/tech_activities/workgroups/nfs_rdma
[SVRNET]
Compaq Servernet,
http://nonstop.compaq.com/view.asp?PAGE=ServerNet
[VIA]
Virtual Interface Architecture Specification Version 1.0,
http://www.vidf.org/info/04standards.html
[VITCP]
S. DiCecco, J. Williams, "VI/TCP (Internet VI)", Internet-
Draft Work in Progress (expired),
http://www.ietf.org/internet-drafts/draft-dicecco-vitcp-00.txt
Authors' Addresses Authors' Addresses
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
skipping to change at page 46, line 4 skipping to change at page 46, line 20
Authors' Addresses Authors' Addresses
Tom Talpey Tom Talpey
Network Appliance, Inc. Network Appliance, Inc.
375 Totten Pond Road 375 Totten Pond Road
Waltham, MA 02451 USA Waltham, MA 02451 USA
Phone: +1 781 768 5329 Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com EMail: thomas.talpey@netapp.com
Spencer Shepler Spencer Shepler
Sun Microsystems, Inc. Sun Microsystems, Inc.
7808 Moonflower Drive 7808 Moonflower Drive
Austin, TX 78750 USA Austin, TX 78750 USA
Phone: +1 512 349 9376 Phone: +1 512 349 9376
EMail: spencer.shepler@sun.com EMail: spencer.shepler@sun.com
Full Copyright Statement Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved. Copyright (C) The Internet Society (2004). All Rights Reserved.
This document and translations of it may be copied and furnished to This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain others, and derivative works that comment on or otherwise explain
it or assist in its implementation may be prepared, copied, it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction published and distributed, in whole or in part, without restriction
of any kind, provided that the above copyright notice and this of any kind, provided that the above copyright notice and this
paragraph are included on all such copies and derivative works. paragraph are included on all such copies and derivative works.
However, this document itself may not be modified in any way, such However, this document itself may not be modified in any way, such
as by removing the copyright notice or references to the Internet as by removing the copyright notice or references to the Internet
Society or other Internet organizations, except as needed for the Society or other Internet organizations, except as needed for the
 End of changes. 135 change blocks. 
480 lines changed or deleted 532 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/