TOC 
NFSv4M. Eisler
Internet-DraftNetApp
Intended status: Standards TrackOctober 18, 2010
Expires: April 21, 2011 


Metadata Striping for pNFS
draft-eisler-nfsv4-pnfs-metastripe-02.txt

Abstract

This Internet-Draft describes a means to add metadata striping to pNFS.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.) [1].

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

This Internet-Draft will expire on April 21, 2011.

Copyright Notice

Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.



Table of Contents

1.  Introduction and Motivation
2.  Terminology
3.  Scope of Metadata Striping
4.  The Definition of Metadata Striping Layout
    4.1.  Name of Metadata Striping Layout Type
    4.2.  Value of Metadata Striping Layout Type
    4.3.  Definition of the da_addr_body Field of the device_addr4 Data Type
    4.4.  Definition of the loh_body Field of the layouthint4 Data Type
    4.5.  Definition of the loc_body Field of the layout_content4 Data Type
    4.6.  Definition of the lou_body Field of the layoutupdate4 Data Type
    4.7.  Storage Access Protocols
    4.8.  Revocation of Layouts
    4.9.  Stateids
    4.10.  Lease Terms
    4.11.  Layout Operations Sent to an L-MDS
    4.12.  Filehandles in Metadata Layouts
    4.13.  READ and WRITE Operations
    4.14.  Recovery
        4.14.1.  Failure and Restart of Client
        4.14.2.  Failure and Restart of Server
        4.14.3.  Failure and Restart of Storage Device
5.  Negotiation
6.  Operational Recommendation for Deployment
7.  Acknowledgements
8.  Security Considerations
9.  IANA Considerations
10.  Normative References
§  Author's Address




 TOC 

1.  Introduction and Motivation

The NFSv4.1 specification describes pNFS [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.). In NFSv4.1, pNFS is limited to the data contents of regular files. The content of regular files is distributed (striped) across multiple storage devices. Metadata is not distributed or striped, and indeed, the model presented in the NFSv4.1 specification is that of a single metadata server. This document describes a means to add metadata striping to pNFS, which includes the notion of multiple metadata servers. With metadata striping, multiple metadata servers may work together to provide a higher parallel performance.

This document does not require a new minor version of NFSv4. Instead, it requires a new layout type.

The XDR description is provided in this document in a way that makes it simple for the reader to extract into a ready to compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the metadata layout:

#!/bin/sh
grep "^  *///" | sed 's?^  *///  ??' | sed 's?^.*///??'

I.e. if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do:

 sh extract.sh < spec.txt > md.x

The effect of the script is to remove leading white space from each line of the specification, plus a sentinel sequence of "///".



 TOC 

2.  Terminology



 TOC 

3.  Scope of Metadata Striping

This proposal assumes a model where there are two or more servers capable of supporting NFSv4.1 operations. At least one server is an I-MDS, and the I-MDS should be thought of as a normal NFSv4.1 server, with the additional capability of granting metadata layouts on demand. The I-MDS might also be capable of granting non-metadata layouts, but this is irrelevant to the scope of metadata striping. The model also requires at least one additional server, an L-MDS, that is capable of supporting NFSv4.1 operations that are directed to the server by the I-MDS. It is permissible for an I-MDS to also be an L-MDS, and an L-MDS to also be an I-MDS. Indeed, a simple submodel is for every NFSv4.1 server in a set to be both an I-MDS and L-MDS.

Metadata striping applies to all NFSv4.1 operations that operate on file objects. These operations can be broken down into three classes:



 TOC 

4.  The Definition of Metadata Striping Layout



 TOC 

4.1.  Name of Metadata Striping Layout Type

The name of the metadata striping layout type is LAYOUT4_METADATA.



 TOC 

4.2.  Value of Metadata Striping Layout Type

The value of the metadata striping layout type is TBD1.



 TOC 

4.3.  Definition of the da_addr_body Field of the device_addr4 Data Type



///  %#include "nfs4_prot.h"
///  union md_layout_addr4 switch (bool mdla_simple) {
///    case TRUE:
///      multipath_list4              mdla_simple_addr;
///    case FALSE:
///      nfsv4_1_file_layout_ds_addr4 mdla_complex_addr;
///  };

 Figure 1 

If mdla_simple is TRUE, the remainder of the device address contains a list of elements (mdla_simple_addr), where each element represents a network address of an L-MDS which can serve equally as the target of metadata operations (typically the filehandle-only operations). See Section 13.5 of [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.) for a description of how the multipath_list4 data type supports multi-pathing.

If mdla_simple is FALSE, the remainder of the device address is the same as the LAYOUT4_NFSV4_1_FILES device address, consisting of an array of lists of L-MDSes servers (nflda_multipath_ds_list), and an array of indices (nflda_stripe_indices). Each element of nflda_multipath_ds_list contains one or more subelements, and each subelement represents a network address of an L-MDS which may serve equally as the target of name-based and directory-reading operations (see Section 13.5 of [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.)). The number of elements in nflda_multipath_ds_list array might be different than the stripe count. The stripe count is the number of elements in nflda_stripe_indices. The value of each element of nflda_stripe_indices is an index into nflda_multipath_ds_list, and thus the value of each element of nflda_stripe_indices MUST be less than the number of elements in nflda_multipath_ds_list.



 TOC 

4.4.  Definition of the loh_body Field of the layouthint4 Data Type



///  enum md_layout_hint_care4 {
///         MD4_CARE_STRIPE_UNIT_SIZE    = 0x040,
///         MD4_CARE_STRIPE_CNT_NAMEOPS  = 0x080,
///         MD4_CARE_STRIPE_CNT_DIRRDOPS = 0x100
///  };
///  %
///  %/* Encoded in the loh_body field of type layouthint4: */
///  %
///  struct md_layouthint4 {
///         uint32_t        mdlh_care;
///         count4          mdlh_stripe_cnt_nameops;
///         count4          mdlh_stripe_cnt_dirrdops;
///         nfs_cookie4     mdlh_stripe_unit_size;
///  };
 Figure 2 

The layout-type specific content for the LAYOUT4_METDATA layout type is composed of four fields. The first field, mdlh_care, is a set of flags indicating which values of the hint the client cares about. If MD4_CARE_STRIPE_CNT_NAMEOPS is set, then the client indicates in the second field, mdlh_stripe_cnt_nameops the preferred stripe count for name-based operations. If MD4_CARE_STRIPE_CNT_DIRRDOPS is set, then the client indicates in the third field, mdlh_stripe_cnt_dirrdops, the preferred stripe count for directory-reading operations. If MD4_CARE_STRIPE_UNIT_SIZE is set, then the client indicates in the fourth field, mdlh_stripe_unit_size, the preferred stripe unit size for directory-reading operations.



 TOC 

4.5.  Definition of the loc_body Field of the layout_content4 Data Type



///  struct md_layout_fhonly {
///    deviceid4   mdlf_devid;
///    nfs_fh4     mdlf_fh<1>;
///  };
///
///  struct md_layout_namebased {
///    deviceid4   mdln_devid;
///    uint32_t    mdln_namebased_alg;
///    uint32_t    mdln_first_index;
///    nfs_fh4     mdln_fh_list<>;
///  };
///
///  union md_layout_dirread_fhlist
///        switch (bool mdldf_use_namebased) {
///    case TRUE:
///      void;
///    case FALSE:
///      nfs_fh4     mdldf_fh_list<>;
///  };
///
///  struct md_layout_dirread {
///    deviceid4                mdld_devid;
///    nfs_cookie4              mdld_first_cookie;
///    nfs_cookie4              mdld_unit_size;
///    uint32_t                 mdld_first_index;
///    md_layout_dirread_fhlist mdld_fh_list;
///  };
///
///  struct md_layout4 {
///    md_layout_fhonly    mdl_fhops_layout<1>;
///    md_layout_namebased mdl_nameops_layout<1>;
///    md_layout_dirread   mdl_dirrdops_layout_segments<>;
///  };

 Figure 3 

The reply to a successful LAYOUTGET request MUST contain exactly one element in logr_layout. The element contains the metadata layout. The metadata layout consists of three variable length arrays. At least one of the arrays MUST be of non-zero length.



 TOC 

4.6.  Definition of the lou_body Field of the layoutupdate4 Data Type



///  %/*
///  % * LAYOUT4_METADATA.
///  % * Encoded in the lou_body field of type layoutupdate4:
///  % *      Nothing. lou_body is a zero length array of octets.
///  % */
///  %
 Figure 6 

The LAYOUT4_METADATA layout type has no content for lou_body filed of the layoutupdate4 data type.



 TOC 

4.7.  Storage Access Protocols

The LAYOUT4_METADATA layout type uses NFSv4.1 operations (and potentially, operations of higher minor versions of NFSv4, subject to the definition of a minor version of NFSv4) to access striped metadata. The LAYOUT4_METADATA does not affect access to storage devices. Thus a client might be able to obtain both a LAYOUT4_METADATA layout, and a non-LAYOUT4_METADATA layout type (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or LAYOUT4_BLOCK_VOLUME) on the same regular file. Of course, for a non-regular file, a pNFS client will be unable to get layouts of types LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS, or LAYOUT4_BLOCK_VOLUME).



 TOC 

4.8.  Revocation of Layouts

Servers MAY revoke layouts of type LAYOUT4_METADATA. A client detects if layout has been revoked if the operation is rejected with NFS4ERR_PNFS_NO_LAYOUT. In NFSv4.1, the error NFS4ERR_PNFS_NO_LAYOUT could be returned only by READ and WRITE. When the server returns a layout of type LAYOUT4_METADATA, the set of operations that can return NFS4ERR_PNFS_NO_LAYOUT is: ACCESS, CLOSE, COMMIT, CREATE, DELEGRETURN, GETATTR, LINK, LOCK, LOCKT, LOCKU, LOOKUP, LOOKUPP, NVERIFY, OPEN, OPENATTR, OPEN_DOWNGRADE, READ, READDIR, READLINK, REMOVE, RENAME, SECINFO, SETATTR, VERIFY, WRITE, GET_DIR_DELEGATION, SECINFO, SECINFO_NO_NAME, and WANT_DELEGATION.



 TOC 

4.9.  Stateids

The pNFS specification for LAYOUT4_NFSV4_1_FILES states data servers MUST be aware of the stateids granted by MDS so that the stateids passed to READ and WRITE can be properly validated. This requirement extends to the LAYOUT4_METADATA layout type: the L-MDS MUST be aware of any non-layout stateids granted by the I-MDS, if and only if the client is in contact the L-MDS under direction of a metadata layout returned by the I-MDS, and the I-MDS has not recalled or revoked that layout. In addition, because an L-MDS can accept operations like OPEN and LOCK that create or modify stateids, the I-MDS MUST be aware of stateids that an L-MDS has returned to a client, if and only if the I-MDS granted the client a metadata layout that directed the client to the L-MDS.

In some cases, one L-MDS MUST be aware of a stateid generated by another L-MDS. For example a client can obtain a stateid from the L-MDS serving as the destination of name-based operations, which includes OPEN. However operations that use the stateid will be filehandle-only operations, and the L-MDS the OPEN operation is sent to might differ from the L-MDS the LOCK operation for the same target file is sent to.



 TOC 

4.10.  Lease Terms

Any state the client obtains from an I-MDS or L-MDS is guaranteed to last for an interval lasting as long as the maximum of the lease_time attribute of the the I-MDS, and any L-MDS the client is directed to as the result of a metadata layout. The client has a lease for each client ID it has with an I-MDS or L-MDS, and each lease MUST be renewed separately for each client ID.



 TOC 

4.11.  Layout Operations Sent to an L-MDS

An L-MDS MAY allow a LAYOUTGET operation. One reason the L-MDS might allow a LAYOUTGET operation is to allow hierarchical striping. For example, for name-based operations, the pNFS server might use a radix tree, (which the field mdln_namebased_alg would indicate). The first four bytes of the component name would be combined to form a 32 bit stripe_unit_number. Once the client contacted the L-MDS, it would repeat the algorithm on the second four bytes of the component, and so on until the component name was exhausted.

One an L-MDS grants a layout, the client MUST use only the L-MDS that granted to the layout to send LAYOUTUPDATE, LAYOUTCOMMIT, and LAYOUTRETURN.



 TOC 

4.12.  Filehandles in Metadata Layouts

The filehandles returned in a metadata layout are subject to becoming stale at any time. The L-MDS SHOULD NOT return NFS4ERR_STALE unless the I-MDS has recalled or revoked the corresponding layout.



 TOC 

4.13.  READ and WRITE Operations

READ and WRITE are filehandle-only operations, and thus the pNFS client SHOULD attempt to obtain a non-metadata layout for a regular file. If it cannot, then it MAY use the metadata layout to send READ and WRITE operations to an L-MDS. An L-MDS MUST accept a READ or WRITE operation if the layout the I-MDS returned to the client included a filehandle-only layout.



 TOC 

4.14.  Recovery

[Comment.1] (it is likely this section will follow that of the files layout type specified in the NFSv4.1 specification.)



 TOC 

4.14.1.  Failure and Restart of Client

TBD



 TOC 

4.14.2.  Failure and Restart of Server

TBD



 TOC 

4.14.3.  Failure and Restart of Storage Device

TBD



 TOC 

5.  Negotiation

An pNFS client sends a GETATTR operation for attribute fs_layout_type. If the reply contains the metadata layout type, then metadata striping is supported, subject to further verification by a LAYOUTGET operation. If not, the client cannot use metadata striping.



 TOC 

6.  Operational Recommendation for Deployment

Deploy the metadata striping layout when it is anticipated that the workload will involve a high fraction of non-I/O operations on filehandles.



 TOC 

7.  Acknowledgements

Brent Welch had the idea of returning a separate device ID for filehandle-only operations in the metadata layout. Pranoop Erasani, Dave Noveck, and Richard Jernigan provided valuable feedback.



 TOC 

8.  Security Considerations

The security considerations of Section 13.12 of [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.) which are specific to data servers apply to lMDSes. In addition, each lMDS server and client are, respectively, a complete NFSv4.1 server and client, and so the security considerations of [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.) apply to any client or server using the metadata layout type.



 TOC 

9.  IANA Considerations

This specification requires an addition to the Layout Types registry described in Section 22.4 of [2] (Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” Sep 2008.). The five fields added to the registy are:

  1. Name of layout type: LAYOUT4_METADATA
  2. Value of layout type: TBD1.
  3. Standards Track RFC that describes this layout: RFCTBD2, which is the RFC of this document.
  4. How the RFC Introduces the specification: L.
  5. Minor versions of NFSv4 that can use the layout type: 1.

This specification requires the creation of a registry of hash algorithms for supporting the field mdln_namebased_alg. Details TBD.



 TOC 

10. Normative References

[1] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” RFC 2119, March 1997 (TXT).
[2] Shepler, S., Eisler, M., and D. Noveck, “NFS Version 4 Minor Version 1,” draft-ietf-nfsv4-minorversion1-26 (work in progress), Sep 2008 (TXT).


 TOC 

Author's Address

  Mike Eisler
  NetApp
  5765 Chase Point Circle
  Colorado Springs, CO 80919
  US
Phone:  +1-719-599-9026
Email:  mike@eisler.com