NFSv4.2 open items (Haynes) --------------------------- State of the Union: where things are stuck FedFS stuck on 3530bis Labeled NFS stuck on having time to work on it and implementation experience. - no change_attribute in Linux implementation, will it be used. NFSv4.2 stuck on RPCSEC_GSSv3 and implementation experience. NFSv4.3 is stuck on doing 4.2. We have to do work if we want to proceed. If we want to keep meeting and move forward we need to remove these road blocks. 3530bis (Haynes) --------------------------- Didn't pass IESG due to internationalization. D.NOVECK: In writing the 3530 internationalization, I was trying to do something real an that met their requirements Yank? pull it out, and put in a statement that says it doesn't work and that we will produce a new document. Crank? Tom Haynes does not want to do it. D.BLACK: who is the victim - probably me. Trying to unscramble isci, nfs is next. What didn't work WRT IESG? - discuss this off line. HAYNES: just go ahead and YANK. D.BLACK: 3530bis patches string prep. D.NOVECK: string prep is a frame work - but in 3530 we said that in order to be flexible we don't have to do everything. flexiblitiy and interoperatbility don't go together D.BLACK: 3530bis is blocking FedFS. Caching in favors with collegues that have internationalization experience is easier with a second doc. D.BLACK: which AD told you YANK would work. HAYNES: this is the only major item, everything else is a minor nit. Implemetations out there are broken, only way to fix is a new intnational doc. D.BLACK: Yank into seperate doc and crank it, I'll find the time somehow. probably ok to stick with string prep - which has it's problems but does interoperate. D.NOVECK: seperate doc is good - same doc will fix 5661 as well. One IESG complaint : only David Black reviewed it. D.BLACK: Yes, just my review was not good enough - separate doc will allow me to go get other reviewers. For 5661 let's lazy evaluate. 3530bis is running code - lets get this written and then look at 5661. D.NOVECK: problem with string prep - doesnt support unicode 6. D.BLACK: We can still simply do string prep. HAYNES: fs attributes back from 5661 - doesnt fit - pull it out, just go read 5661 ACL discussion (Haynes) --------------------------- draft-ietf=nfsv4-acl-mapping-05 Revise this document - map mode bits. Do not add to 3530bis. Any one else having customers who reference this draft? NFSv4.2 (Haynes) --------------------------- RPCSEC_GSSv3: Nico is not working on it who is working on it? holding up v4.2 server side copy, and labeled NFS MARK: Spencer says: RPCSEC_BGSSv3 should be removed PRANOOP: what is the level of dependencey for labled nfs SORIN: Labelend NFS people say the dependencies are crutial. EISLER: the issue with labeled NFS and SS-copy are hard to fix without throwing a crypto error,very hard to move this forward. SS-copy could finesse changing copy args to communicate a shared secret. I don't see an alternative for labeled NFS. SORIN: two questions. maybe if we do enough for SS-copy to move 4.2 ahead, and leave labeled NFS for later. prefer to have 4.2 ahead HAYNES: will verify situation and put together options for this 4.2: words + code MUST or SHOULD? do we need implenetations or just protocol specification MARK: Spencer says: just a protocol definition is sufficient LEVER: i feel that we should have an implementation. Linux labeled NFS is not a full implementation to verify the protocol that we have written. Flexible FIle Layouts (Halevy) -------------------------------- new draft is essentially a rewrite of draft-bhalevy-nfs-obj. changes reflect feedback from ietf 85 and implementation experience motivation: 1) provide per file stripping pattens based on pnfs-obj (RFC5664) arbitrary data placement policies (load balancing, life cycle management etc) 2) support legacy NFS data servers - standalone simple data servers a la object storage devices. 3) support of existing clustered file systems. (Ceph, GlusterFS) use NFS to access their gateway nodes... Changes since previous draft - Removed support for T10-OSD - capture spirit of object layout - clarified state and security and locking models - nfsv3 solution lowers v4.x security level - removed nested striping. - sparse/dense striping patterns compatible with NFSv4.1 files layout. Use open stateid Standalone v3 servers (state model 1) Standalone v4.x servers (state model 2) Clustered v4.1 (state model 3) NOVECK: Is back end protocol in doc? No. State model 1: Standalone NFSv3 no open locks advisory against MDS only use arbitray uid/gid file owner - owner rw, group ro access creds used to authroize client and enforce security policy clients job is to use the ACCESS call to authorize users for for each open owner and new layout. layout is global to the client. arbitray uid/gid used only on data server - doesn't affect MDS. State model 2 Standalone NFSv4.0 OPEN/DELEGATION DS stateid with layout to be used for I/O futher opens do not result in new stateid - can cause false fencing and can get new layout. OPEN_DOWNGRADE ccleint needs to refresh the layout. LOCK is advisory only; via MDS; e.g lock stateid not used to DS Security and fencing: use real user/group ACLs MDS acl changes refected on DS close the file to fence - removes stateid. State model 3 clustered nfsv4.x servers compatible with nfsv4.1 files layout. TODO: need more tech discussion on WG mailing list need more interested parties to implement various state models. want to get on WG charter - what is left to do? Future work - standard back end standard control protocol - we want nfsv4.x/pnfs to be ubiquitous as v3, need to lower the bar. - current proprietary NOVECK: you have done such an implementation of a back end protocol? BENNY: no - just worked on the protocol requirements HAYNES: I know no vendor that ships such a back end protocol - our clustered file system performs the task. There is another vendor that has a control protocol, but they are not shipping BENNY: could create an informational document that could have the reqirements for a control protocol HAYNES: could not mandate that vendors use such a protocol PRANOOP: We have talked about this before - define requirements for back end protocol. we don't think of our implementation as a back-end protocol. BENNY: we want support a mix and match of vendor implementations model HAYNES: need to have a set of requirements. Extract requiremnets from 5661 file layout to get requirements doc started. future work: client based copy on write a-la blockds layout RFC5663. FedFs issues (Lever) --------------------------- Status: admin and NSDB waiting behing 3530bis Submitted draft-dcel-nfsv4-fedfs-security-addendum - GSSAPI support for the admin protocol, including fencing NSDB Future work: Multi-domain authenticaion (from Andy) * Best practices SMB support * NSDB schema updates * FedFS and Samba junction dco-existence pre-populating /nfs4 while avoiding the "ls /afs" problem - auto discovery mech for populating /nfs4 improve LDAP use. Specify SASL/GSSAPI NDSDB security mode - replace namingContext updates with LDAP search - LDAP referrals and update referrals bset practices New admin operations - manage NSDB connection parameters, fencing NSDB, list NSDBs etc - replication management operations - common junction management (NFS, SAMBA, FedFS) Which location do you put in NSDB - not reachable by every client - multi-homed servers - ipv6 link local (non-routable) quickly pick a workable location. fileservers could sort the locations... NOVECK: bugs in specifications or bugs in protocol? spec bugs can be fixed LEVER: do we want to open a v2 of admin protocol, questions like this. pNFS Lustre layout (Fabish) --------------------------- 1) draft 05 - changes from 04 - shim layer ontop of lustre client an server diagram - whamcloud git tree and kernel git tree references continue to use lustre data and control protol -addin pnfsluster layout drier anc generic pnfs client sits on top of lustre clients talks to generic pnf server NOVECK: nfsv4.1+pNFS protocol: what is the transport? t10 object protocol? SORIN: no, modified version of t10 that lustre is using BENNY: proposing to add an nfs gate way in front of lustre client? look into the flexible layout to see if it fills your need. SORIN: performance is the reason. RDMA don't scale to tens of thousands of nodes. this does. BENNY: i don't see RDMA on this diagram. SORIN: just first draft of moving to this idea - don't see how flexible layout helps - it uses NFS protocol which is very solid but not high performance. Wrapper: lustre MD inside pnfsMD Luster client is ready for Linux 3.11 Biggest problem with this draft is basing the draft on Linux implementation. moving to different direction - shim layer on sever to trasnlate from lustre layout and pnfsMD, shim layer on client to translate pNFS MD to lustre layout. facilite and simplify implementation on other OS than Linux. - bsd, solaris. majority of lustre users are on linux.... 2)LNET-based layout. LNET protocol used by Lustre better to use Lustre transport than translate. - remove lustre client/server layout and replace with pNFS client/server layout NOVECK: what do you use now? modified version of t10. Different than file, objects, different than LNET? Why does lustre need to use something specific to lustre? SORIN: Trying to brainstorm NOVECK: then explain it! SORIN: lustre is using a very thin RDMA NOVECK: what form - link layer? SORIN: RDMA draft has multiple private class implementations.. BLACK: should be a class switch that you plug RDMA into. Sorin is after: can we cut lustre down to use just the framework that you speak file over RDMA and bring in the smallest bit of lustre NOVEK: if that is the game, that is not what the LNET slide says. BLACK: LNET is using RDMA to do file. thin layer on top of rdma NOVECK: if that is the case, it should be useable by other stuff and that is good :) MATT: so this is a restart of agile Rdma? BLACK: groan tbd. the game is to pick up LNET. retargetable - what can you retarget LNET to? pnfs on top of LNET MATT: restart, not retarget. more general purpose RDMA layer.... HAYNES: Take this off line. Define simple layout information in Luster MD, do not wrap Luster MD inside but use pNFS attributes. LNET based on RDMA but has some overhead - so not clear that we want LNET due to locking and components that slow it down. so maybe HW support... Still use Lustre OSS - want to support lustre-lustre clients, separeate MDS server and pNFS lustre client. uses control protocol form pNFS MDS to OSS - maybe too heavy - do we want control procol... Future direction dependencies -see slides. no official RDMA verbs draft. 273 pages. write a new verbs draft. BENNY: verbs layer API to transfer the protocol - need to define semantics. BLACK: really a functional interface draft. functional API draft is tough. BENNY: standardize LNET protocl? SORIN: no, to heavy , something thinner . define pNFS. Coherent data caching for NFS (Eshel) --------------------------- Provide POSIX read/write semantis - better than NFS close-to-open - allow apps to run unchanged over ns improve NFS client caching - allow byte-range caching to reuce data revalidations - e.g. byte range delegations - more coherent cache. NOVECK: more than coherent? ESHEL: no, coherent. Use cases: - apps that require stong caching semantics - HPC apps that work on segments of very big files. (just byte-range delegations) ESHEL: didn't see that a byte-range delegation draft was already proposed - will start with this. Want group's input. Do we need this? - need posix semantics? - need better caching? both? POSIX semantics requires that a read which can be proved to occur after a write has returne returns the new data. Today: apps implement this via locks and direct io - expensive and intrusive - how to achieve this ? Tokens. Brent quote: (partial) if you are interested in more efficient cache consistency protocols you want a callback scheme where the message ovehead is proportional to the sharing that actactually occurs. (more in the quote, see slides) NOVECK: delegations are a callback scheme NOVECK: if you can't get a delegation then do byte-range locking. Byte-range delegations (Trond + in 2006). NOVECK: in 4.1 you can ask for delegations. BENNY: a few years back wanted to call it recallable locks NOVECK: lets decide what we're doing then decide what to call it! When to use byte-range delegations? - NFSv4.2 hints NOVECK: don't have to change the application. client gets delegations... MATT: 1) fix in mind what consistency models we are talking about. 2) Clear distinction between consistency and validtaton model. 3) Would like to not just focus on legacy apps, thin of new apps as well EISLER: agree with all that MATT said. Giben cheaper and faster networking - posix just does not fit - POSXI is for when data and computation is on the same machine. not true anymore. difficult to acheive posix. not convinced this is necessary. append for example - very hard to do, just un-implementatble. very difficult problem, and don't see the demand. full posix semantics - speed of light just gets in the way. as far as a byte range coherency model, it is not needed either - restfull object access etc. BENNY: full posix semantic- one more open question. what happens to the file after it is removed and still open. do we detach the remove from the destroy? MARK: if we do posix semantics, then the app depends on it and we need to do it all the time. nice to have a way to control. when app asks for byte range token need to get it.... ESHEL: We have experience with a token model that we can move out to the NFS protocol. Start with Tronds byte-range-delegation draft... End-to-end Integrity (Lever) --------------------------- corruption can occur during transmission storing data at rest retrieval Goal is not to make corruption impossible but to detect it. data is copied from stage to stage and at each copy can be corupped. end-to-end verification: detect this at each step- between nfs clients and block storage on NFS servers. Use cases: virtual disk device in hyperfisor and block -based datbase backends. open questions: nfsv4 reboot recovery ds failover server-server copy and migration read-ahead (async) unstable writes (async) BENNY: inplace updates of unlined writes? LEVER: these happen on aligned pieces of data BENNY: what about unaligned? LEVER: not possible BENNY: addidive checksums? BLACK: addive checksums have a lousy track record. if want to write clever code - can recalculate a CRC on the fly. LEVER: just to be clear, the proposed protocol allows you to do it - go there on your own peril BENNY: practical reasons: virtual disk with 4k blocks, will see 1/2k write.. LEVER: not over nfs BENNY: will see some. BLACK: we are going down a rat hole. the whole point of E-2-E integrity is the use cases: where to you apply this? block world only applies it in specific cases. selective mechanism applied on place it works. if have 512 byte i/o don't try this! LEVER: reason why i said posix - not going there - david sums up why. new per fsid attribute to return an array of protection types. extend v4.2 data_content4. SORIN: who will check correctness? LEVER: not defined in this protocol - just how to transmit the info. multi-server considerations - when do particular protection types MUST vrs MAY be supported BENNY: do you ave an implementation? system all API in mind? LEVER: No, and no posix interface for this feature. which is why we need to carefully select use cases. E.g. hypervisor use case. BENNY: data base uses system call interface LEVER: not necessarligy- oracle has own system call.... exposes t10 to userspace... Versioning Model (Noveck) --------------------------- Does it help with - fix protocol bugs? - adding features No adjustments proposed now, just why nfsv4 have minorversioning - nfsv5 is frightening! - v4.1 m-vers worked but we didn't follow the intended paradigm add features only - now can only be done after mandatoy v4.1 features fix small protocol mistakes. - specification mistakes => bis doc. - v4.1 barrier rasise micro -versioning issues HAYNES: did we fix v4.1 in v4.2: change attributes, report errors on layout? can't move a feature from optional to mandatory.. xdr relationship: one xdr extends another. - xdr version to version number mapping prevents protocol bug fixes if done too late. e.g. put v4.1 fs_locations_info into v4.0... an "add operation" option to back fill a version is optimal. also: adding an attribute would help the internationalization issue. HAYNES: there are implementations that use attributes above the specified range. Feature Batching is not the same as minor versioning - current total view of minor versioning is feature batching. - works ok with related features, but is bad for unrelated features. no way to defer a feature or make it experimental features are added speculattively.... ESHEL: need to require that each protocol addition needs an implementaiton. BENNY: modular architecture - all for it. not dependent on last feature completion - let features mature over time. client-server discover each other features... NOVECK: more modular => address interoperatablilty BENNY: nfsv5 may be required PRANOOP: challange today: need to qualifiy the same feature today in v4.0/1/2 this will become much worse with independant features - qualification matrix grows too large. SPENCER: rough consensus and running code - good enough. Discussin on email alias. need greater energy with currently defined work. NOVECK: what is the difference between running code and requiring and implmentation?! HAYNES: Take this to the alias. Metadata Striping (Benjamin) --------------------------- metastripe: pNFS for metadata stared by mike eisler. scale-out metadata making use of layouts. changes from prior draft: - avoid requirement to take metastripe layouts on regular files. - propose stripe hint attribute per file layouts on directoy - just like original metastripe Pranoop has joined as co-author. other various changes. New items layout subtyping so client can ask for correct layout type. inode vrs dentry striping meta data layouts Device lifecycle and LAYOUTRECALL as defined does not work for both inode and dentry metadata layouts.. Working on an implementation. Will report progress. ------------------------------- MARTIN: general thoughts - more energy in general: get required stuff done first - we had a large delay. bis doc needs a lot of work. - number of pending eratta - would be good if WG looked at them to update it. NOVECK: how do we find out about pending eratta HAYNES: all on erratta data tracker time