[nfsv4] 75th IETF July 25-31, 2009, Stockholm, Sweden WEDNESDAY, July 29, 2009 0900-1015 Morning Session I === Agenda Bash and Note Well No additions. === Server Side Copy Offload Operation (Lentini) Client triggers local copy on NFS server. And perhaps a NFS server to server copy. Extensive feedback and comments on the NFSv4 WG mailing list starting in April 2009. Today: to copy client reads file then writes it back: wasteful Proposal: Client sends copy operation to server which does the copy - saves network and client CPU resources.. Today: to copy from one server to another - same issues. Proposal: client arranges for copy to take place. COPY_NOTIFY to source server, COPY to destination server... Possible for source server and destination server to communicate on a different network than the client->servers. Black: How does security work? Covered later. Faibish: is there a check that there is a connection between source and destination servers? Lentini: reply to COPY_NOTIFY has a set of addresses so source server can tell the client which Lentini: Uses: - File restore (snapshot). - Virtualized Environment - allows a hypervisor to: snap shot, clone, migrate a VM's storage. File versus directory copy: proposal is for file based copy only. Synchronous vs asynchronous: server decides - client gets a completion message (sync case) - client gets a in-progress message with handle to server (asynchronous case) when copy complete, sends completion message Partial file copy supported. Space reservation (bit to ensure there is enough destination space; prevents a destination server from thin provisioning a file that is dominated by content containing zeroes) Intra and inter server copies. 2 server case: not mandating any server to server protocol. - proprietary and standards based protocols are possible. Selected pull instead of push. (push means source server have write permission on destination server which is less secure) Reply to COPY_NOTIFY has list of URLs (addresses or services). Can be NFS, ftp, http, etc. URLs. Black: What is definition for NFS URL and where is it specified and how version negotiated? Eisler: NFS URL defined in 1990s (re: WebNFS) with RFCs. Version is negotiated the same way NFS negotiates version today (in-band, via major version in ONC RPC header, and minor version in COMPOUND header). Security: requirements listed, two options: RPCSEC_GSSv3 (work in progress) or host based (AUTH_SYS). Black: full delegation mechanism or delegation restricted to copy operation? Lentini: restricted to NFS user credentials, copy privilege and source file being copied. Faibish: assumed client has access to both servers? Eisler: yes, client must have access to both Pawlowski: Faibish to review draft to see if security concern on copy operation from perspective of client authorization. Next steps: RPCSEC_GSSv3 draft completed Make copy offload part of the WG charter Make it piece of 4.2 Black: is it 4.2 part of the WG charter? Pawlowski: Nope! should do this! Pawlowski: COPY comes up every few years within the WG, and WG consensus has been to push back because of the lack of APIs that would use it. Also inter-server COPY has generated push back because NFS previously had no notion of a formal relationship between NFS servers. What has changed is that APIs exists, and with the introduction of pNFS, we now have an example of formal relationships between NFS servers. === Federated FS (Lentini) - "Using DNS SRV to Specify a Global File Name Space with NFS version 4" -- - "Administration Protocol for Federated Filesystems" -- - "Requirements for Federated File Systems" -- - "NSDB Protocol for Federated Filesystems“ -- Some future extensions (root fileset fsl type for smb, etc) are not in current drafts. Requirements document has finished WG Last Call. Shepler owes a shepherding statement to Lars Eggert (Area Advisor). Shepler has committed to Monday, August 10, 2009. Name space root discovery: defines a DNS record to fine the root of the namespace. Ready for last call review in October. Eggert: Ask DNS directorate to review BEFORE HAND!!! Lentini to follow up. Lentini would like to do another pass on security considerations in the NSDB etc. spec. Shepler (via Eisler): it is possible to have WG last call on and before October? Lentini: There is an NFSv4 bake-a-thon before October. If FedFS is going to be be tested there, then would like to leverage that opportunity, to shake out protocol details and so stick with October. nfsv4 referral mechanism undefined in RFC3530 (but fully defined in NFSv4.1 spec) - expired fs locations draft. Should we resurrect the draft and include in RFC3530bis? Black: Lars - comment on size of 3530? Eggert: if you can structure into multiple document that is preferred - but OK to put in 3530 as we have an example that says even longer documents can be processed by IETF and RFC Editor. Eisler: Actually technically we don't have proof that NFSv4.1 can be processed by RFC Editor (see the end of these minutes). AI: Dave Noveck and Tom Haynes to review course forward on Referrals doc resurrection and whether to fold into RFC3530bis or not. NFSv4 Multi-domain access (Re, Andy Adamson's I-D) will become important when deploy NFS this will be needed. In September, should have feedback on implementation details of Federate Naming. Should hit last call in October though. Pawlowski: By October will we have testing experience that will impact Last Call schedule? James: We won't know until September.. === NFS operation over IPv4 and IPv6 (Alex RN) The WG charter suggests updating the specifications for IPv6 IPv6 allows two clients to have the same address on two private networks. The server needs a way to distinguish between these two clients. IPv6 is a problem/issue for: - multi-homing - RPCBIND - NLM - NFSv4.0 client identification - reply cache - dual to single stack transitions Multi-homing: how are IPv4 and IPv6 different? - Private address boundaries: NFS client boot when IPv6 is also boot strapped - IPv6 potentially have same subnetids for different private addresses - server scope ambiguity - need to store extra info for private addresses. embedded addresses: separate address prefix with IPv4 address included RPCBIND Issues - Should always be used if client or server supports always for IPv6, preferred over portmap because netid and universal address provides non-ambiguous information about whether service is supported on IPv6. - Problem: advertising non-local info. NLM issues: - Assuming a dual stack NLM client. If a IPv4 or IPv6 path goes down, locks can get 'stuck'; no clear cut notion of what is an NLM client identifier whether the IPv4 or IPv6 NLM client owns state. Solution: If the monitor name is the same as the client name, the server can contact the client over the IPv4 network rather than the IPv6 network. client restart: client can reconnect and re-establish locks NFSv4.0 client identification needed: E.g. get delegation through both IPv4 and IPv6 address families - causes revocation even though same client. Solution: use client string across address family and send a SETCLIENTID whenever a new TCP connection to NFSv4.0 server is created. reply-cache and Exactly Once Semantics. If transmit is from one address family and re-transmit is over another address family, the re-transmit will not be recognized as a retry, will miss reply cache, and break EOS. Eisler: What's the solution for NFSv3? Alex: The solution is only for NFSv4. There is no solution being proposed for NFSv2 or NFSv3. For NFSv4.0, solution is to include client ID in reply cache, in addition to xid (i.e. don't use the source IP address and port). Summary: expand charter to include implementation advice for NFSv2,v3,v4,and v4.1 for IPv6 Pawlowski: Why expand for NFSv[23]; I thought this was not solvable for NFSv[23]? Eisler: some things, like switching address families between re-transmits produce unsolvable problems. WG could produce document that advises implementations to strive to avoid such conditions, or at least, advised them that when it has to be done, unsolvable problems will result. Pawlowski: Nonetheless, I don't want the WG to spend lots of time on v2 and v3. There might be advice to give. We definitely need to nail v4 and v4.1 Black: Useful to have advice on existing implementations v2 + v3 will behave when confronted with IPv6 Pawlowski: Lars, is this what a BCP would do? Eggert: BCPs represent a class of document that are stronger than an Informational RFC. Eggert: recommendation for how to structure this. (1) If you want to run this in a IPv6 only environment, are there any changes needed to NFS or ONC RPC protocols? If so, then there is standards-based work here, but that work would be in document(s) separate from a BCP. (2) If you are running in a mixed environment (IPv4 + IPv6 or IPv6 + IPv6, ...) what implementation advice is there?. Have a BCP for the latter. The BCP would propose how to deal with this. === NFSv4 Multi-Domain Access (Adamson) Work done at UMich CITI with Kevin Coffman. Propose taking draft under charter. First draft on UID mappings, GID mappings are future work. Definition of a name service: Exports a unique UID number space. Focus is on LDAP as a name service. David Black noted the issue of a multi-named user in DNS domains and NFS naming. E.g, adamson.com, beepy.com, and eisler.com, are hosted on a single machine, then this machine must pick 1 domain that is none of those 3. Eisler says this is important in light of Federated Name Spaces (multi-realm considerations). Lentini: Pointed out that the work is applicable/important even if FedFS is not being used. Faibish: Would like use cases for when this is important. Eggert: Andy must run by our LDAP advisor - Leif Johansson (listed on NFSv4 WG charter). === Proposal for an NFSv4 extension to allow the use of NFS clients as pNFS data servers (Adamson) - Presenting for Trond. Solutions like cachefs or NFS-server-side data replication are inefficient because the same data is getting cached multiple times and fetched multiple times from the same origin data. Solution is to use a peer-to-peer approach where pNFS clients act as data servers. Work being presented as been prototyped by NetApp for Linux and showed "correct" scaling? Eisler: define correct? Adamson: not sure what Trond meant but likely meant linear. Proposed extensions? Eisler and Black pretty much said the same thing: - Eisler - This could easily be extended to write through and write back (Pawlowski: my sentiment exactly; Eisler made case for sub-file caching in San Francisco). - Black - suggesting another place to dig. I see why need weak delegation. Just want delegation revocable. What if you client holds a layout, leverage the pNFS data channel. Leverage - Eisler big believer in using layouts to extend NFS. Eisler: Especially with write-through or write-back, this proposal could be highly disruptive. === Access checks and pNFS (Sorin Faibish) Note: no accompanying I-D at this time. Summary: Problem is not specific to a specific layout type. Proposal to add clarifying error code to unsnarl errors arising, and perhaps described expected function of MDS and DS without specifying protocol between. The problem arises when the metadata server does not have permission to access a data server. If the client cannot see all the devices at mount time, the client will fall back to NFS with no indication of an issue. Silent loss of scalability as a result. In file/object layout, client does not check that it has access to data servers at mount time. If you detect this error at mount time, the problem is much easier to fix. The admin will not notice this error and scalability will suffer. Protocol Gaps: - client doesn't communicate to MDS that it had an I/O error on the DS - MDS can give clients a layout with no expectation that the client has permissions to access the DS - permission problem is not reported at mount time - (two more he didn't cover because of questions interrupting him; see slides) Adams: by access do you mean physical access or permissions? Faibish: (file permission) Black: I can't get there and I can get there and I can't see anything are very similar. Functional requirements/expectations on the MDS to DS protocol are reasonable even if the protocol is not specified. Faibish: Agree. Eisler: Bullet #1 is the protocol gap. Pawlowski: this is not earth shattering to pNFS. This is really a diagnostic error code proposal. Black: Agree. Eisler: Agree. Faibish: Agree. Proposed Remedies: 1. Add permission checks of the clients to access all the Data Servers (using a list sent by MDS) at mount time. 2. Add new client error case when client cannot access a Data Server at mount time and propagate to MDS 3. Add permission check of MDS to DS after a client permission access error report to that DS 4. Add new I/O error when a pNFS client cannot access a DS that was accessible at mount time and then ask for the re-direct Adamson: For #1, makes sense for block. For users, how would you check for every user at mount time? Faibish: In file case, this would be for basic client access checks. Pawlowski: The implementations should be logging these errors. This is not totally a protocol error. Faibish: Agree. Black: #3 (and some other items) are implementation advice not protocol changes. Black: Complete the I/O successfully in an I/O through the MDS, but also inform the user via the protocol. Don't fail the I/O. Continuation of Remedies: 5. The pNFS server that granted a layout to the client, should check that the client has access to the storage devices (files, LUNs, or objects). 6. pNFS client should add a new mount switch pNFS to inform the pNFS server of client's pNFS access intention and log on both (client/server) in case of failure 7. pNFS MDS should check that it can perform normal I/Os to any device it hands out in a pNFS layout Black: Comment on item 6. The important underlying principle to write up is opportunistic versus intentional uses of pNFS (topic on the list). In the later case, you really want to know that it is not working because there will be major problems otherwise. Implementation Ideas ... * Add an error case into LAYOUTRETURN or LAYOUTCOMMIT. * Add a new layout return type that is "FSID with prejudice", i.e., return all layouts for this FSID and tell the server that the reason for the return is a connectivity issue * Add periodic access permission checks retries and return layout only after several retries * Add a new mount switch –pNFS and a possible error on pNFS optimization that didn't work and carries on using plain NFS (not pNFS) to the MDS Questions raised on Implementation ideas ... Beepy: Are you proposing an alternative scalability protocol to pNFS? If pNFS does not work, we are no worse off than with regular NFS. Faibish: described the uses cases for the addition of a DS after I/O has already started. Faibish: I was thinking about programs that run for many days. If DSes come and go, we want to know when the client is not using pNFS. Black: this seems appropriate as implementation guidance even if pNFS MDS to DS protocol is not specified. The only protocol change being suggested is how errors are reported between the client and MDS. Slide with list of Questions: * Should we leave this entire issue as an implementation detail? * Should we include protocol changes to address the scalability limitation to pNFS scalable protocol? * If we answer yes to protocol changes should we introduce a new layout command or modify LAYOUTGET, LAYOUTCOMMIT? * Should we amend/enhance NFSv4.1 or leave it for v4.2? Beepy: Lars, should we overload the protocol specification with BCP? Eggert: BCP are a class of documents that are used for a lot of things. They are stronger than informational. The BCP level statement is a way of saying the "NFSv4 consensus is to do X". Lars doesn't know what we want to say. A BCP might be appropriate or it might not be. The IETF has always been careful to not tell you how to implement the spec. It is useful to communicate implementation advice though. Black: last 2 questions (on slide with Questions) are connected. We need to figure out what the right way to do the error reporting channel is, then we can figure out where we would put this in v4.1 or v4.2. Faibish: This is where we fail. Early adopters to pNFS will be tripped up by these issues first. Black: I want to see a reasonable description of the proposed solution. Lentini: Referring to Black's point, observes that this is the 3rd time v4.2 came up in this meeting - where do we stand with making this a WG work item? Pawlowski: Having a 'it takes a village approach' to v4.2. Get Falkner or Eisler to start rolling out v4.2 potential item list. Eisler: Will make the v4.2 list ASAP. Pawlowski: Two weeks from today? Eisler: Done. === Topic of why NFSv4.1 doesn't have a published RFC yet. The BTNS WG needs to resolve an issue with the connection latching specification which NFSv4.1 indirectly depends on. Eggert: 4.1 track down Nico... talk to Russ - check his concern, and the BUTTONS working group. Eisler should try to meet with the BTNS WG chairs if either or both are at IETF (still). Eggert: WRT v4.1 maybe we want to track down (with Nico) and resolve what we need this week. Pawlowski: I pointed Spencer to Nico.... I'll put an Aug 12th timer on getting back to Lars. === Wrap-up (Pawlowski)