Following up on the discussion in Montreal and today's tele-conference
discussion this is a formal proposal to remove the CB_SIZECHANGED
callback operation from the protocol in order to reduce the protocol and
client complexity.
It was noted by Trond that the usage of CB_SIZECHANGED does not provide
strict distributed coherent view of the file's length. The operation
was aimed at reducing the window size in which a file's length changed
until this change is propagated to other clients. The proposed usage of
CB_SIZECHANGED for file truncate had other issues, especially with
respect to the atomicity of truncating SETATTR operations and resulting
CB_SIZECHANGED callbacks.
In the absence of CB_SIZECHANGED it's expected that the nfs client will
continue to use the GETATTR operation to poll for the latest file
attributes, including the file size. This can be done periodically,
e.g. when traffic to the MDS is idle or in the same compound RPC with
LAYOUTCOMMITs, or in case the client detects a "short" read from a
storage device when reading beyond its local notion of the file's EOF
when it has to determine whether to return a short read to the
application or zero-fill the returned data. In case the file may be
extended by some other writing client the server could use CB_GETATTR to
get the latest file length from that client similar to the usage of
CB_GETATTR for write delegations.
The server is expected to keep the layout semantics for which layouts
should be recalled when a file is truncated and the outstanding layouts
may become invalid. Some block-based implementation may truncate the
file size but keep the file blocks allocated on storage thus
theoretically allowing clients to keep their current layouts, however
using these layouts for read may return stale data that's still stored
on disk beyond the new, shorter EOF. The attached email (which I can't
find on the ietf archive at
http://www1.ietf.org/mail-archive/web/nfsv4/current/index.html... :( )
discusses a possible way of handling this case using enhancements to
CB_LAYOUTRECALL.
Benny
------------------------------------------------------------------------
Subject:
Re: [nfsv4] Block Layout and CB_SIZECHANGED
From:
Garth Goodson <Garth.Goodson at netapp.com>
Date:
Thu, 20 Jul 2006 18:13:38 -0700
To:
Black_David at emc.com
To:
Black_David at emc.com
CC:
Dave.Noveck at netapp.com, nfsv4 at ietf.org, trond.myklebust at fys.uio.no
Black_David at emc.com wrote:
As I'm discussing NFS vs. the combination of Trond and Dave, I'm
clearly in a hole, and hence need to stop digging ... but Trond's
message identified where my fuzzy thinking was:
Agreed! ...and as far as a POSIX client is concerned, the operations
that guarantee visibility of the writes on the disk/server should be
well defined. I'm not sure this is an exhaustive list, it should be
close:
write(O_SYNC)/write(O_DSYNC)/write(O_DIRECT)
fcntl(F_SETLK)/fcntl(F_SETLKW);
fsync()/msync(MS_SYNC)
close()
In addition, it is common NFS client practice to flush writes on
truncate() and/or fstat().
In all those cases, the client must do a LAYOUTCOMMIT, and (assuming I
didn't miss something above) that should suffice to deal with all those
cases where the application is using some sneaky out-of-band
communication.
I was thinking about O_SYNC writes or the like and wanted to batch the
pNFS LAYOUTCOMMIT ops to avoid having to do one per write (in particular,
my mention of fflush() was clearly wrong). If the EOF isn't being moved,
can the LAYOUTCOMMIT be delayed as long as the data write to the pNFS
back
end is synchronous? In the blocks world, it looks like the answer is
"no", as a provisionally allocated block doesn't exist until the
LAYOUTCOMMIT, hence delaying that may cause an O_SYNC write to vanish
as a consequence of a client crash, making this all a bad idea :-( .
I think we can live without CB_SIZECHANGED, although it does optimize
the case where a file is truncated to zero between rounds of a parallel
computation with threads writing to different areas of the file - we
could try advising people not to do that with pNFS (Doctor, it hurts
when I do <that>).
Courtesy of Steve Fridella, here's a different interesting idea - for
the truncate case, would it help to add a "discard" or "invalid" flag
to CB_LAYOUTRECALL to tell the client that the range being recalled has
been invalidated by a truncate, and hence the client doesn't need to
do writes or LAYOUTCOMMITs? Or is this optimizing a case that's not
worth the effort?
There have been other suggestions in the past to add flags to
LAYOUTRECALL. Specifically, to give the client some idea as to why the
file was recalled. This could help the client in figuring out its
policy of when to/how often to request a new layout -- we have just
started to think about this for Linux client implementation. For
instance if the layout was recalled due to sharing vs. load balancing of
a ds (switching from one DS to another) it may help the client. This
may be another instance of that.
I am interested in what other people have to say about this.
-Garth
Thanks,
--David
----------------------------------------------------
David L. Black, Senior Technologist
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david at emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
-----Original Message-----
From: Trond Myklebust [mailto:trond.myklebust at fys.uio.no] Sent:
Friday, July 14, 2006 12:12 PM
To: Noveck, Dave
Cc: Black, David; nfsv4 at ietf.org
Subject: RE: [nfsv4] Block Layout and CB_SIZECHANGED
On Fri, 2006-07-14 at 11:52 -0400, Noveck, Dave wrote:
forcing the application above NFS to fflush() or the
equivalent (to force an earlier LAYOUT COMMIT)
If he doesn't do a flush, then the data can be in the
buffer cache and in that case the data will reappear after the
truncate, in NFS as well an pNFS. So the client has to do something
to force his writes to be
committed at least as far as necessary to ensure that
they don't happen again. Given that he has to do something, what is
the difficulty with saying that something has to include
LAYOUTCOMMIT as well as WRITE and COMMIT?
By the way, I think was wrong about unstable write case.
While it is true that if I do a COMMIT after truncate no additional
data will be written, if I do an unstable
write and do not COMMIT, old-style NFS is just as exposed to this
issue. This is because after the unstable writes and the truncate
the server may reboot,
in which case I am going to have my COMMIT fail and I am going to
redo my writes, extending the file.
So I think the proper distinction here is between writes
that others may see and those that other may but don't
have to see. It makes sense that writes in that latter state,
whether due to not doing a COMMIT or not doing a LAYOUTCOMMIT are
inherently subject to appearing after
a truncate. This means that the rule is that if you want to make
sure that they are included as subject to
the truncate you have to convert them from possibly-
visible-by-others to really-done-and-I-mean-it-and-
others-must-be-able-to-see-them status.
Agreed! ...and as far as a POSIX client is concerned, the operations
that guarantee visibility of the writes on the disk/server should be
well defined. I'm not sure this is an exhaustive list, it should be
close:
write(O_SYNC)/write(O_DSYNC)/write(O_DIRECT)
fcntl(F_SETLK)/fcntl(F_SETLKW);
fsync()/msync(MS_SYNC)
close()
In addition, it is common NFS client practice to flush writes on
truncate() and/or fstat().
In all those cases, the client must do a LAYOUTCOMMIT, and (assuming I
didn't miss something above) that should suffice to deal with all those
cases where the application is using some sneaky out-of-band
communication.
The "broken client" scenario need not be fixed in the protocol.
Cheers,
Trond
-----Original Message-----
From: Black_David at emc.com [mailto:Black_David at emc.com] Sent: Friday,
July 14, 2006 11:22 AM
To: trond.myklebust at fys.uio.no
Cc: nfsv4 at ietf.org
Subject: RE: [nfsv4] Block Layout and CB_SIZECHANGED
I'd argue that until you commit the layout, you are still in the
situation where the data has not been written. You have
not done the
equivalent of a full NFSv4.0 unstable WRITE since a successful
unstable
write must update both the data _and_ the metadata in the server's
cache.
IOW the point at which the written data becomes visible
to others is
what matters, and that means after LAYOUTCOMMIT.
And if NFS were the only possible communication channel, I
might agree,
but going back to my scenario (and inserting a couple of
instances of
"[layout]" for clarification:
1) pNFS client takes out an extent from 32k to 64k, and writes data.
It marks the written area as needing to be [layout] COMMIT-ed,
but
doesn't do the [layout] COMMIT.
2) Some other client uses SETATTR to truncates the file to be 4k in
size.
Suppose that the clients are in cahoots - there was an out-of-band
communication between them, and the SETATTR was supposed to throw
away the first client's writes (and some other data). Having it
reappear because pNFS did something strange (first client does the
delayed LAYOUTCOMMIT after the SETATTR) would be peculiar, and
to my mind, forcing the application above NFS to fflush() or the
equivalent (to force an earlier LAYOUT COMMIT) before the
out-of-band
communication is tantamount to admitting that there is a
problem here
but we're going to force applications to fix it. This is an NFS vs.
pNFS behavior difference that I'd prefer to eliminate.
Thanks,
--David
----------------------------------------------------
David L. Black, Senior Technologist
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david at emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
But David is not talking about cached writes but writes done to
the data server which have not been LAYOUTCOMMITed.
There is no
non-pnfs equivalent of that.
The closest I can come is unstable writes done to the server
which have not been COMMITed. In this case a truncate
is effective
without locking. You do the the COMMIT and the file
not extended.
How you judge this case depends on what analogies you make. Is
writing to the data server more like putting things in
your cache
or it more like doing an unstable write? I'd argue
that the latter
is a more appropriate analogy.
I'd argue that until you commit the layout, you are still in the
situation where the data has not been written. You have
not done the
equivalent of a full NFSv4.0 unstable WRITE since a successful
unstable
write must update both the data _and_ the metadata in the server's
cache.
IOW the point at which the written data becomes visible
to others is
what matters, and that means after LAYOUTCOMMIT.
Cheers,
Trond
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4
------------------------------------------------------------------------
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4