As I'm discussing NFS vs. the combination of Trond and Dave, I'm
clearly in a hole, and hence need to stop digging ... but Trond's
message identified where my fuzzy thinking was:
Agreed! ...and as far as a POSIX client is concerned, the operations
that guarantee visibility of the writes on the disk/server should be
well defined. I'm not sure this is an exhaustive list, it should be
close:
write(O_SYNC)/write(O_DSYNC)/write(O_DIRECT)
fcntl(F_SETLK)/fcntl(F_SETLKW);
fsync()/msync(MS_SYNC)
close()
In addition, it is common NFS client practice to flush writes on
truncate() and/or fstat().
In all those cases, the client must do a LAYOUTCOMMIT, and (assuming I
didn't miss something above) that should suffice to deal with all those
cases where the application is using some sneaky out-of-band
communication.
I was thinking about O_SYNC writes or the like and wanted to batch the
pNFS LAYOUTCOMMIT ops to avoid having to do one per write (in particular,
my mention of fflush() was clearly wrong). If the EOF isn't being moved,
can the LAYOUTCOMMIT be delayed as long as the data write to the pNFS back
end is synchronous? In the blocks world, it looks like the answer is
"no", as a provisionally allocated block doesn't exist until the
LAYOUTCOMMIT, hence delaying that may cause an O_SYNC write to vanish
as a consequence of a client crash, making this all a bad idea :-( .
I think we can live without CB_SIZECHANGED, although it does optimize
the case where a file is truncated to zero between rounds of a parallel
computation with threads writing to different areas of the file - we
could try advising people not to do that with pNFS (Doctor, it hurts
when I do <that>).
Courtesy of Steve Fridella, here's a different interesting idea - for
the truncate case, would it help to add a "discard" or "invalid" flag
to CB_LAYOUTRECALL to tell the client that the range being recalled has
been invalidated by a truncate, and hence the client doesn't need to
do writes or LAYOUTCOMMITs? Or is this optimizing a case that's not
worth the effort?
Thanks,
--David
----------------------------------------------------
David L. Black, Senior Technologist
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david at emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
-----Original Message-----
From: Trond Myklebust [mailto:trond.myklebust at fys.uio.no]
Sent: Friday, July 14, 2006 12:12 PM
To: Noveck, Dave
Cc: Black, David; nfsv4 at ietf.org
Subject: RE: [nfsv4] Block Layout and CB_SIZECHANGED
On Fri, 2006-07-14 at 11:52 -0400, Noveck, Dave wrote:
forcing the application above NFS to fflush() or the
equivalent (to force an earlier LAYOUT COMMIT)
If he doesn't do a flush, then the data can be in the
buffer cache and in that case the data will reappear
after the truncate, in NFS as well an pNFS. So the
client has to do something to force his writes to be
committed at least as far as necessary to ensure that
they don't happen again. Given that he has to do
something, what is the difficulty with saying that
something has to include LAYOUTCOMMIT as well as
WRITE and COMMIT?
By the way, I think was wrong about unstable write case.
While it is true that if I do a COMMIT after truncate
no additional data will be written, if I do an unstable
write and do not COMMIT, old-style NFS is just as
exposed to this issue. This is because after the
unstable writes and the truncate the server may reboot,
in which case I am going to have my COMMIT fail and I
am going to redo my writes, extending the file.
So I think the proper distinction here is between writes
that others may see and those that other may but don't
have to see. It makes sense that writes in that latter
state, whether due to not doing a COMMIT or not doing a
LAYOUTCOMMIT are inherently subject to appearing after
a truncate. This means that the rule is that if you
want to make sure that they are included as subject to
the truncate you have to convert them from possibly-
visible-by-others to really-done-and-I-mean-it-and-
others-must-be-able-to-see-them status.
Agreed! ...and as far as a POSIX client is concerned, the operations
that guarantee visibility of the writes on the disk/server should be
well defined. I'm not sure this is an exhaustive list, it should be
close:
write(O_SYNC)/write(O_DSYNC)/write(O_DIRECT)
fcntl(F_SETLK)/fcntl(F_SETLKW);
fsync()/msync(MS_SYNC)
close()
In addition, it is common NFS client practice to flush writes on
truncate() and/or fstat().
In all those cases, the client must do a LAYOUTCOMMIT, and (assuming I
didn't miss something above) that should suffice to deal with
all those
cases where the application is using some sneaky out-of-band
communication.
The "broken client" scenario need not be fixed in the protocol.
Cheers,
Trond
-----Original Message-----
From: Black_David at emc.com [mailto:Black_David at emc.com]
Sent: Friday, July 14, 2006 11:22 AM
To: trond.myklebust at fys.uio.no
Cc: nfsv4 at ietf.org
Subject: RE: [nfsv4] Block Layout and CB_SIZECHANGED
I'd argue that until you commit the layout, you are still in the
situation where the data has not been written. You have
not done the
equivalent of a full NFSv4.0 unstable WRITE since a successful
unstable
write must update both the data _and_ the metadata in the server's
cache.
IOW the point at which the written data becomes visible
to others is
what matters, and that means after LAYOUTCOMMIT.
And if NFS were the only possible communication channel, I
might agree,
but going back to my scenario (and inserting a couple of
instances of
"[layout]" for clarification:
1) pNFS client takes out an extent from 32k to 64k, and writes data.
It marks the written area as needing to be [layout] COMMIT-ed,
but
doesn't do the [layout] COMMIT.
2) Some other client uses SETATTR to truncates the file to be 4k in
size.
Suppose that the clients are in cahoots - there was an out-of-band
communication between them, and the SETATTR was supposed to throw
away the first client's writes (and some other data). Having it
reappear because pNFS did something strange (first client does the
delayed LAYOUTCOMMIT after the SETATTR) would be peculiar, and
to my mind, forcing the application above NFS to fflush() or the
equivalent (to force an earlier LAYOUT COMMIT) before the
out-of-band
communication is tantamount to admitting that there is a
problem here
but we're going to force applications to fix it. This is an NFS vs.
pNFS behavior difference that I'd prefer to eliminate.
Thanks,
--David
----------------------------------------------------
David L. Black, Senior Technologist
EMC Corporation, 176 South St., Hopkinton, MA 01748
+1 (508) 293-7953 FAX: +1 (508) 293-7786
black_david at emc.com Mobile: +1 (978) 394-7754
----------------------------------------------------
But David is not talking about cached writes but writes done to
the data server which have not been LAYOUTCOMMITed.
There is no
non-pnfs equivalent of that.
The closest I can come is unstable writes done to the server
which have not been COMMITed. In this case a truncate
is effective
without locking. You do the the COMMIT and the file
not extended.
How you judge this case depends on what analogies you make. Is
writing to the data server more like putting things in
your cache
or it more like doing an unstable write? I'd argue
that the latter
is a more appropriate analogy.
I'd argue that until you commit the layout, you are still in the
situation where the data has not been written. You have
not done the
equivalent of a full NFSv4.0 unstable WRITE since a successful
unstable
write must update both the data _and_ the metadata in the server's
cache.
IOW the point at which the written data becomes visible
to others is
what matters, and that means after LAYOUTCOMMIT.
Cheers,
Trond
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4
_______________________________________________
nfsv4 mailing list
nfsv4 at ietf.org
https://www1.ietf.org/mailman/listinfo/nfsv4