idnits 2.17.1 

draft-ietf-nfsv4-internationalization-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (September 26, 2021) is 943 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. '11'

  -- Possible downref: Non-RFC (?) normative reference: ref. '12'

  -- Possible downref: Non-RFC (?) normative reference: ref. '13'

  -- Obsolete informational reference (is this intentional?): RFC 3010 (ref.
     '16') (Obsoleted by RFC 3530)

  -- Obsolete informational reference (is this intentional?): RFC 3454 (ref.
     '17') (Obsoleted by RFC 7564)

  -- Obsolete informational reference (is this intentional?): RFC 3490 (ref.
     '18') (Obsoleted by RFC 5890, RFC 5891)

  -- Obsolete informational reference (is this intentional?): RFC 3491 (ref.
     '19') (Obsoleted by RFC 5891)

  -- Obsolete informational reference (is this intentional?): RFC 3530 (ref.
     '20') (Obsoleted by RFC 7530)

  -- Obsolete informational reference (is this intentional?): RFC 5661 (ref.
     '21') (Obsoleted by RFC 8881)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	NFSv4                                                          D. Noveck
3	Internet-Draft                                                    NetApp
4	Updates: 8881, 7530 (if approved)                     September 26, 2021
5	Intended status: Standards Track
6	Expires: March 30, 2022

8	              Internationalization for the NFSv4 Protocols
9	                draft-ietf-nfsv4-internationalization-01

11	Abstract

13	   This document describes the handling of internationalization for all
14	   NFSv4 protocols, including NFSv4.0, NFSv4.1, NFSv4.2 and extensions
15	   thereof, and future minor versions.

17	   It updates RFC7530 and RFC8881.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at https://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on March 30, 2022.

36	Copyright Notice

38	   Copyright (c) 2021 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (https://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
54	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   4
55	     2.1.  Requirements Language Definition  . . . . . . . . . . . .   4
56	     2.2.  Requirements Language Derivation  . . . . . . . . . . . .   4
57	   3.  Internationalization and Minor Versioning . . . . . . . . . .   6
58	   4.  Changes Relative to RFC7530 . . . . . . . . . . . . . . . . .   7
59	   5.  Limitations on Internationalization-Related Processing in the
60	       NFSv4 Context . . . . . . . . . . . . . . . . . . . . . . . .   7
61	   6.  Summary of Server Behavior Types  . . . . . . . . . . . . . .   8
62	   7.  The Attribute Fs_charset_cap  . . . . . . . . . . . . . . . .   9
63	     7.1.  The Attribute Fs_charset_cap in Published NFSv4.1
64	           Specifications  . . . . . . . . . . . . . . . . . . . . .  10
65	     7.2.  The Attribute Fs_charset_cap in Future NFSv4.1
66	           Specifications  . . . . . . . . . . . . . . . . . . . . .  12
67	   8.  String Encoding . . . . . . . . . . . . . . . . . . . . . . .  14
68	   9.  Normalization . . . . . . . . . . . . . . . . . . . . . . . .  15
69	   10. Case-Insensitive Processing of File Names . . . . . . . . . .  15
70	     10.1.  Implementing Case-Insensitive Comparison of File Names .  19
71	     10.2.  Important Examples of Case-insensitive Handling of File
72	            Names  . . . . . . . . . . . . . . . . . . . . . . . . .  21
73	   11. Internationalization-related Processing of File Names by
74	       Clients . . . . . . . . . . . . . . . . . . . . . . . . . . .  24
75	     11.1.  Server Restrictions to Deal with Lack of Client
76	            Knowledge  . . . . . . . . . . . . . . . . . . . . . . .  25
77	     11.2.  Client Processing of File Names for Current NFSv4
78	            Protocols  . . . . . . . . . . . . . . . . . . . . . . .  26
79	     11.3.  Client Processing of File Names for Future NFSv4
80	            Protocols  . . . . . . . . . . . . . . . . . . . . . . .  30
81	   12. String Types with Processing Defined by Other Internet Areas   31
82	     12.1.  Effect of IDNA Changes . . . . . . . . . . . . . . . . .  33
83	     12.2.  Potential Compatibility Issues Related to IDNA Changes .  34
84	   13. Errors Related to UTF-8 . . . . . . . . . . . . . . . . . . .  36
85	   14. Servers That Accept File Component Names That Are Not Valid
86	       UTF-8 Strings . . . . . . . . . . . . . . . . . . . . . . . .  37
87	   15. Future Minor Versions and Extensions  . . . . . . . . . . . .  38
88	   16. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  39
89	   17. Security Considerations . . . . . . . . . . . . . . . . . . .  39
90	   18. References  . . . . . . . . . . . . . . . . . . . . . . . . .  40
91	     18.1.  Normative References . . . . . . . . . . . . . . . . . .  40
92	     18.2.  Informative References . . . . . . . . . . . . . . . . .  41
93	   Appendix A.  History  . . . . . . . . . . . . . . . . . . . . . .  42
94	   Appendix B.  Form-insensitive String Comparisons  . . . . . . . .  47
95	     B.1.  Name Hashes . . . . . . . . . . . . . . . . . . . . . . .  49
96	     B.2.  Character Tables  . . . . . . . . . . . . . . . . . . . .  51
97	     B.3.  Outline of comparison . . . . . . . . . . . . . . . . . .  52
98	     B.4.  Comparing Base Characters . . . . . . . . . . . . . . . .  53
99	     B.5.  Comparing Combining Characters  . . . . . . . . . . . . .  54
100	   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  57
101	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  57

103	1.  Introduction

105	   Internationalization is a complex topic with its own set of
106	   terminology (see [22]).  The topic is made more complex for the NFSv4
107	   protocols by the tangled history described in Appendix A.  In large
108	   part, this document is based on the actual behavior of NFSv4 client
109	   and server implementations (for all existing minor versions) and is
110	   intended to serve as a basis for further implementations to be
111	   developed that can interact with existing implementations as well as
112	   those to be developed in the future.

114	   Note that the behaviors on which this document are based are each
115	   demonstrated by a combination of an NFSv4 server implementation
116	   proper and a server-side physical file system.  It is common for
117	   servers and physical file systems to be configurable as to the
118	   behavior shown.  In the discussion below, each configuration that
119	   shows different behavior is considered separately.

121	   As a consequence of this choice, normative terms defined in RFC2119
122	   [1] are often derived from implementation behavior, rather than the
123	   other way around, as is more commonly the case.  The specifics are
124	   discussed in Section 2.

126	   With regard to the question of interoperability with existing
127	   specifications for NFSv4 minor versions, different minor versions
128	   pose different issues.

130	   o  With regard to NFSv4.0 as defined in RFC7530 [3], no significant
131	      interoperability issues are expected to arise because the
132	      internationalization in that specification, which is the basis for
133	      this one, was also based on the behavior of existing
134	      implementations.  Although, in a formal sense, the treatment of
135	      internationalization here supersedes that in RFC7530 [3], the
136	      treatments are intended to be essentially the same, in order to
137	      eliminate interoperability issues.

139	      Because of a change in the handling of Internationalized domain
140	      names, there are some differences from the handling in RFC7530
141	      [3], as discussed in Appendix A.  For a discussion of those
142	      differences and potential compatibility issues, see Sections 12.1
143	      and 12.2.

145	   o  With regard to NFSv4.1 as defined by RFC881 [9], the situation is
146	      quite different.  The approach to internationalization specified
147	      in that document, based in large part on that in RFC3530 was never
148	      implemented, and implementers were either unaware of the
149	      troublesome implications of that approach or chose to ignore the
150	      existing specification as essentially unimplementable.  An
151	      internationalization approach compatible with that specified in
152	      RFC7530 [3] tended to be followed, despite the fact that, in other
153	      respects, NFSv4.1 was considered to be a separate protocol.

155	      If there were NFSv4 servers who obeyed the internationalization
156	      dictates within RFC5661 [21], or clients that expected servers to
157	      do so, they would fail to interoperate with typical clients and
158	      servers when dealing with non-UTF8 file names, which are quite
159	      common.  As no such implementations have come to our attention, it
160	      has to be assumed that they do not exist and interoperability with
161	      existing implementations as described here is an appropriate basis
162	      for this document.

164	2.  Requirements Language

166	2.1.  Requirements Language Definition

168	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
169	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
170	   document are to be interpreted as BCP 14 [1] [2] when, and only when,
171	   they appear in all capitals, as shown here.

173	2.2.  Requirements Language Derivation

175	   Although the key words "MUST", "SHOULD", and "MAY" retain their
176	   normal meanings, as described above, we need to explain how the
177	   statements involving these terms were arrived at:

179	   o  In the case of statements within Sections 12 and 15, these derive
180	      from the requirements of other internet specifications.

182	   o  In the case of statements within Sections 7, 10, and 11 derive
183	      from the author's view of the appropriate normative language to
184	      use and will, when this document is advanced, represent the
185	      working group's consensus on those same matters.

187	   o  However, in other cases, i.e. those in sections deriving from
188	      RFC7530 [3] (i.e.  Sections 5, 6, 8, 9, 13, 14, 16, 17) this
189	      specification's descriptions were derived from existing
190	      implementation patterns.  Although this pattern is atypical, it is
191	      needed to provide a description that satisfies the goal of RFC2119
192	      [1], providing a normative description to enable future
193	      implementations to be compatible with existing ones.  This
194	      requires that we explain later in this section how the normative
195	      terms used derive from the behavior of existing implementations,
196	      in those situations in which existing implementation behavior
197	      patterns can be determined.

199	   Note that in introductory and explanatory sections of this document
200	   (i.e.  Sections 1 through 4 these terms do not appear except to
201	   explain how they are used in this document.  Also, they do not appear
202	   in Appendix B which provides non-normative implementation guidance.

204	   With regard to the parts of this document deriving from RFC7530, we
205	   explain below how the normative terms used derive from the behavior
206	   of existing implementations, in those situations in which existing
207	   implementation behavior patterns can be determined.

209	   o  Behavior implemented by all existing clients or servers is
210	      described using "MUST", since new implementations need to follow
211	      existing ones to be assured of interoperability.  While it is
212	      possible that different behavior might be workable, we have found
213	      no case where this seems reasonable.

215	      The converse holds for "MUST NOT": if a type of behavior poses
216	      interoperability problems, it MUST NOT be implemented by any
217	      existing clients or servers.

219	   o  Behavior implemented by most existing clients or servers, where
220	      that behavior is more desirable than any alternative, is described
221	      using "SHOULD", since new implementations need to follow that
222	      existing practice unless there are strong reasons to do otherwise.

224	      The converse holds for "SHOULD NOT".

226	   o  Behavior implemented by some, but not all, existing clients or
227	      servers is described using "MAY", indicating that new
228	      implementations have a choice as to whether they will behave in
229	      that way.  Thus, new implementations will have the same
230	      flexibility that existing ones do.

232	   o  Behavior implemented by all existing clients or servers, so far as
233	      is known -- but where there remains some uncertainty as to details
234	      -- is described using "should".  Such cases primarily concern
235	      details of error returns.  New implementations should follow
236	      existing practice even though such situations generally do not
237	      affect interoperability.

239	   There are also cases in which certain server behaviors, while not
240	   known to exist, cannot be reliably determined not to exist.  In part,
241	   this is a consequence of the long period of time that has elapsed
242	   since the publication of the defining specifications, resulting in a
243	   situation in which those involved in the implementation work may no
244	   longer be involved in or be aware of working group activities.

246	   In the case of possible server behavior that is neither known to
247	   exist nor known not to exist, we use "SHOULD NOT" and "MUST NOT" as
248	   follows, and similarly for "SHOULD" and "MUST".

250	   o  In some cases, the potential behavior is not known to exist but is
251	      of such a nature that, if it were in fact implemented,
252	      interoperability difficulties would be expected and reported,
253	      giving us cause to conclude that the potential behavior is not
254	      implemented.  For such behavior, we use "MUST NOT".  Similarly, we
255	      use "MUST" to apply to the contrary behavior.

257	   o  In other cases, potential behavior is not known to exist but the
258	      behavior, while undesirable, is not of such a nature that we are
259	      able to draw any conclusions about its potential existence.  In
260	      such cases, we use "SHOULD NOT".  Similarly, we use "SHOULD" to
261	      apply to the contrary behavior.

263	   In the case of a "MAY", "SHOULD", or "SHOULD NOT" that applies to
264	   servers, clients need to be aware that there are servers that may or
265	   may not take the specified action, and they need to be prepared for
266	   either eventuality.

268	3.  Internationalization and Minor Versioning

270	   Despite the fact that NFSv4.0 and subsequent minor versions have
271	   differed in many ways, the actual implementations of
272	   internationalization have remained the same and internationalized
273	   names have been handled without regard to the minor version being
274	   used.  Minor version specification documents contained different
275	   treatments of internationalization as described in Appendix A but of
276	   those only the implementation-based approach used by RFC7530 [3],
277	   resulted in a workable description while a number of attempts to
278	   specify an approach that implementors were to follow were all
279	   ignored.

281	   It is expected that any future minor versions will follow a similar
282	   approach, even though there is nothing to prevent a future minor
283	   version from adopting a different approach as long as the rules
284	   within [8]) are adhered to.  In any such case, the new minor version
285	   would have to be marked as updating or obsoleting this document.
286	   Issues relating to potential extensions within the framework
287	   specified in this document are dealt with in Section 15.

289	4.  Changes Relative to RFC7530

291	   This document follows the internationalization approach defined in
292	   RFC7530, with a number of significant necessary changes.

294	   o  The handling of internationalization specified in [3] is applied
295	      to all NFSv4 minor versions.  No compatibility issues are expected
296	      to arise because all existing implementations follow the same
297	      approach to internationalization despite the large difference
298	      between [3] and what was specified in [21].  Issues relating to
299	      potential future minor versions and protocol extensions are
300	      addressed in Section 15.

302	   o  Some changes motivated by the shift from IDNA2003 to IDNA2008 have
303	      been made.  The intention is to maintain compatibility with all
304	      existing NFSv4 minor versions.  Potential compatibility issues
305	      with regard to the IDNA shift are discussed in Section 12.2.

307	   o  There is more detailed discussion of case-insensitive handling of
308	      file names, with particular attention to the complexities that can
309	      arise when multiple language conventions in these matters need to
310	      be accommodated.  The discussion in Section 10 applies to both
311	      client or server, although issues relating to the client's
312	      knowledge are dealt with in Section 11.

314	   o  There is additional material, dealing with the implications of
315	      server-side internationalization-related file name processing for
316	      clients that cache the results of READDIR's.  This includes a
317	      discussion of options to deal with the current lack of detailed
318	      information about the server (in Section 11.2), and options for
319	      handling when more detailed information is available (in
320	      Section 11.3)."

322	5.  Limitations on Internationalization-Related Processing in the NFSv4
323	    Context

325	   There are a number of noteworthy circumstances that limit the degree
326	   to which internationalization-related encoding and normalization-
327	   related restrictions can be made universal with regard to NFSv4
328	   clients and servers:

330	   o  The NFSv4 client is part of an extensive set of client-side
331	      software components whose design and internal interfaces are not
332	      within the IETF's purview, limiting the degree to which a
333	      particular character encoding might be made standard.

335	   o  Server-side handling of file component names is typically
336	      implemented within a server-side physical file system, whose
337	      handling of character encoding and normalization is not
338	      specifiable by the IETF.

340	   o  Typical implementation patterns in UNIX systems result in the
341	      NFSv4 client having no knowledge of the character encoding being
342	      used, which might even vary between processes on the same client
343	      system.

345	   o  Users may need access to files stored previously with non-UTF-8
346	      encodings, or with UTF-8 encodings that are not in accord with any
347	      particular normalization form.

349	6.  Summary of Server Behavior Types

351	   Servers MAY reject component name strings that are not valid UTF-8.
352	   This leads to a number of types of valid server behavior, as outlined
353	   below.  When these are combined with the valid normalization-related
354	   behaviors as described in Section 8, this leads to the combined
355	   behaviors outlined below.

357	   o  Servers that limit file component names within a given file system
358	      to UTF-8 strings exist with normalization-related handling as
359	      described in Section 8.  These are best described as behaving as
360	      "UTF-8-only servers".

362	   o  Servers that do not limit file component names on particular file
363	      systems to UTF-8 strings are very common and are necessary to deal
364	      with clients/applications not oriented to the use of UTF-8.  Such
365	      servers ignore normalization-related issues, and there is no way
366	      for them to implement either normalization or representation-
367	      independent lookups.  These are best described as behaving as
368	      "UTF-8-unaware servers" for such file systems, since they treat
369	      file component names as uninterpreted strings of bytes and have no
370	      knowledge of the characters represented.  See Section 13 for
371	      details.

373	   o  It is possible for a server to allow component names that are not
374	      valid UTF-8, while still being aware of the structure of UTF-8
375	      strings.  Such servers could, in theory, implement either
376	      normalization or representation-independent lookups but apply
377	      those techniques only to valid UTF-8 strings.  Such servers are
378	      not common, but it is possible to configure at least one known
379	      server to have this behavior.  This behavior SHOULD NOT be used
380	      due to the possibility that a file name using one encoding may, by
381	      coincidence, have the appearance of a UTF-8 file name; the results
382	      of UTF-8 normalization or representation-independent lookups are
383	      unlikely to be correct in all cases, when considered from the
384	      viewpoint of the other encoding.  Such difficulties can be
385	      compounded when case-insensitive name handling is in effect.

387	7.  The Attribute Fs_charset_cap

389	   This attribute, nominally "RECOMMENDED", appears to have been added
390	   to NFSv4.1 to allow servers, while staying within the constraints of
391	   the stringprep-based specification of internationalization, to allow
392	   uses of UTF-8-unaware naming by clients.  As a result, those NFSv4
393	   servers implementing internationalization as NFSv3 had done, could be
394	   considered spec-compliant, as long as a later "SHOULD" was ignored.
395	   However, because use of UTF-8 was tied to existing stringprep
396	   restrictions, implementations of internationalization, that were
397	   aware of Unicode canonical equivalence issues were not provided for.
398	   Although this attribute may have been implemented despite the
399	   problems noted in Section 7.1, the overall scheme was never
400	   implemented and NFSv4.1 implementations dealt with
401	   internationalization as NFSv4.0 implementations had.

403	   It is generally accepted that attributes designated "RECOMMENDED" are
404	   essentially OPTIONAL with the client having the responsibility to
405	   deal with server non-support of them.  While RFC7530 has gone so far
406	   as to explicitly exclude this use from the general statement that
407	   these terms are to be used as defined by RFC2119, no NFSv4.1
408	   specification has done so, at least through RFC8881 [9].  In this
409	   particular case, there are a number of circumstances that makes this
410	   OPTIONAL status noteworthy:

412	   o  The statement "It is expected that servers will support all
413	      attributes they comfortably can and only fail to support
414	      attributes that are difficult to support in their operating
415	      environments", appearing in Section 5.2 of [9] is troublesome
416	      since it is hard to understand how a server could find this read-
417	      only attribute "difficult to support" regardless of the operating
418	      environment

420	   o  This was added in minor version one which added a number of
421	      REQUIRED operations and could well have added a REQUIRED
422	      attribute.

424	   o  The fact that the client is to be prepared for non-support of the
425	      attribute would require specification of a default value, yet none
426	      is provided.

428	   The attribute contains two flag bits.  As discussed below, in
429	   Section 7.1, it is hard two see why two bits are required while the
430	   implications of this issue for future NFSv4.1 specifications will be
431	   discussed in Section 7.2

433	7.1.  The Attribute Fs_charset_cap in Published NFSv4.1 Specifications

435	   We reproduce Section 14.4 of [9] below, with comments interspersed
436	   trying to make sense of what is there, in order to arrive at an
437	   appropriate replacement, to be presented in Section 7.2.  In that
438	   connection, we need to understand better a few issues:

440	   o  The use of two bits while one is clearly adequate, given the
441	      subject matter actually mentioned.

443	   o  The mention of possible "capabilities" which could not possibly be
444	      realized.

446	   o  The use of the RFC2119 keyword "SHOULD" in contexts in which this
447	      term is clearly inappropriate.

449	   Issues related to the confusion caused by mention of "UTF-8
450	   characters" and the lack of mention of Unicode will be addressed in
451	   the revision in Section 7.2 but will not be further discussed here.

453	      const FSCHARSET_CAP4_CONTAINS_NON_UTF8  = 0x1;
454	      const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8   = 0x2;

456	      typedef uint32_t        fs_charset_cap4;

458	   While it is made clear that two separate bits are to be provided,
459	   their names seem to indicate that they should be complements of one
460	   another.  As a way of understanding why two bits were specified, it
461	   is helpful to consider a possible boolean attribute as a potential
462	   replacement.  That attribute would clearly govern whether names that
463	   do not conform to the rules of UTF-8 are to be rejected, which was a
464	   "MUST" in RFC3530 [20].  Although conveying this information is
465	   clearly part of the motivation, stating so clearly might have been
466	   judged by the authors as unnecessarily provocative, given the role of
467	   IESG in arriving at the internationalization approach specified in
468	   RFC3530.

470	      Because some operating environments and file systems do not
471	      enforce character set encodings,

473	   It is clear that the ability of operating environments to enforce use
474	   of UTF-8 encoding is not an issue, since RFC3530 made this the
475	   responsibility of the server implementation.  That mandate was never
476	   followed because implementers chose not to follow it, and not because
477	   they were unable to do so.  The apparently confused statement above
478	   is best understood if one notes that its essential job is to state
479	   that the "MUST" in RFC3530 referred to above is not reasonable.

481	   However, the authors might well have felt unable to say so clearly,
482	   in light of the potential IESG reaction.

484	      NFSv4.1 supports the fs_charset_cap attribute (Section 5.8.2.11)
485	      that indicates to the client a file system's UTF-8 capabilities.

487	   The problem with the mention of (plural) capabilities is that the
488	   only capability mentioned which servers could implement is to accept
489	   strings which are not valid UTF-8.  There are other potential
490	   capabilities having to do with the implementation of canonical
491	   equivalence, but since they were not mentioned, they will not be
492	   discussed further here.

494	      The attribute is an integer containing a pair of flags.  The first
495	      flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, which, if set to one,
496	      tells the client that the file system contains non-UTF-8
497	      characters,

499	   As stated, this would mean that a server would have to keep track of
500	   a count of non-UTF-8-encoded names within the file system and change
501	   the attribute value as that count varied between zero and non-zero.
502	   Since it is most unlikely that any server would keep track of that or
503	   that any client would find it useful, we will assume that the
504	   capability to store such names is what is most likely intended.

506	      and the server will not convert non-UTF characters to UTF-8 if the
507	      client reads a symbolic link or directory,

509	   There is no way for the server to convert non-UTF names to UTF-8 or
510	   anything else, since it has no knowledge of the name encoding to
511	   begin with.  The alternative to treating names as UTF-8-encoded
512	   Unicode strings is to treat them as POSIX does, as uninterpreted
513	   strings of bytes.  That makes it impossible to interpret strings that
514	   do not follow the rules of UTF-8 at all, making it impossible to
515	   convert the string to UTF-8.

517	      neither will operations with component names or pathnames in the
518	      arguments convert the strings to UTF-8.

520	   As stated above, there is no way a server could ever do that.

522	      The second flag is FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set
523	      to one, indicates that the server will accept (and generate) only
524	      UTF-8 characters on the file system.

526	   That is clear and so it poses no problem for a revised treatment,
527	   unlike the other flag.

529	      If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one,
530	      FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero.

532	   There is no problem with this statement.  However, it does, by
533	   implication, raise the issue of what values of
534	   FSCHARSET_CAP4_CONTAINS_NON_UTF8 may be set in the case in which
535	   FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to zero.

537	      FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one.

539	   According to RFC2119 [1], "SHOULD" means that "there may exist valid
540	   reasons in particular circumstances to ignore a particular item, but
541	   the full implications must be understood and carefully weighing a
542	   different course".  In this context, it is unclear what these "full
543	   implications" might be given the introduction above.  The clause,
544	   "because some operating e environments and file systems do not
545	   enforce character set encodings", gives one no basis for treating
546	   this as other than an unproblematic behavior variant, calling into
547	   question the use of "SHOULD".

549	   Also, the statement in RFC2119 that these terms (i.e. those like
550	   "SHOULD") "only be used where it is actually required for
551	   interoperation or to limit behavior which has the potential for
552	   causing harm"

554	   o  The whole purpose of this feature is to enable interoperation and
555	      there is no basis for the implication that one particular flag
556	      value is superior to another in allowing interoperation.

558	   o  There is no basis for assuming that accepting file names that are
559	      not UTF-8-encoded Unicode has any potential for causing harm.

561	   Despite the statement in RFC2119, that "they [i.e. terms such as
562	   'SHOULD'] must not be used to impose a particular method on
563	   implementors", it is hard to avoid the conclusion that this is in
564	   fact the motivation for the "SHOULD", although the authors might not
565	   have had any such intention but felt that the IESG might well have
566	   such an intention.

568	7.2.  The Attribute Fs_charset_cap in Future NFSv4.1 Specifications

570	   We provide a revised version of Section 14.4 of [9] below, taking
571	   into account the issues noted in Section 7.1.  Given there was a
572	   working group consensus to adopt the confusing language discussed
573	   there, we must now adopt, by consensus, a clearer replacement that
574	   reflects the working group's intentions.  Given the passage of time
575	   and the changed context, it might not be possible to determine those
576	   intentions.  In any case, we will have to be aware of how this
577	   attribute was implemented and used, particularly with regard to the
578	   first flag, whose meaning remains obscure.

580	   The following treatment is proposed as a basis for discussion, with
581	   the understanding that it would need to be changed, if it could raise
582	   interoperability issues.

584	      const FSCHARSET_CAP4_CONTAINS_NON_UTF8  = 0x1;
585	      const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8   = 0x2;

587	      typedef uint32_t        fs_charset_cap4;

589	      This attribute provides a simple way of determining whether a
590	      particular file system behaves as a UTF-8-only server and rejects
591	      file names which are not valid UTF-8 strings.  When this attribute
592	      is supported and the value returned has the
593	      FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 flag set, the error NFS4ERR_INVAL
594	      MUST be returned if any file name argument contains a string which
595	      is not a valid UTF-8 string.

597	      When this attribute is supported and the value returned has the
598	      FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 flag clear, the error
599	      NFS4ERR_INVAL will not be returned based on adherence to the rules
600	      of UTF-8.  While such file systems are generally UTF-8-unaware,
601	      this cannot be assumed, since server are allowed (in some
602	      circumstances; it is a "SHOULD NOT") to accept non-UTF-8 names
603	      while being aware of the structure of UTF-8-conforming names, for
604	      the purposes of determining canonical equivalence, for example.
605	      See Section 6.

607	      With regard to the flag FSCHARSET_CAP4_CONTAINS_NON_UTF8, it has
608	      proved impossible to determine, from existing treatments of this
609	      attribute, any value that might be helpful here.  As a result, we
610	      are forced to assume that this flag is always a complement of
611	      FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 and that any result in which it is
612	      not is to be ignored, with the appropriate handling being the same
613	      as would apply if the attribute were not supported.

615	      When this attribute is not supported, the client can perform a
616	      LOOKUP using a name not conforming to the rules of UTF-8 and use
617	      the error returned to determine whether non-UTF-8 names are
618	      accepted.

620	8.  String Encoding

622	   Strings that potentially contain characters outside the ASCII range
623	   [10] are generally represented in NFSv4 using the UTF-8 encoding [7]
624	   of Unicode [11].  See [7] for precise encoding and decoding rules.

626	   Some details of the protocol treatment depend on the type of string:

628	   o  For strings that are component names, the preferred encoding for
629	      any non-ASCII characters is the UTF-8 representation of Unicode.

631	      In many cases, clients have no knowledge of the encoding being
632	      used, with the encoding done at the user level under the control
633	      of a per-process locale specification.  As a result, it may be
634	      impossible for the NFSv4 client to enforce the use of UTF-8.  The
635	      use of non-UTF-8 encodings can be problematic, since it may
636	      interfere with access to files stored using other forms of name
637	      encoding.  Also, normalization-related processing (see Section 9)
638	      of a string not encoded in UTF-8 could result in inappropriate
639	      name modification or aliasing.  In cases in which one has a non-
640	      UTF-8 encoded name that accidentally conforms to UTF-8 rules,
641	      substitution of canonically equivalent strings can change the non-
642	      UTF-8 encoded name drastically.

644	      For similar reasons, where non-UTF-8 encoded names are accepted,
645	      case-related mappings cannot be relied upon.  For this reason, the
646	      attribute case_insensitive MUST NOT be returned as TRUE for file
647	      systems which accept non-UTF-8 encoded file names.

649	      The kinds of modification and aliasing mentioned here can lead to
650	      both false negatives and false positives, depending on the strings
651	      in question, which can result in security issues such as elevation
652	      of privilege and denial of service (see [23] for further
653	      discussion).

655	   o  For strings based on domain names, non-ASCII characters MUST be
656	      represented using the UTF-8 encoding of Unicode, and additional
657	      string format restrictions may apply.  See Section 12 for details.

659	   o  The contents of symbolic links (of type linktext4 in the XDR) MUST
660	      be treated as opaque data by NFSv4 servers.  Although UTF-8
661	      encoding is often used, it need not be.  In this respect, the
662	      contents of symbolic links are like the contents of regular files
663	      in that their encoding is not within the scope of this
664	      specification.

666	   o  For other sorts of strings, any non-ASCII characters SHOULD be
667	      represented using the UTF-8 encoding of Unicode.

669	9.  Normalization

671	   The client and server operating environments can potentially differ
672	   in their policies and operational methods with respect to character
673	   normalization (see [11] for a discussion of normalization forms).
674	   This difference may also exist between applications on the same
675	   client.  This adds to the difficulty of providing a single
676	   normalization policy for the protocol that allows for maximal
677	   interoperability.  This issue is similar to the issues of character
678	   case where the server may or may not support case-insensitive file
679	   name matching and may or may not preserve the character case when
680	   storing file names.  The protocol does not mandate a particular
681	   behavior but allows for a range of useful behaviors.

683	   The NFSv4 protocol does not mandate the use of a particular
684	   normalization form.  A subsequent minor version of the NFSv4 protocol
685	   might specify a particular normalization form, although there would
686	   be difficulties in doing so (see Section 15 for details).  In any
687	   case, the server and client can expect that they might receive
688	   unnormalized characters within protocol requests and responses.  If
689	   the operating environment requires normalization, then the
690	   implementation will need to normalize the various UTF-8 encoded
691	   strings within the protocol before presenting the information to an
692	   application (at the client) or local file system (at the server).

694	   Server implementations MAY normalize file names to conform to a
695	   particular normalization form before using the resulting string when
696	   looking up or creating a file.  Servers MAY also perform
697	   normalization-insensitive string comparisons without modifying the
698	   names to match a particular normalization form.  Except in cases in
699	   which component names are excluded from normalization-related
700	   handling because they are not valid UTF-8 strings, a server MUST make
701	   the same choice (as to whether to normalize or not, the target form
702	   of normalization, and whether to do normalization-insensitive string
703	   comparisons) in the same way for all accesses to a particular file
704	   system.  Servers SHOULD NOT reject a file name because it does not
705	   conform to a particular normalization form, as this would deny access
706	   to clients that use a different normalization form or clients acting
707	   on behalf of application that use a different normalization form.

709	10.  Case-Insensitive Processing of File Names

711	   When the server is to process file names in a case-insensitive way in
712	   a given file system, it may choose to do so in a number of ways.

714	   o  It can force all characters which have multiple forms to a common
715	      case, whether uppercase of lowercase.  Although this may cause the
716	      file name shown in the directory to be different from that
717	      specified when the file is created, these two names will be judged
718	      as equivalent when a case-insensitive comparison is used.  Such
719	      file systems are case-insensitive but not case-preserving.

721	   o  It can preserve all names, presented as valid and not subject to
722	      case-based modification, while treating two names that are
723	      equivalent when a case-insensitive comparison is used as referring
724	      to the same file.  Such file systems are both case-insensitive and
725	      case-preserving.

727	   When a server implements case-insensitive file name handling, it is
728	   necessary that clients do so as well.  For example, if a client
729	   possessing the cached contents of a directory, notes that the file
730	   "a" does not exist, it cannot immediately act on that presumed non-
731	   existence, without checking for the potential existence of "A" as
732	   well.  As a result, clients need to be able to provide case-
733	   insensitive name comparisons, irrespective of whether the server
734	   handling is case-preserving or not.

736	   Because case-insensitive name comparisons are not always as
737	   straightforward as the above example suggests, the client, if it is
738	   to emulate the server's name handling, would need information about
739	   how certain cases are to be dealt with.  In cases in which that
740	   information is unavailable, the client needs to avoid making
741	   assumptions about the server's handling, since it will be unaware of
742	   the Unicode version implemented by the server, or many of the details
743	   of specific issues that might need to be addressed differently by
744	   different server file systems in implementing case-insensitive name
745	   handling.

747	   Many of the problematic issues with regard to the case-insensitive
748	   handling of names are discussed in Section 5.18 of the Unicode
749	   Standard [12] which deals with case mapping.  While we need to
750	   address all of these issues as well, our approach will not be exactly
751	   the same.

753	   o  Since the client will be doing case-insensitive comparisons,
754	      issues that apply only to uppercasing or lowercasing do not have
755	      the same significance.

757	   o  Many clients will have to operate correctly even in the absence of
758	      detailed information about the specifics of server case-mapping or
759	      the version of Unicode implemented by the server.

761	   o  Clients will have to accommodate server behaviors not anticipated
762	      by the Unicode Specification since it might be that neither the
763	      server nor the client would have any relevant locale knowledge
764	      when file names are processed.

766	   Another source of information about case-folding, and indirectly
767	   about case-insensitive comparisons, is the case-folding text file
768	   which is part of the Unicode Standard [13].  This file contains, for
769	   each Unicode character that can be uppercased or lowercased, a single
770	   character, or, in some cases a string of characters of the other
771	   case.  For characters in capital case, the lowercase counterpart is
772	   given.  Each of the mappings is characterized as of one of four
773	   types:

775	   o  Common case folding, denoted by a status field of "C".  These are
776	      used for mapping where a single character can be mapped to a
777	      single character of another case.  These are always valid with one
778	      potential exception being the mappings of LATIN CAPITAL LETTER I
779	      to LATIN SMALL LETTER I and vice versa, which might be superseded
780	      by the T-type mappings of associated with some Turkic languages.

782	   o  Full case folding, denoted by a status field of "F".  These are
783	      used for mappings in which single character is mapped to a multi-
784	      character string of a different case.

786	   o  Special case folding, denoted by a status field of "S".  These
787	      provide additional single-character-to-single-character which
788	      might be used when there is also an F-type mapping of the same
789	      character.  In the case of case folding, this is an alternative to
790	      the corresponding F-type, although, for the purposes of case-
791	      insensitive string comparison, it is possible for both to be in
792	      considered valid at the same time

794	   o  Special case foldings for Turkic languages, denoted by a status
795	      field of "T".  These consist of the invertible case mappings
796	      between LATIN SMALL LETTER I (U+0069) and LATIN CAPITAL LETTER I
797	      WITH DOT ABOVE (U+0130) and between LATIN CAPITAL LETTER I
798	      (U+0049) and LATIN SMALL LETTER DOTLESS I (U+0131).  The
799	      relationship between these mappings and the C-type mappings for
800	      LETTER I is discussed below in item EX8.

802	   While the case mapping section does discuss case-insensitive string
803	   comparisons, and describes a procedure for constructing equivalence
804	   classes of Unicode characters, the description does not deal clearly
805	   with the effect of F-type mappings.  There are a number of problems
806	   with dealing with F-type mappings for case folding and basing case-
807	   insensitive string comparisons on those mappings, particularly in
808	   situations, such as file systems, in which extensive processing of
809	   strings is unlikely to be possible.

811	   o  Mappings from single characters to multi-character strings, are,
812	      for case-folding purposes, not invertible.  However, case-
813	      insensitive name comparison, by its nature, requires invertible
814	      mappings, in which a multi-character string is mapped to a single
815	      character of a different case which not compatible with any
816	      existing simple case-mapping models.

818	   o  Scanning of names for multi-character sequences might well be too
819	      complicated, especially since such sequences might overlap in
820	      complicated ways.

822	   o  Case foldings which map single characters to multi-character
823	      sequences (see item EX4 below for an important example), would
824	      give rise, because of the invertibility of case mappings when used
825	      to determine case-insensitive string equivalence for very large
826	      sets of strings.  For example, a string of eight copies of the
827	      letter S would give rise to an set of 256 equivalent strings plus
828	      over two thousand others when the German SHARP S characters
829	      discussed in item EX4 are included.

831	   Despite these potential difficulties, case mappings involving multi-
832	   character sequences can be reversed when used as a basis for case-
833	   insensitive string comparisons and incorporated into a set of
834	   equivalence classes on name strings.

836	   o  Case-insensitive servers MAY do either case-mapping to a chosen
837	      case or case-insensitive string comparisons when providing a case-
838	      preserving implementation.  In either case, it MAY include F-type
839	      mappings, which map a single character to a multi-character
840	      string.  However, only the case in which it is doing case-
841	      insensitive string comparison will it use the inverse of F-type
842	      mappings, in which a multi-character string is mapped to a single
843	      character of a different case

845	      In these cases, the server can choose to use either a C-type
846	      mapping or an F-type mapping, or both, when both exist.  Similarly
847	      the server may choose to implement the C-type mappings of LATIN
848	      CAPITAL LETTER I to LATIN SMALL LETTER I and vice versa, the
849	      corresponding T-type mappings or both, although using only the
850	      second of these is NOT ALLOWED, unless there is a means of
851	      informing the client that it has been chosen.

853	   o  The client, when informed of the details of the client's handling
854	      of case, has the ability to efficiently implement an appropriate
855	      case-insensitive name comparison compatible with that of the
856	      server.  This includes the ability to handle mappings between
857	      single characters and multi-character strings.

859	   o  Implementation of case-insensitive name comparisons will typically
860	      require a case-insensitive name hash.

862	10.1.  Implementing Case-Insensitive Comparison of File Names

864	   Implementing case-insensitive string comparisons based on equivalence
865	   classes including multi-character strings can be performed as
866	   described below.  This algorithm requires that if there is more than
867	   one multi-character string within a given equivalence class, they
868	   must all be equivalent, with any equivalences derivable from case-
869	   insensitive string equivalence using single-character equivalence
870	   classes.

872	   Although other sources are possible (see items EX2 and EX3 in
873	   Section 10.2), multi-character sequences often appear in case-
874	   insensitive equivalence classes as the result of the canonical
875	   decomposition of one or more precomposed characters as elements of a
876	   case-insensitive equivalence class.

878	   While the algorithm described in this section can deal with certain
879	   case-based equivalences deriving from canonical decomposition, it is
880	   not capable of providing general handling of the combination of
881	   canonical equivalence and case-based equivalence.  While this can be
882	   addressed by normalizing strings before doing case-insensitive
883	   comparison, it is more efficient to do a general form-insensitive and
884	   case-insensitive string comparison in a single step as described in
885	   Appendix B

887	   The following tables would be used by the comparison algorithm
888	   presented below.

890	   o  For each possible character value, the associated equivalence
891	      class for case-insensitive comparison will be identified

893	   o  For each such equivalence class, the hash value contribution will
894	      be provided.  In the case of equivalence class that do not include
895	      multi-character including equivalence classes that only include a
896	      single member, this will be the hash value contribution of one
897	      particular variant (usually lower case) of the character

899	   o  In the case of equivalence classes that do include multi-character
900	      strings, the hash value contribution needs to equivalent to the
901	      combined contribution of each character within the multi-character
902	      string.  In addition, for each such equivalence class, the length
903	      of the multicharacter string will be provided together with a
904	      pointer to an array describing the multi-character string, most
905	      probably presenting each character as an equivalence class id.

907	   Case-insensitive comparison proceeds as follows:

909	   o  Implementation of case-insensitive name comparisons will typically
910	      require a case-insensitive name hash using the tables described
911	      above.  If such a hash vale is kept or all cached names
912	      comparisons of hashes can be used instead of the detailed
913	      comparison set forth below.  Using such hash comparisons, a large
914	      set of potentially equivalent names can be excluded based on the
915	      occurrence of hash mismatches, since case-equivalent names would
916	      have the same hash value.  value.

918	   o  For names with matching hash values, a detailed case-insensitive
919	      comparison will be necessary.  This can proceed character-by-
920	      character or byte-by-byte.  However, in the byte-by-byte case,
921	      processing in the event of a mismatch must start at the start of
922	      the current character, rather than the byte at which the
923	      difference was detected.

925	   o  In cases in which there is a mismatch, the associated equivalence
926	      classes will be compared.  When these are identical, indicating
927	      the case equivalence of the two characters, the comparison of the
928	      two strings continues at the next character of each string.

930	   o  When the two equivalence classes are not identical, further
931	      comparisons to determine if a single character within one string
932	      matches (except for case) a multi-character string within the
933	      other.  For each of two equivalence classes being compared that
934	      include a multi-character string, the check below must be made to
935	      determine whether the multi-character string at the corresponding
936	      position of the other string being compared, is within the current
937	      equivalence class.  If neither of the two equivalence classes
938	      include multi-character strings, the comparison terminates with a
939	      mismatch indication.

941	   o  For each equivalence class that does include a multi-character
942	      string (there might be one or two), a scan needs to be made to see
943	      of the characters at the current position if the other string
944	      matches (except for case) the multi-character string which is
945	      included in the current equivalence class.  If this check
946	      succeeds, for either equivalence class, the comparison of the two
947	      strings continues at the next character of each string.  In the
948	      event of failure, the same sort of comparison is done using the
949	      other current equivalence class, if it include multi-character
950	      strings.  Once this check fails for all equivalence classes that
951	      include multi-character strings, the comparison terminates with a
952	      mismatch indication.

954	10.2.  Important Examples of Case-insensitive Handling of File Names

956	   In this section, we discuss many of the interesting and/or
957	   troublesome issues that the need for case-insensitive handling gives
958	   rise to in fully internationalized environment.  Many of these are
959	   also discussed in [12].  However, our treatment of these issues,
960	   while not inconsistent with that in [12], differs significantly for a
961	   number of reasons:

963	   o  Our primary focus is on case-insensitive string comparison rather
964	      than with case mapping per se.  While such comparison is natural
965	      for the client and allowed for servers, its greater flexibility
966	      makes it important to understand its capabilities in dealing with
967	      potentially troublesome issues in providing case-insensitive file
968	      name handling.

970	   o  Because a case mapping model forces the specification of a single
971	      case mapping result when there are multiple potentially valid
972	      results, there are inevitably cases in which the result chosen is
973	      inappropriate for some users.  These are cases in which F-type and
974	      S-type mappings are present and in which C-type and T-type
975	      mappings conflict.  Normally, an appropriate choice is selected by
976	      use of the locale, but in a filesystem environment, valid locale
977	      information might not be present.  As a result, case-insensitive
978	      string comparison, which does not force such case mapping choices,
979	      will be more desirable.

981	   The examples below present common situations that go beyond the
982	   simple invertible case mappings of Latin characters and the
983	   straightforward adaptation of that model to Greek and Cyrillic.  In
984	   EX4 and EX5 we have case-based equivalence classes including multi-
985	   character strings not derived from canonical equivalences while for
986	   EX7 and EX8 all multi-character strings are derived from canonical
987	   equivalences.  In addition, EX1, EX2, EX3 and EX6 discuss other
988	   situations in which an equivalence class has more than two elements.

990	   EX1:  Certain digraph characters such LATIN SMALL LETTER DZ (U+01F3)
991	         have additional case variants to consider such as the titlecase
992	         character LATIN CAPTAL LETTER D WITH SMALL LETTER Z (U+01F2) in
993	         addition to the uppercase LATIN CAPITAL LETTER DZ (U+01F1).
994	         While the titlecased variant would not appear in names in case-
995	         insensitive non-case-preserving file systems, case-insensitive
996	         string comparison has no problem in treating these three
997	         characters as within the same equivalence class.

999	         This equivalence class can be derived from only C-type
1000	         mappings.  The possibility of mapping these characters to two-
1001	         character sequences they represent is not a troublesome issue
1002	         since that would be derived from a compatibility equivalence,
1003	         rather than a canonical equivalence, and there is no F-type
1004	         mapping making it an option.

1006	   EX2:  To deal with the case of the OHM SIGN (U+2126) which is
1007	         essentially identical to the GREEK CAPITAL LETTER OMEGA
1008	         (U+03A9), one can construct an equivalence class consisting of
1009	         OHM SIGN (U+2126), GREEK CAPITAL LETTER OMEGA (U+03A9), and
1010	         GREEK SMALL LETTER OMEGA (U+03C9).

1012	         This equivalence class can be derived only from C-type
1013	         mappings.  Both OHM SIGN (U+2126), and GREEK CAPITAL LETTER
1014	         OMEGA (U+03A9) lowercase to GREEK LETTER OMEGA (U+03C9), while
1015	         that character only uppercases to GREEK CAPITAL LETTER OMEGA
1016	         (U+03A9).

1018	   EX3:  To deal with the case of the ANGSTROM SIGN (U+212B) which is
1019	         essentially identical to LATIN CAPITAL LETTER A WITH RING ABOVE
1020	         (U+00C5), one can construct an equivalence class consisting of
1021	         ANGSTROM SIGN (U+212B), LATIN CAPITAL LETTER A WITH RING ABOVE
1022	         (U+00C5), LATIN SMALL LETTER A WITH RING ABOVE (U+00E5),
1023	         together with the two-character sequences involving LATIN
1024	         CAPITAL LETTER A (U+0041) or LATIN SMALL LETTER A (U+0061)
1025	         followed by COMBINING RING ABOVE (U+030A).

1027	         This equivalence class can be derived from C-type mappings
1028	         together with the ability to map characters to canonically
1029	         equivalent strings.  Both ANGSTROM SIGN (U+212B), and LATIN
1030	         CAPITAL LETTER A WITH RING ABOVE (U+00C5) lowercase to LATIN
1031	         SMALL LETTER A WITH RING ABOVE (U+00E5), while that character
1032	         only uppercases to CAPITAL LETTER A WITH RING ABOVE (U+00C5).

1034	   EX4:  In some cases, case mapping of a single character will result
1035	         in a multi-character string.  For example, the German character
1036	         LATIN SMALL LETTER SHARP S (U+00DF) would be uppercased to
1037	         "SS", i.e. two copies of LATIN CAPITAL LETTER S (U+0053).  On
1038	         the other hand, in some situations, it would be uppercased to
1039	         the character LATIN CAPITAL LETTER SHARP S (U+1E9E), using an
1040	         S-type mapping. referred to as an instance of "Tailored
1041	         Casing".  Unfortunately, in the context of a file system, there
1042	         is unlikely to be available information that provides guidance
1043	         about which of these case mappings should be chosen.  However,
1044	         the use of case-insensitive mappings with larger equivalence
1045	         classes often provides handling that is acceptable to a wider
1046	         variety of users.  In this case, German-speakers get the
1047	         mapping they expect while those unfamiliar with these
1048	         characters only see them when they access a file whose name
1049	         contains them.

1051	         It appears that if the construction of case-based equivalence
1052	         classes were generalized to include multi-character sequences,
1053	         then all of LATIN SMALL LETTER SHARP S (U+00DF), LATIN CAPITAL
1054	         LETTER SHARP S (U+1E9E), "ss", "sS", "Ss", and "SS" would
1055	         belong to the same equivalence class and could be handled by
1056	         the general algorithm described in Section 10.1, as well by
1057	         code specifically written to deal with this particular issue.

1059	   EX5:  Other ligatures, such as LATIN SMALL LIGATURE FFL (U+FB04),
1060	         could be handled similarly by this algorithm, if there were
1061	         felt a need to do so.  However, because the decomposition of
1062	         this character into the string consisting of the three letters
1063	         LATIN SMALL LETTER F (U+0066), LATIN SMALL LETTER F (U+0066),
1064	         LATIN SMALL LETTER L (U+006C), is a compatibility equivalence,
1065	         and the F-type mapping of this ligature to the three
1066	         constituent is to be treated as optional, implementations can
1067	         choose either to treat this character as having no uppercase
1068	         equivalent or treat it as part of larger equivalence class
1069	         including "ffl", "ffL", "fFl", etc.).

1071	   EX6:  The character COMBINING GREEK YPOGEGRAMMENI (U+0345), also
1072	         known as "iota-subscript" requires special handling when
1073	         uppercasing and lowercasing.  While the description of the
1074	         appropriate handling for this character, in the case mapping
1075	         section, is focused on multi- character sequences representing
1076	         diphthongs, case-insensitive comparisons can be performed
1077	         without consideration of multi-character sequences.  This can
1078	         be done by assigning COMBINING GREEK YPOGEGRAMMENI (U+0345),
1079	         GREEK SMALL LETTER IOTA (U+03B9), and GREEK CAPITAL LETTER IOTA
1080	         (U+0399) to the same equivalence class, even though the first
1081	         of these is a combining character and the others are not.

1083	   EX7:  In some cases context-dependent case mapping is required.  For
1084	         example, GREEK CAPITAL LETTER SIGMA (U+03A3) lowercases to
1085	         GREEK SMALL LETTER SIGMA (U+03C3) if it is followed by another
1086	         letter and to GREEK SMALL LETTER FINAL SIGMA (U+03C2) if it is
1087	         not.

1089	         Despite this, case-insensitive comparisons can be implemented,
1090	         by considering all of these characters as part of the same
1091	         equivalence class, without any context-dependence, and this
1092	         equivalence class can be derived using only C-type mappings.

1094	   EX8:  In most languages written using Latin characters, the uppercase
1095	         and lowercase varieties of the letter "I" differ in that only
1096	         the lowercase character.  In a number of Turkic languages,
1097	         there are two distinct characters derived from "I" which differ
1098	         only with regard to the presence or absence of a dot so that
1099	         there are both capital and small i's with each having dotted
1100	         and dotless variants.  Within such languages, the dotted and
1101	         dotless I's represent different vowel sounds and are treated as
1102	         separate characters with respect to case mapping.  The
1103	         uppercase of LATIN SMALL LETTER I (U+0069) is LATIN CAPITAL
1104	         LETTER I WITH DOT ABOVE (U+0130), rather than LATIN CAPITAL
1105	         LETTER I (U+0049).  Similarly the lowercase of LATIN CAPITAL
1106	         LETTER I (U+0049) is LATIN SMALL LETTER DOTLESS I (U+0131)
1107	         rather than LATIN SMALL LETTER I (U+0069).

1109	         When doing case mapping, the server must choose to uppercase
1110	         LATIN SMALL LETTER I (U+0069) to either LATIN CAPITAL LETTER I
1111	         (U+0049), based on a C-type mapping to LATIN CAPITAL LETTER I
1112	         WITH DOT ABOVE (U+0130), based on a T-type mapping.  The former
1113	         is acceptable to most people but confusing to speakers of the
1114	         Turkic languages in question since the case mapping changes the
1115	         character to represent a different vowel sound.  On the other
1116	         hand, the latter mapping seemingly inexplicably results in a
1117	         character many users have never seen before.  Normally such
1118	         choices are dealt with based on a locale but, in a file system
1119	         environment, no locale information may be available.

1121	         In the context of case-insensitive string comparison, it is
1122	         possible to create a larger equivalence class, including all of
1123	         the letters LATIN SMALL LETTER I (U+0069), LATIN CAPITAL LETTER
1124	         I (U+0049), LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130),
1125	         LATIN SMALL LETTER DOTLESS I (U+0131) together with the two-
1126	         character string consisting of LATIN CAPITAL LETTER I (U+0049)
1127	         followed by COMBINING DOT ABOVE (U+0307).

1129	11.  Internationalization-related Processing of File Names by Clients

1131	   Given the way that internationalization is addressed within the NFSv4
1132	   protocols, clients, and applications accessing NFS files can
1133	   generally remain unaware of the specific type of
1134	   internationalization-related processing implemented by the server.
1135	   For example, although a server MAY store all file names according to
1136	   the rules appropriate to a particular normalization form, it MUST NOT
1137	   reject names solely because they are not encoded using this
1138	   normalization form, allowing the clients and applications to avoid
1139	   knowledge of normalization choices.

1141	   However, as has been pointed out in [25], there are situations in
1142	   which clients implementing local optimizations use the saved contents
1143	   of directories fetched from the server, making it necessary that the
1144	   client's and the server's handling of internationalization-related
1145	   name mapping issues be in concord.  There are two basic ways this
1146	   issue can be addressed:

1148	   o  Where the protocol has not defined a means whereby the client can
1149	      obtain information about the details of internationalized name
1150	      handling implemented within the server, the client can avoid
1151	      conflict with the server by limiting its use of local
1152	      optimizations.  While positive name caching can be used without
1153	      adverse effects, negative name caching has to limited to avoid
1154	      situations in which a given name is not present but an equivalent
1155	      one may exist, as far as the server is concerned.  This situation,
1156	      which applies to all current NFSv4 protocols is discussed in
1157	      Section 11.2.

1159	   o  The client can be provided complete information about the server's
1160	      internationalization-related name handling (typically implemented
1161	      within the server-based file system.  This situation, which could
1162	      be implemented in later NFSv4 minor versions, or in an extension
1163	      to an existing extensible minor version is discussed in
1164	      Section 11.3.

1166	   o  Note that when case-insensitive handling of file names is
1167	      implemented by a server-side filesystem, further complications can
1168	      arise.  For the most part, these are addressed in Sections 11.2
1169	      and 11.3 by treating the particulars of case-handling as a another
1170	      element of the name handling implemented by the server.  However,
1171	      some of the specific complexities are addressed separately in
1172	      Section 10.

1174	11.1.  Server Restrictions to Deal with Lack of Client Knowledge

1176	   There are a number of restrictions, not previously specified in
1177	   RFC7530 [3], on server implementation of internationalized file name
1178	   handling.  These restrictions apply to both case-sensitive and case-
1179	   insensitive file systems and are designed to limit the options that
1180	   servers have in choosing server-side internationalized file name
1181	   handling so as to enable the clients to either duplicate that
1182	   handling or limit it to avoid relying on cases in which the proper
1183	   handling cannot be determined or duplicated by the client.

1185	   o  The canonical equivalence relation implemented by the server, for
1186	      each internationalization-aware filesystem MUST match that defined
1187	      by some particular UNICODE version equal to or later than version
1188	      4.0.

1190	   o  The case-equivalence relationship implemented by the server, for
1191	      each case-insensitive filesystem MUST include all C-type case
1192	      mappings included by the particular UNICODE version whose
1193	      canonical equivalence relation is implemented by the server, with
1194	      the possible exception of those conflicting with T-type case
1195	      mappings.  by some particular Unicode version equal to or later
1196	      than version 4.0.

1198	   o  In cases in which the server provides no way of determining the
1199	      details of the case-equivalence relationship implemented by the
1200	      server for a particular file system, that mapping must include all
1201	      C-type case mappings included by the particular UNICODE version
1202	      whose canonical equivalence relation is implemented by the server,
1203	      i.e. it MUST map between LATIN SMALL LETTER I (U+0069)and LATIN
1204	      CAPITAL LETTER I (U+0049).

1206	11.2.  Client Processing of File Names for Current NFSv4 Protocols

1208	   The existing minor versions, NFSv4.0 [3], NFSv4.1 [21], and NFSv4.2
1209	   [4], have very limited facilities allowing a client to get
1210	   information about the server's internationalization-related file name
1211	   handling.  Because these protocols were all defined when it was
1212	   assumed that the server's internationalized file name handling could
1213	   be specified in great detail, there was no provision for attributes
1214	   defining the server's choices.  As a result, the information
1215	   available to the client is quite limited:

1217	   o  The client can determine that the server is not performing
1218	      internationalized file name processing.  It can do this by looking
1219	      up a file name using a string which is not valid UTF-8, concluding
1220	      that if the LOOKUP is not rejected on that basis, then the file
1221	      system is not internationalization-aware, allowing the client to
1222	      ignore the potential difficulties which server-based
1223	      internationalized file name processing might give rise to.

1225	   o  The client can use the optional per-fs attributes case_insensitive
1226	      and case_preserving to how the server deals with character case
1227	      for particular file system.  When one of these attributes is not
1228	      supported by a particular file system, the client treats the
1229	      attribute as if it were false.

1231	   When a file system is internationalization-unaware, the client can
1232	   use both positive and negative name caching, without any issues
1233	   arising from the potential for conflict between distinct file names
1234	   that would be considered equivalent by the server.  In other cases,
1235	   the handling is more restricted in the use of negative name caching.
1236	   The issue with regard to case-sensitive and case-insensitive file
1237	   systems are discussed separately below.  In each case, the client has
1238	   a range of choices trading off forgone optimization opportunities
1239	   against the difficulty of implementation while avoiding negative
1240	   consequences arising from the fact that certain details of the
1241	   server's name handling are not known to it.

1243	   In the case of case-sensitive file systems, the uncertainty to be
1244	   dealt with concerns the version of Unicode implemented by the server,
1245	   given that different versions may have different canonical
1246	   equivalence relationships.  However, whether the server implements a
1247	   particular normalization form or implements form-insensitive file
1248	   name matching has no effect on client behavior.  In light of the
1249	   uncertainty created by the lack of knowledge of the precise Unicode
1250	   version used by the server to implement its canonical equivalence
1251	   relation, the follow possibilities, arranged in order of increasing
1252	   value (and difficulty of implementation) should be considered.

1254	   A1:  The client can simply decline to implement optimizations based
1255	        on negative name caching on internationalization-aware file
1256	        systems.

1258	        While this might have a negative effect on performance, it might
1259	        be the best option for clients not heavily used to access
1260	        internationalization-aware filesystems, or where, due to a lack
1261	        of directory delegation support, the client has no assurance
1262	        that will be notified of the invalidation of a previous
1263	        assumption that a particular file does not exist.

1265	   A2:  Relatively simple name filtering can exclude the names for which
1266	        negative name caching might cause difficulties.  For example,
1267	        the client could scan file names for characters whose presence
1268	        might pose difficulties and allow negative name caching only for
1269	        strings known not to contain such characters.  Because the
1270	        Unicode version used by the server file system is not known,
1271	        this treatment would be limited to string only containing
1272	        characters defined in the earliest version of Unicode which
1273	        could be supported, that is, Unicode 4.0.

1275	        One simple way for a client to provide such filtering would be
1276	        to establish an upper limit (e.g.  U+00ff) and disallow negative
1277	        name caching for strings containing characters above that value
1278	        or characters below that value that might cause there to be
1279	        canonically equivalent strings on the server.  A simple mask
1280	        could be used to allow each character to be examined allowing
1281	        composed and combining characters to be identified together with
1282	        code points unassigned in Unicode 4.0.

1284	        This approach would allow negative name caching to be disallowed
1285	        for strings containing those characters while allowing it for
1286	        other strings that do not.  A larger limit (and a corresponding
1287	        mask) would make sense for clients used to access many file
1288	        names containing characters from non-Latin alphabets.

1290	   A3:  A client might implement its own internationalized file name
1291	        handling paralleling that of the server.  Because the Unicode
1292	        version used by the server filesystem is unknown, strings for
1293	        which it is possible that the canonically equivalent string
1294	        might be different depending on the version of Unicode
1295	        implemented by the server will have to be identified and
1296	        excluded from using negative name caching.  This would require
1297	        that strings containing code points unassigned in Unicode
1298	        version 4.0, and those denoting combining characters that could
1299	        be parts of precomposed character added to later versions of
1300	        Unicode be excluded from negative name caching.  The necessary
1301	        filtering could apply to all potential code points although
1302	        clients might choose to simplify implementation by excluding
1303	        strings containing code points beyond a certain point, e.g.
1304	        (U+0FFFF).

1306	        When a client implements internationalized name handling, it
1307	        needs to be able to detect when the apparent absence of a file
1308	        within a directory is contradicted by the occurrence of a file
1309	        with a distinct, but canonically equivalent, name.  In order to
1310	        efficiently find such names, when they exist, a client typically
1311	        needs to implement a form of name hashing which always produces
1312	        the same result for two canonically equivalent names.  This can
1313	        be done by making the contribution of any character to the name
1314	        hash, equal to the contribution of the corresponding canonical
1315	        decomposition string.

1317	   In the case of case-insensitive file systems, the uncertainty to be
1318	   dealt with includes the version of Unicode implemented by the server
1319	   as well as the details of the possible case-handling implemented by
1320	   the server.  In addition to the fact that different Unicode versions
1321	   may have different canonical equivalence relationships, the server
1322	   may implement different approaches to the handling of issues related
1323	   to the handling of dotted and dotless i, in Turkish and Azeri.
1324	   However, the question of whether the server's handling is case-
1325	   preserving has no effect on client behavior, as is the question of
1326	   whether the server implements a particular normalization form or
1327	   implements form-insensitive file name matching.  In light of the
1328	   uncertainty created by the lack of knowledge of the details of the
1329	   case-related equivalence relation together with the precise Unicode
1330	   version used by the server to implement its canonical equivalence
1331	   relation, the following possibilities, arranged in order of
1332	   increasing value (and difficulty of implementation) should be
1333	   considered.

1335	   B1:  The client can simply decline to implement optimizations based
1336	        on negative name caching on case-insensitive file systems.

1338	        While this might have a negative effect on performance where
1339	        significant benefits from negative name caching might be
1340	        expected, it might be the best option for clients not heavily
1341	        used to access case-insensitive filesystems.

1343	   B2:  Filtering similar to that discussed in item A2 could be
1344	        implemented, although a higher limit is likely to be chosen
1345	        (e.g.  U+07ff) if significant use of non-Latin scripts is
1346	        expected.  Because of the uncertainty regarding the handling of
1347	        case relationship among characters used for the variant of I
1348	        used by Turkic languages, this filtering would have to exclude
1349	        names containing LATIN CAPITAL LETTER I WITH DOT ABOVE and LATIN
1350	        SMALL LETTER DOTLESS I together with precomposed characters
1351	        derived from them.

1353	        In cases in which such filtering did not exclude the item from
1354	        consideration, it would need to search for files with possibly
1355	        equivalent names, including those equivalent by canonical
1356	        equivalence, case-insensitive equivalence, or a combination of
1357	        the two.  This will typically require a form of name hashing
1358	        which always produces the same hash for equivalent names,
1359	        similar to that discussed in item A3 but including case-
1360	        insensitive equivalence as well.

1362	   B3:  A client might implement its own internationalized, case-
1363	        insensitive file name handling paralleling that of the server.
1364	        Because the case mappings are uncertain and the Unicode version
1365	        used by the server filesystem is unknown, strings for which it
1366	        is possible that the equivalent string might be different
1367	        depending on the version of Unicode implemented by the server or
1368	        the choice of case mappings would have to be identified and
1369	        excluded from using negative name caching.  This would require
1370	        that strings containing code points unassigned in Unicode
1371	        version 4.0, and those denoting combining characters that could
1372	        be parts of precomposed characters added to later versions of
1373	        Unicode be excluded from negative name caching.  The necessary
1374	        filtering could apply to all potential code points although
1375	        clients might choose to simplify implementation by excluding
1376	        strings containing code points beyond a certain point (e.g.
1377	        U+00FFFF).

1379	        When a client implements internationalized name handling, it
1380	        needs to be able to detect when the apparent absence of a file
1381	        within a directory is contradicted by the occurrence of a file
1382	        with a distinct, but canonically equivalent name.  In order to
1383	        efficiently find such names, when they exist, a client typically
1384	        needs to implements a form of name hashing which always produces
1385	        the same result for two canonically equivalent names.  This can
1386	        be done by making the contribution of any character to the name
1387	        hash, equal to contribution of the correspond canonical
1388	        decomposition string.

1390	11.3.  Client Processing of File Names for Future NFSv4 Protocols

1392	   Because of NFSv4 has an extension framework allowing the addition of
1393	   new attributes in later minor version or in extensions to extensible
1394	   minor versions.  Such new attributes are likely to be optional.  They
1395	   could include a number of useful per-fs attributes to deal with the
1396	   information gaps discussed in Section 11.2:

1398	   o  The Unicode version used to define the canonical equivalence
1399	      relation implemented by the server could be provided as an fs-
1400	      scope attribute.

1402	   o  For case-insensitive filesystems, details regarding the actual
1403	      case mapping used could be provided as an fs-scope attribute.
1404	      These details would include the case mapping associated with LATIN
1405	      LETTER I (i.e. whether the C-type or T-type case mappings or both
1406	      are to be used).  Similarly for characters having F-type case
1407	      mappings, information needs to be provided about whether the
1408	      F-type, mapping, the S-type mapping, or both, are to be used.

1410	   There is little prospect of such additional attributes being
1411	   REQUIRED.  Although the term "RECOMMENDED" has been used to describe
1412	   NFSv4 attributes that are not REQUIRED, any such attributes are best
1413	   considered OPTIONAL for the server to support with the client
1414	   required to deal with the case in which the attribute is not
1415	   supported.

1417	   When such attributes are defined and implemented, it would be
1418	   possible for the client and server to implement compatible
1419	   internationalization-related file name handling.  However, as a
1420	   practical matter, such compatibility would be considerably eased if
1421	   there existed unencumbered open-source implementations of the
1422	   algorithm and tables described in Appendix B.  This would allow
1423	   clients, servers, and server-based file systems, to easily adopt
1424	   compatible approaches to these issues, each calling a common set of
1425	   primitives, even though each might have a different execution
1426	   environment and might be processing file names for different
1427	   purposes.

1429	   In the case of case-sensitive file system, the case-mapping attribute
1430	   is not relevant.  In dealing with the non-support of the Unicode
1431	   version attribute, the client is in the same position as that of
1432	   clients described in Section 11.2.  In the case in which the Unicode
1433	   version is supported, the client would be able to implement the same
1434	   version of the canonical equivalence relation implemented by the
1435	   server, thus avoiding the need for the sort of overbroad filtering
1436	   mentioned in items A2 and A3 within Section 11.2

1438	   The case of case-insensitive file systems is more complicated, since
1439	   there are two OPTIONAL attributes to deal with:

1441	   C1:  When neither of these OPTIONAL attributes is supported, the
1442	        client is in the same position as that of clients described in
1443	        Section 11.2 in dealing with a case-insensitive file system.

1445	   C2:  When the Unicode version is available but the details of case
1446	        mapping are not, the client handling will be similar to that
1447	        specified the options B1 through B3 defined in Section 11.2.
1448	        However, in cases B2 and B3, it will be possible to reduce the
1449	        scope of the character filtering applied, by enabling names
1450	        containing characters defined after Unicode version 4.0 to be
1451	        processed, as long as none of the case mapping options for those
1452	        characters is at all problematic.

1454	   C3:  When the details of case mapping are available but Unicode
1455	        version is not, the client handling will be similar to that
1456	        specified the options B1 through B3 defined in Section 11.2.
1457	        However, in cases B2 and B3 However, in cases B2 and B3, it will
1458	        be possible to reduce the scope of the character filtering by
1459	        enabling names containing characters of uncertain case mapping
1460	        to be processed as long as those character were defined in
1461	        Unicode version 4.0.

1463	   C4:  When both of these OPTIONAL attributes are supported, the client
1464	        has the ability, at least theoretically, to reproduce the
1465	        internationalization-related file name handling implemented by a
1466	        server for a case-insensitive file system.  However, when the
1467	        client is unable to provide such an implementation, it is free
1468	        to ignore the attribute and implement one of the options B1
1469	        through B3 defined in Section 11.2.

1471	12.  String Types with Processing Defined by Other Internet Areas

1473	   There are two types of strings that NFSv4 deals with that are based
1474	   on domain names.  Processing of such strings is defined by other
1475	   Internet standards, and hence the processing behavior for such
1476	   strings should be consistent across all server operating systems and
1477	   server file systems.

1479	   This section differs from other sections of this document in two
1480	   respects:

1482	   o  The normative statements within this section are not derived from
1483	      the behavior from existing NFSv4 implementations, but derive
1484	      instead from existing RFCs.

1486	   o  Because of the switch from IDNA2003 [18] [19] to IDNA2008 [5],
1487	      this section is necessarily different from the corresponding
1488	      section (i.e.  Section 12.6) of [3].  The differences are
1489	      discussed in Section 12.1.

1491	   Because of this shift, there could be compatibility issues to be
1492	   expected between implementations obeying Section 12.6 of [3] and
1493	   those following this document.  Whether such compatibility issues
1494	   actually exist depends on the behavior of NFSv4 implementations and
1495	   how domain names are actually used in existing implementations.
1496	   These matters will be discussed in Section 12.2.

1498	   The types of strings referred to above are as follows:

1500	   o  Server names as they appear in the fs_locations and
1501	      fs_locations_info attribute.  Notes that for most purposes, such
1502	      server names will only be sent by the server to the client.  The
1503	      exception is the use of these attributes in a VERIFY or NVERIFY
1504	      operation.

1506	   o  Principal suffixes that are used to denote sets of users and
1507	      groups, and are in the form of domain names.

1509	   The general rules for handling all of these domain-related strings
1510	   are similar and independent of the role of the sender or receiver as
1511	   client or server, although the consequences of failure to obey these
1512	   rules may be different for client or server.  The server can report
1513	   errors when it is sent invalid strings, whereas the client will
1514	   simply ignore an invalid string or use a default value in its place.

1516	   The string sent SHOULD be in the form of one or more unvalidated
1517	   U-labels as defined by [5].  In cases where this cannot be done, the
1518	   string will instead be in the form of one or more LDH labels [5].
1519	   The receiver needs to be able to accept domain and server names in
1520	   any of the formats allowed.  The server MUST reject, using the error
1521	   NFS4ERR_INVAL, any of the following:

1523	   o  a string that is not valid UTF-8.

1525	   o  a string that contains an XN-label (begins with "xn--") for which
1526	      the characters after "xn--" are not valid output of the Punycode
1527	      algorithm [6].

1529	   o  a string that contains a reserved LDH label which is not an
1530	      XN-label.

1532	   When a domain string is part of id@domain or group@domain, there are
1533	   two possible approaches:

1535	   1.  The server generally treats the domain string as a series of
1536	       unvalidated U-labels.  In cases where the domain string is a
1537	       series of unvalidated A-labels or Non-Reserved LDH (NR-LDH)
1538	       labels, it converts them to U-labels using the Punycode algorithm
1539	       [6].  As a result, the domain string returned within a user id on
1540	       a GETATTR may not match that sent when the user id is set using
1541	       SETATTR, although when this happens, the domain will be in the
1542	       form of an unvalidated U-label.

1544	   2.  The server treats the domain string as a series of unvalidated
1545	       U-labels.  Specifically, it does not map a domain string that is
1546	       not a U-label into a U-label using the methods described above.
1547	       As a result, the domain string returned on a GETATTR of the user
1548	       id MUST be the same as that used when setting the user id by the
1549	       SETATTR.

1551	   A server SHOULD use the first method.

1553	   For VERIFY and NVERIFY, additional string processing requirements
1554	   apply to verification of the owner and owner_group attributes; see
1555	   the section entitled "Interpreting owner and owner_group" for the
1556	   document specifying the minor version in question (RFC750 [3],
1557	   RFC5661 [21])

1559	12.1.  Effect of IDNA Changes

1561	   Overall, the effect of the shift to IDNA2008 is to limit the degree
1562	   of understanding of the IDNA-based restrictions on domain names that
1563	   were expected of NFSv4 in RFC7530 [3].  Despite this specification,
1564	   the degree to which implementations actually implemented such
1565	   restrictions is open to question and will be discussed in detail in
1566	   Section 12.2

1568	   In analyzing how various cases are to be dealt with according to
1569	   RFC7530, there a number of troubling uncertainties that arise in
1570	   trying to interpret the existing specification:

1572	   o  There are a number of cases in which "SHOULD" is used that are
1573	      confusing.  According to RFC2119 [1], "SHOULD" means that "there
1574	      may exist valid reasons in particular circumstances to ignore a
1575	      particular item, but the full implications must be understood and
1576	      carefully weighed before choosing a different course".  To fully
1577	      understand a particular "SHOULD", there needs to be enough context
1578	      to determine whether particular reasons for ignoring the item are
1579	      in fact valid, and sufficient guidance to understand the
1580	      implication of ignoring the item.  In the absence of such
1581	      information, the relevant fact is that the peer needs to deal with
1582	      the item being ignored, making the implications of a "SHOULD" hard
1583	      to distinguish from those of "MAY".

1585	   o  While the document states, "the general rules for handling all of
1586	      these domain-related strings are similar and independent of the
1587	      role of the sender or receiver as client or server", all of the
1588	      following text is explicitly about the server's options, choices
1589	      and responsibilities, leaving the client case unclear.

1591	   o  In a number of places within the paragraph describing server
1592	      approach #1, the word "can" is used as in the text "the server can
1593	      use the ToUnicode function", leaving it unclear whether the server
1594	      can choose to do anything else and if so what.

1596	   The following cases are those where RFC7530 requires use of IDNA
1597	   handling and this requirement could, if implementations follow them,
1598	   create potential compatibility issues, which need to be understood.

1600	   o  The degree to which RFC3490 [18] requires that characters other
1601	      than U+002E (full stop) be treated as label separators, including
1602	      U+3002 (ideographic full stop), U+FF0E (fullwidth full stop),
1603	      U+FF61 (halfwidth ideographic full stop).

1605	   o  The degree to which RFC3490 [18] that server or client needs to
1606	      validate a putative A-label or U-label or to rectify it if it is
1607	      not valid.

1609	12.2.  Potential Compatibility Issues Related to IDNA Changes

1611	   There are a number of factors relating to the handling of domain
1612	   names within NFSv4 implementations that are important in
1613	   understanding why any compatibility issues might be less troubling
1614	   than a comparison of the two IDNA approaches might suggest:

1616	   o  Much of the potentially conflicting IDNA-related behavior required
1617	      or recommended for the server by RFC7530 [3] might not actually be
1618	      implemented, limiting the potential harmful effects of ceasing to
1619	      mandate it.

1621	   o  Even if such behavior were implemented by servers, no
1622	      compatibility issue would arise unless clients actually relied on
1623	      the server to implement it.  Given that none of this behavior is
1624	      made required, the chances of that occurring is quite small.

1626	   o  The range of potential values for user and group attributes sent
1627	      by clients are often quite small with implementations commonly
1628	      restricting all such values to a single domain string.  This is
1629	      even though RFCs 7530 [3] and 5661 [21] are written without
1630	      mention of such restrictions.

1632	      Specification of users and groups in the "id@domain" format within
1633	      NFSv4 was adopted to enable expansion of the spaces of users and
1634	      groups beyond the 32-bit id spaces mandated in NFSv3 [15] and
1635	      NFsv2 [14].  While one obstacle to expansion was eliminated, most
1636	      implementations were unable to actually effect that expansion,
1637	      principally because the physical file systems used assume that
1638	      user and group identifiers fit in 32 bits each and the vnode
1639	      interfaces used by server implementations make similar
1640	      assumptions.

1642	      Given these restrictions, the typical implementation pattern is
1643	      for servers to accept only a single domain, specified as part of
1644	      the server configuration, together with information necessary to
1645	      effect the appropriate name-to-id mappings.

1647	   o  The other uses of domain names in NFSv4, to represent host names
1648	      in location attributes, the values are generated by the server and
1649	      will normally include only include host names within DNS-
1650	      registered domains.

1652	   Keeping the above in mind, we can see that interoperability issues,
1653	   while they might exist are unlikely to raise major challenges as
1654	   looking to the following specific cases shows

1656	   o  When an internationalized domain name is used as part of a user or
1657	      group, it would need to be configured as such, with the domain
1658	      string known to both client and server.

1660	      While it is theoretically possible that a client might work with
1661	      an invalid domain string and rely on the server to correct it to
1662	      an IDNA-acceptable one, such a scenario has to be considered
1663	      extremely unlikely, since it would depend on multiple servers
1664	      implementing the same correction, especially since there is no
1665	      evidence of such corrections ever having been implemented by NFSv4
1666	      servers.

1668	   o  When an internationalized domain in a location string is meant to
1669	      specify a registered domain, similar considerations apply.

1671	      While it is theoretically possible that a client might work with
1672	      an invalid domain string and rely on the server to correct it to
1673	      the appropriate registered one, such a scenario has to be
1674	      considered extremely unlikely, since it would depend on multiple
1675	      servers implementing the same correction, especially since there
1676	      is no evidence of such corrections ever having been implemented by
1677	      NFSv4 servers.

1679	   o  When an internationalized domain in a location string is meant to
1680	      specify a non-registered domain, any such server-applied
1681	      corrections would be useless.

1683	      In this situation, any potential interoperability issue would
1684	      arise from rejecting the name, which has to be considered as what
1685	      should have been done in the first place.

1687	13.  Errors Related to UTF-8

1689	   Where the client sends an invalid UTF-8 string, the server MAY return
1690	   an NFS4ERR_INVAL error.  This includes cases in which inappropriate
1691	   prefixes are detected and where the count includes trailing bytes
1692	   that do not constitute a full Multiple-Octet Coded Universal
1693	   Character Set (UCS) character.

1695	   Requirements for server handling of component names that are not
1696	   valid UTF-8, when a server does not return NFS4ERR_INVAL in response
1697	   to receiving them, are described in Section 14.

1699	   Where the string supplied by the client is not rejected with
1700	   NFS4ERR_INVAL but contains characters that are not supported by the
1701	   server as a value for that string (e.g., names containing slashes, or
1702	   characters that do not fit into 16 bits when converted from UTF-8 to
1703	   a Unicode codepoint), the server should return an NFS4ERR_BADCHAR
1704	   error.

1706	   Where a UTF-8 string is used as a file name, and the file system,
1707	   while supporting all of the characters within the name, does not
1708	   allow that particular name to be used, the server should return the
1709	   error NFS4ERR_BADNAME.  This includes such situations as file system
1710	   prohibitions of "." and ".." as file names for certain operations,
1711	   and similar constraints.

1713	14.  Servers That Accept File Component Names That Are Not Valid UTF-8
1714	     Strings

1716	   As stated previously, servers MAY accept, on all or on some subset of
1717	   the physical file systems exported, component names that are not
1718	   valid UTF-8 strings.  A typical pattern is for a server to use
1719	   UTF-8-unaware physical file systems that treat component names as
1720	   uninterpreted strings of bytes, rather than having any awareness of
1721	   the character set being used.

1723	   Such servers SHOULD NOT change the stored representation of component
1724	   names from those received on the wire and SHOULD use an octet-by-
1725	   octet comparison of component name strings to determine equivalence
1726	   (as opposed to any broader notion of string comparison).  This is
1727	   because the server has no knowledge of the character encoding being
1728	   used.

1730	   Nonetheless, when such a server uses a broader notion of string
1731	   equivalence than what is recommended in the preceding paragraph, the
1732	   following considerations apply:

1734	   o  Outside of 7-bit ASCII, string processing that changes string
1735	      contents is usually specific to a character set and hence is
1736	      generally unsafe when the character set is unknown.  This
1737	      processing could change the file name in an unexpected fashion,
1738	      rendering the file inaccessible to the application or client that
1739	      created or renamed the file and to others expecting the original
1740	      file name.  Hence, such processing should not be performed,
1741	      because doing so is likely to result in incorrect string
1742	      modification or aliasing.

1744	   o  Unicode normalization is particularly dangerous, as such
1745	      processing assumes that the string is UTF-8.  When that assumption
1746	      is false because a different character set was used to create the
1747	      file name, normalization may corrupt the file name with respect to
1748	      that character set, rendering the file inaccessible to the
1749	      application that created it and others expecting the original file
1750	      name.  Hence, Unicode normalization SHOULD NOT be performed,
1751	      because it may cause incorrect string modification or aliasing.

1753	   When the above recommendations are not followed, the resulting string
1754	   modification and aliasing can lead to both false negatives and false
1755	   positives, depending on the strings in question, which can result in
1756	   security issues such as elevation of privilege and denial of service
1757	   (see [23] for further discussion).

1759	15.  Future Minor Versions and Extensions

1761	   As stated above, all current NFSv4 minor versions allow use of non-
1762	   UTF-8 encodings, allow servers a choice of whether to be aware of
1763	   normalization issues or not, and allows servers a number of choices
1764	   about how to address normalization issues.  This range of choices
1765	   reflects the need to accommodate existing file systems and user
1766	   expectations about character handling which in turn reflect the
1767	   assumptions of the POSIX model of handling file names.

1769	   While it is theoretically possible for a subsequent minor version to
1770	   change these aspects of the protocol (see [8]), this section will
1771	   explain why any such change is highly unlikely, making it expected
1772	   that these aspects of NFSv4 internationalization handling will be
1773	   retained indefinitely.  As a result, any new minor version
1774	   specification document that made such a change would have to be
1775	   marked as updating or obsoleting this document

1777	   No such change could be done as an extension to an existing minor
1778	   version or in a new minor version consisting only of OPTIONAL
1779	   features.  Such a change could only be done in a new minor version,
1780	   which like minor version one, was prepared to be incompatible to some
1781	   degree with the previous minor versions.  While it appears unlikely
1782	   that such minor versions will be adopted, the possibility cannot be
1783	   excluded, so we need to explore the difficulties of changing the
1784	   aspects of internationalization handling mentioned above.

1786	   o  Establishing UTF-8 as the sole means of encoding for
1787	      internationalized characters, would make inaccessible existing
1788	      files stored with other encodings.  Further, unless there were a
1789	      corresponding change in the UNIX file interface model, it would
1790	      cause the set of valid names for local and remote files to
1791	      diverge.

1793	   o  Imposing a particular normalization form, in the sense of refusing
1794	      to create to allow access to files whose UTF-8-encoded names are
1795	      not of the selected normalization form would give rise to similar
1796	      difficulties.

1798	   o  Defining a preferred normalization form to be returned as the
1799	      names of all internationalized files, would result in applications
1800	      having to deal with sudden unexplained changes of file names for
1801	      existing files.

1803	   None of the above appears likely since there does not seem to be any
1804	   corresponding benefits to justify the difficulties that they would
1805	   create.

1807	   There would also be difficulties in otherwise reducing the set of
1808	   three acceptable normalization handling options, without reducing it
1809	   to a single option by imposing a specific normalization form.

1811	   o  Eliminating the possibility of a single possible normalization
1812	      form, would pose similar difficulties to imposing the other one,
1813	      even if representation-independent comparisons were also allowed.

1815	      In either case, a specific normalization form would be disfavored,
1816	      with no corresponding benefit.

1818	   o  Allowing only representation-independent lookups would not impose
1819	      difficulties for clients, but there are reasons to doubt it could
1820	      be universally implemented, since such name comparisons would have
1821	      to be done within the file system itself.

1823	      Such a change could only be made once file system support for
1824	      representation-independent file lookups would become commonly
1825	      available.  As long as the POSIX file naming model continues its
1826	      sway, that would be unlikely to happen.

1828	   One possible internationalization-related extension that the working
1829	   could adopt would be definition of an OPTIONAL per-fs attribute
1830	   defining the internationalization-related handling for that file
1831	   system.  That would allow clients to be aware of server choices in
1832	   this area and could be adopted without disrupting existing clients
1833	   and servers.

1835	16.  IANA Considerations

1837	   The current document does not require any actions by IANA.

1839	17.  Security Considerations

1841	   Unicode in the form of UTF-8 is generally is used for file component
1842	   names (i.e., both directory and file components).  However, other
1843	   character sets may also be allowed for these names.  For the owner
1844	   and owner_group attributes and other sorts strings whose form is
1845	   affected by standards outside NFSv4 (see Section 12.) are always
1846	   encoded as UTF-8.  String processing (e.g., Unicode normalization)
1847	   raises security concerns for string comparison.  See Sections 12 and
1848	   9 as well as the respective Sections 5.9 of RFC7530 [3] and RFC5661
1849	   [21] for further discussion.  See [23] for related identifier
1850	   comparison security considerations.  File component names are
1851	   identifiers with respect to the identifier comparison discussion in
1852	   [23] because they are used to identify the objects to which ACLs are
1853	   applied (See the respective Sections 6 of RFC7530 [3] and RFC5661
1854	   [21]).

1856	18.  References

1858	18.1.  Normative References

1860	   [1]        Bradner, S., "Key words for use in RFCs to Indicate
1861	              Requirement Levels", BCP 14, RFC 2119,
1862	              DOI 10.17487/RFC2119, March 1997,
1863	              <https://www.rfc-editor.org/info/rfc2119>.

1865	   [2]        Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
1866	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
1867	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

1869	   [3]        Haynes, T., Ed. and D. Noveck, Ed., "Network File System
1870	              (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
1871	              March 2015, <https://www.rfc-editor.org/info/rfc7530>.

1873	   [4]        Haynes, T., "Network File System (NFS) Version 4 Minor
1874	              Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862,
1875	              November 2016, <https://www.rfc-editor.org/info/rfc7862>.

1877	   [5]        Klensin, J., "Internationalized Domain Names for
1878	              Applications (IDNA): Definitions and Document Framework",
1879	              RFC 5890, DOI 10.17487/RFC5890, August 2010,
1880	              <https://www.rfc-editor.org/info/rfc5890>.

1882	   [6]        Costello, A., "Punycode: A Bootstring encoding of Unicode
1883	              for Internationalized Domain Names in Applications
1884	              (IDNA)", RFC 3492, DOI 10.17487/RFC3492, March 2003,
1885	              <https://www.rfc-editor.org/info/rfc3492>.

1887	   [7]        Yergeau, F., "UTF-8, a transformation format of ISO
1888	              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
1889	              2003, <https://www.rfc-editor.org/info/rfc3629>.

1891	   [8]        Noveck, D., "Rules for NFSv4 Extensions and Minor
1892	              Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017,
1893	              <https://www.rfc-editor.org/info/rfc8178>.

1895	   [9]        Noveck, D., Ed. and C. Lever, "Network File System (NFS)
1896	              Version 4 Minor Version 1 Protocol", RFC 8881,
1897	              DOI 10.17487/RFC8881, August 2020,
1898	              <https://www.rfc-editor.org/info/rfc8881>.

1900	   [10]       Cerf, V., "ASCII format for network interchange", STD 80,
1901	              RFC 20, October 1969,
1902	              <http://www.rfc-editor.org/info/rfc20>.

1904	   [11]       The Unicode Consortium, "The Unicode Standard, Version
1905	              7.0.0", (Mountain View, CA: The Unicode Consortium,
1906	              2014 ISBN 978-1-936213-09-2), June 2014,
1907	              <http://www.unicode.org/versions/Unicode7.0.0/>.

1909	   [12]       The Unicode Consortium, "The Unicode Standard, Version
1910	              13.0.0, Section 5.18 Case Mappings", (Mountain View, CA:
1911	              The Unicode Consortium, 2014 ISBN 978-1-936213-26-9),
1912	              March 2020,
1913	              <http://www.unicode.org/versions/Unicode13.0.0/
1914	              ch05.pdf#G21180>.

1916	   [13]       The Unicode Consortium, "CaseFolding-13.0.0.txt",
1917	              (Mountain View, CA: The Unicode Consortium, 2014 ISBN
1918	              978-1-936213-26-9), March 2020,
1919	              <https://www.unicode.org/Public/13.0.0/ucd/
1920	              CaseFolding.txt>.

1922	18.2.  Informative References

1924	   [14]       Nowicki, B., "NFS: Network File System Protocol
1925	              specification", RFC 1094, DOI 10.17487/RFC1094, March
1926	              1989, <https://www.rfc-editor.org/info/rfc1094>.

1928	   [15]       Callaghan, B., Pawlowski, B., and P. Staubach, "NFS
1929	              Version 3 Protocol Specification", RFC 1813,
1930	              DOI 10.17487/RFC1813, June 1995,
1931	              <https://www.rfc-editor.org/info/rfc1813>.

1933	   [16]       Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
1934	              Beame, C., Eisler, M., and D. Noveck, "NFS version 4
1935	              Protocol", RFC 3010, DOI 10.17487/RFC3010, December 2000,
1936	              <https://www.rfc-editor.org/info/rfc3010>.

1938	   [17]       Hoffman, P. and M. Blanchet, "Preparation of
1939	              Internationalized Strings ("stringprep")", RFC 3454,
1940	              DOI 10.17487/RFC3454, December 2002,
1941	              <https://www.rfc-editor.org/info/rfc3454>.

1943	   [18]       Faltstrom, P., Hoffman, P., and A. Costello,
1944	              "Internationalizing Domain Names in Applications (IDNA)",
1945	              RFC 3490, DOI 10.17487/RFC3490, March 2003,
1946	              <https://www.rfc-editor.org/info/rfc3490>.

1948	   [19]       Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1949	              Profile for Internationalized Domain Names (IDN)",
1950	              RFC 3491, DOI 10.17487/RFC3491, March 2003,
1951	              <https://www.rfc-editor.org/info/rfc3491>.

1953	   [20]       Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
1954	              Beame, C., Eisler, M., and D. Noveck, "Network File System
1955	              (NFS) version 4 Protocol", RFC 3530, DOI 10.17487/RFC3530,
1956	              April 2003, <https://www.rfc-editor.org/info/rfc3530>.

1958	   [21]       Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
1959	              "Network File System (NFS) Version 4 Minor Version 1
1960	              Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010,
1961	              <https://www.rfc-editor.org/info/rfc5661>.

1963	   [22]       Hoffman, P. and J. Klensin, "Terminology Used in
1964	              Internationalization in the IETF", BCP 166, RFC 6365,
1965	              DOI 10.17487/RFC6365, September 2011,
1966	              <https://www.rfc-editor.org/info/rfc6365>.

1968	   [23]       Thaler, D., Ed., "Issues in Identifier Comparison for
1969	              Security Purposes", RFC 6943, DOI 10.17487/RFC6943, May
1970	              2013, <https://www.rfc-editor.org/info/rfc6943>.

1972	   [24]       Beame, C., Thurlow, R., Callaghan, B., Robinson, D.,
1973	              Noveck, D., Eisler, M., and S. Shepler, "Network File
1974	              System (NFS) version 4 Protocol", draft-ietf-
1975	              nfsv4-rfc3010bis-05 (work in progress), November 2002.

1977	   [25]       Williams, N., "Internationalization Considerations for
1978	              Filesystems and Filesystem Protocols", draft-williams-
1979	              filesystem-18n-00 (work in progress), July 2020.

1981	Appendix A.  History

1983	   This section describes the history of internationalization within
1984	   NFSv4.  Despite the fact that NFSv4.0 and subsequent minor versions
1985	   have differed in many ways, the actual implementations of
1986	   internationalization have remained the same and internationalized
1987	   names have been handled without regard to the minor version being
1988	   used.  This is the reason the document is able to treat
1989	   internationalization for all NFSv4 minor versions together.

1991	   During the period from the publication of RFC3010 [16] until now, two
1992	   different perspectives with regard to internationalization have been
1993	   held and represented, to varying degrees, in specifications for NFSv4
1994	   minor versions.

1996	   o  The perspective held by NFSv4 implementers treated most aspects of
1997	      internationalization as basically outside the scope of what NFSv4
1998	      client and server implementers could deal with.  This was because
1999	      the POSIX interface treated file names as uninterpreted strings of
2000	      bytes, because the file systems used by NFSv4 servers treated file
2001	      names similarly, and because those file systems contained files
2002	      with internationalized names using a number of different encoding
2003	      methods, chosen by the users of the POSIX interface.  From this
2004	      perspective, wider support for internationalized names and general
2005	      use of universal encodings was a matter for users and applications
2006	      and not for protocol implementers or designers.

2008	   o  Within the IETF in general and in the IESG, there was a feeling
2009	      that new protocols, such as NFSv4, could not avoid dealing with
2010	      internationalization issues, making it difficult to treat these
2011	      matters, as the implementers' perspective would have it, as
2012	      essentially out of scope.

2014	   As specifications were developed, approved, and at times rewritten,
2015	   this fundamental difference of approach was never fully resolved,
2016	   although, with the publication of RFC7530 [3], a satisfactory modus
2017	   vivendi may have been arrived at.

2019	   Although many specifications were published dealing with NFSv4
2020	   internationalization, all minor versions used the same implementation
2021	   approach, even when the current specification for that minor version
2022	   specified an entirely different approach.  As a result, we need to
2023	   treat the history of NFSv4 internationalization below as an
2024	   integrated whole, rather than treating individual minor versions
2025	   separately.

2027	   o  The approach to internationalization specified in RFC3010 [16]
2028	      sidestepped the conflict of approaches cited above by discussing
2029	      the reasons that UTF-8 encoding was desirable while leaving file
2030	      names as uninterpreted strings of bytes.  The issue of string
2031	      normalization was avoided by saying "The NFS version 4 protocol
2032	      does not mandate the use of a particular normalization form at
2033	      this time."

2035	      Despite this approach's inconsistency with general IETF
2036	      expectations regarding internationalization, RFC3010 was published
2037	      as a Proposed Standard.  NFSv4.0 implementation related to
2038	      internationalization of file names followed the same paradigm used
2039	      by NFSv3, assuring interoperability with files created using that
2040	      protocol, as well as with those created using local means of file
2041	      creation.

2043	   o  When it became necessary, because of issues with byte-range
2044	      locking, to create an rfc3010bis, no change to the previously
2045	      approved approach seemed indicated and the drafts submitted up
2046	      until [24] closely followed RFC3010 as regards
2047	      internationalization.  The IESG then decided that a different
2048	      approach to internationalization was required, to be based on
2049	      stringprep [17] and rfc3010bis was accordingly revised, replacing
2050	      all of the Internationalization section, before being published as
2051	      RFC3530 [20].

2053	      These changes required the rejection of file names that were not
2054	      valid UTF-8, file names that included code points not, at the time
2055	      of publication, assigned a Unicode character (e.g. capital eszett)
2056	      or that were not allowed by stringprep (e.g.  Zero-width joiner
2057	      and non-joiner characters).  Because these restrictions would have
2058	      caused the set of valid file names to be different on NFS-mounted
2059	      and local file systems there was no chance of them ever being
2060	      implemented.

2062	      Because these specification changes were made without working
2063	      group involvement, most implementers were unaware of them while
2064	      those who were aware of the changes ignored them and continued to
2065	      develop implementations based on the internationalization approach
2066	      specified in RFC3010.

2068	   o  When NFsv4.1 was being developed, it seemed that no changes in
2069	      internationalization would be required.  Many people were unaware
2070	      of the stringprep-based requirements which made the NFSv4.0
2071	      internationalization specified in RFC3530 unimplementable.  As a
2072	      result, the internationalization specified in RFC5661 [21] was
2073	      based on that in RFC3530 [20], although the addition of the
2074	      attribute fs_charset_cap, discussed below, provided additional
2075	      flexibility.

2077	      The attribute fs_charset_cap, discussed below in Section 7
2078	      provides flags allowing the server to indicate that it accepts and
2079	      processes non-UTF-8 file names.  Rejecting them was a "MUST" in
2080	      RFC3530 and became a "SHOULD" in RFC5661, although there is no
2081	      evidence that any of these designations ever affected server
2082	      behavior.

2084	      As a result of this treatment of internationalization, even though
2085	      NFSv4.1 was a separate protocol and could have had a different
2086	      approach to internationalization, for a considerable time, the
2087	      internationalization specification for both protocols was based on
2088	      stringprep (in RFC3530 and RFC5661) while the actual
2089	      implementations of the two minor versions both followed the
2090	      approach specified in RFC3010, despite its obsoleted status.

2092	   o  When work started on rfc3530bis it was clear that issues related
2093	      to internationalization had to be addressed.  When the
2094	      implications of the stringprep references in RFC3530 were
2095	      discussed with implementers it became clear that mandating that
2096	      NFSv4.0 file names conform to stringprep was not appropriate.

2098	      While some working group members articulated the view that,
2099	      because of the need to maintain compatibility with the POSIX
2100	      interface and existing file systems, internationalization for
2101	      NFSv4 could not be successfully addressed by the IETF, the
2102	      rfc3530bis draft submitted to the IESG did not explicitly embrace
2103	      the implementers' perspective set forth above.

2105	      The draft submitted to the IESG and RFC7530 [3] as published
2106	      provided an explanation (see Section 5) as to why restrictions on
2107	      character encodings were not viable.  It allowed non-UTF-8
2108	      encodings to be used for internationalized file names while
2109	      defining UTF-8 as the preferred encoding and allowing servers to
2110	      reject non-UTF-8 string as invalid.  Other stringprep-based string
2111	      restrictions were eliminated.  With regard to normalization, it
2112	      continued to defer the matter, leaving open the possibility that
2113	      one might be chosen later.

2115	      This approach is compatible, in implementation terms, with that
2116	      specified in RFC3010 [16], allowing it to be used compatibly with
2117	      existing implementations for all existing minor versions.  This is
2118	      despite the fact that RFC5661 [21] specifies an entirely different
2119	      approach.

2121	      As a result of discussions leading up to the publishing of
2122	      RFC7530, it was discovered that some local file systems used with
2123	      NFSv4 were configured to be both normalization-aware and
2124	      normalization-preserving, mapping all canonically equivalent file
2125	      names to the same file while preserving the form actually used to
2126	      create the file, of whatever form, normalized or not.  This
2127	      behavior, which is legal according to RFC3010, which says little
2128	      about name mapping is probably illegal according to stringprep.
2129	      Nevertheless, it was expressly pointed out in RFC7530 as a valid
2130	      choice to deal with normalization issues, since it allows
2131	      normalization-aware processing without the difficulties that arise
2132	      in imposing a particular normalization form, as described in
2133	      Section 9.

2135	      In its discussion of internationalized domain names, RFC7530 [3]
2136	      adopted an approach compatible with IDNA2003, rather than
2137	      attempting to derive the specification from the behavior of
2138	      existing implementations.

2140	   o  When IDNA2003 was replaced by IDNA2008, the internationalization
2141	      specified by [3] was not changed.  Also, it appears unlikely that
2142	      implementations were changed to reflect that shift.

2144	   o  NFSv4.2 made no changes to internationalization.  As a result,
2145	      RFC7862 [4] which made no mention of internationalization,
2146	      implicitly aligned internationalization in NFSv4.2 with that in
2147	      NFSv4.1, as specified by RFC5661 [21].

2149	      As a result of this implicit alignment, there is no need for this
2150	      document to specifically address NFSv4.2 or be marked as updating
2151	      RFC7862.  It is sufficient that it updates RFC5661, which
2152	      specifies the internationalization for NFSv4.1, inherited by
2153	      NFSv4.2.

2155	   o  Later, as work on the predecessors of this document was underway,
2156	      [25] was submitted, making it necessary that some gaps the
2157	      discussion of internationalization in [3] be filled in.  These
2158	      gaps primarily concerned the need of NFSv4 clients to match the
2159	      handling of the corresponding server when using cached file name
2160	      data locally, or to avoid making invalid assumptions about that
2161	      handling, when information on the details of such handling was not
2162	      available.

2164	   The above history, can, for the purposes of the rest of this document
2165	   be summarized in the following statements:

2167	   o  The actual treatment of internationalization within NFSv4 has not
2168	      been affected by the particular minor version used, despite the
2169	      fact that the specifications for the minor versions have often
2170	      differed in their treatment of internationalization.

2172	   o  With regard to file names, implementations have followed the
2173	      internationalization approach specified in RFC3010, which is
2174	      compatible with the treatment in RFC7530.

2176	   o  With regard to internationalized domain names, RFC7530 [3]
2177	      specified an approach compatible with IDNA at the time of
2178	      publication.  However, no detailed analysis was done to determine
2179	      whether NFSv4 implementations actually followed that approach

2181	   o  Because [3] did not specifically address the special issues that
2182	      clients would face, relying on the assumption that each file is
2183	      accessible only by its name.  As this assumption is no longer true
2184	      when internationalized name handling is in effect, the appropriate
2185	      handling is discusssed below.  Section 11.2 explains the options
2186	      for handling in the case in which the client has very limited
2187	      information about the details about the server's
2188	      internationalization-related handling of file names while
2189	      Section 11.3 discusses how a client might use more complete
2190	      information provided by new attributes.

2192	   In order to deal with all NFSv4 minor versions, this document follows
2193	   the internationalization approach defined in RFC7530, with some
2194	   changes discussed in Section 4 and applies that approach to all NFSv4
2195	   minor versions.

2197	Appendix B.  Form-insensitive String Comparisons

2199	   This section deal with two varieties of form-insensitive string
2200	   comparison:

2202	   o  Providing a comparison function which is form-insensitive only.
2203	      For any string, whether normalized or not, this function will
2204	      determine it to be equivalent to all canonically equivalent
2205	      strings, including but not limited, to the normalized forms NFC
2206	      and NFD

2208	   o  Providing a comparison function which is both form-insensitive and
2209	      case-insensitive.  This function will determine strings that only
2210	      differ in case to be equal but will also be form-insensitive, as
2211	      described above.

2213	   The non-normative guidance provided in this Appendix is intended to
2214	   be helpful to two distinct implementation areas:

2216	   o  Implementation of server-side file systems intended to be accessed
2217	      using NFSv4 protocols.  While it is often the case that such
2218	      filesystems are developed by separate organizations from those
2219	      concerned with NFSv4 server development, the internationalization-
2220	      related requirements specified in this document must be adhered to
2221	      for successful inter-operation, making this implementation
2222	      guidance apropos despite any potential organizational barriers.

2224	   o  Implementation of NFSv4 clients that need to provide matching
2225	      internationalization-related handling for reason discussed in
2226	      Section 11.

2228	   There are three basic reasons that two strings being compared might
2229	   be canonically equivalent even though not identical.  For each such
2230	   reason, the implementation will be similar in the cases in which
2231	   form-insensitive comparison (only) is being done and in which the
2232	   comparison is both case-insensitive and form- insensitive.

2234	   o  Two strings may differ only because each has a different one of
2235	      two code points that are essentially the same.  Three code points
2236	      assigned to represent units, are essentially equivalent to the
2237	      character denoting those units.  For example, the OHM SIGN
2238	      (U+2126) is essentially identical to the GREEK CAPITAL LETTER
2239	      OMEGA (U+03A9) as MICRO SIGN (U+00B5) is to GREEK SMALL LETTER MU
2240	      (U+03BC) and ANGSTROM SIGN (U+212B) is to LATIN CAPITAL LETTER A
2241	      WITH RING ABOVE (U+00C5).

2243	      As discussed in items EX2 and EX3 in Section 10.2, it is possible
2244	      to adjust for this situation using tables designed to resolve
2245	      case-insensitive equivalence, essentially treating the unit
2246	      symbols as an additional case variant, essentially ignoring the
2247	      fact that the graphic representation is the same.  As a result,
2248	      those doing string comparisons that are both form-insensitive and
2249	      case-insensitive do not need to address this issue as part of
2250	      form-insensitivity, since it would be dealt with by existing case-
2251	      insensitive comparison logic.

2253	      Where there is no case-insensitive comparison logic, this function
2254	      needs to be performed using similar tables whose primary function
2255	      is to provide the decomposition of precomposed characters, as
2256	      described in Appendix B.2.

2258	   o  Two strings may differ in that one has the decomposed form
2259	      consisting of a base character and an associated combining
2260	      character while the other has a precomposed character equivalent.

2262	      Although, as discussed in items EX3 in Section 10.2, it is
2263	      possible to use tables designed to resolve case-insensitive
2264	      equivalence by providing as possible case-insensitively equivalent
2265	      string, multi-character string providing the decomposition of
2266	      precomposed characters, special logic to do so is only necessary
2267	      when the decomposition is not a canonical one, i.e. it is a
2268	      compatibility equivalence.

2270	      In general, the table used to do comparisons, whether case-
2271	      sensitive or not, need to provide information about the canonical
2272	      decomposition of precomposed characters.  See Appendix B.2 for
2273	      details.

2275	   o  Two strings may differ in that the strings consist of combining
2276	      characters that have the same effect differ as to the order in
2277	      which the characters appear.

2279	      There is no way this function could be performed within code
2280	      primarily devoted to case-insensitive equivalence.  However, this
2281	      function could be added to implementations, providing both sorts
2282	      of equivalence once it is determined that the base characters are
2283	      case-equivalent while there is a difference of combining
2284	      characters in to be resolved.  (See Appendix B.5 for a discussion
2285	      of how sets of combining characters can be compared).

2287	B.1.  Name Hashes

2289	   We discussed in Section 10.1 the construction of a case-insensitive
2290	   file name hash.  While such a hash could also be form-insensitive if
2291	   the hash contribution of every pre-composed character matched the
2292	   combined contribution of the characters that it decomposes into.

2294	   However, there is no obvious way that sort of hash could respect the
2295	   canonical equivalence of multiple combining characters modifying the
2296	   same base character, when those combining characters appear in
2297	   different orders.  Addressing that issue would require a
2298	   significantly different sort of hash, in which combining characters
2299	   are treated differently from others, so that the re-ordering of a
2300	   string of combining characters applying to the same base character
2301	   will not affect the hash.

2303	   In the hash discussed in Section 10.1, there is no guarantee that the
2304	   hash for multiple combining characters presented in different orders
2305	   will be the same.  This is because typically such hashes implement
2306	   some transformation on the existing hash, together with adding the
2307	   new character to the hash being accumulated.  Such methods of hash
2308	   construction will arrive at different values if the ordering of
2309	   combining characters changes.

2311	   In order to create a hash with the necessary characteristics, one can
2312	   construct a separate sub-hash for composite character, consisting of
2313	   one non-combining character (may be pre-composed) together with the
2314	   set (possibly null) of combining characters immediately following it.
2315	   Each such composed character, whether precomposed or not, will have
2316	   its own sub-hash, which will be the same regardless of the order of
2317	   the combining characters.

2319	   If the hash is to include case-insensitivity, special handling is
2320	   needed to deal with issues arising from the handling of COMBINING
2321	   GREEK YPOGEGRAMMENI (U+0345).  That combining character, as discussed
2322	   in item EX6 of Section 10.2 is uppercased to the non-combining
2323	   character GREEK CAPITAL LETTER IOTA (U+0399) which is in turn
2324	   lowercased to the non-combining character GREEK SMALL LETTER IOTA
2325	   (U+03B9).  As a result, when computing a case-insensitive hash, when
2326	   a base character is IOTA (of either case) and the previous base
2327	   character is ALPHA, ETA, or OMEGA (of the same case as the IOTA),
2328	   that IOTA is treated, for the purpose of defining the composite
2329	   characters for which to generate sub-hashes as if it were a combining
2330	   character.  As a result, in this case a string of containing two
2331	   composite characters will be treated as were a single composite
2332	   character since the iota will be treated as if it were a combining
2333	   character.  This string will have its own sub-hash, which will be the
2334	   same regardless of the order of combining characters.

2336	   The same outline will be followed for generating hashes which are to
2337	   be form-insensitive (only) and for those which are to be both form-
2338	   insensitive and case-insensitive.  The initial value, representing
2339	   the base character, will differ based on the type of hash.

2341	   o  In the case-sensitive case, the initial value of the sub-hash will
2342	      reflect the value of the base character with the only possible
2343	      need to map to a different value deriving from the existence of
2344	      OHM SIGN (U+2126), ANGSTROM SIGN (U+212B), and MICRO SIGN (U+00B5)
2345	      as characters distinct from the letters that represent these code
2346	      points.  This could be done with a mapping table but most
2347	      implementations would probably choose to implement special-purpose
2348	      code to do this.

2350	   o  In the case-insensitive case, the initial value of the sub-hash
2351	      will reflect the case-based equivalence class to which the
2352	      character (the lower-case equivalent is generally suitable).  In
2353	      this context a table-based mapping is required and this mapping
2354	      can shift OHM SIGN, ANGSTROM SIGN, and MICRO SIGN to the case-
2355	      based equivalence class for the corresponding character.

2357	   Regardless of the type of hash to be produced, values based on the
2358	   following combining characters need to reflected in the sub-hash.  In
2359	   order to make the sub-hash invariant to changes in the order of
2360	   combining characters, values based on the particular combining
2361	   character are combined with the hash being computed using a
2362	   commutative associative operation, such as addition.

2364	   To reduce false-positives it is desirable to make the hash relatively
2365	   wide (i.e. 32-64 bits) with the value based on base character in the
2366	   upper portion of the word with the values for the combining
2367	   characters appearing in a wide range of bit positions in the rest of
2368	   the word to limit the degree that multiple distinct sets of combining
2369	   characters have value that are the same.  Although the details will
2370	   be affected by processor cache structure and the distribution of
2371	   names processed, a table of values will be used but typical
2372	   implementations will be different in the two cases we are dealing as
2373	   described in Appendix B.2.

2375	   As each sub-hash is computed, it is combined into a name-wide hash.
2376	   There is no need for this computation to be order-independent and it
2377	   will probably include a circular shift of the hash computed so far to
2378	   be added to the contribution of the sub-hash for the new base or
2379	   composed character.

2381	   As described in Appendix B.3 the appropriate full name hash will have
2382	   the major role in excluding potential matches efficiently.  However,
2383	   in some small number of cases, there will be a hash match in which
2384	   the names to be compared are not equivalent, requiring more involved
2385	   processing.  It is assumed below that a given name will be searching
2386	   for potential cached matches within the directory so that for that
2387	   name, on will be able retain information used to construct the full
2388	   name hash (e.g. individual sub-hashes plus the bounds of each
2389	   composite character.  These will be compared against cached entries
2390	   where only the full (e.g. 64-bit) name hash and the name itself will
2391	   be available for comparison.

2393	B.2.  Character Tables

2395	   The per-character tables used in these algorithms have a number of
2396	   type of entries for different types of characters.  In some cases,
2397	   information for a given character type will be essentially the same
2398	   whether the comparison is to be form-insensitive or case-
2399	   insensitive.  In others, there will be differences.  Also, there may
2400	   be entry types that only exist for particular types of comparisons.
2401	   In any case, some bits within the table entry will be devoted to
2402	   representing the type of character and entry:

2404	   o  For combining characters, the entry will provide information about
2405	      the character's contribution to the composite character sub-hash
2406	      in which it appears.

2408	   o  For case-insensitive comparisons, there need to be special entries
2409	      for characters, which, while not themselves combining characters,
2410	      are the case-insensitive equivalents of combining characters.  An
2411	      example of this situation is provided in item EX6 within
2412	      Section 10.2

2414	   o  For pre-composed characters, the entry needs to provide the
2415	      initial hash value which is to be the basis for the sub-hash for
2416	      the name substring including contributions for the base character
2417	      together with contribution of included combining characters.  In
2418	      addition, such entries will provide, separately, information about
2419	      the character's canonical decomposition.

2421	   o  For case-insensitive comparisons, there needs to be, for base
2422	      characters, entries assigning each base character to the case-
2423	      based equivalence class to which it belongs, although such entries
2424	      can be avoided if the equivalence class matches the character
2425	      (usually caseless and lowercase characters.

2427	   o  Also, for case-insensitive comparisons, there will need to be
2428	      special entries for characters which multi-character string as
2429	      case-insensitive equivalent of the base character.  Examples of
2430	      this situation are provided in items EX4 and EX5 within
2431	      Section 10.2.  Such entries will need to have a hash-contribution
2432	      that reflects the hash that would be computed for the multi-
2433	      character string.

2435	   o  For form-insensitive comparisons, there will be special entries to
2436	      provide special handling for those cases in which there are two
2437	      canonically equivalent single characters.  Such entries do not
2438	      exist for case-insensitive comparison since this situation can be
2439	      handled by a non-standard use of case mapping for base characters
2440	      by placing these two characters in the same case-based equivalence

2442	   In the common case in which a two-stage mapping will be used, there
2443	   will be common groups of characters in which no table entry will be
2444	   required, allowing a default entry type to be used for some character
2445	   groups with entry contents easily calculable from the code point.

2447	   o  In the case form-insensitive comparison, this consists of all base
2448	      characters, with the hash contribution of the character derivable
2449	      by a pre-specified transformation of the code point value.

2451	   o  In the case case-insensitive comparison, this consists of all base
2452	      character which are either caseless or equivalence class is the
2453	      same as the code point, typically lowercase characters.  As in the
2454	      form-insensitive case, the hash contribution of the character is
2455	      derivable by a pre-specified transformation of the code point
2456	      value, which matches, in this case, the id assigned to the case-
2457	      based equivalence class.

2459	B.3.  Outline of comparison

2461	   We are assuming that comparisons will be based on the hash values
2462	   computed as described in Appendix B.1, whether the comparison is to
2463	   be form-insensitive or both case-insensitive and form-insensitive.

2465	   To facilitate this comparison, the name hash will be stored with the
2466	   names to be compared.  As a result, when there is a need to
2467	   investigate a new name and whether there are existing matches, it
2468	   will be possible to search for matches with existing names cached for
2469	   that directory, using a hash for the new name which is computed and
2470	   compared to all the existing names, with the result that the detailed
2471	   comparisons described in Appendices B.4 and B.5 have to be done
2472	   relatively rarely, since non-matching names together with matching
2473	   hashes are likely to be atypical.

2475	   Given the above, it is a reasonable assumption, which we will take
2476	   note of in the sections below, that for one of the names to be
2477	   compared, we will have access to data generated in the process of
2478	   computing the name hash while for the other names, such data would
2479	   have to be generated anew, when necessary.  When that data includes,
2480	   as we expect it will, the offset and length of the string regions
2481	   covered by each sub-hash, direct byte-by-byte comparisons between
2482	   corresponding regions of the two strings can exclude the possibility
2483	   of difference without invoking any detailed logic to deal with the
2484	   possibility of canonical equivalence or case-based equivalence in the
2485	   absence of identical name segment.

2487	   In the case in which the byte-by-byte comparisons fail, further
2488	   analysis is necessary:

2490	   o  First, the associated base characters are compared, as is
2491	      discussed in Appendix B.4.  When doing form-insensitive comparison
2492	      this is straightforward.  However, when case-insensitive
2493	      comparison is to be done, there is the possibility that the sub-
2494	      hash boundaries of the two comparands are different, requiring
2495	      that a common point in both comparands be found to resume
2496	      comparison after a successful match.  For either form of
2497	      comparison, if a mismatch is found at this point then the
2498	      comparison fails, while, if there is match, there must be a
2499	      comparison of any following combining characters, as described
2500	      below, before moving on to the region covered by the appropriate
2501	      sub-string covered by the appropriate next sub-hash for each
2502	      comparand.

2504	   o  If there is no mismatch as to the base characters, the set of
2505	      associated combining characters (might be null) must be compared,
2506	      as is discussed in Appendix B.5.  If a mismatch is found at this
2507	      point then the comparison fails.  This may be because the sets of
2508	      combining characters are different, because there are multiple
2509	      copies of the same combining character in one of the string, or
2510	      because the difference in combining character is not one that
2511	      maintains canonical equivalence (due to combining classes).

2513	   o  When both comparisons show a match, the comparison resumes at the
2514	      next substring, using a byte-by-byte comparison initially.  If the
2515	      comparison cannot be resumed because one of the strings is
2516	      exhausted, the comparison terminate, succeeding only if both
2517	      strings are exhausted while failing if only one of the strings is
2518	      exhausted.

2520	B.4.  Comparing Base Characters

2522	   In general, the task of comparing based characters is simple, using a
2523	   table lookup using the numeric value of the initial character in the
2524	   substring.  When doing form-insensitive comparison this is the base
2525	   character associated with the initial (possibly pre-composed)
2526	   character, while for case-insensitive comparison it is the case-based
2527	   equivalence class associated with that character.

2529	   When doing case-insensitive comparison, issues may arise that result
2530	   when there is a multi-character string that as the case- insensitive
2531	   equivalent of a single base character, as discussed in items EX4 and
2532	   EX5 within Section 10.2.  These are best dealt with using the
2533	   approach outlined in Section 10.1.  When it is noted that the current
2534	   base character (for either comparand) is a character whose associated
2535	   equivalence class contains one or more multi-character strings, then
2536	   these comparisons, normally requiring that each base character be
2537	   mapped to the same case-based equivalence class by modified to allow
2538	   equivalences allowed by these multi-character sequences.

2540	   In such cases, there may need to be comparisons involving the multi-
2541	   character string, in addition to the normal comparisons using the
2542	   base characters' equivalence class.  As an illustration, we will
2543	   consider possible comparison results that involve characters string
2544	   within the equivalence class mentioned in item EX4 within
2545	   Section 10.2

2547	   o  When the base character for both comparands are either LATIN SMALL
2548	      LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S (U+1E9E),
2549	      then a match is recognized.

2551	   o  When the base character for one comparand is either LATIN SMALL
2552	      LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S (U+1E9E),
2553	      while the other is not, each character in the that other comparand
2554	      is case-insensitively compared to the corresponding character of
2555	      the string "ss" with a match being signaled when all such
2556	      subsequent characters match, except for possibly being of a
2557	      different case.  Because that comparison will involve multiple
2558	      base characters, the overall comparison point for that comparand
2559	      will have to be adjusted to reflect character already processed as
2560	      part of the comparison.

2562	   o  When the base character for neither comparands is either LATIN
2563	      SMALL LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S
2564	      (U+1E9E), then matching proceeds normally.  As a result, the only
2565	      cases in which character strings within the equivalence class
2566	      being discussed will result is where both comparands have one of
2567	      the strings "ss", "sS", "Ss", or "SS" at the current comparison
2568	      point.

2570	B.5.  Comparing Combining Characters

2572	   In order to effect the necessary comparison, one needs to assemble,
2573	   for each comparand, the set of combining characters within the
2574	   current substring.  The means used might be different for different
2575	   comparands since there might be useful information retained from the
2576	   generation of the associated string hash for one of the comparands.
2577	   In any case, there are two potential sources for these characters:

2579	   o  Those deriving from the canonical decomposition of a pre-composed
2580	      character, treated as a null set of if the base character is not a
2581	      precomposed one.

2583	   o  Those combining characters that immediate following the base
2584	      character, which will be a null set if the immediately following
2585	      character is not a combining character.  Note that it is possible,
2586	      when doing case-insensitive comparison to treat certain character,
2587	      not normally combining characters, as if they are.  Such
2588	      situations can arise, when, as described in item EX6 within
2589	      Section 10.2, such non-combining character are the uppercase or
2590	      lowercase equivalents of combining characters.

2592	   Although, the two sets of character can be checked to see if they are
2593	   identical, this is a sufficient but not a necessary condition for
2594	   equivalence since some permutations of a set of combining characters
2595	   are considered canonically equivalent.  To summarize the appropriate
2596	   equivalence rules:

2598	   o  Combining characters of different combining classes may be freely
2599	      reordered.

2601	   o  If combining characters of the same combining class are reordered,
2602	      then result is not canonically equivalent

2604	   The rules above do not directly apply to the case, discussed above,
2605	   in which some non-combining characters are the case-based equivalents
2606	   of combining characters such as COMBINING GREEK YPOGEGRAMMENI
2607	   (U+0345).  Nevertheless, because of this equivalence, those
2608	   implementing case-insensitive comparisons do have to deal with this
2609	   potential equivalence when considering whether two strings containing
2610	   combining characters or their case-based equivalents match.  As a
2611	   result when comparing strings of combining characters, we need to
2612	   implement the following modified rules.

2614	   o  When one comparand has a true combining character and the other
2615	      comparand has an identical one, they may differ in location as
2616	      long as there is no permutation of combining characters of the
2617	      same combining class.

2619	   o  When one comparand has a true combining character and the other
2620	      has a case-insensitive equivalent which is not a combining
2621	      character, that character must appear last in its string while the
2622	      combining may character appear in its string in any position
2623	      except the last.  In this case, there are no restrictions based on
2624	      combining classes.

2626	   o  When both comparands contain a non-combining character case-
2627	      insensitively equivalent to a combining character, these character
2628	      must appear last in their respective strings.

2630	   Although it is possible to divide combining characters based on their
2631	   combining classes, sort each of the list and compare, that approach
2632	   will not be discussed here.  Even though the use of sorts might allow
2633	   use of an overall N log N algorithm, the number of combining
2634	   characters is likely to be too low for this to be a practical
2635	   benefit.  Instead, we present below an order N-squared algorithm
2636	   based on searches.

2638	   In this algorithm, one string, chosen arbitrarily id designated the
2639	   "source string" and successive character from it, are searched for in
2640	   the other, designated the "target string".  Associated with the
2641	   target string is a mask to allow characters search for a found to be
2642	   marked so that they will not be found a second time.  In the
2643	   treatment below, when a character is "searched for" only characters
2644	   not yet in the mask are examined and the character sought has its
2645	   associated mask bit set when it is found.

2647	   Each character in the source string is processed in turn with the
2648	   actual processing depending on particular character being processed,
2649	   with the following three possibilities to be dealt with.

2651	   1.  For the typical case (i.e. a combining character with no case-
2652	       insensitive equivalents), the character is searched for in the
2653	       target string with the compare failing if it is not found.

2655	       If it is found, then the region of the target string between the
2656	       point corresponding to the current position in the source string
2657	       and the character found is examined to check for characters of
2658	       the same combining class.  If any are found, the overall
2659	       comparison fails.

2661	   2.  For the case of a combining character with a case- insensitive
2662	       equivalents, the character is searched for as described in the
2663	       first paragraph of item 1.  However, the compare does not fail if
2664	       it is not found.  Instead, a case-insensitive equivalent
2665	       character is searched for at the final position of the string and
2666	       the compare fails if that is not found.

2668	   3.  For the case of a non-combining character that has a combining
2669	       character as a case-insensitive equivalents, the overall
2670	       comparison fail if the character is not in the final position
2671	       within the source string or has already been successfully
2672	       searched for.  Otherwise, the corresponding combining character
2673	       is searched for in the target as described in in the first
2674	       paragraph of item 1.  The overall compare fails if it is not
2675	       found.

2677	   Once all characters in the source string has been processed, the mask
2678	   associated is examined to see if there are combining character that
2679	   were not found in the matching process described above.  Normally, if
2680	   there are such characters, the overall comparison fails.  However, if
2681	   the last character of the target was not matched and if it is a non-
2682	   combining character that is case-insensitively equivalent to a
2683	   combining character, then comparison succeeds and the remaining
2684	   character needs to be matched with the next substring in the source.

2686	Acknowledgements

2688	   This document is based, in large part, on Section 12 of [3] and all
2689	   the people who contributed to that work, have helped make this
2690	   document possible, including David Black, Peter Staubach, Nico
2691	   Williams, Mike Eisler, Trond Myklebust, James Lentini, Mike Kupfer
2692	   and Peter Saint-Andre.

2694	   The author wishes to thank Tom Haynes for his timely suggestion to
2695	   pursue the task of dealing with internationalization on an NFSv4-wide
2696	   basis.

2698	   The author wishes to thank Nico WIlliams for his insights regarding
2699	   the need for clients implementing file access protocols to be aware
2700	   of the details of the server's internationalization-related name
2701	   processing, particularly when case-insensitive file systems are being
2702	   accessed.

2704	Author's Address

2706	   David Noveck
2707	   NetApp
2708	   1601 Trapelo Road
2709	   Waltham, MA  02451
2710	   United States of America

2712	   Phone: +1 781 572 8038
2713	   Email: davenoveck@gmail.com