idnits 2.17.1 draft-ietf-nfsv4-internationalization-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (September 26, 2021) is 943 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. '11' -- Possible downref: Non-RFC (?) normative reference: ref. '12' -- Possible downref: Non-RFC (?) normative reference: ref. '13' -- Obsolete informational reference (is this intentional?): RFC 3010 (ref. '16') (Obsoleted by RFC 3530) -- Obsolete informational reference (is this intentional?): RFC 3454 (ref. '17') (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3490 (ref. '18') (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3491 (ref. '19') (Obsoleted by RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3530 (ref. '20') (Obsoleted by RFC 7530) -- Obsolete informational reference (is this intentional?): RFC 5661 (ref. '21') (Obsoleted by RFC 8881) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NFSv4 D. Noveck 3 Internet-Draft NetApp 4 Updates: 8881, 7530 (if approved) September 26, 2021 5 Intended status: Standards Track 6 Expires: March 30, 2022 8 Internationalization for the NFSv4 Protocols 9 draft-ietf-nfsv4-internationalization-01 11 Abstract 13 This document describes the handling of internationalization for all 14 NFSv4 protocols, including NFSv4.0, NFSv4.1, NFSv4.2 and extensions 15 thereof, and future minor versions. 17 It updates RFC7530 and RFC8881. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on March 30, 2022. 36 Copyright Notice 38 Copyright (c) 2021 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4 55 2.1. Requirements Language Definition . . . . . . . . . . . . 4 56 2.2. Requirements Language Derivation . . . . . . . . . . . . 4 57 3. Internationalization and Minor Versioning . . . . . . . . . . 6 58 4. Changes Relative to RFC7530 . . . . . . . . . . . . . . . . . 7 59 5. Limitations on Internationalization-Related Processing in the 60 NFSv4 Context . . . . . . . . . . . . . . . . . . . . . . . . 7 61 6. Summary of Server Behavior Types . . . . . . . . . . . . . . 8 62 7. The Attribute Fs_charset_cap . . . . . . . . . . . . . . . . 9 63 7.1. The Attribute Fs_charset_cap in Published NFSv4.1 64 Specifications . . . . . . . . . . . . . . . . . . . . . 10 65 7.2. The Attribute Fs_charset_cap in Future NFSv4.1 66 Specifications . . . . . . . . . . . . . . . . . . . . . 12 67 8. String Encoding . . . . . . . . . . . . . . . . . . . . . . . 14 68 9. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 15 69 10. Case-Insensitive Processing of File Names . . . . . . . . . . 15 70 10.1. Implementing Case-Insensitive Comparison of File Names . 19 71 10.2. Important Examples of Case-insensitive Handling of File 72 Names . . . . . . . . . . . . . . . . . . . . . . . . . 21 73 11. Internationalization-related Processing of File Names by 74 Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 75 11.1. Server Restrictions to Deal with Lack of Client 76 Knowledge . . . . . . . . . . . . . . . . . . . . . . . 25 77 11.2. Client Processing of File Names for Current NFSv4 78 Protocols . . . . . . . . . . . . . . . . . . . . . . . 26 79 11.3. Client Processing of File Names for Future NFSv4 80 Protocols . . . . . . . . . . . . . . . . . . . . . . . 30 81 12. String Types with Processing Defined by Other Internet Areas 31 82 12.1. Effect of IDNA Changes . . . . . . . . . . . . . . . . . 33 83 12.2. Potential Compatibility Issues Related to IDNA Changes . 34 84 13. Errors Related to UTF-8 . . . . . . . . . . . . . . . . . . . 36 85 14. Servers That Accept File Component Names That Are Not Valid 86 UTF-8 Strings . . . . . . . . . . . . . . . . . . . . . . . . 37 87 15. Future Minor Versions and Extensions . . . . . . . . . . . . 38 88 16. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 89 17. Security Considerations . . . . . . . . . . . . . . . . . . . 39 90 18. References . . . . . . . . . . . . . . . . . . . . . . . . . 40 91 18.1. Normative References . . . . . . . . . . . . . . . . . . 40 92 18.2. Informative References . . . . . . . . . . . . . . . . . 41 93 Appendix A. History . . . . . . . . . . . . . . . . . . . . . . 42 94 Appendix B. Form-insensitive String Comparisons . . . . . . . . 47 95 B.1. Name Hashes . . . . . . . . . . . . . . . . . . . . . . . 49 96 B.2. Character Tables . . . . . . . . . . . . . . . . . . . . 51 97 B.3. Outline of comparison . . . . . . . . . . . . . . . . . . 52 98 B.4. Comparing Base Characters . . . . . . . . . . . . . . . . 53 99 B.5. Comparing Combining Characters . . . . . . . . . . . . . 54 100 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 57 101 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 57 103 1. Introduction 105 Internationalization is a complex topic with its own set of 106 terminology (see [22]). The topic is made more complex for the NFSv4 107 protocols by the tangled history described in Appendix A. In large 108 part, this document is based on the actual behavior of NFSv4 client 109 and server implementations (for all existing minor versions) and is 110 intended to serve as a basis for further implementations to be 111 developed that can interact with existing implementations as well as 112 those to be developed in the future. 114 Note that the behaviors on which this document are based are each 115 demonstrated by a combination of an NFSv4 server implementation 116 proper and a server-side physical file system. It is common for 117 servers and physical file systems to be configurable as to the 118 behavior shown. In the discussion below, each configuration that 119 shows different behavior is considered separately. 121 As a consequence of this choice, normative terms defined in RFC2119 122 [1] are often derived from implementation behavior, rather than the 123 other way around, as is more commonly the case. The specifics are 124 discussed in Section 2. 126 With regard to the question of interoperability with existing 127 specifications for NFSv4 minor versions, different minor versions 128 pose different issues. 130 o With regard to NFSv4.0 as defined in RFC7530 [3], no significant 131 interoperability issues are expected to arise because the 132 internationalization in that specification, which is the basis for 133 this one, was also based on the behavior of existing 134 implementations. Although, in a formal sense, the treatment of 135 internationalization here supersedes that in RFC7530 [3], the 136 treatments are intended to be essentially the same, in order to 137 eliminate interoperability issues. 139 Because of a change in the handling of Internationalized domain 140 names, there are some differences from the handling in RFC7530 141 [3], as discussed in Appendix A. For a discussion of those 142 differences and potential compatibility issues, see Sections 12.1 143 and 12.2. 145 o With regard to NFSv4.1 as defined by RFC881 [9], the situation is 146 quite different. The approach to internationalization specified 147 in that document, based in large part on that in RFC3530 was never 148 implemented, and implementers were either unaware of the 149 troublesome implications of that approach or chose to ignore the 150 existing specification as essentially unimplementable. An 151 internationalization approach compatible with that specified in 152 RFC7530 [3] tended to be followed, despite the fact that, in other 153 respects, NFSv4.1 was considered to be a separate protocol. 155 If there were NFSv4 servers who obeyed the internationalization 156 dictates within RFC5661 [21], or clients that expected servers to 157 do so, they would fail to interoperate with typical clients and 158 servers when dealing with non-UTF8 file names, which are quite 159 common. As no such implementations have come to our attention, it 160 has to be assumed that they do not exist and interoperability with 161 existing implementations as described here is an appropriate basis 162 for this document. 164 2. Requirements Language 166 2.1. Requirements Language Definition 168 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 169 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 170 document are to be interpreted as BCP 14 [1] [2] when, and only when, 171 they appear in all capitals, as shown here. 173 2.2. Requirements Language Derivation 175 Although the key words "MUST", "SHOULD", and "MAY" retain their 176 normal meanings, as described above, we need to explain how the 177 statements involving these terms were arrived at: 179 o In the case of statements within Sections 12 and 15, these derive 180 from the requirements of other internet specifications. 182 o In the case of statements within Sections 7, 10, and 11 derive 183 from the author's view of the appropriate normative language to 184 use and will, when this document is advanced, represent the 185 working group's consensus on those same matters. 187 o However, in other cases, i.e. those in sections deriving from 188 RFC7530 [3] (i.e. Sections 5, 6, 8, 9, 13, 14, 16, 17) this 189 specification's descriptions were derived from existing 190 implementation patterns. Although this pattern is atypical, it is 191 needed to provide a description that satisfies the goal of RFC2119 192 [1], providing a normative description to enable future 193 implementations to be compatible with existing ones. This 194 requires that we explain later in this section how the normative 195 terms used derive from the behavior of existing implementations, 196 in those situations in which existing implementation behavior 197 patterns can be determined. 199 Note that in introductory and explanatory sections of this document 200 (i.e. Sections 1 through 4 these terms do not appear except to 201 explain how they are used in this document. Also, they do not appear 202 in Appendix B which provides non-normative implementation guidance. 204 With regard to the parts of this document deriving from RFC7530, we 205 explain below how the normative terms used derive from the behavior 206 of existing implementations, in those situations in which existing 207 implementation behavior patterns can be determined. 209 o Behavior implemented by all existing clients or servers is 210 described using "MUST", since new implementations need to follow 211 existing ones to be assured of interoperability. While it is 212 possible that different behavior might be workable, we have found 213 no case where this seems reasonable. 215 The converse holds for "MUST NOT": if a type of behavior poses 216 interoperability problems, it MUST NOT be implemented by any 217 existing clients or servers. 219 o Behavior implemented by most existing clients or servers, where 220 that behavior is more desirable than any alternative, is described 221 using "SHOULD", since new implementations need to follow that 222 existing practice unless there are strong reasons to do otherwise. 224 The converse holds for "SHOULD NOT". 226 o Behavior implemented by some, but not all, existing clients or 227 servers is described using "MAY", indicating that new 228 implementations have a choice as to whether they will behave in 229 that way. Thus, new implementations will have the same 230 flexibility that existing ones do. 232 o Behavior implemented by all existing clients or servers, so far as 233 is known -- but where there remains some uncertainty as to details 234 -- is described using "should". Such cases primarily concern 235 details of error returns. New implementations should follow 236 existing practice even though such situations generally do not 237 affect interoperability. 239 There are also cases in which certain server behaviors, while not 240 known to exist, cannot be reliably determined not to exist. In part, 241 this is a consequence of the long period of time that has elapsed 242 since the publication of the defining specifications, resulting in a 243 situation in which those involved in the implementation work may no 244 longer be involved in or be aware of working group activities. 246 In the case of possible server behavior that is neither known to 247 exist nor known not to exist, we use "SHOULD NOT" and "MUST NOT" as 248 follows, and similarly for "SHOULD" and "MUST". 250 o In some cases, the potential behavior is not known to exist but is 251 of such a nature that, if it were in fact implemented, 252 interoperability difficulties would be expected and reported, 253 giving us cause to conclude that the potential behavior is not 254 implemented. For such behavior, we use "MUST NOT". Similarly, we 255 use "MUST" to apply to the contrary behavior. 257 o In other cases, potential behavior is not known to exist but the 258 behavior, while undesirable, is not of such a nature that we are 259 able to draw any conclusions about its potential existence. In 260 such cases, we use "SHOULD NOT". Similarly, we use "SHOULD" to 261 apply to the contrary behavior. 263 In the case of a "MAY", "SHOULD", or "SHOULD NOT" that applies to 264 servers, clients need to be aware that there are servers that may or 265 may not take the specified action, and they need to be prepared for 266 either eventuality. 268 3. Internationalization and Minor Versioning 270 Despite the fact that NFSv4.0 and subsequent minor versions have 271 differed in many ways, the actual implementations of 272 internationalization have remained the same and internationalized 273 names have been handled without regard to the minor version being 274 used. Minor version specification documents contained different 275 treatments of internationalization as described in Appendix A but of 276 those only the implementation-based approach used by RFC7530 [3], 277 resulted in a workable description while a number of attempts to 278 specify an approach that implementors were to follow were all 279 ignored. 281 It is expected that any future minor versions will follow a similar 282 approach, even though there is nothing to prevent a future minor 283 version from adopting a different approach as long as the rules 284 within [8]) are adhered to. In any such case, the new minor version 285 would have to be marked as updating or obsoleting this document. 286 Issues relating to potential extensions within the framework 287 specified in this document are dealt with in Section 15. 289 4. Changes Relative to RFC7530 291 This document follows the internationalization approach defined in 292 RFC7530, with a number of significant necessary changes. 294 o The handling of internationalization specified in [3] is applied 295 to all NFSv4 minor versions. No compatibility issues are expected 296 to arise because all existing implementations follow the same 297 approach to internationalization despite the large difference 298 between [3] and what was specified in [21]. Issues relating to 299 potential future minor versions and protocol extensions are 300 addressed in Section 15. 302 o Some changes motivated by the shift from IDNA2003 to IDNA2008 have 303 been made. The intention is to maintain compatibility with all 304 existing NFSv4 minor versions. Potential compatibility issues 305 with regard to the IDNA shift are discussed in Section 12.2. 307 o There is more detailed discussion of case-insensitive handling of 308 file names, with particular attention to the complexities that can 309 arise when multiple language conventions in these matters need to 310 be accommodated. The discussion in Section 10 applies to both 311 client or server, although issues relating to the client's 312 knowledge are dealt with in Section 11. 314 o There is additional material, dealing with the implications of 315 server-side internationalization-related file name processing for 316 clients that cache the results of READDIR's. This includes a 317 discussion of options to deal with the current lack of detailed 318 information about the server (in Section 11.2), and options for 319 handling when more detailed information is available (in 320 Section 11.3)." 322 5. Limitations on Internationalization-Related Processing in the NFSv4 323 Context 325 There are a number of noteworthy circumstances that limit the degree 326 to which internationalization-related encoding and normalization- 327 related restrictions can be made universal with regard to NFSv4 328 clients and servers: 330 o The NFSv4 client is part of an extensive set of client-side 331 software components whose design and internal interfaces are not 332 within the IETF's purview, limiting the degree to which a 333 particular character encoding might be made standard. 335 o Server-side handling of file component names is typically 336 implemented within a server-side physical file system, whose 337 handling of character encoding and normalization is not 338 specifiable by the IETF. 340 o Typical implementation patterns in UNIX systems result in the 341 NFSv4 client having no knowledge of the character encoding being 342 used, which might even vary between processes on the same client 343 system. 345 o Users may need access to files stored previously with non-UTF-8 346 encodings, or with UTF-8 encodings that are not in accord with any 347 particular normalization form. 349 6. Summary of Server Behavior Types 351 Servers MAY reject component name strings that are not valid UTF-8. 352 This leads to a number of types of valid server behavior, as outlined 353 below. When these are combined with the valid normalization-related 354 behaviors as described in Section 8, this leads to the combined 355 behaviors outlined below. 357 o Servers that limit file component names within a given file system 358 to UTF-8 strings exist with normalization-related handling as 359 described in Section 8. These are best described as behaving as 360 "UTF-8-only servers". 362 o Servers that do not limit file component names on particular file 363 systems to UTF-8 strings are very common and are necessary to deal 364 with clients/applications not oriented to the use of UTF-8. Such 365 servers ignore normalization-related issues, and there is no way 366 for them to implement either normalization or representation- 367 independent lookups. These are best described as behaving as 368 "UTF-8-unaware servers" for such file systems, since they treat 369 file component names as uninterpreted strings of bytes and have no 370 knowledge of the characters represented. See Section 13 for 371 details. 373 o It is possible for a server to allow component names that are not 374 valid UTF-8, while still being aware of the structure of UTF-8 375 strings. Such servers could, in theory, implement either 376 normalization or representation-independent lookups but apply 377 those techniques only to valid UTF-8 strings. Such servers are 378 not common, but it is possible to configure at least one known 379 server to have this behavior. This behavior SHOULD NOT be used 380 due to the possibility that a file name using one encoding may, by 381 coincidence, have the appearance of a UTF-8 file name; the results 382 of UTF-8 normalization or representation-independent lookups are 383 unlikely to be correct in all cases, when considered from the 384 viewpoint of the other encoding. Such difficulties can be 385 compounded when case-insensitive name handling is in effect. 387 7. The Attribute Fs_charset_cap 389 This attribute, nominally "RECOMMENDED", appears to have been added 390 to NFSv4.1 to allow servers, while staying within the constraints of 391 the stringprep-based specification of internationalization, to allow 392 uses of UTF-8-unaware naming by clients. As a result, those NFSv4 393 servers implementing internationalization as NFSv3 had done, could be 394 considered spec-compliant, as long as a later "SHOULD" was ignored. 395 However, because use of UTF-8 was tied to existing stringprep 396 restrictions, implementations of internationalization, that were 397 aware of Unicode canonical equivalence issues were not provided for. 398 Although this attribute may have been implemented despite the 399 problems noted in Section 7.1, the overall scheme was never 400 implemented and NFSv4.1 implementations dealt with 401 internationalization as NFSv4.0 implementations had. 403 It is generally accepted that attributes designated "RECOMMENDED" are 404 essentially OPTIONAL with the client having the responsibility to 405 deal with server non-support of them. While RFC7530 has gone so far 406 as to explicitly exclude this use from the general statement that 407 these terms are to be used as defined by RFC2119, no NFSv4.1 408 specification has done so, at least through RFC8881 [9]. In this 409 particular case, there are a number of circumstances that makes this 410 OPTIONAL status noteworthy: 412 o The statement "It is expected that servers will support all 413 attributes they comfortably can and only fail to support 414 attributes that are difficult to support in their operating 415 environments", appearing in Section 5.2 of [9] is troublesome 416 since it is hard to understand how a server could find this read- 417 only attribute "difficult to support" regardless of the operating 418 environment 420 o This was added in minor version one which added a number of 421 REQUIRED operations and could well have added a REQUIRED 422 attribute. 424 o The fact that the client is to be prepared for non-support of the 425 attribute would require specification of a default value, yet none 426 is provided. 428 The attribute contains two flag bits. As discussed below, in 429 Section 7.1, it is hard two see why two bits are required while the 430 implications of this issue for future NFSv4.1 specifications will be 431 discussed in Section 7.2 433 7.1. The Attribute Fs_charset_cap in Published NFSv4.1 Specifications 435 We reproduce Section 14.4 of [9] below, with comments interspersed 436 trying to make sense of what is there, in order to arrive at an 437 appropriate replacement, to be presented in Section 7.2. In that 438 connection, we need to understand better a few issues: 440 o The use of two bits while one is clearly adequate, given the 441 subject matter actually mentioned. 443 o The mention of possible "capabilities" which could not possibly be 444 realized. 446 o The use of the RFC2119 keyword "SHOULD" in contexts in which this 447 term is clearly inappropriate. 449 Issues related to the confusion caused by mention of "UTF-8 450 characters" and the lack of mention of Unicode will be addressed in 451 the revision in Section 7.2 but will not be further discussed here. 453 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 454 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 456 typedef uint32_t fs_charset_cap4; 458 While it is made clear that two separate bits are to be provided, 459 their names seem to indicate that they should be complements of one 460 another. As a way of understanding why two bits were specified, it 461 is helpful to consider a possible boolean attribute as a potential 462 replacement. That attribute would clearly govern whether names that 463 do not conform to the rules of UTF-8 are to be rejected, which was a 464 "MUST" in RFC3530 [20]. Although conveying this information is 465 clearly part of the motivation, stating so clearly might have been 466 judged by the authors as unnecessarily provocative, given the role of 467 IESG in arriving at the internationalization approach specified in 468 RFC3530. 470 Because some operating environments and file systems do not 471 enforce character set encodings, 473 It is clear that the ability of operating environments to enforce use 474 of UTF-8 encoding is not an issue, since RFC3530 made this the 475 responsibility of the server implementation. That mandate was never 476 followed because implementers chose not to follow it, and not because 477 they were unable to do so. The apparently confused statement above 478 is best understood if one notes that its essential job is to state 479 that the "MUST" in RFC3530 referred to above is not reasonable. 481 However, the authors might well have felt unable to say so clearly, 482 in light of the potential IESG reaction. 484 NFSv4.1 supports the fs_charset_cap attribute (Section 5.8.2.11) 485 that indicates to the client a file system's UTF-8 capabilities. 487 The problem with the mention of (plural) capabilities is that the 488 only capability mentioned which servers could implement is to accept 489 strings which are not valid UTF-8. There are other potential 490 capabilities having to do with the implementation of canonical 491 equivalence, but since they were not mentioned, they will not be 492 discussed further here. 494 The attribute is an integer containing a pair of flags. The first 495 flag is FSCHARSET_CAP4_CONTAINS_NON_UTF8, which, if set to one, 496 tells the client that the file system contains non-UTF-8 497 characters, 499 As stated, this would mean that a server would have to keep track of 500 a count of non-UTF-8-encoded names within the file system and change 501 the attribute value as that count varied between zero and non-zero. 502 Since it is most unlikely that any server would keep track of that or 503 that any client would find it useful, we will assume that the 504 capability to store such names is what is most likely intended. 506 and the server will not convert non-UTF characters to UTF-8 if the 507 client reads a symbolic link or directory, 509 There is no way for the server to convert non-UTF names to UTF-8 or 510 anything else, since it has no knowledge of the name encoding to 511 begin with. The alternative to treating names as UTF-8-encoded 512 Unicode strings is to treat them as POSIX does, as uninterpreted 513 strings of bytes. That makes it impossible to interpret strings that 514 do not follow the rules of UTF-8 at all, making it impossible to 515 convert the string to UTF-8. 517 neither will operations with component names or pathnames in the 518 arguments convert the strings to UTF-8. 520 As stated above, there is no way a server could ever do that. 522 The second flag is FSCHARSET_CAP4_ALLOWS_ONLY_UTF8, which, if set 523 to one, indicates that the server will accept (and generate) only 524 UTF-8 characters on the file system. 526 That is clear and so it poses no problem for a revised treatment, 527 unlike the other flag. 529 If FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to one, 530 FSCHARSET_CAP4_CONTAINS_NON_UTF8 MUST be set to zero. 532 There is no problem with this statement. However, it does, by 533 implication, raise the issue of what values of 534 FSCHARSET_CAP4_CONTAINS_NON_UTF8 may be set in the case in which 535 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set to zero. 537 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 SHOULD always be set to one. 539 According to RFC2119 [1], "SHOULD" means that "there may exist valid 540 reasons in particular circumstances to ignore a particular item, but 541 the full implications must be understood and carefully weighing a 542 different course". In this context, it is unclear what these "full 543 implications" might be given the introduction above. The clause, 544 "because some operating e environments and file systems do not 545 enforce character set encodings", gives one no basis for treating 546 this as other than an unproblematic behavior variant, calling into 547 question the use of "SHOULD". 549 Also, the statement in RFC2119 that these terms (i.e. those like 550 "SHOULD") "only be used where it is actually required for 551 interoperation or to limit behavior which has the potential for 552 causing harm" 554 o The whole purpose of this feature is to enable interoperation and 555 there is no basis for the implication that one particular flag 556 value is superior to another in allowing interoperation. 558 o There is no basis for assuming that accepting file names that are 559 not UTF-8-encoded Unicode has any potential for causing harm. 561 Despite the statement in RFC2119, that "they [i.e. terms such as 562 'SHOULD'] must not be used to impose a particular method on 563 implementors", it is hard to avoid the conclusion that this is in 564 fact the motivation for the "SHOULD", although the authors might not 565 have had any such intention but felt that the IESG might well have 566 such an intention. 568 7.2. The Attribute Fs_charset_cap in Future NFSv4.1 Specifications 570 We provide a revised version of Section 14.4 of [9] below, taking 571 into account the issues noted in Section 7.1. Given there was a 572 working group consensus to adopt the confusing language discussed 573 there, we must now adopt, by consensus, a clearer replacement that 574 reflects the working group's intentions. Given the passage of time 575 and the changed context, it might not be possible to determine those 576 intentions. In any case, we will have to be aware of how this 577 attribute was implemented and used, particularly with regard to the 578 first flag, whose meaning remains obscure. 580 The following treatment is proposed as a basis for discussion, with 581 the understanding that it would need to be changed, if it could raise 582 interoperability issues. 584 const FSCHARSET_CAP4_CONTAINS_NON_UTF8 = 0x1; 585 const FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 = 0x2; 587 typedef uint32_t fs_charset_cap4; 589 This attribute provides a simple way of determining whether a 590 particular file system behaves as a UTF-8-only server and rejects 591 file names which are not valid UTF-8 strings. When this attribute 592 is supported and the value returned has the 593 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 flag set, the error NFS4ERR_INVAL 594 MUST be returned if any file name argument contains a string which 595 is not a valid UTF-8 string. 597 When this attribute is supported and the value returned has the 598 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 flag clear, the error 599 NFS4ERR_INVAL will not be returned based on adherence to the rules 600 of UTF-8. While such file systems are generally UTF-8-unaware, 601 this cannot be assumed, since server are allowed (in some 602 circumstances; it is a "SHOULD NOT") to accept non-UTF-8 names 603 while being aware of the structure of UTF-8-conforming names, for 604 the purposes of determining canonical equivalence, for example. 605 See Section 6. 607 With regard to the flag FSCHARSET_CAP4_CONTAINS_NON_UTF8, it has 608 proved impossible to determine, from existing treatments of this 609 attribute, any value that might be helpful here. As a result, we 610 are forced to assume that this flag is always a complement of 611 FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 and that any result in which it is 612 not is to be ignored, with the appropriate handling being the same 613 as would apply if the attribute were not supported. 615 When this attribute is not supported, the client can perform a 616 LOOKUP using a name not conforming to the rules of UTF-8 and use 617 the error returned to determine whether non-UTF-8 names are 618 accepted. 620 8. String Encoding 622 Strings that potentially contain characters outside the ASCII range 623 [10] are generally represented in NFSv4 using the UTF-8 encoding [7] 624 of Unicode [11]. See [7] for precise encoding and decoding rules. 626 Some details of the protocol treatment depend on the type of string: 628 o For strings that are component names, the preferred encoding for 629 any non-ASCII characters is the UTF-8 representation of Unicode. 631 In many cases, clients have no knowledge of the encoding being 632 used, with the encoding done at the user level under the control 633 of a per-process locale specification. As a result, it may be 634 impossible for the NFSv4 client to enforce the use of UTF-8. The 635 use of non-UTF-8 encodings can be problematic, since it may 636 interfere with access to files stored using other forms of name 637 encoding. Also, normalization-related processing (see Section 9) 638 of a string not encoded in UTF-8 could result in inappropriate 639 name modification or aliasing. In cases in which one has a non- 640 UTF-8 encoded name that accidentally conforms to UTF-8 rules, 641 substitution of canonically equivalent strings can change the non- 642 UTF-8 encoded name drastically. 644 For similar reasons, where non-UTF-8 encoded names are accepted, 645 case-related mappings cannot be relied upon. For this reason, the 646 attribute case_insensitive MUST NOT be returned as TRUE for file 647 systems which accept non-UTF-8 encoded file names. 649 The kinds of modification and aliasing mentioned here can lead to 650 both false negatives and false positives, depending on the strings 651 in question, which can result in security issues such as elevation 652 of privilege and denial of service (see [23] for further 653 discussion). 655 o For strings based on domain names, non-ASCII characters MUST be 656 represented using the UTF-8 encoding of Unicode, and additional 657 string format restrictions may apply. See Section 12 for details. 659 o The contents of symbolic links (of type linktext4 in the XDR) MUST 660 be treated as opaque data by NFSv4 servers. Although UTF-8 661 encoding is often used, it need not be. In this respect, the 662 contents of symbolic links are like the contents of regular files 663 in that their encoding is not within the scope of this 664 specification. 666 o For other sorts of strings, any non-ASCII characters SHOULD be 667 represented using the UTF-8 encoding of Unicode. 669 9. Normalization 671 The client and server operating environments can potentially differ 672 in their policies and operational methods with respect to character 673 normalization (see [11] for a discussion of normalization forms). 674 This difference may also exist between applications on the same 675 client. This adds to the difficulty of providing a single 676 normalization policy for the protocol that allows for maximal 677 interoperability. This issue is similar to the issues of character 678 case where the server may or may not support case-insensitive file 679 name matching and may or may not preserve the character case when 680 storing file names. The protocol does not mandate a particular 681 behavior but allows for a range of useful behaviors. 683 The NFSv4 protocol does not mandate the use of a particular 684 normalization form. A subsequent minor version of the NFSv4 protocol 685 might specify a particular normalization form, although there would 686 be difficulties in doing so (see Section 15 for details). In any 687 case, the server and client can expect that they might receive 688 unnormalized characters within protocol requests and responses. If 689 the operating environment requires normalization, then the 690 implementation will need to normalize the various UTF-8 encoded 691 strings within the protocol before presenting the information to an 692 application (at the client) or local file system (at the server). 694 Server implementations MAY normalize file names to conform to a 695 particular normalization form before using the resulting string when 696 looking up or creating a file. Servers MAY also perform 697 normalization-insensitive string comparisons without modifying the 698 names to match a particular normalization form. Except in cases in 699 which component names are excluded from normalization-related 700 handling because they are not valid UTF-8 strings, a server MUST make 701 the same choice (as to whether to normalize or not, the target form 702 of normalization, and whether to do normalization-insensitive string 703 comparisons) in the same way for all accesses to a particular file 704 system. Servers SHOULD NOT reject a file name because it does not 705 conform to a particular normalization form, as this would deny access 706 to clients that use a different normalization form or clients acting 707 on behalf of application that use a different normalization form. 709 10. Case-Insensitive Processing of File Names 711 When the server is to process file names in a case-insensitive way in 712 a given file system, it may choose to do so in a number of ways. 714 o It can force all characters which have multiple forms to a common 715 case, whether uppercase of lowercase. Although this may cause the 716 file name shown in the directory to be different from that 717 specified when the file is created, these two names will be judged 718 as equivalent when a case-insensitive comparison is used. Such 719 file systems are case-insensitive but not case-preserving. 721 o It can preserve all names, presented as valid and not subject to 722 case-based modification, while treating two names that are 723 equivalent when a case-insensitive comparison is used as referring 724 to the same file. Such file systems are both case-insensitive and 725 case-preserving. 727 When a server implements case-insensitive file name handling, it is 728 necessary that clients do so as well. For example, if a client 729 possessing the cached contents of a directory, notes that the file 730 "a" does not exist, it cannot immediately act on that presumed non- 731 existence, without checking for the potential existence of "A" as 732 well. As a result, clients need to be able to provide case- 733 insensitive name comparisons, irrespective of whether the server 734 handling is case-preserving or not. 736 Because case-insensitive name comparisons are not always as 737 straightforward as the above example suggests, the client, if it is 738 to emulate the server's name handling, would need information about 739 how certain cases are to be dealt with. In cases in which that 740 information is unavailable, the client needs to avoid making 741 assumptions about the server's handling, since it will be unaware of 742 the Unicode version implemented by the server, or many of the details 743 of specific issues that might need to be addressed differently by 744 different server file systems in implementing case-insensitive name 745 handling. 747 Many of the problematic issues with regard to the case-insensitive 748 handling of names are discussed in Section 5.18 of the Unicode 749 Standard [12] which deals with case mapping. While we need to 750 address all of these issues as well, our approach will not be exactly 751 the same. 753 o Since the client will be doing case-insensitive comparisons, 754 issues that apply only to uppercasing or lowercasing do not have 755 the same significance. 757 o Many clients will have to operate correctly even in the absence of 758 detailed information about the specifics of server case-mapping or 759 the version of Unicode implemented by the server. 761 o Clients will have to accommodate server behaviors not anticipated 762 by the Unicode Specification since it might be that neither the 763 server nor the client would have any relevant locale knowledge 764 when file names are processed. 766 Another source of information about case-folding, and indirectly 767 about case-insensitive comparisons, is the case-folding text file 768 which is part of the Unicode Standard [13]. This file contains, for 769 each Unicode character that can be uppercased or lowercased, a single 770 character, or, in some cases a string of characters of the other 771 case. For characters in capital case, the lowercase counterpart is 772 given. Each of the mappings is characterized as of one of four 773 types: 775 o Common case folding, denoted by a status field of "C". These are 776 used for mapping where a single character can be mapped to a 777 single character of another case. These are always valid with one 778 potential exception being the mappings of LATIN CAPITAL LETTER I 779 to LATIN SMALL LETTER I and vice versa, which might be superseded 780 by the T-type mappings of associated with some Turkic languages. 782 o Full case folding, denoted by a status field of "F". These are 783 used for mappings in which single character is mapped to a multi- 784 character string of a different case. 786 o Special case folding, denoted by a status field of "S". These 787 provide additional single-character-to-single-character which 788 might be used when there is also an F-type mapping of the same 789 character. In the case of case folding, this is an alternative to 790 the corresponding F-type, although, for the purposes of case- 791 insensitive string comparison, it is possible for both to be in 792 considered valid at the same time 794 o Special case foldings for Turkic languages, denoted by a status 795 field of "T". These consist of the invertible case mappings 796 between LATIN SMALL LETTER I (U+0069) and LATIN CAPITAL LETTER I 797 WITH DOT ABOVE (U+0130) and between LATIN CAPITAL LETTER I 798 (U+0049) and LATIN SMALL LETTER DOTLESS I (U+0131). The 799 relationship between these mappings and the C-type mappings for 800 LETTER I is discussed below in item EX8. 802 While the case mapping section does discuss case-insensitive string 803 comparisons, and describes a procedure for constructing equivalence 804 classes of Unicode characters, the description does not deal clearly 805 with the effect of F-type mappings. There are a number of problems 806 with dealing with F-type mappings for case folding and basing case- 807 insensitive string comparisons on those mappings, particularly in 808 situations, such as file systems, in which extensive processing of 809 strings is unlikely to be possible. 811 o Mappings from single characters to multi-character strings, are, 812 for case-folding purposes, not invertible. However, case- 813 insensitive name comparison, by its nature, requires invertible 814 mappings, in which a multi-character string is mapped to a single 815 character of a different case which not compatible with any 816 existing simple case-mapping models. 818 o Scanning of names for multi-character sequences might well be too 819 complicated, especially since such sequences might overlap in 820 complicated ways. 822 o Case foldings which map single characters to multi-character 823 sequences (see item EX4 below for an important example), would 824 give rise, because of the invertibility of case mappings when used 825 to determine case-insensitive string equivalence for very large 826 sets of strings. For example, a string of eight copies of the 827 letter S would give rise to an set of 256 equivalent strings plus 828 over two thousand others when the German SHARP S characters 829 discussed in item EX4 are included. 831 Despite these potential difficulties, case mappings involving multi- 832 character sequences can be reversed when used as a basis for case- 833 insensitive string comparisons and incorporated into a set of 834 equivalence classes on name strings. 836 o Case-insensitive servers MAY do either case-mapping to a chosen 837 case or case-insensitive string comparisons when providing a case- 838 preserving implementation. In either case, it MAY include F-type 839 mappings, which map a single character to a multi-character 840 string. However, only the case in which it is doing case- 841 insensitive string comparison will it use the inverse of F-type 842 mappings, in which a multi-character string is mapped to a single 843 character of a different case 845 In these cases, the server can choose to use either a C-type 846 mapping or an F-type mapping, or both, when both exist. Similarly 847 the server may choose to implement the C-type mappings of LATIN 848 CAPITAL LETTER I to LATIN SMALL LETTER I and vice versa, the 849 corresponding T-type mappings or both, although using only the 850 second of these is NOT ALLOWED, unless there is a means of 851 informing the client that it has been chosen. 853 o The client, when informed of the details of the client's handling 854 of case, has the ability to efficiently implement an appropriate 855 case-insensitive name comparison compatible with that of the 856 server. This includes the ability to handle mappings between 857 single characters and multi-character strings. 859 o Implementation of case-insensitive name comparisons will typically 860 require a case-insensitive name hash. 862 10.1. Implementing Case-Insensitive Comparison of File Names 864 Implementing case-insensitive string comparisons based on equivalence 865 classes including multi-character strings can be performed as 866 described below. This algorithm requires that if there is more than 867 one multi-character string within a given equivalence class, they 868 must all be equivalent, with any equivalences derivable from case- 869 insensitive string equivalence using single-character equivalence 870 classes. 872 Although other sources are possible (see items EX2 and EX3 in 873 Section 10.2), multi-character sequences often appear in case- 874 insensitive equivalence classes as the result of the canonical 875 decomposition of one or more precomposed characters as elements of a 876 case-insensitive equivalence class. 878 While the algorithm described in this section can deal with certain 879 case-based equivalences deriving from canonical decomposition, it is 880 not capable of providing general handling of the combination of 881 canonical equivalence and case-based equivalence. While this can be 882 addressed by normalizing strings before doing case-insensitive 883 comparison, it is more efficient to do a general form-insensitive and 884 case-insensitive string comparison in a single step as described in 885 Appendix B 887 The following tables would be used by the comparison algorithm 888 presented below. 890 o For each possible character value, the associated equivalence 891 class for case-insensitive comparison will be identified 893 o For each such equivalence class, the hash value contribution will 894 be provided. In the case of equivalence class that do not include 895 multi-character including equivalence classes that only include a 896 single member, this will be the hash value contribution of one 897 particular variant (usually lower case) of the character 899 o In the case of equivalence classes that do include multi-character 900 strings, the hash value contribution needs to equivalent to the 901 combined contribution of each character within the multi-character 902 string. In addition, for each such equivalence class, the length 903 of the multicharacter string will be provided together with a 904 pointer to an array describing the multi-character string, most 905 probably presenting each character as an equivalence class id. 907 Case-insensitive comparison proceeds as follows: 909 o Implementation of case-insensitive name comparisons will typically 910 require a case-insensitive name hash using the tables described 911 above. If such a hash vale is kept or all cached names 912 comparisons of hashes can be used instead of the detailed 913 comparison set forth below. Using such hash comparisons, a large 914 set of potentially equivalent names can be excluded based on the 915 occurrence of hash mismatches, since case-equivalent names would 916 have the same hash value. value. 918 o For names with matching hash values, a detailed case-insensitive 919 comparison will be necessary. This can proceed character-by- 920 character or byte-by-byte. However, in the byte-by-byte case, 921 processing in the event of a mismatch must start at the start of 922 the current character, rather than the byte at which the 923 difference was detected. 925 o In cases in which there is a mismatch, the associated equivalence 926 classes will be compared. When these are identical, indicating 927 the case equivalence of the two characters, the comparison of the 928 two strings continues at the next character of each string. 930 o When the two equivalence classes are not identical, further 931 comparisons to determine if a single character within one string 932 matches (except for case) a multi-character string within the 933 other. For each of two equivalence classes being compared that 934 include a multi-character string, the check below must be made to 935 determine whether the multi-character string at the corresponding 936 position of the other string being compared, is within the current 937 equivalence class. If neither of the two equivalence classes 938 include multi-character strings, the comparison terminates with a 939 mismatch indication. 941 o For each equivalence class that does include a multi-character 942 string (there might be one or two), a scan needs to be made to see 943 of the characters at the current position if the other string 944 matches (except for case) the multi-character string which is 945 included in the current equivalence class. If this check 946 succeeds, for either equivalence class, the comparison of the two 947 strings continues at the next character of each string. In the 948 event of failure, the same sort of comparison is done using the 949 other current equivalence class, if it include multi-character 950 strings. Once this check fails for all equivalence classes that 951 include multi-character strings, the comparison terminates with a 952 mismatch indication. 954 10.2. Important Examples of Case-insensitive Handling of File Names 956 In this section, we discuss many of the interesting and/or 957 troublesome issues that the need for case-insensitive handling gives 958 rise to in fully internationalized environment. Many of these are 959 also discussed in [12]. However, our treatment of these issues, 960 while not inconsistent with that in [12], differs significantly for a 961 number of reasons: 963 o Our primary focus is on case-insensitive string comparison rather 964 than with case mapping per se. While such comparison is natural 965 for the client and allowed for servers, its greater flexibility 966 makes it important to understand its capabilities in dealing with 967 potentially troublesome issues in providing case-insensitive file 968 name handling. 970 o Because a case mapping model forces the specification of a single 971 case mapping result when there are multiple potentially valid 972 results, there are inevitably cases in which the result chosen is 973 inappropriate for some users. These are cases in which F-type and 974 S-type mappings are present and in which C-type and T-type 975 mappings conflict. Normally, an appropriate choice is selected by 976 use of the locale, but in a filesystem environment, valid locale 977 information might not be present. As a result, case-insensitive 978 string comparison, which does not force such case mapping choices, 979 will be more desirable. 981 The examples below present common situations that go beyond the 982 simple invertible case mappings of Latin characters and the 983 straightforward adaptation of that model to Greek and Cyrillic. In 984 EX4 and EX5 we have case-based equivalence classes including multi- 985 character strings not derived from canonical equivalences while for 986 EX7 and EX8 all multi-character strings are derived from canonical 987 equivalences. In addition, EX1, EX2, EX3 and EX6 discuss other 988 situations in which an equivalence class has more than two elements. 990 EX1: Certain digraph characters such LATIN SMALL LETTER DZ (U+01F3) 991 have additional case variants to consider such as the titlecase 992 character LATIN CAPTAL LETTER D WITH SMALL LETTER Z (U+01F2) in 993 addition to the uppercase LATIN CAPITAL LETTER DZ (U+01F1). 994 While the titlecased variant would not appear in names in case- 995 insensitive non-case-preserving file systems, case-insensitive 996 string comparison has no problem in treating these three 997 characters as within the same equivalence class. 999 This equivalence class can be derived from only C-type 1000 mappings. The possibility of mapping these characters to two- 1001 character sequences they represent is not a troublesome issue 1002 since that would be derived from a compatibility equivalence, 1003 rather than a canonical equivalence, and there is no F-type 1004 mapping making it an option. 1006 EX2: To deal with the case of the OHM SIGN (U+2126) which is 1007 essentially identical to the GREEK CAPITAL LETTER OMEGA 1008 (U+03A9), one can construct an equivalence class consisting of 1009 OHM SIGN (U+2126), GREEK CAPITAL LETTER OMEGA (U+03A9), and 1010 GREEK SMALL LETTER OMEGA (U+03C9). 1012 This equivalence class can be derived only from C-type 1013 mappings. Both OHM SIGN (U+2126), and GREEK CAPITAL LETTER 1014 OMEGA (U+03A9) lowercase to GREEK LETTER OMEGA (U+03C9), while 1015 that character only uppercases to GREEK CAPITAL LETTER OMEGA 1016 (U+03A9). 1018 EX3: To deal with the case of the ANGSTROM SIGN (U+212B) which is 1019 essentially identical to LATIN CAPITAL LETTER A WITH RING ABOVE 1020 (U+00C5), one can construct an equivalence class consisting of 1021 ANGSTROM SIGN (U+212B), LATIN CAPITAL LETTER A WITH RING ABOVE 1022 (U+00C5), LATIN SMALL LETTER A WITH RING ABOVE (U+00E5), 1023 together with the two-character sequences involving LATIN 1024 CAPITAL LETTER A (U+0041) or LATIN SMALL LETTER A (U+0061) 1025 followed by COMBINING RING ABOVE (U+030A). 1027 This equivalence class can be derived from C-type mappings 1028 together with the ability to map characters to canonically 1029 equivalent strings. Both ANGSTROM SIGN (U+212B), and LATIN 1030 CAPITAL LETTER A WITH RING ABOVE (U+00C5) lowercase to LATIN 1031 SMALL LETTER A WITH RING ABOVE (U+00E5), while that character 1032 only uppercases to CAPITAL LETTER A WITH RING ABOVE (U+00C5). 1034 EX4: In some cases, case mapping of a single character will result 1035 in a multi-character string. For example, the German character 1036 LATIN SMALL LETTER SHARP S (U+00DF) would be uppercased to 1037 "SS", i.e. two copies of LATIN CAPITAL LETTER S (U+0053). On 1038 the other hand, in some situations, it would be uppercased to 1039 the character LATIN CAPITAL LETTER SHARP S (U+1E9E), using an 1040 S-type mapping. referred to as an instance of "Tailored 1041 Casing". Unfortunately, in the context of a file system, there 1042 is unlikely to be available information that provides guidance 1043 about which of these case mappings should be chosen. However, 1044 the use of case-insensitive mappings with larger equivalence 1045 classes often provides handling that is acceptable to a wider 1046 variety of users. In this case, German-speakers get the 1047 mapping they expect while those unfamiliar with these 1048 characters only see them when they access a file whose name 1049 contains them. 1051 It appears that if the construction of case-based equivalence 1052 classes were generalized to include multi-character sequences, 1053 then all of LATIN SMALL LETTER SHARP S (U+00DF), LATIN CAPITAL 1054 LETTER SHARP S (U+1E9E), "ss", "sS", "Ss", and "SS" would 1055 belong to the same equivalence class and could be handled by 1056 the general algorithm described in Section 10.1, as well by 1057 code specifically written to deal with this particular issue. 1059 EX5: Other ligatures, such as LATIN SMALL LIGATURE FFL (U+FB04), 1060 could be handled similarly by this algorithm, if there were 1061 felt a need to do so. However, because the decomposition of 1062 this character into the string consisting of the three letters 1063 LATIN SMALL LETTER F (U+0066), LATIN SMALL LETTER F (U+0066), 1064 LATIN SMALL LETTER L (U+006C), is a compatibility equivalence, 1065 and the F-type mapping of this ligature to the three 1066 constituent is to be treated as optional, implementations can 1067 choose either to treat this character as having no uppercase 1068 equivalent or treat it as part of larger equivalence class 1069 including "ffl", "ffL", "fFl", etc.). 1071 EX6: The character COMBINING GREEK YPOGEGRAMMENI (U+0345), also 1072 known as "iota-subscript" requires special handling when 1073 uppercasing and lowercasing. While the description of the 1074 appropriate handling for this character, in the case mapping 1075 section, is focused on multi- character sequences representing 1076 diphthongs, case-insensitive comparisons can be performed 1077 without consideration of multi-character sequences. This can 1078 be done by assigning COMBINING GREEK YPOGEGRAMMENI (U+0345), 1079 GREEK SMALL LETTER IOTA (U+03B9), and GREEK CAPITAL LETTER IOTA 1080 (U+0399) to the same equivalence class, even though the first 1081 of these is a combining character and the others are not. 1083 EX7: In some cases context-dependent case mapping is required. For 1084 example, GREEK CAPITAL LETTER SIGMA (U+03A3) lowercases to 1085 GREEK SMALL LETTER SIGMA (U+03C3) if it is followed by another 1086 letter and to GREEK SMALL LETTER FINAL SIGMA (U+03C2) if it is 1087 not. 1089 Despite this, case-insensitive comparisons can be implemented, 1090 by considering all of these characters as part of the same 1091 equivalence class, without any context-dependence, and this 1092 equivalence class can be derived using only C-type mappings. 1094 EX8: In most languages written using Latin characters, the uppercase 1095 and lowercase varieties of the letter "I" differ in that only 1096 the lowercase character. In a number of Turkic languages, 1097 there are two distinct characters derived from "I" which differ 1098 only with regard to the presence or absence of a dot so that 1099 there are both capital and small i's with each having dotted 1100 and dotless variants. Within such languages, the dotted and 1101 dotless I's represent different vowel sounds and are treated as 1102 separate characters with respect to case mapping. The 1103 uppercase of LATIN SMALL LETTER I (U+0069) is LATIN CAPITAL 1104 LETTER I WITH DOT ABOVE (U+0130), rather than LATIN CAPITAL 1105 LETTER I (U+0049). Similarly the lowercase of LATIN CAPITAL 1106 LETTER I (U+0049) is LATIN SMALL LETTER DOTLESS I (U+0131) 1107 rather than LATIN SMALL LETTER I (U+0069). 1109 When doing case mapping, the server must choose to uppercase 1110 LATIN SMALL LETTER I (U+0069) to either LATIN CAPITAL LETTER I 1111 (U+0049), based on a C-type mapping to LATIN CAPITAL LETTER I 1112 WITH DOT ABOVE (U+0130), based on a T-type mapping. The former 1113 is acceptable to most people but confusing to speakers of the 1114 Turkic languages in question since the case mapping changes the 1115 character to represent a different vowel sound. On the other 1116 hand, the latter mapping seemingly inexplicably results in a 1117 character many users have never seen before. Normally such 1118 choices are dealt with based on a locale but, in a file system 1119 environment, no locale information may be available. 1121 In the context of case-insensitive string comparison, it is 1122 possible to create a larger equivalence class, including all of 1123 the letters LATIN SMALL LETTER I (U+0069), LATIN CAPITAL LETTER 1124 I (U+0049), LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130), 1125 LATIN SMALL LETTER DOTLESS I (U+0131) together with the two- 1126 character string consisting of LATIN CAPITAL LETTER I (U+0049) 1127 followed by COMBINING DOT ABOVE (U+0307). 1129 11. Internationalization-related Processing of File Names by Clients 1131 Given the way that internationalization is addressed within the NFSv4 1132 protocols, clients, and applications accessing NFS files can 1133 generally remain unaware of the specific type of 1134 internationalization-related processing implemented by the server. 1135 For example, although a server MAY store all file names according to 1136 the rules appropriate to a particular normalization form, it MUST NOT 1137 reject names solely because they are not encoded using this 1138 normalization form, allowing the clients and applications to avoid 1139 knowledge of normalization choices. 1141 However, as has been pointed out in [25], there are situations in 1142 which clients implementing local optimizations use the saved contents 1143 of directories fetched from the server, making it necessary that the 1144 client's and the server's handling of internationalization-related 1145 name mapping issues be in concord. There are two basic ways this 1146 issue can be addressed: 1148 o Where the protocol has not defined a means whereby the client can 1149 obtain information about the details of internationalized name 1150 handling implemented within the server, the client can avoid 1151 conflict with the server by limiting its use of local 1152 optimizations. While positive name caching can be used without 1153 adverse effects, negative name caching has to limited to avoid 1154 situations in which a given name is not present but an equivalent 1155 one may exist, as far as the server is concerned. This situation, 1156 which applies to all current NFSv4 protocols is discussed in 1157 Section 11.2. 1159 o The client can be provided complete information about the server's 1160 internationalization-related name handling (typically implemented 1161 within the server-based file system. This situation, which could 1162 be implemented in later NFSv4 minor versions, or in an extension 1163 to an existing extensible minor version is discussed in 1164 Section 11.3. 1166 o Note that when case-insensitive handling of file names is 1167 implemented by a server-side filesystem, further complications can 1168 arise. For the most part, these are addressed in Sections 11.2 1169 and 11.3 by treating the particulars of case-handling as a another 1170 element of the name handling implemented by the server. However, 1171 some of the specific complexities are addressed separately in 1172 Section 10. 1174 11.1. Server Restrictions to Deal with Lack of Client Knowledge 1176 There are a number of restrictions, not previously specified in 1177 RFC7530 [3], on server implementation of internationalized file name 1178 handling. These restrictions apply to both case-sensitive and case- 1179 insensitive file systems and are designed to limit the options that 1180 servers have in choosing server-side internationalized file name 1181 handling so as to enable the clients to either duplicate that 1182 handling or limit it to avoid relying on cases in which the proper 1183 handling cannot be determined or duplicated by the client. 1185 o The canonical equivalence relation implemented by the server, for 1186 each internationalization-aware filesystem MUST match that defined 1187 by some particular UNICODE version equal to or later than version 1188 4.0. 1190 o The case-equivalence relationship implemented by the server, for 1191 each case-insensitive filesystem MUST include all C-type case 1192 mappings included by the particular UNICODE version whose 1193 canonical equivalence relation is implemented by the server, with 1194 the possible exception of those conflicting with T-type case 1195 mappings. by some particular Unicode version equal to or later 1196 than version 4.0. 1198 o In cases in which the server provides no way of determining the 1199 details of the case-equivalence relationship implemented by the 1200 server for a particular file system, that mapping must include all 1201 C-type case mappings included by the particular UNICODE version 1202 whose canonical equivalence relation is implemented by the server, 1203 i.e. it MUST map between LATIN SMALL LETTER I (U+0069)and LATIN 1204 CAPITAL LETTER I (U+0049). 1206 11.2. Client Processing of File Names for Current NFSv4 Protocols 1208 The existing minor versions, NFSv4.0 [3], NFSv4.1 [21], and NFSv4.2 1209 [4], have very limited facilities allowing a client to get 1210 information about the server's internationalization-related file name 1211 handling. Because these protocols were all defined when it was 1212 assumed that the server's internationalized file name handling could 1213 be specified in great detail, there was no provision for attributes 1214 defining the server's choices. As a result, the information 1215 available to the client is quite limited: 1217 o The client can determine that the server is not performing 1218 internationalized file name processing. It can do this by looking 1219 up a file name using a string which is not valid UTF-8, concluding 1220 that if the LOOKUP is not rejected on that basis, then the file 1221 system is not internationalization-aware, allowing the client to 1222 ignore the potential difficulties which server-based 1223 internationalized file name processing might give rise to. 1225 o The client can use the optional per-fs attributes case_insensitive 1226 and case_preserving to how the server deals with character case 1227 for particular file system. When one of these attributes is not 1228 supported by a particular file system, the client treats the 1229 attribute as if it were false. 1231 When a file system is internationalization-unaware, the client can 1232 use both positive and negative name caching, without any issues 1233 arising from the potential for conflict between distinct file names 1234 that would be considered equivalent by the server. In other cases, 1235 the handling is more restricted in the use of negative name caching. 1236 The issue with regard to case-sensitive and case-insensitive file 1237 systems are discussed separately below. In each case, the client has 1238 a range of choices trading off forgone optimization opportunities 1239 against the difficulty of implementation while avoiding negative 1240 consequences arising from the fact that certain details of the 1241 server's name handling are not known to it. 1243 In the case of case-sensitive file systems, the uncertainty to be 1244 dealt with concerns the version of Unicode implemented by the server, 1245 given that different versions may have different canonical 1246 equivalence relationships. However, whether the server implements a 1247 particular normalization form or implements form-insensitive file 1248 name matching has no effect on client behavior. In light of the 1249 uncertainty created by the lack of knowledge of the precise Unicode 1250 version used by the server to implement its canonical equivalence 1251 relation, the follow possibilities, arranged in order of increasing 1252 value (and difficulty of implementation) should be considered. 1254 A1: The client can simply decline to implement optimizations based 1255 on negative name caching on internationalization-aware file 1256 systems. 1258 While this might have a negative effect on performance, it might 1259 be the best option for clients not heavily used to access 1260 internationalization-aware filesystems, or where, due to a lack 1261 of directory delegation support, the client has no assurance 1262 that will be notified of the invalidation of a previous 1263 assumption that a particular file does not exist. 1265 A2: Relatively simple name filtering can exclude the names for which 1266 negative name caching might cause difficulties. For example, 1267 the client could scan file names for characters whose presence 1268 might pose difficulties and allow negative name caching only for 1269 strings known not to contain such characters. Because the 1270 Unicode version used by the server file system is not known, 1271 this treatment would be limited to string only containing 1272 characters defined in the earliest version of Unicode which 1273 could be supported, that is, Unicode 4.0. 1275 One simple way for a client to provide such filtering would be 1276 to establish an upper limit (e.g. U+00ff) and disallow negative 1277 name caching for strings containing characters above that value 1278 or characters below that value that might cause there to be 1279 canonically equivalent strings on the server. A simple mask 1280 could be used to allow each character to be examined allowing 1281 composed and combining characters to be identified together with 1282 code points unassigned in Unicode 4.0. 1284 This approach would allow negative name caching to be disallowed 1285 for strings containing those characters while allowing it for 1286 other strings that do not. A larger limit (and a corresponding 1287 mask) would make sense for clients used to access many file 1288 names containing characters from non-Latin alphabets. 1290 A3: A client might implement its own internationalized file name 1291 handling paralleling that of the server. Because the Unicode 1292 version used by the server filesystem is unknown, strings for 1293 which it is possible that the canonically equivalent string 1294 might be different depending on the version of Unicode 1295 implemented by the server will have to be identified and 1296 excluded from using negative name caching. This would require 1297 that strings containing code points unassigned in Unicode 1298 version 4.0, and those denoting combining characters that could 1299 be parts of precomposed character added to later versions of 1300 Unicode be excluded from negative name caching. The necessary 1301 filtering could apply to all potential code points although 1302 clients might choose to simplify implementation by excluding 1303 strings containing code points beyond a certain point, e.g. 1304 (U+0FFFF). 1306 When a client implements internationalized name handling, it 1307 needs to be able to detect when the apparent absence of a file 1308 within a directory is contradicted by the occurrence of a file 1309 with a distinct, but canonically equivalent, name. In order to 1310 efficiently find such names, when they exist, a client typically 1311 needs to implement a form of name hashing which always produces 1312 the same result for two canonically equivalent names. This can 1313 be done by making the contribution of any character to the name 1314 hash, equal to the contribution of the corresponding canonical 1315 decomposition string. 1317 In the case of case-insensitive file systems, the uncertainty to be 1318 dealt with includes the version of Unicode implemented by the server 1319 as well as the details of the possible case-handling implemented by 1320 the server. In addition to the fact that different Unicode versions 1321 may have different canonical equivalence relationships, the server 1322 may implement different approaches to the handling of issues related 1323 to the handling of dotted and dotless i, in Turkish and Azeri. 1324 However, the question of whether the server's handling is case- 1325 preserving has no effect on client behavior, as is the question of 1326 whether the server implements a particular normalization form or 1327 implements form-insensitive file name matching. In light of the 1328 uncertainty created by the lack of knowledge of the details of the 1329 case-related equivalence relation together with the precise Unicode 1330 version used by the server to implement its canonical equivalence 1331 relation, the following possibilities, arranged in order of 1332 increasing value (and difficulty of implementation) should be 1333 considered. 1335 B1: The client can simply decline to implement optimizations based 1336 on negative name caching on case-insensitive file systems. 1338 While this might have a negative effect on performance where 1339 significant benefits from negative name caching might be 1340 expected, it might be the best option for clients not heavily 1341 used to access case-insensitive filesystems. 1343 B2: Filtering similar to that discussed in item A2 could be 1344 implemented, although a higher limit is likely to be chosen 1345 (e.g. U+07ff) if significant use of non-Latin scripts is 1346 expected. Because of the uncertainty regarding the handling of 1347 case relationship among characters used for the variant of I 1348 used by Turkic languages, this filtering would have to exclude 1349 names containing LATIN CAPITAL LETTER I WITH DOT ABOVE and LATIN 1350 SMALL LETTER DOTLESS I together with precomposed characters 1351 derived from them. 1353 In cases in which such filtering did not exclude the item from 1354 consideration, it would need to search for files with possibly 1355 equivalent names, including those equivalent by canonical 1356 equivalence, case-insensitive equivalence, or a combination of 1357 the two. This will typically require a form of name hashing 1358 which always produces the same hash for equivalent names, 1359 similar to that discussed in item A3 but including case- 1360 insensitive equivalence as well. 1362 B3: A client might implement its own internationalized, case- 1363 insensitive file name handling paralleling that of the server. 1364 Because the case mappings are uncertain and the Unicode version 1365 used by the server filesystem is unknown, strings for which it 1366 is possible that the equivalent string might be different 1367 depending on the version of Unicode implemented by the server or 1368 the choice of case mappings would have to be identified and 1369 excluded from using negative name caching. This would require 1370 that strings containing code points unassigned in Unicode 1371 version 4.0, and those denoting combining characters that could 1372 be parts of precomposed characters added to later versions of 1373 Unicode be excluded from negative name caching. The necessary 1374 filtering could apply to all potential code points although 1375 clients might choose to simplify implementation by excluding 1376 strings containing code points beyond a certain point (e.g. 1377 U+00FFFF). 1379 When a client implements internationalized name handling, it 1380 needs to be able to detect when the apparent absence of a file 1381 within a directory is contradicted by the occurrence of a file 1382 with a distinct, but canonically equivalent name. In order to 1383 efficiently find such names, when they exist, a client typically 1384 needs to implements a form of name hashing which always produces 1385 the same result for two canonically equivalent names. This can 1386 be done by making the contribution of any character to the name 1387 hash, equal to contribution of the correspond canonical 1388 decomposition string. 1390 11.3. Client Processing of File Names for Future NFSv4 Protocols 1392 Because of NFSv4 has an extension framework allowing the addition of 1393 new attributes in later minor version or in extensions to extensible 1394 minor versions. Such new attributes are likely to be optional. They 1395 could include a number of useful per-fs attributes to deal with the 1396 information gaps discussed in Section 11.2: 1398 o The Unicode version used to define the canonical equivalence 1399 relation implemented by the server could be provided as an fs- 1400 scope attribute. 1402 o For case-insensitive filesystems, details regarding the actual 1403 case mapping used could be provided as an fs-scope attribute. 1404 These details would include the case mapping associated with LATIN 1405 LETTER I (i.e. whether the C-type or T-type case mappings or both 1406 are to be used). Similarly for characters having F-type case 1407 mappings, information needs to be provided about whether the 1408 F-type, mapping, the S-type mapping, or both, are to be used. 1410 There is little prospect of such additional attributes being 1411 REQUIRED. Although the term "RECOMMENDED" has been used to describe 1412 NFSv4 attributes that are not REQUIRED, any such attributes are best 1413 considered OPTIONAL for the server to support with the client 1414 required to deal with the case in which the attribute is not 1415 supported. 1417 When such attributes are defined and implemented, it would be 1418 possible for the client and server to implement compatible 1419 internationalization-related file name handling. However, as a 1420 practical matter, such compatibility would be considerably eased if 1421 there existed unencumbered open-source implementations of the 1422 algorithm and tables described in Appendix B. This would allow 1423 clients, servers, and server-based file systems, to easily adopt 1424 compatible approaches to these issues, each calling a common set of 1425 primitives, even though each might have a different execution 1426 environment and might be processing file names for different 1427 purposes. 1429 In the case of case-sensitive file system, the case-mapping attribute 1430 is not relevant. In dealing with the non-support of the Unicode 1431 version attribute, the client is in the same position as that of 1432 clients described in Section 11.2. In the case in which the Unicode 1433 version is supported, the client would be able to implement the same 1434 version of the canonical equivalence relation implemented by the 1435 server, thus avoiding the need for the sort of overbroad filtering 1436 mentioned in items A2 and A3 within Section 11.2 1438 The case of case-insensitive file systems is more complicated, since 1439 there are two OPTIONAL attributes to deal with: 1441 C1: When neither of these OPTIONAL attributes is supported, the 1442 client is in the same position as that of clients described in 1443 Section 11.2 in dealing with a case-insensitive file system. 1445 C2: When the Unicode version is available but the details of case 1446 mapping are not, the client handling will be similar to that 1447 specified the options B1 through B3 defined in Section 11.2. 1448 However, in cases B2 and B3, it will be possible to reduce the 1449 scope of the character filtering applied, by enabling names 1450 containing characters defined after Unicode version 4.0 to be 1451 processed, as long as none of the case mapping options for those 1452 characters is at all problematic. 1454 C3: When the details of case mapping are available but Unicode 1455 version is not, the client handling will be similar to that 1456 specified the options B1 through B3 defined in Section 11.2. 1457 However, in cases B2 and B3 However, in cases B2 and B3, it will 1458 be possible to reduce the scope of the character filtering by 1459 enabling names containing characters of uncertain case mapping 1460 to be processed as long as those character were defined in 1461 Unicode version 4.0. 1463 C4: When both of these OPTIONAL attributes are supported, the client 1464 has the ability, at least theoretically, to reproduce the 1465 internationalization-related file name handling implemented by a 1466 server for a case-insensitive file system. However, when the 1467 client is unable to provide such an implementation, it is free 1468 to ignore the attribute and implement one of the options B1 1469 through B3 defined in Section 11.2. 1471 12. String Types with Processing Defined by Other Internet Areas 1473 There are two types of strings that NFSv4 deals with that are based 1474 on domain names. Processing of such strings is defined by other 1475 Internet standards, and hence the processing behavior for such 1476 strings should be consistent across all server operating systems and 1477 server file systems. 1479 This section differs from other sections of this document in two 1480 respects: 1482 o The normative statements within this section are not derived from 1483 the behavior from existing NFSv4 implementations, but derive 1484 instead from existing RFCs. 1486 o Because of the switch from IDNA2003 [18] [19] to IDNA2008 [5], 1487 this section is necessarily different from the corresponding 1488 section (i.e. Section 12.6) of [3]. The differences are 1489 discussed in Section 12.1. 1491 Because of this shift, there could be compatibility issues to be 1492 expected between implementations obeying Section 12.6 of [3] and 1493 those following this document. Whether such compatibility issues 1494 actually exist depends on the behavior of NFSv4 implementations and 1495 how domain names are actually used in existing implementations. 1496 These matters will be discussed in Section 12.2. 1498 The types of strings referred to above are as follows: 1500 o Server names as they appear in the fs_locations and 1501 fs_locations_info attribute. Notes that for most purposes, such 1502 server names will only be sent by the server to the client. The 1503 exception is the use of these attributes in a VERIFY or NVERIFY 1504 operation. 1506 o Principal suffixes that are used to denote sets of users and 1507 groups, and are in the form of domain names. 1509 The general rules for handling all of these domain-related strings 1510 are similar and independent of the role of the sender or receiver as 1511 client or server, although the consequences of failure to obey these 1512 rules may be different for client or server. The server can report 1513 errors when it is sent invalid strings, whereas the client will 1514 simply ignore an invalid string or use a default value in its place. 1516 The string sent SHOULD be in the form of one or more unvalidated 1517 U-labels as defined by [5]. In cases where this cannot be done, the 1518 string will instead be in the form of one or more LDH labels [5]. 1519 The receiver needs to be able to accept domain and server names in 1520 any of the formats allowed. The server MUST reject, using the error 1521 NFS4ERR_INVAL, any of the following: 1523 o a string that is not valid UTF-8. 1525 o a string that contains an XN-label (begins with "xn--") for which 1526 the characters after "xn--" are not valid output of the Punycode 1527 algorithm [6]. 1529 o a string that contains a reserved LDH label which is not an 1530 XN-label. 1532 When a domain string is part of id@domain or group@domain, there are 1533 two possible approaches: 1535 1. The server generally treats the domain string as a series of 1536 unvalidated U-labels. In cases where the domain string is a 1537 series of unvalidated A-labels or Non-Reserved LDH (NR-LDH) 1538 labels, it converts them to U-labels using the Punycode algorithm 1539 [6]. As a result, the domain string returned within a user id on 1540 a GETATTR may not match that sent when the user id is set using 1541 SETATTR, although when this happens, the domain will be in the 1542 form of an unvalidated U-label. 1544 2. The server treats the domain string as a series of unvalidated 1545 U-labels. Specifically, it does not map a domain string that is 1546 not a U-label into a U-label using the methods described above. 1547 As a result, the domain string returned on a GETATTR of the user 1548 id MUST be the same as that used when setting the user id by the 1549 SETATTR. 1551 A server SHOULD use the first method. 1553 For VERIFY and NVERIFY, additional string processing requirements 1554 apply to verification of the owner and owner_group attributes; see 1555 the section entitled "Interpreting owner and owner_group" for the 1556 document specifying the minor version in question (RFC750 [3], 1557 RFC5661 [21]) 1559 12.1. Effect of IDNA Changes 1561 Overall, the effect of the shift to IDNA2008 is to limit the degree 1562 of understanding of the IDNA-based restrictions on domain names that 1563 were expected of NFSv4 in RFC7530 [3]. Despite this specification, 1564 the degree to which implementations actually implemented such 1565 restrictions is open to question and will be discussed in detail in 1566 Section 12.2 1568 In analyzing how various cases are to be dealt with according to 1569 RFC7530, there a number of troubling uncertainties that arise in 1570 trying to interpret the existing specification: 1572 o There are a number of cases in which "SHOULD" is used that are 1573 confusing. According to RFC2119 [1], "SHOULD" means that "there 1574 may exist valid reasons in particular circumstances to ignore a 1575 particular item, but the full implications must be understood and 1576 carefully weighed before choosing a different course". To fully 1577 understand a particular "SHOULD", there needs to be enough context 1578 to determine whether particular reasons for ignoring the item are 1579 in fact valid, and sufficient guidance to understand the 1580 implication of ignoring the item. In the absence of such 1581 information, the relevant fact is that the peer needs to deal with 1582 the item being ignored, making the implications of a "SHOULD" hard 1583 to distinguish from those of "MAY". 1585 o While the document states, "the general rules for handling all of 1586 these domain-related strings are similar and independent of the 1587 role of the sender or receiver as client or server", all of the 1588 following text is explicitly about the server's options, choices 1589 and responsibilities, leaving the client case unclear. 1591 o In a number of places within the paragraph describing server 1592 approach #1, the word "can" is used as in the text "the server can 1593 use the ToUnicode function", leaving it unclear whether the server 1594 can choose to do anything else and if so what. 1596 The following cases are those where RFC7530 requires use of IDNA 1597 handling and this requirement could, if implementations follow them, 1598 create potential compatibility issues, which need to be understood. 1600 o The degree to which RFC3490 [18] requires that characters other 1601 than U+002E (full stop) be treated as label separators, including 1602 U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), 1603 U+FF61 (halfwidth ideographic full stop). 1605 o The degree to which RFC3490 [18] that server or client needs to 1606 validate a putative A-label or U-label or to rectify it if it is 1607 not valid. 1609 12.2. Potential Compatibility Issues Related to IDNA Changes 1611 There are a number of factors relating to the handling of domain 1612 names within NFSv4 implementations that are important in 1613 understanding why any compatibility issues might be less troubling 1614 than a comparison of the two IDNA approaches might suggest: 1616 o Much of the potentially conflicting IDNA-related behavior required 1617 or recommended for the server by RFC7530 [3] might not actually be 1618 implemented, limiting the potential harmful effects of ceasing to 1619 mandate it. 1621 o Even if such behavior were implemented by servers, no 1622 compatibility issue would arise unless clients actually relied on 1623 the server to implement it. Given that none of this behavior is 1624 made required, the chances of that occurring is quite small. 1626 o The range of potential values for user and group attributes sent 1627 by clients are often quite small with implementations commonly 1628 restricting all such values to a single domain string. This is 1629 even though RFCs 7530 [3] and 5661 [21] are written without 1630 mention of such restrictions. 1632 Specification of users and groups in the "id@domain" format within 1633 NFSv4 was adopted to enable expansion of the spaces of users and 1634 groups beyond the 32-bit id spaces mandated in NFSv3 [15] and 1635 NFsv2 [14]. While one obstacle to expansion was eliminated, most 1636 implementations were unable to actually effect that expansion, 1637 principally because the physical file systems used assume that 1638 user and group identifiers fit in 32 bits each and the vnode 1639 interfaces used by server implementations make similar 1640 assumptions. 1642 Given these restrictions, the typical implementation pattern is 1643 for servers to accept only a single domain, specified as part of 1644 the server configuration, together with information necessary to 1645 effect the appropriate name-to-id mappings. 1647 o The other uses of domain names in NFSv4, to represent host names 1648 in location attributes, the values are generated by the server and 1649 will normally include only include host names within DNS- 1650 registered domains. 1652 Keeping the above in mind, we can see that interoperability issues, 1653 while they might exist are unlikely to raise major challenges as 1654 looking to the following specific cases shows 1656 o When an internationalized domain name is used as part of a user or 1657 group, it would need to be configured as such, with the domain 1658 string known to both client and server. 1660 While it is theoretically possible that a client might work with 1661 an invalid domain string and rely on the server to correct it to 1662 an IDNA-acceptable one, such a scenario has to be considered 1663 extremely unlikely, since it would depend on multiple servers 1664 implementing the same correction, especially since there is no 1665 evidence of such corrections ever having been implemented by NFSv4 1666 servers. 1668 o When an internationalized domain in a location string is meant to 1669 specify a registered domain, similar considerations apply. 1671 While it is theoretically possible that a client might work with 1672 an invalid domain string and rely on the server to correct it to 1673 the appropriate registered one, such a scenario has to be 1674 considered extremely unlikely, since it would depend on multiple 1675 servers implementing the same correction, especially since there 1676 is no evidence of such corrections ever having been implemented by 1677 NFSv4 servers. 1679 o When an internationalized domain in a location string is meant to 1680 specify a non-registered domain, any such server-applied 1681 corrections would be useless. 1683 In this situation, any potential interoperability issue would 1684 arise from rejecting the name, which has to be considered as what 1685 should have been done in the first place. 1687 13. Errors Related to UTF-8 1689 Where the client sends an invalid UTF-8 string, the server MAY return 1690 an NFS4ERR_INVAL error. This includes cases in which inappropriate 1691 prefixes are detected and where the count includes trailing bytes 1692 that do not constitute a full Multiple-Octet Coded Universal 1693 Character Set (UCS) character. 1695 Requirements for server handling of component names that are not 1696 valid UTF-8, when a server does not return NFS4ERR_INVAL in response 1697 to receiving them, are described in Section 14. 1699 Where the string supplied by the client is not rejected with 1700 NFS4ERR_INVAL but contains characters that are not supported by the 1701 server as a value for that string (e.g., names containing slashes, or 1702 characters that do not fit into 16 bits when converted from UTF-8 to 1703 a Unicode codepoint), the server should return an NFS4ERR_BADCHAR 1704 error. 1706 Where a UTF-8 string is used as a file name, and the file system, 1707 while supporting all of the characters within the name, does not 1708 allow that particular name to be used, the server should return the 1709 error NFS4ERR_BADNAME. This includes such situations as file system 1710 prohibitions of "." and ".." as file names for certain operations, 1711 and similar constraints. 1713 14. Servers That Accept File Component Names That Are Not Valid UTF-8 1714 Strings 1716 As stated previously, servers MAY accept, on all or on some subset of 1717 the physical file systems exported, component names that are not 1718 valid UTF-8 strings. A typical pattern is for a server to use 1719 UTF-8-unaware physical file systems that treat component names as 1720 uninterpreted strings of bytes, rather than having any awareness of 1721 the character set being used. 1723 Such servers SHOULD NOT change the stored representation of component 1724 names from those received on the wire and SHOULD use an octet-by- 1725 octet comparison of component name strings to determine equivalence 1726 (as opposed to any broader notion of string comparison). This is 1727 because the server has no knowledge of the character encoding being 1728 used. 1730 Nonetheless, when such a server uses a broader notion of string 1731 equivalence than what is recommended in the preceding paragraph, the 1732 following considerations apply: 1734 o Outside of 7-bit ASCII, string processing that changes string 1735 contents is usually specific to a character set and hence is 1736 generally unsafe when the character set is unknown. This 1737 processing could change the file name in an unexpected fashion, 1738 rendering the file inaccessible to the application or client that 1739 created or renamed the file and to others expecting the original 1740 file name. Hence, such processing should not be performed, 1741 because doing so is likely to result in incorrect string 1742 modification or aliasing. 1744 o Unicode normalization is particularly dangerous, as such 1745 processing assumes that the string is UTF-8. When that assumption 1746 is false because a different character set was used to create the 1747 file name, normalization may corrupt the file name with respect to 1748 that character set, rendering the file inaccessible to the 1749 application that created it and others expecting the original file 1750 name. Hence, Unicode normalization SHOULD NOT be performed, 1751 because it may cause incorrect string modification or aliasing. 1753 When the above recommendations are not followed, the resulting string 1754 modification and aliasing can lead to both false negatives and false 1755 positives, depending on the strings in question, which can result in 1756 security issues such as elevation of privilege and denial of service 1757 (see [23] for further discussion). 1759 15. Future Minor Versions and Extensions 1761 As stated above, all current NFSv4 minor versions allow use of non- 1762 UTF-8 encodings, allow servers a choice of whether to be aware of 1763 normalization issues or not, and allows servers a number of choices 1764 about how to address normalization issues. This range of choices 1765 reflects the need to accommodate existing file systems and user 1766 expectations about character handling which in turn reflect the 1767 assumptions of the POSIX model of handling file names. 1769 While it is theoretically possible for a subsequent minor version to 1770 change these aspects of the protocol (see [8]), this section will 1771 explain why any such change is highly unlikely, making it expected 1772 that these aspects of NFSv4 internationalization handling will be 1773 retained indefinitely. As a result, any new minor version 1774 specification document that made such a change would have to be 1775 marked as updating or obsoleting this document 1777 No such change could be done as an extension to an existing minor 1778 version or in a new minor version consisting only of OPTIONAL 1779 features. Such a change could only be done in a new minor version, 1780 which like minor version one, was prepared to be incompatible to some 1781 degree with the previous minor versions. While it appears unlikely 1782 that such minor versions will be adopted, the possibility cannot be 1783 excluded, so we need to explore the difficulties of changing the 1784 aspects of internationalization handling mentioned above. 1786 o Establishing UTF-8 as the sole means of encoding for 1787 internationalized characters, would make inaccessible existing 1788 files stored with other encodings. Further, unless there were a 1789 corresponding change in the UNIX file interface model, it would 1790 cause the set of valid names for local and remote files to 1791 diverge. 1793 o Imposing a particular normalization form, in the sense of refusing 1794 to create to allow access to files whose UTF-8-encoded names are 1795 not of the selected normalization form would give rise to similar 1796 difficulties. 1798 o Defining a preferred normalization form to be returned as the 1799 names of all internationalized files, would result in applications 1800 having to deal with sudden unexplained changes of file names for 1801 existing files. 1803 None of the above appears likely since there does not seem to be any 1804 corresponding benefits to justify the difficulties that they would 1805 create. 1807 There would also be difficulties in otherwise reducing the set of 1808 three acceptable normalization handling options, without reducing it 1809 to a single option by imposing a specific normalization form. 1811 o Eliminating the possibility of a single possible normalization 1812 form, would pose similar difficulties to imposing the other one, 1813 even if representation-independent comparisons were also allowed. 1815 In either case, a specific normalization form would be disfavored, 1816 with no corresponding benefit. 1818 o Allowing only representation-independent lookups would not impose 1819 difficulties for clients, but there are reasons to doubt it could 1820 be universally implemented, since such name comparisons would have 1821 to be done within the file system itself. 1823 Such a change could only be made once file system support for 1824 representation-independent file lookups would become commonly 1825 available. As long as the POSIX file naming model continues its 1826 sway, that would be unlikely to happen. 1828 One possible internationalization-related extension that the working 1829 could adopt would be definition of an OPTIONAL per-fs attribute 1830 defining the internationalization-related handling for that file 1831 system. That would allow clients to be aware of server choices in 1832 this area and could be adopted without disrupting existing clients 1833 and servers. 1835 16. IANA Considerations 1837 The current document does not require any actions by IANA. 1839 17. Security Considerations 1841 Unicode in the form of UTF-8 is generally is used for file component 1842 names (i.e., both directory and file components). However, other 1843 character sets may also be allowed for these names. For the owner 1844 and owner_group attributes and other sorts strings whose form is 1845 affected by standards outside NFSv4 (see Section 12.) are always 1846 encoded as UTF-8. String processing (e.g., Unicode normalization) 1847 raises security concerns for string comparison. See Sections 12 and 1848 9 as well as the respective Sections 5.9 of RFC7530 [3] and RFC5661 1849 [21] for further discussion. See [23] for related identifier 1850 comparison security considerations. File component names are 1851 identifiers with respect to the identifier comparison discussion in 1852 [23] because they are used to identify the objects to which ACLs are 1853 applied (See the respective Sections 6 of RFC7530 [3] and RFC5661 1854 [21]). 1856 18. References 1858 18.1. Normative References 1860 [1] Bradner, S., "Key words for use in RFCs to Indicate 1861 Requirement Levels", BCP 14, RFC 2119, 1862 DOI 10.17487/RFC2119, March 1997, 1863 . 1865 [2] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1866 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1867 May 2017, . 1869 [3] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 1870 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 1871 March 2015, . 1873 [4] Haynes, T., "Network File System (NFS) Version 4 Minor 1874 Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, 1875 November 2016, . 1877 [5] Klensin, J., "Internationalized Domain Names for 1878 Applications (IDNA): Definitions and Document Framework", 1879 RFC 5890, DOI 10.17487/RFC5890, August 2010, 1880 . 1882 [6] Costello, A., "Punycode: A Bootstring encoding of Unicode 1883 for Internationalized Domain Names in Applications 1884 (IDNA)", RFC 3492, DOI 10.17487/RFC3492, March 2003, 1885 . 1887 [7] Yergeau, F., "UTF-8, a transformation format of ISO 1888 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 1889 2003, . 1891 [8] Noveck, D., "Rules for NFSv4 Extensions and Minor 1892 Versions", RFC 8178, DOI 10.17487/RFC8178, July 2017, 1893 . 1895 [9] Noveck, D., Ed. and C. Lever, "Network File System (NFS) 1896 Version 4 Minor Version 1 Protocol", RFC 8881, 1897 DOI 10.17487/RFC8881, August 2020, 1898 . 1900 [10] Cerf, V., "ASCII format for network interchange", STD 80, 1901 RFC 20, October 1969, 1902 . 1904 [11] The Unicode Consortium, "The Unicode Standard, Version 1905 7.0.0", (Mountain View, CA: The Unicode Consortium, 1906 2014 ISBN 978-1-936213-09-2), June 2014, 1907 . 1909 [12] The Unicode Consortium, "The Unicode Standard, Version 1910 13.0.0, Section 5.18 Case Mappings", (Mountain View, CA: 1911 The Unicode Consortium, 2014 ISBN 978-1-936213-26-9), 1912 March 2020, 1913 . 1916 [13] The Unicode Consortium, "CaseFolding-13.0.0.txt", 1917 (Mountain View, CA: The Unicode Consortium, 2014 ISBN 1918 978-1-936213-26-9), March 2020, 1919 . 1922 18.2. Informative References 1924 [14] Nowicki, B., "NFS: Network File System Protocol 1925 specification", RFC 1094, DOI 10.17487/RFC1094, March 1926 1989, . 1928 [15] Callaghan, B., Pawlowski, B., and P. Staubach, "NFS 1929 Version 3 Protocol Specification", RFC 1813, 1930 DOI 10.17487/RFC1813, June 1995, 1931 . 1933 [16] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 1934 Beame, C., Eisler, M., and D. Noveck, "NFS version 4 1935 Protocol", RFC 3010, DOI 10.17487/RFC3010, December 2000, 1936 . 1938 [17] Hoffman, P. and M. Blanchet, "Preparation of 1939 Internationalized Strings ("stringprep")", RFC 3454, 1940 DOI 10.17487/RFC3454, December 2002, 1941 . 1943 [18] Faltstrom, P., Hoffman, P., and A. Costello, 1944 "Internationalizing Domain Names in Applications (IDNA)", 1945 RFC 3490, DOI 10.17487/RFC3490, March 2003, 1946 . 1948 [19] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1949 Profile for Internationalized Domain Names (IDN)", 1950 RFC 3491, DOI 10.17487/RFC3491, March 2003, 1951 . 1953 [20] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., 1954 Beame, C., Eisler, M., and D. Noveck, "Network File System 1955 (NFS) version 4 Protocol", RFC 3530, DOI 10.17487/RFC3530, 1956 April 2003, . 1958 [21] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed., 1959 "Network File System (NFS) Version 4 Minor Version 1 1960 Protocol", RFC 5661, DOI 10.17487/RFC5661, January 2010, 1961 . 1963 [22] Hoffman, P. and J. Klensin, "Terminology Used in 1964 Internationalization in the IETF", BCP 166, RFC 6365, 1965 DOI 10.17487/RFC6365, September 2011, 1966 . 1968 [23] Thaler, D., Ed., "Issues in Identifier Comparison for 1969 Security Purposes", RFC 6943, DOI 10.17487/RFC6943, May 1970 2013, . 1972 [24] Beame, C., Thurlow, R., Callaghan, B., Robinson, D., 1973 Noveck, D., Eisler, M., and S. Shepler, "Network File 1974 System (NFS) version 4 Protocol", draft-ietf- 1975 nfsv4-rfc3010bis-05 (work in progress), November 2002. 1977 [25] Williams, N., "Internationalization Considerations for 1978 Filesystems and Filesystem Protocols", draft-williams- 1979 filesystem-18n-00 (work in progress), July 2020. 1981 Appendix A. History 1983 This section describes the history of internationalization within 1984 NFSv4. Despite the fact that NFSv4.0 and subsequent minor versions 1985 have differed in many ways, the actual implementations of 1986 internationalization have remained the same and internationalized 1987 names have been handled without regard to the minor version being 1988 used. This is the reason the document is able to treat 1989 internationalization for all NFSv4 minor versions together. 1991 During the period from the publication of RFC3010 [16] until now, two 1992 different perspectives with regard to internationalization have been 1993 held and represented, to varying degrees, in specifications for NFSv4 1994 minor versions. 1996 o The perspective held by NFSv4 implementers treated most aspects of 1997 internationalization as basically outside the scope of what NFSv4 1998 client and server implementers could deal with. This was because 1999 the POSIX interface treated file names as uninterpreted strings of 2000 bytes, because the file systems used by NFSv4 servers treated file 2001 names similarly, and because those file systems contained files 2002 with internationalized names using a number of different encoding 2003 methods, chosen by the users of the POSIX interface. From this 2004 perspective, wider support for internationalized names and general 2005 use of universal encodings was a matter for users and applications 2006 and not for protocol implementers or designers. 2008 o Within the IETF in general and in the IESG, there was a feeling 2009 that new protocols, such as NFSv4, could not avoid dealing with 2010 internationalization issues, making it difficult to treat these 2011 matters, as the implementers' perspective would have it, as 2012 essentially out of scope. 2014 As specifications were developed, approved, and at times rewritten, 2015 this fundamental difference of approach was never fully resolved, 2016 although, with the publication of RFC7530 [3], a satisfactory modus 2017 vivendi may have been arrived at. 2019 Although many specifications were published dealing with NFSv4 2020 internationalization, all minor versions used the same implementation 2021 approach, even when the current specification for that minor version 2022 specified an entirely different approach. As a result, we need to 2023 treat the history of NFSv4 internationalization below as an 2024 integrated whole, rather than treating individual minor versions 2025 separately. 2027 o The approach to internationalization specified in RFC3010 [16] 2028 sidestepped the conflict of approaches cited above by discussing 2029 the reasons that UTF-8 encoding was desirable while leaving file 2030 names as uninterpreted strings of bytes. The issue of string 2031 normalization was avoided by saying "The NFS version 4 protocol 2032 does not mandate the use of a particular normalization form at 2033 this time." 2035 Despite this approach's inconsistency with general IETF 2036 expectations regarding internationalization, RFC3010 was published 2037 as a Proposed Standard. NFSv4.0 implementation related to 2038 internationalization of file names followed the same paradigm used 2039 by NFSv3, assuring interoperability with files created using that 2040 protocol, as well as with those created using local means of file 2041 creation. 2043 o When it became necessary, because of issues with byte-range 2044 locking, to create an rfc3010bis, no change to the previously 2045 approved approach seemed indicated and the drafts submitted up 2046 until [24] closely followed RFC3010 as regards 2047 internationalization. The IESG then decided that a different 2048 approach to internationalization was required, to be based on 2049 stringprep [17] and rfc3010bis was accordingly revised, replacing 2050 all of the Internationalization section, before being published as 2051 RFC3530 [20]. 2053 These changes required the rejection of file names that were not 2054 valid UTF-8, file names that included code points not, at the time 2055 of publication, assigned a Unicode character (e.g. capital eszett) 2056 or that were not allowed by stringprep (e.g. Zero-width joiner 2057 and non-joiner characters). Because these restrictions would have 2058 caused the set of valid file names to be different on NFS-mounted 2059 and local file systems there was no chance of them ever being 2060 implemented. 2062 Because these specification changes were made without working 2063 group involvement, most implementers were unaware of them while 2064 those who were aware of the changes ignored them and continued to 2065 develop implementations based on the internationalization approach 2066 specified in RFC3010. 2068 o When NFsv4.1 was being developed, it seemed that no changes in 2069 internationalization would be required. Many people were unaware 2070 of the stringprep-based requirements which made the NFSv4.0 2071 internationalization specified in RFC3530 unimplementable. As a 2072 result, the internationalization specified in RFC5661 [21] was 2073 based on that in RFC3530 [20], although the addition of the 2074 attribute fs_charset_cap, discussed below, provided additional 2075 flexibility. 2077 The attribute fs_charset_cap, discussed below in Section 7 2078 provides flags allowing the server to indicate that it accepts and 2079 processes non-UTF-8 file names. Rejecting them was a "MUST" in 2080 RFC3530 and became a "SHOULD" in RFC5661, although there is no 2081 evidence that any of these designations ever affected server 2082 behavior. 2084 As a result of this treatment of internationalization, even though 2085 NFSv4.1 was a separate protocol and could have had a different 2086 approach to internationalization, for a considerable time, the 2087 internationalization specification for both protocols was based on 2088 stringprep (in RFC3530 and RFC5661) while the actual 2089 implementations of the two minor versions both followed the 2090 approach specified in RFC3010, despite its obsoleted status. 2092 o When work started on rfc3530bis it was clear that issues related 2093 to internationalization had to be addressed. When the 2094 implications of the stringprep references in RFC3530 were 2095 discussed with implementers it became clear that mandating that 2096 NFSv4.0 file names conform to stringprep was not appropriate. 2098 While some working group members articulated the view that, 2099 because of the need to maintain compatibility with the POSIX 2100 interface and existing file systems, internationalization for 2101 NFSv4 could not be successfully addressed by the IETF, the 2102 rfc3530bis draft submitted to the IESG did not explicitly embrace 2103 the implementers' perspective set forth above. 2105 The draft submitted to the IESG and RFC7530 [3] as published 2106 provided an explanation (see Section 5) as to why restrictions on 2107 character encodings were not viable. It allowed non-UTF-8 2108 encodings to be used for internationalized file names while 2109 defining UTF-8 as the preferred encoding and allowing servers to 2110 reject non-UTF-8 string as invalid. Other stringprep-based string 2111 restrictions were eliminated. With regard to normalization, it 2112 continued to defer the matter, leaving open the possibility that 2113 one might be chosen later. 2115 This approach is compatible, in implementation terms, with that 2116 specified in RFC3010 [16], allowing it to be used compatibly with 2117 existing implementations for all existing minor versions. This is 2118 despite the fact that RFC5661 [21] specifies an entirely different 2119 approach. 2121 As a result of discussions leading up to the publishing of 2122 RFC7530, it was discovered that some local file systems used with 2123 NFSv4 were configured to be both normalization-aware and 2124 normalization-preserving, mapping all canonically equivalent file 2125 names to the same file while preserving the form actually used to 2126 create the file, of whatever form, normalized or not. This 2127 behavior, which is legal according to RFC3010, which says little 2128 about name mapping is probably illegal according to stringprep. 2129 Nevertheless, it was expressly pointed out in RFC7530 as a valid 2130 choice to deal with normalization issues, since it allows 2131 normalization-aware processing without the difficulties that arise 2132 in imposing a particular normalization form, as described in 2133 Section 9. 2135 In its discussion of internationalized domain names, RFC7530 [3] 2136 adopted an approach compatible with IDNA2003, rather than 2137 attempting to derive the specification from the behavior of 2138 existing implementations. 2140 o When IDNA2003 was replaced by IDNA2008, the internationalization 2141 specified by [3] was not changed. Also, it appears unlikely that 2142 implementations were changed to reflect that shift. 2144 o NFSv4.2 made no changes to internationalization. As a result, 2145 RFC7862 [4] which made no mention of internationalization, 2146 implicitly aligned internationalization in NFSv4.2 with that in 2147 NFSv4.1, as specified by RFC5661 [21]. 2149 As a result of this implicit alignment, there is no need for this 2150 document to specifically address NFSv4.2 or be marked as updating 2151 RFC7862. It is sufficient that it updates RFC5661, which 2152 specifies the internationalization for NFSv4.1, inherited by 2153 NFSv4.2. 2155 o Later, as work on the predecessors of this document was underway, 2156 [25] was submitted, making it necessary that some gaps the 2157 discussion of internationalization in [3] be filled in. These 2158 gaps primarily concerned the need of NFSv4 clients to match the 2159 handling of the corresponding server when using cached file name 2160 data locally, or to avoid making invalid assumptions about that 2161 handling, when information on the details of such handling was not 2162 available. 2164 The above history, can, for the purposes of the rest of this document 2165 be summarized in the following statements: 2167 o The actual treatment of internationalization within NFSv4 has not 2168 been affected by the particular minor version used, despite the 2169 fact that the specifications for the minor versions have often 2170 differed in their treatment of internationalization. 2172 o With regard to file names, implementations have followed the 2173 internationalization approach specified in RFC3010, which is 2174 compatible with the treatment in RFC7530. 2176 o With regard to internationalized domain names, RFC7530 [3] 2177 specified an approach compatible with IDNA at the time of 2178 publication. However, no detailed analysis was done to determine 2179 whether NFSv4 implementations actually followed that approach 2181 o Because [3] did not specifically address the special issues that 2182 clients would face, relying on the assumption that each file is 2183 accessible only by its name. As this assumption is no longer true 2184 when internationalized name handling is in effect, the appropriate 2185 handling is discusssed below. Section 11.2 explains the options 2186 for handling in the case in which the client has very limited 2187 information about the details about the server's 2188 internationalization-related handling of file names while 2189 Section 11.3 discusses how a client might use more complete 2190 information provided by new attributes. 2192 In order to deal with all NFSv4 minor versions, this document follows 2193 the internationalization approach defined in RFC7530, with some 2194 changes discussed in Section 4 and applies that approach to all NFSv4 2195 minor versions. 2197 Appendix B. Form-insensitive String Comparisons 2199 This section deal with two varieties of form-insensitive string 2200 comparison: 2202 o Providing a comparison function which is form-insensitive only. 2203 For any string, whether normalized or not, this function will 2204 determine it to be equivalent to all canonically equivalent 2205 strings, including but not limited, to the normalized forms NFC 2206 and NFD 2208 o Providing a comparison function which is both form-insensitive and 2209 case-insensitive. This function will determine strings that only 2210 differ in case to be equal but will also be form-insensitive, as 2211 described above. 2213 The non-normative guidance provided in this Appendix is intended to 2214 be helpful to two distinct implementation areas: 2216 o Implementation of server-side file systems intended to be accessed 2217 using NFSv4 protocols. While it is often the case that such 2218 filesystems are developed by separate organizations from those 2219 concerned with NFSv4 server development, the internationalization- 2220 related requirements specified in this document must be adhered to 2221 for successful inter-operation, making this implementation 2222 guidance apropos despite any potential organizational barriers. 2224 o Implementation of NFSv4 clients that need to provide matching 2225 internationalization-related handling for reason discussed in 2226 Section 11. 2228 There are three basic reasons that two strings being compared might 2229 be canonically equivalent even though not identical. For each such 2230 reason, the implementation will be similar in the cases in which 2231 form-insensitive comparison (only) is being done and in which the 2232 comparison is both case-insensitive and form- insensitive. 2234 o Two strings may differ only because each has a different one of 2235 two code points that are essentially the same. Three code points 2236 assigned to represent units, are essentially equivalent to the 2237 character denoting those units. For example, the OHM SIGN 2238 (U+2126) is essentially identical to the GREEK CAPITAL LETTER 2239 OMEGA (U+03A9) as MICRO SIGN (U+00B5) is to GREEK SMALL LETTER MU 2240 (U+03BC) and ANGSTROM SIGN (U+212B) is to LATIN CAPITAL LETTER A 2241 WITH RING ABOVE (U+00C5). 2243 As discussed in items EX2 and EX3 in Section 10.2, it is possible 2244 to adjust for this situation using tables designed to resolve 2245 case-insensitive equivalence, essentially treating the unit 2246 symbols as an additional case variant, essentially ignoring the 2247 fact that the graphic representation is the same. As a result, 2248 those doing string comparisons that are both form-insensitive and 2249 case-insensitive do not need to address this issue as part of 2250 form-insensitivity, since it would be dealt with by existing case- 2251 insensitive comparison logic. 2253 Where there is no case-insensitive comparison logic, this function 2254 needs to be performed using similar tables whose primary function 2255 is to provide the decomposition of precomposed characters, as 2256 described in Appendix B.2. 2258 o Two strings may differ in that one has the decomposed form 2259 consisting of a base character and an associated combining 2260 character while the other has a precomposed character equivalent. 2262 Although, as discussed in items EX3 in Section 10.2, it is 2263 possible to use tables designed to resolve case-insensitive 2264 equivalence by providing as possible case-insensitively equivalent 2265 string, multi-character string providing the decomposition of 2266 precomposed characters, special logic to do so is only necessary 2267 when the decomposition is not a canonical one, i.e. it is a 2268 compatibility equivalence. 2270 In general, the table used to do comparisons, whether case- 2271 sensitive or not, need to provide information about the canonical 2272 decomposition of precomposed characters. See Appendix B.2 for 2273 details. 2275 o Two strings may differ in that the strings consist of combining 2276 characters that have the same effect differ as to the order in 2277 which the characters appear. 2279 There is no way this function could be performed within code 2280 primarily devoted to case-insensitive equivalence. However, this 2281 function could be added to implementations, providing both sorts 2282 of equivalence once it is determined that the base characters are 2283 case-equivalent while there is a difference of combining 2284 characters in to be resolved. (See Appendix B.5 for a discussion 2285 of how sets of combining characters can be compared). 2287 B.1. Name Hashes 2289 We discussed in Section 10.1 the construction of a case-insensitive 2290 file name hash. While such a hash could also be form-insensitive if 2291 the hash contribution of every pre-composed character matched the 2292 combined contribution of the characters that it decomposes into. 2294 However, there is no obvious way that sort of hash could respect the 2295 canonical equivalence of multiple combining characters modifying the 2296 same base character, when those combining characters appear in 2297 different orders. Addressing that issue would require a 2298 significantly different sort of hash, in which combining characters 2299 are treated differently from others, so that the re-ordering of a 2300 string of combining characters applying to the same base character 2301 will not affect the hash. 2303 In the hash discussed in Section 10.1, there is no guarantee that the 2304 hash for multiple combining characters presented in different orders 2305 will be the same. This is because typically such hashes implement 2306 some transformation on the existing hash, together with adding the 2307 new character to the hash being accumulated. Such methods of hash 2308 construction will arrive at different values if the ordering of 2309 combining characters changes. 2311 In order to create a hash with the necessary characteristics, one can 2312 construct a separate sub-hash for composite character, consisting of 2313 one non-combining character (may be pre-composed) together with the 2314 set (possibly null) of combining characters immediately following it. 2315 Each such composed character, whether precomposed or not, will have 2316 its own sub-hash, which will be the same regardless of the order of 2317 the combining characters. 2319 If the hash is to include case-insensitivity, special handling is 2320 needed to deal with issues arising from the handling of COMBINING 2321 GREEK YPOGEGRAMMENI (U+0345). That combining character, as discussed 2322 in item EX6 of Section 10.2 is uppercased to the non-combining 2323 character GREEK CAPITAL LETTER IOTA (U+0399) which is in turn 2324 lowercased to the non-combining character GREEK SMALL LETTER IOTA 2325 (U+03B9). As a result, when computing a case-insensitive hash, when 2326 a base character is IOTA (of either case) and the previous base 2327 character is ALPHA, ETA, or OMEGA (of the same case as the IOTA), 2328 that IOTA is treated, for the purpose of defining the composite 2329 characters for which to generate sub-hashes as if it were a combining 2330 character. As a result, in this case a string of containing two 2331 composite characters will be treated as were a single composite 2332 character since the iota will be treated as if it were a combining 2333 character. This string will have its own sub-hash, which will be the 2334 same regardless of the order of combining characters. 2336 The same outline will be followed for generating hashes which are to 2337 be form-insensitive (only) and for those which are to be both form- 2338 insensitive and case-insensitive. The initial value, representing 2339 the base character, will differ based on the type of hash. 2341 o In the case-sensitive case, the initial value of the sub-hash will 2342 reflect the value of the base character with the only possible 2343 need to map to a different value deriving from the existence of 2344 OHM SIGN (U+2126), ANGSTROM SIGN (U+212B), and MICRO SIGN (U+00B5) 2345 as characters distinct from the letters that represent these code 2346 points. This could be done with a mapping table but most 2347 implementations would probably choose to implement special-purpose 2348 code to do this. 2350 o In the case-insensitive case, the initial value of the sub-hash 2351 will reflect the case-based equivalence class to which the 2352 character (the lower-case equivalent is generally suitable). In 2353 this context a table-based mapping is required and this mapping 2354 can shift OHM SIGN, ANGSTROM SIGN, and MICRO SIGN to the case- 2355 based equivalence class for the corresponding character. 2357 Regardless of the type of hash to be produced, values based on the 2358 following combining characters need to reflected in the sub-hash. In 2359 order to make the sub-hash invariant to changes in the order of 2360 combining characters, values based on the particular combining 2361 character are combined with the hash being computed using a 2362 commutative associative operation, such as addition. 2364 To reduce false-positives it is desirable to make the hash relatively 2365 wide (i.e. 32-64 bits) with the value based on base character in the 2366 upper portion of the word with the values for the combining 2367 characters appearing in a wide range of bit positions in the rest of 2368 the word to limit the degree that multiple distinct sets of combining 2369 characters have value that are the same. Although the details will 2370 be affected by processor cache structure and the distribution of 2371 names processed, a table of values will be used but typical 2372 implementations will be different in the two cases we are dealing as 2373 described in Appendix B.2. 2375 As each sub-hash is computed, it is combined into a name-wide hash. 2376 There is no need for this computation to be order-independent and it 2377 will probably include a circular shift of the hash computed so far to 2378 be added to the contribution of the sub-hash for the new base or 2379 composed character. 2381 As described in Appendix B.3 the appropriate full name hash will have 2382 the major role in excluding potential matches efficiently. However, 2383 in some small number of cases, there will be a hash match in which 2384 the names to be compared are not equivalent, requiring more involved 2385 processing. It is assumed below that a given name will be searching 2386 for potential cached matches within the directory so that for that 2387 name, on will be able retain information used to construct the full 2388 name hash (e.g. individual sub-hashes plus the bounds of each 2389 composite character. These will be compared against cached entries 2390 where only the full (e.g. 64-bit) name hash and the name itself will 2391 be available for comparison. 2393 B.2. Character Tables 2395 The per-character tables used in these algorithms have a number of 2396 type of entries for different types of characters. In some cases, 2397 information for a given character type will be essentially the same 2398 whether the comparison is to be form-insensitive or case- 2399 insensitive. In others, there will be differences. Also, there may 2400 be entry types that only exist for particular types of comparisons. 2401 In any case, some bits within the table entry will be devoted to 2402 representing the type of character and entry: 2404 o For combining characters, the entry will provide information about 2405 the character's contribution to the composite character sub-hash 2406 in which it appears. 2408 o For case-insensitive comparisons, there need to be special entries 2409 for characters, which, while not themselves combining characters, 2410 are the case-insensitive equivalents of combining characters. An 2411 example of this situation is provided in item EX6 within 2412 Section 10.2 2414 o For pre-composed characters, the entry needs to provide the 2415 initial hash value which is to be the basis for the sub-hash for 2416 the name substring including contributions for the base character 2417 together with contribution of included combining characters. In 2418 addition, such entries will provide, separately, information about 2419 the character's canonical decomposition. 2421 o For case-insensitive comparisons, there needs to be, for base 2422 characters, entries assigning each base character to the case- 2423 based equivalence class to which it belongs, although such entries 2424 can be avoided if the equivalence class matches the character 2425 (usually caseless and lowercase characters. 2427 o Also, for case-insensitive comparisons, there will need to be 2428 special entries for characters which multi-character string as 2429 case-insensitive equivalent of the base character. Examples of 2430 this situation are provided in items EX4 and EX5 within 2431 Section 10.2. Such entries will need to have a hash-contribution 2432 that reflects the hash that would be computed for the multi- 2433 character string. 2435 o For form-insensitive comparisons, there will be special entries to 2436 provide special handling for those cases in which there are two 2437 canonically equivalent single characters. Such entries do not 2438 exist for case-insensitive comparison since this situation can be 2439 handled by a non-standard use of case mapping for base characters 2440 by placing these two characters in the same case-based equivalence 2442 In the common case in which a two-stage mapping will be used, there 2443 will be common groups of characters in which no table entry will be 2444 required, allowing a default entry type to be used for some character 2445 groups with entry contents easily calculable from the code point. 2447 o In the case form-insensitive comparison, this consists of all base 2448 characters, with the hash contribution of the character derivable 2449 by a pre-specified transformation of the code point value. 2451 o In the case case-insensitive comparison, this consists of all base 2452 character which are either caseless or equivalence class is the 2453 same as the code point, typically lowercase characters. As in the 2454 form-insensitive case, the hash contribution of the character is 2455 derivable by a pre-specified transformation of the code point 2456 value, which matches, in this case, the id assigned to the case- 2457 based equivalence class. 2459 B.3. Outline of comparison 2461 We are assuming that comparisons will be based on the hash values 2462 computed as described in Appendix B.1, whether the comparison is to 2463 be form-insensitive or both case-insensitive and form-insensitive. 2465 To facilitate this comparison, the name hash will be stored with the 2466 names to be compared. As a result, when there is a need to 2467 investigate a new name and whether there are existing matches, it 2468 will be possible to search for matches with existing names cached for 2469 that directory, using a hash for the new name which is computed and 2470 compared to all the existing names, with the result that the detailed 2471 comparisons described in Appendices B.4 and B.5 have to be done 2472 relatively rarely, since non-matching names together with matching 2473 hashes are likely to be atypical. 2475 Given the above, it is a reasonable assumption, which we will take 2476 note of in the sections below, that for one of the names to be 2477 compared, we will have access to data generated in the process of 2478 computing the name hash while for the other names, such data would 2479 have to be generated anew, when necessary. When that data includes, 2480 as we expect it will, the offset and length of the string regions 2481 covered by each sub-hash, direct byte-by-byte comparisons between 2482 corresponding regions of the two strings can exclude the possibility 2483 of difference without invoking any detailed logic to deal with the 2484 possibility of canonical equivalence or case-based equivalence in the 2485 absence of identical name segment. 2487 In the case in which the byte-by-byte comparisons fail, further 2488 analysis is necessary: 2490 o First, the associated base characters are compared, as is 2491 discussed in Appendix B.4. When doing form-insensitive comparison 2492 this is straightforward. However, when case-insensitive 2493 comparison is to be done, there is the possibility that the sub- 2494 hash boundaries of the two comparands are different, requiring 2495 that a common point in both comparands be found to resume 2496 comparison after a successful match. For either form of 2497 comparison, if a mismatch is found at this point then the 2498 comparison fails, while, if there is match, there must be a 2499 comparison of any following combining characters, as described 2500 below, before moving on to the region covered by the appropriate 2501 sub-string covered by the appropriate next sub-hash for each 2502 comparand. 2504 o If there is no mismatch as to the base characters, the set of 2505 associated combining characters (might be null) must be compared, 2506 as is discussed in Appendix B.5. If a mismatch is found at this 2507 point then the comparison fails. This may be because the sets of 2508 combining characters are different, because there are multiple 2509 copies of the same combining character in one of the string, or 2510 because the difference in combining character is not one that 2511 maintains canonical equivalence (due to combining classes). 2513 o When both comparisons show a match, the comparison resumes at the 2514 next substring, using a byte-by-byte comparison initially. If the 2515 comparison cannot be resumed because one of the strings is 2516 exhausted, the comparison terminate, succeeding only if both 2517 strings are exhausted while failing if only one of the strings is 2518 exhausted. 2520 B.4. Comparing Base Characters 2522 In general, the task of comparing based characters is simple, using a 2523 table lookup using the numeric value of the initial character in the 2524 substring. When doing form-insensitive comparison this is the base 2525 character associated with the initial (possibly pre-composed) 2526 character, while for case-insensitive comparison it is the case-based 2527 equivalence class associated with that character. 2529 When doing case-insensitive comparison, issues may arise that result 2530 when there is a multi-character string that as the case- insensitive 2531 equivalent of a single base character, as discussed in items EX4 and 2532 EX5 within Section 10.2. These are best dealt with using the 2533 approach outlined in Section 10.1. When it is noted that the current 2534 base character (for either comparand) is a character whose associated 2535 equivalence class contains one or more multi-character strings, then 2536 these comparisons, normally requiring that each base character be 2537 mapped to the same case-based equivalence class by modified to allow 2538 equivalences allowed by these multi-character sequences. 2540 In such cases, there may need to be comparisons involving the multi- 2541 character string, in addition to the normal comparisons using the 2542 base characters' equivalence class. As an illustration, we will 2543 consider possible comparison results that involve characters string 2544 within the equivalence class mentioned in item EX4 within 2545 Section 10.2 2547 o When the base character for both comparands are either LATIN SMALL 2548 LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S (U+1E9E), 2549 then a match is recognized. 2551 o When the base character for one comparand is either LATIN SMALL 2552 LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S (U+1E9E), 2553 while the other is not, each character in the that other comparand 2554 is case-insensitively compared to the corresponding character of 2555 the string "ss" with a match being signaled when all such 2556 subsequent characters match, except for possibly being of a 2557 different case. Because that comparison will involve multiple 2558 base characters, the overall comparison point for that comparand 2559 will have to be adjusted to reflect character already processed as 2560 part of the comparison. 2562 o When the base character for neither comparands is either LATIN 2563 SMALL LETTER SHARP S (U+00DF) or LATIN CAPITAL LETTER SHARP S 2564 (U+1E9E), then matching proceeds normally. As a result, the only 2565 cases in which character strings within the equivalence class 2566 being discussed will result is where both comparands have one of 2567 the strings "ss", "sS", "Ss", or "SS" at the current comparison 2568 point. 2570 B.5. Comparing Combining Characters 2572 In order to effect the necessary comparison, one needs to assemble, 2573 for each comparand, the set of combining characters within the 2574 current substring. The means used might be different for different 2575 comparands since there might be useful information retained from the 2576 generation of the associated string hash for one of the comparands. 2577 In any case, there are two potential sources for these characters: 2579 o Those deriving from the canonical decomposition of a pre-composed 2580 character, treated as a null set of if the base character is not a 2581 precomposed one. 2583 o Those combining characters that immediate following the base 2584 character, which will be a null set if the immediately following 2585 character is not a combining character. Note that it is possible, 2586 when doing case-insensitive comparison to treat certain character, 2587 not normally combining characters, as if they are. Such 2588 situations can arise, when, as described in item EX6 within 2589 Section 10.2, such non-combining character are the uppercase or 2590 lowercase equivalents of combining characters. 2592 Although, the two sets of character can be checked to see if they are 2593 identical, this is a sufficient but not a necessary condition for 2594 equivalence since some permutations of a set of combining characters 2595 are considered canonically equivalent. To summarize the appropriate 2596 equivalence rules: 2598 o Combining characters of different combining classes may be freely 2599 reordered. 2601 o If combining characters of the same combining class are reordered, 2602 then result is not canonically equivalent 2604 The rules above do not directly apply to the case, discussed above, 2605 in which some non-combining characters are the case-based equivalents 2606 of combining characters such as COMBINING GREEK YPOGEGRAMMENI 2607 (U+0345). Nevertheless, because of this equivalence, those 2608 implementing case-insensitive comparisons do have to deal with this 2609 potential equivalence when considering whether two strings containing 2610 combining characters or their case-based equivalents match. As a 2611 result when comparing strings of combining characters, we need to 2612 implement the following modified rules. 2614 o When one comparand has a true combining character and the other 2615 comparand has an identical one, they may differ in location as 2616 long as there is no permutation of combining characters of the 2617 same combining class. 2619 o When one comparand has a true combining character and the other 2620 has a case-insensitive equivalent which is not a combining 2621 character, that character must appear last in its string while the 2622 combining may character appear in its string in any position 2623 except the last. In this case, there are no restrictions based on 2624 combining classes. 2626 o When both comparands contain a non-combining character case- 2627 insensitively equivalent to a combining character, these character 2628 must appear last in their respective strings. 2630 Although it is possible to divide combining characters based on their 2631 combining classes, sort each of the list and compare, that approach 2632 will not be discussed here. Even though the use of sorts might allow 2633 use of an overall N log N algorithm, the number of combining 2634 characters is likely to be too low for this to be a practical 2635 benefit. Instead, we present below an order N-squared algorithm 2636 based on searches. 2638 In this algorithm, one string, chosen arbitrarily id designated the 2639 "source string" and successive character from it, are searched for in 2640 the other, designated the "target string". Associated with the 2641 target string is a mask to allow characters search for a found to be 2642 marked so that they will not be found a second time. In the 2643 treatment below, when a character is "searched for" only characters 2644 not yet in the mask are examined and the character sought has its 2645 associated mask bit set when it is found. 2647 Each character in the source string is processed in turn with the 2648 actual processing depending on particular character being processed, 2649 with the following three possibilities to be dealt with. 2651 1. For the typical case (i.e. a combining character with no case- 2652 insensitive equivalents), the character is searched for in the 2653 target string with the compare failing if it is not found. 2655 If it is found, then the region of the target string between the 2656 point corresponding to the current position in the source string 2657 and the character found is examined to check for characters of 2658 the same combining class. If any are found, the overall 2659 comparison fails. 2661 2. For the case of a combining character with a case- insensitive 2662 equivalents, the character is searched for as described in the 2663 first paragraph of item 1. However, the compare does not fail if 2664 it is not found. Instead, a case-insensitive equivalent 2665 character is searched for at the final position of the string and 2666 the compare fails if that is not found. 2668 3. For the case of a non-combining character that has a combining 2669 character as a case-insensitive equivalents, the overall 2670 comparison fail if the character is not in the final position 2671 within the source string or has already been successfully 2672 searched for. Otherwise, the corresponding combining character 2673 is searched for in the target as described in in the first 2674 paragraph of item 1. The overall compare fails if it is not 2675 found. 2677 Once all characters in the source string has been processed, the mask 2678 associated is examined to see if there are combining character that 2679 were not found in the matching process described above. Normally, if 2680 there are such characters, the overall comparison fails. However, if 2681 the last character of the target was not matched and if it is a non- 2682 combining character that is case-insensitively equivalent to a 2683 combining character, then comparison succeeds and the remaining 2684 character needs to be matched with the next substring in the source. 2686 Acknowledgements 2688 This document is based, in large part, on Section 12 of [3] and all 2689 the people who contributed to that work, have helped make this 2690 document possible, including David Black, Peter Staubach, Nico 2691 Williams, Mike Eisler, Trond Myklebust, James Lentini, Mike Kupfer 2692 and Peter Saint-Andre. 2694 The author wishes to thank Tom Haynes for his timely suggestion to 2695 pursue the task of dealing with internationalization on an NFSv4-wide 2696 basis. 2698 The author wishes to thank Nico WIlliams for his insights regarding 2699 the need for clients implementing file access protocols to be aware 2700 of the details of the server's internationalization-related name 2701 processing, particularly when case-insensitive file systems are being 2702 accessed. 2704 Author's Address 2706 David Noveck 2707 NetApp 2708 1601 Trapelo Road 2709 Waltham, MA 02451 2710 United States of America 2712 Phone: +1 781 572 8038 2713 Email: davenoveck@gmail.com