idnits 2.17.1 draft-williams-filesystem-18n-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There is 1 instance of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 6, 2020) is 1388 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 561 == Unused Reference: 'RFC3629' is defined on line 514, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- Obsolete informational reference (is this intentional?): RFC 7230 (Obsoleted by RFC 9110, RFC 9112) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force N. Williams, Ed. 3 Internet-Draft Cryptonector, LLC 4 Intended status: Best Current Practice July 6, 2020 5 Expires: January 7, 2021 7 Internationalization Considerations for Filesystems and Filesystem 8 Protocols 9 draft-williams-filesystem-18n-00 11 Abstract 13 This document describes requirements for internationalization (I18N) 14 of filesystems specifically in the context of Internet protocols, the 15 architecture for filesystems in most currently popular general 16 purpose operating systems, and their implications for filesystem 17 I18N. From the I18N requirements for filesystems and the 18 architecture of running code we derive requirements and 19 recommendations for implementors of operating systems and/or 20 filesystems, as well as for Internet remote filesystem protocols. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 7, 2021. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . . 3 58 1.2. Filesystem Internationalization . . . . . . . . . . . . . . 3 59 1.2.1. Canonical Equivalence (Normalization) . . . . . . . . . . 4 60 1.2.2. Case Foldings for Case-Insensitivity . . . . . . . . . . 4 61 1.2.3. Caching Clients . . . . . . . . . . . . . . . . . . . . . 5 62 1.3. Running Code Architecture Notes . . . . . . . . . . . . . . 5 63 2. Filesystem I18N Guidelines . . . . . . . . . . . . . . . . . 9 64 2.1. Filesystem I18N Guidelines: Non-Unicode File names . . . . 9 65 2.2. Filesystem I18N Guidelines: Case-Insensitivity . . . . . . 9 66 2.3. I18N Versioning . . . . . . . . . . . . . . . . . . . . . . 9 67 3. Filesystem Protocol I18N Guidelines . . . . . . . . . . . . . 10 68 3.1. I18N and Caching in Filesystem Protocol Clients . . . . . . 10 69 4. Internationalization Considerations . . . . . . . . . . . . . 10 70 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 71 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11 72 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 73 7.1. Normative References . . . . . . . . . . . . . . . . . . . 11 74 7.2. Informative References . . . . . . . . . . . . . . . . . . 12 75 7.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 76 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12 78 1. Introduction 80 [TBD: Add references galore. How to reference Unicode? How to 81 reference US-ASCII? How best to reference HFS+? How best to 82 reference ZFS? May have to find useful references for POSIX and 83 WIN32. Various blog entries may be of interest -- can they be 84 referenced?] 86 We, the Internet community, have long concluded that we must 87 internationalize all our protocols. This is generally not an easy 88 task, as often we are constrained by the realities of what can be 89 achieved while maintaining backwards compatibility. 91 In this document we focus on filesystem internationalization (I18N), 92 specifically only for file names and file paths. Here we address the 93 two main issues that arise in filesystem I18N: 95 o Unicode equivalence 96 o Case foldings for case-insensitivity 98 These two issues are different flavors of the same generic issue: 99 that there can be more than one way to write text with the same 100 rendering and/or semantics. 102 Only I18N issues relating to file names and paths are addressed here. 103 In particular, I18N issues related to representations of user 104 identities and groups, for use in access control lists (ACLs) or 105 other authorization systems, are out of scope for this document. 106 Also out of scope here are I18N issues related to Uniform Resource 107 Identifiers (URIs) [RFC3986] or Internationalized Resource 108 Identifiers (IRIs) [RFC3987]. 110 1.1. Requirements Language 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 114 document are to be interpreted as described in RFC 2119 [RFC2119]. 116 1.2. Filesystem Internationalization 118 We must address two issues: 120 o Unicode equivalence 122 o Case foldings for case-insensitivity 124 Unicode can represent certain character strings in multiple visually- 125 and semantically-equivalent ways. For example, there are two ways to 126 express LATIN SMALL LETTER A WITH ACUTE (รก): 128 o U+00E1 130 o U+0061 U+0301 132 For some glyphs there is a single way to write them. For others 133 there are two. And for yet others there can be many more than two. 135 To deal with the equivalence problem, Unicode defines Normal Forms 136 (NFs), of which there are two basic ones: Normal Form Composed (NFC), 137 and Normal Form Decomposed (NFD). There are also NFs that use 138 "compatibility" Foldings, NFKC and NFKD. Unicode-aware applications 139 can normalize text to avoid ambiguities, or they can use form- 140 insensitive string comparisons, or both. 142 Some filesystems support case-insensitivity, which is trivial to 143 define and implement for US-ASCII, but non-trivial for Unicode, 144 requiring not only larger case-folding tables, but also localized 145 case-folding tables as case-folding rules might differ from locale to 146 locale. 148 1.2.1. Canonical Equivalence (Normalization) 150 For case-sensitive filesystems, only Unicode equivalence issues arise 151 as to file names and file paths. These can be addressed in one of 152 two ways: 154 o normalize file names when created and when looked up, 156 o perform form-insensitive string comparisons on lookup. 158 The first option yields normalized file names on-disk and on the wire 159 (e.g., when listing directories). We shall term this "normalize-on- 160 CREATE", or sometimes "normalize-on-CREATE-and-LOOKUP", or even just 161 "NoCL". 163 The second option preserves form as originally produced by the user 164 or on their behalf by their system's text input modes, but otherwise 165 is form-insensitive. That is, this option permits either encoding 166 of, e.g., LATIN SMALL LETTER A WITH ACUTE on-disk and on the wire, 167 but permits only one form of any string, whether normal or not. We 168 shall term this option "form-insensitive", or sometimes "form- 169 insensitive and form-preserving", or just "FIP". 171 Unicode compatibility equivalence allows equivalence between 172 different representations of the same abstract character that may 173 nonetheless have different visual appearance of behavior. There are 174 two canonical forms that support compatibility equivalence: NFKC and 175 NFKD. Using NoCL with NFKC or NFKD may be surprising to users in a 176 visual way. While form-insensitivity with NFKC or NFKD may surprise 177 users who might consider two file names distinct even when Unicode 178 considers them equivalent under compatibility equivalence. The 179 latter seems less likely and less surprising, though that is an 180 entirely subjective judgement. 182 We do not recommend either of NoCL or FIP over the other. 184 1.2.2. Case Foldings for Case-Insensitivity 186 Case-insensitivity implies folding characters of one case to another 187 for comparison purposes, typically to lower-case. These case 188 foldings are defined by Unicode. Generally, case-insensitive 189 filesystems preserve original case just form-insensitive filesystems 190 preserve original form. 192 It is possible that some case foldings may have to vary by locale. A 193 commonly used example of character where case foldings that varies by 194 locale is LATIN SMALL LETTER DOTLESS I (U+0131). 196 In some cases it may be possible to construct case-folding tailorings 197 that are locale-neutral. For example, all of the following conuld be 198 considered equivalent: 200 o LATIN CAPITAL LETTER I (U+0049) 202 o LATIN SMALL LETTER I (U+0069) 204 o LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130) 206 o LATIN SMALL LETTER DOTLESS I (U+0131) 208 which might satisfy a mix of users including those familiar with 209 Turkish and those not, using the same filesystem. 211 1.2.3. Caching Clients 213 Remote filesystem protocols often involve caching on clients, which 214 caching may require knowledge of filesystem I18N settings in order to 215 permit local operations to be performed using cached directory 216 listings that work the same way as on the server. We do not specify 217 any case foldings here. Instead we will either create a registry of 218 case folding tailorings, or use the Common Locale Data Repository 219 (CLDR), then require that filesystems and servers be able to identify 220 what case foldings are in effect for case-insensitive filesystems. 222 1.3. Running Code Architecture Notes 224 Surprisingly, almost all if not all general purpose operating systems 225 in common use today have a "virtual filesystem switch" (VFS) 226 [McKusick86] [wikipedia] [1] interface that permits the use of 227 multiple different filesystem types on one system, all accessed 228 through the same filesystems application programming interfaces 229 (APIs). The VFS is essentially a pluggable layer that includes 230 functionality for routing calls from user processes to the 231 appropriate filesystems. The VFS has even been generalized and 232 extended to support isolation, thus we have the Filesystem in 233 Userspace (FUSE), which is akin to a remote filesystem protocol, but 234 for use over local inter-process communications (IPC) facilities. 236 The VFS architecture was developed in the 1980s, before Unicode 237 adoption. It is not surprising then that in general -if not simply 238 always today- the code path from the interface between a user 239 application and the operating system all the way to the filesystem 240 implements no I18N functionality whatsoever, and does the absolute 241 minimum of character data interpretation: 243 o use of US-ASCII NUL (for "C string" termination), 245 o use of US-ASCII '/' and/or '\' (for file path component 246 delimiting). 248 For example, the 4.4BSD operating system and derivatives have a VFS 249 [BSD4.4], as do Solaris and derivatives [SolarisInternals], Windows 250 , OS 251 X, and Linux. A VFS of a sort, including FUSE, may well be the only 252 reasonable way to support more than one kind of filesystem while 253 retaining compatibility with previously-existing filesystem APIs. 254 This explains why so many modern operating systems have a VFS. 256 Thus in most if not all general purpose operating systems today, the 257 code path from the boundary between the application and the operating 258 system, and the boundary between the VFS and the filesystem, is 259 "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE]), with no 260 attempt at normalization or case folding done anywhere in between. 262 There are filesystem servers that access raw storage directly and 263 implement the filesystem and the remote filesystem protocol server in 264 one monolythic stack without a VFS in the way, but it is very common 265 to have remote filesystem protocol servers implemented on top of the 266 VFS or on top of the system calls. Even monolythic servers tend to 267 support a notion of multiple filesystems in a server or volume, and 268 may have different I18N settings for each filesystem. Thus it's 269 common to leave I18N handling to code layers close to the filesystem 270 even in monolythic server implementations. 272 In practice all of foregoing has led to I18N functionality residing 273 strictly in the filesystem. Two filesystems have defined the best 274 current practices in this regard: 276 o HFS+, which does normalize-on-CREATE (and LOOKUP), normalizing to 277 a form that is very close to NFD and is case-sensitive; 279 o ZFS, which implements form-insensitive, form-preserving behavior 280 and optionally implements case-insensitive, case-preserving 281 behavior on a per-filesystem basis. 283 Altogether, these circumstances make it very difficult to reliably 284 and always locate I18N functionality above the VFS, or to not use a 285 VFS at all: there are too many places to alter, and all must agree 286 exactly on I18N choices. Moreover, implementing case-insensitive but 287 case-preserving behavior above the VFS requires fully reading each 288 directory, and so does implementing form-insensitive and form- 289 preserving behavior at the VFS layer itself. The only behaviors that 290 can be reliably implemented at or above the VFS are normalize- and 291 case-fold-on-CREATE (and LOOKUP). 293 Consider the set of already-running code that must all be modified in 294 order to reliably implement I18N above the filesystem on general 295 purpose operating systems: 297 o filesystem protocol servers, including but not limited to: 299 * Network File System (NFSv4) [RFC7530]; 301 * Hypertext Transfer Protocol (HTTP) servers serving resources 302 hosted on filesystems[RFC7230]; 304 * SSH File Transfer Protocol (SFTP) [I-D.ietf-secsh-filexfer]; 306 * various remote filesystem protocols that are not Internet 307 Protocols (i.e., not standards-track Internet RFCs); 309 o POSIX system call layers or user process system call stub 310 libraries; 312 o WIN32 system call layers or user process system call stub 313 libraries. 315 Regarding system calls and system call stubs in user process system 316 libraries, the continued use of statically-linked executables means 317 that these cannot reliably be modified. Indeed, on some systems the 318 Application Binary Interface (ABI) between user-space applications 319 and the operating system kernel is well-defined and long-term stable. 320 The system call handlers cannot reliably inspect the calling process 321 to determine any attributes of its locale. Adding new system calls 322 is possible, but existing running code wouldn't use them. For 323 similar reasons, the VFS layer is generally (always) completely 324 unaware of any attributes of the locale of applications calling it, 325 whether via system calls or any other path. 327 Unix-like operating systems are generally (always) "just-use-8", 328 assuming only that file names and paths are C strings (i.e., 329 terminated by zero-valued bytes) and sufficiently compatible with US- 330 ASCII that the file path component separator character, US-ASCII '/', 331 is meaningful. As a result, it is possible to find I18N-unaware 332 filesystems with one or more non-Unicode, non-ASCII codesets in use 333 for file names! We leave non-ASCII and non-Unicode file names out of 334 scope here. 336 For these reasons it is simply not practical to implement I18N at any 337 layer above the VFS. 339 Even in the VFS, form- and case-insensitive and -preserving behaviors 340 would be difficult to implement as performantly as in the filesystem. 341 The VFS would have to list a directory completely before being able 342 to apply those behaviors. It is reasonable to expect caching clients 343 of remote filesystems to cache directory listings (especially for 344 offline operation), but it isn't reasonable to expect the same of the 345 VFS. Compare to the filesystem itself, which can maintain a fast 346 index (e.g., hash table or b-tree) where the keys are normalized and 347 possibly case-folded file names and thus may not need to read 348 directories in order to perform fast lookups that are form- and even 349 case-insensitive. 351 The only way to implement I18N behaviors in the VFS layer rather than 352 at the filesystem is to abandon form- and case-preserving behaviors. 353 For case-insensitivity this would require using sentence-case, or all 354 lower-case, perhaps, and all such choices would surely be surprising 355 to users. At any rate, that approach would also render much running 356 code "non-compliant" with any Internet filesystem protocol I18N 357 specification. 359 Therefore, generally speaking, only the filesystem can reliably, 360 interoperably, and performantly implement I18N behaviors in general 361 purpose operating systems. 363 Note that variations in I18N behaviors can happen even on the same 364 server with multiple filesystems of the same type. This can happen 365 because of 367 different Unicode versions being used at the times of creation of 368 various filesystems, and 370 different locale settings on various filesystems. 372 Locale variations are only relevant to case-folding for case- 373 insensitivity. Running code mostly uses default case-folding rules, 374 but there is no reason to assume that locale-specific case-folding 375 rules won't be supported by running code in the future. 377 It may not be possible or easy for a filesystem to adopt new Unicode 378 versions, or adopt backwards-incompatible case foldings, after 379 content has been created in it that would be ambiguous under new 380 rules. This implies that where a client for a remote filesystem must 381 know what I18N functionality to implement for use with cached 382 directory listings, the client must know specifically what profile of 383 I18N functionality each cached filesystem implements. 385 2. Filesystem I18N Guidelines 387 We begin be recognizing and accepting that much running code 388 implements I18N functionality at the filesystem. Given this, we 389 catalogue the range of acceptable behaviors. Filesystems adhering to 390 this specification MUST implement only acceptable I18N behaviors as 391 specified here. Acceptable variations may be registered in a to-be- 392 determined (IANA?) registry of filesystem I18N behaviors. 394 2.1. Filesystem I18N Guidelines: Non-Unicode File names 396 o Filesystems SHOULD reject attempts to create new non-Unicode file 397 names. 399 o Filesystems either MUST normalize on CREATE (and LOOKUP), or MUST 400 be form-insensitive and form-preserving. 402 o Filesystems MUST specify a Unicode version for their equivalence 403 behaviors. 405 2.2. Filesystem I18N Guidelines: Case-Insensitivity 407 o Filesystems MAY support case-insensitivity, in which case they 408 SHOULD be case-preserving. Filesystems that are case-insensitive 409 but not case-preserving either MUST specify a case form, such as 410 title case or sentence case. 412 o Case foldings for case-insensitive filesystems MUST be identified. 413 The Unicode default case foldings SHOULD be the default case 414 algorithms for the identified Unicode version without additional 415 tailorings. Filesystems that use case algorithms tailored to 416 specific locales SHOULD use case foldings registered in a to-be- 417 determined (IANA?) registry. 419 o Case-insensitive filesystems MUST specify a Unicode version for 420 their case-insensitive behavior. 422 2.3. I18N Versioning 424 Each filesystem MUST identify a Unicode version for their I18N 425 behaviors. Filesystem implementations SHOULD adopt new Unicode 426 versions as they are produced, though it is understood that it may be 427 difficult to migrate non-empty filesystems to new Unicode versions. 429 3. Filesystem Protocol I18N Guidelines 431 Remote filesystem protocols that allow clients to perform lookups 432 against cached directory listings MUST allow clients to discover all 433 relevant I18N behaviors of the filesystem whence any given directory 434 listing: 436 o whether the filesystem normalizes on CREATE (and LOOKUP), and if 437 so, to what NF in what Unicode version; 439 o whether the filesystem is form-insensitive and form-preserving, 440 and if so, in what Unicode version; 442 o whether the filesystem is case-insensitive and case-preserving, 443 and if so, with what foldings (default or tailured, and if 444 tailored provide an identifier for the set of foldings), and a 445 Unicode version. 447 Foldings are identified via a folding set name as registered in a to- 448 be-determined (IANA?) registry. 450 Because some filesystems might allow for different I18N settings on a 451 per-directory basis, remote filesystem protocols MUST allow those 452 settings to be discoverable on a per-directory basis. 454 Internet filesystem servers MUST reject attempts to create new non- 455 Unicode file names. (Note that this requirement is weaker ("SHOULD") 456 for the actual filesystems, since those might have to allow non- 457 Unicode content for legacy reasons via interfaces other than Internet 458 filesystem protocols.) 460 3.1. I18N and Caching in Filesystem Protocol Clients 462 Caching clients of remote filesystems either MUST NOT perform lookups 463 against cached directory listings, or MUST query the directories' 464 filesystems' I18N profiles and apply the same I18N equivalent form 465 policis and case-insensitivity case foldings. 467 4. Internationalization Considerations 469 This document deals in internationalization throughout. 471 5. IANA Considerations 473 [ALTERNATIVELY use locale names and CLDR? Need to determine the 474 stability of CLDR locales... Basically, we need stable locale names, 475 and stable case-folding mappings.] 476 We hereby request the creation of a new IANA registry with Expert 477 Review registration rules with the following fields: 479 o name, an identifier-like name 481 o Unicode version number 483 o listing of case folding tailorings and/or references to external 484 case folding tailoring specifications 486 The case foldings registered here will be used by case-insensitive 487 filesystems and filesystem protocols to identify tailored case 488 foldings so that caching clients can implement the same case- 489 insensitive behavior using cached directory listings. 491 6. Security Considerations 493 Security considerations of Unicode and filesystem protocols apply. 494 No new security considerations are added or need be noted here. 496 The methods of handling equivalent Unicode strings cause aliasing. 497 This is not expected to be a security problem. 499 Case-insensitivity causes aliasing. This is not expected to be a 500 security problem. 502 No effort is made here to handle confusables. This is not expected 503 to be a serious security problem in the context of file servers. 505 7. References 507 7.1. Normative References 509 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 510 Requirement Levels", BCP 14, RFC 2119, 511 DOI 10.17487/RFC2119, March 1997, 512 . 514 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 515 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 516 2003, . 518 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 519 12.1.0", May 2019, 520 . 522 7.2. Informative References 524 [BSD4.4] McKusik, M., Bostic, K., Karels, M., and J. Quarterman, 525 "The Design and Implementation of the 4.4BSD Operating 526 System", DOI 10.5555/231070, 1996. 528 [I-D.ietf-secsh-filexfer] 529 Galbraith, J. and O. Saarenmaa, "SSH File Transfer 530 Protocol", draft-ietf-secsh-filexfer-13 (work in 531 progress), July 2006. 533 [McKusick86] 534 McKusik, M. and M. Karels, "Towards a Compatible File 535 System Interface", Jun 1986. 537 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 538 Resource Identifier (URI): Generic Syntax", STD 66, 539 RFC 3986, DOI 10.17487/RFC3986, January 2005, 540 . 542 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource 543 Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987, 544 January 2005, . 546 [RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 547 Protocol (HTTP/1.1): Message Syntax and Routing", 548 RFC 7230, DOI 10.17487/RFC7230, June 2014, 549 . 551 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System 552 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, 553 March 2015, . 555 [SolarisInternals] 556 McDougal, R. and J. Mauro, "Solaris Internals -- Solaris 557 10 and OpenSolaris Kernel Architecture", 2007. 559 7.3. URIs 561 [1] https://en.wikipedia.org/wiki/Virtual_file_system 563 Author's Address 564 Nico Williams (editor) 565 Cryptonector, LLC 566 Austin, TX 567 USA 569 Email: nico@cryptonector.com