idnits 2.17.1
draft-williams-filesystem-18n-00.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== There is 1 instance of lines with non-ascii characters in the document.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
-- The document date (July 6, 2020) is 1388 days in the past. Is this
intentional?
Checking references for intended status: Best Current Practice
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
-- Looks like a reference, but probably isn't: '1' on line 561
== Unused Reference: 'RFC3629' is defined on line 514, but no explicit
reference was found in the text
-- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'
-- Obsolete informational reference (is this intentional?): RFC 7230
(Obsoleted by RFC 9110, RFC 9112)
Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Internet Engineering Task Force N. Williams, Ed.
3 Internet-Draft Cryptonector, LLC
4 Intended status: Best Current Practice July 6, 2020
5 Expires: January 7, 2021
7 Internationalization Considerations for Filesystems and Filesystem
8 Protocols
9 draft-williams-filesystem-18n-00
11 Abstract
13 This document describes requirements for internationalization (I18N)
14 of filesystems specifically in the context of Internet protocols, the
15 architecture for filesystems in most currently popular general
16 purpose operating systems, and their implications for filesystem
17 I18N. From the I18N requirements for filesystems and the
18 architecture of running code we derive requirements and
19 recommendations for implementors of operating systems and/or
20 filesystems, as well as for Internet remote filesystem protocols.
22 Status of This Memo
24 This Internet-Draft is submitted in full conformance with the
25 provisions of BCP 78 and BCP 79.
27 Internet-Drafts are working documents of the Internet Engineering
28 Task Force (IETF). Note that other groups may also distribute
29 working documents as Internet-Drafts. The list of current Internet-
30 Drafts is at https://datatracker.ietf.org/drafts/current/.
32 Internet-Drafts are draft documents valid for a maximum of six months
33 and may be updated, replaced, or obsoleted by other documents at any
34 time. It is inappropriate to use Internet-Drafts as reference
35 material or to cite them other than as "work in progress."
37 This Internet-Draft will expire on January 7, 2021.
39 Copyright Notice
41 Copyright (c) 2020 IETF Trust and the persons identified as the
42 document authors. All rights reserved.
44 This document is subject to BCP 78 and the IETF Trust's Legal
45 Provisions Relating to IETF Documents
46 (https://trustee.ietf.org/license-info) in effect on the date of
47 publication of this document. Please review these documents
48 carefully, as they describe your rights and restrictions with respect
49 to this document. Code Components extracted from this document must
50 include Simplified BSD License text as described in Section 4.e of
51 the Trust Legal Provisions and are provided without warranty as
52 described in the Simplified BSD License.
54 Table of Contents
56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . . 3
58 1.2. Filesystem Internationalization . . . . . . . . . . . . . . 3
59 1.2.1. Canonical Equivalence (Normalization) . . . . . . . . . . 4
60 1.2.2. Case Foldings for Case-Insensitivity . . . . . . . . . . 4
61 1.2.3. Caching Clients . . . . . . . . . . . . . . . . . . . . . 5
62 1.3. Running Code Architecture Notes . . . . . . . . . . . . . . 5
63 2. Filesystem I18N Guidelines . . . . . . . . . . . . . . . . . 9
64 2.1. Filesystem I18N Guidelines: Non-Unicode File names . . . . 9
65 2.2. Filesystem I18N Guidelines: Case-Insensitivity . . . . . . 9
66 2.3. I18N Versioning . . . . . . . . . . . . . . . . . . . . . . 9
67 3. Filesystem Protocol I18N Guidelines . . . . . . . . . . . . . 10
68 3.1. I18N and Caching in Filesystem Protocol Clients . . . . . . 10
69 4. Internationalization Considerations . . . . . . . . . . . . . 10
70 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
71 6. Security Considerations . . . . . . . . . . . . . . . . . . . 11
72 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
73 7.1. Normative References . . . . . . . . . . . . . . . . . . . 11
74 7.2. Informative References . . . . . . . . . . . . . . . . . . 12
75 7.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
76 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 12
78 1. Introduction
80 [TBD: Add references galore. How to reference Unicode? How to
81 reference US-ASCII? How best to reference HFS+? How best to
82 reference ZFS? May have to find useful references for POSIX and
83 WIN32. Various blog entries may be of interest -- can they be
84 referenced?]
86 We, the Internet community, have long concluded that we must
87 internationalize all our protocols. This is generally not an easy
88 task, as often we are constrained by the realities of what can be
89 achieved while maintaining backwards compatibility.
91 In this document we focus on filesystem internationalization (I18N),
92 specifically only for file names and file paths. Here we address the
93 two main issues that arise in filesystem I18N:
95 o Unicode equivalence
96 o Case foldings for case-insensitivity
98 These two issues are different flavors of the same generic issue:
99 that there can be more than one way to write text with the same
100 rendering and/or semantics.
102 Only I18N issues relating to file names and paths are addressed here.
103 In particular, I18N issues related to representations of user
104 identities and groups, for use in access control lists (ACLs) or
105 other authorization systems, are out of scope for this document.
106 Also out of scope here are I18N issues related to Uniform Resource
107 Identifiers (URIs) [RFC3986] or Internationalized Resource
108 Identifiers (IRIs) [RFC3987].
110 1.1. Requirements Language
112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
114 document are to be interpreted as described in RFC 2119 [RFC2119].
116 1.2. Filesystem Internationalization
118 We must address two issues:
120 o Unicode equivalence
122 o Case foldings for case-insensitivity
124 Unicode can represent certain character strings in multiple visually-
125 and semantically-equivalent ways. For example, there are two ways to
126 express LATIN SMALL LETTER A WITH ACUTE (รก):
128 o U+00E1
130 o U+0061 U+0301
132 For some glyphs there is a single way to write them. For others
133 there are two. And for yet others there can be many more than two.
135 To deal with the equivalence problem, Unicode defines Normal Forms
136 (NFs), of which there are two basic ones: Normal Form Composed (NFC),
137 and Normal Form Decomposed (NFD). There are also NFs that use
138 "compatibility" Foldings, NFKC and NFKD. Unicode-aware applications
139 can normalize text to avoid ambiguities, or they can use form-
140 insensitive string comparisons, or both.
142 Some filesystems support case-insensitivity, which is trivial to
143 define and implement for US-ASCII, but non-trivial for Unicode,
144 requiring not only larger case-folding tables, but also localized
145 case-folding tables as case-folding rules might differ from locale to
146 locale.
148 1.2.1. Canonical Equivalence (Normalization)
150 For case-sensitive filesystems, only Unicode equivalence issues arise
151 as to file names and file paths. These can be addressed in one of
152 two ways:
154 o normalize file names when created and when looked up,
156 o perform form-insensitive string comparisons on lookup.
158 The first option yields normalized file names on-disk and on the wire
159 (e.g., when listing directories). We shall term this "normalize-on-
160 CREATE", or sometimes "normalize-on-CREATE-and-LOOKUP", or even just
161 "NoCL".
163 The second option preserves form as originally produced by the user
164 or on their behalf by their system's text input modes, but otherwise
165 is form-insensitive. That is, this option permits either encoding
166 of, e.g., LATIN SMALL LETTER A WITH ACUTE on-disk and on the wire,
167 but permits only one form of any string, whether normal or not. We
168 shall term this option "form-insensitive", or sometimes "form-
169 insensitive and form-preserving", or just "FIP".
171 Unicode compatibility equivalence allows equivalence between
172 different representations of the same abstract character that may
173 nonetheless have different visual appearance of behavior. There are
174 two canonical forms that support compatibility equivalence: NFKC and
175 NFKD. Using NoCL with NFKC or NFKD may be surprising to users in a
176 visual way. While form-insensitivity with NFKC or NFKD may surprise
177 users who might consider two file names distinct even when Unicode
178 considers them equivalent under compatibility equivalence. The
179 latter seems less likely and less surprising, though that is an
180 entirely subjective judgement.
182 We do not recommend either of NoCL or FIP over the other.
184 1.2.2. Case Foldings for Case-Insensitivity
186 Case-insensitivity implies folding characters of one case to another
187 for comparison purposes, typically to lower-case. These case
188 foldings are defined by Unicode. Generally, case-insensitive
189 filesystems preserve original case just form-insensitive filesystems
190 preserve original form.
192 It is possible that some case foldings may have to vary by locale. A
193 commonly used example of character where case foldings that varies by
194 locale is LATIN SMALL LETTER DOTLESS I (U+0131).
196 In some cases it may be possible to construct case-folding tailorings
197 that are locale-neutral. For example, all of the following conuld be
198 considered equivalent:
200 o LATIN CAPITAL LETTER I (U+0049)
202 o LATIN SMALL LETTER I (U+0069)
204 o LATIN CAPITAL LETTER I WITH DOT ABOVE (U+0130)
206 o LATIN SMALL LETTER DOTLESS I (U+0131)
208 which might satisfy a mix of users including those familiar with
209 Turkish and those not, using the same filesystem.
211 1.2.3. Caching Clients
213 Remote filesystem protocols often involve caching on clients, which
214 caching may require knowledge of filesystem I18N settings in order to
215 permit local operations to be performed using cached directory
216 listings that work the same way as on the server. We do not specify
217 any case foldings here. Instead we will either create a registry of
218 case folding tailorings, or use the Common Locale Data Repository
219 (CLDR), then require that filesystems and servers be able to identify
220 what case foldings are in effect for case-insensitive filesystems.
222 1.3. Running Code Architecture Notes
224 Surprisingly, almost all if not all general purpose operating systems
225 in common use today have a "virtual filesystem switch" (VFS)
226 [McKusick86] [wikipedia] [1] interface that permits the use of
227 multiple different filesystem types on one system, all accessed
228 through the same filesystems application programming interfaces
229 (APIs). The VFS is essentially a pluggable layer that includes
230 functionality for routing calls from user processes to the
231 appropriate filesystems. The VFS has even been generalized and
232 extended to support isolation, thus we have the Filesystem in
233 Userspace (FUSE), which is akin to a remote filesystem protocol, but
234 for use over local inter-process communications (IPC) facilities.
236 The VFS architecture was developed in the 1980s, before Unicode
237 adoption. It is not surprising then that in general -if not simply
238 always today- the code path from the interface between a user
239 application and the operating system all the way to the filesystem
240 implements no I18N functionality whatsoever, and does the absolute
241 minimum of character data interpretation:
243 o use of US-ASCII NUL (for "C string" termination),
245 o use of US-ASCII '/' and/or '\' (for file path component
246 delimiting).
248 For example, the 4.4BSD operating system and derivatives have a VFS
249 [BSD4.4], as do Solaris and derivatives [SolarisInternals], Windows
250 , OS
251 X, and Linux. A VFS of a sort, including FUSE, may well be the only
252 reasonable way to support more than one kind of filesystem while
253 retaining compatibility with previously-existing filesystem APIs.
254 This explains why so many modern operating systems have a VFS.
256 Thus in most if not all general purpose operating systems today, the
257 code path from the boundary between the application and the operating
258 system, and the boundary between the VFS and the filesystem, is
259 "just-use-8" or "just-use-16" (as in UTF-16 [UNICODE]), with no
260 attempt at normalization or case folding done anywhere in between.
262 There are filesystem servers that access raw storage directly and
263 implement the filesystem and the remote filesystem protocol server in
264 one monolythic stack without a VFS in the way, but it is very common
265 to have remote filesystem protocol servers implemented on top of the
266 VFS or on top of the system calls. Even monolythic servers tend to
267 support a notion of multiple filesystems in a server or volume, and
268 may have different I18N settings for each filesystem. Thus it's
269 common to leave I18N handling to code layers close to the filesystem
270 even in monolythic server implementations.
272 In practice all of foregoing has led to I18N functionality residing
273 strictly in the filesystem. Two filesystems have defined the best
274 current practices in this regard:
276 o HFS+, which does normalize-on-CREATE (and LOOKUP), normalizing to
277 a form that is very close to NFD and is case-sensitive;
279 o ZFS, which implements form-insensitive, form-preserving behavior
280 and optionally implements case-insensitive, case-preserving
281 behavior on a per-filesystem basis.
283 Altogether, these circumstances make it very difficult to reliably
284 and always locate I18N functionality above the VFS, or to not use a
285 VFS at all: there are too many places to alter, and all must agree
286 exactly on I18N choices. Moreover, implementing case-insensitive but
287 case-preserving behavior above the VFS requires fully reading each
288 directory, and so does implementing form-insensitive and form-
289 preserving behavior at the VFS layer itself. The only behaviors that
290 can be reliably implemented at or above the VFS are normalize- and
291 case-fold-on-CREATE (and LOOKUP).
293 Consider the set of already-running code that must all be modified in
294 order to reliably implement I18N above the filesystem on general
295 purpose operating systems:
297 o filesystem protocol servers, including but not limited to:
299 * Network File System (NFSv4) [RFC7530];
301 * Hypertext Transfer Protocol (HTTP) servers serving resources
302 hosted on filesystems[RFC7230];
304 * SSH File Transfer Protocol (SFTP) [I-D.ietf-secsh-filexfer];
306 * various remote filesystem protocols that are not Internet
307 Protocols (i.e., not standards-track Internet RFCs);
309 o POSIX system call layers or user process system call stub
310 libraries;
312 o WIN32 system call layers or user process system call stub
313 libraries.
315 Regarding system calls and system call stubs in user process system
316 libraries, the continued use of statically-linked executables means
317 that these cannot reliably be modified. Indeed, on some systems the
318 Application Binary Interface (ABI) between user-space applications
319 and the operating system kernel is well-defined and long-term stable.
320 The system call handlers cannot reliably inspect the calling process
321 to determine any attributes of its locale. Adding new system calls
322 is possible, but existing running code wouldn't use them. For
323 similar reasons, the VFS layer is generally (always) completely
324 unaware of any attributes of the locale of applications calling it,
325 whether via system calls or any other path.
327 Unix-like operating systems are generally (always) "just-use-8",
328 assuming only that file names and paths are C strings (i.e.,
329 terminated by zero-valued bytes) and sufficiently compatible with US-
330 ASCII that the file path component separator character, US-ASCII '/',
331 is meaningful. As a result, it is possible to find I18N-unaware
332 filesystems with one or more non-Unicode, non-ASCII codesets in use
333 for file names! We leave non-ASCII and non-Unicode file names out of
334 scope here.
336 For these reasons it is simply not practical to implement I18N at any
337 layer above the VFS.
339 Even in the VFS, form- and case-insensitive and -preserving behaviors
340 would be difficult to implement as performantly as in the filesystem.
341 The VFS would have to list a directory completely before being able
342 to apply those behaviors. It is reasonable to expect caching clients
343 of remote filesystems to cache directory listings (especially for
344 offline operation), but it isn't reasonable to expect the same of the
345 VFS. Compare to the filesystem itself, which can maintain a fast
346 index (e.g., hash table or b-tree) where the keys are normalized and
347 possibly case-folded file names and thus may not need to read
348 directories in order to perform fast lookups that are form- and even
349 case-insensitive.
351 The only way to implement I18N behaviors in the VFS layer rather than
352 at the filesystem is to abandon form- and case-preserving behaviors.
353 For case-insensitivity this would require using sentence-case, or all
354 lower-case, perhaps, and all such choices would surely be surprising
355 to users. At any rate, that approach would also render much running
356 code "non-compliant" with any Internet filesystem protocol I18N
357 specification.
359 Therefore, generally speaking, only the filesystem can reliably,
360 interoperably, and performantly implement I18N behaviors in general
361 purpose operating systems.
363 Note that variations in I18N behaviors can happen even on the same
364 server with multiple filesystems of the same type. This can happen
365 because of
367 different Unicode versions being used at the times of creation of
368 various filesystems, and
370 different locale settings on various filesystems.
372 Locale variations are only relevant to case-folding for case-
373 insensitivity. Running code mostly uses default case-folding rules,
374 but there is no reason to assume that locale-specific case-folding
375 rules won't be supported by running code in the future.
377 It may not be possible or easy for a filesystem to adopt new Unicode
378 versions, or adopt backwards-incompatible case foldings, after
379 content has been created in it that would be ambiguous under new
380 rules. This implies that where a client for a remote filesystem must
381 know what I18N functionality to implement for use with cached
382 directory listings, the client must know specifically what profile of
383 I18N functionality each cached filesystem implements.
385 2. Filesystem I18N Guidelines
387 We begin be recognizing and accepting that much running code
388 implements I18N functionality at the filesystem. Given this, we
389 catalogue the range of acceptable behaviors. Filesystems adhering to
390 this specification MUST implement only acceptable I18N behaviors as
391 specified here. Acceptable variations may be registered in a to-be-
392 determined (IANA?) registry of filesystem I18N behaviors.
394 2.1. Filesystem I18N Guidelines: Non-Unicode File names
396 o Filesystems SHOULD reject attempts to create new non-Unicode file
397 names.
399 o Filesystems either MUST normalize on CREATE (and LOOKUP), or MUST
400 be form-insensitive and form-preserving.
402 o Filesystems MUST specify a Unicode version for their equivalence
403 behaviors.
405 2.2. Filesystem I18N Guidelines: Case-Insensitivity
407 o Filesystems MAY support case-insensitivity, in which case they
408 SHOULD be case-preserving. Filesystems that are case-insensitive
409 but not case-preserving either MUST specify a case form, such as
410 title case or sentence case.
412 o Case foldings for case-insensitive filesystems MUST be identified.
413 The Unicode default case foldings SHOULD be the default case
414 algorithms for the identified Unicode version without additional
415 tailorings. Filesystems that use case algorithms tailored to
416 specific locales SHOULD use case foldings registered in a to-be-
417 determined (IANA?) registry.
419 o Case-insensitive filesystems MUST specify a Unicode version for
420 their case-insensitive behavior.
422 2.3. I18N Versioning
424 Each filesystem MUST identify a Unicode version for their I18N
425 behaviors. Filesystem implementations SHOULD adopt new Unicode
426 versions as they are produced, though it is understood that it may be
427 difficult to migrate non-empty filesystems to new Unicode versions.
429 3. Filesystem Protocol I18N Guidelines
431 Remote filesystem protocols that allow clients to perform lookups
432 against cached directory listings MUST allow clients to discover all
433 relevant I18N behaviors of the filesystem whence any given directory
434 listing:
436 o whether the filesystem normalizes on CREATE (and LOOKUP), and if
437 so, to what NF in what Unicode version;
439 o whether the filesystem is form-insensitive and form-preserving,
440 and if so, in what Unicode version;
442 o whether the filesystem is case-insensitive and case-preserving,
443 and if so, with what foldings (default or tailured, and if
444 tailored provide an identifier for the set of foldings), and a
445 Unicode version.
447 Foldings are identified via a folding set name as registered in a to-
448 be-determined (IANA?) registry.
450 Because some filesystems might allow for different I18N settings on a
451 per-directory basis, remote filesystem protocols MUST allow those
452 settings to be discoverable on a per-directory basis.
454 Internet filesystem servers MUST reject attempts to create new non-
455 Unicode file names. (Note that this requirement is weaker ("SHOULD")
456 for the actual filesystems, since those might have to allow non-
457 Unicode content for legacy reasons via interfaces other than Internet
458 filesystem protocols.)
460 3.1. I18N and Caching in Filesystem Protocol Clients
462 Caching clients of remote filesystems either MUST NOT perform lookups
463 against cached directory listings, or MUST query the directories'
464 filesystems' I18N profiles and apply the same I18N equivalent form
465 policis and case-insensitivity case foldings.
467 4. Internationalization Considerations
469 This document deals in internationalization throughout.
471 5. IANA Considerations
473 [ALTERNATIVELY use locale names and CLDR? Need to determine the
474 stability of CLDR locales... Basically, we need stable locale names,
475 and stable case-folding mappings.]
476 We hereby request the creation of a new IANA registry with Expert
477 Review registration rules with the following fields:
479 o name, an identifier-like name
481 o Unicode version number
483 o listing of case folding tailorings and/or references to external
484 case folding tailoring specifications
486 The case foldings registered here will be used by case-insensitive
487 filesystems and filesystem protocols to identify tailored case
488 foldings so that caching clients can implement the same case-
489 insensitive behavior using cached directory listings.
491 6. Security Considerations
493 Security considerations of Unicode and filesystem protocols apply.
494 No new security considerations are added or need be noted here.
496 The methods of handling equivalent Unicode strings cause aliasing.
497 This is not expected to be a security problem.
499 Case-insensitivity causes aliasing. This is not expected to be a
500 security problem.
502 No effort is made here to handle confusables. This is not expected
503 to be a serious security problem in the context of file servers.
505 7. References
507 7.1. Normative References
509 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
510 Requirement Levels", BCP 14, RFC 2119,
511 DOI 10.17487/RFC2119, March 1997,
512 .
514 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
515 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
516 2003, .
518 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version
519 12.1.0", May 2019,
520 .
522 7.2. Informative References
524 [BSD4.4] McKusik, M., Bostic, K., Karels, M., and J. Quarterman,
525 "The Design and Implementation of the 4.4BSD Operating
526 System", DOI 10.5555/231070, 1996.
528 [I-D.ietf-secsh-filexfer]
529 Galbraith, J. and O. Saarenmaa, "SSH File Transfer
530 Protocol", draft-ietf-secsh-filexfer-13 (work in
531 progress), July 2006.
533 [McKusick86]
534 McKusik, M. and M. Karels, "Towards a Compatible File
535 System Interface", Jun 1986.
537 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
538 Resource Identifier (URI): Generic Syntax", STD 66,
539 RFC 3986, DOI 10.17487/RFC3986, January 2005,
540 .
542 [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource
543 Identifiers (IRIs)", RFC 3987, DOI 10.17487/RFC3987,
544 January 2005, .
546 [RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
547 Protocol (HTTP/1.1): Message Syntax and Routing",
548 RFC 7230, DOI 10.17487/RFC7230, June 2014,
549 .
551 [RFC7530] Haynes, T., Ed. and D. Noveck, Ed., "Network File System
552 (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530,
553 March 2015, .
555 [SolarisInternals]
556 McDougal, R. and J. Mauro, "Solaris Internals -- Solaris
557 10 and OpenSolaris Kernel Architecture", 2007.
559 7.3. URIs
561 [1] https://en.wikipedia.org/wiki/Virtual_file_system
563 Author's Address
564 Nico Williams (editor)
565 Cryptonector, LLC
566 Austin, TX
567 USA
569 Email: nico@cryptonector.com