idnits 2.17.1 draft-ietf-ftpext-intl-ftp-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-03-29) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing document type: Expected "INTERNET-DRAFT" in the upper left hand corner of the first page ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 19 longer pages, the longest (page 1) being 64 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. (A line matching the expected section header was found, but with an unexpected indentation: ' 1 Introduction' ) ** The document seems to lack a Security Considerations section. (A line matching the expected section header was found, but with an unexpected indentation: ' 4 Security' ) ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack an Authors' Addresses Section. ** There are 229 instances of too long lines in the document, the longest one being 8 characters in excess of 72. ** The abstract seems to contain references ([RFC2279], [UNICODE], [ABNF], [ISO-10646], [ISO-8859], [ASCII], [RFC2119], [RFC2130], [RFC959], [FEAT], [UTF-8], [RFC1738], [RFC1123], [RFC854]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 679 has weird spacing: '... else retur...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC 2119' on line 482 looks like a reference -- Missing reference section? 'RFC959' on line 462 looks like a reference -- Missing reference section? 'RFC1123' on line 467 looks like a reference -- Missing reference section? 'ASCII' on line 425 looks like a reference -- Missing reference section? 'ISO-8859' on line 714 looks like a reference -- Missing reference section? 'ISO-10646' on line 449 looks like a reference -- Missing reference section? 'UTF-8' on line 497 looks like a reference -- Missing reference section? 'RFC 2130' on line 486 looks like a reference -- Missing reference section? 'UNICODE' on line 545 looks like a reference -- Missing reference section? 'RFC2279' on line 477 looks like a reference -- Missing reference section? 'ABNF' on line 420 looks like a reference -- Missing reference section? 'RFC854' on line 455 looks like a reference -- Missing reference section? 'FEAT' on line 431 looks like a reference -- Missing reference section? 'RFC1738' on line 472 looks like a reference Summary: 13 errors (**), 0 flaws (~~), 3 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 FTPEXT Working Group B. Curtin 3 INTERNET DRAFT Defense Information Systems Agency 4 Expires 01 December 1998 01 June, 1998 6 Internationalization of the File Transfer Protocol 7 9 Status of this Memo 11 This document is an Internet-Draft. Internet-Drafts are 12 working documents of the Internet Engineering Task Force 13 (IETF), its areas, and its working groups. Note that other 14 groups may also distribute working documents as 15 Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of 18 six months. Internet-Drafts may be updated, replaced, or 19 obsoleted by other documents at any time. It is not 20 appropriate to use Internet-Drafts as reference material or 21 to cite them other than as a "working draft" or "work in 22 progress". 24 To view the entire list of current Internet-Drafts, please check 25 the "1id-abstracts.txt" listing contained in the Internet-Drafts 26 Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net 27 (Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au 28 (Pacific Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu 29 (US West Coast). 31 Distribution of this document is unlimited. Please send 32 comments to the FTP Extension working group (FTPEXT-WG) of 33 the Internet Engineering Task Force (IETF) at 34 . Subscription address is 35 . Discussions of the group 36 are archived at . 38 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 39 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and 40 "OPTIONAL" in this document are to be interpreted as 41 described in RFC 2119 [RFC 2119]. 43 Abstract 45 The File Transfer Protocol, as defined in RFC 959 [RFC959] 46 and RFC 1123 Section 4 [RFC1123], is one of the oldest and 47 widely used protocols on the Internet. The protocol's primary 48 character set, 7 bit ASCII, has served the protocol well 49 through the early growth years of the Internet. However, as 50 the Internet becomes more global, there is a need to support 51 character sets beyond 7 bit ASCII. 53 This document addresses the internationalization (I18n) of 54 FTP, which includes supporting the multiple character sets 55 found throughout the Internet community. This is achieved by 56 extending the FTP specification and giving recommendations 57 for proper internationalization support. 59 Table of Contents 61 1 INTRODUCTION....................................................3 62 2 INTERNATIONALIZATION............................................3 63 2.1 International Character Set.................................4 64 2.2 Transfer Encoding...........................................4 65 3 CONFORMANCE.....................................................5 66 3.1 General.....................................................5 67 3.2 International Servers.......................................7 68 3.3 International Clients.......................................7 69 4 SECURITY........................................................8 70 5 ACKNOWLEDGMENTS.................................................8 71 6 GLOSSARY........................................................8 72 7 BIBLIOGRAPHY....................................................9 73 8 AUTHOR'S ADDRESS...............................................10 74 APPENDIX A - IMPLEMENTATION CONSIDERATIONS......................A-1 75 A.1 General Considerations....................................A-1 76 A.2 Transition Considerations.................................A-2 77 APPENDIX B - SAMPLE CODE AND EXAMPLES...........................B-1 78 B.1 Valid UTF-8 check.........................................B-1 79 B.2 Conversions...............................................B-2 80 B.2.1 Conversion from local character set to UTF-8............B-2 81 B.2.2 Conversion from UTF-8 to local character set............B-5 82 B.2.3 ISO/IEC 8859-8 Example.................................B-7 83 B.2.4 Vendor Codepage Example.................................B-7 84 B.3 Pseudo Code for translating servers.......................B-8 86 Expires 01 December 1998 [Page 2 ] 88 1 Introduction 90 As the Internet grows throughout the world the requirement to 91 support character sets outside of the ASCII [ASCII] / Latin-1 92 [ISO-8859] character set becomes ever more urgent. For FTP, 93 because of the large installed base, it is paramount that 94 this be done without breaking existing clients and servers. 95 This document addresses this need. In doing so it defines a 96 solution which will still allow the installed base to 97 interoperate with new international clients and servers. 99 This document enhances the capabilities of the File Transfer 100 Protocol by removing the 7-bit restrictions on pathnames used 101 in client commands and server responses, recommends the use 102 of a Universal Character Set (UCS) ISO/IEC 10646 [ISO-10646], 103 and recommends a UCS transformation format (UTF) UTF-8 104 [UTF-8]. 106 The recommendations made in this document are consistent with 107 the recommendations expressed by the 29 Feb - 1 Mar 1996 IAB 108 Character Set Workshop as expressed in RFC 2130 [RFC 2130]. 110 2 Internationalization 112 The File Transfer Protocol was developed when the predominate 113 character sets were 7 bit ASCII and 8 bit EBCDIC. Today these 114 character sets cannot support the wide range of characters 115 needed by multinational systems. Given that there are a 116 number of character sets in current use that provide more 117 characters than 7-bit ASCII, it makes sense to decide on a 118 convenient way to represent the union of those possibilities. 119 To work globally either requires support of a number of 120 character sets and to be able to convert between them, or the 121 use of a single preferred character set. To assure global 122 interoperability this document RECOMMENDS the latter approach 123 and defines a single character set, in addition to NVT ASCII 124 and EBCDIC, which is understandable by all systems. For FTP 125 this character set SHALL be ISO/IEC 10646:1993. For support 126 of global compatibility it is STRONGLY RECOMMENDED that 127 clients and servers use UTF-8 encoding when exchanging 128 pathnames. Clients and servers are, however, under no 129 obligation to perform any conversion on the contents of a 130 file for operations such as STOR or RETR. 132 The character set used to store files SHALL remain a local 133 decision and MAY depend on the capability of local operating 134 systems. Prior to the exchange of pathnames they should be 135 converted into a ISO/IEC 10646 format and UTF-8 encoded. This 136 approach, while allowing international exchange of pathnames, 137 will still allow backward compatibility with older systems 138 because the code set positions for ASCII characters are 139 identical to the one byte sequence in UTF-8. 141 Expires 01 December 1998 [Page 3 ] 142 Sections 2.1 and 2.2 give a brief description of the 143 international character set and transfer encoding recommended 144 by this document. A more thorough description of UTF-8, 145 ISO/IEC 10646, and UNICODE [UNICODE], beyond that given in 146 this document, can be found in RFC 2279 [RFC2279]. 148 2.1 International Character Set 150 The character set defined for international support of FTP 151 SHALL be the Universal Character Set as defined in ISO 152 10646:1993 as amended. This standard incorporates the 153 character sets of many existing international, national, and 154 corporate standards. ISO/IEC 10646 defines two alternate 155 forms of encoding, UCS-4 and UCS-2. UCS-4 is a four byte (31 156 bit) encoding containing 2**31 code positions divided into 157 128 groups of 256 planes. Each plane consists of 256 rows of 158 256 cells. UCS-2 is a 2 byte (16 bit) character set 159 consisting of plane zero or the Basic Multilingual Plane 160 (BMP). Currently, no codesets have been defined outside of 161 the 2 byte BMP. 163 The Unicode standard version 2.0 [UNICODE] is consistent with 164 the UCS-2 subset of ISO/IEC 10646. The Unicode standard 165 version 2.0 includes the repertoire of IS 10646 characters, 166 amendments 1-7 of IS 10646, and editorial and technical 167 corrigenda. 169 2.2 Transfer Encoding 171 UCS Transformation Format 8 (UTF-8), in the past referred to 172 as UTF-2 or UTF-FSS, SHALL be used as a transfer encoding to 173 transmit the international character set. UTF-8 is a file 174 safe encoding which avoids the use of byte values that have 175 special significance during the parsing of pathname character 176 strings. UTF-8 is an 8 bit encoding of the characters in the 177 UCS. Some of UTF-8's benefits are that it is compatible with 178 7 bit ASCII, so it doesn't affect programs that give special 179 meanings to various ASCII characters; it is immune to 180 synchronization errors; its encoding rules allow for easy 181 identification; and it has enough space to support a large 182 number of character sets. 184 UTF-8 encoding represents each UCS character as a sequence of 185 1 to 6 bytes in length. For all sequences of one byte the 186 most significant bit is ZERO. For all sequences of more than 187 one byte the number of ONE bits in the first byte, starting 188 from the most significant bit position, indicates the number 189 of bytes in the UTF-8 sequence followed by a ZERO bit. For 190 example, the first byte of a 3 byte UTF-8 sequence would have 191 1110 as its most significant bits. Each additional bytes 192 (continuing bytes) in the UTF-8 sequence, contain a ONE bit 193 followed by a ZERO bit as their most significant bits. The 194 remaining free bit positions in the continuing bytes are used 196 Expires 01 December 1998 [Page 4 ] 197 to identify characters in the UCS. The relationship between 198 UCS and UTF-8 is demonstrated in the following table: 200 UCS-4 range(hex) UTF-8 byte sequence(binary) 201 00000000 - 0000007F 0xxxxxxx 202 00000080 - 000007FF 110xxxxx 10xxxxxx 203 00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 204 00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 205 00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 206 10xxxxxx 207 04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 208 10xxxxxx 10xxxxxx 210 A beneficial property of UTF-8 is that its single byte 211 sequence is consistent with the ASCII character set. This 212 feature will allow a transition where old ASCII-only clients 213 can still interoperate with new servers that support the 214 UTF-8 encoding. 216 Another feature is that the encoding rules make it very 217 unlikely that a character sequence from a different character 218 set will be mistaken for a UTF-8 encoded character sequence. 219 Clients and servers can use a simple routine to determine if 220 the character set being exchanged is valid UTF-8. Section B.1 221 shows a code example of this check. 223 3 Conformance 225 3.1 General 227 - The 7-bit restriction for pathnames exchanged is dropped. 229 - Many operating system allow the use of spaces , 230 carriage return , and line feed characters as part 231 of the pathname. The exchange of pathnames with these 232 special command characters will cause the pathnames to be 233 parsed improperly. This is because ftp commands associated 234 with pathnames have the form: 236 COMMAND . 238 To allow the exchange of pathnames containing these 239 characters, the definition of pathname is changed from 240 ::= ; in BNF format 241 to 242 pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF] 244 To avoid mistaking these characters within pathnames as 245 special command characters the following rules will apply: 247 Expires 01 December 1998 [Page 5 ] 248 There MUST be only one between a ftp command and the 249 pathname. Implementations MUST assume characters 250 following the initial as part of the pathname. For 251 example the pathname in STOR foo.bar is 252 foo.bar . 254 Current implementations, which may allow multiple 255 characters as separators between the command and 256 pathname, MUST assure that they comply with this single 257 convention. Note: Implementations which treat 3 258 character commands (e.g. CWD, MKD, etc.) as a fixed 4 259 character command by padding the command with a trailing 260 are in non-compliance to this specification. 262 When a character is encountered as part of a pathname 263 it MUST be padded with a character prior to sending 264 the command. On receipt of a pathname containing a 265 sequence the character MUST be stripped away. This 266 approach is described in the Telnet protocol [RFC854] on 267 pages 11 and 12. For example, to store a pathname 268 fooboo.bar the pathname would become 269 fooboo.bar prior to sending the command STOR 270 fooboo.bar . 271 Upon receipt of the altered pathname the character 272 following the would be stripped away to form the 273 original pathname. 275 - Conforming internationalized clients and servers MUST 276 support UTF-8 for the transfer and receipt of pathnames. 277 Clients and servers MAY in addition give users a choice of 278 specifying interpretation of pathnames in another encoding. 279 Note that configuring clients and servers to use character 280 sets / encoding other than UTF-8 is outside of the scope of 281 this document. While it is recognized that in certain 282 operational scenarios this may be desirable, this is left as 283 a quality of implementation and operational issue. 285 - Pathnames are sequences of bytes. The encoding of names 286 that are valid UTF-8 sequences is assumed to be UTF-8. The 287 character set of other names is undefined. Clients and 288 servers, unless otherwise configured to support a specific 289 native character set, MUST check for a valid UTF-8 byte 290 sequence to determine if the pathname being presented is 291 UTF-8. 293 - To avoid data loss, clients and servers SHOULD use the UTF- 294 8 encoded pathnames when unable to convert them to a usable 295 code set. 297 - There may be cases when the code set / encoding presented 298 to the server or client cannot be determined. In such cases 299 the raw bytes SHOULD be used. 301 Expires 01 December 1998 [Page 6 ] 303 3.2 International Servers 305 - Servers MUST support the UTF-8 feature in response to the 306 FEAT command [FEAT]. The UTF-8 feature is a line containing 307 the exact string "UTF8". This string is not case sensitive, 308 but SHOULD be transmitted in upper case. The response to a 309 FEAT command SHOULD be: 311 C> feat 312 S> 211- 313 S> ... 314 S> UTF8 315 S> ... 316 S> 211 end 318 The ellipses indicate placeholders where other features may 319 be included, and are not required. The one space indentation 320 of the feature lines is mandatory [FEAT]. 322 - Mirror servers may want to exactly reflect the site that 323 they are mirroring. In such cases servers MAY store and 324 present the exact pathname bytes that it received from the 325 main server. 327 3.3 International Clients 329 - Clients which do not require display of pathnames are under 330 no obligation to do so. Non-display clients do not need to 331 conform to requirements associated with display. 333 - Clients, which are presented UTF-8 pathnames by the server, 334 SHOULD parse UTF-8 correctly and attempt to display the 335 pathname within the limitation of the resources available. 337 - Clients MUST support the FEAT command and recognize the 338 "UTF8" feature (defined in 3.2 above) to determine if a 339 server supports UTF-8 encoding. 341 - Character semantics of other names shall remain undefined. 342 If a client detects that a server is non UTF-8, it SHOULD 343 change its display appropriately. How a client 344 implementation handles non UTF-8 is a quality of 345 implementation issue. It MAY try to assume some other 346 encoding, give the user a chance to try to assume something, 347 or save encoding assumptions for a server from one FTP 348 session to another. 350 - Glyph rendering is outside the scope of this document. How 351 a client presents characters it cannot display is a quality 352 of implementation issue. This document RECOMMENDS that 353 octets corresponding to non-displayable characters SHOULD be 354 presented in URL %HH format defined in RFC 1738 [RFC1738]. 355 They MAY, however, display them as question marks, with 357 Expires 01 December 1998 [Page 7 ] 358 their UCS hexadecimal value, or in any other suitable 359 fashion. 361 - Many existing clients interpret 8-bit pathnames as being in 362 the local character set. They MAY continue to do so for 363 pathnames that are not valid UTF-8. 365 4 Security 367 This document addresses the support of character sets beyond 368 1 byte. Conformance to this document should not induce a 369 security threat. 371 5 Acknowledgments 373 The following people have contributed to this document: 375 D. J. Bernstein 376 Martin J. Duerst 377 Mark Harris 378 Paul Hethmon 379 Alun Jones 380 James Matthews 381 Keith Moore 382 Sandra O'Donnell 383 Benjamin Riefenstahl 384 Stephen Tihor 386 (and others from the FTPEXT working group) 388 6 Glossary 390 BIDI - abbreviation for Bi-directional, a reference to mixed 391 right-to-left and left-to-right text. 393 Character Set - a collection of characters used to represent 394 textual information in which each character has a numeric 395 value 397 Code Set - (see character set). 399 Glyph - a character image represented on a display device. 401 I18N - "I eighteen N", the first and last letters of the word 402 "internationalization" and the eighteen letters in between. 404 UCS-2 - the ISO/IEC 10646 two octet Universal Character Set 405 form. 407 UCS-4 - the ISO/IEC 10646 four octet Universal Character Set 408 form. 410 Expires 01 December 1998 [Page 8 ] 412 UTF-8 - the UCS Transformation Format represented in 8 bits. 414 UTF-16 - A 16-bit format including the BMP (directly encoded) 415 and surrogate pairs to represent characters in planes 01-16; 416 equivalent to Unicode. 418 7 Bibliography 420 [ABNF] 422 D. Crocker, P. Overell, Augmented BNF for Syntax 423 Specifications: ABNF, RFC 2234, November 1997. 425 [ASCII] 427 ANSI X3.4:1986 Coded Character Sets - 7 Bit American 428 National Standard Code for Information Interchange (7-bit 429 ASCII) 431 [FEAT] 433 R. Elz, P. Hethmon, "Feature Negotiation Mechanism for the 434 File Transfer Protocol", Work in Progress, November 1997. 437 [ISO-8859] 439 ISO 8859. International standard -- Information processing 440 -- 8-bit single-byte coded graphic character sets -- Part 1: 441 Latin alphabet No. 1 (1987) -- Part 2: Latin alphabet No. 2 442 (1987) -- Part 3: Latin alphabet No. 3 (1988) -- Part 4: 443 Latin alphabet No. 4 (1988) -- Part 5: Latin/Cyrillic 444 alphabet (1988) -- Part 6: Latin/Arabic alphabet (1987) -- 445 Part : Latin/Greek alphabet (1987) -- Part 8: Latin/Hebrew 446 alphabet (1988) -- Part 9: Latin alphabet No. 5 (1989) -- 447 Part10: Latin alphabet No. 6 (1992) 449 [ISO-10646] 451 ISO/IEC 10646-1:1993. International standard -- Information 452 technology -- Universal multiple-octet coded character set 453 (UCS) -- Part 1: Architecture and basic multilingual plane. 455 [RFC854] 457 J. Postel, J Reynolds, "Telnet Protocol Specification", RFC 458 854, May 1983. 460 Expires 01 December 1998 [Page 9 ] 462 [RFC959] 464 J. Postel, J Reynolds, "File Transfer Protocol (FTP)", RFC 465 959, October 1985. 467 [RFC1123] 469 R. Braden, "Requirements for Internet Hosts -- Application 470 and Support", RFC 1123, October 1989. 472 [RFC1738] 474 T. Berners-Lee, L. Masinter, M.McCahill, "Uniform Resource 475 Locators (URL)", RFC 1738, December 1994. 477 [RFC2279] 479 F. Yergeau, "UTF-8, a transformation format of ISO 10646", 480 RFC 2279, January 1998. 482 [RFC 2119] 483 S. Bradner, " Key words for use in RFCs to Indicate 484 Requirement Levels", RFC 2119, March 1997. 486 [RFC 2130] 488 C. Weider, C. Preston, K.Simonsen, H. Alvestrand, " The 489 Report of the IAB Character Set Workshop held 29 February - 490 1 March, 1996", RFC 2130, April, 1997. 492 [UNICODE] 494 The Unicode Consortium, "The Unicode Standard - Version 495 2.0", Addison Westley Developers Press, July 1996. 497 [UTF-8] 499 ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation 500 Format 8 (UTF-8). 502 8 Author's Address 504 JIEO 505 Attn JEBBD (Bill Curtin) 506 Ft. Monmouth, N.J. 507 07703-5613 508 curtinw@ftm.disa.mil 510 Expires 01 December 1998 [Page 10 ] 511 Annex A - Implementation Considerations 513 A.1 General Considerations 515 - Implementers should ensure that their code accounts for 516 potential problems, such as using a NULL character to 517 terminate a string or no longer being able to steal the high 518 order bit for internal use, when supporting the extended 519 character set. 521 - Implementers should be aware that there is a chance that 522 pathnames that are non UTF-8 may be parsed as valid UTF-8. 523 The probabilities are low for some encoding or statistically 524 zero to zero for others. A recent non-scientific analysis 525 found that EUC encoded Japanese words had a 2.7% false 526 reading; SJIS had a 0.0005% false reading; other encoding 527 such as ASCII or KOI-8 have a 0% false reading. This 528 probability is highest for short pathnames and decreases as 529 pathname size increases. Implementers may want to look for 530 signs that pathnames which parse as UTF-8 are not valid UTF- 531 8, such as the existence of multiple local character sets in 532 short pathnames. Hopefully, as more implementations conform 533 to UTF-8 transfer encoding there will be a smaller need to 534 guess at the encoding. 536 - Client developers should be aware that it will be possible 537 for pathnames to contain mixed characters (e.g. 538 /Latin1DirectoryName/HebrewFileName). They should be 539 prepared to handle the Bi-directional (BIDI) display of 540 these character sets (i.e. right to left display for the 541 directory and left to right display for the filename). While 542 bi-directional display is outside the scope of this document 543 and more complicated than the above example, an algorithm 544 for bi-directional display can be found in the UNICODE 2.0 545 [UNICODE] standard. Also note that pathnames can have 546 different byte ordering yet be logically and display-wise 547 equivalent due to the insertion of BIDI control characters 548 at different points during composition. Also note that mixed 549 character sets may also present problems with font swapping. 551 - A server that copies pathnames transparently from a local 552 filesystem may continue to do so. It is then up to the local 553 file creators to use UTF-8 pathnames. 555 - Servers can supports charset labeling of files and/or 556 directories, such that different pathnames may have 557 different charsets. The server should attempt to convert all 558 pathnames to UTF-8, but if it can't then it should leave 559 that name in its raw form. 561 Expires 01 December 1998 [Page A-1 ] 562 - Some server's OS do not mandate character sets, but allow 563 administrators to configure it in the FTP server. These 564 servers should be configured to use a particular mapping 565 table (either external or built-in). This will allow the 566 flexibility of defining different charsets for different 567 directories. 569 - If the server's OS does not mandate the character set and 570 the FTP server cannot be configured, the server should 571 simply use the raw bytes in the file name. They might be 572 ASCII or UTF-8. 574 - If the server is a mirror, and wants to look just like the 575 site it is mirroring, it should store the exact file name 576 bytes that it received from the main server. 578 A.2 Transition Considerations 580 -Clients and servers can transition to UTF-8 by either 581 converting to/from the local encoding, or the users can 582 store UTF-8 filenames. The former approach is easier on 583 tightly controlled file systems (e.g. PCs and MACs). The 584 latter approach is easier on more free form file systems 585 (e.g. Unix). 587 -For interactive use attention should be focused on user 588 interface and ease of use. Non-interactive use requires a 589 consistent and controlled behavior. 591 -There may be many applications which reference files under 592 their old raw pathname (e.g. linked URLs). Changing the 593 pathname to UTF-8 will cause access to the old URL to fail. 594 A solution may be for the server to act as if there was 2 595 different pathnames associated with the file. This might be 596 done internal to the server on controlled file systems or by 597 using symbolic links on free form systems. While this 598 approach may work for single file transfer non-interactive 599 use, a non-interactive transfer of all of the files in a 600 directory will produce duplicates. Interactive users may be 601 presented with lists of files which are double the actual 602 number files. 604 Expires 01 December 1998 [Page A-2 ] 605 Annex B - Sample Code and Examples 607 B.1 Valid UTF-8 check 609 The following routine checks if a byte sequence is valid UTF- 610 8. This is done by checking for the proper tagging of the 611 first and following bytes to make sure they conform to the 612 UTF-8 format. It then checks to assure that the data part of 613 the UTF-8 sequence conforms to the proper range allowed by 614 the encoding. Note: This routine will not detect characters 615 that have not been assigned and therefore do not exist. 617 int utf8_valid(const unsigned char *buf, unsigned int len) 618 { 619 const unsigned char *endbuf = buf + len; 620 unsigned char byte2mask=0x00, c; 621 int trailing = 0; // trailing (continuation) 622 bytes to follow 624 while (buf != endbuf) 625 { 626 c = *buf++; 627 if (trailing) 628 if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 629 format? 630 {if (byte2mask) // Need to check 2nd byte for 631 proper range? 632 if (c&byte2mask) // Are appropriate bits set? 633 byte2mask=0x00; 634 else 635 return 0; 636 trailing--; } 637 else 638 return 0; 639 else 640 if ((c&0x80) == 0x00) continue; // valid 1 byte 641 UTF-8 642 else if ((c&0xE0) == 0xC0) // valid 2 byte 643 UTF-8 644 if (c&0x1E) // Is UTF-8 byte in 645 proper range? 646 trailing =1; 647 else 648 return 0; 649 else if ((c&0xF0) == 0xE0) // valid 3 byte 650 UTF-8 651 {if (!(c&0x0F)) // Is UTF-8 byte in 652 proper range? 653 byte2mask=0x20; // If not set mask 654 to check next byte 655 trailing = 2;} 657 Expires 01 December 1998 [Page B-1 ] 658 else if ((c&0xF8) == 0xF0) // valid 4 byte 659 UTF-8 660 {if (!(c&0x07)) // Is UTF-8 byte in 661 proper range? 662 byte2mask=0x30; // If not set mask 663 to check next byte 664 trailing = 3;} 665 else if ((c&0xFC) == 0xF8) // valid 5 byte 666 UTF-8 667 {if (!(c&0x03)) // Is UTF-8 byte in 668 proper range? 669 byte2mask=0x38; // If not set mask 670 to check next byte 671 trailing = 4;} 672 else if ((c&0xFE) == 0xFC) // valid 6 byte 673 UTF-8 674 {if (!(c&0x01)) // Is UTF-8 byte in 675 proper range? 676 byte2mask=0x3C; // If not set mask 677 to check next byte 678 trailing = 5;} 679 else return 0; 680 } 681 return trailing == 0; 682 } 684 B.2 Conversions 686 The code examples in this section closely reflect the 687 algorithm in ISO 10646 and may not present the most efficient 688 solution for converting to / from UTF-8 encoding. If 689 efficiency is an issue, implementers should use the 690 appropriate bitwise operators. 692 Additional code examples and numerous mapping tables can be 693 found at the Unicode site, HTTP://www.unicode.org or 694 FTP://unicode.org. 696 Note that the conversion examples below assume that the local 697 character set supported in the operating system is something 698 other than UCS2/UTF-16. There are some operating systems that 699 already support UCS2/UTF-16 (notably Plan 9 and Windows NT). 700 In this case no conversion will be necessary from the local 701 character set to the UCS. 703 B.2.1 Conversion from local character set to UTF-8 705 Conversion from the local filesystem character set to UTF-8 706 will normally involve a two step process. First convert the 707 local character set to the UCS; then convert the UCS to 708 UTF-8. 710 Expires 01 December 1998 [Page B-2 ] 711 The first step in the process can be performed by maintaining 712 a mapping table that includes the local character set code 713 and the corresponding UCS code. For instance the ISO/IEC 714 8859-8 [ISO-8859] code for the Hebrew letter "VAV" is 0xE4. 715 The corresponding 4 byte ISO/IEC 10646 code is 0x000005D5. 717 The next step is to convert the UCS character code to the 718 UTF-8 encoding. The following routine can be used to 719 determine and encode the correct number of bytes based on the 720 UCS-4 character code: 722 unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int 723 ucs4_len, unsigned char *utf8_buf) 725 { 726 const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; 727 unsigned int utf8_len = 0; // return value for UTF8 size 728 unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer 729 // to load UTF8 values 731 while (ucs4_buf != ucs4_endbuf) 732 { 733 if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed 734 { 735 *t_utf8_buf++ = (unsigned char) *ucs4_buf; 736 utf8_len++; 737 ucs4_buf++; 738 } 739 else 740 if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range 741 { 742 *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40)); 743 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 744 utf8_len+=2; 745 ucs4_buf++; 746 } 747 else 748 if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The 749 values 0x0000FFFE, 0x0000FFFF 750 and 0x0000D800 - 0x0000DFFF do 751 not occur in UCS-4 */ 752 { 753 *t_utf8_buf++= (unsigned char) (0xE0 + 754 (*ucs4_buf/0x1000)); 755 *t_utf8_buf++= (unsigned char) (0x80 + 756 ((*ucs4_buf/0x40)%0x40)); 757 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 758 utf8_len+=3; 759 ucs4_buf++; 761 } 762 else 763 if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range 765 Expires 01 December 1998 [Page B-3 ] 766 { 767 *t_utf8_buf++= (unsigned char) (0xF0 + 768 (*ucs4_buf/0x040000)); 769 *t_utf8_buf++= (unsigned char) (0x80 + 770 ((*ucs4_buf/0x10000)%0x40)); 771 *t_utf8_buf++= (unsigned char) (0x80 + 772 ((*ucs4_buf/0x40)%0x40)); 773 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 774 utf8_len+=4; 775 ucs4_buf++; 777 } 778 else 779 if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range 780 { 781 *t_utf8_buf++= (unsigned char) (0xF8 + 782 (*ucs4_buf/0x01000000)); 783 *t_utf8_buf++= (unsigned char) (0x80 + 784 ((*ucs4_buf/0x040000)%0x40)); 785 *t_utf8_buf++= (unsigned char) (0x80 + 786 ((*ucs4_buf/0x1000)%0x40)); 787 *t_utf8_buf++= (unsigned char) (0x80 + 788 ((*ucs4_buf/0x40)%0x40)); 789 *t_utf8_buf++= (unsigned char) (0x80 + 790 (*ucs4_buf%0x40)); 791 utf8_len+=5; 792 ucs4_buf++; 793 } 794 else 795 if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range 796 { 797 *t_utf8_buf++= (unsigned char) 798 (0xF8 +(*ucs4_buf/0x40000000)); 799 *t_utf8_buf++= (unsigned char) (0x80 + 800 ((*ucs4_buf/0x01000000)%0x40)); 801 *t_utf8_buf++= (unsigned char) (0x80 + 802 ((*ucs4_buf/0x040000)%0x40)); 803 *t_utf8_buf++= (unsigned char) (0x80 + 804 ((*ucs4_buf/0x1000)%0x40)); 805 *t_utf8_buf++= (unsigned char) (0x80 + 806 ((*ucs4_buf/0x40)%0x40)); 807 *t_utf8_buf++= (unsigned char) (0x80 + 808 (*ucs4_buf%0x40)); 809 utf8_len+=6; 810 ucs4_buf++; 812 } 813 } 814 return (utf8_len); 815 } 817 Expires 01 December 1998 [Page B-4 ] 819 B.2.2 Conversion from UTF-8 to local character set 821 When moving from UTF-8 encoding to the local character set 822 the reverse procedure is used. First the UTF-8 encoding is 823 transformed into the UCS-4 character set. The UCS-4 is then 824 converted to the local character set from a mapping table 825 (i.e. the opposite of the table used to form the UCS-4 826 character code). 828 To convert from UTF-8 to UCS-4 the free bits (those that do 829 not define UTF-8 sequence size or signify continuation bytes) 830 in a UTF-8 sequence are concatenated as a bit string. The 831 bits are then distributed into a four-byte sequence starting 832 from the least significant bits. Those bits not assigned a 833 bit in the four-byte sequence are padded with ZERO bits. The 834 following routine converts the UTF-8 encoding to UCS-4 835 character codes: 837 int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len, 838 unsigned char *utf8_buf) 839 { 841 const unsigned char *utf8_endbuf = utf8_buf + utf8_len; 842 unsigned int ucs_len=0; 844 while (utf8_buf != utf8_endbuf) 845 { 847 if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion 848 needed */ 849 { 850 *ucs4_buf++ = (unsigned long) *utf8_buf; 851 utf8_buf++; 852 ucs_len++; 853 } 854 else 855 if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range 856 { 857 *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40) 858 + ( *(utf8_buf+1) - 0x80)); 859 utf8_buf += 2; 860 ucs_len++; 861 } 862 else 863 if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8 864 range */ 865 { 866 *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000) 867 + (( *(utf8_buf+1) - 0x80) * 0x40) 868 + ( *(utf8_buf+2) - 0x80)); 869 utf8_buf+=3; 871 Expires 01 December 1998 [Page B-5 ] 872 ucs_len++; 873 } 874 else 875 if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8 876 range */ 877 { 878 *ucs4_buf++ = (unsigned long) 879 (((*utf8_buf - 0xF0) * 0x040000) 880 + (( *(utf8_buf+1) - 0x80) * 0x1000) 881 + (( *(utf8_buf+2) - 0x80) * 0x40) 882 + ( *(utf8_buf+3) - 0x80)); 883 utf8_buf+=4; 884 ucs_len++; 885 } 886 else 887 if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8 888 range */ 889 { 890 *ucs4_buf++ = (unsigned long) 891 (((*utf8_buf - 0xF8) * 0x01000000) 892 + ((*(utf8_buf+1) - 0x80) * 0x040000) 893 + (( *(utf8_buf+2) - 0x80) * 0x1000) 894 + (( *(utf8_buf+3) - 0x80) * 0x40) 895 + ( *(utf8_buf+4) - 0x80)); 896 utf8_buf+=5; 897 ucs_len++; 898 } 899 else 900 if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8 901 range */ 902 { 903 *ucs4_buf++ = (unsigned long) 904 (((*utf8_buf - 0xFC) * 0x40000000) 905 + ((*(utf8_buf+1) - 0x80) * 0x010000000) 906 + ((*(utf8_buf+2) - 0x80) * 0x040000) 907 + (( *(utf8_buf+3) - 0x80) * 0x1000) 908 + (( *(utf8_buf+4) - 0x80) * 0x40) 909 + ( *(utf8_buf+5) - 0x80)); 910 utf8_buf+=6; 911 ucs_len++; 912 } 914 } 915 return (ucs_len); 916 } 918 Expires 01 December 1998 [Page B-6 ] 920 B.2.3 ISO/IEC 8859-8 Example 922 This example demonstrates mapping ISO/IEC 8859-8 character 923 set to UTF-8 and back to ISO/IEC 8859-8. As noted earlier, 924 the Hebrew letter "VAV" is convertd from the ISO/IEC 8859-8 925 character code 0xE4 to the corresponding 4 byte ISO/IEC 10646 926 code of 0x000005D5 by a simple lookup of a conversion/mapping 927 file. 929 The UCS-4 character code is transformed into UTF-8 using the 930 ucs4_to_utf8 routine described earlier by: 932 1. Because the UCS-4 character is between 0x80 and 0x07FF it 933 will map to a 2 byte UTF-8 sequence. 934 2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) 935 = 0xD7. 936 3. The second byte is defined by (0x80 + (0x000005D5 % 937 0x40)) = 0x95. 939 The UTF-8 encoding is transferred back to UCS-4 by using the 940 utf8_to_ucs4 routine described earlier by: 942 1. Because the first byte of the sequence, when the '&' 943 operator with a value of 0xE0 is applied, will produce 944 0xC0 (0xD7 & 0xE0 = 0xC0) the UTF-8 is a 2 byte sequence. 945 2. The four byte UCS-4 character code is produced by 946 (((0xD7 - 0xC0) * 0x40) + (0x95 -0x80)) = 0x000005D5. 948 Finally, the UCS-4 character code is converted to ISO/IEC 949 8859-8 character code (using the mapping table which matches 950 ISO/IEC 8859-8 to UCS-4 ) to produce the original 0xE4 code 951 for the Hebrew letter "VAV". 953 B.2.4 Vendor Codepage Example 955 This example demonstrates the mapping of a codepage to UTF-8 956 and back to a vendor codepage. Mapping between vendor 957 codepages can be done in a very similar manner as described 958 above. For instance both the PC and Mac codepages reflect the 959 character set from the Thai standard TIS 620-2533. The 960 character code on both platforms for the Thai letter "SO SO" 961 is 0xAB. This character can then be mapped into the UCS-4 by 962 way of a conversion/mapping file to produce the UCS-4 code of 963 0x0E0B. 965 The UCS-4 character code is transformed into UTF-8 using the 966 ucs4_to_utf8 routine described earlier by: 968 1. Because the UCS-4 character is between 0x0800 and 0xFFFF 969 it will map to a 3 byte UTF-8 sequence. 970 2. The first byte is defined by (0xE0 + (0x00000E0B / 971 0x1000) = 0xE0. 972 3. The second byte is defined by (0x80 + ((0x00000E0B / 973 0x40) % 0x40))) = 0xB8. 975 Expires 01 December 1998 [Page B-7 ] 976 4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) 977 = 0x8B. 979 The UTF-8 encoding is transferred back to UCS-4 by using the 980 utf8_to_ucs4 routine described earlier by: 982 1. Because the first byte of the sequence, when the '&' 983 operator with a value of 0xF0 is applied, will produce 984 0xE0 (0xE0 & 0xF0 = 0xE0) the UTF-8 is a 3 byte sequence. 985 2. The four byte UCS-4 character code is produced by 986 (((0xE0 - 0xE0) * 0x1000) + ((0xB8 - 0x80) * 0x40) + 987 (0x8B -0x80) = 0x0000E0B. 989 Finally, the UCS-4 character code is converted to either the 990 PC or MAC codepage character code (using the mapping table 991 which matches codepage to UCS-4 ) to produce the original 992 0xAB code for the Thai letter "SO SO". 994 B.3 Pseudo Code for a high-quality translating server 996 if utf8_valid(fn) 997 { 998 attempt to convert fn to the local charset, producing localfn 999 if (conversion fails temporarily) return error 1000 if (conversion succeeds) 1001 { 1002 attempt to open localfn 1003 if (open fails temporarily) return error 1004 if (open succeeds) return success 1005 } 1006 } 1007 attempt to open fn 1008 if (open fails temporarily) return error 1009 if (open succeeds) return success 1010 return permanent error 1012 Expires 01 December 1998 [Page B-8 ]