idnits 2.17.1 draft-ietf-ftpext-intl-ftp-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([RFC959], [RFC1123]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? RFC 2119 keyword, line 122: '...is character set SHALL be ISO/IEC 1064...' RFC 2119 keyword, line 123: '...bility it is STRONGLY RECOMMENDED that...' RFC 2119 keyword, line 129: '...d to store files SHALL remain a local ...' RFC 2119 keyword, line 130: '... and MAY depend on the capability of ...' RFC 2119 keyword, line 131: '...f pathnames they SHOULD be converted i...' (66 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 648 has weird spacing: '...ication and...' == Line 861 has weird spacing: '... else retur...' -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'BCP14' on line 620 looks like a reference -- Missing reference section? 'RFC959' on line 641 looks like a reference -- Missing reference section? 'RFC1123' on line 646 looks like a reference -- Missing reference section? 'ASCII' on line 604 looks like a reference -- Missing reference section? 'ISO-8859' on line 890 looks like a reference -- Missing reference section? 'ISO-10646' on line 625 looks like a reference -- Missing reference section? 'UTF-8' on line 687 looks like a reference -- Missing reference section? 'RFC 2277' on line 107 looks like a reference -- Missing reference section? 'UNICODE' on line 729 looks like a reference -- Missing reference section? 'RFC2279' on line 672 looks like a reference -- Missing reference section? 'ABNF' on line 599 looks like a reference -- Missing reference section? 'RFC854' on line 636 looks like a reference -- Missing reference section? '2389' on line 677 looks like a reference -- Missing reference section? 'RFC1738' on line 651 looks like a reference -- Missing reference section? 'RFC2130' on line 661 looks like a reference -- Missing reference section? 'MLST' on line 631 looks like a reference -- Missing reference section? 'RFC1766' on line 431 looks like a reference -- Missing reference section? 'RFC2277' on line 667 looks like a reference Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 22 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 FTPEXT Working Group B. Curtin 3 INTERNET DRAFT Defense Information Systems Agency 4 Expires 7 October, 1999 7 April, 1999 6 Internationalization of the File Transfer Protocol 7 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with 12 all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as Internet- 17 Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet-Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt. 27 To view the list Internet-Draft Shadow Directories, see 28 http://www.ietf.org/shadow.html. 30 Distribution of this document is unlimited. Please send comments to 31 the FTP Extension working group (FTPEXT-WG) of the Internet 32 Engineering Task Force (IETF) at . 33 Subscription address is . Discussions 34 of the group are archived at . 37 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 38 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 39 document are to be interpreted as described in BCP 14 [BCP14]. 41 Abstract 43 The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC 44 1123 Section 4 [RFC1123], is one of the oldest and widely used 45 protocols on the Internet. The protocol's primary character set, 7 bit 46 ASCII, has served the protocol well through the early growth years of 47 the Internet. However, as the Internet becomes more global, there is a 48 need to support character sets beyond 7 bit ASCII. 50 This document addresses the internationalization (I18n) of FTP, which 51 includes supporting the multiple character sets and languages found 52 throughout the Internet community. This is achieved by extending the 53 FTP specification and giving recommendations for proper 54 internationalization support. 56 Table of Contents 58 ABSTRACT.......................................................2 59 1 INTRODUCTION.................................................3 60 2 INTERNATIONALIZATION.........................................3 61 2.1 International Character Set...............................4 62 2.2 Transfer Encoding Set.....................................4 63 3 PATHNAMES....................................................5 64 3.1 General compliance........................................5 65 3.2 Servers compliance........................................7 66 3.3 Clients compliance........................................7 67 4 LANGUAGE SUPPORT.............................................8 68 4.1 The LANG command..........................................8 69 4.2 Syntax of the LANG command................................9 70 4.3 Feat response for LANG command...........................11 71 4.3.1 Feat examples.........................................11 72 5 SECURITY....................................................12 73 6 ACKNOWLEDGMENTS.............................................13 74 7 GLOSSARY....................................................13 75 8 BIBLIOGRAPHY................................................13 76 9 AUTHOR'S ADDRESS............................................15 77 ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16 78 A.1 General Considerations...................................16 79 A.2 Transition Considerations................................17 80 ANNEX B - SAMPLE CODE AND EXAMPLES............................18 81 B.1 Valid UTF-8 check........................................18 82 B.2 Conversions..............................................19 83 B.2.1 Conversion from Local Character Set to UTF-8..........19 84 B.2.2 Conversion from UTF-8 to Local Character Set..........22 85 B.2.3 ISO/IEC 8859-8 Example................................24 86 B.2.4 Vendor Codepage Example...............................24 87 B.3 Pseudo Code for Translating Servers......................25 88 1 Introduction 90 As the Internet grows throughout the world the requirement to support 91 character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859] 92 character set becomes ever more urgent. For FTP, because of the large 93 installed base, it is paramount that this is done without breaking 94 existing clients and servers. This document addresses this need. In 95 doing so it defines a solution which will still allow the installed 96 base to interoperate with new clients and servers. 98 This document enhances the capabilities of the File Transfer Protocol 99 by removing the 7-bit restrictions on pathnames used in client 100 commands and server responses, RECOMMENDs the use of a Universal 101 Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS 102 transformation format (UTF) UTF-8 [UTF-8], and defines a new command 103 for language negotiation. 105 The recommendations made in this document are consistent with the 106 recommendations expressed by the IETF policy related to character sets 107 and languages as defined in RFC 2277 [RFC 2277]. 109 2 Internationalization 111 The File Transfer Protocol was developed when the predominate 112 character sets were 7 bit ASCII and 8 bit EBCDIC. Today these 113 character sets cannot support the wide range of characters needed by 114 multinational systems. Given that there are a number of character sets 115 in current use that provide more characters than 7-bit ASCII, it makes 116 sense to decide on a convenient way to represent the union of those 117 possibilities. To work globally either requires support of a number of 118 character sets and to be able to convert between them, or the use of a 119 single preferred character set. To assure global interoperability this 120 document RECOMMENDS the latter approach and defines a single character 121 set, in addition to NVT ASCII and EBCDIC, which is understandable by 122 all systems. For FTP this character set SHALL be ISO/IEC 10646:1993. 123 For support of global compatibility it is STRONGLY RECOMMENDED that 124 clients and servers use UTF-8 encoding when exchanging pathnames. 125 Clients and servers are, however, under no obligation to perform any 126 conversion on the contents of a file for operations such as STOR or 127 RETR. 129 The character set used to store files SHALL remain a local decision 130 and MAY depend on the capability of local operating systems. Prior to 131 the exchange of pathnames they SHOULD be converted into a ISO/IEC 132 10646 format and UTF-8 encoded. This approach, while allowing 133 international exchange of pathnames, will still allow backward 134 compatibility with older systems because the code set positions for 135 ASCII characters are identical to the one byte sequence in UTF-8. 137 Sections 2.1 and 2.2 give a brief description of the international 138 character set and transfer encoding RECOMMENDED by this document. A 139 more thorough description of UTF-8, ISO/IEC 10646, and UNICODE 140 [UNICODE], beyond that given in this document, can be found in RFC 141 2279 [RFC2279]. 143 2.1 International Character Set 145 The character set defined for international support of FTP SHALL be 146 the Universal Character Set as defined in ISO 10646:1993 as amended. 147 This standard incorporates the character sets of many existing 148 international, national, and corporate standards. ISO/IEC 10646 149 defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a 150 four byte (31 bit) encoding containing 2**31 code positions divided 151 into 128 groups of 256 planes. Each plane consists of 256 rows of 256 152 cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane 153 zero or the Basic Multilingual Plane (BMP). Currently, no codesets 154 have been defined outside of the 2 byte BMP. 156 The Unicode standard version 2.0 [UNICODE] is consistent with the UCS- 157 2 subset of ISO/IEC 10646. The Unicode standard version 2.0 includes 158 the repertoire of IS 10646 characters, amendments 1-7 of IS 10646, and 159 editorial and technical corrigenda. 161 2.2 Transfer Encoding 163 UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2 164 or UTF-FSS, SHALL be used as a transfer encoding to transmit the 165 international character set. UTF-8 is a file safe encoding which 166 avoids the use of byte values that have special significance during 167 the parsing of pathname character strings. UTF-8 is an 8 bit encoding 168 of the characters in the UCS. Some of UTF-8's benefits are that it is 169 compatible with 7 bit ASCII, so it doesn't affect programs that give 170 special meanings to various ASCII characters; it is immune to 171 synchronization errors; its encoding rules allow for easy 172 identification; and it has enough space to support a large number of 173 character sets. 175 UTF-8 encoding represents each UCS character as a sequence of 1 to 6 176 bytes in length. For all sequences of one byte the most significant 177 bit is ZERO. For all sequences of more than one byte the number of ONE 178 bits in the first byte, starting from the most significant bit 179 position, indicates the number of bytes in the UTF-8 sequence followed 180 by a ZERO bit. For example, the first byte of a 3 byte UTF-8 sequence 181 would have 1110 as its most significant bits. Each additional bytes 182 (continuing bytes) in the UTF-8 sequence, contain a ONE bit followed 183 by a ZERO bit as their most significant bits. The remaining free bit 184 positions in the continuing bytes are used to identify characters in 185 the UCS. The relationship between UCS and UTF-8 is demonstrated in the 186 following table: 188 UCS-4 range(hex) UTF-8 byte sequence(binary) 189 00000000 - 0000007F 0xxxxxxx 190 00000080 - 000007FF 110xxxxx 10xxxxxx 191 00000800 - 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 192 00010000 - 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 193 00200000 - 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 194 10xxxxxx 195 04000000 - 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 196 10xxxxxx 10xxxxxx 198 A beneficial property of UTF-8 is that its single byte sequence is 199 consistent with the ASCII character set. This feature will allow a 200 transition where old ASCII-only clients can still interoperate with 201 new servers that support the UTF-8 encoding. 203 Another feature is that the encoding rules make it very unlikely that 204 a character sequence from a different character set will be mistaken 205 for a UTF-8 encoded character sequence. Clients and servers can use a 206 simple routine to determine if the character set being exchanged is 207 valid UTF-8. Section B.1 shows a code example of this check. 209 3 Pathnames 211 3.1 General compliance 213 - The 7-bit restriction for pathnames exchanged is dropped. 215 - Many operating system allow the use of spaces , carriage return 216 , and line feed characters as part of the pathname. The 217 exchange of pathnames with these special command characters will 218 cause the pathnames to be parsed improperly. This is because ftp 219 commands associated with pathnames have the form: 221 COMMAND . 223 To allow the exchange of pathnames containing these characters, the 224 definition of pathname is changed from 225 ::= ; in BNF format 226 to 227 pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF]. 229 To avoid mistaking these characters within pathnames as special 230 command characters the following rules will apply: 232 There MUST be only one between a ftp command and the pathname. 233 Implementations MUST assume characters following the initial 234 as part of the pathname. For example the pathname in STOR 235 foo.bar is foo.bar. 237 Current implementations, which may allow multiple characters as 238 separators between the command and pathname, MUST assure that they 239 comply with this single convention. Note: Implementations which 240 treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4 241 character command by padding the command with a trailing are in 242 non-compliance to this specification. 244 When a character is encountered as part of a pathname it MUST 245 be padded with a character prior to sending the command. On 246 receipt of a pathname containing a sequence the 247 character MUST be stripped away. This approach is described in the 248 Telnet protocol [RFC854] on pages 11 and 12. For example, to store a 249 pathname fooboo.bar the pathname would become 250 fooboo.bar prior to sending the command STOR 251 fooboo.bar. Upon receipt of the altered 252 pathname the character following the would be stripped 253 away to form the original pathname. 255 - Conforming clients and servers MUST support UTF-8 for the transfer 256 and receipt of pathnames. Clients and servers MAY in addition give 257 users a choice of specifying interpretation of pathnames in another 258 encoding. Note that configuring clients and servers to use character 259 sets / encoding other than UTF-8 is outside of the scope of this 260 document. While it is recognized that in certain operational 261 scenarios this may be desirable, this is left as a quality of 262 implementation and operational issue. 264 - Pathnames are sequences of bytes. The encoding of names that are 265 valid UTF-8 sequences is assumed to be UTF-8. The character set of 266 other names is undefined. Clients and servers, unless otherwise 267 configured to support a specific native character set, MUST check 268 for a valid UTF-8 byte sequence to determine if the pathname being 269 presented is UTF-8. 271 - To avoid data loss, clients and servers SHOULD use the UTF- 8 272 encoded pathnames when unable to convert them to a usable code set. 274 - There may be cases when the code set / encoding presented to the 275 server or client cannot be determined. In such cases the raw bytes 276 SHOULD be used. 278 3.2 Servers compliance 280 - Servers MUST support the UTF-8 feature in response to the FEAT 281 command [2389]. The UTF-8 feature is a line containing the exact 282 string "UTF8". This string is not case sensitive, but SHOULD be 283 transmitted in upper case. The response to a FEAT command SHOULD be: 285 C> feat 286 S> 211- 287 S> ... 288 S> UTF8 289 S> ... 290 S> 211 end 292 The ellipses indicate placeholders where other features may be 293 included, but are NOT REQUIRED. The one space indentation of the 294 feature lines is mandatory [2389]. 296 - Mirror servers may want to exactly reflect the site that they are 297 mirroring. In such cases servers MAY store and present the exact 298 pathname bytes that it received from the main server. 300 3.3 Clients compliance 302 - Clients which do not require display of pathnames are under no 303 obligation to do so. Non-display clients do not need to conform to 304 requirements associated with display. 306 - Clients, which are presented UTF-8 pathnames by the server, SHOULD 307 parse UTF-8 correctly and attempt to display the pathname within the 308 limitation of the resources available. 310 - Clients MUST support the FEAT command and recognize the "UTF8" 311 feature (defined in 3.2 above) to determine if a server supports 312 UTF-8 encoding. 314 - Character semantics of other names shall remain undefined. If a 315 client detects that a server is non UTF-8, it SHOULD change its 316 display appropriately. How a client implementation handles non UTF-8 317 is a quality of implementation issue. It MAY try to assume some 318 other encoding, give the user a chance to try to assume something, 319 or save encoding assumptions for a server from one FTP session to 320 another. 322 - Glyph rendering is outside the scope of this document. How a client 323 presents characters it cannot display is a quality of implementation 324 issue. This document RECOMMENDS that octets corresponding to non- 325 displayable characters SHOULD be presented in URL %HH format defined 326 in RFC 1738 [RFC1738]. They MAY, however, display them as question 327 marks, with their UCS hexadecimal value, or in any other suitable 328 fashion. 330 - Many existing clients interpret 8-bit pathnames as being in the 331 local character set. They MAY continue to do so for pathnames that are 332 not valid UTF-8. 334 4. Language Support 336 The Character Set Workshop Report [RFC2130] suggests that clients and 337 servers SHOULD negotiate a language for "greetings" and "error 338 messages". This specification interprets the use of the term "error 339 message", by RFC 2130, to mean any explanatory text string returned by 340 server-PI in response to a user-PI command. 342 Implementers SHOULD note that FTP commands and numeric responses are 343 protocol elements. As such, their use is not affected by any guidance 344 expressed by this specification. 346 Language support of greetings and command responses shall be the 347 default language supported by the server or the language supported by 348 the server and selected by the client. 350 It may be possible to achieve language support through a virtual host 351 as described in [MLST]. However, an FTP server might not support 352 virtual servers, or virtual servers might be configured to support an 353 environment without regard for language. To allow language negotiation 354 this specification defines a new LANG command. Clients and servers 355 that comply with this specification MUST support the LANG command. 357 4.1 The LANG command 359 A new command "LANG" is added to the FTP command set to allow server- 360 FTP process to determine in which language to present server greetings 361 and the textual part of command responses. The parameter associated 362 with the LANG command SHALL be one of the language tags defined in RFC 363 1766 [RFC1766]. If a LANG command without a parameter is issued the 364 server's default language will be used. 366 Greetings and responses issued prior to language negotiation SHALL be 367 in the server's default language. Paragraph 4.5 of [RFC2277] state 368 that this "default language MUST be understandable by an English- 369 speaking person". This specification RECOMMENDS that the server 370 default language be English encoded using ASCII. This text may be 371 augmented by text from other languages. Once negotiated, server-PI 372 MUST return server messages and textual part of command responses in 373 the negotiated language and encoded in UTF-8. Server-PI MAY wish to 374 re-send previously issued server messages in the newly negotiated 375 language. 377 The LANG command only affects presentation of greeting messages and 378 explanatory text associated with command responses. No attempt should 379 be made by the server to translate protocol elements (FTP commands and 380 numeric responses) or data transmitted over the data connection. 382 User-PI MAY issue the LANG command at any time during an FTP session. 383 In order to gain the full benefit of this command, it SHOULD be 384 presented prior to authentication. In general, it will be issued after 385 the HOST command [MLST]. Note that the issuance of a HOST or REIN 386 command [RFC959] will negate the affect of the LANG command. User-PI 387 SHOULD be capable of supporting UTF-8 encoding for the language 388 negotiated. Guidance on interpretation and rendering of UTF-8, defined 389 in section 3, SHALL apply. 391 Although NOT REQUIRED by this specification, a user-PI SHOULD issue a 392 FEAT command [2389] prior to a LANG command. This will allow the user- 393 PI to determine if the server supports the LANG command and which 394 language options. 396 In order to aid the server in identifying whether a connection has 397 been established with a client which conforms to this specification or 398 an older client, user-PI MUST send a HOST [MLST] and/or LANG command 399 prior to issuing any other command (other than FEAT [2389]). If user- 400 PI issues a HOST command, and the server's default language is 401 acceptable, it need not issue a LANG command. However, if the 402 implementation does not support the HOST command, a LANG command MUST 403 be issued. Until server-PI is presented with either a HOST or LANG 404 command it SHOULD assume that the user-PI does not comply with this 405 specification. 407 4.2 Syntax of the LANG command 409 The LANG command is defined as follows: 411 lang-command = "Lang" [(SP lang-tag)] CRLF 412 lang-tag = Primary-tag *( "-" Sub-tag) 413 Primary-tag = 1*8ALPHA 414 Sub-tag = 1*8ALPHA 416 lang-response = lang-ok / error-response 417 lang-ok = "200" [SP *(%x00..%xFF) ] CRLF 418 error-response = command-unrecognized / bad-argument / 419 not-implemented / unsupported-parameter 420 command-unrecognized = "500" [SP *(%x01..%xFF) ] CRLF 421 bad-argument = "501" [SP *(%x01..%xFF) ] CRLF 422 not-implemented = "502" [SP *(%x01..%xFF) ] CRLF 423 unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF 425 The "lang" command word is case independent and may be specified in 426 any character case desired. Therefore "LANG", "lang", "Lang", and 427 "lAnG" are equivalent commands. 429 The OPTIONAL "Lang-tag" given as a parameter specifies the primary 430 language tags and zero or more sub-tags as defined in [RFC1766]. As 431 described in [RFC1766] language tags are treated as case insensitive. 432 If omitted server-PI MUST use the server's default language. 434 Server-FTP responds to the "Lang" command with either "lang-ok" or 435 "error-response". "lang-ok" MUST be sent if Server-FTP supports the 436 "Lang" command and can support some form of the "lang-tag". Support 437 SHOULD be as follows: 439 - If server-FTP receives "Lang" with no parameters it SHOULD return 440 messages and command responses in the server default language. 442 - If server-FTP receives "Lang" with only a primary tag argument 443 (e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD 444 return messages and command responses in the language associated 445 with that primary tag. It is possible that server-FTP will only 446 support the primary tag when combined with a sub-tag (e.g. en-US, 447 en-UK, etc.). In such cases, server-FTP MAY determine the 448 appropriate variant to use during the session. How server-FTP makes 449 that determination is outside the scope of this specification. If 450 server-FTP cannot determine if a sub-tag variant is appropriate it 451 SHOULD return an "unsupported-parameter" (504) response. 453 - If server-FTP receives "Lang" with a primary tag and sub-tag(s) 454 argument, which is implemented, it SHOULD return messages and 455 command responses in support of the language argument. It is 456 possible that server-FTP can support the primary tag of the "Lang" 457 argument but not the sub-tag(s). In such cases server-FTP MAY 458 return messages and command responses in the most appropriate 459 variant of the primary tag that has been implemented. How server- 460 FTP makes that determination is outside the scope of this 461 specification. If server-FTP cannot determine if a sub-tag variant 462 is appropriate it SHOULD return an "unsupported-parameter" (504) 463 response. 465 For example if client-FTP sends a "LANG en-AU" command and server-FTP 466 has implemented language tags en-US and en-UK it may decide that the 467 most appropriate language tag is en-UK and return "200 en-AU not 468 supported. Language set to en-UK". The numeric response is a protocol 469 element and can not be changed. The associated string is for 470 illustrative purposes only. 472 Clients and servers that conform to this specification MUST support 473 the LANG command. Clients SHOULD, however, anticipate receiving a 500 474 or 502 command response, in cases where older or non-compliant servers 475 do not recognize or have not implemented the "Lang". A 501 response 476 SHOULD be sent if the argument to the "Lang" command is not 477 syntactically correct. A 504 response SHOULD be sent if the "Lang" 478 argument, while syntactically correct, is not implemented. As noted 479 above, an argument may be considered a lexicon match even though it is 480 not an exact syntax match. 482 4.3 Feat response for LANG command 484 A server-FTP process that supports the LANG command, and language 485 support for messages and command responses, MUST include in the 486 response to the FEAT command [2389], a feature line indicating that 487 the LANG command is supported and a fact list of the supported 488 language tags. A response to a FEAT command SHALL be in the following 489 format: 491 Lang-feat = SP "LANG" SP lang-fact CRLF 492 lang-fact = lang-tag ["*"] *(";" lang-tag ["*"]) 494 lang-tag = Primary-tag *( "-" Sub-tag) 495 Primary-tag= 1*8ALPHA 496 Sub-tag = 1*8ALPHA 498 The lang-feat response contains the string "LANG" followed by a 499 language fact. This string is not case sensitive, but SHOULD be 500 transmitted in upper case, as recommended in [2389]. The initial space 501 shown in the Lang-feat response is REQUIRED by the FEAT command. It 502 MUST be a single space character. More or less space characters are 503 not permitted. The lang-fact SHALL include the lang-tags which server- 504 FTP can support. At least one lang-tag MUST be included with the FEAT 505 response. The lang-tag SHALL be in the form described earlier in this 506 document. The OPTIONAL asterisk, when present, SHALL indicate the 507 current lang-tag being used by server-FTP for messages and responses. 509 4.3.1 Feat examples 511 C> feat 512 S> 211- 513 S> ... 514 S> LANG EN* 515 S> ... 516 S> 211 end 518 In this example server-FTP can only support English, which is the 519 current language (as shown by the asterisk) being used by the server 520 for messages and command responses. 522 C> feat 523 S> 211- 524 S> ... 525 S> LANG EN*;FR 526 S> ... 527 S> 211 end 529 C> LANG fr 530 S> 200 Le response sera changez au francais 532 C> feat 533 S> 211- 534 S> ... 535 S> LANG EN;FR* 536 S> ... 537 S> 211 end 539 In this example server-FTP supports both English and French as shown 540 by the initial response to the FEAT command. The asterisk indicates 541 that English is the current language in use by server-FTP. After a 542 LANG command is issued to change the language to French, the FEAT 543 response shows French as the current language in use. 545 In the above examples ellipses indicate placeholders where other 546 features may be included, but are NOT REQUIRED. 548 5 Security 550 This document addresses the support of character sets beyond 1 byte 551 and a new language negotiation command. Conformance to this document 552 should not induce a security risk. 554 6 Acknowledgments 556 The following people have contributed to this document: 558 D. J. Bernstein 559 Martin J. Duerst 560 Mark Harris 561 Paul Hethmon 562 Alun Jones 563 Gregory Lundberg 564 James Matthews 565 Keith Moore 566 Sandra O'Donnell 567 Benjamin Riefenstahl 568 Stephen Tihor 570 (and others from the FTPEXT working group) 572 7 Glossary 574 BIDI - abbreviation for Bi-directional, a reference to mixed right-to- 575 left and left-to-right text. 577 Character Set - a collection of characters used to represent textual 578 information in which each character has a numeric value 580 Code Set - (see character set). 582 Glyph - a character image represented on a display device. 584 I18N - "I eighteen N", the first and last letters of the word 585 "internationalization" and the eighteen letters in between. 587 UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form. 589 UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form. 591 UTF-8 - the UCS Transformation Format represented in 8 bits. 593 UTF-16 - A 16-bit format including the BMP (directly encoded) and 594 surrogate pairs to represent characters in planes 01-16; equivalent to 595 Unicode. 597 8 Bibliography 599 [ABNF] 601 D. Crocker, P. Overell, Augmented BNF for Syntax Specifications: 602 ABNF, RFC 2234, November 1997. 604 [ASCII] 606 ANSI X3.4:1986 Coded Character Sets - 7 Bit American National 607 Standard Code for Information Interchange (7-bit ASCII) 609 [ISO-8859] 611 ISO 8859. International standard -- Information processing -- 8-bit 612 single-byte coded graphic character sets -- Part 1: Latin alphabet 613 No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) -- Part 3: Latin 614 alphabet No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- Part 615 5: Latin/Cyrillic alphabet (1988) -- Part 6: Latin/Arabic alphabet 616 (1987) -- Part : Latin/Greek alphabet (1987) -- Part 8: Latin/Hebrew 617 alphabet (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part10: 618 Latin alphabet No. 6 (1992) 620 [BCP14] 622 S. Bradner, "Key words for use in RFCs to Indicate Requirement 623 Levels", BCP 14, RFC 2119, March 1997. 625 [ISO-10646] 627 ISO/IEC 10646-1:1993. International standard -- Information 628 technology -- Universal multiple-octet coded character set (UCS) -- 629 Part 1: Architecture and basic multilingual plane. 631 [MLST] 633 R. Elz, P. Hethmon, "Extensions to FTP", Work in Progress , February 1999. 636 [RFC854] 638 J. Postel, J Reynolds, "Telnet Protocol Specification", RFC 854, May 639 1983. 641 [RFC959] 643 J. Postel, J Reynolds, "File Transfer Protocol (FTP)", RFC 959, 644 October 1985. 646 [RFC1123] 648 R. Braden, "Requirements for Internet Hosts -- Application and 649 Support", RFC 1123, October 1989. 651 [RFC1738] 653 T. Berners-Lee, L. Masinter, M.McCahill, "Uniform Resource Locators 654 (URL)", RFC 1738, December 1994. 656 RFC1766] 658 H. Alvestrand, "Tags for the Identification of Languages", RFC 1766, 659 March 1995. 661 [RFC2130] 663 C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. 664 Crispin, P. Svanberg, "Character Set Workshop Report", RFC 2130, 665 April 1997. 667 [RFC2277] 669 H. Alvestrand, " IETF Policy on Character Sets and Languages", RFC 670 2277, January 1998. 672 [RFC2279] 674 F. Yergeau, "UTF-8, a transformation format of ISO 10646", RFC 2279, 675 January 1998. 677 [2389] 679 R. Elz, P. Hethmon, "Feature Negotiation Mechanism for the File 680 Transfer Protocol", RFC 2389, August 1998. 682 [UNICODE] 684 The Unicode Consortium, "The Unicode Standard - Version 2.0", 685 Addison Westley Developers Press, July 1996. 687 [UTF-8] 689 ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8 690 (UTF-8). 692 9 Author's Address 694 JIEO 695 Attn JEBBD (Bill Curtin) 696 Ft. Monmouth, N.J. 697 07703-5613 698 curtinw@ftm.disa.mil 699 Annex A - Implementation Considerations 701 A.1 General Considerations 703 - Implementers should ensure that their code accounts for potential 704 problems, such as using a NULL character to terminate a string or no 705 longer being able to steal the high order bit for internal use, when 706 supporting the extended character set. 708 - Implementers should be aware that there is a chance that pathnames 709 that are non UTF-8 may be parsed as valid UTF-8. The probabilities 710 are low for some encoding or statistically zero to zero for others. 711 A recent non-scientific analysis found that EUC encoded Japanese 712 words had a 2.7% false reading; SJIS had a 0.0005% false reading; 713 other encoding such as ASCII or KOI-8 have a 0% false reading. This 714 probability is highest for short pathnames and decreases as pathname 715 size increases. Implementers may want to look for signs that 716 pathnames which parse as UTF-8 are not valid UTF- 8, such as the 717 existence of multiple local character sets in short pathnames. 718 Hopefully, as more implementations conform to UTF-8 transfer 719 encoding there will be a smaller need to guess at the encoding. 721 - Client developers should be aware that it will be possible for 722 pathnames to contain mixed characters (e.g. 723 /Latin1DirectoryName/HebrewFileName). They should be prepared to 724 handle the Bi-directional (BIDI) display of these character sets 725 (i.e. right to left display for the directory and left to right 726 display for the filename). While bi-directional display is outside 727 the scope of this document and more complicated than the above 728 example, an algorithm for bi-directional display can be found in the 729 UNICODE 2.0 [UNICODE] standard. Also note that pathnames can have 730 different byte ordering yet be logically and display-wise equivalent 731 due to the insertion of BIDI control characters at different points 732 during composition. Also note that mixed character sets may also 733 present problems with font swapping. 735 - A server that copies pathnames transparently from a local filesystem 736 may continue to do so. It is then up to the local file creators to 737 use UTF-8 pathnames. 739 - Servers can supports charset labeling of files and/or directories, 740 such that different pathnames may have different charsets. The 741 server should attempt to convert all pathnames to UTF-8, but if it 742 can't then it should leave that name in its raw form. 744 - Some server's OS do not mandate character sets, but allow 745 administrators to configure it in the FTP server. These servers 746 should be configured to use a particular mapping table (either 747 external or built-in). This will allow the flexibility of defining 748 different charsets for different directories. 750 - If the server's OS does not mandate the character set and the FTP 751 server cannot be configured, the server should simply use the raw 752 bytes in the file name. They might be ASCII or UTF-8. 754 - If the server is a mirror, and wants to look just like the site it 755 is mirroring, it should store the exact file name bytes that it 756 received from the main server. 758 A.2 Transition Considerations 760 - Servers which support this specification, when presented a pathname 761 from an old client (one which does not support this specification), 762 can nearly always tell whether the pathname is in UTF-8 (see B.1) or 763 in some other code set. In order to support these older clients, 764 servers may wish to default to a non UTF-8 code set. However, how a 765 server supports non UTF-8 is outside the scope of this 766 specification. 768 - Clients which support this specification will be able to determine 769 if the server can support UTF-8 (i.e. supports this specification) 770 by the ability of the server to support the FEAT command and the 771 UTF8 feature (defined in 3.2). If the newer clients determine that 772 the server does not support UTF-8 it may wish to default to a 773 different code set. Client developers should take into consideration 774 that pathnames, associated with older servers, might be stored in 775 UTF-8. However, how a client supports non UTF-8 is outside the scope 776 of this specification. 778 - Clients and servers can transition to UTF-8 by either converting 779 to/from the local encoding, or the users can store UTF-8 filenames. 780 The former approach is easier on tightly controlled file systems 781 (e.g. PCs and MACs). The latter approach is easier on more free form 782 file systems (e.g. Unix). 784 - For interactive use attention should be focused on user interface 785 and ease of use. Non-interactive use requires a consistent and 786 controlled behavior. 788 - There may be many applications which reference files under their old 789 raw pathname (e.g. linked URLs). Changing the pathname to UTF-8 will 790 cause access to the old URL to fail. A solution may be for the 791 server to act as if there was 2 different pathnames associated with 792 the file. This might be done internal to the server on controlled 793 file systems or by using symbolic links on free form systems. While 794 this approach may work for single file transfer non-interactive use, 795 a non-interactive transfer of all of the files in a directory will 796 produce duplicates. Interactive users may be presented with lists of 797 files which are double the actual number files. 799 Annex B - Sample Code and Examples 801 B.1 Valid UTF-8 check 803 The following routine checks if a byte sequence is valid UTF-8. This 804 is done by checking for the proper tagging of the first and following 805 bytes to make sure they conform to the UTF-8 format. It then checks to 806 assure that the data part of the UTF-8 sequence conforms to the proper 807 range allowed by the encoding. Note: This routine will not detect 808 characters that have not been assigned and therefore do not exist. 810 int utf8_valid(const unsigned char *buf, unsigned int len) 811 { 812 const unsigned char *endbuf = buf + len; 813 unsigned char byte2mask=0x00, c; 814 int trailing = 0; // trailing (continuation) bytes to follow 816 while (buf != endbuf) 817 { 818 c = *buf++; 819 if (trailing) 820 if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format? 821 {if (byte2mask) // Need to check 2nd byte for proper range? 822 if (c&byte2mask) // Are appropriate bits set? 823 byte2mask=0x00; 824 else 825 return 0; 826 trailing--; } 827 else 828 return 0; 829 else 830 if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8 831 else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8 832 if (c&0x1E) // Is UTF-8 byte in 833 // proper range? 834 trailing =1; 835 else 836 return 0; 837 else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8 838 {if (!(c&0x0F)) // Is UTF-8 byte in 839 // proper range? 840 byte2mask=0x20; // If not set mask 841 // to check next byte 842 trailing = 2;} 843 else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8 844 {if (!(c&0x07)) // Is UTF-8 byte in 845 // proper range? 846 byte2mask=0x30; // If not set mask 847 // to check next byte 848 trailing = 3;} 849 else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8 850 {if (!(c&0x03)) // Is UTF-8 byte in 851 // proper range? 852 byte2mask=0x38; // If not set mask 853 // to check next byte 854 trailing = 4;} 855 else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8 856 {if (!(c&0x01)) // Is UTF-8 byte in 857 // proper range? 858 byte2mask=0x3C; // If not set mask 859 // to check next byte 860 trailing = 5;} 861 else return 0; 862 } 863 return trailing == 0; 864 } 866 B.2 Conversions 868 The code examples in this section closely reflect the algorithm in ISO 869 10646 and may not present the most efficient solution for converting 870 to / from UTF-8 encoding. If efficiency is an issue, implementers 871 should use the appropriate bitwise operators. 873 Additional code examples and numerous mapping tables can be found at 874 the Unicode site, HTTP://www.unicode.org or FTP://unicode.org. 876 Note that the conversion examples below assume that the local 877 character set supported in the operating system is something other 878 than UCS2/UTF-16. There are some operating systems that already 879 support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no 880 conversion will be necessary from the local character set to the UCS. 882 B.2.1 Conversion from Local Character Set to UTF-8 884 Conversion from the local filesystem character set to UTF-8 will 885 normally involve a two step process. First convert the local character 886 set to the UCS; then convert the UCS to UTF-8. 888 The first step in the process can be performed by maintaining a 889 mapping table that includes the local character set code and the 890 corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859] 891 code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte 892 ISO/IEC 10646 code is 0x000005D5. 894 The next step is to convert the UCS character code to the UTF-8 895 encoding. The following routine can be used to determine and encode 896 the correct number of bytes based on the UCS-4 character code: 898 unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int 899 ucs4_len, unsigned char *utf8_buf) 901 { 902 const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; 903 unsigned int utf8_len = 0; // return value for UTF8 size 904 unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer 905 // to load UTF8 values 907 while (ucs4_buf != ucs4_endbuf) 908 { 909 if ( *ucs4_buf <= 0x7F) // ASCII chars no conversion needed 910 { 911 *t_utf8_buf++ = (unsigned char) *ucs4_buf; 912 utf8_len++; 913 ucs4_buf++; 914 } 915 else 916 if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range 917 { 918 *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40)); 919 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 920 utf8_len+=2; 921 ucs4_buf++; 922 } 923 else 924 if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The 925 values 0x0000FFFE, 0x0000FFFF 926 and 0x0000D800 - 0x0000DFFF do 927 not occur in UCS-4 */ 928 { 929 *t_utf8_buf++= (unsigned char) (0xE0 + 930 (*ucs4_buf/0x1000)); 931 *t_utf8_buf++= (unsigned char) (0x80 + 932 ((*ucs4_buf/0x40)%0x40)); 933 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 934 utf8_len+=3; 935 ucs4_buf++; 936 } 937 else 938 if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range 939 { 940 *t_utf8_buf++= (unsigned char) (0xF0 + 941 (*ucs4_buf/0x040000)); 942 *t_utf8_buf++= (unsigned char) (0x80 + 943 ((*ucs4_buf/0x10000)%0x40)); 944 *t_utf8_buf++= (unsigned char) (0x80 + 945 ((*ucs4_buf/0x40)%0x40)); 946 *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40)); 947 utf8_len+=4; 948 ucs4_buf++; 950 } 951 else 952 if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range 953 { 954 *t_utf8_buf++= (unsigned char) (0xF8 + 955 (*ucs4_buf/0x01000000)); 956 *t_utf8_buf++= (unsigned char) (0x80 + 957 ((*ucs4_buf/0x040000)%0x40)); 958 *t_utf8_buf++= (unsigned char) (0x80 + 959 ((*ucs4_buf/0x1000)%0x40)); 960 *t_utf8_buf++= (unsigned char) (0x80 + 961 ((*ucs4_buf/0x40)%0x40)); 962 *t_utf8_buf++= (unsigned char) (0x80 + 963 (*ucs4_buf%0x40)); 964 utf8_len+=5; 965 ucs4_buf++; 966 } 967 else 968 if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range 969 { 970 *t_utf8_buf++= (unsigned char) 971 (0xF8 +(*ucs4_buf/0x40000000)); 972 *t_utf8_buf++= (unsigned char) (0x80 + 973 ((*ucs4_buf/0x01000000)%0x40)); 974 *t_utf8_buf++= (unsigned char) (0x80 + 975 ((*ucs4_buf/0x040000)%0x40)); 976 *t_utf8_buf++= (unsigned char) (0x80 + 977 ((*ucs4_buf/0x1000)%0x40)); 978 *t_utf8_buf++= (unsigned char) (0x80 + 979 ((*ucs4_buf/0x40)%0x40)); 980 *t_utf8_buf++= (unsigned char) (0x80 + 981 (*ucs4_buf%0x40)); 982 utf8_len+=6; 983 ucs4_buf++; 985 } 986 } 987 return (utf8_len); 988 } 989 B.2.2 Conversion from UTF-8 to Local Character Set 991 When moving from UTF-8 encoding to the local character set the reverse 992 procedure is used. First the UTF-8 encoding is transformed into the 993 UCS-4 character set. The UCS-4 is then converted to the local 994 character set from a mapping table (i.e. the opposite of the table 995 used to form the UCS-4 character code). 997 To convert from UTF-8 to UCS-4 the free bits (those that do not define 998 UTF-8 sequence size or signify continuation bytes) in a UTF-8 sequence 999 are concatenated as a bit string. The bits are then distributed into a 1000 four-byte sequence starting from the least significant bits. Those 1001 bits not assigned a bit in the four-byte sequence are padded with ZERO 1002 bits. The following routine converts the UTF-8 encoding to UCS-4 1003 character codes: 1005 int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len, 1006 unsigned char *utf8_buf) 1007 { 1009 const unsigned char *utf8_endbuf = utf8_buf + utf8_len; 1010 unsigned int ucs_len=0; 1012 while (utf8_buf != utf8_endbuf) 1013 { 1015 if ((*utf8_buf & 0x80) == 0x00) /*ASCII chars no conversion 1016 needed */ 1017 { 1018 *ucs4_buf++ = (unsigned long) *utf8_buf; 1019 utf8_buf++; 1020 ucs_len++; 1021 } 1022 else 1023 if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range 1024 { 1025 *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40) 1026 + ( *(utf8_buf+1) - 0x80)); 1027 utf8_buf += 2; 1028 ucs_len++; 1029 } 1030 else 1031 if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8 1032 range */ 1033 { 1034 *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000) 1035 + (( *(utf8_buf+1) - 0x80) * 0x40) 1036 + ( *(utf8_buf+2) - 0x80)); 1037 utf8_buf+=3; 1038 ucs_len++; 1039 } 1040 else 1041 if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8 1042 range */ 1043 { 1044 *ucs4_buf++ = (unsigned long) 1045 (((*utf8_buf - 0xF0) * 0x040000) 1046 + (( *(utf8_buf+1) - 0x80) * 0x1000) 1047 + (( *(utf8_buf+2) - 0x80) * 0x40) 1048 + ( *(utf8_buf+3) - 0x80)); 1049 utf8_buf+=4; 1050 ucs_len++; 1051 } 1052 else 1053 if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8 1054 range */ 1055 { 1056 *ucs4_buf++ = (unsigned long) 1057 (((*utf8_buf - 0xF8) * 0x01000000) 1058 + ((*(utf8_buf+1) - 0x80) * 0x040000) 1059 + (( *(utf8_buf+2) - 0x80) * 0x1000) 1060 + (( *(utf8_buf+3) - 0x80) * 0x40) 1061 + ( *(utf8_buf+4) - 0x80)); 1062 utf8_buf+=5; 1063 ucs_len++; 1064 } 1065 else 1066 if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8 1067 range */ 1068 { 1069 *ucs4_buf++ = (unsigned long) 1070 (((*utf8_buf - 0xFC) * 0x40000000) 1071 + ((*(utf8_buf+1) - 0x80) * 0x010000000) 1072 + ((*(utf8_buf+2) - 0x80) * 0x040000) 1073 + (( *(utf8_buf+3) - 0x80) * 0x1000) 1074 + (( *(utf8_buf+4) - 0x80) * 0x40) 1075 + ( *(utf8_buf+5) - 0x80)); 1076 utf8_buf+=6; 1077 ucs_len++; 1078 } 1080 } 1081 return (ucs_len); 1082 } 1083 B.2.3 ISO/IEC 8859-8 Example 1085 This example demonstrates mapping ISO/IEC 8859-8 character set to UTF- 1086 8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter 1087 "VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the 1088 corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple 1089 lookup of a conversion/mapping file. 1091 The UCS-4 character code is transformed into UTF-8 using the 1092 ucs4_to_utf8 routine described earlier by: 1094 1. Because the UCS-4 character is between 0x80 and 0x07FF it will map 1095 to a 2 byte UTF-8 sequence. 1096 2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7. 1097 3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95. 1099 The UTF-8 encoding is transferred back to UCS-4 by using the 1100 utf8_to_ucs4 routine described earlier by: 1102 1. Because the first byte of the sequence, when the '&' operator with 1103 a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0) 1104 the UTF-8 is a 2 byte sequence. 1105 2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0) 1106 * 0x40) + (0x95 -0x80)) = 0x000005D5. 1108 Finally, the UCS-4 character code is converted to ISO/IEC 8859-8 1109 character code (using the mapping table which matches ISO/IEC 8859-8 1110 to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter 1111 "VAV". 1113 B.2.4 Vendor Codepage Example 1115 This example demonstrates the mapping of a codepage to UTF-8 and back 1116 to a vendor codepage. Mapping between vendor codepages can be done in 1117 a very similar manner as described above. For instance both the PC and 1118 Mac codepages reflect the character set from the Thai standard TIS 1119 620-2533. The character code on both platforms for the Thai letter "SO 1120 SO" is 0xAB. This character can then be mapped into the UCS-4 by way 1121 of a conversion/mapping file to produce the UCS-4 code of 0x0E0B. 1123 The UCS-4 character code is transformed into UTF-8 using the 1124 ucs4_to_utf8 routine described earlier by: 1126 1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will 1127 map to a 3 byte UTF-8 sequence. 1128 2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) = 1129 0xE0. 1130 3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) % 1131 0x40))) = 0xB8. 1132 4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B. 1134 The UTF-8 encoding is transferred back to UCS-4 by using the 1135 utf8_to_ucs4 routine described earlier by: 1137 1. Because the first byte of the sequence, when the '&' operator with 1138 a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0) 1139 the UTF-8 is a 3 byte sequence. 1140 2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0) 1141 * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B. 1143 Finally, the UCS-4 character code is converted to either the PC or MAC 1144 codepage character code (using the mapping table which matches 1145 codepage to UCS-4 ) to produce the original 0xAB code for the Thai 1146 letter "SO SO". 1148 B.3 Pseudo Code for a High-Quality Translating Server 1150 if utf8_valid(fn) 1151 { 1152 attempt to convert fn to the local charset, producing localfn 1153 if (conversion fails temporarily) return error 1154 if (conversion succeeds) 1155 { 1156 attempt to open localfn 1157 if (open fails temporarily) return error 1158 if (open succeeds) return success 1159 } 1160 } 1161 attempt to open fn 1162 if (open fails temporarily) return error 1163 if (open succeeds) return success 1164 return permanent error