idnits 2.17.1 draft-ietf-ftpext-intl-ftp-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([RFC959], [RFC1123]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? RFC 2119 keyword, line 122: '...is character set SHALL be ISO/IEC 1064...' RFC 2119 keyword, line 123: '...bility it is STRONGLY RECOMMENDED that...' RFC 2119 keyword, line 129: '...d to store files SHALL remain a local ...' RFC 2119 keyword, line 130: '... and MAY depend on the capability of ...' RFC 2119 keyword, line 131: '...f pathnames they SHOULD be converted i...' (66 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 648 has weird spacing: '...ication and...' == Line 861 has weird spacing: '... else retur...' -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'BCP14' on line 620 looks like a reference

  -- Missing reference section? 'RFC959' on line 641 looks like a reference

  -- Missing reference section? 'RFC1123' on line 646 looks like a reference

  -- Missing reference section? 'ASCII' on line 604 looks like a reference

  -- Missing reference section? 'ISO-8859' on line 890 looks like a reference

  -- Missing reference section? 'ISO-10646' on line 625 looks like a reference

  -- Missing reference section? 'UTF-8' on line 687 looks like a reference

  -- Missing reference section? 'RFC 2277' on line 107 looks like a reference

  -- Missing reference section? 'UNICODE' on line 729 looks like a reference

  -- Missing reference section? 'RFC2279' on line 672 looks like a reference

  -- Missing reference section? 'ABNF' on line 599 looks like a reference

  -- Missing reference section? 'RFC854' on line 636 looks like a reference

  -- Missing reference section? '2389' on line 677 looks like a reference

  -- Missing reference section? 'RFC1738' on line 651 looks like a reference

  -- Missing reference section? 'RFC2130' on line 661 looks like a reference

  -- Missing reference section? 'MLST' on line 631 looks like a reference

  -- Missing reference section? 'RFC1766' on line 431 looks like a reference

  -- Missing reference section? 'RFC2277' on line 667 looks like a reference


     Summary: 7 errors (**), 0 flaws (~~), 3 warnings (==), 22 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	FTPEXT Working Group                                     B. Curtin
3	INTERNET DRAFT                  Defense Information Systems Agency
4	Expires 7 October, 1999                              7 April, 1999

6	           Internationalization of the File Transfer Protocol
7	                   

9	Status of this Memo

11	   This document is an Internet-Draft and is in full conformance with
12	   all provisions of Section 10 of RFC2026.

14	   Internet-Drafts are working documents of the Internet Engineering
15	   Task Force (IETF), its areas, and its working groups.  Note that
16	   other groups may also distribute working documents as Internet-
17	   Drafts.

19	   Internet-Drafts are draft documents valid for a maximum of six months
20	   and may be updated, replaced, or obsoleted by other documents at any
21	   time.  It is inappropriate to use Internet-Drafts as reference
22	   material or to cite them other than as "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt.

27	   To view the list Internet-Draft Shadow Directories, see
28	   http://www.ietf.org/shadow.html.

30	  Distribution of this document is unlimited.  Please send comments to
31	  the FTP Extension working group (FTPEXT-WG) of the Internet
32	  Engineering Task Force (IETF) at .
33	  Subscription address is . Discussions
34	  of the group are archived at .

37	  The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
38	  "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
39	  document are to be interpreted as described in BCP 14 [BCP14].

41	Abstract

43	  The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
44	  1123 Section 4 [RFC1123], is one of the oldest and widely used
45	  protocols on the Internet. The protocol's primary character set, 7 bit
46	  ASCII, has served the protocol well through the early growth years of
47	  the Internet. However, as the Internet becomes more global, there is a
48	  need to support character sets beyond 7 bit ASCII.

50	  This document addresses the internationalization (I18n) of FTP, which
51	  includes supporting the multiple character sets and languages found
52	  throughout the Internet community.  This is achieved by extending the
53	  FTP specification and giving recommendations for proper
54	  internationalization support.

56	Table of Contents

58	ABSTRACT.......................................................2
59	1 INTRODUCTION.................................................3
60	2 INTERNATIONALIZATION.........................................3
61	 2.1 International Character Set...............................4
62	 2.2 Transfer Encoding Set.....................................4
63	3 PATHNAMES....................................................5
64	 3.1 General compliance........................................5
65	 3.2 Servers compliance........................................7
66	 3.3 Clients compliance........................................7
67	4 LANGUAGE SUPPORT.............................................8
68	 4.1 The LANG command..........................................8
69	 4.2 Syntax of the LANG command................................9
70	 4.3 Feat response for LANG command...........................11
71	  4.3.1 Feat examples.........................................11
72	5 SECURITY....................................................12
73	6 ACKNOWLEDGMENTS.............................................13
74	7 GLOSSARY....................................................13
75	8 BIBLIOGRAPHY................................................13
76	9 AUTHOR'S ADDRESS............................................15
77	ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16
78	 A.1 General Considerations...................................16
79	 A.2 Transition Considerations................................17
80	ANNEX B - SAMPLE CODE AND EXAMPLES............................18
81	 B.1 Valid UTF-8 check........................................18
82	 B.2 Conversions..............................................19
83	  B.2.1 Conversion from Local Character Set to UTF-8..........19
84	  B.2.2 Conversion from UTF-8 to Local Character Set..........22
85	  B.2.3 ISO/IEC 8859-8 Example................................24
86	  B.2.4 Vendor Codepage Example...............................24
87	 B.3 Pseudo Code for Translating Servers......................25
88	1 Introduction

90	  As the Internet grows throughout the world the requirement to support
91	  character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859]
92	  character set becomes ever more urgent.  For FTP, because of the large
93	  installed base, it is paramount that this is done without breaking
94	  existing clients and servers. This document addresses this need. In
95	  doing so it defines a solution which will still allow the installed
96	  base to interoperate with new clients and servers.

98	  This document enhances the capabilities of the File Transfer Protocol
99	  by removing the 7-bit restrictions on pathnames used in client
100	  commands and server responses, RECOMMENDs the use of a Universal
101	  Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS
102	  transformation format (UTF) UTF-8 [UTF-8], and defines a new command
103	  for language negotiation.

105	  The recommendations made in this document are consistent with the
106	  recommendations expressed by the IETF policy related to character sets
107	  and languages as defined in RFC 2277 [RFC 2277].

109	2 Internationalization

111	  The File Transfer Protocol was developed when the predominate
112	  character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
113	  character sets cannot support the wide range of characters needed by
114	  multinational systems. Given that there are a number of character sets
115	  in current use that provide more characters than 7-bit ASCII, it makes
116	  sense to decide on a convenient way to represent the union of those
117	  possibilities. To work globally either requires support of a number of
118	  character sets and to be able to convert between them, or the use of a
119	  single preferred character set. To assure global interoperability this
120	  document RECOMMENDS the latter approach and defines a single character
121	  set, in addition to NVT ASCII and EBCDIC, which is understandable by
122	  all systems. For FTP this character set SHALL be ISO/IEC 10646:1993.
123	  For support of global compatibility it is STRONGLY RECOMMENDED that
124	  clients and servers use UTF-8 encoding when exchanging pathnames.
125	  Clients and servers are, however, under no obligation to perform any
126	  conversion on the contents of a file for operations such as STOR or
127	  RETR.

129	  The character set used to store files SHALL remain a local decision
130	  and MAY depend on the capability of local operating systems. Prior to
131	  the exchange of pathnames they SHOULD be converted into a ISO/IEC
132	  10646 format and UTF-8 encoded. This approach, while allowing
133	  international exchange of pathnames, will still allow backward
134	  compatibility with older systems because the code set positions for
135	  ASCII characters are identical to the one byte sequence in UTF-8.

137	  Sections 2.1 and 2.2 give a brief description of the international
138	  character set and transfer encoding RECOMMENDED by this document. A
139	  more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
140	  [UNICODE], beyond that given in this document, can be found in RFC
141	  2279 [RFC2279].

143	2.1 International Character Set

145	  The character set defined for international support of FTP SHALL be
146	  the Universal Character Set as defined in ISO 10646:1993 as amended.
147	  This standard incorporates the character sets of many existing
148	  international, national, and corporate standards. ISO/IEC 10646
149	  defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
150	  four byte (31 bit) encoding containing 2**31 code positions divided
151	  into 128 groups of 256 planes. Each plane consists of 256 rows of 256
152	  cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
153	  zero or the Basic Multilingual Plane (BMP).  Currently, no codesets
154	  have been defined outside of the 2 byte BMP.

156	  The Unicode standard version 2.0 [UNICODE] is consistent with the UCS-
157	  2 subset of ISO/IEC 10646. The Unicode standard version 2.0 includes
158	  the repertoire of IS 10646 characters, amendments 1-7 of IS 10646, and
159	  editorial and technical corrigenda.

161	2.2 Transfer Encoding

163	  UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
164	  or UTF-FSS, SHALL be used as a transfer encoding to transmit the
165	  international character set. UTF-8 is a file safe encoding which
166	  avoids the use of byte values that have special significance during
167	  the parsing of pathname character strings. UTF-8 is an 8 bit encoding
168	  of the characters in the UCS. Some of UTF-8's benefits are that it is
169	  compatible with 7 bit ASCII, so it doesn't affect programs that give
170	  special meanings to various ASCII characters; it is immune to
171	  synchronization errors; its encoding rules allow for easy
172	  identification; and it has enough space to support a large number of
173	  character sets.

175	  UTF-8 encoding represents each UCS character as a sequence of 1 to 6
176	  bytes in length. For all sequences of one byte the most significant
177	  bit is ZERO. For all sequences of more than one byte the number of ONE
178	  bits in the first byte, starting from the most significant bit
179	  position, indicates the number of bytes in the UTF-8 sequence followed
180	  by a ZERO bit. For example, the first byte of a 3 byte UTF-8 sequence
181	  would have 1110 as its most significant bits. Each additional bytes
182	  (continuing bytes) in the UTF-8 sequence, contain a ONE bit followed
183	  by a ZERO bit as their most significant bits. The remaining free bit
184	  positions in the continuing bytes are used to identify characters in
185	  the UCS. The relationship between UCS and UTF-8 is demonstrated in the
186	  following table:

188	  UCS-4 range(hex)          UTF-8 byte sequence(binary)
189	  00000000 - 0000007F       0xxxxxxx
190	  00000080 - 000007FF       110xxxxx 10xxxxxx
191	  00000800 - 0000FFFF       1110xxxx 10xxxxxx 10xxxxxx
192	  00010000 - 001FFFFF       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
193	  00200000 - 03FFFFFF       111110xx 10xxxxxx 10xxxxxx 10xxxxxx
194	                            10xxxxxx
195	  04000000 - 7FFFFFFF       1111110x 10xxxxxx 10xxxxxx 10xxxxxx
196	                            10xxxxxx 10xxxxxx

198	  A beneficial property of UTF-8 is that its single byte sequence is
199	  consistent with the ASCII character set. This feature will allow a
200	  transition where old ASCII-only clients can still interoperate with
201	  new servers that support the UTF-8 encoding.

203	  Another feature is that the encoding rules make it very unlikely that
204	  a character sequence from a different character set will be mistaken
205	  for a UTF-8 encoded character sequence. Clients and servers can use a
206	  simple routine to determine if the character set being exchanged is
207	  valid UTF-8. Section B.1 shows a code example of this check.

209	3 Pathnames

211	3.1 General compliance

213	  - The 7-bit restriction for pathnames exchanged is dropped.

215	  - Many operating system allow the use of spaces , carriage return
216	    , and line feed  characters as part of the pathname. The
217	    exchange of pathnames with these special command characters will
218	    cause the pathnames to be parsed improperly. This is because ftp
219	    commands associated with pathnames have the form:

221	       COMMAND   .

223	    To allow the exchange of pathnames containing these characters, the
224	    definition of pathname is changed from
225	       ::=    ; in BNF format
226	    to
227	      pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF].

229	  To avoid mistaking these characters within pathnames as special
230	  command characters the following rules will apply:

232	  There MUST be only one  between a ftp command and the pathname.
233	  Implementations MUST assume  characters following the initial
234	   as part of the pathname. For example the pathname in STOR
235	  foo.bar is foo.bar.

237	  Current implementations, which may allow multiple  characters as
238	  separators between the command and pathname, MUST assure that they
239	  comply with this single  convention. Note: Implementations which
240	  treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4
241	  character command by padding the command with a trailing  are in
242	  non-compliance to this specification.

244	  When a  character is encountered as part of a pathname it MUST
245	  be padded with a  character prior to sending the command. On
246	  receipt of a pathname containing a  sequence the 
247	  character MUST be stripped away. This approach is described in the
248	  Telnet protocol [RFC854] on pages 11 and 12. For example, to store a
249	  pathname fooboo.bar the pathname would become
250	  fooboo.bar prior to sending the command STOR
251	  fooboo.bar. Upon receipt of the altered
252	  pathname the  character following the  would be stripped
253	  away to form the original pathname.

255	  - Conforming clients and servers MUST support UTF-8 for the transfer
256	    and receipt of pathnames. Clients and servers MAY in addition give
257	    users a choice of specifying interpretation of pathnames in another
258	    encoding. Note that configuring clients and servers to use character
259	    sets / encoding other than UTF-8 is outside of the scope of this
260	    document. While it is recognized that in certain operational
261	    scenarios this may be desirable, this is left as a quality of
262	    implementation and operational issue.

264	  - Pathnames are sequences of bytes.  The encoding of names that are
265	    valid UTF-8 sequences is assumed to be UTF-8.  The character set of
266	    other names is undefined. Clients and servers, unless otherwise
267	    configured to support a specific native character set, MUST check
268	    for a valid UTF-8 byte sequence to determine if the pathname being
269	    presented is UTF-8.

271	  - To avoid data loss, clients and servers SHOULD use the UTF- 8
272	    encoded pathnames when unable to convert them to a usable code set.

274	  - There may be cases when the code set / encoding presented to the
275	    server or client cannot be determined. In such cases the raw bytes
276	    SHOULD be used.

278	3.2 Servers compliance

280	  - Servers MUST support the UTF-8 feature in response to the FEAT
281	    command [2389]. The UTF-8 feature is a line containing the exact
282	    string "UTF8". This string is not case sensitive, but SHOULD be
283	    transmitted in upper case. The response to a FEAT command SHOULD be:

285	       C> feat
286	       S> 211- 
287	       S>  ...
288	       S>  UTF8
289	       S>  ...
290	       S> 211 end

292	    The ellipses indicate placeholders where other features may be
293	    included, but are NOT REQUIRED. The one space indentation of the
294	    feature lines is mandatory [2389].

296	  - Mirror servers may want to exactly reflect the site that they are
297	    mirroring. In such cases servers MAY store and present the exact
298	    pathname bytes that it received from the main server.

300	3.3 Clients compliance

302	  - Clients which do not require display of pathnames are under no
303	    obligation to do so. Non-display clients do not need to conform to
304	    requirements associated with display.

306	  - Clients, which are presented UTF-8 pathnames by the server, SHOULD
307	    parse UTF-8 correctly and attempt to display the pathname within the
308	    limitation of the resources available.

310	  - Clients MUST support the FEAT command and recognize the "UTF8"
311	    feature (defined in 3.2 above) to determine if a server supports
312	    UTF-8 encoding.

314	  - Character semantics of other names shall remain undefined. If a
315	    client detects that a server is non UTF-8, it SHOULD change its
316	    display appropriately. How a client implementation handles non UTF-8
317	    is a quality of implementation issue. It MAY try to assume some
318	    other encoding, give the user a chance to try to assume something,
319	    or save encoding assumptions for a server from one FTP session to
320	    another.

322	  - Glyph rendering is outside the scope of this document. How a client
323	    presents characters it cannot display is a quality of implementation
324	    issue. This document RECOMMENDS that octets corresponding to non-
325	    displayable characters SHOULD be presented in URL %HH format defined
326	    in RFC 1738 [RFC1738]. They MAY, however, display them as question
327	    marks, with their UCS hexadecimal value, or in any other suitable
328	    fashion.

330	  - Many existing clients interpret 8-bit pathnames as being in the
331	    local character set. They MAY continue to do so for pathnames that are
332	    not valid UTF-8.

334	4. Language Support

336	  The Character Set Workshop Report [RFC2130] suggests that clients and
337	  servers SHOULD negotiate a language for "greetings" and "error
338	  messages". This specification interprets the use of the term  "error
339	  message", by RFC 2130, to mean any explanatory text string returned by
340	  server-PI in response to a user-PI command.

342	  Implementers SHOULD note that FTP commands and numeric responses are
343	  protocol elements. As such, their use is not affected by any guidance
344	  expressed by this specification.

346	  Language support of greetings and command responses shall be the
347	  default language supported by the server or the language supported by
348	  the server and selected by the client.

350	  It may be possible to achieve language support through a virtual host
351	  as described in [MLST]. However, an FTP server might not support
352	  virtual servers, or virtual servers might be configured to support an
353	  environment without regard for language. To allow language negotiation
354	  this specification defines a new LANG command. Clients and servers
355	  that comply with this specification MUST support the LANG command.

357	4.1 The LANG command

359	  A new command "LANG" is added to the FTP command set to allow server-
360	  FTP process to determine in which language to present server greetings
361	  and the textual part of command responses. The parameter associated
362	  with the LANG command SHALL be one of the language tags defined in RFC
363	  1766 [RFC1766]. If a LANG command without a parameter is issued the
364	  server's default language will be used.

366	  Greetings and responses issued prior to language negotiation SHALL be
367	  in the server's default language. Paragraph 4.5 of [RFC2277] state
368	  that this "default language MUST be understandable by an English-
369	  speaking person". This specification RECOMMENDS that the server
370	  default language be English encoded using ASCII. This text may be
371	  augmented by text from other languages. Once negotiated, server-PI
372	  MUST return server messages and textual part of command responses in
373	  the negotiated language and encoded in UTF-8. Server-PI MAY wish to
374	  re-send previously issued server messages in the newly negotiated
375	  language.

377	  The LANG command only affects presentation of greeting messages and
378	  explanatory text associated with command responses. No attempt should
379	  be made by the server to translate protocol elements (FTP commands and
380	  numeric responses) or data transmitted over the data connection.

382	  User-PI MAY issue the LANG command at any time during an FTP session.
383	  In order to gain the full benefit of this command, it SHOULD be
384	  presented prior to authentication. In general, it will be issued after
385	  the HOST command [MLST]. Note that the issuance of a HOST or REIN
386	  command [RFC959] will negate the affect of the LANG command. User-PI
387	  SHOULD be capable of supporting UTF-8 encoding for the language
388	  negotiated. Guidance on interpretation and rendering of UTF-8, defined
389	  in section 3, SHALL apply.

391	  Although NOT REQUIRED by this specification, a user-PI SHOULD issue a
392	  FEAT command [2389] prior to a LANG command. This will allow the user-
393	  PI to determine if the server supports the LANG command and which
394	  language options.

396	  In order to aid the server in identifying whether a connection has
397	  been established with a client which conforms to this specification or
398	  an older client, user-PI MUST send a HOST [MLST] and/or LANG command
399	  prior to issuing any other command (other than FEAT [2389]). If user-
400	  PI issues a HOST command, and the server's default language is
401	  acceptable, it need not issue a LANG command. However, if the
402	  implementation does not support the HOST command, a LANG command MUST
403	  be issued. Until server-PI is presented with either a HOST or LANG
404	  command it SHOULD assume that the user-PI does not comply with this
405	  specification.

407	4.2 Syntax of the LANG command

409	  The LANG command is defined as follows:

411	  lang-command       = "Lang" [(SP lang-tag)] CRLF
412	  lang-tag           = Primary-tag *( "-" Sub-tag)
413	  Primary-tag        = 1*8ALPHA
414	  Sub-tag            = 1*8ALPHA

416	  lang-response      = lang-ok / error-response
417	  lang-ok            = "200" [SP *(%x00..%xFF) ] CRLF
418	  error-response     = command-unrecognized / bad-argument /
419	                    not-implemented / unsupported-parameter
420	  command-unrecognized  = "500" [SP *(%x01..%xFF) ] CRLF
421	  bad-argument       = "501" [SP *(%x01..%xFF) ] CRLF
422	  not-implemented    = "502" [SP *(%x01..%xFF) ] CRLF
423	  unsupported-parameter = "504" [SP *(%x01..%xFF) ] CRLF

425	  The "lang" command word is case independent and may be specified in
426	  any character case desired. Therefore "LANG", "lang", "Lang", and
427	  "lAnG" are equivalent commands.

429	  The OPTIONAL "Lang-tag" given as a parameter specifies the primary
430	  language tags and zero or more sub-tags as defined in [RFC1766]. As
431	  described in [RFC1766] language tags are treated as case insensitive.
432	  If omitted server-PI MUST use the server's default language.

434	  Server-FTP responds to the "Lang" command with either "lang-ok" or
435	  "error-response". "lang-ok" MUST be sent if Server-FTP supports the
436	  "Lang" command and can support some form of the "lang-tag". Support
437	  SHOULD be as follows:

439	  -  If server-FTP receives "Lang" with no parameters it SHOULD return
440	     messages and command responses in the server default language.

442	  -  If server-FTP receives "Lang" with only a primary tag argument
443	    (e.g. en, fr, de, ja, zh, etc.), which it can support, it SHOULD
444	     return messages and command responses in the language associated
445	     with that primary tag. It is possible that server-FTP will only
446	     support the primary tag when combined with a sub-tag (e.g. en-US,
447	     en-UK, etc.). In such cases, server-FTP MAY determine the
448	     appropriate variant to use during the session. How server-FTP makes
449	     that determination is outside the scope of this specification. If
450	     server-FTP cannot determine if a sub-tag variant is appropriate it
451	     SHOULD return an "unsupported-parameter" (504) response.

453	  -  If server-FTP receives "Lang" with a primary tag and sub-tag(s)
454	     argument, which is implemented, it SHOULD return messages and
455	     command responses in support of the language argument. It is
456	     possible that server-FTP can support the primary tag of the "Lang"
457	     argument but not the sub-tag(s). In such cases server-FTP MAY
458	     return messages and command responses in the most appropriate
459	     variant of the primary tag that has been implemented. How server-
460	     FTP makes that determination is outside the scope of this
461	     specification. If server-FTP cannot determine if a sub-tag variant
462	     is appropriate it SHOULD return an "unsupported-parameter" (504)
463	     response.

465	  For example if client-FTP sends a "LANG en-AU" command and server-FTP
466	  has implemented language tags en-US and en-UK it may decide that the
467	  most appropriate language tag is en-UK and return "200 en-AU not
468	  supported. Language set to en-UK". The numeric response is a protocol
469	  element and can not be changed. The associated string is for
470	  illustrative purposes only.

472	  Clients and servers that conform to this specification MUST support
473	  the LANG command. Clients SHOULD, however, anticipate receiving a 500
474	  or 502 command response, in cases where older or non-compliant servers
475	  do not recognize or have not implemented the "Lang". A 501 response
476	  SHOULD be sent if the argument to the "Lang" command is not
477	  syntactically correct. A 504 response SHOULD be sent if the "Lang"
478	  argument, while syntactically correct, is not implemented. As noted
479	  above, an argument may be considered a lexicon match even though it is
480	  not an exact syntax match.

482	4.3 Feat response for LANG command

484	  A server-FTP process that supports the LANG command, and language
485	  support for messages and command responses, MUST include in the
486	  response to the FEAT command [2389], a feature line indicating that
487	  the LANG command is supported and a fact list of the supported
488	  language tags. A response to a FEAT command SHALL be in the following
489	  format:

491	      Lang-feat  = SP "LANG" SP lang-fact CRLF
492	      lang-fact  = lang-tag ["*"] *(";" lang-tag ["*"])

494	      lang-tag   = Primary-tag *( "-" Sub-tag)
495	      Primary-tag= 1*8ALPHA
496	      Sub-tag    = 1*8ALPHA

498	  The lang-feat response contains the string "LANG" followed by a
499	  language fact. This string is not case sensitive, but SHOULD be
500	  transmitted in upper case, as recommended in [2389]. The initial space
501	  shown in the Lang-feat response is REQUIRED by the FEAT command. It
502	  MUST be a single space character. More or less space characters are
503	  not permitted. The lang-fact SHALL include the lang-tags which server-
504	  FTP can support. At least one lang-tag MUST be included with the FEAT
505	  response. The lang-tag SHALL be in the form described earlier in this
506	  document. The OPTIONAL asterisk, when present, SHALL indicate the
507	  current lang-tag being used by server-FTP for messages and responses.

509	4.3.1 Feat examples

511	       C> feat
512	       S> 211- 
513	       S>  ...
514	       S>  LANG EN*
515	       S>  ...
516	       S> 211 end

518	  In this example server-FTP can only support English, which is the
519	  current language (as shown by the asterisk) being used by the server
520	  for messages and command responses.

522	       C> feat
523	       S> 211- 
524	       S>  ...
525	       S>  LANG EN*;FR
526	       S>  ...
527	       S> 211 end

529	       C> LANG fr
530	       S> 200 Le response sera changez au francais

532	       C> feat
533	       S> 211- 
534	       S>  ...
535	       S>  LANG EN;FR*
536	       S>  ...
537	       S> 211 end

539	  In this example server-FTP supports both English and French as shown
540	  by the initial response to the FEAT command. The asterisk indicates
541	  that English is the current language in use by server-FTP. After a
542	  LANG command is issued to change the language to French, the FEAT
543	  response shows French as the current language in use.

545	  In the above examples ellipses indicate placeholders where other
546	  features may be included, but are NOT REQUIRED.

548	5 Security

550	  This document addresses the support of character sets beyond 1 byte
551	  and a new language negotiation command. Conformance to this document
552	  should not induce a security risk.

554	6 Acknowledgments

556	  The following people have contributed to this document:

558	  D. J. Bernstein
559	  Martin J. Duerst
560	  Mark Harris
561	  Paul Hethmon
562	  Alun Jones
563	  Gregory Lundberg
564	  James Matthews
565	  Keith Moore
566	  Sandra O'Donnell
567	  Benjamin Riefenstahl
568	  Stephen Tihor

570	  (and others from the FTPEXT working group)

572	7 Glossary

574	  BIDI - abbreviation for Bi-directional, a reference to mixed right-to-
575	  left and left-to-right text.

577	  Character Set - a collection of characters used to represent textual
578	  information in which each character has a numeric value

580	  Code Set -  (see character set).

582	  Glyph - a character image represented on a display device.

584	  I18N - "I eighteen N", the first and last letters of the word
585	  "internationalization" and the eighteen letters in between.

587	  UCS-2 - the ISO/IEC 10646 two octet Universal Character Set form.

589	  UCS-4 - the ISO/IEC 10646 four octet Universal Character Set form.

591	  UTF-8 - the UCS Transformation Format represented in 8 bits.

593	  UTF-16 - A 16-bit format including the BMP (directly encoded) and
594	  surrogate pairs to represent characters in planes 01-16; equivalent to
595	  Unicode.

597	8 Bibliography

599	  [ABNF]

601	    D. Crocker, P. Overell, Augmented BNF for Syntax Specifications:
602	    ABNF, RFC 2234, November 1997.

604	  [ASCII]

606	    ANSI X3.4:1986 Coded Character Sets - 7 Bit American National
607	    Standard Code for Information Interchange (7-bit ASCII)

609	  [ISO-8859]

611	    ISO 8859.  International standard -- Information processing -- 8-bit
612	    single-byte coded graphic character sets -- Part 1: Latin alphabet
613	    No. 1 (1987) -- Part 2: Latin alphabet No. 2 (1987) -- Part 3: Latin
614	    alphabet No. 3 (1988) -- Part 4: Latin alphabet No. 4 (1988) -- Part
615	    5: Latin/Cyrillic alphabet (1988) -- Part 6: Latin/Arabic alphabet
616	    (1987) -- Part : Latin/Greek alphabet (1987) -- Part 8: Latin/Hebrew
617	    alphabet (1988) -- Part 9: Latin alphabet No. 5 (1989) -- Part10:
618	    Latin alphabet No. 6 (1992)

620	  [BCP14]

622	    S. Bradner, "Key words for use in RFCs to Indicate Requirement
623	    Levels", BCP 14, RFC 2119, March 1997.

625	  [ISO-10646]

627	    ISO/IEC 10646-1:1993. International standard -- Information
628	    technology -- Universal multiple-octet coded character set (UCS) --
629	    Part 1: Architecture and basic multilingual plane.

631	  [MLST]

633	    R. Elz, P. Hethmon, "Extensions to FTP", Work in Progress , February 1999.

636	  [RFC854]

638	    J. Postel, J Reynolds, "Telnet Protocol Specification", RFC 854, May
639	    1983.

641	  [RFC959]

643	    J. Postel, J Reynolds, "File Transfer Protocol (FTP)", RFC 959,
644	    October 1985.

646	  [RFC1123]

648	    R. Braden, "Requirements for Internet Hosts -- Application  and
649	    Support", RFC 1123, October 1989.

651	  [RFC1738]

653	    T. Berners-Lee, L. Masinter, M.McCahill, "Uniform Resource Locators
654	    (URL)", RFC 1738, December 1994.

656	  RFC1766]

658	    H. Alvestrand, "Tags for the Identification of Languages", RFC 1766,
659	    March 1995.

661	  [RFC2130]

663	    C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M.
664	    Crispin, P. Svanberg, "Character Set Workshop Report", RFC 2130,
665	    April 1997.

667	  [RFC2277]

669	    H. Alvestrand, " IETF Policy on Character Sets and Languages", RFC
670	    2277, January 1998.

672	  [RFC2279]

674	    F. Yergeau, "UTF-8, a transformation format of ISO 10646", RFC 2279,
675	    January 1998.

677	  [2389]

679	    R. Elz, P. Hethmon, "Feature Negotiation Mechanism for the File
680	    Transfer Protocol", RFC 2389, August 1998.

682	  [UNICODE]

684	    The Unicode Consortium, "The Unicode Standard - Version 2.0",
685	    Addison Westley Developers Press, July 1996.

687	  [UTF-8]

689	    ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8
690	    (UTF-8).

692	9 Author's Address

694	  JIEO
695	  Attn JEBBD (Bill Curtin)
696	  Ft. Monmouth, N.J.
697	          07703-5613
698	  curtinw@ftm.disa.mil
699	                 Annex A - Implementation Considerations

701	A.1 General Considerations

703	  - Implementers should ensure that their code accounts for potential
704	    problems, such as using a NULL character to terminate a string or no
705	    longer being able to steal the high order bit for internal use, when
706	    supporting the extended character set.

708	  - Implementers should be aware that there is a chance that pathnames
709	    that are non UTF-8 may be parsed as valid UTF-8. The probabilities
710	    are low for some encoding or statistically zero to zero for others.
711	    A recent non-scientific analysis found that EUC encoded Japanese
712	    words had a 2.7% false reading; SJIS had a 0.0005% false reading;
713	    other encoding such as ASCII or KOI-8 have a 0% false reading. This
714	    probability is highest for short pathnames and decreases as pathname
715	    size increases. Implementers may want to look for signs that
716	    pathnames which parse as UTF-8 are not valid UTF- 8, such as the
717	    existence of multiple local character sets in short pathnames.
718	    Hopefully, as more implementations conform to UTF-8 transfer
719	    encoding there will be a smaller need to guess at the encoding.

721	  - Client developers should be aware that it will be possible for
722	    pathnames to contain mixed characters (e.g.
723	    /Latin1DirectoryName/HebrewFileName). They should be prepared to
724	    handle the Bi-directional (BIDI) display of these character sets
725	    (i.e. right to left display for the directory and left to right
726	    display for the filename). While bi-directional display is outside
727	    the scope of this document and more complicated than the above
728	    example, an algorithm for bi-directional display can be found in the
729	    UNICODE 2.0 [UNICODE] standard. Also note that pathnames can have
730	    different byte ordering yet be logically and display-wise equivalent
731	    due to the insertion of BIDI control characters at different points
732	    during composition. Also note that mixed character sets may also
733	    present problems with font swapping.

735	  - A server that copies pathnames transparently from a local filesystem
736	    may continue to do so. It is then up to the local file creators to
737	    use UTF-8 pathnames.

739	  - Servers can supports charset labeling of files and/or directories,
740	    such that different pathnames may have different charsets. The
741	    server should attempt to convert all pathnames to UTF-8, but if it
742	    can't then it should leave that name in its raw form.

744	  - Some server's OS do not mandate character sets, but allow
745	    administrators to configure it in the FTP server. These servers
746	    should be configured to use a particular mapping table (either
747	    external or built-in). This will allow the flexibility of defining
748	    different charsets for different directories.

750	  - If the server's OS does not mandate the character set and the FTP
751	    server cannot be configured, the server should simply use the raw
752	    bytes in the file name.  They might be ASCII or UTF-8.

754	  - If the server is a mirror, and wants to look just like the site it
755	    is mirroring, it should store the exact file name bytes that it
756	    received from the main server.

758	A.2 Transition Considerations

760	  - Servers which support this specification, when presented a pathname
761	    from an old client (one which does not support this specification),
762	    can nearly always tell whether the pathname is in UTF-8 (see B.1) or
763	    in some other code set. In order to support these older clients,
764	    servers may wish to default to a non UTF-8 code set. However, how a
765	    server supports non UTF-8 is outside the scope of this
766	    specification.

768	  - Clients which support this specification will be able to determine
769	    if the server can support UTF-8 (i.e. supports this specification)
770	    by the ability of the server to support the FEAT command and the
771	    UTF8 feature (defined in 3.2). If the newer clients determine that
772	    the server does not support UTF-8 it may wish to default to a
773	    different code set. Client developers should take into consideration
774	    that pathnames, associated with older servers, might be stored in
775	    UTF-8. However, how a client supports non UTF-8 is outside the scope
776	    of this specification.

778	  - Clients and servers can transition to UTF-8 by either converting
779	    to/from the local encoding, or the users can store UTF-8 filenames.
780	    The former approach is easier on tightly controlled file systems
781	    (e.g. PCs and MACs). The latter approach is easier on more free form
782	    file systems (e.g. Unix).

784	  - For interactive use attention should be focused on user interface
785	    and ease of use. Non-interactive use requires a consistent and
786	    controlled behavior.

788	  - There may be many applications which reference files under their old
789	    raw pathname (e.g. linked URLs). Changing the pathname to UTF-8 will
790	    cause access to the old URL to fail. A solution may be for the
791	    server to act as if there was 2 different pathnames associated with
792	    the file. This might be done internal to the server on controlled
793	    file systems or by using symbolic links on free form systems. While
794	    this approach may work for single file transfer non-interactive use,
795	    a non-interactive transfer of all of the files in a directory will
796	    produce duplicates. Interactive users may be presented with lists of
797	    files which are double the actual number files.

799	                   Annex B - Sample Code and Examples

801	B.1 Valid UTF-8 check

803	  The following routine checks if a byte sequence is valid UTF-8. This
804	  is done by checking for the proper tagging of the first and following
805	  bytes to make sure they conform to the UTF-8 format. It then checks to
806	  assure that the data part of the UTF-8 sequence conforms to the proper
807	  range allowed by the encoding. Note: This routine will not detect
808	  characters that have not been assigned and therefore do not exist.

810	  int utf8_valid(const unsigned char *buf, unsigned int len)
811	  {
812	   const unsigned char *endbuf = buf + len;
813	   unsigned char byte2mask=0x00, c;
814	   int trailing = 0;  // trailing (continuation) bytes to follow

816	   while (buf != endbuf)
817	   {
818	     c = *buf++;
819	     if (trailing)
820	      if ((c&0xC0) == 0x80)  // Does trailing byte follow UTF-8 format?
821	      {if (byte2mask)        // Need to check 2nd byte for proper range?
822	        if (c&byte2mask)     // Are appropriate bits set?
823	         byte2mask=0x00;
824	        else
825	         return 0;
826	       trailing--; }
827	      else
828	       return 0;
829	     else
830	      if ((c&0x80) == 0x00)  continue;      // valid 1 byte UTF-8
831	      else if ((c&0xE0) == 0xC0)            // valid 2 byte UTF-8
832	            if (c&0x1E)                     // Is UTF-8 byte in
833	                                            // proper range?
834	             trailing =1;
835	            else
836	             return 0;
837	      else if ((c&0xF0) == 0xE0)           // valid 3 byte UTF-8
838	            {if (!(c&0x0F))                // Is UTF-8 byte in
839	                                           // proper range?
840	              byte2mask=0x20;              // If not set mask
841	                                           // to check next byte
842	              trailing = 2;}
843	      else if ((c&0xF8) == 0xF0)           // valid 4 byte UTF-8
844	            {if (!(c&0x07))                // Is UTF-8 byte in
845	                                           // proper range?
846	              byte2mask=0x30;              // If not set mask
847	                                           // to check next byte
848	              trailing = 3;}
849	      else if ((c&0xFC) == 0xF8)           // valid 5 byte UTF-8
850	            {if (!(c&0x03))                // Is UTF-8 byte in
851	                                           // proper range?
852	              byte2mask=0x38;              // If not set mask
853	                                           // to check next byte
854	              trailing = 4;}
855	      else if ((c&0xFE) == 0xFC)           // valid 6 byte UTF-8
856	            {if (!(c&0x01))                // Is UTF-8 byte in
857	                                           // proper range?
858	              byte2mask=0x3C;              // If not set mask
859	                                           // to check next byte
860	              trailing = 5;}
861	      else  return 0;
862	   }
863	    return trailing == 0;
864	  }

866	B.2 Conversions

868	  The code examples in this section closely reflect the algorithm in ISO
869	  10646 and may not present the most efficient solution for converting
870	  to / from UTF-8 encoding. If efficiency is an issue, implementers
871	  should use the appropriate bitwise operators.

873	  Additional code examples and numerous mapping tables can be found at
874	  the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.

876	  Note that the conversion examples below assume that the local
877	  character set supported in the operating system is something other
878	  than UCS2/UTF-16. There are some operating systems that already
879	  support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no
880	  conversion will be necessary from the local character set to the UCS.

882	B.2.1 Conversion from Local Character Set to UTF-8

884	  Conversion from the local filesystem character set to UTF-8 will
885	  normally involve a two step process. First convert the local character
886	  set to the UCS; then convert the UCS to UTF-8.

888	  The first step in the process can be performed by maintaining a
889	  mapping table that includes the local character set code and the
890	  corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
891	  code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
892	  ISO/IEC 10646 code is 0x000005D5.

894	  The next step is to convert the UCS character code to the UTF-8
895	  encoding. The following routine can be used to determine and encode
896	  the correct number of bytes based on the UCS-4 character code:

898	       unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
899	                                  ucs4_len, unsigned char *utf8_buf)

901	       {
902	        const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
903	        unsigned int utf8_len = 0;        // return value for UTF8 size
904	        unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
905	                                              // to load UTF8 values

907	        while (ucs4_buf != ucs4_endbuf)
908	        {
909	         if ( *ucs4_buf <= 0x7F)    // ASCII chars no conversion needed
910	         {
911	          *t_utf8_buf++ = (unsigned char) *ucs4_buf;
912	          utf8_len++;
913	          ucs4_buf++;
914	         }
915	         else
916	          if ( *ucs4_buf <= 0x07FF ) // In the 2 byte utf-8 range
917	          {
918	            *t_utf8_buf++= (unsigned char) (0xC0 + (*ucs4_buf/0x40));
919	            *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
920	            utf8_len+=2;
921	            ucs4_buf++;
922	          }
923	          else
924	            if ( *ucs4_buf <= 0xFFFF ) /* In the 3 byte utf-8 range. The
925	                                        values 0x0000FFFE, 0x0000FFFF
926	                                        and 0x0000D800 - 0x0000DFFF do
927	                                        not occur in UCS-4 */
928	            {
929	             *t_utf8_buf++= (unsigned char) (0xE0 +
930	                            (*ucs4_buf/0x1000));
931	             *t_utf8_buf++= (unsigned char) (0x80 +
932	                            ((*ucs4_buf/0x40)%0x40));
933	             *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
934	             utf8_len+=3;
935	             ucs4_buf++;
936	             }
937	            else
938	             if ( *ucs4_buf <= 0x1FFFFF ) //In the 4 byte utf-8 range
939	             {
940	              *t_utf8_buf++= (unsigned char) (0xF0 +
941	                             (*ucs4_buf/0x040000));
942	              *t_utf8_buf++= (unsigned char) (0x80 +
943	                             ((*ucs4_buf/0x10000)%0x40));
944	              *t_utf8_buf++= (unsigned char) (0x80 +
945	                             ((*ucs4_buf/0x40)%0x40));
946	              *t_utf8_buf++= (unsigned char) (0x80 + (*ucs4_buf%0x40));
947	              utf8_len+=4;
948	              ucs4_buf++;

950	             }
951	             else
952	              if ( *ucs4_buf <= 0x03FFFFFF )//In the 5 byte utf-8 range
953	              {
954	               *t_utf8_buf++= (unsigned char) (0xF8 +
955	                              (*ucs4_buf/0x01000000));
956	               *t_utf8_buf++= (unsigned char) (0x80 +
957	                              ((*ucs4_buf/0x040000)%0x40));
958	               *t_utf8_buf++= (unsigned char) (0x80 +
959	                              ((*ucs4_buf/0x1000)%0x40));
960	               *t_utf8_buf++= (unsigned char) (0x80 +
961	                              ((*ucs4_buf/0x40)%0x40));
962	               *t_utf8_buf++= (unsigned char) (0x80 +
963	                              (*ucs4_buf%0x40));
964	               utf8_len+=5;
965	               ucs4_buf++;
966	              }
967	              else
968	              if ( *ucs4_buf <= 0x7FFFFFFF )//In the 6 byte utf-8 range
969	               {
970	                 *t_utf8_buf++= (unsigned char)
971	                                (0xF8 +(*ucs4_buf/0x40000000));
972	                 *t_utf8_buf++= (unsigned char) (0x80 +
973	                                ((*ucs4_buf/0x01000000)%0x40));
974	                 *t_utf8_buf++= (unsigned char) (0x80 +
975	                                ((*ucs4_buf/0x040000)%0x40));
976	                 *t_utf8_buf++= (unsigned char) (0x80 +
977	                                ((*ucs4_buf/0x1000)%0x40));
978	                 *t_utf8_buf++= (unsigned char) (0x80 +
979	                                ((*ucs4_buf/0x40)%0x40));
980	                 *t_utf8_buf++= (unsigned char) (0x80 +
981	                                (*ucs4_buf%0x40));
982	                 utf8_len+=6;
983	                 ucs4_buf++;

985	               }
986	        }
987	        return (utf8_len);
988	       }
989	B.2.2 Conversion from UTF-8 to Local Character Set

991	  When moving from UTF-8 encoding to the local character set the reverse
992	  procedure is used. First the UTF-8 encoding is transformed into the
993	  UCS-4 character set. The UCS-4 is then converted to the local
994	  character set from a mapping table (i.e. the opposite of the table
995	  used to form the UCS-4 character code).

997	  To convert from UTF-8 to UCS-4 the free bits (those that do not define
998	  UTF-8 sequence size or signify continuation bytes) in a UTF-8 sequence
999	  are concatenated as a bit string. The bits are then distributed into a
1000	  four-byte sequence starting from the least significant bits. Those
1001	  bits not assigned a bit in the four-byte sequence are padded with ZERO
1002	  bits. The following routine converts the UTF-8 encoding to UCS-4
1003	  character codes:

1005	       int utf8_to_ucs4 (unsigned long *ucs4_buf, unsigned int utf8_len,
1006	                         unsigned char *utf8_buf)
1007	       {

1009	       const unsigned char *utf8_endbuf = utf8_buf + utf8_len;
1010	       unsigned int ucs_len=0;

1012	        while (utf8_buf != utf8_endbuf)
1013	        {

1015	         if ((*utf8_buf & 0x80) == 0x00)  /*ASCII chars no conversion
1016	                                            needed */
1017	         {
1018	          *ucs4_buf++ = (unsigned long) *utf8_buf;
1019	          utf8_buf++;
1020	          ucs_len++;
1021	         }
1022	         else
1023	          if ((*utf8_buf & 0xE0)== 0xC0) //In the 2 byte utf-8 range
1024	          {
1025	            *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xC0) * 0x40)
1026	                           + ( *(utf8_buf+1) - 0x80));
1027	            utf8_buf += 2;
1028	            ucs_len++;
1029	          }
1030	          else
1031	            if ( (*utf8_buf & 0xF0) == 0xE0 ) /*In the 3 byte utf-8
1032	                                                range */
1033	            {
1034	            *ucs4_buf++ = (unsigned long) (((*utf8_buf - 0xE0) * 0x1000)
1035	                          + (( *(utf8_buf+1) -  0x80) * 0x40)
1036	                          + ( *(utf8_buf+2) - 0x80));
1037	             utf8_buf+=3;
1038	             ucs_len++;
1039	            }
1040	            else
1041	             if ((*utf8_buf & 0xF8) == 0xF0) /* In the 4 byte utf-8
1042	                                                range */
1043	             {
1044	              *ucs4_buf++ = (unsigned long)
1045	                              (((*utf8_buf - 0xF0) * 0x040000)
1046	                              + (( *(utf8_buf+1) -  0x80) * 0x1000)
1047	                              + (( *(utf8_buf+2) -  0x80) * 0x40)
1048	                              + ( *(utf8_buf+3) - 0x80));
1049	              utf8_buf+=4;
1050	              ucs_len++;
1051	             }
1052	             else
1053	              if ((*utf8_buf & 0xFC) == 0xF8) /* In the 5 byte utf-8
1054	                                                 range */
1055	              {
1056	               *ucs4_buf++ = (unsigned long)
1057	                              (((*utf8_buf - 0xF8) * 0x01000000)
1058	                              + ((*(utf8_buf+1) - 0x80) * 0x040000)
1059	                              + (( *(utf8_buf+2) -  0x80) * 0x1000)
1060	                              + (( *(utf8_buf+3) -  0x80) * 0x40)
1061	                              + ( *(utf8_buf+4) - 0x80));
1062	               utf8_buf+=5;
1063	               ucs_len++;
1064	              }
1065	              else
1066	               if ((*utf8_buf & 0xFE) == 0xFC) /* In the 6 byte utf-8
1067	                                                  range */
1068	               {
1069	                 *ucs4_buf++ = (unsigned long)
1070	                               (((*utf8_buf - 0xFC) * 0x40000000)
1071	                                + ((*(utf8_buf+1) - 0x80) * 0x010000000)
1072	                                + ((*(utf8_buf+2) - 0x80) * 0x040000)
1073	                                + (( *(utf8_buf+3) -  0x80) * 0x1000)
1074	                                + (( *(utf8_buf+4) -  0x80) * 0x40)
1075	                                + ( *(utf8_buf+5) - 0x80));
1076	                 utf8_buf+=6;
1077	                 ucs_len++;
1078	               }

1080	        }
1081	       return (ucs_len);
1082	       }
1083	B.2.3 ISO/IEC 8859-8 Example

1085	  This example demonstrates mapping ISO/IEC 8859-8 character set to UTF-
1086	  8 and back to ISO/IEC 8859-8. As noted earlier, the Hebrew letter
1087	  "VAV" is convertd from the ISO/IEC 8859-8 character code 0xE4 to the
1088	  corresponding 4 byte ISO/IEC 10646 code of 0x000005D5 by a simple
1089	  lookup of a conversion/mapping file.

1091	  The UCS-4 character code is transformed into UTF-8 using the
1092	  ucs4_to_utf8 routine described earlier by:

1094	   1. Because the UCS-4 character is between 0x80 and 0x07FF it will map
1095	      to a 2 byte UTF-8 sequence.
1096	   2. The first byte is defined by (0xC0 + (0x000005D5 / 0x40)) = 0xD7.
1097	   3. The second byte is defined by (0x80 + (0x000005D5 % 0x40)) = 0x95.

1099	  The UTF-8 encoding is transferred back to UCS-4 by using the
1100	  utf8_to_ucs4 routine described earlier by:

1102	   1. Because the first byte of the sequence, when the '&' operator with
1103	      a value of 0xE0 is applied, will produce 0xC0 (0xD7 & 0xE0 = 0xC0)
1104	      the UTF-8 is a 2 byte sequence.
1105	   2. The four byte UCS-4 character code is produced by (((0xD7 - 0xC0)
1106	      * 0x40) + (0x95 -0x80)) = 0x000005D5.

1108	  Finally, the UCS-4 character code is converted to ISO/IEC 8859-8
1109	  character code (using the mapping table which matches ISO/IEC 8859-8
1110	  to UCS-4 ) to produce the original 0xE4 code for the Hebrew letter
1111	  "VAV".

1113	B.2.4 Vendor Codepage Example

1115	  This example demonstrates the mapping of a codepage to UTF-8 and back
1116	  to a vendor codepage. Mapping between vendor codepages can be done in
1117	  a very similar manner as described above. For instance both the PC and
1118	  Mac codepages reflect the character set from the Thai standard TIS
1119	  620-2533. The character code on both platforms for the Thai letter "SO
1120	  SO" is 0xAB. This character can then be mapped into the UCS-4 by way
1121	  of a conversion/mapping file to produce the UCS-4 code of 0x0E0B.

1123	  The UCS-4 character code is transformed into UTF-8 using the
1124	  ucs4_to_utf8 routine described earlier by:

1126	   1. Because the UCS-4 character is between 0x0800 and 0xFFFF it will
1127	      map to a 3 byte UTF-8 sequence.
1128	   2. The first byte is defined by (0xE0 + (0x00000E0B / 0x1000) =
1129	      0xE0.
1130	   3. The second byte is defined by (0x80 + ((0x00000E0B / 0x40) %
1131	      0x40))) = 0xB8.
1132	   4. The third byte is defined by (0x80 + (0x00000E0B % 0x40)) = 0x8B.

1134	  The UTF-8 encoding is transferred back to UCS-4 by using the
1135	  utf8_to_ucs4 routine described earlier by:

1137	   1. Because the first byte of the sequence, when the '&' operator with
1138	      a value of 0xF0 is applied, will produce 0xE0 (0xE0 & 0xF0 = 0xE0)
1139	      the UTF-8 is a 3 byte sequence.
1140	   2. The four byte UCS-4 character code is produced by (((0xE0 - 0xE0)
1141	      * 0x1000) + ((0xB8 - 0x80) * 0x40) + (0x8B -0x80) = 0x0000E0B.

1143	  Finally, the UCS-4 character code is converted to either the PC or MAC
1144	  codepage character code (using the mapping table which matches
1145	  codepage to UCS-4 ) to produce the original 0xAB code for the Thai
1146	  letter "SO SO".

1148	B.3 Pseudo Code for a High-Quality Translating Server

1150	  if utf8_valid(fn)
1151	    {
1152	    attempt to convert fn to the local charset, producing localfn
1153	    if (conversion fails temporarily) return error
1154	    if (conversion succeeds)
1155	    {
1156	  attempt to open localfn
1157	  if (open fails temporarily) return error
1158	  if (open succeeds) return success
1159	    }
1160	    }
1161	  attempt to open fn
1162	  if (open fails temporarily) return error
1163	  if (open succeeds) return success
1164	  return permanent error