idnits 2.17.1 draft-duerst-query-i18n-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-26) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 181: '... query component SHOULD be sent back e...' RFC 2119 keyword, line 184: '... nent MUST be sent back according to...' RFC 2119 keyword, line 190: '...est-header field MUST be sent back whe...' Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1997) is 9782 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'Fileupload' on line 246 looks like a reference -- Missing reference section? 'RFC1738' on line 264 looks like a reference -- Missing reference section? 'URLsyntax' on line 289 looks like a reference -- Missing reference section? 'RFC2044' on line 267 looks like a reference -- Missing reference section? 'URLprocess' on line 285 looks like a reference -- Missing reference section? 'IMAPURL' on line 261 looks like a reference -- Missing reference section? 'FTPINT' on line 249 looks like a reference -- Missing reference section? 'RFC2130' on line 280 looks like a reference -- Missing reference section? 'RFC2070' on line 274 looks like a reference -- Missing reference section? 'RFC2045' on line 270 looks like a reference -- Missing reference section? 'IANA' on line 112 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 1 warning (==), 13 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft M. Duerst 3 University of Zurich 4 Expires January 1998 July 1997 6 Handling Internationalized Query Components in URLs 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working doc- 11 uments of the Internet Engineering Task Force (IETF), its areas, and 12 its working groups. Note that other groups may also distribute work- 13 ing documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months. Internet-Drafts may be updated, replaced, or obsoleted by 17 other documents at any time. It is not appropriate to use Internet- 18 Drafts as reference material or to cite them other than as a "working 19 draft" or "work in progress". 21 To learn the current status of any Internet-Draft, please check the 22 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 23 Directories on ds.internic.net (US East Coast), nic.nordu.net 24 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 25 Rim). 27 Distribution of this document is unlimited. Please send comments to 28 the author at or to the uri mailing list at 29 uri@bunyip.com. This document is currently a pre-draft, for 30 restricted discussion only. It is intended to become part of a suite 31 of documents related to the internationalization of URLs. 33 Abstract 35 HTTP and HTML provide the facility to query the user and return the 36 results. This is usually done in the query component of an URL. This 37 mechanisms works with full satisfaction for characters of the us- 38 ascii repertoire. Due to the lack of an agreed encoding for other 39 characters, the situation is much less satisfactory for characters 40 outside the us-ascii repertoire. 42 This document makes two contributions to the problem: (1) It 43 describes an application convention mostly already respected, and 44 sufficient in many cases. (2) It introduces an addition to HTTP to 45 ease the transition to a general internationalized URL architecture. 47 Table of contents 49 1. Introduction ................................................... 2 50 1.1 General ......................................................2 51 1.2 Terms ........................................................3 52 2. A Simple Application Convention for Browsers ....................3 53 4. Upgrading of Query Component to UTF-8 ...........................4 54 3.1 The Query-UTF-8 Request/Response-Header Field ................4 55 3.2 Rationale ....................................................5 56 Bibliography .......................................................6 57 Author's Address ...................................................7 59 1. Introduction 61 1.1 General 63 HTTP (HyperText Transfer Protocol [HTTP1.1]) and HTML (HyperText 64 Markup Language [HTML4.0]) provide the facility to query the user 65 (with a FORM in HTML) and return the results to the server. There are 66 various ways to return the result (see in particular [Fileupload]), 67 but the one most widely used is to encode the result in the query 68 component of an URL [RFC1738, URLsyntax]. This mechanisms work with 69 full satisfaction for characters of the us-ascii repertoire. Due to 70 the lack of an agreed encoding for other characters, the situation is 71 much less satisfactory for characters outside the us-ascii reper- 72 toire. 74 Ideally, the problem would be solved by agreeing on a single charac- 75 ter encoding for all query parts or all URLs. The outstanding candi- 76 date for this is UTF-8 [RFC2044]. UTF-8 is already the preferred 77 encoding for new URL schemes [URLprocess], the only encoding for a 78 recently defined URL scheme [IMAPURL], the encoding on the wire for 79 beyond-ASCII FTP filenames [FTPINT] (thus making it the encoding for 80 the ftp: URL scheme) and the encoding suggested for the Internet in 81 general [RFC2130]. UTF-8 has various important properties, in par- 82 ticular that it is completely compatible with US-ASCII and is easily 83 detectable by simple heuristics. 85 Moving to UTF-8 for URLs is most difficult for the query component. 86 This is due to the fact that for the other components, in particular 87 for the path component, the namespace is very sparse and well known 88 to the server, while it is dense and not well known in the case of 89 the query part. To increase the reliability of transmitting query 90 information, this document describes an existing convention and 91 proposes some new protocol element for HTTP. 93 1.2 Terms 95 This section contains definitions and explanations for some terms 96 that may otherwise not be clear. 98 - Accept-Charset attribute: An HTML attribute, proposed in [RFC2070] 99 and taken up in HTML 4.0 [HTML4.0]. Please note that this is not 100 the same as the Accept-Charset request-header field in HTTP. 101 Please also note that the Accept-Charset attribute is on INPUT and 102 TEXTAREA in RFC 2070, but on FORM in HTML 4.0. The HTML 4.0 syntax 103 is preferred, and assumed in this document. 105 - CGI Script: In the context of this document, a placeholder for any 106 kind of functional component used to process a response to a 107 query. 109 - Character Encoding: A mapping from an octet sequence to a sequence 110 of characters. Misleadingly called "character set" in some IETF 111 documents [RFC2045]. Denoted by the value of the "charset" pra- 112 mater, with values from the corresponding IANA registry [IANA]. 114 - Transcoding Server/Proxy: A HTTP Server or Proxy which transcodes 115 the documents it serves, to respond to an "Accept-Charset" HTTP 116 request header field. 118 - Transcoding: The act of changing the character encoding of a docu- 119 ment, while not changing it otherwise (the length of the document 120 may be affected). 122 2. A Simple Application Convention for Browsers 124 This section spells out an application convention that is in use in 125 most current and older browsers, although it is not followed, or not 126 completely followded, by all browsers, and that can be implemented 127 easily. 129 The convention is that a user agent should send back the results of a 130 query in exactly the same character encoding as the character encod- 131 ing of the document that contained the FORM, as received by the user 132 agent. 134 The advantage of this application convention is that it works nicely 135 for documents and CGI scripts that are assuming a single character 136 encoding. In the plain case, neither the server nor the CGI script 137 have to do any special processing such as trying to detect the char- 138 acter encoding of the query component or transcode the query compo- 139 nent. 141 This application convention fails if the document has been transcoded 142 by a transcoding proxy. The query compontent is sent back in the 143 character encoding requested by the user agent, which is the target 144 character encoding of the transcoding undergone at the proxy. The 145 query component sent back to the server, however, must not be changed 146 by the proxy (see [HTTP1.1]). 148 3. Upgrading of Query Component to UTF-8 150 For those parts of an URL that originate at the server, in particular 151 for the path component, the introduction of UTF-8 [RFC2044] as the 152 encoding of choice can be made on a per-server or per-resource base. 153 Because the name space of the path component is usually very sparsely 154 populated, it is even possible to accept URLs with path components in 155 different character encodings for the same resource. 157 The query component of an URL, however, is in most cases generated 158 independently in the user agent, and the namespace can be very 159 densely populated. To upgrade it to UTF-8 therefore requires addi- 160 tional provisions. 162 Here, we propose to add a single header field to HTTP. The header 163 field is used both as a request header field and as a response header 164 field. 166 3.1 The Query-UTF-8 Request/Response-Header Field 168 The syntax of the QUERY-UTF-8 request/response-header field is 169 defined as follows: 171 query-utf-8 = "Query-UTF-8" ":" ( "Yes" | "No" ) 173 Both "Yes" and "No" above are case insensitive. I.e. "Yes" as well as 174 "yes" or "yES", and so on, are acceptable. 176 As a response-header field (sent from the server to the client), the 177 field indicates whether the user agent can send back the query compo- 178 nent encoded as UTF-8 or not. If the value is "Yes", and the scheme 179 component and site component of the URL of the document containing 180 the FORM and the URL given for query submission are identical, the 181 query component SHOULD be sent back encoded as UTF-8. If the value 182 is "No", and the FORM does not have an Accept-charset attribute that 183 contains the "charset" parameter value "UTF-8", then the query compo- 184 nent MUST be sent back according to the application convention 185 described in Section 3, or in some other way by older browsers. 187 As a request-header field (sent from the client to the server; the 188 term request-header field is somewhat misleading here), the field 189 indicates whether the query component is encoded as UTF-8. A Query- 190 UTF-8 request-header field MUST be sent back when the following con- 191 ditions are all met: 193 - The URL sent back contains a query compontent 195 - The document containing the FORM is received with a Query-UTF-8 196 response-header field with value "Yes" or the Accept-Charset 197 attribute of the FORM contains the charset parameter value of 198 "UTF-8". 200 - The client recognizes the corresponding syntax. (The intention of 201 the last sentence is to be able to phase out Query-UTF-8 after a 202 transitory period.) 204 3.2 Rationale 206 The availability of both the Accept-charset attribute on FORM and the 207 Query-UTF-8 response-header field may seem unnecessary. The rationale 208 for this is to allow two modes of operation, called server-driven and 209 script-driven. 211 In script-driven mode, the CGI script handles character encoding 212 negotiation and identification. Typically, the author of a FORM docu- 213 ment and the corresponding CGI script will use the Accept-charset 214 attribute on FORM with the value "UTF-8" to tell the client to send 215 back data in UTF-8. It will then check for the presence and value of 216 the Query-UTF-8 request-header field in the response from the client, 217 and make conversions if necessary. 219 In server-driven mode, the character encoding that a CGI scripts 220 expects to receive is registered with the server in a similar way as 221 the character encodings of documents (including those generated by 222 CGI scripts) are registered. A server offering such a functionality 223 adds the Query-UTF-8 response-header field with value "Yes" to outgo- 224 ing documents containing FORMs, and converts from UTF-8 back to the 225 encoding the CGI script is expecting when a query arrives with 226 "Query-UTF-8: Yes". 228 The distinction between script-driven and server-driven mode is not 229 made based on whether Query-UTF-8 or the Accept-Charset attribute are 230 used. Both features are provided because it is easier for a document 231 author to use Accept-Charset, and easier for a server to add Query- 232 UTF-8. Also, because a server does not know about the facilities 233 available on other servers, "Query-UTF-8: Yes" sent from the server 234 to the client is only valid if the query result is sent back to the 235 same server. For query results sent to other servers, the Accept- 236 Charset attribute must be used. 238 Acknowledgements 240 I am grateful in particular to the following persons for their help 241 and/or criticism: Roy Fielding, Eric van der Poel, Francois Yergeau, 242 Gavin Nicol, Frank Tang, Larry Masenter, and Tim Greenwood. 244 Bibliography 246 [Fileupload] E. Nebel and L. Masinter, "Form-based File Upload in 247 HTML", draft-ietf-html-fileupload-03.txt, August 1995. 249 [FTPINT] B. Curtin, "Internationalization of the File Transfer 250 Protocol", draft-ietf-ftpext-intl-ftp-02.txt, June 251 1997. 253 [HTTP1.1] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T. 254 Berners-Lee, "Hypertext Transfer Protocol -- 255 HTTP/1.1", RFC 2068, January 1997. 257 [HTML4.0] D. Raggett, A. Le Hors, and I. Jacobs, "HTML 4.0 Spec- 258 ification", http://www.w3.org/TR/WD-html40/, July 259 1997. 261 [IMAPURL] Ch. Newman, "IMAP URL Scheme", draft-newman-url- 262 imap-10.txt, July 1997. 264 [RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, "Uniform 265 Resource Locators (URL)", CERN, Dec. 1994. 267 [RFC2044] F. Yergeau, "UTF-8, A Transformation Format of Unicode 268 and ISO 10646", Alis Technologies, October 1996. 270 [RFC2045] N. Freed, N. Borenstein, "Multipurpose Internet Mail 271 Extensions (MIME) Part One: Format of Internet Message 272 Bodies", November 1996. 274 [RFC2070] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter- 275 nationalization of the Hypertext Markup Language", RFC 276 2070, January 1997 (Note: This RFC is currently being 277 updated to reference Unicode 2.0 and ISO 10646 includ- 278 ing AM-5. The new definition of UTF-8 should be used). 280 [RFC2130] C. Weider C. Preston, K. Simonsen, H. Alvestrand, R. 281 Atkinson, M. Crispin, P. Svanberg, "The Report of the 282 IAB Character Set Workshop held 29 February - 1 March, 283 1996", April 1997. 285 [URLprocess] L. Masinter, D. Zigmond and H. Alvestrand, "Guidelines 286 and Process for new URL Schemes", draft-masinter-url- 287 process-01.txt, March 1997. 289 [URLsyntax] T. Berners-Lee, R. Fielding, L. Masinter, "Uniform 290 Resource Locators (URL): Generic Syntax and Seman- 291 tics", draft-fielding-url-syntax-05.txt, May 1997. 293 Author's Address 295 Martin J. Duerst 296 Multimedia-Laboratory 297 Department of Computer Science 298 University of Zurich 299 Winterthurerstrasse 190 300 CH-8057 Zurich 301 Switzerland 303 Tel: +41 1 257 43 16 304 Fax: +41 1 363 00 35 305 E-mail: mduerst@ifi.unizh.ch 307 NOTE -- Please write the author's name with u-Umlaut wherever 308 possible, e.g. in HTML as Dürst.