idnits 2.17.1 draft-kamp-httpbis-structure-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (October 30, 2016) is 2734 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '5' on line 402

  == Missing Reference: 'RFC7694' is mentioned on line 453, but not defined

  ** Obsolete undefined reference: RFC 7694 (Obsoleted by RFC 9110)

  == Missing Reference: 'Section 3' is mentioned on line 453, but not defined

  == Missing Reference: 'RFC7239' is mentioned on line 508, but not defined

  == Unused Reference: 'RFC5234' is defined on line 310, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 7230 (Obsoleted by RFC 9110, RFC 9112)


     Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           PH. Kamp
3	Internet-Draft                                 The Varnish Cache Project
4	Intended status: Informational                          October 30, 2016
5	Expires: May 3, 2017

7	                      HTTP header common structure
8	                    draft-kamp-httpbis-structure-01

10	Abstract

12	   An abstract data model for HTTP headers, "Common Structure", and a
13	   HTTP/1 serialization of it, generalized from current HTTP headers.

15	Status of This Memo

17	   This Internet-Draft is submitted in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF).  Note that other groups may also distribute
22	   working documents as Internet-Drafts.  The list of current Internet-
23	   Drafts is at http://datatracker.ietf.org/drafts/current/.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   This Internet-Draft will expire on May 3, 2017.

32	Copyright Notice

34	   Copyright (c) 2016 IETF Trust and the persons identified as the
35	   document authors.  All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents
39	   (http://trustee.ietf.org/license-info) in effect on the date of
40	   publication of this document.  Please review these documents
41	   carefully, as they describe your rights and restrictions with respect
42	   to this document.  Code Components extracted from this document must
43	   include Simplified BSD License text as described in Section 4.e of
44	   the Trust Legal Provisions and are provided without warranty as
45	   described in the Simplified BSD License.

47	1.  Introduction

49	   The HTTP protocol does not impose any structure or datamodel on the
50	   information in HTTP headers, the HTTP/1 serialization is the
51	   datamodel: An ASCII string without control characters.

53	   HTTP header definitions specify how the string must be formatted and
54	   while families of similar headers exist, it still requires an
55	   uncomfortable large number of bespoke parser and validation routines
56	   to process HTTP traffic correctly.

58	   In order to improve performance HTTP/2 and HPACK uses naive text-
59	   compression, which incidentally decoupled the on-the-wire
60	   serialization from the data model.

62	   During the development of HPACK it became evident that significantly
63	   bigger gains were available if semantic compression could be used,
64	   most notably with timestamps.  However, the lack of a common data
65	   structure for HTTP headers would make semantic compression one long
66	   list of special cases.

68	   Parallel to this, various proposals for how to fulfill data-
69	   transportation needs, and to a lesser degree to impose some kind of
70	   order on HTTP headers, at least going forward were floated.

72	   All of these proposals, JSON, CBOR etc. run into the same basic
73	   problem: Their serialization is incompatible with [RFC7230]'s ABNF
74	   definition of 'field-value'.

76	   For binary formats, such as CBOR, a wholesale base64/85
77	   reserialization would be needed, with negative results for both
78	   debugability and bandwidth.

80	   For textual formats, such as JSON, the format must first be neutered
81	   to not violate field-value's ABNF, and then workarounds added to
82	   reintroduce the features just lost, for instance UNICODE strings, and
83	   suddenly it is no longer JSON anymore.

85	   This proposal starts from the other end, and builds and generalizes a
86	   data structure definition from existing HTTP headers, which means
87	   that HTTP/1 serialization and 'field-value' compatibility is built
88	   in.

90	   If all future HTTP headers are defined to fit into this Common
91	   Structure we have at least halted the proliferation of bespoke
92	   parsers and started to pave the road for semantic compression
93	   serializations of HTTP traffic.

95	1.1.  Terminology

97	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
98	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
99	   and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119
100	   [RFC2119].

102	2.  Definition of HTTP header Common Structure

104	   The data model of Common Structure is an ordered sequence of named
105	   dictionaries.  Please see Appendix A for how this model was derived.

107	   The definition of the data model is on purpose abstract, uncoupled
108	   from any protocol serialization or programming environment
109	   representation, meant as the foundation on which all such
110	   manifestations of the model can be built.

112	   Common Structure in ABNF:

114	       import token from RFC7230
115	       import DIGIT from RFC5234

117	       common-structure = 1* ( identifier dictionary )

119	       dictionary = * ( identifier value )

121	       value = identifier /
122	               number /
123	               ascii_string /
124	               unicode_string /
125	               blob /
126	               timestamp /
127	               common-structure

129	       identifier = token  [ "/" token ]

131	       number = ["-"] 1*15 DIGIT
132	               # XXX: Not sure how to do this in ABNF:
133	               # XXX: A single "." allowed between any two digits
134	               # The range is limited is to ensure it can be
135	               # correctly represented in IEEE754 64 bit
136	               # binary floating point format.

138	       ascii_string = * %x20-7e
139	               # This is a "safe" string in the sense that it
140	               # contains no control characters or multi-byte
141	               # sequences.  If that is not fancy enough, use
142	               # unicode_string.

144	       unicode_string = * unicode_codepoint
145	               # XXX: Is there a place to import this from ?
146	               # Unrestricted unicode, because there is no sane
147	               # way to restrict or otherwise make unicode "safe".

149	       blob = * %0x00-ff
150	               # Intended for cryptographic data and as a general
151	               # escape mechanism for unmet requirements.

153	       timestamp = POSIX time_t with optional millisecond resolution
154	               # XXX: Is there a place to import this from ?

156	3.  HTTP/1 serialization of HTTP header Common Structure

158	   In ABNF:

160	       import OWS from {{RFC7230}}
161	       import HEXDIG, DQUOTE from {{RFC5234}}
162	       import UTF8-2, UTF8-3, UTF8-4 from {{RFC3629}}

164	       h1_common-structure-header =
165	               ( field-name ":" OWS ">" h1_common_structure "<" )
166	                       # Self-identifying HTTP headers
167	               ( field-name ":" OWS h1_common_structure ) /
168	                       # legacy HTTP headers on white-list, see {{iana}}

170	       h1_common_structure = h1_element  * ("," h1_element)

172	       h1_element = identifier * (";" identifier ["=" h1_value])

174	       h1_value = identifier /
175	               number /
176	               h1_ascii_string /
177	               h1_unicode_string /
178	               h1_blob /
179	               h1_timestamp /
180	               h1_common-structure

182	       h1_ascii_string = DQUOTE *(
183	                       ( "\" DQUOTE ) /
184	                       ( "\" "\" ) /
185	                       0x20-21 /
186	                       0x23-5B /
187	                       0x5D-7E
188	                       ) DQUOTE
189	               # This is a proper subset of h1_unicode_string
190	               # NB only allowed backslash escapes are \" and \\

192	       h1_unicode_string = DQUOTE *(
193	                       ( "\" DQUOTE )
194	                       ( "\" "\" ) /
195	                       ( "\" "u" 4*HEXDIG ) /
196	                       0x20-21 /
197	                       0x23-5B /
198	                       0x5D-7E /
199	                       UTF8-2 /
200	                       UTF8-3 /
201	                       UTF8-4
202	                       ) DQUOTE
203	               # This is UTF8 with HTTP1 unfriendly codepoints
204	               # (00-1f, 7f) neutered with \uXXXX escapes.

206	       h1_blob = "'" base64 "'"
207	               # XXX: where to import base64 from ?

209	       h1_timestamp = number
210	               # UNIX/POSIX time_t semantics.
211	               # fractional seconds allowed.

213	       h1_common_structure = ">" h1_common_structure "<"

215	   XXX: Allow OWS in parsers, but not in generators ?

217	   In programming environments which do not define a native
218	   representation or serialization of Common Structure, the HTTP/1
219	   serialization should be used.

221	4.  When to use Common Structure parser

223	   All future standardized and all private HTTP headers using Common
224	   Structure should self identify as such.  In the HTTP/1 serialization
225	   by making the first character ">" and the last "<".  (These two
226	   characters are deliberately "the wrong way" to not clash with
227	   exsisting usages.)

229	   Legacy HTTP headers which fit into Common Structure, are marked as
230	   such in the IANA Message Header Registry (see {iana}), and a snapshot
231	   of the registry can be used to trigger parsing according to Common
232	   Structure of these headers.

234	5.  Desired normative effects

236	   All new HTTP headers SHOULD use the Common Structure if at all
237	   possible.

239	6.  Open/Outstanding issues to resolve

241	6.1.  Single/multiple headers

243	   Should we allow splitting common structure data over multiple headers
244	   ?

246	   Pro:

248	   Avoids size restrictions, easier on-the-fly editing

250	   Contra:

252	   Cannot act on any such header until all headers have been received.

254	   We must define where headers can be split (between identifier and
255	   dictionary ?, in the middle of dictionaries ?)

257	   Most on-the-fly editing is hackish at best.

259	7.  Future work

261	7.1.  Redefining existing headers for better performance

263	   The HTTP/1 serializations self-identification mechanism makes it
264	   possible to extend the definition of existing Appendix C headers into
265	   Common Structure.

267	   For instance one could imagine:

269	       Date: >1475061449.201<

271	   Which would be faster to parse and validate than the current
272	   definition of the Date header and more precise too.

274	   Some kind of signal/negotiation mechanism would be required to make
275	   this work in practice.

277	7.2.  Define a validation dictionary

279	   A machine-readable specification of the legal contents of HTTP
280	   headers would go a long way to improve efficiency and security in
281	   HTTP implementations.

283	8.  IANA considerations

285	   The IANA Message Header Registry will be extended with an additional
286	   field named "Common Structure" which can have the values "True",
287	   "False" or "Unknown".

289	   The RFC723x headers listed in Appendix B will get the value "True" in
290	   the new field.

292	   The RFC723x headers listed in Appendix C will get the value "False"
293	   in the new field.

295	   All other existing entries in the registry will be set to "Unknown"
296	   until and if the owner of the entry requests otherwise.

298	9.  Normative References

300	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
301	              Requirement Levels", BCP 14, RFC 2119,
302	              DOI 10.17487/RFC2119, March 1997,
303	              .

305	   [RFC7230]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
306	              Protocol (HTTP/1.1): Message Syntax and Routing",
307	              RFC 7230, DOI 10.17487/RFC7230, June 2014,
308	              .

310	   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
311	              Specifications: ABNF", STD 68, RFC 5234,
312	              DOI 10.17487/RFC5234, January 2008,
313	              .

315	Appendix A.  Does HTTP headers have any common structure ?

317	   Several proposals have been floated in recent years to use some
318	   preexisting structured data serialization or other for HTTP headers,
319	   to impose some sanity.

321	   None of these proposals have gained traction and no obvious candidate
322	   data serializations have been left unexamined.

324	   This effort tries to tackle the question from the other side, by
325	   asking if there is a common structure in existing HTTP headers we can
326	   generalize for this purpose.

328	A.1.  Survey of HTTP header structure

330	   The RFC723x family of HTTP/1 standards control 49 entries in the IANA
331	   Message Header Registry, and they share two common motifs.

333	   The majority of RFC723x HTTP headers are lists.  A few of them are
334	   ordered, ('Content-Encoding'), some are unordered ('Connection') and
335	   some are ordered by 'q=%f' weight parameters ('Accept')

337	   In most cases, the list elements are some kind of identifier, usually
338	   derived from ABNF 'token' as defined by [RFC7230].

340	   A subgroup of headers, mostly related to MIME, uses what one could
341	   call a 'qualified token'::

343	       qualified_token = token_or_asterix [ "/" token_or_asterix ]

345	   The second motif is parameterized list elements.  The best known is
346	   the "q=0.5" weight parameter, but other parameters exist as well.

348	   Generalizing from these motifs, our candidate "Common Structure" data
349	   model becomes an ordered list of named dictionaries.

351	   In pidgin ABNF, ignoring white-space for the sake of clarity, the
352	   HTTP/1.1 serialization of Common Structure is is something like:

354	       token_or_asterix = token from {{RFC7230}}, but also allowing "*"

356	       qualified_token = token_or_asterix [ "/" token_or_asterix ]

358	       field-name, see {{RFC7230}}

360	       Common_Structure_Header = field-name ":" 1#named_dictionary

362	       named_dictionary = qualified_token [ *(";" param) ]

364	       param = token [ "=" value ]

366	       value = we'll get back to this in a moment.

368	   Nineteen out of the RFC723x's 48 headers, almost 40%, can already be
369	   parsed using this definition, and none the rest have requirements
370	   which could not be met by this data model.  See Appendix B and
371	   Appendix C for the full survey details.

373	A.2.  Survey of values in HTTP headers

375	   Surveying the datatypes of HTTP headers, standardized as well as
376	   private, the following picture emerges:

378	A.2.1.  Numbers

380	   Integer and floating point are both used.  Range and precision is
381	   mostly unspecified in controlling documents.

383	   Scientific notation (9.192631770e9) does not seem to be used
384	   anywhere.

386	   The ranges used seem to be minus several thousand to plus a couple of
387	   billions, the high end almost exclusively being POSIX time_t
388	   timestamps.

390	A.2.2.  Timestamps

392	   RFC723x text format, but POSIX time_t represented as integer or
393	   floating point is not uncommon.  ISO8601 have also been spotted.

395	A.2.3.  Strings

397	   The vast majority are pure ASCII strings, with either no escapes, %xx
398	   URL-like escapes or C-style back-slash escapes, possibly with the
399	   addition of \uxxxx UNICODE escapes.

401	   Where non-ASCII character sets are used, they are almost always
402	   implicit, rather than explicit.  UTF8 and ISO-8859-1[5] seem to be
403	   most common.

405	A.2.4.  Binary blobs

407	   Often used for cryptographic data.  Usually in base64 encoding,
408	   sometimes ""-quoted more often not.  base85 encoding is also seen,
409	   usually quoted.

411	A.2.5.  Identifiers

413	   Seems to almost always fit in the RFC723x 'token' definition.

415	A.3.  Is this actually a useful thing to generalize ?

417	   The number one wishlist item seems to be UNICODE strings, with a big
418	   side order of not having to write a new parser routine every time
419	   somebody comes up with a new header.

421	   Having a common parser would indeed be a good thing, and having an
422	   underlying data model which makes it possible define a compressed
423	   serialization, rather than rely on serialization to text followed by
424	   text compression (ie: HPACK) seems like a good idea too.

426	   However, when using a datamodel and a parser general enough to
427	   transport useful data, it will have to be followed by a validation
428	   step, which checks that the data also makes sense.

430	   Today validation, such as it is, is often done by the bespoke
431	   parsers.

433	   This then is probably where the next big potential for improvement
434	   lies:

436	   Ideally a machine readable "data dictionary" which makes it possibly
437	   to copy that text out of RFCs, run it through a code generator which
438	   spits out validation code which operates on the output of the common
439	   parser.

441	   But history has been particularly unkind to that idea.

443	   Most attempts studied as part of this effort, have sunk under
444	   complexity caused by reaching for generality, but where scope has
445	   been wisely limited, it seems to be possible.

447	   So file that idea under "future work".

449	Appendix B.  RFC723x headers with "common structure"

451	       Accept              [RFC7231, Section 5.3.2]
452	       Accept-Charset      [RFC7231, Section 5.3.3]
453	       Accept-Encoding     [RFC7231, Section 5.3.4][RFC7694, Section 3]
454	       Accept-Language     [RFC7231, Section 5.3.5]
455	       Age                 [RFC7234, Section 5.1]
456	       Allow               [RFC7231, Section 7.4.1]
457	       Connection          [RFC7230, Section 6.1]
458	       Content-Encoding    [RFC7231, Section 3.1.2.2]
459	       Content-Language    [RFC7231, Section 3.1.3.2]
460	       Content-Length      [RFC7230, Section 3.3.2]
461	       Content-Type        [RFC7231, Section 3.1.1.5]
462	       Expect              [RFC7231, Section 5.1.1]
463	       Max-Forwards        [RFC7231, Section 5.1.2]
464	       MIME-Version        [RFC7231, Appendix A.1]
465	       TE                  [RFC7230, Section 4.3]
466	       Trailer             [RFC7230, Section 4.4]
467	       Transfer-Encoding   [RFC7230, Section 3.3.1]
468	       Upgrade             [RFC7230, Section 6.7]
469	       Vary                [RFC7231, Section 7.1.4]

471	Appendix C.  RFC723x headers with "uncommon structure"

473	   1 of the RFC723x headers is only reserved, and therefore have no
474	   structure at all:

476	       Close               [RFC7230, Section 8.1]

478	   5 of the RFC723x headers are HTTP dates:

480	       Date                [RFC7231, Section 7.1.1.2]
481	       Expires             [RFC7234, Section 5.3]
482	       If-Modified-Since   [RFC7232, Section 3.3]
483	       If-Unmodified-Since [RFC7232, Section 3.4]
484	       Last-Modified       [RFC7232, Section 2.2]

486	   24 of the RFC723x headers use bespoke formats which only a single or
487	   in rare cases two headers share:

489	       Accept-Ranges       [RFC7233, Section 2.3]
490	           bytes-unit / other-range-unit

492	       Authorization       [RFC7235, Section 4.2]
493	       Proxy-Authorization [RFC7235, Section 4.4]
494	           credentials

496	       Cache-Control       [RFC7234, Section 5.2]
497	           1#cache-directive

499	       Content-Location    [RFC7231, Section 3.1.4.2]
500	           absolute-URI / partial-URI

502	       Content-Range       [RFC7233, Section 4.2]
503	           byte-content-range / other-content-range

505	       ETag                [RFC7232, Section 2.3]
506	           entity-tag

508	       Forwarded           [RFC7239]
509	           1#forwarded-element

511	       From                [RFC7231, Section 5.5.1]
512	           mailbox

514	       If-Match            [RFC7232, Section 3.1]
515	       If-None-Match       [RFC7232, Section 3.2]
516	           "*" / 1#entity-tag

518	       If-Range            [RFC7233, Section 3.2]
519	           entity-tag / HTTP-date

521	       Host                [RFC7230, Section 5.4]
522	           uri-host [ ":" port ]

524	       Location            [RFC7231, Section 7.1.2]
525	           URI-reference

527	       Pragma              [RFC7234, Section 5.4]
528	           1#pragma-directive

530	       Range               [RFC7233, Section 3.1]
531	           byte-ranges-specifier / other-ranges-specifier

533	       Referer             [RFC7231, Section 5.5.2]
534	           absolute-URI / partial-URI

536	       Retry-After         [RFC7231, Section 7.1.3]
537	           HTTP-date / delay-seconds

539	       Server              [RFC7231, Section 7.4.2]
540	       User-Agent          [RFC7231, Section 5.5.3]
541	           product *( RWS ( product / comment ) )

543	       Via                 [RFC7230, Section 5.7.1]
544	           1#( received-protocol RWS received-by [ RWS comment ] )

546	       Warning             [RFC7234, Section 5.5]
547	           1#warning-value

549	       Proxy-Authenticate  [RFC7235, Section 4.3]
550	       WWW-Authenticate    [RFC7235, Section 4.1]
551	           1#challenge

553	Author's Address

555	   Poul-Henning Kamp
556	   The Varnish Cache Project

558	   Email: phk@varnish-cache.org