idnits 2.17.1 draft-lee-sdch-spec-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 6 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 28, 2016) is 2736 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2119' is defined on line 930, but no explicit reference was found in the text == Unused Reference: 'RFC3929' is defined on line 952, but no explicit reference was found in the text ** Obsolete normative reference: RFC 7230 (Obsoleted by RFC 9110, RFC 9112) -- Obsolete informational reference (is this intentional?): RFC 3548 (Obsoleted by RFC 4648) -- Obsolete informational reference (is this intentional?): RFC 2965 (Obsoleted by RFC 6265) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Butler 3 Internet-Draft 4 Intended status: Informational W. Lee 5 Expires: May 1, 2017 6 B. McQuade 8 K. Mixter 9 October 28, 2016 11 A Proposal for Shared Dictionary Compression over HTTP 12 draft-lee-sdch-spec-00 14 Abstract 16 This paper proposes an HTTP/1.1-compatible extension that supports 17 inter-response data compression by means of a reference dictionary 18 shared between user agent and server. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on May 1, 2017. 37 Copyright Notice 39 Copyright (c) 2016 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 1. Introduction 54 In order to reduce payload size, HTTP/1.1 supports response 55 compression via the Accept-Encoding and Content-Encoding headers. 56 The most commonly used HTTP response compression encoding is gzip, 57 which compresses data that is repeated within a given response. 58 However, HTTP/1.1 does not provide a mechanism for compressing data 59 that is repeated between responses. A different class of encoding 60 technique, known as delta encoding, has proven effective at 61 compressing inter-response data. 63 Previous efforts to extend HTTP/1.1 to support delta compression have 64 focused on encoding an HTTP response as a delta of a previous version 65 of that response. One such approach is discussed in RFC 3229 "Delta 66 encoding in HTTP" [RFC3229]. While RFC 3229 is effective at reducing 67 payload size for many types of resources, it may not be suitable for 68 certain classes of responses. 70 Specifically, under RFC 3229, deltas can only be applied to responses 71 originating from the same URL, and the means of identifying the 72 instance to delta "from" is by a Last-Modified timestamp or entity- 73 tag. This makes RFC 3229 unsuitable for compressing dynamically 74 generated responses to a given URL with varying query parameters 75 (e.g. a search results page), since these types of responses are 76 difficult to identify uniquely using entity tags or last modified 77 timestamps. Content hashes can be used, but false positives are 78 possible. Also, storing all previous responses on the server may not 79 be practical. 81 2. Proposal: Shared Dictionary Compression over HTTP 83 Existing techniques compress each response in isolation, and so 84 cannot take advantage of cross-payload redundancy. For example, 85 retrieving a set of HTML pages with the same header, footer, inlined 86 JavaScript and CSS requires the retransmission of the same data 87 multiple times. This paper proposes a compression technique that 88 leverages this cross-payload redundancy. 90 In this proposal, a dictionary is a file downloaded by the user agent 91 from the server that contains strings which are likely to appear in 92 subsequent HTTP responses. In the case described above, if the 93 header, footer, JavaScript and CSS are stored in a dictionary 94 possessed by both user agent and server, the server can substitute 95 these elements with references to the dictionary, and the user agent 96 can reconstruct the original page from these references. By 97 substituting dictionary references for repeated elements in HTTP 98 responses, the payload size can be reduced. 100 If either the user agent or the server does not support the 101 extension, then ordinary HTTP responses are served. 103 If both the user agent and the server support the extension but the 104 user agent does not have an applicable dictionary (as described in 105 detail below), the server responds with an ordinary HTTP response 106 that includes a header advertising the location of a relevant 107 dictionary. This dictionary can be retrieved out-of-band by the user 108 agent. 110 If both the user agent and the server support the extension and the 111 user agent has an applicable dictionary, then each HTTP response 112 includes references to strings in the dictionary, rather than 113 repeating those strings in the response. The references require 114 fewer bytes to encode than the strings themselves, reducing the 115 payload size. 117 The HTTP header-based protocol for negotiating the presence of 118 dictionaries on user agent and server is referred to in this proposal 119 as the SDCH protocol. The compression scheme based on a particular 120 dictionary shared between user agent and server is referred to as the 121 SDCH encoding, and is built upon the VCDIFF compression data format 122 [RFC3284]. 124 3. Syntax 126 The grammar descriptions in the sections that follow depend on the 127 following syntax: DIGIT (decimal digit), BASE64URLDIGIT (alphanumeric 128 digit or "-" or "_"), PAYLOADBYTE (a byte), token (informally, a 129 sequence of non-special, non-white space characters), rest-of-line 130 (informally, a sequence of characters not including carriage return 131 or line-feed). In the grammar below, HTTP_url, abs_path, and query 132 are defined in RFC 7230 [RFC7230]. 134 header = attr ":" value "\n" 135 attr = token 136 value = rest-of-line 137 dictionary-client-id = 1*BASE64URLDIGIT 138 dictionary-server-id = 1*BASE64URLDIGIT 139 payload = 1*PAYLOADBYTE 140 vcdiff-payload = 1*PAYLOADBYTE 141 partial-url = HTTP_url | abs_path [ "?" query ] 143 The attribute names (attr) are case-insensitive. White space is 144 permitted between tokens. 146 4. Dictionary Description 148 4.1. General 150 In the proposed protocol, a dictionary can only be used with a 151 limited set of URLs and for a limited duration of time, referred to 152 as its scope and lifetime, respectively. A dictionary is composed of 153 the data used by the compression algorithm, known as the payload, as 154 well as metadata describing its scope and lifetime. The scope is 155 specified by a domain attribute and path attribute that are patterned 156 after the same named attributes from the HTTP State Management 157 Specification [RFC2965]. 159 4.2. Syntax of Dictionary Metadata 161 The syntax of dictionary metadata is as follows: 163 dictionary-metadata = 1*dictionary-header "\n" 164 dictionary-header = "domain" ":" value "\n" 165 | "path" ":" value "\n" 166 | "path-equals" ":" value "\n" 167 | "format-version" ":" value "\n" 168 | "max-age" ":" value "\n" 169 | "port" ":" <"> portlist <"> "\n" 170 portlist = 1#portnum 171 portnum = 1*DIGIT 173 A complete dictionary definition then has this format: n dictionary- 174 definition = dictionary-metadata payload 176 Informally, the metadata for a dictionary is a series of headers, 177 similar in form to HTTP headers, terminated by an empty line. The 178 dictionary payload begins immediately after this blank line. 180 The valid dictionary header identifiers are described below: 182 o Domain: domain. 184 Required. Indicates the domain to which the dictionary applies. The 185 domain specification must explicitly start with a dot. For example, 186 a dictionary with the domain specification ".google.com" may be used 187 to compress a response served from the host name www.google.com, but 188 not used to compress a response served from the host name 189 www.gmail.com. Only printable ASCII characters are permitted in the 190 domain value. International Domain Names must be specified using 191 IDNA. 193 o Path: path. 195 Optional. Indicates the set of URL paths for which this dictionary 196 is valid. If unspecified, the dictionary applies to all paths within 197 the given domain. 199 o Path-equals: path. 201 Optional. Indicates the exact URL path for which this dictionary is 202 valid. If both "path" and "path-equals" are specified, the 203 dictionary applies only to those URLs which satisfy both criteria. 205 o Format-version: version. 207 Optional. Indicates the version of the dictionary payload. If 208 unspecified, the format version defaults to "1.0". Currently, the 209 only acceptable value is "1.0". 211 o Max-age: delta-seconds. 213 Optional. Indicates the amount of time that a dictionary can be 214 advertised to the server by the user agent, relative to the time it 215 was downloaded. If unspecified, the default is 30 days from the time 216 the dictionary was downloaded by the user agent. * Port: port list. 217 Optional. Indicates the comma-separated list of ports to which this 218 dictionary applies. If unspecified, the dictionary applies to all 219 ports. 221 Like HTTP headers, dictionary header identifiers are case- 222 insensitive. Unknown headers will be ignored by the user agent, 223 allowing other headers to be added in the future. 225 4.3. Dictionary Scope 227 The specific rules of when a dictionary can be applied to a URL, i.e. 228 that define its scope, are modeled after the rules for cookie 229 scoping. The term "domain-match" is defined in RFC 2965. We define 230 path-matching as follows For two strings that represent paths, P1 and 231 P2, P1 path-matches P2 if either: 233 1. P2 is equal to P1 235 2. P2 is a prefix of P1 and either the final character in P2 is "/" 236 or the character following P2 in P1 is "/". 238 For example, "/tec/waldo" path-matches "/tec", "/tec/", and "/tec/ 239 waldo", but does not path-match "/tec/wal". 241 Given these definitions of domain-match and path-match, a request URL 242 falls within a dictionary's scope exactly when all of the following 243 are true: 245 1. The request URL's host name domain-matches the Domain attribute 246 of the dictionary. 248 2. If the dictionary has a Port attribute, the request port is one 249 of the ports listed in the Port attribute. 251 3. The request URL path-matches the path attribute of the 252 dictionary. 254 4. The request URL's scheme matches the scheme of the dictionary. 256 If a URL falls within a dictionary's scope, the dictionary is said to 257 "apply" to the URL. 259 4.4. Dictionary Identifier 261 In communications between user agent and server, a dictionary is 262 identified by the first 96 bits of the SHA-256 digest [RFC6234] of a 263 dictionary's metadata and payload (see dictionary-definition above) 264 exactly as it is received by the user agent from the server. Both 265 user agent and server compute this identifier independently, based on 266 the metadata and the payload of the dictionary. This digest should 267 be unique within a dictionary's scope (domain and path) in order to 268 prevent dictionary identifier collisions. 270 The digest serves not only as an identifier but also as a safeguard 271 against attempts to maliciously intercept or otherwise modify 272 dictionary contents, since a compromised dictionary will hash to a 273 different identifier and the server will not recognize it. The user 274 agent identifier for a dictionary is defined as the URL-safe base64 275 encoding (as described in RFC 3548, section 4 [RFC3548] of the first 276 48 bits (bits 0..47) of the dictionary's SHA-256 digest. The server 277 identifier for a dictionary is the URL-safe base64 encoding of the 278 second 48 bits (bits 48..95). When identifying a dictionary to the 279 server, the user agent uses the user agent identifier, and similarly, 280 when identifying a dictionary to the user agent, the server uses the 281 server identifier. Note that both user agent and server have the 282 entire dictionary and can thus compute both identifiers for the 283 dictionary. 285 As a consequence of this scheme, dictionaries do not need to be 286 explicitly named by site maintainers, as the protocol avoids 287 identifying them in any way other than the above digest-generated 288 identifiers. 290 4.5. Differences between Dictionaries and Cookies 292 Dictionaries are similar to cookies in that they allow sharing of 293 state over HTTP. Thus, we have modeled dictionaries after cookies, 294 as described in RFC 2965. However, because dictionaries are 295 typically larger than cookies, embedding a dictionary in the response 296 would increase latency of the response. Thus a dictionary is always 297 sent as a separate HTTP response (unlike a cookie which is included 298 in a Set-Cookie header of any HTTP response). The Get-Dictionary 299 HTTP response header is used to tell the user agent that it should 300 fetch a dictionary separately for use in future requests. 302 Likewise, rather than including the dictionary contents in the HTTP 303 request headers (like a cookie in the Cookie header), dictionary 304 identifiers (described above) are used to advertise available 305 dictionaries in HTTP requests from the user agent to the server. 307 5. User Agent / Server Interaction Description 309 5.1. User Agent Role in HTTP Request Generation 311 The user agent: 313 1. Advertises support for the proposed protocol by adding the "sdch" 314 token to the Accept-Encoding header of HTTP requests. 316 2. Advertises any dictionaries it possesses that apply to the URL 317 being requested (per the scoping rules above) in the Avail- 318 Dictionary request header. 320 The Avail-Dictionary header syntax is as follows: avail-dictionary- 321 header = "Avail-Dictionary" ":" 1#dictionary-client where dictionary- 322 client-id is the user agent identifier part for the dictionary based 323 on the SHA-256 digest as described above. The value of this header 324 is informally a comma separated list of user agent dictionary 325 identifiers. 327 The user agent must advertise every dictionary it has cached that 328 applies to the requested URL. It is only the presence of the 329 dictionary identifier in this header that indicates to the server 330 that the user agent possesses and therefore does not need to download 331 the dictionary. Since the user agent must advertise every dictionary 332 it has, it is the site maintainer's responsibility to avoid making 333 too many dictionaries available at a given time. Advertising many 334 dictionaries in this header can counteract the benefits of 335 compression. 337 Note that for each individual request the user agent has discretion 338 over whether or not to add "sdch" Accept-Encoding token and the 339 Avail-Dictionary header. Since some responses, such as image data, 340 are unlikely to benefit from dictionary compression, the user agent 341 can reduce the size of its requests by not sending this token and 342 header. The user agent may decide whether or not to add these 343 headers based on file extensions in URLs or the context of the 344 request. For instance, the user agent may choose to not advertise 345 SDCH for URLs referenced in IMG elements. 347 5.2. Server Role in HTTP Response Generation 349 When a server that supports the extension receives a request that 350 indicates that the user agent supports the protocol (e.g. the "sdch" 351 token is present in the Accept-Encoding request header), two 352 independent decisions must be made. The server must decide: 1. if it 353 wants to send an encoded response. 2. if it wants to inform the user 354 agent about additional dictionaries it can download and use in the 355 future. 357 The server may return an encoded response only if all of the 358 following are true: 1. The Accept-Encoding request header contains 359 the "sdch" token. 2. The server can send a response compressed with 360 a dictionary whose dictionary-client-id is in the Avail-Dictionary 361 request header. 363 A server may return a response that is not encoded even if it 364 recognizes a dictionary advertised by the user agent. If the server 365 decides to not use SDCH encoding when a Avail-Dictionary header is 366 present, it must include a specific HTTP header X-SDCH-Encoding with 367 value "0" in the response. The syntax of the X-SDCH-Encoding header 368 is: 370 sdch-not-used-header = "X-SDCH-Encoding" ":" "0" 372 The server indicates that an HTTP response is encoded by inserting 373 the token "sdch" into the Content-Encoding header of the HTTP 374 response. 376 A compatible server may instruct a compatible user agent to download 377 one or more new dictionaries by including the Get-Dictionary header 378 in the HTTP response. The server may advertise a Get-Dictionary 379 header even if the response is not encoded. The syntax of the Get- 380 Dictionary header is: get-dictionary-header = "Get-Dictionary" ":" 381 1#partial-url where partial-url is either a complete URL, or just the 382 absolute URL path (in which case the scheme, host, and port of the 383 originating server would be used when requesting the dictionary). If 384 a complete URL is provided, it must have the same scheme, host, and 385 port as the originating server. The Content-Type header of 386 dictionary responses must be application/x-sdch-dictionary. The 387 value in the get dictionary header is a comma-separated list of 388 partial-url elements. 390 The server must not advertise a dictionary with a dictionary-client- 391 id that the user agent has listed in the Avail-Dictionary header. 393 The server may use SDCH compression with a dictionary that the user 394 agent has advertised and also include a Get-Dictionary header for a 395 different dictionary that the user agent has not advertised. 397 The server must prevent SDCH-encoded responses from being cached by 398 intermediate proxies. See the section below on proxy caching for 399 additional details. 401 The server should limit the number of active dictionaries at any one 402 time, by using well-scoped dictionaries. A server that has many 403 active dictionaries with overlapping scope will cause user agents to 404 generate a very long Avail-Dictionary header, the overhead of which 405 can counteract the benefits of SDCH compression. 407 The server may decide to precompute and cache SDCH-encoded responses 408 if a given SDCH-encoded response will be served multiple times (e.g. 409 for static content). 411 The server may apply multiple Content-Encodings to the response, 412 (e.g. sdch and gzip) in which case subsequent encoding tokens are 413 appended to the Content-Encoding header, per the HTTP/1.1 RFC section 414 14.11. 416 5.3. User Agent Role in HTTP Response Handling 418 An SDCH-compatible user agent must inspect the Content-Encoding HTTP 419 response header to determine if the response is SDCH-encoded. If the 420 Content-Encoding includes the "sdch" token, the user agent must 421 perform SDCH decompression on the response. 423 If the HTTP response includes a Get-Dictionary header, the user agent 424 must verify that the partial-url specified refers to the same server 425 that generated the response. If so, the user agent may download the 426 dictionary at the given URL. 428 There are two different URLs to consider when downloading and storing 429 a dictionary. The referer URL is the URL of the request that 430 resulted in the server responding with a Get-Dictionary header. 432 The dictionary URL is defined as follows: 434 1. If the partial-url is a complete URL, the dictionary URL is the 435 partial-url. 437 2. If the partial-url is just a path URL, the dictionary URL is 438 generated from the scheme and host name of the referrer URL and 439 the path in the partial-url. 441 The user agent may retrieve a dictionary if the origin of the 442 dictionary matches the origin of the referrer. HTTP redirects may 443 only be followed if the origin matches as well. 445 Upon retrieving the dictionary, the user agent must validate the 446 dictionary. Here again, the validation rules are modeled after the 447 rules for when a user agent can accept an HTTP cookie. A dictionary 448 is invalid and must not be stored if any of the following are true: 450 1. The dictionary has no Domain attribute. 452 2. The effective host name that derives from the referrer URL host 453 name does not domain-match the Domain attribute. 455 3. The Domain attribute is a top level domain. 457 4. The referrer URL host is a host domain name (not IP address) and 458 has the form HD, where D is the value of the Domain attribute, 459 and H is a string that contains one or more dots. 461 5. If the dictionary has a Port attribute and the referrer URL's 462 port was not in the list. 464 If the dictionary is valid and user agent decides to store the 465 dictionary, the scheme of the dictionary URL should also be stored 466 along with dictionary. 468 5.4. SDCH-Encoded Response Body 470 An SDCH-encoded response starts with the dictionary-server-id used to 471 compress the response. The syntax of the SDCH-encoded response is: 472 dictionary-compression-response = dictionary-server-id "\0" vcdiff- 473 payload 475 6. Examples 477 For the purpose of these examples, assume the following dictionaries 478 exist on the server and can be downloaded from the following URLs: 480 "Search results" dictionary 481 o domain: .google.com 483 o path: /search 485 o user agent ID: TWFuIGlz 487 o server ID: JOWk0d2N 489 o download location: /dictionaries/search_dict 491 "Help pages" dictionary 493 o domain: .google.com 495 o path: / 497 o user agent ID: GVhc3V48 499 o server ID: O9d2_m3- 501 o download location: /dictionaries/help_dict 503 Note that the dictionary identifier consists of two parts: user agent 504 ID and the server ID. Most of the detail of the request and response 505 headers has been omitted. 507 6.1. Example 1: Initial Interaction, User Agent has No Dictionaries 509 1. user agent's request 511 GET /search?q=sprouts HTTP/1.1 512 Host: www.google.com 513 Accept-Encoding: sdch, gzip 515 1. server's response 517 HTTP/1.1 200 OK 518 Content-type: text/html 519 Content-Encoding: gzip 520 Get-Dictionary: /dictionaries/search_dict, /dictionaries/help_dict 521 Cache-Control: private 523 Note that the response returned by the server does NOT use SDCH 524 encoding, since the user agent does not have a dictionary. The 525 server simply provides the locations of the dictionaries for future 526 use. The user agent may choose to retrieve one or both dictionaries 527 separately. 529 6.2. Example 2: User Agent Requests the Dictionary 531 1. user agent's request 533 GET /dictionaries/search_dict HTTP/1.1 534 Host: www.google.com 535 Accept-Encoding: sdch, gzip 537 1. server's response 539 HTTP/1.1 200 OK 540 Content-type: application/x-sdch-dictionary 541 Content-Encoding: gzip 543 Domain: .google.com 544 Path: /search 545 Format-version: 1.0 547 ...dictionary contents... 549 Upon receiving this response, the user agent computes the digest of 550 the dictionary and determines the user agent ID is TWFuIGlz and the 551 server ID is JOWk0d2N. 553 6.3. Example 3: User Requests Page AND User Agent Has Already 554 Downloaded 556 the Dictionary 558 1. user agent's request 560 GET /search&q=brussel+sprouts HTTP/1.1 561 Host: www.google.com 562 Accept-Encoding: sdch, gzip 563 Avail-Dictionary: TWFuIGlz 565 1. server's response 567 HTTP/1.1 200 OK 568 Content-type: text/html 569 Content-Encoding: sdch, gzip 570 Get-Dictionary: /dictionaries/help_dict 571 Cache-Control: private 573 JOWk0d2N...VCDIFFed response... 574 (note that the response shown to the left the result of gzip 575 decompression) 576 The server has properly identified the dictionary using its server ID 577 and the user agent can confirm that the second 48 bits of the SHA-256 578 digest of the dictionary match its computation. It can then 579 decompress the VCDIFF response using this dictionary. Even though 580 the "search results" dictionary was used to decompress the response, 581 the server has chosen to indicate another dictionary could be 582 requested by the user agent from http://www.google.com/dictionaries/ 583 help_dict. This dictionary must be different than the "search 584 results" dictionary as the server must never request the user agent 585 download a dictionary it knows the user agent already has. Let's 586 assume the user agent decides to download this dictionary. 588 6.4. Example 4: User Requests with Multiple Dictionaries 590 1. user agent's request 592 GET /search&q=brussels HTTP/1.1 593 Host: www.google.com 594 Accept-Encoding: sdch, gzip 595 Avail-Dictionary: GVhc3V48,TWFuIGlz 597 1. server's response 599 HTTP/1.1 200 OK 600 Content-type: text/html 601 Content-Encoding: sdch, gzip 602 Cache-Control: private 604 JOWk0d2N...VCDIFFed response... (note that the response shown 605 to the left the result of gzip decompression) 607 The user agent advertises that it has already downloaded two 608 dictionaries that apply. The server may compress the response with 609 either dictionary. As the server has no other dictionaries that 610 apply to the request, it does not advertise any dictionaries in its 611 response. 613 7. Implementation Considerations 615 7.1. Implementation Limits 617 There are practical limitations to the number and size of the 618 dictionaries a user agent can store. It is suggested that general 619 use, non-mobile user agents should have the following minimum 620 capabilities: 622 o At least 300 dictionaries stored 623 o At least 100KB of payload per dictionary 625 o At least 10MB of total dictionary contents 627 o At least 20 dictionaries stored per domain 629 7.2. Dictionary Downloading 631 The user agent always has the choice of whether or not to download a 632 dictionary. It is recommended that the user agent be implemented 633 with sufficient state to avoid downloading too many dictionaries from 634 the same server. A malfunctioning server may also request the user 635 agent continually download the same dictionary. One simple method to 636 avoid both of these possibilities is for the user agent to rate-limit 637 downloading dictionaries from the same domain. 639 When the user agent receives a response with a Get-Dictionary header 640 with dictionary download URLs that it may fetch, it should perform 641 the dictionary downloads in the background. This is possible as the 642 dictionary to be downloaded is guaranteed to not be needed to 643 decompress the response with the Get-Dictionary header. The user 644 agent should be careful to abort background dictionary downloads that 645 do not complete in a reasonable amount of time. 647 7.3. Data Integrity 649 If the dictionaries are tied to individual users or specific user 650 actions, HTTP may leak this information to passive attacker by 651 allowing the Get-Dictionary info to be seen. When using HTTPS, the 652 same risk is prevented in the design document since Get-Dictionary 653 URLs are required to be same-origin as the response. 655 However, Downloading dictionaries over HTTPS or advertising 656 dictionaries over HTTPS might introduce new security risks. 658 TODO: add some examples. For example, SDCH-over-HTTPS subject to 659 compression oracle attacks similar to CRIME/BREACH with the 660 difference that the compression context is not supplied by the 661 attacker. If an attacker had the contents of a dictionary, there is 662 a theoretical possibility where a server sends a static response 663 XOR'ed with user-provided data. The Attacker can provide data which 664 reduced the size of the response when XOR'ed with the static 665 response, the attacker may then be able to determine the contents of 666 the static response. 668 The protocol needs to ensure that the content as decompressed by the 669 user agent with a given dictionary is identical to the server's 670 originally intended content. The three areas that can cause a data 671 integrity problem are discussed below. 673 7.3.1. Data tampered by Proxy 675 We have found incorrectly implemented proxies which tamper with an 676 SDCH response and make the response unable to be decompressed to the 677 server's originally intended content. The tampering may not be 678 detected in the SDCH encoding itself if the proxy makes SDCH content 679 look like non-SDCH content, for instance, by stripping the 'sdch' 680 token from the content-encoding header of the response or by adding 681 additional encodings (like gzip) on top of the SDCH and gzipped 682 response without making the Content-Encoding header match. In order 683 to detect when this occurs, the HTTP header X-SDCH-Encoding must be 684 added to the response by the server to inform the client that the 685 response was originally not SDCH encoded by the server. Should the 686 user agent advertise SDCH capability in the request but receive a 687 non-SDCH encoded response without the X-SDCH-Encoding header, it 688 suggests that the response was tampered by a proxy. The user agent 689 may then take action to avoid using SDCH in the future. 691 7.3.2. Dictionary mismatch 693 When a dictionary information is exchanged between user agent and 694 server, it is necessary to ensure that the dictionary identifiers are 695 completely unambiguous, or the decompressed result may differ from 696 the original content. To address this issue, SDCH uses the first 96 697 bits of the SHA-256 digest of a dictionary's metadata and payload to 698 create the dictionary identifiers used by the user agent and server 699 to avoid ambiguity. (Please refer to the section "Dictionaries 700 description" above for details.) 702 7.3.3. Data corruption / malicious attacks 704 While this issue is not specific to SDCH, it can be exacerbated due 705 to the nature of the stateful compression. For example, if the 706 dictionary is corrupted or maliciously modified in a persistent on- 707 disk cache, all subsequent responses decoded by using this dictionary 708 will be corrupt. For this reason, the user agent and server should 709 revalidate the dictionaries' integrity when they are loaded from non- 710 volatile storage. 712 Other issues like data corruption during transmission in the encoded 713 payload could have much bigger adverse effect than that in the plain 714 text. TCP provides a checksum, but it cannot detect some errors like 715 swapped bytes. To address this issue, SDCH includes an Adler32 716 checksum [RFC1950] in the encoded data shards. (Please refer to 717 appendix "VCDIFF Encoding Format and SDCH" for details.) 719 8. Response Caching 721 8.1. User Agent Cache 723 The user agent should honor HTTP caching directives (Cache-Control, 724 Expires,...) for caching responses, whether or not the responses are 725 SDCH-encoded. When caching the SDCH-encoded responses, the SDCH- 726 encoded responses should be decoded before being written to the 727 cache. If this is not possible, the user agent may cache SDCH- 728 encoded responses, unless the HTTP response headers indicate that the 729 response is not cacheable. In this case, an SDCH-encoded cache entry 730 should be invalidated when (1) the dictionary used to encode that 731 response is deleted from the dictionary store, (2) the SDCH 732 decompression user agent is uninstalled (if it is implemented as a 733 browser add-on), or (3) the SDCH capable user agent is disabled. 735 Intermediate Caches 737 The server should use HTTP cache headers that prevent non-SDCH-aware 738 intermediate cache servers from storing the encoded contents. The 739 cache directive "Cache-Control: private" can be used for this 740 purpose. 742 If the compressed response can be cached by proxy caches, the server 743 must include the HTTP header "Vary: Accept-Encoding, Avail- 744 Dictionary" to alert proxies about sending the cached content only to 745 the user agents who can decode it. Note that some proxies may not 746 respect the Vary header, in which case non-SDCH-capable user agents 747 would end up downloading SDCH-encoded responses. Thus, we recommend 748 that SDCH-encoded responses not be cacheable by intermediate proxies 749 unless there is a very compelling reason. Further, "Vary: Accept- 750 Encoding, Avail-Dictionary" will not match requests unless these 751 headers match exactly. 753 A proxy cache may provide one of three levels of support for caching 754 SDCH-encoded objects. 756 1. No support - Never cache any response if the header Vary is 757 present. 759 2. Basic support - The proxy cache only serves cached SDCH-encoded 760 content if all cache serving conditions are satisfied and the 761 values of the HTTP headers specified in the Vary header of the 762 cached content exactly match the corresponding headers in the 763 HTTP request. 765 3. Full support - The proxy should understand the SDCH protocol, 766 should know what dictionary is used to encode/decode the 767 response, and should be able to download advertised dictionaries. 768 The cache needs to have both SDCH user agent and server logic in 769 it. The server should store the SDCH decoded responses in its 770 cache. 772 Dictionary Caching User Agent Cache 774 As dictionary payloads may be large compared to the size of 775 individual HTTP responses, in order to maximize latency improvements 776 and minimize the bandwidth overhead of downloading dictionaries, it 777 is recommended that the user agent persistently store dictionaries in 778 a dictionary cache (e.g. on disk). It is suggested that the user 779 agent implement a maximum limit on number of dictionaries stored per 780 domain in order to avoid allowing one domain to force dictionaries 781 for other domains out of the user agent's dictionary cache. To 782 implement a fixed maximum size cache it is recommended that the cache 783 manager first evict the dictionaries that were least recently used 784 for decoding. 786 Ideally dictionaries will be stored in the same cache as HTTP 787 responses and may be inspected and cleared by the user using existing 788 user interfaces. However, new support may be created to fulfill the 789 need for the user agent to be able to quickly determine which 790 dictionaries should be advertised for a given request. 792 The user agent should be careful to validate that a dictionary 793 matches its original identifier before being used for decompression 794 to prevent malicious attacks on the dictionary cache. The user agent 795 may implicitly handle this by always recomputing the hash before 796 advertising the dictionary. However, to improve efficiency, the user 797 agent may cache the original digest of the dictionary, advertise the 798 dictionary with that digest, and then only for the dictionary 799 selected by the server to encode the response, verify that the cached 800 dictionary digest still matches the digest computed from the cached 801 dictionary. 803 The user agent must not evict dictionaries from its dictionary store 804 that have been advertised in the Avail-Dictionary header of a HTTP 805 request for which a response has not yet been returned. 807 If a user agent downloads a dictionary which has the same identifier 808 as another previously downloaded dictionary which are applicable to 809 the same hosts, the user agent must be careful to either ignore the 810 new dictionary or evict the old dictionary. If the two dictionaries 811 with the same identifier have exactly the same contents the choice is 812 not important, however this indicates a server error as a server must 813 never instruct the user agent to download a dictionary that was 814 advertised by the user agent. The user agent may want to avoid 815 downloading dictionaries from this server in the future as they may 816 not be new and downloading unnecessary dictionaries can increase 817 latency. 819 Intermediate Caches 821 The dictionary should be treated as a regular HTTP response by 822 intermediate proxies. Thus, the normal HTTP caching consideration 823 for intermediate proxies should apply to the dictionary as well. 825 9. Future Directions 827 ===================== 829 As currently proposed, SDCH is not applicable to another case where 830 differential compression would be beneficial: large files that change 831 infrequently and in small ways, such as JavaScript and CSS files 832 referenced by other HTML documents. 834 TODO: Re-evaluate dictionary scoping rules, current approach that 835 patterned after the same named attributes from the HTTP State 836 Management Specification [RFC2965] may not be the best choice. 838 10. Current Status and Updates 840 For current information about the status of this proposal: 841 https://groups.google.com/group/SDCH 843 11. IANA Considerations 845 This document makes no requests of IANA. 847 12. Security Considerations 849 Some security considerations are discussed in the data integrity 850 section above, but the author anticipates further work to describe 851 these. 853 13. Acknowledgements 855 The authors would like to acknowledge the support of Google, Inc. for 856 the development of this work. Technical editor: Harriett Hardman. 857 Feedback and comments: Greg Badros, Chandra Chereddi, Darren Fisher, 858 Ted Hardie, Ashu Jain, Ian Hickson, Othman Laraki, Jim Roskind, Ryan 859 Sleevi, Lincoln Smith, Randy Smith, and Linus Upson. 861 14. Appendix: VCDIFF Encoding Format and SDCH 863 Although the SDCH protocol is proposed so that it could be adapted 864 for use with any differential-encoding format, it currently uses the 865 VCDIFF encoding format. This format was chosen because its 866 definition is publicly available as the RFC 3284 draft standard. The 867 VCDIFF format is independent of the method used for finding the 868 longest possible matches between the dictionary (source) data and the 869 payload (target) data. 871 An encoder and decoder for the VCDIFF format, intended for use with 872 SDCH, has been released as open-source under the Apache license. 873 This package is called "open-vcdiff". It uses the Bentley/McIlroy 874 technique for finding matches between the dictionary and target data. 875 It conforms to the VCDIFF draft standard, with the following 876 exceptions: 878 Interleaved format 880 The VCDIFF draft standard format divides each encoded delta window 881 into three sections (data, instructions, and addresses), with the aim 882 of improving compressibility of the encoded file using a secondary 883 compressor such as gzip. The drawback to this approach is that none 884 of the target data can be reconstructed unless the entire delta 885 window is available. The delta window is received in packets over 886 the network and it is desirable to be able to process its contents as 887 they arrive. In order to facilitate decoding a stream of packets 888 from the network, we have modified the VCDIFF format so that it 889 interleaves the data, instructions, and addresses instead of placing 890 them in three separate sections. Each instruction is followed by its 891 size and then by an address or literal data. 893 Adler32 checksum 895 The format can be modified to include an Adler32 checksum [RFC1950] 896 of the target window data. If the checksum format is used, then bit 897 2 (0x04, defined as VCD_CHECKSUM) of the Win_Indicator byte will be 898 set, and the checksum will appear just after the "Length of addresses 899 for COPYs" field and before the "Data section for ADDs and RUNs" 900 section in the encoding. 902 Version header byte (Header4) 904 If either of the two enhancements described above is used, then the 905 resulting format will not conform to the VCDIFF draft standard as 906 described in RFC 3284. In order to indicate this deviation from the 907 standard, the fourth byte in the encoding (Header4, reserved for the 908 VCDIFF version code) will be set to 0x53 (a capital "S" character in 909 ASCII.) If neither enhancement is used, the fourth byte may be 0x00 910 (a null character), the default value described in the standard. 912 VCD_TARGET flag and target COPY instructions not allowed for SDCH 914 The SDCH protocol is intended to produce a delta between static 915 dictionary data and target data. Secondary compression with gzip 916 will be used to eliminate redundancy within the target data. For 917 this reason, when using VCDIFF for SDCH, the Win_Indicator flag 918 should always include the VCD_SOURCE flag, never the VCD_TARGET flag. 919 COPY instructions should only reference addresses within the source 920 data, never within the previously decoded target. 922 The Xdelta package (http://xdelta.org) produces a format based on 923 VCDIFF, though not 100% compatible with the RFC draft standard. That 924 package has been released under the GNU General Public License. 926 15. References 928 15.1. Normative References 930 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 931 Requirement Levels", BCP 14, RFC 2119, 932 DOI 10.17487/RFC2119, March 1997, 933 . 935 [RFC7230] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 936 Protocol (HTTP/1.1): Message Syntax and Routing", 937 RFC 7230, DOI 10.17487/RFC7230, June 2014, 938 . 940 15.2. Informative References 942 [RFC3284] Korn, D., MacDonald, J., Mogul, J., and K. Vo, "The VCDIFF 943 Generic Differencing and Compression Data Format", 944 RFC 3284, DOI 10.17487/RFC3284, June 2002, 945 . 947 [RFC3229] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., 948 Goland, Y., van Hoff, A., and D. Hellerstein, "Delta 949 encoding in HTTP", RFC 3229, DOI 10.17487/RFC3229, January 950 2002, . 952 [RFC3929] Hardie, T., "Alternative Decision Making Processes for 953 Consensus-Blocked Decisions in the IETF", RFC 3929, 954 DOI 10.17487/RFC3929, October 2004, 955 . 957 [RFC3548] Josefsson, S., Ed., "The Base16, Base32, and Base64 Data 958 Encodings", RFC 3548, DOI 10.17487/RFC3548, July 2003, 959 . 961 [RFC2965] Kristol, D. and L. Montulli, "HTTP State Management 962 Mechanism", RFC 2965, DOI 10.17487/RFC2965, October 2000, 963 . 965 [RFC1950] Deutsch, P. and J-L. Gailly, "ZLIB Compressed Data Format 966 Specification version 3.3", RFC 1950, 967 DOI 10.17487/RFC1950, May 1996, 968 . 970 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 971 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 972 DOI 10.17487/RFC6234, May 2011, 973 . 975 Authors' Addresses 977 Jon Butler 979 Email: jkbutler@google.com 981 Wei-Hsin Lee 983 Email: weihsinl@google.com 985 Bryan McQuade 987 Email: mcquade@google.com 989 Kenneth Mixter 991 Email: kmixter@google.com