idnits 2.17.1 draft-ietf-trade-hiroshi-dom-hash-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 63 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 4 instances of too long lines in the document, the longest one being 6 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 385 has weird spacing: '...here is no 0-...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'DOM' ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. 'MD5') -- Possible downref: Non-RFC (?) normative reference: ref. 'NAM' -- Possible downref: Non-RFC (?) normative reference: ref. 'SAX' -- Possible downref: Non-RFC (?) normative reference: ref. 'SHA' ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC 3986) -- Possible downref: Non-RFC (?) normative reference: ref. 'XML' Summary: 8 errors (**), 0 flaws (~~), 3 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group Hirosi Maruyama 2 INTERNET-DRAFT Kent Tamura 3 September 1999 Naohiko Uramoto 4 Expires: March 2000 IBM 5 draft-ietf-trade-hiroshi-dom-hash-03.txt 7 Digest Values for DOM (DOMHASH) 8 ------ ------ --- --- --------- 10 Hiroshi Maruyama 11 Kent Tamura 12 Naohiko Uramoto 14 Status of This Document 16 This draft, file name draft-ietf-trade-hiroshi-dom-hash-03.txt, is 17 intended to be become an Informational RFC. Distribution of this 18 document is unlimited. Comments should be sent to the authors or the 19 IETF TRADE working group mailing list . 21 This document is an Internet-Draft and is in full conformance with 22 all provisions of Section 10 of RFC2026. Internet-Drafts are working 23 documents of the Internet Engineering Task Force (IETF), its areas, 24 and its working groups. Note that other groups may also distribute 25 working documents as Internet-Drafts. 27 Internet-Drafts are draft documents valid for a maximum of six 28 months. Internet-Drafts may be updated, replaced, or obsoleted by 29 other documents at any time. It is not appropriate to use Internet- 30 Drafts as reference material or to cite them other than as a 31 ``working draft'' or ``work in progress.'' 33 The list of current Internet-Drafts can be accessed at 34 http://www.ietf.org/ietf/1id-abstracts.txt 36 The list of Internet-Draft Shadow Directories can be accessed at 37 http://www.ietf.org/shadow.html. 39 To view the entire list of current Internet-Drafts, please check the 40 "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 41 Directories as listed at . 43 Abstract 45 This internet draft defines a clear and unambiguous definition of 46 digest (hash) values of the XML objects regardless of the surface 47 string variation of XML. This definition can be used for XML digital 48 signature as well efficient replication of XML objects. 50 Table of Contents 52 Status of This Document....................................1 53 Abstract...................................................1 55 Table of Contents..........................................2 57 1. Introduction............................................3 58 2. Digest Calculation......................................4 59 2.1. Overview..............................................4 60 2.2. Namespace Considerations..............................5 61 2.3. Definition with Code Fragments........................6 62 2.3.1. Text Nodes..........................................6 63 2.3.2. ProcessingInstruction Nodes.........................7 64 2.3.3. Attr Nodes..........................................8 65 2.3.4. Element Nodes.......................................8 66 3. Discussion.............................................10 67 4. Security Considerations................................10 69 References................................................11 70 Author's Address..........................................11 71 Expiration and File Name..................................11 73 1. Introduction 75 The purpose of this document is to give a clear and unambiguous 76 definition of digest (hash) values of the XML objects [XML]. Two 77 subtrees are considered identical if their hash values are the same, 78 and different if their hash values are different. 80 There are at least two usage scenarios of DOMHASH. One is as a basis 81 for digital signatures for XML. Digital signature algorithms normally 82 require hashing a signed content before signing. DOMHASH provides a 83 concrete definition of the hash value calculation. 85 The other is to use DOMHASH when synchronizing two DOM structures 86 [DOM]. Suppose that a server program generates a DOM structure which 87 is to be rendered by clients. If the server makes frequent small 88 changes on a large DOM tree, it is desirable that only the modified 89 parts are sent over to the client. A client can initiate a request by 90 sending the root hash value of the structure in the cache memory. If 91 it matches with the root hash value of the current server structure, 92 nothing needs be sent. If not, then the server compares the client 93 hash with the older versions in the server's cache. If it finds one 94 that matches the client's version of the structure, then it locates 95 differences with the current version by recursively comparing the 96 hash values of each node. This way, the client can receive only an 97 updated portion of a large structure without requesting the whole 98 thing. 100 One way of defining digest values is to take a surface string as the 101 input for a digest algorithm. However, this approach has several 102 drawbacks. The same internal DOM structure may be represented in may 103 different ways as surface strings even if they strictly conform to 104 the XML specification. Treatment of white spaces, selection of 105 character encodings, entity references (i.e., use of ampersands), and 106 so on have impact on the generation of a surface string. If the 107 implementations of surface string generation are different, the hash 108 values would be different, resulting in unvalidatable digital 109 signatures and unsuccessful detection of identical DOM structures. 110 Therefore, it is desirable that digest of DOM is defined in the DOM 111 terms -- that is, as an unambiguous algorithm operating on a DOM 112 tree. This is the approach we take in this specification. 114 Introduction of namespace is another source of variation of surface 115 string because different namespace prefixes can be used for 116 representing the same namespace URI [URI]. In the following example, 117 the namespace prefix "edi" is bound to the URI 118 "http://ecommerce.org/schema" but this prefix can be arbitrary chosen 119 without changing the logical contents as shown in the second example. 121 122 123 124 : 125 126 128 129 130 131 : 132 133 135 The DOMHash defined in this document is designed so that the choice 136 of the namespace prefix does not affect the digest value. In the 137 above example, both the "root" elements will get the same digest 138 value. 140 2. Digest Calculation 142 2.1. Overview 144 Hash values are defined on the DOM type Node. We consider the 145 following four node types that are used for representing a DOM 146 document structure: 148 1. Element 149 2. Attr 150 3. ProcessingInstruction 151 4. Text 153 Comment nodes and Document Type Definitions (DTDs) do not participate 154 in the digest value calculation. This is because DOM does not 155 require a conformant processor to create data structures for these. 156 DOMHash is designed so that it can be computed with any XML processor 157 conformant to the DOM or SAX [SAX] specification. 159 Nodes with the node type EntityReference must be expanded prior to 160 digest calculation. 162 The digest values are defined recursively on each level of the DOM 163 tree so that only a relevant part needs to be recalculated when a 164 small portion of the tree is changed. 166 Below, we give the precise definitions of digest for these types. We 167 describe the format of the data to be supplied to a hash algorithm 168 using a figure and a simple description, followed by a Java code 169 fragment using the DOM API and the JDK 1.1 Platform Core API only. 170 Therefore, the semantics should be unambiguous. 172 As the rule of thumb, all strings are to be in UTF-16 in the network 173 byte order (Big Endian) with no byte order mark. If there is a 174 sequence of text nodes without any element nodes in between, these 175 text nodes are merged into one by concatenating them. A zero-length 176 text node is always ignored. 178 Note that validating and non-validating XML processors may generate 179 different DOM trees from the same XML document, due to attribute 180 normalization and default attributes. If DOMHash is to be used for 181 testing logical equivalence between two XML documents (as opposed to 182 DOM trees), it may be necessary to normalize attributes and supply 183 default attributes prior to DOMHash calculation. 185 Some legacy character encodings (such as ISO-2022-JP) have certain 186 ambiguity in translating into Unicode. This is again dependent on 187 XML processors. Treatment of such processor dependencies is out of 188 scope of this document. 190 2.2. Namespace Considerations 192 To avoid the dependence on the namespace prefix, we use "expanded 193 names" to do digest calculation. If an element name or an attribute 194 name is qualified either by a explicit namespace prefix or by a 195 default namespace, the name's LocalPart is prepended by the URI of 196 the namespace (the namespace name as defined in the NameSpace 197 specification [NAM]) and a colon before digest calculation. In the 198 following example, the default qualified name "order" is expanded 199 into "http://ecommerce.org/schema:order" while the explicit qualified 200 name "book:title" is expanded into "urn:loc.gov:books:title" before 201 digest calculation. 203 205 207 208 ... 209 : 210 211 213 We define an expanded name (either for element or attribute) as 214 follows: 216 If a name is not qualified, the expanded name is the name 217 itself. 219 If a name is qualified with the prefix "xmlns", the expanded 220 name is undefined. 222 If a name is qualified either by default or by an explicit 223 namespace prefix, the expanded name is URI bound to the 224 namespace + ":" + LocalPart 226 In the following example code, we assume that the getExpandedName() 227 method (which returns the expanded name as defined above) is defined 228 in both Element and Attr interfaces of DOM. 230 Note that the digest values are not defined on namespace 231 declarations. In other words, the digest value is not defined for an 232 attribute when 234 - the attribute name is "xmlns", or 235 - the namespace prefix is "xmlns". 237 In the above example, the two attributes which are namespace 238 declarations do not have digest values and therefore will not 239 participate in the calculation of the digest value of the "root" 240 element. 242 2.3. Definition with Code Fragments 244 The code fragments in the definitions below assume that they are in 245 implementation classes of Node. Therefore, a methods call without an 246 explicit object reference is for the Node itself. For example, 247 getData() returns the text data of the current node if it is a Text 248 node. The parameter digestAlgorithm is to be replaced by an 249 identifier of the digest algorithm, such as "MD5" [MD5] and "SHA-1" 250 [SHA]. 252 The computation should begin with a four byte integer that represents 253 the type of the node, such as TEXT_NODE or ELEMENT_NODE. 255 2.3.1. Text Nodes 257 The hash value of a Text node is computed on the four byte header 258 followed by the UTF-16 encoded text string. 260 - TEXT_NODE (3) in 32 bit network-byte-ordered integer 261 - Text data in UTF-16 stream (variable length) 263 public byte[] getDigest(String digestAlgorithm) { 264 MessageDigest md = MessageDigest.getInstance(digestAlgorithm); 265 md.update((byte)0); 266 md.update((byte)0); 267 md.update((byte)0); 268 md.update((byte)3); 269 md.update(getData().getBytes("UnicodeBigUnmarked")); 270 return md.digest(); 271 } 273 Here, MessageDigest is in the package java.security.*, one of the 274 built-in packages of JDK 1.1. 276 2.3.2. ProcessingInstruction Nodes 278 A ProcessingInstruction (PI) node has two components: the target and 279 the data. Accordingly, the hash is computed on the concatenation of 280 both, separated by 'x0000'. PI data is from the first non white space 281 character after the target to the character immediately preceding the 282 "?>". 284 - PROCESSING_INSTRUCTION_NODE (7) in 32 bit network-byte-ordered 285 integer 287 - PI target in UTF-16 stream (variable length) 288 - 0x00 0x00 289 - PI data in UTF-16 stream (variable length) 291 public byte[] getDigest(String digestAlgorithm) { 292 MessageDigest md = MessageDigest.getInstance(digestAlgorithm); 293 md.update((byte)0); 294 md.update((byte)(0); 295 md.update((byte)0); 296 md.update((byte)7); 297 md.update(getName().getBytes("UnicodeBigUnmarked")); 298 md.update((byte)0); 299 md.update((byte)0); 300 md.update(getData().getBytes("UnicodeBigUnmarked")); 301 return md.digest(); 302 } 304 2.3.3. Attr Nodes 306 The digest value of Attr nodes are defined similarly to PI nodes, 307 except that we need a separator between the expanded attribute name 308 and the attribute value. The '0x0000' value in UTF-16 is allowed 309 nowhere in an XML document, so it can serve as an unambiguous 310 separator. The expanded name must be used as the attribute name 311 because it may be qualified. Note that if the attribute is a 312 namespace declaration (either the attribute name is "xmlns" or its 313 prefix is "xmlns"), the digest value is undefined and the getDigest() 314 method should return null. 316 - ATTRIBUTE_NODE (2) in 32 bit network-byte-ordered integer 317 - Expanded attribute name in UTF-16 stream (variable length) 318 - 0x00 0x00 319 - Attribute value in UTF-16 stream (variable length) 321 public byte[] getDigest(String digestAlgorithm) { 322 if (getNodeName().equals("xmlns") 323 || getNodeName().startsWith("xmlns:")) 324 return null; 325 MessageDigest md = MessageDigest.getInstance(digestAlgorithm); 326 md.update((byte)0); 327 md.update((byte)0); 328 md.update((byte)0); 329 md.update((byte)2); 330 md.update(getExpandedName().getBytes("UnicodeBigUnmarked")); 331 md.update((byte)0); 332 md.update((byte)0); 333 md.update(getValue().getBytes("UnicodeBigUnmarked")); 334 return md.digest(); 335 } 337 2.3.4. Element Nodes 339 Element nodes are the most complex because they consist of other 340 nodes recursively. Hash values of these component nodes are used to 341 calculate the node's digest so that we can save computation when the 342 structure is partially changed. 344 First, all the attributes except for namespace declarations must be 345 collected. This list is sorted by the expanded attribute names. The 346 sorting is done in ascending order in terms of the UTF-16 encoded 347 expanded attribute names, using the string comparison operator 348 defined as String#compareTo() in Java. The semantics of this sorting 349 operation should be clear. 351 - ELEMENT_NODE (1) in 32 bit network-byte-ordered integer 352 - Expanded element name in UTF-16 stream (variable length) 353 - 0x00 0x00 354 - A number of non-namespace-declaration attributes in 32 bit 355 network-byte-ordered unsigned integer 356 - Sequence of digest values of non-namespace-declaration attributes, 357 sorted by String#compareTo() for attribute names 358 - A number of child nodes (except for Comment nodes) in 32bit 359 network-byte-ordered unsigned integer 360 - Sequence of digest values of each child node except for Comment 361 nodes (variable length) (A sequence of child texts is merged to one 362 text. A zero-length text and Comment nodes are not counted as child) 364 public byte[] getDigest(String digestAlgorithm) { 365 MessageDigest md = MessageDigest.getInstance(digestAlgorithm); 366 ByteArrayOutputStream baos = new ByteArrayOutputStream(); 367 DataOutputStream dos = new DataOutputStream(baos); 368 dos.writeInt(ELEMENT_NODE);//This is stored in network byte order 369 dos.write(getExpandedName().getBytes("UnicodeBigUnmarked")); 370 dos.write((byte)0); 371 dos.write((byte)0); 372 // Collect all attributes except for namespace declarations 373 NamedNodeMap nnm = this.getAttributes(); 374 int len = nnm.getLength() 375 // Find "xmlns" or "xmlns:foo" in nnm and omit it. 376 ... 377 dos.writeInt(len); // This is sorted in the network byte order 378 // Sort attributes by String#compareTo() on expanded attribute names. 379 ... 380 // Assume that `Attr[] aattr' has sorted Attribute instances. 381 for (int i = 0; i < len; i ++) 382 dos.write(aattr[i].getDigest(digestAlgorithm)); 383 Node n = this.getFirstChild(); 384 // Assume that adjoining Texts are merged, 385 // there is no 0-length Text, and 386 // comment nodes are removed. 387 len = this.getChildNodes().getLength(); 388 dos.writeInt(len); // This is stored in the network byte order 389 while (n != null) { 390 dos.write(n.getDigest(digestAlgorithm)); 391 n = n.getNextSibling(); 392 } 393 dos.close(); 394 md.update(baos.toByteArray()); 395 return md.digest(); 396 } 398 3. Discussion 400 The definition described above can be efficiently implemented with 401 any XML processor that is conformant to either DOM and SAX 402 specification. Reference implementations are available on request. 404 4. Security Considerations 406 DOMHASH is expected to be used as the basis for digital signatures 407 and other security and integrity uses. It's appropriateness for such 408 uses depends on the security of the hash algorithm used and inclusion 409 of the fundamental characteristics it is desired to check in parts of 410 the DOM model incorporated in the digest by DOMHASH. 412 References 414 [DOM] - "Document Object Model (DOM), Level 1 Specification", October 415 1998, http://www.w3.org/TR/REC-DOM-Level-1/ 417 [MD5] - RFC 1321 - R. Rivest, "The MD5 Message-Digest Algorithm", 418 April 1992. 420 [NAM] - Tim Bray, Dave Hollander, Andrew Layman, "Namespaces in XML", 421 http://www.w3.org/TR/1999/REC-xml-names-19990114. 423 [SAX] - David Megginson, "SAX 1.0: The Simple API for XML", 424 http://www.megginson.com/SAX/, May 1998. 426 [SHA] - (US) National Institute of Standards and Technology, "Federal 427 Information Processing Standards Publication 180-1: Secure Hash 428 Standard", 17 April 1995. 430 [URI] - RFC 2396 - T. Berners-Lee, R. Fielding, L. Masinter, "Uniform 431 Resource Identifiers (URI): Generic Syntax", August 1998. 433 [XML] - Tim Bray, Jean Paoli, C. M. Sperber-McQueen, "Extensible 434 Markup Language (XML) 1.0", http://www.w3.org/TR/1998/REC-xml- 435 19980210 437 Author's Address 439 Hiroshi Maruyama, 440 IBM Research, Tokyo Research Laboratory 441 email: maruyama @ jp.ibm.com 443 Kent Tamura, 444 IBM Research, Tokyo Research Laboratory 445 email: kent @ trl.ibm.co.jp 447 Naohiko Uramoto, 448 IBM Research, Tokyo Research Laboratory 449 email: uramoto @ jp.ibm.com 451 Expiration and File Name 453 This draft expires March 2000. 455 Its file name is draft-ietf-trade-hiroshi-dom-hash-03.txt.