idnits 2.17.1 draft-phillips-record-jar-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 437. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 448. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 455. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 461. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 20, 2008) is 5909 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'UAX31' -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Obsolete informational reference (is this intentional?): RFC 4646 (Obsoleted by RFC 5646) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Yahoo! Inc. 4 Expires: August 23, 2008 February 20, 2008 6 The record-jar Format 7 draft-phillips-record-jar-02 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on August 23, 2008. 34 Copyright Notice 36 Copyright (C) The IETF Trust (2008). 38 Abstract 40 The record-jar format provides a method of storing multiple records 41 with a variable repertoire of fields in a text format. This document 42 provides a description of the format. Comments are solicited and 43 should be addressed to the mailing list 'record-jar@yahoogroups.com' 44 and/or the author. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 49 2. Format and Grammar . . . . . . . . . . . . . . . . . . . . . . 4 50 2.1. Folding of Field Values . . . . . . . . . . . . . . . . . 5 51 2.2. Comments . . . . . . . . . . . . . . . . . . . . . . . . . 7 52 2.3. Characters, Encodings, and Escapes . . . . . . . . . . . . 7 53 3. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 54 4. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 55 4.1. Normative References . . . . . . . . . . . . . . . . . . . 11 56 4.2. Informative References . . . . . . . . . . . . . . . . . . 11 57 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 12 58 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 13 59 Intellectual Property and Copyright Statements . . . . . . . . . . 14 61 1. Introduction 63 The record-jar format was originally described by The Art of Unix 64 Programming [AOUP]. This format is useful for storing information in 65 a human-readable text form, while making the data available for 66 machine processing. It is a flexible format, since it provides for 67 an arbitrary range of fields in any given record and can be used to 68 store data with variable length and content. 70 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 71 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 72 document are to be interpreted as described in [RFC2119]. 74 2. Format and Grammar 76 The record-jar format is described by the following ABNF ([RFC4234]): 78 record-jar = [encodingSig] [separator] *record 79 record = 1*field separator 80 field = ( field-name field-sep field-body CRLF ) 81 field-name = 1*character 82 field-sep = *SP ":" *SP 83 field-body = *(continuation 1*character) 84 continuation = ["\"] [[*SP CRLF] 1*SP] 85 separator = [blank-line] *("%%" [comment] CRLF) 86 comment = SP *69(character) 87 character = SP / ASCCHAR / UNICHAR / ESCAPE 88 encodingSig = "%%encoding" field-sep 89 *(ALPHA / DIGIT / "-" / "_") CRLF 90 blank-line = WSP CRLF 92 ; ASCII characters except %x26 (&) and %x5C (\) 93 ASCCHAR = %x21-25 / %x27-5B / %x5D-7E 94 ; Unicode characters 95 UNICHAR = %x80-10FFFF 96 ESCAPE = "\" ("\" / "&" / "r" / "n" / "t" ) 97 / "&#x" 2*6HEXDIG ";" 99 record-jar ABNF 101 The record-jar format uses plain-text to represent data values. A 102 record-jar document consists of a sequence of records, each of which 103 contains one or more fields. Each record is separated from other 104 records by at least one line beginning with the sequence "%%" 105 (%x25.25). A record MAY contain as many or as few fields as are 106 necessary to convey the necessary data. Empty records and blank 107 lines are ignored. 109 A field is a single, logical line of characters from the Universal 110 Character Set (Unicode) [Unicode]. Each field is comprised of three 111 parts: the field-name, the field-separator, and the field body. 113 The field-name is an identifer. Field-names consist of a sequence of 114 Unicode characters. Whitespace characters and colon (":", %x3A) are 115 not permitted in a field-name. 117 An application can impose additional restrictions on field-names. 118 For example, they might be restricted to the characters permitted in 119 identifiers according to Unicode Standards Annex #31 (UAX#31) 120 [UAX31]. Or they might be restricted to a sequence of letters and 121 digits from the US-ASCII [ISO646] character repertoire. 123 Field-names are case sensitive. Upper and lowercase letters are 124 often used to visually break up the name, for example using 125 CamelCase. It is a common convention that field names use an initial 126 capital letter, although this is not enforced. 128 The field separator (field-sep) is the colon character (":", %x3A). 129 The separator MAY be surrounded on either side by any amount of 130 horizontal whitespace (tab or space characters). The normal 131 convention is one space on each side. 133 The field-body contains the data value. Logically, the field-body 134 consists of a single line of text using any combination of characters 135 from the Universal Character Set followed by a CRLF (newline). The 136 carriage return, newline, and tab characters, when they occur in the 137 data value stored in the field-body, are represented by their common 138 backslash escapes ("\r", "\n", and "\t" respectively). See 139 Section 2.3 for more information on escape sequences. 141 2.1. Folding of Field Values 143 Some protocols limit total line length. For example, many Internet 144 plain-text protocols limit lines to 72 total bytes. To accommodate 145 such limits or for readability and presentational purposes, the 146 field-body portion of a field can be split into a multiple-line 147 representation; this is called "folding". 149 Successive lines in the same field-body begin with one or more 150 whitespace characters. When processing the record-jar format, the 151 linear whitespace (including the newline and any preceeding spaces) 152 is consumed by the processor and the two parts of the field-body 153 joined to form a single, logical line. For example: 154 Eulers-Number : 2.718281828459045235360287471 155 352662497757247093699959574966967627724076630353547 156 5945713821785251664274274663919320030599218174135... 158 Figure 2: Example of Folding 160 Note that imposing a line length limit effectively limits the length 161 of the field-name, since the field separator MUST appear on the same 162 line with the field-name and the field-name MUST NOT be folded. 163 Also, when imposing a line length limit, note that some encodings 164 (including the Unicode encodings) can use a variable number of bytes 165 per character or commonly use more than one byte per character. 166 Characters MUST NOT be folded in the middle of a byte sequence. 168 It is RECOMMENDED that folding not occur between characters inside a 169 Unicode grapheme cluster (since this will alter the display of 170 characters in the file and might result in unintentional alteration 171 of the file's semantics). Information on grapheme clusters can be 172 found in [UAX29] 174 In some cases, the field-body contains spaces that are important to 175 the data. To accurately preserve whitespace in the document, an 176 optional line-continuation character (backslash, %x5C) MAY be 177 included to delimit and separate whitespace to be preserved from 178 whitespace that will be removed by the processor. The line- 179 continuation character and any whitespace that follows it (including 180 whitespace at the beginning of the continuing field-body on the next 181 line) MUST be consumed by the processor when reading the file. 182 Whitespace appearing before the line-continuation MUST NOT be 183 consumed. Use of the line continuation character makes the 184 whitespace visible in the file. 186 In other cases, the field-body might contain natural language text, 187 and, while it is readily apparent that many languages use spaces to 188 separate words, others, such as Japanese or Thai, do not. 189 Implementations MAY, in the absence of line continuation characters, 190 replace the continuation sequence (the line break and surrounding 191 whitespace) in a folded line with a single ASCII space (%x20), 192 however, implementations SHOULD just remove the continuation sequence 193 altogether in order to avoid causing unnatural breaks in the text. 195 Here are some examples: 196 SomeField : This is some running text \ 197 that is continued on several lines \ 198 and which preserves spaces between \ 199 the words. 200 %% 201 AnotherExample: There are three spaces \ 202 between 'spaces' and 'between' in this record. 203 %% 204 SwallowingExample: There are no spaces between \ 205 the numbers one and two in this example 1\ 206 2. 207 %% 209 Figure 3: Example of Folding with Preserved Whitespace 211 Note that entirely blank continuation lines are not permitted. That 212 is, this record is illegal, since the field-body of "SomeText" would 213 be the empty string: 215 %% 216 SomeText: \ 217 \ 218 \ 219 %% 221 Figure 4: Whitespace Folding Example 223 2.2. Comments 225 Comments MAY be included in the body of the record-jar document by 226 placing them at the end of a separator line. The comment MUST be 227 separated by at least one space from the "%%" sequence that 228 introduces the record separator. 230 Multiple record separators (including comment lines) MAY appear 231 between records. Logically this appears to result in records that 232 contain no fields: records containing no fields MUST be ignored by a 233 processor. 235 Folding of comments is not permitted; instead multiple comment lines 236 MUST be used. Comments can not appear in the body of a record. For 237 example: 238 %% this is a comment. 239 Record: goes here 240 %% 241 %% here is another sequence of comments 242 %% that appear on multiple lines 243 Record: another record 244 %% a final comment 245 %% 247 Figure 5: Comment example 249 Although comments are not associated with any particular record in 250 the file, processors that preserve comments sometimes treat the 251 comments as if they were associated with the record just following 252 them. Reserialization of a record-jar file would thus restore the 253 comments to their logical position in the file. In many cases, 254 processing a record-jar file loses comment information associated 255 with the file. 257 2.3. Characters, Encodings, and Escapes 259 By default, a file containing a record-jar archive uses the UTF-8 260 character encoding (see [RFC3629]). If an application, protocol, or 261 specification permits a character encoding other than UTF-8 to be 262 used in the file, it SHOULD also support reading the character 263 encoding from the encoding signature. 265 The encoding signature, when present, MUST be the very first line of 266 the file. If the encoding signature is not present, an application 267 or protocol MAY attempt to infer the character encoding using other 268 means. Record-jar files SHOULD always include an encoding signature, 269 even if one is not required, whenever the application, protocol, or 270 specification permits one. 272 A file that uses the UTF-16 or UTF-32 encoding MAY also include a 273 Byte Order Mark (U+FEFF) as the first sequence of two octets (in the 274 case of UTF-16) or four octets (in the case of UTF-32) in the file, 275 just preceeding the encoding signature. 277 Some applications, protocols, or specifications require that the 278 record-jar file use some other, non-Unicode, legacy character 279 encoding. In particular, some applications, protocols, or 280 specifications only support the US-ASCII character set ([ISO646]). 282 Here is an example of the encoding signature for the UTF-8 encoding 283 of Unicode: 284 %%encoding:UTF-8 286 Figure 6: Example of an Encoding Signature 288 Printable ASCII characters excepting backslash ("\") and ampersand 289 ("&") are represented as themselves. 291 Non-ASCII values MAY be included in a record-jar file in several 292 ways. For portability, the best mechanism is to use escape sequences 293 in the field-body. Exclusive use of escape sequences results in a 294 pure ASCII text file. 296 Non-ASCII characters MAY be represented using the character's Unicode 297 value represented using the Numeric Character Reference format 298 adapted from XML; the sequence "&#x" (%x26.23.78) is followed by the 299 character's Unicode scalar value in hex followed directly by the 300 semi-colon character (";", %x3B). Leading zeroes MAY be omitted. 301 For example, the EURO SIGN is U+20AC and could be represented as 302 "€". 304 Non-ASCII characters MAY also be represented as their associated 305 octet sequence in the file's character encoding. For example, the 306 EURO SIGN would be represented as the octet sequence %xE2.82.AC, 307 since those three bytes encode that character in UTF-8. 309 The characters for carriage return, newline, and tab when considered 310 as part of the data (and not the file format itself) are represented 311 by the traditional escape sequences "\r" (%x5C.72), "\n" (%x5C.6E), 312 and "\t" (%x5C.74) respectively. The character backslash is 313 represented by "\\" (%x5C.5C), while the ampersand character is 314 represented by "\&" (%x5C.26). A single backslash at the end of a 315 line indicates continuation, as discussed in Section 2.1. Otherwise 316 a single backslash followed by some other character in the data is an 317 error, although a record-jar processor MAY choose to interpret it as 318 a backslash. 320 3. Examples 322 Here is the canonical example from [AOUP]: 323 Planet: Mercury 324 Orbital-Radius: 57,910,000 km 325 Diameter: 4,880 km 326 Mass: 3.30e23 kg 327 %% 328 Planet: Venus 329 Orbital-Radius: 108,200,000 km 330 Diameter: 12,103.6 km 331 Mass: 4.869e24 kg 332 %% 333 Planet: Earth 334 Orbital-Radius: 149,600,000 km 335 Diameter: 12,756.3 km 336 Mass: 5.972e24 kg 337 Moons: Luna 339 A more complete example showing more of the various features in the 340 format is described in [RFC4646]. The data shown here is taken from 341 the Language Subtag Registry defined that document: 342 %% 343 Type: language 344 Subtag: ia 345 Description: Interlingua (International Auxiliary Language \ 346 Association) 347 Added: 2005-08-16 348 %% 349 Type: language 350 Subtag: id 351 Description: Indonesian 352 Added: 2005-08-16 353 Suppress-Script: Latn 354 %% 355 Type: language 356 Subtag: nb 357 Description: Norwegian Bokmål 358 Added: 2005-08-16 359 Suppress-Script: Latn 360 %% 362 4. References 364 4.1. Normative References 366 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 367 Requirement Levels", BCP 14, RFC 2119, March 1997. 369 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 370 10646", STD 63, RFC 3629, November 2003. 372 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 373 Specifications: ABNF", draft-crocker-abnf-rfc2234bis-00 374 (work in progress), October 2005, 375 . 377 [UAX31] Davis, M., "Unicode Standard Annex #31: Identifier and 378 Pattern Syntax", 09 2006. 380 [Unicode] Unicode Consortium, "The Unicode Consortium. The Unicode 381 Standard, Version 5.0, (Boston, MA, Addison-Wesley, 2003. 382 ISBN 0-321-49081-0)", January 2007. 384 4.2. Informative References 386 [AOUP] Raymond, E., "The Art of Unix Programming", 2003, 387 . 389 [ISO646] International Organization for Standardization, "ISO/IEC 390 646:1991, Information technology -- ISO 7-bit coded 391 character set for information interchange.", 1991. 393 [RFC4646] Phillips, A., Ed. and M. Davis, Ed., "Tags for the 394 Identification of Languages", September 2006, 395 . 397 [UAX29] Davis, M., "Unicode Standard Annex #29: Text Boundaries", 398 10 2006, . 400 Appendix A. Acknowledgements 402 Thanks to Eris S. Raymond for his gracious permission to both 403 reference and quote The Art of Unix Programming in this document. 404 Without his work, this document would likely not exist. 406 Contributors to this document include: Stephane Bortzmeyer, John 407 Cowan, Frank Ellerman, Doug Ewell. 409 The IETF LTRU working group adopted record-jar format on John Cowan's 410 suggestion. That effort required record-jar to be documented and 411 many people in that group contributed to this work there: the author 412 thanks everyone who participated in that effort, even though names 413 cannot be mustered here. 415 Author's Address 417 Addison Phillips (editor) 418 Yahoo! Inc. 420 Email: addison@inter-locale.com 421 URI: http://www.inter-locale.com 423 Full Copyright Statement 425 Copyright (C) The IETF Trust (2008). 427 This document is subject to the rights, licenses and restrictions 428 contained in BCP 78, and except as set forth therein, the authors 429 retain all their rights. 431 This document and the information contained herein are provided on an 432 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 433 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 434 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 435 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 436 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 437 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 439 Intellectual Property 441 The IETF takes no position regarding the validity or scope of any 442 Intellectual Property Rights or other rights that might be claimed to 443 pertain to the implementation or use of the technology described in 444 this document or the extent to which any license under such rights 445 might or might not be available; nor does it represent that it has 446 made any independent effort to identify any such rights. Information 447 on the procedures with respect to rights in RFC documents can be 448 found in BCP 78 and BCP 79. 450 Copies of IPR disclosures made to the IETF Secretariat and any 451 assurances of licenses to be made available, or the result of an 452 attempt made to obtain a general license or permission for the use of 453 such proprietary rights by implementers or users of this 454 specification can be obtained from the IETF on-line IPR repository at 455 http://www.ietf.org/ipr. 457 The IETF invites any interested party to bring to its attention any 458 copyrights, patents or patent applications, or other proprietary 459 rights that may cover technology that may be required to implement 460 this standard. Please address the information to the IETF at 461 ietf-ipr@ietf.org. 463 Acknowledgment 465 Funding for the RFC Editor function is provided by the IETF 466 Administrative Support Activity (IASA).