idnits 2.17.1 draft-davies-idntables-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 288: '... A document MUST contain exactly one...' RFC 2119 keyword, line 289: '... element MUST contain exactly one "d...' RFC 2119 keyword, line 305: '...icode-version" element MUST be used by...' RFC 2119 keyword, line 313: '... RECOMMENDED that it be the decimal ...' RFC 2119 keyword, line 328: '... of this element MUST be a valid ISO 8...' (56 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 24, 2015) is 3229 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC7303' is defined on line 2045, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group K. Davies 3 Internet-Draft ICANN 4 Intended status: Informational A. Freytag 5 Expires: December 26, 2015 ASMUS Inc. 6 June 24, 2015 8 Representing Label Generation Rulesets using XML 9 draft-davies-idntables-10 11 Abstract 13 This document describes a method of representing rules for validating 14 identifier labels and alternate representations of those labels using 15 Extensible Markup Language (XML). These policies, known as "Label 16 Generation Rulesets" (LGRs), are used for the implementation of 17 Internationalized Domain Names (IDNs), for example. The rulesets are 18 used to implement and share that aspect of policy defining which 19 labels and specific Unicode code points are permitted for 20 registrations, which alternative code points are considered variants, 21 and what actions may be performed on labels containing those 22 variants. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at http://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on December 26, 2015. 41 Copyright Notice 43 Copyright (c) 2015 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . 4 60 3. LGR Format . . . . . . . . . . . . . . . . . . . . . . . . . 6 61 3.1. Namespace . . . . . . . . . . . . . . . . . . . . . . . . 6 62 3.2. Basic Structure . . . . . . . . . . . . . . . . . . . . . 6 63 3.3. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 7 64 3.3.1. The version Element . . . . . . . . . . . . . . . . . 7 65 3.3.2. The date Element . . . . . . . . . . . . . . . . . . 7 66 3.3.3. The language Element . . . . . . . . . . . . . . . . 8 67 3.3.4. The scope Element . . . . . . . . . . . . . . . . . . 8 68 3.3.5. The description Element . . . . . . . . . . . . . . . 9 69 3.3.6. The validity-start and validity-end Elements . . . . 9 70 3.3.7. The unicode-version Element . . . . . . . . . . . . . 9 71 3.3.8. The references Element . . . . . . . . . . . . . . . 10 72 4. Code Points and Variants . . . . . . . . . . . . . . . . . . 11 73 4.1. Sequences . . . . . . . . . . . . . . . . . . . . . . . . 12 74 4.2. Variants . . . . . . . . . . . . . . . . . . . . . . . . 12 75 4.2.1. Basic Variants . . . . . . . . . . . . . . . . . . . 13 76 4.2.2. The type attribute . . . . . . . . . . . . . . . . . 14 77 4.2.3. Null Variants . . . . . . . . . . . . . . . . . . . . 15 78 4.2.4. Variants with Reflexive Mapping . . . . . . . . . . . 15 79 4.2.5. Conditional Variants . . . . . . . . . . . . . . . . 16 80 4.3. Annotations . . . . . . . . . . . . . . . . . . . . . . . 18 81 4.3.1. The ref Attribute . . . . . . . . . . . . . . . . . . 18 82 4.3.2. The comment Attribute . . . . . . . . . . . . . . . . 19 83 4.4. Code Point Tagging . . . . . . . . . . . . . . . . . . . 19 84 5. Whole Label and Context Evaluation . . . . . . . . . . . . . 20 85 5.1. Basic Concepts . . . . . . . . . . . . . . . . . . . . . 20 86 5.2. Character Classes . . . . . . . . . . . . . . . . . . . . 20 87 5.2.1. Declaring and Invoking Named Classes . . . . . . . . 21 88 5.2.2. Tag-based Classes . . . . . . . . . . . . . . . . . . 21 89 5.2.3. Unicode Property-based Classes . . . . . . . . . . . 22 90 5.2.4. Explicitly Declared Classes . . . . . . . . . . . . . 23 91 5.2.5. Combined Classes . . . . . . . . . . . . . . . . . . 24 92 5.3. Whole Label and Context Rules . . . . . . . . . . . . . . 25 93 5.3.1. The rule Element . . . . . . . . . . . . . . . . . . 25 94 5.3.2. The Match Operators . . . . . . . . . . . . . . . . . 26 95 5.3.3. The count Attribute . . . . . . . . . . . . . . . . . 27 96 5.3.4. The name and by-ref Attributes . . . . . . . . . . . 28 97 5.3.5. The choice Element . . . . . . . . . . . . . . . . . 29 98 5.3.6. Literal Code Point Sequences . . . . . . . . . . . . 29 99 5.3.7. The any Element . . . . . . . . . . . . . . . . . . . 29 100 5.3.8. The start and end Elements . . . . . . . . . . . . . 30 101 5.3.9. Example rule from IDNA2008 . . . . . . . . . . . . . 30 102 5.4. Parameterized Context or When Rules . . . . . . . . . . . 31 103 5.4.1. The anchor Element . . . . . . . . . . . . . . . . . 31 104 5.4.2. The look-behind and look-ahead Elements . . . . . . . 32 105 5.4.3. Omitting the anchor Element . . . . . . . . . . . . . 33 106 6. The action Element . . . . . . . . . . . . . . . . . . . . . 34 107 6.1. The match and not-match Attributes . . . . . . . . . . . 34 108 6.2. Actions with Variant Type Triggers . . . . . . . . . . . 35 109 6.2.1. The all-, any- and only-variants Attributes . . . . . 35 110 6.2.2. Example from RFC 3743 Tables . . . . . . . . . . . . 37 111 6.3. Recommended Disposition Values . . . . . . . . . . . . . 38 112 6.4. Precedence . . . . . . . . . . . . . . . . . . . . . . . 39 113 6.5. Implied Actions . . . . . . . . . . . . . . . . . . . . . 39 114 6.6. Default Actions . . . . . . . . . . . . . . . . . . . . . 39 115 7. Processing a Label Against an LGR . . . . . . . . . . . . . . 40 116 7.1. Determining Eligibility for a Label . . . . . . . . . . . 40 117 7.2. Determining Variants for a Label . . . . . . . . . . . . 41 118 7.3. Determining a Disposition for a Label or Variant Label . 41 119 8. Conversion to and from Other Formats . . . . . . . . . . . . 42 120 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 43 121 9.1. Media Type . . . . . . . . . . . . . . . . . . . . . . . 43 122 9.2. URN Registration . . . . . . . . . . . . . . . . . . . . 43 123 10. Security Considerations . . . . . . . . . . . . . . . . . . . 43 124 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 43 125 Appendix A. Example Tables . . . . . . . . . . . . . . . . . . . 44 126 Appendix B. How to Translate RFC 3743 based Tables into the XML 127 Format . . . . . . . . . . . . . . . . . . . . . . . 48 128 Appendix C. Indic Syllable Structure Example . . . . . . . . . . 52 129 Appendix D. RelaxNG Compact Schema . . . . . . . . . . . . . . . 55 130 Appendix E. Acknowledgements . . . . . . . . . . . . . . . . . . 63 131 Appendix F. Editorial Notes . . . . . . . . . . . . . . . . . . 64 132 F.1. Known Issues and Future Work . . . . . . . . . . . . . . 64 133 F.2. Change History . . . . . . . . . . . . . . . . . . . . . 64 134 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 66 136 1. Introduction 138 This memo describes a method of using Extensible Markup Language 139 (XML) to describe the algorithm used to determine whether a given 140 identifier label is permitted, and under which conditions, based on 141 the code points it contains and their context. These algorithms are 142 comprised of a list of permissible code points, variant code point 143 mappings, and a set of rules acting on them. These algorithms form 144 part of an administrator's policies, and can be referred to as Label 145 Generation Rulesets (LGRs), or IDN tables. 147 There are other kinds of policies relating to labels which are not 148 normally covered by Label Generation Rulesets and are therefore not 149 representable by the XML format described here. These include, but 150 are not limited to policies around trademarks, or prohibition of 151 fraudulent or objectionable words. 153 Administrators of the zones for top-level domain registries have 154 historically published their LGRs using ASCII text or HTML. The 155 formatting of these documents has been loosely based on the format 156 used for the Language Variant Table described in [RFC3743]. 157 [RFC4290] also provides a "model table format" that describes a 158 similar set of functionality. Common to these formats is that the 159 algorithms used to evaluate the data therein are implicit or 160 specified elsewhere. 162 Through the first decade of IDN deployment, experience has shown that 163 LGRs derived from these formats are difficult to consistently 164 implement and compare due to their differing formats. A universal 165 format, such as one using a structured XML format, will assist by 166 improving machine-readability, consistency, reusability and 167 maintainability of LGRs. 169 When used to represent simple list of permitted code points, the 170 format is quite straightforward. At the cost of some complexity in 171 the resulting file, it also allows for an implementation of more 172 sophisticated handling of conditional variants that reflects the 173 known requirements of current zone administrator policies. 175 Another feature of this format is that it allows many of the 176 algorithms to be made explicit and machine implementable. A 177 remaining small set of implicit algorithms is described in this 178 document to allow commonality in implementation. 180 While the predominant usage of this specification is to represent IDN 181 label policy, the format is not limited to IDN usage may also be used 182 for describing ASCII domain name label rulesets, or other types of 183 identifier labels beyond those used for domain names. 185 2. Design Goals 187 The following goals informed the design of this format: 189 o The format needs to be implementable in a reasonably 190 straightforward manner in software. 192 o The format should be able to be automatically checked for 193 formatting errors, so that common mistakes can be caught. 195 o An LGR needs to be able to express the set of valid code points 196 that are allowed for registration under a specific administrator's 197 policies. 199 o Provide the ability to express computed alternatives to a given 200 identifier based on mapping relationships between code points, 201 whether one-to-one or many-to-many. These computed alternatives 202 are commonly known as "variants". 204 o Variant code points should be able to be tagged with specific 205 dispositions or categories that can be used to support registry 206 policy (such as whether to allocate the computed variant, or to 207 merely block it from usage or registration). 209 o Variants and code points must be able to be stipulated based on 210 contextual information. For example, specific variants may only 211 be applicable when they follow another specific code point, or 212 when the code point is displayed in a specific presentation form. 214 o The data contained within an LGR must be able to be interpreted 215 unambiguously, so that independent implementations that utilize 216 the contents will arrive at the same results. 218 o To the largest extent possible, policy rules should be able to be 219 specified in the XML format without relying hidden, or built-in 220 algorithms in implementations. 222 o LGRs should be suitable for comparison and re-use, such that one 223 could easily compare the contents of two or more to see the 224 differences, to merge them, and so on. 226 o As many existing IDN tables as practicable should be able to be 227 migrated to the LGR format with all applicable interpretation 228 logic retained. 230 These requirements are partly derived from reviewing the existing 231 corpus of published IDN tables, plus the requirements of ICANN's work 232 to implement an LGR for the DNS Root Zone [LGR-PROCEDURE]. In 233 particular, Section B of that document identifies five specific 234 requirements for an LGR methodology. 236 The syntax and rules in [RFC5892] and [RFC3743] were also reviewed. 238 It is explicitly not the goal of this format to stipulate what code 239 points should be listed in an LGR by a zone administrator. Which 240 registration policies are used for a particular zone is outside the 241 scope of this memo. 243 3. LGR Format 245 An LGR is expressed as a well-formed XML Document [XML]. 247 3.1. Namespace 249 The XML Namespace URI is "urn:ietf:params:xml:ns:lgr-1.0". [Note: 250 the examples and schemas for any non-final versions of this 251 specification use a namespace that is not guaranteed. Early 252 implementors should consider the need to revise the namespace in 253 subsequent revisions.] 255 See Section 9.2 for more information. 257 3.2. Basic Structure 259 The basic XML framework of the document is as follows: 261 262 263 ... 264 266 The "lgr" element contains up to three sub-elements. First is an 267 optional "meta" element that contains all meta-data associated with 268 the LGR, such as its authorship, what it is used for, implementation 269 notes and references. This is followed by a "data" element that 270 contains the substantive code point data. Finally, an optional 271 "rules" element contains information on contextual and whole-label 272 evaluation rules, if any, along with any specific "action" elements 273 providing for the disposition of labels and computed variant labels. 275 276 277 278 ... 279 280 281 ... 282 283 284 ... 285 286 288 A document MUST contain exactly one "lgr" element. Each "lgr" 289 element MUST contain exactly one "data" element, optionally preceded 290 by one "meta" element and optionally followed by one "rules" element. 292 In the following descriptions, required, non-repeating elements or 293 attributes are generally not called out explicitly, in contrast to 294 optional ones or those that may be repeated. For attributes that 295 take lists as values the elements are space-delimited. 297 3.3. Metadata 299 The optional "meta" element is used to express meta-data associated 300 within the LGR. It can be used to identify the author or relevant 301 contact person, explain the intended usage of the LGR, and provide 302 implementation notes as well as references. With the exception of 303 "unicode-version" element, the data contained within is not required 304 by software consuming the LGR in order to calculate valid labels, or 305 to calculate variants. The "unicode-version" element MUST be used by 306 a consumer of the table to identify that it has the correct Unicode 307 property data to perform operations on the table. 309 3.3.1. The version Element 311 The "version" element is optional. It is used to uniquely identify 312 each version of the LGR. No specific format is required, but it is 313 RECOMMENDED that it be the decimal representation of a single 314 positive integer, which is incremented with each revision of the 315 file. 317 An example of a typical first edition of a document: 319 1 321 The "version" element may have an optional "comment" attribute. 323 1 325 3.3.2. The date Element 327 The optional "date" element is used to identify the date the LGR was 328 posted. The contents of this element MUST be a valid ISO 8601 "full- 329 date" string as described in [RFC3339]. 331 Example of a date: 333 2009-11-01 335 3.3.3. The language Element 337 The optional "language" element signals that the LGR is associated 338 with a specific language or script. The value of the "language" 339 element MUST be a valid language tag as described in [RFC5646]. The 340 tag may refer to a script plus undefined language if the LGR is not 341 referring to a specific language. 343 Example of an English language LGR: 345 en 347 If the LGR applies to a specific script, rather than a language, the 348 "und" language tag should be used followed by the relevant [RFC5646] 349 script subtag. For example, for a Cyrillic script LGR: 351 und-Cyrl 353 If the LGR covers a specific set of multiple languages or scripts, 354 the "language" element can be repeated. However, for cases of a 355 script-specific LGR exhibiting insignificant admixture of code points 356 from other scripts, it is RECOMMENDED to the use a single "language" 357 element identifying the predominant script. In the exceptional case 358 of a multi-script LGR where no script is predominant, use Zyyy 359 (Common): 361 und-Zyyy 363 Note that that for the particular case of Japanese, a script tag 364 "Jpan" exists that matches the mixture of scripts used in writing 365 that language. The preferred "language" element would be: 367 und-Jpan 369 3.3.4. The scope Element 371 This optional element refers to a scope, such as a domain, to which 372 this policy is applied. The "type" attribute specifies the type of 373 scope being defined. A type of "domain" means that the scope is a 374 domain that represents the apex of the DNS zone to which the LGR is 375 applied. The value must be a valid domain name, and in the case of 376 the DNS root zone, should be represented as ".". 378 example.com 380 There may be multiple "scope" tags used, for example to reflect a 381 list of domains to which the LGR is applied. Other types of scope 382 are application defined, with an explanation in the "description" 383 element RECOMMENDED. 385 3.3.5. The description Element 387 The "description" element is an optional free-form element that 388 contains any additional relevant description that is useful for the 389 user in its interpretation. Typically, this field contains 390 authorship information, as well as additional context on how the LGR 391 was formulated and how it applies, such as citations and references 392 that apply to the LGR as a whole. 394 This field should not be relied upon for providing instructions on 395 how to parse or utilize the data contained elsewhere in the 396 specification. Authors of tables should expect that software 397 applications that parse and use LGRs will not use the description 398 field to condition the application of the LGR's data and rules. 400 The element has an optional "type" attribute, which refers to the 401 internet media type of the enclosed data. Typical types would be 402 "text/plain" or "text/html". The attribute SHOULD be a valid MIME 403 type. If supplied, it will be assumed that the contents are of that 404 media type. If the description lacks a type field, it will be 405 assumed to be plain text ("text/plain"). 407 3.3.6. The validity-start and validity-end Elements 409 The "validity-start" and "validity-end" elements are optional 410 elements that describe the time period from which the contents of the 411 LGR become valid (i.e. are used in registry policy), and the contents 412 of the LGR cease to be used. 414 The dates MUST confirm to the "full-date" format described in section 415 5.6 of [RFC3339]. 417 2014-03-12 419 3.3.7. The unicode-version Element 421 Whenever an LGR depends on character properties from a given version 422 of the Unicode standard, the version number used in creating the LGR 423 MUST be listed in the form x.y.z, where x, y, and z are positive, 424 decimal integers (see [Unicode-Versions]). If any software 425 processing the table does not have access to character property data 426 of the requisite version, it MUST NOT perform any operations relating 427 to whole-label evaluation relying on Unicode properties 428 (Section 5.2.3). 430 The value of a given Unicode property in [UAX42] may change between 431 versions, unless such change has been explicitly disallowed in 432 [Unicode-Stability]. It is RECOMMENDED to only reference properties 433 defined as stable or immutable. As an alternative to referencing the 434 property, the information can be presented explicitly in the LGR. 436 6.2.0 438 It is not necessary to include a "unicode-version" element for LGRs 439 that do not make use of Unicode properties, however, it is 440 RECOMMENDED. 442 3.3.8. The references Element 444 A Label Generation Ruleset may define a list of references which are 445 used to associate various individual elements in the LGR to one or 446 more normative references. A common use for references is to 447 annotate that code points belong to an externally defined collection 448 or standard, or to give normative references for rules. 450 References are specified in an optional "references" element contains 451 any number of "reference" elements, each with a unique "id" 452 attribute. It is RECOMMENDED that the "id" attribute be a zero-based 453 integer. The value of each "reference" element SHOULD be the 454 citation of a standard, dictionary or other specification in any 455 suitable format. In addition to an "id" attribute, a "reference" 456 element may have a "comment" attribute for an optional free-form 457 annotation. 459 460 The Unicode Standard, Version 7.0 461 Big-5: Computer Chinese Glyph and Character 462 Code Mapping Table, Technical Report C-26, 1984 463 464 ISO/IEC 465 10646:2012 3rd edition 466 ... 467 468 ... 469 470 471 ... 472 474 A reference is associated with an element by using an optional "ref" 475 attribute (see Section 4.3.1). The use of "ref" attributes is 476 limited to certain kinds of elements in the "data" or "rules" 477 sections of the LGR, most notably those defining code points and 478 rules. A "ref" attribute may neither occur on elements that are 479 named references to character classes and rules nor on certain other 480 element types. See description of these elements below. 482 4. Code Points and Variants 484 The bulk of a label generation ruleset is a description of which set 485 of code points are eligible for a given label. For rulesets that 486 perform operations that result in potential variants, the code point- 487 level relationships between variants need to also be described. 489 The code point data is collected within the "data" element. Within 490 this element, a series of "char" and "range" elements describe 491 eligible code points, or ranges of code points, respectively. 493 Discrete permissible code points or code point sequences are declared 494 with a "char" element, e.g. 496 498 Ranges of permissible code points may be stipulated with a "range" 499 element, e.g. 501 503 The range is inclusive of the first and last code points. All 504 attributes defined for a "range" element act as if applied to each 505 code point within. A "range" element has no child elements. 507 It is always possible to substitute a list of individually specified 508 code points for a range element. The reverse is not necessarily the 509 case. Whenever such a substitution is possible, it makes no 510 difference in processing the data. Tools reading or writing the LGR 511 format are free to aggregate sequences of consecutive code points of 512 the same properties into range elements. 514 Code points must be expressed in uppercase, hexadecimal, and zero 515 padded to a minimum of 4 digits. In other words, represented 516 according to the standard Unicode convention but without the prefix 517 "U+". The rationale for not allowing other encoding formats, 518 including native Unicode encoding in XML, is explored in [UAX42]. 519 The XML conventions used in this format, including the element and 520 attribute names, mirror this document where practical and reasonable 521 to do so. It is RECOMMENDED to list all "char" elements in ascending 522 order of the "cp" attribute. 524 All "char" elements in the data section MUST have distinct "cp" 525 attributes. The "range" elements MUST NOT specify code point ranges 526 that overlap either another range or any single code point "char" 527 elements. 529 4.1. Sequences 531 A sequence of two or more code points may be specified in an LGR, for 532 example, when defining the source for n:m variant mappings. Another 533 use of sequences would be in cases when the exact sequence of code 534 points is required to occur in order for the constituent elements to 535 be eligible, such as when a specific code point is only eligible when 536 preceded or followed by another code point. The following would 537 define the eligibility of the MIDDLE DOT (U+00B7) only when both 538 preceded and followed by the LATIN SMALL LETTER L (U+006C): 540 542 All sequences defined this way must be distinct, but sub-sequences 543 may be defined. Thus, the sequence defined here may coexist with 544 single code point definitions such as: 546 548 As an alternative to using sequences to define a required context, a 549 "char" or "range" element may specify conditional context using an 550 optional "when" attribute as described below in Section 4.2.5. The 551 latter method is more flexible in that such conditional context is 552 not limited to specific code point in addition to allowing both 553 prohibited as well as required context to be specified. 555 As described below, the "char" element, whether or not it is used for 556 a single code point, or for a sequence, may have optional child 557 elements defining variants. Both the "char" and "range" elements can 558 take a number of optional attributes for conditional inclusion, 559 commenting, cross referencing and character tagging, as described 560 below. 562 4.2. Variants 564 Most LGRs typically only determine simple code point eligibility, and 565 for them, the elements described so far would be the only ones 566 required for their "data" section. Others additionally specify a 567 mapping of code points to other code points, known as "variants". 568 What constitutes a variant code point is a matter of policy, and 569 varies for each implementation. The following examples are intended 570 to demonstrate the syntax; they are not necessarily typical. 572 4.2.1. Basic Variants 574 Variant code points are specified using one of more "var" elements as 575 children of a "char" element. The target mapping is specified using 576 the "cp" attribute. Other, optional attributes for the "var" element 577 are described below. 579 For example, to map LATIN SMALL LETTER V (U+0076) as a variant of 580 LATIN SMALL LETTER U (U+0075): 582 583 584 586 A sequence of multiple code points can be specified as a variant of a 587 single code point. For example, the sequence of LATIN SMALL LETTER O 588 (U+006F) then LATIN SMALL LETTER E (U+0065) might hypothetically be 589 specified as a variant for an LATIN SMALL LETTER O WITH DIAERESIS 590 (U+00F6) as follows: 592 593 594 596 The source and target of a variant mapping may both be sequences, but 597 not ranges. 599 The "var" element specifies variant mappings in only one direction, 600 even though the variant relation is usually considered symmetric, 601 that is, if A is a variant of B then B should also be a variant of A. 602 The format requires that the inverse of the variant be given 603 explicitly to fully specify symmetric variant relations in the LGR. 604 This has the beneficial side effect of making the symmetry explicit: 606 607 608 610 Variant relations are normally not only symmetric, but also 611 transitive. If A is a variant of B and B is a variant of C, then A 612 is also a variant of C. As with symmetry, these transitive relations 613 are spelled out explicitly in the LGR. 615 All variant mappings are unique. For a given "char" element all 616 "var" elements MUST have a unique combination of "cp", "when" and 617 "not-when" attributes. It is RECOMMENDED to list the "var" elements 618 in ascending order of their target code point sequence. (For "when" 619 and "not-when" attributes, see Section 4.2.5). 621 4.2.2. The type attribute 623 Variants may be tagged with an optional "type" attribute. The value 624 of the "type" attribute may be any non-empty value not starting with 625 an underscore and not containing spaces. This value is used to 626 resolve the disposition of any variant labels created using a given 627 variant. (See Section 6.2.) 629 By default, the values of the "type" attribute directly describe the 630 target policy status (disposition) for a variant label that was 631 generated using a particular variant, with any variant label being 632 assigned a disposition corresponding to the most restrictive variant 633 type. Several conventional disposition values are predefined below 634 in Section 6. Whenever these values can represent the desired 635 policy, they SHOULD be used. 637 638 639 640 641 642 644 By default, if a variant label contains any instance of one of the 645 variants of type "blocked" the label would be blocked, but if it 646 contained only instances of variants to be allocated it could be 647 allocated. See the discussion about implied actions in Section 6.6. 649 The XML format for the LGR makes the relation between the values of 650 the "type" attribute on variants and the resulting disposition of 651 variant labels fully explicit. See the discussion in Section 6.2. 652 Making this relation explicit allows a generalization of the "type" 653 attribute from directly reflecting dispositions to a more 654 differentiated intermediate value that used in the resolution of 655 label disposition. Instead of the default action of applying the 656 most restrictive disposition to the entire label, such a generalized 657 resolution can be used to achieve additional goals, such as limiting 658 the set of allocated variant labels, or to implement other policies 659 found in existing LGRs (see for example Appendix B). 661 Because variant mappings MUST be unique, it is not possible to define 662 the same variant for the same "char" element with different type 663 attributes (see however Section 4.2.5). 665 4.2.3. Null Variants 667 A null variant is a variant string that maps to no code point. This 668 is used when a particular code point sequence is considered 669 discretionary in the context of a whole label. To specify a null 670 variant, use an empty cp attribute. For example, to mark a string 671 with a ZERO WIDTH NON-JOINER (U+200C) to the same string without the 672 ZERO WIDTH NON-JOINER: 674 675 676 678 This is useful in expressing the intent that some code points in a 679 label are to be mapped away when generating a canonical variant of 680 the label. However, in tables that are designed to have symmetric 681 variant mappings, this could lead to combinatorial explosion, if not 682 handled carefully. 684 The symmetric form of a null variant is expressed as follows: 686 687 688 690 A "char" element with an empty "cp" attribute MUST specify at least 691 one variant mapping. It is strongly RECOMMENDED to use a type of 692 "invalid" or equivalent when defining variant mappings from null 693 sequences, so that variant mapping from null sequences are removed in 694 variant label generation (see Section 4.2.2). 696 4.2.4. Variants with Reflexive Mapping 698 At first sight there seems to be no call for adding variant mappings 699 for which source and target code points are the same, that is for 700 which the mapping is reflexive, or, in other words, an identity 701 mapping. Yet such reflexive mappings occur frequently in LGRs that 702 follow [RFC3743]. 704 Adding a "var" element allows both a type and a reference id to be 705 specified for it. While the reference id is not used in processing, 706 the type of the variant can be used to trigger actions. In permuting 707 the label to generate all possible variants, the type associated with 708 a reflexive variant mapping is applied to any of the permuted labels 709 containing the original code point. 711 In the following example, the code point U+3473 exists both as a 712 variant of U+3447 and as a variant of itself (reflexive mapping). 714 Assuming an original label of "U+3473 U+3447", the permuted variant 715 "U+3473 U+3473" would consist of the reflexive variant of U+3473 716 followed by a variant of U+3447. Accordingly, the types for both of 717 the variant mappings used to generate that particular permutation 718 would have the value "preferred" given the following definitions of 719 variant mappings: 721 722 723 724 725 726 727 729 Having established the variant types in this way, a set of actions 730 could be defined that return a disposition of "allocate" or 731 "activate" for a label consisting exclusively of variants with type 732 "preferred" for example. (For details on how to define actions based 733 on variant types see Section 6.2.1.) 735 In general, using reflexive variant mappings in this manner makes it 736 possible to calculate disposition values using a uniform approach for 737 all labels, whether they consist of mapped variant code points, 738 original code points, or a mixture of both. In particular, the 739 dispositions for two otherwise identical labels may differ based on 740 which variant mappings were executed in order to generate each of 741 them. (For details on how to generate variants and evaluate 742 dispositions, see Section 7.) 744 Another useful convention that uses reflexive variants is described 745 below in Section 6.2.1. 747 4.2.5. Conditional Variants 749 Fundamentally, variants are mappings between two sequences of code 750 points. However, in some instances for a variant relationship to 751 exist, some context external to the code point sequence must be 752 considered. For example, a positional context may determine whether 753 two code point sequences are variants of each other. 755 An example of that are Arabic code points which can have different 756 forms based on position, with some code points sharing forms, thus 757 making them variants in the positions corresponding to those forms. 758 Such positional context cannot be solely derived from the code point 759 by itself, as the code point would be the same for the various forms. 761 To specify a conditional variant relationship the optional "when" 762 attribute is used. The variant relationship exists when the 763 condition in the "when" attribute is satisfied. A "not-when" 764 attribute may be used for conditions that must not be satisfied. The 765 value of each "when" or "not-when" attributes is a parameterized 766 context rule as described below in Section 5.4. 768 As described in Section 4.1 a "when" or "not-when" attribute may also 769 be specified to any "char" element in the data section to define 770 required or prohibited contextual conditions under which a code point 771 is valid. 773 Assuming the "rules" element contains suitably defined rules for 774 "arabic-isolated" and "arabic-final", the following example shows how 775 to mark ARABIC LETTER ALEF WITH WAVY HAMZA BELOW (U+0673) as a 776 variant of ARABIC LETTER ALEF WITH HAMZA BELOW (U+0625), but only 777 when it appears in its isolated or final forms: 779 780 781 782 784 Only a single "when" or "not-when" attribute can be applied to any 785 "var" element, however, multiple "var" elements using the same 786 mapping, but different "when" or "not-when" attributes may be 787 specified. In such a case care must be taken to ensure that for each 788 context at most one of the context rules for the "when" or "not-when" 789 attributes is satisfied; otherwise the results are undefined. 791 Two contexts may be complementary, as in the following example, which 792 shows ARABIC LETTER TEH MARBUTA (U+0629) as a variant of ARABIC 793 LETTER ALEF MAKSURA (U+0649), but with two different types. 795 796 797 798 800 The intent is that in final position a label that uses U+0629 instead 801 of U+0647 should be considered essentially the same label and 802 therefore allocatable to the same entity, while the same substitution 803 in non-final context leads to labels that are different, but 804 considered confusable so that either one, but not both should be 805 delegatable. 807 For symmetry, the reverse mappings must exist, and must agree in 808 their "when" or "not-when" attributes. However, symmetry does not 809 apply to the other attributes. For example, these are the actual 810 reverse mappings for the above: 812 813 814 815 817 Here, both variants have the same "type" attribute. While it is 818 tempting to recognize that in this instance the "when" and "not-when" 819 attributes are complementary and therefore between them cover every 820 single possible context, it is STRONGLY RECOMMENDED to use the format 821 shown in the example that makes the symmetry easily verifiable by 822 parsers and tools. (The same applies to entries created for 823 transitivity.) 825 Arabic is an example of a script for which such conditional variants 826 have been established in at least some existing LGRs. The mechanism 827 defined here supports other forms of conditional variants that may 828 required by other scripts. 830 4.3. Annotations 832 Two attributes, the "ref" and "comment" attributes, can be used to 833 annotate individual elements in the LGR. They are ignored in 834 machine-processing or the LGR. The "ref" attribute is intended for 835 formal annotations, and the "comment" attribute for free form 836 annotation. The latter can be applied more widely. 838 4.3.1. The ref Attribute 840 Reference information may optionally be specified by a "ref" 841 attribute, consisting of a space delimited sequence of reference 842 identifiers. 844 845 846 847 849 This facility is typically used to give source information for code 850 points or variant relations. This information is ignored when 851 machine-processing an LGR. If applied to a range the "ref" attribute 852 applies to every code point in the range. All reference identifiers 853 MUST be from the set declared in the "references" element (see 854 Section 3.3.8). It is an error to repeat a reference identifier in 855 the same "ref" attribute. It is RECOMMENDED that identifiers be 856 listed in ascending order. 858 In addition to "char", "range" and "var" elements in the data 859 section, a "ref" attribute may be present for these elements that 860 appear in the rules section described below: actions, literals 861 ("char" inside a rule), as well as for definitions of rules and 862 classes, but not for named references using the "by-ref" attribute 863 defined below. For these elements, the use of the "by-ref" and "ref" 864 attributes are mutually exclusive. None of the elements in the 865 metadata take a "ref" attribute; instead use the description element 866 there. 868 4.3.2. The comment Attribute 870 Any "char", "range" or "variant" element in the data section may 871 contain an optional "comment" attribute. The contents of a "comment" 872 attribute are free-form plain text. Comments are ignored in machine 873 processing of the table. Comment attributes may also be placed on 874 all elements in the "rules" section of the document, such as actions 875 and match operators, such as literals ("char"), as well as 876 definitions of classes and rules, but not on child elements of the 877 "class" element. Finally, in the metadata, only the "version" and 878 "reference" elements may have "comment" attributes (to match the 879 syntax in [RFC3743]). 881 4.4. Code Point Tagging 883 Typically, LGRs are used to explicitly designate allowable code 884 points, where any label that contains a code point not explicitly 885 listed in the LGR is considered an ineligible label according to the 886 ruleset. 888 For more complex registry rules, there may be a need to discern one 889 or more subsets of code points. This can be accomplished by applying 890 an optional "tag" attribute to "char" or "range" elements that are 891 child elements of the "data" element. By collecting code points that 892 share the same tag value, character classes may be defined (see 893 Section 5.2.2) which can then be used in whole label evaluation rules 894 (see Section 5.3.2). 896 Each "tag" attribute may contain multiple values separated by white 897 space. A tag value is an identifier, which may also include certain 898 punctuation marks, such as colon. Formally, it MUST correspond to 899 the XML 1.0 Nmtoken (Name token) production. It is an error to 900 duplicate a value within the same "tag" attribute. A "tag" attribute 901 for a "range" element applies to all code points in the range. 902 Because code point sequences are not proper members of a set of code 903 points, a "tag" attribute MUST NOT be present in a "char" element 904 defining a code point sequence. 906 5. Whole Label and Context Evaluation 908 5.1. Basic Concepts 910 The code points in a label sometimes need to satisfy context-based 911 rules, for example for the label to be considered valid, or to 912 satisfy the context for a variant mapping (see the description of the 913 "when" attribute in Section 5.4). 915 A Whole Label Evaluation rule (WLE) is applied to the whole label. 916 It is used to validate both original labels and variant labels 917 computed from them using a permutation over all applicable variant 918 mappings. A conditional context rules is a specialized form of WLE 919 specific to the context around a single code point or code point 920 sequence. For example, if a rule is referenced in the "when" 921 attribute of a variant mapping it is used to describe the conditional 922 context under which the particular variant mapping is defined to 923 exist. 925 Each rule is defined in a "rule" element. A rule may contain the 926 following as child elements: 928 o literal code points or code point sequences 930 o character classes, which define sets of code points to be used for 931 context comparisons 933 o context operators, which define when character classes and 934 literals may appear 936 o nested rules, whether defined in place or invoked by reference 938 Collectively, these are called match operators and are listed in 939 Section 5.3.2. 941 5.2. Character Classes 943 Character classes are sets of characters that often share a 944 particular property. While they function like sets in every way, 945 even supporting the usual set operators, they are called character 946 classes here in a nod to the use of that term in regular expression 947 syntax. (This also avoids confusion with the term "character set" in 948 the sense of character encoding.) 950 Character classes (or sets) can be specified in several ways: 952 o by defining the set via matching a tag in the code point data. 953 All characters with the same "tag" attribute are part of the same 954 class; 956 o by referencing one of the Unicode character properties defined in 957 the Unicode Character Database [UAX42]; 959 o by explicitly listing all the code points in the class; or 961 o by defining the class as a set combination of any number of other 962 classes. 964 5.2.1. Declaring and Invoking Named Classes 966 A character class has an optional "name" attribute, consisting of a 967 single, identifier not containing spaces. All names for classes must 968 be unique. If the "name" attribute is omitted, the class is 969 anonymous and exists only inside the rule or combined class where it 970 is defined. A named character class is defined independently and can 971 be referenced by name from within any rules or as part of other 972 character class definitions. 974 975 976 977 978 ... 979 980 981 983 An empty "class" element with a "by-ref" attribute is a reference to 984 an existing named class. The "by-ref" attribute cannot be used in 985 the same "class" element with any of these attributes: "name", "from- 986 tag", "property" or "ref". The "name" attribute MUST be present, if 987 and only if the class is a direct child element of the "rules" 988 element. It is an error to reference a named class for which the 989 definition has not been seen. 991 5.2.2. Tag-based Classes 993 The "char" or "range" elements that are child elements of the "data" 994 element may contain a "tag" attribute that consists of one or more 995 space separated tag values, for example: 997 998 1000 This defines two tags for use with code point U+0061, the tag 1001 "letter" and the tag "lower". Use 1003 1004 1006 to define two named character classes, "letter" and "lower", 1007 containing all code points with the respective tags, the first with 1008 0061 and 4E00 as elements and the latter with 0061, but not 4E00 as 1009 an element. The "name" attribute may be omitted for an anonymous in- 1010 place definition of a nested, tag-based class. 1012 Tag values are typically identifiers, with the addition of a few 1013 punctuation symbols, such as colon. Formally they MUST correspond to 1014 the XML 1.0 Nmtoken (Name token) production. While a "tag" attribute 1015 may contain a list of tag values, the "from-tag" attribute always 1016 contains a single tag value. 1018 If the document contains no "char" or "range" elements with a 1019 corresponding tag, the character class represents the empty set. 1020 This is valid, to allow a common "rules" element to be shared across 1021 files. However, it is RECOMMENDED that implementations allow for a 1022 warning to ensure that referring to an undefined tag in this way is 1023 intentional. 1025 5.2.3. Unicode Property-based Classes 1027 A class is defined in terms of Unicode properties by giving the 1028 Unicode property alias and the property value or property value 1029 alias, separated by a colon. 1031 1033 The example above selects all code points for which the Unicode 1034 canonical combining class (ccc) value is 9. This value of the ccc is 1035 assigned to all code points that encode viramas. The string "ccc" is 1036 the short-alias for the canonical combining class, as defined in the 1037 Unicode Character Database [UAX42]. 1039 Unicode properties may, in principle, change between versions of the 1040 Unicode Standard. However, the values assigned for a given version 1041 are fixed. If Unicode Properties are used, a Unicode version MUST be 1042 declared in the "unicode-version" element in the header. (Note: some 1043 Unicode properties are by definition stable across versions and do 1044 not change once assigned (see [Unicode-Stability].) 1046 It is RECOMMENDED that all implementations processing LGR files 1047 provide support for the following minimal set of Unicode properties: 1049 o General Category (gc) 1051 o Script (sc) 1053 o Canonical Combining Class (ccc) 1055 o Bidi Class (bc) 1057 o Arabic Joining Type (jt) 1059 o Indic Syllabic Category (InSC) 1061 o Deprecated (Dep) 1063 The short name for each property is given in parentheses. 1065 If a program that is using an LGR to determine the validity of a 1066 label encounters a property that it does not support, it MUST abort 1067 with an error. 1069 5.2.4. Explicitly Declared Classes 1071 A class of code points may also be declared by listing the code 1072 points that are a member of the class. This is useful when tagging 1073 cannot be used because code points are not listed individually as 1074 part of the eligible set of code points for the given LGR, for 1075 example because they only occur in code point sequences. 1077 To define a class in terms of an explicit list of code points use a 1078 space separated list of hexadecimal code point values: 1080 0061 0062 0063 0064 1082 This defines a class named "abcd" containing the code points for 1083 characters "a", "b", "c" and "d". The ordering of the code points is 1084 not material, but it is RECOMMENDED to list them in ascending order. 1086 Code point ranges are represented by a start and end value separated 1087 by a hyphen. The following declaration is equivalent to the 1088 preceding: 1090 0061-0064 1092 Range and code point declarations can be freely intermixed: 1094 0061 0062-0063 0064 1096 5.2.5. Combined Classes 1098 Classes may be combined using operators for set complement, union, 1099 intersection, difference and symmetric difference (exclusive-or). 1100 Because classes fundamentally function like sets, the union of 1101 several character classes is itself a class, for example. 1103 +-------------------+----------------------------------------------+ 1104 | Logical Operation | Example | 1105 +-------------------+----------------------------------------------+ 1106 | Complement | | 1107 +-------------------+----------------------------------------------+ 1108 | Union | | 1109 | | | 1110 | | | 1111 | | | 1112 | | | 1113 +-------------------+----------------------------------------------+ 1114 | Intersection | | 1115 | | | 1116 | | | 1117 | | | 1118 +-------------------+----------------------------------------------+ 1119 | Difference | | 1120 | | | 1121 | | | 1122 | | | 1123 +-------------------+----------------------------------------------+ 1124 | Symmetric | | 1125 | Difference | | 1126 | | | 1127 | | | 1128 +-------------------+----------------------------------------------+ 1130 Set Operators 1132 The elements from this table may be arbitrarily nested inside each 1133 other, subject to the following restriction: a "complement" element 1134 MUST contain precisely one "class" or one of the operator elements, 1135 while an "intersection", "symmetric-difference" or "difference" 1136 element MUST contain precisely two, and a "union" element MUST 1137 contain two or more of these elements. 1139 An anonymous combined class can be defined directly inside a rule or 1140 of the match operator elements that allow child elements (see 1141 Section 5.3.2) by using the set combination as the outer element. 1143 1144 1145 1146 1147 1148 1150 The example shows the definition of an anonymous combined class that 1151 represents the union of classes "xxx" and "yyy". There is no need to 1152 wrap this union inside another "class" element, and, in fact, set 1153 combination elements MUST NOT be nested inside a "class" element. 1155 Lastly, to create a named combined class that can be referenced in 1156 other classes or in rules as , add a "name" 1157 attribute to the set combination element, for example and place it at the top level immediately below the 1159 "rules" element (see Section 5.2.1. 1161 1162 1163 1164 1165 1166 . . . 1167 1169 Because (as for ordinary sets) a combination of classes is itself a 1170 class, no matter by what combinations of set operators a combined 1171 class is created, a reference to it always uses the "class" element 1172 as described in Section 5.2.1. That is, a named class is always 1173 referenced via an empty "class" element using the "by-ref" attribute 1174 containing the name of the class to be referenced. 1176 5.3. Whole Label and Context Rules 1178 Each rule is comprised of a series of matching operators that must be 1179 satisfied in order to determine whether a label meets a given 1180 condition. Rules may reference other rules or character classes 1181 defined elsewhere in the table. 1183 5.3.1. The rule Element 1185 A matching rule is defined by a "rule" element, the child elements of 1186 which are one of the match operators from Section 5.3.2. In 1187 evaluating a rule, each child element is matched in order. Rule 1188 elements may be nested. 1190 Rules may optionally be named using a "name" attribute containing a 1191 single identifier string with no spaces. A named rule may be 1192 incorporated into another rule by reference. If the "name" attribute 1193 is omitted, the rule is anonymous and may not be incorporated by 1194 reference into another rule or referenced by an action or "when" 1195 attribute. 1197 A simple rule to match a label where all characters are members of 1198 the class "preferred": 1200 1201 1202 1203 1204 1206 Rules are paired with explicit and implied actions, triggering these 1207 actions when a rule matches a label. For example, a simple explicit 1208 action for the rule shown above would be: 1210 1212 This has the effect of setting the policy disposition for a label 1213 made up entirely of "preferred" code points to "allocate". Explicit 1214 actions are further discussed in Section 6 and the use of rules in 1215 conditional contexts for implied actions is discussed in 1216 Section 4.2.5 and Section 6.5. 1218 5.3.2. The Match Operators 1220 The child elements of a rule are a series of match operators, which 1221 are listed here by type and name and with a basic example or two. 1223 +------------+-------------+------------------------------------+ 1224 | Type | Operator | Examples | 1225 +------------+-------------+------------------------------------+ 1226 | logical | any | | 1227 | +-------------+------------------------------------+ 1228 | | choice | | 1229 | | | | 1230 | | | | 1231 | | | | 1232 +--------------------------+------------------------------------+ 1233 | positional | start | | 1234 | +-------------+------------------------------------+ 1235 | | end | | 1236 +--------------------------+------------------------------------+ 1237 | literal | char | | 1238 +--------------------------+------------------------------------+ 1239 | set | class | | 1240 | | | 0061 0064-0065 | 1241 +--------------------------+------------------------------------+ 1242 | group | rule | | 1243 | | | | 1244 +--------------------------+------------------------------------+ 1245 | contextual | anchor | | 1246 | +-------------+------------------------------------+ 1247 | | look-ahead | | 1248 | +-------------+------------------------------------+ 1249 | | look-behind | | 1250 +--------------------------+------------------------------------+ 1252 Match Operators 1254 Any element defining an anonymous class can be used as a match 1255 operator, including any of the set combination operators (see 1256 Section 5.2.5) as well as references to named classes. 1258 All match operators shown as empty elements in the Examples column of 1259 the table above do not support child elements of their own; otherwise 1260 match operators may be nested. In particular, anonymous "rule" 1261 elements can be used for grouping. 1263 5.3.3. The count Attribute 1265 The optional "count" attribute specifies the minimally required or 1266 maximal permitted number of times a match operator is used to match 1267 input. If the "count" attribute is 1269 n the match operator matches the input exactly n times, where n is 1270 1 or greater. 1272 n+ the match operator matches the input at least n times, where n 1273 is 0 or greater. 1275 n:m the match operator matches the input at least n times where n is 1276 0 or greater, but matches the input up to m times in total, 1277 where m > n. If m = n and n > 0, the match operator matches the 1278 input exactly n times. 1280 If there is no "count" attribute, the match operator matches the 1281 input exactly once. 1283 In matching, greedy evaluation is used in the sense defined for 1284 regular expressions: beyond the required number or times, the input 1285 is matched as many times as possible, but not so often as to prevent 1286 a match of the remainder of the rule. 1288 The "count" attribute MUST NOT be applied to match operators of type 1289 "start", "end", "anchor", "look-ahead" and "look-behind" or to any 1290 operators, such as "rule" or "choice" that contain them, whether the 1291 latter are declared in place or used by reference. The "count" 1292 attribute may be applied to "class" and "rule" elements only if they 1293 do not have a "name" attribute, that is, to anonymous rules and 1294 classes or any invocation of predefined rules or classes by 1295 reference. 1297 The optional "count" attribute MAY be applied to match operators of 1298 type "any", "char" and "class", as well as to match operators 1299 "choice" and "rule", as long as they contain none of the operators 1300 "start", "end", "anchor", "look-ahead" and "look-behind" as direct or 1301 indirect child elements. The same requirement applies recursively to 1302 any "rule" element referenced inside a "choice" or "rule" with a 1303 "count" attribute. The "count" attribute cannot appear in the same 1304 element as a "name" attribute. 1306 5.3.4. The name and by-ref Attributes 1308 Like classes (see Section 5.2.1), rules declared as immediate child 1309 elements of the "rules" element MUST be named using a unique "name" 1310 attribute, and all other instances MUST NOT be named. Anonymous 1311 rules and classes or reference to named rules and classes can be 1312 nested inside other match operators by reference. 1314 To reference a named rule or class inside a rule or match operator 1315 use a rule or "class" element with an optional "by-ref" attribute 1316 containing the name of the referenced element. It is an error to 1317 reference a rule or class for which the definition has not been seen. 1318 The "by-ref" attribute cannot appear in the same element as the 1319 "name" attribute, or in an element that has any child elements. 1321 Here's an example of a rule requiring that all labels be letters 1322 (optionally followed by combining marks) and possibly digits. The 1323 example shows rules and classes referenced by name. 1325 1326 1327 1328 1329 1330 1331 1333 5.3.5. The choice Element 1335 The "choice" element is used to represent a list of two or more 1336 alternatives: 1338 1339 1340 1341 1342 1343 1344 1346 Each child element of a "choice" represents one alternative. The 1347 first matching alternative determines the match for the "choice" 1348 element. To express a choice where an alternative itself consists of 1349 a sequence of elements, the sequence must be wrapped in an anonymous 1350 rule. 1352 5.3.6. Literal Code Point Sequences 1354 A literal code point sequence matches a single code point or a 1355 sequence. It is defined by a "char" element, with the code point or 1356 sequence to be matched given by the "cp" attribute. When used as a 1357 literal, a "char" element may contain a "count" in addition to the 1358 "cp" attribute and optional "comment" or "ref" attributes. No other 1359 attributes or child elements are permitted. 1361 5.3.7. The any Element 1363 The "any" element matches any single code point. It may have a 1364 "count" attribute. For an example see Section 5.3.9 1366 Unlike a literal, the "any" element" may not have a "ref" attribute. 1368 5.3.8. The start and end Elements 1370 To match the beginning or end of a label, use the "start" or "end" 1371 element. An empty label would match this rule: 1373 1374 1375 1376 1378 Conceptually, Whole Label Evaluation Rules evaluate the label as a 1379 whole, but in practice, many rules do not actually need to be 1380 specified to match the entire label. For example, to express a 1381 requirement of not starting a label with a digit, a rule needs to 1382 describe only the initial part of a label. 1384 This example uses the previously defined rules, together with start 1385 and end tag, to define a rule that requires that an entire label is 1386 well-formed. For this example that means, that it must start with a 1387 letter and contains no leading digits or combining marks, nor 1388 combining marks placed on digits. 1390 1391 1392 1393 1394 1395 1396 1397 1398 1400 Each "start" or "end" element occurs at most once in a rule, except 1401 if nested inside a "choice" element in such a way that in matching 1402 each alternative at most one occurrence of each is encountered. 1403 Otherwise, the result is an error; as is any case where a "start" or 1404 "end" element is not encountered as first or last element to be 1405 matched, respectively, in matching a rule. Start and end elements do 1406 not have a "count" or any other attribute. It is an error for any 1407 match operator enclosing a nested "start" or "end" element to have a 1408 "count" attribute. 1410 5.3.9. Example rule from IDNA2008 1412 This is an example of the whole label evaluation rule from [RFC5892] 1413 forbidding the mixture of the Arabic-Indic and extended Arabic-Indic 1414 digits in the same label. The example also demonstrates several 1415 instances of the use of anonymous rules for grouping. 1417 1418 1420 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1440 The effect of this example is that a label containing a code point 1441 from either of the two digit ranges is invalid for any label matching 1442 the "mixed-digits" rule, that is, anytime a code point from the other 1443 range is also present. Note that this is not the same as 1444 invalidating the definition of the "range" elements. 1446 5.4. Parameterized Context or When Rules 1448 A special type of rule provides a context for evaluating the validity 1449 of a code point or variant mapping. This rule is invoked by the 1450 "when" attribute described in Section 4.2.5. An action implied by a 1451 context rule always has a disposition of "invalid" whenever the rule 1452 is not matched (see Section 6.5). Conversely, a "not-when" attribute 1453 results in a disposition of invalid whenever the rule is matched. 1455 5.4.1. The anchor Element 1457 Such parameterized context or "When Rules" may contain a special 1458 place holder represented by an "anchor" element. As each When Rule 1459 is evaluated, the "anchor" element is replaced by a literal 1460 corresponding to the "cp" attribute of the element containing the 1461 "when" (or "not-when") attribute. The match to the "anchor" element 1462 must be at the same position in the label as the code point or 1463 variant mapping triggering the When Rule. 1465 For example, the Greek lower numeral sign is invalid if not 1466 immediately preceding a character in the Greek script. This is most 1467 naturally addressed with a When Rule using look-ahead: 1469 1470 ... 1471 1472 1473 1474 1475 1476 1477 1479 In evaluating this rule, the "anchor" element is treated as if it was 1480 replaced by a literal 1482 1484 but only the instance of U+0375 at the given position is evaluated. 1485 If a label had two instances of U+0375 with the first one matching 1486 the rule and the second not, then evaluating the When Rule MUST 1487 succeed for the first and fail for the second instance. 1489 Unlike other rules, When Rules containing an "anchor" element MUST 1490 only be invoked via the "when" or "not-when" attributes on code 1491 points or variants; otherwise their "anchor" elements cannot be 1492 evaluated. However, it is possible to invoke rules not containing an 1493 "anchor" element from a "when" or "not-when" attribute. (See 1494 Section 5.4.3) 1496 5.4.2. The look-behind and look-ahead Elements 1498 Context rules use the "look-behind" and "look-ahead" elements to 1499 define context before and after the code point sequence matched by 1500 the "anchor" element. If the "anchor" element is omitted, neither 1501 the "look-behind" nor the "look-ahead" element may be present. 1503 Here is an example of a rule that defines an "initial" context for an 1504 Arabic code point: 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1531 A "when rule" contains any combination of "look-behind", "anchor" and 1532 "look-ahead" elements in that order. Each of these elements occurs 1533 at most once, except if nested inside a "choice" element in such a 1534 way that in matching each alternative at most one occurrence of each 1535 is encountered. Otherwise, the result is undefined. None of these 1536 elements takes a "count" attribute, nor does any enclosing match 1537 operator. Otherwise, the result is undefined. If a context rule 1538 contains a "look-ahead" or "look-behind" element, it MUST contain an 1539 "anchor" element. If, because of a choice element, a required anchor 1540 is not actually encountered, the results are undefined. 1542 5.4.3. Omitting the anchor Element 1544 If the "anchor" element is omitted, the evaluation of the context 1545 rule is not tied to the position of the code point or sequence 1546 associated with the "when" attribute. 1548 According to [RFC5892] Katakana middle dot is invalid in any label 1549 not containing at least one Japanese character anywhere in the label. 1550 Because this requirement is independent of the position of the middle 1551 dot, the rule does not require an "anchor" element. 1553 1554 1555 1556 1557 1558 1559 1560 1562 The Katakana middle dot is used only with Han, Katakana or Hiragana. 1563 The corresponding When Rule requires that at least one code point in 1564 the label is in one of these scripts, but the position of that code 1565 point is independent of the location of the middle dot and no anchor 1566 therefore required. (Note that the Katakana middle dot itself is of 1567 script Common). 1569 6. The action Element 1571 The purpose of a rule is to trigger a specific action. Often, the 1572 action simply results in blocking or invalidating a label that does 1573 not match a rule. An example of an action invalidating a label 1574 because it does not match a rule named "leading-letter" is as 1575 follows: 1577 1579 If an action is to be triggered on matching a rule, a "match" 1580 attribute is used instead. Actions are evaluated in the order that 1581 they appear in the XML file, Once an action is triggered by a label, 1582 the disposition defined in the "disp" attribute is assigned to the 1583 label and no other actions are evaluated for that label. 1585 The goal of the Label Generation Rules is to identify all labels and 1586 variant labels and to assign them disposition values. These 1587 dispositions are then fed into a further process that ultimately 1588 implements all aspects of policy. To allow this specification to be 1589 used with the widest range of policies, the permissible values for 1590 the "disp" attribute are neither defined nor restricted. 1591 Nevertheless a set of commonly used disposition values is 1592 RECOMMENDED. (See Section 6.3) 1594 6.1. The match and not-match Attributes 1596 A "match" or "not-match" attribute specify a rule that must be 1597 matched or not matched as a condition for triggering an action. Only 1598 a single rule may be named as the value of a "match" or "not-match" 1599 attribute. Because rules may be composed of other rules, this 1600 restriction to a single attribute value does not impose any 1601 limitation on the contexts that can trigger an action. 1603 An action may contain a "match" or a "not-match" attribute, but not 1604 both. An action without any attributes is triggered by all labels 1605 unconditionally. For a very simple LGR, the following action would 1606 allocate all labels that match the repertoire: 1608 1610 Since rules are evaluated for all labels, whether they are the 1611 original label or computed by permuting the defined and valid variant 1612 mappings for the label's code points, actions based on matching or 1613 not matching a rule may be triggered for both original and variant 1614 labels, but they the rules are not affected by the disposition 1615 attributes of the variant mappings. To trigger any actions base on 1616 these dispositions requires the use additional optional attributes 1617 for actions described next. 1619 6.2. Actions with Variant Type Triggers 1621 6.2.1. The all-, any- and only-variants Attributes 1623 An action may contain one of the optional attributes "any-variant", 1624 "all-variants", or "only-variants" defining triggers based on variant 1625 types. The permitted value for these attributes consists of one or 1626 more variant type values, separated by spaces. When a variant label 1627 is generated, these variant type values are compared to the set of 1628 type values on the variant mappings used to generate the particular 1629 variant label (see Section 7). 1631 Any single match may trigger an action that contains an "any-variant" 1632 attribute, while for an "all-variants" or "only-variants" attribute, 1633 the variant type for all variant code points must match one or 1634 several of the type values specified in the attribute to trigger the 1635 action. There is no requirement that the entire liste of variant 1636 type values be matched, as long as all variant code points match at 1637 least one of the values. 1639 An "only-variants" attribute will trigger the action only if all code 1640 points of the variant label have variant mappings from the original 1641 code points. In other words, the label contains no original code 1642 points other than those with a reflexive mapping (see Section 4.2.4). 1644 1645 1646 1647 1648 1649 1650 1651 . . . 1652 1653 1654 1656 In the example above, the label "xx" would have variant labels "xx", 1657 "xy", "yx" and "yy". The first action would result in blocking any 1658 variant label containing "y", because the variant mapping from "x" to 1659 "y" is of type "block", triggering the "any-variants" condition. 1660 Because in this example "x" has a reflexive variant mapping to itself 1661 of type "allocate" the original label "xx" has a reflexive variant 1662 "xx" that would trigger the "only_variants" condition on the second 1663 action. 1665 A label "yy" would have the variants "xy", "yx" and "xx". Because 1666 the variant mapping from "y" to "x" is of type "allocate" and a 1667 mapping from "y" to "y" is not defined, the labels "xy" and "yx" 1668 trigger the "any-variants" condition on the third label. The variant 1669 "xx", being generated using the mapping from "y" to "x" of type 1670 "allocate", would trigger the "only-variants" condition on the 1671 section action. As there is no reflexive variant "yy", the original 1672 label "yy" cannot trigger any variant type triggers. However, it 1673 could still trigger an action defined as matching or not matching a 1674 rule. 1676 In each action, one variant type trigger may be present by itself or 1677 in conjunction with an attribute matching or not-matching a rule. If 1678 variant triggers and rule-matching triggers are used together, the 1679 label MUST "match" or respectively "not-match" the specified rule, 1680 AND satisfy the conditions on the variant type values given by the 1681 "any-variant", "all-variants", or "only-variants" attribute. 1683 A useful convention combines the "any-variants" trigger with 1684 reflexive variant mappings (Section 4.2.4). This convention is used, 1685 for example, when multiple LGRs are defined within the same registry 1686 and for overlapping repertoire. In some cases, the delegation of a 1687 label from one LGR must prohibit the delegation of another label in 1688 some other LGR. This can be done using a variant of type "blocked" 1689 as in this example from an Armenian LGR, where the Armenian, Latin 1690 and Cyrillic letters all look identical: 1692 1693 1694 1696 1698 The issue is that the target code points for these two variants are 1699 both outside the Armenian repertoire. By using a reflexive variant 1700 with the following convention: 1702 1703 1705 1706 1707 1708 ... 1710 and associating this with an action of the form: 1712 1714 it is possible to list the symmetric and transitive variant mappings 1715 in the LGR even where they involve out-of-repertoire code points. By 1716 associating the action shown with the special type for these 1717 reflexive mappings any original labels containing one or more of the 1718 out-of-repertoire code points are filtered out -- just as if these 1719 code points had not been listed in the LGR in the first place. 1720 Nevertheless, they do participate in the permutation of variant 1721 labels for n-repertoire labels (Armenian in the example), and these 1722 permuted variants can be used to detect collisions with out-of- 1723 repertoire labels (see Section 7). 1725 6.2.2. Example from RFC 3743 Tables 1727 This section gives an example of using variant type triggers, 1728 combined with variants with reflexive mappings (Section 4.2.4) to 1729 achieve LGRs that implement tables like those defined according to 1730 [RFC3743] where the goal is to allow as variants only labels that 1731 consist entirely of simplified or traditional variants, in addition 1732 to the original label. 1734 Assuming an LGR where all variants have been given suitable "type" 1735 attributes of "block", "simplified", "traditional", or "both", 1736 similar to the ones discussed in Appendix B. Given such an LGR, the 1737 following example actions evaluate the disposition for the variant 1738 label: 1740 1741 1742 1743 1744 1746 The first action matches any variant label for which at least one of 1747 the code point variants is of type "block". The second matches any 1748 variant label for which all of the code point variants are of type 1749 "simplified" or "both", in other words an all-simplified label. The 1750 third matches any label for which all variants are of type 1751 "traditional" or "both", that is all traditional. These two actions 1752 are not triggered by any variant labels containing some original code 1753 points, unless each of those code points has a variant defined with a 1754 reflexive mapping (Section 4.2.4). 1756 The final two actions rely on the fact that actions are evaluated in 1757 sequence, and that the first action triggered also defines the final 1758 disposition for a variant label (see Section 6.4). They further rely 1759 on the assumption that the only variants with type "both" are also 1760 reflexive variants. 1762 Given these assumptions, any remaining simplified or traditional 1763 variants must then be part of a mixed label, and so are blocked; all 1764 labels surviving to the last action are original code points only 1765 (that is the original label). The example assumes that an original 1766 label may be a mixed label; if that is not the case, the disposition 1767 for the last action would be set to "block". 1769 There are exceptions where the assumption on reflexive mappings made 1770 above does not hold, so this basic scheme needs some refinements to 1771 cover all cases. For a more complete example, see Appendix B. 1773 6.3. Recommended Disposition Values 1775 The precise nature of the policy action taken in response to a 1776 disposition and the name of the corresponding "disp" attributes are 1777 only partially defined here. It is strongly RECOMMENDED to use the 1778 following dispositions only with their conventional sense. 1780 invalid The resulting string is not a valid label. This disposition 1781 may be assigned implicitly, see Section 6.5. No variant labels 1782 should be generated from a variant mapping with this type. 1784 block The resulting string is a valid label, but should be block 1785 from registration. This would typically apply for a derived 1786 variant that has is undesirable as having no practical use or 1787 being confusingly similar to some other label. 1789 allocate The resulting string should be reserved for use by the same 1790 operator of the origin string, but not automatically allocated 1791 for use. 1793 activate The resulting string should be activated for use. (This is 1794 the typical default action if no dispositions are defined and is 1795 known as a "preferred" variant in [RFC3743]) 1797 6.4. Precedence 1799 Actions are applied in the order of their appearance in the file. 1800 This defines their relative precedence. The first action triggered 1801 by a label defines the disposition for that label. To define a 1802 specific order of precedence, list the actions in the desired order. 1803 The conventional order of precedence for the actions defined in 1804 Section 6.3 is "invalid", "block", "allocate", then "activate". This 1805 default precedence is used for the default actions defined in 1806 Section 6.6. 1808 6.5. Implied Actions 1810 The context rules on code points ("not-when" or "when" rules) carry 1811 an implied action with a disposition of "invalid" (not eligible). 1812 These rules are evaluated at the time the code points for a label or 1813 its variant labels are checked for validity (see Section 7). In 1814 other words, they are evaluated before any of the whole-label 1815 evaluation rules and with higher precedence. The context rules for 1816 variant mappings are evaluated when variants are generated and/or 1817 when variant tables are made symmetric and transitive. They have an 1818 implied action with a disposition of "invalid" which means a putative 1819 variant mapping does not exist whenever the given context matches a 1820 "not-when" rule or fails to match a "when" rule specified for that 1821 mapping. The result of that disposition is that the variant mapping 1822 is ignored in generating variant labels and the value is therefore 1823 not accessible to trigger any explicit actions. 1825 Note that such non-existing variant mapping is different from a 1826 blocked variant, which is a variant code point mapping that exists 1827 but results in a label that may not be allocated. 1829 6.6. Default Actions 1831 As described in Section 6 any variant mapping may be given a "type" 1832 attribute. An action containing an "any-variant" or "all-variants" 1833 attribute relates these type values to a resulting disposition for 1834 the entire variant label. 1836 If no actions are defined for the standard disposition values of 1837 "invalid", "block", "allocate" and "activate", then the following 1838 default actions exist that are shown below in their default order of 1839 precedence (see Section 6.4). This default order for evaluating 1840 dispositions applies only to labels that triggered no explicitly 1841 defined actions and which are therefore handled by default actions. 1842 Default actions have a lower order of precedence than explicit 1843 actions (see Section 7.3). 1845 The default actions for variant labels are defined as follows: 1847 1848 1849 1850 1852 A final default action sets the disposition to "allocate" for any 1853 label matching the repertoire for which no other action has been 1854 triggered (catch-all). 1856 1858 7. Processing a Label Against an LGR 1860 7.1. Determining Eligibility for a Label 1862 In order to test a specific label for membership in the LGR, a 1863 consumer of the LGR must iterate through each code point within a 1864 given label, and test that each code point is a member of the LGR. 1865 If any code point is not a member of the LGR, the label shall be 1866 deemed as invalid. 1868 An individual code point is deemed a member of the LGR when it is 1869 listed using a "char" element, or is part of a range defined with a 1870 "range" element, and all necessary condition in any "when" or "not- 1871 when" attributes are correctly satisfied. 1873 Alternatively, a code point is also deemed a member of the LGR when 1874 it forms part of a sequence that corresponds to a sequence listed 1875 using a "char" element for which the "cp" attribute defines a 1876 sequence, and all necessary condition in any "when" or "not-when" 1877 attributes are correctly satisfied. 1879 A label must also not trigger any action that results in a 1880 disposition of "invalid", otherwise it is deemed not eligible. (This 1881 step may need to be deferred, until variant code point dispositions 1882 have been determined). 1884 For LGRs that contain reflexive variant mappings (defined in 1885 Section 4.2.4), the final evaluation of eligibility for the label 1886 must be deferred until variants are generated. In essence, LGRs that 1887 use this feature treat the original label as the (identity) variant 1888 of itself. For such LGRs, the ordinary iteration over code points 1889 would generally only exclude a subset of invalid labels, but it could 1890 be used effectively as a pre-screening. 1892 7.2. Determining Variants for a Label 1894 For a given eligible label, the set of variant labels is deemed to 1895 consist of each possible permutation of original code points and 1896 substituted code points or sequences defined in "var" elements, 1897 whereby all "when" and "not-when" attributes are correctly satisfied 1898 for each "char" or "var" element in the given permutation and all 1899 applicable whole label evaluation rules are satisfied as follows: 1901 1. Create each possible permutation of a label, by substituting each 1902 code point or code point sequence in turn by any defined variant 1903 mapping (including any reflexive mappings) 1905 2. Apply variant mappings with "when" or "not-when" attributes only 1906 if the conditions are satisfied; otherwise they are not defined 1908 3. Record each of the "type" values on the variant mappings used in 1909 creating a given variant label in a disposition set; for any 1910 unmapped code point record the "type" value of any reflexive 1911 variant (see Section 4.2.4) 1913 4. Determine the disposition for each variant label per Section 7.3 1915 5. If the disposition is "invalid", remove the label from the set 1917 6. If final evaluation of the disposition for the unpermuted label 1918 per Section 7.3 results in a disposition of "invalid", remove all 1919 associated variant labels from the set. 1921 7.3. Determining a Disposition for a Label or Variant Label 1923 For a given label (variant or original), its disposition is 1924 determined by evaluating in order of their appearance all actions for 1925 which the label or variant label satisfies the conditions. 1927 1. For any label, the disposition is given by the value of the 1928 "disp" attribute for the first action triggered by the label. An 1929 action is triggered, if any of the following is true: 1931 * the label matches or doesn't match the whole label evaluation 1932 rule, given in the "match" or "not-match" attribute 1933 respectively for that action; 1935 * any or all of the recorded variant types for a variant label 1936 match the types specified in an "any-variant", "all-variants", 1937 or "only-variants" attribute, for that action, and in case of 1938 "only-variants", the label contains only code points that are 1939 the target of applied variant mappings; 1941 * the label matches or doesn't match the whole label evaluation 1942 rule, given in the "match" or "not-match" attribute 1943 respectively for that action and any or all of the recorded 1944 variant types for a variant label match the types specified in 1945 an "any-variant", "all-variants", or "only-variants" 1946 attribute, respectively, for that action, and in case of 1947 "only-variants" the label contains only code points that are 1948 the target of applied variant mappings; or 1950 * the action does not contain any "match", "not-match", "any- 1951 variant", "all-variants", or "only-variants" attributes: 1952 catch-all. 1954 2. For any remaining variant label, assign the variant label the 1955 disposition using the default actions defined in Section 6.6. 1956 For this step, variant types outside the predefined recommended 1957 set (see Section 6.3) are ignored. 1959 3. For any remaining label, set the disposition to "allocate". 1961 8. Conversion to and from Other Formats 1963 Both [RFC3743] and [RFC4290] provide different grammars for IDN 1964 tables. These formats are unable to fully cater for the increased 1965 requirements of contemporary IDN variant policies. 1967 This specification is a superset of functionality provided by these 1968 IDN table formats, thus any table expressed in those formats can be 1969 expressed in this format. Automated conversion can be conducted 1970 between tables conformant with the grammar specified in each 1971 document. 1973 For notes on how to translate an RFC 3743-style table, see 1974 Appendix B. 1976 9. IANA Considerations 1978 9.1. Media Type 1980 IANA is asked to register the media type of "application/lgr+xml" to 1981 enable transmission of a well-formed LGR in accordance with this 1982 specification. This media type SHOULD be used to signal to an LGR- 1983 aware client that the content is designed to be interpreted as an 1984 LGR. 1986 [TODO: Add Media Type registration details per [RFC7303]] 1988 9.2. URN Registration 1990 This document uses a URN to describe the XML namespace in accordance 1991 with [RFC3688]. IANA is asked to register the following URN for this 1992 purpose. 1994 URI: urn:ietf:params:xml:ns:lgr-1.0 1996 Registrant Contact: See the Authors of this document. 1998 XML: None. 2000 10. Security Considerations 2002 If a system that is querying an identifier list (such as a domain 2003 zone) that uses the rules in this memo, and those rules are not 2004 implemented correctly, and that system is relying on the rules being 2005 applied, the system might fail if the rules are not applied in a 2006 predictable fashion. This could cause security problems for the 2007 querying system. 2009 11. References 2011 [ASIA-TABLE] 2012 DotAsia Organisation, ".ASIA ZH IDN Language Table". 2014 [LGR-PROCEDURE] 2015 Internet Corporation for Assigned Names and Numbers, 2016 "Procedure to Develop and Maintain the Label Generation 2017 Rules for the Root Zone in Respect of IDNA Labels". 2019 [RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the 2020 Internet: Timestamps", RFC 3339, July 2002. 2022 [RFC3688] Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, 2023 January 2004. 2025 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 2026 Engineering Team (JET) Guidelines for Internationalized 2027 Domain Names (IDN) Registration and Administration for 2028 Chinese, Japanese, and Korean", RFC 3743, April 2004. 2030 [RFC4290] Klensin, J., "Suggested Practices for Registration of 2031 Internationalized Domain Names (IDN)", RFC 4290, December 2032 2005. 2034 [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, 2035 "Linguistic Guidelines for the Use of the Arabic Language 2036 in Internet Domains", RFC 5564, February 2010. 2038 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 2039 Languages", BCP 47, RFC 5646, September 2009. 2041 [RFC5892] Faltstrom, P., "The Unicode Code Points and 2042 Internationalized Domain Names for Applications (IDNA)", 2043 RFC 5892, August 2010. 2045 [RFC7303] Thompson, H. and C. Lilley, "XML Media Types", RFC 7303, 2046 July 2014. 2048 [TDIL-HINDI] 2049 Technology Development for Indian Languages (TDIL) 2050 Programme, "Devanagari Script Behaviour for Hindi". 2052 [UAX42] Unicode Consortium, "Unicode Character Database in XML". 2054 [Unicode-Stability] 2055 Unicode Consortium, "Unicode Encoding Stability Policy, 2056 Property Value Stability". 2058 [Unicode-Versions] 2059 Unicode Consortium, "Unicode Version Numbering". 2061 [WLE-RULES] 2062 Internet Corporation for Assigned Names and Numbers, "WLE 2063 Rules". 2065 [XML] World Wide Web Consortium, "Extensible Markup Language 2066 (XML) 1.0". 2068 Appendix A. Example Tables 2070 The following presents a minimal LGR table defining the lower case 2071 LDH (letter-digit-hyphen) repertoire and containing no rules or 2072 metadata elements. Many simple LGR tables will look quite similar, 2073 except that they would contain some metadata. 2075 2076 2077 2078 2079 2081 2083 2084 2086 The following sample LGR shows a more complete collection of the 2087 elements and attributes defined in this specification in a somewhat 2088 typical context. 2090 2092 2097 2098 2099 2100 1 2101 2010-01-01 2102 sv 2103 example 2104 2010-01-01 2105 2013-12-31 2106 2107 Swedish 2110 examples institute. 2111 ]]> 2112 2113 6.3.0 2114 2115 The 2116 Unicode Standard 6.2 2117 RFC 5892 2118 Big-5: Computer Chinese Glyph 2119 and Character Code Mapping Table, Technical Report 2120 C-26, 1984 2121 2122 2123 2124 2125 2126 2128 2129 2130 2132 2133 2135 2136 2138 2139 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2177 2179 2182 2183 0061-007A 2184 2185 0061 0065 0069 006F 0075 2186 2187 2189 2190 2191 2192 2194 2195 2196 2198 2199 2201 2202 2203 2204 2206 2208 2210 2211 2213 2214 2216 Appendix B. How to Translate RFC 3743 based Tables into the XML Format 2218 As a background, the [RFC3743] rules work as follows: 2220 1. The Original (requested) label is checked to make sure that all 2221 the code points are a subset of the repertoire. 2223 2. If it passes the check, the Original label is allocatable. 2225 3. Generate the all-simplified and all-traditional variant labels 2226 (union of all the labels generated using all the simplified 2227 variants of the code points) for allocation. 2229 To illustrate by example, here is one of the more complicated set of 2230 variants: 2232 U+4E7E 2233 U+4E81 2234 U+5E72 2235 U+5E79 2236 U+69A6 2237 U+6F27 2239 The following shows the relevant section of the Chinese language 2240 table published by the .ASIA registry [ASIA-TABLE]. Its entries 2241 read: 2243 ;;; 2245 These are the lines corresponding to the set of variants listed above 2247 U+4E7E;U+4E7E,U+5E72;U+4E7E;U+4E81,U+5E72,U+6F27,U+5E79,U+69A6 2248 U+4E81;U+5E72;U+4E7E;U+5E72,U+6F27,U+5E79,U+69A6 2249 U+5E72;U+5E72;U+5E72,U+4E7E,U+5E79;U+4E7E,U+4E81,U+69A6,U+6F27 2250 U+5E79;U+5E72;U+5E79;U+69A6,U+4E7E,U+4E81,U+6F27 2251 U+69A6;U+5E72;U+69A6;U+5E79,U+4E7E,U+4E81,U+6F27 2252 U+6F27;U+4E7E;U+6F27;U+4E81,U+5E72,U+5E79,U+69A6 2254 The corresponding data section XML format would look like this: 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2306 Here the simplified variants have been given a type of "simp", the 2307 traditional variants one of "trad" and all other ones are given 2308 "block". 2310 Because some variant mappings show in more than one column, while the 2311 XML format allows only a single type value, they have been given the 2312 type of "both". 2314 Note that some variant mappings map to themselves (identity), that is 2315 the mapping is reflexive (see Section 4.2.4). In creating the 2316 permutation of all variant labels, these mappings have no effect, 2317 other than adding a value to the variant type list for the variant 2318 label containing them. 2320 In the example so far, all of the entries with type="both" are also 2321 mappings where source and target are identical. That is, they are 2322 reflexive mappings as defined in Section 4.2.4. 2324 Given a label "U+4E7E U+4E81", the following labels would be ruled 2325 allocatable under [RFC3743] based on how that standard is commonly 2326 implemented in domain registries: 2328 Original label: U+4E7E U+4E81 2329 Simplified label 1: U+4E7E U+5E72 2330 Simplified label 2: U+5E72 U+5E72 2331 Traditional label: U+4E7E U+4E7E 2333 However, if allocatable labels were generated simply by a straight 2334 permutation of all variants with type other than type="block" and 2335 without regard to the simplified / traditional variants, we would end 2336 up with an extra allocatable label of "U+5E72 U+4E7E". This label is 2337 comprised of a both Simplified Chinese character and a Traditional 2338 Chinese code point and therefore shouldn't be allocatable. 2340 To more fully resolve the dispositions requires several actions to be 2341 defined as described in Section 6.2.2 which will override the default 2342 actions from Section 6.6. After blocking all labels that contain a 2343 variant with type "block", these actions will allocate labels based 2344 on the following variant types: "simp", "trad" and "both". Note that 2345 these variant types do not directly relate to dispositions for the 2346 variant label, but that the actions will resolve them to the standard 2347 dispositions on labels, to with "block" and "allocate". 2349 To resolve label dispositions requires five actions to be defined (in 2350 the rules section of this document) these actions apply in order and 2351 the first one triggered, defines the disposition for the label. The 2352 actions are: 2354 1. block all variant labels containing at least one blocked variant. 2356 2. allocate all labels that consist entirely of variants that are 2357 "simp" or "both" 2359 3. also allocate all labels that are entirely "trad" or "both" 2361 4. block all surviving labels containing any one of the dispositions 2362 "simp" or "trad" or "both" because they are now known to be part 2363 of an undesirable mixed simplified/traditional label 2365 5. allocate any remaining label; the original label would be such a 2366 label. 2368 The rules declarations would be represented as: 2370 2371 2372 2373 2374 2375 2376 2377 2379 Up to now, variants with type "both" have occurred only associated 2380 with reflexive variant mappings. The "action" elements defined above 2381 rely on the assumption that this is always the case. However, 2382 consider the following set of variants: 2384 U+62E0;U+636E;U+636E;U+64DA 2385 U+636E;U+636E;U+64DA;U+62E0 2386 U+64DA;U+636E;U+64DA;U+62E0 2388 The corresponding XML would be: 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2405 To make such variant sets work requires a way to selectively trigger 2406 an action based on whether a variant type is associated with an 2407 identity or reflexive mapping, or is associated with an ordinary 2408 variant mapping. This can be done by adding a prefix "r-" to the 2409 "type" attribute on reflexive variant mappings. For example the 2410 "trad" for code point U+64DA in the preceding figure would become 2411 "r-trad". 2413 With the dispositions prepared in this way, only a slight 2414 modification to the actions is needed to yield the correct set of 2415 allocatable labels: 2417 2418 2419 2420 2421 2423 The first three actions get triggered by the same labels as before. 2425 The fourth action blocks any label that combines an original code 2426 point with any mix of ordinary variant mappings; however no labels 2427 that are a combination of only original code points (code points 2428 having either no variant mappings or a reflexive mapping) would be 2429 affected. These are the original labels and they are allocated in 2430 the last action. 2432 Using this scheme of assinging types to ordinary and reflexive 2433 variants, all RFC 3743-style tables can be converted to XML. By 2434 defining a set of actions as outlined above, the LGR will yield the 2435 correct set of allocatable variants: all variants consisting 2436 completely of variant code points preferred for simplified or 2437 traditional, respectively, will be allocated, as will be the original 2438 label. All other variant labels will be blocked. 2440 Appendix C. Indic Syllable Structure Example 2442 In LGRs for Indic scripts it may be desirable to restrict valid 2443 labels to sequences of valid Indic syllables, or aksharas. This 2444 appendix gives a sample set of rules designed to enforce this 2445 restriction. 2447 An example of a BNF from for an akshara which has been published in 2448 "Devanagari Script Behavior for Hindi" [TDIL-HINDI]. The rules for 2449 ther languages and scripts used in India are expected to be generally 2450 similar. 2452 For Hindi, the BNF has the form: 2454 V[m]|{C[N]H}C[N](H|[v][m]) 2456 Where: 2458 V (upper case) is any independent vowel 2460 m is any vowel modifier (Devanagari Anusvara, Visarga, and 2461 Candrabindu) 2463 C is any consonant (with inherent vowel) 2465 N is Nukta 2467 H is a Halant (or Virama) 2469 v (lower case) is any dependent vowel sign (matra) 2471 {} encloses items which may be repeated one or more times 2473 [ ] encloses items which may or may not be present 2475 | separates items, out of which only one can be present 2477 By using the Unicode property "InSC" or "Indic_Syllable_Category" 2478 which corresponds rather directly to the classification of characters 2479 in the BNF above, we can directly translate the BNF into a set of WLE 2480 rules matching the definition of an akshara. 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2533 With the rules and classes as defined above, the final action assigns 2534 a disposition of "invalid" to all labels that are not composed of a 2535 sequence of well-formed aksharas, optionally interspersed with other 2536 characters, perhaps digits, for example. 2538 The relevant Unicode property could be replicated by tagging 2539 repertoire values directly in the LGR which would remove the 2540 dependency on any specific version of the Unicode Standard. 2542 Generally, dependent vowels may only follow consonant expressions, 2543 however, for some scripts, like Bengali, the Unicode standard 2544 supports sequences of dependent vowels or their application on 2545 independent vowels. This makes the definition of akshara less 2546 restrictive. 2548 It is possible to reduce the complexity of these rules by defining 2549 alternate rules which simply define the permissible pair-wise context 2550 of adjacent code points by character class--such as the rule that a 2551 Halant can only follow a (nuktated) consonant. (See the example in 2552 [WLE-RULES]). 2554 Appendix D. RelaxNG Compact Schema 2556 default namespace = "urn:ietf:params:xml:ns:lgr-1.0" 2558 # 2559 # SIMPLE TYPES 2560 # 2562 # RFC 5646 language tag (e.g. "de", "und-Latn", etc.) 2563 language-tag = xsd:token 2565 # The scope to which the LGR applies. For the "domain" scope type it 2566 # should be a fully qualified domain name. 2567 scope-value = xsd:token { 2568 minLength = "1" 2569 } 2571 ## a single code point 2572 code-point = xsd:token { 2573 pattern = "[0-9A-F]{4,6}" 2574 } 2576 ## a space-separated sequence of code points 2577 code-point-sequence = xsd:token { 2578 pattern = "[0-9A-F]{4,6}( [0-9A-F]{4,6})+" 2579 } 2581 ## single code point, or a sequence of code points 2582 code-point-literal = code-point | code-point-sequence 2584 code-point-set-shorthand = xsd:token { 2585 pattern = "([0-9A-F]{4,6}|[0-9A-F]{4,6}-[0-9A-F]{4,6})" 2586 ~ "( ([0-9A-F]{4,6}|[0-9A-F]{4,6}-[0-9A-F]{4,6}))*" 2587 } 2589 ## dates are used in information fields in the meta section. 2590 date = xsd:token { 2591 pattern = "\d{4}-\d\d-\d\d" 2592 } 2594 ## reference to a rule name (used in "when" and "not-when" 2595 ## attributes, as well as the "by-ref" attribute of the "rule" 2596 ## element.) 2597 rule-ref = xsd:IDREF 2598 ## a space-separated list of tags. Tags should generally follow 2599 ## xsd:Name syntax. However, we are using the xsd:NMTOKENS here 2600 ## because there is no native XSD datatype for space-separated 2601 ## xsd:Name 2602 tags = xsd:NMTOKENS 2604 ## The value space of a "from-tag" attribute. Although it is closer 2605 ## to xsd:IDREF lexically and semantically, tags are not unique in 2606 ## the document. As such, we are unable to take advantage of 2607 ## facilities provided by a validator. xsd:NMTOKEN is used instead 2608 ## of the stricter xsd:Names here so as to be consistent with the 2609 ## above. 2610 tag-ref = xsd:NMTOKEN 2612 ## an identifier type (used by "name" attributes). 2613 identifier = xsd:ID 2615 ## used in the class "by-ref" attribute to reference another class 2616 ## of 2617 ## the same "name" attribute value. 2618 class-ref = xsd:IDREF 2620 ## count attribute pattern ("n", "n+" or "n:m") 2621 count-pattern = xsd:token { 2622 pattern = "\d+(\+|:\d+)?" 2623 } 2625 # 2626 # STRUCTURES 2627 # 2629 ## Representation of a single code point, or a sequence of code 2630 ## points 2631 char = element char { 2632 attribute cp { code-point-literal }, 2633 attribute comment { text }?, 2634 attribute when { rule-ref }?, 2635 attribute not-when { rule-ref }?, 2636 attribute tag { tags }?, 2637 attribute ref { text }?, 2638 variant* 2639 } 2641 ## Representation of a range of code points 2642 range = element range { 2643 attribute first-cp { code-point }, 2644 attribute last-cp { code-point }, 2645 attribute comment { text }?, 2646 attribute tag { tags }?, 2647 attribute ref { text }? 2648 } 2650 ## Representation of a single code point (no sequences allowed, and 2651 ## no tag attribute allowed). This is used when defining the set of 2652 ## characters that constitute a class. 2653 char-simple = element char { 2654 attribute cp { code-point } 2655 } 2657 ## Representation of a range of code points, for use in defining the 2658 ## set of characters that constitute a class. 2659 range-simple = element range { 2660 attribute first-cp { code-point }, 2661 attribute last-cp { code-point } 2662 } 2664 ## Representation of a variant code point or sequence 2665 variant = element var { 2666 attribute cp { code-point-literal }, 2667 attribute type { text }?, 2668 attribute when { rule-ref }?, 2669 attribute not-when { rule-ref }?, 2670 attribute comment { text }?, 2671 attribute type { text }?, 2672 attribute ref { text }? 2673 } 2675 # 2676 # Classes 2677 # 2679 ## a "class" element that references the name of another "class" 2680 ## (or set-operator like "union") defined elsewhere. 2681 ## If used as a matcher (appearing under a "rule" ## element), 2682 ## the "count" attribute may be present. 2683 class-invocation = element class { 2684 (attribute by-ref { class-ref } 2685 | attribute from-tag { tag-ref }), 2686 attribute count { count-pattern }?, 2687 attribute comment { text }? 2688 } 2690 ## defines a new class (set of code points) using Unicode property or 2691 ## code point literals 2692 class-declaration = element class { 2693 # "name" attribute MUST be present if this is a "top-level" class 2694 # declaration, i.e. appearing directly under the "rules" element. 2695 # Otherwise, it MUST be absent. 2696 attribute name { identifier }?, 2697 # If used as a matcher (appearing in a "rule" element), the 2698 # "count" attribute may be present. Otherwise, it MUST be absent. 2699 attribute count { count-pattern }?, 2700 attribute comment { text }?, 2701 attribute ref { text }?, 2702 ( 2703 # define the class by property (e.g. property="sc:Latn"), OR 2704 attribute property { text } 2705 # define the class by tagged code points, OR 2706 | attribute from-tag { tag-ref } 2707 # list of single code points and ranges, OR 2708 | (char-simple | range-simple)+ 2709 # text node to allow for shorthand notation e.g. 2710 # "0061 0062-0063" 2711 | code-point-set-shorthand 2712 ) 2713 } 2715 class-or-set-operator-nested = 2716 class-invocation | class-declaration | set-operator 2718 class-or-set-operator-declaration = 2719 # a "class" element or set operator (effectively defining a class) 2720 # directly in the "rules" element. 2721 class-declaration | set-operator 2723 # 2724 # Set operators 2725 # 2727 complement-operator = element complement { 2728 attribute name { identifier }?, 2729 attribute comment { text }?, 2730 attribute ref { text }?, 2731 # "count" attribute MUST only be used when this set-operator is 2732 # used as a matcher (i.e. nested in a element) 2733 attribute count { count-pattern }?, 2734 class-or-set-operator-nested 2735 } 2737 union-operator = element union { 2738 attribute name { identifier }?, 2739 attribute comment { text }?, 2740 attribute ref { text }?, 2741 # "count" attribute MUST only be used when this set-operator is 2742 # used as a matcher (i.e. nested in a element) 2743 attribute count { count-pattern }?, 2744 class-or-set-operator-nested, 2745 # needs two or more child elements 2746 class-or-set-operator-nested+ 2747 } 2749 intersection-operator = element intersection { 2750 attribute name { identifier }?, 2751 attribute comment { text }?, 2752 attribute ref { text }?, 2753 # "count" attribute MUST only be used when this set-operator is 2754 # used as a matcher (i.e. nested in a element) 2755 attribute count { count-pattern }?, 2756 class-or-set-operator-nested, 2757 class-or-set-operator-nested 2758 } 2760 difference-operator = element difference { 2761 attribute name { identifier }?, 2762 attribute comment { text }?, 2763 attribute ref { text }?, 2764 # "count" attribute MUST only be used when this set-operator is 2765 # used as a matcher (i.e. nested in a element) 2766 attribute count { count-pattern }?, 2767 class-or-set-operator-nested, 2768 class-or-set-operator-nested 2769 } 2771 symmetric-difference-operator = element symmetric-difference { 2772 attribute name { identifier }?, 2773 attribute comment { text }?, 2774 attribute ref { text }?, 2775 # "count" attribute MUST only be used when this set-operator is 2776 # used as a matcher (i.e. nested in a element) 2777 attribute count { count-pattern }?, 2778 class-or-set-operator-nested, 2779 class-or-set-operator-nested 2780 } 2782 ## operators that transform class(es) into a new class. 2783 set-operator = complement-operator 2784 | union-operator 2785 | intersection-operator 2786 | difference-operator 2787 | symmetric-difference-operator 2789 # 2790 # Match operators (matchers) 2791 # 2793 any-matcher = element any { 2794 attribute count { count-pattern }?, 2795 attribute comment { text }? 2796 } 2798 choice-matcher = element choice { 2799 attribute count { count-pattern }?, 2800 attribute comment { text }?, 2801 # two or more match operators 2802 match-operator-choice, 2803 match-operator-choice+ 2804 } 2806 char-matcher = 2807 # for use as a matcher - like "char" but without a "tag" attribute 2808 element char { 2809 attribute cp { code-point-literal }, 2810 # If used as a matcher (appearing in a "rule" element), the 2811 # "count" attribute may be present. Otherwise, it MUST be 2812 # absent. 2813 attribute count { count-pattern }?, 2814 attribute comment { text }?, 2815 attribute ref { text }? 2816 } 2818 start-matcher = element start { 2819 attribute comment { text }? 2820 } 2822 end-matcher = element end { 2823 attribute comment { text }? 2824 } 2826 anchor-matcher = element anchor { 2827 attribute comment { text }? 2828 } 2830 look-ahead-matcher = element look-ahead { 2831 attribute comment { text }?, 2832 match-operators-non-pos 2833 } 2834 look-behind-matcher = element look-behind { 2835 attribute comment { text }?, 2836 match-operators-non-pos 2837 } 2839 ## non-positional match operator that can be used as a 2840 ## direct child element of the choice matcher. 2841 match-operator-choice = ( 2842 any-matcher | choice-matcher | start-matcher | end-matcher 2843 | char-matcher | class-or-set-operator-nested | rule-matcher 2844 ) 2846 ## non-positional match operators do not contain any anchor, 2847 ## look-behind or look-ahead elements. 2848 match-operators-non-pos = ( 2849 start-matcher?, 2850 (any-matcher | choice-matcher | char-matcher 2851 | class-or-set-operator-nested | rule-matcher)*, 2852 end-matcher? 2853 ) 2855 ## positional match operators have an anchor element, which may be 2856 ## preceeded by a look-behind element, or followed by a look-ahead 2857 ## element, or both. 2858 match-operators-pos = 2859 look-behind-matcher?, anchor-matcher, look-ahead-matcher? 2861 match-operators = match-operators-non-pos | match-operators-pos 2863 # 2864 # Rules 2865 # 2867 # top-level rule must have "name" attribute 2868 rule-declaration-top = element rule { 2869 attribute name { identifier }, 2870 attribute comment { text }?, 2871 attribute ref { text }?, 2872 match-operators 2873 } 2875 ## rule element used as a matcher (either by-ref or contains other 2876 ## match operators itself) 2877 rule-matcher = 2878 element rule { 2879 attribute count { count-pattern }?, 2880 attribute comment { text }?, 2881 attribute ref { text }?, 2882 (attribute by-ref { rule-ref } | match-operators) 2883 } 2885 # 2886 # Actions 2887 # 2889 action-declaration = element action { 2890 attribute comment { text }?, 2891 attribute ref { text }?, 2892 attribute disp { text }, 2893 ( attribute match { text } | attribute not-match { text } )?, 2894 ( attribute any-variant { text } 2895 | attribute all-variants { text } 2896 | attribute only-variants { text } )? 2897 } 2899 # DOCUMENT STRUCTURE 2901 start = lgr 2902 lgr = element lgr { 2903 attribute id { text }?, 2904 meta-section?, 2905 data-section, 2906 rules-section? 2907 } 2909 ## Meta section - information recorded with an label 2910 ## generation ruleset that generally does not affect machine 2911 ## processing (except for unicode-version). However, if any 2912 ## "class-declaration" uses the "property" attribute, one or 2913 ## more unicode-version MUST be present. 2915 meta-section = element meta { 2916 element version { 2917 attribute comment { text }?, 2918 text 2919 }? 2920 & element date { 2921 xsd:token { 2922 pattern = "\d{4}-\d{2}-\d{2}" 2923 } 2924 }? 2925 & element language { language-tag }* 2926 & element scope { 2927 # type may by "domain" or an application-defined value 2928 attribute type { xsd:NCName }, 2929 scope-value 2930 }* 2931 & element validity-start { text }? 2932 & element validity-end { text }? 2933 & element unicode-version { 2934 xsd:token { 2935 pattern = "\d+\.\d+\.\d+" 2936 } 2937 }? 2938 & element description { 2939 attribute type { text }?, 2940 text 2941 }? 2942 & element references { 2943 element reference { 2944 attribute id { text }, 2945 attribute comment { text }?, 2946 text 2947 }* 2948 }? 2949 } 2951 data-section = element data { (char | range)+ } 2953 ## Note that action declarations are strictly order dependent. 2954 ## class-or-set-operator-declaration and rule-declaration-top 2955 ## are weakly order dependent, they must precede first use of the 2956 ## identifier via by-ref. 2957 rules-section = element rules { 2958 ( class-or-set-operator-declaration 2959 | rule-declaration-top 2960 | action-declaration)* 2961 } 2963 Appendix E. Acknowledgements 2965 This format builds upon the work on documenting IDN tables by many 2966 different registry operators. Notably, a comprehensive language 2967 table for Chinese, Japanese and Korean was developed by the "Joint 2968 Engineering Team" [RFC3743] that is the basis of many registry 2969 policies; and a set of guidelines for Arabic script registrations 2970 [RFC5564] was published by the Arabic-language community. 2972 Contributions that have shaped this document have been provided by 2973 Francisco Arias, Mark Davis, Paul Hoffman, Nicholas Ostler, Thomas 2974 Roessler, Steve Sheng, Michel Suignard, Andrew Sullivan, Wil Tan and 2975 John Yunker. 2977 Appendix F. Editorial Notes 2979 This appendix to be removed prior to final publication. 2981 F.1. Known Issues and Future Work 2983 o A method of specifying the origin URI for a table, and an 2984 expiration or refresh policy, as meta-data may be a useful way to 2985 declare how the table will be updated. 2987 o The "domain" element should be specified as absolute, so that the 2988 Root can be identified as needed for the Root Zone LGR. 2990 o The recommended names for disposition ("block" and "allocate") 2991 deviate from the name in the Root Zone LGR Procedure ("blocked" 2992 and "allocatable"). The latter were chosen to highlight that the 2993 machine processing of the LGR table is just the first step, actual 2994 allocation requires additional actions, hence "allocatable". This 2995 should be resolved. 2997 F.2. Change History 2999 -00 Initial draft. 3001 -01 Add an XML Namespace, and fix other XML nits. Add support for 3002 sequences of code points. Improve on consistently using Unicode 3003 nomenclature. 3005 -02 Add support for validity periods. 3007 -03 Incorporate requirements from the Label Generation Ruleset 3008 Procedure for the DNS Root Zone. These requirements include a 3009 detailed grammar for specifying whole-label variants, and the 3010 ability to explicitly declare of the actions associated with a 3011 specific variant. The document also consistently applies the 3012 term "Label Generation Ruleset", rather than "IDN table", to 3013 reflect the policy term now being used to describe these. 3015 -04 Support reference information per [RFC3743]. Update description 3016 in response to feedback. Extend the context rules to "char" 3017 elements and allow for inverse matching ("not-when"). Extend 3018 the description of label processing and implied actions, and 3019 allow for actions that reference disposition attributes on any 3020 or all variant mappings used in the generation of a variant 3021 label. 3023 -05 Change the name of the "disposition" attribute to "disp". Add 3024 comment attribute on version and reference elements. Allow 3025 empty "cp" attributes in char elements to support expressing 3026 symmetric mapping of null variants. Describe use of variants 3027 that map identically. Clarify how actions are triggered, in 3028 particular based on variant dispositions, as well as description 3029 of default actions. Revise description of processing a label 3030 and its variants. Move example table at the head of appendices. 3031 Add "only-variants" attribute. Change "name" attribute to "by- 3032 ref" attribute for referencing named classes and rules. Change 3033 "not" to "complement". Remove "match" attribute on rules as 3034 redundant if "start" and "end" are supported. Rename "match" 3035 element to "anchor" as better fitting it's function and removing 3036 confusion with both the "match" attribute on actions as well as 3037 the generic term Match Operator. Augmented the examples 3038 relevant to [RFC3743]. 3040 -06 Extend the discussion of reflexive variants and their use; 3041 includes update of the appendix on converting tables in the 3042 style of [RFC3743]. Improve description of tagging and clarify 3043 that it doesn't apply to sequences. Specify that root zone uses 3044 ".". Add an appendix with an Indic Syllable Structure example. 3045 Extend count attribute to allow maximal counts. 3047 -07 Change "byref" to "by-ref". Add list of recommended properties. 3048 Change "location" to "positional" for collective name of start/ 3049 end match operators. Use from-tag instead of by-ref for tag- 3050 based classes. Made optional or mutually exclusive nature of 3051 some attributes more explicit. Allowing "comment" attributes on 3052 all child elements of "rules" except "char" and "range" elements 3053 used as child elements of "class". Recast the design goals and 3054 requirements at the start of the document. Reword aspects of 3055 the document to make it clear the format's application is not 3056 limited only to domain names. 3058 -08 Change "domain" to scope with type="domain". Reword in several 3059 places for clarity. Flesh out note on security. Change "disp" 3060 to "type" for variants, to mark that these attributes do not 3061 necessarily correspond one-to-one to variant label dispositions. 3062 Add example of variant type triggers. Remove "long form" of 3063 class definition. 3065 -09 Grammatical updates, clarity improvements. Altered some DNS- 3066 specific terminology. 3068 -10 Added convention for out-of-repertoire variants, additional 3069 examples of when rules in the context of symmetry, isolated 3070 minor copy editing. Use a URN as the XML namespace 3071 (provisional). Specify a media type for the file. 3073 Authors' Addresses 3075 Kim Davies 3076 Internet Corporation for Assigned Names and Numbers 3077 12025 Waterfront Drive 3078 Los Angeles, CA 90094 3079 US 3081 Phone: +1 310 301 5800 3082 Email: kim.davies@icann.org 3083 URI: http://www.icann.org/ 3085 Asmus Freytag 3086 ASMUS Inc. 3088 Email: asmus@unicode.org