idnits 2.17.1 draft-ietf-precis-framework-19.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 23, 2014) is 3445 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 1626 -- Looks like a reference, but probably isn't: '2' on line 1628 == Outdated reference: A later version (-12) exists of draft-ietf-precis-mappings-08 == Outdated reference: A later version (-19) exists of draft-ietf-precis-nickname-11 == Outdated reference: A later version (-18) exists of draft-ietf-precis-saslprepbis-09 == Outdated reference: A later version (-24) exists of draft-ietf-xmpp-6122bis-15 -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3491 (Obsoleted by RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 5226 (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 5246 (Obsoleted by RFC 8446) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PRECIS P. Saint-Andre 3 Internet-Draft &yet 4 Obsoletes: 3454 (if approved) M. Blanchet 5 Intended status: Standards Track Viagenie 6 Expires: April 26, 2015 October 23, 2014 8 PRECIS Framework: Preparation, Enforcement, and Comparison of 9 Internationalized Strings in Application Protocols 10 draft-ietf-precis-framework-19 12 Abstract 14 Application protocols using Unicode characters in protocol strings 15 need to properly handle such strings in order to enforce 16 internationalization rules for strings placed in various protocol 17 slots (such as addresses and identifiers) and to perform valid 18 comparison operations (e.g., for purposes of authentication or 19 authorization). This document defines a framework enabling 20 application protocols to perform the preparation, enforcements, and 21 comparison of internationalized strings ("PRECIS") in a way that 22 depends on the properties of Unicode characters and thus is agile 23 with respect to versions of Unicode. As a result, this framework 24 provides a more sustainable approach to the handling of 25 internationalized strings than the previous framework, known as 26 Stringprep (RFC 3454). This document obsoletes RFC 3454. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at http://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on April 26, 2015. 45 Copyright Notice 47 Copyright (c) 2014 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 64 3. Preparation, Enforcement, and Comparison . . . . . . . . . . 6 65 4. String Classes . . . . . . . . . . . . . . . . . . . . . . . 7 66 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 7 67 4.2. IdentifierClass . . . . . . . . . . . . . . . . . . . . . 8 68 4.2.1. Valid . . . . . . . . . . . . . . . . . . . . . . . . 9 69 4.2.2. Contextual Rule Required . . . . . . . . . . . . . . 9 70 4.2.3. Disallowed . . . . . . . . . . . . . . . . . . . . . 9 71 4.2.4. Unassigned . . . . . . . . . . . . . . . . . . . . . 10 72 4.2.5. Examples . . . . . . . . . . . . . . . . . . . . . . 10 73 4.3. FreeformClass . . . . . . . . . . . . . . . . . . . . . . 10 74 4.3.1. Valid . . . . . . . . . . . . . . . . . . . . . . . . 10 75 4.3.2. Contextual Rule Required . . . . . . . . . . . . . . 11 76 4.3.3. Disallowed . . . . . . . . . . . . . . . . . . . . . 11 77 4.3.4. Unassigned . . . . . . . . . . . . . . . . . . . . . 11 78 4.3.5. Examples . . . . . . . . . . . . . . . . . . . . . . 12 79 5. Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 12 80 5.1. Profiles Must Not Be Multiplied Beyond Necessity . . . . 12 81 5.2. Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 13 82 5.2.1. Width Mapping Rule . . . . . . . . . . . . . . . . . 13 83 5.2.2. Additional Mapping Rule . . . . . . . . . . . . . . . 13 84 5.2.3. Case Mapping Rule . . . . . . . . . . . . . . . . . . 14 85 5.2.4. Normalization Rule . . . . . . . . . . . . . . . . . 14 86 5.2.5. Exclusion Rule . . . . . . . . . . . . . . . . . . . 14 87 5.2.6. Directionality Rule . . . . . . . . . . . . . . . . . 15 88 5.3. Building Application-Layer Constructs . . . . . . . . . . 15 89 5.4. A Note about Spaces . . . . . . . . . . . . . . . . . . . 16 90 6. Order of Operations . . . . . . . . . . . . . . . . . . . . . 17 91 7. Code Point Properties . . . . . . . . . . . . . . . . . . . . 17 92 8. Category Definitions Used to Calculate Derived Property . . . 20 93 8.1. LetterDigits (A) . . . . . . . . . . . . . . . . . . . . 20 94 8.2. Unstable (B) . . . . . . . . . . . . . . . . . . . . . . 21 95 8.3. IgnorableProperties (C) . . . . . . . . . . . . . . . . . 21 96 8.4. IgnorableBlocks (D) . . . . . . . . . . . . . . . . . . . 21 97 8.5. LDH (E) . . . . . . . . . . . . . . . . . . . . . . . . . 21 98 8.6. Exceptions (F) . . . . . . . . . . . . . . . . . . . . . 21 99 8.7. BackwardCompatible (G) . . . . . . . . . . . . . . . . . 21 100 8.8. JoinControl (H) . . . . . . . . . . . . . . . . . . . . . 22 101 8.9. OldHangulJamo (I) . . . . . . . . . . . . . . . . . . . . 22 102 8.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . 22 103 8.11. ASCII7 (K) . . . . . . . . . . . . . . . . . . . . . . . 22 104 8.12. Controls (L) . . . . . . . . . . . . . . . . . . . . . . 22 105 8.13. PrecisIgnorableProperties (M) . . . . . . . . . . . . . . 22 106 8.14. Spaces (N) . . . . . . . . . . . . . . . . . . . . . . . 23 107 8.15. Symbols (O) . . . . . . . . . . . . . . . . . . . . . . . 23 108 8.16. Punctuation (P) . . . . . . . . . . . . . . . . . . . . . 23 109 8.17. HasCompat (Q) . . . . . . . . . . . . . . . . . . . . . . 23 110 8.18. OtherLetterDigits (R) . . . . . . . . . . . . . . . . . . 23 111 9. Guidelines for Designated Experts . . . . . . . . . . . . . . 24 112 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 113 10.1. PRECIS Derived Property Value Registry . . . . . . . . . 24 114 10.2. PRECIS Base Classes Registry . . . . . . . . . . . . . . 24 115 10.3. PRECIS Profiles Registry . . . . . . . . . . . . . . . . 25 116 11. Security Considerations . . . . . . . . . . . . . . . . . . . 27 117 11.1. General Issues . . . . . . . . . . . . . . . . . . . . . 27 118 11.2. Use of the IdentifierClass . . . . . . . . . . . . . . . 28 119 11.3. Use of the FreeformClass . . . . . . . . . . . . . . . . 28 120 11.4. Local Character Set Issues . . . . . . . . . . . . . . . 28 121 11.5. Visually Similar Characters . . . . . . . . . . . . . . 28 122 11.6. Security of Passwords . . . . . . . . . . . . . . . . . 30 123 12. Interoperability Considerations . . . . . . . . . . . . . . . 31 124 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 125 13.1. Normative References . . . . . . . . . . . . . . . . . . 32 126 13.2. Informative References . . . . . . . . . . . . . . . . . 32 127 13.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 35 128 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 35 129 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 131 1. Introduction 133 Application protocols using Unicode characters [Unicode7.0] in 134 protocol strings need to properly handle such strings in order to 135 enforce internationalization rules for strings placed in various 136 protocol slots (such as addresses and identifiers) and to perform 137 valid comparison operations (e.g., for purposes of authentication or 138 authorization). This document defines a framework enabling 139 application protocols to perform the preparation, enforcement, and 140 comparison of internationalized strings ("PRECIS") in a way that 141 depends on the properties of Unicode characters and thus is agile 142 with respect to versions of Unicode. 144 As described in the PRECIS problem statement [RFC6885], many IETF 145 protocols have used the Stringprep framework [RFC3454] as the basis 146 for preparing, enforcing, and comparing protocol strings that contain 147 Unicode characters, especially characters outside the ASCII range 148 [RFC20]. The Stringprep framework was developed during work on the 149 original technology for internationalized domain names (IDNs), here 150 called "IDNA2003" [RFC3490], and Nameprep [RFC3491] was the 151 Stringprep profile for IDNs. At the time, Stringprep was designed as 152 a general framework so that other application protocols could define 153 their own Stringprep profiles. Indeed, a number of application 154 protocols defined such profiles. 156 After the publication of [RFC3454] in 2002, several significant 157 issues arose with the use of Stringprep in the IDN case, as 158 documented in the IAB's recommendations regarding IDNs [RFC4690] 159 (most significantly, Stringprep was tied to Unicode version 3.2). 160 Therefore, the newer IDNA specifications, here called "IDNA2008" 161 ([RFC5890], [RFC5891], [RFC5892], [RFC5893], [RFC5894]), no longer 162 use Stringprep and Nameprep. This migration away from Stringprep for 163 IDNs prompted other "customers" of Stringprep to consider new 164 approaches to the preparation, enforcement, and comparison of 165 internationalized strings, as described in [RFC6885]. 167 This document defines a framework for a post-Stringprep approach to 168 the preparation, enforcement, and comparison of internationalized 169 strings in application protocols, based on several principles: 171 1. Define a small set of string classes that specify the Unicode 172 characters (i.e., specific "code points") appropriate for common 173 application protocol constructs. 175 2. Define each PRECIS string class in terms of Unicode code points 176 and their properties so that an algorithm can be used to 177 determine whether each code point or character category is (a) 178 valid, (b) allowed in certain contexts, (c) disallowed, or (d) 179 unassigned. 181 3. Use an "inclusion model" such that a string class consists only 182 of code points that are explicitly allowed, with the result that 183 any code point not explicitly allowed is forbidden. 185 4. Enable application protocols to define profiles of the PRECIS 186 string classes if necessary, addressing matters such as width 187 mapping, case mapping, Unicode normalization, directionality, and 188 further excluded code points or character categories. 190 It is expected that this framework will yield the following benefits: 192 o Application protocols will be agile with regard to Unicode 193 versions. 195 o Implementers will be able to share code point tables and software 196 code across application protocols, most likely by means of 197 software libraries. 199 o End users will be able to acquire more accurate expectations about 200 the characters that are acceptable in various contexts. Given 201 this more uniform set of string classes, it is also expected that 202 copy/paste operations between software implementing different 203 application protocols will be more predictable and coherent. 205 Whereas the string classes define the "baseline" code points for a 206 range of applications, profiling enables application protocols to 207 further restrict the allowable code points beyond those specified for 208 the relevant string class (e.g., characters with special or reserved 209 meaning, such as "@" and "/" when used as separators within 210 identifiers) and to apply the string classes in ways that are 211 appropriate for constructs such as usernames and passwords 212 [I-D.ietf-precis-saslprepbis], nicknames [I-D.ietf-precis-nickname], 213 the localparts of instant messaging addresses 214 [I-D.ietf-xmpp-6122bis], and free-form strings 215 [I-D.ietf-xmpp-6122bis]. Profiles are responsible for defining the 216 handling of right-to-left characters as well as various mapping 217 operations of the kind also discussed for IDNs in [RFC5895], such as 218 case preservation or lowercasing, Unicode normalization, mapping of 219 certain characters to other characters or to nothing, and mapping of 220 full-width and half-width characters. 222 When an application applies a profile of a PRECIS string class, it 223 can achieve the following objectives: 225 a. Determine if a given string conforms to the profile, thus 226 enabling enforcement of the rules (e.g., to determine if a string 227 is allowed for use in the relevant "slot" specified by an 228 application protocol). 230 b. Determine if any two given strings are equivalent, thus enabling 231 comparision (e.g., to make an access decision for purposes of 232 authentication or authorization as further described in 233 [RFC6943]). 235 The opportunity to define profiles naturally introduces the 236 possibility of a proliferation of profiles, thus potentially 237 mitigating the benefits of common code and violating user 238 expectations. See Section 5 for a discussion of this important 239 topic. 241 Although this framework is similar to IDNA2008 and borrows some of 242 the character categories defined in [RFC5892], it defines additional 243 character categories to meet the needs of common application 244 protocols. 246 The character categories and calculation rules defined under 247 Section 7 and Section 8 are normative and apply to all Unicode code 248 points. The code point table that results from applying the 249 character categories and calculation rules to the latest version of 250 Unicode are provided in an IANA registry. 252 2. Terminology 254 Many important terms used in this document are defined in [RFC5890], 255 [RFC6365], [RFC6885], and [Unicode7.0]. The terms "left-to-right" 256 (LTR) and "right-to-left" (RTL) are defined in Unicode Standard Annex 257 #9 [UAX9]. 259 As of the date of writing, the version of Unicode published by the 260 Unicode Consortium is 6.3 [Unicode7.0]; however, PRECIS is not tied 261 to a specific version of Unicode. The latest version of Unicode is 262 always available [UnicodeCurrent]. 264 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 265 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 266 "OPTIONAL" in this document are to be interpreted as described in 267 [RFC2119]. 269 3. Preparation, Enforcement, and Comparison 271 This document distinguishes between three different actions that an 272 entity can take with regard to a string: 274 o Enforcement entails applying all of the rules specified for a 275 particular string class or profile thereof to an individual 276 string, for the purpose of determining if the string can be used 277 in a given protocol slot. 279 o Comparison entails applying all of the rules specified for a 280 particular string class or profile thereof to two separate 281 strings, for the purpose of determining if the two strings are 282 equivalent. 284 o Preparation entails only ensuring that the characters in an 285 individual string are allowed by the underlying PRECIS string 286 class. 288 In most cases, authoritative entities such as servers are responsible 289 for enforcement and subsidiary entities such as clients are 290 responsible only for preparation. The rationale for this distinction 291 is that clients might not have the facilities (in terms of device 292 memory and processing power) to enforce all the rules regarding 293 internationalized strings (such as width mapping and Unicode 294 normalization), although often they can limit the repertoire of 295 characters they offer to an end user. By contrast, it is assumed 296 that a server would have more capacity to enforce the rules, and in 297 any case acts as an authority regarding allowable strings in protocol 298 slots such as addresses and endpoint identifiers (since a client 299 cannot necessarily be trusted to properly generate such strings, 300 especially for security-sensitive contexts such as authentication and 301 authorization). 303 4. String Classes 305 4.1. Overview 307 Starting in 2010, various "customers" of Stringprep began to discuss 308 the need to define a post-Stringprep approach to the preparation and 309 comparison of internationalized strings other than IDNs. This 310 community analyzed the existing Stringprep profiles and also weighed 311 the costs and benefits of defining a relatively small set of Unicode 312 characters that would minimize the potential for user confusion 313 caused by visually similar characters (and thus be relatively "safe") 314 vs. defining a much larger set of Unicode characters that would 315 maximize the potential for user creativity (and thus be relatively 316 "expressive"). As a result, the community concluded that most 317 existing uses could be addressed by two string classes: 319 IdentifierClass: a sequence of letters, numbers, and some symbols 320 that is used to identify or address a network entity such as a 321 user account, a venue (e.g., a chatroom), an information source 322 (e.g., a data feed), or a collection of data (e.g., a file); the 323 intent is that this class will minimize user confusion in a wide 324 variety of application protocols, with the result that safety has 325 been prioritized over expressiveness for this class. 327 FreeformClass: a sequence of letters, numbers, symbols, spaces, and 328 other characters that is used for free-form strings, including 329 passwords as well as display elements such as human-friendly 330 nicknames in chatrooms; the intent is that this class will allow 331 nearly any Unicode character, with the result that expressiveness 332 has been prioritized over safety for this class (e.g., protocol 333 designers, application developers, service providers, and end 334 users might not understand or be able to enter all of the 335 characters that can be included in the FreeformClass - see 336 Section 11.3 for details). 338 Future specifications might define additional PRECIS string classes, 339 such as a class that falls somewhere between the IdentifierClass and 340 the FreeformClass. At this time, it is not clear how useful such a 341 class would be. In any case, because application developers are able 342 to define profiles of PRECIS string classes, a protocol needing a 343 construct between the IdentiferClass and the FreeformClass could 344 define a restricted profile of the FreeformClass if needed. 346 The following subsections discuss the IdentifierClass and 347 FreeformClass in more detail, with reference to the dimensions 348 described in Section 3 of [RFC6885]. Each string class is defined by 349 the following behavioral rules: 351 Valid: Defines which code points and character categories are 352 treated as valid input to the string. 354 Contextual Rule Required: Defines which code points and character 355 categories are treated as allowed only if the requirements of a 356 contextual rule are met (i.e., either CONTEXTJ or CONTEXTO). 358 Disallowed: Defines which code points and character categories need 359 to be excluded from the string. 361 Unassigned: Defines application behavior in the presence of code 362 points that are unknown (i.e., not yet designated) for the version 363 of Unicode used by the application. 365 This document defines the valid, contextual rule required, 366 disallowed, and unassigned rules for the IdentifierClass and 367 FreeformClass. As described under Section 5, profiles of these 368 string classes are responsible for defining the width mapping, 369 additional mappings, case mapping, normalization, directionality, and 370 exclusion rules. 372 4.2. IdentifierClass 374 Most application technologies need strings that can be used to refer 375 to, include, or communicate protocol strings like usernames, file 376 names, data feed identifiers, and chatroom names. We group such 377 strings into a class called "IdentifierClass" having the following 378 features. 380 4.2.1. Valid 382 o Code points traditionally used as letters and numbers in writing 383 systems, i.e., the LetterDigits ("A") category first defined in 384 [RFC5892] and listed here under Section 8.1. 386 o Code points in the range U+0021 through U+007E, i.e., the 387 (printable) ASCII7 ("K") rule defined under Section 8.11. These 388 code points are "grandfathered" into PRECIS and thus are valid 389 even if they would otherwise be disallowed according to the 390 property-based rules specified in the next section. 392 Note: Although the PRECIS IdentifierClass re-uses the LetterDigits 393 category from IDNA2008, the range of characters allowed in the 394 IdentifierClass is wider than the range of characters allowed in 395 IDNA2008. The main reason is that IDNA2008 applies the Unstable 396 category before the LetterDigits category, thus disallowing 397 uppercase characters, whereas the IdentifierClass does not apply 398 the Unstable category. 400 4.2.2. Contextual Rule Required 402 o A number of characters from the Exceptions ("F") category defined 403 under Section 8.6 (see Section 8.6 for a full list). 405 o Joining characters, i.e., the JoinControl ("H") category defined 406 under Section 8.8. 408 4.2.3. Disallowed 410 o Old Hangul Jamo characters, i.e., the OldHangulJamo ("I") category 411 defined under Section 8.9. 413 o Control characters, i.e., the Controls ("L") category defined 414 under Section 8.12. 416 o Ignorable characters, i.e., the PrecisIgnorableProperties ("M") 417 category defined under Section 8.13. 419 o Space characters, i.e., the Spaces ("N") category defined under 420 Section 8.14. 422 o Symbol characters, i.e., the Symbols ("O") category defined under 423 Section 8.15. 425 o Punctuation characters, i.e., the Punctuation ("P") category 426 defined under Section 8.16. 428 o Any character that has a compatibility equivalent, i.e., the 429 HasCompat ("Q") category defined under Section 8.17. These code 430 points are disallowed even if they would otherwise be valid 431 according to the property-based rules specified in the previous 432 section. 434 o Letters and digits other than the "traditional" letters and digits 435 allowed in IDNs, i.e., the OtherLetterDigits ("R") category 436 defined under Section 8.18. 438 4.2.4. Unassigned 440 Any code points that are not yet designated in the Unicode character 441 set are considered Unassigned for purposes of the IdentifierClass, 442 and such code points are to be treated as Disallowed. 444 4.2.5. Examples 446 As described in the Introduction to this document, the string classes 447 do not handle all issues related to string preparation and comparison 448 (such as case mapping); instead, such issues are handled at the level 449 of profiles. Examples for two profiles of the IdentifierClass can be 450 found in [I-D.ietf-precis-saslprepbis] (the UsernameIdentifierClass 451 profile) and in [I-D.ietf-xmpp-6122bis] (the JIDlocalIdentifierClass 452 profile). 454 4.3. FreeformClass 456 Some application technologies need strings that can be used in a 457 free-form way, e.g., as a password in an authentication exchange (see 458 [I-D.ietf-precis-saslprepbis]) or a nickname in a chatroom (see 459 [I-D.ietf-precis-nickname]). We group such things into a class 460 called "FreeformClass" having the following features. 462 Security Warning: As mentioned, the FreeformClass prioritizes 463 expressiveness over safety; Section 11.3 describes some of the 464 security hazards involved with using or profiling the 465 FreeformClass. 467 Security Warning: Consult Section 11.6 for relevant security 468 considerations when strings conforming to the FreeformClass, or a 469 profile thereof, are used as passwords. 471 4.3.1. Valid 473 o Traditional letters and numbers, i.e., the LetterDigits ("A") 474 category first defined in [RFC5892] and listed here under 475 Section 8.1. 477 o Letters and digits other than the "traditional" letters and digits 478 allowed in IDNs, i.e., the OtherLetterDigits ("R") category 479 defined under Section 8.18. 481 o Code points in the range U+0021 through U+007E, i.e., the 482 (printable) ASCII7 ("K") rule defined under Section 8.11. 484 o Any character that has a compatibility equivalent, i.e., the 485 HasCompat ("Q") category defined under Section 8.17. 487 o Space characters, i.e., the Spaces ("N") category defined under 488 Section 8.14. 490 o Symbol characters, i.e., the Symbols ("O") category defined under 491 Section 8.15. 493 o Punctuation characters, i.e., the Punctuation ("P") category 494 defined under Section 8.16. 496 4.3.2. Contextual Rule Required 498 o A number of characters from the Exceptions ("F") category defined 499 under Section 8.6 (see Section 8.6 for a full list). 501 o Joining characters, i.e., the JoinControl ("H") category defined 502 under Section 8.8. 504 4.3.3. Disallowed 506 o Old Hangul Jamo characters, i.e., the OldHangulJamo ("I") category 507 defined under Section 8.9. 509 o Control characters, i.e., the Controls ("L") category defined 510 under Section 8.12. 512 o Ignorable characters, i.e., the PrecisIgnorableProperties ("M") 513 category defined under Section 8.13. 515 4.3.4. Unassigned 517 Any code points that are not yet designated in the Unicode character 518 set are considered Unassigned for purposes of the FreeformClass, and 519 such code points are to be treated as Disallowed. 521 4.3.5. Examples 523 As described in the Introduction to this document, the string classes 524 do not handle all issues related to string preparation and comparison 525 (such as case mapping); instead, such issues are handled at the level 526 of profiles. Examples for two profiles of the FreeformClass can be 527 found in [I-D.ietf-precis-nickname] (the NicknameFreeformClass 528 profile) and in [I-D.ietf-xmpp-6122bis] (the 529 JIDresourceIdentifierClass profile). 531 5. Profiles 533 This framework document defines the valid, contextual-rule-required, 534 disallowed, and unassigned rules for the IdentifierClass and the 535 FreeformClass. A profile of a PRECIS string class MUST define the 536 width mapping, additional mappings (if any), case mapping, 537 normalization, directionality, and exclusion rules. A profile MAY 538 also restrict the allowable characters above and beyond the 539 definition of the relevant PRECIS string class (but MUST NOT add as 540 valid any code points or character categories that are disallowed by 541 the relevant PRECIS string class). These matters are discussed in 542 the following subsections. 544 Profiles of the PRECIS string classes are registered with the IANA as 545 described under Section 10.3. Profile names use the following 546 convention: they are of the form "ProfilenameBaseClass", where the 547 "Profilename" string is a differentiator and "BaseClass" is the name 548 of the PRECIS string class being profiled; for example, the profile 549 of the IdentifierClass used for localparts of Jabber Identifiers 550 (JIDs) in the Extensible Messaging and Presence Protocol (XMPP) is 551 named "JIDlocalIdentifierClass" [I-D.ietf-xmpp-6122bis]. 553 5.1. Profiles Must Not Be Multiplied Beyond Necessity 555 The risk of profile proliferation is significant because having too 556 many profiles will result in different behavior across various 557 applications, thus violating what is known in user interface design 558 as the Principle of Least Astonishment. 560 Indeed, we already have too many profiles. Ideally we would have at 561 most two or three profiles. Unfortunately, numerous application 562 protocols exist with their own quirks regarding protocol strings. 563 Domain names, email addresses, instant messaging addresses, chatroom 564 nicknames, filenames, authentication identifiers, passwords, and 565 other strings are already out there in the wild and need to be 566 supported in existing application protocols such as DNS, SMTP, XMPP, 567 IRC, NFS, iSCSI, EAP, and SASL among others. 569 Nevertheless, profiles must not be multiplied beyond necessity. 571 To help prevent profile proliferation, this document recommends 572 sensible defaults for the various options offered to profile creators 573 (such as width mapping and Unicode normalization). In addition, the 574 guidelines for designated experts provided under Section 9 are meant 575 to encourage a high level of due diligence regarding new profiles. 577 5.2. Rules 579 5.2.1. Width Mapping Rule 581 The width mapping rule of a profile specifies whether width mapping 582 is performed on fullwidth and halfwidth characters, and how the 583 mapping is done. Typically such mapping consists of mapping 584 fullwidth and halfwidth characters, i.e., code points with a 585 Decomposition Type of Wide or Narrow, to their decomposition 586 mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be 587 mapped to DIGIT ZERO (U+0030). 589 The normalization form specified by a profile (see below) has an 590 impact on the need for width mapping. Because width mapping is 591 performed as a part of compatibility decomposition, a profile 592 employing either normalization form KD (NFKD) or normalization form 593 KC (NFKC) does not need to specify width mapping. However, if 594 Unicode normalization form C (NFC) is used (as is recommended) then 595 the profile needs to specify whether to apply width mapping; in this 596 case, width mapping is in general RECOMMENDED because allowing 597 fullwidth and halfwidth characters to remain unmapped to their 598 compatibility variants would violate the Principle of Least 599 Astonishment. For more information about the concept of width in 600 East Asian scripts within Unicode, see Unicode Standard Annex #11 601 [UAX11]. 603 5.2.2. Additional Mapping Rule 605 The additional mapping rule of a profile specifies whether additional 606 mappings are to be applied, such as: 608 Mapping of delimiter characters (such as '@', ':', '/', '+', and 609 '-') 611 Mapping of special characters (e.g., non-ASCII space characters to 612 ASCII space or control characters to nothing). 614 The PRECIS mappings document [I-D.ietf-precis-mappings] describes 615 such mappings in more detail. 617 5.2.3. Case Mapping Rule 619 The case mapping rule of a profile specifies whether case mapping is 620 performed (instead of case preservation) on uppercase and titlecase 621 characters, and how the mapping is done (e.g., mapping uppercase and 622 titlecase characters to their lowercase equivalents). 624 If case mapping is desired (instead of case preservation), it is 625 RECOMMENDED to use Unicode Default Case Folding as defined in Chapter 626 3 of the Unicode Standard [Unicode7.0]. 628 Note: Unicode Default Case Folding is not designed to handle 629 various localization issues (such as so-called "dotless i" in 630 several Turkic languages). The PRECIS mappings document 631 [I-D.ietf-precis-mappings] describes these issues in greater 632 detail and defines a "local case mapping" method that handles some 633 locale-dependent and context-dependent mappings. 635 In order to maximize entropy and minimize the potential for false 636 positives, it is NOT RECOMMENDED for application protocols to map 637 uppercase and titlecase code points to their lowercase equivalents 638 when strings conforming to the FreeformClass, or a profile thereof, 639 are used in passwords; instead, it is RECOMMENDED to preserve the 640 case of all code points contained in such strings and then perform 641 case-sensitive comparison. See also the related discussion in 642 [I-D.ietf-precis-saslprepbis]. 644 5.2.4. Normalization Rule 646 The normalization rule of a profile specifies which Unicode 647 normalization form (D, KD, C, or KC) is to be applied (see Unicode 648 Standard Annex #15 [UAX15] for background information). 650 In accordance with [RFC5198], normalization form C (NFC) is 651 RECOMMENDED. 653 5.2.5. Exclusion Rule 655 The exclusion rule of a profile specifies any particular characters 656 or character categories that are not allowed in strings conforming to 657 the profile, above and beyond those excluded by the string class 658 being profiled. 660 That is, a profile MAY do either of the following: 662 1. Exclude specific code points that are allowed by the relevant 663 string class. 665 2. Exclude characters matching certain Unicode properties (e.g., 666 math symbols) that are included in the relevant PRECIS string 667 class. 669 As a result of such exclusions, code points that are defined as valid 670 for the PRECIS string class being profiled will be defined as 671 disallowed for the profile. 673 Typically, an exclusion rule is defined for the purpose of backward- 674 compatibility with legacy formats. Profiles for newly-defined string 675 types SHOULD NOT have an exclusion rule. 677 5.2.6. Directionality Rule 679 The directionality rule of a profile specifies how to treat strings 680 containing left-to-right (LTR) and right-to-left (RTL) characters 681 (see Unicode Standard Annex #9 [UAX9]). A profile usually specifies 682 a directionality rule that restricts strings to be entirely LTR 683 strings or entirely RTL strings and defines the allowable sequences 684 of characters in LTR and RTL strings. Possible rules include, but 685 are not limited to, (a) considering any string that contains a right- 686 to-left code point to be a right-to-left string, or (b) applying the 687 "Bidi Rule" from [RFC5893]. 689 Mixed-direction strings are not directly supported by the PRECIS 690 framework itself, since there is currently no widely accepted and 691 implemented solution for the safe display of mixed-direction strings. 692 An application protocol that uses the PRECIS framework (or an 693 extension to the framework) could define better ways to present 694 mixed-direction strings; however, that work is outside the scope of 695 this framework and would likely require a great deal of careful 696 research into the problems of displaying bidirectional text. 698 5.3. Building Application-Layer Constructs 700 Sometimes, an application-layer construct does not map in a 701 straightforward manner to one of the base string classes or a profile 702 thereof. Consider, for example, the "simple user name" construct in 703 the Simple Authentication and Security Layer (SASL) [RFC4422]. 704 Depending on the deployment, a simple user name might take the form 705 of a user's full name (e.g., the user's personal name followed by a 706 space and then the user's family name). Such a simple user name 707 cannot be defined as an instance of the IdentifierClass or a profile 708 thereof, since space characters are not allowed in the 709 IdentifierClass; however, it could be defined using a space-separated 710 sequence of IdentifierClass instances, as in the following ABNF 711 [RFC5234] from [I-D.ietf-precis-saslprepbis]: 713 username = userpart [1*(1*SP userpart)] 714 userpart = 1*(idbyte) 715 ; 716 ; an "idbyte" is a byte used to represent a 717 ; UTF-8 encoded Unicode code point that can be 718 ; contained in a string that conforms to the 719 ; PRECIS "IdentifierClass" 720 ; 722 Similar techniques could be used to define many application-layer 723 constructs, say of the form "user@domain" or "/path/to/file". 725 5.4. A Note about Spaces 727 With regard to the IdentiferClass, the consensus of the PRECIS 728 Working Group was that spaces are problematic for many reasons, 729 including: 731 o Many Unicode characters are confusable with ASCII space. 733 o Even if non-ASCII space characters are mapped to ASCII space 734 (U+0020), space characters are often not rendered in user 735 interfaces, leading to the possibility that a human user might 736 consider a string containing spaces to be equivalent to the same 737 string without spaces. 739 o In some locales, some devices are known to generate a character 740 other than ASCII space (such as ZERO WIDTH JOINER, U+200D) when a 741 user performs an action like hit the space bar on a keyboard. 743 One consequence of disallowing space characters in the 744 IdentifierClass might be to effectively discourage their use within 745 identifiers created in newer application protocols; given the 746 challenges involved in properly handling space characters (especially 747 non-ASCII space characters) in identifiers and other protocol 748 strings, the Working Group considered this to be a feature, not a 749 bug. 751 However, the FreeformClass does allow spaces, which enables 752 application protocols to define profiles of the FreeformClass that 753 are more flexible than any profiles of the IdentifierClass. In 754 addition, as explained in the previous section, application protocols 755 can also define application-layer constructs containing spaces. 757 6. Order of Operations 759 To ensure proper comparison, the rules specified for a particular 760 string class or profile MUST be applied in the following order: 762 1. Width Mapping Rule 764 2. Additional Mapping Rule 766 3. Case Mapping Rule 768 4. Normalization Rule 770 5. Exclusion Rule 772 6. Behavioral rules for determining whether a code point is valid, 773 allowed under a contextual rule, disallowed, or unassigned 775 As already described, the width mapping, additional mapping, case 776 mapping, normalization, and exclusion rules are specified for each 777 profile, whereas the behavioral rules are specified for each string 778 class. Some of the logic behind this order is provided under 779 Section 5.2.1 (see also the PRECIS mappings document 780 [I-D.ietf-precis-mappings]). 782 7. Code Point Properties 784 In order to implement the string classes described above, this 785 document does the following: 787 1. Reviews and classifies the collections of code points in the 788 Unicode character set by examining various code point properties. 790 2. Defines an algorithm for determining a derived property value, 791 which can vary depending on the string class being used by the 792 relevant application protocol. 794 This document is not intended to specify precisely how derived 795 property values are to be applied in protocol strings. That 796 information is the responsibility of the protocol specification that 797 uses or profiles a PRECIS string class from this document. The value 798 of the property is to be interpreted as follows. 800 PROTOCOL VALID Those code points that are allowed to be used in any 801 PRECIS string class (currently, IdentifierClass and 802 FreeformClass). Code points with this property value are 803 permitted for general use in any string class. The abbreviated 804 term "PVALID" is used to refer to this value in the remainder of 805 this document. 807 SPECIFIC CLASS PROTOCOL VALID Those code points that are allowed to 808 be used in specific string classes. Code points with this 809 property value are permitted for use in specific string classes. 810 In the remainder of this document, the abbreviated term *_PVAL is 811 used, where * = (ID | FREE), i.e., either "FREE_PVAL" or 812 "ID_PVAL". In practice, the derived property ID_PVAL is not used 813 in this specification, since every ID_PVAL code point is PVALID. 815 CONTEXTUAL RULE REQUIRED Some characteristics of the character, such 816 as its being invisible in certain contexts or problematic in 817 others, require that it not be used in labels unless specific 818 other characters or properties are present. As in IDNA2008, there 819 are two subdivisions of CONTEXTUAL RULE REQUIRED, the first for 820 Join_controls (called "CONTEXTJ") and the second for other 821 characters (called "CONTEXTO"). A character with the derived 822 property value CONTEXTJ or CONTEXTO MUST NOT be used unless an 823 appropriate rule has been established and the context of the 824 character is consistent with that rule. The most notable of the 825 CONTEXTUAL RULE REQUIRED characters are the Join Control 826 characters U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON- 827 JOINER, which have a derived property value of CONTEXTJ. See 828 Appendix A of [RFC5892] for more information. 830 DISALLOWED Those code points that are not permitted in any PRECIS 831 string class. 833 SPECIFIC CLASS DISALLOWED Those code points that are not to be 834 included in a specific string class. Code points with this 835 property value are not permitted in one of the string classes but 836 might be permitted in others. In the remainder of this document, 837 the abbreviated term *_DIS is used, where * = (ID | FREE), i.e., 838 either "FREE_DIS" or "ID_DIS". In practice, the derived property 839 FREE_DIS is not used in this specification, since every FREE_DIS 840 code point is DISALLOWED. 842 UNASSIGNED Those code points that are not designated (i.e. are 843 unassigned) in the Unicode Standard. 845 To summarize, the assigned values of the derived property are: 847 o PVALID 849 o FREE_PVAL 851 o CONTEXTJ 852 o CONTEXTO 854 o DISALLOWED 856 o UNASSIGNED 858 The algorithm to calculate the value of the derived property is as 859 follows: 861 If .cp. .in. Exceptions Then Exceptions(cp); 862 Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp); 863 Else If .cp. .in. Unassigned Then UNASSIGNED; 864 Else If .cp. .in. ASCII7 Then PVALID; 865 Else If .cp. .in. JoinControl Then CONTEXTJ; 866 Else If .cp. .in. OldHangulJamo Then DISALLOWED; 867 Else If .cp. .in. PrecisIgnorableProperties Then DISALLOWED; 868 Else If .cp. .in. Controls Then DISALLOWED; 869 Else If .cp. .in. HasCompat Then ID_DIS or FREE_PVAL; 870 Else If .cp. .in. LetterDigits Then PVALID; 871 Else If .cp. .in. OtherLetterDigits Then ID_DIS or FREE_PVAL; 872 Else If .cp. .in. Spaces Then ID_DIS or FREE_PVAL; 873 Else If .cp. .in. Symbols Then ID_DIS or FREE_PVAL; 874 Else If .cp. .in. Punctuation Then ID_DIS or FREE_PVAL; 875 Else DISALLOWED; 877 The value of the derived property calculated can depend on the string 878 class; for example, if an identifier used in an application protocol 879 is defined as profiling the PRECIS IdentifierClass then a space 880 character such as U+0020 would be assigned to ID_DIS, whereas if an 881 identifier is defined as profiling the PRECIS FreeformClass then the 882 character would be assigned to FREE_PVAL. For the sake of brevity, 883 the designation "FREE_PVAL" is used in the code point tables, instead 884 of the longer designation "ID_DIS or FREE_PVAL". In practice, the 885 derived properties ID_PVAL and FREE_DIS are not used in this 886 specification, since every ID_PVAL code point is PVALID and every 887 FREE_DIS code point is DISALLOWED. 889 Use of the name of a rule (such as "Exceptions") implies the set of 890 code points that the rule defines, whereas the same name as a 891 function call (such as "Exceptions(cp)") implies the value that the 892 code point has in the Exceptions table. 894 The mechanisms described here allow determination of the value of the 895 property for future versions of Unicode (including characters added 896 after Unicode 5.2 or 7.0 depending on the category, since some 897 categories in this document are reused from IDNA2008 and therefore 898 were defined at the time of Unicode 5.2). Changes in Unicode 899 properties that do not affect the outcome of this process therefore 900 do not affect this framework. For example, a character can have its 901 Unicode General_Category value (see Chapter 4 of the Unicode Standard 902 [Unicode7.0]) change from So to Sm, or from Lo to Ll, without 903 affecting the algorithm results. Moreover, even if such changes were 904 to result, the BackwardCompatible list (Section 8.7) can be adjusted 905 to ensure the stability of the results. 907 8. Category Definitions Used to Calculate Derived Property 909 The derived property obtains its value based on a two-step procedure: 911 1. Characters are placed in one or more character categories either 912 (1) based on core properties defined by the Unicode Standard or 913 (2) by treating the code point as an exception and addressing the 914 code point based on its code point value. These categories are 915 not mutually exclusive. 917 2. Set operations are used with these categories to determine the 918 values for a property specific to a given string class. These 919 operations are specified under Section 7. 921 Note: Unicode property names and property value names might have 922 short abbreviations, such as "gc" for the General_Category 923 property and "Ll" for the Lowercase_Letter property value of the 924 gc property. 926 In the following specification of character categories, the operation 927 that returns the value of a particular Unicode character property for 928 a code point is designated by using the formal name of that property 929 (from the Unicode PropertyAliases.txt [1]) followed by '(cp)' for 930 "code point". For example, the value of the General_Category 931 property for a code point is indicated by General_Category(cp). 933 The first ten categories (A-J) shown below were previously defined 934 for IDNA2008 and are copied directly from [RFC5892] to ease the 935 understanding of how PRECIS handles various characters. Some of 936 these categories are reused in PRECIS and some of them are not; 937 however, the lettering of categories is retained to prevent overlap 938 and to ease implementation of both IDNA2008 and PRECIS in a single 939 software application. The next eight categories (K-R) are specific 940 to PRECIS. 942 8.1. LetterDigits (A) 944 This category is defined in Secton 2.1 of [RFC5892] and is included 945 by reference for use in PRECIS. 947 8.2. Unstable (B) 949 This category is defined in Secton 2.2 of [RFC5892] but not used in 950 PRECIS. 952 8.3. IgnorableProperties (C) 954 This category is defined in Secton 2.3 of [RFC5892] but not used in 955 PRECIS. 957 Note: See the "PrecisIgnorableProperties (M)" category below for a 958 more inclusive category used in PRECIS identifiers. 960 8.4. IgnorableBlocks (D) 962 This category is defined in Secton 2.4 of [RFC5892] but not used in 963 PRECIS. 965 8.5. LDH (E) 967 This category is defined in Secton 2.5 of [RFC5892] but not used in 968 PRECIS. 970 Note: See the "ASCII7 (K)" category below for a more inclusive 971 category used in PRECIS identifiers. 973 8.6. Exceptions (F) 975 This category is defined in Secton 2.6 of [RFC5892] and is included 976 by reference for use in PRECIS. 978 8.7. BackwardCompatible (G) 980 This category is defined in Secton 2.7 of [RFC5892] and is included 981 by reference for use in PRECIS. 983 Note: Because of how the PRECIS string classes are defined, only 984 changes that would result in code points being added to or removed 985 from the LetterDigits ("A") category would result in backward- 986 incompatible modifications to code point assignments. Therefore, 987 management of this category is handled via the processes specified in 988 [RFC5892]. At the time of this writing (and also at the time that 989 RFC 5892 was published), this category consisted of the empty set; 990 however, that is subject to change as described in RFC 5892. 992 8.8. JoinControl (H) 994 This category is defined in Secton 2.8 of [RFC5892] and is included 995 by reference for use in PRECIS. 997 8.9. OldHangulJamo (I) 999 This category is defined in Secton 2.9 of [RFC5892] and is included 1000 by reference for use in PRECIS. 1002 8.10. Unassigned (J) 1004 This category is defined in Secton 2.10 of [RFC5892] and is included 1005 by reference for use in PRECIS. 1007 8.11. ASCII7 (K) 1009 This PRECIS-specific category consists of all printable, non-space 1010 characters from the 7-bit ASCII range. By applying this category, 1011 the algorithm specified under Section 7 exempts these characters from 1012 other rules that might be applied during PRECIS processing, on the 1013 assumption that these code points are in such wide use that 1014 disallowing them would be counter-productive. 1016 K: cp is in {0021..007E} 1018 8.12. Controls (L) 1020 L: Control(cp) = True 1022 8.13. PrecisIgnorableProperties (M) 1024 This PRECIS-specific category is used to group code points that are 1025 discouraged from use in PRECIS string classes. 1027 M: Default_Ignorable_Code_Point(cp) = True or 1028 Noncharacter_Code_Point(cp) = True 1030 The definition for Default_Ignorable_Code_Point can be found in the 1031 DerivedCoreProperties.txt [2] file, and at the time of Unicode 7.0 is 1032 as follows: 1034 Other_Default_Ignorable_Code_Point 1035 + Cf (Format characters) 1036 + Variation_Selector 1037 - White_Space 1038 - FFF9..FFFB (Annotation Characters) 1039 - 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters 1040 that should be visible) 1042 8.14. Spaces (N) 1044 This PRECIS-specific category is used to group code points that are 1045 space characters. 1047 N: General_Category(cp) is in {Zs} 1049 8.15. Symbols (O) 1051 This PRECIS-specific category is used to group code points that are 1052 symbols. 1054 O: General_Category(cp) is in {Sm, Sc, Sk, So} 1056 8.16. Punctuation (P) 1058 This PRECIS-specific category is used to group code points that are 1059 punctuation characters. 1061 P: General_Category(cp) is in {Pc, Pd, Ps, Pe, Pi, Pf, Po} 1063 8.17. HasCompat (Q) 1065 This PRECIS-specific category is used to group code points that have 1066 compatibility equivalents as explained in Chapter 2 and Chapter 3 of 1067 the Unicode Standard [Unicode7.0]. 1069 Q: toNFKC(cp) != cp 1071 The toNFKC() operation returns the code point in normalization form 1072 KC. For more information, see Section 5 of Unicode Standard Annex 1073 #15 [UAX15]. 1075 8.18. OtherLetterDigits (R) 1077 This PRECIS-specific category is used to group code points that are 1078 letters and digits other than the "traditional" letters and digits 1079 grouped under the LetterDigits (A) class (see Section 8.1). 1081 R: General_Category(cp) is in {Lt, Nl, No, Me} 1083 9. Guidelines for Designated Experts 1085 Experience with internationalization in application protocols has 1086 shown that protocol designers usually do not understand the 1087 subtleties and tradeoffs involved with internationalization and that 1088 they need considerable guidance in making reasonable decisions with 1089 regard to the options before them. Therefore, although the 1090 registration policy for PRECIS profiles is Expert Review and a stable 1091 specification is not strictly required, the designated experts for 1092 profile registration requests ought to encourage applicants to 1093 provide a stable specification documenting the profile. 1095 Internationalization can be difficult and contentious; the designated 1096 experts and applicants are strongly encouraged to work together in a 1097 spirit of good faith and mutual understanding to achieve rough 1098 consensus on progressing registrations through the process. They are 1099 also encouraged to bring additional expertise into the discussion if 1100 that would be helpful in adding perspective or otherwise resolving 1101 issues. 1103 Further considerations for profile registrants and designated experts 1104 can be found under Section 10.3. 1106 10. IANA Considerations 1108 10.1. PRECIS Derived Property Value Registry 1110 IANA is requested to create a PRECIS-specific registry with the 1111 Derived Properties for the versions of Unicode that are released 1112 after (and including) version 7.0. The derived property value is to 1113 be calculated in cooperation with a designated expert [RFC5226] 1114 according to the rules specified under Section 8 and Section 7. 1116 The IESG is to be notified if backward-incompatible changes to the 1117 table of derived properties are discovered or if other problems arise 1118 during the process of creating the table of derived property values 1119 or during expert review. Changes to the rules defined under 1120 Section 8 and Section 7 require IETF Review. 1122 10.2. PRECIS Base Classes Registry 1124 IANA is requested to create a registry of PRECIS string classes. In 1125 accordance with [RFC5226], the registration policy is "RFC Required". 1127 The registration template is as follows: 1129 Base Class: [the name of the PRECIS string class] 1130 Description: [a brief description of the PRECIS string class and its 1131 intended use, e.g., "A sequence of letters, numbers, and symbols 1132 that is used to identify or address a network entity."] 1134 Specification: [the RFC number] 1136 The initial registrations are as follows: 1138 Base Class: FreeformClass. 1139 Description: A sequence of letters, numbers, symbols, spaces, and 1140 other code points that is used for free-form strings. 1141 Specification: Section 3.3 of this document. 1142 [Note to RFC Editor: please change "this document" 1143 to the RFC number issued for this specification.] 1145 Base Class: IdentifierClass. 1146 Description: A sequence of letters, numbers, and symbols that is 1147 used to identify or address a network entity. 1148 Specification: Section 3.2 of this document. 1149 [Note to RFC Editor: please change "this document" 1150 to the RFC number issued for this specification.] 1152 10.3. PRECIS Profiles Registry 1154 IANA is requested to create a registry of profiles that use the 1155 PRECIS string classes. In accordance with [RFC5226], the 1156 registration policy is "Expert Review". This policy was chosen in 1157 order to ease the burden of registration while ensuring that 1158 "customers" of PRECIS receive appropriate guidance regarding the 1159 sometimes complex and subtle internationalization issues related to 1160 profiles of PRECIS string classes. 1162 The registration template is as follows: 1164 Name: [the name of the profile] 1166 Applicability: [the specific protocol elements to which this profile 1167 applies, e.g., "Localparts in XMPP addresses."] 1169 Base Class: [which PRECIS string class is being profiled] 1171 Replaces: [the Stringprep profile that this PRECIS profile replaces, 1172 if any] 1174 Width Mapping Rule: [the behavioral rule for handling of width, 1175 e.g., "Map fullwidth and halfwidth characters to their 1176 compatibility variants."] 1178 Additional Mapping Rule: [any additional mappings are required or 1179 recommended, e.g., "Map non-ASCII space characters to ASCII 1180 space."] 1182 Case Mapping Rule: [the behavioral rule for handling of case, e.g., 1183 "Unicode Default Case Folding"] 1185 Normalization Rule: [which Unicode normalization form is applied, 1186 e.g., "NFC"] 1188 Exclusion Rule: [a brief description of the specific code points or 1189 characters categories are excluded, e.g., "Eight legacy characters 1190 in the ASCII range" or "Any character that has a compatibility 1191 equivalent, i.e., the HasCompat category"] 1193 Directionality Rule: [the behavioral rule for handling of right-to- 1194 left code points, e.g., "The 'Bidi Rule' defined in RFC 5893 1195 applies."] 1197 Enforcement: [which entities enforce the rules, and when that 1198 enforcement occurs during protocol operations] 1200 Specification: [a pointer to relevant documentation, such as an RFC 1201 or Internet-Draft] 1203 In order to request a review, the registrant shall send a completed 1204 template to the precis@ietf.org list or its designated successor. 1206 Factors to focus on while defining profiles and reviewing profile 1207 registrations include the following: 1209 o Is the problem being addressed by this profile well-defined? 1211 o Does the specification define what kinds of applications are 1212 involved and the protocol elements to which this profile applies? 1214 o Would an existing PRECIS string class or profile solve the 1215 problem? 1217 o Is the profile clearly defined? 1219 o Is the profile based on an appropriate dividing line between user 1220 interface (culture, context, intent, locale, device limitations, 1221 etc.) and the use of conformant strings in protocol elements? 1223 o Are the width mapping, case mapping, additional mappings, 1224 normalization, exclusion, and directionality rules appropriate for 1225 the intended use? 1227 o Does the profile explain which entities enforce the rules, and 1228 when such enforcement occurs during protocol operations? 1230 o Does the profile reduce the degree to which human users could be 1231 surprised or confused by application behavior (the "Principle of 1232 Least Astonishment")? 1234 o Does the profile introduce any new security concerns such as those 1235 described under Section 11 of this document (e.g., false positives 1236 for authentication or authorization)? 1238 11. Security Considerations 1240 11.1. General Issues 1242 If input strings that appear "the same" to users are programmatically 1243 considered to be distinct in different systems, or if input strings 1244 that appear distinct to users are programmatically considered to be 1245 "the same" in different systems, then users can be confused. Such 1246 confusion can have security implications, such as the false positives 1247 and false negatieves discussed in [RFC6943]. One starting goal of 1248 work on the PRECIS framework was to limit the number of times that 1249 users are confused (consistent with the "Principle of Least 1250 Astonishment"). Unfortunately, this goal has been difficult to 1251 achieve given the large number of application protocols already in 1252 existence, each with its own conventions regarding allowable 1253 characters (see for example [I-D.saintandre-username-interop] with 1254 regard to various username constructs). Despite these difficulties, 1255 profiles should not be multiplied beyond necessity. In particular, 1256 application protocol designers should think long and hard before 1257 defining a new profile instead of using one that has already been 1258 defined, and if they decide to define a new profile then they should 1259 clearly explain their reasons for doing so. 1261 The security of applications that use this framework can depend in 1262 part on the proper preparation and comparison of internationalized 1263 strings. For example, such strings can be used to make 1264 authentication and authorization decisions, and the security of an 1265 application could be compromised if an entity providing a given 1266 string is connected to the wrong account or online resource based on 1267 different interpretations of the string. 1269 Specifications of application protocols that use this framework are 1270 strongly encouraged to describe how internationalized strings are 1271 used in the protocol, including the security implications of any 1272 false positives and false negatives that might result from various 1273 comparison operations. For some helpful guidelines, refer to 1274 [RFC6943], [RFC5890], [UTR36], and [UTS39]. 1276 11.2. Use of the IdentifierClass 1278 Strings that conform to the IdentifierClass and any profile thereof 1279 are intended to be relatively safe for use in a broad range of 1280 applications, primarily because they include only letters, digits, 1281 and "grandfathered" non-space characters from the ASCII range; thus 1282 they exclude spaces, characters with compatibility equivalents, and 1283 almost all symbols and punctuation marks. However, because such 1284 strings can still include so-called confusable characters (see 1285 Section 11.5), protocol designers and implementers are encouraged to 1286 pay close attention to the security considerations described 1287 elsewhere in this document. 1289 11.3. Use of the FreeformClass 1291 Strings that conform to the FreeformClass and many profiles thereof 1292 can include virtually any Unicode character. This makes the 1293 FreeformClass quite expressive, but also problematic from the 1294 perspective of possible user confusion. Protocol designers are 1295 hereby warned that the FreeformClass contains codepoints they might 1296 not understand, and are encouraged to profile the IdentifierClass 1297 wherever feasible; however, if an application protocol requires more 1298 code points than are allowed by the IdentifierClass, protocol 1299 designers are encouraged to define a profile of the FreeformClass 1300 that restricts the allowable code points as tightly as possible. 1301 (The PRECIS Working Group considered the option of allowing 1302 superclasses as well as profiles of PRECIS string classes, but 1303 decided against allowing superclasses to reduce the likelihood of 1304 security and interoperability problems.) 1306 11.4. Local Character Set Issues 1308 When systems use local character sets other than ASCII and Unicode, 1309 this specification leaves the problem of converting between the local 1310 character set and Unicode up to the application or local system. If 1311 different applications (or different versions of one application) 1312 implement different rules for conversions among coded character sets, 1313 they could interpret the same name differently and contact different 1314 application servers or other network entities. This problem is not 1315 solved by security protocols, such as Transport Layer Security (TLS) 1316 [RFC5246] and the Simple Authentication and Security Layer (SASL) 1317 [RFC4422], that do not take local character sets into account. 1319 11.5. Visually Similar Characters 1321 Some characters are visually similar and thus can cause confusion 1322 among humans. Such characters are often called "confusable 1323 characters" or "confusables". 1325 The problem of confusable characters is not necessarily caused by the 1326 use of Unicode code points outside the ASCII range. For example, in 1327 some presentations and to some individuals the string "ju1iet" 1328 (spelled with DIGIT ONE, U+0031, as the third character) might appear 1329 to be the same as "juliet" (spelled with LATIN SMALL LETTER L, 1330 U+006C), especially on casual visual inspection. This phenomenon is 1331 sometimes called "typejacking". 1333 However, the problem is made more serious by introducing the full 1334 range of Unicode code points into protocol strings. For example, the 1335 characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 from the 1336 Cherokee block look similar to the ASCII characters "STPETER" as they 1337 might appear when presented using a "creative" font family. 1339 In some examples of confusable characters, it is unlikely that the 1340 average human could tell the difference between the real string and 1341 the fake string. (Indeed, there is no programmatic way to 1342 distinguish with full certainty which is the fake string and which is 1343 the real string; in some contexts, the string formed of Cherokee 1344 characters might be the real string and the string formed of ASCII 1345 characters might be the fake string.) Because PRECIS-compliant 1346 strings can contain almost any properly-encoded Unicode code point, 1347 it can be relatively easy to fake or mimic some strings in systems 1348 that use the PRECIS framework. The fact that some strings are easily 1349 confused introduces security vulnerabilities of the kind that have 1350 also plagued the World Wide Web, specifically the phenomenon known as 1351 phishing. 1353 Despite the fact that some specific suggestions about identification 1354 and handling of confusable characters appear in the Unicode Security 1355 Considerations [UTR36] and the Unicode Security Mechanisms [UTS39], 1356 it is also true (as noted in [RFC5890]) that "there are no 1357 comprehensive technical solutions to the problems of confusable 1358 characters". Because it is impossible to map visually similar 1359 characters without a great deal of context (such as knowing the font 1360 families used), the PRECIS framework does nothing to map similar- 1361 looking characters together, nor does it prohibit some characters 1362 because they look like others. 1364 Nevertheless, specifications for application protocols that use this 1365 framework are strongly encouraged to describe how confusable 1366 characters can be abused to compromise the security of systems that 1367 use the protocol in question, along with any protocol-specific 1368 suggestions for overcoming those threats. In particular, software 1369 implementations and service deployments that use PRECIS-based 1370 technologies are strongly encouraged to define and implement 1371 consistent policies regarding the registration, storage, and 1372 presentation of visually similar characters. The following 1373 recommendations are appropriate: 1375 1. An application service SHOULD define a policy that specifies the 1376 scripts or blocks of characters that the service will allow to be 1377 registered (e.g., in an account name) or stored (e.g., in a file 1378 name). Such a policy SHOULD be informed by the languages and 1379 scripts that are used to write registered account names; in 1380 particular, to reduce confusion, the service SHOULD forbid 1381 registration or storage of strings that contain characters from 1382 more than one script and SHOULD restrict registrations to 1383 characters drawn from a very small number of scripts (e.g., 1384 scripts that are well-understood by the administrators of the 1385 service, to improve manageability). 1387 2. User-oriented application software SHOULD define a policy that 1388 specifies how internationalized strings will be presented to a 1389 human user. Because every human user of such software has a 1390 preferred language or a small set of preferred languages, the 1391 software SHOULD gather that information either explicitly from 1392 the user or implicitly via the operating system of the user's 1393 device. Furthermore, because most languages are typically 1394 represented by a single script or a small set of scripts, and 1395 because most scripts are typically contained in one or more 1396 blocks of characters, the software SHOULD warn the user when 1397 presenting a string that mixes characters from more than one 1398 script or block, or that uses characters outside the normal range 1399 of the user's preferred language(s). (Such a recommendation is 1400 not intended to discourage communication across different 1401 communities of language users; instead, it recognizes the 1402 existence of such communities and encourages due caution when 1403 presenting unfamiliar scripts or characters to human users.) 1405 The challenges inherent in supporting the full range of Unicode code 1406 points have in the past led some to hope for a way to 1407 programmatically negotiate more restrictive ranges based on locale, 1408 script, or other relevant factors, to tag the locale associated with 1409 a particular string, etc. As a general-purpose internationalization 1410 technology, the PRECIS framework does not include such mechanisms. 1412 11.6. Security of Passwords 1414 Two goals of passwords are to maximize the amount of entropy and to 1415 minimize the potential for false positives. These goals can be 1416 achieved in part by allowing a wide range of code points and by 1417 ensuring that passwords are handled in such a way that code points 1418 are not compared aggressively. Therefore, it is NOT RECOMMENDED for 1419 application protocols to profile the FreeformClass for use in 1420 passwords in a way that removes entire categories (e.g., by 1421 disallowing symbols or punctuation). Furthermore, it is NOT 1422 RECOMMENDED for application protocols to map uppercase and titlecase 1423 code points to their lowercase equivalents in such strings; instead, 1424 it is RECOMMENDED to preserve the case of all code points contained 1425 in such strings and to compare them in a case-sensitive manner. 1427 That said, software implementers need to be aware that there exist 1428 tradeoffs between entropy and usability. For example, allowing a 1429 user to establish a password containing "uncommon" code points might 1430 make it difficult for the user to access a service when using an 1431 unfamiliar or constrained input device. 1433 Some application protocols use passwords directly, whereas others 1434 reuse technologies that themselves process passwords (one example of 1435 such a technology is the Simple Authentication and Security Layer 1436 [RFC4422]). Moreover, passwords are often carried by a sequence of 1437 protocols with backend authentication systems or data storage systems 1438 such as RADIUS [RFC2865] and LDAP [RFC4510]. Developers of 1439 application protocols are encouraged to look into reusing these 1440 profiles instead of defining new ones, so that end-user expectations 1441 about passwords are consistent no matter which application protocol 1442 is used. 1444 In protocols that provide passwords as input to a cryptographic 1445 algorithm such as a hash function, the client will need to perform 1446 proper preparation of the password before applying the algorithm, 1447 since the password is not available to the server in plaintext form. 1449 Further discussion of password handling can be found in 1450 [I-D.ietf-precis-saslprepbis]. 1452 12. Interoperability Considerations 1454 Although strings that are consumed in PRECIS-based application 1455 protocols are often encoded using UTF-8 [RFC3629], the exact encoding 1456 is a matter for the application protocol that uses PRECIS, not for 1457 the PRECIS framework. 1459 It is known that some existing systems are unable to support the full 1460 Unicode character set, or even any characters outside the ASCII 1461 range. If two (or more) applications need to interoperate when 1462 exchanging data (e.g., for the purpose of authenticating a username 1463 or password), they will naturally need to have in common at least one 1464 coded character set (as defined by [RFC6365]). Establishing such a 1465 baseline is a matter for the application protocol that uses PRECIS, 1466 not for the PRECIS framework. 1468 Three Unicode code points underwent changes in their GeneralCategory 1469 between Unicode 5.2 (current at the time IDNA2008 was originally 1470 published) and Unicode 6.0, as described in [RFC6452]. Implementers 1471 might need to be aware that the treatment of these characters differs 1472 depending on which version of Unicode is available on the system that 1473 is using IDNA2008 or PRECIS, and that other such differences are 1474 possible between the version of Unicode current at the time of this 1475 writing (7.0) and future versions. 1477 13. References 1479 13.1. Normative References 1481 [RFC20] Cerf, V., "ASCII format for network interchange", RFC 20, 1482 October 1969. 1484 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1485 Requirement Levels", BCP 14, RFC 2119, March 1997. 1487 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 1488 Interchange", RFC 5198, March 2008. 1490 [Unicode7.0] 1491 The Unicode Consortium, "The Unicode Standard, Version 1492 7.0.0", 2014, 1493 . 1495 13.2. Informative References 1497 [I-D.ietf-precis-mappings] 1498 Yoneya, Y. and T. NEMOTO, "Mapping characters for PRECIS 1499 classes", draft-ietf-precis-mappings-08 (work in 1500 progress), June 2014. 1502 [I-D.ietf-precis-nickname] 1503 Saint-Andre, P., "Preparation and Comparison of 1504 Nicknames", draft-ietf-precis-nickname-11 (work in 1505 progress), October 2014. 1507 [I-D.ietf-precis-saslprepbis] 1508 Saint-Andre, P. and A. Melnikov, "Username and Password 1509 Preparation Algorithms", draft-ietf-precis-saslprepbis-09 1510 (work in progress), October 2014. 1512 [I-D.ietf-xmpp-6122bis] 1513 Saint-Andre, P., "Extensible Messaging and Presence 1514 Protocol (XMPP): Address Format", draft-ietf-xmpp- 1515 6122bis-15 (work in progress), October 2014. 1517 [I-D.saintandre-username-interop] 1518 Saint-Andre, P., "An Interoperable Subset of Characters 1519 for Internationalized Usernames", draft-saintandre- 1520 username-interop-03 (work in progress), March 2014. 1522 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 1523 "Remote Authentication Dial In User Service (RADIUS)", RFC 1524 2865, June 2000. 1526 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1527 Internationalized Strings ("stringprep")", RFC 3454, 1528 December 2002. 1530 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1531 "Internationalizing Domain Names in Applications (IDNA)", 1532 RFC 3490, March 2003. 1534 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1535 Profile for Internationalized Domain Names (IDN)", RFC 1536 3491, March 2003. 1538 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1539 10646", STD 63, RFC 3629, November 2003. 1541 [RFC4422] Melnikov, A. and K. Zeilenga, "Simple Authentication and 1542 Security Layer (SASL)", RFC 4422, June 2006. 1544 [RFC4510] Zeilenga, K., "Lightweight Directory Access Protocol 1545 (LDAP): Technical Specification Road Map", RFC 4510, June 1546 2006. 1548 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 1549 Recommendations for Internationalized Domain Names 1550 (IDNs)", RFC 4690, September 2006. 1552 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 1553 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 1554 May 2008. 1556 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1557 Specifications: ABNF", STD 68, RFC 5234, January 2008. 1559 [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security 1560 (TLS) Protocol Version 1.2", RFC 5246, August 2008. 1562 [RFC5890] Klensin, J., "Internationalized Domain Names for 1563 Applications (IDNA): Definitions and Document Framework", 1564 RFC 5890, August 2010. 1566 [RFC5891] Klensin, J., "Internationalized Domain Names in 1567 Applications (IDNA): Protocol", RFC 5891, August 2010. 1569 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1570 Internationalized Domain Names for Applications (IDNA)", 1571 RFC 5892, August 2010. 1573 [RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for 1574 Internationalized Domain Names for Applications (IDNA)", 1575 RFC 5893, August 2010. 1577 [RFC5894] Klensin, J., "Internationalized Domain Names for 1578 Applications (IDNA): Background, Explanation, and 1579 Rationale", RFC 5894, August 2010. 1581 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 1582 Internationalized Domain Names in Applications (IDNA) 1583 2008", RFC 5895, September 2010. 1585 [RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in 1586 Internationalization in the IETF", BCP 166, RFC 6365, 1587 September 2011. 1589 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 1590 Internationalized Domain Names for Applications (IDNA) - 1591 Unicode 6.0", RFC 6452, November 2011. 1593 [RFC6885] Blanchet, M. and A. Sullivan, "Stringprep Revision and 1594 Problem Statement for the Preparation and Comparison of 1595 Internationalized Strings (PRECIS)", RFC 6885, March 2013. 1597 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 1598 Purposes", RFC 6943, May 2013. 1600 [UAX9] The Unicode Consortium, "Unicode Standard Annex #9: 1601 Unicode Bidirectional Algorithm", September 2012, 1602 . 1604 [UAX11] The Unicode Consortium, "Unicode Standard Annex #11: East 1605 Asian Width", September 2012, 1606 . 1608 [UAX15] The Unicode Consortium, "Unicode Standard Annex #15: 1609 Unicode Normalization Forms", August 2012, 1610 . 1612 [UnicodeCurrent] 1613 The Unicode Consortium, "The Unicode Standard", 1614 2014-present, . 1616 [UTR36] The Unicode Consortium, "Unicode Technical Report #36: 1617 Unicode Security Considerations", July 2012, 1618 . 1620 [UTS39] The Unicode Consortium, "Unicode Technical Standard #39: 1621 Unicode Security Mechanisms", July 2012, 1622 . 1624 13.3. URIs 1626 [1] http://unicode.org/Public/UNIDATA/PropertyAliases.txt 1628 [2] http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt 1630 Appendix A. Acknowledgements 1632 The authors would like to acknowledge the comments and contributions 1633 of the following individuals during working group discussion: David 1634 Black, Edward Burns, Dan Chiba, Mark Davis, Alan DeKok, Martin 1635 Duerst, Patrik Faltstrom, Ted Hardie, Joe Hildebrand, Bjoern 1636 Hoehrmann, Paul Hoffman, Jeffrey Hutzelman, Simon Josefsson, John 1637 Klensin, Alexey Melnikov, Takahiro Nemoto, Yoav Nir, Mike Parker, 1638 Pete Resnick, Andrew Sullivan, Dave Thaler, Yoshiro Yoneya, and 1639 Florian Zeitz. 1641 Charlie Kaufman, Tom Taylor, and Tim Wicinski reviewed the document 1642 on behalf of the Security Directorate, the General Area Review Team, 1643 and the Operations and Management Directorate, respectively. 1645 During IESG review, Alissa Cooper, Stephen Farrell, and Barry Leiba 1646 provided comments that led to further improvements. 1648 Some algorithms and textual descriptions have been borrowed from 1649 [RFC5892]. Some text regarding security has been borrowed from 1650 [RFC5890], [I-D.ietf-precis-saslprepbis], and 1651 [I-D.ietf-xmpp-6122bis]. 1653 Peter Saint-Andre wishes to acknowledge Cisco Systems, Inc., for 1654 employing him during his work on earlier versions of this document. 1656 Authors' Addresses 1658 Peter Saint-Andre 1659 &yet 1661 Email: peter@andyet.com 1662 URI: https://andyet.com/ 1664 Marc Blanchet 1665 Viagenie 1666 246 Aberdeen 1667 Quebec, QC G1R 2E1 1668 Canada 1670 Email: Marc.Blanchet@viagenie.ca 1671 URI: http://www.viagenie.ca/