idnits 2.17.1 draft-ietf-precis-framework-18.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 2, 2014) is 3496 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 1509 -- Looks like a reference, but probably isn't: '2' on line 1511 == Outdated reference: A later version (-12) exists of draft-ietf-precis-mappings-08 == Outdated reference: A later version (-19) exists of draft-ietf-precis-nickname-09 == Outdated reference: A later version (-18) exists of draft-ietf-precis-saslprepbis-07 == Outdated reference: A later version (-24) exists of draft-ietf-xmpp-6122bis-12 -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3491 (Obsoleted by RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 5226 (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 5246 (Obsoleted by RFC 8446) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PRECIS P. Saint-Andre 3 Internet-Draft &yet 4 Obsoletes: 3454 (if approved) M. Blanchet 5 Intended status: Standards Track Viagenie 6 Expires: March 6, 2015 September 2, 2014 8 PRECIS Framework: Preparation and Comparison of Internationalized 9 Strings in Application Protocols 10 draft-ietf-precis-framework-18 12 Abstract 14 Application protocols using Unicode characters in protocol strings 15 need to properly prepare such strings in order to perform valid 16 comparison operations (e.g., for purposes of authentication or 17 authorization). This document defines a framework enabling 18 application protocols to perform the preparation and comparison of 19 internationalized strings ("PRECIS") in a way that depends on the 20 properties of Unicode characters and thus is agile with respect to 21 versions of Unicode. As a result, this framework provides a more 22 sustainable approach to the handling of internationalized strings 23 than the previous framework, known as Stringprep (RFC 3454). This 24 document obsoletes RFC 3454. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on March 6, 2015. 43 Copyright Notice 45 Copyright (c) 2014 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. String Classes . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 6 64 3.2. IdentifierClass . . . . . . . . . . . . . . . . . . . . . 7 65 3.2.1. Valid . . . . . . . . . . . . . . . . . . . . . . . . 7 66 3.2.2. Contextual Rule Required . . . . . . . . . . . . . . 8 67 3.2.3. Disallowed . . . . . . . . . . . . . . . . . . . . . 8 68 3.2.4. Unassigned . . . . . . . . . . . . . . . . . . . . . 9 69 3.2.5. Examples . . . . . . . . . . . . . . . . . . . . . . 9 70 3.3. FreeformClass . . . . . . . . . . . . . . . . . . . . . . 9 71 3.3.1. Valid . . . . . . . . . . . . . . . . . . . . . . . . 9 72 3.3.2. Contextual Rule Required . . . . . . . . . . . . . . 10 73 3.3.3. Disallowed . . . . . . . . . . . . . . . . . . . . . 10 74 3.3.4. Unassigned . . . . . . . . . . . . . . . . . . . . . 10 75 3.3.5. Examples . . . . . . . . . . . . . . . . . . . . . . 10 76 4. Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . 11 77 4.1. Principles . . . . . . . . . . . . . . . . . . . . . . . 11 78 4.1.1. Width Mapping . . . . . . . . . . . . . . . . . . . . 11 79 4.1.2. Additional Mappings . . . . . . . . . . . . . . . . . 12 80 4.1.3. Case Mapping . . . . . . . . . . . . . . . . . . . . 12 81 4.1.4. Normalization . . . . . . . . . . . . . . . . . . . . 12 82 4.1.5. Directionality . . . . . . . . . . . . . . . . . . . 12 83 4.1.6. Exclusions . . . . . . . . . . . . . . . . . . . . . 13 84 4.2. Building Application-Layer Constructs . . . . . . . . . . 13 85 4.3. A Note about Spaces . . . . . . . . . . . . . . . . . . . 14 86 5. Order of Operations . . . . . . . . . . . . . . . . . . . . . 14 87 6. Code Point Properties . . . . . . . . . . . . . . . . . . . . 15 88 7. Category Definitions Used to Calculate Derived Property . . . 18 89 7.1. LetterDigits (A) . . . . . . . . . . . . . . . . . . . . 18 90 7.2. Unstable (B) . . . . . . . . . . . . . . . . . . . . . . 18 91 7.3. IgnorableProperties (C) . . . . . . . . . . . . . . . . . 19 92 7.4. IgnorableBlocks (D) . . . . . . . . . . . . . . . . . . . 19 93 7.5. LDH (E) . . . . . . . . . . . . . . . . . . . . . . . . . 19 94 7.6. Exceptions (F) . . . . . . . . . . . . . . . . . . . . . 19 95 7.7. BackwardCompatible (G) . . . . . . . . . . . . . . . . . 19 96 7.8. JoinControl (H) . . . . . . . . . . . . . . . . . . . . . 19 97 7.9. OldHangulJamo (I) . . . . . . . . . . . . . . . . . . . . 20 98 7.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . 20 99 7.11. ASCII7 (K) . . . . . . . . . . . . . . . . . . . . . . . 20 100 7.12. Controls (L) . . . . . . . . . . . . . . . . . . . . . . 20 101 7.13. PrecisIgnorableProperties (M) . . . . . . . . . . . . . . 20 102 7.14. Spaces (N) . . . . . . . . . . . . . . . . . . . . . . . 21 103 7.15. Symbols (O) . . . . . . . . . . . . . . . . . . . . . . . 21 104 7.16. Punctuation (P) . . . . . . . . . . . . . . . . . . . . . 21 105 7.17. HasCompat (Q) . . . . . . . . . . . . . . . . . . . . . . 21 106 7.18. OtherLetterDigits (R) . . . . . . . . . . . . . . . . . . 21 107 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 108 8.1. PRECIS Derived Property Value Registry . . . . . . . . . 21 109 8.2. PRECIS Base Classes Registry . . . . . . . . . . . . . . 22 110 8.3. PRECIS Profiles Registry . . . . . . . . . . . . . . . . 22 111 9. Security Considerations . . . . . . . . . . . . . . . . . . . 24 112 9.1. General Issues . . . . . . . . . . . . . . . . . . . . . 24 113 9.2. Use of the IdentifierClass . . . . . . . . . . . . . . . 25 114 9.3. Use of the FreeformClass . . . . . . . . . . . . . . . . 25 115 9.4. Local Character Set Issues . . . . . . . . . . . . . . . 26 116 9.5. Visually Similar Characters . . . . . . . . . . . . . . . 26 117 9.6. Security of Passwords . . . . . . . . . . . . . . . . . . 28 118 10. Interoperability Considerations . . . . . . . . . . . . . . . 29 119 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 29 120 11.1. Normative References . . . . . . . . . . . . . . . . . . 29 121 11.2. Informative References . . . . . . . . . . . . . . . . . 30 122 11.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 32 123 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 32 124 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 126 1. Introduction 128 As described in the problem statement for the preparation and 129 comparison of internationalized strings ("PRECIS") [RFC6885], many 130 IETF protocols have used the Stringprep framework [RFC3454] as the 131 basis for preparing and comparing protocol strings that contain 132 Unicode characters [Unicode7.0] outside the ASCII range [RFC20]. The 133 Stringprep framework was developed during work on the original 134 technology for internationalized domain names (IDNs), here called 135 "IDNA2003" [RFC3490], and Nameprep [RFC3491] was the Stringprep 136 profile for IDNs. At the time, Stringprep was designed as a general 137 framework so that other application protocols could define their own 138 Stringprep profiles for the preparation and comparison of strings and 139 identifiers. Indeed, a number of application protocols defined such 140 profiles. 142 After the publication of [RFC3454] in 2002, several significant 143 issues arose with the use of Stringprep in the IDN case, as 144 documented in the IAB's recommendations regarding IDNs [RFC4690] 145 (most significantly, Stringprep was tied to Unicode version 3.2). 146 Therefore, the newer IDNA specifications, here called "IDNA2008" 147 ([RFC5890], [RFC5891], [RFC5892], [RFC5893], [RFC5894]), no longer 148 use Stringprep and Nameprep. This migration away from Stringprep for 149 IDNs has prompted other "customers" of Stringprep to consider new 150 approaches to the preparation and comparison of internationalized 151 strings, as described in [RFC6885]. 153 This document defines a framework for a post-Stringprep approach to 154 the preparation and comparison of internationalized strings in 155 application protocols, based on several principles: 157 1. Define a small set of string classes that specify the Unicode 158 characters (i.e., specific "code points") appropriate for common 159 application protocol constructs. 161 2. Define each PRECIS string class in terms of Unicode code points 162 and their properties so that an algorithm can be used to 163 determine whether each code point or character category is (a) 164 valid, (b) allowed in certain contexts, (c) disallowed, or (d) 165 unassigned. 167 3. Use an "inclusion model" such that a string class consists only 168 of code points that are explicitly allowed, with the result that 169 any code point not explicitly allowed is forbidden. 171 4. Enable application protocols to define profiles of the PRECIS 172 string classes, addressing matters such as width mapping, case 173 folding and other forms of character mapping, Unicode 174 normalization, directionality, and further excluded code points 175 or character categories. 177 Whereas the string classes define the "baseline" code points for a 178 range of applications, profiling enables application protocols to 179 further restrict the allowable code points beyond those specified for 180 the relevant string class (e.g., characters with special or reserved 181 meaning, such as "@" and "/" when used as separators within 182 identifiers) and to apply the string classes in ways that are 183 appropriate for constructs such as usernames and passwords 184 [I-D.ietf-precis-saslprepbis], nicknames [I-D.ietf-precis-nickname], 185 the localparts of instant messaging addresses 186 [I-D.ietf-xmpp-6122bis], and free-form strings 187 [I-D.ietf-xmpp-6122bis]. Profiles are responsible for defining the 188 handling of right-to-left characters as well as various mapping 189 operations of the kind also discussed for IDNs in [RFC5895], such as 190 case preservation or lowercasing, Unicode normalization, mapping of 191 certain characters to other characters or to nothing, and mapping of 192 full-width and half-width characters. 194 When an application applies a profile of a PRECIS string class, it 195 can achieve the following objectives: 197 a. Determine if a given string conforms to the profile (e.g. to 198 determine if it is allowed for use in the relevant "slot" 199 specified by an application protocol). 201 b. Determine if any two given strings are equivalent (e.g., to make 202 an access decision for purposes of authentication or 203 authorization as further described in [RFC6943]). 205 It is expected that this framework will yield the following benefits: 207 o Application protocols will be agile with regard to Unicode 208 versions. 210 o Implementers will be able to share code point tables and software 211 code across application protocols, most likely by means of 212 software libraries. 214 o End users will be able to acquire more accurate expectations about 215 the characters that are acceptable in various contexts. Given 216 this more uniform set of string classes, it is also expected that 217 copy/paste operations between software implementing different 218 application protocols will be more predictable and coherent. 220 Although this framework is similar to IDNA2008 and borrows some of 221 the character categories defined in [RFC5892], it defines additional 222 character categories to meet the needs of common application 223 protocols. 225 The character categories and calculation rules defined under 226 Section 7 and Section 6 are normative and apply to all Unicode code 227 points. The code point table that results from applying the 228 character categories and calculation rules to the latest version of 229 Unicode are provided in an IANA registry. 231 2. Terminology 233 Many important terms used in this document are defined in [RFC5890], 234 [RFC6365], [RFC6885], and [Unicode7.0]. The terms "left-to-right" 235 (LTR) and "right-to-left" (RTL) are defined in Unicode Standard Annex 236 #9 [UAX9]. 238 As of the date of writing, the version of Unicode published by the 239 Unicode Consortium is 6.3 [Unicode7.0]; however, PRECIS is not tied 240 to a specific version of Unicode. The latest version of Unicode is 241 always available [UnicodeCurrent]. 243 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 244 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 245 "OPTIONAL" in this document are to be interpreted as described in 246 [RFC2119]. 248 3. String Classes 250 3.1. Overview 252 Starting in 2010, various "customers" of Stringprep began to discuss 253 the need to define a post-Stringprep approach to the preparation and 254 comparison of internationalized strings other than IDNs. This 255 community analyzed the existing Stringprep profiles and also weighed 256 the costs and benefits of defining a relatively small set of Unicode 257 characters that would minimize the potential for user confusion 258 caused by visually similar characters (and thus be relatively "safe") 259 vs. defining a much larger set of Unicode characters that would 260 maximize the potential for user creativity (and thus be relatively 261 "expressive"). As a result, the community concluded that most 262 existing uses could be addressed by two string classes: 264 IdentifierClass: a sequence of letters, numbers, and some symbols 265 that is used to identify or address a network entity such as a 266 user account, a venue (e.g., a chatroom), an information source 267 (e.g., a data feed), or a collection of data (e.g., a file); the 268 intent is that this class will minimize user confusion in a wide 269 variety of application protocols, with the result that safety has 270 been prioritized over expressiveness for this class. 272 FreeformClass: a sequence of letters, numbers, symbols, spaces, and 273 other characters that is used for free-form strings, including 274 passwords as well as display elements such as human-friendly 275 nicknames in chatrooms; the intent is that this class will allow 276 nearly any Unicode character, with the result that expressiveness 277 has been prioritized over safety for this class (e.g., protocol 278 designers, application developers, service providers, and end 279 users might not understand or be able to enter all of the 280 characters that can be included in the FreeformClass - see 281 Section 9.3 for details). 283 Future specifications might define additional PRECIS string classes, 284 such as a class that falls somewhere between the IdentifierClass and 285 the FreeformClass. At this time, it is not clear how useful such a 286 class would be. In any case, because application developers are able 287 to define profiles of PRECIS string classes, a protocol needing a 288 construct between the IdentiferClass and the FreeformClass could 289 define a restricted profile of the FreeformClass if needed. 291 The following subsections discuss the IdentifierClass and 292 FreeformClass in more detail, with reference to the dimensions 293 described in Section 3 of [RFC6885]. Each string class is defined by 294 the following behavioral rules: 296 Valid: Defines which code points and character categories are 297 treated as valid input to the string. 299 Contextual Rule Required: Defines which code points and character 300 categories are treated as allowed only if the requirements of a 301 contextual rule are met (i.e., either CONTEXTJ or CONTEXTO). 303 Disallowed: Defines which code points and character categories need 304 to be excluded from the string. 306 Unassigned: Defines application behavior in the presence of code 307 points that are unknown (i.e., not yet designated) for the version 308 of Unicode used by the application. 310 This document defines the valid, contextual rule required, 311 disallowed, and unassigned rules for the IdentifierClass and 312 FreeformClass. As described under Section 4, profiles of these 313 string classes are responsible for defining the width mapping, 314 additional mappings, case mapping, normalization, directionality, and 315 exclusion rules. 317 3.2. IdentifierClass 319 Most application technologies need strings that can be used to refer 320 to, include, or communicate protocol strings like usernames, file 321 names, data feed identifiers, and chatroom names. We group such 322 strings into a class called "IdentifierClass" having the following 323 features. 325 3.2.1. Valid 327 o Code points traditionally used as letters and numbers in writing 328 systems, i.e., the LetterDigits ("A") category first defined in 329 [RFC5892] and listed here under Section 7.1. 331 o Code points in the range U+0021 through U+007E, i.e., the 332 (printable) ASCII7 ("K") rule defined under Section 7.11. These 333 code points are "grandfathered" into PRECIS and thus are valid 334 even if they would otherwise be disallowed according to the 335 property-based rules specified in the next section. 337 Note: Although the PRECIS IdentifierClass re-uses the LetterDigits 338 category from IDNA2008, the range of characters allowed in the 339 IdentifierClass is wider than the range of characters allowed in 340 IDNA2008. The main reason is that IDNA2008 applies the Unstable 341 category before the LetterDigits category, thus disallowing 342 uppercase characters, whereas the IdentifierClass does not apply 343 the Unstable category. 345 3.2.2. Contextual Rule Required 347 o A number of characters from the Exceptions ("F") category defined 348 under Section 7.6 (see Section 7.6 for a full list). 350 o Joining characters, i.e., the JoinControl ("H") category defined 351 under Section 7.8. 353 3.2.3. Disallowed 355 o Old Hangul Jamo characters, i.e., the OldHangulJamo ("I") category 356 defined under Section 7.9. 358 o Control characters, i.e., the Controls ("L") category defined 359 under Section 7.12. 361 o Ignorable characters, i.e., the PrecisIgnorableProperties ("M") 362 category defined under Section 7.13. 364 o Space characters, i.e., the Spaces ("N") category defined under 365 Section 7.14. 367 o Symbol characters, i.e., the Symbols ("O") category defined under 368 Section 7.15. 370 o Punctuation characters, i.e., the Punctuation ("P") category 371 defined under Section 7.16. 373 o Any character that has a compatibility equivalent, i.e., the 374 HasCompat ("Q") category defined under Section 7.17. These code 375 points are disallowed even if they would otherwise be valid 376 according to the property-based rules specified in the previous 377 section. 379 o Letters and digits other than the "traditional" letters and digits 380 allowed in IDNs, i.e., the OtherLetterDigits ("R") category 381 defined under Section 7.18. 383 3.2.4. Unassigned 385 Any code points that are not yet designated in the Unicode character 386 set are considered Unassigned for purposes of the IdentifierClass, 387 and such code points are to be treated as Disallowed. 389 3.2.5. Examples 391 As described in the Introduction to this document, the string classes 392 do not handle all issues related to string preparation and comparison 393 (such as case mapping); instead, such issues are handled at the level 394 of profiles. Examples for two profiles of the IdentifierClass can be 395 found in [I-D.ietf-precis-saslprepbis] (the UsernameIdentifierClass 396 profile) and in [I-D.ietf-xmpp-6122bis] (the JIDlocalIdentifierClass 397 profile). 399 3.3. FreeformClass 401 Some application technologies need strings that can be used in a 402 free-form way, e.g., as a password in an authentication exchange (see 403 [I-D.ietf-precis-saslprepbis]) or a nickname in a chatroom (see 404 [I-D.ietf-precis-nickname]). We group such things into a class 405 called "FreeformClass" having the following features. 407 Security Warning: As mentioned, the FreeformClass prioritizes 408 expressiveness over safety; Section 9.3 describes some of the 409 security hazards involved with using or profiling the 410 FreeformClass. 412 Security Warning: Consult Section 9.6 for relevant security 413 considerations when strings conforming to the FreeformClass, or a 414 profile thereof, are used as passwords. 416 3.3.1. Valid 418 o Traditional letters and numbers, i.e., the LetterDigits ("A") 419 category first defined in [RFC5892] and listed here under 420 Section 7.1. 422 o Letters and digits other than the "traditional" letters and digits 423 allowed in IDNs, i.e., the OtherLetterDigits ("R") category 424 defined under Section 7.18. 426 o Code points in the range U+0021 through U+007E, i.e., the 427 (printable) ASCII7 ("K") rule defined under Section 7.11. 429 o Any character that has a compatibility equivalent, i.e., the 430 HasCompat ("Q") category defined under Section 7.17. 432 o Space characters, i.e., the Spaces ("N") category defined under 433 Section 7.14. 435 o Symbol characters, i.e., the Symbols ("O") category defined under 436 Section 7.15. 438 o Punctuation characters, i.e., the Punctuation ("P") category 439 defined under Section 7.16. 441 3.3.2. Contextual Rule Required 443 o A number of characters from the Exceptions ("F") category defined 444 under Section 7.6 (see Section 7.6 for a full list). 446 o Joining characters, i.e., the JoinControl ("H") category defined 447 under Section 7.8. 449 3.3.3. Disallowed 451 o Old Hangul Jamo characters, i.e., the OldHangulJamo ("I") category 452 defined under Section 7.9. 454 o Control characters, i.e., the Controls ("L") category defined 455 under Section 7.12. 457 o Ignorable characters, i.e., the PrecisIgnorableProperties ("M") 458 category defined under Section 7.13. 460 3.3.4. Unassigned 462 Any code points that are not yet designated in the Unicode character 463 set are considered Unassigned for purposes of the FreeformClass, and 464 such code points are to be treated as Disallowed. 466 3.3.5. Examples 468 As described in the Introduction to this document, the string classes 469 do not handle all issues related to string preparation and comparison 470 (such as case mapping); instead, such issues are handled at the level 471 of profiles. Examples for two profiles of the FreeformClass can be 472 found in [I-D.ietf-precis-nickname] (the NicknameFreeformClass 473 profile) and in [I-D.ietf-xmpp-6122bis] (the 474 JIDresourceIdentifierClass profile). 476 4. Profiles 478 4.1. Principles 480 This framework document defines the valid, contextual-rule-required, 481 disallowed, and unassigned rules for the IdentifierClass and the 482 FreeformClass. A profile of a PRECIS string class MUST define the 483 width mapping, additional mappings (if any), case mapping, 484 normalization, directionality, and exclusion rules. A profile MAY 485 also restrict the allowable characters above and beyond the 486 definition of the relevant PRECIS string class (but MUST NOT add as 487 valid any code points or character categories that are disallowed by 488 the relevant PRECIS string class). These matters are discussed in 489 the following subsections. 491 Profiles of the PRECIS string classes are registered with the IANA as 492 described under Section 8.3. Profile names use the following 493 convention: they are of the form "ProfilenameBaseClass", where the 494 "Profilename" string is a differentiator and "BaseClass" is the name 495 of the PRECIS string class being profiled; for example, the profile 496 of the IdentifierClass used for localparts of Jabber Identifiers 497 (JIDs) in the Extensible Messaging and Presence Protocol (XMPP) is 498 named "JIDlocalIdentifierClass" [I-D.ietf-xmpp-6122bis]. 500 4.1.1. Width Mapping 502 The width mapping rule of a profile specifies whether width mapping 503 is performed on fullwidth and halfwidth characters, and how the 504 mapping is done. Typically such mapping consists of mapping 505 fullwidth and halfwidth characters, i.e., code points with a 506 Decomposition Type of Wide or Narrow, to their decomposition 507 mappings; as an example, FULLWIDTH DIGIT ZERO (U+FF10) would be 508 mapped to DIGIT ZERO (U+0030). 510 The normalization form specified by a profile (see below) has an 511 impact on the need for width mapping. Because width mapping is 512 performed as a part of compatibility decomposition, a profile 513 employing either normalization form KD (NFKD) or normalization form 514 KC (NFKC) does not need to specify width mapping. However, if 515 Unicode normalization form C (NFC) is used then the profile needs to 516 specify whether to apply width mapping; in this case, width mapping 517 is in general RECOMMENDED because allowing fullwidth and halfwidth 518 characters to remain unmapped to their compatibility variants would 519 violate the principle of least user surprise. For more information 520 about the concept of width in East Asian scripts within Unicode, see 521 Unicode Standard Annex #11 [UAX11]. 523 4.1.2. Additional Mappings 525 The additional mappings rule of a profile specifies whether 526 additional mappings are to be applied, such as mapping of delimiter 527 characters and mapping of special characters (e.g., non-ASCII space 528 characters to ASCII space or certain characters to nothing). 530 4.1.3. Case Mapping 532 The case mapping rule of a profile specifies whether case mapping is 533 performed (instead of case preservation) on uppercase and titlecase 534 characters, and how the mapping is done (e.g., mapping uppercase and 535 titlecase characters to their lowercase equivalents). 537 If case mapping is desired (instead of case preservation), it is 538 RECOMMENDED to use Unicode Default Case Folding as defined in Chapter 539 3 of the Unicode Standard [Unicode7.0]. 541 Note: Unicode Default Case Folding is not designed to handle 542 various localization issues (such as so-called "dotless i" in 543 several Turkic languages). The PRECIS mappings document 544 [I-D.ietf-precis-mappings] describes these issues in greater 545 detail and defines a "local case mapping" method that handles some 546 locale-dependent and context-dependent mappings. 548 In order to maximize entropy and minimize the potential for false 549 positives, it is NOT RECOMMENDED for application protocols to map 550 uppercase and titlecase code points to their lowercase equivalents 551 when strings conforming to the FreeformClass, or a profile thereof, 552 are used in passwords; instead, it is RECOMMENDED to preserve the 553 case of all code points contained in such strings and then perform 554 case-sensitive comparison. See also the related discussion in 555 [I-D.ietf-precis-saslprepbis]. 557 4.1.4. Normalization 559 The normalization rule of a profile specifies which Unicode 560 normalization form (D, KD, C, or KC) is to be applied (see Unicode 561 Standard Annex #15 [UAX15] for background information). 563 In accordance with [RFC5198], normalization form C (NFC) is 564 RECOMMENDED. 566 4.1.5. Directionality 568 The directionality rule of a profile specifies how to treat strings 569 containing left-to-right (LTR) and right-to-left (RTL) characters 570 (see Unicode Standard Annex #9 [UAX9]). A profile usually specifies 571 a directionality rule that restricts strings to be entirely LTR 572 strings or entirely RTL strings and defines the allowable sequences 573 of characters in LTR and RTL strings. Possible rules include, but 574 are not limited to, (a) considering any string that contains a right- 575 to-left code point to be a right-to-left string, or (b) applying the 576 "Bidi Rule" from [RFC5893]. 578 Mixed-direction strings are not directly supported by the PRECIS 579 framework itself, since there is currently no widely accepted and 580 implemented solution for the safe display of mixed-direction strings. 581 An application protocol that uses the PRECIS framework (or an 582 extension to the framework) could define better ways to present 583 mixed-direction strings; however, that work is outside the scope of 584 this framework and would likely require a great deal of careful 585 research into the problems of displaying bidirectional text. 587 4.1.6. Exclusions 589 The exclusions rule of a profile specifies whether the profile 590 excludes additional code points or character categories above and 591 beyond those excluded by the string class being profiled. That is, a 592 profile MAY do either of the following: 594 1. Exclude specific code points that are allowed by the relevant 595 string class. 597 2. Exclude characters matching certain Unicode properties (e.g., 598 math symbols) that are included in the relevant PRECIS string 599 class. 601 As a result of such exclusions, code points that are defined as valid 602 for the PRECIS string class being profiled will be defined as 603 disallowed for the profile. 605 4.2. Building Application-Layer Constructs 607 Sometimes, an application-layer construct does not map in a 608 straightforward manner to one of the base string classes or a profile 609 thereof. Consider, for example, the "simple user name" construct in 610 the Simple Authentication and Security Layer (SASL) [RFC4422]. 611 Depending on the deployment, a simple user name might take the form 612 of a user's full name (e.g., the user's personal name followed by a 613 space and then the user's family name). Such a simple user name 614 cannot be defined as an instance of the IdentifierClass or a profile 615 thereof, since space characters are not allowed in the 616 IdentifierClass; however, it could be defined using a space-separated 617 sequence of IdentifierClass instances, as in the following pseudo- 618 ABNF [RFC5234]: 620 fullname = namepart *(1*SP namepart) 621 namepart = 1*idpoint 622 ; 623 ; an "idpoint" is a UTF-8 encoded Unicode code point 624 ; that conforms to the PRECIS IdentifierClass 626 Similar techniques could be used to define many application-layer 627 constructs, say of the form "user@domain" or "/path/to/file". 629 4.3. A Note about Spaces 631 With regard to the IdentiferClass, the consensus of the PRECIS 632 Working Group was that spaces are problematic for many reasons, 633 including: 635 o Many Unicode characters are confusable with ASCII space. 637 o Even if non-ASCII space characters are mapped to ASCII space 638 (U+0020), space characters are often not rendered in user 639 interfaces, leading to the possibility that a human user might 640 consider a string containing spaces to be equivalent to the same 641 string without spaces. 643 o In some locales, some devices are known to generate a character 644 other than ASCII space (such as ZERO WIDTH JOINER, U+200D) when a 645 user performs an action like hit the space bar on a keyboard. 647 One consequence of disallowing space characters in the 648 IdentifierClass might be to effectively discourage their use within 649 identifiers created in newer application protocols; given the 650 challenges involved in properly handling space characters (especially 651 non-ASCII space characters) in identifiers and other protocol 652 strings, the Working Group considered this to be a feature, not a 653 bug. 655 However, the FreeformClass does allow spaces, which enables 656 application protocols to define profiles of the FreeformClass that 657 are more flexible than any profiles of the IdentifierClass. In 658 addition, as explained in the previous section, application protocols 659 can also define application-layer constructs containing spaces. 661 5. Order of Operations 663 To ensure proper comparison, the following order of operations is 664 REQUIRED: 666 1. Width mapping 667 2. Optionally, additional mappings such as mapping of delimiters 668 (e.g., characters such as '@', ':', '/', '+', and '-') and 669 special handling of certain characters or classes of characters 670 (e.g., mapping of non-ASCII spaces to ASCII space or mapping of 671 control characters to nothing); the PRECIS mappings document 672 [I-D.ietf-precis-mappings] describes such mappings in more detail 674 3. Case mapping as described under Section 4.1.3 of this document 676 4. Normalization 678 5. Behavioral rules for determining whether a code point is valid, 679 allowed under a contextual rule, disallowed, or unassigned 681 As already described, the width mapping, additional mappings, case 682 mapping, and normalization operations are specified for each profile, 683 whereas the behavioral rules are specified for each string class. 684 Some of the logic behind this order is provided under Section 4.1.1 685 (see also the PRECIS mappings document [I-D.ietf-precis-mappings]). 687 6. Code Point Properties 689 In order to implement the string classes described above, this 690 document does the following: 692 1. Reviews and classifies the collections of code points in the 693 Unicode character set by examining various code point properties. 695 2. Defines an algorithm for determining a derived property value, 696 which can vary depending on the string class being used by the 697 relevant application protocol. 699 This document is not intended to specify precisely how derived 700 property values are to be applied in protocol strings. That 701 information is the responsibility of the protocol specification that 702 uses or profiles a PRECIS string class from this document. The value 703 of the property is to be interpreted as follows. 705 PROTOCOL VALID Those code points that are allowed to be used in any 706 PRECIS string class (currently, IdentifierClass and 707 FreeformClass). Code points with this property value are 708 permitted for general use in any string class. The abbreviated 709 term "PVALID" is used to refer to this value in the remainder of 710 this document. 712 SPECIFIC CLASS PROTOCOL VALID Those code points that are allowed to 713 be used in specific string classes. Code points with this 714 property value are permitted for use in specific string classes. 716 In the remainder of this document, the abbreviated term *_PVAL is 717 used, where * = (ID | FREE), i.e., either "FREE_PVAL" or 718 "ID_PVAL". In practice, the derived property ID_PVAL is not used 719 in this specification, since every ID_PVAL code point is PVALID. 721 CONTEXTUAL RULE REQUIRED Some characteristics of the character, such 722 as its being invisible in certain contexts or problematic in 723 others, require that it not be used in labels unless specific 724 other characters or properties are present. As in IDNA2008, there 725 are two subdivisions of CONTEXTUAL RULE REQUIRED, the first for 726 Join_controls (called "CONTEXTJ") and the second for other 727 characters (called "CONTEXTO"). A character with the derived 728 property value CONTEXTJ or CONTEXTO MUST NOT be used unless an 729 appropriate rule has been established and the context of the 730 character is consistent with that rule. The most notable of the 731 CONTEXTUAL RULE REQUIRED characters are the Join Control 732 characters U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON- 733 JOINER, which have a derived property value of CONTEXTJ. See 734 Appendix A of [RFC5892] for more information. 736 DISALLOWED Those code points that are not permitted in any PRECIS 737 string class. 739 SPECIFIC CLASS DISALLOWED Those code points that are not to be 740 included in a specific string class. Code points with this 741 property value are not permitted in one of the string classes but 742 might be permitted in others. In the remainder of this document, 743 the abbreviated term *_DIS is used, where * = (ID | FREE), i.e., 744 either "FREE_DIS" or "ID_DIS". In practice, the derived property 745 FREE_DIS is not used in this specification, since every FREE_DIS 746 code point is DISALLOWED. 748 UNASSIGNED Those code points that are not designated (i.e. are 749 unassigned) in the Unicode Standard. 751 To summarize, the assigned values of the derived property are: 753 o PVALID 755 o FREE_PVAL 757 o CONTEXTJ 759 o CONTEXTO 761 o DISALLOWED 763 o UNASSIGNED 764 The algorithm to calculate the value of the derived property is as 765 follows: 767 If .cp. .in. Exceptions Then Exceptions(cp); 768 Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp); 769 Else If .cp. .in. Unassigned Then UNASSIGNED; 770 Else If .cp. .in. ASCII7 Then PVALID; 771 Else If .cp. .in. JoinControl Then CONTEXTJ; 772 Else If .cp. .in. OldHangulJamo Then DISALLOWED; 773 Else If .cp. .in. PrecisIgnorableProperties Then DISALLOWED; 774 Else If .cp. .in. Controls Then DISALLOWED; 775 Else If .cp. .in. HasCompat Then ID_DIS or FREE_PVAL; 776 Else If .cp. .in. LetterDigits Then PVALID; 777 Else If .cp. .in. OtherLetterDigits Then ID_DIS or FREE_PVAL; 778 Else If .cp. .in. Spaces Then ID_DIS or FREE_PVAL; 779 Else If .cp. .in. Symbols Then ID_DIS or FREE_PVAL; 780 Else If .cp. .in. Punctuation Then ID_DIS or FREE_PVAL; 781 Else DISALLOWED; 783 The value of the derived property calculated can depend on the string 784 class; for example, if an identifier used in an application protocol 785 is defined as profiling the PRECIS IdentifierClass then a space 786 character such as U+0020 would be assigned to ID_DIS, whereas if an 787 identifier is defined as profiling the PRECIS FreeformClass then the 788 character would be assigned to FREE_PVAL. For the sake of brevity, 789 the designation "FREE_PVAL" is used in the code point tables, instead 790 of the longer designation "ID_DIS or FREE_PVAL". In practice, the 791 derived properties ID_PVAL and FREE_DIS are not used in this 792 specification, since every ID_PVAL code point is PVALID and every 793 FREE_DIS code point is DISALLOWED. 795 Use of the name of a rule (such as "Exceptions") implies the set of 796 code points that the rule defines, whereas the same name as a 797 function call (such as "Exceptions(cp)") implies the value that the 798 code point has in the Exceptions table. 800 The mechanisms described here allow determination of the value of the 801 property for future versions of Unicode (including characters added 802 after Unicode 5.2 or 7.0 depending on the category, since some 803 categories in this document are reused from IDNA2008 and therefore 804 were defined at the time of Unicode 5.2). Changes in Unicode 805 properties that do not affect the outcome of this process therefore 806 do not affect this framework. For example, a character can have its 807 Unicode General_Category value (see Chapter 4 of the Unicode Standard 808 [Unicode7.0]) change from So to Sm, or from Lo to Ll, without 809 affecting the algorithm results. Moreover, even if such changes were 810 to result, the BackwardCompatible list (Section 7.7) can be adjusted 811 to ensure the stability of the results. 813 7. Category Definitions Used to Calculate Derived Property 815 The derived property obtains its value based on a two-step procedure: 817 1. Characters are placed in one or more character categories either 818 (1) based on core properties defined by the Unicode Standard or 819 (2) by treating the code point as an exception and addressing the 820 code point based on its code point value. These categories are 821 not mutually exclusive. 823 2. Set operations are used with these categories to determine the 824 values for a property specific to a given string class. These 825 operations are specified under Section 6. 827 Note: Unicode property names and property value names might have 828 short abbreviations, such as "gc" for the General_Category 829 property and "Ll" for the Lowercase_Letter property value of the 830 gc property. 832 In the following specification of character categories, the operation 833 that returns the value of a particular Unicode character property for 834 a code point is designated by using the formal name of that property 835 (from the Unicode PropertyAliases.txt [1]) followed by '(cp)' for 836 "code point". For example, the value of the General_Category 837 property for a code point is indicated by General_Category(cp). 839 The first ten categories (A-J) shown below were previously defined 840 for IDNA2008 and are copied directly from [RFC5892] to ease the 841 understanding of how PRECIS handles various characters. Some of 842 these categories are reused in PRECIS and some of them are not; 843 however, the lettering of categories is retained to prevent overlap 844 and to ease implementation of both IDNA2008 and PRECIS in a single 845 software application. The next eight categories (K-R) are specific 846 to PRECIS. 848 7.1. LetterDigits (A) 850 This category is defined in Secton 2.1 of [RFC5892] and is included 851 by reference for use in PRECIS. 853 7.2. Unstable (B) 855 This category is defined in Secton 2.2 of [RFC5892] but not used in 856 PRECIS. 858 7.3. IgnorableProperties (C) 860 This category is defined in Secton 2.3 of [RFC5892] but not used in 861 PRECIS. 863 Note: See the "PrecisIgnorableProperties (M)" category below for a 864 more inclusive category used in PRECIS identifiers. 866 7.4. IgnorableBlocks (D) 868 This category is defined in Secton 2.4 of [RFC5892] but not used in 869 PRECIS. 871 7.5. LDH (E) 873 This category is defined in Secton 2.5 of [RFC5892] but not used in 874 PRECIS. 876 Note: See the "ASCII7 (K)" category below for a more inclusive 877 category used in PRECIS identifiers. 879 7.6. Exceptions (F) 881 This category is defined in Secton 2.6 of [RFC5892] and is included 882 by reference for use in PRECIS. 884 7.7. BackwardCompatible (G) 886 This category is defined in Secton 2.7 of [RFC5892] and is included 887 by reference for use in PRECIS. 889 Note: Because of how the PRECIS string classes are defined, only 890 changes that would result in code points being added to or removed 891 from the LetterDigits ("A") category would result in backward- 892 incompatible modifications to code point assignments. Therefore, 893 management of this category is handled via the processes specified in 894 [RFC5892]. At the time of this writing (and also at the time that 895 RFC 5892 was published), this category consisted of the empty set; 896 however, that is subject to change as described in RFC 5892. 898 7.8. JoinControl (H) 900 This category is defined in Secton 2.8 of [RFC5892] and is included 901 by reference for use in PRECIS. 903 7.9. OldHangulJamo (I) 905 This category is defined in Secton 2.9 of [RFC5892] and is included 906 by reference for use in PRECIS. 908 7.10. Unassigned (J) 910 This category is defined in Secton 2.10 of [RFC5892] and is included 911 by reference for use in PRECIS. 913 7.11. ASCII7 (K) 915 This PRECIS-specific category consists of all printable, non-space 916 characters from the 7-bit ASCII range. By applying this category, 917 the algorithm specified under Section 6 exempts these characters from 918 other rules that might be applied during PRECIS processing, on the 919 assumption that these code points are in such wide use that 920 disallowing them would be counter-productive. 922 K: cp is in {0021..007E} 924 7.12. Controls (L) 926 L: Control(cp) = True 928 7.13. PrecisIgnorableProperties (M) 930 This PRECIS-specific category is used to group code points that are 931 discouraged from use in PRECIS string classes. 933 M: Default_Ignorable_Code_Point(cp) = True or 934 Noncharacter_Code_Point(cp) = True 936 The definition for Default_Ignorable_Code_Point can be found in the 937 DerivedCoreProperties.txt [2] file, and at the time of Unicode 7.0 is 938 as follows: 940 Other_Default_Ignorable_Code_Point 941 + Cf (Format characters) 942 + Variation_Selector 943 - White_Space 944 - FFF9..FFFB (Annotation Characters) 945 - 0600..0604, 06DD, 070F, 110BD (exceptional Cf characters 946 that should be visible) 948 7.14. Spaces (N) 950 This PRECIS-specific category is used to group code points that are 951 space characters. 953 N: General_Category(cp) is in {Zs} 955 7.15. Symbols (O) 957 This PRECIS-specific category is used to group code points that are 958 symbols. 960 O: General_Category(cp) is in {Sm, Sc, Sk, So} 962 7.16. Punctuation (P) 964 This PRECIS-specific category is used to group code points that are 965 punctuation characters. 967 P: General_Category(cp) is in {Pc, Pd, Ps, Pe, Pi, Pf, Po} 969 7.17. HasCompat (Q) 971 This PRECIS-specific category is used to group code points that have 972 compatibility equivalents as explained in Chapter 2 and Chapter 3 of 973 the Unicode Standard [Unicode7.0]. 975 Q: toNFKC(cp) != cp 977 The toNFKC() operation returns the code point in normalization form 978 KC. For more information, see Section 5 of Unicode Standard Annex 979 #15 [UAX15]. 981 7.18. OtherLetterDigits (R) 983 This PRECIS-specific category is used to group code points that are 984 letters and digits other than the "traditional" letters and digits 985 grouped under the LetterDigits (A) class (see Section 7.1). 987 R: General_Category(cp) is in {Lt, Nl, No, Me} 989 8. IANA Considerations 991 8.1. PRECIS Derived Property Value Registry 993 IANA is requested to create a PRECIS-specific registry with the 994 Derived Properties for the versions of Unicode that are released 995 after (and including) version 7.0. The derived property value is to 996 be calculated in cooperation with a designated expert [RFC5226] 997 according to the rules specified under Section 7 and Section 6. 999 The IESG is to be notified if backward-incompatible changes to the 1000 table of derived properties are discovered or if other problems arise 1001 during the process of creating the table of derived property values 1002 or during expert review. Changes to the rules defined under 1003 Section 7 and Section 6 require IETF Review. 1005 8.2. PRECIS Base Classes Registry 1007 IANA is requested to create a registry of PRECIS string classes. In 1008 accordance with [RFC5226], the registration policy is "RFC Required". 1010 The registration template is as follows: 1012 Base Class: [the name of the PRECIS string class] 1014 Description: [a brief description of the PRECIS string class and its 1015 intended use, e.g., "A sequence of letters, numbers, and symbols 1016 that is used to identify or address a network entity."] 1018 Specification: [the RFC number] 1020 The initial registrations are as follows: 1022 Base Class: FreeformClass. 1023 Description: A sequence of letters, numbers, symbols, spaces, and 1024 other code points that is used for free-form strings. 1025 Specification: Section 3.3 of this document. 1026 [Note to RFC Editor: please change "this document" 1027 to the RFC number issued for this specification.] 1029 Base Class: IdentifierClass. 1030 Description: A sequence of letters, numbers, and symbols that is 1031 used to identify or address a network entity. 1032 Specification: Section 3.2 of this document. 1033 [Note to RFC Editor: please change "this document" 1034 to the RFC number issued for this specification.] 1036 8.3. PRECIS Profiles Registry 1038 IANA is requested to create a registry of profiles that use the 1039 PRECIS string classes. In accordance with [RFC5226], the 1040 registration policy is "Expert Review". This policy was chosen in 1041 order to ease the burden of registration while ensuring that 1042 "customers" of PRECIS receive appropriate guidance regarding the 1043 sometimes complex and subtle internationalization issues related to 1044 profiles of PRECIS string classes. 1046 The registration template is as follows: 1048 Name: [the name of the profile] 1050 Applicability: [the specific protocol elements to which this profile 1051 applies, e.g., "Localparts in XMPP addresses."] 1053 Base Class: [which PRECIS string class is being profiled] 1055 Replaces: [the Stringprep profile that this PRECIS profile replaces, 1056 if any] 1058 Width Mapping: [the behavioral rule for handling of width, e.g., 1059 "Map fullwidth and halfwidth characters to their compatibility 1060 variants."] 1062 Additional Mappings: [any additional mappings are required or 1063 recommended, e.g., "Map non-ASCII space characters to ASCII 1064 space."] 1066 Case Mapping: [the behavioral rule for handling of case, e.g., 1067 "Unicode Default Case Folding"] 1069 Normalization: [which Unicode normalization form is applied, e.g., 1070 "NFC"] 1072 Directionality: [the behavioral rule for handling of right-to-left 1073 code points, e.g., "The 'Bidi Rule' defined in RFC 5893 applies."] 1075 Exclusions: [a brief description of the specific code points or 1076 characters categories are excluded, e.g., "Eight legacy characters 1077 in the ASCII range" or "Any character that has a compatibility 1078 equivalent, i.e., the HasCompat category"] 1080 Enforcement: [which entities enforce the rules, and when that 1081 enforcement occurs during protocol operations] 1083 Specification: [a pointer to relevant documentation, such as an RFC 1084 or Internet-Draft] 1086 In order to request a review, the registrant shall send a completed 1087 template to the precis@ietf.org list or its designated successor. 1089 Factors to focus on while defining profiles and reviewing profile 1090 registrations include the following: 1092 o Is the problem being addressed by this profile well-defined? 1094 o Does the specification define what kinds of applications are 1095 involved and the protocol elements to which this profile applies? 1097 o Would an existing PRECIS string class or profile solve the 1098 problem? 1100 o Is the profile clearly defined? 1102 o Is the profile based on an appropriate dividing line between user 1103 interface (culture, context, intent, locale, device limitations, 1104 etc.) and the use of conformant strings in protocol elements? 1106 o Are the width mapping, case mapping, additional mappings, 1107 normalization, exclusion, and directionality rules appropriate for 1108 the intended use? 1110 o Does the profile explain which entities enforce the rules, and 1111 when such enforcement occurs during protocol operations? 1113 o Does the profile reduce the degree to which human users could be 1114 surprised or confused by application behavior (the "principle of 1115 least user surprise")? 1117 o Does the profile introduce any new security concerns such as those 1118 described under Section 9 of this document (e.g., false positives 1119 for authentication or authorization)? 1121 9. Security Considerations 1123 9.1. General Issues 1125 If input strings that appear "the same" to users are programmatically 1126 considered to be distinct in different systems, or if input strings 1127 that appear distinct to users are programmatically considered to be 1128 "the same" in different systems, then users can be confused. Such 1129 confusion can have security implications, such as the false positives 1130 and false negatieves discussed in [RFC6943]. One starting goal of 1131 work on the PRECIS framework was to limit the number of times that 1132 users are confused (consistent with the "principle of least 1133 astonishment"). Unfortunately, this goal has been difficult to 1134 achieve given the large number of application protocols already in 1135 existence, each with its own conventions regarding allowable 1136 characters (see for example [I-D.saintandre-username-interop] with 1137 regard to various username constructs). Despite these difficulties, 1138 profiles should not be multiplied beyond necessity. In particular, 1139 application protocol designers should think long and hard before 1140 defining a new profile instead of using one that has already been 1141 defined, and if they decide to define a new profile then they should 1142 clearly explain their reasons for doing so. 1144 The security of applications that use this framework can depend in 1145 part on the proper preparation and comparison of internationalized 1146 strings. For example, such strings can be used to make 1147 authentication and authorization decisions, and the security of an 1148 application could be compromised if an entity providing a given 1149 string is connected to the wrong account or online resource based on 1150 different interpretations of the string. 1152 Specifications of application protocols that use this framework are 1153 strongly encouraged to describe how internationalized strings are 1154 used in the protocol, including the security implications of any 1155 false positives and false negatives that might result from various 1156 comparison operations. For some helpful guidelines, refer to 1157 [RFC6943], [RFC5890], [UTR36], and [UTS39]. 1159 9.2. Use of the IdentifierClass 1161 Strings that conform to the IdentifierClass and any profile thereof 1162 are intended to be relatively safe for use in a broad range of 1163 applications, primarily because they include only letters, digits, 1164 and "grandfathered" non-space characters from the ASCII range; thus 1165 they exclude spaces, characters with compatibility equivalents, and 1166 almost all symbols and punctuation marks. However, because such 1167 strings can still include so-called confusable characters (see 1168 Section 9.5), protocol designers and implementers are encouraged to 1169 pay close attention to the security considerations described 1170 elsewhere in this document. 1172 9.3. Use of the FreeformClass 1174 Strings that conform to the FreeformClass and many profiles thereof 1175 can include virtually any Unicode character. This makes the 1176 FreeformClass quite expressive, but also problematic from the 1177 perspective of possible user confusion. Protocol designers are 1178 hereby warned that the FreeformClass contains codepoints they might 1179 not understand, and are encouraged to profile the IdentifierClass 1180 wherever feasible; however, if an application protocol requires more 1181 code points than are allowed by the IdentifierClass, protocol 1182 designers are encouraged to define a profile of the FreeformClass 1183 that restricts the allowable code points as tightly as possible. 1184 (The PRECIS Working Group considered the option of allowing 1185 superclasses as well as profiles of PRECIS string classes, but 1186 decided against allowing superclasses to reduce the likelihood of 1187 security and interoperability problems.) 1189 9.4. Local Character Set Issues 1191 When systems use local character sets other than ASCII and Unicode, 1192 this specification leaves the problem of converting between the local 1193 character set and Unicode up to the application or local system. If 1194 different applications (or different versions of one application) 1195 implement different rules for conversions among coded character sets, 1196 they could interpret the same name differently and contact different 1197 application servers or other network entities. This problem is not 1198 solved by security protocols, such as Transport Layer Security (TLS) 1199 [RFC5246] and the Simple Authentication and Security Layer (SASL) 1200 [RFC4422], that do not take local character sets into account. 1202 9.5. Visually Similar Characters 1204 Some characters are visually similar and thus can cause confusion 1205 among humans. Such characters are often called "confusable 1206 characters" or "confusables". 1208 The problem of confusable characters is not necessarily caused by the 1209 use of Unicode code points outside the ASCII range. For example, in 1210 some presentations and to some individuals the string "ju1iet" 1211 (spelled with DIGIT ONE, U+0031, as the third character) might appear 1212 to be the same as "juliet" (spelled with LATIN SMALL LETTER L, 1213 U+006C), especially on casual visual inspection. This phenomenon is 1214 sometimes called "typejacking". 1216 However, the problem is made more serious by introducing the full 1217 range of Unicode code points into protocol strings. For example, the 1218 characters U+13DA U+13A2 U+13B5 U+13AC U+13A2 U+13AC U+13D2 from the 1219 Cherokee block look similar to the ASCII characters "STPETER" as they 1220 might appear when presented using a "creative" font family. 1222 In some examples of confusable characters, it is unlikely that the 1223 average human could tell the difference between the real string and 1224 the fake string. (Indeed, there is no programmatic way to 1225 distinguish with full certainty which is the fake string and which is 1226 the real string; in some contexts, the string formed of Cherokee 1227 characters might be the real string and the string formed of ASCII 1228 characters might be the fake string.) Because PRECIS-compliant 1229 strings can contain almost any properly-encoded Unicode code point, 1230 it can be relatively easy to fake or mimic some strings in systems 1231 that use the PRECIS framework. The fact that some strings are easily 1232 confused introduces security vulnerabilities of the kind that have 1233 also plagued the World Wide Web, specifically the phenomenon known as 1234 phishing. 1236 Despite the fact that some specific suggestions about identification 1237 and handling of confusable characters appear in the Unicode Security 1238 Considerations [UTR36] and the Unicode Security Mechanisms [UTS39], 1239 it is also true (as noted in [RFC5890]) that "there are no 1240 comprehensive technical solutions to the problems of confusable 1241 characters". Because it is impossible to map visually similar 1242 characters without a great deal of context (such as knowing the font 1243 families used), the PRECIS framework does nothing to map similar- 1244 looking characters together, nor does it prohibit some characters 1245 because they look like others. 1247 Nevertheless, specifications for application protocols that use this 1248 framework are strongly encouraged to describe how confusable 1249 characters can be abused to compromise the security of systems that 1250 use the protocol in question, along with any protocol-specific 1251 suggestions for overcoming those threats. In particular, software 1252 implementations and service deployments that use PRECIS-based 1253 technologies are strongly encouraged to define and implement 1254 consistent policies regarding the registration, storage, and 1255 presentation of visually similar characters. The following 1256 recommendations are appropriate: 1258 1. An application service SHOULD define a policy that specifies the 1259 scripts or blocks of characters that the service will allow to be 1260 registered (e.g., in an account name) or stored (e.g., in a file 1261 name). Such a policy SHOULD be informed by the languages and 1262 scripts that are used to write registered account names; in 1263 particular, to reduce confusion, the service SHOULD forbid 1264 registration or storage of strings that contain characters from 1265 more than one script and SHOULD restrict registrations to 1266 characters drawn from a very small number of scripts (e.g., 1267 scripts that are well-understood by the administrators of the 1268 service, to improve manageability). 1270 2. User-oriented application software SHOULD define a policy that 1271 specifies how internationalized strings will be presented to a 1272 human user. Because every human user of such software has a 1273 preferred language or a small set of preferred languages, the 1274 software SHOULD gather that information either explicitly from 1275 the user or implicitly via the operating system of the user's 1276 device. Furthermore, because most languages are typically 1277 represented by a single script or a small set of scripts, and 1278 because most scripts are typically contained in one or more 1279 blocks of characters, the software SHOULD warn the user when 1280 presenting a string that mixes characters from more than one 1281 script or block, or that uses characters outside the normal range 1282 of the user's preferred language(s). (Such a recommendation is 1283 not intended to discourage communication across different 1284 communities of language users; instead, it recognizes the 1285 existence of such communities and encourages due caution when 1286 presenting unfamiliar scripts or characters to human users.) 1288 The challenges inherent in supporting the full range of Unicode code 1289 points have in the past led some to hope for a way to 1290 programmatically negotiate more restrictive ranges based on locale, 1291 script, or other relevant factors, to tag the locale associated with 1292 a particular string, etc. As a general-purpose internationalization 1293 technology, the PRECIS framework does not include such mechanisms. 1295 9.6. Security of Passwords 1297 Two goals of passwords are to maximize the amount of entropy and to 1298 minimize the potential for false positives. These goals can be 1299 achieved in part by allowing a wide range of code points and by 1300 ensuring that passwords are handled in such a way that code points 1301 are not compared aggressively. Therefore, it is NOT RECOMMENDED for 1302 application protocols to profile the FreeformClass for use in 1303 passwords in a way that removes entire categories (e.g., by 1304 disallowing symbols or punctuation). Furthermore, it is NOT 1305 RECOMMENDED for application protocols to map uppercase and titlecase 1306 code points to their lowercase equivalents in such strings; instead, 1307 it is RECOMMENDED to preserve the case of all code points contained 1308 in such strings and to compare them in a case-sensitive manner. 1310 That said, software implementers need to be aware that there exist 1311 tradeoffs between entropy and usability. For example, allowing a 1312 user to establish a password containing "uncommon" code points might 1313 make it difficult for the user to access a service when using an 1314 unfamiliar or constrained input device. 1316 Some application protocols use passwords directly, whereas others 1317 reuse technologies that themselves process passwords (one example of 1318 such a technology is the Simple Authentication and Security Layer 1319 [RFC4422]). Moreover, passwords are often carried by a sequence of 1320 protocols with backend authentication systems or data storage systems 1321 such as RADIUS [RFC2865] and LDAP [RFC4510]. Developers of 1322 application protocols are encouraged to look into reusing these 1323 profiles instead of defining new ones, so that end-user expectations 1324 about passwords are consistent no matter which application protocol 1325 is used. 1327 In protocols that provide passwords as input to a cryptographic 1328 algorithm such as a hash function, the client will need to perform 1329 proper preparation of the password before applying the algorithm, 1330 since the password is not available to the server in plaintext form. 1332 Further discussion of password handling can be found in 1333 [I-D.ietf-precis-saslprepbis]. 1335 10. Interoperability Considerations 1337 Although strings that are consumed in PRECIS-based application 1338 protocols are often encoded using UTF-8 [RFC3629], the exact encoding 1339 is a matter for the application protocol that uses PRECIS, not for 1340 the PRECIS framework. 1342 It is known that some existing systems are unable to support the full 1343 Unicode character set, or even any characters outside the ASCII 1344 range. If two (or more) applications need to interoperate when 1345 exchanging data (e.g., for the purpose of authenticating a username 1346 or password), they will naturally need to have in common at least one 1347 coded character set (as defined by [RFC6365]). Establishing such a 1348 baseline is a matter for the application protocol that uses PRECIS, 1349 not for the PRECIS framework. 1351 Three Unicode code points underwent changes in their GeneralCategory 1352 between Unicode 5.2 (current at the time IDNA2008 was originally 1353 published) and Unicode 6.0, as described in [RFC6452]. Implementers 1354 might need to be aware that the treatment of these characters differs 1355 depending on which version of Unicode is available on the system that 1356 is using IDNA2008 or PRECIS, and that other such differences are 1357 possible between the version of Unicode current at the time of this 1358 writing (7.0) and future versions. 1360 11. References 1362 11.1. Normative References 1364 [RFC20] Cerf, V., "ASCII format for network interchange", RFC 20, 1365 October 1969. 1367 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1368 Requirement Levels", BCP 14, RFC 2119, March 1997. 1370 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 1371 Interchange", RFC 5198, March 2008. 1373 [Unicode7.0] 1374 The Unicode Consortium, "The Unicode Standard, Version 1375 6.0.0", 2014, 1376 . 1378 11.2. Informative References 1380 [I-D.ietf-precis-mappings] 1381 Yoneya, Y. and T. NEMOTO, "Mapping characters for PRECIS 1382 classes", draft-ietf-precis-mappings-08 (work in 1383 progress), June 2014. 1385 [I-D.ietf-precis-nickname] 1386 Saint-Andre, P., "Preparation and Comparison of 1387 Nicknames", draft-ietf-precis-nickname-09 (work in 1388 progress), January 2014. 1390 [I-D.ietf-precis-saslprepbis] 1391 Saint-Andre, P. and A. Melnikov, "Username and Password 1392 Preparation Algorithms", draft-ietf-precis-saslprepbis-07 1393 (work in progress), March 2014. 1395 [I-D.ietf-xmpp-6122bis] 1396 Saint-Andre, P., "Extensible Messaging and Presence 1397 Protocol (XMPP): Address Format", draft-ietf-xmpp- 1398 6122bis-12 (work in progress), March 2014. 1400 [I-D.saintandre-username-interop] 1401 Saint-Andre, P., "An Interoperable Subset of Characters 1402 for Internationalized Usernames", draft-saintandre- 1403 username-interop-03 (work in progress), March 2014. 1405 [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, 1406 "Remote Authentication Dial In User Service (RADIUS)", RFC 1407 2865, June 2000. 1409 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1410 Internationalized Strings ("stringprep")", RFC 3454, 1411 December 2002. 1413 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1414 "Internationalizing Domain Names in Applications (IDNA)", 1415 RFC 3490, March 2003. 1417 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1418 Profile for Internationalized Domain Names (IDN)", RFC 1419 3491, March 2003. 1421 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1422 10646", STD 63, RFC 3629, November 2003. 1424 [RFC4422] Melnikov, A. and K. Zeilenga, "Simple Authentication and 1425 Security Layer (SASL)", RFC 4422, June 2006. 1427 [RFC4510] Zeilenga, K., "Lightweight Directory Access Protocol 1428 (LDAP): Technical Specification Road Map", RFC 4510, June 1429 2006. 1431 [RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and 1432 Recommendations for Internationalized Domain Names 1433 (IDNs)", RFC 4690, September 2006. 1435 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 1436 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 1437 May 2008. 1439 [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1440 Specifications: ABNF", STD 68, RFC 5234, January 2008. 1442 [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security 1443 (TLS) Protocol Version 1.2", RFC 5246, August 2008. 1445 [RFC5890] Klensin, J., "Internationalized Domain Names for 1446 Applications (IDNA): Definitions and Document Framework", 1447 RFC 5890, August 2010. 1449 [RFC5891] Klensin, J., "Internationalized Domain Names in 1450 Applications (IDNA): Protocol", RFC 5891, August 2010. 1452 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1453 Internationalized Domain Names for Applications (IDNA)", 1454 RFC 5892, August 2010. 1456 [RFC5893] Alvestrand, H. and C. Karp, "Right-to-Left Scripts for 1457 Internationalized Domain Names for Applications (IDNA)", 1458 RFC 5893, August 2010. 1460 [RFC5894] Klensin, J., "Internationalized Domain Names for 1461 Applications (IDNA): Background, Explanation, and 1462 Rationale", RFC 5894, August 2010. 1464 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 1465 Internationalized Domain Names in Applications (IDNA) 1466 2008", RFC 5895, September 2010. 1468 [RFC6365] Hoffman, P. and J. Klensin, "Terminology Used in 1469 Internationalization in the IETF", BCP 166, RFC 6365, 1470 September 2011. 1472 [RFC6452] Faltstrom, P. and P. Hoffman, "The Unicode Code Points and 1473 Internationalized Domain Names for Applications (IDNA) - 1474 Unicode 6.0", RFC 6452, November 2011. 1476 [RFC6885] Blanchet, M. and A. Sullivan, "Stringprep Revision and 1477 Problem Statement for the Preparation and Comparison of 1478 Internationalized Strings (PRECIS)", RFC 6885, March 2013. 1480 [RFC6943] Thaler, D., "Issues in Identifier Comparison for Security 1481 Purposes", RFC 6943, May 2013. 1483 [UAX9] The Unicode Consortium, "Unicode Standard Annex #9: 1484 Unicode Bidirectional Algorithm", September 2012, 1485 . 1487 [UAX11] The Unicode Consortium, "Unicode Standard Annex #11: East 1488 Asian Width", September 2012, 1489 . 1491 [UAX15] The Unicode Consortium, "Unicode Standard Annex #15: 1492 Unicode Normalization Forms", August 2012, 1493 . 1495 [UnicodeCurrent] 1496 The Unicode Consortium, "The Unicode Standard", 1497 2014-present, . 1499 [UTR36] The Unicode Consortium, "Unicode Technical Report #36: 1500 Unicode Security Considerations", July 2012, 1501 . 1503 [UTS39] The Unicode Consortium, "Unicode Technical Standard #39: 1504 Unicode Security Mechanisms", July 2012, 1505 . 1507 11.3. URIs 1509 [1] http://unicode.org/Public/UNIDATA/PropertyAliases.txt 1511 [2] http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt 1513 Appendix A. Acknowledgements 1515 The authors would like to acknowledge the comments and contributions 1516 of the following individuals during working group discussion: David 1517 Black, Edward Burns, Dan Chiba, Mark Davis, Alan DeKok, Martin 1518 Duerst, Patrik Faltstrom, Ted Hardie, Joe Hildebrand, Bjoern 1519 Hoehrmann, Paul Hoffman, Jeffrey Hutzelman, Simon Josefsson, John 1520 Klensin, Alexey Melnikov, Takahiro Nemoto, Yoav Nir, Mike Parker, 1521 Pete Resnick, Andrew Sullivan, Dave Thaler, Yoshiro Yoneya, and 1522 Florian Zeitz. 1524 Charlie Kaufman, Tom Taylor, and Tim Wicinski reviewed the document 1525 on behalf of the Security Directorate, the General Area Review Team, 1526 and the Operations and Management Directorate, respectively. 1528 During IESG review, Alissa Cooper, Stephen Farrell, and Barry Leiba 1529 provided comments that led to further improvements. 1531 Some algorithms and textual descriptions have been borrowed from 1532 [RFC5892]. Some text regarding security has been borrowed from 1533 [RFC5890], [I-D.ietf-precis-saslprepbis], and 1534 [I-D.ietf-xmpp-6122bis]. 1536 Peter Saint-Andre wishes to acknowledge Cisco Systems, Inc., for 1537 employing him during his work on earlier versions of this document. 1539 Authors' Addresses 1541 Peter Saint-Andre 1542 &yet 1543 P.O. Box 787 1544 Parker, CO 80134 1545 USA 1547 Email: peter@andyet.net 1549 Marc Blanchet 1550 Viagenie 1551 246 Aberdeen 1552 Quebec, QC G1R 2E1 1553 Canada 1555 Email: Marc.Blanchet@viagenie.ca 1556 URI: http://www.viagenie.ca/