idnits 2.17.1 

draft-iab-identifier-comparison-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 2 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 415: '....  Host software MUST support this mor...'
     RFC 2119 keyword, line 420: '...dentity of an Internet host, it SHOULD...'
     RFC 2119 keyword, line 422: '...#.#.#.#") form.  The host SHOULD check...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (May 8, 2012) is 4361 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC5890' is mentioned on line 545, but not defined

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pkix-rfc5280-clarifications-04

  == Outdated reference: A later version (-09) exists of
     draft-ietf-precis-problem-statement-05

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     D. Thaler, Ed.
3	Internet-Draft                                                 Microsoft
4	Intended status: Informational                               May 8, 2012
5	Expires: November 9, 2012

7	         Issues in Identifier Comparison for Security Purposes
8	                 draft-iab-identifier-comparison-02.txt

10	Abstract

12	   Identifiers such as hostnames, URIs, and email addresses are often
13	   used in security contexts to identify security principals and
14	   resources.  In such contexts, an identifier supplied via some
15	   protocol is often compared against some policy to make security
16	   decisions such as whether the principal may access the resource, what
17	   level of authentication or encryption is required, etc.  If the
18	   parties involved in a security decision use different algorithms to
19	   compare identifiers, then failure scenarios ranging from denial of
20	   service to elevation of privilege can result.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on November 9, 2012.

39	Copyright Notice

41	   Copyright (c) 2012 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	     1.1.  Canonicalization . . . . . . . . . . . . . . . . . . . . .  4
58	   2.  Security Uses  . . . . . . . . . . . . . . . . . . . . . . . .  5
59	     2.1.  Types of Identifiers . . . . . . . . . . . . . . . . . . .  6
60	     2.2.  False Positives and Negatives  . . . . . . . . . . . . . .  7
61	     2.3.  Hypothetical Example . . . . . . . . . . . . . . . . . . .  8
62	   3.  Common Identifiers . . . . . . . . . . . . . . . . . . . . . .  9
63	     3.1.  Hostnames  . . . . . . . . . . . . . . . . . . . . . . . .  9
64	       3.1.1.  IPv4 Literals  . . . . . . . . . . . . . . . . . . . .  9
65	       3.1.2.  IPv6 Literals  . . . . . . . . . . . . . . . . . . . . 11
66	       3.1.3.  Internationalization . . . . . . . . . . . . . . . . . 12
67	       3.1.4.  Resolution for comparison  . . . . . . . . . . . . . . 12
68	     3.2.  Ports and Service Names  . . . . . . . . . . . . . . . . . 13
69	     3.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
70	       3.3.1.  Scheme component . . . . . . . . . . . . . . . . . . . 15
71	       3.3.2.  Authority component  . . . . . . . . . . . . . . . . . 15
72	       3.3.3.  Path component . . . . . . . . . . . . . . . . . . . . 16
73	       3.3.4.  Query component  . . . . . . . . . . . . . . . . . . . 16
74	       3.3.5.  Fragment component . . . . . . . . . . . . . . . . . . 16
75	       3.3.6.  Resolution for comparison  . . . . . . . . . . . . . . 17
76	     3.4.  Email Address-like Identifiers . . . . . . . . . . . . . . 17
77	   4.  General Internationalization Issues  . . . . . . . . . . . . . 17
78	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 18
79	   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19
80	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
81	   8.  Informative References . . . . . . . . . . . . . . . . . . . . 19
82	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 21

84	1.  Introduction

86	   In computing and the Internet, various types of "identifiers" are
87	   used to identify humans, devices, content, etc.  Before discussing
88	   security issues, we first give some background on some typical
89	   processes involving identifiers.

91	   As depicted in Figure 1, there are multiple processes relevant to our
92	   discussion.
93	   1.  An identifier must first be generated.  If the identifier is
94	       intended to be unique, the generation process includes some
95	       mechanism, such as allocation by a central authority, to help
96	       ensure uniqueness.  However the notion of "unique" involves
97	       determining whether a putative identifier matches any other
98	       already-allocated identifier.  As we will see, for many types of
99	       identifiers, this is not simply an exact binary match.

101	       As a result of generating the identifier, it is often stored in
102	       two locations: with the requester or "holder" of the identifier,
103	       and with some repository of identifiers (e.g., DNS).  For
104	       example, if the identifier was allocated by a central authority,
105	       the repository might be that authority.  If the identifier
106	       identifies a device or content on a device, the repository might
107	       be that device.
108	   2.  The identifier must be distributed, either by the holder of the
109	       identifier or by a repository of identifiers, to others who could
110	       use the identifier.  This distribution might be electronic, but
111	       sometimes it is via other channels such as voice, business card,
112	       billboard, or other form of advertisement.  The identifier itself
113	       might be distributed directly, or it might be used to generate a
114	       portion of another type of identifier that is then distributed.
115	       For example, a URI or email address might include a server name,
116	       and hence distributing the URI or email address also inherently
117	       distributes the server name.
118	   3.  The identifier must be used by some party.  Generally the user
119	       supplies the identifier which is (directly or indirectly) sent to
120	       the repository of identifiers.  For example, using an email
121	       address to send email to the holder of an identifier may result
122	       in the email arriving at the holder's email server which has
123	       access to the mail stores.

125	       The repository of identifiers must then attempt to match the
126	       user-supplied identifier with an identifier in its repository.

128	                            +------------+
129	                            |  Holder of |     1. Generation
130	                            | identifier +<---------+
131	                            +----+-------+          |
132	                                 |                  | Match
133	                                 |                  v/
134	                                 |          +-------+-------+
135	                                 +----------+ Repository of |
136	                                 |          |  identifiers  |
137	                                 |          +-------+-------+
138	                 2. Distribution |                  ^\
139	                                 |                  | Match
140	                                 v                  |
141	                       +---------+-------+          |
142	                       |      User of    |          |
143	                       |    identifier   +----------+
144	                       +-----------------+    3. Use

146	                       Typical Identifier Processes

148	                                 Figure 1

150	   One key aspect is that the identifier values passed in generation,
151	   distribution, and use, may all be different forms.  For example,
152	   generation might be exchanged in printed form, distribution done via
153	   voice, and use done electronically.  As such, the match process can
154	   be complicated.

156	   Furthermore, in many uses, the relationship between holder,
157	   repositories, and users may be more involved.  For example, when a
158	   hierarchy of web caches exist, each cache is itself a repository of a
159	   sort, and the match process is usually intended to be the same as on
160	   the origin server.

162	1.1.  Canonicalization

164	   Perhaps the most common algorithm for comparison involves first
165	   converting each identifier to a canonical form (a process known as
166	   "canonicalization" or "normalization"), and then testing . the
167	   resulting canonical representations for bitwise equality.  In so
168	   doing, it is thus critical that all entities involved agree on the
169	   same canonical form and use the same canonicalization algorithm so
170	   that the overall comparison process is also the same.

172	   Note that in some contexts, such as in internationalization, the
173	   terms "canonicalization" and "normalization" have a precise meaning.
174	   In this document, however, we use these terms synonymously in their
175	   more generic form, to mean conversion to some standard form.

177	   While the most common method of comparison includes canonicalization,
178	   comparison can also be done by defining an equivalence algorithm,
179	   where no single form is canonical.  However in most cases, a
180	   canonical form is useful for other purposes, such as output, and so
181	   in such cases defining a canonical form suffices to define a
182	   comparison method.

184	2.  Security Uses

186	   Identifiers such as hostnames, URIs, and email addresses are used in
187	   security contexts to identify principals and resources as well as
188	   other security parameters such as types and values of claims.  Those
189	   identifiers are then used to make security decisions based on an
190	   identifier supplied via some protocol.  For example:
191	   o  Authentication: a protocol might match a security principal
192	      identifier to look up expected keying material, and then match
193	      keying material.
194	   o  Authorization: a protocol might match a resource name to look up
195	      an access control list (ACL), and then look up the security
196	      principal identifier in that ACL.
197	   o  Accounting: a system might create an accounting record for a
198	      security principal identifier or resource name, and then might
199	      later need to match a supplied identifier to allow (for example)
200	      law enforcement to follow up based on the records, or add new
201	      filtering rules based on the records in order to stop an attack.

203	   If the parties involved in a security decision use different matching
204	   algorithms for the same identifiers, then failure scenarios ranging
205	   from denial of service to elevation of privilege can result, as we
206	   will see.

208	   This is especially complicated in cases involving multiple parties
209	   and multiple protocols.  For example, there are many scenarios where
210	   some form of "security token service" is used to grant to a requester
211	   permission to access a resource, where the resource is held by a
212	   third party that relies on the security token service (see Figure 2).
213	   The protocol used to request permission (e.g., Kerberos or OAuth) may
214	   be different from the protocol used to access the resource (e.g.,
215	   HTTP).  Opportunities for security problems arise when two protocols
216	   define different comparison algorithms for the same type of
217	   identifier, or when a protocol is ambiguously specified and two
218	   endpoints (e.g., a security token service and a resource holder)
219	   implement different algorithms within the same protocol.

221	        +----------+
222	        | security |
223	        |  token   |
224	        | service  |
225	        +----------+
226	             ^
227	             | 1. supply credentials and
228	             | get token for resource
229	             |                                             +--------+
230	        +----------+  2. supply token and access resource  |resource|
231	        |requester |=------------------------------------->| holder |
232	        +----------+                                       +--------+

234	                         Simple Security Exchange

236	                                 Figure 2

238	   In many cases the situation is more complex.  With certificates, the
239	   name in a certificate gets compared against names in ACLs or other
240	   things.  In the case of web site security, the name in the
241	   certificate gets compared to a portion of the URI that a user may
242	   have typed into a browser.  The fact that many different people are
243	   doing the typing, on many different types of systems, complicates the
244	   problem.

246	   Add to this the certificate enrollment step, and the certificate
247	   issuance step, and two more parties have an opportunity to adjust the
248	   encoding or worse, the software that supports them might make changes
249	   that the parties are unaware are happening.

251	2.1.  Types of Identifiers

253	   In this document we will refer to the following types of identifiers:

255	   o  Absolute: identifiers that can be compared byte-by-byte for
256	      equality.  Two identifiers that have different bytes are defined
257	      to be different.  For example, binary IP addresses are in this
258	      class.
259	   o  Definite: identifiers that have a well-defined comparison
260	      algorithm on which all parties agree.  For example, URI scheme
261	      names are required to be ASCII and are defined to match in a case-
262	      insensitive way; the comparison is thus definite since all parties
263	      agree on how to do a case-insensitive match among ASCII strings.
264	   o  Indefinite: identifiers that have no single comparison algorithm
265	      on which all parties agree.  For example, human names are in this
266	      class.  Everyone might want the comparison to be tailored for
267	      their locale, for some definition of locale.  In some cases, there
268	      may be limited subsets of parties that might be able to agree
269	      (e.g., US-ASCII users might all agree on a common comparison
270	      algorithm whereas US-ASCII users vs. Turkish users may not), but
271	      identifiers often tend to leak out of such limited environments.

273	2.2.  False Positives and Negatives

275	   It is first worth discussing in more detail the effects of errors in
276	   the comparison algorithm.  A "false positive" results when two
277	   identifiers compare as if they were equal, but in reality refer to
278	   two different objects (e.g., security principals or resources).  When
279	   privilege is granted on a match, a false positive thus results in an
280	   elevation of privilege, for example allowing execution of an
281	   operation that should not have been permitted otherwise.  When
282	   privilege is denied on a match (e.g., matching an entry in a block/
283	   deny list or a revocation list), a permissible operation is denied.
284	   At best, this can cause worse performance (e.g., a cache miss, or
285	   forcing redundant authentication), and at worst can result in a
286	   denial of service.

288	   A "false negative" results when two identifiers that in reality refer
289	   to the same thing compare as if they were different, and the effects
290	   are the reverse of those for false positives.  That is, when
291	   privilege is granted on a match, the result is at best worse
292	   performance and at worst a denial of service; when privilege is
293	   denied on a match, elevation of privilege results.

295	   Figure 3 summarizes these effects.

297	                  | "Grant on match"       | "Deny on match"
298	   ---------------+------------------------+-----------------------
299	   False positive | Elevation of privilege | Denial of service
300	   ---------------+------------------------+-----------------------
301	   False negative | Denial of service      | Elevation of privilege
302	   ---------------+------------------------+-----------------------

304	                    Effect of False Positives/Negatives

306	                                 Figure 3

308	   Elevation of privilege is almost always seen as far worse than denial
309	   of service.  Hence, for URIs for example, Section 6.1 of [RFC3986]
310	   states: "comparison methods are designed to minimize false negatives
311	   while strictly avoiding false positives".

313	   Thus URIs were defined with a "grant privilege on match" paradigm in
314	   mind, where it is critical to prevent elevation of privilege while
315	   minimizing denial of service.  Using URIs in a "deny privilege on
316	   match" system can thus be problematic.

318	2.3.  Hypothetical Example

320	   In this example, both security principals and resources are
321	   identified using URIs.  Foo Corp has paid example.com for access to
322	   the Stuff service.  Foo Corp allows its employees to create accounts
323	   on the Stuff service.  Alice gets the account
324	   "http://example.com/Stuff/FooCorp/alice" and Bob gets
325	   "http://example.com/Stuff/FooCorp/bob".  It turns out, however, that
326	   Foo Corp's URI canonicalizer includes URI fragment components in
327	   comparisons whereas example.com's does not, and Foo Corp does not
328	   disallow the # character in the account name.  So Chuck, who is a
329	   malicious employee of Foo Corp, asks to create an account at
330	   example.com with the name alice#stuff.  Foo Corp's URI logic checks
331	   its records for accounts it has created with stuff and sees that
332	   there is no account with the name alice#stuff.  Hence, in its
333	   records, it associates the account alice#stuff with Chuck and will
334	   only issue tokens good for use with
335	   "http://example.com/Stuff/FooCorp/alice#stuff" to Chuck.

337	   Chuck, the attacker, goes to a security token service at Foo Corp and
338	   asks for a security token good for
339	   "http://example.com/Stuff/FooCorp/alice#stuff".  Foo Corp issues the
340	   token since Chuck is the legitimate owner (in Foo Corp's view) of the
341	   alice#stuff account.  Chuck then submits the security token in a
342	   request to "http://example.com/Stuff/FooCorp/alice".

344	   But example.com uses a URI canonicalizer that, for the purposes of
345	   checking equality, ignores fragments.  So when example.com looks in
346	   the security token to see if the requester has permission from Foo
347	   Corp to access the given account it successfully matches the URI in
348	   the security token, "http://example.com/Stuff/FooCorp/alice#stuff",
349	   with the requested resource name
350	   "http://example.com/Stuff/FooCorp/alice".

352	   Leveraging the inconsistencies in the canonicalizers used by Foo Corp
353	   and example.com, Chuck is able to successfully launch an elevation of
354	   privilege attack and access Alice's resource.

356	   Furthermore, consider an attacker using a similar corporation such as
357	   "foocorp" (or any variation containing a non-ASCII character that
358	   some humans might expect to represent the same corporation).  If the
359	   resource holder treats them as different, but the security token
360	   service treats them as the same, then again elevation of privilege
361	   can occur.

363	3.  Common Identifiers

365	   In this section, we walk through a number of common types of
366	   identifiers and discuss various issues related to comparison that may
367	   affect security whenever they are used to identify security
368	   principals or resources.  These examples illustrate common patterns
369	   that may arise with other types of identifiers.

371	3.1.  Hostnames

373	   Hostnames (composed of dot-separated labels) are commonly used either
374	   directly as identifiers, or as components in identifiers such as in
375	   URIs and email addresses.  Another example is in [RFC5280], sections
376	   7.2 and 7.3 (and updated in section 3 of
377	   [I-D.ietf-pkix-rfc5280-clarifications]), which specify use in
378	   certificates.

380	   In this section we discuss a number of issues in comparing strings
381	   that appear to be some form of hostname.

383	   Section 3 of [RFC6055] discusses the differences between a "hostname"
384	   vs. a "DNS name", where the former is a subset of the latter by using
385	   a restricted set of characters.  If one canonicalizer uses the "DNS
386	   name" definition whereas another uses a "hostname" definition, a name
387	   might be valid in the former but invalid in the latter.  As long as
388	   invalid identifiers are denied privilege, this difference will not
389	   result in elevation of privilege.

391	   [IAB1123] briefly discusses issues with the ambiguity around whether
392	   a label will be "alphabetic", including among other issues, whether a
393	   hostname can be interpreted as an IP address.  We explore this last
394	   issue in more detail below.

396	3.1.1.  IPv4 Literals

398	   [RFC0952] defined an entry in the "Internet host table" as follows:

400	      A "name" (Net, Host, Gateway, or Domain name) is a text string up
401	      to 24 characters drawn from the alphabet (A-Z), digits (0-9),
402	      minus sign (-), and period (.).  Note that periods are only
403	      allowed when they serve to delimit components of "domain style
404	      names". [...]  No blank or space characters are permitted as part
405	      of a name.  No distinction is made between upper and lower case.
406	      The first character must be an alpha character.  The last
407	      character must not be a minus sign or period. [...]  Single
408	      character names or nicknames are not allowed.

410	   [RFC1123] section 2.1 then updates the definition with:

412	      The syntax of a legal Internet host name was specified in RFC-952
413	      [DNS:4].  One aspect of host name syntax is hereby changed: the
414	      restriction on the first character is relaxed to allow either a
415	      letter or a digit.  Host software MUST support this more liberal
416	      syntax.

418	   and

420	      Whenever a user inputs the identity of an Internet host, it SHOULD
421	      be possible to enter either (1) a host domain name or (2) an IP
422	      address in dotted-decimal ("#.#.#.#") form.  The host SHOULD check
423	      the string syntactically for a dotted-decimal number before
424	      looking it up in the Domain Name System.

426	   and

428	      This last requirement is not intended to specify the complete
429	      syntactic form for entering a dotted-decimal host number; that is
430	      considered to be a user-interface issue.

432	   In specifying the inet_addr() API, the POSIX standard [IEEE-1003.1]
433	   defines "IPv4 dotted decimal notation" as allowing not only strings
434	   of the form "10.0.1.2", but also allows octal and hexadecimal, and
435	   addresses with less than four parts.  For example, "10.0.258",
436	   "0xA000001", and "012.0x102" all represent the same IPv4 address in
437	   standard "IPv4 dotted decimal" notation.  We will refer to this as
438	   the "loose" syntax of an IPv4 address literal.

440	   In section 6.1 of [RFC3493] getaddrinfo() is defined to support the
441	   same (loose) syntax as inet_addr():

443	      If the specified address family is AF_INET or AF_UNSPEC, address
444	      strings using Internet standard dot notation as specified in
445	      inet_addr() are valid.

447	   In contrast, section 6.3 of the same RFC states, specifying
448	   inet_pton():

450	      If the af argument of inet_pton() is AF_INET, the src string shall
451	      be in the standard IPv4 dotted-decimal form: ddd.ddd.ddd.ddd where
452	      "ddd" is a one to three digit decimal number between 0 and 255.
453	      The inet_pton() function does not accept other formats (such as
454	      the octal numbers, hexadecimal numbers, and fewer than four
455	      numbers that inet_addr() accepts).

457	   As shown above, inet_pton() uses what we will refer to as the
458	   "strict" form of an IPv4 address literal.  Some platforms also use
459	   the strict form with getaddrinfo() when the AI_NUMERICHOST flag is
460	   passed to it.

462	   Both the strict and loose forms are standard forms, and hence a
463	   protocol specification is still ambiguous if it simply defines a
464	   string to be in the "standard IPv4 dotted decimal form".  And, as a
465	   result of these differences, names like "10.11.12" are ambiguous as
466	   to whether they are an IP address or a hostname, and even
467	   "10.11.12.13" can be ambiguous because of the "SHOULD" in RFC 1123
468	   above making it optional whether to treat it as an address or a name.

470	   Protocols and data formats that can use addresses in string form for
471	   security purposes need to resolve these ambiguities.  For example,
472	   for the host component of URIs, section 3.2.2 of [RFC3986] resolves
473	   the first ambiguity by only allowing the strict form, and the second
474	   ambiguity by specifying that it is considered an IPv4 address
475	   literal.  New protocols and data formats should similarly consider
476	   using the strict form rather than the loose form in order to better
477	   match user expectations.

479	   Thus, whereas (binary) IPv4 addresses are Absolute identifiers, IPv4
480	   address literals are at best Definite identifiers, and often turn out
481	   to be Indefinite identifiers.

483	   Furthermore, when strings can contain non-ASCII characters, they can
484	   contain other characters that may look like dots or digits to a human
485	   viewing and/or entering the identifier, especially to one who might
486	   expect digits to appear in his or her native script.

488	3.1.2.  IPv6 Literals

490	   IPv6 addresses similarly have a wide variety of alternate but
491	   semantically identical string representations, as defined in section
492	   2.2 of [RFC4291].  As discussed in section 3.2.5 of [RFC5952], this
493	   fact causes problems in security contexts if comparison (such as in
494	   X.509 certificates), is done between strings rather than between the
495	   binary representations of addresses.

497	   [RFC5952] recently specified a recommended canonical string format as
498	   an attempt to solve this problem, but it may not be ubiquitously
499	   supported at present.  And, when strings can contain non-ASCII
500	   characters, the same issues (and more, since hexadecimal and colons
501	   are allowed) arise as with IPv4 literals.

503	   Whereas (binary) IPv6 addresses are Absolute identifiers, IPv6
504	   address literals are Definite identifiers, since string-to-address
505	   conversion for IPv6 address literals is unambiguous.

507	3.1.3.  Internationalization

509	   The IETF policy on character sets and languages [RFC2277] requires
510	   support for UTF-8 in protocols, and as a result many protocols now do
511	   support non-ASCII characters.  When a hostname is sent in a UTF-8
512	   field, there are a number of ways it may be encoded.  For example,
513	   hostname labels might be encoded directly in UTF-8, or might first be
514	   Punycode-encoded [RFC3492] or percent-encoded and then encoded in
515	   UTF-8.

517	   For example, in URIs, [RFC3986] section 3.2.2 specifically allows for
518	   the use of percent-encoded UTF-8 characters in the hostname, as well
519	   as the use of IDNA encoding [RFC3490] using the Punycode algorithm.

521	   Percent-encoding is unambiguous for hostnames since the percent
522	   character cannot appear in the strict definition of a "hostname",
523	   though it can appear in a DNS name.

525	   Punycode-encoded labels (or "A-labels") on the other hand can be
526	   ambiguous if hosts are actually allowed to be named with a name
527	   starting with "xn--", and false positives can result.  While this may
528	   be extremely unlikely for normal scenarios, it nevertheless provides
529	   a possible vector for an attacker.

531	   A hostname comparator thus needs to decide whether a Punycode-encoded
532	   label should or should not be considered a valid hostname label, and
533	   if so, then whether it should match a label encoded in some other
534	   form such as a percent-encoded Unicode label (U-label).

536	   For example, Section 3 of "Transport Layer Security (TLS) Extensions"
537	   [RFC6066], states:

539	      "HostName" contains the fully qualified DNS hostname of the
540	      server, as understood by the client.  The hostname is represented
541	      as a byte string using ASCII encoding without a trailing dot.
542	      This allows the support of internationalized domain names through
543	      the use of A-labels defined in [RFC5890].  DNS hostnames are case-
544	      insensitive.  The algorithm to compare hostnames is described in
545	      [RFC5890], Section 2.3.2.4.

547	   For some additional discussion of security issues that arise with
548	   internationalization, see [TR36].

550	3.1.4.  Resolution for comparison

552	   Some systems (specifically Java URLs [JAVAURL]) use the rule that if
553	   two hostnames resolve to the same IP address then the hostnames are
554	   considered equal.  That is, the canonicalization algorithm involves
555	   name resolution with an IP address being the canonical form.

557	   For example, if resolution was done via DNS, and DNS contained:

559	   example.com.  IN A 10.0.0.6
560	   example.net.  CNAME example.com.
561	   example.org.  IN A 10.0.0.6

563	   then the algorithm might treat all three names as equal, even though
564	   the third name might refer to a different entity.

566	   With the introduction of dynamic IP addresses, private IP addresses,
567	   multiple IP addresses per name, multiple address families (e.g., IPv4
568	   vs. IPv6), devices that roam to new locations, commonly deployed DNS
569	   tricks that result in the answer depending on factors such as the
570	   requester's location and the load on the server whose address is
571	   returned, etc., this method of comparison cannot be relied upon.
572	   There is no guarantee that two names for the same host will resolve
573	   the name to the same IP addresses, nor that the addresses resolved
574	   refer to the same entity such as when the names resolve to private IP
575	   addresses, nor even that the system has connectivity (and the
576	   willingness to wait for the delay) to resolve names at the time the
577	   answer is needed.

579	   In addition, a comparison mechanism that relies on the ability to
580	   resolve identifiers such as hostnames to other identifies such as IP
581	   addresses leaks information about security decisions to outsiders if
582	   these queries are publicly observable.

584	3.2.  Ports and Service Names

586	   Port numbers and service names are discussed in depth in [RFC6335].
587	   Historically, there were port numbers, service names used in SRV
588	   records, and mnemonic identifiers for assigned port numbers (known as
589	   port "keywords" at [IANA-PORT]).  The latter two are now unified, and
590	   various protocols use one or more of these types in strings.  For
591	   example, the common syntax used by many URI schemes allows port
592	   numbers but not service names.  Some implementations of the
593	   getaddrinfo() API support strings that can be either port numbers or
594	   port keywords (but not service names).

596	   For protocols that use service names that must be resolved, the
597	   issues are the same as those for resolution of addresses in
598	   Section 3.1.4.  In addition, Section 5.1 of [RFC6335] clarifies that
599	   service names/port keywords must contain at least one letter.  This
600	   prevents confusion with port numbers in strings where both are
601	   allowed.

603	3.3.  URIs

605	   This section looks at issues related to using URIs for security
606	   purposes.  For example, [RFC5280], section 7.4, specifies comparison
607	   of URIs in certificates.  Examples of URIs in security token-based
608	   access control systems include WS-*, SAML-P and OAuth WRAP.  In such
609	   systems, a variety of participants in the security infrastructure are
610	   identified by URIs.  For example, requesters of security tokens are
611	   sometimes identified with URIs.  The issuers of security tokens and
612	   the relying parties who are intended to consume security tokens are
613	   frequently identified by URIs.  Claims in security tokens often have
614	   their types defined using URIs and the values of the claims can also
615	   be URIs.

617	   Also, when a URI is embedded in plain text (e.g., an email message),
618	   there is an additional concern because there is no termination
619	   criterion for a URL.  For example, consider
620	   http://unicode.org/cldr/utility/list-unicodeset.jsp?a=a&amp;g=gc.
621	   Some email clients will stop before the ';' while others go to the
622	   '.'.  As another point of comparison, Section 2.37 of [EE] (a
623	   standard for history citations) specifies the use of a space after a
624	   URI and before the punctuation.

626	   URIs are defined with multiple components, each of which has its own
627	   rules.  We cover each in turn below.  However, it is also important
628	   to note that there exist multiple comparison algorithms.  [RFC3986]
629	   section 6.2 states:

631	      A variety of methods are used in practice to test URI equivalence.
632	      These methods fall into a range, distinguished by the amount of
633	      processing required and the degree to which the probability of
634	      false negatives is reduced.  As noted above, false negatives
635	      cannot be eliminated.  In practice, their probability can be
636	      reduced, but this reduction requires more processing and is not
637	      cost-effective for all applications.
638	      If this range of comparison practices is considered as a ladder,
639	      the following discussion will climb the ladder, starting with
640	      practices that are cheap but have a relatively higher chance of
641	      producing false negatives, and proceeding to those that have
642	      higher computational cost and lower risk of false negatives.

644	   The ladder approach has both pros and cons.  On the pro side, it
645	   allows some uses to optimize for security, and other uses to optimize
646	   for cost, thus allowing URIs to be applicable to a wide range of
647	   uses.  A disadvantage is that when different approaches are taken by
648	   different components in the same system using the same identifiers,
649	   the inconsistencies can result in security issues.

651	3.3.1.  Scheme component

653	   [RFC3986] defines URI schemes as being case-insensitive ASCII and in
654	   section 6.2.2.1 specifies that scheme names should be normalized to
655	   lower-case characters.

657	   New schemes can be defined over time.  In general two URIs with an
658	   unrecognized scheme cannot be safely compared, however.  This is
659	   because the canonicalization and comparison rules for the other
660	   components may vary by scheme.  For example, a new URI scheme might
661	   have a default port of X, and without that knowledge, a comparison
662	   algorithm cannot know whether "example.com" and "example.com:X"
663	   should be considered to match in the authority component.  Hence for
664	   security purposes, it is safest for unrecognized schemes to be
665	   treated as invalid identifiers.  However, if the URIs are only used
666	   with a "grant access on match" paradigm then unrecognized schemes can
667	   be supported by doing a generic case-sensitive comparison, at the
668	   expense of some false negatives.

670	3.3.2.  Authority component

672	   The authority component is scheme-specific, but many schemes follow a
673	   common syntax that allows for userinfo, host, and port.

675	3.3.2.1.  Host

677	   Section 3.1 discussed issues with hostnames in general.  In addition,
678	   [RFC3986] section 3.2.2 allows future changes using the IPvFuture
679	   production.  As with IPv4 and IPv6 literals, IPvFuture formats may
680	   have issues with multiple semantically identical string
681	   representations, and may also be semantically identical to an IPv4 or
682	   IPv6 address.  As such, false negatives may be common if IPvFuture is
683	   used.

685	3.3.2.2.  Port

687	   See discussion in Section 3.2.

689	3.3.2.3.  Userinfo

691	   [RFC3986] defines the userinfo production that allows arbitrary data
692	   about the user of the URI to be placed before '@' signs in URIs (see
693	   also Section 3.4.  For example:
694	   "http://alice:bob:chuck@example.com/bar" has the value "alice:bob:
695	   chuck" as its userinfo.  When comparing URIs in a security context,
696	   one must decide whether to treat the userinfo as being significant or
697	   not.  Some URI comparison services for example treat
698	   "http://alice:ick@example.com" and "http://example.com" as being
699	   equal.

701	3.3.3.  Path component

703	   [RFC3986] supports the use of path segment values such as "./" or
704	   "../" for relative URLs.  Strictly speaking, including such path
705	   segment values in a fully qualified URI is syntactically illegal but
706	   [RFC3986] section 4.1 nevertheless defines an algorithm to remove
707	   them.

709	   Unless a scheme states otherwise, the path component is defined to be
710	   case-sensitive.  However, if the resource is stored and accessed
711	   using a filesystem using case-insensitive paths, there will be many
712	   paths that refer to the same resource.  As such, false negatives can
713	   be common in this case.

715	3.3.4.  Query component

717	   There is the question as to whether "http://example.com/foo",
718	   "http://example.com/foo?", and "http://example.com/foo?bar" are each
719	   considered equal or different.

721	   Similarly, it is unspecified whether the order of values matters.
722	   For example, should "http://example.com/blah?ick=bick&foo=bar" be
723	   considered equal to "http://example.com/blah?foo=bar&ick=bick"?  And
724	   if a domain name is permitted to appear in a query component (e.g.,
725	   in a reference to another URI), the same issues in Section 3.1 apply.

727	3.3.5.  Fragment component

729	   Some URI formats include fragment identifiers.  These are typically
730	   handles to locations within a resource and are used for local
731	   reference.  A classic example is the use of fragments in HTTP URLs
732	   where a URL of the form "http://example.com/blah.html#ick" means
733	   retrieve the resource "http://example.com/blah.html" and, once it has
734	   arrived locally, find the HTML anchor named ick and display that.

736	   So, for example, when a user clicks on the link
737	   "http://example.com/blah.html#baz" a browser will check its cache by
738	   doing a URI comparison for "http://example.com/blah.html" and, if the
739	   resource is present in the cache, a match is declared.

741	   Hence comparisons for security purposes typically ignore the fragment
742	   component and treat all fragments as equal to the full resource.

744	3.3.6.  Resolution for comparison

746	   As with Section 3.1.4 for hostnames, it may be tempting to define a
747	   URI comparison algorithm based on whether they resolve to the same
748	   content.  Similar problems exist, however, including content that
749	   dynamically changes over time or based on factors such as the
750	   requester's location, potential lack of external connectivity at the
751	   time/place comparison is done, potentially undesirable delay
752	   introduced, etc.

754	   In addition, as noted in Section 3.1.4, resolution leaks information
755	   about security decisions to outsiders if the queries are publicaly
756	   observable.

758	3.4.  Email Address-like Identifiers

760	   Section 3.4.1 of [RFC5322] defines the syntax of an email address-
761	   like identifier, and Section 3.2 of [RFC6532] updates it to support
762	   internationalization.  [RFC5280], section 7.5, further discusses the
763	   use of internationalized email addresses in certificates.

765	   [RFC6532] use in certificates points to [RFC6530], where Section 13
766	   of that document contains a discussion of many issues resulting from
767	   internationalization.

769	   Email address-like identifiers have a local part and a domain part.
770	   The issues with the domain part are essentially the same as with
771	   hostnames, covered earlier.

773	   The local part is left for each domain to define.  People quite
774	   commonly use email addresses as usernames with web sites like banks
775	   or shopping sites, but the site doesn't know whether foo@example.com
776	   is the same person as FOO@example.com.  Thus email-like identifiers
777	   are typically Indefinite identifiers.

779	   To avoid false positives, some security mechanisms (such as
780	   [RFC5280]) compare the local part using an exact match.  Hence, like
781	   URIs, email address-like identifiers are designed for use in grant-
782	   on-match security schemes, not in deny-on-match schemes.

784	4.  General Internationalization Issues

786	   In addition to the issues with hostnames discussed in Section 3.1.3,
787	   there are a number of internationalization issues that apply to many
788	   types of Definite and Indefinite identifiers.

790	   First, there is no DNS mechanism for identifying whether two strings
791	   (such as "color" and "colour", although many non-English cases occur
792	   such as Saudi numeric strings, different forms of Chinese strings,
793	   etc.) would be seen by a human as being equivalent.  Attempts to
794	   produce such alternate forms algorithmically could produce false
795	   positives and hence have an adverse affect on security.

797	   Second, some strings are visually confusable with others, and hence
798	   if a security decision is made by a user based on visual inspection,
799	   many opportunities for false positives exist.  As such, using visual
800	   inspection for security is unreliable.

802	   Determining whether a string is a valid identifier should typically
803	   be done after, or as part of, canonicalization.  Otherwise an
804	   attacker might use the canonicalization algorithm to inject (e.g.,
805	   via percent encoding, NFKC, or non-shortest-form UTF-8) delimiters
806	   such as '@' in an email address-like identifier, or a '.' in a
807	   hostname.

809	   Any case-insensitive comparisons need to define how comparison is
810	   done, since such comparisons may vary by locale of the endpoint.  As
811	   such, using case-insensitive comparisons in general often result in
812	   identifiers being either Indefinite or, if the legal character set is
813	   restricted (e.g. to ASCII), then Definite.

815	   See also [WEBER] for a more visual discussion of many of these
816	   issues.

818	   Finally, the set of permitted characters and the canonical form of
819	   the characters (and hence the canonicalization algorithm) sometimes
820	   varies by protocol today, even when the intent is to use the same
821	   identifier, such as when one protocol passes identifiers to the
822	   other.  See [I-D.ietf-precis-problem-statement] for further
823	   discussion.

825	5.  Security Considerations

827	   This entire document is about security considerations.

829	   To minimize elevation of privilege issues, any system that requires
830	   the ability to use both deny and allow operations within the same
831	   identifier space, should avoid the use of Indefinite identifiers in
832	   security comparisons.

834	   To minimize future security risks, any new identifiers being designed
835	   should specify an Absolute or Definite comparison algorithm, and if
836	   extensibility is allowed (e.g., as new schemes in URIs allow) then
837	   the comparison algorithm should remain invariant so that unrecognized
838	   extensions can be compared.  That is, security risks can be reduced
839	   by specifying the comparison algorithm, making sure to resolve any
840	   ambiguities pointed out in this document (e.g., "standard dotted
841	   decimal").

843	   Some issues (such as unrecognized extensions) can be mitigated by
844	   treating such identifiers as invalid.  Validity checking of
845	   identifiers is further discussed in [RFC3696].

847	   Perhaps the hardest issues arise when multiple protocols are used
848	   together, such as in the figure in Section 2, where the two protocols
849	   are defined or implemented using different comparison algorithms.
850	   When constructing an architecture that uses multiple such protocols,
851	   designers should pay attention to any differences in comparison
852	   algorithms among the protocols, in order to fully understand the
853	   security risks.  An area for future work is how to deal with such
854	   security risks in current systems.

856	6.  Acknowledgements

858	   Yaron Goland contributed to much of the discussion on URIs.  Patrick
859	   Faltstrom contributed to the background on identifiers.  Additional
860	   helpful feedback and suggestions came from Magnus Nystrom, Bernard
861	   Aboba, Mark Davis, John Klensin, and Russ Housley.

863	7.  IANA Considerations

865	   This document requires no actions by the IANA.

867	8.  Informative References

869	   [EE]       Mills, E., "Evidence Explained: Citing History Sources
870	              from Artifacts to Cyberspace", 2007.

872	   [I-D.ietf-pkix-rfc5280-clarifications]
873	              Cooper, D., "Updates to the Internet X.509 Public Key
874	              Infrastructure Certificate and Certificate Revocation List
875	              (CRL) Profile", draft-ietf-pkix-rfc5280-clarifications-04
876	              (work in progress), March 2012.

878	   [I-D.ietf-precis-problem-statement]
879	              Blanchet, M. and A. Sullivan, "Stringprep Revision Problem
880	              Statement", draft-ietf-precis-problem-statement-05 (work
881	              in progress), March 2012.

883	   [IAB1123]  IAB, "The interpretation of rules in the ICANN gTLD
884	              Applicant Guidebook", February 2012, <http://www.iab.org/
885	              documents/correspondence-reports-documents/2012-2/
886	              iab-statement-the-interpretation-of-rules-in-the-icann-
887	              gtld-applicant-guidebook>.

889	   [IANA-PORT]
890	              IANA, "PORT NUMBERS", June 2011,
891	              <http://www.iana.org/assignments/port-numbers>.

893	   [IEEE-1003.1]
894	              IEEE and The Open Group, "The Open Group Base
895	              Specifications, Issue 6 IEEE Std 1003.1, 2004 Edition",
896	              IEEE Std 1003.1, 2004.

898	   [JAVAURL]  Oracle, "Class URL, Java(TM) Platform, Standard Ed. 7",
899	              2011, <http://docs.oracle.com/javase/7/docs/api/java/net/
900	              URL.html>.

902	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
903	              host table specification", RFC 952, October 1985.

905	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
906	              and Support", STD 3, RFC 1123, October 1989.

908	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
909	              Languages", BCP 18, RFC 2277, January 1998.

911	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
912	              "Internationalizing Domain Names in Applications (IDNA)",
913	              RFC 3490, March 2003.

915	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
916	              for Internationalized Domain Names in Applications
917	              (IDNA)", RFC 3492, March 2003.

919	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
920	              Stevens, "Basic Socket Interface Extensions for IPv6",
921	              RFC 3493, February 2003.

923	   [RFC3696]  Klensin, J., "Application Techniques for Checking and
924	              Transformation of Names", RFC 3696, February 2004.

926	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
927	              Resource Identifier (URI): Generic Syntax", STD 66,
928	              RFC 3986, January 2005.

930	   [RFC4291]  Hinden, R. and S. Deering, "IP Version 6 Addressing
931	              Architecture", RFC 4291, February 2006.

933	   [RFC5280]  Cooper, D., Santesson, S., Farrell, S., Boeyen, S.,
934	              Housley, R., and W. Polk, "Internet X.509 Public Key
935	              Infrastructure Certificate and Certificate Revocation List
936	              (CRL) Profile", RFC 5280, May 2008.

938	   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322,
939	              October 2008.

941	   [RFC5952]  Kawamura, S. and M. Kawashima, "A Recommendation for IPv6
942	              Address Text Representation", RFC 5952, August 2010.

944	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
945	              Encodings for Internationalized Domain Names", RFC 6055,
946	              February 2011.

948	   [RFC6066]  Eastlake, D., "Transport Layer Security (TLS) Extensions:
949	              Extension Definitions", RFC 6066, January 2011.

951	   [RFC6335]  Cotton, M., Eggert, L., Touch, J., Westerlund, M., and S.
952	              Cheshire, "Internet Assigned Numbers Authority (IANA)
953	              Procedures for the Management of the Service Name and
954	              Transport Protocol Port Number Registry", BCP 165,
955	              RFC 6335, August 2011.

957	   [RFC6530]  Klensin, J. and Y. Ko, "Overview and Framework for
958	              Internationalized Email", RFC 6530, February 2012.

960	   [RFC6532]  Yang, A., Steele, S., and N. Freed, "Internationalized
961	              Email Headers", RFC 6532, February 2012.

963	   [TR36]     Unicode Consortium, "Unicode Security Considerations",
964	              Unicode Technical Report 36, August 2004.

966	   [WEBER]    Weber, C., "Attacking Software Globalization", March 2010,
967	              <http://www.casabasecurity.com/files/
968	              Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf>.

970	Author's Address

972	   Dave Thaler (editor)
973	   Microsoft Corporation
974	   One Microsoft Way
975	   Redmond, WA  98052
976	   USA

978	   Phone: +1 425 703 8835
979	   Email: dthaler@microsoft.com