idnits 2.17.1 

draft-iab-identifier-comparison-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  == There are 2 instances of lines with private range IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x,
     198.51.100.x or 203.0.113.x.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 418: '....  Host software MUST support this mor...'
     RFC 2119 keyword, line 423: '...dentity of an Internet host, it SHOULD...'
     RFC 2119 keyword, line 425: '...#.#.#.#") form.  The host SHOULD check...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 16, 2012) is 4302 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC5890' is mentioned on line 548, but not defined

  == Outdated reference: A later version (-06) exists of
     draft-ietf-6man-uri-zoneid-02

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pkix-rfc5280-clarifications-05

  == Outdated reference: A later version (-09) exists of
     draft-ietf-precis-problem-statement-06

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)


     Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     D. Thaler, Ed.
3	Internet-Draft                                                 Microsoft
4	Intended status: Informational                             July 16, 2012
5	Expires: January 17, 2013

7	         Issues in Identifier Comparison for Security Purposes
8	                 draft-iab-identifier-comparison-03.txt

10	Abstract

12	   Identifiers such as hostnames, URIs, and email addresses are often
13	   used in security contexts to identify security principals and
14	   resources.  In such contexts, an identifier supplied via some
15	   protocol is often compared against some policy to make security
16	   decisions such as whether the principal may access the resource, what
17	   level of authentication or encryption is required, etc.  If the
18	   parties involved in a security decision use different algorithms to
19	   compare identifiers, then failure scenarios ranging from denial of
20	   service to elevation of privilege can result.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on January 17, 2013.

39	Copyright Notice

41	   Copyright (c) 2012 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	     1.1.  Canonicalization . . . . . . . . . . . . . . . . . . . . .  4
58	   2.  Security Uses  . . . . . . . . . . . . . . . . . . . . . . . .  5
59	     2.1.  Types of Identifiers . . . . . . . . . . . . . . . . . . .  6
60	     2.2.  False Positives and Negatives  . . . . . . . . . . . . . .  7
61	     2.3.  Hypothetical Example . . . . . . . . . . . . . . . . . . .  8
62	   3.  Common Identifiers . . . . . . . . . . . . . . . . . . . . . .  9
63	     3.1.  Hostnames  . . . . . . . . . . . . . . . . . . . . . . . .  9
64	       3.1.1.  IPv4 Literals  . . . . . . . . . . . . . . . . . . . .  9
65	       3.1.2.  IPv6 Literals  . . . . . . . . . . . . . . . . . . . . 11
66	       3.1.3.  Internationalization . . . . . . . . . . . . . . . . . 12
67	       3.1.4.  Resolution for comparison  . . . . . . . . . . . . . . 12
68	     3.2.  Ports and Service Names  . . . . . . . . . . . . . . . . . 13
69	     3.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
70	       3.3.1.  Scheme component . . . . . . . . . . . . . . . . . . . 15
71	       3.3.2.  Authority component  . . . . . . . . . . . . . . . . . 15
72	       3.3.3.  Path component . . . . . . . . . . . . . . . . . . . . 16
73	       3.3.4.  Query component  . . . . . . . . . . . . . . . . . . . 16
74	       3.3.5.  Fragment component . . . . . . . . . . . . . . . . . . 16
75	       3.3.6.  Resolution for comparison  . . . . . . . . . . . . . . 17
76	     3.4.  Email Address-like Identifiers . . . . . . . . . . . . . . 17
77	   4.  General Internationalization Issues  . . . . . . . . . . . . . 17
78	   5.  General Scope Issues . . . . . . . . . . . . . . . . . . . . . 19
79	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 19
80	   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
81	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
82	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 20
83	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 23

85	1.  Introduction

87	   In computing and the Internet, various types of "identifiers" are
88	   used to identify humans, devices, content, etc.  Before discussing
89	   security issues, we first give some background on some typical
90	   processes involving identifiers.

92	   As depicted in Figure 1, there are multiple processes relevant to our
93	   discussion.
94	   1.  An identifier must first be generated.  If the identifier is
95	       intended to be unique, the generation process includes some
96	       mechanism, such as allocation by a central authority, to help
97	       ensure uniqueness.  However the notion of "unique" involves
98	       determining whether a putative identifier matches any other
99	       already-allocated identifier.  As we will see, for many types of
100	       identifiers, this is not simply an exact binary match.

102	       As a result of generating the identifier, it is often stored in
103	       two locations: with the requester or "holder" of the identifier,
104	       and with some repository of identifiers (e.g., DNS).  For
105	       example, if the identifier was allocated by a central authority,
106	       the repository might be that authority.  If the identifier
107	       identifies a device or content on a device, the repository might
108	       be that device.
109	   2.  The identifier must be distributed, either by the holder of the
110	       identifier or by a repository of identifiers, to others who could
111	       use the identifier.  This distribution might be electronic, but
112	       sometimes it is via other channels such as voice, business card,
113	       billboard, or other form of advertisement.  The identifier itself
114	       might be distributed directly, or it might be used to generate a
115	       portion of another type of identifier that is then distributed.
116	       For example, a URI or email address might include a server name,
117	       and hence distributing the URI or email address also inherently
118	       distributes the server name.
119	   3.  The identifier must be used by some party.  Generally the user
120	       supplies the identifier which is (directly or indirectly) sent to
121	       the repository of identifiers.  For example, using an email
122	       address to send email to the holder of an identifier may result
123	       in the email arriving at the holder's email server which has
124	       access to the mail stores.

126	       The repository of identifiers must then attempt to match the
127	       user-supplied identifier with an identifier in its repository.

129	                            +------------+
130	                            |  Holder of |     1. Generation
131	                            | identifier +<---------+
132	                            +----+-------+          |
133	                                 |                  | Match
134	                                 |                  v/
135	                                 |          +-------+-------+
136	                                 +----------+ Repository of |
137	                                 |          |  identifiers  |
138	                                 |          +-------+-------+
139	                 2. Distribution |                  ^\
140	                                 |                  | Match
141	                                 v                  |
142	                       +---------+-------+          |
143	                       |      User of    |          |
144	                       |    identifier   +----------+
145	                       +-----------------+    3. Use

147	                       Typical Identifier Processes

149	                                 Figure 1

151	   One key aspect is that the identifier values passed in generation,
152	   distribution, and use, may all be different forms.  For example,
153	   generation might be exchanged in printed form, distribution done via
154	   voice, and use done electronically.  As such, the match process can
155	   be complicated.

157	   Furthermore, in many uses, the relationship between holder,
158	   repositories, and users may be more involved.  For example, when a
159	   hierarchy of web caches exist, each cache is itself a repository of a
160	   sort, and the match process is usually intended to be the same as on
161	   the origin server.

163	1.1.  Canonicalization

165	   Perhaps the most common algorithm for comparison involves first
166	   converting each identifier to a canonical form (a process known as
167	   "canonicalization" or "normalization"), and then testing . the
168	   resulting canonical representations for bitwise equality.  In so
169	   doing, it is thus critical that all entities involved agree on the
170	   same canonical form and use the same canonicalization algorithm so
171	   that the overall comparison process is also the same.

173	   Note that in some contexts, such as in internationalization, the
174	   terms "canonicalization" and "normalization" have a precise meaning.
175	   In this document, however, we use these terms synonymously in their
176	   more generic form, to mean conversion to some standard form.

178	   While the most common method of comparison includes canonicalization,
179	   comparison can also be done by defining an equivalence algorithm,
180	   where no single form is canonical.  However in most cases, a
181	   canonical form is useful for other purposes, such as output, and so
182	   in such cases defining a canonical form suffices to define a
183	   comparison method.

185	2.  Security Uses

187	   Identifiers such as hostnames, URIs, and email addresses are used in
188	   security contexts to identify principals and resources as well as
189	   other security parameters such as types and values of claims.  Those
190	   identifiers are then used to make security decisions based on an
191	   identifier supplied via some protocol.  For example:
192	   o  Authentication: a protocol might match a security principal
193	      identifier to look up expected keying material, and then match
194	      keying material.
195	   o  Authorization: a protocol might match a resource name to look up
196	      an access control list (ACL), and then look up the security
197	      principal identifier (or a surrogate for it) in that ACL.
198	   o  Accounting: a system might create an accounting record for a
199	      security principal identifier or resource name, and then might
200	      later need to match a supplied identifier to allow (for example)
201	      law enforcement to follow up based on the records, or add new
202	      filtering rules based on the records in order to stop an attack.

204	   If the parties involved in a security decision use different matching
205	   algorithms for the same identifiers, then failure scenarios ranging
206	   from denial of service to elevation of privilege can result, as we
207	   will see.

209	   This is especially complicated in cases involving multiple parties
210	   and multiple protocols.  For example, there are many scenarios where
211	   some form of "security token service" is used to grant to a requester
212	   permission to access a resource, where the resource is held by a
213	   third party that relies on the security token service (see Figure 2).
214	   The protocol used to request permission (e.g., Kerberos or OAuth) may
215	   be different from the protocol used to access the resource (e.g.,
216	   HTTP).  Opportunities for security problems arise when two protocols
217	   define different comparison algorithms for the same type of
218	   identifier, or when a protocol is ambiguously specified and two
219	   endpoints (e.g., a security token service and a resource holder)
220	   implement different algorithms within the same protocol.

222	        +----------+
223	        | security |
224	        |  token   |
225	        | service  |
226	        +----------+
227	             ^
228	             | 1. supply credentials and
229	             | get token for resource
230	             |                                             +--------+
231	        +----------+  2. supply token and access resource  |resource|
232	        |requester |=------------------------------------->| holder |
233	        +----------+                                       +--------+

235	                         Simple Security Exchange

237	                                 Figure 2

239	   In many cases the situation is more complex.  With certificates, the
240	   name in a certificate gets compared against names in ACLs or other
241	   things.  In the case of web site security, the name in the
242	   certificate gets compared to a portion of the URI that a user may
243	   have typed into a browser.  The fact that many different people are
244	   doing the typing, on many different types of systems, complicates the
245	   problem.

247	   Add to this the certificate enrollment step, and the certificate
248	   issuance step, and two more parties have an opportunity to adjust the
249	   encoding or worse, the software that supports them might make changes
250	   that the parties are unaware are happening.

252	2.1.  Types of Identifiers

254	   In this document we will refer to the following types of identifiers:

256	   o  Absolute: identifiers that can be compared byte-by-byte for
257	      equality.  Two identifiers that have different bytes are defined
258	      to be different.  For example, binary IP addresses are in this
259	      class.
260	   o  Definite: identifiers that have a well-defined comparison
261	      algorithm on which all parties agree.  For example, URI scheme
262	      names are required to be ASCII and are defined to match in a case-
263	      insensitive way; the comparison is thus definite since all parties
264	      agree on how to do a case-insensitive match among ASCII strings.
265	   o  Indefinite: identifiers that have no single comparison algorithm
266	      on which all parties agree.  For example, human names are in this
267	      class.  Everyone might want the comparison to be tailored for
268	      their locale, for some definition of locale.  In some cases, there
269	      may be limited subsets of parties that might be able to agree
270	      (e.g., ASCII users might all agree on a common comparison
271	      algorithm whereas users of other Latin scripts, such as Turkish,
272	      may not), but identifiers often tend to leak out of such limited
273	      environments.

275	2.2.  False Positives and Negatives

277	   It is first worth discussing in more detail the effects of errors in
278	   the comparison algorithm.  A "false positive" results when two
279	   identifiers compare as if they were equal, but in reality refer to
280	   two different objects (e.g., security principals or resources).  When
281	   privilege is granted on a match, a false positive thus results in an
282	   elevation of privilege, for example allowing execution of an
283	   operation that should not have been permitted otherwise.  When
284	   privilege is denied on a match (e.g., matching an entry in a block/
285	   deny list or a revocation list), a permissible operation is denied.
286	   At best, this can cause worse performance (e.g., a cache miss, or
287	   forcing redundant authentication), and at worst can result in a
288	   denial of service.

290	   A "false negative" results when two identifiers that in reality refer
291	   to the same thing compare as if they were different, and the effects
292	   are the reverse of those for false positives.  That is, when
293	   privilege is granted on a match, the result is at best worse
294	   performance and at worst a denial of service; when privilege is
295	   denied on a match, elevation of privilege results.

297	   Figure 3 summarizes these effects.

299	                  | "Grant on match"       | "Deny on match"
300	   ---------------+------------------------+-----------------------
301	   False positive | Elevation of privilege | Denial of service
302	   ---------------+------------------------+-----------------------
303	   False negative | Denial of service      | Elevation of privilege
304	   ---------------+------------------------+-----------------------

306	                    Effect of False Positives/Negatives

308	                                 Figure 3

310	   Elevation of privilege is almost always seen as far worse than denial
311	   of service.  Hence, for URIs for example, Section 6.1 of [RFC3986]
312	   states: "comparison methods are designed to minimize false negatives
313	   while strictly avoiding false positives".

315	   Thus URIs were defined with a "grant privilege on match" paradigm in
316	   mind, where it is critical to prevent elevation of privilege while
317	   minimizing denial of service.  Using URIs in a "deny privilege on
318	   match" system can thus be problematic.

320	2.3.  Hypothetical Example

322	   In this example, both security principals and resources are
323	   identified using URIs.  Foo Corp has paid example.com for access to
324	   the Stuff service.  Foo Corp allows its employees to create accounts
325	   on the Stuff service.  Alice gets the account
326	   "http://example.com/Stuff/FooCorp/alice" and Bob gets
327	   "http://example.com/Stuff/FooCorp/bob".  It turns out, however, that
328	   Foo Corp's URI canonicalizer includes URI fragment components in
329	   comparisons whereas example.com's does not, and Foo Corp does not
330	   disallow the # character in the account name.  So Chuck, who is a
331	   malicious employee of Foo Corp, asks to create an account at
332	   example.com with the name alice#stuff.  Foo Corp's URI logic checks
333	   its records for accounts it has created with stuff and sees that
334	   there is no account with the name alice#stuff.  Hence, in its
335	   records, it associates the account alice#stuff with Chuck and will
336	   only issue tokens good for use with
337	   "http://example.com/Stuff/FooCorp/alice#stuff" to Chuck.

339	   Chuck, the attacker, goes to a security token service at Foo Corp and
340	   asks for a security token good for
341	   "http://example.com/Stuff/FooCorp/alice#stuff".  Foo Corp issues the
342	   token since Chuck is the legitimate owner (in Foo Corp's view) of the
343	   alice#stuff account.  Chuck then submits the security token in a
344	   request to "http://example.com/Stuff/FooCorp/alice".

346	   But example.com uses a URI canonicalizer that, for the purposes of
347	   checking equality, ignores fragments.  So when example.com looks in
348	   the security token to see if the requester has permission from Foo
349	   Corp to access the given account it successfully matches the URI in
350	   the security token, "http://example.com/Stuff/FooCorp/alice#stuff",
351	   with the requested resource name
352	   "http://example.com/Stuff/FooCorp/alice".

354	   Leveraging the inconsistencies in the canonicalizers used by Foo Corp
355	   and example.com, Chuck is able to successfully launch an elevation of
356	   privilege attack and access Alice's resource.

358	   Furthermore, consider an attacker using a similar corporation such as
359	   "foocorp" (or any variation containing a non-ASCII character that
360	   some humans might expect to represent the same corporation).  If the
361	   resource holder treats them as different, but the security token
362	   service treats them as the same, then again elevation of privilege
363	   can occur.

365	3.  Common Identifiers

367	   In this section, we walk through a number of common types of
368	   identifiers and discuss various issues related to comparison that may
369	   affect security whenever they are used to identify security
370	   principals or resources.  These examples illustrate common patterns
371	   that may arise with other types of identifiers.

373	3.1.  Hostnames

375	   Hostnames (composed of dot-separated labels) are commonly used either
376	   directly as identifiers, or as components in identifiers such as in
377	   URIs and email addresses.  Another example is in [RFC5280], sections
378	   7.2 and 7.3 (and updated in section 3 of
379	   [I-D.ietf-pkix-rfc5280-clarifications]), which specify use in
380	   certificates.

382	   In this section we discuss a number of issues in comparing strings
383	   that appear to be some form of hostname.

385	   Section 3 of [RFC6055] discusses the differences between a "hostname"
386	   vs. a "DNS name", where the former is a subset of the latter by using
387	   a restricted set of characters.  If one canonicalizer uses the "DNS
388	   name" definition whereas another uses a "hostname" definition, a name
389	   might be valid in the former but invalid in the latter.  As long as
390	   invalid identifiers are denied privilege, this difference will not
391	   result in elevation of privilege.

393	   [IAB1123] briefly discusses issues with the ambiguity around whether
394	   a label will be "alphabetic", including among other issues, how
395	   "alphabetic" should be interpreted in an internationalized
396	   environment, and whether a hostname can be interpreted as an IP
397	   address.  We explore this last issue in more detail below.

399	3.1.1.  IPv4 Literals

401	   [RFC0952] defined an entry in the "Internet host table" as follows:

403	      A "name" (Net, Host, Gateway, or Domain name) is a text string up
404	      to 24 characters drawn from the alphabet (A-Z), digits (0-9),
405	      minus sign (-), and period (.).  Note that periods are only
406	      allowed when they serve to delimit components of "domain style
407	      names". [...]  No blank or space characters are permitted as part
408	      of a name.  No distinction is made between upper and lower case.
409	      The first character must be an alpha character.  The last
410	      character must not be a minus sign or period. [...]  Single
411	      character names or nicknames are not allowed.

413	   [RFC1123] section 2.1 then updates the definition with:

415	      The syntax of a legal Internet host name was specified in RFC-952
416	      [DNS:4].  One aspect of host name syntax is hereby changed: the
417	      restriction on the first character is relaxed to allow either a
418	      letter or a digit.  Host software MUST support this more liberal
419	      syntax.

421	   and

423	      Whenever a user inputs the identity of an Internet host, it SHOULD
424	      be possible to enter either (1) a host domain name or (2) an IP
425	      address in dotted-decimal ("#.#.#.#") form.  The host SHOULD check
426	      the string syntactically for a dotted-decimal number before
427	      looking it up in the Domain Name System.

429	   and

431	      This last requirement is not intended to specify the complete
432	      syntactic form for entering a dotted-decimal host number; that is
433	      considered to be a user-interface issue.

435	   In specifying the inet_addr() API, the POSIX standard [IEEE-1003.1]
436	   defines "IPv4 dotted decimal notation" as allowing not only strings
437	   of the form "10.0.1.2", but also allows octal and hexadecimal, and
438	   addresses with less than four parts.  For example, "10.0.258",
439	   "0xA000001", and "012.0x102" all represent the same IPv4 address in
440	   standard "IPv4 dotted decimal" notation.  We will refer to this as
441	   the "loose" syntax of an IPv4 address literal.

443	   In section 6.1 of [RFC3493] getaddrinfo() is defined to support the
444	   same (loose) syntax as inet_addr():

446	      If the specified address family is AF_INET or AF_UNSPEC, address
447	      strings using Internet standard dot notation as specified in
448	      inet_addr() are valid.

450	   In contrast, section 6.3 of the same RFC states, specifying
451	   inet_pton():

453	      If the af argument of inet_pton() is AF_INET, the src string shall
454	      be in the standard IPv4 dotted-decimal form: ddd.ddd.ddd.ddd where
455	      "ddd" is a one to three digit decimal number between 0 and 255.
456	      The inet_pton() function does not accept other formats (such as
457	      the octal numbers, hexadecimal numbers, and fewer than four
458	      numbers that inet_addr() accepts).

460	   As shown above, inet_pton() uses what we will refer to as the
461	   "strict" form of an IPv4 address literal.  Some platforms also use
462	   the strict form with getaddrinfo() when the AI_NUMERICHOST flag is
463	   passed to it.

465	   Both the strict and loose forms are standard forms, and hence a
466	   protocol specification is still ambiguous if it simply defines a
467	   string to be in the "standard IPv4 dotted decimal form".  And, as a
468	   result of these differences, names such as "10.11.12" are ambiguous
469	   as to whether they are an IP address or a hostname, and even
470	   "10.11.12.13" can be ambiguous because of the "SHOULD" in RFC 1123
471	   above making it optional whether to treat it as an address or a name.

473	   Protocols and data formats that can use addresses in string form for
474	   security purposes need to resolve these ambiguities.  For example,
475	   for the host component of URIs, section 3.2.2 of [RFC3986] resolves
476	   the first ambiguity by only allowing the strict form, and the second
477	   ambiguity by specifying that it is considered an IPv4 address
478	   literal.  New protocols and data formats should similarly consider
479	   using the strict form rather than the loose form in order to better
480	   match user expectations.

482	   Thus, whereas (binary) IPv4 addresses are Absolute identifiers, IPv4
483	   address literals are at best Definite identifiers, and often turn out
484	   to be Indefinite identifiers.

486	   Furthermore, when strings can contain non-ASCII characters, they can
487	   contain other characters that may look like dots or digits to a human
488	   viewing and/or entering the identifier, especially to one who might
489	   expect digits to appear in his or her native script.

491	3.1.2.  IPv6 Literals

493	   IPv6 addresses similarly have a wide variety of alternate but
494	   semantically identical string representations, as defined in section
495	   2.2 of [RFC4291] and section 2 of [I-D.ietf-6man-uri-zoneid].  As
496	   discussed in section 3.2.5 of [RFC5952], this fact causes problems in
497	   security contexts if comparison (such as in X.509 certificates), is
498	   done between strings rather than between the binary representations
499	   of addresses.

501	   [RFC5952] recently specified a recommended canonical string format as
502	   an attempt to solve this problem, but it may not be ubiquitously
503	   supported at present.  And, when strings can contain non-ASCII
504	   characters, the same issues (and more, since hexadecimal and colons
505	   are allowed) arise as with IPv4 literals.

507	   Whereas (binary) IPv6 addresses are Absolute identifiers, IPv6
508	   address literals are Definite identifiers, since string-to-address
509	   conversion for IPv6 address literals is unambiguous.

511	3.1.3.  Internationalization

513	   The IETF policy on character sets and languages [RFC2277] requires
514	   support for UTF-8 in protocols, and as a result many protocols now do
515	   support non-ASCII characters.  When a hostname is sent in a UTF-8
516	   field, there are a number of ways it may be encoded.  For example,
517	   hostname labels might be encoded directly in UTF-8, or might first be
518	   Punycode-encoded [RFC3492] or even percent-encoded from UTF-8.

520	   For example, in URIs, [RFC3986] section 3.2.2 specifically allows for
521	   the use of percent-encoded UTF-8 characters in the hostname, as well
522	   as the use of IDNA encoding [RFC3490] using the Punycode algorithm.

524	   Percent-encoding is unambiguous for hostnames since the percent
525	   character cannot appear in the strict definition of a "hostname",
526	   though it can appear in a DNS name.

528	   Punycode-encoded labels (or "A-labels") on the other hand can be
529	   ambiguous if hosts are actually allowed to be named with a name
530	   starting with "xn--", and false positives can result.  While this may
531	   be extremely unlikely for normal scenarios, it nevertheless provides
532	   a possible vector for an attacker.

534	   A hostname comparator thus needs to decide whether a Punycode-encoded
535	   label should or should not be considered a valid hostname label, and
536	   if so, then whether it should match a label encoded in some other
537	   form such as a percent-encoded Unicode label (U-label).

539	   For example, Section 3 of "Transport Layer Security (TLS) Extensions"
540	   [RFC6066], states:

542	      "HostName" contains the fully qualified DNS hostname of the
543	      server, as understood by the client.  The hostname is represented
544	      as a byte string using ASCII encoding without a trailing dot.
545	      This allows the support of internationalized domain names through
546	      the use of A-labels defined in [RFC5890].  DNS hostnames are case-
547	      insensitive.  The algorithm to compare hostnames is described in
548	      [RFC5890], Section 2.3.2.4.

550	   For some additional discussion of security issues that arise with
551	   internationalization, see [TR36].

553	3.1.4.  Resolution for comparison

555	   Some systems (specifically Java URLs [JAVAURL]) use the rule that if
556	   two hostnames resolve to the same IP address(es) then the hostnames
557	   are considered equal.  That is, the canonicalization algorithm
558	   involves name resolution with an IP address being the canonical form.

560	   For example, if resolution was done via DNS, and DNS contained:

562	   example.com.  IN A 10.0.0.6
563	   example.net.  CNAME example.com.
564	   example.org.  IN A 10.0.0.6

566	   then the algorithm might treat all three names as equal, even though
567	   the third name might refer to a different entity.

569	   With the introduction of dynamic IP addresses, private IP addresses,
570	   multiple IP addresses per name, multiple address families (e.g., IPv4
571	   vs. IPv6), devices that roam to new locations, commonly deployed DNS
572	   tricks that result in the answer depending on factors such as the
573	   requester's location and the load on the server whose address is
574	   returned, etc., this method of comparison cannot be relied upon.
575	   There is no guarantee that two names for the same host will resolve
576	   the name to the same IP addresses, nor that the addresses resolved
577	   refer to the same entity such as when the names resolve to private IP
578	   addresses, nor even that the system has connectivity (and the
579	   willingness to wait for the delay) to resolve names at the time the
580	   answer is needed.

582	   In addition, a comparison mechanism that relies on the ability to
583	   resolve identifiers such as hostnames to other identifies such as IP
584	   addresses leaks information about security decisions to outsiders if
585	   these queries are publicly observable.

587	3.2.  Ports and Service Names

589	   Port numbers and service names are discussed in depth in [RFC6335].
590	   Historically, there were port numbers, service names used in SRV
591	   records, and mnemonic identifiers for assigned port numbers (known as
592	   port "keywords" at [IANA-PORT]).  The latter two are now unified, and
593	   various protocols use one or more of these types in strings.  For
594	   example, the common syntax used by many URI schemes allows port
595	   numbers but not service names.  Some implementations of the
596	   getaddrinfo() API support strings that can be either port numbers or
597	   port keywords (but not service names).

599	   For protocols that use service names that must be resolved, the
600	   issues are the same as those for resolution of addresses in
601	   Section 3.1.4.  In addition, Section 5.1 of [RFC6335] clarifies that
602	   service names/port keywords must contain at least one letter.  This
603	   prevents confusion with port numbers in strings where both are
604	   allowed.

606	3.3.  URIs

608	   This section looks at issues related to using URIs for security
609	   purposes.  For example, [RFC5280], section 7.4, specifies comparison
610	   of URIs in certificates.  Examples of URIs in security token-based
611	   access control systems include WS-*, SAML-P and OAuth WRAP.  In such
612	   systems, a variety of participants in the security infrastructure are
613	   identified by URIs.  For example, requesters of security tokens are
614	   sometimes identified with URIs.  The issuers of security tokens and
615	   the relying parties who are intended to consume security tokens are
616	   frequently identified by URIs.  Claims in security tokens often have
617	   their types defined using URIs and the values of the claims can also
618	   be URIs.

620	   Also, when a URI is embedded in plain text (e.g., an email message),
621	   there is an additional concern because there is no termination
622	   criterion for a URI.  For example, consider
623	   http://unicode.org/cldr/utility/list-unicodeset.jsp?a=a&amp;g=gc.
624	   Some applications that detect URIs will stop before the first '.' in
625	   the path, while others go to last '.', and yet others may stop at the
626	   ';'.  As another point of comparison, Section 2.37 of [EE] (a
627	   standard for history citations) specifies the use of a space after a
628	   URI and before the punctuation.

630	   URIs are defined with multiple components, each of which has its own
631	   rules.  We cover each in turn below.  However, it is also important
632	   to note that there exist multiple comparison algorithms.  [RFC3986]
633	   section 6.2 states:

635	      A variety of methods are used in practice to test URI equivalence.
636	      These methods fall into a range, distinguished by the amount of
637	      processing required and the degree to which the probability of
638	      false negatives is reduced.  As noted above, false negatives
639	      cannot be eliminated.  In practice, their probability can be
640	      reduced, but this reduction requires more processing and is not
641	      cost-effective for all applications.
642	      If this range of comparison practices is considered as a ladder,
643	      the following discussion will climb the ladder, starting with
644	      practices that are cheap but have a relatively higher chance of
645	      producing false negatives, and proceeding to those that have
646	      higher computational cost and lower risk of false negatives.

648	   The ladder approach has both pros and cons.  On the pro side, it
649	   allows some uses to optimize for security, and other uses to optimize
650	   for cost, thus allowing URIs to be applicable to a wide range of
651	   uses.  A disadvantage is that when different approaches are taken by
652	   different components in the same system using the same identifiers,
653	   the inconsistencies can result in security issues.

655	3.3.1.  Scheme component

657	   [RFC3986] defines URI schemes as being case-insensitive ASCII and in
658	   section 6.2.2.1 specifies that scheme names should be normalized to
659	   lower-case characters.

661	   New schemes can be defined over time.  In general two URIs with an
662	   unrecognized scheme cannot be safely compared, however.  This is
663	   because the canonicalization and comparison rules for the other
664	   components may vary by scheme.  For example, a new URI scheme might
665	   have a default port of X, and without that knowledge, a comparison
666	   algorithm cannot know whether "example.com" and "example.com:X"
667	   should be considered to match in the authority component.  Hence for
668	   security purposes, it is safest for unrecognized schemes to be
669	   treated as invalid identifiers.  However, if the URIs are only used
670	   with a "grant access on match" paradigm then unrecognized schemes can
671	   be supported by doing a generic case-sensitive comparison, at the
672	   expense of some false negatives.

674	3.3.2.  Authority component

676	   The authority component is scheme-specific, but many schemes follow a
677	   common syntax that allows for userinfo, host, and port.

679	3.3.2.1.  Host

681	   Section 3.1 discussed issues with hostnames in general.  In addition,
682	   [RFC3986] section 3.2.2 allows future changes using the IPvFuture
683	   production.  As with IPv4 and IPv6 literals, IPvFuture formats may
684	   have issues with multiple semantically identical string
685	   representations, and may also be semantically identical to an IPv4 or
686	   IPv6 address.  As such, false negatives may be common if IPvFuture is
687	   used.

689	3.3.2.2.  Port

691	   See discussion in Section 3.2.

693	3.3.2.3.  Userinfo

695	   [RFC3986] defines the userinfo production that allows arbitrary data
696	   about the user of the URI to be placed before '@' signs in URIs.  For
697	   example: "http://alice:bob:chuck@example.com/bar" has the value
698	   "alice:bob:chuck" as its userinfo.  When comparing URIs in a security
699	   context, one must decide whether to treat the userinfo as being
700	   significant or not.  Some URI comparison services for example treat
701	   "http://alice:ick@example.com" and "http://example.com" as being
702	   equal.

704	   When the userinfo is treated as being significant, it has additional
705	   considerations (e.g., whether it is case-sensitive or not) which we
706	   cover in Section 3.4.

708	3.3.3.  Path component

710	   [RFC3986] supports the use of path segment values such as "./" or
711	   "../" for relative URIs.  Strictly speaking, including such path
712	   segment values in a fully qualified URI is syntactically illegal but
713	   [RFC3986] section 4.1 nevertheless defines an algorithm to remove
714	   them.

716	   Unless a scheme states otherwise, the path component is defined to be
717	   case-sensitive.  However, if the resource is stored and accessed
718	   using a filesystem using case-insensitive paths, there will be many
719	   paths that refer to the same resource.  As such, false negatives can
720	   be common in this case.

722	3.3.4.  Query component

724	   There is the question as to whether "http://example.com/foo",
725	   "http://example.com/foo?", and "http://example.com/foo?bar" are each
726	   considered equal or different.

728	   Similarly, it is unspecified whether the order of values matters.
729	   For example, should "http://example.com/blah?ick=bick&foo=bar" be
730	   considered equal to "http://example.com/blah?foo=bar&ick=bick"?  And
731	   if a domain name is permitted to appear in a query component (e.g.,
732	   in a reference to another URI), the same issues in Section 3.1 apply.

734	3.3.5.  Fragment component

736	   Some URI formats include fragment identifiers.  These are typically
737	   handles to locations within a resource and are used for local
738	   reference.  A classic example is the use of fragments in HTTP URIs
739	   where a URI of the form "http://example.com/blah.html#ick" means
740	   retrieve the resource "http://example.com/blah.html" and, once it has
741	   arrived locally, find the HTML anchor named ick and display that.

743	   So, for example, when a user clicks on the link
744	   "http://example.com/blah.html#baz" a browser will check its cache by
745	   doing a URI comparison for "http://example.com/blah.html" and, if the
746	   resource is present in the cache, a match is declared.

748	   Hence comparisons for security purposes typically ignore the fragment
749	   component and treat all fragments as equal to the full resource.
750	   However, if one were actually trying to compare the piece of a
751	   resource that was identified by the fragment identifier, ignoring it
752	   would result in potential false positives.  For example, there is at
753	   least one well known site today (Twitter) that requires the fragment
754	   component in order to uniquely identify a user profile.

756	3.3.6.  Resolution for comparison

758	   As with Section 3.1.4 for hostnames, it may be tempting to define a
759	   URI comparison algorithm based on whether they resolve to the same
760	   content.  Similar problems exist, however, including content that
761	   dynamically changes over time or based on factors such as the
762	   requester's location, potential lack of external connectivity at the
763	   time/place comparison is done, potentially undesirable delay
764	   introduced, etc.

766	   In addition, as noted in Section 3.1.4, resolution leaks information
767	   about security decisions to outsiders if the queries are publicly
768	   observable.

770	3.4.  Email Address-like Identifiers

772	   Section 3.4.1 of [RFC5322] defines the syntax of an email address-
773	   like identifier, and Section 3.2 of [RFC6532] updates it to support
774	   internationalization.  [RFC5280], section 7.5, further discusses the
775	   use of internationalized email addresses in certificates.

777	   [RFC6532] use in certificates points to [RFC6530], where Section 13
778	   of that document contains a discussion of many issues resulting from
779	   internationalization.

781	   Email address-like identifiers have a local part and a domain part.
782	   The issues with the domain part are essentially the same as with
783	   hostnames, covered earlier.

785	   The local part is left for each domain to define.  People quite
786	   commonly use email addresses as usernames with web sites such as
787	   banks or shopping sites, but the site doesn't know whether
788	   foo@example.com is the same person as FOO@example.com.  Thus email-
789	   like identifiers are typically Indefinite identifiers.

791	   To avoid false positives, some security mechanisms (such as
792	   [RFC5280]) compare the local part using an exact match.  Hence, like
793	   URIs, email address-like identifiers are designed for use in grant-
794	   on-match security schemes, not in deny-on-match schemes.

796	4.  General Internationalization Issues

798	   In addition to the issues with hostnames discussed in Section 3.1.3,
799	   there are a number of internationalization issues that apply to many
800	   types of Definite and Indefinite identifiers.

802	   First, there is no DNS mechanism for identifying whether non-
803	   identical strings would be seen by a human as being equivalent.
804	   There are problematic examples even with ASCII (Basic Latin) strings
805	   including regional spelling variations such as "color" and "colour"
806	   and many non-English cases including partially-numeric strings in
807	   Arabic script contexts, Chinese strings in Simplified and Traditional
808	   forms, and so on.  Attempts to produce such alternate forms
809	   algorithmically could produce false positives and hence have an
810	   adverse affect on security.

812	   Second, some strings are visually confusable with others, and hence
813	   if a security decision is made by a user based on visual inspection,
814	   many opportunities for false positives exist.  As such, using visual
815	   inspection for security is unreliable.  In addition to the security
816	   issues, visual confusability also adversely affects the usability of
817	   identifiers distributed via visual mediums.  Similar issues can arise
818	   with audible confusability when using audio (e.g., for radio
819	   distribution, accessibility to the blind, etc.) in place of a visual
820	   medium.

822	   Determining whether a string is a valid identifier should typically
823	   be done after, or as part of, canonicalization.  Otherwise an
824	   attacker might use the canonicalization algorithm to inject (e.g.,
825	   via percent encoding, NFKC, or non-shortest-form UTF-8) delimiters
826	   such as '@' in an email address-like identifier, or a '.' in a
827	   hostname.

829	   Any case-insensitive comparisons need to define how comparison is
830	   done, since such comparisons may vary by locale of the endpoint.  As
831	   such, using case-insensitive comparisons in general often result in
832	   identifiers being either Indefinite or, if the legal character set is
833	   restricted (e.g., to ASCII), then Definite.

835	   See also [WEBER] for a more visual discussion of many of these
836	   issues.

838	   Finally, the set of permitted characters and the canonical form of
839	   the characters (and hence the canonicalization algorithm) sometimes
840	   varies by protocol today, even when the intent is to use the same
841	   identifier, such as when one protocol passes identifiers to the
842	   other.  See [I-D.ietf-precis-problem-statement] for further
843	   discussion.

845	5.  General Scope Issues

847	   Another issue arises when an identifier (e.g., "localhost",
848	   "10.11.12.13", etc.) is not globally unique.  [RFC3986] Section 1.1
849	   states:

851	      URIs have a global scope and are interpreted consistently
852	      regardless of context, though the result of that interpretation
853	      may be in relation to the end-user's context.  For example,
854	      "http://localhost/" has the same interpretation for every user of
855	      that reference, even though the network interface corresponding to
856	      "localhost" may be different for each end-user: interpretation is
857	      independent of access.

859	   Whenever a non-globally-unique identifier is passed to another entity
860	   outside of the scope of uniqueness, it will refer to a different
861	   resource, and can result in a false positive.  This problem is often
862	   addressed by using the identifier together with some other unique
863	   identifier of the context.  For example "alice" may uniquely identify
864	   a user within a system, but must be used with "example.com" (as in
865	   "alice@example.com") to uniquely identify the context outside of that
866	   system.

868	   It is also worth noting that non-globally-scoped IPv6 addresses can
869	   be written with, or otherwise associated with, a "zone ID" to
870	   identify the context (see [RFC4007] for more information).  However,
871	   zone IDs are only unique within a host, so they typically narrow,
872	   rather than expand, the scope of uniqueness of the resulting
873	   identifier.

875	6.  Security Considerations

877	   This entire document is about security considerations.

879	   To minimize elevation of privilege issues, any system that requires
880	   the ability to use both deny and allow operations within the same
881	   identifier space, should avoid the use of Indefinite identifiers in
882	   security comparisons.

884	   To minimize future security risks, any new identifiers being designed
885	   should specify an Absolute or Definite comparison algorithm, and if
886	   extensibility is allowed (e.g., as new schemes in URIs allow) then
887	   the comparison algorithm should remain invariant so that unrecognized
888	   extensions can be compared.  That is, security risks can be reduced
889	   by specifying the comparison algorithm, making sure to resolve any
890	   ambiguities pointed out in this document (e.g., "standard dotted
891	   decimal").

893	   Some issues (such as unrecognized extensions) can be mitigated by
894	   treating such identifiers as invalid.  Validity checking of
895	   identifiers is further discussed in [RFC3696].

897	   Perhaps the hardest issues arise when multiple protocols are used
898	   together, such as in the figure in Section 2, where the two protocols
899	   are defined or implemented using different comparison algorithms.
900	   When constructing an architecture that uses multiple such protocols,
901	   designers should pay attention to any differences in comparison
902	   algorithms among the protocols, in order to fully understand the
903	   security risks.  An area for future work is how to deal with such
904	   security risks in current systems.

906	7.  Acknowledgements

908	   Yaron Goland contributed to the discussion on URIs.  Patrick
909	   Faltstrom contributed to the background on identifiers.  John Klensin
910	   contributed text in a number of different sections.  Additional
911	   helpful feedback and suggestions came from Bernard Aboba, Leslie
912	   Daigle, Mark Davis, Russ Housley, Magnus Nystrom, and Chris Weber.

914	8.  IANA Considerations

916	   This document requires no actions by the IANA.

918	9.  Informative References

920	   [EE]       Mills, E., "Evidence Explained: Citing History Sources
921	              from Artifacts to Cyberspace", 2007.

923	   [I-D.ietf-6man-uri-zoneid]
924	              Carpenter, B. and R. Hinden, "Representing IPv6 Zone
925	              Identifiers in Address Literals and Uniform Resource
926	              Identifiers", draft-ietf-6man-uri-zoneid-02 (work in
927	              progress), July 2012.

929	   [I-D.ietf-pkix-rfc5280-clarifications]
930	              Yee, P., "Updates to the Internet X.509 Public Key
931	              Infrastructure Certificate and Certificate Revocation List
932	              (CRL) Profile", draft-ietf-pkix-rfc5280-clarifications-05
933	              (work in progress), June 2012.

935	   [I-D.ietf-precis-problem-statement]
936	              Blanchet, M. and A. Sullivan, "Stringprep Revision and
937	              PRECIS Problem Statement",
938	              draft-ietf-precis-problem-statement-06 (work in progress),
939	              July 2012.

941	   [IAB1123]  IAB, "The interpretation of rules in the ICANN gTLD
942	              Applicant Guidebook", February 2012, <http://www.iab.org/
943	              documents/correspondence-reports-documents/2012-2/
944	              iab-statement-the-interpretation-of-rules-in-the-icann-
945	              gtld-applicant-guidebook>.

947	   [IANA-PORT]
948	              IANA, "PORT NUMBERS", June 2011,
949	              <http://www.iana.org/assignments/port-numbers>.

951	   [IEEE-1003.1]
952	              IEEE and The Open Group, "The Open Group Base
953	              Specifications, Issue 6 IEEE Std 1003.1, 2004 Edition",
954	              IEEE Std 1003.1, 2004.

956	   [JAVAURL]  Oracle, "Class URL, Java(TM) Platform, Standard Ed. 7",
957	              2011, <http://docs.oracle.com/javase/7/docs/api/java/net/
958	              URL.html>.

960	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
961	              host table specification", RFC 952, October 1985.

963	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
964	              and Support", STD 3, RFC 1123, October 1989.

966	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
967	              Languages", BCP 18, RFC 2277, January 1998.

969	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
970	              "Internationalizing Domain Names in Applications (IDNA)",
971	              RFC 3490, March 2003.

973	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
974	              for Internationalized Domain Names in Applications
975	              (IDNA)", RFC 3492, March 2003.

977	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
978	              Stevens, "Basic Socket Interface Extensions for IPv6",
979	              RFC 3493, February 2003.

981	   [RFC3696]  Klensin, J., "Application Techniques for Checking and
982	              Transformation of Names", RFC 3696, February 2004.

984	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
985	              Resource Identifier (URI): Generic Syntax", STD 66,
986	              RFC 3986, January 2005.

988	   [RFC4007]  Deering, S., Haberman, B., Jinmei, T., Nordmark, E., and
989	              B. Zill, "IPv6 Scoped Address Architecture", RFC 4007,
990	              March 2005.

992	   [RFC4291]  Hinden, R. and S. Deering, "IP Version 6 Addressing
993	              Architecture", RFC 4291, February 2006.

995	   [RFC5280]  Cooper, D., Santesson, S., Farrell, S., Boeyen, S.,
996	              Housley, R., and W. Polk, "Internet X.509 Public Key
997	              Infrastructure Certificate and Certificate Revocation List
998	              (CRL) Profile", RFC 5280, May 2008.

1000	   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322,
1001	              October 2008.

1003	   [RFC5952]  Kawamura, S. and M. Kawashima, "A Recommendation for IPv6
1004	              Address Text Representation", RFC 5952, August 2010.

1006	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1007	              Encodings for Internationalized Domain Names", RFC 6055,
1008	              February 2011.

1010	   [RFC6066]  Eastlake, D., "Transport Layer Security (TLS) Extensions:
1011	              Extension Definitions", RFC 6066, January 2011.

1013	   [RFC6335]  Cotton, M., Eggert, L., Touch, J., Westerlund, M., and S.
1014	              Cheshire, "Internet Assigned Numbers Authority (IANA)
1015	              Procedures for the Management of the Service Name and
1016	              Transport Protocol Port Number Registry", BCP 165,
1017	              RFC 6335, August 2011.

1019	   [RFC6530]  Klensin, J. and Y. Ko, "Overview and Framework for
1020	              Internationalized Email", RFC 6530, February 2012.

1022	   [RFC6532]  Yang, A., Steele, S., and N. Freed, "Internationalized
1023	              Email Headers", RFC 6532, February 2012.

1025	   [TR36]     Unicode Consortium, "Unicode Security Considerations",
1026	              Unicode Technical Report 36, August 2004.

1028	   [WEBER]    Weber, C., "Attacking Software Globalization", March 2010,
1029	              <http://www.lookout.net/files/
1030	              Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf>.

1032	Author's Address

1034	   Dave Thaler (editor)
1035	   Microsoft Corporation
1036	   One Microsoft Way
1037	   Redmond, WA  98052
1038	   USA

1040	   Phone: +1 425 703 8835
1041	   Email: dthaler@microsoft.com