idnits 2.17.1 

draft-iab-identifier-comparison-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 388: '....  Host software MUST support this mor...'
     RFC 2119 keyword, line 393: '...dentity of an Internet host, it SHOULD...'
     RFC 2119 keyword, line 395: '...#.#.#.#") form.  The host SHOULD check...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 12, 2012) is 4427 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC5890' is mentioned on line 518, but not defined

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pkix-rfc5280-clarifications-04


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     D. Thaler, Ed.
3	Internet-Draft                                                 Microsoft
4	Intended status: Informational                            March 12, 2012
5	Expires: September 13, 2012

7	         Issues in Identifier Comparison for Security Purposes
8	                 draft-iab-identifier-comparison-01.txt

10	Abstract

12	   Identifiers such as hostnames, URIs/IRIs, and email addresses are
13	   often used in security contexts to identify security principals and
14	   resources.  In such contexts, an identifier supplied via some
15	   protocol is often compared against some policy to make security
16	   decisions such as whether the principal may access the resource, what
17	   level of authentication or encryption is required, etc.  If the
18	   parties involved in a security decision use different algorithms to
19	   compare identifiers, then failure scenarios ranging from denial of
20	   service to elevation of privilege can result.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on September 13, 2012.

39	Copyright Notice

41	   Copyright (c) 2012 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   2.  Security Uses  . . . . . . . . . . . . . . . . . . . . . . . .  4
58	     2.1.  Types of Identifiers . . . . . . . . . . . . . . . . . . .  6
59	     2.2.  False Positives and Negatives  . . . . . . . . . . . . . .  6
60	     2.3.  Hypothetical Example . . . . . . . . . . . . . . . . . . .  7
61	   3.  Common Identifiers . . . . . . . . . . . . . . . . . . . . . .  8
62	     3.1.  Hostnames  . . . . . . . . . . . . . . . . . . . . . . . .  8
63	       3.1.1.  IPv4 Literals  . . . . . . . . . . . . . . . . . . . .  9
64	       3.1.2.  IPv6 Literals  . . . . . . . . . . . . . . . . . . . . 11
65	       3.1.3.  Internationalization . . . . . . . . . . . . . . . . . 11
66	       3.1.4.  Resolution for comparison  . . . . . . . . . . . . . . 12
67	     3.2.  Ports and Service Names  . . . . . . . . . . . . . . . . . 12
68	     3.3.  URIs and IRIs  . . . . . . . . . . . . . . . . . . . . . . 13
69	       3.3.1.  Scheme component . . . . . . . . . . . . . . . . . . . 14
70	       3.3.2.  Authority component  . . . . . . . . . . . . . . . . . 14
71	       3.3.3.  Path component . . . . . . . . . . . . . . . . . . . . 15
72	       3.3.4.  Query component  . . . . . . . . . . . . . . . . . . . 15
73	       3.3.5.  Fragment component . . . . . . . . . . . . . . . . . . 15
74	     3.4.  Email Address-like Identifiers . . . . . . . . . . . . . . 16
75	   4.  General Internationalization Issues  . . . . . . . . . . . . . 16
76	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
77	   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
78	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 18
79	   8.  Informative References . . . . . . . . . . . . . . . . . . . . 18
80	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19

82	1.  Introduction

84	   In computing and the Internet, various types of "identifiers" are
85	   used to identify humans, devices, content, etc.  Before discussing
86	   security issues, we first give some background on some typical
87	   processes involving identifiers.

89	   As depicted in Figure 1, there are multiple processes relevant to our
90	   discussion.
91	   1.  An identifier must first be generated.  If the identifier is
92	       intended to be unique, the generation process includes some
93	       mechanism, such as allocation by a central authority, to help
94	       ensure uniqueness.  However the notion of "unique" involves
95	       determining whether a putative identifier matches any other
96	       already-allocated identifier.  As we will see, for many types of
97	       identifiers, this is not simply an exact binary match.

99	       As a result of generating the identifier, it is often stored in
100	       two locations: with the requester or "holder" of the identifier,
101	       and with some repository of identifiers (e.g., DNS).  For
102	       example, if the identifier was allocated by a central authority,
103	       the repository might be that authority.  If the identifier
104	       identifies a device or content on a device, the repository might
105	       be that device.
106	   2.  The identifier must be distributed, either by the holder of the
107	       identifier or by a repository of identifiers, to others who could
108	       use the identifier.  This distribution might be electronic, but
109	       sometimes it is via other channels such as voice, business card,
110	       billboard, or other form of advertisement.  The identifier itself
111	       might be distributed directly, or it might be used to generate a
112	       portion of another type of identifier that is then distributed.
113	       For example, a URI or email address might include a server name,
114	       and hence distributing the URI or email address also inherently
115	       distributes the server name.
116	   3.  The identifier must be used by some party.  Generally the user
117	       supplies the identifier which is (directly or indirectly) sent to
118	       the repository of identifiers.  For example, using an email
119	       address to send email to the holder of an identifier may result
120	       in the email arriving at the holder's email server which has the
121	       repository of all email accounts on that server.

123	       The repository of identifiers must then attempt to match the
124	       user-supplied identifier with an identifier in its repository.

126	                            +------------+
127	                            |  Holder of |     1. Generation
128	                            | identifier +<---------+
129	                            +----+-------+          |
130	                                 |                  | Match
131	                                 |                  v/
132	                                 |          +-------+-------+
133	                                 +----------+ Repository of |
134	                                 |          |  identifiers  |
135	                                 |          +-------+-------+
136	                 2. Distribution |                  ^\
137	                                 |                  | Match
138	                                 v                  |
139	                       +---------+-------+          |
140	                       |      User of    |          |
141	                       |    identifier   +----------+
142	                       +-----------------+    3. Use

144	                       Typical Identifier Processes

146	                                 Figure 1

148	   One key aspect is that the identifier values passed in generation,
149	   distribution, and use, may all be different forms.  For example,
150	   generation might be exchanged in printed form, distribution done via
151	   voice, and use done electronically.  As such, the match process can
152	   be complicated.

154	   Furthermore, in many uses, the relationship between holder,
155	   repositories, and users may be more involved.  For example, when a
156	   hierarchy of caches exist (as with web pages for example), each cache
157	   is itself a repository of a sort, and the match process is usually
158	   intended to be the same as on the authoritative web server.

160	2.  Security Uses

162	   Identifiers such as hostnames, URIs/IRIs, and email addresses are
163	   used in security contexts to identify principals and resources as
164	   well as other security parameters such as types and values of claims.
165	   Those identifiers are then used to make security decisions based on
166	   an identifier supplied via some protocol.  For example:
167	   o  Authentication: a protocol might match a security principal
168	      identifier to look up expected keying material, and then match
169	      keying material.
170	   o  Authorization: a protocol might match a resource name to look up
171	      an access control list (ACL), and then look up the security
172	      principal identifier in that ACL.

174	   If the parties involved in a security decision use different matching
175	   algorithms for the same identifiers, then failure scenarios ranging
176	   from denial of service to elevation of privilege can result, as we
177	   will see.

179	   This is especially complicated in cases involving multiple parties
180	   and multiple protocols.  For example, there are many scenarios where
181	   some form of "security token service" is used to grant to a requester
182	   permission to access a resource, where the resource is held by a
183	   third party that relies on the security token service (see Figure 2).
184	   The protocol used to request permission (e.g., Kerberos or OAuth) may
185	   be different from the protocol used to access the resource (e.g.,
186	   HTTP).  Opportunities for security problems arise when two protocols
187	   define different comparison algorithms for the same type of
188	   identifier, or when a protocol is ambiguously specified and two
189	   endpoints (e.g., a security token service and a resource holder)
190	   implement different algorithms within the same protocol.

192	        +----------+
193	        | security |
194	        |  token   |
195	        | service  |
196	        +----------+
197	             ^
198	             | 1. supply credentials and
199	             | get token for resource
200	             |                                             +--------+
201	        +----------+  2. supply token and access resource  |resource|
202	        |requester |=------------------------------------->| holder |
203	        +----------+                                       +--------+

205	                         Simple Security Exchange

207	                                 Figure 2

209	   In many cases the situation is more complex.  With certificates, the
210	   name in a certificate gets compared against names in ACLs or other
211	   things.  In the case of web site security, the name in the
212	   certificate gets compared to a portion of the URI that a user may
213	   have typed into a browser.  The fact that many different people are
214	   doing the typing, on many different types of systems, complicates the
215	   problem.

217	   Add to this the certificate enrollment step, and the certificate
218	   issuance step, and two more parties have an opportunity to adjust the
219	   encoding or worse, the software that supports them might make changes
220	   that the parties are unaware are happening.

222	2.1.  Types of Identifiers

224	   In this document we will refer to the following types of identifiers:

226	   o  Absolute: identifiers that can be compared byte-by-byte for
227	      equality.  Two identifiers that have different bytes are defined
228	      to be different.  For example, binary IP addresses are in this
229	      class.
230	   o  Definite: identifiers that have a well-defined comparison
231	      algorithm on which all parties agree.  For example, URI scheme
232	      names are defined to be a case-insensitive match, where the set of
233	      permitted characters results in an unambiguous definition of case-
234	      insensitive match, since non-ASCII characters are not permitted.
235	   o  Indefinite: identifiers that have no single comparison algorithm
236	      on which all parties agree.  For example, human names are in this
237	      class.  Everyone might want the comparison to be tailored for
238	      their locale, for some definition of locale.  In some cases, there
239	      may be limited subsets of parties that might be able to agree
240	      (e.g., US-ASCII users might all agree on a common comparison
241	      algorithm whereas US-ASCII users vs. Turkish users may not), but
242	      identifiers often tend to leak out of such limited environments.

244	2.2.  False Positives and Negatives

246	   Perhaps the most common algorithm for comparison involves
247	   "canonicalization", or converting each identifier to a canonical
248	   form, and then testing the canonical representations for bitwise
249	   equality.  In so doing, it is thus critical that all entities
250	   involved agree on the same canonical form and use the same
251	   canonicalization algorithm so that the overall comparison process is
252	   also the same.  (Often the term "normalization" is used synonymously
253	   with "canonicalization", but in internationalization the term
254	   normalization has a precise meaning, and so we use the generic term
255	   canonicalization here instead.)

257	   It is first worth discussing in more detail the effects of errors in
258	   the comparison algorithm.  A "false positive" results when two
259	   identifiers compare as if they were equal, but in reality refer to
260	   two different things (e.g., security principals or resources).  When
261	   privilege is granted on a match, a false positive thus results in an
262	   elevation of privilege, for example allowing execution of an
263	   operation that should not have been permitted.  When privilege is
264	   denied on a match (e.g., matching an entry in a block/deny list or a
265	   revocation list), a permissable operation is denied.  At best, this
266	   can cause worse performance (e.g., a cache miss, or forcing redundant
267	   authentication), and at worst can result in a denial of service.

269	   A "false negative" results when two identifiers that in reality refer
270	   to the same thing compare as if they were different, and the effects
271	   are the reverse of those for false positives.  That is, when
272	   privilege is granted on a match, the result is at best worse
273	   performance and at worst a denial of service; when privilege is
274	   denied on a match, elevation of privilege results.

276	   Figure 3 summarizes these effects.

278	                  | "Grant on match"       | "Deny on match"
279	   ---------------+------------------------+-----------------------
280	   False positive | Elevation of privilege | Denial of service
281	   ---------------+------------------------+-----------------------
282	   False negative | Denial of service      | Elevation of privilege
283	   ---------------+------------------------+-----------------------

285	                    Effect of False Positives/Negatives

287	                                 Figure 3

289	   Elevation of privilege is almost always seen as far worse than denial
290	   of service.  Hence, for URIs for example, Section 6.1 of [RFC3986]
291	   states: "comparison methods are designed to minimize false negatives
292	   while strictly avoiding false positives".

294	   Thus URIs were defined with a "grant privilege on match" paradigm in
295	   mind, where it is critical to prevent elevation of privilege while
296	   minimizing denial of service.  Using URIs in a "deny privilege on
297	   match" system can thus be problematic.

299	2.3.  Hypothetical Example

301	   In this example, both security principals and resources are
302	   identified using URIs.  Foo Corp has paid example.com for access to
303	   the stuff service.  Foo Corp allows its employees to create accounts
304	   on the stuff service.  Alice gets the account
305	   "http://example.com/stuff/FooCorp/alice" and Bob gets
306	   "http://example.com/stuff/FooCorp/bob".  It turns out, however, that
307	   Foo Corp's URI canonicalizer includes URI fragment components in
308	   comparisons whereas example.com's does not, and Foo Corp does not
309	   disallow the # character in the account name.  So Chuck, who is a
310	   malicious employee of Foo Corp, asks to create an account at
311	   example.com with the name alice#stuff.  Foo Corp's URI logic checks
312	   its records for accounts it has created with stuff and sees that
313	   there is no account with the name alice#stuff.  Hence, in its
314	   records, it associates the account alice#stuff with Chuck and will
315	   only issue tokens good for use with
316	   "http://example.com/stuff/FooCorp/alice#stuff" to Chuck.

318	   Chuck, the attacker, goes to a security token service at Foo Corp and
319	   asks for a security token good for
320	   "http://example.com/stuff/FooCorp/alice#stuff".  Foo Corp issues the
321	   token since Chuck is the legitimate owner (in Foo Corp's view) of the
322	   alice#stuff account.  Chuck then submits the security token in a
323	   request to "http://example.com/stuff/FooCorp/alice".

325	   But example.com uses a URI canonicalizer that, for the purposes of
326	   checking equality, ignores fragments.  So when example.com looks in
327	   the security token to see if the requester has permission from Foo
328	   Corp to access the given account it successfully matches the URI in
329	   the security token, "http://example.com/stuff/FooCorp/alice#stuff",
330	   with the requested resource name
331	   "http://example.com/stuff/FooCorp/alice".

333	   Leveraging the inconsistencies in the canonicalizers used by Foo Corp
334	   and example.com, Chuck is able to successfully launch an elevation of
335	   privilege attack and access Alice's resource.

337	3.  Common Identifiers

339	   In this section, we walk through a number of common types of
340	   identifiers and discuss various issues related to comparison that may
341	   affect security whenever they are used to identify security
342	   principals or resources.  These examples illustrate common patterns
343	   that may arise with other types of identifiers.

345	3.1.  Hostnames

347	   Hostnames are commonly used either directly as identifiers, or as
348	   components in identifiers such as in URIs and email addresses.
349	   Another example is in [RFC5280], sections 7.2 and 7.3 (and updated in
350	   section 3 of [I-D.ietf-pkix-rfc5280-clarifications]), which specify
351	   use in certificates.

353	   In this section we discuss a number of issues in comparing strings
354	   that appear to be some form of hostname.

356	   Section 3 of [RFC6055] discusses the differences between a "hostname"
357	   vs. a "DNS name", where the former is a subset of the latter by using
358	   a restricted set of characters.  If one canonicalizer uses the "DNS
359	   name" definition whereas another uses a "hostname" definition, a name
360	   might be valid in the former but invalid in the latter.  As long as
361	   invalid identifiers are denied privilege, this difference will not
362	   result in elevation of privilege.

364	   [IAB1123] briefly discusses issues with the ambiguity around whether
365	   a label will be "alphabetic", including among other issues, whether a
366	   hostname can be interpreted as an IP address.  We explore this last
367	   issue in more detail below.

369	3.1.1.  IPv4 Literals

371	   [RFC0952] defined an entry in the "Internet host table" as follows:

373	      A "name" (Net, Host, Gateway, or Domain name) is a text string up
374	      to 24 characters drawn from the alphabet (A-Z), digits (0-9),
375	      minus sign (-), and period (.).  Note that periods are only
376	      allowed when they serve to delimit components of "domain style
377	      names". [...]  No blank or space characters are permitted as part
378	      of a name.  No distinction is made between upper and lower case.
379	      The first character must be an alpha character.  The last
380	      character must not be a minus sign or period. [...]  Single
381	      character names or nicknames are not allowed.

383	   [RFC1123] section 2.1 then updates the definition with:

385	      The syntax of a legal Internet host name was specified in RFC-952
386	      [DNS:4].  One aspect of host name syntax is hereby changed: the
387	      restriction on the first character is relaxed to allow either a
388	      letter or a digit.  Host software MUST support this more liberal
389	      syntax.

391	   and

393	      Whenever a user inputs the identity of an Internet host, it SHOULD
394	      be possible to enter either (1) a host domain name or (2) an IP
395	      address in dotted-decimal ("#.#.#.#") form.  The host SHOULD check
396	      the string syntactically for a dotted-decimal number before
397	      looking it up in the Domain Name System.

399	   and

401	      This last requirement is not intended to specify the complete
402	      syntactic form for entering a dotted-decimal host number; that is
403	      considered to be a user-interface issue.

405	   In specifying the inet_addr() API, the POSIX standard [IEEE-1003.1]
406	   defines "IPv4 dotted decimal notation" as allowing not only strings
407	   of the form "10.0.1.2", but also allows octal and hexadecimal, and
408	   addresses with less than four parts.  For example, "10.0.258",
409	   "0xA000001", and "012.0x102" all represent the same IPv4 address in
410	   standard "IPv4 dotted decimal" notation.  We will refer to this as
411	   the "loose" syntax of an IPv4 address literal.

413	   In section 6.1 of [RFC3493] getaddrinfo() is defined to support the
414	   same (loose) syntax as inet_addr():

416	      If the specified address family is AF_INET or AF_UNSPEC, address
417	      strings using Internet standard dot notation as specified in
418	      inet_addr() are valid.

420	   In contrast, section 6.3 of the same RFC states, specifying
421	   inet_pton():

423	      If the af argument of inet_pton() is AF_INET, the src string shall
424	      be in the standard IPv4 dotted-decimal form: ddd.ddd.ddd.ddd where
425	      "ddd" is a one to three digit decimal number between 0 and 255.
426	      The inet_pton() function does not accept other formats (such as
427	      the octal numbers, hexadecimal numbers, and fewer than four
428	      numbers that inet_addr() accepts).

430	   As shown above, inet_pton() uses what we will refer to as the
431	   "strict" form of an IPv4 address literal.  Some platforms also use
432	   the strict form with getaddrinfo() when the AI_NUMERICHOST flag is
433	   passed to it.

435	   Both the strict and loose forms are standard forms, and hence a
436	   protocol specification is still ambiguous if it simply defines a
437	   string to be in the "standard IPv4 dotted decimal form".  And, as a
438	   result of these differences, names like "10.11.12" are ambiguous as
439	   to whether they are an IP address or a hostname, and even
440	   "10.11.12.13" can be ambiguous because of the "SHOULD" in RFC 1123
441	   above making it optional whether to treat it as an address or a name.

443	   Protocols and data formats that can use addresses in string form for
444	   security purposes need to resolve these ambiguities.  For example,
445	   for the host component of URIs, section 3.2.2 of [RFC3986] resolves
446	   the first ambiguity by only allowing the strict form, and the second
447	   ambiguity by specifying that it is considered an IPv4 address
448	   literal.  New protocols and data formats should similarly consider
449	   using the strict form rather than the loose form in order to better
450	   match user expectations.

452	   Thus, whereas (binary) IPv4 addresses are Absolute identifiers, IPv4
453	   address literals are at best Definite identifiers, and often turn out
454	   to be Indefinite identifiers.

456	   Furthermore, when strings can contain non-ASCII characters, they can
457	   contain other characters that may look like dots or digits to a human
458	   viewing and/or entering the identifier, especially to one who might
459	   expect digits to appear in his or her native script.

461	3.1.2.  IPv6 Literals

463	   IPv6 addresses similarly have a wide variety of alternate but
464	   semantically identical string representations, as defined in section
465	   2.2 of [RFC4291].  As discussed in section 3.2.5 of [RFC5952], this
466	   fact causes problems in security contexts if comparison (such as in
467	   X.509 certificates), is done between strings rather than between the
468	   binary representations of addresses.

470	   [RFC5952] recently specified a recommended canonical string format as
471	   an attempt to solve this problem, but it may not be ubiquitously
472	   supported at present.  And, when strings can contain non-ASCII
473	   characters, the same issues (and more, since hexadecimal and colons
474	   are allowed) arise as with IPv4 literals.

476	   Whereas (binary) IPv6 addresses are Absolute identifiers, IPv6
477	   address literals are Definite identifiers, since string-to-address
478	   conversion for IPv6 address literals is unambiguous.

480	3.1.3.  Internationalization

482	   The IETF policy on character sets and languages [RFC2277] requires
483	   support for UTF-8 in protocols, and as a result many protocols now do
484	   support non-ASCII characters.  When a hostname is sent in a UTF-8
485	   field, there are a number of ways it may be encoded.  For example,
486	   labels might encoded directly in UTF-8, or might first be Punycode-
487	   encoded or percent-encoded and then encoded in UTF-8.

489	   For example, in URIs, [RFC3986] section 3.2.2 specifically allows for
490	   the use of percent-encoded UTF-8 characters in the hostname, as well
491	   as the use of IDNA encoding using the Punycode algorithm.

493	   Percent-encoding is unambiguous for hostnames since the percent
494	   character cannot appear in the strict definition of a "hostname",
495	   though it can appear in a DNS name.

497	   Punycode-encoded labels (or "A-labels") on the other hand can be
498	   ambiguous if hosts are actually allowed to be named with a name
499	   starting with "xn--", and false positives can result.  While this may
500	   be extremely unlikely for normal scenarios, it nevertheless provides
501	   a possible vector for an attacker.

503	   A hostname comparator used with non-ASCII strings thus needs to
504	   decide whether a Punycode-encoded string should or should not be
505	   considered a valid hostname label, and if so, then whether it should
506	   match the equivalent Unicode string ("U-label").

508	   For example, Section 3 of "Transport Layer Security (TLS) Extensions"

510	   [RFC6066], states:

512	      "HostName" contains the fully qualified DNS hostname of the
513	      server, as understood by the client.  The hostname is represented
514	      as a byte string using ASCII encoding without a trailing dot.
515	      This allows the support of internationalized domain names through
516	      the use of A-labels defined in [RFC5890].  DNS hostnames are case-
517	      insensitive.  The algorithm to compare hostnames is described in
518	      [RFC5890], Section 2.3.2.4.

520	   For some additional discussion of security issues that arise with
521	   internationalization, see [TR36].

523	3.1.4.  Resolution for comparison

525	   Some systems (specifically Java) used to follow the rule that if two
526	   hostnames resolved to the same IP address then the hostnames were
527	   considered equal.  That is, the canonicalization algorithm involved
528	   name resolution with an IP address being the canonical form.
529	   However, with the introduction of dynamic IP addresses, private IP
530	   addresses, multiple IP addresses per name, etc., this method of
531	   comparison cannot be relied upon.  There is no guarantee that two
532	   names for the same host will resolve the name to the same IP
533	   addresses, nor that the addresses resolved refer to the same entity.

535	   In addition, a comparison mechanism that relies on the ability to
536	   resolve identifiers such as hostnames to other identifies such as IP
537	   addresses leaks information about security decisions to outsiders if
538	   these queries are publicly observable.

540	3.2.  Ports and Service Names

542	   Port numbers and service names are discussed in depth in [RFC6335].
543	   Historically, there were port numbers, service names used in SRV
544	   records, and mnemonic identifiers for assigned port numbers (known as
545	   port "keywords" at [IANA-PORT]).  The latter two are now unified, and
546	   various protocols use one or more of these types in strings.  For
547	   example, the common syntax used by many URI schemes allows port
548	   numbers but not service names.  Some implementations of the
549	   getaddrinfo() API support strings that can be either port numbers or
550	   port keywords (but not service names).

552	   For protocols that use service names that must be resolved, the
553	   issues are the same as those for resolution of addresses in
554	   Section 3.1.4.  In addition, Section 5.1 of [RFC6335] clarifies that
555	   service names/port keywords must contain at least one letter.  This
556	   prevents confusion with port numbers in strings where both are
557	   allowed.

559	3.3.  URIs and IRIs

561	   This section looks at issues related to using URIs for security
562	   purposes.  For example, [RFC5280], section 7.4, specifies comparison
563	   of URIs in certificates.  Examples of URIs in security token-based
564	   access control systems include WS-*, SAML-P and OAuth WRAP.  In such
565	   systems, a variety of participants in the security infrastructure are
566	   identified by URIs.  For example, requesters of security tokens are
567	   sometimes identified with URIs.  The issuers of security tokens and
568	   the relying parties who are intended to consume security tokens are
569	   frequently identified by URIs.  Claims in security tokens often have
570	   their types defined using URIs and the values of the claims can also
571	   be URIs.

573	   Also, when a URI is embedded in plain text (e.g., an email message),
574	   there is an additional concern because there is no termination
575	   criterion for a URL.  For example, consider
576	   http://unicode.org/cldr/utility/list-unicodeset.jsp?a=a&amp;g=gc.
577	   Some email clients will stop before the ';' while others go to the
578	   '.'.  As another point of comparison, Section 2.37 of [EE] (a
579	   standard for history citations) specifies the use of a space after a
580	   URI and before the punctuation.

582	   URIs are defined with multiple components, each of which has their
583	   own rules.  We cover each in turn below.  However, it is also
584	   important to note that there exist multiple comparison algorithms.
585	   [RFC3986] section 6.2 states:

587	      A variety of methods are used in practice to test URI equivalence.
588	      These methods fall into a range, distinguished by the amount of
589	      processing required and the degree to which the probability of
590	      false negatives is reduced.  As noted above, false negatives
591	      cannot be eliminated.  In practice, their probability can be
592	      reduced, but this reduction requires more processing and is not
593	      cost-effective for all applications.
594	      If this range of comparison practices is considered as a ladder,
595	      the following discussion will climb the ladder, starting with
596	      practices that are cheap but have a relatively higher chance of
597	      producing false negatives, and proceeding to those that have
598	      higher computational cost and lower risk of false negatives.

600	   The ladder approach has both pros and cons.  On the pro side, it
601	   allows some uses to optimize for security, and other uses to optimize
602	   for cost, thus allowing URIs to be applicable to a wide range of
603	   uses.  A disadvantage is that when different approaches are taken by
604	   different components in the same system using the same identifiers,
605	   the inconsistencies can result in security issues.

607	3.3.1.  Scheme component

609	   [RFC3986] defines URI schemes as being case-insensitive ASCII and in
610	   section 6.2.2.1 specifies that scheme names should be normalized to
611	   lower-case characters.

613	   New schemes can be defined over time.  In general two URIs with an
614	   unrecognized scheme cannot be safely compared, however.  This is
615	   because the canonicalization and comparison rules for the other
616	   components may vary by scheme.  For example, a new URI scheme might
617	   have a default port of X, and without that knowledge, a comparison
618	   algorithm cannot know whether "example.com" and "example.com:X"
619	   should be considered to match in the authority component.  Hence for
620	   security purposes, it is safest for unrecognized schemes to be
621	   treated as invalid identifiers.  However, if the URIs are only used
622	   with a "grant access on match" paradigm then unrecognized schemes can
623	   be supported by doing a generic case-sensitive comparison, at the
624	   expense of some false negatives.

626	3.3.2.  Authority component

628	   The authority component is scheme-specific, but many schemes follow a
629	   common syntax that allows for userinfo, host, and port.

631	3.3.2.1.  Host

633	   Section 3.1 discussed issues with hostnames in general.  In addition,
634	   [RFC3986] section 3.2.2 allows future changes using the IPvFuture
635	   production.  As with IPv4 and IPv6 literals, IPvFuture formats may
636	   have issues with multiple semantically identical string
637	   representations, and may also be semantically identical to an IPv4 or
638	   IPv6 address.  As such, false negatives may be common if IPvFuture is
639	   used.

641	3.3.2.2.  Port

643	   See discussion in Section 3.2.

645	3.3.2.3.  Userinfo

647	   [RFC3986] defines the userinfo production that allows arbitrary data
648	   about the user of the URI to be placed before '@' signs in URIs (see
649	   also Section 3.4.  For example:
650	   "http://alice:bob:chuck@example.com/bar" has the value "alice:bob:
651	   chuck" as its userinfo.  When comparing URIs in a security context,
652	   one must decide whether to treat the userinfo as being significant or
653	   not.  Some URI comparison services for example treat
654	   "http://alice:ick@example.com" and "http://example.com" as being
655	   equal.

657	3.3.3.  Path component

659	   [RFC3986] supports the use of path segment values such as "./" or
660	   "../" for relative URLs.  Strictly speaking, including such path
661	   segment values in a fully qualified URI is syntactically illegal but
662	   [RFC3986] section 4.1 nevertheless defines an algorithm to remove
663	   them.

665	   Unless a scheme states otherwise, the path component is defined to be
666	   case-sensitive.  However, if the resource is stored and accessed
667	   using a filesystem using case-insensitive paths, there will be many
668	   paths that refer to the same resource.  As such, false negatives can
669	   be common in this case.

671	3.3.4.  Query component

673	   There is the question as to whether "http://example.com/foo",
674	   "http://example.com/foo?", and "http://example.com/foo?bar" are each
675	   considered equal or different.

677	   Similarly, it is unspecified whether the order of values matters.
678	   For example, should "http://example.com/blah?ick=bick&foo=bar" be
679	   considered equal to "http://example.com/blah?foo=bar&ick=bick"?  And
680	   if a domain name is permitted to appear in a query component (e.g.,
681	   in a reference to another URI), the same issues in Section 3.1 apply.

683	3.3.5.  Fragment component

685	   Some URI formats include fragment identifiers.  These are typically
686	   handles to locations within a resource and are used for local
687	   reference.  A classic example is the use of fragments in HTTP URLs
688	   where a URL of the form "http://example.com/blah.html#ick" means
689	   retrieve the resource "http://example.com/blah.html" and, once it has
690	   arrived locally, find the HTML anchor named ick and display that.

692	   So, for example, when a user clicks on the link
693	   "http://example.com/blah.html#baz" a browser will check its cache by
694	   doing a URI comparison for "http://example.com/blah.html" and, if the
695	   resource is present in the cache, a match is declared.

697	   Hence comparisons for security purposes typically ignore the fragment
698	   component and treat all fragments as equal to the full resource.

700	3.4.  Email Address-like Identifiers

702	   Section 3.4.1 of [RFC5322] defines the syntax of an email address-
703	   like identifier, and Section 3.2 of [RFC6532] updates it to support
704	   internationalization.  [RFC5280], section 7.5, further discusses the
705	   use of internationalized email addresses in certificates.

707	   [RFC6532] use in certificates points to [RFC6530], where Section 13
708	   of that document contains a discussion of many issues resulting from
709	   internationalization.

711	   Email address-like identifiers have a local part and a domain part.
712	   The issues with the domain part are essentially the same as with
713	   hostnames, covered earlier.

715	   The local part is left for each domain to define.  People quite
716	   commonly use email addresses as usernames with web sites like banks
717	   or shopping sites, but the site doesn't know whether foo@example.com
718	   is the same person as FOO@example.com.  Thus email-like identifiers
719	   are typically Indefinite identifiers.

721	   To avoid false positives, some security mechanisms (such as
722	   [RFC5280]) compare the local part using an exact match.  Hence, like
723	   URIs, email address-like identifiers are designed for use in grant-
724	   on-match security schemes, not in deny-on-match schemes.

726	4.  General Internationalization Issues

728	   In addition to the issues with hostnames discussed in Section 3.1.3,
729	   there are a number of internationalization issues that apply to many
730	   types of Definite and Indefinite identifiers.

732	   Some strings are visually confusable with others, and hence if a
733	   security decision is made by a user based on visual inspection, many
734	   opportunities for false positives exist.  As such, highly secure
735	   systems cannot rely on visual inspection.

737	   Determining whether a string is a valid identifier should typically
738	   be done after, or as part of, canonicalization.  Otherwise an
739	   attacker might use the canonicalization algorithm to inject (e.g.,
740	   via percent encoding, NFKC, or non-shortest-form UTF-8) delimiters
741	   such as '@' in an email address-like identifier, or a '.' in a
742	   hostname.

744	   Any case-insensitive comparisons need to define how comparison is
745	   done, since such comparisons may vary by locale of the endpoint.  As
746	   such, using case-insensitive comparisons in general often result in
747	   identifiers being either Indefinite or, if the legal character set is
748	   restricted (e.g. to ASCII), then Definite.

750	   See also [WEBER] for a more visual discussion of many of these
751	   issues.

753	5.  Security Considerations

755	   This entire document is about security considerations.

757	   To minimize elevation of privilege issues, any system that requires
758	   the ability to use both deny and allow operations within the same
759	   identifier space, should avoid the use of Indefinite identifiers in
760	   security comparisons.

762	   To minimize future security risks, any new identifiers being designed
763	   should specify an Absolute or Definite comparison algorithm, and if
764	   extensibility is allowed (e.g., as new schemes in URIs allow) then
765	   the comparison algorithm should remain invariant so that unrecognized
766	   extensions can be compared.  That is, security risks can be reduced
767	   by specifying the comparison algorithm, making sure to resolve any
768	   ambiguities pointed out in this document (e.g., "standard dotted
769	   decimal").

771	   Some issues (such as unrecognized extensions) can be mitigated by
772	   treating such identifiers as invalid.  Validity checking of
773	   identifiers is further discussed in [RFC3696].

775	   Perhaps the hardest issues arise when multiple protocols are used
776	   together, such as in the figure in Section 2, where the two protocols
777	   are defined or implemented using different comparison algorithms.
778	   When constructing an architecture that uses multiple such protocols,
779	   designers should pay attention to any differences in comparison
780	   algorithms among the protocols, in order to fully understand the
781	   security risks.  An area for future work is how to deal with such
782	   security risks in current systems.

784	6.  Acknowledgements

786	   Yaron Goland contributed to much of the discussion on URIs.  Patrick
787	   Faltstrom contributed to the background on identifiers.  Additional
788	   helpful feedback and suggestions came from Magnus Nystrom, Bernard
789	   Aboba, Mark Davis, John Klensin, and Russ Housley.

791	7.  IANA Considerations

793	   This document requires no actions by the IANA.

795	8.  Informative References

797	   [EE]       Mills, E., "Evidence Explained: Citing History Sources
798	              from Artifacts to Cyberspace", 2007.

800	   [I-D.ietf-pkix-rfc5280-clarifications]
801	              Cooper, D., "Updates to the Internet X.509 Public Key
802	              Infrastructure Certificate and Certificate Revocation List
803	              (CRL) Profile", draft-ietf-pkix-rfc5280-clarifications-04
804	              (work in progress), March 2012.

806	   [IAB1123]  IAB, "The interpretation of rules in the ICANN gTLD
807	              Applicant Guidebook", February 2012, <http://www.iab.org/
808	              documents/correspondence-reports-documents/2012-2/
809	              iab-statement-the-interpretation-of-rules-in-the-icann-
810	              gtld-applicant-guidebook>.

812	   [IANA-PORT]
813	              IANA, "PORT NUMBERS", June 2011,
814	              <http://www.iana.org/assignments/port-numbers>.

816	   [IEEE-1003.1]
817	              IEEE and The Open Group, "The Open Group Base
818	              Specifications, Issue 6 IEEE Std 1003.1, 2004 Edition",
819	              IEEE Std 1003.1, 2004.

821	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
822	              host table specification", RFC 952, October 1985.

824	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
825	              and Support", STD 3, RFC 1123, October 1989.

827	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
828	              Languages", BCP 18, RFC 2277, January 1998.

830	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
831	              Stevens, "Basic Socket Interface Extensions for IPv6",
832	              RFC 3493, February 2003.

834	   [RFC3696]  Klensin, J., "Application Techniques for Checking and
835	              Transformation of Names", RFC 3696, February 2004.

837	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
838	              Resource Identifier (URI): Generic Syntax", STD 66,
839	              RFC 3986, January 2005.

841	   [RFC4291]  Hinden, R. and S. Deering, "IP Version 6 Addressing
842	              Architecture", RFC 4291, February 2006.

844	   [RFC5280]  Cooper, D., Santesson, S., Farrell, S., Boeyen, S.,
845	              Housley, R., and W. Polk, "Internet X.509 Public Key
846	              Infrastructure Certificate and Certificate Revocation List
847	              (CRL) Profile", RFC 5280, May 2008.

849	   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322,
850	              October 2008.

852	   [RFC5952]  Kawamura, S. and M. Kawashima, "A Recommendation for IPv6
853	              Address Text Representation", RFC 5952, August 2010.

855	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
856	              Encodings for Internationalized Domain Names", RFC 6055,
857	              February 2011.

859	   [RFC6066]  Eastlake, D., "Transport Layer Security (TLS) Extensions:
860	              Extension Definitions", RFC 6066, January 2011.

862	   [RFC6335]  Cotton, M., Eggert, L., Touch, J., Westerlund, M., and S.
863	              Cheshire, "Internet Assigned Numbers Authority (IANA)
864	              Procedures for the Management of the Service Name and
865	              Transport Protocol Port Number Registry", BCP 165,
866	              RFC 6335, August 2011.

868	   [RFC6530]  Klensin, J. and Y. Ko, "Overview and Framework for
869	              Internationalized Email", RFC 6530, February 2012.

871	   [RFC6532]  Yang, A., Steele, S., and N. Freed, "Internationalized
872	              Email Headers", RFC 6532, February 2012.

874	   [TR36]     Unicode Consortium, "Unicode Security Considerations",
875	              Unicode Technical Report 36, August 2004.

877	   [WEBER]    Weber, C., "Attacking Software Globalization", March 2010,
878	              <http://www.casabasecurity.com/files/
879	              Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf>.

881	Author's Address

883	   Dave Thaler (editor)
884	   Microsoft Corporation
885	   One Microsoft Way
886	   Redmond, WA  98052
887	   USA

889	   Phone: +1 425 703 8835
890	   Email: dthaler@microsoft.com