idnits 2.17.1 

draft-iab-identifier-comparison-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 365: '....  Host software MUST support this mor...'
     RFC 2119 keyword, line 370: '...dentity of an Internet host, it SHOULD...'
     RFC 2119 keyword, line 372: '...#.#.#.#") form.  The host SHOULD check...'
     RFC 2119 keyword, line 488: '...      client MUST ensure that labels a...'
     RFC 2119 keyword, line 492: '...  characters, it MUST perform the conv...'
     (3 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 2, 2011) is 4683 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'IDNA' is mentioned on line 493, but not defined

  == Outdated reference: A later version (-12) exists of
     draft-ietf-eai-frmwrk-4952bis-10

  == Outdated reference: A later version (-13) exists of
     draft-ietf-eai-rfc5335bis-10

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pkix-rfc5280-clarifications-02

  -- Obsolete informational reference (is this intentional?): RFC 3546
     (Obsoleted by RFC 4366)


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                     D. Thaler, Ed.
3	Internet-Draft                                              July 2, 2011
4	Intended status: Informational
5	Expires: January 3, 2012

7	         Issues in Identifier Comparison for Security Purposes
8	                 draft-iab-identifier-comparison-00.txt

10	Abstract

12	   Identifiers such as hostnames, URIs/IRIs, and email addresses are
13	   often used in security contexts to identify security principals and
14	   resources.  In such contexts, an identifier supplied via some
15	   protocol is often compared against some policy to make security
16	   decisions such as whether the principal may access the resource, what
17	   level of authentication or encryption is required, etc.  If the
18	   parties involved in a security decision use different algorithms to
19	   compare identifiers, then failure scenarios ranging from denial of
20	   service to elevation of privilege can result.

22	Status of this Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on January 3, 2012.

39	Copyright Notice

41	   Copyright (c) 2011 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   2.  Security Uses  . . . . . . . . . . . . . . . . . . . . . . . .  4
58	     2.1.  Types of Identifiers . . . . . . . . . . . . . . . . . . .  6
59	     2.2.  False Positives and Negatives  . . . . . . . . . . . . . .  6
60	     2.3.  Hypothetical Example . . . . . . . . . . . . . . . . . . .  7
61	   3.  Common Identifiers . . . . . . . . . . . . . . . . . . . . . .  8
62	     3.1.  Hostnames  . . . . . . . . . . . . . . . . . . . . . . . .  8
63	       3.1.1.  IPv4 Literals  . . . . . . . . . . . . . . . . . . . .  8
64	       3.1.2.  IPv6 Literals  . . . . . . . . . . . . . . . . . . . . 10
65	       3.1.3.  Internationalization . . . . . . . . . . . . . . . . . 10
66	       3.1.4.  Resolution for comparison  . . . . . . . . . . . . . . 12
67	     3.2.  Ports and Service Names  . . . . . . . . . . . . . . . . . 12
68	     3.3.  URIs and IRIs  . . . . . . . . . . . . . . . . . . . . . . 12
69	       3.3.1.  Scheme component . . . . . . . . . . . . . . . . . . . 13
70	       3.3.2.  Authority component  . . . . . . . . . . . . . . . . . 13
71	       3.3.3.  Path component . . . . . . . . . . . . . . . . . . . . 14
72	       3.3.4.  Query component  . . . . . . . . . . . . . . . . . . . 14
73	       3.3.5.  Fragment component . . . . . . . . . . . . . . . . . . 14
74	     3.4.  Email Address-like Identifiers . . . . . . . . . . . . . . 15
75	   4.  General Internationalization Issues  . . . . . . . . . . . . . 16
76	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
77	   6.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
78	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
79	   8.  Informative References . . . . . . . . . . . . . . . . . . . . 17
80	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19

82	1.  Introduction

84	   In computing and the Internet, various types of "identifiers" are
85	   used to identify humans, devices, content, etc.  Before discussing
86	   security issues, we first give some background on the typical
87	   processes involving identifiers.

89	   As depicted in Figure 1, there are multiple processes relevant to our
90	   discussion.
91	   1.  An identifier must first be generated.  If the identifier is
92	       intended to be unique, the generation process includes some
93	       mechanism, such as allocation by a central authority, to help
94	       ensure uniqueness.  However the notion of "unique" involves
95	       determining whether a putative identifier matches other already-
96	       allocated identifiers.  As we will see, for many types of
97	       identifiers, this is not simply an exact binary match.

99	       As a result of generating the identifier, it is typically stored
100	       in two locations: with the requester or "holder" of the
101	       identifier, and with some repository of identifiers.  For
102	       example, if the identifier was allocated by a central authority,
103	       the repository might be that authority.  If the identifier
104	       identifies a device or content on a device, the repository might
105	       be that device.
106	   2.  The identifier must be distributed, either by the holder of the
107	       identifier or by a repository of identifiers, to others who could
108	       use the identifier.  This distribution might be electronic, but
109	       sometimes it is via other channels such as voice, business card,
110	       billboard, or other form of advertisement.  The identifier itself
111	       might be distributed directly, or it might be used to generate a
112	       portion of another type of identifier that is then distributed.
113	       For example, a URI or email address might include a server name,
114	       and hence distributing the URI or email address also inherently
115	       distributes the server name.
116	   3.  The identifier must be used by some party.  Generally the user
117	       supplies the identifier which is (directly or indirectly) sent to
118	       the repository of identifiers.  For example, using an email
119	       address to send email to the holder of an identifier may result
120	       in the email arriving at the holder's email server which has the
121	       repository of all email accounts on that server.

123	       The repository of identifiers must then attempt to match the
124	       user-supplied identifier with an identifier in its repository.

126	                            +------------+
127	                            |  Holder of |     1. Generation
128	                            | identifier +<---------+
129	                            +----+-------+          |
130	                                 |                  | Match
131	                                 |                  v/
132	                                 |          +-------+-------+
133	                                 +----------+ Repository of |
134	                                 |          |  identifiers  |
135	                                 |          +-------+-------+
136	                 2. Distribution |                  ^\
137	                                 |                  | Match
138	                                 v                  |
139	                       +---------+-------+          |
140	                       |      User of    |          |
141	                       |    identifier   +----------+
142	                       +-----------------+    3. Use

144	                       Typical Identifier Processes

146	                                 Figure 1

148	   One key aspect is that the identifier values passed in generation,
149	   distribution, and use, may all be different.  For example, generation
150	   might be exchanged in printed form, distribution done via voice, and
151	   use done electronically.  As such, the match process can be
152	   complicated.

154	   Furthermore, in many uses, the relationship between holder,
155	   repositories, and users may be more involved.  For example, when a
156	   hierarchy of caches exist (as with web pages for example), each cache
157	   is itself a repository of a sort, and the match process should be the
158	   same as on the authoritative web server.

160	2.  Security Uses

162	   Identifiers such as hostnames, URIs/IRIs, and email addresses are
163	   used in security contexts to identify principals and resources as
164	   well as other security parameters such as types and values of claims.
165	   Those identifiers are then used to make security decisions based on
166	   an identifier supplied via some protocol.  For example:
167	   o  Authentication: a protocol might match a security principal
168	      identifier to look up expected keying material, and then match
169	      keying material.
170	   o  Authorization: a protocol might match a resource name to look up
171	      an access control list (ACL), and then look up the security
172	      principal identifier in that ACL.

174	   If the parties involved in a security decision use different matching
175	   algorithms for the same identifiers, then failure scenarios ranging
176	   from denial of service to elevation of privilege can result, as we
177	   will see.

179	   This is especially complicated in cases involving multiple parties
180	   and multiple protocols.  For example, there are many scenarios where
181	   some form of "security token service" is used to grant to a requester
182	   permission to access a resource, where the resource is held by a
183	   third party that relies on the security token service.  The protocol
184	   used to request permission (e.g., Kerberos or OAuth) may be different
185	   from the protocol used to access the resource (e.g., HTTP).
186	   Opportunities for security problems arise when two protocols define
187	   different comparison algorithms for the same type of identifier, or
188	   when a protocol is ambiguously specified and two endpoints (e.g., a
189	   security token service and a resource holder) implement different
190	   algorithms within the same protocol.

192	        +----------+
193	        | security |
194	        |  token   |
195	        | service  |
196	        +----------+
197	             ^
198	             | 1. supply credentials and
199	             | get token for resource
200	             |                                             +--------+
201	        +----------+  2. supply token and access resource  |resource|
202	        |requester |=------------------------------------->| holder |
203	        +----------+                                       +--------+

205	                         Simple Security Exchange

207	                                 Figure 2

209	   In many cases the situation is more complex.  With certificates, the
210	   name in a certificate gets compared to ACLs or other things.  In the
211	   case of web site security, the name in the certificate gets compared
212	   to a portion of the URI that a user may have typed into a browser.
213	   The fact that many different people are doing the typing, on many
214	   different types of systems, complicates the problem since it is not
215	   just an administrator that is typing an ACL entry.

217	   Add to this the certificate enrollment step, and the certificate
218	   issuance step, and two more parties have an opportunity to adjust the
219	   encoding or worse, the software that supports them might make changes
220	   that the parties are unaware are happening.

222	2.1.  Types of Identifiers

224	   In this document we will refer to the following types of identifiers:

226	   o  Absolute: identifiers that can be compared byte-by-byte for
227	      equality.  Two identifiers that have different bytes are defined
228	      to be different.  For example, binary IP addresses are in this
229	      class.
230	   o  Definite: identifiers that have a well-defined comparison
231	      algorithm on which all parties agree.  For example, URI scheme
232	      names are defined to be a case-insensitive match, where the set of
233	      permitted characters results in an unambiguous definition of case-
234	      insensitive match since non-ASCII characters are not permitted.
235	   o  Indefinite: identifiers that have no single comparison algorithm
236	      on which all parties agree.  For example, human names are in this
237	      class.  Everyone might want the comparison to be tailored for
238	      their locale, for some definition of locale.  In some cases, there
239	      may be limited subsets of parties that might be able to agree
240	      (e.g., US-ASCII users might all agree on a comparison algorithm
241	      whereas US-ASCII and Turkish users may not), but identifiers often
242	      tend to leak out of such limited environments.

244	2.2.  False Positives and Negatives

246	   Perhaps the most common algorithm for comparison involves
247	   "canonicalization", or converting each identifier to a canonical
248	   form, and then testing the canonical representations for bitwise
249	   equality.  In so doing, it is thus critical that all entities
250	   involved agree on the same canonical form and use the same
251	   canonicalization algorithm so that the overall comparison process is
252	   also the same.

254	   It is first worth discussing in more detail the effects of errors in
255	   the comparison algorithm.  A "false positive" results when two
256	   identifiers compare as if they were equal, but in reality refer to
257	   two different things (e.g., security principals or resources).  When
258	   privilege is granted on a match, a false positive thus results in an
259	   elevation of privilege, for example allowing execution of an
260	   operation that should not have been permitted.  When privilege is
261	   denied on a match (e.g., matching an entry in a block/deny list or a
262	   revocation list), a permissable operation is denied.  At best, this
263	   can cause worse performance (e.g., a cache miss, or forcing redundant
264	   authentication), and at worst can result in a denial of service.

266	   A "false negative" results when two identifiers that in reality refer
267	   to the same thing compare as if they were different, and the effects
268	   are the reverse of those for false positives.  That is, when
269	   privilege is granted on a match, the result is at best worse
270	   performance and at worst a denial of service; when privilege is
271	   denied on a match, elevation of privilege results.

273	   Elevation of privilege is almost always seen as far worse than denial
274	   of service.  Hence, for URIs for example, Section 6.1 of [RFC3986]
275	   states: "comparison methods are designed to minimize false negatives
276	   while strictly avoiding false positives".

278	   Thus URIs were defined with a "grant privilege on match" paradigm in
279	   mind, where it is critical to prevent elevation of privilege while
280	   minimizing denial of service.  Using URIs in a "deny privilege on
281	   match" system can thus be problematic.

283	2.3.  Hypothetical Example

285	   In this example, both security principals and resources are
286	   identified using URIs.  Foo Corp has paid example.com for access to
287	   the stuff service.  Foo Corp allows its employees to create accounts
288	   on the stuff service.  Alice gets the account
289	   "http://example.com/stuff/FooCorp/alice" and Bob gets
290	   "http://example.com/stuff/FooCorp/bob".  It turns out, however, that
291	   Foo Corp's URI canonicalizer includes URI fragment components in
292	   comparisons whereas example.com's does not, and Foo Corp does not
293	   disallow the # character in the account name.  So Chuck, who is a
294	   malicious employee of Foo Corp, asks to create an account at
295	   example.com with the name alice#stuff.  Foo Corp's URI logic checks
296	   its records for accounts it has created with stuff and sees that
297	   there is no account with the name alice#stuff.  Hence, in its
298	   records, it associates the account alice#stuff with Chuck and will
299	   only issue tokens good for use with
300	   "http://example.com/stuff/FooCorp/alice#stuff" to Chuck.

302	   Chuck, the attacker, goes to a security token service at Foo Corp and
303	   asks for a security token good for
304	   "http://example.com/stuff/FooCorp/alice#stuff".  Foo Corp issues the
305	   token since Chuck is the legitimate owner (in Foo Corp's view) of the
306	   alice#stuff account.  Chuck then submits the security token in a
307	   request to "http://example.com/stuff/FooCorp/alice".

309	   But example.com uses a URI canonicalizer that, for the purposes of
310	   checking equality, ignores fragments.  So when example.com looks in
311	   the security token to see if the requester has permission from Foo
312	   Corp to access the given account it successfully matches the URI in
313	   the security token, "http://example.com/stuff/FooCorp/alice#stuff",
314	   with the requested resource name
315	   "http://example.com/stuff/FooCorp/alice".

317	   Leveraging the inconsistencies in the canonicalizers used by Foo Corp
318	   and example.com, Chuck is able to successfully launch an elevation of
319	   privilege attack and access Alice's resource.

321	3.  Common Identifiers

323	   In this section, we walk through a number of common types of
324	   identifiers and discuss various issues related to comparison that may
325	   affect security whenever they are used to identify security
326	   principals or resources.  These examples illustrate common patterns
327	   that may arise with other types of identifiers.

329	3.1.  Hostnames

331	   Hostnames are commonly used either directly as identifiers, or as
332	   components in identifiers such as in URIs and email addresses.
333	   Another example is in [RFC5280], sections 7.2 and 7.3 (and updated in
334	   section 3 of [I-D.ietf-pkix-rfc5280-clarifications]), which specify
335	   use in certificates.

337	   In this section we discuss a number of issues in comparing strings
338	   that appear to be some form of hostname.

340	   Section 3 of [RFC6055] discusses "hostname" vs "DNS name". [[anchor6:
341	   TODO: add some discussion here of security impact of names simply
342	   being invalid vs valid-but-different.  Failing security checks for
343	   invalid names means that treating a name as invalid can cause a false
344	   negative but not a false positive.]]

346	3.1.1.  IPv4 Literals

348	   [RFC0952] defined an entry in the "Internet host table" as follows:

350	      A "name" (Net, Host, Gateway, or Domain name) is a text string up
351	      to 24 characters drawn from the alphabet (A-Z), digits (0-9),
352	      minus sign (-), and period (.).  Note that periods are only
353	      allowed when they serve to delimit components of "domain style
354	      names". [...]  No blank or space characters are permitted as part
355	      of a name.  No distinction is made between upper and lower case.
356	      The first character must be an alpha character.  The last
357	      character must not be a minus sign or period. [...]  Single
358	      character names or nicknames are not allowed.

360	   [RFC1123] section 2.1 then updates the definition with:

362	      The syntax of a legal Internet host name was specified in RFC-952
363	      [DNS:4].  One aspect of host name syntax is hereby changed: the
364	      restriction on the first character is relaxed to allow either a
365	      letter or a digit.  Host software MUST support this more liberal
366	      syntax.

368	   and

370	      Whenever a user inputs the identity of an Internet host, it SHOULD
371	      be possible to enter either (1) a host domain name or (2) an IP
372	      address in dotted-decimal ("#.#.#.#") form.  The host SHOULD check
373	      the string syntactically for a dotted-decimal number before
374	      looking it up in the Domain Name System.

376	   and

378	      This last requirement is not intended to specify the complete
379	      syntactic form for entering a dotted-decimal host number; that is
380	      considered to be a user-interface issue.

382	   In specifying the inet_addr() API, the POSIX standard [IEEE-1003.1]
383	   defines "IPv4 dotted decimal notation" as allowing not only strings
384	   of the form "10.0.1.2", but also allows octal and hexadecimal, and
385	   addresses with less than four parts.  For example, "10.0.258",
386	   "0xA000001", and "012.0x102" all represent the same IPv4 address in
387	   standard "IPv4 dotted decimal" notation.  We will refer to this as
388	   the "loose" syntax of an IPv4 address literal.

390	   In section 6.1 of [RFC3493] getaddrinfo() is defined to support the
391	   same (loose) syntax as inet_addr():

393	      If the specified address family is AF_INET or AF_UNSPEC, address
394	      strings using Internet standard dot notation as specified in
395	      inet_addr() are valid.

397	   In contrast, section 6.3 of the same RFC states, specifying
398	   inet_pton():

400	      If the af argument of inet_pton() is AF_INET, the src string shall
401	      be in the standard IPv4 dotted-decimal form: ddd.ddd.ddd.ddd where
402	      "ddd" is a one to three digit decimal number between 0 and 255.
403	      The inet_pton() function does not accept other formats (such as
404	      the octal numbers, hexadecimal numbers, and fewer than four
405	      numbers that inet_addr() accepts).

407	   As shown above, inet_pton() uses what we will refer to as the
408	   "strict" form of an IPv4 address literal.  Some platforms also use
409	   the strict form with getaddrinfo() when the AI_NUMERICHOST flag is
410	   passed to it.

412	   Both the strict and loose forms are standard forms, and hence a
413	   protocol specification is still ambiguous if it simply defines a
414	   string to be in the "standard IPv4 dotted decimal form".  And, as a
415	   result of these differences, names like "10.11.12" are ambiguous as
416	   to whether they are an IP address or a hostname, and even
417	   "10.11.12.13" can be ambiguous because of the "SHOULD" in RFC 1123
418	   above making it optional whether to treat it as an address or a name.

420	   Protocols and data formats that can use addreses in string form for
421	   security purposes need to resolve these ambiguities.  For example,
422	   for the host component of URIs, section 3.2.2 of [RFC3986] resolves
423	   the first ambiguity by only allowing the strict form, and the second
424	   ambiguity by specifying that it is considered an IPv4 address
425	   literal.  We recommend that new protocols and data formats similarly
426	   use the strict form rather than the loose form.

428	   Thus, whereas (binary) IPv4 addresses are Absolute identifiers, IPv4
429	   address literals are at best Definite identifiers, and often turn out
430	   to be Indefinite identifiers.

432	   Furthermore, when strings can contain non-ASCII characters, they can
433	   contain other characters that may look like dots or digits to a human
434	   viewing and/or entering the identifier, especially to one who might
435	   expect digits to appear in his or her native script.

437	3.1.2.  IPv6 Literals

439	   IPv6 addresses similarly have a wide variety of alternate but
440	   semantically identical string representations, as defined in section
441	   2.2 of [RFC4291].  As discussed in section 3.2.5 of [RFC5952], this
442	   fact causes problems in security contexts if comparison (such as in
443	   X.509 certificates), is done between strings rather than between the
444	   binary representations of addresses.

446	   [RFC5952] recently specified a recommended canonical string format as
447	   an attempt to solve this problem, but it may not be ubiquitously
448	   supported at present.  And, when strings can contain non-ASCII
449	   characters, the same issues (and more, since hexadecimal and colons
450	   are allowed) arise as with IPv4 literals.

452	   Whereas (binary) IPv6 addresses are Absolute identifiers, IPv6
453	   address literals are Definite identifiers, since string-to-address
454	   conversion for IPv6 address literals is unambiguous.

456	3.1.3.  Internationalization

458	   The IETF policy on character sets and languages [RFC2277] requires
459	   support for UTF-8 in protocols, and as a result many protocols now do
460	   support non-ASCII characters.  When a hostname is sent in a UTF-8
461	   field, there are a number of ways it may be encoded.  For example,
462	   labels might encoded directly in UTF-8, or might first be Punycode-
463	   encoded or percent-encoded and then encoded in UTF-8.

465	   For example, in URIs, [RFC3986] section 3.2.2 specifically allows for
466	   the use of percent-encoded UTF-8 characters in the hostname, as well
467	   as the use of IDNA encoding using the Punycode algorithm.

469	   Percent-encoding is unambiguous for hostnames since the percent
470	   character cannot appear in the strict definition of a "hostname",
471	   though it can appear in a DNS name.

473	   Punycode-encoded labels (or "A-labels") on the other hand can be
474	   ambiguous if hosts are actually allowed to be named with a name
475	   starting with "xn--", and false positives can result.  While this may
476	   be extremely unlikely for normal scenarios, it nevertheless provides
477	   a possible vector for an attacker.

479	   A hostname comparator used with non-ASCII strings thus needs to
480	   decide whether a Punycode-encoded string should or should not be
481	   considered a valid hostname label, and if so, then whether it should
482	   match the equivalent Unicode string ("U-label").

484	   For example, Section 3.1 of "Transport Layer Security (TLS)
485	   Extensions" [RFC3546], states:

487	      If the hostname labels contain only US-ASCII characters, then the
488	      client MUST ensure that labels are separated only by the byte
489	      0x2E, representing the dot character U+002E (requirement 1 in
490	      section 3.1 of [IDNA] notwithstanding).  If the server needs to
491	      match the HostName against names that contain non-US-ASCII
492	      characters, it MUST perform the conversion operation described in
493	      section 4 of [IDNA], treating the HostName as a "query string"
494	      (i.e. the AllowUnassigned flag MUST be set).  Note that IDNA
495	      allows labels to be separated by any of the Unicode characters
496	      U+002E, U+3002, U+FF0E, and U+FF61, therefore servers MUST accept
497	      any of these characters as a label separator.  If the server only
498	      needs to match the HostName against names containing exclusively
499	      ASCII characters, it MUST compare ASCII names case-insensitively.

501	   [[anchor9: TODO: add observations based on the above text.  The text
502	   in RFC 3546 now obsolete since IDNA2008 is much more restrictive
503	   about the use of dot-oids in IDNs.  In addition, conversion between
504	   A-labels and Unicode strings that claim to be labels (but not vice
505	   versa) turns slightly ambiguous if mapping is permitted and pre-
506	   mapping strings may appear.]]

508	   For some additional discussion of security issues that arise with
509	   internationalization, see [TR36].

511	3.1.4.  Resolution for comparison

513	   Some systems (specifically Java) used to follow the rule that if two
514	   hostnames resolved to the same IP address then the hostnames were
515	   considered equal.  That is, the canonicalization algorithm involved
516	   name resolution with an IP address being the canonical form.
517	   However, with the introduction of dynamic IP addresses, private IP
518	   addresses, multiple IP addresses per name, etc., this method of
519	   comparison cannot be relied upon.  There is no guarantee that two
520	   endpoints will resolve the name to the same IP addresses, nor that
521	   the addresses resolved refer to the same entity.

523	   In addition, a comparison mechanism that relies on the ability to
524	   resolve identifiers such as hostnames to other identifies such as IP
525	   addresses leaks information about security decisions to outsiders if
526	   these queries are publicly observable.

528	3.2.  Ports and Service Names

530	   Port numbers and service names are discussed in depth in
531	   [I-D.ietf-tsvwg-iana-ports].  Historically, there were port numbers,
532	   service names used in SRV records, and mnemonic identifiers for
533	   assigned port numbers (known as port "keywords" at [IANA-PORT]).  The
534	   latter two are now unified, and various protocols use one or more of
535	   these types in strings.  For example, the common syntax used by many
536	   URI schemes allows port numbers but not service names.  Some
537	   implementations of the getaddrinfo() API support strings that can be
538	   either port numbers or port keywords (but not service names).

540	   For protocols that use service names that must be resolved, the
541	   issues are the same as those for resolution of addresses in
542	   Section 3.1.4.  In addition, Section 5.1 of
543	   [I-D.ietf-tsvwg-iana-ports] clarifies that service names/port
544	   keywords must contain at least one letter.  This prevents confusion
545	   with port numbers in strings where both are allowed.

547	3.3.  URIs and IRIs

549	   This section looks at issues related to using URIs for security
550	   purposes.  For example, [RFC5280], section 7.4, specifies comparison
551	   of URIs in certificates.  Examples of URIs in security token-based
552	   access control systems include WS-*, SAML-P and OAuth WRAP.  In such
553	   systems, a variety of participants in the security infrastructure are
554	   identified by URIs.  For example, requesters of security tokens are
555	   sometimes identified with URIs.  The issuers of security tokens and
556	   the relying parties who are intended to consume security tokens are
557	   frequently identified by URIs.  Claims in security tokens often have
558	   their types defined using URIs and the values of the claims can also
559	   be URIs.

561	   Also, when a URI is embedded in plain text (e.g., an email message),
562	   there is an additional concern because there is no termination
563	   criterion for a URL.  For example, consider
564	   http://unicode.org/cldr/utility/list-unicodeset.jsp?a=a&g=gc.  Some
565	   email clients will stop before the ; while others go to the .  As
566	   another point of comparison, Section 2.37 of [EE] (a standard for
567	   history citations) specifies the use of a space after a URI and
568	   before the punctuation.

570	   URIs are defined with multiple components, each of which has their
571	   own rules.  We cover each in turn.

573	3.3.1.  Scheme component

575	   [RFC3986] defines URI schemes as being case-insensitive and in
576	   section 6.2.2.1 specifies that scheme names should be normalized to
577	   lower-case characters.

579	   New schemes can be defined over time.  In general two URIs with an
580	   unrecognized scheme cannot be safely compared, however.  This is
581	   because the canonicalization and comparison rules for the other
582	   components may vary by scheme.  For example, a new URI scheme might
583	   have a default port of X, and without that knowledge, a comparison
584	   algorithm cannot know whether "example.com" and "example.com:X"
585	   should be considered to match in the authority component.  Hence for
586	   security purposes, it is safest for unrecognized schemes to be
587	   treated as invalid identifiers.  However, if the URIs are only used
588	   with a "grant access on match" paradigm then unrecognized schemes can
589	   be supported by doing a generic case-sensitive comparison, at the
590	   expense of some false negatives.

592	3.3.2.  Authority component

594	   The authority component is scheme-specific, but many schemes follow a
595	   common syntax that allows for userinfo, host, and port.

597	3.3.2.1.  Host

599	   Section 3.1 discussed issues with hostnames in general.  In addition,
600	   [RFC3986] section 3.2.2 allows future changes using the IPvFuture
601	   production.  As with IPv4 and IPv6 literals, IPvFuture formats may
602	   have issues with multiple semantically identical string
603	   representations, and may also be semantically identical to an IPv4 or
604	   IPv6 address.  As such, false negatives may be common if IPvFuture is
605	   used.

607	3.3.2.2.  Port

609	   See discussion in Section 3.2.

611	3.3.2.3.  Userinfo

613	   [RFC3986] defines the userinfo production that allows arbitrary data
614	   about the user of the URI to be placed before '@' signs in URIs (see
615	   also Section 3.4.  For example:
616	   "http://alice:bob:chuck@example.com/bar" has the value "alice:bob:
617	   chuck" as its userinfo.  When comparing URIs in a security context,
618	   one must decide whether to treat the userinfo as being significant or
619	   not.  Some URI comparison services for example treat
620	   "http://alice:ick@example.com" and "http://example.com" as being
621	   equal.

623	3.3.3.  Path component

625	   [RFC3986] supports the use of path segment values such as "./" or
626	   "../" for relative URLs.  Strictly speaking, including such path
627	   segment values in a fully qualified URI is syntactically illegal but
628	   [RFC3986] section 4.1 nevertheless defines an algorithm to remove
629	   them.

631	   Unless a scheme states otherwise, the path component is defined to be
632	   case-sensitive.  However, if the resource is stored and accessed
633	   using a filesystem using case-insensitive paths, there will be many
634	   paths that refer to the same resource.  As such, false negatives can
635	   be common in this case.

637	3.3.4.  Query component

639	   There is the question as to whether "http://example.com/foo",
640	   "http://example.com/foo?", and "http://example.com/foo?bar" are each
641	   considered equal or different.

643	   Similarly, it is unspecified whether the order of values matters.
644	   For example, should "http://example.com/blah?ick=bick&foo=bar" be
645	   considered equal to "http://example.com/blah?foo=bar&ick=bick"?  And
646	   if a domain name is permitted to appear in a query component (e.g.,
647	   in a reference to another URI), the same issues in Section 3.1 apply.

649	3.3.5.  Fragment component

651	   Some URI formats include fragment identifiers.  These are typically
652	   handles to locations within a resource and are used for local
653	   reference.  A classic example is the use of fragments in HTTP URLs
654	   where a URL of the form "http://example.com/blah.html#ick" means
655	   retrieve the resource "http://example.com/blah.html" and, once it has
656	   arrived locally, find the HTML anchor named ick and display that.

658	   So, for example, when a user clicks on the link
659	   "http://example.com/blah.html#baz" a browser will check its cache by
660	   doing a URI comparison for "http://example.com/blah.html" and, if the
661	   resource is present in the cache, a match is declared.

663	   Hence comparisons for security purposes should typically ignore the
664	   fragment component and treat all fragments as equal to the full
665	   resource.

667	3.4.  Email Address-like Identifiers

669	   [[anchor19: TODO: this section needs work and will need to be tracked
670	   as EAI WG opinions about the permissibility of A-labels in the domain
671	   part evolves.]]

673	   Section 4.4 of [I-D.ietf-eai-rfc5335bis] defines the encoding for an
674	   internationalized email address, and [RFC5280], section 7.5,
675	   discusses the use of internationalized email addresses in
676	   certificates.

678	   [I-D.ietf-eai-rfc5335bis] use in certificates points to
679	   [I-D.ietf-eai-frmwrk-4952bis] section 9, which contains a discussion
680	   of many issues resulting from internationalization (though no
681	   normative text).

683	   Email address-like identifiers have a local part and a domain part.
684	   The issues with the domain part are essentially the same as with
685	   hostnames, covered earlier.

687	   The local part is left for each domain to define.  People quite
688	   commonly use email addresses as usernames with web sites like banks
689	   or shopping sites, but the site doesn't know whether foo@example.com
690	   is the same person as FOO@example.com.  Thus email-like identifiers
691	   are typically Indefinite identifiers.

693	   To avoid false positives, some security mechanisms (such as
694	   [RFC5280]) compare the local part using an exact match.  Hence, like
695	   URIs, email address-like identifiers are designed for use in grant-
696	   on-match security schemes, not in deny-on-match schemes.

698	4.  General Internationalization Issues

700	   In addition to the issues with hostnames discussed in Section 3.1.3,
701	   there are a number of internationalization issues that apply to many
702	   types of Definite and Indefinite identifiers.

704	   Some strings are visually confusable with others, and hence if a
705	   security decision is made by a user based on visual inspection, many
706	   opportunities for false positives exist.  As such, highly secure
707	   systems should not rely on visual inspection.

709	   Determining whether a string is a valid identifier should typically
710	   be done after, or as part of, canonicalization.  Otherwise an
711	   attacker might us the canonicalization algorithm to inject (e.g., via
712	   percent encoding, NFKC, or non-shortest-form UTF-8) delimeters such
713	   as '@' in an email address-like identifier, or a '.' in a hostname.

715	   Any case-insensitive comparisons need to define how comparison is
716	   done, since such comparisons may vary by locale of the endpoint.  As
717	   such, using case-insensitive comparisons in general often result in
718	   identifiers being either Indefinite or, if the legal character set is
719	   restricted (e.g. to ASCII), then Definite.

721	   See also [WEBER] for a more visual discussion of many of these
722	   issues.

724	5.  Security Considerations

726	   This entire document is about security considerations.

728	   To minimize elevation of privilege issues, for any system that
729	   requires the ability to use both deny and allow operations within the
730	   same identifier space, we recommend the use of Absolute or Definite
731	   identifiers.  Use of Absolute identifiers typically provides the
732	   least chance of issues due to bugs.

734	   Perhaps the hardest issues arise when multiple protocols are used
735	   together, such as in the figure in Section 2, where the two protocols
736	   are defined or implemented using different comparison algorithms.
737	   For security protocols (such as security token services) designed to
738	   be used in conjunction with other protocols that access resources,
739	   either:
740	   a.  the security protocol should specify the identifier comparison
741	       algorithm, and the security protocol should only be used by
742	       protocols that use the same algorithm; or

744	   b.  the security protocol should be capable of supporting multiple
745	       comparison algorithms, and use the one indicated by the using
746	       protocol.

748	   For any new identifiers being designed, we recommend that the
749	   definition specify an Absolute or Definite comparison algorithm, and
750	   if extensibility is allowed (e.g., as new schemes in URIs allow) then
751	   the comparison algorithm should remain invariant so that unrecognized
752	   extensions can be compared.

754	   Some issues (such as unrecognized extensions) can be mitigated by
755	   treating such identifiers as invalid.  Validity checking of
756	   identifiers is further discussed in [RFC3696].

758	6.  Acknowledgements

760	   Yaron Goland contributed to much of the discussion on URIs.  Patrick
761	   Faltstrom contributed to the background on identifiers.  Additional
762	   helpful feedback and suggestions came from Magnus Nystrom, Bernard
763	   Aboba, Mark Davis, John Klensin, and Russ Housley.

765	7.  IANA Considerations

767	   This document requires no actions by the IANA.

769	8.  Informative References

771	   [EE]       Mills, E., "Evidence Explained: Citing History Sources
772	              from Artifacts to Cyberspace", 2007.

774	   [I-D.ietf-eai-frmwrk-4952bis]
775	              Klensin, J. and Y. Ko, "Overview and Framework for
776	              Internationalized Email", draft-ietf-eai-frmwrk-4952bis-10
777	              (work in progress), September 2010.

779	   [I-D.ietf-eai-rfc5335bis]
780	              Yang, A. and S. Steele, "Internationalized Email Headers",
781	              draft-ietf-eai-rfc5335bis-10 (work in progress),
782	              March 2011.

784	   [I-D.ietf-pkix-rfc5280-clarifications]
785	              Cooper, D., "Clarifications to the Internet X.509 Public
786	              Key Infrastructure Certificate and Certificate Revocation
787	              List (CRL) Profile",
788	              draft-ietf-pkix-rfc5280-clarifications-02 (work in
789	              progress), March 2011.

791	   [I-D.ietf-tsvwg-iana-ports]
792	              Cotton, M., Eggert, L., Touch, J., Westerlund, M., and S.
793	              Cheshire, "Internet Assigned Numbers Authority (IANA)
794	              Procedures for the Management of the Service Name and
795	              Transport Protocol Port Number Registry",
796	              draft-ietf-tsvwg-iana-ports-10 (work in progress),
797	              February 2011.

799	   [IANA-PORT]
800	              IANA, "PORT NUMBERS", June 2011,
801	              <http://www.iana.org/assignments/port-numbers>.

803	   [IEEE-1003.1]
804	              IEEE and The Open Group, "The Open Group Base
805	              Specifications, Issue 6 IEEE Std 1003.1, 2004 Edition",
806	              IEEE Std 1003.1, 2004.

808	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
809	              host table specification", RFC 952, October 1985.

811	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
812	              and Support", STD 3, RFC 1123, October 1989.

814	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
815	              Languages", BCP 18, RFC 2277, January 1998.

817	   [RFC3493]  Gilligan, R., Thomson, S., Bound, J., McCann, J., and W.
818	              Stevens, "Basic Socket Interface Extensions for IPv6",
819	              RFC 3493, February 2003.

821	   [RFC3546]  Blake-Wilson, S., Nystrom, M., Hopwood, D., Mikkelsen, J.,
822	              and T. Wright, "Transport Layer Security (TLS)
823	              Extensions", RFC 3546, June 2003.

825	   [RFC3696]  Klensin, J., "Application Techniques for Checking and
826	              Transformation of Names", RFC 3696, February 2004.

828	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
829	              Resource Identifier (URI): Generic Syntax", STD 66,
830	              RFC 3986, January 2005.

832	   [RFC4291]  Hinden, R. and S. Deering, "IP Version 6 Addressing
833	              Architecture", RFC 4291, February 2006.

835	   [RFC5280]  Cooper, D., Santesson, S., Farrell, S., Boeyen, S.,
836	              Housley, R., and W. Polk, "Internet X.509 Public Key
837	              Infrastructure Certificate and Certificate Revocation List
838	              (CRL) Profile", RFC 5280, May 2008.

840	   [RFC5952]  Kawamura, S. and M. Kawashima, "A Recommendation for IPv6
841	              Address Text Representation", RFC 5952, August 2010.

843	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
844	              Encodings for Internationalized Domain Names", RFC 6055,
845	              February 2011.

847	   [TR36]     Unicode Consortium, "Unicode Security Considerations",
848	              Unicode Technical Report 36, August 2004.

850	   [WEBER]    Weber, C., "Attacking Software Globalization", March 2010,
851	              <http://www.casabasecurity.com/files/
852	              Chris_Weber_Character%20Transformations%20v1.7_IUC33.pdf>.

854	Author's Address

856	   Dave Thaler (editor)
857	   One Microsoft Way
858	   Redmond, WA  98052
859	   USA

861	   Phone: +1 425 703 8835
862	   Email: dthaler@microsoft.com