Locale-free UniCode Identifiers (LUCID) BOF IETF-92 Dallas, TX, USA March 2015 Co-chairs: Marc Blanchet, Dave Thaler Andrew Sullivan Presented draft-sullivan-lucid-prob-stmt, an overview of the problem statement and scope. Joint work with Asmus Freytag from the Unicode Technical Committee of the Unicode Consortium. [Slide 12] Homoglyphs are glyphs that represent different abstract characters but look visually identical (or virtually identical). Sometimes creative fonts can help distinguish homoglyphs (like the way “programmer” fonts add a slash or a dot to the zero character) but in general this can’t be safely assumed for all renditions of the glyph. Ladar Levison: You have talked about different code points that produce characters that look the same. Are there cases of characters in the same language that look the same? Andrew Sullivan: Characters don’t have an attached language, and the same characters are used in different ... Ladar Levison: Are there any situations where the different homoglyph code points have different meanings in the local language? Can’t we just map all characters that look alike to the same code point? Andrew Sullivan: That’s an important question with an important mistake in it. The question assumes you always have a reliable binding between the writing system, the characters, and the language. You don’t. What is the language of “3com.com”? Latin characters are used for lots of languages, and different languages use the same characters differently. And different countries use the same language differently. In France, the capital form of LATIN SMALL LETTER E WITH ACUTE is undecorated. In Quebec, the capital form has an accent. To answer your question you need to know the location and culture of the person asking it. Ladar Levison: If we eliminate all the duplicate characters and leave just one of each, does that limit the ability to write words in some languages? Andrew Sullivan: Maybe. One possibility is that we work out how to do that, at least some of the time. Ladar Levison: How do we normalize between different case forms? Unicode doesn’t define that. Andrew Sullivan: Yes it does. Unicode defines uppercase and lowercase, but casefold doesn’t work the way you might think. Jeff Hodges: Is the problem that the different normalizations don’t yield the same answer? Andrew Sullivan: No. One problem is that Normalization Form C does not do what we thought. The goal of this BoF is to get an understanding of the problem(s). Jeff Hodges: The problem is there’s more than one Unicode Normalization Form. [Slide 14] Dave Thaler: In this case (ARABIC LETTER BEH WITH HAMZA ABOVE) there is no intent for the two characters to normalize to the same thing. They’re completely different characters that just happen to look identical. Andrew Sullivan: There are only two relevant normalizations, compatibility normalization and canonical normalization. We have NFC and NFD, but there’s there’s a 1:1 bidirectional mapping between NFC and NFD. Similarly for NFKC and NFKD. Kerry Lynn: For a given person writing in a given language, would they ever use two different code points that produce characters that look alike? Andrew Sullivan: It appears that no user of any writing system would intentionally use both forms, most of the time. But you can’t assume you know what writing system an identifier was written with. Stuart Cheshire: I can imagine a diligent writer using both the Kelvin symbol and and capital letter K in one document. John Klensin: In some cases, they not only render the same but have the same name. Nabil Benamar: If we remove one of the characters, we cannot write certain words. Why remove a character from a language? To solve what problem? Andrew Sullivan: We’re not talking about removing anything, yet. We’re not at the point of solutions. We’re exploring the problem. If you’re referring to the IAB statement, that is calling for more understanding of the problem. Joe Hildebrand: You said no user of any writing system would intentionally use two forms that look alike, but, of course, attackers might deliberately do this. John Klensin: Not only do you not know the language the person decided to write in but you don't know which IME (Input Method) was used and what its properties are. Tony Hansen: I know some characters have changed their properties from one Unicode version to the next. Is there a possibility of a Unicode version 7.1 that fixes this? Andrew Sullivan: We thought we understood the normalization stability rules. We were wrong. Normalization cases are added rarely, and never removed. It is unusual for a character to move between scripts after it’s been in Unicode for a few versions. We’ve recommended not using a character until it has been in Unicode for a few versions. To be clear, BEH WITH HAMZA ABOVE is not actually used by any human. It was added for Fula, a language that has not been written in arabic script for over 100 years. But, writing systems can change. John Klensin We _did_ understand the normalization stability rules. What we didn't understand was the range of choices that could be made when a new, more or less precomposed, character / code point that was added. Fula has been written by humans in lots less than 100 years, just not by very many humans. I know people who are still writing Classical Mayan too, just not many of them and not for daily communication. (And not in Unicode, of course.) [Slide 18] Andrew Sullivan: A further example: TAMIL LETTER KA (க) and TAMIL DIGIT ONE (௧) look identical, but are encoded with separate code points so that they sort correctly. Stuart Cheshire: I’m struck by the similarity with old typewriters that had no digits 0 and 1, so people typed “O” and “l” instead. Then with computers they had to learn that they couldn’t do that because to a computer an “O” is not a “0” and an “l” is not a “1”. Andrew Sullivan: Confusability and common substitutions are an additional problem. The problem here is more than that. We can’t solve everything. Can we do something that will get us part way there? Yutaka Oiwa: I am Japanese. This is not a problem. In a well designed font *all* characters are clearly distinguishable Andrew Sullivan: Hooray! The problem is smaller than we thought. John Klensin: A sufficiently smart IME could deal with the Ka problem by observing context (e.g., with or without other numerals or associated delimiters). Whether there are such IMEs is another question. Andrew Sullivan: We do have identifiers that mix letters and numbers. Joe Hildebrand: Some software lets you use a pen or your finger for entering characters. Andrew Sullivan: We’re not just talking about confusability. Sometimes “rn” looks like “m”. [Slide 22] Possible Directions 1. Find them, disallow new, cope with old 2. Disallow some combining sequences 3. Do nothing / just warn 4. Get 1 (or more) new Unicode property (or properties) 5. Create NFI (Normalization Form for Identifiers) Kerry Lynn: Option 1 is to search for and disallow all homoglyphs? Andrew Sullivan: Yes, and that would be a huge amount of work. It’s not just all Unicode characters, but all possible combining sequences. Kerry Lynn: Can we just disallow combining sequences? Andrew Sullivan: Before BEH WITH HAMZA ABOVE existed, people may have already been approximating it by writing a BEH and adding a HAMZA ABOVE. Kerry Lynn: Are these combining sequences exceedingly rare? Andrew Sullivan: No. Ladar Levison: Can’t we just map all representations of digit one to the Latin “1” character? Andrew Sullivan: This is not as easy as it sounds. See the long email thread with the subject “A digit is a digit”. What do you do with “ö” (LATIN SMALL LETTER O WITH DIAERESIS) when you can’t display DIAERESIS? Ladar Levison: It should map to “o” (LATIN SMALL LETTER O) without DIAERESIS. Henrik Levkowetz: In Swedish if you don’t have DIAERESIS available you’d write “ö” as “o”. A German would transcribe “ö” as “oe”. Barry Leiba: Shall we disallow “rn” in user names because it might look like “m”? Let’s not rat-hole on that. Shawn Steele: Calling it an Ostrich is a bit presumptuous (referring to “Do nothing / just warn”) "This is a case of "we're willing to break these characters because they seem 'fixable'", however there are thousands of other cases that are ignored." - are you referring to the fact that the universe of confusable characters is broader than the problem under discussion here? The "they would use the other one even though it wasn't the character they wanted" really bothers me. John Klensin: it bothers me too even though I agree with what I think Andrew was trying to say. Andrew Sullivan: I was suggesting the (hypothetical) case where someone was using BEH and adding a HAMZA ABOVE before Unicode 7 existed. It is likely that people will do these kinds of substitutions. ARABIC LETTER BEH WITH HAMZA ABOVE is not really the literal case we care about. What if a community switches its writing system (say, from Latin to Cyrillic)? It has happened in our lifetime. And when a community switches its writing system, that’s when they tend to add distinguishing marks to the text. We can’t pretend that that’s not going to happen. John Klensin: We've also got some identical characters with letter properties in the same script (Latin) that we may need to disallow anyway (see the comments about phonetic descriptors in one of the I-Ds) Andrew Sullivan: These are the typesetting for how you pronounce a word. Those are phonetic descriptor characters. They look remarkably like Latin letters, but they’re not. Oleksandr Tsaruk: In the long run, will UTF-32 solve the problem? Andrew Sullivan: No. The problem is with the code points, not how they’re encoded. Joe Hildebrand: Could be just solve this with software by rendering all the characters and comparing (using software) to see if the images match? Barry Leiba: In what font? Joe Hildebrand: Google Noto. Andrew Sullivan: These rules operate on single code points and combining sequences, not entire strings. Policy about entire strings operates at a different level. Maybe we solve this problem by operating on single code points and combining sequences. Pete Resnick: There are some simple cases, where characters have different properties, e.g.: l letter, 1 digit O letter, 0 digit K letter, K (Kelvin) symbol Solving the general confusability problem is impossible. I am leaning towards option 4 (new Unicode property) Stuart Cheshire: Actually, K (Kelvin) symbol is in Unicode category “Uppercase_Letter” Chris Newman: The “do nothing” option is not acceptable. At least there will be warnings and guidance, which is not nothing. An example: In email, “+” is allowed. But if you use it, you get problems. Some fight it. Some give up and stop using “+”. We should stop making software more complicated and work on advice and guidance instead. Andrew Sullivan: What if we didn’t change the software, but had additional Unicode properties that affected which characters were PVALID, etc. Chris Newman: This adds complexity for the deployments. For example, we have multiple versions of SASL: Unconstrained UTF-8, then Stringprep rules (which were too strict), and now (probably) Precis rules. Andrew Sullivan: So, this is not new rules. It’s like a new version of Precis where the characters have different properties. Chris Newman: This will cause interoperability problems. Marc Blanchet: I agree, multiple versions is bad. Options 1,2,3 are bad. I support option 4 (new Unicode property). In the Unicode table, the description of a code point is just a comment. There could be two code points with the same description, but they’re not the same abstract character. Shawn Steele: Lot's of these "identical" things depend on the font. Even when the font makes them "identical", its reasonably easy to find a font designer that chose a different direction. John Klensin: Pete, if those phonetic descriptors are counted, it probably is hundreds but not very many hundreds. Shawn Steele: So is bundling/blocking an option? John Klensin: Pete, the difference in practice between option 4 "new property" and item 5 "NFI" may be very slight except in how we define protocols. Shawn Steele: [to Chris] +1000 on this should be advice :) John Klensin: (Especially to Chris) As someone who continues to use "+", please see the distinction between "coding artifact" and actual character differences that I made on the LUCID list John Klensin: Not clear that doing something here other than warn makes interoperability worse -- it might actually make it better because of that "same user, different device and IME can yield different strings" problem, resulting in unexpected non-matches. Shawn Steele: Bundling/blocking would be a "direction" Andrew Sullivan: For explanation, bundling/blocking means manually creating aliases (e.g. duplicating zone data, or using DNAME records). E.g., if you register “École” (with acute accent) you get “Ecole” (no accent) for free. This depends on competent administrator of the repository (and a registration authority to do that administration). Pete Resnick: These are policy questions They are somewhat equivalent to the mapping document we have right now. There are some things you can do to be helpful. IDNA-2003 used NFKC to map all script digits to a single canonical form. I agree with Marc Blanchet: we should do option 4 (new Unicode property). If that fails, we’re left with option 3 (advice and guidance). Option 5 (IETF creates Normalization Form for Identifiers) would be crazy. If the Unicode Consortium creates NFI, that would be good, but they may not want to do it. Andrew Sullivan: I agree, Unicode consortium should create NFI, or the new code point properties, which is functionally equivalent. Dave Crocker: We’ve had this problem for 25 years and it continues to get worse. We just won’t solve this problem. There’s now a whole ecosystem, and we look for localized mechanisms without considering how they affect the whole ecosystem. Every proposed solution adds more complexity to the entire ecosystem. Doing more hard engineering work is not going to yield worthwhile benefit. We should focus on dealing with what there is now, and letting the world adapt to it. Andrew Sullivan: I have a great deal of sympathy for that point of view. Once we have a world that is shaped a certain way we are stuck with it. It reminds me of the great quote: When the facts change, I change my mind. Ladar Levison: I ran into the problem with the “+” sign in email addresses. An early RFC did not allow “+” sign, but a later one did. I blocked those email addresses and lots of people complained to me. I don’t allow usernames that differ only in the (US ASCII) case. We should do the same for Unicode. We should separate what we display from what we store. We should have matching rules for what names should be considered equivalent. Andrew Sullivan: You want greedy matching and parsimonious display. I can see the appeal in that. Maybe we could do this with just an operational guidance document. Yutaka Oiwa: We should study the reason for confusion in each case. We should do option 4 (new Unicode property). In the SI definition the Kelvin sign is actually a capital K. In Japanese we have the Katakana middle dot character which looks like punctuation, but it’s important and we need it. Andrew Sullivan: There are similar rules for the Arabic Indic digits. We can do it. It’s just that the way context rules work and their awkwardness is a sign of the problem. Elliot Lear: We need to work out what will cause was the right thing to happen with the least amount of work. Letting each protocol designer do their own thing will not produce consistent results. We need more reviewers for the document. John Klensin: No, Pete, I was assuming 5 (NFI) could be done on the basis of properties plus probably an exception list (for the same reason we needed an exception list in IDNA) Andrew, it is also a matter of how we organize protocols because of questions of early or late normalization. To part of Dave Crocker's statement, the added complexity driver is that Unicode isn't fixed -- new code points and scripts keep getting added, with new conventions. So we either adapt of find ourselves back in the "just pick a version of Unicode and stick with it" state. If we assume that everything that will be added in the future will be obscure, that isn't necessarily a bad idea, but does add some separate implementation complexity as libraries evolve differently. Dave Thaler made call for consensus of room: Do we agree there is an important problem here? * Substantial hum in support * Shawn Steele against (it’s a fool’s errand) Do we agree on what the problems are? * Mixed response Dave Thaler: Is it possible to mitigate at least some of the problems? Is it hopeless? Joe Hildebrand: Yes to both. Dave Thaler: Is there something the IETF should work on with the Unicode Consortium? Is there something the IETF should work on independently? Joe Hildebrand: I’m not convinced we know enough to know what to ask the Unicode Technical Committee. Pete Resnick: I make a distinction between knowing what to ask the Unicode Consortium (we have no work to do) and believing there is a question to be asked but we have to work out how to ask it. Jeff Hodges: I want to point out that John Klensin’s draft-klensin-idna-5892upd-unicode70-04 did not get discussed. John Klensin: The scope/list in draft-klensin-idna-5892upd-unicode70 is broader than that in draft-sullivan-lucid... and neither does what IDNA thinks of a homographs. Ladar Levison: We need a normalization form suitable for identifiers. Kerry Lynn: Are there things we can do to mitigate the problems? Is there guidance we can provide to browser vendors? Ladar Levison: Before we can fix situations we need to be able to identify the situations that need fixing. Elliot Lear: This is about what work needs to be done, not who will do it. This has to be a collaborative effort between IETF and the Unicode Technical Committee. Dave Thaler: Call for consensus of room: Who supports 4/5 (new property, NFI) Substantial hum in favour None against Dave Crocker: Why do we think this is a practical path, given past history? Do we think the collaboration will be successful? John Klensin: I think the 4-5 bundle is the right way to go... and, incidentally, that getting a good start on it is prerequisite to 3 Shawn Steele: We did 4/5 in 2003, and it failed. Dave Thaler: Who supports 3/6 (warn, bundling/blocking) as a “Plan B” Substantial hum in favour Little against Summary: 4/5 (new property, NFI) is preferred direction; fall back to 3/6 (warn, bundling/blocking) if that fails. Dave Thaler: Who will volunteer to do work on this effort (the 4/5 bundle and/or the 3/6 bundle)? The following people volunteered: Andrew Sullivan Pete Resnick Yoshiro Yoneya Alexander Francisco Arias Sarman Hussain Tony Hasen