Minutes from IDN BOF, IETF 71 Philadelphia, PA, 2008-03-12 Chair: Harald Alvestrand. The Chair brought the meeting to order. A. Administration; Agenda; Plan of the Meeting. Administrative matters. The Chair asked for additional agenda items. There were no additions. The Chair reviewed the agenda. The first part of the meeting: document review. The Chair noted that list discussion tends to focus on very specific, technical issues. In order to cover the items on the agenda, he asked that specific discussion be deferred until later in the meeting, so that the current documents could be reviewed broadly. Following document review: charter review. After the charter review: certain linguistic and cultural issues that might need addressing at the protocol level. Note that this item was removed late in the meeting because of lack of time. [A note on minutes convention: items mentioned on the slide for which there was not substantive expansion are not mentioned in here.] B. Document review. 1. Introduction to an IDN revision. John Klensin slides: http://www3.ietf.org/proceedings/08mar/slides/idn-3.pdf How many people had looked at recent versions of the relevant documents? Many. John said he would proceed quickly to cover the high points, relying on the fact that many people had read the documents. "History". Emphasise that some of these problems are simply not possible to solve: - domain names are at best a subset of a language ("can't write literature in domain names") - Unicode not perfect (inventing a completely new encoding scheme is not a likely solution) "Evolution". RFC 4690 might be different if it were written today. This is an open effort, and there is evidence of that openness. "Key Issues I". It's important to understand how and why the things left out of IDNA2003 are important. The problems are real problems to users of those languages, even if there are only small numbers of such speakers. That said, John noted that it is important to avoid the trap of thinking everything can fit in the DNS. Emphasise that the IETF does not have a consensus mechanism for solving orthographic or linguistic disputes. Emphasise also the serious problem of characters that are dangerous in themselves, but make certain words impossible to write at all. "Key Issues II". Emphasise that it doesn't matter what one thinks of the linguistic foundation of complaints about IDNA2003's handling of certain scripts. The IETF hasn't a way to resolve such disputes, but must listen and try to resolve the complaints. "Key Issues III". John noted that some of the problems are because of natural analogies with the way "traditional" DNS works (e.g. the relationship between, say, ASCII lower case and upper case; and the relationship between certain characters in some other script). John also emphasised the importance of libraries in the apparent issues: IDNA2003 specified a certain version of Unicode, but applications don't know and can't learn what version they're actually using. and even if they could, it wouldn't help). He also emphasised that the "stable list" approach in IDNS2003 turned out not to work. "Key Goals". (No expansion) "Current Structure". John emphasised that the point of a "why" document is to prevent people making up own rules & principles because they don't understand why the rules are as they are. "DNS Internationalization". John mentioned that the "common sense" that users need to have may require some education in order to develop it. QUESTIONS FROM THE FLOOR a. Ted Hardie asked whether a document is needed outlining which other protocols need (or need not) to pay attention to this new work. John replied that in some sense, the answer is that the need is restricted to issues of localization. On the other hand, because of the possibility of new label separators and the ubiquity of DNS label parsing, everything is going to need to know something about it, so the idea that IDNs can be implemented entirely on the client side is dead. b. Phillip Hallam-Baker suggested a caveat that the group will not undo what has already happened: existing registrations, even those not conforming to the new approach, will remain legal. John replied that in this case it is more important to design it to work well, rather than to address current pathologies. c. An unnamed speaker (Lisa Dussault?) asked why the apparent introduction of a new label separator appears to have been assumed, rather than argued for. John replied that the basic problem is an analogy made by non-ASCII users between the period/full stop and the dot label separator on the one hand, and a non-ASCII sentence separator and a same-character label separator. There is no way to be sure the list of such candidate separators is bounded. Finally, there's the practical issue that some people complain they don't have the dot on their keyboard. An unnamed speaker (Paul Hoffman?) noted also that in IDNA2003, the "dot mapping" went into all mappings, and that's isn't to be the case under the new approach. 2. Issues and Rationale. John Klensin. slides: continue in same deck "Issues and Rationale". (no expansion) "Address Primary Issues". (no expansion) "New Terminology". John emphasised the difference of approach here: the IDNA2003 documents were about labels, and not FQDNs. "The Front End". (no expansion) "Summary". (no expansion) QUESTIONS FROM THE FLOOR a. [Name not clear: Yoshiro YONEYA?] asked about the issues of usability with different label separators. John replied that he didn't have a complete answer, but he noted that there is a difference between mapping label separators and other parts of the FQDN, because different targets need to know these maps. It's really a perspective issue, because what goes on the wire doesn't change that much between IDNA2003 and 200x; but what goes is changes. b. Stuart [name not clear] from Apple expressed surprise that people collapsed "dot" and "full stop". John made some remarks about the historical name of the "." character, and the decimal-number separator in some European usage. c. Harald Alvestrand observed that the main issues seem to be some protocol issues plus the matter of separators. John replied that the most important differences lie in trying to define rules and mechanisms to which one can conform rather than an algorithm one can implement. In some way it's simpler, because the mappings are gone (in contrast to IDNA2003). d. An unnamed person noted that there seems to be a difference where in IDNA2003 the dots were "in the mapping", whereas now the mapping has to be "in before the protocol". 3. Tables. Patrik Fältström slides: http://www3.ietf.org/proceedings/08mar/slides/idn-1.pdf "Abstract". (no expansion) "What is this". Patrik emphasised that he's trying to ensure that there is more than one algorithm to generate the same tables; he's had some success with this. John mentioned that doing it this way may help eventual users understand the way it works. "Algorithm/tables". Note which parts are normative. "Property values". These will become clearer later in the presentation. "Category A". Each category is determined by rules; the rules decide whether some code point belongs to a category. This category encompasses the "good codepoints", which mostly means that this is where graphic characters and such like are thrown away. "Category B". It's important that there is "stability" in normalization and casefolding. John noted that it's important to tease out different meanings of "stability", and then promptly refused to say exactly what it means. "Category C". (no expansion) "Category D". Whereas C calls out individual characters, D calls out whole blocks. John noted that this means you can't write music in DNS either. QUESTIONS FROM THE FLOOR a. [Unnamed] asked whether this is a stable list. Patrik replied that this is what's in version 5 of the tables document; the rules have changed in different versions. b. Ted Hardie asked whether, if a new code point is added to a list, then does the list have to change? The answer is yes. John also noted that several of these characters are likely already to have been picked by other rules. c. Phillip Hallam-Baker noted that the current approach is not specifying what the attacks to which the approach is responding. Patrik replied that the point is to extend the "LDH rule" to international contexts, so all this is doing is picking out from all the possible characters the ones that are not in "internationalized LDH". John said that the work is not motivated by security. Phillip pressed, though, asking why these ones are the ones that are "out". Patrik made an analogy with control characters or NULL in hostnames. Harald suggested the LDH principle: for IDNs to be useful, they have to be a subset of all language. "Category E". (no expansion) "Category F". These are important exceptions, called out individually. Notice that they have individually assigned properties. John mentioned that, except for hyphen-minus (which is historic), these are all mismatches in the way DNS and Unicode are optimized. "Category G". This is a placeholder category for things that somehow get missed. Let's hope it stays empty. "Category H". (no expansion) "Category I". H and I are worded carefully to avoid cases where "can't happen" cases happen. If some unanalyzed script turns up that causes problems, this is an out ("banned unless a special rule"). "Category J". (no expansion) "Algorithm explanation": roughly, do these in order of FGEHIBCDJA, QUESTIONS FROM THE FLOOR a. [Unnamed]: Has anyone come up with a mnemonic for the table order? Patrik noted that there've been various suggestions about how to name the categories, but disadvantages in each case. b. Tony Hansen said he couldn't tell which ones are positive categories and which negative. Patrik replied that PVALID is "ok", DISALLOWED is "bad", and "table lookup" means that it's dependent on the codepoint itself. Harald noted that the idea is simply to be able to crank out the table values over again for the next version of Unicode, with no human judgement involved. Tony asked about CONTEXTO. John replied that this is a dirty engineering trick, and asked for a better answer. The problem here is that some characters cause problems in some contexts, but are needed in others. So there are two meta-rules: if no rule, then the character is prohibited; if there's a rule, then follow that rule. He made the analogy with regular expressions: "Is this codepoint in that script?" There is a possible difference between CONTEXTO and CONTEXTJ, but not everyone agrees. Some things are really problematic and need to be checked at lookup (J): even if they somehow get registered, they should never get looked up. Others are less troublesome. If the distinction is removed, the everything has to be checked at lookup time. So this distinction is really an application optimisation. 4. Bidi. Harald Alvestrand slides: none In IDNA2003, only if the first character is RTL and the last character is RTL can the label can be RTL. The problem with this is non-spacing or combining marks (most combining marks are non-spacing, but two aren't). Some languages (2 have been mentioned in a draft) have words with a combining mark at the end, and the combining mark has no direction. So under IDNA2003, you can't use that language at all. So Cary Karp and Harald proposed a new rule: a nonspacing mark may occur at the end of a label. After hacking up some Perl to check the way this might work, it turned out that some ASCII labels next to some RTL labels will break. It also turns out that Arabic numbers cannot be mixed with European numbers. Nothing can start with "-" or numbers. QUESTIONS FROM THE FLOOR a. [Unnamed] So server5.3com.com is bad? Harald: yes, if "server5" is in Arabic. b. Pete Resnick said that the rules seem good, but seem to break things that "should work"; so the principle must be wrong. Harald replied that he worries about what happens when a user gets email in an RTL script with an IRI in it containing a domain name: will they blame their application? Pete thought that meant the application was broken. John mentioned also the problem that no modern language is strictly RTL, because everyone uses decimal numbers. This also highlights the problems with "foo123" as a single label or a label made up of "f" "o" "o" "1" "2" "3". c. [Unnamed: Paul Hoffman?] notes that there are names currently allowed in IDNA2003 that are in use, that will become illegal under the new rules. Harald asked for an example, but also noted that he's willing to take the hit if it will make things better (but one can only do it once). 5. Charter review. Lisa Dussault, in role as AD. slides: http://www3.ietf.org/proceedings/08mar/slides/idn-2.ppt [note: this segment of the minutes is not keyed to individual slides] AD said that she is hoping for a working group that can resolve some of the issues, call for consensus, and answer the call in a reasonable amount of time. There was some pressure not to have a working group, but there seemed not to be a consensus in external review. She asked that people identify whether they are opposed to the WG, and comments in favour (even if in the latter case with scope restrictions). She asked for a show of who had seen the charter. There seemed to be many who had. QUESTIONS AND DISCUSSION FROM THE FLOOR a. Stephane Bortzmeyer asked that the requirements of the work be spelled out more clearly. In particular, it seems that removing references to RFC 4690 from the charter is needed. For instance, 4690 discusses phishing extensively, but John's presentation explicitly called it one of the problems that can't be solved by this effort. *** AD called for hum on leaving phishing out of charter. +++ result: consensus [scribe's note: see also below, item e] b. Phillip Hallam-Baker argued that the charter needs specific reasons for changes. AD asked for a suggestion for the charter. Patrik Fältström noted a need for explicit examples. John Klensin argued that the point is to try to support languages, not look at Unicode and figure out "what we can't have". He opposed the latter approach. *** No proposed charter text on this item; no sense of room test c. Ted Hardie said that the milestones in the proposed charter were not practical. AD noted that actual months removed from slides because of a similar comment in external review. *** No proposed charter text on this item; no sense of room test d. Ted Hardie noted that there is a significant need for tutorial material, and asked whether this should be added to the charter, or left for the IAB. John Klensin agreed, but worried that this would lead to specifying user interface details. AD asked for a volunteer to write a tutorial. Dave Crocker noted that tutorials are good and probably important, but audience considerations might yield many possible tutorials, which expands the scope [beyond a tractable WG charter?]. Phillip Hallam-Baker noted an apparent absurdity in writing an ASCII-only [RFC] tutorial for internationalization. *** AD called for hum on adding tutorial matter to charter +++ result: indistinct. Take to list e. [Name not clear: Marcos Sanz?] noted that the earlier argument was in favour of removing references to RFC 4196, but that that was not the question posed at hum in (a). Stephane Bortzmeyer argued that he wanted to mention explicitly in the charter that solving phishing was not a goal, because that's been one of the "hottest" issues with IDNA2003. Phillip Hallam-Baker suggested that it would be better to exclude the subset of IDNA-related phishing cases than phishing itself; nobody ever thought all phishing was IDNA-based. Harald Alvestrand objected to removing 4690, because he'd heard no rationale. [Unnamed: Paul Hoffman?] replied that 4690 is a laundry list; and if it's mentioned, the chartered WG would have to answer every item on list. [Original speaker: Marcos?] said he did not want to talk about perceived inadequacies of the current system, because the issue is to address what to do (positively). *** AD called for hum on keeping or removing reference to RFC 4690. +++ result: remove from charter f. In response to foregoing, [Unnamed: Paul Hoffman?] suggested that the issue was with specific documents listed. It would be better to talk about goals, not specific documents. AD stated that such is a matter of charter clarity, not scoping, so it's a topic for the list. *** No possible sense of room test. g. [Name not clear] noted that the earlier IDN working group was established in the Internet area, and this group is contemplated in the Applications area. He wanted to know why. AD responded that the earlier effort had to be in Internet area because there was an option to change (radically) the DNS in order to achieve the goals. In this case, the relevant work is already in the Applications area. If the result of the working group is in fact a consensus that a deeper change is necessary, a recharter under the Internet area could be required. That said, this is clearly work that affects other areas, and the group cannot work in isolation from those areas. We're still part of the IETF. *** Clarification question; no sense of room test. h. Jelte Jansen for Simon Josefsson (on Jabber): Is this work going to obsolete INDA2003, Stringprep, Nameprep, &c? Which things? AD responded that such is (see f) part of charter clarification, and a topic for clarification on list. *** No sense of room possible. i. Stephane Bortzmeyer observed that some technical decisions appear to have been prejudged in proposed charter; specifically, what is checked at registration, vs. what is checked at resolution. AD asked whether it would be ok if the distinction remains in the charter, as long as there is no requirement that the difference does not necessarily entail different classes in the ultimate work result. Stephane replied that the charter needs to be clear that the goal is not to dictate what is valid at registration time. [Name not clear: Michael?] supported Stephane, and suggested a rewording using a distinction between stored labels as opposed to lookups instead of registration and resolution. Patrik Fältström also expressed strong agreement with mentioning registration and resolution as different things. *** AD concluded that this is complicated and needs to move to the list. j. Tony Hansen expressed concern with whether the continued use of xn-- as a prefix is set in stone. Paul Hoffman argued that the current charter restriction must remain. The BOF was out of time, and this discussion appeared inconclusive *** No determination of status of xn-- prefix. k. Phillip Hallam-Baker suggested that the charter should emphasise the goal of solving "badness" rather than technical perfection. Several respondents disagreed. *** No text or specific charter items proposed, so no sense of room test possible.