Thursday Plenary

3 Thursday Plenary

Current Meeting Report

IETF Technical Plenary Session, 
November 12, 2009
   
1.  Introduction - Olaf Kolkman   
   
(Olaf)  It's now 4:30 and I would like to start in about a minute,
so if everybody can find themselves a seat.  So, memorize the note
well.  

Ladies and gentlemen, welcome all to the technical plenary, 
IETF 76 in Hiroshima.  During this meeting, we've got some 
supporting tools for people in jabber land, and also for the people
in this room.  There's a jabber room for the plenary, 
jabber.ietf.org, and the presentations that we will use have been 
uploaded.  You can find them on the meeting material site - so you 
can read a along if you're back home.  Welcome to those back home.  

There's a little bit of an opportunistic environment for an 
experiment, we had the opportunity to find somebody to transcribe 
in English, and you can see that on the screen on the left and 
right - what is being said in this meeting.  This might help people
who do not capture the accents of some of the speakers or some of 
the people at the mic.  And we would like to know your experience 
with that - if you value this experiment.  

When you step up to the mic and ask questions, again, it's 
important to swipe your card.  I will do so now, so you can also 
see who I am.  And when you are at the mic, please keep it short 
and keep it to the point.  It's something that we heard yesterday.  
We hope we can establish a tradition.  

A short paragraph on the agenda for today.  Aaron will start with
his report.  I will follow with the IAB's report.  And then we will 
have a session on internationalization of names and other 
identifiers, that's a session led by John Klensin, Dave Thaler and
Stuart Cheshire.  After that, somewhere between 7:00 and 7:10 we 
will have an open microphone session which will hopefully not last
longer than 7:30.  It all depends on the conciseness of questions 
and amount of questions asked.  Without further ado, Aaron...
 

2.  IRTF Chair Report - Aaron Falk 

(Aaron) Hello, I'm here to give a brief report on the Internet 
Research Task Force (IRTF).  My name is Aaron Falk.  The short 
status - we had four research groups meeting this week here at the
IETF.  The Host Identity Protocol (HIP) research group and the 
Scalable Adaptive Multicast (SAM) research groups have already met.
Tomorrow the Delay Tolerant Networking (DTN) and Routing Research 
Group (RRG) will meet.  Another thing we did was to have review of
the Routing Research Group with the IAB a this morning.

We have six IRTF RFCs waiting to be published.  Publication is 
currently wedged on finalizing the TLP, the trust license, and that
looks like (as was reported last night) it will happen around the 
end of the year or early next year.  The document defining the IRTF 
RFC stream was revised and is now sitting the RFC editor's queue, 
and for those of you keeping track, it was modified so that now the 
independent stream and the IRTF registries have common rights 
language.  

Let's see.  What else?  

In work - there's been some proposed new work.  We've been talking
about for quite sometime, I guess almost two years, an RG on 
virtual networks, or network virtualization.  There's a bar BOF 
starting in about 40 minutes, so those of who are going to that 
will miss the fascinating talk that's coming up later tonight, 
here in the plenary.  But hopefully, this group is finally getting
to discussion of a charter for the research group, so that would be
good progress.  

Another topic that's come up, in groups that I've been in a couple
of times this week, has been internet of things, smart objects, and
so there's starting to be discussion about maybe there should be a
research group that's looking at an architecture for those 
technologies and how they fit into the internet architecture.  That
is just talk at this point, but I'm just giving the community heads
up.  One thing we want to make sure is that anything that happens 
in the IRTF does not slow down the smart grid activities that are 
going on.  So make sure we work around that.
  
I like to give sort of a quick snapshot of the different research 
groups and the energy levels - who is active and who is not.  The 
groups on the right, the 'active' list, are groups that are 
meeting at IETF or elsewhere, or having very a active mailing 
lists.  The colored one at the bottom is the SAM research group,
which has moved to the active column.  They met this week, and they
have actually had a few meetings at non-IETF locations with other
conferences.  

The quiescent groups, they're not totally inactive, they have 
meeting lists going along.  The Public Key Next Generation Research 
Group, PKNGRG, they had a little trouble getting started, but it 
sounds like there's some energy.  Do you have a question?

(Richard Barnes)  I was just wondering if the PKNG group will be 
meeting at IETF 79?

(Aaron)  I don't think that's been decided why yet.  I don't think
there's a planned meeting now, but if you're on the mailing list, 
you would hear about it.  Is Paul in the room?  Can you confirm 
that's true?  

(Paul Hoffman) That's correct.  

(Aaron)  Moving right along.  So, another thing that I've been 
doing with the IRTF reports is to take a couple of research groups
and give a very quick snapshot of the topic area - what the group 
is up to, some of their recent work items, to give a flavor of 
what's going on in these groups.  This is really very cursory, and
just to give a you flavor of some research stuff that's happening 
in the IETF and maybe help discover whether there is interest in 
getting more involved.  

The first group I want to talk about is Anti-Spam Research Group.  
It's an open research group that's looking at anti-spam research.  
In particular, at the open problems.  It's been hoped that there 
would be some standards work that would come out of that, but it's
not been as fruitful as was originally hoped.  But there's a wide 
range of participation from not only the standards folks that we
see here, but also researchers and other folks who are working in 
the area.  There's lots of industrial activity going on this in 
this space.  Because anti-spam is a big industry, there are lots of
other activities going on, and so it's important to understand that
the research group is not doing standards work.  There is some in 
the IETF, in DKIM.  It's not a trade group - there are several of 
those - I think the large one is MOK.  And it's not an academic
conference.  So, this has really been sort of a discussion of 
technical topics in the area, and they've worked on a couple of 
documents, but mostly the activity has been on the mailing list.  

There's a document they produced on DNS black lists and white 
lists that's waiting to be published.  And then there's another one
on black list management that has been, it's a draft that has 
been circulated for a while and it's sort of waiting to be 
finalized.  

Another topic that's been going on in this research group has been 
starting to develop a taxonomy of the different techniques for 
fighting spam, and also, of different spamming strategies.  You can 
see the URL here if want too check it on the web.  This is really 
sort of open for contributions.  I think that part of the 
motivation for this is that many people have come up with ideas, 
often the same ideas repeatedly for how to solve the spam problem.
And so, this is a little bit, it's been described in the past as 
pre-printed rejection slips for why your idea won't work, so you 
can be indexed into the correct part of the WIKI when you have an
idea and don't have to re-circulate threads on the mailing list 
over and over again.  So I think that would be good work.  This is
turned out to be, like the spam problem in general, hard to make 
progress in.  And the research group - I've heard, and I think the
chair has heard, some frustration as to why they have not done 
more.  There are a lot of folks doing research in this area and 
they're focused on publishing papers, sometimes more so than doing
collaboration in the IRTF.  Also, some of these problems are 
extremely hard.  But, one of the values of what's happening on the
research group mailing list is it's starting to go capture some of
the folklore, some of the wisdom that's passed around between 
practitioners.  But there's also misinformation that's stamped out,
and they're making an effort to capture these things in the wiki.
So the chair asked me to pass along - the mailing list is really 
intended for folks in the IETF and elsewhere, having questions 
about spam and anti-spam related technologies, that this research
group is intended to be a good discussion point for bringing those
topics out.  

Okay.  So, the other research group that I wanted to talk about is
the Scaleable Adaptive Multi-cast group.  I apologize, it's hard 
for me to read (the slide) so it's probably hard for you to read.  
The concept behind this group, if you look at the pictures, the 
bottom one is intended to be a conventional network, host at the 
edge and routers in the middle.  And the goal of the group is 
really to enable multicast services, taking hybrids of application
layer multicast, which is easy to deploy among consenting end 
systems, and take advantage of either IP multicast or link layer
multicast - any native multicast that might exist.  

This is what see in the pictures, in the bottom you have the 
conventional network, then you have the multicast tree, and then at
the top you have a hybrid multicast environment where you have 
native multicast in one region, and application multicast in 
regions that don't support it.  It takes advantage of the AMT 
protocol - this is a protocol for tunneling.  AMT is Adaptive, 
automatic multi cast tunneling.  That connects multicast enabled 
clouds over unicast networks.  And this is technology that was 
developed in the MBONE D environment, and so, it's a way of sort of
gluing these together.  So the Sam RG is trying to create framework
and protocols for integrating these various strategies for enabling 
multicast. 

There's a bunch of different communities, this work was initial led
out of the X-cast environment, where they have some protocols, and
they've got sort of one point they developed in the space.  There's
also P2P overlays and the IP multicast folks, and then applications 
include streaming and mobile networks and other kinds of 
applications.  

They've developed, I've gotten some drafts on developing a 
framework.  This is just another illustration of another version of
the same picture where you've got networks that have neighborhood 
IP multicast.  They might have link layer multicast, application 
layer multicast, and they're glued together, and they've developed 
a protocol that's got different kinds of joins, IP multicast join, 
join-by-gateway, join-by-native-link, and so this is some of the 
work that's been going on the longest in the group.  And it's 
pretty mature, I understand.  Another thing that they've been 
working on is developing namespace support so that hosts can 
directly participate in multicast services.  And along with 
middleware to make that work.  And they've been also trying to 
build a simulation environment to allow exploration of a wide range
of networks in this space.  They started with a tool called OMnet, 
and they're extending it to support IP multicast, and then 
extending that again to support different kinds of overlay 
strategies.  

Then finally, to go beyond simulation to test beds, there's some
work that's just being discussed now about building a hybrid 
multicast test bed that's started with contributions from the 
different participants in the research group.  That is, they're 
actually globally distributed with the hopes that that will grow 
for implementing and exploring some of these protocols.  

So in a nutshell that's the status of the IRTF and two of the 
research groups, and I am open for questions if anybody has any.  
Okay.  Thank you very much.  
   
(applause)  
   

3.  IAB Report - Olaf Kolkman

As far as the IAB report, I love these little crane birds folded
out of paper, so I put one on the picture, typically for Hiroshima.
And I enjoyed making them during the social.  

Anyway about the IAB.  I show this slide every time I open the 
session - basically pointing out what we're about.  It's very hard
to give a nutshell description of what the IAB is about.  But it 
has a charter, RFC 2850, and we try to describe as much as possible
on our home page.  You can find the current membership there.  
There are links to documents, and within the documents section you 
can find our minutes.  It is the goal to have minutes posted not 
more than two meetings behind.  We're very bad at meeting that 
goal.  Just before this meeting, a batch of minutes was published 
that were approved earlier in this week.  

Correspondence... when we talk to other organizations we usually 
leave a trail of correspondence and that is published on our web 
site as well.  Documents are one of outputs.  Recently we published
as an RFC 5620 the RFC Editor model.  I will be talking about the 
model implementation at the end of this presentation.  

There are two documents currently in auth48, 5694 which is the P2P 
architecture definition, taxonomies, and examples and 
applicability.  That is about to be published.  There's a final 
little thing with a header, same goes for RFC 5704 Uncoordinated 
Protocol Development Considered Harmful.  You've heard a 
presentation about that, in previous IAB reports.  Those are two 
very short.  

There is ongoing document activity.  We've been working on a 
document considering IPv6 NATs.  There was a call for comments from
the community.  Those comments have been incorporated into version 
2 of this draft and we're about to submit this to the RFC editor 
when every IAB member has a chance to sign off on it.  So that will
be going to the RFC editor shortly.  

There's another document - IAB Thoughts on Encoding and 
Internationalized Domain Names.  It's part of the inspiration for 
today's technical session.  So basically, it's a call for comments.
The technical plenary today is a working session that is based 
around this document.  

There are a bunch of documents that have draft IAB that are sitting
somewhere in various states, that didn't have much attention over 
the last few weeks, at least not visibly.  Drafts IAB headers and 
boiler plates - that document has been finished for a long time and
is sitting in the RFC editor and refers to the BIS document.  We 
found a way to get out of there by changing the reference to RFC 
3932bis itself if the situation with 3932bis does not get resolved 
pretty soon.  So we want to get that out as soon as possible.  
That document basically changes the headers of documents, and 
changes some of the boilerplates so it's more obvious if a document
is an Independent RFC or an IETF Stream RFC or IAB Stream RFC.  

We're also working, and internally we've been rehashing, this 
document that is intended to describe the IANA functions, and what
the IETF needs out of that.  An update on that is imminent, and so
is an update on the IP model (on which we've been working, and 
which will be uploaded as soon as the queue is open)  
   
A little bit of news.  We had a communication with IANA on the way 
forward with respect to signing .arpa  We received a plan of action 
shortly before the IETF, in which there's two phase approach, where
they will proceed with the temporary set up to get .arpa signed in 
fourth quarter of 2009.  Given that quarter four of 2009 is still 
only six weeks old, we expect it will be signed before the end of 
the year, so it's really imminent.  

After the design has been finished for signing the root zone, that 
same system will be used for signing .arpa  We responded positively
to that plan and we find it's very important to get .arpa signed, 
to get its key signing key published in the IANA tar? and in the 
signed root whenever that is available, and make sure there is 
secure delegations to sign sub zones, and that is now being set in 
motion.  So this is some progress on that front.  

We made a bunch of appointments.  We've re-appointed Thomas Narten
as the IETF liaison to the ICANN BOT, and related to that, Henk 
Uijterwaal for ICAN NOMCOM.  And finally, we appointed Fred Baker 
to the smart grid inter operability panel.  And a number of you 
have been to the smart grid bar BOF yesterday and know what this is
about.  

Communication to other bodies...  There is an effort underway 
within the EU to standardize, to modernize ICT standardization.  
There was a white paper published by the committee, and we've 
reviewed that and basically replied with a number of facts around 
our process.  So that we, so we at least are sure that there's no
misunderstanding of how the IETF works.  We also provided comments 
to the ICANN CEO and the ICANN board of trustees on a study that
appeared recently that was about scaling the root.  You can find 
those comments on our IAB correspondence section of the web page.  

Something that is of a more operational nature is the 
implementation of the RFC editor model.  Just as a recap of the 
state of affairs, I've been talking about this previously, we're in
a transition period.  We're moving away from ISI as a service 
provider, and into an implementation of a model that has been 
developed over the last few years.  Within this model, we've got 
four legs: the RFC Series Editor, the Independent Submissions
Editor, the Publisher, and the Production House.  The IAOC is 
responsible for selecting the RFC production center and the RFC 
publisher, and the IAB is responsible for creating an advisory 
group, appointing an advisory group for helping us with the 
selection of RSE and ISE candidates.  That has all been done.

Looking for the RFC series editor is our responsibility, and the 
Independent Submission Editor, which is also one of the functions
within the model that is our job.  

Where are we with all that?  Well, the IAOC, you heard yesterday,
has appointed, awarded the production center contract to AMS and 
also the RFC publisher contract to AMS.  And the good news here is
that Sandy and Alice are the core members of the production center, 
which means, that the continuity of publishing and getting RFCs 
online is not in danger.  This is the good news so to speak.  

As far as the Independent Submission Editor goes, that is the 
editor that assesses the technical quality of documents on the 
independent stream, we've had significant delays.  That delay has 
been because we've been focusing on trying to find RFC series 
editor.  However, we have candidates and we are currently 
interviewing and assessing those candidates and we are basically on 
track with that now.  

As far as the RFC Series Editor function goes, we had a call in 
July, not quite half a year ago.  Closed nominations, August 
fifteenth, and the nominations were provided to the ACEF, the 
committee that helps us with assessing the candidates.  They 
interviewed candidates.  They've had long deliberations, and their 
conclusion was that there was no suitable match between the 
candidates, the functions, and the expectations of the role - those 
three variables didn't quite match.  And their advice was to seek 
somebody to manage the transition, to do a step back, and make sure
that pieces are in place and go for the long term solution.  
"Manage the transition" was the advice.  

The IAB went over this advice, turned it around a couple of times, 
and finally decided that the transitional RFC Series Editor (RSE)
way forward is the best plan, the best way out so to speak.  We've
defined that job, and you should have seen the announcement with 
job description, and call for candidates earlier last week, mid 
last week.  

There is an ongoing call for candidates, but the evaluation of 
whatever we have will start November 20 or so.  In a week.  So, why
do we think we will be successful now, or have a higher odds of 
success?  Well, there are a couple of things that are different 
than we had with the situation on July 8.  First, there is less 
uncertainty about the state of the production and publication 
functions.  It is known who is going to execute those functions.
There is capable staff there, there is institutional knowledge 
which makes the job easier.  

There's also, in the job description that we are looking for, 
there's more focus on the transitional aspects.  We've called out
that the person who is going to do this needs to refine the role of
the RSE after the initial transition, so that it is more clear what
the successor will be getting into.  There is an explicit task to 
propose possible modifications to the RFC editor model in order to 
see that things work better, when we go out for the more permanent 
function.  And, because this is a transitional management job, so 
to speak, it has different type of commitment, a different type of
personality, and also a shorter time space for commitment.  So we 
hope that the pool is wider, deeper, or of different dimensions.  

One of the things that we will also not do with respect to this, 
looking for candidates, is to disclose the names of the candidates
publicly.  We think that was a mistake and we don't do that now.  

As I said, the call for nominations is now open.  We will start
evaluation November 23, and we will accept nominations as long as
nobody has been announced.  

We believe that this is inline with RFC 5620, the RFC Editor Model
and the general community consensus.  That doesn't mean that we 
kept out, or we didn't go back to the community and ask "is this 
all okay?".  Because there is time pressure - ISI will stop this 
function December 31st, and January 1 we will be starting 
implementing this new model.  We couldn't afford to lose time.  
That doesn't mean we're not listening if you have any comments or 
things should have been done differently.  

We have been talking, as Bob Hinden said yesterday, with ISI, and
Bob Braden in particular, about their willingness to extend the, 
current contract on somewhat of a consultancy basis.  So at least
we have somebody so that balls do not fall on the ground, and 
somebody who can actually transfer some institutional memory to 
whoever gets this job.  There are two small links that you can get
and read them from the slides. 
 
Finally, it is worth mentioning this time we did not have appeals, 
and that basically closes my presentation.  

(applause)
   
With that, I would like to invite John, Stuart, and Dave for a 
session on internationalization.  And I'll start up the slide set.  


4.  Internationalization in Names and Other Identifiers
   
(John Klensin) All right.  Good afternoon, everybody.  

We've come to share some general ideas about internationalization,
and where we stand, and where we're going.  The plenary's goal is 
to try to inform the community about this topic.  This is not the 
first time the IAB has tried to do this.  We continue to learn more
and to try to share that with you.  

Internationalization is badly understood.  It is understood 
moderately well by a fairly small number of experts, most of whom 
end up realizing how little we actually understand.  But it affects
a large number of protocols, a large number of people, and should 
affect virtually everything we're doing in the IETF, and elsewhere 
there's a user interface.
  
We've got a new working draft which contains some recommendations 
about choices of binds and encodings that we'll talk about.  
Current version is draft-iab-encoding-01.txt.  It is very much 
still in progress.  

And more work is needed in this area, both on this document and 
about other things, and should continue.  

Internationalization is important and timely because a lot of 
things are going a on around us.  Names.  Names can have non-ASCII 
characters in them.  Not everybody writes in Roman characters, 
especially undecorated ones.  We're seeing trends towards 
internationalized domain names.  They've been floating around the 
Internet, fairly widely deployed since about 2003.  Earlier than 
that in other than the public Internet.  Pieces of this talk we'll
show you later, and they can be as simple as a string of Latin 
characters with accents or other declaration on some of them, or as
complex as scripts which some of the people in this room can read, 
and others cannot.  The things you can't read become problems.  The 
things you can read become opportunities.
  
We see URLs floating around.  Actually IRIs - what we discover is
that a great deal of software doesn't know the difference, and a 
great many people know even less about the difference.  

We have path names in various kinds of systems that use
internationalized identifiers, and use them in ways which are fully 
interchangeable with plane ASCII things, because the underlying 
operating systems have been internationalized.  

Users want to use the Internet in their own languages.  It seems 
obvious.  It took us a long time to get there.  We're not there 
yet.  At the same time, we've been making progress.  The MIME work
which permitted non-ASCII characters in e-mail bodies in a 
standardized way was done in Internet time a very, very long time 
ago, and has been working fairly successfully since.  

In China, IDNs are being used for all government sites.  A great 
many IDNs are deployed in the .cn domain.  Users in China see 
domain names as if the top level domains are actually IDNs.  We're
told that about 35 percent of the domains in Taiwan are IDNs, and 
almost 14 percent of the domains in Korea are IDNs.  

There's demand from various parts of the world that use Arabic 
scripts and they have special problems because the script runs 
right to left under normal circumstances, and most of our 
protocols that use URLs and identifiers have been written around 
the assumption that things are written from left to right.  In 
general, right to left is not a problem.  Mixing right to left and
left to right produces various strange effects.  
   
(Stuart Cheshire) Thank you.  Jon.  

So, I'm going to go over some of the basic ideas and some of the 
terminology.  

Unicode is a set of characters which are represented using 
integers.  There's actually about a million of them, but most of 
those are not used.  Most commonly used characters fit in the first 
65,000.  Most of even the less commonly used ones fit in the first 
200,000 or so, but the unicode standard defines up to about a 
million.  

These are abstract integers that, for the most part, represent 
characters.  I say 'for the most part' because there are 
variations.  You can have an E accent as a character, or the E with 
the character accent separately.  But ignoring those details, 
roughly speaking it's a set of characters with numbers a assigned 
to them.  

Now, you can write those numbers on paper with a pen, or on a 
blackboard with chalk.  When we use them in computer systems, we 
need some way to encode them.  And it's easy for us to forget that,
but the encoding is important.  

Here are three of the common encoding: UTF-32 is 32 bits in 
memory, of the normal way of representing integer values.  That 
means there are endian issues.  UFT-16 slightly more compact 
because the majority of characters fit in 65,000.  That means most 
unicode characters can be represented by a single 32-bit word, so 
that takes half the space.  Similarly, there are endian issues with
UTF-16.  UTF- 8 uses a sequence of 8 bit bytes to encode the 
characters.  And UTF-8 has some interesting properties, so I'm 
going to talk a bit more about that.  

The IETF policy on character sets and protocols specifies that all 
protocols starting in January 1998 should be able to use UTF-8.  
Why is that?  Why do we like UTF-8?  Well, UTF-8 has a useful 
property of being ASCII compatible.  And what that means is that 
the unicode code points from zero to a 127 are the same as the 
ASCII code points.  So, decimal 65, hexadecimal 41, represents an 
upper case A, both in ASCII and unicode.
  
I'm talking about integers here.  When you represent that integer 
unicode value using UTF-8, use the same value for values up to 127.
So this may seem very obvious, but, it's an important distinction 
between the integer value and how represent it in memory.  

The property of this is that if I have an ASCII file which is clean
7-bit ASCII, I can wave a magic wand and say that's actually UTF-8, 
and it is actually valid UTF-8 and not just 'valid', but valid with
the same meaning - it represents the same string of characters.  

For files that already have other meanings for the octet values 128
and up, like Latin 8859-1, that property is not true, because 
those code values have already been given other meanings.  But for 
plain ASCII, UTF-8 is backwards compatible.  UTF-8 uses those octet 
values above a 127 to encode the high numbered code points, and 
I'll explain how that works.  

So, in blue, we have the unicode characters that are the same as 
ASCII characters.  It's just a single byte in memory.  In the 
middle, we have the green ones, and those are the octets that 
start with the top two bits being one, or the top three or the top
four.  

And when see one of those, that indicates the start of a multi- 
character, a multi-byte sequence for encoding one unicode code 
point.  In the right in purple, we have the continuation bytes 
which all have the top bits one and zero.  

The nice property of this is, by looking at any octet value in 
memory, you can tell whether it's a stand-alone character, the 
start of a sequence, or something in the middle of a multi-byte 
sequence.  This is how they look in memory.  

So, we have the ASCII characters standing alone, we have the 
two-byte sequence where we start with the 110 marker, we have the
three byte and the four byte sequence.  

UTF-8 has some nice properties.  Part of being ASCII compatible is
that UTF-8 encoding results in no zero octets in the middle of the 
string.  That's useful for standard C APIs that expect null-
terminated strings.  .  

The fact that the bytes are all self-describing makes it robust to 
errors.  If there is corruption in the data, or data is copied and 
pasted, inserted and deleted, maybe by software that doesn't 
understand UTF-8, it is possible to recognize.  If I give a 
megabyte character, and look at a byte in the middle of the file, 
then you can tell whether you've got a stand-alone character.  If 
you look at the byte and the top two bits are 10, you know you're 
in the middle of a sequence, so you have to go forward or back, but 
you don't have to go very far before you can re-synchronize with 
the byte stream, and you know how to recode the characters 
correctly.  That is in contrast to other encodings that uses skip 
characters to switch modes, where you really have to parse the data 
from the start to keep track of what mode you're in. 

Another nice property of UTF-8 is because it has this structure, 
you can tell with very high probability looking at a file whether 
it's valid UTF-8, or whether it is something else like Latin.  
One of the properties is that if I see a byte above 127 in a file, 
it can't appear by itself, because that must be part of a multi-
byte sequence.  So there have to be at least two, and the first one
has to have two or three or four top bits, and the later ones have 
to have the top bits be one zero.  So the probability of typing 
8859-1 that happens to meet that pattern goes down very quickly for
all but the shortest files.  

Another useful property is that a simple byte-wise comparison of 
two strings of UTF-8 bytes using routines results in them sorting 
in the same order as sorting the unicode code points as integers.  
This is not necessarily for humans to see in terms of what we 
consider to be alphabetical order, but for software like quick 
sort that needs an ordering of things.  This is a suitable 
comparison which results inconsistent behavior whether you're using 
unicode or UTF-8.
  
One of the criticisms that's often raised against UTF-8 is that 
while it's great for ASCII - one character is one byte - and it's 
pretty good for European languages, since most European characters 
fit in two bytes, for Asian languages they often take three or four 
bytes per character.  And this has led to a concern that it results
in big, bloated files on disk.  While that may have been a concern 
10 or 20 years ago, I think in today's world, there are different 
trade offs we have to consider.  One thing is that everybody 
inventing their own encoding, which is locally optimal in some 
particular context, may save a few bytes of memory in that context,
but it comes at a big price in interoperability.  And when I talk 
about different contexts here, I don't mean just geographically
different places around the world, or different languages, but 
context like e-mail doing one thing and web pages doing a different 
thing.  Also in the context of applications and working groups, we 
have a tendency for each community to roll their own solution that 
they feel meets their needs best, which is different to that of 
other people's, and we have a lot of friction at the boundaries 
when you convert between these different protocols that are using 
different encodings.  We'll have some more examples of that later.  

Another aspect is that on most of our disks today, most of that 
space is taken up you with images, and audio and video.  Text 
actually takes a very small amount of space.  When you view a web 
page, most of the data that's coming over the network is JPEG 
images.  If you're looking at YouTube, almost all of it is the 
video data.  Those images and video are almost always compressed, 
because it makes sense to compress them.  

Ten years ago there were web browsers that would actually gzip the 
HTML part of the file to make the download faster.  I don't believe 
anybody worries about that anymore, because the text part of the 
web page is so insignificant compared to the other media that it's 
not that important.  

Another interesting observation here is, with today's file formats
like HTML and XML, quite often the the machine readable mark up 
tags in that file, which are not there for end users to ever see,
they're there to tell your web browser how to render the text, 
those tags really, they're just bytes in memory.  They have 
no human meanings, but it's convenient that we use mnemonic text, 
so we use ASCII characters for things like title and head and body.
And even in files containing international text, a lot of that mark
up is ASCII.  And I have had discussions very much like this with 
engineering teams at apple, all the applications that apple ships 
are internationalized in multiple languages and we have inside the 
application, which you can see for yourself, if you control click 
on it and open up to see the contents, they contain files that 
contain all of the user interface text in different languages. 
And we had the debate should it be in UTF-8 or UTF-16.  And clearly
for western European languages, the UTF-8 is more compact.  But the 
argument was for Asian languages that would be wasteful.  So I did 
an experiment.  

This is the file path, you can try the experiment for yourself.  I 
had a look at that file.  And in UTF-16 it was a 117K.  In UTF-8, 
it was barely half the size.  This is the Japanese localization.  
I'm thinking, how can that be?  

I was expecting it to be about the same size or a little bigger, 
but I wasn't expecting it to be smaller.  When I looked at the file, 
it's because of this...  

The file is full of these key equals value pairs.  And all that 
text on the left is ASCII text.  And the Japanese on the right may 
be taking three or four bytes per character, but that's not the 
only thing in the file.  So, I believe that the benefits we get 
from having a consistent text encoding so we can communicate with 
each other are worth paying possible performance size overhead that 
there might be.  And as this example shows, there may not be a size 
overhead in many cases.  

So that's UTF-8.  But we know that not everything uses UTF-8.  So 
the other thing we're going to talk is punycode, which is what's 
used in international domain names.  

Now, this is not because the DNS can't handle 8 bit data.  The DNS 
protocol itself perfectly well can.  But many of the applications 
that use DNS names have been written assuming that the only valid 
DNS names contains letters, digits and hyphens.  So in order to 
accommodate those applications punycode was invented.  And whereas 
UTF-8 encodes unicode code points as octet values in the range from
zero up to hexF4, punycode restricts itself to a smaller range of 
values, listed on the slide.  And those are the byte values that 
correspond to a ASCII characters hyphen, digits and letters.  

So, what that means is that when punycode encodes a unicode string
you get out a series of bytes, which if you interpret them as being 
ASCII, look like a sequence of characters.  If you interpret them 
as being a punycode encoding of a unicode string, and do the 
appropriate decoding, and then display it using the appropriate 
fonts, they look like rich text.  

So this is a subtle point.  We have these same sequence of bytes in 
memory or on disk or on the wire in the protocol, that have two 
interpretations.  They can be interpreted as letters, digits and 
hyphens -not particularly helpful, as it kind of looks like opening
a JPEG in emacs.  You see a bunch of characters, but that doesn't 
really communicate what the meaning of the JPEG is.  Or, the 
letters and hyphens can be interpreted as punycode data which 
represents a unicode string.  Let me give you another example of 
that.  

Does this look like standard 7 bit U.S. ASCII or not?  Let me zoom 
in.  We'll do a hum.  Who would say this is 7 bit ASCII?  Can I 
have a hum?  Who would say this looks like rich unicode text?  Hum?  

Okay.  Let me zoom in a bit closer.  This is a plane ASCII file.  
In fact it only contains Xs and spaces.  You can edit this file in
vi if you want.  

So, the same data has two interpretations.  Seen from a sufficient 
distance, it looks like Chinese characters, but it can also be 
interpreted as Xs and spaces.  So the meaning of this text depends 
very much on how you choose to look at it.  But I would argue that 
editing this file in vi would not be the most efficient way of 
writing Chinese text.  

So this problem that the same byte values and memory can be 
interpreted in different ways really plagues us today.  That was 
just a few days ago I was buying a hard disk on amazon and got 
these upside down question marks.  I think that's supposed to be 
dashes.  This isn't even in complicated script systems - this is 
just the characters that any English or American reader would 
expect to use in plain text.  

I remember when I had my first computer, an Apple IIe, and it could
only do upper case.  And then my next one, the BBC micro, had lower
case.  And the next one which was a Macintosh, in about 1985, could
actually do curly quotes, and I could write degrees Fahrenheit with
a degrees symbol, and I could do E-M dashes, and I could do Greek 
alpha signs.  I could write not equals as an equal sign with a line
through it the way I did in school when I was writing with a point.
Not exclamation point.  We have done it for so long, we forget.  
Not equals is a equal sign with a slash through it.  So by 1986 we 
had gone from typewriter to some fairly nice typography where I 
could type what I wanted on my Mac.  And here we are 10 years later
and things seem to have gone backwards.  I'm not happy.  How do we 
solve this problem?  

We make the user guess from 30 different encodings - "what do you 
think this web page might be?"  This is not something that we want 
to impose on users.  This is not something that the average end 
user is even qualified to understand what they're being asked here.
  
So, international domain names don't only appear on their own.  
They appear in context.  And here are some examples.  They can 
appear in URLs, they can appear in file paths on windows.  Of all 
these different encodings, which most of the people in this room 
would probably recognize as meaning the same thing, these are the 
only ones that in my mind are really useful, if we have a goal of 
supporting international text.  

If you you asked a child to draw a Greek alpha symbol, and gave her
a pencil and paper, plain pencil and paper, she would draw an alpha 
symbol.  She would not write %cn, % something and say that's an 
alpha.  That's complete insanity.  That is not an alpha.  An alpha 
is this thing that looks like an A with a curly tail on the right 
side.  If we want to support international text, it's got to look 
like international text.  

But because we have all these protocols that don't have a native 
handling of international text, we keep thinking of ways to encode 
international text using printable ASCII characters.  And when you 
do that encoding, who decodes it?  There's an assumption that if I
encode it with percent something or ampersand something, then the 
thing on the receiving side will undo can that and put it back to 
the alpha character it was supposed to be.  Well, we got bitten by
this yesterday.  We sent out an e-mail announcing this plenary.  
This was not staged.  This is real.  And some piece of software,
somewhere decided that unicode newlines were no good.  So it was 
going to replace them with the HTML ampersand code for a unicode 
newline.  

And something on the receiving side was supposed to undo that, and
turn it back into a newline.  Well, nothing did and this is what 
you all got in your email.  

This can get really crazy.  Suppose you have a domain name which is
part of an e-mail address, which you put in a mail to URL, which is 
then appearing on a web page in HTML text.  Is the domain name 
supposed to be actual rich text as seen to the user?  Or is it 
supposed to be puny code?  Because it's an email address, email 
uses printable encoding followed by two hexadecimal characters.  
Well, in an e-mail address do we have to do that escaping?  And the 
e-mail is part of the URL.  And the whole thing is going into a web 
page, so HTML has its own escaping for representing arbitrary 
characters.  

Do we use all of these?  A lot of people say yes.  It's not clear 
which ones we wouldn't use out of that four in the nested hierarchy
of containers.  If you're looking at an HTML file in your editor, 
you are very far removed from having rich text in front of you on 
the screen.  

So we tried, we decided we'd try an experiment.  What would happen
if we didn't do all this encoding?  What would happen if we just 
sent straight 8 bit data over the network and decided to try this 
email test.  Now, the SMTP specification says it's 7 bit only, but
we asked the question, what if we disregarded that, and tried it 
anyway, to see what would happen?
  
So, I sent a test e-mail, where I replaced the E in my name with a
Greek epsilon, and I with an iota, and I sent this e-mail by hand, 
using net cat, so it wasn't my mail client doing encoding.  I just
put the raw bytes on to the wire and sent them to the SMTP server 
to see how it would handle it.  I did it two ways, using the 
punycode-encoded representation of that first label of the domain 
name, X M -- something that looks like line noise.  And I did it a 
second time, just using the UTF-8 representation of that, which I'm 
showing here as the actual unicode characters.  

So to make that really clear, this is the text that I sent using 
net cat to the SMTP server.  This is the first one, using punycode, 
so this whole email is plane 7 bit ASCII.  No surprising byte 
values in it.  The first two lines are the header, after the blank
line, the rest is the body.  I point this out because headers are
handled differently from bodies.  Header lines are processed by the 
mail system.  The body by and large is delivered to the user for 
viewing.  

The second e-mail is conceptually the same thing, except not using
punycode, using just 8 bit UTF-8.  So this is the result of the 
first test.  Not surprisingly, the puny code in the body of the 
message was displayed by all the mail clients we tried as line 
noise.  Which is not surprising, because it's just text in the body
of an e-mail message.  There's no way that the mail client really 
knows that that text is actually the representation of an 
international domain name that's been encoded.  We could have some
heuristics where it looks through the e-mail.  I would not be happy 
about that.  Type the wrong thing in e-mail and it magically 
displays as something else.  That seems like going further in the 
wrong direction.  

In the from line, where we could argue that the mail client does 
know this is a domain name because it's user name angle bracket, 
user at example.com, close angle bracket, that is a clearly 
structured syntax for st e-mail address, and the mail client knows
how to reply to it.  It could conceivably decode that text and say 
'this is puny code'.  The intended meaning of this text is not 
xxc-x, it's a rich text name with epsilons and iotas in it.  One 
client did that, which was outlook on windows.  

The second test was the raw 8 bit UTF-8 data.  And I'm very happy 
to say, in our small set of e-mail clients that we tested, a 100% 
of them displayed UTF-8 text in the body in a sensible way.  

We had some more interesting results from the from line.  Gmail did
this very interesting thing where it clearly received and 
understood the UTF 8 text perfectly well, because it displayed it 
to the user as the punycode form.  I'm not quite sure why.  
Possibly for security reasons, because there is concern with 
confusable characters, which you will hear about in great detail in
a few minutes.  There is concern with confusable characters that 
you might get spoofed emails that look like they're from somebody 
you know but are really not.  Turning it into this punycode form, 
at some level, should avoid that.  I'm not sure it really does, 
because in a world where all of my email comes from line noise, the 
chance of me noticing that the line noise is different in this 
particular email, I don't know how much of a security feature that 
really is.  But that may be the motivation.  

Eudora 6 is an old mail client, written I think before UTF was very 
common.  Those characters there are what get if you interpret the 
UTF bytes as being ISO 8859-1.  And the last three here, to be 
fair, I don't think we should blame the Outlook clients here, 
because what appears to have happened is that the mail server that 
received the mail, went through and whacked any characters that 
were above 127 and changed them to question marks.  It didn't do 
that in the body, see, but in the header it did do that pre- 
processing.  So it's unclear right now whether it was the mail 
client that did this or the mail server that messed it before the 
client even saw it.  

So, back to terminology.  Mapping is the process of converting one
string into another equivalent one.  And we'll talk a little bit 
later about what that's used for.  

Matching is the process of comparing things that are intended to be 
equivalent as far as the user is concerned, even though the unicode 
code points may be different, the bytes in memory used to represent 
those unicode code points may be different, but the user intention 
is the same.  

Sorting is a question of deciding what order things should be 
displayed to the user.  And the encoding issue has various levels 
to it.  I've talked today about how to encode unicode code points 
using UTF-8.  There is also the question that the E accent 
character can be represented by a single unicode code point for E
accent, or as the code point for E followed by the accent combining 
character.  

So more terminology.  In the IDNA space, an IDNA valid string is 
one that contains allowed unicode characters to go into 
international domain names, and those can take two forms.  The term
will commonly used in the IDN community is a U label.  An IDNA- 
valid string represented in unicode, by which they mean, in 
whatever is a sensible representation in that operating system.  It 
might be UTF-8, it might be UTF-16, but it is one of the natural 
forms of encoding unicode strings.
  
An A label is that string encoded with the punycode algorithm, and 
xn-- to call out the fact that that is not just a string of 
characters in DNS, this is something that's encoded by punycode, so
you have to decode it in order to get the meaning.  

So I'll wrap up my part of the presentation with an observation.  
When it comes to writing documents, or writing an e-mail to your 
family, having the most expressively rich writing tools available 
is very nice.  When it comes to identifiers that are going to be 
passed around, and are used to identify specific things, then 
it's not quite so clear.  Because the bigger the alphabet, the more 
ambiguity.  

Telephone numbers use ten digits.  And by and large, we can read 
those digits without getting too confused.  We can hear them over 
the telephone.  Most people who can work a telephone, can read and 
hear the ten digits without getting them too confused.  When we go
to domain names, we have a bigger alphabet.  We have 37 characters,
and we start to get a bit of confusion.  Os and zeros, Ls and ones
and Is - there's a bit of confusion, which is bad, but it's limited
to those few examples.  When we move to international domain names,
the alphabet is tens of thousands, and the number of characters 
that look similar or identical is much much greater.  So with more 
expressibility comes more scope for confusion.  And I will note, 
that while we're going in this direction, of bigger and bigger 
alphabets, the computer systems we use went in the opposite 
direction.  They went to binary.  Because when you only have one 
and zero, then there's a lot less scope for confusion in terms of 
signaling on the wire, with voltage levels.  If there's only two 
voltage levels that are valid, you're high or low.  If there are 
ten that are valid, then a smaller error might mean reading a 5 as
a 6.  So we know that when we build reliable computer systems that 
binary has this nice property.  So, I leave you with that.  And I 
ask Dave to come up and tell more.  

   
(Dave Thaler) I'm going to talk about matching first.  So earlier 
on when we talked about definitions, we said, you probably thought
that matching meant comparing two things in memory.  That is 
certainly one of the as aspects of matching.  You do a database 
entry, and I know whether to respond or not.  There's another 
problem with matching - that is the human recognition matching 
problem.  

So, let's do another eye test here.  We have two strings up here 
that could be easily confused by a human.  Can you spot the 
difference?  Hum if you can spot the difference.  

Okay.  The difference that you can spot here, is that on the left,
this is .com, and on the right this is .corn.  It seems like a 
great opportunity for some farmer's organization, doesn't it?  

This illustrates that even in plain ASCII we have confusion.  Now,
some of you who have been participating in the RFID experiment are 
aware of another type of confusion.  On this slide, these are not 
capital Is, they are lower case Ls.  More confusion with just 
ASCII.  But wait, it gets worse.  

This is the Greek alphabet.  If Ethiopia, those are not the letters
E T H I O P I A.  If you check the lower case both of those, then 
they look different.  All right.  The lower case versions of the 
Greek letters there is fairly distinctive.  So as a result, we see 
the current trend to actually deprecate these various forms and 
revert to one standard one or one conical one if you will - in 
various identifiers such as IRIs and so on.  In IDNA, in 2008, some
of these characters are treated as disallowed for these types of 
reasons.  

Second eye chart, okay.  Look up from your computer and stare at 
the screen.  Hum if you spot the difference.  They both look the 
same.  If you can spot the difference, you may need an eye test. 
 
The difference here is that all the characters on the right are in 
the Cyrillic alphabet.  There's no visual difference.  What's worse
is that in ASCII .py is the TLD for Paraguay.  On the right, those 
are the Cyrillic alphabet letters corresponding to .ru, which is
Russian. 
 
Now, anybody here who actually speaks Russian or is intimately 
familiar with internationalization, will be quick to point out 
one important fact - 'jessica' uses letters that are not in the 
Russian language.  For example, J and S do not appear in the 
Russian language.  It points out there are alphabets, and languages
that use a subset of characters in those alphabets.  In order to 
get the letters that look like 'jessica', you have to combine 
characters from two different languages, but they're both in the 
same alphabet.  

So what this points out if you are a registry, that is going to be 
accepting say, domain registrations under your zone, then they may 
want to apply additional restrictions, such as not accepting things
that look like, or that contain characters that are not in their 
language.  If you're .py for Paraguay, if there are certain 
characters that don't want to allow, you can restrict that.  This 
particular example requires combining characters from two different 
languages, and there are other examples that are purely from the 
same language.  Epoxy, and on the right, this may be, say, a 5 
letter acronym for some Russian organization.  The problem is 
that's at the human matching layer.  Is that the thing you're 
looking for?  Does that match or not match?  

John is going to talk about a couple of more examples.  So 
hopefully your eye tests have been enlightening.  


(John Klensin)  We get more interesting problems when we move 
beyond the eye tests into a piece of human perception problem, 
which is people tend to see what they kind of expect to see.  

So we have here two strings which look different, but look 
different only when they're next to each other.  And, the first one
is a restaurant, and the second one is in Latin characters and 
something different altogether.  But they look a lot alike if 
you're not sensitive to what's going on.  

In general, if you have a sufficiently creative use of fonts, and 
style sheets from a strange environment, almost anything can look 
like almost anything else.  A number of years ago I came into 
Bangkok very late at night, and I was exhausted and I was driving,
being driven to the city and I saw a huge billboard and it had 
three characters on it and a red, white, and blue background and 
from the characters I was firmly convinced it was USA.  Well, it 
was in Thai, the characters were decorated, and having seen them
outside of that script, maybe I would have understood the 
difference.  Maybe I would not have.  

That brings us to another perception test which snuck up on my last 
month, me and an audience of other people.  We were sitting in a 
room, two months ago, at an AP meeting and there was a poster in 
the back of the room with the sponsors.  And on the poster we had 
these three logos, and the first one, pretend that you're not used 
to looking at Latin characters.  You look at the first one and you
don't know whether that character is an A or star.  

And then there's a reverse eye test.  See the second and third 
lines there, and convince yourself, assuming you know nothing about 
Latin alphabets, as to whether those are the same string or same 
letters or not.  Because this is the problem that you're going to 
get into when you're seeing characters in scripts and strings that 
you're not familiar with.  And it's a problem when people are not 
used to looking at Latin characters, when the fonts get fancy.  
People keep carrying out tests in which they say 'these things are 
confusable or not confusable' when they're looking at things in 
fonts which are designed to make maximum distinctions.  When people
get artistic about their writing systems, they're not trying to 
make maximum distinctions, they're trying to be artistic.  And 
artistic-ness is another source of ambiguity for people, as to 
whether two things are the same or different.  

We have other kinds of equivalence problems.  To anyone who looks 
closely, or who is vaguely familiar with Chinese, simplified 
Chinese characters do not look like traditional Chinese characters.
But they're equivalent if it's Chinese.  If it's Japanese or Korean 
instead, one of them may be completely unintelligible, which means
they are not equivalent anymore.  

As a consequence of some coding decisions which unicode made for 
perfectly good reasons, there are characters in the Arabic script 
with two different code points but which look exactly the same.  So
the two strings seen there, which are the name of the Kingdom of 
Saudi Arabia, look identical, but would not compare equal if one 
simply compared the bytes.  

Two strings, same a appearance, different code points.  Little 
simple things like worrying about whether accents go over Es do not
get caught in the same way these things do.  This is another 
equivalence issue.  What you're looking at are digits from zero to 
to 9 in most cases and from one to 9 in a few.  Are they equivalent?  

Well, for some purposes, if we write numbers in two different 
scripts are the same.  For other purposes, they're not.  

We've seen an interesting situation with Arabic in that input 
method mechanisms in parts of the world accept what the user thinks 
of as indic Arabic characters going in encode European digits.  
When they decode they treat the situation as a localization matter, 
so users see Arabic digits going and and coming out.  But if we 
compare them to a system in which the actual indic Arabic digits 
are stored, we get not equal.  

We've also got in unicode some western indic Arabic digits and some 
eastern Arabic indic digits.  They look the same above three, but 
below three they look different, and all of the code points are 
different.  Are are they equal or not equal?  

And if you think that they're digits and if you think, as we've 
said several times this week in various working groups and over the
last several years, that user facing information ought to be 
internationalized, remember that we show IP addresses in URLs which
users look at and sometimes type, so now assume you see some of 
these Arabic digits, two or three of them, followed by a period.  

And then you see another two or three Arabic digits followed by a
period.  And then see another one, two or three Arabic digits 
followed by a period and then see another one, two or three Arabic
digits.  Is that an IPv4 address?  Or domain name?  And if it's an 
IPv4 address, do you know what order the octets come in?  
   
The difficulty with all of these things, is that they're, fun, 
funny.  And then you catch your breath and say, he's not kidding.  
These are real, serious problems, and they don't have answers, 
except from a lot of context and a lot of knowledge.  Our problems
arise not when we're working in our own scripts, but in somebody 
else's.  So now we come back to the place where people started 
becoming aware of these problems - with internationalization of the
DNS.  If I can make a string in one script, or partially in one
script, look like a string in some other script, I suddenly have an 
opportunity, especially if I'm what the security people call a bad 
guy.  But those kinds of attacks cannot be deliberate.  They can be
deliberate or accidental, depending on what's going on.  We spent a
lot of time thinking in the early days of IDNs, believing if only 
we can prevent people from mixing scripts we'd be okay.  The 
example Dave gave shows how far from okay that is.  We're almost at
the point of believing that prohibiting mixed scripts is probably 
still worthwhile but it makes so little difference if somebody is 
trying to mount an a attack that it's really not a defense.  

If you have names in scripts that are not used in the user's area,
and the user is not familiar with them, many scripts become 
indistinguishable chicken scratch to a user who is not used to that 
script.  And all chicken scratches are indistinguishable from other 
chicken scratches, except for certain species of chickens.  

We talked from time to time about user interface design, and 
whether it should warn the user when displaying things from unknown
sources, or strange environments, or mixed scripts, but the UI may 
not be able to tell.  

We're in a situation with many applications these days that we are 
coloring, and putting into italics, and marking, and putting lines 
under or around so many things, that the user cannot keep track of 
what's a warning and what's emphasis and what's a funny name.  

And as was mentioned earlier, some browsers try to fix this problem
by displaying A labels, xn4 .  Our problem there is that those 
things are impossible to remember.  And one of the things we 
discovered fairly early is if we take a user who has been living 
for years with some nasty inadequate ASCII transliteration of her 
name, and we suddenly offer them, instead of that name written 
properly in its own characters, we offer them something completely 
non-nemonic, starting with X and followed by what Stuart calls line 
noise, and for some reason, the user doesn't think that's an an 
improvement.  

We've also recently discovered another problem we should have 
noticed earlier.  There are two strings, or one string depending on 
which operating system you're using, which are confusable with 
anything.  If one of these strings shows up in your environment, 
and you don't have the fonts or rendering machinery to render it, 
the system does something.  It can turn it to blanks, which is 
pretty useless.  But what most often happens is it's turned into 
some character which the system uses to represent characters it 
can't display.  

A string of question marks can either be a string of question marks
or it can be some set of characters for which you don't have fonts.
A string of little boxes can be either a string of little boxes, or 
some six characters for which you don't have fonts, or it can be an 
approximation to a string of question marks.  

And thus two strings in an environment in which you don't have the 
fonts installed can be confusable with anything.  

Now, the question is, what does a user do?  Well, it should be a 
warning to the user that something is strange.  But we know 
something about users from our security experience, which is if we 
pop-up a box which says, 'aha! this is strange - would you like to 
go ahead anyway?' the users almost always do the same thing, which 
is click okay, and go on.  

So, this is the string which can get you what you can't even read.  
And usually, depending on the operating system, trying to copy this 
into another environment by some kind of cut-and-paste situation 
will not work.  The number of colorful ways of it not working, but 
not working, is pretty consistent.  

So, we started talking about mapping before. In a perfect world we
would have a consistent system that allows us to perform the
comparison for us. Now, that sounds obvious.

In the ASCII DNS, when that was defined, we wrote a rule which said 
matching was going to be case insensitive, and the server goes off 
and does something case insensitive.  They get stored in case 
sensitive ways, more or less, but queries in one case, match stored 
values in another case.  It's all done on the server, nothing 
changes the other things.  

If you don't have intelligent mapping on the server, and you want 
to try to simulate it, which is what we've been trying to do over 
and over again in the international environment, where we're trying
to not change the server, or how we think about things very much, 
one of the possibilities is to map both strings into some pre-
defined conical form and compare the results.  

That sort of works.  It doesn't permit matching based upon close 
enough principles or something fuzzy, and that's right in some 
cases and terribly wrong in others.  But when we start converting 
characters, we lose information.  We convert a visual form of one
variety, into another form which is more easily understood, that's
fine for matching purposes.  

But if we need to recover the original form, and we've made the 
conversion, we may be in trouble, depending on what we've done.  
The mapping process in inherently loses information when we start 
changing one character into another one.  

Sometimes it's pretty harmless.  Case conversion may be harmless or
not harmless, depending on what it is you're doing.  Converting 
half width or full width characters to full width is normally 
harmless, depending on what you're doing.  Unicode has 
normalization operations which turn strings in one form into 
strings of another form, making the E with an accent character and 
the E with followed by an over-striking non-spacing accent into the
same kind of thing so they can be compared.  Usually safe.  

Unicode has other operations which take characters which somebody 
thought were perfectly valid independent characters and turns them
into something else, because somebody else thought they weren't 
independent and valid enough.  If that conversion is taking a 
mathematical script, lower case A and turning it into a plain A, 
it's probably safe.  If it's taking a character which is used in 
somebody's name and changing it into a character which is used in 
somebody else's name, it's probably not such a hot idea.  

And the difficulty is that we try to write simplified rules that 
get all these things right, and there are probably no such rules.  

So the mapping summary is, making up your own mapping systems is 
probably not a very good idea.  People who have gotten, who have 
been experts in it, spend years worrying about how to get it right, 
can't get it right either, because there is no right.  It depends 
on context.  And finding the correct mapping for a particular use 
very often depends on the language in use, and very often, when 
we're trying to do these comparisons - DNS is a perfect example, 
but not the only one - we don't know what the language is which is
being used.  If you need language-dependent mapping, and you don't 
know the language, you're in big trouble.  If you use a non- 
language dependent mapping in an environment where the user expects
a dependent mapping, you can expect the user to get upset.  In an 
international world, upset users are probably fate, but we need to 
get smarter how we handle them.  

(Dave Thaler)  Our next topic is the issue of encoding, which is 
the topic that our working draft focuses on.  

So, if we look at some of the RFCs that we have right now, and we 
can step back and construct a simplified architecture, this is the 
simplified version.  We are on a host, we have an application, it 
sits on top of and uses the DNS resolver library.  That's our over-
simplified model.  There are two problems.  And by the way, the 
IDNA work, for example, talks about inserting the punycode encoding 
algorithm in between those two.  

The two problems with this over-simplification: one, DNS is not the
only protocol.  Different protocols use different encoding today - 
and I'll get to this in a second...
  
And the second problem is that the public Internet name space, in 
DNS, is not the only name space.  In DNS.  As John mentioned 
earlier, the Chinese TLDs are not in the public root.  And 
different name spaces, as we'll see, use different encodings today.
  
So this is the more realistic, more complicated version of that 
previous picture.  On a host, you have an application.  That sits 
on top of some name resolution library, such as sockets or 
whatever.  Between those, they communicate with whatever the native 
encoding is of the operating system of choice.  UTF-8 and UTF-16 
are most common.  Underneath the name resolution library, you have 
some variety of protocols, and there's the union of different 
things that exist on various operating systems.  

And then this host is a attached, for example, to multiple local 
LANs, each of which may or may not be connected to the public 
internet.  And it may also be connected to a VPN, for example.  
Each of these is a potentially different naming context that you 
can resolve names in.  

So let's talk about problem No. 1 first, which is a multitude of 
name resolution protocols.  Now, it turns out that many of these 
are actually defined to use the same syntax.  What that means is if 
somebody hands you a FQDN, this thing with dots, you cannot tell 
what protocol is going to be used.  It might be resolved by looking
in your host file.  It might be resolved by querying DNS, or 
resolved in the local LAN by the server TCP.  Each of these are 
defined to use the same type of identifier space, same syntax.  

And so what happens, the name resolution library takes a request
from the application and tries to figure out where to send it, 
which protocol or protocols to try, and in what order?  And of 
course, if you have different implementations of different 
libraries that end up choosing different orders, you get 
interesting results.  

To make it more difficult, different protocols specify in different 
encodings, and so when you put those things together, that means 
the application can't tell which encoding, or in the case of 
multiple name resolution protocols being tried, which *set* of
encodings are going to be attempted because that's the decision 
made by the name resolution library.  

Let's talk just for a a moment about the history of what is a legal 
name.  All right.  The naming resolution library gets something, 
and is that a legal name?  What's that something?  Let's briefly 
walk through the history, so to understand sort of where the world 
is at today.  Back in 1985, RFC 952 defined the name of the host 
file.  It may be internet host names, gateway names, domain names, 
or whatever.  This is the one that said it contains ASCII letters, 
digits and hyphens, or LDH.
  
In 1989 is when DNS came along, published in RFC 1034, 1035, and it 
includes a section called 'preferred name syntax' which repeats 
the same description of LDH.  The confusion comes from the word 
'preferred' there.  Well, remember, this was before RFC 2119 
language.  Is that preferred a SHOULD or a MUST?.  Or is preferred 
mandatory?  There's confusion there.
  
That was 1989.  By 1997, 8 years later, we had RFC 2181, which was 
a clarification to the DNS specification, because of a number of 
areas of am a ambiguity and confusion that were resulting.  These 
are three direct quotes with emphasis added.  First one says 'any 
binary string', whatever, can be used as the label of any resource 
record.  'Any binary string' can serve as the value of any record 
that includes a domain name.  And, as Stuart mentioned, 
applications can have restrictions imposed on what particular 
values are acceptable in their environment.  

Okay.  So, to clarify, the DNS protocol itself places no 
restrictions whatsoever, but users of entries in DNS could place 
restrictions, and many have.  

Now, that was 1997, and in that same year there was work on the 
IETF policy which was published in, I think, January of '98, which
is RFC 2277.  This is the one that Stuart referred to, and here's 
the quotes from theirs.  The first one you saw earlier - 'protocols 
must be able to use the UTF-8 character set'.  And it then 
continues, 'protocols may specify, in addition, how to use other 
character sets or other character encoding schemes'.  And finally, 
'using a default other than UTF-8 is acceptable.' 

What's also worth pointing out - it's not just what it says, it is
also what it doesn't say.  What it can't say is anything about 
case, whether the E with accent, any types of combined characters, 
how things get sorted, etc.  There's no policy, the IETF policy did
not talk about such cases.  

And so, as a result, two unicode strings often cannot be compared 
to yield what you'd expect without some additional processesing.  
Now, since the protocols must be able to use UTF-8, but could 
potentially use other things, and since the simultaneously produced
DNS RFC said any binary string is fine, that means it complies with
the IETF policy.  

So, if the policy is used, UTF-8 and DNS comply with that policy, 
that means starting in that year, people started using UTF-8 in 
private namespaces.  By private namespaces, we mean things like 
enterprises, corporate networks.  By private namespace, again, we 
mean 'not resolvable outside of that particular network, not 
resolvable from the public internet.'  In their own world they go 
off and use UTF-8 and it became widely deployed in those private 
networks.  5 years after that, UTF-8 was widely deployed.  This 
included the work on punycode encoding for work in the public DNS 
name space.  

So, just to summarize here, UTF-8 is widely deployed in private 
namespaces.  Punycode encoded strings or A labels deployed on 
the public DNS name space.  

Now, within the internationalization community, there's been a 
bunch of discussions on link issues, and I think it's important for
the wider community to understand.  DNS itself introduces a 
restriction on the length of names: 63 octets per label, 255 octets 
per name (not including a zero byte at the end if you're passing it
around in an API.
  
The point is that non ASCII characters, as Stuart showed, use a 
variable number of octets and encodings that are relevant here.  
Now, 256 UTF-16 octets, 256 UTF-8 octets, and 256 A label octets 
are all different lengths.  So that existing strings can be 
represented within the length restrictions, in punycode-encoded A 
labels, but can't be encoded within the same length restrictions 
within UTF-8.  There also exists strings that can be encoded in 
UTF-8, but cannot fit within punycode and get an A label.  So, you
can imagine some interesting discussions there.  

Let's recap.  We've talked about multiple encodings of the same 
unicode characters.  There are things we called U labels.  WIth U 
think unicode, with A think ASCII.  U labels, you have something 
that is usually written out in that way.  A labels are things that
start with xn--
  
You have different encodings, say, the top form and bottom form, 
that are used by different encodings and different networks, even
within DNS.  Punycode A labels on the Internet and private 
intranets.  And even different applications that start to pay 
attention to different RFCs, ones that actually implement the IDNA 
document, and ones that don't.  Because you have all the 
differences across the protocols, networks and so on, you can 
imagine the confusion that results.  If you have one application 
that launches another application, and passes it some name or URL 
or whatever to use, the launching application may be able to access 
the directory of stuff, and you click on something, or you cause it
to launch another application the some way, and it passes the name 
and whether the an indication that just got launched, or the use of
the identifier in the same way.  In general, all bets are off.  It 
may or may not.  

You may get a failure, may get to some different site than what you
got to from the launching application.  

Similarly, if you have two applications that are both trying to do 
the same thing - two jabber clients, for example - and one happens 
to work and the other one doesn't happen to work, there would be a
switching incentive to say 'all I have to do is switch to the other 
one.'  

So let's walk through a couple of examples of applications that 
have actually tried to do a bunch of work to improve the user 
experience to deal with these cases, and you have to deal with the 
multiplicity of encoding, and I don't want to get to the wrong 
place or get failures.  So what we found is some applications have
tried to improve the algorithm to deal with this case that RFCs 
don't tell you how to deal with the multiple encoding issues.  

And so most of the time they actually get it right.  There are a 
couple of corner cases where they don't solve it 100 percent.  
Here's one.  You type in something into an address bar in a 
browser.  And in this example, the 'IDN-aware' application is one
that understands that there exists UTF-8 in some private namespaces
that it's connected to, and punycode and the public namespace it's 
connected to.  And so it knows which networks it's connected to, 
and may have some information about what the names are that are 
likely to appear on them, so it runs some algorithm to decide if 
this is an intranet or internet name.  At this point the string is 
being held internally in memory in, let's say, UTF-16 or UTF-8, 
whatever the native storage of the operating system is.  

In this example, let's say it decides it's going to be an intranet 
name.  So in this case it leaves it in sort-of the native encoding, 
does not run a punycode algorithm, and passes it to the name 
resolution API.  Then it goes to DNS and says we're using UTF-8, 
and sends it to the DNS server in UTF-8.  

If you have host B in this example, if that one has chosen to 
register its name in DNS, in the punycode-encoded form, the A label 
form, if that's the name that actually matches, that's going to 
fail.  If the host is host A, where it's using the same type of 
algorithm as the one on the top, it's going to succeed.  

So the normal expectation is that most of the hosts in that 
environment are all cooperating, or all have the same knowledge or 
configuration, and you actually get to host A.  If instead it's in 
the mode of host B, it will fail.  

Now, let's take a case where the application decided by looking at 
it that it's going to be an Internet name.  And by deciding by 
looking at it it could try one or the other.  In this case it runs 
the punycode on it.  The xn--4.  This goes to the public DNS.  

In this example, let's say that name does not in fact exist in the
DNS, and so the name resolution API wants to fall back and try a 
local LAN resolution and try LDNS or LMRR.  In this case, LDNS is 
defined to say the protocol spec says that 'if the name is 
registered there and resolvable, better ask for it in UTF-8 or 
won't get the answer.'  Here, if there is an indication to put it 
in the A a label form first before passing it to MDNS.  If MDNS 
puts it out there, it's not going to find a match.  Most of the 
time it does the right thing in both environments, but there are 
corner cases in both cases where things will fail.  

The next category is where you have some application that has 
become IDN aware, and another application that doesn't do anything,
it just takes whatever the user types in and passes it directly to 
name resolution with no inspection or conversion, because the name 
resolution APIs in this example are UTF-16 APIs.  So on the left, 
if this one is IDN aware, it will convert it using punycode to the 
A label form and it will go out and find the registration DNS in 
the punycode encoded form, whereas the other application passes it 
down in UTF-16.  DNS will convert it to UTF-8, and it will go out 
and not find it.  It doesn't find it, but there actually exists 
unicode code points with those binary strings and any string could
peer in the DNS.  

So what if the UTF-8 version magically found its way out there, 
either accidentally or intentionally.  They would get to a 
different site than what they expected.
  
Finally, the other category of differences is applications that 
want to say 'I don't know which one it's going to be, so I'm going 
to try them both.'  In some order...  So, consider 2 applications, 
one that decides to try the UTF-8 version first, and one that 
decides to try the A label first.  The one on the left converts it 
to punycode first, to the A label version first, and it goes out 
and finds the P or the A label version.  The other one might try 
UTF-8 first, and again might find a different version in which one
might be unreachable.  So you get non deterministic behavior.  Of 
course, the other one is intelligent too, so if that was 
unreachable you get the reverse.  So this is what applications 
actually do today.  

So, the basic principle, the basic learning from these, the fact of 
physics, right, is that conversion to an A label, or UTF-8, or 
whatever else is going to appear on the wire, can only be done by 
some entity that knows which entity or protocol namespace is going
to be used.  What is the encoding that is appropriate for that 
particular environment, or that type of resolution.  When an 
application tries to resolve a name, the name resolutions may try 
multiple of them.  So there's no single or right choice at the 
application layer.  This leads to two sort of remaining categories 
of hard issues.

In general, the client (using the term generically), whether it's a
host or application or whatever, because again, while we're using 
host names in many of our examples, the problems we're talking 
about are not limited to host names.  Many of the ones we've talked 
about today may be unique to host names, but they could occur in 
other identifier spaces - 'may or may not' I should say.  

The first one is the client.  The client has to guess, or learn, 
whatever the server encoding expects.  In many cases it may be 
defined by the protocol, and that's fine.  But if there are 
multiple protocols, it's part of the learning or guessing.  Names 
appear inside in other types of identifiers, and each identifier 
type today often has its own encoding conventions.  What is this 
identifier space?  Is it UTF-8?  Is it A label form?  Is it percent
or ampersand form or whatever...
  
And anything that converts from one name space to another name 
space, such as extracting an e-mail address from mail, or 
extracting a host name from URL, you have to convert from those
two sets of requirements.  Now, just saying, well, if they all used
a single encoding, they wouldn't have to do any of this transcoding 
in the movement between layers. 

By comparison, that's the easy part.  That's not the hardest part 
of the problem.  That's sufficient only if the only thing you're 
going to do is display it.  All other things besides the encoding 
issue - comparison, matching, sorting - they all require more work.
So just like RFC 952 defined what ASCII characters were legal in a 
host name, we need to define the unicode subsets for other 
identifiers.  

What are the things that are legal?  The optimal subset for one 
protocol or type of identifier may be different from what's optimal
for some other one.  Now, there also exists cases where based on, 
say, implementation differences, the way that two things display 
visually look different.  Usually, this is due to a bug.  Now, the
problem is nobody agrees which one is the bug or the correct 
behavior.  So that's a hard issue.

Stuart - back to you.  
   
(Stuart Cheshire) Thank you, Dave.  

So, Dave is right - having a single encoding does not solve all our 
problems, although, having lots of different encodings definitely 
does add to them.  This is not news.  We've known this for a while.  

There used to be computers using different character sets, and we 
recognized if some computers used ASCII, and some used another one,
the receiver had to work out which it was, and this was not going 
to give a good experience.  So the wire protocols used ASCII when
they could.  And if you had a computer that used the other one, you
needed a mapping table so you could convert to the common language 
on the wire and convert back upon reception.  We recognized that in
1969, but we seemed to have forgotten it now.  

To get out of the current chaos we need to go can beyond the 
current recommendation.  Merely supporting UTF-8 as one of the many
options doesn't solve the problem.  I think we need to move to a 
world where we only use UTF-8, and when you receive an identifier, 
or you receive a text string off the network, you don't have to 
guess what the encoding is, because there is only one encoding.  

So the summary is, for text that end users see, we want to have 
rich text, and that means unicode.  And for compatibility on the 
wire, that means using UTF-8 to encode those unicode code points.  
The corollary of this is for identifiers that are protocol 
identifiers, that are used for communication between computers to
tell each computer what to do, and aren't seen by end users, it is
much harder to make the argument why those should be unicode, 
because the bigger the alphabet, the more the scope for confusion 
and the more chance of things not interoperating.
  
With that, I'd like to open the mic for questions.  I think we 
should do half an hour for questions on this internationalization 
presentation, and then that will leave half an hour for general 
questions to the IAB.  

We will take new questions at the middle mic and the end ones, and
the in between ones for follow-ups.


Open Mic:

(Bob Briscoe) A question for John.  How long have we known about 
the security problems?  Because, it was sort of quiet hearing 
about them, but this is the Internet, and we ought to be fixing 
these things.  
   
(John Klensin) What do you mean by 'knowing about the security 
problems?'
  
(Bob Briscoe) Well, the problems of being able to spoof one 
character with another, and change fonts, etc.
  
(John Klensin) Since long before this process started.  We've known 
about confusability in characters since we started looking at 
multiple scripts.  We've known about some of these confusion 
problems in titles of things since we deployed MIME with multiple 
character sets, and that would have been in, I'm guessing from 
memory, but like 1990 or shortly thereafter.  I gave a presentation
at an ICANN meeting Melborne that exhibited some of these abilities
to write different things in different scripts.  At that time, it 
was a general warning about these things.  

We've certainly seen more subtlety, as we've understood these 
things better.  I used to joke that one of the properties of this 
whole internationalization situation, when one is actually trying 
to use the strings and identifiers, rather than printing them, is 
that every time we looked at a new script, we found a new and 
different set of problems.  It was like going through a field and 
turning over rocks, and each time you found something new.  So I'm
not certain how to answer your question.  

This is just epidemic in an environment where we're suddenly moving 
identifiers from a world in which the maximum number of characters 
we treat as different is around 36, to an environment where the 
maximum number of characters we treat as different is in the range 
of tens of thousands.  

(Bob Briscoe)  I guess my question is, your presentation told us 
about the problems.  If we've known about these problems for 19 
years or so, are there, you know, could we do a presentation on a
solution space?  Is there any solution space?
  
(John Klensin)  Let me give you an a different answer - we've had
these problems for somewhere between two and 6,000 years.  

(Bob Briscoe)  Time to fix it.  

(John Klensin)  Absolutely time to fix it.  

(Bob Briscoe)  You might do it before something goes to full 
standard.
  
(John Klensin)  The fundamental issues here really rely on two 
things.  One of which is that we can design very, very highly 
distinguishable fonts.  And we possibly need to design highly 
distinguishable fonts a across the entire unicode set, and they
would be so ugly nobody would want to use them.  

We, in theory, could teach everyone about all of these 6,000 
separate languages, and only slightly smaller number of scripts in
the world, but that isn't going to happen.  

So the answer to your question is that there's a tremendous amount
of reliance on user interface design here.  And what we need to 
understand is that there's both the problem and an opportunity.  
The opportunity, which is very important, is for people to use the 
internet, in their own script, in their own language in their own 
environments.  That's really important.  

Our problems arise when we start looking at, and operating in, 
environments which one of us doesn't understand.  I'm gradually 
learning to recognize a few Chinese characters, but my ability to
read Chinese or Japanese or Korean is zero.  I don't know about 
you.
  
But if your situation with regard to Chinese characters is the same
as mine, if I send you a message in Chinese characters, we are both 
having a problem.  If I send a message in a script I can read, or 
an identifier in a script I can read, but you can't, you've got a 
whole series of problems.  You can't read the characters, you 
probably can't figure out how to put the characters in a computer 
if you can read them, and you're going to be easily tricked.  And 
we're going to have to learn how to deal with that, just as we've 
had to learn about non interoperability of the human languages.  

If I have a face-to-face conversation with you, using a language 
which only one of us understands, then at a minimum we're going 
to have an interoperability problem.  At a maximum, if I can make
that language sound enough like something that you expect to hear, 
or you can do that to me, then we may have a nasty spoofing 
problem.  

And again, these issues are thousands of years old.  And we kind of 
learned to cope.  And we learn to cope by being careful, and we 
learn to cope by remembering that those little boxes are a big 
warning sign that we may not be able to read something.
  
Many of us have started filtering out any email which arrives in a
script which we can't read, because we know we're not going to be 
able to read it anyway.  And those are the kind of things we do.
It's very, very close to user level.  And I don't think there are 
any easy answers.  

But the alternative to this situation is, say, oh, oops, terrible, 
there might be a security problem so nobody gets to use their own
script and that answer is completely unacceptable.  

(Yoshiro Yoneya)  All right.  From my experience in the
internationalization of protocols, one of the hardest issues is to
keep backward compatibility.  So, inventing encoding is to get 
interoperability, or backward compatibility with existing protocol.
That's the reason why there are many encodings.  So, I hope to have 
migration, generic migration guidelines for the protocol 
internationalization, that will be very good future work.  
   
(Stuart Cheshire) I think one of the things we need to be careful 
of - it's easy to fall into this trap to say we need to be backward 
compatible.  And that actually means something else.  But if the 
thing at the receiving end doesn't know that it means something 
else, we've not got international text.  We have lots of percent 
signs.  

(Larry Masinter) This is a actually a followup about how long we've 
known about the problem.  I'll take some blame.  In 1993, I think, 
there was an internet draft where I proposed internationalization 
of URLs.  Based on discussions in 1992, when I thought it was a 
simple problem.  Just used UTF-8, and that there were regular URLs
and internationalized ones.  But I think part of the problem was 
the switch from thinking of these as, these weren't names, they 
weren't identifiers, they were locators.  The notion of comparing 
two of them to see if they were the same was not a requirement.  

And at the time, there were no caches.  And so, the notion of 
figuring out whether or not this URL was the same as that one, 
wasn't part of the protocol stack.  And, therefore, some of the 
problems we're seeing, that idea of phishing that would actually 
look at the name and believe something, merely because you saw it 
on your screen, didn't have anything to do with where you were 
trying to go.  That was a requirement that was added after the fact 
without a lot of thought.  

And, if you you think about it, we've add on some requirements that 
maybe shouldn't be there.  And so, I think if you look at all of 
your examples, there's still some problems, even if don't try and
compare.  But almost all of the problems that you've listed really
have to do with comparison, and of locators.  As lots of them do.
  
A lot of the problems have to do.  You had a lot of things - look 
at a this and that, and are they the same or different?  And if 
you didn't have the problem of trying to decide, a user trying to 
decide whether or not they were the same ahead of time, you 
wouldn't see a problem.
  
(Dave Thaler)  Larry, these issues exist when a system decides to
take a a label or a string which can be user input, and compare it
with something which is stored in in the database.  The classic 
matching look up problem in DNS or otherwise, and then the question
is whether the answer to the question, whether or not those 
matched, meets user expectations as far as the user is concerned 
is off the wall.  There's no way to avoid that particular problem, 
other than require user to have universal knowledge of exactly 
what's stored.  And I do do mean exactly.  

(Larry Masinter)  No.  I think, if you are, if you put a human 
communication in the loop, that you're going to print something on
the side of the bus that you want people to type into their 
computers, it is your responsibility at the time that you print 
that on the bus, to do it in a way in which the users will have a 
satisfactory experience.  It is not the responsibility of the 
intermediate system to make up for the fact that the printing was 
something that could be an O or could be a zero, or could be an L 
or one, you know, you get a password and I can't tell because the 
font used was bad.  

It's the responsibility of the printer to do that in a way that 
will cause appropriate behavior, and not to choose, to print things
that are unrecognizable or have ambiguous forms.  There are lots of 
systems that never go through that phase of the translate into a 
perceptual representation and translate back, and expect that to 
happen.  

So, I think that we can make progress by being more careful about 
what we choose to accept as requirements of the overall 
communication system.  
   
(Dave Thaler)  I just want to comment on one of the things you 
said, about whether most of the problems are due to such and such.
I want to summarize that we actually talked about at least two 
different, big categories of problems.  One category of problems is
when there are multiple unicode strings, in other words, multiple 
sets of unicode code point numbers that can be confused or not, or 
matched or whatever with each other.  There's one set of things 
that are inherent in that, and it's lot about user interface, 
display, and so on.  

The second is one set of unicode code points.  Those are two fairly 
different sets of problems that we talked about tonight.  

(Larry Masinter)  I think if follow the paths, these differing 
alternate forms that look the same don't fall from the sky.  They
don't appear magically in the middle of the system.  There's some 
data path that either transmits them, and along the way is screwing
them up, or there's some human perceptual path along the way that 
involves printing things out or reading it out loud and 
transcribing it in a way that's inappropriate.  

(Dave Thaler)  Pete, do you have follow up on this.  

(Pete Resnick)  I do.  I actually disagree with Larry, at one 
level.  We're talking about identifiers being used for user 
interaction, that are also being used for machine interaction, for 
protocols.  And that's inevitably going to get screwed up, because
the stuff that we use for user interaction has variants, it's got  
humans involved.  Once a user has to type and interpret something, 
and there are variations of how it might be typed or interpreted 
based on context, there's nothing to be done.  

What we've done is increased the probability of that happening from
those 37 odd characters to tens of thousands of characters, 
incredibly.  I used to be much more in the camp ten years ago if 
you said to me that today, I would say such a thing, I would 
have thought it ridiculous - we have to straighten this out with 
using proper encodings, this is done.  

You know, e-mail is no longer reliably delivered because of spam. I 
don't care anymore if e-mail is not delivered because a user cannot 
type in the e-mail address exactly the way I put it on the screen.  
There's no way to make that precise.  If we get unlucky, the person 
who chose that e-mail address gets what they pay for.  

(Stuart Cheshire)  I want to add one clarification to Larry's 
point.  When we talk about comparing strings, we're not talking 
about showing two strings to the user and saying, do you think 
these are the same.  

(Larry Masinter)  That was one category.  

(Stuart Cheshire)  We were talking about, when a DNS server has a 
million names in its own files, and a query comes in for the name 
the user types, the DNS has to go through its own file and work out 
which record that query addresses.  

And you mentioned the subject of phishing, that's not a requirement
that the IETF decided to put on identifiers.  That's something that 
criminals decided would be lucrative for them, and we have to think 
about the consequences.  

(Larry Masinter)  Let me see if I can clarify something.  I'm not 
saying it's not a problem here.  I'm just pointing out that trying
to solve it, at a different place than, I'm trying to point out 
where I think it is going to be most productive, as far as looking 
at solutions.  And it is inputing restrictions on what is output or 
displayed, in such a way that it is more unambiguous about how to 
enter it in a way that would be reliable.  
   
(Stuart Cheshire) Okay.  

(Larry Masinter) And to focus on that area.  

(John Klensin) You're asking people who design user input and 
output procedures to constrain their designs in a way which makes 
things unambiguous.  My experience with telling designers what they
can and cannot do has been pretty bad.  

(Larry Masinter) Somebody is going to have to do something, and 
trying to patch it somewhere else is not going to be effective.  

(John Klensin) Another way of looking at this is that these 
problems would be vastly diminished if we let no one on the 
internet who wasn't trained to be sophisticated about these kinds
of things.  And while there were times in my life while I probably
would have a approved of 'nobody uses a computer unless they pass 
the training course and get a license', I think that's probably 
harder to constrained than designers.  

(Spencer Dawkins)  Spencer Dawkins, and probably the least clued
person on this topic that stood up so far.  So I'm thinking, I'm 
thinking the kind of questions I would ask would be triage kind of 
questions.  So, is this situation getting worse or, have we already 
hit the bottom?  
   
(Stuart Cheshire)  It's still getting worse.  We think of it as an 
educational process in which we continue to learn.  

(Spencer Dawkins)  How much better does it have to get before it's
good?  Before it's okay?  I mean, how much do we have to fix?  

(Stuart Cheshire)  I think with a big name space identifiers, there 
is always going to be problems.  Our goal is to minimize the 
unnecessary problems.  

(Spencer Dawkins)  I see e-mails coming through and it's 
disappointing. 
 
(Stuart Cheshire) I think people who are working on this problem 
have job security.  Sort of like security in antispam and so on.  
As long as human languages continue to exist, as long as there are 
humans using the network, the problems will exist.  The one way to 
to make them go away would be to remove all the humans.
  
(Spencer Dawkins) So, tell me if I've got this right.  That, once 
upon a time, there was ASCII and there was the other system.  And
people on each side wanted to get to the resources on the other 
side.  So, there was a death match, and we picked ASCII and life 
went on.  

Are we in any danger of being able to have that kind of a covering 
today, I mean, do people worry that they can't get places, in other 
scripts and things like that?  Do people see this as a problem?  
And John has been, you know, demonstrating this, you know, on 
napkins and stuff like that for me for a while, just as a curiosity
kind of thing, so I congratulate you guys for managing to scare the
hell out of me yet again.  

But, like I say, I'm kind of curious about that.  So, I'll sit 
down.  

So you asked a question there, at think at the end, you said, I 
think part of the question you're implying is 'how often are people 
actually running into problems today, right?'
  
(Spencer Dawkins) Basically, like I said, the ASCII thing is, 
there's a computer I need to get to, and I can't get there.  
There's a computer in Saudi Arabia that I can't type the name of.
How big of a problem is that?  

(Stuart Cheshire) As an example, in some of the cases I showed, 
applications are trying to deal with the fact that there's multiple 
encodings, it's a corner cases that fail.  People run into that, 
but not very often.  So people have done a good job of compensating
for that.  But we keep its as rare as possible.  But the phishing 
attacks, whenever somebody tries to be dangerous, hopefully that 
isn't accomplished either.  

I'm goes to close the mic lines now, we have about 5 more minutes.
Do we have a followup there?  
  
(Bob Briscoe)  Maybe the question could be better posed as 'do we 
think there's sufficient protocols and languages that we're
standardizing for applications that need to be secure, to be able 
to be?'  And what I'm thinking is, if you're viewing a font and an 
encoding through an application that's some business, you know, 
important thing, legal, whatever, could the application writer say, 
well, normally in your locale, you'd be restricted to this, so if 
anything outside that range comes in, I can warn, et cetera, et 
cetera, and I can sign all your fonts and encodings.  Do you think 
there's enough support there for an application to do that?  

(Stuart Cheshire)  I think there is scope for heuristics to spot 
specific behavior, but there's trial and error, and they tend to be 
developed over a long period of time.  When find something that 
doesn't work, find the particular heuristic.  


(Dave Crocker)  So I got up before Pete, to ask you, Stuart, ask 
you about the end of your presentation, but it was, my question is, 
predicated on exactly the point that Pete was making.  Which is 
that much of the mess right now, well, there are inherent 
complexities in the topics, but most of the mess is a layer 
violation that we created in simpler times, and the simpler times 
probably helped things a lot back then, in terms of making the 
internet usable.  Making the arpanet usable.  So that e-mail 
addresses, andlater web URLs, and to a large extent domain names, 
had this user interface use, and over the wire use, we made a lot 
of things simple that way, but we built the problem we have now.
  
And, we continue to try to maintain the layer violation, and say
that's okay, we have to do that.
  
The end of your presentation, didn't phrase it this way, but 
essentially was going, no, maybe we really don't and we certainly
should try not to.  That is, we should go to a canonical over the 
wire representation.  

The piece, the little I touched this area, seems to suffer a lot 
from, and it will suffer even without this, but, it suffers worse,
is the difficulty of getting the distinction between the user 
interface human factors, use bit, the human side stuff, and 
distinguishing it from over the wire.  

And I totally understand the resistance to it.  But, our job is to
fix problems, and we really need to be careful we don't just 
maintain them.  

We've been having, in the years that the international stuff has
been worked on, we've been having to deal with some realities that 
forced us to make decisions that do maintain them.  So I think that
your suggestions at the end is, I mean, it's charmingly '70s.  It's
'go back to canonical forms and over the wire.'  And so the 
question has to do with achievability.  

How do we get there?  Do we get there before we get to IP v6?  Do 
we get there before we retire?  Well, some of us anyhow?  I mean, 
it's clearly the right goal.  But is there anything practical about
the goal, and if so, how?  
   
(Stuart Cheshire)  I think moving to UTF-8 does not solve all of 
the problems that we talked about, not by a long way.  But it 
solves one of the problems that at least we know which characters
we're talking about when we're trying to decide if they're equal.  

Who will solve it, is implementers writing software, they need to
write their software that way.  Working groups writing standards 
need to specify that.  I think I'm less pessimistic than are you 
are about the prospects of moving in a good direction here.  And in
the interest of time, we'll take the last question.  
 
(John Klensin)  Dave, I think the other part of the answer is 
precisely that we have to stop taking the short cuts.  Of assuming
that by dropping a mechanism for internationalized characters into 
something which was designed for ASCII only, that that's a solution
to the problem.  Occasionally it will be a solution to the problem.
But we may have to start thinking for the first time in our lives, 
seriously about presentation layers and identifiers which work in 
this kind of environment, rather than things that have been patched
for a little bit of internationalization in an ASCII environment.  

And I don't think those problems are insurmountable.  I don't think
the problems of getting serious about localization sensitivity are 
insolvable.  But we need to get serious and start working on them 
at some stage.  

(Dave Crocker)  The layer violation is the reason why the Internet
is successful, and popular.  Right?  

(John Klensin)  Was.  

(Dave Crocker)  Was.  Well, is.  It is.  But I think, I'll put in 
the process plug.  We had a bar BOF in Stockholm, and there was a 
lot of interest in internationalization and resource identifiers to
take this document, that there's an RFC proposed standard.  But 
having 9 different solutions and 9 different committees for how we 
approach the problem seems like a bad idea.  

And there are a lot of different groups working on their solution 
on how to go about it.  And I'm hoping we can converge into a 
single, if somewhat interesting working group.  So, I encourage you
to consider it.  Public dash IRI, and I think IRI to the working 
group.  

(Olaf Kolkman)  All right.  Thank you.  With that, I'll ask the 
rest of the IAB to come up on stage and we'll take general 
questions.  
 
So while the rest of the IAB comes to the stage for the open mic 
session, there was a suggestion yesterday to keep things short.  I 
would like to remember the audience of that, we all have other own 
responsibilities here.  

In previous sessions I wrote a mail to the audience before the 
plenary saying 'if you have a question, please write us a mail.'  
That would help us to actually think about an answer, and answer 
concisely.  And it would help to think about the question a little 
bit.  

That never happened, really.  But I think that might be a mechanism
for short mic lines and intelligent answers to your questions.  So, 
please keep that in your mind for the rest of the year, the open 
mic is not the only way you can approach the IAB, or the community.  

With that said, is there anything somebody wants to bring to the mic? O
Oh, I should introduce all.  Let's start at the far end with Jon.  



(Olaf Kolkman) Okay.  Thank you.  
   
(Tina Tsou)  So, the document that the IETF produced, which is one 
that is about NATs.
 
(Dave Thaler) It is mostly a repeat of a lot of the same points 
that have been made on the topics.  What the IAB's thoughts are, if
I can sum up what the RFC or the RFC-to-be says, is that the most
important point is to preserve end-to-end transparency.  

Now, there are multiple solutions that preserve end-to-end 
transparency.  So IPv6 nat is a solution, is something that's in 
the solution category.  Now, it's possible to do translation in 
ways that preserve end-to-end transparency, it's possible to use 
tunneling, it may be possible to do other things.  

The IAB statement is that it's important to preserve end-to-end 
transparency.  And there is no statement saying there must be NAT 
in IPv6 or not.  So the first main point is that on the topic of if
NAT could be done in a way that preserved end-to-end transparency
is neither arguing for or against that. 

The second thing that's the main point of the document, is that 
there exists a number of things that people see as advantages in 
IPv4 NATS that they use them for.  Re-numbering, all the things 
brought up in the IETF in the past.  Some of them were documented 
previously in RFC 4864, some of them were not entirely there and so
we elaborated on those.  Those are things that people see as 
requirements for solutions.
  
Today the simplest solution that people see is v6 NAT, but that may
or may not be the only or the best solution.  And so, the second 
point was there are some requirements there, that the community 
needs to work on solutions for.  

That's basically to sum up.  Anybody else want to add anything?  
That's what the IAB's thoughts are are.  Once you get into ways to 
meet those requirements, that's for the IETF to figure out.  But we 
wanted to comment on what we believe the requirements are, and what 
the constraints on are, and to what extent NAT does or does not 
meet those requirements.  

(Gregory Lebowitz)  I think the other thing that we tried to make 
very clear in the document was that every time you use a NAT to 
solve one of those problems, you give up something significant.  
And we tried to call out what those things were - the trade-off 
and the cost associated.  
   
(Dave Oran)  Well, I just want to mention that it is somewhat 
difficult to establish transparency in any translation system, but
it is in fact possible.  Where you run into trouble is trying to 
take the simple approach, where you with would attempt to confine 
the translation state independently, in individual boxes with no 
coordination of the translation state, and that results in the non- 
invertability of transformation and the loss of transparency.  So, 
something I would encourage the community to do is to look at NATs 
not as simply 'what can I get away with doing the minimal amount of
work in in order to maybe get something that I want', because the 
consequences in negative terms for transparency are pretty severe.  
And with somewhat more work, translation-like approaches may in 
fact be quite acceptable.  
   
(Olaf Kolkman)  Yes, my apologies for a minute ago to be a little 
bit fast in trying to close the mic lines, I see that people have 
collected, queued up.  

Peter, please.  

(Peter Lothberg) Okay.  I'm Peter.  So, there was an obvious reason
why we have IPv4 NATs, and then people made them do all sorts of 
fantastic things, and I think people have them because they want 
more addresses because they have more things inside their houses 
and said I don't want to go there.  But a major use of it is some 
kind of gate keeper, some kind of policy.  It's a policy box that 
sits there and implements what policy I decide I want to have 
coming into my house for people.  I look out the door, the door 
bell rings, how do they look, would I let them in or not.  

So, last time we forgot to do any work in the IETF and we ended up
with a mess.  I heard talks about smart grids, intelligent houses,
and assume for a second that if we use IPv6 addresses on all of 
them and they have their own unique address, we still want policy.
And those devices are so small, they probably need somebody to help 
them.  See maybe the IETF somebody should go look at, so okay, in 
the future, we still need a policy control device that sits at the 
boundary of something and something, in order to enforce policy, 
to make sure the pool man gets to the pool and the alarm company
gets to  alarm, and vice versa.  And let's get that done, before 
people make more kludges.
  
(Gregory Lebowitz) Don't we have those, aren't they called fire 
walls?  

(Dave Oran)  Can I jump in?  So, as somebody who has been skeptical
of most things firewalls do for as long as I can remember, I 
absolutely agree with Peter.  However, in some cases, trying to 
capture the correct policy semantics simply at the individual 
packet layer as one would with a fire wall runs into many, many, 
many problems that make things in fact worse rather than better.  

If the fire wall simply processes on a per packet basis, there's 
lots of -- if the box attempts to do packet inspection, either 
shallow or deep, I think we're all aware of the problems that that
type of approach goes into.  So there's going to be some kind of 
application intermediate, that's going to be needed for various 
applications to enforce policy.  Get used to it.  Don't try to do 
everything by per packet processing and firewalls, or even worse, 
to try and guess what the correct policy ought to be for an 
application by doing intermediate inspection of packets.  

(Peter Lothberg)  I was more thinking, more of the Swiss Army knife 
solution here.  I wasn't only thinking packet inspections.  I was 
thinking, this is the thing where I actually have stored my policy 
and all the devices I have in the house actually goes, and it's the
system where it gets stored, the database, the PKI, blah, blah, 
blah.  

(Dave Oran) Then we agree.  But it doesn't necessarily have to be 
the gateway box that sits physically at the boundary.  

(Peter Lothberg) Correct.  

(Dave Oran) But people don't want to buy many boxes.  

(Peter Lothberg)  Yes, they only want one box that needs to get attacked.  
Right.  

(Remi Despres)  That's a follow on of the first question, I think
you made the point that end-to-end transparency is something 
important.  And of course, I do agree.  Now, yesterday something 
strange, with reference to this happened.  Among the interesting
technologies which are proposed to restore end-to-end transparency 
and move in the right direction, there is one which is in IPv4, the 
extension of addresses with trenches.  Now, there was a BOF on A 
plus P.  Now, some of the major birds of a feather, those people 
who are interested in the subject, were not permitted to talk, to 
present I mean, that is, to present their contribution to that.  

And the conclusion was that we would no longer be permitted to talk
in any group, on this approach on end to end transparency.  I still 
expect that there will be a reversal of that decision, that it will
be possible in this area to work on A plus P.  
   
(Dave Thaler)  Part of that is a question for the IESG, and part of
it is for a question for IAB.  And I'll comment about the IAB
portion, which is about the end-to-end transparency, versus say the 
evolution of the model.  The IAB has another document about what 
the assumptions are and the impact of changing those are, and 
whether those impacts should be done, you know, obviously, the 
whole point of the evolution of the IP model is to say, well, the 
IP model does evolve, but evolution has to happen carefully.  

Architecturally, what I think those at the BOF are trying to weigh 
between is, for IPv4 you have some inherent problems in IPv4 - we 
know that.  So, at one hand, you said, well, not we the IAB, those 
in the BOF, we're looking at a one alternative that would say, 
maybe there's better end-to-end transparency, but more changes (for
some definition of changes). 

And at the other extreme, there might be less end-to-end 
transparency.  There's no single right answer, because 
architecturally the answer is remove the limitations in IPv4 and go
to IPv6, that would be the architectural solution which gives you 
end-to-end transparency and preserves the model.  So we see a 
tussle between the two sets of requirements that are trying to be 
met, but cannot be met at the same time architecturally.  

(Remi Despres)  For the tussle to be resolved, it should be 
possible to talk and explain.  

(Dave Thaler)  That's the question for for IESG, not IAB.  

(Remi Despres)  Okay.  That you thank you for the information.  

(Lorenzo Coletti)  Since it's open mic and as somebody who deployed
IPv6 network and services, that started because we needed more 
address space.  If I look at the papers, I read the paper about the
botnet that researchers got into control of the botnet and they had 
all of the access to the machines behind the botnet, and 80% had 
private IP addresses.  That's a high number because it tells us 
that the internet would be dead and buried if we didn't have NAT, 
but let's remember that it was created to fight address shortage.  

You know the old saying, when all you have is a hammer.  Yes, we 
started doing that because of address shortage, well it gives us 
security, well, not really.  Multi homing, kind of.  If we want 
those benefits, let's think outside the box.  Not do it the same 
way.  David was saying this.  People want to do things the same 
way as they're used to, but there's benefit to clean slate. 

There's a protocol that allows you you to do things in very 
different ways, apply security policies through the last 64 bits 
of the IP address.  You can do all a of this.  Try to think outside
the box.  Don't do it the same way and think of all the operational 
costs that's involved in having different scopes and different 
addresses.  

Finally, we use public IPv6 addresses internally.  And I can tell 
you, it's refreshingly simple to have one address.  It's just, you
just know what's going on.  And, all the security benefits that NAT
ostensibly has, personally, I don't buy them.  And I don't think 
they would be comparable even to the game that you have when you 
can actually understand things and have a clean simple design.  

(Jon Peterson)  That was not a question, but a contribution to the 
discussion?  

(Lorenzo Coletti) Open mic, right.  

(Jon Peterson) Certainly, if 80% of the hosts are behind NATS that 
tells something about the security that NATs grant.
  
(Erik Kline)  I'd like to add as well, could we possibly make a 
requirement that anybody who wants to implement IPv6 NAT actually 
run a reasonably large network for say, oh, a year.  Because I'm 
concerned about lots of things being done based off of experiments
and not valid requirements.  
   
(Jon Peterson) Well, we never did that with IPv4 and still they 
were developed.  

(Erik Kline) Right.  But everybody by then had several years of 
experience with IPv4 four, actual experience.  

(Jon Peterson)  I think that I'm going to disagree with that 
suggestion.  I think it's a dangerous thing, because I guarantee 
you, whoever puts that amount of effort in, will become committed 
to it.  

(Olaf Kolkman)  I'm going to carefully look around.  If there are
no further initiatives to move to the mic, and I don't see any, so,
now, again, very slowly, going, going, gone.  Thank you.
Slides

Agenda and Introduction
IRTF Chair Report
IAB Chair's Report
Internationalization in Names and Other Identifiers