IETF Technical Plenary Session,
November 12, 2009
1. Introduction - Olaf Kolkman
(Olaf) It's now 4:30 and I would like to start in about a minute,
so if everybody can find themselves a seat. So, memorize the note
well.
Ladies and gentlemen, welcome all to the technical plenary,
IETF 76 in Hiroshima. During this meeting, we've got some
supporting tools for people in jabber land, and also for the people
in this room. There's a jabber room for the plenary,
jabber.ietf.org, and the presentations that we will use have been
uploaded. You can find them on the meeting material site - so you
can read a along if you're back home. Welcome to those back home.
There's a little bit of an opportunistic environment for an
experiment, we had the opportunity to find somebody to transcribe
in English, and you can see that on the screen on the left and
right - what is being said in this meeting. This might help people
who do not capture the accents of some of the speakers or some of
the people at the mic. And we would like to know your experience
with that - if you value this experiment.
When you step up to the mic and ask questions, again, it's
important to swipe your card. I will do so now, so you can also
see who I am. And when you are at the mic, please keep it short
and keep it to the point. It's something that we heard yesterday.
We hope we can establish a tradition.
A short paragraph on the agenda for today. Aaron will start with
his report. I will follow with the IAB's report. And then we will
have a session on internationalization of names and other
identifiers, that's a session led by John Klensin, Dave Thaler and
Stuart Cheshire. After that, somewhere between 7:00 and 7:10 we
will have an open microphone session which will hopefully not last
longer than 7:30. It all depends on the conciseness of questions
and amount of questions asked. Without further ado, Aaron...
2. IRTF Chair Report - Aaron Falk
(Aaron) Hello, I'm here to give a brief report on the Internet
Research Task Force (IRTF). My name is Aaron Falk. The short
status - we had four research groups meeting this week here at the
IETF. The Host Identity Protocol (HIP) research group and the
Scalable Adaptive Multicast (SAM) research groups have already met.
Tomorrow the Delay Tolerant Networking (DTN) and Routing Research
Group (RRG) will meet. Another thing we did was to have review of
the Routing Research Group with the IAB a this morning.
We have six IRTF RFCs waiting to be published. Publication is
currently wedged on finalizing the TLP, the trust license, and that
looks like (as was reported last night) it will happen around the
end of the year or early next year. The document defining the IRTF
RFC stream was revised and is now sitting the RFC editor's queue,
and for those of you keeping track, it was modified so that now the
independent stream and the IRTF registries have common rights
language.
Let's see. What else?
In work - there's been some proposed new work. We've been talking
about for quite sometime, I guess almost two years, an RG on
virtual networks, or network virtualization. There's a bar BOF
starting in about 40 minutes, so those of who are going to that
will miss the fascinating talk that's coming up later tonight,
here in the plenary. But hopefully, this group is finally getting
to discussion of a charter for the research group, so that would be
good progress.
Another topic that's come up, in groups that I've been in a couple
of times this week, has been internet of things, smart objects, and
so there's starting to be discussion about maybe there should be a
research group that's looking at an architecture for those
technologies and how they fit into the internet architecture. That
is just talk at this point, but I'm just giving the community heads
up. One thing we want to make sure is that anything that happens
in the IRTF does not slow down the smart grid activities that are
going on. So make sure we work around that.
I like to give sort of a quick snapshot of the different research
groups and the energy levels - who is active and who is not. The
groups on the right, the 'active' list, are groups that are
meeting at IETF or elsewhere, or having very a active mailing
lists. The colored one at the bottom is the SAM research group,
which has moved to the active column. They met this week, and they
have actually had a few meetings at non-IETF locations with other
conferences.
The quiescent groups, they're not totally inactive, they have
meeting lists going along. The Public Key Next Generation Research
Group, PKNGRG, they had a little trouble getting started, but it
sounds like there's some energy. Do you have a question?
(Richard Barnes) I was just wondering if the PKNG group will be
meeting at IETF 79?
(Aaron) I don't think that's been decided why yet. I don't think
there's a planned meeting now, but if you're on the mailing list,
you would hear about it. Is Paul in the room? Can you confirm
that's true?
(Paul Hoffman) That's correct.
(Aaron) Moving right along. So, another thing that I've been
doing with the IRTF reports is to take a couple of research groups
and give a very quick snapshot of the topic area - what the group
is up to, some of their recent work items, to give a flavor of
what's going on in these groups. This is really very cursory, and
just to give a you flavor of some research stuff that's happening
in the IETF and maybe help discover whether there is interest in
getting more involved.
The first group I want to talk about is Anti-Spam Research Group.
It's an open research group that's looking at anti-spam research.
In particular, at the open problems. It's been hoped that there
would be some standards work that would come out of that, but it's
not been as fruitful as was originally hoped. But there's a wide
range of participation from not only the standards folks that we
see here, but also researchers and other folks who are working in
the area. There's lots of industrial activity going on this in
this space. Because anti-spam is a big industry, there are lots of
other activities going on, and so it's important to understand that
the research group is not doing standards work. There is some in
the IETF, in DKIM. It's not a trade group - there are several of
those - I think the large one is MOK. And it's not an academic
conference. So, this has really been sort of a discussion of
technical topics in the area, and they've worked on a couple of
documents, but mostly the activity has been on the mailing list.
There's a document they produced on DNS black lists and white
lists that's waiting to be published. And then there's another one
on black list management that has been, it's a draft that has
been circulated for a while and it's sort of waiting to be
finalized.
Another topic that's been going on in this research group has been
starting to develop a taxonomy of the different techniques for
fighting spam, and also, of different spamming strategies. You can
see the URL here if want too check it on the web. This is really
sort of open for contributions. I think that part of the
motivation for this is that many people have come up with ideas,
often the same ideas repeatedly for how to solve the spam problem.
And so, this is a little bit, it's been described in the past as
pre-printed rejection slips for why your idea won't work, so you
can be indexed into the correct part of the WIKI when you have an
idea and don't have to re-circulate threads on the mailing list
over and over again. So I think that would be good work. This is
turned out to be, like the spam problem in general, hard to make
progress in. And the research group - I've heard, and I think the
chair has heard, some frustration as to why they have not done
more. There are a lot of folks doing research in this area and
they're focused on publishing papers, sometimes more so than doing
collaboration in the IRTF. Also, some of these problems are
extremely hard. But, one of the values of what's happening on the
research group mailing list is it's starting to go capture some of
the folklore, some of the wisdom that's passed around between
practitioners. But there's also misinformation that's stamped out,
and they're making an effort to capture these things in the wiki.
So the chair asked me to pass along - the mailing list is really
intended for folks in the IETF and elsewhere, having questions
about spam and anti-spam related technologies, that this research
group is intended to be a good discussion point for bringing those
topics out.
Okay. So, the other research group that I wanted to talk about is
the Scaleable Adaptive Multi-cast group. I apologize, it's hard
for me to read (the slide) so it's probably hard for you to read.
The concept behind this group, if you look at the pictures, the
bottom one is intended to be a conventional network, host at the
edge and routers in the middle. And the goal of the group is
really to enable multicast services, taking hybrids of application
layer multicast, which is easy to deploy among consenting end
systems, and take advantage of either IP multicast or link layer
multicast - any native multicast that might exist.
This is what see in the pictures, in the bottom you have the
conventional network, then you have the multicast tree, and then at
the top you have a hybrid multicast environment where you have
native multicast in one region, and application multicast in
regions that don't support it. It takes advantage of the AMT
protocol - this is a protocol for tunneling. AMT is Adaptive,
automatic multi cast tunneling. That connects multicast enabled
clouds over unicast networks. And this is technology that was
developed in the MBONE D environment, and so, it's a way of sort of
gluing these together. So the Sam RG is trying to create framework
and protocols for integrating these various strategies for enabling
multicast.
There's a bunch of different communities, this work was initial led
out of the X-cast environment, where they have some protocols, and
they've got sort of one point they developed in the space. There's
also P2P overlays and the IP multicast folks, and then applications
include streaming and mobile networks and other kinds of
applications.
They've developed, I've gotten some drafts on developing a
framework. This is just another illustration of another version of
the same picture where you've got networks that have neighborhood
IP multicast. They might have link layer multicast, application
layer multicast, and they're glued together, and they've developed
a protocol that's got different kinds of joins, IP multicast join,
join-by-gateway, join-by-native-link, and so this is some of the
work that's been going on the longest in the group. And it's
pretty mature, I understand. Another thing that they've been
working on is developing namespace support so that hosts can
directly participate in multicast services. And along with
middleware to make that work. And they've been also trying to
build a simulation environment to allow exploration of a wide range
of networks in this space. They started with a tool called OMnet,
and they're extending it to support IP multicast, and then
extending that again to support different kinds of overlay
strategies.
Then finally, to go beyond simulation to test beds, there's some
work that's just being discussed now about building a hybrid
multicast test bed that's started with contributions from the
different participants in the research group. That is, they're
actually globally distributed with the hopes that that will grow
for implementing and exploring some of these protocols.
So in a nutshell that's the status of the IRTF and two of the
research groups, and I am open for questions if anybody has any.
Okay. Thank you very much.
(applause)
3. IAB Report - Olaf Kolkman
As far as the IAB report, I love these little crane birds folded
out of paper, so I put one on the picture, typically for Hiroshima.
And I enjoyed making them during the social.
Anyway about the IAB. I show this slide every time I open the
session - basically pointing out what we're about. It's very hard
to give a nutshell description of what the IAB is about. But it
has a charter, RFC 2850, and we try to describe as much as possible
on our home page. You can find the current membership there.
There are links to documents, and within the documents section you
can find our minutes. It is the goal to have minutes posted not
more than two meetings behind. We're very bad at meeting that
goal. Just before this meeting, a batch of minutes was published
that were approved earlier in this week.
Correspondence... when we talk to other organizations we usually
leave a trail of correspondence and that is published on our web
site as well. Documents are one of outputs. Recently we published
as an RFC 5620 the RFC Editor model. I will be talking about the
model implementation at the end of this presentation.
There are two documents currently in auth48, 5694 which is the P2P
architecture definition, taxonomies, and examples and
applicability. That is about to be published. There's a final
little thing with a header, same goes for RFC 5704 Uncoordinated
Protocol Development Considered Harmful. You've heard a
presentation about that, in previous IAB reports. Those are two
very short.
There is ongoing document activity. We've been working on a
document considering IPv6 NATs. There was a call for comments from
the community. Those comments have been incorporated into version
2 of this draft and we're about to submit this to the RFC editor
when every IAB member has a chance to sign off on it. So that will
be going to the RFC editor shortly.
There's another document - IAB Thoughts on Encoding and
Internationalized Domain Names. It's part of the inspiration for
today's technical session. So basically, it's a call for comments.
The technical plenary today is a working session that is based
around this document.
There are a bunch of documents that have draft IAB that are sitting
somewhere in various states, that didn't have much attention over
the last few weeks, at least not visibly. Drafts IAB headers and
boiler plates - that document has been finished for a long time and
is sitting in the RFC editor and refers to the BIS document. We
found a way to get out of there by changing the reference to RFC
3932bis itself if the situation with 3932bis does not get resolved
pretty soon. So we want to get that out as soon as possible.
That document basically changes the headers of documents, and
changes some of the boilerplates so it's more obvious if a document
is an Independent RFC or an IETF Stream RFC or IAB Stream RFC.
We're also working, and internally we've been rehashing, this
document that is intended to describe the IANA functions, and what
the IETF needs out of that. An update on that is imminent, and so
is an update on the IP model (on which we've been working, and
which will be uploaded as soon as the queue is open)
A little bit of news. We had a communication with IANA on the way
forward with respect to signing .arpa We received a plan of action
shortly before the IETF, in which there's two phase approach, where
they will proceed with the temporary set up to get .arpa signed in
fourth quarter of 2009. Given that quarter four of 2009 is still
only six weeks old, we expect it will be signed before the end of
the year, so it's really imminent.
After the design has been finished for signing the root zone, that
same system will be used for signing .arpa We responded positively
to that plan and we find it's very important to get .arpa signed,
to get its key signing key published in the IANA tar? and in the
signed root whenever that is available, and make sure there is
secure delegations to sign sub zones, and that is now being set in
motion. So this is some progress on that front.
We made a bunch of appointments. We've re-appointed Thomas Narten
as the IETF liaison to the ICANN BOT, and related to that, Henk
Uijterwaal for ICAN NOMCOM. And finally, we appointed Fred Baker
to the smart grid inter operability panel. And a number of you
have been to the smart grid bar BOF yesterday and know what this is
about.
Communication to other bodies... There is an effort underway
within the EU to standardize, to modernize ICT standardization.
There was a white paper published by the committee, and we've
reviewed that and basically replied with a number of facts around
our process. So that we, so we at least are sure that there's no
misunderstanding of how the IETF works. We also provided comments
to the ICANN CEO and the ICANN board of trustees on a study that
appeared recently that was about scaling the root. You can find
those comments on our IAB correspondence section of the web page.
Something that is of a more operational nature is the
implementation of the RFC editor model. Just as a recap of the
state of affairs, I've been talking about this previously, we're in
a transition period. We're moving away from ISI as a service
provider, and into an implementation of a model that has been
developed over the last few years. Within this model, we've got
four legs: the RFC Series Editor, the Independent Submissions
Editor, the Publisher, and the Production House. The IAOC is
responsible for selecting the RFC production center and the RFC
publisher, and the IAB is responsible for creating an advisory
group, appointing an advisory group for helping us with the
selection of RSE and ISE candidates. That has all been done.
Looking for the RFC series editor is our responsibility, and the
Independent Submission Editor, which is also one of the functions
within the model that is our job.
Where are we with all that? Well, the IAOC, you heard yesterday,
has appointed, awarded the production center contract to AMS and
also the RFC publisher contract to AMS. And the good news here is
that Sandy and Alice are the core members of the production center,
which means, that the continuity of publishing and getting RFCs
online is not in danger. This is the good news so to speak.
As far as the Independent Submission Editor goes, that is the
editor that assesses the technical quality of documents on the
independent stream, we've had significant delays. That delay has
been because we've been focusing on trying to find RFC series
editor. However, we have candidates and we are currently
interviewing and assessing those candidates and we are basically on
track with that now.
As far as the RFC Series Editor function goes, we had a call in
July, not quite half a year ago. Closed nominations, August
fifteenth, and the nominations were provided to the ACEF, the
committee that helps us with assessing the candidates. They
interviewed candidates. They've had long deliberations, and their
conclusion was that there was no suitable match between the
candidates, the functions, and the expectations of the role - those
three variables didn't quite match. And their advice was to seek
somebody to manage the transition, to do a step back, and make sure
that pieces are in place and go for the long term solution.
"Manage the transition" was the advice.
The IAB went over this advice, turned it around a couple of times,
and finally decided that the transitional RFC Series Editor (RSE)
way forward is the best plan, the best way out so to speak. We've
defined that job, and you should have seen the announcement with
job description, and call for candidates earlier last week, mid
last week.
There is an ongoing call for candidates, but the evaluation of
whatever we have will start November 20 or so. In a week. So, why
do we think we will be successful now, or have a higher odds of
success? Well, there are a couple of things that are different
than we had with the situation on July 8. First, there is less
uncertainty about the state of the production and publication
functions. It is known who is going to execute those functions.
There is capable staff there, there is institutional knowledge
which makes the job easier.
There's also, in the job description that we are looking for,
there's more focus on the transitional aspects. We've called out
that the person who is going to do this needs to refine the role of
the RSE after the initial transition, so that it is more clear what
the successor will be getting into. There is an explicit task to
propose possible modifications to the RFC editor model in order to
see that things work better, when we go out for the more permanent
function. And, because this is a transitional management job, so
to speak, it has different type of commitment, a different type of
personality, and also a shorter time space for commitment. So we
hope that the pool is wider, deeper, or of different dimensions.
One of the things that we will also not do with respect to this,
looking for candidates, is to disclose the names of the candidates
publicly. We think that was a mistake and we don't do that now.
As I said, the call for nominations is now open. We will start
evaluation November 23, and we will accept nominations as long as
nobody has been announced.
We believe that this is inline with RFC 5620, the RFC Editor Model
and the general community consensus. That doesn't mean that we
kept out, or we didn't go back to the community and ask "is this
all okay?". Because there is time pressure - ISI will stop this
function December 31st, and January 1 we will be starting
implementing this new model. We couldn't afford to lose time.
That doesn't mean we're not listening if you have any comments or
things should have been done differently.
We have been talking, as Bob Hinden said yesterday, with ISI, and
Bob Braden in particular, about their willingness to extend the,
current contract on somewhat of a consultancy basis. So at least
we have somebody so that balls do not fall on the ground, and
somebody who can actually transfer some institutional memory to
whoever gets this job. There are two small links that you can get
and read them from the slides.
Finally, it is worth mentioning this time we did not have appeals,
and that basically closes my presentation.
(applause)
With that, I would like to invite John, Stuart, and Dave for a
session on internationalization. And I'll start up the slide set.
4. Internationalization in Names and Other Identifiers
(John Klensin) All right. Good afternoon, everybody.
We've come to share some general ideas about internationalization,
and where we stand, and where we're going. The plenary's goal is
to try to inform the community about this topic. This is not the
first time the IAB has tried to do this. We continue to learn more
and to try to share that with you.
Internationalization is badly understood. It is understood
moderately well by a fairly small number of experts, most of whom
end up realizing how little we actually understand. But it affects
a large number of protocols, a large number of people, and should
affect virtually everything we're doing in the IETF, and elsewhere
there's a user interface.
We've got a new working draft which contains some recommendations
about choices of binds and encodings that we'll talk about.
Current version is draft-iab-encoding-01.txt. It is very much
still in progress.
And more work is needed in this area, both on this document and
about other things, and should continue.
Internationalization is important and timely because a lot of
things are going a on around us. Names. Names can have non-ASCII
characters in them. Not everybody writes in Roman characters,
especially undecorated ones. We're seeing trends towards
internationalized domain names. They've been floating around the
Internet, fairly widely deployed since about 2003. Earlier than
that in other than the public Internet. Pieces of this talk we'll
show you later, and they can be as simple as a string of Latin
characters with accents or other declaration on some of them, or as
complex as scripts which some of the people in this room can read,
and others cannot. The things you can't read become problems. The
things you can read become opportunities.
We see URLs floating around. Actually IRIs - what we discover is
that a great deal of software doesn't know the difference, and a
great many people know even less about the difference.
We have path names in various kinds of systems that use
internationalized identifiers, and use them in ways which are fully
interchangeable with plane ASCII things, because the underlying
operating systems have been internationalized.
Users want to use the Internet in their own languages. It seems
obvious. It took us a long time to get there. We're not there
yet. At the same time, we've been making progress. The MIME work
which permitted non-ASCII characters in e-mail bodies in a
standardized way was done in Internet time a very, very long time
ago, and has been working fairly successfully since.
In China, IDNs are being used for all government sites. A great
many IDNs are deployed in the .cn domain. Users in China see
domain names as if the top level domains are actually IDNs. We're
told that about 35 percent of the domains in Taiwan are IDNs, and
almost 14 percent of the domains in Korea are IDNs.
There's demand from various parts of the world that use Arabic
scripts and they have special problems because the script runs
right to left under normal circumstances, and most of our
protocols that use URLs and identifiers have been written around
the assumption that things are written from left to right. In
general, right to left is not a problem. Mixing right to left and
left to right produces various strange effects.
(Stuart Cheshire) Thank you. Jon.
So, I'm going to go over some of the basic ideas and some of the
terminology.
Unicode is a set of characters which are represented using
integers. There's actually about a million of them, but most of
those are not used. Most commonly used characters fit in the first
65,000. Most of even the less commonly used ones fit in the first
200,000 or so, but the unicode standard defines up to about a
million.
These are abstract integers that, for the most part, represent
characters. I say 'for the most part' because there are
variations. You can have an E accent as a character, or the E with
the character accent separately. But ignoring those details,
roughly speaking it's a set of characters with numbers a assigned
to them.
Now, you can write those numbers on paper with a pen, or on a
blackboard with chalk. When we use them in computer systems, we
need some way to encode them. And it's easy for us to forget that,
but the encoding is important.
Here are three of the common encoding: UTF-32 is 32 bits in
memory, of the normal way of representing integer values. That
means there are endian issues. UFT-16 slightly more compact
because the majority of characters fit in 65,000. That means most
unicode characters can be represented by a single 32-bit word, so
that takes half the space. Similarly, there are endian issues with
UTF-16. UTF- 8 uses a sequence of 8 bit bytes to encode the
characters. And UTF-8 has some interesting properties, so I'm
going to talk a bit more about that.
The IETF policy on character sets and protocols specifies that all
protocols starting in January 1998 should be able to use UTF-8.
Why is that? Why do we like UTF-8? Well, UTF-8 has a useful
property of being ASCII compatible. And what that means is that
the unicode code points from zero to a 127 are the same as the
ASCII code points. So, decimal 65, hexadecimal 41, represents an
upper case A, both in ASCII and unicode.
I'm talking about integers here. When you represent that integer
unicode value using UTF-8, use the same value for values up to 127.
So this may seem very obvious, but, it's an important distinction
between the integer value and how represent it in memory.
The property of this is that if I have an ASCII file which is clean
7-bit ASCII, I can wave a magic wand and say that's actually UTF-8,
and it is actually valid UTF-8 and not just 'valid', but valid with
the same meaning - it represents the same string of characters.
For files that already have other meanings for the octet values 128
and up, like Latin 8859-1, that property is not true, because
those code values have already been given other meanings. But for
plain ASCII, UTF-8 is backwards compatible. UTF-8 uses those octet
values above a 127 to encode the high numbered code points, and
I'll explain how that works.
So, in blue, we have the unicode characters that are the same as
ASCII characters. It's just a single byte in memory. In the
middle, we have the green ones, and those are the octets that
start with the top two bits being one, or the top three or the top
four.
And when see one of those, that indicates the start of a multi-
character, a multi-byte sequence for encoding one unicode code
point. In the right in purple, we have the continuation bytes
which all have the top bits one and zero.
The nice property of this is, by looking at any octet value in
memory, you can tell whether it's a stand-alone character, the
start of a sequence, or something in the middle of a multi-byte
sequence. This is how they look in memory.
So, we have the ASCII characters standing alone, we have the
two-byte sequence where we start with the 110 marker, we have the
three byte and the four byte sequence.
UTF-8 has some nice properties. Part of being ASCII compatible is
that UTF-8 encoding results in no zero octets in the middle of the
string. That's useful for standard C APIs that expect null-
terminated strings. .
The fact that the bytes are all self-describing makes it robust to
errors. If there is corruption in the data, or data is copied and
pasted, inserted and deleted, maybe by software that doesn't
understand UTF-8, it is possible to recognize. If I give a
megabyte character, and look at a byte in the middle of the file,
then you can tell whether you've got a stand-alone character. If
you look at the byte and the top two bits are 10, you know you're
in the middle of a sequence, so you have to go forward or back, but
you don't have to go very far before you can re-synchronize with
the byte stream, and you know how to recode the characters
correctly. That is in contrast to other encodings that uses skip
characters to switch modes, where you really have to parse the data
from the start to keep track of what mode you're in.
Another nice property of UTF-8 is because it has this structure,
you can tell with very high probability looking at a file whether
it's valid UTF-8, or whether it is something else like Latin.
One of the properties is that if I see a byte above 127 in a file,
it can't appear by itself, because that must be part of a multi-
byte sequence. So there have to be at least two, and the first one
has to have two or three or four top bits, and the later ones have
to have the top bits be one zero. So the probability of typing
8859-1 that happens to meet that pattern goes down very quickly for
all but the shortest files.
Another useful property is that a simple byte-wise comparison of
two strings of UTF-8 bytes using routines results in them sorting
in the same order as sorting the unicode code points as integers.
This is not necessarily for humans to see in terms of what we
consider to be alphabetical order, but for software like quick
sort that needs an ordering of things. This is a suitable
comparison which results inconsistent behavior whether you're using
unicode or UTF-8.
One of the criticisms that's often raised against UTF-8 is that
while it's great for ASCII - one character is one byte - and it's
pretty good for European languages, since most European characters
fit in two bytes, for Asian languages they often take three or four
bytes per character. And this has led to a concern that it results
in big, bloated files on disk. While that may have been a concern
10 or 20 years ago, I think in today's world, there are different
trade offs we have to consider. One thing is that everybody
inventing their own encoding, which is locally optimal in some
particular context, may save a few bytes of memory in that context,
but it comes at a big price in interoperability. And when I talk
about different contexts here, I don't mean just geographically
different places around the world, or different languages, but
context like e-mail doing one thing and web pages doing a different
thing. Also in the context of applications and working groups, we
have a tendency for each community to roll their own solution that
they feel meets their needs best, which is different to that of
other people's, and we have a lot of friction at the boundaries
when you convert between these different protocols that are using
different encodings. We'll have some more examples of that later.
Another aspect is that on most of our disks today, most of that
space is taken up you with images, and audio and video. Text
actually takes a very small amount of space. When you view a web
page, most of the data that's coming over the network is JPEG
images. If you're looking at YouTube, almost all of it is the
video data. Those images and video are almost always compressed,
because it makes sense to compress them.
Ten years ago there were web browsers that would actually gzip the
HTML part of the file to make the download faster. I don't believe
anybody worries about that anymore, because the text part of the
web page is so insignificant compared to the other media that it's
not that important.
Another interesting observation here is, with today's file formats
like HTML and XML, quite often the the machine readable mark up
tags in that file, which are not there for end users to ever see,
they're there to tell your web browser how to render the text,
those tags really, they're just bytes in memory. They have
no human meanings, but it's convenient that we use mnemonic text,
so we use ASCII characters for things like title and head and body.
And even in files containing international text, a lot of that mark
up is ASCII. And I have had discussions very much like this with
engineering teams at apple, all the applications that apple ships
are internationalized in multiple languages and we have inside the
application, which you can see for yourself, if you control click
on it and open up to see the contents, they contain files that
contain all of the user interface text in different languages.
And we had the debate should it be in UTF-8 or UTF-16. And clearly
for western European languages, the UTF-8 is more compact. But the
argument was for Asian languages that would be wasteful. So I did
an experiment.
This is the file path, you can try the experiment for yourself. I
had a look at that file. And in UTF-16 it was a 117K. In UTF-8,
it was barely half the size. This is the Japanese localization.
I'm thinking, how can that be?
I was expecting it to be about the same size or a little bigger,
but I wasn't expecting it to be smaller. When I looked at the file,
it's because of this...
The file is full of these key equals value pairs. And all that
text on the left is ASCII text. And the Japanese on the right may
be taking three or four bytes per character, but that's not the
only thing in the file. So, I believe that the benefits we get
from having a consistent text encoding so we can communicate with
each other are worth paying possible performance size overhead that
there might be. And as this example shows, there may not be a size
overhead in many cases.
So that's UTF-8. But we know that not everything uses UTF-8. So
the other thing we're going to talk is punycode, which is what's
used in international domain names.
Now, this is not because the DNS can't handle 8 bit data. The DNS
protocol itself perfectly well can. But many of the applications
that use DNS names have been written assuming that the only valid
DNS names contains letters, digits and hyphens. So in order to
accommodate those applications punycode was invented. And whereas
UTF-8 encodes unicode code points as octet values in the range from
zero up to hexF4, punycode restricts itself to a smaller range of
values, listed on the slide. And those are the byte values that
correspond to a ASCII characters hyphen, digits and letters.
So, what that means is that when punycode encodes a unicode string
you get out a series of bytes, which if you interpret them as being
ASCII, look like a sequence of characters. If you interpret them
as being a punycode encoding of a unicode string, and do the
appropriate decoding, and then display it using the appropriate
fonts, they look like rich text.
So this is a subtle point. We have these same sequence of bytes in
memory or on disk or on the wire in the protocol, that have two
interpretations. They can be interpreted as letters, digits and
hyphens -not particularly helpful, as it kind of looks like opening
a JPEG in emacs. You see a bunch of characters, but that doesn't
really communicate what the meaning of the JPEG is. Or, the
letters and hyphens can be interpreted as punycode data which
represents a unicode string. Let me give you another example of
that.
Does this look like standard 7 bit U.S. ASCII or not? Let me zoom
in. We'll do a hum. Who would say this is 7 bit ASCII? Can I
have a hum? Who would say this looks like rich unicode text? Hum?
Okay. Let me zoom in a bit closer. This is a plane ASCII file.
In fact it only contains Xs and spaces. You can edit this file in
vi if you want.
So, the same data has two interpretations. Seen from a sufficient
distance, it looks like Chinese characters, but it can also be
interpreted as Xs and spaces. So the meaning of this text depends
very much on how you choose to look at it. But I would argue that
editing this file in vi would not be the most efficient way of
writing Chinese text.
So this problem that the same byte values and memory can be
interpreted in different ways really plagues us today. That was
just a few days ago I was buying a hard disk on amazon and got
these upside down question marks. I think that's supposed to be
dashes. This isn't even in complicated script systems - this is
just the characters that any English or American reader would
expect to use in plain text.
I remember when I had my first computer, an Apple IIe, and it could
only do upper case. And then my next one, the BBC micro, had lower
case. And the next one which was a Macintosh, in about 1985, could
actually do curly quotes, and I could write degrees Fahrenheit with
a degrees symbol, and I could do E-M dashes, and I could do Greek
alpha signs. I could write not equals as an equal sign with a line
through it the way I did in school when I was writing with a point.
Not exclamation point. We have done it for so long, we forget.
Not equals is a equal sign with a slash through it. So by 1986 we
had gone from typewriter to some fairly nice typography where I
could type what I wanted on my Mac. And here we are 10 years later
and things seem to have gone backwards. I'm not happy. How do we
solve this problem?
We make the user guess from 30 different encodings - "what do you
think this web page might be?" This is not something that we want
to impose on users. This is not something that the average end
user is even qualified to understand what they're being asked here.
So, international domain names don't only appear on their own.
They appear in context. And here are some examples. They can
appear in URLs, they can appear in file paths on windows. Of all
these different encodings, which most of the people in this room
would probably recognize as meaning the same thing, these are the
only ones that in my mind are really useful, if we have a goal of
supporting international text.
If you you asked a child to draw a Greek alpha symbol, and gave her
a pencil and paper, plain pencil and paper, she would draw an alpha
symbol. She would not write %cn, % something and say that's an
alpha. That's complete insanity. That is not an alpha. An alpha
is this thing that looks like an A with a curly tail on the right
side. If we want to support international text, it's got to look
like international text.
But because we have all these protocols that don't have a native
handling of international text, we keep thinking of ways to encode
international text using printable ASCII characters. And when you
do that encoding, who decodes it? There's an assumption that if I
encode it with percent something or ampersand something, then the
thing on the receiving side will undo can that and put it back to
the alpha character it was supposed to be. Well, we got bitten by
this yesterday. We sent out an e-mail announcing this plenary.
This was not staged. This is real. And some piece of software,
somewhere decided that unicode newlines were no good. So it was
going to replace them with the HTML ampersand code for a unicode
newline.
And something on the receiving side was supposed to undo that, and
turn it back into a newline. Well, nothing did and this is what
you all got in your email.
This can get really crazy. Suppose you have a domain name which is
part of an e-mail address, which you put in a mail to URL, which is
then appearing on a web page in HTML text. Is the domain name
supposed to be actual rich text as seen to the user? Or is it
supposed to be puny code? Because it's an email address, email
uses printable encoding followed by two hexadecimal characters.
Well, in an e-mail address do we have to do that escaping? And the
e-mail is part of the URL. And the whole thing is going into a web
page, so HTML has its own escaping for representing arbitrary
characters.
Do we use all of these? A lot of people say yes. It's not clear
which ones we wouldn't use out of that four in the nested hierarchy
of containers. If you're looking at an HTML file in your editor,
you are very far removed from having rich text in front of you on
the screen.
So we tried, we decided we'd try an experiment. What would happen
if we didn't do all this encoding? What would happen if we just
sent straight 8 bit data over the network and decided to try this
email test. Now, the SMTP specification says it's 7 bit only, but
we asked the question, what if we disregarded that, and tried it
anyway, to see what would happen?
So, I sent a test e-mail, where I replaced the E in my name with a
Greek epsilon, and I with an iota, and I sent this e-mail by hand,
using net cat, so it wasn't my mail client doing encoding. I just
put the raw bytes on to the wire and sent them to the SMTP server
to see how it would handle it. I did it two ways, using the
punycode-encoded representation of that first label of the domain
name, X M -- something that looks like line noise. And I did it a
second time, just using the UTF-8 representation of that, which I'm
showing here as the actual unicode characters.
So to make that really clear, this is the text that I sent using
net cat to the SMTP server. This is the first one, using punycode,
so this whole email is plane 7 bit ASCII. No surprising byte
values in it. The first two lines are the header, after the blank
line, the rest is the body. I point this out because headers are
handled differently from bodies. Header lines are processed by the
mail system. The body by and large is delivered to the user for
viewing.
The second e-mail is conceptually the same thing, except not using
punycode, using just 8 bit UTF-8. So this is the result of the
first test. Not surprisingly, the puny code in the body of the
message was displayed by all the mail clients we tried as line
noise. Which is not surprising, because it's just text in the body
of an e-mail message. There's no way that the mail client really
knows that that text is actually the representation of an
international domain name that's been encoded. We could have some
heuristics where it looks through the e-mail. I would not be happy
about that. Type the wrong thing in e-mail and it magically
displays as something else. That seems like going further in the
wrong direction.
In the from line, where we could argue that the mail client does
know this is a domain name because it's user name angle bracket,
user at example.com, close angle bracket, that is a clearly
structured syntax for st e-mail address, and the mail client knows
how to reply to it. It could conceivably decode that text and say
'this is puny code'. The intended meaning of this text is not
xxc-x, it's a rich text name with epsilons and iotas in it. One
client did that, which was outlook on windows.
The second test was the raw 8 bit UTF-8 data. And I'm very happy
to say, in our small set of e-mail clients that we tested, a 100%
of them displayed UTF-8 text in the body in a sensible way.
We had some more interesting results from the from line. Gmail did
this very interesting thing where it clearly received and
understood the UTF 8 text perfectly well, because it displayed it
to the user as the punycode form. I'm not quite sure why.
Possibly for security reasons, because there is concern with
confusable characters, which you will hear about in great detail in
a few minutes. There is concern with confusable characters that
you might get spoofed emails that look like they're from somebody
you know but are really not. Turning it into this punycode form,
at some level, should avoid that. I'm not sure it really does,
because in a world where all of my email comes from line noise, the
chance of me noticing that the line noise is different in this
particular email, I don't know how much of a security feature that
really is. But that may be the motivation.
Eudora 6 is an old mail client, written I think before UTF was very
common. Those characters there are what get if you interpret the
UTF bytes as being ISO 8859-1. And the last three here, to be
fair, I don't think we should blame the Outlook clients here,
because what appears to have happened is that the mail server that
received the mail, went through and whacked any characters that
were above 127 and changed them to question marks. It didn't do
that in the body, see, but in the header it did do that pre-
processing. So it's unclear right now whether it was the mail
client that did this or the mail server that messed it before the
client even saw it.
So, back to terminology. Mapping is the process of converting one
string into another equivalent one. And we'll talk a little bit
later about what that's used for.
Matching is the process of comparing things that are intended to be
equivalent as far as the user is concerned, even though the unicode
code points may be different, the bytes in memory used to represent
those unicode code points may be different, but the user intention
is the same.
Sorting is a question of deciding what order things should be
displayed to the user. And the encoding issue has various levels
to it. I've talked today about how to encode unicode code points
using UTF-8. There is also the question that the E accent
character can be represented by a single unicode code point for E
accent, or as the code point for E followed by the accent combining
character.
So more terminology. In the IDNA space, an IDNA valid string is
one that contains allowed unicode characters to go into
international domain names, and those can take two forms. The term
will commonly used in the IDN community is a U label. An IDNA-
valid string represented in unicode, by which they mean, in
whatever is a sensible representation in that operating system. It
might be UTF-8, it might be UTF-16, but it is one of the natural
forms of encoding unicode strings.
An A label is that string encoded with the punycode algorithm, and
xn-- to call out the fact that that is not just a string of
characters in DNS, this is something that's encoded by punycode, so
you have to decode it in order to get the meaning.
So I'll wrap up my part of the presentation with an observation.
When it comes to writing documents, or writing an e-mail to your
family, having the most expressively rich writing tools available
is very nice. When it comes to identifiers that are going to be
passed around, and are used to identify specific things, then
it's not quite so clear. Because the bigger the alphabet, the more
ambiguity.
Telephone numbers use ten digits. And by and large, we can read
those digits without getting too confused. We can hear them over
the telephone. Most people who can work a telephone, can read and
hear the ten digits without getting them too confused. When we go
to domain names, we have a bigger alphabet. We have 37 characters,
and we start to get a bit of confusion. Os and zeros, Ls and ones
and Is - there's a bit of confusion, which is bad, but it's limited
to those few examples. When we move to international domain names,
the alphabet is tens of thousands, and the number of characters
that look similar or identical is much much greater. So with more
expressibility comes more scope for confusion. And I will note,
that while we're going in this direction, of bigger and bigger
alphabets, the computer systems we use went in the opposite
direction. They went to binary. Because when you only have one
and zero, then there's a lot less scope for confusion in terms of
signaling on the wire, with voltage levels. If there's only two
voltage levels that are valid, you're high or low. If there are
ten that are valid, then a smaller error might mean reading a 5 as
a 6. So we know that when we build reliable computer systems that
binary has this nice property. So, I leave you with that. And I
ask Dave to come up and tell more.
(Dave Thaler) I'm going to talk about matching first. So earlier
on when we talked about definitions, we said, you probably thought
that matching meant comparing two things in memory. That is
certainly one of the as aspects of matching. You do a database
entry, and I know whether to respond or not. There's another
problem with matching - that is the human recognition matching
problem.
So, let's do another eye test here. We have two strings up here
that could be easily confused by a human. Can you spot the
difference? Hum if you can spot the difference.
Okay. The difference that you can spot here, is that on the left,
this is .com, and on the right this is .corn. It seems like a
great opportunity for some farmer's organization, doesn't it?
This illustrates that even in plain ASCII we have confusion. Now,
some of you who have been participating in the RFID experiment are
aware of another type of confusion. On this slide, these are not
capital Is, they are lower case Ls. More confusion with just
ASCII. But wait, it gets worse.
This is the Greek alphabet. If Ethiopia, those are not the letters
E T H I O P I A. If you check the lower case both of those, then
they look different. All right. The lower case versions of the
Greek letters there is fairly distinctive. So as a result, we see
the current trend to actually deprecate these various forms and
revert to one standard one or one conical one if you will - in
various identifiers such as IRIs and so on. In IDNA, in 2008, some
of these characters are treated as disallowed for these types of
reasons.
Second eye chart, okay. Look up from your computer and stare at
the screen. Hum if you spot the difference. They both look the
same. If you can spot the difference, you may need an eye test.
The difference here is that all the characters on the right are in
the Cyrillic alphabet. There's no visual difference. What's worse
is that in ASCII .py is the TLD for Paraguay. On the right, those
are the Cyrillic alphabet letters corresponding to .ru, which is
Russian.
Now, anybody here who actually speaks Russian or is intimately
familiar with internationalization, will be quick to point out
one important fact - 'jessica' uses letters that are not in the
Russian language. For example, J and S do not appear in the
Russian language. It points out there are alphabets, and languages
that use a subset of characters in those alphabets. In order to
get the letters that look like 'jessica', you have to combine
characters from two different languages, but they're both in the
same alphabet.
So what this points out if you are a registry, that is going to be
accepting say, domain registrations under your zone, then they may
want to apply additional restrictions, such as not accepting things
that look like, or that contain characters that are not in their
language. If you're .py for Paraguay, if there are certain
characters that don't want to allow, you can restrict that. This
particular example requires combining characters from two different
languages, and there are other examples that are purely from the
same language. Epoxy, and on the right, this may be, say, a 5
letter acronym for some Russian organization. The problem is
that's at the human matching layer. Is that the thing you're
looking for? Does that match or not match?
John is going to talk about a couple of more examples. So
hopefully your eye tests have been enlightening.
(John Klensin) We get more interesting problems when we move
beyond the eye tests into a piece of human perception problem,
which is people tend to see what they kind of expect to see.
So we have here two strings which look different, but look
different only when they're next to each other. And, the first one
is a restaurant, and the second one is in Latin characters and
something different altogether. But they look a lot alike if
you're not sensitive to what's going on.
In general, if you have a sufficiently creative use of fonts, and
style sheets from a strange environment, almost anything can look
like almost anything else. A number of years ago I came into
Bangkok very late at night, and I was exhausted and I was driving,
being driven to the city and I saw a huge billboard and it had
three characters on it and a red, white, and blue background and
from the characters I was firmly convinced it was USA. Well, it
was in Thai, the characters were decorated, and having seen them
outside of that script, maybe I would have understood the
difference. Maybe I would not have.
That brings us to another perception test which snuck up on my last
month, me and an audience of other people. We were sitting in a
room, two months ago, at an AP meeting and there was a poster in
the back of the room with the sponsors. And on the poster we had
these three logos, and the first one, pretend that you're not used
to looking at Latin characters. You look at the first one and you
don't know whether that character is an A or star.
And then there's a reverse eye test. See the second and third
lines there, and convince yourself, assuming you know nothing about
Latin alphabets, as to whether those are the same string or same
letters or not. Because this is the problem that you're going to
get into when you're seeing characters in scripts and strings that
you're not familiar with. And it's a problem when people are not
used to looking at Latin characters, when the fonts get fancy.
People keep carrying out tests in which they say 'these things are
confusable or not confusable' when they're looking at things in
fonts which are designed to make maximum distinctions. When people
get artistic about their writing systems, they're not trying to
make maximum distinctions, they're trying to be artistic. And
artistic-ness is another source of ambiguity for people, as to
whether two things are the same or different.
We have other kinds of equivalence problems. To anyone who looks
closely, or who is vaguely familiar with Chinese, simplified
Chinese characters do not look like traditional Chinese characters.
But they're equivalent if it's Chinese. If it's Japanese or Korean
instead, one of them may be completely unintelligible, which means
they are not equivalent anymore.
As a consequence of some coding decisions which unicode made for
perfectly good reasons, there are characters in the Arabic script
with two different code points but which look exactly the same. So
the two strings seen there, which are the name of the Kingdom of
Saudi Arabia, look identical, but would not compare equal if one
simply compared the bytes.
Two strings, same a appearance, different code points. Little
simple things like worrying about whether accents go over Es do not
get caught in the same way these things do. This is another
equivalence issue. What you're looking at are digits from zero to
to 9 in most cases and from one to 9 in a few. Are they equivalent?
Well, for some purposes, if we write numbers in two different
scripts are the same. For other purposes, they're not.
We've seen an interesting situation with Arabic in that input
method mechanisms in parts of the world accept what the user thinks
of as indic Arabic characters going in encode European digits.
When they decode they treat the situation as a localization matter,
so users see Arabic digits going and and coming out. But if we
compare them to a system in which the actual indic Arabic digits
are stored, we get not equal.
We've also got in unicode some western indic Arabic digits and some
eastern Arabic indic digits. They look the same above three, but
below three they look different, and all of the code points are
different. Are are they equal or not equal?
And if you think that they're digits and if you think, as we've
said several times this week in various working groups and over the
last several years, that user facing information ought to be
internationalized, remember that we show IP addresses in URLs which
users look at and sometimes type, so now assume you see some of
these Arabic digits, two or three of them, followed by a period.
And then you see another two or three Arabic digits followed by a
period. And then see another one, two or three Arabic digits
followed by a period and then see another one, two or three Arabic
digits. Is that an IPv4 address? Or domain name? And if it's an
IPv4 address, do you know what order the octets come in?
The difficulty with all of these things, is that they're, fun,
funny. And then you catch your breath and say, he's not kidding.
These are real, serious problems, and they don't have answers,
except from a lot of context and a lot of knowledge. Our problems
arise not when we're working in our own scripts, but in somebody
else's. So now we come back to the place where people started
becoming aware of these problems - with internationalization of the
DNS. If I can make a string in one script, or partially in one
script, look like a string in some other script, I suddenly have an
opportunity, especially if I'm what the security people call a bad
guy. But those kinds of attacks cannot be deliberate. They can be
deliberate or accidental, depending on what's going on. We spent a
lot of time thinking in the early days of IDNs, believing if only
we can prevent people from mixing scripts we'd be okay. The
example Dave gave shows how far from okay that is. We're almost at
the point of believing that prohibiting mixed scripts is probably
still worthwhile but it makes so little difference if somebody is
trying to mount an a attack that it's really not a defense.
If you have names in scripts that are not used in the user's area,
and the user is not familiar with them, many scripts become
indistinguishable chicken scratch to a user who is not used to that
script. And all chicken scratches are indistinguishable from other
chicken scratches, except for certain species of chickens.
We talked from time to time about user interface design, and
whether it should warn the user when displaying things from unknown
sources, or strange environments, or mixed scripts, but the UI may
not be able to tell.
We're in a situation with many applications these days that we are
coloring, and putting into italics, and marking, and putting lines
under or around so many things, that the user cannot keep track of
what's a warning and what's emphasis and what's a funny name.
And as was mentioned earlier, some browsers try to fix this problem
by displaying A labels, xn4 . Our problem there is that those
things are impossible to remember. And one of the things we
discovered fairly early is if we take a user who has been living
for years with some nasty inadequate ASCII transliteration of her
name, and we suddenly offer them, instead of that name written
properly in its own characters, we offer them something completely
non-nemonic, starting with X and followed by what Stuart calls line
noise, and for some reason, the user doesn't think that's an an
improvement.
We've also recently discovered another problem we should have
noticed earlier. There are two strings, or one string depending on
which operating system you're using, which are confusable with
anything. If one of these strings shows up in your environment,
and you don't have the fonts or rendering machinery to render it,
the system does something. It can turn it to blanks, which is
pretty useless. But what most often happens is it's turned into
some character which the system uses to represent characters it
can't display.
A string of question marks can either be a string of question marks
or it can be some set of characters for which you don't have fonts.
A string of little boxes can be either a string of little boxes, or
some six characters for which you don't have fonts, or it can be an
approximation to a string of question marks.
And thus two strings in an environment in which you don't have the
fonts installed can be confusable with anything.
Now, the question is, what does a user do? Well, it should be a
warning to the user that something is strange. But we know
something about users from our security experience, which is if we
pop-up a box which says, 'aha! this is strange - would you like to
go ahead anyway?' the users almost always do the same thing, which
is click okay, and go on.
So, this is the string which can get you what you can't even read.
And usually, depending on the operating system, trying to copy this
into another environment by some kind of cut-and-paste situation
will not work. The number of colorful ways of it not working, but
not working, is pretty consistent.
So, we started talking about mapping before. In a perfect world we
would have a consistent system that allows us to perform the
comparison for us. Now, that sounds obvious.
In the ASCII DNS, when that was defined, we wrote a rule which said
matching was going to be case insensitive, and the server goes off
and does something case insensitive. They get stored in case
sensitive ways, more or less, but queries in one case, match stored
values in another case. It's all done on the server, nothing
changes the other things.
If you don't have intelligent mapping on the server, and you want
to try to simulate it, which is what we've been trying to do over
and over again in the international environment, where we're trying
to not change the server, or how we think about things very much,
one of the possibilities is to map both strings into some pre-
defined conical form and compare the results.
That sort of works. It doesn't permit matching based upon close
enough principles or something fuzzy, and that's right in some
cases and terribly wrong in others. But when we start converting
characters, we lose information. We convert a visual form of one
variety, into another form which is more easily understood, that's
fine for matching purposes.
But if we need to recover the original form, and we've made the
conversion, we may be in trouble, depending on what we've done.
The mapping process in inherently loses information when we start
changing one character into another one.
Sometimes it's pretty harmless. Case conversion may be harmless or
not harmless, depending on what it is you're doing. Converting
half width or full width characters to full width is normally
harmless, depending on what you're doing. Unicode has
normalization operations which turn strings in one form into
strings of another form, making the E with an accent character and
the E with followed by an over-striking non-spacing accent into the
same kind of thing so they can be compared. Usually safe.
Unicode has other operations which take characters which somebody
thought were perfectly valid independent characters and turns them
into something else, because somebody else thought they weren't
independent and valid enough. If that conversion is taking a
mathematical script, lower case A and turning it into a plain A,
it's probably safe. If it's taking a character which is used in
somebody's name and changing it into a character which is used in
somebody else's name, it's probably not such a hot idea.
And the difficulty is that we try to write simplified rules that
get all these things right, and there are probably no such rules.
So the mapping summary is, making up your own mapping systems is
probably not a very good idea. People who have gotten, who have
been experts in it, spend years worrying about how to get it right,
can't get it right either, because there is no right. It depends
on context. And finding the correct mapping for a particular use
very often depends on the language in use, and very often, when
we're trying to do these comparisons - DNS is a perfect example,
but not the only one - we don't know what the language is which is
being used. If you need language-dependent mapping, and you don't
know the language, you're in big trouble. If you use a non-
language dependent mapping in an environment where the user expects
a dependent mapping, you can expect the user to get upset. In an
international world, upset users are probably fate, but we need to
get smarter how we handle them.
(Dave Thaler) Our next topic is the issue of encoding, which is
the topic that our working draft focuses on.
So, if we look at some of the RFCs that we have right now, and we
can step back and construct a simplified architecture, this is the
simplified version. We are on a host, we have an application, it
sits on top of and uses the DNS resolver library. That's our over-
simplified model. There are two problems. And by the way, the
IDNA work, for example, talks about inserting the punycode encoding
algorithm in between those two.
The two problems with this over-simplification: one, DNS is not the
only protocol. Different protocols use different encoding today -
and I'll get to this in a second...
And the second problem is that the public Internet name space, in
DNS, is not the only name space. In DNS. As John mentioned
earlier, the Chinese TLDs are not in the public root. And
different name spaces, as we'll see, use different encodings today.
So this is the more realistic, more complicated version of that
previous picture. On a host, you have an application. That sits
on top of some name resolution library, such as sockets or
whatever. Between those, they communicate with whatever the native
encoding is of the operating system of choice. UTF-8 and UTF-16
are most common. Underneath the name resolution library, you have
some variety of protocols, and there's the union of different
things that exist on various operating systems.
And then this host is a attached, for example, to multiple local
LANs, each of which may or may not be connected to the public
internet. And it may also be connected to a VPN, for example.
Each of these is a potentially different naming context that you
can resolve names in.
So let's talk about problem No. 1 first, which is a multitude of
name resolution protocols. Now, it turns out that many of these
are actually defined to use the same syntax. What that means is if
somebody hands you a FQDN, this thing with dots, you cannot tell
what protocol is going to be used. It might be resolved by looking
in your host file. It might be resolved by querying DNS, or
resolved in the local LAN by the server TCP. Each of these are
defined to use the same type of identifier space, same syntax.
And so what happens, the name resolution library takes a request
from the application and tries to figure out where to send it,
which protocol or protocols to try, and in what order? And of
course, if you have different implementations of different
libraries that end up choosing different orders, you get
interesting results.
To make it more difficult, different protocols specify in different
encodings, and so when you put those things together, that means
the application can't tell which encoding, or in the case of
multiple name resolution protocols being tried, which *set* of
encodings are going to be attempted because that's the decision
made by the name resolution library.
Let's talk just for a a moment about the history of what is a legal
name. All right. The naming resolution library gets something,
and is that a legal name? What's that something? Let's briefly
walk through the history, so to understand sort of where the world
is at today. Back in 1985, RFC 952 defined the name of the host
file. It may be internet host names, gateway names, domain names,
or whatever. This is the one that said it contains ASCII letters,
digits and hyphens, or LDH.
In 1989 is when DNS came along, published in RFC 1034, 1035, and it
includes a section called 'preferred name syntax' which repeats
the same description of LDH. The confusion comes from the word
'preferred' there. Well, remember, this was before RFC 2119
language. Is that preferred a SHOULD or a MUST?. Or is preferred
mandatory? There's confusion there.
That was 1989. By 1997, 8 years later, we had RFC 2181, which was
a clarification to the DNS specification, because of a number of
areas of am a ambiguity and confusion that were resulting. These
are three direct quotes with emphasis added. First one says 'any
binary string', whatever, can be used as the label of any resource
record. 'Any binary string' can serve as the value of any record
that includes a domain name. And, as Stuart mentioned,
applications can have restrictions imposed on what particular
values are acceptable in their environment.
Okay. So, to clarify, the DNS protocol itself places no
restrictions whatsoever, but users of entries in DNS could place
restrictions, and many have.
Now, that was 1997, and in that same year there was work on the
IETF policy which was published in, I think, January of '98, which
is RFC 2277. This is the one that Stuart referred to, and here's
the quotes from theirs. The first one you saw earlier - 'protocols
must be able to use the UTF-8 character set'. And it then
continues, 'protocols may specify, in addition, how to use other
character sets or other character encoding schemes'. And finally,
'using a default other than UTF-8 is acceptable.'
What's also worth pointing out - it's not just what it says, it is
also what it doesn't say. What it can't say is anything about
case, whether the E with accent, any types of combined characters,
how things get sorted, etc. There's no policy, the IETF policy did
not talk about such cases.
And so, as a result, two unicode strings often cannot be compared
to yield what you'd expect without some additional processesing.
Now, since the protocols must be able to use UTF-8, but could
potentially use other things, and since the simultaneously produced
DNS RFC said any binary string is fine, that means it complies with
the IETF policy.
So, if the policy is used, UTF-8 and DNS comply with that policy,
that means starting in that year, people started using UTF-8 in
private namespaces. By private namespaces, we mean things like
enterprises, corporate networks. By private namespace, again, we
mean 'not resolvable outside of that particular network, not
resolvable from the public internet.' In their own world they go
off and use UTF-8 and it became widely deployed in those private
networks. 5 years after that, UTF-8 was widely deployed. This
included the work on punycode encoding for work in the public DNS
name space.
So, just to summarize here, UTF-8 is widely deployed in private
namespaces. Punycode encoded strings or A labels deployed on
the public DNS name space.
Now, within the internationalization community, there's been a
bunch of discussions on link issues, and I think it's important for
the wider community to understand. DNS itself introduces a
restriction on the length of names: 63 octets per label, 255 octets
per name (not including a zero byte at the end if you're passing it
around in an API.
The point is that non ASCII characters, as Stuart showed, use a
variable number of octets and encodings that are relevant here.
Now, 256 UTF-16 octets, 256 UTF-8 octets, and 256 A label octets
are all different lengths. So that existing strings can be
represented within the length restrictions, in punycode-encoded A
labels, but can't be encoded within the same length restrictions
within UTF-8. There also exists strings that can be encoded in
UTF-8, but cannot fit within punycode and get an A label. So, you
can imagine some interesting discussions there.
Let's recap. We've talked about multiple encodings of the same
unicode characters. There are things we called U labels. WIth U
think unicode, with A think ASCII. U labels, you have something
that is usually written out in that way. A labels are things that
start with xn--
You have different encodings, say, the top form and bottom form,
that are used by different encodings and different networks, even
within DNS. Punycode A labels on the Internet and private
intranets. And even different applications that start to pay
attention to different RFCs, ones that actually implement the IDNA
document, and ones that don't. Because you have all the
differences across the protocols, networks and so on, you can
imagine the confusion that results. If you have one application
that launches another application, and passes it some name or URL
or whatever to use, the launching application may be able to access
the directory of stuff, and you click on something, or you cause it
to launch another application the some way, and it passes the name
and whether the an indication that just got launched, or the use of
the identifier in the same way. In general, all bets are off. It
may or may not.
You may get a failure, may get to some different site than what you
got to from the launching application.
Similarly, if you have two applications that are both trying to do
the same thing - two jabber clients, for example - and one happens
to work and the other one doesn't happen to work, there would be a
switching incentive to say 'all I have to do is switch to the other
one.'
So let's walk through a couple of examples of applications that
have actually tried to do a bunch of work to improve the user
experience to deal with these cases, and you have to deal with the
multiplicity of encoding, and I don't want to get to the wrong
place or get failures. So what we found is some applications have
tried to improve the algorithm to deal with this case that RFCs
don't tell you how to deal with the multiple encoding issues.
And so most of the time they actually get it right. There are a
couple of corner cases where they don't solve it 100 percent.
Here's one. You type in something into an address bar in a
browser. And in this example, the 'IDN-aware' application is one
that understands that there exists UTF-8 in some private namespaces
that it's connected to, and punycode and the public namespace it's
connected to. And so it knows which networks it's connected to,
and may have some information about what the names are that are
likely to appear on them, so it runs some algorithm to decide if
this is an intranet or internet name. At this point the string is
being held internally in memory in, let's say, UTF-16 or UTF-8,
whatever the native storage of the operating system is.
In this example, let's say it decides it's going to be an intranet
name. So in this case it leaves it in sort-of the native encoding,
does not run a punycode algorithm, and passes it to the name
resolution API. Then it goes to DNS and says we're using UTF-8,
and sends it to the DNS server in UTF-8.
If you have host B in this example, if that one has chosen to
register its name in DNS, in the punycode-encoded form, the A label
form, if that's the name that actually matches, that's going to
fail. If the host is host A, where it's using the same type of
algorithm as the one on the top, it's going to succeed.
So the normal expectation is that most of the hosts in that
environment are all cooperating, or all have the same knowledge or
configuration, and you actually get to host A. If instead it's in
the mode of host B, it will fail.
Now, let's take a case where the application decided by looking at
it that it's going to be an Internet name. And by deciding by
looking at it it could try one or the other. In this case it runs
the punycode on it. The xn--4. This goes to the public DNS.
In this example, let's say that name does not in fact exist in the
DNS, and so the name resolution API wants to fall back and try a
local LAN resolution and try LDNS or LMRR. In this case, LDNS is
defined to say the protocol spec says that 'if the name is
registered there and resolvable, better ask for it in UTF-8 or
won't get the answer.' Here, if there is an indication to put it
in the A a label form first before passing it to MDNS. If MDNS
puts it out there, it's not going to find a match. Most of the
time it does the right thing in both environments, but there are
corner cases in both cases where things will fail.
The next category is where you have some application that has
become IDN aware, and another application that doesn't do anything,
it just takes whatever the user types in and passes it directly to
name resolution with no inspection or conversion, because the name
resolution APIs in this example are UTF-16 APIs. So on the left,
if this one is IDN aware, it will convert it using punycode to the
A label form and it will go out and find the registration DNS in
the punycode encoded form, whereas the other application passes it
down in UTF-16. DNS will convert it to UTF-8, and it will go out
and not find it. It doesn't find it, but there actually exists
unicode code points with those binary strings and any string could
peer in the DNS.
So what if the UTF-8 version magically found its way out there,
either accidentally or intentionally. They would get to a
different site than what they expected.
Finally, the other category of differences is applications that
want to say 'I don't know which one it's going to be, so I'm going
to try them both.' In some order... So, consider 2 applications,
one that decides to try the UTF-8 version first, and one that
decides to try the A label first. The one on the left converts it
to punycode first, to the A label version first, and it goes out
and finds the P or the A label version. The other one might try
UTF-8 first, and again might find a different version in which one
might be unreachable. So you get non deterministic behavior. Of
course, the other one is intelligent too, so if that was
unreachable you get the reverse. So this is what applications
actually do today.
So, the basic principle, the basic learning from these, the fact of
physics, right, is that conversion to an A label, or UTF-8, or
whatever else is going to appear on the wire, can only be done by
some entity that knows which entity or protocol namespace is going
to be used. What is the encoding that is appropriate for that
particular environment, or that type of resolution. When an
application tries to resolve a name, the name resolutions may try
multiple of them. So there's no single or right choice at the
application layer. This leads to two sort of remaining categories
of hard issues.
In general, the client (using the term generically), whether it's a
host or application or whatever, because again, while we're using
host names in many of our examples, the problems we're talking
about are not limited to host names. Many of the ones we've talked
about today may be unique to host names, but they could occur in
other identifier spaces - 'may or may not' I should say.
The first one is the client. The client has to guess, or learn,
whatever the server encoding expects. In many cases it may be
defined by the protocol, and that's fine. But if there are
multiple protocols, it's part of the learning or guessing. Names
appear inside in other types of identifiers, and each identifier
type today often has its own encoding conventions. What is this
identifier space? Is it UTF-8? Is it A label form? Is it percent
or ampersand form or whatever...
And anything that converts from one name space to another name
space, such as extracting an e-mail address from mail, or
extracting a host name from URL, you have to convert from those
two sets of requirements. Now, just saying, well, if they all used
a single encoding, they wouldn't have to do any of this transcoding
in the movement between layers.
By comparison, that's the easy part. That's not the hardest part
of the problem. That's sufficient only if the only thing you're
going to do is display it. All other things besides the encoding
issue - comparison, matching, sorting - they all require more work.
So just like RFC 952 defined what ASCII characters were legal in a
host name, we need to define the unicode subsets for other
identifiers.
What are the things that are legal? The optimal subset for one
protocol or type of identifier may be different from what's optimal
for some other one. Now, there also exists cases where based on,
say, implementation differences, the way that two things display
visually look different. Usually, this is due to a bug. Now, the
problem is nobody agrees which one is the bug or the correct
behavior. So that's a hard issue.
Stuart - back to you.
(Stuart Cheshire) Thank you, Dave.
So, Dave is right - having a single encoding does not solve all our
problems, although, having lots of different encodings definitely
does add to them. This is not news. We've known this for a while.
There used to be computers using different character sets, and we
recognized if some computers used ASCII, and some used another one,
the receiver had to work out which it was, and this was not going
to give a good experience. So the wire protocols used ASCII when
they could. And if you had a computer that used the other one, you
needed a mapping table so you could convert to the common language
on the wire and convert back upon reception. We recognized that in
1969, but we seemed to have forgotten it now.
To get out of the current chaos we need to go can beyond the
current recommendation. Merely supporting UTF-8 as one of the many
options doesn't solve the problem. I think we need to move to a
world where we only use UTF-8, and when you receive an identifier,
or you receive a text string off the network, you don't have to
guess what the encoding is, because there is only one encoding.
So the summary is, for text that end users see, we want to have
rich text, and that means unicode. And for compatibility on the
wire, that means using UTF-8 to encode those unicode code points.
The corollary of this is for identifiers that are protocol
identifiers, that are used for communication between computers to
tell each computer what to do, and aren't seen by end users, it is
much harder to make the argument why those should be unicode,
because the bigger the alphabet, the more the scope for confusion
and the more chance of things not interoperating.
With that, I'd like to open the mic for questions. I think we
should do half an hour for questions on this internationalization
presentation, and then that will leave half an hour for general
questions to the IAB.
We will take new questions at the middle mic and the end ones, and
the in between ones for follow-ups.
Open Mic:
(Bob Briscoe) A question for John. How long have we known about
the security problems? Because, it was sort of quiet hearing
about them, but this is the Internet, and we ought to be fixing
these things.
(John Klensin) What do you mean by 'knowing about the security
problems?'
(Bob Briscoe) Well, the problems of being able to spoof one
character with another, and change fonts, etc.
(John Klensin) Since long before this process started. We've known
about confusability in characters since we started looking at
multiple scripts. We've known about some of these confusion
problems in titles of things since we deployed MIME with multiple
character sets, and that would have been in, I'm guessing from
memory, but like 1990 or shortly thereafter. I gave a presentation
at an ICANN meeting Melborne that exhibited some of these abilities
to write different things in different scripts. At that time, it
was a general warning about these things.
We've certainly seen more subtlety, as we've understood these
things better. I used to joke that one of the properties of this
whole internationalization situation, when one is actually trying
to use the strings and identifiers, rather than printing them, is
that every time we looked at a new script, we found a new and
different set of problems. It was like going through a field and
turning over rocks, and each time you found something new. So I'm
not certain how to answer your question.
This is just epidemic in an environment where we're suddenly moving
identifiers from a world in which the maximum number of characters
we treat as different is around 36, to an environment where the
maximum number of characters we treat as different is in the range
of tens of thousands.
(Bob Briscoe) I guess my question is, your presentation told us
about the problems. If we've known about these problems for 19
years or so, are there, you know, could we do a presentation on a
solution space? Is there any solution space?
(John Klensin) Let me give you an a different answer - we've had
these problems for somewhere between two and 6,000 years.
(Bob Briscoe) Time to fix it.
(John Klensin) Absolutely time to fix it.
(Bob Briscoe) You might do it before something goes to full
standard.
(John Klensin) The fundamental issues here really rely on two
things. One of which is that we can design very, very highly
distinguishable fonts. And we possibly need to design highly
distinguishable fonts a across the entire unicode set, and they
would be so ugly nobody would want to use them.
We, in theory, could teach everyone about all of these 6,000
separate languages, and only slightly smaller number of scripts in
the world, but that isn't going to happen.
So the answer to your question is that there's a tremendous amount
of reliance on user interface design here. And what we need to
understand is that there's both the problem and an opportunity.
The opportunity, which is very important, is for people to use the
internet, in their own script, in their own language in their own
environments. That's really important.
Our problems arise when we start looking at, and operating in,
environments which one of us doesn't understand. I'm gradually
learning to recognize a few Chinese characters, but my ability to
read Chinese or Japanese or Korean is zero. I don't know about
you.
But if your situation with regard to Chinese characters is the same
as mine, if I send you a message in Chinese characters, we are both
having a problem. If I send a message in a script I can read, or
an identifier in a script I can read, but you can't, you've got a
whole series of problems. You can't read the characters, you
probably can't figure out how to put the characters in a computer
if you can read them, and you're going to be easily tricked. And
we're going to have to learn how to deal with that, just as we've
had to learn about non interoperability of the human languages.
If I have a face-to-face conversation with you, using a language
which only one of us understands, then at a minimum we're going
to have an interoperability problem. At a maximum, if I can make
that language sound enough like something that you expect to hear,
or you can do that to me, then we may have a nasty spoofing
problem.
And again, these issues are thousands of years old. And we kind of
learned to cope. And we learn to cope by being careful, and we
learn to cope by remembering that those little boxes are a big
warning sign that we may not be able to read something.
Many of us have started filtering out any email which arrives in a
script which we can't read, because we know we're not going to be
able to read it anyway. And those are the kind of things we do.
It's very, very close to user level. And I don't think there are
any easy answers.
But the alternative to this situation is, say, oh, oops, terrible,
there might be a security problem so nobody gets to use their own
script and that answer is completely unacceptable.
(Yoshiro Yoneya) All right. From my experience in the
internationalization of protocols, one of the hardest issues is to
keep backward compatibility. So, inventing encoding is to get
interoperability, or backward compatibility with existing protocol.
That's the reason why there are many encodings. So, I hope to have
migration, generic migration guidelines for the protocol
internationalization, that will be very good future work.
(Stuart Cheshire) I think one of the things we need to be careful
of - it's easy to fall into this trap to say we need to be backward
compatible. And that actually means something else. But if the
thing at the receiving end doesn't know that it means something
else, we've not got international text. We have lots of percent
signs.
(Larry Masinter) This is a actually a followup about how long we've
known about the problem. I'll take some blame. In 1993, I think,
there was an internet draft where I proposed internationalization
of URLs. Based on discussions in 1992, when I thought it was a
simple problem. Just used UTF-8, and that there were regular URLs
and internationalized ones. But I think part of the problem was
the switch from thinking of these as, these weren't names, they
weren't identifiers, they were locators. The notion of comparing
two of them to see if they were the same was not a requirement.
And at the time, there were no caches. And so, the notion of
figuring out whether or not this URL was the same as that one,
wasn't part of the protocol stack. And, therefore, some of the
problems we're seeing, that idea of phishing that would actually
look at the name and believe something, merely because you saw it
on your screen, didn't have anything to do with where you were
trying to go. That was a requirement that was added after the fact
without a lot of thought.
And, if you you think about it, we've add on some requirements that
maybe shouldn't be there. And so, I think if you look at all of
your examples, there's still some problems, even if don't try and
compare. But almost all of the problems that you've listed really
have to do with comparison, and of locators. As lots of them do.
A lot of the problems have to do. You had a lot of things - look
at a this and that, and are they the same or different? And if
you didn't have the problem of trying to decide, a user trying to
decide whether or not they were the same ahead of time, you
wouldn't see a problem.
(Dave Thaler) Larry, these issues exist when a system decides to
take a a label or a string which can be user input, and compare it
with something which is stored in in the database. The classic
matching look up problem in DNS or otherwise, and then the question
is whether the answer to the question, whether or not those
matched, meets user expectations as far as the user is concerned
is off the wall. There's no way to avoid that particular problem,
other than require user to have universal knowledge of exactly
what's stored. And I do do mean exactly.
(Larry Masinter) No. I think, if you are, if you put a human
communication in the loop, that you're going to print something on
the side of the bus that you want people to type into their
computers, it is your responsibility at the time that you print
that on the bus, to do it in a way in which the users will have a
satisfactory experience. It is not the responsibility of the
intermediate system to make up for the fact that the printing was
something that could be an O or could be a zero, or could be an L
or one, you know, you get a password and I can't tell because the
font used was bad.
It's the responsibility of the printer to do that in a way that
will cause appropriate behavior, and not to choose, to print things
that are unrecognizable or have ambiguous forms. There are lots of
systems that never go through that phase of the translate into a
perceptual representation and translate back, and expect that to
happen.
So, I think that we can make progress by being more careful about
what we choose to accept as requirements of the overall
communication system.
(Dave Thaler) I just want to comment on one of the things you
said, about whether most of the problems are due to such and such.
I want to summarize that we actually talked about at least two
different, big categories of problems. One category of problems is
when there are multiple unicode strings, in other words, multiple
sets of unicode code point numbers that can be confused or not, or
matched or whatever with each other. There's one set of things
that are inherent in that, and it's lot about user interface,
display, and so on.
The second is one set of unicode code points. Those are two fairly
different sets of problems that we talked about tonight.
(Larry Masinter) I think if follow the paths, these differing
alternate forms that look the same don't fall from the sky. They
don't appear magically in the middle of the system. There's some
data path that either transmits them, and along the way is screwing
them up, or there's some human perceptual path along the way that
involves printing things out or reading it out loud and
transcribing it in a way that's inappropriate.
(Dave Thaler) Pete, do you have follow up on this.
(Pete Resnick) I do. I actually disagree with Larry, at one
level. We're talking about identifiers being used for user
interaction, that are also being used for machine interaction, for
protocols. And that's inevitably going to get screwed up, because
the stuff that we use for user interaction has variants, it's got
humans involved. Once a user has to type and interpret something,
and there are variations of how it might be typed or interpreted
based on context, there's nothing to be done.
What we've done is increased the probability of that happening from
those 37 odd characters to tens of thousands of characters,
incredibly. I used to be much more in the camp ten years ago if
you said to me that today, I would say such a thing, I would
have thought it ridiculous - we have to straighten this out with
using proper encodings, this is done.
You know, e-mail is no longer reliably delivered because of spam. I
don't care anymore if e-mail is not delivered because a user cannot
type in the e-mail address exactly the way I put it on the screen.
There's no way to make that precise. If we get unlucky, the person
who chose that e-mail address gets what they pay for.
(Stuart Cheshire) I want to add one clarification to Larry's
point. When we talk about comparing strings, we're not talking
about showing two strings to the user and saying, do you think
these are the same.
(Larry Masinter) That was one category.
(Stuart Cheshire) We were talking about, when a DNS server has a
million names in its own files, and a query comes in for the name
the user types, the DNS has to go through its own file and work out
which record that query addresses.
And you mentioned the subject of phishing, that's not a requirement
that the IETF decided to put on identifiers. That's something that
criminals decided would be lucrative for them, and we have to think
about the consequences.
(Larry Masinter) Let me see if I can clarify something. I'm not
saying it's not a problem here. I'm just pointing out that trying
to solve it, at a different place than, I'm trying to point out
where I think it is going to be most productive, as far as looking
at solutions. And it is inputing restrictions on what is output or
displayed, in such a way that it is more unambiguous about how to
enter it in a way that would be reliable.
(Stuart Cheshire) Okay.
(Larry Masinter) And to focus on that area.
(John Klensin) You're asking people who design user input and
output procedures to constrain their designs in a way which makes
things unambiguous. My experience with telling designers what they
can and cannot do has been pretty bad.
(Larry Masinter) Somebody is going to have to do something, and
trying to patch it somewhere else is not going to be effective.
(John Klensin) Another way of looking at this is that these
problems would be vastly diminished if we let no one on the
internet who wasn't trained to be sophisticated about these kinds
of things. And while there were times in my life while I probably
would have a approved of 'nobody uses a computer unless they pass
the training course and get a license', I think that's probably
harder to constrained than designers.
(Spencer Dawkins) Spencer Dawkins, and probably the least clued
person on this topic that stood up so far. So I'm thinking, I'm
thinking the kind of questions I would ask would be triage kind of
questions. So, is this situation getting worse or, have we already
hit the bottom?
(Stuart Cheshire) It's still getting worse. We think of it as an
educational process in which we continue to learn.
(Spencer Dawkins) How much better does it have to get before it's
good? Before it's okay? I mean, how much do we have to fix?
(Stuart Cheshire) I think with a big name space identifiers, there
is always going to be problems. Our goal is to minimize the
unnecessary problems.
(Spencer Dawkins) I see e-mails coming through and it's
disappointing.
(Stuart Cheshire) I think people who are working on this problem
have job security. Sort of like security in antispam and so on.
As long as human languages continue to exist, as long as there are
humans using the network, the problems will exist. The one way to
to make them go away would be to remove all the humans.
(Spencer Dawkins) So, tell me if I've got this right. That, once
upon a time, there was ASCII and there was the other system. And
people on each side wanted to get to the resources on the other
side. So, there was a death match, and we picked ASCII and life
went on.
Are we in any danger of being able to have that kind of a covering
today, I mean, do people worry that they can't get places, in other
scripts and things like that? Do people see this as a problem?
And John has been, you know, demonstrating this, you know, on
napkins and stuff like that for me for a while, just as a curiosity
kind of thing, so I congratulate you guys for managing to scare the
hell out of me yet again.
But, like I say, I'm kind of curious about that. So, I'll sit
down.
So you asked a question there, at think at the end, you said, I
think part of the question you're implying is 'how often are people
actually running into problems today, right?'
(Spencer Dawkins) Basically, like I said, the ASCII thing is,
there's a computer I need to get to, and I can't get there.
There's a computer in Saudi Arabia that I can't type the name of.
How big of a problem is that?
(Stuart Cheshire) As an example, in some of the cases I showed,
applications are trying to deal with the fact that there's multiple
encodings, it's a corner cases that fail. People run into that,
but not very often. So people have done a good job of compensating
for that. But we keep its as rare as possible. But the phishing
attacks, whenever somebody tries to be dangerous, hopefully that
isn't accomplished either.
I'm goes to close the mic lines now, we have about 5 more minutes.
Do we have a followup there?
(Bob Briscoe) Maybe the question could be better posed as 'do we
think there's sufficient protocols and languages that we're
standardizing for applications that need to be secure, to be able
to be?' And what I'm thinking is, if you're viewing a font and an
encoding through an application that's some business, you know,
important thing, legal, whatever, could the application writer say,
well, normally in your locale, you'd be restricted to this, so if
anything outside that range comes in, I can warn, et cetera, et
cetera, and I can sign all your fonts and encodings. Do you think
there's enough support there for an application to do that?
(Stuart Cheshire) I think there is scope for heuristics to spot
specific behavior, but there's trial and error, and they tend to be
developed over a long period of time. When find something that
doesn't work, find the particular heuristic.
(Dave Crocker) So I got up before Pete, to ask you, Stuart, ask
you about the end of your presentation, but it was, my question is,
predicated on exactly the point that Pete was making. Which is
that much of the mess right now, well, there are inherent
complexities in the topics, but most of the mess is a layer
violation that we created in simpler times, and the simpler times
probably helped things a lot back then, in terms of making the
internet usable. Making the arpanet usable. So that e-mail
addresses, andlater web URLs, and to a large extent domain names,
had this user interface use, and over the wire use, we made a lot
of things simple that way, but we built the problem we have now.
And, we continue to try to maintain the layer violation, and say
that's okay, we have to do that.
The end of your presentation, didn't phrase it this way, but
essentially was going, no, maybe we really don't and we certainly
should try not to. That is, we should go to a canonical over the
wire representation.
The piece, the little I touched this area, seems to suffer a lot
from, and it will suffer even without this, but, it suffers worse,
is the difficulty of getting the distinction between the user
interface human factors, use bit, the human side stuff, and
distinguishing it from over the wire.
And I totally understand the resistance to it. But, our job is to
fix problems, and we really need to be careful we don't just
maintain them.
We've been having, in the years that the international stuff has
been worked on, we've been having to deal with some realities that
forced us to make decisions that do maintain them. So I think that
your suggestions at the end is, I mean, it's charmingly '70s. It's
'go back to canonical forms and over the wire.' And so the
question has to do with achievability.
How do we get there? Do we get there before we get to IP v6? Do
we get there before we retire? Well, some of us anyhow? I mean,
it's clearly the right goal. But is there anything practical about
the goal, and if so, how?
(Stuart Cheshire) I think moving to UTF-8 does not solve all of
the problems that we talked about, not by a long way. But it
solves one of the problems that at least we know which characters
we're talking about when we're trying to decide if they're equal.
Who will solve it, is implementers writing software, they need to
write their software that way. Working groups writing standards
need to specify that. I think I'm less pessimistic than are you
are about the prospects of moving in a good direction here. And in
the interest of time, we'll take the last question.
(John Klensin) Dave, I think the other part of the answer is
precisely that we have to stop taking the short cuts. Of assuming
that by dropping a mechanism for internationalized characters into
something which was designed for ASCII only, that that's a solution
to the problem. Occasionally it will be a solution to the problem.
But we may have to start thinking for the first time in our lives,
seriously about presentation layers and identifiers which work in
this kind of environment, rather than things that have been patched
for a little bit of internationalization in an ASCII environment.
And I don't think those problems are insurmountable. I don't think
the problems of getting serious about localization sensitivity are
insolvable. But we need to get serious and start working on them
at some stage.
(Dave Crocker) The layer violation is the reason why the Internet
is successful, and popular. Right?
(John Klensin) Was.
(Dave Crocker) Was. Well, is. It is. But I think, I'll put in
the process plug. We had a bar BOF in Stockholm, and there was a
lot of interest in internationalization and resource identifiers to
take this document, that there's an RFC proposed standard. But
having 9 different solutions and 9 different committees for how we
approach the problem seems like a bad idea.
And there are a lot of different groups working on their solution
on how to go about it. And I'm hoping we can converge into a
single, if somewhat interesting working group. So, I encourage you
to consider it. Public dash IRI, and I think IRI to the working
group.
(Olaf Kolkman) All right. Thank you. With that, I'll ask the
rest of the IAB to come up on stage and we'll take general
questions.
So while the rest of the IAB comes to the stage for the open mic
session, there was a suggestion yesterday to keep things short. I
would like to remember the audience of that, we all have other own
responsibilities here.
In previous sessions I wrote a mail to the audience before the
plenary saying 'if you have a question, please write us a mail.'
That would help us to actually think about an answer, and answer
concisely. And it would help to think about the question a little
bit.
That never happened, really. But I think that might be a mechanism
for short mic lines and intelligent answers to your questions. So,
please keep that in your mind for the rest of the year, the open
mic is not the only way you can approach the IAB, or the community.
With that said, is there anything somebody wants to bring to the mic? O
Oh, I should introduce all. Let's start at the far end with Jon.
(Olaf Kolkman) Okay. Thank you.
(Tina Tsou) So, the document that the IETF produced, which is one
that is about NATs.
(Dave Thaler) It is mostly a repeat of a lot of the same points
that have been made on the topics. What the IAB's thoughts are, if
I can sum up what the RFC or the RFC-to-be says, is that the most
important point is to preserve end-to-end transparency.
Now, there are multiple solutions that preserve end-to-end
transparency. So IPv6 nat is a solution, is something that's in
the solution category. Now, it's possible to do translation in
ways that preserve end-to-end transparency, it's possible to use
tunneling, it may be possible to do other things.
The IAB statement is that it's important to preserve end-to-end
transparency. And there is no statement saying there must be NAT
in IPv6 or not. So the first main point is that on the topic of if
NAT could be done in a way that preserved end-to-end transparency
is neither arguing for or against that.
The second thing that's the main point of the document, is that
there exists a number of things that people see as advantages in
IPv4 NATS that they use them for. Re-numbering, all the things
brought up in the IETF in the past. Some of them were documented
previously in RFC 4864, some of them were not entirely there and so
we elaborated on those. Those are things that people see as
requirements for solutions.
Today the simplest solution that people see is v6 NAT, but that may
or may not be the only or the best solution. And so, the second
point was there are some requirements there, that the community
needs to work on solutions for.
That's basically to sum up. Anybody else want to add anything?
That's what the IAB's thoughts are are. Once you get into ways to
meet those requirements, that's for the IETF to figure out. But we
wanted to comment on what we believe the requirements are, and what
the constraints on are, and to what extent NAT does or does not
meet those requirements.
(Gregory Lebowitz) I think the other thing that we tried to make
very clear in the document was that every time you use a NAT to
solve one of those problems, you give up something significant.
And we tried to call out what those things were - the trade-off
and the cost associated.
(Dave Oran) Well, I just want to mention that it is somewhat
difficult to establish transparency in any translation system, but
it is in fact possible. Where you run into trouble is trying to
take the simple approach, where you with would attempt to confine
the translation state independently, in individual boxes with no
coordination of the translation state, and that results in the non-
invertability of transformation and the loss of transparency. So,
something I would encourage the community to do is to look at NATs
not as simply 'what can I get away with doing the minimal amount of
work in in order to maybe get something that I want', because the
consequences in negative terms for transparency are pretty severe.
And with somewhat more work, translation-like approaches may in
fact be quite acceptable.
(Olaf Kolkman) Yes, my apologies for a minute ago to be a little
bit fast in trying to close the mic lines, I see that people have
collected, queued up.
Peter, please.
(Peter Lothberg) Okay. I'm Peter. So, there was an obvious reason
why we have IPv4 NATs, and then people made them do all sorts of
fantastic things, and I think people have them because they want
more addresses because they have more things inside their houses
and said I don't want to go there. But a major use of it is some
kind of gate keeper, some kind of policy. It's a policy box that
sits there and implements what policy I decide I want to have
coming into my house for people. I look out the door, the door
bell rings, how do they look, would I let them in or not.
So, last time we forgot to do any work in the IETF and we ended up
with a mess. I heard talks about smart grids, intelligent houses,
and assume for a second that if we use IPv6 addresses on all of
them and they have their own unique address, we still want policy.
And those devices are so small, they probably need somebody to help
them. See maybe the IETF somebody should go look at, so okay, in
the future, we still need a policy control device that sits at the
boundary of something and something, in order to enforce policy,
to make sure the pool man gets to the pool and the alarm company
gets to alarm, and vice versa. And let's get that done, before
people make more kludges.
(Gregory Lebowitz) Don't we have those, aren't they called fire
walls?
(Dave Oran) Can I jump in? So, as somebody who has been skeptical
of most things firewalls do for as long as I can remember, I
absolutely agree with Peter. However, in some cases, trying to
capture the correct policy semantics simply at the individual
packet layer as one would with a fire wall runs into many, many,
many problems that make things in fact worse rather than better.
If the fire wall simply processes on a per packet basis, there's
lots of -- if the box attempts to do packet inspection, either
shallow or deep, I think we're all aware of the problems that that
type of approach goes into. So there's going to be some kind of
application intermediate, that's going to be needed for various
applications to enforce policy. Get used to it. Don't try to do
everything by per packet processing and firewalls, or even worse,
to try and guess what the correct policy ought to be for an
application by doing intermediate inspection of packets.
(Peter Lothberg) I was more thinking, more of the Swiss Army knife
solution here. I wasn't only thinking packet inspections. I was
thinking, this is the thing where I actually have stored my policy
and all the devices I have in the house actually goes, and it's the
system where it gets stored, the database, the PKI, blah, blah,
blah.
(Dave Oran) Then we agree. But it doesn't necessarily have to be
the gateway box that sits physically at the boundary.
(Peter Lothberg) Correct.
(Dave Oran) But people don't want to buy many boxes.
(Peter Lothberg) Yes, they only want one box that needs to get attacked.
Right.
(Remi Despres) That's a follow on of the first question, I think
you made the point that end-to-end transparency is something
important. And of course, I do agree. Now, yesterday something
strange, with reference to this happened. Among the interesting
technologies which are proposed to restore end-to-end transparency
and move in the right direction, there is one which is in IPv4, the
extension of addresses with trenches. Now, there was a BOF on A
plus P. Now, some of the major birds of a feather, those people
who are interested in the subject, were not permitted to talk, to
present I mean, that is, to present their contribution to that.
And the conclusion was that we would no longer be permitted to talk
in any group, on this approach on end to end transparency. I still
expect that there will be a reversal of that decision, that it will
be possible in this area to work on A plus P.
(Dave Thaler) Part of that is a question for the IESG, and part of
it is for a question for IAB. And I'll comment about the IAB
portion, which is about the end-to-end transparency, versus say the
evolution of the model. The IAB has another document about what
the assumptions are and the impact of changing those are, and
whether those impacts should be done, you know, obviously, the
whole point of the evolution of the IP model is to say, well, the
IP model does evolve, but evolution has to happen carefully.
Architecturally, what I think those at the BOF are trying to weigh
between is, for IPv4 you have some inherent problems in IPv4 - we
know that. So, at one hand, you said, well, not we the IAB, those
in the BOF, we're looking at a one alternative that would say,
maybe there's better end-to-end transparency, but more changes (for
some definition of changes).
And at the other extreme, there might be less end-to-end
transparency. There's no single right answer, because
architecturally the answer is remove the limitations in IPv4 and go
to IPv6, that would be the architectural solution which gives you
end-to-end transparency and preserves the model. So we see a
tussle between the two sets of requirements that are trying to be
met, but cannot be met at the same time architecturally.
(Remi Despres) For the tussle to be resolved, it should be
possible to talk and explain.
(Dave Thaler) That's the question for for IESG, not IAB.
(Remi Despres) Okay. That you thank you for the information.
(Lorenzo Coletti) Since it's open mic and as somebody who deployed
IPv6 network and services, that started because we needed more
address space. If I look at the papers, I read the paper about the
botnet that researchers got into control of the botnet and they had
all of the access to the machines behind the botnet, and 80% had
private IP addresses. That's a high number because it tells us
that the internet would be dead and buried if we didn't have NAT,
but let's remember that it was created to fight address shortage.
You know the old saying, when all you have is a hammer. Yes, we
started doing that because of address shortage, well it gives us
security, well, not really. Multi homing, kind of. If we want
those benefits, let's think outside the box. Not do it the same
way. David was saying this. People want to do things the same
way as they're used to, but there's benefit to clean slate.
There's a protocol that allows you you to do things in very
different ways, apply security policies through the last 64 bits
of the IP address. You can do all a of this. Try to think outside
the box. Don't do it the same way and think of all the operational
costs that's involved in having different scopes and different
addresses.
Finally, we use public IPv6 addresses internally. And I can tell
you, it's refreshingly simple to have one address. It's just, you
just know what's going on. And, all the security benefits that NAT
ostensibly has, personally, I don't buy them. And I don't think
they would be comparable even to the game that you have when you
can actually understand things and have a clean simple design.
(Jon Peterson) That was not a question, but a contribution to the
discussion?
(Lorenzo Coletti) Open mic, right.
(Jon Peterson) Certainly, if 80% of the hosts are behind NATS that
tells something about the security that NATs grant.
(Erik Kline) I'd like to add as well, could we possibly make a
requirement that anybody who wants to implement IPv6 NAT actually
run a reasonably large network for say, oh, a year. Because I'm
concerned about lots of things being done based off of experiments
and not valid requirements.
(Jon Peterson) Well, we never did that with IPv4 and still they
were developed.
(Erik Kline) Right. But everybody by then had several years of
experience with IPv4 four, actual experience.
(Jon Peterson) I think that I'm going to disagree with that
suggestion. I think it's a dangerous thing, because I guarantee
you, whoever puts that amount of effort in, will become committed
to it.
(Olaf Kolkman) I'm going to carefully look around. If there are
no further initiatives to move to the mic, and I don't see any, so,
now, again, very slowly, going, going, gone. Thank you.
|