[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [http-state] Ticket 3: Public Suffixes



On Sat, 16 Jan 2010 23:39:35 +0100, Adam Barth <ietf at adambarth.com> wrote:

On Sat, Jan 16, 2010 at 11:47 AM, corvid <corvid at lavabit.com> wrote:
Adam wrote:
Another alternative is to recommend a heuristic that works in many
cases and then further recommend that user agents use the full list.
The problem with this approach is that I don't know of any simple
heuristics that provide reasonable behavior.  In the past, some user
agents have used heuristics based on the length of the top-level
domain (i.e., two characters => ccTLD => foo.cc is a public suffix).
Unfortunately, this heuristic has undesirable consequences for some
small countries that let folks register domains directly in the ccTLD.

This seems good to me to both
- let implementors know that they can use the publicsuffix list
- try to provide the best heuristic we know of for user agents who might
 not have the luxury of using publicsuffix for whatever reason (or can't
 depend on it)

Here's the best heuristic I know.  The algorithm can probably be
simplified and explained more clearly.

[[
Roughly, getDomain(strFQDN) amounts to:

1> If the final label is empty, drop it for the purposes of this
1> algorithm
// Otherwise "www.example.com." would have four labels "www",
"example", "com", "".  Instead, we drop the final label.

2> Name the labels Ln,...,L3,L2,L1; decreasing from start
(Leftmost=Ln) to finish (Rightmost=L1).
// If at any point in this algorithm the result demands >n labels,
getDomain returns "".

3> Check n > 1.  If not, there's no domain, just a plain hostname.
Return ""; exit.
// Dotless FQDNs consist of a host only, there is no domain.

4> Check L1 == "tv".  If so, getDomain returns L2.L1; exit.
// "tv" is a special-case "completely flat" ccTLD for historical reasons.

5> Check Len(L1) > 2.  If so, getDomain returns L2.L1; exit.
// Len(L1)>2 suggests L1 is a gTLD rather than a ccTLD.
// If Len(L1)<=2 we assume L1 is a part of a ccTLD.

6> Check if L2 in gTLD list "com,edu,net,org,gov,mil,int".  If so,
getDomain returns L3.L2.L1; exit.
// gTLDs, when they appear immediately left of a ccTLD (modulo
exception in step 4), are considered a part of the TLD.

7> If L1 is in the list "GR,PL" AND L2 is NOT in the gTLD list,
getDomain returns L2.L1; exit.
// GR and PL are considered "flat" ccTLDs EXCEPT when a gTLD appears in L2.
// getDomain("a.pl") returns "a.pl"
// getDomain("a.uk") returns ""

8> If Len(L2) < 3 getDomain returns L3.L2.L1; exit.
// getDomain("aa.bb.cc") returns "aa.bb.cc"

9> Otherwise, getDomain returns L2.L1
// getDomain("aa.bbb.cc") returns "bbb.cc"
]]

The heuristic is sufficiently ugly and wrong that I'd prefer to
recommend that user agent that care about security use the public
suffix list.  For example, it breaks the cookie protocol for domains
in the "to" ccTLD.  If a user agent doesn't care about security, then
it can skip the public suffix check and the protocol will still
function fine.

Just to confirm the "wrong" part:

This algorithm would at least classify two norwegian public suffixes, vgs.no (highschools of Norway) and kommune.no (municipalities/counties of Norway), as ordinary domains, as well as 400+ others in dot-no namespace.

It would also classify the domain of Norway's largest print and online newspaper, vg.no, as a public suffix, and Norways largest daily economic newspaper's domain, dn.no, as well.

IMO looking up an online public suffix repository, like Opera will be doing shortly, is probably the best option, as it does not require hardcoding a list of domains into the executable, and eliminates the need to update the execuatable when the list changes.

--
Sincerely,
Yngve N. Pettersen
 
********************************************************************
Senior Developer                     Email: yngve at opera.com
Opera Software ASA                   http://www.opera.com/
Phone:  +47 24 16 42 60              Fax:    +47 24 16 40 01
********************************************************************

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.