On Sat, Jan 16, 2010 at 11:47 AM, corvid <corvid at lavabit.com> wrote:
> Adam wrote:
>> Another alternative is to recommend a heuristic that works in many
>> cases and then further recommend that user agents use the full list.
>> The problem with this approach is that I don't know of any simple
>> heuristics that provide reasonable behavior. In the past, some user
>> agents have used heuristics based on the length of the top-level
>> domain (i.e., two characters => ccTLD => foo.cc is a public suffix).
>> Unfortunately, this heuristic has undesirable consequences for some
>> small countries that let folks register domains directly in the ccTLD.
>
> This seems good to me to both
> - let implementors know that they can use the publicsuffix list
> - try to provide the best heuristic we know of for user agents who might
> not have the luxury of using publicsuffix for whatever reason (or can't
> depend on it)
Here's the best heuristic I know. The algorithm can probably be
simplified and explained more clearly.
[[
Roughly, getDomain(strFQDN) amounts to:
1> If the final label is empty, drop it for the purposes of this
1> algorithm
// Otherwise "www.example.com." would have four labels "www",
"example", "com", "". Instead, we drop the final label.
2> Name the labels Ln,...,L3,L2,L1; decreasing from start
(Leftmost=Ln) to finish (Rightmost=L1).
// If at any point in this algorithm the result demands >n labels,
getDomain returns "".
3> Check n > 1. If not, there's no domain, just a plain hostname.
Return ""; exit.
// Dotless FQDNs consist of a host only, there is no domain.
4> Check L1 == "tv". If so, getDomain returns L2.L1; exit.
// "tv" is a special-case "completely flat" ccTLD for historical reasons.
5> Check Len(L1) > 2. If so, getDomain returns L2.L1; exit.
// Len(L1)>2 suggests L1 is a gTLD rather than a ccTLD.
// If Len(L1)<=2 we assume L1 is a part of a ccTLD.
6> Check if L2 in gTLD list "com,edu,net,org,gov,mil,int". If so,
getDomain returns L3.L2.L1; exit.
// gTLDs, when they appear immediately left of a ccTLD (modulo
exception in step 4), are considered a part of the TLD.
7> If L1 is in the list "GR,PL" AND L2 is NOT in the gTLD list,
getDomain returns L2.L1; exit.
// GR and PL are considered "flat" ccTLDs EXCEPT when a gTLD appears in L2.
// getDomain("a.pl") returns "a.pl"
// getDomain("a.uk") returns ""
8> If Len(L2) < 3 getDomain returns L3.L2.L1; exit.
// getDomain("aa.bb.cc") returns "aa.bb.cc"
9> Otherwise, getDomain returns L2.L1
// getDomain("aa.bbb.cc") returns "bbb.cc"
]]
The heuristic is sufficiently ugly and wrong that I'd prefer to
recommend that user agent that care about security use the public
suffix list. For example, it breaks the cookie protocol for domains
in the "to" ccTLD. If a user agent doesn't care about security, then
it can skip the public suffix check and the protocol will still
function fine.
Adam
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.