Re: [http-state] Ticket 3: Public Suffixes

"Yngve N. Pettersen (Developer Opera Software ASA)" <yngve@opera.com> Sat, 16 January 2010 23:10 UTC

Return-Path: <yngve@opera.com>
X-Original-To: http-state@core3.amsl.com
Delivered-To: http-state@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id CAEC13A6864 for <http-state@core3.amsl.com>; Sat, 16 Jan 2010 15:10:40 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.199
X-Spam-Level:
X-Spam-Status: No, score=-6.199 tagged_above=-999 required=5 tests=[AWL=-0.200, BAYES_00=-2.599, J_CHICKENPOX_32=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0sgHtHPE9LYJ for <http-state@core3.amsl.com>; Sat, 16 Jan 2010 15:10:39 -0800 (PST)
Received: from smtp.opera.com (smtp.opera.com [213.236.208.81]) by core3.amsl.com (Postfix) with ESMTP id 211A13A6784 for <http-state@ietf.org>; Sat, 16 Jan 2010 15:10:37 -0800 (PST)
Received: from acorna.invalid.invalid (pat-tdc.opera.com [213.236.208.22]) (authenticated bits=0) by smtp.opera.com (8.14.3/8.14.3/Debian-5) with ESMTP id o0GN7Nkl022649 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Sat, 16 Jan 2010 23:07:24 GMT
Date: Sun, 17 Jan 2010 00:10:26 +0100
To: Adam Barth <ietf@adambarth.com>, corvid <corvid@lavabit.com>
From: "Yngve N. Pettersen (Developer Opera Software ASA)" <yngve@opera.com>
Organization: Opera Software AS
Content-Type: text/plain; format="flowed"; delsp="yes"; charset="iso-8859-15"
MIME-Version: 1.0
References: <7789133a1001160001h62d203b3w76e175ec22d55e6@mail.gmail.com> <20100116194716.GA3036@local.gobigwest.com> <7789133a1001161439o6873ec88jdebc911ea5dd0ebc@mail.gmail.com>
Content-Transfer-Encoding: 8bit
Message-ID: <op.u6nenop5qrq7tp@acorna.invalid.invalid>
In-Reply-To: <7789133a1001161439o6873ec88jdebc911ea5dd0ebc@mail.gmail.com>
User-Agent: Opera Mail/9.65 (Win32)
Cc: http-state <http-state@ietf.org>
Subject: Re: [http-state] Ticket 3: Public Suffixes
X-BeenThere: http-state@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discuss HTTP State Management Mechanism <http-state.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/http-state>, <mailto:http-state-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/http-state>
List-Post: <mailto:http-state@ietf.org>
List-Help: <mailto:http-state-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/http-state>, <mailto:http-state-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Jan 2010 23:10:41 -0000

On Sat, 16 Jan 2010 23:39:35 +0100, Adam Barth <ietf@adambarth.com> wrote:

> On Sat, Jan 16, 2010 at 11:47 AM, corvid <corvid@lavabit.com> wrote:
>> Adam wrote:
>>> Another alternative is to recommend a heuristic that works in many
>>> cases and then further recommend that user agents use the full list.
>>> The problem with this approach is that I don't know of any simple
>>> heuristics that provide reasonable behavior.  In the past, some user
>>> agents have used heuristics based on the length of the top-level
>>> domain (i.e., two characters => ccTLD => foo.cc is a public suffix).
>>> Unfortunately, this heuristic has undesirable consequences for some
>>> small countries that let folks register domains directly in the ccTLD.
>>
>> This seems good to me to both
>> - let implementors know that they can use the publicsuffix list
>> - try to provide the best heuristic we know of for user agents who might
>>  not have the luxury of using publicsuffix for whatever reason (or can't
>>  depend on it)
>
> Here's the best heuristic I know.  The algorithm can probably be
> simplified and explained more clearly.
>
> [[
> Roughly, getDomain(strFQDN) amounts to:
>
> 1> If the final label is empty, drop it for the purposes of this
> 1> algorithm
> // Otherwise "www.example.com." would have four labels "www",
> "example", "com", "".  Instead, we drop the final label.
>
> 2> Name the labels Ln,...,L3,L2,L1; decreasing from start
> (Leftmost=Ln) to finish (Rightmost=L1).
> // If at any point in this algorithm the result demands >n labels,
> getDomain returns "".
>
> 3> Check n > 1.  If not, there's no domain, just a plain hostname.
> Return ""; exit.
> // Dotless FQDNs consist of a host only, there is no domain.
>
> 4> Check L1 == "tv".  If so, getDomain returns L2.L1; exit.
> // "tv" is a special-case "completely flat" ccTLD for historical reasons.
>
> 5> Check Len(L1) > 2.  If so, getDomain returns L2.L1; exit.
> // Len(L1)>2 suggests L1 is a gTLD rather than a ccTLD.
> // If Len(L1)<=2 we assume L1 is a part of a ccTLD.
>
> 6> Check if L2 in gTLD list "com,edu,net,org,gov,mil,int".  If so,
> getDomain returns L3.L2.L1; exit.
> // gTLDs, when they appear immediately left of a ccTLD (modulo
> exception in step 4), are considered a part of the TLD.
>
> 7> If L1 is in the list "GR,PL" AND L2 is NOT in the gTLD list,
> getDomain returns L2.L1; exit.
> // GR and PL are considered "flat" ccTLDs EXCEPT when a gTLD appears in  
> L2.
> // getDomain("a.pl") returns "a.pl"
> // getDomain("a.uk") returns ""
>
> 8> If Len(L2) < 3 getDomain returns L3.L2.L1; exit.
> // getDomain("aa.bb.cc") returns "aa.bb.cc"
>
> 9> Otherwise, getDomain returns L2.L1
> // getDomain("aa.bbb.cc") returns "bbb.cc"
> ]]
>
> The heuristic is sufficiently ugly and wrong that I'd prefer to
> recommend that user agent that care about security use the public
> suffix list.  For example, it breaks the cookie protocol for domains
> in the "to" ccTLD.  If a user agent doesn't care about security, then
> it can skip the public suffix check and the protocol will still
> function fine.

Just to confirm the "wrong" part:

This algorithm would at least classify two norwegian public suffixes,  
vgs.no (highschools of Norway) and kommune.no (municipalities/counties of  
Norway), as ordinary domains, as well as 400+ others in dot-no namespace.

It would also classify the domain of Norway's largest print and online  
newspaper, vg.no, as a public suffix, and Norways largest daily economic  
newspaper's domain, dn.no, as well.

IMO looking up an online public suffix repository, like Opera will be  
doing shortly, is probably the best option, as it does not require  
hardcoding a list of domains into the executable, and eliminates the need  
to update the execuatable when the list changes.

-- 
Sincerely,
Yngve N. Pettersen
 
********************************************************************
Senior Developer                     Email: yngve@opera.com
Opera Software ASA                   http://www.opera.com/
Phone:  +47 24 16 42 60              Fax:    +47 24 16 40 01
********************************************************************