Re: [Bug-wget] libpsl design

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] libpsl design

From:	Daniel Kahn Gillmor
Subject:	Re: [Bug-wget] libpsl design
Date:	Fri, 21 Mar 2014 17:24:56 -0400
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Icedove/24.3.0

On 03/21/2014 04:54 PM, Ángel González wrote:
> On 21/03/14 21:13, Daniel Kahn Gillmor wrote:
>> i've just pushed some cleanup suggestions here:
>>
>>    https://github.com/rockdaboot/libpsl/pull/1
>>
>> i see you've pulled them already, thanks!
>>
>> i've got three more conceptual issues which warrant discussion, rather
>> than a patch, though.  If there's a better place to have this discussion
>> than this mailing list, i'm happy to move to it, please let me know
>> where.
>>
>> psl_is_tld() semantics
>> ----------------------
>>
>> the way i see it, we know what it means for psl_is_tld() to return
>> "true" -- but "false" could mean either:
>>
>> (A) "this zone is subordinate to a TLD" (as example.com is to com)
>>     or
>> (B) "this zone is superior to a TLD" (as uk is to co.uk).  Note that
>> "uk" is not a public suffix.
> Hmm, actually uk is a public suffix, since not matching anything
> explictely in
> the list,  it will be caught by the implicit last-resource rule '*'.
> 
> Also, what would you do with a domain such as his.name?
> It is both inferior to a public suffix (.name) and superior
> (forgot.his.name).

hm, the same problem is present for amazonaws.com; it is superior to
s3.amazonaws.com (and 32 other public suffixes), and subordinate to .com

> I think it should have a different return code, though.

can you propose a specific API?  the devil is in the details.

>>    https://www.gnu.org/software/libidn/
> I would expect the input in punycode and optionally in utf-8. This means
> a preprocessing step from the original list is needed.

This implies that people wouldn't be able to use effective_tld_names.dat
as distributed, right?  I can see this working for OS-level
distributions (I can preprocess effective_tld_names.dat when
distributing it in publicsuffix for debian), but for regular users it
sounds terrible.

> If we are handed a i18n domain, punycode them with libidn if we are
> linked to it, else return an error.

How do you propose we determine that we're handed an i18n domain if
we're not linked to libidn?  just check for any byte other than
printable ascii?

should we do the same thing for psl_load_file() ?

If we implement somthing like psl_get_private_zone(), what form should
the returned name be?

> It is disgusting to do a roundtrip utf-8 -> punycode -> utf-8 for
> extracting the base domain, though.

that does sound ugly.

>> malformed inputs
>> ----------------
>>
>> What should the library do with malformed inputs?  i'm thinking about
>> super-long strings, strings starting with more than one dot, or with
>> multiple dots adjacent to each other, strings that don't match whatever
>> encoding we're expecting users to send, etc.
>
> Return an error.

I'm asking what API we think is reasonable for handling errors here.  Do
we need to distinguish between the malformed input error and the kind of
error we might get by calling psl_get_private_zone("uk")?  what makes
sense for callers?

        --dkg

signature.asc
Description: OpenPGP digital signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] Overly permissive hostname matching, (continued)

Prev by Date: Re: [Bug-wget] libpsl design [was: Re: Overly permissive hostname matching]
Next by Date: Re: [Bug-wget] libpsl design
Previous by thread: Re: [Bug-wget] libpsl design [was: Re: Overly permissive hostname matching]
Next by thread: Re: [Bug-wget] libpsl design
Index(es):
- Date
- Thread