bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [Bug-Wget] Misc. patches


From: Tim Rühsen
Subject: Re: [Bug-wget] [Bug-Wget] Misc. patches
Date: Sun, 20 Jul 2014 22:58:04 +0200
User-agent: KMail/4.12.4 (Linux/3.14-1-amd64; KDE/4.13.1; x86_64; ; )

Am Montag, 21. Juli 2014, 00:58:49 schrieb Darshit Shah:
> On Mon, Jul 7, 2014 at 8:14 PM, Tim Ruehsen <address@hidden> wrote:
> > One more comment / idea.
> > 
> > The 'cookie_domain' comes from a HTTP Set-Cookie repsonse header and thus
> > is (must be) toASCII() encoded (=puncode). Of course this has to be
> > checked when normalizing the incoming cookie data. A cookie comain having
> > non-ascii characters should simply be dropped.
> > 
> > The whole check only works when 'host' is also in toASCII() (punycode)
> > form.
> > 
> > Assuming this, psl_str_to_utf8lower() just reduces to a ASCII lowercase
> > converter.
> > 
> > If Wget would convert any domain name input to punycode + lowercase, many
> > conversions would fall away and case-function would not be needed (e.g.
> > calling strcmp instead of strcasecmp, the need to call
> > psl_str_to_utf8lower() would fall away, etc.).
> > 
> > What do you think ?
> 
> Sounds like an interesting idea to me. Although, how do you suggest we
> go about converting the domain names to lowercase?
> I'm not sure about this, so I confirm first. After running the input
> domain names through toASCII(), can we simply pass the string to
> tolower() to get the lowercase version?

That depends on the library you use.

libidn's toASCII() has a built-in lowercase conversion. So the input case does 
not matter, the output is always lowercase ASCII.

Using libidn2, you have to convert to lowercase first yourself (e.g. using 
libunistring). The output is of course lowercase ASCII.

Using libicu, you have to convert to lowercase first yourself (but libicu is 
able to do that). The output is of course lowercase ASCII.


What I thought of (what I did in Mget), 'normalize' every domain name before 
further processing/comparing. 'normalizing' means trimming, percent-decoding, 
charset transcoding to UTF-8, toASCII() conversion (with or without prior 
lowercasing, depending on the IDN library used).

Having that, Wget's code just needs strcmp() to compare domains and
$ wget übel.de Übel.de xn--bel-goa.de 
should reduce to a download of a single file (xn--bel-goa.de/index.html)
(but maybe it is Wget's policy to explictely download every URL given on the 
command line, even if it is always the same !?)

There is domain name input from the command line (URL's and a few options like 
-D/--domains), from local files (-i/--input-file) and from remote files.

But Darshit, maybe this should have low priority. It is more a kind of 'code 
polishing'. I am looking forward to start a Wget version based on a libwget in 
the next 6-12 months. Most of the code is already working in the Mget project, 
but everything needs polishing (e.g. APi docs and more of Wget functionality, 
-k/convert-links implemented last week ;-) And than the day comes to merge 
Wget and Mget... if that finds any friends ;-)

> 
> > Tim
> > 
> > On Monday 07 July 2014 17:08:48 Darshit Shah wrote:
> >> +  if (psl_str_to_utf8lower (cookie_domain, NULL,
> >> NULL,&cookie_domain_lower)> 
> > == PSL_SUCCESS &&
> > 
> >> +      psl_str_to_utf8lower (host, NULL, NULL, &host_lower) ==
> >> PSL_SUCCESS)
> >> +    {
> >> +      is_acceptable = psl_is_cookie_domain_acceptable (psl,
> >> host_lower, cookie_domain_lower);
> >> +    }
> >> +  else
> >> +    {
> >> +        DEBUGP (("libpsl unable to parse domain name. "
> >> +                 "Falling back to simple heuristics.\n"));
> >> +        goto no_psl;
> >> +    }




reply via email to

[Prev in Thread] Current Thread [Next in Thread]