[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [Bug-Wget] Misc. patches
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] [Bug-Wget] Misc. patches |
Date: |
Sun, 20 Jul 2014 22:58:04 +0200 |
User-agent: |
KMail/4.12.4 (Linux/3.14-1-amd64; KDE/4.13.1; x86_64; ; ) |
Am Montag, 21. Juli 2014, 00:58:49 schrieb Darshit Shah:
> On Mon, Jul 7, 2014 at 8:14 PM, Tim Ruehsen <address@hidden> wrote:
> > One more comment / idea.
> >
> > The 'cookie_domain' comes from a HTTP Set-Cookie repsonse header and thus
> > is (must be) toASCII() encoded (=puncode). Of course this has to be
> > checked when normalizing the incoming cookie data. A cookie comain having
> > non-ascii characters should simply be dropped.
> >
> > The whole check only works when 'host' is also in toASCII() (punycode)
> > form.
> >
> > Assuming this, psl_str_to_utf8lower() just reduces to a ASCII lowercase
> > converter.
> >
> > If Wget would convert any domain name input to punycode + lowercase, many
> > conversions would fall away and case-function would not be needed (e.g.
> > calling strcmp instead of strcasecmp, the need to call
> > psl_str_to_utf8lower() would fall away, etc.).
> >
> > What do you think ?
>
> Sounds like an interesting idea to me. Although, how do you suggest we
> go about converting the domain names to lowercase?
> I'm not sure about this, so I confirm first. After running the input
> domain names through toASCII(), can we simply pass the string to
> tolower() to get the lowercase version?
That depends on the library you use.
libidn's toASCII() has a built-in lowercase conversion. So the input case does
not matter, the output is always lowercase ASCII.
Using libidn2, you have to convert to lowercase first yourself (e.g. using
libunistring). The output is of course lowercase ASCII.
Using libicu, you have to convert to lowercase first yourself (but libicu is
able to do that). The output is of course lowercase ASCII.
What I thought of (what I did in Mget), 'normalize' every domain name before
further processing/comparing. 'normalizing' means trimming, percent-decoding,
charset transcoding to UTF-8, toASCII() conversion (with or without prior
lowercasing, depending on the IDN library used).
Having that, Wget's code just needs strcmp() to compare domains and
$ wget übel.de Übel.de xn--bel-goa.de
should reduce to a download of a single file (xn--bel-goa.de/index.html)
(but maybe it is Wget's policy to explictely download every URL given on the
command line, even if it is always the same !?)
There is domain name input from the command line (URL's and a few options like
-D/--domains), from local files (-i/--input-file) and from remote files.
But Darshit, maybe this should have low priority. It is more a kind of 'code
polishing'. I am looking forward to start a Wget version based on a libwget in
the next 6-12 months. Most of the code is already working in the Mget project,
but everything needs polishing (e.g. APi docs and more of Wget functionality,
-k/convert-links implemented last week ;-) And than the day comes to merge
Wget and Mget... if that finds any friends ;-)
>
> > Tim
> >
> > On Monday 07 July 2014 17:08:48 Darshit Shah wrote:
> >> + if (psl_str_to_utf8lower (cookie_domain, NULL,
> >> NULL,&cookie_domain_lower)>
> > == PSL_SUCCESS &&
> >
> >> + psl_str_to_utf8lower (host, NULL, NULL, &host_lower) ==
> >> PSL_SUCCESS)
> >> + {
> >> + is_acceptable = psl_is_cookie_domain_acceptable (psl,
> >> host_lower, cookie_domain_lower);
> >> + }
> >> + else
> >> + {
> >> + DEBUGP (("libpsl unable to parse domain name. "
> >> + "Falling back to simple heuristics.\n"));
> >> + goto no_psl;
> >> + }
- [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/05
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Giuseppe Scrivano, 2014/07/05
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/05
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Giuseppe Scrivano, 2014/07/05
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/05
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Tim Ruehsen, 2014/07/07
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/07
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Tim Ruehsen, 2014/07/07
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Tim Ruehsen, 2014/07/07
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/20
- Re: [Bug-wget] [Bug-Wget] Misc. patches,
Tim Rühsen <=
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/21
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Giuseppe Scrivano, 2014/07/21
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Tim Rühsen, 2014/07/22
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/23
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Darshit Shah, 2014/07/19
- Re: [Bug-wget] [Bug-Wget] Misc. patches, Tim Rühsen, 2014/07/19