bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)


From: Tim Ruehsen
Subject: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date: Thu, 17 Dec 2015 15:30:02 +0100
User-agent: KMail/4.14.10 (Linux/4.3.0-1-amd64; KDE/4.14.14; x86_64; ; )

Thanks, I pushed your changes to master.

Tim

On Tuesday 15 December 2015 18:52:01 Eli Zaretskii wrote:
> > From: Tim Ruehsen <address@hidden>
> > Cc: Eli Zaretskii <address@hidden>
> > Date: Tue, 15 Dec 2015 11:02:21 +0100
> > 
> > I pushed a conversion fix to master.
> 
> Thanks!
> 
> > There is another bug in wget that comes out with
> > wget -d --local-encoding=cp1255
> > 'http://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4'
> > 
> > Wget double escapes/converts to UTF-8... Maybe you can address this when
> > you are working on the code !?
> 
> You mean, because http redirects to https?  Yes, I've seen that
> already.  The simple patch below fixes that.  The problem seems to be
> that wget assumes the redirected URL to be encoded in the same
> encoding as the original one (which, as described earlier, starts with
> the local encoding), whereas it is much more reasonable to use the
> value provided by --remote-encoding.
> 
> And if the 'if' in the patch looks strange to you, it's rightfully
> so.  Look at this strange logic in set_uri_encoding:
> 
>   /* Set uri_encoding of struct iri i. If a remote encoding was specified,
> use it unless force is true. */
>   void
>   set_uri_encoding (struct iri *i, const char *charset, bool force)
>   {
>     DEBUGP (("URI encoding = %s\n", charset ? quote (charset) : "None"));
>     if (!force && opt.encoding_remote)
>       return;
> 
> I understand the reason to prefer opt.encoding_remote when the 'force'
> flag is false -- the user-provided remote encoding should take
> preference.  But why return without making sure the URI's encoding is
> in fact set to that??  I guess there's some assumption that
> iri->uri_encoding is already set to opt.encoding_remote, but this
> assumption is certainly false in this case.  So I tyhink this function
> should be changed to actually use opt.encoding_remote, if non-NULL,
> and otherwise use 'charset' even if 'force' is false.  Then the patch
> below could be simplify to avoid the test.  WDYT?
> 
> Here's the patch I promised.  With it, wget survives redirection from
> http to https and successful retrieves that page.
> 
> 
> diff --git a/src/retr.c b/src/retr.c
> index a6a9bd7..6af26a0 100644
> --- a/src/retr.c
> +++ b/src/retr.c
> @@ -872,9 +872,11 @@ retrieve_url (struct url * orig_parsed, const char
> *origurl, char **file, xfree (mynewloc);
>        mynewloc = construced_newloc;
> 
> -      /* Reset UTF-8 encoding state, keep the URI encoding and reset
> +      /* Reset UTF-8 encoding state, set the URI encoding and reset
>           the content encoding. */
>        iri->utf8_encode = opt.enable_iri;
> +      if (opt.encoding_remote)
> +     set_uri_encoding (iri, opt.encoding_remote, true);
>        set_content_encoding (iri, NULL);
>        xfree (iri->orig_url);



reply via email to

[Prev in Thread] Current Thread [Next in Thread]