bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #50383] --local-encoding isn't used when converting a re


From: anonymous
Subject: [Bug-wget] [bug #50383] --local-encoding isn't used when converting a relative link in a recursive download
Date: Wed, 22 Feb 2017 18:10:43 -0500 (EST)
User-agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0

URL:
  <http://savannah.gnu.org/bugs/?50383>

                 Summary: --local-encoding isn't used when converting a
relative link in a recursive download
                 Project: GNU Wget
            Submitted by: None
            Submitted on: Wed 22 Feb 2017 11:10:42 PM UTC
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: William Prescott
        Originator Email: address@hidden
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 1.19
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

When expanding a relative URL found on a page, Wget doesn't appear to take
into
account the local encoding of the URL Wget was called with.

This is apparent when trying to recursively download pages encoded with
Shift_JIS
whose URL contains a tilde (Shift_JIS lacks ~ and has ‾ at the same code
point).
While the documents themselves cannot have a tilde, they are able to use
relative links
to move around within this path.

Wget is currently expanding relative links as if the user-provided URL was in
the
document's character encoding. In the case of my example here, this changes
the
URL's tilde to ‾.

My expectation is that Wget would use the specified local encoding for the
user-provided
part of the base and the remote encoding for the rest of the URL.

Additionally, links on a page retrieved using "IRI fallbacking" will not be
followed (noticeable on bar.html in the example). This may constitute another
bug?

----------------------------------------
EXAMPLE CASE (test files attached as tar archive)
On server:
~foo/index.html
~foo/bar.html
~foo/baz.html (empty)

~foo/index.html is Shift_JIS encoded and contains
<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="bar.html">Bar</a>

~foo/bar.html is Shift_JIS encoded and contains
<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="baz.html">Baz</a>


Results for wget -np -r --local-encoding=utf-8 -d 'http://127.0.0.1/~foo/'
(using Wget 1.19):
~foo/index.html works fine and is saved to "127.0.0.1/~foo/index.html"
~foo/bar.html gets tried as "%E2%80%BEfoo/bar.html" before IRI fallbacking
and
is then incorrectly saved to "127.0.0.1/‾foo/bar.html"
~foo/baz.html is never visited.



Mailing list discussion at
http://lists.gnu.org/archive/html/bug-wget/2017-02/msg00111.html



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Wed 22 Feb 2017 11:10:42 PM UTC  Name: wget_output.txt  Size: 6kB   By:
None

<http://savannah.gnu.org/bugs/download.php?file_id=39812>
-------------------------------------------------------
Date: Wed 22 Feb 2017 11:10:42 PM UTC  Name: example.tar.gz  Size: 297B   By:
None

<http://savannah.gnu.org/bugs/download.php?file_id=39813>

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?50383>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]