[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] [bug #50383] --local-encoding isn't used when converting a re
From: |
anonymous |
Subject: |
[Bug-wget] [bug #50383] --local-encoding isn't used when converting a relative link in a recursive download |
Date: |
Wed, 22 Feb 2017 18:10:43 -0500 (EST) |
User-agent: |
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0 |
URL:
<http://savannah.gnu.org/bugs/?50383>
Summary: --local-encoding isn't used when converting a
relative link in a recursive download
Project: GNU Wget
Submitted by: None
Submitted on: Wed 22 Feb 2017 11:10:42 PM UTC
Category: Program Logic
Severity: 3 - Normal
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Name: William Prescott
Originator Email: address@hidden
Open/Closed: Open
Discussion Lock: Any
Release: 1.19
Operating System: GNU/Linux
Reproducibility: Every Time
Fixed Release: None
Planned Release: None
Regression: None
Work Required: None
Patch Included: None
_______________________________________________________
Details:
When expanding a relative URL found on a page, Wget doesn't appear to take
into
account the local encoding of the URL Wget was called with.
This is apparent when trying to recursively download pages encoded with
Shift_JIS
whose URL contains a tilde (Shift_JIS lacks ~ and has ‾ at the same code
point).
While the documents themselves cannot have a tilde, they are able to use
relative links
to move around within this path.
Wget is currently expanding relative links as if the user-provided URL was in
the
document's character encoding. In the case of my example here, this changes
the
URL's tilde to ‾.
My expectation is that Wget would use the specified local encoding for the
user-provided
part of the base and the remote encoding for the rest of the URL.
Additionally, links on a page retrieved using "IRI fallbacking" will not be
followed (noticeable on bar.html in the example). This may constitute another
bug?
----------------------------------------
EXAMPLE CASE (test files attached as tar archive)
On server:
~foo/index.html
~foo/bar.html
~foo/baz.html (empty)
~foo/index.html is Shift_JIS encoded and contains
<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="bar.html">Bar</a>
~foo/bar.html is Shift_JIS encoded and contains
<meta http-equiv="Content-Type" content="text/html;charset=Shift_JIS">
<a href="baz.html">Baz</a>
Results for wget -np -r --local-encoding=utf-8 -d 'http://127.0.0.1/~foo/'
(using Wget 1.19):
~foo/index.html works fine and is saved to "127.0.0.1/~foo/index.html"
~foo/bar.html gets tried as "%E2%80%BEfoo/bar.html" before IRI fallbacking
and
is then incorrectly saved to "127.0.0.1/‾foo/bar.html"
~foo/baz.html is never visited.
Mailing list discussion at
http://lists.gnu.org/archive/html/bug-wget/2017-02/msg00111.html
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Wed 22 Feb 2017 11:10:42 PM UTC Name: wget_output.txt Size: 6kB By:
None
<http://savannah.gnu.org/bugs/download.php?file_id=39812>
-------------------------------------------------------
Date: Wed 22 Feb 2017 11:10:42 PM UTC Name: example.tar.gz Size: 297B By:
None
<http://savannah.gnu.org/bugs/download.php?file_id=39813>
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?50383>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Bug-wget] [bug #50383] --local-encoding isn't used when converting a relative link in a recursive download,
anonymous <=