[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-wget] URL normalisation: consecutive forward slashes
From: |
Cillian Sharkey |
Subject: |
[Bug-wget] URL normalisation: consecutive forward slashes |
Date: |
Wed, 2 Jun 2010 17:27:06 +0100 |
User-agent: |
mutt-ng/devel-r804 (Linux) |
Hi,
I've found wget does not always correctly normalise URLs by collapsing
multiple consecutive forward slashes into a single slash.
This is a problem when recursively mirroring a site, as certain kinds of
links with multiple consecutive slashes will cause wget to go into an
infinite loop, limited only by the maximum depth level.
Without complete normalisation, a link with extra slashes is seen as a
new URL that has not been visited, even if it has already. With each
traversal an extra slash is cumulatively appended to the URL, causing
the loop.
Example:
/index.html has href to "foo/loop.html"
/foo/loop.html has href to "..//index.html"
Results in the following link traversal:
/index.html
/a/loop.html
//index.html
//a/loop.html
///index.html
///a/loop.html
[..]
I've tried a combination of URLs with and without consecutive slashes,
to test wget's behaviour. Results as follows:
/index.html links to:
HREF: wget requests: should be:
/a//../b/10.html /a/b/10.html /b/10.html
/a/../b/11.html /b/11.html
/a/b/..//../c/20.html /a/c/20.html /c/20.html
/a/b/../../c/21.html /c/21.html
..//30.html //30.html /30.html
../31.html /31.html
.//40.html //40.html /40.html
./41.html /41.html
//50.html Skipped, not downloaded!
/51.html /51.html
wget --version
GNU Wget 1.12 built on linux-gnu.
+digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl
-iri
Wgetrc:
/etc/wgetrc (system)
Locale: /usr/share/locale
Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
-DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2
-D_FILE_OFFSET_BITS=64 -O2 -g -Wall
Link: gcc -g -O2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall /usr/lib/libssl.so
/usr/lib/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o
gen-md5.o ../lib/libgnu.a
Regards,
--
Cillian Sharkey Managed Network Services
t: +353-1-660-9040 HEAnet Limited - http://www.heanet.ie/
f: +353-1-660-3666 5 George's Dock, I.F.S.C., Dublin 1.
PGP: E1B98B66 Registered in Ireland, no. 275301
- [Bug-wget] URL normalisation: consecutive forward slashes,
Cillian Sharkey <=