bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] URL normalisation: consecutive forward slashes


From: Cillian Sharkey
Subject: [Bug-wget] URL normalisation: consecutive forward slashes
Date: Wed, 2 Jun 2010 17:27:06 +0100
User-agent: mutt-ng/devel-r804 (Linux)

Hi,

I've found wget does not always correctly normalise URLs by collapsing
multiple consecutive forward slashes into a single slash.

This is a problem when recursively mirroring a site, as certain kinds of
links with multiple consecutive slashes will cause wget to go into an
infinite loop, limited only by the maximum depth level.

Without complete normalisation, a link with extra slashes is seen as a
new URL that has not been visited, even if it has already.  With each
traversal an extra slash is cumulatively appended to the URL, causing
the loop.

Example:

/index.html has href to "foo/loop.html"
/foo/loop.html has href to "..//index.html"

Results in the following link traversal:

/index.html
/a/loop.html
//index.html
//a/loop.html
///index.html
///a/loop.html
[..]

I've tried a combination of URLs with and without consecutive slashes,
to test wget's behaviour. Results as follows:

/index.html links to:

HREF:                  wget requests:      should be:

/a//../b/10.html       /a/b/10.html        /b/10.html
/a/../b/11.html        /b/11.html
                       
/a/b/..//../c/20.html  /a/c/20.html        /c/20.html
/a/b/../../c/21.html   /c/21.html
                       
..//30.html            //30.html           /30.html
../31.html             /31.html
                       
.//40.html             //40.html           /40.html
./41.html              /41.html
                       
//50.html              Skipped, not downloaded!
/51.html               /51.html


wget --version

GNU Wget 1.12 built on linux-gnu.

+digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl 
-iri 

Wgetrc: 
    /etc/wgetrc (system)
Locale: /usr/share/locale 
Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" 
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2 
    -D_FILE_OFFSET_BITS=64 -O2 -g -Wall 
Link: gcc -g -O2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall /usr/lib/libssl.so 
    /usr/lib/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o 
    gen-md5.o ../lib/libgnu.a

Regards,

-- 
Cillian Sharkey             Managed Network Services
t: +353-1-660-9040          HEAnet Limited - http://www.heanet.ie/
f: +353-1-660-3666          5 George's Dock, I.F.S.C., Dublin 1.
PGP: E1B98B66               Registered in Ireland, no. 275301



reply via email to

[Prev in Thread] Current Thread [Next in Thread]