[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] URL normalisation: consecutive forward slashes
From: |
Giuseppe Scrivano |
Subject: |
Re: [Bug-wget] URL normalisation: consecutive forward slashes |
Date: |
Thu, 03 Jun 2010 14:32:45 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux) |
Hello,
thanks for your report. I am not sure that the URL normalisation should
collapse multiple consecutive forward slashes, I don't see anything
about it in RFC 1808. We can't assume that "foo//bar" is the same as
"foo/bar", it could be handled differently by the server, for example it
may be part of PATH_INFO.
AFAICS, Firefox and Chromium don't normalize consecutive forward slashes
too.
Cheers,
Giuseppe
Cillian Sharkey <address@hidden> writes:
> Hi,
>
> I've found wget does not always correctly normalise URLs by collapsing
> multiple consecutive forward slashes into a single slash.
>
> This is a problem when recursively mirroring a site, as certain kinds of
> links with multiple consecutive slashes will cause wget to go into an
> infinite loop, limited only by the maximum depth level.
>
> Without complete normalisation, a link with extra slashes is seen as a
> new URL that has not been visited, even if it has already. With each
> traversal an extra slash is cumulatively appended to the URL, causing
> the loop.
>
> Example:
>
> /index.html has href to "foo/loop.html"
> /foo/loop.html has href to "..//index.html"
>
> Results in the following link traversal:
>
> /index.html
> /a/loop.html
> //index.html
> //a/loop.html
> ///index.html
> ///a/loop.html
> [..]
>
> I've tried a combination of URLs with and without consecutive slashes,
> to test wget's behaviour. Results as follows:
>
> /index.html links to:
>
> HREF: wget requests: should be:
>
> /a//../b/10.html /a/b/10.html /b/10.html
> /a/../b/11.html /b/11.html
>
> /a/b/..//../c/20.html /a/c/20.html /c/20.html
> /a/b/../../c/21.html /c/21.html
>
> ..//30.html //30.html /30.html
> ../31.html /31.html
>
> .//40.html //40.html /40.html
> ./41.html /41.html
>
> //50.html Skipped, not downloaded!
> /51.html /51.html
>
>
> wget --version
>
> GNU Wget 1.12 built on linux-gnu.
>
> +digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl
> -iri
>
> Wgetrc:
> /etc/wgetrc (system)
> Locale: /usr/share/locale
> Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
> -DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2
> -D_FILE_OFFSET_BITS=64 -O2 -g -Wall
> Link: gcc -g -O2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall /usr/lib/libssl.so
> /usr/lib/libcrypto.so -ldl -lrt ftp-opie.o openssl.o http-ntlm.o
> gen-md5.o ../lib/libgnu.a
>
> Regards,