[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] What ought to be a simple use of wget
From: |
Dale R. Worley |
Subject: |
Re: [Bug-wget] What ought to be a simple use of wget |
Date: |
Thu, 04 Aug 2016 11:35:58 -0400 |
Tim Ruehsen <address@hidden> writes:
> Sounds like "download everything from www.iana.org/assignments/ plus all page
> requisites on www.iana.org". Page requisites from other domains shouldn't be
> pulled in !?
>
> Then your first try was very close, it was basically:
> wget -r --no-parent --page-requisites http://www.iana.org/assignments/
> index.html
>
> With -d you can see that this page is being redirected to /protocols and thus
> no further downloading takes place since /protocols would escape the /
> assignments/ directory (not allowed due to --no-parent).
I'm getting something different than that...
First off, let's drop --page-requisites. That seems to be working
exactly as I want it, and it just complicates the discussion.
I'm also using wget 1.16.1, which is a couple of years old.
If I run the command quoted above, I get output which shows the
redirection happening, and the file is fetched successfully:
[Quote characters ASCIIized.]
$ wget -r --no-parent http://www.iana.org/assignments/index.html
--2016-08-04 11:22:48-- http://www.iana.org/assignments/index.html
Resolving www.iana.org (www.iana.org)... 192.0.32.8, 2620:0:2d0:200::8
Connecting to www.iana.org (www.iana.org)|192.0.32.8|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: /protocols [following]
--2016-08-04 11:22:48-- http://www.iana.org/protocols
Reusing existing connection to www.iana.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'www.iana.org/assignments/index.html'
www.iana.org/assign [ <=> ] 727.79K 578KB/s in
1.3s
2016-08-04 11:22:52 (578 KB/s) - 'www.iana.org/assignments/index.html'
saved [745252]
FINISHED --2016-08-04 11:22:52--
Total wall clock time: 4.7s
Downloaded: 1 files, 728K in 1.3s (578 KB/s)
$ ls -lR .
.:
total 4
drwxr-xr-x. 3 worley worley 4096 Aug 4 11:22 www.iana.org
./www.iana.org:
total 4
drwxr-xr-x. 2 worley worley 4096 Aug 4 11:22 assignments
./www.iana.org/assignments:
total 728
-rw-r--r--. 1 worley worley 745252 Aug 4 11:22 index.html
$
I can argue from the wording of the man page that this is correct, as
--no-parent is described as "Do not ever ascend to the parent directory
when retrieving recursively."
What *seems* to be happening is that index.html is fetched, but its
links are not fetched recursively, despite the -r and qualifying under
--no-parent. E.g., line 23441 of that file is
<td><a
href="/assignments/yang-parameters/yang-parameters.xhtml#yang-parameters-1">YANG
Module Names</a></td>
which specifies a target URL of
http://www.iana.org//assignments/yang-parameters/yang-parameters.xhtml.
And yet, that file is not fetched.
OK, using -d shows what the internal logic is: After fetching
index.html, the wget output is:
2016-08-04 11:31:13 (576 KB/s) - 'www.iana.org/assignments/index.html'
saved [745252]
Deciding whether to enqueue "http://www.iana.org/protocols".
Going to "" would escape "assignments" with no_parent on.
Decided NOT to load it.
Redirection "http://www.iana.org/protocols" failed the test.
I'm going to have to think about that, as the behavior is rather
counter-intuitive. It seems to me that if wget is willing to *fetch* a
page, it should look at the links on the page for potential recursion.
Dale