bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] What ought to be a simple use of wget


From: Dale R. Worley
Subject: Re: [Bug-wget] What ought to be a simple use of wget
Date: Thu, 04 Aug 2016 11:35:58 -0400

Tim Ruehsen <address@hidden> writes:
> Sounds like "download everything from www.iana.org/assignments/ plus all page 
> requisites on www.iana.org". Page requisites from other domains shouldn't be 
> pulled in !?
>
> Then your first try was very close, it was basically:
> wget -r --no-parent --page-requisites http://www.iana.org/assignments/
> index.html
>
> With -d you can see that this page is being redirected to /protocols and thus 
> no further downloading takes place since /protocols would escape the /
> assignments/ directory  (not allowed due to --no-parent).

I'm getting something different than that...

First off, let's drop --page-requisites.  That seems to be working
exactly as I want it, and it just complicates the discussion.

I'm also using wget 1.16.1, which is a couple of years old.

If I run the command quoted above, I get output which shows the
redirection happening, and the file is fetched successfully:

[Quote characters ASCIIized.]

    $ wget -r --no-parent http://www.iana.org/assignments/index.html
    --2016-08-04 11:22:48--  http://www.iana.org/assignments/index.html
    Resolving www.iana.org (www.iana.org)... 192.0.32.8, 2620:0:2d0:200::8
    Connecting to www.iana.org (www.iana.org)|192.0.32.8|:80... connected.
    HTTP request sent, awaiting response... 302 Found
    Location: /protocols [following]
    --2016-08-04 11:22:48--  http://www.iana.org/protocols
    Reusing existing connection to www.iana.org:80.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [text/html]
    Saving to: 'www.iana.org/assignments/index.html'

    www.iana.org/assign     [      <=>             ] 727.79K   578KB/s   in 
1.3s   

    2016-08-04 11:22:52 (578 KB/s) - 'www.iana.org/assignments/index.html' 
saved [745252]

    FINISHED --2016-08-04 11:22:52--
    Total wall clock time: 4.7s
    Downloaded: 1 files, 728K in 1.3s (578 KB/s)
    $ ls -lR .
    .:
    total 4
    drwxr-xr-x. 3 worley worley 4096 Aug  4 11:22 www.iana.org

    ./www.iana.org:
    total 4
    drwxr-xr-x. 2 worley worley 4096 Aug  4 11:22 assignments

    ./www.iana.org/assignments:
    total 728
    -rw-r--r--. 1 worley worley 745252 Aug  4 11:22 index.html
    $ 

I can argue from the wording of the man page that this is correct, as
--no-parent is described as "Do not ever ascend to the parent directory
when retrieving recursively."

What *seems* to be happening is that index.html is fetched, but its
links are not fetched recursively, despite the -r and qualifying under
--no-parent.  E.g., line 23441 of that file is

    <td><a 
href="/assignments/yang-parameters/yang-parameters.xhtml#yang-parameters-1">YANG
 Module Names</a></td>

which specifies a target URL of
http://www.iana.org//assignments/yang-parameters/yang-parameters.xhtml.
And yet, that file is not fetched.


OK, using -d shows what the internal logic is:  After fetching
index.html, the wget output is:

    2016-08-04 11:31:13 (576 KB/s) - 'www.iana.org/assignments/index.html' 
saved [745252]

    Deciding whether to enqueue "http://www.iana.org/protocols";.
    Going to "" would escape "assignments" with no_parent on.
    Decided NOT to load it.
    Redirection "http://www.iana.org/protocols"; failed the test.

I'm going to have to think about that, as the behavior is rather
counter-intuitive.  It seems to me that if wget is willing to *fetch* a
page, it should look at the links on the page for potential recursion.

Dale



reply via email to

[Prev in Thread] Current Thread [Next in Thread]