[Bug-wget] [bug #48708] Wget downloads file but refuses to examine it fo

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #48708] Wget downloads file but refuses to examine it fo

From:	Dale Worley
Subject:	[Bug-wget] [bug #48708] Wget downloads file but refuses to examine it for links to follow
Date:	Fri, 5 Aug 2016 15:15:16 +0000 (UTC)
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0

URL:
  <http://savannah.gnu.org/bugs/?48708>

                 Summary: Wget downloads file but refuses to examine it for
links to follow
                 Project: GNU Wget
            Submitted by: worley
            Submitted on: Fri 05 Aug 2016 03:15:13 PM GMT
                Category: Program Logic
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: worley
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any
                 Release: 1.16.1
        Operating System: GNU/Linux
         Reproducibility: Every Time
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None

    _______________________________________________________

Details:

To demonstrate (including useful debugging output):

$ wget -d -r --include-directories=/assignments,/protocols
http://www.iana.org/protocols/index.html

Naively, I expect wget to download the index.html file and scan it for links
to recurse on.

The complication seems to arise from the fact that /protocols/index.html is
redirected to http://www.iana.org/protocols.  That file is fetched, and stored
as www.iana.org/protocols/index.html, that is, the file name is based on the
original URL, not the redirected one.

However, wget does not examine the file for links to follow.  Wget gives the
following messages:

Deciding whether to enqueue "http://www.iana.org/protocols";.
http://www.iana.org/protocols () is excluded/not-included.
Decided NOT to load it.
Redirection "http://www.iana.org/protocols"; failed the test.

This is unexpected; I expect that the file is treated consistently in regard
to (1) whether to download it, (2) what file name to store it in, and (3)
whether to examine it for links, in that all three decisions would be made
based on either the original URL or the ultimate redirected URL.  (The
decision to use the original URL seems to be the correct choice to me.)  But
wget's behavior is to make decision (3) based on the redirected name, not the
original name.

In addition, (as I read the documentation) wget will read all URLs that are
named on the command line, regardless of whether they meet the include/exclude
criteria, and so I expect that with -r, all those URLs would be scanned for
links.  However it is clear that wget does not always scan provided URL for
links.






    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?48708>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] [bug #48708] Wget downloads file but refuses to examine it for links to follow, Dale Worley <=

Prev by Date: Re: [Bug-wget] What ought to be a simple use of wget
Next by Date: Re: [Bug-wget] [PATCH] Support metalink:file elements with a "path/file" format
Previous by thread: [Bug-wget] [PATCH] Remove hyphens from command names (Was: Re: Hyphens in command name (init.c) / option data (main.c))
Next by thread: [Bug-wget] [PATCH] Fix signal race condition
Index(es):
- Date
- Thread