bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Recursive retrieval


From: Dale R. Worley
Subject: [Bug-wget] Recursive retrieval
Date: Wed, 02 Nov 2016 12:24:03 -0400

In regard to my difficulties with recursively retrieving
http://www.iana.org/assignments/index.html:  I discovered that one URL
(http://www.iana.org/assignments/forces/forces.xhtml) is pointed to by
no less than three different URLs:

http://www.iana.org/assignments/forces/forces.xhtml
http://www.iana.org/assignments/forces-parameters/forces-parameters.xhtml
http://www.iana.org/assignments/forces

The first is the proper URL for it, and the second two are redirected to
the first URL.

There are several other occurrences of this situation.

And I discovered that if I specify --trust-server-names, then wget will
realize that the redirection URL can be retrieved once, and links to the
other two URLs can be directed to that one file.  Without
--trust-server-names, wget considers all three URLs to be different,
despite that they are redirected to the same URL, and dutifully stores
essentially the same content three times.  With --trust-server-names,
wget understands that all three URLs are the same.

It turns out that this provides me with a much better mirror of the web
site.

I've attached a patch that improves the documentation of
--trust-server-names, to clarify that if -nd is not in effect, then the
file name is constructed from the entire redirection URL, not just the
last component.

(--trust-server-names is also mentioned in doc/metalink-standard.txt,
but that text does not seem to me to have the problem the patch
corrects.)

Dale

Attachment: 0001-Improve-documentation-of-trust-server-names.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]