[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] What ought to be a simple use of wget
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] What ought to be a simple use of wget |
Date: |
Thu, 04 Aug 2016 11:38:33 +0200 |
User-agent: |
KMail/5.2.3 (Linux/4.6.0-1-amd64; KDE/5.23.0; x86_64; ; ) |
On Wednesday, August 3, 2016 11:55:55 AM CEST Dale R. Worley wrote:
> Tim Rühsen <address@hidden> writes:
> > If you have a look at 'man wget'/--page-requisites, the stuff is explained
> > quite well. To me it looks like you are missing --level 2.
> >
> > If --level 2 is not what you want. you could make your point clear by
> > making up a small document tree as an example.
>
> I definitely don't want --level 2, because that limits how many links
> the recursion can traverse. If all the links are within the
> /assignments/ directory, wget should follow an unlimited number.
>
> Here's an outline of what I want retrieved, based on Matthew White's
> listing:
>
> www.iana.org/
> Some or all of these files are OK, since they're likely page requisites:
> www.iana.org/_css/
> www.iana.org/_css/2015.1/
> www.iana.org/_css/2015.1/print.css
> www.iana.org/_css/2015.1/screen.css
> www.iana.org/_img/
> www.iana.org/_img/2011.1/
> www.iana.org/_img/2011.1/icons/
> ...
> www.iana.org/_js/
> www.iana.org/_js/2013.1/
> www.iana.org/_js/2013.1/iana.js
> www.iana.org/_js/2013.1/jquery.js
> Nothing in these directories:
> www.iana.org/about/
> www.iana.org/abuse/
> Lots and lots of files in this directory:
> www.iana.org/assignments/
> www.iana.org/assignments/_6lowpan-parameters/
>
> www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml.html
> www.iana.org/assignments/_support/
> www.iana.org/assignments/_support/iana-registry.css
> www.iana.org/assignments/_support/jquery.js
> www.iana.org/assignments/_support/sort.js
> www.iana.org/assignments/aaa-parameters/
> www.iana.org/assignments/aaa-parameters/aaa-parameters-1.csv
> www.iana.org/assignments/aaa-parameters/aaa-parameters.txt
> www.iana.org/assignments/aaa-parameters/aaa-parameters.xhtml.html
> www.iana.org/assignments/aaa-parameters/aaa-parameters.xml
> www.iana.org/assignments/abfab-parameters/
> www.iana.org/assignments/abfab-parameters/abfab-parameters.txt
> www.iana.org/assignments/abfab-parameters/abfab-parameters.xhtml.html
> www.iana.org/assignments/abfab-parameters/abfab-parameters.xml
> www.iana.org/assignments/abfab-parameters/urn-parameters.csv
> ...
> Nothing in these directories:
> www.iana.org/dnssec/
> www.iana.org/domains/
> www.iana.org/go/
> www.iana.org/help/
> www.iana.org/numbers/
> www.iana.org/procedures/
> www.iana.org/protocols/
> www.iana.org/reports/
Sounds like "download everything from www.iana.org/assignments/ plus all page
requisites on www.iana.org". Page requisites from other domains shouldn't be
pulled in !?
Then your first try was very close, it was basically:
wget -r --no-parent --page-requisites http://www.iana.org/assignments/
index.html
With -d you can see that this page is being redirected to /protocols and thus
no further downloading takes place since /protocols would escape the /
assignments/ directory (not allowed due to --no-parent).
[It is debatable if this behavior regarding redirections should be changed or
not, so feel free to open a bug report at https://savannah.gnu.org/bugs/?
func=additem&group=wget.]
Your are currently left with what Matthew White already suggested.
Similar approach would be to extract all links from 'protocols', build a list
of all referenced links and filter with e.g. (e)grep:
wget -d --convert-links -r --no-parent --page-requisites http://www.iana.org/
assignments/index.html 2>&1|grep ^TO_COMPLETE|cut -d' ' -f 4 >list.txt
After editing, filtering list.txt, download all the URLs including --page-
requisites:
wget --convert-links --page-requisites -x -i list.txt
Tim
signature.asc
Description: This is a digitally signed message part.