[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] What ought to be a simple use of wget
From: |
Matthew White |
Subject: |
Re: [Bug-wget] What ought to be a simple use of wget |
Date: |
Tue, 2 Aug 2016 22:06:26 +0200 |
Out of curiosity I tried the following command lines (which I attach to this
message, just in case they get truncated):
wget --recursive \
--no-clobber \
--page-requisites \
--adjust-extension \
--convert-links \
--span-hosts \
--domains="www.iana.org" \
http://www.iana.org/assignments/index.html
wget --recursive \
--page-requisites \
--convert-links \
--domains="www.iana.org" \
--reject "robots.txt","reports","contact" \
--exclude-directories="/go,/assignments,/_img,/_js,/_css,/domains,/performance,/about,/protocols,/procedures,/dnssec,/reports,/help,/abuse,/numbers,/reviews,/time-zones,/2000,/2001"
\
http://www.iana.org/assignments/index.html
As Tim said, you may set the depth level with --level=n (n is a number).
I had to stop the first command, ending up with 1586 files for a total of
129MB. I attach the result just to give you an idea. I don't know how far
--page-requisites will go starting from
http://www.iana.org/assignments/index.html
The second command uses exclusion lists. It will download
http://www.iana.org/assignments/index.html and the files under
http://www.iana.org/ except the rejected files and the excluded directories (I
wrote all the directories found at the moment of writing this and a bunch of
files as example).
I hope this helps. Let us know when you find your way!
On Tue, 02 Aug 2016 20:15:45 +0200
Tim Rühsen <address@hidden> wrote:
> Hi Dale,
>
> If you have a look at 'man wget'/--page-requisites, the stuff is explained
> quite well. To me it looks like you are missing --level 2.
>
> If --level 2 is not what you want. you could make your point clear by making
> up a small document tree as an example.
>
> Regards, Tim
>
> On Dienstag, 2. August 2016 12:38:25 CEST Dale R. Worley wrote:
> > I want to make a local copy of the "IANA protocol assignments" web
> > pages. It seems to me that this ought to be a simple use of wget in
> > recursive mode, and indeed, it seems like someone else must have run
> > into this need before. But I can't get a combination of wget options
> > that has the behavior I want.
> >
> > The goal is to make a local file tree that mirrors these URLs:
> >
> > http://www.iana.org/assignments/index.html
> > (That page should be in a file named 'index.html'.)
> >
> > every HTML page under http://www.iana.org/assignments/ that can be
> > reached from index.html
> >
> > page requisites for those pages, even if they aren't under
> > http://www.iana.org/assignments/
> >
> > The interference comes from all the stuff under http://www.iana.org that
> > is not under http://www.iana.org/assignments, but which is pointed to by
> > the pages listed above.
> >
> > To resolve the simple problem, it appears that --page-requisites does
> > fetch the page requisites, even if they aren't under
> > http://www.iana.org/assignments/. So that part of the solution works
> > fine.
> >
> > But I can't figure out the right combination of options to fetch the
> > HTML files that I want:
> >
> >
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments/index.html Follows links outside of
> > /assignments/.
> >
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments/index.html This doesn't recurse beyond
> > index.html.
> >
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments Follows links outside of /assignments/.
> >
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments This doesn't recurse beyond index.html.
> >
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments/ This doesn't recurse beyond index.html.
> >
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments/ This doesn't recurse beyond index.html.
> >
> >
> > I'm hoping that this is a known problem and someone can tell me the
> > answer without having to think about it.
> >
> > I also think the documentation could be made clearer in some places, but
> > that can wait.
> >
> > Dale
>
--
Matthew White <address@hidden>
commands
Description: Binary data
result.gz
Description: Binary data