bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] What ought to be a simple use of wget


From: Matthew White
Subject: Re: [Bug-wget] What ought to be a simple use of wget
Date: Tue, 2 Aug 2016 22:06:26 +0200

Out of curiosity I tried the following command lines (which I attach to this 
message, just in case they get truncated):

wget --recursive              \
     --no-clobber             \
     --page-requisites        \
     --adjust-extension       \
     --convert-links          \
     --span-hosts             \
     --domains="www.iana.org" \
     http://www.iana.org/assignments/index.html

wget --recursive                               \
     --page-requisites                         \
     --convert-links                           \
     --domains="www.iana.org"                  \
     --reject "robots.txt","reports","contact" \
     
--exclude-directories="/go,/assignments,/_img,/_js,/_css,/domains,/performance,/about,/protocols,/procedures,/dnssec,/reports,/help,/abuse,/numbers,/reviews,/time-zones,/2000,/2001"
 \
http://www.iana.org/assignments/index.html

As Tim said, you may set the depth level with --level=n (n is a number).

I had to stop the first command, ending up with 1586 files for a total of 
129MB. I attach the result just to give you an idea. I don't know how far 
--page-requisites will go starting from 
http://www.iana.org/assignments/index.html

The second command uses exclusion lists. It will download 
http://www.iana.org/assignments/index.html and the files under 
http://www.iana.org/ except the rejected files and the excluded directories (I 
wrote all the directories found at the moment of writing this and a bunch of 
files as example).

I hope this helps. Let us know when you find your way!

On Tue, 02 Aug 2016 20:15:45 +0200
Tim Rühsen <address@hidden> wrote:

> Hi Dale,
> 
> If you have a look at 'man wget'/--page-requisites, the stuff is explained 
> quite well. To me it looks like you are missing --level 2.
> 
> If --level 2 is not what you want. you could make your point clear by making 
> up a small document tree as an example.
> 
> Regards, Tim
> 
> On Dienstag, 2. August 2016 12:38:25 CEST Dale R. Worley wrote:
> > I want to make a local copy of the "IANA protocol assignments" web
> > pages.  It seems to me that this ought to be a simple use of wget in
> > recursive mode, and indeed, it seems like someone else must have run
> > into this need before.  But I can't get a combination of wget options
> > that has the behavior I want.
> > 
> > The goal is to make a local file tree that mirrors these URLs:
> > 
> >     http://www.iana.org/assignments/index.html
> >     (That page should be in a file named 'index.html'.)
> > 
> >     every HTML page under http://www.iana.org/assignments/ that can be
> >     reached from index.html
> > 
> >     page requisites for those pages, even if they aren't under
> >     http://www.iana.org/assignments/
> > 
> > The interference comes from all the stuff under http://www.iana.org that
> > is not under http://www.iana.org/assignments, but which is pointed to by
> > the pages listed above.
> > 
> > To resolve the simple problem, it appears that --page-requisites does
> > fetch the page requisites, even if they aren't under
> > http://www.iana.org/assignments/.  So that part of the solution works
> > fine.
> > 
> > But I can't figure out the right combination of options to fetch the
> > HTML files that I want:
> > 
> > 
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments/index.html Follows links outside of
> > /assignments/.
> > 
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments/index.html This doesn't recurse beyond
> > index.html.
> > 
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments Follows links outside of /assignments/.
> > 
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments This doesn't recurse beyond index.html.
> > 
> > wget --mirror --convert-links --no-parent --page-requisites
> > http://www.iana.org/assignments/ This doesn't recurse beyond index.html.
> > 
> > wget --mirror --convert-links --exclude-directories=/ --page-requisites
> > http://www.iana.org/assignments/ This doesn't recurse beyond index.html.
> > 
> > 
> > I'm hoping that this is a known problem and someone can tell me the
> > answer without having to think about it.
> > 
> > I also think the documentation could be made clearer in some places, but
> > that can wait.
> > 
> > Dale
> 


-- 
Matthew White <address@hidden>

Attachment: commands
Description: Binary data

Attachment: result.gz
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]