bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Not getting the wildcards to work in wget


From: Felix Dietrich
Subject: Re: Not getting the wildcards to work in wget
Date: Fri, 05 Feb 2021 06:25:37 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)

Hello,

Cherise Haywood <Cherise.Haywood@metoffice.gov.tt> writes:

> I am trying to download specific .zip files from this website:
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/
>
> I have used several iterations of wget to yield only the folders (
> directories) being formed, but absolutely no data being downloaded.
>
> Here are copies of the code I have used:
>
> OPTION 1: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A .zip
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/
>
> Can you assist?

It seems that wget has problems with parsing the /robots.txt correctly:
the empty record for “User-Agent: *” appears to cause it to consider all
paths disallowed.  To work around the issue you may disable honouring
the /robots.txt by adding “--execute robots=off” to your command-line.

> OPTION 2: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A *_72*.zip --time-stamps
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS/

--time-stamps should probably have been --timestamping.

--mirror sets an infinite recursion depth (--level=inf).  You may limit
the depth when using --mirror by specifying --level after --mirror (I
believe).

> OPTION 3: wget --no-parent --relative --recursive --level=2
> --accept=zip --mirror -A _72
> https://www2.census.gov/geo/tiger/TIGER2012/ROADS

Having multiple patterns specified with -A, --accept either using
separate arguments or comma separated patterns will accept a file if
*any one* of the patterns matches.

> I only want the files with *_72*.zip to be downloaded to a copy of the
> directories it comes from on my system.

This is the invocation I have come up with (backslash used as line
continuation marker):

  wget --execute robots=off --timestamping \
       --no-parent --recursive --level=1 \
       --accept '*_72*.zip' \
       'https://www2.census.gov/geo/tiger/TIGER2012/ROADS/'

Make sure to quote strings containing characters with special meaning to
your shell (like the ‘*’ often used for globing).  --level=1 seems to be
enough to get the .zip files: they are all in the directory your URL
points to – but you should check that.
  
> I have attached error imgs, I captured!

It would have been better, had you provided a log in text form.  Wget
can be instructed to output to a log file using --output-file or
--append-output; if you still want to see the progress bar also add
--show-progress.  You may also use the Windows’ command-prompt
redirection operator “> /path/to/file” to write wget’s output to a file.

Happy data analysing, I presume.

-- 
Felix Dietrich



reply via email to

[Prev in Thread] Current Thread [Next in Thread]