bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Why does -A not work?


From: Nils Gerlach
Subject: Re: [Bug-wget] Why does -A not work?
Date: Wed, 20 Jun 2018 18:20:11 +0200

It does not delete any html-file or anything else. Either it is accepted
and kept or it is saved forever.
With the tip about --accept and --acept-regex I can get wget to traverse
the links but it does not go deep
enough to get the *l.jpgs I tried to increase -l but to no avail. It seems
like it is going only 1 link deep.
And not deletes.

2018-06-20 16:58 GMT+02:00 Tim Rühsen <address@hidden>:

> Hi Niels,
>
> please always answer to the mailing list (no problem if you CC me, but
> not needed).
>
> It was just an example for POSIX regexes - it's up to you to work out
> the details ;-) Or maybe there is a volunteer reading this.
>
> The implicitly downloaded HTML pages should be removed after parsing
> when you use --accept-regex. Except the explicitly 'starting' page from
> your command line.
>
> Regards, Tim
>
> On 06/20/2018 04:28 PM, Nils Gerlach wrote:
> > Hi Tim,
> >
> > I am sorry but your command does not work. It only downloads the
> thumbnails
> > from the first page
> > and follows none of the links. Open the link in a browser. Click on the
> > pictures to get a larger picture.
> > There is a link "high quality picture" the pictures behind those links
> are
> > the ones i want to download.
> > Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but
> from
> > the other search result pages, too.
> > Can you work that one out? Does this work with wget? Best result would be
> > if the visited html-pages were
> > deleted by wget. But if they stay I can delete them afterwards. But
> > automatism would be better, that's why I am
> > trying to use wget ;)
> >
> > Thanks for the information on the filename and path, though.
> >
> > Greetings
> >
> > 2018-06-20 16:13 GMT+02:00 Tim Rühsen <address@hidden>:
> >
> >> Hi Nils,
> >>
> >> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
> >>> Hi there,
> >>>
> >>> in #wget on freenode I was suggested to write this to you:
> >>> I tried using wget to get some images:
> >>> wget -nd -rH -Dcomicstriplibrary.org -A
> >>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
> >> -p -e
> >>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
> >>> I wanted to download the images only but wget was not following any of
> >> the
> >>> links so I got that much more into -A. But it still does not follow the
> >>> links.
> >>> Page numbers of the search result contain "page" in the link, links to
> >> the
> >>> big pictures i want wget to download contain "display". Both are given
> in
> >>> -A and are seen in the html-document wget gets. Neither is followed by
> >> wget.
> >>>
> >>> Why does this not work at all? Website is public, anybody is free to
> >> test.
> >>> But this is not my website!
> >>
> >> -A / -R works only on the filename, not on the path. The docs (man page)
> >> is not very explicit about it.
> >>
> >> Instead try --accept-regex / --reject-regex which acts on the complete
> >> URL - but shell wildcard's won't work.
> >>
> >> For your example this means to replace '.' by '\.' and '*' by '.*'.
> >>
> >> To download those nemo jpegs:
> >> wget -d -rH -Dcomicstriplibrary.org --accept-regex
> >> ".*little-nemo.*n\.jpeg" -p -e robots=off
> >> 'http://comicstriplibrary.org/search?search=little+nemo'
> >> --regex-type=posix
> >>
> >> Regards, Tim
> >>
> >>
> >
>
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]