bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Why does -A not work?


From: Tim Rühsen
Subject: Re: [Bug-wget] Why does -A not work?
Date: Thu, 21 Jun 2018 11:23:03 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

Just try

wget2 -nd -l2 -r -A "*little-nemo*s.jpeg"
'http://comicstriplibrary.org/search?search=little+nemo'

and you only get
little-nemo-19051015-s.jpeg
little-nemo-19051022-s.jpeg
little-nemo-19051029-s.jpeg
little-nemo-19051105-s.jpeg
little-nemo-19051112-s.jpeg
little-nemo-19051119-s.jpeg
little-nemo-19051126-s.jpeg
little-nemo-19051203-s.jpeg
little-nemo-19051210-s.jpeg
little-nemo-19051217-s.jpeg
little-nemo-19051224-s.jpeg
little-nemo-19051231-s.jpeg
little-nemo-19060107-s.jpeg
little-nemo-19060114-s.jpeg
little-nemo-19060121-s.jpeg
little-nemo-19060128-s.jpeg
little-nemo-19060204-s.jpeg
little-nemo-19060211-s.jpeg
little-nemo-19060218-s.jpeg
little-nemo-19060225-s.jpeg

Regards, Tim

On 06/20/2018 09:59 PM, Tim Rühsen wrote:
> On 20.06.2018 18:20, Nils Gerlach wrote:
>> It does not delete any html-file or anything else. Either it is accepted
>> and kept or it is saved forever.
>> With the tip about --accept and --acept-regex I can get wget to traverse
>> the links but it does not go deep
>> enough to get the *l.jpgs I tried to increase -l but to no avail. It seems
>> like it is going only 1 link deep.
>> And not deletes.
> 
> Yes, my failure. Looking at the code, the regex options are applied
> without taking --recursive or --level into account. They are dumb URL
> filters.
> 
> We are back at
> 
> wget -d -olog -r -Dcomicstriplibrary.org -A "*little-nemo*s.jpeg"
> 'http://comicstriplibrary.org/search?search=little+nemo'
> 
> that doesn't work as expected. Somehow it doesn't follow certain links
> so that little-nemo*s.jpeg files aren't found.
> 
> Interestingly, the same options with wget2 are finding + downloading
> those files. From a first glimpse: those files are linked from an RSS /
> Atom file. Those aren't supported by wget, but wget2 does parse them for
> URLs.
> 
> Want to give it a try ? https://gitlab.com/gnuwget/wget2
> 
> Regards, Tim
> 
>>
>> 2018-06-20 16:58 GMT+02:00 Tim Rühsen <address@hidden>:
>>
>>> Hi Niels,
>>>
>>> please always answer to the mailing list (no problem if you CC me, but
>>> not needed).
>>>
>>> It was just an example for POSIX regexes - it's up to you to work out
>>> the details ;-) Or maybe there is a volunteer reading this.
>>>
>>> The implicitly downloaded HTML pages should be removed after parsing
>>> when you use --accept-regex. Except the explicitly 'starting' page from
>>> your command line.
>>>
>>> Regards, Tim
>>>
>>> On 06/20/2018 04:28 PM, Nils Gerlach wrote:
>>>> Hi Tim,
>>>>
>>>> I am sorry but your command does not work. It only downloads the
>>> thumbnails
>>>> from the first page
>>>> and follows none of the links. Open the link in a browser. Click on the
>>>> pictures to get a larger picture.
>>>> There is a link "high quality picture" the pictures behind those links
>>> are
>>>> the ones i want to download.
>>>> Regex being ".*little-nemo.*n\l.jpeg". And not only the first page but
>>> from
>>>> the other search result pages, too.
>>>> Can you work that one out? Does this work with wget? Best result would be
>>>> if the visited html-pages were
>>>> deleted by wget. But if they stay I can delete them afterwards. But
>>>> automatism would be better, that's why I am
>>>> trying to use wget ;)
>>>>
>>>> Thanks for the information on the filename and path, though.
>>>>
>>>> Greetings
>>>>
>>>> 2018-06-20 16:13 GMT+02:00 Tim Rühsen <address@hidden>:
>>>>
>>>>> Hi Nils,
>>>>>
>>>>> On 06/20/2018 06:16 AM, Nils Gerlach wrote:
>>>>>> Hi there,
>>>>>>
>>>>>> in #wget on freenode I was suggested to write this to you:
>>>>>> I tried using wget to get some images:
>>>>>> wget -nd -rH -Dcomicstriplibrary.org -A
>>>>>> "little-nemo*s.jpeg","*html*","*.html.*","*.tmp","*page*","*display*"
>>>>> -p -e
>>>>>> robots=off 'http://comicstriplibrary.org/search?search=little+nemo'
>>>>>> I wanted to download the images only but wget was not following any of
>>>>> the
>>>>>> links so I got that much more into -A. But it still does not follow the
>>>>>> links.
>>>>>> Page numbers of the search result contain "page" in the link, links to
>>>>> the
>>>>>> big pictures i want wget to download contain "display". Both are given
>>> in
>>>>>> -A and are seen in the html-document wget gets. Neither is followed by
>>>>> wget.
>>>>>>
>>>>>> Why does this not work at all? Website is public, anybody is free to
>>>>> test.
>>>>>> But this is not my website!
>>>>>
>>>>> -A / -R works only on the filename, not on the path. The docs (man page)
>>>>> is not very explicit about it.
>>>>>
>>>>> Instead try --accept-regex / --reject-regex which acts on the complete
>>>>> URL - but shell wildcard's won't work.
>>>>>
>>>>> For your example this means to replace '.' by '\.' and '*' by '.*'.
>>>>>
>>>>> To download those nemo jpegs:
>>>>> wget -d -rH -Dcomicstriplibrary.org --accept-regex
>>>>> ".*little-nemo.*n\.jpeg" -p -e robots=off
>>>>> 'http://comicstriplibrary.org/search?search=little+nemo'
>>>>> --regex-type=posix
>>>>>
>>>>> Regards, Tim
>>>>>
>>>>>
>>>>
>>>
>>>
> 

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]