bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Wget follows "button" links


From: Tim Rühsen
Subject: Re: [Bug-wget] Wget follows "button" links
Date: Tue, 5 Jun 2018 16:37:57 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

Hi,

> "Both --no-clobber and --convert-links were specified, only
--convert-links will be used."

Right, I missed that. The combination of both flags was buggy by design
(also in 1.12) and suffered from several flaws (not to say bugs).

Regex more like '.*/xpage=watch.*'. The exact syntax depends on
  --regex-type=TYPE           regex type (posix|pcre)

What else can you do... try wget2. It allows the combination of
--no-clobber and --convert-links. And if you find bugs they can be fixed
(other as wget1.x were we have to redesign a whole lot of things).

See https://gitlab.com/gnuwget/wget2

If you don't like to build from git, you can download a pretty recent
tarball from https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.

Signature at https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.sig

Regards, Tim

On 06/05/2018 03:52 PM, CryHard wrote:
> Hey Tim,
> 
> Please see http://savannah.gnu.org/bugs/?31781 where it implemented. Since 
> version 1.12.1.
> 
> On my personal mac I have 1.19.5, and when I run the command with both 
> arguments i get: 
> 
> "Both --no-clobber and --convert-links were specified, only --convert-links 
> will be used."
> 
> As a response. 
> 
> Anyway, I might make due without -nc if I can use the regex argument. Could 
> you give an example on how would that argument work in my case? Can I just 
> use www.mywiki.com/delete/* as an argument for example? or .*/xpage=watch.* ?
> 
> Thanks!
> 
> 
> ​Sent with ProtonMail Secure Email.​
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> 
> On June 5, 2018 2:40 PM, Tim Rühsen <address@hidden> wrote:
> 
>> Hi,
>>
>> in this case you could try it with -X / --exclude-directories.
>>
>> E.g. wget -X /delete,/remove
>>
>> That wouldn't help with "xpage=watch..." though.
>>
>> And I can't tell you if and how good -X works with wget 1.12.
>>
>> Why (or since when) doesn't --no-clobber plus --convert-links work any
>>
>> more ?
>>
>> Please feel free to open a bug report at
>>
>> https://savannah.gnu.org/bugs/?func=additem&group=wget with a detailed
>>
>> description, please.
>>
>> Cause it works for me :-)
>>
>> Regards, Tim
>>
>> On 06/05/2018 03:11 PM, CryHard wrote:
>>
>>> Hey Tim,
>>>
>>> Thanks for the info. The wiki software we use (xwiki) appends something to 
>>> wiki pages URLs to express a certain behavior. For example, to "watch" a 
>>> page, the button once pressed redirects you to 
>>> "www.wiki.com/WIKI-PAGE-NAME?xpage=watch&do=adddocument"
>>>
>>> Where the only thing that changes is the "WIKI-PAGE-NAME" part.
>>>
>>> Also, for actions such as like "deleting" or "reverting" a wiki page, the 
>>> URL changes by adding /remove/ or /delete/ 'sub-folders" in the URL. these 
>>> are usually in the middle, before the actual page name. For example: 
>>> www.wiki.com/delete/WIKI-PAGE-NAME. So in this case the "offending URL" is 
>>> in the middle of the actual wiki page URL.
>>>
>>> What I would need to do is exclude from wget visiting any 
>>> www.wiki.com/delete or www.wiki.com/remove/ pages. I'd also need to exclude 
>>> links that end with "xpage=watch&do=adddocument" which triggers me to watch 
>>> that page.
>>>
>>> I am using v1.12 because the most recent versions have disabled 
>>> --no-clobber and --convert-links from working together. I need --no-clobber 
>>> because if the download stops, I need to be able to resume without 
>>> re-downloading all the files. And I need --convert-links because this needs 
>>> to work as a local copy.
>>>
>>> From my understanding the options you mention have been added after v1.12. 
>>> Is there any way to achieve this?
>>>
>>> BTW, -N (timestamps) doesn't work, as the server on which the wiki is 
>>> hosted doesn't seem to support this, hence wget keeps redownloading the 
>>> same files.
>>>
>>> Thanks a lot!
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>
>>> On June 5, 2018 1:57 PM, Tim Rühsen address@hidden wrote:
>>>
>>>> On 06/05/2018 11:53 AM, CryHard wrote:
>>>>
>>>>> Hey there,
>>>>>
>>>>> I've used the following:
>>>>>
>>>>> wget --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) 
>>>>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 
>>>>> Safari/537.36" --user=myuser --ask-password --no-check-certificate 
>>>>> --recursive --page-requisites --adjust-extension --span-hosts 
>>>>> --restrict-file-names=windows --domains wiki.com --no-parent wiki.com 
>>>>> --no-clobber --convert-links --wait=0 --quota=inf -P /home/W
>>>>>
>>>>> To download a wiki. The problem is that this will follow "button" links, 
>>>>> e.g the links that allow a user to put a page on a watchlist for further 
>>>>> modifications. This has led to me watching hundreds of pages. Not only 
>>>>> that, but apparently it also follows the links that lead to reverting 
>>>>> changes made by others on a page.
>>>>>
>>>>> Is there a way to avoid this behavior?
>>>>
>>>> Hi,
>>>>
>>>> that depends on how these "button links" are realized.
>>>>
>>>> A button may be part of a HTML FORM tag/structure where the URL is the
>>>>
>>>> value of the 'action' attribute. Wget doesn't download such URLs because
>>>>
>>>> of the problem you describe.
>>>>
>>>> A dynamic web page can realize "button links" by using simple links.
>>>>
>>>> Wget doesn't know about hidden semantics and so downloads these URLs -
>>>>
>>>> and maybe they trigger some changes in a database.
>>>>
>>>> If this is your issue, you have to look into the HTML files and exclude
>>>>
>>>> those URLs from being downloaded. Or you create a whitelist. Look at
>>>>
>>>> options -A/-R and --accept-regex and --reject-regex.
>>>>
>>>>> I'm using the following version:
>>>>>
>>>>>> wget --version
>>>>>>
>>>>>> GNU Wget 1.12 built on linux-gnu.
>>>>
>>>> Ok, you should update wget if possible. Latest version is 1.19.5.
>>>>
>>>> Regards, Tim
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]