[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Unexpected result with -H and -D
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] Unexpected result with -H and -D |
Date: |
Wed, 17 Jan 2018 15:53:58 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2 |
Hi,
this is not a PSL matching, so no libpsl is needed.
Just sufmatch() has to be fixed to do (sub)domain matching.
Attached is a fix.
With Best Regards, Tim
On 01/17/2018 03:01 PM, Darshit Shah wrote:
> Hi,
>
> This is a bug in Wget, apparently a really old one! Seems like the bug has
> been
> around since atleast 1997.
>
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
>
> I'm going to try and implement this correctly, but I'm currently a little
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
>
>
> Of course, this change will require libpsl to no longer be an optional
> dependency
>
> * Friso van Vollenhoven <address@hidden> [180117 14:40]:
>> Hello all,
>>
>> I am trying to do a recursive download of a webpage and span multiple hosts
>> within the same domain, but not cross to other domains. The issue is that
>> the crawl does extend to other domains. My full command is this:
>>
>> wget \
>> --recursive \
>> --no-clobber \
>> --page-requisites \
>> --adjust-extension \
>> --span-hosts \
>> --domains=scapino.nl \
>> --no-parent \
>> --tries=2 \
>> --wait=1 \
>> --random-wait \
>> --waitretry=2 \
>> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
>> https://www.scapino.nl/winkels/scapino-utrecht-510061
>>
>> From this combination of --span-hosts and --domains, I would expect to
>> download assets from cdn.scapino.nl and www.scapino.nl, but not other
>> domains. For some reason that I don't understand, wget also starts to do
>> what looks like a full crawl of the domain werkenbijscapino.nl, which is
>> referenced from the original page.
>>
>> Any thoughts or direction would be much appreciated.
>>
>> I am using wget 1.18 on Debian.
>>
>>
>> Best regards,
>> Friso
>
0001-src-host.c-sufmatch-Fix-to-domain-matching.patch
Description: Text Data
signature.asc
Description: OpenPGP digital signature