Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming

From: UukGoblin
Subject: Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Date: Fri, 1 May 2015 11:50:19 +0000
On Thu, Apr 30, 2015 at 11:02:31PM +0200, Tim R?hsen wrote:
> The top-down approach would be something like
> wget -r --extract-links | distributor host1 host2 ... hostN
> 'distributor' is a program that start one instance of wget on each host 
> given, 
> taking the (absolute) URLs via stdin, and give it to the wget instances (e.g. 
> via round-robin... better would be to know wether a file download has been 
> finished).

Yes, something like that, although not quite simple. The distributor would
have to know what has just been downloaded by the worker, and invoke
the link extractor on each newly-downloaded html file - in order to
append the links in it to the download queue.

> I assume '-r --extract-links' does not download, but just recursive 
> scans/extracts the existing files !?

Yes, that's exactly what I had in mind.

> Wget also has to be adjusted to start downloading immediately on the first 
> URL 
> read from stdin. Right now it collects all URLs until stdin closes and than 
> starts downloading.

Ah, good point, I wasn't aware of that.

> I wrote a C library for the nextgen Wget (start to move the code to wget this 
> autumn) with that you can also do the extraction part. There are small C 
> examples that you might extend to work recursive. It works with CSS and HTML.
> https://github.com/rockdaboot/mget/tree/master/examples

Nice, thank you! I'll check it out :-)

