[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals

From: Darshit Shah
Subject: Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals
Date: Sat, 22 Mar 2014 22:32:20 +0100

On Sat, Mar 22, 2014 at 10:21 PM, Ángel González <address@hidden> wrote:
> On 22/03/14 18:10, Darshit Shah wrote:
>> There was a case ealier today on the IRC channel that I'd like to
>> bring out here.
>> The user in question was attempting to continue a recursive retrieval.
>> The files being dowloaded were large binaries. However, Wget still
>> happens to load files that have already been downloaded in an attempt
>> to find new links. Below is the debug output that the user shared:
> (...)
>> As you can see, Wget receives only a HTTP 416 response with
>> Content-type text/html, but it still loads the complete 2GB file in
>> memory, looking for links. Since Wget does not know the filetype at
>> this moment, I agree it might be the right thing to do, but according
>> to section 7.2.1 of RFC2616,
>> "
>>     Any HTTP/1.1 message containing an entity-body SHOULD include a
>>     Content-Type header field defining the media type of that body. If
>>     and only if the media type is not given by a Content-Type field, the
>>     recipient MAY attempt to guess the media type via inspection of its
>>     content and/or the name extension(s) of the URI used to identify the
>>     resource. If the media type remains unknown, the recipient SHOULD
>>     treat it as type "application/octet-stream".
>> "
>> Hence, Wget's behaviour seems to be against what the specifications
>> mandates.
> Well, the text/html content-type in the reply seems to indicate that the
> file
> *is* html, so it makes sense that it scans for links (although I suspect
> that
> the server is wrong and it isn't).
I think the server is right in sending the Content-Type header. The
specification never states that the header details have to be that for
the requested resource. Rather, the content-type being sent is about
the response being sent back. Since the response is simple text, a
content type of text/html seems fine to me.

If I've misread the specification somewhere, do let me know.
>> However, I understand that for continuing recursive retrievals, we may
>> want to scan all existing files too. Maybe, Wget could write a simple
>> flat file with the relevant details in case it is being aborted? This
>> way it knows what files it *Should* parse and which ones it shouldn't.
>> The user reporting this issue had the problem that Wget would block
>> for almost 30 seconds on each downloaded file while it loads it into
>> memory, while it simply skipped over newly downloaded files, giving me
>> the idea that the server did indeed send the right content-type
>> headers with HTTP 200 responses.
>> I'm looking for comments and opinions of how Wget should hand;e such
>> corner cases.
> Even worse, I have seen wget trying to parse for links files bigger than it
> could
> load into memory (first trying to mmap, which failed, then slowly usind
> read()
> and realloc(), until it finally crashed...) A simple optimization for these
> cases
> would be to quickly skip the link-scanning if the file looks like binary.
Yes. But I'm not sure how you'd like to scan and check if the file
*looks* binary.

> A different issue we could fix for download continuation is to add a
> parameter
> to skip download of existing files, ie. if there's a file with the name we
> would use,
> treat it as the final file we wante dto download and don't ask the server at
> all
> about it.
> When continuing downloads of a large number of files, the roundtrups of
> continue-this / 416 can give a significant delay.

Yes, that would be a nice feature to have. I've myself been burnt on
this. Also, we would reduce the loads on some servers because of this.
Should not be too difficult to implement. Let me see if I can code
this up tomorrow.

Thanking You,
Darshit Shah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]