bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals


From: Ángel González
Subject: Re: [Bug-wget] [Bug-Wget] Issue in recursive retrievals
Date: Sun, 23 Mar 2014 01:31:23 +0100
User-agent: Thunderbird

On 22/03/14 22:32, Darshit Shah wrote:
On Sat, Mar 22, 2014 at 10:21 PM, Ángel González<address@hidden>  wrote:
Well, the text/html content-type in the reply seems to indicate that the file *is* html, so it makes sense that it scans for links (although I suspect that the server is wrong and it isn't).
I think the server is right in sending the Content-Type header. The
specification never states that the header details have to be that for
the requested resource. Rather, the content-type being sent is about
the response being sent back. Since the response is simple text, a
content type of text/html seems fine to me.

If I've misread the specification somewhere, do let me know.
I was under the impression that 416 was one of the no-body replies, in which
case I expected the content-type to be that of the underlying resource. Turns
out I was wrong. A *body* is returned, although in my opinion the other
behavior would have been preferable.

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>416 Requested Range Not Satisfiable</title>
</head><body>
<h1>Requested Range Not Satisfiable</h1>
<p>None of the range-specifier values in the Range
request-header field overlap the current extent
of the selected resource.</p>
</body></html>

So forget what I said :)


Even worse, I have seen wget trying to parse for links files bigger than it
could
load into memory (first trying to mmap, which failed, then slowly usind
read()
and realloc(), until it finally crashed...) A simple optimization for these
cases
would be to quickly skip the link-scanning if the file looks like binary.
Yes. But I'm not sure how you'd like to scan and check if the file
*looks* binary.

I guess the best way to do that is to follow http://mimesniff.spec.whatwg.org/#rules-for-text-or-binary Basically, it says to consider a file binary if -reading the first 512 bytes- there is anybyte <http://mimesniff.spec.whatwg.org/#byte>in the range 0x00 to 0x08 (NUL to BS), thebyte <http://mimesniff.spec.whatwg.org/#byte>0x0B (VT), abyte <http://mimesniff.spec.whatwg.org/#byte>in the range 0x0E to 0x1A (SO to SUB), or abyte <http://mimesniff.spec.whatwg.org/#byte>in the range 0x1C to 0x1F (FS to US).

(those 512 bytes aren't deterministic, but since wget will be downloading full files, they seem a good bet)

A different issue we could fix for download continuation is to add a
parameter
to skip download of existing files, ie. if there's a file with the name we
would use,
treat it as the final file we wante dto download and don't ask the server at
all
about it.
When continuing downloads of a large number of files, the roundtrups of
continue-this / 416 can give a significant delay.

Yes, that would be a nice feature to have. I've myself been burnt on
this. Also, we would reduce the loads on some servers because of this.
Should not be too difficult to implement. Let me see if I can code
this up tomorrow.
Nice!





reply via email to

[Prev in Thread] Current Thread [Next in Thread]