|
From: | Jens Schleusener |
Subject: | Re: [Bug-wget] New wget (1.19.2): Unexpected download behaviour for gzip-compressed tarballs (HTTP-header dependent) |
Date: | Fri, 3 Nov 2017 20:10:22 +0100 (CET) |
User-agent: | Alpine 2.20 (LSU 67 2015-01-07) |
On Fri, 3 Nov 2017, Tim Rühsen wrote:
On 11/03/2017 06:37 AM, James Cloos wrote:"TR" == Tim Rühsen <address@hidden> writes:TR> I downloaded/tested thousands of web pages and they behave as if 'Content- TR> Encoding: gzip' is a compression for the transport. Uncompressing it 'on-the- TR> fly' and saving that uncompressed data was the correct behavior. Lots of servers have that misconfiguration; it was recommended in the past and apache defaulted to doing that when grabbing things like tar.gz. The gui browsers had to learn to work around that misconfig. wget also has to. In short, do not uncompress if the destination name has a compression suffix. Or, in that case, test whether the uncompressed data starts with gzip magic and complete one decompression if so, non if not so. And the same for the other compression formats.Thanks for this insight ! Looking at the Mozilla/Gecko sources shows that gzip Content-Encoding is just cleared for Content-Types application/x-gzip, application/gzip and application/x-gunzip. That makes it straight forward to go that way.
That seems at least for the gzip ones to be a client-side correction of an incorrect server behaviour according to RFC 7231 "Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content" https://tools.ietf.org/html/rfc7231#section-3.1.2.2
If the media type includes an inherent encoding, such as a data format that is always compressed, then that encoding would not be restated in Content-Encoding even if it happens to be the same algorithm as one of the content codings. Such a content coding would only be listed if, for some bizarre reason, it is applied a second time to form the representation. Likewise, an origin server might choose to publish the same data as multiple representations that differ only in whether the coding is defined as part of Content-Type or Content-Encoding, since some user agents will behave differently in their handling of each response (e.g., open a "Save as ..." dialog instead of automatic decompression and rendering of content). Regards Jens
[Prev in Thread] | Current Thread | [Next in Thread] |