bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH 1-3/3] support HTTP compression


From: Yuriy M. Kaminskiy
Subject: Re: [Bug-wget] [PATCH 1-3/3] support HTTP compression
Date: Thu, 18 Dec 2014 15:32:00 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Ángel González wrote:
> On 16/12/14 16:20, Yuriy M. Kaminskiy wrote:
> Thanks!
> 
> Comments inline.

FWIW: I found a bug with these patches (introduced by patch 2/3): option
--convert-links (and --backup-converted) behaves confusingly/broken when
--compressed set.

>> > From ccb95548926a0ab8ad4ed3787470d4803efad4ca Mon Sep 17 00:00:00 2001
>> From: "Yuriy M. Kaminskiy"<address@hidden>
>> Date: Tue, 16 Dec 2014 17:19:22 +0300
>> Subject: [PATCH 1/3] Recognize Content-Encoding header
>>
>> ---
>>   src/http.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   src/wget.h |    8 +++++++-
>>   2 files changed, 56 insertions(+), 1 deletions(-)
>>
...
>> @@ -2822,6 +2867,10 @@ read_header:
>>           {
>>             ensure_extension (hs, ".css", dt);
>>           }
>> +      if (ce_ext != NULL)
>> +        {
>> +          ensure_extension (hs, ce_ext, dt);
>> +        }
>>       }
>>
>>     if (statcode == HTTP_STATUS_RANGE_NOT_SATISFIABLE
> 
> I don't think we should adapt the extension to include the compression.
> We just want the files uncompressed, don't we?

I preferred to save file as is, without attempting to decompress them, leaving
decompression to user.
(Later patches only decompress files in-memory, when wget want to parse them for
links (-r/-p options)).

It also allows to resume compressed downloads (if server cached compressed
variants and accepts Range:'d requests).
And no unwanted decompression of .tar.gz, when server "helpfully" sets C-E: gzip
for such files.

>> diff --git a/src/wget.h b/src/wget.h
>> index 6edbfb8..be5fe51 100644
>> --- a/src/wget.h
>> +++ b/src/wget.h
>> @@ -331,7 +331,13 @@ enum
>>     SEND_NOCACHE         = 0x0008,        /* send Pragma: no-cache
>> directive */
>>     ACCEPTRANGES         = 0x0010,        /* Accept-ranges header was
>> found */
>>     ADDED_HTML_EXTENSION = 0x0020,        /* added ".html" extension
>> due to -E */
>> -  TEXTCSS              = 0x0040         /* document is of type
>> text/css */
>> +  TEXTCSS              = 0x0040,        /* document is of type
>> text/css */
>> +  CE_COMPRESS          = 0x0100,        /* Content-Encoding: compress */
>> +  CE_DEFLATE           = 0x0200,        /* Content-Encoding: deflate */
>> +  CE_GZIP              = 0x0400,        /* Content-Encoding: gzip */
>> +  CE_BZIP2             = 0x0800,        /* Content-Encoding: bzip2 */
>> +#define CE_ANY (CE_COMPRESS|CE_DEFLATE|CE_GZIP|CE_BZIP2)
>> +                                        /* Any known Content-Encoding */
>>   };
>>
>>   /* Universal error type -- used almost everywhere.  Error reporting of
>> -- 
>>
> 
>> > From b41339ba8d0e28f9033e83883d33889177ebbdc5 Mon Sep 17 00:00:00 2001
>> From: "Yuriy M. Kaminskiy"<address@hidden>
>> Date: Tue, 16 Dec 2014 17:29:34 +0300
>> Subject: [PATCH 2/3] Added option to support HTTP compression
>>
>> Send Accept-Encoding header, decompress gzip-compressed document
>> (recognize by extension, should work well with --adjust-extension option)
>> ---
>>   src/http.c    |    6 +++++-
>>   src/init.c    |    1 +
>>   src/main.c    |    3 +++
>>   src/options.h |    1 +
>>   src/utils.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>>   5 files changed, 53 insertions(+), 1 deletions(-)
>>
>> diff --git a/src/http.c b/src/http.c
>> index c769e95..5ce6c93 100644
>> --- a/src/http.c
>> +++ b/src/http.c
>> @@ -1798,6 +1798,9 @@ gethttp (struct url *u, struct http_stat *hs,
>> int *dt, struct url *proxy,
>>                           rel_value);
>>     SET_USER_AGENT (req);
>>     request_set_header (req, "Accept", "*/*", rel_none);
>> +  if (opt.compressed)
>> +    request_set_header (req, "Accept-Encoding", "gzip, deflate",
>> rel_none);
> If we are also supporting compress and bzip2, they should be listed.

Rationale:
a) most browsers supports only gzip and deflate; (de)compressing "bzip2" is more
expensive, and "compress" is outdated, so I doubt a lot of servers support them.
b) there are no support for decompressing bzip2 and compress in my patches

Arguable, this list should be shortened to just "gzip":
1) there are confusion about format and compatibility problems with "deflate"
[is it "raw deflate" or "zlib format"?],
2) "deflate" format lacks magic/signatures [so decompress code is limited to
matching extension],
3) there no commonly distributed tools to decompress it [and my design decision
was to don't decompress saved files, leaving it to user; I have a tool to deal
with them, but what is common user expected to do with ".zlib" files?].
4) there are no advantage of "deflate" over "gzip" in compression ration/etc,
5) I doubt there are servers that support "deflate", but don't support "gzip"

If user accepts limitations, she can always override defaults with something 
like
$ wget -E --com --header 'Accept-Encoding: gzip, deflate, bzip2, compress' [...]
(as --header option overrides whatever wget sets implicitly).

>> +  else
>>     request_set_header (req, "Accept-Encoding", "identity", rel_none);
> 
> You should indent lines when they move into else branches.
>>     /* Find the username and password for authentication. */
>> @@ -2683,7 +2686,8 @@ read_header:
>>           }
>>       }
>>
>> -  if (resp_header_copy (resp, "Content-Encoding", hdrval,
>> sizeof(hdrval)))
>> +  if (opt.compressed&&
>> +      resp_header_copy (resp, "Content-Encoding", hdrval,
>> sizeof(hdrval)))
>>       {
> I would decompress it even if we didn't ask for the compression.
>>         const char *p = hdrval;
>>         if (p[0] == 'x'&&  p[1] == '-')
>> diff --git a/src/init.c b/src/init.c
>> index 088a6e9..ccf742a 100644
>> --- a/src/init.c
>> +++ b/src/init.c
>> @@ -154,6 +154,7 @@ static const struct {
>>     { "checkcertificate",&opt.check_cert,        cmd_boolean },
>>   #endif
>>     { "chooseconfig",&opt.choose_config,     cmd_file },
>> +  { "compressed",&opt.compressed,        cmd_boolean },
>>     { "connecttimeout",&opt.connect_timeout,   cmd_time },
>>     { "contentdisposition",&opt.content_disposition, cmd_boolean },
>>     { "contentonerror",&opt.content_on_error,  cmd_boolean },
>> diff --git a/src/main.c b/src/main.c
>> index 99c2819..6f0d7b2 100644
>> --- a/src/main.c
>> +++ b/src/main.c
>> @@ -186,6 +186,7 @@ static struct cmdline_option option_data[] =
>>       { IF_SSL ("certificate-type"), 0, OPT_VALUE, "certificatetype",
>> -1 },
>>       { IF_SSL ("check-certificate"), 0, OPT_BOOLEAN,
>> "checkcertificate", -1 },
>>       { "clobber", 0, OPT__CLOBBER, NULL, optional_argument },
>> +    { "compressed", 'C', OPT_BOOLEAN, "compressed", -1 },
>>       { "config", 0, OPT_VALUE, "chooseconfig", -1 },
>>       { "connect-timeout", 0, OPT_VALUE, "connecttimeout", -1 },
>>       { "continue", 'c', OPT_BOOLEAN, "continue", -1 },
>> @@ -582,6 +583,8 @@ Directories:\n"),
>>       N_("\
>>     -nH, --no-host-directories       don't create host directories.\n"),
>>       N_("\
>> +       --compressed                support HTTP compression
>> (Content-Encoding).\n"),
> Maybe it's better to name it "compression" ? (with either none or
> methods as values)

Rationale: curls call this option "--compressed"; fewer number of varying option
names => lower user confusion (sure, curl most of time takes opposite approach;
but so what?).

>> > From d48d72d891773858aa70806787466224c2e8a31f Mon Sep 17 00:00:00 2001
>> From: "Yuriy M. Kaminskiy"<address@hidden>
>> Date: Tue, 16 Dec 2014 17:35:27 +0300
>> Subject: [PATCH 3/3] Attempt to recognize compressed files by signature
>>
>> Should improve --compressed without --adjust-extension
>> ---
>>   src/utils.c |   48 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 files changed, 47 insertions(+), 1 deletions(-)
...
>> -- 
> Doesn't this last patch mean that if I download
> http://ftp.gnu.org/gnu/wget/wget-1.16.tar.gz and the compressed option
> is enabled, then I get a plain tar, not a tgz?

That's why I *didn't* want to make wget to save uncompressed files: some servers
set Content-Encoding for already compressed files.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]