Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files

From:	Tim Rühsen
Subject:	Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Date:	Sat, 30 Mar 2013 21:54:23 +0100
User-agent:	KMail/1.13.7 (Linux/3.7-trunk-amd64; KDE/4.8.4; x86_64; ; )

Am Freitag, 29. März 2013 schrieb Andy Jackson:
> When using wget 1.14 to generate warc.gz files, e.g.
> 
> wget -O tempname --warc-file="output"  "http://example.com";
> 
> the files this creates do not play back well using the Internet Archives 
> warc.gz parsers, throwing errors like 
> 
> "Invalid FExtra length/records". 
> 
> It appears wget may be creating slightly malformed GZIP skip-length 
> fields - see 
> 
> https://github.com/ukwa/warc-discovery/issues/1 
> 
> for details.
> 
> It's likely that we'll need to make the warc.gz parsers a bit more 
> robust, but I thought I'd mention it here in case this is 
> actually a bug in wget.
> 
> Thanks for your time.
> 
> Andy Jackson

Just a very quick test (before I go to bed) shows an unexpected behaviour to 
me:

$ wget -O tempname --warc-file="output"  "http://example.com";
results in a 5065 bytes file 'output.warc.gz'

Unzipping it and zipping it again results in a 2387 byte file.

So, for a first glimpse, it looks like Wget compresses very suboptimal.
But I won't say it is a bug before I take a deeper look... (in the next days).

Regards Tim

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] wget 1.14 possibly writing off-spec warc.gz files, Andy Jackson, 2013/03/29
- Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files, Tim Rühsen <=
  - Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files, Andy Jackson, 2013/03/30
- Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files, Gijs van Tulder, 2013/03/30

Prev by Date: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Next by Date: Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Previous by thread: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Next by thread: Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Index(es):
- Date
- Thread