[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] --header="Accept-encoding: gzip"
From: |
andreas wpv |
Subject: |
Re: [Bug-wget] --header="Accept-encoding: gzip" |
Date: |
Tue, 22 Sep 2015 21:47:38 -0500 |
Many sites use server side compression for sending most of the data.
Example is Google, this is the header in the homepage load (f12 in Chrome).
1. alt-svc:
quic=":443"; p="1"; ma=604800
2. alternate-protocol:
443:quic,p=1
3. cache-control:
private, max-age=0
4. content-encoding:
gzip
5. content-type:
text/html; charset=UTF-8
6. date:
Wed, 23 Sep 2015 02:20:03 GMT
See the 'content encoding'?
Chrome shows the encoded size with 53kb for the html document.
If I download this with wget, it shows me the uncompressed size with 150
kB. (wget --user-agent mozilla www.google.com)
If I use wget --user-agent mozilla --header="accept-encoding: gzip "
www.google.com it downloads a file with 51kB - much closer to what Chrome
sees (the difference might be the user agent and cookie handling, or the
download does not work properly. If I zcat the file, it seems cut off ).
So, now with -p I want to load all page elements (images, scripts, css,
etc), and with -H I ensure to get the elements from other domains as well.
(-r is not the right tool for this, as far as I know. ) BTW, with no user
agent, Google blocks the download. Actually need a full, valid agent. Then,
robots off so it won't check that as well (time).
Example: wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0)
Gecko/20100101 Firefox/10.0" -e robots=off -p -H "www.google.com"
This gives me a whole list of files with a total of 378 kB:
.
./ssl.gstatic.com
./ssl.gstatic.com/gb
./ssl.gstatic.com/gb/images
./ssl.gstatic.com/gb/images/i2_2ec824b0.png
./ssl.gstatic.com/gb/images/a
./ssl.gstatic.com/gb/images/a/f5cdd88b65.png
./ssl.gstatic.com/gb/images/p1_8b13e09b.png
./ssl.gstatic.com/gb/images/p2_5972b4fd.png
./ssl.gstatic.com/gb/images/i1_1967ca6a.png
./www.google.com
./www.google.com/index.html
./www.google.com/images
./www.google.com/images/nav_logo231.png
./www.google.com/images/branding
./www.google.com/images/branding/googlelogo
./www.google.com/images/branding/googlelogo/2x
./www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png
./www.google.com/images/branding/product
./www.google.com/images/branding/product/ico
./www.google.com/images/branding/product/ico/googleg_lodp.ico
Now the same, but download compressed what is compressed...
wget --user-agent "Mozilla/5.0 (Windows NT x.y; WOW64; rv:10.0)
Gecko/20100101 Firefox/10.0" -e robots=off --header="accept-encoding: gzip
" -p -H "www.google.com"
Still only gives me 52 kb! and one file: index.html
So, accept encoding seems to work, but only for the main file?
On Tue, Sep 22, 2015 at 3:51 PM, Ángel González <address@hidden> wrote:
> On 22/09/15 19:57, andreas wpv wrote:
>
>> Unfortunately this only pulls the html files (because where I pull them
>> they are compressed), and not all the other scripts and stylesheets and so
>> on, although at least a few of these are compressed, either.
>>
> From wget point of view, the "html" is a binary blob. It scans it looking
> for
> scripts/stylesheets and founds none.
>
> Ideas, tips?
>>
> What about implementing gzip Accept-encoding into wget? :)
>
> Someone asked about doing it not so long ago, but it wasn't done.
>
>
> * That should actually save the pages uncompressed, but I assume you are
> more interested in downloading the contents compressed than in storing
> them compressed locally. Otherwise, you can download them with current
> wget and then run a script compressing everything.
>
>
>