bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] SOLVED! RE: Wget results VERY different from browser save


From: MB2613
Subject: [Bug-wget] SOLVED! RE: Wget results VERY different from browser save
Date: Wed, 2 Jan 2019 16:09:12 -0600

I got the following private reply:

 

=================================================

Sent: Wednesday, January 2, 2019 13:17

Subject: Re: [Bug-wget] Wget results VERY different from browser save

 

Sorry to go offlist.... Not my area of expertise....

 

Perhaps try changing the User Agent string. Some sites serve different
content, depending on the UA.

 

And this assumes you are using clean room downloads. I.e., Firefox is not
sending a cookie from a previous visit. If Firefox is not clean

room, then you might consider opening a "New Private Browsing Windows" (or
whatever it is called).

=================================================

 

I don't understand the part about clean-room downloads, but I tried the user
agent idea, and it worked. First I tried it with no agent, then simulating
Firefox. The results are interesting. Details are below and in the attached
files. Here's a summary:

 


 

Size of main file

Number of additional files

Result compared to original in Firefox


Firefox Save as, complete

218 kb

5

Same page layout. No logo or apps selector grid.


Wget, user agent default

12 kb

3

Different page layout.


Wget, user agent none

46 kb

6

Another different page layout. Black header.


Wget, user agent Firefox

215 kb

12

Identical to original!

 

So this is really interesting. Google sends out at least three different
versions of their homepage to clients identifying themselves as different
browsers. With the user agent simulating Firefox, the main file obtained is
almost as large, and there are twice as many additional files, as what
Firefox gets by "Save as, complete." Of the five additional files obtained
in the Firefox save, only one of them is the same as any of those obtained
by Wget, and it's the logo obtained with the Firefox user agent, the logo
which doesn't work in the Firefox save.

 

In addition to getting a working local file for the page, I like that Wget
also gives me the timestamps of the downloaded files.

 

Thank you to the anonymous respondent for the solution, even though it's
outside his area of expertise!

 

So now I'm back to my original question. How to get the additional files in
a subfolder, separate from the main file?

 

=================================================

Details:

=================================================

wget --append-output=download/Wget_Google3.log --show-progress
--no-directories --directory-prefix=download/Google3 --adjust-extension
--user-agent="" --convert-links --backup-converted --page-requisites
--span-hosts http://www.Google.com

 

Contents of folder "download\Google3":

     2016 04 20  21:17             4,682      b8_3615d64d.png

     2016 04 20  21:17             9,760      b_8d5afc09.png

     2016 12 07  19:00             5,482
googlelogo_white_background_color_272x92dp.png

     2019 01 02  14:50            46,176      index.html

     2019 01 02  14:50            46,162      index.html.orig

     2016 12 16  06:30            12,263      nav_logo229.png

     2018 12 31  03:15             2,205      robots.txt

     2018 11 16  04:00             6,913      robots.txt.1

               8 File(s)        133,643 bytes

 

Wget log: See attached "Wget_Google_UserAgentNone.log".

Result of save as viewed in Firefox: See attached "Google from
Wget_UserAgentNone.png".

=================================================

=================================================

wget --append-output=download/Wget_Google4.log --show-progress
--no-directories --directory-prefix=download/Google4 --adjust-extension
--user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0)
Gecko/20100101 Firefox/64.0" --convert-links --backup-converted
--page-requisites --span-hosts http://www.Google.com

 

Contents of folder "download\Google4":

     2016 04 20  21:17               381      f5cdd88b65.png

     2016 12 07  19:00            13,504      googlelogo_color_272x92dp.png

    2016 12 07  19:00             5,969      googlelogo_color_272x92dp.png.1

     2016 12 12  08:45             7,325      i1_1967ca6a.png

     2016 12 12  08:45            24,211      i2_2ec824b0.png

    2019 01 02  14:54           220,125      index.html

     2019 01 02  14:54           220,432      index.html.orig

     2016 12 14  14:30            16,786      nav_logo242.png

     2018 11 27  08:45            55,033      p1_254618290.png

     2018 11 27  08:45           124,778      p2_63dfd6b10.png

     2018 11 16  04:00             6,913      robots.txt

     2018 12 31  03:15             2,205      robots.txt.1

     2016 04 20  21:17             1,747      silhouette_96.png

     2016 12 07  19:00               185      wavy-underline.png

              14 File(s)        699,594 bytes

 

Wget log: See attached "Wget_Google_UserAgentFirefox.log".

Result of save as viewed in Firefox: See attached "Google from
Wget_UserAgentFirefox.png".

=================================================

 

 

 

 

From: address@hidden
Sent: Wednesday, January 2, 2019 12:30
To: 'address@hidden'
Subject: Wget results VERY different from browser save

 

On closer inspection, I've found that the results from Wget and Firefox are
very different. Neither is perfect, but the Wget results are definitely
wrong. Here are the results from both:

 

=================================================

wget --append-output=Wget_Google.log --show-progress --no-directories
--adjust-extension --directory-prefix=download/Google2 --convert-links
--backup-converted --page-requisites --span-hosts http://www.Google.com

 

Contents of folder "download\Google2":

2016 12 07  19:00             5,482
googlelogo_white_background_color_272x92dp.png

2019 01 02  11:10            11,587      index.html

2019 01 02  11:10            11,437      index.html.orig

2016 12 16  06:30            12,263      nav_logo229.png

2018 11 16  04:00             6,913      robots.txt

               5 File(s)         47,682 bytes

 

Wget log: See attached "Wget_Google.log".

Result of save as viewed in Firefox: See attached "Google from Wget.png".

=================================================

Firefox at https://www.google.com/

File > Save Page As > Save as type: Web Page, complete

 

Contents of folder:

2019 01 02  11:15           222,403      Google2.htm

2019 01 02  11:15    <DIR>          Google2_files

 

Contents of subfolder "Google2_files"

2019 01 02  11:15           140,084      cbgapi.loaded_0

2019 01 02  11:15            13,504      googlelogo_color_272x92dp.png

2019 01 02  11:15            85,565
msb_wizaaabdasyncdvlfootiflipv6lummusfxz7cCd

2019 01 02  11:15           140,913
rsAA2YrTv-X7m9A6GmnfpSsKdPIfvIYg06ZQ

2019 01 02  11:15           403,380
rsACT90oGMg6Rr6Oa277nSkJoiMyEfVXOeOQ

               5 File(s)        783,446 bytes

 

Result of save as viewed in Firefox: See attached "Google from Firefox.png".

=================================================

Actual appearance of the webpage: See attached "Google original.png".

=================================================

 

Observations:

                * The main file saved by Firefox is 218 kb, that by Wget is
only 12 kb.

                * Firefox saves five additional files, Wget only three, and
none of them even have the same filenames!

                * Firefox gets the page layout right, including headers and
footers, but for some reason doesn't show the logo. Wget looks like it
downloaded a different page. The whole layout is different. But it got the
logo right.

 

What do I need to do for Wget to get the page correctly?

 

Thank you.

 

=================================================

 

 

 

 

From: address@hidden [mailto:address@hidden 
Sent: Wednesday, January 2, 2019 04:50
To: 'address@hidden'
Subject: How to simulate "Save as webpage, complete"?

 

Hi, not a bug, but a question:

 

The command:

wget --no-directories --adjust-extension --directory-prefix _files
--convert-links --page-requisites --span-hosts http://www.Google.com

 

saves the Google homepage as "index.html" along with associated files, all
together in the folder "_files". The result works nicely, but what I want is
for "index.html" to be in one folder and the associated files to be in a
subfolder of that called "_files". This is what a browser does when one asks
it to "save as webpage, complete." How do I simulate that behavior with
Wget?

 

The manual entry for -P / --directory-prefix says "the directory prefix is
the directory where all other files and subdirectories will be saved."
Because of the word "other," I thought this would do what I want, but it
didn't. It put all the files in the same directory, including "index.html".

 

I am using Wget, v. 1.20 as the Windows binary provided by Jernej Simončič
at www.eternallybored.org/misc/wget/ and running it in a DOS window
("Command Prompt") of Windows 7.

 

Thanks for your help.

 

Attachment: Wget_Google_UserAgentNone.log
Description: Binary data

Attachment: Google from Wget_UserAgentNone.png
Description: PNG image

Attachment: Wget_Google_UserAgentFirefox.log
Description: Binary data

Attachment: Google from Wget_UserAgentFirefox.png
Description: PNG image


reply via email to

[Prev in Thread] Current Thread [Next in Thread]