|
From: | MB2613 |
Subject: | [Bug-wget] SOLVED! RE: Wget results VERY different from browser save |
Date: | Wed, 2 Jan 2019 16:09:12 -0600 |
I got the following private reply: ================================================= Sent: Wednesday, January 2, 2019 13:17 Subject: Re: [Bug-wget] Wget results VERY different from browser save Sorry to go offlist.... Not my area of expertise.... Perhaps try changing the User Agent string. Some sites serve different content, depending on the UA. And this assumes you are using clean room downloads. I.e., Firefox is not sending a cookie from a previous visit. If Firefox is not clean room, then you might consider opening a "New Private Browsing Windows" (or whatever it is called). ================================================= I don't understand the part about clean-room downloads, but I tried the user agent idea, and it worked. First I tried it with no agent, then simulating Firefox. The results are interesting. Details are below and in the attached files. Here's a summary: Size of main file Number of additional files Result compared to original in Firefox Firefox Save as, complete 218 kb 5 Same page layout. No logo or apps selector grid. Wget, user agent default 12 kb 3 Different page layout. Wget, user agent none 46 kb 6 Another different page layout. Black header. Wget, user agent Firefox 215 kb 12 Identical to original! So this is really interesting. Google sends out at least three different versions of their homepage to clients identifying themselves as different browsers. With the user agent simulating Firefox, the main file obtained is almost as large, and there are twice as many additional files, as what Firefox gets by "Save as, complete." Of the five additional files obtained in the Firefox save, only one of them is the same as any of those obtained by Wget, and it's the logo obtained with the Firefox user agent, the logo which doesn't work in the Firefox save. In addition to getting a working local file for the page, I like that Wget also gives me the timestamps of the downloaded files. Thank you to the anonymous respondent for the solution, even though it's outside his area of expertise! So now I'm back to my original question. How to get the additional files in a subfolder, separate from the main file? ================================================= Details: ================================================= wget --append-output=download/Wget_Google3.log --show-progress --no-directories --directory-prefix=download/Google3 --adjust-extension --user-agent="" --convert-links --backup-converted --page-requisites --span-hosts http://www.Google.com Contents of folder "download\Google3": 2016 04 20 21:17 4,682 b8_3615d64d.png 2016 04 20 21:17 9,760 b_8d5afc09.png 2016 12 07 19:00 5,482 googlelogo_white_background_color_272x92dp.png 2019 01 02 14:50 46,176 index.html 2019 01 02 14:50 46,162 index.html.orig 2016 12 16 06:30 12,263 nav_logo229.png 2018 12 31 03:15 2,205 robots.txt 2018 11 16 04:00 6,913 robots.txt.1 8 File(s) 133,643 bytes Wget log: See attached "Wget_Google_UserAgentNone.log". Result of save as viewed in Firefox: See attached "Google from Wget_UserAgentNone.png". ================================================= ================================================= wget --append-output=download/Wget_Google4.log --show-progress --no-directories --directory-prefix=download/Google4 --adjust-extension --user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0" --convert-links --backup-converted --page-requisites --span-hosts http://www.Google.com Contents of folder "download\Google4": 2016 04 20 21:17 381 f5cdd88b65.png 2016 12 07 19:00 13,504 googlelogo_color_272x92dp.png 2016 12 07 19:00 5,969 googlelogo_color_272x92dp.png.1 2016 12 12 08:45 7,325 i1_1967ca6a.png 2016 12 12 08:45 24,211 i2_2ec824b0.png 2019 01 02 14:54 220,125 index.html 2019 01 02 14:54 220,432 index.html.orig 2016 12 14 14:30 16,786 nav_logo242.png 2018 11 27 08:45 55,033 p1_254618290.png 2018 11 27 08:45 124,778 p2_63dfd6b10.png 2018 11 16 04:00 6,913 robots.txt 2018 12 31 03:15 2,205 robots.txt.1 2016 04 20 21:17 1,747 silhouette_96.png 2016 12 07 19:00 185 wavy-underline.png 14 File(s) 699,594 bytes Wget log: See attached "Wget_Google_UserAgentFirefox.log". Result of save as viewed in Firefox: See attached "Google from Wget_UserAgentFirefox.png". ================================================= From: address@hidden Sent: Wednesday, January 2, 2019 12:30 To: 'address@hidden' Subject: Wget results VERY different from browser save On closer inspection, I've found that the results from Wget and Firefox are very different. Neither is perfect, but the Wget results are definitely wrong. Here are the results from both: ================================================= wget --append-output=Wget_Google.log --show-progress --no-directories --adjust-extension --directory-prefix=download/Google2 --convert-links --backup-converted --page-requisites --span-hosts http://www.Google.com Contents of folder "download\Google2": 2016 12 07 19:00 5,482 googlelogo_white_background_color_272x92dp.png 2019 01 02 11:10 11,587 index.html 2019 01 02 11:10 11,437 index.html.orig 2016 12 16 06:30 12,263 nav_logo229.png 2018 11 16 04:00 6,913 robots.txt 5 File(s) 47,682 bytes Wget log: See attached "Wget_Google.log". Result of save as viewed in Firefox: See attached "Google from Wget.png". ================================================= Firefox at https://www.google.com/ File > Save Page As > Save as type: Web Page, complete Contents of folder: 2019 01 02 11:15 222,403 Google2.htm 2019 01 02 11:15 <DIR> Google2_files Contents of subfolder "Google2_files" 2019 01 02 11:15 140,084 cbgapi.loaded_0 2019 01 02 11:15 13,504 googlelogo_color_272x92dp.png 2019 01 02 11:15 85,565 msb_wizaaabdasyncdvlfootiflipv6lummusfxz7cCd 2019 01 02 11:15 140,913 rsAA2YrTv-X7m9A6GmnfpSsKdPIfvIYg06ZQ 2019 01 02 11:15 403,380 rsACT90oGMg6Rr6Oa277nSkJoiMyEfVXOeOQ 5 File(s) 783,446 bytes Result of save as viewed in Firefox: See attached "Google from Firefox.png". ================================================= Actual appearance of the webpage: See attached "Google original.png". ================================================= Observations: * The main file saved by Firefox is 218 kb, that by Wget is only 12 kb. * Firefox saves five additional files, Wget only three, and none of them even have the same filenames! * Firefox gets the page layout right, including headers and footers, but for some reason doesn't show the logo. Wget looks like it downloaded a different page. The whole layout is different. But it got the logo right. What do I need to do for Wget to get the page correctly? Thank you. ================================================= From: address@hidden [mailto:address@hidden Sent: Wednesday, January 2, 2019 04:50 To: 'address@hidden' Subject: How to simulate "Save as webpage, complete"? Hi, not a bug, but a question: The command: wget --no-directories --adjust-extension --directory-prefix _files --convert-links --page-requisites --span-hosts http://www.Google.com saves the Google homepage as "index.html" along with associated files, all together in the folder "_files". The result works nicely, but what I want is for "index.html" to be in one folder and the associated files to be in a subfolder of that called "_files". This is what a browser does when one asks it to "save as webpage, complete." How do I simulate that behavior with Wget? The manual entry for -P / --directory-prefix says "the directory prefix is the directory where all other files and subdirectories will be saved." Because of the word "other," I thought this would do what I want, but it didn't. It put all the files in the same directory, including "index.html". I am using Wget, v. 1.20 as the Windows binary provided by Jernej Simončič at www.eternallybored.org/misc/wget/ and running it in a DOS window ("Command Prompt") of Windows 7. Thanks for your help.
Wget_Google_UserAgentNone.log
Description: Binary data
Google from Wget_UserAgentNone.png
Description: PNG image
Wget_Google_UserAgentFirefox.log
Description: Binary data
Google from Wget_UserAgentFirefox.png
Description: PNG image
[Prev in Thread] | Current Thread | [Next in Thread] |