Re: wget -p shall honor -H (isn´t used unless -r is given)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget -p shall honor -H (isn´t used unless -r is given)

From:	Elmar Stellnberger
Subject:	Re: wget -p shall honor -H (isn´t used unless -r is given)
Date:	Mon, 9 Oct 2023 11:11:58 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0

Hi Tim

No, the images are simple, plain <img src=xx images that getdisplayed as part of the web page without any Javascript, otherwise thegrep -o| sed would not have worked. The point is that the images do notget downloaded by a usual invocation, because their domain name isdifferent from www.esquire.de, that is static.esquire.de instead ofwww.esquire.de. Note that for wget, -D remains without any effect aslong as you do not specify -H. The -r should not be needed for a wget -pas these are two different semantics: download a single web page andrecursively download a sub-directory, one or more whole domains.


  The problem about:

wget -p -r -l 1 -N -H -D static.esquire.dehttps://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur... is that it will only download the directly referenced .-html pagefrom www.esquire.de, no supporting .css or .js files from www.-. Notealso that when -H isn´t given, it won´t even download a single file fromstatic.esquire.de. That is a -D different.domain without an additional-H results in -p and -r being disregarded.


  Also:

wget -p -r -l 1 -N -H -D esquire.de --exclude-domains www.esquire.dehttps://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-naturdoes nothing different with wget 1.21.3 (In the former email I had useda newer version of wget).

That is apparently you first need a wget -p for www.esquire.de andthen download the images at static.esquire.de with -r in a second run.Not ideal.Forget the posted screen output from grep -o/sed download of thestatic.-images in my last email; wget must not behave like that(binary/cpu exposes wrong behaviour). It is not a bug (see also:https://www.elstel.org/uni/, SAT-solver master thesis, Epilogue,starting on from point 6 as well as countless other examples).

Having a try with wget2 will be an interesting thing, though I´dpersonally consider wget -p to be a standard feature and I´d like to seeit work there also when you have more than one domain on a specific pageand that currently needs to be given with -H -D xy.tld. "-D xy.tld" isneeded if you don´t want to download Javascripts from Google, althoughthis may as well be interesting whenever you intend to view the pageentirely offline, afterwards. Look at my last posting "wget--page-requisites/-p should download as for a web browser" which ispretty much about similar stuff:

https://lists.gnu.org/archive/html/bug-wget/2023-10/msg00008.html

To me wget[2] -r -D without -H would be an interesting thing for totell that additional domains shall be considered but without recursivelyfetching content from there (useful for -p as well as -r, as long as youdon´t download recursively from two or more domains by one invocation).That is the first example in this email made sense also when -H is notgiven.Basically if you implement a feature like this you could make it workfor multi-domain recursive downloads as well, that is you could haveseparate options for both behaviours like -H xy.tld,wz.tld -Dadd-non-recusrive.tld. As by that example the parameter of -H would needto start without a minus and you needed wget -H -- https://..., as Iusually do that, here to avoid searching for ://.).


Regards,
Elmar


Am 08.10.23 um 19:48 schrieb Tim Rühsen:

Hey Elmar,

did you try the following?
wget2 -p -r -l 1 -N -D static.esquire.dehttps://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
It downloads 94 files, 44 are .jpg files in static.esquire.de/.
TBH, I am not 100% sure what you are trying to do, so excuse me if I amoff the track.The -p option is for downloading the files you need for displaying apage (e.g. inlined images). If the images are just links, they are notdownloaded by -p. In this case, -r -l 1 help. If images that aredisplayed in the browser are downloaded/displayed by javascript,wget/wget2 won't help you.
Regards, Tim

On 9/13/23 00:14, Elmar Stellnberger wrote:
Hi to all!

   Today I wanted to download the following web page for means of
archiving:
https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur

   The following command line did not do what I want:
wget -p -N -H -D esquire.de --tries=10https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
   The following seemed to do:
wget -p -r -N -H -D esquire.de --exclude-domains www.esquire.de--tries=10https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
: files downloaded:
now/static.esquire.de/1200x630/smart/images/2023-08/gettyimages-1391653079.jpg
now/www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
: dld.log:
...
BEENDET --2023-09-12 23:18:01--
Verstrichene Zeit: 1,2s
Geholt: 2 Dateien, 246K in 0,07s (3,62 MB/s)
i.e. diz "two files fetched, no error"

   Without -r & --exclude-domains it did download 52 files (most of them
.js), all from www.esquire.de and none from static.esquire.de. Finally I
succeeded to download the images desired by me by á: (here starting from
the second file as I did a manual download of the first)
grep -o "https://static.esquire.de/[^ ]*\.jpg"schoenste-wasserfaelle-welt-natur.html | sed -n '2,500/./p' | whileread line; do wget -p "$line"; done
   Might (theoretically) be a bug of wget 1.21.4 (1.mga9, i.e. Mageia 9
i686) that it did not download more than two files at the second attempt,
though that may also be supposed to be a public-avail-silicon fallacy by
whomever wants it to assume.
BTW: 'wpdld' is my scriptlet to archive the web pages I read.Regarding
the pages it works for (using wget) I prefer this over a Firefox
save-page, as it keeps the web page more or less in pristine state to be
mirrored like at the Wayback machine, if necessary. Not to save on disk
what I read is something I have experienced that it can be nasty, caus´
not every article in news is kept online forever, or be it that it is
just deleted from the indexes of search engines (and on-page searches).
I would also have 'wpv' for viewing, but alas that isn´t multidomain or
non-relative link ready - Hi, what about a make-relative feature of
already downloaded web pages on disk for wget2? (would be my desire as I
prefer to download non-relative and doing that on disk allows a 'dircmp'
(another self-written program to compare (and sync) directories; using it
more or less since 2008).)

Regards,
Elmar Stellnberger

[Prev in Thread]

Current Thread

[Next in Thread]

Re: wget -p shall honor -H (isn´t used unless -r is given), Tim Rühsen, 2023/10/08
- Re: wget -p shall honor -H (isn´t used unless -r is given), Elmar Stellnberger <=

Prev by Date: Re: wget -p shall honor -H (isn´t used unless -r is given)
Next by Date: Problematic default file naming system (BUG?)
Previous by thread: Re: wget -p shall honor -H (isn´t used unless -r is given)
Next by thread: Problematic default file naming system (BUG?)
Index(es):
- Date
- Thread