bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget -p shall honor -H (isn´t used unless -r is given)


From: Elmar Stellnberger
Subject: wget -p shall honor -H (isn´t used unless -r is given)
Date: Wed, 13 Sep 2023 00:14:35 +0200
User-agent: Mutt (Linux i686)

Hi to all!

  Today I wanted to download the following web page for means of
archiving:
https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur

  The following command line did not do what I want:
wget -p -N -H -D esquire.de --tries=10 
https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur

  The following seemed to do:
wget -p -r -N -H -D esquire.de --exclude-domains www.esquire.de --tries=10 
https://www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
: files downloaded:
now/static.esquire.de/1200x630/smart/images/2023-08/gettyimages-1391653079.jpg
now/www.esquire.de/life/reisen/schoenste-wasserfaelle-welt-natur
: dld.log:
...
BEENDET --2023-09-12 23:18:01--
Verstrichene Zeit: 1,2s
Geholt: 2 Dateien, 246K in 0,07s (3,62 MB/s)
i.e. diz "two files fetched, no error"

  Without -r & --exclude-domains it did download 52 files (most of them
.js), all from www.esquire.de and none from static.esquire.de. Finally I
succeeded to download the images desired by me by á: (here starting from
the second file as I did a manual download of the first) 
grep -o "https://static.esquire.de/[^ ]*\.jpg" 
schoenste-wasserfaelle-welt-natur.html | sed -n '2,500/./p' | while read line; 
do wget -p "$line"; done

  Might (theoretically) be a bug of wget 1.21.4 (1.mga9, i.e. Mageia 9
i686) that it did not download more than two files at the second attempt,
though that may also be supposed to be a public-avail-silicon fallacy by
whomever wants it to assume.

  BTW: 'wpdld' is my scriptlet to archive the web pages I read. Regarding
the pages it works for (using wget) I prefer this over a Firefox
save-page, as it keeps the web page more or less in pristine state to be
mirrored like at the Wayback machine, if necessary. Not to save on disk
what I read is something I have experienced that it can be nasty, caus´
not every article in news is kept online forever, or be it that it is
just deleted from the indexes of search engines (and on-page searches). 
I would also have 'wpv' for viewing, but alas that isn´t multidomain or
non-relative link ready - Hi, what about a make-relative feature of
already downloaded web pages on disk for wget2? (would be my desire as I
prefer to download non-relative and doing that on disk allows a 'dircmp'
(another self-written program to compare (and sync) directories; using it
more or less since 2008).)

Regards,
Elmar Stellnberger

Attachment: wpdld
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]