bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

srcset lists are corrupted when converting links


From: Dan Ellis
Subject: srcset lists are corrupted when converting links
Date: Tue, 29 Dec 2020 11:43:59 -0500

I'm using wget to make a frozen, offline mirror of a wordpress.com site.
The original HTML makes extensive use of <img srcset=...> (responsive
design for different browser resolutions. wget is corrupting the
comma-separated lists of images.

e.g.


  wget --page-requisites --span-hosts https://theliteratelens.com/

downloads a set of files including theliteratelens.com/index.html which
includes the following element as the first instance of srcset (line breaks
inserted by me and irrelevant fields omitted):

<img width="350" height="248"
 src="
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
"
 class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
 alt=""
 loading="lazy"
 srcset="
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
350w,
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&amp;h=106&amp;crop=1
150w,
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&amp;h=212&amp;crop=1
300w"
 sizes="(max-width: 350px) 100vw, 350px"
... />

Note the srcset field with 3 versions of the image referenced whose decoded
URL tails look like "realistfrontcover_small.jpg?w=150&h=248&crop=1"

However, if I add --convert-links, e.g.

  wget --page-requisites --span-hosts --convert-links
https://theliteratelens.com/

the same element in theliteratelens.com/index.html becomes:

<img width="350" height="248"
 src="../
theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1
"
 class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
 alt=""
 loading="lazy"
 srcset="../
theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&amp;h=248&amp;crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&amp;h=106&amp;crop=1h=106&a../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&amp;h=212&amp;crop=1300&amp;h=212&amp;crop=1
300w"
 sizes="(max-width: 350px) 100vw, 350px"
... />

i.e. the comma-separated list in the srcset has been badly corrupted.  For
instance, the end of the first path, which was originally

  ...h=248&amp;crop=1 350w, https://
theliteratelens.files.wordpress.com/2017/12...

becomes

  ...h=248&amp;crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12.
..

and the second boundary between elements starts as

  ...h=106&amp;crop=1 150w, https://theliteratelens.files...

but ends up as

  ...h=106&amp;crop=1h=106&a../theliteratelens.files...

What seems to be happening is that the convert-links logic is finding the
absolute URLs to the second host (
https://theliteratelens.files.wordpress.com) and correctly maps them to
relative paths (../theliteratelens.files.wordpress.com/), but at the same
time it reaches back one space-delimiter too far, and replaces those
characters with a spurious sample from the preceding string.

I hope this helps identify the problem.

  DAn.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]