[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
srcset lists are corrupted when converting links
From: |
Dan Ellis |
Subject: |
srcset lists are corrupted when converting links |
Date: |
Tue, 29 Dec 2020 11:43:59 -0500 |
I'm using wget to make a frozen, offline mirror of a wordpress.com site.
The original HTML makes extensive use of <img srcset=...> (responsive
design for different browser resolutions. wget is corrupting the
comma-separated lists of images.
e.g.
wget --page-requisites --span-hosts https://theliteratelens.com/
downloads a set of files including theliteratelens.com/index.html which
includes the following element as the first instance of srcset (line breaks
inserted by me and irrelevant fields omitted):
<img width="350" height="248"
src="
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
"
class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
alt=""
loading="lazy"
srcset="
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
350w,
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1
150w,
https://theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1
300w"
sizes="(max-width: 350px) 100vw, 350px"
... />
Note the srcset field with 3 versions of the image referenced whose decoded
URL tails look like "realistfrontcover_small.jpg?w=150&h=248&crop=1"
However, if I add --convert-links, e.g.
wget --page-requisites --span-hosts --convert-links
https://theliteratelens.com/
the same element in theliteratelens.com/index.html becomes:
<img width="350" height="248"
src="../
theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1
"
class="attachment-suburbia-sticky size-suburbia-sticky wp-post-image"
alt=""
loading="lazy"
srcset="../
theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=350&h=248&crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=150&h=106&crop=1h=106&a../theliteratelens.files.wordpress.com/2017/12/realistfrontcover_small.jpg?w=300&h=212&crop=1300&h=212&crop=1
300w"
sizes="(max-width: 350px) 100vw, 350px"
... />
i.e. the comma-separated list in the srcset has been badly corrupted. For
instance, the end of the first path, which was originally
...h=248&crop=1 350w, https://
theliteratelens.files.wordpress.com/2017/12...
becomes
...h=248&crop=1p;crop=../theliteratelens.files.wordpress.com/2017/12.
..
and the second boundary between elements starts as
...h=106&crop=1 150w, https://theliteratelens.files...
but ends up as
...h=106&crop=1h=106&a../theliteratelens.files...
What seems to be happening is that the convert-links logic is finding the
absolute URLs to the second host (
https://theliteratelens.files.wordpress.com) and correctly maps them to
relative paths (../theliteratelens.files.wordpress.com/), but at the same
time it reaches back one space-delimiter too far, and replaces those
characters with a spurious sample from the preceding string.
I hope this helps identify the problem.
DAn.
- srcset lists are corrupted when converting links,
Dan Ellis <=