bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget 1.18-5+deb9u1 with --hsts -E -k fails


From: Darshit Shah
Subject: Re: [Bug-wget] wget 1.18-5+deb9u1 with --hsts -E -k fails
Date: Wed, 25 Apr 2018 08:31:24 +0200
User-agent: NeoMutt/20180323

Hi Karl,

Thanks! That's a pretty detailed bug report. Definitely helpful :)

1. Yes, I'm aware of the issue. Even 1.20 is not listed on it. I'll get right
   on fixing that ASAP. 

2. and 3. These are a little more involved than I have the time to look into
   for now. It does seem like it is being caused due to the webserver having
   autoindex pages. The situation with HSTS is even more weird. I will take a
   look at the specifics most likely next week and try to come up with either
   an explanation or a solution.

* Karl O. Pinc <address@hidden> [180419 00:19]:
> Hello,
> 
> This is a bad bug report.  Sorry.  I'm thinking that
> you'd rather hear _something_ than nothing.
> 
> I'm using wget 1.18-5+deb9u1, which is 1.18
> on Debian Squeeze (9.4).
> 
> I can't say I'm certain that there is even a bug,
> although there is a functionality problem at some
> level.
> 
> 3 different problems, from simplest and most trivial
> to more complex.
> 
> 1) The wget Savannah page makes no mention of
> version 1.19 in the news section.
> 
> 2) The situation described below (with --adjust-extension)
> produces a "doc/guide" directory and a "doc/guide.1.html"
> file.  It would be nice if the file were instead named
> "doc/guide.html", without the ".1".  (There is no such
> file.)
> 
> 3)  I am mirroring a site where the url paths ending
> in "/" deliver pages, but there are additional, longer,
> urls which extend these urls.  So --adjust-extension (-E)
> is required so that wget can write an "index file",
> ending in ".html", and create directories to hold
> additional content.
> 
> I am also using --convert (-k) so as to have relative
> links in the downloaded material.
> 
> The problem is that when I use --hsts I get (sometimes,
> but consistently for particular urls)
> a "foo/" directory, a "foo.1.html" file containing
> some converted links, and a "foo.html" file without
> converted links.  FYI, "foo" is downloaded by linking
> "upwards" in the url path from the targeted url to
> mirror.  The downloaded, --convert-ed, material contains
> some links to "foo.1.html" and some to "foo.html".
> 
> When using --no-hsts I get 301 (Permanent redirect)
> from the mirrored site to https pages (and it seems
> in this particular case https pages on the target
> top-level domain).  I then have no problems with
> --convert-ed data.
> 
> With --hsts I get some pages on other sub-domains
> of the target domain, FYI.  This is not obviously
> related to the problem.
> 
> Now, for the specifics.  Apologies that the
> example is not clean and the site it hits may change
> in ways that make the problem not reproducible.
> 
> The goal is to mirror the Yii 1.1 reference documentation
> and user guide.  The command which "works" is:
> 
> wget --no-hsts --directory-prefix mirror --timestamping -F
> --no-remove-listing --domains=www.yiiframework.com,yiiframework.com
> --regex-type=pcre
> --reject-regex='^https?://www\.yiiframework\.com/(?:(?:forum)|(?:wiki)|(?:user)|(?:extension)|(?:doc-2\.0)|(?:doc/(?:(?:(?:(?:guide)|(?:api))/(?:1|2)\.0)|(?:guide/1\.1/(?:(?:de)|(?:es)|(?:fr)|(?:he)|(?:id)|(?:it)|(?:ja)|(?:pl)|(?:pt)|(?:pt-br)|(?:ro)|(?:ru)|(?:sv)|(?:uk)|(?:zh-cn)))|(?:download/yii-.*-2\.0)|(?:blog)))|(?:news)|(?:blog)|(?:team)|(?:user)|(?:badge))'
> --adjust-extension --recursive --level inf --convert-links
> --page-requisites --span-hosts --no-clobber
> https://www.yiiframework.com/doc/guide/1.1/en
> 
> Some notes:
> 
> I happen to know that the guide contains links to the API
> docs, and all the API docs cross reference each other, 
> so I mirrored the guide and picked up the API docs as well.
> 
> The above command downloads 521 files comprising 43MB. (!)
> Sorry.
> 
> -F probably does nothing, but I included it because
> that's what I ran with.
> 
> Leaving off the --no-hsts I get:
> 
> mirror/
>   www.yiiframework.com/
>     doc/
>       api/
>       api.1.html
>       api.html
>       guide/
>         1.1/
>           en/
>           en.1.html
>           en.html
>       guide.1.html
>       terms/
> 
> As noted, "en.html" and "api.html" contain un-converted links and
> some downloaded content links to these files.  I _think_ these get
> created late in the download.
> 
> With --no-hists (as in the command above) I get:
> 
> mirror/
>   www.yiiframework.com/
>     doc/
>       api/
>       api.html
>       guide/
>         1.1/
>           en/
>           en.html
>       guide.1.html
>       terms/
> 
> FYI.  I first tried using multiple --reject-regex arguments, but 
> this did not seem to work.  The docs were not clear as to whether
> multiple --reject-regex arguments are allowed.  So I wrote a single
> regex.  A note in the documentation about this might be helpful.
> 
> I hope that the above is useful.
> 
> Regards,
> 
> Karl <address@hidden>
> Free Software:  "You don't pay back, you pay forward."
>                  -- Robert A. Heinlein
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]