bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem downloading a website from archive.org


From: timscrim
Subject: Re: Problem downloading a website from archive.org
Date: Thu, 13 Mar 2025 18:25:14 -0000

Hi Stephane

Thank you very much indeed for your very informative reply.

Kind regards

Tim

----- Original Message ----- From: "Stephane Ascoet" <stephane.ascoet@univ-paris1.fr>
To: <bug-wget@gnu.org>
Sent: Thursday, March 13, 2025 5:52 PM
Subject: Re: Problem downloading a website from archive.org


From: <timscrim@timscrim.co.uk>

Hi Everyone

I am trying to download a complete website from archive.org using Wget but I have run into a problem.

Hi, yes, I face the same one and wrote to Archive.org teams around three years ago(I paste the texte below) and they answered me that they will work on it. I fear that it was just to calm me down...

If you are a human and you are exploring an old website on archive.org, you may notice that sometimes when you click on a link from one page on the website to another, the datestamp part of the URL changes.

That's the problem, yes!

You can also end up on the same page as you were previously but with a different datestamp.

Never noticed this.


Hi Paul

Thank you very much indeed for your very informative and helpful reply and for the link to your MakeStaticSite tool. I will try it out.

I'm glad you have send this because, as it happens often on some lists, the Paul answer isn't in the daily compilation :-(


The problem of retrieval for the likes of Wget is explained by Archive Team
  https://wiki.archiveteam.org/index.php?title=Restoring

Some sentences look exactly the sames as the ones I have sent to them. They list a free software in Ruby for doing what we want.
But the content of the page seems to be rather old.


As an attempted solution, I have developed a prototype tool, MakeStaticSite that runs Wget iteratively, downloading snapshots selectively to minimise repetition, then merging them into a canonical form.
  https://makestaticsite.sh/
  https://github.com/paultraf/makestaticsite

Interesting. Here below the two mails I sent to Archive.org and their answers:
24, Feb, 2022, info@archive.org:
Hi, I was unable to find a more suitable contact path to submit feature request than this mail adress. First of all, thanks and congrats, especially for the wayback machine and for being visionary about the need for this since 1996. Your work is fundamental for humanity.

Well, about the feature I need, and I'm not the only one: I'm sucking some old now-disappeared Websites for various reasons like adding copies (in addition to yours on Archive.org), on physical medias, readable offline on old computers for own usage, spreading them around me and for humanity future memories. I use Wget and Httrack. Sadly, it doesn't work perfectly well because they are confused by these specific wayback machine additions:
1-The added header;
2-the timestamps in the URIs(and also the fact that they may be absolute instead of relative, or-semi-absolute at last).

So I think the user should be able to ask for a websuckers-friendly version of a displayed Website. What I imagine could work like this(a little like a "printer friendly version" that some Website have): 1-The user search and find the archived Website capture he wants, as usual. 2-The user can ask for a "websuckers-friendly version" of the current displayed archived Website. 3-The Wayback machine provides an URL leading to the wanted archived Website to be used by the sucker. 3-1-This URL leads to an archived Website version without the WM header, and with internal links of the archived Website sanitized: they lead to the same linked resources than before, but appearing to the sucker as an relative internal link of the Website.
 3-2-This URL could be available in a limited time.
3-3-Advised syntax to retrieve it could be furnished for popular suckers like Wget and Httrack, for user information.

It's dev work, but not so hard I think, since most of the needed code is in the WM already. It's just the behavior of the on-the-fly internal links that must be changed in this new mode. I don't expect it to be done very quickly, but would be pleased if you add this project to your future to-dos timeline.

Patron Services Yellow, Feb 24, 2022, 9:08 PST

Hi,

I have never heard the term "websucker".

Might you please share an example?

And.. this might be helpful to you.


If we have archives of this website, you would be able to find them by searching for the URL at web.archive.org - use the timeline on the top to navigate the year, and the calendar display below to select the date(s) you are interested in viewing. Dates with a circular highlight on them indicate available archives - typically, what you see is what we have.

If you own the content from the original website, and assume full responsibility for ensuring that your use of the archived content is in accordance with all applicable law, we would be very glad if the archives at web.archive.org were of assistance in helping you restore your website.

We do not have our own bulk grab tool, and we cannot guarantee results, but you are welcome to save off webpages individually (from your Web browser) or attempt to use a third-party tool to target your archived content. There are several third-party services that will help you re-build a website from archives available via our Wayback Machine at web.archive.org. Here are some that we are aware of:

 waybackrebuilder.com
 waybackdownloader.com
 waybackmachinedownloader.com
 waybackdownloads.com

Please note that we do not have direct experience working with any of these, so we are unable to give a recommendation for which might be best. Unfortunately, we cannot provide download directions nor field technical support inquiries. However, you may wish to hire a Web Developer if you need additional assistance.

For more information about how to use archive.org and the Wayback Machine, please see our Help Center: https://help.archive.org/hc/en-us

 ---
 The Internet Archive Team

I notice that a mail from me seems to be missing here.


Your request (590999) has been updated. To add additional comments, reply to this email.
----------------------------------------------

Patron Services Yellow, Feb 24, 2022, 9:32 PST

Ah... I call those web scraping or web archiving tools.

What, exactly, are you trying to do?

Hi, it's a word to word translation from the french "Aspirateur de sites"...
Like I wrote in the original request:
"Well, about the feature I need, and I'm not the only one: I'm sucking some old now-disappeared Websites for various reasons like adding copies (in addition to yours on Archive.org), on physical medias, readable offline on old computers for own usage, spreading them around me and for humanity future memories. I use Wget and Httrack."

If you want the exact case that lead me to make the request, here it is: this time I want to get the whole <https://web.archive.org/web/19980126083908/http://www.imaginet.fr/ime/>. It shouldn't be a problem, this is a good-old-fashioned-full-strict-HTML Website.

"httrack -W https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"; and "wget -c -m -k -K -E -nH --cut-dirs=4 -np -p https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"; only do the same thing as a simple "save as" in the browser, probably because they consider links elements and pages to lead to a different Website, because of their format, being absolute from Archives root and with the timestamp field.

"cd /tmp ; d="ungi" ; echo "We're about to delete "$d" in "`pwd` ; sleep 9 ; rm -rvf $d ; mkdir $d ; cd $d ; wget -c -m -k -K -E -nH --cut-dirs=6 -p --show-progress https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"; never ends, it seems to try to download all the captures of all the elements of the Website. Perhaps because of the WM header with the calendar, but I'm not sure.
I fear that I'm stuck, and will be forced to save each page from Firefox
Even if you implement my feature request in the future, it will be in a long time, if ever...

--
Sincerely, Stephane Ascoet







reply via email to

[Prev in Thread] Current Thread [Next in Thread]