Re: Problem downloading a website from archive.org

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem downloading a website from archive.org

From:	timscrim
Subject:	Re: Problem downloading a website from archive.org
Date:	Thu, 13 Mar 2025 18:25:14 -0000

Hi Stephane

Thank you very much indeed for your very informative reply.

Kind regards

Tim

----- Original Message -----From: "Stephane Ascoet" <stephane.ascoet@univ-paris1.fr>

To: <bug-wget@gnu.org>
Sent: Thursday, March 13, 2025 5:52 PM
Subject: Re: Problem downloading a website from archive.org

From: <timscrim@timscrim.co.uk>

Hi Everyone
I am trying to download a complete website from archive.org using Wgetbut I have run into a problem.
Hi, yes, I face the same one and wrote to Archive.org teams around threeyears ago(I paste the texte below) and they answered me that they willwork on it. I fear that it was just to calm me down...
If you are a human and you are exploring an old website on archive.org,you may notice that sometimes when you click on a link from one page onthe website to another, the datestamp part of the URL changes.
That's the problem, yes!
You can also end up on the same page as you were previously but with adifferent datestamp.
Never noticed this.
Hi Paul
Thank you very much indeed for your very informative and helpful replyand for the link to your MakeStaticSite tool. I will try it out.
I'm glad you have send this because, as it happens often on some lists,the Paul answer isn't in the daily compilation :-(
The problem of retrieval for the likes of Wget is explained by ArchiveTeam
  https://wiki.archiveteam.org/index.php?title=Restoring
Some sentences look exactly the sames as the ones I have sent to them.They list a free software in Ruby for doing what we want.
But the content of the page seems to be rather old.
As an attempted solution, I have developed a prototype tool,MakeStaticSite that runs Wget iteratively, downloading snapshotsselectively to minimise repetition, then merging them into a canonicalform.
  https://makestaticsite.sh/
  https://github.com/paultraf/makestaticsite
Interesting. Here below the two mails I sent to Archive.org and theiranswers:
24, Feb, 2022, info@archive.org:
Hi, I was unable to find a more suitable contact path to submit featurerequest than this mail adress. First of all, thanks and congrats,especially for the wayback machine and for being visionary about the needfor this since 1996. Your work is fundamental for humanity.
Well, about the feature I need, and I'm not the only one: I'm sucking someold now-disappeared Websites for various reasons like adding copies (inaddition to yours on Archive.org), on physical medias, readable offline onold computers for own usage, spreading them around me and for humanityfuture memories. I use Wget and Httrack. Sadly, it doesn't work perfectlywell because they are confused by these specific wayback machineadditions:
1-The added header;
2-the timestamps in the URIs(and also the fact that they may be absoluteinstead of relative, or-semi-absolute at last).
So I think the user should be able to ask for a websuckers-friendlyversion of a displayed Website. What I imagine could work like this(alittle like a "printer friendly version" that some Website have):1-The user search and find the archived Website capture he wants, asusual.2-The user can ask for a "websuckers-friendly version" of the currentdisplayed archived Website.3-The Wayback machine provides an URL leading to the wanted archivedWebsite to be used by the sucker.3-1-This URL leads to an archived Website version without the WM header,and with internal links of the archived Website sanitized: they lead tothe same linked resources than before, but appearing to the sucker as anrelative internal link of the Website.
 3-2-This URL could be available in a limited time.
3-3-Advised syntax to retrieve it could be furnished for popular suckerslike Wget and Httrack, for user information.
It's dev work, but not so hard I think, since most of the needed code isin the WM already. It's just the behavior of the on-the-fly internal linksthat must be changed in this new mode. I don't expect it to be done veryquickly, but would be pleased if you add this project to your futureto-dos timeline.
Patron Services Yellow, Feb 24, 2022, 9:08 PST

Hi,

I have never heard the term "websucker".

Might you please share an example?

And.. this might be helpful to you.
If we have archives of this website, you would be able to find them bysearching for the URL at web.archive.org - use the timeline on the top tonavigate the year, and the calendar display below to select the date(s)you are interested in viewing. Dates with a circular highlight on themindicate available archives - typically, what you see is what we have.
If you own the content from the original website, and assume fullresponsibility for ensuring that your use of the archived content is inaccordance with all applicable law, we would be very glad if the archivesat web.archive.org were of assistance in helping you restore yourwebsite.
We do not have our own bulk grab tool, and we cannot guarantee results,but you are welcome to save off webpages individually (from your Webbrowser) or attempt to use a third-party tool to target your archivedcontent. There are several third-party services that will help youre-build a website from archives available via our Wayback Machine atweb.archive.org. Here are some that we are aware of:
 waybackrebuilder.com
 waybackdownloader.com
 waybackmachinedownloader.com
 waybackdownloads.com
Please note that we do not have direct experience working with any ofthese, so we are unable to give a recommendation for which might be best.Unfortunately, we cannot provide download directions nor field technicalsupport inquiries. However, you may wish to hire a Web Developer if youneed additional assistance.
For more information about how to use archive.org and the WaybackMachine, please see our Help Center: https://help.archive.org/hc/en-us
 ---
 The Internet Archive Team
I notice that a mail from me seems to be missing here.
Your request (590999) has been updated. To add additional comments, replyto this email.
----------------------------------------------

Patron Services Yellow, Feb 24, 2022, 9:32 PST

Ah... I call those web scraping or web archiving tools.

What, exactly, are you trying to do?
Hi, it's a word to word translation from the french "Aspirateur desites"...
Like I wrote in the original request:
"Well, about the feature I need, and I'm not the only one: I'm suckingsome old now-disappeared Websites for various reasons like adding copies(in addition to yours on Archive.org), on physical medias, readableoffline on old computers for own usage, spreading them around me and forhumanity future memories. I use Wget and Httrack."
If you want the exact case that lead me to make the request, here it is:this time I want to get the whole<https://web.archive.org/web/19980126083908/http://www.imaginet.fr/ime/>.It shouldn't be a problem, this is a good-old-fashioned-full-strict-HTMLWebsite.
"httrack -Whttps://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/";and "wget -c -m -k -K -E -nH --cut-dirs=4 -np -phttps://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/";only do the same thing as a simple "save as" in the browser, probablybecause they consider links elements and pages to lead to a differentWebsite, because of their format, being absolute from Archives root andwith the timestamp field.
"cd /tmp ; d="ungi" ; echo "We're about to delete "$d" in "`pwd` ; sleep 9; rm -rvf $d ; mkdir $d ; cd $d ;wget -c -m -k -K -E -nH --cut-dirs=6 -p --show-progresshttps://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/";never ends, it seems to try to download all the captures of all theelements of the Website. Perhaps because of the WM header with thecalendar, but I'm not sure.
I fear that I'm stuck, and will be forced to save each page from Firefox
Even if you implement my feature request in the future, it will be in along time, if ever...
--
Sincerely, Stephane Ascoet

[Prev in Thread]

Current Thread

[Next in Thread]

Problem downloading a website from archive.org, timscrim, 2025/03/13
- Re: Problem downloading a website from archive.org, Paul Trafford, 2025/03/14
  - Re: Problem downloading a website from archive.org, timscrim, 2025/03/13
- Re: Problem downloading a website from archive.org, Stephane Ascoet, 2025/03/13
  - Re: Problem downloading a website from archive.org, timscrim <=
- Re: Problem downloading a website from archive.org, David Niklas, 2025/03/13
  - Re: Problem downloading a website from archive.org, timscrim, 2025/03/14

Prev by Date: Re: Problem downloading a website from archive.org
Next by Date: Re: Problem downloading a website from archive.org
Previous by thread: Re: Problem downloading a website from archive.org
Next by thread: Re: Problem downloading a website from archive.org
Index(es):
- Date
- Thread