From: <timscrim@timscrim.co.uk>
Hi Everyone
I am trying to download a complete website from archive.org using Wget
but I have run into a problem.
Hi, yes, I face the same one and wrote to Archive.org teams around three
years ago(I paste the texte below) and they answered me that they will
work on it. I fear that it was just to calm me down...
If you are a human and you are exploring an old website on archive.org,
you may notice that sometimes when you click on a link from one page on
the website to another, the datestamp part of the URL changes.
That's the problem, yes!
You can also end up on the same page as you were previously but with a
different datestamp.
Never noticed this.
Hi Paul
Thank you very much indeed for your very informative and helpful reply
and for the link to your MakeStaticSite tool. I will try it out.
I'm glad you have send this because, as it happens often on some lists,
the Paul answer isn't in the daily compilation :-(
The problem of retrieval for the likes of Wget is explained by Archive
Team
https://wiki.archiveteam.org/index.php?title=Restoring
Some sentences look exactly the sames as the ones I have sent to them.
They list a free software in Ruby for doing what we want.
But the content of the page seems to be rather old.
As an attempted solution, I have developed a prototype tool,
MakeStaticSite that runs Wget iteratively, downloading snapshots
selectively to minimise repetition, then merging them into a canonical
form.
https://makestaticsite.sh/
https://github.com/paultraf/makestaticsite
Interesting. Here below the two mails I sent to Archive.org and their
answers:
24, Feb, 2022, info@archive.org:
Hi, I was unable to find a more suitable contact path to submit feature
request than this mail adress. First of all, thanks and congrats,
especially for the wayback machine and for being visionary about the need
for this since 1996. Your work is fundamental for humanity.
Well, about the feature I need, and I'm not the only one: I'm sucking some
old now-disappeared Websites for various reasons like adding copies (in
addition to yours on Archive.org), on physical medias, readable offline on
old computers for own usage, spreading them around me and for humanity
future memories. I use Wget and Httrack. Sadly, it doesn't work perfectly
well because they are confused by these specific wayback machine
additions:
1-The added header;
2-the timestamps in the URIs(and also the fact that they may be absolute
instead of relative, or-semi-absolute at last).
So I think the user should be able to ask for a websuckers-friendly
version of a displayed Website. What I imagine could work like this(a
little like a "printer friendly version" that some Website have):
1-The user search and find the archived Website capture he wants, as
usual.
2-The user can ask for a "websuckers-friendly version" of the current
displayed archived Website.
3-The Wayback machine provides an URL leading to the wanted archived
Website to be used by the sucker.
3-1-This URL leads to an archived Website version without the WM header,
and with internal links of the archived Website sanitized: they lead to
the same linked resources than before, but appearing to the sucker as an
relative internal link of the Website.
3-2-This URL could be available in a limited time.
3-3-Advised syntax to retrieve it could be furnished for popular suckers
like Wget and Httrack, for user information.
It's dev work, but not so hard I think, since most of the needed code is
in the WM already. It's just the behavior of the on-the-fly internal links
that must be changed in this new mode. I don't expect it to be done very
quickly, but would be pleased if you add this project to your future
to-dos timeline.
Patron Services Yellow, Feb 24, 2022, 9:08 PST
Hi,
I have never heard the term "websucker".
Might you please share an example?
And.. this might be helpful to you.
If we have archives of this website, you would be able to find them by
searching for the URL at web.archive.org - use the timeline on the top to
navigate the year, and the calendar display below to select the date(s)
you are interested in viewing. Dates with a circular highlight on them
indicate available archives - typically, what you see is what we have.
If you own the content from the original website, and assume full
responsibility for ensuring that your use of the archived content is in
accordance with all applicable law, we would be very glad if the archives
at web.archive.org were of assistance in helping you restore your
website.
We do not have our own bulk grab tool, and we cannot guarantee results,
but you are welcome to save off webpages individually (from your Web
browser) or attempt to use a third-party tool to target your archived
content. There are several third-party services that will help you
re-build a website from archives available via our Wayback Machine at
web.archive.org. Here are some that we are aware of:
waybackrebuilder.com
waybackdownloader.com
waybackmachinedownloader.com
waybackdownloads.com
Please note that we do not have direct experience working with any of
these, so we are unable to give a recommendation for which might be best.
Unfortunately, we cannot provide download directions nor field technical
support inquiries. However, you may wish to hire a Web Developer if you
need additional assistance.
For more information about how to use archive.org and the Wayback
Machine, please see our Help Center: https://help.archive.org/hc/en-us
---
The Internet Archive Team
I notice that a mail from me seems to be missing here.
Your request (590999) has been updated. To add additional comments, reply
to this email.
----------------------------------------------
Patron Services Yellow, Feb 24, 2022, 9:32 PST
Ah... I call those web scraping or web archiving tools.
What, exactly, are you trying to do?
Hi, it's a word to word translation from the french "Aspirateur de
sites"...
Like I wrote in the original request:
"Well, about the feature I need, and I'm not the only one: I'm sucking
some old now-disappeared Websites for various reasons like adding copies
(in addition to yours on Archive.org), on physical medias, readable
offline on old computers for own usage, spreading them around me and for
humanity future memories. I use Wget and Httrack."
If you want the exact case that lead me to make the request, here it is:
this time I want to get the whole
<https://web.archive.org/web/19980126083908/http://www.imaginet.fr/ime/>.
It shouldn't be a problem, this is a good-old-fashioned-full-strict-HTML
Website.
"httrack -W
https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"
and "wget -c -m -k -K -E -nH --cut-dirs=4 -np -p
https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"
only do the same thing as a simple "save as" in the browser, probably
because they consider links elements and pages to lead to a different
Website, because of their format, being absolute from Archives root and
with the timestamp field.
"cd /tmp ; d="ungi" ; echo "We're about to delete "$d" in "`pwd` ; sleep 9
; rm -rvf $d ; mkdir $d ; cd $d ;
wget -c -m -k -K -E -nH --cut-dirs=6 -p --show-progress
https://web.archive.org/web/19980110124843/http://www.imaginet.fr/ime/"
never ends, it seems to try to download all the captures of all the
elements of the Website. Perhaps because of the WM header with the
calendar, but I'm not sure.
I fear that I'm stuck, and will be forced to save each page from Firefox
Even if you implement my feature request in the future, it will be in a
long time, if ever...
--
Sincerely, Stephane Ascoet