[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] WARC, new version

From: David H. Lipman
Subject: Re: [Bug-wget] WARC, new version
Date: Sun, 30 Oct 2011 17:42:57 -0400

From: "Gijs van Tulder" <address@hidden>

> Hi David,
> David H. Lipman wrote:
>> I have seen WARC mentioned but have not seen a definition.
> WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web 
> resources. It 
> is used for making archives of web sites. The Internet Archive, for example, 
> uses it as 
> the file format for their Wayback Machine and Heritrix crawler.
> The nice thing about WARC is that it lets you store all information about 
> your web crawl: 
> the files you download, of course, but also things like the HTTP request and 
> response 
> headers, information about redirects and error pages. WARC also provides a 
> place to keep 
> the related metadata. It is, in short, a way to store everything, in a 
> standardized file 
> format.
> Adding WARC to wget means that you'll be able to do things like
>    wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu
> which will produce (next to the normal wget download) a file named 
> 'gnu.warc.gz' that 
> contains every HTTP request and every HTTP response that wget made. This is a 
> 'archival 
> grade' copy of the mirrored site.
> Once you have the WARC file, you could store it in your archive, extract 
> files, run your 
> own local Wayback Machine [2, 3].
> wget is already a very useful tool to make a quick copy of a website, adding 
> support helps to make the copy is as complete as possible.
> Maybe that answers some of your questions?
> Regards,
> Gijs
> [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> [2] http://archive-access.sourceforge.net/projects/wayback/
> [3] http://netpreserve.org/software/downloads.php

It answers all the question and now I understand.

*Thank You Gijs !*

Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk

reply via email to

[Prev in Thread] Current Thread [Next in Thread]