Re: [Bug-wget] WARC, new version

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] WARC, new version

From:	Gijs van Tulder
Subject:	Re: [Bug-wget] WARC, new version
Date:	Sun, 30 Oct 2011 22:33:16 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1

Hi David,

David H. Lipman wrote:

I have seen WARC mentioned but have not seen a definition.

WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing webresources. It is used for making archives of web sites. The InternetArchive, for example, uses it as the file format for their WaybackMachine and Heritrix crawler.

The nice thing about WARC is that it lets you store all informationabout your web crawl: the files you download, of course, but also thingslike the HTTP request and response headers, information about redirectsand error pages. WARC also provides a place to keep the relatedmetadata. It is, in short, a way to store everything, in a standardizedfile format.


Adding WARC to wget means that you'll be able to do things like

  wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

which will produce (next to the normal wget download) a file named'gnu.warc.gz' that contains every HTTP request and every HTTP responsethat wget made. This is a 'archival grade' copy of the mirrored site.

Once you have the WARC file, you could store it in your archive, extractfiles, run your own local Wayback Machine [2, 3].

wget is already a very useful tool to make a quick copy of a website,adding WARC support helps to make the copy is as complete as possible.


Maybe that answers some of your questions?

Regards,

Gijs


[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/21
- Re: [Bug-wget] WARC, new version, Giuseppe Scrivano, 2011/10/23
  - Re: [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/23
    - Re: [Bug-wget] WARC, new version, Giuseppe Scrivano, 2011/10/30
    - Re: [Bug-wget] WARC, new version, David H. Lipman, 2011/10/30
    - Re: [Bug-wget] WARC, new version, Gijs van Tulder <=
    - Re: [Bug-wget] WARC, new version, David H. Lipman, 2011/10/30
    - Re: [Bug-wget] WARC, new version, Gijs van Tulder, 2011/10/30

Prev by Date: Re: [Bug-wget] WARC, new version
Next by Date: Re: [Bug-wget] WARC, new version
Previous by thread: Re: [Bug-wget] WARC, new version
Next by thread: Re: [Bug-wget] WARC, new version
Index(es):
- Date
- Thread