bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Patch: Always surround the "WARC-Target-URI" value with angle


From: Benjamin Esham
Subject: [Bug-wget] Patch: Always surround the "WARC-Target-URI" value with angle brackets
Date: Fri, 3 Mar 2017 09:00:57 -0500

Hello,

When producing WARC files, Wget records the requested URI in the
"WARC-Target-URI" field. I noticed that Wget encloses the value of this URI
within <angle brackets> in blocks with "WARC-Type: request", but not those
with types of "response", "resource", "revisit", or "metadata". Enclosing URIs
within angle brackets is required by the spec [1]. I'm attaching a patch that
adds the angle brackets for all block types.

(Doing this for "request" blocks was the subject of bug 47281 [2], which was
fixed almost exactly a year ago. My patch simply extends the use of the
warc_write_header_uri function to the other appropriate places.)

Here is a truncated example of the output from Wget 1.19.1:

    WARC/1.0
    WARC-Type: response
    WARC-Record-ID: <urn:uuid:95D7B77A-C019-4E91-9BBB-7526B68864F2>
    WARC-Warcinfo-ID: <urn:uuid:29F863DF-B273-498B-B91C-B50B2FD1BFCD>
    WARC-Concurrent-To: <urn:uuid:EDCAF84C-D7A6-43CE-AE78-AEE16D3B7F4B>
    WARC-Target-URI: https://www.gnu.org/software/wget/

And from the patched version:

    WARC/1.0
    WARC-Type: response
    WARC-Record-ID: <urn:uuid:54F2170C-C3FA-4B05-A8B1-116466D92401>
    WARC-Warcinfo-ID: <urn:uuid:29BCF957-0D4D-4933-9CA3-F7FF2218D144>
    WARC-Concurrent-To: <urn:uuid:61FCAFA4-5DF9-4CC0-A6C6-BC233601EF1E>
    WARC-Target-URI: <https://www.gnu.org/software/wget/>

Best regards,

Benjamin


[1] http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

[2] http://savannah.gnu.org/bugs/?47281

Attachment: 0001-src-warc.c-Use-warc_write_header_uri-for-all-WARC-Ta.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]