bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Segfault with WARC + CDX


From: Gijs van Tulder
Subject: [Bug-wget] Segfault with WARC + CDX
Date: Wed, 30 May 2012 23:13:54 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

Hi,

There's a bug in the warc_find_duplicate_cdx_record function. If you provide a file with CDX records, Wget can segfault if a record is not found in the CDX file. In fact, the deduplication now only works if *every* new record can be found in the CDX index.

The segmentation fault is generated on these lines in src/warc.c:

  hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, &key,
                       &rec_existing);
  if (rec_existing != NULL && strcmp (rec_existing->url, url) == 0)

Other than the code expects hash_table_get_pair does not set rec_existing to NULL if no record is found. So instead of checking for NULL, the function should check if the return value of hash_table_get_pair is non-zero:

int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload,
                                   &key, &rec_existing);
  if (found && strcmp (rec_existing->url, url) == 0)

The attached patch makes this change. The deduplication works better.

Regards,

Gijs

Attachment: 0001-warc-Fix-segfault-if-CDX-record-is-not-found.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]