bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Re: How to ignore errors with time stamping


From: Andre Majorel
Subject: Re: [Bug-wget] Re: How to ignore errors with time stamping
Date: Fri, 12 Dec 2008 10:22:39 +0100
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

On 2008-12-12 09:03 +0100, Morten Lemvigh wrote:

> No links on a page with a missing last-modified header are
> scanned, if  the page is on the disk already. If I run:
>
> wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
>
> --08:51:24--  
> http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
>            => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
> Resolving eur-lex.europa.eu... 147.67.136.2, 147.67.136.102,  
> 147.67.119.2, ...
> Connecting to eur-lex.europa.eu|147.67.136.2|:80... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 9.709 (9.5K) [text/html]
> Last-modified header missing -- time-stamps turned off.
> 08:51:24 (82.42 KB/s) -  
> `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML' saved  
> [9709/9709]
> [....]
>
> wget will retrieve the page and continue recursively getting all the  
> linked pages, as I would expect.

OK. This is normal.

> If I issue this command a second time,  all I get is this:
>
> wget -r -N http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
> --08:53:18--  
> http://eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML
>            => `eur-lex.europa.eu/JOHtml.do?uri=OJ:L:2008:321:SOM:DA:HTML'
> Resolving eur-lex.europa.eu... 147.67.119.2, 147.67.119.102,  
> 147.67.136.2, ...
> Connecting to eur-lex.europa.eu|147.67.119.2|:80... connected.
> HTTP request sent, awaiting response... 500 Internal Server Error
> 08:53:18 ERROR 500: Internal Server Error.
> FINISHED --08:53:18--
> Downloaded: 0 bytes in 0 files
>
> So all the pages linked from this page are ignored to. It's fine
> if wget  skips the problematic document, but I would prefer wget
> to continue the  recursive scan.

The first time, the local file doesn't exist so Wget issues a GET
request, which succeeds (200).

The second time, the local file exists so Wget must first check
whether the resource has changed. To that end, it issues a HEAD
request.  The server apparently doesn't know when the document was
last modified. It could fullfill the HEAD request without a
Last-modified header. Instead, it rejects it with a 500.

It's not that that missing Last-modified header causes Wget to
"ignore the links". It's that there is no document to scan for
links because, when queried about it, the server replied 500.

To work around that kind of brokenness, Wget would have to ignore
the 500 error and fall back on parsing the local file. That should
probably not be made the default behaviour, though.

-- 
André Majorel <URL:http://www.teaser.fr/~amajorel/>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]