[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev The traversal limitation

From: per . magnus . banck
Subject: Re: lynx-dev The traversal limitation
Date: Fri, 23 Oct 1998 7:15:00 +0100

David Woolley <address@hidden> wrote in reply to me:

>>>>> Most sites do contain much more material than I ever want to download 
>>>>> over my
>>>>> slow link
>> Many sites object strongly to being crawled as well, because they expend >> 
>> bandwidth on pages not read.  IMDB is a case in point.  Please never crawl
>> that site with Lynx or you will find that Lynx gets permanently barred from
>> it.  (The other issue is that mirrored copies breach the copyright and >> 
>> deny them the ability to obtain the advertising revenue that pays for the
>> site.)

We are in full agreement over what policy issues are involved in this.
What I tried to do was to _limit_ the number of pages being crawled.

>>>>> So I try to filter the searches from the start file via the reject.dat 
>>>>> file
>>>>> and -realm. But in this case, the intersting pages is in a /cgi-bin/
>> Pages aren't normally in /cgi-bin, but rather a program is run to create
>> the page on the fly when you reference URLs of this form.  That's 
>> particularly
>> expensive for the site and most sites will use the robots.txt file to bar
>> access to well behaved crawlers.  Unfortunately, Lynx is NOT well behaved.

>>>>> "The '-traversal' switch is for http URLs and cannot be used for file:"
>>>>> Is there any big security concern behind this limitation?
>>>>> If not, I suggest we skip this test altogether.

Getting Lynx to be well-behaved is surely worth a discussion  of its own.
But that is far out of my limited knowledge of Lynx internals..

The point I tried to raise was merely: if any security concern exists behind the
limittion that the startfile is not allowed to be on the local disk 

Anybody knows ?

>> wget is designed for this purpose and does exist in win32 versions.  It is
>> well behaved as a crawler, but can be given an explicit list of URLs, on
>> the command line, or in a file, and will then bypass robots.txt.  Because it
>> is well behaved, it is less likely to be barred, although excessive use
>> for mirroring a set of pages  could still have this effect.
>> It doesn't render the HTML, but can fixup internal URLs so that they work
>> from the local filesystem, allowing you to use Lynx, or another browser, on
>> that copy.

I will try out WGET asap to see if it suits my need - it took a while to find 
Windows port. But in case any one more wants to try, here it can be found:

[Banck Per Magnus DD GL]   /Magnus
Per Magnus Banck                address@hidden
Electoral Information Service,  Box 4186,  SE-10264 Stockholm (Sweden)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]