[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV Re: LYLynx as a web crawler

From: Foteos Macrides
Subject: Re: LYNX-DEV Re: LYLynx as a web crawler
Date: Thu, 13 Nov 1997 11:09:04 -0500 (EST)

"Matthew G. Saroff" <address@hidden> wrote:
>       I use the traverse and crawl options to work my way through a web
>site, and then I look at the text files dumped for contact information.
>One can limit toe web retrievals to one site, but I'd rather limit the
>depth traversed.
>       Is there a way to do this?

        There are four control files for the traversal/crawl feature,
whose paths are defined via symbols in userdefs.h.  The one defined
via TRAVERSE_REJECT_FILE can be used homologously to a robots.txt
file for crawlers, to limit what will be traversed.  See the
distribution's CRAWL.announce file for more information.  The Lynx
traversal feature is intended for local site management, not general
Web crawling, and that's why it uses it's own control files, rather
than a server's robot.txt file (so you can traverse stuff you don't
want outside crawlers to traverse).  If you are using it to traverse
a server other than your own, you should fetch it's robots.txt file
and create it's homolog for Lynx.

        For what you are doing, it might be more efficient and less
intrusive if you used wget (available from GNU software distribution

        You also could write a simple script to invoke Lynx with -dump
and output to files for a set of URLs.  The References sections of the
output files will show any mailto URLs in those documents.


 Foteos Macrides            Worcester Foundation for Biomedical Research
 address@hidden         222 Maple Avenue, Shrewsbury, MA 01545
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]