lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV Downloading a whole web site for local offline browsing


From: David Woolley
Subject: Re: LYNX-DEV Downloading a whole web site for local offline browsing
Date: Wed, 1 Apr 1998 08:30:12 +0100 (BST)

> This is going to set some of you "off" but I've been getting hit hard
> by the bozos who download our whole site (~20MB of text)... we get
> about 19 of those a week (on average - ie. 19 complete "sucks" from
> the title page down to the last stupid icon used on a well-hidden
> page that no one in their right mind has ever dowloaded before).
> 
> I'm _seriously_ thinking about "interlocking" portions of my site
> with CGI's that ask you to check off some random box that you

I assume you already have robots.txt, and, as far as possible, the 
equivalent META header.  If you don't, then even the best behaved
web grabber will feel free to walk all over you.  (Note the last version
of wget I got had a bug which meant that robots.txt files with comments 
might not work - this will be fixed in the next version, but I don't
know if it is out yet.)  Lynx doesn't process these controls.
(From another theme, people would probably just change the user agent
to make it look like altavista's crawler if you did  implement this in
Lynx.)

Incidentally, if you want to be subtle, you should reject with the
"use proxy" HTTP status, as proper use of proxies would reduce your
problem as well.

> want to proceed to next "tree" of documents. I don't have a problem
> with people who *consciously* download the whole site (though we
> have mirrors for "those" excuses and offere "off-line" ZIPped

I wish more sites would do this, and conspicuously, but the reality is
that this is beyond the technical abilities of most web users (the demand
for integration in Lynx is really to do with dumbing down as well).
Failure to use proxies is also due to lack of sophistication (although
the active content being promoted by Microsoft, etc., tends to nullify
their value).

Actually, I would rather that sites included a PDF or plain postscript
copy of the site as a single file as well, as I find that is often a 
better way of handling a whole site (maybe a consolidate Lynx -dump for
text only users).  PDF is already compressed if done with the Adobe tools
and the text, but not images, are compressed with ghostscript 5.10.  
The same ghostscript can directly read gzipped postscript, although in
that sort of use I would have no problem with ungizzping them myself (the
average user of the MSIE subscribe feature, can be assumed not to be able
to cope with plain postscript, though).

> versions too) but I *DO* have a *BIG* problem with people who
> do it with automated and braindead tools just to save themselves
> some time... and guess who pays for that? Yup... the provider.
> 

> Not a _direct_ flame to people asking for this... mind you... just
> please _consider_ that if you write any tools that do this. Things
> like MOMspider at least _wait_ 5-10 seconds between fetches! Those

That's an option in wget, but not the default.  It would be totally
unacceptable for most of the people wanting to do this with Lynx.  The
main reason for grabbing sites, at least in the UK, is that it is much
cheaper to fetch them quickly then disconnect the phone line, than to
read them interactively.  I'd suggest that it is only people with permanent
connection, trying to mirror a site who would throttle the transfer
deliberately, given the choice.

> assholes from "Black Widow" and MicroShaft do NOT... so they
> load down the server like there was no tomorrow. But, why not...
> it's all free. May they roast in hell.

Generally commercial software providers pander to customer wants, not
the best interests of the network (see also the comment about active
content above).  However Microsoft do honour robot controls.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]