[Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-d

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-d

From:	Tim Ruehsen
Subject:	[Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay
Date:	Thu, 09 Apr 2015 20:25:43 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.6.0

Follow-up Comment #6, bug #30999 (project wget):

Crawl-delay is host/domain specific. Thus a wget -r 'domain1 domain2 domain3'
can't simply wait 'crawl-delay' seconds after a download. We need some
specific logic when dequeing the next file. Also how comes --wait into play ?
The user might be able to override crawl-delay for domain1 but not for domain2
and domain3.

Today, web servers often allow for 50+ parallel connections from one client -
I really don't see the point in implementing crawl-delay.

I could change my mind if someone has a *real* good reason for it *and* comes
up with a good algorithm / patch to handle all corner cases.


    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?30999>

_______________________________________________
  Nachricht gesendet von/durch Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay, Miquel Llobet, 2015/04/09
- [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay, Tim Ruehsen <=
  - Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay, Miquel Llobet, 2015/04/09

Prev by Date: Re: [Bug-wget] Memory leak in idn_encode; Valgrind suppression file
Next by Date: Re: [Bug-wget] Both wget and winhttrack have a prerequisites problem
Previous by thread: [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay
Next by thread: Re: [Bug-wget] [bug #30999] wget should respect robots.txt directive crawl-delay
Index(es):
- Date
- Thread