[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Does wget check if specified user agent is allowed in rob
Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?
Sat, 21 Jun 2014 21:40:07 +0200
Thank you so much! This is a perfect reply. Couldn't have asked for more.
While not a bug, an additional idea came to my mind while reading your
reply. If this robots checking feature will be fixed, it would be great to
be able to enable robots checking for simple, one off requests as well.
Currently, wget only checks robots in recursive mode as I understand.
Perhaps, if we could have an option such as --robots on...
On Sat, Jun 21, 2014 at 9:31 PM, Darshit Shah <address@hidden> wrote:
> I responded to your original question on Stack Overflow. However for
> completeness and to document facts, I'll add a response here too.
> The answer to your question is: No. Sadly enough, Wget does NOT check
> for the user agent string it is using when parsing the robots file. It
> simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
> preference to the rules specified for Wget alone.
> This also has another major implication. Wget seems to be reading and
> adhering to robots rules ONLY for * and wget. Which means that not
> only does Wget ignore the correct robots exclusion rules, it even
> follows the wrong set of rules if Wget is using a different User-Agent
> and the website provides a set of rules for Wget.
> This bug can be seen in action by the test case I created. Apply the
> attached patch and run the Test--UA.py test. The patch is made against
> the new python based test suite which exists in the parallel-wget
> On Fri, Jun 20, 2014 at 2:47 AM, György Chityil
> <address@hidden> wrote:
> > If I specify a custom user agent for wget, eg "MyBot 1.0 (address@hidden
> > Will wget check this in robots.txt as well, if the bot was banned, or
> > the general robot exclusions? Does wget check if "MyBot" is allowed to
> > crawl?
> > If not, this would be a nice feature. If yes, it would be great to
> > this info in the robots overview here https://www.gnu.org/software/wget
> > I originally posted this question here , but then I found this list
> > --
> > Gyuri
> > 274 44 98
> > 06 30 5888 744
> Thanking You,
> Darshit Shah
274 44 98
06 30 5888 744