[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] does wget honor robots meta tag?
From: |
Micah Cowan |
Subject: |
Re: [Bug-wget] does wget honor robots meta tag? |
Date: |
Fri, 14 Nov 2008 14:50:39 -0800 |
User-agent: |
Thunderbird 2.0.0.17 (X11/20080925) |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Allan Spiegel wrote:
> I have some pages with
>
> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
>
> in the <head> section and when I crawl my site to make sure I have this
> tag in all the right pages, wget gets these pages. does wget support
> this tag?
It does (provided you don't have "robots=off" in your .wgetrc or somesuch).
However, looking at the code:
while (*content)
{
/* Find the next occurrence of ',' or the end of
the string. */
char *end = strchr (content, ',');
if (end)
++end;
else
end = content + strlen (content);
if (!strncasecmp (content, "nofollow", end - content))
ctx->nofollow = true;
content = end;
}
It looks to me like it won't work if nofollow isn't the last item in the
list (because it sets end to one past the comma), and additionally
doesn't like whitespace (even though the examples at robotstxt.org show
spaces). I suspect it'll work if you remove the space after the comma
(and I'll file a bug for these, and fix 'em soon).
- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkeALAACgkQ7M8hyUobTrFnrwCgiBXpvljZdtHmi8DP/EtiHPKU
wIgAniOVBoPCtfUBKlZJmxx0R022Cwbx
=huPt
-----END PGP SIGNATURE-----