bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] does wget honor robots meta tag?


From: Micah Cowan
Subject: Re: [Bug-wget] does wget honor robots meta tag?
Date: Fri, 14 Nov 2008 14:50:39 -0800
User-agent: Thunderbird 2.0.0.17 (X11/20080925)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Allan Spiegel wrote:
> I have some pages with
> 
>            <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
> 
> in the <head> section and when I crawl my site to make sure I have this
> tag in all the right pages, wget gets these pages. does wget support
> this tag?

It does (provided you don't have "robots=off" in your .wgetrc or somesuch).

However, looking at the code:

          while (*content)
            {
              /* Find the next occurrence of ',' or the end of
                 the string.  */
              char *end = strchr (content, ',');
              if (end)
                ++end;
              else
                end = content + strlen (content);
              if (!strncasecmp (content, "nofollow", end - content))
                ctx->nofollow = true;
              content = end;
            }

It looks to me like it won't work if nofollow isn't the last item in the
list (because it sets end to one past the comma), and additionally
doesn't like whitespace (even though the examples at robotstxt.org show
spaces). I suspect it'll work if you remove the space after the comma
(and I'll file a bug for these, and fix 'em soon).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkeALAACgkQ7M8hyUobTrFnrwCgiBXpvljZdtHmi8DP/EtiHPKU
wIgAniOVBoPCtfUBKlZJmxx0R022Cwbx
=huPt
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]