bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Wget Starting Questions


From: Micah Cowan
Subject: Re: [Bug-wget] Wget Starting Questions
Date: Sun, 19 Apr 2009 12:07:15 -0700
User-agent: Thunderbird 2.0.0.21 (X11/20090318)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Jason Todd Slack-Moehrle wrote:
> Hi All,
> 
> I have some starting Wget questions that I am hoping to gain insight about.
> 
> I want to start at Dmoz.org and follow links for entertainment (like
> concerts, art gallery events, etc) and examine the link to see if I
> should get data back about it and from it.
> 
> My questions:
> 
> 1. Can Wget start at a given URL and examine every link (based upon my
> criteria)? (obviously I can write Case or If/Else or While to do this)

There's -A and -R, which let you decide whether to follow a link based
on wildcard matching; however, it doesn't match against the query-string
portion of the URL (whatever follows a ? in the URL). Also, any link
with a filename ending in .htm or .html will always be downloaded
irrespective of -A or -R (to check for further links to recurse; the
file itself will be deleted later if it doesn't match the accept/reject
rules).

> 2. If I find a link that has certain keywords that I find of interest,
> can I hit that link of interest and get information from that page?

It's not clear to me what you mean by that... if you mean, can wget
respond based on page content, then no.

> 3. How do I get the information about the link of interest and its
> content of interest into a MySQL database? (I know ColdFusion and MySQL
> and PHP). I think what I am asking is how do I get back to my database
> from a crawler?

Wget won't be much help in achieving that. Post-download content (and
log) scanning might.

> 4. I bought Webbots, spiders and screen scrapers in PHP and so far it is
> interesting, but I am wondering what best practices are..

I can't speak much to that myself.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknrdmIACgkQ7M8hyUobTrFdBwCff8nhEJrw/4w4+XMSfBxMbEaE
AeIAn1bNtHZJNiIVrqDAd7PjnL91UB1x
=rCco
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]