[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] request for help with wget (crawling search results of a
Re: [Bug-wget] request for help with wget (crawling search results of a website)
Sun, 3 Nov 2013 20:18:57 +0100
Am 03.11.2013 um 09:13 schrieb Altug Tekin <address@hidden>:
> I am trying to crawl the search results of a news website using *wget*.
> The name of the website is *www.voanews.com <http://www.voanews.com>*.
> After typing in my *search keyword* and clicking search on the website, it
> proceeds to the results. Then i can specify a *"to" and a "from"-date* and
> hit search again.
> After this the URL becomes:
> and the actual content of the results is what i want to download.
> To achieve this I created the following wget-command:
> wget --reject=js,txt,gif,jpeg,jpg \
> --accept=html \
> --user-agent=My-Browser \
> --recursive --level=2 \
> Unfortunately, the crawler doesn't download the search results. It only
> gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..."
> links and saves the articles they link to.
> *It seems like he crawler doesn't check the search result links at all*.
> *What am I doing wrong and how can I modify the wget command to download
> the results search list links (and of course the sites they link to) only ?*
You need to inspect the urls of the results and make sure to
only download these. Maybe a --no-parent is enough.
"You don't become great by trying to be great, you become great by wanting to
and then doing it so hard that you become great in the process." - xkcd #896
Description: S/MIME cryptographic signature