[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] request for help with wget (crawling search results of a

From: Dagobert Michelsen
Subject: Re: [Bug-wget] request for help with wget (crawling search results of a website)
Date: Sun, 3 Nov 2013 20:18:57 +0100


Am 03.11.2013 um 09:13 schrieb Altug Tekin <address@hidden>:
> I am trying to crawl the search results of a news website using *wget*.
> The name of the website is *www.voanews.com <http://www.voanews.com>*.
> After typing in my *search keyword* and clicking search on the website, it
> proceeds to the results. Then i can specify a *"to" and a "from"-date* and
> hit search again.
> After this the URL becomes:
> http://www.voanews.com/search/?st=article&k=mykeyword&df=10%2F01%2F2013&dt=09%2F20%2F2013&ob=dt#article
> and the actual content of the results is what i want to download.
> To achieve this I created the following wget-command:
> wget --reject=js,txt,gif,jpeg,jpg \
>     --accept=html \
>     --user-agent=My-Browser \
>     --recursive --level=2 \
> www.voanews.com/search/?st=article&k=germany&df=08%2F21%2F2013&dt=09%2F20%2F2013&ob=dt#article
> Unfortunately, the crawler doesn't download the search results. It only
> gets into the upper link bar, which contains the "Home,USA,Africa,Asia,..."
> links and saves the articles they link to.
> *It seems like he crawler doesn't check the search result links at all*.
> *What am I doing wrong and how can I modify the wget command to download
> the results search list links (and of course the sites they link to) only ?*

You need to inspect the urls of the results and make sure to
only download these. Maybe a --no-parent is enough.

Best regards

  -- Dago

"You don't become great by trying to be great, you become great by wanting to 
do something,
and then doing it so hard that you become great in the process." - xkcd #896

Attachment: smime.p7s
Description: S/MIME cryptographic signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]