bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] just download HTML content


From: Richard Baron Penman
Subject: Re: [Bug-wget] just download HTML content
Date: Mon, 29 Jun 2009 09:27:54 +1000

On Mon, Jun 29, 2009 at 8:08 AM, Micah Cowan <address@hidden> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Richard Baron Penman wrote:
> > hello,
> >
> > When mirroring a website how do I just download HTML content (whether
> > static, PHP, ASP, etc) and ignore images, css, js, and everything else?
> > At first I thought of creating an accept list, but I can't rely on the
> file
> > extension because many HTML pages do not include an extension (eg
> > http://en.wikipedia.org/wiki/Foo)
> > Then I thought of a reject list, but there are so many different kinds of
> > non-HTML content.
> >
> > Is there a way to do this with wget?
>
> Not really... at some point we'd like to supply content-type-based
> accept/reject options, but this will also tend to increase the amount of
> traffic, as we'd have to send extra requests to determine the content
> type. Perhaps a robust version of it would use a mixture of heuristic
> (e.g., when a filename extension exists, make assumptions about the
> content-type)...
>
> - --
> Micah J. Cowan
> Programmer, musician, typesetting enthusiast, gamer.
> Maintainer of GNU Wget and GNU Teseq
> http://micah.cowan.name/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkpH6d8ACgkQ7M8hyUobTrF+xwCeOAlZEyfV2ranXEYJRIYTlHnn
> pBwAn3B4BURi0sUCW/gpdMrR5JMcgmv6
> =lnUH
> -----END PGP SIGNATURE-----
>


ah OK. Yeah I can't think of a clean way to do it either without those extra
requests.
As a workaround would you recommend using something like this then:
--reject=".js,.css,.jpg,.png,.gif"?

Richard


reply via email to

[Prev in Thread] Current Thread [Next in Thread]