[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Page encoding problem

From: Micah Cowan
Subject: Re: [Bug-wget] Page encoding problem
Date: Mon, 09 Jul 2012 21:42:28 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1

On 07/09/2012 08:02 PM, Owen Watson wrote:
> I'm archiving a website that (according to FF) is UTF-8, and
> text/html; charset=iso-8859-1.
> When I look at the archived page in FF it shows text in ISO-8859-1,
> and text/html; charset=iso-8859-1, and there are various problems with
> the text (eg some spaces replaced by  Â, and apostrophes by ’. How
> do I correct this in the command line?

Your explanation above isn't 100% discernible to me, but I'm going to
make some assumptions about what you're saying. Note that it's always
easier to look into problems if you provide (a) the URL that's giving
you problems, (b) the version of Wget you're using, and (c) the
operating system you're running on.

If the page has <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">, but FF shows it as UTF-8, the most likely
explanation is that the server sent the "real" Content-Type headers to
FF as being "text/html; charset=UTF-8". In wget, you could verify that
the server is doing that, by adding the -S option to wget, and watching
what the server sends for Content-Type.

Such situations represent a gross misconfiguration of the file in
question, which file itself claims to be latin1, but obviously isn't.

Wget does not provide an option to "fix" files that claim to be encoded
using a different encoding than they actually used. But if you're on a
Unix-style system, it would be a relatively simple matter to run
something like:

  $ find -regex '.*\.html?' | xargs sed -i -e \

Of course, there's more than one way to specify latin1, so this assumes
that it's specified as "iso-8859-1", and there's no space between
charset, =, and iso-8859-1, and that the phrase charset=iso-8859-1 only
occurs where it's actually specifying charset, and not, say, real text
in the body of a file.

Hope that helps,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]