[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Problem with national characters in XHTML
From: |
Tomas Zerolo |
Subject: |
Re: Problem with national characters in XHTML |
Date: |
Sat, 1 Oct 2005 06:29:16 +0200 |
User-agent: |
Mutt/1.5.6+20040907i |
On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote:
> Piet van Oostrum wrote:
[...]
> >That is just the internal representation of the character in Emacs. It's
> >not important. What matters is what Emacs writes to your file. When you
> >write out utf-8 (for example by giving the command
[...]
> So you mean that at a - what should I call it? - "text semantic level"
> the utf-8 char and the latin-1 char has the same meaning?
Yes. You put that nicely. The *character* (a dieresis) stays the same.
The *representation* (loosely referred to as `encoding') changes.
I said loosely, because on more complex things as utf-8 there are
actually two layers: the `character set', mapping each character to an
integer (aka `code point', which in this case would be UNICODE or
ISO-10646, which nowadays are equivalent), and the representation in a
file, which may be utf-8 (most common), ucs-16 or whatnot.
Now the advantage of utf-8: it is a variable-width encoding, and uses up
just one byte for one ASCII character (on ASCII it uses the same code
points). So you can interpret an ASCII file ``as-is'' as an utf-8 file.
For higher characters (the ones, for example with codes >127 in
iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK,
up to 6 bytes, but don't take that too seriously.
The disadvantage is: it is a variable-width encoding, so you have to
process a text sequentially, byte-for-byte to get the character
boundaries right (it's designed to re-synchronize gracefully, though).
If you want the whole story (on UNICODE, ISO10646, UTF8), see here:
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
(very recommended). From the perspective of a web slave, see:
<http://www.w3.org/TR/REC-html40/charset.html>
HTH
-- tomas
signature.asc
Description: Digital signature
- Re: Problem with national characters in XHTML,
Tomas Zerolo <=