emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem with national characters in XHTML


From: Tomas Zerolo
Subject: Re: Problem with national characters in XHTML
Date: Sat, 1 Oct 2005 06:29:16 +0200
User-agent: Mutt/1.5.6+20040907i

On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote:
> Piet van Oostrum wrote:
[...]
> >That is just the internal representation of the character in Emacs. It's
> >not important. What matters is what Emacs writes to your file. When you
> >write out utf-8 (for example by giving the command
[...]
> So you mean that at a - what should I call it? - "text semantic level" 
> the utf-8 char and the latin-1 char has the same meaning?

Yes. You put that nicely. The *character* (a dieresis) stays the same.
The *representation* (loosely referred to as `encoding') changes.

I said loosely, because on more complex things as utf-8 there are
actually two layers: the `character set', mapping each character to an
integer (aka `code point', which in this case would be UNICODE or
ISO-10646, which nowadays are equivalent), and the representation in a
file, which may be utf-8 (most common), ucs-16 or whatnot.

Now the advantage of utf-8: it is a variable-width encoding, and uses up
just one byte for one ASCII character (on ASCII it uses the same code
points). So you can interpret an ASCII file ``as-is'' as an utf-8 file.

For higher characters (the ones, for example with codes >127 in
iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK,
up to 6 bytes, but don't take that too seriously.

The disadvantage is: it is a variable-width encoding, so you have to
process a text sequentially, byte-for-byte to get the character
boundaries right (it's designed to re-synchronize gracefully, though).

If you want the whole story (on UNICODE, ISO10646, UTF8), see here:

  <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

(very recommended). From the perspective of a web slave, see:

  <http://www.w3.org/TR/REC-html40/charset.html>

HTH
-- tomas

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]