[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Which Encoding? (was Re: Unicode and Guile)
From: |
Stephen Compall |
Subject: |
Which Encoding? (was Re: Unicode and Guile) |
Date: |
26 Oct 2003 12:34:47 +0000 |
User-agent: |
Gnus/5.09 (Gnus v5.9.0) Emacs/21.2 |
Tom Lord <address@hidden> writes:
> It's culturually discriminatory to regard utf-16 as worse than utf-8
> in those regards.
>
> Or, put differently, for many potential users, utf-16 is the best of
> both worlds: it optimizes the size of the most common characters
> (for some users), and it can also handle any Unicode character.
That's the thing -- it can't, at least not thinking in fixed-width
terms, which was my goal in suggesting UCS-4. It may be able to
handle all *current* Unicode characters, but what about those in the
future? Unicode supports code points higher than 16-bit.
I say it's the worst of both worlds (from the C API user's point of
view), because you have to deal with breaking ASCII compatibility for
7-bit code points, *and* still need surrogate characters
(i.e. variable width), for code points above 65535 (the difference
between UTF-16 and UCS-2).
UTF-16 suffers the same problem as UTF-8: programmers may be tempted
to simply treat the data block as fixed-width 16-bit strings (8-bit
for UTF-8, of course), which of course will break on the surrogate
characters.
If you want to assume that Unicode will never grow out of the 16-bit
set, then UCS-2 would be a much better choice than UTF-16, IMHO. That
way, it is clear that C programs only need deal with fixed-width,
16-bit characters.
--
Stephen Compall or s11 or sirian
Since a politician never believes what he says, he is surprised
when others believe him.
-- Charles DeGaulle
Ft. Meade Lexis-Nexis smuggle virus BROMURE JSOFC3IP emc plutonium
electronic surveillance quarter number key offensive information
warfare fraud Albania Khaddafi