[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Freecats-Dev] About Unicode
From: |
Thierry Sourbier |
Subject: |
RE: [Freecats-Dev] About Unicode |
Date: |
Thu, 13 Feb 2003 12:43:24 +0100 |
Answers about Unicode:
- Unicode maps each characters to a unique code. There is only one Unicode. The
different version are backward compatible, most softwares will support up to
version 3.0 (the version 3.1 introduced some characters that require more than
16 bits to encode and that are rarelly dealt with properly).
- There are indeed several encoding (way to write to disk the codes) each with
its own advantage/disavantage. It is very easy to go from one encoding to
another (it is just some mathematical formula) and often software will use
several encoding (e.g. UTF-16 for internal representation of strings, UTF-8 to
exchange information via socket). Most of the time this will be transparent to
the developper.
- My guess is that Python will use UTF-16 internally (like most language) so
the encoding question only comes during input/output processes. Decide that all
communication happens in UTF-8 (if you have to choose of course) and nobody
will complain.
> Anyway, if it's too difficult to master, we may begin with a Windows ANSI
> version.
XML is based on Unicode. For a translation tool, I don't see Unicode support as
being an option.
T.
-----Original Message-----
From: address@hidden
[mailto:address@hidden Behalf Of
Henri Chorand
Sent: Thursday, February 13, 2003 9:55 AM
To: Free CATS Dev List
Subject: [Freecats-Dev] About Unicode
Hi all,
Sooner or later, we'll have to learn more (well, more than what I actually
know) about Unicode.
A brief look at http://www.unicode.org/ convinced me brief is not enough.
The two-level FAQ (at http://www.unicode.org/faq/utf_bom.html) seems very
interesting.
For those with some spare time still, the reference book is freely available
online at:
http://www.unicode.org/uni2book/u2.html
A possible source of concern with Unicode is, there are just so many
flavors, as seen in the FAQ:
> Which do I need to be able to use from:
> UTF8, UTF16, UTF16LE, UTF16BE, UTF32,
> UTF32LE, UTF32BE?
Things seem to get worse when you read the answer:
> Hard to say. UTF-8 will be most common on the web.
> UTF16, UTF16LE, UTF16BE are used by Java and
> Windows.
> UTF32, UTF32LE, UTF32BE are used by various Unix
> systems.
> Luckily, the conversions between all of them are
> algorithmically based and fast.
And for the curious folks who want to experiment, you may use Windows 2000 /
XP notepad in order to use one of following save options for text files:
- ANSI
- Unicode
- Unicode big endian
- UTF-8
Well, as usual, if somebody happens to know Unicode well enough to provide a
few directions, please <shout mode on>DO SO !</shout mode off>
In a nutshell, what we need to know is:
- little endian/big endian issues between Macs, Windows PC & Unix boxes
(Linux/BSD PC for a start)
- how Python defaults on these (it would be handy if the language knows how
to manage these issues)
- "preferred" encodings within the above (I guess, one in which character
length does not vary)
A "typically optimist" extract:
> Hybrid systems in which UTF-16 is used as a disk storage
> format but expanding to UTF-32 in memory is also a
> popular solution combining small long term storage space
> with ease of processing.
Had this stuff been designed with ease of use in mind... ;-)
Anyway, if it's too difficult to master, we may begin with a Windows ANSI
version.
Let me know your thoughts.
Regards,
Henri
_______________________________________________
Freecats-dev mailing list
address@hidden
http://mail.nongnu.org/mailman/listinfo/freecats-dev