demexp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Demexp-dev] Character encoding


From: David MENTRE
Subject: Re: [Demexp-dev] Character encoding
Date: Mon, 22 Oct 2007 08:57:46 +0200

Hello Lyu,

2007/10/22, Lyu Abe <address@hidden>:
> There's one thing I do not understand in character coding of the
> server's reply. When I display, for example, tag sets, I can read this:
>
> 'a_tag_label': u'citoyennet\xe9'
>
> in which  " u'citoyennet\xe9' " corresponds to an unicode encoded text,
> right?

Yes.

> Then I do not understand why we get unicode encoded strings,
> while DEMEXP is supposed to have UTF-8 encoding...

"UTF-8 is the byte-oriented encoding form of Unicode."
http://www.unicode.org/faq/utf_bom.html#2

In other words, all strings on the server are stored in the UTF-8 byte
encoding of the Unicode encoding. All exchanges between the server and
the clients are done in UTF-8, a byte convention to represent Unicode
characters.

After that, each platform is free to do any appropriate conversion,
e.g. use 16 or 32 bits character encoding if they will. However, you
should take care to set the default Python encoding to UTF-8 when you
dialogue with the server.

To be honest, right now, the server does not check much this encoding.
It mainly came from the GTK2 interface that produces UTF-8 strings.
:-) But that should be done at one point.

Best wishes,
d.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]