help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Robin Redeker
Subject: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 14:30:24 +0200
User-agent: Mutt/1.5.11+cvs20060403

On Mon, Oct 22, 2007 at 01:37:19PM +0200, Paolo Bonzini wrote:
> >The cleanes interface for the JSON parser/serializer would be to
> >receive and produce UnicodeStrings and let the programmer worry about
> >encoding.
> 
> I see.  An alternative is, in the case when you read "\uXXXX", to just 
> return Strings.  To add a UnicodeCharacter to a String stream, you just use
> 
>    aStream display: aCharacter

Hm, but then I would have to do that for any character in a String and not
only for \uXXXX, if I understand you right, as this is valid JSON
(encoding UTF-8):

   {"test":"にほんじん\u306b"}

> A full implementation would probably require adding a method like this:
> 
>     PositionableStream >> encoding
>         ^collection encoding
> 
> and I can take care of a more complete implementation of Stream encoding.
> 
> There are many ways to specify encoding, for example the following:
> 
[.snip.]

Interesting, I'll keep those in mind for the next json.st iteration :)

> >Hm, I agree that hasing Strings in their UTF-8 encoded form is a good 
> >approximation.
> >Which will of course horribly break if someone chooses to use eg. german 
> >"umlaute"
> >in the source code in latin-1 encoding, or maybe not. How is the encoding 
> >of a
> >literal string determined?
> 
> It is not so far, and unless one is interested in using Strings and 
> UnicodeStrings interchangeably for hashing, you should not care.  Do you 
> have example of prior art for other languages?

Nope, Perl has strings of "integers" which can either represent octets
or Unicode characters. The interpretation is up to the programmer. So
the internal hash only operates on the integer values.
Of course you can only use strings as keys for perl hashes, as they
are automatically stringified (afaik, but I'm maybe wrong here).

About Unicode in Perl in general:

See this Perl script (encoded in UTF-8):

http://www.ta-sa.org/files/txt/3f0babeefe692cbf6bdd62def1dd68a2.txt

Output:

306B 307B 3093 3058 3093
E3   81   AB   E3   81   BB   E3   82   93   E3   81   98   E3   82   93
FE   FF   30   6B   30   7B   30   93   30   58   30   93

The 'use utf8' in the beginning will tell the parser to interpret the
soruce code as utf8 encoded unicode, which makes $string to contain
unicode characters.

After encode () is used $utf8_encoded and $utf16_encoded contains a
string of characters each in the range of 0 to 255, which represent the
octets of the encoded strings.

It gets interesting if you remove the 'use utf8' statement on top of the
script which will result in this output:

E3   81   AB   E3   81   BB   E3   82   93   E3   81   98   E3   82   93
C3   A3   C2   81   C2   AB   C3   A3   C2   81   C2   ...
FE   FF   0    E3   0    81   0    AB   0    E3   0    ...

Without the 'use utf8' the $string already contains octets which represent
the utf8 encoded unicode string.

Ah, so much about Perl strings. In general Perl doesn't really care much
about Unicode and lets the programmer care about encoding and keeping
track of how and whether strings are encoded.


Robin




reply via email to

[Prev in Thread] Current Thread [Next in Thread]