[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
From: |
Robin Redeker |
Subject: |
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation |
Date: |
Mon, 22 Oct 2007 14:30:24 +0200 |
User-agent: |
Mutt/1.5.11+cvs20060403 |
On Mon, Oct 22, 2007 at 01:37:19PM +0200, Paolo Bonzini wrote:
> >The cleanes interface for the JSON parser/serializer would be to
> >receive and produce UnicodeStrings and let the programmer worry about
> >encoding.
>
> I see. An alternative is, in the case when you read "\uXXXX", to just
> return Strings. To add a UnicodeCharacter to a String stream, you just use
>
> aStream display: aCharacter
Hm, but then I would have to do that for any character in a String and not
only for \uXXXX, if I understand you right, as this is valid JSON
(encoding UTF-8):
{"test":"にほんじん\u306b"}
> A full implementation would probably require adding a method like this:
>
> PositionableStream >> encoding
> ^collection encoding
>
> and I can take care of a more complete implementation of Stream encoding.
>
> There are many ways to specify encoding, for example the following:
>
[.snip.]
Interesting, I'll keep those in mind for the next json.st iteration :)
> >Hm, I agree that hasing Strings in their UTF-8 encoded form is a good
> >approximation.
> >Which will of course horribly break if someone chooses to use eg. german
> >"umlaute"
> >in the source code in latin-1 encoding, or maybe not. How is the encoding
> >of a
> >literal string determined?
>
> It is not so far, and unless one is interested in using Strings and
> UnicodeStrings interchangeably for hashing, you should not care. Do you
> have example of prior art for other languages?
Nope, Perl has strings of "integers" which can either represent octets
or Unicode characters. The interpretation is up to the programmer. So
the internal hash only operates on the integer values.
Of course you can only use strings as keys for perl hashes, as they
are automatically stringified (afaik, but I'm maybe wrong here).
About Unicode in Perl in general:
See this Perl script (encoded in UTF-8):
http://www.ta-sa.org/files/txt/3f0babeefe692cbf6bdd62def1dd68a2.txt
Output:
306B 307B 3093 3058 3093
E3 81 AB E3 81 BB E3 82 93 E3 81 98 E3 82 93
FE FF 30 6B 30 7B 30 93 30 58 30 93
The 'use utf8' in the beginning will tell the parser to interpret the
soruce code as utf8 encoded unicode, which makes $string to contain
unicode characters.
After encode () is used $utf8_encoded and $utf16_encoded contains a
string of characters each in the range of 0 to 255, which represent the
octets of the encoded strings.
It gets interesting if you remove the 'use utf8' statement on top of the
script which will result in this output:
E3 81 AB E3 81 BB E3 82 93 E3 81 98 E3 82 93
C3 A3 C2 81 C2 AB C3 A3 C2 81 C2 ...
FE FF 0 E3 0 81 0 AB 0 E3 0 ...
Without the 'use utf8' the $string already contains octets which represent
the utf8 encoded unicode string.
Ah, so much about Perl strings. In general Perl doesn't really care much
about Unicode and lets the programmer care about encoding and keeping
track of how and whether strings are encoded.
Robin
- [Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
- Message not available
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation,
Robin Redeker <=
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22