Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation

help-smalltalk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation

From:	Paolo Bonzini
Subject:	Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date:	Mon, 22 Oct 2007 13:37:19 +0200
User-agent:	Thunderbird 2.0.0.6 (Macintosh/20070728)

The cleanes interface for the JSON parser/serializer would be to
receive and produce UnicodeStrings and let the programmer worry about
encoding.

I see. An alternative is, in the case when you read "\uXXXX", to justreturn Strings. To add a UnicodeCharacter to a String stream, you just use


   aStream display: aCharacter

A full implementation would probably require adding a method like this:

    PositionableStream >> encoding
        ^collection encoding

and I can take care of a more complete implementation of Stream encoding.

There are many ways to specify encoding, for example the following:

1) add a #on:encoding: constructor where the encoding defaults to'UTF-8'. When creating a String to be returned, use the same encodingas the input.

2) use the aforementioned PositionableStream >> encoding method; whencreating a String to be returned, use the same encoding as the input.

3) use the aforementioned PositionableStream >> encoding method and adda #on:outputEncoding: constructor, where the encoding defaults to thesame encoding as the input.

4) use the aforementioned PositionableStream >> encoding method andalways return UnicodeStrings. In this case, you will never findCharacters whose value is >= 128 in the input (you'll findUnicodeCharacters instead!).

Hm, I agree that hasing Strings in their UTF-8 encoded form is a good 
approximation.
Which will of course horribly break if someone chooses to use eg. german 
"umlaute"
in the source code in latin-1 encoding, or maybe not. How is the encoding of a
literal string determined?

It is not so far, and unless one is interested in using Strings andUnicodeStrings interchangeably for hashing, you should not care. Do youhave example of prior art for other languages?

ASCII characters and UTF-8 please. :-) I'm also from a Latin-1 country,but I try to think as international as possible. :-)
That Smalltalk source code literals come in UTF-8 encoded form is a bold
assumption (which is increasingly right these days on Linux and other OSs :-)


Yes.

Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

[Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
  - Message not available
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini <=
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
    - Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22

Prev by Date: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Next by Date: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Previous by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Next by thread: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Index(es):
- Date
- Thread