classpath-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes


From: Archie Cobbs
Subject: Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Date: Thu, 18 Nov 2004 08:50:11 -0600 (CST)

Jeroen Frijters wrote:
> > This is arguable in my opinion. Does the UTF-8 specification say that
> > only currently defined Unicode characters may be encoded/decoded?
> 
> It has nothing to do with defined or undefined characters. Java strings
> do not contain characters, but UTF-16 codepoints. When a Unicode
> character above 0xFFFF is put in a Java string, the character code is
> converted to two UTF-16 codepoints (a so called surrogate pair). These
> surrogate pair codepoints are in the range 0xD800-0xDFFF and
> conveniently Unicode doesn't define any characters in this range, so if
> you encounter a Java char in this range, it isn't actually a Unicode
> character, but only half of it. If a string contains half of a surrogate
> pair, this string is malformed and so the UTF-8 encoder (which encodes
> Unicode characters) is right to encode this as an invalid character.
> 
> Now, it's actually possible to have the UTF-8 encoder/decoder
> encode/decode these half surrogate pairs symmetrically so if there is a
> good reason to do so, we can certainly do that.

You say "Java strings do not contain characters, but UTF-16 codepoints".
That's wrong: Java strings contain arrays of char's, which take any value
in the range 0x0000 - 0xFFFF that they want to. Why else would the code
below compile successfully?

There is nothing "Unicode" about Java strings until you try to
interpret their elements as characters rather than unsigned 16 bit
numbers. So it boils down to whether "UTF-8" purports to be an
encoding of Unicode character values or an encoding of unsigned 16
bit quantities; note Java class files are an example of where it
is used for the latter.

I'm simply complaining that the following doesn't work:

        String s = "\ud8aa";
        byte[] b = s.getBytes("UTF-8");
        String t = new String(b, "UTF-8");
        System.out.println(s.equals(t));        // prints false!

If you run this under the JDK, it prints "false".

In other words, there are certain String objects that Sun's UTF-8 encoding
is not capable of encoding, because it doesn't handle all possible character
values in the range 0x0000 - 0xffff.

> UTF = Unicode Transformation Format. Do you have any examples of code
> that uses strings to store arbitrary binary data *and* use UTF-8
> encoding? Since Sun's implementation doesn't support it, I think it's
> unlikely that much code depends on it.

Yes, which is how I came across this bug. There are classes in Classpath
that store arbitrary binary data within String objects. When JCVM tries to
generate the corresponding C files, it outputs a C string definition
containing the UTF-8 encoded value. But this value was being corrupted
because of "illegal" String characters being replaced by '?'.

> > What about Java class files? They contain arbitrary 16 byte characters
> > encoded using "UTF-8" .. by your logic, isn't that a violation? Etc.
> 
> I don't understand what you mean here.

What I meant was: any Java compiler written in Java is another example
of arbitrary binary data in a String needing to be UTF-8 encoded, because
String constants, which can contain arbitrary characters, are encoded via
UTF-8 in classfiles. E.g. compile the above code and check the output.

-Archie

__________________________________________________________________________
Archie Cobbs      *        CTO, Awarix        *      http://www.awarix.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]