classpath-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes


From: Jeroen Frijters
Subject: RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Date: Thu, 18 Nov 2004 17:37:10 +0100

Archie Cobbs wrote:
> I'm simply complaining that the following doesn't work:
> 
>       String s = "\ud8aa";
>       byte[] b = s.getBytes("UTF-8");
>       String t = new String(b, "UTF-8");
>       System.out.println(s.equals(t));        // prints false!
> 
> If you run this under the JDK, it prints "false".

The string isn't valid Unicode so the UTF-8 encoder is within its rights
to encode the surrogate as an invalid character.

> In other words, there are certain String objects that Sun's 
> UTF-8 encoding is not capable of encoding, because it doesn't
> handle all possible character values in the range
> 0x0000 - 0xffff.

I understand what you mean, but you have to face the fact that the range
of 0xD800-0xDFFF doesn't contain valid unicode character and as such
will not be encoded by UTF-8.

> Yes, which is how I came across this bug. There are classes 
> in Classpath that store arbitrary binary data within String
> objects.

Class files don't use UTF-8 to encode strings, they use the format used
by DataOutputStream.writeUTF (what Sun calls "modified UTF").

So maybe all we need to do is make sure that
DataOutputStream.writeUTF/DataInputStream.readUTF can roundtrip *any*
string (even if it has invalid Unicode characters).

Regards,
Jeroen




reply via email to

[Prev in Thread] Current Thread [Next in Thread]