gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hash computation and TFB


From: Stefan Bidi
Subject: Re: Hash computation and TFB
Date: Tue, 6 Aug 2013 08:30:10 -0500

I copied the hash algorithm straight out of -base, so they should match.  I remember a few months ago Richard was playing around with hash functions and this might be causing some issues, now.

I just looked it up, the changes were made on rev 36344.

There is another issue... -base allows UTF-8 strings, which will not be hashed to the same UTF-16 value.  In my opinion, allowing UTF-8 string literals is not a good idea and base should revert back to Latin1 as the default C string encoding.  I'm actually debating adding a UTF-16 string literals configure option for corebase.  I believe using UTF-16 internally is the only sane solution to non-ASCII encodings.

I've tried experimenting with other hash functions that are not one-at-a-time, but unfortunately have not found anything that will work on both ASCII and Unicode strings consistently.  It would be really nice to be able to work with 32- or 64-bit integers directly instead of 8- or 16-bit characters.  If could use UTF-16 across the board, this wouldn't be a problem.

Anyway, those are my thoughts.


On Tue, Aug 6, 2013 at 8:14 AM, Luboš Doležel <address@hidden> wrote:
Hello,

hash computation with Toll-Free Bridging is a tricky subject. Do it wrong and you'll get all sorts of trouble, especially with dictionaries, which use hashes a lot.

The code in corebase currently dispatches all CFHash() calls on ObjC objects to -hash, which is bad. The following expectation breaks due to this dispatch:

CFHash(@"string") == CFHash(CFSTR("string"))

because NSString uses a different hashing algorithm than CFString.
My suggestion is to do away with the ObjC dispatch in CFHash() and alter all the CF*Hash() functions to support ObjC types.

While looking at CFStringHash(), I've also noticed that either 8-bit or 16-bit raw character data is used for hashing based on what is available. I believe this breaks the following case:

===
CFStringRef str1 = CFSTR("str");
CFStringRef str2 = CFStringCreateWithCharacters(NULL, (UniChar*) "s\0t\0r\0", 3); // "str" in UTF-16

CFHash(str1) == CFHash(str2);
===

While the two strings are obviously identical, different bytes are used to generate the hash in both cases.

This problem can by solved by converting the character data to Unicode first, which has a performance impact, but only once for every CFString.

The situation with CFHash() calls on NSStrings is worse, since corebase has nowhere to save the calculated hash, so it must be recalculated every time. But I think it's better to be slow than to be wrong. Please review the attached patch and let me know if you have any observations.

--
Luboš Doležel


reply via email to

[Prev in Thread] Current Thread [Next in Thread]