Re: New ABI NSConstantString

On Thu, Apr 5, 2018 at 3:33 PM, David Chisnall <address@hidden> wrote:

This might be slightly confusing, because your mail client doesn’t seem to do anything sane for quoting:

On 5 Apr 2018, at 20:09, Stefan Bidigaray <address@hidden> wrote:
>
> On Thu, Apr 5, 2018 at 1:41 PM, David Chisnall <address@hidden> wrote:
> On 5 Apr 2018, at 17:27, Stefan Bidigaray <address@hidden> wrote:
> >
> > Hi David,
> > I forgot to make a comment when you originally posted the idea, and I think this would be a great time to add my 2 cents.
> >
> > Regarding the structure:
> > * Would it not be better to add the flags bit field immediately after the isa pointer? My thought here is that it can be checked for if different versions of the structure exist. This is important for CoreBase since it does not have the luxury of real classes.
>
> I’m concerned with structure padding here. Even on a 64-bit platform, we either need an 8-byte flags field (which is wasteful) or end up with 4 bytes of padding. With 128-bit pointers (which are probably coming sooner than you expect) we will end up with 12 bytes of padding if we have a 32-bit flags field followed by a pointer.
>
> Well, I was hoping there is a way we can define this structure so that it can be used directly in CoreBase, without having to call the toll-free bridging mechanism. If a 32-bit hash is used, could it be combined with the "flags" variable (see the structure I included at the end of this email)? I'm hoping to be able to have use the same constant strings without having to call the bridging mechanism. It's pretty slow and cumbersome.

Can you explain why CoreBase needs to store the hash as anything other than a 32-bit value that it can zero extend when returning a 64-bit value? It the CoreFoundation and Foundation implementations of hash are compatible, then it will currently be returning a 28-bit value in a 64-bit register, so I don’t understand the issue here.

>
> By the way, I noticed there was not uint32_t flags in your original structure, making it 24 bytes in 32-bit CPUs.
>
> > * Would it be possible to make the hash variable a NSUInterger? The output of -hash is an NSUInterger, and that would allow the value to be expanded in the future.
>
> We can, though that would again increase the size quite noticeably. I think I’m happy with a 32-bit hash, because as rfm points out with a decent hash algorithm that basically gives us unique hashes.
>
> Sounds reasonable.
>
> > * Why have both count and length? Would it not make more sense to keep a single variable here called count and define it as, "The count/number of code units"? For ASCII and UTF-8 this would be # of bytes, and for UTF-16 it would be the # of 16-bit codes. The Apple documentation states "The number of UTF-16 code units in the receiver", making at least the ASCII and UTF-16 numbers correct. The way I understand the current implementation, the value for length would return the UTF-32 # of characters, which is inconsistent with the docs.
>
> If a UTF-8 string contains multi-byte sequences, then the length of the buffer and the number if UTF-16 code units will be different. If we know the number of bytes, then we can use more efficient C standard library functions for things like comparisons, though that may not be important.
>
> I guess I'm still a bit confused about the meaning and/or different of the variables count and length.

One tells you the logical number of characters, the other the length of the buffer in bytes. A lot of bytes-scanning functions are far more efficient if they know the length up front, because they can then process one word at a time until the last word.

> I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

There’s also the issue that -UTF8String is one of the most commonly used methods on NSString, so if we represent something as UTF-16 internally then it needs converting and returning in an autoreleased buffer, whereas with a UTF-8 string it can just return the pointer. On non-Windows platforms, -UTF8String is the way of getting a string that you pass to pretty much any OS function.

>
> > * I would also think that it makes more sense to have the length/count variable before the data pointer. I don't have a strong opinion about this one, but it just makes more sense in my head.
>
> Again, this gives us more padding in the structure.
>
> Would it? Isn't sizeof (long) == sizeof (void *) in all 32 and 64-bit architectures (except WIN64)? I thought a long would not be padded any more than a pointer for most applications.

Not Win64, not on anything with larger than 64-bit pointers.

> >
> > Regarding the hash function:
> > Why are we using Murmur3 hash? I know it is significantly more efficient than our current one-at-a-time approach, but how much better is it to competing hash functions? Is there a bench mark out there comparing some of the major ones? For example, how does it compare with lookup3 or SpookyHash. If we are storing the hash in the string structure, the speed of calculating the hash is not as important as the spread. Additionally, Murmur3 seems ill suited if NSUInteger is used to store the hash value since, as far as I could tell, it only outputs 32-bit and 128-bit hashes. Lookup3 and SpookyHash, for example, output 64-bit values (2 32-bit words in the case of lookup3), as well.
>
> The size of the type doesn’t necessarily give us the range. We are completely free to give only a 32-bit or even 28-bit range within an NSUInteger (which is what we do now) and if we have good coverage. A good hash function has even distribution of entropy across all bits, so taking a 32-bit or 128-bit hash and truncating it is fine. That said, I’m happy to make the hash value 8 bytes on 64-bit platforms if this seems like a good use of bits.
>
> I’m not wedded to the idea of Murmur3. We do need to use the same hash for constant and non-constant strings, so execution speed is important. I’m somewhat tempted to suggest SHA256, because it’s fairly easy to accelerate with SSE and newer CPUs have full hardware offload for it. That said, the goal is not to mandate the use of the compiler-generated hash for constant strings, it’s to provide a space to store one that the compiler initialises to something sensible.
>
> Given the analysis I’ve done in the reply to Ivan, I think it’s worth consuming space to improve performance.
>
> I agree.
>
> So how about a structure like:
>
> struct {
> id isa; /* Class pointer */
> uint64_t flags;
> /* Flags bitfield:
> Low 2 bits, enum with values:
> 0: ASCII string
> 1: UTF-16 string
> 2 and 3: Reserved for future encodings
> (1<<2) to (1<<3): 0 for one-at-a-time; 1 for murmur hash; 2 and 3 reserved for future hashes
> (1<<4) to (1<<15): Reserved for future compiler-defined flags
> (1<<16) to (1<<31): Reserved for use by the constant string class (I'm hoping this could hold the CFTypeID of a constant string so it can be identified by corebase)
> (1<<32) to (1<<63): hash
> */
> const char *data; /* Pointer to the buffer. ro_data section, so immutable. NULL-terminated */
> long count; /* Number of UTF-16 code units, not including the null terminator */
> }

I don’t see why we’d use a single uint64_t rather than a pair of uint32_ts and I don’t like the ordering (it will be annoying to have to order the fields differently on 128-bit pointer platforms). I’m not convinced that it’s worth omitting the length to save 8 bytes per string. It’s probably also not actually worth using longs for the length on 64-bit platforms, so both of these should probably be 32 bits. 4GB of string literal seems a bit excessive (for one thing, I doubt the compiler will be entirely happy with it, and I don’t know happy linkers are with 4GB symbols…).

David

From:	Stefan Bidigaray
Subject:	Re: New ABI NSConstantString
Date:	Thu, 5 Apr 2018 19:25:01 -0400