guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using libunistring for string comparisons et al


From: Ludovic Courtès
Subject: Re: Using libunistring for string comparisons et al
Date: Sun, 20 Mar 2011 23:12:57 +0100
User-agent: Gnus/5.110013 (No Gnus v0.13) Emacs/23.3 (gnu/linux)

Hi Mark,

Mark H Weaver <address@hidden> writes:

> address@hidden (Ludovic Courtès) writes:
>>> We keep wide (UTF-32) stringbufs as-is, but we change narrow stringbufs
>>> to UTF-8, along with a flag that indicates whether it is known to be
>>> ASCII-only.
>>
>> The whole point of the narrow/wide distinction was to avoid
>> variable-width encodings.  In addition, we’d end up with 3 cases (ASCII,
>> UTF-8, or UTF-32) instead of 2, which seems quite complex to me.
>
> Most functions would not care about the known_ascii_only flag, so really
> it's just two cases.  (As you know, I'd prefer to have only one case).

What about string-ref/set!?  Should they widen their argument as well
when it’s UTF-8?

>> What do you think of moving to narrow = ASCII, as I suggested earlier?
>
> The problem is that the narrow-wide cases will be much more common in
> your scheme, and none of us has a good solution to those cases.  All the
> solutions that handle those cases efficiently involve an unacceptable
> increase in code complexity.

IMO ASCII + UCS-4 can only be simpler than ASCII + UTF-8 + UCS-4.

> In your scheme, a large number of common operations will require
> widening strings, which is bad for efficiency, in both space and time
> for the common operations.

My feeling is that widening will be rare enough.  For instance, most of
the time, programs compare strings in the same language, which goes
through the fast path of ‘string=’?

So here’s a plan for 2.0.x.  Remember that a design goal was to have
constant-time string-ref/set!; this is debatable, I agree, but Mike, I
and others on r6rs-discuss back then thought it was desirable.

To fix the bugs you identified in 2.0.x, I’m in favor of a
narrow = ASCII scheme and to apply Mike’s suggestion.  It should
allow us to fix our bugs with minimal changes.

For 2.1.x, things are different.  I’m happy to revisit not only the
internal storage approach but also the O(1) ref/set! (the latter should
be discussed in light of the trend in other Schemes, though.)

How does that sound?

> You may not realize the extent to which UTF-8's special properties
> mostly eliminate the usual disadvantages of variable-width encodings.
> Please allow me to explain how the most common string operations can be
> implemented on UTF-8 strings.

Thanks for the info!  I was probably not aware of all these properties.
But again, one design constraint in 2.0 was to provide O(1) random
access.  Sure this is doable with variable-width encoding as Clinger et
al. note at <http://trac.sacrideo.us/wg/wiki/StringRepresentations>, but
the narrow/wide scheme we came up with looked like a good trade-off.

Again, if you want to experiment with UTF-8 for internal storage, then
2.1 is yours.  ;-)

BTW, giving up O(1) access may be more work than it seems.  For instance
‘string-fold’, ‘string-map’, & co. would need to be compiled
efficiently, which won’t happen until we have an inliner in the
compiler, etc.  But all these could be worthy goals for 2.1/2.2.

Thanks,
Ludo’.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]