[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new modules for grapheme cluster breaking
From: |
Ben Pfaff |
Subject: |
Re: new modules for grapheme cluster breaking |
Date: |
Tue, 28 Dec 2010 06:59:46 -0800 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) |
Bruno Haible <address@hidden> writes:
>> > The next modules will have a higher-level API, I imagine. You're welcome
>> > to discuss the new API with me, before you implement it.
>>
>> I am thinking of something like:
>> size_t u<#>_grapheme_len (const uint<#>_t *s, size_t n);
>> which would return the number of units in the first grapheme
>> cluster in S.
>
> "grapheme" or "grapheme cluster"? I'm a bit confused: The Unicode 3.0
> book uses the term "grapheme" to denote the entity that users consider
> to be a single character, but UAX #29 nowadays calls it "grapheme cluster".
I am being a little sloppy with terminology. My take-away from
the Unicode glossary definitions is that a "grapheme" is a
user-perceived character, and a "grapheme cluster" is the
sequence of code points that make up a grapheme.
If I am correct about that, then properly this would be a
grapheme cluster.
> OK for this kind of API, if the grapheme break determination is context free
> (unlike the word break determination). Can you confirm that?
Yes, grapheme cluster break determination looks only at the
previous and next code point. The uc_is_grapheme_cluster_break
function is sufficient to find breakpoints.
> A function for iterating backwards, i.e. returning the grapheme
> bounds before a certain point in a string, would be useful too
> then. Like u#_next and u#_prev in <unistr.h>:
>
> const uint#_t * u#_grapheme_next (const uint#_t *s, const uint#_t *end);
> const uint#_t * u#_grapheme_prev (const uint#_t *s, const uint#_t *start);
>
> And for convenience, I would suggest an API that operates on an
> entire string, like done for the word breaks:
>
> /* Determine the grapheme [cluster?] break points in S, and store the result
> at p[0..n-1].
> p[i] = 1 means that there is a grapheme [cluster?] boundary between s[i-1]
> and s[i].
> p[i] = 0 means that s[i-1] and s[i] must not be separated.
> */
> extern void
> u8_grapheme_breaks (const uint8_t *s, size_t n, char *p);
> extern void
> u16_grapheme_breaks (const uint16_t *s, size_t n, char *p);
> extern void
> u32_grapheme_breaks (const uint32_t *s, size_t n, char *p);
> extern void
> ulc_grapheme_breaks (const char *s, size_t n, char *p);
OK, I'll look at writing those functions too.
Thanks,
Ben.
--
Ben Pfaff
http://benpfaff.org