[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new modules for grapheme cluster breaking

From: Ben Pfaff
Subject: Re: new modules for grapheme cluster breaking
Date: Tue, 28 Dec 2010 06:59:46 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

Bruno Haible <address@hidden> writes:

>> > The next modules will have a higher-level API, I imagine. You're welcome
>> > to discuss the new API with me, before you implement it.
>> I am thinking of something like:
>>         size_t u<#>_grapheme_len (const uint<#>_t *s, size_t n);
>> which would return the number of units in the first grapheme
>> cluster in S.
> "grapheme" or "grapheme cluster"? I'm a bit confused: The Unicode 3.0
> book uses the term "grapheme" to denote the entity that users consider
> to be a single character, but UAX #29 nowadays calls it "grapheme cluster".

I am being a little sloppy with terminology.  My take-away from
the Unicode glossary definitions is that a "grapheme" is a
user-perceived character, and a "grapheme cluster" is the
sequence of code points that make up a grapheme.

If I am correct about that, then properly this would be a
grapheme cluster.

> OK for this kind of API, if the grapheme break determination is context free
> (unlike the word break determination). Can you confirm that?

Yes, grapheme cluster break determination looks only at the
previous and next code point.  The uc_is_grapheme_cluster_break
function is sufficient to find breakpoints.

> A function for iterating backwards, i.e. returning the grapheme
> bounds before a certain point in a string, would be useful too
> then. Like u#_next and u#_prev in <unistr.h>:
>   const uint#_t * u#_grapheme_next (const uint#_t *s, const uint#_t *end);
>   const uint#_t * u#_grapheme_prev (const uint#_t *s, const uint#_t *start);
> And for convenience, I would suggest an API that operates on an
> entire string, like done for the word breaks:
> /* Determine the grapheme [cluster?] break points in S, and store the result
>    at p[0..n-1].
>    p[i] = 1 means that there is a grapheme [cluster?] boundary between s[i-1] 
> and s[i].
>    p[i] = 0 means that s[i-1] and s[i] must not be separated.
>  */
> extern void
>        u8_grapheme_breaks (const uint8_t *s, size_t n, char *p);
> extern void
>        u16_grapheme_breaks (const uint16_t *s, size_t n, char *p);
> extern void
>        u32_grapheme_breaks (const uint32_t *s, size_t n, char *p);
> extern void
>        ulc_grapheme_breaks (const char *s, size_t n, char *p);

OK, I'll look at writing those functions too.


Ben Pfaff 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]