[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: announcing thaiword.el?

From: Miles Bader
Subject: Re: announcing thaiword.el?
Date: Tue, 29 Mar 2005 17:35:15 +0900

On Mon, 28 Mar 2005 09:47:09 +0900 (JST), Kenichi Handa <address@hidden> wrote:
> To handle the regular expression "\\b" and "\\B" correctly
> for Thai, we need a bigger change in regex.c.  For the
> moment, I have no idea how to do that.

Current extensions to "word syntax", using `word-separating-categories'
etc., seem to do the correct thing with regexps.[*]  Perhaps some
extension to that mechanism would work.

For instance, what if entries in `word-separating-categories' could have an
optional predicate function -- in addition to the current (CAT1 . CAT2)
format, allow (CAT1 CAT2 PREDICATE-FUN), and only consider the entry to
match if PREDICATE-FUN fun (with some apropriate args) also returns true?

Then for a case like Thai, where you want to do more complicated tests
to establish word-boundaries inside sequences of non-delimited text,
could use a "degenerate" entry in `word-separating-categories' with both
CAT1 and CAT2 the same, but also with a predicate attached to do the
more complicated test.  I suppose that would slow down word matching
when the predicate is called, but it would only happen for text where
that is appropriate.


[*] I was surprised that this is true, and I don't understand why from
    my quick look at regex.c :-/ ... But my simple tests seem to show
    that it does really work.  E.g., I can add '(?C . ?C) to
    `word-separating-categories', and then a regexp search will suddenly
    start considering every single kanji character as a standalone word.
Do not taunt Happy Fun Ball.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]