[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: International support
From: |
Samphan Raruenrom |
Subject: |
Re: International support |
Date: |
Mon, 11 Jan 1999 21:46:29 +0700 |
Kevin Atkinson wrote:
> Samphan Raruenrom wrote:
> > In Thai, we don't put spaces between words at all so
> > the same situation happends naturally.
> > Typical Thai word-segmentation algorithm (which usually
> > do spelling check also) use maximal-match backtracking
> > algorithm with trie word list(s).
> > My implementation is at http://www.thai.net/libinthai/
> > IBM Classes for Unicode implementation is at
> > http://www.ibm.com/java/education/boundaries/boundaries.html
> Ok so how do you detect bonduries of unknown or misspelled words.
IBM ICU's algorithm describe in the above URL is :-
: If we exhausted our possibilities without finding
: a valid sequence of words, it either means there's
: an error in the text, or the text includes a word
: that isn't in the dictionary. In either case, we restore
: the set of break positions that matched the most
: characters, advance one character past where the
: mismatch occurred in that sequence, and start over
: from there. This works pretty well: usually only
: one or two boundary positions around the error
: are in the wrong place.