[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell] Compund words
From: |
Kevin Atkinson |
Subject: |
[aspell] Compund words |
Date: |
Fri, 16 Jun 2000 08:49:24 -0400 (EDT) |
On Thu, 15 Jun 2000, Geoff Kuenning wrote:
> The other issue with German compounds is the formation of really long
> words, like the famous Donaudampfschiffgesellschaftmeister (please
> forgive me if I misspelled it!). Ispell doesn't currently support
> these, for reasons of performance having to do with how I originally
> implemented compound words (they are a special case of the "missing
> space" code). The basic problem is that when you're working with a
> word like "gesellschaftmeister", ispell doesn't know whether it should
> break it up as "gesellschaft meister" or "gesell schaft meister" (is
> "schaft" a legal word? I forget.) It wouldn't be hard to add code to
> try all such possibilities, but I fear that it would run fairly
> slowly. The basic algorithm would be something like this:
>
> for i = 1 to length of word
> pull off first i characters from the front into "temp"
> if "temp" is a legal word
> pull off remaining characters into "rest"
> call yourself recursively with "rest"
> if "rest" can be broken into words, consider the whole
> thing a legal compound
>
> This algorithm will definitely work, but it will be slow in situations
> where the word has a lot of prefixes that are legal words.
Instead of just guessing how about finding out. It can be done in linear
time with the attached prefix.cc which takes a sorted word list in stdin.
When run on a German word list I get this:
8 übersättigender übersättigende übersättigend übersättigen
übersättige übersät übers über übe
8 übersättigendes übersättigende übersättigend übersättigen
übersättige übersät übers über übe
9 weitererzählendem weitererzählende weitererzählend weitererzählen
weitererzähle weiterer weitere weiter weite weit
9 weitererzählenden weitererzählende weitererzählend weitererzählen
weitererzähle weiterer weitere weiter weite weit
9 weitererzählender weitererzählende weitererzählend weitererzählen
weitererzähle weiterer weitere weiter weite weit
9 weitererzählendes weitererzählende weitererzählend weitererzählen
weitererzähle weiterer weitere weiter weite weit
9 zurückhaltenderem zurückhaltendere zurückhaltender zurückhaltende
zurückhaltend zurückhalten zurückhalte zurück zur zu
9 zurückhaltenderen zurückhaltendere zurückhaltender zurückhaltende
zurückhaltend zurückhalten zurückhalte zurück zur zu
9 zurückhaltenderer zurückhaltendere zurückhaltender zurückhaltende
zurückhaltend zurückhalten zurückhalte zurück zur zu
9 zurückhaltenderes zurückhaltendere zurückhaltender zurückhaltende
zurückhaltend zurückhalten zurückhalte zurück zur zu
OK so it looks like for some words it is pretty bad here is what the total
distribution looks like.
./a.out < german.wl | cut -b1-2 | sort | uniq -c
31204 0
59328 1
82950 2
55998 3
38768 4
17881 5
6985 6
1678 7
104 8
8 9
So it still looks pretty bad however 9 prefixes mean that roughly
size of compound word * 9
possibilities are being tried (assuming that there are no other
prefixes in the components of the compound that are not at the
beginning, which may not be the case but read on)
so with a word of size 50 there are 450 possibilities being
tried. That may seam like a lot but when coming up with words that are 1
edit distance (as Ispell does when coming up with near misses) apart
there are roughly
2 (n-1) + n + n E + (n+1) E
combinations being tried where n is the size of the word and E is the size
of the alphabet which will be at least 26.
Thus the number of combinations being tried when coming up with
suggestions is more than the number tried when attempting to see if a
compound word can be broken. Now considering that such compounds are not
that common (am I correct) the running time for spell checking a document
is still perfectly acceptable.
Now if other prefixes are found in some of the inner words the number of
combinations will go up even more however there would have to be a lot of
prefixes to make the running time unacceptable, especially considering that
such wards would be very rare.
However, in order to really tell how many different combinations are being
tried one will have to try it out on an extremely large number if words and
just count the number of combinations tried. If you are interested in
helping me count these please let me know.
Now for those of you are interested the worst case running time is:
T(s < l, l) = 0
T(s , l) = sum T(i,l) where i = l to l-s
where s is the length of the word and l is the number of possible
combinations when a limit of n is set on the length of individual
components.
The worst case is when every single possible prefix is in the dictionary.
Here is a chart of fir s = 1 .. 50 and l = 1 .. 5
s\l 1 2 3 4 5
1 0 0 0 0 0
2 1 0 0 0 0
3 3 0 0 0 0
4 7 1 0 0 0
5 15 2 0 0 0
6 31 4 1 0 0
7 63 7 2 0 0
8 127 12 3 1 0
9 255 20 5 2 0
10 511 33 8 3 1
11 1023 54 12 4 2
12 2047 88 18 6 3
13 4095 143 27 9 4
14 8191 232 40 13 5
15 16383 376 59 18 7
16 32767 609 87 25 10
17 65535 986 128 35 14
18 131071 1596 188 49 19
19 262143 2583 276 68 25
20 524287 4180 405 94 33
21 1048575 6764 594 130 44
22 2097151 10945 871 180 59
23 4194303 17710 1277 249 79
24 8388607 28656 1872 344 105
25 16777215 46367 2744 475 139
26 33554431 75024 4022 656 184
27 67108863 121392 5895 906 244
28 134217727 196417 8640 1251 324
29 268435455 317810 12663 1727 430
30 536870911 514228 18559 2384 570
31 1.07374182e+09 832039 27200 3291 755
32 2.14748365e+09 1346268 39864 4543 1000
33 4.2949673e+09 2178308 58424 6271 1325
34 8.58993459e+09 3524577 85625 8656 1756
35 1.71798692e+10 5702886 125490 11948 2327
36 3.43597384e+10 9227464 183915 16492 3083
37 6.87194767e+10 14930351 269541 22764 4084
38 1.37438953e+11 24157816 395032 31421 5410
39 2.74877907e+11 39088168 578948 43370 7167
40 5.49755814e+11 63245985 848490 59863 9495
41 1.09951163e+12 102334154 1243523 82628 12579
42 2.19902326e+12 165580140 1822472 114050 16664
43 4.39804651e+12 267914295 2670963 157421 22075
44 8.79609302e+12 433494436 3914487 217285 29243
45 1.7592186e+13 701408732 5736960 299914 38739
46 3.51843721e+13 1.13490317e+09 8407924 413965 51319
47 7.03687442e+13 1.8363119e+09 12322412 571387 67984
48 1.40737488e+14 2.97121507e+09 18059373 788673 90060
49 2.81474977e+14 4.80752698e+09 26467298 1088588 119304
50 5.62949953e+14 7.77874205e+09 38789711 1502554 158044
The source that created this is also attached. It turns out that it is
also equal to:
T(s,l) = t(s-l+1, l) - 1
where t(-l..l) = 1
t(n) = t(n-1) + t(n-l)
Which should have an explicit formula.
I am not quote sure why this is . I discovered this pattern by searching
for known sequence at Sloane's On-Line Encyclopedia of Integer Sequences
(http://www.research.att.com/~njas/sequences/).
If someone has some spare time this will be an interesting thing to
prove.
---
Kevin Atkinson
address@hidden
http://metalab.unc.edu/kevina/
prefix.cc
Description: Text document
split.cc
Description: Text document
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [aspell] Compund words,
Kevin Atkinson <=