[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format
From: |
Kevin Atkinson |
Subject: |
[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format |
Date: |
Fri, 30 Sep 2005 13:57:41 -0600 (MDT) |
The details of how Aspell really works is a mystery to most everyone but
me. So I thought it is finally time to fully explain Aspell's core
algorithms and data structures. In doing so I hope people will appreciate the
amount of cleverness I put into it, and perhapses be able to improve it even
more in the future. However, doing so will take some time. Thus I will
be explaining it in small segments. This information will eventually be
included in the Aspell source in one form or another.
In this part I will explain how the data is laid out in the compiled
dictionary for Aspell 0.60. Source file: readonly_ws.cpp
Aspell's main word list is laid out as follows:
* header
* jump table for editdist 1
* jump table for editdist 2
* data block
* hash table
There is nothing particular interesting about the header. Just a bunch of
meta information. I will get back to the jump tables.
Words in the data block are grouped based on the soundslike. Each group
is as follows:
<8 bit: offset to next item><8 bit: soundslike
size><soundslike><null><words>
Each word group is as follows:
<8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
[<affix info><null>]
The offset for the final word in each group points to the next word in
the following group. The offset for the final word and soundslike group
in the dictionary is 0.
I made some previsions for additional info to be stored with the word but
for simplicity I will leave it out. If soundslike data is not used than the
soundslike block it not used.
This format make it easy to iterate over the data without using the hash
table.
Each soundslike group can be a maximum of about 256 bytes. If this limit
is reached than the soundslike group is split. Using 2 bytes for the
soundslike offset would of solved this problem however
1) 256 is normally sufficient, thus I would of wasted some space by
using an extra byte.
2) More importantly, Using 2 bytes means I would of had to worry about
alignment issues.
The soundslike groups are sorted in more or less alphabetic order.
The hash table is a simple open address hash table. The key is the
dictionary word in all lowercase form with any accents removed (what
is known as the "clean" form of the word). The value stored in
the table is an 32 bit offset to the beginning of the word. An integer
offset is used rather than a pointer so that 1) the complied dictionary
can be mmaped in which makes loading the dictionary very fast and so that
the memory can be shared between processed, and 2) on 64 bit platforms
using pointers would of doubled the size of the hash table.
Additional information on the word can be derived from the offset:
word size: offset - 1
offset to next word: offset - 2
flags: offset - 3
I use helper functions for getting this information. Doing it this way
rather than having a data structure is slightly evil, I admit, but I have my
reasons.
In the next part I will explain how Aspell uses the jump tables to quickly
search the list for all soundslike with an edit-distance of 1 or 2.