[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format

From: Kevin Atkinson
Subject: [aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format
Date: Fri, 30 Sep 2005 06:21:54 -0600 (MDT)

The details of how Aspell really works is a mystery to most everyone but me. So I thought it is finally time to fully explain Aspell's core algorithms and data structures. In doing so I hope people will appreciate the amount of cleverness I put into it, and perhapses be able to improve it even more in the future. However, doing so will take some time. Thus I will be explaining it in small segments. This information will eventually be included in the Aspell source in one form or another.

In this part I will explain how the data is laid out in the compiled dictionary for Aspell 0.60. Source file: readonly_ws.cpp

Aspell's main word list is laid out as follows:

* header
* jump table for editdist 1
* jump table for editdist 2
* data block
* hash table

There is nothing particular interesting about the header. Just a bunch of meta information. I will get back to the jump tables.

Words in the data block are grouped based on the soundslike. Each group is as follows:
   <8 bit: offset to next item><8 bit: soundslike size><soundslike><null><words>
Each word group is as follows:
   <8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
   [<affix info><null>]
The offset for the final word in each group points to the next word in the following group. The offset for the final word and soundslike group in the dictionary is 0.

I made some previsions for additional info to be stored with the word but for simplicity I will leave it out. If soundslike data is not used than the soundslike block it not used.

This format make it easy to iterate over the data without using the hash table.

Each soundslike group can be a maximum of about 256 bytes. If this limit is reached than the soundslike group is split. Using 2 bytes for the soundslike offset would of solved this problem however
   1) 256 is normally sufficient, thus I would of wasted some space by
      using an extra byte.
   2) More importantly, Using 2 bytes means I would of had to worry about
      alignment issues.

The soundslike groups are sorted in more or less alphabetic order.

The hash table is a simple open address hash table. The key is the dictionary word in all lowercase form with any accents removed (what is known as the "clean" form of the word). The value stored in the table is an 32 bit offset to the beginning of the word. An integer offset is used rather than a pointer so that 1) the complied dictionary can be mmaped in which makes loading the dictionary very fast and so that the memory can be shared between processed, and 2) on 64 bit platforms using pointers would of doubled the size of the hash table.

Additional information on the word can be derived from the offset:
   word size: offset - 1
   offset to next word: offset - 2
   flags: offset - 3
I use helper functions for getting this information. Doing it this way rather than having a data structure is slightly evil, I admit, but I have my reasons.

In the next part I will explain how Aspell uses the jump tables to quickly search the list for all soundslike with an edit-distance of 1 or 2.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]