[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format

aspell-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format

From:	Kevin Atkinson
Subject:	[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format
Date:	Fri, 30 Sep 2005 06:21:54 -0600 (MDT)

The details of how Aspell really works is a mystery to most everyone butme. So I thought it is finally time to fully explain Aspell's corealgorithms and data structures. In doing so I hope people will appreciate theamount of cleverness I put into it, and perhapses be able to improve it evenmore in the future. However, doing so will take some time. Thus I willbe explaining it in small segments. This information will eventually beincluded in the Aspell source in one form or another.

In this part I will explain how the data is laid out in the compileddictionary for Aspell 0.60. Source file: readonly_ws.cpp


Aspell's main word list is laid out as follows:

* header
* jump table for editdist 1
* jump table for editdist 2
* data block
* hash table

There is nothing particular interesting about the header. Just a bunch ofmeta information. I will get back to the jump tables.

Words in the data block are grouped based on the soundslike. Each groupis as follows:

   <8 bit: offset to next item><8 bit: soundslike size><soundslike><null><words>
Each word group is as follows:
   <8 bit: flags><8 bit: offset to next word><8 bit: word size><word><null>
   [<affix info><null>]

The offset for the final word in each group points to the next word inthe following group. The offset for the final word and soundslike groupin the dictionary is 0.

I made some previsions for additional info to be stored with the word butfor simplicity I will leave it out. If soundslike data is not used than thesoundslike block it not used.

This format make it easy to iterate over the data without using the hashtable.

Each soundslike group can be a maximum of about 256 bytes. If this limitis reached than the soundslike group is split. Using 2 bytes for thesoundslike offset would of solved this problem however

   1) 256 is normally sufficient, thus I would of wasted some space by
      using an extra byte.
   2) More importantly, Using 2 bytes means I would of had to worry about
      alignment issues.

The soundslike groups are sorted in more or less alphabetic order.

The hash table is a simple open address hash table. The key is thedictionary word in all lowercase form with any accents removed (whatis known as the "clean" form of the word). The value stored inthe table is an 32 bit offset to the beginning of the word. An integeroffset is used rather than a pointer so that 1) the complied dictionarycan be mmaped in which makes loading the dictionary very fast and so thatthe memory can be shared between processed, and 2) on 64 bit platformsusing pointers would of doubled the size of the hash table.


Additional information on the word can be derived from the offset:
   word size: offset - 1
   offset to next word: offset - 2
   flags: offset - 3

I use helper functions for getting this information. Doing it this wayrather than having a data structure is slightly evil, I admit, but I have myreasons.

In the next part I will explain how Aspell uses the jump tables to quicklysearch the list for all soundslike with an edit-distance of 1 or 2.

[Prev in Thread]

Current Thread

[Next in Thread]

[aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format, Kevin Atkinson <=
- [aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format, Kevin Atkinson, 2005/09/30
  - Re: [aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format, Kevin Atkinson, 2005/09/30

Prev by Date: Re: [aspell-devel] Core dump, bus error, byte alignment, ObjStack
Next by Date: [aspell-devel] bug: solaris build does not use --prefix properly
Previous by thread: [aspell-devel] Telugu aspell 0.60 dictionary available
Next by thread: [aspell-devel] How Aspell Works: Part 1: The Compiled Dictionary Format
Index(es):
- Date
- Thread