[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell] Re: Aspell & international support

From: Kevin Atkinson
Subject: [aspell] Re: Aspell & international support
Date: Sun, 28 Feb 1999 17:28:10 -0500

Jean Christophe ANDRE wrote:

> I'm sorry for the rude style of my previous mail: I had to shorten it since
> the computer rooms were closing... Here again, I will propose some
> suggestions since I realy love Unicode. I'm not use in english so if you
> feel angry with my mail, please flame me before you blacklist my address ! ;)

That's OK.  I didn't get angry.  Just a little aggravated.

> > I do know about unicode.  Yes I can fairly easy convert my characters to
> > 32 bit ints as my code is clean.  However, then how should I store the
> > word lists in memory?  As a string of ints.  Now that is using up 4
> > times more memory than charters would and for languages that can fit
> > within an 8-bit character that is, in my view, a gross waste of memory.
> Mhhh ... Why not use memory mapping of the dictionnaries in this case ?
> Since hard drive are cheeper and cheeper, you could store dictionnary in a
> usable (uncompressed) form and use it directly with memory mapping.
> Then the efficiency would directly depend on the disk caching method,
> and only the used part of the dictionnaries would realy be loaded into memory.
> You would no more have to load plain dictionnaries into main memory, you'll
> just want to compute some indexes (or something like that) after mapping.

And that still leads to an extra level of complexity.  There is also the
performance issue.  It may not affect performance; however, that is
something I need to consider.
> > So the solution is to work with the charters as 32 bit ints than convert
> > it to a shorter representation when storing them in memory.  Now than can
> > lead to an inefficiency.  I could also use short ints however that may not
> > be good enough to hold all of future versions of unicode and it has the
> > same problems.
> I think converting after loading into memory is not a good idea since it's
> better to always work with integers and only convert it from and to locale
> when reading or writing to end user/program. That's exactly what does the
> Plan9 system from Bell Labs, they use Unicode with UTF-8 encoding for the
> end-user, but internaly, this is full integer processing.
> If the encoding issue is a problem, you may pipe your input/output from/to
> recode (>= 3.4o) which is a pretty good tool for this.

Just in case there is any confusion, when I say "store them in memory" I
mean store them in memory as the shorter representation only when
storing them in the wordlists and not when storing them during the
suggestion processes.

Also doing the conversion is not a problem and the next version will do
exactly that. It will be able to read and write in UTF-8 as well as many
other encoding methods. It will just work with them internally as 8-bit

> > These where the issues I was talking about.
> Ok, I did not understand it on the first reading of your pages...
> May be you would put this as clear as your mail into your web pages ! ;)

I will.  I ask you again.  Would it be okay if I forward these messages
to the aspell mailing list.

Kevin Atkinson


Note: This message was origanlly posted to address@hidden,
      not address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]