[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ispell and unibyte characters

From: Eli Zaretskii
Subject: Re: Ispell and unibyte characters
Date: Mon, 26 Mar 2012 16:08:06 -0400

> Date: Mon, 26 Mar 2012 19:39:12 +0200
> From: Agustin Martin <address@hidden>
> Hi Eli,

Thanks for responding, I was beginning to think that no one is
interested.  In general, I find that ispell.el is in sore need of
modernization; at least that's my conclusion so far from playing with
hunspell (with which I want to replace my aging collection of Ispell
and its dictionaries that I use for many years).

> At least for aspell ispell.el already uses utf8 as default communication
> encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
> OTHERCHARS is guessed from aspell .dat file for given dictionary.

The question is, why isn't this done for any modern speller.  The only
one I know of that cannot handle UTF-8 is Ispell.

OTHERCHARS are not very important anyway, at least for languages I'm
interested in.

> Since currently it is not possible to ask hunspell for installed
> dictionaries (hunspell -D does not return control to the console)
> no one tried something similar for hunspell.

In what version do you have problems with -D?

In any case, hunspell supports multiple dictionaries in the same
session.  One can invoke it with, e.g., "-d en_US,de_DE,ru_RU,he_IL"
and have it spell-check mixed text that uses all these languages in
the same buffer (at least in theory; I didn't yet try that in my
experiments).  Clearly, this can only be done with UTF-8 or some such
as the encoding.

So I think we should deprecate usage of the unibyte characters in the
ispell.el defaults, and simply use [:alpha:] for all languages.  As a
bonus, we can then get rid of the ridiculously long and hard to
maintain customization of each new dictionary you add to your
repertory.  Just one entry will serve almost any language, or at least
supply an excellent default.

> > The only reason for this limitation I could find is in
> > ispell-process-line, which assumes that the byte offsets returned by
> > the speller can be used to compute character position of the
> > misspelled word in the buffer.  Are there any other places in
> > ispell.el that assume unibyte characters?
> Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
> I do not remember reports about this. 

Since I wrote that, I found that the problem was due to a bug in
hunspell (which I fixed in my copy): it reported byte offsets of the
misspelled words, rather than character offsets.  After fixing that
bug, there's no issue here anymore and nothing to fix in ispell.el.
There's a bug report with a patch about that in the hunspell bug
tracker, so there's reason to believe this bug will be fixed in a
future release.

> IIRC, the reason to use octal escapes is mostly that they are encoding
> independent.

They aren't; their encoding is guessed by Emacs based on the locale.
Using them is asking for trouble, IMO.  We specifically discourage use
of unibyte text in Emacs manuals, and yet we ourselves use them in a
package that is part of Emacs!

> Otherwise a .emacs file may have mixed unibyte/multibyte encodings.

I was talking about ispell.el, first and foremost.  There's no problem
with having ispell.el encoded in UTF-8, if needed (but I don't think
there's a need, see above).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]