A soundslike problem with combined English+Russian dictionary

aspell-user

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

A soundslike problem with combined English+Russian dictionary

From:	Maxim Nikulin
Subject:	A soundslike problem with combined English+Russian dictionary
Date:	Tue, 22 Jun 2021 23:56:25 +0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1

Hi,

I am aware that multi-lingual dictionaries are unsupported by Aspell,but I think in some particular cases it is still possible to combine acouple of dictionaries and to get a result of reasonable quality. I amalmost achieved what I expected for merged English and Russian wordlists. I am quite satisfied even with current result. Maybe I just havenot discovered detrimental effect of missed affix table for English orcombined special characters ("-" and "'").

I was hooked by description of the metaphone algorithm that shouldimprove suggested corrections for misspelled words. Since I am not anative English speaker, I do not mind to have such feature if it helpsto remind some word. For Russian general edit distance should be enough,so I tried to use a copy of en_phonet.dat with added line (and exactcopy as well)


    remove_accents 0

that is referenced in the .dat file

    soundslike rue_phonet

To my surprise with such configuration whole English alphabet issuggested as a replacement for misspelled Russian word. In the followingexample word "funetik" is taken from the manual to check that phoneticrules are taken into account (another example taff -> tough does notwork with default suggestion mode)

echo "funetik програма" | aspell -d ./rue.rws -a
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)
& funetik 26 0: fanatic, funk, fungi, Fuentes, functor, frenetic, genetic, 
kinetic, finite, fount, fungoid, funky, lunatic, phonetic, fountain, funked, Fundy, 
fined, founts, funded, font, fund, frantic, funkier, fount's, Fuentes's
& програма 100 8: программа, программ, A, B, C, D, E, F, G, H, I, J, K, L, M, 
N, O, P, Q, R, S, T, U, V, X, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, 
r, s, t, u, v, x, z, AA, AI, AR, Ar, Au, BA, BB, BO, Ba, Be, Bi, CA, CO, Ca, Ce, 
Ch, Ci, Co, Cu, DA, DD, DE, DI, Di, Du, Dy, ER, EU, Er, Eu, FY, Fe, GA, GE, GI, GU, 
Ga, Ge, HI, Ha, He, Ho, IA, IE, Ia, Io, Ir, Jo, KO, KY


That is why I have

     soundslike generic

is my current configuration and it gives more reasonable variants forRussian test word:

& програма 13 8: программа, программ, программе, программу, программы, 
программах, программам, программка, параграмма, программою, проиграна, параграмм, 
погрома

Have I done something wrong? Is it expected behavior that Englishphonetic rules have so detrimental effect on variants for Russian words?I am unsure whether observed result is a bug. (Actually the question is:`How many bugs have I faced?' With zero as a possible variant)


More details of my configuration.

The goal is to see misspelled words in mixed-language documents with mynotes. Variants of correction are appreciated as well. It works in Vimfor years:


    set spelllang=en,ru spell

and I would like to have comparable feature in Emacs

    M-x flyspell-mode RET M-x ispell-change-dictionary RET rue

without special configuration of custom dictionary in Emacs. Side note:certainly I am against idea, I have seen once, to bind ispell dictionaryto input method.


There is a feature request for support of multi-lingual dictionaries
https://github.com/GNUAspell/aspell/issues/448
(and a number of similar threads in the archive of this mail list).
People are still trying to combine dictionaries:
https://unix.stackexchange.com/questions/341714/use-multi-language-dictionary-with-aspell
https://wiki.archlinux.org/title/User:Georgek

There is no section in the manual that clarifies possible problems ofthis approach.

I hope, in my particular case of English and Russian languages it can bedone in a bit more accurate way.

- I rarely use letters with accents, so alphabets are disjoint set ofcharacters. US-ASCII is a subset of KOI8-R encoding.- The cost of discarding of affix data for Russian is ~30M of disk space(and almost certainly RAM as well). I am unsure if I loose something byignoring affix table for English.- Combined "special" is a kind of compromise, it should be per-language,I have not example of imperfect behavior yet however.- As I said above, I would prefer phonetic rules for English but I haveto use generic ones.


--->8--- rue.dat begin --->8---

# Combined dictionary for English and Russian languages
#
# An attempt to create a dictionary suitable for spell checking
# of mixed-language texts.
#
# Something distinct from just "ru" and "en". Do not use a name longer
# than 3 characters otherwise it will not appear in "aspell dump dicts"
# thus will be ignored by other applications. Numbers, e.g. "ru2"
# make language identifier invalid as well.
name            rue
# ISO8859-1 used for "en" dictionaries is a subset of KOI8-R
# modulo accents.
# Russian dictionary from system package on Ubuntu uses namely KOI8-R.
charset         koi8-r
# Combine values from "ru" and "en"
special         - -*- ' -*-
# With
#
#     soundslike rue
#
# and a copy of en_phonet.dat aspell suggests
# e.g. "phonetic" for "funetik" input.
# Unfortunately it ruins scoring of corrections for Russian.
# Even with "remove_accents 0" inside "rue_phonet.dat", abundant
# single- and two-letters variants appear as alternatives.
# However a couple of top rated suggestions are still reasonable.
# Segfault may happen on attempt to generate master dictionary
# when "rue_phonet.dat" is missed in the current directory.
# As a compromise, prefer better quality of correction variants
# for Russian.
soundslike      generic
# Affix compression is not enabled for "en" system dictionaries.
# At the same time it allows to save enough space for "ru" dictionary.
# Size of compressed dictionary is 3Mb, expanded one consumes 30Mb
# of disk space.
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | aspell --lang=ru create master ./ru.rws
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | tr ' ' '\n' \
#         | aspell --lang=ru create master ./ru-expand.rws
#
affix-compress  true
# Actually it is ignored and "rue_affix.dat"
# (copy or symlink is required).
affix           ru

# Noticed differences:
#
#     echo "programm funetic" | aspell --lang en -a
#     & programm 5 0: program, programs, programmer, programmed, program's

# & funetic 14 9: fanatic, frenetic, genetic, kinetic, lunatic,phonetic, frantic, fungi, Fuentes, antic, functor, fanatics, fungoid,fanatic's

#     # ------------------------------------------------------------^^^^^^^^
#
#     echo "programm funetic" | aspell --lang rue -a

# & programm 6 0: program, programs, programmed, programmer,program's, pogrom##---------------------------------------------------------------------^^^^^^

#     & funetic 5 9: fanatic, genetic, kinetic, lunatic, Fuentes
#
# Absence of "phonetic" caused by "soundslike generic". "Pogrom" presents
# in the original "en" word list.

---8<--- rue.dat end   ---8<---

--->8--- rue.multi begin --->8---

# Combined dictionary for English and Russian languages
#
# It is not possible to just add ru.multi and en.multi
# because of languages
# inside the dictionaries differ. Unsure if it is safe to generate
# dictionary for English language using modified ru.dat
# with "special ' -*-".
# Let's generate dictionaries with "rue" as a language identifier.
#
# System-wide .rws files are created on Ubuntu in postinst scripts by
# /usr/sbin/update-dictcommon-aspell and /usr/sbin/aspell-autobuildhash
# utilities. Source word lists are provided
# in /usr/share/aspell directory.
# Example of command to unpack:
#
#     zcat /usr/share/aspell/en-wo_accents-only.cwl.gz | precat
#
# E.g. en_US dictionary is combination of en-common
# (shared with e.g. en_GB)
# and en-wo_accents-only. Unsure if I need this degree of word list
# granularity, so let's try a naive approach to create word lists.
#
# "rue_affix.dat" is required despite "affix ru" line in rue.dat
#
#     ln -s /usr/lib/aspell/ru_affix.dat rue_affix.dat
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#          | aspell --lang=rue create master ./rue-ru.rws
#
# Despite warnings like
#
#     # Warning: Removing inapplicable affix 'H' from word Адель.
#
# expanded word list is the same as the original one.
add rue-ru.rws
# Specify encoding to avoid UTF-8 if some accents
# will appear accidentally.
#
#     aspell --lang=en_US --encoding=iso8859-1 dump master \
#          | aspell --lang=rue create master ./rue-en_US.rws
add rue-en_US.rws

---8<--- rue.multi end   ---8<---

Commands to generate word lists are in the last comments in rue.multi.Finally I can run


    aspell --lang rue -a

Does such configuration have apparent problems? Is it possible to useen_phonet.dat instead of "generic" for soundslike?

[Prev in Thread]

Current Thread

[Next in Thread]

A soundslike problem with combined English+Russian dictionary, Maxim Nikulin <=

Index(es):
- Date
- Thread