[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [aspell-devel] Tokenization of words containing hyphens
From: |
Carlo Traverso |
Subject: |
Re: [aspell-devel] Tokenization of words containing hyphens |
Date: |
Thu, 27 Jun 2013 18:59:15 +0200 (CEST) |
>>>>> "ciaran" == =?iso-8859-1?B?Q2lhcuFuINMgRHVpYmjtbg==?= <iso-8859-1>
>>>>> writes:
ciaran> Languages may have words containing an internal hyphen,
ciaran> but with the components not being themselves words of
ciaran> the language (a possible English example is
ciaran> hotch-potch). In such languages it is well to allow a
ciaran> word-internal hyphen in *.dat and put such "compounds" in
ciaran> the dictionary. No new code is required for this.
ciaran> However, with the change in status of the hyphen, all
ciaran> hyphenated compounds not explicitly included in the
ciaran> dictionary will now be rejected, even if their
ciaran> components are all in the dictionary. To avoid this, new
ciaran> code is needed, for languages supporting internal
ciaran> hyphen, to examine a rejected word, and if it contains
ciaran> an internal hyphen, to check the components separately.
ciaran> If all the components are accepted, so is the compound.
ciaran> The hyphen itself will not be included in the separate
ciaran> components on either side of it.
I don't think that this is a good idea for example for French (the
language that I know that allows word-internal hyphen in *.dat).
Composing words with hyphens is subject to several rules, so listing
the allowed pairs is better. For example, beau, belle, frère, soeur
are allowed words, as well as beau-frère, belle-soeur (brother-
(sister)-in-law). But beau-soeur is wrong (as is "beau soeur" without
hyphen btw) because you need to match genres, masculine with masculine
etc. And most word pairs, even if the genre and case are OK, are
wrong. The correct ones should be included in the dictionary, the
other ones considered wrong.
In German, for example, you compose words without hyphens. You would
never allow a word just because it is obtained composing two words.
My feeling might depend from my usage of a spell-checker, that is to
check OCR results of vintage texts, hence I prefer false positives
(mark as erroneous words that are OK) to false negatives, and OCR
quite easily inserts a hyphen in place of a speckle between words, so
this might lead to accept a few wrong words without accepting a
substantial number of words that currently would be OK but flagged.
So if you really want to accept words that are composed freely, the
feature should be language-specific (English I believe might be the
only one) hence included in *.dat files. For French it is OK as is,
don't fix what isn't broken.
Carlo Traverso