aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Tokenization of words containing hyphens


From: Carlo Traverso
Subject: Re: [aspell-devel] Tokenization of words containing hyphens
Date: Thu, 27 Jun 2013 18:59:15 +0200 (CEST)

>>>>> "ciaran" == =?iso-8859-1?B?Q2lhcuFuINMgRHVpYmjtbg==?=  <iso-8859-1> 
>>>>> writes:

    ciaran> Languages may have words containing an internal hyphen,
    ciaran> but with the components not being themselves words of
    ciaran> the language (a possible English example is
    ciaran> hotch-potch).  In such languages it is well to allow a
    ciaran> word-internal hyphen in *.dat and put such "compounds" in
    ciaran> the dictionary.  No new code is required for this.
    ciaran> However, with the change in status of the hyphen, all
    ciaran> hyphenated compounds not explicitly included in the
    ciaran> dictionary will now be rejected, even if their
    ciaran> components are all in the dictionary.  To avoid this, new
    ciaran> code is needed, for languages supporting internal
    ciaran> hyphen, to examine a rejected word, and if it contains
    ciaran> an internal hyphen, to check the components separately.
    ciaran> If all the components are accepted, so is the compound.
    ciaran> The hyphen itself will not be included in the separate
    ciaran> components on either side of it.

I don't think that this is a good idea for example for French (the
language that I know that allows word-internal hyphen in *.dat). 

Composing words with hyphens is subject to several rules, so listing
the allowed pairs is better. For example, beau, belle, frère, soeur
are allowed words, as well as beau-frère, belle-soeur (brother-
(sister)-in-law). But beau-soeur is wrong (as is "beau soeur" without
hyphen btw) because you need to match genres, masculine with masculine
etc. And most word pairs, even if the genre and case are OK, are
wrong. The correct ones should be included in the dictionary, the
other ones considered wrong.

In German, for example, you compose words without hyphens. You would
never allow a word just because it is obtained composing two words. 

My feeling might depend from my usage of a spell-checker, that is to
check OCR results of vintage texts, hence I prefer false positives
(mark as erroneous words that are OK) to false negatives, and OCR
quite easily inserts a hyphen in place of a speckle between words, so
this might lead to accept a few wrong words without accepting a
substantial number of words that currently would be OK but flagged.

So if you really want to accept words that are composed freely, the
feature should be language-specific (English I believe might be the
only one) hence included in *.dat files. For French it is OK as is,
don't fix what isn't broken.


Carlo Traverso



reply via email to

[Prev in Thread] Current Thread [Next in Thread]