[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Tokenization of words containing hyphens

From: Ciarán Ó Duibhín
Subject: Re: [aspell-devel] Tokenization of words containing hyphens
Date: Thu, 4 Jul 2013 18:05:05 +0100


I agree with Carlo that there is a strong case for continuing to tokenize French as at present. So, I also agree that my proposed additional code for hyphens (change #3) ought to be made conditional on a new option in the *.dat file. The option would only be relevant when the *.dat also contains "special - " with an asterisk in the medial position. Possible format for the option: "compounds -" ?

Although my option is not directed at French, it is worth looking at the implications of a choice for French. Compared with the present behaviour, my option will wrongly approve eg. "beau-soeur" or "avez-nous", and probably most users will agree with Carlo that such wrong approvals are more of a nuisance than any amount of wrong refusals. But on the other side of the coin, French has a number of types of fairly productive hyphenated compounds such as "dit-il" or "temps-là"; or words which include prefixes like "demi-", "mi-", "au-", "avant-", "arrière-" or suffixes like "-ci" or "-là"; or numerals; or proper names — and the dictionary can only hope to contain a selection of any of these compounds (eg. it includes "avez-vous"; actually, the 0.50 French dictionary has 629,570 words, in which there are 7,001 hyphens), and those compounds that it does not contain are refused, rightly or wrongly. My option will eliminate these refusals, and will allow removal from the dictionary of the selection of compounds which has got into it. Further, if the same decompounding behaviour were extended to the apostrophe, and c', d', j', l', m', n', qu', s', t' were added as prefixes to the dictionary, again the result would be a much smaller French dictionary, with fewer wrong refusals, but more wrong approvals (eg. l'dame — although one could say such an error is more to do with grammar than with spelling).

Ciarán Ó Duibhín

----- Original Message ----- From: "Carlo Traverso" <address@hidden>
To: <address@hidden>
Cc: <address@hidden>
Sent: Thursday, June 27, 2013 5:59 PM
Subject: Re: [aspell-devel] Tokenization of words containing hyphens

"ciaran" == =?iso-8859-1?B?Q2lhcuFuINMgRHVpYmjtbg==?= <iso-8859-1> writes:

   ciaran> Languages may have words containing an internal hyphen,
   ciaran> but with the components not being themselves words of
   ciaran> the language (a possible English example is
   ciaran> hotch-potch).  In such languages it is well to allow a
   ciaran> word-internal hyphen in *.dat and put such "compounds" in
   ciaran> the dictionary.  No new code is required for this.
   ciaran> However, with the change in status of the hyphen, all
   ciaran> hyphenated compounds not explicitly included in the
   ciaran> dictionary will now be rejected, even if their
   ciaran> components are all in the dictionary.  To avoid this, new
   ciaran> code is needed, for languages supporting internal
   ciaran> hyphen, to examine a rejected word, and if it contains
   ciaran> an internal hyphen, to check the components separately.
   ciaran> If all the components are accepted, so is the compound.
   ciaran> The hyphen itself will not be included in the separate
   ciaran> components on either side of it.

I don't think that this is a good idea for example for French (the
language that I know that allows word-internal hyphen in *.dat).

Composing words with hyphens is subject to several rules, so listing
the allowed pairs is better. For example, beau, belle, frère, soeur
are allowed words, as well as beau-frère, belle-soeur (brother-
(sister)-in-law). But beau-soeur is wrong (as is "beau soeur" without
hyphen btw) because you need to match genres, masculine with masculine
etc. And most word pairs, even if the genre and case are OK, are
wrong. The correct ones should be included in the dictionary, the
other ones considered wrong.

In German, for example, you compose words without hyphens. You would
never allow a word just because it is obtained composing two words.

My feeling might depend from my usage of a spell-checker, that is to
check OCR results of vintage texts, hence I prefer false positives
(mark as erroneous words that are OK) to false negatives, and OCR
quite easily inserts a hyphen in place of a speckle between words, so
this might lead to accept a few wrong words without accepting a
substantial number of words that currently would be OK but flagged.

So if you really want to accept words that are composed freely, the
feature should be language-specific (English I believe might be the
only one) hence included in *.dat files. For French it is OK as is,
don't fix what isn't broken.

Carlo Traverso

reply via email to

[Prev in Thread] Current Thread [Next in Thread]