[aspell-devel] Tokenization of words containing hyphens

This is the third and last part (change #3) of my consideration of apostrophes and hyphens in aspell.

Languages may have words containing an internal hyphen, but with the components not being themselves words of the language (a possible English example is hotch-potch). In such languages it is well to allow a word-internal hyphen in *.dat and put such "compounds" in the dictionary. No new code is required for this. However, with the change in status of the hyphen, all hyphenated compounds not explicitly included in the dictionary will now be rejected, even if their components are all in the dictionary. To avoid this, new code is needed, for languages supporting internal hyphen, to examine a rejected word, and if it contains an internal hyphen, to check the components separately. If all the components are accepted, so is the compound. The hyphen itself will not be included in the separate components on either side of it.

There is something else we can do, when a hyphen is found in a token: we can check whether the component before AND INCLUDING the hyphen might be a known prefix; or whether the component after AND INCLUDING the hyphen might be a known suffix. Thus the dictionary could be allowed to include prefixes (including a final hyphen) and suffixes (including an initial hyphen), and we can modify *.dat to allow this. Code must be added to support matching of prefixes and suffixes, to be activated if *.dat allows initial/terminal hyphen, and when a rejected token contains an internal hyphen.

The extra code for processing a token containing an internal hyphen, after the token has been rejected as a whole, is positioned in modules/speller/default/speller_impl.cpp, in procedure SpellerImpl::check at around line 190. The new code is placed before the checking for two words run together without a space, though this may not be the best place for it. NOTE that I don't understand the purpose of parameters 36 to procedure check, or the corresponding parameters to procedure check2, and probably have not used them correctly. But the concept is shown to work.

Here is the additional code:

    unsigned i=0;
    while (*(word+i)!= 0) {
      if ((i > 0) && (i < word_end-word-1) && (*(word+i)=='-')) {
      if (lang_->special('-').end) { /* test up to hyphen as prefix, test remainder recursively as word */
          char t = *(word+i+1);
          *(word+i+1) = (char) 0;
        if (check2(word, try_uppercase, *ci, gi)) {
         *(word+i+1) = t;
         if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, gi))
          return true;
       }
    else
        *(word+i+1) = t;
        }
        if (lang_->special('-').middle) { /* test up to hyphen as word, test remainder recursively as word, then as suffix */
        *(word+i) = (char) 0;
        if (check2(word, try_uppercase, *ci, gi)) {
          *(word+i) = '-';
          if (check(word+i+1, word_end, try_uppercase, run_together_limit, ci, gi))
           return true;
          else {
           if (lang_->special('-').begin) {
            if (check(word+i, word_end, try_uppercase, run_together_limit, ci, gi))
              return true;
           }
         }
        }
        else
          *(word+i) = '-';
        }
      }
      ++i;
    }

For this code to work as intended, change #2 is also necessary. Consider the token spell-check . We must test to see if the dictionary contains a prefix spell- or a suffix -check or plain words spell and check. We would expect to find no such prefix or suffix, but to find the two plain words. But unless change #2 is made, the token spell- will be accepted as matching the dictionary form spell and the process will be ended prematurely, albeit with the right result in this case.

As before, my experiments have been conducted using the Hatier port of aspell for Windows at http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 . The changes suggested in these three messages have been made to this source and compiled using VC++ 2005. On the evidence so far, the changes appear to be working as intended, thereby solving the three problems I reported to aspell-user on 19 May 2013, and allowing aspell to treat the tokenization of apostrophes and hyphens in a similar way to the MS Word spell-checker. As far as I can see, no existing functionality is adversely affected by these changes.

Ciarán Ó Duibhín

From:	Ciarán Ó Duibhín
Subject:	[aspell-devel] Tokenization of words containing hyphens
Date:	Fri, 21 Jun 2013 12:31:48 +0100