[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Splitting text into words and non-words

From: Kevin Atkinson
Subject: Re: Splitting text into words and non-words
Date: Sat, 02 Jan 1999 16:35:24 +0000

Asger Alstrup Nielsen wrote:
> > So I was wondering if anyone on this list has any experience in writing
> > this sort of context recognition code or could give me some pointers in
> > the right direction.
> I describe an algorithm in principle, and as a multi-pass algorithm that
> requires O(n) space.  You probably want to do something to reduce that to a
> O(1) space algorithm, and that should be pretty easy to do by using a small
> buffer.
> First convert the file into a list of strings by splitting at whitespace and
> change from letter to non-letters.  Also, throw alway any non-letter 
> characters
> that appear in ordinary text:  , . : ; ( ) ! ? - "
> ...
> This is a rough algorithm that probably is easy to fool,

Yes, very, as it will won't do well at are with functions calls as


would become 

 printCont sc soundslike wrd

and thus none of of it would be considered code.
And you can't always count words between two punctuation charters as
code because then a string like
 .  Howevr,

would mark Howevr as code when it is clearly part of a sentence.

Instead how about this:

Count the number of occurrence of all words which appear next to any
sort of symbol--including punctuation.

Then go back and look at the surrounding symbols for all words which
appear more than X number of times.  If a word has a high occurrence of
a particular symbol either before or after it, mark all occurrences of
the word as correct.

For example Given the following code sample:

    case '$':
      if (cin.get() == '$') {
        switch(cin.get()) {
        case 's':
          cout << sc.score(word.c_str(),word2.c_str()) << endl;
        case 'S':
          switch(cin.get()) {
          case 'W':
          case 'w':
            cin >> word;
            cout << sc.to_soundslike(word) << endl;
          case 'L':

The counts would be (for words which would normally be misspelled)
  cin  4
  cout 2
  endl 2 
  sc   2
  str  2

Because this is a small block of code we will let X be 2 thus.
  cin  ->  (cin.    3/4 times
  cout ->  cout <   2/2 times
  endl ->  << endl; 2/2 times
  sc   ->  << sc.   2/2 times
  str  ->  _str(    2/2 times

Thus all 5 words will be ignored.

Under your system only the "cout", and "endl" would be ignored.

Kevin Atkinson

reply via email to

[Prev in Thread] Current Thread [Next in Thread]