[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Splitting text into words and non-words
From: |
Kevin Atkinson |
Subject: |
Re: Splitting text into words and non-words |
Date: |
Sat, 02 Jan 1999 16:35:24 +0000 |
Asger Alstrup Nielsen wrote:
>
> > So I was wondering if anyone on this list has any experience in writing
> > this sort of context recognition code or could give me some pointers in
> > the right direction.
>
> I describe an algorithm in principle, and as a multi-pass algorithm that
> requires O(n) space. You probably want to do something to reduce that to a
> O(1) space algorithm, and that should be pretty easy to do by using a small
> buffer.
>
> First convert the file into a list of strings by splitting at whitespace and
> change from letter to non-letters. Also, throw alway any non-letter
> characters
> that appear in ordinary text: , . : ; ( ) ! ? - "
>
> ...
>
> This is a rough algorithm that probably is easy to fool,
Yes, very, as it will won't do well at are with functions calls as
printCont(sc.soundslike(wrd));
would become
printCont sc soundslike wrd
and thus none of of it would be considered code.
And you can't always count words between two punctuation charters as
code because then a string like
. Howevr,
would mark Howevr as code when it is clearly part of a sentence.
Instead how about this:
Count the number of occurrence of all words which appear next to any
sort of symbol--including punctuation.
Then go back and look at the surrounding symbols for all words which
appear more than X number of times. If a word has a high occurrence of
a particular symbol either before or after it, mark all occurrences of
the word as correct.
For example Given the following code sample:
case '$':
if (cin.get() == '$') {
switch(cin.get()) {
case 's':
get_word_pair(word,word2);
cout << sc.score(word.c_str(),word2.c_str()) << endl;
break;
case 'S':
switch(cin.get()) {
case 'W':
case 'w':
cin >> word;
cout << sc.to_soundslike(word) << endl;
ignore_rest();
break;
case 'L':
The counts would be (for words which would normally be misspelled)
cin 4
cout 2
endl 2
sc 2
str 2
Because this is a small block of code we will let X be 2 thus.
cin -> (cin. 3/4 times
cout -> cout < 2/2 times
endl -> << endl; 2/2 times
sc -> << sc. 2/2 times
str -> _str( 2/2 times
Thus all 5 words will be ignored.
Under your system only the "cout", and "endl" would be ignored.
--
Kevin Atkinson
address@hidden
http://metalab.unc.edu/kevina/