gnumed-devel
[Top][All Lists]

## re: [Gnumed-devel] record matching

 From: sjtan Subject: re: [Gnumed-devel] record matching Date: Mon, 23 Aug 2004 15:02:49 +1000 User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616

```

```
I think , not sure, that each record's keying fields are compared to the m * k sublists that are generated ( m is the blocking key count, k is the average number of sublists that are generated for a given threshold value) . The comparison function might run as a simple string comparison on each pair, starting on the next character after the last match, and return true if it makes it through the whole sublist. If a match occurs, each sublist is associated with a list of candidate record numbers for comparison , and the record that matches is added as a candidate. Then for the m * k sublists, the sum of the decreasing sequence j-1 to 1 comparisons are made for each j the size of the candidate list in a sublist.
this is a pretty horrible paraphrasing, trying to work out the O complexity of actually running the comparisons
```generated by Bigram indexing.
```
the example was 'ba', 'ax', 'xt', 'te', 'er' from 'baxter' a block key , so there are 5 bigrams. a threshold value of 0.8 , means 0.8 x the number of bigrams for a block key , in this 0.8 x 5 = 4, so all various inorder combinations of the 4 of the 5 bigrams are
```used for comparison.
```
ba ax xt te , ba xt te er, ba ax te er, ax xt te er below is an procedure to determine the bigram sequences ; you can change the value of t and threshold to get different sequences. It seems to resemble the logic of inductive proof, e.g. prove the sum of the sequence(1..n) is = n (n+1) / 2 ; first prove for n =1, assume it's true for n, then prove for n=n'+1 by converging LHS and RHS.
```The methodology is to define the known basic case, and then guess the logic.

def select( start, n , seq):
if n == 0:
return []
if n == 1:
ll = []
for i in range (start, len(seq)):
ll.append ( [seq[i]] )
return ll
ll = []
for i in range( start, len(seq) - n + 1):
ll2 = select ( i + 1, n -1, seq)
for l in ll2:
ll.append ( [ seq[i] ] + l )
return ll

if __name__=='__main__':
t = [ 'ba', 'ax', 'xt', 'te', 'er']
threshold = 0.6
n = int(threshold * len(t) + 0.00001)
print n
l = select ( 0, n, t)
print l

```