[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
## re: [Gnumed-devel] record matching

**From**: |
sjtan |

**Subject**: |
re: [Gnumed-devel] record matching |

**Date**: |
Mon, 23 Aug 2004 15:02:49 +1000 |

**User-agent**: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616 |

`I think , not sure, that each record's keying fields are compared to
``the m * k sublists that are generated ( m is the blocking key count,
``k is the average number of
``sublists that are generated for a given threshold value) . The
``comparison function might run as a simple string comparison on each
``pair, starting on the next
``character after the last match, and return true if it makes it through
``the whole sublist. If a match occurs, each sublist is associated with a
``list of candidate
``record numbers for comparison , and the record that matches is added as
``a candidate.
``Then for the m * k sublists, the sum of the decreasing sequence j-1 to
``1 comparisons are made for each j the size of the candidate list in a
``sublist.
``
``
`

`this is a pretty horrible paraphrasing, trying to work out the O
``complexity of actually running the comparisons
`generated by Bigram indexing.

`the example was 'ba', 'ax', 'xt', 'te', 'er' from 'baxter' a block
``key , so there are 5 bigrams. a threshold value of 0.8 , means
``0.8 x the number of bigrams for a block key , in this 0.8 x 5 = 4,
``so all various inorder combinations of the 4 of the 5 bigrams are
`used for comparison.

`ba ax xt te , ba xt te er, ba ax te er, ax xt te er
``
``below is an procedure to determine the bigram sequences ; you can change
``the value of t and threshold to get different sequences.
``It seems to resemble the logic of inductive proof,
``e.g. prove the sum of the sequence(1..n) is = n (n+1) / 2 ; first
``prove for n =1, assume it's true for n, then prove for n=n'+1 by
``converging LHS and RHS.
`The methodology is to define the known basic case, and then guess the logic.
def select( start, n , seq):
if n == 0:
return []
if n == 1:
ll = []
for i in range (start, len(seq)):
ll.append ( [seq[i]] )
return ll
ll = []
for i in range( start, len(seq) - n + 1):
ll2 = select ( i + 1, n -1, seq)
for l in ll2:
ll.append ( [ seq[i] ] + l )
return ll
if __name__=='__main__':
t = [ 'ba', 'ax', 'xt', 'te', 'er']
threshold = 0.6
n = int(threshold * len(t) + 0.00001)
print n
l = select ( 0, n, t)
print l

**re: [Gnumed-devel] record matching**,
*sjtan* **<=**