[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: [Gnumed-devel] record matching

From: sjtan
Subject: re: [Gnumed-devel] record matching
Date: Mon, 23 Aug 2004 15:02:49 +1000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616

I think , not sure, that each record's keying fields are compared to the m * k sublists that are generated ( m is the blocking key count, k is the average number of sublists that are generated for a given threshold value) . The comparison function might run as a simple string comparison on each pair, starting on the next character after the last match, and return true if it makes it through the whole sublist. If a match occurs, each sublist is associated with a list of candidate record numbers for comparison , and the record that matches is added as a candidate. Then for the m * k sublists, the sum of the decreasing sequence j-1 to 1 comparisons are made for each j the size of the candidate list in a sublist.
this is a pretty horrible paraphrasing, trying to work out the O complexity of actually running the comparisons
generated by Bigram indexing.
the example was 'ba', 'ax', 'xt', 'te', 'er' from 'baxter' a block key , so there are 5 bigrams. a threshold value of 0.8 , means 0.8 x the number of bigrams for a block key , in this 0.8 x 5 = 4, so all various inorder combinations of the 4 of the 5 bigrams are
used for comparison.
ba ax xt te , ba xt te er, ba ax te er, ax xt te er below is an procedure to determine the bigram sequences ; you can change the value of t and threshold to get different sequences. It seems to resemble the logic of inductive proof, e.g. prove the sum of the sequence(1..n) is = n (n+1) / 2 ; first prove for n =1, assume it's true for n, then prove for n=n'+1 by converging LHS and RHS.
The methodology is to define the known basic case, and then guess the logic.

def select( start, n , seq):
       if n == 0:
            return []
       if n == 1:
            ll = []
            for i in range (start, len(seq)):
               ll.append ( [seq[i]] )
            return ll
       ll = []
       for i in range( start, len(seq) - n + 1):
             ll2 = select ( i + 1, n -1, seq)
             for l in ll2:
                ll.append ( [ seq[i] ] + l )
       return ll

if __name__=='__main__':
       t = [ 'ba', 'ax', 'xt', 'te', 'er']
       threshold = 0.6
       n = int(threshold * len(t) + 0.00001)
       print n
       l = select ( 0, n, t)
       print l

reply via email to

[Prev in Thread] Current Thread [Next in Thread]