[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: [Gnumed-devel] record matching

From: Tim Churches
Subject: re: [Gnumed-devel] record matching
Date: 22 Aug 2004 20:01:11 +1000

On Mon, 2004-08-23 at 15:02, sjtan wrote:
> >
> >
> >I think , not sure, that each record's keying fields are compared to 
> >the   m  * k  sublists that are generated ( m is the blocking key count, 
> >k is the average number of
> >sublists that are generated for a given threshold value) . The 
> >comparison function might run as a simple string comparison on each 
> >pair, starting on the next
> >character after the last match, and return true if it makes it through 
> >the whole sublist.  If a match occurs, each sublist is associated with a 
> >list of candidate
> >record numbers for comparison , and the record  that matches is added as 
> >a candidate.
> >Then for the m * k sublists,  the sum of the decreasing sequence j-1 to 
> >1  comparisons are made for each j the size of the candidate list in a 
> >sublist.
> >  
> >
> this is a pretty horrible paraphrasing, trying to work out the  O 
> complexity of actually running the comparisons
> generated by Bigram indexing.
> the example was 'ba', 'ax', 'xt', 'te', 'er'   from 'baxter' a block 
> key  ,  so there are 5 bigrams. a threshold value of 0.8 , means
> 0.8  x  the number of bigrams for  a  block key , in this 0.8 x 5 = 4,  
> so  all various inorder combinations of the 4 of the 5 bigrams are
> used for comparison.
> ba ax xt te  ,   ba xt te er,  ba ax te er,      ax xt te er 

No, you are correct about "combinations" but combinations are by
definition unordered with no repeats (as opposed to permutations which
are ordered and permit repeats) - so if the threshold is 4 out 5 then
there are rCn combinations to check where rCn = n! / r! * (n - r)! (from
memory - I think that is correct).

> below is an procedure to determine the bigram sequences ; you can change 
> the value of t and threshold to get different sequences.
> It seems to resemble the logic of  inductive proof,  
> e.g.  prove the sum of the sequence(1..n) is =  n (n+1) / 2   ;  first 
> prove for n =1, assume   it's true for n, then prove  for n=n'+1 by 
> converging LHS and RHS.
> The methodology is to define the known basic case, and then guess the logic.
> def select( start, n , seq):
>         if n == 0:
>              return []
>         if n == 1:
>              ll = []
>              for i in range (start, len(seq)):
>                 ll.append ( [seq[i]] )
>              return ll
>         ll = []
>         for i in range( start, len(seq) - n + 1):
>               ll2 = select ( i + 1, n -1, seq)
>               for l in ll2:
>                  ll.append ( [ seq[i] ] + l )
>         return ll
> if __name__=='__main__':
>         t = [ 'ba', 'ax', 'xt', 'te', 'er']
>         threshold = 0.6
>         n = int(threshold * len(t) + 0.00001)
>         print n
>         l = select ( 0, n, t)
>         print l

Um, OK, I think. We use a similar recursive function (in pure Python) to
create the combinations, or the Probstat module, which is written in C,
which is faster (but speed of calculating the combinations doesn't
matter much for a single "bigram index" look-up). Anyway, its all there
in the Febrl code which is freely downloadable and re-usable.

Of course, having formed candidate record lists by doing bigram (or
other) indexed look-ups on each field (separate lookups for surnane,
given name etc), you then need to calculate a set of weighted sums (the
"match weights") and choose the highest of these, or a set above a
threshold, as the candidate matches. You can choose the weights
deterministically or you can use a probabilistic approach to choose
statistically optimal weights. Both methods are implemented, in Python,
in Febrl. However, Febrl is intended for batch matching of many records
to many records, but the single record look-up problem is just a
degenerate case of that. 

As I said, all this may be overkill for a small GP database.

OK, digging about on my hard disc I found this pure Python function to
calculate rCn combinations:

def sublist_comb(in_list, length):
  """Routine to recursively compute all combinations of sub-lists
     of the given 'in_list' of length 'len'.
     Based on Gagan Saksena's routines from Activestate Python
     cookbook, see:
     and modified by Ole Nielsen, MSI ANU, November 2002
  sub_lists = []
  in_list_len = len(in_list)  # Number of elements in in_list
  if (length == 0):
    return [[]]
    for i in range(in_list_len):
      sub = sublist_comb(in_list[i+1:], length-1)
      for l in sub:
        l.insert(0, in_list[i])
      sub_lists += sub
  return sub_lists

print sublist_comb(['ba','ax','xt','te','er'],4)


Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0

reply via email to

[Prev in Thread] Current Thread [Next in Thread]