gnumed-devel
[Top][All Lists]

## [Gnumed-devel] record matching : thinking out loud

 From: sjtan Subject: [Gnumed-devel] record matching : thinking out loud Date: Sun, 22 Aug 2004 22:23:22 +1000 User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040616

thanks to Tim Churches for the references.

The problem seems to be : how to join mispelt records that mean the same identity /locality . The paper says, for n records, without grouping records for comparison, n ^ 2 records can be compared, which for large n , (100,000 to millions) is expensive and time consuming. ( 10 ^ 5 becomes 10^ 10 comparisons,
and comparison function may not be simple).

The class of algorithm's to deal with this problem is called 'blocking' meaning records are scanned in one pass ( n ) and put into m blocks, with upto n/m records in each block. then (n/m) ^ 2 comparisons are done for m blocks, given n ^ 2 / (m ^ 2) * m = n ^ 2 / m . Even better, a cluster with c nodes, could parallelize block processing, giving n ^ 2 / ( m * c) so for say a cluster of 20 computers, and 10000 blocks, that's c * m = 20,000 or 2 x 10^5, so 10^5 records can now be processed in time it takes to run 10^6 non-simple comparisons,

The paper talks about Standard Blocking, "Sort/Window" (actually it's Sorted Neighbourhood) Blocking, Bigram Blocking, which a subtype is Canopy Clustering with Term Frequency Inverse Document Frequency threshold .
This is what I could glean:
- Standard blocking uses a key , e.g. a fixed length prefix combination of first 4 letters of surname and first 4 letters of firstname, to make the blocks. ( e.g. select id_identity from names order by prefixNames ( lastnames,4 , firstnames, 4) )

- Sorted Neighbourhood sorts on a key and then selects with a fixed length window, ( select id_identity from names order by prefixNames(lastnames,4,firstnames,4) limit w offset i*w )

- Bigram Index blocking gets the set of keys , chops them into overlapping 2-character string pairs ( 'baxter ' becomes 'ba', 'ax', 'xt', 'te', 'er') , uses a threshold value between 0.0 to 1.0 which is multiplied by the total length of each list of string pairs ( so the above has 5 pairs, so if threshold is 0.8 , 5 x 0.8 = 4 , which will be the length of the sublist set to find ). The set of sublists with length n (threshold x orig length) is found. The sublists are the in order combinations of n pairs that exist that can be generated from the original key bigrams. ( e.g. 'ba', 'xt', 'te', 'er' would be valid for threshold 0.8 on the original 5 pair list, but 'xt', 'ba', 'te', 'er' is out of order , so isn't valid). I think , not sure, that each record's keying fields are compared to the m * k sublists that are generated ( m is the blocking key count, k is the average number of sublists that are generated for a given threshold value) . The comparison function might run as a simple string comparison on each pair, starting on the next character after the last match, and return true if it makes it through the whole sublist. If a match occurs, each sublist is associated with a list of candidate record numbers for comparison , and the record that matches is added as a candidate. Then for the m * k sublists, the sum of the decreasing sequence j-1 to 1 comparisons are made for each j the size of the candidate list in a sublist.

- The canopy clustering with TFIDF blocking is not intuitive from what the paper describes; it says it chooses a randomized blocking key , adds records within a loose threshold function, and then removes records including the chosen random key which are within a tight threshold function, and the threshold function has something to do with term frequency, inverse document frequency, which is unexplained. The result is the block for comparison.

There was a description of Winkler/Jaro function which I found with google,
which is a string comparison function with the formula s = 1/3 ( c/m + c/n - ( c - t) / c ) where m and n are the lengths of the 2 strings to compare, c is the number of common characters, and t is the number of out of order transpositions of characters . This function is the average of terms which
are always between 0 and 1.0 , so s is in 0.0 to 1.0 .
For identical strings , c = m = n , t =0, and s=1. Finding t , sounds involved - comparing pairs of characters from the sequence containing the common characters in each string. The result of the function is then compared to a threshold for matching e.g. 0.9 , and then used to decide whether to record link or not.

Any clarifications appreciated.

BTW , here's another use case: results come in to be viewed on the EHR in use at a practice I work in, but sometimes the result filer can't match to the patient, and prompts with a list of identities, which the user chooses one for filing against. It might do this each time, and doesn't record a patient choice between result batches.
This might be a good thing, in case one's choice is wrong for a session.

Sometimes a result comes in for another practice's patient, and there is no where to put it, but the EHR allows one to delete the result.