[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] re: Soundex

From: J Busser
Subject: Re: [Gnumed-devel] re: Soundex
Date: Sat, 21 Aug 2004 17:57:48 -0700

At 9:15 AM +1000 8/22/04, Tim Churches wrote:
...another alternative is to use a technique we we have dubbed
"n-gram indexes" (since we developed the method for our record linkage
project). We still haven't written a definitive paper on it, but it is
implemented in the Febrl software and described in the manual, and there
is a paper describing it relative retrieval performance - see

I plan to work on an improved implementation of this technique (in
Python of course) over the next several months for use in our public
health data collection systems (where case/patient look-up and
deduplication is vital, but where we have hundreds of thousands or
millions of records) - when this work is complete you might want to
evaluate it for use in GNUmed. It might be overkill for general practice
databases with a few thousand patients, but the technique is
conceptually simple and elegant and unlike teh phonetic indexing
functions, makes no assumptions about name or string morphology and
phonetics - thus it works equally well with alphabetic names from any
culture, including Pinying Chinese names. It takes a set-theoretic
approach, and the faster, built-in set data type in Python 2.4 improves
its speed considerably.

This sounds really interesting, look forward to your progress.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]