|Subject:||Re: [Gnumed-devel] experiments with gnumed - multiusers vnc, importing|
|Date:||Wed, 26 Apr 2006 12:34:17 +0800|
it would be better if the records were synthetic, based on some statistics about
the EHR . e.g. age, sex, health issues, episodes per health issues, encounter frequency,
medication names prescribed, blood pressure, test names ordered and frequency,
clusters of frequencies of appointments and health issues dealt with ,
specialty names mentioned in the narrative text, symptom names mentioned,
sign words like "chest clear, basal , wheeze, nil added, pulse, regular,irregular, abdo , lax, mass, no masses, sclera, pallour,
conjunctiva, well, unwell, fwt , nad, nitrites, wcc , rcc"' ,;
synthetic records can be fairly
sure of being "deidentified". Maybe some configuring statistics and terms, and a program would be better.
Not sure if they loose their value , because maybe someone wants to use real statistical record patterns for research. Probably good enough for load testing though ?
On Tue Apr 25 13:23 , Tim Churches sent:
Syan Tan wrote:
> i've processed 360,000 rows of clin.clin_narrative and parsed out all the words
> containing letters. I was thinking of using a stoplist method where any word
> on the stoplist will be replaced by 'xxxx' . The stoplist would also include all
> the names
> listed out from dem.names.lastnames and dem.names.firstnames.
> BTW - what about a secondary structure for clin.clin_narrative, where the narrative
> consists of a list of indexes pointing into a table of words. this is the
> simplest step before
> having some sort of semantic linking at the word level ( but not at the phrase
> whilst trying to recreate the gnumed database using a pg_dump,
> the dump reload seems to stall ; I tried to turn off logging, table
> constraints, removing
> internal log table data , and fsync , which all finally worked , but I'm not
> sure what causes the stall.
> *On Mon Apr 24 18:53 , Karsten Hilbert sent:
> On Thu, Apr 20, 2006 at 09:47:54AM +0800, Syan Tan wrote:
> > thinking about it, the only correct thing to do seems to be to preserve the
> > structure of the instance data and the health issue + episode headings,
> but to
> > scramble the text with word substitution, as well as name substitution, date
> > fudging, and address random relinking . would that be de-identified enough ?
> Well, I tend to think that "de-identified enough" is a range
> from "acceptably so" to "beyond use" rather than a cutoff.
> The exact value used within that range depends on what sort
> of protection you need.
> Yes, if you want to hide a patient's data securely from your
> fellow doctor next door you will have to scamble the medical
> content, too, as she might be able to match "real patient"
> to "problems/operations listed" by her own medical skills
> and thereby gain knowledge via the now re-identified EMR.
> But if you want to protect a patient's privacy from, say,
> me, it's enough to falsify the identities. I do not have
> access to your patients. I also have no idea how to find out
> who your patients actually are in order to start matching
> EMRs to patients. Hence proper protection is ensure, I dare
> say. It is akin to not storing patient names with any
> medical data and hold the EMR ID <-> patient identity
> mapping elsewhere in a secure space (say, the patient's
> In a recent discussion on the openhealth list this topic was
> chanced upon and the OpenEHR guys thought the latter
> approach would be the most secure that's practically useful
> - and they were talking real live patient data in actual
I didn't mention it on the openEHR list (maybe I should) but merely
removing the direct identifiers (names, DOB etc) does not de-identify or
anonymise that data. For example, if the record reveals "32 yr old male,
with medical visits on 23/4/04, 12/6/05 and 14/01/06" then that record
has a very high probability of being unique to an individual in even a
large population. Hence if I know your age and sex (easily discovered or
ascertained) and I know that you had medical appointments on those dates
(eg if I had access to your work leave records, as staff in the
personnel department of your employer may have), then I can fairly
easily which record belongs to you. Disclosure control in microdata
almost always involves some degree of obfuscation, perturbation or
allocation to broad categories - in other words, a lot of detail needs
to be removed to make real data truly anonymous (in that it cannot be
re-identified). Also, anonymity of data is a continuum - it is not
dichotomous, and often it comes down to a risk judgement and some
assumptions about what additional information an 'attacker' who might
try to re-identify records might possess. If the data are to be made
publicly available, you can't make any assumptions about what an
attacker might or might not already know about a person, so you need to
be very conservative.
|[Prev in Thread]||Current Thread||[Next in Thread]|