[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract

From: Karsten Hilbert
Subject: Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Date: Mon, 25 Jan 2010 23:41:03 +0100

> Some discussion of PDF indexing and scraping of PDFs makes me ask about
> GNUmed's ability to search for text across a patient record:
> 1) when a PDF was generated from source text (such as a word processor and
> "print to pdf") the text within the PDF remains recognizable to software,
> albeit not in human readable form.

AFAIK, that entirely depends on the mode in which it was generated.
It well behooves PDF generators to choose a mode that somehow preserves
text but AFAIK there's other modes where there's no text anymore.

> Is GNUmed presently only able to query
> information stored-as-human-readable text?

Even worse, it cannot query over *any* information in any
of the documents in the archive regardless of format.

> 2) there exists apparently a form of PDF called "searchable" in which a
> PDF can be created (or appended) to contain both an image layer (such as a
> scanned paper document) but to *also* hold, in a separate layer within the
> same document (file), ASCII or perhaps UTF-8 text, as may have been generated
> through OCR or perhaps when the PDF did already contain identifiable text
> (only non-human-readable within the PDF format), into a layer of
> human-readable text.

That sounds mighty useful to me.

> For GNUmed to be able to access such a layer in within-patient searches,
> would it be necessary for such PDFs to have been imported twice, and/or to
> use some additional tool to "split" the document into two parts (one an
> image part, and one the text part)?

It would be possible to implement the access to the text part inside
GNUmed. Actually using that in a search would, however, presently
require exporting each and every document and trying to search it.

That could, indeed, only be mitigated by splitting the text part
into a separate for-search table upon import.

Except that GNUmed already has that table: blobs.doc_desc, of which
there can by any number per document. In fact, we should probably
extend the per-patient and across-patients search to look at those !

Which would then enable practices to implement just what you wanted -
they'd have to import the text version themselves, but it'd be usable
for finding stuff.



Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher!

reply via email to

[Prev in Thread] Current Thread [Next in Thread]