[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract

From: Jim Busser
Subject: Re: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Date: Tue, 26 Jan 2010 07:48:40 -0800

On 2010-01-26, at 7:20 AM, Karsten Hilbert wrote:

>> That could, indeed, only be mitigated by splitting the text part
>> into a separate for-search table upon import.
>> Except that GNUmed already has that table: blobs.doc_desc, of which
>> there can by any number per document. In fact, we should probably
>> extend the per-patient and across-patients search to look at those !
> Which we apparently already do, of course :-)
> One concept of the GNUmed document archive that it tries
> hard to *not* concern itself with the particulars of the
> document part file types. It delegates that as much as at
> all possible. Hence splitting / appropriately importing PDF
> parts is up to the environment.

I am only wondering what constrains or otherwise defines the ability of GNUmed 
(postgres) to "look inside" a part no matter its type. Is it as simple as 
GNUmed looking for ASCII or UTF-8 text strings? If in this case the PDF has 
some combination of
- images + PDF-formatting-encumbered-non-readable text AND
- a layer of human readable text
        (if the latter is, by luck, a layer in a "searchable PDF")

1) should GNUmed then be able to find this document part?
2) will this be incredibly slow, or does GNUmed (postgres) index all of the 
text that is readable "in" the parts?

reply via email to

[Prev in Thread] Current Thread [Next in Thread]