[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR

From: Jim Busser
Subject: [Gnumed-devel] Re: Scanning Xsane, gscan2pdf, Simple Scan, Tesseract OCR
Date: Mon, 25 Jan 2010 13:57:01 -0800

Some discussion of PDF indexing and scraping of PDFs makes me ask about 
GNUmed's ability to search for text across a patient record:

1) when a PDF was generated from source text (such as a word processor and 
"print to pdf") the text within the PDF remains recognizable to software, 
albeit not in human readable form. Is GNUmed presently only able to query 
information stored-as-human-readable text?

2) there exists apparently a form of PDF called "searchable" in which a PDF can 
be created (or appended) to contain both an image layer (such as a scanned 
paper document) but to *also* hold, in a separate layer within the same 
document (file), ASCII or perhaps UTF-8 text, as may have been generated 
through OCR or perhaps when the PDF did already contain identifiable text (only 
non-human-readable within the PDF format), into a layer of human-readable text.

For GNUmed to be able to access such a layer in within-patient searches, would 
it be necessary for such PDFs to have been imported twice, and/or to use some 
additional tool to "split" the document into two parts (one an image part, and 
one the text part)?

PS the maintainer of Xpdf has a link to PdfSearch, a Python-based utility for 
searching PDF files

Also "Some useful stuff for MacOS X"
although it isn't obvious that these solve the use case of passing a PDF to a 
print control dialog and to present that to a user on Mac OS X.

Here were the two Oscar posts (which I merged) that reminded me to ask a 
question I've had on my mind:

> From: Rob James
> Date: January 25, 2010 10:04:05 AM PST
> To: address@hidden
> Subjects: [Oscarmcmaster-bc-users] PDF indexing / PDF scraping
> On the topic of PDF indexing and textual retrieval. ...If 
> the file is graphical in its origins, as you would expect if someone 
> prints it, then faxs/scans it, then you are obligated to fall back to 
> OCR as no true textual data is in the file.   
> If, however, the PDF is generated directly from via 
> PDFdriver, the text is actually in the file,  surrounded by PDF stuff.
> The tool pdf2text ... strip PDFs for ASCII content
> (
> For example, because most academic articles are now distributed with PDF 
> file generated in the later format, Zotero - the remarkable 
> Firefox-based citation/bibliography manager - is able to fully index PDF 
> articles as it acquires them.  That trick would almost certainly have 
> been based on extant open-source tools.   I  assume that what Zotero 
> does is strip the PDF of all the non-text components... then indexes.

> It turns out that Zotero ( uses the resources of Xpdf to 
> scrape academic PDFs for textual content, and then to make the text 
> available for subsequent indexing and retrieval. Given that several 
> projects are using Xpdf in this way, it is probably a place to start.
> There are [info and] downloads for Linux and Windows available at:
> [ ]

reply via email to

[Prev in Thread] Current Thread [Next in Thread]