emacs-orgmode
[Top][All Lists]

## Re: [O] [OT] Scanning for archiving

 From: Karl Voit Subject: Re: [O] [OT] Scanning for archiving Date: Mon, 7 Nov 2011 18:44:24 +0100 User-agent: slrn/0.9.9 (Linux)

Hi!

Inspired by «Total Recall»[3], a book of two MS Research guys, I
started life logging on my own two months ago.

For this purpose I bought an HP OfficeJet Pro 8500A Plus which costs
€ 250 and has a decent scanner. Is can scan and print full duplex.
The scanner as a 30 page ADF which is quite reliable when the paper
was not bend or stapled before.

>
> Using PDF for scanned documents results in *huge* files with a seriously
> disappointing image quality.

I can not copy that at all:

,----
| address@hidden ~2d % l 2011-11-02_13-22-45.png
| -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
| address@hidden ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
| address@hidden ~2d % l 2011-11-02_13-22-45.pdf
| -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
----

In this example, the compression of PDF is much better than the
original PNG one. PDF is only a container format.

> Consider storing your scans in DjVu format
> [1], which was developed specifically for this purpose.

PDF is a common standard whereas DjVu is something I - as an
advanced computer user - never faced before in real life. I am not
sure whether any of my computers can handle DjVu files at all.

The goals of DjVu sound great but I get everything with PDF too.
Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
mp3 again because I could not use many music devices or music
management software packages.

I stick to the format *any* computer can handle without special
software products. And I do think that I get a higher chance of
being able to read my documents twenty years from now.

For scanned images I'd prefer PNG instead but the OS X Software of
my OfficeJet offers me the ability to generate PDF files where an
OCR software adds a searchable text layer above the scanned text.
This is *very* important to me since I am able to do full text
search on the content of my archived documents.

And I plan to archive *all* of my documents. Really all of them.

Storage space does not matter (any more) to me since I have more
disk space now already than I could possible fill with my lifetime
paper correspondence. And I do think that my disk space continues to
grow in future.

> I scan all docs @ 600dpi, predominantly gray-scale (only in colour when
> it's *really* necessary) and store in DjVu format, all using gscan2pdf [2].
>
> Even at that seemingly overkill resolution, single-page documents are
> generally (if they aren't too "grainy") only a few 100 KiB in size.

My HP software uses 300 dpi per default and it is OK to me too.

Funny side fact: grayscale scan document settings produces slightly
larger files than colored ones.

> gscan2pdf also supports a number of OCR utils, but the UI for this is
> clumsy (aren't they all...), so you're better off using the CLI tools
> directly.  Tesseract is recommended.

I played around with ocropus, tesseract, ocroscript, hocr2pdf,
exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
PDF documents (OCR text above the scanned images) on GNU/Linux.
Unfortunately none of those (very cool projects) produced reliable
results on my side. The results vary from «no error but overlay font
size is incorrect and produces loss of layout» to «library error
messages I can not read or handle».

Whereas the HP OfficeJet bundles its OS X software with OCR from
Readiris which produces perfect results even in different languages
and using a usable user interface.

> NOTE: When attempting something like this, a fast scanner with a *reliable*
> automatic document feeder will help prevent premature hair loss ;)

I have found several scanner products I was interested in:

"Canon imageFORMULA P-150": very small form factor with basic Linux
support. Price tag starts with € 260. Neat form factor and very
portable. Different version "P-150m" for Mac OS X.

The authors of [3] use Fujitsu ScanSnap starting at € 400.

I ended up with the Office Jet Pro (mentioned above) at € 250
because I got flatbed scanner *and* ADF-scanner *and* a
full-duplex/full-color network printer with a very good
price-per-printed-page-ratio (better than many laser printers!). And
all of this with a cheaper price tag than any scan-only-product I
was interested in.

So far I am almost satisfied. «Almost»? Well, HP did a good job with
this printer but they made only a 90% solution on almost all levels.
Whereas 100% would be possible with small additional effort when
creating the printer. But those resulting 90% are pretty usable.

3. http://qr.cx/sAHU
--
Karl Voit

`