[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad
From: |
Martin C. Doege |
Subject: |
[Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad |
Date: |
Tue, 5 Jul 2005 21:32:07 +0200 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Antonio!
First of all, thanks for taking the time to develop Ocrad! I
particularly like how incredibly fast it is in comparison to, say,
GOCR. And the recognition rate is not too shabby for a character-based
OCR program.
That being said, I do wonder if it would be possible to extend Ocrad
with dictionary lookups to improve its recognition rate. So for
example, it would work on a word and assign each character a confidence
value. So it might recognize the word "heflo", and since that word is
not in the dictionary, it would tweak the characters with the lowest
confidence value first.
So in this example, maybe it would happen to have a low confidence
value for the "h" and the "l" and therefore try to find permutations in
the dictionary: "beflo", "bello", ""hello",... If this is not
successful it could try replacements like "m" -> "ch", "l_" -> "n",
"ii" -> "ΓΌ", or whatever other common errors are found in OCR output.
Of course the actual dictionary lookups could be handled by an external
program like aspell.
Given that Ocrad is so insanely fast, I think this kind of (optional,
of course) overhead could be worthwhile. I have been working on a
larger project with Ocrad for a few days, and while I am pretty content
with Ocrad's recognition rate, I wish there was an easy way to identify
and correct the kinds of typical OCR errors which standard spell
checkers do not know how to handle.
Of course much of this could perhaps be done with a filter on the
Ocrad-generated text file after the fact, like with a modified aspell
(http://lists.gnu.org/archive/html/aspell-user/2002-07/msg00003.html).
But of course doing some of this in Ocrad itself might be beneficial
because the internal knowledge of the program about the characters
being worked on could be used. And in terms of programming work, this
would probably be cheaper than trying to improve the OCR engine
itself...
Any thoughts on this?
Martin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)
iD8DBQFCyuA4mifxvst1lQIRAgBhAKDJwj2WL6UMkslCSjLDrvNZMQA7GQCdFpYe
6bxhFoCBb7Xw980EIF1t/lM=
=t67Q
-----END PGP SIGNATURE-----
- [Bug-ocrad] Possibility of dictionary-enhanced OCR in Ocrad,
Martin C. Doege <=