guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract


From: Simon South
Subject: [bug#61851] [PATCH] gnu: tesseract-ocr-tessdata-fast: Install tesseract config files.
Date: Tue, 28 Feb 2023 10:00:36 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux)

Jelle Licht <jlicht@fsfe.org> writes:
> Cunningham's law strikes again :)

Ha, interesting.  That one's new to me.

> This makes me believe the current situation was a deliberate choice...

Yes, it was, and I realize now I didn't provide much in the way of
rationale in my previous email.  So here's the background information
for anyone interested:

Tesseract normally expects to find its data files in /usr/share/tessdata
and subfolders thereof.  We'd like to use Guix's native-search-paths
functionality to pull together data from (for instance) multiple
language-specific data packages, and Tesseract conveniently honours a
TESSDATA_PREFIX environment variable that specifies its data folder's
location, so it seems we are all set.

What should TESSDATA_PREFIX be set to?  Tesseract's documentation[0]
says

  TESSDATA_PREFIX environment variable should be set to the parent
  directory of “tessdata” directory.

So "share" then, presumably, to have the data files located at
"share/tessdata".  The man page[1] seems to confirm this:

  To use a non-standard language pack named foo.traineddata, set the
  TESSDATA_PREFIX environment variable so the file can be found at
  TESSDATA_PREFIX/tessdata/foo.traineddata...

This creates a problem, though, since defining a native-search-path of
just "share" will pull in files from virtually every single Guix
package.  The solution then is to introduce an intermediate folder,
"tesseract-ocr", that sidesteps this problem, and to configure Tesseract
appropriately at build time so it installs its data files to
"share/tesseract-ocr/tessdata" instead.  This is why the existing code
was written the way it was and what the comment you pointed out is
referring to.

However there's a problem with this, too: Patching Makefile.am the way
the code does results in only some of Tesseract's data files being
placed in "share/tesseract-ocr/tessdata"; you can see in the package
output there is still a "share/tessdata" folder that contains
Tesseract's config files.  Since these aren't also placed beneath
"share/tesseract-ocr/tessdata" Tesseract can't find them at runtime.

The solution to this seems to be to remove this phase and instead use
the "--datadir" configure flag to specify the desired data-folder path.
Doing this results in all of Tesseract's data files being installed
beneath "share/tesseract-ocr/tessdata" and the resulting package works
as you'd expect.

However the problem with this is... none of it is necessary in the first
place!  It turns out Tesseract's documentation is simply WRONG and the
program actually expects TESSDATA_PREFIX to contain the complete path to
the "tessdata" data folder, not the path of the folder directly above
it.  So Tesseract can be built as-is, the native-search-path can be
safely defined as "share/tessdata", and everything just works.

This is what the patch I passed on yesterday does.

-- 
Simon South
simon@simonsouth.net

[0] 
https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simplest-invocation-to-ocr-an-image

[1] https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc





reply via email to

[Prev in Thread] Current Thread [Next in Thread]