emacs-humanities
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[emacs-humanities] typewriters, OCR, proofreading, hist. of math


From: Joe Corneli
Subject: [emacs-humanities] typewriters, OCR, proofreading, hist. of math
Date: Sat, 02 Jan 2021 12:50:18 +0000

§ TYPEWRITERS

https://www.gnu.org/fun/jokes/xmodmap.html

— My (2nd) so-converted typewriter is a Royal Dart.

Another nice mod is to use stamp ink to re-ink the ribbon (e.g., using
purple so that output looks a bit like a mimeograph).

§ OCR

My hope at one stage was to use OCR to create a digital archive of
typewritten pages.  My previous attempts weren’t so successful, but...

 https://github.com/tesseract-ocr/tesseract and
 https://github.com/ocropus/ocropy

seem to be getting quite mature, e.g., here are some instructions for
training Ocropus to recognize typewritten text:

 https://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html

and there’s an endorsement from Brewster Kahle here:

 https://en.wikipedia.org/wiki/Tesseract_(software)

§ OCR, CONTINUED

As a kind of bellweather, it seems Carr’s book "A synopsis of elementary
results in pure mathematics" still hasn’t be properly OCR’d:
https://archive.org/details/synopsisofelemen00carrrich

This book is one that Ramanujan learned from, so it’s of some historical
significance: https://en.wikipedia.org/wiki/Synopsis_of_Pure_Mathematics

We had at one time proposed a broader mathematics scanning project to
Wikimedia:

 https://meta.wikimedia.org/wiki/Grants:IEG/PlanetMath_Books_Project

But it wasn’t funded:

 https://meta.wikimedia.org/wiki/Grants_talk:IEG/PlanetMath_Books_Project

§ EMACS FOR PROOFREADING

Emacs support for proofreading seems a bit limited:
https://emacsnotes.wordpress.com/2018/05/15/proofreading-with-emacs-change-font-and-size/

... but there’ve been some attempts; I found this package that seems to
be needing a maintainer:

https://web.archive.org/web/20050223191820/mdxi.collapsar.net/hacks/emacs/ocr-mode/
https://web.archive.org/web/20050301035829/http://mdxi.collapsar.net/hacks/emacs/ocr-mode/ocr.el

>From the proposal mentioned above, proofreading could potentially be
sped up with some intelligent assistance:

 « Based on reading the archives of the Project Gutenberg listserver, we
 have identified a range of time-saving techniques. Using these
 techniques, PG contributors were able to proofread a book in four
 hours. They noticed that the straightforward approach of reading
 through a text line by line, page by page and marking errors as they
 appear is highly inefficient. A much better approach is one in which
 the computer identifies places where an error is likely and presents
 this information to the human. Each pass should focus on errors of one
 type, such as spelling, capitalization, or puctuation. To make this
 work efficiently, we will implement an interface in which the text in
 question is highlighted and centered, with surrounding text lowlighted,
 and where the user is presented with multiple choice options as to what
 the text says (with "other" as one of the options). »

... Makes me wonder if maybe the project should be rebooted; we bid for
21,400 USD at the time, proposing to use a non-free system for OCR
($1000 license) and to focus on the architecture around it.

But maybe it would be worthwhile to think about including the cost of
building a free math OCR system in the budget?

That said, building OCR isn’t my specific area of expertise, but the
problem of digitizing mathematical history doesn’t seem to be going
away!  Which is interesting in its own right, given that a
reasonably-performant proprietary software OCR package exists.  Weird,
huh?


-- 
Dr Joseph A. Corneli (https://github.com/holtzermann17)

HYPERREAL ENTERPRISES LTD is a private company limited by shares, incorporated
25th, June 2019 as Company Number 634284 on the Register of Companies for
Scotland (https://beta.companieshouse.gov.uk/company/SC634284).



reply via email to

[Prev in Thread] Current Thread [Next in Thread]