[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Emacs as a translator's tool

From: Jean-Christophe Helary
Subject: Re: Emacs as a translator's tool
Date: Sat, 30 May 2020 12:12:08 +0900

> On May 30, 2020, at 10:33, Emanuel Berg via Users list for the GNU Emacs text 
> editor <> wrote:
> Can't we compile a list of what the commercial CATs
> offer? M Helary and Mr Abrahamsen?

x commercial → ○ professional, if you don't mind :)
OmegaT is very much a professional tool and certainly not a "commercial" one.

My 20 years of practice but otherwise not technically so very informed idea is 
the following:

1) CAT tools extract translatable contents from various file formats into an 
easy-to-handle format, and put the translated contents back into the original 
format. That way the translator does not have to worry *too much* about the 
idiosyncrasies of the original format.

→ File filters are a core part of a CAT tool *but* as was suggested in the 
thread it is possible to rely on an external filter that will output contents 
in a standard localization "intermediate" format (current "industry" standards 
are PO and XLIFF). Such filters provide export and import functions so that the 
translated files are converted back to the original format.

File filters can also accept rules for not outputting non-translatable text 
(the current standard is ITS)

The PO format can be handled by po4a (perl), translate-toolkit (python) and the 
Okapi Framework tools (java).
XLIFF has the Okapi Framework, OpenXLIFF (electron/node) and the 
translate-toolkit. All are top-notch pro-grade free software and in the case of 
Okapi and OpenXLIFF have been developed by people who have participated to the 
standardization process (XLIFF/TMX/SRX/ITS/TBX, etc...)

→ emacs could rely on such external filters and only specialize in one 
"intermediate" format. The po-mode already does that for PO files.

2) Once the text is extracted, it needs to be segmented. Basic "no" 
segmentation usually means paragraph based segmentation. Paragraphs are defined 
differently depending on the original format (1, or 2 line breaks for a text 
file, a block tag for XML-based formats, etc.).
Fine-grained segmentation is obtained by using a set of native language based 
regex that includes break rules and no-break rules. A simple example is break 
after a "period followed by a space" but don't break after "Mr. " for English.

→ File filters usually handle the segmentation part based on user 
specifications. Once the file is segmented into the intermediate format, it is 
not structurally trivial to "split" or "merge" segments because the tool needs 
to remember what will go back into the original file structure.

→ emacs could rely on the external filters to handle the segmentation.

3) The real strength of a CAT tool shows where it helps the translator handle 
all the resources needed in the translation. Let me list potential resources:

- Legacy translations, called "translation memories" (TM), usually in 
multilingual "aligned" files where a given segment has equivalents in various 
languages. Translated PO files are used as TMs, the XML standard is TMX.

- Glossaries, usually in a similar but simpler format, sometimes only TSV, 
sometimes CSV, the XML-based standard is TBX.

- Internal translations, which are produced by the translator while 
translating. Each translated segment adding to the project "memory".

- Dictionaries are a more global form of glossaries, usually monolingual, 
format varies.

- external files, either local documents, or web documents, in various formats, 
usually monolingual (otherwise they'd be aligned and used as TMs)

→ each resource format needs a way to be parsed, memorized, fetched, recycled 
efficiently during the translation

4) Usually the process is the following:

- the translator "enters" a segment
- the tool displays "matches" from the resources that relatively closely 
correspond to the segment contents
- the translator inserts or modifies the matches
- when no matches are produced the translator enters a translation from scratch
- the translator can add glossary items to the project glossary
- the new translation is added to the "internal" memory set
- the translator moves to the next segment

5) The matching is usually some sort of levenstein distance-based algorithm. 
The "tokens" that are used in the "distance" calculation are usually produced 
by native language based tokenizers (the Lucene tokenizers are quite popular)

The better the match, the more efficient the tool is at helping the translator 
recycle resources. The matching process/quality is where tools profoundly 
differ (OmegaT is generally considered to have excellent quality matches, 
sometimes better than expensive commercial tools).

Some tools propose "context" matches where the previous and next segments are 
also taken into account, some tools propose "subsegment" matches where even if 
a whole segment won't match significant subparts can, etc.

The matching process must sometimes apply to extremely big resources (like many 
million lines of multilingual TMs in the case of the EU legal corpora) and must 
thus be able to handle the data quickly regardless of the set size.

6) Goodies that are time savers include:

- history based autocompletion
- glossary/TM/dictionary based autocompletion
- MT services access
- shortcuts that auto insert predefined text chunks
- spell-checking/grammar checking
- QA checks against glossary terms, completeness/length of the translation, 
integrity of the format structure, numbers used, etc. (QA checks are also 
available as external processes in some of the solutions mentioned above, or 
related solutions.)

> I'll read thru this thread tomorrow (today)
> God willing but I don't understand everything, in
> particular examples would nice to get the exact
> meaning of the desired functionality...

Go ahead if you have questions.

> With examples we can also see if Emacs already can do
> it. And if not: Elisp contest :)


> Some features are probably silly, we don't have to
> list or do them, or everything in the CATs, just what
> really makes sense and is useful on an every-day basis.

A lot of the heavy-duty tasks can be handled by external processes.

> When we are done, we put it in the wiki or in a pack.
> We can't have that Emacs doesn't have a firm grip on
> this issue. Because translation is a very common task
> with text!
> Also, let's compile a list of what Emacs already has
> to this end. It doesn't matter if some of that stuff
> already appears somewhere else, modularity is
> our friend.


Jean-Christophe Helary @brandelune

reply via email to

[Prev in Thread] Current Thread [Next in Thread]