[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Emacs as a translator's tool
From: |
Jean-Christophe Helary |
Subject: |
Re: Emacs as a translator's tool |
Date: |
Sat, 30 May 2020 12:12:08 +0900 |
> On May 30, 2020, at 10:33, Emanuel Berg via Users list for the GNU Emacs text
> editor <help-gnu-emacs@gnu.org> wrote:
>
> Can't we compile a list of what the commercial CATs
> offer? M Helary and Mr Abrahamsen?
x commercial → ○ professional, if you don't mind :)
OmegaT is very much a professional tool and certainly not a "commercial" one.
My 20 years of practice but otherwise not technically so very informed idea is
the following:
1) CAT tools extract translatable contents from various file formats into an
easy-to-handle format, and put the translated contents back into the original
format. That way the translator does not have to worry *too much* about the
idiosyncrasies of the original format.
→ File filters are a core part of a CAT tool *but* as was suggested in the
thread it is possible to rely on an external filter that will output contents
in a standard localization "intermediate" format (current "industry" standards
are PO and XLIFF). Such filters provide export and import functions so that the
translated files are converted back to the original format.
File filters can also accept rules for not outputting non-translatable text
(the current standard is ITS)
The PO format can be handled by po4a (perl), translate-toolkit (python) and the
Okapi Framework tools (java).
XLIFF has the Okapi Framework, OpenXLIFF (electron/node) and the
translate-toolkit. All are top-notch pro-grade free software and in the case of
Okapi and OpenXLIFF have been developed by people who have participated to the
standardization process (XLIFF/TMX/SRX/ITS/TBX, etc...)
→ emacs could rely on such external filters and only specialize in one
"intermediate" format. The po-mode already does that for PO files.
2) Once the text is extracted, it needs to be segmented. Basic "no"
segmentation usually means paragraph based segmentation. Paragraphs are defined
differently depending on the original format (1, or 2 line breaks for a text
file, a block tag for XML-based formats, etc.).
Fine-grained segmentation is obtained by using a set of native language based
regex that includes break rules and no-break rules. A simple example is break
after a "period followed by a space" but don't break after "Mr. " for English.
→ File filters usually handle the segmentation part based on user
specifications. Once the file is segmented into the intermediate format, it is
not structurally trivial to "split" or "merge" segments because the tool needs
to remember what will go back into the original file structure.
→ emacs could rely on the external filters to handle the segmentation.
3) The real strength of a CAT tool shows where it helps the translator handle
all the resources needed in the translation. Let me list potential resources:
- Legacy translations, called "translation memories" (TM), usually in
multilingual "aligned" files where a given segment has equivalents in various
languages. Translated PO files are used as TMs, the XML standard is TMX.
- Glossaries, usually in a similar but simpler format, sometimes only TSV,
sometimes CSV, the XML-based standard is TBX.
- Internal translations, which are produced by the translator while
translating. Each translated segment adding to the project "memory".
- Dictionaries are a more global form of glossaries, usually monolingual,
format varies.
- external files, either local documents, or web documents, in various formats,
usually monolingual (otherwise they'd be aligned and used as TMs)
→ each resource format needs a way to be parsed, memorized, fetched, recycled
efficiently during the translation
4) Usually the process is the following:
- the translator "enters" a segment
- the tool displays "matches" from the resources that relatively closely
correspond to the segment contents
- the translator inserts or modifies the matches
- when no matches are produced the translator enters a translation from scratch
- the translator can add glossary items to the project glossary
- the new translation is added to the "internal" memory set
- the translator moves to the next segment
5) The matching is usually some sort of levenstein distance-based algorithm.
The "tokens" that are used in the "distance" calculation are usually produced
by native language based tokenizers (the Lucene tokenizers are quite popular)
The better the match, the more efficient the tool is at helping the translator
recycle resources. The matching process/quality is where tools profoundly
differ (OmegaT is generally considered to have excellent quality matches,
sometimes better than expensive commercial tools).
Some tools propose "context" matches where the previous and next segments are
also taken into account, some tools propose "subsegment" matches where even if
a whole segment won't match significant subparts can, etc.
The matching process must sometimes apply to extremely big resources (like many
million lines of multilingual TMs in the case of the EU legal corpora) and must
thus be able to handle the data quickly regardless of the set size.
6) Goodies that are time savers include:
- history based autocompletion
- glossary/TM/dictionary based autocompletion
- MT services access
- shortcuts that auto insert predefined text chunks
- spell-checking/grammar checking
- QA checks against glossary terms, completeness/length of the translation,
integrity of the format structure, numbers used, etc. (QA checks are also
available as external processes in some of the solutions mentioned above, or
related solutions.)
> I'll read thru this thread tomorrow (today)
> God willing but I don't understand everything, in
> particular examples would nice to get the exact
> meaning of the desired functionality...
Go ahead if you have questions.
> With examples we can also see if Emacs already can do
> it. And if not: Elisp contest :)
:)
> Some features are probably silly, we don't have to
> list or do them, or everything in the CATs, just what
> really makes sense and is useful on an every-day basis.
A lot of the heavy-duty tasks can be handled by external processes.
> When we are done, we put it in the wiki or in a pack.
>
> We can't have that Emacs doesn't have a firm grip on
> this issue. Because translation is a very common task
> with text!
>
> Also, let's compile a list of what Emacs already has
> to this end. It doesn't matter if some of that stuff
> already appears somewhere else, modularity is
> our friend.
:)
--
Jean-Christophe Helary @brandelune
http://mac4translators.blogspot.com
- Re: Emacs as a translator's tool, (continued)
- Re: Emacs as a translator's tool, Marcin Borkowski, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29
- Re: Emacs as a translator's tool, Yuri Khan, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29
- Re: Emacs as a translator's tool, tomas, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29
- Re: Emacs as a translator's tool, tomas, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29
- Re: Emacs as a translator's tool, Jean-Christophe Helary, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29
- Re: Emacs as a translator's tool,
Jean-Christophe Helary <=
- Re: Emacs as a translator's tool, Marcin Borkowski, 2020/05/31
- Re: Emacs as a translator's tool, Eric Abrahamsen, 2020/05/29
- Re: Emacs as a translator's tool, Jean-Christophe Helary, 2020/05/29
- Re: Emacs as a translator's tool, Eric Abrahamsen, 2020/05/29
- Re: Emacs as a translator's tool, Jean-Christophe Helary, 2020/05/29
- Re: Emacs as a translator's tool, Eli Zaretskii, 2020/05/30
- Re: Emacs as a translator's tool, Jean-Christophe Helary, 2020/05/30
- Re: Emacs as a translator's tool, Eric Abrahamsen, 2020/05/30
- Re: Emacs as a translator's tool, Jean-Christophe Helary, 2020/05/29
- Re: Emacs as a translator's tool, Emanuel Berg, 2020/05/29