freecats-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freecats-Dev] Segmenting and fuzzy matching fom a user's point of v


From: Henri Chorand
Subject: Re: [Freecats-Dev] Segmenting and fuzzy matching fom a user's point of view
Date: Sun, 02 Mar 2003 22:05:23 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.1) Gecko/20021003

(... just a note here, about me being somewhat slow to answer:
I'm not as available as I would like to, so I sometimes take a little more time in order to better understand what's said before opening my mouth)

Henri

Dave Simons wrote:

Some things I'm going to say here might be considered premature
> detail  but at least they will have been said.
Please file the comments away as you think fit :-)

(pause while Dave puts his translator's hat on)
Same thing happening at Quimper right now.

Segmenting.
=========

Many of the texts I translate are definitions of very precise, point-by-point procedures; they do not consist of eloquent
> descriptions.

These and lots of other types of document (keyword lists, help
> texts, etc.) contain standalone segments which are very short
> -- so short that in a lot of cases they can't even be parsed.
At the other end of the segmentation spectrum, due the way
> some authors lay their documents out (I'm not saying it's the
> best way but they do have a right their quirks) I find it
> annoying when I'm forced to end my segment at the end of a
> paragraph, because some of these are in fact "faux" paragraphs.

That's why it's essential for me to have maximum flexibility
> and to keep control as far as segment definition goes. I
> nearly always choose the smallest possible segment boundary
> (the smallest reasonable delimiter being ":" and/or ";")
> then glue segments together when this breaks down.
Having done that, I want to be able to continue expanding
> the segment indefinitely, repeatedly gluing the next one
> along the line right till the end of the document if
> necessary!  I've never needed to go to this extreme, but I'm
> just making the point that I don't want ANYTHING to limit
> me, whether the program designers think it's good for my
> health or not.
I, the translator, must be the one who has the final choice
> and I'm adamant about that.

I kept the whole of this quote for Keith.
Even though I won't detail how close my way of working is from Dave's, I must also insist that translators must decide alone how to segment a text, at least to some extent. Dave's point of view might be seen as an extreme one, but to me, he does not state extreme opinions, because from time to time, we translators do see weird files which require altering default segmentation.

So I'll add the following:
One can't decide in advance how a given text will look like - all documentations are not examples of good writing, and pre-segmentation can't solve every possible situation.

Unfortunately, sometimes, in trados/wordfast type clients,
> without there being a visible reason, the program doesn't
> allow gluing at all in certain points in the text.

> I guess this might be due to "interference" from MSWord's
> proprietary formatting markers etc. As for gluing across
> paragraphs, I don't know of any CAT tool that allows this.

I don't really know Wordfast yet, but with MS Word, the reason is simple: you can't join two strings into the same segment when these strings are separated by a CR (carriage return = end of paragraph mark).

Obviously, this is (theoretically) a Good Thing - unless the document IS poorly formatted. This is where the real problems begin.

At the very least, the translator should be able to glue strings within the same paragraph. But in fact, there are lots of situations where he/she must be able to do the same despite one or more end of paragraph marks / line breaks [well, at least with line breaks] - that is, as long as the paragraph-level style applied is not different.

A (rather common) example is with some HTML files in which hard-coded line breaks occur all the time within the same logical paragraph, due to the fact that the documentation writer used some wicked HTML editor in order to produce the documents, and some option generated these hard-coded line breaks instead of performing a logical word wrapping.

We could also take another example. All of us translators, at one stage or another, are asked to translate PDF documents. Have you tried to manually delete hundreds or thousands of carriage returns? If a CAT tool is flexible enough, it should not block users from escaping a real world situation. I reckon that powerful options, if misused, will give horrible results, but (overridable) default values will be here for the newbies.

None of this, of course, will stop other translators using
> different segment boundaries (like paragraph markers) if
> they so wish. They will have just as much choice and control
> as I do.

Same for me - ,I believe a large amount of flexibility must be allowed for segmenting.

For consistency's sake, I must also add that if several translators work on the same TM, all translators should try to work according to the same settings. So I suggest defining the segment delimiters as a list of possible delimiters, along with their mandatory/ optional/ forbidden status at the TM level (by TM administrator in Free CATS perspective).

The idea would be to provide a set of "reasonable" default values and to let the TM admin alter them when, obviously, the state of documents requires it.


Fuzzies
======

I support Henri when he says translators probably don't give a damn
> if a fuzzy match turns out to be nonsense. What I do give a damn
> about is having these fuzzy matches more or less forced on me. This
> is what every CAT tool I've ever tried tends to do. They preset
> the "translated" window to the fuzzy match. Now believe me, when
> you've been translating repetitive stuff for a couple of hours and
> fatigue starts setting in, when you look at a fuzzy match --
> providing you've remained alert enough to recognize it as such
> (yes I know it's displayed in a different colour but fatigue is
> fatigue...) -- you're sometimes hard-pressed to spot exactly how
> fuzzy the match is in semantic terms rather than in % terms.
>
(I've seen 98% matches that contain fatal flaws and 75% matches
> that are just about spot-on.) In circumstances like these, a bad
> fuzzy can easily get past your guard.

This is (and should remain) a "common" problem with CAT software, a bit like when you drive on a highway, AND need some rest, you're more prone to sleeping a few seconds (with possibly the cemetary as the next step), while if you drive on a mountain road wrecked by the winter snow and ice, you are more likely to remain awake.

More seriously, when we all started to think about how fuzzy matches work in the proprietary software we use and how we could improve it, I believe I was not alone in finding out that:
- I'm not really able to explain EXACTLY how the software I presently
  use works
- In some instances, I guess I could tell how to improve things
  (like with Trados, it's so bad with all-upper-case words, and
  it does not know how to treat a word that comes several times
  in the same fuzzy, see Keith's previous message)
- Unless I want to look a little foolish, I'd better take the time to
  think about it.

> (snipping bits here)

So, I don't want to ask for an ideal solution right now, as long as the design is modular, there will be room for improvement.

At present, I'm ready to consider as "fit" any "perfectible" project, at least as long as a satisfactory solution is provided for layout info, which should NOT be mixed up with text contents.

In any case, if the program really does insist on presetting the "translated" window to the fuzzy match, it should at least
> give me the opportunity to banish it to the pits of doom with
> a single keystroke, leaving me a virgin window to work with.

This happens easily with Trados, in which you may specify:
- The fuzzy rate level under which no fuzzy target segment is copied
- Whether or not to copy the current source segment IF no fuzzy is
  found starting from this threshold

Talking about fuzzies, a feature I'd like to see dropped -- or at
> least carefully rethought as regards default settings -- is
> "pre-translation".

Some people will require it, but it will become slightly less necessary anyway. A flexible CAT tool does not dictate how to work with it unless when required (remember what we said with segmenting)?

A free-lancer working on his/her own will be free to decide how to use a TM - unless working for a paranoid customer that does not provide him/her with the TM's admin password...

I never use it myself and can't see why anyone would want to use
> it, but among the agent part of my clientele, there are some
> customers who insist on using it. The problem is that it's a
> dangerous weapon in their hands because they do not have the
> technical nous to configure it correctly.

I would want to use it from time to time.

> Imagine a pre-translated file full of fuzzy-matched parts
numbers and you'll see what I mean.


Henri





reply via email to

[Prev in Thread] Current Thread [Next in Thread]