Re: [emacs-bidi] Suboptimal display-reordering in minibuffer

On Sun, Jun 27, 2010 at 8:25 PM, Eli Zaretskii <address@hidden> wrote:

> Date: Sun, 27 Jun 2010 06:30:27 +0300
> From: Amit Aronovitch <address@hidden>
> Cc: Eli Zaretskii <address@hidden>, address@hidden

>
> First, thanks Eli and all contributors for the remarkable effort, and all
> the recent progress!

You're most welcome.

> Note that there are two separate issues:
> (1) Directionality (I'll use here B to represent hebrew Bet):
> Should the message be displayed "is undefined B^" (RTL paragraph dir) or
> "^B is undefined" (LTR paragraph dir)
>
> (2) Alignment (to right or left margin) - where that message is to be
> displayed. It makes sense to align to the "start" direction (i.e. right
> for RTL and left for LTR), but AFAIK this is a matter of style and not
> within the scope of the unicode standard.
>
> (2) is a relatively minor problem, while (1) could be a real source for
> confusion to the reader.

In Emacs, (2) is entirely determined by (1): a L2R paragraph is
displayed flushed all the way to the left margin of the window, while
R2L paragraphs are flushed to the right margin.

This is perfectly acceptable. I just wanted to point out the problem more clearly, as the OP named the *alignment* as being wrong (which is correlated to, but not exactly the actual problem).

I don't see any reason to have the paragraph and alignment be
independent. Every bidi-aware word processor I've seen behaves like I
described above, and I'm quite sure users expect that.

Of course. This is the most reasonable default.

However, word processors typically also have an option for selectively modifying the alignment without effecting the directionality (toolbars have separate buttons for directionality and alignment), and this gets them out of sync.

(Such explicit alignment information might not be saved in plain-text files, but might be useful for "rich" formats - maybe in w3 mode etc.)

One example where this might be useful is when you have a list of items (names, addresses, cited references), some of which RTL and some LTR, and you wish the whole list to align to a single margin, to avoid a ragged appearance. Another example is within tables.

> True. There is no way to the determine 100% surely the correct direction of
> a sentence out of context. That is why the unicode standard leaves the
> freedom for "higher level protocol" to set that (
> http://unicode.org/reports/tr9/ HL1) .
> When such information is not available, a simple default algorithm is
> described by the standard (rules P2, P3). This is implemented by common bidi
> reordering libs, and I guess this is the reason for what you see here.

Emacs doesn't use any reordering libraries, but it does implement
UAX#9 to the letter, including determining the paragraph direction
from its first strong directional character.

Would be nice if we would be able to specify the direction explicitly (manually) for selected paragraphs in the buffer. This can be stored in the same way that other metadata (font sizes? color? images?) is being handled.

(p.s. If the buffer is plaintext, this information would probably be lost when we save it. Still it might serve as a "manual override" to help readability as long as the buffer is open).

> > Aren't problems like this the entire raison d'etre of the invisible RLM
> > and LRM characters?
> >
> >
> One of the main reasons. True. But, depending on the bidi reordering
> function used, the application might be able to achieve the results by
> providing this "higher level choice" itself. With libfribidi, the
> "pbase_dir" input parameter can be used for that.

In Emacs, we have the bidi-paragraph-direction variable, which
overrides the direction determined by the first strong character.

Is that per-buffer? What if you want to control directionality of specific paragraphs? (you should be able to do that to properly show bidi text e.g. in w3 mode).

> IMO, since the echo messages are typically one-liners, their directionality
> should be defined by their language.

But what is the language of a message that includes mixed Hebrew and
English words or letters?

In all cases I can think of, the language of the message (the messages to be displayed in the echo area) should be as specified by the locale (LC_MESSAGES). This is because if the locale is English, the message itself (the informative wrapper, the template) is actually meant to be in English, and any Hebrew parts come from quoted characters etc. (template data, runtime variables). Vice versa for the case where LC_MESSAGES=he .

(Explanation for readers who are not familiar with the terms: Typically, for i18n support in Unix apps, you write default messages in English, and print them using e.g. GNU gettext (3). If a translation file (provided by relevant translation team) exists for the language specified by the user's locale, this causes the message to be printed in that language. The translated message itself may be merely a template, which includes placeholders for inserting runtime data).

Emacs allows you to mix several scripts (a.k.a. "languages") in the
same buffer, so it is no longer clear in what "language" the document
is written.

> Don't know about "should" (because as you said, both of them look "wrong").
> However if you let the standard unicode algorithm reorder the logical string
> "^B is undefined" with the default auto-detected directionality, it really
> does result with what you seem to expect (the circumflex (0x5e) is a
> neutral, and gets the directionality of the run). Maybe this is not really a
> circumflex, or maybe some other magic is at work here.

If "^" were a normal character, I'd agree (and Emacs would then render
them automatically per UAX#9 anyway). But this is not the case.
Here, the string ^B or B^ is a display feature; the display engine
produces these two characters as a single display element, and cursor
motion treats them both as a single atomic entity. The question is:
within that atomic entity, how should we display the "^" part?

OK, that kind of "other magic" then :-)

Don't get me wrong: if the consensus is that we should display this as
if we had 2 distinct characters, using UAX#9 reordering rules, I'm
okay with that.

To me at least, it does seem better to show it as in UAX#9.

However, it seems that I cannot reproduce the scenario at the moment (see below).

> From a brief check, on Linux with X, with Hebrew and English
> layouts, situation seems to be like that:
>
> 1) On the basic X level (I used xev to test) there is a "state" (binary
> flags, indicate e.g. if ctrl was held, and also the "group" i.e. if we are
> in Hebrew or English mode), keycode (a number, which is the same for "א" and
> "t"), and an "XLookupString" which is the same (14) for both "ctrl-t" and
> "ctrl-א" (but does differentiate between them if ctrl is not held). xev
> also reports "keysym" which is the unicode point for "t" in both cases
> (ctrl-t and ctrl-א), but is the unicode point for א if control is not
> pressed.
> 2) In gtk (a higher level interface), there is "gdk_keyval_name", which is
> either "א" or "t" according to the current layout (language mode). Whether
> or not ctrl was down is determined by the mask GDK_CONTROL_MASK in the state
> of the event.
>
> Note that at both levels there is no specific code for "ctrl-א". Whatever
> it is that emacs sees is either generated by some higher level function that
> I am not aware of, or generated within emacs itself. Probably we should look
> it up in the code.

Instead of looking in the code, it is much easier to put the cursor on
the ctrl-א thing, and type "C-u C-x =". Then Emacs will tell you what
it thinks about this character, including its codepoint.

Could you please do this? I need to know that in order to understand
why Emacs treats this "character" as strong R. I cannot produce this
strange character on MS-Windows, or else I'd do this myself.

Not sure how to do that. It only appears in the echo area and I cannot insert it in a buffer (the message disappears if I try to click the minibuffer or move the cursor there using keyboard shortcuts). By the way, the message that I see is "C-א not defined", not ^א as Larry described.

I tried binding the key to self-insert-command, and then I get a regular א inserted into the buffer.

Actually, while typing the above, I realized that while I was trying to bind the key, I had C-א appearing in the mini-buffer. Checking, I saw that in that scenarion I can actually move the cursor around to it, and use C-u C-x =. However, this reveals that the C-א displayed there is actually three characters (C, -, א)...

From:	Amit Aronovitch
Subject:	Re: [emacs-bidi] Suboptimal display-reordering in minibuffer
Date:	Mon, 28 Jun 2010 03:23:34 +0300