emacs-bidi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [emacs-bidi] Suboptimal display-reordering in minibuffer


From: Eli Zaretskii
Subject: Re: [emacs-bidi] Suboptimal display-reordering in minibuffer
Date: Sun, 27 Jun 2010 20:25:58 +0300

> Date: Sun, 27 Jun 2010 06:30:27 +0300
> From: Amit Aronovitch <address@hidden>
> Cc: Eli Zaretskii <address@hidden>, address@hidden
> 
> First, thanks Eli and all contributors for the remarkable effort, and all
> the recent progress!

You're most welcome.

> Note that there are two separate issues:
> (1) Directionality (I'll use here B to represent hebrew Bet):
>    Should the message be displayed "is undefined B^" (RTL paragraph dir)  or
> "^B is undefined" (LTR paragraph dir)
> 
> (2)  Alignment (to right or left margin) - where that message is to be
> displayed. It makes sense to align to the "start" direction    (i.e. right
> for RTL and left for LTR), but AFAIK this is a matter of style and not
> within the scope of the unicode standard.
> 
>     (2) is a relatively minor problem, while (1) could be a real source for
> confusion to the reader.

In Emacs, (2) is entirely determined by (1): a L2R paragraph is
displayed flushed all the way to the left margin of the window, while
R2L paragraphs are flushed to the right margin.

I don't see any reason to have the paragraph and alignment be
independent.  Every bidi-aware word processor I've seen behaves like I
described above, and I'm quite sure users expect that.

> True. There is no way to the determine 100% surely the correct direction of
> a sentence out of context. That is why the unicode standard leaves the
> freedom for "higher level protocol" to set that (
> http://unicode.org/reports/tr9/ HL1) .
> When such information is not available, a simple default algorithm is
> described by the standard (rules P2, P3). This is implemented by common bidi
> reordering libs, and I guess this is the reason for what you see here.

Emacs doesn't use any reordering libraries, but it does implement
UAX#9 to the letter, including determining the paragraph direction
from its first strong directional character.

> > Aren't problems like this the entire raison d'etre of the invisible RLM
> > and LRM characters?
> >
> >
> One of the main reasons. True. But, depending on the bidi reordering
> function used, the application might be able to achieve the results by
> providing this "higher level choice" itself. With libfribidi, the
> "pbase_dir" input parameter can be used for that.

In Emacs, we have the bidi-paragraph-direction variable, which
overrides the direction determined by the first strong character.

> IMO, since the echo messages are typically one-liners, their directionality
> should be defined by their language.

But what is the language of a message that includes mixed Hebrew and
English words or letters?

Emacs allows you to mix several scripts (a.k.a. "languages") in the
same buffer, so it is no longer clear in what "language" the document
is written.

> Don't know about "should" (because as you said, both of them look "wrong").
> However if you let the standard unicode algorithm reorder the logical string
> "^B is undefined" with the default auto-detected directionality, it really
> does result with what you seem to expect (the circumflex (0x5e) is a
> neutral, and gets the directionality of the run). Maybe this is not really a
> circumflex, or maybe some other magic is at work here.

If "^" were a normal character, I'd agree (and Emacs would then render
them automatically per UAX#9 anyway).  But this is not the case.
Here, the string ^B or B^ is a display feature; the display engine
produces these two characters as a single display element, and cursor
motion treats them both as a single atomic entity.  The question is:
within that atomic entity, how should we display the "^" part?

Don't get me wrong: if the consensus is that we should display this as
if we had 2 distinct characters, using UAX#9 reordering rules, I'm
okay with that.

> From a brief check, on Linux with X,  with Hebrew and English
> layouts, situation seems to be like that:
> 
> 1) On the basic X level (I used xev to test) there is a "state" (binary
> flags, indicate e.g. if ctrl was held, and also the "group" i.e. if we are
> in Hebrew or English mode), keycode (a number, which is the same for "א" and
> "t"), and an "XLookupString" which is the same (14) for both "ctrl-t" and
> "ctrl-א" (but does differentiate between them if ctrl is not held). xev
>  also reports "keysym" which is the unicode point for "t" in both cases
> (ctrl-t and ctrl-א), but is the unicode point for א if control is not
> pressed.
> 2) In gtk (a higher level interface), there is "gdk_keyval_name", which is
> either "א" or "t" according to the current layout (language mode). Whether
> or not ctrl was down is determined by the mask GDK_CONTROL_MASK in the state
> of the event.
> 
>  Note that at both levels there is no specific code for "ctrl-א". Whatever
> it is that emacs sees is either generated by some higher level function that
> I am not aware of, or generated within emacs itself. Probably we should look
> it up in the code.

Instead of looking in the code, it is much easier to put the cursor on
the ctrl-א thing, and type "C-u C-x =".  Then Emacs will tell you what
it thinks about this character, including its codepoint.

Could you please do this?  I need to know that in order to understand
why Emacs treats this "character" as strong R.  I cannot produce this
strange character on MS-Windows, or else I'd do this myself.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]