|
From: | Itai Berli |
Subject: | bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator |
Date: | Tue, 4 Jul 2017 18:57:33 +0300 |
> From: Itai Berli <itai.berli@gmail.com>
> Date: Tue, 4 Jul 2017 13:42:19 +0300
>
> I'd like to add another reason why this behavior is problematic: it breaks interoperability with other plain text
> editors, since the text will not be displayed the same way. Consider, for instance, the very same plain text file
> in GEdit: http://imgur.com/Iw4yrdQ
> in Emacs: http://imgur.com/7kfWseE
As I already explained, the behavior of GEdit is unacceptable for
Emacs, because most modes derived from Text mode tend to deal with
buffers where lines are broken by newlines, so potentially switching
paragraph direction just because a newline happens to be there would
have devastating effect on the text as displayed. This is perhaps in
contrast with other editors and word-processors which mostly deal with
long lines without hard newlines. That's why the notion of paragraph
in Emacs's UBA implementation was chosen to fit the traditional Emacs
definition of paragraph in text-mode and its derivatives.
> Finally, the question of whether Emacs behavior is consistent with the UBA specifications is debatable, since
> when UBA section 3 states "Paragraphs may also be determined by higher-level protocols" the question is
> what exactly the "also" means: is it that the higher-level protocols (HLP) can decide that a newline character is
> not a paragraph boundary, as Emacs does, or is it that the HLP can only declare paragraph boundaries in
> addition to paragraph separator characters?
It is clear from the context and the example following the above
sentence that "also" doesn't mean "in addition".
However, the main issue is not the paragraph boundary, the main issue
is how the base direction of the paragraph is determined. Because no
matter where the paragraph boundary is, if the base direction is not
recalculated there, then the fact that the boundary is there doesn't
matter.
>From Section 4.3 Higher-Level Protocols of the UAX#9:
HL1. Override P3, and set the paragraph embedding level
explicitly. This does not apply when deciding how to treat FSI
in rule X5c.
. A higher-level protocol may set any paragraph level. This can
be done on the basis of the context, such as on a table cell,
paragraph, document, or system level. (P2 may be skipped if
P3 is overridden). [...]
. A higher-level protocol may apply rules equivalent to P2 and
P3 but default to level 1 (RTL) rather than 0 (LTR) to match
overall RTL context.
. A higher-level protocol may use an entirely different
algorithm that heuristically auto-detects the paragraph
embedding level based on the paragraph text and its
context. For example, it could base it on whether there are
more RTL characters in the text than LTR. As another example,
when the paragraph contains no strong characters, its
direction could be determined by the levels of the paragraphs
before and after.
And Section 3.3.1, which describes the P1, P2, and P3 paragraph-level
rules, says:
Whenever a higher-level protocol specifies the paragraph level,
rules P2 and P3 may be overridden: see HL1.
So an application is allowed to override _all_ of the paragraph-level
rules, and do what suits it best. And based on some non-negligible
experience with bidi-aware applications, I submit that an application
that does _not_ employ some higher-level protocol for base paragraph
direction will violate user expectations when working with plain text.
E.g., try reading in MS Outlook an unformatted text message which has
a lot of RTL text mixed with LTR. It's unreadable; I always
copy/paste it into Emacs, and only then I'm able to read it.
[Prev in Thread] | Current Thread | [Next in Thread] |