[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm du

From: Itai Berli
Subject: bug#27526: 25.1; Nonconformance to Unicode bidirectionality algorithm due to paragraph separator
Date: Thu, 29 Jun 2017 12:16:00 +0300

According to the Emacs manual (section 37.26 Bidirectional Display)

>  Emacs provides a “Full Bidirectionality” class implementation of the
>  UBA, consistent with the requirements of the Unicode Standard v8.0.

And again (section 22.19 Bidirectional Editing)

> Emacs implements the Unicode Bidirectional Algorithm described in the Unicode 
> Standard Annex #9, for reordering of bidirectional text for display.

However these statements are false. Emacs does not implement the Unicode
Bidirectional Algorithm correctly, and therefore does not even provide
'Implicit bidirectionality', which is the minimal level of conformance
listed in section 4.2 'Explicit Formatting Character' of the Unicode
8.0.0 Bidirectional Algorithm specifications
(www.unicode.org/reports/tr9/tr9-33.html), let alone 'Full bidirectionality'.

The reason has to do with the way the Emacs bidi implementation
recognizes separate paragraphs, which is inconsistent with the Unicode

The unicode Bidirectional Algorithm, specify (section 3 'Basic
Display Algorithm')

> The algorithm reorders text only within a paragraph; characters in one
> paragraph have no effect on characters in a different
> paragraph. Paragraphs are divided by the Paragraph Separator or
> appropriate Newline Function (for guidelines on the handling of CR,
> LF, and CRLF, see Section 4.4, Directionality, and Section 5.8,
> Newline Guidelines of [Unicode]).

However Emacs, by its own admition (section 22.19 Bidirectional
Editing), take the following approach:

> Paragraph boundaries are empty lines, i.e., lines consisting entirely of 
> whitespace characters.

I'll repeat: according to Unicode a paragraph ends with a paragraph
separator. What constitutes a paragraph separator is specified precisely
in section 5.8 'Newline Guidelines' of The Unicode Standard version
8.0.0. For instance, on a MacOS X system, it is `LF` (line feed,
Unicode 000A). The formatting effects of the bidi algorithm must not
cross the paragraph separator boundary.

And yet in Emacs the formatting extend beyond the paragraph separator,
and this is the case on all operating systems. Consider, for instance,
the following example.

ILLUSTRATION: An English paragraph directly following a Hebrew paragraph
is formatted like Hebrew text.

The first, Hebrew paragraph is formatted correctly, however the second,
English paragraph is formatted wrongly, as though it was a Hebrew
paragraph: it is right justified, the question mark appears on the left,
and so does the cursor. Once an empty paragraph is inserted between the two
paragraph, the English paragraph is formatted correctly.

ILLUSTRATION: When paragraphs are separated by an empty paragraph, they
are formatted correctly.

This is not just a theoretical question of conformance to standards;
this problem has practical consequences.

Consider, for
instance, a LaTeX document for typesetting Hebrew
text. Normally in order to eliminate the usual leading indentation of
the first line of a paragraph, a `\noinent` command is placed at the
beginning of the paragraph. However, because the Unicode bidi algorithm
determins the directionality of a paragraph based on its first word, the
Hebrew text is formatted like English text. This is not a problem; it is
to be expected.

ILLUSTRATION: A LaTeX document for typesetting a Hebrew paragraph with
no indentation of the first line.

One way to resolve this is to explicitly change the directionality of the
paragraph, however, disregarding the fact that this is not currently
possible due to a separate Emacs bug, even if it were possible, it would
affect the placement of the backslash at the beginning of the
`\noindent` command, which will no longer look like a LaTeX command.

ILLUSTRATION: Explicitly changing the directionality of the

(Note: This is a screenshot of a Microsoft Word application,
since due to a bug, Emacs doesn't currently enable to change the
automatically determined directionality of a paragraph.)

So the best way to resolve this problem would be to place the `\noindent`
command on a separate paragraph. Unfortunately, here Emacs' faulty
implementatino of the Unicode bidi algorithm rears its ugly
head. Since Emacs doesn't recognize the paragraph separator for what it
is, it will format the Hebrew text wrongly as though it were an English text.

ILLUSTRATION: Putting the `\noindent` on a separate paragraph results in
the Hebrew text being formatted like English text

Placing an empty paragraph between the `\noindent' command and the
Hebrew text will resolve the formatting problem inside the Emacs editor, but
now the `\indent` command, which only affects the current LaTeX
paragraphs (LaTeX paragraphs are ended by an empty line), no longer
eliminates the indentation of the first line of the Hebrew paragraph in
the typeset file.

In GNU Emacs 25.1.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21
Version 10.9.5 (Build 13F1911))
 of 2016-09-21 built on builder10-9.porkrind.org
Windowing system distributor 'Apple', version 10.3.1504
Configured using:
 'configure --with-ns '--enable-locallisppath=/Library/Application
 Support/Emacs/site-lisp' --with-modules'

Configured features:

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Fundamental

Minor modes in effect:
  ivy-mode: t
  shell-dirtrack-mode: t
  projectile-mode: t
  helm-descbinds-mode: t
  async-bytecomp-package-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
ad-handle-definition: ‘ibuffer’ got redefined
Turn on helm-projectile key bindings
For information about GNU Emacs and the GNU system, type C-h C-a.

Load-path shadows:
/Users/itaiberli/.emacs.d/elpa/seq-2.20/seq hides

(shadow sort mail-extr emacsbug message rfc822 mml mml-sec epg mm-decode
mm-bodies mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader
sendmail rfc2047 rfc2045 ietf-drums mail-utils colir color counsel
jka-compr esh-util etags xref project swiper reftex reftex-vars
two-column ivy delsel ivy-overlay helm-projectile helm-files rx
image-dired tramp tramp-compat tramp-loaddefs trampver shell pcomplete
format-spec dired-x dired-aux ffap helm-tags helm-bookmark helm-adaptive
helm-info bookmark pp helm-external helm-net browse-url xml url
url-proxy url-privacy url-expand url-methods url-history url-cookie
url-domsuf url-util url-parse auth-source gnus-util mm-util help-fns
mail-prsvr password-cache url-vars mailcap helm-buffers helm-grep
helm-regexp helm-utils helm-locate helm-help helm-types projectile grep
compile comint ansi-color ring ibuf-ext ibuffer thingatpt helm-descbinds
helm easy-mmode helm-source cl-seq eieio-compat eieio eieio-core
helm-multi-match helm-lib dired helm-config helm-easymenu cl-macs
async-bytecomp async advice edmacro kmacro finder-inf tex-site info
package epg-config seq byte-opt gv bytecomp byte-compile cl-extra
help-mode easymenu cconv cl-loaddefs pcase cl-lib time-date mule-util
tooltip eldoc electric uniquify ediff-hook vc-hooks lisp-float-type
mwheel ns-win ucs-normalize term/common-win tool-bar dnd fontset image
regexp-opt fringe tabulated-list newcomment elisp-mode lisp-mode
prog-mode register page menu-bar rfn-eshadow timer select scroll-bar
mouse jit-lock font-lock syntax facemenu font-core frame cl-generic cham
georgian utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao
korean japanese eucjp-ms cp51932 hebrew greek romanian slovak czech
european ethiopic indian cyrillic chinese charscript case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer cl-preloaded nadvice
loaddefs button faces cus-face macroexp files text-properties overlay
sha1 md5 base64 format env code-pages mule custom widget
hashtable-print-readable backquote kqueue cocoa ns multi-tty
make-network-process emacs)

Memory information:
((conses 16 312045 13704)
 (symbols 48 30403 0)
 (miscs 40 88 192)
 (strings 32 51754 11765)
 (string-bytes 1 1669992)
 (vectors 16 50218)
 (vector-slots 8 844617 7052)
 (floats 8 564 218)
 (intervals 56 242 111)
 (buffers 976 18))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]