emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bidirectional editing in Emacs -- main design decisions


From: joakim
Subject: Re: Bidirectional editing in Emacs -- main design decisions
Date: Fri, 09 Oct 2009 23:55:19 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

Eli Zaretskii <address@hidden> writes:

> As some of you know, I'm slowly working on adding support for
> bidirectional editing in Emacs.  (Before you ask: the code is not
> publicly available yet, and won't be until Emacs switches to bzr as
> its main VCS.)

Short question if you have time: I'm working (also slowly) on a patch to
embedd gtk widgets in buffers(the xwidget patch). It works mostly the
same as embedding images. From what youre writing below it sounds like
the display of images will work as before, therefore my patch will
apply, hopefully nicely, on top of bidi. Correct?

It will be very nice to see your patch when youre ready to publish!

> While there's a lot of turf to be covered yet, I thought I'd publish
> the main design decisions up to this point.  Many of these decisions
> were discussed at length years ago on emacs-bidi mailing list, and
> since then I also talked them over in private email with a few people.
> Other decisions were made recently, as I went about changing the
> display engine.
>
> My goal, and the main drive behind these design decisions was to
> preserve as much as possible the basic assumptions and design
> principles of the current Emacs display engine.  This is not just
> opportunism; I firmly believe that any other way would mean a total
> redesign and rewrite of the display engine, which is something we want
> to avoid.  Personally, if such a redesign would be necessary, I
> couldn't have participated in that endeavor, except as advisor.
>
> With that preamble out of my way, here's what I can tell about the
> subject at this point:
>
> 1. Text storage
>
>    Bidirectional text in Emacs buffers and strings is stored in strict
>    logical order (a.k.a. "reading order").  This is how most (if not
>    all) other implementations handle bidirectional text.  The
>    advantage of this is that file and process I/O is trivial, as well
>    as text search.  The disadvantage is that text needs to be
>    reordered for display (see below) and also for sending to any other
>    visual-order stream, such as a printer or a file in visual-order
>    encoding.
>
> 2. Support for Unicode Bidirectional Algorithm
>
>    The Unicode Bidirectional Algorithm, described in Annex 9 of the
>    Unicode Standard (a.k.a. UAX#9, see http://www.unicode.org/reports/tr9/),
>    specifies how to reorder bidirectional text from logical to visual
>    order.  Emacs will belong to the so-called "Full Bidirectionality"
>    class of applications, which include support for both implicit
>    bidirectional reordering and explicit directional embedding codes
>    that allow to override the implicit reordering.  This means that
>    Emacs supports the entire spectrum of Unicode character properties
>    and special codes relevant to bidirectional text.
>
> 3. Bidi formatting codes are retained
>
>    At some point in the reordering described by UAX#9, the various
>    formatting codes are to be removed from the text, once they've
>    performed their role of forcing the order of characters for
>    display, because they are not supposed to be visible on display.
>    Contrary to this, Emacs does not remove these formatting codes, it
>    just behaves as if they are not there. (This behavior is
>    acknowledged by UAX#9 under "Retaining Format Codes" clause, so
>    Emacs does not break conformance here.)  This is primarily because
>    Emacs must preserve the text that was not edited; in particular,
>    visiting a file and then saving it to a different file without
>    changing anything must produce the same byte stream as the original
>    file, even if the formatting codes were part of the original file.
>    In addition, being able to show these formatting codes to the user
>    is a valuable feature, because the way reordered text looks might
>    not be otherwise understood or changed easily.
>
> 4. Reordering of text for display
>
>    Reordering for display happens as part of Emacs redisplay.  In a
>    nutshell, the current unidirectional redisplay code walks through
>    buffer text and considers each character in turn.  After each
>    character is processed and translated into a `struct glyph', which
>    includes all the information needed for displaying that character,
>    the iterator's position is incremented to the next character.
>
>    In the bidi Emacs, this _linear_ iteration through the buffer is
>    replaced with a _non-linear_ one, whereby instead of incrementing
>    buffer position, a function is called to return the next position
>    in the visual order.  Whatever position it returns is processed
>    next into a `struct glyph'.  The rest of the code that produces
>    "glyph matrices" (data structures used to decide which parts of the
>    screen need to be redrawn) is largely ignorant of the
>    bidirectionality of the text.  Of course, parts of the display
>    engine that manipulate the glyph matrices directly and assume that
>    buffer positions increase monotonically with glyph positions need
>    to be fixed or rewritten.  But these parts of the display are
>    relatively few and localized.  Also, some redisplay optimizations
>    need to be disabled when bidirectional text is rendered for
>    display.
>
> 5. Visual-order information is volatile
>
>    There were lots of discussions several years ago about whether
>    Emacs should record in some way the information needed to reorder
>    text into visual order of the characters, to reuse it later.  In
>    UAX#9 terminology, this information is the "resolved level" of each
>    character.  Various features were suggested as a vehicle for this,
>    for example, some special text properties (except that text
>    properties, unlike resolved levels, cannot overlap).  Lots of
>    energy went into discussing how this information would be recorded
>    and how it will be reused, e.g. if portion of the text was
>    copy-pasted into a different buffer or string.  The complications,
>    it turns out, are abound.
>
>    The current design doesn't record this information at all.  It
>    simply recomputes it each time a buffer or string need to be
>    displayed or sent to a visual-order stream.  The resolved levels
>    are computed during reordering, then forgotten.  It turns out that
>    bidirectional iteration through buffer text is not much more
>    expensive than the current unidirectional one.  The implementation
>    of UAX#9 written for Emacs is efficient enough to make any
>    long-term caching of resolved levels unnecessary.
>
> 6. Reordering of strings from `display' properties
>
>    Strings that are values of `display' text properties and overlay
>    properties are reordered individually.  This matters when such
>    properties cover adjacent portions of buffer text, back to back.
>    For example, PROP1 is associated with buffer positions P1 to P2,
>    and PROP2 immediately follows it, being associated with positions
>    P2 to P3.  The current design calls for reordering the characters
>    of the strings that are the values of PROP1 and PROP2 separately.
>    An alternative would be to feed them concatenated into the
>    reordering algorithm, in which case the characters coming from
>    PROP2 could end up displayed before (to the left) of the characters
>    coming from PROP1.  However, this alternative requires a major
>    surgery of several parts of the display code.  (Interested readers
>    are advised to read the code of set_cursor_from_row in xdisp.c, as
>    just one example.)  It's not clear what is TRT to do in this case
>    anyway; I'm not aware of any other application that provides
>    similar features, so there's nothing I could compare it to.  So I
>    decided to go with the easier design.  If the application needs a
>    single long string, it can always collapse two or more `display'
>    properties into one long one.
>
>    Another, perhaps more serious implication of this design decision
>    is that strings from `display' properties are reordered separately
>    from the surrounding buffer text.  IOW, production of glyphs from
>    reordered buffer text is stopped when a `display' property is
>    found, the string that is the property's value is reordered and
>    displayed, and then the rest of text is reordered and its glyphs
>    produced.  The effect will be visible, e.g., when a `display'
>    string is embedded in right-to-left text in otherwise left-to-right
>    paragraph text.  Again, I think in the absence of clear "prior
>    art", simplicity of design and the amount of changes required in
>    the existing display engine win here.
>
> 7. Paragraph base direction
>
>    Bidirectional text can be rendered in left-to-right or in
>    right-to-left paragraphs.  The former is used for mostly
>    left-to-right text, possibly with some embedded right-to-left text.
>    The latter is used for text that is mostly or entirely
>    right-to-left.  Right-to-left paragraphs are displayed flushed all
>    the way to the right margin of the display; this is how users of
>    right-to-left scripts expect to see text in their languages.
>
>    UAX#9 specifies how to determine whether this attribute of a
>    paragraph, called "base direction", is one or the other, by finding
>    the first strong directional character in the paragraph.  However,
>    the Unicode Character Database specifies that NL and CR characters
>    are paragraph separators, which means each line is a separate
>    paragraph, as far as UAX#9 is concerned.  If Emacs would follow
>    UAX#9 to the letter, each line could have different base direction,
>    which is, of course, intolerable.  We could avoid this nonsense by
>    using the "soft newline" or similar features, but I firmly believe
>    that Emacs should DTRT with bidirectional text even in the simplest
>    modes, including the Fundamental mode, where every newline is hard.
>
>    Fortunately, UAX#9 acknowledges that applications could have other
>    ideas about what is a "paragraph".  It calls this ``higher
>    protocol''.  So I decided to use such a higher protocol -- namely,
>    the Emacs definition of a paragraph, as determined by the
>    `paragraph-start' and `paragraph-separate' regexps.  Therefore, the
>    first strong directional character after `paragraph-start' or
>    `paragraph-separate' determines the paragraph direction, and that
>    direction is kept for all the lines of the paragraph, until another
>    `paragraph-separate' is found.  (Of course, this means that
>    inserting a single character near the beginning of a paragraph
>    might affect the display of all the lines in that paragraph, so
>    some of the current redisplay optimizations which deal with changes
>    to a single line need to be disabled in this case.)
>
>    There is a buffer-specific variable `paragraph-direction' that
>    allows to override this dynamic detection of the direction of each
>    paragraph, and force a certain base direction on all paragraphs in
>    the buffer.  I expect, for example, each major mode for a
>    programming language to force the left-to-right paragraph
>    direction, because programming languages are written left to right,
>    and right-to-left scripts appear in such buffers only in strings
>    embedded in the program or in comments.
>
> 8. User control of visual order
>
>    UAX#9 does not always produce perfect results on the screen.
>    Notable cases where it doesn't are related to characters such as
>    `+' and `-' which have more than one role: they can be used in
>    mathematical context or in plain-text context; the "correct"
>    reordering turns out to be different in each case.
>
>    Again, lots of energy was invested in past discussions how to
>    prevent these blunders.  Several clever heuristics are known to
>    avoid that.  The problem is that all those heuristics contradict
>    UAX#9, which means text that looks OK in Emacs will look different
>    (i.e. wrong) in another application.
>
>    I decided it was unjustified to deviate from UAX#9.  Its algorithm
>    already provides the solution to this problem: users can always
>    control the visual order by inserting special formatting codes at
>    strategic places.  These codes are by default not shown in the
>    displayed text, but they influence the resolved directionality of
>    the surrounding characters, and thus change their visual order.  We
>    could (and probably should) have commands in Emacs to control the
>    visual order that will work simply by inserting the appropriate
>    formatting codes.  For example, a paragraph starting with an Arabic
>    letter could nonetheless be rendered as left-to-right paragraph by
>    inserting the LRM code before that Arabic character; Emacs could
>    have a command called, say, `make-paragraph-left-to-right' that did
>    its job simply by inserting LRM at the beginning of the paragraph.
>
>    This design kills two birds: (a) it produces text that is compliant
>    with other applications, and will display the same as in Emacs, and
>    (b) it avoids the need to invent yet another Emacs infrastructure
>    feature to keep information such as paragraph direction outside of
>    the text itself.
>
> That is all for now.  If you have comments or questions, you are
> welcome to voice them.  However, I reserver the right to respond only
> to those I'm interested in and/or have time for. ;-)
>
-- 
Joakim Verona




reply via email to

[Prev in Thread] Current Thread [Next in Thread]