[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: readline display bug with UTF-8 and char insertion

From: Egmont Koblinger
Subject: Re: readline display bug with UTF-8 and char insertion
Date: Sun, 17 Feb 2013 14:42:17 +0100

Hi Chet,

Friendly ping - could you please look at this bug and patch, too?  I know
it's bit more complicated, and I haven't finished the non-UTF part, but
based on my findings I believe this should be really straightforward for

I know bash's UTF-8 support is contributed code and is quite complex, and I
know it improved a lot since the beginning, but unfortunately there are
still some problems with it, and it really frustrates me that handling
accented characters still causes problems in 2013, I'd really love to see
these problems disappear for good.  Please let me know if there's anything
more I could do to get this one fixed.


On Mon, Jan 21, 2013 at 9:16 PM, Egmont Koblinger <egmont@gmail.com> wrote:

> Hi,
> On a fully UTF-8 environment, in certain easily reproducible cases, the
> command line becomes messed up (the cursor ends up in the wrong column, and
> from then on it's tons of cumulative mistakes).  I'm using bash-4.2.42, but
> the bug also affects other readline-based applications.
> To reproduce:  Save the attached file as your .bash_history.  (Notice the
> UTF-8 accented characters in the middle line.  Also notice the trailing
> space in each line.)  Use either the Up arrow, or enter a common prefix of
> the lines and press a hotkey bound to history-search-backward.  You'll see
> the command line getting messed up.
> Please read on for my findings on what goes on, and my proposed fix.
> The bug occurs only when the readline's insert_some_chars() method (the
> one which inserts columns pushing some characters to the right) is
> involved.  It also needs UTF-8 characters to be present.
> My first finding was that display.c line 1684, namely:
>   twidth = _rl_col_width (nfd+lendiff, 0, temp-col_lendiff, 1);
> is clearly broken.  _rl_col_width() converts logical byte offsets into
> columns, and it is passed a column where a byte offset should be present,
> this is the typical case of "code smell", semantic incorrectness.  Changing
> the argument from "temp-col_lendiff" to "temp-lendiff" fixes the case where
> you go back from the last history entry to the previous; however it breaks
> going back from that one to the first :)
> My second finding, trying to fix the newly introduced bug, was that the
> concept of the insert_some_chars() is broken in UTF-8 environment.  Here's
> why.
> We know lendiff (the growth of the command line in bytes) and col_lendiff
> (the growth in columns).  However, the code tries to print the new text in
> two runs: first the segment that corresponds to the newly created columns
> (this happens within insert_some_chars()), and then the ones that overwrite
> the old characters (in update_line()).  In order to do that correctly we
> would need to know how many bytes would produce the given number of
> columns, (kinda the opposite of _rl_col_width()), but we have no method for
> that, so no wonder the code does it wrong.  It's not even possible to have
> such a method, e.g. when simple English letter is be replaced by a CJK
> glyph then coldiff==1, so you shift the remaining characters to the right
> by 1, and you'd need to print the left half and the right half of that CJK
> character in separate steps.  Also note that another problem with the
> source line quoted above is that nfd+lendiff might point to the middle of a
> UTF-8 character, and computing width from there doesn't make sense.
> Hence the fix would be to abandon the concept of printing new characters
> in two runs, separately for the new columns and separately for overwriting
> the old ones.  The correct behavior can only be implemented if, after
> allocating the proper amount of columns, all the differing characters are
> printed in a single run.  This is what my proposed patch does.
> insert_some_chars() is made simpler: it only inserts new colums, but
> doesn't print any text, that responsibility is given back to its caller
> update_line() which prints the whole stuff at the location where it used to
> print the first part only.  Printing the second run is then simply removed.
> Please see the attached proof-of-concept patch.  I've stress-tested it
> against my actual entries in my real .bash_history, with many mp3 filenames
> containing accented letters (and trailing space due to completion), and it
> seems to work correctly.
> Note that my patch does *not* address the non-UTF-8 branch, I believe I
> don't have the knowledge and understanding of the source to test all
> possible scenarios.  The three "else" branches starting at line 1645 also
> have to be modified to print the whole difference, not just the part
> filling the new columns.  I'm kindly asking you to do this piece of work, I
> guess it should be very easy for you based on my findings, and I'm sure
> you'll want to make cosmetic changes to my code anyway.
> Thanks very much,
> egmont

reply via email to

[Prev in Thread] Current Thread [Next in Thread]