nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] nmh-1.7-RC1: scan with complex subjects dumps core


From: Ralph Corderoy
Subject: Re: [Nmh-workers] nmh-1.7-RC1: scan with complex subjects dumps core
Date: Tue, 08 Aug 2017 17:07:51 +0100

Hi David,

> > Sunglasses have a width of 1 here, that's why David and I don't see
> > the problem.
>
> I'm surprised that I didn't see the same behavior as Norm, because we
> use the same locale, en_US.utf8.  Any idea why?

I'm en_GB.utf8, but I don't see it either.  It's the wcwidth(3) answer
for a codepoint, and as Unicode continue to add POO WITH SUNGLASSES, so
the answers change with the version of one's system's database.

    $ test/getcwidth --ctype | grep 1f576
     1f576   1  address@hidden

So here it's `print', `graph', and `punct', with a width of 1.  Norm's
gang have a width of -1 as they haven't the foggiest what it is.
http://unicode.org/cldr/utility/character.jsp?a=1f576&B1=Show says its
East Asian width is `Neutral', which is treated as `Narrow', so
getcwidth reporting 1 matches.

Nearby is http://unicode.org/cldr/utility/character.jsp?a=1f57a&B1=Show
that says it's `Wide', but here I don't know anything about that yet,
thankfully.

    $ test/getcwidth --ctype | grep 1f57a
     1f57a  -1  ------------

One can poke about the local definitions.

    $ test/getcwidth --ctype | awk '{print $2}' |
    > sort -n | uniq -c
      57249 -1
       1723 0
      29884 1
      95464 2
    $
    $ test/getcwidth --ctype | awk '{print $3}' |
    > LC_ALL=C sort | uniq -c
      57183 ------------
         14 -p--------sb
      15563 address@hidden
         10 -pg---dxN---
     107528 -pga----N---
       2167 -pga-l--N---
          6 -pga-l-xN---
       1772 -pgau---N---
          6 -pgau--xN---
          4 -pgaul--N---
         60 c-----------
          6 c---------s-
          1 c---------sb
    $ 

That says there are four runes that are both upper and lower!

    $ printf '%b\n' $(test/getcwidth --ctype |
    > awk '$3 ~ /ul/ {print "\\u" $1}')
    Dž
    Lj
    Nj
    Dz
    $

And here's the first printable zero-width.

    $ test/getcwidth --ctype | grep -m1 ' 0 .*p'
        ad   0  address@hidden

U+00AD is soft hyphen.  Unicode is said to be an ISO 8859-1 superset,
and U+AD was soft hyphen in that too, but visible, with a width of 1.
ISO used it at the end of the line to show a word had been broken, but
not by the author, allowing it to be stripped on re-formatting.  Unicode
changed that.  For them, it's a hint from the author to the renderer
that here's a potential point to break the word, thus, when rendered,
it's not visible and has zero width.  Toc toc toc!

Terminals get this wrong.  libvte-based terminals here think it has
width.

    $ s="$(printf '\uad')"
    $ scan -format "_%4(lit foo)_\n_%4(lit £)_\n_%4(lit $s)_" .
    _foo _
    _£   _
    _­    _   [Rune after first _ isn't a space.]
    $

Dickey's venerable xterm(1) does better.

    $ s="$(printf '\uad')"
    $ scan -format "_%4(lit foo)_\n_%4(lit £)_\n_%4(lit $s)_" .
    _foo _
    _£   _
    _    _   [All four are spaces.]
    $

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy



reply via email to

[Prev in Thread] Current Thread [Next in Thread]