bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17196: UTF-8 printf string formating problem


From: Rich Felker
Subject: bug#17196: UTF-8 printf string formating problem
Date: Thu, 10 Apr 2014 03:56:10 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
> Eric Blake <address@hidden> wrote:
>  |>>   Dan Douglas wrote:
>  |>>> ksh93 already has this feature using the "L" modifier:
>  |>>> 
>  |>>> ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>  |>>> ★★★
>  |>>
>  |>> At least there is prior art for it.
>  |> 
>  |> So we can count bytes, chars or cells (graphemes).
>  |> 
>  |> Thinking a bit more about it, I think shell level printf
>  |> should be dealing in text of the current encoding and counting cells.
>  |> In the edge case where you want to deal in bytes one can do:
>  |>   LC_ALL=C printf ...
>  |> 
>  |> I see that ksh behaves as I would expect and counts cells,
>  |> though requires the explicit %L enabler:
>  |>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★★
>  |>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★
>  |>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>  |>   A
>  |> 
>  |> zsh seems to just count characters:
>  |>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>  |>   á★
>  |>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>  |>   A★★
>  |> 
>  |> I see that dash gives invalid directive for any of %ls %Ls %S.
>  |> 
>  |> Pity there is no consensus here.
>  |> Personally I would go for:
>  |>   printf '%3s' 'blah'  # count cells
>  |>   printf '%3Ls' 'blah' # count chars
>  |>   LANG=C '%3Ls' 'blah' # count bytes
>  |>   LANG=C '%3s' 'blah'  # count bytes
>  |
>  |Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
>  |and currently states that %Ls is undefined.  But I would LOVE to have a
>  |standardized spelling for counting characters instead of bytes.  The
>  |extension %Ls looks like a good candidate for standardization, precisely
>  |because counting characters when printing a multibyte string is more
>  |useful than counting bytes (you do NOT want to end in the middle of a
>  |multibyte character), and because ksh offers it as existing practice.
>  |
>  |Your idea for counting "cells" (by which I'm assuming you mean one or
>  |more characters that all display within the same cell of the terminal,
>  |as if the end user saw only one grapheme), on the other hand, does not
>  |seem to have any precedence, and I would strongly object to having %s
>  |count by cells because %s already has a standardized (if unfortunate)
>  |meaning of counting by bytes.  Maybe yet another extension is warranted
>  |(perhaps %LLs?) as a new notion for counting by cells instead of
>  |characters, but it's harder to justify that without existing practice.
> 
> I see you are trying to invent the word character for code points
> and reserve the term "graphem" for user-perceived characters.
> This goes in line with the GNU library which has the existing
> practice to let wcwidth(3) return the value 1 for accents and
> other combining code points as well as so-called (Unicode)
> noncharacters.  And who would call wcwidth(3) on something that is
> not to be drawn onto the screen directly afterwards.  And, of
> course, which terminal will perform the composition of code points
> written via STD I/O to characters on its own.
> I think for quite a while it is up to the input methods to combine
> into something precomposed in order to let POSIX programs finally
> work with it.

Many languages do not have precomposed forms for all the character
sequences they need, and for some, it would not even be practical to
have precomposed forms, and would force the use of complex input
methods instead of simple keyboard maps.

Rich





reply via email to

[Prev in Thread] Current Thread [Next in Thread]