bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#29606: Command 'fold' dangerous with utf-8 input


From: Pádraig Brady
Subject: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Sat, 9 Dec 2017 15:50:36 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 08/12/17 19:15, Assaf Gordon wrote:
> Hello Mark,
> 
> First,
> thank you for taking the time and effort
> to test our development snapshot, and reporting results back.
> This kind of feedback is critical in getting multibyte support ready.
> 
> 
> Second,
> I can confirm the behavior you are observing, reproduced here
> with 'od' for easier output:
> 
> ## POSIX single-byte locale:
> 
> $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
>   303  \n 237  \n
> $ echo "ß" | LC_ALL=C src/fold         --width 1 | od -tc -An
>   303  \n 237  \n
> 
> ## UTF8 locale:
> 
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
>   303 237  \n
> 
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold         --width 1 | od -tc -An
>   303 237  \n
> 
> 
> On 2017-12-08 05:04 AM, Mark Roberts wrote:
>> When --bytes is not specified, the program treats '\b', '\r' and '\t' 
>> specially. It assumes a tab width of eight (compile-time #define) and 
>> attempts to keep track of what the output will look like.
>>
>> This is absolutely not what I expected.
> 
> That is correct, and I share your sentiment: it also took me some time
> to try and track down why it behaves this way, and whether it's by 
> design or a bug.
> 
>> But of course, when the program 
>> was first written, the words byte and character meant the same thing for 
>> printable characters. Printable bytes.
> 
> The reasoning for this behavior is explained in the OpenGroup's POSIX 
> standard page for fold, in the "RATIONAL" section:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18
> 
> There, it is made clear:
>    "Historical versions of the fold utility assumed 1 byte was one
>    character and occupied one column position when written out. This is
>    no longer always true.
>    [....]
>    Note that although the width for the -b option is in bytes, a line is
>    never split in the middle of a character."
> 
> Therefore, the current implementation (of the development version) is 
> correct.
> 
>> I will attempt to suggest an improved text for the man-page so that 
>> others will not be surprised.
> 
> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.
> 
> If you find the time to send such a patch - great!
> If not, I will add it sooner or later (hopefully sooner).
> 
> As such I'm closing this bug report, but further discussion (and
> patches) are welcomed by replying to this thread.

Note while splitting in the middle of a character is incorrect,
it doesn't preclude approximate counting in --bytes.
This is the approach the current i18n patch takes:

$ export LC_ALL=en_CA.UTF-8
$ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An
 303 237  \n 303 237  \n  \n
$ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An
 303 237  \n 303 237  \n  \n
$ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An
 303 237 303 237  \n

The i18n version of fold also has a --characters option
to operate in the current fold-assaf mode.
I'm not convinced we want to be different from the i18n patch in this regard at 
least.

cheers,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]