--- Begin Message ---
Subject: |
wc: word count incorrect when words separated only by no-break space |
Date: |
Mon, 18 Feb 2019 02:12:15 -0600 |
$ wc --version
wc (GNU coreutils) 8.29
Packaged by Gentoo (8.29-r1 (p1.0))
The man page for wc states: "A word is a... sequence of characters delimited by
white space."
But its concept of white space only seems to include ASCII white space. U+00A0
NO-BREAK SPACE, for instance, is not recognized.
If your terminal displays UTF-8 encoding:
printf 'how are\xC2\xA0you\n'
or if your terminal displays ISO 8859-1 encoding:
printf 'how are\xA0you\n'
the visible output of this printf is "how are you". In either case, wc does
not recognize the second space as white space, resulting in an incorrect word
count:
$ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
2
$ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
2
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#34524: wc: word count incorrect when words separated only by no-break space |
Date: |
Mon, 25 Feb 2019 20:26:55 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 24/02/19 19:55, Pádraig Brady wrote:
> On 24/02/19 17:07, Pádraig Brady wrote:
>> So non break space is generally considered a word delimiter,
>> though there are complications you detail from unicode.
>>
>> In regard to options for enabling various behaviors for wc(1),
>> I'm thinking we might keep the strict POSIX isspace() behavior
>> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
>> by default, since that's the most common operation one would want,
>> and is consistent with libreoffice for example.
>> I'll adjust the patch along those lines.
>
> Full patch attached.
Updated patch attached. I'll push in a few hours.
Marking this bug as done.
cheers,
Pádraig.
wc-nbsp.patch
Description: Text Data
--- End Message ---