bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34524: wc: word count incorrect when words separated only by no-brea


From: Pádraig Brady
Subject: bug#34524: wc: word count incorrect when words separated only by no-break space
Date: Sat, 23 Feb 2019 21:22:51 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 18/02/19 00:12, address@hidden wrote:
> $ wc --version
> wc (GNU coreutils) 8.29
> Packaged by Gentoo (8.29-r1 (p1.0))
> 
> The man page for wc states: "A word is a... sequence of characters delimited 
> by white space."
> 
> But its concept of white space only seems to include ASCII white space.  
> U+00A0 NO-BREAK SPACE, for instance, is not recognized.
> 
> If your terminal displays UTF-8 encoding:
> 
> printf 'how are\xC2\xA0you\n'
> 
> or if your terminal displays ISO 8859-1 encoding:
> 
> printf 'how are\xA0you\n'
> 
> the visible output of this printf is "how are you".  In either case, wc does 
> not recognize the second space as white space, resulting in an incorrect word 
> count:
> 
> $ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
> 2
> $ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
> 2

wc does support multi-byte locales well and we use iswspace()
to test whether it's a separator or not.
Though on glibc, NBSP is not considered a space.
I wrote a little prog to output what is considered a space on glibc locales:

0009 HORIZONTAL TAB
000A NEW LINE (not blank)
000B VERTICAL TAB (not blank)
000C FORM FEED (not blank)
000D CARRIAGE RETURN (not blank)
0020 SPACE
1680 OGHAM SPACE MARK
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR (not blank)
2029 PARAGRAPH SEPARATOR (not blank)
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE

In the non breaking space class we have:

00A0 NON BREAKING SPACE
2007 FIGURE SPACE
202F NARROW NO-BREAK SPACE
2060 WORD JOINER

Maybe we should consider these as word separators?
I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'`
into libreoffice writer and it treated all but the last
as a word separator in its word count tool.

There is some discussion of POSIX and unicode classes at:
http://unicode.org/L2/L2003/03139-posix-classes.htm

I guess POSIX is defining lower level functionality
and has to be compat with all uses of iswspace()
which might be used for line reformatting etc.
but wc(1) being higher level, perhaps should consider
the non breaking variants as word separators?
The following change would do that:

diff --git a/src/wc.c b/src/wc.c
index 179abbe..ca990b4 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -147,6 +147,13 @@ the following order: newline, word, character, byte, 
maximum line length.\n\
   exit (status);
 }

+static int _GL_ATTRIBUTE_PURE
+iswnbspace (wint_t wc)
+{
+  return  wc == L'\u00A0' || wc == L'\u2007' \
+       || wc == L'\u202F' || wc == L'\u2060';
+}
+
 /* FILE is the name of the file (or NULL for standard input)
    associated with the specified counters.  */
 static void
@@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, 
off_t current_pos)
                           if (width > 0)
                             linepos += width;
                         }
-                      if (iswspace (wide_char))
+                      if (iswspace (wide_char) || iswnbspace (wide_char))
                         goto mb_word_separator;
                       in_word = true;
                     }


Note general word boundary handling is complicated:
https://www.unicode.org/reports/tr29/#Word_Boundaries
Consider this number with figure space:
  $ printf "1\u2007234,56\n"
  1 234,56
That would be considered as one word rather than two.
For more sophisticated contextual processing we would need
to use some of the word break functionality from libunistring.

cheers,
Pádraig





reply via email to

[Prev in Thread] Current Thread [Next in Thread]