[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wc: expand help of '-L' (and a question)
From: |
Stephane Chazelas |
Subject: |
Re: wc: expand help of '-L' (and a question) |
Date: |
Wed, 13 May 2015 13:01:12 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
2015-05-13 03:00:48 +0100, Pádraig Brady:
[...]
> Yes. You could filter with sed to adjust:
>
> sed 's/././g' | wc -L # count chars
> LC_ALL=C sed 's/././g' | wc -L # count bytes
[...]
Note that unicode code points D800 to DFFF (reserved for UTF-16
encoding) and 110000 to 7FFFFFFF now that they've given up on
ever having anything above 10FFFF) are not characters.
Still GNU sed considers their UTF-8 encodings (as per the
original UTF-8 encoding, before it got limited to 4 bytes)
as characters.
$ printf '\ud800\udfff\U110000\U7fffffff\n' | sed s/././g | wc -L
4
(I'm not sure I'd object to that though).
Other byte sequences that don't form valid characters are not:
$ printf '\x80\xff' | sed s/././g | wc -L
0
--
Stephane