|
From: | Paul Eggert |
Subject: | bug#18266: handling bytes not part of the charset, and other garbage |
Date: | Fri, 12 Sep 2014 17:57:39 -0700 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 |
Vincent Lefevre wrote:
I wonder whether anyone is interested in matching individual bytes in a file regarded as UTF-8 encoded. This seems weird.
It's not weird at all. For example, suppose we invent the notation [[:error:]] to match encoding errors. Then the pattern '[[:error:]]' would match all encoding errors in a file, which could well be a useful thing.
Currently, for example, the tz package <http://www.iana.org/time-zones> has a Make rule 'check_character_set' that verifies that the source files are all properly encoded. It executes this shell command:
! grep -nv '^.*$' file namesThis relies on GNU grep's behavior that "." does not match an encoding error. But it's a command that is not obvious. It'd be simpler and clearer to write this:
! grep -n '[[:error:]]' file names if such a feature were available.
[Prev in Thread] | Current Thread | [Next in Thread] |