bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: removing blank lines: "grep ." is really slow


From: Paul Eggert
Subject: Re: removing blank lines: "grep ." is really slow
Date: Fri, 23 Apr 2010 13:51:37 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)

Paolo Bonzini <address@hidden> writes:

> On 04/18/2010 06:32 AM, Ivan wrote:
>> So... right now, "." means "valid UTF-8 character"? Or not?
>
> Yes, if your locale is UTF-8.

Wouldn't it be better to model encoding errors as characters?  That is,
if grep sees a byte that cannot possibly be the start of a character, we
call it a "character" even though it is not in the standard Unicode
character set.  Internally, we could model it as (say) a negative
number, the negative of the byte value (so it would be in the range -255
.. -128).

Under this approach, the regular expression "." will match all nonempty
lines, which is what most users expect.  The current approach, where "."
matches only lines that contain at least one valid UTF-8 character, is
not nearly as useful or intuitive.

This modeling could be done consistently in both regular expressions and
in input.  It's very easy to explain: surely it's much easier than
whatever the current rules are.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]