[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/4] dfa: do not match invalid UTF-8

From: Paul Eggert
Subject: Re: [PATCH 4/4] dfa: do not match invalid UTF-8
Date: Wed, 18 Dec 2019 09:06:30 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

On 12/18/19 12:48 AM, Bruno Haible wrote re my recent Gnulib change
with corresponding Grep change

> Do I understand it correctly that, as a consequence of this change,
> 'grep' with a regex of '^.*$' will no longer match lines which contains
> an invalid UTF-8 byte sequence?

Yes and no. dfa.c's '^.*$' already rejected some lines with invalid UTF-8 byte
sequences. The change merely made dfa.c reject all such lines.
>   - Is this effect on 'grep' intended? (And the workaround is to use the
>     "C" locale.)


>   - Is it consistent with the behaviour of regex and kwset, which 'grep'
>     also uses, depending on the arguments (as far as I understand)?

No, in the sense that the matchers disagree about what to do with encoding
errors. I think regex '.' matches the first byte of an encoding error (which
would be hard to mimic in that part of dfa.c as this behavior requires
lookahead). I don't know what kwset does.

In some sense it doesn't matter, as neither POSIX nor the grep manual say what
to do when the pattern or input contains encoding errors. I installed the patch
because it seemed "wrong" to me that the "." pattern matched an invalid byte
sequence of length 2 or more, with no characters in sight.

Conversely, I suppose if the change significantly hurts performance, then it
should be reverted (but with a comment explaining why dfa.c accepts more than
just the valid UTF-8 byte sequences) or perhaps redone in a better way.

I am cc'ing this to address@hidden to give 'grep' lurkers a heads-up
about this.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]