[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors
From: |
Markus Armbruster |
Subject: |
Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors |
Date: |
Tue, 28 Aug 2018 06:35:12 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) |
Eric Blake <address@hidden> writes:
> On 08/27/2018 02:00 AM, Markus Armbruster wrote:
>> When the lexer chokes on an input character, it consumes the
>> character, emits a JSON error token, and enters its start state. This
>> can lead to suboptimal error recovery. For instance, input
>>
>> 0123 ,
>>
>> produces the tokens
>>
>> JSON_ERROR 01
>> JSON_INTEGER 23
>> JSON_COMMA ,
>>
>> Make the lexer skip characters after a lexical error until a
>> structural character ('[', ']', '{', '}', ':', ','), an ASCII control
>> character, or '\xFE', or '\xFF'.
>>
>> Note that we must not skip ASCII control characters, '\xFE', '\xFF',
>> because those are documented to force the JSON parser into known-good
>> state, by docs/interop/qmp-spec.txt.
>>
>> The lexer now produces
>>
>> JSON_ERROR 01
>> JSON_COMMA ,
>
> So the lexer has now completely skipped the intermediate input, but
> the resulting error message need only point at the start of where
> input went wrong, and skipping to a sane point results in fewer error
> tokens to be reported. Makes sense.
Exactly.
We could emit a recovery token to let json_message_process_token()
report what we skipped, but I don't think it's worth the trouble.
>> Update qmp-test for the nicer error recovery: QMP now report just one
>
> s/report/reports/
Will fix.
>> error for input %p instead of two. Also drop the newline after %p; it
>> was needed to tease out the second error.
>
> That's because pre-patch, 'p' is one of the input characters that
> requires lookahead to determine if it forms a complete token (and the
> newline provides the transition needed to consume it); now post-patch,
> the 'p' is consumed as part of the junk after the error is first
> detected at the '%'.
Exactly.
> And to my earlier complaint about 0x1a resulting in JSON_ERROR then
> JSON_INTEGER then JSON_KEYWORD, that sequence is likewise now
> identified as a single JSON_ERROR at the 'x', with the rest of the
> attempted hex number (invalid in JSON) silently skipped. Nice.
Correct.
>>
>> Signed-off-by: Markus Armbruster <address@hidden>
>> ---
>> qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
>> tests/qmp-test.c | 6 +-----
>> 2 files changed, 30 insertions(+), 19 deletions(-)
>>
>> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
>> index 28582e17d9..39c7ce7adc 100644
>> --- a/qobject/json-lexer.c
>> +++ b/qobject/json-lexer.c
>> @@ -101,6 +101,7 @@
>> enum json_lexer_state {
>> IN_ERROR = 0, /* must really be 0, see json_lexer[] */
>> + IN_RECOVERY,
>> IN_DQ_STRING_ESCAPE,
>> IN_DQ_STRING,
>> IN_SQ_STRING_ESCAPE,
>> @@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
>> static const uint8_t json_lexer[][256] = {
>> /* Relies on default initialization to IN_ERROR! */
>> + /* error recovery */
>> + [IN_RECOVERY] = {
>> + /*
>> + * Skip characters until a structural character, an ASCII
>> + * control character other than '\t', or impossible UTF-8
>> + * bytes '\xFE', '\xFF'. Structural characters and line
>> + * endings are promising resynchronization points. Clients
>> + * may use the others to force the JSON parser into known-good
>> + * state; see docs/interop/qmp-spec.txt.
>> + */
>> + [0 ... 0x1F] = IN_START | LOOKAHEAD,
>
> And here's where the LOOKAHEAD bit has to be separate, because you are
> now assigning semantics to the transition on '\0' that are distinct
> from either of the two roles previously enumerated as possible.
I could do
TERMINAL(IN_START)
[0x20 ... 0xFD] = IN_RECOVERY,
['\t'] = IN_RECOVERY,
but it would be awful :)
>> + [0x20 ... 0xFD] = IN_RECOVERY,
>> + [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
>> + ['\t'] = IN_RECOVERY,
>> + ['['] = IN_START | LOOKAHEAD,
>> + [']'] = IN_START | LOOKAHEAD,
>> + ['{'] = IN_START | LOOKAHEAD,
>> + ['}'] = IN_START | LOOKAHEAD,
>> + [':'] = IN_START | LOOKAHEAD,
>> + [','] = IN_START | LOOKAHEAD,
>> + },
>
> It took me a couple of reads before I was satisfied that everything is
> initialized as desired (range assignments followed by more-specific
> re-assignment works, but isn't common), but this looks right.
>
> Reviewed-by: Eric Blake <address@hidden>
Thanks!
- [Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F', (continued)
- [Qemu-devel] [PATCH 1/6] json: Fix lexer for lookahead character beyond '\x7F', Markus Armbruster, 2018/08/27
- [Qemu-devel] [PATCH 2/6] json: Clean up how lexer consumes "end of input", Markus Armbruster, 2018/08/27
- [Qemu-devel] [PATCH 3/6] json: Make lexer's "character consumed" logic less confusing, Markus Armbruster, 2018/08/27
- [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors, Markus Armbruster, 2018/08/27
- [Qemu-devel] [PATCH 5/6] json: Eliminate lexer state IN_ERROR, Markus Armbruster, 2018/08/27
[Qemu-devel] [PATCH 6/6] json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP, Markus Armbruster, 2018/08/27