[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 41/60] json: Nicer recovery from invalid lead

From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH v2 41/60] json: Nicer recovery from invalid leading zero
Date: Mon, 20 Aug 2018 13:39:49 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Eric Blake <address@hidden> writes:

> On 08/17/2018 10:05 AM, Markus Armbruster wrote:
>> For input 0123, the lexer produces the tokens
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>> Reporting an error is correct; 0123 is invalid according to RFC 7159.
>> But the error recovery isn't nice.
>> Make the finite state machine eat digits before going into the error
>> state.  The lexer now produces
>>      JSON_ERROR    0123
>> Signed-off-by: Markus Armbruster <address@hidden>
>> Reviewed-by: Eric Blake <address@hidden>
> Did you also want to reject invalid attempts at hex numbers, by adding
> [xXa-fA-F] to the set of characters eaten by IN_BAD_ZERO?

I put one foot on a slippery slope with this patch...

In review of v1, we discussed whether to try matching non-integer
numbers with redundant leading zero.  Doing that tightly in the lexer
requires duplicating six states.  A simpler alternative is to have the
lexer eat "digit salad" after redundant leading zero: 0[0-9.eE+-]+.
Your suggestion for hexadecimal numbers is digit salad with different
digits: [0-9a-fA-FxX].  Another option is their union: [0-9a-fA-FxX+-].
Even more radical would be eating anything but whitespace and structural
characters: [^][}{:, \t\n\r].  That idea pushed to the limit results in
a two-stage lexer: first stage finds token strings, where a token string
is a structural character or a sequence of non-structural,
non-whitespace characters, second stage rejects invalid token strings.

Hmm, we could try to recover from lexical errors more smartly in
general: instead of ending the JSON error token after the first
offending character, end it before the first whitespace or structural
character following the offending character.

I can try that, but I'd prefer to try it in a follow-up patch.

>>   +    [IN_BAD_ZERO] = {
>> +        ['0' ... '9'] = IN_BAD_ZERO,
>> +    },
>> +

reply via email to

[Prev in Thread] Current Thread [Next in Thread]