[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in
From: |
Markus Armbruster |
Subject: |
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in strings |
Date: |
Mon, 04 Feb 2013 19:09:13 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.1 (gnu/linux) |
Paolo Bonzini <address@hidden> writes:
> Il 04/02/2013 18:19, Markus Armbruster ha scritto:
>> + /* 2 Boundary condition test cases */
>> + /* 2.1 First possible sequence of a certain length */
>> + /* 2.1.5 5 bytes U+200000 */
>> + {
>> + "\"\xF8\x88\x80\x80\x80\"",
>> + NULL, /* bug: rejected */
>> + "\"\\u8200\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>> + "\xF8\x88\x80\x80\x80",
>> + },
>> + /* 2.1.6 6 bytes U+4000000 */
>> + {
>> + "\"\xFC\x84\x80\x80\x80\x80\"",
>> + NULL, /* bug: rejected */
>> + "\"\\uC100\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\""
>> */
>> + "\xFC\x84\x80\x80\x80\x80",
>> + },
>> + },
>> + /* 2.2.4 4 bytes U+1FFFFF */
>> + {
>> + "\"\xF7\xBF\xBF\xBF\"",
>> + NULL, /* bug: rejected */
>> + "\"\\u7FFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>> + "\xF7\xBF\xBF\xBF",
>> + },
>> + /* 2.2.5 5 bytes U+3FFFFFF */
>> + {
>> + "\"\xFB\xBF\xBF\xBF\xBF\"",
>> + NULL, /* bug: rejected */
>> + "\"\\uBFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>> + "\xFB\xBF\xBF\xBF\xBF",
>> + },
>> + /* 2.2.6 6 bytes U+7FFFFFFF */
>> + {
>> + "\"\xFD\xBF\xBF\xBF\xBF\xBF\"",
>> + NULL, /* bug: rejected */
>> + "\"\\uDFFF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\""
>> */
>> + "\xFD\xBF\xBF\xBF\xBF\xBF",
>> + },
>> + {
>> + /* \U+1FFFFF */
>> + "\"\xF8\x87\xBF\xBF\xBF\"",
>> + NULL, /* bug: rejected */
>> + "\"\\u81FF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>> + "\xF8\x87\xBF\xBF\xBF",
>> + },
>> + {
>> + /* \U+3FFFFFF */
>> + "\"\xFC\x83\xBF\xBF\xBF\xBF\"",
>> + NULL, /* bug: rejected */
>> + "\"\\uC0FF\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\uFFFF\""
>> */
>> + "\xFC\x83\xBF\xBF\xBF\xBF",
>> + },
>> + {
>> + /* \U+0000 */
>> + "\"\xF8\x80\x80\x80\x80\"",
>> + NULL, /* bug: rejected */
>> + "\"\\u8000\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\"" */
>> + "\xF8\x80\x80\x80\x80",
>> + },
>> + {
>> + /* \U+0000 */
>> + "\"\xFC\x80\x80\x80\x80\x80\"",
>> + NULL, /* bug: rejected */
>> + "\"\\uC000\\uFFFF\\uFFFF\\uFFFF\"", /* bug: want "\"\\u0000\""
>> */
>> + "\xFC\x80\x80\x80\x80\x80",
>> + },
>
> Rejecting these is not a bug IMO. Unicode is only defined up to
> U+10FFFF. Codepoints above are not valid UTF-8 at all, and in
> particular 5/6-byte sequences are never valid UTF-8 (they used to be).
See explanation of bug markers above:
+ * - bug: rejected
+ * JSON parser rejects invalid sequence(s)
+ * We may choose to define this as feature
> But there are indeed other bugs...
>
>> + /* 2.1.4 4 bytes U+10000 */
>> + {
>> + "\"\xF0\x90\x80\x80\"",
>> + "\xF0\x90\x80\x80",
>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
>> + },
>> + /* U+10FFFF */
>> + "\"\xF4\x8F\xBF\xBF\"",
>> + "\xF4\x8F\xBF\xBF",
>> + "\"\\u43FF\\uFFFF\"", /* bug: want "\"\\uDBFF\\uDFFF\"" */
>> + },
>> + {
>> + /* U+110000 */
>> + "\"\xF4\x90\x80\x80\"",
>> + "\xF4\x90\x80\x80",
>> + "\"\\u4400\\uFFFF\"", /* bug: want "\"\\uFFFF\"" */
>> + },
>
> ...and also some good catches here! In particular U+110000 should be
> rejected.
Thanks!