qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 21/56] json: Reject invalid UTF-8 sequences


From: Eric Blake
Subject: Re: [Qemu-devel] [PATCH 21/56] json: Reject invalid UTF-8 sequences
Date: Thu, 9 Aug 2018 17:16:48 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

On 08/08/2018 07:02 AM, Markus Armbruster wrote:
We reject bytes that can't occur in valid UTF-8 (\xC0..\xC1,
\xF5..\xFF in the lexer.  That's insufficient; there's plenty of
invalid UTF-8 not containing these bytes, as demonstrated by
check-qjson:

* Malformed sequences

   - Unexpected continuation bytes

   - Missing continuation bytes after start bytes other than
     \xC0..\xC1, \xF5..\xFD.

* Overlong sequences with start bytes other than \xC0..\xC1,
   \xF5..\xFD.

* Invalid code points

Fixing this in the lexer would be bothersome.  Fixing it in the parser
is straightforward, so do that.

Signed-off-by: Markus Armbruster <address@hidden>
---

@@ -193,12 +198,15 @@ static QString 
*qstring_from_escaped_str(JSONParserContext *ctxt,
                  goto out;
              }
          } else {
-            char dummy[2];
-
-            dummy[0] = *ptr;
-            dummy[1] = 0;
-
-            qstring_append(str, dummy);
+            cp = mod_utf8_codepoint(ptr, 6, &end);

Why are you hard-coding 6 here, rather than computing min(6, strchr(ptr,0)-ptr)? If the user passes an invalid sequence at the end of the string, can we end up making mod_utf8_codepoint() read beyond the end of our string? Would it be better to just always pass the remaining string length (mod_utf8_codepoint() only cares about stopping short of 6 bytes, but never reads beyond there even if you pass a larger number)?

+            if (cp <= 0) {
+                parse_error(ctxt, token, "invalid UTF-8 sequence in string");
+                goto out;
+            }
+            ptr = end - 1;
+            len = mod_utf8_encode(utf8_buf, sizeof(utf8_buf), cp);
+            assert(len >= 0);
+            qstring_append(str, utf8_buf);
          }
      }

+++ b/util/unicode.c
@@ -13,6 +13,21 @@
  #include "qemu/osdep.h"
  #include "qemu/unicode.h"

+ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
+{
+    assert(bufsz >= 5);
+
+    if (!is_valid_codepoint(codepoint)) {
+        return -1;
+    }
+
+    if (codepoint > 0 && codepoint <= 0x7F) {
+        buf[0] = codepoint & 0x7F;

Dead use of binary &. But acceptable for symmetry with the other code branches.

+        buf[1] = 0;
+        return 1;
+    }
+    if (codepoint <= 0x7FF) {
+        buf[0] = 0xC0 | ((codepoint >> 6) & 0x1F);
+        buf[1] = 0x80 | (codepoint & 0x3F);
+        buf[2] = 0;
+        return 2;
+    }
+    if (codepoint <= 0xFFFF) {
+        buf[0] = 0xE0 | ((codepoint >> 12) & 0x0F);
+        buf[1] = 0x80 | ((codepoint >> 6) & 0x3F);
+        buf[2] = 0x80 | (codepoint & 0x3F);
+        buf[3] = 0;
+        return 3;
+    }
+    buf[0] = 0xF0 | ((codepoint >> 18) & 0x07);
+    buf[1] = 0x80 | ((codepoint >> 12) & 0x3F);
+    buf[2] = 0x80 | ((codepoint >> 6) & 0x3F);
+    buf[3] = 0x80 | (codepoint & 0x3F);
+    buf[4] = 0;
+    return 4;
+}


Overall, looks nice.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]