[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be publ
From: |
Richard W.M. Jones |
Subject: |
Re: [PATCH 1/2] util: Refactor json-writer's string sanitizer to be public |
Date: |
Fri, 2 Aug 2024 22:00:27 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals. Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated. [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
>
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters). But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
I have to admit I'd never heard of "modified UTF-8" before, but it's a
thing:
https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
As the patch is almost a simple code motion:
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
Rich.
> ---
> include/qemu/unicode.h | 3 ++
> qobject/json-writer.c | 47 +----------------------
> util/unicode.c | 84 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 88 insertions(+), 46 deletions(-)
>
> diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
> index 7fa10b8e604..e1013b29f72 100644
> --- a/include/qemu/unicode.h
> +++ b/include/qemu/unicode.h
> @@ -4,4 +4,7 @@
> int mod_utf8_codepoint(const char *s, size_t n, char **end);
> ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);
>
> +void mod_utf8_sanitize(GString *buf, const char *str);
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
> +
> #endif
> diff --git a/qobject/json-writer.c b/qobject/json-writer.c
> index 309a31d57a9..5c14574efee 100644
> --- a/qobject/json-writer.c
> +++ b/qobject/json-writer.c
> @@ -104,53 +104,8 @@ static void pretty_newline_or_space(JSONWriter *writer)
>
> static void quoted_str(JSONWriter *writer, const char *str)
> {
> - const char *ptr;
> - char *end;
> - int cp;
> -
> g_string_append_c(writer->contents, '"');
> -
> - for (ptr = str; *ptr; ptr = end) {
> - cp = mod_utf8_codepoint(ptr, 6, &end);
> - switch (cp) {
> - case '\"':
> - g_string_append(writer->contents, "\\\"");
> - break;
> - case '\\':
> - g_string_append(writer->contents, "\\\\");
> - break;
> - case '\b':
> - g_string_append(writer->contents, "\\b");
> - break;
> - case '\f':
> - g_string_append(writer->contents, "\\f");
> - break;
> - case '\n':
> - g_string_append(writer->contents, "\\n");
> - break;
> - case '\r':
> - g_string_append(writer->contents, "\\r");
> - break;
> - case '\t':
> - g_string_append(writer->contents, "\\t");
> - break;
> - default:
> - if (cp < 0) {
> - cp = 0xFFFD; /* replacement character */
> - }
> - if (cp > 0xFFFF) {
> - /* beyond BMP; need a surrogate pair */
> - g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
> - 0xD800 + ((cp - 0x10000) >> 10),
> - 0xDC00 + ((cp - 0x10000) & 0x3FF));
> - } else if (cp < 0x20 || cp >= 0x7F) {
> - g_string_append_printf(writer->contents, "\\u%04X", cp);
> - } else {
> - g_string_append_c(writer->contents, cp);
> - }
> - }
> - };
> -
> + mod_utf8_sanitize(writer->contents, str);
> g_string_append_c(writer->contents, '"');
> }
>
> diff --git a/util/unicode.c b/util/unicode.c
> index 8580bc598b3..a419ed4de35 100644
> --- a/util/unicode.c
> +++ b/util/unicode.c
> @@ -154,3 +154,87 @@ ssize_t mod_utf8_encode(char buf[], size_t bufsz, int
> codepoint)
> buf[4] = 0;
> return 4;
> }
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + *
> + * Append the contents of the NUL-terminated Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters. The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize(GString *buf, const char *str)
> +{
> + mod_utf8_sanitize_len(buf, str, -1);
> +}
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + * @len: Length of @str, or negative to stop at NUL terminator
> + *
> + * Append the contents of @len bytes of the Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters. The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len)
> +{
> + const char *ptr;
> + char *end;
> + int cp;
> +
> + if (len < 0) {
> + len = strlen(str);
> + }
> +
> + for (ptr = str; *ptr; ptr = end) {
> + cp = mod_utf8_codepoint(ptr, MIN(6, str + len - ptr), &end);
> + switch (cp) {
> + case '\"':
> + g_string_append(buf, "\\\"");
> + break;
> + case '\\':
> + g_string_append(buf, "\\\\");
> + break;
> + case '\b':
> + g_string_append(buf, "\\b");
> + break;
> + case '\f':
> + g_string_append(buf, "\\f");
> + break;
> + case '\n':
> + g_string_append(buf, "\\n");
> + break;
> + case '\r':
> + g_string_append(buf, "\\r");
> + break;
> + case '\t':
> + g_string_append(buf, "\\t");
> + break;
> + default:
> + if (cp < 0) {
> + cp = 0xFFFD; /* replacement character */
> + }
> + if (cp > 0xFFFF) {
> + /* beyond BMP; need a surrogate pair */
> + g_string_append_printf(buf, "\\u%04X\\u%04X",
> + 0xD800 + ((cp - 0x10000) >> 10),
> + 0xDC00 + ((cp - 0x10000) & 0x3FF));
> + } else if (cp < 0x20 || cp >= 0x7F) {
> + g_string_append_printf(buf, "\\u%04X", cp);
> + } else {
> + g_string_append_c(buf, cp);
> + }
> + }
> + }
> +}
> --
> 2.45.2
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v