Re: stringprep() doesn't match documentation

help-libidn
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: stringprep() doesn't match documentation

From:	dclarke blastwave.org
Subject:	Re: stringprep() doesn't match documentation
Date:	Wed, 26 Nov 2014 04:44:16 -0500 (EST)
<snip>
> > This means that, if an attacker is able to inject invalid UTF-8 into
> > the input
> > buffer used for stringprep(), the lack of error checking by
> > stringprep_utf8_to_ucs4() can be used to skip over the actual
> > terminating
> > NULL-byte, causing he stringprep call to read memory past the buffer
> > it was
> > supposed to not read outside of. Sure, this is the application's
> > fault for not
> > properly veryfing the input is UTF-8, but the mismatch between the
> > documentation and the function makes this worse.
> 
> If the input string is valid UTF-8, I believe there is no problem.  Do
> you agree?
> 
> Applications should not pass unvalidated strings to stringprep(), it
> must be checked to be valid UTF-8 first.  If stringprep() receives
> non-UTF8 inputs, I believe there are other similar serious things that
> can happen.
> 
> Quoting the docstring:
> 
>  * Prepare the input zero terminated UTF-8 string according to the
>  * stringprep profile, and write back the result to the input string.
> 
> Admittedly, the library could 1) provide functions for checking
> whether
> a string is valid UTF-8, and/or 2) actually validate that inputs are
> UTF-8 before using them.  The latter may cause a small performance
> penalty, but probably not that much.  Further thinking or suggestions
> in
> this direction is welcome.

*whoa*

Even superficial checking for valid UTF-8 would require looking at
expected bit patterns such as leading upper bits set to indicate bytes
within a multi-byte character and then further checks to catch byte
sequences that may result in nonsense such as null chars or other such
issues. I wrote a UTF-8 check routine on a recent project and it did
require a fair amount of thought and was not a perfect check algorithm
by any means.  Things such as bytes 0xC080h ( as mentioned in section 10
of RFC 3629 ) would be reasonable to check but a complete and stringent
check for compliance could be a fair chunk of work.

In my own work I did write code that would check for reasonable bits in
up to 4-byte characters and could catch and remove damaged characters.
>From my comment blocks ( lots of verbose detail ) I have things such as:

    /*
     * valid four byte UTF-8 characters have bit patterns like
     *
     *     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
     *
     * An example thus :
     *
     *      F0        A4        AD        A2
     *
     *      11110000  10100100  10101101  10100010
     *
     * Therefore a valid test case would be :
     *
     *   14 bytes thus  e2 82 ac 43 69 61 c3 bc F0 A4 AD A2 32 35
     *
     * Another possible and reasonable character is U+20FB4
     * where we have 20FB4 = 0010 0000 1111 1011 0100 binary.
     *
     * This complex asian character may be seen in the UNICODE
     * documentation "CJK Unified Ideographs Extension B,
     * Range: 20000–2A6D6"
     *
     * This encodes in UTF8 thus :
     *
     *     break this into groups of six bits for the last bytes
     *     and a leading three bits in the front :
     *
     *           ...    ......    ......    ......
     *           000    100000    111110    110100
     *
     *     then aadd in the UTF8 regular pattern bits for a four
     *     byte char :
     *
     *           ...    ......    ......    ......
     *      11110000  10100000  10111110  10110100
     *
     *     Now convert to hex :
     *
     *         F   0     A   0     B   E     B   4
     *      11110000  10100000  10111110  10110100
     *
     *
     *     Result is the four byte UTF-8 char F0 A0 BE B4
     *
     * So a valid test case would be :
     *
     *  e2 82 ac 43 69 61 c3 bc F0 A4 AD A2 32 35 F0 A0 BE B4
     *
     */

I would then modify the bits in various positions and test for
valid UTF-8, either silently catch and correct the damage ( remove
the invalid byte sequence and signal a "catch" ) or outright reject
the entire input buffer. After doing some thought and testing with
even simple strings such as :

    /* Here we shall damage the upper two bits of the last byte
     * in the euro character.
     *
     *   Bérénice needs 5€ for Caffè
     *
     *   42 c3 a9 72 c3 a9 6e 69 63 65 20 6e 65 65 64 73
     *   20 35 e2 82 0c 20 66 6f 72 20 43 61 66 66 c3 a8
     *               ^^
     */

It became more clear to me that a stringent test for valid UTF-8 would
require real work and careful testing.  Simply checking for valid bit
structures would not be enough. There are byte order mark issues to
consider also.

I did try testing for some common Mandarin strings and was rapidly
overwhelmed. However I went with the assumption that valid bit
patterns should be enough.  Consider :

 input : ">The Chinese Mandarin word for "family members" is 好看"

3E 54 68 65 20 43 68 69 6E 65 73 65 20 4D 61 6E  >The Chinese Man
64 61 72 69 6E 20 77 6F 72 64 20 66 6F 72 20 22  darin word for "
66 61 6D 69 6C 79 20 6D 65 6D 62 65 72 73 22 20  family members"
69 73 20 E5 A5 BD E7 9C 8B         is ......

Here we see the bytes E5 A5 BD E7 9C 8B result in a valid character
pair but this may not mean that any valid three byte sequence is a
valid Mandarin character.  Or even a valid character at all.

Those six bytes are the two chars :

    U+597D    CJK UNIFIED IDEOGRAPH

        hex 0xE5 binary 11100101 first byte of a 3 byte sequence
        hex 0xA5 binary 10100101 continuation byte 1
        hex 0xBD binary 10111101 continuation byte 2.

    U+770B    CJK UNIFIED IDEOGRAPH

        hex 0xE7 binary 11100111 first byte of a 3 byte sequence
        hex 0x9C binary 10011100 continuation byte 1
        hex 0x8B binary 10001011 continuation byte 2.

In any case I certainly feel it would be great to have a routine which
may catch any invalid UTF-8 and possibly even repair a byte stream by
removing broken bits and even returning various codes to indicate the
nature of the "breakage".  My work focused on catching superficial bit
damage and then allowing reasonable insert strings for a database. I
was able to scan for what seemed reasonable across a set of languages
but I don't think a simple bit pattern scanner would suffice as a
stringent routine for checking an input buffer.

Dennis Clarke
ps: sorry for jumping in but the topic is near to my heart
[Prev in Thread]
Current Thread
[Next in Thread]
Re: stringprep() doesn't match documentation, Simon Josefsson, 2014/11/25
- Re: stringprep() doesn't match documentation, dclarke blastwave.org <=
Prev by Date: Re: stringprep() doesn't match documentation
Previous by thread: Re: stringprep() doesn't match documentation
Index(es):
- Date
- Thread