[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] the libidn problem
From: |
Ander Juaristi |
Subject: |
Re: [Bug-wget] the libidn problem |
Date: |
Tue, 30 Jun 2015 09:52:29 +0200 |
User-agent: |
Thunderbird on Linux |
On 06/29/2015 11:14 PM, Daniel Stenberg wrote:
Hi,
The libidn issue that was previously reported[1], is still outstanding and
hasn't been fixed in libidn. This keeps wget vulnerable.
I've just recommended[2] libcurl users to disable libidn until this gets
resolved, as it seems it may drag on and keeping vulnerable code around is not
good.
[1] = https://lists.gnu.org/archive/html/bug-wget/2015-06/msg00002.html
[2] = http://curl.haxx.se/mail/lib-2015-06/0143.html
Thanks for the reminder!
I'm usually reluctant to write hacks in order to work around third-party
issues. The reason is that one might easily miss corner cases, specially if
you're not an expert in the domain field of that third party code, UTF-8 in
this case. And I myself don't know anything about UTF-8, apart from the very
unavoidable basics. Another reason is that all the code related to that
particular field (UTF-8 in this case) should be grouped together in the same
place, this includes checking the input for correctness. This is related to the
first one: the library user (me, us, in this case) doesn't have to know
anything about UTF-8, so we should rely on the library for everything
UTF-8-related. Putting 10% of the UTF-8 code in the client side and 90% in the
library side is an inconsistency that is, well, inconsistent (kudos to the
kernel guys :D).
Having said that, it really doesn't look like the libidn guys are responsive at
all. Someone with better understanding of what's going on in GNU correct me if
I'm wrong. Thus, I think in this case we should write a workaround ourselves.
To start rolling the discussion, I've decided to dust off my confidence and
propose a simple algorithm (based on
https://en.wikipedia.org/wiki/UTF-8#Description) that should detect invalid
UTF-8 sequences based on the input length.
first_byte = input[0];
if (first_byte & 0x80 == 0) len = 1
else if (first_byte & 0xE0 == 0xC0) len = 2
else if (first_byte & 0xF0 == 0xE0) len = 3
else if (first_byte & 0xF8 == 0xF0) len = 4
else if (first_byte & 0xFC == 0xF8) len = 5
else if (first_byte & 0xFE == 0xFC) len = 6
else return false
if (input_len == len)
{
valid = true
if (len > 1)
{
while (*input++ && valid)
{
if (input & 0xC0 == 0x80) valid = true
else valid = false
}
}
}
return valid
Is it ok? Anything I've missed? Thoughts/concerns? Endianness issues?
I won't have time to send a patch myself during the following two weeks or so,
so everyone feel free to do it if you want.
--
Regards,
- AJ