bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] the libidn problem


From: Ander Juaristi
Subject: Re: [Bug-wget] the libidn problem
Date: Tue, 30 Jun 2015 09:52:29 +0200
User-agent: Thunderbird on Linux

On 06/29/2015 11:14 PM, Daniel Stenberg wrote:
Hi,

The libidn issue that was previously reported[1], is still outstanding and 
hasn't been fixed in libidn. This keeps wget vulnerable.

I've just recommended[2] libcurl users to disable libidn until this gets 
resolved, as it seems it may drag on and keeping vulnerable code around is not 
good.

[1] = https://lists.gnu.org/archive/html/bug-wget/2015-06/msg00002.html
[2] = http://curl.haxx.se/mail/lib-2015-06/0143.html


Thanks for the reminder!

I'm usually reluctant to write hacks in order to work around third-party 
issues. The reason is that one might easily miss corner cases, specially if 
you're not an expert in the domain field of that third party code, UTF-8 in 
this case. And I myself don't know anything about UTF-8, apart from the very 
unavoidable basics. Another reason is that all the code related to that 
particular field (UTF-8 in this case) should be grouped together in the same 
place, this includes checking the input for correctness. This is related to the 
first one: the library user (me, us, in this case) doesn't have to know 
anything about UTF-8, so we should rely on the library for everything 
UTF-8-related. Putting 10% of the UTF-8 code in the client side and 90% in the 
library side is an inconsistency that is, well, inconsistent (kudos to the 
kernel guys :D).

Having said that, it really doesn't look like the libidn guys are responsive at 
all. Someone with better understanding of what's going on in GNU correct me if 
I'm wrong. Thus, I think in this case we should write a workaround ourselves.

To start rolling the discussion, I've decided to dust off my confidence and 
propose a simple algorithm (based on 
https://en.wikipedia.org/wiki/UTF-8#Description) that should detect invalid 
UTF-8 sequences based on the input length.

first_byte = input[0];

if (first_byte & 0x80 == 0) len = 1
else if (first_byte & 0xE0 == 0xC0) len = 2
else if (first_byte & 0xF0 == 0xE0) len = 3
else if (first_byte & 0xF8 == 0xF0) len = 4
else if (first_byte & 0xFC == 0xF8) len = 5
else if (first_byte & 0xFE == 0xFC) len = 6
else return false

if (input_len == len)
{
  valid = true
  if (len > 1)
  {
    while (*input++ && valid)
    {
      if (input & 0xC0 == 0x80) valid = true
      else valid = false
    }
  }
}

return valid


Is it ok? Anything I've missed? Thoughts/concerns? Endianness issues?

I won't have time to send a patch myself during the following two weeks or so, 
so everyone feel free to do it if you want.

--
Regards,
- AJ



reply via email to

[Prev in Thread] Current Thread [Next in Thread]