Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.

From:

Alex Shinn

Subject:

Date:

Wed, 23 Jan 2013 15:42:19 +0900

On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex <address@hidden> wrote:

On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
> This result looks broken. As I noted in my previous mail, the URI
> representation already handles non-ASCII characters and escapes on output:
>
> $ csi -R uri-common
> #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
> #<URI-common: scheme="http" port=#f host="127.0.0.1" path=(/ "삼계탕")
> query=#f fragment=#f>
> #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
> "삼계탕")))
> "http://127.0.0.1/82%BCB3%8483%95"
>

> Unrelated, the actual escaped output looks buggy - it looks like
> some characters like the leading "%EC%" are getting dropped.

OK, I took some time to investigate and I pinpointed this problem.
This appears to happen due to the use of core srfi-14 and srfi-13 in
uri-generic; its char-set operations simply don't deal with anything
beyond ASCII.

As an aside from the uri discussion, we really need to fix srfi-14.

The reference implementation is terrible. Not only does it not

handle Unicode, but it doesn't not-handle it gracefully:

#;1> (char-set-contains? char-set:full #\x100)

Error: (string-ref) out of range [...]

At a minimum we should avoid these errors, but really we

should be using a Unicode-aware implementation - there's no

barrier to doing so like there is for Unicode strings. We could

just move utf8-srfi-14 into the core, or I could patch up the

srfi-14 implementation to handle wide chars properly (but maybe

slowly) without bringing in the iset dependency.

Alex