chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.


From: Alex Shinn
Subject: Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date: Wed, 23 Jan 2013 15:42:19 +0900

On Thu, Jan 17, 2013 at 4:51 AM, Peter Bex <address@hidden> wrote:
On Tue, Jan 15, 2013 at 02:44:08PM +0900, Alex Shinn wrote:
> This result looks broken.  As I noted in my previous mail, the URI
> representation already handles non-ASCII characters and escapes on output:
>
> $ csi -R uri-common
> #;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
> #<URI-common: scheme="http" port=#f host="127.0.0.1" path=(/ "삼계탕")
> query=#f fragment=#f>
> #;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/
> "삼계탕")))
> "http://127.0.0.1/82%BCB3%8483%95"
>
> Unrelated, the actual escaped output looks buggy - it looks like
> some characters like the leading "%EC%" are getting dropped.

OK, I took some time to investigate and I pinpointed this problem.
This appears to happen due to the use of core srfi-14 and srfi-13 in
uri-generic; its char-set operations simply don't deal with anything
beyond ASCII.

As an aside from the uri discussion, we really need to fix srfi-14.

The reference implementation is terrible.  Not only does it not
handle Unicode, but it doesn't not-handle it gracefully:

#;1> (char-set-contains? char-set:full #\x100)
Error: (string-ref) out of range [...]

At a minimum we should avoid these errors, but really we
should be using a Unicode-aware implementation - there's no
barrier to doing so like there is for Unicode strings.  We could
just move utf8-srfi-14 into the core, or I could patch up the
srfi-14 implementation to handle wide chars properly (but maybe
slowly) without bringing in the iset dependency.

-- 
Alex


reply via email to

[Prev in Thread] Current Thread [Next in Thread]