chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.


From: Ivan Raikov
Subject: Re: [Chicken-users] [Q] uri-common has problem with UTF-8 uri.
Date: Tue, 15 Jan 2013 15:14:30 +0900

Oops, the second example should have been

For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83  95 and (utf8-string->uri "http://example.com/삼계탕") produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)

Sorry about the confusion.

  Ivan




On Tue, Jan 15, 2013 at 3:03 PM, Ivan Raikov <address@hidden> wrote:

Hi Alex,

    I understand your point about make-uri, but I want to provide a uri constructor that takes a UTF-8 input string and maps it in accordance with RFC 3986 / 3987.
So we still have to perform path and percent-encoding normalization steps for the ASCII portions of the string. make-uri makes no such attempts at normalization and so does not strictly follow RFC 3986.
I interpreted Section 3.1 from RFC 3987 to mean that UTF-8 are encoded by taking each octet and applying percent encoding on it.

So for the string "пиле" the octets are D0 BF D0 B8 D0 BB D0 B5 and (utf8-string->uri "http://example.com/пиле") produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)

For the string "삼계탕" the octets are EC 82 BC EA B3 84 ED 83  95 and (utf8-string->uri "http://example.com/삼계탕") produces

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%D0%BF%D0%B8%D0%BB%D0%B5") query=#f fragment=#f)


Can you elaborate what is broken about this? Perhaps I do not understand UTF-8 and need to apply a bitmask or something to the octets?

Percent-encoded sequences of more than one octet will not get touched by pct-decode in the current implementation, so you will not get double escaping. Percent-encoded sequences of one octet will get decoded if they fall in the "unstructured" char-set, as per RFC 3986.

  Ivan



This result looks broken.  As I noted in my previous mail, the URI representation
already handles non-ASCII characters and escapes on output:

$ csi -R uri-common
#;1> (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕"))
#<URI-common: scheme="http" port=#f host="127.0.0.1" path=(/ "삼계탕") query=#f fragment=#f>
#;2> (uri->string (make-uri scheme: "http" host: "127.0.0.1" path: '(/ "삼계탕")))

If you put percent escapes _inside_ the internal path representation,
you'll get double escaping.

Parsing is a separate matter, and utf8-string->uri should return
the URI object without error, but with the unescaped values in
the path and query as resulting from the make-uri above.

Unrelated, the actual escaped output looks buggy - it looks like
some characters like the leading "%EC%" are getting dropped.

-- 
Alex

#(URI scheme=http authority=#(URIAuth host="example.com" port=#f) path=(/ "%EC%82%BC%EA%B3%84%ED%83%95") query=#f fragment=#f)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]