[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] html->sxml (html-parser egg) does not decode entiti

From: Andy Bennett
Subject: Re: [Chicken-users] html->sxml (html-parser egg) does not decode entities in html attributes, ide as why?
Date: Thu, 08 May 2014 22:44:10 +0100
User-agent: Trojita/0.4.1; Qt/4.8.2; X11; Linux; Debian GNU/Linux 7.4 (wheezy)


Thanks for your email.

I'm somewhat confused by what you say. Through investigation, it seems html->sxml will decode entities, so long as they aren't within a HTML element attribute. Could you clarify on whether that default applies globally or just to attributes?

Yes, sorry, I misread my own code :)

The default is to _decode_ entities:

#;1> (html->sxml """)
(*TOP* "\"")

And as you say, it currently doesn't just process attributes:

#;2> (html->sxml "<div data-foo=\"&quot;\">")
(*TOP* (div (@ (data-foo "&quot;"))))

I'll fix this.

Thanks for this Alex and sorry for taking so long to come back to you.

When Philip first reported this we were running html-parser 0.5.0 on CHICKEN 4.7.0. We're currently upgrading to CHICKEN 4.9.0 and we were trying the latest html-parser, version 0.5.2. Unfortunately we've had a couple of problems: one with empty attributes and another that seems a bit more sinister.

html-parser 0.5.0 works on both 4.7.0 and 4.9.0.
html-parsers 0.5.1 and 0.5.2 don't work on either 4.7.0 or 4.9.0 so I've isolated the problem to changes introduced in 0.5.1.

Empty attributes now seem to decode to the string "()".

During &quot; deserialisation when inside an attribute, we seem to get data from earlier in the stream introduced:

(define empty "<div data=\"\">empty</div>")

(define content "<br>\r\n<br>\r\n<div data=\"(sxml (@ (attr &quot;12345&quot;)) body)\">div body</div>")


#;> (html->sxml empty)
(*TOP* (div (@ (data "")) "empty"))

#;> (html->sxml content)
(*TOP* (br) "\r\n" (br) "\r\n" (div (@ (data "(sxml (@ (attr &quot;12345&quot;)) body)")) "div body"))


#;> (html->sxml empty)
(*TOP* (div (@ (data "()")) "empty"))

#;> (html->sxml content)
(*TOP* (br) "\r\n" (br) "\r\n" (div (@ (data "(sxml (@ (attr \"\r\nbr\r\nbr12345\"\r\nbr\r\nbr)) body)")) "div body"))

The data in attr seems to be taken from data elsewhere:

#;> (html->sxml "<first>\r\n<br>\r\n<second /><div data=\"(sxml (@ (attr &quot;12345&quot;)) body)\">div body</div>") (*TOP* (first "\r\n" (br) "\r\n" (second) (div (@ (data "(sxml (@ (attr \"second\r\nbr\r\n12345\"second\r\nbr\r\n)) body)")) "div body")))

Thanks for all your help maintaining this and, once again, sorry it took so long for us to put your newer versions into our code.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]