[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region
From: |
Katsumi Yamaoka |
Subject: |
bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't |
Date: |
Tue, 13 Mar 2018 11:28:45 +0900 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-unknown-cygwin) |
On Tue, 13 Mar 2018 01:44:22 +0100, Lars Ingebrigtsen wrote:
> libxml is more strict about correctness of the input than most other
> HTML parsers. I don't think there's anything we can do about this
> problematic input other than ponder whether Emacs should use a different
> HTML parser, which I think sounds of unlikely. :-)
I see. I agree not to modify libxml. Jidanni, how about trying
the following patch personally if you often get such broken mails?
Though I'm not quite sure if it does not cause another problem,
it fixes at least the mail in question.
--- mm-decode.el~ 2018-02-28 02:01:37.897607000 +0000
+++ mm-decode.el 2018-03-13 02:23:04.321753900 +0000
@@ -1810,6 +1810,11 @@
(when (and (or coding
(setq coding (mm-charset-to-coding-system charset nil t)))
(not (eq coding 'ascii)))
+ ;; Remove extra bytes in utf-8 encoded data.
+ (when (eq coding 'utf-8)
+ (goto-char (point-min))
+ (while (re-search-forward "[\x00-\x7f]+\\([\x80-\xbf]\\)" nil t)
+ (replace-match "\\1")))
(insert (prog1
(decode-coding-string (buffer-string) coding)
(erase-buffer)
bug#30789: 26.0.91; xml-parse-region works but libxml-parse-html-region doesn't, 積丹尼 Dan Jacobson, 2018/03/12