[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Orgmode] org-feed XML entities and character encoding

From: Michael Brand
Subject: [Orgmode] org-feed XML entities and character encoding
Date: Tue, 10 Aug 2010 21:59:26 +0200
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv: Gecko/20100317 Thunderbird/3.0.4

Hi all,

org-feed is becoming very useful for me, so far to manage the
episodes of podcasts. Now I have a patch and a request for help.

1. patch for an issue with XML entities

I found that some XML entities in my feeds are not substituted. The
comments of two recent org-feed.el commits by David Maus
lead me to the thread
and invited me to replace org-feed-unescape with xml-substitute-special
which converts more XML entities. The resulting patch below helps for
me but of course I would like it to be reviewed by an experienced elisp
programmer and org-feed user before being applied.

2. request for help about an issue with multibyte character encoding

There is an issue with multibyte characters that appear in the input
as unescaped, multibyte encoded characters (not as XML entities, as XML
entities multibyte characters are simply substituted correctly). I
looked for an example with a character encoding specified in the first
line of the XML feed like
<?xml version="1.0" encoding="utf-8"?>
and found one here:

The W3C validator
seems to be happy with this feed but when fed into a feeds.org the
unescaped, multibyte encoded characters e. g. of the title `Screencast
076 [...]' get upset, even with `coding: utf-8-unix' in the first line
of the file feeds.org. Can someone please help to get this issue
resolved? If easily possible, like I expect it to be, generally for
all character encodings supported by Emacs? I would even like if
UTF-8 feeds like
that do not have the character encoding specified would work too.


- Michael

--- a/lisp/org-feed.el
+++ b/lisp/org-feed.el
@@ -99,6 +99,7 @@
 (declare-function xml-get-children "xml" (node child-name))
 (declare-function xml-get-attribute "xml" (node attribute))
 (declare-function xml-get-attribute-or-nil "xml" (node attribute))
+(declare-function xml-substitute-special "xml" (string))
 (defvar xml-entity-alist)

 (defgroup org-feed  nil
@@ -269,17 +270,6 @@
 (defvar org-feed-buffer "*Org feed*"
   "The buffer used to retrieve a feed.")

-(defun org-feed-unescape (s)
-  "Unescape protected entities in S."
-  (require 'xml)
-  (let ((re (concat "&\\("
-                   (mapconcat 'car xml-entity-alist "\\|")
-                   "\\);")))
-    (while (string-match re s)
-      (setq s (replace-match
-              (cdr (assoc (match-string 1 s) xml-entity-alist)) nil nil s)))
-    s))
 (defun org-feed-update-all ()
   "Get inbox items from all feeds in `org-feed-alist'."
@@ -613,6 +603,7 @@

 (defun org-feed-parse-rss-entry (entry)
   "Parse the `:item-full-text' field for xml tags and create new properties."
+  (require 'xml)
     (insert (plist-get entry :item-full-text))
     (goto-char (point-min))
@@ -620,7 +611,7 @@
                              nil t)
       (setq entry (plist-put entry
                             (intern (concat ":" (match-string 1)))
-                            (org-feed-unescape (match-string 2)))))
+                            (xml-substitute-special (match-string 2)))))
     (goto-char (point-min))
     (unless (re-search-forward "isPermaLink[ \t]*=[ \t]*\"false\"" nil t)
       (setq entry (plist-put entry :guid-permalink t))))
@@ -633,7 +624,6 @@

 The `:item-full-text' property actually contains the sexp
 formatted as a string, not the original XML data."
-  (require 'xml)
   (with-current-buffer buffer
     (let ((feed (car (xml-parse-region (point-min) (point-max)))))
@@ -654,7 +644,7 @@
     ;; Add <title/> as :title.
     (setq entry (plist-put entry :title
-                          (org-feed-unescape
+                          (xml-substitute-special
                            (car (xml-node-children
                                  (car (xml-get-children xml 'title)))))))
     (let* ((content (car (xml-get-children xml 'content)))
@@ -664,12 +654,12 @@
         ((string= type "text")
          ;; We like plain text.
          (setq entry (plist-put entry :description
-                                (org-feed-unescape
+                                (xml-substitute-special
                                  (car (xml-node-children content))))))
         ((string= type "html")
          ;; TODO: convert HTML to Org markup.
          (setq entry (plist-put entry :description
-                                (org-feed-unescape
+                                (xml-substitute-special
                                  (car (xml-node-children content))))))
         ((string= type "xhtml")
          ;; TODO: convert XHTML to Org markup.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]