[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17958: SHR: base handling broken (shr-parse-base, shr-expand-url)

From: Ivan Shmakov
Subject: bug#17958: SHR: base handling broken (shr-parse-base, shr-expand-url)
Date: Thu, 14 Aug 2014 18:50:20 +0000
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)

retitle 17958 SHR: base handling broken (shr-parse-base, shr-expand-url) 
tag     17958 + patch

>>>>> Ivan Shmakov <address@hidden> writes:


 > However, I believe that the real culprit is shr-expand-url, which
 > mishandles the nil ‘uri’ case:

 > (mapcar (lambda (x) (shr-expand-url x "http://example.com/welcome/";))
 >         '("hello" "/world" nil))
 > ;; ⇒
 > ("http://example.com/welcome/hello";
 >  "http://example.com/world";
 >  "http://example.com";)

 > My expectation for the last result would be the ‘base’ argument
 > unchanged (i. e., http://example.com/welcome/.)

 > Thus, I suggest changing shr-expand-url to return not the 0th element
 > of the (parsed) ‘base’ (see below), but the 3rd.

 > 596    (cond ((or (not url)
 > 597               (not base)
 > 598               (string-match "\\`[a-z]*:" url))
 > 599           ;; Absolute URL.
 > 600           (or url (car base)))

 > [1] https://tools.wmflabs.org/guc/?user=2001:db8:1337::cafe

        As it seems, there’s one more issue with SHR “base” handling.
        Namely, the <base href="" /> URI may actually itself be
        relative, and SHR fails to handle that properly.  As per [2]:

    To set the frozen base URL, resolve the value of the element's href
    content attribute relative to the Document's fallback base URL; if
    this is successful, set the frozen base URL to the resulting
    absolute URL, otherwise, set the frozen base URL to the fallback
    base URL.

        The SHR behavior doesn’t match the above.  Consider, e. g.:

(let ((shr-base (shr-parse-base "http://example.org/";)))
  (shr-tag-base '((:href . "/relative")))
;; ⇒
("" "/" nil "/relative")

        With the patch MIMEd (which also fixes the issue described in my
        initial bug report), it instead gives what I deem to be the
        correct result:

(let ((shr-base (shr-parse-base "http://example.org/";)))
  (shr-tag-base '((:href . "/relative")))
;; ⇒
("http://example.org"; "/" "http" "http://example.org/relative";)

        For proper compliance to the specification, SHR should also
        ignore all the <base /> elements but the first one, but I guess
        that may be fixed separately.

        The relative <base /> URIs appear, e. g., on the Internet
        Wayback Machine archive pages, when the original page uses the
        <base /> element.

[2] http://www.w3.org/TR/html5/document-metadata.html#the-base-element

FSF associate member #7257  http://boycottsystemd.org/  … 3013 B6A0 230E 334A
--- a/lisp/net/shr.el
+++ b/lisp/net/shr.el
@@ -574,6 +574,8 @@ size, and full-buffer size."
   ;; Always chop off anchors.
   (when (string-match "#.*" url)
     (setq url (substring url 0 (match-beginning 0))))
+  ;; NB: <base href="" > URI may itself be relative to the document’s URI
+  (setq url (shr-expand-url url))
   (let* ((parsed (url-generic-parse-url url))
         (local (url-filename parsed)))
     (setf (url-filename parsed) "")
@@ -592,6 +594,7 @@ size, and full-buffer size."
 (defun shr-expand-url (url &optional base)
   (setq base
        (if base
+           ;; shr-parse-base should never call this with non-nil base!
            (shr-parse-base base)
          ;; Bound by the parser.
@@ -600,8 +603,8 @@ size, and full-buffer size."
   (cond ((or (not url)
             (not base)
             (string-match "\\`[a-z]*:" url))
-        ;; Absolute URL.
-        (or url (car base)))
+        ;; Absolute or empty URI
+        (or url (nth 3 base)))
        ((eq (aref url 0) ?/)
         (if (and (> (length url) 1)
                  (eq (aref url 1) ?/))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]