url-expand.el and url-parse.el not conforming to RFC3986

From: Alain Schneble (Realize IT GmbH)
Subject: url-expand.el and url-parse.el not conforming to RFC3986
Date: Fri, 27 Nov 2015 15:22:29 +0000


url-expand.el and url-parse.el seem to not follow RFC3986 "Uniform
Resource Identifier (URI): Generic Syntax" in some cases. But I guess
they should. So I started to study RFC3986 in more details and write
tests against url-expand-file-name and url-generic-parse-url (see
attached patch).

The tests reveal the following issues:

1. resolving relative "fragment-only" URIs against a given absolute
   base URI (see RFC3986, section 5. Reference Resolution, and
   especially 5.2.2. Transform References):

   (url-expand-file-name "#s" "http://a/b/c/d;p?q";)
   => "#s" but should be http://a/b/c/d;p?q#s";

   (url-expand-file-name "#bar" "http://host";)
   => "#bar" but should be "http://host#bar";

   (url-expand-file-name "#bar" "http://host/";)
   => "#bar" but should be "http://host/#bar";

   (url-expand-file-name "#bar" "http://host/foo";)
   => "#bar" but should be "http://host/foo#bar";

2. resolving relative "query-only" URIs against a given absolute base
   URI (see RFC3986, same sections as mentioned in point 1.):

   (url-expand-file-name "?y" "http://a/b/c/d;p?q";)
   => "http://a/b/c/?y"; but should be "http://a/b/c/d;p?y";

   (url-expand-file-name "?y" "http://a/b/c/d";)
         => "http://a/b/c/?y"; but should be "http://a/b/c/d?y";)

3. removing dot segments (see RFC3986, section 5.2.4. Remove Dot

   (url-expand-file-name "/./g" "http://a/b/c/d;p?q";)
   => "http://a/./g"; but should be "http://a/g";

   (url-expand-file-name "/../g" "http://a/b/c/d;p?q";)
   => "http://a/../g"; but should be "http://a/g";

4. empty fragment information is lost after parsing URI:

   (equal (url-generic-parse-url "#")
       (url-parse-make-urlobj nil nil nil nil nil "" "" nil nil))
   => nil but should be t (fragment component is actually nil instead
   of an empty string)

   Same issue with URLs having a number sign (#) as suffix:
   ... and so forth

   The problem with this is that the inverse function url-recreate-url
   won't be able to reconstruct exactly the same URI. For example:

   (url-recreate-url (url-generic-parse-url "#"))
   => "" but should be "#"

To address these issues, I propose changes to url-parse.el and
url-expand.el, see attached patch. Here is the detailed summary:

- url-parse-tests.el: add tests for url-generic-parse-url
- url-expand-tests.el: add tests for url-expand-file-name

- url-generic-parse-url: keep empty fragment information in URL-struct
- url-path-and-query: do not artificially turn empty path and query
  into nil path and query, respectively
- url-expander-remove-relative-links: do not turn empty path into an
  absolute path ("/"). Remark: due to the name of this function, would
  it be better to fix this case where this function is called?
- url-expand-file-name: properly resolve fragment-only URIs. Do not
  just return them unchanged. I think that this bug was due to a
  misinterpretation of RFC3986, section 5.1. Establishing a Base URI:
    "Aside from fragment-only references (Section 4.4), relative
    references are only usable when a base URI is known."
  To me, this does not mean that they should not be resolved
  properly. And the expamples given in the RFC emphasize this as well.
- url-default-expander: an empty path in the relative reference URI
  should not drop the last segment.

Please let me know if I should follow a different procedure to submit
these changes. I signed the copyright assignment "GNU EMACS" this year.


Attachment: 0001-Make-relative-URL-parsing-and-resolution-consistent-.patch
Description: 0001-Make-relative-URL-parsing-and-resolution-consistent-.patch

