[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format
From: |
Tuomas Lukka |
Subject: |
Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format |
Date: |
Wed, 2 Apr 2003 11:25:06 +0300 |
User-agent: |
Mutt/1.4.1i |
This sounds just great!
Inserted below are a lot of issues for you to resolve ;)
On Tue, Apr 01, 2003 at 09:45:50PM +0200, Benja Fallenstein wrote:
> =========================================================
> ``canon3_file_format``: A canonical, N3-based file format
> =========================================================
>
> :Author: Benja Fallenstein
> :Date: 2003-04-01
> :Revision: $Revision: 1.1 $
> :Last-Modified: $Date: 2003/03/31 09:37:41 $
> :Type: Architecture
> :Scope: Major
> :Status: Current
>
>
> We need a canonical file format for storing data in CVS
> (canonical so that diffs will only show the differences
> in structure, not changes because one RDF writer
> chose to order triples differently than another writer
> or so).
Issue: Does this cover also bags and sequences?
> This format could also be a potential candidate
> for storing versions of RDF graphs in Storm.
>
> This PEG specifies such a format.
>
>
> Specification
> =============
>
> The name of the format is *Canon3*. This version is identified
> by the URI <http://fenfire.org/2003/Canon3/1.0>. It is related to
> both `Notation 3`_ and `NTriples`_.
Issue: do we really need a *new* format?
Issue: How compatible is this with N3 and NTriples? What are the differences?
> Canon3 files
> are encoded as UTF-8, normalized to Unicode `Normalization Form C`_.
> They obey the following grammar::
>
> document ::= header (triple)*
> header ::= "# Canon3 <http://fenfire.org/2003/Canon3/1.0/>" NEWLINE
Issue: should the encoding be allowed to be different? Is UTF8 always
sufficient?
Issue: Should the encoding be mentioned there?
> triple ::= subject " " property " " object "." NEWLINE
Issue: should we support reification?
> subject ::= URItoken | anonNode
> property ::= URItoken
> object ::= URItoken | anonNode | literal
> URItoken ::= "<" URIref ">"
> anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*
> literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers
Issue: is quoting with three quotes really what we want? It
complicates the quoting and unquoting processes.
> qualifiers ::= ("@" language)? ("^^" URItoken)?
These are not explained properly anywhere.
Please put in an example using all syntactic tricks..
> The ``NEWLINE`` token may be any of CR, LF, and CRLF.
> (This is necessary for CVS to be useful across platforms.)
> In contexts where the specific form used matters,
> the newline character is LF. (In particular, when computing
> a content hash-- e.g., when creating a Canon3 Storm block.)
This is just asking for trouble!
> The triples must be ordered.
Capitalize "must" ;)
Might be good to include language that a processor MUST
not accept faulty Canon3
> Two triples are compared
> by comparing their subjects, properties, and objects
> in this order. Each of these parts is compared
> as follows:
>
> - Literals are lower than (go before) URIrefs,
> which go before anonymous nodes.
??????
> - URIrefs are compared character-by-character,
> in the form as defined in [RFC 2396]
> (i.e., *after* Unicode characters outside
> the ASCII range have been escaped).
> Characters are compared by Unicode code point
> value.
Is this the same as a lexicographic string comparison
of the UTF-8 encoded one?
> - Literals are compared character-by-character
> in their unescaped form (i.e., before the
> backslash escaping defined below).
Why before?
Actually, if you make the quotes by using single quotes
and *doubling* all quotes inside (foo"bar becomes "foo""bar"),
you have the same order before and after!
> If the
> strings of two literals are equal, first
> the language tag and then the data type,
> if any, are compared in the same manner.
> Literals without language tags/data types
> go before literals with them (if the
> contents of the literals are equal).
> - Anonymous nodes are compared by their
> internal identifiers (the stuff following
> the ``_:``), also character-by-character.
> A triple may only be listed once; if there are two
> equal triples in the graph to be serialized, this
> triple must occur only once in the serialization.
Umm, the graph is not AFAIK a multigraph, so it *can't* occur
more than once.
> ``URIref`` is a URI reference as defined in [RFC 2396].
> Percent escapes (e.g. ``%2f``) should preferably
> be encoded in lower case.
Should? Ouch... better not leave any choices here.
Should probably also specify which characters shall and which shall not
be escaped.
> URIref may be either of the following:
>
> 1. An absolute URI (e.g., ``http://example.org/``).
> 2. An absolute URI plus a fragment identifier
> (e.g., ``http://example.org/#foo``).
> 3. The empty URI reference (which is a relative URI
> refering to the current document).
> 4. A standalone fragment identifier (e.g., ``#foo``),
> refering to a fragment of the current document.
>
> ``language`` is a Language-Tag as defined by [RFC 3066].
>
> A ``string`` is any UTF-8 character sequence
> encoded in the following way:
>
> - Double any backslash in the string.
> - Insert a backslash before the first of any three
> consecutive double quotes (#x22) in the string.
> (This means: In a sequence of three or more
> double quote characters, instert a backslash
> before all but the last two double quotes).
This description is too late - you've already
referred to backslash
> For example, the string ``f\oo"""""ba"r`` becomes
> ``f\\oo\"\"\"""ba"r``.
>
> Strings may contain newlines. Like all of Canon3,
> they are encoded in Normalization Form C.
Issue: Why normalization form C?
> They are enclosed in triple double quotes
> (see production ``literal``).
>
> We will register a MIME type for Canon3.
>
> \- Benja
>
>
> .. _Normalization Form C: http://www.unicode.org/unicode/reports/tr15/
> .. _NTriples: http://www.w3.org/TR/rdf-testcases/#ntriples
> .. _Notation 3: http://www.w3.org/DesignIssues/Notation3.html
Tuomas
Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuukka Hastrup, 2003/04/02