Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format

gzz-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format

From:	Tuomas Lukka
Subject:	Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format
Date:	Wed, 2 Apr 2003 11:25:06 +0300
User-agent:	Mutt/1.4.1i

This sounds just great!

Inserted below are a lot of issues for you to resolve ;)

On Tue, Apr 01, 2003 at 09:45:50PM +0200, Benja Fallenstein wrote:
> =========================================================
> ``canon3_file_format``: A canonical, N3-based file format
> =========================================================
> 
> :Author:      Benja Fallenstein
> :Date:                2003-04-01
> :Revision:    $Revision: 1.1 $
> :Last-Modified: $Date: 2003/03/31 09:37:41 $
> :Type:                Architecture
> :Scope:               Major
> :Status:      Current
> 
> 
> We need a canonical file format for storing data in CVS
> (canonical so that diffs will only show the differences
> in structure, not changes because one RDF writer
> chose to order triples differently than another writer
> or so). 

Issue: Does this cover also bags and sequences?

> This format could also be a potential candidate
> for storing versions of RDF graphs in Storm.
> 
> This PEG specifies such a format.
> 
> 
> Specification
> =============
> 
> The name of the format is *Canon3*. This version is identified
> by the URI <http://fenfire.org/2003/Canon3/1.0>. It is related to
> both `Notation 3`_ and `NTriples`_. 

Issue: do we really need a *new* format? 

Issue: How compatible is this with N3 and NTriples? What are the differences?

> Canon3 files
> are encoded as UTF-8, normalized to Unicode `Normalization Form C`_.
> They obey the following grammar::
> 
>     document ::= header (triple)*
>     header ::= "# Canon3 <http://fenfire.org/2003/Canon3/1.0/>" NEWLINE

Issue: should the encoding be allowed to be different? Is UTF8 always 
sufficient?

Issue: Should the encoding be mentioned there?

>     triple ::= subject " " property " " object "." NEWLINE

Issue: should we support reification?

>     subject ::= URItoken | anonNode
>     property ::= URItoken
>     object ::= URItoken | anonNode | literal
>     URItoken ::= "<" URIref ">"
>     anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*

>     literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers

Issue: is quoting with three quotes really what we want? It
        complicates the quoting and unquoting processes.

>     qualifiers ::= ("@" language)? ("^^" URItoken)?

These are not explained properly anywhere.

Please put in an example using all syntactic tricks..

> The ``NEWLINE`` token may be any of CR, LF, and CRLF.
> (This is necessary for CVS to be useful across platforms.)
> In contexts where the specific form used matters,
> the newline character is LF. (In particular, when computing
> a content hash-- e.g., when creating a Canon3 Storm block.)

This is just asking for trouble!

> The triples must be ordered. 

Capitalize "must" ;)

Might be good to include language that a processor MUST
not accept faulty Canon3

> Two triples are compared
> by comparing their subjects, properties, and objects
> in this order. Each of these parts is compared
> as follows:
> 
> - Literals are lower than (go before) URIrefs,
>   which go before anonymous nodes.

??????

> - URIrefs are compared character-by-character,
>   in the form as defined in [RFC 2396]
>   (i.e., *after* Unicode characters outside
>   the ASCII range have been escaped).
>   Characters are compared by Unicode code point
>   value.

Is this the same as a lexicographic string comparison
of the UTF-8 encoded one?

> - Literals are compared character-by-character
>   in their unescaped form (i.e., before the
>   backslash escaping defined below). 

Why before? 

Actually, if you make the quotes by using single quotes
and *doubling* all quotes inside (foo"bar becomes "foo""bar"),
you have the same order before and after!

> If the
>   strings of two literals are equal, first
>   the language tag and then the data type,
>   if any, are compared in the same manner.
>   Literals without language tags/data types
>   go before literals with them (if the
>   contents of the literals are equal).
> - Anonymous nodes are compared by their
>   internal identifiers (the stuff following
>   the ``_:``), also character-by-character.

> A triple may only be listed once; if there are two
> equal triples in the graph to be serialized, this
> triple must occur only once in the serialization.

Umm, the graph is not AFAIK a multigraph, so it *can't* occur
more than once.

> ``URIref`` is a URI reference as defined in [RFC 2396].
> Percent escapes (e.g. ``%2f``) should preferably
> be encoded in lower case. 

Should? Ouch... better not leave any choices here.
Should probably also specify which characters shall and which shall not
be escaped.

> URIref may be either of the following:
> 
> 1. An absolute URI (e.g., ``http://example.org/``).
> 2. An absolute URI plus a fragment identifier
>    (e.g., ``http://example.org/#foo``).
> 3. The empty URI reference (which is a relative URI
>    refering to the current document).
> 4. A standalone fragment identifier (e.g., ``#foo``),
>    refering to a fragment of the current document.
> 
> ``language`` is a Language-Tag as defined by [RFC 3066].
> 
> A ``string`` is any UTF-8 character sequence
> encoded in the following way:
> 
> - Double any backslash in the string.
> - Insert a backslash before the first of any three
>   consecutive double quotes (#x22) in the string.
>   (This means: In a sequence of three or more
>   double quote characters, instert a backslash
>   before all but the last two double quotes).

This description is too late - you've already
referred to backslash

> For example, the string ``f\oo"""""ba"r`` becomes
> ``f\\oo\"\"\"""ba"r``.
> 
> Strings may contain newlines. Like all of Canon3,
> they are encoded in Normalization Form C.

Issue: Why normalization form C?

> They are enclosed in triple double quotes
> (see production ``literal``).
> 
> We will register a MIME type for Canon3.
> 
> \- Benja
> 
> 
> .. _Normalization Form C: http://www.unicode.org/unicode/reports/tr15/
> .. _NTriples: http://www.w3.org/TR/rdf-testcases/#ntriples
> .. _Notation 3: http://www.w3.org/DesignIssues/Notation3.html

        Tuomas

[Prev in Thread]

Current Thread

[Next in Thread]

[Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/01
- Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Antti-Juhani Kaijanaho, 2003/04/02
  - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/02
- Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuomas Lukka <=
  - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/02
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuomas Lukka, 2003/04/02
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/02
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuomas Lukka, 2003/04/02
- Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuukka Hastrup, 2003/04/02
  - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/02
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuomas Lukka, 2003/04/02
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuukka Hastrup, 2003/04/03
    - Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Benja Fallenstein, 2003/04/03

Prev by Date: [Gzz] PEG: Abstract node view, context and content
Next by Date: Re: [Gzz] Simple Storm again
Previous by thread: Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format
Next by thread: Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format
Index(es):
- Date
- Thread