[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gzz] ``canon3_file_format``: A canonical, N3-based file format
From: |
Benja Fallenstein |
Subject: |
[Gzz] ``canon3_file_format``: A canonical, N3-based file format |
Date: |
Tue, 01 Apr 2003 21:45:50 +0200 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030327 Debian/1.3-4 |
=========================================================
``canon3_file_format``: A canonical, N3-based file format
=========================================================
:Author: Benja Fallenstein
:Date: 2003-04-01
:Revision: $Revision: 1.1 $
:Last-Modified: $Date: 2003/03/31 09:37:41 $
:Type: Architecture
:Scope: Major
:Status: Current
We need a canonical file format for storing data in CVS
(canonical so that diffs will only show the differences
in structure, not changes because one RDF writer
chose to order triples differently than another writer
or so). This format could also be a potential candidate
for storing versions of RDF graphs in Storm.
This PEG specifies such a format.
Specification
=============
The name of the format is *Canon3*. This version is identified
by the URI <http://fenfire.org/2003/Canon3/1.0>. It is related to
both `Notation 3`_ and `NTriples`_. Canon3 files
are encoded as UTF-8, normalized to Unicode `Normalization Form C`_.
They obey the following grammar::
document ::= header (triple)*
header ::= "# Canon3 <http://fenfire.org/2003/Canon3/1.0/>" NEWLINE
triple ::= subject " " property " " object "." NEWLINE
subject ::= URItoken | anonNode
property ::= URItoken
object ::= URItoken | anonNode | literal
URItoken ::= "<" URIref ">"
anonNode ::= "_:" [A-Za-z][A-Za-z0-9]*
literal ::= #x22 #x22 #x22 string #x22 #x22 #x22 qualifiers
qualifiers ::= ("@" language)? ("^^" URItoken)?
The ``NEWLINE`` token may be any of CR, LF, and CRLF.
(This is necessary for CVS to be useful across platforms.)
In contexts where the specific form used matters,
the newline character is LF. (In particular, when computing
a content hash-- e.g., when creating a Canon3 Storm block.)
The triples must be ordered. Two triples are compared
by comparing their subjects, properties, and objects
in this order. Each of these parts is compared
as follows:
- Literals are lower than (go before) URIrefs,
which go before anonymous nodes.
- URIrefs are compared character-by-character,
in the form as defined in [RFC 2396]
(i.e., *after* Unicode characters outside
the ASCII range have been escaped).
Characters are compared by Unicode code point
value.
- Literals are compared character-by-character
in their unescaped form (i.e., before the
backslash escaping defined below). If the
strings of two literals are equal, first
the language tag and then the data type,
if any, are compared in the same manner.
Literals without language tags/data types
go before literals with them (if the
contents of the literals are equal).
- Anonymous nodes are compared by their
internal identifiers (the stuff following
the ``_:``), also character-by-character.
A triple may only be listed once; if there are two
equal triples in the graph to be serialized, this
triple must occur only once in the serialization.
``URIref`` is a URI reference as defined in [RFC 2396].
Percent escapes (e.g. ``%2f``) should preferably
be encoded in lower case. URIref may be either of the following:
1. An absolute URI (e.g., ``http://example.org/``).
2. An absolute URI plus a fragment identifier
(e.g., ``http://example.org/#foo``).
3. The empty URI reference (which is a relative URI
refering to the current document).
4. A standalone fragment identifier (e.g., ``#foo``),
refering to a fragment of the current document.
``language`` is a Language-Tag as defined by [RFC 3066].
A ``string`` is any UTF-8 character sequence
encoded in the following way:
- Double any backslash in the string.
- Insert a backslash before the first of any three
consecutive double quotes (#x22) in the string.
(This means: In a sequence of three or more
double quote characters, instert a backslash
before all but the last two double quotes).
For example, the string ``f\oo"""""ba"r`` becomes
``f\\oo\"\"\"""ba"r``.
Strings may contain newlines. Like all of Canon3,
they are encoded in Normalization Form C.
They are enclosed in triple double quotes
(see production ``literal``).
We will register a MIME type for Canon3.
\- Benja
.. _Normalization Form C: http://www.unicode.org/unicode/reports/tr15/
.. _NTriples: http://www.w3.org/TR/rdf-testcases/#ntriples
.. _Notation 3: http://www.w3.org/DesignIssues/Notation3.html
- [Gzz] ``canon3_file_format``: A canonical, N3-based file format,
Benja Fallenstein <=
Re: [Gzz] ``canon3_file_format``: A canonical, N3-based file format, Tuukka Hastrup, 2003/04/02