[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] an xml schema for (single|multiple)_cell_document file XML for
Re: [lmi] an xml schema for (single|multiple)_cell_document file XML format
Tue, 10 Aug 2010 10:41:36 +0000
Thunderbird 126.96.36.199 (Windows/20100228)
On 2010-08-09 16:28Z, Vadim Zeitlin wrote:
> On Sun, 08 Aug 2010 16:28:55 +0000 Greg Chicares <address@hidden> wrote:
> GC> On 2007-12-27 12:43Z, Evgeniy Tarassov wrote:
> GC> >
> GC> > The newer version of XML Schema files for cns/ill files could be
> GC> > from lmi project download area at savannah:
> GC> > | http://download.savannah.nongnu.org/releases/lmi/cell_document.tar.bz2
> GC> If we were doing this all over today, would XML Schema still be a good
> GC> or is something else like RELAX NG or Schematron clearly better now?
> I'm not aware of any dramatic changes in the XML validation area since the
> last 3 years so I'd be tempted to say no, i.e. that XML Schema still
> remains a decent choice because even though RELAX NG has its advantages
> over it (notably relative simplicity) it's still less standard/supported by
> various tools than it. As for Schematron, I believe it's mostly used in
> addition to either XML Schema or RELAX NG and not solely on its own anyhow.
> I could look more into recent developments in this area but, frankly, I
> doubt that we're going to find any earth shattering revelations. IMHO it
> would make sense to stick with XML Schema even if subjectively I like RELAX
> NG "compact syntax" (http://en.wikipedia.org/wiki/RELAX_NG#Compact_syntax)
> a lot.
Okay, we'll stick with XSD.
This is somewhat related to the census manager, so let me say how I plan
to address the OP's other points, in case there is any objection.
| Issue A (major):
| The current format contains only 'cell' nodes which represent cases,
| class and cells. To specify the number of nodes of each type helper nodes
| 'NumberOfCases' and 'NumberOfClasses' are used. Each of 'NumberOfXXX'
| is a positive integer number N, which is followed by exactly
| N 'cell' nodes.
The 'NumberOf' elements aren't really appropriate in xml.
| A simple workaround would be to rename the 'cell' nodes into
| the corresponding cell type: 'case', 'class', 'cell'. This allows
| to fix the document node structure and to get rid of the redundant
| nodes 'NumberOfXXX'.
These three categories must be distinguished somehow. I'm inclined to add
an attribute or a subelement. Changing the main element tag seems drastic.
| Issue B (minor):
| Most of the elements that represent an array/list/sequence are stored as
| a single string with items separated with spaces.
| The bruteforce approach solves the issue by supplying complex regular
| expressions for each sequence type. But if changing current format
| of cns/ill files could be considered, then sequence elements could be
| properly represented by a node with children nodes (instead of a single
| string) which will allow rather simple validation of array/list/sequence
| items separately.
For input sequences, generality and expressive power are important: e.g.,
10000, retirement; 0
in the 'case' cell (and replicated to the others) may suffice to specify
the premium pattern for an entire census. If that's difficult to validate
with XSD, so be it.
| Issue C (unsure/major):
| It is impossible to force two strings to have the same number of words.
| But this seems to be a common validity constraint in xxx_cell_document.
| The current schema ignores these constraints and does let nodes
| representing sequence to have any number of items.
Likewise: so be it.
| Issue D (minor):
| Enum element values could contain '_' instead of spaces (' ').
In the past, they could. Now we generally avoid that; for instance, solve
"CSV = tax basis"
which make sense to end users, who would find "Avoid_MEC" weird.