[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [GMG-Devel] MediaGoblin, metadata, and json-ld
From: |
Christopher Allan Webber |
Subject: |
Re: [GMG-Devel] MediaGoblin, metadata, and json-ld |
Date: |
Mon, 05 May 2014 16:26:39 -0500 |
User-agent: |
mu4e 0.9.9.6pre2; emacs 24.1.50.1 |
To follow up to myself, this is some stuff related to research into the
metadata branch related to what Natalie is doing for the
academic/metadata/archival institutions grant. I'm crossposting,
basically, but I've taken out a bit specific to that discussion.
If you don't care about metadata, feel free to ignore :)
- What to accept
I think Mike is right that we should recognize all the stuff from
known prefixes at http://www.w3.org/2011/rdfa-context/rdfa-1.1
I think this won't be hard, at minimum we can just have a list of
these prefixes mapped to their URLs and whitelist anything come in
from here.
- Does MediaGoblin need its own context/schema?
I think the answer is, "not yet". Using and supporting the above
contexts seems like it would work just fine.
We probably should just provide a context that's the above rdfa
things together.
The next bit deserves its own section:
Validation and types
====================
The problem that's been trickiest so far has been figuring out how
strict to be about incoming data. Natalie asked me for clarification
on this, and I've spent a while thinking about it and doing research.
Here's the issues:
- Should we be providing typeof= stuff in the RDFa? It would be
nice, but seems not really necessary.
- Should we be collecting a full list of types for all the data? It
seems like there's no friendly ready-to-use collection of contexts
containing the types of things for things like dublin core,
schema.org, etc. So we would have to build such a thing ourselves.
Note: if we did have that though, using jsonld's expand api, it
would be easy to get what's presumably the types of each
item... (though only if one type is possible; multiple types are
seemingly allowed in a context)
- Validation is the big question though. Should we be making sure
that incoming data /really is/ of the type that's coming in? Also,
given that types are possibly provided in a json-ld context,
couldn't we just check against those?
We had a couple of conversations with Manu Sporny about this:
- He said there's no known linter for json-ld data based on its
context.
- He initially suggested using json-schema to do validation
instead. (Natalie looked at implementing this, though she's run
into some frustrations and suggested maybe if we do validation
we should do it in python-land instead... more thoughts on that
below.)
- When I came back with further questions, Manu clarified that
json-ld is meant to be pretty loose with things that are wrong;
in his own applications they use json-schema to check the
general things, and for logical stuff use actual code.
Thinking about that, it seems like we can either do a kind of
whitelist or greylist type validation. Kind of muddling those terms,
but by this I mean for whitelist, we accept *only* the types of things
that we know how to check the type of. Anything else, we toss out.
By greylist, I mean we can just enforce the type on things that
actually, really matter.
It seems to me that if we're really going to allow the full list of
things from:
http://www.w3.org/2011/rdfa-context/rdfa-1.1
Doing the whitelist-only approach would be a *huge* task. We would
basically have to implement our own linter that new how to process all
sorts of stuff... and when you consider stuff like schema.org, that
would be a lot:
http://schema.org/docs/full.html
I don't think that's worth our time. So we really would be accepting
a very finnicky, narrow set of things if we do whitelisting. I'm not
sure that's very satisfying...
So it seems we're stuck with greylisting. That's probably okay, we
can type-validate the things that matter. The big risk here is that
we might get some stuff accepted that for now we didn't verify the
type of, and in some future time, we use and we need to work a lot
more clearly. That might break some future-code, and that would be
lame.
There's one possible way to get around that; if we ever run into the
future situation where we have to validate old metadata, we could have
the metadata validated on access time. We could rubber stamp the
stuff that looks good, and remove and "quarantine" the stuff that
doesn't fit our standards.
We might never hit that stage, but I think I've sufficiently
carefully overthought all that now. :)
Basically at the moment, I think there's no need to implement
something as heavy as json-schema as Yet Another Dependency. We only
look at a few fields, and those we can validate in python.
- Re: [GMG-Devel] MediaGoblin, metadata, and json-ld, Samat K Jain, 2014/05/01
- Re: [GMG-Devel] MediaGoblin, metadata, and json-ld, Jason Li, 2014/05/02
- Re: [GMG-Devel] MediaGoblin, metadata, and json-ld, Christopher Allan Webber, 2014/05/02
- Re: [GMG-Devel] MediaGoblin, metadata, and json-ld,
Christopher Allan Webber <=