Re: [GMG-Devel] MediaGoblin, metadata, and json-ld

Hello all,

There's metadata stuff about recently... lots of reasons to think
about it, so here's a big braindump. But first, a rationale of why
metadata stuff is of such interest:

- As some of you know, Natalie Foust-Pilcher (aka tilly-q) is working
on a branch that extends MediaGoblin to be more useful for archival
and academic institutions. This includes lots of metadata related
stuff.
- Federation related planning
- We've always had metadata for files, but it's always been super
loose and not well defined at all. How to make the meaning of this
metadata clearer?

Elrond and I recently sat down to talk about lots of this, informed by
conversations Natalie and I had already been happening (along with
some other people). We agreed I should do a write-up of those ideas
and post them here. So! Here goes.

Where to store media metadata?
==============================

It seems logical that there's two levels to things:

- Media entry level metadata, for example:
+ The original author
+ The date it was published

- File-specific metadata. For example:
+ resolution of this specific file
+ encoding rate
+ etc

As such, it makes sense that we'd store this on the MediaEntry and
MediaFile tables respectively.

We have the latter field already, though barely used, in the
MediaFile.file_metadata attribute. This uses a json dictionary to
store metadata, and seems fine. Probably we will want to do the same
thing for the MediaEntry... MediaEntry.metadata should be fine.

Elrond and I were talking: we think that everything in
MediaEntry.metadata and MediaFile.file_metadata should be assumed to
be public knowledge. Assuming someone has permission to view the
media entry, they should have access to all the metadata in these
fields. Which means if plugins want to store extra data that is
private, they can store it in the plugin extension tables method, more
or less as we do now. This assumption will possibly simplify
federation, and will allow us to expose this information to those who
might find it useful.

On json-ld
==========

So, one question comes about: how to actually distinguish what all
these different things mean? Does "size" mean something like "medium"
or does it mean some thing like a width by height variable? json is
very loose, and thus it's possible for definitions to conflict!

As such, we are looking to adopt a pretty cool system called json-ld:

http://json-ld.org/

This system allows you to provide 'contexts" to the data you're
providing. This provides clarity as to what the definitions of
different keys mean, as well as allowing you to supply the "types" of
values and so on. It's pretty cool!

The API spec is also pretty easy to read:

http://www.w3.org/TR/json-ld-api/

A good way to play around is to start using pyld, which is what we're
going to be using:

https://github.com/digitalbazaar/pyld

Another advantage to having things within a context is that it means
we can store all kinds of metadata... plugins may even define their
own metadata, and that's just fine... we can even remember what that
field they put in means if they can provide a context we can check.

So where to put the context? Well a few things: it looks like you can
supply a context locally in json-ld, but you can also refer to an
external context definition.

I think we'll want to provide a base context of some of the most basic
terms we're using in MediaGoblin. We can probably install this in the
MediaGoblin package and have mediagoblin.org pull that out of the
MediaGoblin repository and serve it. (I can probably write a makefile
command in the mediagoblin-website repository to automate this.)

As for versioning the schema, we may want to do something like the
following:

http://mediagoblin.org/schema/0.1/draft/media_entry.jsonld <- in flux
http://mediagoblin.org/schema/0.1/final/media_entry.jsonld <- the
"final" version we release

We may want to have separate contexts for the files than we do for the
entries. I'm not sure yet.

As for validation of it, I would think that json-ld should be able to
validate that types actually match things based on the types provided.
However, Manu Sporny (who knows more about this than anyone) suggested
we actually look at json-schema. I don't know why we can't do it all
in one thing. That needs more investigation. Anyway:

http://json-schema.org/

On federation
=============

This stuff becomes more relevant also when we get to the point of
federation. Since we're using the Pump API, everything is about
posting around ActivityStreams objects.

But what happens if part of an activity isn't defined? For example,
there are object types in the activitystreams spec for audio and
video, but not for slideshow/presentation. It might be this never
ends up being too serious of an issue for us... we might just make it
by on the basic definitions somehow... but I want to be sure that if
there's data we're providing on activity types that aren't defined
that we have a way of clearly expressing them without running into
naming collisions (eg, "presentation" means two things if you think
about a conference speaker... it means both the way they're presenting
their talk, but it also means the actual presentation content).

We should be clear on what we mean... it seems to me that json-ld is
probably our best shot here.

By the way, I asked Evan Prodromou what he thought about posting media
types that weren't defined clearly in the spec, and he suggested maybe
using the file object and maybe identifying by mimetype. I'm not so
keen on that idea... even though we use mimetype for detection, it can
mean different things potentially still (we aren't just a file sharing
application, we're a media *publishing* application, so presentational
format is important) and media entries may actually contain multiple
files. I asked about json-ld and got a "you can do that" response.
I'm not sure if Evan thinks it's a great idea or not, but we can do
it. ;)

A note on future Postgres stuff
===============================

So, the current field as it's stored in MediaGoblin is marshalled into
a text SQL field. However, there's cool stuff that's happening in the
postgres world around json:

http://clarkdave.net/2013/06/what-can-you-do-with-postgresql-and-json/

So, you can do queries in postgres that involve json in a really nice
way, and even do some indexing on the json fields' subfields (as we
used to be able to do in Mongo when we used that). We don't have a
way to do this with metadata shoved into a json field, and we don't
really need it yet (we can always duplicate those fields into other
types if need be in the meanwhile), but maybe in the future it would
be nice.

Particularly, in the future it looks like json will be *really*
efficient in postgres thanks to jsonb support:

http://www.craigkerstiens.com/2014/03/24/Postgres-9.4-Looking-up/

SQLAlchemy supports this stuff now, it looks like:

http://docs.sqlalchemy.org/en/rel_0_9/dialects/postgresql.html#sqlalchemy.dialects.postgresql.JSON

This means if we moved to postgres-only, we could get some *really
neat and efficient* json querying tools. Dropping sqlite would be a
huge decision though. One of the reasons we moved to SQL was *because
of* sqlite making it easy to run small instances. But on the other
hand, sqlite has been a huge headache:

http://dustycloud.org/blog/sqlite-alter-pain/

Anyway, we don't have to make any decisions on this soon.

One final note
==============

Many of the things discussed above are /already being implemented!/
The grant that Natalie Foust-Pilcher is working on is implementing
much of the above metadata stuff.

Speaking of which, I now need to review the present state of the code.
I had planned on jumping into research-cation after this, but I think
I'll be jumping into a brief sprint to try to implement the above,
mostly working off of the hard work that Natalie has already done.
(Thank you Natalie!)

Thoughts are welcome, encouraged!
- Chris
_______________________________________________
devel mailing list
address@hidden
http://lists.mediagoblin.org/listinfo/devel

From:	Jason Li
Subject:	Re: [GMG-Devel] MediaGoblin, metadata, and json-ld
Date:	Fri, 2 May 2014 12:20:46 +0800