[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Gzz] Storm blocks and metadata (Re: P2P and RDF)
From: |
Benja Fallenstein |
Subject: |
[Gzz] Storm blocks and metadata (Re: P2P and RDF) |
Date: |
Tue, 25 Mar 2003 11:22:34 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030319 Debian/1.3-3 |
Hi Reto,
[drifting towards off-topic, but leaving on www-rdf-interest for now
because it still concerns the use of RDF]
Reto Bachmann-Gmuer wrote:
I think this is a very good approach, you could use freenet
conten-hash uris to identify the blocks.
We'll probably register our own URN namespace, among other goals
because we want to use 'real,' registered URIs. (We're also
considering putting a MIME content type in the URI, so that a block
served up through our system would be basically as useful as a file
retrieved through HTTP, and allowing us to easily serve blocks through
a HTTP proxy, too. Not yet decided, though-- some people I've
contacted commented that MIME types do not belong in URIs.)
hmm, I don't see the analogy with http, since http-urls should not
contain a content-type indicator but leave the task to browser and
server to negotiate the best content-type deliverable. Of course your
case is different, since your uri immutably references a sequence of
bytes.
Yes, that would have been my argument. However, you make a good point
below: If we refer to an RDF 'metadata' block containing the URI of the
actual block, we can include references to alternative versions-- even
allowing some degree of content negotiation. This is something I have to
mull about :-)
I strongly disagree with putting the mime-type into the url,
because the mime type is meta information for which I see no reason to
be threaded differently than other meta-information,
It is necessary for the interpretation of the data we get; and it's
usually easy to agree on (people won't too often assign different mime
types to the same bytes). One thing about content hashes is, when two
people put the same file into a hash-based system, they will use the
same identifier for it. With MIME types, that's still pretty much true;
with more elaborate metadata, it isn't.
Using the same identifier is important for queries like, "Which
documents include this image?" If the three documents that use the image
use three different kinds of IDs for it (because they refer to three
different kinds of metadata), you're out of luck.
rather theoretically is it possible that the same sequence of bytes (block)
represents different contents being interpreted with a different
mime-type/encoding, should the host then store the block twice?
Up to the host. Since it *is* rather unlikely, I don't think there would
be big penalties to storing the block twice in this case. I wouldn't do
it anyway, but for a different reason: Other systems do not include the
MIME type in their hash-based identifiers, and we should be able to find
blocks and serve them to those systems even when we do not know the MIME
type.
Higher level applications should not use block-uris anyway but deal with an
abstraction representing the content (like http urls should).
You mean as in, with content negotiation applied? You use a single URI
which maps to different representations of the same resource?
An example to be more explicit:
<urn:urn-5:G7Fj> <DC:title> "Ulisses"
<urn:urn-5:G7Fj> <DC:decription> "bla bli"
This, for example, I would not include here. :-) Firstly, it is
something I would want to be versioned independently: if I change the
description of an image, that should not create a new version of the
image. Secondly, I don't see a reason why the URI of the image would
need to refer to this. Thirdly, I don't think that when a file is put
into the system-- and thus given its identifier-- is necessarily the
time to create this kind of metadata. It would seem to hold up the task
at hand. Rather, I'd like to be able to add it later on, and maybe
someone else can do that even better than me-- like a librarian who has
scientific background in giving metadata about stuff.
It seems like you could easily put this data in another block without
losing much (assuming that the second block could be easily found
through an appropriate query).
<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<urn:urn-5:lG5d> <ex:englishVersion> <urn:urn-5:G7Fj>
<urn:urn-5:lG5d> <ex:spanishVersion> <urn:urn-5:kA2L>
These, on the other hand, are very good cases, because they can be used
by the computer in ways that require a certain level of trust: We want
to retrieve only the data that the referrer intended to be retrieved,
and we want to be able to check this cryptographically-- so this
actually needs to be part of what we protect cryptographically.
One technical side note, though. We'd have two types of URIs, something
like,
urn:foo:content-hash:jv24kt5
urn:foo:ref:rs53h85p
The first would be just a plain byte stream identified by a content
hash. The second would be a content hash, too, but we'd know that the
target should be interpreted as an RDF file with data like you give
above. Now, when we retrieved this block, we need to know at which node
we need to start looking to find the block we're interested in, so I
think we'd need to write this as something like,
<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<> <ex:englishVersion> <urn:urn-5:G7Fj>
<> <ex:spanishVersion> <urn:urn-5:kA2L>
I.e., "this resource" is <> (the empty URI reference) and we start
traversing the graph from there.
I found another use case for RDF metadata: Creative Commons licenses. It
would make sense to me if this would be part of the reference, allowing
the computer to automatically conclude how data may be copied and used.
In this example application should reference "urn:urn-5:G7Fj" (which
does not have a mime type) rather than "urn:content-hash: Dj&/fjkZRT68"
(which has a mime type in a specific context) wherever possible, in many
cases a higher abstraction "urn:urn-5:lG5d" can be used .
Um, using a urn-5 doesn't work since it's just a random number-- if we
use just a random number, we cannot check whether the data we may
retrieve from a p2p network is really what the person making the
reference wanted us to see. We would need to use "urn:foo:ref:[blah]",
which would be the above RDF data, from which we could then get the
specific representation.
While you can
only deficiently use http to server a block,
Why?
you could server the uri of
both the abstractions (urn:urn-5:G7Fj and urn:urn-5:lG5d) directly using
http 1.1.features.
(Again, you'd have to use hashes, or you could be arbitrarily spoofed.)
But am I right that this makes rdf-literals obsolete for everything
but small decimals?
Hm, why? :-)
well, why use literal if you can make a block out of it, shortening
queries and unifying handling?
Ah, that depends on many factors. Speed is one; you may need to load a
lot of blocks to get the data for all the literals in a graph. Also, if
we store each block as a file on a file system, there are some file
systems that perform badly when faced with a large number of really
small files.
And how do you split the metadata in blocks
Well, depends very much on the application. How do you split metadata
into files? :-)
Not at all ;-). The splitting into file is rudimentary represented
meta-data, if you use RDF the filesystem is a legacy application.
Um, but if you put metadata on an http server, you split it too?
The rule of thumb is: Split it in units you would want to transfer
independently. E.g. in Annotea, you would make one block = one
annotation. When putting email into RDF, you might make one block = one
email. You might want to put your FOAF data in one block. If you have
metadata about many documents, you might make a metadata block for each
document you process. If you publish your personal TV recommendations
each week, you'd make one block each week.
Of course if the granularity doesn't fit the task at hand-- you want to
send a friend all love story recommendations of the last year-- the
computer can split up those blocks automatically and reassemble them in
a different way. It's just that for many applications a certain
granularity fits usage patterns pretty well-- for example, you'd most of
the time transmit an annotation as a whole. Then, if you've downloaded
an annotation once, you never need to download it again (that's one of
the benefit of putting them in blocks, you can cache them indefinitely).
So anyway, there are a number of reasons why we need to do powerful
queries over a set of Storm blocks. For example, since we use hashes
as the identifiers for blocks, we don't have file names as hints to
humans about their content; instead, we'll use RDF metadata, stored
in *other* blocks. As a second example, on top of the unchangeable
blocks, we need to create a notion of updateable, versioned
resources. We do this by creating metadata blocks saying e.g.,
"Block X is the newest version of resource Y as of
2003-03-20T22:29:25Z" and searching for the newest such statement.
I don't quite understand: isn't there a regression problem if the
metadata is itself contained in blocks? Or is at least the timestamp
of a block something external to the blocks?
A metadata block does not usually have a 'second-level' metadata block
with information about the first metadata block, if you mean that;
Say you want to change the description of an entity, not just add a new
one, I think you should tell about another metadata block that it is
wrong (in the era starting now ;-)).
"Not usually" just meant that *most* metadata blocks do not have a
second-level metadata block, in case you were worried that we'd need an
infinite number of metametametameta blocks otherwise :)
no, timestamps are not external to the blocks.
When the user synchronizes his laptop with the home-pc I guess the
metadata may be contradictory, I thought with an external timestamp
contradictions could be handled (the newer is the right one). If the
timestamp is part of the metadata the application should probably
enforce it (while generally giving the user the maximum power to make
all sorts of metadata constructs).
The timestamp is on the assertion, "Block X is the newest version of
resource Y," and it gives the time when the user said X is the current
version (i.e., when the user saved the document). If the user saves the
document on the desktop, and then on the laptop, that would be different
saves, made at different times, so the timestamps wouldn't be
contradictory: they would simply be the timestamps of two different things.
(There's another problem in this scenario, though: If the user edited a
document independently on desktop and laptop, it wouldn't be nice if the
version saved later would supersede the other one; rather, the changes
from both should be merged. We actually use a slightly different system
for synchronization of independent systems; instead of storing a
timestamp, we store a list of obsoleted versions... but that's leading
us astray here :-) )
- Benja
- [Gzz] Storm blocks and metadata (Re: P2P and RDF),
Benja Fallenstein <=