For a while, I've been considering the possibilities of creating a
standard and some software to create distributed databases (in the
sense of "collections of information", not relational
databases). davidw's article about a free
translation repository had a lot in common with my ideas, so I
decided to put some thoughts down. Note that these are rather
unfinished ideas, I'm putting them down here in hope of some feedback
that might help refine them.
The Situation:
Today, there are a large number of online information collections that
offer their data under a free license, and many of them also offer a
query interface other than web forms. Examples of such databases
include Wikipedia (my personal
favourite lately) and NuPedia, CanonicalTomes, FreeDB, dmoz, Project Gutenberg, and many
more. However, all of these are either centralized, or allow only very
rudimentary access to their data.
In addition, there are other database projects that are non-free, but
collect large amounts of useful data, mostly metadata and compilations
of information that isn't copyrightable as pieces, but is as a
compilation. Examples of such databases include IMDB, AllMusic (with its smaller cousins
AllMovie and AllGame, and miscellaneous
scientific databases. One of the main reasons the free information
community (an equivalent to the free software community, although
smaller, which includes the people who run and work on the free sites
listed above) has not risen up to create equivalents for these nonfree
databases is that it requires a huge amount of bandwidth and human
resources to run a centralized database equivalent to, say, IMDB, and
besides, IMDB has an enormous head start.
The Vision:
What if the data could be shared easily, and setting up a mirror for
such databases was a simple, near-automated process? What if you could
submit corrections and additions to the data in the database, and have
such corrections peer reviewed and rated based on a trust metric to
decide if they should go in or not? What if the system to achieve this
was extremely simple, and simply transported arbitrary XML documents
(for instance) or, optionally, diffs to XML documents, and had just a
thin wrapper format to give everything a unique ID, and link things
together?
This could create a large database of freely available information
that could be widely mirrored and distributed, and where pretty much
everyone could contribute. The implications could be huge, both for
the free information community, which would get a simple distributed
tool for maintaining and updating data, for software creators
(proprietary and free alike) who would be able to build in access to
free databases in their products (media players that do something
equivalent of a FreeDB lookup, but on much richer information, for
instance), and for the public at large, who would be able to access
the information from the web as well as from within software, download
it to their computers to work offline, and so on.
What's needed:
The system could be rather simple, a cross between Usenet and a
distributed Wiki, with rudimentary trust metrics and an endorsement
system built in. All contributions (new content, or diffs to existing
content) is cryptographically signed by the person contributing, and
is assigned a unique ID by the server it is first uploaded
to. Contributions are then distributed to other servers (this is the
Usenet like part). Other users can look at the contributions, and
choose to endorse them (by sending a special contribution, signed in
the same way as other contributions), possibly with several levels of
endorsement. Thus, there is a flow of trust information.
The system itself doesn't need to concern itself with trust
thresholds, for instance, that can be up to the individual server
administrator (who can also choose to manually review all
contributions, and only distribute versions that are approved by the
administrator). Likewise, the system doesn't need to concern itself
with the encoded data or its format, only the wrapper and transport
layer (which, I'd suggest, should be HTTP-based).
Obviously, there are more tools needed to make this system
useful. Each application (say, "music metadata", "movie metadata",
"encyclopedia articles", etc.) can probably benefit from separate and
dedicated tools, but these can be relatively simple, and in many cases
should probably be implemented as Bonobo components or something
similar, so they can be easily embedded in applications. Likewise,
people will construct web form interfaces to access, add to, and edit
the information. But this is similar to Usenet, the protocol doesn't
need to concern itself with the UI of the newsreader. These things
will come into existence as people need them enough to code them up.
The obvious way to go about this is to spec the protocol, make a
minimal implementation, and then creating a couple of XML DTDs and
convert existing free information into this XML format. FreeDB's
downloadable database dumps are an excellent source of such seed
information, as is dmoz.org's database dumps.
This skims the surface of what I've been thinking about. It's pretty
much a brain dump, so I hope it gets the basic points across. I'd love
to see Advogato people talk about the problems, obstacles, usage
areas, implementation details, suggestions, etc., to try to judge if
it's worthwhile (or even possible, I might be overlooking something
essential here) to proceed with the project.
Since you brought it up, there's been a lot of thinking already in how
to move content between wikis. The first step is generally believed to
define a MeatBall:XmlRpcToWiki protocol and then to define
a MeatBall:WikiInterchangeFormat. The goal originally was
to make some sort of Wiki:InterWiki, but maybe in the
interim
we can build a MeatBall:DistributedWikiForum. One
successful bridging technology has been
Wiki:SisterSites.
Personally, I don't think it's really necessary to use too much
cryptography. Wikis don't need them. If you think hard enough, you
won't need to for this this either. For every bit of encryption you
misuse, you half your userbase.
The major problem isn't security, it's copyright and control. The best
example was
when Wikipedia forked FOLDOC. I recommended using sister sites, but Larry didn't like
that. He prefered absorbing as much content into Wikipedia as possible
as then everything would be inside Wikipedia. Considering the FOLDOC
wasn't under the GFDL, this is reasonable (is it?). In the end, Denis
converted his entire site to the GFDL, which was nice of him.
Similarly, FOLDOC incorporates The Jargon File, which is in the public
domain. This is perhaps nicer for an automated syndication
system because it's less restrictive than the GFDL.
Insert that whole BSD vs. GPL debate.
If I didn't know better, I would have sworn that you were listening in
on the discussions jlatour and I --
tnt -- have been having, about project Sand Dunes. Because
virtually all of what you have said, has also been said in our
conversations.
Now project Sand Dunes has
not yet been announced, because (as you may have noticed if you
visited the home page) we are
not ready (to announce it). We are in the process of writing a
white paper for it (which is the work in progress that
you see on the home page).
However, having said that, the similarities between what you have
proposed and project Sand Dunes
are so close, that I thought that we should let you know about it... and
see if you wanted to work with us.
It's too bad the white paper isn't done though.... Reading
through what we have, again, I'm noticing that the similarities
between what you have proposed and project Sand Dunes hasn't even been written
up yet. But just to add to what you see on the home page, project Sand Dunes has things such as:
- Expert Groups... which are trust metric groups with
different seeds,
- Forks... the ability to fork the project... without
leaving the main system,
- A Distributed System of servers,
- Trust Metrics between expert groups... and not just people with
in expert groups,
- multilingual stuff,
- and more.
Obviously that summary is very very inadequate. But combine that with
what's correctly there on the home
page... an... ummm... I guess that's still inadequate; but it gives
you some idea. Anyways, e-mail me (or reply here) if you are
interested in working with us.