Exchange formats and common pools for free data

Posted 6 Apr 2002 at 06:18 UTC by Radagast Share This

For a while, I've been considering the possibilities of creating a standard and some software to create distributed databases (in the sense of "collections of information", not relational databases). davidw's article about a free translation repository had a lot in common with my ideas, so I decided to put some thoughts down. Note that these are rather unfinished ideas, I'm putting them down here in hope of some feedback that might help refine them.

The Situation:
Today, there are a large number of online information collections that offer their data under a free license, and many of them also offer a query interface other than web forms. Examples of such databases include Wikipedia (my personal favourite lately) and NuPedia, CanonicalTomes, FreeDB, dmoz, Project Gutenberg, and many more. However, all of these are either centralized, or allow only very rudimentary access to their data.

In addition, there are other database projects that are non-free, but collect large amounts of useful data, mostly metadata and compilations of information that isn't copyrightable as pieces, but is as a compilation. Examples of such databases include IMDB, AllMusic (with its smaller cousins AllMovie and AllGame, and miscellaneous scientific databases. One of the main reasons the free information community (an equivalent to the free software community, although smaller, which includes the people who run and work on the free sites listed above) has not risen up to create equivalents for these nonfree databases is that it requires a huge amount of bandwidth and human resources to run a centralized database equivalent to, say, IMDB, and besides, IMDB has an enormous head start.

The Vision:
What if the data could be shared easily, and setting up a mirror for such databases was a simple, near-automated process? What if you could submit corrections and additions to the data in the database, and have such corrections peer reviewed and rated based on a trust metric to decide if they should go in or not? What if the system to achieve this was extremely simple, and simply transported arbitrary XML documents (for instance) or, optionally, diffs to XML documents, and had just a thin wrapper format to give everything a unique ID, and link things together?

This could create a large database of freely available information that could be widely mirrored and distributed, and where pretty much everyone could contribute. The implications could be huge, both for the free information community, which would get a simple distributed tool for maintaining and updating data, for software creators (proprietary and free alike) who would be able to build in access to free databases in their products (media players that do something equivalent of a FreeDB lookup, but on much richer information, for instance), and for the public at large, who would be able to access the information from the web as well as from within software, download it to their computers to work offline, and so on.

What's needed:
The system could be rather simple, a cross between Usenet and a distributed Wiki, with rudimentary trust metrics and an endorsement system built in. All contributions (new content, or diffs to existing content) is cryptographically signed by the person contributing, and is assigned a unique ID by the server it is first uploaded to. Contributions are then distributed to other servers (this is the Usenet like part). Other users can look at the contributions, and choose to endorse them (by sending a special contribution, signed in the same way as other contributions), possibly with several levels of endorsement. Thus, there is a flow of trust information.

The system itself doesn't need to concern itself with trust thresholds, for instance, that can be up to the individual server administrator (who can also choose to manually review all contributions, and only distribute versions that are approved by the administrator). Likewise, the system doesn't need to concern itself with the encoded data or its format, only the wrapper and transport layer (which, I'd suggest, should be HTTP-based).

Obviously, there are more tools needed to make this system useful. Each application (say, "music metadata", "movie metadata", "encyclopedia articles", etc.) can probably benefit from separate and dedicated tools, but these can be relatively simple, and in many cases should probably be implemented as Bonobo components or something similar, so they can be easily embedded in applications. Likewise, people will construct web form interfaces to access, add to, and edit the information. But this is similar to Usenet, the protocol doesn't need to concern itself with the UI of the newsreader. These things will come into existence as people need them enough to code them up.

The obvious way to go about this is to spec the protocol, make a minimal implementation, and then creating a couple of XML DTDs and convert existing free information into this XML format. FreeDB's downloadable database dumps are an excellent source of such seed information, as is dmoz.org's database dumps.

This skims the surface of what I've been thinking about. It's pretty much a brain dump, so I hope it gets the basic points across. I'd love to see Advogato people talk about the problems, obstacles, usage areas, implementation details, suggestions, etc., to try to judge if it's worthwhile (or even possible, I might be overlooking something essential here) to proceed with the project.


Libre software needs libre data, posted 6 Apr 2002 at 15:48 UTC by mlinksva » (Journeyer)

I feel pretty strongly about this, having organized a BOF on "Open Source/Open Data" at the last O'Reilly Open Source Conference helped found a company that's working to build an open file metadata catalog. Like the projects mentioned in the article, our catalog is currently centralized, probably for a similar reason: too few resources to work on domain problems and innovate in the area of distributed read/write datasets.

Libre datasets are increasingly important because much of the functionality provided by computers now resides in the net. Suppose you have a Debian box with only software from main installed, and your favorite services include Mapquest and Amazon book reviews. Are you still a free software saint? :-) What if the proprietary "data" you access starts to act more like code? What if your free software application can't compete in its niche without a dependency on proprietary data (e.g., imagine a world without FreeDB or MusicBrainz providing/working on free CDDB alternatives)?

Apart from the physical world and book metadata hinted at above, I'd really like to see distributed web annotations. Without them the promise of hypertext remains unfulfilled.

Moving content between wikis, posted 7 Apr 2002 at 06:35 UTC by Sunir » (Journeyer)

Since you brought it up, there's been a lot of thinking already in how to move content between wikis. The first step is generally believed to define a MeatBall:XmlRpcToWiki protocol and then to define a MeatBall:WikiInterchangeFormat. The goal originally was to make some sort of Wiki:InterWiki, but maybe in the interim we can build a MeatBall:DistributedWikiForum. One successful bridging technology has been Wiki:SisterSites.

Personally, I don't think it's really necessary to use too much cryptography. Wikis don't need them. If you think hard enough, you won't need to for this this either. For every bit of encryption you misuse, you half your userbase.

The major problem isn't security, it's copyright and control. The best example was when Wikipedia forked FOLDOC. I recommended using sister sites, but Larry didn't like that. He prefered absorbing as much content into Wikipedia as possible as then everything would be inside Wikipedia. Considering the FOLDOC wasn't under the GFDL, this is reasonable (is it?). In the end, Denis converted his entire site to the GFDL, which was nice of him.

Similarly, FOLDOC incorporates The Jargon File, which is in the public domain. This is perhaps nicer for an automated syndication system because it's less restrictive than the GFDL.

Insert that whole BSD vs. GPL debate.

See also..., posted 7 Apr 2002 at 16:14 UTC by jerry » (Journeyer)

Your description gets pretty much exactly both motivation behind and the basic structur of the Project Askemos.

To clarify: the project is about building a virtual public library. The software (current prototype version is 0.6.13) askemos is the infrastructure you basically described.

And a shameless plug: we really need helping hands!

dmoz model, posted 8 Apr 2002 at 14:00 UTC by SteveMallett » (Journeyer)

We also kinda began such a thing. We publish our database in XML (yes, I know it's not that great, but we don't have a good XML guy). We did this to emulate the dmoz distribution model.

We also began talking to coopx about an implementable schema for sites that host projects so the info could be shared among everyone. Here's the mailing list archive.

I hope this is helpful.

Project Sand Dunes, posted 9 Apr 2002 at 18:51 UTC by tnt » (Master)

If I didn't know better, I would have sworn that you were listening in on the discussions jlatour and I -- tnt -- have been having, about project Sand Dunes. Because virtually all of what you have said, has also been said in our conversations.

Now project Sand Dunes has not yet been announced, because (as you may have noticed if you visited the home page) we are not ready (to announce it). We are in the process of writing a white paper for it (which is the work in progress that you see on the home page).

However, having said that, the similarities between what you have proposed and project Sand Dunes are so close, that I thought that we should let you know about it... and see if you wanted to work with us.

It's too bad the white paper isn't done though.... Reading through what we have, again, I'm noticing that the similarities between what you have proposed and project Sand Dunes hasn't even been written up yet. But just to add to what you see on the home page, project Sand Dunes has things such as:

  • Expert Groups... which are trust metric groups with different seeds,
  • Forks... the ability to fork the project... without leaving the main system,
  • A Distributed System of servers,
  • Trust Metrics between expert groups... and not just people with in expert groups,
  • multilingual stuff,
  • and more.

Obviously that summary is very very inadequate. But combine that with what's correctly there on the home page... an... ummm... I guess that's still inadequate; but it gives you some idea. Anyways, e-mail me (or reply here) if you are interested in working with us.

A Free Directory, posted 12 Apr 2002 at 15:20 UTC by chalst » (Master)

Good ideas here. I'd like to emphasise the importance of something like dmoz in providing `high-level structure' for the web.

Here is an extract of an email I sent in November 2000 to a number of people who had left/been kicked out of dmoz around the time the new copyright guidelines came into force. I resigned my categories (I was editor cas) in protest at the removal of George Ruban (gruban) at this time. I've attached it here because I think it is important that there is a free alternative to dmoz that can `scale' at least as well as dmoz.

I have been thinking for a while about the possibility of constructing a decentralised alternative to DMoz that has many of the strengths of that project but without the defects that I found to be increasingly tiresome.

The strengths of DMoz (some of which it learnt from Yahoo), as I see them are:

  1. It promotes cooperation in editing and figuring out organisational issues;
  2. It brings together people with common interests;
  3. It turns the product of the directory into a standard, world-readable document based on open W3C standards, making a natural separation between editing and publishing;
  4. It supports editor privacy in the forums;
  5. It managed to promote a generally quite high minimum level of editing, and a high level of directory-wide consistency;
  6. It has good safeguards against editor abuse.

Its weaknesses are mostly due to its centralised form:

  1. A courtly mode of operation, where influence is based on favours granted by staff's indulgence.
  2. Editor permissions are governed by metas who generally are not sensitive to, for example, local academic standards in particular disciplines.
  3. Directory-wide rules are imposed that are frequently an irrelevance or an inconvenience to certain parts of the directory. Eg. the debate over `cooled' sites.
  4. A small number of metas govern an enormous number of editors, leading to a bureaucratic and remote feel. Many problems have to be solved that would not exist in smaller groups where everyone knows each other. And it makes volunteering feel like working for a company.
  5. Disagreements are resolved by higher-level intervention, essentially consensus at all costs, as opposed to the open-source model is based upon the more free and more intelligent idea of allowing the parties to fork and allow the users decide who was right;
  6. The centralised directory becomes a central point of failure, attractive to lawsuits, and vulnerable to malicious hackers, and changes in corporate strategy.

And some weaknesses due to the data representation:

  1. It proves terribly hard to avoid duplicating work, and maximising opportunities for `lateral' linking without endlessly multiplying categories, which makes navigating the directory tiresome and error-prone.
  2. Changing ontology is unnecessarily difficult. It's a matter of hard consensus, and so there is little experimentation to see if alternative ontologies work better. And it is fraught with issues of who-gets-to-edit-what which is irrelevant to the users of the directory.

I made a concrete proposal along these lines, that for various reasons never came to anything. The key idea was to split content from ontology, and use a modular representation for ontology that allows one ontology easily to be joined and divided. I think the design was sound and I'd be interested in reviving the project.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page