Skip to content

Wikidata begins

I regret to say that our consultant Jeroen De Dauw will not be doing any significant work for WikiWorks for at least the next year. Thankfully, that’s for a very good reason: he’s moved to Berlin to be part of the Wikidata project, which starts tomorrow.

Wikidata is headed by Denny Vrandecic, who, like Jeroen, is a friend and colleague of mine; and its goal is to bring true data to Wikipedia, in part via Semantic MediaWiki. There was a press release about it on Friday that got some significant media attention, including this good summary at TechCrunch.

I’m very excited about the project, as a MediaWiki and SMW developer, as a data enthusiast, and simply as a Wikipedia user. This project quite different from any of the work that I’ve personally been involved with, because Wikipedia is simply a different beast from any standard wiki. There are five challenges that are specific to Wikipedia: it’s massive, it needs to be extremely fast at all times, it’s highly multi-lingual (over 200 languages currently), it requires references for all facts (at least in theory), and it has, at this point, no real top-down structure.

So the approach they will take will be not to tag information within articles themselves, the way it’s done in Semantic MediaWiki, but rather to create a new, separate site: a “Data Commons”, where potentially hundreds of millions of facts (or more?) will be stored, each fact with its own reference. Then, each individual language Wikipedia can make use of those facts within its own infobox template, where that Wikipedia’s community sees fit to use it.

It’s a bold vision, and there will be a lot of work necessary to pull it off, but I have a lot of faith in the abilities of the programmers who are on the team now. Just as importantly, I see the planned outcome of Wikidata as an inevitable one for Wikipedia. Wikipedia has been incrementally evolving from a random collection of articles to a true database since the beginning, and I think this is a natural step along that process.

A set of files were discovered in 2010 that represented the state of Wikipedia after about six weeks of existence, in February 2001. If you look through those pages, you can see nearly total chaos: there’s not even a hint of a unifying structure, or guidelines as to what should constitute a Wikipedia page; over 10% the articles related in some way to the book Atlas Shrugged, presumably added by a devoted fan.

11 years later, there’s structure everywhere: infobox templates dictate the important summary information for any one subject type, reference templates specify how references should be structured, article-tagging templates let users state precisely the areas they think need improvement. There are guidelines for the first sentence, for the introductory paragraphs (ideally, one to four of them, depending on the article’s overall length), for how detailed sections should be, for when one should link to years, and so on. There are also tens of thousands of categories (at least, on the English-language Wikipedia), with guidelines on how to use them, creating a large set of hierarchies for browsing through all the information. These are all, in my eyes, symptoms of a natural progression toward a database-like system. Why is it natural? Because, if a rule makes sense for one article, it probably makes sense for all of them. Of course, that’s not always true, and there can be sub-rules, exceptions, etc.; but still, there’s no use reinventing the wheel for every article.

People complain that the proliferation of rules and guidelines, not to mention categories and templates, drive away new users, who are increasingly afraid to edit articles for fear of doing the wrong thing. And they’re right. But the solution to this problem is not to scale back all these rules, but rather to make the software more aware of the rules, and the overall structure, to prevent users from being able to make a mistake in the first place. That, at heart, was the thinking behind my extension Semantic Forms: if there’s a specific way of creating calls to a specific template, there’s no point requiring each user to create them in that way, when you can just display a form, let the user only enter valid inputs, and have the software take care of the rest.

Now, Wikidata isn’t concerned with the structuring of articles, but only with the data that they contain; but the core philosophy is the same: let the software take care of anything that there’s only one right way to do. If a country has a certain population (at least, according to some source), then there’s no reason that the users of every different language Wikipedia need to independently look up and maintain that information. If every page about a mutiplanetary system already has its information stored semantically, then there’s no reason to separately maintain a hand-generated list of multiplanetary systems. And if, for every page about a musician, there’s already semantic information about their genre, instrument and nationality, then there’s no reason for categories such as “Danish jazz trumpeters“. (And there’s probably much less of a need for categories in general.)

With increased meaning/semantics on one hand, and increased structure on the other, Wikipedia will become more like a database that can be queried, than like a standard encyclopedia. And at that point, the possibilities are endless, as they say. The demand is already there; all that’s missing is the software, and that’s what they’ll be working on in Berlin. Viel Glück!

Categories: Uncategorized.

Comment Feed

6 Responses

  1. Hi Yaron,

    Will Semantic forms play a role in wkidata?

    Adrian FApril 2, 2012 @ 5:25 AM
  2. Thank you very much, Yaron! That’s a great overview of what we want to achieve.

  3. Thank you for the blog post Yaron. It’s good to hear from people close to the Semantic MediaWiki efforts on Wikidata. As a huge fan of SMW it wasn’t clear to me if Wikidata is a good, bad or neutral thing for SMW. After reading your post, I come away feeling it’s somewhat neutral. Is that the correct reading?

  4. I agree that this was inevitable. Wikitext was supposed to make editing easier. But now the markup for the average Wikipedia page looks like code + mess. This will help things by getting the infoboxes out of the way.

    It will be interesting to see what kind of query system will be used. It would seem that if Google is involved then natural language queries will probably be the way most people use the data.

    Am I the first to suspect that Google is looking for this project to be a huge Siri-killer for Android? I suspect I am. Using this system, you would be able to ask the “Android Assistant” any question, not just the ones in Apple’s commercials.

    I hope the data will me made available to non-WMF wikis, similar to InstantCommons.

    Note that the actual Wikidata pages (from the mockup I saw) will use form-based input, so definitely +1 for Semantic Forms. And +2 for SMW.

  5. Just to clarify, Wikidata isn’t planning to use Semantic Forms for its data entry, although maybe that’s not what you meant.

    Anyway, the “Siri-killer” idea is an interesting one. Google the search engine has itself been in the question-answering business for a long time (when it knows the answer, it puts it above the search results), but obviously mobile is where the action is these days.

    Yaron KorenApril 2, 2012 @ 8:45 PM

Some HTML is OK

or, reply to this post via trackback.

Continuing the Discussion

  1. [...] understand better the rationale and benefits of Wikidata, let’s consider two common problems with structured data in [...]