Skip to content

Making Wikipedia into a database

Last month I was quoted in a Technology Review article about using Wikipedia’s data, and soon after that one of the editors of Hatilda Harevi’it (”The Fourth Tilde”), a Hebrew-language semi-monthly online Wikipedia-based newsletter, wrote me, asking me to write something for Hatilda clarifying the various ideas presented in there. I wrote something in English, which they dutifully translated into Hebrew (I can read Hebrew fine, but my writing leaves something to be desired). The latest edition, vol. 24, came out yesterday, with my column – you can see it here (look for “Semantic MediaWiki” :) ). It ended up being longer than I thought it would – it contained not just an overview of the concepts, but a technical proposal for Wikipedia.

And for the benefit of those of you who can’t read Hebrew, here’s the original version:

Making Wikipedia into a database

In July, the magazine Technology Review published an online article, Wikipedia to Add Meaning to Its Pages, about adding semantics to Wikipedia, that caused a little bit of a stir – I think for most people who read it, it was the first time they had heard about me, Semantic MediaWiki (SMW), or the consulting company WikiWorks (which is mentioned indirectly); for some, it may have been the first time they had heard of the Semantic Web. That article just provided a very brief summary of all the issues involved, so I’d like to give my view of things in more detail.

I see the history of Wikipedia as, in part, a progression from collection of text articles into something more like a database. As the amount of information in Wikipedia has grown, the structure needed to support it has grown alongside it – that’s an entire world of categories, infobox templates, navigation templates and list pages (which you can see taken to a logical conclusion, though not the most extreme one possible, with the English Wikipedia’s “Lists of lists” category). At the same time, the importance of Wikipedia as a source of data has also grown considerably. Three online projects, all mentioned in the article, are either completely or to a large extent based on using Wikipedia’s data: DBpedia, which puts the information from the English-language Wikipedia on the web in a format that computers can query directly; Freebase, which does something similar for information from many different sources, although Wikipedia is one of the largest; and Powerset (not “PowerSet”), which according to the article gets its Wikipedia information indirectly, via Freebase (I thought it built up its store of information by doing natural-language processing on the main text of Wikipedia articles – in either case, it’s based on Wikipedia). These projects have all done well for themselves – DBpedia is literally at the center of every graph of the world of “linked data”, Powerset was bought in 2008 by Microsoft, and Metaweb, the company behind Freebase, was bought by Google about a week after the article came out (I’m guessing that’s a coincidence).

This progression from text to data, by the way, reflects a larger overall trend in the web, a trend that people have generally referred to as the Semantic Web, and sometimes “Web 3.0″. The term “Semantic Web” has been used to mean many different things, and it’s itself the subject of controversy, but the very basic idea is that we should be able to have content from web pages accessed and understood directly by computers. If, for instance, I want to find the names of the 10 highest-paid actors who were born in Hungary, I should be able to enter my question into the computer in some way, and then have it go to the right sources for the different sets of information, put the information together, and give me back an answer. (Explanations for the Semantic Web often involve users finding plane tickets, but I thought I would give a more interesting example.) People have been talking about the Semantic Web since almost as long as there has been a web, but in the last five years it has really picked up, and now “Web 3.0″ is starting to see the same kind of hype that “Web 2.0″ once did.

So where that does leave Wikipedia? We’re at the beginning of some sort of online data revolution, and Wikipedia itself is the source for data projects worth tens of millions of dollars, yet Wikipedia’s own approach to data is quite basic – the same facts have to be manually entered by users over and over, at least once in every language and usually more than that. There is also very little ability to export any of the data in a machine-readable way. In short, it’s a wasted opportunity.

For those who want to improve access to Wikipedia’s data, one approach usually stands out: Semantic MediaWiki (SMW), which is an extension to MediaWiki (the software on which Wikipedia runs). It’s also a project that I’ve been involved with for four years. SMW is an extension that lets users easily store information found on the wiki within the wiki’s own database, so that that information can be queried, displayed (in tables, graphs, maps, calendars etc.) and exported elsewhere. I won’t describe SMW here in more detail than that, but if you want to read more about it, the FAQ is a good place to start. The FAQ mentions how Semantic MediaWiki in fact has its roots in a proposal for turning text into data on Wikipedia itself, and that getting SMW onto Wikipedia is still a major goal for some of its developers (though it was never a big goal of mine). Still, despite the Wikipedia connection, Semantic MediaWiki has taken on a big life outsid of Wikipedia, and at this point it gets serious usage as a data-management tool within companies, government agencies and other organizations (helping such organizations to use it effectively is the main business of WikiWorks).

That’s SMW, very briefly; but me let change directions here: I may surprise people who know me here by saying that I don’t believe that Semantic MediaWiki is the right answer for Wikipedia at the moment. The biggest reason for that is that Wikipedia is itself a collection of over 200 sub-sites, each in a different language; and a single store of data, that all of them can use, is probably a better solution than what SMW could provide, which is a separate data store for each one.

What would such a database look like, then? It would have to fit some general criteria: the data would have to be easily modifiable by people who speak many different languages; the data would have to be usable in many different languages; it would have to be extremely fast; and ideally it could be usable even outside of Wikipedia, as a general-purpose data API.

For those who are curious, and have some understanding of technical concepts like APIs and parser functions, I present, in the appendix, one option for how it could be done: it would involve creating a new wiki at a URL like, that would hold thousands (or more) pages of raw data, probably in English; each set would be in CSV format (which stands for “comma-separated values”), the simplest format that data can take. All these pages would be created by hand, by users. The wiki could then be queried, using the URL, to get the contents of that data. Querying would be done by each language Wikipedia (the values would also get translated into the right language, an issue I talk about in the proposal), as well as by any outside system that wanted to easily get data from Wikipedia.

Is this a “semantic” solution? That depends on who you ask. For some people, “semantic” implies the use of very specific features: semantic triples, ontologies, and data formats like RDF and OWL. For me, that’s an academic discussion – all that really matters is finding a simple way to free up Wikipedia’s data for all sorts of interesting uses.

The appendix was kept untranslated, so you can see it here, for the full technical details.

Categories: Uncategorized.

Comment Feed

3 Responses

  1. Most extreme? How about Wikipedia’s List of lists of lists? :) Someone out there must get enjoyment out of over-complicating everything.

    Thanks, Yaron, for a well-explained, down-to-earth article.

    I found their proposed query interesting: ?? ??? ??? ????? ???? ????? ????? (What was the $ exchange rate on the day Kennedy was assassinated?) Unless I’m mistaken, I don’t think SMW can easily handle that kind of query yet.

    It really is amazing how much data redundancy exists between all the Wikipedias. However, I can see some shying away from sort of crowning English as the primary “source language” for all data. Taking your suggestion a step backwards, perhaps the next Wikimedia project could be a translating dictionary for every word and phrase, from every language to every language. Besides being a valuable tool in its own right, it would make manually entering all interwiki links a thing of the past and serve as the backbone for something like your proposal. The source data would appear simultaneously in all languages, and an edit of that data in any language would “propagate” to all other languages. As you mentioned, this could be easily kick-started using the already comprehensive interwiki links. Hmmmm… Maybe this should be the next WikiWorks project! :) (This is not terribly simple. Word meanings vary based on context which is difficult for computers to figure out. But I don’t think it’s impossible.)

  2. Ike

    Have a look at where SMW is being used to create the translating dictionary you describe.


    filceolaireSeptember 7, 2010 @ 6:50 AM

Some HTML is OK

or, reply to this post via trackback.

Continuing the Discussion

  1. [...] This post was mentioned on Twitter by Semantic Wikis, WikiWorks. WikiWorks said: Making Wikipedia into a database – our proposal. [...]