Skip to content

Making Wikipedia into a database

Last month I was quoted in a Technology Review article about using Wikipedia’s data, and soon after that one of the editors of Hatilda Harevi’it (”The Fourth Tilde”), a Hebrew-language semi-monthly online Wikipedia-based newsletter, wrote me, asking me to write something for Hatilda clarifying the various ideas presented in there. I wrote something in English, which they dutifully translated into Hebrew (I can read Hebrew fine, but my writing leaves something to be desired). The latest edition, vol. 24, came out yesterday, with my column – you can see it here (look for “Semantic MediaWiki” :) ). It ended up being longer than I thought it would – it contained not just an overview of the concepts, but a technical proposal for Wikipedia.

And for the benefit of those of you who can’t read Hebrew, here’s the original version:

Making Wikipedia into a database

In July, the magazine Technology Review published an online article, Wikipedia to Add Meaning to Its Pages, about adding semantics to Wikipedia, that caused a little bit of a stir – I think for most people who read it, it was the first time they had heard about me, Semantic MediaWiki (SMW), or the consulting company WikiWorks (which is mentioned indirectly); for some, it may have been the first time they had heard of the Semantic Web. That article just provided a very brief summary of all the issues involved, so I’d like to give my view of things in more detail.

I see the history of Wikipedia as, in part, a progression from collection of text articles into something more like a database. As the amount of information in Wikipedia has grown, the structure needed to support it has grown alongside it – that’s an entire world of categories, infobox templates, navigation templates and list pages (which you can see taken to a logical conclusion, though not the most extreme one possible, with the English Wikipedia’s “Lists of lists” category). At the same time, the importance of Wikipedia as a source of data has also grown considerably. Three online projects, all mentioned in the article, are either completely or to a large extent based on using Wikipedia’s data: DBpedia, which puts the information from the English-language Wikipedia on the web in a format that computers can query directly; Freebase, which does something similar for information from many different sources, although Wikipedia is one of the largest; and Powerset (not “PowerSet”), which according to the article gets its Wikipedia information indirectly, via Freebase (I thought it built up its store of information by doing natural-language processing on the main text of Wikipedia articles – in either case, it’s based on Wikipedia). These projects have all done well for themselves – DBpedia is literally at the center of every graph of the world of “linked data”, Powerset was bought in 2008 by Microsoft, and Metaweb, the company behind Freebase, was bought by Google about a week after the article came out (I’m guessing that’s a coincidence).

This progression from text to data, by the way, reflects a larger overall trend in the web, a trend that people have generally referred to as the Semantic Web, and sometimes “Web 3.0″. The term “Semantic Web” has been used to mean many different things, and it’s itself the subject of controversy, but the very basic idea is that we should be able to have content from web pages accessed and understood directly by computers. If, for instance, I want to find the names of the 10 highest-paid actors who were born in Hungary, I should be able to enter my question into the computer in some way, and then have it go to the right sources for the different sets of information, put the information together, and give me back an answer. (Explanations for the Semantic Web often involve users finding plane tickets, but I thought I would give a more interesting example.) People have been talking about the Semantic Web since almost as long as there has been a web, but in the last five years it has really picked up, and now “Web 3.0″ is starting to see the same kind of hype that “Web 2.0″ once did.

So where that does leave Wikipedia? We’re at the beginning of some sort of online data revolution, and Wikipedia itself is the source for data projects worth tens of millions of dollars, yet Wikipedia’s own approach to data is quite basic – the same facts have to be manually entered by users over and over, at least once in every language and usually more than that. There is also very little ability to export any of the data in a machine-readable way. In short, it’s a wasted opportunity.

For those who want to improve access to Wikipedia’s data, one approach usually stands out: Semantic MediaWiki (SMW), which is an extension to MediaWiki (the software on which Wikipedia runs). It’s also a project that I’ve been involved with for four years. SMW is an extension that lets users easily store information found on the wiki within the wiki’s own database, so that that information can be queried, displayed (in tables, graphs, maps, calendars etc.) and exported elsewhere. I won’t describe SMW here in more detail than that, but if you want to read more about it, the FAQ is a good place to start. The FAQ mentions how Semantic MediaWiki in fact has its roots in a proposal for turning text into data on Wikipedia itself, and that getting SMW onto Wikipedia is still a major goal for some of its developers (though it was never a big goal of mine). Still, despite the Wikipedia connection, Semantic MediaWiki has taken on a big life outsid of Wikipedia, and at this point it gets serious usage as a data-management tool within companies, government agencies and other organizations (helping such organizations to use it effectively is the main business of WikiWorks).

That’s SMW, very briefly; but me let change directions here: I may surprise people who know me here by saying that I don’t believe that Semantic MediaWiki is the right answer for Wikipedia at the moment. The biggest reason for that is that Wikipedia is itself a collection of over 200 sub-sites, each in a different language; and a single store of data, that all of them can use, is probably a better solution than what SMW could provide, which is a separate data store for each one.

What would such a database look like, then? It would have to fit some general criteria: the data would have to be easily modifiable by people who speak many different languages; the data would have to be usable in many different languages; it would have to be extremely fast; and ideally it could be usable even outside of Wikipedia, as a general-purpose data API.

For those who are curious, and have some understanding of technical concepts like APIs and parser functions, I present, in the appendix, one option for how it could be done: it would involve creating a new wiki at a URL like http://data.wikipedia.org, that would hold thousands (or more) pages of raw data, probably in English; each set would be in CSV format (which stands for “comma-separated values”), the simplest format that data can take. All these pages would be created by hand, by users. The wiki could then be queried, using the URL, to get the contents of that data. Querying would be done by each language Wikipedia (the values would also get translated into the right language, an issue I talk about in the proposal), as well as by any outside system that wanted to easily get data from Wikipedia.

Is this a “semantic” solution? That depends on who you ask. For some people, “semantic” implies the use of very specific features: semantic triples, ontologies, and data formats like RDF and OWL. For me, that’s an academic discussion – all that really matters is finding a simple way to free up Wikipedia’s data for all sorts of interesting uses.

The appendix was kept untranslated, so you can see it here, for the full technical details.

Categories: Uncategorized.

Complex Operations, indeed

I found on Twitter (and we then re-tweeted) this video of Alper Caglayan explaining Semantic MediaWiki at the U.S. Naval Postgraduate School. I’m late on this – the talk happened over three months ago, and Dr. Caglayan has already blogged about the talk, and written up a blog post that summarizes his main points from the talk, in the meantime.

He apparently works for the company Milcord, and the SMW-based wiki he talks about for most of the time is the Complex Operations Wiki, which holds a lot of information about, among other things, the tribes and geography of Afghanistan. The potentials for the use of wiki are quite large, though, as far as I understand, the data isn’t being used by anyone right now – at the moment, it’s just a demo wiki, though its development was funded by people within the U.S. Department of Defense.

The video is nice (between around minutes 25 and 35 is the really important part) – it’s always interesting to see how other people present the technology, and how audiences perceive it when hearing about it for the first time. I think one of the obstacles for the SMW/Semantic Forms/etc. system, maybe the main one, is that it’s so different from other technologies out there that it’s hard to explain what it does, and thus how useful it is, in ways people can understand. Dr. Caglayan goes with the approach of calling it a wiki whose data can be queried – which is reasonable, but it doesn’t quite convey the experience of reading or editing the wiki. He shows a form in use, but doesn’t explain that the form in question didn’t require any custom programming to create. Then again, I know from firsthand experience that trying to explain the whole system takes a long time – at some point, I gave a full 8-hour seminar on SMW and its related extensions, and I once saw a 3-hour talk about it that barely covered anything but the basics.

It’s heartening, though, to see the fairly positive reaction of the audience, who are mostly civilians but who seem to play a role in the military’s data policies. The U.S. military has the same data problems as just about any mid-sized-and-larger corporation, from data “silos” to information that may have lost its validity at some point in the past. Hopefully Semantic MediaWiki can be part of the solution.

Categories: Uncategorized.

New MediaWiki extension: Approved Revs

Approved Revs is my latest MediaWiki extension (with some important code contributions made by Jeroen and others), released about a week ago; version 0.2 just came out today. It’s a simple extension, that just lets administrators mark a single revision/version of any wiki page as the “approved” one – so that, when users go to that page, what they see is the approved revision, not necessarily the latest one.

It’s a simple concept, and hardly original: you may be aware that there’s already an extension that does this – FlaggedRevs, which is already in use on a growing number of language Wikipedias; maybe a dozen currently – not yet the English-language one, but it’s probably just a matter of time. What’s different about Approved Revs is just its simplicity – FlaggedRevs puts in place an entire framework for evaluating the quality of any specific revision. That sort of framework approach makes sense for very large sites, like Wikipedia, where decisions about which version to approve have to be done out in the open, and agreed to by many people. For smaller sites, the framework of FlaggedRevs could be overkill – in fact, my first thought to create a new extension came after trying to install FlaggedRevs and getting scared off after around the 3rd paragraph of the documentation (though to be fair, that’s what some people have said about Semantic MediaWiki as well).

Anyway, I think Approved Revs will be an important extension, because it enables an element of workflow, something that MediaWiki has generally lacked. When you create a website with a standard CMS solution like Drupal or WordPress, you can easily save a page or posting in draft form before it gets “published”, i.e. made viewable to the public. And you can have different user types, so that one set of people is responsible for writing the content, and another is responsible for approving it. In MediaWiki it’s a different world: whatever the last thing that a user wrote, whether your user base is a small group of employees or the whole world, is what everyone sees. This of course offers a big advantage in immediacy, but for some organizations it’s just not acceptable. So Approved Revs could open up the use of MediaWiki to content-management situations where previously it wasn’t a possibility.

And yes, if the page contains semantic data, it’s the data from the approved revision that gets stored by SMW, which is great. (The same behavior could be true of FlaggedRevs – I don’t know.)

Categories: Uncategorized.

WikiWorks goes Rabbinic

Since this is my first post, I’ll start by introducing myself. I started my SMW consulting career with my own company named Tosfos Development, based in Queens, NY. When Yaron started WW and asked me to join, I jumped at the opportunity.

But that’s old news. In new news (That can’t be grammatically correct and if it is, it shouldn’t be. Hey, can I have a full sentence in parentheses smack in the middle of another sentence?) I am pleased to announce that in addition to my years of experience with the semantic technologies, I have just become an ordained Rabbi! Need a kosher wiki? Trim the pork from your budget! Got milk? Then don’t add meat! I’ll end here.

I guess this adds some more diversity to an already diverse team. We have consultants in three continents which means we cover half the world (subtracting Antarctica – poor Antarctica!). We can handle pretty much anything, such as custom extensions, skins, php, Semantic Bundle, wiki hosting and theological questions.

Wishing a happy Independence Day to my fellow Americans!

Categories: Uncategorized.

Six conferences

Working with MediaWiki, and usually Semantic MediaWiki as well, we find ourselves at the intersection of a number of different worlds: Wikipedia, corporate wikis, the Semantic Web, etc. To that end, we’ll be at several conferences this year and next – here are the known ones:

  • First, one from the past: the Spring 2010 SMWCon happened just about a month ago, at MIT in Cambridge, Massachusetts. I mostly put the conference together, including getting the catering (it was a challenge on a tiny budget, but we managed – the key is not to order too much fruit). There were about 30 attendees, and some quite in-depth presentations and discussions. Everything was videotaped, and though the videos aren’t online yet, I have reason to believe that they’ll be up quite soon.
  • The next SMWCon is already scheduled, and happening not long from now – it’ll be September 18-19, in Amsterdam. I’ll be there, along with possibly two other members of WikiWorks. This one looks to be a big deal, judging from the enthusiasm of the planners (not me, thankfully).
  • SemTech, the big annual semantic conference in California. None of us are there this year (it’s happening now), but I think we’ll try for next year.
  • Wikimania – the annual convention relating to all Wikimedia projects, but mostly Wikipedia – is coming up; it’ll be in Gdansk, Poland, in about two weeks. Unfortunately, I won’t be going, unlike the last two years – it just felt like too much, with everything else that was going on. But WikiWorks member Jeroen De Dauw will be there, and will be making several presentations, including one about his extensions Maps and Semantic Maps, and another about his very-eagerly-anticipated Extension Management Platform, that he’s working on now.
  • Speaking of Wikimania, the location of next year’s Wikimania was recently announced – it’ll be in Haifa, Israel next summer. It’s amazing to me that Wikimania is coming to Haifa – I grew up there in the early 1980s, I used to go back with my family fairly often after that, and the city and country are still very dear to my heart. I never thought my professional career would end up taking me back there; I plan to be there, if at all possible, and it may well be an emotional experience.
  • That brings us to RecentChangesCamp, aka “RoCoCo” (that’s the French name for it), which is happening this weekend in Montreal. It covers wikis in general, although judging from the attendees list it looks like at least half the people there will be from the MediaWiki world. I’ll be there, and I’m very much looking forward to it – it’ll be nice to talk to people from different parts of the wiki world – other people involved in development, hosting, design, and all the other things WikiWorks does on a daily basis, though not usually with much input from others. And Montreal is a great city – one of the nicest in the world, I think.

Categories: Uncategorized.

Vector-y at last

The big news today in the MediaWiki/Wikimedia world was that the default look for the English-language Wikipedia (i.e., the one shown to non-registered users, and to logged-in users who haven’t modified their defaults) changed – the logo was made smaller and a little darker, and, more importantly, the default skin was changed from “Monobook” to “Vector”.

I’ve personally known about Vector for about a year now, and six months I changed my own wiki, Discourse DB, to use the skin as a default, when I upgraded it to MediaWiki 1.16 (the current skin is basically just Vector done in olive-green). I personally think Vector has a fantastic look – very clean, with bigger, more obvious tabs, and an easier-to-find search area (in the top right, instead of lower down in the sidebar). That last decision, to move the search input, seems to have been a controversial one among Wikipedia users – if you read through the comments in the “new look” blog post, they’re full of pleas to restore the search input to its previous location, with one commenter helpfully stating that “The guy who decided to move the search box should be fired immediately and banished from the IT industry forever.” I’m not a usability expert, but it’s my understanding that the usability tests done by the team who created the Vector skin (many of whom I’ve met, actually) showed that the old location was hard to find for many new users – people have come to expect a search entry right at the top.

There are other complaints that seem more valid – like that parts of the sidebar show up as minimized and require you to click on an arrow in order to see them; which does seem pointless. Thankfully, this appears to be design decision unique to Wikipedia, and isn’t the default behavior of the Vector skin. There are also complaints about the font size, which I haven’t been able to duplicate – the font size seems the same on my screen as the old version.

Outside of Wikipedia, people seem to like Vector – already we see newly-launched wikis, like Startup Linkup and Innovation Cell, using Vector; and we’re talking now to a client who wants a custom skin, and who’s decided to go with Vector as the basis for it instead of Monobook. It should be noted that this is all happening even though the first version of MediaWiki in which Vector is included, 1.16, hasn’t been officially released yet – it’s been available for a long time, but it’s still in beta, and probably will be for another few months.

What will all this mean for those of us in the MediaWiki business? I think it’s great news – the overall look of MediaWiki has, in my opinion, been one of its weaknesses for a while; even though everyone knows the look from Wikipedia, it’s still been considered clunky, and difficult for people trying to edit for the first time to understand. The tremendous power and flexibility of MediaWiki have tended to outweigh some of the problems with the appearance. But now it looks – dare we say it? – nice.

Categories: Uncategorized.

Summer memories to come

Spring is almost halfway over now, which of course means that it’s almost time… for the Google Summer of Code. For those who don’t know, it’s a program in which Google, out of the goodness of their hearts, sponsors hundreds of projects for various open-source software organizations, with high school and college students doing the actual work, each one paired with a mentor from that organization.

I had been excited about this year’s GSoC for a while. I mentored a project last year, with Jeroen De Dauw as the student, in which he created the extensions Maps and Semantic Maps. It turned out to be a bit of a “game-changer” for Semantic MediaWiki: it greatly improved the mapping capabilities of SMW, and mapping is probably the single most important visualization tool for wiki data. There’s also a rather good chance that at least the Maps extension will end up on Wikipedia itself within a year. Although I’m completely biased, I’d say it could end up being the single most successful Wikimedia GSoC project until now.

The list of accepted projects for the upcoming summer was announced today, and six projects were accepted this year for the Wikimedia Foundation. Of them, there are three that I’m particularly excited about. I’ll be mentoring another project, this one with Sanyam Goyal, from Mumbai, as a student; he’ll be improving the Javascript in SMW and some of the related extensions to use everybody’s new favorite Javascript library, jQuery. Jeroen is also doing another project, this one with Brion Vibber as the mentor. It’s a project that has a lot of people buzzing: setting up a framework so that extensions can be downloaded and installed via the web interface, in the same way that WordPress does it. Finally, there’s a project to add RDF-importing capability to SMW, which should be quite helpful, mentored by Denny Vrandecic and done by Samuel Lampa.

Categories: Uncategorized.

Is this thing on?

Tap, tap. Hello, hello. Yes, we are still here. One of the unfortunate aspects to having a company blog (not that there are that many) is that, if you leave it untended for a while, people will begin to suspect that you’re no longer in business. Which is unfortunate, because it’s not at all the case with us: we have a set of MediaWiki-related projects we’re working on now, more coming up soon. And it’s doubly unfortunate, because there’s actually quite a lot to talk about, in the wiki, MediaWiki and software worlds. So, starting next week, we’ll be back to a more regular blogging schedule.

In the meantime, you can check out our Twitter feed, which is active.

Categories: Uncategorized.

SMWCon coming to Cambridge, May 22-23

I’m helping to put together the upcoming SMWCon, or Semantic MediaWiki Conference, and we just announced the date and location: May 22-23, in around two months, in Cambridge, Massachusetts, USA. I’m very excited about this one, because our last event, the SMW Camp in Karlsruhe, Germany, was a big success, with almost 50 attendees, and a lot of interesting discussions, some of which have already led to improvements in the software; this one should be equally productive, and it should be a good opportunity, especially for North-America-based SMW users, to meet each other and present what they’re doing.

It’s taking place at the CSAIL building at MIT, home of Project Simile, well-known to SMW users as the developers of Timeline and Exhibit, two applications used within the Semantic Result Formats extension.

If you’re interested in attending (it only costs $20! snacks included!), you can read more information and sign up on the Spring 2010 SMWCon page.

Categories: Uncategorized.

Cool OpenEI video

OpenEI, or Open Energy Info, is a Semantic MediaWiki-based wiki created by the National Renewable Energy Laboratory, meant to hold information on renewable-energy initiatives and companies around the world. I helped them a little with putting it together last year, soon before WikiWorks was founded. By all measures, it looks like the site is going great for them.

They just very recently put up a video showing one of the OpenEI articles (”Vestas“), and how to edit it.

There are already a few videos on the web that show SMW in action, but this one might be my new favorite because it’s clear and to the point, and you can see some related extensions in action there: Semantic Forms, Semantic Maps, CategoryTree (within the form), and Widgets.

The narrator interestingly calls this wiki a “wikipedia”, which I guess in this context means “wiki encyclopedia” – more idiosyncratic than incorrect, I would say.

Categories: Uncategorized.