Notes from the Wikimedia Data Summit

Update: See the comments below for some important corrections/clarifications from Erik Möller.

On Friday I attended the “Wikimedia Data Summit”, which was held at the O’Reilly Media offices in Sebastopol, California. It was a one-day event, that was actually more like three summits, on three different topics. The first, and the one I was involved with, was meant to discuss plans for making the Wikimedia projects, mostly Wikipedia, more data-focused: essentially, the “Semantic Wikipedia” dream that some people have had for a long time. The second was a discussion on how to improve the web analytics of Wikimedia sites: improving the info and visualizations gathered about page views, editing statistics and all that. And the third was a smaller discussion about improving MediaWiki’s wiki-text parser.

The day started with everyone in the same room, first with an introductory talk by Erik Möller and then with 10-minute talks by various people related to each of the three main topics. Six of them were related in some way to Semantic MediaWiki:

  • I was tasked with explaining all of SMW and its family of extensions in 10 minutes (which I sort of managed to do, barely)
  • Denny Vrandecic talked about Shortipedia, his new-ish SMW-based research project
  • Michael Erdmann from ontoprise talked about SMW+
  • Christian Becker, best known as one of the DBpedia people, talked about his work on the SMW+ Linked Data Extension
  • Mark Greaves from Vulcan talked about Ultrapedia, Vulcan’s demonstration of an SMW-annotated subset of Wikipedia
  • Michael Dale talked about the MetaVid extension and website

You can see the notes from that first session here; and, if you’re curious, you can see the slides from my talk here.

We then split up into three groups, and I joined the “Semantic Wikipedia” group, which was led by Erik, and included the six SMW-related people, as well as people from Google, Freebase (although that’s technically Google too), DBpedia and some other groups. I volunteered to be the note-taker for the discussion, and you can see my notes here.

The basic goal, as Erik defined it, is that they want to create a “Wikimedia Data Commons”, in the spirit of Wikimedia Commons – a single site that would serve as a data repository for all the different language Wikipedias. Facts would be entered there in the form of semantic triples, although really they would include more than three things – besides the fact itself, there would be data like the source (which might include a URL), the date, the language that the information is in, etc. Every Wikipedia could then somehow query that data in order to display it within infoboxes; and the data could also be queried by outside applications.

Ultimately, this project is Erik’s own, and it became obvious early on that his preferred approach was to use Denny’s Shortipedia as a basis; actually, it turned out that Shortipedia had originally been developed, about five months ago, after discussions with Erik. If I can give my own perspective on Shortipedia: it’s a fact-based approach to data, as opposed to Semantic MediaWiki’s generally page-based approach. As far as I know, Shortipedia uses SMW mainly for its pre-existing database tables, as opposed to making use of much of the logic in the SMW code. Actually, Shortipedia reminds me more than anything (as I told Denny when he first showed it to me) of other semantic wiki applications I’ve seen, like AceWiki and OntoWiki. In the non-SMW semantic wikis, the semantic stuff is generally kept separate from the other content, and there’s usually some sort of mini-form that lets you keep adding triples to a page. That approach generally, from my experience, works poorly for individual wikis; but for a massive data repository like the “Wikimedia Data Commons”, it may well be the right way to go. So there you have it: if you want to see the future of data on Wikipedia, Shortipedia might be it.

As far as the event itself: if I had planned it, I would have done various things differently. The major issue was that I didn’t think all three events should have been held together. The three topics are rather discrete, and I didn’t think there was much “synergy” generated by putting them all under one roof. Worse than that, it meant that various people who ideally would have taken part in two or all three of the meetings (like many of the Wikimedia development staff) could only be in one.

As for the the semantic/data part of things, advance notice that Shortipedia was the way things would probably go would have been helpful – I would have spent more time beforehand thinking about the ramifications of it, something I only really started doing today. As it was, the most helpful thing I contributed at the time was probably my note-taking skills (which I’m proud of, by the way!)

Anyway, the big takeaway from the meeting wasn’t my somewhat underwhelming experience, but rather the fact that there now appears to be at least the beginnings of a path forward for data on Wikipedia. That’s probably the more important news, in the greater scheme of things. :)

  1. As a caveat to any overeager journalists and bloggers, at this point in time, we’re still figuring out how to best resource work in this area in partnership with some of the attendees of the summit. Whether we can, in the near term, turn this into a project with full-time people attached to it therefore still remains to be seen. It’s clear to me that we can’t avoid the structured data challenge indefinitely, but without help, we won’t be able to do it this year. So, we’ll talk more about it when things start moving.

    I’m not wedded to the Shortipedia architecture. Shortipedia is, at the end of the day, still a pretty quick hack, and we may end up abandoning it altogether. I’m interested in a path that, with reasonable effort, gets us to a solution that meets most of the requirements defined in the Data Commons proposal (which, as you recall, I shared with you in July), or at least can be incrementally built towards doing so. I’ll put that proposal up on Meta soon so that it can be discussed further, but want to first pursue some direct one-on-one conversations with some of the folks at the summit who can actually help us make things happen.

    I agree that the synergies weren’t perfect. I think the areas will come closer together (as more structured data becomes available, analytics folks will use it more extensively), so in some ways I see a holistic approach as groundwork for things to come.

    Erik MoellerFebruary 8, 2011 @ 2:24 AM

