Skip to content

The Cargo extension, a late announcement

I released the Cargo extension two months ago, though I’m only blogging about it now. (Though I did post about it to mailing lists and so on.) But I also wanted to wait a little while before fully announcing it, because on first release it was somewhat more experimental than it is now, and I wasn’t entirely sure that it would really work at all. But now I can say that it does indeed have users, and it does seem to work without any major security flaws.

Cargo is – though it still feels awkward to say it – intended as an alternative to Semantic MediaWiki. And not just to SMW itself, but to the set of libraries that SMW makes use of, and to many of the extensions that have been built on top of SMW, including Semantic Result Formats and Semantic Drilldown (though not Semantic Forms). All in all, Cargo is meant to serve as a substitute for a group of around 15 MediaWiki extensions; a more complete explanation can be found here.

Cargo can take the place of all of these extensions together, with a significantly smaller set of code, because it has a simplified approach to data storage. Instead of storing data as triples (in standard Semantic Web style), it stores data directly in database tables – it constructs a separate table for each set of data, then allows, more or less, direct SQL “SELECT” calls on those data sets. MediaWiki templates are used to define and store the data – much in the same way that they’re already used for data storage in SMW, though with Cargo it’s done more formally.

Altogether, this approach means that no custom storage or querying mechanism really had to be built for Cargo, unlike with Semantic MediaWiki; instead, the code for storing and querying data is a relatively thin wrapper around SQL, though with enough code to handle security concerns and data structures that are not well supported in most database systems, like fields that hold an array of values.

There are other interesting aspects to Cargo, and the whole issue of storing data in free-form triples vs. in structured tables is a very interesting one – really a philosophical issue. But I’m not going to get into any of that here; there’s a little more about it on the Cargo page. I do want to clarify what this means for WikiWorks. We are a full-service MediaWiki consulting company, which means that we support all manner of MediaWiki customizations and extensions; I look forward to providing support for Semantic MediaWiki installations for a long time. For new MediaWiki installations, I expect that we will be recommending Cargo if there’s ever a need for a data storage component – because it’s easier to set up, use, and maintain than SMW, in my opinion. But the jury’s still out, as they say, on Cargo, and I look forward to seeing what happens with both extensions.

(Though I did post about it to mailing lists and so on.) But I also wanted to wait a little while before fully announcing it, because on first release it was somewhat more experimental than it is now, and I wasn’t entirely sure that it would really work at all. But now I can say that it does indeed have users, and it does seem to work without any major security flaws.
Cargo is, though it still feels awkward to say it, intended as an alternative to Semantic MediaWiki. And not just to SMW itself, but to the set of libraries that SMW makes use of, and to many of the extensions that have been built on top of SMW, including Semantic Result Formats and Semantic Drilldown (though not Semantic Forms). All in all, Cargo is meant to serve as a substitute for a group of around 15 MediaWiki extensions; a more complete explanation can be found here.
Cargo can take the place of all of these extensions together, with a significantly smaller set of code, because it has a simplified approach to data storage. Instead of storing data as triples (in standard Semantic Web style), it stores data directly in database tables – it constructs a separate table for each set of data, then allows, more or less, direct SQL “SELECT” calls on those data sets. MediaWiki templates are used to define and store the data – much in the same way that they’re already used for data storage in SMW, though with Cargo it’s done more formally.
Altogether, this approach means that no custom storage or querying mechanism really had to be built for Cargo, unlike with Semantic MediaWiki; instead, the code for storing and querying data is a relatively thin wrapper around SQL, though with enough code to handle security concerns and data structures that are not well supported in most database systems, like fields that hold an array of values.
There are other interesting aspects to Cargo, and the whole issue of storing data in free-form triples vs. in structured tables is a very interesting one – really a philosophical issue. But I’m not going to get into any of that here; there’s a little more about it on the Cargo page. I do want to clarify what this means for WikiWorks. We are a full-service MediaWiki consulting company, which means that we support all manner of MediaWiki customizations and extensions; I look forward to providing support for Semantic MediaWiki installations for a long time. For new MediaWiki installations, I expect that we will be recommending Cargo if there’s ever a need for a data storage component – because it’s easier to set up, use, and maintain than SMW, in my opinion. But the jury’s still out, as they say, on Cargo, and I look forward to seeing what happens with both extensions.

Categories: Uncategorized.

New MultimediaPlayer extension

We just released the MultimediaPlayer extension, which plays a list of multimedia files. It is intended for use with multimedia items hosted by an external service – not stored in the wiki.
This is, as far as I know, a different approach to Multimedia that no extension supported. Instead of showing a bunch of multimedia items, such as YouTube video thumbnails, this extension uses one player. It also generates a (usually) text list of items. Clicking on an item loads the player with the correct item.
This presumably provides a performance boost. More importantly, it is a more elegant approach to playing a large number of files. Showing 20 video thumbnails on a page is not so good. It also allows using content from mixed hosts in a way that is pretty hidden from users. Why should the user care whether a video is hosted by YouTube or Vimeo or Uncle Timmy’s? It just needs to play. Items can come from multiple sources and can be a mix of audio and video.

Default sources

By default, the player supports files hosted by DailyMotion, Instagram, SoundCloud, YouTube, Vimeo and Vine. It plays these items by embedding each service’s player. These sources also ship with some CSS that makes the players responsive.

Add your own source

The player is customizable. Admins can add code for an external player. Then that source can be included with the “multimediaitem” parser function call just like any of the default sources.


This was created for the International Anesthesia Research Society and their project You can see it in action here. They are temporarily using a very old version of the extension but the functionality is similar. Most of the items on this page come from Libsyn, and either use Libsyn’s audio or video player.


The player’s documentation is on the web site.

Geeky programmer stuff

The source code can be browsed here.

Object Oriented Programming

As time goes on, I’m becoming more and more object-oriented oriented. I’ve never created an extra class and later regretted it (because it blocked some kind of functionality I later wanted to add) and had to remove that class and insert its code into some other class. But the reverse happens to me all the time. So I decided to skip all that and put everything in its own class. The classes used were:

  1. MultimediaPlayer – This defines the player as a whole, interacting with or holding the other classes.
  2. MultimediaPlayerContainer – This is the Container into which the player scripts are loaded
  3. MultimediaPlayerItem – Defines an Item, created by the parser function and displayed to the user as a clickable link that loads a player into the container.
  4. MultimediaPlayerSources – Really a static class that just holds the code for each known source.

The Singleton Pattern

Right now, there can only be on MultimediaPlayer per page. That’s probably not ideal and may change. But it calls for only one instance to be used. So I originally created this by instantiating the MultimediaPlayer as a global. And I knew I would have to fix that. Still, I had trouble with deciding how.

In an ideal world we would create one MultimediaPlayer instance and inject it as a dependency into MultimediaPlayerHooks::renderContainer() and MultimediaPlayerHooks::renderMultimediaItem(). But since we’re using a parser hook and tag, we’re pretty set in what those functions can take as parameters.

So I went with the Singleton pattern. This is of course controversial but it’s better than using a global. So it’s an improvement. MediaWiki core uses Singletons in a number of places so I don’t feel too guilty, and I can’t think of a better way. Any ideas? Please comment.

Categories: Uncategorized.

Post-hackathon thoughts

About a week ago we had the NYC Enterprise MediaWiki Hackathon, a two-day event that was also the first-ever enterprise MediaWiki hackathon.

What does it mean to be an enterprise MediaWiki hackathon? It means that the focus is on code that’s used by companies and other organizations that have a MediaWiki installation – as opposed to by Wikipedia and the like. In practice, that usually means MediaWiki extensions.

There have certainly been MediaWiki hackathons before – there have been about five every year since 2011 – but the focus in all of them, as far as I know, has been in some way related to Wikipedia, whether it’s the development of core MediaWiki, extensions in use on Wikipedia, tools and gadgets for Wikipedia, the visualization of Wikipedia data, etc. Which is all to the good – but there’s also a need for this other stuff.

We first discussed having an enterprise hackathon at SMWCon in Berlin, last October. There was a good amount of interest expressed at the time; and if we had had the hackathon in Europe, there probably would have been more attendees, just by the nature of things. But an event in the US was easier for me to attend and orgzine, so that’s where we did this first one. I certainly hope we can have one in Europe before too long. (We also talked about naming it a “Create Camp” instead – and there are valid arguments for calling it that – but I stuck with “hackathon” for this one just to keep things simple.)

The event was held at the NYU-Poly incubator, courtesy of Ontodia, a Semantic MediaWiki-using (and -contributing) company based there – big thanks to them.

So, how did it go? Thursday, the first day of the hackathon, coincided with an epic snowstorm that dropped a foot of snow across much of the Northeast. And Friday was Valentine’s Day. And the whole event was pretty last-minute; the dates were only set a month beforehand. So turnout was certainly curtailed; but we managed to get seven people to show up on one or both days, from New York, DC and Boston; which is not bad, I think. Nearly everyone there was a Semantic MediaWiki user, and that was a big focus of the discussion and work.

The single biggest outcome of the hackathon, in my opinion, was a set of changes to Semantic Forms that Cindy Cicalese from MITRE and I worked on, that will allow for easily displaying alias values for dropdowns, radiobuttons and the like. That’s a feature that SF users have been asking about for a long time. We also got a specific Semantic Forms implementation issue resolved; people looked into setting up a semantic triplestore (though the attempt was ultimately unsuccessful); and there were various discussions about large-scale SMW-based data architecture, skinning, and other topics.

What can we learn from all this?

  • A hackathon doesn’t need to be big. More projects are of course generally better, but we managed to get a bunch of stuff done with our small size. Having a small group helped us in getting free space, which kept costs minimal. And the round-table discussions we had at the beginning, introducing ourselves and talking about the projects we wanted to see done, might have taken a lot more time with a large group. (Or simply have been prohibitive to do.)
  • It’s good to have people think about what they want to work on ahead of time, and write their ideas on the wiki page. That helps organizers, and participants, try to plan projects out ahead of time, to maximize productivity.
  • Not all hackathon results are just code – though I’m of two minds about this one. There were some good discussions, and probably necessary ones, about various aspects of organizational MediaWiki usage. In that way, this hackathon resembled the informal part of conferences – the discussions that happen during breaks, over lunch, etc. These are often just as important as the main part of conferences, and at a hackathon you can have those kinds of discussions in a really focused way. (This hackathon was certainly not unique in that respect.) Still, as a developer, I’m focused on creating and improving code, and that to me is the real measurable output of such events. So perhaps it’ll be a while before we know the full outcome of this hackathon. But judging from people’s feedback afterwards, even time spent not writing code was time well spent.

Categories: Uncategorized.

Announcing Miga

I’m excited to announce the first release of a product I’ve been working on for the last several months: the Miga Data Viewer. Miga (pronounced MEE-ga) is an open source application, mostly written in Javascript, that creates an automatic display interface around a set of structured data, where the data is contained in one or more CSV files. (CSV, which stands for “comma-separated values”, is a popular and extremely lightweight file format for storing structured data.) You supply the file(s), and a schema file that explains the type of each field contained in that file or files, and Miga does all the rest.

“Miga” means “crumb” in Spanish, and informally it can mean (I think) anything small and genial. The name helps to suggest the software’s lightweight aspect, though I confess that mostly I just picked the name because I liked the sound of it. (There’s also a resemblance to the name “MediaWiki”, but that is – I think – a coincidence.) As for the “data viewer” part, I could have called it a “data browser” instead of “data viewer” – in some ways, “browser” more accurately reflects what the software does – but that would have had the initials “DB”, which is already strongly associated with “database”. I also considered calling it a “data navigator” or “data explorer” instead, but I liked the compactness of “viewer”.

Conceptually, Miga is based almost entirely on the Semantic MediaWiki system, and on experiences gained working with it about how best to structure data. Miga began its life when I started thinking about what it would take to create a mobile interface for SMW data. It wasn’t long into that process when I realized that, if you’re making a separate set of software just for mobile display, that same set of software could in theory handle data from any system at all. That’s what led me to the “data is data” theory, which I wrote about here earlier this year. To summarize, it’s the idea that the best ways to browse and display a set of data can be determined by looking at the size and schema of the data, not by anything connected to its subject matter. And now Miga exists as, hopefully, a proof concept of the entire theory.

The practical implementation of Miga owes a lot to Semantic MediaWiki as well. The structure of the database, and the approach to data typing, are very similar to that of Semantic MediaWiki (though the set of data types is not identical – Miga has some types that SMW lacks, like “Start time” and “End time”, that make automatic handling easier). The handling of n-ary/compound data is based on how things are done in SMW – though in SMW the entities used to store such data are called either “subobjects” or “internal objects”, while in Miga they’re called part of “unnamed categories”. And the main interface is based in large part on that of the Semantic Drilldown extension.

You can think of Miga as Semantic MediaWiki without the wiki – the data entry is all required to have been done ahead of time, and the creation of logical browsing structures is done automatically by the software.

There’s another way in which Miga differs from SMW, though, which is that the data is stored on the browser itself, using Web SQL Database (a feature of browsers that really just means that you can store databases on the user’s own computer). The data gets stored in the browser when the site is first loaded, and from then on it’s just there, including if the user closes the browser and then navigates to the page again. It makes a huge difference to have the data all be stored client-side, instead of server-side: pages load up noticeably faster, and, especially on mobile devices, the lack of requirement of the network after the initial load has a big impact on both battery usage and offline usability – if you load the data in the browser on your cell phone, then head into an area with no coverage, you can keep browsing.

The website has more information on usage of the software; I would recommend looking at the Demos page to see the actual look-and-feel, across a variety of different data sets. I’ll include here just one screenshot

Hopefully this one image encapsulates what Miga DV can do. The software sees that this table of data (which comes from infobox text within the Wikipedia pages about public parks) contains geographical coordinates, so it automatically makes a mapping display available, and handles everything related to the display of the map. You can see above the map that there’s additional filtering available, and above that that one filter has already been selected. (And you can see here this exact page in action, if you want to try playing around with all the functionality.)

Miga DV is not the only framework for browsing through arbitrary structured data. Spreadsheets offer it to some extent, via “pivoting” and the like, including online spreadsheet applications like Google Docs. The application Recline.js offers something even closer, with the ability to do mapping, charting and the like, although the standard view is much closer to a spreadsheet than Miga’s is. There are libraries like Exhibit and Freebase Parallax that allow for browsing and visualization of data that’s overall more sophisticated than what Miga offers. Still, I think Miga offers an interface that’s the closest to the type of interface that people have become used to on the web and in apps, with a separate page for each entity. That, combined with the ease of setup for administrators, makes Miga a good choice in many situations, in my opinion.

There’s also the element that Miga is open-source software. I know less about what’s going among proprietary software in this field, but I wouldn’t be surprised if there’s similar software and/or services that costs money to use. There’s certainly no shortage of proprietary app-development software; the advantage of an open-source solution over paid software is a matter of personal opinion.

What next? There are a few features that I’m hoping to add soon-ish, the most important being internationalization (right now all the text displayed is hardcoded in English). In the longer term, my biggest goal for the software is the ability to create true mobile apps with it. There are a few important advantages that mobile apps have over web-based applications; the biggest, in my opinion, is they can be used fully offline, meaning even if the phone or device is shut off and then restarted somewhere with little or no reception. People do also like the convenience of having a separate icon for each app, though that can be replicated to some extent with URLs (which, as I understand, is easier to do on iOS than Android.)

My hope is of course that people start to make use of this software for their own data – both for public websites and for private installations, such as within corporations. Maybe Miga, or systems like it, will mean that a lot of data that otherwise would never be published, because creating an interface around it would take too much money and/or developer time, will finally get its day. And beyond that, it would be great if some sort of user and developer community grew around the software; but we’ll see.

Categories: Uncategorized.

Data is data

My new motto for 2013 is: “data is data”. What does that mean (aside from being a tautology)? It means that data has a set of behaviors and a “personality” all it own, very much distinct from its underlying subject matter, or from what format it’s in. Ultimately, the fact that a table of data holds an ID field, three string fields and a date, and that it has 300 rows, says more about how to display and interface with it than the fact that it’s an elementary school activity listing, or a description of video game characters, or top-secret military information, or the results of biotech research. And the fact that the data is in an Excel spreadsheet, or a database, or embedded as RDF in a web page, shouldn’t matter either.

What is data?

I should first define what I mean by data. I’m talking about anything that is stored as fields – slots where the meaning of some value can be determined by the name or location of that value. And it should be information that represents something about the outside world.

For that latter reason, I don’t think that information related to communication – whether it’s email, blog posts, microblogs and status updates, discussion forums and so on – is truly data, though it’s usually stored in databases. It’s not intended to represent anything beyond itself – there’s data about it (who wrote it and when, etc.), but the communication itself is not really data. Tied in with that, there’s rarely a desire to display such communications in a venue other than where it was originally created. (Communication represents the vast majority of information in social networking sites, but it’s not the only information – there’s also real data, like users’ friends, interests, biographical information and so on.)

Data can be stored in a lot of different ways. I put together a table of the different terms used in different data-storage approaches:

Database/spreadsheet Table Row Column Value, cell
Standard website Category, page type Page (usually) Field Value
Semantic MediaWiki Category Page (usually) Property, field, template parameter Value
Semantic Web Class Subject Predicate, relationship Object
Object-oriented programming Class Object, instance Field, property Value

What’s the most obvious observation here (other than maybe the fact that this too is a table of data)? That, for all of their differences, all of these storage mechanisms are dealing with the same things – they just have different ways to refer to them.

A wrapper around a database

The vast majority of websites that have ever been created have been, at heart, code around a database, where the code is mostly intended to display and modify the contents of that database. That’s true of websites from Facebook to eBay to Craigslist to Wikipedia. There’s often an email component as well, and sometimes a credit card handling component, and often peripherals like ads, but the basic concept is: there’s a database somewhere, and users can use the site to navigate around some or all of its contents, some or all users can also use the site to add or modify contents. The data structure is fixed (though of course it changes, usually getting more complex, over time), and often, all the code to run the site had to be created more or less from scratch.

Of course, not all web products are standalone websites: there’s software to let you create your own blogs, wikis, e-commerce sites, social networking sites, and so on. This software is more generic than standalone websites, but it, too, is tied to a very specific data structure.

So you have millions of hours, or possibly even billions, that have been spent creating interfaces around databases. And in a lot of those projects, the same sort of logic has been implemented over and over again, in dozens of different programming languages and hundreds of different coding styles. This is not to say that all of that work has been wasted: there has been a tremendous amount of innovation, hard work and genius that has gone into all of it, optimizing speed, user interface, interoperability and all of that. But there has also been a lot of duplicated work.

Now, as I noted before, not all data stored in a database should be considered data: blog posts, messages and the like should not, in my opinion. So my point about duplicated work in data-handling may not full apply to blogs, social networking sites and so on. I’m sure there’s needlessly duplicated work on that side of things as well, but it’s not relevant to this essay. (Though social-networking sites like Facebook do include true structured data as well, about users’ friends, interests, biographical information, etc.)

What about [insert software library here]?

Creating a site, or web software, “from scratch” can mean different things. There are software libraries that work with a database schema, making use of the set of tables and their fields to let you create code around a database without having to do all the drudgework of creating a class from scratch from every table, etc. Ruby on Rails is the most well-known example, but there’s a wide variety of libraries in various languages that do this sort of thing: they are libraries that implement what’s known as the active record pattern. These “active record” libraries are quite helpful when you’re a programmer creating a site (I myself have created a number of websites with Ruby on Rails), but still, these are tools for programmers. A programmer still has to write code to do anything but the most basic display and editing of information.

So here’s a crazy thought: why does someone need to write any code at all, to just display the contents of a database in a user-friendly manner? Can’t there be software that takes a common-sense approach to data, displaying things in a way that makes sense for the current set of data?

No database? No problem

And, for that matter, why does the underlying data have to be in a database, as nearly all web software currently expects it to be? Why can’t code that interfaces with a database work just as well with data that’s in a spreadsheet, or an XML file, or available through an API from some other website? After all, data is data – once the data exists, you should be able to display it, and modify it if you have the appropriate permissions, no matter what format it’s in.

It’s too slow to query tens of thousands of rows of data if they’re in a spreadsheet? Fine – so have the application generate its own database tables to store all that data, and it can then query on that. There’s nothing that’s really technically challenging about doing that, even if the amount of data stretches to the hundreds of thousands of rows. And if the data or data structure is going to change in the outside spreadsheet/XML/etc., you can set up a process to have the application keep re-importing the current contents into its internal database and delete the old stuff, say once a day or once a week.

Conversely, if you’re sure that the underlying data isn’t being modified, you could have the application also allow users to modify its data, and then propagate the changes back to the underlying source, if it has permissions to do so.

Figuring out the data structure, and other complications

Now, you may argue that it’s not really possible to take, say, a set of Excel spreadsheets and construct an interface out of it. There’s a lot we don’t know: if there are two different tables that contain a column called “ID”, and each one has a row with the value “1234″ for that column, do those rows refer to the the same thing? And if there’s a column that mostly contains numbers, except for a few rows where it contains a string of letters, should that be treated as a number field, or as a string? And so on.

These are valid points – and someone who wants to use a generic tool to display a set of data will probably first have to specify some things about the data structure: which fields/columns correspond to which other fields/columns, what the data type is for each field, which fields represent a unique ID, and so on. (Though some of that information may be specified already, if the source is a database.) The administrator could potentially specially all of that “meta-data” in a settings file, or via a web interface, or some such. It’s some amount of work, yes – but it’s fairly trivial, certainly compared to programming.

Another complication is read-access. Many sites contain information that only a small percentage of its users can access. And corporate sites of course can contain a lot of sensitive information, readable only to a small group of managers. Can all of that read-access control really be handled by a generic application?

Yes and no. If the site has some truly strange or even just detailed rules on who can view what information, then there’s probably no easy way to have a generic application mimic all of them. But if the rules are basic – like that a certain set of users cannot view the contents of certain columns, or cannot view an entire table, or cannot view the rows in a table that match certain criteria, then it seems like that, too, could be handled via some basic settings.

Some best practices

Now, let’s look at some possible “best practices” for displaying data. Here are some fairly basic ones:

  • If a table of data contains a field storing geographical coordinates (or two fields – one for latitude and one for longitude), chances are good that you’ll want to display some or all of those coordinates in a map.
  • If a table of data contains a date field, there’s a reasonable chance that you’ll want to display those rows in a calendar.
  • For any table of data holding public information, there’s a good chance that you’ll want to provide users with a faceted search interface (where there’s a text input for each field), or a faceted browsing/drill-down interface (where there are clickable/selectable values for each field), or both, or some combination of the two.

If we can make all of these assumptions, surely the software can too, and provide a default display for all of this kind of information. Perhaps having a map should be the default behavior, that happens unless you specify otherwise?

But there’s more that an application can assume than just the need for certain kinds of visualization interfaces. You can assume a good amount based on the nature of the data:

  • If there are 5 rows of data in a table, and it’s not a helper table, then it’s probably enough to just have a page for each row and be done with it. If there are 5,000 rows, on the other hand, it probably makes sense to have a complete faceted browsing interface, as well as a general text search input.
  • If there are 10 columns, then, assuming you have a page showing all the information for any one row of data, you can just display all the values on that one page, in a vertical list. But if you have 100 columns, including information from auxiliary tables, then it probably makes sense to break up the page, using tabs or “children” pages or just creative formatting (small fonts, use of alternating colors, etc.)
  • If a map has over, say, 200 points, then it should probably be displayed as a “heat map”, or a “cluster map”, or maps should only show up after the user has already done some filtering.
  • If the date field in question has a range of values spread out over a few days, then just showing a list of items for each day makes sense. If it’s spread out over a few years, then a monthly calendar interface makes sense. And if it’s spread out over centuries, then a timeline makes sense.

Somehow software never makes these kinds of assumptions.

I am guilty of that myself, by the way. My MediaWiki extension Semantic Drilldown lets you define a drill-down interface for a table of data, just by specifying a filter/facet for every column (or, in Semantic MediaWiki’s parlance, property) of data that you want filterable. So far, so good. But Semantic Drilldown doesn’t look at the data to try to figure out the most reasonable display. If a property/column has 500 different values, then a user who goes to the drilldown page (at Special:BrowseData) will see 500 different values for that filter that they can click on. (And yes, that has happened.) That’s an interface failure: either (a) those values should get aggregated into a much smaller number of values; or (b) there should be a cutoff, so that any value that appears in less than, say, three pages should just get grouped into “Other”; or (c) there should just be a text input there (ideally, with autocompletion), instead of a set of links, so that users can just enter the text they’re looking for; or… something. Showing a gigantic list of values does not seem like the ideal approach.

Similarly, for properties that are numbers, Semantic Drilldown lets you define a set of ranges for users to click on: it could be something like 0-49, 50-199, 200-499 and so on. But even if this set of ranges is well-calibrated when the wiki is first set up, it could become unbalanced as more data gets added – for example, a lot of new data could be added, that all has a value for that property in the 0-49 range. So why not have the software itself set the appropriate ranges, based on the set of data?

And maybe the number ranges should themselves shift, as the user selects values for other filters? That’s rarely done in interfaces right now, but maybe there’s an argument to be made for doing it that way. At the very least, having intelligent software that is aware of the data it’s handling opens up those kinds of dynamic possibilities for the interface.

Mobile and the rest

Another factor that should get considered (and is also more important than the underlying subject matter) is the type of display. So far I’ve described everything in terms of standard websites, but you may want to display the data on a cell phone (via either an app or a customized web display), or on a tablet, or on a giant touch-screen kiosk, or even in a printed document. Each type of display should ideally have its own handling. For someone creating a website from scratch, that sort of thing can be a major headache – especially the mobile-friendly interface – but a generic data application could provide a reasonable default behavior for each display type.

By the way, I haven’t mentioned desktop software yet, but everything that I wrote before, about software serving as a wrapper around a database, is true of a lot of enterprise desktop software as well – especially the kind meant to hold a specific type of data: software for managing hospitals, amusement parks, car dealerships, etc. So it’s quite possible that an approach like this could be useful for creating desktop software.

Current solutions

Is there software (web, desktop or otherwise) that already does this? At the moment, I don’t know of anything that even comes close. There’s software that lets you define a data structure, either in whole or in part, and create an interface apparatus around it of form fields, drill-down, and other data visualizations. I actually think the software that’s the furthest along in that respect is the Semantic MediaWiki family of MediaWiki extensions, which provide enormous amounts of functionality around an arbitrary data structure. There’s the previously-mentioned Semantic Drilldown, as well as functionality that provides editing forms, maps, calendars, charts etc. around an arbitrary set of data. There are other applications that do some similar things – like other wiki software, and like Drupal, which lets you create custom data-entry forms, and even like Microsoft Access – but I think they all currently fall short of what SMW provides, in terms of both out-of-the-box functionality and ease of use for non-programmers. I could be wrong about that – if there’s some software I’m not aware of that does all of that, please let me know.

Anyway, even if Semantic MediaWiki is anywhere near the state of the art, it still is not a complete solution. There are areas where it could be smarter about displaying the data, as I noted before, and it has no special handling for mobile devices; but much more importantly than either of those, it doesn’t provide a good solution for data that doesn’t already live in the wiki. Perhaps all the world’s data should be stored in Semantic MediaWiki (who am I to argue otherwise?), but that will never be the case.

Now, SMW actually does provide a way to handle outside data, via the External Data extension – you can bring in data from a variety of other sources, store it in the same way as local data, and then query/visualize/etc. all of this disparate data together. I even know of some cases where all, or nearly all, of an SMW-based wiki’s data comes from externally – the wiki is used only to store its own copy of the data, which it can then display with all of its out-of-the-box functionality like maps, calendars, bulleted lists, etc.

But that, of course, is a hack – an entire wiki apparatus around a set of data that users can’t edit – and the fact that this hack is in use just indicates the lack of other options currently available. There is no software that says, “give me your data – in any standard format – and I will construct a pleasant display interface around it”. Why not? It should be doable. Data is data, and if we can make reasonable assumptions based on its size and nature, then we can come up with a reasonable way to display it, without requiring a programmer for it.

Bringing in multiple sources

And like SMW and its use of the External Data extension, there’s no reason that the data all has to come from one place. Why can’t one table come from a spreadsheet, and another from a database? Or why can’t the data come from two different databases? If the application can just use its own internal database for the data that it needs, there’s no limit to how many sources it was originally stored in.

And that also goes for public APIs, that provide general information that can enrich the local information one has. There are a large and growing number of general-information APIs, and the biggest one by far is yet to come: Wikidata, which will hold a queriable store of millions of facts. How many database-centered applications could benefit from additional information like the population of a city, the genre of a movie, the flag of a country (for display purposes) and so on? Probably a fair number. And a truly data-neutral application could display all such information seamlessly to the user – so there wouldn’t be any way of knowing that some information originally came from Wikidata as opposed to having been entered by hand by that site’s own creators or users.

Data is data. It shouldn’t be too hard for software to understand that, and it would be nice if it did.

Categories: Uncategorized.

SMWCon coming to New York, March 20-22

If you use Semantic MediaWiki, or are curious about it, I highly recommend going to SMWCon, the twice-yearly conference about Semantic MediaWiki. The next one will be a month and half from now, in New York City – the conference page is here. I will be there, as will Jeroen De Dauw, who will be representing both core SMW developers and the extremely important Wikidata project; as will a host of SMW users from corporations, the US government, startups and academia. There will be a lot of interesting talks, the entrance fee is quite reasonable ($155 for a three-day event), and I’m the local chair, so I can tell you for sure that there will be some great evening events planned. (And the main conference will be at the hacker mecca ITP, which is itself a cool spot to check out if you’ve never been there.) I hope some of you can make it!

Categories: Uncategorized.

Announcing our book…

I am happy to announce a project I’ve been working on for a rather long time now: Working with MediaWiki, a general-purpose guide to MediaWiki. It finally was released officially two days ago. It’s available in print-on-demand (where it numbers roughly 300 pages), e-book (.epub and .mobi formats) and PDF form.

As anyone who knows WikiWorks and our interests might expect, Semantic MediaWiki and its related extensions get a heavy focus in the book: a little less than one third of the book is about the Semantic MediaWiki-based extensions. I think that’s generally a good ratio: anyone who wants to learn about Semantic MediaWiki can get a solid understanding of it, both conceptually and in practice; while people who don’t plan to use it still get a lot of content about just about every aspect of MediaWiki.

This book is, in a sense, an extension of our consulting business, because there’s a lot of information and advice contained there that draws directly on my and others’ experience setting up and improving MediaWiki installations for out clients. There’s a section about enabling multiple languages on the same wiki, for instance, which is a topic I’ve come to know fairly well because that’s a rather common request among clients. The same goes for controlling read- and write-access. Conversely, there is only a little information devoted to extensions that enable chat rooms within wikis, even though there are a fair number of them, because clients have never asked about installing chat stuff.

So having this book is like having access to our consultants, although of course at a lower price and with all the benefits of the written word. (And plenty of illustrations.) And I think it’s a good investment even for organizations that do work with us, to get the standard stuff out of the way so that, when it comes time to do consulting, we can focus on the challenging and unique stuff.

Once again, here is the book site: Working with MediaWiki. I do hope that everyone who’s interested in MediaWiki checks it out – my hope is that it could make using MediaWiki simpler for a lot of people.

Categories: Uncategorized.

Launch of Innovention wiki

Today is the launch date of a new WikiWorks site: Innovention wiki, which “showcases the themes of innovation and invention through stories drawn from South Australia.”


We were able to design a really nice skin for the site, based on the specs of their designer. It uses a 3-column layout which is kind of uncharted territory as far as MediaWiki skins go. Part of the challenge here was the right-hand column. The search section is a part of the skin, while the maps, photos and videos are generated by the MediaWiki page itself. This was accomplished by putting that stuff into a div which uses absolute positioning.

Another challenge was trying to fit a decent form into a very narrow middle column. The solution was to hide the right column via CSS, since the search form doesn’t really need to be on a form page. Then, the middle column is stretched to cover both columns. This was easy to do, since Semantic Forms helpfully adds a class to the body tag for any formedit page (which works for existing pages) and MediaWiki adds a tag to any Special page (for when adding a new place with Special:FormEdit). So the content area was accessed with:

.action-formedit #content, .mw-special-FormEdit #content {
   width: (a whole lot);

Displaying different stuff to logged in and anonymous users

While on the topic of body attributes, MediaWiki does not add any classes to the body tag which would differentiate logged in from anonymous users. This doesn’t present a problem for the skin, which can easily check if the user is logged in. But what if you wanted to have a part of the MediaWiki content page displayed only for anonymous users? A common example would be exhortations to create an account and/or sign in. That’s something that should be hidden for logged in users. Fortunately, this is easily and cleanly resolved.

Since this was a custom skin, we overrode the Skin class’s handy addToBodyAttributes function (hat tip):

function addToBodyAttributes( $out, $sk, &$bodyAttrs ) {
$bodyClasses = array();

/* Extend the body element by a class that tells whether the user is
logged in or not */
if ( $sk->getUser()->isLoggedin() ) {
   $bodyClasses[] = 'isloggedin';
} else {
   $bodyClasses[] = 'notloggedin';

if ( isset( $bodyAttrs['class'] ) && strlen( $bodyAttrs['class'] ) > 0 ) {
   $bodyAttrs['class'] .= ' ' . implode( ' ', $bodyClasses );
} else {
   $bodyAttrs['class'] = implode( ' ', $bodyClasses );

return true;

For the built in skins, this is still easy to do. Just use the same code with the OutputPageBodyAttributes hook in your LocalSettings.php. This function adds a class to the body tag called either “isloggedin” or “notloggedin.” Then add the following CSS to your MediaWiki:SkinName.css:

.isloggedin .hideifloggedin {
.notloggedin .hideifnotloggedin {

Now in your MediaWiki code simply use these two classes to hide information from anonymous or logged in users. For example:

<span class="hideifnotloggedin">You're logged in, mate!</span>
<span class="hideifloggedin">Dude, you should really make an account.</span>

Combine with some nifty login and formedit links

Or even better, here’s a trick to generate links to edit the current page with a form:

<span class="hideifnotloggedin"> [{{fullurl:{{FULLPAGENAMEE}}|action=formedit}} Edit this page]</span>

…and a bonus trick that will log in an anonymous user and THEN bring him to the form edit page:

<span class="hideifloggedin">[{{fullurl:Special:Userlogin|returnto={{FULLPAGENAMEE}}&returntoquery=action=formedit}} Log in and edit this page.]</span>

It doesn’t get much better than that! See it in action here. Yes, you’d have to make an account to really see it work. So take my word for it.

Spam bots

While on the subject of making an account, it seems that bots have gotten way too sophisticated. One of our clients had been using ConfirmEdit with reCAPTCHA and was getting absolutely clobbered by spam. I’ve found that for low traffic wikis, the best and easiest solution is to combine with QuestyCaptcha instead. They’re easily broken by an attacker who is specifically targeting that wiki, but very few wikis have gained that level of prominence. The trick is to ask a question that only a human can answer. I’ve had success with this type of question:

Please write the word, “horsse”, here (leave out the extra “s”): ______

Featured article slideshow

This site has a pretty cool main page. The main contributor to that coolness is the transitioning slideshow with various featured articles. Gone are the days when a wiki only featured one page! This was made possible by bringing the Javascript Slideshow extension up to date, which was done by our founder Yaron Koren in honor of Innovention wiki. The articles are inserted manually which gives the user complete control over the appearance. But it would be pretty simple to generate the featured pages with a Semantic MediaWiki query.

Tag Cloud

Also on the main page is a nifty tag cloud. That is done with Semantic Result Formats and its tag-cloud format (by our own Jeroen De Dauw). Maybe I’ll blog more about that as it develops.

Stay tuned

The site will be developed further over the next few weeks, with some neat stuff to come…

Categories: Uncategorized.

Wikidata begins

I regret to say that our consultant Jeroen De Dauw will not be doing any significant work for WikiWorks for at least the next year. Thankfully, that’s for a very good reason: he’s moved to Berlin to be part of the Wikidata project, which starts tomorrow.

Wikidata is headed by Denny Vrandecic, who, like Jeroen, is a friend and colleague of mine; and its goal is to bring true data to Wikipedia, in part via Semantic MediaWiki. There was a press release about it on Friday that got some significant media attention, including this good summary at TechCrunch.

I’m very excited about the project, as a MediaWiki and SMW developer, as a data enthusiast, and simply as a Wikipedia user. This project quite different from any of the work that I’ve personally been involved with, because Wikipedia is simply a different beast from any standard wiki. There are five challenges that are specific to Wikipedia: it’s massive, it needs to be extremely fast at all times, it’s highly multi-lingual (over 200 languages currently), it requires references for all facts (at least in theory), and it has, at this point, no real top-down structure.

So the approach they will take will be not to tag information within articles themselves, the way it’s done in Semantic MediaWiki, but rather to create a new, separate site: a “Data Commons”, where potentially hundreds of millions of facts (or more?) will be stored, each fact with its own reference. Then, each individual language Wikipedia can make use of those facts within its own infobox template, where that Wikipedia’s community sees fit to use it.

It’s a bold vision, and there will be a lot of work necessary to pull it off, but I have a lot of faith in the abilities of the programmers who are on the team now. Just as importantly, I see the planned outcome of Wikidata as an inevitable one for Wikipedia. Wikipedia has been incrementally evolving from a random collection of articles to a true database since the beginning, and I think this is a natural step along that process.

A set of files were discovered in 2010 that represented the state of Wikipedia after about six weeks of existence, in February 2001. If you look through those pages, you can see nearly total chaos: there’s not even a hint of a unifying structure, or guidelines as to what should constitute a Wikipedia page; over 10% the articles related in some way to the book Atlas Shrugged, presumably added by a devoted fan.

11 years later, there’s structure everywhere: infobox templates dictate the important summary information for any one subject type, reference templates specify how references should be structured, article-tagging templates let users state precisely the areas they think need improvement. There are guidelines for the first sentence, for the introductory paragraphs (ideally, one to four of them, depending on the article’s overall length), for how detailed sections should be, for when one should link to years, and so on. There are also tens of thousands of categories (at least, on the English-language Wikipedia), with guidelines on how to use them, creating a large set of hierarchies for browsing through all the information. These are all, in my eyes, symptoms of a natural progression toward a database-like system. Why is it natural? Because, if a rule makes sense for one article, it probably makes sense for all of them. Of course, that’s not always true, and there can be sub-rules, exceptions, etc.; but still, there’s no use reinventing the wheel for every article.

People complain that the proliferation of rules and guidelines, not to mention categories and templates, drive away new users, who are increasingly afraid to edit articles for fear of doing the wrong thing. And they’re right. But the solution to this problem is not to scale back all these rules, but rather to make the software more aware of the rules, and the overall structure, to prevent users from being able to make a mistake in the first place. That, at heart, was the thinking behind my extension Semantic Forms: if there’s a specific way of creating calls to a specific template, there’s no point requiring each user to create them in that way, when you can just display a form, let the user only enter valid inputs, and have the software take care of the rest.

Now, Wikidata isn’t concerned with the structuring of articles, but only with the data that they contain; but the core philosophy is the same: let the software take care of anything that there’s only one right way to do. If a country has a certain population (at least, according to some source), then there’s no reason that the users of every different language Wikipedia need to independently look up and maintain that information. If every page about a mutiplanetary system already has its information stored semantically, then there’s no reason to separately maintain a hand-generated list of multiplanetary systems. And if, for every page about a musician, there’s already semantic information about their genre, instrument and nationality, then there’s no reason for categories such as “Danish jazz trumpeters“. (And there’s probably much less of a need for categories in general.)

With increased meaning/semantics on one hand, and increased structure on the other, Wikipedia will become more like a database that can be queried, than like a standard encyclopedia. And at that point, the possibilities are endless, as they say. The demand is already there; all that’s missing is the software, and that’s what they’ll be working on in Berlin. Viel Gl├╝ck!

Categories: Uncategorized.

Dynamically resizing an image

This is something that comes up, particularly when dealing with MediaWiki infoboxes. We had an infobox table floating on the right side of the page with a fixed with. Then there was an image to the left of it that was supposed to take up the remaining page’s width. The challenge was: What happens as the user shrinks or stretches the browser window? The fixed width table would stay the same size but the image would have to grow or shrink with the browser width.

There are some scripts out there that would do this. You don’t need them. Here’s what to do:

{| style="float: right; width: 613px" ... |}
<div id="image-holder" style="margin-right: 630px;">

Then add some CSS:

div#image-holder img {
  height: auto;
  width: 100%;

Simple, right? Basically, the 100% width combined with the right margin gives us a “100% percent minus x number of pixels” effect. The browser responds accordingly.

See it live here.

Categories: Uncategorized.