Skip to content

Data is data

My new motto for 2013 is: “data is data”. What does that mean (aside from being a tautology)? It means that data has a set of behaviors and a “personality” all it own, very much distinct from its underlying subject matter, or from what format it’s in. Ultimately, the fact that a table of data holds an ID field, three string fields and a date, and that it has 300 rows, says more about how to display and interface with it than the fact that it’s an elementary school activity listing, or a description of video game characters, or top-secret military information, or the results of biotech research. And the fact that the data is in an Excel spreadsheet, or a database, or embedded as RDF in a web page, shouldn’t matter either.

What is data?

I should first define what I mean by data. I’m talking about anything that is stored as fields – slots where the meaning of some value can be determined by the name or location of that value. And it should be information that represents something about the outside world.

For that latter reason, I don’t think that information related to communication – whether it’s email, blog posts, microblogs and status updates, discussion forums and so on – is truly data, though it’s usually stored in databases. It’s not intended to represent anything beyond itself – there’s data about it (who wrote it and when, etc.), but the communication itself is not really data. Tied in with that, there’s rarely a desire to display such communications in a venue other than where it was originally created. (Communication represents the vast majority of information in social networking sites, but it’s not the only information – there’s also real data, like users’ friends, interests, biographical information and so on.)

Data can be stored in a lot of different ways. I put together a table of the different terms used in different data-storage approaches:

Database/spreadsheet Table Row Column Value, cell
Standard website Category, page type Page (usually) Field Value
Semantic MediaWiki Category Page (usually) Property, field, template parameter Value
Semantic Web Class Subject Predicate, relationship Object
Object-oriented programming Class Object, instance Field, property Value

What’s the most obvious observation here (other than maybe the fact that this too is a table of data)? That, for all of their differences, all of these storage mechanisms are dealing with the same things – they just have different ways to refer to them.

A wrapper around a database

The vast majority of websites that have ever been created have been, at heart, code around a database, where the code is mostly intended to display and modify the contents of that database. That’s true of websites from Facebook to eBay to Craigslist to Wikipedia. There’s often an email component as well, and sometimes a credit card handling component, and often peripherals like ads, but the basic concept is: there’s a database somewhere, and users can use the site to navigate around some or all of its contents, some or all users can also use the site to add or modify contents. The data structure is fixed (though of course it changes, usually getting more complex, over time), and often, all the code to run the site had to be created more or less from scratch.

Of course, not all web products are standalone websites: there’s software to let you create your own blogs, wikis, e-commerce sites, social networking sites, and so on. This software is more generic than standalone websites, but it, too, is tied to a very specific data structure.

So you have millions of hours, or possibly even billions, that have been spent creating interfaces around databases. And in a lot of those projects, the same sort of logic has been implemented over and over again, in dozens of different programming languages and hundreds of different coding styles. This is not to say that all of that work has been wasted: there has been a tremendous amount of innovation, hard work and genius that has gone into all of it, optimizing speed, user interface, interoperability and all of that. But there has also been a lot of duplicated work.

Now, as I noted before, not all data stored in a database should be considered data: blog posts, messages and the like should not, in my opinion. So my point about duplicated work in data-handling may not full apply to blogs, social networking sites and so on. I’m sure there’s needlessly duplicated work on that side of things as well, but it’s not relevant to this essay. (Though social-networking sites like Facebook do include true structured data as well, about users’ friends, interests, biographical information, etc.)

What about [insert software library here]?

Creating a site, or web software, “from scratch” can mean different things. There are software libraries that work with a database schema, making use of the set of tables and their fields to let you create code around a database without having to do all the drudgework of creating a class from scratch from every table, etc. Ruby on Rails is the most well-known example, but there’s a wide variety of libraries in various languages that do this sort of thing: they are libraries that implement what’s known as the active record pattern. These “active record” libraries are quite helpful when you’re a programmer creating a site (I myself have created a number of websites with Ruby on Rails), but still, these are tools for programmers. A programmer still has to write code to do anything but the most basic display and editing of information.

So here’s a crazy thought: why does someone need to write any code at all, to just display the contents of a database in a user-friendly manner? Can’t there be software that takes a common-sense approach to data, displaying things in a way that makes sense for the current set of data?

No database? No problem

And, for that matter, why does the underlying data have to be in a database, as nearly all web software currently expects it to be? Why can’t code that interfaces with a database work just as well with data that’s in a spreadsheet, or an XML file, or available through an API from some other website? After all, data is data – once the data exists, you should be able to display it, and modify it if you have the appropriate permissions, no matter what format it’s in.

It’s too slow to query tens of thousands of rows of data if they’re in a spreadsheet? Fine – so have the application generate its own database tables to store all that data, and it can then query on that. There’s nothing that’s really technically challenging about doing that, even if the amount of data stretches to the hundreds of thousands of rows. And if the data or data structure is going to change in the outside spreadsheet/XML/etc., you can set up a process to have the application keep re-importing the current contents into its internal database and delete the old stuff, say once a day or once a week.

Conversely, if you’re sure that the underlying data isn’t being modified, you could have the application also allow users to modify its data, and then propagate the changes back to the underlying source, if it has permissions to do so.

Figuring out the data structure, and other complications

Now, you may argue that it’s not really possible to take, say, a set of Excel spreadsheets and construct an interface out of it. There’s a lot we don’t know: if there are two different tables that contain a column called “ID”, and each one has a row with the value “1234″ for that column, do those rows refer to the the same thing? And if there’s a column that mostly contains numbers, except for a few rows where it contains a string of letters, should that be treated as a number field, or as a string? And so on.

These are valid points – and someone who wants to use a generic tool to display a set of data will probably first have to specify some things about the data structure: which fields/columns correspond to which other fields/columns, what the data type is for each field, which fields represent a unique ID, and so on. (Though some of that information may be specified already, if the source is a database.) The administrator could potentially specially all of that “meta-data” in a settings file, or via a web interface, or some such. It’s some amount of work, yes – but it’s fairly trivial, certainly compared to programming.

Another complication is read-access. Many sites contain information that only a small percentage of its users can access. And corporate sites of course can contain a lot of sensitive information, readable only to a small group of managers. Can all of that read-access control really be handled by a generic application?

Yes and no. If the site has some truly strange or even just detailed rules on who can view what information, then there’s probably no easy way to have a generic application mimic all of them. But if the rules are basic – like that a certain set of users cannot view the contents of certain columns, or cannot view an entire table, or cannot view the rows in a table that match certain criteria, then it seems like that, too, could be handled via some basic settings.

Some best practices

Now, let’s look at some possible “best practices” for displaying data. Here are some fairly basic ones:

  • If a table of data contains a field storing geographical coordinates (or two fields – one for latitude and one for longitude), chances are good that you’ll want to display some or all of those coordinates in a map.
  • If a table of data contains a date field, there’s a reasonable chance that you’ll want to display those rows in a calendar.
  • For any table of data holding public information, there’s a good chance that you’ll want to provide users with a faceted search interface (where there’s a text input for each field), or a faceted browsing/drill-down interface (where there are clickable/selectable values for each field), or both, or some combination of the two.

If we can make all of these assumptions, surely the software can too, and provide a default display for all of this kind of information. Perhaps having a map should be the default behavior, that happens unless you specify otherwise?

But there’s more that an application can assume than just the need for certain kinds of visualization interfaces. You can assume a good amount based on the nature of the data:

  • If there are 5 rows of data in a table, and it’s not a helper table, then it’s probably enough to just have a page for each row and be done with it. If there are 5,000 rows, on the other hand, it probably makes sense to have a complete faceted browsing interface, as well as a general text search input.
  • If there are 10 columns, then, assuming you have a page showing all the information for any one row of data, you can just display all the values on that one page, in a vertical list. But if you have 100 columns, including information from auxiliary tables, then it probably makes sense to break up the page, using tabs or “children” pages or just creative formatting (small fonts, use of alternating colors, etc.)
  • If a map has over, say, 200 points, then it should probably be displayed as a “heat map”, or a “cluster map”, or maps should only show up after the user has already done some filtering.
  • If the date field in question has a range of values spread out over a few days, then just showing a list of items for each day makes sense. If it’s spread out over a few years, then a monthly calendar interface makes sense. And if it’s spread out over centuries, then a timeline makes sense.

Somehow software never makes these kinds of assumptions.

I am guilty of that myself, by the way. My MediaWiki extension Semantic Drilldown lets you define a drill-down interface for a table of data, just by specifying a filter/facet for every column (or, in Semantic MediaWiki’s parlance, property) of data that you want filterable. So far, so good. But Semantic Drilldown doesn’t look at the data to try to figure out the most reasonable display. If a property/column has 500 different values, then a user who goes to the drilldown page (at Special:BrowseData) will see 500 different values for that filter that they can click on. (And yes, that has happened.) That’s an interface failure: either (a) those values should get aggregated into a much smaller number of values; or (b) there should be a cutoff, so that any value that appears in less than, say, three pages should just get grouped into “Other”; or (c) there should just be a text input there (ideally, with autocompletion), instead of a set of links, so that users can just enter the text they’re looking for; or… something. Showing a gigantic list of values does not seem like the ideal approach.

Similarly, for properties that are numbers, Semantic Drilldown lets you define a set of ranges for users to click on: it could be something like 0-49, 50-199, 200-499 and so on. But even if this set of ranges is well-calibrated when the wiki is first set up, it could become unbalanced as more data gets added – for example, a lot of new data could be added, that all has a value for that property in the 0-49 range. So why not have the software itself set the appropriate ranges, based on the set of data?

And maybe the number ranges should themselves shift, as the user selects values for other filters? That’s rarely done in interfaces right now, but maybe there’s an argument to be made for doing it that way. At the very least, having intelligent software that is aware of the data it’s handling opens up those kinds of dynamic possibilities for the interface.

Mobile and the rest

Another factor that should get considered (and is also more important than the underlying subject matter) is the type of display. So far I’ve described everything in terms of standard websites, but you may want to display the data on a cell phone (via either an app or a customized web display), or on a tablet, or on a giant touch-screen kiosk, or even in a printed document. Each type of display should ideally have its own handling. For someone creating a website from scratch, that sort of thing can be a major headache – especially the mobile-friendly interface – but a generic data application could provide a reasonable default behavior for each display type.

By the way, I haven’t mentioned desktop software yet, but everything that I wrote before, about software serving as a wrapper around a database, is true of a lot of enterprise desktop software as well – especially the kind meant to hold a specific type of data: software for managing hospitals, amusement parks, car dealerships, etc. So it’s quite possible that an approach like this could be useful for creating desktop software.

Current solutions

Is there software (web, desktop or otherwise) that already does this? At the moment, I don’t know of anything that even comes close. There’s software that lets you define a data structure, either in whole or in part, and create an interface apparatus around it of form fields, drill-down, and other data visualizations. I actually think the software that’s the furthest along in that respect is the Semantic MediaWiki family of MediaWiki extensions, which provide enormous amounts of functionality around an arbitrary data structure. There’s the previously-mentioned Semantic Drilldown, as well as functionality that provides editing forms, maps, calendars, charts etc. around an arbitrary set of data. There are other applications that do some similar things – like other wiki software, and like Drupal, which lets you create custom data-entry forms, and even like Microsoft Access – but I think they all currently fall short of what SMW provides, in terms of both out-of-the-box functionality and ease of use for non-programmers. I could be wrong about that – if there’s some software I’m not aware of that does all of that, please let me know.

Anyway, even if Semantic MediaWiki is anywhere near the state of the art, it still is not a complete solution. There are areas where it could be smarter about displaying the data, as I noted before, and it has no special handling for mobile devices; but much more importantly than either of those, it doesn’t provide a good solution for data that doesn’t already live in the wiki. Perhaps all the world’s data should be stored in Semantic MediaWiki (who am I to argue otherwise?), but that will never be the case.

Now, SMW actually does provide a way to handle outside data, via the External Data extension – you can bring in data from a variety of other sources, store it in the same way as local data, and then query/visualize/etc. all of this disparate data together. I even know of some cases where all, or nearly all, of an SMW-based wiki’s data comes from externally – the wiki is used only to store its own copy of the data, which it can then display with all of its out-of-the-box functionality like maps, calendars, bulleted lists, etc.

But that, of course, is a hack – an entire wiki apparatus around a set of data that users can’t edit – and the fact that this hack is in use just indicates the lack of other options currently available. There is no software that says, “give me your data – in any standard format – and I will construct a pleasant display interface around it”. Why not? It should be doable. Data is data, and if we can make reasonable assumptions based on its size and nature, then we can come up with a reasonable way to display it, without requiring a programmer for it.

Bringing in multiple sources

And like SMW and its use of the External Data extension, there’s no reason that the data all has to come from one place. Why can’t one table come from a spreadsheet, and another from a database? Or why can’t the data come from two different databases? If the application can just use its own internal database for the data that it needs, there’s no limit to how many sources it was originally stored in.

And that also goes for public APIs, that provide general information that can enrich the local information one has. There are a large and growing number of general-information APIs, and the biggest one by far is yet to come: Wikidata, which will hold a queriable store of millions of facts. How many database-centered applications could benefit from additional information like the population of a city, the genre of a movie, the flag of a country (for display purposes) and so on? Probably a fair number. And a truly data-neutral application could display all such information seamlessly to the user – so there wouldn’t be any way of knowing that some information originally came from Wikidata as opposed to having been entered by hand by that site’s own creators or users.

Data is data. It shouldn’t be too hard for software to understand that, and it would be nice if it did.

Categories: Uncategorized.

Comment Feed

3 Responses

  1. Good perspective Yaron. It’s interesting to compare the parallel observation “Text is Text”. Some of what you describe above has become common with text. Most systems will turn two newlines into a new paragraph. Or consider things like markdown and how it handles text conversation. It may be a stretch, but there is some potential direction there that could apply to data. I’m not sure what a “markdown for data” would be, but its an interesting point to explore.

  2. “Data is Data”, and once you add some more data, “Data about Data” – metadata, this could really be a powerful idea. In this age of “Big Data”, there’s no lack of both. It would be great if there are systems that can automatically create sensible interactions based on the shape and size of these haystacks.
    Eagerly awaiting what you come up with….

  3. Jamie – that’s true about simple markup languages like Markdown (and, of course, Wikitext and others), and assumptions like double newlines. With data it’s trickier, if only because it can come in a bunch of different formats. But maybe something like a separate text settings file, that just lists the basics of column names, types and their connections to one another, could do the job. Like markup languages there could of course be variants, though hopefully not too many.

    Joel – thanks! Well, I never said I would be the one to do it. :) But hopefully someone will.

    Yaron KorenFebruary 7, 2013 @ 3:49 PM

Some HTML is OK

or, reply to this post via trackback.