Monday 7 March 2011

Visualising (dirty) data from data.gov.uk using the Dataset Publishing Language (DSPL)

A fortnight ago the Dataset Publishing Language (DSPL) was launched by the Public Data Team at Google. DSPL is an XML-based language to support the generation of rich and interactive data visualisations using the Public Data Explorer, Google's hitherto closed visualisation tool. The XML is used to describe the dataset, including informational metadata like descriptions of measures and metrics, as well as structural metadata such as relations between tables. The completed DSPL XML is then uploaded to the Public Data Explorer in a 'dataset bundle' containing a set of CSV files containing the data of the dataset.
I decided to take the DSPL for a spin using data gleaned from data.gov.uk and visualised data pertaining to UK higher education income and expenditure in the years up to 2008 and 2009. This process was a little fidgety, primarily for reasons to be discussed in a moment; but it was also fidgety owing to the demands of the DSPL and the seemingly temperamental nature of the Public Data Explorer. (These technical issues are something the Public Data Team is resolving). The dataset can be visited and enjoyed as a bar graph, bubble chart or line graph, with dimensions selected from the left-hand column and temporal dimensions under the X axis. Bubble metrics in the bubble graph can be toggled in the top right-hand corner. Note that all values are shown in units of 1000 GBP, and where necessary rounded to the nearest 1000 GBP. Screenshots are above and below.

These data visualisations look very good indeed, and this will no doubt be a useful resource for many. But I can't help wondering if it's all too much pain for too little gain. The dataset I used is relatively simple but it still required 140 lines of XML and an endless amount of tinkering with the original data. So unless you have a large, pristine dataset which is to form the focus of a keynote presentation at an important conference (such as Prof. Hans Rosling), it is difficult to see whether it is worth the effort. Added to which, ironing out errors in the DSPL is arduous because the Public Data Explorer is only clever enough to tell you that there is an error, not where the error might be. This is all very frustrating when your XML is well-formed, validates, and your CSV files appear kosher. Again, the Public Data Team is working hard so things should improve soon. Which brings me back to the principal reason why the whole process was fidgety: data.gov.uk.

Data.gov.uk was launched a year ago by Tim Berners-Lee on behalf of the UK government. You can read about the background in your own time. Suffice to say, the raison d'etre of data.gov.uk is to publish government datasets in an open, structured and interoperable way thus stimulating new and "economically and socially valuable applications". As it currently stands, data.gov.uk does not come close to achieving this. It is not until you delve beneath the surface (as I did for the dataset above) that you appreciate what data.gov.uk actually provides is almost the opposite: closed, unstructured and un-interoperable data! A resource like this should be based – in an ideal world – on RDF or XML, with CSV the preferred option for those unfamiliar, unwilling or unable to provide something better. But it should not be a repository for virtually every file format known to human-kind, with contents structured in an arbitrary manner.

Identifying a suitable dataset for my DSPL experiments was exhausting. PDF files are commonplace; some "datasets" are simply empty or broken, or are simply bits of information (e.g. reports). Even if you are lucky enough to find a CSV compliant dataset (and don't expect any RDF or XML), it will inevitably be dirty and require significant time to render it usable, hence why my experiences were fidgety. All of these frustrations appear to be shared by developers that post on the data.gov.uk forum. To be sure the data is "open" insofar as UK citizens can visit data.gov.uk, view data and hold public officials to account. However, it's the data.gov.uk logo (three linked orbs) - which is almost identical to the old Semantic Web logo - that seduces one into thinking data.gov.uk it is a rich source of structured, interoperable, open data. None of this is entirely fair because data.gov.uk does have a page on Linked Data, and it does provide some useful RDF on MPs, legislation, etc. and some SPARQL endpoints; but in the grand scheme of 'all-things-data.gov.uk' it constitutes a very small proportion of what data.gov.uk actually provides. And all of this is very depressing. It increases barriers, alienates the developers and data enthusiasts, and will ultimately fail to reach the objective: "economically and socially valuable applications".

No comments:

Post a Comment