Showing posts with label Linked Data. Show all posts
Showing posts with label Linked Data. Show all posts

Monday, 7 March 2011

Visualising (dirty) data from data.gov.uk using the Dataset Publishing Language (DSPL)

A fortnight ago the Dataset Publishing Language (DSPL) was launched by the Public Data Team at Google. DSPL is an XML-based language to support the generation of rich and interactive data visualisations using the Public Data Explorer, Google's hitherto closed visualisation tool. The XML is used to describe the dataset, including informational metadata like descriptions of measures and metrics, as well as structural metadata such as relations between tables. The completed DSPL XML is then uploaded to the Public Data Explorer in a 'dataset bundle' containing a set of CSV files containing the data of the dataset.
I decided to take the DSPL for a spin using data gleaned from data.gov.uk and visualised data pertaining to UK higher education income and expenditure in the years up to 2008 and 2009. This process was a little fidgety, primarily for reasons to be discussed in a moment; but it was also fidgety owing to the demands of the DSPL and the seemingly temperamental nature of the Public Data Explorer. (These technical issues are something the Public Data Team is resolving). The dataset can be visited and enjoyed as a bar graph, bubble chart or line graph, with dimensions selected from the left-hand column and temporal dimensions under the X axis. Bubble metrics in the bubble graph can be toggled in the top right-hand corner. Note that all values are shown in units of 1000 GBP, and where necessary rounded to the nearest 1000 GBP. Screenshots are above and below.

These data visualisations look very good indeed, and this will no doubt be a useful resource for many. But I can't help wondering if it's all too much pain for too little gain. The dataset I used is relatively simple but it still required 140 lines of XML and an endless amount of tinkering with the original data. So unless you have a large, pristine dataset which is to form the focus of a keynote presentation at an important conference (such as Prof. Hans Rosling), it is difficult to see whether it is worth the effort. Added to which, ironing out errors in the DSPL is arduous because the Public Data Explorer is only clever enough to tell you that there is an error, not where the error might be. This is all very frustrating when your XML is well-formed, validates, and your CSV files appear kosher. Again, the Public Data Team is working hard so things should improve soon. Which brings me back to the principal reason why the whole process was fidgety: data.gov.uk.

Data.gov.uk was launched a year ago by Tim Berners-Lee on behalf of the UK government. You can read about the background in your own time. Suffice to say, the raison d'etre of data.gov.uk is to publish government datasets in an open, structured and interoperable way thus stimulating new and "economically and socially valuable applications". As it currently stands, data.gov.uk does not come close to achieving this. It is not until you delve beneath the surface (as I did for the dataset above) that you appreciate what data.gov.uk actually provides is almost the opposite: closed, unstructured and un-interoperable data! A resource like this should be based – in an ideal world – on RDF or XML, with CSV the preferred option for those unfamiliar, unwilling or unable to provide something better. But it should not be a repository for virtually every file format known to human-kind, with contents structured in an arbitrary manner.

Identifying a suitable dataset for my DSPL experiments was exhausting. PDF files are commonplace; some "datasets" are simply empty or broken, or are simply bits of information (e.g. reports). Even if you are lucky enough to find a CSV compliant dataset (and don't expect any RDF or XML), it will inevitably be dirty and require significant time to render it usable, hence why my experiences were fidgety. All of these frustrations appear to be shared by developers that post on the data.gov.uk forum. To be sure the data is "open" insofar as UK citizens can visit data.gov.uk, view data and hold public officials to account. However, it's the data.gov.uk logo (three linked orbs) - which is almost identical to the old Semantic Web logo - that seduces one into thinking data.gov.uk it is a rich source of structured, interoperable, open data. None of this is entirely fair because data.gov.uk does have a page on Linked Data, and it does provide some useful RDF on MPs, legislation, etc. and some SPARQL endpoints; but in the grand scheme of 'all-things-data.gov.uk' it constitutes a very small proportion of what data.gov.uk actually provides. And all of this is very depressing. It increases barriers, alienates the developers and data enthusiasts, and will ultimately fail to reach the objective: "economically and socially valuable applications".

Monday, 19 July 2010

Google finally gets serious about the Semantic Web?

Google has been flirting with the Semantic Web recently, and we've talked about it occasionally on this blog. However, compared with other web search engines (e.g. Yahoo!) and the state of Semantic Web activity generally, Google has been slow to dive in completely. They have restricted themselves to rich snippets, using bits of RDFa and microformats, and making up their own too. Perhaps this was because their intention was always to purchase a prominent Semantic Web start-up company instead of putting in the spade work themselves? Perhaps so.

Google has this week announced the purchase of Metaweb Technologies. None the wiser?! Metaweb is perhaps most known for providing the Semantic Web community with Freebase. Freebase cropped up last year on this blog when we discussed the emergence of Common Tags. Freebase essentially represents a not insignificant hub in the rapidly expanding Linked Data cloud, providing RDF data on 12 million entities with URIs linking to other linked and Semantic Web datasets, e.g. DBpedia.

My comments are limited to the above; just thought this was probably an extremely important development and one to watch. A high level of social proof appears to be required before some tech firms or organisations will embrace the Semantic Web. But what greater social proof than Google? Google also appear committed to the Freebase ethos:
"[We] plan to maintain Freebase as a free and open database for the world. Better yet, we plan to contribute to and further develop Freebase and would be delighted if other web companies use and contribute to the data. We believe that by improving Freebase, it will be a tremendous resource to make the web richer for everyone. And to the extent the web becomes a better place, this is good for webmasters and good for users."
Very significant stuff indeed.

Wednesday, 23 June 2010

Visualising the metadata universe

No blog postings for almost three months and then two come along at once...  I thought it would be worth drawing to the attention of readers the recent work of Jenn Riley of Indiana University.  Jenn is currently metadata guru for the Indiana University Digital Library Program and yesterday on the Dublin Core list she announced the output of a project to build a conceptual model of the 'metadata universe'. 

As evidenced by some of my blogs, there are literally hundreds of metadata standards and structured data formats available, all with their own acronym.  This seems to have become more complicated with the emergence of numerous XML based standards in the early to mid noughties, and the more recent proliferation of RDF vocabularies for the Semantic Web and the associated Linked Data drive.  What formats exists?  How do they relate to each other?  For which communities of practice are they optimised, e.g. information industry or cultural sector?  What are the metadata, technical standards, vocabularies that I should be congnisant of in my area?  And so the question list goes on...


These questions can be difficult to answer, and it is for this reason that Jenn Riley has produced a gigantic poster diagram (above) entitled, 'Seeing standards: a visualization of the metadata universe'.  The diagram achieves what a good model should, i.e. simplifying complex phenomena and presenting a large volume of information in a condensed way.  As the website blurb states:
"Each of the 105 standards listed here is evaluated on its strength of application to defined categories in each of four axes: community, domain, function, and purpose. The strength of a standard in a given category is determined by a mixture of its adoption in that category, its design intent, and its overall appropriateness for use in that category."
A useful conceptual tool for academics, practitioners and students alike.  A glossary of metadata standards in either poster or pamphlet form is also available.

Tuesday, 22 June 2010

Goulash all round: Linked Data at NSZL

I meant to blog about this as soon as the news emerged in mid-April but University bureaucracy and research project demands prevented it: Adam Horvath (Director of Informatics) at the The National Széchényi Library (NSZL) (or National Library of Hungary, if you prefer) announced on the Semantic Web Linking Open Data Project email list that the NSZL have exposed their entire OPAC and digital library as Linked Data - that's correct, their entire OPAC and digital library has been published as Linked Data. This includes corresponding authority data, with all nodes represented using Cool URIs.

The RDF vocabularies used include Dublin Core RDF for bibliographic metadata, SKOS for subject indexing (in a variety of terminologies) and FOAF for name authority data. Incredible! Not only that, the FOAF descriptions include mapped owl:sameAs statements to corresponding dbpedia URIs. For example, here is FOAF data pertaining to Hungarian novelist, Jókai Mór:

<?xml version="1.0"?>
<rdf:RDF
xmlns:dbpedia="http://dbpedia.org/property/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns="http://web.resource.org/cc/"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:zs="http://www.loc.gov/zing/srw/">
<foaf:Person rdf:about="http://nektar.oszk.hu/resource/auth/33589">
<dbpedia:deathYear>1904</dbpedia:deathYear>
<dbpedia:birthYear>1825</dbpedia:birthYear>
<foaf:familyName>Jókai</foaf:familyName>
<foaf:givenName>Mór</foaf:givenName>
<foaf:name>Jókai Mór (1825-1904)</foaf:name>
<foaf:name>Mór Jókai</foaf:name>
<foaf:name>Jókai Mór</foaf:name>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/M%C3%B3r_J%C3%B3kai"/>
</foaf:Person>
</rdf:RDF>


Visit the above noted dbpedia data for fun.

Rich SKOS data is also available for a local information retrieval thesaurus. Follow this link for an example of the skos:prefLabel ' magyar irodalom'.

It's a herculean effort from the NSZL which must be commended. And before the Germans did it too! Goulash all round to celebrate - and a photograph of the Hungarian Parliament, methinks.