Wednesday 24 December 2008

SKOS-ifying Knowledge Organisation Systems: a continuing contradiction for the Semantic Web

A few days ago Ed Summers announced on his blog that he was shutting down lcsh.info. For those that don't know, lcsh.info was a Semantic Web demonstrator developed by Ed with the expressed purpose of illustrating how the Library of Congress Subject Headings (LCSH) could be represented and its structure harnessed using Simple Knowledge Organisation Systems (SKOS). In particular, Ed was keen to explore issues pertaining to Linked Data and the representation of concepts using URIs. He even hoped that the URIs used would be Cool URIs, linking eventually to a bona fide LCSH service were one ever to be released. Sadly, it was not to be... The reasons remain unclear but were presumably related to IPR. As the lcsh.info blog entry notes, Ed was compelled to remove it by the Library of Congress itself. The fact that he was the LC's resident Semantic Web buff probably didn't help matters, I'm sure.

SKOS falls within my area of interest and is an initiative of the Semantic Web Deployment Working Group. In brief, SKOS is an application of RDF and RDFS and is a series of evolving specifications and standards used to support the use of knowledge organisation systems (KOS) (e.g. information retrieval thesauri, classification schemes, subject heading systems, taxonomies or any other controlled vocabulary) within the framework of the Semantic Web. The Semantic Web is many things of course; but it is predicated upon the assumption that there exists communities of practices willing and able to create the necessary structured data (generally applications of RDF) to make it work. This might be metadata, or it might be an ontology, or it might be a KOS represented in SKOS. The resulting data can then be re-used, integrated, interconnected, queried and is open. When large communities of practice fail to contribute, the model breaks down.

There is a sense in which the Semantic Web has been designed to bring out the schizophrenic tendencies within some quarters of the LIS community. Whilst the majority of our community has embraced SKOS (and other related specifications), can appreciate the potential and actively contributes to the evolution of the standards, there is a small coterie that flirts with the technology whilst simultaneously shirking at the thought of exposing hitherto proprietary data. It's the 'lock down' versus 'openness' contradiction again.

In a previous research post I was involved with the High-Level Thesaurus (HILT) research project and continue my involvement in an consultative capacity. HILT continues to research and develop a terminology web service providing M2M access to a plethora of terminological data, including terminology mappings. Such terminological data can be incorporated into local systems to improve local searching functionality. Improvements might include, say, implementing a dynamic hierarchical subject browsing tree, or incorporating interactive query expansion techniques as part of the search interface, for example. An important - and the original motivation behind HILT - is to develop a 'terminology mapping server' capable of ameliorating the "limited terminological interoperability afforded between the federation of repositories, digital libraries and information services comprising the UK Joint Information Systems Committee (JISC) Information Environment" (Macgregor et al., 2007), thus enabling accurate federated subject-based information retrieval. This is a blog so detail will be avoided for now; but, in essence, HILT is an attempt to provide a terminology server in a mash-up context using open standards. To make the terminological data as usable as possible and to expose it to the Semantic Web, the data is modelled using SKOS.

But what happens to HILT when/if it becomes an operational service? Will its terminological innards be ripped out by the custodians of terminologies because they no longer want their data exposed, or will the ethos of the model be undermined as service administrators permit only HE institutions or charitable organisations from accessing the data? This isn't a concern for HILT yet; but it is one I anticipated several years ago. And the sad experience of lcsh.info illustrates that it's a very real concern.

Digital libraries, repositories and other information services have to decide where they want to be. This is a crossroads within a much bigger picture. Do they want their much needed data put to a good use on the Web, as some are doing (e.g. AGROVOC, GEMET, UKAT)? Or do they want alternative approaches to supplant them entirely (i.e. LCSH)? What's it gonna be, punks???

Monday 8 December 2008

Wikipedia censorship: allusions to 'Smell the Glove'?

Another Wikipedia controversy rages, this time over censorship. Over the past two days, the Internet Watch Foundation informed some ISPs that an article pertaining to an album by the 'classic' German heavy metal band, Scorpions, may be illegal. Leaving aside the fact that Scorpions is one of many groups to have similar imagery on their record sleeves (the eponymous 1969 debut album by Blind Faith, Eric Clapton's supergroup, being another obvious example), am I the only person to notice the similarities with fictional rockumentary, This Is Spinal Tap?

Like metal, censorship is a heavy topic; but I thought this tenuous linkage with This Is Spinal Tap might be a welcome distraction from the usual blog postings, which are necessarily academic. Those of you familiar with said film might recall the controversy surrounding the proposed (tasteless) art work for Spinal Tap's new album (Smell the Glove), which in the end gets mothballed owing to its indecent nature. Getting into trouble over sleeve art is part and parcel of being in a heavy metal band it would seem! Enjoy the winter break, people!

Friday 5 December 2008

Some general musings on tag clouds, resource discovery and pointless widgets...

The efficacy of collaborative tagging in information retrieval and resource discovery has undergone some discussion on this blog in the past. Despite emerging a good couple of years ago – and like many other Web 2.0 developments – collaborative tagging remains a topic of uncertainty; an area lacking sufficient evaluation and research. A creature of collaborative tagging which has similarly evaded adequate evaluation is the (seemingly ubiquitous!) 'tag cloud'. Invented by Flickr (Flickr tag cloud) and popularised by delicious (and aren't you glad they dropped the irritating full stops in their name and URL a few months ago?), tag clouds are everywhere; cluttering interfaces with their differently and irritatingly sized fonts.

Coincidentally, a series of tag cloud themed research papers were discussed at one of our recent ISG research group meetings. One of the papers under discussion (Sinclair & Cardew-Hall, 2008) conducted an experimental study comparing the usage and effectiveness of tag clouds with traditional search interface approaches to information retrieval. Their work is welcomed since it constitutes one of the few robust evaluations of tag clouds since they emerged several years ago.

One would hate to consider tag clouds as completely useless – and I have to admit to harbouring this thought. Fortunately, Sinclair and Cardew-Hall found tag clouds to be not entirely without merit. Whilst they are not conducive to precision retrieval and often conceal relevant resources, the authors found that users reported them useful for broad browsing and/or non-specific resource discovery. They were also found to be useful in characterising the subject nature of databases to be searched, thus aiding the information seeking process. The utility of tag clouds therefore remains confined to the search behaviours of inexperienced searchers and – as the authors conclude - cannot displace traditional search facilities or taxonomic browsing structures. As always, further research is required...

The only thing saving tag clouds from being completely useless is that they can occasionally assist you in finding something useful, perhaps serendipitously. What would be the point in having a tag cloud that didn't help you retrieve any information at all? Answer: There wouldn't be any point; but this doesn't stop some people. Recently we have witnessed the emergence of 'tag cloud generation' tools. Such tools generate tag clouds for Web pages, or text entered by the user. Wordle is one such example. They look nice and create interesting visualisations, but don't seem to do anything other than take a paragraph of text and increase the size of words based on frequency. (See the screen shot of a Wordle tag cloud for my home page research interests.)


OCLC have developed their very own tag cloud generator. Clearly, this widget has been created while developing their suite of nifty services, such as WorldCat, DeweyBrowser, FictionFinder, etc., so we must hold fire on the criticism. But unlike Wordle, this is something OCLC could make useful. For example, if I generate a tag cloud via this service, I expect to be able to click on a tag and immediately initiate a search on WorldCat, or a variety of OCLC services … or the Web generally! In line with good information retrieval practice, I also expect stopwords to be removed. In my example some of the largest tags are nonsense, such as "etc", "specifically", "use", etc. But I guess this is also a fundamental problem with tagging generally...

OCLC are also in a unique position in that they have access to numerous terminologies. This obviously cracks open the potential for cross-referencing tags with their terminological datasets so that only genuine controlled subject terms feature in the tag cloud, or productive linkages can be established between tags and controlled terms. This idea is almost as old as tagging itself but, again, has taken until recently to be investigated properly. Exploring the connections between tags and controlled vocabularies is something the EnTag project is exploring, a partner in which is OCLC. In particular, EnTag (Enhanced Tagging for Discovery) is exploring whether tag data, normally typified by its unstructured and uncontrolled nature, can be enhanced and rendered more useful by robust terminological data. The project finished a few months ago – and a final report is eagerly anticipated, particularly as my formative research group submitted a proposal to JISC but lost out to EnTag! C'est la vie!