Information Strategy Group, LJMU: knowledge organisation

Showing posts with label knowledge organisation. Show all posts

Tuesday, 2 November 2010

Crowd-sourcing faceted information retrieval

This blog has witnessed the demise of several search engines, all of which have attempted to challenge the supremacy of the big innovators - and I would tend to include Yahoo! and Bing before the obvious market leader. Yesterday it was the turn of Blekko to be the next Cuil. Or is it?

Blekko presents a fresh attempt to move web search forward, using a style of retrieval which has hitherto only been successful in systems based on pre-coordinated indexes and combining it with crowd-sourcing techniques. Interestingly, Rich Skrenta - co-founder of Blekko - was also a principal founder of the Dmoz project. Remember Dmoz? When I worked on BUBL years and years ago, I recall considering Dmoz to be an inferior beast. But it remains alive and kicking – and remains popular and relevant to modern web developments with weekly RDF dumps made of its rich, categorised, crowd-sourced content for Linked Data purposes. BUBL, on the other hand, has been static for years.

Flirting with taxonomical organisation and categorisation with Dmoz (as well as crowd-sourcing) has obviously influenced the Blekko approach to search. Blekko provides innovation in retrieval by enabling users to define their very own vertical search indexes using so-called 'slashtags', thus (essentially) providing a quasi form of faceted search. The advantage of this approach is that using a particular slashtag (or facet, if you prefer) in a query increases precision by removing 'irrelevant' results associated with different meanings of the search query terms. Sounds good, eh? Ranganathan would be salivating at such functionality in automatic indexing! To provide some form of critical mass, Blekko has provided hundreds of slashtags that can be used straight away; but the future of slashtags depends on users creating their own, which will be screened by Blekko before being added to their publicly available slashtags list. Blekko users can also assist in weeding out poor results and any erroneous slashtags results (see the video below) thus contributing to the improved precision Blekko purports to have and maintaining slashtag efficacy. In fact, Skrenta proposes that the Blekko approach will improve precision in the longer term. Says Skrenta on the BBC dot.Maggie blog:

"The only way to fix this [precision problem] is to bring back large-scale human curation to search combined with strong algorithms. You have to put people into the mix […] Crowdsourcing is the only way we will be able to allow search to scale to the ever-growing web".

Let's look at a typical Blekko query. I am interested in the new Microsoft Windows mobile OS, and in bona fide reviews of the new OS. Moreover, since I am tech savvy and will have read many reviews, I am only interested in reviews published recently (i.e. within the past two weeks, or so). In Blekko we can search like so…

"windows mobile 7" /tech-reviews /date

…where the /tech-reviews slashtag limits results to genuine reviews published in the technology press and/or associated websites, and the /date slashtag orders the results by date. It works, and works spectacularly well. Skrenta sticks two fingers up at his competitors when in the Blekko promotional video he quips, "Try doing this [type of] search anywhere else!" Blekko provides 'Five use cases where slashtags shine' which - although only using one slashtag - illustrate how the approach can be used in a variety of different queries. Of course, Blekko can still be used like a conventional search engine, e.g. enter a query and get results ranked according to the Blekko algorithm. And on this count – using my own personal 'search engine test queries' - Blekko appears to rank relevant results sensibly and index pages which other search engines either ignore or, if they do index them, normally drown in spam (spam results which these engines rank as more relevant).

There is a lot to admire about Blekko. Aside from an innovative approach to information retrieval, there is also a commitment to algorithm openness and transparency which SEO people will be pleased about; but I worry that while a Blekko slashtag search is innovative and useful, most users will approach Blekko as another search engine rather than buying into the importance of slashtags and, in doing so will not hang around long enough to 'get it' (even though I intend to...). Indeed, to some extent Blekko has more in common with command line searching of the online databases in the days of yore. There are also some teething troubles which rigorous testing can reveal. But there are reasons to be hopeful. Blekko is presumably hoping to promote slashtag popularity and have users following slashtags just as users follow Twitter groups, thus driving website traffic and presumably advertising. Being the owner of that slashtag could be useful, but also highly profitable, even if Blekko remains small.

blekko: how to slash the web from blekko on Vimeo.

Wednesday, 23 June 2010

Visualising the metadata universe

No blog postings for almost three months and then two come along at once... I thought it would be worth drawing to the attention of readers the recent work of Jenn Riley of Indiana University. Jenn is currently metadata guru for the Indiana University Digital Library Program and yesterday on the Dublin Core list she announced the output of a project to build a conceptual model of the 'metadata universe'.

As evidenced by some of my blogs, there are literally hundreds of metadata standards and structured data formats available, all with their own acronym. This seems to have become more complicated with the emergence of numerous XML based standards in the early to mid noughties, and the more recent proliferation of RDF vocabularies for the Semantic Web and the associated Linked Data drive. What formats exists? How do they relate to each other? For which communities of practice are they optimised, e.g. information industry or cultural sector? What are the metadata, technical standards, vocabularies that I should be congnisant of in my area? And so the question list goes on...

These questions can be difficult to answer, and it is for this reason that Jenn Riley has produced a gigantic poster diagram (above) entitled, 'Seeing standards: a visualization of the metadata universe'. The diagram achieves what a good model should, i.e. simplifying complex phenomena and presenting a large volume of information in a condensed way. As the website blurb states:

"Each of the 105 standards listed here is evaluated on its strength of application to defined categories in each of four axes: community, domain, function, and purpose. The strength of a standard in a given category is determined by a mixture of its adoption in that category, its design intent, and its overall appropriateness for use in that category."

A useful conceptual tool for academics, practitioners and students alike. A glossary of metadata standards in either poster or pamphlet form is also available.

Friday, 26 June 2009

Read all about it: interesting contributions at ISKO-UK 2009

I had the pleasure of attending the ISKO-UK 2009 conference earlier this week at University College London (UCL), organised in association with the Department of Inf ormation Studies. This was my first visit to the home of the architect of Utilitaria nism, J eremy Bentham, and the nearby St. Pancras International since it has been revamped - and what a smart train station it is.

The ISKO conference theme was 'content architecture', with a particular focus on:

"Integration and semantic interoperability between diverse resources – text, images, audio, multimedia
Social networking and user participation in knowledge structuring
Image retrieval
Information architecture, metadata and faceted frameworks"

The underlying themes throughout most papers were those related to the Semantic Web, Linked Data, and other Seman

tic Web inspired approaches to resolving or ameliorating common problems within our disciplines. There were a great many interesting papers delivered and it is difficult to say something about them all; however, for me, there were particular highlights (in no particular order)...

Libo Eric Si (et al.) from the Department of In for mation Science at Loughboro ugh University described research to develop a prototype middleware framework between disparate terminology resources to facilitate subject cross-browsing of information and library portal systems. A lot of work has already been undertaken in this area (see for example, HILT project (a project in which I used to be involved), and CrissCross), so it was interesting to hear about his 'bag' approach in which – rather than using precise mappings between different Knowledge Organisation Systems (KOS) (e.g. thesauri, subject heading lists, taxonomies, etc.) - "a number of relevant concepts could be put into a 'bag', and the bag is mapped to an equivalent DDC concept. The bag becomes a very abstract concept that may not have a clear meaning, but based on the evaluation findings, it was widely-agreed that using a bag to combine a number of concepts together is a good idea".

Brian Matthews (et al.) reported on an evaluation of social tagging and KOS. In par

ticular, they investigated ways of enhancing social tagging via KOS, with a view to improving the quality of tags for improvements in and retrieval performance. A detailed and robust methodology was provided, but essentially groups of participants were given the opportunity to tag resources using tags, controlled terms (i.e. from KOS), or terms displayed in a tag cloud, all within a specially designed demonstrator. Participants were later asked to try alternative tools in order to gather data on the nature of user preferences. There are numerous findings - and a pre-print of the paper is already available on the conference website so you can read these yourself - but the main ones can be summarised from their paper as follows and were surprising in some cases:

"Users appreciated the benefits of consistency and vocabulary control and were potentially willing to engage with the tagging system;
There was evidence of support for automated suggestions if they are appropriate and relevant;
The quality and appropriateness of the controlled vocabulary proved to be important;
The main tag cloud proved problematic to use effectively; and,
The user interface proved important along with the visual presentation and interaction sequence."

The user preference for controlled terms was reassuring. In fact, as Matthews et al. report:

"There was general sentiment amongst the depositors that choosing terms from a controlled vocabulary was a "Good Thing" and better than choosing their own terms. The subjects could overall see the value of adding terms for information retrieval purposes, and could see the advantages of consistency of retrieval if the terms used are from an authoritative source."

Chris Town from the University of Cambridge Computer Laboratory presented two (see [1], [2]) equally interesting papers relating to image retrieval on the Web. Although images and video now comprise the majority of Web content, the vast majority of retrieval systems essentially use text, tags, etc. that surround images in order t

o make assumptions about what the image might be. Of course, using any major search engine we discover that this approach is woefully inaccurate. Dr. Town has developed improved approaches to content-based image retrieval (CBIR) which provide a novel way of bridging the 'semantic gap' between the retrieval model used by the system and that of the user. His approach is founded on the "notion of an ontological query language, combined with a set of advanced automated image analysis and classification models". This approach has been so successful that he has founded his own company, Imense. The difference in performance between Imense and Google is staggering and has to been seen to be believed. Examples can be found in his presentation slides (which will be on the ISKO website soon), but can observed from simply messing around on the Imense Picture Search.

Chris Town's second paper essentially explored how best to do the CBIR image processing required for the retrieval system. According to Dr. Town there are approximately 20 billion images on the web, with the majority at a high resolution, meaning that by his calculation it would take 4000 years to undertake the necessary CBIR processing to facilitate retrieval! Phew! Large-scale grid computing options therefore have to be explored if the approach is to be scalable. Chris Town and his colleague Karl Harrison therefore undertook a series of CBIR processing evaluations by distributing the required computational task across thousands of Grid nodes. This distributed approach resulted in the processing of over 25 million high resolution images in less than two weeks, thus making grid processing a scalable option for CBIR.

Andreas Vlachidis (et al.) from the Hypermedia Research Unit at the University of Gla morgan described the use of 'information extraction' techniques employing Natural Language Processing (NLP) techniques to assist in the semantic indexing of archaeological text resources. Such 'Grey Literature' is a good tes

t bed as more established indexing techniques are insufficient in meeting user needs. The aim of the research is to create a system capable of being "semantically aware" during document indexing. Sounds complicated? Yes – a little. Vlachidis is achieving this by using a core cultural heritage ontology and the English Heritage Thesauri to support the 'information extraction' process and which supports "a semantic framework in which indexed terms are capable of supporting semantic-aware access to on-line resources".

Perhaps the most interesting aspect of the conference was that it was well attended by people outside the academic fraternity, and as such there were papers on how these organisations are doing innovative work with a range of technologies, specifications and standards which, to a large extent, remain the preserve of researchers and academics. Papers were delivered by technical teams at the World Bank and Dow Jones, for example. Perhaps the most interesting contribution from the 'real world' though was that delivered by Tom Scott, a key member of the BBC's online and technology team. Tom is a key proponent of the Semantic Web and Linked Data at the BBC and his presentation threw light on BBC activity in this area – and rather coincidentally complemented an accidental discovery I made a few weeks ago.

Tom currently leads the BBC Earth project which aims to bring more of the BBC's Natural History content online and bring the BBC into the Linked Data cloud, thus enabling intelligent linking, re-use, re-aggregation, with what's already available. He provided interesting examples of how the BBC was exposing structured data about all forms of BBC programming on the Web by adopting a Linked Data approach and he expressed a desire for users to traverse

detailed and well connected RDF graphs. Says Tom on his blog:

"To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another."

Whilst Tom conceded that this work is small compared to the entire output and technical activity at the BBC, it still constitutes a huge volume of data and is significant owing to the BBC's pre-eminence in broadcasting. Tom even reported that a SPARQL end point will be made available to query this data. I had actually hoped to ask Tom a few questions during the lunch and coffee breaks, but he was such a popular guy that in the end I lost my chance, such is the existence of a popular techie from the Beeb.

Pre-print papers from the conference are available on the proceedings page of the ISKO-UK 2009 website; however, fully peer reviewed and 'added value' papers from the conference are to be published in a future issue of Aslib Proceedings.

Tuesday, 16 June 2009

11 June 2009: the day Common Tags was born and collaborative tagging died?

Mirroring the emergence of other Web 2.0 concepts, 2004-2006 witnessed a great deal of hyperbole about collaborative tagging (or 'folksonomies' as they are sometimes known). It is now 2009 and most of us know what collaborative tagging is so I'll avoid contributing to the pile of definitions already available. The hype subsided after 2006 (how active is Tagsonomy now?), but the implementation of tagging within services of all types didn't; tagging became and is ubiquitous.

The strange thing about collaborative tagging is that when it emerged the purveyors of its hype (e.g. Clay Shirky in particular, but there were many others) drowned out the comments made by many in the information, computer and library sciences. The essence of these comments was that collaborative tagging broke so many of the well established rules of information retrieval that it would never really work in general resource discovery contexts. In fact, collaborative tagging was so flawed on a theoretical level that further exploration of its alleged benefits was considered futile. Indeed, to this day, research has been limited for this reason, and I recall attending a conference in Bangalore in which lengthy discussions ensued about tagging being ineffective and entirely unscalable. For the tagging evangelists though, these comments simply provided proof that these communities were 'stuck-in-their-way' and harboured an unwillingness to break with theoretical norms. One of the most irritating aspects of the position adopted by the evangelists was that they relied on the power of persuasion and were never able to point to evidence. Moreover, even their powers of persuasion were lacking because most of them were generally 'technology evangelists' with no real understanding of the theories of information retrieval or knowledge organisation; they were simply being carried along by the hype.

The difficulties surrounding collaborative tagging for general resource discovery are multifarious and have been summarised elsewhere; but one of the intractable problems relates to the lack of vocabulary control or collocation and the effect this has on retrieval recall and precision. The Common Tags website summarises the root problem in three sentences (we'll come back to Common Tags in a moment…):

"People use tags to organize, share and discover content on the Web. However, in the absence of a common tagging format, the benefits of tagging have been limited. Individual things like New York City are often represented by multiple tags (like 'nyc', 'new_york_city', and 'newyork'), making it difficult to organize related content; and it isn’t always clear what a particular tag represents—does the tag 'jaguar' represent the animal, the car company, or the operating system?"

These problems have been recognised since the beginning and were anticipated in the theoretical arguments posited by those in our communities of practice. Research has therefore focused on how searching or browsing tags can be made more reliable for users, either by structuring them, mapping them to existing knowledge structures, or using them in conjunction with other retrieval tools (e.g. supplementing tools based on automatic indexing). In short, tags in themselves are of limited use and the trend is now towards taming them using tried and tested methods. For advocates of Web 2.0 and the social ethos it often promotes, this is really a reversal of the tagging philosophy - but it appears to be necessary.

The root difficulty relates to use of collaborative tagging in Personal Information Management (PIM). Make no bones about it, tagging originally emerged as PIM tool and it is here that it has been most successful. I, for example, make good use of BibSonomy to organise my bookmarks and publications. BibSonomy might be like delicious on steroids, but one of its key features is the use of tags. In late 2005 I submitted a paper to the WWW2006 Collaborative Tagging Workshop with a colleague. Submitted at the height of tagging hyperbole, it was a theoretical paper exploring some of the difficulties with tagging as general resource discovery tool. In particular, we aimed to explore the difficulties in expecting a tool optimised for PIM to yield benefits when used for general resource discovery and we noted how 'PIM noise' was being introduced into users' results. How could tags that were created to organise a personal collection be expected to provide a reasonable level of recall, let alone precision? Unfortunately it wasn't accepted; but since it scored well in peer review I like to think that the organising committee were overwhelmed by submissions!! (It is also noteworthy that no other collaborative tagging workshops have been held since.)

Nevertheless, the basic thesis remains valid. It is precisely this tension (i.e. PIM vs. general resource discovery) which has compromised the effectiveness of collaborative tagging for anything other than PIM. Whilst patterns can be observed in collaborative tagging behaviour, we generally find that the problems summarised in the Common Tags quote above are insurmountable – and this simply because tags are used for PIM first and foremost, and often tell us nothing about the intellectual content of the resource ('toPrint' anyone? 'toRead', 'howto', etc.). True – users of tagging systems can occasionally discover similar items tagged by other users. But how useful is this and how often do you do it? And how often do you search tags? I never do any of these things because the results are generally feeble and I'm not particularly interested in what other people have been tagging. Is anyone? So whilst tags have taken off in PIM, their utility in facilitating wider forms of information retrieval has been quite limited.

Common Tags

Last Friday the Common Tags initiative was officially launched. Common Tags is a collaboration between some established Web companies and university research centres, including DERI at the National University of Ireland and Yahoo!. It is an attempt to address the multifarious problems above and to widen the use of tags. Says the Common Tags website:

"The Common Tag format was developed to address the current shortcomings of tagging and help everyone—including end users, publishers, and developers—get more out of Web content. With Common Tag, content is tagged with unique, well-defined concepts – everything about New York City is tagged with one concept for New York City and everything about jaguar the animal is tagged with one concept for jaguar the animal. Common Tag also provides access to useful metadata that defines each concept and describes how the concepts relate to one another. For example, metadata for the Barack Obama Common Tag indicates that he's the President of the United States and that he’s married to Michelle Obama."

Great! But how is Common Tags achieving this? Answer: RDFa. What else? Common Tags enables each tag to be defined using a concept URI taken from Freebase or DBPedia (much like more formal methods, e.g. SKOS/RDF) thus permitting the unique identification of concepts and ameliorating some of our resource discovery problems (see Common Tags workflow diagram below). A variety of participating social bookmarking websites will also enable users to bookmark using Common Tags (e.g. ZigTag, Faviki, etc.). In short, Common Tags attempts to Semantic Web-ify tags using RDFa/XHTML compliant web pages and in so doing makes tags more useful in general resource discovery contexts. Faviki even describes them as Semantic Tags and employs the logo strap line, 'tags that make sense'. Common Tags won't solve everything but at least to will see some improvement recall and increase the precision in certain circumstances, as well as offering the benefits of Semantic Web integration.

So, in summary, collaborative tagging hasn't died, but at least now - at long last - it might become useful for something other than PIM. There is irony in the fact that formal description methods have to be used to improve tag utility, but will the evangelists see it? Probably not.

Friday, 12 June 2009

Quasi-facetted retrieval of images using emotions?

As part of my literature catch up I found an extremely interesting paper in JASIST by S. Schmidt and Wolfgang G. Stock entitled, 'Collective indexing of emotions in images : a study in emotional information retrieval'. The motivation behind the research is simple: images tend to elicit emotional responses in people. Is it therefore possible to capture these emotional responses and use them in image retrieval?

An interesting research question indeed, and Schmidt and Stock's study found that 'yes', it is possible to capture these emotional responses and use them. In brief, their research asked circa 800 users to tag a variety of public images from Flickr using their scroll-bar tagging system. This scroll-bar tagging system allowed users to tag images according to a series of specially selected emotional responses and to indicate the intensity of these emotions. Schmidt and Stock found that users tended to have favourite emotions and this can obviously differ between users; however, for a large proportion of images the consistency of emotion tagging is very high (i.e. a large proportion of users frequently experience the same emotional response to an image). It's a complex area of study and their paper is recommended reading precisely for this reason (capturing emotions anyone?!), but their conclusions suggest that:

"…it seems possible to apply collective image emotion tagging to image information systems and to present a new search option for basic emotions."

To what extent does the image above (by D Sharon Pruitt) make you feel happiness, anger, sadness, disgust or fear? It is early days, but the future application of such tools could find a place within the growing suite of image filters that many search engines have recently unveiled. For example, yesterday Keith Trickey was commenting on the fact that the image filters in Bing are better than Google or Yahoo!. True. There are more filters, and they seem to work better. In fact, they provide a species of quasi-taxonomical facets: (by) size, layout, color, style and people. It's hardly Ranganathan's PMEST, but – keeping in mind that no human intervention is required - it's a useful quasi-facet way of retrieving or filtering images, albeit flat.

An emotional facet, based on Schmidt and Stock's research, could easily be added to systems like Bing. In the medium term it is Yahoo! that will be more in a position to harness the potential of emotional tagging. They own Flickr and have recently incorporated the searching and filtering of Flickr images within Yahoo! Image Search. As Yahoo! are keen for us to use Image Search to find CC images for PowerPoint presentations, or to illustrate a blog, being able to filter by emotions would be a useful addition to the filtering arsenal.

Friday, 1 May 2009

LCSH as Linked Data ... officially!

Yesterday was, in my estimation, pretty historic. The Library of Congress officially launched the LC Authorities and Vocabularies service. You might recall a previous post relating to lcsh.info in which I lamented the LC's decision to pull down a SKOS demonstrator of LCSH, explicitly designed to explore the possibilities of Linked Data and dereferenceable URIs. All the background is in the previous post; but the whole episode appears to have been a PR disaster for LC.

The great news is that the LC Authorities and Vocabularies service (let's call it LCAV henceforth, shall we?) officially re-launched lcsh.info in a bigger, better and much improved form. The service essentially enables both humans and machines to access a plethora of LC authority data. Like lcsh.info, the service employs Semantic Web approaches to exposing this data and implements approaches to Linked Data by exposing and linking data on the Web via dereferenceable URIs.

Five minutes exploring the website reveals that LCAV serves up the entire LCSH for free, with incredible search and browse functionality, leaving Connexion in the shade. The concept URIs point to detailed data modelled in SKOS as RDFa for human readability, but with links to SKOS as RDF/XML, N-Triples and (the less familiar?) JSON for machine processing. RDF graphs can even be visualised by clicking, well, the 'visualize' tab – incredible. Mappings to other vocabularies are also provided.

Check out this heading for Semantic Web, for example.

On top of all this, LCSH can be downloaded in its entirety as RDF/XML or N-Triples (SKOS)! LCAV also indicate that further authority data will be made available soon.

Make no bones about it, this is historic stuff, not only because the service is so good but because this terminological data is no longer locked down. I think it's important to stroke our imaginary beards over the significance of the LC's change of direction. Is this the beginning of the end for locked down terminological data?! Will they be like dominoes henceforth? A fiver says DDC does the same by the end of the year. Any takers???

Wednesday, 24 December 2008

SKOS-ifying Knowledge Organisation Systems: a continuing contradiction for the Semantic Web

A few days ago Ed Summers announced on his blog that he was shutting down lcsh.info. For those that don't know, lcsh.info was a Semantic Web demonstrator developed by Ed with the expressed purpose of illustrating how the Library of Congress Subject Headings (LCSH) could be represented and its structure harnessed using Simple Knowledge Organisation Systems (SKOS). In particular, Ed was keen to explore issues pertaining to Linked Data and the representation of concepts using URIs. He even hoped that the URIs used would be Cool URIs, linking eventually to a bona fide LCSH service were one ever to be released. Sadly, it was not to be... The reasons remain unclear but were presumably related to IPR. As the lcsh.info blog entry notes, Ed was compelled to remove it by the Library of Congress itself. The fact that he was the LC's resident Semantic Web buff probably didn't help matters, I'm sure.

SKOS falls within my area of interest and is an initiative of the Semantic Web Deployment Working Group. In brief, SKOS is an application of RDF and RDFS and is a series of evolving specifications and standards used to support the use of knowledge organisation systems (KOS) (e.g. information retrieval thesauri, classification schemes, subject heading systems, taxonomies or any other controlled vocabulary) within the framework of the Semantic Web. The Semantic Web is many things of course; but it is predicated upon the assumption that there exists communities of practices willing and able to create the necessary structured data (generally applications of RDF) to make it work. This might be metadata, or it might be an ontology, or it might be a KOS represented in SKOS. The resulting data can then be re-used, integrated, interconnected, queried and is open. When large communities of practice fail to contribute, the model breaks down.

There is a sense in which the Semantic Web has been designed to bring out the schizophrenic tendencies within some quarters of the LIS community. Whilst the majority of our community has embraced SKOS (and other related specifications), can appreciate the potential and actively contributes to the evolution of the standards, there is a small coterie that flirts with the technology whilst simultaneously shirking at the thought of exposing hitherto proprietary data. It's the 'lock down' versus 'openness' contradiction again.

In a previous research post I was involved with the High-Level Thesaurus (HILT) research project and continue my involvement in an consultative capacity. HILT continues to research and develop a terminology web service providing M2M access to a plethora of terminological data, including terminology mappings. Such terminological data can be incorporated into local systems to improve local searching functionality. Improvements might include, say, implementing a dynamic hierarchical subject browsing tree, or incorporating interactive query expansion techniques as part of the search interface, for example. An important - and the original motivation behind HILT - is to develop a 'terminology mapping server' capable of ameliorating the "limited terminological interoperability afforded between the federation of repositories, digital libraries and information services comprising the UK Joint Information Systems Committee (JISC) Information Environment" (Macgregor et al., 2007), thus enabling accurate federated subject-based information retrieval. This is a blog so detail will be avoided for now; but, in essence, HILT is an attempt to provide a terminology server in a mash-up context using open standards. To make the terminological data as usable as possible and to expose it to the Semantic Web, the data is modelled using SKOS.

But what happens to HILT when/if it becomes an operational service? Will its terminological innards be ripped out by the custodians of terminologies because they no longer want their data exposed, or will the ethos of the model be undermined as service administrators permit only HE institutions or charitable organisations from accessing the data? This isn't a concern for HILT yet; but it is one I anticipated several years ago. And the sad experience of lcsh.info illustrates that it's a very real concern.

Digital libraries, repositories and other information services have to decide where they want to be. This is a crossroads within a much bigger picture. Do they want their much needed data put to a good use on the Web, as some are doing (e.g. AGROVOC, GEMET, UKAT)? Or do they want alternative approaches to supplant them entirely (i.e. LCSH)? What's it gonna be, punks???

Friday, 5 December 2008

Some general musings on tag clouds, resource discovery and pointless widgets...

The efficacy of collaborative tagging in information retrieval and resource discovery has undergone some discussion on this blog in the past. Despite emerging a good couple of years ago – and like many other Web 2.0 developments – collaborative tagging remains a topic of uncertainty; an area lacking sufficient evaluation and research. A creature of collaborative tagging which has similarly evaded adequate evaluation is the (seemingly ubiquitous!) 'tag cloud'. Invented by Flickr (Flickr tag cloud) and popularised by delicious (and aren't you glad they dropped the irritating full stops in their name and URL a few months ago?), tag clouds are everywhere; cluttering interfaces with their differently and irritatingly sized fonts.

Coincidentally, a series of tag cloud themed research papers were discussed at one of our recent ISG research group meetings. One of the papers under discussion (Sinclair & Cardew-Hall, 2008) conducted an experimental study comparing the usage and effectiveness of tag clouds with traditional search interface approaches to information retrieval. Their work is welcomed since it constitutes one of the few robust evaluations of tag clouds since they emerged several years ago.

One would hate to consider tag clouds as completely useless – and I have to admit to harbouring this thought. Fortunately, Sinclair and Cardew-Hall found tag clouds to be not entirely without merit. Whilst they are not conducive to precision retrieval and often conceal relevant resources, the authors found that users reported them useful for broad browsing and/or non-specific resource discovery. They were also found to be useful in characterising the subject nature of databases to be searched, thus aiding the information seeking process. The utility of tag clouds therefore remains confined to the search behaviours of inexperienced searchers and – as the authors conclude - cannot displace traditional search facilities or taxonomic browsing structures. As always, further research is required...

The only thing saving tag clouds from being completely useless is that they can occasionally assist you in finding something useful, perhaps serendipitously. What would be the point in having a tag cloud that didn't help you retrieve any information at all? Answer: There wouldn't be any point; but this doesn't stop some people. Recently we have witnessed the emergence of 'tag cloud generation' tools. Such tools generate tag clouds for Web pages, or text entered by the user. Wordle is one such example. They look nice and create interesting visualisations, but don't seem to do anything other than take a paragraph of text and increase the size of words based on frequency. (See the screen shot of a Wordle tag cloud for my home page research interests.)

OCLC have developed their very own tag cloud generator. Clearly, this widget has been created while developing their suite of nifty services, such as WorldCat, DeweyBrowser, FictionFinder, etc., so we must hold fire on the criticism. But unlike Wordle, this is something OCLC could make useful. For example, if I generate a tag cloud via this service, I expect to be able to click on a tag and immediately initiate a search on WorldCat, or a variety of OCLC services … or the Web generally! In line with good information retrieval practice, I also expect stopwords to be removed. In my example some of the largest tags are nonsense, such as "etc", "specifically", "use", etc. But I guess this is also a fundamental problem with tagging generally...

OCLC are also in a unique position in that they have access to numerous terminologies. This obviously cracks open the potential for cross-referencing tags with their terminological datasets so that only genuine controlled subject terms feature in the tag cloud, or productive linkages can be established between tags and controlled terms. This idea is almost as old as tagging itself but, again, has taken until recently to be investigated properly. Exploring the connections between tags and controlled vocabularies is something the EnTag project is exploring, a partner in which is OCLC. In particular, EnTag (Enhanced Tagging for Discovery) is exploring whether tag data, normally typified by its unstructured and uncontrolled nature, can be enhanced and rendered more useful by robust terminological data. The project finished a few months ago – and a final report is eagerly anticipated, particularly as my formative research group submitted a proposal to JISC but lost out to EnTag! C'est la vie!

Tuesday, 7 October 2008

Search engines: solving the 'Anomalous State of Knowledge'

Information retrieval (IR) remains one of the most active areas of research within the information, computing and library science communities. It also remains one of the sexiest. The growth in information retrieval sex appeal has a clear correlation with the growth of the Web and the need for improvements in retrieval systems based on automatic indexing. No doubt the flurry of big name academics and Silicon Valley employees attending conferences such as SIGIR also adds glamour. Nevertheless, the allure of IR research has precipitated some of the best innovations in IR ever, as well as creating some of the most important search engines and business brands. Of course, asked to pick from a list their favourite search engine or brand, most would probably select Google.

The habitual use of Google by students (and by real people generally!) was discussed in a previous post and needn't be revisited here. Nevertheless, one of the most distressing aspects of Google (for me, at least!) is a recent malaise in its commitment to search. There have been some impressive innovations in a variety of search engines in a variety of areas. For example, Yahoo! is to better harness metadata and Semantic Web data on the Web. More interestingly though, some recent and impressive innovations in solving the 'ASK conundrum' is visible in a variety of search engines, but not in Google. Although Google always tell us that search is its bread and butter, is it spreading itself a little too thinly? Or - with a brand loyalty second to none and the robust PageRank algorithm deployed to good effect – is Google resting on its laurels?

In 1982 a young Nicholas J. Belkin spearheaded a series of seminal papers documenting various models of users' information needs in IR. These papers remain relevant today and are frequently cited. One of Belkin et al.'s central suppositions is that the user suffers from the so-called Anomalous State of Knowledge, which can be conveniently acronymized to 'ASK'. Their supposition can be summarised by the following quote from their JDoc paper:

"[P]eople who use IR systems do so because they have recognised an anomaly in their state of knowledge on some topic, but they are unable to specify precisely what is necessary to resolve that anomaly. ... Thus, we presume that it is unrealistic (in general) to ask the user of an IR system to say exactly what it is that she/he needs to know, since it is just the lack of that knowledge which has brought her/him to the system in the first place".

This astute deduction ushered in a branch of IR research that sought to improve retrieval by resolving the Anomalous State of Knowledge (e.g. providing the user with assistance in the query formulation process, helping users ‘fill in the blanks’ to improve recall (e.g. query expansion), etc.).

Last winter Yahoo! unveiled its 'Search Assist' facility (see screenshot above - search for 'united nations'), which provides a real time query formulation assistance to the user. Providing these facilities in systems based on metadata has always been possible owing the use of controlled vocabularies for indexing, the use of name authority files, and even content standards such as AACR2; but providing a similar level functionality with unstructured information is difficult – yet Yahoo! provide something ... and it can be useful and can actually help resolve the ASK conundrum!

Similarly, meta-search engine Clusty has provided its 'clustering' techniques for quite some time. These clusters group related concepts and are designed to aid in query formulation, but also to provide some level of relevance feedback to users (see screenshot above - search for 'George Macgregor'). Of course, these clusters can be a bit hit or miss but, again, they can improve retrieval and aid the user in query formulation. Similar developments can also be found in Ask. View this canned search, for example. What help does Google provide?

The bottom line is that some search engines are innovating endlessly and putting the fruits of a sexy research area to good use. These search engines are actually moving search forward. Can the same still be said of Google?

Tuesday, 3 June 2008

Harvesting and distributing semantic data: FOAF and Firefox

One of the most attractive aspects of Mozilla Firefox continues to be the incessant supply of useful (and not-so useful) Extensions. The supply is virtually endless, and long may it continue! (What would I do without Zotero, or Firebug?!) We have seen the emergence of some useful metadata tools, such as Dublin Core metadata viewers, and more recently, Semantic Web widgets. Operator is an interesting example of the latter, harnessing microformats and eRDF in a way that allows users to interact easily with some simple semantic data (e.g. via toolbar operations). Another interesting Firefox extension is Semantic Radar.

Semantic Radar is a "semantic metadata detector" for Firefox and is a product of a wider research project based at DERI, National University of Ireland. Semantic Radar can identify the presence of FOAF, DOAP and SIOC data in a webpage (as well as more generic RDF and RDFa data) and will display the relevant icon(s) in the Firefox status bar to alert the user when such data is found. (See screen shot for FOAF example) Clicking the icon(s) enables users to browse the data using an online semantic browser (e.g. FOAF Explorer, SIOC Browser). In short, Semantic Radar parses web pages for RDF auto-discovery links to discover semantic data. A neat tool to have...

Semantic Radar has been available for sometime, but a recent update to the widget means that it is now possible to automatically 'ping' the Ping The Semantic Web website. Ping The Semantic Web (PTSW) is an online service harvesting, storing and distributing RDF documents. If one of those documents is updated, its author can notify PTSW to that effect by pinging it with the URL of the document. This is an amazingly efficient way to disseminate semantic data. It is also an amazingly effective way for crawlers and other software agents to discover and index RDF data on the web. (URLs indexed by PTSW can be re-used by other web services and can also be viewed on the PTSW website (See screenshot of PTSW for my FOAF file)).

It might just be me, but I’m finally getting the sense that the volume of structured data available on the web - and the tools necessary to harness it - are beginning to reach some sort of critical mass. The recent spate of blogs echoing this topic bare testament to that (e.g. [1], [2]). Dare I use such a term, but Web 3.0 definitely feels closer than ever.

Thursday, 13 December 2007

DeweyBrowser: beta forever!

The beta 1 version of the OCLC Research DeweyBrowser has now been superseded by beta 2. We are, after all, in the age of Web 2.0, where the mantra is 'beta forever'!

The beta 2 DeweyBrowser has some nice features, sporting improved functionality and a new interface (the latter being reminiscent of the similarly slick OCLC Open WorldCat interface - also in beta). Users can search for a topic or DDC number, or drill down by clicking through the Dewey captions which are represented as Dewey clouds. New features include the ability to filter search results by format, language, and OCLC Audience Level. Users can also search within result sets, view search histories, and peruse larger Dewey clouds. Of course, the best thing about DeweyBrowser remains the fact that it provides access to – and interlinks with - one of the biggest databases in the world; a union catalogue of over 1 billion hybrid resources (i.e. the OCLC Worldcat database).

We know the story. DDC is the most widely used classification system in the world, built on sound principles that make it handy as a general knowledge organisation tool. It has expressive notation, which makes it conducive to deployment on the web for improved information retrieval (for example, see HILT or OCLC Terminology Services), as well as well-defined(ish) classes and maturely developed hierarchies for powerful retrieval within other information environments. It is good to see OCLC so active in harnessing all this structured data for doing some good. Take a look at OCLC's FRBR inspired FictionFinder, for example, or the new-ish Open WorldCat. It's about putting all this structured information the LIS community has accrued to good work – and it's about time too!

Monday, 19 November 2007

Stop the press: Google is grim!

I always enjoy being kept abreast of scientific developments by tuning into Leading Edge on BBC Radio 4 on Thursday evenings. A riveting piece of radio journalism! Last Thursday (15th November 2007) we had reports on a team from the US that has created cloned embryos from an adult primate and an invigorating debate on the deployment of brain enhancement drugs. We also had researchers that have demonstrated how robotic cockroaches can influence the behaviour of real ones. However, Leading Edge often gives us snippets of news that impinge directly on what we do within the Information Strategy Group – and last Thursday was no exception.

Technology guru Bill Thompson explained why he believes Google is corrupting us all. This is a refreshing viewpoint from a technology commentator and not one we are accustomed to hearing (except from librarians, information scientists and some computer scientists!). Such commentators normally fail to observe the limitations of any information retrieval tool and drone on about how 'cool' it is. Not Thompson. "We have all fallen for the illusion" because it is "easy" and "simple", says Thompson. "Google and the other search engines have encouraged us to forget the fundamental difference between search and research".

Google is indeed easy and simple. To make his point Thompson (unwittingly?) revisits well worn LIS arguments emanating from the metadata, cataloguing, and indexing areas. Some of these emerged in the 1970s when the efficacy of automatic indexing began to improve. These arguments cite issues of reconciling the terms used to describe concepts and issues of collocation (e.g. 'Java' the coffee, 'Java' the programming language, 'Java' the primary landmass of Indonesia, etc.) and differentiating between information about Tony Blair and information by Tony Blair. Thompson almost sounds surprised when he vents spleen over Google's inability to make sense of the word ‘Quark’. Welcome to web search engines, Bill!

The most astonishing part of Thompson's report was not his LIS-tinged rant about Google, but his suggestion that librarians had themselves fallen for the Google illusion, along with the academics and school children. Pardon? What could have given him this impression??? Was it an ill-judged, off hand comment?

The body of research and development exploring the use metadata on the web is gigantic and is largely driven by the LIS community. The 'deep web' is almost synonymous with digital library content or database content under the custodianship of information professionals. Those in content management or information architecture will also be singing from the LIS hymn sheet. The Semantic Web is another development that seeks to resolve Thompson's 'Quark conundrum'. Even at front line services, librarians are rolling out information literacy sessions as a means of communicating the importance of using deep web tools, but also making users aware of Google's limitations (e.g. Google only indexes a small portion of the web, problems of information authoritativeness, etc., etc.).

That is not to say that the profession doesn't flirt with Google; of course it does! It flirts with Google because Google provides a vehicle through which to introduce users to valuable information (often from the deep web). And such flirting does not immediately jettison well formed library and information management theories or principals (see an ex-colleague, for example [1], [2], [3] and [4]).

Of course, I could go on for a lot longer, but there doesn’t seem to be any point as you already know the arguments. But you can listen to Bill Thompson’s report on Leading Edge to hear the arguments of yore restyled by a technology guru. You may also feel compelled to contact Leading Edge to vent your spleen!

Thursday, 1 November 2007

MTSR 2007

A paper some ex-colleagues and I submitted to the International Conference on Metadata and Semantics Research has now been published as part of the conference proceedings. The paper entitled, 'Terminology server for improved resource discovery: analysis of functions and model', is available online for those that might be interested.

The conference took place at the Ionian University in sunny Corfu; however, owing to work pressures (since moving to LJMU) I was unable to present the paper in person and take advantage of warmer climes! Enjoy!

Information Strategy Group, LJMU