Tuesday 26 January 2010

Renaissance of thesaurus-enhanced information retrieval?

As my students in BSNIM3034 Content Management have been learning recently, semantics play such a huge role in the level of recall and precision capable of being achieved in an information retrieval system. Put simply, computers are great at interpreting syntax but are far too dumb to understand semantics or the intricacies of human language. This has historically been – and currently remains – the trump card of metadata proponents, and it is something the Semantic Web is attempting to resolve with its use of structured data too. The creation of metadata involves a human; a cataloguer who performs a 'conceptual analysis' of the information object in question to determine its 'aboutness'. They then translate this into the concepts prescribed in a controlled vocabulary or encoding scheme (e.g. taxonomy, thesaurus, etc.) and create other forms of descriptive and administrative metadata. All this improves recall and precision (e.g. conceptually similar items are retrieved and conceptually dissimilar items are suppressed).

As good as they are these days, retrieval systems based on automatic indexing (i.e. most web search engines, including Google, Bing, Yahoo!, etc.) suffer from the 'syntax problem'. They provide what appears to be high recall coupled with poor precision. This is the nature of search the engine beast. Conceptually similar items are ignored because such systems are unable to tell that 'Haematobia irritans' is a synonym of 'horn flies' or that 'java' is a term fraught with homonymy (e.g. 'java' the programming language, 'java' the island within the Indonesian archipelago and 'java' the coffee, and so forth). All of the aforementioned contributes to arguably the biggest problem for user searching: query formulation. Search engines suffer from the added lack of any structured browsing of, say, resource subjects, titles, etc. to assist users in query formulation.

This blog has discussed query formulation in the search process at various times (see this for example). The selection of search terms for query formulation remains one of the most difficult stages in users' information retrieval process. The huge body of research relating to information seeking behaviour, information retrieval, relevance feedback, and human-computer interaction attests to this. One of the techniques used to assist users in query formulation is thesaurus assisted searching and/or query expansion. Such techniques are not particularly new and are often used in search systems successfully (see Ali Shiri's JASIST paper from 2006).

Last week, however, Google announced adjustments to their search service. This adjustment is particularly significant because it is an attempt to control for synonyms. Their approach is based on 'contextual language analysis' rather than the use of information retrieval thesauri. The blog reads:
"Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts [...] Our synonyms system is the result of more than five years of research within our web search ranking team. We constantly monitor the quality of the system, but recently we made a special effort to analyze synonyms impact and quality."
Firstly, this is certainly positive news. Synonyms – as noted above – are a well known phenomenon which has blighted the effectiveness of automatic indexing in retrieval. But on the negative side – and not to belittle Google's efforts as they are dealing with unstructured data - Google are only dealing with single words. 'Song lyrics' and 'song words', or 'homocide' and 'murder' (examples provided from Google on their blog posting) They are dealing with words in a Roget's Thesaurus sense, rather than compound terms in an information retrieval thesaurus sense – and it is the latter which will ultimately be more useful in improving recall and precision. This is, after all, why information retrieval thesauri have historically been used in searching.

More interesting will be Google's exploration of homonymous terms. Homonyms are more complex that synonyms and are, perhaps for the foreseeable future, an intractable problem?

No comments:

Post a Comment