Tuesday 26 January 2010

Renaissance of thesaurus-enhanced information retrieval?

As my students in BSNIM3034 Content Management have been learning recently, semantics play such a huge role in the level of recall and precision capable of being achieved in an information retrieval system. Put simply, computers are great at interpreting syntax but are far too dumb to understand semantics or the intricacies of human language. This has historically been – and currently remains – the trump card of metadata proponents, and it is something the Semantic Web is attempting to resolve with its use of structured data too. The creation of metadata involves a human; a cataloguer who performs a 'conceptual analysis' of the information object in question to determine its 'aboutness'. They then translate this into the concepts prescribed in a controlled vocabulary or encoding scheme (e.g. taxonomy, thesaurus, etc.) and create other forms of descriptive and administrative metadata. All this improves recall and precision (e.g. conceptually similar items are retrieved and conceptually dissimilar items are suppressed).

As good as they are these days, retrieval systems based on automatic indexing (i.e. most web search engines, including Google, Bing, Yahoo!, etc.) suffer from the 'syntax problem'. They provide what appears to be high recall coupled with poor precision. This is the nature of search the engine beast. Conceptually similar items are ignored because such systems are unable to tell that 'Haematobia irritans' is a synonym of 'horn flies' or that 'java' is a term fraught with homonymy (e.g. 'java' the programming language, 'java' the island within the Indonesian archipelago and 'java' the coffee, and so forth). All of the aforementioned contributes to arguably the biggest problem for user searching: query formulation. Search engines suffer from the added lack of any structured browsing of, say, resource subjects, titles, etc. to assist users in query formulation.

This blog has discussed query formulation in the search process at various times (see this for example). The selection of search terms for query formulation remains one of the most difficult stages in users' information retrieval process. The huge body of research relating to information seeking behaviour, information retrieval, relevance feedback, and human-computer interaction attests to this. One of the techniques used to assist users in query formulation is thesaurus assisted searching and/or query expansion. Such techniques are not particularly new and are often used in search systems successfully (see Ali Shiri's JASIST paper from 2006).

Last week, however, Google announced adjustments to their search service. This adjustment is particularly significant because it is an attempt to control for synonyms. Their approach is based on 'contextual language analysis' rather than the use of information retrieval thesauri. The blog reads:
"Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts [...] Our synonyms system is the result of more than five years of research within our web search ranking team. We constantly monitor the quality of the system, but recently we made a special effort to analyze synonyms impact and quality."
Firstly, this is certainly positive news. Synonyms – as noted above – are a well known phenomenon which has blighted the effectiveness of automatic indexing in retrieval. But on the negative side – and not to belittle Google's efforts as they are dealing with unstructured data - Google are only dealing with single words. 'Song lyrics' and 'song words', or 'homocide' and 'murder' (examples provided from Google on their blog posting) They are dealing with words in a Roget's Thesaurus sense, rather than compound terms in an information retrieval thesaurus sense – and it is the latter which will ultimately be more useful in improving recall and precision. This is, after all, why information retrieval thesauri have historically been used in searching.

More interesting will be Google's exploration of homonymous terms. Homonyms are more complex that synonyms and are, perhaps for the foreseeable future, an intractable problem?

Friday 15 January 2010

Death of the book salesman...

Shopping in Glasgow prior to Christmas was a sad time. Borders, which occupied what is reputed to be the most expensive retail space in Glasgow (the old Royal Bank of Scotland building), announced that it was in administration and was flogging all stock in a gargantuan clearance sale. Borders had become an institution since it opened on Buchannan Street in 1997 (I think) and I'm sure branches in other cities were similarly iconic and located at city centre hot-spots, the London Oxford Street branch being another prime example. It was a great place to meet friends before heading out for dinner or drinks; perusing the amazing magazine or newspaper selection, or browsing the books or music. Of course, I stopped buying books there years ago because the genre classification they used made it impossible to find anything; but it nevertheless occupied a special place in my heart...

The official line is that Borders fell victim to the current economic climate, although it was a complicated concatenation of economic circumstances, including aggressive competition from online retailers and particularly supermarkets (those supermarkets again – a £3 copy of the latest Jordan autobiography anyone?), a sales downturn and, finally, a lack of credit from suppliers. Waterstone's remains the only national bookseller but today was responsible for a decline in the share price of HMV as their Chief Executive tries to administer 'bookshop CPR' (i.e. let's make our stores more cosy). Can we expect the closure of it too in the foreseeable future? That would be extremely depressing...

Of course it's all depressing news; but one can't help thinking that the demise of super-selling bookshops was a quagmire of their own making. The Net Book Agreement (NBA) – the 100 year long (almost) price fixing of books which collapsed in 1995 – was precipitated by Waterstone's in the first place. And it was precisely the collapse of the NBA which enticed Borders to the UK and enabled Amazon to establish UK operations. Both retailers would not have been able to operate with the NBA still in operation (remember the big Amazon book discounts in the late 1990s?). I suppose none of the book retailers anticipated the level of competitive aggression they had unleashed, particularly from supermarkets. Although I think some naivety played a part...

Many years ago I recall enjoying a talk delivered by the Deputy Chairman of John Smith & Son, Willie Anderson. Despite being the oldest bookseller in the English speaking world, John Smith moved off the high street many years ago. You are now most likely to encounter them as the university campus bookseller. During his talk Willie made an interesting point about the lack of business sense in the book selling industry; that the NBA had made all book sellers blind to conventional business practice or simple economic principles such as the laws of supply and demand. Said Willie (as best as I can remember! It was well over 10 years ago!):
"Harry Potter and the Chamber of Secrets was anticipated to be a best seller and an extremely popular title. We [John Smith & Son] had large pre-orders from customers. Yet, virtually all other booksellers were slashing prices and offering ridiculous pre-order discounts on an item which commanded a high price. At John Smith we didn't offer any discounts and we sold every copy at full price, precisely because demand was high. This is normal business practice, but most book retailers appear to be oblivious to this. Book retailers have a lot to learn about competition because they have been protected from it for so long. The industry needs to learn quickly otherwise it will suffer economic difficulties in the future".
What happens now then? There is certainly money to be made in book selling, particularly with the decline of 'good' stockists. There are more books bought now than at any time in history. Perhaps the time is ripe for a renaissance in the classic independent bookshop, of which Reid of Liverpool is archetypal? Supermarkets do not - I think - occupy the same business space as such book sellers and thus allowing the independent retailer to thrive. There wouldn't be any Costa or Starbucks, nor would it occupy a prime retail site, but I think we'd be all the better for it.

(Photo: Laura-Elizabeth, Flickr, Creative Commons)