Information Strategy Group, LJMU: search engines

Showing posts with label search engines. Show all posts

Tuesday, 2 November 2010

Crowd-sourcing faceted information retrieval

This blog has witnessed the demise of several search engines, all of which have attempted to challenge the supremacy of the big innovators - and I would tend to include Yahoo! and Bing before the obvious market leader. Yesterday it was the turn of Blekko to be the next Cuil. Or is it?

Blekko presents a fresh attempt to move web search forward, using a style of retrieval which has hitherto only been successful in systems based on pre-coordinated indexes and combining it with crowd-sourcing techniques. Interestingly, Rich Skrenta - co-founder of Blekko - was also a principal founder of the Dmoz project. Remember Dmoz? When I worked on BUBL years and years ago, I recall considering Dmoz to be an inferior beast. But it remains alive and kicking – and remains popular and relevant to modern web developments with weekly RDF dumps made of its rich, categorised, crowd-sourced content for Linked Data purposes. BUBL, on the other hand, has been static for years.

Flirting with taxonomical organisation and categorisation with Dmoz (as well as crowd-sourcing) has obviously influenced the Blekko approach to search. Blekko provides innovation in retrieval by enabling users to define their very own vertical search indexes using so-called 'slashtags', thus (essentially) providing a quasi form of faceted search. The advantage of this approach is that using a particular slashtag (or facet, if you prefer) in a query increases precision by removing 'irrelevant' results associated with different meanings of the search query terms. Sounds good, eh? Ranganathan would be salivating at such functionality in automatic indexing! To provide some form of critical mass, Blekko has provided hundreds of slashtags that can be used straight away; but the future of slashtags depends on users creating their own, which will be screened by Blekko before being added to their publicly available slashtags list. Blekko users can also assist in weeding out poor results and any erroneous slashtags results (see the video below) thus contributing to the improved precision Blekko purports to have and maintaining slashtag efficacy. In fact, Skrenta proposes that the Blekko approach will improve precision in the longer term. Says Skrenta on the BBC dot.Maggie blog:

"The only way to fix this [precision problem] is to bring back large-scale human curation to search combined with strong algorithms. You have to put people into the mix […] Crowdsourcing is the only way we will be able to allow search to scale to the ever-growing web".

Let's look at a typical Blekko query. I am interested in the new Microsoft Windows mobile OS, and in bona fide reviews of the new OS. Moreover, since I am tech savvy and will have read many reviews, I am only interested in reviews published recently (i.e. within the past two weeks, or so). In Blekko we can search like so…

"windows mobile 7" /tech-reviews /date

…where the /tech-reviews slashtag limits results to genuine reviews published in the technology press and/or associated websites, and the /date slashtag orders the results by date. It works, and works spectacularly well. Skrenta sticks two fingers up at his competitors when in the Blekko promotional video he quips, "Try doing this [type of] search anywhere else!" Blekko provides 'Five use cases where slashtags shine' which - although only using one slashtag - illustrate how the approach can be used in a variety of different queries. Of course, Blekko can still be used like a conventional search engine, e.g. enter a query and get results ranked according to the Blekko algorithm. And on this count – using my own personal 'search engine test queries' - Blekko appears to rank relevant results sensibly and index pages which other search engines either ignore or, if they do index them, normally drown in spam (spam results which these engines rank as more relevant).

There is a lot to admire about Blekko. Aside from an innovative approach to information retrieval, there is also a commitment to algorithm openness and transparency which SEO people will be pleased about; but I worry that while a Blekko slashtag search is innovative and useful, most users will approach Blekko as another search engine rather than buying into the importance of slashtags and, in doing so will not hang around long enough to 'get it' (even though I intend to...). Indeed, to some extent Blekko has more in common with command line searching of the online databases in the days of yore. There are also some teething troubles which rigorous testing can reveal. But there are reasons to be hopeful. Blekko is presumably hoping to promote slashtag popularity and have users following slashtags just as users follow Twitter groups, thus driving website traffic and presumably advertising. Being the owner of that slashtag could be useful, but also highly profitable, even if Blekko remains small.

blekko: how to slash the web from blekko on Vimeo.

Tuesday, 26 January 2010

Renaissance of thesaurus-enhanced information retrieval?

As my students in BSNIM3034 Content Management have been learning recently, semantics play such a huge role in the level of recall and precision capable of being achieved in an information retrieval system. Put simply, computers are great at interpreting syntax but are far too dumb to understand semantics or the intricacies of human language. This has historically been – and currently remains – the trump card of metadata proponents, and it is something the Semantic Web is attempting to resolve with its use of structured data too. The creation of metadata involves a human; a cataloguer who performs a 'conceptual analysis' of the information object in question to determine its 'aboutness'. They then translate this into the concepts prescribed in a controlled vocabulary or encoding scheme (e.g. taxonomy, thesaurus, etc.) and create other forms of descriptive and administrative metadata. All this improves recall and precision (e.g. conceptually similar items are retrieved and conceptually dissimilar items are suppressed).

As good as they are these days, retrieval systems based on automatic indexing (i.e. most web search engines, including Google, Bing, Yahoo!, etc.) suffer from the 'syntax problem'. They provide what appears to be high recall coupled with poor precision. This is the nature of search the engine beast. Conceptually similar items are ignored because such systems are unable to tell that 'Haematobia irritans' is a synonym of 'horn flies' or that 'java' is a term fraught with homonymy (e.g. 'java' the programming language, 'java' the island within the Indonesian archipelago and 'java' the coffee, and so forth). All of the aforementioned contributes to arguably the biggest problem for user searching: query formulation. Search engines suffer from the added lack of any structured browsing of, say, resource subjects, titles, etc. to assist users in query formulation.

This blog has discussed query formulation in the search process at various times (see this for example). The selection of search terms for query formulation remains one of the most difficult stages in users' information retrieval process. The huge body of research relating to information seeking behaviour, information retrieval, relevance feedback, and human-computer interaction attests to this. One of the techniques used to assist users in query formulation is thesaurus assisted searching and/or query expansion. Such techniques are not particularly new and are often used in search systems successfully (see Ali Shiri's JASIST paper from 2006).

Last week, however, Google announced adjustments to their search service. This adjustment is particularly significant because it is an attempt to control for synonyms. Their approach is based on 'contextual language analysis' rather than the use of information retrieval thesauri. The blog reads:

"Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts [...] Our synonyms system is the result of more than five years of research within our web search ranking team. We constantly monitor the quality of the system, but recently we made a special effort to analyze synonyms impact and quality."

Firstly, this is certainly positive news. Synonyms – as noted above – are a well known phenomenon which has blighted the effectiveness of automatic indexing in retrieval. But on the negative side – and not to belittle Google's efforts as they are dealing with unstructured data - Google are only dealing with single words. 'Song lyrics' and 'song words', or 'homocide' and 'murder' (examples provided from Google on their blog posting) They are dealing with words in a Roget's Thesaurus sense, rather than compound terms in an information retrieval thesaurus sense – and it is the latter which will ultimately be more useful in improving recall and precision. This is, after all, why information retrieval thesauri have historically been used in searching.

More interesting will be Google's exploration of homonymous terms. Homonyms are more complex that synonyms and are, perhaps for the foreseeable future, an intractable problem?

Friday, 13 November 2009

An interesting article about search engines...again...

Victor Keegan (Guardian Technology journalist) published an interesting column yesterday on the current state of search engines. The column entitled, "Why I am searching beyond Google", is an interesting discussion which picks up on something that has been discussed a lot on this blog: the fact that Google really isn't that good any more. There are dozens of search engines out there which offer the user greater functionality and/or search data which Google ignores or can't be bothered indexing. Yahoo! and Bing are mentioned by Keegan, but leapfish, monitter and duckduckgo are also discussed.

Keegan also comments on the destructive monopoly that Google has within search:

"If you were to do a blind tasting of Google with Yahoo, Bing or others, you would be pushed to tell them apart. Google's power is no longer as a good search engine but as a brand and an increasingly pervasive one. Google hasn't been my default search for ages but I am irresistibly drawn to it because it is embedded on virtually every page I go to and, as a big user of other Google services (documents, videos, Reader, maps), I don't navigate to Google search, it navigates to me."

This is where Google's dominance is starting to become a problem. Competition is no longer fair. There are now several major search engines which are, in many ways, better than Google; yet, this is not reflected in their market share, partly because the search market is now so skewed in Google's favour. As Keegan notes, Google comes to him, not the other way round.

In a concurrent development, WolframAlpha is to be incorporated into Bing to augment Bing's results in areas such as nutrition, health and mathematics. Will we see Google incorporate structured data from Google Squared into their universal search soon?

I realise that this is yet another blog posting about either, a) Google, or, b) search engines. I promise this is the last, for at least, erm, 2 months. In my defence, I am simply highlighting an interesting article rather than making a bona fide blog posting!

Thursday, 8 October 2009

AJAX content made discoverable...soon

I follow the Official Google Webmaster Central Blog. It can be an interesting read at times, but on other occasions it provides humdrum information on how best to optimise a website, or answers questions which most of us know the answers to already (e.g. recently we had, 'Does page metadata influence Google page rankings?'). However, the latest posting is one of the exceptions. Google have just announced that they are proposing a new standard to make AJAX-based websites indexable and, by extension, discoverable to users. Good ho!

The advent of Web 2.0 has brought about a huge increase in interactive websites and dynamic page content, much of which has been delivered using AJAX ('Asynchronous JavaScript and XML', not a popular household cleaner!). AJAX is great and furnished me with my iGoogle page years ago; but increasingly websites use it to deliver page content which might otherwise be delivered using static web pages in XHTML. This presents a big problem for search engines because AJAX is currently un-indexable (if this is a word!) and a lot of content is therefore invisible to all search engines. Indeed, the latest web design mantra has been "don't publish in AJAX if you want your website to be visible". (There are also accessibility and usability issues, but these are an aside for this posting...)

The Webmaster Blog summarises:

"While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines."

Google's proposal involves shifting the responsibility of indexing the website to the administrator/webmaster of the website, whose responsibility it would be to set up a headless browser on the web server. (A headless browser is essentially a browser without a user interface; a piece of software that can access web documents but does not deliver them to human users). The headless browser would then be used to programmatically access the AJAX website on the server and provide an HTML 'snap shot' to search engines when they request it - which is a clever idea. The crux of Google's proposal is a suite of URL protocols. These would control when the search engine knows to request the headless browser information (i.e. HTML snapshot) and which URL to reveal to human users.

It's good that Google are taking the initiative; my only concern is that they start trying to re-write standards, as they have a little with RDFa. Their slides are below - enjoy!

Wednesday, 23 September 2009

Yahoo! is alive and kicking!

In a recent posting I discussed the partnership between Yahoo! and Microsoft and wondered whether this might bring an end to Yahoo!'s innovate information retrieval work. Yesterday the Official Yahoo! Search Blog announced big changes to Yahoo! Search. Many of the these changes have been discussed in previous postings here (e.g. Search Assist, Search Pad, Search Monkey, etc.); however, Yahoo! have updated their search and results interface to make better use of these tools. As they state:

"[These changes] deliver a dynamic, compelling, and integrated experience that better understands what you are looking for so you can get things done quickly on the Web."

To us it means better integration of user query formation tools, better use of structured data on the Web (e.g. RDF data, metadata, etc.) to provide improved results and results browsing, and improved filtering tools, something which is nicely explained in their grand tour. According to their blog though, better integration of these innovations involved a serious overhaul of the Yahoo! Search technical architecture to make it run faster.

"Now, here's the best part: Rather than building this new experience on top of our existing front-end technology, our talented engineering and design teams rebuilt much of the foundational markup/CSS/JavaScript for the SRP design and core functionality completely from scratch. This allowed us to get rid of old cruft and take advantage of quite a few new techniques and best practices, reducing core page weight and render complexity in the process."

I sound like a sales officer for Yahoo!, but these improvements are really very good indeed and have to be experienced first hand. It's good to see that the intellectual capital of Yahoo! has not disappeared, and fingers-crossed it never will. True - these updates were probably already in the pipeline months before the partnership with Microsoft; but it at least demonstrates to Microsoft why it still has the upper hand in Web search.

Wednesday, 29 July 2009

Is it R.I.P. for Yahoo! as we know it?

And so Microsoft and Yahoo! finally agree terms of a partnership which will change the face of the web search market. Historically – and let's face it, this story has been ongoing since January 2008! – Microsoft always wanted to take over Yahoo!; but on reflection both parties probably felt that forging a partnership was most likely to give them success against the market leader. So this has to be good news, no?

Well, it'll do some good to have the dominance of Google properly challenged by the next two biggest fish - and Google will probably be concerned. But their partnership entails that Yahoo! Search be powered by Bing and, in return, Yahoo! will become the sales force for both companies' premium search advertising. We've noted recently that Bing is good and was an admirable adversary for Yahoo!, but will a Yahoo! front-end powered by a Bing back-end mean an end to some of Yahoo!'s excellent retrieval tools (often documented on this blog, see this, this, this and this, for example) and, more importantly, an end to their innovative research strategies to better harness the power of structured data on the web? Is the innovative SearchMonkey Open Search Platform to be jettisoned?

The precise details of the partnership are sketchy at the moment, but it would be tragic if this intellectual capital was to be lost or now neglected...

Tuesday, 7 July 2009

Welcome to my (Search) Pad

Search innovators at Yahoo! have today launched Search Pad. Search Pad integrates with the usual Yahoo! Search interface and allows users to take notes while conducting common information seeking tasks (e.g. researching a holiday, whether to buy that new piece of gadgetry, etc.). Search Pad can track the websites users are visiting and is invoked when it considers the user to be conducting a research task. On the Yahoo! Search Blog today:

"Search Pad helps you track sites and make notes by intelligently detecting user research intent and automatically collecting sites the user visits. Search Pad turns on automatically when you're doing research, tracking sites to make document authoring a snap. You can then quickly edit and organize your notes with the Search Pad interface, which includes drag-and-drop functionality and auto-attributed pasting."

Nice. From the website and Yahoo! blog (and this video), Search Pad is in many ways reminiscent of Listas from Microsoft Live Labs (and discussed on this blog before). It's possible to copy text, images and create lists for sharing with others, either via URL or via other services (e.g. Facebook, Twitter, Delicious). Search Pad also has an easy to use menu driven interface. Whilst it was useful in some circumstances, Listas lacked a worthy application; however, Search Pad builds on Listas functionality and instead has incorporated an improved version of it within a traditional search interface to do something we often do when we are searching (i.e. take notes about a search task).

The only problem is that I can't get it to work!! I have tried conducting a variety of 'obvious' research tasks which I anticipated Search Pad would recognise, but the Search Pad console hasn't appeared. Perhaps the 'intelligent detection' isn't has intelligent as promised? I'll keep trying, but please let me know if anyone has better luck. Still, it demonstrates the state of permanent innovation at Yahoo! Search.

Thursday, 11 June 2009

Bada Bing!

So much has been happening in the world of search engines since spring this year. This much can be evidenced from the postings on this blog. All the (best) search engines have been active in improving user tools, features, extra search functionality, etc. and there is a real sense that some serious competition is happening at the moment. It's all exciting stuff…

Last week Microsoft officially released its new Bing search engine. I've been using it, and it has found things Google hasn't been able to. The critics have been extremely impressed by Bing too and some figures suggest that it is stealing market share and moving Yahoo! to the number 2 spot. What about number 1?

The trouble is that it doesn't matter how good your search engine is because it will always have difficulty interrupting users' habitual use of Google. Indeed, Google's own research has demonstrated that the mere presence of the Google logo atop a result set is a key determinant of whether a user is satisfied with their results or not. In effect, users can be shown results from Yahoo! but branded as Google, and vice versa, but will always choose the result with the Google branding. Thus, users are generally unable to tell whether there is any real difference in the results (i.e. their precision, relevance, etc.) and are actually more influenced by the brand and their past experience. It's depressing, but a reality for the likes of Microsoft, Yahoo!, Ask, etc.

Francis Muir has the 'Microsoft mantra'. He predicts that in the long run Microsoft is always going to dominate Google – and I am starting to agree with him. Microsoft sit back, wait for things to unfold, and then develop something better than its previously dominant competitors. True – they were caught on the back foot with Web searching, but Bing is as at least as good as Yahoo!, perhaps better, and it can only get better. Their contribution to cloud computing (SkyDrive) offers 25GB storage, integration with Office and email, etc. and is far better than anything else available. Google documents? Pah! Who are you going to share that with? And then you consider Microsoft's dominance in software, operating systems, programming frameworks, databases, etc. Integrating and interoperating with this stuff over the Web is a significant part of the Web's future. Google is unlikely to be part of this, and for once I'm pleased.

It is not Microsoft's intention to take on Google's dominance of the Web at the moment. But I reckon Bing is certainly part of the long term strategy. The Muir prophecy is one step closer methinks.

Wednesday, 27 May 2009

Image searching with Creative Commons

Student information literacy skills have been discussed on the blog before. In short, they are woeful. One area where students tend to have little understanding is in the area of intellectual property rights (IPR). The situation might be looking better for digital music, but in my experience it remains poor for other digital artefacts, particularly images. 'Twas only a few weeks ago while was in a lab with some undergraduate students for a web technologies module when I discovered most of them were ripping images from the web for inclusion within their information gateways. While this can (in some circumstances) be tolerated within the confines of an educational institution, it remains copyright infringement owing to copying by 'reprographic means' - and this isn't behaviour we want to become habitual in our graduates. My brother (a graphic designer and new media guru) has spun me many a yarn about ex-colleagues who have been shown their P45 for engaging in IPR theft (e.g. reusing someone's basic design or photograph).

All of this is veering away from the original reason for this blog though, which is to draw attention to some new image searching functionality on Yahoo! Image Search. Following on nicely from the Search Options post, the Yahoo! Search Blog has just announced the inclusion of some extra search filters for image result sets. Not only is it better than Google (and more accurate?), but it also includes a useful Creative Commons (CC) filter. Using a similar interface to Yahoo! Search Assist, Yahoo! Image Search allows users to apply a CC checkbox to filter for images, with specific filters included for commercial reuse and/or remixing. This is particularly useful to embellish those PowerPoint presentations or to illustrate a blog, or for those undergraduate students building an information gateway, or to avoid getting a P45!

There appears to be a downside, unfortunately. When I saw the Yahoo! Search Blog announcement I thought (perhaps naively) that Yahoo! was starting to put into practice its commitment to metadata, Semantic Web specifications, and other structured data. Since I know my personal homepage is indexed by Yahoo! and uses XHTML+RDFa to notify intelligent agents that its page content falls under a Creative Commons Attribution 3.0 License, I thought I'd put an Image Search to the test. Providing the CC namespace is referenced, the XHTML+RDFa required is simple. For example:

<p>Content on <a href="http://www.staff.ljmu.ac.uk/bsngmacg/" property="cc:attributionName" rel="cc:attributionURL">George Macgregor</a>'s website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">Creative Commons Attribution 3.0 License</a></p>

...and with specific CC reference to my foaf:depiction...

<img src="img/georgedepiction.jpg" alt="Image of George Macgregor" rel="license" href="http://creativecommons.org/licenses/by/3.0/" property="foaf:depiction" content="George Macgregor"/>

My filtered CC search was unsuccessful though. This disappointed me; but then I observed the following notice:

"Note: Only Flickr images are supported currently."

Flickr – which is a subsidiary of Yahoo! – has allowed users to conduct advanced searches of its publicly uploaded images for quite some time. This has included CC searching. And it would appear that Yahoo! has integrated Flickr searching functionality into Image Search, albeit with some nice tweaks. If I had read their blog in its entirety I would have realised this; I just clicked the link such was my excitement about Yahoo! Image Search!

It's useful to have this functionality within a conventional searching tool, but it is disappointing that Image Search isn't using cleverer means of doing it (e.g. RDFa) rather than relying on the preferences of Flickr users when they upload their images. Don't get me wrong, this is useful and most welcome, and it will save me time on occasion, but it would be exciting to crack CC image searching beyond the controlled Flickr environment. Hopefully the 'currently' in "Only Flickr images are supported currently" will mean that my expectations will be met soon…

Monday, 18 May 2009

Light relief: Celtic fringes erased by WolframAlpha?!

With WolframAlpha launched on Friday, I spent much of my weekend trying to get a 'computable request' to compute. Not until Monday morning did a request compute – but its performance has been getting better ever since so hopefully we will all have more time to experiment with it over coming days and weeks...

Like me, Gwenda Mynott has been testing WolframAlpha and has been searching for things that, a) you have a good knowledge of, and, b) a topic that WolframAlpha can easily compute. Places are good for this (e.g. countries, towns, cities, etc.), and Stephen Wolfram computes multiple locations to good effect in his demonstrations; however, Gwenda tried to 'compute' Wales and arrived at some bizarre results. Check them out. WolframAlpha doesn't retrieve data pertaining to the constituent nation of the United Kingdom of Great Britain and Northern Ireland (i.e. Wales as you or I would tend to know it!), but a small town in South Yorkshire by the name of Wales (?) The only other obvious option WolframAlpha provides is Wales (New York, USA), which is equally amiss.

Hmmmmm. If this is the result for Wales, what are the results for the rest of the UK? Well, that's equally controversial. England appears to be synonymous with the United Kingdom of Great Britain and Northern Ireland. Scotland is referred by WolframAlpha back to the Kingdom of Scotland, which ceased to exist after the Act of Union in 1707. Worse than that, Northern Ireland doesn't even exist! ("WolframAlpha isn't sure what to do with your input") Cornish nationalists will also be dismayed to learn that Cornwall (Canada) is the only one that counts.

Is this a systematic attempt to erase the history, culture and memory of the Celtic fringes?! Of course not. The results might be strange, but from a knowledge engine point of view – and ontologically speaking - Wales, Scotland and Northern Ireland are subsumed by the larger geographical and political entity of the UK, so it's understandable that WolframAlpha computes the answer in this way. Still, the England/UK synonymy is a bit odd and must have been encoded by someone somewhere sometime!

Experiment away, folks - and I would encourage everyone to post their most bizarre / illogical data results as comments to this blog. A prize will go to the most outlandish!

Friday, 15 May 2009

Some more 'Search Options'...

I promised not to blog about Google any time soon for fear the blog becomes known as the unofficial Google blog. After some consideration I thought, 'pish posh!' Anyway, the post has a wider remit than just Google...honest!

The absence of retrieval aids for Google users (oh no, not again - I hear you cry!) has been discussed at great length on this blog before. To appreciate the extent of this deficiency we need only peruse some innovative rival search engines such as Ask (recently re-branded back to Ask Jeeves), Yahoo!, or Clusty. Google has been making changes though and today the Official Google Blog announced some further enhancements to the universal Google search interface. Simply called, 'Search Options', these tools let you "slice and dice" results, apply rudimentary filters, and generate alternative views of results. Search Options does a little bit more to help the user in query formulation (the area where I think Google is weakest), but also offers some useful functionality once you have your results.

Check out our usual canned search for 'communism in Russia'; 'click' the 'Search options…' link in the top left had corner of the interface to reveal the Search Option tools.

Filters are available for videos, forums and reviews (the latter being fairly useful if you are shopping). Various publication time filters are also available. Nothing here is particularly mind blowing though.

Search Options gets a bit more interesting when the search display options are explored in a little detail. Firstly, it's possible to request details of related searches. These are displayed in a better page location than before and look similar to Yahoo! Search Assist. But it is now also possible to select the 'Wonder Wheel' which generates a visualisation of the related terms. I'm unsure how useful the Wonder Wheel really is, particularly as the true nature of the relationships between terms is impossible for Google to represent other than in syntactic terms; this is something the Semantic Web community is obviously trying to resolve.

Most interesting though is the 'Timeline' tool. This allows results to be displayed along, erm, yup, a timeline. The timeline is clickable allowing the user to drill down into particular temporal zones and to view resources relating to that zone. I use the word 'interesting' because although the timeline is probably quite useful for historical research, its moment of introduction is the most interesting part. Indeed, the timeline functionality looks in part like Google is bracing itself for the release of WolframAlpha, which is due any day now (or tonight?) – and I wouldn't be at all surprised if this announcement was an attempt to steal some of its thunder. This appears to have been combined with the demonstration of Google Squared at the Google Searchology conference a few days ago. No Google Squared prototypes appear to be available for us to experiment with, but TechCrunch got a sneaky peak at Searchology (view the YouTube video below). Google Squared is, in essence, Google's answer to WolframAlpha.

For me the most interesting news to emerge alongside Search Options is Google's desire to make greater use of RDFa. RDFa is probably a little pedestrian for me, but it's better than nothing – and at least there is a clear intention of using some Semantic Web specifications. It's just a shame Yahoo! announced something similar but more radical almost 18 months ago.

Thursday, 30 April 2009

WolframAlpha and destructive hype

If you have been plugged into the search engine or technology news feeds over recent months you may have encountered the excitement surrounding WolframAlpha. WolframAlpha is a Web search tool to be launched in May which – apart from having a good name – is anticipated to be the next Google exterminator. Although being touted as a destroyer of Google, the technology commentators indicate that WolframAlpha will inhabit an entirely different intellectual space on the Web.

WolframAlpha is described by its creators as a "computational knowledge engine" which, instead of retrieving resources using conventional automatic indexing methods, dynamically computes the answers to a wide variety of questions. The way in which it does this remains a mystery, but we do know that it models particular areas of knowledge. It then combines this with a vast repository of curated data harvested from disparate data sources and some ingenious natural language processing algorithms to represent knowledge. These knowledge representations can then be queried to answer real questions. Stills sounds like an enigma; but it must work on some level given the hype around it. Mustn't it?!

The brainchild of Dr. Stephen Wolfram (purveyor of computer algebra), WolframAlpha has had information and computer scientists and technology commentators salivating for months. The trouble is that while the incessant hype continues, an increasing number of people (me, but some commentators) are growing increasingly cynical of its true capabilities; we want to see a demo, or some kind of prototype. Mindful that cynicism could be spreading, Wolfram unveiled his creation yesterday for the first time at the Harvard University Berkman Center for Internet & Society (via a sold-out Web cast - clip from YouTube below). This demonstration appears to have further stimulated the hype (judging by some headlines), but has simultaneously added to the increasing cynicism. Hype and 'vapourware' exasperates people. And this is where the hype could actually be death of WolframAlpha, rather than Google.

In reality, it certainly sounds like WolframAlpha is not out to compete with Google; but it doesn't matter, this is how it is being described in the media and WolframAlpha hasn't tried to dispel the myth. In his blog, Wolfram describes WolframAlpha as "a new paradigm for using computing and the Web". It immediately provides people with a Google yardstick and false expectations; most new users will not understand that WolframAlpha is an entirely different beast. But more importantly, it's setting WolframAlpha up for an almighty fall.

Remember Cuil? People also thought Cuil was going to change the face of searching but it failed. It was hotly anticipated and was hyped, arguably more, than WolframAlpha. This hype did it no favours when it crashed on its launch day. It's only been 9 months since Cuil was officially launched, yet we never hear about it, nor do any of us use it. In part, this is because its indexes are so poor. My LJMU profile page was updated on 08 November 2008, almost 6 months ago; yet, Cuil still returns this page as it was on 07 November 2008 as a result. This is extremely feeble when you consider that Microsoft Live Search refreshes its indexes every 20 days.

Along with the hype, this was Cuil's 'blind spot'. A blind spot is normally tolerated in the early days of an innovative Web tool, but inflated expectations breeds intolerance. WolframAlpha is bound to have its own blind spot; what will it be and will users be tolerant until it's fixed? Probably not. They therefore have to get it right on launch day.

The moral of this tale is simple. Hyperbole must end. It's destructive and in the long run it does nobody any favours.

Wednesday, 25 March 2009

ASK conundrum revisited...again!

I posted a blog about Google's eye tracking research last month. I'm loathed to discuss Google again lest the ISG blog becomes known as the unofficial Google blog; however, the latest post on the Official Google Blog is worthy of some comment...

You might recall another post I made regarding search engine research and development, particularly in the area of information retrieval (IR) aids for users. In this posting I summarised Belkin's research and theories regarding the Anomalous State of Knowledge (ASK). Most of this and subsequent research has sought to introduce IR aids for the user so that they can better solve their ASK conundrum. This assistance varies but often takes the form of query expansion (in its various permutations), browsable subject trees to stimulate query formulation, relevance feedback, and so forth. Providing such tools in systems based on automatic indexing is difficult, but we noted that some search engines have introduced some effective retrieval aids, all designed to alleviate the ASK problem. For example, Yahoo! provides its search assist tool, Clusty provides related concept clusters, and Ask provides other similar tools. Their accuracy in IR varies widely, but overall they prove useful to user. Unfortunately, we also noted that Google provides few user aids comparable to those above, arguably relying more on its PageRank algorithm. Not any longer...

Today Google launched some interface functionality not dissimilar to Yahoo! search assist and Clusty. Their assistance provides some suggested related searches and some extra result summary text for particular results. Receiving this assistance depends on the nature of your query, so have a look at this canned search: 'communism in Russia'. This isn't bad and is better than nothing; but does it really measure up to the aids provided by competing search engines? Compare the results for these canned searches and the IR aids provided for the user by the systems we've discussed already:

Yahoo! search assist: query term = communism in Russia
Clusty: query term = communism in Russia
Ask: query term = communism in Russia

Google's attempts appear quite pedestrian by comparison. Yahoo! and Clusty, for example, make their aids readily available so that the user can affect changes in their information seeking behaviour, but Google's tools are far less visible, less detailed, and offer far less functionality. Since a lot of research indicates that many users will not scroll below the 'golden triangle' (i.e. to the bottom of the first result set), it is entirely feasible to think that these 'related search' aids will go unnoticed by the disoriented information seeker.

It is good to see Google deploying user query aids and reacting to developments in other IR systems, but it appears that it will be some time before Google can be said to alleviate users' Anomalous State of Knowledge.

Friday, 6 February 2009

Information seeking behaviour at Google: eye-tracking research

Anne Aula and Kerry Rodden have just published a posting on the Official Google Blog summarising some eye-tracking research they have been conducting on Google's 'Universal Search'. Both are active in information seeking behaviour and human-computer interaction research at Google and are well published within the related literature (e.g. JASIST, IPM, SIGIR, CHI, etc.).

The motivation behind their research was to evaluate the effect incorporation of thumbnail images and video within a research set has on user information seeking behaviour. Previous information retrieval eye-tracking research indicates that users scan results in order, scanning down their results until they reach a (potentially) relevant result, or until they decide to refine their search query or abandon the search. Aula and Rodden were concerned that the inclusion of thumbnail images might distract the "well-established order of result evaluation". Some comparative evaluation was therefore order of the day.

"We ran a series of eye-tracking studies where we compared how users scan the search results pages with and without thumbnail images. Our studies showed that the thumbnails did not strongly affect the order of scanning the results and seemed to make it easier for the participants to find the result they wanted."

A good finding for Google, of course; but most astonishing is the eye-tracking data. The speed with which users scanned result sets and the number of points on the interface they scanned was incredible. View the 'real time' clip below. A dot increasing in size denotes the length of time a user spent pausing at that specific point in the interface or result set. Some other interesting discoveries were made – the full posting is essential reading.

Tuesday, 7 October 2008

Search engines: solving the 'Anomalous State of Knowledge'

Information retrieval (IR) remains one of the most active areas of research within the information, computing and library science communities. It also remains one of the sexiest. The growth in information retrieval sex appeal has a clear correlation with the growth of the Web and the need for improvements in retrieval systems based on automatic indexing. No doubt the flurry of big name academics and Silicon Valley employees attending conferences such as SIGIR also adds glamour. Nevertheless, the allure of IR research has precipitated some of the best innovations in IR ever, as well as creating some of the most important search engines and business brands. Of course, asked to pick from a list their favourite search engine or brand, most would probably select Google.

The habitual use of Google by students (and by real people generally!) was discussed in a previous post and needn't be revisited here. Nevertheless, one of the most distressing aspects of Google (for me, at least!) is a recent malaise in its commitment to search. There have been some impressive innovations in a variety of search engines in a variety of areas. For example, Yahoo! is to better harness metadata and Semantic Web data on the Web. More interestingly though, some recent and impressive innovations in solving the 'ASK conundrum' is visible in a variety of search engines, but not in Google. Although Google always tell us that search is its bread and butter, is it spreading itself a little too thinly? Or - with a brand loyalty second to none and the robust PageRank algorithm deployed to good effect – is Google resting on its laurels?

In 1982 a young Nicholas J. Belkin spearheaded a series of seminal papers documenting various models of users' information needs in IR. These papers remain relevant today and are frequently cited. One of Belkin et al.'s central suppositions is that the user suffers from the so-called Anomalous State of Knowledge, which can be conveniently acronymized to 'ASK'. Their supposition can be summarised by the following quote from their JDoc paper:

"[P]eople who use IR systems do so because they have recognised an anomaly in their state of knowledge on some topic, but they are unable to specify precisely what is necessary to resolve that anomaly. ... Thus, we presume that it is unrealistic (in general) to ask the user of an IR system to say exactly what it is that she/he needs to know, since it is just the lack of that knowledge which has brought her/him to the system in the first place".

This astute deduction ushered in a branch of IR research that sought to improve retrieval by resolving the Anomalous State of Knowledge (e.g. providing the user with assistance in the query formulation process, helping users ‘fill in the blanks’ to improve recall (e.g. query expansion), etc.).

Last winter Yahoo! unveiled its 'Search Assist' facility (see screenshot above - search for 'united nations'), which provides a real time query formulation assistance to the user. Providing these facilities in systems based on metadata has always been possible owing the use of controlled vocabularies for indexing, the use of name authority files, and even content standards such as AACR2; but providing a similar level functionality with unstructured information is difficult – yet Yahoo! provide something ... and it can be useful and can actually help resolve the ASK conundrum!

Similarly, meta-search engine Clusty has provided its 'clustering' techniques for quite some time. These clusters group related concepts and are designed to aid in query formulation, but also to provide some level of relevance feedback to users (see screenshot above - search for 'George Macgregor'). Of course, these clusters can be a bit hit or miss but, again, they can improve retrieval and aid the user in query formulation. Similar developments can also be found in Ask. View this canned search, for example. What help does Google provide?

The bottom line is that some search engines are innovating endlessly and putting the fruits of a sexy research area to good use. These search engines are actually moving search forward. Can the same still be said of Google?

Information Strategy Group, LJMU