Tuesday, 17 November 2009

Getting into technical debt in a recession?

I like this succinct quote from TalkTalk CIO David Cooper about the IT systems in newly acquired Tiscali.

"In addition, Tiscali faced some IT issues in the past and worked pragmatically to fix them, resulting in some discontinuities between systems – we are repairing this now," www.computing.co.uk/computing/analysis/2252848/ringing-changes-talktalk-4893004

There are, no doubt, times to go for the cheapest, quickest most 'pragmatic' solution to getting systems to mesh together. But in getting to a solution for least cost there can be an accumulation of 'Technical Debt'. I know that one of my clients in my day job, keeps a measure of technical debt being accumulated in development projects. But it is difficult to persuade an organisation to contemplate such debt, let alone put it on the balance sheet.

It's never easy either to argue for spending money on technical debt when you can have new shiny functionality baubles. What we find we do at Village Software when working on clients projects is try to improve things as we go along. Most clients would not be happy if we suggested they spent 10’s of thousands refactoring a system with little or no functional gain. Hence I suspect we often do this on the cheap out of a possibly misplaced or at least poorly negotiated sense of professionalism. We have recently spent a five figure sum refactoring our Lab Solution to improve it under the hood, I've got to say this hurts and we are now on a functionality campaign.

I wonder what the effect of the current recession is on technical debt. In principle resources to deal with it are cheaper than in a boom time. However the need to gain the cost reducing, innovation gaining benefits of Business Information Systems at lower investments will surely lead to an increase in technical debt across the economy. Perhaps the UK is accumulating billions of technical debt in the public and private sector to match the vast national debt in the public sector and the balance sheet retrenchments in the private sector.

The term 'Technical Debt' by the way was coined by Ward Cunningham as a useful allegory. One of the great thinkers in current software development Martin Fowler describes it on his Wiki martinfowler.com/bliki/TechnicalDebt.html. He explains that getting things done, in a way TalkTalk's David Cooper describes generously as pragmatically, is like borrowing money, you have to start paying interest, eventually you have to pay it back along with the principal. Ward Cunningham has a neat little 4 point plan referring to technical debt in software development. (For those not familiar with the term, refactoring is the practice of improving software code quality without adding functionality). Cunningham describes technical debt on his Wiki (all these guys have wikis) www.c2.com/cgi/wiki?ComplexityAsDebt :-
  • Skipping design is like borrowing money.
  • Refactoring is like repaying principal.
  • Slower development due to complexity is like paying interest.
  • When the whole project caves in under the mess, is that like when the big guys come round and slam your hands in the car door for not paying up?

Others describe it by the allegory of lactic acid building up on your muscles during a run. In the wider world of Commercial and Government ICT we can expect a build up of such debt. For businesses without a strategic plan for ICT, the resources to deliver a plan or whose plan is to accumulate technical debt there is going to be a backlog. Alas 'Technical Debt' will not be appearing on balance sheets soon if indeed there were a way to measure it. I would be baffled if a student asked how to go about measuring technical debt in his company.

I have a feeling that the Information Systems people, like me, will somehow get the blame for letting such debt accumulate.

Life can be unfair, but it is indoor work no heavy lifting.

Friday, 13 November 2009

An interesting article about search engines...again...

Victor Keegan (Guardian Technology journalist) published an interesting column yesterday on the current state of search engines. The column entitled, "Why I am searching beyond Google", is an interesting discussion which picks up on something that has been discussed a lot on this blog: the fact that Google really isn't that good any more. There are dozens of search engines out there which offer the user greater functionality and/or search data which Google ignores or can't be bothered indexing. Yahoo! and Bing are mentioned by Keegan, but leapfish, monitter and duckduckgo are also discussed.

Keegan also comments on the destructive monopoly that Google has within search:
"If you were to do a blind tasting of Google with Yahoo, Bing or others, you would be pushed to tell them apart. Google's power is no longer as a good search engine but as a brand and an increasingly pervasive one. Google hasn't been my default search for ages but I am irresistibly drawn to it because it is embedded on virtually every page I go to and, as a big user of other Google services (documents, videos, Reader, maps), I don't navigate to Google search, it navigates to me."
This is where Google's dominance is starting to become a problem. Competition is no longer fair. There are now several major search engines which are, in many ways, better than Google; yet, this is not reflected in their market share, partly because the search market is now so skewed in Google's favour. As Keegan notes, Google comes to him, not the other way round.

In a concurrent development, WolframAlpha is to be incorporated into Bing to augment Bing's results in areas such as nutrition, health and mathematics. Will we see Google incorporate structured data from Google Squared into their universal search soon?

I realise that this is yet another blog posting about either, a) Google, or, b) search engines. I promise this is the last, for at least, erm, 2 months. In my defence, I am simply highlighting an interesting article rather than making a bona fide blog posting!

Thursday, 22 October 2009

Blackboard on the shopping list: do Google need reining in?

Alex Spiers (Learning Innovation & Development, LJMU) alerted me via Twitter to rumours in the 'Internet playground' that Google is considering branching out into educational software. According to the article spreading the rumour, Google plans to fulfil its recent pledge to acquire one small company per month by purchasing Blackboard.

The area of educational software is not completely alien to Google. The Google Apps Education Edition (providing email, collaboration widgets, etc.) has been around for a while now (I think) and - as the article insinuates - moving deeper into educational software seems a natural progression and provides Google with clear access to a key demographic. This is all conjecture of course; but if Google acquired Blackboard I think I would suffer a schizophrenic episode. A part of me would think, "Great - Google will make Blackboard less clunky, offer more functionality and more flexiblity". But the other part (which is slightly bigger, I think) would feel extremely uncomfortable that Google is yet again moving into new areas, probably with the intention of dominating that area.

We forget how huge and pervasive Google is today. Google is everywhere and now reaches far beyond its dominant position in search into virtually every significant area of web and software development. If Google were Microsoft the US Government and the EU would be all over Google like a rash for pushing the boundaries of antitrust legislation and competition laws. This situation takes on a rather sinister tone when you consider the situation in HE if Blackboard becomes a Google subsidiary. Edge Hill University is one of several institutions which has elected to ditch fully integrated institutional email applications (e.g. MS Outlook, Thunderbird) in favour of Google Mail. Having a VLE maintained by Google therefore sets the alarm bells ringing. The key technological interactions for a 21st century student are as follows: email, web, VLE, library. Picture it - a student existence which would be entirely dependent upon one company and the directed advertising that goes with it: Google Mail, web (and their first port of call is likely to be Google, of course), GoogleBoard (the name of Blackboard if they decided to re-brand it!) and a massive digital library which Google is attempting to create and which would essentially create a de facto digital library monopoly.

I'm probably getting ahead of myself. The acquisition of Blackboard probably won't happen, and the digital library has encountered plenty of opposition, not least from Angela Merkel; but it does get me thinking that Google finally needs reining in. Even before this news broke I was starting to think that Google was turning into a Sesame Street-style Cookie Monster, devouring everything in sight. Their ubiquity can't possibly be healthy anymore, can it? Or am I being completely paranoid?

Monday, 19 October 2009

The Kindle according to Cellan-Jones

The world in which Rory Cellan-Jones exists is an interesting place. It's one which often results in a good, hard slap to the face. He can always be relied upon for some cynicism and negativity (or realism?) in his analysis of new technologies and tech related businesses. (See the last posting about Google Wave, for example.) This can be unexpected, often because he sees through the hype or aesthetics of many technologies and evaluates stuff based squarely on utilitarian principles. His overview of the Kindle is no exception to this rule:
"The Kindle looks to me like an attractive but expensive niche product, giving a few techie bibliophiles the chance to take more books on holiday without incurring excess baggage charges. But will it force thousands of bookshops to close and transform the economics of struggling newspapers? Don't bet on it."
The thing is, Cellan-Jones often talks a lot of sense. To be sure, the Kindle looks like an extremely smart piece of kit, but when Cellan-Jones stacks up the realities of the Kindle one wonders whether it'll be the game changer everyone is expecting it to be.

The focus for the Kindle seems to be on the best seller lists and the broad sheets. An area which appears to have eluded adequate exposition by all the tech commentators is the use of this new generation of e-book readers to deliver text books, learning materials, etc. This was always considered an important area for the early e-book readers. Why carry lots of heavy text books around when you could have them all on your Kindle or Sony Reader Touch, and be in a position to browse and search the content therein more effectively? Or, is this an extravagant use of E-Ink? E-Ink is required for lengthy reading sessions (i.e. novel) rather than dipping in and out of text books to complete academic tasks, something for which a netbook or mobile device might be better. So what happens to the future of e-book readers in academia?

Friday, 9 October 2009

Wave a washout?

This is just a brief posting to flag up a review of Google Wave on the BBC dot.life blog.

Google unveiled Wave at their Google I/O conference in late May 2009. The Wave development team presented a lengthy demonstration of what it can do and – given that it was probably a well rehearsed presentation and demo – Wave looked pretty impressive. It might be a little bit boring of me, but I was particularly impressed by the context sensitive spell checker ("Icland is an icland" – amazing!). Those of you that missed that demonstration can check it out in the video below. And try not to get annoyed at the sycophantic applause of their fellow Google developers...

Since then Wave has been hyped up by the technology press and even made mainstream news headlines at the BBC, Channel 4 News, etc. when it went on limited (invitation only) release last week. Dot.life has reviewed Wave and the verdict was not particularly positive. Surprisingly they (Rory Cellan-Jones, Stephen Fry, Bill Thompson and others) found it pretty difficult to use and pretty chaotic. I'm now anxious to try it out myself because I was convinced that it would be pretty amazing. Their review is funny and worth reading in full; but the main issues were noted as follows:
"Well, I'm not entirely sure that our attempt to use Google Wave to review Google Wave has been a stunning success. But I've learned a few lessons.

First of all, if you're using it to work together on a single document, then a strong leader (backed by a decent sub-editor, adds Fildes) has to take charge of the Wave, otherwise chaos ensues. And that's me - so like it or lump it, fellow Wavers.

Second, we saw a lot of bugs that still need fixing, and no very clear guide as to how to do so. For instance, there is an "upload files" option which will be vital for people wanting to work on a presentation or similar large document, but the button is greyed out and doesn't seem to work.

Third, if Wave is really going to revolutionise the way we communicate, it's going to have to be integrated with other tools like e-mail and social networks. I'd like to tell my fellow Wavers that we are nearly done and ready to roll with this review - but they're not online in Wave right now, so they can't hear me.

And finally, if such a determined - and organised - clutch of geeks and hacks struggle to turn their ripples and wavelets into one impressive giant roller, this revolution is going to struggle to capture the imagination of the masses."
My biggest concern about Wave was the important matter of critical mass, and this is something the dot.life review hints at too. A tool like Wave is only ever going to take off if large numbers of people buy into it; if your organisation suddenly dumps all existing communication and collaboration tools in favour of Wave. It's difficult to see that happening any time soon...

Thursday, 8 October 2009

AJAX content made discoverable...soon

I follow the Official Google Webmaster Central Blog. It can be an interesting read at times, but on other occasions it provides humdrum information on how best to optimise a website, or answers questions which most of us know the answers to already (e.g. recently we had, 'Does page metadata influence Google page rankings?'). However, the latest posting is one of the exceptions. Google have just announced that they are proposing a new standard to make AJAX-based websites indexable and, by extension, discoverable to users. Good ho!

The advent of Web 2.0 has brought about a huge increase in interactive websites and dynamic page content, much of which has been delivered using AJAX ('Asynchronous JavaScript and XML', not a popular household cleaner!). AJAX is great and furnished me with my iGoogle page years ago; but increasingly websites use it to deliver page content which might otherwise be delivered using static web pages in XHTML. This presents a big problem for search engines because AJAX is currently un-indexable (if this is a word!) and a lot of content is therefore invisible to all search engines. Indeed, the latest web design mantra has been "don't publish in AJAX if you want your website to be visible". (There are also accessibility and usability issues, but these are an aside for this posting...)

The Webmaster Blog summarises:
"While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines."
Google's proposal involves shifting the responsibility of indexing the website to the administrator/webmaster of the website, whose responsibility it would be to set up a headless browser on the web server. (A headless browser is essentially a browser without a user interface; a piece of software that can access web documents but does not deliver them to human users). The headless browser would then be used to programmatically access the AJAX website on the server and provide an HTML 'snap shot' to search engines when they request it - which is a clever idea. The crux of Google's proposal is a suite of URL protocols. These would control when the search engine knows to request the headless browser information (i.e. HTML snapshot) and which URL to reveal to human users.

It's good that Google are taking the initiative; my only concern is that they start trying to re-write standards, as they have a little with RDFa. Their slides are below - enjoy!

Wednesday, 23 September 2009

Yahoo! is alive and kicking!

In a recent posting I discussed the partnership between Yahoo! and Microsoft and wondered whether this might bring an end to Yahoo!'s innovate information retrieval work. Yesterday the Official Yahoo! Search Blog announced big changes to Yahoo! Search. Many of the these changes have been discussed in previous postings here (e.g. Search Assist, Search Pad, Search Monkey, etc.); however, Yahoo! have updated their search and results interface to make better use of these tools. As they state:
"[These changes] deliver a dynamic, compelling, and integrated experience that better understands what you are looking for so you can get things done quickly on the Web."
To us it means better integration of user query formation tools, better use of structured data on the Web (e.g. RDF data, metadata, etc.) to provide improved results and results browsing, and improved filtering tools, something which is nicely explained in their grand tour. According to their blog though, better integration of these innovations involved a serious overhaul of the Yahoo! Search technical architecture to make it run faster.
"Now, here's the best part: Rather than building this new experience on top of our existing front-end technology, our talented engineering and design teams rebuilt much of the foundational markup/CSS/JavaScript for the SRP design and core functionality completely from scratch. This allowed us to get rid of old cruft and take advantage of quite a few new techniques and best practices, reducing core page weight and render complexity in the process."
I sound like a sales officer for Yahoo!, but these improvements are really very good indeed and have to be experienced first hand. It's good to see that the intellectual capital of Yahoo! has not disappeared, and fingers-crossed it never will. True - these updates were probably already in the pipeline months before the partnership with Microsoft; but it at least demonstrates to Microsoft why it still has the upper hand in Web search.

Thursday, 13 August 2009

Trough of disillusionment for microblogging and social software?

The IT research firm Gartner has recently published another of its technology reports for 2009: Gartner's Hype Cycle Special Report for 2009. This report is another in a long line of similar Gartner reports which do exactly what they say on the tin. That is, they provide a technology 'hype cycle' for 2009! Did you see that coming?! The technology hype cycle was a topic that Johnny Read recently discussed at an ISG research reading group, so I thought it was worth commenting on.

According to Gartner - who I believe introduced the concept of the technology hype cycle - the expectations of new or emerging technology grows far more quickly that the technology itself. This is obviously problematic since user expectations get inflated only to be deflated later as the true value of the technology slowly becomes recognised. This true value is normally reached when the technology experiences mainstream use (i.e. plateau of productivity). The figure below illustrates the basic principles behind the hype cycle model.

The latest Gartner hype cycle (below) is interesting - and interesting is really as far as you can go with this because it's unclear how the hype cycles are compiled and whether they can be used for forecasting or as a true indicator of technology trends. Nevertheless, according to the hype cycle 2009, microblogging and social networking are on the decent into the trough of disillusionment.

From a purely personal view this is indeed good news as it might mean I don't have to read about Twitter in virtually every technology newspaper, blog and website for much longer, or be exposed to a woeful interview of the Twitter CEO on Newsnight. But I suppose it is easy to anticipate the plateau of productivity for these technologies. Social software has been around for a while now, and my own experiences would suggest that many people are starting to withdraw from it; the novelty has worn off. And remember, it's not just users that perpetuate the hype cycle, those wishing to harness the social graph for directed advertising, marketing, etc. are probably sliding down the trough of disillisionment too as the promise of a captive audience has not been financially fulfilled.

It's worth perusing the Gartner report itself - interesting. The above summary hype cycle figure doesn't seem to be available at the report, so I've linked to the version available at the BBC dot.life blog which also discusses the report.

Tuesday, 4 August 2009

Extending the FOAF vocabulary for junkets, personal travel and map generation

As we know, FOAF provides a good way of exposing machine-readable data on people, their activities and interests, and the nature of their relationships with other people, groups or things. FOAF allows us to model social networks much in the same way as a social networking service might (e.g. Facebook). The big difference being that with FOAF the resultant social graph is exposed to the Semantic Web in a distributed way for machine processing (and all the goodness that this might entail…); not held in proprietary databases.

FOAF data has generally always been augmented with other RDF vocabularies. Nothing strange in this; this was anticipated, and reusing and remixing vocabularies and RDF data is a key component of the Semantic Web. My FOAF profile, for example, uses numerous additional vocabularies for enrichment, including Dublin Core, the Music Ontology, and the Contact, Relationship and Basic Geo vocabularies. The latter vocabulary (Basic Geo) provides the hook for this blog posting.

The need to provide geographical coordinates and related data in RDF was recognised early in the life of the Semantic Web, and the Basic Geo (WGS84 lat/long) Vocabulary website lists obvious applications for such data. Although including geographical data within FOAF profiles presents an obvious use (e.g. using Basic Geo to provide the latitude and longitude of, say, your office location), few people do it because few applications actually do anything with it. That was until a couple of years ago when Richard Cyganiak (DERI, National University of Ireland) developed an experimental FOAF widget (FOAF – Where Am I?) to determine geographical coordinates using the Google Maps API and then to spit it out in FOAF RDF/XML for inclusion in a FOAF profile. In his words, "there's no more excuses for [not including coordinates]". With coordinates included, FOAF profiles could be mapped using Alexandre Passant's FOAFMap.net widget (also from DERI), which was developed around the same time and extracts geographical data embedded within FOAF profiles and then maps it using Google Maps. Despite the presence of these useful widgets, FOAF profiles rarely contain location data because, let's face it, are we that interested in a precise geographical location of an office?!

More interesting – and perhaps more useful – is to model personal travel within a FOAF profile. This is consistent with the recent emergence (within the past year or so) of 'smart travel' services on the web, the most notable of which is probably Dopplr. Dopplr essentially allows users to create, share and map details of future journeys and travel itineraries with friends, colleagues, business contacts, etc. so that overlaps can be discovered in journey patterns and important meetings arranged between busy persons. It is also consistent with the personal homepages of academics and researchers. For example, Ivan Herman's (W3C Semantic Web Activity Lead) website is one of many which include a section about upcoming trips. There are others too. From personal experience I can confirm that many an international research relationship has been struck by knowing who is going to be at the conference you are attending next week! People also like to record where they have been and why, and the 'Cities I've Visited' Facebook application provides yet another example of wanting to associate travel with a personal profile, albeit within Facebook.

Of course, Dopplr and Facebook applications are all well and good; but we want to expose these journeys and travel itineraries in a distributed and machine processable way - and FOAF profiles are the obvious place to do it. It is possible to use the RDF Calendar vocabulary to model some travel, but it's a little itchy and can't really tell us the purpose of a journey. Other travel ontologies exist, but they are for 'serious' travel applications and too heavy weight for a simple FOAF profile. It therefore occurred to me that there is a need for a light weight RDF travel vocabulary, ideally for use with FOAF, which can better leverage the power of existing vocabularies such as Basic Geo and RDF Calendar. I documented my original thoughts about this on my personal blog, which I use of more technical musings. Enriching a FOAF profile with such data would not only expose it to the Semantic Web and enrich social graphs, but make applications (similar to those described above) possible in an open way.

To this end I have started authoring the Travelogue RDF Vocabulary (TRAVOC). It's a pretty rough and ready approach (c'mon, 3 -4 hours!) and is really just for experimental purposes; but I have published what I have so far. A formal RDF Schema is also available. Most properties entail being a foaf:Person and I have provided brief examples on my blog.

As noted, TRAVOC has been sewn together in a short order. It would therefore benefit from some further consideration, refinements and (maybe) expansion. Perhaps there's a research proposal in it? Thoughts anyone? In particular, I would be interested know if any of the TRAVOC properties overlap with existing vocabularies which I haven't been able to find. If I have time – and if and when I am satisfied with the final vocabulary - I may acquire the necessary PURLs.

Wednesday, 29 July 2009

Is it R.I.P. for Yahoo! as we know it?

And so Microsoft and Yahoo! finally agree terms of a partnership which will change the face of the web search market. Historically – and let's face it, this story has been ongoing since January 2008! – Microsoft always wanted to take over Yahoo!; but on reflection both parties probably felt that forging a partnership was most likely to give them success against the market leader. So this has to be good news, no?

Well, it'll do some good to have the dominance of Google properly challenged by the next two biggest fish - and Google will probably be concerned. But their partnership entails that Yahoo! Search be powered by Bing and, in return, Yahoo! will become the sales force for both companies' premium search advertising. We've noted recently that Bing is good and was an admirable adversary for Yahoo!, but will a Yahoo! front-end powered by a Bing back-end mean an end to some of Yahoo!'s excellent retrieval tools (often documented on this blog, see this, this, this and this, for example) and, more importantly, an end to their innovative research strategies to better harness the power of structured data on the web? Is the innovative SearchMonkey Open Search Platform to be jettisoned?

The precise details of the partnership are sketchy at the moment, but it would be tragic if this intellectual capital was to be lost or now neglected...

Thursday, 16 July 2009

Broken business models again...

In a tenuous link with several previous blog postings (this one and this one), the latest BBC dot.life posting by Cellan-Jones discusses the future of the music industry. It's an interesting summary of recent research on the music habits of the British public. Surprisingly, CDs remain by far the most popular music format, even amongst teenagers. This pleases me because – although I am a man that enjoys his eMusic downloads - I am also a chap that enjoys the CD, its artwork, its liner notes, its aesthetic qualities, etc.

Of course, the big finding that people have been latching onto is the large reduction in illegal file sharing. This is indeed good news; however, whilst many of these music fans will have switched to legal download services (e.g. iTunes, eMusic, Amazon, take your pick....), many have reverted to legal streaming services like Spotify. The trouble is, as Cellan-Jones points out, Spotify is another service lacking a robust business strategy. Advertising doesn't bring home the bacon and Spotify is relying on users upgrading to their pay-for premium service. Unfortunately, nobody is. Without this revenue stream Spotify is doomed in the longer term. Nothing new in this; Spotify simply joins the growing number of Web 2.0 services that are failing to monetise their innovations.

By coincidence Guardian columnist, Paul Carr, authored an article a few days ago entitled, 'I'm calling a 'time of death' for London's internet startup industry'. The article laments the failure of London based Web 2.0 companies to experience any modicum of successful or profitability. Many of his arguments have been applied elsewhere, but the London focus makes it compelling reading, particularly because Carr was around during the first dot.com boom and has personally witnessed the mysterious nature of revenue within new media. His book, 'Bringing Nothing To The Party: True Confessions Of A New Media Whore', says it all. Like Cellan-Jones, Carr also singles out Spotify, although professing to be "discreet with names". Says Carr:
"You see, the sad but true fact – and I've said this before, albeit in less aggressive terms – is that the London internet industry is increasingly, and terminally, screwed. I'll be discreet with names so as not to make things worse but since I've been back in town, I've met no fewer than three once-successful entrepreneurs who admit they're running out of money at a sickening rate (personally and professionally) with no prospect of raising more. I've seen two businesses close and one having its funding yanked suddenly because, basically, it was going nowhere fast. Everyone I speak to has the same story: investors aren't investing, revenues aren't coming, founders are being forced out – or leaving of their own accord – and no one seems to have the first idea what to do about it. Even Spotify, the current darling of London startups (which is actually from Sweden), might not be doing as well as it appears. The company says it's projecting profitability by the end of the year, with a senior staffer boasting about that fact to the geeks at the Juju event. Unfortunately, when one blogger challenged him to provide numbers to back it up, he was forced to admit that the profitability is less "projected" and more "hoped for". Meanwhile, rivals (and fellow London poster-children) Last.fm just saw all three of their founders depart the company leaving a huge hole at the top during a time of massive uncertainty. However you dress it up, that's not good."
No - it's not good; but when is the madness all going to end? Like many others, I keep on thinking the end is 'just round the corner', but it never comes. How many insane venture capitalists are left? Will it be a house of cards, and, if so, which card is going to be removed first? Perhaps a little schadenfreude is order of the day - shall we have a sweepstake?

Tuesday, 7 July 2009

Welcome to my (Search) Pad

Search innovators at Yahoo! have today launched Search Pad. Search Pad integrates with the usual Yahoo! Search interface and allows users to take notes while conducting common information seeking tasks (e.g. researching a holiday, whether to buy that new piece of gadgetry, etc.). Search Pad can track the websites users are visiting and is invoked when it considers the user to be conducting a research task. On the Yahoo! Search Blog today:
"Search Pad helps you track sites and make notes by intelligently detecting user research intent and automatically collecting sites the user visits. Search Pad turns on automatically when you're doing research, tracking sites to make document authoring a snap. You can then quickly edit and organize your notes with the Search Pad interface, which includes drag-and-drop functionality and auto-attributed pasting."
Nice. From the website and Yahoo! blog (and this video), Search Pad is in many ways reminiscent of Listas from Microsoft Live Labs (and discussed on this blog before). It's possible to copy text, images and create lists for sharing with others, either via URL or via other services (e.g. Facebook, Twitter, Delicious). Search Pad also has an easy to use menu driven interface. Whilst it was useful in some circumstances, Listas lacked a worthy application; however, Search Pad builds on Listas functionality and instead has incorporated an improved version of it within a traditional search interface to do something we often do when we are searching (i.e. take notes about a search task).

The only problem is that I can't get it to work!! I have tried conducting a variety of 'obvious' research tasks which I anticipated Search Pad would recognise, but the Search Pad console hasn't appeared. Perhaps the 'intelligent detection' isn't has intelligent as promised? I'll keep trying, but please let me know if anyone has better luck. Still, it demonstrates the state of permanent innovation at Yahoo! Search.

Wednesday, 1 July 2009

When Web 2.0 business models and accessibility collide with information services and e-learning...

Rory Cellan-Jones has today posted his musings on the current state of Facebook at the BBC dot.life blog. His posting was inspired by an interview with Sheryl Sandberg (Chief Operating Officer) and was originally billed as 'Will Facebook ever make any money?'. Sandberg was recruited from Google last year to help Facebook turn a financial corner. According to her interview with Cellan-Jones, Facebook is still failing to break even, but her projections are that Facebook will start to turn a profit by the end of 2010. If true, this will be good news for Facebook. Not everyone believes this of course, including Cellan-Jones judging by his questions, his raised left eye brow and his prediction that tighter EU regulation will harm Facebook growth. Says Cellan-Jones:
"And [another] person I met at Facebook's London office symbolised the firm's determination to deal with its other challenge - regulation.

Richard Allan, a former Liberal Democrat MP and then director of European government affairs at Cisco, has been hired to lobby European regulators for Facebook.

With the EU mulling over tighter privacy rules for firms that share their users' data, and with continuing concern from politicians about issues like cyber-bullying and hate-speak on social networks, there will be plenty on Mr Allan's plate.

So, yes, Facebook suddenly looks like a mature business, poised for steady progress towards profitability and ready to engage in grown-up conversations about its place in society. Then again, so did MySpace a year ago, until it suddenly went out of fashion."
This is all by way of introduction, because a few weeks ago I attended the CILIP MmIT North West day conference on 'Emerging technologies in the library' at LJMU. A series of interesting speakers, including Nick Woolley, Russell Prue and Jane Secker, pondered the use of new technologies in e-learning, digital libraries and other information services. Of course, one of the recurring themes to emerge throughout the day was the innovative use of social networking tools in e-learning or digital library contexts. To be sure, there is some innovative work going on; but none of the speakers addressed two elephants in the room:
  • Service longevity, and;
  • Accessibility
For me these are the two biggest threats to social media use within universities.

The adoption of Facebook, YouTube, MySpace, Twitter (and the rest) within universities has been rapid. Many in the literature and at conferences evangelise about the adoption of these tools as if their use was now mandatory. Nick Woolley voiced sensible concerns over this position. An additional concern that I have – and one I had hoped to verbalise at some point during the proceedings – is whether it is appropriate for services (whether e-learning or digital libraries, or whatever) to be going to the effort of embedding these technologies within curricula or services when they are third party services over which we have little control and when their economic futures are so uncertain.

The magic word at the MmIT event was 'free'. "Make use of this tool – it's free and the kids love it!". Very few of the tools over which LISers and learning technologists get excited about actually have viable business models. Google lost almost $500 million on YouTube in the year up to April 2009 and is unable to turn it into a viable business. MySpace is struggling and slashing staff, Facebook's future remains uncertain, Twitter currently has no business model at all and is being propped up by venture capitalists while it contemplates desperate ways to create revenue, and so the list continues. Will any of these services still be here next year? Well published and straight talking advertising consultant, George Parker, has been pondering the state of social media advertising on his blog recently (warning – he is straight talking and profanities are order of the day!). He has insightful comments to make though on why most of these services are never going to make spectacular amounts of money from their current (failed?) model (i.e. advertising). According to Parker, advertising is just plain wrong. Niche markets where subscriptions are required will be the only way for these services to make decent money...

A more general concern relates to the usability and accessibility of social networking services. Very few of them, if any, actually come close to minimal W3C accessibility guidelines, or DDA and the Special Educational Needs and Disability Act (SENDA) 2001. Surely there are legal and ethical questions to be asked, particularly of universities? Embedding these third party services into curricula seems like a good idea but it's one which could potentially exclude students from the same learning experience as others. This is a concern I have had for a few years now, but I had thought it would, a) have been resolved by services voluntarily by now, and, b) institutions wishing to deploy them would have taken measures to resolve it (this might be not using them at all!). Obviously not...

There are many arguments for not engaging with Web 2.0 at university, and - where appropriate - many of these arguments were cogently made at the MmIT conference. But if adopting such technologies is considered to be imperative, should we not be making more of an effort to develop tools that replicate their functionality, thus allowing control over their longevity and accessibility? Attempts at this have hitherto been pooh-poohed on the grounds that interrupting habitual student behaviour (i.e. getting students to switch from, say, Facebook to an academic equivalent) was too onerous, or that replicating the social mass and collaborative appeal of international networking sites couldn't be done within academic environments. But have we really tried hard enough? Most have been half-baked efforts. It is also noteworthy that research conducted by Mike Thelwall and published in JASIST indicates that homophily continues within social networking websites. If this is true, then it is likely that getting students to make the switch to locally hosted equivalents of Facebook or MySpace is certainly possible, particularly as the majority of their network will comprise similar people within similar academic situations.

Perhaps there is more of a need for the wider adoption of social web markup languages, such as the User Labor Markup Language (ULML), to enable users to switch between disparate social networking services whilst simultaneously allowing the portability of social capital (or 'user labour') from one service to another? This would make the decision to adopt academic equivalents far more attractive. However, if this is the case, then more research needs to be undertaken to extend ULML (and other options) to make them fully interoperable with the breadth of services currently available.

I don't like putting a downer on all the innovative and excellent work that the LIS and e-learning communities are doing in this area; it's just that many seem to be oblivious to these threats and are content to carry on regardless. Nothing good ever comes from carrying on regardless, least of all that dreadful tune by the Beautiful South. Let's just talk about it a bit more and actually acknowledge these issues...

Friday, 26 June 2009

Read all about it: interesting contributions at ISKO-UK 2009

I had the pleasure of attending the ISKO-UK 2009 conference earlier this week at University College London (UCL), organised in association with the Department of Information Studies. This was my first visit to the home of the architect of Utilitarianism, Jeremy Bentham, and the nearby St. Pancras International since it has been revamped - and what a smart train station it is.

The ISKO conference theme was 'content architecture', with a particular focus on:
  • "Integration and semantic interoperability between diverse resources – text, images, audio, multimedia
  • Social networking and user participation in knowledge structuring
  • Image retrieval
  • Information architecture, metadata and faceted frameworks"
The underlying themes throughout most papers were those related to the Semantic Web, Linked Data, and other Semantic Web inspired approaches to resolving or ameliorating common problems within our disciplines. There were a great many interesting papers delivered and it is difficult to say something about them all; however, for me, there were particular highlights (in no particular order)...

Libo Eric Si (et al.) from the Department of Information Science at Loughborough University described research to develop a prototype middleware framework between disparate terminology resources to facilitate subject cross-browsing of information and library portal systems. A lot of work has already been undertaken in this area (see for example, HILT project (a project in which I used to be involved), and CrissCross), so it was interesting to hear about his 'bag' approach in which – rather than using precise mappings between different Knowledge Organisation Systems (KOS) (e.g. thesauri, subject heading lists, taxonomies, etc.) - "a number of relevant concepts could be put into a 'bag', and the bag is mapped to an equivalent DDC concept. The bag becomes a very abstract concept that may not have a clear meaning, but based on the evaluation findings, it was widely-agreed that using a bag to combine a number of concepts together is a good idea".

Brian Matthews (et al.) reported on an evaluation of social tagging and KOS. In particular, they investigated ways of enhancing social tagging via KOS, with a view to improving the quality of tags for improvements in and retrieval performance. A detailed and robust methodology was provided, but essentially groups of participants were given the opportunity to tag resources using tags, controlled terms (i.e. from KOS), or terms displayed in a tag cloud, all within a specially designed demonstrator. Participants were later asked to try alternative tools in order to gather data on the nature of user preferences. There are numerous findings - and a pre-print of the paper is already available on the conference website so you can read these yourself - but the main ones can be summarised from their paper as follows and were surprising in some cases:
  • "Users appreciated the benefits of consistency and vocabulary control and were potentially willing to engage with the tagging system;
  • There was evidence of support for automated suggestions if they are appropriate and relevant;
  • The quality and appropriateness of the controlled vocabulary proved to be important;
  • The main tag cloud proved problematic to use effectively; and,
  • The user interface proved important along with the visual presentation and interaction sequence."
The user preference for controlled terms was reassuring. In fact, as Matthews et al. report:
"There was general sentiment amongst the depositors that choosing terms from a controlled vocabulary was a "Good Thing" and better than choosing their own terms. The subjects could overall see the value of adding terms for information retrieval purposes, and could see the advantages of consistency of retrieval if the terms used are from an authoritative source."
Chris Town from the University of Cambridge Computer Laboratory presented two (see [1], [2]) equally interesting papers relating to image retrieval on the Web. Although images and video now comprise the majority of Web content, the vast majority of retrieval systems essentially use text, tags, etc. that surround images in order to make assumptions about what the image might be. Of course, using any major search engine we discover that this approach is woefully inaccurate. Dr. Town has developed improved approaches to content-based image retrieval (CBIR) which provide a novel way of bridging the 'semantic gap' between the retrieval model used by the system and that of the user. His approach is founded on the "notion of an ontological query language, combined with a set of advanced automated image analysis and classification models". This approach has been so successful that he has founded his own company, Imense. The difference in performance between Imense and Google is staggering and has to been seen to be believed. Examples can be found in his presentation slides (which will be on the ISKO website soon), but can observed from simply messing around on the Imense Picture Search.

Chris Town's second paper essentially explored how best to do the CBIR image processing required for the retrieval system. According to Dr. Town there are approximately 20 billion images on the web, with the majority at a high resolution, meaning that by his calculation it would take 4000 years to undertake the necessary CBIR processing to facilitate retrieval! Phew! Large-scale grid computing options therefore have to be explored if the approach is to be scalable. Chris Town and his colleague Karl Harrison therefore undertook a series of CBIR processing evaluations by distributing the required computational task across thousands of Grid nodes. This distributed approach resulted in the processing of over 25 million high resolution images in less than two weeks, thus making grid processing a scalable option for CBIR.

Andreas Vlachidis (et al.) from the Hypermedia Research Unit at the University of Glamorgan described the use of 'information extraction' techniques employing Natural Language Processing (NLP) techniques to assist in the semantic indexing of archaeological text resources. Such 'Grey Literature' is a good test bed as more established indexing techniques are insufficient in meeting user needs. The aim of the research is to create a system capable of being "semantically aware" during document indexing. Sounds complicated? Yes – a little. Vlachidis is achieving this by using a core cultural heritage ontology and the English Heritage Thesauri to support the 'information extraction' process and which supports "a semantic framework in which indexed terms are capable of supporting semantic-aware access to on-line resources".

Perhaps the most interesting aspect of the conference was that it was well attended by people outside the academic fraternity, and as such there were papers on how these organisations are doing innovative work with a range of technologies, specifications and standards which, to a large extent, remain the preserve of researchers and academics. Papers were delivered by technical teams at the World Bank and Dow Jones, for example. Perhaps the most interesting contribution from the 'real world' though was that delivered by Tom Scott, a key member of the BBC's online and technology team. Tom is a key proponent of the Semantic Web and Linked Data at the BBC and his presentation threw light on BBC activity in this area – and rather coincidentally complemented an accidental discovery I made a few weeks ago.

Tom currently leads the BBC Earth project which aims to bring more of the BBC's Natural History content online and bring the BBC into the Linked Data cloud, thus enabling intelligent linking, re-use, re-aggregation, with what's already available. He provided interesting examples of how the BBC was exposing structured data about all forms of BBC programming on the Web by adopting a Linked Data approach and he expressed a desire for users to traverse detailed and well connected RDF graphs. Says Tom on his blog:
"To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another."
Whilst Tom conceded that this work is small compared to the entire output and technical activity at the BBC, it still constitutes a huge volume of data and is significant owing to the BBC's pre-eminence in broadcasting. Tom even reported that a SPARQL end point will be made available to query this data. I had actually hoped to ask Tom a few questions during the lunch and coffee breaks, but he was such a popular guy that in the end I lost my chance, such is the existence of a popular techie from the Beeb.

Pre-print papers from the conference are available on the proceedings page of the ISKO-UK 2009 website; however, fully peer reviewed and 'added value' papers from the conference are to be published in a future issue of Aslib Proceedings.

Tuesday, 16 June 2009

11 June 2009: the day Common Tags was born and collaborative tagging died?

Mirroring the emergence of other Web 2.0 concepts, 2004-2006 witnessed a great deal of hyperbole about collaborative tagging (or 'folksonomies' as they are sometimes known). It is now 2009 and most of us know what collaborative tagging is so I'll avoid contributing to the pile of definitions already available. The hype subsided after 2006 (how active is Tagsonomy now?), but the implementation of tagging within services of all types didn't; tagging became and is ubiquitous.

The strange thing about collaborative tagging is that when it emerged the purveyors of its hype (e.g. Clay Shirky in particular, but there were many others) drowned out the comments made by many in the information, computer and library sciences. The essence of these comments was that collaborative tagging broke so many of the well established rules of information retrieval that it would never really work in general resource discovery contexts. In fact, collaborative tagging was so flawed on a theoretical level that further exploration of its alleged benefits was considered futile. Indeed, to this day, research has been limited for this reason, and I recall attending a conference in Bangalore in which lengthy discussions ensued about tagging being ineffective and entirely unscalable. For the tagging evangelists though, these comments simply provided proof that these communities were 'stuck-in-their-way' and harboured an unwillingness to break with theoretical norms. One of the most irritating aspects of the position adopted by the evangelists was that they relied on the power of persuasion and were never able to point to evidence. Moreover, even their powers of persuasion were lacking because most of them were generally 'technology evangelists' with no real understanding of the theories of information retrieval or knowledge organisation; they were simply being carried along by the hype.

The difficulties surrounding collaborative tagging for general resource discovery are multifarious and have been summarised elsewhere; but one of the intractable problems relates to the lack of vocabulary control or collocation and the effect this has on retrieval recall and precision. The Common Tags website summarises the root problem in three sentences (we'll come back to Common Tags in a moment…):
"People use tags to organize, share and discover content on the Web. However, in the absence of a common tagging format, the benefits of tagging have been limited. Individual things like New York City are often represented by multiple tags (like 'nyc', 'new_york_city', and 'newyork'), making it difficult to organize related content; and it isn’t always clear what a particular tag represents—does the tag 'jaguar' represent the animal, the car company, or the operating system?"
These problems have been recognised since the beginning and were anticipated in the theoretical arguments posited by those in our communities of practice. Research has therefore focused on how searching or browsing tags can be made more reliable for users, either by structuring them, mapping them to existing knowledge structures, or using them in conjunction with other retrieval tools (e.g. supplementing tools based on automatic indexing). In short, tags in themselves are of limited use and the trend is now towards taming them using tried and tested methods. For advocates of Web 2.0 and the social ethos it often promotes, this is really a reversal of the tagging philosophy - but it appears to be necessary.

The root difficulty relates to use of collaborative tagging in Personal Information Management (PIM). Make no bones about it, tagging originally emerged as PIM tool and it is here that it has been most successful. I, for example, make good use of BibSonomy to organise my bookmarks and publications. BibSonomy might be like delicious on steroids, but one of its key features is the use of tags. In late 2005 I submitted a paper to the WWW2006 Collaborative Tagging Workshop with a colleague. Submitted at the height of tagging hyperbole, it was a theoretical paper exploring some of the difficulties with tagging as general resource discovery tool. In particular, we aimed to explore the difficulties in expecting a tool optimised for PIM to yield benefits when used for general resource discovery and we noted how 'PIM noise' was being introduced into users' results. How could tags that were created to organise a personal collection be expected to provide a reasonable level of recall, let alone precision? Unfortunately it wasn't accepted; but since it scored well in peer review I like to think that the organising committee were overwhelmed by submissions!! (It is also noteworthy that no other collaborative tagging workshops have been held since.)

Nevertheless, the basic thesis remains valid. It is precisely this tension (i.e. PIM vs. general resource discovery) which has compromised the effectiveness of collaborative tagging for anything other than PIM. Whilst patterns can be observed in collaborative tagging behaviour, we generally find that the problems summarised in the Common Tags quote above are insurmountable – and this simply because tags are used for PIM first and foremost, and often tell us nothing about the intellectual content of the resource ('toPrint' anyone? 'toRead', 'howto', etc.). True – users of tagging systems can occasionally discover similar items tagged by other users. But how useful is this and how often do you do it? And how often do you search tags? I never do any of these things because the results are generally feeble and I'm not particularly interested in what other people have been tagging. Is anyone? So whilst tags have taken off in PIM, their utility in facilitating wider forms of information retrieval has been quite limited.

Common Tags

Last Friday the Common Tags initiative was officially launched. Common Tags is a collaboration between some established Web companies and university research centres, including DERI at the National University of Ireland and Yahoo!. It is an attempt to address the multifarious problems above and to widen the use of tags. Says the Common Tags website:
"The Common Tag format was developed to address the current shortcomings of tagging and help everyone—including end users, publishers, and developers—get more out of Web content. With Common Tag, content is tagged with unique, well-defined concepts – everything about New York City is tagged with one concept for New York City and everything about jaguar the animal is tagged with one concept for jaguar the animal. Common Tag also provides access to useful metadata that defines each concept and describes how the concepts relate to one another. For example, metadata for the Barack Obama Common Tag indicates that he's the President of the United States and that he’s married to Michelle Obama."
Great! But how is Common Tags achieving this? Answer: RDFa. What else? Common Tags enables each tag to be defined using a concept URI taken from Freebase or DBPedia (much like more formal methods, e.g. SKOS/RDF) thus permitting the unique identification of concepts and ameliorating some of our resource discovery problems (see Common Tags workflow diagram below). A variety of participating social bookmarking websites will also enable users to bookmark using Common Tags (e.g. ZigTag, Faviki, etc.). In short, Common Tags attempts to Semantic Web-ify tags using RDFa/XHTML compliant web pages and in so doing makes tags more useful in general resource discovery contexts. Faviki even describes them as Semantic Tags and employs the logo strap line, 'tags that make sense'. Common Tags won't solve everything but at least to will see some improvement recall and increase the precision in certain circumstances, as well as offering the benefits of Semantic Web integration.

So, in summary, collaborative tagging hasn't died, but at least now - at long last - it might become useful for something other than PIM. There is irony in the fact that formal description methods have to be used to improve tag utility, but will the evangelists see it? Probably not.

Friday, 12 June 2009

Serendipity reveals ontological description of BBC programmes

I have been enjoying Flight of the Conchords on BBC Four recently. Unfortunately, I missed the first couple of episodes of the new series. So that I could configure my Humax HDR to record all future episodes, I visited the BBC website to access their online schedule. It was while doing this that I discovered visible usage of the BBC's Programmes Ontology. The programme title (i.e. Flight of the Conchords) is hyperlinked to an RDF file on this schedule page.

The Semantic Web is supposed to provide machine readable data, not human readable data, and hyperlinking to an RDF/XML file is clearly a temporarily glitch at the Beeb. After all, 99.99% of BBC users clicking on these links would be hoping to see further details about the programme, not to be presented with a bunch of angled brackets. Nevertheless, this glitch provides an interesting insight for us since it reveals the extent to which RDF data is being exposed on the Semantic Web about BBC programming, and the vocabularies the BBC are using. Researchers at the BBC are active in dissemination (e.g. ESWC2009, XTech 2008), but it's not often that you surreptitiously discover this sort of stuff in action at an organisation like this.

The Programme Ontology is based significantly on the Music Ontology Specification and the FOAF Vocabulary Specification, but their data deploys – admittedly not in the example below, except in the namespace declarations – Dublin Core and SKOS.

Oh, and the next episode of Flight of the Conchords is on tonight at 23:00, BBC Four.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
xmlns:foaf = "http://xmlns.com/foaf/0.1/"
xmlns:po = "http://purl.org/ontology/po/"
xmlns:mo = "http://purl.org/ontology/mo/"
xmlns:skos = "http://www.w3.org/2008/05/skos#"
xmlns:time = "http://www.w3.org/2006/time#"
xmlns:dc = "http://purl.org/dc/elements/1.1/"
xmlns:dcterms = "http://purl.org/dc/terms/"
xmlns:wgs84_pos= "http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:timeline = "http://purl.org/NET/c4dm/timeline.owl#"
xmlns:event = "http://purl.org/NET/c4dm/event.owl#">

<rdf:Description rdf:about="/programmes/b00l22n4.rdf">
<rdfs:label>Description of the episode Unnatural Love</rdfs:label>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2009-06-02T00:14:09+01:00</dcterms:created>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2009-06-02T00:14:09+01:00</dcterms:modified>
<foaf:primaryTopic rdf:resource="/programmes/b00l22n4#programme"/>
</rdf:Description>

<po:Episode rdf:about="/programmes/b00l22n4#programme">
<dc:title>Unnatural Love</dc:title>
<po:short_synopsis>Jemaine accidentally goes home with an Australian girl he meets at a nightclub.</po:short_synopsis>
<po:medium_synopsis>Comedy series about two Kiwi folk musicians in New York. When Bret and Jemaine go out nightclubbing with Dave, Jemaine accidentally goes home with an Australian girl.</po:medium_synopsis>
<po:long_synopsis>When Bret and Jemaine go out nightclubbing with Dave, Jemaine accidentally goes home with an Australian girl. At first plagued by shame and self-doubt, he comes to care about her, much to Bret and Murray&#39;s annoyance. Can their love cross the racial divide?</po:long_synopsis>
<po:masterbrand rdf:resource="/bbcfour#service"/>
<po:position rdf:datatype="http://www.w3.org/2001/XMLSchema#int">5</po:position>
<po:genre rdf:resource="/programmes/genres/comedy/music#genre" />
<po:genre rdf:resource="/programmes/genres/comedy/sitcoms#genre" />
<po:version rdf:resource="/programmes/b00l22my#programme" />
</po:Episode>

<po:Series rdf:about="/programmes/b00kkptn#programme">
<po:episode rdf:resource="/programmes/b00l22n4#programme"/>
</po:Series>

<po:Brand rdf:about="/programmes/b00kkpq8#programme">
<po:episode rdf:resource="/programmes/b00l22n4#programme"/>
</po:Brand>
</rdf:RDF>

Quasi-facetted retrieval of images using emotions?

As part of my literature catch up I found an extremely interesting paper in JASIST by S. Schmidt and Wolfgang G. Stock entitled, 'Collective indexing of emotions in images : a study in emotional information retrieval'. The motivation behind the research is simple: images tend to elicit emotional responses in people. Is it therefore possible to capture these emotional responses and use them in image retrieval?

An interesting research question indeed, and Schmidt and Stock's study found that 'yes', it is possible to capture these emotional responses and use them. In brief, their research asked circa 800 users to tag a variety of public images from Flickr using their scroll-bar tagging system. This scroll-bar tagging system allowed users to tag images according to a series of specially selected emotional responses and to indicate the intensity of these emotions. Schmidt and Stock found that users tended to have favourite emotions and this can obviously differ between users; however, for a large proportion of images the consistency of emotion tagging is very high (i.e. a large proportion of users frequently experience the same emotional response to an image). It's a complex area of study and their paper is recommended reading precisely for this reason (capturing emotions anyone?!), but their conclusions suggest that:
"…it seems possible to apply collective image emotion tagging to image information systems and to present a new search option for basic emotions."
To what extent does the image above (by D Sharon Pruitt) make you feel happiness, anger, sadness, disgust or fear? It is early days, but the future application of such tools could find a place within the growing suite of image filters that many search engines have recently unveiled. For example, yesterday Keith Trickey was commenting on the fact that the image filters in Bing are better than Google or Yahoo!. True. There are more filters, and they seem to work better. In fact, they provide a species of quasi-taxonomical facets: (by) size, layout, color, style and people. It's hardly Ranganathan's PMEST, but – keeping in mind that no human intervention is required - it's a useful quasi-facet way of retrieving or filtering images, albeit flat.

An emotional facet, based on Schmidt and Stock's research, could easily be added to systems like Bing. In the medium term it is Yahoo! that will be more in a position to harness the potential of emotional tagging. They own Flickr and have recently incorporated the searching and filtering of Flickr images within Yahoo! Image Search. As Yahoo! are keen for us to use Image Search to find CC images for PowerPoint presentations, or to illustrate a blog, being able to filter by emotions would be a useful addition to the filtering arsenal.

Thursday, 11 June 2009

Bada Bing!

So much has been happening in the world of search engines since spring this year. This much can be evidenced from the postings on this blog. All the (best) search engines have been active in improving user tools, features, extra search functionality, etc. and there is a real sense that some serious competition is happening at the moment. It's all exciting stuff…

Last week Microsoft officially released its new Bing search engine. I've been using it, and it has found things Google hasn't been able to. The critics have been extremely impressed by Bing too and some figures suggest that it is stealing market share and moving Yahoo! to the number 2 spot. What about number 1?

The trouble is that it doesn't matter how good your search engine is because it will always have difficulty interrupting users' habitual use of Google. Indeed, Google's own research has demonstrated that the mere presence of the Google logo atop a result set is a key determinant of whether a user is satisfied with their results or not. In effect, users can be shown results from Yahoo! but branded as Google, and vice versa, but will always choose the result with the Google branding. Thus, users are generally unable to tell whether there is any real difference in the results (i.e. their precision, relevance, etc.) and are actually more influenced by the brand and their past experience. It's depressing, but a reality for the likes of Microsoft, Yahoo!, Ask, etc.

Francis Muir has the 'Microsoft mantra'. He predicts that in the long run Microsoft is always going to dominate Google – and I am starting to agree with him. Microsoft sit back, wait for things to unfold, and then develop something better than its previously dominant competitors. True – they were caught on the back foot with Web searching, but Bing is as at least as good as Yahoo!, perhaps better, and it can only get better. Their contribution to cloud computing (SkyDrive) offers 25GB storage, integration with Office and email, etc. and is far better than anything else available. Google documents? Pah! Who are you going to share that with? And then you consider Microsoft's dominance in software, operating systems, programming frameworks, databases, etc. Integrating and interoperating with this stuff over the Web is a significant part of the Web's future. Google is unlikely to be part of this, and for once I'm pleased.

It is not Microsoft's intention to take on Google's dominance of the Web at the moment. But I reckon Bing is certainly part of the long term strategy. The Muir prophecy is one step closer methinks.

Cracking open metadata and cataloguing research with Resource Description & Access (RDA)

I have been taking the opportunity to catch up with some recently published literature over the past couple of weeks. While perusing the latest issue of the Bulletin of the American Society for Information Science and Technology (the magazine which complements JASIST), I read an interesting article by Shawne D. Miksa (associate professor at the College of Information, University of North Texas). Miksa's principal research interests reside in metadata, cataloguing and indexing. She has been active in disseminating about Resource Description & Access (RDA) and has a book in the pipeline designed to demystify it.

RDA has been in development for several years now, is the successor to AACR2 and provides rules and guidance on the cataloguing of information entities. I use the phrase 'information entities' since RDA departs significantly from AACR2. The foundations of AACR2 were created prior to the advent of the Web and this remains problematic given the digital and new media information environment in which we now exist. Of course, more recent editions of AACR2 have attempted to better accommodate these developments, but fire fighting was always order of the day. The now re-named Joint Steering Committee for the Development of RDA has known for quite some time that an entirely new approach was required – and a few years ago radical changes to AACR2 were announced. As my ex-colleague Gordon Dunsire describes in a recent D-Lib Magazine article:
"RDA: Resource Description and Access is in development as a new standard for resource description and access designed for the digital world. It is being built on the foundation established for the Anglo-American Cataloguing Rules (AACR). Although it is being developed for use primarily in libraries, it aims to attain an effective level of alignment with the metadata standards used in related communities such as archives, museums and publishers, and to provide a better fit with emerging database technologies."
The ins and outs of RDA is a bit much for this blog; suffice to say that RDA is ultimately designed to improve the resource discovery potential of digital libraries and other retrieval systems by utilising the FRBR conceptual entity-relationship model (see this entity-relationship diagram at the FRBR blog). FRBR provides a holistic approach to users' retrieval requirements by establishing the relationships between information entities and allowing users to traverse the hierarchical relationships therein. I am an advocate of FRBR and appreciate its retrieval potential. Indeed, I often direct postgraduate students to Fiction Finder, an OCLC Research prototype which demonstrates the FRBR Work-Set Algorithm.

Reading Miksa's article was interesting for two reasons. Firstly, RDA has fallen off of my radar recently. I used to be kept abreast of RDA development through the activities of my colleague Gordon, who also disseminates widely on RDA and feeds into the JSC's work. Miksa's article – which announces the official release of RDA in second half of 2009 – was almost like being in a time machine! RDA is here already! Wow! It only seems like last week when JSC started work on RDA (...but it was actually over 5 years ago…).

The development of RDA has been extremely controversial, and Miksa alludes to this in her article – metadata gurus clashing with traditional cataloguers clashing with LIS revolutionaries. It has been pretty ugly at times. But secondly – and perhaps more importantly – Miksa's article is a brilliant call to arms for more metadata research. Not only that, she notes areas where extensive research will be mandatory to bring truly FRBR-ised digital libraries to fruition. This includes consideration of how this impacts upon LIS education.

A new dawn? I think so… Can the non-believers grumble about that? Between the type of developments noted earlier and RDA, the future of information organisation is alive and kicking.

Thursday, 4 June 2009

Fight! Google Squared vs. WolframAlpha

By now we all realise that WolframAlpha is not intended to compete with Google's Universal Search; it's a 'computational knowledge engine' designed to serve up facts, data and scientific knowledge and is an entirely different beast. Nevertheless, Google is not a company to be outdone and has just announced the release of Google Squared which, if the technology press is to be believed, is Google's attempt to usurp WolframAlpha's grip on offering up facts, data and knowledge. Indeed, Google attempted to steal WolframAlpha's thunder by announcing that Google Squared was in development on the same day Stephen Wolfram was unveiling WolframAlpha for the first time a few weeks ago. Meow!

In the same way that WolframAlpha occupies a different intellectual space to most web search engines, Google Squared seems to be quite different to WolframAlpha. Says the Official Google Blog:
"Google Squared is an experimental search tool that collects facts from the web and presents them in an organized collection, similar to a spreadsheet. If you search for [roller coasters], Google Squared builds a square with rows for each of several specific roller coasters and columns for corresponding facts, such as image, height and maximum speed."
Google Squared appears to work best when the query submitted is conducive to comparing species of, say, snakes or country rock bands. With the former you retrieve a variety of snake types, images, description, as well as biological taxonomic classification data, and with the latter genre and date of band formation is retrieved (including Dillard & Clark and the Flying Burrito Brothers), in addition to images and descriptions. Many of the data values are incorrect, but Google has been quite forthright in stating that Google Squared to extremely experimental ("This technology is by no means perfect"; "Google Squared is an experimental search tool"). Of course, Google wants us to explore their canned searches, such as Rollercoasters or African countries, to best appreciate what is possible.

As we noted recently though, place names are good to test these systems and, like WolframAlpha, some bizarre results are retrieved. A search for Liverpool seems to only retrieve facts on assorted Liverpool F.C. players, and Glasgow retrieves persons associated with Glasgow and the death of Glasgow Central train station in 1989(!) I had hoped Google Squared's comparative power might have pulled together facts and statistics on Glasgow (UK) with the ten or so places named Glasgow in the USA and Canada. A similar result would have been expected for Liverpool or Manchester (which has far, far more), but alas. This is a particular shame owing to the fact that much of this data is available on Wikipedia in a relatively structured format, with disambiguation pages to help.

Google Squared allows users to correct data, remove erroneous results or suggest better results. The effect of this is a dynamically evolving result set. A search for a popular topic an hour ago can yield an entirely different result an hour later. All of this will help Google Squared become more accurate and cleverer over time.

Although Google Squared and WolframAlpha are quite different, there are some similarities. For this reason it is possible to state that the current score is 1-0 to WolframAlpha.