Information Strategy Group, LJMU: indexing

Showing posts with label indexing. Show all posts

Wednesday, 23 June 2010

Visualising the metadata universe

No blog postings for almost three months and then two come along at once... I thought it would be worth drawing to the attention of readers the recent work of Jenn Riley of Indiana University. Jenn is currently metadata guru for the Indiana University Digital Library Program and yesterday on the Dublin Core list she announced the output of a project to build a conceptual model of the 'metadata universe'.

As evidenced by some of my blogs, there are literally hundreds of metadata standards and structured data formats available, all with their own acronym. This seems to have become more complicated with the emergence of numerous XML based standards in the early to mid noughties, and the more recent proliferation of RDF vocabularies for the Semantic Web and the associated Linked Data drive. What formats exists? How do they relate to each other? For which communities of practice are they optimised, e.g. information industry or cultural sector? What are the metadata, technical standards, vocabularies that I should be congnisant of in my area? And so the question list goes on...

These questions can be difficult to answer, and it is for this reason that Jenn Riley has produced a gigantic poster diagram (above) entitled, 'Seeing standards: a visualization of the metadata universe'. The diagram achieves what a good model should, i.e. simplifying complex phenomena and presenting a large volume of information in a condensed way. As the website blurb states:

"Each of the 105 standards listed here is evaluated on its strength of application to defined categories in each of four axes: community, domain, function, and purpose. The strength of a standard in a given category is determined by a mixture of its adoption in that category, its design intent, and its overall appropriateness for use in that category."

A useful conceptual tool for academics, practitioners and students alike. A glossary of metadata standards in either poster or pamphlet form is also available.

Thursday, 8 October 2009

AJAX content made discoverable...soon

I follow the Official Google Webmaster Central Blog. It can be an interesting read at times, but on other occasions it provides humdrum information on how best to optimise a website, or answers questions which most of us know the answers to already (e.g. recently we had, 'Does page metadata influence Google page rankings?'). However, the latest posting is one of the exceptions. Google have just announced that they are proposing a new standard to make AJAX-based websites indexable and, by extension, discoverable to users. Good ho!

The advent of Web 2.0 has brought about a huge increase in interactive websites and dynamic page content, much of which has been delivered using AJAX ('Asynchronous JavaScript and XML', not a popular household cleaner!). AJAX is great and furnished me with my iGoogle page years ago; but increasingly websites use it to deliver page content which might otherwise be delivered using static web pages in XHTML. This presents a big problem for search engines because AJAX is currently un-indexable (if this is a word!) and a lot of content is therefore invisible to all search engines. Indeed, the latest web design mantra has been "don't publish in AJAX if you want your website to be visible". (There are also accessibility and usability issues, but these are an aside for this posting...)

The Webmaster Blog summarises:

"While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines."

Google's proposal involves shifting the responsibility of indexing the website to the administrator/webmaster of the website, whose responsibility it would be to set up a headless browser on the web server. (A headless browser is essentially a browser without a user interface; a piece of software that can access web documents but does not deliver them to human users). The headless browser would then be used to programmatically access the AJAX website on the server and provide an HTML 'snap shot' to search engines when they request it - which is a clever idea. The crux of Google's proposal is a suite of URL protocols. These would control when the search engine knows to request the headless browser information (i.e. HTML snapshot) and which URL to reveal to human users.

It's good that Google are taking the initiative; my only concern is that they start trying to re-write standards, as they have a little with RDFa. Their slides are below - enjoy!

Friday, 26 June 2009

Read all about it: interesting contributions at ISKO-UK 2009

I had the pleasure of attending the ISKO-UK 2009 conference earlier this week at University College London (UCL), organised in association with the Department of Inf ormation Studies. This was my first visit to the home of the architect of Utilitaria nism, J eremy Bentham, and the nearby St. Pancras International since it has been revamped - and what a smart train station it is.

The ISKO conference theme was 'content architecture', with a particular focus on:

"Integration and semantic interoperability between diverse resources – text, images, audio, multimedia
Social networking and user participation in knowledge structuring
Image retrieval
Information architecture, metadata and faceted frameworks"

The underlying themes throughout most papers were those related to the Semantic Web, Linked Data, and other Seman

tic Web inspired approaches to resolving or ameliorating common problems within our disciplines. There were a great many interesting papers delivered and it is difficult to say something about them all; however, for me, there were particular highlights (in no particular order)...

Libo Eric Si (et al.) from the Department of In for mation Science at Loughboro ugh University described research to develop a prototype middleware framework between disparate terminology resources to facilitate subject cross-browsing of information and library portal systems. A lot of work has already been undertaken in this area (see for example, HILT project (a project in which I used to be involved), and CrissCross), so it was interesting to hear about his 'bag' approach in which – rather than using precise mappings between different Knowledge Organisation Systems (KOS) (e.g. thesauri, subject heading lists, taxonomies, etc.) - "a number of relevant concepts could be put into a 'bag', and the bag is mapped to an equivalent DDC concept. The bag becomes a very abstract concept that may not have a clear meaning, but based on the evaluation findings, it was widely-agreed that using a bag to combine a number of concepts together is a good idea".

Brian Matthews (et al.) reported on an evaluation of social tagging and KOS. In par

ticular, they investigated ways of enhancing social tagging via KOS, with a view to improving the quality of tags for improvements in and retrieval performance. A detailed and robust methodology was provided, but essentially groups of participants were given the opportunity to tag resources using tags, controlled terms (i.e. from KOS), or terms displayed in a tag cloud, all within a specially designed demonstrator. Participants were later asked to try alternative tools in order to gather data on the nature of user preferences. There are numerous findings - and a pre-print of the paper is already available on the conference website so you can read these yourself - but the main ones can be summarised from their paper as follows and were surprising in some cases:

"Users appreciated the benefits of consistency and vocabulary control and were potentially willing to engage with the tagging system;
There was evidence of support for automated suggestions if they are appropriate and relevant;
The quality and appropriateness of the controlled vocabulary proved to be important;
The main tag cloud proved problematic to use effectively; and,
The user interface proved important along with the visual presentation and interaction sequence."

The user preference for controlled terms was reassuring. In fact, as Matthews et al. report:

"There was general sentiment amongst the depositors that choosing terms from a controlled vocabulary was a "Good Thing" and better than choosing their own terms. The subjects could overall see the value of adding terms for information retrieval purposes, and could see the advantages of consistency of retrieval if the terms used are from an authoritative source."

Chris Town from the University of Cambridge Computer Laboratory presented two (see [1], [2]) equally interesting papers relating to image retrieval on the Web. Although images and video now comprise the majority of Web content, the vast majority of retrieval systems essentially use text, tags, etc. that surround images in order t

o make assumptions about what the image might be. Of course, using any major search engine we discover that this approach is woefully inaccurate. Dr. Town has developed improved approaches to content-based image retrieval (CBIR) which provide a novel way of bridging the 'semantic gap' between the retrieval model used by the system and that of the user. His approach is founded on the "notion of an ontological query language, combined with a set of advanced automated image analysis and classification models". This approach has been so successful that he has founded his own company, Imense. The difference in performance between Imense and Google is staggering and has to been seen to be believed. Examples can be found in his presentation slides (which will be on the ISKO website soon), but can observed from simply messing around on the Imense Picture Search.

Chris Town's second paper essentially explored how best to do the CBIR image processing required for the retrieval system. According to Dr. Town there are approximately 20 billion images on the web, with the majority at a high resolution, meaning that by his calculation it would take 4000 years to undertake the necessary CBIR processing to facilitate retrieval! Phew! Large-scale grid computing options therefore have to be explored if the approach is to be scalable. Chris Town and his colleague Karl Harrison therefore undertook a series of CBIR processing evaluations by distributing the required computational task across thousands of Grid nodes. This distributed approach resulted in the processing of over 25 million high resolution images in less than two weeks, thus making grid processing a scalable option for CBIR.

Andreas Vlachidis (et al.) from the Hypermedia Research Unit at the University of Gla morgan described the use of 'information extraction' techniques employing Natural Language Processing (NLP) techniques to assist in the semantic indexing of archaeological text resources. Such 'Grey Literature' is a good tes

t bed as more established indexing techniques are insufficient in meeting user needs. The aim of the research is to create a system capable of being "semantically aware" during document indexing. Sounds complicated? Yes – a little. Vlachidis is achieving this by using a core cultural heritage ontology and the English Heritage Thesauri to support the 'information extraction' process and which supports "a semantic framework in which indexed terms are capable of supporting semantic-aware access to on-line resources".

Perhaps the most interesting aspect of the conference was that it was well attended by people outside the academic fraternity, and as such there were papers on how these organisations are doing innovative work with a range of technologies, specifications and standards which, to a large extent, remain the preserve of researchers and academics. Papers were delivered by technical teams at the World Bank and Dow Jones, for example. Perhaps the most interesting contribution from the 'real world' though was that delivered by Tom Scott, a key member of the BBC's online and technology team. Tom is a key proponent of the Semantic Web and Linked Data at the BBC and his presentation threw light on BBC activity in this area – and rather coincidentally complemented an accidental discovery I made a few weeks ago.

Tom currently leads the BBC Earth project which aims to bring more of the BBC's Natural History content online and bring the BBC into the Linked Data cloud, thus enabling intelligent linking, re-use, re-aggregation, with what's already available. He provided interesting examples of how the BBC was exposing structured data about all forms of BBC programming on the Web by adopting a Linked Data approach and he expressed a desire for users to traverse

detailed and well connected RDF graphs. Says Tom on his blog:

"To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another."

Whilst Tom conceded that this work is small compared to the entire output and technical activity at the BBC, it still constitutes a huge volume of data and is significant owing to the BBC's pre-eminence in broadcasting. Tom even reported that a SPARQL end point will be made available to query this data. I had actually hoped to ask Tom a few questions during the lunch and coffee breaks, but he was such a popular guy that in the end I lost my chance, such is the existence of a popular techie from the Beeb.

Pre-print papers from the conference are available on the proceedings page of the ISKO-UK 2009 website; however, fully peer reviewed and 'added value' papers from the conference are to be published in a future issue of Aslib Proceedings.

Friday, 12 June 2009

Quasi-facetted retrieval of images using emotions?

As part of my literature catch up I found an extremely interesting paper in JASIST by S. Schmidt and Wolfgang G. Stock entitled, 'Collective indexing of emotions in images : a study in emotional information retrieval'. The motivation behind the research is simple: images tend to elicit emotional responses in people. Is it therefore possible to capture these emotional responses and use them in image retrieval?

An interesting research question indeed, and Schmidt and Stock's study found that 'yes', it is possible to capture these emotional responses and use them. In brief, their research asked circa 800 users to tag a variety of public images from Flickr using their scroll-bar tagging system. This scroll-bar tagging system allowed users to tag images according to a series of specially selected emotional responses and to indicate the intensity of these emotions. Schmidt and Stock found that users tended to have favourite emotions and this can obviously differ between users; however, for a large proportion of images the consistency of emotion tagging is very high (i.e. a large proportion of users frequently experience the same emotional response to an image). It's a complex area of study and their paper is recommended reading precisely for this reason (capturing emotions anyone?!), but their conclusions suggest that:

"…it seems possible to apply collective image emotion tagging to image information systems and to present a new search option for basic emotions."

To what extent does the image above (by D Sharon Pruitt) make you feel happiness, anger, sadness, disgust or fear? It is early days, but the future application of such tools could find a place within the growing suite of image filters that many search engines have recently unveiled. For example, yesterday Keith Trickey was commenting on the fact that the image filters in Bing are better than Google or Yahoo!. True. There are more filters, and they seem to work better. In fact, they provide a species of quasi-taxonomical facets: (by) size, layout, color, style and people. It's hardly Ranganathan's PMEST, but – keeping in mind that no human intervention is required - it's a useful quasi-facet way of retrieving or filtering images, albeit flat.

An emotional facet, based on Schmidt and Stock's research, could easily be added to systems like Bing. In the medium term it is Yahoo! that will be more in a position to harness the potential of emotional tagging. They own Flickr and have recently incorporated the searching and filtering of Flickr images within Yahoo! Image Search. As Yahoo! are keen for us to use Image Search to find CC images for PowerPoint presentations, or to illustrate a blog, being able to filter by emotions would be a useful addition to the filtering arsenal.

Friday, 5 December 2008

Some general musings on tag clouds, resource discovery and pointless widgets...

The efficacy of collaborative tagging in information retrieval and resource discovery has undergone some discussion on this blog in the past. Despite emerging a good couple of years ago – and like many other Web 2.0 developments – collaborative tagging remains a topic of uncertainty; an area lacking sufficient evaluation and research. A creature of collaborative tagging which has similarly evaded adequate evaluation is the (seemingly ubiquitous!) 'tag cloud'. Invented by Flickr (Flickr tag cloud) and popularised by delicious (and aren't you glad they dropped the irritating full stops in their name and URL a few months ago?), tag clouds are everywhere; cluttering interfaces with their differently and irritatingly sized fonts.

Coincidentally, a series of tag cloud themed research papers were discussed at one of our recent ISG research group meetings. One of the papers under discussion (Sinclair & Cardew-Hall, 2008) conducted an experimental study comparing the usage and effectiveness of tag clouds with traditional search interface approaches to information retrieval. Their work is welcomed since it constitutes one of the few robust evaluations of tag clouds since they emerged several years ago.

One would hate to consider tag clouds as completely useless – and I have to admit to harbouring this thought. Fortunately, Sinclair and Cardew-Hall found tag clouds to be not entirely without merit. Whilst they are not conducive to precision retrieval and often conceal relevant resources, the authors found that users reported them useful for broad browsing and/or non-specific resource discovery. They were also found to be useful in characterising the subject nature of databases to be searched, thus aiding the information seeking process. The utility of tag clouds therefore remains confined to the search behaviours of inexperienced searchers and – as the authors conclude - cannot displace traditional search facilities or taxonomic browsing structures. As always, further research is required...

The only thing saving tag clouds from being completely useless is that they can occasionally assist you in finding something useful, perhaps serendipitously. What would be the point in having a tag cloud that didn't help you retrieve any information at all? Answer: There wouldn't be any point; but this doesn't stop some people. Recently we have witnessed the emergence of 'tag cloud generation' tools. Such tools generate tag clouds for Web pages, or text entered by the user. Wordle is one such example. They look nice and create interesting visualisations, but don't seem to do anything other than take a paragraph of text and increase the size of words based on frequency. (See the screen shot of a Wordle tag cloud for my home page research interests.)

OCLC have developed their very own tag cloud generator. Clearly, this widget has been created while developing their suite of nifty services, such as WorldCat, DeweyBrowser, FictionFinder, etc., so we must hold fire on the criticism. But unlike Wordle, this is something OCLC could make useful. For example, if I generate a tag cloud via this service, I expect to be able to click on a tag and immediately initiate a search on WorldCat, or a variety of OCLC services … or the Web generally! In line with good information retrieval practice, I also expect stopwords to be removed. In my example some of the largest tags are nonsense, such as "etc", "specifically", "use", etc. But I guess this is also a fundamental problem with tagging generally...

OCLC are also in a unique position in that they have access to numerous terminologies. This obviously cracks open the potential for cross-referencing tags with their terminological datasets so that only genuine controlled subject terms feature in the tag cloud, or productive linkages can be established between tags and controlled terms. This idea is almost as old as tagging itself but, again, has taken until recently to be investigated properly. Exploring the connections between tags and controlled vocabularies is something the EnTag project is exploring, a partner in which is OCLC. In particular, EnTag (Enhanced Tagging for Discovery) is exploring whether tag data, normally typified by its unstructured and uncontrolled nature, can be enhanced and rendered more useful by robust terminological data. The project finished a few months ago – and a final report is eagerly anticipated, particularly as my formative research group submitted a proposal to JISC but lost out to EnTag! C'est la vie!

Tuesday, 1 July 2008

Making the inaccessible accessible (from an information retrieval perspective)

Websites using Adobe Flash have attracted a lot of criticism over the years, and understandably so. Flash websites break all the rules that make (X)HTML great. They (generally) exemplify poor usability and remain woefully inaccessible to visually impaired users, or those with low bandwidth. Browser support also remains a big problem. However, even those web designers unwilling to relinquish Flash for the aforementioned reasons have done so because Flash has remained inaccessible to all the major search engines, thereby causing serious problems if making your website discoverable is a key concern. Even my brother - historically a huge Flash aficionado - a few years ago conceded that Flash on the web was a bad thing – primarily because of the issues it raises for search engine indexing.

Still, if you look hard enough, you will find many that insist on using it. And these chaps will be pleased to learn that the Official Google Blog has announced that Google have been developing an algorithm for crawling textual Flash content (e.g. menus, buttons and banners, “self-contained Flash websites”, etc.). Improved visibility of Flash content is henceforth order of the day.

But to my mind this is both good news and bad news (well, mainly bad news...). Aside from being championed by a particular breed of web designer, Flash has fallen out of favour with webbies precisely because of the indexing problems associated with it. This, in turn, has promoted an increase in good web design practice, such as compliance with open standards, accessibility and usability. Search engine visibility was, in essence, a big stick with which to whip the Flashers into shape (the carrot of improved website accessibility wasn’t big enough!). Now that the indexing problems have been (partly) resolved, the much celebrated decline in Flash might soon end; we may even see a resurgence of irritating animation and totally unusable navigation systems. I have little desire to visit such websites, even if they are now discoverable.

Friday, 14 March 2008

Shout 'Yahoo!' : more use of metadata and the Semantic Web

Within the lucrative world of information retrieval on the web, Yahoo! is considered an 'old media company'; a company that has gone in a different direction to, say, Google. Yahoo! has been a bit patchy when it comes to openness. It is keen on locking data and widgets down; Google is keen on unlocking data and widgets. And to Yahoo!'s detriment, Google has discovered that there is more money to be made their way, and that users and developers alike are – to a certain extent - very happy with the Google ethos. Google Code is an excellent example of this fine ethos, with the Google Book Search API being a recently announced addition to the Code arsenal.

Since there must be some within Yahoo! ranks attributing their current fragility to a lack of openness, Yahoo! have recently announced their Yahoo! Search 'open platform'. They might be a little slow in fully committing to openness, but cracking open Yahoo! Search is a big and interesting step. For me, it's particularly interesting...

Yesterday Amit Kumar (Product Management, Yahoo! Search) released further details of the new Yahoo! Search platform. This included (among many other things), a commitment to harnessing the potential of metadata and Semantic Web content. More specifically, this means greater support of Dublin Core, Friend-of-a-Friend (FOAF) and other applications of RDF (Resource Description Framework), Creative Commons, and a plethora of microformats.

Greater use of these initiatives by Yahoo! is great news for the information and computing professions, not least because it may stimulate the wider deployment of the aforementioned standards, thus making the introduction of a mainstream 'killer app' that fully harnesses the potential of structured data actually possible. For example, if your purpose is to be discovered by Google, there is currently no real demand for Dublin Core (DC) metadata to be embedded within the XHTML of a web page, or for you to link to an XML or RDF encoded DC file. Google just doesn't use it. It may use dc.title, but that's about it. That is not to say that it's useless of course. Specialist search tools use it, Content Management Systems (CMS) use it, many national governments use it as the basis for metadata interoperability and resource discovery (e.g. eGMS), it forms the basis of many Information Architecture (IA) strategies, etc, etc. But this Google conundrum has been a big problem for the metadata, indexing and Semantic Web communities (see, for example). Their tools provide so much potential; but this potential is generally confined to particular communities of practice. Your average information junkie has never used a Semantic Web tool in his/her life. But if a large scale retrieval device (i.e. Yahoo!) showed some real commitment to harnessing structured data, then it could usher in a new age of large scale information retrieval; one based on an intelligent blend of automatic indexing, metadata, and Semantic Web tools (e.g. OWL, SKOS, FOAF, etc.). In short, there would be a huge demand for the 'data web' outside the distinct communities of practice over which librarians, information managers and some computer scientists currently preside. And by implication this would entail greater demand for their skills. All we need to do now is get more people to create metadata and ontologies!

Given the fragile state of Yahoo!, initiatives like this (if they come to fruition!) should be applauded. Shout 'Yahoo!' for Yahoo! I'm not sure if it will prevent a Microsoft takeover though...

Monday, 19 November 2007

Stop the press: Google is grim!

I always enjoy being kept abreast of scientific developments by tuning into Leading Edge on BBC Radio 4 on Thursday evenings. A riveting piece of radio journalism! Last Thursday (15th November 2007) we had reports on a team from the US that has created cloned embryos from an adult primate and an invigorating debate on the deployment of brain enhancement drugs. We also had researchers that have demonstrated how robotic cockroaches can influence the behaviour of real ones. However, Leading Edge often gives us snippets of news that impinge directly on what we do within the Information Strategy Group – and last Thursday was no exception.

Technology guru Bill Thompson explained why he believes Google is corrupting us all. This is a refreshing viewpoint from a technology commentator and not one we are accustomed to hearing (except from librarians, information scientists and some computer scientists!). Such commentators normally fail to observe the limitations of any information retrieval tool and drone on about how 'cool' it is. Not Thompson. "We have all fallen for the illusion" because it is "easy" and "simple", says Thompson. "Google and the other search engines have encouraged us to forget the fundamental difference between search and research".

Google is indeed easy and simple. To make his point Thompson (unwittingly?) revisits well worn LIS arguments emanating from the metadata, cataloguing, and indexing areas. Some of these emerged in the 1970s when the efficacy of automatic indexing began to improve. These arguments cite issues of reconciling the terms used to describe concepts and issues of collocation (e.g. 'Java' the coffee, 'Java' the programming language, 'Java' the primary landmass of Indonesia, etc.) and differentiating between information about Tony Blair and information by Tony Blair. Thompson almost sounds surprised when he vents spleen over Google's inability to make sense of the word ‘Quark’. Welcome to web search engines, Bill!

The most astonishing part of Thompson's report was not his LIS-tinged rant about Google, but his suggestion that librarians had themselves fallen for the Google illusion, along with the academics and school children. Pardon? What could have given him this impression??? Was it an ill-judged, off hand comment?

The body of research and development exploring the use metadata on the web is gigantic and is largely driven by the LIS community. The 'deep web' is almost synonymous with digital library content or database content under the custodianship of information professionals. Those in content management or information architecture will also be singing from the LIS hymn sheet. The Semantic Web is another development that seeks to resolve Thompson's 'Quark conundrum'. Even at front line services, librarians are rolling out information literacy sessions as a means of communicating the importance of using deep web tools, but also making users aware of Google's limitations (e.g. Google only indexes a small portion of the web, problems of information authoritativeness, etc., etc.).

That is not to say that the profession doesn't flirt with Google; of course it does! It flirts with Google because Google provides a vehicle through which to introduce users to valuable information (often from the deep web). And such flirting does not immediately jettison well formed library and information management theories or principals (see an ex-colleague, for example [1], [2], [3] and [4]).

Of course, I could go on for a lot longer, but there doesn’t seem to be any point as you already know the arguments. But you can listen to Bill Thompson’s report on Leading Edge to hear the arguments of yore restyled by a technology guru. You may also feel compelled to contact Leading Edge to vent your spleen!

Information Strategy Group, LJMU