Information Strategy Group, LJMU: 2008

Wednesday 24 December 2008

SKOS-ifying Knowledge Organisation Systems: a continuing contradiction for the Semantic Web

A few days ago Ed Summers announced on his blog that he was shutting down lcsh.info. For those that don't know, lcsh.info was a Semantic Web demonstrator developed by Ed with the expressed purpose of illustrating how the Library of Congress Subject Headings (LCSH) could be represented and its structure harnessed using Simple Knowledge Organisation Systems (SKOS). In particular, Ed was keen to explore issues pertaining to Linked Data and the representation of concepts using URIs. He even hoped that the URIs used would be Cool URIs, linking eventually to a bona fide LCSH service were one ever to be released. Sadly, it was not to be... The reasons remain unclear but were presumably related to IPR. As the lcsh.info blog entry notes, Ed was compelled to remove it by the Library of Congress itself. The fact that he was the LC's resident Semantic Web buff probably didn't help matters, I'm sure.

SKOS falls within my area of interest and is an initiative of the Semantic Web Deployment Working Group. In brief, SKOS is an application of RDF and RDFS and is a series of evolving specifications and standards used to support the use of knowledge organisation systems (KOS) (e.g. information retrieval thesauri, classification schemes, subject heading systems, taxonomies or any other controlled vocabulary) within the framework of the Semantic Web. The Semantic Web is many things of course; but it is predicated upon the assumption that there exists communities of practices willing and able to create the necessary structured data (generally applications of RDF) to make it work. This might be metadata, or it might be an ontology, or it might be a KOS represented in SKOS. The resulting data can then be re-used, integrated, interconnected, queried and is open. When large communities of practice fail to contribute, the model breaks down.

There is a sense in which the Semantic Web has been designed to bring out the schizophrenic tendencies within some quarters of the LIS community. Whilst the majority of our community has embraced SKOS (and other related specifications), can appreciate the potential and actively contributes to the evolution of the standards, there is a small coterie that flirts with the technology whilst simultaneously shirking at the thought of exposing hitherto proprietary data. It's the 'lock down' versus 'openness' contradiction again.

In a previous research post I was involved with the High-Level Thesaurus (HILT) research project and continue my involvement in an consultative capacity. HILT continues to research and develop a terminology web service providing M2M access to a plethora of terminological data, including terminology mappings. Such terminological data can be incorporated into local systems to improve local searching functionality. Improvements might include, say, implementing a dynamic hierarchical subject browsing tree, or incorporating interactive query expansion techniques as part of the search interface, for example. An important - and the original motivation behind HILT - is to develop a 'terminology mapping server' capable of ameliorating the "limited terminological interoperability afforded between the federation of repositories, digital libraries and information services comprising the UK Joint Information Systems Committee (JISC) Information Environment" (Macgregor et al., 2007), thus enabling accurate federated subject-based information retrieval. This is a blog so detail will be avoided for now; but, in essence, HILT is an attempt to provide a terminology server in a mash-up context using open standards. To make the terminological data as usable as possible and to expose it to the Semantic Web, the data is modelled using SKOS.

But what happens to HILT when/if it becomes an operational service? Will its terminological innards be ripped out by the custodians of terminologies because they no longer want their data exposed, or will the ethos of the model be undermined as service administrators permit only HE institutions or charitable organisations from accessing the data? This isn't a concern for HILT yet; but it is one I anticipated several years ago. And the sad experience of lcsh.info illustrates that it's a very real concern.

Digital libraries, repositories and other information services have to decide where they want to be. This is a crossroads within a much bigger picture. Do they want their much needed data put to a good use on the Web, as some are doing (e.g. AGROVOC, GEMET, UKAT)? Or do they want alternative approaches to supplant them entirely (i.e. LCSH)? What's it gonna be, punks???

Monday 8 December 2008

Wikipedia censorship: allusions to 'Smell the Glove'?

Another Wikipedia controversy rages, this time over censorship. Over the past two days, the Internet Watch Foundation informed some ISPs that an article pertaining to an album by the 'classic' German heavy metal band, Scorpions, may be illegal. Leaving aside the fact that Scorpions is one of many groups to have similar imagery on their record sleeves (the eponymous 1969 debut album by Blind Faith, Eric Clapton's supergroup, being another obvious example), am I the only person to notice the similarities with fictional rockumentary, This Is Spinal Tap?

Like metal, censorship is a heavy topic; but I thought this tenuous linkage with This Is Spinal Tap might be a welcome distraction from the usual blog postings, which are necessarily academic. Those of you familiar with said film might recall the controversy surrounding the proposed (tasteless) art work for Spinal Tap's new album (Smell the Glove), which in the end gets mothballed owing to its indecent nature. Getting into trouble over sleeve art is part and parcel of being in a heavy metal band it would seem! Enjoy the winter break, people!

Friday 5 December 2008

Some general musings on tag clouds, resource discovery and pointless widgets...

The efficacy of collaborative tagging in information retrieval and resource discovery has undergone some discussion on this blog in the past. Despite emerging a good couple of years ago – and like many other Web 2.0 developments – collaborative tagging remains a topic of uncertainty; an area lacking sufficient evaluation and research. A creature of collaborative tagging which has similarly evaded adequate evaluation is the (seemingly ubiquitous!) 'tag cloud'. Invented by Flickr (Flickr tag cloud) and popularised by delicious (and aren't you glad they dropped the irritating full stops in their name and URL a few months ago?), tag clouds are everywhere; cluttering interfaces with their differently and irritatingly sized fonts.

Coincidentally, a series of tag cloud themed research papers were discussed at one of our recent ISG research group meetings. One of the papers under discussion (Sinclair & Cardew-Hall, 2008) conducted an experimental study comparing the usage and effectiveness of tag clouds with traditional search interface approaches to information retrieval. Their work is welcomed since it constitutes one of the few robust evaluations of tag clouds since they emerged several years ago.

One would hate to consider tag clouds as completely useless – and I have to admit to harbouring this thought. Fortunately, Sinclair and Cardew-Hall found tag clouds to be not entirely without merit. Whilst they are not conducive to precision retrieval and often conceal relevant resources, the authors found that users reported them useful for broad browsing and/or non-specific resource discovery. They were also found to be useful in characterising the subject nature of databases to be searched, thus aiding the information seeking process. The utility of tag clouds therefore remains confined to the search behaviours of inexperienced searchers and – as the authors conclude - cannot displace traditional search facilities or taxonomic browsing structures. As always, further research is required...

The only thing saving tag clouds from being completely useless is that they can occasionally assist you in finding something useful, perhaps serendipitously. What would be the point in having a tag cloud that didn't help you retrieve any information at all? Answer: There wouldn't be any point; but this doesn't stop some people. Recently we have witnessed the emergence of 'tag cloud generation' tools. Such tools generate tag clouds for Web pages, or text entered by the user. Wordle is one such example. They look nice and create interesting visualisations, but don't seem to do anything other than take a paragraph of text and increase the size of words based on frequency. (See the screen shot of a Wordle tag cloud for my home page research interests.)

OCLC have developed their very own tag cloud generator. Clearly, this widget has been created while developing their suite of nifty services, such as WorldCat, DeweyBrowser, FictionFinder, etc., so we must hold fire on the criticism. But unlike Wordle, this is something OCLC could make useful. For example, if I generate a tag cloud via this service, I expect to be able to click on a tag and immediately initiate a search on WorldCat, or a variety of OCLC services … or the Web generally! In line with good information retrieval practice, I also expect stopwords to be removed. In my example some of the largest tags are nonsense, such as "etc", "specifically", "use", etc. But I guess this is also a fundamental problem with tagging generally...

OCLC are also in a unique position in that they have access to numerous terminologies. This obviously cracks open the potential for cross-referencing tags with their terminological datasets so that only genuine controlled subject terms feature in the tag cloud, or productive linkages can be established between tags and controlled terms. This idea is almost as old as tagging itself but, again, has taken until recently to be investigated properly. Exploring the connections between tags and controlled vocabularies is something the EnTag project is exploring, a partner in which is OCLC. In particular, EnTag (Enhanced Tagging for Discovery) is exploring whether tag data, normally typified by its unstructured and uncontrolled nature, can be enhanced and rendered more useful by robust terminological data. The project finished a few months ago – and a final report is eagerly anticipated, particularly as my formative research group submitted a proposal to JISC but lost out to EnTag! C'est la vie!

Friday 28 November 2008

Catching up with the old future of databases

We have been discussing in the group what we should be teaching on our Business Information Systems undergraduate course. Which is having a bit of a revamp. One area of discussion is about the areas of 'databases' which we teach mostly in the students second year and 'object oriented analysis and design' (but mostly analysis) which we teach in their final year.

When I used to teach database sections 10 years ago to business students we used to teach a history of:-

file access
hierarchical databases
network databases
relational databases
object oriented databases.

Of course Object Oriented Databases hadn't happened in a big way back then. The surprising thing is that they haven't happened in a big way even now, they have spent 10 years being the next big thing. Meanwhile their close relatives UML based analysis and object oriented design and development have swept in from all directions. Strange that my notes from 10 years back now look like 'Space 1999', In their predictive powers. In my day job at Village we had a book on the subject 10 years back but a quick check of the book shelf shows we have long since recycled it.

Apparently believers and developers of Object Oriented Databases are just bemused by why everybody hasn't followed them into the promised land, particularly as much time is spent on 'Object Relational Mapping' technologies.

All round software development and architecture thinker and general purpose bearded Guru Martin Fowler, believes that the issue isn't to do with the general capabilities of the Object Oriented Databases but rather the fact that much integration in corporations occurs in the data layer not the business layer hence systems are dependent on standardised SQL approaches to integration. He suggests in his blog post on the subject that this shared database integration requirement has been holding back the march to the future of Object Oriented Databases. Creating extra inertia. Indeed, a confession, in my day job despite being Object Oriented N-Tier architecture developers by trade and conviction, when it came to tying our own timesheet system to our task management system we used database level triggers. It's a bit like the fact that there are better ways to do typing that using qwerty but we've all learnt to live with qwerty.

However with the movement towards using Web Services and SOA type architectures in effect making XML the linguq franca rather than SQL, Martin Fowler suggests that the field might start to loosen up. Although I wonder whether reporting is another issue. We produce some reports (in Crystal Reports and the equivalent) straight from our business objects, but other management reports really need to be produced straight off the (SQL) database. On occasions this is seperate from the main thrust of the application using a different technology stack.

Really the technology of the day in software development is Object Relational Mapping tools, ORMS. These try and hold the Object Oriented businesss layer to the Data Entity Oriented database layer. Such connections are relatively straightfoward in an unsophisticated design. My final year students currently angsting over a UML assignment will find that their Class Diagram is much the same as their Entity Relationship Diagram. But as you move deeper into doing things the Object way the two diverge. My Village colleague Ian Bufton and I have been discussing this in terms of lining up the two layers using either tools or code generation you can see some of his initial ponderings on his blog.

Luckily these types of contemplation are outside the scope of the things that our Information Systems students at the Business School, no doubt they have to worry about how to teach it at the Computer Science department.

Wednesday 26 November 2008

'Gluing' searches with Yahoo!: part three in the search engine trilogy

There is plenty to comment on in the world of search engines at the moment. This post signifies the last in a series of discussions regarding search engines developments (or lack of!). (Part I; Part II)

As Google SearchWiki was unveiled, Yahoo! announced the wider release of Yahoo! Glue. Originally developed and tested at Yahoo! India, Yahoo! Glue now has a wider release – although it remains (perpetually?) in 'beta'. Glue is an attempt to aggregate disparate forms of information on a single results page in response to a single query. I suppose Glue is a functional demonstration of the ultimate mashup information retrieval tool. Glue assembles heterogeneous information from all over the Web, including text, news feeds, images, video, audio, etc. Search for the Beatles and you will get a results page listing Wikipedia definitions, LastFM tracks for your listening pleasure, news feeds, YouTube videos, etc.

In an ironic twist, Glue appears to be defying the dynamic ethos of Web 2.0. Glue searches are not created on the fly and only a limited number Glue searches are available at the moment (for example, no Liverpool!). Glue has the 'beta forever' mantra as the 'Get out of jail free card', of course. Still, Yahoo! informs us that:

"These pages are built using an algorithm that automatically places the most relevant modules on a page, giving you a visually rich, diverse page all about the topic in which you're interested."

Glue is also an example of Yahoo! exploring the social web in retrieval, harnessing as it does users' opinions on the accuracy of this algorithm (e.g. irrelevant or poorly ranked results can be 'flagged' as inappropriate or irrelevant).

Glue is - and will be - for the leisure user; the person falling into the 'popular search' category in search engines. These are the users submitting the simplest queries. The teenagers searching for 'Britney Spears', and the adults searching for 'Barack Obama' or 'Strictly Come Dancing'. The serious user (e.g. student, academic, knowledge worker, etc.) need not apply. I also have reservations over whether the summarisation of results is appropriate, and whether Glue can actually assemble disparate resources that are all relevant to a query. Check out these canned searches for Stephen Stills and Glasgow. In what way is Amsterdam a relation to Glasgow? And why the spurious news stories for Stephen Stills? Examining the text it is clear how it has been retrieved; but why when similar issues do not affect Yahoo! Search?

Like Google with SearchWiki, Yahoo! emphasise that Glue is not a replacements for Yahoo! Search; rather it's a "standalone experience":

"… Yahoo! Glue(TM) beta is not to replace the Yahoo! Search experience [...] We're always challenging ourselves to explore innovative new ways to deliver great experiences. Glue is one of those experiments, with a goal of giving users one more visual way to browse and discover new things from across the Web. We'll be working to expand the number of Glue pages, improve the experience and incorporate your feedback into future versions."

Very good. But making this dynamic and scalable should be atop the Glue 'to do' list. No Liverpool!

Monday 24 November 2008

Wikifying search

This blog follows a series of other blogs pontificating about the efficacy of search engines in information retrieval. Over the weekend Google announced the release of Google SearchWiki. Google SearchWiki essentially allows users to customise searches by re-ranking, deleting, adding, and commenting on their results. This is personalised searching (see video below). As the Official Google Blog notes:

"With just a single click you can move the results you like to the top or add a new site. You can also write notes attached to a particular site and remove results that you don't feel belong."

The advantages of this are a little unclear at first; however, things become clearer when we learn that such changes can only be affected if you have an iGoogle account. Google have – quite understandably – been very specific about this aspect of SearchWiki. Search is their bread and butter; messing with the formula would be like dancing with the devil!

Google SearchWiki doesn't do anything further to address our Anomalous State of Knowledge (ASK), nor can I see myself using it, but it is an indication that Google is interested in better exploring the potential of social data to improve relevance feedback. Google will, of course, harvest vast amounts of data pertaining to users' information seeking behaviour which can then be channelled into improving their bread and butter. (And from my perspective, I would be interested to know how they analyse such data and affect changes in their PageRank algorithm). Their move also resonates with an increasing trend to support users in their Personal Information Management (PIM); to assist users in re-finding information they have previously located, or those frequently conducting the same searches over and over. It particularly reminds me of research undertaken by Bruce et al. (2004). For example, users increasingly chose not to bookmark a useful or interesting web page, but simply find it again – because they know they can. If you continually encounter information that is irrelevant to your area, re-rank it accordingly - so the SearchWiki ethos goes...

Perusing recent blogs it is clear that some consider this development to have business motivations. Technology guru and Wired magazine founder, John Battelle, thinks SearchWiki is an attempt to attract more users of iGoogle (which at the moment is small), whilst simultaneously rendering iGoogle the centre of users’ personal web universe. To my mind Google is always about business. PageRank is a great free-text searching tool, thus permitting huge market penetration. SearchWiki is simply another business tool which happens to offer some (vaguely?) useful functionality.

Tuesday 7 October 2008

Search engines: solving the 'Anomalous State of Knowledge'

Information retrieval (IR) remains one of the most active areas of research within the information, computing and library science communities. It also remains one of the sexiest. The growth in information retrieval sex appeal has a clear correlation with the growth of the Web and the need for improvements in retrieval systems based on automatic indexing. No doubt the flurry of big name academics and Silicon Valley employees attending conferences such as SIGIR also adds glamour. Nevertheless, the allure of IR research has precipitated some of the best innovations in IR ever, as well as creating some of the most important search engines and business brands. Of course, asked to pick from a list their favourite search engine or brand, most would probably select Google.

The habitual use of Google by students (and by real people generally!) was discussed in a previous post and needn't be revisited here. Nevertheless, one of the most distressing aspects of Google (for me, at least!) is a recent malaise in its commitment to search. There have been some impressive innovations in a variety of search engines in a variety of areas. For example, Yahoo! is to better harness metadata and Semantic Web data on the Web. More interestingly though, some recent and impressive innovations in solving the 'ASK conundrum' is visible in a variety of search engines, but not in Google. Although Google always tell us that search is its bread and butter, is it spreading itself a little too thinly? Or - with a brand loyalty second to none and the robust PageRank algorithm deployed to good effect – is Google resting on its laurels?

In 1982 a young Nicholas J. Belkin spearheaded a series of seminal papers documenting various models of users' information needs in IR. These papers remain relevant today and are frequently cited. One of Belkin et al.'s central suppositions is that the user suffers from the so-called Anomalous State of Knowledge, which can be conveniently acronymized to 'ASK'. Their supposition can be summarised by the following quote from their JDoc paper:

"[P]eople who use IR systems do so because they have recognised an anomaly in their state of knowledge on some topic, but they are unable to specify precisely what is necessary to resolve that anomaly. ... Thus, we presume that it is unrealistic (in general) to ask the user of an IR system to say exactly what it is that she/he needs to know, since it is just the lack of that knowledge which has brought her/him to the system in the first place".

This astute deduction ushered in a branch of IR research that sought to improve retrieval by resolving the Anomalous State of Knowledge (e.g. providing the user with assistance in the query formulation process, helping users ‘fill in the blanks’ to improve recall (e.g. query expansion), etc.).

Last winter Yahoo! unveiled its 'Search Assist' facility (see screenshot above - search for 'united nations'), which provides a real time query formulation assistance to the user. Providing these facilities in systems based on metadata has always been possible owing the use of controlled vocabularies for indexing, the use of name authority files, and even content standards such as AACR2; but providing a similar level functionality with unstructured information is difficult – yet Yahoo! provide something ... and it can be useful and can actually help resolve the ASK conundrum!

Similarly, meta-search engine Clusty has provided its 'clustering' techniques for quite some time. These clusters group related concepts and are designed to aid in query formulation, but also to provide some level of relevance feedback to users (see screenshot above - search for 'George Macgregor'). Of course, these clusters can be a bit hit or miss but, again, they can improve retrieval and aid the user in query formulation. Similar developments can also be found in Ask. View this canned search, for example. What help does Google provide?

The bottom line is that some search engines are innovating endlessly and putting the fruits of a sexy research area to good use. These search engines are actually moving search forward. Can the same still be said of Google?

Friday 29 August 2008

A conceptual model of e-learning: better studying effectiveness

My personal development has recently led me to explore and research the effectiveness of e-learning approaches to Higher Education (HE) teaching and learning. Since the late 1990s, e-learning has become a key focus of activity within pedagogical communities of practice (as well as those within information systems and LIS communities who often manage the necessary technology). HE is increasingly harnessing e-learning approaches to provide flexible course delivery models capable of meeting the needs of part-time study and lifelong learners. Of particular relevance, of course, is the Web, a mechanism highly conducive to disseminating knowledge and delivering a plethora of interactive learning activities (hence the role of informaticians).

The advantages of e-learning are frequently purported in the literature and are generally manifest in the Web itself. Such benefits include the ability to engage students in non-linear information access and synthesis; the availability of learning environments from any location and at any time; the ability for students to influence the level and pace of engagement with the learning process; and, increased opportunities for deploying disparate learning strategies, such as group discussion and problem-based or collaborative learning, as well as delivering interactive learning materials or learning objects. Various administrative and managerial benefits are also cited, such as cost savings over traditional methods and the relative ease with which teaching materials or courses can be revised.

Although flexible course delivery remains a principal motivating factor, the use of e-learning is largely predicated upon the assumption that it can facilitate improvements in student learning and can therefore be more effective than conventional techniques. This assumption is largely supported by theoretical arguments and underpins the large amounts of government and institutional investment in e-learning (e.g. JISC e-learning); yet, it is an assumption that is not entirely supported by the academic literature, containing as it does a growing body of indifferent evidence...

In 1983, Richard E. Clark from the University of Southern California conducted a series of meta-analyses investigating the influence of media on learning. His research found little evidence of any educational benefits and concluded that media were no more effective in teaching and learning than traditional teaching techniques. Said Clark:

"[E]lectronic media have revolutionised industry and we have understandable hopes that they would also benefit instruction".

Clark's paper was/is seminal and remains a common citation in those papers reporting indifferent e-learning effectiveness findings.

Is the same true of e-learning? Is there a similar assumption fuelling the gargantuan levels of e-learning investment? I feel safe in stating that such an assumption is endemic - and I can confirm this having worked briefly on a recent e-learning project. And I am in no way casting aspersions on my colleagues during this time, as I too held the very same assumption!!!

It is clear that evidence supporting the effectiveness of e-learning in HE teaching and learning remains unconvincing (e.g. Bernard et al.; Frederickson et al.). A number of comparative studies have arrived at indifferent conclusions and support the view that e-learning is at least as effective as traditional teaching methods, but not more effective (e.g. Abraham; Dutton et al.; Johnson et al.; Leung; Piccoli et al.). However, some of these studies exemplify a lack of methodological rigour (e.g. group self-selection) and many fail to control for some of the most basic variables hypothesised to influence effectiveness (e.g. social interaction, learner control, etc.). By contrast, those studies which have been more holistic in their methodological design have found e-learning to be more effective (e.g. Liu et al.; Hui et al.). These positive results could be attributed to the fact that e-learning, as an area of study, is maturing; bringing with it an improved understanding of the variables influencing e-learning effectiveness. Perhaps electronic media will "revolutionise" instruction after all?

Although such positive research tends to employ greater control over variables, such work fails to control for all the factors considered – both empirically and theoretically - to influence whether e-learning will be effective or not. Frederickson et al. have suggested that the theoretical understanding of e-learning has been exhausted and call for a greater emphasis on empirical research; yet it is precisely because a lack of theoretical understanding exists that invalid empirical studies have been designed. It is evident that the variables influencing e-learning effectiveness are multifarious and few researchers impose adequate controls or factor any of them into research designs. Such variables include: level of learner control; social interactivity; learning styles; e-learning system design; properties of learning objects used; system or interface usability; ICT and information literacy skills; and, the manner or degree to which information is managed within the e-learning environment itself (e.g. Information Architecture). From this perspective it can be concluded that no valid e-learning effectiveness research has ever been undertaken since no study has yet attempted to control for them all.

Motivated by this confusing scenario, and informed by the literature, it is possible to propose a rudimentary conceptual model of e-learning effectiveness (see diagram above) which I intend to develop and write up formally in the literature. The model attempts to improve our theoretical understanding of e-learning effectiveness and should aid researchers in comprehending the relevant variables and the manner in which they interact. It is anticipated that such a model will assist researchers in developing future evaluative studies which are both robust and holistic in design. It can therefore be hypothesised that using the model in evaluative studies will yield more positive e-learning effectiveness results.

Apologies this was such a lengthy posting, but does anyone have any thoughts on this or fancy working it up with me?

Friday 8 August 2008

Where to next for social metadata? User Labor Markup Language?

The noughties will go down in history as a great decade for metadata, largely as a result of XML and RDF. Here are some highlights so far, off the top of my head: MARCXML, MODS, METS, MPEG-21 DIDL, IEEE LOM, FRBR, PREMIS. Even Dublin Core – an initiative born in the mid-1990s – has taken off in the noughties owing to its extensibility, variety of serialisations, growing number of application profiles and implementation contexts. Add to this other structured data, such as Semantic Web specifications (some of which are optimised for expressing indexing languages) like SKOS, OWL, FOAF, other RDF applications, and microformats. These are probably just a perplexing bunch of acronyms and jargon for most folk; but that's no reason to stop additions to the metadata acronym hall of fame quite yet...!

Spurred by social networking, so-called 'social metadata' has been emerging as key area of metadata development in recent years. For some, developments such as collaborative tagging are considered social metadata. To my mind – and those of others – social metadata is something altogether more structured, enabling interoperability, reuse and intelligence. Semantic Web specifications such as FOAF provide an excellent example of social metadata; a means of describing and graphing social networks, inferring and describing relationships between like-minded people, establishing trust networks, facilitating DataPortability, and so forth. However, social metadata is increasingly becoming concerned with modelling users' online social interactions in a number of ways (e.g. APML).

A recently launched specification which grabbed my attention is the User Labor Markup Language (ULML). ULML is described as an "open protocol for sharing the value of user's labor across the web" and embodies the notion that making such labour metric data more readily accessible and transparent is necessary to underpin the fragile business models of social networking services and applications. According to the ULML specification:

"User labor is the work that people put in to create, improve, and maintain their existence in social web. In more detail, user labor is the sum of all activities such as:
generating assets (e.g. user profiles, images, videos, blog posts),
creating metadata (e.g. tagging, voting, commenting etc.),
attracting traffic (e.g. incoming views, comments, favourites),
socializing with other people (e.g. number of friends, social influence)
in a social web service".

In essence then, ULML simply provides a means of modelling and sharing users' online social activities. ULML is structured much like RSS, with three major document elements (action, reaction and network). Check out the simple Flickr example below (referenced from the spec.). An XML editor screen dump is also included for good measure:

<?xml version="1.0" encoding="UTF-8"?>
<ulml version="0.1">
<channel>
<title>Flickr / arikan</title>
<link>http://fickr.com/photos/arikan</link>
<description>arikan's photos on Flickr.</description>
<pubDate>Thu, 06 Feb 2008 20:55:01 GMT</pubDate>
<user>arikan</user>
<memberSince>Thu, 01 Jun 2005 20:00:01 GMT</memberSince>
<record>
<actions>
<item name="photo" type="upload">852</item>
<item name="group" type="create">4</item>
<item name="photo" type="tag">1256</item>
<item name="photo" type="comment">200</item>
<item name="photo" type="favorite">32</item>
<item name="photo" type="flag">3</item>
<item name="group" type="join">12</item>
</actions>
<reactions>
<item name="photo" type="view">26984</item>
<item name="photo" type="comment">96</item>
<item name="photo" type="favorite">25</item>
</reactions>
<network>
<item name="connection">125</item>
<item name="density">0.167</item>
<item name="betweenness">0.102</item>
<item name="closeness">0.600</item>
</network>
<pubDate>Thu, 06 Feb 2008 20:55:01 GMT</pubDate>
</record>
</channel>
</ulml>

The tag properties within the action and reaction elements are all pretty self-explanatory. Within the network element "connection" denotes the number of friend connections, "density" denotes the number of connections divided by the total number of all possible connections, "closeness" denotes the average distance of that user to all their friends, and "betweenness" the "probability that the [user] lies on the shortest path between any two other persons".

Although the specification is couched in a lot of labour theory jargon, ULML is quite a funky idea and is a relatively simple thing to implement at an application level. With the relevant privacy safeguards in place, service providers could make ULML files publicly available, thus better enabling them and other providers to understand users' behaviour via a series of common metrics. This, in turn, could facilitate improved systems design and personalisation since previous user expectations could be interpreted through ULML analysis. Authors of the specification also suggest that a ULML document constitutes an online curriculum vitae of users' social web experience. It provides a synopsis of user activity and work experience. It is, in essence, evidence of how they perform within social web contexts. Say the authors:

"...a ULML document is a tool for users to communicate with the web services upfront and to negotiate on how they will be rewarded in return for their labour within the service".

This latter concept is significant since – as we have discussed before – such Web 2.0 services rely on social activity (i.e. labour) to make their services useful in the first place; but such activity is ultimately necessary to make them economically viable.

Clearly, if implemented, ULML would be automatically generated metadata. It therefore doesn't really relate to the positive metadata developments documented here before, or the dark art itself; however, it is a further recognition that with structured data there lies deductive and inferential power.

Thursday 24 July 2008

Knol: Wikipedia, but not as we know it...

A while ago I posted a blog about how Wikipedia in Germany was experimenting with new editing rules in attempt to stem the rising number of malicious edits. In essence, these new editing rules would impose greater editorial controls by only allowing trustworthy and hardened Wikipedians to affect changes. The success of this policy remains unknown (perhaps I’ll investigate it further after posting this blog); but the general ethos was about improving information quality, authority and reliability.

While Wikipedia wrestle with their editorial demons, Google have officially launched Knol. According to the website, a knol is a "unit of knowledge", or more specifically, "an authoritative article about a specific topic". Each topic has an author who has exclusive ownership of the topic which is associated with them. An author can allow visitors to comment on knols, or suggest changes; however, unlike Wikipedia, the author cannot be challenged. This is what Google refers to as "moderated collaboration".

Says Google:

"With Knol, we are introducing a new method for authors to work together that we call 'moderated collaboration'. With this feature, any reader can make suggested edits to a knol which the author may then choose to accept, reject, or modify before these contributions become visible to the public. This allows authors to accept suggestions from everyone in the world while remaining in control of their content. After all, their name is associated with it!"

A knol is supposed to be an authoritative and credible article, and Google have therefore placed a strong emphasis on author credentials. This is apparent from the moment you visit Knol. Medical knols are written by bona fide doctors; DIY advice is provided by a genuine handyman – and their identities are verified.

Knol is clearly a direct challenge to the supremacy of Wikipedia; yet it jettisons many of the aspects that made Wikipedia popular in the first place. And it does this to maintain information integrity. Am I sorry about this? 'Yes' and 'no'. For me Knol represents a useful halfway house; a balance between networked collaboration and information integrity. Is this elitist? No - it's just common sense.

What do you think? Register your vote on the poll!

Tuesday 1 July 2008

Making the inaccessible accessible (from an information retrieval perspective)

Websites using Adobe Flash have attracted a lot of criticism over the years, and understandably so. Flash websites break all the rules that make (X)HTML great. They (generally) exemplify poor usability and remain woefully inaccessible to visually impaired users, or those with low bandwidth. Browser support also remains a big problem. However, even those web designers unwilling to relinquish Flash for the aforementioned reasons have done so because Flash has remained inaccessible to all the major search engines, thereby causing serious problems if making your website discoverable is a key concern. Even my brother - historically a huge Flash aficionado - a few years ago conceded that Flash on the web was a bad thing – primarily because of the issues it raises for search engine indexing.

Still, if you look hard enough, you will find many that insist on using it. And these chaps will be pleased to learn that the Official Google Blog has announced that Google have been developing an algorithm for crawling textual Flash content (e.g. menus, buttons and banners, “self-contained Flash websites”, etc.). Improved visibility of Flash content is henceforth order of the day.

But to my mind this is both good news and bad news (well, mainly bad news...). Aside from being championed by a particular breed of web designer, Flash has fallen out of favour with webbies precisely because of the indexing problems associated with it. This, in turn, has promoted an increase in good web design practice, such as compliance with open standards, accessibility and usability. Search engine visibility was, in essence, a big stick with which to whip the Flashers into shape (the carrot of improved website accessibility wasn’t big enough!). Now that the indexing problems have been (partly) resolved, the much celebrated decline in Flash might soon end; we may even see a resurgence of irritating animation and totally unusable navigation systems. I have little desire to visit such websites, even if they are now discoverable.

Tuesday 3 June 2008

Harvesting and distributing semantic data: FOAF and Firefox

One of the most attractive aspects of Mozilla Firefox continues to be the incessant supply of useful (and not-so useful) Extensions. The supply is virtually endless, and long may it continue! (What would I do without Zotero, or Firebug?!) We have seen the emergence of some useful metadata tools, such as Dublin Core metadata viewers, and more recently, Semantic Web widgets. Operator is an interesting example of the latter, harnessing microformats and eRDF in a way that allows users to interact easily with some simple semantic data (e.g. via toolbar operations). Another interesting Firefox extension is Semantic Radar.

Semantic Radar is a "semantic metadata detector" for Firefox and is a product of a wider research project based at DERI, National University of Ireland. Semantic Radar can identify the presence of FOAF, DOAP and SIOC data in a webpage (as well as more generic RDF and RDFa data) and will display the relevant icon(s) in the Firefox status bar to alert the user when such data is found. (See screen shot for FOAF example) Clicking the icon(s) enables users to browse the data using an online semantic browser (e.g. FOAF Explorer, SIOC Browser). In short, Semantic Radar parses web pages for RDF auto-discovery links to discover semantic data. A neat tool to have...

Semantic Radar has been available for sometime, but a recent update to the widget means that it is now possible to automatically 'ping' the Ping The Semantic Web website. Ping The Semantic Web (PTSW) is an online service harvesting, storing and distributing RDF documents. If one of those documents is updated, its author can notify PTSW to that effect by pinging it with the URL of the document. This is an amazingly efficient way to disseminate semantic data. It is also an amazingly effective way for crawlers and other software agents to discover and index RDF data on the web. (URLs indexed by PTSW can be re-used by other web services and can also be viewed on the PTSW website (See screenshot of PTSW for my FOAF file)).

It might just be me, but I’m finally getting the sense that the volume of structured data available on the web - and the tools necessary to harness it - are beginning to reach some sort of critical mass. The recent spate of blogs echoing this topic bare testament to that (e.g. [1], [2]). Dare I use such a term, but Web 3.0 definitely feels closer than ever.

Friday 9 May 2008

R.I.P advertising: Business models on the Web

Rory Cellan-Jones, BBC News technology guru and blogger at BBC dot.life, posted some interesting musings about Firefox 3 yesterday. Mozilla Firefox 3 is due for final release in June 2008. One controversial development, however, is the 'awesome bar' - controversial owing to the manner in which it gathers data on user searching behaviour...

Seemingly – and probably unsurprisingly if one gives it a second thought – the Mozilla Foundation have struggled to keep their head above water. How does the "poster-child of the open-source movement" keep a staff of 160? Until fairly recently the majority of revenue was generated from Mozilla branded merchandise. Incredible that the browser you are probably using to read this blog was created by a bunch of people selling T-shirts to keep their passion afloat. As Google have demonstrated, the true Web money is in information retrieval and intelligent advertising. To this end Yahoo! is attempting to rejuvenate itself in the face of Microsoft by (hopefully) revolutionising information retrieval through leveraging structured data (e.g. metadata, Semantic Web data, microformats, etc.) – because that is the honey pot. For Mozilla, the Firefox awesome bar is set to supplement revenue generated by the default Firefox Google search box. But as Raju Vegesna acknowledges, "People are realising that advertising is not good for everything, that it's not going to make them the next Google". This is simply because not enough people look at adverts.

But this post is a general musing on Web business models. Sure - open-source is about user emancipation. And those innovative enough (i.e. Mozilla) will find good revenue generating tools, ultimately based on advertising. But what about Web journalism, or Web 2.0 services? How are they going to put dinner on the table in 3-4 years time? Has our unbounded enthusiasm to provide everything for free during the mid to late 1990s created a business model nightmare today? Have we passed the point of no return?

These aren't new questions, but they are questions we continue to agonise over. While the Future of Journalism conference attendees do a little more agonising, the pioneers of Web 2.0 engage in their own head scratching. As we learned a few weeks ago, Bubble 2.0 might burst soon; many are finding advertising difficult and some openly acknowledge that a robust business model was a secondary concern for their Web 2.0 enterprise. Others are resorting to subscription to pay the bills. Is it time to acknowledge that – except for the few (i.e. Google et al.) – advertising, as the basis for an e-business model, is finally dying?

Thursday 27 March 2008

Bubble 2.0: brace yourself

Tim Berners-Lee would have us think that the label 'Web 2.0' is a piece of jargon and a complete misnomer. There is some validity in his argument. The majority of technologies commonly associated with Web 2.0 applications have been at the disposal of developers since well before Bubble 1.0 burst. Even the social and collaborative aspects that have - rightly or wrongly - become synonymous with Web 2.0 were well versed by Amazon (since 1995!) and eBay. That the web has entered a new phase of development is incontrovertible though; but when I think of Web 2.0 I prefer to focus on the use of web service APIs to create new information services and harnessing the potential of structured data (e.g. applications of XML, RDF, etc.).

Be that as it may, Web 2.0 has now been with us for a few years because people believe it is actually happening; that Web 2.0 is upon us all. 2005-2008 has witnessed a proliferation of so-called Web 2.0 companies, many 'built to be bought'. Was del.icio.us really worth $30 million when Yahoo! bought it in late 2005?

This frenzy of activity, aggressive over speculation, and the impending credit crunch now has many technology and financial commentators (and protagonists) nervous. Bubble 2.0 is soon to burst. High stock values and high P/E ratios simply contribute to this perception. The storm clouds do indeed seem to be gathering over Silicon Valley and a near certain second dot-com crash is approaching (Adam Lashinsky's article inspired this post). Perhaps if people had been listening to Tim Berners-Lee at the beginning of the Web 2.0 phenomenon things wouldn't have spiraled out of control; enthusiasm might have been tempered.

The ridiculous nature of the Web 2.0 scenario and its impending doom can never be properly articulated by the technology or business press. It takes popular culture to deliver a good, hard slap to the face. To this end the Richter Scales provide us with an astute appraisal of the current state of Web 2.0, its tenuous origins and its future, all to the tune of 'We didn't start the fire' by Billy Joel. Enjoy 'Here comes another bubble'. It's educational.

Here Comes Another Bubble - Richter Scales

Friday 14 March 2008

Shout 'Yahoo!' : more use of metadata and the Semantic Web

Within the lucrative world of information retrieval on the web, Yahoo! is considered an 'old media company'; a company that has gone in a different direction to, say, Google. Yahoo! has been a bit patchy when it comes to openness. It is keen on locking data and widgets down; Google is keen on unlocking data and widgets. And to Yahoo!'s detriment, Google has discovered that there is more money to be made their way, and that users and developers alike are – to a certain extent - very happy with the Google ethos. Google Code is an excellent example of this fine ethos, with the Google Book Search API being a recently announced addition to the Code arsenal.

Since there must be some within Yahoo! ranks attributing their current fragility to a lack of openness, Yahoo! have recently announced their Yahoo! Search 'open platform'. They might be a little slow in fully committing to openness, but cracking open Yahoo! Search is a big and interesting step. For me, it's particularly interesting...

Yesterday Amit Kumar (Product Management, Yahoo! Search) released further details of the new Yahoo! Search platform. This included (among many other things), a commitment to harnessing the potential of metadata and Semantic Web content. More specifically, this means greater support of Dublin Core, Friend-of-a-Friend (FOAF) and other applications of RDF (Resource Description Framework), Creative Commons, and a plethora of microformats.

Greater use of these initiatives by Yahoo! is great news for the information and computing professions, not least because it may stimulate the wider deployment of the aforementioned standards, thus making the introduction of a mainstream 'killer app' that fully harnesses the potential of structured data actually possible. For example, if your purpose is to be discovered by Google, there is currently no real demand for Dublin Core (DC) metadata to be embedded within the XHTML of a web page, or for you to link to an XML or RDF encoded DC file. Google just doesn't use it. It may use dc.title, but that's about it. That is not to say that it's useless of course. Specialist search tools use it, Content Management Systems (CMS) use it, many national governments use it as the basis for metadata interoperability and resource discovery (e.g. eGMS), it forms the basis of many Information Architecture (IA) strategies, etc, etc. But this Google conundrum has been a big problem for the metadata, indexing and Semantic Web communities (see, for example). Their tools provide so much potential; but this potential is generally confined to particular communities of practice. Your average information junkie has never used a Semantic Web tool in his/her life. But if a large scale retrieval device (i.e. Yahoo!) showed some real commitment to harnessing structured data, then it could usher in a new age of large scale information retrieval; one based on an intelligent blend of automatic indexing, metadata, and Semantic Web tools (e.g. OWL, SKOS, FOAF, etc.). In short, there would be a huge demand for the 'data web' outside the distinct communities of practice over which librarians, information managers and some computer scientists currently preside. And by implication this would entail greater demand for their skills. All we need to do now is get more people to create metadata and ontologies!

Given the fragile state of Yahoo!, initiatives like this (if they come to fruition!) should be applauded. Shout 'Yahoo!' for Yahoo! I'm not sure if it will prevent a Microsoft takeover though...

Thursday 14 February 2008

Death by librarian!

Electroacoustic popsters from the antipodes, Haunted Love, provide this seductive pop ditty, paying tribute the to craft of librarianship - although I don't recall squashing patrons in between the book stacks as part of this craft. Interesting too is the absence of any ICT!

The video has been around for a while, but I thought it might be a welcome distraction from the usual blog postings, which are necessarily academic.

Monday 4 February 2008

Wearing an Information System Hair Shirt (Talk Talk)

What have we done. I’ve been forced to wear an information system hair shirt, here’s why.

Popular consciousness would have us think that scientists seem to amble through life inventing weaponry and new life forms that probably won’t mutate into the wild. We wonder if they will ask themselves ‘what have we done’. There is another breed of scientist who takes their own action on the cheek injecting themselves with disease and the trial vaccines. In the information systems world we are also producing mutant commercial life forms in the form of giant workflow driven operations. My 15 month effort to get ‘Free Broadband’ with UK telco TalkTalk (www.talktalk.co.uk other dysfunctional telephone companies are available) leaves me thinking ‘what have I done’.

Commercial IS analysts and developers seek to breakdown tasks into a series of workflows and then build information systems that offer operators, or via e-commerce the punters themselves, access to these workflows. This has allowed us to create low cost call centre led support operations for giant machines such as telephone companies, other utilities and banks. This works cheaply and well until your workflow relationship with the company falls, on the one less travelled by.

It is a bit like printing out a Google map for how to get from A to B. You get a set of instructions ‘Turn right at Missing Sign Road 780 yards’. This all works well until you fall off the path laid down for you then you realise you are lost and haven’t got a real map in the car.

So it seems to be with workflow driven companies you operate cheaply and well until you fall off the path, at this point you enter a Kafka’s ‘The Trial’ strangely held down by a faceless system.

In the case of my experience with TalkTalk my original promised go live date November 2006 is now a distant memory. There have been dozens of phone calls and many hours on hold after which you are met by well trained customer professionals to calm and ease your mind. From my side of the phone it feels like they look up your file see that something has gone wrong with ‘provisioning’ and then on screen have a huge button marked ‘re-provision then wish customer a happy day’. I spend my time trying to dissuade them from pressing that button while I try to explain that their predecessors have already tried that.

That puts me on the very edge of the workflow, in desperation they suggest I phone back a different number or on some occasions put me through to the experts departments with names such as ‘sales’, ‘engineering escalation’, ‘customer retention’ and ‘complaints’. These people operate on the outer edge of the workflow and attempt to push you back into the mainstream, although in my case I think I am beyond help as they end up assuring me that they will phone back in the next 48 hours, 72 hours tops. However much I argue that that is what the last guy said they explain that they can’t understand what went wrong there and they will definitely phone back scouts honour. No one ever has.

Every time I go through this loop (I try to every 5 or 6 weeks) it makes me frustrated and angry so why do I bother you and my wife ask. Well it is a bit like the scientist injecting the vaccine into himself. By living at the rough end of a large companies workflow systems I remember my own humanity and perhaps approach the giant Information Systems we espouse with a bit more caution.

Talk Talk is my hair shirt that reminds me of the danger of Information System sin. Perhaps next time the operator at Talk Talk will instead of putting me through to the department of ‘Repentance, Platitude and False Promises’ they could put me through to the company chaplain.

Information Strategy Group, LJMU