Information Strategy Group, LJMU: Metadata

Showing posts with label Metadata. Show all posts

Wednesday, 23 June 2010

Visualising the metadata universe

No blog postings for almost three months and then two come along at once... I thought it would be worth drawing to the attention of readers the recent work of Jenn Riley of Indiana University. Jenn is currently metadata guru for the Indiana University Digital Library Program and yesterday on the Dublin Core list she announced the output of a project to build a conceptual model of the 'metadata universe'.

As evidenced by some of my blogs, there are literally hundreds of metadata standards and structured data formats available, all with their own acronym. This seems to have become more complicated with the emergence of numerous XML based standards in the early to mid noughties, and the more recent proliferation of RDF vocabularies for the Semantic Web and the associated Linked Data drive. What formats exists? How do they relate to each other? For which communities of practice are they optimised, e.g. information industry or cultural sector? What are the metadata, technical standards, vocabularies that I should be congnisant of in my area? And so the question list goes on...

These questions can be difficult to answer, and it is for this reason that Jenn Riley has produced a gigantic poster diagram (above) entitled, 'Seeing standards: a visualization of the metadata universe'. The diagram achieves what a good model should, i.e. simplifying complex phenomena and presenting a large volume of information in a condensed way. As the website blurb states:

"Each of the 105 standards listed here is evaluated on its strength of application to defined categories in each of four axes: community, domain, function, and purpose. The strength of a standard in a given category is determined by a mixture of its adoption in that category, its design intent, and its overall appropriateness for use in that category."

A useful conceptual tool for academics, practitioners and students alike. A glossary of metadata standards in either poster or pamphlet form is also available.

Tuesday, 22 June 2010

Goulash all round: Linked Data at NSZL

I meant to blog about this as soon as the news emerged in mid-April but University bureaucracy and research project demands prevented it: Adam Horvath (Director of Informatics) at the The National Széchényi Library (NSZL) (or National Library of Hungary, if you prefer) announced on the Semantic Web Linking Open Data Project email list that the NSZL have exposed their entire OPAC and digital library as Linked Data - that's correct, their entire OPAC and digital library has been published as Linked Data. This includes corresponding authority data, with all nodes represented using Cool URIs.

The RDF vocabularies used include Dublin Core RDF for bibliographic metadata, SKOS for subject indexing (in a variety of terminologies) and FOAF for name authority data. Incredible! Not only that, the FOAF descriptions include mapped owl:sameAs statements to corresponding dbpedia URIs. For example, here is FOAF data pertaining to Hungarian novelist, Jókai Mór:

<?xml version="1.0"?>
<rdf:RDF
xmlns:dbpedia="http://dbpedia.org/property/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns="http://web.resource.org/cc/"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:zs="http://www.loc.gov/zing/srw/">
<foaf:Person rdf:about="http://nektar.oszk.hu/resource/auth/33589">
<dbpedia:deathYear>1904</dbpedia:deathYear>
<dbpedia:birthYear>1825</dbpedia:birthYear>
<foaf:familyName>Jókai</foaf:familyName>
<foaf:givenName>Mór</foaf:givenName>
<foaf:name>Jókai Mór (1825-1904)</foaf:name>
<foaf:name>Mór Jókai</foaf:name>
<foaf:name>Jókai Mór</foaf:name>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/M%C3%B3r_J%C3%B3kai"/>
</foaf:Person>
</rdf:RDF>

Visit the above noted dbpedia data for fun.

Rich SKOS data is also available for a local information retrieval thesaurus. Follow this link for an example of the skos:prefLabel ' magyar irodalom'.

It's a herculean effort from the NSZL which must be commended. And before the Germans did it too! Goulash all round to celebrate - and a photograph of the Hungarian Parliament, methinks.

Tuesday, 26 January 2010

Renaissance of thesaurus-enhanced information retrieval?

As my students in BSNIM3034 Content Management have been learning recently, semantics play such a huge role in the level of recall and precision capable of being achieved in an information retrieval system. Put simply, computers are great at interpreting syntax but are far too dumb to understand semantics or the intricacies of human language. This has historically been – and currently remains – the trump card of metadata proponents, and it is something the Semantic Web is attempting to resolve with its use of structured data too. The creation of metadata involves a human; a cataloguer who performs a 'conceptual analysis' of the information object in question to determine its 'aboutness'. They then translate this into the concepts prescribed in a controlled vocabulary or encoding scheme (e.g. taxonomy, thesaurus, etc.) and create other forms of descriptive and administrative metadata. All this improves recall and precision (e.g. conceptually similar items are retrieved and conceptually dissimilar items are suppressed).

As good as they are these days, retrieval systems based on automatic indexing (i.e. most web search engines, including Google, Bing, Yahoo!, etc.) suffer from the 'syntax problem'. They provide what appears to be high recall coupled with poor precision. This is the nature of search the engine beast. Conceptually similar items are ignored because such systems are unable to tell that 'Haematobia irritans' is a synonym of 'horn flies' or that 'java' is a term fraught with homonymy (e.g. 'java' the programming language, 'java' the island within the Indonesian archipelago and 'java' the coffee, and so forth). All of the aforementioned contributes to arguably the biggest problem for user searching: query formulation. Search engines suffer from the added lack of any structured browsing of, say, resource subjects, titles, etc. to assist users in query formulation.

This blog has discussed query formulation in the search process at various times (see this for example). The selection of search terms for query formulation remains one of the most difficult stages in users' information retrieval process. The huge body of research relating to information seeking behaviour, information retrieval, relevance feedback, and human-computer interaction attests to this. One of the techniques used to assist users in query formulation is thesaurus assisted searching and/or query expansion. Such techniques are not particularly new and are often used in search systems successfully (see Ali Shiri's JASIST paper from 2006).

Last week, however, Google announced adjustments to their search service. This adjustment is particularly significant because it is an attempt to control for synonyms. Their approach is based on 'contextual language analysis' rather than the use of information retrieval thesauri. The blog reads:

"Our systems analyze petabytes of web documents and historical search data to build an intricate understanding of what words can mean in different contexts [...] Our synonyms system is the result of more than five years of research within our web search ranking team. We constantly monitor the quality of the system, but recently we made a special effort to analyze synonyms impact and quality."

Firstly, this is certainly positive news. Synonyms – as noted above – are a well known phenomenon which has blighted the effectiveness of automatic indexing in retrieval. But on the negative side – and not to belittle Google's efforts as they are dealing with unstructured data - Google are only dealing with single words. 'Song lyrics' and 'song words', or 'homocide' and 'murder' (examples provided from Google on their blog posting) They are dealing with words in a Roget's Thesaurus sense, rather than compound terms in an information retrieval thesaurus sense – and it is the latter which will ultimately be more useful in improving recall and precision. This is, after all, why information retrieval thesauri have historically been used in searching.

More interesting will be Google's exploration of homonymous terms. Homonyms are more complex that synonyms and are, perhaps for the foreseeable future, an intractable problem?

Friday, 26 June 2009

Read all about it: interesting contributions at ISKO-UK 2009

I had the pleasure of attending the ISKO-UK 2009 conference earlier this week at University College London (UCL), organised in association with the Department of Inf ormation Studies. This was my first visit to the home of the architect of Utilitaria nism, J eremy Bentham, and the nearby St. Pancras International since it has been revamped - and what a smart train station it is.

The ISKO conference theme was 'content architecture', with a particular focus on:

"Integration and semantic interoperability between diverse resources – text, images, audio, multimedia
Social networking and user participation in knowledge structuring
Image retrieval
Information architecture, metadata and faceted frameworks"

The underlying themes throughout most papers were those related to the Semantic Web, Linked Data, and other Seman

tic Web inspired approaches to resolving or ameliorating common problems within our disciplines. There were a great many interesting papers delivered and it is difficult to say something about them all; however, for me, there were particular highlights (in no particular order)...

Libo Eric Si (et al.) from the Department of In for mation Science at Loughboro ugh University described research to develop a prototype middleware framework between disparate terminology resources to facilitate subject cross-browsing of information and library portal systems. A lot of work has already been undertaken in this area (see for example, HILT project (a project in which I used to be involved), and CrissCross), so it was interesting to hear about his 'bag' approach in which – rather than using precise mappings between different Knowledge Organisation Systems (KOS) (e.g. thesauri, subject heading lists, taxonomies, etc.) - "a number of relevant concepts could be put into a 'bag', and the bag is mapped to an equivalent DDC concept. The bag becomes a very abstract concept that may not have a clear meaning, but based on the evaluation findings, it was widely-agreed that using a bag to combine a number of concepts together is a good idea".

Brian Matthews (et al.) reported on an evaluation of social tagging and KOS. In par

ticular, they investigated ways of enhancing social tagging via KOS, with a view to improving the quality of tags for improvements in and retrieval performance. A detailed and robust methodology was provided, but essentially groups of participants were given the opportunity to tag resources using tags, controlled terms (i.e. from KOS), or terms displayed in a tag cloud, all within a specially designed demonstrator. Participants were later asked to try alternative tools in order to gather data on the nature of user preferences. There are numerous findings - and a pre-print of the paper is already available on the conference website so you can read these yourself - but the main ones can be summarised from their paper as follows and were surprising in some cases:

"Users appreciated the benefits of consistency and vocabulary control and were potentially willing to engage with the tagging system;
There was evidence of support for automated suggestions if they are appropriate and relevant;
The quality and appropriateness of the controlled vocabulary proved to be important;
The main tag cloud proved problematic to use effectively; and,
The user interface proved important along with the visual presentation and interaction sequence."

The user preference for controlled terms was reassuring. In fact, as Matthews et al. report:

"There was general sentiment amongst the depositors that choosing terms from a controlled vocabulary was a "Good Thing" and better than choosing their own terms. The subjects could overall see the value of adding terms for information retrieval purposes, and could see the advantages of consistency of retrieval if the terms used are from an authoritative source."

Chris Town from the University of Cambridge Computer Laboratory presented two (see [1], [2]) equally interesting papers relating to image retrieval on the Web. Although images and video now comprise the majority of Web content, the vast majority of retrieval systems essentially use text, tags, etc. that surround images in order t

o make assumptions about what the image might be. Of course, using any major search engine we discover that this approach is woefully inaccurate. Dr. Town has developed improved approaches to content-based image retrieval (CBIR) which provide a novel way of bridging the 'semantic gap' between the retrieval model used by the system and that of the user. His approach is founded on the "notion of an ontological query language, combined with a set of advanced automated image analysis and classification models". This approach has been so successful that he has founded his own company, Imense. The difference in performance between Imense and Google is staggering and has to been seen to be believed. Examples can be found in his presentation slides (which will be on the ISKO website soon), but can observed from simply messing around on the Imense Picture Search.

Chris Town's second paper essentially explored how best to do the CBIR image processing required for the retrieval system. According to Dr. Town there are approximately 20 billion images on the web, with the majority at a high resolution, meaning that by his calculation it would take 4000 years to undertake the necessary CBIR processing to facilitate retrieval! Phew! Large-scale grid computing options therefore have to be explored if the approach is to be scalable. Chris Town and his colleague Karl Harrison therefore undertook a series of CBIR processing evaluations by distributing the required computational task across thousands of Grid nodes. This distributed approach resulted in the processing of over 25 million high resolution images in less than two weeks, thus making grid processing a scalable option for CBIR.

Andreas Vlachidis (et al.) from the Hypermedia Research Unit at the University of Gla morgan described the use of 'information extraction' techniques employing Natural Language Processing (NLP) techniques to assist in the semantic indexing of archaeological text resources. Such 'Grey Literature' is a good tes

t bed as more established indexing techniques are insufficient in meeting user needs. The aim of the research is to create a system capable of being "semantically aware" during document indexing. Sounds complicated? Yes – a little. Vlachidis is achieving this by using a core cultural heritage ontology and the English Heritage Thesauri to support the 'information extraction' process and which supports "a semantic framework in which indexed terms are capable of supporting semantic-aware access to on-line resources".

Perhaps the most interesting aspect of the conference was that it was well attended by people outside the academic fraternity, and as such there were papers on how these organisations are doing innovative work with a range of technologies, specifications and standards which, to a large extent, remain the preserve of researchers and academics. Papers were delivered by technical teams at the World Bank and Dow Jones, for example. Perhaps the most interesting contribution from the 'real world' though was that delivered by Tom Scott, a key member of the BBC's online and technology team. Tom is a key proponent of the Semantic Web and Linked Data at the BBC and his presentation threw light on BBC activity in this area – and rather coincidentally complemented an accidental discovery I made a few weeks ago.

Tom currently leads the BBC Earth project which aims to bring more of the BBC's Natural History content online and bring the BBC into the Linked Data cloud, thus enabling intelligent linking, re-use, re-aggregation, with what's already available. He provided interesting examples of how the BBC was exposing structured data about all forms of BBC programming on the Web by adopting a Linked Data approach and he expressed a desire for users to traverse

detailed and well connected RDF graphs. Says Tom on his blog:

"To enable the sharing of this data in a structured way, we are using the linked data approach to connect and expose resources i.e. using web technologies (URLs and HTTP etc.) to identify and link to a representation of something, and that something can be person, a programme or an album release. These resources also have representations which can be machine-processable (through the use of RDF, Microformats, RDFa, etc.) and they can contain links for other web resources, allowing you to jump from one dataset to another."

Whilst Tom conceded that this work is small compared to the entire output and technical activity at the BBC, it still constitutes a huge volume of data and is significant owing to the BBC's pre-eminence in broadcasting. Tom even reported that a SPARQL end point will be made available to query this data. I had actually hoped to ask Tom a few questions during the lunch and coffee breaks, but he was such a popular guy that in the end I lost my chance, such is the existence of a popular techie from the Beeb.

Pre-print papers from the conference are available on the proceedings page of the ISKO-UK 2009 website; however, fully peer reviewed and 'added value' papers from the conference are to be published in a future issue of Aslib Proceedings.

Friday, 12 June 2009

Serendipity reveals ontological description of BBC programmes

I have been enjoying Flight of the Conchords on BBC Four recently. Unfortunately, I missed the first couple of episodes of the new series. So that I could configure my Humax HDR to record all future episodes, I visited the BBC website to access their online schedule. It was while doing this that I discovered visible usage of the BBC's Programmes Ontology. The programme title (i.e. Flight of the Conchords) is hyperlinked to an RDF file on this schedule page.

The Semantic Web is supposed to provide machine readable data, not human readable data, and hyperlinking to an RDF/XML file is clearly a temporarily glitch at the Beeb. After all, 99.99% of BBC users clicking on these links would be hoping to see further details about the programme, not to be presented with a bunch of angled brackets. Nevertheless, this glitch provides an interesting insight for us since it reveals the extent to which RDF data is being exposed on the Semantic Web about BBC programming, and the vocabularies the BBC are using. Researchers at the BBC are active in dissemination (e.g. ESWC2009, XTech 2008), but it's not often that you surreptitiously discover this sort of stuff in action at an organisation like this.

The Programme Ontology is based significantly on the Music Ontology Specification and the FOAF Vocabulary Specification, but their data deploys – admittedly not in the example below, except in the namespace declarations – Dublin Core and SKOS.

Oh, and the next episode of Flight of the Conchords is on tonight at 23:00, BBC Four.

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
xmlns:foaf = "http://xmlns.com/foaf/0.1/"
xmlns:po = "http://purl.org/ontology/po/"
xmlns:mo = "http://purl.org/ontology/mo/"
xmlns:skos = "http://www.w3.org/2008/05/skos#"
xmlns:time = "http://www.w3.org/2006/time#"
xmlns:dc = "http://purl.org/dc/elements/1.1/"
xmlns:dcterms = "http://purl.org/dc/terms/"
xmlns:wgs84_pos= "http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:timeline = "http://purl.org/NET/c4dm/timeline.owl#"
xmlns:event = "http://purl.org/NET/c4dm/event.owl#">

<rdf:Description rdf:about="/programmes/b00l22n4.rdf">
<rdfs:label>Description of the episode Unnatural Love</rdfs:label>
<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2009-06-02T00:14:09+01:00</dcterms:created>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2009-06-02T00:14:09+01:00</dcterms:modified>
<foaf:primaryTopic rdf:resource="/programmes/b00l22n4#programme"/>
</rdf:Description>

<po:Episode rdf:about="/programmes/b00l22n4#programme">
<dc:title>Unnatural Love</dc:title>
<po:short_synopsis>Jemaine accidentally goes home with an Australian girl he meets at a nightclub.</po:short_synopsis>
<po:medium_synopsis>Comedy series about two Kiwi folk musicians in New York. When Bret and Jemaine go out nightclubbing with Dave, Jemaine accidentally goes home with an Australian girl.</po:medium_synopsis>
<po:long_synopsis>When Bret and Jemaine go out nightclubbing with Dave, Jemaine accidentally goes home with an Australian girl. At first plagued by shame and self-doubt, he comes to care about her, much to Bret and Murray's annoyance. Can their love cross the racial divide?</po:long_synopsis>
<po:masterbrand rdf:resource="/bbcfour#service"/>
<po:position rdf:datatype="http://www.w3.org/2001/XMLSchema#int">5</po:position>
<po:genre rdf:resource="/programmes/genres/comedy/music#genre" />
<po:genre rdf:resource="/programmes/genres/comedy/sitcoms#genre" />
<po:version rdf:resource="/programmes/b00l22my#programme" />
</po:Episode>

<po:Series rdf:about="/programmes/b00kkptn#programme">
<po:episode rdf:resource="/programmes/b00l22n4#programme"/>
</po:Series>

<po:Brand rdf:about="/programmes/b00kkpq8#programme">
<po:episode rdf:resource="/programmes/b00l22n4#programme"/>
</po:Brand>
</rdf:RDF>

Thursday, 11 June 2009

Cracking open metadata and cataloguing research with Resource Description & Access (RDA)

I have been taking the opportunity to catch up with some recently published literature over the past couple of weeks. While perusing the latest issue of the Bulletin of the American Society for Information Science and Technology (the magazine which complements JASIST), I read an interesting article by Shawne D. Miksa (associate professor at the College of Information, University of North Texas). Miksa's principal research interests reside in metadata, cataloguing and indexing. She has been active in disseminating about Resource Description & Access (RDA) and has a book in the pipeline designed to demystify it.

RDA has been in development for several years now, is the successor to AACR2 and provides rules and guidance on the cataloguing of information entities. I use the phrase 'information entities' since RDA departs significantly from AACR2. The foundations of AACR2 were created prior to the advent of the Web and this remains problematic given the digital and new media information environment in which we now exist. Of course, more recent editions of AACR2 have attempted to better accommodate these developments, but fire fighting was always order of the day. The now re-named Joint Steering Committee for the Development of RDA has known for quite some time that an entirely new approach was required – and a few years ago radical changes to AACR2 were announced. As my ex-colleague Gordon Dunsire describes in a recen t D-Lib Magazine article:

"RDA: Resource Description and Access is in development as a new standard for resource description and access designed for the digital world. It is being built on the foundation established for the Anglo-American Cataloguing Rules (AACR). Although it is being developed for use primarily in libraries, it aims to attain an effective level of alignment with the metadata standards used in related communities such as archives, museums and publishers, and to provide a better fit with emerging database technologies."

The ins and outs of RDA is a bit much for this blog; suffice to say that RDA is ultimately designed to improve the resource discovery potential of digital libraries and other retrieval systems by utilising the FRBR conceptual entity-relationship model (see this entity-relationship diagram at the FRBR blog). FRBR provides a holistic approach to users' retrieval requirements by establishing the relationships between information entities and allowing users to traverse the hierarchical relationships therein. I am an advocate of FRBR and appreciate its retrieval potential. Indeed, I often direct postgraduate students to Fiction Finder, an OCLC Research prototype which demonstrates the FRBR Work-Set Algorithm.

Reading Miksa's article was interesting for two reasons. Firstly, RDA has fallen off of my radar recently. I used to be kept abreast of RDA development through the activities of my colleague Gordon, who also disseminates widely on RDA and feeds into the JSC's work. Miksa's article – which announces the official release of RDA in second half of 2009 – was almost like being in a time machine! RDA is here already! Wow! It only seems like last week when JSC started work on RDA (...but it was actually over 5 years ago…).

The development of RDA has been extremely controversial, and Miksa alludes to this in her article – metadata gurus clashing with traditional cataloguers clashing with LIS revolutionaries. It has been pretty ugly at times. But secondly – and perhaps more importantly – Miksa's article is a brilliant call to arms for more metadata research. Not only that, she notes areas where extensive research will be mandatory to bring truly FRBR-ised digital libraries to fruition. This includes consideration of how this impacts upon LIS education.

A new dawn? I think so… Can the non-believers grumble about that? Between the type of developments noted earlier and RDA, the future of information organisation is alive and kicking.

Thursday, 12 February 2009

FOAF and political social graphs

While catching up on some blogs I follow, I noticed that the Semantic Web-ite Ivan Herman posted comments regarding the US Congress SpaceBook – a US political answer to Facebook. He, in turn, was commenting on a blog made by the ProgrammableWeb – the website dedicated to keeping us informed of the latest web services, mashups, and Web 2.0 APIs.

From a mashup perspective, SpaceBook is pretty incredible, incorporating (so far) 11 different Web APIs. However, for me SpaceBook is interesting because it makes use of semantic data provided via FOAF and the microformat, XFN. To do this SpaceBook makes good use of the Google Social Graph API, which aims to harness such data to generate social graphs. The Social Graph API has been available for almost a year but has had quite a low profile until now. Says the API website:

"Google Search helps make this information more accessible and useful. If you take away the documents, you're left with the connections between people. Information about the public connections between people is really useful -- as a user, you might want to see who else you're connected to, and as a developer of social applications, you can provide better features for your users if you know who their public friends are. There hasn't been a good way to access this information. The Social Graph API now makes information about the public connections between people on the Web, expressed by XFN and FOAF markup and other publicly declared connections, easily available and useful for developers."

Bravo! This creates some neat connections. Unfortunately – and as Ivan Herman regrettably notes - the generated FOAF data is inserted into Hilary Clinton’s page as a page comment, rather than as a separate .rdf file or as RDFa. The FOAF file is also a little limited, but it does include links to her Twitter account. More puzzling for me though is why the embedded XHTML metadata does not use Qualified Dublin Core! Let's crank up the interoperability, please!

Friday, 8 August 2008

Where to next for social metadata? User Labor Markup Language?

The noughties will go down in history as a great decade for metadata, largely as a result of XML and RDF. Here are some highlights so far, off the top of my head: MARCXML, MODS, METS, MPEG-21 DIDL, IEEE LOM, FRBR, PREMIS. Even Dublin Core – an initiative born in the mid-1990s – has taken off in the noughties owing to its extensibility, variety of serialisations, growing number of application profiles and implementation contexts. Add to this other structured data, such as Semantic Web specifications (some of which are optimised for expressing indexing languages) like SKOS, OWL, FOAF, other RDF applications, and microformats. These are probably just a perplexing bunch of acronyms and jargon for most folk; but that's no reason to stop additions to the metadata acronym hall of fame quite yet...!

Spurred by social networking, so-called 'social metadata' has been emerging as key area of metadata development in recent years. For some, developments such as collaborative tagging are considered social metadata. To my mind – and those of others – social metadata is something altogether more structured, enabling interoperability, reuse and intelligence. Semantic Web specifications such as FOAF provide an excellent example of social metadata; a means of describing and graphing social networks, inferring and describing relationships between like-minded people, establishing trust networks, facilitating DataPortability, and so forth. However, social metadata is increasingly becoming concerned with modelling users' online social interactions in a number of ways (e.g. APML).

A recently launched specification which grabbed my attention is the User Labor Markup Language (ULML). ULML is described as an "open protocol for sharing the value of user's labor across the web" and embodies the notion that making such labour metric data more readily accessible and transparent is necessary to underpin the fragile business models of social networking services and applications. According to the ULML specification:

"User labor is the work that people put in to create, improve, and maintain their existence in social web. In more detail, user labor is the sum of all activities such as:
generating assets (e.g. user profiles, images, videos, blog posts),
creating metadata (e.g. tagging, voting, commenting etc.),
attracting traffic (e.g. incoming views, comments, favourites),
socializing with other people (e.g. number of friends, social influence)
in a social web service".

In essence then, ULML simply provides a means of modelling and sharing users' online social activities. ULML is structured much like RSS, with three major document elements (action, reaction and network). Check out the simple Flickr example below (referenced from the spec.). An XML editor screen dump is also included for good measure:

<?xml version="1.0" encoding="UTF-8"?>
<ulml version="0.1">
<channel>
<title>Flickr / arikan</title>
<link>http://fickr.com/photos/arikan</link>
<description>arikan's photos on Flickr.</description>
<pubDate>Thu, 06 Feb 2008 20:55:01 GMT</pubDate>
<user>arikan</user>
<memberSince>Thu, 01 Jun 2005 20:00:01 GMT</memberSince>
<record>
<actions>
<item name="photo" type="upload">852</item>
<item name="group" type="create">4</item>
<item name="photo" type="tag">1256</item>
<item name="photo" type="comment">200</item>
<item name="photo" type="favorite">32</item>
<item name="photo" type="flag">3</item>
<item name="group" type="join">12</item>
</actions>
<reactions>
<item name="photo" type="view">26984</item>
<item name="photo" type="comment">96</item>
<item name="photo" type="favorite">25</item>
</reactions>
<network>
<item name="connection">125</item>
<item name="density">0.167</item>
<item name="betweenness">0.102</item>
<item name="closeness">0.600</item>
</network>
<pubDate>Thu, 06 Feb 2008 20:55:01 GMT</pubDate>
</record>
</channel>
</ulml>

The tag properties within the action and reaction elements are all pretty self-explanatory. Within the network element "connection" denotes the number of friend connections, "density" denotes the number of connections divided by the total number of all possible connections, "closeness" denotes the average distance of that user to all their friends, and "betweenness" the "probability that the [user] lies on the shortest path between any two other persons".

Although the specification is couched in a lot of labour theory jargon, ULML is quite a funky idea and is a relatively simple thing to implement at an application level. With the relevant privacy safeguards in place, service providers could make ULML files publicly available, thus better enabling them and other providers to understand users' behaviour via a series of common metrics. This, in turn, could facilitate improved systems design and personalisation since previous user expectations could be interpreted through ULML analysis. Authors of the specification also suggest that a ULML document constitutes an online curriculum vitae of users' social web experience. It provides a synopsis of user activity and work experience. It is, in essence, evidence of how they perform within social web contexts. Say the authors:

"...a ULML document is a tool for users to communicate with the web services upfront and to negotiate on how they will be rewarded in return for their labour within the service".

This latter concept is significant since – as we have discussed before – such Web 2.0 services rely on social activity (i.e. labour) to make their services useful in the first place; but such activity is ultimately necessary to make them economically viable.

Clearly, if implemented, ULML would be automatically generated metadata. It therefore doesn't really relate to the positive metadata developments documented here before, or the dark art itself; however, it is a further recognition that with structured data there lies deductive and inferential power.

Tuesday, 3 June 2008

Harvesting and distributing semantic data: FOAF and Firefox

One of the most attractive aspects of Mozilla Firefox continues to be the incessant supply of useful (and not-so useful) Extensions. The supply is virtually endless, and long may it continue! (What would I do without Zotero, or Firebug?!) We have seen the emergence of some useful metadata tools, such as Dublin Core metadata viewers, and more recently, Semantic Web widgets. Operator is an interesting example of the latter, harnessing microformats and eRDF in a way that allows users to interact easily with some simple semantic data (e.g. via toolbar operations). Another interesting Firefox extension is Semantic Radar.

Semantic Radar is a "semantic metadata detector" for Firefox and is a product of a wider research project based at DERI, National University of Ireland. Semantic Radar can identify the presence of FOAF, DOAP and SIOC data in a webpage (as well as more generic RDF and RDFa data) and will display the relevant icon(s) in the Firefox status bar to alert the user when such data is found. (See screen shot for FOAF example) Clicking the icon(s) enables users to browse the data using an online semantic browser (e.g. FOAF Explorer, SIOC Browser). In short, Semantic Radar parses web pages for RDF auto-discovery links to discover semantic data. A neat tool to have...

Semantic Radar has been available for sometime, but a recent update to the widget means that it is now possible to automatically 'ping' the Ping The Semantic Web website. Ping The Semantic Web (PTSW) is an online service harvesting, storing and distributing RDF documents. If one of those documents is updated, its author can notify PTSW to that effect by pinging it with the URL of the document. This is an amazingly efficient way to disseminate semantic data. It is also an amazingly effective way for crawlers and other software agents to discover and index RDF data on the web. (URLs indexed by PTSW can be re-used by other web services and can also be viewed on the PTSW website (See screenshot of PTSW for my FOAF file)).

It might just be me, but I’m finally getting the sense that the volume of structured data available on the web - and the tools necessary to harness it - are beginning to reach some sort of critical mass. The recent spate of blogs echoing this topic bare testament to that (e.g. [1], [2]). Dare I use such a term, but Web 3.0 definitely feels closer than ever.

Friday, 14 March 2008

Shout 'Yahoo!' : more use of metadata and the Semantic Web

Within the lucrative world of information retrieval on the web, Yahoo! is considered an 'old media company'; a company that has gone in a different direction to, say, Google. Yahoo! has been a bit patchy when it comes to openness. It is keen on locking data and widgets down; Google is keen on unlocking data and widgets. And to Yahoo!'s detriment, Google has discovered that there is more money to be made their way, and that users and developers alike are – to a certain extent - very happy with the Google ethos. Google Code is an excellent example of this fine ethos, with the Google Book Search API being a recently announced addition to the Code arsenal.

Since there must be some within Yahoo! ranks attributing their current fragility to a lack of openness, Yahoo! have recently announced their Yahoo! Search 'open platform'. They might be a little slow in fully committing to openness, but cracking open Yahoo! Search is a big and interesting step. For me, it's particularly interesting...

Yesterday Amit Kumar (Product Management, Yahoo! Search) released further details of the new Yahoo! Search platform. This included (among many other things), a commitment to harnessing the potential of metadata and Semantic Web content. More specifically, this means greater support of Dublin Core, Friend-of-a-Friend (FOAF) and other applications of RDF (Resource Description Framework), Creative Commons, and a plethora of microformats.

Greater use of these initiatives by Yahoo! is great news for the information and computing professions, not least because it may stimulate the wider deployment of the aforementioned standards, thus making the introduction of a mainstream 'killer app' that fully harnesses the potential of structured data actually possible. For example, if your purpose is to be discovered by Google, there is currently no real demand for Dublin Core (DC) metadata to be embedded within the XHTML of a web page, or for you to link to an XML or RDF encoded DC file. Google just doesn't use it. It may use dc.title, but that's about it. That is not to say that it's useless of course. Specialist search tools use it, Content Management Systems (CMS) use it, many national governments use it as the basis for metadata interoperability and resource discovery (e.g. eGMS), it forms the basis of many Information Architecture (IA) strategies, etc, etc. But this Google conundrum has been a big problem for the metadata, indexing and Semantic Web communities (see, for example). Their tools provide so much potential; but this potential is generally confined to particular communities of practice. Your average information junkie has never used a Semantic Web tool in his/her life. But if a large scale retrieval device (i.e. Yahoo!) showed some real commitment to harnessing structured data, then it could usher in a new age of large scale information retrieval; one based on an intelligent blend of automatic indexing, metadata, and Semantic Web tools (e.g. OWL, SKOS, FOAF, etc.). In short, there would be a huge demand for the 'data web' outside the distinct communities of practice over which librarians, information managers and some computer scientists currently preside. And by implication this would entail greater demand for their skills. All we need to do now is get more people to create metadata and ontologies!

Given the fragile state of Yahoo!, initiatives like this (if they come to fruition!) should be applauded. Shout 'Yahoo!' for Yahoo! I'm not sure if it will prevent a Microsoft takeover though...

Information Strategy Group, LJMU