Monday 26 September 2011

New Term underway banging on about jobs and placements

A new term is under way at the Liverpool Business School, mostly induction this week. I have taken the chance while going to the cohort meetings with the 2nd and final year Businesss Information Systems students to go on about my key rants.

These are born partly out of the my experience recruiting over the summer. Namely the necessity of getting something that gives evidence of your greatness on your CV.

While reading new graduate CV's in the summer I was horrified to see people who began with a splurge of waffle about what a great team player, self starter they were, backed by no evidence. Then they list the modules and technologies they covered on their courses, which they and everyone else on their course has covered. And then they finish off telling me that they like to socialise with their friends, watch a movie, play computer games and possibly stay up to date with technology.

They must think this separates them from the crowd but it is rare to come across anyone who doesn't like spending time with their friends, or at least admits to it in a CV. And the other three things are basically sitting in front of a monitor of some type, watching movies, playing games and web surfing. Employers mine included are not going to expect this on its own to push the company forward in these tough times.

We all encourage the students to think about these things, but of course we are often frustrated that the students don't take this seriously until it is too late. As it stands the universities societies fair was just a hundred yards away so I desperately tried to push the students in that direction in the hope that their being treasurer of the university plate balancing team might put some evidence behind their inevitable claim to be a team player, by showing that a group of peers in the team were prepared to trust them with something.

Meanwhile Matthew Baxter-Reynolds writes in the Guardian noting how in the software industry recruitment companies were creating a largely inefficient barrier between the 'talent' and the companies. Roughly suggesting that more candidates should send their CV's direct to potential employers.

I had commented to my colleagues in my day job that we don't seem to get unsolicited CV's anymore, once upon a time we kept a file of them as they accumulated. But now they don't appear. This article seemed to confirm that this was not just my experience. Whereas I get dozens of unsolicited emails from recruitment agencies every week and probably about 150 phone calls a year from the same.

My conclusion is that my students should get out there sending out their CV's looking for jobs and placements depending on their position on the course. The other rants to my (Information Systems) students were that they should get a black belt in Microsoft Office particularly Excel but also Access, so that they can do things that others can't do once they get that job or placement.

When I complete these rants, I am always optimistic that the students will have heard the urgent message and set themselves on a path to a solid career, I choose to ignore the evidence of previous years. Still as Tesco say every little helps.

Monday 19 September 2011

Does a 2:1 in computing mean you can write a simple computer program?

As a business person and academic I have two positions of interest in graduates and their capability. At JMU we are always thinking about trying to balance what employers want, what students want to do and what we are able to teach. These things are not always in line of course. As an employer in Software Development I am looking for people with a demonstrable aptitude and broader long term promise.

At Village Software we recently advertised for a graduate trainee for £15k. We had about 50 applicants, of these we spoke to about a dozen, invited 6 to interview, of whom 4 attended. Two things to note here I’ll here consider the most shocking which is the question can you acquire a computing related 2:1 from XYZ University without being able to write a simple computer program, perhaps elsewhere I’ll consider the CV’s that don’t get you a phone call.

We set the interviewees the common and much discussed Fizz Buzz test. There was a frenzy sometime about 2007 about the fact that people applying for programming jobs couldn’t program. The thought was that this was some kind of zombie attack of qualified people without basic competence who were flooding the industry, we needed some way to tell the zombie programmers from the real thing. This coalesced about the fizz buzz test. A simple programming exercise along the lines of:-


Write a program that prints the numbers from 1 to 100. But for multiples of three print "Fizz" instead of the number and for the multiples of five print "Buzz". For numbers which are multiples of both three and five print "FizzBuzz".

Graduates failures to pass this test is much discussed for example "why cant programmers program", there are whole blog posts on how to write answers to this "Geek School Fizz Buzz". Making it surprising that none of our four candidates had heard of the problem.

We set a slight variant on the theme fearing, unnecessarily, that candidates might have heard of the problem and learnt a solution.

We asked other questions and had a whole stack of other things but this question was decisive. As an employer I look at peoples degree grade and subject and wonder what they tell me. There is a general question of whether a 2:1 from in a software subject is a guarantee that the student can write a simple computer program. I’m afraid the answer is that it isn’t, although 2 of our candidate did very well, so it is perhaps an indication of an at least 50/50 chance that a graduate can write a computer program.

This is bad for universities. The pressure from potential and actual students is to increase our ‘value add’ and enable them to get a 2:1 otherwise they might buy elsewhere and we’ll be out of business. But the business stakeholders want to see that degree awards represent some measure of useful competence in the chosen subject. A university that lets out a computer student with a 2:1 while unable to complete a simple program in any language of their choice is devaluing their credibility. Our evidence is anecdotal and certainly every student on a computer course certainly has the facility to learn to achieve this level of competence, so they only have themselves to blame.

Unless of course they didn’t have the aptitude in the first place in which case the University has failed to select suitable students for its course. Perhaps they should be doing this test on the way in. In fact why would someone unable to write a brief program such as this even start on a three or four year course of study in this field.

Later today I am trying out this test on some final year Liverpool University students (not represented in our interviews) looking to do a final year project with us, we shall see how they do. Hats off to Liverpool Hope by the way their candidate swept through the technical tests and is now sitting tapping away 10 yards away, saving the day for the home team we also had one John Moores candidate who pulled it off but alas there is evidence that you can get a 2:1 from John Moores without being able to do so.

Friday 16 September 2011

Blog lifecycles

Understanding the lifecycle and the dynamics of blogs has been a topic of interest within computing and information science for many years.  Blogs exhibit peculiar social and temporal features thus making them a rich domain of study and, quite frankly, more interesting than static web pages.  Since this blog is almost four years old, now seems like an appropriate time to review the health of the ISG Blog.  It is not my intention to expose our blog to the kind of detailed analysis one would expect to find in the pages of JASIST; but let's look at some of the most basic numbers...

Now, in an ideal world, or a sensible one for that matter, one would be able to output a .csv file from Blogger which would contain a wealth of data on the number of blog posts, the hits these posts have attracted (per week and per month), number of comments, the identity of referring sites, etc, etc.  Alas, most of this data is unavailable, and any data that is available has to be generated manually making any serious analysis difficult.  Despite these obstacles I displayed sufficient stamina to manually generate some basic blog data and to describe it using the Dataset Publishing Language (DSLP) for running through the Google Public Data Explorer.  (There still remains some XML pain but I did it anyway...).  Data available pertains to the number of blog postings, their total hits (2007-2011), number of comments per blog post and the length of postings.  Data Explorer provides a good overview of the data but doesn't perform any statistics or analysis. I have therefore included some further data analysis below. Anyway, some of the headline figures are as follows:
  • 85 blog postings have been published since October 2007.
  • George Macgregor (i.e. me) was the most prolific blogger, accounting for 87% of all posts.  Johnny Read was next in line, producing 9.41% of all posts; Francis Muir, Jack OFarrell and Keith Trickey each contributed 1.18% of the total posts.
  • 2009 was the most productive year for the blog, with 33 posts being published, accounting for 38.82% of the blog's total posts.
  • The mean number of page views was 29 per blog post (M = 29; SD = 90; IQR = 18).
  • On average, 0.8 reader comments were made in response to the blog postings (M = 0.8; SD = 1.23; IQR = 1).
  • The most read post was this one from October 2009, attracting 751 page views.
Let's look at the last headline figure first.

Figure 1: ISG Blog hits (2007-2011) by author, as viewed in Google Public Data Explorer. 
Blogger provides summary data on blog posting page views, or "hits" if you prefer.  I extracted these manually to get a measure of post impact.  An average of 29 page views is disappointing and – as you can see from the Data Explorer – although there are some traffic spikes which account for the high data dispersion (i.e. SD = 90; IQR = 18), some of the individual page view figures are very low.  However, we must remember two important caveats:  Google Analytics (used to compile the Blogger data) uses a rigid definition of page views in order to flush out transient visitors.  Secondly, many – and perhaps the majority of those dedicated to reading the ISG Blog – will read postings using an RSS reader.  Unfortunately, even Google Analytics can't capture data on consumption made via RSS.  It is therefore safe to assume that these figures grossly underestimate the number of ISG Blog readers.  With this in mind, the top ten most read postings were as follows:
  1. Blackboard on the shopping list (751 page views)
  2. The Kindle according to Cellan-Jones (301 page views)
  3. Some general musing on tag clouds, resource discovery and pointless widgets (235 page views)
  4. Crowd-sourcing facetted information retrieval (103 page views)
  5. Web Teaching Day – 6 Sep 2010 (74 page views)
  6. How much software is there in Liverpool and is it enough to keep me interested? (67 page views)
  7. Trough of disillusionment for microblogging and social software (56 page views)
  8. Jimmy Reid and the public library: an education like no other (52 page views)
  9. Goulash all round: Linked Data at the NSZL (50 page views)
  10. Shout "Yahoo!": more use of metadata and the Semantic Web (46 page views)
Rather surprisingly – but disappointingly given the extra time they take to compose - the top ten most read blog postings tend not to be the longer, more intellectually considered contributions; but the more ephemeral ones.  This is clear from the #1 most read posting, which was merely a brief comment on a blogosphere rumour that Google might acquire Blackboard.  This post evidently fed into the social and temporal characteristics that can typify blogs and must be considered – using the more up-to-date jargon of the Twitterati – a "trending" topic.  It attracted the highest number of page views (751) and comments (9), and to date remains popular (according to some extra data that I have...).  In fact, using Gruhl et al.'s macroscopic blog characteristics typology, this posting could be considered "Mostly Chatter".  "Mostly Chatter" postings are those that attract attention or discussion at moderate levels throughout the entire period of analysis.  The majority of other postings fall within Gruhl et al.'s "Just Spike" category, i.e. they are postings that become active but then suddenly become inactive and demonstrate a very low level of chatter.  This appears to be corroborated by the generally low page view figures for most posts and the average comment figures (M = 0.8; SD = 1.23; IQR = 1).

Figure 2: Comments per ISG Blog post (2007-2011).
It is also interesting to note that although Francis Muir only made one blog post during the lifetime of the blog his post features in the top five most read contributions (74 page views).  Again, this is perhaps because it was a bursty topic and was trending at the time of publication.  It is nevertheless reassuring that at least some of the more intellectually considered contributions feature in the top ten (e.g. 3, 6 and 8).  On average though, the rest of us attracted fewer eyeballs.  For example, George Macgregor (M = 31; SD = 96; IQR = 18); Johnny Read (M = 14; SD = 31; IQR = 25).

Figure 3 provides an overview of blog post length. As a frequent author of the longest blog posts I have always been worried that I might be boring readers to death (5 posts > 1,000 words).  I always felt longer posts were necessary to cover our intellectually stimulating topics.  Yet, as it transpires, my average post was shorter than expected (M = 534; SD = 306; IQR = 355), and was actually shorter than Johnny Read's average (M = 668; SD = 154; IQR = 106).  I know, I know...  My SD and IQR are far higher, but let's not focus on that because, on the face of it, Johnny would appear to be more boring than I am! ;-)

Figure 3: Post length on the ISG Blog by author (2007-2011).
Which leads to the topic that started all this: ISG Blog health, or the blog lifecycle if you prefer.  What is the current state of health of the ISG Blog?  We noted that 2009 was the most productive year for the blog.  This can be easily observed from the graphs, most of which reveal a busy profile during 2009.  But according to the graph on total posts (Figure 4), the data reveals a spike in 2009, with a comparable number of contributions in 2010 and 2008, and a similar pattern in 2011 and 2007.  In other words, the trend in 2011 seems to be for decline and perhaps even death. 

Figure 4: Total post per year by author (2007-2011).
Researchers have been keen to model blog failure for many years.  For example, Qazvinian et al.'s research (presented at the International AAAI Conference on Weblogs and Social Media) identifies blogs that are prone to "connection failure" and "commitment failure".  As the names of these phenomena suggest, connection failure is a blog that fails to enjoy the network effect within the blogosphere, either because other blogs are not commenting or linking to that blog, or because the readers are not engaged enough to comment on postings.  Commitment failures are more difficult to interpret from Qasvinian’s data; however, their data clearly indicates that new bloggers (of circa one month) typically account for 80% of all blog failures (i.e. quits) within any given time window.  The most dangerous time in which the ISG Blog could succumb to commitment failure has therefore been and gone.  But despite making it past the one month mark by almost four years, the ISG Blog has clearly past its prime.  I made half as many posts in 2010 as I did in 2009, and I have thus far made fewer than half my 2010 contributions in 2011.  A similar trend can be observed in the number of Johnny Read's posts too.  The only tenuous consolation is that as time has gone by my average blog length appears to have increased.  However, although this appears to be borne out the scatterplot (Figure 5 - yup, Data Explorer can't do scatterplots or trendlines) in which a upwards linear regression trendline can be observed, it isn't borne out by the associated numbers ( = 0.0442). 
Figure 5: ISG Blog post length for George Macgregor (2007-2011), with linear regression line.
It is no surprise that my diagnosis is that the ISG Blog suffers a mixture of connection and commitment failure, and that my departure at the end of September could be the final nail in the ISG Blog coffin.  The question is can someone administer CPR after I depart to save it from near certain death?

Wednesday 3 August 2011

The fallacy of the self-organising network: the limits of cybernetic systems theory, or not...

Recently, the BBC broadcast a series of films by Adam Curtis, the eminent British documentarian. These films were broadcast under the title, All watched over by machines of loving grace, and, in general, all focused on how models of computation and systems have been applied to the world around us. Curtis is well known for his sharp journalism, particularly in areas pertaining to the politics of power. All watched over... differed from his previous documentaries owing to its focus on technology, computers and systems, but the theme of power was nevertheless omnipresent, as was the sharp journalism. I thought some of Curtis' ideas were worth further comment here.

In part two of All watched over..., entitled "The Use and Abuse of Vegetational Concepts", Curtis focused on cybernetics and systems theory. Since ex-members of our team were experts in cybernetics I was particularly interested in what he had to say. Curtis examined how cybernetics and systems theory came to be applied to natural ecosystems, and how this gave rise to a distorted view of how nature worked in reality. The very fact that ecosystems are termed "eco-systems" suggests the extent of systems thinking in nature. Indeed, Sir Arthur Tansley, the celebrated ecology pioneer, coined the term in the 1930s. Tansley was fascinated by Freud's theories of the human brain and, in particular, his theory that the brain was essentially an interconnected electrical machine carrying bursts of energy around the brain through networks, much like electrical circuits. As an ecologist, Tansley became convinced that such a model also applied to the whole of nature believing that the natural world was governed by a network of machine-like systems which were inherently stable and self-correcting. These theories of ecosystems and cybernetics were to fuse together in the late 1960s.

Jay Forrester.
One of the earliest pioneers of cybernetic systems was Jay Forrester. Prof. Forrester (or Emeritus Professor of MIT as he is now) was a key figure in the development of US defence systems in the late 1940s and early 1950s and with his colleagues he developed theories of feedback control systems and the role of feedback loops in regulating – and keeping in equilibrium – systems. The ecology movement assimilated this idea and increasingly viewed the natural world as complex natural systems as it helped to explain how stability was reached in the natural world (i.e. via natural feedback loops). Forrester's experience of developing digital combat information systems and the role of cybernetic systems in resolving such problems inspired him to explore systems difficulties in alternative domains, such as organisations. This would become known as Systems Dynamics. As the Systems Dynamics Society note:
"Forrester's experiences as a manager led him to conclude that the biggest impediment to progress comes, not from the engineering side of industrial problems, but from the management side. This is because, he reasoned, social systems are much harder to understand and control than are physical systems. In 1956, Forrester accepted a professorship in the newly-formed MIT School of Management. His initial goal was to determine how his background in science and engineering could be brought to bear, in some useful way, on the core issues that determine the success or failure of corporations."
Forrester used computer simulations and cybernetic models to analyse social systems and predict the implications of different models. As Curtis notes in his film, Forrester and others cybernetic theorists increasingly viewed humans as nodes in networks; as machines which demonstrated predictable behaviour.

Curtis is rather unkind (and incorrect, IMHO) in his treatment of Forrester during the 70s environmental crisis. Forrester's "world model" - created under the auspices of the Club of Rome and published in the seminal "Limits to Growth" - is portrayed in a negative light in Curtis' film, as is the resulting computer model. Yet, systems theory is supposed to provide insight not clairvoyance. This isn't reflected in Curtis' film. Forrester's model appears now to have been reasonably accurate and was later amended to take account of some of the criticisms Curtis highlights. And few could argue with the premise that destiny of the world is for zero growth; to maintain a "steady state stable equilibrium" within the capacity of the Earth.

Anyway, summarising the intricacies of Curtis' entire polemic in this brief blog posting is difficult; suffice to say the aforementioned intellectual trajectory (i.e. cybernetics, ecosystems, etc.) fostered a widespread belief that because humans were part of a global system they should demonstrate self-organising and self-correcting properties, as demonstrated by feedback control systems and most potently exemplified in the natural world by ecosystems. In particular, these ideas were adopted by the computer utopians (The California Ideology) who dreamt of a global computer network in which all participants were equal and liberated from the old power hierarchies; a self-organising network, regulated by data and information feedback loops. Of course, the emergence of the Web was considered the epitome of this model (remember the utopian predictions of the Web in the mid-90s?) and continues to inspire utopian visions of a self-organising digital society.

The inherent contradiction of the self-organising system is that despite rejecting hierarchical power structures such systems in the end actually foster concentrations of power and hierarchy. Curtis cites the failure of the hippie communes, such as Synergia which implemented failed "ecotechnics", and the relative failure of revolutions in Georgia, Kyrgyzstan, Ukraine and Iran, all of which were coordinated via the Web. And, I suppose, we could extend this to examples in the so-called Arab Spring where the desire for change during the revolution, often orchestrated via Facebook and Twitter, has not always been replicated afterwards.

On this count I feel Curtis is probably correct, and aspects of his conclusion extend further. Indeed, the utopian vision of egalitarian self-organising computer networks continues and has been rejuvenated most recently by social media and "new tech", or Web 2.0 as it is now unfashionably called. Even examples which epitomise the so-called self-organising principle, such as Wikipedia, have morphed into hierarchical systems of power. This is partly because not everyone who contributes to Wikipedia can be as trusted as the next; but it is more because groups of users with a particular world view coalesce to assert their views aggressively and religiously. Editing wars are commonplace and new article rating systems have been introduced, both of which are prone to John Stuart Mill's theories on the tyranny of the majority - all within a Wikipedia ecosystem. Contributions are increasingly governed by a hierarchy of Wikipedia Reviewers who wield their powers to scrutinise, edit, and delete flagged articles. (Ever tried creating a new article on Wikipedia? It's not as easy as you think. Within seconds you will contacted by a reviewer. They relish their control over the type of knowledge Wikipedia publishes, and they make sure you know it…)
'Network of berries' - Quinn Dombrowski, Flickr - some rights reserved
But the same erosion of self-organisation can be applied to the disproportionate growth of particular topics within social bookmarking systems (which are supposed to provide a self-organising and egalitarian way of organising information), or those who have come to dominate the blogosphere or Twittersphere. Even a social networking behemoth like Facebook is, in itself, a quintessential mechanism of control and power. Hundreds of millions of users subjected to Facebook's power and the control over personal data that it implies. So while some users may feel liberated within the Facebook ecosystem, aspects of their identity and, perhaps, their economic and political freedom have been relinquished. I'm not sure this is an issue Clay Shirky addressed satisfactorily in his recent monograph, so perhaps he and Curtis should arrange a chat.

Yet, it is incredible how pervasive the ecosystem metaphor has become. Discussing the new tech bubble on BBC News recently, Julia Meyer rationalised it as "ecosystem economics". Says Meyer:
"...very distinct "ecosystems" have emerged during the past half-decade […] Each of these camps are deeply social - there is a network at its core.
"Companies like LinkedIn and Groupon have significant and growing revenues. While these may not entirely support their valuations, they clearly point to the fact that business models plus their understanding of the network-orientation of all business is on the right track. For those of us who finance entrepreneurship in Europe, what this means is we're mostly going to help build "digital Davids" - companies who understand how to re-organise the economics to create robust and sustainable businesses where everybody wins - customers, retailers and ultimately of course, investors.
"So why are firms like Groupon worth billions? How can something as simple as organising a group discount be so powerful? Because ecosystem economics is at play."
Huh. Ecosystems? Networks? Sustainability where "everyone wins"? Re-organising networks? I smell something dodgy – and I'm not referring to the men's lavatory in the John Foster Building.

Monday 1 August 2011

Sifting CV's out in industry

I’ve written a couple about recruitment campaigns we've had at villagesoftware.co.uk to get in graduate programmers and we are in the middle of another one. Again bringing people in at the bottom of Software development for £15,000. This superstar starting salary attracted 55 cv’s sign of the times. Last time we did this 2 years ago the split was

50% were expats looking for first UK job.
30% bog standard graduates with relevant degree.
10% were experienced but out of work local developers.
10% Random people with no relevant capability

This time we had many less expats applying, perhaps a weaker pound or perhaps the job centre had advertised thing less widely. Aside from this group , the proportions stayed much the same.

10% were expats looking for first UK job.
50% bog standard graduates with relevant degree.
20% were experienced but out of work local developers.
20% Random people with no relevant capability

My colleagues and I having been through the CV’s and covering letters some things are apparent.

Firstly the obvious things are true. A properly written CV and personable covering letter opens up well. We had one guy who didn’t even give a surname, several who didn’t give a real address.

Very narrow interests do not excite, programming (fair enough), computer game playing, keeping up on technology developments (web surfing), not a collection that point to a rounded candidate. An amazing number led with the fact that they liked socialising with their friends as their main interest, presumably to distinguish themselves from the candidates who ‘don’t like people’. Some watched movies!

Clearly those who had done relevant work experience or a sandwich year stood out from the crowd. My Business School colleague Alistair Beere commented if only we could convince the students of this.

There was some annoyance around the table here at the number of graduates of computer science related degrees who did not have relevant skills. The big surprise was that those who went to the ‘lesser’ Universities (in Liverpool the traditional league is Liverpool, John Moores, Hope, Edge Hill) seemed to have the better skills. Perhaps their lecturers are more in touch with the commercial world than those at the higher brow outfits. It must be frustrating for the student who has paid and studied and are not being rejected by us for a lack of relevant background. I felt frustrated that Universities are sending their graduates out poorly armed to enter their chosen profession. While degrees are not only about employability I am reminded that we must work hard to teach what the students need to know not what we happen to know.
The perception of grade inflation was apparent with most students with 2:2’s being kicked into touch, where degrees were an issue it was considered that if students couldn’t get in the 63% of students now given 2:1’s they were unlikely to be fliers, despite the many fine adjectives they deployed on their C.V’s, it shows that not getting a 2:1 is a problem in the job market.

Interviews later this week, fingers crossed we find the right person, having previously published on this blog how we do interviews interviewing-graduates.html we will have to make a few changes.

I will hopefully take some of the lessons learnt back into my interactions with the various modules I teach on.

Friday 15 July 2011

Graduation 2011: congratulations to students from the information and PR programmes

Graduation represents the formal conferment of undergraduate and postgraduate awards. This week LJMU students have been attending the graduation ceremonies at the spectacular Liverpool Anglican Cathedral to receive the ultimate reward for all their academic studies. This includes many students graduating from our information and PR related programmes based at Liverpool Business School, including students graduating from our BSc (Hons) Business Information Systems, BA (Hons) Business Management & Information, BA (Hons) Business & Public Relations, and MA/MSc Information & Library Management.
Members of the successful BA (Hons) Business Management & Information student cohort with lecturers Janet Farrow, Elaine Ansell and Chris Taylor.

Congratulations to all the students graduating from our programmes! We wish you the best of luck in your future endeavours!

BSc (Hons) Business Information Systems students celebrate their success!


Andrew Prescott.
Several students achieved the highest academic status by achieving First Class undergraduate degrees, or attaining Distinction at MA/MSc level. A special "well done" goes to these students. However, a further special mention goes to Andrew Prescott, who aside from graduating with a First Class BA (Hons) degree in Business Management & Information, received an award for Outstanding Placement Student from the Chartered Management Institute (CMI). Andrew's industrial placement was based at General Motors in Luton where he concentrated on Supply Chain Management for a year. Andrew was the only student who had his placement extended, spending two additional months at the Ellesmere Port branch where he spearheaded a project on line side delivery optimisation – finding ways to reduce waste and overall costs of production. Andrew has since secured a two year graduate scheme in the logistics department of Bentley. Well done, Andrew!
BA (Hons) Business & Public Relations students celebrate!

All LJMU graduation ceremonies were streamed live from the Anglican Cathedral and can be watched again at the LJMU website. Information and PR programme ceremonies took place on Wednesday 13 July (AM) and Thursday 14 July (PM). Further information is also available from the LJMU website.

Tuesday 7 June 2011

Reinventing the wheel as a square: schema.org

A few days ago the big three search engines (Google, Bing and Yahoo!) announced schema.org. Schema.org is a "collaborative" effort in the area of vocabularies for structured data on the Web and specifies nearly 300 mini-schema that can be used to provide semantics within XHTML. These mini-schema are based on the Microdata specification currently under review as part of the forthcoming HTML5 specification. What? It can be used to "provide semantics"? Don't we have ways of doing this within XHTML already, like RDFa and Microformats?!

Indeed we do...

Schema.org essentially proposes the use of Microdata instead of RDFa (and/or Microformats) and – although derived from the RDF data model – is simpler, less expressive and, as Manu Sporny notes, "exclusive". The announcement has caused a ruckus in the SW blogosphere, particularly from the co-chair of the W3C RDFa Working Group (Sporny) who has declared schema.org to be a "false choice". Even Yahoo!'s resident semantic search technology research guru, Peter Mika – who was part of the team that helped develop schema.org - acknowledges that RDFa would have been preferable because "I consider it more mature and a superior standard to Microdata in many ways". So why has Microdata and a suite of new vocabularies (the mini-schema) been proposed? This appears to be the question many people are asking. Myself included.

Although schema.org cite RDFa complexity and lack of adoption to be motivating factors behind their initiative, both are poor reasons and do not appear to be borne out by the evidence. RDFa can be as expressive as you like, and crucially, it can be just as simple as Microdata. Sporny provides a useful comparison of RDFa and Microdata modelling the same data, as does Gavin Carothers. And a 510% increase in RDFa usage during 2009-2010 does not tend to suggest slow adoption. On the contrary, I blogged about how utterly astonished I was at the uptake. (My early view is that schema.org appears to be motivated more by pure commercial considerations; this seems to be evident from perusing the available mini-schema, many of which are clearly designed to trigger richer results displays for the sale of particular products or services, and/or popular topics with clear commercial potential. SEO consultants are going to clean-up...)

But what probably disappoints most about schema.org is the lack of commitment to re-using existing vocabularies. Isn't that an important aspect of the Semantic Web? Re-use! Minimise duplication! Schema.org duplicates the work of established vocabularies (i.e. RDF Schema) such as FOAF, Dublin Core, the Music Ontology Specification, and many others, and often in a less expressive way. Why re-invent them? But this is part of a more general phenomenon. Rather than harness existing RDF standards that have benefitted from years of developer feedback, research and development, disparate use cases and, essentially, standards that have attempted to deliver what developers have asked for, the search engines have instead declared that they would prefer standards that work better for them. Their vision of structured data is one in which they control the direction of the Semantic Web and not the Semantic Web community, the W3C, or the Web community for that matter. The true impact of schema.org is therefore more philosophical than technical – and not in a good way.

So perhaps the instinctive technical reaction from 'Semantic gurus' is a little melodramatic. Schema.org will change the structured data landscape to be sure, but it is not in the same marketplace as vanilla RDF, doesn't even try to be, and is far less expressive than RDFa. Moreover, the search engines have announced their continued support for RDFa and Microformats - although no mixing of formats, please (!!!). Some, such as Mike Bergman, even see schema.org as a stepping stone for developers; fulfilling a different purpose and encouraging developers to move onto richer forms of structured data. And at least schema.org uses URIs, thus enabling some flexibility on how they are referenced in the future.

It is interesting to note that schema.org was announced a few days prior to the SemTech conference, which kicked-off yesterday in San Francisco. I wonder what the topic of conversation will be at the conference dinner? Well, we can look at Twitter for that...

Monday 7 March 2011

Visualising (dirty) data from data.gov.uk using the Dataset Publishing Language (DSPL)

A fortnight ago the Dataset Publishing Language (DSPL) was launched by the Public Data Team at Google. DSPL is an XML-based language to support the generation of rich and interactive data visualisations using the Public Data Explorer, Google's hitherto closed visualisation tool. The XML is used to describe the dataset, including informational metadata like descriptions of measures and metrics, as well as structural metadata such as relations between tables. The completed DSPL XML is then uploaded to the Public Data Explorer in a 'dataset bundle' containing a set of CSV files containing the data of the dataset.
I decided to take the DSPL for a spin using data gleaned from data.gov.uk and visualised data pertaining to UK higher education income and expenditure in the years up to 2008 and 2009. This process was a little fidgety, primarily for reasons to be discussed in a moment; but it was also fidgety owing to the demands of the DSPL and the seemingly temperamental nature of the Public Data Explorer. (These technical issues are something the Public Data Team is resolving). The dataset can be visited and enjoyed as a bar graph, bubble chart or line graph, with dimensions selected from the left-hand column and temporal dimensions under the X axis. Bubble metrics in the bubble graph can be toggled in the top right-hand corner. Note that all values are shown in units of 1000 GBP, and where necessary rounded to the nearest 1000 GBP. Screenshots are above and below.

These data visualisations look very good indeed, and this will no doubt be a useful resource for many. But I can't help wondering if it's all too much pain for too little gain. The dataset I used is relatively simple but it still required 140 lines of XML and an endless amount of tinkering with the original data. So unless you have a large, pristine dataset which is to form the focus of a keynote presentation at an important conference (such as Prof. Hans Rosling), it is difficult to see whether it is worth the effort. Added to which, ironing out errors in the DSPL is arduous because the Public Data Explorer is only clever enough to tell you that there is an error, not where the error might be. This is all very frustrating when your XML is well-formed, validates, and your CSV files appear kosher. Again, the Public Data Team is working hard so things should improve soon. Which brings me back to the principal reason why the whole process was fidgety: data.gov.uk.

Data.gov.uk was launched a year ago by Tim Berners-Lee on behalf of the UK government. You can read about the background in your own time. Suffice to say, the raison d'etre of data.gov.uk is to publish government datasets in an open, structured and interoperable way thus stimulating new and "economically and socially valuable applications". As it currently stands, data.gov.uk does not come close to achieving this. It is not until you delve beneath the surface (as I did for the dataset above) that you appreciate what data.gov.uk actually provides is almost the opposite: closed, unstructured and un-interoperable data! A resource like this should be based – in an ideal world – on RDF or XML, with CSV the preferred option for those unfamiliar, unwilling or unable to provide something better. But it should not be a repository for virtually every file format known to human-kind, with contents structured in an arbitrary manner.

Identifying a suitable dataset for my DSPL experiments was exhausting. PDF files are commonplace; some "datasets" are simply empty or broken, or are simply bits of information (e.g. reports). Even if you are lucky enough to find a CSV compliant dataset (and don't expect any RDF or XML), it will inevitably be dirty and require significant time to render it usable, hence why my experiences were fidgety. All of these frustrations appear to be shared by developers that post on the data.gov.uk forum. To be sure the data is "open" insofar as UK citizens can visit data.gov.uk, view data and hold public officials to account. However, it's the data.gov.uk logo (three linked orbs) - which is almost identical to the old Semantic Web logo - that seduces one into thinking data.gov.uk it is a rich source of structured, interoperable, open data. None of this is entirely fair because data.gov.uk does have a page on Linked Data, and it does provide some useful RDF on MPs, legislation, etc. and some SPARQL endpoints; but in the grand scheme of 'all-things-data.gov.uk' it constitutes a very small proportion of what data.gov.uk actually provides. And all of this is very depressing. It increases barriers, alienates the developers and data enthusiasts, and will ultimately fail to reach the objective: "economically and socially valuable applications".

Friday 4 February 2011

From the bottom up: growing the Semantic Web


RDFa is essentially a form of XHTML which incorporates a variety of RDF attributes and can adhere to the RDF data model. It's generally far less detailed than standalone RDF; but it does the trick for most web pages. (See earlier postings for some further information). As a Semantic Web aficionado, I was therefore pleased to read Peter Mika's recent blog concerning the recent growth of RDFa deployment.

Mika is a researcher at Yahoo! Research specialising in semantic search. His research is widely published and his role at Yahoo! has enabled him to analyse the growth of RDFa on the surface web. Mika's analysis charts the evolution of certain microformats and RDFa on the web, as percentage of all web pages, as indexed by Yahoo! Search. This includes over 12 billion web pages. The data collection was conducted at at three different time points over the past two years, thus allowing growth to be charted. There are some caveats with the data, but overall Mika found the following:
"The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hAtom microformat."
510% growth in 18 months??? What an incredible statistic. It doesn't matter that academics, the BBC, The Guardian, NY Times, data.gov.uk, DBpedia, digital libraries, the CIA, etc. participate in the Semantic Web using 'pure' RDF; it takes Yahoo! (which was leading the search engines in the use of RDFa) and Google (which recently acquired Metaweb Technologies) to motivate ordinary web developers to use RDFa. All of this is very motivating. I wonder where we'll be by October 2011? Hopefully Mika will update his blog sometime in the autumn!

Monday 24 January 2011

Delicious: an obituary of sorts

University bureaucracy was swallowing up my time when this news broke, otherwise I would have commented sooner... But the leaked announcement by Yahoo! that it has decided to 'sunset' Delicious was momentous, and although Delicious may have a future elsewhere, the news remains significant. It's significant because – along with Flickr, which Yahoo! also owns – Delicious was one of the first services which epitomised the new and mysterious 'Web 2.0' concept, when it emerged in 2004.

In 2004 the concept of organising and sharing links on the Web was fresh and new, and Delicious was really the first to offer an innovative solution to save, organise and share bookmarks with friends. Delicious popularised the use of bookmarklets and practically coined the term 'social bookmarking'. It was probably the first Web 2.0 service to make a domain hack cool and not cheap looking (it was http://del.icio.us/ until 2008, and actually called itself del.icio.us initially).

More interestingly, Delicious popularised tagging and – probably more than any other service at that time – launched an avenue of functionality known as 'social tagging'. In the deluge of social tagging research papers that have been published since Delicious was launched, few will not have cited Delicious within its introduction. And, of course, social tagging - or social bookmarking, or collaborative tagging - has come to be one of the defining aspects of Web 2.0. Social tagging has sent shock waves throughout the Web, influencing the design of subsequent social media services and discovery tools, such as digital libraries. Even the ubiquitous (and infamous) tag cloud - which sister service Flickr invented - was adopted by Delicious and rendered infinitely more useful with the uniquely identifiable resources it and its users curated.

And all of the aforementioned was why Yahoo! decided to acquire Delicious in late 2005 for circa $30 million, in what commentators noted as the first attempt by a Web 1.0 company to jump on to the Web 2.0 bandwagon. Of course, since 2005 many of the original Web 2.0 names have found a Web 1.0 home in which to evolve (e.g. Delicious and Flickr @ Yahoo!; YouTube, Picasa, Blogger @ Google; MySpace @ News Corp; etc.). To be sure, Yahoo! paid too much for Delicious; but Yahoo! weren't buying the technology (which at the time of purchase wasn't particularly complex). They were buying the brand, its users and the Web 2.0 kudos. However, Yahoo! failed to capitalise on all of that, and even failed to harness the underlying technology to make Delicious a household name. One would have thought that an injection of Yahoo! R&D would have made Delicious the most innovative social bookmaking service available, with plenty of horizontal integration with other Yahoo! products. Far from innovating, Delicious has been static since its acquisition. Five years on there are numerous social bookmarking services, virtually all of which are more innovative, more exciting and ultimately more useful.

It is an end of an era to be sure. No-one talks about 'Web 2.0' any more because there is no Web 2.0 to point to, and the demise of Delicious is an example of this. Web 2.0 is now about a handful of social media behemoths. To some extent the 'sunsetting' of Delicious is a comment on the utility of tagging and the value that can be mined from tags, and, ergo, the money that can be made from them. Tightening the use of tags is something which has attracted more attention recently from the CommonTag initiative and rival social bookmarking services such as Faviki and ZigTag. TechCrunch suggest that Yahoo! could have made money from Delicious if they had wanted to and that organisational issues prevented Delicious from being profitable. Perhaps they are right. Only a small team would have been required - but it can't have been easy to make money otherwise Yahoo! would have done it. The moment for Delicious is now gone... And this is an obituary of sorts: can you see any company thinking Delicious is a good investment?