Monday, 26 September 2011

New Term underway banging on about jobs and placements

A new term is under way at the Liverpool Business School, mostly induction this week. I have taken the chance while going to the cohort meetings with the 2nd and final year Businesss Information Systems students to go on about my key rants.

These are born partly out of the my experience recruiting over the summer. Namely the necessity of getting something that gives evidence of your greatness on your CV.

While reading new graduate CV's in the summer I was horrified to see people who began with a splurge of waffle about what a great team player, self starter they were, backed by no evidence. Then they list the modules and technologies they covered on their courses, which they and everyone else on their course has covered. And then they finish off telling me that they like to socialise with their friends, watch a movie, play computer games and possibly stay up to date with technology.

They must think this separates them from the crowd but it is rare to come across anyone who doesn't like spending time with their friends, or at least admits to it in a CV. And the other three things are basically sitting in front of a monitor of some type, watching movies, playing games and web surfing. Employers mine included are not going to expect this on its own to push the company forward in these tough times.

We all encourage the students to think about these things, but of course we are often frustrated that the students don't take this seriously until it is too late. As it stands the universities societies fair was just a hundred yards away so I desperately tried to push the students in that direction in the hope that their being treasurer of the university plate balancing team might put some evidence behind their inevitable claim to be a team player, by showing that a group of peers in the team were prepared to trust them with something.

Meanwhile Matthew Baxter-Reynolds writes in the Guardian noting how in the software industry recruitment companies were creating a largely inefficient barrier between the 'talent' and the companies. Roughly suggesting that more candidates should send their CV's direct to potential employers.

I had commented to my colleagues in my day job that we don't seem to get unsolicited CV's anymore, once upon a time we kept a file of them as they accumulated. But now they don't appear. This article seemed to confirm that this was not just my experience. Whereas I get dozens of unsolicited emails from recruitment agencies every week and probably about 150 phone calls a year from the same.

My conclusion is that my students should get out there sending out their CV's looking for jobs and placements depending on their position on the course. The other rants to my (Information Systems) students were that they should get a black belt in Microsoft Office particularly Excel but also Access, so that they can do things that others can't do once they get that job or placement.

When I complete these rants, I am always optimistic that the students will have heard the urgent message and set themselves on a path to a solid career, I choose to ignore the evidence of previous years. Still as Tesco say every little helps.

Monday, 19 September 2011

Does a 2:1 in computing mean you can write a simple computer program?

As a business person and academic I have two positions of interest in graduates and their capability. At JMU we are always thinking about trying to balance what employers want, what students want to do and what we are able to teach. These things are not always in line of course. As an employer in Software Development I am looking for people with a demonstrable aptitude and broader long term promise.

At Village Software we recently advertised for a graduate trainee for £15k. We had about 50 applicants, of these we spoke to about a dozen, invited 6 to interview, of whom 4 attended. Two things to note here I’ll here consider the most shocking which is the question can you acquire a computing related 2:1 from XYZ University without being able to write a simple computer program, perhaps elsewhere I’ll consider the CV’s that don’t get you a phone call.

We set the interviewees the common and much discussed Fizz Buzz test. There was a frenzy sometime about 2007 about the fact that people applying for programming jobs couldn’t program. The thought was that this was some kind of zombie attack of qualified people without basic competence who were flooding the industry, we needed some way to tell the zombie programmers from the real thing. This coalesced about the fizz buzz test. A simple programming exercise along the lines of:-


Write a program that prints the numbers from 1 to 100. But for multiples of three print "Fizz" instead of the number and for the multiples of five print "Buzz". For numbers which are multiples of both three and five print "FizzBuzz".

Graduates failures to pass this test is much discussed for example "why cant programmers program", there are whole blog posts on how to write answers to this "Geek School Fizz Buzz". Making it surprising that none of our four candidates had heard of the problem.

We set a slight variant on the theme fearing, unnecessarily, that candidates might have heard of the problem and learnt a solution.

We asked other questions and had a whole stack of other things but this question was decisive. As an employer I look at peoples degree grade and subject and wonder what they tell me. There is a general question of whether a 2:1 from in a software subject is a guarantee that the student can write a simple computer program. I’m afraid the answer is that it isn’t, although 2 of our candidate did very well, so it is perhaps an indication of an at least 50/50 chance that a graduate can write a computer program.

This is bad for universities. The pressure from potential and actual students is to increase our ‘value add’ and enable them to get a 2:1 otherwise they might buy elsewhere and we’ll be out of business. But the business stakeholders want to see that degree awards represent some measure of useful competence in the chosen subject. A university that lets out a computer student with a 2:1 while unable to complete a simple program in any language of their choice is devaluing their credibility. Our evidence is anecdotal and certainly every student on a computer course certainly has the facility to learn to achieve this level of competence, so they only have themselves to blame.

Unless of course they didn’t have the aptitude in the first place in which case the University has failed to select suitable students for its course. Perhaps they should be doing this test on the way in. In fact why would someone unable to write a brief program such as this even start on a three or four year course of study in this field.

Later today I am trying out this test on some final year Liverpool University students (not represented in our interviews) looking to do a final year project with us, we shall see how they do. Hats off to Liverpool Hope by the way their candidate swept through the technical tests and is now sitting tapping away 10 yards away, saving the day for the home team we also had one John Moores candidate who pulled it off but alas there is evidence that you can get a 2:1 from John Moores without being able to do so.

Friday, 16 September 2011

Blog lifecycles

Understanding the lifecycle and the dynamics of blogs has been a topic of interest within computing and information science for many years.  Blogs exhibit peculiar social and temporal features thus making them a rich domain of study and, quite frankly, more interesting than static web pages.  Since this blog is almost four years old, now seems like an appropriate time to review the health of the ISG Blog.  It is not my intention to expose our blog to the kind of detailed analysis one would expect to find in the pages of JASIST; but let's look at some of the most basic numbers...

Now, in an ideal world, or a sensible one for that matter, one would be able to output a .csv file from Blogger which would contain a wealth of data on the number of blog posts, the hits these posts have attracted (per week and per month), number of comments, the identity of referring sites, etc, etc.  Alas, most of this data is unavailable, and any data that is available has to be generated manually making any serious analysis difficult.  Despite these obstacles I displayed sufficient stamina to manually generate some basic blog data and to describe it using the Dataset Publishing Language (DSLP) for running through the Google Public Data Explorer.  (There still remains some XML pain but I did it anyway...).  Data available pertains to the number of blog postings, their total hits (2007-2011), number of comments per blog post and the length of postings.  Data Explorer provides a good overview of the data but doesn't perform any statistics or analysis. I have therefore included some further data analysis below. Anyway, some of the headline figures are as follows:
  • 85 blog postings have been published since October 2007.
  • George Macgregor (i.e. me) was the most prolific blogger, accounting for 87% of all posts.  Johnny Read was next in line, producing 9.41% of all posts; Francis Muir, Jack OFarrell and Keith Trickey each contributed 1.18% of the total posts.
  • 2009 was the most productive year for the blog, with 33 posts being published, accounting for 38.82% of the blog's total posts.
  • The mean number of page views was 29 per blog post (M = 29; SD = 90; IQR = 18).
  • On average, 0.8 reader comments were made in response to the blog postings (M = 0.8; SD = 1.23; IQR = 1).
  • The most read post was this one from October 2009, attracting 751 page views.
Let's look at the last headline figure first.

Figure 1: ISG Blog hits (2007-2011) by author, as viewed in Google Public Data Explorer. 
Blogger provides summary data on blog posting page views, or "hits" if you prefer.  I extracted these manually to get a measure of post impact.  An average of 29 page views is disappointing and – as you can see from the Data Explorer – although there are some traffic spikes which account for the high data dispersion (i.e. SD = 90; IQR = 18), some of the individual page view figures are very low.  However, we must remember two important caveats:  Google Analytics (used to compile the Blogger data) uses a rigid definition of page views in order to flush out transient visitors.  Secondly, many – and perhaps the majority of those dedicated to reading the ISG Blog – will read postings using an RSS reader.  Unfortunately, even Google Analytics can't capture data on consumption made via RSS.  It is therefore safe to assume that these figures grossly underestimate the number of ISG Blog readers.  With this in mind, the top ten most read postings were as follows:
  1. Blackboard on the shopping list (751 page views)
  2. The Kindle according to Cellan-Jones (301 page views)
  3. Some general musing on tag clouds, resource discovery and pointless widgets (235 page views)
  4. Crowd-sourcing facetted information retrieval (103 page views)
  5. Web Teaching Day – 6 Sep 2010 (74 page views)
  6. How much software is there in Liverpool and is it enough to keep me interested? (67 page views)
  7. Trough of disillusionment for microblogging and social software (56 page views)
  8. Jimmy Reid and the public library: an education like no other (52 page views)
  9. Goulash all round: Linked Data at the NSZL (50 page views)
  10. Shout "Yahoo!": more use of metadata and the Semantic Web (46 page views)
Rather surprisingly – but disappointingly given the extra time they take to compose - the top ten most read blog postings tend not to be the longer, more intellectually considered contributions; but the more ephemeral ones.  This is clear from the #1 most read posting, which was merely a brief comment on a blogosphere rumour that Google might acquire Blackboard.  This post evidently fed into the social and temporal characteristics that can typify blogs and must be considered – using the more up-to-date jargon of the Twitterati – a "trending" topic.  It attracted the highest number of page views (751) and comments (9), and to date remains popular (according to some extra data that I have...).  In fact, using Gruhl et al.'s macroscopic blog characteristics typology, this posting could be considered "Mostly Chatter".  "Mostly Chatter" postings are those that attract attention or discussion at moderate levels throughout the entire period of analysis.  The majority of other postings fall within Gruhl et al.'s "Just Spike" category, i.e. they are postings that become active but then suddenly become inactive and demonstrate a very low level of chatter.  This appears to be corroborated by the generally low page view figures for most posts and the average comment figures (M = 0.8; SD = 1.23; IQR = 1).

Figure 2: Comments per ISG Blog post (2007-2011).
It is also interesting to note that although Francis Muir only made one blog post during the lifetime of the blog his post features in the top five most read contributions (74 page views).  Again, this is perhaps because it was a bursty topic and was trending at the time of publication.  It is nevertheless reassuring that at least some of the more intellectually considered contributions feature in the top ten (e.g. 3, 6 and 8).  On average though, the rest of us attracted fewer eyeballs.  For example, George Macgregor (M = 31; SD = 96; IQR = 18); Johnny Read (M = 14; SD = 31; IQR = 25).

Figure 3 provides an overview of blog post length. As a frequent author of the longest blog posts I have always been worried that I might be boring readers to death (5 posts > 1,000 words).  I always felt longer posts were necessary to cover our intellectually stimulating topics.  Yet, as it transpires, my average post was shorter than expected (M = 534; SD = 306; IQR = 355), and was actually shorter than Johnny Read's average (M = 668; SD = 154; IQR = 106).  I know, I know...  My SD and IQR are far higher, but let's not focus on that because, on the face of it, Johnny would appear to be more boring than I am! ;-)

Figure 3: Post length on the ISG Blog by author (2007-2011).
Which leads to the topic that started all this: ISG Blog health, or the blog lifecycle if you prefer.  What is the current state of health of the ISG Blog?  We noted that 2009 was the most productive year for the blog.  This can be easily observed from the graphs, most of which reveal a busy profile during 2009.  But according to the graph on total posts (Figure 4), the data reveals a spike in 2009, with a comparable number of contributions in 2010 and 2008, and a similar pattern in 2011 and 2007.  In other words, the trend in 2011 seems to be for decline and perhaps even death. 

Figure 4: Total post per year by author (2007-2011).
Researchers have been keen to model blog failure for many years.  For example, Qazvinian et al.'s research (presented at the International AAAI Conference on Weblogs and Social Media) identifies blogs that are prone to "connection failure" and "commitment failure".  As the names of these phenomena suggest, connection failure is a blog that fails to enjoy the network effect within the blogosphere, either because other blogs are not commenting or linking to that blog, or because the readers are not engaged enough to comment on postings.  Commitment failures are more difficult to interpret from Qasvinian’s data; however, their data clearly indicates that new bloggers (of circa one month) typically account for 80% of all blog failures (i.e. quits) within any given time window.  The most dangerous time in which the ISG Blog could succumb to commitment failure has therefore been and gone.  But despite making it past the one month mark by almost four years, the ISG Blog has clearly past its prime.  I made half as many posts in 2010 as I did in 2009, and I have thus far made fewer than half my 2010 contributions in 2011.  A similar trend can be observed in the number of Johnny Read's posts too.  The only tenuous consolation is that as time has gone by my average blog length appears to have increased.  However, although this appears to be borne out the scatterplot (Figure 5 - yup, Data Explorer can't do scatterplots or trendlines) in which a upwards linear regression trendline can be observed, it isn't borne out by the associated numbers ( = 0.0442). 
Figure 5: ISG Blog post length for George Macgregor (2007-2011), with linear regression line.
It is no surprise that my diagnosis is that the ISG Blog suffers a mixture of connection and commitment failure, and that my departure at the end of September could be the final nail in the ISG Blog coffin.  The question is can someone administer CPR after I depart to save it from near certain death?

Wednesday, 3 August 2011

The fallacy of the self-organising network: the limits of cybernetic systems theory, or not...

Recently, the BBC broadcast a series of films by Adam Curtis, the eminent British documentarian. These films were broadcast under the title, All watched over by machines of loving grace, and, in general, all focused on how models of computation and systems have been applied to the world around us. Curtis is well known for his sharp journalism, particularly in areas pertaining to the politics of power. All watched over... differed from his previous documentaries owing to its focus on technology, computers and systems, but the theme of power was nevertheless omnipresent, as was the sharp journalism. I thought some of Curtis' ideas were worth further comment here.

In part two of All watched over..., entitled "The Use and Abuse of Vegetational Concepts", Curtis focused on cybernetics and systems theory. Since ex-members of our team were experts in cybernetics I was particularly interested in what he had to say. Curtis examined how cybernetics and systems theory came to be applied to natural ecosystems, and how this gave rise to a distorted view of how nature worked in reality. The very fact that ecosystems are termed "eco-systems" suggests the extent of systems thinking in nature. Indeed, Sir Arthur Tansley, the celebrated ecology pioneer, coined the term in the 1930s. Tansley was fascinated by Freud's theories of the human brain and, in particular, his theory that the brain was essentially an interconnected electrical machine carrying bursts of energy around the brain through networks, much like electrical circuits. As an ecologist, Tansley became convinced that such a model also applied to the whole of nature believing that the natural world was governed by a network of machine-like systems which were inherently stable and self-correcting. These theories of ecosystems and cybernetics were to fuse together in the late 1960s.

Jay Forrester.
One of the earliest pioneers of cybernetic systems was Jay Forrester. Prof. Forrester (or Emeritus Professor of MIT as he is now) was a key figure in the development of US defence systems in the late 1940s and early 1950s and with his colleagues he developed theories of feedback control systems and the role of feedback loops in regulating – and keeping in equilibrium – systems. The ecology movement assimilated this idea and increasingly viewed the natural world as complex natural systems as it helped to explain how stability was reached in the natural world (i.e. via natural feedback loops). Forrester's experience of developing digital combat information systems and the role of cybernetic systems in resolving such problems inspired him to explore systems difficulties in alternative domains, such as organisations. This would become known as Systems Dynamics. As the Systems Dynamics Society note:
"Forrester's experiences as a manager led him to conclude that the biggest impediment to progress comes, not from the engineering side of industrial problems, but from the management side. This is because, he reasoned, social systems are much harder to understand and control than are physical systems. In 1956, Forrester accepted a professorship in the newly-formed MIT School of Management. His initial goal was to determine how his background in science and engineering could be brought to bear, in some useful way, on the core issues that determine the success or failure of corporations."
Forrester used computer simulations and cybernetic models to analyse social systems and predict the implications of different models. As Curtis notes in his film, Forrester and others cybernetic theorists increasingly viewed humans as nodes in networks; as machines which demonstrated predictable behaviour.

Curtis is rather unkind (and incorrect, IMHO) in his treatment of Forrester during the 70s environmental crisis. Forrester's "world model" - created under the auspices of the Club of Rome and published in the seminal "Limits to Growth" - is portrayed in a negative light in Curtis' film, as is the resulting computer model. Yet, systems theory is supposed to provide insight not clairvoyance. This isn't reflected in Curtis' film. Forrester's model appears now to have been reasonably accurate and was later amended to take account of some of the criticisms Curtis highlights. And few could argue with the premise that destiny of the world is for zero growth; to maintain a "steady state stable equilibrium" within the capacity of the Earth.

Anyway, summarising the intricacies of Curtis' entire polemic in this brief blog posting is difficult; suffice to say the aforementioned intellectual trajectory (i.e. cybernetics, ecosystems, etc.) fostered a widespread belief that because humans were part of a global system they should demonstrate self-organising and self-correcting properties, as demonstrated by feedback control systems and most potently exemplified in the natural world by ecosystems. In particular, these ideas were adopted by the computer utopians (The California Ideology) who dreamt of a global computer network in which all participants were equal and liberated from the old power hierarchies; a self-organising network, regulated by data and information feedback loops. Of course, the emergence of the Web was considered the epitome of this model (remember the utopian predictions of the Web in the mid-90s?) and continues to inspire utopian visions of a self-organising digital society.

The inherent contradiction of the self-organising system is that despite rejecting hierarchical power structures such systems in the end actually foster concentrations of power and hierarchy. Curtis cites the failure of the hippie communes, such as Synergia which implemented failed "ecotechnics", and the relative failure of revolutions in Georgia, Kyrgyzstan, Ukraine and Iran, all of which were coordinated via the Web. And, I suppose, we could extend this to examples in the so-called Arab Spring where the desire for change during the revolution, often orchestrated via Facebook and Twitter, has not always been replicated afterwards.

On this count I feel Curtis is probably correct, and aspects of his conclusion extend further. Indeed, the utopian vision of egalitarian self-organising computer networks continues and has been rejuvenated most recently by social media and "new tech", or Web 2.0 as it is now unfashionably called. Even examples which epitomise the so-called self-organising principle, such as Wikipedia, have morphed into hierarchical systems of power. This is partly because not everyone who contributes to Wikipedia can be as trusted as the next; but it is more because groups of users with a particular world view coalesce to assert their views aggressively and religiously. Editing wars are commonplace and new article rating systems have been introduced, both of which are prone to John Stuart Mill's theories on the tyranny of the majority - all within a Wikipedia ecosystem. Contributions are increasingly governed by a hierarchy of Wikipedia Reviewers who wield their powers to scrutinise, edit, and delete flagged articles. (Ever tried creating a new article on Wikipedia? It's not as easy as you think. Within seconds you will contacted by a reviewer. They relish their control over the type of knowledge Wikipedia publishes, and they make sure you know it…)
'Network of berries' - Quinn Dombrowski, Flickr - some rights reserved
But the same erosion of self-organisation can be applied to the disproportionate growth of particular topics within social bookmarking systems (which are supposed to provide a self-organising and egalitarian way of organising information), or those who have come to dominate the blogosphere or Twittersphere. Even a social networking behemoth like Facebook is, in itself, a quintessential mechanism of control and power. Hundreds of millions of users subjected to Facebook's power and the control over personal data that it implies. So while some users may feel liberated within the Facebook ecosystem, aspects of their identity and, perhaps, their economic and political freedom have been relinquished. I'm not sure this is an issue Clay Shirky addressed satisfactorily in his recent monograph, so perhaps he and Curtis should arrange a chat.

Yet, it is incredible how pervasive the ecosystem metaphor has become. Discussing the new tech bubble on BBC News recently, Julia Meyer rationalised it as "ecosystem economics". Says Meyer:
"...very distinct "ecosystems" have emerged during the past half-decade […] Each of these camps are deeply social - there is a network at its core.
"Companies like LinkedIn and Groupon have significant and growing revenues. While these may not entirely support their valuations, they clearly point to the fact that business models plus their understanding of the network-orientation of all business is on the right track. For those of us who finance entrepreneurship in Europe, what this means is we're mostly going to help build "digital Davids" - companies who understand how to re-organise the economics to create robust and sustainable businesses where everybody wins - customers, retailers and ultimately of course, investors.
"So why are firms like Groupon worth billions? How can something as simple as organising a group discount be so powerful? Because ecosystem economics is at play."
Huh. Ecosystems? Networks? Sustainability where "everyone wins"? Re-organising networks? I smell something dodgy – and I'm not referring to the men's lavatory in the John Foster Building.

Monday, 1 August 2011

Sifting CV's out in industry

I’ve written a couple about recruitment campaigns we've had at villagesoftware.co.uk to get in graduate programmers and we are in the middle of another one. Again bringing people in at the bottom of Software development for £15,000. This superstar starting salary attracted 55 cv’s sign of the times. Last time we did this 2 years ago the split was

50% were expats looking for first UK job.
30% bog standard graduates with relevant degree.
10% were experienced but out of work local developers.
10% Random people with no relevant capability

This time we had many less expats applying, perhaps a weaker pound or perhaps the job centre had advertised thing less widely. Aside from this group , the proportions stayed much the same.

10% were expats looking for first UK job.
50% bog standard graduates with relevant degree.
20% were experienced but out of work local developers.
20% Random people with no relevant capability

My colleagues and I having been through the CV’s and covering letters some things are apparent.

Firstly the obvious things are true. A properly written CV and personable covering letter opens up well. We had one guy who didn’t even give a surname, several who didn’t give a real address.

Very narrow interests do not excite, programming (fair enough), computer game playing, keeping up on technology developments (web surfing), not a collection that point to a rounded candidate. An amazing number led with the fact that they liked socialising with their friends as their main interest, presumably to distinguish themselves from the candidates who ‘don’t like people’. Some watched movies!

Clearly those who had done relevant work experience or a sandwich year stood out from the crowd. My Business School colleague Alistair Beere commented if only we could convince the students of this.

There was some annoyance around the table here at the number of graduates of computer science related degrees who did not have relevant skills. The big surprise was that those who went to the ‘lesser’ Universities (in Liverpool the traditional league is Liverpool, John Moores, Hope, Edge Hill) seemed to have the better skills. Perhaps their lecturers are more in touch with the commercial world than those at the higher brow outfits. It must be frustrating for the student who has paid and studied and are not being rejected by us for a lack of relevant background. I felt frustrated that Universities are sending their graduates out poorly armed to enter their chosen profession. While degrees are not only about employability I am reminded that we must work hard to teach what the students need to know not what we happen to know.
The perception of grade inflation was apparent with most students with 2:2’s being kicked into touch, where degrees were an issue it was considered that if students couldn’t get in the 63% of students now given 2:1’s they were unlikely to be fliers, despite the many fine adjectives they deployed on their C.V’s, it shows that not getting a 2:1 is a problem in the job market.

Interviews later this week, fingers crossed we find the right person, having previously published on this blog how we do interviews interviewing-graduates.html we will have to make a few changes.

I will hopefully take some of the lessons learnt back into my interactions with the various modules I teach on.

Friday, 15 July 2011

Graduation 2011: congratulations to students from the information and PR programmes

Graduation represents the formal conferment of undergraduate and postgraduate awards. This week LJMU students have been attending the graduation ceremonies at the spectacular Liverpool Anglican Cathedral to receive the ultimate reward for all their academic studies. This includes many students graduating from our information and PR related programmes based at Liverpool Business School, including students graduating from our BSc (Hons) Business Information Systems, BA (Hons) Business Management & Information, BA (Hons) Business & Public Relations, and MA/MSc Information & Library Management.
Members of the successful BA (Hons) Business Management & Information student cohort with lecturers Janet Farrow, Elaine Ansell and Chris Taylor.

Congratulations to all the students graduating from our programmes! We wish you the best of luck in your future endeavours!

BSc (Hons) Business Information Systems students celebrate their success!


Andrew Prescott.
Several students achieved the highest academic status by achieving First Class undergraduate degrees, or attaining Distinction at MA/MSc level. A special "well done" goes to these students. However, a further special mention goes to Andrew Prescott, who aside from graduating with a First Class BA (Hons) degree in Business Management & Information, received an award for Outstanding Placement Student from the Chartered Management Institute (CMI). Andrew's industrial placement was based at General Motors in Luton where he concentrated on Supply Chain Management for a year. Andrew was the only student who had his placement extended, spending two additional months at the Ellesmere Port branch where he spearheaded a project on line side delivery optimisation – finding ways to reduce waste and overall costs of production. Andrew has since secured a two year graduate scheme in the logistics department of Bentley. Well done, Andrew!
BA (Hons) Business & Public Relations students celebrate!

All LJMU graduation ceremonies were streamed live from the Anglican Cathedral and can be watched again at the LJMU website. Information and PR programme ceremonies took place on Wednesday 13 July (AM) and Thursday 14 July (PM). Further information is also available from the LJMU website.

Tuesday, 7 June 2011

Reinventing the wheel as a square: schema.org

A few days ago the big three search engines (Google, Bing and Yahoo!) announced schema.org. Schema.org is a "collaborative" effort in the area of vocabularies for structured data on the Web and specifies nearly 300 mini-schema that can be used to provide semantics within XHTML. These mini-schema are based on the Microdata specification currently under review as part of the forthcoming HTML5 specification. What? It can be used to "provide semantics"? Don't we have ways of doing this within XHTML already, like RDFa and Microformats?!

Indeed we do...

Schema.org essentially proposes the use of Microdata instead of RDFa (and/or Microformats) and – although derived from the RDF data model – is simpler, less expressive and, as Manu Sporny notes, "exclusive". The announcement has caused a ruckus in the SW blogosphere, particularly from the co-chair of the W3C RDFa Working Group (Sporny) who has declared schema.org to be a "false choice". Even Yahoo!'s resident semantic search technology research guru, Peter Mika – who was part of the team that helped develop schema.org - acknowledges that RDFa would have been preferable because "I consider it more mature and a superior standard to Microdata in many ways". So why has Microdata and a suite of new vocabularies (the mini-schema) been proposed? This appears to be the question many people are asking. Myself included.

Although schema.org cite RDFa complexity and lack of adoption to be motivating factors behind their initiative, both are poor reasons and do not appear to be borne out by the evidence. RDFa can be as expressive as you like, and crucially, it can be just as simple as Microdata. Sporny provides a useful comparison of RDFa and Microdata modelling the same data, as does Gavin Carothers. And a 510% increase in RDFa usage during 2009-2010 does not tend to suggest slow adoption. On the contrary, I blogged about how utterly astonished I was at the uptake. (My early view is that schema.org appears to be motivated more by pure commercial considerations; this seems to be evident from perusing the available mini-schema, many of which are clearly designed to trigger richer results displays for the sale of particular products or services, and/or popular topics with clear commercial potential. SEO consultants are going to clean-up...)

But what probably disappoints most about schema.org is the lack of commitment to re-using existing vocabularies. Isn't that an important aspect of the Semantic Web? Re-use! Minimise duplication! Schema.org duplicates the work of established vocabularies (i.e. RDF Schema) such as FOAF, Dublin Core, the Music Ontology Specification, and many others, and often in a less expressive way. Why re-invent them? But this is part of a more general phenomenon. Rather than harness existing RDF standards that have benefitted from years of developer feedback, research and development, disparate use cases and, essentially, standards that have attempted to deliver what developers have asked for, the search engines have instead declared that they would prefer standards that work better for them. Their vision of structured data is one in which they control the direction of the Semantic Web and not the Semantic Web community, the W3C, or the Web community for that matter. The true impact of schema.org is therefore more philosophical than technical – and not in a good way.

So perhaps the instinctive technical reaction from 'Semantic gurus' is a little melodramatic. Schema.org will change the structured data landscape to be sure, but it is not in the same marketplace as vanilla RDF, doesn't even try to be, and is far less expressive than RDFa. Moreover, the search engines have announced their continued support for RDFa and Microformats - although no mixing of formats, please (!!!). Some, such as Mike Bergman, even see schema.org as a stepping stone for developers; fulfilling a different purpose and encouraging developers to move onto richer forms of structured data. And at least schema.org uses URIs, thus enabling some flexibility on how they are referenced in the future.

It is interesting to note that schema.org was announced a few days prior to the SemTech conference, which kicked-off yesterday in San Francisco. I wonder what the topic of conversation will be at the conference dinner? Well, we can look at Twitter for that...

Monday, 7 March 2011

Visualising (dirty) data from data.gov.uk using the Dataset Publishing Language (DSPL)

A fortnight ago the Dataset Publishing Language (DSPL) was launched by the Public Data Team at Google. DSPL is an XML-based language to support the generation of rich and interactive data visualisations using the Public Data Explorer, Google's hitherto closed visualisation tool. The XML is used to describe the dataset, including informational metadata like descriptions of measures and metrics, as well as structural metadata such as relations between tables. The completed DSPL XML is then uploaded to the Public Data Explorer in a 'dataset bundle' containing a set of CSV files containing the data of the dataset.
I decided to take the DSPL for a spin using data gleaned from data.gov.uk and visualised data pertaining to UK higher education income and expenditure in the years up to 2008 and 2009. This process was a little fidgety, primarily for reasons to be discussed in a moment; but it was also fidgety owing to the demands of the DSPL and the seemingly temperamental nature of the Public Data Explorer. (These technical issues are something the Public Data Team is resolving). The dataset can be visited and enjoyed as a bar graph, bubble chart or line graph, with dimensions selected from the left-hand column and temporal dimensions under the X axis. Bubble metrics in the bubble graph can be toggled in the top right-hand corner. Note that all values are shown in units of 1000 GBP, and where necessary rounded to the nearest 1000 GBP. Screenshots are above and below.

These data visualisations look very good indeed, and this will no doubt be a useful resource for many. But I can't help wondering if it's all too much pain for too little gain. The dataset I used is relatively simple but it still required 140 lines of XML and an endless amount of tinkering with the original data. So unless you have a large, pristine dataset which is to form the focus of a keynote presentation at an important conference (such as Prof. Hans Rosling), it is difficult to see whether it is worth the effort. Added to which, ironing out errors in the DSPL is arduous because the Public Data Explorer is only clever enough to tell you that there is an error, not where the error might be. This is all very frustrating when your XML is well-formed, validates, and your CSV files appear kosher. Again, the Public Data Team is working hard so things should improve soon. Which brings me back to the principal reason why the whole process was fidgety: data.gov.uk.

Data.gov.uk was launched a year ago by Tim Berners-Lee on behalf of the UK government. You can read about the background in your own time. Suffice to say, the raison d'etre of data.gov.uk is to publish government datasets in an open, structured and interoperable way thus stimulating new and "economically and socially valuable applications". As it currently stands, data.gov.uk does not come close to achieving this. It is not until you delve beneath the surface (as I did for the dataset above) that you appreciate what data.gov.uk actually provides is almost the opposite: closed, unstructured and un-interoperable data! A resource like this should be based – in an ideal world – on RDF or XML, with CSV the preferred option for those unfamiliar, unwilling or unable to provide something better. But it should not be a repository for virtually every file format known to human-kind, with contents structured in an arbitrary manner.

Identifying a suitable dataset for my DSPL experiments was exhausting. PDF files are commonplace; some "datasets" are simply empty or broken, or are simply bits of information (e.g. reports). Even if you are lucky enough to find a CSV compliant dataset (and don't expect any RDF or XML), it will inevitably be dirty and require significant time to render it usable, hence why my experiences were fidgety. All of these frustrations appear to be shared by developers that post on the data.gov.uk forum. To be sure the data is "open" insofar as UK citizens can visit data.gov.uk, view data and hold public officials to account. However, it's the data.gov.uk logo (three linked orbs) - which is almost identical to the old Semantic Web logo - that seduces one into thinking data.gov.uk it is a rich source of structured, interoperable, open data. None of this is entirely fair because data.gov.uk does have a page on Linked Data, and it does provide some useful RDF on MPs, legislation, etc. and some SPARQL endpoints; but in the grand scheme of 'all-things-data.gov.uk' it constitutes a very small proportion of what data.gov.uk actually provides. And all of this is very depressing. It increases barriers, alienates the developers and data enthusiasts, and will ultimately fail to reach the objective: "economically and socially valuable applications".

Friday, 4 February 2011

From the bottom up: growing the Semantic Web


RDFa is essentially a form of XHTML which incorporates a variety of RDF attributes and can adhere to the RDF data model. It's generally far less detailed than standalone RDF; but it does the trick for most web pages. (See earlier postings for some further information). As a Semantic Web aficionado, I was therefore pleased to read Peter Mika's recent blog concerning the recent growth of RDFa deployment.

Mika is a researcher at Yahoo! Research specialising in semantic search. His research is widely published and his role at Yahoo! has enabled him to analyse the growth of RDFa on the surface web. Mika's analysis charts the evolution of certain microformats and RDFa on the web, as percentage of all web pages, as indexed by Yahoo! Search. This includes over 12 billion web pages. The data collection was conducted at at three different time points over the past two years, thus allowing growth to be charted. There are some caveats with the data, but overall Mika found the following:
"The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa. The deployment of microformats has not advanced significantly in the same period, except for the hAtom microformat."
510% growth in 18 months??? What an incredible statistic. It doesn't matter that academics, the BBC, The Guardian, NY Times, data.gov.uk, DBpedia, digital libraries, the CIA, etc. participate in the Semantic Web using 'pure' RDF; it takes Yahoo! (which was leading the search engines in the use of RDFa) and Google (which recently acquired Metaweb Technologies) to motivate ordinary web developers to use RDFa. All of this is very motivating. I wonder where we'll be by October 2011? Hopefully Mika will update his blog sometime in the autumn!

Monday, 24 January 2011

Delicious: an obituary of sorts

University bureaucracy was swallowing up my time when this news broke, otherwise I would have commented sooner... But the leaked announcement by Yahoo! that it has decided to 'sunset' Delicious was momentous, and although Delicious may have a future elsewhere, the news remains significant. It's significant because – along with Flickr, which Yahoo! also owns – Delicious was one of the first services which epitomised the new and mysterious 'Web 2.0' concept, when it emerged in 2004.

In 2004 the concept of organising and sharing links on the Web was fresh and new, and Delicious was really the first to offer an innovative solution to save, organise and share bookmarks with friends. Delicious popularised the use of bookmarklets and practically coined the term 'social bookmarking'. It was probably the first Web 2.0 service to make a domain hack cool and not cheap looking (it was http://del.icio.us/ until 2008, and actually called itself del.icio.us initially).

More interestingly, Delicious popularised tagging and – probably more than any other service at that time – launched an avenue of functionality known as 'social tagging'. In the deluge of social tagging research papers that have been published since Delicious was launched, few will not have cited Delicious within its introduction. And, of course, social tagging - or social bookmarking, or collaborative tagging - has come to be one of the defining aspects of Web 2.0. Social tagging has sent shock waves throughout the Web, influencing the design of subsequent social media services and discovery tools, such as digital libraries. Even the ubiquitous (and infamous) tag cloud - which sister service Flickr invented - was adopted by Delicious and rendered infinitely more useful with the uniquely identifiable resources it and its users curated.

And all of the aforementioned was why Yahoo! decided to acquire Delicious in late 2005 for circa $30 million, in what commentators noted as the first attempt by a Web 1.0 company to jump on to the Web 2.0 bandwagon. Of course, since 2005 many of the original Web 2.0 names have found a Web 1.0 home in which to evolve (e.g. Delicious and Flickr @ Yahoo!; YouTube, Picasa, Blogger @ Google; MySpace @ News Corp; etc.). To be sure, Yahoo! paid too much for Delicious; but Yahoo! weren't buying the technology (which at the time of purchase wasn't particularly complex). They were buying the brand, its users and the Web 2.0 kudos. However, Yahoo! failed to capitalise on all of that, and even failed to harness the underlying technology to make Delicious a household name. One would have thought that an injection of Yahoo! R&D would have made Delicious the most innovative social bookmaking service available, with plenty of horizontal integration with other Yahoo! products. Far from innovating, Delicious has been static since its acquisition. Five years on there are numerous social bookmarking services, virtually all of which are more innovative, more exciting and ultimately more useful.

It is an end of an era to be sure. No-one talks about 'Web 2.0' any more because there is no Web 2.0 to point to, and the demise of Delicious is an example of this. Web 2.0 is now about a handful of social media behemoths. To some extent the 'sunsetting' of Delicious is a comment on the utility of tagging and the value that can be mined from tags, and, ergo, the money that can be made from them. Tightening the use of tags is something which has attracted more attention recently from the CommonTag initiative and rival social bookmarking services such as Faviki and ZigTag. TechCrunch suggest that Yahoo! could have made money from Delicious if they had wanted to and that organisational issues prevented Delicious from being profitable. Perhaps they are right. Only a small team would have been required - but it can't have been easy to make money otherwise Yahoo! would have done it. The moment for Delicious is now gone... And this is an obituary of sorts: can you see any company thinking Delicious is a good investment?

Monday, 15 November 2010

New undergraduate degree programme: BSc Business Communications at LJMU

This blog tends to focus on research and often comments on how technological developments will alter the management of information and the computation of data. Occasionally, however, we also discuss issues within the undergraduate and postgraduate degrees (and modules) our team happens to deliver. It is therefore worthwhile announcing to all those who read the blog that our team has launched a new undergraduate degree programme for 2011: BSc Business Communications.

In BSc Business Communications at LJMU (UCAS Code: N102) students will study the strategic importance of communication, information and technology, and the role these play in the modern business organisation. Further information on the new programme can be found at our standalone BSc Business Communications website, the official LJMU BSc Business Communications website, or our Facebook group (BSc Business Communications at LJMU). BSc Business Communications is recruiting now for 2011/2012.

Tuesday, 2 November 2010

Crowd-sourcing faceted information retrieval

This blog has witnessed the demise of several search engines, all of which have attempted to challenge the supremacy of the big innovators - and I would tend to include Yahoo! and Bing before the obvious market leader. Yesterday it was the turn of Blekko to be the next Cuil. Or is it?

Blekko presents a fresh attempt to move web search forward, using a style of retrieval which has hitherto only been successful in systems based on pre-coordinated indexes and combining it with crowd-sourcing techniques. Interestingly, Rich Skrenta - co-founder of Blekko - was also a principal founder of the Dmoz project. Remember Dmoz? When I worked on BUBL years and years ago, I recall considering Dmoz to be an inferior beast. But it remains alive and kicking – and remains popular and relevant to modern web developments with weekly RDF dumps made of its rich, categorised, crowd-sourced content for Linked Data purposes. BUBL, on the other hand, has been static for years.

Flirting with taxonomical organisation and categorisation with Dmoz (as well as crowd-sourcing) has obviously influenced the Blekko approach to search. Blekko provides innovation in retrieval by enabling users to define their very own vertical search indexes using so-called 'slashtags', thus (essentially) providing a quasi form of faceted search. The advantage of this approach is that using a particular slashtag (or facet, if you prefer) in a query increases precision by removing 'irrelevant' results associated with different meanings of the search query terms. Sounds good, eh? Ranganathan would be salivating at such functionality in automatic indexing! To provide some form of critical mass, Blekko has provided hundreds of slashtags that can be used straight away; but the future of slashtags depends on users creating their own, which will be screened by Blekko before being added to their publicly available slashtags list. Blekko users can also assist in weeding out poor results and any erroneous slashtags results (see the video below) thus contributing to the improved precision Blekko purports to have and maintaining slashtag efficacy. In fact, Skrenta proposes that the Blekko approach will improve precision in the longer term. Says Skrenta on the BBC dot.Maggie blog:
"The only way to fix this [precision problem] is to bring back large-scale human curation to search combined with strong algorithms. You have to put people into the mix […] Crowdsourcing is the only way we will be able to allow search to scale to the ever-growing web".
Let's look at a typical Blekko query. I am interested in the new Microsoft Windows mobile OS, and in bona fide reviews of the new OS. Moreover, since I am tech savvy and will have read many reviews, I am only interested in reviews published recently (i.e. within the past two weeks, or so). In Blekko we can search like so…

"windows mobile 7" /tech-reviews /date

…where the /tech-reviews slashtag limits results to genuine reviews published in the technology press and/or associated websites, and the /date slashtag orders the results by date. It works, and works spectacularly well. Skrenta sticks two fingers up at his competitors when in the Blekko promotional video he quips, "Try doing this [type of] search anywhere else!" Blekko provides 'Five use cases where slashtags shine' which - although only using one slashtag - illustrate how the approach can be used in a variety of different queries. Of course, Blekko can still be used like a conventional search engine, e.g. enter a query and get results ranked according to the Blekko algorithm. And on this count – using my own personal 'search engine test queries' - Blekko appears to rank relevant results sensibly and index pages which other search engines either ignore or, if they do index them, normally drown in spam (spam results which these engines rank as more relevant).

There is a lot to admire about Blekko. Aside from an innovative approach to information retrieval, there is also a commitment to algorithm openness and transparency which SEO people will be pleased about; but I worry that while a Blekko slashtag search is innovative and useful, most users will approach Blekko as another search engine rather than buying into the importance of slashtags and, in doing so will not hang around long enough to 'get it' (even though I intend to...). Indeed, to some extent Blekko has more in common with command line searching of the online databases in the days of yore. There are also some teething troubles which rigorous testing can reveal. But there are reasons to be hopeful. Blekko is presumably hoping to promote slashtag popularity and have users following slashtags just as users follow Twitter groups, thus driving website traffic and presumably advertising. Being the owner of that slashtag could be useful, but also highly profitable, even if Blekko remains small.


blekko: how to slash the web from blekko on Vimeo.

Wednesday, 22 September 2010

New Google

My interest in information retrieval means that subscribing to search engine blogs (among other things) is essential. The most active blog to which I subscribe is the Official Google Blog. According to Google, the OGB provides "insights from Googlers into our products, technology, and the Google culture". More simply, the OGB is the place to look for developments in search, particularly those which Google wants to shout about.

There was a time (probably around two years ago) when updates to the OGB occurred every other week, and often the receipt of the RSS feed would compel me to post to this blog, such were the gravity of OGB announcements (see this, this and this, for example). However, in the past six months the OGB has been in overdrive. Almost every day a huge Google announcement is made on the OGB, whether it's the announcement of Google Instant or significant developments to Google Docs. Enter Google New, a new dedicated website to find all things new from Google. Here's the rationale from Google as published – yup, you guessed it – on the OGB:

"If it seems to you like every day Google releases a new product or feature, well, it seems like that to us too. The central place we tell you about most of these is through the official Google Blog Network [...] But if you want to keep up just with what’s new (or even just what Google does besides search), you’ll want to know about Google New. A few of us had a 20 percent project idea: create a single destination called Google New where people could find the latest product and feature launches from Google. It’s designed to pull in just those posts from various blogs."

Makes sense I suppose, eh?

Thursday, 9 September 2010

Web Teaching Day - 6 Sep 2010

On Monday, 6 Sept 2010 I attended a Web Teaching Day organised by Richard Eskins from Manchester Metropolitan University (his blog). We do a fair amount of web teaching in this group so I thought it would be useful to go along. Web teaching is undertaken by the Computing Department or the Art / Design department in most Universities and our courses tend to be very business orientated.

In many ways the conversations I had reminded me of those we have regarding Information Systems at John Moores. There's a problem relating to the range of skills required from basic technical skills, through design skills to high level inter personal skills. Our aim is to produce a "hybrid" graduate combining business with systems / technical skills. There's huge demand in industry for these graduates and our best students command very high salaries but students find the work hard and it difficult to recruit good students.

Web design / development courses have very similar aims.

Some of the highlights of the day:

Chris Mills from Opera talked about the Opera Web Standards Curriculum. He's been producing teaching material for students all of which is freely available on the internet. He also talked about Mozilla's P2PU (Peer to Peer University) programme - School of Webcraft which is aimed at delivering and assessing these skills. He's co-author on Interact with Web Standards: A Holistic Approach to Web Design (Voices That Matter).

David Watson from Greenwich University talked about the course he designed and runs - MA in Web Design and Content Planning. They've produced their own site to support the course. One interesting point he made is that he reckoned that this site had increased applicants to the course significantly, they now have 60 applicants for 20 places whereas before they struggled for numbers. He published his presentation here. He had some interesting observations on setting up and running the course. In particular don't depend on the University to market and recruit students to your course, you may end up with no students.

Christopher Murphy and Nik Persson (also known as the Web Standardistas) talked about their course BSc (Hons) Interactive Multimedia Design at Ulster University and the issues involved in delivering to undergraduates. I particularly liked their use of the nerd (Bill Gates) - designer (Steve Jobs) continuum to describe the difficulties of being a web builder and how you need so many disparate skills along this path. Here are some of the tools they recommend. Their book is HTML and CSS Web Standards Solutions: A Web Standardistas' Approach.

Aesha Zafar from the BBC talked about the new developments in Manchester and, in particular, the jobs that will be created there and Nicola Critchlow talked about the gap between industry's needs and graduates being produced by Universities (which is large and getting larger, nothing new there).

Finally, Andy Clarke, a freelance designer led a group discussion and chat at the end.

So what skills does a graduate from a web design / development course need? This is my list based on the nerd - designer continuum:

  • databases
  • server side programming languages (PHP seems to be in vogue though there are others)
  • Javascript
  • CSS
  • HTML including web development tools
  • graphics
  • design
  • people skills

It was an excellent day with lots of really inspiring speakers and it really got me fired up about the possibilities of delivering a web design / development course at John Moores. I don't believe that the course I want to offer exists here (though that's based on absolutely no research whatsoever!).

Our team has really strong skills in databases, programming, HTML, CSS and Javascript and we teach most of these skills at various levels. The people skills elements are taught throughout all our courses and is embedded in all JMU programmes via the World of Work (WoW) programme.

Our weakness is in design / graphics, however, Liverpool School of Art & Design has huge experience in areas such as graphic design and digital media.

So, here's a great opportunity to collaborate on a new course in an area that is growing in popularity.