Tuesday 7 June 2011

Reinventing the wheel as a square: schema.org

A few days ago the big three search engines (Google, Bing and Yahoo!) announced schema.org. Schema.org is a "collaborative" effort in the area of vocabularies for structured data on the Web and specifies nearly 300 mini-schema that can be used to provide semantics within XHTML. These mini-schema are based on the Microdata specification currently under review as part of the forthcoming HTML5 specification. What? It can be used to "provide semantics"? Don't we have ways of doing this within XHTML already, like RDFa and Microformats?!

Indeed we do...

Schema.org essentially proposes the use of Microdata instead of RDFa (and/or Microformats) and – although derived from the RDF data model – is simpler, less expressive and, as Manu Sporny notes, "exclusive". The announcement has caused a ruckus in the SW blogosphere, particularly from the co-chair of the W3C RDFa Working Group (Sporny) who has declared schema.org to be a "false choice". Even Yahoo!'s resident semantic search technology research guru, Peter Mika – who was part of the team that helped develop schema.org - acknowledges that RDFa would have been preferable because "I consider it more mature and a superior standard to Microdata in many ways". So why has Microdata and a suite of new vocabularies (the mini-schema) been proposed? This appears to be the question many people are asking. Myself included.

Although schema.org cite RDFa complexity and lack of adoption to be motivating factors behind their initiative, both are poor reasons and do not appear to be borne out by the evidence. RDFa can be as expressive as you like, and crucially, it can be just as simple as Microdata. Sporny provides a useful comparison of RDFa and Microdata modelling the same data, as does Gavin Carothers. And a 510% increase in RDFa usage during 2009-2010 does not tend to suggest slow adoption. On the contrary, I blogged about how utterly astonished I was at the uptake. (My early view is that schema.org appears to be motivated more by pure commercial considerations; this seems to be evident from perusing the available mini-schema, many of which are clearly designed to trigger richer results displays for the sale of particular products or services, and/or popular topics with clear commercial potential. SEO consultants are going to clean-up...)

But what probably disappoints most about schema.org is the lack of commitment to re-using existing vocabularies. Isn't that an important aspect of the Semantic Web? Re-use! Minimise duplication! Schema.org duplicates the work of established vocabularies (i.e. RDF Schema) such as FOAF, Dublin Core, the Music Ontology Specification, and many others, and often in a less expressive way. Why re-invent them? But this is part of a more general phenomenon. Rather than harness existing RDF standards that have benefitted from years of developer feedback, research and development, disparate use cases and, essentially, standards that have attempted to deliver what developers have asked for, the search engines have instead declared that they would prefer standards that work better for them. Their vision of structured data is one in which they control the direction of the Semantic Web and not the Semantic Web community, the W3C, or the Web community for that matter. The true impact of schema.org is therefore more philosophical than technical – and not in a good way.

So perhaps the instinctive technical reaction from 'Semantic gurus' is a little melodramatic. Schema.org will change the structured data landscape to be sure, but it is not in the same marketplace as vanilla RDF, doesn't even try to be, and is far less expressive than RDFa. Moreover, the search engines have announced their continued support for RDFa and Microformats - although no mixing of formats, please (!!!). Some, such as Mike Bergman, even see schema.org as a stepping stone for developers; fulfilling a different purpose and encouraging developers to move onto richer forms of structured data. And at least schema.org uses URIs, thus enabling some flexibility on how they are referenced in the future.

It is interesting to note that schema.org was announced a few days prior to the SemTech conference, which kicked-off yesterday in San Francisco. I wonder what the topic of conversation will be at the conference dinner? Well, we can look at Twitter for that...