I was looking at the slides from a recent talk by Paul Rissen, Senior Data Architect at the BBC, about the history of Linked Data usage at the organisation. One of his slides, number 20 to be exact, reminded me of how quietly revolutionary the work at the BBC has been. The slide was titled ‘The Web as a Content Management System’.
Early on the BBC decided not to mint their own ids but to utilise existing URIs for musical artists from a freely available database MusicBrainz. For the uninitiated, a URI (Uniform Resource Identifier) is a way for the computer to identify a thing and it is one of the basic concepts in Linked Data paradigm.
Firstly, this instantly gave them a database of 50 million artists, albums and songs. This saved the BBC a huge time and expense. Each MusicBrainz entry has a link to yet another data source dBpedia which has the text description of the artist from wikipedia.
That’s the ‘magic’ of linked data. By magic, I mean, exactly how a graph (of data) works. Everything is connected. Follow the links, gather the content.
I can imagine what the conversations with the heads of editorial when the techies suggested the idea of using ‘wild’ data more often called open data. I’ve been involved in similar conversations. There is always a fear of losing control.
Editorial wants to create faultless content and it is hard for them to imagine that quality coming from anyone else but their team. The dilemma these days is how do you maintain that high-quality in an era of shrinking editorial budgets and ever increasing amounts of data.
See what Jem Rayfield, Senior Technical Architect at BBC at that time had to say about the complexity of the data the BBC Olympics 2012 site had to manage.
The ability to automatically and reliably make use of information on the web FOR FREE must have convinced the skeptics on the editorial side of the BBC. To give you an idea of just how much information is out there: dBpedia has data for 4.58 million things (e.g. people, places, music, film, video games, organisations, species, etc). Wikidata, another general information data source, has 26 million similar kinds of ‘items’.
The use of Linked Open Data would have been one battle that would have been fought. BBC went further and made the strategic decision to also use its resources to help improve the MusicBrainz database. When errors were found, the BBC fixed the mistake in the external data source and not within the walled garden of BBC’s ICT infrastructure where only the BBC could benefit from the organisation’s editorial expertise. Of course, the BBC’s charter requires the organisation to provide ‘benefit’ to the public and contributing to the free and open MusicBrainz database fits nicely with that public service remit.
Regardless of its public service remit, this is a strategically smart approach and one of those quietly revolutionary ideas behind ‘the Web as CMS’. The BBC’s contributions add value to a resource, the MusicBrainz database. That added value, in turn, makes that resource more attractive to others who will use the resource and further improve that data. This virtuous cycle is how Wikipedia became a ubiquitous part of our lives online. The BBC is one of the main beneficiaries of their altruism.
Today the list of MuzicBrainz’ contributors includes names like last.fm, Spotify and Universal Music who inject Linked Open Data into thier knowledge management infrastructure to enhance the effectiveness of their catalogues metadata.
Ten years after the BBC started down the Linked Data path still makes some editors, and even IT directors, worried.
The lack of control is still a concern. Each time an organisation looks at using open data, the same conversation has to be had. What about mistakes or deliberate errors introduced into the data sources? The importance is to be able to trace the provenance of the error. Every organisation will have a means to trace the source of an error that doesn’t change when you are using the web as your CMS. It’s just that you have a few extra thousand pairs of eyes also on the content who are more likely to catch the error and fix it before your relatively small team.
The choice to use open data is not an all or nothing proposition. Use what you need, ignore the rest. Of course, you can create a guarantee for the data, create a vetting process, track deltas, etc. You can even pull the data into the walled garden of your organisation and never share and play nice with the rest of the community. Just as there is a concern about the data coming in, people worry that they will lose control of the data going out. Rest assured, you can still make the business decision on what internal data you want to share and what you feel commands a premium.
Those arguments were true ten years ago as they are today, but back then it was hard to convince organizations of the advantages of open data. That was ten years ago. The use of open data is commonplace now. There are only numerous examples proving the real value that open data provides. We have moved from the bold experiments of the BBC to ‘ignore at your peril’.
The scale of content and data that an organisation must make sense of has long ago gone beyond what can be handled by one organisation. The data problems that the giants like google, twitter and facebook were dealing with ten years ago are the problems that all organisations are dealing with. This has made it more likely that organisations can’t afford to manage data and content without making use of the data that exists openly and freely on the web. The simple but radical idea of ‘The Web as a CMS’ is increasingly the norm.
Run your first query and discover meaning in your data