Of nonsensical numbers: openness score

[Summary: A brief critique of the ‘openness score’]

A recent Cabinet Office press release, picked up Kable Government Computing states that: “The average openness score for all departments is 52%”.

What’s an openness score I hear you ask? Well, apparently it’s “based on the percentage of the datasets published by each department and its arms-length bodies that achieve three stars and above against the Five Star Rating for Open Data set out in the Open Data White Paper”. That is, it’s calculated by an algorithm that looks over all the datasets published by a department on Data.gov.uk and checks to see what format the files linked to are in.

Which seems to display  both a category mistake on behalf of the Cabinet Office, and a rather worrying lack of statistical literacy and awareness of how such a number might be gamed.

On the category mistake: the openness core appears to equate openness with file format – but ‘openness’ in general is not equivalent to the use of an ‘open’ file format. Firstly, even when using a machine readable format data can be non-open and non-machine readable depending on how it is formatted: a garbled CSV is to all intents and purposes less accessible and open than a well formatted Excel file. Secondly, openness is not just a technical concept, and is not just about data (I’ve commented on that in more detail here). To take the number of well-formatted datasets as a proxy for departmental openness is reductive and narrow in the extreme. This may just be an issue of communication, such that Cabinet Office should be talking about an ‘open data score’ rather than ‘openness score’, but as an input into narratives on open government this risks creating confusion, and again muddling the relationship between openness of government in general, and open data.

On nonsensical numbers: even as an ‘open data score’, the current number is practically meaningless, as it is just a ratio of the non-machine readable to the machine-readable datasets. The score can be increased by removing non-machine readable datasets from Data.gov.uk, and is skewed according to how many datasets a department publishes. A department publishing two datasets, one machine readable and one not, gets a score of 50%. If they publish an extra dataset, full of meaningful information, but that is not yet machine-readable, their score drops to 33%. This means the score is not only misleading, but potentially creates perverse incentives that run counter to the very notion of Tim Berners-Lee’s 5-Star rating of open data (which I should remind readers is not a rigorously designed set of criteria, but something Tim prepared just before a conference presentation as a rough heuristic for how data should be opened), which calls for people to put data online as a first step, even if it can’t be made machine readable right away.

An openness score constructed as the current score potentially incentivises less data publishing not more.

I hope whoever came up with the idea of the openness score is encouraged to go back to the drawing board and think about both about it’s design, and how it is communicated.

 

11 Comments

  1. David Read

    Hi Tim,

    This is welcome thought on this matter, and your points are well worth addressing.

    Saying openness score equates *just* to file format is a misunderstanding of how the scoring works and the issues that the open data community currently face when using data.gov.uk.

    There has been massive criticism by geeks that data has been published in PDF format – impenetrable for analysis – so providing a carrot for publishing in a better format is a very useful heuristic.

    The second thing it is based on, which activitists have been widely criticised government for is ensuring the license is open. Now we have a useful carrot to publish under licenses such as OGL, although about 1000 datasets on data.gov.uk still refuse, stating for example ‘terms and conditions apply’, suggesting an activist can’t even include a table or a screen shot of the data in a blog post.

    And one more thing the openness score measures – easy access. Again, a serious issue that those trying to browse, analyse or mash-up government data is being blighted by broken links or multiple hops to find the actual data. This year we’ve got public bodies being shut down and merged, hundreds more government websites merged, so keeping track of broken links was never a more important issues. When a direct link is provided to the data file or API, and maintains that then the publisher is rewarded in the openness score.

    Tim Berners-Lee might have put the system together on the back of an envelope, but it undeniably addresses the key three problems with published data.

    Your main argument that the score is “nonsense” is because “a garbled CSV is to all intents and purposes less accessible than a well formatted Excel file” can you provide any examples? The hundreds of CSVs I’ve personally accessed from data.gov.uk and I’ve not come across a single garbled one. They are all in neat rows, all with headings and nearly all with units. Is this merely a theoretical objection or do you have a list of poignant examples?

    You’ve previously commented that a serious flaw in data.gov.uk was documentation for datasets, and that this could be included in the score. I’d like to see if anyone else sees this as a wide issue or not, and particular examples of datasets which are lacking.

    You wonder if publishers will ‘game’ the system and remove badly scoring datasets from data.gov.uk. We have controls on deleting, avoiding this specific issue. But more importantly, a metric that has been used a lot more than openness, is the ‘number of datasets’. You could argue it has its problems itself, but it has undeniably been a carrot for publishers to keep opening data.

    But I do agree with the concern about the use of the term “openness score” in the wider context. The wording of the press release could be misconstrued aligning “openness score” as a direct measure of “open data commitments”. Although the sentence provides surprising detail in how the percentage is calculated, such a broadly named indicator should probably include other aspects of “open data commitments”, beyond licence, machine-readability and ease of access which are included – it doesn’t take account of how many datasets the public body has failed to open up, or not timely enough.

    David Read
    data.gov.uk developer

  2. David Read

    Thanks Tim. It’s clear we both have strong reservations about using the openness score percentage as an overall indicator for departments.

    But I disagree when you rant against the measurement of each dataset against an ‘openness score’. Although you say you have “got no major objection”, you think that the openness of some data is only in the eye of the beholder, and the complicatedness of the multiple dimensions, or indeed any useful sense of quality, cannot be captured in a single number. It’s clear from your article http://www.opendataimpacts.net/2011/11/evaluating-open-government-data-initiatives-can-a-5-star-framework-work/ you’ve done a lot of careful thought about this before coming to your conclusion, which I respect, but I think now we have actual scores it needs re-evaluating.

    We need to change the mindset of the average regional care trust that sees no issue with publishing their October 2012 spending in a PDF file, without a licence and allows the file to go missing six months later. The reality is that a decent chunk of data on data.gov.uk has these sorts of fundamental problems. The five star rating is an excellent carrot to improve this, and if publishers do the small changes to get 3 stars, then in 99% of cases they will come up with excellent open data.

    I don’t think you should call for the five stars scores to be scrapped on the basis that it ignores details such as whether something has published complete documentation yet (plenty of people leave comments asking for things like this and get answers) or whether a category of datasets should be in Excel, CSV or both. Yes these are good questions, but I don’t think they skew the whole rating for a dataset by anything like as much as you are suggesting.

  3. Graeme Jones

    I think a progressive ‘openness score’ would include organisations or departments and business units open enough to put documents or data online as is to actively and iteratively improve the dataset quality from feedback.

  4. David Read

    Thanks for your comments so Tim. Really good to get to the bottom of this.

    Aggregating star ratings to get a percentage is bad – we’re agreed on that as I’ve said. I’m afraid I have so far misunderstood that nearly all of your comments have been about this because I have long used the term “openness score” not as the aggregated percentage, but a dataset’s star rating. So I’m sorry about that.

    But I want to challenge you more about the individual dataset ratings. I think that you’re not against any of the (six) individual dimensions of analysis that make up the Five Stars of Openness, but that you see little benefit of combining them into a single star rating.

    In my experience the individual dimensions of analysis hit on the key problems with published data in reality, and you. You seem to be worrying that incentivising a publisher to add an open license, of convert a PDF to CSV, or fix a broken link could somehow go wrong. So wrong, that the incentive is perverse or even damaging.

    You challenge me to define excellent open data, and there are of course many things and it varies from dataset to dataset. To a lesser extent it varies from user to user. But I can tell you with surity that providing it with an open licence, make access easy and in a structured non-proprietary format is a bloody good start. I think anyone coming to data.gov.uk would expect the majority of datasets to achieve these criteria, yet that is far from the truth, so let’s quantify how well these are done and provide a summary as a star rating. Is that such a bad thing?

    On aggregating the dimensions into a star rating, your other article admits the power of communicating a single number over 6 individual dimensions. In my experience, that sort of power is what gets a lot of data released and improved and is therefore immensely valuable. I don’t think the five stars distort the dimensions very much – more stars is better in all but a few obscure cases. I think everyone would agree that data in PDFs simply would be published in Excel or KML format, say. Excel vs CSV – well there’s not a lot in it to argue over really and its usually easy to include both. The worry that RDF might be published without any more accessible formats is a special case though. There is nothing bad about shared vocabularies or links as such, but RDF gets complicated and suitable tools and knowledge are still rare. But rather than throw out Five Stars, one could tweak it to award four or five stars only if you also provide an accessible version, such as a spreadsheet or web visualisation.

    In conclusion I think Five Stars of Openness is an excellent incentive for data publishers. Your concerns are heavily outweighed by the benefits, and you should be celebrating the introduction on data.gov.uk to drive improvement of compliance with basic standards of data publishing.

  5. David Read

    Graeme, I don’t think anyone would disagree with wanting both quantity and quality .

    But how far should say geographic data go though? Is a PDF map of local allotments good enough? KML of their lat/longs would bring lots of benefits to developers reusing. Overlaying them on Google Maps would be general public friendly and help searching and routing to them. Some councils provide commercial web apps with more features, such as comparing with say geology. And RDF with common vocabularies would allow aggregation of allotments nationwide. Where is the line drawn between quality and a waste of local authority cash?

  6. David Read

    Tim, thanks again. You example is clear, and I think the tweak I suggested to the system would cover this possibility, ensuring a CSV or Excel file was published too, whilst avoiding the complexity of your 4/5 star toggles. In addition I don’t think the parent would have any problem opening the CSV or the developer converting the Excel – the difference is marginal.

    I don’t think there are any more hypothetical examples beyond the spreadsheet vs RDF one. Of the 100 odd RDFs in data.gov.uk I don’t think any are missing an easier way to access the data. So your example remains purely hypothetical and of little value.

    Now here’s my example of empirical evidence. This week I used the new system on data.gov.uk to discover a central government department with 80 broken links across 26 datasets on data.gov.uk. They had been broken for months with no-one realising, due to a website reorganisation. The department was very happy to hear about the issues in detail, promptly fixed most of them and are pleased to see their 5 star scores go up from 0 to 3 in most cases.

  7. David Read

    Tim, I have sympathy with Excel being useful when viewing complicated ONS datasets. One might tweak the five star rating to also require Excel when a CSV is supplied before awarding 3 stars. I’m not sure you could require the same with KML and Shapefiles, or requiring other proprietary formats, so this would be a one-off exception. It is something we could do some analysis on to see how widespread the problem is, or whether this would encourage more work than it is worth.

    The old Edubase dataset is provided in RDF, CSV, JSON and as web pages so it would get 5 stars. The new one with closed license and difficulty of access both limit it to 0 stars. That’s not to say the automatic routine has some tough choices sometimes, and I’ve considered a manual override for datasets which are complicated for an automatic tool. BTW I’ll push for it to be added to data.gov.uk catalogue, as I agree its a surprising omission.

    Regarding this weeks improvements to 26 datasets, on this occasion someone centrally needed to take the trouble. As we grow more confident that the reports are correct we plan to make them public, in a similar fashion to the Spend Data reports: http://data.gov.uk/data/openspending-report/index which we’ve seen driving improvement in the data there, without the need for someone central to coordinate.

    I guess our views are coming at this from slightly different angles – I’m mainly concerned about getting all the 0 star datasets we have today moving in what we can mostly agree is the right direction (something better than broken links to perhaps PDFs with closed licences), and you’re focused on the accuracy of the goal, for which we need more of this healthy debate.

    It’s hard to articulate and agree on where we want to be. Tim B-L set the parameters for the project kicking off and more voices pitching in are needed to refine it. It’s great to have you involved debating this.

Leave a Reply

Open Data in Developing Countries


The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy