Digital landscapes: effective open data takes more than a single CSV

Notes from data exploration. I’ve been finding myself tinkering with interesting open datasets quite a lot recently – but often never quite getting to writing anything up before I have to move on to other work. So, rather than loose the learning – I’m going to try and keep notes on this blog of some of those explorations. First up, data from UK Digital Landscape research.


Alongside the new Government Digital Strategy launched this week, Cabinet Office has published Digital Landscape Research undertaken by 2CV.

Interestingly, the report is published natively for the web, with the PDF option being secondary (you can read all the technical details of how the Digital Strategy and report pages are put together here), and at a number of points refers to the dataset published alongside it.

Getting direct access to the raw data underlying government commissioned research is surely a good thing. As I was reading through I started to make notes of some questions I had that I hoped looking at the data might be able to answer.

For example, section 5 of the report sets out ‘Groupings of people who do not use government services or information online’, plots them on an axis for positive and negative perceptions of the Internet, and skills, and then offers descriptive statistics about these groups. However, the report says nothing of how much of the population these groups may represent. Knowing more about the size, and socio-demograhics of these groups beyond those few factors presented in the report would be extremely interesting. Equally, it would be interesting to see more of how these clusters have been constructed, and how the axis have been developed.

However, when I looked at the data it turned out not to be the raw data, but table upon table of cross-tabulations, all set out in a ten-thousand row CSV. I’ve not combed through every row – but from searching on possible variable names, it seems that, even amongst all these cross-tabs, the clusters, which form the most substantive data presentation of results from the data are nowhere to be seen.

Being able to browse the cross-tabs that are there is interesting (for example, discovering the only options in the survey for ‘Occupation’ seemed to be ‘Advertising/Marketing/Market Research; Public Relations; Government Department/Agency; Web design/content provider; Banking; Retail; Other’ – suggesting a rather skewed view of what the population does; and seeing that under 16 year olds are ignored by the research) – but without proper raw data, and a codebook that shows what each variable means, it’s very hard to use this data to do anything more than grab the odd extra statistic out of context for use in a presentation.

I did hope that perhaps the dataset had been listed on along with a little more meta-data, but alas a search there hasn’t turned up anything. I also looked to see if there was anywhere I could ask the authors of the report for more information, or somewhere I could comment on the challenges of using the data to others: but again, nothing to be seen (In fact, as elegant as it is, the online presentation of this report lacks any information on the authors, who is responsible for it, or who to correspond with. A ‘Contact’ link with clear details of the different routes for readers to respond would be valuable addition to these templates in future).

All this goes to highlight for me again that open data involves more than just putting a dataset online.

  • Data structure matters – a lot of open data advocacy ends up focussed on format, but not enough attention is paid to structure. In practice, the main use to which the cross-tabulations provided alongside the Digital Landscape Research could be used is to pick out statistics by hand, rather than for machine processing, and for this, a formatted Excel sheet with tabs for each section of the research, or even a PDF would be just as functional from a users perspective. This isn’t a defence of PDF publishing (though as long as meaning isn’t conveyed in formatting in Excel, the availability of libraries for reading Excel files, and it’s ability to present more user-friendly information means I’m not averse to seeing well-structured data published in Excel), but is rather to make the point that structure is what really matters for data re-use. Consistent columns and rows make for much easier analysis.
  • There is more than one dataset – In any research project the same underlying data might be expressed in a number of different datasets, and it may be appropriate to share any number of these for different audiences. Cross-tabulations that are the product of analysis are useful for many users; the underlying raw survey data, with a row for each response, and a column for each variable may be useful for others; so too might the table of derived variables (e.g. clusters) that has been used in producing the analysis. ‘Publish the data’ doesn’t need to mean one dataset, but could mean publishing data from various stages of research and analysis to meet different needs.
  • Meta-data matters – Without a code book, how can anyone know what the variables mean?
  • Show your working – Between the raw data (if it was shared) and report would still sit a lot of analysis. To really promote opportunities for open data to enable secondary analysis, a researcher would need to be able to see the SPSS, STATA or R commands used to generate conclusions. Sharing source alongside data is part of putting data in context.
  • Be social – If I were to make a cleaned up version of this dataset, structured to more easily support exploration and analysis – how would anyone else find it? Datasets need to be embedded within spaces that support conversation and collaboration. At the very least, for government that should mean listing them on and linking to them there where there’s a minimal comment feature. But really, it should involve a lot more focus on supporting proper open data engagement.

I started out planning to write a post on my explorations of the Digital Landscape Research data. Unfortunately there’s still a way to go before government is publishing truly engaging data by default. However, it’s still fantastic to see the positive step that data underlying research was published, and the team behind it deserve credit for negotiating with a research supplier that their data would be published in this way. Perhaps some part of the digital capacity building planned in the Digital Strategy will focus on all the steps and skills involved in open data publication to help build on these positive steps in future.

One Comment

  1. Graham

    Good post, hits on a really important aspect of Open Data which is (still) that data is designed to fit certain contexts – just like software is designed to work under certain assumptions. Sometimes your context is the same as the original and you can go straight ahead. Sometimes you have to tweak it a bit to meet your context. Sometimes it’s just a case of starting again…

    Once upon a time I started mapping out different contexts based on function. While you can’t predict every single context that people will be in, I suspect (more thinking needed here) that there are certain “common” contexts/functions, which are useful to a majority.

    I wonder if mapping that out would help with 3 questions:

    1. How can you make data available in a format that’s useful to N contexts?

    2. How can you make data available in a format that can be easily tweaked to fit other contexts?

    3. Which functions/contexts are likely to overlap? ie. If you can identify your likely audience(s) and who is most likely to be taking your data forwards, can you build this into your data structure from the start?

Leave a Reply

Open Data in Developing Countries

The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy

RSS Recent Publications

  • An error has occurred, which probably means the feed is down. Try again later.