The Guardian Data Blog have shared details today of their ‘workflow’ for cleaning and preparing data and then generating graphics and visualisations. The headline figures about the balance of effort involved (70% of time spent preparing data; just 30% visualising and presenting it) might seem to point towards continued problems with open data being non-standard and always needing a lot of data wrangling before you can do anything with it. For many, including projects like Linked Gov, there is an implicit assumption that data use should be as frictionless as possible. We should be able to spend almost 100% of the time on the ‘fun stuff’ of visualising and making things with the data.

But as I look at different steps involved in ‘data wrangling’ (converting units; changing around columns; matching different datasets together; etc.) I notice that many processes have two outputs:

  1. They lead to better data to work with for visualisation and other uses;
  2. They lead to a better understanding of the nature, contents, complexities and limitations of a datasets

A similar point emerged in the workshop we held on Monday to prepare for Phase 2 of a Young Lives Linked Data Demonstrator as we looked at potentially replacing the current process of wading through 100 page code-books with simple searchable online interfaces that provide access to details of data collected and questions asked. We can dramatically increase access to specific questions; but we may also impact on the cognitive processes of understanding a survey and it’s data – replacing browse with search; and potentially reducing friction to far, so complex questions appear simple.

This is not to defend the poor quality of much data, or to suggest that it’s not desirable to be able to pull in any use data with little friction: but, rather to suggest that somewhere across the 70-30 split of time spent preparing and using data, is at least some percentage of time spent ‘understanding’ the data – and we need to find creative ways to make sure this baby doesn’t get thrown out with the data-wrangling bathwater.

