Does removing the wrangling undermine understanding

The Guardian Data Blog have shared details today of their ‘workflow’ for cleaning and preparing data and then generating graphics and visualisations. The headline figures about the balance of effort involved (70% of time spent preparing data; just 30% visualising and presenting it) might seem to point towards continued problems with open data being non-standard and always needing a lot of data wrangling before you can do anything with it. For many, including projects like Linked Gov, there is an implicit assumption that data use should be as frictionless as possible. We should be able to spend almost 100% of the time on the ‘fun stuff’ of visualising and making things with the data.

But as I look at different steps involved in ‘data wrangling’ (converting units; changing around columns; matching different datasets together; etc.) I notice that many processes have two outputs:

  1. They lead to better data to work with for visualisation and other uses;
  2. They lead to a better understanding of the nature, contents, complexities and limitations of a datasets

A similar point emerged in the workshop we held on Monday to prepare for Phase 2 of a Young Lives Linked Data Demonstrator as we looked at potentially replacing the current process of wading through 100 page code-books with simple searchable online interfaces that provide access to details of data collected and questions asked. We can dramatically increase access to specific questions; but we may also impact on the cognitive processes of understanding a survey and it’s data – replacing browse with search; and potentially reducing friction to far, so complex questions appear simple.

This is not to defend the poor quality of much data, or to suggest that it’s not desirable to be able to pull in any use data with little friction: but, rather to suggest that somewhere across the 70-30 split of time spent preparing and using data, is at least some percentage of time spent ‘understanding’ the data – and we need to find creative ways to make sure this baby doesn’t get thrown out with the data-wrangling bathwater.

One Trackback/Pingback

  1. Pneumatics And Open Data «

    […] (for more on this topic from a man far better versed in the detail of Open Data please see this link here). Which is why Open Data shouldn’t be conflated with open governance. This is not a tin hat […]

Leave a Reply

Open Data in Developing Countries

The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy

RSS Recent Publications

  • An error has occurred, which probably means the feed is down. Try again later.