Untangling the open data debate: definitions and implications

I’m exploring creating a series of short notes based on my current PhD research into open data as tools to support wider dialogue around data policy and practice. Here’s a draft of the first one, trying to set out some clear categories for understanding debates over data. It’s also available as a two-page PDF here.

Untangling the data debate: definitions and implications

Data is a hot topic right now: from big data, to open data and linked data, entrepreneurs and policy makers are making big claims about ‘data revolutions’. But, not all ‘data’ are the same, and good decision making about data involves knowing the differences.

Big data

Definition: Data that requires ‘massive’ computing power to process (Crawford & Boyd, 2011).

Massive computing power, originally only available on supercomputers, is increasingly available on desktop computers or via low cost cloud computing.

Implications: Companies and researchers can ‘data mine’ vast data resources, to identify trends and patterns. Big data is often generated by combining different datasets.

Digital traces from individuals and companies are increasingly captured and stored for their potential value as ‘big data’.

Raw data

Definition: Primary data, as collected or measured direct from the source. Or Data in a form that allows it to be easily manipulated, sorted, filtered and remixed.

Implications: Access to raw data can allows journalists, researchers and citizens to ‘fact check’ official analysis. Programmers are interested in building innovative services with raw data.

Real-time data

Definitions: Data measured and made accessible with minimal delay. Often accessed over the web as a stream of data through APIs (Application Programming Interfaces).

Implications: Real-time data supports rapid identifications trends. Data can support the development of ‘early warning systems’ (e.g. Google Flu Trends; Ushahidi). ‘Smart systems’ and ‘smart cities’ can be configured to respond to real-time data and adapt to changing circumstances.

Open data

Definition: Datasets that are made accessible in non-proprietary formats under licenses that permit unrestricted re-use (OKF – Open Knowledge Foundation, 2006). Open government data involves governments providing many of their datasets online in this way.

Implications: Third-parties can innovate with open data, generating social and economic benefits. Citizens and advocacy groups can use open government data to hold state institutions to account. Data can be shared between institutions with less friction.

Personal/ private data

Definitions: Data about an individual that they have a right to control access to. Such data might be gathered by companies, governments or other third-parties in order to provide a service to someone, or as part of regulatory and law-enforcement activities.

Implications: Many big and raw datasets are based on aggregating personal data, and combining them with other data. Effective anonymisation of personal data is difficult particularly when open data provides the pieces for ‘jigsaw identification’ of private facts about people (Ohm, 2009).

Linked data

Definitions: Datasets are published in the RDF format using URIs (web addresses) to identify the elements they contain, with links made between datasets (Berners-Lee, 2006; Shadbolt, Hall, & Berners-Lee, 2006).

Implications: A ‘web of linked data’ emerges, supporting ‘smart applications’ (Allemang & Hendler, 2008) that can follow the links between datasets. This provides the foundations for the Semantic Web.

More dimensions of data:

These are just a few different types of data commonly discussed in policy debates. There are many other data-distinctions we could also draw. For example: we can look at whether data was crowd-sourced, statistically sampled, or collected through a census. The content of a dataset also has important influence on the implications that working with that data will have: an operational dataset of performance statistics is very different from a geographical dataset describing the road network for example.

Crossovers and conflicts:

Almost all of the above types of data can be found in combination: you can have big linked raw data; real-time open data; raw personal data; and so-on.

There are some combinations that must be addressed with care. For example, ‘open data’ and ‘personal data’ are two categories that are generally kept apart for good reason: open data involves giving up control over access to a dataset, whilst personal data is the data an individual has the right to control access over.

These can be found in combination on platforms like Twitter, when individuals choose to give wider access to personal information by sharing it in a public space, but this is different from the controller of a dataset of personal data making that whole dataset openly available.

A nuanced debate:

It’s not uncommon to see claims and anecdotes about the impacts of ‘big data’ use in companies like Amazon, Google or Twitter being used to justify publishing ‘open’ and ‘raw data’ from governments, drawing on aggregating ‘personal data’. This sort of treatment glosses over the difference between types of data, the contents of the datasets, and the contexts they are used in. Looking to the potential of data use from different contexts, and looking to transfer learning between sectors can support economic and social innovation, but it also needs critical questions to be asked, such as:

  • What kind of data is this case describing?
  • Does the data I’m dealing with have similar properties?
  • Can the impacts of this data apply to the data I’m dealing with?
  • What other considerations apply to the data I’m dealing with?

Bibliography/further reading:

See http://www.opendataimpacts.net for ongoing work.

Allemang, D., & Hendler, J. A. (2008). Semantic web for the working ontologist: modeling in RDF, RDFS and OWL. Morgan Kaufmann. Retrieved from

Berners-Lee, T. (2006, July). Linked Data – Design Issues. Retrieved from http://www.w3.org/DesignIssues/LinkedData.html

Crawford, K., & Boyd, D. (2011). Six Provocations for Big Data.

Davies, T. (2010). Open data, democracy and public sector reform: A look at open government data use from data. gov. uk. Practical Participation. Retrieved from http://www.practicalparticipation.co.uk/odi/report

OKF – Open Knowledge Foundation. (2006). Open Knowledge Definition. Retrieved March 4, 2010, from http://www.opendefinition.org/

Ohm, P. (2009). Broken promises of privacy: Responding to the surprising failure of anonymization. Imagine. Retrieved from http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=1450006

Shadbolt, N., Hall, W., & Berners-Lee, T. (2006). The Semantic Web Revisited. IEEE intelligent systems, 21(3), 96–101.

2 Comments

  1. Norman Gray

    Tim, hello.

    That’s an interesting set of definitions — it’s always useful to have things like this with an explicit definition one can point to.

    Here with some quibbles and one flat disagreement!

    Quibble 1 — big data: I agree that an operational definition is necessary for this, in contrast to something based simply on numbers of bytes; but simply ‘massive computing power’ is a bit vague. In http://purl.org/nxg/projects/mrd-gw/report (section 1.3) I effectively worry at some similar definitions, and for me, I think that ‘big data’ means data volumes large enough that you have to be innovative in how you manage them. Or slightly less extreme, volumes large enough that simply storing and processing them is a significant technical problem.

    Quibble 2 — personal data: here, it might be worth mentioning that the notion of ‘personal data’ has a formal definition in the DPA. That’s probably the most important definition of that category of data, and it might not be useful to use a different definition.

    Disagreement — raw data: in your definition here, I would replace ‘or’ with ‘as opposed to’. Raw data is largely useless, and shouldn’t be shared, because it’s likely to be unusable. In some cases, the processing necessary for release might be minimal (for example, turning tabular data into a .csv file and documenting what the columns mean), but for (I’d guess) the majority of science data, raw data is of less than no use — it may have a negative value, in that its release may _increase_ costs for everyone.

    All the best!

    Norman

2 Trackbacks/Pingbacks

  1. Untangling the data debate : Tim's Blog

    [...] [Cross posted from my PhD blog where I'm trying to write a bit more about issues coming up in my current research...] [...]

  2. What is a Dataset? | Lost Boy

    [...] data (as defined here!) is more useful as it supports more downstream usage. Raw data has more [...]

Leave a Reply

Open Data in Developing Countries


The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy