The limited promise of (open data) research automation

[Summary: adding to the network of discussion around open data measurement]

Earlier this week the 2nd Edition of the Open Data Barometer launched: a project I’ve worked on over the last two years, helping design the original framework of indicators, setting up the data collection tools, and writing up the report (with a fun learning curve this year to try and make it an interactive web-first report as much as possible). Putting together a study like the Barometer is a long and, at times, torturous, process. So I was really interested to see that this week the Open Data Institute also released a report on “Benchmarking Open Data Automatically“. Could there be ways around the costs and time commitments of survey based techniques to understand the open data landscape?

Unfortunately, at least so far as the report explores, it seems not.

Below I offer a critical reading and reflection, addressing both some substantive issues for consideration in taking forward efforts around automated open data measurement, and raising some general issues about the problems of approaching research with an automation lens.

Credit to the authors: they do not at any points overstate the claim for automation, and provide a fair overview of its potential. However, the methodology of the report in identifying suggested indicators and datasets §4 is unstated and unclear, and given it is framed as an output of the developing country focussed Partnership for Open Data (POD), critical consideration of the global applicability of its suggested models appears absent.

Context and environment: qualitative concerns

The report looks at the potential for automated assessment in four areas, building on the Common Assessment Methods framework for open data which covers context/environment, data supply, use & impact.

The suggested automation approaches for Context/Environment variables either involve the secondary use of global indicator data, or propose the use of text-mining approaches on legislative or communicative texts from government. Both these are problematic.

  • Firstly, the report does not appear to consider that the secondary data it wants to draw upon is far from automated itself, but instead is the product of in-depth qualitative research processes, each with their own dynamics and biases. By framing use of this data as a strategy of automation, rather than a strategy of secondary data use, it glosses over the important critical attention that needs to be given when drawing on indicators from elsewhere, and over questions of whether these proposed indicators are adequate to capture the constructs being sought. In the Open Data Barometer model, primary survey data is paired with secondary indicator data (a) to address the gaps in each; (b) to ensure each component of the ‘Readiness’ Barometer sub-index is based on a plurality of data sources, to limit the potential effects researcher bias.
  • Secondly, the idea that it will be possible to effectively infer meaningful information about open data policies from automated reading of legal texts is highly questionable. Not only are legal texts often inaccessible online, but the range of languages, and legal cultures, any analysis would have to deal with begs the question of whether it would not be more efficient in any case to work with bi-lingual scholars than computational techniques. The report fails to cite any examples of studies that have proven the value of such automated text-mining approaches in an international comparative context. Furthermore, automated techniques bring in a big risk of imposing normative assumptions (particularly in this case Western, Anglo-saxon legal tradition assumptions) about law, policy making, and what constitutes effective policy. When I carried out my initial policy analysis of open data in six countries though a qualitative reading of policies, it became clear that it was important to look at the function, not just the form of words in policy documents – and to understand the different ways policy operates in different countries. Should one country be assessed as having a better environment for achieving the promised impacts of open data because it uses the words ‘open data’ in laws, than one which talks about ‘free and digitally available public sector information’? I would be extremely cautious about such possibilities.

One of the weaknesses of the Open Data Barometer as it currently stands is that it assesses only at the National level, whereas in many federal states particular policies or datasets that it is looking for from a national government are devolved to states and sub-national jurisdictions. Working with the complexity of the real world generally requires greater resource for critical analysis and research, not more automation.

Data and use: the meta-data dilemma

The report also disappoints when it comes to automated assessment of both data publication and dataset re-use. It concludes that whilst “automated metrics exist for metadata that stems from data portals such as CKAN, Socrata, OpenDataSoft or DataPress” such approaches as “however, subject to the existence of high-quality metadata in a consistent, standardised and complete format.”

What it doesn’t flesh out is the fact that this makes getting at research constructs that can represent the range of data published, or the broad usability of data, as opposed to constructs that in fact pick out primarily the quality of meta-data, is particularly challenging. It also perpetuates the idea that open data involves cataloguing data in data portals, rather than making data broadly accessible, published so that citizens can access it alongside government information, embedding a culture of data publication across government, rather then treating open data as some centralised responsibility.

Such meta-data centric metrics are also inherently open to gaming. If a metric, for example, is based on the percentage of datasets, discovered through a meta-data catalogue, that conform to some automated tests for quality (e.g. the UK Governments ‘openness score’), then the easiest way to up your score is to drop from the catalogue datasets that don’t meet the quality bar. Sensitivity to gaming has is vital to any analysis of metrics: particularly those seeking to create behaviour change. If its possible to change the metric scores by changing behaviours other than those the metric aims to target, then there need to be feedback loops to control for this. These often exist just as part of the processes of analysis in qualitative research, but need to be explicitly planned for in any automated methods.

When it comes to data use assessment, much more promising (albeit resource intensive) I suspect are the bottom up methods for assessing the value of open datasets for particular use cases being explored by some of my Web Science DTC colleagues at the University of Southampton.

Impact: researcher support

The report notes that “measuring impact with an automated approach is inherently difficult”. The Open Data Barometer uses, as a proxy for impact, the peer reviewed ‘perceptions of impact’ amongst informed researchers, based on being able to cite media or academic literature that connects open data and specific impacts. Whilst the expert judgement can’t be automated – it would be useful to consider where automation could support the researcher in locating relevant sources.

For example, right now researchers have to go out and track down news stories, and there are some inter-coder reliability issues that are likely to arise as a result of different search strategies and approaches. But the same technologies proposed for text-analysis of legal documents, rather than producing an output metric, could produce a corpus of sources for human analysis – helping dig deeper into sources of the stories of impact. A similar approach may be relevant to researcher assessments of datasets and data uses as well: using online traces to support researchers to locate and profile dataset re-use, but recognising the need for assessments that are sensitive to different contexts and cultures (e.g. for some datasets, searching on GitHub can turn up all sorts of open source projects that are re-using that data, but this only works in countries and cultural communities where GitHub is the platform of choice for source code sharing).

An automatic no?

Should we reject all automated methods in open data research? Clearly not: projects like the Open Data Monitor, working in the European Context of a single market and substantial cross-government standardisation already, and concepts like Ulrich Atz’s Tau of Open Data, have the potential to offer very useful insights as part of a broader research framework. And the kinds of secondary data consider in the automation report can be useful heuristics in understanding the open data landscape, providing they are read critically.

However, rather than focus on automated measurement, focussing on tools to help researchers locate sources, collate date, and make judgements may be a much more appropriate place to focus attention. Big data methods rooted in working with volume and variety of sources, rather than meta-data driven methods that require standardisation prior to research are likely to be much more relevant to the diverse world one encounters when exploring open data in a development context.

Data standards and inclusion in the network society

This note is based on a presentation given at the ‘Inclusion in the Network Society Roundtable‘ hosted by IT For Change in Bangalore, 29th September – 1st October. It further develops a number of points that were not included in the talk due to time.

The original presentation responded to themes from earlier talks, including Nishant Shah’s insightful challenge to the notion of the ‘network’ as the organising concept of our time, Mark Graham and Christopher Fosters’s empirical look at inequalities of information production and use, and Roberto Bissio’s look at the ‘Data Revolution’ and trends in big data, and so may make most sense read in light of these.

The hidden regulatory power of data standards

Lawrence Lessig has famously declared that ‘Code is Law’ (Lessig 2006), drawing our attention to the regulatory function played by digital systems in our modern world. As we think about Inclusion in the Network Society, I want to also draw attention to the role of data and data standards in structuring the systems that affect inclusion and exclusion, and, for a short while, to surface these increasingly important elements of our digital infrastructure for scrutiny.

I was drawn explore data standards and their political salience through my work on open data policies around the world. The standard narrative of open data suggests that open data is about simply transferring datasets from inside the state, to being published openly on the Web. All that changes is where the dataset sits, and who can use it. Yet, in practice this is not the case at all (Davies and Frank 2013). Datasets rarely exist as simple atomic things. Instead, opening data often involves the construction of new datasets, standardised, reshaped and filtered. Legitimate privacy concerns mean that much state data, based as it is on personal records, cannot be made wholesale open, but instead aggregated extracts only can be published. And in presenting data to the world, governments may take disparate datasets and seek to render them using common external-facing standards.

Watching open data operate in practice, I have observed a growth in efforts at standardisation. And unlike past standardisation projects which took place in the committees of standards organisations, this new wave of standardisation has a more ad-hoc nature to it: taking place across the Web, and often at global scale. Standards not only configure the data that gets published, they also configure the tools available to work with it, and in the process change the costs and benefits a government choosing to publish data faces when deciding whether to use a particular standard or not. Standards both enable certain uses of data, and provide a constraint on what can be expressed, what can be known from data, who can be known, and who can know.

Data structures and standards have an interesting property when compared to the other modes of regulation in a network society. Lessig identifies four modes of regulation:

  • Law
  • Markets
  • Norms
  • Architecture

And data standards sit alongside code as part of the ‘architecture’ mode. Whilst our network society can make law, markets and norms more visible and contestable, by default code, data and standards become an embedded part of the background, rarely subjected to scrutiny, and rarely open to be shaped by those who they affect.

Considering the regulations faced in day-to-day life, when acting we have a choice between:

  1. Taking action within the current ‘rules’ of the game (being an active consumer; voting; create something); or
  2. Changing the rules;

The malleability of the rules depends on both the level at which they are set, and how visible they are to us as subjects of critique. Law operates, more or less, at level of nation state. Markets, although with many invisible aspects, are increasingly be made visible to us through media technologies: and alternative models of production and consumption offer opportunities to change within the system, and to some extent to change the rules of the system. Cultural norms are similarly surfaced and challenged by communications networks. Yet, architecture becomes less visible. It takes what Bowker and Star (Bowker 2000) call an ‘infrastructural inversion’ to get us to see it. We often only become aware of these infrastructures when they break down. In scrutinising impacts of infrastructures on inequality and equality, we could look at architectures and policies at all levels of the Internet stack. Questions of net neutrality for example, are well explored, in terms of their potential to consolidate the power of the powerful, at the expense of new participants in the network. However, my focus is on the emerging network of architectures for data on the Web: sitting at the content layer, and built out of data standards, shared categories and identifiers. Michael Gurstein has drawn attention to the fact that open data is likely to be more empowering to those who have the access, skills and capacity to get things out of the data (Gurstein 2011) – but I want to draw attention to the way it also empowers those who have the ability to get things into the data, and most importantly, those who can get things baked into the standards through which data will be exchanged.

The field-level tussle

To give a concrete example, let me turn to a project I’m currently working on: the Open Contracting Data Standards. The World Wide Web Foundation is leading work to prototype a data standard that will help deliver on government pledges to the Open Contracting Partnership that all stages of procurement, land and extractives contracts will be made open: published in detail online. Because lots of disparate datasets on contracting would be hard to assess for their value and would be hard to analyse, the standard is needed to specify how such data should be released.

We’ve gone about building the standard from three directions: firstly, looking at the data that states already published (there is little point in building a standard that can only represent data which doesn’t actually exist); secondly, looking at detailed user stories of how different communities, from corruption monitors, to small firms competing for government contracts, want to make use of contracting data. Needless to say their is not perfect alignment between these. Thirdly, we’ve been looking at existing standards and tools and looking for opportunities to re-use rather than invent approaches to represent data: identifying the best ways to maximise the number of people who will be able to access and use the data, without sacrificing too much expressiveness or extensibility from our data.

Things that might seem like obscure technical choices, such as how to represent location in the data, are ultimately socio-technical tussles, with significant potential consequences. To follow the example through: users often want to know the locations related to a contract, but which location? The jurisdiction of the procuring body? The location where goods and services are to be delivered? What about when there are multiple goods and services going to different locations? In designing a standard decisions have to be made as to whether to allow location to be attached anywhere in the data, or whether to restrict the ways it can be expressed. If allowing location of contracts, organisations, line-items and so-on to all be expressed – what is the chance that different governments will actually capture this in their database, and that tools will develop that can cope with location data at all these levels? Design choices are constrained by existing database infrastructures of the state: but they also feed back into creating those infrastructures in the future. If only the owners of existing systems drive the standardisation process, new data needs will fail to be accommodated in the standard. But if it is based on ideals, rather than realities of currently held data, it will struggle to see adoption.

Flattening the network society

If readers will permit a further excavation of the infrastructural layers of data standards, even beneath the level of discussion fields, objects and properties within a datasets, there are substantial questions and tensions to explore about the different data models being used on the Web, and their implications for inclusion and empowerment.

In the early days of open data, a lot of emphasis was placed on Linked Open Data (LOD), represented through the RDF (Resource Descriptor Framework) model. This is a graph based data model: essentially a way of representing data built for the network. A truly network society data model. In LOD, everything is either a node, or a link. In fact, even nodes are identified using links. Theoretically, data can be distributed right across the web: the power to describe different things devolved to disparate actors and data systems through identifiers layered on top of the domain name system. Yet, in practice, LOD is computationally and cognitively expensive to work with. Although graph-based data models underlie many of the most popular services on the Web (not least in the social graphs of Facebook and Twitter), these run on centralised infrastructures, and a distributed Web of data has struggled to catch on.

Similarly, XML, which stands for eXtensible Markup Language, once seen as the future data interchange language of the Web, with it’s strong validation and schema languages and ability to mix document and data by adding structured meaningful markup to text, has lost its place in programmer hearts to JSON (Javascript Object Notation), a schema-free simple model, easy to author by hand, and capturing data in simple tree structure (a tree structure is like the numbered paragraphs in a very detailed report – with section 1, 1.1, 1.1.1 and so-on). However, early this year, even JSON appeared to be losing ground, as Jeni Tennison, a member of the World Wide Web Consortium’s (W3C) Technical Architecture Group, and formerly one of the foremost advocates of both XML and Linked Data, declared it to the by the year of CSV (Tennison 2014): Comma Separated Values, or simple flat tabular data.

CSV wins out as a ‘standard’ because it is so simple. Just about any system can parse CSV. It’s a lowest-common-denominator of data exchange – increasingly required because with open data it’s not just government engineers exchanging data with other government engineers with the same background and training: but government exchanging data with all sorts of firms, citizens and non-skilled users. Flat data is also computationally cheaper to process: not least in large quantities where datasets with millions of rows can be split apart and processed using map-reduce functions that send off bits of the data to different computers to rapidly process in parallel, re-assembling the results when done. Paradoxically, the opening of data, potentially drives us towards (over?)-simplified expressions of messy social realities, including more people in the community of potential data users, whilst struggling to accommodate more the complex, networked data structures we might need to more inclusively describe our world.

Standardisation around flat data also has some more basic inclusion risks. The current proposed best-practices for CSV on the web are based on defining the meaning of columns by column position. Column A, for example, will always be a ‘contract ID’, column B the amount it is for and so-on. This is much more brittle than the extensible tree structures of JSON where a user can drop in new fields and data if they require, and much less sensitive to multilingualism than data languages such as XML for example. The choices made in the next year over CSV on the Web Standardisation potentially have big consequences for wide representations of knowledge and data on the web. Most often, choices of data serialisations and standards to use are made as pragmatic ones: seeking paths of least resistance given what is known about developer and user skills. Yet, these choices can be embed important knowledge politics, and there is a need to explore more how far architects of open data on the web are consciously seeking to combine that politics with pragmatism.

Shaping systems and infrastructures: how open data creates opportunities and challenges

If a standard is successful, then systems start adapting to fit with it. This can be seen in another global open data project – the International Aid Transparency Initiative (IATI). IATI provides a standard for talking about aid projects, and now, after five years of development, governments are starting to adapt their internal systems to be based around the standard: shifting their patterns of data collection in order to better generate IATI data, and in the process potentially scrapping fields they used to collect. The standard may even come to trump local businesses requirements for data, as costs of using it fall, and costs of customisation are comparatively high.

The point here is that the invisible infrastructure of data standards start to shape the rules of the game, determining the data that states collect, what can be easily known, and what kinds of policies can be made and measured. This impact of standardisation of course is not new at all (Bowker and Star’s work (Bowker 2000) explores the 100 year old International Classification of Disease and it’s impacts on health policy making across the world) but open data standardisation brings a number of new elements to the picture:

  • Increased intergovernmental cooperation at the technical level;
  • Increased interaction with boundary partners in the generation of standards – often private sector partners;
  • Increased embeddedness of standards within the wider emerging Web of data;

Open data creates space for a different dynamic to standardisation, and opportunities for more inclusive approaches to standard setting. Yet it also risks leaving us with new standards created through formally open pseudo-meritocratic, but in practice, relatively closed process, which do not incorporate a critical politics of inclusion, and which generate lowest common denominator standards that restrict rather than enhance our ability to see the state, and to imagine and pursue a wide range of future social policy choices.

To return to Michael Gurstein’s critiques of open data, Michael has suggested we need to talk not just about ‘open data’, but about ‘open and inclusive data’. The same goes for data standards, with the need to find methods for building not just open standards, but open and inclusive standards for our networked futures.


Bowker, Geoffrey C. 2000. “Biodiversity Datadiversity.” Social Studies of Science: 643-683.

Davies, Tim, and Mark Frank. 2013. “‘There’s No Such Thing as Raw Data’. Exploring the Socio-Technical Life of a Government Dataset.” In Web Science 2013. November 2012. Paris, France.

Gurstein, Michael. 2011. “Open Data: Empowering the Empowered or Effective Data Use for Everyone?” First Monday 16 (2).

Lessig, Lawrence. 2006. Code. Version 2. Basic Books.

Tennison, Jeni. 2014. “2014: The Year of CSV | News | Open Data Institute.”

Call for Papers: JeDEM Special Issue on Open Government Data and Open Policies

See the full call here.

Special Issue on Open Government Data and open policies

Special Issue 01/2014: Open Government Data and Policies

Guest Editors

  • Ina Schieferdecker, Fraunhofer FOKUS, Germany
  • Marijn Janssen, Delft University of Technology, The Netherlands
  • Tim Davies, University of Southampton, UK
  • Johann Höchtl, Danube University Krems

Editorial Board

  • Marianne Fraefel (Bern University of Applied Sciences)
  • Mila Gasco, ESADE, Spain
  • Alessia Neuroni (Bern University of Applied Sciences)
  • Peter Parycek, Danube University Krems
  • Reinhard Riedl, Bern University of Applied Sciences
  • Anneke Zuiderwijk (Delft University of Technology)

In efforts to increase openness, transparency and participation, governments around the world have drafted Open Government policies and established Open Data as an integral part of modern administration. Open data and public sector information has been held out as a powerful resource to support good governance, improve public services, engage citizens, and stimulate economic growth. The promises have been high, but the results have been modest so far and more and there are more and more critical sounds.  Policies have not resulted in gaining the desired benefits and implementations have been criticized for its technology orientation and neglecting the user perspective. These policies and implementations are now under scrutiny, with important questions to be asked about: whether the results justify the efforts; about how different outcomes from open data can be secured; and who is benefiting from open data in different countries and contexts?

JeDEM Journal for eDemocracy is inviting submissions to the following topics:

  • Ongoing and finished projects using open government data: benefits, opportunities and challenges;
  • Innovation and efficiency use-cases of open data within government;
  • Visualisation, simulation and gamification that seeks to reduce the complexity of open government data;
  • Smart cities, smart regions and the enabling effects of open data
  • Economic aspect of open data: including open data business models
  • Encouraging data usage by commercial and non-profit developers;
  • Policies for stimulating use, institutional arrangements
  • Political and legal aspects of open data, including it’s relationship with Right to Information/Freedom of Information policies;
  • Privacy and open data: tensions between open government data and citizen’s privacy rights;
  • Global perspectives: open data as a phenomenon of developed countries, or a global phenomena? Differences and similarities across countries, cities and regions.
  • The impact of open data on the public sector workplace of the future: personal opinions and a human-face vs. administrative decisions and procedures.
  • Open data shifting boundaries at the intersection of public administration and public sphere: citizens as public agents and civil servants as member of the community
  • Open data infrastructures, ontologies, methods and tools and their impact

Open data is both a social and technical phenomena, and studies are needed that explore the interaction of technology and policy. Many national data portals from Germany and Austria to the USA and the Philippines, already adhere to agreed meta data standards for describing data, and the G8 Open Data Charter has committed members to harmonising meta-data. However, open online data by its nature makes not halt at national or organisational boundaries. To deliver on the European Digital Agenda 2020 vision of digital common market, the UN vision of a post-2015 ‘data revolution’ enabling greater coordination, of the goal of advocacy organisations in joining up data from across countries to track financial flows and to root out corruption, it needs to be easier to join up data across countries. The European Commission has already elaborated a meta data description to help bridge data from different EU member states administrations and to mitigate language barriers of data descriptions, and efforts are ongoing to develop a wide range of open data standards, covering issues from aid and public contracting, to parliamentary records and public transport timetables. This accumulated knowledge is collected by the SharePSI-project and should inform the W3C working group Data on the Web. Increasing open data interoperability is an ongoing and current challenge which administrations worldwide have to deal with.

Therefore this issue of JeDEM is additionally calling for submissions in these areas:

  • Requirements, costs & benefits, as well as evaluation, of existing standardisation efforts: including a focus on metadata, naming schemes, and URI schemes;
  • Procedures for publishing open data, including identification of data, preparation of data, and handover from internal departments to Open Data Portals;
  • Efforts for interoperability of open datasets, including extending existing core vocabularies;
  • The impacts of open data interoperability demands on internal organisational change and processes;
  • The connections or conflicts between technical requirements from developers and other re-users of data, vis-a-vis the data currently supplied through Open Data Portals;
  • Legal standardisation, licensing and liability – and their impact on developer and other third-party re-use of data

Author guidelines

Length of paper: 7,500-12,000 words, all drafts have to be typed double-spaced, the format has to be Word for processing reasons.
JeDEM encourages scientific papers as well as case studies, project descriptions and reflections. More guidelines for authors and Word template can be found here:

Important Dates

Call out: 24 April 2014
Submission deadline: 30 July 2014
Start of peer review: 1 August 2014
End of peer review: 1 September 2014
Editorial Decisions: 15 September 2014
Authors revisions: 20 October 2014
Publishing: 30 October 2014

A diverse open data discourse?

[Summary: on the need for greater diversity in the open data discourse in 2014]
2014 is going to see a lot of Open Data conferences and events around the world, particularly in developing countries, where open data has become part of the donors discourse. And a lot of these events are likely to be packed full of anecdotes and examples of open data applications and websites drawn from the USA and Europe, and presenters whose main contributions to open data have been made in the leading cities of high-tech stable democracies with decades or centuries of systematic governance data collections and records. The stories they can tell are often inspiring – and can spark many ideas about how government could be done differently, or how citizens can use data to drive bottom-up change. But the stories they tell should not be taken as templates to transferred and applied in different countries without consideration of the vastly different contexts.
As the Open Data Barometer demonstrated, many developing countries don’t have consistently collected state datasets just waiting to be opened up, and may have much smaller technology communities to draw upon in mediating raw data into useful platforms and products. In the Open Data in Developing Countries research network we spent time in Cape Town a few weeks ago discussing the need to split apart the packaged definition of open data offered by most high-profile advocates, recognising that for much of the data in the South it may make sense to focus on just one or two of ‘Proactively Published’, ‘Machine readable’, and ‘Legal permissions to use’ in the first instance, working progressively towards increased openness of data, rather than treating open data as a binary all-or-nothing state. The importance of adapting open data ideas to local contexts has been a key theme throughout the emerging research findings: but it’s not one we often hear from conference platforms.
In the conferences next year then, we need to be hearing more voices from those who have been grappling with open data from African, Asian and Latin American perspectives, as well as those from all continents who have been exploring open data outside the capital cities and with grassroots groups. Shifting the practices of the state, of its interlocutors, of citizens and businesses to be ‘open by default’, and ensuring the net-gains that can bring are fairly distributed, is not a simple task – and its one that even leading advocates of open data have only taken the first steps towards. Where we are bringing examples across country contexts in presentations, we need to do more to distill and express the theories of change behind open data impacts, and to open that up so that different countries can work out how to fit the open data vision and agenda into their local political, technical and social realities. And we need to explore the different theories of change emerging across different sectors and countries to understand how the core idea of open data can be assembled in many different ways to bring about change. Getting more diverse voices onto the podium in 2014 is a good way to start that.

Open Data and Improving Governance: issues of measurement

I was speaking at the Institute for International Economic Policy this afternoon at George Washington University at a conference on “Known-Knowns and Unknowns about the Internet: Measuring the Economic, Social, and Governance Impact of the Web“. The input I offered, for a session under the title ‘Has the Internet Helped Citizens and Policymakers Improve Governance? Are the effects measure-able? What new innovations might be helpful?’ was around how we approach measuring the impacts of open data. Below are the notes I prepared for the talk: the actual delivery voyaged off this a bit – but the version below is probably a bit more clear and concise.

You can also find a recording from the whole panel session here , including a fantastic talk from my Web Foundation colleague Bhanupriya Rao on No-tech, low-tech and high-tech for transparency.

Open Data and Improving Governance: issues of measurement

This talk will focus in on Open Data on the Internet, and through that explore one route by which the Internet is involved in changing governance. It will look at three issues: definitions; the role of measurement; and emerging impacts from a recent study of open data – the Open Data Barometer.

Definitions: Open Data

We need to start by defining our terms: what do we mean by open data, and by governance and, as a result, what kind of measurement makes sense. It is important to have a strong and focussed analytical definition of open data, to avoid it becoming an all encompassing idea. Definitions are important in developing measurement. The session titles at the Known Knowns and Unknowns conference use terms of ‘web’ and ‘Internet’ interchangeably – yet for many kinds of intervention these are not one and the same.

Similarly, we see a lot of confusion in the open data field between ideas of open data, big data, linked data and so-forth – and so drilling down into an analytical concept of open data is an important starting point for both research and practice. It is particularly important to do this with data, as just about anything can be represented as data – and if we’re not careful we end up confusing form and content. For example, if we attribute the creation of a host of mobile phone apps that use open transport data, and that generate economic impacts for both consumers and producers, to the openness of the data alone, rather than at least partially to the fact it is data about transport and, moreover, it is usually data about transport in an urban centre with good public transport systems, then we end up anticipating that the next dataset ‘opened up’ might see similar returns and impacts just by virtue of being open data. But if that next dataset is data on cattle movements from a department of agriculture, we might not see quite so many smart-phone apps emerging.

When we disentangle ‘dataset form’ from dataset function and subject matter we can more intelligently ask about the potential impacts.

So. What is the form of open data? There are three elements we use in our operational definition in the ODDC research network:

1) Proactive Publishing – the idea that governments (or other parties) should put data online without being asked for it;

2) Machine readability – the idea that the data should be possible to process with a computer – not just read on screen – but possible to sort, sift through, filter and generally manipulate without high technical barriers. In practice this means using standard file formats which can be accessed without expensive software, and which maintain the granularity of the data.

3) Permission to re-use – the idea that there should not be legal restrictions to prevent someone re-sharing or re-using the data they have been given access to. Often government data is placed under copyrights or IP protections that prohibit re-use, and so the open data movement has advocated for the use of clear license statements that, at most, require re-users to attribute the source of the data and that place no other restrictions on those that wish to work with a dataset.

It is important in our definitions to be aware of different legal and cultural practices around the world, understanding open data as a socio-technical construct. For example, in some countries government data is assumed to be open regardless of the present or not of a license statement; in others, state data is copyright by default, and explicit licenses are needed to give re-users the certainty that they have permission to build upon and market products that use the data.

Definitions: Governance

The second important definitional pre-requisite to address the questions in focus in this session is for ‘governance. Wikipedia here demonstrates the ability of a crowdsourced product to provide the best concise definition, describing governance as concerned with “decisions that define expectations, grant power, or verify performance.”

Now – it might be possible to construct a predominantly descriptive or positivist account of how open data improves the verification of government performance, against pre-set objectives or rules: but any discussion of how the Internet and open data improve governance with respect to decisions that define expectations and grant power is necessarily normative. Deciding when we have an improvement in the setting of expectations or the granting of power involved taking a stand over what counts as improvement. Whilst we may be able to agree at one level about negatives of which the removal constitutes improvement: things like extreme corruption for example; when it comes to a positive vision of what open data should do to governance we find quite divergent views.

Let me illustrate this point by setting out three distinct theories of change for how open data might affect how power is assigned, each belonging to different traditions of political theory .

  • Firstly, there is the idea of that open data enables citizen oversight of their governments and addresses information asymmetries – enabling citizens as voters to better control elected officials as their agents in power. Here, the electoral mechanisms of a governance in an electoral democracy, or indeed, the pressures of public opinion in a constrained autocracy, are left unchanged, but the ability of journalists, pressure groups and citizens to punch through the veil of secrecy around state decisions can drive decisions more in-line with citizen interests.
  • Secondly, we have the idea of a consumer-democracy – one in which citizens engage in governance through individual consumption choices and selection of public services. This is the theory of change prioritised by the David Cameron in his recent speech at the Open Government Partnership – where it is argued that, through open data, citizens can gain more detailed, and personalised, information on public services, and can make more informed choices about which services to select from the ‘marketplace’ of services – thus using market mechanisms to drive better services. Here, ‘governance’ happens through the operation of the market and distributed choices of individuals.
  • Lastly, we have the idea of co-production and more collaborative governance in which open data supports groups made up of citizens, civil society organisations and entrepreneurs to work together with each other, and with the state, to improve policy making and practice. Here, governance is improved when it is more inclusive, and when more people participate in determining the outcome of collectively held power.

These are of course not the only theories of change – but outline the divergent ways in which we might approach the question of what ‘improved governance’ is. Indeed – policy makers and citizens might have very different ideas in any situation of what improved governance looks like: we might hypothesise that more transparent and responsive services are compatible with more efficient and low-cost services, but whether this is true is is an empirical question.


The reason for this detour into political theories is not simply to problematise the term of governance, but also to highlight that our approach to measurement also involves taking a stand on normative questions, and to provide a basis for outlining the stand I propose taking.

As Bhanupriya has outlined, many of the kinds of improved governance we want to see involve empowering and enabling groups at the grassroots to engage in policy processes, both acting locally, and speaking out for shared national and global frameworks that better allow them to act locally in the ways that meet their needs. These are not about top-down governance, or enabling policy makers to better control service delivery with a birds-eye view.

The site of acting then, if we are to have actionable measurements on open data and governance, is not simply at the level of policy – but is also at the level of grassroots practice. The measurements we make need to allow grassroots groups to both understand ways of engaging with open data as a resource for improving governance locally, and to make appropriate and effective claims on national and international actions to support them with the data they need for better decision making and monitoring of implementation.

ODDC and The Open Data Barometer

Open Data Barometer

Now – having said all this – we might feel that it adds too much complexity to the process of developing a measurement frameworks, and research into impacts of open data on governance is then necessarily solely a space for action research – with no general measurement possible. But this would not engage fully with the problematisation. Global measurements will be made – and so we should work to make sure that where they are, they are sensitive to the practitioner need at the grassroots, and are a resource for practice – whilst also enabling cross-cutting and comparative global research that can illuminate macro-level trends and feed into national and local policy and practice.

That framed our expiration with the Open Data Barometer, a study launched by Sir Tim Berners-Lee at the Open Government Partnership meeting in London a few week ago that takes a multidimensional look at the readiness of 77 states to secure benefits from open data, the implementation of open data policy via the proxy of dataset publication, and emerging impacts of open data by the proxy of media and academic coverage of it.

It was a pilot study, but one we hope provides strong foundations for future work to understand the governance effects and impacts of opening data. Methodologically is uses an expert survey, combined with a number of secondary indicators – used to create sub-indices and an overall Barometer index number for countries to support overall comparison, and comparison along a number of different dimensions. I want to highlight three key considerations and points of learning from the development of the Barometer:

  • We build on learning from qualitative work to look at different aspects of readiness. For the last year the Web Foundation have been running a research network on Exploring the Emerging Impacts of Open Data in Developing Countries – which you can find at – and in this we’ve been working with research partners across the developing world to look at the use of open data in affecting governance. Through this the importance of a number of different aspects of government readiness have been emphasised, including the importance of RTI laws alongside open data laws; this qualitative work has also brought up issues around the importance of civil society intermediaries. Working with these qualitative insights we sought to find indicators and expert survey questions that would help us understand appropriate aspects of the context around open data in different countries.
  • We distinguish different kinds of data. Prior studies of open data publication have used a list of datasets based on those felt to be important in London and Washington, rather than looking for datasets that represent the breadth of government activity, and the breadth of theories of change about how open data operates. We put together a list of 14 dataset categories and asked our expert researchers to assess whether this data was available, online, machine readable, openly licensed and so-on. In our analysis we cluster these datasets according to those most likely to be used as part of an ‘accountability stack’, those most often used in ‘innovation’, and those with a strong impact on ‘social policy’.
  • We look at impact based on asking for narratives of change. This speaks to the question of whether effects of open data are measurable. Right now, that measurement is very difficult. Conceptually, open data can be used to achieve a wide range of impacts, so if we had gone in trying to look for one particular kind of data use – data in participatory budgeting for example – we would have risked missing lots of other potential impacts of open data. We’re still looking to find betters methods here – but the approach we took was to ask our expert researchers to look for media mentions of open data having effects in a variety of domains: political, economic, environmental and so-on, and to rate the breadth and depth of impacts cited.

In this talk I won’t go in-depth into the actual Barometer results, as you can find those at but I’ll briefly mention a few findings:

  • Open data policy has rapidly spread across the globe – over 50% of our sample of countries had an open data policy, many with strong senior government backing.
  • However, open data availability is low – just 71 of the over 1078 datasets we looked at were available as open data, and in general, publication of data that meets all the criteria for open data I set out above was concentrated in a small number of states. Politically contentious datasets such as government spending, land registries and company registries were least likely to be available.
  • Impacts right now are very low – the average of our 10-point impact scale was below 2 for every category, and remained below 3 even when we took out countries with no open data or open data policy. In terms of the kinds of impacts researchers could locate cited – Transparency and Accountability impacts were most common, with impacts on environmental sustainability, and the inclusions of marginalised groups least likely to be cited.

In conclusion

Returning to the questions that frame this panel: Has the Internet helped citizens and policymakers improve governance? Are the effects measure-able? What new innovations might be helpful?

It seems fair to state as a basic assumption that information does change governance. On the flight here I was reading work by Eleanor Ostrom on governance of the commons, which emphasises the central role of information in governance. Changing how information flows does impact governance – but assessing whether that impact is positive or negative involves normative questions. Right now – the impact of open data on effectively altering the flow of information across society is limited. We see highly-used apps in a limited number of settings like transport, but beyond that we see relatively few datasets that are truly available as open data. When we dig into many of the anecdotes shared about open data impacts, it often turns out that wider contextual variables are much more important in determining outcomes than the particular open properties of the data itself. And yet, policy seems to focus on a replication of a standard model of open data data publication as the primary intervention.

Ultimately in taking measurement forward I’d like to suggest we look in two directions. Firstly, we need to drill down thematically, focussing on data in context in particular settings, looking to generate actionable knowledge for practitioners in these sectors that will help them to advocate bottom-up for open data – rather than focussing on over-generalised measurement that seeks to promises generalised open data impacts without understanding differences of subject matter and context. Secondly, we need to explore developing rigorous shared case study methodologies that can enable us to build measurement and research through controlled cross-case comparisons, informing macro-level assessments, but focussing on micro-level effects and theories of change around open data.

This is something we’re hoping to focus on more in the ODDC project in the coming months – so do join us on the network Linked In group if you would like to explore this more.

Open Data Barometer: 2013 Global Report launched

Open Data BarometerLast Thursday the study I’ve spent the last five month working on with the Web Foundation was formally launched in the Open Government Data Working Group session of the Open Government Partnership Summit. The Open Data Barometer takes a look at the context, implementation and emerging impacts of open government data in 77 countries around the world.

Last week’s launch included both an analytical report and quantitative datasets for the secondary indicator and expert survey data collected in the study. I’ll be writing more in the coming weeks here about the process of designing and carrying out the study, and reflecting on how it might evolve and be built upon in future. But for now, here’s a link to where you can download the report and data, findings from the exec summary, and a few charts pulled out from the overall report.

Executive Summary: 2013 Global Open Data Barometer Report

Open data is still in its infancy. Less than five years after the first major Open Government Data (OGD) portal went live, hundreds of national and local governments have established OGD portals, joined by international institutions, NGOs and businesses. All are exploring, in different ways, how opening data can unlock latent value, stimulate innovation and increase transparency and accountability. Against this backdrop of rapid growth of the open data field, this Open Data Barometer global report provides a snapshot of OGD practices at national level. It also outlines a country-by-country ranking. Covering a broad sample of 77 countries, it combines peer-reviewed expert survey data and secondary indicators to look at open data readiness, implementation and emerging impacts. Through this study we find that:

  • OGD policies have seen rapid diffusion over the last five years, reaching over 55% of the countries surveyed in the Barometer. The OGD initiatives launched have taken a range of different forms: from isolated open data portals launched within an e-government framework, through to ambitious government-wide OGD implementations.
  • But – there is still a long way to go: Although OGD policies have spread fast, the availability of truly open data remains low, with less than 7% of the dataset surveyed in the Barometer published both in bulk machine-readable forms, and under open licenses. This makes it unnecessarily difficult for users to access, process and work with government data, and potential entrepreneurs face significant legal uncertainty over their rights to build businesses on top of government datasets.
  • Leading countries in the ODB are investing in the creation of ‘National Data Infrastructures’ to provide a foundation for public and private innovation and efficiency. They have high-level and broad-based political backing for the OGD initiatives, and are investing in capacity building with entrepreneurs and intermediaries. They are also focussing on building communities around open data, convening government officials and outside stakeholders to understand more clearly how data can be harnessed for economic and social progress. However, no countries can yet claim to fully be ‘open by default’, and embedding OGD practices across government is a key future challenge.
  • Mid-ranking countries have put in place some of the components of an OGD initiative, such as an open data portal and competitions or events to catalyse re-use of data, but have often failed to make key datasets available, and are lacking in important foundations for effective open data re-use. Absence of strong Right to Information laws may prevent citizens from using open data to hold government to account, and weak or absent Data Protection Laws may undermine citizen confidence in OGD initiatives. In addition, limited training and support for intermediaries may mean data cannot be mobilised to generate economic and social benefits.
  • Low-ranking countries have not yet started to engage with Open Data, and many developing countries lack basic foundations such as well-managed and digitised government datasets. In these countries, interventions to support OGD may look radically different from the leading OGD initiatives surveyed in the Barometer – with opportunities for open data approaches to be used to generate, as well as use, public information.
  • The Barometer ranks the UK as the most advanced country for open data readiness, implementation and impact, scoring above the USA (2nd), Sweden (3rd), New Zealand (4th), Denmark and Norway (joint 5th). The leading developing country is Kenya (21st), ranking higher than rich countries such as Ireland (29th) and Belgium (31st). However, no country can yet claim to be fully ‘open by default’.

Furthermore, in offering the first global snapshot covering both OGD policy and practice, the Barometer highlights:

  • Different countries and regions face different challenges in pursuing OGD – including the need to build government data collection and management capacity; the need to support and equip innovators and intermediaries to use data; and the need to secure civil society freedoms that will enable the use of open data for effective transparency and accountability. There is no one-size fits all approach to OGD.

  • Key datasets such as Land Registries and Company Registries are least likely to be available as open data[1], suggesting that OGD initiatives are not yet securing the release of politically important datasets that can be vital to holding governments and companies accountable.

  • In most countries, key datasets for entrepreneurship and improving policy are not available as open data, and when published are in non-standard formats. For example, even in the case of public transport, where data standards are well established, just 25% of countries surveyed have machine-readable data available. Mapping data is also often unavailable in digital forms, or only available for a fee, suggesting that inefficient charging for public data continues to be an issue in many countries.

  • Categories of data managed by statistical authorities are the most likely to be accessible online, but are often only released in very aggregated forms and with unclear or restrictive licenses. Adding a focus on open data to statistical agency capacity building may assist in making key datasets available as bulk, machine-readable open data, contributing positively to the ‘data revolution’ (UN, 2013).

  • Strong evidence on the impacts of OGD is almost universally lacking. Few OGD programmes have yet been evaluated, and the majority of discussion of impacts remains based on anecdote. The Barometer asked about six kinds of OGD impact (government efficiency, transparency and accountability, environmental sustainability, inclusion of marginalised groups, economic growth, and supporting entrepreneurs). In countries with some form of OGD policy (n = 43) in 45% of impact questions no examples of impact could be found, and on average evidence of impact was scored at just 1.7 out of 10.  Scores were particularly low for inclusion and environmental impacts of OGD, suggesting an area in need of further focus.

It remains very early days in the development of OGD practices. The World Wide Web has now been with us for almost 25 years, and, even so, many governments, businesses and civil society groups are still in the early stages of learning how to harness its potential. The open data vision is a bold one: but one that will take considerable work to make a reality. It cannot just be a case of ad-hoc dataset publication, but needs attention paid to legal, social, economic, technical, organisation and political dimensions of open data publication and re-use. This year’s Open Data Barometer provides a baseline for tracking how we collectively progress in the open data arena in years to come.

Web Observatories: The Governance Dimensions

Governance & Sustainability for a Web Observatory

I’m in a workshop at MIT today about plans to create a ‘Web Observatory’, collecting and curating vast quantities of data from across the web for research – in part to ensure that researchers can keep pace in their capacity to research the web with the companies and entrepreneurs who are already gathering terabytes of ‘traces’ of online behaviours in proprietary platforms. A lot of the discussion so far has looked at datasets for research gathered from platforms such as Twitter, curated data from platforms like Open Street Map, or collected in focussed research projects focussed on sensor networks and ‘humans as sensors’. However, the vision of the Web Observatory is not just about providing a catalogue of data for secondary research, but also about providing methods and tools that enable researchers to “locate, analyse, compare and interpret useful information in a consistent and reliable way … rather than drowning in a sea of data”.

As Wendy Hall noted in opening remarks, whilst the Web Observatory work begins with emphasis on academic researchers as the users of data, in the long run, Observatories could (or should) be accessible to individuals also. The growing imbalance of power created between citizens and companies through the privileged access that corporations have to information on our collective social lives is set to become an increasingly pressing social and political issue.

Now, there are clearly big technical challenges ahead in building the Web Observatory project and the many federated Web Observatories that will result, but in this post I want to briefly explore one of the organisational ones: getting the governance and sustainability of Web Observatories right.

Lessons from linked data: sustainability

If you’ve ever spent time exploring Linked Data projects you will have likely stumbled across a lot of abandoned datasets. One off conversions of open data; or data generated through now defunct research projects. The Web of Linked Data is far too often a web of broken links – as the funding for research projects runs out and links go dark.

The Linked Data Around the Clock programme (on a website that’s now offline ) had a slide that captures the coordination dilemma at the heart of creating and sustaining good Linked Data: the value of (linked and/or open) data accrues to a range of parties, and involves input from a range of parties. When projects are sustained through short-term grant funding, which covers all the work to create, curate and make accessible a dataset, then that data is sustainable only so long as the funding continues. It could be argued that when data is open, this is not so big a problem – as someone can simply take a copy of the data and if the original source goes dark, can bring up an alternative host for the data. But in practice, with Web Observatory datasets we’re talking big data where simply storing the datasets can require hundreds of terabytes storage; and datasets which cannot be entirely open due to privacy concerns or Terms of Service of the source data. The data also tends to be shaped primarily by the needs of the funding project that creates it, not by the needs of the projects that want to re-use the data. Although linked data promises distributed annotation and enhancement of data, in practice to query data it needs to be aggregated together in one place – and it’s more efficient to pool resources to enhance and maintain one data store, than to try and copy, convert and enhance multiple copies of big datasets.

So: if learning from Linked Data is anything to go by, the Web Observatory needs to be thinking critically from the start about how key datasets will be sustained, and how collaboration on enhancing data will be facilitated – recognising that there is a non-zero net cost (lots of near-zero marginal costs add up quickly in big data…) to enhancing and adding data to someone else’s data store.

Ethics issues: empowering access

Many of the datasets that might be contained in Web Observatories will raise significant privacy concerns. It might be tempting to manage these by simply deferring responsibility for judging what use can be made of the data to Institutional Review Boards and ethics committees at different participating academic institutions – if the Web Observatory programme is to be open to partners beyond academia, then ethics processes need to be placed into the heart of the Observatory governance structures, rather than managed around the edges.

A proposal: exploring co-opererative ownership and governance

There are, I think, three broad governance models open:

  • Observatories hosted and held-in trust by institutions: institutions, primarily academic, use fixed-term project funds to set up Web Observatories. They let other people use these so long as their funding allows, and prioritise those requests to enhance, extract or work with the data that fit with their own research goals. At the end of the project funding, Observatories either die, or end up maintained through residual or other funds.
  • Independent foundations: the model used by large web public goods like and Wikipedia – establishing independent legal entities that maintain an Observatory. This has the value of helping Observatories out-live the projects that start them – but makes Observatories dependent upon finding their own funding, and creates an extra organisation entity over and above the partners with an interest in the data which either ends up with it’s own agenda and organisational imperatives, or which leaves a collective action problem with each of the partners waiting for the others to provide the funding to keep the lights on.
  • Data co-operatives: building on discussions convened last year in Manchester, there may be a new organisational structure the Web Observatories can build upon – that of the data cooperative. In a data cooperative, a light-weight separate entity is established, but which is constituted and jointly owned by the researchers and research institutions with a stake in the data. Cooperatives can establish rules about the resources that members should bring to the co-op, and what control they can expect over the design and maintenance of the Observatory, and can provide procedures for easy entry and exit from the co-op. In Manchester we discussed the potential for hybrid ‘workers/suppliers’ and ‘users/consumers’ co-operatives, that could give both the creators of data, and the researchers using the data, an appropriate stake in it. Co-operative membership to access data with privacy/ethical issues could also address ethics procedures.

Whilst the least developed, this third option I think holds most promise.

I don’t know yet if the Web Observatory programme will have an organisational research – but I hope so…

CfP: Open Data Track – 2014 Conference for E-Democracy and Open Government

I’m co-chairing a track on ‘Open Data, Transparency and Open Innovation‘ at the next CeDEM Conference for E Democracy and Open Government, taking place at Danube University Krems in May next year.

The full call for papers and submission details can be found here, and the details of the Open Data Track are below:

Open Data, Transparency and Open Innovation

Chairs: Johann Höchtl (Danube University Krems, AT), Ina Schieferdecker (Frauenhofer, DE), Tim Davies (University of Southampton, UK)

Open data can provide a platform for many forms of democratic engagement: from enabling citizen scrutiny of  governments, to supporting co-production of public data and services, or the emergence of innovative solutions to shared problems. This track will explore the opportunities and challenges for open data production, quality assurance, supply and use across different levels of governance. Key themes include:

  • Open data policy and politics: opportunities and challenges for governments; the global spread of open data policy; transparency and accountability, economic innovation, drivers for open data; benefits and challenges for developing countries.
  • Licensing and legal issues: copyright vs. open licenses & creative commons; Freedom of Information and the ‘right to data’; information sharing and privacy.
  • Open data technologies: technical frameworks for data and meta-data; mash-ups; data formats, standards and APIs; integration into backend systems; data visualisation; data end-users and intermediaries;
  • Open innovation and co-production: open data enabled models of public service provision; government as a platform; making open data innovation sustainable; data and democracy; connecting open data and crowdsourcing; data and information literacy;
  • Evidence and impacts: costs and benefits of providing or using open data; emerging good practices; methods for open data research; empirical data measuring open data impacts

Submissions are due by 6th January 2014.

Reflections on developing a global sectoral open data initiative: agriculture and nutrition

At the 2012 G-8 Summit leaders committed to a ‘New Alliance for Food Security and Nutrition ‘, and as part of the follow up to the US G8 presidency in April this year the World Bank hosted the ‘G8 Conference on Open Data for Agriculture ‘, exploring opportunities to create a global platform for sharing agriculturally relevant information. Initially driven by the UK and US, this initiative is has developed into the ‘Global Open Data Initiative for Agriculture and Nutrition ‘, currently preparing for a launch at the October 2013 Open Government Partnership summit in London. As the open data concept continues to gain traction at a policy level, such sectoral open data initiatives are increasingly common, and raise a wide range of questions. This post attempts to unpack some such questions for the proposed Agriculture and Nutrition Initiative.

Sector vs. supplier-centric open data

Early open data initiatives, such as the Open Government Data initiatives of the USA, UK and Kenya, have been supplier-centric. They are essentially based on the idea that a single data holder (or, in practice, amalgamation of different departmental data holders, but all from the same overall organisation) supply the data they hold online as open datasets. An open data portal often provides a focal point for this activity.

By contrast, sectoral open data initiatives draw on data from a wide range of suppliers. Some, such as the International Aid Transparency Initiative (IATI) are primarily interested in a single flow of data (in the IATI case, standardised datasets of aid funded activities), although others, such as the renewable energy focussed Reegle project look to aggregate together and integrate a range of different open data datasets with a single sectoral focus*.

An Open Data Initiative for Agriculture and Nutrition may have both supplier-focussed, and sectoral focussed, elements to it. Some of the high-profile holders of data in the agriculture sector, such as theWorld Bank and Food and Agriculture Organisation already have their own open data initiatives. However, there are many more actors who might be suppliers of data when it comes to agriculture and nutrition. It is worth nothing that existing sectoral open data initiatives such as IATI and the Reegle project are relatively limited in their scope and reach, and rely on a certain degree of centralisation (the IATI Registry and Standard in the former case, and a central data store for Reegle in the latter), and so an Open Data Initiative for Agriculture and Nutrition potentially represents a new level of ambition and complexity, requiring more decentralised approaches to securing a wider range of relevant open data.

(*I’ve not looked here at sectoral ‘open data’ initiatives from the sciences as I’m less familiar with these. However, emerging collaborations around genomics research, for example, which seek to pool data into a resource for answering a range of shared research questions may also be relevant points-of-reference in thinking about the shape of an agriculture and nutrition open data initiative).

Why open data?

A recognition of the need for better information and data sharing in agriculture and nutrition is nothing new. For over 100 years organisations like CABI have been producing abstract journals to more effectively transmit agriculture research to the locations on the ground where it is needed, and the agriculture and nutrition field has a well developed network of research institutions, agricultural extension services and initiatives to harmonise information, ranging from the long-establishedAGROVOC vocabulary, through to the more recent CIARD ‘Coherence in Information for Agricultural Research for Development’ movement, bringing together over 50 organisations to collaborate “to make agricultural research information and knowledge publicly accessible to all.”

However, an initiative for open data does have some significant differences of emphasis from one focussing on making information and knowledge publicly accessible. Open data initiatives place explicit emphasis on data over information; upon making that data machine-readable in standard formats; and requiring the use of open licenses that allow the data to be re-used by anyone. A number of arguments might be put forward to justify this specific emphasis:

  • Open data principles lead to lower transaction costs for finding and accessing data – and give re-users certainty that they can work with the data. For example, existing important datasets like Agrovoc do not use an open license, and can only be accessed in bulk behind a registration, meaning that users wanting to use agrovoc classifications in a dataset that also contains commercial data would be prevented from doing so without negotiations with the FAO. Applying open data principles could increase use and coherence of data.

  • Machine-readable data supports efficient and innovative re-use of data. With access to data in standard formats, users can remix, re-interpret and re-present the data, offering alternative interpretations and generating new insights that may not have been contained in shared informational publications. Open data principles also aim to support the easy combination of multiple datasets to support the identification of new trends and patterns across different datasets.

  • Open data allows new actors to get involved in address agriculture and nutrition challenges.Unlike data-sharing initiatives, which often work to ensure an identifiable list of actors have access to data and information, open data is, to a degree, about allowing as key unknown parties to access and innovate with data. This has the potential to bring new researchers, entrepreneurs and policy actors into the process of providing solutions to key challenges, enabling more open forms of innovation .

  • Existing open data initiatives could do more for agriculture. With many governments already publishing open data, and agriculture and nutrition open data initiative can harness the momentum to secure new datasets that would not be provided through existing initiatives – and can work to make sure multi-purpose datasets, such as cadastral data, land ownership records, weather data and other resources are provided in ways that support agriculture and nutrition activities.

I won’t assess the validity of each of these arguments here: that is a matter for empirical research – but they do highlight the kinds of areas that classic open data initiatives may focus on, and allow an assessment of how far an open data initiative may be complementary to other existing activities, or how it might connect with these.

The scope of a global initiative

Agriculture and nutrition is a vast field, and the issues on the agenda vary wildly across the world – from securing crop production and nutritional standards in developing countries, to ensuring trustworthy supply chains in Europe, and from planning for food security, to giving consumers information to choose organic or fairly traded products. An Global Open Data Initiative on Agriculture and Nutrition emerging from the Africa-focussed New Alliance for Food Security and Nutrition could choose to look only at issues of basic food security, but this might be a missed opportunity to also consider how open data has a role to play across a wide range of agriculture and nutrition issues. For example, catalysing activity around open data on food supply chains could be driven by, and have benefits for, both food security and consumer confidence in food.

An initiative also needs to consider whether it’s scope is primarily around public sector data and data held by research organisations, or whether it will also look at the vast quantities of private sector held data on agriculture and nutrition. Many governments already use targeted transparency measures to require food producers to generate and publish nutritional information on their products, suggesting that further steps to require private sector publication of open data on various agriculture and nutrition issues might not be out of the question.

Self-selected commitments, or a shared agenda?

The International Aid Transparency Initiative sets out a clear standard for data that all signatories to the initiative should work towards publishing. As more signatories publish this data to the common standard, network effects kick in making the data more and more useful. By contrast, the Open Government Partnership invites countries to sign up to some broad principles and then to self-select what they will do, and (if choosing to focus on open data at all) the particular datasets they might release: with the areas of focus driven by domestic engagement and pressure. Somewhere between the two, the G8 Open Data Charter includes a list of core datasets that all signatories should work to publish, and then invites self-selected commitments with a long-list of suggested datasets to focus on.

A Global Open Data Initiative on Agriculture and Nutrition could identify a shared agenda based around a small number of datasets and issues, or could be driven by general principles, with members self-selecting their areas of focus. There are pros and cons to each approach, but they potentially lead to initiatives of very different characters, and consequences for the way in which different stakeholders might get involved.

What needs to go into an open data initiative?

I’ve written in the past about ten building blocks of an open data initiative highlighting that open data initiatives need more than datasets – also requiring explicit effort on outreach and engagement, and capacity building to enable wider use of the data that is made available. When it comes to Agriculture and Nutrition there are a wide range of actors who might need to be involved in these wider activities – from the infomediaries who translate research and data into actionable information for farmers and traders, through to the government planners or civil society activists seeking to improve the equitable and fair management of natural resources.

The Ten Building Blocks of open data listed below can take many different forms, and operate at different levels of scale – but an initiative that focusses only on one or two of these building elements to the exclusion of others is unlikely to be able to realise the potential impacts of open data.

  1. Leadership and bureaucratic support

  2. Datasets

  3. Licences

  4. Data standards

  5. Data portals

  6. Interpretations, interfaces and applications

  7. Outreach and engagement

  8. Capacity building

  9. Feedback loops

  10. Policy and legislative lock-in

Evidence and impact

There has been an interesting dialogue recently in the Open Data Innovations LinkedIn Group about “How to monitor progress of open data”. The discussion has highlighted that, as the use to which data will be put is generally left ‘open’, coming up with concrete evaluation frameworks for measure whether open data has had the desired impact can be challenging. This is of course, one of the big issues we’re grappling with in the Open Data in Developing Countries project – currently focussing on qualitiative case studies to understand how open data interacts with existing processes of governance on the ground.

However, it is not inconsistent to set both a series of primary goals for the greater sharing of data against which an intervention can be measured, and to develop frameworks for monitoring secondary impacts resulting from leaving data open to re-use. Such frameworks must, however, be able to also capture unintended consequences of open data re-use – noting that not all results will inevitably be positive, particularly in contexts where so much is stake as the agricultural domain, where the interests of communities, agri-business, governemnts, environmentalists and others are not always aligned.

The ODDC Conceptual Framework seeks to outline some possible directions for such a framework, as a foundation to be further revised as our 2013/14 case studies start to report later this year.

Many more questions…

The formation of larger scale sectoral open data initiatives is an emerging phenomena, and something that will need continued practical and research attention. From a research perspective, it will be fascinating to see how plans for the Global Open Data Initiative on Agriculture and Nutrition involve.

(Disclosure: In my role at the Web Foundation I’ve been involved in some discussions with the convening team for the Global Open Data Initiative on Agriculture and Nutrition, and this post as an open reflection is offered as an input to their ongoing dialogue, as well as a wider reflection on sectoral open data initiatives)


Open data and privacy

Cross-posted from the Open Data Research network site.

On 1st August two IDRC research networks came together for a web meeting to explore Open Data and Privacy. The Privacy in the Developing World network, and the Open Data in Developing Countries network set out to explore whether open data and protecting privacy are inherently in tension, or whether the two can be complementary, and to identify particular issues that might come up around privacy and open data in the developing world. This post shares and develops some of the themes discussed in the meeting.


Open data is generally defined as data made accessible, in formats that can be manipulated by computers (allowing the creation of new interfaces, mash-ups and other data analysis), and without restrictions on how the data can be re-used. In essence, open data asks those who hold data (usually governments) to give up formal control over how it is used, with the idea that this allows greater scrutiny of governments, and unlocks potential for innovation with the data.

Privacy, by contrast, is concerned with control over information, who can access it, and how it is used. As Daniel Solove notes[1] this has many dimensions, from concerns about intrusive information collection, through to risks of exposure, increased insecurity or interference in their decisions that individuals or communities are subjected to when their ‘private’ information is widely known. Privacy is generally linked to individuals, families or community groups, and is a concept that is often used to demarcate a line between a ‘private’ and ‘public’ sphere. Article 12 of the Universal Declaration on Human Rights states “No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation”. It has been argued that privacy is a western concept, only relevant to industrialised societies – yet work by Privacy International has found privacy concerns to be widespread across developing countries, and legal systems across the world tend to recognise privacy as a concern, even if the depth of legal rights to privacy and their enforcement varies. It is worth noting though that few of the countries covered by the ODDC project have strong privacy protection laws in place.

Different kinds of data

One of the starting points of discussion around open data and privacy is to work out which kinds of ‘data’ might fall within the focus of each. In the context of open government data, we might think about three broad categories:

  • Infrastructural data – data held about the state of the world – for example, describing the land, transport networks, structures of government, weather measurements and so-on. There are very few privacy concerns about this data (though in some states security concerns may restrict the extent to which it is shared, such as geographic border and water flow data in the the rivers of Northern India)
  • Public service data – data about the activities of government – ranging from the locations of public services and their budgets through to public registers, and detailed performance statistics on schools, hospitals and other facilities. This last set can be in a grey area – as they are often built up from the aggregation of records about individual users of public services, and it is not always clear who they are about. For example, is the medical record about an operation and it’s outcome data about the patient, or about the doctor?
  • Personal data – data about individuals, and usually things that an individual would have a legitimate right to manage access to – such as information on their sexuality or their health.

In the Web Meeting, Sam Smith noted that the framers of the ‘Open Definition’, taken as a basis for much open data advocacy, were focussed specifically on non-personal data, and that open data advocates tend to make clear that they are not talking about information that could identify private information about individuals. However, as the categories above show, the dividing line between public and personal data is not always clear.

Kinds of data - infrastructure; public service; personal

This classification does make clear however that there are some kinds of data (the infrastructural data) where applying open data should be, from the privacy perspective at least, uncontroversial. The relative importance of data in the middle category to the kinds of outcomes sought from open data policy interventions then becomes an important question to ask.

It is worth noting that because of the political popularity of open data policy, there has been a tendency for other policies relating to data to be presented under an open data banner in some countries. For example, policies on the restricted sharing of medical records with pharmaceutical companies (through secure data sharing rather than as open data) were included in UK open data measures in 2011. These policies clearly need to be considered distinctly from open data policies, and their implications also weighed carefully.

Opening data disrupts past privacy practice

Steve Song offered an input into the Web Meeting focussed on the online publication of a dataset and mash-up map showing the location of registered Gun Owners in the wake of a school shooting in Connecticut. The register of gun ownership had long been a public document, but it had been in the form of documents that could be inspected rather than as a dataset. The conversion of this public register into open data which could be easily mapped created a strong backlash: law enforcement officials worried that their addresses had been revealed online, and those with and without guns expressing concerns that the information could be used by burglars to target particular houses. The accuracy of the record was also questioned, and it was suggested that much of the information was misleading or wrong.

This case illustrated how turning existing ‘public records’ into open data might change some of the balances around privacy that have been struck by the practical difficulties that exist right now of access to those records. Previously the ‘data’ had been hidden in plain view: but no-one had been encouraged to use it in ways that might give rise to concerns. Thought may be needed then not only when things previously secret are made public, but also when public records are turned into more easily manipulated and processed open data. Steve noted that this may be particularly important in contexts of ethnic or communal tensions: imagine for example how voter registers might be used as data where ethnicity can be inferred from the voters name, and where an election is contested on ethnic lines.

In the United Kingdom, the recent Shakespeare Review of Public Sector Information[2] has proposed shifting the legal responsibility for mis-used of data from the person who publishes the data onto the person who abuses the data – suggesting a model in which privacy laws would control (ab)use rather than access to data. However, such a model is tricky to envisage in a world where data can cross borders easily, there is little harmonisation of privacy laws, and harms from privacy violations can also cross borders.

Privacy as an excuse? Open as a general principle?

One of the key concerns raised in the meeting was that if arguments for open data are applied as a ‘general rule’ without sensitivity to the kinds of data in question, there are significant risks that privacy rights might be undermined. Yet, transparency and open data advocates are often concerned that ‘data protection’, or ‘protecting privacy’ might be used as excuses not to release data, or to only release data in aggregated forms that don’t permit detailed analysis of what government is doing. Neither can necessarily be used as a principle that trumps the other.

In his review of open data and privacy for the UK Government, Kieron O’Hara noted[3] that, even within open data advocacy, different groups have different requirements for what for good quality open data for their purposes (§2.1.4). For example, transparency campaigners may be happy with crime data covering general geographic areas that particular official is responsible for, whereas entrepreneurial re-users of data might want data down to the individual street and house-level to feed into risk models for insurers, or to use in route-planning applications.

In our web meeting Steve Song suggested that by developing a clearer picture of the kinds of impacts open data can have, and the ways in which it might be used (a central theme in ODDC explorations), we will be better able to have informed debate about the trade-offs between privacy and open data. This again moves away from the simple rhetorical message of ‘open everything’, and ‘raw data now’ that many open data advocates have pushed for – and suggests that deeper debate will be needed over the sharing of datasets that fall into the grey areas between public and personal. Such a debate will need to engage with questions of whether open data is being used to support public goods or private gains, and with nationally and culturally specific judgements about how to manage trade offs between public good and personal or community privacy. For example, in some countries, personal tax records are considered public and are published, yet in others, these are judged to be private data.

The question of corporate confidentiality was also raised in the web meeting discussions. Although corporate confidentiality is conceptually distinct from privacy, it is another principle that might sometimes be found to be in tension with a drive towards open data, and can become the grounds of excuses for not releasing data. Distinguishing when privacy or corporate confidentiality are being used as excuses for not releasing data, or when they are based in serious and valid concerns, will be important for open data advocacy.

In practice, it wasn’t clear from web meeting participant’s experience whether privacy is actually being used as a grounds for restricting access to data in developing countries, or if privacy is being adequately considered in decisions about opening data. This will be a key issue to track in future research to better understand how potential tensions between open data and privacy are playing out in practice.

Open data, privacy and power

At the Asia regional meeting of the ODDC project, one participant noted the curious overlap between participants in the Data Meet community (often involved in pushing for open data), and those organising ‘Crypto Parties’, teaching each other about privacy protection software. How have these individual reconciled campaigning for both open data and privacy? If they are pushing for a balance between the two, how is such a balance to be struck. One possible way to understand the compatibility of pushing for both privacy and open data is through the lens of power and autonomy. Activists may be interested in seeking maximum autonomy from the state through protecting their privacy, and maximum control over the state, through the ability to see what the state is doing through open data, and to work with state-collected data. Such a political position might be associated with the libertarianism of some open source geek cultures, but may also have different routes and political slants around the world.

The power-based analysis might also help in determining which kinds of entrepreneurial uses of open data are desirable or not. Cases where entrepreneurs act as intermediaries in ways that enhance the autonomy of citizens (for example, providing public transport planning applications to help citizens move more freely through space, or informational applications that help citizens to collaborate and co-create or claim access to public services) may be seen as positive, whereas commercial open data re-use that leads to interference in individuals decisions through targeting of advertising, or that drive discriminatory pricing of services and insurance, might be seen having a negative impact on individual autonomy (although the negative effective may only be felt by some segments of the population such as minority or marginalised groups). The question however would remain of how such potential negative uses of open data should be governed, particular in developing world contexts where legal frameworks vary widely. Serious abuses of open data (whether to incite community tensions, or affect individuals through discriminatory pricing) could be outlawed, but if they have not, what should those releasing data consider?


By the end of the web meeting we had opened up many more issues that we had resolved, but we had established that there can be a productive dialogue between privacy and open data, and that more work is needed to explore how the two concepts together are unfolding in developed and developing world.

If you would like to join the debate over privacy and open data, there’s a thread over on the Open Data Research network Linked In group.


[1] Solove, D. J. (2005). A Taxonomy of Privacy. University of Pennsylvania Law Review, 154(3), 477.

[2] Shakespeare, S. (2013). Shakespeare Review: An independent review of public sector information. London.

[3] O’Hara, K. (2011). Transparent government, not transparent citizens: a report on privacy and transparency for the Cabinet Office.

Open Data in Developing Countries

The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy

RSS Recent Publications

  • An error has occurred, which probably means the feed is down. Try again later.