Open data: embracing the tough questions

Two open data related publications I’ve been working on have made it to the web in the last few days. Having spent a lot of the last few years working to support organisations to explore the possibilities of open data, these feel like they represent a more critical strand of exploring OGD, trying to embrace and engage with, rather than to avoid the tough questions. I’m hoping, however, they both offer something to the ongoing and unfolding debate about how to use open data in the interests of positive social change.

Special Issue of JoCI on Open Government Data
The first is a Special Issue of the Journal of Community Informatics on Open Government Data (OGD) bringing together four new papers, five field notes, and two editorials that critically explore how Open Government Data policies and practices are playing out across the world. All the papers and notes draw upon empirical study and grassroots experiences in order to explore key challenges of, and challenges to, OGD.

Nitya Raman’s note on “Collecting data in Chennai City and the limits of Openness” and Tom Demeyer’s account of putting together an application competition in Amsterdam explore some of the challenges of accessing and opening up government datasets in very different contexts, highlighting the complex realities involved in securing ongoing access to reliable government data. Papers from Sharadini Rath (on using government data to influence local planning in India), and Fiorella De Cindo (on designing deliberative digital spaces), explore the challenges of taking open data into civic discussions and policy making – recognising the role that platforms, politics and social dynamics play in enabling, and putting the brakes on, open data as a tool to drive change. A field note from Wolfgang Both and a point of view note from Rolie Cole on “The practice of open data as opposed to it’s promise” highlight that any OGD initiative involves choices about the data to priotise, and the compromises to make between competing agendas when it comes to opening data. Shashank Srinivasan’s note on Mapping the Tso Kar basin in Ladakh, using GIS systems to represent the Changpa tribal people’s interaction with the land also draws attention to the key role that technical systems and architectures play in making certain information visible, and the need to look for the data that is missing from official records.

Unlike many reports and white papers on OGD out there, which focus solely on potential positive benefits, a number of the papers in the issue also take the important step of looking at the potential for OGD to cause harm, or for OGD agendas to be co-opted against the interests of citizens and communities. Bhuvaneswari Raman’s paper
The Rhetoric of Transparency and its Reality: Transparent Territories, Opaque Power and Empowerment
puts power front and centre of an analysis of how the impacts of open data may play out, and Jo Bates “This is what modern deregulation looks like” : co-optation and contestation in the shaping of the UK’s Open Government Data Initiative questions whether UK open data policy has become a fig-leaf for marketisation of public services and neoliberal reforms in the state.

These challenges to open government data, questioning whether OGD does (or even can?) deliver on promises to promote democratic engagement and citizen empowerment are, well, challenging. Advocates of OGD may initially want to ignore these critical cases, or to jump straight to sketching ‘patches’ and pragmatic fixes that route around these challenges. However, I suspect the positive potential of OGD will be closer when we more deeply engage with these critiques, and when in the advocacy and architecture of OGD we find ways to embrace tough questions of power and local context.

(Zainab and I have tried to provide a longer summary weaving together some of these issues in our editorial essay here, although we see this very much as the start, rather than end-point, of an exploration…)

More to come: I’ve been working on the journal issue for just over a year with my co-editor Zainab Bawa, and at the invitation of Michael Gurstein, who has also been fantastically supportive in us publishing this as a ‘rolling issue’. That means we’re going to be adding to the issue over the coming months, and this is just the first batch of papers available to start feeding into discussions and debates now, particuarly ahead of the Open Government Partnership meeting in Brasilia next week where IDRC, Berkman Centre and the World Wide Web Foundation are hosting a discussion to develop future research agendas on the impacts of Open Government Data.

ICT for or against development? Exploring linked and open data in development

The second publication is a report I worked on last year with Mike Powel and Keisha Taylor for the IKM Emergent programme, under the title: ICT for or against development? An introduction to the ongoing case of Web 3” (PDF). The paper asks whether the International Development sector has historically adopted ICT innovations in ways that empower the subjects of development and to deliver sustainable improvements for those whose lives ” are blighted by poverty, ill-health, insecurity and lack of opportunity”, and looks at where the opportunities and challenges might lie in the adoption of open and linked data technologies in the development sector. It’s online as a PDF here, and summaries are available in English, Spanish and French

Untangling the open data debate: definitions and implications

I’m exploring creating a series of short notes based on my current PhD research into open data as tools to support wider dialogue around data policy and practice. Here’s a draft of the first one, trying to set out some clear categories for understanding debates over data. It’s also available as a two-page PDF here.

Untangling the data debate: definitions and implications

Data is a hot topic right now: from big data, to open data and linked data, entrepreneurs and policy makers are making big claims about ‘data revolutions’. But, not all ‘data’ are the same, and good decision making about data involves knowing the differences.

Big data

Definition: Data that requires ‘massive’ computing power to process (Crawford & Boyd, 2011).

Massive computing power, originally only available on supercomputers, is increasingly available on desktop computers or via low cost cloud computing.

Implications: Companies and researchers can ‘data mine’ vast data resources, to identify trends and patterns. Big data is often generated by combining different datasets.

Digital traces from individuals and companies are increasingly captured and stored for their potential value as ‘big data’.

Raw data

Definition: Primary data, as collected or measured direct from the source. Or Data in a form that allows it to be easily manipulated, sorted, filtered and remixed.

Implications: Access to raw data can allows journalists, researchers and citizens to ‘fact check’ official analysis. Programmers are interested in building innovative services with raw data.

Real-time data

Definitions: Data measured and made accessible with minimal delay. Often accessed over the web as a stream of data through APIs (Application Programming Interfaces).

Implications: Real-time data supports rapid identifications trends. Data can support the development of ‘early warning systems’ (e.g. Google Flu Trends; Ushahidi). ‘Smart systems’ and ‘smart cities’ can be configured to respond to real-time data and adapt to changing circumstances.

Open data

Definition: Datasets that are made accessible in non-proprietary formats under licenses that permit unrestricted re-use (OKF – Open Knowledge Foundation, 2006). Open government data involves governments providing many of their datasets online in this way.

Implications: Third-parties can innovate with open data, generating social and economic benefits. Citizens and advocacy groups can use open government data to hold state institutions to account. Data can be shared between institutions with less friction.

Personal/ private data

Definitions: Data about an individual that they have a right to control access to. Such data might be gathered by companies, governments or other third-parties in order to provide a service to someone, or as part of regulatory and law-enforcement activities.

Implications: Many big and raw datasets are based on aggregating personal data, and combining them with other data. Effective anonymisation of personal data is difficult particularly when open data provides the pieces for ‘jigsaw identification’ of private facts about people (Ohm, 2009).

Linked data

Definitions: Datasets are published in the RDF format using URIs (web addresses) to identify the elements they contain, with links made between datasets (Berners-Lee, 2006; Shadbolt, Hall, & Berners-Lee, 2006).

Implications: A ‘web of linked data’ emerges, supporting ‘smart applications’ (Allemang & Hendler, 2008) that can follow the links between datasets. This provides the foundations for the Semantic Web.

More dimensions of data:

These are just a few different types of data commonly discussed in policy debates. There are many other data-distinctions we could also draw. For example: we can look at whether data was crowd-sourced, statistically sampled, or collected through a census. The content of a dataset also has important influence on the implications that working with that data will have: an operational dataset of performance statistics is very different from a geographical dataset describing the road network for example.

Crossovers and conflicts:

Almost all of the above types of data can be found in combination: you can have big linked raw data; real-time open data; raw personal data; and so-on.

There are some combinations that must be addressed with care. For example, ‘open data’ and ‘personal data’ are two categories that are generally kept apart for good reason: open data involves giving up control over access to a dataset, whilst personal data is the data an individual has the right to control access over.

These can be found in combination on platforms like Twitter, when individuals choose to give wider access to personal information by sharing it in a public space, but this is different from the controller of a dataset of personal data making that whole dataset openly available.

A nuanced debate:

It’s not uncommon to see claims and anecdotes about the impacts of ‘big data’ use in companies like Amazon, Google or Twitter being used to justify publishing ‘open’ and ‘raw data’ from governments, drawing on aggregating ‘personal data’. This sort of treatment glosses over the difference between types of data, the contents of the datasets, and the contexts they are used in. Looking to the potential of data use from different contexts, and looking to transfer learning between sectors can support economic and social innovation, but it also needs critical questions to be asked, such as:

  • What kind of data is this case describing?
  • Does the data I’m dealing with have similar properties?
  • Can the impacts of this data apply to the data I’m dealing with?
  • What other considerations apply to the data I’m dealing with?

Bibliography/further reading:

See http://www.opendataimpacts.net for ongoing work.

Allemang, D., & Hendler, J. A. (2008). Semantic web for the working ontologist: modeling in RDF, RDFS and OWL. Morgan Kaufmann. Retrieved from

Berners-Lee, T. (2006, July). Linked Data – Design Issues. Retrieved from http://www.w3.org/DesignIssues/LinkedData.html

Crawford, K., & Boyd, D. (2011). Six Provocations for Big Data.

Davies, T. (2010). Open data, democracy and public sector reform: A look at open government data use from data. gov. uk. Practical Participation. Retrieved from http://www.practicalparticipation.co.uk/odi/report

OKF – Open Knowledge Foundation. (2006). Open Knowledge Definition. Retrieved March 4, 2010, from http://www.opendefinition.org/

Ohm, P. (2009). Broken promises of privacy: Responding to the surprising failure of anonymization. Imagine. Retrieved from http://papers.ssrn.com/sol3/Papers.cfm?abstract_id=1450006

Shadbolt, N., Hall, W., & Berners-Lee, T. (2006). The Semantic Web Revisited. IEEE intelligent systems, 21(3), 96–101.

Open Data: Construction and Critiques (notes for guest lecture)

Below are the slides, links and further reading for a lecture given 21st Feb as part of the Digital Media Futures course at LSE taking a critical look at the open government data landscape. I’ll add to these after the class if any other links or relevant papers come up.

Class participants: your (critical) feedback is very welcome by way of comments below. This overview of open data is an ongoing work in progress.

Slides

View more presentations from Tim Davies.

Links:

In more depth/source material to explore:

Berners-Lee, T. (1999). Weaving the Web: The Past, Present and Future of the World Wide Web by its Inventor (p. 255).

Bowker, G. C. (2000). Biodiversity datadiversity. Social Studies of Science, 643–683. Retrieved from http://epl.scu.edu/~gbowker/biodivdatadiv.pdf

Davies, T. (2010). Open data, democracy and public sector reform: A look at open government data use from data. gov. uk. Practical Participation. Available at http://www.practicalparticipation.co.uk/odi/report/

Kundra, V. (2012). Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect. Public Policy. Cambridge, Mass.

Lathrop, D., & Ruma, L. (2010). Open Government: Collaboration, Transparency, and Participation in Practice. O’Reilly Media.

Mcclean, T. (2011). Not with a Bang but a Whimper The Politics of Accountability and Open Data in the UK.

Scott, J. C. (1998). Seeing like a state. Yale University Press New Haven, CT.

Open Data is Not Enough – presentation from OGDCamp 2011

The Open Government Data Camp team have just published a page listing many of the presentations and slides from this years event that took place in Warsaw in October. Rather surprisingly (I didn’t spot the camera!) the list includes a recording of my talk sharing learning from recent work on the International Aid Transparency Initiative, and from a paper on sustainable re-use of open data under the title ‘Open Data is Not Enough’.

The audio and recording isn’t great, and I was presenting having run to the venue after a rather marathon delayed overnight train ride from Luxembourg, but, I think there are some useful bits in here that I’ve not fully written up elsewhere, hence the recording is shared below:

Brief notes on user-centred evaluation of open data portals

This weekend, an interesting e-mail on the Open Spending mailing list from Angelica Peralta Ramos at the Desarrollando Latin American hack day put out a call for ideas around evaluating government data portals. What sort of factors should feature in evaluating an official government statistics site, or an open data portal?
Evaluating and rating a portal is a different matter from assessing a whole Open Government Data Initiative, but as with OGD Initiatives, it’s hard to reduce any analysis to a single dimension. Here’s my brief notes on some of the considerations that could go into constructing some user-centred metrics to evaluate a portal:
It could be useful to start an assessment from asking ’Does this provide users with all they need to use the data to do X’ where X could be ‘directly find the fact they wanted’, or ‘visualise the data in their own way’, or ‘support more efficient and effective work in the public sector’ or ‘innovate and build on top of the data’.
Each of those have slightly different criteria for what makes a good data site (or for what is important to them), and a good site should either be clear about who it is serving, or should make a reasonable attempt to serve them all.
For example:
- To directly find the fact you want – a PDF or Excel file may be better (shock horror!) than a CSV or other data dump if it provides a layout that helps a human find facts without having to fire up an analysis tool
- To visualise the data – not only the format of the data matters, but also it’s structure – and that might involve looking inside a sample of datasets. Excel files with headings all over the place and non-standard coding of fields can be just as tricky to deal with as a well-structured table in a word file, for example.
- To support more efficient/effective work with the data – it might be important to have information about it’s provenance alongside it, or to have contact details attached to a dataset so that you can get in touch with the data owner to talk about it more.
- To innovate and build on top of the data – good machine readable data + a clear open licence displayed alongside the datasets is likely to be important. Links to shared source code / spaces to collaborate with other users of the data / detailed documentation and APIs are all useful things too – though very few data portals provide this right now.
Some of this can be seen feeding into the Desarrollando hack at http://boletinesar.blogspot.com/ where the top score goes to data accessible in a range of formats, not just in the most machine readable.

Evaluating open government data initiatives: can a 5-star framework work?

How do you evaluate or compare open government data initiatives (ODI)? With initiatives like data.gov and data.gov.uk well established, and well over 100 national and city-level open data initiatives emerging across the globe, the question of evaluating these initiatives is coming up more and more.

As Jose Alonso of the World Wide Foundation has noted the elements of an evaluation framework are, as yet, few and far between. Linked Open Data (LOD) actors often turn to the ‘5-Stars of linked data’ to provide some metrics for evaluating particular datasets, and Jose has proposed that a 5-Star framework might be extended to provide a more general evaluative framework. Identifying six dimensions which should be taken into account in an open data initiative (political, legal, organizational, technical, social and economic), Jose suggests that:

5-star scale [Open Data Initiative] is one that is 5-star on every single of the six dimensions.

Exactly what it means to be a ‘5-star’ initiative is as yet unspecified, and World Wide Web Foundation are starting to explore how such a framework might be developed, and where it might connect with the development of a multi-dimensional composite World Wide Web Index to rank the impact of the web/open data on countries around the world.

In this post I’ll explore some of the challenges ahead in constructing a 5-star evaluation framework for open data initiatives, offering some remarks of possible routes to explore in the future.

Heading for an index? Reduction and ranking

Simple metrics and indexes are clearly very useful advocacy tools, encouraging behaviour change amongst government, civil society and business communities. ‘Official’ UN, OECD or World Bank Statistics on health or education can drive Ministerial focus in a desire for a country not to slip down the rankings; and civil society indexes like IFPRI’s Global Hunger Index, and Publish What You Funds’s recently released pilot Aid Transparency Index are useful tools in putting an issue on the public agenda. A well constructed index, based on open inputs, has the potential to balance the simplicity conventionally demanded by public advocacy, with the depth required to identify the complexity of creating change around a particular issue.

In outputting a single number and allowing ranking to be constructed, indexes can capture news attention as evaluated entities (often countries) look to see their relative positioning on the scale. But if an index is based on good open data, and this is also published clearly (as in the case of the excellent online interface to the PWYF Index), then the index also provides pointers for countries, companies or whichever institution was evaluated to identify areas where they should focus their efforts for change to get a higher ranking next time. In a good index, the input measures should each be linked directly to, or be proxies for, states and actions that have a proven connection with positive change in the overall domain the index is concerned with. For example, in an ideal context, PWYF should be able to account for how improving against each measured input of their Aid Transparency Index can support improvements in the ultimate effectiveness of a donors aid.

Finding the right inputs for an index is challenging: Indexes tend to rely not only on reducing the output to a single number, but also on reducing each of the inputs to the index to things that are easily quantifiable and comparable: and this can introduce significant national and cultural biases. For example, the Open Knowledge Foundation Open Economic’s group’s pilot Open Knowledge Index includes attempts to capture the existence of an ‘Open Knowledge Society’ by looking at indicators such as “Number of Wikipedia edits per 100.000 inhabitants”, not only prioritizing a particular technical platform and failing to take into account the complexity of comparing editing practices between different country (and their potentially diverse language communities), but also ignoring the likely double-counting of earlier index elements such as “Tertiary Education Rates” and “Fixed broadband Internet subscribers (per 100 people)” introduced by looking at edits of a written online resource per head of population. In fairness, the Open Knowledge index is just in it’s early stages, but it’s reliance upon existing comparable datasets highlights a key limitation of international index construction: it would be hard to use one input dataset for one region, and another in another region without finding some way of making these comparable.

The reductionism of indexes has a further problem: what exactly is to be compared? A number of the indexes above rank ‘countries’, but an open data initiative index might need to cover not only national government-driven open data schemes, but also local government, community and transnational projects. Putting these in a single ranking would obviously be fraught with difficulty – and it might be appropriate to weight elements differently depending on the type of initiative being evaluated. But event amongst national open government data initiatives: would it be right to plot all governments on the same axis? How would comparing Kenya, India, Moldova and the US on an index help develop open data practice in each?

Whilst the political attraction of an index might prove a strong one for open data advocates, and indexes are certainly in vogue, the reduction of indicators to an index needs careful and critical thought – and, if the driving force behind an evaluation framework, could lead to some potentially damaging distortions in it’s development.

5-Star scales; 6 domains; at least 2 sides

Jose’s proposal however isn’t yet for an index. Rather, the post suggests that in addition to the 5-Stars of open linked data in the technical domain, similar ‘scales’ are needed in the political, legal, organizational, social and economic domains. This raises a number of questions.

Firstly, to what extent is the existing 5-Stars of Linked Data model truly a ‘scale’. I’ve commented before on the importance of seeing the stars as incremental and cumulative actions to be taken: as a checklist to work through in order, rather than as a ‘score’ where leaping to the top score without moving through the stages before is desirable. The 5-Stars might better be conceived of as a set of ‘indicators’, with early stars setting out the foundations that future steps should build upon. In the Hear by Right framework (PDF) for Organisational Change co-authored by Bill, my colleague at Practical Participation, 49 indicators are organised around a 7-S organisational change model, and divided into ‘Emerging’, ‘Established’ and ‘Advanced’ levels – highlighting that it’s important to move through ‘Emerging’ practice, to become ‘Established’ and to aspire to ‘Advanced’ forms of practice. It might be possible to maintain a ‘5-star’ model to indicate the movement through from emerging, to established and then advanced practice, but ensuring the design of indicators is not ambiguous about their cumulative nature will be important.

Secondly, we should ask to what extent each dimension (technical, political, legal, organizational, social and economic) can have a single set of cumulative indicators, or to what extent we might identify multiple sets of indicators in each. For example, the five-stars of linked open data (which Jose’s post might imply could simply be adopted as the indicator set for the ‘technical’ domain) only focusses on one set of technical issues in open data publishing: the format and publishing platform (i.e. non-proprietary / linked data; and the web). However, in looking at the use of open data in practice, we find there are important further technical elements to open data initiatives – including providing tools for data discovery (catalogues), providing open source code and tools for working with data, and ensuring technical platforms can cope with demand. Similarly, there might not be one simple sequential set of indicators for the economic or political domain (for example), but rather a parallel set of states that are good to get to, including having political leadership for open data; having open data about politics available; and having open data used in political decision making.

Which leads to the third question, and perhaps one of the most fundamental for an evaluation framework: what exactly are we evaluating? The current 5-Stars of Linked Open Data is primarily a supply-side evaluation: is data being provided. But we might look at the demand or use-side of each of the dimensions Jose points to, asking not only is this domain contributing to the availability of open data, but is open data being effectively used in this domain (which is also different from asking ‘is data about this domain being used effectively).

Jose has already noted the further challenge with the 5-star scale in terms of working out when a star is reached. Is an open data initiative only 5-star when all the data within it’s ambit is published according to Linked Open Data standards, when all public organisations are fully equipped to share and work with open data, and when the whole enterprise sector is engaged with open data and using it to create new jobs? Or is there some threshold of 10% of datasets; and 50% of organisations? Or is that threshold based on ‘valuable datasets’ , which, as Jose notes, raises the question “What does “most valuable” mean? For whom?”

A six-domain five-star model quickly loses it’s potential simplicity when we find the need to focus on both input and impact sides of the equation.

An refined proposal: creating organisational change, measuring social change

So where does this leave us? Again, turning back to learning from Hear by Right, it may be useful to draw a clear distinction between a framework for organisational change, and measuring the impacts of open data.

An organisational change framework for open data initiatives would draw upon the 5-stars already put forward as indicators for mapping and planning: organisations can map their own performance against these indicators (with the possibility of some external assessment and audit too) and can identify actions to move towards a higher level of indicator. Each indicator would identify a set of states or actions an organisation can take to effectively run an open data initiative. Each indicator should be based on a hypothesis about how that state or action increases the impact of open data, but the measurement should simply be based on whether or not the initiative has achieved that state, or taken that action. For example, in the economic dimension, an organisational change framework might include indicators for: ‘The initiative supports the development of a marketplace connecting potential infomediaries with possible sustainable sources of revenue for their services’, and would measure this on the basis of whether the initiative self-assesses (or others judge) that this in place. The organisational change framework would not include any metrics about impact, although if, over time, it became clear an indicator did not lead to the sorts of changes it was hypothesized to support, then it might be removed or amended.

An impact framework would identify key dimensions of change which could take the form of statements about the sorts of impacts an open data strategy might have. For example, “Open data supports economic growth” in the economic dimension; or “Open data is actively used by citizens in policy making processes”. These might have ‘suggested evidence’ requirements, but it’s unlikely these will be reducible to a single number in most cases. Both organisational change and social change are, to a significant extend, subjective. Whilst we can measure certain ‘states’ (existence of organisational policies and practices; performance statistics; etc.) any measurement of organisational and social change needs also to include a narrative component – highlighting experiences of change, and how the benefits of change are distributed, as well as look at aggregate measures of change.

In this approach, we disentangle ‘best practices’ and ‘impacts’ – and allow them to each be evaluated on their own terms. Both are still needed: pursuing organisational change without asking ‘what difference does this make?’ isn’t helpful. And equally, measuring impacts without hypothesizing about how to further them, and planning concrete steps to do so, creates massive missed opportunities.

It might even be possible to fit this approach with the elegance of a 5-star formulation.

IKM Paper: Towards Linked Development Data – Practical Issues

I’m getting closer to a final draft of an IKM Working Paper on Linked Data & Development, building on learning from a number of pilot projects working with linked data and the 2010 IKM Workshop on open linked information. It’s take a while both because we decided to commission some extra mapping of the open data environment around development datasets, and as trying to find the right balance between technology and policy in a paper like this has proved pretty challenging.

Even if it might take a little longer to get the balance right, I’m aware (a) that a number of projects are exploring linked and open data for development right now and the learning in here could be useful; and (b) that we could definitely do with feedback to help shape the final draft; and (c) I’ll be talking about some of this at the upcoming Development Studies Association (DSA) Conference in York in a few weeks time, and so here’s the draft as it stands.

It’s broken down into a few PDFs, and references are incomplete at the moment. All feedback very welcome.

Introduction (PDF)

An overview of what the paper contains and the principles guiding it’s focus.

1 – Linked Data Primer – Introducing Open Linked Data (PDF)

An eight-page overview of what linked and open data are, including an attempt to describe the actual stack of technologies involved in linked data in practice, and addressing issues of licensing open data.

2 – Mapping Issues and Eco-systems of linked data in development (PDF)

  • Four brief case studies of projects involving linked data with shared learning and reflections (Young Lives; Global Hunger Index; IKM Vines; FAO Linked Data).
  • The results of our preliminary open development data mapping study
Shared learning with details considerations about the process of creating linked data: including policy and technical issues.
Designed to help newcomers to linked data to work through all the issues they might need to address in join teams of technical and policy people.
4 – Conclusions (Not quite complete yet…)
A draft of the companion working paper mentioned in these documents (ICT for or against development? An introduction to the ongoing case of Web 3.0) is due by the end of the month. Update: A summary of the policy paper is now available here.

A whole lot of development datasets

Sharing some work in progress; and a small collection of International development datasets

What happens when you set two researchers to work looking for openly available datasets with some connection to the very broad field of International Development?

Well, for one thing you get a very large spreadsheet of 91 different datasets.

Initially I thought it might be possible to record details of different datasets against a short list of themes at categories, but the researchers quickly broke out of the pre-defined list, and my attempts to try and clean up the data from a top-down view in Google Refine, or to visualise the different categories by generating a graphviz file quickly ran up against the complex and messy reality of international development data. So, after many attempts to get the big-picture to fit in a small laptop screen I gave up and turned to my favourite approach of making little paper playing cards.

You can find the spreadsheet of datasets the researchers located here (CSV), and a PDF of these datasets ready to be cut out as little playing cards here.

The cards only display some of the information recorded on each dataset – and, as I found the way way to understand a dataset was to visit it’s website, each card has a QR code on. Using a little QReader application I’ve been able to find a fairly good working process, sifting through the cards and, whenever I want to find out more about the dataset described on one – holding it up to my web-cam to quickly get the relevant website on screen. Surprisingly a lot easier and a more natural process than having to turn to the keyboard, looking up web addresses and typing them in (note to self: explore different ways to use QR codes to organise future research work…).

Working through the cards I started to get a very different perspective on international development data from that my preconceptions might have suggested. Whilst Keisha Taylor had the brief of looking globally at datasets which could broadly be said to be relevant to international development, Rui Correia was focussed on datasets relevant to South Africa, and starting with Rui’s datasets was instructive. Asides from the big datasets from the World Bank and from UN agencies and institutions, well known in open data circles, there are a myriad of national and local projects collecting and publishing data on all sorts of issues. A whole network of sites share research data on biodiversity; the Africa-wide FINSCOPE project (originally funded by DFID) holds detailed data on financial readiness in different contexts, primarily selling the data to finance institutions, but also providing data-rich PDF reports for free; Universities run data-archives along national boundaries (as is the case for most state-funded archives across the world); smaller sites contain lists of available data, but with an e-mail address or form to request it rather than a download option (on websites that are most probably still built with desktop apps rather than Content Management Systems). Rui also included a number of sites in mapping that I, on a initial read, was about to disregarded as ‘not datasets’ – consisting instead of news websites with loose directories and listings of organisations. Compared to sites like the World Association of NGOs directory, or Kabissa, that hold structured information on organisations and projects, these sites may not appear to belong in a collection of datasets, but if our concern is about development information, and how having this information as data helps it to be shared, then such sites should be within our mapping of the potential development data environment.

I’m only 1/3 way through exploring all the cards and datasets for now – and how much analysis I’ll be able to complete for the first draft of the working paper this research will be feeding into I’m not sure. I’ll try and share some more reflections as I explore more datasets – but for now I just wanted to get a post up sharing, in the spirit of open data, the mapping spreadsheet so far…

Notes
I’ve been thinking from the start of this research that we should make sure the datasets we find get listed on CKAN.net. That’s still part of the plan – but if anyone wanted to get started on that they would be most welcome as it could take a while…

 

The politics of URI choice…

For the Linked Data for Development paper I’m working on (slowly but surely…), I’ve been trying to unpack some of the implications that flow from the choices that linked data publishers make about the identifiers to use in their datasets. Here’s a quick draft of one section I’m considering for the paper on that topic:

Choosing URIs (from Linked Data for Development – IKM Technical Working Paper draft)

A lot of the benefits of linked data come when you identify things in your dataset using third-party URIs.

For example, instead of using your own identifiers for a country, when you link against the Food and Agriculture Organisation’s (FAO) geographical ontology (http://www.fao.org/countryprofiles/geoinfo/geopolitical/resource/) you find that: (a) you are using an identifier that many other people may be using in their datasets, and so it will be easier to identify where you hold data about the same things; (b) when you look up (dereference) an identifier in the FAO ontology, you will find they provide detailed additional information about countries, including their ‘codes’ in other code schemes such as ISO codes, or their identifiers in key linked data hubs like dbpedia.org. Your application, or applications querying your data, can now choose to integrate all this information.

However, in this example, a number of additional consequences can flow from the decision of which URIs to use:

  • Firstly, you establish which third-party datasets it will be easiest to integrate your dataset with.If you use exactly the same identifier as a third-party dataset, then you can mix your datasets together in a triple store or RDF aware tool and instantly have integrated data. If you link against a source such as the FAO geopolitical ontology which provides useful mapping information (e.g. ISO codes), then you give your applications access to the information they need to integrate with a dataset that uses such codes, but, the integration is likely to require some addition work, either in how queries over the data are constructed, or in using reasoning tools which look for implicit connections in the dataset and add them to a triple store.

    Given this additional effort may require time, skills, software or equipment in some cases, choices of identifiers may impact significantly on how data gets used, who it is used by, and who the burdens of integration effort fall upon.

    Sometimes there may be two or more possible sets of identifiers to use for a thing, with some datasets using one set, and others using the other, and no existing mapping between them. In these cases, if the mapping between terms is non-trivial, your choice of identifiers can place you within a particular community of datasets that can only be connected when an investment is made in mapping to integrate the two.

  • Secondly, you may influence other’s use of URIs, setting informal standards through your data publishing.There are strong network effects when it comes to choice of URIs. If you are publishing a significant dataset and choose to use a particular set of URIs, others who come along to publish after you may follow your choice. Linked data doesn’t have formal standard setting processes, so precedents function as informal standard setting.

  • Thirdly, you decide who you are delegating authority over particular concepts to.This delegation of control can happen on two levels:

    (1) Often URIs will follow established standards devised by offline systems. For example, FAO’s country ontology of URIs only contains countries that FAO, as a UN body, has chosen to recognize. If you want to refer to a country that FAO doesn’t recognize, you won’t find an FAO URI. The choice of URI can involve a commitment to following a particular institutions view of the world;

    (2) You delegate control, to some extent, over defining the thing referred to the owner of the domain of the URI. For example, FAO could choose to start making new assertions about countries in their dataset which did not fit with your understanding of a country. Or another third-party you were linking to could completely change, or cease to provide, the URIs you were using.

    Neither of these issues are insurmountable. You can mint your own URIs for concepts that a third-party does not have coverage for; and you can choose not to trust particular third-party data in your applications, or to update your dataset to use alternative URIs in future if a third-party ceases to provide useful data. However, if most of the entities in your dataset are linked to third party URIs, but a small proportion are not, there is a risk these could become ‘second class citizens’ in your data or could be missed out in queries which assume everything is linked to the third-party URIs.

One phrase that came out of the IKM Linked Data workshop to capture the decisions involved in choosing URIs was that, to gain full benefit from linked data, we must face the “Economics of integration” or the “Politics of delegation” – pointing to the need to either spend time and effort creating your own URIs and mapping these to diverse other URIs (as the FAO Ontology does), or to delegate control to third-parties, making explicit of implicit choices about which concepts can be easily used in a dataset, and how those concepts are defined.

 

Comments, critique, feedback very welcome….