The politics of URI choice…

For the Linked Data for Development paper I’m working on (slowly but surely…), I’ve been trying to unpack some of the implications that flow from the choices that linked data publishers make about the identifiers to use in their datasets. Here’s a quick draft of one section I’m considering for the paper on that topic:

Choosing URIs (from Linked Data for Development – IKM Technical Working Paper draft)

A lot of the benefits of linked data come when you identify things in your dataset using third-party URIs.

For example, instead of using your own identifiers for a country, when you link against the Food and Agriculture Organisation’s (FAO) geographical ontology ( you find that: (a) you are using an identifier that many other people may be using in their datasets, and so it will be easier to identify where you hold data about the same things; (b) when you look up (dereference) an identifier in the FAO ontology, you will find they provide detailed additional information about countries, including their ‘codes’ in other code schemes such as ISO codes, or their identifiers in key linked data hubs like Your application, or applications querying your data, can now choose to integrate all this information.

However, in this example, a number of additional consequences can flow from the decision of which URIs to use:

  • Firstly, you establish which third-party datasets it will be easiest to integrate your dataset with.If you use exactly the same identifier as a third-party dataset, then you can mix your datasets together in a triple store or RDF aware tool and instantly have integrated data. If you link against a source such as the FAO geopolitical ontology which provides useful mapping information (e.g. ISO codes), then you give your applications access to the information they need to integrate with a dataset that uses such codes, but, the integration is likely to require some addition work, either in how queries over the data are constructed, or in using reasoning tools which look for implicit connections in the dataset and add them to a triple store.

    Given this additional effort may require time, skills, software or equipment in some cases, choices of identifiers may impact significantly on how data gets used, who it is used by, and who the burdens of integration effort fall upon.

    Sometimes there may be two or more possible sets of identifiers to use for a thing, with some datasets using one set, and others using the other, and no existing mapping between them. In these cases, if the mapping between terms is non-trivial, your choice of identifiers can place you within a particular community of datasets that can only be connected when an investment is made in mapping to integrate the two.

  • Secondly, you may influence other’s use of URIs, setting informal standards through your data publishing.There are strong network effects when it comes to choice of URIs. If you are publishing a significant dataset and choose to use a particular set of URIs, others who come along to publish after you may follow your choice. Linked data doesn’t have formal standard setting processes, so precedents function as informal standard setting.

  • Thirdly, you decide who you are delegating authority over particular concepts to.This delegation of control can happen on two levels:

    (1) Often URIs will follow established standards devised by offline systems. For example, FAO’s country ontology of URIs only contains countries that FAO, as a UN body, has chosen to recognize. If you want to refer to a country that FAO doesn’t recognize, you won’t find an FAO URI. The choice of URI can involve a commitment to following a particular institutions view of the world;

    (2) You delegate control, to some extent, over defining the thing referred to the owner of the domain of the URI. For example, FAO could choose to start making new assertions about countries in their dataset which did not fit with your understanding of a country. Or another third-party you were linking to could completely change, or cease to provide, the URIs you were using.

    Neither of these issues are insurmountable. You can mint your own URIs for concepts that a third-party does not have coverage for; and you can choose not to trust particular third-party data in your applications, or to update your dataset to use alternative URIs in future if a third-party ceases to provide useful data. However, if most of the entities in your dataset are linked to third party URIs, but a small proportion are not, there is a risk these could become ‘second class citizens’ in your data or could be missed out in queries which assume everything is linked to the third-party URIs.

One phrase that came out of the IKM Linked Data workshop to capture the decisions involved in choosing URIs was that, to gain full benefit from linked data, we must face the “Economics of integration” or the “Politics of delegation” – pointing to the need to either spend time and effort creating your own URIs and mapping these to diverse other URIs (as the FAO Ontology does), or to delegate control to third-parties, making explicit of implicit choices about which concepts can be easily used in a dataset, and how those concepts are defined.


Comments, critique, feedback very welcome….

One Comment

  1. Hugo Besemer

    Choosing URI’s from a collection also implies accepting a viewpoint. In this example you choose the viewpoint of an intergovernmental UN agency about which political entitities are considered as countries, and which not. In practical terms: no Kosovo here (that might upset Serbia), one Somalia while different areas are “governed” quite differently, and you end up with names like “The former Yugoslav Republic of Macedonia” (meant to keep Greece happy, afraid that did not work well)

Leave a Reply

Open Data in Developing Countries

The focus of my work is currently on the Exploring the Emerging Impacts of Open Data in Developing Countries (ODDC) project with the Web Foundation.

MSc – Open Data & Democracy

RSS Recent Publications

  • An error has occurred, which probably means the feed is down. Try again later.