Have you used any data recently? How would you describe what you did with it? Did you draw upon an original ‘raw’ dataset? Or on a version someone had already cleaned up, converted or hosted in an API somehow? What other data and tools did you use into the mix? Did you generate new source code along the way? Has someone else built upon the data you created?
There’s lots of information we might want to know about how open data is being used, and one of the big challenges of the workflow part of the Social Life of Data project is working out how to capture this:
- (a) In a friendly user interface – it would be good to create a tool which doesn’t just rely on data entered by researchers, but which presents non-technical end-users of open data with a way to easily describe what they have done.
- (b) In structured and standard form – so that we can analyse different patterns of data use, and draw conclusions about popular approaches to using data.
- (c) In a usable data model – and, particularly as this project forms part of the IKM Emergent exploration of open linked data, ideally a model that can be easily recorded in RDF.
I’ve been exploring a range of different ways to record workflows. A lot my initial exploration looked at the Open Provenance Model, and the OMPV RDF version of it. This works on the basis that there are Artifacts (things like datasets; visualisations and source code), and each artifact is created by some process. The processes use other Artifacts, and are carried out at certain times by certain agents. The model is expressive, and flexible – but it’s also fairly abstract – and I’m not sure if we might need an intermediate stage between OPM and a data model that works for the slightly more relaxed Social Life of Data workflows, which need to allow for some fuzzy-ness in how the use of data is described.
For example, take the description below of what I did with data from the IATI registry
“I took all the datasets from the IATI Registry and I made an XML aggregator and an interface onto that aggregator.
In the process, I wrote a PHP script that uses the eXist PHP library and PHP Ckan Library to fetch each IATI Registry XML file and upload it to a local copy of eXist I am running.
This results in an XML API that can be used to fetch or query any IATI files.
To create an interface onto the data I adapted an existing PHP script that fetches an IATI XML file and uses XSLT to transform it into JSON which is then fed into the Simile Exhibit tool that generates an interface.”
At the simplest we need a model allowing a direct connection to be made between a Dataset, and a ‘use’ of that dataset. (In fact, after much time trying to work out how to capture the full chain of events between dataset and use, I’m going to look at just solving the simple problem with some prototype code first).
What is a dataset anyway?
However, as the example above shows, even this can get complicated: in using all the datasets from the IATI registry, the use has, strictly, been of 121 different datasets, run through a ‘merge’ process. A similar problem comes up if you try and describe the use of a dataset from data.gov.uk or CKAN. What is listed and counted a ‘dataset’ (e.g. http://data.gov.uk/dataset/coins/) has multiple versions, which might cover different time-periods (and thus be essentially sub-sets of the dataset), or which are the same dataset in a different format (and thus are representations of the data). Defining the boundaries of a dataset to be able to establish that two separate ‘workflows’ were based on the same data could get tricky! We can look at the DCAT vocabulary to get a sense of how others are handling this problem (and to see if we can remain compatible with widely used models), and we find that has the notions of a Catalog, a CatalogRecord, a Dataset, and a Distribution.
We could then say that any workflow can begin with a Catalog, a Dataset or a Distribution. Distributions belong to datasets; and datasets to catalogs. If a workflow begins with a Catalog then we could infer that it uses all the datasets within that Catalog (although this raises some temporal issues: as Catalogs may be added to over time…). At least if we record the relationship between Catalogs, Datasets and Distributions in the datastore sitting behind the workflows tool we can allow end-users to navigate from a datasets, to the catalog it belongs to, to other uses that draw upon that catalog.
Whilst some uses of data may not strictly result in tangible artifacts (e.g. conversations or presentations without slides), we’ll assume that we can describe these intangible things as artifacts in our model.
Attaching processes and agents
Even if we stick to a really simple model in the first instance which allows users to specify a starting artifact (usually a Catalog, Dataset or Distribution), and an ending artifact, we need to (a) ask about what type of artifact they are ending up with (e.g. is it a report; a mash-up; a website or a visualisation); (b) find out something about the process used to create it (even if just a free text description for now); and (c) find out about the agent (the person) responsible for it (as if we’ve got no people involved we can hardly have the social life of data…).
We’re going to need an interface that captures these – and some sort of simple taxonomy of data use artifacts, processes, and relationships between processes and agents.
The full chain
In the example described above you will notice though that there are two ‘uses’ of data described, and one builds upon the other – so even though we’re trying to have a really simple way of describing datasets and the artifacts (uses) resulting from them, we might want to allow that one use can build upon another.
There’s an interface challenge for the future about how we might let users insert new processes in the middle of an existing workflow, but for now we could simply allow that chains of data use can be built up sequentially.
What would we really like to know
When you add up all the things it might be nice to know about the way data has been used (and you realise it would be helpful to know this about every stage of use) it turns out to be a pretty long list – so one of the coming challenges is going to be (a) finding interfaces that make cumulatively building this information up, without overwhelming the user inputting data, possible; and (b) finding other ways to fill in any of the data we might want (e.g can we fetch information about data artifacts from a catalogue; can we find a users bio by asking for Twitter IDs etc); (c) finding the right compromises to make sure the workflow tool is useful for humans first and foremost.
- (M1) What did you do with this dataset?
- (M2) What data did you use to build this?
- (M3) Who is exploring, using and talking about this data?
- (A0) What sort of artifact is it?
- (A1) Describe the artifact briefly.
- (A2) Is it available online somewhere?
- (A3) Is anyone maintaining it? Who is responsible for it?
- (A4) Where can I find out more about it?
- (A5) Are there conversations about it?
- (A6) When was it last updated?
- (P1) Describe the process briefly.
- (P2) What did you use?
- (P3) Who was involved?
- (P4) When did it first happen?
- (P5) When did it last happen?
- (P6) How often is it updated?
- (AG1) Who is this?
- (AG2) Where can I find them online?
- (AG3) What organisation do they belong to?
- (AG3) Brief biography?
We’re not quite there yet with a clear sense of the model we should use for capturing workflow information. Next up comes a bit of back-and-forth testing with different possible interface ideas (and tools we can draw upon) for creating and displaying workflows.
Once we’ve got a rough model together, the first iterations on generating some real workflow data can begin!