Miles and Huberman (1994) encourage mixed-methods researchers to explicitly outline their research process. The data collection Fig. 2 above took place in sequential but overlapping steps, each informing subsequent data collection. The analysis draws upon data resources in parallel.
3.3.1. Exploratory research participant observation
Between January 5th and June 30th 2010 a custom-built CAQDAS system was used to record and analyze public Twitter messages including the ‘#opendata’ hash-tag (Huang et al. 2010), presenting data in tag-clouds for exploration (Rivadeneira et al. 2007). Tweets were regularly reviewed and emerging themes recorded through a private wiki-based research journal (Borg 2001; Janesick 1999). In March and April 2010 I participated in multiple open-data events, including two ‘hack-days’, attended by over 60 people. As an announced participant-observer (Gray 2009) at these events I observed and engaged with a range of OGD uses. Further reflective journaling throughout this phase was used to identify key questions and themes for later analysis, as well as deepening my understanding of issues relating to OGD use.
3.2.2. Online survey
Initial exploratory research highlighted underrepresentation of certain OGD users and uses in mainstream digital discourses. Online survey methods offer a cost-effective way to gather input from large numbers of people (Gray 2009, 13; Fink 2006). An online survey was designed to identify wide-ranging open data uses, beyond commonly cited examples, though focusing on UK OGD. Dekkers et al. (2006) note the difficulty of generating clear sampling frames of OGD re-users. Lacking a clearly bounded population that would allow statistical sampling (Kish 1965), the study adopts opportunistic sampling whilst seeking wide dissemination of the survey. A careful balance had to be struck between introducing excessive ‘selection bias’ by heavy dissemination to particular OGD user communities, and getting adequate responses. A prize-draw incentive was offered to reduce non-response (Couper 2000) and the project blog was developed to demonstrate the authenticity of the research and a commitment to sharing research findings (Cho LaRose 1999). The online survey, of 35 questions, was available between May 11th and June 14th, receiving 72 responses, 44 describing OGD use from data.gov.uk.
Given its non-probability sample, the survey design focused on collecting information about instances of OGD use; and inviting respondents to indicate their views on key statements developed from prior phases of research. Exploratory factor analysis (Bartholomew et al. 2008, 7 9; Costello Osborne 2005; Lawley Maxwell 1962), bootstrapped cluster analysis (Suzuki Shimodaira 2006) and visual analysis of correlations were conducted within the R statistical software (Becker et al. 1988). This analysis informed the description of OGD users and their motivations. A full copy of survey questions, anonymous response data and details of survey promotion can be found on the project blog.
Eight purposively sampled semi-structured interviews (Bryman 2008, chap.18) were carried out to complement survey data. Interviews invited respondents to give detailed elaboration of particular OGD uses, their reasons for working with OGD, and challenges they encountered, as well as exploring their general attitudes to OGD. All but one interview took place by phone, recorded, transcribed and coded for key themes using TAMS Analyser (Weinstein 2008), with GraphViz used to visualize relationships between themes (Bilgin et al. 2009).
3.2.4. Ethical considerations
Alongside adopting good practice for informed consent in online surveys, interviews (Varnhagen et al. 2005) and participant-observation (Gray 2009, chap.4), particular attention was paid in this study’s data collection and analysis to problems of ‘privacy in public’ (Nissenbaum 1998). The inherently searchable nature of many OGD uses means that even basic descriptions can allow readers to, in practice, link quotes to individuals, even though responses were collected under offer of anonymity. This necessitates care in how respondents are quoted, and obtaining additional permission in certain instances. Sensitivity to the context in which information was shared has guided the direct use of content from exploratory/participant-observer phases of the research (Eynon et al. 2008).
3.4. Analysis Embedded Cases
Analysis is carried out with a fully mixed-method approach, selecting appropriate methods to begin and pursue exploration on the basis of each RQ and the available data. Analysis of RQ1 starts from exploratory statistical analysis, checked against qualitative reading of survey responses and complemented by insights from the literature, interviews and participants-observation. Analysis of RQ2, 3 and 4 all follow a holistic process of analysis (Wiber 1999) predominantly seeded by exploration and visualization of data use instances. Diesing explains: “Holist theorizing should always proceed in intimate contact with particular cases, so that each theoretical step can be immediately checked against a range of examples” (1971, p.182) A key challenge for holistic researchers is finding an n of detailed cases sufficient enough to draw insights from, but manageable enough to work with. The 44 data-use instances identified by the survey were too shallow, and too numerous, for detailed comparison. Consequently, two sets of embedded cases of data.gov.uk OGD use, using a ‘multiple case, embedded case’ design (Yin 2008; Gray 2009, p.256) were generated from existing and additional data to allow a more careful and in-depth analysis of emerging themes. Readers are encouraged to attend carefully to diagrams and tables, as much of the detail of findings is presented here, without repetition in the text.
3.4.1. Embedded case selection
Embedded case selection was theoretically informed, drawing on earlier research phases to create a broad-based sample allowing for both contrast and comparison between cases. Each case was written up using a range of sources and entered into the custom CAQDAS (see 3.2.1). Embedded cases were also treated separately as data use instances for §4.2, giving 55 instances of data use in total (some cases and instances overlapped).
Embedded cases are listed in §4.2 along with details of the sources for each. The first set focus on education data from data.gov.uk, selected both because of the political saliency of education data, and because the EduBase dataset is available as RDF linked data. Case selection drew upon leads from survey responses (E5-E7) and analysis of online discussions around education OGD (E1-E4 E8). Whilst E1 does not strictly constitute a use of data.gov.uk OGD, drawing on datasets prior to their release as open data by scraping the relevant content, it is included to facilitate comparison against E4. The second set of embedded cases focus on a different topic, to ensure emerging themes were not education data specific or overly skewed by the shared topic of the datasets. Where E1-E8 cover a range of datasets, C1-C6 focus on a single dataset: COINS public spending data, selected both because of its high profile launch during the period of study (generating considerable documentary data to analyze about each case) and because it was newly released data, not just made more accessible through directory listings or updated licensing.
The second arrow in Fig. 3 indicates the route to generalization of findings from the study. The controlled comparison approach of the core study could be adopted to check the relevance of findings ‘vertically’ against other uses of open data, or ‘horizontally’, against other specific data directories, including local government data directories. The articulation of many findings from the study in the form of typologies is also designed to facilitate their re-use in other contexts, with their validity tested on pragmatic grounds.