Sensitive government data is any data that could be used to infer personally identifiable information. It’s a term that is readily applicable to a very large chunk of government data. As part of research into solutions for making analysis of sensitive government data more possible, I have talked to several researchers of late. How do they find and analyze such data and what are the major challenges in doing so? In terms of whom a researcher is, my sample so far is biased toward those employed by national science organizations and consultants specializing in high end analytics. Emerging from this informal survey is a consistent pattern in the challenges researchers are facing in the world of not-so-open data.
First, there are problems that are common to all government data research, not just sensitive stuff. Very often, researchers come across data that may be useful to them. That leads to two challenges. First, the data is often poorly documented, which leads to guess work, assumptions, and blind alleys. Second, researchers need to perform exploratory data analysis (EDA) to find out whether the data is useful, and what kind of hypotheses might be interesting to investigate. As a minimum, this typically requires download of a dataset, some transformations, loading it up in a suitable tool and finally, checking if it there is anything of potential interest there. That can be a time-consuming process. In some cases, the data isn’t freely available, which in the best case scenario leads to emails and spreadsheets going backwards and forth, and in the worst case, leads to a dead end. And when the EDA is complete it may not be of any use after all. On to the next best guess!
So far, that’s no different from analyzing open government data. However, an additional roadblock faced by those looking at health data for example, where personal privacy is a critical issue, is getting approval to access such data. That’s an administrative headache that can dwarf the technical challenges of performing any initial exploration. For example, an Australian researcher told about how he needed to complete an ethics application which has hundreds of questions on it. Given that this is before they learn whether the data is of any use, how many times does a researcher choose to simply walk away empty handed? Many research projects these days have very fast turn-around times, so this kind of approval process is really just another way of saying that the data is not available. We know there is a cost in terms of privacy violations if sensitive information is disclosed, but what is the cost of researchers not having access to valuable data sources that may contain vital statistical information that can lead to better policy and insights on a whole range of issues such as healthcare, social security, taxation etc?
I got this feedback from a small and rather biased sample. I’d really like to learn more about what kinds of EDA people need to do and the challenges in getting it done, especially where privacy issues block people from getting the information they need. There is a lot of research containing sensitive personal information, and much of it has a very high reuse value, so this is an important issue for the successful sharing of data for research. What’s your experience? Have you had to jump through hoops, or travel to a research data laboratory to analyze sensitive government data? Could it have been simpler?




