Exploring Not-So-Open Data

August 25th, 2010 by Don McIntosh

Sensitive government data is any data that could be used to infer personally identifiable information. It’s a term that is readily applicable to a very large chunk of government data. As part of research into solutions for making analysis of sensitive government data more possible, I have talked to several researchers of late. How do they find and analyze such data and what are the major challenges in doing so? In terms of whom a researcher is, my sample so far is biased toward those employed by national science organizations and consultants specializing in high end analytics. Emerging from this informal survey is a consistent pattern in the challenges researchers are facing in the world of not-so-open data.

 

 

First, there are problems that are common to all government data research, not just sensitive stuff. Very often, researchers come across data that may be useful to them. That leads to two challenges. First, the data is often poorly documented, which leads to guess work, assumptions, and blind alleys. Second, researchers need to perform exploratory data analysis (EDA) to find out whether the data is useful, and what kind of hypotheses might be interesting to investigate. As a minimum, this typically requires download of a dataset, some transformations, loading it up in a suitable tool and finally, checking if it there is anything of potential interest there. That can be a time-consuming process. In some cases, the data isn’t freely available, which in the best case scenario leads to emails and spreadsheets going backwards and forth, and in the worst case, leads to a dead end. And when the EDA is complete it may not be of any use after all. On to the next best guess!

 

So far, that’s no different from analyzing open government data. However, an additional roadblock faced by those looking at health data for example, where personal privacy is a critical issue, is getting approval to access such data. That’s an administrative headache that can dwarf the technical challenges of performing any initial exploration. For example, an Australian researcher told about how he needed to complete an ethics application which has hundreds of questions on it. Given that this is before they learn whether the data is of any use, how many times does a researcher choose to simply walk away empty handed? Many research projects these days have very fast turn-around times, so this kind of approval process is really just another way of saying that the data is not available. We know there is a cost in terms of privacy violations if sensitive information is disclosed, but what is the cost of researchers not having access to valuable data sources that may contain vital statistical information that can lead to better policy and insights on a whole range of issues such as healthcare, social security, taxation etc?

 

I got this feedback from a small and rather biased sample. I’d really like to learn more about what kinds of EDA people need to do and the challenges in getting it done, especially where privacy issues block people from getting the information they need. There is a lot of research containing sensitive personal information, and much of it has a very high reuse value, so this is an important issue for the successful sharing of data for research. What’s your experience? Have you had to jump through hoops, or travel to a research data laboratory to analyze sensitive government data? Could it have been simpler?

SDMX Web Services

June 9th, 2010 by Don McIntosh

Recently, many of us at STR have been working on implementing open data formats, specifically SDMX 2.1 and DDI 3.1. Both are extremely relevant for statistical processing - DDI assumes the key position for planning, data collection, processing and microdata dissemination.  SDMX is most suited for processing and dissemination of aggregated data. Previous blog posts and news items have provided an overview of SDMX to inform our customers about how how SDMX might help them with their own business processes.  This blog post is all about what we are actually delivering with our  mid year SuperSTAR Release 7.0.  The following SDMX functionality will be included:

  1. SDMX output from SuperWEB
  2. Building SDMX-driven SuperVIEW interactive presentations (with no SXV4 db required)
  3. RESTful SDMX Web Services

This blog focuses on the Web Services which is arguably the most important capability.  And perhaps the other reason I’m excited by it is because it is the first time that SDMX has been introduced directly to microdata.  I’ll explain what I mean by this a bit later.

From the point of view of many data providers, the advantage of the Web Services is that it can provide their customers with just the data they need, no more and no less. This can free up staff devoted to responding to ad hoc queries.

From the customer point of view, it opens up new possibilities for consuming the data and building unique, useful services on top of it. For example, a third party application can convert user responses from a Web app into dynamic SDMX queries and then the results from this can in turn be used to determine how the Web app should behave. Without Web Services, such an app would previously have relied on potentially stale data that was downloaded and loaded into a local database. And thanks to the detailed data model of SDMX, apps can also work out what other data sources might sensibly be combined together to produce richer, more useful results.

The other thing I’ll mention before getting into some specifics about what we’ve done is that our implementation is actually that of a RESTful API, not a “traditional” Web Service. We’re glad to see this becoming so much more popular now.  SDMX orginally only had standard SOAP based Web Services defined, but we’ve based our implementation on the proposed RESTful API for SDMX version 2.1.  As developers, a RESTful API is something we find a lot easier to start using, to explore, and to scale and we we think that our customers will find the same.

What we’ve done

The SDMX API that we are focused on can be broken into three logical chunks:

  1. Metadata Discovery - what data collections are available, and what concepts/classifications are used where
  2. Database Metadata Discovery - What metadata (eg: concepts and code lists) are used within a particular SDMX dataset?
  3. Queries - Defining and pulling back a slice of an SDMX data cube

We’ve implemented parts 2 & 3.  (Part 1 we will consider for a future version, but we are also looking at solving this gap in a different way, such as leveraging existing SDMX registries, which are used to collate and manage contents that are stored in SDMX repositories. The important thing to note here is that we don’t want SuperSTAR to be an island - many of the organisations we work with would want to reuse the same search and discovery mechanism across many different types of data and applications, so we’d like to learn more about how SDMX solutions can be part of such an environment before we proceed with this.)

Our SDMX Restful API supports access to aggregated data that is managed by SuperSTAR. This can be from several different sources:

  1. SuperSTAR data cubes
  2. SuperSTAR tables defined by SuperWEB users
  3. SuperSTAR microdata databases

The last case is worth elaborating on, and links back to the point I mentioned earlier about introducing SDMX to microdata. Up until now, SDMX use has been limited to working with pre-aggregated data. This makes sense, especially when you consider the origins of SDMX, which is a group of organizations that deal almost solely with such aggregated statistical data and only rarely with the underlying microdata from which the statistics were derived.

From our point of view, however, and I believe from the point of view of many of our customers, dealing with microdata is very much part of the production process that they are involved in. What is useful about this is that the users are not constrained to taking slices of pre-defined cubes of data, but rather exploring and dynamically defining queries to run against the microdata. This approach can generate orders of magnitude more possible outputs and therefore relieve the provider from the burden of manually addressing many ad hoc queries that can’t be satisfied by a query against an existing cube. It does occasionally introduce other problems, namely confidentiality and performance, but these are part of our core capabilities, so our solution addresses potential drawbacks in this regard.

To make it possible to use an SDMX-based API to run tabulation queries against microdata, we’ve made some necessary innovations to the SDMX standard. Firstly, while you can query for the data structure definition (DSD) of a very large virtual cube (which is actually a SuperSTAR database), we prevent clients from requesting the full dataset for this cube - it’s simply going to be too big. What we do instead is allow for any subset of dimensions in the DSD to be combined in an SDMX query.

In addition, any tables that a user defines in SuperWEB can be accessed as SDMX datasets; both the DSD and the data from such a table can be obtained through queries against the SDMX RESTful API.

If you’ve read this whole post, you must be interested in what we are doing here. We think that the API can be very useful for many of our customers, so please leave a comment here if you have a question or something say. Or if you want to go one step further, let us know and we’ll discuss providing you with a test package that you can use to try the API against your own data.

Embracing Advanced Visualization - apps4NSW Comp entries

March 26th, 2010 by Jo Deeker

Space-Time Research have developed two entries for the apps4NSW competition (for New South Wales, Australia) using SuperVIEW.  The apps4NSW competition, like the Mashup Australia and Apps For Democracy competitions, invited the public to submit ideas and applications that would benefit the citizens of New South Wales.

I’m excited about our two applications because they are genuinely useful online interactive publications of complex data that everyone will benefit from.  Our Why Australians Travel application presents a dataset from Tourism Research Australia that has not been made available to the public in an interactive way before.  It also includes advanced visualization in the form of a Motion Chart (Gapminder-style) which we’re very excited by! The motion chart can tell a story with data over time that you simply don’t see in static tables or reports.

The How Safe Is Your Suburb 2.0 application provides NSW Crime data in an interactive way, allowing users to analyse relative crime rates ot absolute crime rates by suburb.  This application is supported by one of our newest features - metadata -where explanations about the data are provided to the user to help them understand the meaning of the data.

Go check our applications out and vote for us if you like them!  And if you have any feedback on our entries please don’t hesitate to make a comment on our blog here.

Gov 2.0 Radio Interview: The Future of Privacy

March 18th, 2010 by Jo Deeker

Don McIntosh was recently a guest on Gov 2.0 Radio discussing the future of Privacy and how it relates to data.

Said Don:
“Many people, especially Gen Y, have the view that privacy is not an issue for them and to quote Eric Schmidt, ‘If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.’ I much prefer the view of Bruce Schneier, who is pretty much the world’s leading expert in information security, who points out in an excellent essay very clearly that people espousing that view ‘… accept the premise that privacy is about hiding a wrong. It’s not. Privacy is an inherent human right, and a requirement for maintaining the human condition with dignity and respect.’”

Click here to listen to the podcast.

Introducing SuperVIEW Collaboration

February 3rd, 2010 by Jo Deeker

SuperVIEW is our solution for Interactive Publication, Exploration & Visualization of Public Data. Our latest version has a new collaboration feature that we want to share with you.

Using our new SuperVIEW Collaboration features, you can make comments or invite others to make comments on your visualizations using Google Friend Connect.  You can also share your customized visualisation with others using our new Share feature. The Share feature allows you to embed a link to your view in a website, blog, Facebook, Twitter or your other favorite social networking application.

Recently Craig Thomler, a well-known active participant and leader in the Australian Gov2.0 movement, wrote a blog post on the new data.gov.uk site which he considers is the world leader in open data websites.  He then goes on to make a wishlist of what we could do in Australia to the data.australia.gov.au site to make it the best in the world.  Some of what he is asking is for is delivered by SuperVIEW right now including the ability for people to embed visualizations into their own sites, and to allow every set of data to support a discussion to allow people to ask questions to clarify what the dataset contains and discuss how it could be presented in a more usable way.

View this video to see SuperVIEW Collaboration in action.

If you have any questions about SuperVIEW please contact  jo.deeker@spacetimeresearch.com

Do government agencies know enough about the limits of anonymization?

January 18th, 2010 by Don McIntosh

There is a new wave of open government data scheduled to crash over the US on January 22 resulting from the government’s Open Government Directive. Is the government paying enough attention to data privacy issues that this deluge could trigger, and how aware are agencies of the well-established fact that anonymizing data is often an inadequate means of protecting privacy in public sector information, and that in many cases more “scrubbing” of the data is needed before any part of it can be safely released for public use?

Until recently, many government agencies have not been motivated to provide data transparency. Compared with the work that directly aligns with their mission and funding being a visionary supporter of the principles of transparent government is not really high on the agenda. In fact, in many cases, the message from up high hasn’t really reached them at all (one senior US government official’s take on Gov 2.0 was “oh, that’s a subset of Web 2.0 isn’t it?”). If you add to this reluctance the quite significant disincentives such as the risks of being too transparent, inadvertent privacy breaches, and plain and simple costs, then it’s not surprising that the average department hasn’t been as enthusiastic as the Gov 2.0 activist community might like them to be. And if the ROI on the whole deal is often external, why bother?

Well, there’s nothing like a directive straight from the top to get things moving. As of December 8, U.S. federal agencies had 45 days to get three “high-value datasets” published online and available through data.gov. Wow! Having worked with national statistics agencies for many years, I have some grasp of how long they typically take to publish data and it’s often longer than this, especially when you are dealing with data that has not previously been published. Of course, the data in some cases might be basic lists of non-sensitive material, in which case perhaps it is not too much extra work to make it suitable for public access. What I’m interested in examining is what it will take for agencies that don’t have it that easy, who will need to derive statistics from their data, or reduce it in some way to make it “safe” for public consumption.

Firstly, why bother publishing statistics if the raw data is available? Isn’t the open data community interested in getting “raw data now”, so that it’s quick for the agency and promises maximum flexibility for users? The reality in many cases — and one that seems to still be ignored by some who work in Information Management — is that even after you “de-identify” data by stripping obviously identifying attributes from it such as names, addresses, SSNs, etc, it does not necessarily protect privacy. It can still be a fairly trivial exercise for an ill-meaning data analyst, or even a non-technical person in many cases, to re-identify many of the people in the list. That is why in many cases we’ll see statistics being released about the data, rather than the raw data itself.

Associate Professor of Law Paul Ohm from the University of Colorado released a paper about the “Surprising Failure of Anonymization” last year, citing some prominent cases where anonymized data was re-identified and pointing out that there are many laws and regulations that are based on the false assumption of anonymization being a panacea for data privacy protection. In one example he describes, a researcher demonstrated how 87.1% of people in the U.S. were uniquely identified by their combined ZIP code, birth date, and sex. He also covers the AOL search data scandal, where individuals were identified from vast volumes of data by their unique search habits, uncovering some embarrassing personal information along the way.

While the individual agencies may not all have a clear understanding of all the potential privacy issues related to open data, at least the federal administration does have a focus on this. The directive itself states that data can only be made available “subject to valid privacy, confidentiality ….. restrictions”. In addition, the “Concept of Operations” paper for data.gov does have privacy in its sights, stating that there will be working groups looking into privacy issues arising from how data is mashed up and/or used in applications. I would point out that these groups could make an early head start simply by reading Paul Ohm’s paper, and not wait until after this round of data has been released. It seems that for the moment at least, the idea of what constitutes adequate privacy protection for open data is really up to each agency to decide.

While the working groups deliberate how privacy issues that result from data mashups and the like should be addressed, many datasets will be posted to data.gov and despite the proven limits of the effectiveness of anonymization, the experience that my colleagues and I have gained from talking with people who work in Information Management in government is that key staff in at least some agencies are not sufficiently aware of this, and that in their view, anonymization is essentially all you need to do to make data safe for release. I’d be interested to know if this agrees with others’ observations.

My observation regarding government’s understanding of data privacy issues is based largely on anecdotal evidence collected by myself and my colleagues. Perhaps I am overstating things and agencies do have the required skills and knowledge to release data safely. It would be good to hear about how different agencies are dealing with the Open Data Directive and what you think about the challenges of releasing useful data without unduly compromising privacy.

Note: Ohm’s paper is fairly lengthy. For a very interesting summary of the paper, you can check out this post on ars technica, which sparked a lot of debate regarding the importance of privacy.

SuperVIEW Version 1.4

December 4th, 2009 by Jo Deeker

Every month we release a new build of SuperVIEW and the team behind the development are Agile masters. Each build contains new and improved features for data geeks, new visualizations, and of course fixes for the bugs… we even go on safaris to find them.

Hybrid Cloud Service

The SuperVIEW Hybrid Cloud Service consists of two components:

  • The SuperVIEW Web application in a cloud service provided by the Google App Engine.
  • The application is connected to the ‘back-end’ SuperSTAR server that cross-tabulates and processes the data.

Learn more about the Hybrid Cloud Service ….

Showcase Visualizations

Top-N charts

Top-N charts sort and filter datasets to provide an easy visual comparison of relative data event sizes. They allow you to integrate very large classifications into SuperVIEW sites by filtering in only the Top-N items in a given query. For example the top 10 locations out of 100.

Top N Chart

Population Pyramid

The Population Pyramid enables a visualization of demographic trends through population pyramids that stack two distributions back to back and side by side.

Population pyramid

Previews of New Visualizations

Dual-axis Chart

The Dual Axis Chart plots two data series against each other.

Dual-axis Chart

Timeline Chart

The Timeline Chart is based on the Google Visualization API. This chart allows you to select a time period from a scale at the base of a chart, and then see the data updated and stretched to fit the width of the screen. You can also zoom in and out of a time range.

time-line-chart

Side-by-Side Pie Charts

This allows you to view two pie charts side-by-side.

Side-by-side Pie Charts

Features for Data Geeks

Dynamic Recodes

The configuration of the dynamic recodes feature used in the Data Selection Experience has been streamlined. You now can multi-select or de-select filters.

Want more

Contact Space-Time Research if you want more information or leave us a comment in this blog post.

Gov 2.0 for Koalas - Community vs. Government Data

December 3rd, 2009 by Don McIntosh

koala3

I heard a debate on the radio on Tuesday about whether koalas should be classified as an endangered species. There’s an article from ABC news from last month that covers the issue quite well. Oddly enough, I was reminded of it when I had a chat with Gartner analyst Andrea Di Maio that same evening when he pointed out what he called the asymmetry of Gov 2.0. What he was referring to was the fact that many communities have data of their own, and that the standards that we are demanding of government are in no way being reciprocated in terms of what is expected of communities. A question for us and for our customers (typically government agencies) is what should government do with data owned and collected by the community?

How many koalas are there? Are they a threatened species, or endangered? What do we need to do to make sure that Australia retains a diverse, healthy population of koalas? This is a hotly debated topic, with the government accused of siding with property developers at the expense of many hectares of koala habitat. As much as I’m worried about predictions of extinction of koalas within 30 yrs, I’m not trying to push either side of the argument in this post. I’ll leave that to those who are better informed about this. What I do want to do is explore what should be done with “unofficial” statistics.

So here’s the problem: we have official statistics produced by government derived from data that is objectively collected, categorized and disseminated in keeping with scientific survey practices. And then on the flip side, we have passionate communities conducting their own research which increasingly seems to involve collecting data and producing statistics. In terms of quality, I imagine that the output varies a lot. But it is data, and potentially useful data. How should government deal with that, especially where they plan or need to have data that clearly overlaps with what already exists? Could government help turn them into official statistics? I would suspect that in many cases the answer would be a rather emphatic no. However, perhaps there would be cases where there may be some benefit to government to be gained from acknowledging and making some use of community-sourced data.

At the front end of the statistical business process model published by the UNECE, there is a planning phase where existing data sources are considered for inclusion in official collections. Here’s what the UNECE says in step 1.5:

Check Data Availability: This sub-process checks whether current data sources could meet user requirements, and the conditions under which they would be available, including any restrictions on their use. An assessment of possible alternatives would normally include research into potential administrative data sources and their methodologies, to determine whether they would be suitable for use for statistical purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data requirement is prepared…”

I’d take away from that that the authors had absolutely no thought in their minds about community data. So, if I was a community activist, I’d say that means that there is room for it to happen. After all, it doesn’t explicitly preclude a statistician from asking around to see if anyone else is out there counting our cuddly little Aussie icons. Perhaps there would be valid cases where government could collaborate in some way so that either the quality of the output is improved, or at the very least it can be better understood and therefore used appropriately.

There are a couple of issues that spring to mind…

  • Biased and/or poor quality evidence. Communities are typically passionate and biased to a particular point of view. With no standards or checks in place to determine data quality, the government would be right to be highly skeptical of any “facts” presented. The CEO of the Australian Koala Foundation (AKF) noted that over 20 years, 2000 field sites have been looked at and over 80,000 trees. Is that enough? For what? Should the government do more? Well, at the very least, they should do some due diligence, or even better, demand a bit of transparency of the AKF, which I suspect they would be quite willing to provide. It’s right that we demand transparency of the government, but it is equally right that community groups offering evidence to support their claims should be held to a similar standard. Maybe government departments need to band together to demand Community 2.0 ?
  • Inappropriate use of anecdotal evidence. Let’s face it, right now there are no doubt many policies that are based on little more than personal opinions of government executives rather than any solid evidence. People regularly draw conclusions based on direct experiences, or from stories of those they trust. Here’s a simple case in point from a comment on the ABC’s article: “Last year in the Otways I saw koalas where I had never seen them before. Seems to me that their numbers are increasing and a good thing too.” Let’s hope that’s not from environment minister Peter Garrett. This is a great evolutionary attribute that allows us to form opinions on things that might actually affect us but it doesn’t serve us well when we choose to use it to form a model of complex, widespread populations with many different local influences at play. I hardly need to point out what a huge role data can play when it comes to making informed decisions.
  • Real experts in the public ready to make a contribution. There are many informed and passionate members both within the official communities, as well as in the public at large. What if we could give them a little bit more in the way of facts and figures to work with? As it is, there is a fair bit of scientific knowledge introduced by commenters. One knowledgeable commenter had a fascinating insight into the problem: “Koala populations are notoriously difficult to monitor. They are such a specialized animal that a minor change in habitat can lead to local extinctions in one area while they pop up somewhere else where they haven’t been seen in living memory.” Well, it sounds like he knows about it. It would be helpful if there was a way for him to easily reference credible evidence to back that up.

As one commenter noted: “At least it’s a positive step to have some dialogue over koala population density, and clearly there are big differences in estimates.” Yep, that pretty much sums it up. The question is, how do we get to the next step? Personally, I don’t know. I think that Andrea got it right when he said that government should acknowledge the existence of the community data sources. What they do next is an open question. Having surveys with tens of thousands of data points may still be unreliable depending on the use, but it may be better than making conclusions based on what Fred saw on his weekend trip to the Otways. I’d love to hear what other community vs government data debates people have had, and what the outcome was.

Australian Privacy Awards 2009 - “Hey, that’s just what we do!!”

November 13th, 2009 by Don McIntosh

Most people have an opinion about privacy these days, from Scott McNealy’s memorable throw away line “You have zero privacy. Get over it”, to the fierce concerns many people have around how much information Google stores about each and every one of us. Well, I certainly feel it’s important and it was great to have the opportunity to meet many other like-minded people at the Australian Privacy Awards dinner last night.

Special Minister of State and Cabinet Secretary Senator Joe Ludwig started the night with a good overview of the state of play, with many people and organisations struggling to come to terms with technology advances such as social networking that have such far-reaching effects on privacy. He mentioned the need for balancing government transparency and protecting personal information so many times that I felt like jumping up and saying “Hey, that’s just what we do!!

It certainly was an honour to receive the “highly commended” award in our category on Space-Time Research’s behalf and I’d like to thank the Office for the Privacy Commissioner for giving us the opportunity to be part of the whole event, and to meet and talk with so many people who work in this area. However, what I really wanted to mention in this post was a couple of award winners that I found particularly interesting.

Dr Roger Clarke was a worthy winner of the Australian Privacy Medal. Dr Clarke used his speech to remind the audience that there was a lot of real work that needed to be done, and that that he felt his medal was a little tarnished, because some people including some of the award winners were essentially just window dressing (not his terms - but I think that was the gist of it), and not really applying a genuine effort to promote privacy. Rooms are typically politely quiet when people give speeches but I think in this case it was a pregnant, slightly awkward kind of quiet.

There’s some great information on Dr Roger Clarke’s website about information privacy. In fact, I came across one note where he mentioned data protection that has made me rethink why we are using this term. He makes a really good point that many laws focus on data protection, where the focus is protecting data about people. As he explains, the real issue is to protect the people and you do that by considering what information might be derived from the data, rather than just protecting the data itself. Very good point.

Another winner I really liked was the Victorian Department of Justice (and not just because they are our customer). Who would have thought promoting privacy practices could be so fun or entertaining? Well, the people at Department of Justice certainly do. As an example, their most recent idea is to put together a radio show based on the X-Files concept. It will be called the P-files, with some really witty variations on Scully and Mulder’s names that have totally slipped my mind. One way or another, they plan to slip in the line “is that a USB stick in your pocket or are you just pleased to see me?” It was really refreshing to hear about their work and I do hope they have inspired many people there to take an equally innovative and enthusiastic response not just to promoting privacy practices, but to many other aspects of their work. I’m sure that even Dr Clarke would agree that they were really deserving winners.

Until 18 mths ago, I’d never heard of the office of the Privacy Commissioner. Now I know a whole community of people who are working to help Australians find the right balance and have some control of what parts of their lives are public knowledge. Privacy may not seem like an important issue to many in this age of Facebook and with the attitudes of Gen Y but I think Roger summed it up very nicely in his speech: “Privacy doesn’t matter until it does.”

Australian Privacy Awards 2009