Archive for the ‘General’ Category

Crowd Sourcing, Twitter, and Trust - Natstats 2010

Friday, September 17th, 2010 by Don McIntosh

Natstats opened on Wednesday evening and had its first full day today. For those who aren’t aware, it’s a conference all about statistics organized by the Australian Bureau of Statistics, with the theme of this event being around “Measuring what counts: economic development, wellbeing and progress in 21st century Australia”. It’s only the second time it has been staged but judging by the success thus far and quality of speakers, certainly not the last. I’ll leave official reports of talks etc to others but I would like to share some stories from people I’ve had the pleasure of talking with during the conference today. I invite attendees to add their own stories in the comments section.

Landlines

Here’s an odd one to start with. Associate Professor Warren Laffen works in the Institute for Social Science Research, which is part of the University of Queensland. He is involved in a huge range of research projects and one in particular caught my interest. It’s a study they only recently commenced comparing profiles of people who have only mobile phones to those who use only landlines. Why would you want to do that? Well, apparently some phone surveys only use landline numbers and the idea behind this research is to find out how the results of such surveys might be biased because of this. It struck me as rather obscure but at the same time, I could understand the purpose and value in running such a study. It’s the same with many stats collected – they may not be of value to us all but for some, they are very important and meaningful. I formed a picture in my mind of the “landliners”: homely elderly folk sitting around knitting and never having churned from Telstra to another provider, let alone considered moving to a mobile. I was a little surprised when I sat down for the Natstats dinner this evening and discovered that my 20-something year old neighbor and her husband, both with white collar jobs, living in Hobart, had only one landline to share at home and no mobiles.

Wikiprogress Crowd Sourcing

Philippa Lysaght from OECD introduced me to an intriguing idea for a statistical Web site in Wikiprogress. Partially funded by OECD, but also from many independent supporters, the idea behind this innovative site is to measure progress of societies from around the world. Read more about what it’s all about here. Being a wiki, it gathers statistics from any members of “the community” who choose to contribute. There are various indicators based on official stats on the site, as well as data created and shared by individual researchers, academics and the like. An obvious question statisticians would have about this is how the quality of the data is ascertained if everyone has the freedom to contribute. Well, that’s part of the challenge for Wikiprogress and something that Philippa said they are working hard to keep on top of. They certainly have plenty of officially sourced statistics that have been contributed, and working out reasonable ways of accepting and presenting statistics from related (or not) communities is something that they are managing so far. Perhaps they might help us to find a middle path that can help reconcile the gap between official and community statistics (see the earlier post about community koala data vs official sources), especially given there are so many things we want to count and only so many statistical organizations around to do the work.

Twitter

Another interesting thing to see was the level of Twitter use at the conference. Jeanette Cotterill, who was the lead person at ABS responsible for organizing the event (well done, Jeanette and team!!), explained that there had been some concerns about making Twitter an official part of Natstats communications, but the decision was taken that it was a good opportunity to engage with people and indeed, so far there has been some happy Tweeting from a number of participants, as well as from the official @Natstats2010 Twitterer. Seeing as it’s a stats conference, I’d be remiss if I didn’t note something about Twitter use in stats: there were 10 unique people on Twitter on this first day tweeting about Natstats, out of an overall attendance of around 500. Maybe someone can work out how that profile compares to the proportion of Twitter users in the population at large: are Natstats attendees more or less likely to use Twitter than other folk?

Trust

I’ll close with another statistic, this one from the ABS chief statistician, Brian Pink. He noted that 92% of the Australian public trust official statistics. Oddly enough, that stat doesn’t come from the ABS but from an independent survey that the ABS has commissioned to find out more about public opinion around official statistics. You’ll have to wait for World Statistics Day on October 20 to find out more about the results.

Well, thanks so much to the many people who attended Natstats and have made it a thoroughly enjoyable experience for myself and my colleague Mark Humphreys and no doubt many others. Please do feel free to add you own sentiments or stories about Natstats in the comments. I wish you well on day two and look forward to speaking with many of you at our booth (and if you mention this post, you’ll get an extra lolly of your choice).

SDMX Web Services

Wednesday, June 9th, 2010 by Don McIntosh

Recently, many of us at STR have been working on implementing open data formats, specifically SDMX 2.1 and DDI 3.1. Both are extremely relevant for statistical processing - DDI assumes the key position for planning, data collection, processing and microdata dissemination.  SDMX is most suited for processing and dissemination of aggregated data. Previous blog posts and news items have provided an overview of SDMX to inform our customers about how how SDMX might help them with their own business processes.  This blog post is all about what we are actually delivering with our  mid year SuperSTAR Release 7.0.  The following SDMX functionality will be included:

  1. SDMX output from SuperWEB
  2. Building SDMX-driven SuperVIEW interactive presentations (with no SXV4 db required)
  3. RESTful SDMX Web Services

This blog focuses on the Web Services which is arguably the most important capability.  And perhaps the other reason I’m excited by it is because it is the first time that SDMX has been introduced directly to microdata.  I’ll explain what I mean by this a bit later.

From the point of view of many data providers, the advantage of the Web Services is that it can provide their customers with just the data they need, no more and no less. This can free up staff devoted to responding to ad hoc queries.

From the customer point of view, it opens up new possibilities for consuming the data and building unique, useful services on top of it. For example, a third party application can convert user responses from a Web app into dynamic SDMX queries and then the results from this can in turn be used to determine how the Web app should behave. Without Web Services, such an app would previously have relied on potentially stale data that was downloaded and loaded into a local database. And thanks to the detailed data model of SDMX, apps can also work out what other data sources might sensibly be combined together to produce richer, more useful results.

The other thing I’ll mention before getting into some specifics about what we’ve done is that our implementation is actually that of a RESTful API, not a “traditional” Web Service. We’re glad to see this becoming so much more popular now.  SDMX orginally only had standard SOAP based Web Services defined, but we’ve based our implementation on the proposed RESTful API for SDMX version 2.1.  As developers, a RESTful API is something we find a lot easier to start using, to explore, and to scale and we we think that our customers will find the same.

What we’ve done

The SDMX API that we are focused on can be broken into three logical chunks:

  1. Metadata Discovery - what data collections are available, and what concepts/classifications are used where
  2. Database Metadata Discovery - What metadata (eg: concepts and code lists) are used within a particular SDMX dataset?
  3. Queries - Defining and pulling back a slice of an SDMX data cube

We’ve implemented parts 2 & 3.  (Part 1 we will consider for a future version, but we are also looking at solving this gap in a different way, such as leveraging existing SDMX registries, which are used to collate and manage contents that are stored in SDMX repositories. The important thing to note here is that we don’t want SuperSTAR to be an island - many of the organisations we work with would want to reuse the same search and discovery mechanism across many different types of data and applications, so we’d like to learn more about how SDMX solutions can be part of such an environment before we proceed with this.)

Our SDMX Restful API supports access to aggregated data that is managed by SuperSTAR. This can be from several different sources:

  1. SuperSTAR data cubes
  2. SuperSTAR tables defined by SuperWEB users
  3. SuperSTAR microdata databases

The last case is worth elaborating on, and links back to the point I mentioned earlier about introducing SDMX to microdata. Up until now, SDMX use has been limited to working with pre-aggregated data. This makes sense, especially when you consider the origins of SDMX, which is a group of organizations that deal almost solely with such aggregated statistical data and only rarely with the underlying microdata from which the statistics were derived.

From our point of view, however, and I believe from the point of view of many of our customers, dealing with microdata is very much part of the production process that they are involved in. What is useful about this is that the users are not constrained to taking slices of pre-defined cubes of data, but rather exploring and dynamically defining queries to run against the microdata. This approach can generate orders of magnitude more possible outputs and therefore relieve the provider from the burden of manually addressing many ad hoc queries that can’t be satisfied by a query against an existing cube. It does occasionally introduce other problems, namely confidentiality and performance, but these are part of our core capabilities, so our solution addresses potential drawbacks in this regard.

To make it possible to use an SDMX-based API to run tabulation queries against microdata, we’ve made some necessary innovations to the SDMX standard. Firstly, while you can query for the data structure definition (DSD) of a very large virtual cube (which is actually a SuperSTAR database), we prevent clients from requesting the full dataset for this cube - it’s simply going to be too big. What we do instead is allow for any subset of dimensions in the DSD to be combined in an SDMX query.

In addition, any tables that a user defines in SuperWEB can be accessed as SDMX datasets; both the DSD and the data from such a table can be obtained through queries against the SDMX RESTful API.

If you’ve read this whole post, you must be interested in what we are doing here. We think that the API can be very useful for many of our customers, so please leave a comment here if you have a question or something say. Or if you want to go one step further, let us know and we’ll discuss providing you with a test package that you can use to try the API against your own data.

Introducing SuperVIEW Collaboration

Wednesday, February 3rd, 2010 by Jo Deeker

SuperVIEW is our solution for Interactive Publication, Exploration & Visualization of Public Data. Our latest version has a new collaboration feature that we want to share with you.

Using our new SuperVIEW Collaboration features, you can make comments or invite others to make comments on your visualizations using Google Friend Connect.  You can also share your customized visualisation with others using our new Share feature. The Share feature allows you to embed a link to your view in a website, blog, Facebook, Twitter or your other favorite social networking application.

Recently Craig Thomler, a well-known active participant and leader in the Australian Gov2.0 movement, wrote a blog post on the new data.gov.uk site which he considers is the world leader in open data websites.  He then goes on to make a wishlist of what we could do in Australia to the data.australia.gov.au site to make it the best in the world.  Some of what he is asking is for is delivered by SuperVIEW right now including the ability for people to embed visualizations into their own sites, and to allow every set of data to support a discussion to allow people to ask questions to clarify what the dataset contains and discuss how it could be presented in a more usable way.

View this video to see SuperVIEW Collaboration in action.

If you have any questions about SuperVIEW please contact  jo.deeker@spacetimeresearch.com

Gov 2.0 for Koalas - Community vs. Government Data

Thursday, December 3rd, 2009 by Don McIntosh

koala3

I heard a debate on the radio on Tuesday about whether koalas should be classified as an endangered species. There’s an article from ABC news from last month that covers the issue quite well. Oddly enough, I was reminded of it when I had a chat with Gartner analyst Andrea Di Maio that same evening when he pointed out what he called the asymmetry of Gov 2.0. What he was referring to was the fact that many communities have data of their own, and that the standards that we are demanding of government are in no way being reciprocated in terms of what is expected of communities. A question for us and for our customers (typically government agencies) is what should government do with data owned and collected by the community?

How many koalas are there? Are they a threatened species, or endangered? What do we need to do to make sure that Australia retains a diverse, healthy population of koalas? This is a hotly debated topic, with the government accused of siding with property developers at the expense of many hectares of koala habitat. As much as I’m worried about predictions of extinction of koalas within 30 yrs, I’m not trying to push either side of the argument in this post. I’ll leave that to those who are better informed about this. What I do want to do is explore what should be done with “unofficial” statistics.

So here’s the problem: we have official statistics produced by government derived from data that is objectively collected, categorized and disseminated in keeping with scientific survey practices. And then on the flip side, we have passionate communities conducting their own research which increasingly seems to involve collecting data and producing statistics. In terms of quality, I imagine that the output varies a lot. But it is data, and potentially useful data. How should government deal with that, especially where they plan or need to have data that clearly overlaps with what already exists? Could government help turn them into official statistics? I would suspect that in many cases the answer would be a rather emphatic no. However, perhaps there would be cases where there may be some benefit to government to be gained from acknowledging and making some use of community-sourced data.

At the front end of the statistical business process model published by the UNECE, there is a planning phase where existing data sources are considered for inclusion in official collections. Here’s what the UNECE says in step 1.5:

Check Data Availability: This sub-process checks whether current data sources could meet user requirements, and the conditions under which they would be available, including any restrictions on their use. An assessment of possible alternatives would normally include research into potential administrative data sources and their methodologies, to determine whether they would be suitable for use for statistical purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data requirement is prepared…”

I’d take away from that that the authors had absolutely no thought in their minds about community data. So, if I was a community activist, I’d say that means that there is room for it to happen. After all, it doesn’t explicitly preclude a statistician from asking around to see if anyone else is out there counting our cuddly little Aussie icons. Perhaps there would be valid cases where government could collaborate in some way so that either the quality of the output is improved, or at the very least it can be better understood and therefore used appropriately.

There are a couple of issues that spring to mind…

  • Biased and/or poor quality evidence. Communities are typically passionate and biased to a particular point of view. With no standards or checks in place to determine data quality, the government would be right to be highly skeptical of any “facts” presented. The CEO of the Australian Koala Foundation (AKF) noted that over 20 years, 2000 field sites have been looked at and over 80,000 trees. Is that enough? For what? Should the government do more? Well, at the very least, they should do some due diligence, or even better, demand a bit of transparency of the AKF, which I suspect they would be quite willing to provide. It’s right that we demand transparency of the government, but it is equally right that community groups offering evidence to support their claims should be held to a similar standard. Maybe government departments need to band together to demand Community 2.0 ?
  • Inappropriate use of anecdotal evidence. Let’s face it, right now there are no doubt many policies that are based on little more than personal opinions of government executives rather than any solid evidence. People regularly draw conclusions based on direct experiences, or from stories of those they trust. Here’s a simple case in point from a comment on the ABC’s article: “Last year in the Otways I saw koalas where I had never seen them before. Seems to me that their numbers are increasing and a good thing too.” Let’s hope that’s not from environment minister Peter Garrett. This is a great evolutionary attribute that allows us to form opinions on things that might actually affect us but it doesn’t serve us well when we choose to use it to form a model of complex, widespread populations with many different local influences at play. I hardly need to point out what a huge role data can play when it comes to making informed decisions.
  • Real experts in the public ready to make a contribution. There are many informed and passionate members both within the official communities, as well as in the public at large. What if we could give them a little bit more in the way of facts and figures to work with? As it is, there is a fair bit of scientific knowledge introduced by commenters. One knowledgeable commenter had a fascinating insight into the problem: “Koala populations are notoriously difficult to monitor. They are such a specialized animal that a minor change in habitat can lead to local extinctions in one area while they pop up somewhere else where they haven’t been seen in living memory.” Well, it sounds like he knows about it. It would be helpful if there was a way for him to easily reference credible evidence to back that up.

As one commenter noted: “At least it’s a positive step to have some dialogue over koala population density, and clearly there are big differences in estimates.” Yep, that pretty much sums it up. The question is, how do we get to the next step? Personally, I don’t know. I think that Andrea got it right when he said that government should acknowledge the existence of the community data sources. What they do next is an open question. Having surveys with tens of thousands of data points may still be unreliable depending on the use, but it may be better than making conclusions based on what Fred saw on his weekend trip to the Otways. I’d love to hear what other community vs government data debates people have had, and what the outcome was.

My favourite sites

Tuesday, September 22nd, 2009 by Jo Deeker

My three favourite sites at the moment are:

As we enable public intelligence and data provision, and we’re an Australian based company, I have to keep on top of this every day. I love how fast ideas are moving.
http://gov2.net.au

For all goodness in quality and testing management. If I ever have a question or problem to solve and I’m stuck, I go here. Good for inspiration and great ideas.
https://www.stickyminds.com

Just launched by the US Government and we’re going to be on it soon with a cloud provision of SuperWEB. Any US Government Agency will be able to buy us through this process. Super-excited about this one.
http://apps.gov

Jo