Posts Tagged ‘Gov2.0’

SDMX Web Services

Wednesday, June 9th, 2010 by Don McIntosh

Recently, many of us at STR have been working on implementing open data formats, specifically SDMX 2.1 and DDI 3.1. Both are extremely relevant for statistical processing - DDI assumes the key position for planning, data collection, processing and microdata dissemination.  SDMX is most suited for processing and dissemination of aggregated data. Previous blog posts and news items have provided an overview of SDMX to inform our customers about how how SDMX might help them with their own business processes.  This blog post is all about what we are actually delivering with our  mid year SuperSTAR Release 7.0.  The following SDMX functionality will be included:

  1. SDMX output from SuperWEB
  2. Building SDMX-driven SuperVIEW interactive presentations (with no SXV4 db required)
  3. RESTful SDMX Web Services

This blog focuses on the Web Services which is arguably the most important capability.  And perhaps the other reason I’m excited by it is because it is the first time that SDMX has been introduced directly to microdata.  I’ll explain what I mean by this a bit later.

From the point of view of many data providers, the advantage of the Web Services is that it can provide their customers with just the data they need, no more and no less. This can free up staff devoted to responding to ad hoc queries.

From the customer point of view, it opens up new possibilities for consuming the data and building unique, useful services on top of it. For example, a third party application can convert user responses from a Web app into dynamic SDMX queries and then the results from this can in turn be used to determine how the Web app should behave. Without Web Services, such an app would previously have relied on potentially stale data that was downloaded and loaded into a local database. And thanks to the detailed data model of SDMX, apps can also work out what other data sources might sensibly be combined together to produce richer, more useful results.

The other thing I’ll mention before getting into some specifics about what we’ve done is that our implementation is actually that of a RESTful API, not a “traditional” Web Service. We’re glad to see this becoming so much more popular now.  SDMX orginally only had standard SOAP based Web Services defined, but we’ve based our implementation on the proposed RESTful API for SDMX version 2.1.  As developers, a RESTful API is something we find a lot easier to start using, to explore, and to scale and we we think that our customers will find the same.

What we’ve done

The SDMX API that we are focused on can be broken into three logical chunks:

  1. Metadata Discovery - what data collections are available, and what concepts/classifications are used where
  2. Database Metadata Discovery - What metadata (eg: concepts and code lists) are used within a particular SDMX dataset?
  3. Queries - Defining and pulling back a slice of an SDMX data cube

We’ve implemented parts 2 & 3.  (Part 1 we will consider for a future version, but we are also looking at solving this gap in a different way, such as leveraging existing SDMX registries, which are used to collate and manage contents that are stored in SDMX repositories. The important thing to note here is that we don’t want SuperSTAR to be an island - many of the organisations we work with would want to reuse the same search and discovery mechanism across many different types of data and applications, so we’d like to learn more about how SDMX solutions can be part of such an environment before we proceed with this.)

Our SDMX Restful API supports access to aggregated data that is managed by SuperSTAR. This can be from several different sources:

  1. SuperSTAR data cubes
  2. SuperSTAR tables defined by SuperWEB users
  3. SuperSTAR microdata databases

The last case is worth elaborating on, and links back to the point I mentioned earlier about introducing SDMX to microdata. Up until now, SDMX use has been limited to working with pre-aggregated data. This makes sense, especially when you consider the origins of SDMX, which is a group of organizations that deal almost solely with such aggregated statistical data and only rarely with the underlying microdata from which the statistics were derived.

From our point of view, however, and I believe from the point of view of many of our customers, dealing with microdata is very much part of the production process that they are involved in. What is useful about this is that the users are not constrained to taking slices of pre-defined cubes of data, but rather exploring and dynamically defining queries to run against the microdata. This approach can generate orders of magnitude more possible outputs and therefore relieve the provider from the burden of manually addressing many ad hoc queries that can’t be satisfied by a query against an existing cube. It does occasionally introduce other problems, namely confidentiality and performance, but these are part of our core capabilities, so our solution addresses potential drawbacks in this regard.

To make it possible to use an SDMX-based API to run tabulation queries against microdata, we’ve made some necessary innovations to the SDMX standard. Firstly, while you can query for the data structure definition (DSD) of a very large virtual cube (which is actually a SuperSTAR database), we prevent clients from requesting the full dataset for this cube - it’s simply going to be too big. What we do instead is allow for any subset of dimensions in the DSD to be combined in an SDMX query.

In addition, any tables that a user defines in SuperWEB can be accessed as SDMX datasets; both the DSD and the data from such a table can be obtained through queries against the SDMX RESTful API.

If you’ve read this whole post, you must be interested in what we are doing here. We think that the API can be very useful for many of our customers, so please leave a comment here if you have a question or something say. Or if you want to go one step further, let us know and we’ll discuss providing you with a test package that you can use to try the API against your own data.

Embracing Advanced Visualization - apps4NSW Comp entries

Friday, March 26th, 2010 by Jo Deeker

Space-Time Research have developed two entries for the apps4NSW competition (for New South Wales, Australia) using SuperVIEW.  The apps4NSW competition, like the Mashup Australia and Apps For Democracy competitions, invited the public to submit ideas and applications that would benefit the citizens of New South Wales.

I’m excited about our two applications because they are genuinely useful online interactive publications of complex data that everyone will benefit from.  Our Why Australians Travel application presents a dataset from Tourism Research Australia that has not been made available to the public in an interactive way before.  It also includes advanced visualization in the form of a Motion Chart (Gapminder-style) which we’re very excited by! The motion chart can tell a story with data over time that you simply don’t see in static tables or reports.

The How Safe Is Your Suburb 2.0 application provides NSW Crime data in an interactive way, allowing users to analyse relative crime rates ot absolute crime rates by suburb.  This application is supported by one of our newest features - metadata -where explanations about the data are provided to the user to help them understand the meaning of the data.

Go check our applications out and vote for us if you like them!  And if you have any feedback on our entries please don’t hesitate to make a comment on our blog here.

Gov 2.0 for Koalas - Community vs. Government Data

Thursday, December 3rd, 2009 by Don McIntosh

koala3

I heard a debate on the radio on Tuesday about whether koalas should be classified as an endangered species. There’s an article from ABC news from last month that covers the issue quite well. Oddly enough, I was reminded of it when I had a chat with Gartner analyst Andrea Di Maio that same evening when he pointed out what he called the asymmetry of Gov 2.0. What he was referring to was the fact that many communities have data of their own, and that the standards that we are demanding of government are in no way being reciprocated in terms of what is expected of communities. A question for us and for our customers (typically government agencies) is what should government do with data owned and collected by the community?

How many koalas are there? Are they a threatened species, or endangered? What do we need to do to make sure that Australia retains a diverse, healthy population of koalas? This is a hotly debated topic, with the government accused of siding with property developers at the expense of many hectares of koala habitat. As much as I’m worried about predictions of extinction of koalas within 30 yrs, I’m not trying to push either side of the argument in this post. I’ll leave that to those who are better informed about this. What I do want to do is explore what should be done with “unofficial” statistics.

So here’s the problem: we have official statistics produced by government derived from data that is objectively collected, categorized and disseminated in keeping with scientific survey practices. And then on the flip side, we have passionate communities conducting their own research which increasingly seems to involve collecting data and producing statistics. In terms of quality, I imagine that the output varies a lot. But it is data, and potentially useful data. How should government deal with that, especially where they plan or need to have data that clearly overlaps with what already exists? Could government help turn them into official statistics? I would suspect that in many cases the answer would be a rather emphatic no. However, perhaps there would be cases where there may be some benefit to government to be gained from acknowledging and making some use of community-sourced data.

At the front end of the statistical business process model published by the UNECE, there is a planning phase where existing data sources are considered for inclusion in official collections. Here’s what the UNECE says in step 1.5:

Check Data Availability: This sub-process checks whether current data sources could meet user requirements, and the conditions under which they would be available, including any restrictions on their use. An assessment of possible alternatives would normally include research into potential administrative data sources and their methodologies, to determine whether they would be suitable for use for statistical purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data requirement is prepared…”

I’d take away from that that the authors had absolutely no thought in their minds about community data. So, if I was a community activist, I’d say that means that there is room for it to happen. After all, it doesn’t explicitly preclude a statistician from asking around to see if anyone else is out there counting our cuddly little Aussie icons. Perhaps there would be valid cases where government could collaborate in some way so that either the quality of the output is improved, or at the very least it can be better understood and therefore used appropriately.

There are a couple of issues that spring to mind…

  • Biased and/or poor quality evidence. Communities are typically passionate and biased to a particular point of view. With no standards or checks in place to determine data quality, the government would be right to be highly skeptical of any “facts” presented. The CEO of the Australian Koala Foundation (AKF) noted that over 20 years, 2000 field sites have been looked at and over 80,000 trees. Is that enough? For what? Should the government do more? Well, at the very least, they should do some due diligence, or even better, demand a bit of transparency of the AKF, which I suspect they would be quite willing to provide. It’s right that we demand transparency of the government, but it is equally right that community groups offering evidence to support their claims should be held to a similar standard. Maybe government departments need to band together to demand Community 2.0 ?
  • Inappropriate use of anecdotal evidence. Let’s face it, right now there are no doubt many policies that are based on little more than personal opinions of government executives rather than any solid evidence. People regularly draw conclusions based on direct experiences, or from stories of those they trust. Here’s a simple case in point from a comment on the ABC’s article: “Last year in the Otways I saw koalas where I had never seen them before. Seems to me that their numbers are increasing and a good thing too.” Let’s hope that’s not from environment minister Peter Garrett. This is a great evolutionary attribute that allows us to form opinions on things that might actually affect us but it doesn’t serve us well when we choose to use it to form a model of complex, widespread populations with many different local influences at play. I hardly need to point out what a huge role data can play when it comes to making informed decisions.
  • Real experts in the public ready to make a contribution. There are many informed and passionate members both within the official communities, as well as in the public at large. What if we could give them a little bit more in the way of facts and figures to work with? As it is, there is a fair bit of scientific knowledge introduced by commenters. One knowledgeable commenter had a fascinating insight into the problem: “Koala populations are notoriously difficult to monitor. They are such a specialized animal that a minor change in habitat can lead to local extinctions in one area while they pop up somewhere else where they haven’t been seen in living memory.” Well, it sounds like he knows about it. It would be helpful if there was a way for him to easily reference credible evidence to back that up.

As one commenter noted: “At least it’s a positive step to have some dialogue over koala population density, and clearly there are big differences in estimates.” Yep, that pretty much sums it up. The question is, how do we get to the next step? Personally, I don’t know. I think that Andrea got it right when he said that government should acknowledge the existence of the community data sources. What they do next is an open question. Having surveys with tens of thousands of data points may still be unreliable depending on the use, but it may be better than making conclusions based on what Fred saw on his weekend trip to the Otways. I’d love to hear what other community vs government data debates people have had, and what the outcome was.

Protecting confidentiality - some real life examples

Sunday, November 1st, 2009 by Don McIntosh

This post blog is on how we are enabling our customers to disseminate detailed information while protecting the privacy of individuals. In the context of being providers of Official statistics, making data more available, and making governments more transparent, we show that it *can* be done - you *can* release data.

We are currently engaging with three customers and developing new requirements around the area of privacy protection on their data. For two of the three, the main goal is to deliver more detailed, useful data to their customers without compromising privacy concerns. The other key goals are around reducing the risk of accidentally releasing sensitive data (a goal of increasing importance given the Gov 2.0 fueled demand for more open data), and reducing costs associated with the application of privacy protection. I thought I’d write a short note to summarise our work in this area of late.

We have an API plugin architecture for applying disclosure control. Basically, you can build your own modules that do things like adjust, conceal, and/or annotate cell values based on certain rules, or reject a query if it’s deemed too sensitive for whatever reason. You can also record query details and use them to monitor for potential privacy intrusions.

The work we are looking at doing in relation to current customer requests includes the following:

  • Implementing plugins with customised rounding and concealment rules. This is straight forward work as far as our current architecture is concerned, and helps our customers with these requirements to implement rules that maximise the data they can make available. For one customer, we have written a plugin that will suppress numbers less than a certain value, and any related totals. So for example, if you were suppressing all numbers in a table less than or equal to 3, a simple table would show suppression of that cell, plus any totals containing that cell. The example table demonstrates how a returned table would look. By suppressing the totals, you are preventing someone from back-calculating a value that has been suppressed.
Suppressed Table

Suppressed Table

  • Allowing custom selection of different rule combinations for testing and more advanced use of disclosure control. This is useful especially where you have a few in-house specialists who are authorised to be more lenient in terms of what rules need to be applied when responding to ad hoc information requests.
  • Extending confidentiality to apply to the output of calculations (SuperSTAR field derivations). For example, you might have a function that in some cases returns “..C” instead of a real value for certain cells as per the example above. Confidentiality can be extended to work with derived data. For example, it would be useful for determining a statistical mean or median and concealing the result if there was less than a certain number of contributors.

We are really keen to hear from our customers and other interested parties. If you have some recent experience in using confidentiality in SuperSTAR or elsewhere, or would like to give us any kind of related feedback, please do feel free to leave a comment or contact us directly.

Why APIs are important for Gov2.0

Wednesday, October 21st, 2009 by Jo Deeker

I was at the Gov 2.0 conference in Canberra earlier in the week and found that compared to the talk around social engagement through Twitter and Facebook, the whole concept of open data and APIs took a back seat for much of the event. APIs were mentioned by speakers, but I did not get any sense that the majority of the attendees were thinking about APIs and mash-up-ability of data as much as I do. I also wasn’t sure that everyone knew what an API was, or why you would want one.

So we asked our Director of Product Planning, Don McIntosh to write an article about what APIs are, and why they’re important. This is what he has to say about APIs.

With social applications, there is a clear and obvious use that everyone can understand, and the staggering traffic volumes for these sites make the topic all the more compelling. But what about open data and APIs? Why should we pay them any attention and how do we benefit from them?

An API is an Application Programming Interface. Web based APIs, sometimes referred to as Web services, are growing at a phenomenal rate. Basically, instead of information being presented in a predetermined manner through Web pages, APIs allow other applications (iPhone apps, Websites, MS Windows applications….) to extract specific chunks of information and combine it with other information in all kinds of ways to serve a specific purpose. Jim Ericson from Information Management blogged about this, and he included a good description of how Web services get used:

“Now think of all the thousands of iPhone apps and how they amalgamate all kinds of Web services. You open your commuter traffic app, it calls on traffic information services, Google maps, a weather forecast and maybe an ad for public transportation. One browser app, many (API) calls.”

Jim also mentioned how prominent APIs are becoming. For many popular websites, the network traffic generated by APIs actually exceeds the direct Web traffic. And that’s expected to continue. Perhaps even more interesting is the fact that these days, you don’t even need to be a programmer to use Web APIs. If you have played with Yahoo Pipes, or similar mashup tools, you know what I mean. Basically, these tools are empowering end users to create their own custom applications. Just drag and drop – no coding required.
So, they’re useful, widely used, accessible even to non-programming types, and becoming more popular by the day but what in particular makes them so important in a Gov 2.0 context? I’d summarise it by saying that it’s about making it possible (and easy) for those outside of government to present statistics in a context that is meaningful and useful for them, and that can help facilitate informed discussion and decision making. If I want to provide a service to help people decide where to live, I could combine census statistics such as occupation, income, and age and mash it up with information about the location of shopping centres, pubs etc from a different service. I could achieve the same by gathering all the data into a database and building my service on top, but by accessing the data through an API, my information can remain current, and my queries can be run by calls to the API, saving me from the complexities and resources required to process the data myself. I can also leverage other services such as Google maps to present results. And of course, thanks to mashup platforms, this kind of application might just be something that an (non-programmer) individual does to satisfy their own interest. Either way, it makes it much more possible for people to take government information and use it in ways that government may never have chosen to do.

From a data provider’s perspective, there are many things to consider when looking at providing APIs for direct data access and querying.

1. API vs other means

An API can facilitate innovation, and help automate services that other organizations may provide based on the data. It can also provide transparency by not colouring the data in any particular way, but leaving it open to others to render analysis of the data in their own way. On the other hand, if representing the data in certain ways is useful in promoting an organization’s mission, then it might be best to concentrate on delivering the appropriate views and/or viewing tools for the data. Or in some cases, it might make sense to do both.

2. Risk of abuse

Gartner analyst Andrea diMaio noted that separating data from its source and having no clear way to let consumers understand its lineage or quality runs a great risk of it being misused, or deliberately doctored to represent the “facts” that best suit the application builder. What does this mean to the organization providing the data? Providers of official statistics go to great lengths to defend against this possibility yet by providing data through APIs, they may in some way increase the risk of this happening. Perhaps one way to look at it is to realise that this can happen anyway, without APIs. And it is probably unreasonable to expect a provider to do more than provide accurate quality information alongside their data (and even make it queryable through the API) so that users can make informed choices about what constitutes valid use of the data.

3. Data privacy protection

Many statistical agencies have “remote access data laboratory” services to give researchers the ability to perform detailed analyses on their data. There are typically manual checking processes in this, to ensure that researchers’ queries do not breach data privacy laws by identifying individuals from the data (something that is very easy to do, even when data has been anonymized). A provider would need to determine what privacy risks are posed by making the data available through an API, and ensure that appropriate safeguards are put in place.

4. Resources

An API call results in some amount of processing. Depending on the specifics, such as the type of query and the volume of data, the level of computing resources required can be quite significant. In the beginning, one option may be to limit API use to a few specific applications, and expand that over time. Alternatively, the API could impose certain limits for any single user. This is the approach that Twitter uses to manage the enormous demand it generates.

We’re in the cloud! SuperWEB available now

Thursday, October 1st, 2009 by Jo Deeker

I’m really excited to announce that we aim to be among the first companies to host applications on the Apps.gov website.

To get there, we needed to get SuperWEB up into the cloud, and this week, we hosted our first application on the Amazon EC2 cloud. Yesterday, I got my first Amazon bill - $10 / day so far and we uploaded a lot of data!

Background:

Vivek Kundra, the US Federal Chief Information Officer, has launched the new Apps.gov Storefront to enable US Federal Government agencies to buy cloud computing services as easily as a consumer can acquire a Gmail or Facebook account.

Cloud computing services reduce costs through reductions in purchasing and maintaining servers, while simultaneously improving service scalabilty to manage peaks and troughs in usage. Kundra says that besides encouraging better collaboration among agencies, he expects cloud services to reduce energy consumption because agencies will be able to share IT infrastructures.

Space-Time Research is responding to the recent US Federal Government request for proposal for applications to be hosted via the Apps.gov website. The Apps.gov Storefront is managed by the US GSA (General Services Administration) and SuperSTAR software is already available for purchase through the GSA e-Library.

Space-Time Research cloud offerings

In September, Space-Time Research initiated a cloud offering by hosting SuperWEB Software as a Service (SaaS) on the Amazon EC2 cloud service. SuperWEB is currently in the process of being assessed for inclusion in the Apps.gov website. Once certified, SuperWEB SaaS will be available to buy as a small, medium, large or extra large implementation on a pay-by-month basis.

At the end of October, SuperVIEW will be production-ready and available via a Google App Engine hybrid cloud service. For more information, see SuperVIEW hybrid cloud service.

More about Apps.gov

Apps.gov is managed by the GSA development team, which is led by Casey Coleman, GSA’s CIO. In the article Kundra’s great experiment: Government apps ’store front’ opens for business, Coleman says:

“Through Apps.gov, GSA can take on more of the procurement processes upfront, helping agencies to better fulfill their missions by implementing solutions more rapidly,”

“We will also work with industry to ensure cloud-based solutions are secure and compliant to increase efficiency by reducing duplication of security processes throughout government.”

Jo Deeker

My favourite sites

Tuesday, September 22nd, 2009 by Jo Deeker

My three favourite sites at the moment are:

As we enable public intelligence and data provision, and we’re an Australian based company, I have to keep on top of this every day. I love how fast ideas are moving.
http://gov2.net.au

For all goodness in quality and testing management. If I ever have a question or problem to solve and I’m stuck, I go here. Good for inspiration and great ideas.
https://www.stickyminds.com

Just launched by the US Government and we’re going to be on it soon with a cloud provision of SuperWEB. Any US Government Agency will be able to buy us through this process. Super-excited about this one.
http://apps.gov

Jo

Open Data Initiative - Free SuperVIEW hosting of data

Monday, August 17th, 2009 by Jo Deeker
Open Data Initiative

Open Data Initiative

Space-Time Research this week launced a new program called the Open Data Initiative at the International Statistical Institute (ISI) 2009 conference in Durban.

What is the Open Data Initiative?

The Open Data Initiative is a Web 2.0 site for disseminating public data. Users discover and explore data in a rich, interactive, and intuitive application, rather than browse or read large documents of published tables and charts. The end user can select and visualize any combination of data. It can be exported, printed, linked to, and shared in collaboration environments.

The Open Data Initiative is a freely available online service for the creation and dissemination of data for public consumption. You have the data; we have the service to disseminate it to the public.

The Open Data Initiative is hosted on the Google AppEngine Cloud, enabling providers of public data to create engagingand rich Web 2.0 experiences built on top of Space?Time Research’s SuperVIEW product suite. This provides transparent, lightning?fast web traffic responsiveness, scalability and built in redundancy no matter where in the world you are.

Data types suitable for the Open Data Initiative: Health, Transport, Education, Agriculture, Population Statistics, Labour Force, etc.

How do I sign up?

Contact us via the Open Data Initiative website

Key Benefits. The Open Data Initiative:

  • Is Cost and Time Efficient — Reduces the workload on your data analysts and researchers.
  • Provides Data that is Complete — Why compromise on providing a subset of the data? Maximize the ability of the public to self?service data of personal interest.
  • Provides Data as Service — Now you can provide a new online data service to the public.
  • Protects the Relevance of Your Brand — Provide an engaging and rewarding experience for the public. This reinforces the relationship of trust they have in your organization.
  • Delivers Data Integrity — Have confidence that the public are seeing the right numbers, graphs, and maps, andreaching the correct interpretation and understanding behind those numbers.
  • Delivers Data Responsiveness — Minimize the time between data collection and data dissemination to ensure maximum relevancy of the data to the audience.
  • Creates Communities of Users — Ensure the online experience can be captured and shared by the public incollaborative environments from Blogs to Twitter.

Frequently asked questions coming from some of our early adopters:

Q. What is the business model for Space-Time Research?
A. This is a free service and as such it has business model restrictions for customers - they cannot charge a fee for access to their created sites. It must be public and not sit behind authentication or payment gateways. We have a paid service available that overcomes these restrictions but this is a good way to test drive the technology and the dissemination approach using the free service initially. Alternatively customers can purchase a paid SuperVIEW software license and implement their own business model around a deployed SuperVIEW.

Q. What about confidentiality?
No confidentiality capabilities are offered with the free SuperVIEW. The Open Data Initiative will host all data in the Cloud so by it’s nature data provided should not contain confidential information. We can provide a confidential Cloud based service using our Hybrid connector, but this becomes a paid solution engagement.

Q. How do statistical boundaries get loaded?
We will detail this in the data collection process over the next week with people that sign up to our early adopter program, but think it will be along the lines of providing a shapefile (with some size limits — i.e. pre-simplified and for particular areas) or KML to us.

Q. How does the application get integrated with the data providers website.

Option 1 -> provide a link that takes the user from the data provider website to the Open Data Initiative website.
Option 2 -> use an IFRAME to embed the Open Data Initiative hosted site into their website.

Jo Deeker