Posts Tagged ‘SDMX’

SDMX Web Services

Wednesday, June 9th, 2010 by Don McIntosh

Recently, many of us at STR have been working on implementing open data formats, specifically SDMX 2.1 and DDI 3.1. Both are extremely relevant for statistical processing - DDI assumes the key position for planning, data collection, processing and microdata dissemination.  SDMX is most suited for processing and dissemination of aggregated data. Previous blog posts and news items have provided an overview of SDMX to inform our customers about how how SDMX might help them with their own business processes.  This blog post is all about what we are actually delivering with our  mid year SuperSTAR Release 7.0.  The following SDMX functionality will be included:

  1. SDMX output from SuperWEB
  2. Building SDMX-driven SuperVIEW interactive presentations (with no SXV4 db required)
  3. RESTful SDMX Web Services

This blog focuses on the Web Services which is arguably the most important capability.  And perhaps the other reason I’m excited by it is because it is the first time that SDMX has been introduced directly to microdata.  I’ll explain what I mean by this a bit later.

From the point of view of many data providers, the advantage of the Web Services is that it can provide their customers with just the data they need, no more and no less. This can free up staff devoted to responding to ad hoc queries.

From the customer point of view, it opens up new possibilities for consuming the data and building unique, useful services on top of it. For example, a third party application can convert user responses from a Web app into dynamic SDMX queries and then the results from this can in turn be used to determine how the Web app should behave. Without Web Services, such an app would previously have relied on potentially stale data that was downloaded and loaded into a local database. And thanks to the detailed data model of SDMX, apps can also work out what other data sources might sensibly be combined together to produce richer, more useful results.

The other thing I’ll mention before getting into some specifics about what we’ve done is that our implementation is actually that of a RESTful API, not a “traditional” Web Service. We’re glad to see this becoming so much more popular now.  SDMX orginally only had standard SOAP based Web Services defined, but we’ve based our implementation on the proposed RESTful API for SDMX version 2.1.  As developers, a RESTful API is something we find a lot easier to start using, to explore, and to scale and we we think that our customers will find the same.

What we’ve done

The SDMX API that we are focused on can be broken into three logical chunks:

  1. Metadata Discovery - what data collections are available, and what concepts/classifications are used where
  2. Database Metadata Discovery - What metadata (eg: concepts and code lists) are used within a particular SDMX dataset?
  3. Queries - Defining and pulling back a slice of an SDMX data cube

We’ve implemented parts 2 & 3.  (Part 1 we will consider for a future version, but we are also looking at solving this gap in a different way, such as leveraging existing SDMX registries, which are used to collate and manage contents that are stored in SDMX repositories. The important thing to note here is that we don’t want SuperSTAR to be an island - many of the organisations we work with would want to reuse the same search and discovery mechanism across many different types of data and applications, so we’d like to learn more about how SDMX solutions can be part of such an environment before we proceed with this.)

Our SDMX Restful API supports access to aggregated data that is managed by SuperSTAR. This can be from several different sources:

  1. SuperSTAR data cubes
  2. SuperSTAR tables defined by SuperWEB users
  3. SuperSTAR microdata databases

The last case is worth elaborating on, and links back to the point I mentioned earlier about introducing SDMX to microdata. Up until now, SDMX use has been limited to working with pre-aggregated data. This makes sense, especially when you consider the origins of SDMX, which is a group of organizations that deal almost solely with such aggregated statistical data and only rarely with the underlying microdata from which the statistics were derived.

From our point of view, however, and I believe from the point of view of many of our customers, dealing with microdata is very much part of the production process that they are involved in. What is useful about this is that the users are not constrained to taking slices of pre-defined cubes of data, but rather exploring and dynamically defining queries to run against the microdata. This approach can generate orders of magnitude more possible outputs and therefore relieve the provider from the burden of manually addressing many ad hoc queries that can’t be satisfied by a query against an existing cube. It does occasionally introduce other problems, namely confidentiality and performance, but these are part of our core capabilities, so our solution addresses potential drawbacks in this regard.

To make it possible to use an SDMX-based API to run tabulation queries against microdata, we’ve made some necessary innovations to the SDMX standard. Firstly, while you can query for the data structure definition (DSD) of a very large virtual cube (which is actually a SuperSTAR database), we prevent clients from requesting the full dataset for this cube - it’s simply going to be too big. What we do instead is allow for any subset of dimensions in the DSD to be combined in an SDMX query.

In addition, any tables that a user defines in SuperWEB can be accessed as SDMX datasets; both the DSD and the data from such a table can be obtained through queries against the SDMX RESTful API.

If you’ve read this whole post, you must be interested in what we are doing here. We think that the API can be very useful for many of our customers, so please leave a comment here if you have a question or something say. Or if you want to go one step further, let us know and we’ll discuss providing you with a test package that you can use to try the API against your own data.