Governments around the world have released more than a million open data sets over the last decade. This information has helped fuel job creation and some societal changes, including increased government accountability and consumer protection, more transparent health care costs and more resilience against climate change.
Analyzing that data -- when it's accurate -- can help policymakers make better decisions, but they're only beginning to tap into this potential.
A project from the Massachusetts Institute of Technology Media Lab shows what's possible. In 2013, the Foundation for Research of the State of Minas Gerais, Brazil's version of the National Science Foundation, hired César Hidalgo, the head of the MacroConnections group at the lab, to produce a report on where industrial developments would flourish. Hidalgo decided to go beyond a stale, static document and create something he believed would be more useful and dynamic.
His team at the Media Lab released its DataViva engine, which allows users to visualize more than 500 gigabytes of Brazilian government data in 1 billion different ways. The public can use the software platform to quickly and easily mash up multiple sets of economic, demographic, trade and educational data. The idea behind DataViva, Hidalgo explained, "is to make reports obsolete."
Now the team is relaunching DataViva with information from all across Brazil -- making it, Hidalgo said, the largest data visualization platform online.
In the two years since DataViva's release, the MIT team has taken an amazing amount of information -- international trade data from more than 5,000 municipalities, employment data from 50 million workers in Brazil's formal economy, enrollment and graduation data from of Brazil's university and basic education systems, and five years' worth of tax data -- and has standardized it, structured it and added it to its system.
Hidalgo says he hopes the platform will make it easier for Brazilian bureaucracies to make well-informed decisions quickly. City and state leaders can use DataViva to search data for any municipality in the country and see it visualized in a variety of ways. For example, banks can decide whether to issue a loan to a prospective small business by looking at economic data and deciding if the business would be a good fit for a certain area.
The code behind DataViva is open source and available on the code-sharing site GitHub, and all of the government data it uses is accessible as downloadable files. Hidalgo and his business partners are now exploring whether other countries, states and cities may find the data visualization engine useful.
"There is knowledge embodied in the network of places," Hidalgo said. "The relevance of industrial makeup is not just to income growth but inequality."
In May, Hidalgo showed me at the Media Lab how DataViva works, then responded to a few questions about it over email. His answers, edited and condensed for clarity, follow:
How does DataViva differ from existing open government data websites?
We do not just make data accessible through well-designed visualizations. We have organized the data into carefully curated profiles for each location, industry, occupation, university, etc. These profiles help search engine optimization and also facilitate the discovery of visualizations. In addition, we have an advanced visualizations builder that can be used to create eleven different types of charts.
DataViva has been three years in the making. We have built the platform and the technologies needed to create the platform. To the best of my knowledge, the quality and functionality of the DataViva visualization platform is unparalleled by any effort to open public data. I would love that its ideas and code base would help lead by example the future redesign of open data portals.
What's the most important thing this does that won't be immediately apparent to the general public?
Most users of the tool will have no idea of the data manipulation routines developed to ingest, index and ultimately make this data accessible at a speed that is consistent with the pace of the web. This means returning most data queries in less than one second.
The long-term vision of the site may not be apparent on first visit either. DataViva, along with The Observatory of Economic Complexity or Pantheon, are similar to encyclopedias in the way that no one would ever read one front to back yet their usefulness as a resource reveals itself over time. Journalists looking to cite raw numbers about the economy, decision-makers looking to validate policy with data trends, or even curious citizens wondering about the composition of their municipality or the distribution of salaries paid to people with their same occupation, can consult the tool in a very structured and logical way. This is due to the many iterations made to the both the [user interface] and [user experience] of the site.
We see DataViva as part of the second wave of open data sites, where data is not just made "available" as massive text files but also visualized in interesting and compelling ways.
What's the most notable feature that wasn't possible prior to launch?
Before the data was cleaned and made accessible through DataViva, it was difficult to make comparisons across time (longitudinally) and between entities (geographies, industries, occupations, universities, etc.) (latitudinally). Using the compare visualization, for instance, users can immediately able to see which industries or occupations are earning higher wages in one location vs. another. This is something that would have taken hours and a lot of technical know-how to be able to do previously.
Another concept utilized by the user interface of the site is the network of profiles. Similar to Netflix, a visit to the homepage of DataViva shows a number of pre-populated lists with links to the "richest municipalities" or "best paid occupations," bringing users one click away from their content. From these profiles, more quick statistics are displayed along the left hand sidebar with follow up question in the form of links to subsequent profiles.
For example, if a user clicks on the profile page for Magistrates (one of the highest paid occupations in Brazil), they see a link to the profile page for the Justice industry, to which they then see a link the the profile page for judicial clerks (the top occupation by number of employees in the justice industry). So, you begin to see how this web of related profiles is constructed.
For the technically inclined, how will D3Plus, the data visualization library you developed, be relevant to other organizations?
The library started as a way for us to avoid duplicating our efforts and abstracting away some of the core functionality of the general task of creating interactive visualizations for the web, given any data set. Having the code on Github allows the project to be part of the open source community, something we've benefited from immensely. It is a way of giving back.
Thus, we've been chipping away at the task of making D3plus a visualization library that allows users to bring their data, write some basic configuration code and output a very powerful interactive visualization for the web. This is an extremely powerful concept when considering the initial ideation phase of working with data and trying to prototype a final product.