We now live in a world where data is its own form of currency. From email addresses to social security numbers and medical histories we are, in the grand scheme, a collection of data points. In order to function in the modern world we as individuals are now largely required to trust these data points with a wide array of organizations and businesses. Unfortunately, that trust has been harder to give as more and more entities from the U.S. Government to Walmart have been victims to a variety of data breaches that ultimately leave the individuals exposed.
Professor Pentland serves as a Scientific Advisor for Monument Capital Group Holdings, an international private investment firm. Alex "Sandy" Pentland also directs MIT's Human Dynamics Laboratory and the MIT Media Lab Entrepreneurship Program, co-leads the World Economic Forum Big Data and Personal Data initiatives, and is a member of the Advisory Boards for Google, Nissan, Motorola Mobility, Telefonica, and a variety of start-up firms. He has spent a lot of time addressing the failings of big data systems and pushing for better infrastructure and security protocols across the board. He has even developed a consumer level data platform that mimics what the big guys should be doing.
SPN: You have spoken in the past about the flaws in how big data is collected and used. Is there any one fundamental flaw we need to fix?
AP: A major risk of deploying big data operations comes from the danger of putting so much personal data into the hands of one organization and also from storing that data in a single location. Organizations must arrange big data resources in a distributed manner, with each different type of data separated and dispersed among many locations, using many different types of computer systems and encryption.
SPN: What about personnel access?
AP: Human resources should be organized into cells of access and permission that are localized both spatially and by data type. Both computer and human resources should always be redundant and fragmented in order to avoid overly powerful central actors who can be corrupted and then override standard security precautions. It's this issue that creates situations like the Edward Snowden affair.
SPN: So segment information and limit access to it?
AP: Databases that have different types of data that are physically and logically distributed, and that also have heterogeneous computer and encryption systems, are hard to attack, both physically as well as through cyber-attack. This is because any single exploit is likely to gain access to only a limited part of the whole database. Similarly, the resilience of organizations with a heterogeneous cell-like human and permissions structure is familiar from intelligence and terrorist organizations. Importantly, resistance to attack by adopting a distributed organization is a particularly pressing issue for centralized organizations, because unfettered access to data about consumer or citizen behavior can be a major source of risk and liability.
SPN: I see how this can make breaches more difficult overall, but how hard is it to monitor for breaches in a system like this?
AP: The key insight is that for distributed data systems, each type of data analysis operation has a characteristic pattern of communication between the various different databases and human operators. As a consequence, it is possible to monitor the functioning of the data analysis process without gaining access to, or endangering, the analysis content. In short, one can use "metadata about metadata" in order to monitor the use of metadata, and with some reasonable confidence one can ensure that only normal or usual analysis operations are being conducted without reference to specific content. Organizations that structure their data resources in this manner can more easily monitor attacks and misuse of all sorts.
For example, let's assume a system in which different types of databases are physically distributed. In this case one can observe the amount and pattern of traffic between the different databases. These patterns are characteristic of the analysis being performed, and so deviations from the normal patterns of communication between databases are cause for concern. In this manner, an independent authority can perform substantial, fairly effective monitoring of the functioning of a department that performs analysis of secret or proprietary data. In most cases it is sufficient that each element of the system monitor only local traffic. It's important to understand that all of these lessons also apply to entire cities, economic sectors, or even entire nations...any entity with large, complex databases.
SPN: How so?
AP: Criminal behavior by citizens or employees, industrial espionage, and cyber-attack are among the greatest dangers that we face in the big data era. A distributed architecture of databases joined with a network that supports permissions, provenance, and auditing can reduce risk and increase resilience of such large networked systems of databases. At the same time such a distributed system can provide the sort of `many eyes' oversight that can help keep exploitative companies in check.
SPN: How complicated is it to build a system with these types of structure and protocols? Is there any best practice?
AP: For a system that relies on multiple levels of oversight, the computer architecture should have distributed data stores with permissions, provenance, and auditing for sharing among data stores. The data stores can, and probably should, be segmented by their content--for example, there would be separate data stores for tax records for individuals, tax records for companies, import records from country X to port Y, and so on. The current best practice is a system of sharing is called trust networks.
Trust networks are a combination of a computer network that keeps track of user permissions for each piece of data, and a legal framework that specifies both what can and cannot be done with the data and what happens if there is a violation of the permissions. This is the model of data management that is most frequently proposed within the World Economic Forum Big Data and Personal Data Initiatives.
SPN: Are there any downsides to trust networks?
AP: A concern about trust networks is the cost associated with keeping track of permissions and supporting the capability for automated auditing. Since many companies already maintain such data structures in order to support internal compliance and auditing functions, the cost concern does not appear to be a major barrier. Another concern is that a trust network system may be too complex for normal use, or that it will not inspire (or deserve) the sort of user trust that the name suggests. Again, the longstanding use of trust networks by banks around the world suggests that complexity of use is not a barrier.
SPN: When we talk trust networks are these strictly enterprise level systems?
AP: Until recently, such systems were available only to the "big guys." Today, however, the ever-decreasing cost of computing, storage, and transmission enables small companies and even individuals to have a similarly safe method of managing personal data. Towards this end, my research group at MIT has developed and built openPDS--a consumer version of this type of system. We are now testing it with a variety of industry and government partners, and commercial versions are already being deployed.
SPN: With what we now know about big data and structure/security protocols why haven't large scale businesses modified their infrastructure to better protect against the high level breaches we have seen over the last couple years?
AP: The easy thing is to imagine you can solve security problems by building a strong vault and hiring a reliable policeman to watch over it There is an entire security industry pushing this sort of solution, and in addition hardware and software vendors like to push centralized, uniform solutions. It is hard to face the reality that attacks aren't just by amateurs and that no matter how good your wall or your security staff some attacks will succeed. That means that you have to be prepared for losses, and focus on building a resilient defensive system that accounts for attacks on both computers and people.