In the July 1945 issue of The Atlantic Monthly, Dr. Vannevar Bush's famous essay, "As We May Think," was published as one of the first articles addressing Big Data, information overload, or the "growing mountain of research" as stated in the article. The 2010 IOUG Database Growth Survey, conducted in July-August 2010, estimates that more than a zettabyte (or a trillion gigabytes) of data exists in databases, and that 16 percent of organizations surveyed reported a data growth rate in excess of 50 percent annually. A Gartner survey, also conducted in July-August 2010, reported that 47 percent of IT staffers surveyed ranked data growth as one of the top three challenges faced by their IT organization. Based on two recent IBM articles derived from their CIO Survey, one in three CIOs make decisions based on untrusted data; one in two feel they do not have the data they need to make an informed decision; and 83 percent cite better analytics as a top concern. A recent survey conducted for MarkLogic asserts that 35 percent of respondents believe their unstructured data sources will surpass their structured data sources in size in the next 36 months, while 86 percent of respondents claim that unstructured data is important to their organization. The survey further asserts that only 11 percent of those that consider unstructured data important have an infrastructure that addresses unstructured data.
Dr. Bush conceptualized a "private library," coined "memex" (mem[ory ind]ex) in his essay, which could ingest the "mountain of research," and use associative indexing -- how we think -- to correlate trusted data to support human decision making. Although Dr. Bush conceptualized "memex" as a desk-based device complete with levers, buttons, and a microfilm-based storage device, he recognized that future mechanisms and gadgetry would enhance the basic concepts. The core capabilities of "memex" were needed to allow man to "encompass the great record and to grow in the wisdom of race experience."
The first technology needed to tame Big Data -- derived from the "memex" concept -- is semantic technology, which loosely implements the concept of associative indexing. Dr. Bush is generally considered the godfather of hypertext based on the associative indexing concept, per his 1945 article. The Semantic Web, paraphrased from a definition by the World Wide Web Consortium (W3C), extends hyperlinked Web pages by adding machine-readable metadata about the Web page, including relationships across Web pages, thus allowing machine agents to process the hyperlinks automatically. The W3C provides a series of standards to implement the Semantic Web, such as Web Ontology Language (OWL), Resource Description Framework (RDF), Rule Interchange Format (RIF), and several others.
The May 2001 Scientific American article "The Semantic Web" by Tim Berners-Lee, Jim Hendler, and Ora Lassila described the Semantic Web as agents that query ontologies representing human knowledge to find information requested by a human. OWL ontology is based on Description Logics, which are both expressive and decidable, and provide a foundation for developing precise models about various domains of knowledge. These ontologies provide the "memory index" that enables searches across vast amounts of data to return relevant, actionable information, while addressing key data trust challenges as well. The ability to deliver semantics to a mobile device, such as what the recent release of the iPhone 4S does with Siri, is an excellent step in taming the Big Data beast, since users can get the data they need when and where they need it. Big Data continues to grow, but semantic technologies provide the needed check points to properly index vital information in methods that imitate the way humans think, as Dr. Bush aptly noted.
The second technology needed to tame Big Data? Cloud. Dr. Bush's "memex" consisted of a desk-based mechanism that was capable of recording all data and providing speedy search results that included the associative indexes previously created. Today, organizations are using private, public, and hybrid clouds to ingest great volumes of data and apply cloud-based analytics that can sift through this data -- whether structured, semi-structured, or unstructured -- to index key facts. Cloud technologies (e.g., Map Reduce, Hadoop) are currently used to create industrial-strength analytics that automate some of the associative indexing that required manual interaction in "memex." As data continues to grow at alarming rates, cloud provides the platform to elastically grow to support the increasing data volume.
Natural Language Processing
The third technology relevant to taming the Big Data beast is natural language processing or NLP, which is a group of technologies that mine facts from unstructured data. There exists commercial and open source NLP software, some of which can be deployed in a cloud, which are very accurate at extracting key entities, events, and relationships from unstructured text. Some can even extract to RDF and OWL (semantic technology standards). Harvesting structured data from unstructured text allows cloud-based analytics to work across all data in the enterprise.
The increasing growth of data, and particularly unstructured data, continues to concern corporate leaders. Dr. Bush's "memex" offers answers even today on technology to tame Big Data problems: semantic technology to provide data indexing similar to how we think; cloud technology to process large amounts of data with analytics to mine facts; and natural language processing to harvest trusted, structured data from the rapidly expanding corpus of unstructured emails, logs, and other documents.