What Big Data Can Teach Us About Language

Click here to read an original op-ed from the TED speaker who inspired this post and watch the TEDTalk below.

Philosophers and scientists have long wondered how human knowledge is acquired and organized. One area of specific interest concerns how we come to understand a word and both its relation to a physical referent in the world (i.e., referential meaning) and its relation to other words (i.e., sense meaning). In this regard, large scale projects like Deb Roy's are becoming increasingly helpful. Professor Deb Roy's study is a great example of how large-scale databases can answer important questions beyond those obtainable from standard experiments. Recently, Psycholinguistic researchers have utilized large-scale databases to study language processes across individuals. For instance, the English Lexicon Project tested word recognition and pronunciation rates from hundreds of participants across thousands of words, producing millions of observations. The resulting database has taught researchers a great deal about interrelations between factors critical for word recognition and production such as word frequency, length, age-of-acquisition, spelling-to-sound regularity, lexical ambiguity, etc... However, projects like this cannot examine changes in the interrelations between these variables over time or changes in their relative importance throughout development.

To study sense relations, other studies have tested thousands of subjects in word association tests in which participants respond to a cue word with "the first related word that comes to mind." These association strength measures have done a reasonably good job of predicting the ability of some words to provide a semantic context that facilitates responding to later related words (called "priming"). However, association values themselves provide an "empty" explanation for why one word facilitates another, because they don't explain how or why the two words became associated in the first place. More recently, other large scale approaches to understanding semantic memory such as the Latent Semantic Analysis involve making use of the large bodies of online text (such as Wikipedia) to empirically define "related" words, similar to Roy's social media analyses. The resulting databases can be used to compute both a local co-occurrence and global similarity measure. Local co-occurrence for two words refers to the probability they appear in the same utterance whereas global similarity refers to the similarity of their linguistic contexts (i.e., other words with which each co-occurs). Thus, "shirt" and "sweater" seldom appear in the same sentence, yet have high global similarity because they are used in similar linguistic contexts.

As I see it, the real advantage of Deb Roy's approach over current approaches is its focus on tracking the development of a single individual over time. This can demonstrate the how and why of associations that presumably drives the development of semantic knowledge. Professor Roy's example for "water" demonstrates how this database can track referential learning through the investigation of "word landscapes" (i.e., the location in which words are most commonly used). As expected, our knowledge of a word usually incorporates both the environment in which it's located and the function for which it's used. Perhaps a more useful approach would be to examine sense relations with Dr. Roy's model and even the interaction of referential and sense relations. For sense relations, one could ask, "What are the linguistic contexts in which words appear?" Dr. Roy could examine how indices of local co-occurrence and global similarity change across development. As a specific example of an interaction between sense and reference, Dr. Roy could compute a cosine of the distance between word landscape peaks. This would allow him to expand upon the "linguistic context" currently used to define similarity by including "location similarity." Presumably, related words (especially synonyms) are often learned in the same or close location (e.g., salt-pepper). This could distinguish between associations that emerge due to close temporal/spatial proximity from other types of associations.

Finally, although this post has focused primarily on language, the most obvious use for this type of database is for memory researchers. Having an accurate longitudinal record of events could prove invaluable to examining the development of memory processes. What events are the first to be accurately remembered? How do landmarks of language development and/or event-structure knowledge influence the development of episodic memory? Can one implant false memories of early childhood events for not only mundane behaviors, but bizarre and/or traumatic behaviors as well? Here, an accurate record of actual events could help refute arguments that the "implanted" event might actually have occurred. Finally, from a practical standpoint, giving an individual with memory impairments access to such a database of their own life could be enormously beneficial as they learn to insert keywords to query their own life events as easily as searching the internet.

In summary, I am continually amazed at the brilliance of researchers who use technology and "big data" to help answer intriguing questions regarding human behavior. I have focused only on specific questions within language and memory development, but am certain Dr. Roy's approach can also be used to make breakthroughs across many disciplines.

Ideas are not set in stone. When exposed to thoughtful people, they morph and adapt into their most potent form. TEDWeekends will highlight some of today's most intriguing ideas and allow them to develop in real time through your voice! Tweet #TEDWeekends to share your perspective or email to learn about future weekend's ideas to contribute as a writer.