Whale Calls Meet Citizen Science

diving blue whale
diving blue whale

One aspect of research on whale and dolphin communication that people find very surprising is that we categorize calls using human judges to visually assess similarity of call spectrograms (plots of frequency versus time). Whales and dolphins tend to maintain the overall pattern of frequency changes over time of their calls, which appear as shapes on a spectrogram, but at the same time they often vary time and frequency aspects of their calls (along the lines of transposing a song into a different key). These variations cause difficulties for computerized classification methods, yet humans are superb at focusing on overall patterns and shapes in the calls (which we know are important to the animals themselves). However, visual categorizations of sounds get very time intensive when dealing with large data sets.

Recently, I was working with a team from the University of St Andrews in Scotland trying to categorize over 4,000 calls recorded from pilot whales in the Tongue of the Ocean in the Bahamas. At around the same time, Sander van Benda-Beckmann of the Netherlands Organization for Applied Scientific Research recognized the potential of a crowdsourcing approach for whale call categorizations. This approach was being used successfully to categorize galaxies, via the website Galaxy Zoo (see Peter Diamandis's blog). He contacted Chris Lintott, one of the initial developers of Galaxy Zoo, and Peter Tyack, a whale researcher, who both supported the idea. The next step was to obtain large numbers of whale calls. We, along with several other researchers, contributed our calls to the site, and the website Whale FM was launched in late 2011.

We had the opportunity to use some of the first results from Whale FM in a publication on pilot whale call types. Based on our visual assessments of similarity, we had found evidence for repeated call types, which had not been previously reported for pilot whales. However, to make the case, we really needed to see how generalizable our classifications were. We initially tried to use data from additional observers, but the vast and variable nature of the data made this very difficult. So, we decided to see if preliminary results from Whale FM might provide the corroboration of our classifications that we needed. At the time, the website had been available to the public for about one month, and there were 255 instances of users being presented with a categorized call from our study as the "main call" to be matched. Users matched calls according to our categorization scheme in 189, or 74 percent of these instances.

Since we do not have information about how many misclassifications may have resulted from users not being presented with a call from the same category as a possible match (a feature that has since been added to Whale FM), the percentage of classifications that agreed with ours could have been even higher than 74 percent. So, although these results are preliminary, the level of agreement was very positive. Thus the Whale FM data provided an independent measure of reliability to our results, and are a testament to the usefulness of "citizen science."