Why Is Speech Recognition Technology So Difficult to Perfect?

Why Is Speech Recognition Technology So Difficult to Perfect?
This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.

Why isn't speech recognition software more accurate? originally appeared on Quora: the knowledge sharing network where compelling questions are answered by people with unique insights.

Answer by Sunit Sivasankaran, Research Engineer, on Quora:

Why isn't speech recognition software more accurate? This is an excellent question to start off an automatic speech recognition (ASR) interview. I would slightly rephrase the question as "Why is speech recognition hard?"

The reasons are plenty and here is my take on the topic:

An ASR is just like any other machine learning (ML) problem, where the objective is to classify a sound wave into one of the basic units of speech (also called a "class" in ML terminology), such as a word. The problem with human speech is the huge amount of variation that occurs while pronouncing a word. For example, below are two recordings of the word "Yes" spoken by the same person (wave source: AN4 dataset [1]). It can easily be seen that the signals differ and the same can be verified by analyzing it in frequency or time-frequency domain. Comparison of two different recording of the word "Yes" in the time domain.

There are several reasons for this variation, namely stress on the vocal chords, environmental conditions, and microphone conditions, to mention a few. To capture this variation, ML algorithms such as the hidden Markov model (HMM)[2] along with Gaussian mixture models are used. More recently, deep neural networks (DNN) have been shown to perform better.

One way to do ASR is to train ML models for each word. During the training phase, the speech signal is broken down into a set of features (such as Mel frequency cepstral coefficients, or MFCC for short) which are then used to build the model. These models are called acoustic models (AM). When a speech signal has to be "recognized" (testing phase), features are again extracted, and are compared against each word model. The signal is assigned to represent the word, which has the highest probability value. This way of doing ASR works pretty well for small vocabularies. When the number of words increases, we end up comparing with a very large set of models, which is computationally not feasible. There is another problem of finding enough data to train these models. The word model fails for large vocabulary continuous speech recognition tasks due to the high complexity involved in decoding as well the need for the high amounts of training data.

To overcome this problem, we divide words into smaller units called phones. In the English language (and many Indian languages), there are approximately fifty phones that can be combined to make up any word. For example the word "Hello" can be broken in to "HH, AH, L, OW". You can look up the CMU pronunciation dictionary [6] for phonetic expansion of English words.

The problem of ASR boils down to recognizing the phone sequence instead of a word. This requires building ML models for every phone. These models are called Monophone models. If you can do a good job of recognizing the phones, you have solved a big part of the ASR problem. Unfortunately, recognizing phones is not an easy task.

If we plot the Fourier spectrum of a phone utterance, distinct peaks are visible, as can be seen in this plot: Formant Frequencies [5].

The peak frequencies F1 and F2 are key indicators of a phone. A scatter plot of the vowels with respect to F1 and F2 is shown here. As can be seen, the spread is large and very often overlaps with one another. Variation of dominant frequencies between vowels. No clear boundaries can be drawn to differentiate the vowels [3].

This overlap makes it hard for a ML algorithm to distinguish between phones.

Another problem with monophones is that they are often influenced by the neighboring phones. This figure shows the time domain as well as the time-frequency domain (STFT) representation of a speech utterance "Heel". Time domain and STFT representation of the word "Heel" [Page on upenn.edu].

The word heel can phonetically be expanded as "HH IY L". The influence of the phone "HH" on "IY" can clearly be seen in the figure. Triphone models, also called context dependent models, are proposed as solutions to model this context dependency. Here models are built for every possible variation of the triphone, with the hope that they capture enough contextual variations. The possible count of such triphones would be in the order of 50^3, which is a very high number. Building such large numbers of models is, again, not feasible. Fortunately, not all triphones occur in the English language (other languages also). After a few smart tricks, we can reduce the number of classification units to be in the range of 5,000-10,000.

Even with good phoneme recognition, it is still hard to recognize speech. This is because the word boundaries are not defined beforehand. This causes problems while differentiating phonetically similar sentences. A classic example for such sentences are "Let's wreck a nice beach" and "Let's recognize speech". These sentences are phonetically very similar and the acoustic model can easily confuse between them. Language models (LM) are used in ASR to solve this particular problem.

Another factor which bugs an ASR system is an accent. Just like humans, machines too have a hard time understanding the same language with different accents. See this video for an example.

This is because the classification boundaries previously learnt by a system for a particular accent do not stay constant for other accents. This is the reason why ASR systems often ask for your location/speaking style (English-Indian, English-US, English-UK, for example) during the configuration process.

The complexities described so far are part of natural speech. Even with such large complexities, recognizing speech in noiseless environments is generally considered a solved problem. It is the external influence such as noise and echos which are bigger culprits.

Noise and echoes are unavoidable interference while recording audio. Echoes happen due to the reflections of speech energy from surfaces such as walls, mirrors, and tables. This is not much of a problem when a speaker is speaking close to the microphone. But when spoken from a distance (making a purchase through Amazon Echo, for example), multiple copies of the same signal are reflected and combined at different time delays and intensities. This will result in the stretching of phones across time and will end up corrupting the neighboring speech information.

This phenomenon is called as smearing. The process of removing the smear is called dereverberation, which is a commonly used technique to address the reverberation problem.

An other problem of note in ASR is during the decoding stage. Here, the LM and AM are combined to form a huge network. Recognition is basically a search problem in such a big space. The bigger the space, the bigger the problem. Realtime recognition involves scanning the network using Viterbi algorithms to obtain the transcription of the speech signal.

This question originally appeared on Quora. Ask a question, get a great answer. Learn from experts and access insider knowledge. You can follow Quora on Twitter, Facebook, and Google+.
More questions:

Popular in the Community

Close

What's Hot