Reaching new records in speech recognition
By George Saon, Principal Research Scientist, IBM
March 08, 2017
Depending on whom you ask, humans miss one to two words out of every 20 they hear. In a five-minute conversation, that could be as many 80 words. But, for most of us it isn’t a problem. Imagine, though, how difficult it is for a computer?
Last year, IBM announced a major milestone in conversational speech recognition: a system that achieved a 6.9 percent word error rate. Since then, we have continued to push the boundaries of speech recognition, and today we’ve reached a new industry record of 5.5 percent.
This was measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus, has been used for over two decades to benchmark speech recognition systems.
To reach this 5.5 percent breakthrough, IBM researchers focused on extending our application of deep learning technologies. We combined LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models. Within the acoustic models used, the first two were six-layer bidirectional LSTMs. One of these has multiple feature inputs, while the other is trained with speaker-adversarial multi-task learning. The unique thing about the last model is that it not only learns from positive examples but also takes advantage of negative examples – so it gets smarter as it goes and performs better where similar speech patterns are repeated.
Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the ultimate industry goal. Others in the industry are chasing this milestone alongside us, and some have recently claimed reaching 5.9 percent as equivalent to human parity…but we’re not popping the champagne yet. As part of our process in reaching today’s milestone, we determined human parity is actually lower than what anyone has yet achieved — at 5.1 percent.
To determine this number, we worked to reproduce human-level results with the help of our partner Appen, which provides speech and search technology services. And while our breakthrough of 5.5% is a big one, this discovery of human parity at 5.1 percent proved to us we have a way to go before we can claim technology is on par with humans.
As part of our research efforts, we connected with different industry experts to get their input on this matter too. Yoshua Bengio, leader of the University of Montreal’s MILA (Montreal Institute for Learning Algorithms) Lab agrees we still have more work to do to reach human parity:
“In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challenge. Indeed, standard benchmarks do not always reveal the variations and complexities of real data. For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition,” says Bengio. “IBM continues to make significant strides in advancing speech recognition by applying neural networks and deep learning into acoustic and language models.”
We also realized finding a standard measurement for human parity across the industry is more complex than it seems. Beyond SWITCHBOARD, another industry corpus, known as “CallHome,” offers a different set of linguistic data that can be tested, which is created from more colloquial conversations between family members on topics that are not pre-fixed. Conversations from CallHome data are more challenging for machines to transcribe than those from SWITCHBOARD, making breakthroughs harder to achieve. (On this corpus we achieved a 10.3 percent word error rate – another industry record – but again, with Appen’s help, measured human performance in the same situation to be 6.8 percent).
In addition, with SWITCHBOARD, some of the same human voices in test speakers’ data are also included in the training data used to train the acoustic and language models. Since CallHome has no such overlap, the speech recognition models have not been exposed to test speakers’ data. Because of this, there is no repetition and this has led to its larger gap between human and machine performance. As we continue to pursue human parity, advancements in our Deep Learning technologies that can pick up on such repetitions are ever more important to finally overcoming these challenges.
Julia Hirschberg, a professor and Chair at the Department of Computer Science at Columbia University, also commented on the ongoing complex challenge of speech recognition:
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance, since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated,” she shared. “IBM’s recent achievements on the SWITCHBOARD and on the CallHome data are thus quite impressive. But I’m also impressed with the way IBM has been working to better understand human ability to understand these two, much-cited corpora. This scientific achievement is in its way as impressive as the performance of their current ASR technology, and shows that we still have a way to go for machines to match human speech understanding.”
Today’s achievement adds to recent advancements we’ve made in speech technology – for example, in December we added diarization to our Watson Speech to Text service, marking a step forward in distinguishing individual speakers in a conversation. These speech developments build on decades of research, and achieving speech recognition comparable to that of humans is a complex task. We will continue to work towards creating the technology that will one day match the complexity of how the human ear, voice and brain interact. While we are energized by our progress, our work is dependent on further research – and most importantly, staying accountable to the highest standards of accuracy possible.
To check out the white paper on this automatic speech recognition milestone, please see this link https://arxiv.org/abs/1703.02136