In December 2019, the team managed to get DeepSpeech running “faster than real time” on a single core of a Raspberry Pi 4. The English-language model shrank from 188MB to 47MB and memory consumption dropped by 22 times. In the years that followed, Mozilla worked to shrink the DeepSpeech model while boosting its performance and remaining below the 10% error rate target. In the project’s early days, training a high-performing model took about a week. The Mozilla Research team started training it with a single computer running four Titan X Pascal GPUs but eventually migrated it to two servers with 8 Titan XPs each. The latest DeepSpeech model contains tens of millions parameters, or the parts of the model that are learned from historical training data.
The team hoped to design a system that could be trained using Google’s TensorFlow framework via supervised learning, where the model learns to infer patterns from datasets of labeled speech. Mozilla’s work on DeepSpeech began in late 2017, with the goal of developing a model that gets audio features - speech - as input and outputs characters directly. And as the company cleans up the documentation and prepares to stop Mozilla staff upkeep of the codebase, Mozilla says it’ll publish a toolkit to help people, researchers, companies, and any other interested parties use DeepSpeech to build voice-based solutions. To this end, the company plans to transition the project to “people and organizations” interested in furthering “use-case-based explorations.” Mozilla says it’s streamlined the continuous integration processes for getting DeepSpeech up and running with minimal dependencies. It’s Mozilla’s belief that DeepSpeech has reached the point where the next step is to work on building applications. One of Mozilla’s major aims was to achieve a transcription word error rate of lower than 10%, and the newest versions of the pretrained English-language model achieve that aim, averaging around a 7.5% word error rate. Modeled after research papers published by Baidu, the model is an end-to-end trainable, character-level architecture that can transcribe audio in a range of languages. Over the next four years, the DeepSpeech team released newer versions of the model capable of transcribing lectures, phone conversations, television programs, radio shows, and other live streams with “human accuracy.” But in the coming months, Mozilla plans to cease development and maintenance of DeepSpeech as the company transitions into an advisory role, which will include the launch of a grant program to fund a number of initiatives demonstrating applications for DeepSpeech.ĭeepSpeech isn’t the only open source project of its kind, but it’s among the most mature. In 2017, Mozilla launched DeepSpeech, an initiative incubated within the machine learning team at Mozilla Research focused on open sourcing an automatic speech recognition model.
Did you miss a session from GamesBeat Summit 2022? All sessions are available to stream now.