Automated speech recognition (ASR) has improved significantly in terms of accuracy, accessibility, and affordability in the past decade. Advances in deep learning and model architectures have made speech-to-text technology part of our everyday lives—from smartphones to home assistants to vehicle interfaces.
Speech recognition is also a critical component of industrial applications. Industries such as call centers, cloud phone services, video platforms, podcasts, and more are using speech recognition technology to transcribe audio or video streams and as a powerful analytical tool. These companies use state-of-the-art speech-to-text APIs to enhance their own products with features like speaker diarization (speaker labels), personally identifiable information (PII) redaction, topic detection, sentiment analysis, profanity filtering, and more.
Many developers are experimenting with building their own speech recognition models for personal projects or commercial use. If you're interested in building your own, here are a few considerations to keep in mind.
Choose the right architecture
Depending on your use case or goal, you have many different model architectures to choose from. They vary based on whether you need real-time or asynchronous transcription, your accuracy needs, the processing power you have available, additional analytics or features required for your use case, and more.
Open source model architectures are a great route if you're willing to put in the work. They're a way to get started building a speech recognition model with relatively good accuracy.
[ Learn how to alleviate technical debt—in time and money—through IT modernization. ]
Popular open source model architectures include:
Kaldi: Kaldi is one of the most popular open source speech recognition toolkits. It's written in C++ and uses CUDA to boost its processing power. It has been widely tested in both the research community and commercially, making it a robust option to build with. With Kaldi, you can also train your own models and take advantage of its good out-of-the-box models with high levels of accuracy.
Mozilla DeepSpeech: DeepSpeech is another great open source option with good out-of-the-box accuracy. Its end-to-end model architecture is based on innovative research from Baidu, and it's implemented as an open source project by Mozilla. It uses Tensorflow and Python, making it easy to train and fine-tune on your own data. DeepSpeech can also run in real time on a wide range of devices—from a Raspberry Pi 4 to a high-powered graphics processing unit.
Wav2Letter: As part of Facebook AI Research's ASR toolkit, Wav2Letter provides decent accuracy for small projects. Wav2Letter is written in C++ and uses the ArrayFire tensor library.
CMUSphinx: CMUSphinx is an open source speech recognition toolkit designed for low-resource platforms. It supports multiple languages—from English to French to Mandarin—and has an active support community for new and seasoned developers. It also provides a range of out-of-the-box features, such as keyword spotting, pronunciation evaluation, and more.
Make sure you have enough data
Once you've chosen the best model architecture for your use case, you need to make sure you have enough training data. Any model you plan to train needs an enormous amount of data to work accurately and be robust to different speakers, dialects, background noise, and more.
[ Explore important considerations for hybrid cloud, containers, multicloud, and Kubernetes technologies in An architect's guide to multicloud infrastructure. ]
How do you plan to source this data? Options include:
- Paying for training data (this can get very expensive quickly)
- Using public data sets
- Sourcing it from open source audio or video streams
- Using in-person or field-collected data sets
Make sure your data sets contain a variety of characteristics, so you won't bias your model towards one particular subset over another (for example, toward midwestern US speech versus northeastern US speech or towards male speakers versus female speakers).
Choose the right metrics to evaluate the model
Finally, you need to choose the right metrics to evaluate your model.
When training a speech recognition model, the loss function is a good indicator of how well your model fits your data set. A high number means your predictions are completely off, while a lower number indicates more accurate predictions.
However, minimizing loss will only get you so far—you need also to consider the word error rate (WER) of your speech recognition model. This is the standard metric for evaluating speech recognition systems.
WER adds the number of substitutions (S), deletions (D), and insertions (I) and divides them by the number of words (N). The resulting percentage shows your WER—the lower, the better. However, even WER isn't failproof. WER fails to consider elements like context, capitalization, punctuation, and such in its calculation, so always compare normalized transcripts (model versus human) to get the best picture of accuracy.
Build, evaluate, and repeat
By following the steps below, you'll be on your way to building a robust speech recognition model:
- Choose the best model architecture for your use case
- Source enough diverse data
- Evaluate your model effectively
Note that building a speech recognition model is a cyclical process. Once you reach the evaluation stage, you'll often find that you need to go back and retrain your model with more training data or a more diverse training set, as you continually work toward greater accuracy.
Model forums, community boards, and academic research can be great resources to learn more about the latest approaches and trends and to help you solve problems as you work.