One of the primary functions of computers is to parse data. Some data is easier to parse than other data, and voice input continues to be a work in progress. There have been many improvements in the area in recent years, though, and one of them is in the form of DeepSpeech, a project by Mozilla, the foundation that maintains the Firefox web browser. DeepSpeech is a voice-to-text command and library, making it useful for users who need to transform voice input into text and developers who want to provide voice input for their applications.
Install DeepSpeech
DeepSpeech is open source, released under the Mozilla Public License (MPL). You can download the source code from its GitHub page.
To install, first create a virtual environment for Python:
$ python3 -m pip install deepspeech --user
DeepSpeech relies on machine learning. You can train it yourself, but it's easiest just to download pre-trained model files when you're just starting.
$ mkdir DeepSpeech
$ cd Deepspeech
$ curl -LO \
https://github.com/mozilla/DeepSpeech/releases/download/vX.Y.Z/deepspeech-X.Y.Z-models.pbmm
$ curl -LO \
https://github.com/mozilla/DeepSpeech/releases/download/vX.Y.Z/deepspeech-X.Y.Z-models.scorer
Applications for users
With DeepSpeech, you can transcribe recordings of speech to written text. You get the best results from speech cleanly recorded under optimal conditions. However, in a pinch, you can try any recording, and you'll probably get something you can use as a starting point for manual transcription.
For test purposes, you might record an audio file containing the simple phrase, "This is a test. Hello world, this is a test." Save the audio as a .wav
file called hello-test.wav
.
In your DeepSpeech folder, launch a transcription by providing the model file, the scorer file, and your audio:
$ deepspeech --model deepspeech*pbmm \
--scorer deepspeech*scorer \
--audio hello-test.wav
Output is provided to the standard out (your terminal):
this is a test hello world this is a test
You can get output in JSON format by using the --json
option:
$ deepspeech --model deepspeech*pbmm \
-- json
--scorer deepspeech*scorer \
--audio hello-test.wav
This renders each word along with a timestamp:
{
"transcripts": [
{
"confidence": -42.7990608215332,
"words": [
{
"word": "this",
"start_time": 2.54,
"duration": 0.12
},
{
"word": "is",
"start_time": 2.74,
"duration": 0.1
},
{
"word": "a",
"start_time": 2.94,
"duration": 0.04
},
{
"word": "test",
"start_time": 3.06,
"duration": 0.74
},
[...]
Developers
DeepSpeech isn't just a command to transcribe pre-recorded audio. You can also use it to process audio streams in real time. The GitHub repository DeepSpeech-examples is full of JavaScript, Python, C#, and Java for Android.
Most of the hard work is already done, so integrating DeepSpeech usually is just a matter of referencing the DeepSpeech library and knowing how to obtain the audio from the host device (which you generally do through the /dev
filesystem on Linux or an SDK on Android and other platforms.)
Speech recognition
As a developer, enabling speech recognition for your application isn't just a fun trick but an important accessibility feature that makes your application easier to use by people with mobility issues, low vision, and chronic multi-taskers who like to keep their hands full. As a user, DeepSpeech is a useful transcription tool that can convert audio files into text. Regardless of your use case, try DeepSpeech and see what it can do for you.
Comments are closed.