Easy-to-use Speech Toolkit including SOTA/Streaming ASR with punctuation, influential TTS with text frontend, Speaker Verification System and End-to-End Speech Simultaneous Translation.

sound-classification transformer asr speech-synthesis voice-cloning punctuation-restoration streaming-tts speech-recognition vocoder kws streaming-asr speech-alignment tts conformer speech-translation voice-recognition

Go to file

Xinghai Sun dfd7652308 Rename ctc_best_path_decoder to ctc_greedy_decoder in unitest.		8 years ago
cloud	Print log to pfs for DS cloud training and set use_gru to False by default.	8 years ago
conf	Update noise and impulse augmentor according to code review.	8 years ago
data_utils	Merge pull request #202 from wanghaoshuang/fix_ds2_variable	8 years ago
datasets	Add more test cases and make DP more clear.	8 years ago
lm	upload the language model	9 years ago
tests	Rename ctc_best_path_decoder to ctc_greedy_decoder in unitest.	8 years ago
tools	Simplify codes and comments.	8 years ago
.gitignore	deep speech2 can directly use warpctc instead by export LD_LIBRARY_PATH	8 years ago
README.md	deep speech2 can directly use warpctc instead by export LD_LIBRARY_PATH	8 years ago
decoder.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
demo_client.py	Add ASR demo usage to README.md for DS2.	8 years ago
demo_server.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
error_rate.py	Add more test cases and make DP more clear.	8 years ago
evaluate.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
infer.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
layer.py	Revert back to support input-hidden weights sharing between bi-directional RNNs.	8 years ago
model.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
requirements.txt	Remove pynput and pyaudio packages from requriements.txt and add installation tips to README.md.	8 years ago
setup.sh	use wget to download	8 years ago
train.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago
tune.py	Reduce the config parsing codes for DS2 and make it looks cleaner.	8 years ago

README.md

DeepSpeech2 on PaddlePaddle

Installation

sh setup.sh

Please replace $PADDLE_INSTALL_DIR with your own paddle installation directory.

Usage

Preparing Data

cd datasets
sh run_all.sh
cd ..

sh run_all.sh prepares all ASR datasets (currently, only LibriSpeech available). After running, we have several summarization manifest files in json-format.

A manifest file summarizes a speech data set, with each line containing the meta data (i.e. audio filepath, transcript text, audio duration) of each audio file within the data set, in json format. Manifest file serves as an interface informing our system of where and what to read the speech samples.

More help for arguments:

python datasets/librispeech/librispeech.py --help

Preparing for Training

python tools/compute_mean_std.py

It will compute mean and stdandard deviation for audio features, and save them to a file with a default name ./mean_std.npz. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by

python tools/compute_mean_std.py --specgram_type mfcc

and specify --specgram_type mfcc when running train.py, infer.py, evaluator.py or tune.py.

More help for arguments:

python tools/compute_mean_std.py --help

Training

For GPU Training:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py

For CPU Training:

python train.py --use_gpu False

More help for arguments:

python train.py --help

Preparing language model

The following steps, inference, parameters tuning and evaluating, will require a language model during decoding. A compressed language model is provided and can be accessed by

cd ./lm
sh run.sh
cd ..

Inference

For GPU inference

CUDA_VISIBLE_DEVICES=0 python infer.py

For CPU inference

python infer.py --use_gpu=False

More help for arguments:

python infer.py --help

Evaluating

CUDA_VISIBLE_DEVICES=0 python evaluate.py

More help for arguments:

python evaluate.py --help

Parameters tuning

Usually, the parameters \alpha and \beta for the CTC prefix beam search decoder need to be tuned after retraining the acoustic model.

For GPU tuning

CUDA_VISIBLE_DEVICES=0 python tune.py

For CPU tuning

python tune.py --use_gpu=False

More help for arguments:

python tune.py --help

Then reset parameters with the tuning result before inference or evaluating.

Playing with the ASR Demo

A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server).

For example, on MAC OS X:

brew install portaudio
pip install pyaudio
pip install pynput

After a model and language model is prepared, we can first start the demo's server:

CUDA_VISIBLE_DEVICES=0 python demo_server.py

And then in another console, start the demo's client:

python demo_client.py

On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed.

It could be possible to start the server and the client in two seperate machines, e.g. demo_client.py is usually started in a machine with a microphone hardware, while demo_server.py is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use --host_ip and --host_port to indicate the server machine's actual IP address (instead of the localhost as default) and TCP port, in both demo_server.py and demo_client.py.

PaddleCloud Training

If you wish to train DeepSpeech2 on PaddleCloud, please refer to Train DeepSpeech2 on PaddleCloud.