To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
### Prerequisites
- Python 2.7 only supported
- Python >= 3.5
- PaddlePaddle 1.8.0 or later (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
### Setup
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed, including the pre-built PaddlePaddle, CTC decoders, and other necessary Python and third-party packages. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
```
Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
@ -132,7 +165,7 @@ For how to generate such manifest files, please refer to `data/librispeech/libri
To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples:
```bash
python tools/compute_mean_std.py \
python3 tools/compute_mean_std.py \
--num_samples 2000 \
--specgram_type linear \
--manifest_path data/librispeech/manifest.train \
@ -147,7 +180,7 @@ It will compute the mean and standard deviatio of power spectrum feature with 20
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with `tools/build_vocab.py`.
```bash
python tools/build_vocab.py \
python3 tools/build_vocab.py \
--count_threshold 0 \
--vocab_path data/librispeech/eng_vocab.txt \
--manifest_paths data/librispeech/manifest.train
@ -160,9 +193,9 @@ It will write a vocabuary file `data/librispeeech/eng_vocab.txt` with all transc
@ -273,13 +306,13 @@ An inference module caller `infer.py` is provided to infer, decode and visualize
- Inference with GPU:
```bash
CUDA_VISIBLE_DEVICES=0 python infer.py
CUDA_VISIBLE_DEVICES=0 python3 infer.py
```
- Inference with CPUs:
```bash
python infer.py --use_gpu False
python3 infer.py --use_gpu False
```
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`.
@ -287,7 +320,7 @@ We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search
For more help on arguments:
```
python infer.py --help
python3 infer.py --help
```
or refer to `example/librispeech/run_infer.sh`.
@ -298,13 +331,13 @@ To evaluate a model's performance quantitatively, please run:
The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed.
@ -312,7 +345,7 @@ The error rate (default: word error rate; can be set with `--error_rate_type`) w
For more help on arguments:
```bash
python test.py --help
python3 test.py --help
```
or refer to `example/librispeech/run_test.sh`.
@ -326,7 +359,7 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python tools/tune.py \
python3 tools/tune.py \
--alpha_from 1.0 \
--alpha_to 3.2 \
--num_alphas 45 \
@ -338,7 +371,7 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
- Tuning with CPU:
```bash
python tools/tune.py --use_gpu False
python3 tools/tune.py --use_gpu False
```
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
@ -352,36 +385,10 @@ Usually, as the figure shows, the variation of language model weight ($\alpha$)
After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation modules to see if they really help improve the ASR performance. For more help
```bash
python tune.py --help
python3 tune.py --help
```
or refer to `example/librispeech/run_tune.sh`.
## Running in Docker Container
Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed, including the pre-built PaddlePaddle, CTC decoders, and other necessary Python and third-party packages. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
```
Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
## Training for Mandarin Language
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell```. As mentioned above, please execute ```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh``` and ```sh run_infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh run_infer_golden.sh``` and ```sh run_test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.
@ -394,7 +401,7 @@ To start the demo's server, please run this in one console:
```bash
CUDA_VISIBLE_DEVICES=0 \
python deploy/demo_server.py \
python3 deploy/demo_server.py \
--host_ip localhost \
--host_port 8086
```
@ -413,7 +420,7 @@ Then to start the client, please run this in another console:
```bash
CUDA_VISIBLE_DEVICES=0 \
python -u deploy/demo_client.py \
python3 -u deploy/demo_client.py \
--host_ip 'localhost' \
--host_port 8086
```
@ -427,8 +434,8 @@ Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which wi