update aishell egs

4 years ago · 141109b49d
parent c246d315b5
commit 141109b49d
9 changed files with 95 additions and 104 deletions
--- a/README.md
+++ b/README.md
@ -6,11 +6,9 @@

 ## Table of Contents
 - [Installation](#installation)
- [Running in Docker Container](#running-in-docker-container)
 - [Getting Started](#getting-started)
 - [Data Preparation](#data-preparation)
 - [Training a Model](#training-a-model)
- [Data Augmentation Pipeline](#data-augmentation-pipeline)
 - [Inference and Evaluation](#inference-and-evaluation)
 - [Hyper-parameters Tuning](#hyper-parameters-tuning)
 - [Training for Mandarin Language](#training-for-mandarin-language)
@ -116,42 +114,6 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
    ```bash
    bash run.sh
    ```
- Prepare the data
-
-    ```bash
-    sh local/run_data.sh
-    ```
-
-    `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
- Train your own ASR model
-
-    ```bash
-    sh local/run_train.sh
-    ```
-
-    `run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
- Case inference with an existing model
-
-    ```bash
-    sh local/run_infer.sh
-    ```
-
-    `run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
-
-    ```bash
-    sh local/run_infer_golden.sh
-    ```
- Evaluate an existing model
-
-    ```bash
-    sh local/run_test.sh
-    ```
-
-    `run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
-
-    ```bash
-    sh local/run_test_golden.sh
-    ```

 More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech2 on PaddlePaddle* ASR engine!

@ -169,7 +131,7 @@ More detailed information are provided in the following sections. Wish you a hap

 To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.

-For how to generate such manifest files, please refer to `PATH/TO/LIBRISPEECH/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.
+For how to generate such manifest files, please refer to `examples/librispeech/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.

 ### Compute Mean & Stddev for Normalizer

@ -179,11 +141,11 @@ To perform z-score normalization (zero-mean, unit stddev) upon audio features, w
 python3 tools/compute_mean_std.py \
 --num_samples 2000 \
 --specgram_type linear \
--manifest_path PATH/TO/LIBRISPEECH/data/manifest.train \
--output_path PATH/TO/LIBRISPEECH/data/mean_std.npz
+--manifest_path examples/librispeech/data/manifest.train \
+--output_path examples/librispeech/data/mean_std.npz
 ```

-It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `PATH/TO/LIBRISPEECH/data/manifest.train` and save the results to `PATH/TO/LIBRISPEECH/data/mean_std.npz` for further usage.
+It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.


 ### Build Vocabulary
@ -193,18 +155,18 @@ A vocabulary of possible characters is required to convert the transcription int
 ```bash
 python3 tools/build_vocab.py \
 --count_threshold 0 \
--vocab_path PATH/TO/LIBRISPEECH/data/eng_vocab.txt \
--manifest_paths PATH/TO/LIBRISPEECH/data/manifest.train
+--vocab_path examples/librispeech/data/eng_vocab.txt \
+--manifest_paths examples/librispeech/data/manifest.train
 ```

-It will write a vocabuary file `PATH/TO/LIBRISPEEECH/data/eng_vocab.txt` with all transcription text in `PATH/TO/LIBRISPEECH/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
+It will write a vocabuary file `examples/librispeech/data/eng_vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).

 ### More Help

 For more help on arguments:

 ```bash
-python3 data/librispeech/librispeech.py --help
+python3 examples/librispeech/local/librispeech.py --help
 python3 tools/compute_mean_std.py --help
 python3 tools/build_vocab.py --help
 ```
@ -240,7 +202,7 @@ python3 train.py --help
 or refer to `example/librispeech/local/run_train.sh`.


-## Data Augmentation Pipeline
+### Data Augmentation Pipeline

 Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.

@ -440,7 +402,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking

 Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.

-Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
+Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  

 For more help on arguments:

--- a/examples/aishell/local/run_infer.sh
+++ b/examples/aishell/local/run_infer.sh
@ -1,9 +1,8 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null

 # download language model
-cd models/lm > /dev/null
+cd ${MAIN_ROOT}/models/lm > /dev/null
 bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -13,7 +12,7 @@ cd - > /dev/null

 # infer
 CUDA_VISIBLE_DEVICES=0 \
-python3 -u infer.py \
+python3 -u ${MAIN_ROOT}/infer.py \
 --num_samples=10 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
@ -27,11 +26,11 @@ python3 -u infer.py \
 --use_gru=True \
 --use_gpu=True \
 --share_rnn_weights=False \
--infer_manifest="data/aishell/manifest.test" \
--mean_std_path="data/aishell/mean_std.npz" \
--vocab_path="data/aishell/vocab.txt" \
--model_path="checkpoints/aishell/step_final" \
--lang_model_path="models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
+--infer_manifest="data/manifest.test" \
+--mean_std_path="data/mean_std.npz" \
+--vocab_path="data/vocab.txt" \
+--model_path="checkpoints/step_final" \
+--lang_model_path="${MAIN_ROOT}/models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
 --decoding_method="ctc_beam_search" \
 --error_rate_type="cer" \
 --specgram_type="linear"
--- a/examples/aishell/local/run_infer_golden.sh
+++ b/examples/aishell/local/run_infer_golden.sh
@ -1,9 +1,7 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null
-
 # download language model
-cd models/lm > /dev/null
+cd ${MAIN_ROOT}/models/lm > /dev/null
 bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -12,7 +10,7 @@ cd - > /dev/null


 # download well-trained model
-cd models/aishell > /dev/null
+cd ${MAIN_ROOT}/models/aishell > /dev/null
 bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -22,7 +20,7 @@ cd - > /dev/null

 # infer
 CUDA_VISIBLE_DEVICES=0 \
-python3 -u infer.py \
+python3 -u ${MAIN_ROOT}/infer.py \
 --num_samples=10 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
@ -36,11 +34,11 @@ python3 -u infer.py \
 --use_gru=True \
 --use_gpu=False \
 --share_rnn_weights=False \
--infer_manifest="data/aishell/manifest.test" \
--mean_std_path="models/aishell/mean_std.npz" \
--vocab_path="models/aishell/vocab.txt" \
--model_path="models/aishell" \
--lang_model_path="models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
+--infer_manifest="data/manifest.test" \
+--mean_std_path="${MAIN_ROOT}/models/aishell/mean_std.npz" \
+--vocab_path="${MAIN_ROOT}/models/aishell/vocab.txt" \
+--model_path="${MAIN_ROOT}/models/aishell" \
+--lang_model_path="${MAIN_ROOT}/models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
 --decoding_method="ctc_beam_search" \
 --error_rate_type="cer" \
 --specgram_type="linear"
--- a/examples/aishell/local/run_test.sh
+++ b/examples/aishell/local/run_test.sh
@ -1,9 +1,7 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null
-
 # download language model
-cd models/lm > /dev/null
+cd ${MAIN_ROOT}/models/lm > /dev/null
 bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -13,7 +11,7 @@ cd - > /dev/null

 # evaluate model
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-python3 -u test.py \
+python3 -u ${MAIN_ROOT}/test.py \
 --batch_size=128 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
@ -27,11 +25,11 @@ python3 -u test.py \
 --use_gru=True \
 --use_gpu=True \
 --share_rnn_weights=False \
--test_manifest="data/aishell/manifest.test" \
--mean_std_path="data/aishell/mean_std.npz" \
--vocab_path="data/aishell/vocab.txt" \
--model_path="checkpoints/aishell/step_final" \
--lang_model_path="models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
+--test_manifest="data/manifest.test" \
+--mean_std_path="data/mean_std.npz" \
+--vocab_path="data/vocab.txt" \
+--model_path="checkpoints/step_final" \
+--lang_model_path="${MAIN_ROOT}/models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
 --decoding_method="ctc_beam_search" \
 --error_rate_type="cer" \
 --specgram_type="linear"
--- a/examples/aishell/local/run_test_golden.sh
+++ b/examples/aishell/local/run_test_golden.sh
@ -1,9 +1,7 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null
-
 # download language model
-cd models/lm > /dev/null
+cd ${MAIN_ROOT}/models/lm > /dev/null
 bash download_lm_ch.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -12,7 +10,7 @@ cd - > /dev/null


 # download well-trained model
-cd models/aishell > /dev/null
+cd ${MAIN_ROOT}/models/aishell > /dev/null
 bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -22,7 +20,7 @@ cd - > /dev/null

 # evaluate model
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-python3 -u test.py \
+python3 -u ${MAIN_ROOT}/test.py \
 --batch_size=128 \
 --beam_size=300 \
 --num_proc_bsearch=8 \
@ -36,11 +34,11 @@ python3 -u test.py \
 --use_gru=True \
 --use_gpu=True \
 --share_rnn_weights=False \
--test_manifest="data/aishell/manifest.test" \
--mean_std_path="models/aishell/mean_std.npz" \
--vocab_path="models/aishell/vocab.txt" \
--model_path="models/aishell" \
--lang_model_path="models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
+--test_manifest="data/manifest.test" \
+--mean_std_path="${MAIN_ROOT}/models/aishell/mean_std.npz" \
+--vocab_path="${MAIN_ROOT}/models/aishell/vocab.txt" \
+--model_path="${MAIN_ROOT}/models/aishell" \
+--lang_model_path="${MAIN_ROOT}/models/lm/zh_giga.no_cna_cmn.prune01244.klm" \
 --decoding_method="ctc_beam_search" \
 --error_rate_type="cer" \
 --specgram_type="linear"
--- a/examples/aishell/local/run_train.sh
+++ b/examples/aishell/local/run_train.sh
@ -1,12 +1,10 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null
-
 # train model
 # if you wish to resume from an exists model, uncomment --init_from_pretrained_model
 export FLAGS_sync_nccl_allreduce=0
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-python3 -u train.py \
+python3 -u ${MAIN_ROOT}/train.py \
 --batch_size=64 \
 --num_epoch=50 \
 --num_conv_layers=2 \
@ -24,12 +22,12 @@ python3 -u train.py \
 --use_gpu=True \
 --is_local=True \
 --share_rnn_weights=False \
--train_manifest="data/aishell/manifest.train" \
--dev_manifest="data/aishell/manifest.dev" \
--mean_std_path="data/aishell/mean_std.npz" \
--vocab_path="data/aishell/vocab.txt" \
--output_model_dir="./checkpoints/aishell" \
--augment_conf_path="conf/augmentation.config" \
+--train_manifest="data/manifest.train" \
+--dev_manifest="data/manifest.dev" \
+--mean_std_path="data/mean_std.npz" \
+--vocab_path="data/vocab.txt" \
+--output_model_dir="./checkpoints" \
+--augment_conf_path="${MAIN_ROOT}/conf/augmentation.config" \
 --specgram_type="linear" \
 --shuffle_method="batch_shuffle_clipped" \

--- a/examples/baidu_en8k/run_test_golden.sh
+++ b/examples/baidu_en8k/run_test_golden.sh
@ -1,9 +1,9 @@
 #! /usr/bin/env bash

-cd ../.. > /dev/null
+source path.sh

 # download language model
-cd models/lm > /dev/null
+cd ${MAIN_ROOT}/models/lm > /dev/null
 bash download_lm_en.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -12,7 +12,7 @@ cd - > /dev/null


 # download well-trained model
-cd models/baidu_en8k > /dev/null
+cd ${MAIN_ROOT}/models/baidu_en8k > /dev/null
 bash download_model.sh
 if [ $? -ne 0 ]; then
    exit 1
@ -22,7 +22,7 @@ cd - > /dev/null

 # evaluate model
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
-python3 -u test.py \
+python3 -u ${MAIN_ROOT}/test.py \
 --batch_size=128 \
 --beam_size=500 \
 --num_proc_bsearch=8 \
@ -37,11 +37,11 @@ python3 -u test.py \
 --use_gru=True \
 --use_gpu=False \
 --share_rnn_weights=False \
--test_manifest="data/librispeech/manifest.test-clean" \
--mean_std_path="models/baidu_en8k/mean_std.npz" \
--vocab_path="models/baidu_en8k/vocab.txt" \
--model_path="models/baidu_en8k" \
--lang_model_path="models/lm/common_crawl_00.prune01111.trie.klm" \
+--test_manifest="data/manifest.test-clean" \
+--mean_std_path="${MAIN_ROOT}/models/baidu_en8k/mean_std.npz" \
+--vocab_path="${MAIN_ROOT}/models/baidu_en8k/vocab.txt" \
+--model_path="${MAIN_ROOT}/models/baidu_en8k" \
+--lang_model_path="${MAIN_ROOT}/models/lm/common_crawl_00.prune01111.trie.klm" \
 --decoding_method="ctc_beam_search" \
 --error_rate_type="wer" \
 --specgram_type="linear"
--- a/examples/librispeech/local/run_tune.sh
+++ b/examples/librispeech/local/run_tune.sh
@ -2,7 +2,7 @@

 # grid-search for hyper-parameters in language model
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
-python3 -u tools/tune.py \
+python3 -u ${MAIN_ROOT}tools/tune.py \
 --num_batches=-1 \
 --batch_size=128 \
 --beam_size=500 \
--- a/examples/tiny/README.md
+++ b/examples/tiny/README.md
@ -2,3 +2,41 @@

 1. `source path.sh`
 2. `bash run.sh`
+
+## Steps
+- Prepare the data
+
+    ```bash
+    sh local/run_data.sh
+    ```
+
+    `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
+- Train your own ASR model
+
+    ```bash
+    sh local/run_train.sh
+    ```
+
+    `run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
+- Case inference with an existing model
+
+    ```bash
+    sh local/run_infer.sh
+    ```
+
+    `run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
+
+    ```bash
+    sh local/run_infer_golden.sh
+    ```
+- Evaluate an existing model
+
+    ```bash
+    sh local/run_test.sh
+    ```
+
+    `run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
+
+    ```bash
+    sh local/run_test_golden.sh
+    ```