From bf40e3ff2d9505e9a8aef739d3e4780801292174 Mon Sep 17 00:00:00 2001 From: Jackwaterveg <87408988+Jackwaterveg@users.noreply.github.com> Date: Wed, 15 Sep 2021 12:51:29 +0800 Subject: [PATCH 1/4] Update deepspeech_architecture.md --- doc/src/deepspeech_architecture.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/doc/src/deepspeech_architecture.md b/doc/src/deepspeech_architecture.md index dfa60790..04c7bee7 100644 --- a/doc/src/deepspeech_architecture.md +++ b/doc/src/deepspeech_architecture.md @@ -20,7 +20,7 @@ The arcitecture of the model is shown in Fig.1. ### Data Preparation #### Vocabulary -For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \ and \. The \ represents the blank label in CTC, the \ represents the unknown character and the represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \, \ and \. For both English and mandarin data, we set the default indexs that \=0, \=1 and \= last index. +For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \ and \. The \ represents the blank label in CTC, the \ represents the unknown character and the \ represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \, \ and \. For both English and mandarin data, we set the default indexs that \=0, \=1 and \= last index. ``` # The code to build vocabulary cd examples/aishell/s0 @@ -65,17 +65,19 @@ python3 ../../../utils/compute_mean_std.py \ ``` ### Encoder -The Backbone is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For rnn layers, LSTM cell and GRU cell are provided. Adding one fully connected (fc) layer after rnn layer is optional, if the number of rnn layers is less than 5, adding one fc layer after rnn layers is recommand. +The Backbone is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. The code of Encoder is in: ``` vi deepspeech/models/ds2_online/deepspeech2.py ``` ### Decoder -To got the character possibilities of each frame, the feature represention of each frame output from the backbone are input into a projection layer which is implemented as a dense layer to do projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to make frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results. -The code of Encoder is in: +To got the character possibilities of each frame, the feature represention of each frame output from the encoder are input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results. +The code of Decoder is in: ``` +# The code of constructing the decoder in model vi deepspeech/models/ds2_online/deepspeech2.py +# The code of CTC Decoder vi deepspeech/modules/ctc.py ``` @@ -119,7 +121,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then avg.sh exp/${ckpt}/checkpoints ${avg_num} fi ``` -By using the command above, the training process can be started. There are 5 stages in run.sh, and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss. +By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss. ## Testing Process Using the command below, you can test the deepspeech2 online model. @@ -152,7 +154,7 @@ After the training process, we use stage 3,4,5 for testing process. The stage 3 ## Non-Streaming -The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. +The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use. The arcitecture of the model is shown in Fig.2.

@@ -162,9 +164,9 @@ The arcitecture of the model is shown in Fig.2. -For data preparation, decoder, the deepspeech2 offline model is same with the deepspeech2 online model. +For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model. - The code of encoder and decoder for deepspeech2 offline model is in: +The code of encoder and decoder for deepspeech2 offline model is in: ``` vi deepspeech/models/ds2/deepspeech2.py ``` From c8d62807b39c5b2db627928af4a8ee8519dc5cec Mon Sep 17 00:00:00 2001 From: Jackwaterveg <87408988+Jackwaterveg@users.noreply.github.com> Date: Wed, 15 Sep 2021 13:01:31 +0800 Subject: [PATCH 2/4] Update deepspeech_architecture.md --- docs/src/deepspeech_architecture.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/src/deepspeech_architecture.md b/docs/src/deepspeech_architecture.md index 6c495189..f4cfcf9b 100644 --- a/docs/src/deepspeech_architecture.md +++ b/docs/src/deepspeech_architecture.md @@ -66,6 +66,7 @@ python3 ../../../utils/compute_mean_std.py \ ### Encoder The Backbone is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. + The code of Encoder is in: ``` vi deepspeech/models/ds2_online/deepspeech2.py @@ -73,7 +74,8 @@ vi deepspeech/models/ds2_online/deepspeech2.py ### Decoder To got the character possibilities of each frame, the feature represention of each frame output from the encoder are input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results. -The code of Decoder is in: + +The code of the decoder is in: ``` # The code of constructing the decoder in model vi deepspeech/models/ds2_online/deepspeech2.py From 6e5d15250340474aef0b49e573f3fc97f94cdcb3 Mon Sep 17 00:00:00 2001 From: Jackwaterveg <87408988+Jackwaterveg@users.noreply.github.com> Date: Wed, 15 Sep 2021 13:03:02 +0800 Subject: [PATCH 3/4] Update deepspeech_architecture.md --- docs/src/deepspeech_architecture.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/deepspeech_architecture.md b/docs/src/deepspeech_architecture.md index f4cfcf9b..580b1388 100644 --- a/docs/src/deepspeech_architecture.md +++ b/docs/src/deepspeech_architecture.md @@ -65,7 +65,7 @@ python3 ../../../utils/compute_mean_std.py \ ``` ### Encoder -The Backbone is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. +The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. The code of Encoder is in: ``` From 0b2c794d88d4019462e3cf5209565c3a6bf31239 Mon Sep 17 00:00:00 2001 From: Jackwaterveg <87408988+Jackwaterveg@users.noreply.github.com> Date: Wed, 15 Sep 2021 13:14:12 +0800 Subject: [PATCH 4/4] Emphasis the setup stage in install.sh --- docs/src/install.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/src/install.md b/docs/src/install.md index 01049a2f..79460737 100644 --- a/docs/src/install.md +++ b/docs/src/install.md @@ -6,13 +6,14 @@ To avoid the trouble of environment setup, [running in Docker container](#runnin - Python >= 3.7 - PaddlePaddle 2.0.0 or later (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) -## Setup +## Setup (Important) - Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost`, `sox, and `swig`, e.g. installing them via `apt-get`: ```bash sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev ``` +The version of `swig` should >= 3.0 or, installing them via `yum`: