For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of Chinese characters statistics from the training set, and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexes that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
The encoder is composed of two 2D convolution subsampling layers and several stacked single-direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of the audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation is input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers are optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers are recommended.
To get the character possibilities of each frame, the feature representation of each frame output from the encoder is input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is the same as the vocabulary size. After the projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results.
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for the training process. Stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary, and CMVN file will be generated in "./data/". Stage 1 is used for training the model, the log files and model checkpoint are saved in "exp/deepspeech2_online/". Stage 2 is used to generate the final model for predicting by averaging the top-k model parameters based on validation loss.
After the training process, we use stages 3,4,5 for the testing process. Stage 3 is for testing the model generated in stage 2 and provided the CER index of the test set. Stage 4 is for transforming the model from a dynamic graph to a static graph by using "paddle.jit" library. Stage 5 is for testing the model in a static graph.
The deepspeech2 offline model is similar to the deepspeech2 online model. The main difference between them is the offline model uses the stacked bi-directional rnn layers while the online model uses the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.