parent
d983f8d8b3
commit
5d75b0fee8
@ -1,17 +1,25 @@
|
||||
# Model Arcitecture
|
||||
|
||||
The implemented arcitecure of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes.
|
||||
The figure of arcitecture is shown in ![image](../image/ds2onlineModel.png).
|
||||
The model is mainly composed of 2D convolution subsampling layer and single direction rnn layers. To illustrate the model implementation in detail, 5 parts is introduced.
|
||||
The model is mainly composed of 2D convolution subsampling layer and single direction rnn layers.
|
||||
To illustrate the model implementation clearly, 5 parts is described in detail.
|
||||
1. Feature Extraction.
|
||||
2. 2D Convolution subsampling layer.
|
||||
3. RNN layer with only forward direction.
|
||||
4. Softmax Layer.
|
||||
5. CTC Decoder.
|
||||
The arcitecture of the model is shown in Fig.1.
|
||||
|
||||
<p align="center">
|
||||
<img src="../images/ds2onlineModel.png" width=800>
|
||||
<br/>Fig.1 The Arcitecture of deepspeech2 online modle
|
||||
</p>
|
||||
|
||||
# Feature Extraction
|
||||
# Feature Extraction
|
||||
|
||||
Three methods of feature extraction is implemented, which are linear, fbank and mfcc.
|
||||
For a single utterance $x^i$ sampled from the training set $S$,
|
||||
$ S= {(x^1,y^1),(x^2,y^2),...,(x^m,y^m)}$, where $y^i$ is the label correspodding to the ${x^i}
|
||||
|
||||
# Backbone
|
||||
The Backbone is composed of 2D Convolution subsampling layer.
|
||||
|
Loading…
Reference in new issue