fix textfrontend readme, fix imgs link

pull/941/head
TianYuan 3 years ago
parent 41526ca1b8
commit 670a68ad95

@ -125,7 +125,7 @@ The current hyperlinks redirect to [Previous Parakeet](https://github.com/Paddle
<tr>
<td rowspan="6">Acoustic Model</td>
<td rowspan="4" >Aishell</td>
<td >2 Conv + 5 LSTM layers with only forward direction </td>
<td >2 Conv + 5 LSTM layers with only forward direction </td>
<td>
<a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz">Ds2 Online Aishell Model</a>
</td>
@ -318,4 +318,3 @@ PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).
## Acknowledgement
PaddleSpeech depends on a lot of open source repos. See [references](docs/source/asr/reference.md) for more information.

@ -13,7 +13,7 @@ In addition, the training process and the testing process are also introduced.
The arcitecture of the model is shown in Fig.1.
<p align="center">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2onlineModel.png" width=800>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/ds2onlineModel.png" width=800>
<br/>Fig.1 The Arcitecture of deepspeech2 online model
</p>
@ -160,7 +160,7 @@ The deepspeech2 offline model is similarity to the deepspeech2 online model. The
The arcitecture of the model is shown in Fig.2.
<p align="center">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2offlineModel.png" width=800>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/ds2offlineModel.png" width=800>
<br/>Fig.2 The Arcitecture of deepspeech2 offline model
</p>

@ -54,7 +54,7 @@ CUDA_VISIBLE_DEVICES=0 bash local/tune.sh
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
<p align="center">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tuning_error_surface.png" width=550>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tuning_error_surface.png" width=550>
<br/>An example error surface for tuning on the dev-clean set of LibriSpeech
</p>

@ -27,14 +27,14 @@ At present, there are two mainstream acoustic model structures.
- Acoustic decoder (N Frames - > N Frames).
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/frame_level_am.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/frame_level_am.png" width=500 /> <br>
</div>
- Sequence to sequence acoustic model:
- M Tokens - > N Frames.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/seq2seq_am.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/seq2seq_am.png" width=500 /> <br>
</div>
### Tacotron2
@ -54,7 +54,7 @@ At present, there are two mainstream acoustic model structures.
- CBHG postprocess.
- Vocoder: Griffin-Lim.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron.png" width=700 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tacotron.png" width=700 /> <br>
</div>
**Advantage of Tacotron:**
@ -89,7 +89,7 @@ At present, there are two mainstream acoustic model structures.
- The alignment matrix of previous time is considered at the step `t` of decoder.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tacotron2.png" width=500 /> <br>
</div>
You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [examples/ljspeech/tts0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0).
@ -118,7 +118,7 @@ Transformer TTS is a combination of Tacotron2 and Transformer.
- Positional Encoding.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/transformer.png" width=500 /> <br>
</div>
#### Transformer TTS
@ -138,7 +138,7 @@ Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.
- Uniform scale position encoding may have a negative impact on input or output sequences.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer_tts.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/transformer_tts.png" width=500 /> <br>
</div>
**Disadvantages of Transformer TTS:**
@ -184,14 +184,14 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
• Can be generated in parallel (decoding time is less affected by sequence length)
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech.png" width=800 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastspeech.png" width=800 /> <br>
</div>
#### FastPitch
[FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastpitch.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastpitch.png" width=500 /> <br>
</div>
#### FastSpeech2
@ -209,7 +209,7 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastspeech2.png" width=800 /> <br>
</div>
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
@ -223,7 +223,7 @@ You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example
- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/speedyspeech.png" width=500 /> <br>
</div>
You can find PaddleSpeech TTS's SpeedySpeech with CSMSC dataset example at [examples/csmsc/tts2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2).
@ -289,7 +289,7 @@ You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examp
- Multi-resolution STFT loss.
<div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br>
<img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/pwg.png" width=600 /> <br>
</div>
You can find PaddleSpeech TTS's Parallel WaveGAN with CSMSC example at [examples/csmsc/voc1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1).

@ -21,72 +21,18 @@ Run the command below to get the results of test.
```
The `avg WER` of g2p is: 0.027495061517943988
```text
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,------------------------------------------------------------------------.
| ./exp/g2p/text.g2p |
|------------------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|------+-----------------+-----------------------------------------------|
| bak | 9996 299181 | 290969 8198 14 14 8226 5249 |
|========================================================================|
| Sum | 9996 299181 | 290969 8198 14 14 8226 5249 |
|========================================================================|
| Mean |9996.0 299181.0 |290969.0 8198.0 14.0 14.0 8226.0 5249.0 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
|Median|9996.0 299181.0 |290969.0 8198.0 14.0 14.0 8226.0 5249.0 |
`------------------------------------------------------------------------'
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,--------------------------------------------------------------------.
| ./exp/g2p/text.g2p |
|--------------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+-----------------+-----------------------------------------|
| bak | 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 |
|====================================================================|
| Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 |
|====================================================================|
| Mean |9996.0 299181.0 | 97.3 2.7 0.0 0.0 2.7 52.5 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
| Median |9996.0 299181.0 | 97.3 2.7 0.0 0.0 2.7 52.5 |
`--------------------------------------------------------------------'
```
The `avg CER` of text normalization is: 0.006388318503308237
```text
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,----------------------------------------------------------------.
| ./exp/textnorm/text.tn |
|----------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|------+--------------+------------------------------------------|
| utt | 125 2254 | 2241 2 11 2 15 4 |
|================================================================|
| Sum | 125 2254 | 2241 2 11 2 15 4 |
|================================================================|
| Mean |125.0 2254.0 |2241.0 2.0 11.0 2.0 15.0 4.0 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
|Median|125.0 2254.0 |2241.0 2.0 11.0 2.0 15.0 4.0 |
`----------------------------------------------------------------'
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,-----------------------------------------------------------------.
| ./exp/textnorm/text.tn |
|-----------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+--------------+-----------------------------------------|
| utt | 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 |
|=================================================================|
| Sum/Avg| 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 |
|=================================================================|
| Mean |125.0 2254.0 | 99.4 0.1 0.5 0.1 0.7 3.2 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
| Median |125.0 2254.0 | 99.4 0.1 0.5 0.1 0.7 3.2 |
`-----------------------------------------------------------------'
```

@ -37,4 +37,3 @@
```bash
bash local/export.sh ckpt_path saved_jit_model_path
```

@ -53,8 +53,8 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
peek_example = minibatch[0]
assert len(peek_example.shape) == 1, "text example is an 1D tensor"
lengths = [example.shape[0] for example in
minibatch] # assume (channel, n_samples) or (n_samples, )
lengths = [example.shape[0] for example in minibatch
] # assume (channel, n_samples) or (n_samples, )
max_len = np.max(lengths)
batch = []

@ -67,16 +67,19 @@ class LJSpeechCollector(object):
# Sort by text_len in descending order
texts = [
i for i, _ in sorted(
i
for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i for i, _ in sorted(
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = [
i for i, _ in sorted(
i
for i, _ in sorted(
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
]

Loading…
Cancel
Save