fix textfrontend readme, fix imgs link

4 years ago · 670a68ad95
parent 41526ca1b8
commit 670a68ad95
8 changed files with 43 additions and 96 deletions
--- a/README.md
+++ b/README.md
@ -125,7 +125,7 @@ The current hyperlinks redirect to [Previous Parakeet](https://github.com/Paddle
    <tr>
      <td rowspan="6">Acoustic Model</td>
      <td rowspan="4" >Aishell</td>
-      <td >2 Conv + 5 LSTM layers with only forward direction	</td>
+      <td >2 Conv + 5 LSTM layers with only forward direction    </td>
      <td>
      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz">Ds2 Online Aishell Model</a>
      </td>
@ -318,4 +318,3 @@ PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).
 ## Acknowledgement
 PaddleSpeech depends on a lot of open source repos. See [references](docs/source/asr/reference.md) for more information.
--- a/docs/source/asr/models_introduction.md
+++ b/docs/source/asr/models_introduction.md
@ -13,7 +13,7 @@ In addition, the training process and the testing process are also introduced.
 The arcitecture of the model is shown in Fig.1.
 <p align="center">
-    <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2onlineModel.png" width=800>
+    <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/ds2onlineModel.png" width=800>
    <br/>Fig.1 The Arcitecture of deepspeech2 online model
 </p>
@ -160,7 +160,7 @@ The deepspeech2 offline model is similarity to the deepspeech2 online model. The
 The arcitecture of the model is shown in Fig.2.
 <p align="center">
-    <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2offlineModel.png" width=800>
+    <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/ds2offlineModel.png" width=800>
    <br/>Fig.2 The Arcitecture of deepspeech2 offline model
 </p>
--- a/docs/source/asr/quick_start.md
+++ b/docs/source/asr/quick_start.md
@ -54,7 +54,7 @@ CUDA_VISIBLE_DEVICES=0 bash local/tune.sh
 The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
 <p align="center">
-    <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tuning_error_surface.png" width=550>
+    <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tuning_error_surface.png" width=550>
    <br/>An example error surface for tuning on the dev-clean set of LibriSpeech
 </p>
--- a/docs/source/tts/models_introduction.md
+++ b/docs/source/tts/models_introduction.md
@ -27,14 +27,14 @@ At present, there are two mainstream acoustic model structures.
   - Acoustic decoder (N Frames - > N Frames).
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/frame_level_am.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/frame_level_am.png" width=500 /> <br>
 </div>
 - Sequence to sequence acoustic model:
    - M Tokens - > N Frames.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/seq2seq_am.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/seq2seq_am.png" width=500 /> <br>
 </div>
 ### Tacotron2
@ -54,7 +54,7 @@ At present, there are two mainstream acoustic model structures.
    - CBHG postprocess.
    - Vocoder: Griffin-Lim.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron.png" width=700 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tacotron.png" width=700 /> <br>
 </div>
 **Advantage of Tacotron:**
@ -89,7 +89,7 @@ At present, there are two mainstream acoustic model structures.
   - The alignment matrix of previous time is considered at the step `t` of decoder.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/tacotron2.png" width=500 /> <br>
 </div>
 You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [examples/ljspeech/tts0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0).
@ -118,7 +118,7 @@ Transformer TTS is a combination of Tacotron2 and Transformer.
    - Positional Encoding.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/transformer.png" width=500 /> <br>
 </div>
 #### Transformer TTS
@ -138,7 +138,7 @@ Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.
    - Uniform scale position encoding may have a negative impact on input or output sequences.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer_tts.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/transformer_tts.png" width=500 /> <br>
 </div>
 **Disadvantages of Transformer TTS:**
@ -184,14 +184,14 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
 • Can be generated in parallel (decoding time is less affected by sequence length)
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech.png" width=800 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastspeech.png" width=800 /> <br>
 </div>
 #### FastPitch
 [FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastpitch.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastpitch.png" width=500 /> <br>
 </div>
 #### FastSpeech2
@ -209,7 +209,7 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
 FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/fastspeech2.png" width=800 /> <br>
 </div>
 You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
@ -223,7 +223,7 @@ You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example
 - Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/speedyspeech.png" width=500 /> <br>
 </div>
 You can find PaddleSpeech TTS's SpeedySpeech with CSMSC dataset example at [examples/csmsc/tts2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2).
@ -289,7 +289,7 @@ You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examp
 - Multi-resolution STFT loss.
 <div align="left">
-  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br>
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/DeepSpeech/develop/docs/images/pwg.png" width=600 /> <br>
 </div>
 You can find PaddleSpeech TTS's Parallel WaveGAN with CSMSC example at [examples/csmsc/voc1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1).
--- a/examples/other/text_frontend/README.md
+++ b/examples/other/text_frontend/README.md
@ -21,72 +21,18 @@ Run the command below to get the results of test.
 ```
 The `avg WER` of g2p is: 0.027495061517943988
 ```text
                     SYSTEM SUMMARY PERCENTAGES by SPEAKER  
   ,------------------------------------------------------------------------.
   |                           ./exp/g2p/text.g2p                           |
   |------------------------------------------------------------------------|
   | SPKR | # Snt    # Wrd  |  Corr      Sub     Del    Ins    Err    S.Err |
   |------+-----------------+-----------------------------------------------|
   | bak  |  9996   299181  | 290969    8198      14     14   8226    5249  |
   |========================================================================|
   | Sum  |  9996   299181  | 290969    8198      14     14   8226    5249  |
   |========================================================================|
   | Mean |9996.0  299181.0 |290969.0  8198.0   14.0   14.0  8226.0  5249.0 |
   | S.D. |  0.0      0.0   |   0.0      0.0     0.0    0.0    0.0     0.0  |
   |Median|9996.0  299181.0 |290969.0  8198.0   14.0   14.0  8226.0  5249.0 |
   `------------------------------------------------------------------------'
                     SYSTEM SUMMARY PERCENTAGES by SPEAKER  
     ,--------------------------------------------------------------------.
-     |                         ./exp/g2p/text.g2p                         |
+     |        | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------------------------------------------------------------------|
     | SPKR   | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
     |--------+-----------------+-----------------------------------------|
     | bak    |  9996   299181  | 97.3    2.7    0.0    0.0    2.7   52.5 |
     |====================================================================|
     | Sum/Avg|  9996   299181  | 97.3    2.7    0.0    0.0    2.7   52.5 |
     |====================================================================|
     |  Mean  |9996.0  299181.0 | 97.3    2.7    0.0    0.0    2.7   52.5 |
     |  S.D.  |  0.0      0.0   |  0.0    0.0    0.0    0.0    0.0    0.0 |
     | Median |9996.0  299181.0 | 97.3    2.7    0.0    0.0    2.7   52.5 |
     `--------------------------------------------------------------------'
 ```
 The `avg CER` of text normalization is: 0.006388318503308237
 ```text
                     SYSTEM SUMMARY PERCENTAGES by SPEAKER  
       ,----------------------------------------------------------------.
       |                     ./exp/textnorm/text.tn                     |
       |----------------------------------------------------------------|
       | SPKR | # Snt  # Wrd | Corr     Sub    Del    Ins    Err  S.Err |
       |------+--------------+------------------------------------------|
       | utt  |  125    2254 | 2241       2     11      2     15      4 |
       |================================================================|
       | Sum  |  125    2254 | 2241       2     11      2     15      4 |
       |================================================================|
       | Mean |125.0  2254.0 |2241.0    2.0   11.0    2.0   15.0    4.0 |
       | S.D. |  0.0    0.0  |  0.0     0.0    0.0    0.0    0.0    0.0 |
       |Median|125.0  2254.0 |2241.0    2.0   11.0    2.0   15.0    4.0 |
       `----------------------------------------------------------------'
                     SYSTEM SUMMARY PERCENTAGES by SPEAKER  
      ,-----------------------------------------------------------------.
-      |                     ./exp/textnorm/text.tn                      |
+      |        | # Snt  # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
      |-----------------------------------------------------------------|
      | SPKR   | # Snt  # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
      |--------+--------------+-----------------------------------------|
      | utt    |  125    2254 | 99.4    0.1    0.5    0.1    0.7    3.2 |
      |=================================================================|
      | Sum/Avg|  125    2254 | 99.4    0.1    0.5    0.1    0.7    3.2 |
      |=================================================================|
      |  Mean  |125.0  2254.0 | 99.4    0.1    0.5    0.1    0.7    3.2 |
      |  S.D.  |  0.0    0.0  |  0.0    0.0    0.0    0.0    0.0    0.0 |
      | Median |125.0  2254.0 | 99.4    0.1    0.5    0.1    0.7    3.2 |
      `-----------------------------------------------------------------'
 ```
--- a/examples/tiny/s0/README.md
+++ b/examples/tiny/s0/README.md
@ -37,4 +37,3 @@
    ```bash
    bash local/export.sh ckpt_path saved_jit_model_path
    ```
--- a/parakeet/data/batch.py
+++ b/parakeet/data/batch.py
@ -53,8 +53,8 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
    peek_example = minibatch[0]
    assert len(peek_example.shape) == 1, "text example is an 1D tensor"
-    lengths = [example.shape[0] for example in
+    lengths = [example.shape[0] for example in minibatch
-               minibatch]  # assume (channel, n_samples) or (n_samples, )
+               ]  # assume (channel, n_samples) or (n_samples, )
    max_len = np.max(lengths)
    batch = []
--- a/parakeet/exps/tacotron2/ljspeech.py
+++ b/parakeet/exps/tacotron2/ljspeech.py
@ -67,16 +67,19 @@ class LJSpeechCollector(object):
        # Sort by text_len in descending order
        texts = [
-            i for i, _ in sorted(
+            i
            for i, _ in sorted(
                zip(texts, text_lens), key=lambda x: x[1], reverse=True)
        ]
        mels = [
-            i for i, _ in sorted(
+            i
            for i, _ in sorted(
                zip(mels, text_lens), key=lambda x: x[1], reverse=True)
        ]
        mel_lens = [
-            i for i, _ in sorted(
+            i
            for i, _ in sorted(
                zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
        ]