ctc loss

remove run prefix using ord value as text id
5 years ago · 5fe1b40630
parent f121f851d9
commit 5fe1b40630
32 changed files with 255 additions and 201 deletions
--- a/README.md
+++ b/README.md
@ -197,7 +197,7 @@ For more help on arguments:
 ```bash
 python3 train.py --help
 ```
-or refer to `example/librispeech/local/run_train.sh`.
+or refer to `example/librispeech/local/train.sh`.


 ### Data Augmentation Pipeline
@ -239,7 +239,7 @@ Be careful when utilizing the data augmentation technique, as improper augmentat

 ### Training for Mandarin Language

-The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh``` and ```sh run_infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh run_infer_golden.sh``` and ```sh run_test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.
+The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.


 ## Inference and Evaluation
@ -299,7 +299,7 @@ For more help on arguments:
 ```
 python3 infer.py --help
 ```
-or refer to `example/librispeech/local/run_infer.sh`.
+or refer to `example/librispeech/local/infer.sh`.

 ### Evaluate a Model

@ -324,7 +324,7 @@ For more help on arguments:
 ```bash
 python3 test.py --help
 ```
-or refer to `example/librispeech/local/run_test.sh`.
+or refer to `example/librispeech/local/test.sh`.

 ## Hyper-parameters Tuning

@ -364,7 +364,7 @@ After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation
 ```bash
 python3 tune.py --help
 ```
-or refer to `example/librispeech/local/run_tune.sh`.
+or refer to `example/librispeech/local/tune.sh`.


 ## Trying Live Demo with Your Own Voice
@ -403,7 +403,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking

 Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.

-Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
+Please also refer to `examples/deploy_demo/english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  

 For more help on arguments:

@ -427,7 +427,7 @@ VoxForge European       |   30.15           |   18.64
 VoxForge Indian         |   53.73           |   25.51
 Baidu Internal Testset  |   40.75           |   8.48

-For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh run_data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on.
+For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on.

 #### Benchmark Results for Mandarin Model (Character Error Rate)

--- a/README_cn.md
+++ b/README_cn.md
@ -197,7 +197,7 @@ python3 tools/build_vocab.py --help
 ```bash
 python3 train.py --help
 ```
-或参考 `example/librispeech/local/run_train.sh`.
+或参考 `example/librispeech/local/train.sh`.


 ### 数据增强流水线
@ -238,7 +238,7 @@ python3 train.py --help

 ### 训练普通话语言

-普通话语言训练与英语训练的关键步骤相同，我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述，请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备，训练，测试和推断。我们还准备了一个预训练过的模型（执行./models/aishell/download_model.sh下载）供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意，与英语语言模型不同，普通话语言模型是基于汉字的，请运行```tools/tune.py```来查找最佳设置。
+普通话语言训练与英语训练的关键步骤相同，我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述，请执行```sh data.sh```, ```sh train.sh```, ```sh test.sh```和```sh infer.sh```做相应的数据准备，训练，测试和推断。我们还准备了一个预训练过的模型（执行./models/aishell/download_model.sh下载）供用户使用```infer_golden.sh```和```test_golden.sh```来。请注意，与英语语言模型不同，普通话语言模型是基于汉字的，请运行```tools/tune.py```来查找最佳设置。



@ -300,7 +300,7 @@ bash download_lm_ch.sh
 ```
 python3 infer.py --help
 ```
-或参考`example/librispeech/local/run_infer.sh`.
+或参考`example/librispeech/local/infer.sh`.

 ### 评估模型

@ -325,7 +325,7 @@ python3 infer.py --help
 ```bash
 python3 test.py --help
 ```
-或参考`example/librispeech/local/run_test.sh`.
+或参考`example/librispeech/local/test.sh`.



@ -367,7 +367,7 @@ python3 test.py --help
 ```bash
 python3 tune.py --help
 ```
-或参考`example/librispeech/local/run_tune.sh`.
+或参考`example/librispeech/local/tune.sh`.


 ## 用自己的声音尝试现场演示
@ -406,7 +406,7 @@ python3 -u deploy/demo_client.py \

 请注意，`deploy/demo_client.py`必须在带麦克风设备的机器上运行，而`deploy/demo_server.py`可以在没有任何录音硬件的情况下运行，例如任何远程服务器机器。如果服务器和客户端使用两台独立的机器运行，只需要注意将`host_ip`和`host_port`参数设置为实际可访问的IP地址和端口。如果它们在单台机器上运行，则不用作任何处理。

-请参考`examples/deploy_demo/run_english_demo_server.sh`，它将首先下载一个预先训练过的英语模型（用3000小时的内部语音数据训练），然后用模型启动演示服务器。通过运行`examples/deploy_demo/run_demo_client.sh`，你可以说英语来测试它。如果您想尝试其他模型，只需更新脚本中的`--model_path`参数即可。
+请参考`examples/deploy_demo/english_demo_server.sh`，它将首先下载一个预先训练过的英语模型（用3000小时的内部语音数据训练），然后用模型启动演示服务器。通过运行`examples/deploy_demo/demo_client.sh`，你可以说英语来测试它。如果您想尝试其他模型，只需更新脚本中的`--model_path`参数即可。

 获得更多帮助：

@ -430,7 +430,7 @@ VoxForge European       |   30.15           |   18.64
 VoxForge Indian         |   53.73           |   25.51
 Baidu Internal Testset  |   40.75           |   8.48

-为了在VoxForge数据上重现基准测试结果，我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````run_data.sh```来获取VoxForge方言manifest文件。请注意，VoxForge数据可能会持续更新，生成的清单文件可能与我们评估的清单文件有所不同。
+为了在VoxForge数据上重现基准测试结果，我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````data.sh```来获取VoxForge方言manifest文件。请注意，VoxForge数据可能会持续更新，生成的清单文件可能与我们评估的清单文件有所不同。


 #### 普通话模型的baseline测试结果（字符错误率）
--- a/data_utils/dataset.py
+++ b/data_utils/dataset.py
@ -428,7 +428,7 @@ class DeepSpeech2BatchSampler(BatchSampler):


 class SpeechCollator():
-    def __init__(self, padding_to=-1):
+    def __init__(self, padding_to=-1, is_training=False):
        """
        Padding audio features with zeros to make them have the same shape (or
        a user-defined shape) within one bach.
@ -438,6 +438,7 @@ class SpeechCollator():
        target shape (only refers to the second axis).
        """
        self._padding_to = padding_to
+        self._is_training = is_training

    def __call__(self, batch):
        new_batch = []
@ -461,7 +462,10 @@ class SpeechCollator():
            audio_lens.append(audio.shape[1])
            # text
            padded_text = np.zeros([max_text_length])
-            padded_text[:len(text)] = text
+            if self._is_training:
+                padded_text[:len(text)] = text  #ids
+            else:
+                padded_text[:len(text)] = [ord(t) for t in text]  # string
            texts.append(padded_text)
            text_lens.append(len(text))

@ -472,61 +476,61 @@ class SpeechCollator():
        return padded_audios, texts, audio_lens, text_lens


-def create_dataloader(manifest_path,	
-                      vocab_filepath,	
-                      mean_std_filepath,	
-                      augmentation_config='{}',	
-                      max_duration=float('inf'),	
-                      min_duration=0.0,	
-                      stride_ms=10.0,	
-                      window_ms=20.0,	
-                      max_freq=None,	
-                      specgram_type='linear',	
-                      use_dB_normalization=True,	
-                      random_seed=0,	
-                      keep_transcription_text=False,	
-                      is_training=False,	
-                      batch_size=1,	
-                      num_workers=0,	
-                      sortagrad=False,	
-                      shuffle_method=None,	
-                      dist=False):	
-
-    dataset = DeepSpeech2Dataset(	
-        manifest_path,	
-        vocab_filepath,	
-        mean_std_filepath,	
-        augmentation_config=augmentation_config,	
-        max_duration=max_duration,	
-        min_duration=min_duration,	
-        stride_ms=stride_ms,	
-        window_ms=window_ms,	
-        max_freq=max_freq,	
-        specgram_type=specgram_type,	
-        use_dB_normalization=use_dB_normalization,	
-        random_seed=random_seed,	
-        keep_transcription_text=keep_transcription_text)	
-
-    if dist:	
-        batch_sampler = DeepSpeech2DistributedBatchSampler(	
-            dataset,	
-            batch_size,	
-            num_replicas=None,	
-            rank=None,	
-            shuffle=is_training,	
-            drop_last=is_training,	
-            sortagrad=is_training,	
-            shuffle_method=shuffle_method)	
-    else:	
-        batch_sampler = DeepSpeech2BatchSampler(	
-            dataset,	
-            shuffle=is_training,	
-            batch_size=batch_size,	
-            drop_last=is_training,	
-            sortagrad=is_training,	
-            shuffle_method=shuffle_method)	
-
-    def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):	
+def create_dataloader(manifest_path,
+                      vocab_filepath,
+                      mean_std_filepath,
+                      augmentation_config='{}',
+                      max_duration=float('inf'),
+                      min_duration=0.0,
+                      stride_ms=10.0,
+                      window_ms=20.0,
+                      max_freq=None,
+                      specgram_type='linear',
+                      use_dB_normalization=True,
+                      random_seed=0,
+                      keep_transcription_text=False,
+                      is_training=False,
+                      batch_size=1,
+                      num_workers=0,
+                      sortagrad=False,
+                      shuffle_method=None,
+                      dist=False):
+
+    dataset = DeepSpeech2Dataset(
+        manifest_path,
+        vocab_filepath,
+        mean_std_filepath,
+        augmentation_config=augmentation_config,
+        max_duration=max_duration,
+        min_duration=min_duration,
+        stride_ms=stride_ms,
+        window_ms=window_ms,
+        max_freq=max_freq,
+        specgram_type=specgram_type,
+        use_dB_normalization=use_dB_normalization,
+        random_seed=random_seed,
+        keep_transcription_text=keep_transcription_text)
+
+    if dist:
+        batch_sampler = DeepSpeech2DistributedBatchSampler(
+            dataset,
+            batch_size,
+            num_replicas=None,
+            rank=None,
+            shuffle=is_training,
+            drop_last=is_training,
+            sortagrad=is_training,
+            shuffle_method=shuffle_method)
+    else:
+        batch_sampler = DeepSpeech2BatchSampler(
+            dataset,
+            shuffle=is_training,
+            batch_size=batch_size,
+            drop_last=is_training,
+            sortagrad=is_training,
+            shuffle_method=shuffle_method)
+
+    def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
        """	
        Padding audio features with zeros to make them have the same shape (or	
        a user-defined shape) within one bach.	
@ -536,42 +540,45 @@ def create_dataloader(manifest_path,
        target shape (only refers to the second axis).	

        If `flatten` is True, features will be flatten to 1darray.	
-        """	
-        new_batch = []	
+        """
+        new_batch = []
        # get target shape	
-        max_length = max([audio.shape[1] for audio, text in batch])	
-        if padding_to != -1:	
-            if padding_to < max_length:	
-                raise ValueError("If padding_to is not -1, it should be larger "	
-                                 "than any instance's shape in the batch")	
-            max_length = padding_to	
-        max_text_length = max([len(text) for audio, text in batch])	
+        max_length = max([audio.shape[1] for audio, text in batch])
+        if padding_to != -1:
+            if padding_to < max_length:
+                raise ValueError("If padding_to is not -1, it should be larger "
+                                 "than any instance's shape in the batch")
+            max_length = padding_to
+        max_text_length = max([len(text) for audio, text in batch])
        # padding	
-        padded_audios = []	
-        audio_lens = []	
-        texts, text_lens = [], []	
-        for audio, text in batch:	
-            padded_audio = np.zeros([audio.shape[0], max_length])	
-            padded_audio[:, :audio.shape[1]] = audio	
-            if flatten:	
-                padded_audio = padded_audio.flatten()	
-            padded_audios.append(padded_audio)	
-            audio_lens.append(audio.shape[1])	
-
-            padded_text = np.zeros([max_text_length])	
-            padded_text[:len(text)] = text	
-            texts.append(padded_text)	
-            text_lens.append(len(text))	
-
-        padded_audios = np.array(padded_audios).astype('float32')	
-        audio_lens = np.array(audio_lens).astype('int64')	
-        texts = np.array(texts).astype('int32')	
-        text_lens = np.array(text_lens).astype('int64')	
-        return padded_audios, texts, audio_lens, text_lens	
-
-    loader = DataLoader(	
-        dataset,	
-        batch_sampler=batch_sampler,	
-        collate_fn=partial(padding_batch, is_training=is_training),	
-        num_workers=num_workers)	
-    return loader
+        padded_audios = []
+        audio_lens = []
+        texts, text_lens = [], []
+        for audio, text in batch:
+            padded_audio = np.zeros([audio.shape[0], max_length])
+            padded_audio[:, :audio.shape[1]] = audio
+            if flatten:
+                padded_audio = padded_audio.flatten()
+            padded_audios.append(padded_audio)
+            audio_lens.append(audio.shape[1])
+
+            padded_text = np.zeros([max_text_length])
+            if is_training:
+                padded_text[:len(text)] = text  #ids
+            else:
+                padded_text[:len(text)] = [ord(t) for t in text]  # string
+            texts.append(padded_text)
+            text_lens.append(len(text))
+
+        padded_audios = np.array(padded_audios).astype('float32')
+        audio_lens = np.array(audio_lens).astype('int64')
+        texts = np.array(texts).astype('int32')
+        text_lens = np.array(text_lens).astype('int64')
+        return padded_audios, texts, audio_lens, text_lens
+
+    loader = DataLoader(
+        dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=partial(padding_batch, is_training=is_training),
+        num_workers=num_workers)
+    return loader
--- a/examples/aishell/conf/deepspeech2.yaml
+++ b/examples/aishell/conf/deepspeech2.yaml
@ -38,7 +38,7 @@ training:
  save_interval: 1000
  valid_interval: 1000
 decoding:
-  batch_size: 128
+  batch_size: 10
  error_rate_type: cer 
  decoding_method: ctc_beam_search
  lang_model_path: models/lm/zh_giga.no_cna_cmn.prune01244.klm
@ -48,4 +48,3 @@ decoding:
  cutoff_prob: 0.99 
  cutoff_top_n: 40
  num_proc_bsearch: 8
-
--- a/examples/aishell/local/run_data.sh
+++ b/examples/aishell/local/run_data.sh
--- a/examples/aishell/local/run_infer.sh
+++ b/examples/aishell/local/run_infer.sh
--- a/examples/aishell/local/run_infer_golden.sh
+++ b/examples/aishell/local/run_infer_golden.sh
--- a/examples/aishell/local/run_test.sh
+++ b/examples/aishell/local/run_test.sh
@ -9,7 +9,6 @@ fi
 cd - > /dev/null


-CUDA_VISIBLE_DEVICES=6 \
 python3 -u ${MAIN_ROOT}/test.py \
 --device 'gpu' \
 --nproc 1 \
--- a/examples/aishell/local/run_test_golden.sh
+++ b/examples/aishell/local/run_test_golden.sh
--- a/examples/aishell/local/run_train.sh
+++ b/examples/aishell/local/run_train.sh
--- a/examples/aishell/run.sh
+++ b/examples/aishell/run.sh
@ -3,19 +3,19 @@
 source path.sh

 # prepare data
-bash ./local/run_data.sh
+bash ./local/data.sh

 # test pretrain model
-bash ./local/run_test_golden.sh
+bash ./local/test_golden.sh

 # test pretain model
-bash ./local/run_infer_golden.sh
+bash ./local/infer_golden.sh

 # train model
-bash ./local/run_train.sh
+bash ./local/train.sh

 # test model
-bash ./local/run_test.sh
+bash ./local/test.sh

 # infer model
-bash ./local/run_infer.sh
+bash ./local/infer.sh
--- a/examples/librispeech/conf/augmentation.config
+++ b/examples/librispeech/conf/augmentation.config
@ -0,0 +1,8 @@
+[
+    {
+        "type": "shift",
+        "params": {"min_shift_ms": -5,
+                   "max_shift_ms": 5},
+        "prob": 1.0
+    }
+]
--- a/examples/librispeech/conf/deepspeech2.yaml
+++ b/examples/librispeech/conf/deepspeech2.yaml
@ -0,0 +1,51 @@
+# https://yaml.org/type/float.html
+data:
+  train_manifest: data/manifest.tiny
+  dev_manifest: data/manifest.tiny
+  test_manifest: data/manifest.tiny
+  mean_std_filepath: data/mean_std.npz
+  vocab_filepath: data/vocab.txt 
+  augmentation_config: conf/augmentation.config
+  batch_size: 4
+  max_duration: 27.0
+  min_duration: 0.0
+  specgram_type: linear
+  target_sample_rate: 16000
+  max_freq: None
+  n_fft: None
+  stride_ms: 10.0
+  window_ms: 20.0
+  use_dB_normalization: True
+  target_dB: -20
+  random_seed: 0
+  keep_transcription_text: False
+  sortagrad: True 
+  shuffle_method: batch_shuffle
+  num_workers: 0
+model:
+  num_conv_layers: 2
+  num_rnn_layers: 3
+  rnn_layer_size: 2048
+  use_gru: True 
+  share_rnn_weights: True 
+training:
+  n_epoch: 20
+  lr: 1e-5 
+  weight_decay: 1e-06
+  global_grad_clip: 400.0
+  max_iteration: 500000
+  plot_interval: 1000
+  save_interval: 1000
+  valid_interval: 1000
+decoding:
+  batch_size: 128
+  error_rate_type: wer
+  decoding_method: ctc_beam_search
+  lang_model_path: models/lm/common_crawl_00.prune01111.trie.klm
+  alpha: 2.5
+  beta: 0.3
+  beam_size: 500
+  cutoff_prob: 1.0
+  cutoff_top_n: 40
+  num_proc_bsearch: 8
+
--- a/examples/librispeech/local/run_data.sh
+++ b/examples/librispeech/local/run_data.sh
--- a/examples/librispeech/local/run_infer.sh
+++ b/examples/librispeech/local/run_infer.sh
--- a/examples/librispeech/local/run_infer_golden.sh
+++ b/examples/librispeech/local/run_infer_golden.sh
--- a/examples/librispeech/local/run_test.sh
+++ b/examples/librispeech/local/run_test.sh
--- a/examples/librispeech/local/run_test_golden.sh
+++ b/examples/librispeech/local/run_test_golden.sh
--- a/examples/librispeech/local/run_train.sh
+++ b/examples/librispeech/local/run_train.sh
--- a/examples/librispeech/local/run_tune.sh
+++ b/examples/librispeech/local/run_tune.sh
--- a/examples/librispeech/models
+++ b/examples/librispeech/models
@ -0,0 +1 @@
+../../models
--- a/examples/librispeech/run.sh
+++ b/examples/librispeech/run.sh
@ -3,22 +3,16 @@
 source path.sh

 # prepare data
-bash ./local/run_data.sh
-
-# test pretrain model
-bash ./local/run_test_golden.sh
-
-# test pretain model
-bash ./local/run_infer_golden.sh
+bash ./local/data.sh

 # train model
-bash ./local/run_train.sh
+bash ./local/train.sh

 # test model
-bash ./local/run_test.sh
+bash ./local/test.sh

 # infer model
-bash ./local/run_infer.sh
+bash ./local/infer.sh

 # tune model
-bash ./local/run_tune.sh
+#bash ./local/tune.sh
--- a/examples/tiny/README.md
+++ b/examples/tiny/README.md
@ -7,39 +7,39 @@
 - Prepare the data

    ```bash
-    sh local/run_data.sh
+    bash local/data.sh
    ```

-    `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
+    `data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.

 - Train your own ASR model

    ```bash
-    sh local/run_train.sh
+    bash local/train.sh
    ```

-    `run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
+    `train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.

 - Case inference with an existing model

    ```bash
-    sh local/run_infer.sh
+    bash local/infer.sh
    ```

-    `run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
+    `infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:

    ```bash
-    sh local/run_infer_golden.sh
+    bash local/infer_golden.sh
    ```

 - Evaluate an existing model

    ```bash
-    sh local/run_test.sh
+    bash local/test.sh
    ```

-    `run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
+    `test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:

    ```bash
-    sh local/run_test_golden.sh
+    bash local/test_golden.sh
    ```
--- a/examples/tiny/local/run_data.sh
+++ b/examples/tiny/local/run_data.sh
--- a/examples/tiny/local/run_infer_golden.sh
+++ b/examples/tiny/local/run_infer_golden.sh
--- a/examples/tiny/local/run_test.sh
+++ b/examples/tiny/local/run_test.sh
--- a/examples/tiny/local/run_test_golden.sh
+++ b/examples/tiny/local/run_test_golden.sh
--- a/examples/tiny/local/run_train.sh
+++ b/examples/tiny/local/run_train.sh
--- a/examples/tiny/local/run_tune.sh
+++ b/examples/tiny/local/run_tune.sh
--- a/examples/tiny/run.sh
+++ b/examples/tiny/run.sh
@ -4,22 +4,16 @@ set -e
 source path.sh

 # prepare data
-bash ./local/run_data.sh
-
-## test pretrain model
-#bash ./local/run_test_golden.sh
-#
-## test pretain model
-#bash ./local/run_infer_golden.sh
+bash ./local/data.sh

 # train model
-bash ./local/run_train.sh
+bash ./local/train.sh

 # test model
-bash ./local/run_test.sh
+bash ./local/test.sh

 # infer model
-bash ./local/run_infer.sh
+bash ./local/infer.sh

 ## tune model
-#bash ./local/run_tune.sh
+#bash ./local/tune.sh
--- a/model_utils/model.py
+++ b/model_utils/model.py
@ -20,12 +20,12 @@ import time
 import logging
 import numpy as np
 from collections import defaultdict
+from functools import partial

 import paddle
 from paddle import distributed as dist
 from paddle.io import DataLoader

-
 from paddle.fluid.dygraph import base as imperative_base
 from paddle.fluid import layers
 from paddle.fluid import framework
@ -51,6 +51,7 @@ from utils.error_rate import char_errors, word_errors, cer, wer

 logger = logging.getLogger(__name__)

+
 class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
    def __init__(self, clip_norm):
        super().__init__(clip_norm)
@ -70,7 +71,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
            square = layers.square(merge_grad)
            sum_square = layers.reduce_sum(square)
-            logger.info(f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }")
+            logger.info(
+                f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
+            )
            sum_square_list.append(sum_square)

        # all parameters have been filterd out
@ -85,8 +88,7 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
            shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm)
        clip_var = layers.elementwise_div(
            x=max_global_norm,
-            y=layers.elementwise_max(
-                x=global_norm_var, y=max_global_norm))
+            y=layers.elementwise_max(x=global_norm_var, y=max_global_norm))
        for p, g in params_grads:
            if g is None:
                continue
@ -94,7 +96,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
                params_and_grads.append((p, g))
                continue
            new_grad = layers.elementwise_mul(x=g, y=clip_var)
-            logger.info(f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }")
+            logger.info(
+                f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
+            )
            params_and_grads.append((p, new_grad))

        return params_and_grads
@ -106,12 +110,14 @@ def print_grads(model, logger=None):
        if logger:
            logger.info(msg)

+
 def print_params(model, logger=None):
    for n, p in model.named_parameters():
        msg = f"param: {n}: shape: {p.shape} stop_grad: {p.stop_gradient}"
-         if logger:
+        if logger:
            logger.info(msg)

+
 class DeepSpeech2Trainer(Trainer):
    def __init__(self, config, args):
        super().__init__(config, args)
@ -126,8 +132,7 @@ class DeepSpeech2Trainer(Trainer):
        start = time.time()
        self.model.train()

-        audio, text, audio_len, text_len = batch_data
-        outputs = self.model(audio, text, audio_len, text_len)
+        outputs = self.model(*batch_data)
        loss = self.compute_losses(batch_data, outputs)

        loss.backward()
@ -204,7 +209,7 @@ class DeepSpeech2Trainer(Trainer):
        valid_losses = defaultdict(list)
        for i, batch in enumerate(self.valid_loader):
            audio, text, audio_len, text_len = batch
-            outputs = self.model(audio, text, audio_len, text_len)
+            outputs = self.model(*batch)
            loss = self.compute_losses(batch, outputs)
            metrics = self.compute_metrics(batch, outputs)

@ -243,8 +248,7 @@ class DeepSpeech2Trainer(Trainer):

        print_params(model, self.logger)

-        grad_clip = MyClipGradByGlobalNorm(
-            config.training.global_grad_clip)
+        grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip)

        # optimizer = paddle.optimizer.Adam(
        #     learning_rate=config.training.lr,
@ -313,7 +317,7 @@ class DeepSpeech2Trainer(Trainer):
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
-            keep_transcription_text=False)
+            keep_transcription_text=True)

        if self.parallel:
            batch_sampler = DeepSpeech2DistributedBatchSampler(
@ -338,14 +342,14 @@ class DeepSpeech2Trainer(Trainer):
        self.train_loader = DataLoader(
            train_dataset,
            batch_sampler=batch_sampler,
-            collate_fn=collate_fn,
+            collate_fn=SpeechCollator(is_training=True),
            num_workers=config.data.num_workers, )
        self.valid_loader = DataLoader(
            dev_dataset,
            batch_size=config.data.batch_size,
            shuffle=False,
            drop_last=False,
-            collate_fn=collate_fn)
+            collate_fn=SpeechCollator(is_training=True))
        self.logger.info("Setup train/valid Dataloader!")


@ -353,13 +357,14 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
    def __init__(self, config, args):
        super().__init__(config, args)

-    def id2token(self, texts, texts_len, vocab_list):
+    def ordid2token(self, texts, texts_len):
+        """ ord() id to chr() chr """
        trans = []
        for text, n in zip(texts, texts_len):
            n = n.numpy().item()
            ids = text[:n]
-            trans.append(''.join([vocab_list[i] for i in ids]))
-        return np.array(trans)
+            trans.append(''.join([chr(i) for i in ids]))
+        return trans

    def compute_metrics(self, inputs, outputs):
        cfg = self.config.decoding
@ -372,10 +377,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
        error_rate_func = cer if cfg.error_rate_type == 'cer' else wer

        vocab_list = self.test_loader.dataset.vocab_list
-        for t in vocab_list:
-            self.logger.info(f"vocab: {t}")
-            
-        target_transcripts = self.id2token(texts, texts_len, vocab_list)
+
+        target_transcripts = self.ordid2token(texts, texts_len)
        result_transcripts = self.model.decode_probs(
            probs.numpy(),
            vocab_list,
@ -513,13 +516,12 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
            use_dB_normalization=config.data.use_dB_normalization,
            target_dB=config.data.target_dB,
            random_seed=config.data.random_seed,
-            keep_transcription_text=False)
+            keep_transcription_text=True)

-        collate_fn = SpeechCollator()
        self.test_loader = DataLoader(
            test_dataset,
            batch_size=config.decoding.batch_size,
            shuffle=False,
            drop_last=False,
-            collate_fn=collate_fn)
+            collate_fn=SpeechCollator(is_training=False))
        self.logger.info("Setup test Dataloader!")
--- a/model_utils/network.py
+++ b/model_utils/network.py
@ -31,32 +31,6 @@ logger = logging.getLogger(__name__)
 __all__ = ['DeepSpeech2', 'DeepSpeech2Loss']


-def ctc_loss(logits,
-             labels,
-             input_lengths,
-             label_lengths,
-             blank=0,
-             reduction='mean',
-             norm_by_times=False):
-    #logger.info("my ctc loss with norm by times")
-    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
-    loss_out = paddle.fluid.layers.warpctc(
-        logits, labels, blank, norm_by_times, input_lengths, label_lengths)
-
-    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
-    logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
-    assert reduction in ['mean', 'sum', 'none']
-    if reduction == 'mean':
-        loss_out = paddle.mean(loss_out / label_lengths)
-    elif reduction == 'sum':
-        loss_out = paddle.sum(loss_out)
-    logger.info(f"ctc loss: {loss_out}")
-    return loss_out
-
-
-#F.ctc_loss = ctc_loss
-
-
 def brelu(x, t_min=0.0, t_max=24.0, name=None):
    t_min = paddle.to_tensor(t_min)
    t_max = paddle.to_tensor(t_max)
@ -161,7 +135,7 @@ class ConvStack(nn.Layer):
        self.conv_in = ConvBn(
            num_channels_in=1,
            num_channels_out=32,
-            kernel_size=(41, 11), #[D, T]
+            kernel_size=(41, 11),  #[D, T]
            stride=(2, 3),
            padding=(20, 5),
            act='brelu')
@ -330,7 +304,6 @@ class GRUCellShare(nn.RNNCellBase):
        c = self._activation(x_c + r * h_c)  # apply reset gate after mm
        h = (pre_hidden - c) * z + c
        # https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/layers/dynamic_gru_cn.html#dynamic-gru
-        #h = (1-z) * pre_hidden + z * c

        return h, h

@ -716,6 +689,32 @@ class DeepSpeech2(nn.Layer):
            beam_beta, beam_size, cutoff_prob, cutoff_top_n, num_processes)


+def ctc_loss(logits,
+             labels,
+             input_lengths,
+             label_lengths,
+             blank=0,
+             reduction='mean',
+             norm_by_times=True):
+    #logger.info("my ctc loss with norm by times")
+    ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
+    loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
+                                           input_lengths, label_lengths)
+
+    loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
+    logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
+    assert reduction in ['mean', 'sum', 'none']
+    if reduction == 'mean':
+        loss_out = paddle.mean(loss_out / label_lengths)
+    elif reduction == 'sum':
+        loss_out = paddle.sum(loss_out)
+    logger.info(f"ctc loss: {loss_out}")
+    return loss_out
+
+
+F.ctc_loss = ctc_loss
+
+
 class DeepSpeech2Loss(nn.Layer):
    def __init__(self, vocab_size):
        super().__init__()