diff --git a/README.md b/README.md index 80d95fcc3..70065f2b3 100644 --- a/README.md +++ b/README.md @@ -197,7 +197,7 @@ For more help on arguments: ```bash python3 train.py --help ``` -or refer to `example/librispeech/local/run_train.sh`. +or refer to `example/librispeech/local/train.sh`. ### Data Augmentation Pipeline @@ -239,7 +239,7 @@ Be careful when utilizing the data augmentation technique, as improper augmentat ### Training for Mandarin Language -The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh``` and ```sh run_infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh run_infer_golden.sh``` and ```sh run_test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting. +The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting. ## Inference and Evaluation @@ -299,7 +299,7 @@ For more help on arguments: ``` python3 infer.py --help ``` -or refer to `example/librispeech/local/run_infer.sh`. +or refer to `example/librispeech/local/infer.sh`. ### Evaluate a Model @@ -324,7 +324,7 @@ For more help on arguments: ```bash python3 test.py --help ``` -or refer to `example/librispeech/local/run_test.sh`. +or refer to `example/librispeech/local/test.sh`. ## Hyper-parameters Tuning @@ -364,7 +364,7 @@ After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation ```bash python3 tune.py --help ``` -or refer to `example/librispeech/local/run_tune.sh`. +or refer to `example/librispeech/local/tune.sh`. ## Trying Live Demo with Your Own Voice @@ -403,7 +403,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine. -Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.   +Please also refer to `examples/deploy_demo/english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.   For more help on arguments: @@ -427,7 +427,7 @@ VoxForge European | 30.15 | 18.64 VoxForge Indian | 53.73 | 25.51 Baidu Internal Testset  |   40.75 |   8.48 -For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh run_data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on. +For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on. #### Benchmark Results for Mandarin Model (Character Error Rate) diff --git a/README_cn.md b/README_cn.md index 54971a76b..4ca6dda32 100644 --- a/README_cn.md +++ b/README_cn.md @@ -197,7 +197,7 @@ python3 tools/build_vocab.py --help ```bash python3 train.py --help ``` -或参考 `example/librispeech/local/run_train.sh`. +或参考 `example/librispeech/local/train.sh`. ### 数据增强流水线 @@ -238,7 +238,7 @@ python3 train.py --help ### 训练普通话语言 -普通话语言训练与英语训练的关键步骤相同,我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载)供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。 +普通话语言训练与英语训练的关键步骤相同,我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh data.sh```, ```sh train.sh```, ```sh test.sh```和```sh infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载)供用户使用```infer_golden.sh```和```test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。 @@ -300,7 +300,7 @@ bash download_lm_ch.sh ``` python3 infer.py --help ``` -或参考`example/librispeech/local/run_infer.sh`. +或参考`example/librispeech/local/infer.sh`. ### 评估模型 @@ -325,7 +325,7 @@ python3 infer.py --help ```bash python3 test.py --help ``` -或参考`example/librispeech/local/run_test.sh`. +或参考`example/librispeech/local/test.sh`. @@ -367,7 +367,7 @@ python3 test.py --help ```bash python3 tune.py --help ``` -或参考`example/librispeech/local/run_tune.sh`. +或参考`example/librispeech/local/tune.sh`. ## 用自己的声音尝试现场演示 @@ -406,7 +406,7 @@ python3 -u deploy/demo_client.py \ 请注意,`deploy/demo_client.py`必须在带麦克风设备的机器上运行,而`deploy/demo_server.py`可以在没有任何录音硬件的情况下运行,例如任何远程服务器机器。如果服务器和客户端使用两台独立的机器运行,只需要注意将`host_ip`和`host_port`参数设置为实际可访问的IP地址和端口。如果它们在单台机器上运行,则不用作任何处理。 -请参考`examples/deploy_demo/run_english_demo_server.sh`,它将首先下载一个预先训练过的英语模型(用3000小时的内部语音数据训练),然后用模型启动演示服务器。通过运行`examples/deploy_demo/run_demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。 +请参考`examples/deploy_demo/english_demo_server.sh`,它将首先下载一个预先训练过的英语模型(用3000小时的内部语音数据训练),然后用模型启动演示服务器。通过运行`examples/deploy_demo/demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。 获得更多帮助: @@ -430,7 +430,7 @@ VoxForge European | 30.15 | 18.64 VoxForge Indian | 53.73 | 25.51 Baidu Internal Testset  |   40.75 |   8.48 -为了在VoxForge数据上重现基准测试结果,我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````run_data.sh```来获取VoxForge方言manifest文件。请注意,VoxForge数据可能会持续更新,生成的清单文件可能与我们评估的清单文件有所不同。 +为了在VoxForge数据上重现基准测试结果,我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````data.sh```来获取VoxForge方言manifest文件。请注意,VoxForge数据可能会持续更新,生成的清单文件可能与我们评估的清单文件有所不同。 #### 普通话模型的baseline测试结果(字符错误率) diff --git a/data_utils/dataset.py b/data_utils/dataset.py index 1173b8d38..6b9b9aecc 100644 --- a/data_utils/dataset.py +++ b/data_utils/dataset.py @@ -428,7 +428,7 @@ class DeepSpeech2BatchSampler(BatchSampler): class SpeechCollator(): - def __init__(self, padding_to=-1): + def __init__(self, padding_to=-1, is_training=False): """ Padding audio features with zeros to make them have the same shape (or a user-defined shape) within one bach. @@ -438,6 +438,7 @@ class SpeechCollator(): target shape (only refers to the second axis). """ self._padding_to = padding_to + self._is_training = is_training def __call__(self, batch): new_batch = [] @@ -461,7 +462,10 @@ class SpeechCollator(): audio_lens.append(audio.shape[1]) # text padded_text = np.zeros([max_text_length]) - padded_text[:len(text)] = text + if self._is_training: + padded_text[:len(text)] = text #ids + else: + padded_text[:len(text)] = [ord(t) for t in text] # string texts.append(padded_text) text_lens.append(len(text)) @@ -472,61 +476,61 @@ class SpeechCollator(): return padded_audios, texts, audio_lens, text_lens -def create_dataloader(manifest_path, - vocab_filepath, - mean_std_filepath, - augmentation_config='{}', - max_duration=float('inf'), - min_duration=0.0, - stride_ms=10.0, - window_ms=20.0, - max_freq=None, - specgram_type='linear', - use_dB_normalization=True, - random_seed=0, - keep_transcription_text=False, - is_training=False, - batch_size=1, - num_workers=0, - sortagrad=False, - shuffle_method=None, - dist=False): - - dataset = DeepSpeech2Dataset( - manifest_path, - vocab_filepath, - mean_std_filepath, - augmentation_config=augmentation_config, - max_duration=max_duration, - min_duration=min_duration, - stride_ms=stride_ms, - window_ms=window_ms, - max_freq=max_freq, - specgram_type=specgram_type, - use_dB_normalization=use_dB_normalization, - random_seed=random_seed, - keep_transcription_text=keep_transcription_text) - - if dist: - batch_sampler = DeepSpeech2DistributedBatchSampler( - dataset, - batch_size, - num_replicas=None, - rank=None, - shuffle=is_training, - drop_last=is_training, - sortagrad=is_training, - shuffle_method=shuffle_method) - else: - batch_sampler = DeepSpeech2BatchSampler( - dataset, - shuffle=is_training, - batch_size=batch_size, - drop_last=is_training, - sortagrad=is_training, - shuffle_method=shuffle_method) - - def padding_batch(batch, padding_to=-1, flatten=False, is_training=True): +def create_dataloader(manifest_path, + vocab_filepath, + mean_std_filepath, + augmentation_config='{}', + max_duration=float('inf'), + min_duration=0.0, + stride_ms=10.0, + window_ms=20.0, + max_freq=None, + specgram_type='linear', + use_dB_normalization=True, + random_seed=0, + keep_transcription_text=False, + is_training=False, + batch_size=1, + num_workers=0, + sortagrad=False, + shuffle_method=None, + dist=False): + + dataset = DeepSpeech2Dataset( + manifest_path, + vocab_filepath, + mean_std_filepath, + augmentation_config=augmentation_config, + max_duration=max_duration, + min_duration=min_duration, + stride_ms=stride_ms, + window_ms=window_ms, + max_freq=max_freq, + specgram_type=specgram_type, + use_dB_normalization=use_dB_normalization, + random_seed=random_seed, + keep_transcription_text=keep_transcription_text) + + if dist: + batch_sampler = DeepSpeech2DistributedBatchSampler( + dataset, + batch_size, + num_replicas=None, + rank=None, + shuffle=is_training, + drop_last=is_training, + sortagrad=is_training, + shuffle_method=shuffle_method) + else: + batch_sampler = DeepSpeech2BatchSampler( + dataset, + shuffle=is_training, + batch_size=batch_size, + drop_last=is_training, + sortagrad=is_training, + shuffle_method=shuffle_method) + + def padding_batch(batch, padding_to=-1, flatten=False, is_training=True): """ Padding audio features with zeros to make them have the same shape (or a user-defined shape) within one bach. @@ -536,42 +540,45 @@ def create_dataloader(manifest_path, target shape (only refers to the second axis). If `flatten` is True, features will be flatten to 1darray. - """ - new_batch = [] + """ + new_batch = [] # get target shape - max_length = max([audio.shape[1] for audio, text in batch]) - if padding_to != -1: - if padding_to < max_length: - raise ValueError("If padding_to is not -1, it should be larger " - "than any instance's shape in the batch") - max_length = padding_to - max_text_length = max([len(text) for audio, text in batch]) + max_length = max([audio.shape[1] for audio, text in batch]) + if padding_to != -1: + if padding_to < max_length: + raise ValueError("If padding_to is not -1, it should be larger " + "than any instance's shape in the batch") + max_length = padding_to + max_text_length = max([len(text) for audio, text in batch]) # padding - padded_audios = [] - audio_lens = [] - texts, text_lens = [], [] - for audio, text in batch: - padded_audio = np.zeros([audio.shape[0], max_length]) - padded_audio[:, :audio.shape[1]] = audio - if flatten: - padded_audio = padded_audio.flatten() - padded_audios.append(padded_audio) - audio_lens.append(audio.shape[1]) - - padded_text = np.zeros([max_text_length]) - padded_text[:len(text)] = text - texts.append(padded_text) - text_lens.append(len(text)) - - padded_audios = np.array(padded_audios).astype('float32') - audio_lens = np.array(audio_lens).astype('int64') - texts = np.array(texts).astype('int32') - text_lens = np.array(text_lens).astype('int64') - return padded_audios, texts, audio_lens, text_lens - - loader = DataLoader( - dataset, - batch_sampler=batch_sampler, - collate_fn=partial(padding_batch, is_training=is_training), - num_workers=num_workers) - return loader \ No newline at end of file + padded_audios = [] + audio_lens = [] + texts, text_lens = [], [] + for audio, text in batch: + padded_audio = np.zeros([audio.shape[0], max_length]) + padded_audio[:, :audio.shape[1]] = audio + if flatten: + padded_audio = padded_audio.flatten() + padded_audios.append(padded_audio) + audio_lens.append(audio.shape[1]) + + padded_text = np.zeros([max_text_length]) + if is_training: + padded_text[:len(text)] = text #ids + else: + padded_text[:len(text)] = [ord(t) for t in text] # string + texts.append(padded_text) + text_lens.append(len(text)) + + padded_audios = np.array(padded_audios).astype('float32') + audio_lens = np.array(audio_lens).astype('int64') + texts = np.array(texts).astype('int32') + text_lens = np.array(text_lens).astype('int64') + return padded_audios, texts, audio_lens, text_lens + + loader = DataLoader( + dataset, + batch_sampler=batch_sampler, + collate_fn=partial(padding_batch, is_training=is_training), + num_workers=num_workers) + return loader diff --git a/examples/aishell/conf/deepspeech2.yaml b/examples/aishell/conf/deepspeech2.yaml index d2d46eb44..8bbdfa262 100644 --- a/examples/aishell/conf/deepspeech2.yaml +++ b/examples/aishell/conf/deepspeech2.yaml @@ -38,7 +38,7 @@ training: save_interval: 1000 valid_interval: 1000 decoding: - batch_size: 128 + batch_size: 10 error_rate_type: cer decoding_method: ctc_beam_search lang_model_path: models/lm/zh_giga.no_cna_cmn.prune01244.klm @@ -48,4 +48,3 @@ decoding: cutoff_prob: 0.99 cutoff_top_n: 40 num_proc_bsearch: 8 - diff --git a/examples/aishell/local/run_data.sh b/examples/aishell/local/data.sh similarity index 100% rename from examples/aishell/local/run_data.sh rename to examples/aishell/local/data.sh diff --git a/examples/aishell/local/run_infer.sh b/examples/aishell/local/infer.sh similarity index 100% rename from examples/aishell/local/run_infer.sh rename to examples/aishell/local/infer.sh diff --git a/examples/aishell/local/run_infer_golden.sh b/examples/aishell/local/infer_golden.sh similarity index 100% rename from examples/aishell/local/run_infer_golden.sh rename to examples/aishell/local/infer_golden.sh diff --git a/examples/aishell/local/run_test.sh b/examples/aishell/local/test.sh similarity index 93% rename from examples/aishell/local/run_test.sh rename to examples/aishell/local/test.sh index 1015799b5..6e6544bdb 100644 --- a/examples/aishell/local/run_test.sh +++ b/examples/aishell/local/test.sh @@ -9,7 +9,6 @@ fi cd - > /dev/null -CUDA_VISIBLE_DEVICES=6 \ python3 -u ${MAIN_ROOT}/test.py \ --device 'gpu' \ --nproc 1 \ diff --git a/examples/aishell/local/run_test_golden.sh b/examples/aishell/local/test_golden.sh similarity index 100% rename from examples/aishell/local/run_test_golden.sh rename to examples/aishell/local/test_golden.sh diff --git a/examples/aishell/local/run_train.sh b/examples/aishell/local/train.sh similarity index 100% rename from examples/aishell/local/run_train.sh rename to examples/aishell/local/train.sh diff --git a/examples/aishell/run.sh b/examples/aishell/run.sh index 93bf86388..6cf8af2ba 100644 --- a/examples/aishell/run.sh +++ b/examples/aishell/run.sh @@ -3,19 +3,19 @@ source path.sh # prepare data -bash ./local/run_data.sh +bash ./local/data.sh # test pretrain model -bash ./local/run_test_golden.sh +bash ./local/test_golden.sh # test pretain model -bash ./local/run_infer_golden.sh +bash ./local/infer_golden.sh # train model -bash ./local/run_train.sh +bash ./local/train.sh # test model -bash ./local/run_test.sh +bash ./local/test.sh # infer model -bash ./local/run_infer.sh +bash ./local/infer.sh diff --git a/examples/librispeech/conf/augmentation.config b/examples/librispeech/conf/augmentation.config new file mode 100644 index 000000000..6c24da549 --- /dev/null +++ b/examples/librispeech/conf/augmentation.config @@ -0,0 +1,8 @@ +[ + { + "type": "shift", + "params": {"min_shift_ms": -5, + "max_shift_ms": 5}, + "prob": 1.0 + } +] diff --git a/examples/librispeech/conf/deepspeech2.yaml b/examples/librispeech/conf/deepspeech2.yaml new file mode 100644 index 000000000..457a56b2e --- /dev/null +++ b/examples/librispeech/conf/deepspeech2.yaml @@ -0,0 +1,51 @@ +# https://yaml.org/type/float.html +data: + train_manifest: data/manifest.tiny + dev_manifest: data/manifest.tiny + test_manifest: data/manifest.tiny + mean_std_filepath: data/mean_std.npz + vocab_filepath: data/vocab.txt + augmentation_config: conf/augmentation.config + batch_size: 4 + max_duration: 27.0 + min_duration: 0.0 + specgram_type: linear + target_sample_rate: 16000 + max_freq: None + n_fft: None + stride_ms: 10.0 + window_ms: 20.0 + use_dB_normalization: True + target_dB: -20 + random_seed: 0 + keep_transcription_text: False + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 +model: + num_conv_layers: 2 + num_rnn_layers: 3 + rnn_layer_size: 2048 + use_gru: True + share_rnn_weights: True +training: + n_epoch: 20 + lr: 1e-5 + weight_decay: 1e-06 + global_grad_clip: 400.0 + max_iteration: 500000 + plot_interval: 1000 + save_interval: 1000 + valid_interval: 1000 +decoding: + batch_size: 128 + error_rate_type: wer + decoding_method: ctc_beam_search + lang_model_path: models/lm/common_crawl_00.prune01111.trie.klm + alpha: 2.5 + beta: 0.3 + beam_size: 500 + cutoff_prob: 1.0 + cutoff_top_n: 40 + num_proc_bsearch: 8 + diff --git a/examples/librispeech/local/run_data.sh b/examples/librispeech/local/data.sh similarity index 100% rename from examples/librispeech/local/run_data.sh rename to examples/librispeech/local/data.sh diff --git a/examples/librispeech/local/run_infer.sh b/examples/librispeech/local/infer.sh similarity index 100% rename from examples/librispeech/local/run_infer.sh rename to examples/librispeech/local/infer.sh diff --git a/examples/librispeech/local/run_infer_golden.sh b/examples/librispeech/local/infer_golden.sh similarity index 100% rename from examples/librispeech/local/run_infer_golden.sh rename to examples/librispeech/local/infer_golden.sh diff --git a/examples/librispeech/local/run_test.sh b/examples/librispeech/local/test.sh similarity index 100% rename from examples/librispeech/local/run_test.sh rename to examples/librispeech/local/test.sh diff --git a/examples/librispeech/local/run_test_golden.sh b/examples/librispeech/local/test_golden.sh similarity index 100% rename from examples/librispeech/local/run_test_golden.sh rename to examples/librispeech/local/test_golden.sh diff --git a/examples/librispeech/local/run_train.sh b/examples/librispeech/local/train.sh similarity index 100% rename from examples/librispeech/local/run_train.sh rename to examples/librispeech/local/train.sh diff --git a/examples/librispeech/local/run_tune.sh b/examples/librispeech/local/tune.sh similarity index 100% rename from examples/librispeech/local/run_tune.sh rename to examples/librispeech/local/tune.sh diff --git a/examples/librispeech/models b/examples/librispeech/models new file mode 120000 index 000000000..9e68e9945 --- /dev/null +++ b/examples/librispeech/models @@ -0,0 +1 @@ +../../models \ No newline at end of file diff --git a/examples/librispeech/run.sh b/examples/librispeech/run.sh index c8e589139..c5f66ae1d 100644 --- a/examples/librispeech/run.sh +++ b/examples/librispeech/run.sh @@ -3,22 +3,16 @@ source path.sh # prepare data -bash ./local/run_data.sh - -# test pretrain model -bash ./local/run_test_golden.sh - -# test pretain model -bash ./local/run_infer_golden.sh +bash ./local/data.sh # train model -bash ./local/run_train.sh +bash ./local/train.sh # test model -bash ./local/run_test.sh +bash ./local/test.sh # infer model -bash ./local/run_infer.sh +bash ./local/infer.sh # tune model -bash ./local/run_tune.sh +#bash ./local/tune.sh diff --git a/examples/tiny/README.md b/examples/tiny/README.md index 498bc00e5..c3bfdc9c4 100644 --- a/examples/tiny/README.md +++ b/examples/tiny/README.md @@ -7,39 +7,39 @@ - Prepare the data ```bash - sh local/run_data.sh + bash local/data.sh ``` - `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments. + `data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments. - Train your own ASR model ```bash - sh local/run_train.sh + bash local/train.sh ``` - `run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment. + `train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment. - Case inference with an existing model ```bash - sh local/run_infer.sh + bash local/infer.sh ``` - `run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference: + `infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference: ```bash - sh local/run_infer_golden.sh + bash local/infer_golden.sh ``` - Evaluate an existing model ```bash - sh local/run_test.sh + bash local/test.sh ``` - `run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance: + `test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance: ```bash - sh local/run_test_golden.sh + bash local/test_golden.sh ``` diff --git a/examples/tiny/local/run_data.sh b/examples/tiny/local/data.sh similarity index 100% rename from examples/tiny/local/run_data.sh rename to examples/tiny/local/data.sh diff --git a/examples/tiny/local/run_infer_golden.sh b/examples/tiny/local/infer_golden.sh similarity index 100% rename from examples/tiny/local/run_infer_golden.sh rename to examples/tiny/local/infer_golden.sh diff --git a/examples/tiny/local/run_test.sh b/examples/tiny/local/test.sh similarity index 100% rename from examples/tiny/local/run_test.sh rename to examples/tiny/local/test.sh diff --git a/examples/tiny/local/run_test_golden.sh b/examples/tiny/local/test_golden.sh similarity index 100% rename from examples/tiny/local/run_test_golden.sh rename to examples/tiny/local/test_golden.sh diff --git a/examples/tiny/local/run_train.sh b/examples/tiny/local/train.sh similarity index 100% rename from examples/tiny/local/run_train.sh rename to examples/tiny/local/train.sh diff --git a/examples/tiny/local/run_tune.sh b/examples/tiny/local/tune.sh similarity index 100% rename from examples/tiny/local/run_tune.sh rename to examples/tiny/local/tune.sh diff --git a/examples/tiny/run.sh b/examples/tiny/run.sh index 4fa15bd53..01ad06516 100644 --- a/examples/tiny/run.sh +++ b/examples/tiny/run.sh @@ -4,22 +4,16 @@ set -e source path.sh # prepare data -bash ./local/run_data.sh - -## test pretrain model -#bash ./local/run_test_golden.sh -# -## test pretain model -#bash ./local/run_infer_golden.sh +bash ./local/data.sh # train model -bash ./local/run_train.sh +bash ./local/train.sh # test model -bash ./local/run_test.sh +bash ./local/test.sh # infer model -bash ./local/run_infer.sh +bash ./local/infer.sh ## tune model -#bash ./local/run_tune.sh +#bash ./local/tune.sh diff --git a/model_utils/model.py b/model_utils/model.py index 4a5c030b4..9edae9da6 100644 --- a/model_utils/model.py +++ b/model_utils/model.py @@ -20,12 +20,12 @@ import time import logging import numpy as np from collections import defaultdict +from functools import partial import paddle from paddle import distributed as dist from paddle.io import DataLoader - from paddle.fluid.dygraph import base as imperative_base from paddle.fluid import layers from paddle.fluid import framework @@ -51,6 +51,7 @@ from utils.error_rate import char_errors, word_errors, cer, wer logger = logging.getLogger(__name__) + class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm): def __init__(self, clip_norm): super().__init__(clip_norm) @@ -70,7 +71,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm): merge_grad = layers.get_tensor_from_selected_rows(merge_grad) square = layers.square(merge_grad) sum_square = layers.reduce_sum(square) - logger.info(f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }") + logger.info( + f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }" + ) sum_square_list.append(sum_square) # all parameters have been filterd out @@ -85,8 +88,7 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm): shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm) clip_var = layers.elementwise_div( x=max_global_norm, - y=layers.elementwise_max( - x=global_norm_var, y=max_global_norm)) + y=layers.elementwise_max(x=global_norm_var, y=max_global_norm)) for p, g in params_grads: if g is None: continue @@ -94,7 +96,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm): params_and_grads.append((p, g)) continue new_grad = layers.elementwise_mul(x=g, y=clip_var) - logger.info(f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }") + logger.info( + f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }" + ) params_and_grads.append((p, new_grad)) return params_and_grads @@ -106,12 +110,14 @@ def print_grads(model, logger=None): if logger: logger.info(msg) + def print_params(model, logger=None): for n, p in model.named_parameters(): msg = f"param: {n}: shape: {p.shape} stop_grad: {p.stop_gradient}" - if logger: + if logger: logger.info(msg) + class DeepSpeech2Trainer(Trainer): def __init__(self, config, args): super().__init__(config, args) @@ -126,8 +132,7 @@ class DeepSpeech2Trainer(Trainer): start = time.time() self.model.train() - audio, text, audio_len, text_len = batch_data - outputs = self.model(audio, text, audio_len, text_len) + outputs = self.model(*batch_data) loss = self.compute_losses(batch_data, outputs) loss.backward() @@ -204,7 +209,7 @@ class DeepSpeech2Trainer(Trainer): valid_losses = defaultdict(list) for i, batch in enumerate(self.valid_loader): audio, text, audio_len, text_len = batch - outputs = self.model(audio, text, audio_len, text_len) + outputs = self.model(*batch) loss = self.compute_losses(batch, outputs) metrics = self.compute_metrics(batch, outputs) @@ -243,8 +248,7 @@ class DeepSpeech2Trainer(Trainer): print_params(model, self.logger) - grad_clip = MyClipGradByGlobalNorm( - config.training.global_grad_clip) + grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip) # optimizer = paddle.optimizer.Adam( # learning_rate=config.training.lr, @@ -313,7 +317,7 @@ class DeepSpeech2Trainer(Trainer): use_dB_normalization=config.data.use_dB_normalization, target_dB=config.data.target_dB, random_seed=config.data.random_seed, - keep_transcription_text=False) + keep_transcription_text=True) if self.parallel: batch_sampler = DeepSpeech2DistributedBatchSampler( @@ -338,14 +342,14 @@ class DeepSpeech2Trainer(Trainer): self.train_loader = DataLoader( train_dataset, batch_sampler=batch_sampler, - collate_fn=collate_fn, + collate_fn=SpeechCollator(is_training=True), num_workers=config.data.num_workers, ) self.valid_loader = DataLoader( dev_dataset, batch_size=config.data.batch_size, shuffle=False, drop_last=False, - collate_fn=collate_fn) + collate_fn=SpeechCollator(is_training=True)) self.logger.info("Setup train/valid Dataloader!") @@ -353,13 +357,14 @@ class DeepSpeech2Tester(DeepSpeech2Trainer): def __init__(self, config, args): super().__init__(config, args) - def id2token(self, texts, texts_len, vocab_list): + def ordid2token(self, texts, texts_len): + """ ord() id to chr() chr """ trans = [] for text, n in zip(texts, texts_len): n = n.numpy().item() ids = text[:n] - trans.append(''.join([vocab_list[i] for i in ids])) - return np.array(trans) + trans.append(''.join([chr(i) for i in ids])) + return trans def compute_metrics(self, inputs, outputs): cfg = self.config.decoding @@ -372,10 +377,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer): error_rate_func = cer if cfg.error_rate_type == 'cer' else wer vocab_list = self.test_loader.dataset.vocab_list - for t in vocab_list: - self.logger.info(f"vocab: {t}") - - target_transcripts = self.id2token(texts, texts_len, vocab_list) + + target_transcripts = self.ordid2token(texts, texts_len) result_transcripts = self.model.decode_probs( probs.numpy(), vocab_list, @@ -513,13 +516,12 @@ class DeepSpeech2Tester(DeepSpeech2Trainer): use_dB_normalization=config.data.use_dB_normalization, target_dB=config.data.target_dB, random_seed=config.data.random_seed, - keep_transcription_text=False) + keep_transcription_text=True) - collate_fn = SpeechCollator() self.test_loader = DataLoader( test_dataset, batch_size=config.decoding.batch_size, shuffle=False, drop_last=False, - collate_fn=collate_fn) + collate_fn=SpeechCollator(is_training=False)) self.logger.info("Setup test Dataloader!") diff --git a/model_utils/network.py b/model_utils/network.py index 00a3dd885..03c6163e5 100644 --- a/model_utils/network.py +++ b/model_utils/network.py @@ -31,32 +31,6 @@ logger = logging.getLogger(__name__) __all__ = ['DeepSpeech2', 'DeepSpeech2Loss'] -def ctc_loss(logits, - labels, - input_lengths, - label_lengths, - blank=0, - reduction='mean', - norm_by_times=False): - #logger.info("my ctc loss with norm by times") - ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403 - loss_out = paddle.fluid.layers.warpctc( - logits, labels, blank, norm_by_times, input_lengths, label_lengths) - - loss_out = paddle.fluid.layers.squeeze(loss_out, [-1]) - logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ") - assert reduction in ['mean', 'sum', 'none'] - if reduction == 'mean': - loss_out = paddle.mean(loss_out / label_lengths) - elif reduction == 'sum': - loss_out = paddle.sum(loss_out) - logger.info(f"ctc loss: {loss_out}") - return loss_out - - -#F.ctc_loss = ctc_loss - - def brelu(x, t_min=0.0, t_max=24.0, name=None): t_min = paddle.to_tensor(t_min) t_max = paddle.to_tensor(t_max) @@ -161,7 +135,7 @@ class ConvStack(nn.Layer): self.conv_in = ConvBn( num_channels_in=1, num_channels_out=32, - kernel_size=(41, 11), #[D, T] + kernel_size=(41, 11), #[D, T] stride=(2, 3), padding=(20, 5), act='brelu') @@ -330,7 +304,6 @@ class GRUCellShare(nn.RNNCellBase): c = self._activation(x_c + r * h_c) # apply reset gate after mm h = (pre_hidden - c) * z + c # https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/layers/dynamic_gru_cn.html#dynamic-gru - #h = (1-z) * pre_hidden + z * c return h, h @@ -716,6 +689,32 @@ class DeepSpeech2(nn.Layer): beam_beta, beam_size, cutoff_prob, cutoff_top_n, num_processes) +def ctc_loss(logits, + labels, + input_lengths, + label_lengths, + blank=0, + reduction='mean', + norm_by_times=True): + #logger.info("my ctc loss with norm by times") + ## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403 + loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times, + input_lengths, label_lengths) + + loss_out = paddle.fluid.layers.squeeze(loss_out, [-1]) + logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ") + assert reduction in ['mean', 'sum', 'none'] + if reduction == 'mean': + loss_out = paddle.mean(loss_out / label_lengths) + elif reduction == 'sum': + loss_out = paddle.sum(loss_out) + logger.info(f"ctc loss: {loss_out}") + return loss_out + + +F.ctc_loss = ctc_loss + + class DeepSpeech2Loss(nn.Layer): def __init__(self, vocab_size): super().__init__()