remove run prefix
using ord value as text id
pull/522/head
Hui Zhang 5 years ago
parent f121f851d9
commit 5fe1b40630

@ -197,7 +197,7 @@ For more help on arguments:
```bash
python3 train.py --help
```
or refer to `example/librispeech/local/run_train.sh`.
or refer to `example/librispeech/local/train.sh`.
### Data Augmentation Pipeline
@ -239,7 +239,7 @@ Be careful when utilizing the data augmentation technique, as improper augmentat
### Training for Mandarin Language
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh``` and ```sh run_infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh run_infer_golden.sh``` and ```sh run_test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.
## Inference and Evaluation
@ -299,7 +299,7 @@ For more help on arguments:
```
python3 infer.py --help
```
or refer to `example/librispeech/local/run_infer.sh`.
or refer to `example/librispeech/local/infer.sh`.
### Evaluate a Model
@ -324,7 +324,7 @@ For more help on arguments:
```bash
python3 test.py --help
```
or refer to `example/librispeech/local/run_test.sh`.
or refer to `example/librispeech/local/test.sh`.
## Hyper-parameters Tuning
@ -364,7 +364,7 @@ After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation
```bash
python3 tune.py --help
```
or refer to `example/librispeech/local/run_tune.sh`.
or refer to `example/librispeech/local/tune.sh`.
## Trying Live Demo with Your Own Voice
@ -403,7 +403,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking
Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.
Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
Please also refer to `examples/deploy_demo/english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/deploy_demo/demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
For more help on arguments:
@ -427,7 +427,7 @@ VoxForge European | 30.15 | 18.64
VoxForge Indian | 53.73 | 25.51
Baidu Internal Testset  |   40.75 |   8.48
For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh run_data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on.
For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on.
#### Benchmark Results for Mandarin Model (Character Error Rate)

@ -197,7 +197,7 @@ python3 tools/build_vocab.py --help
```bash
python3 train.py --help
```
或参考 `example/librispeech/local/run_train.sh`.
或参考 `example/librispeech/local/train.sh`.
### 数据增强流水线
@ -238,7 +238,7 @@ python3 train.py --help
### 训练普通话语言
普通话语言训练与英语训练的关键步骤相同,我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。
普通话语言训练与英语训练的关键步骤相同,我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh data.sh```, ```sh train.sh```, ```sh test.sh```和```sh infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载供用户使用```infer_golden.sh```和```test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。
@ -300,7 +300,7 @@ bash download_lm_ch.sh
```
python3 infer.py --help
```
或参考`example/librispeech/local/run_infer.sh`.
或参考`example/librispeech/local/infer.sh`.
### 评估模型
@ -325,7 +325,7 @@ python3 infer.py --help
```bash
python3 test.py --help
```
或参考`example/librispeech/local/run_test.sh`.
或参考`example/librispeech/local/test.sh`.
@ -367,7 +367,7 @@ python3 test.py --help
```bash
python3 tune.py --help
```
或参考`example/librispeech/local/run_tune.sh`.
或参考`example/librispeech/local/tune.sh`.
## 用自己的声音尝试现场演示
@ -406,7 +406,7 @@ python3 -u deploy/demo_client.py \
请注意,`deploy/demo_client.py`必须在带麦克风设备的机器上运行,而`deploy/demo_server.py`可以在没有任何录音硬件的情况下运行,例如任何远程服务器机器。如果服务器和客户端使用两台独立的机器运行,只需要注意将`host_ip`和`host_port`参数设置为实际可访问的IP地址和端口。如果它们在单台机器上运行则不用作任何处理。
请参考`examples/deploy_demo/run_english_demo_server.sh`它将首先下载一个预先训练过的英语模型用3000小时的内部语音数据训练然后用模型启动演示服务器。通过运行`examples/deploy_demo/run_demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。
请参考`examples/deploy_demo/english_demo_server.sh`它将首先下载一个预先训练过的英语模型用3000小时的内部语音数据训练然后用模型启动演示服务器。通过运行`examples/deploy_demo/demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。
获得更多帮助:
@ -430,7 +430,7 @@ VoxForge European | 30.15 | 18.64
VoxForge Indian | 53.73 | 25.51
Baidu Internal Testset  |   40.75 |   8.48
为了在VoxForge数据上重现基准测试结果我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````run_data.sh```来获取VoxForge方言manifest文件。请注意VoxForge数据可能会持续更新生成的清单文件可能与我们评估的清单文件有所不同。
为了在VoxForge数据上重现基准测试结果我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````data.sh```来获取VoxForge方言manifest文件。请注意VoxForge数据可能会持续更新生成的清单文件可能与我们评估的清单文件有所不同。
#### 普通话模型的baseline测试结果字符错误率

@ -428,7 +428,7 @@ class DeepSpeech2BatchSampler(BatchSampler):
class SpeechCollator():
def __init__(self, padding_to=-1):
def __init__(self, padding_to=-1, is_training=False):
"""
Padding audio features with zeros to make them have the same shape (or
a user-defined shape) within one bach.
@ -438,6 +438,7 @@ class SpeechCollator():
target shape (only refers to the second axis).
"""
self._padding_to = padding_to
self._is_training = is_training
def __call__(self, batch):
new_batch = []
@ -461,7 +462,10 @@ class SpeechCollator():
audio_lens.append(audio.shape[1])
# text
padded_text = np.zeros([max_text_length])
padded_text[:len(text)] = text
if self._is_training:
padded_text[:len(text)] = text #ids
else:
padded_text[:len(text)] = [ord(t) for t in text] # string
texts.append(padded_text)
text_lens.append(len(text))
@ -472,61 +476,61 @@ class SpeechCollator():
return padded_audios, texts, audio_lens, text_lens
def create_dataloader(manifest_path,
vocab_filepath,
mean_std_filepath,
augmentation_config='{}',
max_duration=float('inf'),
min_duration=0.0,
stride_ms=10.0,
window_ms=20.0,
max_freq=None,
specgram_type='linear',
use_dB_normalization=True,
random_seed=0,
keep_transcription_text=False,
is_training=False,
batch_size=1,
num_workers=0,
sortagrad=False,
shuffle_method=None,
dist=False):
dataset = DeepSpeech2Dataset(
manifest_path,
vocab_filepath,
mean_std_filepath,
augmentation_config=augmentation_config,
max_duration=max_duration,
min_duration=min_duration,
stride_ms=stride_ms,
window_ms=window_ms,
max_freq=max_freq,
specgram_type=specgram_type,
use_dB_normalization=use_dB_normalization,
random_seed=random_seed,
keep_transcription_text=keep_transcription_text)
if dist:
batch_sampler = DeepSpeech2DistributedBatchSampler(
dataset,
batch_size,
num_replicas=None,
rank=None,
shuffle=is_training,
drop_last=is_training,
sortagrad=is_training,
shuffle_method=shuffle_method)
else:
batch_sampler = DeepSpeech2BatchSampler(
dataset,
shuffle=is_training,
batch_size=batch_size,
drop_last=is_training,
sortagrad=is_training,
shuffle_method=shuffle_method)
def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
def create_dataloader(manifest_path,
vocab_filepath,
mean_std_filepath,
augmentation_config='{}',
max_duration=float('inf'),
min_duration=0.0,
stride_ms=10.0,
window_ms=20.0,
max_freq=None,
specgram_type='linear',
use_dB_normalization=True,
random_seed=0,
keep_transcription_text=False,
is_training=False,
batch_size=1,
num_workers=0,
sortagrad=False,
shuffle_method=None,
dist=False):
dataset = DeepSpeech2Dataset(
manifest_path,
vocab_filepath,
mean_std_filepath,
augmentation_config=augmentation_config,
max_duration=max_duration,
min_duration=min_duration,
stride_ms=stride_ms,
window_ms=window_ms,
max_freq=max_freq,
specgram_type=specgram_type,
use_dB_normalization=use_dB_normalization,
random_seed=random_seed,
keep_transcription_text=keep_transcription_text)
if dist:
batch_sampler = DeepSpeech2DistributedBatchSampler(
dataset,
batch_size,
num_replicas=None,
rank=None,
shuffle=is_training,
drop_last=is_training,
sortagrad=is_training,
shuffle_method=shuffle_method)
else:
batch_sampler = DeepSpeech2BatchSampler(
dataset,
shuffle=is_training,
batch_size=batch_size,
drop_last=is_training,
sortagrad=is_training,
shuffle_method=shuffle_method)
def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
"""
Padding audio features with zeros to make them have the same shape (or
a user-defined shape) within one bach.
@ -536,42 +540,45 @@ def create_dataloader(manifest_path,
target shape (only refers to the second axis).
If `flatten` is True, features will be flatten to 1darray.
"""
new_batch = []
"""
new_batch = []
# get target shape
max_length = max([audio.shape[1] for audio, text in batch])
if padding_to != -1:
if padding_to < max_length:
raise ValueError("If padding_to is not -1, it should be larger "
"than any instance's shape in the batch")
max_length = padding_to
max_text_length = max([len(text) for audio, text in batch])
max_length = max([audio.shape[1] for audio, text in batch])
if padding_to != -1:
if padding_to < max_length:
raise ValueError("If padding_to is not -1, it should be larger "
"than any instance's shape in the batch")
max_length = padding_to
max_text_length = max([len(text) for audio, text in batch])
# padding
padded_audios = []
audio_lens = []
texts, text_lens = [], []
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
padded_audios.append(padded_audio)
audio_lens.append(audio.shape[1])
padded_text = np.zeros([max_text_length])
padded_text[:len(text)] = text
texts.append(padded_text)
text_lens.append(len(text))
padded_audios = np.array(padded_audios).astype('float32')
audio_lens = np.array(audio_lens).astype('int64')
texts = np.array(texts).astype('int32')
text_lens = np.array(text_lens).astype('int64')
return padded_audios, texts, audio_lens, text_lens
loader = DataLoader(
dataset,
batch_sampler=batch_sampler,
collate_fn=partial(padding_batch, is_training=is_training),
num_workers=num_workers)
return loader
padded_audios = []
audio_lens = []
texts, text_lens = [], []
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
padded_audios.append(padded_audio)
audio_lens.append(audio.shape[1])
padded_text = np.zeros([max_text_length])
if is_training:
padded_text[:len(text)] = text #ids
else:
padded_text[:len(text)] = [ord(t) for t in text] # string
texts.append(padded_text)
text_lens.append(len(text))
padded_audios = np.array(padded_audios).astype('float32')
audio_lens = np.array(audio_lens).astype('int64')
texts = np.array(texts).astype('int32')
text_lens = np.array(text_lens).astype('int64')
return padded_audios, texts, audio_lens, text_lens
loader = DataLoader(
dataset,
batch_sampler=batch_sampler,
collate_fn=partial(padding_batch, is_training=is_training),
num_workers=num_workers)
return loader

@ -38,7 +38,7 @@ training:
save_interval: 1000
valid_interval: 1000
decoding:
batch_size: 128
batch_size: 10
error_rate_type: cer
decoding_method: ctc_beam_search
lang_model_path: models/lm/zh_giga.no_cna_cmn.prune01244.klm
@ -48,4 +48,3 @@ decoding:
cutoff_prob: 0.99
cutoff_top_n: 40
num_proc_bsearch: 8

@ -9,7 +9,6 @@ fi
cd - > /dev/null
CUDA_VISIBLE_DEVICES=6 \
python3 -u ${MAIN_ROOT}/test.py \
--device 'gpu' \
--nproc 1 \

@ -3,19 +3,19 @@
source path.sh
# prepare data
bash ./local/run_data.sh
bash ./local/data.sh
# test pretrain model
bash ./local/run_test_golden.sh
bash ./local/test_golden.sh
# test pretain model
bash ./local/run_infer_golden.sh
bash ./local/infer_golden.sh
# train model
bash ./local/run_train.sh
bash ./local/train.sh
# test model
bash ./local/run_test.sh
bash ./local/test.sh
# infer model
bash ./local/run_infer.sh
bash ./local/infer.sh

@ -0,0 +1,8 @@
[
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 1.0
}
]

@ -0,0 +1,51 @@
# https://yaml.org/type/float.html
data:
train_manifest: data/manifest.tiny
dev_manifest: data/manifest.tiny
test_manifest: data/manifest.tiny
mean_std_filepath: data/mean_std.npz
vocab_filepath: data/vocab.txt
augmentation_config: conf/augmentation.config
batch_size: 4
max_duration: 27.0
min_duration: 0.0
specgram_type: linear
target_sample_rate: 16000
max_freq: None
n_fft: None
stride_ms: 10.0
window_ms: 20.0
use_dB_normalization: True
target_dB: -20
random_seed: 0
keep_transcription_text: False
sortagrad: True
shuffle_method: batch_shuffle
num_workers: 0
model:
num_conv_layers: 2
num_rnn_layers: 3
rnn_layer_size: 2048
use_gru: True
share_rnn_weights: True
training:
n_epoch: 20
lr: 1e-5
weight_decay: 1e-06
global_grad_clip: 400.0
max_iteration: 500000
plot_interval: 1000
save_interval: 1000
valid_interval: 1000
decoding:
batch_size: 128
error_rate_type: wer
decoding_method: ctc_beam_search
lang_model_path: models/lm/common_crawl_00.prune01111.trie.klm
alpha: 2.5
beta: 0.3
beam_size: 500
cutoff_prob: 1.0
cutoff_top_n: 40
num_proc_bsearch: 8

@ -3,22 +3,16 @@
source path.sh
# prepare data
bash ./local/run_data.sh
# test pretrain model
bash ./local/run_test_golden.sh
# test pretain model
bash ./local/run_infer_golden.sh
bash ./local/data.sh
# train model
bash ./local/run_train.sh
bash ./local/train.sh
# test model
bash ./local/run_test.sh
bash ./local/test.sh
# infer model
bash ./local/run_infer.sh
bash ./local/infer.sh
# tune model
bash ./local/run_tune.sh
#bash ./local/tune.sh

@ -7,39 +7,39 @@
- Prepare the data
```bash
sh local/run_data.sh
bash local/data.sh
```
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
`data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `${MAIN_ROOT}/dataset/librispeech` and the corresponding manifest files generated in `${PWD}/data` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
- Train your own ASR model
```bash
sh local/run_train.sh
bash local/train.sh
```
`run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
`train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `${PWD}/checkpoints`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
- Case inference with an existing model
```bash
sh local/run_infer.sh
bash local/infer.sh
```
`run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
`infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
```bash
sh local/run_infer_golden.sh
bash local/infer_golden.sh
```
- Evaluate an existing model
```bash
sh local/run_test.sh
bash local/test.sh
```
`run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
`test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
```bash
sh local/run_test_golden.sh
bash local/test_golden.sh
```

@ -4,22 +4,16 @@ set -e
source path.sh
# prepare data
bash ./local/run_data.sh
## test pretrain model
#bash ./local/run_test_golden.sh
#
## test pretain model
#bash ./local/run_infer_golden.sh
bash ./local/data.sh
# train model
bash ./local/run_train.sh
bash ./local/train.sh
# test model
bash ./local/run_test.sh
bash ./local/test.sh
# infer model
bash ./local/run_infer.sh
bash ./local/infer.sh
## tune model
#bash ./local/run_tune.sh
#bash ./local/tune.sh

@ -20,12 +20,12 @@ import time
import logging
import numpy as np
from collections import defaultdict
from functools import partial
import paddle
from paddle import distributed as dist
from paddle.io import DataLoader
from paddle.fluid.dygraph import base as imperative_base
from paddle.fluid import layers
from paddle.fluid import framework
@ -51,6 +51,7 @@ from utils.error_rate import char_errors, word_errors, cer, wer
logger = logging.getLogger(__name__)
class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
def __init__(self, clip_norm):
super().__init__(clip_norm)
@ -70,7 +71,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
square = layers.square(merge_grad)
sum_square = layers.reduce_sum(square)
logger.info(f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }")
logger.info(
f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
)
sum_square_list.append(sum_square)
# all parameters have been filterd out
@ -85,8 +88,7 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm)
clip_var = layers.elementwise_div(
x=max_global_norm,
y=layers.elementwise_max(
x=global_norm_var, y=max_global_norm))
y=layers.elementwise_max(x=global_norm_var, y=max_global_norm))
for p, g in params_grads:
if g is None:
continue
@ -94,7 +96,9 @@ class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
params_and_grads.append((p, g))
continue
new_grad = layers.elementwise_mul(x=g, y=clip_var)
logger.info(f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }")
logger.info(
f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
)
params_and_grads.append((p, new_grad))
return params_and_grads
@ -106,12 +110,14 @@ def print_grads(model, logger=None):
if logger:
logger.info(msg)
def print_params(model, logger=None):
for n, p in model.named_parameters():
msg = f"param: {n}: shape: {p.shape} stop_grad: {p.stop_gradient}"
if logger:
if logger:
logger.info(msg)
class DeepSpeech2Trainer(Trainer):
def __init__(self, config, args):
super().__init__(config, args)
@ -126,8 +132,7 @@ class DeepSpeech2Trainer(Trainer):
start = time.time()
self.model.train()
audio, text, audio_len, text_len = batch_data
outputs = self.model(audio, text, audio_len, text_len)
outputs = self.model(*batch_data)
loss = self.compute_losses(batch_data, outputs)
loss.backward()
@ -204,7 +209,7 @@ class DeepSpeech2Trainer(Trainer):
valid_losses = defaultdict(list)
for i, batch in enumerate(self.valid_loader):
audio, text, audio_len, text_len = batch
outputs = self.model(audio, text, audio_len, text_len)
outputs = self.model(*batch)
loss = self.compute_losses(batch, outputs)
metrics = self.compute_metrics(batch, outputs)
@ -243,8 +248,7 @@ class DeepSpeech2Trainer(Trainer):
print_params(model, self.logger)
grad_clip = MyClipGradByGlobalNorm(
config.training.global_grad_clip)
grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip)
# optimizer = paddle.optimizer.Adam(
# learning_rate=config.training.lr,
@ -313,7 +317,7 @@ class DeepSpeech2Trainer(Trainer):
use_dB_normalization=config.data.use_dB_normalization,
target_dB=config.data.target_dB,
random_seed=config.data.random_seed,
keep_transcription_text=False)
keep_transcription_text=True)
if self.parallel:
batch_sampler = DeepSpeech2DistributedBatchSampler(
@ -338,14 +342,14 @@ class DeepSpeech2Trainer(Trainer):
self.train_loader = DataLoader(
train_dataset,
batch_sampler=batch_sampler,
collate_fn=collate_fn,
collate_fn=SpeechCollator(is_training=True),
num_workers=config.data.num_workers, )
self.valid_loader = DataLoader(
dev_dataset,
batch_size=config.data.batch_size,
shuffle=False,
drop_last=False,
collate_fn=collate_fn)
collate_fn=SpeechCollator(is_training=True))
self.logger.info("Setup train/valid Dataloader!")
@ -353,13 +357,14 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
def __init__(self, config, args):
super().__init__(config, args)
def id2token(self, texts, texts_len, vocab_list):
def ordid2token(self, texts, texts_len):
""" ord() id to chr() chr """
trans = []
for text, n in zip(texts, texts_len):
n = n.numpy().item()
ids = text[:n]
trans.append(''.join([vocab_list[i] for i in ids]))
return np.array(trans)
trans.append(''.join([chr(i) for i in ids]))
return trans
def compute_metrics(self, inputs, outputs):
cfg = self.config.decoding
@ -372,10 +377,8 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
error_rate_func = cer if cfg.error_rate_type == 'cer' else wer
vocab_list = self.test_loader.dataset.vocab_list
for t in vocab_list:
self.logger.info(f"vocab: {t}")
target_transcripts = self.id2token(texts, texts_len, vocab_list)
target_transcripts = self.ordid2token(texts, texts_len)
result_transcripts = self.model.decode_probs(
probs.numpy(),
vocab_list,
@ -513,13 +516,12 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
use_dB_normalization=config.data.use_dB_normalization,
target_dB=config.data.target_dB,
random_seed=config.data.random_seed,
keep_transcription_text=False)
keep_transcription_text=True)
collate_fn = SpeechCollator()
self.test_loader = DataLoader(
test_dataset,
batch_size=config.decoding.batch_size,
shuffle=False,
drop_last=False,
collate_fn=collate_fn)
collate_fn=SpeechCollator(is_training=False))
self.logger.info("Setup test Dataloader!")

@ -31,32 +31,6 @@ logger = logging.getLogger(__name__)
__all__ = ['DeepSpeech2', 'DeepSpeech2Loss']
def ctc_loss(logits,
labels,
input_lengths,
label_lengths,
blank=0,
reduction='mean',
norm_by_times=False):
#logger.info("my ctc loss with norm by times")
## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
loss_out = paddle.fluid.layers.warpctc(
logits, labels, blank, norm_by_times, input_lengths, label_lengths)
loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
assert reduction in ['mean', 'sum', 'none']
if reduction == 'mean':
loss_out = paddle.mean(loss_out / label_lengths)
elif reduction == 'sum':
loss_out = paddle.sum(loss_out)
logger.info(f"ctc loss: {loss_out}")
return loss_out
#F.ctc_loss = ctc_loss
def brelu(x, t_min=0.0, t_max=24.0, name=None):
t_min = paddle.to_tensor(t_min)
t_max = paddle.to_tensor(t_max)
@ -161,7 +135,7 @@ class ConvStack(nn.Layer):
self.conv_in = ConvBn(
num_channels_in=1,
num_channels_out=32,
kernel_size=(41, 11), #[D, T]
kernel_size=(41, 11), #[D, T]
stride=(2, 3),
padding=(20, 5),
act='brelu')
@ -330,7 +304,6 @@ class GRUCellShare(nn.RNNCellBase):
c = self._activation(x_c + r * h_c) # apply reset gate after mm
h = (pre_hidden - c) * z + c
# https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/layers/dynamic_gru_cn.html#dynamic-gru
#h = (1-z) * pre_hidden + z * c
return h, h
@ -716,6 +689,32 @@ class DeepSpeech2(nn.Layer):
beam_beta, beam_size, cutoff_prob, cutoff_top_n, num_processes)
def ctc_loss(logits,
labels,
input_lengths,
label_lengths,
blank=0,
reduction='mean',
norm_by_times=True):
#logger.info("my ctc loss with norm by times")
## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
input_lengths, label_lengths)
loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
assert reduction in ['mean', 'sum', 'none']
if reduction == 'mean':
loss_out = paddle.mean(loss_out / label_lengths)
elif reduction == 'sum':
loss_out = paddle.sum(loss_out)
logger.info(f"ctc loss: {loss_out}")
return loss_out
F.ctc_loss = ctc_loss
class DeepSpeech2Loss(nn.Layer):
def __init__(self, vocab_size):
super().__init__()

Loading…
Cancel
Save