Support paddle 2.x (#538)
* 2.x model * model test pass * fix data * fix soundfile with flac support * one thread dataloader test pass * export feasture size add trainer and utils add setup model and dataloader update travis using Bionic dist * add venv; test under venv * fix unittest; train and valid * add train and config * add config and train script * fix ctc cuda memcopy error * fix imports * fix train valid log * fix dataset batch shuffle shift start from 1 fix rank_zero_only decreator error close tensorboard when train over add decoding config and code * test process can run * test with decoding * test and infer with decoding * fix infer * fix ctc loss lr schedule sortagrad logger * aishell egs * refactor train add aishell egs * fix dataset batch shuffle and add batch sampler log print model parameter * fix model and ctc * sequence_mask make all inputs zeros, which cause grad be zero, this is a bug of LessThanOp add grad clip by global norm add model train test notebook * ctc loss remove run prefix using ord value as text id * using unk when training compute_loss need text ids ord id using in test mode, which compute wer/cer * fix tester * add lr_deacy refactor code * fix tools * fix ci add tune fix gru model bugs add dataset and model test * fix decoding * refactor repo fix decoding * fix musan and rir dataset * refactor io, loss, conv, rnn, gradclip, model, utils * fix ci and import * refactor model add export jit model * add deploy bin and test it * rm uselss egs * add layer tools * refactor socket server new model from pretrain * remve useless * fix instability loss and grad nan or inf for librispeech training * fix sampler * fix libri train.sh * fix doc * add license on cpp * fix doc * fix libri script * fix install * clip 5 wer 7.39, clip 400 wer 7.54, 1.8 clip 400 baseline 7.49pull/544/head
parent
054d795dc0
commit
d7e753546a
@ -1,2 +1,7 @@
|
||||
.DS_Store
|
||||
*.pyc
|
||||
tools/venv
|
||||
.vscode
|
||||
*.log
|
||||
*.pdmodel
|
||||
*.pdiparams*
|
||||
|
@ -0,0 +1,389 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "emerging-meter",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
|
||||
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
|
||||
" def convert_to_list(value, n, name, dtype=np.int):\n",
|
||||
"/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/scipy/fftpack/__init__.py:103: DeprecationWarning: The module numpy.dual is deprecated. Instead of using dual, use the functions directly from numpy or scipy.\n",
|
||||
" from numpy.dual import register_func\n",
|
||||
"/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/scipy/special/orthogonal.py:81: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
|
||||
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
|
||||
" from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,\n",
|
||||
"/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/numba/core/types/__init__.py:108: DeprecationWarning: `np.long` is a deprecated alias for `np.compat.long`. To silence this warning, use `np.compat.long` by itself. In the likely event your code does not need to work on Python 2 you can use the builtin `int` for which `np.compat.long` is itself an alias. Doing this will not modify any behaviour and is safe. When replacing `np.long`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
|
||||
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
|
||||
" long_ = _make_signed(np.long)\n",
|
||||
"/workspace/DeepSpeech-2.x/tools/venv/lib/python3.7/site-packages/numba/core/types/__init__.py:109: DeprecationWarning: `np.long` is a deprecated alias for `np.compat.long`. To silence this warning, use `np.compat.long` by itself. In the likely event your code does not need to work on Python 2 you can use the builtin `int` for which `np.compat.long` is itself an alias. Doing this will not modify any behaviour and is safe. When replacing `np.long`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.\n",
|
||||
"Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n",
|
||||
" ulong = _make_unsigned(np.long)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import math\n",
|
||||
"import random\n",
|
||||
"import tarfile\n",
|
||||
"import logging\n",
|
||||
"import numpy as np\n",
|
||||
"from collections import namedtuple\n",
|
||||
"from functools import partial\n",
|
||||
"\n",
|
||||
"import paddle\n",
|
||||
"from paddle.io import Dataset\n",
|
||||
"from paddle.io import DataLoader\n",
|
||||
"from paddle.io import BatchSampler\n",
|
||||
"from paddle.io import DistributedBatchSampler\n",
|
||||
"from paddle import distributed as dist\n",
|
||||
"\n",
|
||||
"from data_utils.utility import read_manifest\n",
|
||||
"from data_utils.augmentor.augmentation import AugmentationPipeline\n",
|
||||
"from data_utils.featurizer.speech_featurizer import SpeechFeaturizer\n",
|
||||
"from data_utils.speech import SpeechSegment\n",
|
||||
"from data_utils.normalizer import FeatureNormalizer\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"from data_utils.dataset import (\n",
|
||||
" DeepSpeech2Dataset,\n",
|
||||
" DeepSpeech2DistributedBatchSampler,\n",
|
||||
" DeepSpeech2BatchSampler,\n",
|
||||
" SpeechCollator,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"id": "excessive-american",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def create_dataloader(manifest_path,\t\n",
|
||||
" vocab_filepath,\t\n",
|
||||
" mean_std_filepath,\t\n",
|
||||
" augmentation_config='{}',\t\n",
|
||||
" max_duration=float('inf'),\t\n",
|
||||
" min_duration=0.0,\t\n",
|
||||
" stride_ms=10.0,\t\n",
|
||||
" window_ms=20.0,\t\n",
|
||||
" max_freq=None,\t\n",
|
||||
" specgram_type='linear',\t\n",
|
||||
" use_dB_normalization=True,\t\n",
|
||||
" random_seed=0,\t\n",
|
||||
" keep_transcription_text=False,\t\n",
|
||||
" is_training=False,\t\n",
|
||||
" batch_size=1,\t\n",
|
||||
" num_workers=0,\t\n",
|
||||
" sortagrad=False,\t\n",
|
||||
" shuffle_method=None,\t\n",
|
||||
" dist=False):\t\n",
|
||||
"\n",
|
||||
" dataset = DeepSpeech2Dataset(\t\n",
|
||||
" manifest_path,\t\n",
|
||||
" vocab_filepath,\t\n",
|
||||
" mean_std_filepath,\t\n",
|
||||
" augmentation_config=augmentation_config,\t\n",
|
||||
" max_duration=max_duration,\t\n",
|
||||
" min_duration=min_duration,\t\n",
|
||||
" stride_ms=stride_ms,\t\n",
|
||||
" window_ms=window_ms,\t\n",
|
||||
" max_freq=max_freq,\t\n",
|
||||
" specgram_type=specgram_type,\t\n",
|
||||
" use_dB_normalization=use_dB_normalization,\t\n",
|
||||
" random_seed=random_seed,\t\n",
|
||||
" keep_transcription_text=keep_transcription_text)\t\n",
|
||||
"\n",
|
||||
" if dist:\t\n",
|
||||
" batch_sampler = DeepSpeech2DistributedBatchSampler(\t\n",
|
||||
" dataset,\t\n",
|
||||
" batch_size,\t\n",
|
||||
" num_replicas=None,\t\n",
|
||||
" rank=None,\t\n",
|
||||
" shuffle=is_training,\t\n",
|
||||
" drop_last=is_training,\t\n",
|
||||
" sortagrad=is_training,\t\n",
|
||||
" shuffle_method=shuffle_method)\t\n",
|
||||
" else:\t\n",
|
||||
" batch_sampler = DeepSpeech2BatchSampler(\t\n",
|
||||
" dataset,\t\n",
|
||||
" shuffle=is_training,\t\n",
|
||||
" batch_size=batch_size,\t\n",
|
||||
" drop_last=is_training,\t\n",
|
||||
" sortagrad=is_training,\t\n",
|
||||
" shuffle_method=shuffle_method)\t\n",
|
||||
"\n",
|
||||
" def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):\t\n",
|
||||
" \"\"\"\t\n",
|
||||
" Padding audio features with zeros to make them have the same shape (or\t\n",
|
||||
" a user-defined shape) within one bach.\t\n",
|
||||
"\n",
|
||||
" If ``padding_to`` is -1, the maximun shape in the batch will be used\t\n",
|
||||
" as the target shape for padding. Otherwise, `padding_to` will be the\t\n",
|
||||
" target shape (only refers to the second axis).\t\n",
|
||||
"\n",
|
||||
" If `flatten` is True, features will be flatten to 1darray.\t\n",
|
||||
" \"\"\"\t\n",
|
||||
" new_batch = []\t\n",
|
||||
" # get target shape\t\n",
|
||||
" max_length = max([audio.shape[1] for audio, text in batch])\t\n",
|
||||
" if padding_to != -1:\t\n",
|
||||
" if padding_to < max_length:\t\n",
|
||||
" raise ValueError(\"If padding_to is not -1, it should be larger \"\t\n",
|
||||
" \"than any instance's shape in the batch\")\t\n",
|
||||
" max_length = padding_to\t\n",
|
||||
" max_text_length = max([len(text) for audio, text in batch])\t\n",
|
||||
" # padding\t\n",
|
||||
" padded_audios = []\t\n",
|
||||
" audio_lens = []\t\n",
|
||||
" texts, text_lens = [], []\t\n",
|
||||
" for audio, text in batch:\t\n",
|
||||
" padded_audio = np.zeros([audio.shape[0], max_length])\t\n",
|
||||
" padded_audio[:, :audio.shape[1]] = audio\t\n",
|
||||
" if flatten:\t\n",
|
||||
" padded_audio = padded_audio.flatten()\t\n",
|
||||
" padded_audios.append(padded_audio)\t\n",
|
||||
" audio_lens.append(audio.shape[1])\t\n",
|
||||
"\n",
|
||||
" padded_text = np.zeros([max_text_length])\n",
|
||||
" if is_training:\n",
|
||||
" padded_text[:len(text)] = text\t# ids\n",
|
||||
" else:\n",
|
||||
" padded_text[:len(text)] = [ord(t) for t in text] # string\n",
|
||||
" \n",
|
||||
" texts.append(padded_text)\t\n",
|
||||
" text_lens.append(len(text))\t\n",
|
||||
"\n",
|
||||
" padded_audios = np.array(padded_audios).astype('float32')\t\n",
|
||||
" audio_lens = np.array(audio_lens).astype('int64')\t\n",
|
||||
" texts = np.array(texts).astype('int32')\t\n",
|
||||
" text_lens = np.array(text_lens).astype('int64')\t\n",
|
||||
" return padded_audios, texts, audio_lens, text_lens\t\n",
|
||||
"\n",
|
||||
" loader = DataLoader(\t\n",
|
||||
" dataset,\t\n",
|
||||
" batch_sampler=batch_sampler,\t\n",
|
||||
" collate_fn=partial(padding_batch, is_training=is_training),\t\n",
|
||||
" num_workers=num_workers)\t\n",
|
||||
" return loader"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"id": "naval-brave",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"{'num_samples': 5, 'beam_size': 500, 'num_proc_bsearch': 8, 'num_conv_layers': 2, 'num_rnn_layers': 3, 'rnn_layer_size': 2048, 'alpha': 2.5, 'beta': 0.3, 'cutoff_prob': 1.0, 'cutoff_top_n': 40, 'use_gru': False, 'use_gpu': True, 'share_rnn_weights': True, 'infer_manifest': 'examples/aishell/data/manifest.dev', 'mean_std_path': 'examples/aishell/data/mean_std.npz', 'vocab_path': 'examples/aishell/data/vocab.txt', 'lang_model_path': 'models/lm/common_crawl_00.prune01111.trie.klm', 'model_path': 'examples/aishell/checkpoints/step_final', 'decoding_method': 'ctc_beam_search', 'error_rate_type': 'wer', 'specgram_type': 'linear'}\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"import argparse\n",
|
||||
"import functools\n",
|
||||
"from utils.utility import add_arguments, print_arguments\n",
|
||||
"parser = argparse.ArgumentParser(description=__doc__)\n",
|
||||
"add_arg = functools.partial(add_arguments, argparser=parser)\n",
|
||||
"# yapf: disable\n",
|
||||
"add_arg('num_samples', int, 5, \"# of samples to infer.\")\n",
|
||||
"add_arg('beam_size', int, 500, \"Beam search width.\")\n",
|
||||
"add_arg('num_proc_bsearch', int, 8, \"# of CPUs for beam search.\")\n",
|
||||
"add_arg('num_conv_layers', int, 2, \"# of convolution layers.\")\n",
|
||||
"add_arg('num_rnn_layers', int, 3, \"# of recurrent layers.\")\n",
|
||||
"add_arg('rnn_layer_size', int, 2048, \"# of recurrent cells per layer.\")\n",
|
||||
"add_arg('alpha', float, 2.5, \"Coef of LM for beam search.\")\n",
|
||||
"add_arg('beta', float, 0.3, \"Coef of WC for beam search.\")\n",
|
||||
"add_arg('cutoff_prob', float, 1.0, \"Cutoff probability for pruning.\")\n",
|
||||
"add_arg('cutoff_top_n', int, 40, \"Cutoff number for pruning.\")\n",
|
||||
"add_arg('use_gru', bool, False, \"Use GRUs instead of simple RNNs.\")\n",
|
||||
"add_arg('use_gpu', bool, True, \"Use GPU or not.\")\n",
|
||||
"add_arg('share_rnn_weights',bool, True, \"Share input-hidden weights across \"\n",
|
||||
" \"bi-directional RNNs. Not for GRU.\")\n",
|
||||
"add_arg('infer_manifest', str,\n",
|
||||
" 'examples/aishell/data/manifest.dev',\n",
|
||||
" \"Filepath of manifest to infer.\")\n",
|
||||
"add_arg('mean_std_path', str,\n",
|
||||
" 'examples/aishell/data/mean_std.npz',\n",
|
||||
" \"Filepath of normalizer's mean & std.\")\n",
|
||||
"add_arg('vocab_path', str,\n",
|
||||
" 'examples/aishell/data/vocab.txt',\n",
|
||||
" \"Filepath of vocabulary.\")\n",
|
||||
"add_arg('lang_model_path', str,\n",
|
||||
" 'models/lm/common_crawl_00.prune01111.trie.klm',\n",
|
||||
" \"Filepath for language model.\")\n",
|
||||
"add_arg('model_path', str,\n",
|
||||
" 'examples/aishell/checkpoints/step_final',\n",
|
||||
" \"If None, the training starts from scratch, \"\n",
|
||||
" \"otherwise, it resumes from the pre-trained model.\")\n",
|
||||
"add_arg('decoding_method', str,\n",
|
||||
" 'ctc_beam_search',\n",
|
||||
" \"Decoding method. Options: ctc_beam_search, ctc_greedy\",\n",
|
||||
" choices = ['ctc_beam_search', 'ctc_greedy'])\n",
|
||||
"add_arg('error_rate_type', str,\n",
|
||||
" 'wer',\n",
|
||||
" \"Error rate type for evaluation.\",\n",
|
||||
" choices=['wer', 'cer'])\n",
|
||||
"add_arg('specgram_type', str,\n",
|
||||
" 'linear',\n",
|
||||
" \"Audio feature type. Options: linear, mfcc.\",\n",
|
||||
" choices=['linear', 'mfcc'])\n",
|
||||
"# yapf: disable\n",
|
||||
"args = parser.parse_args([])\n",
|
||||
"print(vars(args))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"id": "bearing-physics",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"batch_reader = create_dataloader(\n",
|
||||
" manifest_path=args.infer_manifest,\n",
|
||||
" vocab_filepath=args.vocab_path,\n",
|
||||
" mean_std_filepath=args.mean_std_path,\n",
|
||||
" augmentation_config='{}',\n",
|
||||
" #max_duration=float('inf'),\n",
|
||||
" max_duration=27.0,\n",
|
||||
" min_duration=0.0,\n",
|
||||
" stride_ms=10.0,\n",
|
||||
" window_ms=20.0,\n",
|
||||
" max_freq=None,\n",
|
||||
" specgram_type=args.specgram_type,\n",
|
||||
" use_dB_normalization=True,\n",
|
||||
" random_seed=0,\n",
|
||||
" keep_transcription_text=True,\n",
|
||||
" is_training=False,\n",
|
||||
" batch_size=args.num_samples,\n",
|
||||
" sortagrad=True,\n",
|
||||
" shuffle_method=None,\n",
|
||||
" dist=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"id": "classified-melissa",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"test Tensor(shape=[5, 6], dtype=int32, place=CUDAPinnedPlace, stop_gradient=True,\n",
|
||||
" [[22823, 26102, 20195, 37324, 0 , 0 ],\n",
|
||||
" [22238, 26469, 23601, 22909, 0 , 0 ],\n",
|
||||
" [20108, 26376, 22235, 26085, 0 , 0 ],\n",
|
||||
" [36824, 35201, 20445, 25345, 32654, 24863],\n",
|
||||
" [29042, 27748, 21463, 23456, 0 , 0 ]])\n",
|
||||
"test raw 大时代里\n",
|
||||
"test raw 煲汤受宠\n",
|
||||
"audio len Tensor(shape=[5], dtype=int64, place=CUDAPinnedPlace, stop_gradient=True,\n",
|
||||
" [163, 167, 180, 186, 186])\n",
|
||||
"test len Tensor(shape=[5], dtype=int64, place=CUDAPlace(0), stop_gradient=True,\n",
|
||||
" [4, 4, 4, 6, 4])\n",
|
||||
"audio Tensor(shape=[5, 161, 186], dtype=float32, place=CUDAPinnedPlace, stop_gradient=True,\n",
|
||||
" [[[ 1.11669052, 0.79015088, 0.93658292, ..., 0. , 0. , 0. ],\n",
|
||||
" [ 0.83549136, 0.72643483, 0.83578080, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.89155018, -0.18894747, -0.53357804, ..., 0. , 0. , 0. ],\n",
|
||||
" ...,\n",
|
||||
" [ 0.33386710, -0.81240511, 0.12869737, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.17537928, 0.58380985, 0.70696265, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.84175998, 1.22041416, 0.07929770, ..., 0. , 0. , 0. ]],\n",
|
||||
"\n",
|
||||
" [[-0.35964420, 0.77392709, 0.71409988, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.15990183, 0.42962283, 0.06222462, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.31166190, -0.74864638, -0.52836996, ..., 0. , 0. , 0. ],\n",
|
||||
" ...,\n",
|
||||
" [-0.27546275, 0.32889456, 0.12410031, ..., 0. , 0. , 0. ],\n",
|
||||
" [ 0.16264282, 0.49418071, -0.15960945, ..., 0. , 0. , 0. ],\n",
|
||||
" [ 0.12476666, 0.00516864, 1.16021466, ..., 0. , 0. , 0. ]],\n",
|
||||
"\n",
|
||||
" [[ 0.90202141, 1.48541915, 0.92062062, ..., 0. , 0. , 0. ],\n",
|
||||
" [ 0.82661545, 1.37171340, 0.86746097, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.62287915, -0.48645937, 0.35041964, ..., 0. , 0. , 0. ],\n",
|
||||
" ...,\n",
|
||||
" [ 0.07376949, 0.07138316, 0.76355994, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.32306790, 0.43247896, 1.27311838, ..., 0. , 0. , 0. ],\n",
|
||||
" [-0.97667056, 0.60747612, 0.79181534, ..., 0. , 0. , 0. ]],\n",
|
||||
"\n",
|
||||
" [[ 0.72022128, 0.95428467, 0.92766261, ..., 0.29105374, -0.45564806, -0.62151009],\n",
|
||||
" [ 0.42083180, 0.49279949, 0.82724041, ..., -0.17333922, -1.45363355, -0.61673522],\n",
|
||||
" [-0.76116520, -0.84750438, -0.09512503, ..., -1.01497340, -1.42781055, -0.80859023],\n",
|
||||
" ...,\n",
|
||||
" [-0.23009977, 1.06155431, 1.09065628, ..., 0.25581080, 0.53794998, -1.22650719],\n",
|
||||
" [-1.37693381, 0.30778193, 0.17152318, ..., 0.51650339, 0.25580606, 0.83097816],\n",
|
||||
" [-1.62180591, 1.30567718, 1.09928656, ..., -0.77590007, 1.27712476, 0.53189957]],\n",
|
||||
"\n",
|
||||
" [[ 1.03205252, -0.51535392, 0.21077573, ..., 0.76618457, 1.27425683, 1.52250278],\n",
|
||||
" [ 0.82059991, 0.43990925, 0.13090958, ..., 0.86662549, 1.01687658, 1.48495352],\n",
|
||||
" [-0.75489789, -0.01997089, -0.65174174, ..., 0.09061214, -0.55211234, -0.01614586],\n",
|
||||
" ...,\n",
|
||||
" [ 0.50985396, 1.84555030, 0.79185146, ..., 1.13666189, 1.19898069, 1.98158395],\n",
|
||||
" [ 1.98721015, 2.52385354, 1.11714780, ..., 0.19416514, 1.11329341, 0.64460152],\n",
|
||||
" [ 2.69512844, 1.90993905, 0.50245082, ..., -0.50902629, 0.03333465, -1.24584770]]])\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for idx, (audio, text, audio_len, text_len) in enumerate(batch_reader()):\n",
|
||||
" print('test', text)\n",
|
||||
" print(\"test raw\", ''.join( chr(i) for i in text[0][:int(text_len[0])] ))\n",
|
||||
" print(\"test raw\", ''.join( chr(i) for i in text[-1][:int(text_len[-1])] ))\n",
|
||||
" print('audio len', audio_len)\n",
|
||||
" print('test len', text_len)\n",
|
||||
" print('audio', audio)\n",
|
||||
" break"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "unexpected-skating",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "minus-modern",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
File diff suppressed because it is too large
Load Diff
@ -1,381 +0,0 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Contains data generator for orgnaizing various audio data preprocessing
|
||||
pipeline and offering data reader interface of PaddlePaddle requirements.
|
||||
"""
|
||||
|
||||
import random
|
||||
import tarfile
|
||||
import multiprocessing
|
||||
import numpy as np
|
||||
import paddle.fluid as fluid
|
||||
from threading import local
|
||||
from data_utils.utility import read_manifest
|
||||
from data_utils.augmentor.augmentation import AugmentationPipeline
|
||||
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
|
||||
from data_utils.speech import SpeechSegment
|
||||
from data_utils.normalizer import FeatureNormalizer
|
||||
|
||||
|
||||
class DataGenerator(object):
|
||||
"""
|
||||
DataGenerator provides basic audio data preprocessing pipeline, and offers
|
||||
data reader interfaces of PaddlePaddle requirements.
|
||||
|
||||
:param vocab_filepath: Vocabulary filepath for indexing tokenized
|
||||
transcripts.
|
||||
:type vocab_filepath: str
|
||||
:param mean_std_filepath: File containing the pre-computed mean and stddev.
|
||||
:type mean_std_filepath: None|str
|
||||
:param augmentation_config: Augmentation configuration in json string.
|
||||
Details see AugmentationPipeline.__doc__.
|
||||
:type augmentation_config: str
|
||||
:param max_duration: Audio with duration (in seconds) greater than
|
||||
this will be discarded.
|
||||
:type max_duration: float
|
||||
:param min_duration: Audio with duration (in seconds) smaller than
|
||||
this will be discarded.
|
||||
:type min_duration: float
|
||||
:param stride_ms: Striding size (in milliseconds) for generating frames.
|
||||
:type stride_ms: float
|
||||
:param window_ms: Window size (in milliseconds) for generating frames.
|
||||
:type window_ms: float
|
||||
:param max_freq: Used when specgram_type is 'linear', only FFT bins
|
||||
corresponding to frequencies between [0, max_freq] are
|
||||
returned.
|
||||
:types max_freq: None|float
|
||||
:param specgram_type: Specgram feature type. Options: 'linear'.
|
||||
:type specgram_type: str
|
||||
:param use_dB_normalization: Whether to normalize the audio to -20 dB
|
||||
before extracting the features.
|
||||
:type use_dB_normalization: bool
|
||||
:param random_seed: Random seed.
|
||||
:type random_seed: int
|
||||
:param keep_transcription_text: If set to True, transcription text will
|
||||
be passed forward directly without
|
||||
converting to index sequence.
|
||||
:type keep_transcription_text: bool
|
||||
:param place: The place to run the program.
|
||||
:type place: CPUPlace or CUDAPlace
|
||||
:param is_training: If set to True, generate text data for training,
|
||||
otherwise, generate text data for infer.
|
||||
:type is_training: bool
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
vocab_filepath,
|
||||
mean_std_filepath,
|
||||
augmentation_config='{}',
|
||||
max_duration=float('inf'),
|
||||
min_duration=0.0,
|
||||
stride_ms=10.0,
|
||||
window_ms=20.0,
|
||||
max_freq=None,
|
||||
specgram_type='linear',
|
||||
use_dB_normalization=True,
|
||||
random_seed=0,
|
||||
keep_transcription_text=False,
|
||||
place=fluid.CPUPlace(),
|
||||
is_training=True):
|
||||
self._max_duration = max_duration
|
||||
self._min_duration = min_duration
|
||||
self._normalizer = FeatureNormalizer(mean_std_filepath)
|
||||
self._augmentation_pipeline = AugmentationPipeline(
|
||||
augmentation_config=augmentation_config, random_seed=random_seed)
|
||||
self._speech_featurizer = SpeechFeaturizer(
|
||||
vocab_filepath=vocab_filepath,
|
||||
specgram_type=specgram_type,
|
||||
stride_ms=stride_ms,
|
||||
window_ms=window_ms,
|
||||
max_freq=max_freq,
|
||||
use_dB_normalization=use_dB_normalization)
|
||||
self._rng = random.Random(random_seed)
|
||||
self._keep_transcription_text = keep_transcription_text
|
||||
self._epoch = 0
|
||||
self._is_training = is_training
|
||||
# for caching tar files info
|
||||
self._local_data = local()
|
||||
self._local_data.tar2info = {}
|
||||
self._local_data.tar2object = {}
|
||||
self._place = place
|
||||
|
||||
def process_utterance(self, audio_file, transcript):
|
||||
"""Load, augment, featurize and normalize for speech data.
|
||||
|
||||
:param audio_file: Filepath or file object of audio file.
|
||||
:type audio_file: str | file
|
||||
:param transcript: Transcription text.
|
||||
:type transcript: str
|
||||
:return: Tuple of audio feature tensor and data of transcription part,
|
||||
where transcription part could be token ids or text.
|
||||
:rtype: tuple of (2darray, list)
|
||||
"""
|
||||
if isinstance(audio_file, str) and audio_file.startswith('tar:'):
|
||||
speech_segment = SpeechSegment.from_file(
|
||||
self._subfile_from_tar(audio_file), transcript)
|
||||
else:
|
||||
speech_segment = SpeechSegment.from_file(audio_file, transcript)
|
||||
self._augmentation_pipeline.transform_audio(speech_segment)
|
||||
specgram, transcript_part = self._speech_featurizer.featurize(
|
||||
speech_segment, self._keep_transcription_text)
|
||||
specgram = self._normalizer.apply(specgram)
|
||||
return specgram, transcript_part
|
||||
|
||||
def batch_reader_creator(self,
|
||||
manifest_path,
|
||||
batch_size,
|
||||
padding_to=-1,
|
||||
flatten=False,
|
||||
sortagrad=False,
|
||||
shuffle_method="batch_shuffle"):
|
||||
"""
|
||||
Batch data reader creator for audio data. Return a callable generator
|
||||
function to produce batches of data.
|
||||
|
||||
Audio features within one batch will be padded with zeros to have the
|
||||
same shape, or a user-defined shape.
|
||||
|
||||
:param manifest_path: Filepath of manifest for audio files.
|
||||
:type manifest_path: str
|
||||
:param batch_size: Number of instances in a batch.
|
||||
:type batch_size: int
|
||||
:param padding_to: If set -1, the maximun shape in the batch
|
||||
will be used as the target shape for padding.
|
||||
Otherwise, `padding_to` will be the target shape.
|
||||
:type padding_to: int
|
||||
:param flatten: If set True, audio features will be flatten to 1darray.
|
||||
:type flatten: bool
|
||||
:param sortagrad: If set True, sort the instances by audio duration
|
||||
in the first epoch for speed up training.
|
||||
:type sortagrad: bool
|
||||
:param shuffle_method: Shuffle method. Options:
|
||||
'' or None: no shuffle.
|
||||
'instance_shuffle': instance-wise shuffle.
|
||||
'batch_shuffle': similarly-sized instances are
|
||||
put into batches, and then
|
||||
batch-wise shuffle the batches.
|
||||
For more details, please see
|
||||
``_batch_shuffle.__doc__``.
|
||||
'batch_shuffle_clipped': 'batch_shuffle' with
|
||||
head shift and tail
|
||||
clipping. For more
|
||||
details, please see
|
||||
``_batch_shuffle``.
|
||||
If sortagrad is True, shuffle is disabled
|
||||
for the first epoch.
|
||||
:type shuffle_method: None|str
|
||||
:return: Batch reader function, producing batches of data when called.
|
||||
:rtype: callable
|
||||
"""
|
||||
|
||||
def batch_reader():
|
||||
# read manifest
|
||||
manifest = read_manifest(
|
||||
manifest_path=manifest_path,
|
||||
max_duration=self._max_duration,
|
||||
min_duration=self._min_duration)
|
||||
# sort (by duration) or batch-wise shuffle the manifest
|
||||
if self._epoch == 0 and sortagrad:
|
||||
manifest.sort(key=lambda x: x["duration"])
|
||||
|
||||
else:
|
||||
if shuffle_method == "batch_shuffle":
|
||||
manifest = self._batch_shuffle(
|
||||
manifest, batch_size, clipped=False)
|
||||
elif shuffle_method == "batch_shuffle_clipped":
|
||||
manifest = self._batch_shuffle(
|
||||
manifest, batch_size, clipped=True)
|
||||
elif shuffle_method == "instance_shuffle":
|
||||
self._rng.shuffle(manifest)
|
||||
elif shuffle_method == None:
|
||||
pass
|
||||
else:
|
||||
raise ValueError("Unknown shuffle method %s." %
|
||||
shuffle_method)
|
||||
# prepare batches
|
||||
batch = []
|
||||
instance_reader = self._instance_reader_creator(manifest)
|
||||
|
||||
for instance in instance_reader():
|
||||
batch.append(instance)
|
||||
if len(batch) == batch_size:
|
||||
yield self._padding_batch(batch, padding_to, flatten)
|
||||
batch = []
|
||||
if len(batch) >= 1:
|
||||
yield self._padding_batch(batch, padding_to, flatten)
|
||||
self._epoch += 1
|
||||
|
||||
return batch_reader
|
||||
|
||||
@property
|
||||
def feeding(self):
|
||||
"""Returns data reader's feeding dict.
|
||||
|
||||
:return: Data feeding dict.
|
||||
:rtype: dict
|
||||
"""
|
||||
feeding_dict = {"audio_spectrogram": 0, "transcript_text": 1}
|
||||
return feeding_dict
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
"""Return the vocabulary size.
|
||||
|
||||
:return: Vocabulary size.
|
||||
:rtype: int
|
||||
"""
|
||||
return self._speech_featurizer.vocab_size
|
||||
|
||||
@property
|
||||
def vocab_list(self):
|
||||
"""Return the vocabulary in list.
|
||||
|
||||
:return: Vocabulary in list.
|
||||
:rtype: list
|
||||
"""
|
||||
return self._speech_featurizer.vocab_list
|
||||
|
||||
def _parse_tar(self, file):
|
||||
"""Parse a tar file to get a tarfile object
|
||||
and a map containing tarinfoes
|
||||
"""
|
||||
result = {}
|
||||
f = tarfile.open(file)
|
||||
for tarinfo in f.getmembers():
|
||||
result[tarinfo.name] = tarinfo
|
||||
return f, result
|
||||
|
||||
def _subfile_from_tar(self, file):
|
||||
"""Get subfile object from tar.
|
||||
|
||||
It will return a subfile object from tar file
|
||||
and cached tar file info for next reading request.
|
||||
"""
|
||||
tarpath, filename = file.split(':', 1)[1].split('#', 1)
|
||||
if 'tar2info' not in self._local_data.__dict__:
|
||||
self._local_data.tar2info = {}
|
||||
if 'tar2object' not in self._local_data.__dict__:
|
||||
self._local_data.tar2object = {}
|
||||
if tarpath not in self._local_data.tar2info:
|
||||
object, infoes = self._parse_tar(tarpath)
|
||||
self._local_data.tar2info[tarpath] = infoes
|
||||
self._local_data.tar2object[tarpath] = object
|
||||
return self._local_data.tar2object[tarpath].extractfile(
|
||||
self._local_data.tar2info[tarpath][filename])
|
||||
|
||||
def _instance_reader_creator(self, manifest):
|
||||
"""
|
||||
Instance reader creator. Create a callable function to produce
|
||||
instances of data.
|
||||
|
||||
Instance: a tuple of ndarray of audio spectrogram and a list of
|
||||
token indices for transcript.
|
||||
"""
|
||||
|
||||
def reader():
|
||||
for instance in manifest:
|
||||
inst = self.process_utterance(instance["audio_filepath"],
|
||||
instance["text"])
|
||||
yield inst
|
||||
|
||||
return reader
|
||||
|
||||
def _padding_batch(self, batch, padding_to=-1, flatten=False):
|
||||
"""
|
||||
Padding audio features with zeros to make them have the same shape (or
|
||||
a user-defined shape) within one bach.
|
||||
|
||||
If ``padding_to`` is -1, the maximun shape in the batch will be used
|
||||
as the target shape for padding. Otherwise, `padding_to` will be the
|
||||
target shape (only refers to the second axis).
|
||||
|
||||
If `flatten` is True, features will be flatten to 1darray.
|
||||
"""
|
||||
new_batch = []
|
||||
# get target shape
|
||||
max_length = max([audio.shape[1] for audio, text in batch])
|
||||
if padding_to != -1:
|
||||
if padding_to < max_length:
|
||||
raise ValueError("If padding_to is not -1, it should be larger "
|
||||
"than any instance's shape in the batch")
|
||||
max_length = padding_to
|
||||
# padding
|
||||
padded_audios = []
|
||||
texts, text_lens = [], []
|
||||
audio_lens = []
|
||||
masks = []
|
||||
for audio, text in batch:
|
||||
padded_audio = np.zeros([audio.shape[0], max_length])
|
||||
padded_audio[:, :audio.shape[1]] = audio
|
||||
if flatten:
|
||||
padded_audio = padded_audio.flatten()
|
||||
padded_audios.append(padded_audio)
|
||||
if self._is_training:
|
||||
texts += text
|
||||
else:
|
||||
texts.append(text)
|
||||
text_lens.append(len(text))
|
||||
audio_lens.append(audio.shape[1])
|
||||
mask_shape0 = (audio.shape[0] - 1) // 2 + 1
|
||||
mask_shape1 = (audio.shape[1] - 1) // 3 + 1
|
||||
mask_max_len = (max_length - 1) // 3 + 1
|
||||
mask_ones = np.ones((mask_shape0, mask_shape1))
|
||||
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
|
||||
mask = np.repeat(
|
||||
np.reshape(
|
||||
np.concatenate((mask_ones, mask_zeros), axis=1),
|
||||
(1, mask_shape0, mask_max_len)),
|
||||
32,
|
||||
axis=0)
|
||||
masks.append(mask)
|
||||
padded_audios = np.array(padded_audios).astype('float32')
|
||||
if self._is_training:
|
||||
texts = np.expand_dims(np.array(texts).astype('int32'), axis=-1)
|
||||
texts = fluid.create_lod_tensor(
|
||||
texts, recursive_seq_lens=[text_lens], place=self._place)
|
||||
audio_lens = np.array(audio_lens).astype('int64').reshape([-1, 1])
|
||||
masks = np.array(masks).astype('float32')
|
||||
return padded_audios, texts, audio_lens, masks
|
||||
|
||||
def _batch_shuffle(self, manifest, batch_size, clipped=False):
|
||||
"""Put similarly-sized instances into minibatches for better efficiency
|
||||
and make a batch-wise shuffle.
|
||||
|
||||
1. Sort the audio clips by duration.
|
||||
2. Generate a random number `k`, k in [0, batch_size).
|
||||
3. Randomly shift `k` instances in order to create different batches
|
||||
for different epochs. Create minibatches.
|
||||
4. Shuffle the minibatches.
|
||||
|
||||
:param manifest: Manifest contents. List of dict.
|
||||
:type manifest: list
|
||||
:param batch_size: Batch size. This size is also used for generate
|
||||
a random number for batch shuffle.
|
||||
:type batch_size: int
|
||||
:param clipped: Whether to clip the heading (small shift) and trailing
|
||||
(incomplete batch) instances.
|
||||
:type clipped: bool
|
||||
:return: Batch shuffled mainifest.
|
||||
:rtype: list
|
||||
"""
|
||||
manifest.sort(key=lambda x: x["duration"])
|
||||
shift_len = self._rng.randint(0, batch_size - 1)
|
||||
batch_manifest = list(zip(*[iter(manifest[shift_len:])] * batch_size))
|
||||
self._rng.shuffle(batch_manifest)
|
||||
batch_manifest = [item for batch in batch_manifest for item in batch]
|
||||
if not clipped:
|
||||
res_len = len(manifest) - shift_len - len(batch_manifest)
|
||||
batch_manifest.extend(manifest[-res_len:])
|
||||
batch_manifest.extend(manifest[0:shift_len])
|
||||
return batch_manifest
|
@ -1,20 +0,0 @@
|
||||
#ifndef CTC_GREEDY_DECODER_H
|
||||
#define CTC_GREEDY_DECODER_H
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
/* CTC Greedy (Best Path) Decoder
|
||||
*
|
||||
* Parameters:
|
||||
* probs_seq: 2-D vector that each element is a vector of probabilities
|
||||
* over vocabulary of one time step.
|
||||
* vocabulary: A vector of vocabulary.
|
||||
* Return:
|
||||
* The decoding result in string
|
||||
*/
|
||||
std::string ctc_greedy_decoder(
|
||||
const std::vector<std::vector<double>>& probs_seq,
|
||||
const std::vector<std::string>& vocabulary);
|
||||
|
||||
#endif // CTC_GREEDY_DECODER_H
|
@ -0,0 +1,9 @@
|
||||
ThreadPool/
|
||||
build/
|
||||
dist/
|
||||
kenlm/
|
||||
openfst-1.6.3/
|
||||
openfst-1.6.3.tar.gz
|
||||
swig_decoders.egg-info/
|
||||
decoders_wrap.cxx
|
||||
swig_decoders.py
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#include "ctc_beam_search_decoder.h"
|
||||
|
||||
#include <algorithm>
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#ifndef CTC_BEAM_SEARCH_DECODER_H_
|
||||
#define CTC_BEAM_SEARCH_DECODER_H_
|
||||
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#include "ctc_greedy_decoder.h"
|
||||
#include "decoder_utils.h"
|
||||
|
@ -0,0 +1,34 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#ifndef CTC_GREEDY_DECODER_H
|
||||
#define CTC_GREEDY_DECODER_H
|
||||
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
/* CTC Greedy (Best Path) Decoder
|
||||
*
|
||||
* Parameters:
|
||||
* probs_seq: 2-D vector that each element is a vector of probabilities
|
||||
* over vocabulary of one time step.
|
||||
* vocabulary: A vector of vocabulary.
|
||||
* Return:
|
||||
* The decoding result in string
|
||||
*/
|
||||
std::string ctc_greedy_decoder(
|
||||
const std::vector<std::vector<double>>& probs_seq,
|
||||
const std::vector<std::string>& vocabulary);
|
||||
|
||||
#endif // CTC_GREEDY_DECODER_H
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#include "decoder_utils.h"
|
||||
|
||||
#include <algorithm>
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#ifndef DECODER_UTILS_H_
|
||||
#define DECODER_UTILS_H_
|
||||
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#include "path_trie.h"
|
||||
|
||||
#include <algorithm>
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#ifndef PATH_TRIE_H
|
||||
#define PATH_TRIE_H
|
||||
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#include "scorer.h"
|
||||
|
||||
#include <unistd.h>
|
@ -1,3 +1,17 @@
|
||||
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
//
|
||||
// Licensed under the Apache License, Version 2.0 (the "License");
|
||||
// you may not use this file except in compliance with the License.
|
||||
// You may obtain a copy of the License at
|
||||
//
|
||||
// http://www.apache.org/licenses/LICENSE-2.0
|
||||
//
|
||||
// Unless required by applicable law or agreed to in writing, software
|
||||
// distributed under the License is distributed on an "AS IS" BASIS,
|
||||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
// See the License for the specific language governing permissions and
|
||||
// limitations under the License.
|
||||
|
||||
#ifndef SCORER_H_
|
||||
#define SCORER_H_
|
||||
|
@ -0,0 +1,54 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Record wav from Microphone"""
|
||||
# http://people.csail.mit.edu/hubert/pyaudio/
|
||||
import pyaudio
|
||||
import wave
|
||||
|
||||
CHUNK = 1024
|
||||
FORMAT = pyaudio.paInt16
|
||||
CHANNELS = 1
|
||||
RATE = 16000
|
||||
RECORD_SECONDS = 5
|
||||
WAVE_OUTPUT_FILENAME = "output.wav"
|
||||
|
||||
p = pyaudio.PyAudio()
|
||||
|
||||
stream = p.open(
|
||||
format=FORMAT,
|
||||
channels=CHANNELS,
|
||||
rate=RATE,
|
||||
input=True,
|
||||
frames_per_buffer=CHUNK)
|
||||
|
||||
print("* recording")
|
||||
|
||||
frames = []
|
||||
|
||||
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
|
||||
data = stream.read(CHUNK)
|
||||
frames.append(data)
|
||||
|
||||
print("* done recording")
|
||||
|
||||
stream.stop_stream()
|
||||
stream.close()
|
||||
p.terminate()
|
||||
|
||||
wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
|
||||
wf.setnchannels(CHANNELS)
|
||||
wf.setsampwidth(p.get_sample_size(FORMAT))
|
||||
wf.setframerate(RATE)
|
||||
wf.writeframes(b''.join(frames))
|
||||
wf.close()
|
@ -0,0 +1,207 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Server-end for the ASR demo."""
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import functools
|
||||
import paddle
|
||||
import numpy as np
|
||||
|
||||
from deepspeech.utils.socket_server import warm_up_test
|
||||
from deepspeech.utils.socket_server import AsrTCPServer
|
||||
from deepspeech.utils.socket_server import AsrRequestHandler
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
|
||||
from deepspeech.frontend.utility import read_manifest
|
||||
from deepspeech.utils.utility import add_arguments, print_arguments
|
||||
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2Model
|
||||
from deepspeech.io.dataset import ManifestDataset
|
||||
|
||||
from paddle.inference import Config
|
||||
from paddle.inference import create_predictor
|
||||
|
||||
|
||||
def init_predictor(args):
|
||||
if args.model_dir is not None:
|
||||
config = Config(args.model_dir)
|
||||
else:
|
||||
config = Config(args.model_file, args.params_file)
|
||||
|
||||
config.enable_memory_optim()
|
||||
if args.use_gpu:
|
||||
config.enable_use_gpu(memory_pool_init_size_mb=1000, device_id=0)
|
||||
else:
|
||||
# If not specific mkldnn, you can set the blas thread.
|
||||
# The thread num should not be greater than the number of cores in the CPU.
|
||||
config.set_cpu_math_library_num_threads(4)
|
||||
config.enable_mkldnn()
|
||||
|
||||
predictor = create_predictor(config)
|
||||
return predictor
|
||||
|
||||
|
||||
def run(predictor, img):
|
||||
# copy img data to input tensor
|
||||
input_names = predictor.get_input_names()
|
||||
for i, name in enumerate(input_names):
|
||||
input_tensor = predictor.get_input_handle(name)
|
||||
#input_tensor.reshape(img[i].shape)
|
||||
#input_tensor.copy_from_cpu(img[i].copy())
|
||||
|
||||
# do the inference
|
||||
predictor.run()
|
||||
|
||||
results = []
|
||||
# get out data from output tensor
|
||||
output_names = predictor.get_output_names()
|
||||
for i, name in enumerate(output_names):
|
||||
output_tensor = predictor.get_output_handle(name)
|
||||
output_data = output_tensor.copy_to_cpu()
|
||||
results.append(output_data)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def inference(config, args):
|
||||
predictor = init_predictor(args)
|
||||
|
||||
|
||||
def start_server(config, args):
|
||||
"""Start the ASR server"""
|
||||
dataset = ManifestDataset(
|
||||
config.data.test_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config="{}",
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=True)
|
||||
|
||||
model = DeepSpeech2Model.from_pretrained(dataset, config,
|
||||
args.checkpoint_path)
|
||||
model.eval()
|
||||
|
||||
# prepare ASR inference handler
|
||||
def file_to_transcript(filename):
|
||||
feature = dataset.process_utterance(filename, "")
|
||||
audio = np.array([feature[0]]).astype('float32') #[1, D, T]
|
||||
audio_len = feature[0].shape[1]
|
||||
audio_len = np.array([audio_len]).astype('int64') # [1]
|
||||
|
||||
result_transcript = model.decode(
|
||||
paddle.to_tensor(audio),
|
||||
paddle.to_tensor(audio_len),
|
||||
vocab_list=dataset.vocab_list,
|
||||
decoding_method=config.decoding.decoding_method,
|
||||
lang_model_path=config.decoding.lang_model_path,
|
||||
beam_alpha=config.decoding.alpha,
|
||||
beam_beta=config.decoding.beta,
|
||||
beam_size=config.decoding.beam_size,
|
||||
cutoff_prob=config.decoding.cutoff_prob,
|
||||
cutoff_top_n=config.decoding.cutoff_top_n,
|
||||
num_processes=config.decoding.num_proc_bsearch)
|
||||
return result_transcript[0]
|
||||
|
||||
# warming up with utterrances sampled from Librispeech
|
||||
print('-----------------------------------------------------------')
|
||||
print('Warming up ...')
|
||||
warm_up_test(
|
||||
audio_process_handler=file_to_transcript,
|
||||
manifest_path=args.warmup_manifest,
|
||||
num_test_cases=3)
|
||||
print('-----------------------------------------------------------')
|
||||
|
||||
# start the server
|
||||
server = AsrTCPServer(
|
||||
server_address=(args.host_ip, args.host_port),
|
||||
RequestHandlerClass=AsrRequestHandler,
|
||||
speech_save_dir=args.speech_save_dir,
|
||||
audio_process_handler=file_to_transcript)
|
||||
print("ASR Server Started.")
|
||||
server.serve_forever()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
start_server(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
add_arg = functools.partial(add_arguments, argparser=parser)
|
||||
# yapf: disable
|
||||
add_arg('host_ip', str,
|
||||
'localhost',
|
||||
"Server's IP address.")
|
||||
add_arg('host_port', int, 8086, "Server's IP port.")
|
||||
add_arg('speech_save_dir', str,
|
||||
'demo_cache',
|
||||
"Directory to save demo audios.")
|
||||
add_arg('warmup_manifest', str, None, "Filepath of manifest to warm up.")
|
||||
add_arg(
|
||||
"--model_file",
|
||||
type=str,
|
||||
default="",
|
||||
help="Model filename, Specify this when your model is a combined model."
|
||||
)
|
||||
add_arg(
|
||||
"--params_file",
|
||||
type=str,
|
||||
default="",
|
||||
help=
|
||||
"Parameter filename, Specify this when your model is a combined model."
|
||||
)
|
||||
add_arg(
|
||||
"--model_dir",
|
||||
type=str,
|
||||
default=None,
|
||||
help=
|
||||
"Model dir, If you load a non-combined model, specify the directory of the model."
|
||||
)
|
||||
add_arg("--use_gpu",
|
||||
type=bool,
|
||||
default=False,
|
||||
help="Whether use gpu.")
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
|
||||
args.warmup_manifest = config.data.test_manifest
|
||||
print_arguments(args)
|
||||
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,52 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Socket client to send wav to ASR server."""
|
||||
import struct
|
||||
import socket
|
||||
import argparse
|
||||
import wave
|
||||
|
||||
from deepspeech.utils.socket_server import socket_send
|
||||
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument(
|
||||
"--host_ip",
|
||||
default="localhost",
|
||||
type=str,
|
||||
help="Server IP address. (default: %(default)s)")
|
||||
parser.add_argument(
|
||||
"--host_port",
|
||||
default=8086,
|
||||
type=int,
|
||||
help="Server Port. (default: %(default)s)")
|
||||
args = parser.parse_args()
|
||||
|
||||
WAVE_OUTPUT_FILENAME = "output.wav"
|
||||
|
||||
|
||||
def main():
|
||||
wf = wave.open(WAVE_OUTPUT_FILENAME, 'rb')
|
||||
nframe = wf.getnframes()
|
||||
data = wf.readframes(nframe)
|
||||
print(f"Wave: {WAVE_OUTPUT_FILENAME}")
|
||||
print(f"Wave samples: {nframe}")
|
||||
print(f"Wave channels: {wf.getnchannels()}")
|
||||
print(f"Wave sample rate: {wf.getframerate()}")
|
||||
print(f"Wave sample width: {wf.getsampwidth()}")
|
||||
assert isinstance(data, bytes)
|
||||
socket_send(args.host_ip, args.host_port, data)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
@ -0,0 +1,134 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Server-end for the ASR demo."""
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import functools
|
||||
import paddle
|
||||
import numpy as np
|
||||
|
||||
from deepspeech.utils.socket_server import warm_up_test
|
||||
from deepspeech.utils.socket_server import AsrTCPServer
|
||||
from deepspeech.utils.socket_server import AsrRequestHandler
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
|
||||
from deepspeech.frontend.utility import read_manifest
|
||||
from deepspeech.utils.utility import add_arguments, print_arguments
|
||||
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2Model
|
||||
from deepspeech.io.dataset import ManifestDataset
|
||||
|
||||
|
||||
def start_server(config, args):
|
||||
"""Start the ASR server"""
|
||||
dataset = ManifestDataset(
|
||||
config.data.test_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config="{}",
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=True)
|
||||
model = DeepSpeech2Model.from_pretrained(dataset, config,
|
||||
args.checkpoint_path)
|
||||
model.eval()
|
||||
|
||||
# prepare ASR inference handler
|
||||
def file_to_transcript(filename):
|
||||
feature = dataset.process_utterance(filename, "")
|
||||
audio = np.array([feature[0]]).astype('float32') #[1, D, T]
|
||||
audio_len = feature[0].shape[1]
|
||||
audio_len = np.array([audio_len]).astype('int64') # [1]
|
||||
|
||||
result_transcript = model.decode(
|
||||
paddle.to_tensor(audio),
|
||||
paddle.to_tensor(audio_len),
|
||||
vocab_list=dataset.vocab_list,
|
||||
decoding_method=config.decoding.decoding_method,
|
||||
lang_model_path=config.decoding.lang_model_path,
|
||||
beam_alpha=config.decoding.alpha,
|
||||
beam_beta=config.decoding.beta,
|
||||
beam_size=config.decoding.beam_size,
|
||||
cutoff_prob=config.decoding.cutoff_prob,
|
||||
cutoff_top_n=config.decoding.cutoff_top_n,
|
||||
num_processes=config.decoding.num_proc_bsearch)
|
||||
return result_transcript[0]
|
||||
|
||||
# warming up with utterrances sampled from Librispeech
|
||||
print('-----------------------------------------------------------')
|
||||
print('Warming up ...')
|
||||
warm_up_test(
|
||||
audio_process_handler=file_to_transcript,
|
||||
manifest_path=args.warmup_manifest,
|
||||
num_test_cases=3)
|
||||
print('-----------------------------------------------------------')
|
||||
|
||||
# start the server
|
||||
server = AsrTCPServer(
|
||||
server_address=(args.host_ip, args.host_port),
|
||||
RequestHandlerClass=AsrRequestHandler,
|
||||
speech_save_dir=args.speech_save_dir,
|
||||
audio_process_handler=file_to_transcript)
|
||||
print("ASR Server Started.")
|
||||
server.serve_forever()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
start_server(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
add_arg = functools.partial(add_arguments, argparser=parser)
|
||||
# yapf: disable
|
||||
add_arg('host_ip', str,
|
||||
'localhost',
|
||||
"Server's IP address.")
|
||||
add_arg('host_port', int, 8086, "Server's IP port.")
|
||||
add_arg('speech_save_dir', str,
|
||||
'demo_cache',
|
||||
"Directory to save demo audios.")
|
||||
add_arg('warmup_manifest', str, None, "Filepath of manifest to warm up.")
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
|
||||
args.warmup_manifest = config.data.test_manifest
|
||||
print_arguments(args)
|
||||
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,58 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Export for DeepSpeech2 model."""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import argparse
|
||||
import functools
|
||||
|
||||
from paddle import distributed as dist
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.utils.utility import print_arguments
|
||||
from deepspeech.utils.error_rate import char_errors, word_errors
|
||||
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
|
||||
|
||||
|
||||
def main_sp(config, args):
|
||||
exp = Tester(config, args)
|
||||
exp.setup()
|
||||
exp.run_export()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
main_sp(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,59 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Inferer for DeepSpeech2 model."""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import argparse
|
||||
import functools
|
||||
|
||||
from paddle import distributed as dist
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.utils.utility import print_arguments
|
||||
from deepspeech.utils.error_rate import char_errors, word_errors
|
||||
|
||||
# TODO(hui zhang): dynamic load
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
|
||||
|
||||
|
||||
def main_sp(config, args):
|
||||
exp = Tester(config, args)
|
||||
exp.setup()
|
||||
exp.run_test()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
main_sp(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,58 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Evaluation for DeepSpeech2 model."""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import argparse
|
||||
import functools
|
||||
|
||||
from paddle import distributed as dist
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.utils.utility import print_arguments
|
||||
from deepspeech.utils.error_rate import char_errors, word_errors
|
||||
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
from deepspeech.exps.deepspeech2.model import DeepSpeech2Tester as Tester
|
||||
|
||||
|
||||
def main_sp(config, args):
|
||||
exp = Tester(config, args)
|
||||
exp.setup()
|
||||
exp.run_test()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
main_sp(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,60 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Trainer for DeepSpeech2 model."""
|
||||
|
||||
import io
|
||||
import logging
|
||||
import argparse
|
||||
import functools
|
||||
|
||||
from paddle import distributed as dist
|
||||
|
||||
from deepspeech.utils.utility import print_arguments
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
from deepspeech.exps.deepspeech2.model import DeepSpeech2Trainer as Trainer
|
||||
|
||||
|
||||
def main_sp(config, args):
|
||||
exp = Trainer(config, args)
|
||||
exp.setup()
|
||||
exp.run()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
if args.device == "gpu" and args.nprocs > 1:
|
||||
dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
|
||||
else:
|
||||
main_sp(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,210 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Beam search parameters tuning for DeepSpeech2 model."""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import numpy as np
|
||||
import argparse
|
||||
import functools
|
||||
import gzip
|
||||
import logging
|
||||
|
||||
from paddle.io import DataLoader
|
||||
|
||||
from deepspeech.utils import error_rate
|
||||
from deepspeech.utils.utility import add_arguments, print_arguments
|
||||
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2Model
|
||||
from deepspeech.io.collator import SpeechCollator
|
||||
from deepspeech.io.dataset import ManifestDataset
|
||||
|
||||
from deepspeech.training.cli import default_argument_parser
|
||||
from deepspeech.exps.deepspeech2.config import get_cfg_defaults
|
||||
|
||||
|
||||
def tune(config, args):
|
||||
"""Tune parameters alpha and beta incrementally."""
|
||||
if not args.num_alphas >= 0:
|
||||
raise ValueError("num_alphas must be non-negative!")
|
||||
if not args.num_betas >= 0:
|
||||
raise ValueError("num_betas must be non-negative!")
|
||||
|
||||
dev_dataset = ManifestDataset(
|
||||
config.data.dev_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config="{}",
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=True)
|
||||
|
||||
valid_loader = DataLoader(
|
||||
dev_dataset,
|
||||
batch_size=config.data.batch_size,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
collate_fn=SpeechCollator(is_training=False))
|
||||
|
||||
model = DeepSpeech2Model.from_pretrained(dev_dataset, config,
|
||||
args.checkpoint_path)
|
||||
model.eval()
|
||||
|
||||
# decoders only accept string encoded in utf-8
|
||||
vocab_list = valid_loader.dataset.vocab_list
|
||||
errors_func = error_rate.char_errors if config.decoding.error_rate_type == 'cer' else error_rate.word_errors
|
||||
|
||||
# create grid for search
|
||||
cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas)
|
||||
cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas)
|
||||
params_grid = [(alpha, beta) for alpha in cand_alphas
|
||||
for beta in cand_betas]
|
||||
|
||||
err_sum = [0.0 for i in range(len(params_grid))]
|
||||
err_ave = [0.0 for i in range(len(params_grid))]
|
||||
|
||||
num_ins, len_refs, cur_batch = 0, 0, 0
|
||||
# initialize external scorer
|
||||
model.decoder.init_decode(args.alpha_from, args.beta_from,
|
||||
config.decoding.lang_model_path, vocab_list,
|
||||
config.decoding.decoding_method)
|
||||
## incremental tuning parameters over multiple batches
|
||||
print("start tuning ...")
|
||||
for infer_data in valid_loader():
|
||||
if (args.num_batches >= 0) and (cur_batch >= args.num_batches):
|
||||
break
|
||||
|
||||
def ordid2token(texts, texts_len):
|
||||
""" ord() id to chr() chr """
|
||||
trans = []
|
||||
for text, n in zip(texts, texts_len):
|
||||
n = n.numpy().item()
|
||||
ids = text[:n]
|
||||
trans.append(''.join([chr(i) for i in ids]))
|
||||
return trans
|
||||
|
||||
audio, text, audio_len, text_len = infer_data
|
||||
target_transcripts = ordid2token(text, text_len)
|
||||
num_ins += audio.shape[0]
|
||||
|
||||
# model infer
|
||||
eouts, eouts_len = model.encoder(audio, audio_len)
|
||||
probs = model.decoder.probs(eouts)
|
||||
|
||||
# grid search
|
||||
for index, (alpha, beta) in enumerate(params_grid):
|
||||
print(f"tuneing: alpha={alpha} beta={beta}")
|
||||
result_transcripts = model.decoder.decode_probs(
|
||||
probs.numpy(), eouts_len, vocab_list,
|
||||
config.decoding.decoding_method,
|
||||
config.decoding.lang_model_path, alpha, beta,
|
||||
config.decoding.beam_size, config.decoding.cutoff_prob,
|
||||
config.decoding.cutoff_top_n, config.decoding.num_proc_bsearch)
|
||||
|
||||
for target, result in zip(target_transcripts, result_transcripts):
|
||||
errors, len_ref = errors_func(target, result)
|
||||
err_sum[index] += errors
|
||||
|
||||
# accumulate the length of references of every batchπ
|
||||
# in the first iteration
|
||||
if args.alpha_from == alpha and args.beta_from == beta:
|
||||
len_refs += len_ref
|
||||
|
||||
err_ave[index] = err_sum[index] / len_refs
|
||||
if index % 2 == 0:
|
||||
sys.stdout.write('.')
|
||||
sys.stdout.flush()
|
||||
print(f"tuneing: one grid done!")
|
||||
|
||||
# output on-line tuning result at the end of current batch
|
||||
err_ave_min = min(err_ave)
|
||||
min_index = err_ave.index(err_ave_min)
|
||||
print("\nBatch %d [%d/?], current opt (alpha, beta) = (%s, %s), "
|
||||
" min [%s] = %f" %
|
||||
(cur_batch, num_ins, "%.3f" % params_grid[min_index][0],
|
||||
"%.3f" % params_grid[min_index][1],
|
||||
config.decoding.error_rate_type, err_ave_min))
|
||||
cur_batch += 1
|
||||
|
||||
# output WER/CER at every (alpha, beta)
|
||||
print("\nFinal %s:\n" % config.decoding.error_rate_type)
|
||||
for index in range(len(params_grid)):
|
||||
print("(alpha, beta) = (%s, %s), [%s] = %f" %
|
||||
("%.3f" % params_grid[index][0], "%.3f" % params_grid[index][1],
|
||||
config.decoding.error_rate_type, err_ave[index]))
|
||||
|
||||
err_ave_min = min(err_ave)
|
||||
min_index = err_ave.index(err_ave_min)
|
||||
print("\nFinish tuning on %d batches, final opt (alpha, beta) = (%s, %s)" %
|
||||
(cur_batch, "%.3f" % params_grid[min_index][0],
|
||||
"%.3f" % params_grid[min_index][1]))
|
||||
|
||||
print("finish tuning")
|
||||
|
||||
|
||||
def main(config, args):
|
||||
tune(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = default_argument_parser()
|
||||
add_arg = functools.partial(add_arguments, argparser=parser)
|
||||
add_arg('num_batches', int, -1, "# of batches tuning on. "
|
||||
"Default -1, on whole dev set.")
|
||||
add_arg('num_alphas', int, 45, "# of alpha candidates for tuning.")
|
||||
add_arg('num_betas', int, 8, "# of beta candidates for tuning.")
|
||||
add_arg('alpha_from', float, 1.0, "Where alpha starts tuning from.")
|
||||
add_arg('alpha_to', float, 3.2, "Where alpha ends tuning with.")
|
||||
add_arg('beta_from', float, 0.1, "Where beta starts tuning from.")
|
||||
add_arg('beta_to', float, 0.45, "Where beta ends tuning with.")
|
||||
|
||||
add_arg('batch_size', int, 256, "# of samples per batch.")
|
||||
add_arg('beam_size', int, 500, "Beam search width.")
|
||||
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
|
||||
add_arg('cutoff_prob', float, 1.0, "Cutoff probability for pruning.")
|
||||
add_arg('cutoff_top_n', int, 40, "Cutoff number for pruning.")
|
||||
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
# https://yaml.org/type/float.html
|
||||
config = get_cfg_defaults()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
|
||||
config.data.batch_size = args.batch_size
|
||||
config.decoding.beam_size = args.beam_size
|
||||
config.decoding.num_proc_bsearch = args.num_proc_bsearch
|
||||
config.decoding.cutoff_prob = args.cutoff_prob
|
||||
config.decoding.cutoff_top_n = args.cutoff_top_n
|
||||
|
||||
config.freeze()
|
||||
print(config)
|
||||
|
||||
if args.dump_config:
|
||||
with open(args.dump_config, 'w') as f:
|
||||
print(config, file=f)
|
||||
|
||||
main(config, args)
|
@ -0,0 +1,84 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from yacs.config import CfgNode as CN
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2Model
|
||||
|
||||
_C = CN()
|
||||
_C.data = CN(
|
||||
dict(
|
||||
train_manifest="",
|
||||
dev_manifest="",
|
||||
test_manifest="",
|
||||
vocab_filepath="",
|
||||
mean_std_filepath="",
|
||||
augmentation_config="",
|
||||
max_duration=float('inf'),
|
||||
min_duration=0.0,
|
||||
stride_ms=10.0, # ms
|
||||
window_ms=20.0, # ms
|
||||
n_fft=None, # fft points
|
||||
max_freq=None, # None for samplerate/2
|
||||
specgram_type='linear', # 'linear', 'mfcc'
|
||||
target_sample_rate=16000, # sample rate
|
||||
use_dB_normalization=True,
|
||||
target_dB=-20,
|
||||
random_seed=0,
|
||||
keep_transcription_text=False,
|
||||
batch_size=32, # batch size
|
||||
num_workers=0, # data loader workers
|
||||
sortagrad=False, # sorted in first epoch when True
|
||||
shuffle_method="batch_shuffle", # 'batch_shuffle', 'instance_shuffle'
|
||||
))
|
||||
|
||||
_C.model = CN(
|
||||
dict(
|
||||
num_conv_layers=2, #Number of stacking convolution layers.
|
||||
num_rnn_layers=3, #Number of stacking RNN layers.
|
||||
rnn_layer_size=1024, #RNN layer size (number of RNN cells).
|
||||
use_gru=True, #Use gru if set True. Use simple rnn if set False.
|
||||
share_rnn_weights=True #Whether to share input-hidden weights between forward and backward directional RNNs.Notice that for GRU, weight sharing is not supported.
|
||||
))
|
||||
|
||||
DeepSpeech2Model.params(_C.model)
|
||||
|
||||
_C.training = CN(
|
||||
dict(
|
||||
lr=5e-4, # learning rate
|
||||
lr_decay=1.0, # learning rate decay
|
||||
weight_decay=1e-6, # the coeff of weight decay
|
||||
global_grad_clip=5.0, # the global norm clip
|
||||
n_epoch=50, # train epochs
|
||||
))
|
||||
|
||||
_C.decoding = CN(
|
||||
dict(
|
||||
alpha=2.5, # Coef of LM for beam search.
|
||||
beta=0.3, # Coef of WC for beam search.
|
||||
cutoff_prob=1.0, # Cutoff probability for pruning.
|
||||
cutoff_top_n=40, # Cutoff number for pruning.
|
||||
lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm', # Filepath for language model.
|
||||
decoding_method='ctc_beam_search', # Decoding method. Options: ctc_beam_search, ctc_greedy
|
||||
error_rate_type='wer', # Error rate type for evaluation. Options `wer`, 'cer'
|
||||
num_proc_bsearch=8, # # of CPUs for beam search.
|
||||
beam_size=500, # Beam search width.
|
||||
batch_size=128, # decoding batch size
|
||||
))
|
||||
|
||||
|
||||
def get_cfg_defaults():
|
||||
"""Get a yacs CfgNode object with default values for my_project."""
|
||||
# Return a clone so that the defaults will not be altered
|
||||
# This is for the "local variable" use pattern
|
||||
return _C.clone()
|
@ -0,0 +1,424 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Contains DeepSpeech2 model."""
|
||||
|
||||
import io
|
||||
import sys
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
import numpy as np
|
||||
from collections import defaultdict
|
||||
from functools import partial
|
||||
from pathlib import Path
|
||||
|
||||
import paddle
|
||||
from paddle import distributed as dist
|
||||
from paddle.io import DataLoader
|
||||
|
||||
from deepspeech.training import Trainer
|
||||
from deepspeech.training.gradclip import MyClipGradByGlobalNorm
|
||||
|
||||
from deepspeech.utils import mp_tools
|
||||
from deepspeech.utils import layer_tools
|
||||
from deepspeech.utils import error_rate
|
||||
|
||||
from deepspeech.io.collator import SpeechCollator
|
||||
from deepspeech.io.sampler import SortagradDistributedBatchSampler
|
||||
from deepspeech.io.sampler import SortagradBatchSampler
|
||||
from deepspeech.io.dataset import ManifestDataset
|
||||
|
||||
from deepspeech.modules.loss import CTCLoss
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2Model
|
||||
from deepspeech.models.deepspeech2 import DeepSpeech2InferModel
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class DeepSpeech2Trainer(Trainer):
|
||||
def __init__(self, config, args):
|
||||
super().__init__(config, args)
|
||||
|
||||
def train_batch(self, batch_data):
|
||||
start = time.time()
|
||||
self.model.train()
|
||||
loss = self.model(*batch_data)
|
||||
loss.backward()
|
||||
layer_tools.print_grads(self.model, print_func=None)
|
||||
self.optimizer.step()
|
||||
self.optimizer.clear_grad()
|
||||
|
||||
iteration_time = time.time() - start
|
||||
|
||||
losses_np = {
|
||||
'train_loss': float(loss),
|
||||
'train_loss_div_batchsize':
|
||||
float(loss) / self.config.data.batch_size
|
||||
}
|
||||
msg = "Train: Rank: {}, ".format(dist.get_rank())
|
||||
msg += "epoch: {}, ".format(self.epoch)
|
||||
msg += "step: {}, ".format(self.iteration)
|
||||
msg += "time: {:>.3f}s, ".format(iteration_time)
|
||||
msg += ', '.join('{}: {:>.6f}'.format(k, v)
|
||||
for k, v in losses_np.items())
|
||||
self.logger.info(msg)
|
||||
|
||||
if dist.get_rank() == 0 and self.visualizer:
|
||||
for k, v in losses_np.items():
|
||||
self.visualizer.add_scalar("train/{}".format(k), v,
|
||||
self.iteration)
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
@paddle.no_grad()
|
||||
def valid(self):
|
||||
self.logger.info(
|
||||
f"Valid Total Examples: {len(self.valid_loader.dataset)}")
|
||||
self.model.eval()
|
||||
valid_losses = defaultdict(list)
|
||||
for i, batch in enumerate(self.valid_loader):
|
||||
loss = self.model(*batch)
|
||||
|
||||
valid_losses['val_loss'].append(float(loss))
|
||||
valid_losses['val_loss_div_batchsize'].append(
|
||||
float(loss) / self.config.data.batch_size)
|
||||
|
||||
# write visual log
|
||||
valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
|
||||
|
||||
# logging
|
||||
msg = f"Valid: Rank: {dist.get_rank()}, "
|
||||
msg += "epoch: {}, ".format(self.epoch)
|
||||
msg += "step: {}, ".format(self.iteration)
|
||||
msg += ', '.join('{}: {:>.6f}'.format(k, v)
|
||||
for k, v in valid_losses.items())
|
||||
self.logger.info(msg)
|
||||
|
||||
if self.visualizer:
|
||||
for k, v in valid_losses.items():
|
||||
self.visualizer.add_scalar("valid/{}".format(k), v,
|
||||
self.iteration)
|
||||
|
||||
def setup_model(self):
|
||||
config = self.config
|
||||
model = DeepSpeech2Model(
|
||||
feat_size=self.train_loader.dataset.feature_size,
|
||||
dict_size=self.train_loader.dataset.vocab_size,
|
||||
num_conv_layers=config.model.num_conv_layers,
|
||||
num_rnn_layers=config.model.num_rnn_layers,
|
||||
rnn_size=config.model.rnn_layer_size,
|
||||
use_gru=config.model.use_gru,
|
||||
share_rnn_weights=config.model.share_rnn_weights)
|
||||
|
||||
if self.parallel:
|
||||
model = paddle.DataParallel(model)
|
||||
|
||||
layer_tools.print_params(model, self.logger.info)
|
||||
|
||||
grad_clip = MyClipGradByGlobalNorm(config.training.global_grad_clip)
|
||||
lr_scheduler = paddle.optimizer.lr.ExponentialDecay(
|
||||
learning_rate=config.training.lr,
|
||||
gamma=config.training.lr_decay,
|
||||
verbose=True)
|
||||
optimizer = paddle.optimizer.Adam(
|
||||
learning_rate=lr_scheduler,
|
||||
parameters=model.parameters(),
|
||||
weight_decay=paddle.regularizer.L2Decay(
|
||||
config.training.weight_decay),
|
||||
grad_clip=grad_clip)
|
||||
|
||||
self.model = model
|
||||
self.optimizer = optimizer
|
||||
self.lr_scheduler = lr_scheduler
|
||||
self.logger.info("Setup model/optimizer/lr_scheduler!")
|
||||
|
||||
def setup_dataloader(self):
|
||||
config = self.config
|
||||
|
||||
train_dataset = ManifestDataset(
|
||||
config.data.train_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config=io.open(
|
||||
config.data.augmentation_config, mode='r',
|
||||
encoding='utf8').read(),
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=False)
|
||||
|
||||
dev_dataset = ManifestDataset(
|
||||
config.data.dev_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config="{}",
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=False)
|
||||
|
||||
if self.parallel:
|
||||
batch_sampler = SortagradDistributedBatchSampler(
|
||||
train_dataset,
|
||||
batch_size=config.data.batch_size,
|
||||
num_replicas=None,
|
||||
rank=None,
|
||||
shuffle=True,
|
||||
drop_last=True,
|
||||
sortagrad=config.data.sortagrad,
|
||||
shuffle_method=config.data.shuffle_method)
|
||||
else:
|
||||
batch_sampler = SortagradBatchSampler(
|
||||
train_dataset,
|
||||
shuffle=True,
|
||||
batch_size=config.data.batch_size,
|
||||
drop_last=True,
|
||||
sortagrad=config.data.sortagrad,
|
||||
shuffle_method=config.data.shuffle_method)
|
||||
|
||||
collate_fn = SpeechCollator(is_training=True)
|
||||
self.train_loader = DataLoader(
|
||||
train_dataset,
|
||||
batch_sampler=batch_sampler,
|
||||
collate_fn=collate_fn,
|
||||
num_workers=config.data.num_workers, )
|
||||
self.valid_loader = DataLoader(
|
||||
dev_dataset,
|
||||
batch_size=config.data.batch_size,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
collate_fn=collate_fn)
|
||||
self.logger.info("Setup train/valid Dataloader!")
|
||||
|
||||
|
||||
class DeepSpeech2Tester(DeepSpeech2Trainer):
|
||||
def __init__(self, config, args):
|
||||
super().__init__(config, args)
|
||||
|
||||
def ordid2token(self, texts, texts_len):
|
||||
""" ord() id to chr() chr """
|
||||
trans = []
|
||||
for text, n in zip(texts, texts_len):
|
||||
n = n.numpy().item()
|
||||
ids = text[:n]
|
||||
trans.append(''.join([chr(i) for i in ids]))
|
||||
return trans
|
||||
|
||||
def compute_metrics(self, audio, texts, audio_len, texts_len):
|
||||
cfg = self.config.decoding
|
||||
errors_sum, len_refs, num_ins = 0.0, 0, 0
|
||||
errors_func = error_rate.char_errors if cfg.error_rate_type == 'cer' else error_rate.word_errors
|
||||
error_rate_func = error_rate.cer if cfg.error_rate_type == 'cer' else error_rate.wer
|
||||
|
||||
vocab_list = self.test_loader.dataset.vocab_list
|
||||
|
||||
target_transcripts = self.ordid2token(texts, texts_len)
|
||||
result_transcripts = self.model.decode(
|
||||
audio,
|
||||
audio_len,
|
||||
vocab_list,
|
||||
decoding_method=cfg.decoding_method,
|
||||
lang_model_path=cfg.lang_model_path,
|
||||
beam_alpha=cfg.alpha,
|
||||
beam_beta=cfg.beta,
|
||||
beam_size=cfg.beam_size,
|
||||
cutoff_prob=cfg.cutoff_prob,
|
||||
cutoff_top_n=cfg.cutoff_top_n,
|
||||
num_processes=cfg.num_proc_bsearch)
|
||||
|
||||
for target, result in zip(target_transcripts, result_transcripts):
|
||||
errors, len_ref = errors_func(target, result)
|
||||
errors_sum += errors
|
||||
len_refs += len_ref
|
||||
num_ins += 1
|
||||
self.logger.info(
|
||||
"\nTarget Transcription: %s\nOutput Transcription: %s" %
|
||||
(target, result))
|
||||
self.logger.info("Current error rate [%s] = %f" % (
|
||||
cfg.error_rate_type, error_rate_func(target, result)))
|
||||
|
||||
return dict(
|
||||
errors_sum=errors_sum,
|
||||
len_refs=len_refs,
|
||||
num_ins=num_ins,
|
||||
error_rate=errors_sum / len_refs,
|
||||
error_rate_type=cfg.error_rate_type)
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
@paddle.no_grad()
|
||||
def test(self):
|
||||
self.logger.info(
|
||||
f"Test Total Examples: {len(self.test_loader.dataset)}")
|
||||
self.model.eval()
|
||||
cfg = self.config
|
||||
error_rate_type = None
|
||||
errors_sum, len_refs, num_ins = 0.0, 0, 0
|
||||
|
||||
for i, batch in enumerate(self.test_loader):
|
||||
metrics = self.compute_metrics(*batch)
|
||||
errors_sum += metrics['errors_sum']
|
||||
len_refs += metrics['len_refs']
|
||||
num_ins += metrics['num_ins']
|
||||
error_rate_type = metrics['error_rate_type']
|
||||
self.logger.info("Error rate [%s] (%d/?) = %f" %
|
||||
(error_rate_type, num_ins, errors_sum / len_refs))
|
||||
|
||||
# logging
|
||||
msg = "Test: "
|
||||
msg += "epoch: {}, ".format(self.epoch)
|
||||
msg += "step: {}, ".format(self.iteration)
|
||||
msg += ", Final error rate [%s] (%d/%d) = %f" % (
|
||||
error_rate_type, num_ins, num_ins, errors_sum / len_refs)
|
||||
self.logger.info(msg)
|
||||
|
||||
def run_test(self):
|
||||
self.resume_or_load()
|
||||
try:
|
||||
self.test()
|
||||
except KeyboardInterrupt:
|
||||
exit(-1)
|
||||
|
||||
def export(self):
|
||||
self.infer_model.eval()
|
||||
feat_dim = self.test_loader.dataset.feature_size
|
||||
paddle.jit.save(
|
||||
self.infer_model,
|
||||
self.args.export_path,
|
||||
input_spec=[
|
||||
paddle.static.InputSpec(
|
||||
shape=[None, feat_dim, None],
|
||||
dtype='float32'), # audio, [B,D,T]
|
||||
paddle.static.InputSpec(shape=[None],
|
||||
dtype='int64'), # audio_length, [B]
|
||||
])
|
||||
|
||||
def run_export(self):
|
||||
try:
|
||||
self.export()
|
||||
except KeyboardInterrupt:
|
||||
exit(-1)
|
||||
|
||||
def setup(self):
|
||||
"""Setup the experiment.
|
||||
"""
|
||||
paddle.set_device(self.args.device)
|
||||
|
||||
self.setup_output_dir()
|
||||
self.setup_checkpointer()
|
||||
self.setup_logger()
|
||||
|
||||
self.setup_dataloader()
|
||||
self.setup_model()
|
||||
|
||||
self.iteration = 0
|
||||
self.epoch = 0
|
||||
|
||||
def setup_model(self):
|
||||
config = self.config
|
||||
model = DeepSpeech2Model(
|
||||
feat_size=self.test_loader.dataset.feature_size,
|
||||
dict_size=self.test_loader.dataset.vocab_size,
|
||||
num_conv_layers=config.model.num_conv_layers,
|
||||
num_rnn_layers=config.model.num_rnn_layers,
|
||||
rnn_size=config.model.rnn_layer_size,
|
||||
use_gru=config.model.use_gru,
|
||||
share_rnn_weights=config.model.share_rnn_weights)
|
||||
|
||||
infer_model = DeepSpeech2InferModel.from_pretrained(
|
||||
self.test_loader.dataset, config, self.args.checkpoint_path)
|
||||
|
||||
self.model = model
|
||||
self.infer_model = infer_model
|
||||
self.logger.info("Setup model!")
|
||||
|
||||
def setup_dataloader(self):
|
||||
config = self.config
|
||||
# return raw text
|
||||
test_dataset = ManifestDataset(
|
||||
config.data.test_manifest,
|
||||
config.data.vocab_filepath,
|
||||
config.data.mean_std_filepath,
|
||||
augmentation_config="{}",
|
||||
max_duration=config.data.max_duration,
|
||||
min_duration=config.data.min_duration,
|
||||
stride_ms=config.data.stride_ms,
|
||||
window_ms=config.data.window_ms,
|
||||
n_fft=config.data.n_fft,
|
||||
max_freq=config.data.max_freq,
|
||||
target_sample_rate=config.data.target_sample_rate,
|
||||
specgram_type=config.data.specgram_type,
|
||||
use_dB_normalization=config.data.use_dB_normalization,
|
||||
target_dB=config.data.target_dB,
|
||||
random_seed=config.data.random_seed,
|
||||
keep_transcription_text=True)
|
||||
|
||||
# return text ord id
|
||||
self.test_loader = DataLoader(
|
||||
test_dataset,
|
||||
batch_size=config.decoding.batch_size,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
collate_fn=SpeechCollator(is_training=False))
|
||||
self.logger.info("Setup test Dataloader!")
|
||||
|
||||
def setup_output_dir(self):
|
||||
"""Create a directory used for output.
|
||||
"""
|
||||
# output dir
|
||||
if self.args.output:
|
||||
output_dir = Path(self.args.output).expanduser()
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
else:
|
||||
output_dir = Path(
|
||||
self.args.checkpoint_path).expanduser().parent.parent
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
self.output_dir = output_dir
|
||||
|
||||
def setup_logger(self):
|
||||
"""Initialize a text logger to log the experiment.
|
||||
|
||||
Each process has its own text logger. The logging message is write to
|
||||
the standard output and a text file named ``worker_n.log`` in the
|
||||
output directory, where ``n`` means the rank of the process.
|
||||
"""
|
||||
format = '[%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s'
|
||||
formatter = logging.Formatter(fmt=format, datefmt='%Y/%m/%d %H:%M:%S')
|
||||
|
||||
logger.setLevel("INFO")
|
||||
|
||||
# global logger
|
||||
stdout = True
|
||||
save_path = ""
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG if stdout else logging.INFO,
|
||||
format=format,
|
||||
datefmt='%Y/%m/%d %H:%M:%S',
|
||||
filename=save_path if not stdout else None)
|
||||
self.logger = logger
|
@ -0,0 +1,128 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from paddle.io import DataLoader
|
||||
|
||||
from deepspeech.io.collator import SpeechCollator
|
||||
from deepspeech.io.sampler import SortagradDistributedBatchSampler
|
||||
from deepspeech.io.sampler import SortagradBatchSampler
|
||||
from deepspeech.io.dataset import ManifestDataset
|
||||
|
||||
|
||||
def create_dataloader(manifest_path,
|
||||
vocab_filepath,
|
||||
mean_std_filepath,
|
||||
augmentation_config='{}',
|
||||
max_duration=float('inf'),
|
||||
min_duration=0.0,
|
||||
stride_ms=10.0,
|
||||
window_ms=20.0,
|
||||
max_freq=None,
|
||||
specgram_type='linear',
|
||||
use_dB_normalization=True,
|
||||
random_seed=0,
|
||||
keep_transcription_text=False,
|
||||
is_training=False,
|
||||
batch_size=1,
|
||||
num_workers=0,
|
||||
sortagrad=False,
|
||||
shuffle_method=None,
|
||||
dist=False):
|
||||
|
||||
dataset = ManifestDataset(
|
||||
manifest_path,
|
||||
vocab_filepath,
|
||||
mean_std_filepath,
|
||||
augmentation_config=augmentation_config,
|
||||
max_duration=max_duration,
|
||||
min_duration=min_duration,
|
||||
stride_ms=stride_ms,
|
||||
window_ms=window_ms,
|
||||
max_freq=max_freq,
|
||||
specgram_type=specgram_type,
|
||||
use_dB_normalization=use_dB_normalization,
|
||||
random_seed=random_seed,
|
||||
keep_transcription_text=keep_transcription_text)
|
||||
|
||||
if dist:
|
||||
batch_sampler = SortagradDistributedBatchSampler(
|
||||
dataset,
|
||||
batch_size,
|
||||
num_replicas=None,
|
||||
rank=None,
|
||||
shuffle=is_training,
|
||||
drop_last=is_training,
|
||||
sortagrad=is_training,
|
||||
shuffle_method=shuffle_method)
|
||||
else:
|
||||
batch_sampler = SortagradBatchSampler(
|
||||
dataset,
|
||||
shuffle=is_training,
|
||||
batch_size=batch_size,
|
||||
drop_last=is_training,
|
||||
sortagrad=is_training,
|
||||
shuffle_method=shuffle_method)
|
||||
|
||||
def padding_batch(batch, padding_to=-1, flatten=False, is_training=True):
|
||||
"""
|
||||
Padding audio features with zeros to make them have the same shape (or
|
||||
a user-defined shape) within one bach.
|
||||
|
||||
If ``padding_to`` is -1, the maximun shape in the batch will be used
|
||||
as the target shape for padding. Otherwise, `padding_to` will be the
|
||||
target shape (only refers to the second axis).
|
||||
|
||||
If `flatten` is True, features will be flatten to 1darray.
|
||||
"""
|
||||
new_batch = []
|
||||
# get target shape
|
||||
max_length = max([audio.shape[1] for audio, text in batch])
|
||||
if padding_to != -1:
|
||||
if padding_to < max_length:
|
||||
raise ValueError("If padding_to is not -1, it should be larger "
|
||||
"than any instance's shape in the batch")
|
||||
max_length = padding_to
|
||||
max_text_length = max([len(text) for audio, text in batch])
|
||||
# padding
|
||||
padded_audios = []
|
||||
audio_lens = []
|
||||
texts, text_lens = [], []
|
||||
for audio, text in batch:
|
||||
padded_audio = np.zeros([audio.shape[0], max_length])
|
||||
padded_audio[:, :audio.shape[1]] = audio
|
||||
if flatten:
|
||||
padded_audio = padded_audio.flatten()
|
||||
padded_audios.append(padded_audio)
|
||||
audio_lens.append(audio.shape[1])
|
||||
|
||||
padded_text = np.zeros([max_text_length])
|
||||
if is_training:
|
||||
padded_text[:len(text)] = text #ids
|
||||
else:
|
||||
padded_text[:len(text)] = [ord(t) for t in text] # string
|
||||
texts.append(padded_text)
|
||||
text_lens.append(len(text))
|
||||
|
||||
padded_audios = np.array(padded_audios).astype('float32')
|
||||
audio_lens = np.array(audio_lens).astype('int64')
|
||||
texts = np.array(texts).astype('int32')
|
||||
text_lens = np.array(text_lens).astype('int64')
|
||||
return padded_audios, texts, audio_lens, text_lens
|
||||
|
||||
loader = DataLoader(
|
||||
dataset,
|
||||
batch_sampler=batch_sampler,
|
||||
collate_fn=partial(padding_batch, is_training=is_training),
|
||||
num_workers=num_workers)
|
||||
return loader
|
@ -0,0 +1,73 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
import numpy as np
|
||||
from collections import namedtuple
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = [
|
||||
"SpeechCollator",
|
||||
]
|
||||
|
||||
|
||||
class SpeechCollator():
|
||||
def __init__(self, padding_to=-1, is_training=True):
|
||||
"""
|
||||
Padding audio features with zeros to make them have the same shape (or
|
||||
a user-defined shape) within one bach.
|
||||
|
||||
If ``padding_to`` is -1, the maximun shape in the batch will be used
|
||||
as the target shape for padding. Otherwise, `padding_to` will be the
|
||||
target shape (only refers to the second axis).
|
||||
"""
|
||||
self._padding_to = padding_to
|
||||
self._is_training = is_training
|
||||
|
||||
def __call__(self, batch):
|
||||
new_batch = []
|
||||
# get target shape
|
||||
max_length = max([audio.shape[1] for audio, _ in batch])
|
||||
if self._padding_to != -1:
|
||||
if self._padding_to < max_length:
|
||||
raise ValueError("If padding_to is not -1, it should be larger "
|
||||
"than any instance's shape in the batch")
|
||||
max_length = self._padding_to
|
||||
max_text_length = max([len(text) for _, text in batch])
|
||||
# padding
|
||||
padded_audios = []
|
||||
audio_lens = []
|
||||
texts, text_lens = [], []
|
||||
for audio, text in batch:
|
||||
# audio
|
||||
padded_audio = np.zeros([audio.shape[0], max_length])
|
||||
padded_audio[:, :audio.shape[1]] = audio
|
||||
padded_audios.append(padded_audio)
|
||||
audio_lens.append(audio.shape[1])
|
||||
# text
|
||||
padded_text = np.zeros([max_text_length])
|
||||
if self._is_training:
|
||||
padded_text[:len(text)] = text # token ids
|
||||
else:
|
||||
padded_text[:len(text)] = [ord(t)
|
||||
for t in text] # string, unicode ord
|
||||
texts.append(padded_text)
|
||||
text_lens.append(len(text))
|
||||
|
||||
padded_audios = np.array(padded_audios).astype('float32')
|
||||
audio_lens = np.array(audio_lens).astype('int64')
|
||||
texts = np.array(texts).astype('int32')
|
||||
text_lens = np.array(text_lens).astype('int64')
|
||||
return padded_audios, texts, audio_lens, text_lens
|
@ -0,0 +1,206 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import random
|
||||
import tarfile
|
||||
import logging
|
||||
import numpy as np
|
||||
from collections import namedtuple
|
||||
from functools import partial
|
||||
|
||||
from paddle.io import Dataset
|
||||
|
||||
from deepspeech.frontend.utility import read_manifest
|
||||
from deepspeech.frontend.augmentor.augmentation import AugmentationPipeline
|
||||
from deepspeech.frontend.featurizer.speech_featurizer import SpeechFeaturizer
|
||||
from deepspeech.frontend.speech import SpeechSegment
|
||||
from deepspeech.frontend.normalizer import FeatureNormalizer
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = [
|
||||
"ManifestDataset",
|
||||
]
|
||||
|
||||
|
||||
class ManifestDataset(Dataset):
|
||||
def __init__(self,
|
||||
manifest_path,
|
||||
vocab_filepath,
|
||||
mean_std_filepath,
|
||||
augmentation_config='{}',
|
||||
max_duration=float('inf'),
|
||||
min_duration=0.0,
|
||||
stride_ms=10.0,
|
||||
window_ms=20.0,
|
||||
n_fft=None,
|
||||
max_freq=None,
|
||||
target_sample_rate=16000,
|
||||
specgram_type='linear',
|
||||
use_dB_normalization=True,
|
||||
target_dB=-20,
|
||||
random_seed=0,
|
||||
keep_transcription_text=False):
|
||||
"""Manifest Dataset
|
||||
|
||||
Args:
|
||||
manifest_path (str): manifest josn file path
|
||||
vocab_filepath (str): vocab file path
|
||||
mean_std_filepath (str): mean and std file path, which suffix is *.npy
|
||||
augmentation_config (str, optional): augmentation json str. Defaults to '{}'.
|
||||
max_duration (float, optional): audio length in seconds must less than this. Defaults to float('inf').
|
||||
min_duration (float, optional): audio length is seconds must greater than this. Defaults to 0.0.
|
||||
stride_ms (float, optional): stride size in ms. Defaults to 10.0.
|
||||
window_ms (float, optional): window size in ms. Defaults to 20.0.
|
||||
n_fft (int, optional): fft points for rfft. Defaults to None.
|
||||
max_freq (int, optional): max cut freq. Defaults to None.
|
||||
target_sample_rate (int, optional): target sample rate which used for training. Defaults to 16000.
|
||||
specgram_type (str, optional): 'linear' or 'mfcc'. Defaults to 'linear'.
|
||||
use_dB_normalization (bool, optional): do dB normalization. Defaults to True.
|
||||
target_dB (int, optional): target dB. Defaults to -20.
|
||||
random_seed (int, optional): for random generator. Defaults to 0.
|
||||
keep_transcription_text (bool, optional): True, when not in training mode, will not do tokenizer; Defaults to False.
|
||||
"""
|
||||
super().__init__()
|
||||
|
||||
self._max_duration = max_duration
|
||||
self._min_duration = min_duration
|
||||
self._normalizer = FeatureNormalizer(mean_std_filepath)
|
||||
self._augmentation_pipeline = AugmentationPipeline(
|
||||
augmentation_config=augmentation_config, random_seed=random_seed)
|
||||
self._speech_featurizer = SpeechFeaturizer(
|
||||
vocab_filepath=vocab_filepath,
|
||||
specgram_type=specgram_type,
|
||||
stride_ms=stride_ms,
|
||||
window_ms=window_ms,
|
||||
n_fft=n_fft,
|
||||
max_freq=max_freq,
|
||||
target_sample_rate=target_sample_rate,
|
||||
use_dB_normalization=use_dB_normalization,
|
||||
target_dB=target_dB)
|
||||
self._rng = random.Random(random_seed)
|
||||
self._keep_transcription_text = keep_transcription_text
|
||||
# for caching tar files info
|
||||
self._local_data = namedtuple('local_data', ['tar2info', 'tar2object'])
|
||||
self._local_data.tar2info = {}
|
||||
self._local_data.tar2object = {}
|
||||
|
||||
# read manifest
|
||||
self._manifest = read_manifest(
|
||||
manifest_path=manifest_path,
|
||||
max_duration=self._max_duration,
|
||||
min_duration=self._min_duration)
|
||||
self._manifest.sort(key=lambda x: x["duration"])
|
||||
|
||||
@property
|
||||
def manifest(self):
|
||||
return self._manifest
|
||||
|
||||
@property
|
||||
def vocab_size(self):
|
||||
"""Return the vocabulary size.
|
||||
|
||||
:return: Vocabulary size.
|
||||
:rtype: int
|
||||
"""
|
||||
return self._speech_featurizer.vocab_size
|
||||
|
||||
@property
|
||||
def vocab_list(self):
|
||||
"""Return the vocabulary in list.
|
||||
|
||||
:return: Vocabulary in list.
|
||||
:rtype: list
|
||||
"""
|
||||
return self._speech_featurizer.vocab_list
|
||||
|
||||
@property
|
||||
def feature_size(self):
|
||||
return self._speech_featurizer.feature_size
|
||||
|
||||
def _parse_tar(self, file):
|
||||
"""Parse a tar file to get a tarfile object
|
||||
and a map containing tarinfoes
|
||||
"""
|
||||
result = {}
|
||||
f = tarfile.open(file)
|
||||
for tarinfo in f.getmembers():
|
||||
result[tarinfo.name] = tarinfo
|
||||
return f, result
|
||||
|
||||
def _subfile_from_tar(self, file):
|
||||
"""Get subfile object from tar.
|
||||
|
||||
It will return a subfile object from tar file
|
||||
and cached tar file info for next reading request.
|
||||
"""
|
||||
tarpath, filename = file.split(':', 1)[1].split('#', 1)
|
||||
if 'tar2info' not in self._local_data.__dict__:
|
||||
self._local_data.tar2info = {}
|
||||
if 'tar2object' not in self._local_data.__dict__:
|
||||
self._local_data.tar2object = {}
|
||||
if tarpath not in self._local_data.tar2info:
|
||||
object, infoes = self._parse_tar(tarpath)
|
||||
self._local_data.tar2info[tarpath] = infoes
|
||||
self._local_data.tar2object[tarpath] = object
|
||||
return self._local_data.tar2object[tarpath].extractfile(
|
||||
self._local_data.tar2info[tarpath][filename])
|
||||
|
||||
def process_utterance(self, audio_file, transcript):
|
||||
"""Load, augment, featurize and normalize for speech data.
|
||||
|
||||
:param audio_file: Filepath or file object of audio file.
|
||||
:type audio_file: str | file
|
||||
:param transcript: Transcription text.
|
||||
:type transcript: str
|
||||
:return: Tuple of audio feature tensor and data of transcription part,
|
||||
where transcription part could be token ids or text.
|
||||
:rtype: tuple of (2darray, list)
|
||||
"""
|
||||
if isinstance(audio_file, str) and audio_file.startswith('tar:'):
|
||||
speech_segment = SpeechSegment.from_file(
|
||||
self._subfile_from_tar(audio_file), transcript)
|
||||
else:
|
||||
speech_segment = SpeechSegment.from_file(audio_file, transcript)
|
||||
self._augmentation_pipeline.transform_audio(speech_segment)
|
||||
specgram, transcript_part = self._speech_featurizer.featurize(
|
||||
speech_segment, self._keep_transcription_text)
|
||||
specgram = self._normalizer.apply(specgram)
|
||||
return specgram, transcript_part
|
||||
|
||||
def _instance_reader_creator(self, manifest):
|
||||
"""
|
||||
Instance reader creator. Create a callable function to produce
|
||||
instances of data.
|
||||
|
||||
Instance: a tuple of ndarray of audio spectrogram and a list of
|
||||
token indices for transcript.
|
||||
"""
|
||||
|
||||
def reader():
|
||||
for instance in manifest:
|
||||
inst = self.process_utterance(instance["audio_filepath"],
|
||||
instance["text"])
|
||||
yield inst
|
||||
|
||||
return reader
|
||||
|
||||
def __len__(self):
|
||||
return len(self._manifest)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
instance = self._manifest[idx]
|
||||
return self.process_utterance(instance["audio_filepath"],
|
||||
instance["text"])
|
@ -0,0 +1,256 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import random
|
||||
import tarfile
|
||||
import logging
|
||||
import numpy as np
|
||||
from collections import namedtuple
|
||||
from functools import partial
|
||||
|
||||
import paddle
|
||||
from paddle.io import BatchSampler
|
||||
from paddle.io import DistributedBatchSampler
|
||||
from paddle import distributed as dist
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = [
|
||||
"SortagradDistributedBatchSampler",
|
||||
"SortagradBatchSampler",
|
||||
]
|
||||
|
||||
|
||||
def _batch_shuffle(indices, batch_size, epoch, clipped=False):
|
||||
"""Put similarly-sized instances into minibatches for better efficiency
|
||||
and make a batch-wise shuffle.
|
||||
|
||||
1. Sort the audio clips by duration.
|
||||
2. Generate a random number `k`, k in [0, batch_size).
|
||||
3. Randomly shift `k` instances in order to create different batches
|
||||
for different epochs. Create minibatches.
|
||||
4. Shuffle the minibatches.
|
||||
|
||||
:param indices: indexes. List of int.
|
||||
:type indices: list
|
||||
:param batch_size: Batch size. This size is also used for generate
|
||||
a random number for batch shuffle.
|
||||
:type batch_size: int
|
||||
:param clipped: Whether to clip the heading (small shift) and trailing
|
||||
(incomplete batch) instances.
|
||||
:type clipped: bool
|
||||
:return: Batch shuffled mainifest.
|
||||
:rtype: list
|
||||
"""
|
||||
rng = np.random.RandomState(epoch)
|
||||
shift_len = rng.randint(0, batch_size - 1)
|
||||
batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size))
|
||||
rng.shuffle(batch_indices)
|
||||
batch_indices = [item for batch in batch_indices for item in batch]
|
||||
assert (clipped == False)
|
||||
if not clipped:
|
||||
res_len = len(indices) - shift_len - len(batch_indices)
|
||||
# when res_len is 0, will return whole list, len(List[-0:]) = len(List[:])
|
||||
if res_len != 0:
|
||||
batch_indices.extend(indices[-res_len:])
|
||||
batch_indices.extend(indices[0:shift_len])
|
||||
assert len(indices) == len(
|
||||
batch_indices
|
||||
), f"_batch_shuffle: {len(indices)} : {len(batch_indices)} : {res_len} - {shift_len}"
|
||||
return batch_indices
|
||||
|
||||
|
||||
class SortagradDistributedBatchSampler(DistributedBatchSampler):
|
||||
def __init__(self,
|
||||
dataset,
|
||||
batch_size,
|
||||
num_replicas=None,
|
||||
rank=None,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
sortagrad=False,
|
||||
shuffle_method="batch_shuffle"):
|
||||
"""Sortagrad Sampler for multi gpus.
|
||||
|
||||
Args:
|
||||
dataset (paddle.io.Dataset):
|
||||
batch_size (int): batch size for one gpu
|
||||
num_replicas (int, optional): world size or numbers of gpus. Defaults to None.
|
||||
rank (int, optional): rank id. Defaults to None.
|
||||
shuffle (bool, optional): True for do shuffle, or else. Defaults to False.
|
||||
drop_last (bool, optional): whether drop last batch which is less than batch size. Defaults to False.
|
||||
sortagrad (bool, optional): True, do sortgrad in first epoch, then shuffle as usual; or else. Defaults to False.
|
||||
shuffle_method (str, optional): shuffle method, "instance_shuffle" or "batch_shuffle". Defaults to "batch_shuffle".
|
||||
"""
|
||||
super().__init__(dataset, batch_size, num_replicas, rank, shuffle,
|
||||
drop_last)
|
||||
self._sortagrad = sortagrad
|
||||
self._shuffle_method = shuffle_method
|
||||
|
||||
def __iter__(self):
|
||||
num_samples = len(self.dataset)
|
||||
indices = np.arange(num_samples).tolist()
|
||||
indices += indices[:(self.total_size - len(indices))]
|
||||
assert len(indices) == self.total_size
|
||||
|
||||
# sort (by duration) or batch-wise shuffle the manifest
|
||||
if self.shuffle:
|
||||
if self.epoch == 0 and self._sortagrad:
|
||||
logger.info(
|
||||
f'rank: {dist.get_rank()} dataset sortagrad! epoch {self.epoch}'
|
||||
)
|
||||
else:
|
||||
logger.info(
|
||||
f'rank: {dist.get_rank()} dataset shuffle! epoch {self.epoch}'
|
||||
)
|
||||
if self._shuffle_method == "batch_shuffle":
|
||||
# using `batch_size * nrank`, or will cause instability loss and nan or inf grad,
|
||||
# since diff batch examlpe length in batches case instability loss in diff rank,
|
||||
# e.g. rank0 maxlength 20, rank3 maxlength 1000
|
||||
indices = _batch_shuffle(
|
||||
indices,
|
||||
self.batch_size * self.nranks,
|
||||
self.epoch,
|
||||
clipped=False)
|
||||
elif self._shuffle_method == "instance_shuffle":
|
||||
np.random.RandomState(self.epoch).shuffle(indices)
|
||||
else:
|
||||
raise ValueError("Unknown shuffle method %s." %
|
||||
self._shuffle_method)
|
||||
assert len(
|
||||
indices
|
||||
) == self.total_size, f"batch shuffle examples error: {len(indices)} : {self.total_size}"
|
||||
|
||||
# slice `self.batch_size` examples by rank id
|
||||
def _get_indices_by_batch_size(indices):
|
||||
subsampled_indices = []
|
||||
last_batch_size = self.total_size % (self.batch_size * self.nranks)
|
||||
assert last_batch_size % self.nranks == 0
|
||||
last_local_batch_size = last_batch_size // self.nranks
|
||||
|
||||
for i in range(self.local_rank * self.batch_size,
|
||||
len(indices) - last_batch_size,
|
||||
self.batch_size * self.nranks):
|
||||
subsampled_indices.extend(indices[i:i + self.batch_size])
|
||||
|
||||
indices = indices[len(indices) - last_batch_size:]
|
||||
subsampled_indices.extend(
|
||||
indices[self.local_rank * last_local_batch_size:(
|
||||
self.local_rank + 1) * last_local_batch_size])
|
||||
return subsampled_indices
|
||||
|
||||
if self.nranks > 1:
|
||||
indices = _get_indices_by_batch_size(indices)
|
||||
|
||||
assert len(indices) == self.num_samples
|
||||
_sample_iter = iter(indices)
|
||||
|
||||
batch_indices = []
|
||||
for idx in _sample_iter:
|
||||
batch_indices.append(idx)
|
||||
if len(batch_indices) == self.batch_size:
|
||||
logger.info(
|
||||
f"rank: {dist.get_rank()} batch index: {batch_indices} ")
|
||||
yield batch_indices
|
||||
batch_indices = []
|
||||
if not self.drop_last and len(batch_indices) > 0:
|
||||
yield batch_indices
|
||||
|
||||
def __len__(self):
|
||||
num_samples = self.num_samples
|
||||
num_samples += int(not self.drop_last) * (self.batch_size - 1)
|
||||
return num_samples // self.batch_size
|
||||
|
||||
|
||||
class SortagradBatchSampler(BatchSampler):
|
||||
def __init__(self,
|
||||
dataset,
|
||||
batch_size,
|
||||
shuffle=False,
|
||||
drop_last=False,
|
||||
sortagrad=False,
|
||||
shuffle_method="batch_shuffle"):
|
||||
"""Sortagrad Sampler for one gpu.
|
||||
|
||||
Args:
|
||||
dataset (paddle.io.Dataset):
|
||||
batch_size (int): batch size for one gpu
|
||||
shuffle (bool, optional): True for do shuffle, or else. Defaults to False.
|
||||
drop_last (bool, optional): whether drop last batch which is less than batch size. Defaults to False.
|
||||
sortagrad (bool, optional): True, do sortgrad in first epoch, then shuffle as usual; or else. Defaults to False.
|
||||
shuffle_method (str, optional): shuffle method, "instance_shuffle" or "batch_shuffle". Defaults to "batch_shuffle".
|
||||
"""
|
||||
self.dataset = dataset
|
||||
|
||||
assert isinstance(batch_size, int) and batch_size > 0, \
|
||||
"batch_size should be a positive integer"
|
||||
self.batch_size = batch_size
|
||||
assert isinstance(shuffle, bool), \
|
||||
"shuffle should be a boolean value"
|
||||
self.shuffle = shuffle
|
||||
assert isinstance(drop_last, bool), \
|
||||
"drop_last should be a boolean number"
|
||||
|
||||
self.drop_last = drop_last
|
||||
self.epoch = 0
|
||||
self.num_samples = int(math.ceil(len(self.dataset) * 1.0))
|
||||
self.total_size = self.num_samples
|
||||
self._sortagrad = sortagrad
|
||||
self._shuffle_method = shuffle_method
|
||||
|
||||
def __iter__(self):
|
||||
num_samples = len(self.dataset)
|
||||
indices = np.arange(num_samples).tolist()
|
||||
indices += indices[:(self.total_size - len(indices))]
|
||||
assert len(indices) == self.total_size
|
||||
|
||||
# sort (by duration) or batch-wise shuffle the manifest
|
||||
if self.shuffle:
|
||||
if self.epoch == 0 and self._sortagrad:
|
||||
logger.info(f'dataset sortagrad! epoch {self.epoch}')
|
||||
else:
|
||||
logger.info(f'dataset shuffle! epoch {self.epoch}')
|
||||
if self._shuffle_method == "batch_shuffle":
|
||||
indices = _batch_shuffle(
|
||||
indices, self.batch_size, self.epoch, clipped=False)
|
||||
elif self._shuffle_method == "instance_shuffle":
|
||||
np.random.RandomState(self.epoch).shuffle(indices)
|
||||
else:
|
||||
raise ValueError("Unknown shuffle method %s." %
|
||||
self._shuffle_method)
|
||||
assert len(
|
||||
indices
|
||||
) == self.total_size, f"batch shuffle examples error: {len(indices)} : {self.total_size}"
|
||||
|
||||
assert len(indices) == self.num_samples
|
||||
_sample_iter = iter(indices)
|
||||
|
||||
batch_indices = []
|
||||
for idx in _sample_iter:
|
||||
batch_indices.append(idx)
|
||||
if len(batch_indices) == self.batch_size:
|
||||
logger.info(
|
||||
f"rank: {dist.get_rank()} batch index: {batch_indices} ")
|
||||
yield batch_indices
|
||||
batch_indices = []
|
||||
if not self.drop_last and len(batch_indices) > 0:
|
||||
yield batch_indices
|
||||
|
||||
self.epoch += 1
|
||||
|
||||
def __len__(self):
|
||||
num_samples = self.num_samples
|
||||
num_samples += int(not self.drop_last) * (self.batch_size - 1)
|
||||
return num_samples // self.batch_size
|
@ -0,0 +1,442 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import collections
|
||||
import numpy as np
|
||||
import logging
|
||||
from typing import Optional
|
||||
from yacs.config import CfgNode
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
from deepspeech.modules.conv import ConvStack
|
||||
from deepspeech.modules.rnn import RNNStack
|
||||
from deepspeech.modules.mask import sequence_mask
|
||||
from deepspeech.modules.activation import brelu
|
||||
from deepspeech.utils import checkpoint
|
||||
from deepspeech.utils import layer_tools
|
||||
from deepspeech.decoders.swig_wrapper import Scorer
|
||||
from deepspeech.decoders.swig_wrapper import ctc_greedy_decoder
|
||||
from deepspeech.decoders.swig_wrapper import ctc_beam_search_decoder_batch
|
||||
|
||||
from deepspeech.modules.loss import CTCLoss
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['DeepSpeech2Model']
|
||||
|
||||
|
||||
class CRNNEncoder(nn.Layer):
|
||||
def __init__(self,
|
||||
feat_size,
|
||||
dict_size,
|
||||
num_conv_layers=2,
|
||||
num_rnn_layers=3,
|
||||
rnn_size=1024,
|
||||
use_gru=False,
|
||||
share_rnn_weights=True):
|
||||
super().__init__()
|
||||
self.rnn_size = rnn_size
|
||||
self.feat_size = feat_size # 161 for linear
|
||||
self.dict_size = dict_size
|
||||
|
||||
self.conv = ConvStack(feat_size, num_conv_layers)
|
||||
|
||||
i_size = self.conv.output_height # H after conv stack
|
||||
self.rnn = RNNStack(
|
||||
i_size=i_size,
|
||||
h_size=rnn_size,
|
||||
num_stacks=num_rnn_layers,
|
||||
use_gru=use_gru,
|
||||
share_rnn_weights=share_rnn_weights)
|
||||
|
||||
@property
|
||||
def output_size(self):
|
||||
return self.rnn_size * 2
|
||||
|
||||
def forward(self, audio, audio_len):
|
||||
"""
|
||||
audio: shape [B, D, T]
|
||||
text: shape [B, T]
|
||||
audio_len: shape [B]
|
||||
text_len: shape [B]
|
||||
"""
|
||||
"""Compute Encoder outputs
|
||||
|
||||
Args:
|
||||
audio (Tensor): [B, D, T]
|
||||
text (Tensor): [B, T]
|
||||
audio_len (Tensor): [B]
|
||||
text_len (Tensor): [B]
|
||||
Returns:
|
||||
x (Tensor): encoder outputs, [B, T, D]
|
||||
x_lens (Tensor): encoder length, [B]
|
||||
"""
|
||||
# [B, D, T] -> [B, C=1, D, T]
|
||||
x = audio.unsqueeze(1)
|
||||
x_lens = audio_len
|
||||
|
||||
# convolution group
|
||||
x, x_lens = self.conv(x, x_lens)
|
||||
|
||||
# convert data from convolution feature map to sequence of vectors
|
||||
#B, C, D, T = paddle.shape(x) # not work under jit
|
||||
x = x.transpose([0, 3, 1, 2]) #[B, T, C, D]
|
||||
#x = x.reshape([B, T, C * D]) #[B, T, C*D] # not work under jit
|
||||
x = x.reshape([0, 0, -1]) #[B, T, C*D]
|
||||
|
||||
# remove padding part
|
||||
x, x_lens = self.rnn(x, x_lens) #[B, T, D]
|
||||
return x, x_lens
|
||||
|
||||
|
||||
class CTCDecoder(nn.Layer):
|
||||
def __init__(self, enc_n_units, vocab_size):
|
||||
super().__init__()
|
||||
self.blank_id = vocab_size
|
||||
self.output = nn.Linear(enc_n_units,
|
||||
vocab_size + 1) # blank id is last id
|
||||
self.criterion = CTCLoss(self.blank_id)
|
||||
|
||||
self._ext_scorer = None
|
||||
|
||||
def forward(self, eout, eout_lens, texts, texts_len):
|
||||
"""Compute CTC Loss
|
||||
|
||||
Args:
|
||||
eout (Tensor):
|
||||
eout_lens (Tensor):
|
||||
texts (Tenosr):
|
||||
texts_len (Tensor):
|
||||
Returns:
|
||||
loss (Tenosr): [1]
|
||||
"""
|
||||
logits = self.output(eout)
|
||||
loss = self.criterion(logits, texts, eout_lens, texts_len)
|
||||
return loss
|
||||
|
||||
def probs(self, eouts, temperature=1.):
|
||||
"""Get CTC probabilities.
|
||||
Args:
|
||||
eouts (FloatTensor): `[B, T, enc_units]`
|
||||
Returns:
|
||||
probs (FloatTensor): `[B, T, vocab]`
|
||||
"""
|
||||
return F.softmax(self.output(eouts) / temperature, axis=-1)
|
||||
|
||||
def scores(self, eouts, temperature=1.):
|
||||
"""Get log-scale CTC probabilities.
|
||||
Args:
|
||||
eouts (FloatTensor): `[B, T, enc_units]`
|
||||
Returns:
|
||||
log_probs (FloatTensor): `[B, T, vocab]`
|
||||
"""
|
||||
return F.log_softmax(self.output(eouts) / temperature, axis=-1)
|
||||
|
||||
def _decode_batch_greedy(self, probs_split, vocab_list):
|
||||
"""Decode by best path for a batch of probs matrix input.
|
||||
:param probs_split: List of 2-D probability matrix, and each consists
|
||||
of prob vectors for one speech utterancce.
|
||||
:param probs_split: List of matrix
|
||||
:param vocab_list: List of tokens in the vocabulary, for decoding.
|
||||
:type vocab_list: list
|
||||
:return: List of transcription texts.
|
||||
:rtype: List of str
|
||||
"""
|
||||
results = []
|
||||
for i, probs in enumerate(probs_split):
|
||||
output_transcription = ctc_greedy_decoder(
|
||||
probs_seq=probs, vocabulary=vocab_list)
|
||||
results.append(output_transcription)
|
||||
return results
|
||||
|
||||
def _init_ext_scorer(self, beam_alpha, beam_beta, language_model_path,
|
||||
vocab_list):
|
||||
"""Initialize the external scorer.
|
||||
:param beam_alpha: Parameter associated with language model.
|
||||
:type beam_alpha: float
|
||||
:param beam_beta: Parameter associated with word count.
|
||||
:type beam_beta: float
|
||||
:param language_model_path: Filepath for language model. If it is
|
||||
empty, the external scorer will be set to
|
||||
None, and the decoding method will be pure
|
||||
beam search without scorer.
|
||||
:type language_model_path: str|None
|
||||
:param vocab_list: List of tokens in the vocabulary, for decoding.
|
||||
:type vocab_list: list
|
||||
"""
|
||||
# init once
|
||||
if self._ext_scorer != None:
|
||||
return
|
||||
|
||||
if language_model_path != '':
|
||||
logger.info("begin to initialize the external scorer "
|
||||
"for decoding")
|
||||
self._ext_scorer = Scorer(beam_alpha, beam_beta,
|
||||
language_model_path, vocab_list)
|
||||
lm_char_based = self._ext_scorer.is_character_based()
|
||||
lm_max_order = self._ext_scorer.get_max_order()
|
||||
lm_dict_size = self._ext_scorer.get_dict_size()
|
||||
logger.info("language model: "
|
||||
"is_character_based = %d," % lm_char_based +
|
||||
" max_order = %d," % lm_max_order + " dict_size = %d" %
|
||||
lm_dict_size)
|
||||
logger.info("end initializing scorer")
|
||||
else:
|
||||
self._ext_scorer = None
|
||||
logger.info("no language model provided, "
|
||||
"decoding by pure beam search without scorer.")
|
||||
|
||||
def _decode_batch_beam_search(self, probs_split, beam_alpha, beam_beta,
|
||||
beam_size, cutoff_prob, cutoff_top_n,
|
||||
vocab_list, num_processes):
|
||||
"""Decode by beam search for a batch of probs matrix input.
|
||||
:param probs_split: List of 2-D probability matrix, and each consists
|
||||
of prob vectors for one speech utterancce.
|
||||
:param probs_split: List of matrix
|
||||
:param beam_alpha: Parameter associated with language model.
|
||||
:type beam_alpha: float
|
||||
:param beam_beta: Parameter associated with word count.
|
||||
:type beam_beta: float
|
||||
:param beam_size: Width for Beam search.
|
||||
:type beam_size: int
|
||||
:param cutoff_prob: Cutoff probability in pruning,
|
||||
default 1.0, no pruning.
|
||||
:type cutoff_prob: float
|
||||
:param cutoff_top_n: Cutoff number in pruning, only top cutoff_top_n
|
||||
characters with highest probs in vocabulary will be
|
||||
used in beam search, default 40.
|
||||
:type cutoff_top_n: int
|
||||
:param vocab_list: List of tokens in the vocabulary, for decoding.
|
||||
:type vocab_list: list
|
||||
:param num_processes: Number of processes (CPU) for decoder.
|
||||
:type num_processes: int
|
||||
:return: List of transcription texts.
|
||||
:rtype: List of str
|
||||
"""
|
||||
if self._ext_scorer != None:
|
||||
self._ext_scorer.reset_params(beam_alpha, beam_beta)
|
||||
|
||||
# beam search decode
|
||||
num_processes = min(num_processes, len(probs_split))
|
||||
beam_search_results = ctc_beam_search_decoder_batch(
|
||||
probs_split=probs_split,
|
||||
vocabulary=vocab_list,
|
||||
beam_size=beam_size,
|
||||
num_processes=num_processes,
|
||||
ext_scoring_func=self._ext_scorer,
|
||||
cutoff_prob=cutoff_prob,
|
||||
cutoff_top_n=cutoff_top_n)
|
||||
|
||||
results = [result[0][1] for result in beam_search_results]
|
||||
return results
|
||||
|
||||
def init_decode(self, beam_alpha, beam_beta, lang_model_path, vocab_list,
|
||||
decoding_method):
|
||||
if decoding_method == "ctc_beam_search":
|
||||
self._init_ext_scorer(beam_alpha, beam_beta, lang_model_path,
|
||||
vocab_list)
|
||||
|
||||
def decode_probs(self, probs, logits_lens, vocab_list, decoding_method,
|
||||
lang_model_path, beam_alpha, beam_beta, beam_size,
|
||||
cutoff_prob, cutoff_top_n, num_processes):
|
||||
""" probs: activation after softmax
|
||||
logits_len: audio output lens
|
||||
"""
|
||||
probs_split = [probs[i, :l, :] for i, l in enumerate(logits_lens)]
|
||||
if decoding_method == "ctc_greedy":
|
||||
result_transcripts = self._decode_batch_greedy(
|
||||
probs_split=probs_split, vocab_list=vocab_list)
|
||||
elif decoding_method == "ctc_beam_search":
|
||||
result_transcripts = self._decode_batch_beam_search(
|
||||
probs_split=probs_split,
|
||||
beam_alpha=beam_alpha,
|
||||
beam_beta=beam_beta,
|
||||
beam_size=beam_size,
|
||||
cutoff_prob=cutoff_prob,
|
||||
cutoff_top_n=cutoff_top_n,
|
||||
vocab_list=vocab_list,
|
||||
num_processes=num_processes)
|
||||
else:
|
||||
raise ValueError(f"Not support: {decoding_method}")
|
||||
return result_transcripts
|
||||
|
||||
|
||||
class DeepSpeech2Model(nn.Layer):
|
||||
"""The DeepSpeech2 network structure.
|
||||
|
||||
:param audio_data: Audio spectrogram data layer.
|
||||
:type audio_data: Variable
|
||||
:param text_data: Transcription text data layer.
|
||||
:type text_data: Variable
|
||||
:param audio_len: Valid sequence length data layer.
|
||||
:type audio_len: Variable
|
||||
:param masks: Masks data layer to reset padding.
|
||||
:type masks: Variable
|
||||
:param dict_size: Dictionary size for tokenized transcription.
|
||||
:type dict_size: int
|
||||
:param num_conv_layers: Number of stacking convolution layers.
|
||||
:type num_conv_layers: int
|
||||
:param num_rnn_layers: Number of stacking RNN layers.
|
||||
:type num_rnn_layers: int
|
||||
:param rnn_size: RNN layer size (dimension of RNN cells).
|
||||
:type rnn_size: int
|
||||
:param use_gru: Use gru if set True. Use simple rnn if set False.
|
||||
:type use_gru: bool
|
||||
:param share_rnn_weights: Whether to share input-hidden weights between
|
||||
forward and backward direction RNNs.
|
||||
It is only available when use_gru=False.
|
||||
:type share_weights: bool
|
||||
:return: A tuple of an output unnormalized log probability layer (
|
||||
before softmax) and a ctc cost layer.
|
||||
:rtype: tuple of LayerOutput
|
||||
"""
|
||||
|
||||
@classmethod
|
||||
def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
|
||||
default = CfgNode(
|
||||
dict(
|
||||
num_conv_layers=2, #Number of stacking convolution layers.
|
||||
num_rnn_layers=3, #Number of stacking RNN layers.
|
||||
rnn_layer_size=1024, #RNN layer size (number of RNN cells).
|
||||
use_gru=True, #Use gru if set True. Use simple rnn if set False.
|
||||
share_rnn_weights=True #Whether to share input-hidden weights between forward and backward directional RNNs.Notice that for GRU, weight sharing is not supported.
|
||||
))
|
||||
if config is not None:
|
||||
config.merge_from_other_cfg(default)
|
||||
return default
|
||||
|
||||
def __init__(self,
|
||||
feat_size,
|
||||
dict_size,
|
||||
num_conv_layers=2,
|
||||
num_rnn_layers=3,
|
||||
rnn_size=1024,
|
||||
use_gru=False,
|
||||
share_rnn_weights=True):
|
||||
super().__init__()
|
||||
self.encoder = CRNNEncoder(
|
||||
feat_size=feat_size,
|
||||
dict_size=dict_size,
|
||||
num_conv_layers=num_conv_layers,
|
||||
num_rnn_layers=num_rnn_layers,
|
||||
rnn_size=rnn_size,
|
||||
use_gru=use_gru,
|
||||
share_rnn_weights=share_rnn_weights)
|
||||
assert (self.encoder.output_size == rnn_size * 2)
|
||||
self.decoder = CTCDecoder(
|
||||
enc_n_units=self.encoder.output_size, vocab_size=dict_size)
|
||||
|
||||
def forward(self, audio, text, audio_len, text_len):
|
||||
"""Compute Model loss
|
||||
|
||||
Args:
|
||||
audio (Tenosr): [B, D, T]
|
||||
text (Tensor): [B, T]
|
||||
audio_len (Tensor): [B]
|
||||
text_len (Tensor): [B]
|
||||
|
||||
Returns:
|
||||
loss (Tenosr): [1]
|
||||
"""
|
||||
|
||||
eouts, eouts_len = self.encoder(audio, audio_len)
|
||||
loss = self.decoder(eouts, eouts_len, text, text_len)
|
||||
return loss
|
||||
|
||||
@paddle.no_grad()
|
||||
def decode(self, audio, audio_len, vocab_list, decoding_method,
|
||||
lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
|
||||
cutoff_top_n, num_processes):
|
||||
# init once
|
||||
# decoders only accept string encoded in utf-8
|
||||
self.decoder.init_decode(
|
||||
beam_alpha=beam_alpha,
|
||||
beam_beta=beam_beta,
|
||||
lang_model_path=lang_model_path,
|
||||
vocab_list=vocab_list,
|
||||
decoding_method=decoding_method)
|
||||
|
||||
eouts, eouts_len = self.encoder(audio, audio_len)
|
||||
probs = self.decoder.probs(eouts)
|
||||
return self.decoder.decode_probs(
|
||||
probs.numpy(), eouts_len, vocab_list, decoding_method,
|
||||
lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
|
||||
cutoff_top_n, num_processes)
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, dataset, config, checkpoint_path):
|
||||
"""Build a DeepSpeech2Model model from a pretrained model.
|
||||
Parameters
|
||||
----------
|
||||
dataset: paddle.io.Dataset
|
||||
|
||||
config: yacs.config.CfgNode
|
||||
model configs
|
||||
|
||||
checkpoint_path: Path or str
|
||||
the path of pretrained model checkpoint, without extension name
|
||||
|
||||
Returns
|
||||
-------
|
||||
DeepSpeech2Model
|
||||
The model built from pretrained result.
|
||||
"""
|
||||
model = cls(feat_size=dataset.feature_size,
|
||||
dict_size=dataset.vocab_size,
|
||||
num_conv_layers=config.model.num_conv_layers,
|
||||
num_rnn_layers=config.model.num_rnn_layers,
|
||||
rnn_size=config.model.rnn_layer_size,
|
||||
use_gru=config.model.use_gru,
|
||||
share_rnn_weights=config.model.share_rnn_weights)
|
||||
checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
|
||||
layer_tools.summary(model)
|
||||
return model
|
||||
|
||||
|
||||
class DeepSpeech2InferModel(DeepSpeech2Model):
|
||||
def __init__(self,
|
||||
feat_size,
|
||||
dict_size,
|
||||
num_conv_layers=2,
|
||||
num_rnn_layers=3,
|
||||
rnn_size=1024,
|
||||
use_gru=False,
|
||||
share_rnn_weights=True):
|
||||
super().__init__(
|
||||
feat_size=feat_size,
|
||||
dict_size=dict_size,
|
||||
num_conv_layers=num_conv_layers,
|
||||
num_rnn_layers=num_rnn_layers,
|
||||
rnn_size=rnn_size,
|
||||
use_gru=use_gru,
|
||||
share_rnn_weights=share_rnn_weights)
|
||||
|
||||
def forward(self, audio, audio_len):
|
||||
"""export model function
|
||||
|
||||
Args:
|
||||
audio (Tensor): [B, D, T]
|
||||
audio_len (Tensor): [B]
|
||||
|
||||
Returns:
|
||||
probs: probs after softmax
|
||||
"""
|
||||
eouts, eouts_len = self.encoder(audio, audio_len)
|
||||
probs = self.decoder.probs(eouts)
|
||||
return probs
|
@ -0,0 +1,13 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
@ -0,0 +1,32 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
import numpy as np
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['brelu']
|
||||
|
||||
|
||||
def brelu(x, t_min=0.0, t_max=24.0, name=None):
|
||||
# paddle.to_tensor is dygraph_only can not work under JIT
|
||||
t_min = paddle.full(shape=[1], fill_value=t_min, dtype='float32')
|
||||
t_max = paddle.full(shape=[1], fill_value=t_max, dtype='float32')
|
||||
return x.maximum(t_min).minimum(t_max)
|
@ -0,0 +1,147 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
from deepspeech.modules.mask import sequence_mask
|
||||
from deepspeech.modules.activation import brelu
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['ConvStack']
|
||||
|
||||
|
||||
class ConvBn(nn.Layer):
|
||||
"""Convolution layer with batch normalization.
|
||||
|
||||
:param kernel_size: The x dimension of a filter kernel. Or input a tuple for
|
||||
two image dimension.
|
||||
:type kernel_size: int|tuple|list
|
||||
:param num_channels_in: Number of input channels.
|
||||
:type num_channels_in: int
|
||||
:param num_channels_out: Number of output channels.
|
||||
:type num_channels_out: int
|
||||
:param stride: The x dimension of the stride. Or input a tuple for two
|
||||
image dimension.
|
||||
:type stride: int|tuple|list
|
||||
:param padding: The x dimension of the padding. Or input a tuple for two
|
||||
image dimension.
|
||||
:type padding: int|tuple|list
|
||||
:param act: Activation type, relu|brelu
|
||||
:type act: string
|
||||
:return: Batch norm layer after convolution layer.
|
||||
:rtype: Variable
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self, num_channels_in, num_channels_out, kernel_size, stride,
|
||||
padding, act):
|
||||
|
||||
super().__init__()
|
||||
assert len(kernel_size) == 2
|
||||
assert len(stride) == 2
|
||||
assert len(padding) == 2
|
||||
self.kernel_size = kernel_size
|
||||
self.stride = stride
|
||||
self.padding = padding
|
||||
|
||||
self.conv = nn.Conv2D(
|
||||
num_channels_in,
|
||||
num_channels_out,
|
||||
kernel_size=kernel_size,
|
||||
stride=stride,
|
||||
padding=padding,
|
||||
weight_attr=None,
|
||||
bias_attr=False,
|
||||
data_format='NCHW')
|
||||
|
||||
self.bn = nn.BatchNorm2D(
|
||||
num_channels_out,
|
||||
weight_attr=None,
|
||||
bias_attr=None,
|
||||
data_format='NCHW')
|
||||
self.act = F.relu if act == 'relu' else brelu
|
||||
|
||||
def forward(self, x, x_len):
|
||||
"""
|
||||
x(Tensor): audio, shape [B, C, D, T]
|
||||
"""
|
||||
x = self.conv(x)
|
||||
x = self.bn(x)
|
||||
x = self.act(x)
|
||||
|
||||
x_len = (x_len - self.kernel_size[1] + 2 * self.padding[1]
|
||||
) // self.stride[1] + 1
|
||||
|
||||
# reset padding part to 0
|
||||
masks = sequence_mask(x_len) #[B, T]
|
||||
masks = masks.unsqueeze(1).unsqueeze(1) # [B, 1, 1, T]
|
||||
x = x.multiply(masks)
|
||||
|
||||
return x, x_len
|
||||
|
||||
|
||||
class ConvStack(nn.Layer):
|
||||
"""Convolution group with stacked convolution layers.
|
||||
|
||||
:param feat_size: audio feature dim.
|
||||
:type feat_size: int
|
||||
:param num_stacks: Number of stacked convolution layers.
|
||||
:type num_stacks: int
|
||||
"""
|
||||
|
||||
def __init__(self, feat_size, num_stacks):
|
||||
super().__init__()
|
||||
self.feat_size = feat_size # D
|
||||
self.num_stacks = num_stacks
|
||||
|
||||
self.conv_in = ConvBn(
|
||||
num_channels_in=1,
|
||||
num_channels_out=32,
|
||||
kernel_size=(41, 11), #[D, T]
|
||||
stride=(2, 3),
|
||||
padding=(20, 5),
|
||||
act='brelu')
|
||||
|
||||
out_channel = 32
|
||||
self.conv_stack = nn.LayerList([
|
||||
ConvBn(
|
||||
num_channels_in=32,
|
||||
num_channels_out=out_channel,
|
||||
kernel_size=(21, 11),
|
||||
stride=(2, 1),
|
||||
padding=(10, 5),
|
||||
act='brelu') for i in range(num_stacks - 1)
|
||||
])
|
||||
|
||||
# conv output feat_dim
|
||||
output_height = (feat_size - 1) // 2 + 1
|
||||
for i in range(self.num_stacks - 1):
|
||||
output_height = (output_height - 1) // 2 + 1
|
||||
self.output_height = out_channel * output_height
|
||||
|
||||
def forward(self, x, x_len):
|
||||
"""
|
||||
x: shape [B, C, D, T]
|
||||
x_len : shape [B]
|
||||
"""
|
||||
x, x_len = self.conv_in(x, x_len)
|
||||
for i, conv in enumerate(self.conv_stack):
|
||||
x, x_len = conv(x, x_len)
|
||||
return x, x_len
|
@ -0,0 +1,65 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['CTCLoss']
|
||||
|
||||
|
||||
def ctc_loss(logits,
|
||||
labels,
|
||||
input_lengths,
|
||||
label_lengths,
|
||||
blank=0,
|
||||
reduction='mean',
|
||||
norm_by_times=True):
|
||||
#logger.info("my ctc loss with norm by times")
|
||||
## https://github.com/PaddlePaddle/Paddle/blob/f5ca2db2cc/paddle/fluid/operators/warpctc_op.h#L403
|
||||
loss_out = paddle.fluid.layers.warpctc(logits, labels, blank, norm_by_times,
|
||||
input_lengths, label_lengths)
|
||||
|
||||
loss_out = paddle.fluid.layers.squeeze(loss_out, [-1])
|
||||
logger.info(f"warpctc loss: {loss_out}/{loss_out.shape} ")
|
||||
assert reduction in ['mean', 'sum', 'none']
|
||||
if reduction == 'mean':
|
||||
loss_out = paddle.mean(loss_out / label_lengths)
|
||||
elif reduction == 'sum':
|
||||
loss_out = paddle.sum(loss_out)
|
||||
logger.info(f"ctc loss: {loss_out}")
|
||||
return loss_out
|
||||
|
||||
|
||||
F.ctc_loss = ctc_loss
|
||||
|
||||
|
||||
class CTCLoss(nn.Layer):
|
||||
def __init__(self, blank_id):
|
||||
super().__init__()
|
||||
# last token id as blank id
|
||||
self.loss = nn.CTCLoss(blank=blank_id, reduction='sum')
|
||||
|
||||
def forward(self, logits, text, logits_len, text_len):
|
||||
# warp-ctc do softmax on activations
|
||||
# warp-ctc need activation with shape [T, B, V + 1]
|
||||
logits = logits.transpose([1, 0, 2])
|
||||
|
||||
ctc_loss = self.loss(logits, text, logits_len, text_len)
|
||||
return ctc_loss
|
@ -0,0 +1,34 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['sequence_mask']
|
||||
|
||||
|
||||
def sequence_mask(x_len, max_len=None, dtype='float32'):
|
||||
max_len = max_len or x_len.max()
|
||||
x_len = paddle.unsqueeze(x_len, -1)
|
||||
row_vector = paddle.arange(max_len)
|
||||
#mask = row_vector < x_len
|
||||
mask = row_vector > x_len # a bug, broadcast 的时候出错了
|
||||
mask = paddle.cast(mask, dtype)
|
||||
return mask
|
@ -0,0 +1,310 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import logging
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
from paddle.nn import initializer as I
|
||||
|
||||
from deepspeech.modules.mask import sequence_mask
|
||||
from deepspeech.modules.activation import brelu
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ['RNNStack']
|
||||
|
||||
|
||||
class RNNCell(nn.RNNCellBase):
|
||||
r"""
|
||||
Elman RNN (SimpleRNN) cell. Given the inputs and previous states, it
|
||||
computes the outputs and updates states.
|
||||
The formula used is as follows:
|
||||
.. math::
|
||||
h_{t} & = act(x_{t} + b_{ih} + W_{hh}h_{t-1} + b_{hh})
|
||||
y_{t} & = h_{t}
|
||||
|
||||
where :math:`act` is for :attr:`activation`.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
hidden_size,
|
||||
activation="tanh",
|
||||
weight_ih_attr=None,
|
||||
weight_hh_attr=None,
|
||||
bias_ih_attr=None,
|
||||
bias_hh_attr=None,
|
||||
name=None):
|
||||
super().__init__()
|
||||
std = 1.0 / math.sqrt(hidden_size)
|
||||
self.weight_hh = self.create_parameter(
|
||||
(hidden_size, hidden_size),
|
||||
weight_hh_attr,
|
||||
default_initializer=I.Uniform(-std, std))
|
||||
self.bias_ih = None
|
||||
self.bias_hh = self.create_parameter(
|
||||
(hidden_size, ),
|
||||
bias_hh_attr,
|
||||
is_bias=True,
|
||||
default_initializer=I.Uniform(-std, std))
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
if activation not in ["tanh", "relu", "brelu"]:
|
||||
raise ValueError(
|
||||
"activation for SimpleRNNCell should be tanh or relu, "
|
||||
"but get {}".format(activation))
|
||||
self.activation = activation
|
||||
self._activation_fn = paddle.tanh \
|
||||
if activation == "tanh" \
|
||||
else F.relu
|
||||
if activation == 'brelu':
|
||||
self._activation_fn = brelu
|
||||
|
||||
def forward(self, inputs, states=None):
|
||||
if states is None:
|
||||
states = self.get_initial_states(inputs, self.state_shape)
|
||||
pre_h = states
|
||||
i2h = inputs
|
||||
if self.bias_ih is not None:
|
||||
i2h += self.bias_ih
|
||||
h2h = paddle.matmul(pre_h, self.weight_hh, transpose_y=True)
|
||||
if self.bias_hh is not None:
|
||||
h2h += self.bias_hh
|
||||
h = self._activation_fn(i2h + h2h)
|
||||
return h, h
|
||||
|
||||
@property
|
||||
def state_shape(self):
|
||||
return (self.hidden_size, )
|
||||
|
||||
|
||||
class GRUCell(nn.RNNCellBase):
|
||||
r"""
|
||||
Gated Recurrent Unit (GRU) RNN cell. Given the inputs and previous states,
|
||||
it computes the outputs and updates states.
|
||||
The formula for GRU used is as follows:
|
||||
.. math::
|
||||
r_{t} & = \sigma(W_{ir}x_{t} + b_{ir} + W_{hr}h_{t-1} + b_{hr})
|
||||
z_{t} & = \sigma(W_{iz}x_{t} + b_{iz} + W_{hz}h_{t-1} + b_{hz})
|
||||
\widetilde{h}_{t} & = \tanh(W_{ic}x_{t} + b_{ic} + r_{t} * (W_{hc}h_{t-1} + b_{hc}))
|
||||
h_{t} & = z_{t} * h_{t-1} + (1 - z_{t}) * \widetilde{h}_{t}
|
||||
y_{t} & = h_{t}
|
||||
|
||||
where :math:`\sigma` is the sigmoid fucntion, and * is the elemetwise
|
||||
multiplication operator.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
input_size,
|
||||
hidden_size,
|
||||
weight_ih_attr=None,
|
||||
weight_hh_attr=None,
|
||||
bias_ih_attr=None,
|
||||
bias_hh_attr=None,
|
||||
name=None):
|
||||
super().__init__()
|
||||
std = 1.0 / math.sqrt(hidden_size)
|
||||
self.weight_hh = self.create_parameter(
|
||||
(3 * hidden_size, hidden_size),
|
||||
weight_hh_attr,
|
||||
default_initializer=I.Uniform(-std, std))
|
||||
self.bias_ih = None
|
||||
self.bias_hh = self.create_parameter(
|
||||
(3 * hidden_size, ),
|
||||
bias_hh_attr,
|
||||
is_bias=True,
|
||||
default_initializer=I.Uniform(-std, std))
|
||||
|
||||
self.hidden_size = hidden_size
|
||||
self.input_size = input_size
|
||||
self._gate_activation = F.sigmoid
|
||||
self._activation = paddle.tanh
|
||||
#self._activation = F.relu
|
||||
|
||||
def forward(self, inputs, states=None):
|
||||
if states is None:
|
||||
states = self.get_initial_states(inputs, self.state_shape)
|
||||
|
||||
pre_hidden = states
|
||||
x_gates = inputs
|
||||
if self.bias_ih is not None:
|
||||
x_gates = x_gates + self.bias_ih
|
||||
h_gates = paddle.matmul(pre_hidden, self.weight_hh, transpose_y=True)
|
||||
if self.bias_hh is not None:
|
||||
h_gates = h_gates + self.bias_hh
|
||||
|
||||
x_r, x_z, x_c = paddle.split(x_gates, num_or_sections=3, axis=1)
|
||||
h_r, h_z, h_c = paddle.split(h_gates, num_or_sections=3, axis=1)
|
||||
|
||||
r = self._gate_activation(x_r + h_r)
|
||||
z = self._gate_activation(x_z + h_z)
|
||||
c = self._activation(x_c + r * h_c) # apply reset gate after mm
|
||||
h = (pre_hidden - c) * z + c
|
||||
# https://www.paddlepaddle.org.cn/documentation/docs/zh/api/paddle/fluid/layers/dynamic_gru_cn.html#dynamic-gru
|
||||
|
||||
return h, h
|
||||
|
||||
@property
|
||||
def state_shape(self):
|
||||
r"""
|
||||
The `state_shape` of GRUCell is a shape `[hidden_size]` (-1 for batch
|
||||
size would be automatically inserted into shape). The shape corresponds
|
||||
to the shape of :math:`h_{t-1}`.
|
||||
"""
|
||||
return (self.hidden_size, )
|
||||
|
||||
|
||||
class BiRNNWithBN(nn.Layer):
|
||||
"""Bidirectonal simple rnn layer with sequence-wise batch normalization.
|
||||
The batch normalization is only performed on input-state weights.
|
||||
|
||||
:param name: Name of the layer parameters.
|
||||
:type name: string
|
||||
:param size: Dimension of RNN cells.
|
||||
:type size: int
|
||||
:param share_weights: Whether to share input-hidden weights between
|
||||
forward and backward directional RNNs.
|
||||
:type share_weights: bool
|
||||
:return: Bidirectional simple rnn layer.
|
||||
:rtype: Variable
|
||||
"""
|
||||
|
||||
def __init__(self, i_size, h_size, share_weights):
|
||||
super().__init__()
|
||||
self.share_weights = share_weights
|
||||
if self.share_weights:
|
||||
#input-hidden weights shared between bi-directional rnn.
|
||||
self.fw_fc = nn.Linear(i_size, h_size, bias_attr=False)
|
||||
# batch norm is only performed on input-state projection
|
||||
self.fw_bn = nn.BatchNorm1D(
|
||||
h_size, bias_attr=None, data_format='NLC')
|
||||
self.bw_fc = self.fw_fc
|
||||
self.bw_bn = self.fw_bn
|
||||
else:
|
||||
self.fw_fc = nn.Linear(i_size, h_size, bias_attr=False)
|
||||
self.fw_bn = nn.BatchNorm1D(
|
||||
h_size, bias_attr=None, data_format='NLC')
|
||||
self.bw_fc = nn.Linear(i_size, h_size, bias_attr=False)
|
||||
self.bw_bn = nn.BatchNorm1D(
|
||||
h_size, bias_attr=None, data_format='NLC')
|
||||
|
||||
self.fw_cell = RNNCell(hidden_size=h_size, activation='brelu')
|
||||
self.bw_cell = RNNCell(hidden_size=h_size, activation='brelu')
|
||||
self.fw_rnn = nn.RNN(
|
||||
self.fw_cell, is_reverse=False, time_major=False) #[B, T, D]
|
||||
self.bw_rnn = nn.RNN(
|
||||
self.fw_cell, is_reverse=True, time_major=False) #[B, T, D]
|
||||
|
||||
def forward(self, x, x_len):
|
||||
# x, shape [B, T, D]
|
||||
fw_x = self.fw_bn(self.fw_fc(x))
|
||||
bw_x = self.bw_bn(self.bw_fc(x))
|
||||
fw_x, _ = self.fw_rnn(inputs=fw_x, sequence_length=x_len)
|
||||
bw_x, _ = self.bw_rnn(inputs=bw_x, sequence_length=x_len)
|
||||
x = paddle.concat([fw_x, bw_x], axis=-1)
|
||||
return x, x_len
|
||||
|
||||
|
||||
class BiGRUWithBN(nn.Layer):
|
||||
"""Bidirectonal gru layer with sequence-wise batch normalization.
|
||||
The batch normalization is only performed on input-state weights.
|
||||
|
||||
:param name: Name of the layer.
|
||||
:type name: string
|
||||
:param input: Input layer.
|
||||
:type input: Variable
|
||||
:param size: Dimension of GRU cells.
|
||||
:type size: int
|
||||
:param act: Activation type.
|
||||
:type act: string
|
||||
:return: Bidirectional GRU layer.
|
||||
:rtype: Variable
|
||||
"""
|
||||
|
||||
def __init__(self, i_size, h_size, act):
|
||||
super().__init__()
|
||||
hidden_size = h_size * 3
|
||||
|
||||
self.fw_fc = nn.Linear(i_size, hidden_size, bias_attr=False)
|
||||
self.fw_bn = nn.BatchNorm1D(
|
||||
hidden_size, bias_attr=None, data_format='NLC')
|
||||
self.bw_fc = nn.Linear(i_size, hidden_size, bias_attr=False)
|
||||
self.bw_bn = nn.BatchNorm1D(
|
||||
hidden_size, bias_attr=None, data_format='NLC')
|
||||
|
||||
self.fw_cell = GRUCell(input_size=hidden_size, hidden_size=h_size)
|
||||
self.bw_cell = GRUCell(input_size=hidden_size, hidden_size=h_size)
|
||||
self.fw_rnn = nn.RNN(
|
||||
self.fw_cell, is_reverse=False, time_major=False) #[B, T, D]
|
||||
self.bw_rnn = nn.RNN(
|
||||
self.fw_cell, is_reverse=True, time_major=False) #[B, T, D]
|
||||
|
||||
def forward(self, x, x_len):
|
||||
# x, shape [B, T, D]
|
||||
fw_x = self.fw_bn(self.fw_fc(x))
|
||||
bw_x = self.bw_bn(self.bw_fc(x))
|
||||
fw_x, _ = self.fw_rnn(inputs=fw_x, sequence_length=x_len)
|
||||
bw_x, _ = self.bw_rnn(inputs=bw_x, sequence_length=x_len)
|
||||
x = paddle.concat([fw_x, bw_x], axis=-1)
|
||||
return x, x_len
|
||||
|
||||
|
||||
class RNNStack(nn.Layer):
|
||||
"""RNN group with stacked bidirectional simple RNN or GRU layers.
|
||||
|
||||
:param input: Input layer.
|
||||
:type input: Variable
|
||||
:param size: Dimension of RNN cells in each layer.
|
||||
:type size: int
|
||||
:param num_stacks: Number of stacked rnn layers.
|
||||
:type num_stacks: int
|
||||
:param use_gru: Use gru if set True. Use simple rnn if set False.
|
||||
:type use_gru: bool
|
||||
:param share_rnn_weights: Whether to share input-hidden weights between
|
||||
forward and backward directional RNNs.
|
||||
It is only available when use_gru=False.
|
||||
:type share_weights: bool
|
||||
:return: Output layer of the RNN group.
|
||||
:rtype: Variable
|
||||
"""
|
||||
|
||||
def __init__(self, i_size, h_size, num_stacks, use_gru, share_rnn_weights):
|
||||
super().__init__()
|
||||
self.rnn_stacks = nn.LayerList()
|
||||
for i in range(num_stacks):
|
||||
if use_gru:
|
||||
#default:GRU using tanh
|
||||
self.rnn_stacks.append(
|
||||
BiGRUWithBN(i_size=i_size, h_size=h_size, act="relu"))
|
||||
else:
|
||||
self.rnn_stacks.append(
|
||||
BiRNNWithBN(
|
||||
i_size=i_size,
|
||||
h_size=h_size,
|
||||
share_weights=share_rnn_weights))
|
||||
i_size = h_size * 2
|
||||
|
||||
def forward(self, x, x_len):
|
||||
"""
|
||||
x: shape [B, T, D]
|
||||
x_len: shpae [B]
|
||||
"""
|
||||
for i, rnn in enumerate(self.rnn_stacks):
|
||||
x, x_len = rnn(x, x_len)
|
||||
masks = sequence_mask(x_len) #[B, T]
|
||||
masks = masks.unsqueeze(-1) # [B, T, 1]
|
||||
x = x.multiply(masks)
|
||||
return x, x_len
|
@ -0,0 +1,69 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
|
||||
|
||||
def default_argument_parser():
|
||||
r"""A simple yet genral argument parser for experiments with parakeet.
|
||||
|
||||
This is used in examples with parakeet. And it is intended to be used by
|
||||
other experiments with parakeet. It requires a minimal set of command line
|
||||
arguments to start a training script.
|
||||
|
||||
The ``--config`` and ``--opts`` are used for overwrite the deault
|
||||
configuration.
|
||||
|
||||
The ``--data`` and ``--output`` specifies the data path and output path.
|
||||
Resuming training from existing progress at the output directory is the
|
||||
intended default behavior.
|
||||
|
||||
The ``--checkpoint_path`` specifies the checkpoint to load from.
|
||||
|
||||
The ``--device`` and ``--nprocs`` specifies how to run the training.
|
||||
|
||||
|
||||
See Also
|
||||
--------
|
||||
parakeet.training.experiment
|
||||
Returns
|
||||
-------
|
||||
argparse.ArgumentParser
|
||||
the parser
|
||||
"""
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
# yapf: disable
|
||||
# data and output
|
||||
parser.add_argument("--config", metavar="FILE", help="path of the config file to overwrite to default config with.")
|
||||
parser.add_argument("--dump-config", metavar="FILE", help="dump config to yaml file.")
|
||||
# parser.add_argument("--data", metavar="DATA_DIR", help="path to the datatset.")
|
||||
parser.add_argument("--output", metavar="OUTPUT_DIR", help="path to save checkpoint and logs.")
|
||||
|
||||
# load from saved checkpoint
|
||||
parser.add_argument("--checkpoint_path", type=str, help="path of the checkpoint to load")
|
||||
|
||||
# save jit model to
|
||||
parser.add_argument("--export_path", type=str, help="path of the jit model to save")
|
||||
|
||||
# running
|
||||
parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"], help="device type to use, cpu and gpu are supported.")
|
||||
parser.add_argument("--nprocs", type=int, default=1, help="number of parallel processes to use.")
|
||||
|
||||
# overwrite extra config and default config
|
||||
#parser.add_argument("--opts", nargs=argparse.REMAINDER, help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
|
||||
parser.add_argument("--opts", type=str, default=[], nargs='+', help="options to overwrite --config file and the default config, passing in KEY VALUE pairs")
|
||||
# yapd: enable
|
||||
|
||||
return parser
|
@ -0,0 +1,74 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import logging
|
||||
|
||||
import paddle
|
||||
from paddle.fluid.dygraph import base as imperative_base
|
||||
from paddle.fluid import layers
|
||||
from paddle.fluid import core
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class MyClipGradByGlobalNorm(paddle.nn.ClipGradByGlobalNorm):
|
||||
def __init__(self, clip_norm):
|
||||
super().__init__(clip_norm)
|
||||
|
||||
@imperative_base.no_grad
|
||||
def _dygraph_clip(self, params_grads):
|
||||
params_and_grads = []
|
||||
sum_square_list = []
|
||||
for p, g in params_grads:
|
||||
if g is None:
|
||||
continue
|
||||
if getattr(p, 'need_clip', True) is False:
|
||||
continue
|
||||
merge_grad = g
|
||||
if g.type == core.VarDesc.VarType.SELECTED_ROWS:
|
||||
merge_grad = layers.merge_selected_rows(g)
|
||||
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
|
||||
square = layers.square(merge_grad)
|
||||
sum_square = layers.reduce_sum(square)
|
||||
logger.info(
|
||||
f"Grad Before Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
|
||||
)
|
||||
sum_square_list.append(sum_square)
|
||||
|
||||
# all parameters have been filterd out
|
||||
if len(sum_square_list) == 0:
|
||||
return params_grads
|
||||
|
||||
global_norm_var = layers.concat(sum_square_list)
|
||||
global_norm_var = layers.reduce_sum(global_norm_var)
|
||||
global_norm_var = layers.sqrt(global_norm_var)
|
||||
logger.info(f"Grad Global Norm: {float(global_norm_var)}!!!!")
|
||||
max_global_norm = layers.fill_constant(
|
||||
shape=[1], dtype=global_norm_var.dtype, value=self.clip_norm)
|
||||
clip_var = layers.elementwise_div(
|
||||
x=max_global_norm,
|
||||
y=layers.elementwise_max(x=global_norm_var, y=max_global_norm))
|
||||
for p, g in params_grads:
|
||||
if g is None:
|
||||
continue
|
||||
if getattr(p, 'need_clip', True) is False:
|
||||
params_and_grads.append((p, g))
|
||||
continue
|
||||
new_grad = layers.elementwise_mul(x=g, y=clip_var)
|
||||
logger.info(
|
||||
f"Grad After Clip: {p.name}: {float(layers.sqrt(layers.reduce_sum(layers.square(merge_grad))) ) }"
|
||||
)
|
||||
params_and_grads.append((p, new_grad))
|
||||
|
||||
return params_and_grads
|
@ -0,0 +1,327 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
import logging
|
||||
import logging.handlers
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
from collections import defaultdict
|
||||
|
||||
import paddle
|
||||
from paddle import distributed as dist
|
||||
from paddle.distributed.utils import get_gpus
|
||||
from tensorboardX import SummaryWriter
|
||||
|
||||
from deepspeech.utils import checkpoint
|
||||
from deepspeech.utils import mp_tools
|
||||
|
||||
__all__ = ["Trainer"]
|
||||
|
||||
|
||||
class Trainer():
|
||||
"""
|
||||
An experiment template in order to structure the training code and take
|
||||
care of saving, loading, logging, visualization stuffs. It's intended to
|
||||
be flexible and simple.
|
||||
|
||||
So it only handles output directory (create directory for the output,
|
||||
create a checkpoint directory, dump the config in use and create
|
||||
visualizer and logger) in a standard way without enforcing any
|
||||
input-output protocols to the model and dataloader. It leaves the main
|
||||
part for the user to implement their own (setup the model, criterion,
|
||||
optimizer, define a training step, define a validation function and
|
||||
customize all the text and visual logs).
|
||||
It does not save too much boilerplate code. The users still have to write
|
||||
the forward/backward/update mannually, but they are free to add
|
||||
non-standard behaviors if needed.
|
||||
We have some conventions to follow.
|
||||
1. Experiment should have ``model``, ``optimizer``, ``train_loader`` and
|
||||
``valid_loader``, ``config`` and ``args`` attributes.
|
||||
2. The config should have a ``training`` field, which has
|
||||
``valid_interval``, ``save_interval`` and ``max_iteration`` keys. It is
|
||||
used as the trigger to invoke validation, checkpointing and stop of the
|
||||
experiment.
|
||||
3. There are four methods, namely ``train_batch``, ``valid``,
|
||||
``setup_model`` and ``setup_dataloader`` that should be implemented.
|
||||
Feel free to add/overwrite other methods and standalone functions if you
|
||||
need.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config: yacs.config.CfgNode
|
||||
The configuration used for the experiment.
|
||||
|
||||
args: argparse.Namespace
|
||||
The parsed command line arguments.
|
||||
Examples
|
||||
--------
|
||||
>>> def main_sp(config, args):
|
||||
>>> exp = Trainer(config, args)
|
||||
>>> exp.setup()
|
||||
>>> exp.run()
|
||||
>>>
|
||||
>>> config = get_cfg_defaults()
|
||||
>>> parser = default_argument_parser()
|
||||
>>> args = parser.parse_args()
|
||||
>>> if args.config:
|
||||
>>> config.merge_from_file(args.config)
|
||||
>>> if args.opts:
|
||||
>>> config.merge_from_list(args.opts)
|
||||
>>> config.freeze()
|
||||
>>>
|
||||
>>> if args.nprocs > 1 and args.device == "gpu":
|
||||
>>> dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
|
||||
>>> else:
|
||||
>>> main_sp(config, args)
|
||||
"""
|
||||
|
||||
def __init__(self, config, args):
|
||||
self.config = config
|
||||
self.args = args
|
||||
self.optimizer = None
|
||||
self.visualizer = None
|
||||
self.output_dir = None
|
||||
self.checkpoint_dir = None
|
||||
self.logger = None
|
||||
|
||||
def setup(self):
|
||||
"""Setup the experiment.
|
||||
"""
|
||||
paddle.set_device(self.args.device)
|
||||
if self.parallel:
|
||||
self.init_parallel()
|
||||
|
||||
self.setup_output_dir()
|
||||
self.dump_config()
|
||||
self.setup_visualizer()
|
||||
self.setup_logger()
|
||||
self.setup_checkpointer()
|
||||
|
||||
self.setup_dataloader()
|
||||
self.setup_model()
|
||||
|
||||
self.iteration = 0
|
||||
self.epoch = 0
|
||||
|
||||
@property
|
||||
def parallel(self):
|
||||
"""A flag indicating whether the experiment should run with
|
||||
multiprocessing.
|
||||
"""
|
||||
return self.args.device == "gpu" and self.args.nprocs > 1
|
||||
|
||||
def init_parallel(self):
|
||||
"""Init environment for multiprocess training.
|
||||
"""
|
||||
dist.init_parallel_env()
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
def save(self):
|
||||
"""Save checkpoint (model parameters and optimizer states).
|
||||
"""
|
||||
checkpoint.save_parameters(self.checkpoint_dir, self.iteration,
|
||||
self.model, self.optimizer)
|
||||
|
||||
def resume_or_load(self):
|
||||
"""Resume from latest checkpoint at checkpoints in the output
|
||||
directory or load a specified checkpoint.
|
||||
|
||||
If ``args.checkpoint_path`` is not None, load the checkpoint, else
|
||||
resume training.
|
||||
"""
|
||||
iteration = checkpoint.load_parameters(
|
||||
self.model,
|
||||
self.optimizer,
|
||||
checkpoint_dir=self.checkpoint_dir,
|
||||
checkpoint_path=self.args.checkpoint_path)
|
||||
self.iteration = iteration
|
||||
|
||||
def new_epoch(self):
|
||||
"""Reset the train loader and increment ``epoch``.
|
||||
"""
|
||||
if self.parallel:
|
||||
# batch sampler epoch start from 0
|
||||
self.train_loader.batch_sampler.set_epoch(self.epoch)
|
||||
self.epoch += 1
|
||||
|
||||
def train(self):
|
||||
"""The training process.
|
||||
|
||||
It includes forward/backward/update and periodical validation and
|
||||
saving.
|
||||
"""
|
||||
self.logger.info(
|
||||
f"Train Total Examples: {len(self.train_loader.dataset)}")
|
||||
self.new_epoch()
|
||||
while self.epoch <= self.config.training.n_epoch:
|
||||
try:
|
||||
for batch in self.train_loader:
|
||||
self.iteration += 1
|
||||
self.train_batch(batch)
|
||||
except Exception as e:
|
||||
self.logger.error(e)
|
||||
pass
|
||||
|
||||
self.valid()
|
||||
self.save()
|
||||
self.lr_scheduler.step()
|
||||
self.new_epoch()
|
||||
|
||||
def run(self):
|
||||
"""The routine of the experiment after setup. This method is intended
|
||||
to be used by the user.
|
||||
"""
|
||||
self.resume_or_load()
|
||||
try:
|
||||
self.train()
|
||||
except KeyboardInterrupt:
|
||||
self.save()
|
||||
exit(-1)
|
||||
finally:
|
||||
self.destory()
|
||||
|
||||
def setup_output_dir(self):
|
||||
"""Create a directory used for output.
|
||||
"""
|
||||
# output dir
|
||||
output_dir = Path(self.args.output).expanduser()
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
self.output_dir = output_dir
|
||||
|
||||
def setup_checkpointer(self):
|
||||
"""Create a directory used to save checkpoints into.
|
||||
|
||||
It is "checkpoints" inside the output directory.
|
||||
"""
|
||||
# checkpoint dir
|
||||
checkpoint_dir = self.output_dir / "checkpoints"
|
||||
checkpoint_dir.mkdir(exist_ok=True)
|
||||
|
||||
self.checkpoint_dir = checkpoint_dir
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
def destory(self):
|
||||
# https://github.com/pytorch/fairseq/issues/2357
|
||||
if self.visualizer:
|
||||
self.visualizer.close()
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
def setup_visualizer(self):
|
||||
"""Initialize a visualizer to log the experiment.
|
||||
|
||||
The visual log is saved in the output directory.
|
||||
|
||||
Notes
|
||||
------
|
||||
Only the main process has a visualizer with it. Use multiple
|
||||
visualizers in multiprocess to write to a same log file may cause
|
||||
unexpected behaviors.
|
||||
"""
|
||||
# visualizer
|
||||
visualizer = SummaryWriter(logdir=str(self.output_dir))
|
||||
|
||||
self.visualizer = visualizer
|
||||
|
||||
def setup_logger(self):
|
||||
"""Initialize a text logger to log the experiment.
|
||||
|
||||
Each process has its own text logger. The logging message is write to
|
||||
the standard output and a text file named ``worker_n.log`` in the
|
||||
output directory, where ``n`` means the rank of the process.
|
||||
when - how to split the log file by time interval
|
||||
'S' : Seconds
|
||||
'M' : Minutes
|
||||
'H' : Hours
|
||||
'D' : Days
|
||||
'W' : Week day
|
||||
default value: 'D'
|
||||
format - format of the log
|
||||
default format:
|
||||
%(levelname)s: %(asctime)s: %(filename)s:%(lineno)d * %(thread)d %(message)s
|
||||
INFO: 12-09 18:02:42: log.py:40 * 139814749787872 HELLO WORLD
|
||||
backup - how many backup file to keep
|
||||
default value: 7
|
||||
"""
|
||||
when = 'D'
|
||||
backup = 7
|
||||
format = '[%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s'
|
||||
formatter = logging.Formatter(fmt=format, datefmt='%Y/%m/%d %H:%M:%S')
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logger.setLevel("INFO")
|
||||
|
||||
stream_handler = logging.StreamHandler()
|
||||
stream_handler.setFormatter(formatter)
|
||||
logger.addHandler(stream_handler)
|
||||
|
||||
log_file = self.output_dir / 'worker_{}.log'.format(dist.get_rank())
|
||||
# file_handler = logging.FileHandler(str(log_file))
|
||||
# file_handler.setFormatter(formatter)
|
||||
# logger.addHandler(file_handler)
|
||||
|
||||
# handler = logging.handlers.TimedRotatingFileHandler(
|
||||
# str(self.output_dir / "warning.log"), when=when, backupCount=backup)
|
||||
# handler.setLevel(logging.WARNING)
|
||||
# handler.setFormatter(formatter)
|
||||
# logger.addHandler(handler)
|
||||
|
||||
# stop propagate for propagating may print
|
||||
# log multiple times
|
||||
logger.propagate = False
|
||||
|
||||
# global logger
|
||||
stdout = False
|
||||
save_path = log_file
|
||||
logging.basicConfig(
|
||||
level=logging.DEBUG if stdout else logging.INFO,
|
||||
format=format,
|
||||
datefmt='%Y/%m/%d %H:%M:%S',
|
||||
filename=save_path if not stdout else None)
|
||||
self.logger = logger
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
def dump_config(self):
|
||||
"""Save the configuration used for this experiment.
|
||||
|
||||
It is saved in to ``config.yaml`` in the output directory at the
|
||||
beginning of the experiment.
|
||||
"""
|
||||
with open(self.output_dir / "config.yaml", 'wt') as f:
|
||||
print(self.config, file=f)
|
||||
|
||||
def train_batch(self):
|
||||
"""The training loop. A subclass should implement this method.
|
||||
"""
|
||||
raise NotImplementedError("train_batch should be implemented.")
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
@paddle.no_grad()
|
||||
def valid(self):
|
||||
"""The validation. A subclass should implement this method.
|
||||
"""
|
||||
raise NotImplementedError("valid should be implemented.")
|
||||
|
||||
def setup_model(self):
|
||||
"""Setup model, criterion and optimizer, etc. A subclass should
|
||||
implement this method.
|
||||
"""
|
||||
raise NotImplementedError("setup_model should be implemented.")
|
||||
|
||||
def setup_dataloader(self):
|
||||
"""Setup training dataloader and validation dataloader. A subclass
|
||||
should implement this method.
|
||||
"""
|
||||
raise NotImplementedError("setup_dataloader should be implemented.")
|
@ -0,0 +1,13 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
@ -0,0 +1,140 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import time
|
||||
import logging
|
||||
import numpy as np
|
||||
|
||||
import paddle
|
||||
from paddle import distributed as dist
|
||||
from paddle.nn import Layer
|
||||
from paddle.optimizer import Optimizer
|
||||
|
||||
from deepspeech.utils import mp_tools
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
__all__ = ["load_parameters", "save_parameters"]
|
||||
|
||||
|
||||
def _load_latest_checkpoint(checkpoint_dir: str) -> int:
|
||||
"""Get the iteration number corresponding to the latest saved checkpoint.
|
||||
Args:
|
||||
checkpoint_dir (str): the directory where checkpoint is saved.
|
||||
Returns:
|
||||
int: the latest iteration number.
|
||||
"""
|
||||
checkpoint_record = os.path.join(checkpoint_dir, "checkpoint")
|
||||
if (not os.path.isfile(checkpoint_record)):
|
||||
return 0
|
||||
|
||||
# Fetch the latest checkpoint index.
|
||||
with open(checkpoint_record, "rt") as handle:
|
||||
latest_checkpoint = handle.readlines()[-1].strip()
|
||||
step = latest_checkpoint.split(":")[-1]
|
||||
iteration = int(step.split("-")[-1])
|
||||
|
||||
return iteration
|
||||
|
||||
|
||||
def _save_checkpoint(checkpoint_dir: str, iteration: int):
|
||||
"""Save the iteration number of the latest model to be checkpointed.
|
||||
Args:
|
||||
checkpoint_dir (str): the directory where checkpoint is saved.
|
||||
iteration (int): the latest iteration number.
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
checkpoint_record = os.path.join(checkpoint_dir, "checkpoint")
|
||||
# Update the latest checkpoint index.
|
||||
with open(checkpoint_record, "a+") as handle:
|
||||
handle.write("model_checkpoint_path:step-{}\n".format(iteration))
|
||||
|
||||
|
||||
def load_parameters(model,
|
||||
optimizer=None,
|
||||
checkpoint_dir=None,
|
||||
checkpoint_path=None):
|
||||
"""Load a specific model checkpoint from disk.
|
||||
Args:
|
||||
model (Layer): model to load parameters.
|
||||
optimizer (Optimizer, optional): optimizer to load states if needed.
|
||||
Defaults to None.
|
||||
checkpoint_dir (str, optional): the directory where checkpoint is saved.
|
||||
checkpoint_path (str, optional): if specified, load the checkpoint
|
||||
stored in the checkpoint_path and the argument 'checkpoint_dir' will
|
||||
be ignored. Defaults to None.
|
||||
Returns:
|
||||
iteration (int): number of iterations that the loaded checkpoint has
|
||||
been trained.
|
||||
"""
|
||||
if checkpoint_path is not None:
|
||||
iteration = int(os.path.basename(checkpoint_path).split("-")[-1])
|
||||
elif checkpoint_dir is not None:
|
||||
iteration = _load_latest_checkpoint(checkpoint_dir)
|
||||
if iteration == 0:
|
||||
return iteration
|
||||
checkpoint_path = os.path.join(checkpoint_dir,
|
||||
"step-{}".format(iteration))
|
||||
else:
|
||||
raise ValueError(
|
||||
"At least one of 'checkpoint_dir' and 'checkpoint_path' should be specified!"
|
||||
)
|
||||
|
||||
rank = dist.get_rank()
|
||||
|
||||
params_path = checkpoint_path + ".pdparams"
|
||||
model_dict = paddle.load(params_path)
|
||||
model.set_state_dict(model_dict)
|
||||
logger.info(
|
||||
"[checkpoint] Rank {}: loaded model from {}".format(rank, params_path))
|
||||
|
||||
optimizer_path = checkpoint_path + ".pdopt"
|
||||
if optimizer and os.path.isfile(optimizer_path):
|
||||
optimizer_dict = paddle.load(optimizer_path)
|
||||
optimizer.set_state_dict(optimizer_dict)
|
||||
logger.info("[checkpoint] Rank {}: loaded optimizer state from {}".
|
||||
format(rank, optimizer_path))
|
||||
|
||||
return iteration
|
||||
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
def save_parameters(checkpoint_dir, iteration, model, optimizer=None):
|
||||
"""Checkpoint the latest trained model parameters.
|
||||
Args:
|
||||
checkpoint_dir (str): the directory where checkpoint is saved.
|
||||
iteration (int): the latest iteration number.
|
||||
model (Layer): model to be checkpointed.
|
||||
optimizer (Optimizer, optional): optimizer to be checkpointed.
|
||||
Defaults to None.
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
checkpoint_path = os.path.join(checkpoint_dir, "step-{}".format(iteration))
|
||||
|
||||
model_dict = model.state_dict()
|
||||
params_path = checkpoint_path + ".pdparams"
|
||||
paddle.save(model_dict, params_path)
|
||||
logger.info("[checkpoint] Saved model to {}".format(params_path))
|
||||
|
||||
if optimizer:
|
||||
opt_dict = optimizer.state_dict()
|
||||
optimizer_path = checkpoint_path + ".pdopt"
|
||||
paddle.save(opt_dict, optimizer_path)
|
||||
logger.info(
|
||||
"[checkpoint] Saved optimzier state to {}".format(optimizer_path))
|
||||
|
||||
_save_checkpoint(checkpoint_dir, iteration)
|
@ -0,0 +1,78 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from paddle import nn
|
||||
|
||||
__all__ = [
|
||||
"summary", "gradient_norm", "freeze", "unfreeze", "print_grads",
|
||||
"print_params"
|
||||
]
|
||||
|
||||
|
||||
def summary(layer: nn.Layer, print_func=print):
|
||||
num_params = num_elements = 0
|
||||
print_func("layer summary:")
|
||||
for name, param in layer.state_dict().items():
|
||||
print_func("{}|{}|{}".format(name, param.shape, np.prod(param.shape)))
|
||||
num_elements += np.prod(param.shape)
|
||||
num_params += 1
|
||||
print_func("layer has {} parameters, {} elements.".format(num_params,
|
||||
num_elements))
|
||||
|
||||
|
||||
def gradient_norm(layer: nn.Layer):
|
||||
grad_norm_dict = {}
|
||||
for name, param in layer.state_dict().items():
|
||||
if param.trainable:
|
||||
grad = param.gradient()
|
||||
grad_norm_dict[name] = np.linalg.norm(grad) / grad.size
|
||||
return grad_norm_dict
|
||||
|
||||
|
||||
def recursively_remove_weight_norm(layer: nn.Layer):
|
||||
for layer in layer.sublayers():
|
||||
try:
|
||||
nn.utils.remove_weight_norm(layer)
|
||||
except:
|
||||
# ther is not weight norm hoom in this layer
|
||||
pass
|
||||
|
||||
|
||||
def freeze(layer: nn.Layer):
|
||||
for param in layer.parameters():
|
||||
param.trainable = False
|
||||
|
||||
|
||||
def unfreeze(layer: nn.Layer):
|
||||
for param in layer.parameters():
|
||||
param.trainable = True
|
||||
|
||||
|
||||
def print_grads(model, print_func=print):
|
||||
for n, p in model.named_parameters():
|
||||
msg = f"param grad: {n}: shape: {p.shape} grad: {p.grad}"
|
||||
if print_func:
|
||||
print_func(msg)
|
||||
|
||||
|
||||
def print_params(model, print_func=print):
|
||||
total = 0.0
|
||||
for n, p in model.named_parameters():
|
||||
msg = f"param: {n}: shape: {p.shape} stop_grad: {p.stop_gradient}"
|
||||
total += np.prod(p.shape)
|
||||
if print_func:
|
||||
print_func(msg)
|
||||
if print_func:
|
||||
print_func(f"Total parameters: {total}!")
|
@ -0,0 +1,31 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import paddle
|
||||
from paddle import distributed as dist
|
||||
from functools import wraps
|
||||
|
||||
__all__ = ["rank_zero_only"]
|
||||
|
||||
|
||||
def rank_zero_only(func):
|
||||
@wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
rank = dist.get_rank()
|
||||
if rank != 0:
|
||||
return
|
||||
result = func(*args, **kwargs)
|
||||
return result
|
||||
|
||||
return wrapper
|
@ -0,0 +1,111 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
from time import gmtime, strftime
|
||||
import socketserver
|
||||
import struct
|
||||
import wave
|
||||
|
||||
from deepspeech.frontend.utility import read_manifest
|
||||
|
||||
__all__ = ["socket_send", "warm_up_test", "AsrTCPServer", "AsrRequestHandler"]
|
||||
|
||||
|
||||
def socket_send(server_ip: str, server_port: str, data: bytes):
|
||||
# Connect to server and send data
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
sock.connect((server_ip, server_port))
|
||||
sent = data
|
||||
sock.sendall(struct.pack('>i', len(sent)) + sent)
|
||||
print('Speech[length=%d] Sent.' % len(sent))
|
||||
# Receive data from the server and shut down
|
||||
received = sock.recv(1024)
|
||||
print("Recognition Results: {}".format(received.decode('utf8')))
|
||||
sock.close()
|
||||
|
||||
|
||||
def warm_up_test(audio_process_handler,
|
||||
manifest_path,
|
||||
num_test_cases,
|
||||
random_seed=0):
|
||||
"""Warming-up test."""
|
||||
manifest = read_manifest(manifest_path)
|
||||
rng = random.Random(random_seed)
|
||||
samples = rng.sample(manifest, num_test_cases)
|
||||
for idx, sample in enumerate(samples):
|
||||
print("Warm-up Test Case %d: %s", idx, sample['audio_filepath'])
|
||||
start_time = time.time()
|
||||
transcript = audio_process_handler(sample['audio_filepath'])
|
||||
finish_time = time.time()
|
||||
print("Response Time: %f, Transcript: %s" %
|
||||
(finish_time - start_time, transcript))
|
||||
|
||||
|
||||
class AsrTCPServer(socketserver.TCPServer):
|
||||
"""The ASR TCP Server."""
|
||||
|
||||
def __init__(self,
|
||||
server_address,
|
||||
RequestHandlerClass,
|
||||
speech_save_dir,
|
||||
audio_process_handler,
|
||||
bind_and_activate=True):
|
||||
self.speech_save_dir = speech_save_dir
|
||||
self.audio_process_handler = audio_process_handler
|
||||
socketserver.TCPServer.__init__(
|
||||
self, server_address, RequestHandlerClass, bind_and_activate=True)
|
||||
|
||||
|
||||
class AsrRequestHandler(socketserver.BaseRequestHandler):
|
||||
"""The ASR request handler."""
|
||||
|
||||
def handle(self):
|
||||
# receive data through TCP socket
|
||||
chunk = self.request.recv(1024)
|
||||
target_len = struct.unpack('>i', chunk[:4])[0]
|
||||
data = chunk[4:]
|
||||
while len(data) < target_len:
|
||||
chunk = self.request.recv(1024)
|
||||
data += chunk
|
||||
# write to file
|
||||
filename = self._write_to_file(data)
|
||||
|
||||
print("Received utterance[length=%d] from %s, saved to %s." %
|
||||
(len(data), self.client_address[0], filename))
|
||||
start_time = time.time()
|
||||
transcript = self.server.audio_process_handler(filename)
|
||||
finish_time = time.time()
|
||||
print("Response Time: %f, Transcript: %s" %
|
||||
(finish_time - start_time, transcript))
|
||||
self.request.sendall(transcript.encode('utf-8'))
|
||||
|
||||
def _write_to_file(self, data):
|
||||
# prepare save dir and filename
|
||||
if not os.path.exists(self.server.speech_save_dir):
|
||||
os.mkdir(self.server.speech_save_dir)
|
||||
timestamp = strftime("%Y%m%d%H%M%S", gmtime())
|
||||
out_filename = os.path.join(
|
||||
self.server.speech_save_dir,
|
||||
timestamp + "_" + self.client_address[0] + ".wav")
|
||||
# write to wav file
|
||||
file = wave.open(out_filename, 'wb')
|
||||
file.setnchannels(1)
|
||||
file.setsampwidth(2)
|
||||
file.setframerate(16000)
|
||||
file.writeframes(data)
|
||||
file.close()
|
||||
return out_filename
|
@ -0,0 +1,60 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Contains common utility functions."""
|
||||
|
||||
import numpy as np
|
||||
import distutils.util
|
||||
|
||||
__all__ = ['print_arguments', 'add_arguments']
|
||||
|
||||
|
||||
def print_arguments(args):
|
||||
"""Print argparse's arguments.
|
||||
|
||||
Usage:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("name", default="Jonh", type=str, help="User name.")
|
||||
args = parser.parse_args()
|
||||
print_arguments(args)
|
||||
|
||||
:param args: Input argparse.Namespace for printing.
|
||||
:type args: argparse.Namespace
|
||||
"""
|
||||
print("----------- Configuration Arguments -----------")
|
||||
for arg, value in sorted(vars(args).items()):
|
||||
print("%s: %s" % (arg, value))
|
||||
print("------------------------------------------------")
|
||||
|
||||
|
||||
def add_arguments(argname, type, default, help, argparser, **kwargs):
|
||||
"""Add argparse's argument.
|
||||
|
||||
Usage:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
add_argument("name", str, "Jonh", "User name.", parser)
|
||||
args = parser.parse_args()
|
||||
"""
|
||||
type = distutils.util.strtobool if type == bool else type
|
||||
argparser.add_argument(
|
||||
"--" + argname,
|
||||
default=default,
|
||||
type=type,
|
||||
help=help + ' Default: %(default)s.',
|
||||
**kwargs)
|
@ -1,251 +0,0 @@
|
||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""Server-end for the ASR demo."""
|
||||
import os
|
||||
import time
|
||||
import random
|
||||
import argparse
|
||||
import functools
|
||||
from time import gmtime, strftime
|
||||
import SocketServer
|
||||
import struct
|
||||
import wave
|
||||
import paddle.fluid as fluid
|
||||
import numpy as np
|
||||
import _init_paths
|
||||
from data_utils.data import DataGenerator
|
||||
from model_utils.model import DeepSpeech2Model
|
||||
from data_utils.utility import read_manifest
|
||||
from utils.utility import add_arguments, print_arguments
|
||||
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
add_arg = functools.partial(add_arguments, argparser=parser)
|
||||
# yapf: disable
|
||||
add_arg('host_port', int, 8086, "Server's IP port.")
|
||||
add_arg('beam_size', int, 500, "Beam search width.")
|
||||
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
|
||||
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
|
||||
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
|
||||
add_arg('alpha', float, 2.5, "Coef of LM for beam search.")
|
||||
add_arg('beta', float, 0.3, "Coef of WC for beam search.")
|
||||
add_arg('cutoff_prob', float, 1.0, "Cutoff probability for pruning.")
|
||||
add_arg('cutoff_top_n', int, 40, "Cutoff number for pruning.")
|
||||
add_arg('use_gru', bool, False, "Use GRUs instead of simple RNNs.")
|
||||
add_arg('use_gpu', bool, True, "Use GPU or not.")
|
||||
add_arg('share_rnn_weights',bool, True, "Share input-hidden weights across "
|
||||
"bi-directional RNNs. Not for GRU.")
|
||||
add_arg('host_ip', str,
|
||||
'localhost',
|
||||
"Server's IP address.")
|
||||
add_arg('speech_save_dir', str,
|
||||
'demo_cache',
|
||||
"Directory to save demo audios.")
|
||||
add_arg('warmup_manifest', str,
|
||||
'data/librispeech/manifest.test-clean',
|
||||
"Filepath of manifest to warm up.")
|
||||
add_arg('mean_std_path', str,
|
||||
'data/librispeech/mean_std.npz',
|
||||
"Filepath of normalizer's mean & std.")
|
||||
add_arg('vocab_path', str,
|
||||
'data/librispeech/eng_vocab.txt',
|
||||
"Filepath of vocabulary.")
|
||||
add_arg('model_path', str,
|
||||
'./checkpoints/libri/step_final',
|
||||
"If None, the training starts from scratch, "
|
||||
"otherwise, it resumes from the pre-trained model.")
|
||||
add_arg('lang_model_path', str,
|
||||
'lm/data/common_crawl_00.prune01111.trie.klm',
|
||||
"Filepath for language model.")
|
||||
add_arg('decoding_method', str,
|
||||
'ctc_beam_search',
|
||||
"Decoding method. Options: ctc_beam_search, ctc_greedy",
|
||||
choices = ['ctc_beam_search', 'ctc_greedy'])
|
||||
add_arg('specgram_type', str,
|
||||
'linear',
|
||||
"Audio feature type. Options: linear, mfcc.",
|
||||
choices=['linear', 'mfcc'])
|
||||
# yapf: disable
|
||||
args = parser.parse_args()
|
||||
|
||||
|
||||
class AsrTCPServer(SocketServer.TCPServer):
|
||||
"""The ASR TCP Server."""
|
||||
|
||||
def __init__(self,
|
||||
server_address,
|
||||
RequestHandlerClass,
|
||||
speech_save_dir,
|
||||
audio_process_handler,
|
||||
bind_and_activate=True):
|
||||
self.speech_save_dir = speech_save_dir
|
||||
self.audio_process_handler = audio_process_handler
|
||||
SocketServer.TCPServer.__init__(
|
||||
self, server_address, RequestHandlerClass, bind_and_activate=True)
|
||||
|
||||
|
||||
class AsrRequestHandler(SocketServer.BaseRequestHandler):
|
||||
"""The ASR request handler."""
|
||||
|
||||
def handle(self):
|
||||
# receive data through TCP socket
|
||||
chunk = self.request.recv(1024)
|
||||
target_len = struct.unpack('>i', chunk[:4])[0]
|
||||
data = chunk[4:]
|
||||
while len(data) < target_len:
|
||||
chunk = self.request.recv(1024)
|
||||
data += chunk
|
||||
# write to file
|
||||
filename = self._write_to_file(data)
|
||||
|
||||
print("Received utterance[length=%d] from %s, saved to %s." %
|
||||
(len(data), self.client_address[0], filename))
|
||||
start_time = time.time()
|
||||
transcript = self.server.audio_process_handler(filename)
|
||||
finish_time = time.time()
|
||||
print("Response Time: %f, Transcript: %s" %
|
||||
(finish_time - start_time, transcript))
|
||||
self.request.sendall(transcript.encode('utf-8'))
|
||||
|
||||
def _write_to_file(self, data):
|
||||
# prepare save dir and filename
|
||||
if not os.path.exists(self.server.speech_save_dir):
|
||||
os.mkdir(self.server.speech_save_dir)
|
||||
timestamp = strftime("%Y%m%d%H%M%S", gmtime())
|
||||
out_filename = os.path.join(
|
||||
self.server.speech_save_dir,
|
||||
timestamp + "_" + self.client_address[0] + ".wav")
|
||||
# write to wav file
|
||||
file = wave.open(out_filename, 'wb')
|
||||
file.setnchannels(1)
|
||||
file.setsampwidth(4)
|
||||
file.setframerate(16000)
|
||||
file.writeframes(data)
|
||||
file.close()
|
||||
return out_filename
|
||||
|
||||
|
||||
def warm_up_test(audio_process_handler,
|
||||
manifest_path,
|
||||
num_test_cases,
|
||||
random_seed=0):
|
||||
"""Warming-up test."""
|
||||
manifest = read_manifest(manifest_path)
|
||||
rng = random.Random(random_seed)
|
||||
samples = rng.sample(manifest, num_test_cases)
|
||||
for idx, sample in enumerate(samples):
|
||||
print("Warm-up Test Case %d: %s", idx, sample['audio_filepath'])
|
||||
start_time = time.time()
|
||||
transcript = audio_process_handler(sample['audio_filepath'])
|
||||
finish_time = time.time()
|
||||
print("Response Time: %f, Transcript: %s" %
|
||||
(finish_time - start_time, transcript))
|
||||
|
||||
|
||||
def start_server():
|
||||
"""Start the ASR server"""
|
||||
# prepare data generator
|
||||
if args.use_gpu:
|
||||
place = fluid.CUDAPlace(0)
|
||||
else:
|
||||
place = fluid.CPUPlace()
|
||||
|
||||
data_generator = DataGenerator(
|
||||
vocab_filepath=args.vocab_path,
|
||||
mean_std_filepath=args.mean_std_path,
|
||||
augmentation_config='{}',
|
||||
specgram_type=args.specgram_type,
|
||||
keep_transcription_text=True,
|
||||
place = place,
|
||||
is_training = False)
|
||||
# prepare ASR model
|
||||
ds2_model = DeepSpeech2Model(
|
||||
vocab_size=data_generator.vocab_size,
|
||||
num_conv_layers=args.num_conv_layers,
|
||||
num_rnn_layers=args.num_rnn_layers,
|
||||
rnn_layer_size=args.rnn_layer_size,
|
||||
use_gru=args.use_gru,
|
||||
init_from_pretrained_model=args.model_path,
|
||||
place=place,
|
||||
share_rnn_weights=args.share_rnn_weights)
|
||||
|
||||
vocab_list = [chars for chars in data_generator.vocab_list]
|
||||
|
||||
if args.decoding_method == "ctc_beam_search":
|
||||
ds2_model.init_ext_scorer(args.alpha, args.beta, args.lang_model_path,
|
||||
vocab_list)
|
||||
# prepare ASR inference handler
|
||||
def file_to_transcript(filename):
|
||||
feature = data_generator.process_utterance(filename, "")
|
||||
audio_len = feature[0].shape[1]
|
||||
mask_shape0 = (feature[0].shape[0] - 1) // 2 + 1
|
||||
mask_shape1 = (feature[0].shape[1] - 1) // 3 + 1
|
||||
mask_max_len = (audio_len - 1) // 3 + 1
|
||||
mask_ones = np.ones((mask_shape0, mask_shape1))
|
||||
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
|
||||
mask = np.repeat(
|
||||
np.reshape(
|
||||
np.concatenate((mask_ones, mask_zeros), axis=1),
|
||||
(1, mask_shape0, mask_max_len)),
|
||||
32,
|
||||
axis=0)
|
||||
feature = (np.array([feature[0]]).astype('float32'),
|
||||
None,
|
||||
np.array([audio_len]).astype('int64').reshape([-1,1]),
|
||||
np.array([mask]).astype('float32'))
|
||||
probs_split = ds2_model.infer_batch_probs(
|
||||
infer_data=feature,
|
||||
feeding_dict=data_generator.feeding)
|
||||
|
||||
if args.decoding_method == "ctc_greedy":
|
||||
result_transcript = ds2_model.decode_batch_greedy(
|
||||
probs_split=probs_split,
|
||||
vocab_list=vocab_list)
|
||||
else:
|
||||
result_transcript = ds2_model.decode_batch_beam_search(
|
||||
probs_split=probs_split,
|
||||
beam_alpha=args.alpha,
|
||||
beam_beta=args.beta,
|
||||
beam_size=args.beam_size,
|
||||
cutoff_prob=args.cutoff_prob,
|
||||
cutoff_top_n=args.cutoff_top_n,
|
||||
vocab_list=vocab_list,
|
||||
num_processes=1)
|
||||
return result_transcript[0]
|
||||
|
||||
# warming up with utterrances sampled from Librispeech
|
||||
print('-----------------------------------------------------------')
|
||||
print('Warming up ...')
|
||||
warm_up_test(
|
||||
audio_process_handler=file_to_transcript,
|
||||
manifest_path=args.warmup_manifest,
|
||||
num_test_cases=3)
|
||||
print('-----------------------------------------------------------')
|
||||
|
||||
# start the server
|
||||
server = AsrTCPServer(
|
||||
server_address=(args.host_ip, args.host_port),
|
||||
RequestHandlerClass=AsrRequestHandler,
|
||||
speech_save_dir=args.speech_save_dir,
|
||||
audio_process_handler=file_to_transcript)
|
||||
print("ASR Server Started.")
|
||||
server.serve_forever()
|
||||
|
||||
|
||||
def main():
|
||||
print_arguments(args)
|
||||
start_server()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
@ -0,0 +1,507 @@
|
||||
# DeepSpeech2 on PaddlePaddle
|
||||
|
||||
[中文版](README_cn.md)
|
||||
|
||||
*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
|
||||
|
||||
## Table of Contents
|
||||
- [Installation](#installation)
|
||||
- [Running in Docker Container](#running-in-docker-container)
|
||||
- [Getting Started](#getting-started)
|
||||
- [Data Preparation](#data-preparation)
|
||||
- [Training a Model](#training-a-model)
|
||||
- [Data Augmentation Pipeline](#data-augmentation-pipeline)
|
||||
- [Inference and Evaluation](#inference-and-evaluation)
|
||||
- [Hyper-parameters Tuning](#hyper-parameters-tuning)
|
||||
- [Training for Mandarin Language](#training-for-mandarin-language)
|
||||
- [Trying Live Demo with Your Own Voice](#trying-live-demo-with-your-own-voice)
|
||||
- [Released Models](#released-models)
|
||||
- [Experiments and Benchmarks](#experiments-and-benchmarks)
|
||||
- [Questions and Help](#questions-and-help)
|
||||
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
|
||||
|
||||
### Prerequisites
|
||||
- Python >= 3.7
|
||||
- PaddlePaddle 1.8.5 (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
|
||||
|
||||
### Setup
|
||||
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev
|
||||
```
|
||||
|
||||
or, installing them via `yum`:
|
||||
|
||||
```bash
|
||||
sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel python3-devel
|
||||
wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
|
||||
xz -d flac-1.3.1.tar.xz
|
||||
tar -xvf flac-1.3.1.tar
|
||||
cd flac-1.3.1
|
||||
./configure
|
||||
make
|
||||
make install
|
||||
```
|
||||
|
||||
- Run the setup script for the remaining dependencies
|
||||
|
||||
```bash
|
||||
git clone https://github.com/PaddlePaddle/DeepSpeech.git
|
||||
cd DeepSpeech
|
||||
pushd tools; make; popd
|
||||
source tools/venv/bin/activate
|
||||
bash setup.sh
|
||||
```
|
||||
|
||||
- Source venv before do experiment.
|
||||
|
||||
```bash
|
||||
source tools/venv/bin/activate
|
||||
```
|
||||
|
||||
### Running in Docker Container
|
||||
|
||||
Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed, including the pre-built PaddlePaddle, CTC decoders, and other necessary Python and third-party packages. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
|
||||
|
||||
Take several steps to launch the Docker image:
|
||||
|
||||
- Download the Docker image
|
||||
|
||||
```bash
|
||||
nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu
|
||||
```
|
||||
|
||||
- Clone this repository
|
||||
|
||||
```
|
||||
git clone https://github.com/PaddlePaddle/DeepSpeech.git
|
||||
```
|
||||
|
||||
- Run the Docker image
|
||||
|
||||
```bash
|
||||
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
|
||||
```
|
||||
Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
|
||||
|
||||
|
||||
- Install PaddlePaddle
|
||||
|
||||
For example, for CUDA 10.1, CuDNN7.5:
|
||||
```bash
|
||||
python3 -m pip install paddlepaddle-gpu==1.8.0.post107
|
||||
```
|
||||
|
||||
## Getting Started
|
||||
|
||||
Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
|
||||
|
||||
Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
|
||||
|
||||
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
|
||||
|
||||
- Go to directory
|
||||
|
||||
```bash
|
||||
cd examples/tiny
|
||||
```
|
||||
|
||||
Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to `examples/librispeech` instead.
|
||||
- Prepare the data
|
||||
|
||||
```bash
|
||||
sh run_data.sh
|
||||
```
|
||||
|
||||
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `./dataset/librispeech` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
|
||||
- Train your own ASR model
|
||||
|
||||
```bash
|
||||
sh run_train.sh
|
||||
```
|
||||
|
||||
`run_train.sh` will start a training job, with training logs printed to stdout and model checkpoint of every pass/epoch saved to `./checkpoints/tiny`. These checkpoints could be used for training resuming, inference, evaluation and deployment.
|
||||
- Case inference with an existing model
|
||||
|
||||
```bash
|
||||
sh run_infer.sh
|
||||
```
|
||||
|
||||
`run_infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference:
|
||||
|
||||
```bash
|
||||
sh run_infer_golden.sh
|
||||
```
|
||||
- Evaluate an existing model
|
||||
|
||||
```bash
|
||||
sh run_test.sh
|
||||
```
|
||||
|
||||
`run_test.sh` will evaluate the model with Word Error Rate (or Character Error Rate) measurement. Similarly, you can also download a well-trained model and test its performance:
|
||||
|
||||
```bash
|
||||
sh run_test_golden.sh
|
||||
```
|
||||
|
||||
More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech2 on PaddlePaddle* ASR engine!
|
||||
|
||||
|
||||
## Data Preparation
|
||||
|
||||
### Generate Manifest
|
||||
|
||||
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
|
||||
|
||||
```
|
||||
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
|
||||
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
|
||||
```
|
||||
|
||||
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.
|
||||
|
||||
For how to generate such manifest files, please refer to `data/librispeech/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.
|
||||
|
||||
### Compute Mean & Stddev for Normalizer
|
||||
|
||||
To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples:
|
||||
|
||||
```bash
|
||||
python3 tools/compute_mean_std.py \
|
||||
--num_samples 2000 \
|
||||
--specgram_type linear \
|
||||
--manifest_path data/librispeech/manifest.train \
|
||||
--output_path data/librispeech/mean_std.npz
|
||||
```
|
||||
|
||||
It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `data/librispeech/manifest.train` and save the results to `data/librispeech/mean_std.npz` for further usage.
|
||||
|
||||
|
||||
### Build Vocabulary
|
||||
|
||||
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with `tools/build_vocab.py`.
|
||||
|
||||
```bash
|
||||
python3 tools/build_vocab.py \
|
||||
--count_threshold 0 \
|
||||
--vocab_path data/librispeech/eng_vocab.txt \
|
||||
--manifest_paths data/librispeech/manifest.train
|
||||
```
|
||||
|
||||
It will write a vocabuary file `data/librispeeech/eng_vocab.txt` with all transcription text in `data/librispeech/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
|
||||
|
||||
### More Help
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
```bash
|
||||
python3 data/librispeech/librispeech.py --help
|
||||
python3 tools/compute_mean_std.py --help
|
||||
python3 tools/build_vocab.py --help
|
||||
```
|
||||
|
||||
## Training a model
|
||||
|
||||
`train.py` is the main caller of the training module. Examples of usage are shown below.
|
||||
|
||||
- Start training from scratch with 8 GPUs:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train.py
|
||||
```
|
||||
|
||||
- Start training from scratch with CPUs:
|
||||
|
||||
```
|
||||
python3 train.py --use_gpu False
|
||||
```
|
||||
- Resume training from a checkpoint:
|
||||
|
||||
```
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
|
||||
python3 train.py \
|
||||
--init_from_pretrained_model CHECKPOINT_PATH_TO_RESUME_FROM
|
||||
```
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
```bash
|
||||
python3 train.py --help
|
||||
```
|
||||
or refer to `example/librispeech/run_train.sh`.
|
||||
|
||||
|
||||
## Data Augmentation Pipeline
|
||||
|
||||
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
|
||||
|
||||
Six optional augmentation components are provided to be selected, configured and inserted into the processing pipeline.
|
||||
|
||||
- Volume Perturbation
|
||||
- Speed Perturbation
|
||||
- Shifting Perturbation
|
||||
- Online Bayesian normalization
|
||||
- Noise Perturbation (need background noise audio files)
|
||||
- Impulse Response (need impulse audio files)
|
||||
|
||||
In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
|
||||
|
||||
```
|
||||
[{
|
||||
"type": "speed",
|
||||
"params": {"min_speed_rate": 0.95,
|
||||
"max_speed_rate": 1.05},
|
||||
"prob": 0.6
|
||||
},
|
||||
{
|
||||
"type": "shift",
|
||||
"params": {"min_shift_ms": -5,
|
||||
"max_shift_ms": 5},
|
||||
"prob": 0.8
|
||||
}]
|
||||
```
|
||||
|
||||
When the `--augment_conf_file` argument of `trainer.py` is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a random sampled offset between -5 ms and 5 ms. Finally this newly synthesized audio clip will be feed into the feature extractor for further training.
|
||||
|
||||
For other configuration examples, please refer to `conf/augmenatation.config.example`.
|
||||
|
||||
Be careful when utilizing the data augmentation technique, as improper augmentation will do harm to the training, due to the enlarged train-test gap.
|
||||
|
||||
## Inference and Evaluation
|
||||
|
||||
### Prepare Language Model
|
||||
|
||||
A language model is required to improve the decoder's performance. We have prepared two language models (with lossy compression) for users to download and try. One is for English and the other is for Mandarin. Users can simply run this to download the preprared language models:
|
||||
|
||||
```bash
|
||||
cd models/lm
|
||||
bash download_lm_en.sh
|
||||
bash download_lm_ch.sh
|
||||
```
|
||||
|
||||
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own.
|
||||
|
||||
#### English LM
|
||||
|
||||
The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
|
||||
|
||||
* Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
|
||||
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
|
||||
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
|
||||
|
||||
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are trained with agruments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the arpa file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
|
||||
|
||||
#### Mandarin LM
|
||||
|
||||
Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include:
|
||||
|
||||
* The beginning and trailing whitespace characters are removed.
|
||||
* English punctuations and Chinese punctuations are removed.
|
||||
* A whitespace character between two tokens is inserted.
|
||||
|
||||
Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings.
|
||||
|
||||
### Speech-to-text Inference
|
||||
|
||||
An inference module caller `infer.py` is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance.
|
||||
|
||||
- Inference with GPU:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 python3 infer.py
|
||||
```
|
||||
|
||||
- Inference with CPUs:
|
||||
|
||||
```bash
|
||||
python3 infer.py --use_gpu False
|
||||
```
|
||||
|
||||
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`.
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
```
|
||||
python3 infer.py --help
|
||||
```
|
||||
or refer to `example/librispeech/run_infer.sh`.
|
||||
|
||||
### Evaluate a Model
|
||||
|
||||
To evaluate a model's performance quantitatively, please run:
|
||||
|
||||
- Evaluation with GPUs:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 test.py
|
||||
```
|
||||
|
||||
- Evaluation with CPUs:
|
||||
|
||||
```bash
|
||||
python3 test.py --use_gpu False
|
||||
```
|
||||
|
||||
The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed.
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
```bash
|
||||
python3 test.py --help
|
||||
```
|
||||
or refer to `example/librispeech/run_test.sh`.
|
||||
|
||||
## Hyper-parameters Tuning
|
||||
|
||||
The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed.
|
||||
|
||||
`tools/tune.py` performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts.
|
||||
|
||||
- Tuning with GPU:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
|
||||
python3 tools/tune.py \
|
||||
--alpha_from 1.0 \
|
||||
--alpha_to 3.2 \
|
||||
--num_alphas 45 \
|
||||
--beta_from 0.1 \
|
||||
--beta_to 0.45 \
|
||||
--num_betas 8
|
||||
```
|
||||
|
||||
- Tuning with CPU:
|
||||
|
||||
```bash
|
||||
python3 tools/tune.py --use_gpu False
|
||||
```
|
||||
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
|
||||
|
||||
<p align="center">
|
||||
<img src="docs/images/tuning_error_surface.png" width=550>
|
||||
<br/>An example error surface for tuning on the dev-clean set of LibriSpeech
|
||||
</p>
|
||||
|
||||
Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning.
|
||||
|
||||
After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation modules to see if they really help improve the ASR performance. For more help
|
||||
|
||||
```bash
|
||||
python3 tune.py --help
|
||||
```
|
||||
or refer to `example/librispeech/run_tune.sh`.
|
||||
|
||||
## Training for Mandarin Language
|
||||
|
||||
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell```. As mentioned above, please execute ```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh``` and ```sh run_infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by ./models/aishell/download_model.sh) for users to try with ```sh run_infer_golden.sh``` and ```sh run_test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```tools/tune.py``` to find an optimal setting.
|
||||
|
||||
## Trying Live Demo with Your Own Voice
|
||||
|
||||
Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_english_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
|
||||
|
||||
To start the demo's server, please run this in one console:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python3 deploy/demo_server.py \
|
||||
--host_ip localhost \
|
||||
--host_port 8086
|
||||
```
|
||||
|
||||
For the machine (might not be the same machine) to run the demo's client, please do the following installation before moving on.
|
||||
|
||||
For example, on MAC OS X:
|
||||
|
||||
```bash
|
||||
brew install portaudio
|
||||
pip install pyaudio
|
||||
pip install keyboard
|
||||
```
|
||||
|
||||
Then to start the client, please run this in another console:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python3 -u deploy/demo_client.py \
|
||||
--host_ip 'localhost' \
|
||||
--host_port 8086
|
||||
```
|
||||
|
||||
Now, in the client console, press the `whitespace` key, hold, and start speaking. Until finishing your utterance, release the key to let the speech-to-text results shown in the console. To quit the client, just press `ESC` key.
|
||||
|
||||
Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.
|
||||
|
||||
Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
```bash
|
||||
python3 deploy/demo_server.py --help
|
||||
python3 deploy/demo_client.py --help
|
||||
```
|
||||
|
||||
## Released Models
|
||||
|
||||
#### Speech Model Released
|
||||
|
||||
Language | Model Name | Training Data | Hours of Speech
|
||||
:-----------: | :------------: | :----------: | -------:
|
||||
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
|
||||
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h
|
||||
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
|
||||
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
|
||||
|
||||
#### Language Model Released
|
||||
|
||||
Language Model | Training Data | Token-based | Size | Descriptions
|
||||
:-------------:| :------------:| :-----: | -----: | :-----------------
|
||||
[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8'
|
||||
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
|
||||
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings
|
||||
|
||||
## Experiments and Benchmarks
|
||||
|
||||
#### Benchmark Results for English Models (Word Error Rate)
|
||||
|
||||
Test Set | LibriSpeech Model | BaiduEN8K Model
|
||||
:--------------------- | ---------------: | -------------------:
|
||||
LibriSpeech Test-Clean | 6.85 | 5.41
|
||||
LibriSpeech Test-Other | 21.18 | 13.85
|
||||
VoxForge American-Canadian | 12.12 | 7.13
|
||||
VoxForge Commonwealth | 19.82 | 14.93
|
||||
VoxForge European | 30.15 | 18.64
|
||||
VoxForge Indian | 53.73 | 25.51
|
||||
Baidu Internal Testset | 40.75 | 8.48
|
||||
|
||||
For reproducing benchmark results on VoxForge data, we provide a script to download data and generate VoxForge dialect manifest files. Please go to ```data/voxforge``` and execute ```sh run_data.sh``` to get VoxForge dialect manifest files. Notice that VoxForge data may keep updating and the generated manifest files may have difference from those we evaluated on.
|
||||
|
||||
#### Benchmark Results for Mandarin Model (Character Error Rate)
|
||||
|
||||
Test Set | BaiduCN1.2k Model
|
||||
:--------------------- | -------------------:
|
||||
Baidu Internal Testset | 12.64
|
||||
|
||||
#### Acceleration with Multi-GPUs
|
||||
|
||||
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
|
||||
|
||||
<img src="docs/images/multi_gpu_speedup.png" width=450><br/>
|
||||
|
||||
| # of GPU | Acceleration Rate |
|
||||
| -------- | --------------: |
|
||||
| 1 | 1.00 X |
|
||||
| 2 | 1.98 X |
|
||||
| 4 | 3.73 X |
|
||||
| 8 | 6.95 X |
|
||||
|
||||
`tools/profile.sh` provides such a profiling tool.
|
||||
|
||||
## Questions and Help
|
||||
|
||||
You are welcome to submit questions and bug reports in [Github Issues](https://github.com/PaddlePaddle/DeepSpeech/issues). You are also welcome to contribute to this project.
|
@ -0,0 +1,36 @@
|
||||
|
||||
# Data Augmentation Pipeline
|
||||
|
||||
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
|
||||
|
||||
Six optional augmentation components are provided to be selected, configured and inserted into the processing pipeline.
|
||||
|
||||
- Volume Perturbation
|
||||
- Speed Perturbation
|
||||
- Shifting Perturbation
|
||||
- Online Bayesian normalization
|
||||
- Noise Perturbation (need background noise audio files)
|
||||
- Impulse Response (need impulse audio files)
|
||||
|
||||
In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
|
||||
|
||||
```
|
||||
[{
|
||||
"type": "speed",
|
||||
"params": {"min_speed_rate": 0.95,
|
||||
"max_speed_rate": 1.05},
|
||||
"prob": 0.6
|
||||
},
|
||||
{
|
||||
"type": "shift",
|
||||
"params": {"min_shift_ms": -5,
|
||||
"max_shift_ms": 5},
|
||||
"prob": 0.8
|
||||
}]
|
||||
```
|
||||
|
||||
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a random sampled offset between -5 ms and 5 ms. Finally this newly synthesized audio clip will be feed into the feature extractor for further training.
|
||||
|
||||
For other configuration examples, please refer to `examples/conf/augmentation.config.example`.
|
||||
|
||||
Be careful when utilizing the data augmentation technique, as improper augmentation will do harm to the training, due to the enlarged train-test gap.
|
@ -0,0 +1,16 @@
|
||||
# Benchmarks
|
||||
|
||||
## Acceleration with Multi-GPUs
|
||||
|
||||
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
|
||||
|
||||
<img src="docs/images/multi_gpu_speedup.png" width=450><br/>
|
||||
|
||||
| # of GPU | Acceleration Rate |
|
||||
| -------- | --------------: |
|
||||
| 1 | 1.00 X |
|
||||
| 2 | 1.98 X |
|
||||
| 4 | 3.73 X |
|
||||
| 8 | 6.95 X |
|
||||
|
||||
`utils/profile.sh` provides such a demo profiling tool, you can change it as need.
|
@ -0,0 +1,43 @@
|
||||
|
||||
# Data Preparation
|
||||
|
||||
## Generate Manifest
|
||||
|
||||
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
|
||||
|
||||
```
|
||||
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
|
||||
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
|
||||
```
|
||||
|
||||
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.
|
||||
|
||||
For how to generate such manifest files, please refer to `examples/librispeech/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.
|
||||
|
||||
## Compute Mean & Stddev for Normalizer
|
||||
|
||||
To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples:
|
||||
|
||||
```bash
|
||||
python3 utils/compute_mean_std.py \
|
||||
--num_samples 2000 \
|
||||
--specgram_type linear \
|
||||
--manifest_path examples/librispeech/data/manifest.train \
|
||||
--output_path examples/librispeech/data/mean_std.npz
|
||||
```
|
||||
|
||||
It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
|
||||
|
||||
|
||||
## Build Vocabulary
|
||||
|
||||
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
|
||||
|
||||
```bash
|
||||
python3 utils/build_vocab.py \
|
||||
--count_threshold 0 \
|
||||
--vocab_path examples/librispeech/data/eng_vocab.txt \
|
||||
--manifest_paths examples/librispeech/data/manifest.train
|
||||
```
|
||||
|
||||
It will write a vocabuary file `examples/librispeech/data/eng_vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
|
@ -0,0 +1,80 @@
|
||||
# Getting Started
|
||||
|
||||
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
|
||||
|
||||
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.
|
||||
|
||||
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
|
||||
|
||||
- Go to directory
|
||||
|
||||
```bash
|
||||
cd examples/tiny
|
||||
```
|
||||
|
||||
Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to `examples/librispeech` instead.
|
||||
|
||||
- Source env
|
||||
|
||||
```bash
|
||||
source path.sh
|
||||
```
|
||||
**Must do this before starting do anything.**
|
||||
Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as default, you can change this in the script.
|
||||
|
||||
- Main entrypoint
|
||||
|
||||
```bash
|
||||
bash run.sh
|
||||
```
|
||||
This just a demo, please make sure every `step` is work fine when do next `step`.
|
||||
|
||||
More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
|
||||
|
||||
## Training a model
|
||||
|
||||
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```local/tune.sh``` to find an optimal setting.
|
||||
|
||||
## Speech-to-text Inference
|
||||
|
||||
An inference module caller `infer.py` is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 bash local/infer.sh
|
||||
```
|
||||
|
||||
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`.
|
||||
|
||||
## Evaluate a Model
|
||||
|
||||
To evaluate a model's performance quantitatively, please run:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 bash local/test.sh
|
||||
```
|
||||
|
||||
The error rate (default: word error rate; can be set with `error_rate_type`) will be printed.
|
||||
|
||||
For more help on arguments:
|
||||
|
||||
## Hyper-parameters Tuning
|
||||
|
||||
The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed.
|
||||
|
||||
`tune.py` performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts.
|
||||
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 bash local/tune.sh
|
||||
```
|
||||
|
||||
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
|
||||
|
||||
<p align="center">
|
||||
<img src="docs/images/tuning_error_surface.png" width=550>
|
||||
<br/>An example error surface for tuning on the dev-clean set of LibriSpeech
|
||||
</p>
|
||||
|
||||
Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning.
|
||||
|
||||
After tuning, you can reset $\alpha$ and $\beta$ in the inference and evaluation modules to see if they really help improve the ASR performance. For more help
|
@ -0,0 +1,81 @@
|
||||
# Installation
|
||||
|
||||
To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
|
||||
|
||||
## Prerequisites
|
||||
- Python >= 3.7
|
||||
- PaddlePaddle 2.0.0 or later (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
|
||||
|
||||
## Setup
|
||||
|
||||
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
|
||||
|
||||
```bash
|
||||
sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev
|
||||
```
|
||||
|
||||
or, installing them via `yum`:
|
||||
|
||||
```bash
|
||||
sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel python3-devel
|
||||
wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
|
||||
xz -d flac-1.3.1.tar.xz
|
||||
tar -xvf flac-1.3.1.tar
|
||||
cd flac-1.3.1
|
||||
./configure
|
||||
make
|
||||
make install
|
||||
```
|
||||
|
||||
- Run the setup script for the remaining dependencies
|
||||
|
||||
```bash
|
||||
git clone https://github.com/PaddlePaddle/DeepSpeech.git
|
||||
cd DeepSpeech
|
||||
pushd tools; make; popd
|
||||
source tools/venv/bin/activate
|
||||
bash setup.sh
|
||||
```
|
||||
|
||||
- Source venv before do experiment.
|
||||
|
||||
```bash
|
||||
source tools/venv/bin/activate
|
||||
```
|
||||
|
||||
## Running in Docker Container
|
||||
|
||||
Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed, including the pre-built PaddlePaddle, CTC decoders, and other necessary Python and third-party packages. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
|
||||
|
||||
Take several steps to launch the Docker image:
|
||||
|
||||
- Download the Docker image
|
||||
|
||||
For example, pull paddle 2.0.0 image:
|
||||
|
||||
```bash
|
||||
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.0.0-gpu-cuda10.1-cudnn7
|
||||
```
|
||||
|
||||
- Clone this repository
|
||||
|
||||
```
|
||||
git clone https://github.com/PaddlePaddle/DeepSpeech.git
|
||||
```
|
||||
|
||||
- Run the Docker image
|
||||
|
||||
```bash
|
||||
sudo nvidia-docker run --rm -it -v $(pwd)/DeepSpeech:/DeepSpeech registry.baidubce.com/paddlepaddle/paddle:2.0.0-gpu-cuda10.1-cudnn7 /bin/bash
|
||||
```
|
||||
|
||||
Now you can execute training, inference and hyper-parameters tuning in the Docker container.
|
||||
|
||||
|
||||
- Install PaddlePaddle
|
||||
|
||||
For example, for CUDA 10.1, CuDNN7.5 install paddle 2.0.0:
|
||||
|
||||
```bash
|
||||
python3 -m pip install paddlepaddle-gpu==2.0.0
|
||||
```
|
@ -0,0 +1,31 @@
|
||||
# Prepare Language Model
|
||||
|
||||
A language model is required to improve the decoder's performance. We have prepared two language models (with lossy compression) for users to download and try. One is for English and the other is for Mandarin. Users can simply run this to download the preprared language models:
|
||||
|
||||
```bash
|
||||
cd examples/aishell
|
||||
source path.sh
|
||||
bash local/download_lm_ch.sh
|
||||
```
|
||||
|
||||
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own.
|
||||
|
||||
## English LM
|
||||
|
||||
The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
|
||||
|
||||
* Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
|
||||
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
|
||||
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
|
||||
|
||||
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are trained with agruments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the arpa file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
|
||||
|
||||
## Mandarin LM
|
||||
|
||||
Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include:
|
||||
|
||||
* The beginning and trailing whitespace characters are removed.
|
||||
* English punctuations and Chinese punctuations are removed.
|
||||
* A whitespace character between two tokens is inserted.
|
||||
|
||||
Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings.
|
@ -0,0 +1,9 @@
|
||||
# Released Models
|
||||
|
||||
## Language Model Released
|
||||
|
||||
Language Model | Training Data | Token-based | Size | Descriptions
|
||||
:-------------:| :------------:| :-----: | -----: | :-----------------
|
||||
[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8'
|
||||
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
|
||||
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings
|
@ -0,0 +1,4 @@
|
||||
data
|
||||
ckpt*
|
||||
demo_cache
|
||||
*.log
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue