update deepspeech to fluid api

pull/369/head
lfchener 5 years ago
parent d2bdd254a3
commit d74f4ff3f5

@ -1,6 +1,6 @@
# DeepSpeech2 on PaddlePaddle
*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, distributed [PaddleCloud](https://github.com/PaddlePaddle/cloud) training, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released.
## Table of Contents
- [Installation](#installation)
@ -10,7 +10,6 @@
- [Data Augmentation Pipeline](#data-augmentation-pipeline)
- [Inference and Evaluation](#inference-and-evaluation)
- [Running in Docker Container](#running-in-docker-container)
- [Distributed Cloud Training](#distributed-cloud-training)
- [Hyper-parameters Tuning](#hyper-parameters-tuning)
- [Training for Mandarin Language](#training-for-mandarin-language)
- [Trying Live Demo with Your Own Voice](#trying-live-demo-with-your-own-voice)
@ -22,13 +21,45 @@
## Installation
For this project was developed in PaddlePaddle V2 API, which is not maintained officially any more, we only support [running it in Docker container](#running-in-docker-container), instead of building environment from source code. And we are going to release the update to the latest Paddle Fluid API very soon, please keep an eye on this project.
To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually.
### Prerequisites
- Python 2.7 only supported
- PaddlePaddle the latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/beginners_guide/install/index_en.html))
### Setup
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`:
```bash
sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig
```
or, installing them via `yum`:
```bash
sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel
wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
xz -d flac-1.3.1.tar.xz
tar -xvf flac-1.3.1.tar
cd flac-1.3.1
./configure
make
make install
```
- Run the setup script for the remaining dependencies
```bash
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
sh setup.sh
```
## Getting Started
Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES` and `--trainer_count`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit.
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
@ -45,7 +76,7 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
sh run_data.sh
```
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `~/.cache/paddle/dataset/speech/libri` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
`run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `./dataset/librispeech` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
- Train your own ASR model
```bash
@ -139,20 +170,20 @@ python tools/build_vocab.py --help
- Start training from scratch with 8 GPUs:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py
```
- Start training from scratch with 16 CPUs:
- Start training from scratch with CPUs:
```
python train.py --use_gpu False --trainer_count 16
python train.py --use_gpu False
```
- Resume training from a checkpoint:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python train.py \
--init_model_path CHECKPOINT_PATH_TO_RESUME_FROM
--init_from_pretrain_model CHECKPOINT_PATH_TO_RESUME_FROM
```
For more help on arguments:
@ -162,6 +193,7 @@ python train.py --help
```
or refer to `example/librispeech/run_train.sh`.
## Data Augmentation Pipeline
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
@ -206,8 +238,8 @@ A language model is required to improve the decoder's performance. We have prepa
```bash
cd models/lm
sh download_lm_en.sh
sh download_lm_ch.sh
bash download_lm_en.sh
bash download_lm_ch.sh
```
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own.
@ -216,7 +248,7 @@ If you wish to train your own better language model, please refer to [KenLM](htt
The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
* Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
* Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
@ -239,13 +271,13 @@ An inference module caller `infer.py` is provided to infer, decode and visualize
- Inference with GPU:
```bash
CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1
CUDA_VISIBLE_DEVICES=0 python infer.py
```
- Inference with CPUs:
```bash
python infer.py --use_gpu False --trainer_count 12
python infer.py --use_gpu False
```
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`.
@ -264,13 +296,13 @@ To evaluate a model's performance quantitatively, please run:
- Evaluation with GPUs:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py
```
- Evaluation with CPUs:
```bash
python test.py --use_gpu False --trainer_count 12
python test.py --use_gpu False
```
The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed.
@ -293,7 +325,6 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python tools/tune.py \
--trainer_count 8 \
--alpha_from 1.0 \
--alpha_to 3.2 \
--num_alphas 45 \
@ -332,7 +363,7 @@ Take several steps to launch the Docker image:
- Download the Docker image
```bash
nvidia-docker pull paddlepaddle/deep_speech:latest-gpu
nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu
```
- Clone this repository
@ -344,72 +375,10 @@ git clone https://github.com/PaddlePaddle/DeepSpeech.git
- Run the Docker image
```bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
```
Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container.
## Distributed Cloud Training
We also provide a cloud training module for users to do the distributed cluster training on [PaddleCloud](https://github.com/PaddlePaddle/cloud), to achieve a much faster training speed with multiple machines. To start with this, please first install PaddleCloud client and register a PaddleCloud account, as described in [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud).
Please take the following steps to submit a training job:
- Go to directory:
```bash
cd cloud
```
- Upload data:
Data must be uploaded to PaddleCloud filesystem to be accessed within a cloud job. `pcloud_upload_data.sh` helps do the data packing and uploading:
```bash
sh pcloud_upload_data.sh
```
Given input manifests, `pcloud_upload_data.sh` will:
- Extract the audio files listed in the input manifests.
- Pack them into a specified number of tar files.
- Upload these tar files to PaddleCloud filesystem.
- Create cloud manifests by replacing local filesystem paths with PaddleCloud filesystem paths. New manifests will be used to inform the cloud jobs of audio files' location and their meta information.
It should be done only once for the very first time to do the cloud training. Later, the data is kept persisitent on the cloud filesystem and reusable for further job submissions.
For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- Configure training arguments:
Configure the cloud job parameters in `pcloud_submit.sh` (e.g. `NUM_NODES`, `NUM_GPUS`, `CLOUD_TRAIN_DIR`, `JOB_NAME` etc.) and then configure other hyper-parameters for training in `pcloud_train.sh` (just as what you do for local training).
For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
- Submit the job:
By running:
```bash
sh pcloud_submit.sh
```
a training job has been submitted to PaddleCloud, with the job name printed to the console.
- Get training logs
Run this to list all the jobs you have submitted, as well as their running status:
```bash
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```bash
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
For more information about the DeepSpeech2 training on PaddleCloud, please refer to
[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
## Training for Mandarin Language
@ -417,14 +386,13 @@ The key steps of training for Mandarin language are same to that of English lang
## Trying Live Demo with Your Own Voice
Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_english_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice.
To start the demo's server, please run this in one console:
```bash
CUDA_VISIBLE_DEVICES=0 \
python deploy/demo_server.py \
--trainer_count 1 \
--host_ip localhost \
--host_port 8086
```
@ -436,7 +404,7 @@ For example, on MAC OS X:
```bash
brew install portaudio
pip install pyaudio
pip install pynput
pip install keyboard
```
Then to start the client, please run this in another console:
@ -452,7 +420,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking
Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine.
Please also refer to `examples/mandarin/run_demo_server.sh`, which will first download a pre-trained Mandarin model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak Mandarin to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.  
For more help on arguments:
@ -467,10 +435,10 @@ python deploy/demo_client.py --help
Language | Model Name | Training Data | Hours of Speech
:-----------: | :------------: | :----------: | -------:
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
#### Language Model Released
@ -504,17 +472,16 @@ Baidu Internal Testset | 12.64
#### Acceleration with Multi-GPUs
We compare the training time with 1, 2, 4, 8, 16 Tesla K40m GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
<img src="docs/images/multi_gpu_speedup.png" width=450><br/>
| # of GPU | Acceleration Rate |
| -------- | --------------: |
| 1 | 1.00 X |
| 2 | 1.97 X |
| 4 | 3.74 X |
| 8 | 6.21 X |
|16 | 10.70 X |
| 2 | 1.98 X |
| 4 | 3.73 X |
| 8 | 6.95 X |
`tools/profile.sh` provides such a profiling tool.

@ -1,7 +1,7 @@
# DeepSpeech2
# 语音识别: DeepSpeech2
*DeepSpeech2* 是一个采用[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)平台的端到端自动语音识别ASR引擎的开源项目具体原理参考这篇论文[Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf)。
我们的愿景是为语音识别在工业应用和学术研究上,提供易于使用、高效和可扩展的工具,包括训练,推理,测试模块,以及分布式的[PaddleCloud](https://github.com/PaddlePaddle/cloud)训练和demo部署。同时我们还将发布一些预训练好的英语和普通话模型。
*语音识别: DeepSpeech2*是一个采用[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)平台的端到端自动语音识别ASR引擎的开源项目具体原理参考这篇论文[Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf)。
我们的愿景是为语音识别在工业应用和学术研究上提供易于使用、高效和可扩展的工具包括训练推理测试模块以及demo部署。同时我们还将发布一些预训练好的英语和普通话模型。
## 目录
- [安装](#安装)
@ -11,7 +11,6 @@
- [数据增强管道](#数据增强管道)
- [推断和评价](#推断和评价)
- [在Docker容器上运行](#在Docker容器上运行)
- [分布式云训练](#分布式云训练)
- [超参数调整](#超参数调整)
- [训练汉语语言](#训练汉语语言)
- [用自己的声音尝试现场演示](#用自己的声音尝试现场演示)
@ -20,18 +19,49 @@
- [问题和帮助](#问题和帮助)
## 安装
为了避免环境配置问题,强烈建议在[Docker容器上运行](#在Docker容器上运行),否则请按照下面的指南安装依赖项。
因该项目基于 PaddlePaddle V2 API 开发,其已不再被官方维护,目前我们仅支持 [在 Docker 容器中运行该项目](#在Docker容器上运行),而不支持从源码构建环境。我们很快会将这个项目升级到最新的 Paddle Fluid API请保持关注。
### 前提
- 只支持Python 2.7
- PaddlePaddle最新版本(请参考[安装指南](https://www.paddlepaddle.org.cn/start))
### 安装
- 请确保以下库或工具已安装完毕:`pkg-config`, `flac`, `ogg`, `vorbis`, `boost``swig`, 如可以通过`apt-get`安装:
```bash
sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig
```
或者,也可以通过`yum`安装:
```bash
sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel
wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz
xz -d flac-1.3.1.tar.xz
tar -xvf flac-1.3.1.tar
cd flac-1.3.1
./configure
make
make install
```
- 运行脚本安装其余的依赖项
```bash
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
sh setup.sh
```
## 开始
`./examples`里的一些shell脚本将帮助我们在一些公开数据集(比如:[LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)) 进行快速尝试,包括了数据准备,模型训练,案例推断和模型评价。阅读这些例子将帮助你理解如何应用你的数据集。
`./examples`里的一些shell脚本将帮助我们在一些公开数据集(比如:[LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)) 进行快速尝试,包括了数据准备,模型训练,案例推断和模型评价。阅读这些例子将帮助你理解如何使用你的数据集训练模型
`./examples`目录中的一些脚本配置使用了8个GPU。如果你没有8个可用的GPU请修改`CUDA_VISIBLE_DEVICES`和`--trainer_count`。如果你没有可用的GPU请设置`--use_gpu`为False这样程序会用CPU代替GPU。另外如果发生内存不足的问题减小`--batch_size`即可。
`./examples`目录中的一些脚本配置使用了8个GPU。如果你没有8个可用的GPU请修改`CUDA_VISIBLE_DEVICES`。如果你没有可用的GPU请设置`--use_gpu`为False这样程序会用CPU代替GPU。另外如果发生内存不足的问题减小`--batch_size`即可。
让我们先看看[LibriSpeech dataset](http://www.openslr.org/12/)小样本集的例子。
- 转到目录
- 到目录
```bash
cd examples/tiny
@ -44,21 +74,21 @@
sh run_data.sh
```
运行`run_data.sh`脚本将会下载数据集产出manifests文件收集一些归一化需要的统计信息并建立词表。当数据准备完成之后下载完的数据仅有LibriSpeech一部分在`~/.cache/paddle/dataset/speech/libri`中其对应的manifest文件均值标准差和词表文件在`./data/tiny`中。在第一次执行的时候一定要执行这个脚本,在接下来所有的实验中我们都会用到这个数据集。
运行`run_data.sh`脚本将会下载数据集产出manifests文件收集一些归一化需要的统计信息并建立词表。当数据准备完成之后下载完的数据仅有LibriSpeech一部分在`dataset/librispeech`中其对应的manifest文件均值标准差和词表文件在`./data/tiny`中。在第一次执行的时候一定要执行这个脚本,在接下来所有的实验中我们都会用到这个数据集。
- 训练你自己的ASR模型
```bash
sh run_train.sh
```
`run_train.sh`将会启动训练任务,训练日志会打印到stdout并且模型每个时期(epoch)的检查点都会保存到`./checkpoints/tiny`目录中。这些检查点可以用来恢复训练,推断,评价和部署。
`run_train.sh`将会启动训练任务,训练日志会打印到终端并且模型每个epoch的checkpoint都会保存到`./checkpoints/tiny`目录中。这些checkpoint可以用来恢复训练,推断,评价和部署。
- 用已有的模型进行案例推断
```bash
sh run_infer.sh
```
`run_infer.sh`将会利用训的模型展现一些默认10个样本语音到文本的解码结果。由于当前模型只使用了LibriSpeech一部分数据集训练因此性能可能不会太好。为了看到更好模型上的表现你可以下载一个已训练好的模型用完整的LibriSpeech训练了好几天来做推断。
`run_infer.sh`将会利用训练好的模型展现一些默认10个样本语音到文本的解码结果。由于当前模型只使用了LibriSpeech一部分数据集训练因此性能可能不会太好。为了看到更好模型上的表现你可以下载一个已训练好的模型用完整的LibriSpeech训练了好几天来做推断。
```bash
sh run_infer_golden.sh
@ -82,20 +112,20 @@
### 生成Manifest
*语音识别: DeepSpeech2*接受文本**manifest**文件作为数据接口。manifest文件包含了一系列语音数据其中每一行代表一个json格式的音频元数据比如文件路径描述时长。具体格式如下
*语音识别: DeepSpeech2*接受文本**manifest**文件作为数据接口。manifest文件包含了一系列语音数据其中每一行代表一个[JSON](http://www.json.org/)格式的音频元数据(比如文件路径,描述,时长)。具体格式如下:
```
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
```
如果你要使用自定义数据你只需要按照以上格式生成自己的manifest文件即可。训练推断以及其他所有模块都能够根据manifest文件获取到音频数据包括他们的元数据。
如果你要使用自定义数据你只需要按照以上格式生成自己的manifest文件即可。给定manifest文件训练推断以及其他所有模块都能够访问到音频数据以及对应的时长和标签数据。
关于如何生成manifest文件请参考`data/librispeech/librispeech.py`。该脚本将会下载LibriSpeech数据集并生成manifest文件。
### 计算均值和标准差用于归一化
为了对音频特征进行z-score归一化零均值单位标准差我们必须预估一些训练样本特征的均值和标准差:
为了对音频特征进行z-score归一化零均值单位标准差我们必须预估训练样本特征的均值和标准差
```bash
python tools/compute_mean_std.py \
@ -105,11 +135,11 @@ python tools/compute_mean_std.py \
--output_path data/librispeech/mean_std.npz
```
以上这段代码会计算在`data/librispeech/manifest.train`路径中2000个随机采样音频剪辑的功率谱特征均值和标准差,并将结果保存在`data/librispeech/mean_std.npz`中,方便以后使用。
以上这段代码会计算在`data/librispeech/manifest.train`路径中2000个随机采样的语音频谱特征均值和标准差,并将结果保存在`data/librispeech/mean_std.npz`中,方便以后使用。
### 建立词表
转换录音为索引用于训练,解码,再将一系列索引转换为文本等操作需要一个可能会出现字符集合的词表。`tools/build_vocab.py`脚本将生成这种基于字符的词表。
我们需要一个包含可能会出现的字符集合的词表来在训练的时候将字符转换成索引,并在解码的时候将索引转换回文本。`tools/build_vocab.py`脚本将生成这种基于字符的词表。
```bash
python tools/build_vocab.py \
@ -118,7 +148,7 @@ python tools/build_vocab.py \
--manifest_paths data/librispeech/manifest.train
```
将`data/librispeech/manifest.train`目录中的所有录音文本写入词表文件`data/librispeeech/eng_vocab.txt`,并且没有词汇截断(`--count_threshold 0`)。
将`data/librispeech/manifest.train`目录中的所有录音文本写入词表文件`data/librispeeech/eng_vocab.txt`,并且没有词汇截断(`--count_threshold 0`)。
### 更多帮助
@ -137,13 +167,13 @@ python tools/build_vocab.py --help
- 开始使用8片GPU训练
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py
```
- 开始使用16片GPU训练
- 开始使用CPU训练
```
python train.py --use_gpu False --trainer_count 16
python train.py --use_gpu False
```
- 从检查点恢复训练:
@ -151,7 +181,7 @@ python tools/build_vocab.py --help
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python train.py \
--init_model_path CHECKPOINT_PATH_TO_RESUME_FROM
--init_from_pretrain_model CHECKPOINT_PATH_TO_RESUME_FROM
```
获得更多帮助:
@ -161,11 +191,12 @@ python train.py --help
```
或参考 `example/librispeech/run_train.sh`.
## 数据增强管道
数据增强是用来提升深度学习性能的非常有效的技术。我们通过在原始音频中添加小随机扰动(标签不变转换)获得新音频来增强我们的语音数据。你不必自己合成,因为数据增强已经嵌入到数据提供者中能在训练模型时每个epoch中随机的合成音频。
数据增强是用来提升深度学习性能的非常有效的技术。我们通过在原始音频中添加小随机扰动(标签不变转换)获得新音频来增强我们的语音数据。你不必自己合成,因为数据增强已经嵌入到数据生成器中并且能够即时完成在训练模型的每个epoch中随机合成音频。
目前提供六个可选的增强组件供选择,配置并插入处理流水线
目前提供六个可选的增强组件供选择,配置并插入处理过程
- 音量扰动
- 速度扰动
@ -195,7 +226,7 @@ python train.py --help
有关其他配置实例,请参考`conf/augmenatation.config.example`.
使用数据增强技术时要小心,由于扩大了训练和测试集的差异,不恰当的增强会对训练模型不利。
使用数据增强技术时要小心,由于扩大了训练和测试集的差异,不恰当的增强会对训练模型不利,导致训练和预测的差距增大
## 推断和评价
@ -205,18 +236,18 @@ python train.py --help
```bash
cd models/lm
sh download_lm_en.sh
sh download_lm_ch.sh
bash download_lm_en.sh
bash download_lm_ch.sh
```
如果你想训练自己更好的语言模型,请参考[KenLM](https://github.com/kpu/kenlm)获取教程。在这里,我们提供一些技巧来展示我们如何准备我们的英语和普通话模型。开始训练的时候,你可以参考这些技巧。
如果你想训练自己更好的语言模型,请参考[KenLM](https://github.com/kpu/kenlm)获取教程。在这里,我们提供一些技巧来展示我们如何准备我们的英语和普通话模型。当你训练自己的模型的时候,可以参考这些技巧。
#### 英语语言模型
英语语料库来自[Common Crawl Repository](http://commoncrawl.org)可以从[statmt](http://data.statmt.org/ngrams/deduped_en)下载它。我们使用en.00部分来训练我们的英语语言模型。训练前有一些预处理步骤如下
英语语料库来自[Common Crawl Repository](http://commoncrawl.org)可以从[statmt](http://data.statmt.org/ngrams/deduped_en)下载它。我们使用en.00部分来训练我们的英语语言模型。训练前有如下的一些预处理过程
* 不在\[A-Za-z0-9\s'\]\s表示空白字符中的字符将被删除阿拉伯数字被转换为英文数字比如“1000”转换为one thousand。
* 不在\['A-Za-z0-9\s'\]\s表示空白字符中的字符将被删除阿拉伯数字被转换为英文数字比如“1000”转换为one thousand。
* 重复的空白字符被压缩为一个,并且开始的空白字符将被删除。请注意,所有的录音都是小写字母,因此所有字符都转换为小写字母。
* 选择前40万个最常用的单词来建立词表其余部分将被替换为“UNKNOWNWORD”。
@ -224,7 +255,7 @@ sh download_lm_ch.sh
#### 普通话语言模型
与英语语言模型不同的是,普通话语言模型是基于字符的,其中每一位都是中文汉字。我们使用内部语料库来训练发布的汉语语言模型。该语料库包含数十亿汉字。预处理阶段与英语语言模型差别很小,主要步骤包括:
与英语语言模型不同的是,普通话语言模型是基于字符的,其中每一位都是中文汉字。我们使用内部语料库来训练发布的汉语语言模型。该语料库包含数十亿汉字。预处理阶段与英语语言模型有一些小的差别,主要步骤包括:
* 删除开始和结尾的空白字符。
* 删除英文标点和中文标点。
@ -234,18 +265,18 @@ sh download_lm_ch.sh
### 语音到文本推断
推断模块调用者为`infer.py`,可以用来推断,解码,以及给一些给定音频剪辑进行可视化语音到文本的结果。这有助于对ASR模型的性能进行直观和定性的评估。
推断模块使用`infer.py`进行调用,可以用来推断,解码,以及输出一些给定音频片段可视化到文本的结果。这有助于对ASR模型的性能进行直观和定性的评估。
- GPU版本的推断
```bash
CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1
CUDA_VISIBLE_DEVICES=0 python infer.py
```
- CPU版本的推断
```bash
python infer.py --use_gpu False --trainer_count 12
python infer.py --use_gpu False
```
我们提供两种类型的CTC解码器*CTC贪心解码器*和*CTC波束搜索解码器*。*CTC贪心解码器*是简单的最佳路径解码算法的实现,在每个时间步选择最可能的字符,因此是贪心的并且是局部最优的。[*CTC波束搜索解码器*](https://arxiv.org/abs/1408.2873)另外使用了启发式广度优先图搜索以达到近似全局最优; 它也需要预先训练的KenLM语言模型以获得更好的评分和排名。解码器类型可以用参数`--decoding_method`设置。
@ -261,16 +292,16 @@ python infer.py --help
要定量评估模型的性能,请运行:
- GPU版本评估
- GPU版本评估
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py
```
- CPU版本评估
```bash
python test.py --use_gpu False --trainer_count 12
python test.py --use_gpu False
```
错误率(默认:误字率;可以用--error_rate_type设置将被打印出来。
@ -286,14 +317,13 @@ python test.py --help
[*CTC波束搜索解码器*](https://arxiv.org/abs/1408.2873)的超参数$\alpha$(语言模型权重)和$\beta$(单词插入权重)对解码器的性能有非常显著的影响。当声学模型更新时,最好在验证集上重新调整它们。
`tools/tune.py`会进行2维网格查找超参数$\alpha$和$\beta$。必须提供$\alpha$和$\beta$的范围,以及尝试的次数。
`tools/tune.py`会进行2维网格查找超参数$\alpha$和$\beta$。必须提供$\alpha$和$\beta$的范围,以及尝试的次数。
- 带GPU版的调整
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python tools/tune.py \
--trainer_count 8 \
--alpha_from 1.0 \
--alpha_to 3.2 \
--num_alphas 45 \
@ -307,14 +337,14 @@ python test.py --help
```bash
python tools/tune.py --use_gpu False
```
网格搜索将会在超参数空间的每个点处打印出WER(误字率)或者CER(字符错误率),并且可选择绘出误差曲面。合适的超参数范围应包括WER/CER误差表面的全局最小值如下图所示。
网格搜索将会在超参数空间的每个点处打印出WER(误字率)或者CER(字符错误率),并且可绘出误差曲面。一个合适的超参数范围应包括WER/CER误差表面的全局最小值如下图所示。
<p align="center">
<img src="docs/images/tuning_error_surface.png" width=550>
<br/>调整LibriSpeech的dev-clean集合的误差曲面示例
</p>
通常,如图所示,语言模型权重($\alpha$的变化显著影响CTC波束搜索解码器的性能。更好的方法是首先调整多批数据可指定数量以找出适当的超参数范围然后更改为整个验证集以进行精确调整。
通常,如图所示,语言模型权重($\alpha$的变化显著影响CTC波束搜索解码器的性能。更好的方法是首先调整多批数据可指定数量以找出适当的超参数范围然后更改为完整的验证集以进行精确调整。
调整之后,您可以在推理和评价模块中重置$\alpha$和$\beta$以检查它们是否真的有助于提高ASR性能。更多帮助如下
@ -325,14 +355,14 @@ python tune.py --help
## 在Docker容器上运行
Docker是一个开源工具用于在孤立的环境中构建发布和运行分布式应用程序。此项目的Docker镜像已在[hub.docker.com](https://hub.docker.com)中提供并安装了所有依赖项其中包括预先构建的PaddlePaddleCTC解码器以及其他必要的Python和第三方库。这个Docker映像需要NVIDIA GPU的支持所以请确保它的可用性并已完成[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)的安装。
Docker是一个开源工具用于在孤立的环境中构建发布和运行分布式应用程序。此项目的Docker镜像已在[hub.docker.com](https://hub.docker.com)中提供并安装了所有依赖项其中包括预先构建的PaddlePaddleCTC解码器以及其他必要的Python和第三方库。这个Docker映像需要NVIDIA GPU的支持所以请确保它的可用性并已完成[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)的安装。
采取以下步骤来启动Docker镜像
- 下载Docker镜像
```bash
nvidia-docker pull paddlepaddle/deep_speech:latest-gpu
nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu
```
- git clone这个资源库
@ -344,90 +374,25 @@ git clone https://github.com/PaddlePaddle/DeepSpeech.git
- 运行Docker镜像
```bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash
sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash
```
现在返回并从[开始](#开始)部分开始您可以在Docker容器中同样执行模型训练推断和超参数调整。
## 分布式云训练
我们还为用户提供云训练模块[PaddleCloud](https://github.com/PaddlePaddle/cloud)以便用户进行集群训练,利用多台机器达到更快的训练速度。首先,请按照[PaddleCloud用法](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud)安装PaddleCloud客户端并注册PaddleCloud账户。
请按照以下步骤提交训练任务:
- 转到目录:
```bash
cd cloud
```
- 上传数据:
数据必须上传到PaddleCloud文件系统才能在云作业中访问。`pcloud_upload_data.sh`负责进行数据打包和上传:
```bash
sh pcloud_upload_data.sh
```
给定manifest文件`pcloud_upload_data.sh`会进行以下处理:
- 提取输入清单中列出的音频文件。
- 将它们打包成指定数量的tar文件。
- 将这些tar文件上传到PaddleCloud文件系统。
- 通过用PaddleCloud文件系统路径替换本地文件系统路径来创建云manifest文件。云作业将通过新的manifest文件获取到音频文件的位置及其元信息。
对于云训练模型来说以上步骤只需做一次。之后这些数据会在云文件系统上保持不变,并可在之后的任务中反复使用。
有关参数的详细信息,请参考[在PaddleCloud上训练DeepSpeech2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud)。
- 配置训练参数
在`pcloud_submit.sh`中配置云任务参数(例如`NUM_NODES``NUM_GPUS``CLOUD_TRAIN_DIR``JOB_NAME`等),然后在`pcloud_train.sh`中配置其他的超参数训练(和本地训练一样)。
有关参数的详细信息,请参阅[在PaddleCloud上训练DeepSpeech2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud)。
- 提交任务
运行:
```bash
sh pcloud_submit.sh
```
一个训练任务已经提交给PaddleCloud并将任务名输出到控制台。
- 获取训练日志
执行以下命令以列出你提交的所有任务以及它们的运行状态:
```bash
paddlecloud get jobs
```
运行此操作,将打印相应的任务日志。
```bash
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
有关PaddleCloud用法的更多信息请参阅[PaddleCloud用法](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)。
有关PaddleCloud的DeepSpeech2训练的更多信息请参阅
[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud).
## 训练普通话语言
普通话语言训练与英语训练的关键步骤相同,我们提供了一个```examples/aishell```中Aishell的普通话训练例子。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。
普通话语言训练与英语训练的关键步骤相同我们提供了一个使用Aishell进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。
##用自己的声音尝试现场演示
到目前为止一个ASR模型已经训练完毕并且进行了定性测试`infer.py`)和用现有的音频文件进行定量测试(`test.py`)。但目前还没有用你自己的声音进行测试。`deploy/demo_server.py`和`deploy/demo_client.py`能够快速构建一个利用训完的模型对ASR引擎进行实时演示系统,使你能够用自己的语音测试和演示。
到目前为止一个ASR模型已经训练完毕并且用现有的音频文件进行了定性测试`infer.py`)和定量测试(`test.py`)。但目前还没有用你自己的声音进行测试。`deploy/demo_english_server.py`和`deploy/demo_client.py`能够快速构建一个利用已训练好的模型对ASR引擎进行实时演示的系统使你能够用自己的语音测试和演示。
要启动演示服务,请在控制台中运行:
```bash
CUDA_VISIBLE_DEVICES=0 \
python deploy/demo_server.py \
--trainer_count 1 \
--host_ip localhost \
--host_port 8086
```
@ -439,7 +404,7 @@ python deploy/demo_server.py \
```bash
brew install portaudio
pip install pyaudio
pip install pynput
pip install keyboard
```
然后启动客户端,请在另一个控制台中运行:
@ -451,11 +416,11 @@ python -u deploy/demo_client.py \
--host_port 8086
```
现在,在客户端控制台中,按下`whitespace`键,按住并开始讲话。讲话完毕请释放该键以让控制台中显示的语音到文本结果。要退出客户端,只需按`ESC`键。
现在,在客户端控制台中,按下`空格`键,按住并开始讲话。讲话完毕请释放该键以让控制台中显示语音的文本结果。要退出客户端,只需按`ESC`键。
请注意,`deploy/demo_client.py`必须在带麦克风设备的机器上运行,而`deploy/demo_server.py`可以在没有任何录音硬件的情况下运行,例如任何远程服务器机器。如果服务器和客户端使用两台独立的机器运行,只需要注意将`host_ip`和`host_port`参数设置为实际可访问的IP地址和端口。如果它们在单台机器上运行则不用作任何处理。
请参考`examples/mandarin/run_demo_server.sh`,它将首先下载一个预先训练过的普通话模型用3000小时的内部语音数据训练然后用模型启动演示服务器。通过运行`examples/mandarin/run_demo_client.sh`,你可以说普通话来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。
请参考`examples/deploy_demo/run_english_demo_server.sh`,它将首先下载一个预先训练过的英语模型用3000小时的内部语音数据训练然后用模型启动演示服务器。通过运行`examples/mandarin/run_demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。
获得更多帮助:
@ -470,10 +435,10 @@ python deploy/demo_client.py --help
语种 | 模型名 | 训练数据 | 语音时长
:-----------: | :------------: | :----------: | -------:
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h
English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h
Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h
Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h
#### 语言模型发布
@ -483,9 +448,9 @@ Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baid
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings
## 实验和基准
## 实验和baseline
#### 英语模型的基准测试结果(字错误率)
#### 英语模型的baseline测试结果(字错误率)
测试集 | LibriSpeech Model | BaiduEN8K Model
:--------------------- | ---------------: | -------------------:
@ -500,7 +465,7 @@ Baidu Internal Testset  |   40.75 |   8.48
为了在VoxForge数据上重现基准测试结果我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````run_data.sh```来获取VoxForge方言manifest文件。请注意VoxForge数据可能会持续更新生成的清单文件可能与我们评估的清单文件有所不同。
#### 普通话模型的基准测试结果(字符错误率)
#### 普通话模型的baseline测试结果(字符错误率)
测试集 | BaiduCN1.2k Model
:--------------------- | -------------------:
@ -508,17 +473,16 @@ Baidu Internal Testset | 12.64
#### 多GPU加速
我们对1,2,4,8,16个Tesla K40m GPU的训练时间LibriSpeech样本的子集其音频持续时间介于6.0和7.0秒之间进行比较。它表明已经实现了具有多个GPU的**近线性**加速。在下图中,训练的时间(以秒为单位)显示在蓝色条上。
我们对1,2,4,8个Tesla V100 GPU的训练时间LibriSpeech样本的子集其音频持续时间介于6.0和7.0秒之间进行比较。它表明已经实现了具有多个GPU的**近线性**加速。在下图中,训练的时间(以秒为单位)显示在蓝色条上。
<img src="docs/images/multi_gpu_speedup.png" width=450><br/>
| # of GPU | 加速比 |
| -------- | --------------: |
| 1 | 1.00 X |
| 2 | 1.97 X |
| 4 | 3.74 X |
| 8 | 6.21 X |
|16 | 10.70 X |
| 2 | 1.98 X |
| 4 | 3.73 X |
| 8 | 6.95 X |
`tools/profile.sh`提供了上述分析工具.

@ -1,63 +0,0 @@
# Train DeepSpeech2 on PaddleCloud
>Note:
>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
## Step 1: Upload Data
Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
Please modify the following arguments in `pcloud_upload_data.sh`:
- `IN_MANIFESTS` Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths.
- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
By running:
```
sh pcloud_upload_data.sh
```
all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.
## Step 2: Configure Training
Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
- `BATCH_SIZE`: Training batch size for a single node.
- `NUM_GPU`: Number of GPUs allocated for a single node.
- `NUM_NODE`: Number of nodes (machines) allocated for this job.
- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
By running:
```
sh pcloud_submit.sh
```
you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
## Step 3 Get Job Logs
Run this to list all the jobs you have submitted, as well as their running status:
```
paddlecloud get jobs
```
Run this, the corresponding job's logs will be printed.
```
paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
```
## More Help
For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).

@ -1,17 +0,0 @@
"""Set up paths for DS2"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os.path
import sys
def add_path(path):
if path not in sys.path:
sys.path.insert(0, path)
this_dir = os.path.dirname(__file__)
proj_path = os.path.join(this_dir, '..')
add_path(proj_path)

@ -1,29 +0,0 @@
#! /usr/bin/env bash
TRAIN_MANIFEST="cloud/cloud_manifests/cloud.manifest.train"
DEV_MANIFEST="cloud/cloud_manifests/cloud.manifest.dev"
CLOUD_MODEL_DIR="./checkpoints"
BATCH_SIZE=512
NUM_GPU=8
NUM_NODE=1
IS_LOCAL="True"
JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S`
DS2_PATH=${PWD%/*}
cp -f pcloud_train.sh ${DS2_PATH}
paddlecloud submit \
-image bootstrapper:5000/paddlepaddle/pcloud_ds2:latest \
-jobname ${JOB_NAME} \
-cpu ${NUM_GPU} \
-gpu ${NUM_GPU} \
-memory 64Gi \
-parallelism ${NUM_NODE} \
-pscpu 1 \
-pservers 1 \
-psmemory 64Gi \
-passes 1 \
-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \
${DS2_PATH}
rm ${DS2_PATH}/pcloud_train.sh

@ -1,46 +0,0 @@
#! /usr/bin/env bash
TRAIN_MANIFEST=$1
DEV_MANIFEST=$2
MODEL_PATH=$3
NUM_GPU=$4
BATCH_SIZE=$5
IS_LOCAL=$6
python ./cloud/split_data.py \
--in_manifest_path=${TRAIN_MANIFEST} \
--out_manifest_path='/local.manifest.train'
python ./cloud/split_data.py \
--in_manifest_path=${DEV_MANIFEST} \
--out_manifest_path='/local.manifest.dev'
mkdir ./logs
python -u train.py \
--batch_size=${BATCH_SIZE} \
--trainer_count=${NUM_GPU} \
--num_passes=200 \
--num_proc_data=${NUM_GPU} \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--use_sortagrad=True \
--use_gru=False \
--use_gpu=True \
--is_local=${IS_LOCAL} \
--share_rnn_weights=True \
--train_manifest='/local.manifest.train' \
--dev_manifest='/local.manifest.dev' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--output_model_dir='./checkpoints' \
--output_model_dir=${MODEL_PATH} \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped' \
2>&1 | tee ./logs/train.log

@ -1,22 +0,0 @@
#! /usr/bin/env bash
mkdir cloud_manifests
IN_MANIFESTS="../data/librispeech/manifest.train ../data/librispeech/manifest.dev-clean ../data/librispeech/manifest.test-clean"
OUT_MANIFESTS="cloud_manifests/cloud.manifest.train cloud_manifests/cloud.manifest.dev cloud_manifests/cloud.manifest.test"
CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech"
NUM_SHARDS=50
python upload_data.py \
--in_manifest_paths ${IN_MANIFESTS} \
--out_manifest_paths ${OUT_MANIFESTS} \
--cloud_data_dir ${CLOUD_DATA_DIR} \
--num_shards ${NUM_SHARDS}
if [ $? -ne 0 ]
then
echo "Upload Data Failed!"
exit 1
fi
echo "All Done."

@ -1,41 +0,0 @@
"""This tool is used for splitting data into each node of
paddlecloud. This script should be called in paddlecloud.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import json
import argparse
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_path",
type=str,
required=True,
help="Input manifest path for all nodes.")
parser.add_argument(
"--out_manifest_path",
type=str,
required=True,
help="Output manifest file path for current node.")
args = parser.parse_args()
def split_data(in_manifest_path, out_manifest_path):
with open("/trainer_id", "r") as f:
trainer_id = int(f.readline()[:-1])
with open("/trainer_count", "r") as f:
trainer_count = int(f.readline()[:-1])
out_manifest = []
for index, json_line in enumerate(open(in_manifest_path, 'r')):
if (index % trainer_count) == trainer_id:
out_manifest.append("%s\n" % json_line.strip())
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
if __name__ == '__main__':
split_data(args.in_manifest_path, args.out_manifest_path)

@ -1,129 +0,0 @@
"""This script is for uploading data for DeepSpeech2 training on paddlecloud.
Steps:
1. Read original manifests and extract local sound files.
2. Tar all local sound files into multiple tar files and upload them.
3. Modify original manifests with updated paths in cloud filesystem.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import json
import os
import tarfile
import sys
import argparse
import shutil
from subprocess import call
import _init_paths
from data_utils.utils import read_manifest
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--in_manifest_paths",
default=[
"../datasets/manifest.train", "../datasets/manifest.dev",
"../datasets/manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of input manifests to load, pack and upload."
"(default: %(default)s)")
parser.add_argument(
"--out_manifest_paths",
default=[
"./cloud.manifest.train", "./cloud.manifest.dev",
"./cloud.manifest.test"
],
type=str,
nargs='+',
help="Local filepaths of modified manifests to write to. "
"(default: %(default)s)")
parser.add_argument(
"--cloud_data_dir",
required=True,
type=str,
help="Destination directory on paddlecloud to upload data to.")
parser.add_argument(
"--num_shards",
default=10,
type=int,
help="Number of parts to split data to. (default: %(default)s)")
parser.add_argument(
"--local_tmp_dir",
default="./tmp/",
type=str,
help="Local directory for storing temporary data. (default: %(default)s)")
args = parser.parse_args()
def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir,
upload_tar_dir, num_shards):
"""Extract and pack sound files listed in the manifest files into multple
tar files and upload them to padldecloud. Besides, generate new manifest
files with updated paths in paddlecloud.
"""
# compute total audio number
total_line = 0
for manifest_path in in_manifest_path_list:
with open(manifest_path, 'r') as f:
total_line += len(f.readlines())
line_per_tar = (total_line // num_shards) + 1
# pack and upload shard by shard
line_count, tar_file = 0, None
for manifest_path, out_manifest_path in zip(in_manifest_path_list,
out_manifest_path_list):
manifest = read_manifest(manifest_path)
out_manifest = []
for json_data in manifest:
sound_filepath = json_data['audio_filepath']
sound_filename = os.path.basename(sound_filepath)
if line_count % line_per_tar == 0:
if tar_file != None:
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
tar_name = 'part-%s-of-%s.tar' % (
str(line_count // line_per_tar).zfill(5),
str(num_shards).zfill(5))
tar_path = os.path.join(local_tmp_dir, tar_name)
tar_file = tarfile.open(tar_path, 'w')
tar_file.add(sound_filepath, arcname=sound_filename)
line_count += 1
json_data['audio_filepath'] = "tar:%s#%s" % (
os.path.join(upload_tar_dir, tar_name), sound_filename)
out_manifest.append("%s\n" % json.dumps(json_data))
with open(out_manifest_path, 'w') as f:
f.writelines(out_manifest)
pcloud_cp(out_manifest_path, upload_tar_dir)
tar_file.close()
pcloud_cp(tar_path, upload_tar_dir)
os.remove(tar_path)
def pcloud_mkdir(dir):
"""Make directory in PaddleCloud filesystem.
"""
if call(['paddlecloud', 'mkdir', dir]) != 0:
raise IOError("PaddleCloud mkdir failed: %s." % dir)
def pcloud_cp(src, dst):
"""Copy src from local filesytem to dst in PaddleCloud filesystem,
or downlowd src from PaddleCloud filesystem to dst in local filesystem.
"""
if call(['paddlecloud', 'cp', src, dst]) != 0:
raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst))
if __name__ == '__main__':
if not os.path.exists(args.local_tmp_dir):
os.makedirs(args.local_tmp_dir)
pcloud_mkdir(args.cloud_data_dir)
upload_data(args.in_manifest_paths, args.out_manifest_paths,
args.local_tmp_dir, args.cloud_data_dir, args.num_shards)
shutil.rmtree(args.local_tmp_dir)

@ -16,6 +16,7 @@ import argparse
import soundfile
import json
import codecs
import io
from data_utils.utility import download, unpack
URL_ROOT = "http://www.openslr.org/resources/12"
@ -68,12 +69,11 @@ def create_manifest(data_dir, manifest_path):
filename for filename in filelist if filename.endswith('trans.txt')
]
if len(text_filelist) > 0:
text_filepath = os.path.join(data_dir, subfolder, text_filelist[0])
for line in open(text_filepath):
text_filepath = os.path.join(subfolder, text_filelist[0])
for line in io.open(text_filepath, encoding="utf8"):
segments = line.strip().split()
text = ' '.join(segments[1:]).lower()
audio_filepath = os.path.join(data_dir, subfolder,
segments[0] + '.flac')
audio_filepath = os.path.join(subfolder, segments[0] + '.flac')
audio_data, samplerate = soundfile.read(audio_filepath)
duration = float(len(audio_data)) / samplerate
json_lines.append(

@ -16,6 +16,7 @@ import zipfile
import argparse
import soundfile
import json
import io
from paddle.v2.dataset.common import md5file
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
@ -88,7 +89,7 @@ def create_manifest(data_dir, manifest_path):
'duration': duration,
'text': ''
}))
with open(manifest_path, 'w') as out_file:
with io.open(manifest_path, mode='w', encoding='utf8') as out_file:
for line in json_lines:
out_file.write(line + '\n')

@ -3,7 +3,7 @@
# download data, generate manifests
PYTHONPATH=../../:$PYTHONPATH python voxforge.py \
--manifest_prefix='./manifest' \
--target_dir='~/.cache/paddle/dataset/speech/VoxForge' \
--target_dir='./dataset/VoxForge' \
--is_merge_dialect=True \
--dialects 'american' 'british' 'australian' 'european' 'irish' 'canadian' 'indian'

@ -18,7 +18,7 @@ import shutil
import subprocess
from data_utils.utility import download_multi, unpack, getfile_insensitive
DATA_HOME = '~/.cache/paddle/dataset/speech'
DATA_HOME = './dataset'
DATA_URL = 'http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/' \
'Audio/Main/16kHz_16bit'

@ -12,6 +12,7 @@ import resampy
from scipy import signal
import random
import copy
import io
class AudioSegment(object):
@ -154,7 +155,7 @@ class AudioSegment(object):
fileno = int(matches.group(2))
# read headers
f = open(filename, 'rb')
f = io.open(filename, mode='rb', encoding='utf8')
version = f.read(4)
num_utterances = struct.unpack("i", f.read(4))[0]
bytes_per_header = struct.unpack("i", f.read(4))[0]

@ -9,10 +9,9 @@ import random
import tarfile
import multiprocessing
import numpy as np
import paddle.v2 as paddle
import paddle.fluid as fluid
from threading import local
from data_utils.utility import read_manifest
from data_utils.utility import xmap_readers_mp
from data_utils.augmentor.augmentation import AugmentationPipeline
from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
from data_utils.speech import SpeechSegment
@ -51,14 +50,17 @@ class DataGenerator(object):
:param use_dB_normalization: Whether to normalize the audio to -20 dB
before extracting the features.
:type use_dB_normalization: bool
:param num_threads: Number of CPU threads for processing data.
:type num_threads: int
:param random_seed: Random seed.
:type random_seed: int
:param keep_transcription_text: If set to True, transcription text will
be passed forward directly without
converting to index sequence.
:type keep_transcription_text: bool
:param place: The place to run the program.
:type place: CPU or GPU
:param is_training: If set to True, generate text data for training,
otherwise, generate text data for infer.
:type is_training: bool
"""
def __init__(self,
@ -72,9 +74,10 @@ class DataGenerator(object):
max_freq=None,
specgram_type='linear',
use_dB_normalization=True,
num_threads=multiprocessing.cpu_count() // 2,
random_seed=0,
keep_transcription_text=False):
keep_transcription_text=False,
place=fluid.CPUPlace(),
is_training=True):
self._max_duration = max_duration
self._min_duration = min_duration
self._normalizer = FeatureNormalizer(mean_std_filepath)
@ -87,14 +90,15 @@ class DataGenerator(object):
window_ms=window_ms,
max_freq=max_freq,
use_dB_normalization=use_dB_normalization)
self._num_threads = num_threads
self._rng = random.Random(random_seed)
self._keep_transcription_text = keep_transcription_text
self._epoch = 0
self._is_training = is_training
# for caching tar files info
self._local_data = local()
self._local_data.tar2info = {}
self._local_data.tar2object = {}
self._place = place
def process_utterance(self, audio_file, transcript):
"""Load, augment, featurize and normalize for speech data.
@ -121,7 +125,6 @@ class DataGenerator(object):
def batch_reader_creator(self,
manifest_path,
batch_size,
min_batch_size=1,
padding_to=-1,
flatten=False,
sortagrad=False,
@ -137,9 +140,6 @@ class DataGenerator(object):
:type manifest_path: basestring
:param batch_size: Number of instances in a batch.
:type batch_size: int
:param min_batch_size: Any batch with batch size smaller than this will
be discarded. (To be deprecated in the future.)
:type min_batch_size: int
:param padding_to: If set -1, the maximun shape in the batch
will be used as the target shape for padding.
Otherwise, `padding_to` will be the target shape.
@ -178,6 +178,7 @@ class DataGenerator(object):
# sort (by duration) or batch-wise shuffle the manifest
if self._epoch == 0 and sortagrad:
manifest.sort(key=lambda x: x["duration"])
else:
if shuffle_method == "batch_shuffle":
manifest = self._batch_shuffle(
@ -193,18 +194,16 @@ class DataGenerator(object):
raise ValueError("Unknown shuffle method %s." %
shuffle_method)
# prepare batches
instance_reader, cleanup = self._instance_reader_creator(manifest)
batch = []
try:
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self._padding_batch(batch, padding_to, flatten)
batch = []
if len(batch) >= min_batch_size:
instance_reader = self._instance_reader_creator(manifest)
for instance in instance_reader():
batch.append(instance)
if len(batch) == batch_size:
yield self._padding_batch(batch, padding_to, flatten)
finally:
cleanup()
batch = []
if len(batch) >= 1:
yield self._padding_batch(batch, padding_to, flatten)
self._epoch += 1
return batch_reader
@ -276,13 +275,11 @@ class DataGenerator(object):
def reader():
for instance in manifest:
yield instance
inst = self.process_utterance(instance["audio_filepath"],
instance["text"]),
yield inst[0]
reader, cleanup_callback = xmap_readers_mp(
lambda instance: self.process_utterance(instance["audio_filepath"], instance["text"]),
reader, self._num_threads, 4096)
return reader, cleanup_callback
return reader
def _padding_batch(self, batch, padding_to=-1, flatten=False):
"""
@ -304,14 +301,43 @@ class DataGenerator(object):
"than any instance's shape in the batch")
max_length = padding_to
# padding
padded_audios = []
texts, text_lens = [], []
audio_lens = []
masks = []
for audio, text in batch:
padded_audio = np.zeros([audio.shape[0], max_length])
padded_audio[:, :audio.shape[1]] = audio
if flatten:
padded_audio = padded_audio.flatten()
padded_instance = [padded_audio, text, audio.shape[1]]
new_batch.append(padded_instance)
return new_batch
padded_audios.append(padded_audio)
if self._is_training:
texts += text
else:
texts.append(text)
text_lens.append(len(text))
audio_lens.append(audio.shape[1])
mask_shape0 = (audio.shape[0] - 1) // 2 + 1
mask_shape1 = (audio.shape[1] - 1) // 3 + 1
mask_max_len = (max_length - 1) // 3 + 1
mask_ones = np.ones((mask_shape0, mask_shape1))
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
mask = np.repeat(
np.reshape(
np.concatenate((mask_ones, mask_zeros), axis=1),
(1, mask_shape0, mask_max_len)),
32,
axis=0)
masks.append(mask)
padded_audios = np.array(padded_audios).astype('float32')
if self._is_training:
texts = fluid.create_lod_tensor(
np.array(texts).astype('int32'),
recursive_seq_lens=[text_lens],
place=self._place)
audio_lens = np.array(audio_lens).astype('int64').reshape([-1, 1])
masks = np.array(masks).astype('float32')
return padded_audios, texts, audio_lens, masks
def _batch_shuffle(self, manifest, batch_size, clipped=False):
"""Put similarly-sized instances into minibatches for better efficiency

@ -11,7 +11,7 @@ import time
from Queue import Queue
from threading import Thread
from multiprocessing import Process, Manager, Value
from paddle.v2.dataset.common import md5file
from paddle.dataset.common import md5file
def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
@ -88,127 +88,3 @@ def unpack(filepath, target_dir, rm_tar=False):
class XmapEndSignal():
pass
def xmap_readers_mp(mapper, reader, process_num, buffer_size, order=False):
"""A multiprocessing pipeline wrapper for the data reader.
:param mapper: Function to map sample.
:type mapper: callable
:param reader: Given data reader.
:type reader: callable
:param process_num: Number of processes in the pipeline
:type process_num: int
:param buffer_size: Maximal buffer size.
:type buffer_size: int
:return: The wrappered reader and cleanup callback
:rtype: tuple
"""
end_flag = XmapEndSignal()
read_workers = []
handle_workers = []
flush_workers = []
read_exit_flag = Value('i', 0)
handle_exit_flag = Value('i', 0)
flush_exit_flag = Value('i', 0)
# define a worker to read samples from reader to in_queue with order flag
def order_read_worker(reader, in_queue):
for order_id, sample in enumerate(reader()):
if read_exit_flag.value == 1: break
in_queue.put((order_id, sample))
in_queue.put(end_flag)
# the reading worker should not exit until all handling work exited
while handle_exit_flag.value == 0 or read_exit_flag.value == 0:
time.sleep(0.001)
# define a worker to handle samples from in_queue by mapper and put results
# to out_queue with order
def order_handle_worker(in_queue, out_queue, mapper, out_order):
ins = in_queue.get()
while not isinstance(ins, XmapEndSignal):
if handle_exit_flag.value == 1: break
order_id, sample = ins
result = mapper(sample)
while order_id != out_order[0]:
time.sleep(0.001)
out_queue.put(result)
out_order[0] += 1
ins = in_queue.get()
in_queue.put(end_flag)
out_queue.put(end_flag)
# wait for exit of flushing worker
while flush_exit_flag.value == 0 or handle_exit_flag.value == 0:
time.sleep(0.001)
read_exit_flag.value = 1
handle_exit_flag.value = 1
# define a thread worker to flush samples from Manager.Queue to Queue
# for acceleration
def flush_worker(in_queue, out_queue):
finish = 0
while finish < process_num and flush_exit_flag.value == 0:
sample = in_queue.get()
if isinstance(sample, XmapEndSignal):
finish += 1
else:
out_queue.put(sample)
out_queue.put(end_flag)
handle_exit_flag.value = 1
flush_exit_flag.value = 1
def cleanup():
# first exit flushing workers
flush_exit_flag.value = 1
for w in flush_workers:
w.join()
# next exit handling workers
handle_exit_flag.value = 1
for w in handle_workers:
w.join()
# last exit reading workers
read_exit_flag.value = 1
for w in read_workers:
w.join()
def xreader():
# prepare shared memory
manager = Manager()
in_queue = manager.Queue(buffer_size)
out_queue = manager.Queue(buffer_size)
out_order = manager.list([0])
# start a read worker in a process
target = order_read_worker
p = Process(target=target, args=(reader, in_queue))
p.daemon = True
p.start()
read_workers.append(p)
# start handle_workers with multiple processes
target = order_handle_worker
args = (in_queue, out_queue, mapper, out_order)
workers = [
Process(target=target, args=args) for _ in xrange(process_num)
]
for w in workers:
w.daemon = True
w.start()
handle_workers.append(w)
# start a thread to read data from slow Manager.Queue
flush_queue = Queue(buffer_size)
t = Thread(target=flush_worker, args=(out_queue, flush_queue))
t.daemon = True
t.start()
flush_workers.append(t)
# get results
sample = flush_queue.get()
while not isinstance(sample, XmapEndSignal):
yield sample
sample = flush_queue.get()
return xreader, cleanup

@ -102,7 +102,7 @@ def ctc_beam_search_decoder(probs_seq,
probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0}
## extend prefix in loop
for time_step in xrange(len(probs_seq)):
for time_step in range(len(probs_seq)):
# prefix_set_next: the set containing candidate prefixes
# probs_b_cur: prefixes' probability ending with blank in current step
# probs_nb_cur: prefixes' probability ending with non-blank in current step
@ -114,7 +114,7 @@ def ctc_beam_search_decoder(probs_seq,
if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len:
prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True)
cutoff_len, cum_prob = 0, 0.0
for i in xrange(len(prob_idx)):
for i in range(len(prob_idx)):
cum_prob += prob_idx[i][1]
cutoff_len += 1
if cum_prob >= cutoff_prob:
@ -127,7 +127,7 @@ def ctc_beam_search_decoder(probs_seq,
probs_b_cur[l], probs_nb_cur[l] = 0.0, 0.0
# extend prefix by travering prob_idx
for index in xrange(cutoff_len):
for index in range(cutoff_len):
c, prob_c = prob_idx[index][0], prob_idx[index][1]
if c == blank_id:

@ -1,5 +1,5 @@
"""Client-end for the ASR demo."""
from pynput import keyboard
import keyboard
import struct
import socket
import sys
@ -23,22 +23,17 @@ is_recording = False
enable_trigger_record = True
def on_press(key):
"""On-press keyboard callback function."""
def on_press_release(x):
"""Keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.space:
press = keyboard.KeyboardEvent('down', 28, 'space')
release = keyboard.KeyboardEvent('up', 28, 'space')
if x.event_type == 'down' and x.name == press.name:
if (not is_recording) and enable_trigger_record:
sys.stdout.write("Start Recording ... ")
sys.stdout.flush()
is_recording = True
def on_release(key):
"""On-release keyboard callback function."""
global is_recording, enable_trigger_record
if key == keyboard.Key.esc:
return False
elif key == keyboard.Key.space:
if x.event_type == 'up' and x.name == release.name:
if is_recording == True:
is_recording = False
@ -80,9 +75,10 @@ def main():
stream.start_stream()
# prepare keyboard listener
with keyboard.Listener(
on_press=on_press, on_release=on_release) as listener:
listener.join()
while (1):
keyboard.hook(on_press_release)
if keyboard.record('esc'):
break
# close up
stream.stop_stream()

@ -8,7 +8,8 @@ from time import gmtime, strftime
import SocketServer
import struct
import wave
import paddle.v2 as paddle
import paddle.fluid as fluid
import numpy as np
import _init_paths
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
@ -141,13 +142,19 @@ def warm_up_test(audio_process_handler,
def start_server():
"""Start the ASR server"""
# prepare data generator
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
# prepare ASR model
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
@ -155,7 +162,8 @@ def start_server():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
init_from_pretrain_model=args.model_path,
place=place,
share_rnn_weights=args.share_rnn_weights)
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
@ -166,8 +174,24 @@ def start_server():
# prepare ASR inference handler
def file_to_transcript(filename):
feature = data_generator.process_utterance(filename, "")
audio_len = feature[0].shape[1]
mask_shape0 = (feature[0].shape[0] - 1) // 2 + 1
mask_shape1 = (feature[0].shape[1] - 1) // 3 + 1
mask_max_len = (audio_len - 1) // 3 + 1
mask_ones = np.ones((mask_shape0, mask_shape1))
mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1))
mask = np.repeat(
np.reshape(
np.concatenate((mask_ones, mask_zeros), axis=1),
(1, mask_shape0, mask_max_len)),
32,
axis=0)
feature = (np.array([feature[0]]).astype('float32'),
None,
np.array([audio_len]).astype('int64').reshape([-1,1]),
np.array([mask]).astype('float32'))
probs_split = ds2_model.infer_batch_probs(
infer_data=[feature],
infer_data=feature,
feeding_dict=data_generator.feeding)
if args.decoding_method == "ctc_greedy":
@ -207,7 +231,6 @@ def start_server():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu, trainer_count=1)
start_server()

Binary file not shown.

Before

Width:  |  Height:  |  Size: 153 KiB

After

Width:  |  Height:  |  Size: 206 KiB

@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/aishell/aishell.py \
--manifest_prefix='data/aishell/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/Aishell'
--target_dir='./dataset/aishell'
if [ $? -ne 0 ]; then
echo "Prepare Aishell failed. Terminated."

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -31,7 +30,7 @@ python -u infer.py \
--infer_manifest='data/aishell/manifest.test' \
--mean_std_path='data/aishell/mean_std.npz' \
--vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
--model_path='checkpoints/aishell/srep_final' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/aishell > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -35,12 +34,12 @@ python -u infer.py \
--cutoff_prob=0.99 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--infer_manifest='data/aishell/manifest.test' \
--mean_std_path='models/aishell/mean_std.npz' \
--vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
--model_path='models/aishell' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -15,10 +15,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
@ -32,7 +30,7 @@ python -u test.py \
--test_manifest='data/aishell/manifest.test' \
--mean_std_path='data/aishell/mean_std.npz' \
--vocab_path='data/aishell/vocab.txt' \
--model_path='checkpoints/aishell/params.latest.tar.gz' \
--model_path='checkpoints/aishell/step_final' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_ch.sh
bash download_lm_ch.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/aishell > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=300 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/aishell/manifest.test' \
--mean_std_path='models/aishell/mean_std.npz' \
--vocab_path='models/aishell/vocab.txt' \
--model_path='models/aishell/params.tar.gz' \
--model_path='models/aishell' \
--lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='cer' \

@ -3,17 +3,18 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u train.py \
--batch_size=64 \
--trainer_count=8 \
--num_passes=50 \
--num_proc_data=16 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=1024 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=120000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
@ -30,7 +31,7 @@ python -u train.py \
--output_model_dir='./checkpoints/aishell' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Failed in training!"

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=5 \
--num_conv_layers=2 \
@ -35,12 +34,12 @@ python -u infer.py \
--cutoff_prob=1.0 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u test.py \
--batch_size=128 \
--trainer_count=4 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
@ -36,12 +35,12 @@ python -u test.py \
--cutoff_prob=1.0 \
--cutoff_top_n=40 \
--use_gru=True \
--use_gpu=True \
--use_gpu=False \
--share_rnn_weights=False \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -14,7 +14,7 @@ cd - > /dev/null
# download well-trained model
cd models/baidu_en8k > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -40,7 +40,7 @@ python -u deploy/demo_server.py \
--warmup_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/baidu_en8k/mean_std.npz' \
--vocab_path='models/baidu_en8k/vocab.txt' \
--model_path='models/baidu_en8k/params.tar.gz' \
--model_path='models/baidu_en8k' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--specgram_type='linear'

@ -5,7 +5,7 @@ cd ../.. > /dev/null
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
--manifest_prefix='data/librispeech/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
--target_dir='./dataset/librispeech' \
--full_download='True'
if [ $? -ne 0 ]; then

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -31,7 +30,7 @@ python -u infer.py \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
--model_path='checkpoints/libri/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -40,7 +39,7 @@ python -u infer.py \
--infer_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -15,10 +15,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
@ -32,7 +30,7 @@ python -u test.py \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='data/librispeech/vocab.txt' \
--model_path='checkpoints/libri/params.latest.tar.gz' \
--model_path='checkpoints/libri/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/librispeech/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -3,17 +3,19 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u train.py \
--batch_size=160 \
--trainer_count=8 \
--num_passes=50 \
--num_proc_data=16 \
--batch_size=20 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=280000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
@ -30,7 +32,7 @@ python -u train.py \
--output_model_dir='./checkpoints/libri' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Failed in training!"

@ -7,7 +7,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u tools/tune.py \
--num_batches=-1 \
--batch_size=128 \
--trainer_count=4 \
--beam_size=500 \
--num_proc_bsearch=12 \
--num_conv_layers=2 \
@ -27,7 +26,7 @@ python -u tools/tune.py \
--tune_manifest='data/librispeech/manifest.dev-clean' \
--mean_std_path='data/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--error_rate_type='wer' \
--specgram_type='linear'

@ -7,11 +7,10 @@ if [ ! -e data/tiny ]; then
mkdir data/tiny
fi
# download data, generate manifests
PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \
--manifest_prefix='data/tiny/manifest' \
--target_dir='~/.cache/paddle/dataset/speech/libri' \
--target_dir='./dataset/librispeech' \
--full_download='False'
if [ $? -ne 0 ]; then
@ -21,12 +20,11 @@ fi
head -n 64 data/tiny/manifest.dev-clean > data/tiny/manifest.tiny
# build vocabulary
python tools/build_vocab.py \
--count_threshold=0 \
--vocab_path='data/tiny/vocab.txt' \
--manifest_paths='data/tiny/manifest.dev-clean'
--manifest_paths='data/tiny/manifest.tiny'
if [ $? -ne 0 ]; then
echo "Build vocabulary failed. Terminated."
@ -47,5 +45,5 @@ if [ $? -ne 0 ]; then
fi
echo "Tiny data preparation done."
echo "LibriSpeech Data preparation done."
exit 0

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -15,7 +15,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -28,10 +27,10 @@ python -u infer.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--infer_manifest='data/tiny/manifest.tiny' \
--infer_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
--model_path='./checkpoints/tiny/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,7 +24,6 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0 \
python -u infer.py \
--num_samples=10 \
--trainer_count=1 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_conv_layers=2 \
@ -40,7 +39,7 @@ python -u infer.py \
--infer_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -14,11 +14,9 @@ cd - > /dev/null
# evaluate model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=16 \
--trainer_count=8 \
--batch_size=128 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
@ -29,10 +27,10 @@ python -u test.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--test_manifest='data/tiny/manifest.tiny' \
--test_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/tiny/params.pass-19.tar.gz' \
--model_path='checkpoints/tiny/step_final' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -4,7 +4,7 @@ cd ../.. > /dev/null
# download language model
cd models/lm > /dev/null
sh download_lm_en.sh
bash download_lm_en.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -13,7 +13,7 @@ cd - > /dev/null
# download well-trained model
cd models/librispeech > /dev/null
sh download_model.sh
bash download_model.sh
if [ $? -ne 0 ]; then
exit 1
fi
@ -24,10 +24,8 @@ cd - > /dev/null
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -u test.py \
--batch_size=128 \
--trainer_count=8 \
--beam_size=500 \
--num_proc_bsearch=8 \
--num_proc_data=8 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
@ -41,7 +39,7 @@ python -u test.py \
--test_manifest='data/tiny/manifest.test-clean' \
--mean_std_path='models/librispeech/mean_std.npz' \
--vocab_path='models/librispeech/vocab.txt' \
--model_path='models/librispeech/params.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--decoding_method='ctc_beam_search' \
--error_rate_type='wer' \

@ -3,17 +3,18 @@
cd ../.. > /dev/null
# train model
# if you wish to resume from an exists model, uncomment --init_model_path
# if you wish to resume from an exists model, uncomment --init_from_pretrain_model
export FLAGS_sync_nccl_allreduce=0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u train.py \
--batch_size=16 \
--trainer_count=4 \
--num_passes=20 \
--num_proc_data=1 \
--batch_size=4 \
--num_epoch=20 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--num_iter_print=1 \
--save_epoch=1 \
--num_samples=64 \
--learning_rate=1e-5 \
--max_duration=27.0 \
--min_duration=0.0 \
@ -30,10 +31,10 @@ python -u train.py \
--output_model_dir='./checkpoints/tiny' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
--shuffle_method='batch_shuffle_clipped' \
if [ $? -ne 0 ]; then
echo "Fail in training!"
echo "Failed in training!"
exit 1
fi

@ -3,11 +3,10 @@
cd ../.. > /dev/null
# grid-search for hyper-parameters in language model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -u tools/tune.py \
--num_batches=1 \
--batch_size=24 \
--trainer_count=8 \
--num_batches=-1 \
--batch_size=128 \
--beam_size=500 \
--num_proc_bsearch=12 \
--num_conv_layers=2 \
@ -24,10 +23,10 @@ python -u tools/tune.py \
--use_gru=False \
--use_gpu=True \
--share_rnn_weights=True \
--tune_manifest='data/tiny/manifest.tiny' \
--tune_manifest='data/tiny/manifest.dev-clean' \
--mean_std_path='data/tiny/mean_std.npz' \
--vocab_path='data/tiny/vocab.txt' \
--model_path='checkpoints/params.pass-9.tar.gz' \
--model_path='models/librispeech' \
--lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \
--error_rate_type='wer' \
--specgram_type='linear'

@ -3,9 +3,13 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import argparse
import functools
import paddle.v2 as paddle
import paddle.fluid as fluid
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
from utils.error_rate import wer, cer
@ -15,7 +19,6 @@ parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('num_samples', int, 10, "# of samples to infer.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
@ -63,20 +66,25 @@ args = parser.parse_args()
def infer():
"""Inference for DeepSpeech2."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=1,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.infer_manifest,
batch_size=args.num_samples,
min_batch_size=1,
sortagrad=False,
shuffle_method=None)
infer_data = batch_reader().next()
infer_data = next(batch_reader())
ds2_model = DeepSpeech2Model(
vocab_size=data_generator.vocab_size,
@ -84,16 +92,19 @@ def infer():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.model_path)
# decoders only accept string encoded in utf-8
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
if args.decoding_method == "ctc_greedy":
ds2_model.logger.info("start inference ...")
probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
probs_split = ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
result_transcripts = ds2_model.decode_batch_greedy(
probs_split=probs_split,
vocab_list=vocab_list)
@ -101,9 +112,11 @@ def infer():
ds2_model.init_ext_scorer(args.alpha, args.beta, args.lang_model_path,
vocab_list)
ds2_model.logger.info("start inference ...")
probs_split = ds2_model.infer_batch_probs(infer_data=infer_data,
probs_split= ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
result_transcripts = ds2_model.decode_batch_beam_search(
result_transcripts= ds2_model.decode_batch_beam_search(
probs_split=probs_split,
beam_alpha=args.alpha,
beam_beta=args.beta,
@ -114,7 +127,7 @@ def infer():
num_processes=args.num_proc_bsearch)
error_rate_func = cer if args.error_rate_type == 'cer' else wer
target_transcripts = [data[1] for data in infer_data]
target_transcripts = infer_data[1]
for target, result in zip(target_transcripts, result_transcripts):
print("\nTarget Transcription: %s\nOutput Transcription: %s" %
(target, result))
@ -125,9 +138,6 @@ def infer():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
infer()

@ -10,8 +10,13 @@ import logging
import gzip
import copy
import inspect
import cPickle as pickle
import collections
import multiprocessing
import numpy as np
from distutils.dir_util import mkpath
import paddle.v2 as paddle
import paddle.fluid as fluid
import paddle.fluid.compiler as compiler
from decoders.swig_wrapper import Scorer
from decoders.swig_wrapper import ctc_greedy_decoder
from decoders.swig_wrapper import ctc_beam_search_decoder_batch
@ -32,37 +37,197 @@ class DeepSpeech2Model(object):
:type num_rnn_layers: int
:param rnn_layer_size: RNN layer size (number of RNN cells).
:type rnn_layer_size: int
:param pretrained_model_path: Pretrained model path. If None, will train
from stratch.
:type pretrained_model_path: basestring|None
:param use_gru: Use gru if set True. Use simple rnn if set False.
:type use_gru: bool
:param share_rnn_weights: Whether to share input-hidden weights between
forward and backward directional RNNs.Notice that
for GRU, weight sharing is not supported.
:type share_rnn_weights: bool
:param place: Program running place.
:type place: CPU or GPU
:param init_from_pretrained_model: Pretrained model path. If None, will train
from stratch.
:type init_from_pretrained_model: string|None
:param output_model_dir: Output model directory. If None, output to current directory.
:type output_model_dir: string|None
"""
def __init__(self, vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size, use_gru, pretrained_model_path,
share_rnn_weights):
self._create_network(vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size, use_gru, share_rnn_weights)
self._create_parameters(pretrained_model_path)
self._inferer = None
self._loss_inferer = None
self._ext_scorer = None
def __init__(self,
vocab_size,
num_conv_layers,
num_rnn_layers,
rnn_layer_size,
use_gru=False,
share_rnn_weights=True,
place=fluid.CPUPlace(),
init_from_pretrain_model=None,
output_model_dir=None):
self._vocab_size = vocab_size
self._num_conv_layers = num_conv_layers
self._num_rnn_layers = num_rnn_layers
self._rnn_layer_size = rnn_layer_size
self._use_gru = use_gru
self._share_rnn_weights = share_rnn_weights
self._place = place
self._init_from_pretrain_model = init_from_pretrain_model
self._output_model_dir = output_model_dir
self._ext_scorer = None
self.logger = logging.getLogger("")
self.logger.setLevel(level=logging.INFO)
def create_network(self, is_infer=False):
"""Create data layers and model network.
:param is_training: Whether to create a network for training.
:type is_training: bool
:return reader: Reader for input.
:rtype reader: read generater
:return log_probs: An output unnormalized log probability layer.
:rtype lig_probs: Varable
:return loss: A ctc loss layer.
:rtype loss: Variable
"""
if not is_infer:
input_fields = {
'names': ['audio_data', 'text_data', 'seq_len_data', 'masks'],
'shapes': [[-1, 161, 161], [-1, 1], [-1, 1], [-1, 32, 81, 1]],
'dtypes': ['float32', 'int32', 'int64', 'float32'],
'lod_levels': [0, 1, 0, 0]
}
inputs = [
fluid.layers.data(
name=input_fields['names'][i],
shape=input_fields['shapes'][i],
dtype=input_fields['dtypes'][i],
lod_level=input_fields['lod_levels'][i])
for i in range(len(input_fields['names']))
]
reader = fluid.io.PyReader(
feed_list=inputs,
capacity=64,
iterable=False,
use_double_buffer=True)
(audio_data, text_data, seq_len_data, masks) = inputs
else:
audio_data = fluid.layers.data(
name='audio_data',
shape=[-1, 161, 161],
dtype='float32',
lod_level=0)
seq_len_data = fluid.layers.data(
name='seq_len_data', shape=[-1, 1], dtype='int64', lod_level=0)
masks = fluid.layers.data(
name='masks',
shape=[-1, 32, 81, 1],
dtype='float32',
lod_level=0)
text_data = None
reader = fluid.DataFeeder([audio_data, seq_len_data, masks],
self._place)
log_probs, loss = deep_speech_v2_network(
audio_data=audio_data,
text_data=text_data,
seq_len_data=seq_len_data,
masks=masks,
dict_size=self._vocab_size,
num_conv_layers=self._num_conv_layers,
num_rnn_layers=self._num_rnn_layers,
rnn_size=self._rnn_layer_size,
use_gru=self._use_gru,
share_rnn_weights=self._share_rnn_weights)
return reader, log_probs, loss
def init_from_pretrain_model(self, exe, program):
'''Init params from pretrain model. '''
assert isinstance(self._init_from_pretrain_model, str)
if not os.path.exists(self._init_from_pretrain_model):
print(self._init_from_pretrain_model)
raise Warning("The pretrained params do not exist.")
return False
fluid.io.load_params(
exe,
self._init_from_pretrain_model,
main_program=program,
filename="params.pdparams")
print("finish initing model from pretrained params from %s" %
(self._init_from_pretrain_model))
pre_epoch = 0
dir_name = self._init_from_pretrain_model.split('_')
if len(dir_name) >= 2 and dir_name[-2].endswith('epoch') and dir_name[
-1].isdigit():
pre_epoch = int(dir_name[-1])
return pre_epoch + 1
def save_param(self, exe, program, dirname):
'''Save model params to dirname'''
assert isinstance(self._output_model_dir, str)
param_dir = os.path.join(self._output_model_dir)
if not os.path.exists(param_dir):
os.mkdir(param_dir)
fluid.io.save_params(
exe,
os.path.join(param_dir, dirname),
main_program=program,
filename="params.pdparams")
print("save parameters at %s" % (os.path.join(param_dir, dirname)))
return True
def test(self, exe, dev_batch_reader, test_program, test_pyreader,
fetch_list):
'''Test the model.
:param exe:The executor of program.
:type exe: Executor
:param dev_batch_reader: The reader of test dataa.
:type dev_batch_reader: read generator
:param test_program: The program of test.
:type test_program: Program
:param test_pyreader: Pyreader of test.
:type test_pyreader: Pyreader
:param fetch_list: Fetch list.
:type fetch_list: list
:return: An output unnormalized log probability.
:rtype: array
'''
test_pyreader.start()
epoch_loss = []
while True:
try:
each_loss = exe.run(
program=test_program,
fetch_list=fetch_list,
return_numpy=False)
epoch_loss.extend(np.array(each_loss[0]))
except fluid.core.EOFException:
test_pyreader.reset()
break
return np.mean(np.array(epoch_loss))
def train(self,
train_batch_reader,
dev_batch_reader,
feeding_dict,
learning_rate,
gradient_clipping,
num_passes,
output_model_dir,
is_local=True,
num_epoch,
batch_size,
num_samples,
save_epoch=100,
num_iterations_print=100,
test_off=False):
"""Train the model.
@ -78,104 +243,138 @@ class DeepSpeech2Model(object):
:type learning_rate: float
:param gradient_clipping: Gradient clipping threshold.
:type gradient_clipping: float
:param num_passes: Number of training epochs.
:type num_passes: int
:param num_epoch: Number of training epochs.
:type num_epoch: int
:param batch_size: Number of batch size.
:type batch_size: int
:param num_samples: The num of train samples.
:type num_samples: int
:param save_epoch: Number of training iterations for save checkpoint and params.
:type save_epoch: int
:param num_iterations_print: Number of training iterations for printing
a training loss.
:type rnn_iteratons_print: int
:param is_local: Set to False if running with pserver with multi-nodes.
:type is_local: bool
:param output_model_dir: Directory for saving the model (every pass).
:type output_model_dir: basestring
:type num_iteratons_print: int
:param test_off: Turn off testing.
:type test_off: bool
"""
# prepare model output directory
if not os.path.exists(output_model_dir):
mkpath(output_model_dir)
if not os.path.exists(self._output_model_dir):
mkpath(self._output_model_dir)
# adapt the feeding dict and reader according to the network
# adapt the feeding dict according to the network
adapted_feeding_dict = self._adapt_feeding_dict(feeding_dict)
adapted_train_batch_reader = self._adapt_data(train_batch_reader)
adapted_dev_batch_reader = self._adapt_data(dev_batch_reader)
# prepare optimizer and trainer
optimizer = paddle.optimizer.Adam(
learning_rate=learning_rate,
gradient_clipping_threshold=gradient_clipping)
trainer = paddle.trainer.SGD(
cost=self._loss,
parameters=self._parameters,
update_equation=optimizer,
is_local=is_local)
# create event handler
def event_handler(event):
global start_time, cost_sum, cost_counter
if isinstance(event, paddle.event.EndIteration):
cost_sum += event.cost
cost_counter += 1
if (event.batch_id + 1) % num_iterations_print == 0:
output_model_path = os.path.join(output_model_dir,
"params.latest.tar.gz")
with gzip.open(output_model_path, 'w') as f:
trainer.save_parameter_to_tar(f)
print("\nPass: %d, Batch: %d, TrainCost: %f" %
(event.pass_id, event.batch_id + 1,
cost_sum / cost_counter))
cost_sum, cost_counter = 0.0, 0
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.BeginPass):
start_time = time.time()
cost_sum, cost_counter = 0.0, 0
if isinstance(event, paddle.event.EndPass):
if test_off:
print("\n------- Time: %d sec, Pass: %d" %
(time.time() - start_time, event.pass_id))
else:
result = trainer.test(
reader=adapted_dev_batch_reader,
feeding=adapted_feeding_dict)
print(
"\n------- Time: %d sec, Pass: %d, "
"ValidationCost: %s" %
(time.time() - start_time, event.pass_id, result.cost))
output_model_path = os.path.join(
output_model_dir, "params.pass-%d.tar.gz" % event.pass_id)
with gzip.open(output_model_path, 'w') as f:
trainer.save_parameter_to_tar(f)
# run train
trainer.train(
reader=adapted_train_batch_reader,
event_handler=event_handler,
num_passes=num_passes,
feeding=adapted_feeding_dict)
# TODO(@pkuyym) merge this function into infer_batch
def infer_loss_batch(self, infer_data):
"""Model inference. Infer the ctc loss for a batch of speech
utterances.
:param infer_data: List of utterances to infer, with each utterance a
tuple of audio features and transcription text (empty
string).
:type infer_data: list
:return: List of ctc loss.
:rtype: List of float
"""
# define inferer
if self._loss_inferer == None:
self._loss_inferer = paddle.inference.Inference(
output_layer=self._loss, parameters=self._parameters)
# run inference
return self._loss_inferer.infer(input=infer_data)
if isinstance(self._place, fluid.CUDAPlace):
dev_count = fluid.core.get_cuda_device_count()
else:
dev_count = int(os.environ.get('CPU_NUM', 1))
# prepare the network
train_program = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_program, startup_prog):
with fluid.unique_name.guard():
train_pyreader, log_probs, ctc_loss = self.create_network()
# prepare optimizer
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=fluid.layers.exponential_decay(
learning_rate=learning_rate,
decay_steps=num_samples / batch_size / dev_count,
decay_rate=0.83,
staircase=True))
fluid.clip.set_gradient_clip(
clip=fluid.clip.GradientClipByGlobalNorm(
clip_norm=gradient_clipping))
optimizer.minimize(loss=ctc_loss)
test_prog = fluid.Program()
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, _, ctc_loss = self.create_network()
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(self._place)
exe.run(startup_prog)
# init from some pretrain models, to better solve the current task
pre_epoch = 0
if self._init_from_pretrain_model:
pre_epoch = self.init_from_pretrain_model(exe, train_program)
build_strategy = compiler.BuildStrategy()
exec_strategy = fluid.ExecutionStrategy()
# pass the build_strategy to with_data_parallel API
compiled_prog = compiler.CompiledProgram(
train_program).with_data_parallel(
loss_name=ctc_loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
train_pyreader.decorate_batch_generator(train_batch_reader)
test_pyreader.decorate_batch_generator(dev_batch_reader)
# run train
for epoch_id in range(num_epoch):
train_pyreader.start()
epoch_loss = []
time_begin = time.time()
batch_id = 0
step = 0
while True:
try:
fetch_list = [ctc_loss.name]
if batch_id % num_iterations_print == 0:
fetch = exe.run(
program=compiled_prog,
fetch_list=fetch_list,
return_numpy=False)
each_loss = fetch[0]
epoch_loss.extend(np.array(each_loss[0]) / batch_size)
print("epoch: %d, batch: %d, train loss: %f\n" %
(epoch_id, batch_id,
np.mean(each_loss[0]) / batch_size))
else:
each_loss = exe.run(
program=compiled_prog,
fetch_list=[],
return_numpy=False)
batch_id = batch_id + 1
except fluid.core.EOFException:
train_pyreader.reset()
break
time_end = time.time()
used_time = time_end - time_begin
if test_off:
print("\n--------Time: %f sec, epoch: %d, train loss: %f\n" %
(used_time, epoch_id, np.mean(np.array(epoch_loss))))
else:
print('\n----------Begin test...')
test_loss = self.test(
exe,
dev_batch_reader=dev_batch_reader,
test_program=test_prog,
test_pyreader=test_pyreader,
fetch_list=[ctc_loss])
print(
"--------Time: %f sec, epoch: %d, train loss: %f, test loss: %f"
% (used_time, epoch_id + pre_epoch,
np.mean(np.array(epoch_loss)), test_loss / batch_size))
if (epoch_id + 1) % save_epoch == 0:
self.save_param(exe, train_program,
"epoch_" + str(epoch_id + pre_epoch))
self.save_param(exe, train_program, "step_final")
print("\n------------Training finished!!!-------------")
def infer_batch_probs(self, infer_data, feeding_dict):
"""Infer the prob matrices for a batch of speech utterances.
:param infer_data: List of utterances to infer, with each utterance
consisting of a tuple of audio features and
transcription text (empty string).
@ -188,26 +387,55 @@ class DeepSpeech2Model(object):
:rtype: List of matrix
"""
# define inferer
if self._inferer == None:
self._inferer = paddle.inference.Inference(
output_layer=self._log_probs, parameters=self._parameters)
infer_program = fluid.Program()
startup_prog = fluid.Program()
# adapt the feeding dict according to the network
adapted_feeding_dict = self._adapt_feeding_dict(feeding_dict)
adapted_infer_data = self._adapt_data(infer_data)
# prepare the network
with fluid.program_guard(infer_program, startup_prog):
with fluid.unique_name.guard():
feeder, log_probs, _ = self.create_network(is_infer=True)
infer_program = infer_program.clone(for_test=True)
exe = fluid.Executor(self._place)
exe.run(startup_prog)
# init param from pretrain_model
if not self._init_from_pretrain_model:
exit("No pretrain model file path!")
self.init_from_pretrain_model(exe, infer_program)
infer_results = []
time_begin = time.time()
# run inference
infer_results = self._inferer.infer(
input=adapted_infer_data, feeding=adapted_feeding_dict)
start_pos = [0] * (len(adapted_infer_data) + 1)
for i in xrange(len(adapted_infer_data)):
start_pos[i + 1] = start_pos[i] + adapted_infer_data[i][3][0]
for i in range(infer_data[0].shape[0]):
each_log_probs = exe.run(
program=infer_program,
feed=feeder.feed(
[[infer_data[0][i], infer_data[2][i], infer_data[3][i]]]),
fetch_list=[log_probs],
return_numpy=False)
infer_results.extend(np.array(each_log_probs[0]))
# slice result
infer_results = np.array(infer_results)
seq_len = (infer_data[2] - 1) // 3 + 1
start_pos = [0] * (infer_data[0].shape[0] + 1)
for i in range(infer_data[0].shape[0]):
start_pos[i + 1] = start_pos[i] + seq_len[i][0]
probs_split = [
infer_results[start_pos[i]:start_pos[i + 1]]
for i in xrange(0, len(adapted_infer_data))
for i in range(0, infer_data[0].shape[0])
]
return probs_split
def decode_batch_greedy(self, probs_split, vocab_list):
"""Decode by best path for a batch of probs matrix input.
:param probs_split: List of 2-D probability matrix, and each consists
of prob vectors for one speech utterancce.
:param probs_split: List of matrix
@ -221,12 +449,12 @@ class DeepSpeech2Model(object):
output_transcription = ctc_greedy_decoder(
probs_seq=probs, vocabulary=vocab_list)
results.append(output_transcription)
print(results)
return results
def init_ext_scorer(self, beam_alpha, beam_beta, language_model_path,
vocab_list):
"""Initialize the external scorer.
:param beam_alpha: Parameter associated with language model.
:type beam_alpha: float
:param beam_beta: Parameter associated with word count.
@ -261,7 +489,6 @@ class DeepSpeech2Model(object):
beam_size, cutoff_prob, cutoff_top_n,
vocab_list, num_processes):
"""Decode by beam search for a batch of probs matrix input.
:param probs_split: List of 2-D probability matrix, and each consists
of prob vectors for one speech utterancce.
:param probs_split: List of matrix
@ -319,124 +546,16 @@ class DeepSpeech2Model(object):
if isinstance(feeding_dict, dict):
adapted_feeding_dict["sequence_offset"] = len(adapted_feeding_dict)
adapted_feeding_dict["sequence_length"] = len(adapted_feeding_dict)
for i in xrange(self._num_conv_layers):
for i in range(self._num_conv_layers):
adapted_feeding_dict["conv%d_index_range" %i] = \
len(adapted_feeding_dict)
elif isinstance(feeding_dict, list):
adapted_feeding_dict.append("sequence_offset")
adapted_feeding_dict.append("sequence_length")
for i in xrange(self._num_conv_layers):
for i in range(self._num_conv_layers):
adapted_feeding_dict.append("conv%d_index_range" % i)
else:
raise ValueError("Type of feeding_dict is %s, not supported." %
type(feeding_dict))
return adapted_feeding_dict
def _adapt_data(self, data):
"""Adapt data according to network struct.
For each convolution layer in the conv_group, to remove impacts from
padding data, we can multiply zero to the padding part of the outputs
of each batch normalization layer. We add a scale_sub_region layer after
each batch normalization layer to reset the padding data.
For rnn layers, to remove impacts from padding data, we can truncate the
padding part before output data feeded into the first rnn layer. We use
sub_seq layer to achieve this.
:param data: Data from data_provider.
:type data: list|function
:return: Adapted data.
:rtype: list|function
"""
def adapt_instance(instance):
if len(instance) < 2 or len(instance) > 3:
raise ValueError("Size of instance should be 2 or 3.")
padded_audio = instance[0]
text = instance[1]
# no padding part
if len(instance) == 2:
audio_len = padded_audio.shape[1]
else:
audio_len = instance[2]
adapted_instance = [padded_audio, text]
# Stride size for conv0 is (3, 2)
# Stride size for conv1 to convN is (1, 2)
# Same as the network, hard-coded here
padded_conv0_h = (padded_audio.shape[0] - 1) // 2 + 1
padded_conv0_w = (padded_audio.shape[1] - 1) // 3 + 1
valid_w = (audio_len - 1) // 3 + 1
adapted_instance += [
[0], # sequence offset, always 0
[valid_w], # valid sequence length
# Index ranges for channel, height and width
# Please refer scale_sub_region layer to see details
[1, 32, 1, padded_conv0_h, valid_w + 1, padded_conv0_w]
]
pre_padded_h = padded_conv0_h
for i in xrange(self._num_conv_layers - 1):
padded_h = (pre_padded_h - 1) // 2 + 1
pre_padded_h = padded_h
adapted_instance += [
[1, 32, 1, padded_h, valid_w + 1, padded_conv0_w]
]
return adapted_instance
if isinstance(data, list):
return map(adapt_instance, data)
elif inspect.isgeneratorfunction(data):
def adapted_reader():
for instance in data():
yield map(adapt_instance, instance)
return adapted_reader
else:
raise ValueError("Type of data is %s, not supported." % type(data))
def _create_parameters(self, model_path=None):
"""Load or create model parameters."""
if model_path is None:
self._parameters = paddle.parameters.create(self._loss)
else:
self._parameters = paddle.parameters.Parameters.from_tar(
gzip.open(model_path))
def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers,
rnn_layer_size, use_gru, share_rnn_weights):
"""Create data layers and model network."""
# paddle.data_type.dense_array is used for variable batch input.
# The size 161 * 161 is only an placeholder value and the real shape
# of input batch data will be induced during training.
audio_data = paddle.layer.data(
name="audio_spectrogram",
type=paddle.data_type.dense_array(161 * 161))
text_data = paddle.layer.data(
name="transcript_text",
type=paddle.data_type.integer_value_sequence(vocab_size))
seq_offset_data = paddle.layer.data(
name='sequence_offset',
type=paddle.data_type.integer_value_sequence(1))
seq_len_data = paddle.layer.data(
name='sequence_length',
type=paddle.data_type.integer_value_sequence(1))
index_range_datas = []
for i in xrange(num_rnn_layers):
index_range_datas.append(
paddle.layer.data(
name='conv%d_index_range' % i,
type=paddle.data_type.dense_vector(6)))
self._log_probs, self._loss = deep_speech_v2_network(
audio_data=audio_data,
text_data=text_data,
seq_offset_data=seq_offset_data,
seq_len_data=seq_len_data,
index_range_datas=index_range_datas,
dict_size=vocab_size,
num_conv_layers=num_conv_layers,
num_rnn_layers=num_rnn_layers,
rnn_size=rnn_layer_size,
use_gru=use_gru,
share_rnn_weights=share_rnn_weights)

@ -1,188 +1,322 @@
"""Contains DeepSpeech2 layers and networks."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle.v2 as paddle
import collections
import paddle.fluid as fluid
import numpy as np
def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
padding, act, index_range_data):
padding, act, masks, name):
"""Convolution layer with batch normalization.
:param input: Input layer.
:type input: LayerOutput
:type input: Variable
:param filter_size: The x dimension of a filter kernel. Or input a tuple for
two image dimension.
:type filter_size: int|tuple|list
:param num_channels_in: Number of input channels.
:type num_channels_in: int
:type num_channels_out: Number of output channels.
:type num_channels_in: out
:param num_channels_out: Number of output channels.
:type num_channels_out: int
:param stride: The x dimension of the stride. Or input a tuple for two
image dimension.
:type stride: int|tuple|list
:param padding: The x dimension of the padding. Or input a tuple for two
image dimension.
:type padding: int|tuple|list
:param act: Activation type.
:type act: BaseActivation
:param index_range_data: Index range to indicate sub region.
:type index_range_data: LayerOutput
:type act: string
:param masks: Masks data layer to reset padding.
:type masks: Variable
:param name: Name of the layer.
:param name: string
:return: Batch norm layer after convolution layer.
:rtype: LayerOutput
:rtype: Variable
"""
conv_layer = paddle.layer.img_conv(
conv_layer = fluid.layers.conv2d(
input=input,
filter_size=filter_size,
num_channels=num_channels_in,
num_filters=num_channels_out,
filter_size=filter_size,
stride=stride,
padding=padding,
act=paddle.activation.Linear(),
param_attr=fluid.ParamAttr(name=name + '_conv2d_weight'),
act=None,
bias_attr=False)
batch_norm = paddle.layer.batch_norm(input=conv_layer, act=act)
batch_norm = fluid.layers.batch_norm(
input=conv_layer,
act=act,
param_attr=fluid.ParamAttr(name=name + '_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_batch_norm_bias'),
moving_mean_name=name + '_batch_norm_moving_mean',
moving_variance_name=name + '_batch_norm_moving_variance')
# reset padding part to 0
scale_sub_region = paddle.layer.scale_sub_region(
batch_norm, index_range_data, value=0.0)
return scale_sub_region
padding_reset = fluid.layers.elementwise_mul(batch_norm, masks)
return padding_reset
def simple_rnn(input, size, param_attr=None, bias_attr=None, is_reverse=False):
'''A simple rnn layer.
:param input:input layer.
:type input:Variable
:param size:Number of RNN cells.
:type size:int
:param param_attr:Parameter properties of hidden layer weights that
can be learned
:type param_attr:ParamAttr
:param bias_attr:Bias properties of hidden layer weights that can be learned
:type bias_attr:ParamAttr
:param is_reverse:Whether to calculate the inverse RNN
:type is_reverse:bool
:return: A simple RNN layer.
:rtype: Variable
'''
if is_reverse:
input = fluid.layers.sequence_reverse(x=input)
pad_value = fluid.layers.assign(input=np.array([0.0], dtype=np.float32))
input, length = fluid.layers.sequence_pad(input, pad_value)
rnn = fluid.layers.StaticRNN()
input = fluid.layers.transpose(input, [1, 0, 2])
with rnn.step():
in_ = rnn.step_input(input)
mem = rnn.memory(shape=[-1, size], batch_ref=in_)
out = fluid.layers.fc(
input=mem,
size=size,
act=None,
param_attr=param_attr,
bias_attr=bias_attr)
out = fluid.layers.elementwise_add(out, in_)
out = fluid.layers.brelu(out)
rnn.update_memory(mem, out)
rnn.output(out)
out = rnn()
out = fluid.layers.transpose(out, [1, 0, 2])
out = fluid.layers.sequence_unpad(x=out, length=length)
if is_reverse:
out = fluid.layers.sequence_reverse(x=out)
return out
def bidirectional_simple_rnn_bn_layer(name, input, size, act, share_weights):
def bidirectional_simple_rnn_bn_layer(name, input, size, share_weights):
"""Bidirectonal simple rnn layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
:param name: Name of the layer.
:param name: Name of the layer parameters.
:type name: string
:param input: Input layer.
:type input: LayerOutput
:type input: Variable
:param size: Number of RNN cells.
:type size: int
:param act: Activation type.
:type act: BaseActivation
:param share_weights: Whether to share input-hidden weights between
forward and backward directional RNNs.
:type share_weights: bool
:return: Bidirectional simple rnn layer.
:rtype: LayerOutput
:rtype: Variable
"""
if share_weights:
# input-hidden weights shared between bi-direcitonal rnn.
input_proj = paddle.layer.fc(
#input-hidden weights shared between bi-directional rnn.
input_proj = fluid.layers.fc(
input=input,
size=size,
act=paddle.activation.Linear(),
act=None,
param_attr=fluid.ParamAttr(name=name + '_fc_weight'),
bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn = paddle.layer.batch_norm(
input=input_proj, act=paddle.activation.Linear())
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn, act=act, reverse=True)
input_proj_bn = fluid.layers.batch_norm(
input=input_proj,
act=None,
param_attr=fluid.ParamAttr(name=name + '_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_batch_norm_bias'),
moving_mean_name=name + '_batch_norm_moving_mean',
moving_variance_name=name + '_batch_norm_moving_variance')
#forward and backword in time
forward_rnn = simple_rnn(
input=input_proj_bn,
size=size,
param_attr=fluid.ParamAttr(name=name + '_forward_rnn_weight'),
bias_attr=fluid.ParamAttr(name=name + '_forward_rnn_bias'),
is_reverse=False)
reverse_rnn = simple_rnn(
input=input_proj_bn,
size=size,
param_attr=fluid.ParamAttr(name=name + '_reverse_rnn_weight'),
bias_attr=fluid.ParamAttr(name=name + '_reverse_rnn_bias'),
is_reverse=True)
else:
input_proj_forward = paddle.layer.fc(
input_proj_forward = fluid.layers.fc(
input=input,
size=size,
act=paddle.activation.Linear(),
act=None,
param_attr=fluid.ParamAttr(name=name + '_forward_fc_weight'),
bias_attr=False)
input_proj_backward = paddle.layer.fc(
input_proj_backward = fluid.layers.fc(
input=input,
size=size,
act=paddle.activation.Linear(),
act=None,
param_attr=fluid.ParamAttr(name=name + '_reverse_fc_weight'),
bias_attr=False)
# batch norm is only performed on input-state projection
input_proj_bn_forward = paddle.layer.batch_norm(
input=input_proj_forward, act=paddle.activation.Linear())
input_proj_bn_backward = paddle.layer.batch_norm(
input=input_proj_backward, act=paddle.activation.Linear())
#batch norm is only performed on input-state projection
input_proj_bn_forward = fluid.layers.batch_norm(
input=input_proj_forward,
act=None,
param_attr=fluid.ParamAttr(
name=name + '_forward_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_bias'),
moving_mean_name=name + '_forward_batch_norm_moving_mean',
moving_variance_name=name + '_forward_batch_norm_moving_variance')
input_proj_bn_backward = fluid.layers.batch_norm(
input=input_proj_backward,
act=None,
param_attr=fluid.ParamAttr(
name=name + '_reverse_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_bias'),
moving_mean_name=name + '_reverse_batch_norm_moving_mean',
moving_variance_name=name + '_reverse_batch_norm_moving_variance')
# forward and backward in time
forward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn_forward, act=act, reverse=False)
backward_simple_rnn = paddle.layer.recurrent(
input=input_proj_bn_backward, act=act, reverse=True)
return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
forward_rnn = simple_rnn(
input=input_proj_bn_forward,
size=size,
param_attr=fluid.ParamAttr(name=name + '_forward_rnn_weight'),
bias_attr=fluid.ParamAttr(name=name + '_forward_rnn_bias'),
is_reverse=False)
reverse_rnn = simple_rnn(
input=input_proj_bn_backward,
size=size,
param_attr=fluid.ParamAttr(name=name + '_reverse_rnn_weight'),
bias_attr=fluid.ParamAttr(name=name + '_reverse_rnn_bias'),
is_reverse=True)
out = fluid.layers.concat(input=[forward_rnn, reverse_rnn], axis=1)
return out
def bidirectional_gru_bn_layer(name, input, size, act):
"""Bidirectonal gru layer with sequence-wise batch normalization.
The batch normalization is only performed on input-state weights.
:param name: Name of the layer.
:type name: string
:param input: Input layer.
:type input: LayerOutput
:param size: Number of RNN cells.
:type input: Variable
:param size: Number of GRU cells.
:type size: int
:param act: Activation type.
:type act: BaseActivation
:return: Bidirectional simple rnn layer.
:rtype: LayerOutput
:type act: string
:return: Bidirectional GRU layer.
:rtype: Variable
"""
input_proj_forward = paddle.layer.fc(
input_proj_forward = fluid.layers.fc(
input=input,
size=size * 3,
act=paddle.activation.Linear(),
act=None,
param_attr=fluid.ParamAttr(name=name + '_forward_fc_weight'),
bias_attr=False)
input_proj_backward = paddle.layer.fc(
input_proj_backward = fluid.layers.fc(
input=input,
size=size * 3,
act=paddle.activation.Linear(),
act=None,
param_attr=fluid.ParamAttr(name=name + '_reverse_fc_weight'),
bias_attr=False)
# batch norm is only performed on input-related projections
input_proj_bn_forward = paddle.layer.batch_norm(
input=input_proj_forward, act=paddle.activation.Linear())
input_proj_bn_backward = paddle.layer.batch_norm(
input=input_proj_backward, act=paddle.activation.Linear())
# forward and backward in time
forward_gru = paddle.layer.grumemory(
input=input_proj_bn_forward, act=act, reverse=False)
backward_gru = paddle.layer.grumemory(
input=input_proj_bn_backward, act=act, reverse=True)
return paddle.layer.concat(input=[forward_gru, backward_gru])
#batch norm is only performed on input-related prohections
input_proj_bn_forward = fluid.layers.batch_norm(
input=input_proj_forward,
act=None,
param_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_bias'),
moving_mean_name=name + '_forward_batch_norm_moving_mean',
moving_variance_name=name + '_forward_batch_norm_moving_variance')
input_proj_bn_backward = fluid.layers.batch_norm(
input=input_proj_backward,
act=None,
param_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_weight'),
bias_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_bias'),
moving_mean_name=name + '_reverse_batch_norm_moving_mean',
moving_variance_name=name + '_reverse_batch_norm_moving_variance')
#forward and backward in time
forward_gru = fluid.layers.dynamic_gru(
input=input_proj_bn_forward,
size=size,
gate_activation='sigmoid',
candidate_activation=act,
param_attr=fluid.ParamAttr(name=name + '_forward_gru_weight'),
bias_attr=fluid.ParamAttr(name=name + '_forward_gru_bias'),
is_reverse=False)
reverse_gru = fluid.layers.dynamic_gru(
input=input_proj_bn_backward,
size=size,
gate_activation='sigmoid',
candidate_activation=act,
param_attr=fluid.ParamAttr(name=name + '_reverse_gru_weight'),
bias_attr=fluid.ParamAttr(name=name + '_reverse_gru_bias'),
is_reverse=True)
return fluid.layers.concat(input=[forward_gru, reverse_gru], axis=1)
def conv_group(input, num_stacks, index_range_datas):
def conv_group(input, num_stacks, seq_len_data, masks):
"""Convolution group with stacked convolution layers.
:param input: Input layer.
:type input: LayerOutput
:type input: Variable
:param num_stacks: Number of stacked convolution layers.
:type num_stacks: int
:param index_range_datas: Index ranges for each convolution layer.
:type index_range_datas: tuple|list
:param seq_len_data:Valid sequence length data layer.
:type seq_len_data:Variable
:param masks: Masks data layer to reset padding.
:type masks: Variable
:return: Output layer of the convolution group.
:rtype: LayerOutput
:rtype: Variable
"""
filter_size = (41, 11)
stride = (2, 3)
padding = (20, 5)
conv = conv_bn_layer(
input=input,
filter_size=(11, 41),
filter_size=filter_size,
num_channels_in=1,
num_channels_out=32,
stride=(3, 2),
padding=(5, 20),
act=paddle.activation.BRelu(),
index_range_data=index_range_datas[0])
for i in xrange(num_stacks - 1):
stride=stride,
padding=padding,
act="brelu",
masks=masks,
name='layer_0', )
seq_len_data = (np.array(seq_len_data) - filter_size[1] + 2 * padding[1]
) // stride[1] + 1
output_height = (161 - 1) // 2 + 1
for i in range(num_stacks - 1):
#reshape masks
output_height = (output_height - 1) // 2 + 1
masks = fluid.layers.slice(
masks, axes=[2], starts=[0], ends=[output_height])
conv = conv_bn_layer(
input=conv,
filter_size=(11, 21),
filter_size=(21, 11),
num_channels_in=32,
num_channels_out=32,
stride=(1, 2),
padding=(5, 10),
act=paddle.activation.BRelu(),
index_range_data=index_range_datas[i + 1])
output_num_channels = 32
output_height = 160 // pow(2, num_stacks) + 1
return conv, output_num_channels, output_height
stride=(2, 1),
padding=(10, 5),
act="brelu",
masks=masks,
name='layer_{}'.format(i + 1), )
output_num_channels = 32
return conv, output_num_channels, output_height, seq_len_data
def rnn_group(input, size, num_stacks, use_gru, share_rnn_weights):
"""RNN group with stacked bidirectional simple RNN layers.
def rnn_group(input, size, num_stacks, num_conv_layers, use_gru,
share_rnn_weights):
"""RNN group with stacked bidirectional simple RNN or GRU layers.
:param input: Input layer.
:type input: LayerOutput
:type input: Variable
:param size: Number of RNN cells in each layer.
:type size: int
:param num_stacks: Number of stacked rnn layers.
@ -194,32 +328,30 @@ def rnn_group(input, size, num_stacks, use_gru, share_rnn_weights):
It is only available when use_gru=False.
:type share_weights: bool
:return: Output layer of the RNN group.
:rtype: LayerOutput
:rtype: Variable
"""
output = input
for i in xrange(num_stacks):
for i in range(num_stacks):
if use_gru:
output = bidirectional_gru_bn_layer(
name=str(i),
name='layer_{}'.format(i + num_conv_layers),
input=output,
size=size,
act=paddle.activation.Relu())
# BRelu does not support hppl, need to add later. Use Relu instead.
act="relu")
else:
name = 'layer_{}'.format(i + num_conv_layers)
output = bidirectional_simple_rnn_bn_layer(
name=str(i),
name=name,
input=output,
size=size,
act=paddle.activation.BRelu(),
share_weights=share_rnn_weights)
return output
def deep_speech_v2_network(audio_data,
text_data,
seq_offset_data,
seq_len_data,
index_range_datas,
masks,
dict_size,
num_conv_layers=2,
num_rnn_layers=3,
@ -227,17 +359,14 @@ def deep_speech_v2_network(audio_data,
use_gru=False,
share_rnn_weights=True):
"""The DeepSpeech2 network structure.
:param audio_data: Audio spectrogram data layer.
:type audio_data: LayerOutput
:type audio_data: Variable
:param text_data: Transcription text data layer.
:type text_data: LayerOutput
:param seq_offset_data: Sequence offset data layer.
:type seq_offset_data: LayerOutput
:type text_data: Variable
:param seq_len_data: Valid sequence length data layer.
:type seq_len_data: LayerOutput
:param index_range_datas: Index ranges data layers.
:type index_range_datas: tuple|list
:type seq_len_data: Variable
:param masks: Masks data layer to reset padding.
:type masks: Variable
:param dict_size: Dictionary size for tokenized transcription.
:type dict_size: int
:param num_conv_layers: Number of stacking convolution layers.
@ -254,49 +383,53 @@ def deep_speech_v2_network(audio_data,
:type share_weights: bool
:return: A tuple of an output unnormalized log probability layer (
before softmax) and a ctc cost layer.
:rtype: tuple of LayerOutput
:rtype: tuple of LayerOutput
"""
audio_data = fluid.layers.unsqueeze(audio_data, axes=[1])
# convolution group
conv_group_output, conv_group_num_channels, conv_group_height = conv_group(
conv_group_output, conv_group_num_channels, conv_group_height, seq_len_data = conv_group(
input=audio_data,
num_stacks=num_conv_layers,
index_range_datas=index_range_datas)
seq_len_data=seq_len_data,
masks=masks)
# convert data form convolution feature map to sequence of vectors
conv2seq = paddle.layer.block_expand(
input=conv_group_output,
num_channels=conv_group_num_channels,
stride_x=1,
stride_y=1,
block_x=1,
block_y=conv_group_height)
transpose = fluid.layers.transpose(conv_group_output, perm=[0, 3, 1, 2])
reshape_conv_output = fluid.layers.reshape(
x=transpose,
shape=[0, -1, conv_group_height * conv_group_num_channels],
inplace=False)
# remove padding part
remove_padding_data = paddle.layer.sub_seq(
input=conv2seq,
offsets=seq_offset_data,
sizes=seq_len_data,
act=paddle.activation.Linear(),
bias_attr=False)
# rnn group
seq_len_data = fluid.layers.reshape(seq_len_data, [-1])
sequence = fluid.layers.sequence_unpad(
x=reshape_conv_output, length=seq_len_data)
#rnn group
rnn_group_output = rnn_group(
input=remove_padding_data,
input=sequence,
size=rnn_size,
num_stacks=num_rnn_layers,
num_conv_layers=num_conv_layers,
use_gru=use_gru,
share_rnn_weights=share_rnn_weights)
fc = paddle.layer.fc(
fc = fluid.layers.fc(
input=rnn_group_output,
size=dict_size + 1,
act=paddle.activation.Linear(),
bias_attr=True)
# probability distribution with softmax
log_probs = paddle.layer.mixed(
input=paddle.layer.identity_projection(input=fc),
act=paddle.activation.Softmax())
# ctc cost
ctc_loss = paddle.layer.warp_ctc(
input=fc,
label=text_data,
size=dict_size + 1,
blank=dict_size,
norm_by_times=True)
return log_probs, ctc_loss
act=None,
param_attr=fluid.ParamAttr(
name='layer_{}'.format(num_conv_layers + num_rnn_layers) +
'_fc_weight'),
bias_attr=fluid.ParamAttr(
name='layer_{}'.format(num_conv_layers + num_rnn_layers) +
'_fc_bias'))
# pribability distribution with softmax
log_probs = fluid.layers.softmax(fc)
log_probs.persistable = True
if not text_data:
return log_probs, None
else:
#ctc cost
ctc_loss = fluid.layers.warpctc(
input=fc, label=text_data, blank=dict_size, norm_by_times=True)
ctc_loss = fluid.layers.reduce_sum(ctc_loss)
return log_probs, ctc_loss

@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz'
MD5=0ee83aa15fba421e5de8fc66c8feb350
TARGET=./aishell_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz'
MD5=2bf0cc8b6d5da2a2a787b5cc36a496b5
TARGET=./aishell_model_fluid.tar.gz
echo "Download Aishell model ..."

@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz'
MD5=5fe7639e720d51b3c3bdf7a1470c6272
TARGET=./baidu_en8k_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz'
MD5=7e58fbf64aa4ecf639b049792ddcf788
TARGET=./baidu_en8k_model_fluid.tar.gz
echo "Download BaiduEn8k model ..."

@ -2,9 +2,9 @@
. ../../utils/utility.sh
URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz'
MD5=1f72d0c5591f453362f0caa09dd57618
TARGET=./librispeech_model.tar.gz
URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz'
MD5=fafb11fe57c3ecd107147056453f5348
TARGET=./librispeech_model_fluid.tar.gz
echo "Download LibriSpeech model ..."

@ -1,4 +1,4 @@
scipy==0.13.1
scipy==1.2.1
resampy==0.1.5
SoundFile==0.9.0.post1
python_speech_features

@ -5,7 +5,7 @@ from __future__ import print_function
import argparse
import functools
import paddle.v2 as paddle
import paddle.fluid as fluid
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
from utils.error_rate import char_errors, word_errors
@ -15,10 +15,8 @@ parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 128, "Minibatch size.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
@ -64,17 +62,22 @@ args = parser.parse_args()
def evaluate():
"""Evaluate on whole test data for DeepSpeech2."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_proc_data,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.test_manifest,
batch_size=args.batch_size,
min_batch_size=1,
sortagrad=False,
shuffle_method=None)
@ -84,8 +87,9 @@ def evaluate():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.model_path)
# decoders only accept string encoded in utf-8
vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list]
@ -115,7 +119,7 @@ def evaluate():
cutoff_top_n=args.cutoff_top_n,
vocab_list=vocab_list,
num_processes=args.num_proc_bsearch)
target_transcripts = [data[1] for data in infer_data]
target_transcripts = infer_data[1]
for target, result in zip(target_transcripts, result_transcripts):
errors, len_ref = errors_func(target, result)
@ -131,9 +135,6 @@ def evaluate():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
evaluate()

@ -9,14 +9,13 @@ function join_by { local IFS="$1"; shift; echo "$*"; }
for NUM_GPUS in 16 8 4 2 1
do
DEVICES=$(join_by , $(seq 0 $(($NUM_GPUS-1))))
BATCH_SIZE=$(($BATCH_SIZE_PER_GPU * $NUM_GPUS))
BATCH_SIZE=$(($BATCH_SIZE_PER_GPU))
CUDA_VISIBLE_DEVICES=$DEVICES \
python train.py \
--batch_size=$BATCH_SIZE \
--num_passes=1 \
--num_epoch=1 \
--test_off=True \
--trainer_count=$NUM_GPUS \
--min_duration=$MIN_DURATION \
--max_duration=$MAX_DURATION > tmp.log 2>&1
@ -24,7 +23,7 @@ do
exit 1
fi
cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$3}'
cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$2}'
rm tmp.log
done

@ -10,7 +10,7 @@ import argparse
import functools
import gzip
import logging
import paddle.v2 as paddle
import paddle.fluid as fluid
import _init_paths
from data_utils.data import DataGenerator
from model_utils.model import DeepSpeech2Model
@ -26,7 +26,6 @@ add_arg('batch_size', int, 256, "# of samples per batch.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('beam_size', int, 500, "Beam search width.")
add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.")
add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
@ -77,13 +76,19 @@ def tune():
if not args.num_betas >= 0:
raise ValueError("num_betas must be non-negative!")
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
data_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_proc_data,
keep_transcription_text=True)
keep_transcription_text=True,
place = place,
is_training = False)
batch_reader = data_generator.batch_reader_creator(
manifest_path=args.tune_manifest,
@ -97,7 +102,8 @@ def tune():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.model_path,
place=place,
init_from_pretrain_model=args.model_path,
share_rnn_weights=args.share_rnn_weights)
# decoders only accept string encoded in utf-8
@ -109,8 +115,8 @@ def tune():
params_grid = [(alpha, beta) for alpha in cand_alphas
for beta in cand_betas]
err_sum = [0.0 for i in xrange(len(params_grid))]
err_ave = [0.0 for i in xrange(len(params_grid))]
err_sum = [0.0 for i in range(len(params_grid))]
err_ave = [0.0 for i in range(len(params_grid))]
num_ins, len_refs, cur_batch = 0, 0, 0
# initialize external scorer
ds2_model.init_ext_scorer(args.alpha_from, args.beta_from,
@ -123,7 +129,7 @@ def tune():
probs_split = ds2_model.infer_batch_probs(
infer_data=infer_data,
feeding_dict=data_generator.feeding)
target_transcripts = [ data[1] for data in infer_data ]
target_transcripts = infer_data[1]
num_ins += len(target_transcripts)
# grid search
@ -137,7 +143,6 @@ def tune():
cutoff_top_n=args.cutoff_top_n,
vocab_list=vocab_list,
num_processes=args.num_proc_bsearch)
for target, result in zip(target_transcripts, result_transcripts):
errors, len_ref = errors_func(target, result)
err_sum[index] += errors
@ -163,7 +168,7 @@ def tune():
# output WER/CER at every (alpha, beta)
print("\nFinal %s:\n" % args.error_rate_type)
for index in xrange(len(params_grid)):
for index in range(len(params_grid)):
print("(alpha, beta) = (%s, %s), [%s] = %f"
% ("%.3f" % params_grid[index][0], "%.3f" % params_grid[index][1],
args.error_rate_type, err_ave[index]))
@ -179,9 +184,6 @@ def tune():
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count)
tune()

@ -5,23 +5,25 @@ from __future__ import print_function
import argparse
import functools
import paddle.v2 as paddle
import io
from model_utils.model import DeepSpeech2Model
from data_utils.data import DataGenerator
from utils.utility import add_arguments, print_arguments
import paddle.fluid as fluid
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 256, "Minibatch size.")
add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).")
add_arg('num_passes', int, 200, "# of training epochs.")
add_arg('num_proc_data', int, 16, "# of CPUs for data preprocessing.")
add_arg('num_epoch', int, 200, "# of training epochs.")
add_arg('num_conv_layers', int, 2, "# of convolution layers.")
add_arg('num_rnn_layers', int, 3, "# of recurrent layers.")
add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.")
add_arg('num_iter_print', int, 100, "Every # iterations for printing "
add_arg('num_iter_print', int, 100, "Every # batch for printing "
"train cost.")
add_arg('save_epoch', int, 10, "# Every # batch for save checkpoint and modle params ")
add_arg('num_samples', int, 10000, "The num of train samples.")
add_arg('learning_rate', float, 5e-4, "Learning rate.")
add_arg('max_duration', float, 27.0, "Longest audio duration allowed.")
add_arg('min_duration', float, 0.0, "Shortest audio duration allowed.")
@ -31,7 +33,12 @@ add_arg('use_gpu', bool, True, "Use GPU or not.")
add_arg('use_gru', bool, False, "Use GRUs instead of simple RNNs.")
add_arg('is_local', bool, True, "Use pserver or not.")
add_arg('share_rnn_weights',bool, True, "Share input-hidden weights across "
"bi-directional RNNs. Not for GRU.")
"bi-directional RNNs. Not for GRU.")
add_arg('init_from_pretrain_model',str,
None,
"If None, the training starts from scratch, "
"otherwise, it resumes from the pre-trained model.")
add_arg('train_manifest', str,
'data/librispeech/manifest.train',
"Filepath of train manifest.")
@ -44,10 +51,6 @@ add_arg('mean_std_path', str,
add_arg('vocab_path', str,
'data/librispeech/vocab.txt',
"Filepath of vocabulary.")
add_arg('init_model_path', str,
None,
"If None, the training starts from scratch, "
"otherwise, it resumes from the pre-trained model.")
add_arg('output_model_dir', str,
"./checkpoints/libri",
"Directory for saving checkpoints.")
@ -68,30 +71,33 @@ args = parser.parse_args()
def train():
"""DeepSpeech2 training."""
if args.use_gpu:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
train_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config=open(args.augment_conf_path, 'r').read(),
augmentation_config=io.open(args.augment_conf_path, mode='r', encoding='utf8').read(),
max_duration=args.max_duration,
min_duration=args.min_duration,
specgram_type=args.specgram_type,
num_threads=args.num_proc_data)
place=place)
dev_generator = DataGenerator(
vocab_filepath=args.vocab_path,
mean_std_filepath=args.mean_std_path,
augmentation_config="{}",
specgram_type=args.specgram_type,
num_threads=args.num_proc_data)
place = place)
train_batch_reader = train_generator.batch_reader_creator(
manifest_path=args.train_manifest,
batch_size=args.batch_size,
min_batch_size=args.trainer_count,
sortagrad=args.use_sortagrad if args.init_model_path is None else False,
sortagrad=args.use_sortagrad if args.init_from_pretrain_model is None else False,
shuffle_method=args.shuffle_method)
dev_batch_reader = dev_generator.batch_reader_creator(
manifest_path=args.dev_manifest,
batch_size=args.batch_size,
min_batch_size=1, # must be 1, but will have errors.
sortagrad=False,
shuffle_method=None)
@ -101,27 +107,27 @@ def train():
num_rnn_layers=args.num_rnn_layers,
rnn_layer_size=args.rnn_layer_size,
use_gru=args.use_gru,
pretrained_model_path=args.init_model_path,
share_rnn_weights=args.share_rnn_weights)
share_rnn_weights=args.share_rnn_weights,
place=place,
init_from_pretrain_model=args.init_from_pretrain_model,
output_model_dir=args.output_model_dir)
ds2_model.train(
train_batch_reader=train_batch_reader,
dev_batch_reader=dev_batch_reader,
feeding_dict=train_generator.feeding,
learning_rate=args.learning_rate,
gradient_clipping=400,
num_passes=args.num_passes,
batch_size=args.batch_size,
num_samples=args.num_samples,
num_epoch=args.num_epoch,
save_epoch=args.save_epoch,
num_iterations_print=args.num_iter_print,
output_model_dir=args.output_model_dir,
is_local=args.is_local,
test_off=args.test_off)
def main():
print_arguments(args)
paddle.init(use_gpu=args.use_gpu,
rnn_use_batch=True,
trainer_count=args.trainer_count,
log_clipping=True)
train()

@ -36,15 +36,15 @@ def _levenshtein_distance(ref, hyp):
distance = np.zeros((2, n + 1), dtype=np.int32)
# initialize distance matrix
for j in xrange(n + 1):
for j in range(n + 1):
distance[0][j] = j
# calculate levenshtein distance
for i in xrange(1, m + 1):
for i in range(1, m + 1):
prev_row_idx = (i - 1) % 2
cur_row_idx = i % 2
distance[cur_row_idx][0] = i
for j in xrange(1, n + 1):
for j in range(1, n + 1):
if ref[i - 1] == hyp[j - 1]:
distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
else:

Loading…
Cancel
Save