diff --git a/README.md b/README.md index 0dcf8b602..806070084 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # DeepSpeech2 on PaddlePaddle -*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, distributed [PaddleCloud](https://github.com/PaddlePaddle/cloud) training, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released. +*DeepSpeech2 on PaddlePaddle* is an open-source implementation of end-to-end Automatic Speech Recognition (ASR) engine, based on [Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf), with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient and scalable implementation, including training, inference & testing module, and demo deployment. Besides, several pre-trained models for both English and Mandarin are also released. ## Table of Contents - [Installation](#installation) @@ -10,7 +10,6 @@ - [Data Augmentation Pipeline](#data-augmentation-pipeline) - [Inference and Evaluation](#inference-and-evaluation) - [Running in Docker Container](#running-in-docker-container) -- [Distributed Cloud Training](#distributed-cloud-training) - [Hyper-parameters Tuning](#hyper-parameters-tuning) - [Training for Mandarin Language](#training-for-mandarin-language) - [Trying Live Demo with Your Own Voice](#trying-live-demo-with-your-own-voice) @@ -22,13 +21,45 @@ ## Installation -For this project was developed in PaddlePaddle V2 API, which is not maintained officially any more, we only support [running it in Docker container](#running-in-docker-container), instead of building environment from source code. And we are going to release the update to the latest Paddle Fluid API very soon, please keep an eye on this project. +To avoid the trouble of environment setup, [running in Docker container](#running-in-docker-container) is highly recommended. Otherwise follow the guidelines below to install the dependencies manually. + +### Prerequisites +- Python 2.7 only supported +- PaddlePaddle 1.6 version (Coming soon ...) + +### Setup +- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost` and `swig`, e.g. installing them via `apt-get`: + +```bash +sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig +``` + +or, installing them via `yum`: + +```bash +sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel +wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz +xz -d flac-1.3.1.tar.xz +tar -xvf flac-1.3.1.tar +cd flac-1.3.1 +./configure +make +make install +``` + +- Run the setup script for the remaining dependencies + +```bash +git clone https://github.com/PaddlePaddle/DeepSpeech.git +cd DeepSpeech +sh setup.sh +``` ## Getting Started Several shell scripts provided in `./examples` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data. -Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES` and `--trainer_count`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit. +Some of the scripts in `./examples` are configured with 8 GPUs. If you don't have 8 GPUs available, please modify `CUDA_VISIBLE_DEVICES`. If you don't have any GPU available, please set `--use_gpu` to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `--batch_size` to fit. Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance. @@ -45,7 +76,7 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org sh run_data.sh ``` - `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `~/.cache/paddle/dataset/speech/libri` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments. + `run_data.sh` will download dataset, generate manifests, collect normalizer's statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in `./dataset/librispeech` and the corresponding manifest files generated in `./data/tiny` as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments. - Train your own ASR model ```bash @@ -139,20 +170,20 @@ python tools/build_vocab.py --help - Start training from scratch with 8 GPUs: ``` - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8 + CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py ``` -- Start training from scratch with 16 CPUs: +- Start training from scratch with CPUs: ``` - python train.py --use_gpu False --trainer_count 16 + python train.py --use_gpu False ``` - Resume training from a checkpoint: ``` CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python train.py \ - --init_model_path CHECKPOINT_PATH_TO_RESUME_FROM + --init_from_pretrained_model CHECKPOINT_PATH_TO_RESUME_FROM ``` For more help on arguments: @@ -162,6 +193,7 @@ python train.py --help ``` or refer to `example/librispeech/run_train.sh`. + ## Data Augmentation Pipeline Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training. @@ -206,8 +238,8 @@ A language model is required to improve the decoder's performance. We have prepa ```bash cd models/lm -sh download_lm_en.sh -sh download_lm_ch.sh +bash download_lm_en.sh +bash download_lm_ch.sh ``` If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. Here we provide some tips to show how we preparing our English and Mandarin language models. You can take it as a reference when you train your own. @@ -216,7 +248,7 @@ If you wish to train your own better language model, please refer to [KenLM](htt The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training: - * Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand. + * Characters not in \['A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand. * Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase. * Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'. @@ -239,13 +271,13 @@ An inference module caller `infer.py` is provided to infer, decode and visualize - Inference with GPU: ```bash - CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1 + CUDA_VISIBLE_DEVICES=0 python infer.py ``` - Inference with CPUs: ```bash - python infer.py --use_gpu False --trainer_count 12 + python infer.py --use_gpu False ``` We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `--decoding_method`. @@ -264,13 +296,13 @@ To evaluate a model's performance quantitatively, please run: - Evaluation with GPUs: ```bash - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8 + CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py ``` - Evaluation with CPUs: ```bash - python test.py --use_gpu False --trainer_count 12 + python test.py --use_gpu False ``` The error rate (default: word error rate; can be set with `--error_rate_type`) will be printed. @@ -293,7 +325,6 @@ The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertio ```bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python tools/tune.py \ - --trainer_count 8 \ --alpha_from 1.0 \ --alpha_to 3.2 \ --num_alphas 45 \ @@ -332,7 +363,7 @@ Take several steps to launch the Docker image: - Download the Docker image ```bash -nvidia-docker pull paddlepaddle/deep_speech:latest-gpu +nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu ``` - Clone this repository @@ -344,72 +375,10 @@ git clone https://github.com/PaddlePaddle/DeepSpeech.git - Run the Docker image ```bash -sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash +sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash ``` Now go back and start from the [Getting Started](#getting-started) section, you can execute training, inference and hyper-parameters tuning similarly in the Docker container. -## Distributed Cloud Training - -We also provide a cloud training module for users to do the distributed cluster training on [PaddleCloud](https://github.com/PaddlePaddle/cloud), to achieve a much faster training speed with multiple machines. To start with this, please first install PaddleCloud client and register a PaddleCloud account, as described in [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud). - -Please take the following steps to submit a training job: - -- Go to directory: - - ```bash - cd cloud - ``` -- Upload data: - - Data must be uploaded to PaddleCloud filesystem to be accessed within a cloud job. `pcloud_upload_data.sh` helps do the data packing and uploading: - - ```bash - sh pcloud_upload_data.sh - ``` - - Given input manifests, `pcloud_upload_data.sh` will: - - - Extract the audio files listed in the input manifests. - - Pack them into a specified number of tar files. - - Upload these tar files to PaddleCloud filesystem. - - Create cloud manifests by replacing local filesystem paths with PaddleCloud filesystem paths. New manifests will be used to inform the cloud jobs of audio files' location and their meta information. - - It should be done only once for the very first time to do the cloud training. Later, the data is kept persisitent on the cloud filesystem and reusable for further job submissions. - - For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud). - - - Configure training arguments: - - Configure the cloud job parameters in `pcloud_submit.sh` (e.g. `NUM_NODES`, `NUM_GPUS`, `CLOUD_TRAIN_DIR`, `JOB_NAME` etc.) and then configure other hyper-parameters for training in `pcloud_train.sh` (just as what you do for local training). - - For argument details please refer to [Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud). - - - Submit the job: - - By running: - - ```bash - sh pcloud_submit.sh - ``` - a training job has been submitted to PaddleCloud, with the job name printed to the console. - - - Get training logs - - Run this to list all the jobs you have submitted, as well as their running status: - - ```bash - paddlecloud get jobs - ``` - - Run this, the corresponding job's logs will be printed. - ```bash - paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME - ``` - -For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务). - -For more information about the DeepSpeech2 training on PaddleCloud, please refer to -[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud). ## Training for Mandarin Language @@ -417,14 +386,13 @@ The key steps of training for Mandarin language are same to that of English lang ## Trying Live Demo with Your Own Voice -Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice. +Until now, an ASR model is trained and tested qualitatively (`infer.py`) and quantitatively (`test.py`) with existing audio files. But it is not yet tested with your own speech. `deploy/demo_english_server.py` and `deploy/demo_client.py` helps quickly build up a real-time demo ASR engine with the trained model, enabling you to test and play around with the demo, with your own voice. To start the demo's server, please run this in one console: ```bash CUDA_VISIBLE_DEVICES=0 \ python deploy/demo_server.py \ ---trainer_count 1 \ --host_ip localhost \ --host_port 8086 ``` @@ -436,7 +404,7 @@ For example, on MAC OS X: ```bash brew install portaudio pip install pyaudio -pip install pynput +pip install keyboard ``` Then to start the client, please run this in another console: @@ -452,7 +420,7 @@ Now, in the client console, press the `whitespace` key, hold, and start speaking Notice that `deploy/demo_client.py` must be run on a machine with a microphone device, while `deploy/demo_server.py` could be run on one without any audio recording hardware, e.g. any remote server machine. Just be careful to set the `host_ip` and `host_port` argument with the actual accessible IP address and port, if the server and client are running with two separate machines. Nothing should be done if they are running on one single machine. -Please also refer to `examples/mandarin/run_demo_server.sh`, which will first download a pre-trained Mandarin model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak Mandarin to test it. If you would like to try some other models, just update `--model_path` argument in the script.   +Please also refer to `examples/deploy_demo/run_english_demo_server.sh`, which will first download a pre-trained English model (trained with 3000 hours of internal speech data) and then start the demo server with the model. With running `examples/mandarin/run_demo_client.sh`, you can speak English to test it. If you would like to try some other models, just update `--model_path` argument in the script.   For more help on arguments: @@ -467,10 +435,10 @@ python deploy/demo_client.py --help Language | Model Name | Training Data | Hours of Speech :-----------: | :------------: | :----------: | -------: -English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h -English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h -Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h -Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h +English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h +English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h +Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h +Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h #### Language Model Released @@ -504,17 +472,16 @@ Baidu Internal Testset | 12.64 #### Acceleration with Multi-GPUs -We compare the training time with 1, 2, 4, 8, 16 Tesla K40m GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars. +We compare the training time with 1, 2, 4, 8 Tesla V100 GPUs (with a subset of LibriSpeech samples whose audio durations are between 6.0 and 7.0 seconds). And it shows that a **near-linear** acceleration with multiple GPUs has been achieved. In the following figure, the time (in seconds) cost for training is printed on the blue bars.
| # of GPU | Acceleration Rate | | -------- | --------------: | | 1 | 1.00 X | -| 2 | 1.97 X | -| 4 | 3.74 X | -| 8 | 6.21 X | -|16 | 10.70 X | +| 2 | 1.98 X | +| 4 | 3.73 X | +| 8 | 6.95 X | `tools/profile.sh` provides such a profiling tool. diff --git a/README_cn.md b/README_cn.md index 06bee58bf..90ae6f48c 100644 --- a/README_cn.md +++ b/README_cn.md @@ -1,17 +1,16 @@ -# DeepSpeech2 +# 语音识别: DeepSpeech2 -*DeepSpeech2* 是一个采用[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)平台的端到端自动语音识别(ASR)引擎的开源项目,具体原理请参考这篇论文[Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf)。 -我们的愿景是为语音识别在工业应用和学术研究上,提供易于使用、高效和可扩展的工具,包括训练,推理,测试模块,以及分布式的[PaddleCloud](https://github.com/PaddlePaddle/cloud)训练和demo部署。同时,我们还将发布一些预训练好的英语和普通话模型。 +*DeepSpeech2*是一个采用[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)平台的端到端自动语音识别(ASR)引擎的开源项目,具体原理参考这篇论文[Baidu's Deep Speech 2 paper](http://proceedings.mlr.press/v48/amodei16.pdf)。 +我们的愿景是为语音识别在工业应用和学术研究上,提供易于使用、高效和可扩展的工具,包括训练,推理,测试模块,以及 demo 部署。同时,我们还将发布一些预训练好的英语和普通话模型。 ## 目录 - [安装](#安装) - [开始](#开始) - [数据准备](#数据准备) - [训练模型](#训练模型) -- [数据增强管道](#数据增强管道) +- [数据增强流水线](#数据增强流水线) - [推断和评价](#推断和评价) -- [在Docker容器上运行](#在Docker容器上运行) -- [分布式云训练](#分布式云训练) +- [在 Docker 容器上运行](#在Docker容器上运行) - [超参数调整](#超参数调整) - [训练汉语语言](#训练汉语语言) - [用自己的声音尝试现场演示](#用自己的声音尝试现场演示) @@ -20,45 +19,76 @@ - [问题和帮助](#问题和帮助) ## 安装 +为了避免环境配置问题,强烈建议在[Docker容器上运行](#在Docker容器上运行),否则请按照下面的指南安装依赖项。 -因该项目基于 PaddlePaddle V2 API 开发,其已不再被官方维护,目前我们仅支持 [在 Docker 容器中运行该项目](#在Docker容器上运行),而不支持从源码构建环境。我们很快会将这个项目升级到最新的 Paddle Fluid API,请保持关注。 +### 前提 +- 只支持Python 2.7 +- PaddlePaddle 1.6 版本(即将发布) + +### 安装 +- 请确保以下库或工具已安装完毕:`pkg-config`, `flac`, `ogg`, `vorbis`, `boost` 和 `swig`, 如可以通过`apt-get`安装: + +```bash +sudo apt-get install -y pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig +``` + +或者,也可以通过`yum`安装: + +```bash +sudo yum install pkgconfig libogg-devel libvorbis-devel boost-devel +wget https://ftp.osuosl.org/pub/xiph/releases/flac/flac-1.3.1.tar.xz +xz -d flac-1.3.1.tar.xz +tar -xvf flac-1.3.1.tar +cd flac-1.3.1 +./configure +make +make install +``` + +- 运行脚本安装其余的依赖项 + +```bash +git clone https://github.com/PaddlePaddle/DeepSpeech.git +cd DeepSpeech +sh setup.sh +``` ## 开始 -`./examples`里的一些shell脚本将帮助我们在一些公开数据集(比如:[LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)) 进行快速尝试,包括了数据准备,模型训练,案例推断和模型评价。阅读这些例子将帮助你理解如何应用你的数据集。 +`./examples`里的一些 shell 脚本将帮助我们在一些公开数据集(比如:[LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)) 进行快速尝试,包括了数据准备,模型训练,案例推断和模型评价。阅读这些例子将帮助你理解如何使用你的数据集训练模型。 -`./examples`目录中的一些脚本配置使用了8个GPU。如果你没有8个可用的GPU,请修改`CUDA_VISIBLE_DEVICES`和`--trainer_count`。如果你没有可用的GPU,请设置`--use_gpu`为False,这样程序会用CPU代替GPU。另外如果发生内存不足的问题,减小`--batch_size`即可。 +`./examples`目录中的一些脚本配置使用了 8 个 GPU。如果你没有 8 个可用的 GPU,请修改环境变量`CUDA_VISIBLE_DEVICES`。如果你没有可用的 GPU,请设置`--use_gpu`为 False,这样程序会用 CPU 代替 GPU。另外如果发生内存不足的问题,减小`--batch_size`即可。 让我们先看看[LibriSpeech dataset](http://www.openslr.org/12/)小样本集的例子。 -- 转到目录 +- 进入目录 ```bash cd examples/tiny ``` - 注意这仅仅是LibriSpeech一个小数据集的例子。如果你想尝试完整的数据集(可能需要花好几天来训练模型),请使用这个路径`examples/librispeech`。 + 注意这仅仅是 LibriSpeech 一个小数据集的例子。如果你想尝试完整的数据集(可能需要花好几天来训练模型),请使用这个路径`examples/librispeech`。 - 准备数据 ```bash sh run_data.sh ``` - 运行`run_data.sh`脚本将会下载数据集,产出manifests文件,收集一些归一化需要的统计信息并建立词表。当数据准备完成之后,下载完的数据(仅有LibriSpeech一部分)在`~/.cache/paddle/dataset/speech/libri`中;其对应的manifest文件,均值标准差和词表文件在`./data/tiny`中。在第一次执行的时候一定要执行这个脚本,在接下来所有的实验中我们都会用到这个数据集。 -- 训练你自己的ASR模型 + 运行`run_data.sh`脚本将会下载数据集,产出 manifests 文件,收集一些归一化需要的统计信息并建立词表。当数据准备完成之后,下载完的数据(仅有 LibriSpeech 一部分)在`dataset/librispeech`中;其对应的 manifest 文件,均值标准差和词表文件在`./data/tiny`中。在第一次执行的时候一定要执行这个脚本,在接下来所有的实验中我们都会用到这个数据集。 +- 训练你自己的 ASR 模型 ```bash sh run_train.sh ``` - `run_train.sh`将会启动训练任务,训练日志会打印到stdout,并且模型每个时期(epoch)的检查点都会保存到`./checkpoints/tiny`目录中。这些检查点可以用来恢复训练,推断,评价和部署。 + `run_train.sh`将会启动训练任务,训练日志会打印到终端,并且模型每个 epoch 的 checkpoint 都会保存到`./checkpoints/tiny`目录中。这些 checkpoint 可以用来恢复训练,推断,评价和部署。 - 用已有的模型进行案例推断 ```bash sh run_infer.sh ``` - `run_infer.sh`将会利用训完的模型展现一些(默认10个)样本语音到文本的解码结果。由于当前模型只使用了LibriSpeech一部分数据集训练,因此性能可能不会太好。为了看到更好模型上的表现,你可以下载一个已训练好的模型(用完整的LibriSpeech训练了好几天)来做推断。 + `run_infer.sh`将会利用训练好的模型展现一些(默认 10 个)样本语音到文本的解码结果。由于当前模型只使用了 LibriSpeech 一部分数据集训练,因此性能可能不会太好。为了看到更好模型上的表现,你可以下载一个已训练好的模型(用完整的 LibriSpeech 训练了好几天)来做推断。 ```bash sh run_infer_golden.sh @@ -75,27 +105,27 @@ sh run_test_golden.sh ``` -更多细节会在接下来的章节中阐述。祝你在*语音识别: DeepSpeech2*ASR引擎学习中过得愉快! +更多细节会在接下来的章节中阐述。祝你在*DeepSpeech2*ASR引擎学习中过得愉快! ## 数据准备 ### 生成Manifest -*语音识别: DeepSpeech2*接受文本**manifest**文件作为数据接口。manifest文件包含了一系列语音数据,其中每一行代表一个json格式的音频元数据(比如文件路径,描述,时长)。具体格式如下: +*DeepSpeech2*接受文本**manifest**文件作为数据接口。manifest 文件包含了一系列语音数据,其中每一行代表一个[JSON](http://www.json.org/)格式的音频元数据(比如文件路径,描述,时长)。具体格式如下: ``` {"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"} {"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"} ``` -如果你要使用自定义数据,你只需要按照以上格式生成自己的manifest文件即可。训练,推断以及其他所有模块都能够根据manifest文件获取到音频数据,包括他们的元数据。 +如果你要使用自定义数据,你只需要按照以上格式生成自己的 manifest 文件即可。给定 manifest 文件,训练、推断以及其它所有模块都能够访问到音频数据以及对应的时长和标签数据。 -关于如何生成manifest文件,请参考`data/librispeech/librispeech.py`。该脚本将会下载LibriSpeech数据集并生成manifest文件。 +关于如何生成 manifest 文件,请参考`data/librispeech/librispeech.py`。该脚本将会下载 LibriSpeech 数据集并生成 manifest 文件。 ### 计算均值和标准差用于归一化 -为了对音频特征进行z-score归一化(零均值,单位标准差),我们必须预估一些训练样本特征的均值和标准差: +为了对音频特征进行 z-score 归一化(零均值,单位标准差),我们必须预估训练样本特征的均值和标准差: ```bash python tools/compute_mean_std.py \ @@ -105,11 +135,11 @@ python tools/compute_mean_std.py \ --output_path data/librispeech/mean_std.npz ``` -以上这段代码会计算在`data/librispeech/manifest.train`路径中,2000个随机采样音频剪辑的功率谱特征均值和标准差,并将结果保存在`data/librispeech/mean_std.npz`中,方便以后使用。 +以上这段代码会计算在`data/librispeech/manifest.train`路径中,2000 个随机采样的语音频谱特征的均值和标准差,并将结果保存在`data/librispeech/mean_std.npz`中,方便以后使用。 ### 建立词表 -转换录音为索引用于训练,解码,再将一系列索引转换为文本等操作需要一个可能会出现字符集合的词表。`tools/build_vocab.py`脚本将生成这种基于字符的词表。 +我们需要一个包含可能会出现的字符集合的词表来在训练的时候将字符转换成索引,并在解码的时候将索引转换回文本。`tools/build_vocab.py`脚本将生成这种基于字符的词表。 ```bash python tools/build_vocab.py \ @@ -118,7 +148,7 @@ python tools/build_vocab.py \ --manifest_paths data/librispeech/manifest.train ``` -他将`data/librispeech/manifest.train`目录中的所有录音文本写入词表文件`data/librispeeech/eng_vocab.txt`,并且没有词汇截断(`--count_threshold 0`)。 +它将`data/librispeech/manifest.train`目录中的所有录音文本写入词表文件`data/librispeeech/eng_vocab.txt`,并且没有词汇截断(`--count_threshold 0`)。 ### 更多帮助 @@ -134,16 +164,16 @@ python tools/build_vocab.py --help `train.py`是训练模块的主要调用者。使用示例如下。 -- 开始使用8片GPU训练: +- 开始使用 8 片 GPU 训练: ``` - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py --trainer_count 8 + CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py ``` -- 开始使用16片GPU训练: +- 开始使用 CPU 训练: ``` - python train.py --use_gpu False --trainer_count 16 + python train.py --use_gpu False ``` - 从检查点恢复训练: @@ -151,7 +181,7 @@ python tools/build_vocab.py --help ``` CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python train.py \ - --init_model_path CHECKPOINT_PATH_TO_RESUME_FROM + --init_from_pretrained_model CHECKPOINT_PATH_TO_RESUME_FROM ``` 获得更多帮助: @@ -161,11 +191,12 @@ python train.py --help ``` 或参考 `example/librispeech/run_train.sh`. -## 数据增强管道 -数据增强是用来提升深度学习性能的非常有效的技术。我们通过在原始音频中添加小随机扰动(标签不变转换)获得新音频来增强我们的语音数据。你不必自己合成,因为数据增强已经嵌入到数据提供者中,能在训练模型时每个epoch中随机的合成音频。 +## 数据增强流水线 + +数据增强是用来提升深度学习性能的非常有效的技术。我们通过在原始音频中添加小的随机扰动(标签不变转换)获得新音频来增强我们的语音数据。你不必自己合成,因为数据增强已经嵌入到数据生成器中并且能够即时完成,在训练模型的每个epoch中随机合成音频。 -目前提供六个可选的增强组件供选择,配置并插入处理流水线。 +目前提供六个可选的增强组件供选择,配置并插入处理过程。 - 音量扰动 - 速度扰动 @@ -191,11 +222,11 @@ python train.py --help }] ``` -当`trainer.py`的`--augment_conf_file`参数被设置为上述示例配置文件的路径时,每个epoch中的每个音频片段都将被处理。首先,均匀随机采样速率会有60%的概率在0.95和1.05之间对音频片段进行速度扰动。然后,音频片段有80%的概率在时间上被挪移,挪移偏差值是-5毫秒和5毫秒之间的随机采样。最后,这个新合成的音频片段将被传送给特征提取器,以用于接下来的训练。 +当`trainer.py`的`--augment_conf_file`参数被设置为上述示例配置文件的路径时,每个 epoch 中的每个音频片段都将被处理。首先,均匀随机采样速率会有60%的概率在 0.95 和 1.05 之间对音频片段进行速度扰动。然后,音频片段有 80% 的概率在时间上被挪移,挪移偏差值是 -5 毫秒和 5 毫秒之间的随机采样。最后,这个新合成的音频片段将被传送给特征提取器,以用于接下来的训练。 有关其他配置实例,请参考`conf/augmenatation.config.example`. -使用数据增强技术时要小心,由于扩大了训练和测试集的差异,不恰当的增强会对训练模型不利。 +使用数据增强技术时要小心,由于扩大了训练和测试集的差异,不恰当的增强会对训练模型不利,导致训练和预测的差距增大。 ## 推断和评价 @@ -205,50 +236,50 @@ python train.py --help ```bash cd models/lm -sh download_lm_en.sh -sh download_lm_ch.sh +bash download_lm_en.sh +bash download_lm_ch.sh ``` -如果你想训练自己更好的语言模型,请参考[KenLM](https://github.com/kpu/kenlm)获取教程。在这里,我们提供一些技巧来展示我们如何准备我们的英语和普通话模型。开始训练的时候,你可以参考这些技巧。 +如果你想训练自己更好的语言模型,请参考[KenLM](https://github.com/kpu/kenlm)获取教程。在这里,我们提供一些技巧来展示我们如何准备我们的英语和普通话模型。当你训练自己的模型的时候,可以参考这些技巧。 #### 英语语言模型 -英语语料库来自[Common Crawl Repository](http://commoncrawl.org),您可以从[statmt](http://data.statmt.org/ngrams/deduped_en)下载它。我们使用en.00部分来训练我们的英语语言模型。训练前有一些预处理步骤如下: +英语语料库来自[Common Crawl Repository](http://commoncrawl.org),你可以从[statmt](http://data.statmt.org/ngrams/deduped_en)下载它。我们使用en.00部分来训练我们的英语语言模型。训练前有如下的一些预处理过程: - * 不在\[A-Za-z0-9\s'\](\s表示空白字符)中的字符将被删除,阿拉伯数字被转换为英文数字,比如“1000”转换为one thousand。 + * 不在\['A-Za-z0-9\s'\](\s表示空白字符)中的字符将被删除,阿拉伯数字被转换为英文数字,比如“1000”转换为 one thousand。 * 重复的空白字符被压缩为一个,并且开始的空白字符将被删除。请注意,所有的录音都是小写字母,因此所有字符都转换为小写字母。 - * 选择前40万个最常用的单词来建立词表,其余部分将被替换为“UNKNOWNWORD”。 + * 选择前 40 万个最常用的单词来建立词表,其余部分将被替换为“UNKNOWNWORD”。 -现在预处理完成了,我们得到一个干净的语料库来训练语言模型。我们发布的语言模型版本使用了参数“-o 5 --prune 0 1 1 1 1”来训练。“-o 5”表示语言模型的最大order为5。“--prune 0 1 1 1 1”表示每个order的计数阈值,更具体地说,它将第2个以及更高的order修剪为单个。为了节省磁盘存储空间,我们将使用参数“-a 22 -q 8 -b 8”将arpa文件转换为“trie”二进制文件。“-a”表示在“trie”中用于切分的指针的最高位数。“-q -b”是概率和退避的量化参数。 +现在预处理完成了,我们得到一个干净的语料库来训练语言模型。我们发布的语言模型版本使用了参数“-o 5 --prune 0 1 1 1 1”来训练。“-o 5”表示语言模型的最大order为 5。“--prune 0 1 1 1 1”表示每个 order 的计数阈值,更具体地说,它将第 2 个以及更高的 order 修剪为单个。为了节省磁盘存储空间,我们将使用参数“-a 22 -q 8 -b 8”将 arpa 文件转换为“trie”二进制文件。“-a”表示在“trie”中用于切分的指针的最高位数。“-q -b”是概率和退避的量化参数。 #### 普通话语言模型 -与英语语言模型不同的是,普通话语言模型是基于字符的,其中每一位都是中文汉字。我们使用内部语料库来训练发布的汉语语言模型。该语料库包含数十亿汉字。预处理阶段与英语语言模型差别很小,主要步骤包括: +与英语语言模型不同的是,普通话语言模型是基于字符的,其中每一位都是中文汉字。我们使用内部语料库来训练发布的汉语语言模型。该语料库包含数十亿汉字。预处理阶段与英语语言模型有一些小的差别,主要步骤包括: * 删除开始和结尾的空白字符。 * 删除英文标点和中文标点。 * 在两个字符之间插入空白字符。 -请注意,发布的语言模型只包含中文简体字。预处理完成后,我们开始训练语言模型。这个小的语言模型训练关键参数是“-o 5 --prune 0 1 2 4 4”,“-o 5”是针对大语言模型。请参考上面的部分了解每个参数的含义。我们还使用默认设置将arpa文件转换为二进制文件。 +请注意,发布的语言模型只包含中文简体字。预处理完成后,我们开始训练语言模型。这个小的语言模型训练关键参数是“-o 5 --prune 0 1 2 4 4”,“-o 5”是针对大语言模型。请参考上面的部分了解每个参数的含义。我们还使用默认设置将 arpa 文件转换为二进制文件。 ### 语音到文本推断 -推断模块调用者为`infer.py`,可以用来推断,解码,以及给一些给定音频剪辑进行可视化语音到文本的结果。这有助于对ASR模型的性能进行直观和定性的评估。 +推断模块使用`infer.py`进行调用,可以用来推断,解码,以及输出一些给定音频片段可视化到文本的结果。这有助于对ASR模型的性能进行直观和定性的评估。 -- GPU版本的推断: +- GPU 版本的推断: ```bash - CUDA_VISIBLE_DEVICES=0 python infer.py --trainer_count 1 + CUDA_VISIBLE_DEVICES=0 python infer.py ``` -- CPU版本的推断: +- CPU 版本的推断: ```bash - python infer.py --use_gpu False --trainer_count 12 + python infer.py --use_gpu False ``` -我们提供两种类型的CTC解码器:*CTC贪心解码器*和*CTC波束搜索解码器*。*CTC贪心解码器*是简单的最佳路径解码算法的实现,在每个时间步选择最可能的字符,因此是贪心的并且是局部最优的。[*CTC波束搜索解码器*](https://arxiv.org/abs/1408.2873)另外使用了启发式广度优先图搜索以达到近似全局最优; 它也需要预先训练的KenLM语言模型以获得更好的评分和排名。解码器类型可以用参数`--decoding_method`设置。 +我们提供两种类型的 CTC 解码器:*CTC贪心解码器*和*CTC波束搜索解码器*。*CTC贪心解码器*是简单的最佳路径解码算法的实现,在每个时间步选择最可能的字符,因此是贪心的并且是局部最优的。[*CTC波束搜索解码器*](https://arxiv.org/abs/1408.2873)另外使用了启发式广度优先图搜索以达到近似全局最优; 它也需要预先训练的KenLM语言模型以获得更好的评分和排名。解码器类型可以用参数`--decoding_method`设置。 获得更多帮助: @@ -261,16 +292,16 @@ python infer.py --help 要定量评估模型的性能,请运行: -- 带GPU版本评估 +- GPU 版本评估 ```bash - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py --trainer_count 8 + CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python test.py ``` -- CPU版本评估 +- CPU 版本评估 ```bash - python test.py --use_gpu False --trainer_count 12 + python test.py --use_gpu False ``` 错误率(默认:误字率;可以用--error_rate_type设置)将被打印出来。 @@ -286,14 +317,13 @@ python test.py --help [*CTC波束搜索解码器*](https://arxiv.org/abs/1408.2873)的超参数$\alpha$(语言模型权重)和$\beta$(单词插入权重)对解码器的性能有非常显著的影响。当声学模型更新时,最好在验证集上重新调整它们。 -`tools/tune.py`会进行2维网格查找超参数$\alpha$和$\beta$。您必须提供$\alpha$和$\beta$的范围,以及尝试的次数。 +`tools/tune.py`会进行2维网格查找超参数$\alpha$和$\beta$。你必须提供$\alpha$和$\beta$的范围,以及尝试的次数。 -- 带GPU版的调整: +- GPU 版的调整: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python tools/tune.py \ - --trainer_count 8 \ --alpha_from 1.0 \ --alpha_to 3.2 \ --num_alphas 45 \ @@ -302,21 +332,21 @@ python test.py --help --num_betas 8 ``` -- CPU版的调整: +- CPU 版的调整: ```bash python tools/tune.py --use_gpu False ``` -网格搜索将会在超参数空间的每个点处打印出WER(误字率)或者CER(字符错误率),并且可选择绘出误差曲面。合适的超参数范围应包括WER/CER误差表面的全局最小值,如下图所示。 +网格搜索将会在超参数空间的每个点处打印出 WER (误字率)或者 CER (字符错误率),并且可绘出误差曲面。一个合适的超参数范围应包括 WER/CER 误差表面的全局最小值,如下图所示。


调整LibriSpeech的dev-clean集合的误差曲面示例

-通常,如图所示,语言模型权重($\alpha$)的变化显著影响CTC波束搜索解码器的性能。更好的方法是首先调整多批数据(可指定数量)以找出适当的超参数范围,然后更改为整个验证集以进行精确调整。 +通常,如图所示,语言模型权重($\alpha$)的变化显著影响 CTC波束搜索解码器的性能。更好的方法是首先调整多批数据(可指定数量)以找出适当的超参数范围,然后更改为完整的验证集以进行精确调整。 -调整之后,您可以在推理和评价模块中重置$\alpha$和$\beta$,以检查它们是否真的有助于提高ASR性能。更多帮助如下: +调整之后,您可以在推理和评价模块中重置$\alpha$和$\beta$,以检查它们是否真的有助于提高 ASR 性能。更多帮助如下: ```bash python tune.py --help @@ -325,121 +355,56 @@ python tune.py --help ## 在Docker容器上运行 -Docker是一个开源工具,用于在孤立的环境中构建,发布和运行分布式应用程序。此项目的Docker镜像已在[hub.docker.com](https://hub.docker.com)中提供,并安装了所有依赖项,其中包括预先构建的PaddlePaddle,CTC解码器以及其他必要的Python和第三方库。这个Docker映像需要NVIDIA GPU的支持,所以请确保它的可用性并已完成[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)的安装。 +Docker 是一个开源工具,用于在孤立的环境中构建、发布和运行分布式应用程序。此项目的 Docker 镜像已在[hub.docker.com](https://hub.docker.com)中提供,并安装了所有依赖项,其中包括预先构建的PaddlePaddle,CTC解码器以及其他必要的 Python 和第三方库。这个 Docker 映像需要NVIDIA GPU的支持,所以请确保它的可用性并已完成[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)的安装。 -采取以下步骤来启动Docker镜像: +采取以下步骤来启动 Docker 镜像: -- 下载Docker镜像 +- 下载 Docker 镜像 ```bash -nvidia-docker pull paddlepaddle/deep_speech:latest-gpu +nvidia-docker pull hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu ``` -- git clone这个资源库 +- git clone 这个资源库 ``` git clone https://github.com/PaddlePaddle/DeepSpeech.git ``` -- 运行Docker镜像 +- 运行 Docker 镜像 ```bash -sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech paddlepaddle/deep_speech:latest-gpu /bin/bash +sudo nvidia-docker run -it -v $(pwd)/DeepSpeech:/DeepSpeech hub.baidubce.com/paddlepaddle/deep_speech_fluid:latest-gpu /bin/bash ``` 现在返回并从[开始](#开始)部分开始,您可以在Docker容器中同样执行模型训练,推断和超参数调整。 -## 分布式云训练 - -我们还为用户提供云训练模块[PaddleCloud](https://github.com/PaddlePaddle/cloud)以便用户进行集群训练,利用多台机器达到更快的训练速度。首先,请按照[PaddleCloud用法](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud)安装PaddleCloud客户端并注册PaddleCloud账户。 - -请按照以下步骤提交训练任务: - -- 转到目录: - - ```bash - cd cloud - ``` -- 上传数据: - - 数据必须上传到PaddleCloud文件系统才能在云作业中访问。`pcloud_upload_data.sh`负责进行数据打包和上传: - - ```bash - sh pcloud_upload_data.sh - ``` - - 给定manifest文件,`pcloud_upload_data.sh`会进行以下处理: - - - 提取输入清单中列出的音频文件。 - - 将它们打包成指定数量的tar文件。 - - 将这些tar文件上传到PaddleCloud文件系统。 - - 通过用PaddleCloud文件系统路径替换本地文件系统路径来创建云manifest文件。云作业将通过新的manifest文件获取到音频文件的位置及其元信息。 - - 对于云训练模型来说以上步骤只需做一次。之后这些数据会在云文件系统上保持不变,并可在之后的任务中反复使用。 - - 有关参数的详细信息,请参考[在PaddleCloud上训练DeepSpeech2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud)。 - - - 配置训练参数 - - 在`pcloud_submit.sh`中配置云任务参数(例如`NUM_NODES`,`NUM_GPUS`,`CLOUD_TRAIN_DIR`,`JOB_NAME`等),然后在`pcloud_train.sh`中配置其他的超参数训练(和本地训练一样)。 - - 有关参数的详细信息,请参阅[在PaddleCloud上训练DeepSpeech2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud)。 - - - - 提交任务 - - 运行: - - ```bash - sh pcloud_submit.sh - ``` - 一个训练任务已经提交给PaddleCloud,并将任务名输出到控制台。 - - - 获取训练日志 - - 执行以下命令以列出你提交的所有任务以及它们的运行状态: - - ```bash - paddlecloud get jobs - ``` - - 运行此操作,将打印相应的任务日志。 - - ```bash - paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME - ``` - -有关PaddleCloud用法的更多信息,请参阅[PaddleCloud用法](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务)。 - -有关PaddleCloud的DeepSpeech2训练的更多信息,请参阅 -[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/cloud). ## 训练普通话语言 -普通话语言训练与英语训练的关键步骤相同,我们提供了一个```examples/aishell```中Aishell的普通话训练例子。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载)供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。 +普通话语言训练与英语训练的关键步骤相同,我们提供了一个使用 Aishell 进行普通话训练的例子```examples/aishell```。如上所述,请执行```sh run_data.sh```, ```sh run_train.sh```, ```sh run_test.sh```和```sh run_infer.sh```做相应的数据准备,训练,测试和推断。我们还准备了一个预训练过的模型(执行./models/aishell/download_model.sh下载)供用户使用```run_infer_golden.sh```和```run_test_golden.sh```来。请注意,与英语语言模型不同,普通话语言模型是基于汉字的,请运行```tools/tune.py```来查找最佳设置。 ##用自己的声音尝试现场演示 -到目前为止,一个ASR模型已经训练完毕,并且进行了定性测试(`infer.py`)和用现有的音频文件进行定量测试(`test.py`)。但目前还没有用你自己的声音进行测试。`deploy/demo_server.py`和`deploy/demo_client.py`能够快速构建一个利用训完的模型,对ASR引擎进行实时演示系统,使你能够用自己的语音测试和演示。 +到目前为止,一个 ASR 模型已经训练完毕,并且用现有的音频文件进行了定性测试(`infer.py`)和定量测试(`test.py`)。但目前还没有用你自己的声音进行测试。`deploy/demo_english_server.py`和`deploy/demo_client.py`能够快速构建一个利用已训练好的模型对ASR引擎进行实时演示的系统,使你能够用自己的语音测试和演示。 要启动演示服务,请在控制台中运行: ```bash CUDA_VISIBLE_DEVICES=0 \ python deploy/demo_server.py \ ---trainer_count 1 \ --host_ip localhost \ --host_port 8086 ``` -对于运行demo客户端的机器(可能不是同一台机器),请在继续之前执行以下安装。 +对于运行 demo 客户端的机器(可能不是同一台机器),请在继续之前执行以下安装。 -比如,对于MAC OS X机器: +比如,对于 MAC OS X 机器: ```bash brew install portaudio pip install pyaudio -pip install pynput +pip install keyboard ``` 然后启动客户端,请在另一个控制台中运行: @@ -451,11 +416,11 @@ python -u deploy/demo_client.py \ --host_port 8086 ``` -现在,在客户端控制台中,按下`whitespace`键,按住并开始讲话。讲话完毕请释放该键以让控制台中显示的语音到文本结果。要退出客户端,只需按`ESC`键。 +现在,在客户端控制台中,按下`空格`键,按住并开始讲话。讲话完毕请释放该键以让控制台中显示语音的文本结果。要退出客户端,只需按`ESC`键。 请注意,`deploy/demo_client.py`必须在带麦克风设备的机器上运行,而`deploy/demo_server.py`可以在没有任何录音硬件的情况下运行,例如任何远程服务器机器。如果服务器和客户端使用两台独立的机器运行,只需要注意将`host_ip`和`host_port`参数设置为实际可访问的IP地址和端口。如果它们在单台机器上运行,则不用作任何处理。 -请参考`examples/mandarin/run_demo_server.sh`,它将首先下载一个预先训练过的普通话模型(用3000小时的内部语音数据训练),然后用模型启动演示服务器。通过运行`examples/mandarin/run_demo_client.sh`,你可以说普通话来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。 +请参考`examples/deploy_demo/run_english_demo_server.sh`,它将首先下载一个预先训练过的英语模型(用3000小时的内部语音数据训练),然后用模型启动演示服务器。通过运行`examples/mandarin/run_demo_client.sh`,你可以说英语来测试它。如果您想尝试其他模型,只需更新脚本中的`--model_path`参数即可。 获得更多帮助: @@ -470,10 +435,10 @@ python deploy/demo_client.py --help 语种 | 模型名 | 训练数据 | 语音时长 :-----------: | :------------: | :----------: | -------: -English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h -English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz) | Baidu Internal English Dataset | 8628 h -Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h -Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h +English | [LibriSpeech Model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz) | [LibriSpeech Dataset](http://www.openslr.org/12/) | 960 h +English | [BaiduEN8k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz) | Baidu Internal English Dataset | 8628 h +Mandarin | [Aishell Model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz) | [Aishell Dataset](http://www.openslr.org/33/) | 151 h +Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baidu_cn1.2k_model_fluid.tar.gz) | Baidu Internal Mandarin Dataset | 1204 h #### 语言模型发布 @@ -483,9 +448,9 @@ Mandarin | [BaiduCN1.2k Model](https://deepspeech.bj.bcebos.com/demo_models/baid [Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4;
About 0.13 billion n-grams;
'probing' binary with default settings [Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning;
About 3.7 billion n-grams;
'probing' binary with default settings -## 实验和基准 +## 实验和baseline -#### 英语模型的基准测试结果(字错误率) +#### 英语模型的baseline测试结果(字错误率) 测试集 | LibriSpeech Model | BaiduEN8K Model :--------------------- | ---------------: | -------------------: @@ -500,7 +465,7 @@ Baidu Internal Testset  |   40.75 |   8.48 为了在VoxForge数据上重现基准测试结果,我们提供了一个脚本来下载数据并生成VoxForge方言manifest文件。请到```data/voxforge```执行````run_data.sh```来获取VoxForge方言manifest文件。请注意,VoxForge数据可能会持续更新,生成的清单文件可能与我们评估的清单文件有所不同。 -#### 普通话模型的基准测试结果(字符错误率) +#### 普通话模型的baseline测试结果(字符错误率) 测试集 | BaiduCN1.2k Model :--------------------- | -------------------: @@ -508,17 +473,16 @@ Baidu Internal Testset | 12.64 #### 多GPU加速 -我们对1,2,4,8,16个Tesla K40m GPU的训练时间(LibriSpeech样本的子集,其音频持续时间介于6.0和7.0秒之间)进行比较。它表明,已经实现了具有多个GPU的**近线性**加速。在下图中,训练的时间(以秒为单位)显示在蓝色条上。 +我们对1,2,4,8个Tesla V100 GPU的训练时间(LibriSpeech样本的子集,其音频持续时间介于6.0和7.0秒之间)进行比较。它表明,已经实现了具有多个GPU的**近线性**加速。在下图中,训练的时间(以秒为单位)显示在蓝色条上。
| # of GPU | 加速比 | | -------- | --------------: | | 1 | 1.00 X | -| 2 | 1.97 X | -| 4 | 3.74 X | -| 8 | 6.21 X | -|16 | 10.70 X | +| 2 | 1.98 X | +| 4 | 3.73 X | +| 8 | 6.95 X | `tools/profile.sh`提供了上述分析工具. diff --git a/cloud/README.md b/cloud/README.md deleted file mode 100644 index a5be1c420..000000000 --- a/cloud/README.md +++ /dev/null @@ -1,63 +0,0 @@ -# Train DeepSpeech2 on PaddleCloud - ->Note: ->Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/` - -## Step 1: Upload Data - -Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths. - -Please modify the following arguments in `pcloud_upload_data.sh`: - -- `IN_MANIFESTS`: Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter. -- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths. -- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. -- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data. - -By running: - -``` -sh pcloud_upload_data.sh -``` -all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`. - -You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions. - -## Step 2: Configure Training - -Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments: - -- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`. -- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation. -- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. -- `BATCH_SIZE`: Training batch size for a single node. -- `NUM_GPU`: Number of GPUs allocated for a single node. -- `NUM_NODE`: Number of nodes (machines) allocated for this job. -- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes. - -Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training. - -By running: - -``` -sh pcloud_submit.sh -``` -you submit a training job to PaddleCloud. And you will see the job name when the submission is done. - - -## Step 3 Get Job Logs - -Run this to list all the jobs you have submitted, as well as their running status: - -``` -paddlecloud get jobs -``` - -Run this, the corresponding job's logs will be printed. -``` -paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME -``` - -## More Help - -For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务). diff --git a/cloud/_init_paths.py b/cloud/_init_paths.py deleted file mode 100644 index 3305d7488..000000000 --- a/cloud/_init_paths.py +++ /dev/null @@ -1,17 +0,0 @@ -"""Set up paths for DS2""" -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import os.path -import sys - - -def add_path(path): - if path not in sys.path: - sys.path.insert(0, path) - - -this_dir = os.path.dirname(__file__) -proj_path = os.path.join(this_dir, '..') -add_path(proj_path) diff --git a/cloud/pcloud_submit.sh b/cloud/pcloud_submit.sh deleted file mode 100644 index 99e458db9..000000000 --- a/cloud/pcloud_submit.sh +++ /dev/null @@ -1,29 +0,0 @@ -#! /usr/bin/env bash - -TRAIN_MANIFEST="cloud/cloud_manifests/cloud.manifest.train" -DEV_MANIFEST="cloud/cloud_manifests/cloud.manifest.dev" -CLOUD_MODEL_DIR="./checkpoints" -BATCH_SIZE=512 -NUM_GPU=8 -NUM_NODE=1 -IS_LOCAL="True" - -JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S` -DS2_PATH=${PWD%/*} -cp -f pcloud_train.sh ${DS2_PATH} - -paddlecloud submit \ --image bootstrapper:5000/paddlepaddle/pcloud_ds2:latest \ --jobname ${JOB_NAME} \ --cpu ${NUM_GPU} \ --gpu ${NUM_GPU} \ --memory 64Gi \ --parallelism ${NUM_NODE} \ --pscpu 1 \ --pservers 1 \ --psmemory 64Gi \ --passes 1 \ --entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \ -${DS2_PATH} - -rm ${DS2_PATH}/pcloud_train.sh diff --git a/cloud/pcloud_train.sh b/cloud/pcloud_train.sh deleted file mode 100644 index d0c47dece..000000000 --- a/cloud/pcloud_train.sh +++ /dev/null @@ -1,46 +0,0 @@ -#! /usr/bin/env bash - -TRAIN_MANIFEST=$1 -DEV_MANIFEST=$2 -MODEL_PATH=$3 -NUM_GPU=$4 -BATCH_SIZE=$5 -IS_LOCAL=$6 - -python ./cloud/split_data.py \ ---in_manifest_path=${TRAIN_MANIFEST} \ ---out_manifest_path='/local.manifest.train' - -python ./cloud/split_data.py \ ---in_manifest_path=${DEV_MANIFEST} \ ---out_manifest_path='/local.manifest.dev' - -mkdir ./logs - -python -u train.py \ ---batch_size=${BATCH_SIZE} \ ---trainer_count=${NUM_GPU} \ ---num_passes=200 \ ---num_proc_data=${NUM_GPU} \ ---num_conv_layers=2 \ ---num_rnn_layers=3 \ ---rnn_layer_size=2048 \ ---num_iter_print=100 \ ---learning_rate=5e-4 \ ---max_duration=27.0 \ ---min_duration=0.0 \ ---use_sortagrad=True \ ---use_gru=False \ ---use_gpu=True \ ---is_local=${IS_LOCAL} \ ---share_rnn_weights=True \ ---train_manifest='/local.manifest.train' \ ---dev_manifest='/local.manifest.dev' \ ---mean_std_path='data/librispeech/mean_std.npz' \ ---vocab_path='data/librispeech/vocab.txt' \ ---output_model_dir='./checkpoints' \ ---output_model_dir=${MODEL_PATH} \ ---augment_conf_path='conf/augmentation.config' \ ---specgram_type='linear' \ ---shuffle_method='batch_shuffle_clipped' \ -2>&1 | tee ./logs/train.log diff --git a/cloud/pcloud_upload_data.sh b/cloud/pcloud_upload_data.sh deleted file mode 100644 index 71bb4af19..000000000 --- a/cloud/pcloud_upload_data.sh +++ /dev/null @@ -1,22 +0,0 @@ -#! /usr/bin/env bash - -mkdir cloud_manifests - -IN_MANIFESTS="../data/librispeech/manifest.train ../data/librispeech/manifest.dev-clean ../data/librispeech/manifest.test-clean" -OUT_MANIFESTS="cloud_manifests/cloud.manifest.train cloud_manifests/cloud.manifest.dev cloud_manifests/cloud.manifest.test" -CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech" -NUM_SHARDS=50 - -python upload_data.py \ ---in_manifest_paths ${IN_MANIFESTS} \ ---out_manifest_paths ${OUT_MANIFESTS} \ ---cloud_data_dir ${CLOUD_DATA_DIR} \ ---num_shards ${NUM_SHARDS} - -if [ $? -ne 0 ] -then - echo "Upload Data Failed!" - exit 1 -fi - -echo "All Done." diff --git a/cloud/split_data.py b/cloud/split_data.py deleted file mode 100644 index 3496d52bf..000000000 --- a/cloud/split_data.py +++ /dev/null @@ -1,41 +0,0 @@ -"""This tool is used for splitting data into each node of -paddlecloud. This script should be called in paddlecloud. -""" -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import os -import json -import argparse - -parser = argparse.ArgumentParser(description=__doc__) -parser.add_argument( - "--in_manifest_path", - type=str, - required=True, - help="Input manifest path for all nodes.") -parser.add_argument( - "--out_manifest_path", - type=str, - required=True, - help="Output manifest file path for current node.") -args = parser.parse_args() - - -def split_data(in_manifest_path, out_manifest_path): - with open("/trainer_id", "r") as f: - trainer_id = int(f.readline()[:-1]) - with open("/trainer_count", "r") as f: - trainer_count = int(f.readline()[:-1]) - - out_manifest = [] - for index, json_line in enumerate(open(in_manifest_path, 'r')): - if (index % trainer_count) == trainer_id: - out_manifest.append("%s\n" % json_line.strip()) - with open(out_manifest_path, 'w') as f: - f.writelines(out_manifest) - - -if __name__ == '__main__': - split_data(args.in_manifest_path, args.out_manifest_path) diff --git a/cloud/upload_data.py b/cloud/upload_data.py deleted file mode 100644 index 9973f8c76..000000000 --- a/cloud/upload_data.py +++ /dev/null @@ -1,129 +0,0 @@ -"""This script is for uploading data for DeepSpeech2 training on paddlecloud. - -Steps: -1. Read original manifests and extract local sound files. -2. Tar all local sound files into multiple tar files and upload them. -3. Modify original manifests with updated paths in cloud filesystem. -""" -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import json -import os -import tarfile -import sys -import argparse -import shutil -from subprocess import call -import _init_paths -from data_utils.utils import read_manifest - -parser = argparse.ArgumentParser(description=__doc__) -parser.add_argument( - "--in_manifest_paths", - default=[ - "../datasets/manifest.train", "../datasets/manifest.dev", - "../datasets/manifest.test" - ], - type=str, - nargs='+', - help="Local filepaths of input manifests to load, pack and upload." - "(default: %(default)s)") -parser.add_argument( - "--out_manifest_paths", - default=[ - "./cloud.manifest.train", "./cloud.manifest.dev", - "./cloud.manifest.test" - ], - type=str, - nargs='+', - help="Local filepaths of modified manifests to write to. " - "(default: %(default)s)") -parser.add_argument( - "--cloud_data_dir", - required=True, - type=str, - help="Destination directory on paddlecloud to upload data to.") -parser.add_argument( - "--num_shards", - default=10, - type=int, - help="Number of parts to split data to. (default: %(default)s)") -parser.add_argument( - "--local_tmp_dir", - default="./tmp/", - type=str, - help="Local directory for storing temporary data. (default: %(default)s)") -args = parser.parse_args() - - -def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir, - upload_tar_dir, num_shards): - """Extract and pack sound files listed in the manifest files into multple - tar files and upload them to padldecloud. Besides, generate new manifest - files with updated paths in paddlecloud. - """ - # compute total audio number - total_line = 0 - for manifest_path in in_manifest_path_list: - with open(manifest_path, 'r') as f: - total_line += len(f.readlines()) - line_per_tar = (total_line // num_shards) + 1 - - # pack and upload shard by shard - line_count, tar_file = 0, None - for manifest_path, out_manifest_path in zip(in_manifest_path_list, - out_manifest_path_list): - manifest = read_manifest(manifest_path) - out_manifest = [] - for json_data in manifest: - sound_filepath = json_data['audio_filepath'] - sound_filename = os.path.basename(sound_filepath) - if line_count % line_per_tar == 0: - if tar_file != None: - tar_file.close() - pcloud_cp(tar_path, upload_tar_dir) - os.remove(tar_path) - tar_name = 'part-%s-of-%s.tar' % ( - str(line_count // line_per_tar).zfill(5), - str(num_shards).zfill(5)) - tar_path = os.path.join(local_tmp_dir, tar_name) - tar_file = tarfile.open(tar_path, 'w') - tar_file.add(sound_filepath, arcname=sound_filename) - line_count += 1 - json_data['audio_filepath'] = "tar:%s#%s" % ( - os.path.join(upload_tar_dir, tar_name), sound_filename) - out_manifest.append("%s\n" % json.dumps(json_data)) - with open(out_manifest_path, 'w') as f: - f.writelines(out_manifest) - pcloud_cp(out_manifest_path, upload_tar_dir) - tar_file.close() - pcloud_cp(tar_path, upload_tar_dir) - os.remove(tar_path) - - -def pcloud_mkdir(dir): - """Make directory in PaddleCloud filesystem. - """ - if call(['paddlecloud', 'mkdir', dir]) != 0: - raise IOError("PaddleCloud mkdir failed: %s." % dir) - - -def pcloud_cp(src, dst): - """Copy src from local filesytem to dst in PaddleCloud filesystem, - or downlowd src from PaddleCloud filesystem to dst in local filesystem. - """ - if call(['paddlecloud', 'cp', src, dst]) != 0: - raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst)) - - -if __name__ == '__main__': - if not os.path.exists(args.local_tmp_dir): - os.makedirs(args.local_tmp_dir) - pcloud_mkdir(args.cloud_data_dir) - - upload_data(args.in_manifest_paths, args.out_manifest_paths, - args.local_tmp_dir, args.cloud_data_dir, args.num_shards) - - shutil.rmtree(args.local_tmp_dir) diff --git a/data/librispeech/librispeech.py b/data/librispeech/librispeech.py index 9a8e1c287..07cc09339 100644 --- a/data/librispeech/librispeech.py +++ b/data/librispeech/librispeech.py @@ -16,6 +16,7 @@ import argparse import soundfile import json import codecs +import io from data_utils.utility import download, unpack URL_ROOT = "http://www.openslr.org/resources/12" @@ -68,12 +69,11 @@ def create_manifest(data_dir, manifest_path): filename for filename in filelist if filename.endswith('trans.txt') ] if len(text_filelist) > 0: - text_filepath = os.path.join(data_dir, subfolder, text_filelist[0]) - for line in open(text_filepath): + text_filepath = os.path.join(subfolder, text_filelist[0]) + for line in io.open(text_filepath, encoding="utf8"): segments = line.strip().split() text = ' '.join(segments[1:]).lower() - audio_filepath = os.path.join(data_dir, subfolder, - segments[0] + '.flac') + audio_filepath = os.path.join(subfolder, segments[0] + '.flac') audio_data, samplerate = soundfile.read(audio_filepath) duration = float(len(audio_data)) / samplerate json_lines.append( diff --git a/data/noise/chime3_background.py b/data/noise/chime3_background.py index f79ca7335..1aa7f8df8 100644 --- a/data/noise/chime3_background.py +++ b/data/noise/chime3_background.py @@ -16,6 +16,7 @@ import zipfile import argparse import soundfile import json +import io from paddle.v2.dataset.common import md5file DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') @@ -88,7 +89,7 @@ def create_manifest(data_dir, manifest_path): 'duration': duration, 'text': '' })) - with open(manifest_path, 'w') as out_file: + with io.open(manifest_path, mode='w', encoding='utf8') as out_file: for line in json_lines: out_file.write(line + '\n') diff --git a/data/voxforge/run_data.sh b/data/voxforge/run_data.sh index c6ff71118..0276744ae 100644 --- a/data/voxforge/run_data.sh +++ b/data/voxforge/run_data.sh @@ -3,7 +3,7 @@ # download data, generate manifests PYTHONPATH=../../:$PYTHONPATH python voxforge.py \ --manifest_prefix='./manifest' \ ---target_dir='~/.cache/paddle/dataset/speech/VoxForge' \ +--target_dir='./dataset/VoxForge' \ --is_merge_dialect=True \ --dialects 'american' 'british' 'australian' 'european' 'irish' 'canadian' 'indian' diff --git a/data/voxforge/voxforge.py b/data/voxforge/voxforge.py index 63f052bd7..b86b0f004 100644 --- a/data/voxforge/voxforge.py +++ b/data/voxforge/voxforge.py @@ -18,7 +18,7 @@ import shutil import subprocess from data_utils.utility import download_multi, unpack, getfile_insensitive -DATA_HOME = '~/.cache/paddle/dataset/speech' +DATA_HOME = './dataset' DATA_URL = 'http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/' \ 'Audio/Main/16kHz_16bit' diff --git a/data_utils/audio.py b/data_utils/audio.py index 3fb782951..e0feb21f3 100644 --- a/data_utils/audio.py +++ b/data_utils/audio.py @@ -12,6 +12,7 @@ import resampy from scipy import signal import random import copy +import io class AudioSegment(object): @@ -154,7 +155,7 @@ class AudioSegment(object): fileno = int(matches.group(2)) # read headers - f = open(filename, 'rb') + f = io.open(filename, mode='rb', encoding='utf8') version = f.read(4) num_utterances = struct.unpack("i", f.read(4))[0] bytes_per_header = struct.unpack("i", f.read(4))[0] diff --git a/data_utils/data.py b/data_utils/data.py index f79a6395a..0fb2a88ba 100644 --- a/data_utils/data.py +++ b/data_utils/data.py @@ -9,10 +9,9 @@ import random import tarfile import multiprocessing import numpy as np -import paddle.v2 as paddle +import paddle.fluid as fluid from threading import local from data_utils.utility import read_manifest -from data_utils.utility import xmap_readers_mp from data_utils.augmentor.augmentation import AugmentationPipeline from data_utils.featurizer.speech_featurizer import SpeechFeaturizer from data_utils.speech import SpeechSegment @@ -51,14 +50,17 @@ class DataGenerator(object): :param use_dB_normalization: Whether to normalize the audio to -20 dB before extracting the features. :type use_dB_normalization: bool - :param num_threads: Number of CPU threads for processing data. - :type num_threads: int :param random_seed: Random seed. :type random_seed: int :param keep_transcription_text: If set to True, transcription text will be passed forward directly without converting to index sequence. :type keep_transcription_text: bool + :param place: The place to run the program. + :type place: CPUPlace or CUDAPlace + :param is_training: If set to True, generate text data for training, + otherwise, generate text data for infer. + :type is_training: bool """ def __init__(self, @@ -72,9 +74,10 @@ class DataGenerator(object): max_freq=None, specgram_type='linear', use_dB_normalization=True, - num_threads=multiprocessing.cpu_count() // 2, random_seed=0, - keep_transcription_text=False): + keep_transcription_text=False, + place=fluid.CPUPlace(), + is_training=True): self._max_duration = max_duration self._min_duration = min_duration self._normalizer = FeatureNormalizer(mean_std_filepath) @@ -87,14 +90,15 @@ class DataGenerator(object): window_ms=window_ms, max_freq=max_freq, use_dB_normalization=use_dB_normalization) - self._num_threads = num_threads self._rng = random.Random(random_seed) self._keep_transcription_text = keep_transcription_text self._epoch = 0 + self._is_training = is_training # for caching tar files info self._local_data = local() self._local_data.tar2info = {} self._local_data.tar2object = {} + self._place = place def process_utterance(self, audio_file, transcript): """Load, augment, featurize and normalize for speech data. @@ -121,7 +125,6 @@ class DataGenerator(object): def batch_reader_creator(self, manifest_path, batch_size, - min_batch_size=1, padding_to=-1, flatten=False, sortagrad=False, @@ -137,9 +140,6 @@ class DataGenerator(object): :type manifest_path: basestring :param batch_size: Number of instances in a batch. :type batch_size: int - :param min_batch_size: Any batch with batch size smaller than this will - be discarded. (To be deprecated in the future.) - :type min_batch_size: int :param padding_to: If set -1, the maximun shape in the batch will be used as the target shape for padding. Otherwise, `padding_to` will be the target shape. @@ -178,6 +178,7 @@ class DataGenerator(object): # sort (by duration) or batch-wise shuffle the manifest if self._epoch == 0 and sortagrad: manifest.sort(key=lambda x: x["duration"]) + else: if shuffle_method == "batch_shuffle": manifest = self._batch_shuffle( @@ -193,18 +194,16 @@ class DataGenerator(object): raise ValueError("Unknown shuffle method %s." % shuffle_method) # prepare batches - instance_reader, cleanup = self._instance_reader_creator(manifest) batch = [] - try: - for instance in instance_reader(): - batch.append(instance) - if len(batch) == batch_size: - yield self._padding_batch(batch, padding_to, flatten) - batch = [] - if len(batch) >= min_batch_size: + instance_reader = self._instance_reader_creator(manifest) + + for instance in instance_reader(): + batch.append(instance) + if len(batch) == batch_size: yield self._padding_batch(batch, padding_to, flatten) - finally: - cleanup() + batch = [] + if len(batch) >= 1: + yield self._padding_batch(batch, padding_to, flatten) self._epoch += 1 return batch_reader @@ -276,13 +275,11 @@ class DataGenerator(object): def reader(): for instance in manifest: - yield instance + inst = self.process_utterance(instance["audio_filepath"], + instance["text"]), + yield inst[0] - reader, cleanup_callback = xmap_readers_mp( - lambda instance: self.process_utterance(instance["audio_filepath"], instance["text"]), - reader, self._num_threads, 4096) - - return reader, cleanup_callback + return reader def _padding_batch(self, batch, padding_to=-1, flatten=False): """ @@ -304,14 +301,43 @@ class DataGenerator(object): "than any instance's shape in the batch") max_length = padding_to # padding + padded_audios = [] + texts, text_lens = [], [] + audio_lens = [] + masks = [] for audio, text in batch: padded_audio = np.zeros([audio.shape[0], max_length]) padded_audio[:, :audio.shape[1]] = audio if flatten: padded_audio = padded_audio.flatten() - padded_instance = [padded_audio, text, audio.shape[1]] - new_batch.append(padded_instance) - return new_batch + padded_audios.append(padded_audio) + if self._is_training: + texts += text + else: + texts.append(text) + text_lens.append(len(text)) + audio_lens.append(audio.shape[1]) + mask_shape0 = (audio.shape[0] - 1) // 2 + 1 + mask_shape1 = (audio.shape[1] - 1) // 3 + 1 + mask_max_len = (max_length - 1) // 3 + 1 + mask_ones = np.ones((mask_shape0, mask_shape1)) + mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1)) + mask = np.repeat( + np.reshape( + np.concatenate((mask_ones, mask_zeros), axis=1), + (1, mask_shape0, mask_max_len)), + 32, + axis=0) + masks.append(mask) + padded_audios = np.array(padded_audios).astype('float32') + if self._is_training: + texts = fluid.create_lod_tensor( + np.array(texts).astype('int32'), + recursive_seq_lens=[text_lens], + place=self._place) + audio_lens = np.array(audio_lens).astype('int64').reshape([-1, 1]) + masks = np.array(masks).astype('float32') + return padded_audios, texts, audio_lens, masks def _batch_shuffle(self, manifest, batch_size, clipped=False): """Put similarly-sized instances into minibatches for better efficiency diff --git a/data_utils/utility.py b/data_utils/utility.py index 89a74c41a..7143f7ded 100644 --- a/data_utils/utility.py +++ b/data_utils/utility.py @@ -11,7 +11,7 @@ import time from Queue import Queue from threading import Thread from multiprocessing import Process, Manager, Value -from paddle.v2.dataset.common import md5file +from paddle.dataset.common import md5file def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0): @@ -88,127 +88,3 @@ def unpack(filepath, target_dir, rm_tar=False): class XmapEndSignal(): pass - - -def xmap_readers_mp(mapper, reader, process_num, buffer_size, order=False): - """A multiprocessing pipeline wrapper for the data reader. - - :param mapper: Function to map sample. - :type mapper: callable - :param reader: Given data reader. - :type reader: callable - :param process_num: Number of processes in the pipeline - :type process_num: int - :param buffer_size: Maximal buffer size. - :type buffer_size: int - :return: The wrappered reader and cleanup callback - :rtype: tuple - """ - end_flag = XmapEndSignal() - - read_workers = [] - handle_workers = [] - flush_workers = [] - - read_exit_flag = Value('i', 0) - handle_exit_flag = Value('i', 0) - flush_exit_flag = Value('i', 0) - - # define a worker to read samples from reader to in_queue with order flag - def order_read_worker(reader, in_queue): - for order_id, sample in enumerate(reader()): - if read_exit_flag.value == 1: break - in_queue.put((order_id, sample)) - in_queue.put(end_flag) - # the reading worker should not exit until all handling work exited - while handle_exit_flag.value == 0 or read_exit_flag.value == 0: - time.sleep(0.001) - - # define a worker to handle samples from in_queue by mapper and put results - # to out_queue with order - def order_handle_worker(in_queue, out_queue, mapper, out_order): - ins = in_queue.get() - while not isinstance(ins, XmapEndSignal): - if handle_exit_flag.value == 1: break - order_id, sample = ins - result = mapper(sample) - while order_id != out_order[0]: - time.sleep(0.001) - out_queue.put(result) - out_order[0] += 1 - ins = in_queue.get() - in_queue.put(end_flag) - out_queue.put(end_flag) - # wait for exit of flushing worker - while flush_exit_flag.value == 0 or handle_exit_flag.value == 0: - time.sleep(0.001) - read_exit_flag.value = 1 - handle_exit_flag.value = 1 - - # define a thread worker to flush samples from Manager.Queue to Queue - # for acceleration - def flush_worker(in_queue, out_queue): - finish = 0 - while finish < process_num and flush_exit_flag.value == 0: - sample = in_queue.get() - if isinstance(sample, XmapEndSignal): - finish += 1 - else: - out_queue.put(sample) - out_queue.put(end_flag) - handle_exit_flag.value = 1 - flush_exit_flag.value = 1 - - def cleanup(): - # first exit flushing workers - flush_exit_flag.value = 1 - for w in flush_workers: - w.join() - # next exit handling workers - handle_exit_flag.value = 1 - for w in handle_workers: - w.join() - # last exit reading workers - read_exit_flag.value = 1 - for w in read_workers: - w.join() - - def xreader(): - # prepare shared memory - manager = Manager() - in_queue = manager.Queue(buffer_size) - out_queue = manager.Queue(buffer_size) - out_order = manager.list([0]) - - # start a read worker in a process - target = order_read_worker - p = Process(target=target, args=(reader, in_queue)) - p.daemon = True - p.start() - read_workers.append(p) - - # start handle_workers with multiple processes - target = order_handle_worker - args = (in_queue, out_queue, mapper, out_order) - workers = [ - Process(target=target, args=args) for _ in xrange(process_num) - ] - for w in workers: - w.daemon = True - w.start() - handle_workers.append(w) - - # start a thread to read data from slow Manager.Queue - flush_queue = Queue(buffer_size) - t = Thread(target=flush_worker, args=(out_queue, flush_queue)) - t.daemon = True - t.start() - flush_workers.append(t) - - # get results - sample = flush_queue.get() - while not isinstance(sample, XmapEndSignal): - yield sample - sample = flush_queue.get() - - return xreader, cleanup diff --git a/decoders/decoders_deprecated.py b/decoders/decoders_deprecated.py index 17b28b0d0..b9248b58b 100644 --- a/decoders/decoders_deprecated.py +++ b/decoders/decoders_deprecated.py @@ -102,7 +102,7 @@ def ctc_beam_search_decoder(probs_seq, probs_b_prev, probs_nb_prev = {'\t': 1.0}, {'\t': 0.0} ## extend prefix in loop - for time_step in xrange(len(probs_seq)): + for time_step in range(len(probs_seq)): # prefix_set_next: the set containing candidate prefixes # probs_b_cur: prefixes' probability ending with blank in current step # probs_nb_cur: prefixes' probability ending with non-blank in current step @@ -114,7 +114,7 @@ def ctc_beam_search_decoder(probs_seq, if cutoff_prob < 1.0 or cutoff_top_n < cutoff_len: prob_idx = sorted(prob_idx, key=lambda asd: asd[1], reverse=True) cutoff_len, cum_prob = 0, 0.0 - for i in xrange(len(prob_idx)): + for i in range(len(prob_idx)): cum_prob += prob_idx[i][1] cutoff_len += 1 if cum_prob >= cutoff_prob: @@ -127,7 +127,7 @@ def ctc_beam_search_decoder(probs_seq, probs_b_cur[l], probs_nb_cur[l] = 0.0, 0.0 # extend prefix by travering prob_idx - for index in xrange(cutoff_len): + for index in range(cutoff_len): c, prob_c = prob_idx[index][0], prob_idx[index][1] if c == blank_id: diff --git a/deploy/demo_client.py b/deploy/demo_client.py index ddf4dd1bf..7f8869462 100644 --- a/deploy/demo_client.py +++ b/deploy/demo_client.py @@ -1,5 +1,5 @@ """Client-end for the ASR demo.""" -from pynput import keyboard +import keyboard import struct import socket import sys @@ -23,22 +23,17 @@ is_recording = False enable_trigger_record = True -def on_press(key): - """On-press keyboard callback function.""" +def on_press_release(x): + """Keyboard callback function.""" global is_recording, enable_trigger_record - if key == keyboard.Key.space: + press = keyboard.KeyboardEvent('down', 28, 'space') + release = keyboard.KeyboardEvent('up', 28, 'space') + if x.event_type == 'down' and x.name == press.name: if (not is_recording) and enable_trigger_record: sys.stdout.write("Start Recording ... ") sys.stdout.flush() is_recording = True - - -def on_release(key): - """On-release keyboard callback function.""" - global is_recording, enable_trigger_record - if key == keyboard.Key.esc: - return False - elif key == keyboard.Key.space: + if x.event_type == 'up' and x.name == release.name: if is_recording == True: is_recording = False @@ -80,9 +75,10 @@ def main(): stream.start_stream() # prepare keyboard listener - with keyboard.Listener( - on_press=on_press, on_release=on_release) as listener: - listener.join() + while (1): + keyboard.hook(on_press_release) + if keyboard.record('esc'): + break # close up stream.stop_stream() diff --git a/deploy/demo_server.py b/deploy/demo_server.py index 1cafb7a58..68fcb245f 100644 --- a/deploy/demo_server.py +++ b/deploy/demo_server.py @@ -8,7 +8,8 @@ from time import gmtime, strftime import SocketServer import struct import wave -import paddle.v2 as paddle +import paddle.fluid as fluid +import numpy as np import _init_paths from data_utils.data import DataGenerator from model_utils.model import DeepSpeech2Model @@ -141,13 +142,19 @@ def warm_up_test(audio_process_handler, def start_server(): """Start the ASR server""" # prepare data generator + if args.use_gpu: + place = fluid.CUDAPlace(0) + else: + place = fluid.CPUPlace() + data_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, augmentation_config='{}', specgram_type=args.specgram_type, - num_threads=1, - keep_transcription_text=True) + keep_transcription_text=True, + place = place, + is_training = False) # prepare ASR model ds2_model = DeepSpeech2Model( vocab_size=data_generator.vocab_size, @@ -155,7 +162,8 @@ def start_server(): num_rnn_layers=args.num_rnn_layers, rnn_layer_size=args.rnn_layer_size, use_gru=args.use_gru, - pretrained_model_path=args.model_path, + init_from_pretrained_model=args.model_path, + place=place, share_rnn_weights=args.share_rnn_weights) vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list] @@ -166,8 +174,24 @@ def start_server(): # prepare ASR inference handler def file_to_transcript(filename): feature = data_generator.process_utterance(filename, "") + audio_len = feature[0].shape[1] + mask_shape0 = (feature[0].shape[0] - 1) // 2 + 1 + mask_shape1 = (feature[0].shape[1] - 1) // 3 + 1 + mask_max_len = (audio_len - 1) // 3 + 1 + mask_ones = np.ones((mask_shape0, mask_shape1)) + mask_zeros = np.zeros((mask_shape0, mask_max_len - mask_shape1)) + mask = np.repeat( + np.reshape( + np.concatenate((mask_ones, mask_zeros), axis=1), + (1, mask_shape0, mask_max_len)), + 32, + axis=0) + feature = (np.array([feature[0]]).astype('float32'), + None, + np.array([audio_len]).astype('int64').reshape([-1,1]), + np.array([mask]).astype('float32')) probs_split = ds2_model.infer_batch_probs( - infer_data=[feature], + infer_data=feature, feeding_dict=data_generator.feeding) if args.decoding_method == "ctc_greedy": @@ -207,7 +231,6 @@ def start_server(): def main(): print_arguments(args) - paddle.init(use_gpu=args.use_gpu, trainer_count=1) start_server() diff --git a/docs/images/multi_gpu_speedup.png b/docs/images/multi_gpu_speedup.png index 57a803bac..286de5151 100755 Binary files a/docs/images/multi_gpu_speedup.png and b/docs/images/multi_gpu_speedup.png differ diff --git a/examples/aishell/run_data.sh b/examples/aishell/run_data.sh index eb0388d84..93ea6c291 100644 --- a/examples/aishell/run_data.sh +++ b/examples/aishell/run_data.sh @@ -5,7 +5,7 @@ cd ../.. > /dev/null # download data, generate manifests PYTHONPATH=.:$PYTHONPATH python data/aishell/aishell.py \ --manifest_prefix='data/aishell/manifest' \ ---target_dir='~/.cache/paddle/dataset/speech/Aishell' +--target_dir='./dataset/aishell' if [ $? -ne 0 ]; then echo "Prepare Aishell failed. Terminated." diff --git a/examples/aishell/run_infer.sh b/examples/aishell/run_infer.sh index e8bd9eab1..c38325d17 100644 --- a/examples/aishell/run_infer.sh +++ b/examples/aishell/run_infer.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_ch.sh +bash download_lm_ch.sh if [ $? -ne 0 ]; then exit 1 fi @@ -15,7 +15,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=300 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -31,7 +30,7 @@ python -u infer.py \ --infer_manifest='data/aishell/manifest.test' \ --mean_std_path='data/aishell/mean_std.npz' \ --vocab_path='data/aishell/vocab.txt' \ ---model_path='checkpoints/aishell/params.latest.tar.gz' \ +--model_path='checkpoints/aishell/srep_final' \ --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='cer' \ diff --git a/examples/aishell/run_infer_golden.sh b/examples/aishell/run_infer_golden.sh index 68f5a521a..56d3365d9 100644 --- a/examples/aishell/run_infer_golden.sh +++ b/examples/aishell/run_infer_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_ch.sh +bash download_lm_ch.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/aishell > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,7 +24,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=300 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -35,12 +34,12 @@ python -u infer.py \ --cutoff_prob=0.99 \ --cutoff_top_n=40 \ --use_gru=True \ ---use_gpu=True \ +--use_gpu=False \ --share_rnn_weights=False \ --infer_manifest='data/aishell/manifest.test' \ --mean_std_path='models/aishell/mean_std.npz' \ --vocab_path='models/aishell/vocab.txt' \ ---model_path='models/aishell/params.tar.gz' \ +--model_path='models/aishell' \ --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='cer' \ diff --git a/examples/aishell/run_test.sh b/examples/aishell/run_test.sh index 35dfca82f..2867444be 100644 --- a/examples/aishell/run_test.sh +++ b/examples/aishell/run_test.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_ch.sh +bash download_lm_ch.sh if [ $? -ne 0 ]; then exit 1 fi @@ -15,10 +15,8 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ --batch_size=128 \ ---trainer_count=8 \ --beam_size=300 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=1024 \ @@ -32,7 +30,7 @@ python -u test.py \ --test_manifest='data/aishell/manifest.test' \ --mean_std_path='data/aishell/mean_std.npz' \ --vocab_path='data/aishell/vocab.txt' \ ---model_path='checkpoints/aishell/params.latest.tar.gz' \ +--model_path='checkpoints/aishell/step_final' \ --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='cer' \ diff --git a/examples/aishell/run_test_golden.sh b/examples/aishell/run_test_golden.sh index 8b5e65595..799f382f5 100644 --- a/examples/aishell/run_test_golden.sh +++ b/examples/aishell/run_test_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_ch.sh +bash download_lm_ch.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/aishell > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,10 +24,8 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ --batch_size=128 \ ---trainer_count=8 \ --beam_size=300 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=1024 \ @@ -41,7 +39,7 @@ python -u test.py \ --test_manifest='data/aishell/manifest.test' \ --mean_std_path='models/aishell/mean_std.npz' \ --vocab_path='models/aishell/vocab.txt' \ ---model_path='models/aishell/params.tar.gz' \ +--model_path='models/aishell' \ --lang_model_path='models/lm/zh_giga.no_cna_cmn.prune01244.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='cer' \ diff --git a/examples/aishell/run_train.sh b/examples/aishell/run_train.sh index e09205cb4..335473fcf 100644 --- a/examples/aishell/run_train.sh +++ b/examples/aishell/run_train.sh @@ -3,17 +3,18 @@ cd ../.. > /dev/null # train model -# if you wish to resume from an exists model, uncomment --init_model_path +# if you wish to resume from an exists model, uncomment --init_from_pretrained_model +export FLAGS_sync_nccl_allreduce=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u train.py \ --batch_size=64 \ ---trainer_count=8 \ ---num_passes=50 \ ---num_proc_data=16 \ +--num_epoch=50 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=1024 \ --num_iter_print=100 \ +--save_epoch=1 \ +--num_samples=120000 \ --learning_rate=5e-4 \ --max_duration=27.0 \ --min_duration=0.0 \ @@ -30,7 +31,7 @@ python -u train.py \ --output_model_dir='./checkpoints/aishell' \ --augment_conf_path='conf/augmentation.config' \ --specgram_type='linear' \ ---shuffle_method='batch_shuffle_clipped' +--shuffle_method='batch_shuffle_clipped' \ if [ $? -ne 0 ]; then echo "Failed in training!" diff --git a/examples/baidu_en8k/run_infer_golden.sh b/examples/baidu_en8k/run_infer_golden.sh index 68cf2fc9f..2f3f0acf7 100644 --- a/examples/baidu_en8k/run_infer_golden.sh +++ b/examples/baidu_en8k/run_infer_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/baidu_en8k > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,7 +24,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=500 \ --num_proc_bsearch=5 \ --num_conv_layers=2 \ @@ -35,12 +34,12 @@ python -u infer.py \ --cutoff_prob=1.0 \ --cutoff_top_n=40 \ --use_gru=True \ ---use_gpu=True \ +--use_gpu=False \ --share_rnn_weights=False \ --infer_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='models/baidu_en8k/mean_std.npz' \ --vocab_path='models/baidu_en8k/vocab.txt' \ ---model_path='models/baidu_en8k/params.tar.gz' \ +--model_path='models/baidu_en8k' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/baidu_en8k/run_test_golden.sh b/examples/baidu_en8k/run_test_golden.sh index b471ac65d..612e71a01 100644 --- a/examples/baidu_en8k/run_test_golden.sh +++ b/examples/baidu_en8k/run_test_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/baidu_en8k > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,7 +24,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3 \ python -u test.py \ --batch_size=128 \ ---trainer_count=4 \ --beam_size=500 \ --num_proc_bsearch=8 \ --num_proc_data=8 \ @@ -36,12 +35,12 @@ python -u test.py \ --cutoff_prob=1.0 \ --cutoff_top_n=40 \ --use_gru=True \ ---use_gpu=True \ +--use_gpu=False \ --share_rnn_weights=False \ --test_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='models/baidu_en8k/mean_std.npz' \ --vocab_path='models/baidu_en8k/vocab.txt' \ ---model_path='models/baidu_en8k/params.tar.gz' \ +--model_path='models/baidu_en8k' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/deploy_demo/run_english_demo_server.sh b/examples/deploy_demo/run_english_demo_server.sh index 67532770c..d67559f33 100644 --- a/examples/deploy_demo/run_english_demo_server.sh +++ b/examples/deploy_demo/run_english_demo_server.sh @@ -5,7 +5,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -14,7 +14,7 @@ cd - > /dev/null # download well-trained model cd models/baidu_en8k > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -40,7 +40,7 @@ python -u deploy/demo_server.py \ --warmup_manifest='data/tiny/manifest.test-clean' \ --mean_std_path='models/baidu_en8k/mean_std.npz' \ --vocab_path='models/baidu_en8k/vocab.txt' \ ---model_path='models/baidu_en8k/params.tar.gz' \ +--model_path='models/baidu_en8k' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --specgram_type='linear' diff --git a/examples/librispeech/run_data.sh b/examples/librispeech/run_data.sh index 6e170c12a..e4db1ac9b 100644 --- a/examples/librispeech/run_data.sh +++ b/examples/librispeech/run_data.sh @@ -5,7 +5,7 @@ cd ../.. > /dev/null # download data, generate manifests PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \ --manifest_prefix='data/librispeech/manifest' \ ---target_dir='~/.cache/paddle/dataset/speech/libri' \ +--target_dir='./dataset/librispeech' \ --full_download='True' if [ $? -ne 0 ]; then diff --git a/examples/librispeech/run_infer.sh b/examples/librispeech/run_infer.sh index 44b97bacf..91d8ff2eb 100644 --- a/examples/librispeech/run_infer.sh +++ b/examples/librispeech/run_infer.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -15,7 +15,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=500 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -31,7 +30,7 @@ python -u infer.py \ --infer_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='data/librispeech/mean_std.npz' \ --vocab_path='data/librispeech/vocab.txt' \ ---model_path='checkpoints/libri/params.latest.tar.gz' \ +--model_path='checkpoints/libri/step_final' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/librispeech/run_infer_golden.sh b/examples/librispeech/run_infer_golden.sh index 173790903..eb8121294 100644 --- a/examples/librispeech/run_infer_golden.sh +++ b/examples/librispeech/run_infer_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/librispeech > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,7 +24,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=500 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -40,7 +39,7 @@ python -u infer.py \ --infer_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='models/librispeech/mean_std.npz' \ --vocab_path='models/librispeech/vocab.txt' \ ---model_path='models/librispeech/params.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/librispeech/run_test.sh b/examples/librispeech/run_test.sh index 11cd74116..9eebbbf24 100644 --- a/examples/librispeech/run_test.sh +++ b/examples/librispeech/run_test.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -15,10 +15,8 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ --batch_size=128 \ ---trainer_count=8 \ --beam_size=500 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ @@ -32,7 +30,7 @@ python -u test.py \ --test_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='data/librispeech/mean_std.npz' \ --vocab_path='data/librispeech/vocab.txt' \ ---model_path='checkpoints/libri/params.latest.tar.gz' \ +--model_path='checkpoints/libri/step_final' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/librispeech/run_test_golden.sh b/examples/librispeech/run_test_golden.sh index 41dbc0dae..abd895925 100644 --- a/examples/librispeech/run_test_golden.sh +++ b/examples/librispeech/run_test_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/librispeech > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,10 +24,8 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ --batch_size=128 \ ---trainer_count=8 \ --beam_size=500 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ @@ -41,7 +39,7 @@ python -u test.py \ --test_manifest='data/librispeech/manifest.test-clean' \ --mean_std_path='models/librispeech/mean_std.npz' \ --vocab_path='models/librispeech/vocab.txt' \ ---model_path='models/librispeech/params.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/librispeech/run_train.sh b/examples/librispeech/run_train.sh index 87e08721b..a568bf221 100644 --- a/examples/librispeech/run_train.sh +++ b/examples/librispeech/run_train.sh @@ -3,17 +3,19 @@ cd ../.. > /dev/null # train model -# if you wish to resume from an exists model, uncomment --init_model_path +# if you wish to resume from an exists model, uncomment --init_from_pretrained_model +export FLAGS_sync_nccl_allreduce=0 + CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u train.py \ ---batch_size=160 \ ---trainer_count=8 \ ---num_passes=50 \ ---num_proc_data=16 \ +--batch_size=20 \ +--num_epoch=50 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ --num_iter_print=100 \ +--save_epoch=1 \ +--num_samples=280000 \ --learning_rate=5e-4 \ --max_duration=27.0 \ --min_duration=0.0 \ @@ -30,7 +32,7 @@ python -u train.py \ --output_model_dir='./checkpoints/libri' \ --augment_conf_path='conf/augmentation.config' \ --specgram_type='linear' \ ---shuffle_method='batch_shuffle_clipped' +--shuffle_method='batch_shuffle_clipped' \ if [ $? -ne 0 ]; then echo "Failed in training!" diff --git a/examples/librispeech/run_tune.sh b/examples/librispeech/run_tune.sh index 9fc9cbb9d..af6e9dafd 100644 --- a/examples/librispeech/run_tune.sh +++ b/examples/librispeech/run_tune.sh @@ -7,7 +7,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 \ python -u tools/tune.py \ --num_batches=-1 \ --batch_size=128 \ ---trainer_count=4 \ --beam_size=500 \ --num_proc_bsearch=12 \ --num_conv_layers=2 \ @@ -27,7 +26,7 @@ python -u tools/tune.py \ --tune_manifest='data/librispeech/manifest.dev-clean' \ --mean_std_path='data/librispeech/mean_std.npz' \ --vocab_path='models/librispeech/vocab.txt' \ ---model_path='models/librispeech/params.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --error_rate_type='wer' \ --specgram_type='linear' diff --git a/examples/tiny/run_data.sh b/examples/tiny/run_data.sh index ba55d284a..1428194d3 100644 --- a/examples/tiny/run_data.sh +++ b/examples/tiny/run_data.sh @@ -7,11 +7,10 @@ if [ ! -e data/tiny ]; then mkdir data/tiny fi - # download data, generate manifests PYTHONPATH=.:$PYTHONPATH python data/librispeech/librispeech.py \ --manifest_prefix='data/tiny/manifest' \ ---target_dir='~/.cache/paddle/dataset/speech/libri' \ +--target_dir='./dataset/librispeech' \ --full_download='False' if [ $? -ne 0 ]; then @@ -21,12 +20,11 @@ fi head -n 64 data/tiny/manifest.dev-clean > data/tiny/manifest.tiny - # build vocabulary python tools/build_vocab.py \ --count_threshold=0 \ --vocab_path='data/tiny/vocab.txt' \ ---manifest_paths='data/tiny/manifest.dev-clean' +--manifest_paths='data/tiny/manifest.tiny' if [ $? -ne 0 ]; then echo "Build vocabulary failed. Terminated." @@ -47,5 +45,5 @@ if [ $? -ne 0 ]; then fi -echo "Tiny data preparation done." +echo "LibriSpeech Data preparation done." exit 0 diff --git a/examples/tiny/run_infer.sh b/examples/tiny/run_infer.sh index 0cc140c8e..bded0e7b6 100644 --- a/examples/tiny/run_infer.sh +++ b/examples/tiny/run_infer.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -15,7 +15,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=500 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -28,10 +27,10 @@ python -u infer.py \ --use_gru=False \ --use_gpu=True \ --share_rnn_weights=True \ ---infer_manifest='data/tiny/manifest.tiny' \ +--infer_manifest='data/tiny/manifest.test-clean' \ --mean_std_path='data/tiny/mean_std.npz' \ --vocab_path='data/tiny/vocab.txt' \ ---model_path='checkpoints/tiny/params.pass-19.tar.gz' \ +--model_path='./checkpoints/tiny/step_final' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/tiny/run_infer_golden.sh b/examples/tiny/run_infer_golden.sh index cf9aa84c9..33662622d 100644 --- a/examples/tiny/run_infer_golden.sh +++ b/examples/tiny/run_infer_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/librispeech > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,7 +24,6 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0 \ python -u infer.py \ --num_samples=10 \ ---trainer_count=1 \ --beam_size=500 \ --num_proc_bsearch=8 \ --num_conv_layers=2 \ @@ -40,7 +39,7 @@ python -u infer.py \ --infer_manifest='data/tiny/manifest.test-clean' \ --mean_std_path='models/librispeech/mean_std.npz' \ --vocab_path='models/librispeech/vocab.txt' \ ---model_path='models/librispeech/params.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/tiny/run_test.sh b/examples/tiny/run_test.sh index a9fe5b936..1dfc65e19 100644 --- a/examples/tiny/run_test.sh +++ b/examples/tiny/run_test.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -14,11 +14,9 @@ cd - > /dev/null # evaluate model CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ ---batch_size=16 \ ---trainer_count=8 \ +--batch_size=128 \ --beam_size=500 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ @@ -29,10 +27,10 @@ python -u test.py \ --use_gru=False \ --use_gpu=True \ --share_rnn_weights=True \ ---test_manifest='data/tiny/manifest.tiny' \ +--test_manifest='data/tiny/manifest.test-clean' \ --mean_std_path='data/tiny/mean_std.npz' \ --vocab_path='data/tiny/vocab.txt' \ ---model_path='checkpoints/tiny/params.pass-19.tar.gz' \ +--model_path='checkpoints/tiny/step_final' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/tiny/run_test_golden.sh b/examples/tiny/run_test_golden.sh index e87ce6eef..542552657 100644 --- a/examples/tiny/run_test_golden.sh +++ b/examples/tiny/run_test_golden.sh @@ -4,7 +4,7 @@ cd ../.. > /dev/null # download language model cd models/lm > /dev/null -sh download_lm_en.sh +bash download_lm_en.sh if [ $? -ne 0 ]; then exit 1 fi @@ -13,7 +13,7 @@ cd - > /dev/null # download well-trained model cd models/librispeech > /dev/null -sh download_model.sh +bash download_model.sh if [ $? -ne 0 ]; then exit 1 fi @@ -24,10 +24,8 @@ cd - > /dev/null CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -u test.py \ --batch_size=128 \ ---trainer_count=8 \ --beam_size=500 \ --num_proc_bsearch=8 \ ---num_proc_data=8 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ @@ -41,7 +39,7 @@ python -u test.py \ --test_manifest='data/tiny/manifest.test-clean' \ --mean_std_path='models/librispeech/mean_std.npz' \ --vocab_path='models/librispeech/vocab.txt' \ ---model_path='models/librispeech/params.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --decoding_method='ctc_beam_search' \ --error_rate_type='wer' \ diff --git a/examples/tiny/run_train.sh b/examples/tiny/run_train.sh index e03a8aff0..95ccd2bc6 100644 --- a/examples/tiny/run_train.sh +++ b/examples/tiny/run_train.sh @@ -3,17 +3,18 @@ cd ../.. > /dev/null # train model -# if you wish to resume from an exists model, uncomment --init_model_path +# if you wish to resume from an exists model, uncomment --init_from_pretrained_model +export FLAGS_sync_nccl_allreduce=0 CUDA_VISIBLE_DEVICES=0,1,2,3 \ python -u train.py \ ---batch_size=16 \ ---trainer_count=4 \ ---num_passes=20 \ ---num_proc_data=1 \ +--batch_size=4 \ +--num_epoch=20 \ --num_conv_layers=2 \ --num_rnn_layers=3 \ --rnn_layer_size=2048 \ ---num_iter_print=100 \ +--num_iter_print=1 \ +--save_epoch=1 \ +--num_samples=64 \ --learning_rate=1e-5 \ --max_duration=27.0 \ --min_duration=0.0 \ @@ -30,10 +31,10 @@ python -u train.py \ --output_model_dir='./checkpoints/tiny' \ --augment_conf_path='conf/augmentation.config' \ --specgram_type='linear' \ ---shuffle_method='batch_shuffle_clipped' +--shuffle_method='batch_shuffle_clipped' \ if [ $? -ne 0 ]; then - echo "Fail in training!" + echo "Failed in training!" exit 1 fi diff --git a/examples/tiny/run_tune.sh b/examples/tiny/run_tune.sh index 89f8adf45..87bcb67b1 100644 --- a/examples/tiny/run_tune.sh +++ b/examples/tiny/run_tune.sh @@ -3,11 +3,10 @@ cd ../.. > /dev/null # grid-search for hyper-parameters in language model -CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ +CUDA_VISIBLE_DEVICES=0,1,2,3 \ python -u tools/tune.py \ ---num_batches=1 \ ---batch_size=24 \ ---trainer_count=8 \ +--num_batches=-1 \ +--batch_size=128 \ --beam_size=500 \ --num_proc_bsearch=12 \ --num_conv_layers=2 \ @@ -24,10 +23,10 @@ python -u tools/tune.py \ --use_gru=False \ --use_gpu=True \ --share_rnn_weights=True \ ---tune_manifest='data/tiny/manifest.tiny' \ +--tune_manifest='data/tiny/manifest.dev-clean' \ --mean_std_path='data/tiny/mean_std.npz' \ --vocab_path='data/tiny/vocab.txt' \ ---model_path='checkpoints/params.pass-9.tar.gz' \ +--model_path='models/librispeech' \ --lang_model_path='models/lm/common_crawl_00.prune01111.trie.klm' \ --error_rate_type='wer' \ --specgram_type='linear' diff --git a/infer.py b/infer.py index f4d75685b..e43aa5266 100644 --- a/infer.py +++ b/infer.py @@ -3,11 +3,16 @@ from __future__ import absolute_import from __future__ import division from __future__ import print_function +import sys +reload(sys) +sys.setdefaultencoding('utf-8') + import argparse import functools -import paddle.v2 as paddle +import paddle.fluid as fluid from data_utils.data import DataGenerator from model_utils.model import DeepSpeech2Model +from model_utils.model_check import check_cuda, check_version from utils.error_rate import wer, cer from utils.utility import add_arguments, print_arguments @@ -15,7 +20,6 @@ parser = argparse.ArgumentParser(description=__doc__) add_arg = functools.partial(add_arguments, argparser=parser) # yapf: disable add_arg('num_samples', int, 10, "# of samples to infer.") -add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).") add_arg('beam_size', int, 500, "Beam search width.") add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.") add_arg('num_conv_layers', int, 2, "# of convolution layers.") @@ -63,20 +67,31 @@ args = parser.parse_args() def infer(): """Inference for DeepSpeech2.""" + + # check if set use_gpu=True in paddlepaddle cpu version + check_cuda(args.use_gpu) + # check if paddlepaddle version is satisfied + check_version() + + if args.use_gpu: + place = fluid.CUDAPlace(0) + else: + place = fluid.CPUPlace() + data_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, augmentation_config='{}', specgram_type=args.specgram_type, - num_threads=1, - keep_transcription_text=True) + keep_transcription_text=True, + place = place, + is_training = False) batch_reader = data_generator.batch_reader_creator( manifest_path=args.infer_manifest, batch_size=args.num_samples, - min_batch_size=1, sortagrad=False, shuffle_method=None) - infer_data = batch_reader().next() + infer_data = next(batch_reader()) ds2_model = DeepSpeech2Model( vocab_size=data_generator.vocab_size, @@ -84,16 +99,19 @@ def infer(): num_rnn_layers=args.num_rnn_layers, rnn_layer_size=args.rnn_layer_size, use_gru=args.use_gru, - pretrained_model_path=args.model_path, - share_rnn_weights=args.share_rnn_weights) + share_rnn_weights=args.share_rnn_weights, + place=place, + init_from_pretrained_model=args.model_path) # decoders only accept string encoded in utf-8 vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list] if args.decoding_method == "ctc_greedy": ds2_model.logger.info("start inference ...") - probs_split = ds2_model.infer_batch_probs(infer_data=infer_data, + probs_split = ds2_model.infer_batch_probs( + infer_data=infer_data, feeding_dict=data_generator.feeding) + result_transcripts = ds2_model.decode_batch_greedy( probs_split=probs_split, vocab_list=vocab_list) @@ -101,9 +119,11 @@ def infer(): ds2_model.init_ext_scorer(args.alpha, args.beta, args.lang_model_path, vocab_list) ds2_model.logger.info("start inference ...") - probs_split = ds2_model.infer_batch_probs(infer_data=infer_data, + probs_split= ds2_model.infer_batch_probs( + infer_data=infer_data, feeding_dict=data_generator.feeding) - result_transcripts = ds2_model.decode_batch_beam_search( + + result_transcripts= ds2_model.decode_batch_beam_search( probs_split=probs_split, beam_alpha=args.alpha, beam_beta=args.beta, @@ -114,7 +134,7 @@ def infer(): num_processes=args.num_proc_bsearch) error_rate_func = cer if args.error_rate_type == 'cer' else wer - target_transcripts = [data[1] for data in infer_data] + target_transcripts = infer_data[1] for target, result in zip(target_transcripts, result_transcripts): print("\nTarget Transcription: %s\nOutput Transcription: %s" % (target, result)) @@ -125,9 +145,6 @@ def infer(): def main(): print_arguments(args) - paddle.init(use_gpu=args.use_gpu, - rnn_use_batch=True, - trainer_count=args.trainer_count) infer() diff --git a/model_utils/model.py b/model_utils/model.py index 4b3764bf2..fe1ae43d0 100644 --- a/model_utils/model.py +++ b/model_utils/model.py @@ -10,8 +10,13 @@ import logging import gzip import copy import inspect +import cPickle as pickle +import collections +import multiprocessing +import numpy as np from distutils.dir_util import mkpath -import paddle.v2 as paddle +import paddle.fluid as fluid +import paddle.fluid.compiler as compiler from decoders.swig_wrapper import Scorer from decoders.swig_wrapper import ctc_greedy_decoder from decoders.swig_wrapper import ctc_beam_search_decoder_batch @@ -32,37 +37,201 @@ class DeepSpeech2Model(object): :type num_rnn_layers: int :param rnn_layer_size: RNN layer size (number of RNN cells). :type rnn_layer_size: int - :param pretrained_model_path: Pretrained model path. If None, will train - from stratch. - :type pretrained_model_path: basestring|None + :param use_gru: Use gru if set True. Use simple rnn if set False. + :type use_gru: bool :param share_rnn_weights: Whether to share input-hidden weights between forward and backward directional RNNs.Notice that for GRU, weight sharing is not supported. :type share_rnn_weights: bool + :param place: Program running place. + :type place: CPUPlace or CUDAPlace + :param init_from_pretrained_model: Pretrained model path. If None, will train + from stratch. + :type init_from_pretrained_model: string|None + :param output_model_dir: Output model directory. If None, output to current directory. + :type output_model_dir: string|None """ - def __init__(self, vocab_size, num_conv_layers, num_rnn_layers, - rnn_layer_size, use_gru, pretrained_model_path, - share_rnn_weights): - self._create_network(vocab_size, num_conv_layers, num_rnn_layers, - rnn_layer_size, use_gru, share_rnn_weights) - self._create_parameters(pretrained_model_path) - self._inferer = None - self._loss_inferer = None - self._ext_scorer = None + def __init__(self, + vocab_size, + num_conv_layers, + num_rnn_layers, + rnn_layer_size, + use_gru=False, + share_rnn_weights=True, + place=fluid.CPUPlace(), + init_from_pretrained_model=None, + output_model_dir=None): + self._vocab_size = vocab_size self._num_conv_layers = num_conv_layers + self._num_rnn_layers = num_rnn_layers + self._rnn_layer_size = rnn_layer_size + self._use_gru = use_gru + self._share_rnn_weights = share_rnn_weights + self._place = place + self._init_from_pretrained_model = init_from_pretrained_model + self._output_model_dir = output_model_dir + self._ext_scorer = None self.logger = logging.getLogger("") self.logger.setLevel(level=logging.INFO) + def create_network(self, is_infer=False): + """Create data layers and model network. + :param is_training: Whether to create a network for training. + :type is_training: bool + :return reader: Reader for input. + :rtype reader: read generater + :return log_probs: An output unnormalized log probability layer. + :rtype lig_probs: Varable + :return loss: A ctc loss layer. + :rtype loss: Variable + """ + + if not is_infer: + input_fields = { + 'names': ['audio_data', 'text_data', 'seq_len_data', 'masks'], + 'shapes': + [[None, 161, None], [None, 1], [None, 1], [None, 32, 81, None]], + 'dtypes': ['float32', 'int32', 'int64', 'float32'], + 'lod_levels': [0, 1, 0, 0] + } + + inputs = [ + fluid.data( + name=input_fields['names'][i], + shape=input_fields['shapes'][i], + dtype=input_fields['dtypes'][i], + lod_level=input_fields['lod_levels'][i]) + for i in range(len(input_fields['names'])) + ] + + reader = fluid.io.DataLoader.from_generator( + feed_list=inputs, + capacity=64, + iterable=False, + use_double_buffer=True) + + (audio_data, text_data, seq_len_data, masks) = inputs + else: + audio_data = fluid.data( + name='audio_data', + shape=[None, 161, None], + dtype='float32', + lod_level=0) + seq_len_data = fluid.data( + name='seq_len_data', + shape=[None, 1], + dtype='int64', + lod_level=0) + masks = fluid.data( + name='masks', + shape=[None, 32, 81, None], + dtype='float32', + lod_level=0) + text_data = None + reader = fluid.DataFeeder([audio_data, seq_len_data, masks], + self._place) + + log_probs, loss = deep_speech_v2_network( + audio_data=audio_data, + text_data=text_data, + seq_len_data=seq_len_data, + masks=masks, + dict_size=self._vocab_size, + num_conv_layers=self._num_conv_layers, + num_rnn_layers=self._num_rnn_layers, + rnn_size=self._rnn_layer_size, + use_gru=self._use_gru, + share_rnn_weights=self._share_rnn_weights) + return reader, log_probs, loss + + def init_from_pretrained_model(self, exe, program): + '''Init params from pretrain model. ''' + + assert isinstance(self._init_from_pretrained_model, str) + + if not os.path.exists(self._init_from_pretrained_model): + print(self._init_from_pretrained_model) + raise Warning("The pretrained params do not exist.") + return False + fluid.io.load_params( + exe, + self._init_from_pretrained_model, + main_program=program, + filename="params.pdparams") + + print("finish initing model from pretrained params from %s" % + (self._init_from_pretrained_model)) + + pre_epoch = 0 + dir_name = self._init_from_pretrained_model.split('_') + if len(dir_name) >= 2 and dir_name[-2].endswith('epoch') and dir_name[ + -1].isdigit(): + pre_epoch = int(dir_name[-1]) + + return pre_epoch + 1 + + def save_param(self, exe, program, dirname): + '''Save model params to dirname''' + + assert isinstance(self._output_model_dir, str) + + param_dir = os.path.join(self._output_model_dir) + + if not os.path.exists(param_dir): + os.mkdir(param_dir) + + fluid.io.save_params( + exe, + os.path.join(param_dir, dirname), + main_program=program, + filename="params.pdparams") + print("save parameters at %s" % (os.path.join(param_dir, dirname))) + + return True + + def test(self, exe, dev_batch_reader, test_program, test_reader, + fetch_list): + '''Test the model. + + :param exe:The executor of program. + :type exe: Executor + :param dev_batch_reader: The reader of test dataa. + :type dev_batch_reader: read generator + :param test_program: The program of test. + :type test_program: Program + :param test_reader: Reader of test. + :type test_reader: Reader + :param fetch_list: Fetch list. + :type fetch_list: list + :return: An output unnormalized log probability. + :rtype: array + ''' + test_reader.start() + epoch_loss = [] + while True: + try: + each_loss = exe.run( + program=test_program, + fetch_list=fetch_list, + return_numpy=False) + epoch_loss.extend(np.array(each_loss[0])) + + except fluid.core.EOFException: + test_reader.reset() + break + return np.mean(np.array(epoch_loss)) + def train(self, train_batch_reader, dev_batch_reader, feeding_dict, learning_rate, gradient_clipping, - num_passes, - output_model_dir, - is_local=True, + num_epoch, + batch_size, + num_samples, + save_epoch=100, num_iterations_print=100, test_off=False): """Train the model. @@ -78,104 +247,138 @@ class DeepSpeech2Model(object): :type learning_rate: float :param gradient_clipping: Gradient clipping threshold. :type gradient_clipping: float - :param num_passes: Number of training epochs. - :type num_passes: int + :param num_epoch: Number of training epochs. + :type num_epoch: int + :param batch_size: Number of batch size. + :type batch_size: int + :param num_samples: The num of train samples. + :type num_samples: int + :param save_epoch: Number of training iterations for save checkpoint and params. + :type save_epoch: int :param num_iterations_print: Number of training iterations for printing a training loss. - :type rnn_iteratons_print: int - :param is_local: Set to False if running with pserver with multi-nodes. - :type is_local: bool - :param output_model_dir: Directory for saving the model (every pass). - :type output_model_dir: basestring + :type num_iteratons_print: int :param test_off: Turn off testing. :type test_off: bool """ # prepare model output directory - if not os.path.exists(output_model_dir): - mkpath(output_model_dir) + if not os.path.exists(self._output_model_dir): + mkpath(self._output_model_dir) - # adapt the feeding dict and reader according to the network + # adapt the feeding dict according to the network adapted_feeding_dict = self._adapt_feeding_dict(feeding_dict) - adapted_train_batch_reader = self._adapt_data(train_batch_reader) - adapted_dev_batch_reader = self._adapt_data(dev_batch_reader) - - # prepare optimizer and trainer - optimizer = paddle.optimizer.Adam( - learning_rate=learning_rate, - gradient_clipping_threshold=gradient_clipping) - trainer = paddle.trainer.SGD( - cost=self._loss, - parameters=self._parameters, - update_equation=optimizer, - is_local=is_local) - - # create event handler - def event_handler(event): - global start_time, cost_sum, cost_counter - if isinstance(event, paddle.event.EndIteration): - cost_sum += event.cost - cost_counter += 1 - if (event.batch_id + 1) % num_iterations_print == 0: - output_model_path = os.path.join(output_model_dir, - "params.latest.tar.gz") - with gzip.open(output_model_path, 'w') as f: - trainer.save_parameter_to_tar(f) - print("\nPass: %d, Batch: %d, TrainCost: %f" % - (event.pass_id, event.batch_id + 1, - cost_sum / cost_counter)) - cost_sum, cost_counter = 0.0, 0 - else: - sys.stdout.write('.') - sys.stdout.flush() - if isinstance(event, paddle.event.BeginPass): - start_time = time.time() - cost_sum, cost_counter = 0.0, 0 - if isinstance(event, paddle.event.EndPass): - if test_off: - print("\n------- Time: %d sec, Pass: %d" % - (time.time() - start_time, event.pass_id)) - else: - result = trainer.test( - reader=adapted_dev_batch_reader, - feeding=adapted_feeding_dict) - print( - "\n------- Time: %d sec, Pass: %d, " - "ValidationCost: %s" % - (time.time() - start_time, event.pass_id, result.cost)) - output_model_path = os.path.join( - output_model_dir, "params.pass-%d.tar.gz" % event.pass_id) - with gzip.open(output_model_path, 'w') as f: - trainer.save_parameter_to_tar(f) - - # run train - trainer.train( - reader=adapted_train_batch_reader, - event_handler=event_handler, - num_passes=num_passes, - feeding=adapted_feeding_dict) - - # TODO(@pkuyym) merge this function into infer_batch - def infer_loss_batch(self, infer_data): - """Model inference. Infer the ctc loss for a batch of speech - utterances. - - :param infer_data: List of utterances to infer, with each utterance a - tuple of audio features and transcription text (empty - string). - :type infer_data: list - :return: List of ctc loss. - :rtype: List of float - """ - # define inferer - if self._loss_inferer == None: - self._loss_inferer = paddle.inference.Inference( - output_layer=self._loss, parameters=self._parameters) - # run inference - return self._loss_inferer.infer(input=infer_data) + + if isinstance(self._place, fluid.CUDAPlace): + dev_count = fluid.core.get_cuda_device_count() + else: + dev_count = int(os.environ.get('CPU_NUM', 1)) + + # prepare the network + train_program = fluid.Program() + startup_prog = fluid.Program() + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_reader, log_probs, ctc_loss = self.create_network() + # prepare optimizer + optimizer = fluid.optimizer.AdamOptimizer( + learning_rate=fluid.layers.exponential_decay( + learning_rate=learning_rate, + decay_steps=num_samples / batch_size / dev_count, + decay_rate=0.83, + staircase=True)) + fluid.clip.set_gradient_clip( + clip=fluid.clip.GradientClipByGlobalNorm( + clip_norm=gradient_clipping)) + optimizer.minimize(loss=ctc_loss) + + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_reader, _, ctc_loss = self.create_network() + + test_prog = test_prog.clone(for_test=True) + + exe = fluid.Executor(self._place) + exe.run(startup_prog) + + # init from some pretrain models, to better solve the current task + pre_epoch = 0 + if self._init_from_pretrained_model: + pre_epoch = self.init_from_pretrained_model(exe, train_program) + + build_strategy = compiler.BuildStrategy() + exec_strategy = fluid.ExecutionStrategy() + + # pass the build_strategy to with_data_parallel API + compiled_prog = compiler.CompiledProgram( + train_program).with_data_parallel( + loss_name=ctc_loss.name, + build_strategy=build_strategy, + exec_strategy=exec_strategy) + + train_reader.set_batch_generator(train_batch_reader) + test_reader.set_batch_generator(dev_batch_reader) + + # run train + for epoch_id in range(num_epoch): + train_reader.start() + epoch_loss = [] + time_begin = time.time() + batch_id = 0 + step = 0 + while True: + try: + fetch_list = [ctc_loss.name] + + if batch_id % num_iterations_print == 0: + fetch = exe.run( + program=compiled_prog, + fetch_list=fetch_list, + return_numpy=False) + each_loss = fetch[0] + epoch_loss.extend(np.array(each_loss[0]) / batch_size) + + print("epoch: %d, batch: %d, train loss: %f\n" % + (epoch_id, batch_id, + np.mean(each_loss[0]) / batch_size)) + + else: + each_loss = exe.run( + program=compiled_prog, + fetch_list=[], + return_numpy=False) + + batch_id = batch_id + 1 + except fluid.core.EOFException: + train_reader.reset() + break + time_end = time.time() + used_time = time_end - time_begin + if test_off: + print("\n--------Time: %f sec, epoch: %d, train loss: %f\n" % + (used_time, epoch_id, np.mean(np.array(epoch_loss)))) + else: + print('\n----------Begin test...') + test_loss = self.test( + exe, + dev_batch_reader=dev_batch_reader, + test_program=test_prog, + test_reader=test_reader, + fetch_list=[ctc_loss]) + print( + "--------Time: %f sec, epoch: %d, train loss: %f, test loss: %f" + % (used_time, epoch_id + pre_epoch, + np.mean(np.array(epoch_loss)), test_loss / batch_size)) + if (epoch_id + 1) % save_epoch == 0: + self.save_param(exe, train_program, + "epoch_" + str(epoch_id + pre_epoch)) + + self.save_param(exe, train_program, "step_final") + + print("\n------------Training finished!!!-------------") def infer_batch_probs(self, infer_data, feeding_dict): """Infer the prob matrices for a batch of speech utterances. - :param infer_data: List of utterances to infer, with each utterance consisting of a tuple of audio features and transcription text (empty string). @@ -188,26 +391,55 @@ class DeepSpeech2Model(object): :rtype: List of matrix """ # define inferer - if self._inferer == None: - self._inferer = paddle.inference.Inference( - output_layer=self._log_probs, parameters=self._parameters) + infer_program = fluid.Program() + startup_prog = fluid.Program() + + # adapt the feeding dict according to the network adapted_feeding_dict = self._adapt_feeding_dict(feeding_dict) - adapted_infer_data = self._adapt_data(infer_data) + + # prepare the network + with fluid.program_guard(infer_program, startup_prog): + with fluid.unique_name.guard(): + feeder, log_probs, _ = self.create_network(is_infer=True) + + infer_program = infer_program.clone(for_test=True) + exe = fluid.Executor(self._place) + exe.run(startup_prog) + + # init param from pretrained_model + if not self._init_from_pretrained_model: + exit("No pretrain model file path!") + self.init_from_pretrained_model(exe, infer_program) + + infer_results = [] + time_begin = time.time() + # run inference - infer_results = self._inferer.infer( - input=adapted_infer_data, feeding=adapted_feeding_dict) - start_pos = [0] * (len(adapted_infer_data) + 1) - for i in xrange(len(adapted_infer_data)): - start_pos[i + 1] = start_pos[i] + adapted_infer_data[i][3][0] + for i in range(infer_data[0].shape[0]): + each_log_probs = exe.run( + program=infer_program, + feed=feeder.feed( + [[infer_data[0][i], infer_data[2][i], infer_data[3][i]]]), + fetch_list=[log_probs], + return_numpy=False) + infer_results.extend(np.array(each_log_probs[0])) + + # slice result + infer_results = np.array(infer_results) + seq_len = (infer_data[2] - 1) // 3 + 1 + + start_pos = [0] * (infer_data[0].shape[0] + 1) + for i in range(infer_data[0].shape[0]): + start_pos[i + 1] = start_pos[i] + seq_len[i][0] probs_split = [ infer_results[start_pos[i]:start_pos[i + 1]] - for i in xrange(0, len(adapted_infer_data)) + for i in range(0, infer_data[0].shape[0]) ] + return probs_split def decode_batch_greedy(self, probs_split, vocab_list): """Decode by best path for a batch of probs matrix input. - :param probs_split: List of 2-D probability matrix, and each consists of prob vectors for one speech utterancce. :param probs_split: List of matrix @@ -221,12 +453,12 @@ class DeepSpeech2Model(object): output_transcription = ctc_greedy_decoder( probs_seq=probs, vocabulary=vocab_list) results.append(output_transcription) + print(results) return results def init_ext_scorer(self, beam_alpha, beam_beta, language_model_path, vocab_list): """Initialize the external scorer. - :param beam_alpha: Parameter associated with language model. :type beam_alpha: float :param beam_beta: Parameter associated with word count. @@ -261,7 +493,6 @@ class DeepSpeech2Model(object): beam_size, cutoff_prob, cutoff_top_n, vocab_list, num_processes): """Decode by beam search for a batch of probs matrix input. - :param probs_split: List of 2-D probability matrix, and each consists of prob vectors for one speech utterancce. :param probs_split: List of matrix @@ -319,124 +550,16 @@ class DeepSpeech2Model(object): if isinstance(feeding_dict, dict): adapted_feeding_dict["sequence_offset"] = len(adapted_feeding_dict) adapted_feeding_dict["sequence_length"] = len(adapted_feeding_dict) - for i in xrange(self._num_conv_layers): + for i in range(self._num_conv_layers): adapted_feeding_dict["conv%d_index_range" %i] = \ len(adapted_feeding_dict) elif isinstance(feeding_dict, list): adapted_feeding_dict.append("sequence_offset") adapted_feeding_dict.append("sequence_length") - for i in xrange(self._num_conv_layers): + for i in range(self._num_conv_layers): adapted_feeding_dict.append("conv%d_index_range" % i) else: raise ValueError("Type of feeding_dict is %s, not supported." % type(feeding_dict)) return adapted_feeding_dict - - def _adapt_data(self, data): - """Adapt data according to network struct. - - For each convolution layer in the conv_group, to remove impacts from - padding data, we can multiply zero to the padding part of the outputs - of each batch normalization layer. We add a scale_sub_region layer after - each batch normalization layer to reset the padding data. - For rnn layers, to remove impacts from padding data, we can truncate the - padding part before output data feeded into the first rnn layer. We use - sub_seq layer to achieve this. - - :param data: Data from data_provider. - :type data: list|function - :return: Adapted data. - :rtype: list|function - """ - - def adapt_instance(instance): - if len(instance) < 2 or len(instance) > 3: - raise ValueError("Size of instance should be 2 or 3.") - padded_audio = instance[0] - text = instance[1] - # no padding part - if len(instance) == 2: - audio_len = padded_audio.shape[1] - else: - audio_len = instance[2] - adapted_instance = [padded_audio, text] - # Stride size for conv0 is (3, 2) - # Stride size for conv1 to convN is (1, 2) - # Same as the network, hard-coded here - padded_conv0_h = (padded_audio.shape[0] - 1) // 2 + 1 - padded_conv0_w = (padded_audio.shape[1] - 1) // 3 + 1 - valid_w = (audio_len - 1) // 3 + 1 - adapted_instance += [ - [0], # sequence offset, always 0 - [valid_w], # valid sequence length - # Index ranges for channel, height and width - # Please refer scale_sub_region layer to see details - [1, 32, 1, padded_conv0_h, valid_w + 1, padded_conv0_w] - ] - pre_padded_h = padded_conv0_h - for i in xrange(self._num_conv_layers - 1): - padded_h = (pre_padded_h - 1) // 2 + 1 - pre_padded_h = padded_h - adapted_instance += [ - [1, 32, 1, padded_h, valid_w + 1, padded_conv0_w] - ] - return adapted_instance - - if isinstance(data, list): - return map(adapt_instance, data) - elif inspect.isgeneratorfunction(data): - - def adapted_reader(): - for instance in data(): - yield map(adapt_instance, instance) - - return adapted_reader - else: - raise ValueError("Type of data is %s, not supported." % type(data)) - - def _create_parameters(self, model_path=None): - """Load or create model parameters.""" - if model_path is None: - self._parameters = paddle.parameters.create(self._loss) - else: - self._parameters = paddle.parameters.Parameters.from_tar( - gzip.open(model_path)) - - def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers, - rnn_layer_size, use_gru, share_rnn_weights): - """Create data layers and model network.""" - # paddle.data_type.dense_array is used for variable batch input. - # The size 161 * 161 is only an placeholder value and the real shape - # of input batch data will be induced during training. - audio_data = paddle.layer.data( - name="audio_spectrogram", - type=paddle.data_type.dense_array(161 * 161)) - text_data = paddle.layer.data( - name="transcript_text", - type=paddle.data_type.integer_value_sequence(vocab_size)) - seq_offset_data = paddle.layer.data( - name='sequence_offset', - type=paddle.data_type.integer_value_sequence(1)) - seq_len_data = paddle.layer.data( - name='sequence_length', - type=paddle.data_type.integer_value_sequence(1)) - index_range_datas = [] - for i in xrange(num_rnn_layers): - index_range_datas.append( - paddle.layer.data( - name='conv%d_index_range' % i, - type=paddle.data_type.dense_vector(6))) - - self._log_probs, self._loss = deep_speech_v2_network( - audio_data=audio_data, - text_data=text_data, - seq_offset_data=seq_offset_data, - seq_len_data=seq_len_data, - index_range_datas=index_range_datas, - dict_size=vocab_size, - num_conv_layers=num_conv_layers, - num_rnn_layers=num_rnn_layers, - rnn_size=rnn_layer_size, - use_gru=use_gru, - share_rnn_weights=share_rnn_weights) diff --git a/model_utils/model_check.py b/model_utils/model_check.py new file mode 100644 index 000000000..bf2c424fd --- /dev/null +++ b/model_utils/model_check.py @@ -0,0 +1,49 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys +import paddle +import paddle.fluid as fluid + + +def check_cuda(use_cuda, err = \ + "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \ + Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n" + ): + """ + Log error and exit when set use_gpu=true in paddlepaddle + cpu version. + """ + try: + if use_cuda == True and fluid.is_compiled_with_cuda() == False: + print(err) + sys.exit(1) + except Exception as e: + pass + + +def check_version(): + """ + Log error and exit when the installed version of paddlepaddle is + not satisfied. + """ + err = "PaddlePaddle version 1.6 or higher is required, " \ + "or a suitable develop version is satisfied as well. \n" \ + "Please make sure the version is good with your code." \ + + try: + fluid.require_version('1.6.0') + except Exception as e: + print(err) + sys.exit(1) diff --git a/model_utils/network.py b/model_utils/network.py index 7b4b8ab20..3a4f1dc38 100644 --- a/model_utils/network.py +++ b/model_utils/network.py @@ -1,189 +1,323 @@ -"""Contains DeepSpeech2 layers and networks.""" from __future__ import absolute_import from __future__ import division from __future__ import print_function -import paddle.v2 as paddle +import collections +import paddle.fluid as fluid +import numpy as np def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride, - padding, act, index_range_data): + padding, act, masks, name): """Convolution layer with batch normalization. :param input: Input layer. - :type input: LayerOutput + :type input: Variable :param filter_size: The x dimension of a filter kernel. Or input a tuple for two image dimension. :type filter_size: int|tuple|list :param num_channels_in: Number of input channels. :type num_channels_in: int - :type num_channels_out: Number of output channels. - :type num_channels_in: out + :param num_channels_out: Number of output channels. + :type num_channels_out: int + :param stride: The x dimension of the stride. Or input a tuple for two + image dimension. + :type stride: int|tuple|list :param padding: The x dimension of the padding. Or input a tuple for two image dimension. :type padding: int|tuple|list :param act: Activation type. - :type act: BaseActivation - :param index_range_data: Index range to indicate sub region. - :type index_range_data: LayerOutput + :type act: string + :param masks: Masks data layer to reset padding. + :type masks: Variable + :param name: Name of the layer. + :param name: string :return: Batch norm layer after convolution layer. - :rtype: LayerOutput + :rtype: Variable + """ - conv_layer = paddle.layer.img_conv( + conv_layer = fluid.layers.conv2d( input=input, - filter_size=filter_size, - num_channels=num_channels_in, num_filters=num_channels_out, + filter_size=filter_size, stride=stride, padding=padding, - act=paddle.activation.Linear(), + param_attr=fluid.ParamAttr(name=name + '_conv2d_weight'), + act=None, bias_attr=False) - batch_norm = paddle.layer.batch_norm(input=conv_layer, act=act) + + batch_norm = fluid.layers.batch_norm( + input=conv_layer, + act=act, + param_attr=fluid.ParamAttr(name=name + '_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_batch_norm_bias'), + moving_mean_name=name + '_batch_norm_moving_mean', + moving_variance_name=name + '_batch_norm_moving_variance') + # reset padding part to 0 - scale_sub_region = paddle.layer.scale_sub_region( - batch_norm, index_range_data, value=0.0) - return scale_sub_region + padding_reset = fluid.layers.elementwise_mul(batch_norm, masks) + return padding_reset + + +def simple_rnn(input, size, param_attr=None, bias_attr=None, is_reverse=False): + '''A simple rnn layer. + :param input: input layer. + :type input: Variable + :param size: Dimension of RNN cells. + :type size: int + :param param_attr: Parameter properties of hidden layer weights that + can be learned + :type param_attr: ParamAttr + :param bias_attr: Bias properties of hidden layer weights that can be learned + :type bias_attr: ParamAttr + :param is_reverse: Whether to calculate the inverse RNN + :type is_reverse: bool + :return: A simple RNN layer. + :rtype: Variable + ''' + if is_reverse: + input = fluid.layers.sequence_reverse(x=input) + + pad_value = fluid.layers.assign(input=np.array([0.0], dtype=np.float32)) + input, length = fluid.layers.sequence_pad(input, pad_value) + rnn = fluid.layers.StaticRNN() + input = fluid.layers.transpose(input, [1, 0, 2]) + with rnn.step(): + in_ = rnn.step_input(input) + mem = rnn.memory(shape=[-1, size], batch_ref=in_) + out = fluid.layers.fc( + input=mem, + size=size, + act=None, + param_attr=param_attr, + bias_attr=bias_attr) + out = fluid.layers.elementwise_add(out, in_) + out = fluid.layers.brelu(out) + rnn.update_memory(mem, out) + rnn.output(out) + + out = rnn() + out = fluid.layers.transpose(out, [1, 0, 2]) + out = fluid.layers.sequence_unpad(x=out, length=length) + if is_reverse: + out = fluid.layers.sequence_reverse(x=out) + return out -def bidirectional_simple_rnn_bn_layer(name, input, size, act, share_weights): + +def bidirectional_simple_rnn_bn_layer(name, input, size, share_weights): """Bidirectonal simple rnn layer with sequence-wise batch normalization. The batch normalization is only performed on input-state weights. - - :param name: Name of the layer. + :param name: Name of the layer parameters. :type name: string :param input: Input layer. - :type input: LayerOutput - :param size: Number of RNN cells. + :type input: Variable + :param size: Dimension of RNN cells. :type size: int - :param act: Activation type. - :type act: BaseActivation :param share_weights: Whether to share input-hidden weights between forward and backward directional RNNs. :type share_weights: bool :return: Bidirectional simple rnn layer. - :rtype: LayerOutput + :rtype: Variable """ if share_weights: - # input-hidden weights shared between bi-direcitonal rnn. - input_proj = paddle.layer.fc( + #input-hidden weights shared between bi-directional rnn. + input_proj = fluid.layers.fc( input=input, size=size, - act=paddle.activation.Linear(), + act=None, + param_attr=fluid.ParamAttr(name=name + '_fc_weight'), bias_attr=False) + # batch norm is only performed on input-state projection - input_proj_bn = paddle.layer.batch_norm( - input=input_proj, act=paddle.activation.Linear()) - # forward and backward in time - forward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn, act=act, reverse=False) - backward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn, act=act, reverse=True) + input_proj_bn = fluid.layers.batch_norm( + input=input_proj, + act=None, + param_attr=fluid.ParamAttr(name=name + '_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_batch_norm_bias'), + moving_mean_name=name + '_batch_norm_moving_mean', + moving_variance_name=name + '_batch_norm_moving_variance') + #forward and backword in time + forward_rnn = simple_rnn( + input=input_proj_bn, + size=size, + param_attr=fluid.ParamAttr(name=name + '_forward_rnn_weight'), + bias_attr=fluid.ParamAttr(name=name + '_forward_rnn_bias'), + is_reverse=False) + + reverse_rnn = simple_rnn( + input=input_proj_bn, + size=size, + param_attr=fluid.ParamAttr(name=name + '_reverse_rnn_weight'), + bias_attr=fluid.ParamAttr(name=name + '_reverse_rnn_bias'), + is_reverse=True) else: - input_proj_forward = paddle.layer.fc( + input_proj_forward = fluid.layers.fc( input=input, size=size, - act=paddle.activation.Linear(), + act=None, + param_attr=fluid.ParamAttr(name=name + '_forward_fc_weight'), bias_attr=False) - input_proj_backward = paddle.layer.fc( + input_proj_backward = fluid.layers.fc( input=input, size=size, - act=paddle.activation.Linear(), + act=None, + param_attr=fluid.ParamAttr(name=name + '_reverse_fc_weight'), bias_attr=False) - # batch norm is only performed on input-state projection - input_proj_bn_forward = paddle.layer.batch_norm( - input=input_proj_forward, act=paddle.activation.Linear()) - input_proj_bn_backward = paddle.layer.batch_norm( - input=input_proj_backward, act=paddle.activation.Linear()) + #batch norm is only performed on input-state projection + input_proj_bn_forward = fluid.layers.batch_norm( + input=input_proj_forward, + act=None, + param_attr=fluid.ParamAttr( + name=name + '_forward_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_bias'), + moving_mean_name=name + '_forward_batch_norm_moving_mean', + moving_variance_name=name + '_forward_batch_norm_moving_variance') + input_proj_bn_backward = fluid.layers.batch_norm( + input=input_proj_backward, + act=None, + param_attr=fluid.ParamAttr( + name=name + '_reverse_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_bias'), + moving_mean_name=name + '_reverse_batch_norm_moving_mean', + moving_variance_name=name + '_reverse_batch_norm_moving_variance') # forward and backward in time - forward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn_forward, act=act, reverse=False) - backward_simple_rnn = paddle.layer.recurrent( - input=input_proj_bn_backward, act=act, reverse=True) - - return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn]) + forward_rnn = simple_rnn( + input=input_proj_bn_forward, + size=size, + param_attr=fluid.ParamAttr(name=name + '_forward_rnn_weight'), + bias_attr=fluid.ParamAttr(name=name + '_forward_rnn_bias'), + is_reverse=False) + reverse_rnn = simple_rnn( + input=input_proj_bn_backward, + size=size, + param_attr=fluid.ParamAttr(name=name + '_reverse_rnn_weight'), + bias_attr=fluid.ParamAttr(name=name + '_reverse_rnn_bias'), + is_reverse=True) + out = fluid.layers.concat(input=[forward_rnn, reverse_rnn], axis=1) + return out def bidirectional_gru_bn_layer(name, input, size, act): """Bidirectonal gru layer with sequence-wise batch normalization. The batch normalization is only performed on input-state weights. - :param name: Name of the layer. :type name: string :param input: Input layer. - :type input: LayerOutput - :param size: Number of RNN cells. + :type input: Variable + :param size: Dimension of GRU cells. :type size: int :param act: Activation type. - :type act: BaseActivation - :return: Bidirectional simple rnn layer. - :rtype: LayerOutput + :type act: string + :return: Bidirectional GRU layer. + :rtype: Variable """ - input_proj_forward = paddle.layer.fc( + input_proj_forward = fluid.layers.fc( input=input, size=size * 3, - act=paddle.activation.Linear(), + act=None, + param_attr=fluid.ParamAttr(name=name + '_forward_fc_weight'), bias_attr=False) - input_proj_backward = paddle.layer.fc( + input_proj_backward = fluid.layers.fc( input=input, size=size * 3, - act=paddle.activation.Linear(), + act=None, + param_attr=fluid.ParamAttr(name=name + '_reverse_fc_weight'), bias_attr=False) - # batch norm is only performed on input-related projections - input_proj_bn_forward = paddle.layer.batch_norm( - input=input_proj_forward, act=paddle.activation.Linear()) - input_proj_bn_backward = paddle.layer.batch_norm( - input=input_proj_backward, act=paddle.activation.Linear()) - # forward and backward in time - forward_gru = paddle.layer.grumemory( - input=input_proj_bn_forward, act=act, reverse=False) - backward_gru = paddle.layer.grumemory( - input=input_proj_bn_backward, act=act, reverse=True) - return paddle.layer.concat(input=[forward_gru, backward_gru]) + #batch norm is only performed on input-related prohections + input_proj_bn_forward = fluid.layers.batch_norm( + input=input_proj_forward, + act=None, + param_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_forward_batch_norm_bias'), + moving_mean_name=name + '_forward_batch_norm_moving_mean', + moving_variance_name=name + '_forward_batch_norm_moving_variance') + input_proj_bn_backward = fluid.layers.batch_norm( + input=input_proj_backward, + act=None, + param_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_weight'), + bias_attr=fluid.ParamAttr(name=name + '_reverse_batch_norm_bias'), + moving_mean_name=name + '_reverse_batch_norm_moving_mean', + moving_variance_name=name + '_reverse_batch_norm_moving_variance') + #forward and backward in time + forward_gru = fluid.layers.dynamic_gru( + input=input_proj_bn_forward, + size=size, + gate_activation='sigmoid', + candidate_activation=act, + param_attr=fluid.ParamAttr(name=name + '_forward_gru_weight'), + bias_attr=fluid.ParamAttr(name=name + '_forward_gru_bias'), + is_reverse=False) + reverse_gru = fluid.layers.dynamic_gru( + input=input_proj_bn_backward, + size=size, + gate_activation='sigmoid', + candidate_activation=act, + param_attr=fluid.ParamAttr(name=name + '_reverse_gru_weight'), + bias_attr=fluid.ParamAttr(name=name + '_reverse_gru_bias'), + is_reverse=True) + return fluid.layers.concat(input=[forward_gru, reverse_gru], axis=1) -def conv_group(input, num_stacks, index_range_datas): +def conv_group(input, num_stacks, seq_len_data, masks): """Convolution group with stacked convolution layers. - :param input: Input layer. - :type input: LayerOutput + :type input: Variable :param num_stacks: Number of stacked convolution layers. :type num_stacks: int - :param index_range_datas: Index ranges for each convolution layer. - :type index_range_datas: tuple|list + :param seq_len_data:Valid sequence length data layer. + :type seq_len_data:Variable + :param masks: Masks data layer to reset padding. + :type masks: Variable :return: Output layer of the convolution group. - :rtype: LayerOutput + :rtype: Variable """ + filter_size = (41, 11) + stride = (2, 3) + padding = (20, 5) conv = conv_bn_layer( input=input, - filter_size=(11, 41), + filter_size=filter_size, num_channels_in=1, num_channels_out=32, - stride=(3, 2), - padding=(5, 20), - act=paddle.activation.BRelu(), - index_range_data=index_range_datas[0]) - for i in xrange(num_stacks - 1): + stride=stride, + padding=padding, + act="brelu", + masks=masks, + name='layer_0', ) + + seq_len_data = (np.array(seq_len_data) - filter_size[1] + 2 * padding[1] + ) // stride[1] + 1 + + output_height = (161 - 1) // 2 + 1 + + for i in range(num_stacks - 1): + #reshape masks + output_height = (output_height - 1) // 2 + 1 + masks = fluid.layers.slice( + masks, axes=[2], starts=[0], ends=[output_height]) conv = conv_bn_layer( input=conv, - filter_size=(11, 21), + filter_size=(21, 11), num_channels_in=32, num_channels_out=32, - stride=(1, 2), - padding=(5, 10), - act=paddle.activation.BRelu(), - index_range_data=index_range_datas[i + 1]) - output_num_channels = 32 - output_height = 160 // pow(2, num_stacks) + 1 - return conv, output_num_channels, output_height + stride=(2, 1), + padding=(10, 5), + act="brelu", + masks=masks, + name='layer_{}'.format(i + 1), ) + output_num_channels = 32 + return conv, output_num_channels, output_height, seq_len_data -def rnn_group(input, size, num_stacks, use_gru, share_rnn_weights): - """RNN group with stacked bidirectional simple RNN layers. +def rnn_group(input, size, num_stacks, num_conv_layers, use_gru, + share_rnn_weights): + """RNN group with stacked bidirectional simple RNN or GRU layers. :param input: Input layer. - :type input: LayerOutput - :param size: Number of RNN cells in each layer. + :type input: Variable + :param size: Dimension of RNN cells in each layer. :type size: int :param num_stacks: Number of stacked rnn layers. :type num_stacks: int @@ -194,32 +328,30 @@ def rnn_group(input, size, num_stacks, use_gru, share_rnn_weights): It is only available when use_gru=False. :type share_weights: bool :return: Output layer of the RNN group. - :rtype: LayerOutput + :rtype: Variable """ output = input - for i in xrange(num_stacks): + for i in range(num_stacks): if use_gru: output = bidirectional_gru_bn_layer( - name=str(i), + name='layer_{}'.format(i + num_conv_layers), input=output, size=size, - act=paddle.activation.Relu()) - # BRelu does not support hppl, need to add later. Use Relu instead. + act="relu") else: + name = 'layer_{}'.format(i + num_conv_layers) output = bidirectional_simple_rnn_bn_layer( - name=str(i), + name=name, input=output, size=size, - act=paddle.activation.BRelu(), share_weights=share_rnn_weights) return output def deep_speech_v2_network(audio_data, text_data, - seq_offset_data, seq_len_data, - index_range_datas, + masks, dict_size, num_conv_layers=2, num_rnn_layers=3, @@ -227,24 +359,21 @@ def deep_speech_v2_network(audio_data, use_gru=False, share_rnn_weights=True): """The DeepSpeech2 network structure. - :param audio_data: Audio spectrogram data layer. - :type audio_data: LayerOutput + :type audio_data: Variable :param text_data: Transcription text data layer. - :type text_data: LayerOutput - :param seq_offset_data: Sequence offset data layer. - :type seq_offset_data: LayerOutput + :type text_data: Variable :param seq_len_data: Valid sequence length data layer. - :type seq_len_data: LayerOutput - :param index_range_datas: Index ranges data layers. - :type index_range_datas: tuple|list + :type seq_len_data: Variable + :param masks: Masks data layer to reset padding. + :type masks: Variable :param dict_size: Dictionary size for tokenized transcription. :type dict_size: int :param num_conv_layers: Number of stacking convolution layers. :type num_conv_layers: int :param num_rnn_layers: Number of stacking RNN layers. :type num_rnn_layers: int - :param rnn_size: RNN layer size (number of RNN cells). + :param rnn_size: RNN layer size (dimension of RNN cells). :type rnn_size: int :param use_gru: Use gru if set True. Use simple rnn if set False. :type use_gru: bool @@ -254,49 +383,53 @@ def deep_speech_v2_network(audio_data, :type share_weights: bool :return: A tuple of an output unnormalized log probability layer ( before softmax) and a ctc cost layer. - :rtype: tuple of LayerOutput + :rtype: tuple of LayerOutput """ + audio_data = fluid.layers.unsqueeze(audio_data, axes=[1]) + # convolution group - conv_group_output, conv_group_num_channels, conv_group_height = conv_group( + conv_group_output, conv_group_num_channels, conv_group_height, seq_len_data = conv_group( input=audio_data, num_stacks=num_conv_layers, - index_range_datas=index_range_datas) + seq_len_data=seq_len_data, + masks=masks) + # convert data form convolution feature map to sequence of vectors - conv2seq = paddle.layer.block_expand( - input=conv_group_output, - num_channels=conv_group_num_channels, - stride_x=1, - stride_y=1, - block_x=1, - block_y=conv_group_height) + transpose = fluid.layers.transpose(conv_group_output, perm=[0, 3, 1, 2]) + reshape_conv_output = fluid.layers.reshape( + x=transpose, + shape=[0, -1, conv_group_height * conv_group_num_channels], + inplace=False) # remove padding part - remove_padding_data = paddle.layer.sub_seq( - input=conv2seq, - offsets=seq_offset_data, - sizes=seq_len_data, - act=paddle.activation.Linear(), - bias_attr=False) - # rnn group + seq_len_data = fluid.layers.reshape(seq_len_data, [-1]) + sequence = fluid.layers.sequence_unpad( + x=reshape_conv_output, length=seq_len_data) + #rnn group rnn_group_output = rnn_group( - input=remove_padding_data, + input=sequence, size=rnn_size, num_stacks=num_rnn_layers, + num_conv_layers=num_conv_layers, use_gru=use_gru, share_rnn_weights=share_rnn_weights) - fc = paddle.layer.fc( + fc = fluid.layers.fc( input=rnn_group_output, size=dict_size + 1, - act=paddle.activation.Linear(), - bias_attr=True) - # probability distribution with softmax - log_probs = paddle.layer.mixed( - input=paddle.layer.identity_projection(input=fc), - act=paddle.activation.Softmax()) - # ctc cost - ctc_loss = paddle.layer.warp_ctc( - input=fc, - label=text_data, - size=dict_size + 1, - blank=dict_size, - norm_by_times=True) - return log_probs, ctc_loss + act=None, + param_attr=fluid.ParamAttr( + name='layer_{}'.format(num_conv_layers + num_rnn_layers) + + '_fc_weight'), + bias_attr=fluid.ParamAttr( + name='layer_{}'.format(num_conv_layers + num_rnn_layers) + + '_fc_bias')) + # pribability distribution with softmax + log_probs = fluid.layers.softmax(fc) + log_probs.persistable = True + if not text_data: + return log_probs, None + else: + #ctc cost + ctc_loss = fluid.layers.warpctc( + input=fc, label=text_data, blank=dict_size, norm_by_times=True) + ctc_loss = fluid.layers.reduce_sum(ctc_loss) + return log_probs, ctc_loss diff --git a/models/aishell/download_model.sh b/models/aishell/download_model.sh index 1c4be79fa..76ac4d005 100644 --- a/models/aishell/download_model.sh +++ b/models/aishell/download_model.sh @@ -2,9 +2,9 @@ . ../../utils/utility.sh -URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model.tar.gz' -MD5=0ee83aa15fba421e5de8fc66c8feb350 -TARGET=./aishell_model.tar.gz +URL='https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_fluid.tar.gz' +MD5=2bf0cc8b6d5da2a2a787b5cc36a496b5 +TARGET=./aishell_model_fluid.tar.gz echo "Download Aishell model ..." diff --git a/models/baidu_en8k/download_model.sh b/models/baidu_en8k/download_model.sh index 9ce672825..bbdb32b61 100644 --- a/models/baidu_en8k/download_model.sh +++ b/models/baidu_en8k/download_model.sh @@ -2,9 +2,9 @@ . ../../utils/utility.sh -URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model.tar.gz' -MD5=5fe7639e720d51b3c3bdf7a1470c6272 -TARGET=./baidu_en8k_model.tar.gz +URL='https://deepspeech.bj.bcebos.com/demo_models/baidu_en8k_model_fluid.tar.gz' +MD5=7e58fbf64aa4ecf639b049792ddcf788 +TARGET=./baidu_en8k_model_fluid.tar.gz echo "Download BaiduEn8k model ..." diff --git a/models/librispeech/download_model.sh b/models/librispeech/download_model.sh index 123bcb818..edf853054 100644 --- a/models/librispeech/download_model.sh +++ b/models/librispeech/download_model.sh @@ -2,9 +2,9 @@ . ../../utils/utility.sh -URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model.tar.gz' -MD5=1f72d0c5591f453362f0caa09dd57618 -TARGET=./librispeech_model.tar.gz +URL='https://deepspeech.bj.bcebos.com/eng_models/librispeech_model_fluid.tar.gz' +MD5=fafb11fe57c3ecd107147056453f5348 +TARGET=./librispeech_model_fluid.tar.gz echo "Download LibriSpeech model ..." diff --git a/requirements.txt b/requirements.txt index e104f633c..8c57208a6 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -scipy==0.13.1 +scipy==1.2.1 resampy==0.1.5 SoundFile==0.9.0.post1 python_speech_features diff --git a/test.py b/test.py index e5a3346a0..12df55219 100644 --- a/test.py +++ b/test.py @@ -5,9 +5,10 @@ from __future__ import print_function import argparse import functools -import paddle.v2 as paddle +import paddle.fluid as fluid from data_utils.data import DataGenerator from model_utils.model import DeepSpeech2Model +from model_utils.model_check import check_cuda, check_version from utils.error_rate import char_errors, word_errors from utils.utility import add_arguments, print_arguments @@ -15,10 +16,8 @@ parser = argparse.ArgumentParser(description=__doc__) add_arg = functools.partial(add_arguments, argparser=parser) # yapf: disable add_arg('batch_size', int, 128, "Minibatch size.") -add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).") add_arg('beam_size', int, 500, "Beam search width.") add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.") -add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.") add_arg('num_conv_layers', int, 2, "# of convolution layers.") add_arg('num_rnn_layers', int, 3, "# of recurrent layers.") add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.") @@ -64,17 +63,28 @@ args = parser.parse_args() def evaluate(): """Evaluate on whole test data for DeepSpeech2.""" + + # check if set use_gpu=True in paddlepaddle cpu version + check_cuda(args.use_gpu) + # check if paddlepaddle version is satisfied + check_version() + + if args.use_gpu: + place = fluid.CUDAPlace(0) + else: + place = fluid.CPUPlace() + data_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, augmentation_config='{}', specgram_type=args.specgram_type, - num_threads=args.num_proc_data, - keep_transcription_text=True) + keep_transcription_text=True, + place = place, + is_training = False) batch_reader = data_generator.batch_reader_creator( manifest_path=args.test_manifest, batch_size=args.batch_size, - min_batch_size=1, sortagrad=False, shuffle_method=None) @@ -84,8 +94,9 @@ def evaluate(): num_rnn_layers=args.num_rnn_layers, rnn_layer_size=args.rnn_layer_size, use_gru=args.use_gru, - pretrained_model_path=args.model_path, - share_rnn_weights=args.share_rnn_weights) + share_rnn_weights=args.share_rnn_weights, + place=place, + init_from_pretrained_model=args.model_path) # decoders only accept string encoded in utf-8 vocab_list = [chars.encode("utf-8") for chars in data_generator.vocab_list] @@ -115,7 +126,7 @@ def evaluate(): cutoff_top_n=args.cutoff_top_n, vocab_list=vocab_list, num_processes=args.num_proc_bsearch) - target_transcripts = [data[1] for data in infer_data] + target_transcripts = infer_data[1] for target, result in zip(target_transcripts, result_transcripts): errors, len_ref = errors_func(target, result) @@ -131,9 +142,6 @@ def evaluate(): def main(): print_arguments(args) - paddle.init(use_gpu=args.use_gpu, - rnn_use_batch=True, - trainer_count=args.trainer_count) evaluate() diff --git a/tools/profile.sh b/tools/profile.sh index 19abe7ede..830a67615 100644 --- a/tools/profile.sh +++ b/tools/profile.sh @@ -9,14 +9,13 @@ function join_by { local IFS="$1"; shift; echo "$*"; } for NUM_GPUS in 16 8 4 2 1 do DEVICES=$(join_by , $(seq 0 $(($NUM_GPUS-1)))) - BATCH_SIZE=$(($BATCH_SIZE_PER_GPU * $NUM_GPUS)) + BATCH_SIZE=$(($BATCH_SIZE_PER_GPU)) CUDA_VISIBLE_DEVICES=$DEVICES \ python train.py \ --batch_size=$BATCH_SIZE \ - --num_passes=1 \ + --num_epoch=1 \ --test_off=True \ - --trainer_count=$NUM_GPUS \ --min_duration=$MIN_DURATION \ --max_duration=$MAX_DURATION > tmp.log 2>&1 @@ -24,7 +23,7 @@ do exit 1 fi - cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$3}' + cat tmp.log | grep "Time" | awk '{print "GPU Num: " "'"$NUM_GPUS"'" " Time: "$2}' rm tmp.log done diff --git a/tools/tune.py b/tools/tune.py index da785189f..7996e4d53 100644 --- a/tools/tune.py +++ b/tools/tune.py @@ -10,7 +10,7 @@ import argparse import functools import gzip import logging -import paddle.v2 as paddle +import paddle.fluid as fluid import _init_paths from data_utils.data import DataGenerator from model_utils.model import DeepSpeech2Model @@ -26,7 +26,6 @@ add_arg('batch_size', int, 256, "# of samples per batch.") add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).") add_arg('beam_size', int, 500, "Beam search width.") add_arg('num_proc_bsearch', int, 8, "# of CPUs for beam search.") -add_arg('num_proc_data', int, 8, "# of CPUs for data preprocessing.") add_arg('num_conv_layers', int, 2, "# of convolution layers.") add_arg('num_rnn_layers', int, 3, "# of recurrent layers.") add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.") @@ -77,13 +76,19 @@ def tune(): if not args.num_betas >= 0: raise ValueError("num_betas must be non-negative!") + if args.use_gpu: + place = fluid.CUDAPlace(0) + else: + place = fluid.CPUPlace() + data_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, augmentation_config='{}', specgram_type=args.specgram_type, - num_threads=args.num_proc_data, - keep_transcription_text=True) + keep_transcription_text=True, + place = place, + is_training = False) batch_reader = data_generator.batch_reader_creator( manifest_path=args.tune_manifest, @@ -97,7 +102,8 @@ def tune(): num_rnn_layers=args.num_rnn_layers, rnn_layer_size=args.rnn_layer_size, use_gru=args.use_gru, - pretrained_model_path=args.model_path, + place=place, + init_from_pretrained_model=args.model_path, share_rnn_weights=args.share_rnn_weights) # decoders only accept string encoded in utf-8 @@ -109,8 +115,8 @@ def tune(): params_grid = [(alpha, beta) for alpha in cand_alphas for beta in cand_betas] - err_sum = [0.0 for i in xrange(len(params_grid))] - err_ave = [0.0 for i in xrange(len(params_grid))] + err_sum = [0.0 for i in range(len(params_grid))] + err_ave = [0.0 for i in range(len(params_grid))] num_ins, len_refs, cur_batch = 0, 0, 0 # initialize external scorer ds2_model.init_ext_scorer(args.alpha_from, args.beta_from, @@ -123,7 +129,7 @@ def tune(): probs_split = ds2_model.infer_batch_probs( infer_data=infer_data, feeding_dict=data_generator.feeding) - target_transcripts = [ data[1] for data in infer_data ] + target_transcripts = infer_data[1] num_ins += len(target_transcripts) # grid search @@ -137,7 +143,6 @@ def tune(): cutoff_top_n=args.cutoff_top_n, vocab_list=vocab_list, num_processes=args.num_proc_bsearch) - for target, result in zip(target_transcripts, result_transcripts): errors, len_ref = errors_func(target, result) err_sum[index] += errors @@ -163,7 +168,7 @@ def tune(): # output WER/CER at every (alpha, beta) print("\nFinal %s:\n" % args.error_rate_type) - for index in xrange(len(params_grid)): + for index in range(len(params_grid)): print("(alpha, beta) = (%s, %s), [%s] = %f" % ("%.3f" % params_grid[index][0], "%.3f" % params_grid[index][1], args.error_rate_type, err_ave[index])) @@ -179,9 +184,6 @@ def tune(): def main(): print_arguments(args) - paddle.init(use_gpu=args.use_gpu, - rnn_use_batch=True, - trainer_count=args.trainer_count) tune() diff --git a/train.py b/train.py index 16415713f..5dae4ccdd 100644 --- a/train.py +++ b/train.py @@ -5,23 +5,26 @@ from __future__ import print_function import argparse import functools -import paddle.v2 as paddle +import io from model_utils.model import DeepSpeech2Model +from model_utils.model_check import check_cuda, check_version from data_utils.data import DataGenerator from utils.utility import add_arguments, print_arguments +import paddle.fluid as fluid + parser = argparse.ArgumentParser(description=__doc__) add_arg = functools.partial(add_arguments, argparser=parser) # yapf: disable add_arg('batch_size', int, 256, "Minibatch size.") -add_arg('trainer_count', int, 8, "# of Trainers (CPUs or GPUs).") -add_arg('num_passes', int, 200, "# of training epochs.") -add_arg('num_proc_data', int, 16, "# of CPUs for data preprocessing.") +add_arg('num_epoch', int, 200, "# of training epochs.") add_arg('num_conv_layers', int, 2, "# of convolution layers.") add_arg('num_rnn_layers', int, 3, "# of recurrent layers.") add_arg('rnn_layer_size', int, 2048, "# of recurrent cells per layer.") -add_arg('num_iter_print', int, 100, "Every # iterations for printing " +add_arg('num_iter_print', int, 100, "Every # batch for printing " "train cost.") +add_arg('save_epoch', int, 10, "# Every # batch for save checkpoint and modle params ") +add_arg('num_samples', int, 10000, "The num of train samples.") add_arg('learning_rate', float, 5e-4, "Learning rate.") add_arg('max_duration', float, 27.0, "Longest audio duration allowed.") add_arg('min_duration', float, 0.0, "Shortest audio duration allowed.") @@ -31,7 +34,12 @@ add_arg('use_gpu', bool, True, "Use GPU or not.") add_arg('use_gru', bool, False, "Use GRUs instead of simple RNNs.") add_arg('is_local', bool, True, "Use pserver or not.") add_arg('share_rnn_weights',bool, True, "Share input-hidden weights across " - "bi-directional RNNs. Not for GRU.") + "bi-directional RNNs. Not for GRU.") +add_arg('init_from_pretrained_model',str, + None, + "If None, the training starts from scratch, " + "otherwise, it resumes from the pre-trained model.") + add_arg('train_manifest', str, 'data/librispeech/manifest.train', "Filepath of train manifest.") @@ -44,10 +52,6 @@ add_arg('mean_std_path', str, add_arg('vocab_path', str, 'data/librispeech/vocab.txt', "Filepath of vocabulary.") -add_arg('init_model_path', str, - None, - "If None, the training starts from scratch, " - "otherwise, it resumes from the pre-trained model.") add_arg('output_model_dir', str, "./checkpoints/libri", "Directory for saving checkpoints.") @@ -68,30 +72,39 @@ args = parser.parse_args() def train(): """DeepSpeech2 training.""" + + # check if set use_gpu=True in paddlepaddle cpu version + check_cuda(args.use_gpu) + # check if paddlepaddle version is satisfied + check_version() + + if args.use_gpu: + place = fluid.CUDAPlace(0) + else: + place = fluid.CPUPlace() + train_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, - augmentation_config=open(args.augment_conf_path, 'r').read(), + augmentation_config=io.open(args.augment_conf_path, mode='r', encoding='utf8').read(), max_duration=args.max_duration, min_duration=args.min_duration, specgram_type=args.specgram_type, - num_threads=args.num_proc_data) + place=place) dev_generator = DataGenerator( vocab_filepath=args.vocab_path, mean_std_filepath=args.mean_std_path, augmentation_config="{}", specgram_type=args.specgram_type, - num_threads=args.num_proc_data) + place = place) train_batch_reader = train_generator.batch_reader_creator( manifest_path=args.train_manifest, batch_size=args.batch_size, - min_batch_size=args.trainer_count, - sortagrad=args.use_sortagrad if args.init_model_path is None else False, + sortagrad=args.use_sortagrad if args.init_from_pretrained_model is None else False, shuffle_method=args.shuffle_method) dev_batch_reader = dev_generator.batch_reader_creator( manifest_path=args.dev_manifest, batch_size=args.batch_size, - min_batch_size=1, # must be 1, but will have errors. sortagrad=False, shuffle_method=None) @@ -101,27 +114,27 @@ def train(): num_rnn_layers=args.num_rnn_layers, rnn_layer_size=args.rnn_layer_size, use_gru=args.use_gru, - pretrained_model_path=args.init_model_path, - share_rnn_weights=args.share_rnn_weights) + share_rnn_weights=args.share_rnn_weights, + place=place, + init_from_pretrained_model=args.init_from_pretrained_model, + output_model_dir=args.output_model_dir) + ds2_model.train( train_batch_reader=train_batch_reader, dev_batch_reader=dev_batch_reader, feeding_dict=train_generator.feeding, learning_rate=args.learning_rate, gradient_clipping=400, - num_passes=args.num_passes, + batch_size=args.batch_size, + num_samples=args.num_samples, + num_epoch=args.num_epoch, + save_epoch=args.save_epoch, num_iterations_print=args.num_iter_print, - output_model_dir=args.output_model_dir, - is_local=args.is_local, test_off=args.test_off) def main(): print_arguments(args) - paddle.init(use_gpu=args.use_gpu, - rnn_use_batch=True, - trainer_count=args.trainer_count, - log_clipping=True) train() diff --git a/utils/error_rate.py b/utils/error_rate.py index 9aa900174..d84d9f875 100644 --- a/utils/error_rate.py +++ b/utils/error_rate.py @@ -36,15 +36,15 @@ def _levenshtein_distance(ref, hyp): distance = np.zeros((2, n + 1), dtype=np.int32) # initialize distance matrix - for j in xrange(n + 1): + for j in range(n + 1): distance[0][j] = j # calculate levenshtein distance - for i in xrange(1, m + 1): + for i in range(1, m + 1): prev_row_idx = (i - 1) % 2 cur_row_idx = i % 2 distance[cur_row_idx][0] = i - for j in xrange(1, n + 1): + for j in range(1, n + 1): if ref[i - 1] == hyp[j - 1]: distance[cur_row_idx][j] = distance[prev_row_idx][j - 1] else: