Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech into new_config

pull/1297/head
huangyuxin 3 years ago
commit 50ceca9d56

@ -0,0 +1,2 @@
# Changelog

@ -171,7 +171,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
### Community
- Scan the QR code below with your Wechat, you can access to official technical exchange group. Look forward to your participation.
- Scan the QR code below with your Wechat (reply【语音】after your friend's application is approved), you can access to official technical exchange group. Look forward to your participation.
<div align="center">
<img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg" width = "300" />

@ -175,7 +175,7 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
- 👏🏻 2021.12.10: PaddleSpeech CLI 上线!覆盖了声音分类、语音识别、语音翻译(英译中)以及语音合成。
### 技术交流群
微信扫描二维码加入官方交流群,获得更高效的问题答疑,与各行各业开发者充分交流,期待您的加入。
微信扫描二维码(好友申请通过后回复【语音】)加入官方交流群,获得更高效的问题答疑,与各行各业开发者充分交流,期待您的加入。
<div align="center">
<img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg" width = "300" />

@ -6,7 +6,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
|:---- |:----------------------------------------------------------- |:----|
| Easy | (1) Use command-line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, Mac(not support M1 chip)Windows |
| Medium | Support major functions such as using the` ready-made `examples and using PaddleSpeech to train your model. | Linux |
| Hard | Support full function of Paddlespeechincluding training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu |
| Hard | Support full function of Paddlespeech, including using join ctc decoder with kaldi, training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu |
## Prerequisites
- Python >= 3.7
@ -52,12 +52,19 @@ sudo apt install build-essential
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
### Install PaddleSpeech
You can use the following command:
Some users may fail to install `kaldiio` due to the default download source, you can install `pytest-runner` at first
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
Then you can use the following commands:
```bash
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple
```
> If you encounter problem with downloading **nltk_data** while using paddlespeech, it maybe due to your poor network, we suggest you download the [nltk_data](https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz) provided by us, and extract it to your `${HOME}`.
> If you fail to install paddlespeech-ctcdecoders, it doesn't matter.
## Medium: Get the Major Functions (Support Linux)
If you want to get the major function of `paddlespeech`, you need to do following steps:
### Git clone PaddleSpeech
@ -117,6 +124,8 @@ python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/
### Install PaddleSpeech
You can install `paddlespeech` by the following commandthen you can use the `ready-made` examples in `paddlespeech` :
```bash
# Some users may fail to install `kaldiio` due to the default download source, you can install `pytest-runner` at first
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
# Make sure you are in the root directory of PaddleSpeech
pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple
```
@ -182,8 +191,11 @@ conda activate tools/venv
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
```
### Install PaddlePaddle
Some users may fail to install `kaldiio` due to the default download source, you can install `pytest-runner` at first
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0:
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
```

@ -5,7 +5,7 @@
| :--- | :----------------------------------------------------------- | :------------------ |
| 简单 | (1) 使用 PaddleSpeech 的命令行功能. <br> (2) 在 Aistudio上体验 PaddleSpeech. | Linux, Mac(不支持M1芯片)Windows |
| 中等 | 支持 PaddleSpeech 主要功能,比如使用已有 examples 中的模型和使用 PaddleSpeech 来训练自己的模型. | Linux |
| 困难 | 支持 PaddleSpeech 的各项功能,包含训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu |
| 困难 | 支持 PaddleSpeech 的各项功能,包含结合kaldi使用 join ctc decoder 方式解码,训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu |
## 先决条件
- Python >= 3.7
- 最新版本的 PaddlePaddle (请看 [安装向导](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
@ -49,12 +49,19 @@ sudo apt install build-essential
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
### 安装 PaddleSpeech
你可以使用如下命令:
部分用户系统由于默认源的问题安装中会出现kaldiio安转出错的问题建议首先安装pytest-runner:
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
然后你可以使用如下命令:
```bash
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple
```
> 如果您在使用 paddlespeech 的过程中遇到关于下载 **nltk_data** 的问题,可能是您的网络不佳,我们建议您下载我们提供的 [nltk_data](https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz) 并解压缩到您的 `${HOME}` 目录下。
> 如果出现 paddlespeech-ctcdecoders 无法安装的问题,无须担心,这不影响使用。
## 中等: 获取主要功能(支持 Linux
如果你想要使用 `paddlespeech` 的主要功能。你需要完成以下几个步骤
### Git clone PaddleSpeech
@ -111,6 +118,8 @@ python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/
### 安装 PaddleSpeech
最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech`中已有的 examples
```bash
# 部分用户系统由于默认源的问题安装中会出现kaldiio安转出错的问题建议首先安装pytest-runner:
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
# 请确保目前处于PaddleSpeech项目的根目录
pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple
```
@ -176,6 +185,11 @@ conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
```
### 用开发者模式安装 PaddleSpeech
部分用户系统由于默认源的问题安装中会出现kaldiio安转出错的问题建议首先安装pytest-runner:
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
然后安装 PaddleSpeech
```bash
pip install -e .[develop] -i https://pypi.tuna.tsinghua.edu.cn/simple
```

@ -21,7 +21,11 @@
"|FB-RAWs|Filter Bank Random Window Discriminators|\n",
"\n",
"<br></br>\n",
"csmsc 数据集上 GAN Vocoder 整体对比\n",
"csmsc 数据集上 GAN Vocoder 整体对比如下, \n ",
"\n",
"测试机器1 x Tesla V100-32G 40 core Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz\n ",
"\n",
"测试环境Python 3.7.0, paddlepaddle 2.2.0\n",
"\n",
"Model|Date|Input|Generator<br>Loss|Discriminator<br>Loss|Need<br>Finetune|Training<br>Steps|Finetune<br>Steps|Batch<br>Size|ips<br>(gen only)<br>(gen + dis)|Static Model<br>Size (gen)|RTF<br>(GPU)|\n",
":-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|\n",

@ -0,0 +1,159 @@
# 发包方法
## conda 代替系统依赖
conda可以用来代替一些 apt-get 安装的系统依赖,这样可以让项目适用于除了 ubuntu 以外的系统。
使用 conda 可以安装 sox, libsndfileswig等 paddlespeech 需要的依赖:
```bash
conda install -y -c conda-forge sox libsndfile
```
部分系统会缺少libbzip2库这个 paddlespeech 也是需要的,这也可以用 conda 安装:
```bash
conda install -y -c bzip2
```
conda也可以安装linux的C++的依赖:
```bash
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
#### 剩余问题使用conda环境编译kenlm失败。目前在conda环境下编译kenlm会出现链接失败的问题
目前知道需要的依赖:
```bash
conda install -c conda-forge eigen boost cmake
```
## python 编包方法
#### 创建 pypi的账号
创建 pypi 账号
#### 下载 twine
```
pip install twine
```
#### python 编包
编写好python包的setup.py, 然后使用如下命令编wheel包
```bash
python setup.py bdist_wheel
```
如果要编源码包,用如下命令:
```bash
python setup.py sdist
```
#### 上传包
```bash
twine upload dist/wheel包
```
输入账号和密码后就可以上传wheel包了
#### 关于python 包的发包信息
主要可以参考这个[文档](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/?highlight=find_packages)
## Manylinux 降低含有 C++ 依赖的 pip 包的 glibc 依赖
为了让有C++依赖的 pip wheel 包可以适用于更多的 linux 系统,需要降低其本身的 glibc 的依赖。这就需要让 pip wheel 包在 manylinux 的 docker 下编包。关于查看系统的 glibc 版本,可以使用命令:`ldd --version`。
### Manylinux
关于Many Linux主要可以参考 Github 项目的说明[ github many linux](https://github.com/pypa/manylinux)。
manylinux1 支持 Centos5以上 manylinux2010 支持 Centos 6 以上manylinux2014 支持Centos 7 以上。
目前使用 manylinux2010 基本可以满足所有的 linux 生产环境需求。不建议使用manylinux1系统较老难度较大
### 拉取 manylinux2010
```bash
docker pull quay.io/pypa/manylinux1_x86_64
```
### 使用 manylinux2010
启动 manylinux2010 docker。
```bash
docker run -it xxxxxx
```
在 Many Linux 2010 的docker环境自带 swig 和各种类型的 python 版本。这里注意不要自己下载conda 来安装环境来编译 pip 包,要用 docker 本身的环境来编包。
设置python
```bash
export PATH="/opt/python/cp37-cp37m/bin/:$PATH"
#export PATH="/opt/python/cp38-cp38/bin/:$PATH"
#export PATH="/opt/python/cp39-cp39/bin/:$PATH"
```
随后正常编包,编包后需要使用 [auditwheel](https://github.com/pypa/auditwheel) 来降低编好的wheel包的版本。
显示 wheel 包的 glibc 依赖版本
```bash
auditwheel show wheel包
```
降低 wheel包的版本
```bash
auditwheel repair wheel包
```
## 区分 install 模式和 develop 模式
可以在setup.py 中划分 install 的依赖(基本依赖)和 develop 的依赖 (开发者额外依赖)。 setup_info 中 `install_requires` 设置 install 的依赖,而在 `extras_require` 中设置 `develop` key为 develop的依赖。
普通安装可以使用:
```bash
pip install .
```
另外使用 pip 安装已发的包也是使用普通安装的:
```
pip install paddlespeech
```
而开发者可以使用如下方式安装这样不仅会安装install的依赖也会安装develop的依赖 即:最后安装的依赖=install依赖 + develop依赖
```bash
pip install -e .[develop]
```
## python 包的动态安装
可以使用 pip包来实现动态安装
```python
import pip
if int(pip.__version__.split('.')[0]) > 9:
from pip._internal import main
else:
from pip import main
main(['install', package_name])
```

@ -15,11 +15,13 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--dur-file=durations.txt \
--output-dir=dump_finetune \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt \
--dataset=baker \
--rootdir=~/datasets/BZNSYP/
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 local/link_wav.py \
python3 link_wav.py \
--old-dump-dir=dump \
--dump-dir=dump_finetune
fi

@ -15,11 +15,13 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--dur-file=durations.txt \
--output-dir=dump_finetune \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt \
--dataset=baker \
--rootdir=~/datasets/BZNSYP/
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 local/link_wav.py \
python3 link_wav.py \
--old-dump-dir=dump \
--dump-dir=dump_finetune
fi

@ -1,85 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from operator import itemgetter
from pathlib import Path
import jsonlines
import numpy as np
def main():
# parse config and args
parser = argparse.ArgumentParser(
description="Preprocess audio and then extract features .")
parser.add_argument(
"--old-dump-dir",
default=None,
type=str,
help="directory to dump feature files.")
parser.add_argument(
"--dump-dir",
type=str,
required=True,
help="directory to finetune dump feature files.")
args = parser.parse_args()
old_dump_dir = Path(args.old_dump_dir).expanduser()
old_dump_dir = old_dump_dir.resolve()
dump_dir = Path(args.dump_dir).expanduser()
# use absolute path
dump_dir = dump_dir.resolve()
dump_dir.mkdir(parents=True, exist_ok=True)
assert old_dump_dir.is_dir()
assert dump_dir.is_dir()
for sub in ["train", "dev", "test"]:
# 把 old_dump_dir 里面的 *-wave.npy 软连接到 dump_dir 的对应位置
output_dir = dump_dir / sub
output_dir.mkdir(parents=True, exist_ok=True)
results = []
for name in os.listdir(output_dir / "raw"):
# 003918_feats.npy
utt_id = name.split("_")[0]
mel_path = output_dir / ("raw/" + name)
gen_mel = np.load(mel_path)
wave_name = utt_id + "_wave.npy"
wav = np.load(old_dump_dir / sub / ("raw/" + wave_name))
os.symlink(old_dump_dir / sub / ("raw/" + wave_name),
output_dir / ("raw/" + wave_name))
num_sample = wav.shape[0]
num_frames = gen_mel.shape[0]
wav_path = output_dir / ("raw/" + wave_name)
record = {
"utt_id": utt_id,
"num_samples": num_sample,
"num_frames": num_frames,
"feats": str(mel_path),
"wave": str(wav_path),
}
results.append(record)
results.sort(key=itemgetter("utt_id"))
with jsonlines.open(output_dir / "raw/metadata.jsonl", 'w') as writer:
for item in results:
writer.write(item)
if __name__ == "__main__":
main()

@ -149,13 +149,13 @@ class DeepSpeech2Model(nn.Layer):
"""Compute Model loss
Args:
audio (Tenosr): [B, T, D]
audio (Tensor): [B, T, D]
audio_len (Tensor): [B]
text (Tensor): [B, U]
text_len (Tensor): [B]
Returns:
loss (Tenosr): [1]
loss (Tensor): [1]
"""
eouts, eouts_len = self.encoder(audio, audio_len)
loss = self.decoder(eouts, eouts_len, text, text_len)

@ -95,16 +95,16 @@ optional arguments:
### Synthesizing
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)and unzip it.
Download pretrained parallel wavegan model from [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip) and unzip it.
```bash
unzip pwg_vctk_ckpt_0.5.zip
unzip pwg_vctk_ckpt_0.1.1.zip
```
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_vctk_ckpt_0.5
├── pwg_default.yaml # default config used to train parallel wavegan
├── pwg_snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
pwg_vctk_ckpt_0.1.1
├── default.yaml # default config used to train parallel wavegan
├── snapshot_iter_1500000.pdz # generator parameters of parallel wavegan
└── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash

@ -12,9 +12,9 @@ python3 ${BIN_DIR}/../synthesize.py \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \

@ -12,9 +12,9 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \

@ -132,15 +132,15 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
Pretrained models can be downloaded here [pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip).
Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_vctk_ckpt_0.5
├── pwg_default.yaml # default config used to train parallel wavegan
├── pwg_snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
pwg_vctk_ckpt_0.1.1
├── default.yaml # default config used to train parallel wavegan
├── snapshot_iter_1500000.pdz # generator parameters of parallel wavegan
└── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -70,7 +70,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 8 # Batch size.
batch_size: 6 # Batch size.
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
num_workers: 2 # Number of workers in DataLoader.
@ -100,7 +100,7 @@ discriminator_grad_norm: 1 # Discriminator's gradient norm.
# INTERVAL SETTING #
###########################################################
discriminator_train_start_steps: 100000 # Number of steps to start to train discriminator.
train_max_steps: 1000000 # Number of training steps.
train_max_steps: 1500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.

@ -0,0 +1,2 @@
# Changelog

@ -31,6 +31,7 @@ from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress
from ..utils import MODEL_HOME
from ..utils import stats_wrapper
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
from paddlespeech.s2t.transform.transformation import Transformation
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -425,6 +426,7 @@ class ASRExecutor(BaseExecutor):
logger.exception(e)
return False
@stats_wrapper
def __call__(self,
audio_file: os.PathLike,
model: str='conformer_wenetspeech',

@ -26,6 +26,7 @@ from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress
from ..utils import MODEL_HOME
from ..utils import stats_wrapper
from paddleaudio import load
from paddleaudio.features import LogMelSpectrogram
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -245,6 +246,7 @@ class CLSExecutor(BaseExecutor):
logger.exception(e)
return False
@stats_wrapper
def __call__(self,
audio_file: os.PathLike,
model: str='panns_cnn14',

@ -30,6 +30,7 @@ from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress
from ..utils import MODEL_HOME
from ..utils import stats_wrapper
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.s2t.utils.utility import UpdateConfig
@ -334,6 +335,7 @@ class STExecutor(BaseExecutor):
logger.exception(e)
return False
@stats_wrapper
def __call__(self,
audio_file: os.PathLike,
model: str='fat_st_ted',

@ -26,6 +26,7 @@ from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress
from ..utils import MODEL_HOME
from ..utils import stats_wrapper
__all__ = ['TextExecutor']
@ -272,6 +273,7 @@ class TextExecutor(BaseExecutor):
logger.exception(e)
return False
@stats_wrapper
def __call__(
self,
text: str,

@ -29,6 +29,7 @@ from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress
from ..utils import MODEL_HOME
from ..utils import stats_wrapper
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
@ -155,15 +156,15 @@ pretrained_models = {
},
"pwgan_vctk-en": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip',
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip',
'md5':
'322ca688aec9b127cec2788b65aa3d52',
'b3da1defcde3e578be71eb284cb89f2c',
'config':
'pwg_default.yaml',
'default.yaml',
'ckpt':
'pwg_snapshot_iter_1000000.pdz',
'snapshot_iter_1500000.pdz',
'speech_stats':
'pwg_stats.npy',
'feats_stats.npy',
},
# mb_melgan
"mb_melgan_csmsc-zh": {
@ -645,6 +646,7 @@ class TTSExecutor(BaseExecutor):
logger.exception(e)
return False
@stats_wrapper
def __call__(self,
text: str,
am: str='fastspeech2_csmsc',

@ -11,22 +11,36 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import hashlib
import inspect
import json
import os
import tarfile
import threading
import time
import uuid
import zipfile
from typing import Any
from typing import Dict
import paddle
import paddleaudio
import requests
import yaml
from paddle.framework import load
from . import download
from .. import __version__
from .entry import commands
requests.adapters.DEFAULT_RETRIES = 3
__all__ = [
'cli_register',
'get_command',
'download_and_decompress',
'load_state_dict_from_url',
'stats_wrapper',
]
@ -101,6 +115,13 @@ def download_and_decompress(archive: Dict[str, str], path: str) -> os.PathLike:
if not os.path.isdir(uncompress_path):
download._decompress(filepath)
else:
StatsWorker(
task='download',
version=__version__,
extra_info={
'download_url': archive['url'],
'paddle_version': paddle.__version__
}).start()
uncompress_path = download.get_path_from_url(archive['url'], path,
archive['md5'])
@ -146,3 +167,171 @@ def _get_sub_home(directory):
PPSPEECH_HOME = _get_paddlespcceh_home()
MODEL_HOME = _get_sub_home('models')
CONF_HOME = _get_sub_home('conf')
def _md5(text: str):
'''Calculate the md5 value of the input text.'''
md5code = hashlib.md5(text.encode())
return md5code.hexdigest()
class ConfigCache:
def __init__(self):
self._data = {}
self._initialize()
self.file = os.path.join(CONF_HOME, 'cache.yaml')
if not os.path.exists(self.file):
self.flush()
return
with open(self.file, 'r') as file:
try:
cfg = yaml.load(file, Loader=yaml.FullLoader)
self._data.update(cfg)
except:
self.flush()
@property
def cache_info(self):
return self._data['cache_info']
def _initialize(self):
# Set default configuration values.
cache_info = _md5(str(uuid.uuid1())[-12:]) + "-" + str(int(time.time()))
self._data['cache_info'] = cache_info
def flush(self):
'''Flush the current configuration into the configuration file.'''
with open(self.file, 'w') as file:
cfg = json.loads(json.dumps(self._data))
yaml.dump(cfg, file)
stats_api = "http://paddlepaddle.org.cn/paddlehub/stat"
cache_info = ConfigCache().cache_info
class StatsWorker(threading.Thread):
def __init__(self,
task="asr",
model=None,
version=__version__,
extra_info={}):
threading.Thread.__init__(self)
self._task = task
self._model = model
self._version = version
self._extra_info = extra_info
def run(self):
params = {
'task': self._task,
'version': self._version,
'from': 'ppspeech'
}
if self._model:
params['model'] = self._model
self._extra_info.update({
'cache_info': cache_info,
})
params.update({"extra": json.dumps(self._extra_info)})
try:
requests.get(stats_api, params)
except Exception:
pass
return
def _note_one_stat(cls_name, params={}):
task = cls_name.replace('Executor', '').lower() # XXExecutor
extra_info = {
'paddle_version': paddle.__version__,
}
if 'model' in params:
model = params['model']
else:
model = None
if 'audio_file' in params:
try:
_, sr = paddleaudio.load(params['audio_file'])
except Exception:
sr = -1
if task == 'asr':
extra_info.update({
'lang': params['lang'],
'inp_sr': sr,
'model_sr': params['sample_rate'],
})
elif task == 'st':
extra_info.update({
'lang':
params['src_lang'] + '-' + params['tgt_lang'],
'inp_sr':
sr,
'model_sr':
params['sample_rate'],
})
elif task == 'tts':
model = params['am']
extra_info.update({
'lang': params['lang'],
'vocoder': params['voc'],
})
elif task == 'cls':
extra_info.update({
'inp_sr': sr,
})
elif task == 'text':
extra_info.update({
'sub_task': params['task'],
'lang': params['lang'],
})
else:
return
StatsWorker(
task=task,
model=model,
version=__version__,
extra_info=extra_info, ).start()
def _parse_args(func, *args, **kwargs):
# FullArgSpec(args, varargs, varkw, defaults, kwonlyargs, kwonlydefaults, annotations)
argspec = inspect.getfullargspec(func)
keys = argspec[0]
if keys[0] == 'self': # Remove self pointer.
keys = keys[1:]
default_values = argspec[3]
values = [None] * (len(keys) - len(default_values))
values.extend(list(default_values))
params = dict(zip(keys, values))
for idx, v in enumerate(args):
params[keys[idx]] = v
for k, v in kwargs.items():
params[k] = v
return params
def stats_wrapper(executor_func):
def _warpper(self, *args, **kwargs):
try:
_note_one_stat(
type(self).__name__, _parse_args(executor_func, *args,
**kwargs))
except Exception:
pass
return executor_func(self, *args, **kwargs)
return _warpper

@ -62,7 +62,7 @@ class Scorer(object):
"""Evaluation function, gathering all the different scores
and return the final one.
:param sentence: The input sentence for evalutation
:param sentence: The input sentence for evaluation
:type sentence: str
:param log: Whether return the score in log representation.
:type log: bool

@ -183,7 +183,7 @@ std::vector<std::pair<double, std::string>> ctc_beam_search_decoder(
std::sort(
prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
// compute aproximate ctc score as the return score, without affecting the
// compute approximate ctc score as the return score, without affecting the
// return order of decoding result. To delete when decoder gets stable.
for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
double approx_ctc = prefixes[i]->score;

@ -26,7 +26,7 @@ std::vector<std::pair<size_t, float>> get_pruned_log_probs(
for (size_t i = 0; i < prob_step.size(); ++i) {
prob_idx.push_back(std::pair<int, double>(i, prob_step[i]));
}
// pruning of vacobulary
// pruning of vocabulary
size_t cutoff_len = prob_step.size();
if (cutoff_prob < 1.0 || cutoff_top_n < cutoff_len) {
std::sort(prob_idx.begin(),

@ -223,7 +223,7 @@ void Scorer::fill_dictionary(bool add_space) {
* This gets rid of "epsilon" transitions in the FST.
* These are transitions that don't require a string input to be taken.
* Getting rid of them is necessary to make the FST determinisitc, but
* Getting rid of them is necessary to make the FST deterministic, but
* can greatly increase the size of the FST
*/
fst::RmEpsilon(&dictionary);

@ -154,7 +154,7 @@ class CTCPrefixScorer(BatchPartialScorerInterface):
Args:
state: The states of hyps
Returns: exteded state
Returns: extended state
"""
new_state = []

@ -11,7 +11,7 @@ class CTCPrefixScorePD():
which is based on Algorithm 2 in WATANABE et al.
"HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,"
but extended to efficiently compute the label probablities for multiple
but extended to efficiently compute the label probabilities for multiple
hypotheses simultaneously
See also Seki et al. "Vectorized Beam Search for CTC-Attention-Based
Speech Recognition," In INTERSPEECH (pp. 3825-3829), 2019.
@ -272,7 +272,7 @@ class CTCPrefixScore():
which is based on Algorithm 2 in WATANABE et al.
"HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,"
but extended to efficiently compute the probablities of multiple labels
but extended to efficiently compute the probabilities of multiple labels
simultaneously
"""

@ -238,7 +238,9 @@ class U2Trainer(Trainer):
preprocess_conf=config.preprocess_config,
n_iter_processes=config.num_workers,
subsampling_factor=1,
num_encs=1)
num_encs=1,
dist_sampler=False,
shortest_first=False)
self.valid_loader = BatchDataLoader(
json_file=config.dev_manifest,
@ -257,7 +259,9 @@ class U2Trainer(Trainer):
preprocess_conf=config.preprocess_config,
n_iter_processes=config.num_workers,
subsampling_factor=1,
num_encs=1)
num_encs=1,
dist_sampler=False,
shortest_first=False)
logger.info("Setup train/valid Dataloader!")
else:
decode_batch_size = config.get('decode', dict()).get(

@ -78,7 +78,8 @@ class BatchDataLoader():
load_aux_input: bool=False,
load_aux_output: bool=False,
num_encs: int=1,
dist_sampler: bool=False):
dist_sampler: bool=False,
shortest_first: bool=False):
self.json_file = json_file
self.train_mode = train_mode
self.use_sortagrad = sortagrad == -1 or sortagrad > 0
@ -97,6 +98,7 @@ class BatchDataLoader():
self.load_aux_input = load_aux_input
self.load_aux_output = load_aux_output
self.dist_sampler = dist_sampler
self.shortest_first = shortest_first
# read json data
with jsonlines.open(json_file, 'r') as reader:
@ -113,7 +115,7 @@ class BatchDataLoader():
maxlen_out,
minibatches, # for debug
min_batch_size=mini_batch_size,
shortest_first=self.use_sortagrad,
shortest_first=self.shortest_first or self.use_sortagrad,
count=batch_count,
batch_bins=batch_bins,
batch_frames_in=batch_frames_in,
@ -149,13 +151,13 @@ class BatchDataLoader():
self.reader)
if self.dist_sampler:
self.sampler = DistributedBatchSampler(
self.batch_sampler = DistributedBatchSampler(
dataset=self.dataset,
batch_size=1,
shuffle=not self.use_sortagrad if self.train_mode else False,
drop_last=False, )
else:
self.sampler = BatchSampler(
self.batch_sampler = BatchSampler(
dataset=self.dataset,
batch_size=1,
shuffle=not self.use_sortagrad if self.train_mode else False,
@ -163,7 +165,7 @@ class BatchDataLoader():
self.dataloader = DataLoader(
dataset=self.dataset,
batch_sampler=self.sampler,
batch_sampler=self.batch_sampler,
collate_fn=batch_collate,
num_workers=self.n_iter_processes, )
@ -194,5 +196,6 @@ class BatchDataLoader():
echo += f"load_aux_input: {self.load_aux_input}, "
echo += f"load_aux_output: {self.load_aux_output}, "
echo += f"dist_sampler: {self.dist_sampler}, "
echo += f"shortest_first: {self.shortest_first}, "
echo += f"file: {self.json_file}"
return echo

@ -151,13 +151,13 @@ class DeepSpeech2Model(nn.Layer):
"""Compute Model loss
Args:
audio (Tenosr): [B, T, D]
audio (Tensors): [B, T, D]
audio_len (Tensor): [B]
text (Tensor): [B, U]
text_len (Tensor): [B]
Returns:
loss (Tenosr): [1]
loss (Tensor): [1]
"""
eouts, eouts_len = self.encoder(audio, audio_len)
loss = self.decoder(eouts, eouts_len, text, text_len)

@ -279,13 +279,13 @@ class DeepSpeech2ModelOnline(nn.Layer):
"""Compute Model loss
Args:
audio (Tenosr): [B, T, D]
audio (Tensor): [B, T, D]
audio_len (Tensor): [B]
text (Tensor): [B, U]
text_len (Tensor): [B]
Returns:
loss (Tenosr): [1]
loss (Tensor): [1]
"""
eouts, eouts_len, final_state_h_box, final_state_c_box = self.encoder(
audio, audio_len, None, None)

@ -680,8 +680,8 @@ class U2BaseModel(ASRInterface, nn.Layer):
"""u2 decoding.
Args:
feats (Tenosr): audio features, (B, T, D)
feats_lengths (Tenosr): (B)
feats (Tensor): audio features, (B, T, D)
feats_lengths (Tensor): (B)
text_feature (TextFeaturizer): text feature object.
decoding_method (str): decoding mode, e.g.
'attention', 'ctc_greedy_search',

@ -478,8 +478,8 @@ class U2STBaseModel(nn.Layer):
"""u2 decoding.
Args:
feats (Tenosr): audio features, (B, T, D)
feats_lengths (Tenosr): (B)
feats (Tensor): audio features, (B, T, D)
feats_lengths (Tensor): (B)
text_feature (TextFeaturizer): text feature object.
decoding_method (str): decoding mode, e.g.
'fullsentence',

@ -39,10 +39,6 @@ except ImportError:
except Exception as e:
logger.info("paddlespeech_ctcdecoders not installed!")
#try:
#except Exception as e:
# logger.info("ctcdecoder not installed!")
__all__ = ['CTCDecoder']
@ -85,10 +81,10 @@ class CTCDecoderBase(nn.Layer):
Args:
hs_pad (Tensor): batch of padded hidden state sequences (B, Tmax, D)
hlens (Tensor): batch of lengths of hidden state sequences (B)
ys_pad (Tenosr): batch of padded character id sequence tensor (B, Lmax)
ys_pad (Tensor): batch of padded character id sequence tensor (B, Lmax)
ys_lens (Tensor): batch of lengths of character sequence (B)
Returns:
loss (Tenosr): ctc loss value, scalar.
loss (Tensor): ctc loss value, scalar.
"""
logits = self.ctc_lo(self.dropout(hs_pad))
loss = self.criterion(logits, ys_pad, hlens, ys_lens)
@ -256,8 +252,8 @@ class CTCDecoder(CTCDecoderBase):
"""ctc decoding with probs.
Args:
probs (Tenosr): activation after softmax
logits_lens (Tenosr): audio output lens
probs (Tensor): activation after softmax
logits_lens (Tensor): audio output lens
vocab_list ([type]): [description]
decoding_method ([type]): [description]
lang_model_path ([type]): [description]

@ -54,7 +54,7 @@ def make_pad_mask(lengths: paddle.Tensor) -> paddle.Tensor:
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 1]]
"""
# (TODO: Hui Zhang): jit not support Tenosr.dim() and Tensor.ndim
# (TODO: Hui Zhang): jit not support Tensor.dim() and Tensor.ndim
# assert lengths.dim() == 1
batch_size = int(lengths.shape[0])
max_len = int(lengths.max())

@ -67,18 +67,19 @@ class WarmupLR(LRScheduler):
super().__init__(learning_rate, last_epoch, verbose)
def __repr__(self):
return f"{self.__class__.__name__}(warmup_steps={self.warmup_steps})"
return f"{self.__class__.__name__}(warmup_steps={self.warmup_steps}, lr={self.base_lr}, last_epoch={self.last_epoch})"
def get_lr(self):
# self.last_epoch start from zero
step_num = self.last_epoch + 1
return self.base_lr * self.warmup_steps**0.5 * min(
step_num**-0.5, step_num * self.warmup_steps**-1.5)
def set_step(self, step: int=None):
'''
It will update the learning rate in optimizer according to current ``epoch`` .
It will update the learning rate in optimizer according to current ``epoch`` .
The new learning rate will take effect on next ``optimizer.step`` .
Args:
step (int, None): specify current epoch. Default: None. Auto-increment from last_epoch=-1.
Returns:
@ -94,7 +95,7 @@ class ConstantLR(LRScheduler):
learning_rate (float): The initial learning rate. It is a python float number.
last_epoch (int, optional): The index of last epoch. Can be set to restart training. Default: -1, means initial learning rate.
verbose (bool, optional): If ``True``, prints a message to stdout for each update. Default: ``False`` .
Returns:
``ConstantLR`` instance to schedule learning rate.
"""

@ -222,7 +222,7 @@ class Trainer():
batch_sampler = self.train_loader.batch_sampler
if isinstance(batch_sampler, paddle.io.DistributedBatchSampler):
logger.debug(
f"train_loader.batch_sample set epoch: {self.epoch}")
f"train_loader.batch_sample.set_epoch: {self.epoch}")
batch_sampler.set_epoch(self.epoch)
def before_train(self):

@ -57,7 +57,7 @@ def filter_valid_args(args: Dict[Text, Any], valid_keys: List[Text]):
return new_args
def filter_out_tenosr(args: Dict[Text, Any]):
def filter_out_tensor(args: Dict[Text, Any]):
return {key: val for key, val in args.items() if not has_tensor(val)}
@ -65,5 +65,5 @@ def instance_class(module_class, args: Dict[Text, Any]):
valid_keys = inspect.signature(module_class).parameters.keys()
new_args = filter_valid_args(args, valid_keys)
logger.info(
f"Instance: {module_class.__name__} {filter_out_tenosr(new_args)}.")
f"Instance: {module_class.__name__} {filter_out_tensor(new_args)}.")
return module_class(**new_args)

@ -21,6 +21,8 @@ import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from tqdm import tqdm
import os
from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
@ -30,6 +32,8 @@ from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config):
rootdir = Path(args.rootdir).expanduser()
assert rootdir.is_dir()
# construct dataset for evaluation
with open(args.phones_dict, "r") as f:
@ -41,9 +45,16 @@ def evaluate(args, fastspeech2_config):
for phn, id in phn_id:
phone_dict[phn] = int(id)
if args.speaker_dict:
with open(args.speaker_dict, 'rt') as f:
spk_id_list = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id_list)
else:
spk_num=None
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
idim=vocab_size, odim=odim, **fastspeech2_config["model"], spk_num=spk_num)
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
@ -65,7 +76,34 @@ def evaluate(args, fastspeech2_config):
sentences, speaker_set = get_phn_dur(args.dur_file)
merge_silence(sentences)
for i, utt_id in enumerate(sentences):
if args.dataset == "baker":
wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
# split data into 3 sections
num_train = 9800
num_dev = 100
train_wav_files = wav_files[:num_train]
dev_wav_files = wav_files[num_train:num_train + num_dev]
test_wav_files = wav_files[num_train + num_dev:]
elif args.dataset == "aishell3":
sub_num_dev = 5
wav_dir = rootdir / "train" / "wav"
train_wav_files = []
dev_wav_files = []
test_wav_files = []
for speaker in os.listdir(wav_dir):
wav_files = sorted(list((wav_dir / speaker).rglob("*.wav")))
if len(wav_files) > 100:
train_wav_files += wav_files[:-sub_num_dev * 2]
dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev]
test_wav_files += wav_files[-sub_num_dev:]
else:
train_wav_files += wav_files
train_wav_files = [os.path.basename(str(str_path)) for str_path in train_wav_files]
dev_wav_files = [os.path.basename(str(str_path)) for str_path in dev_wav_files]
test_wav_files = [os.path.basename(str(str_path)) for str_path in test_wav_files]
for i, utt_id in enumerate(tqdm(sentences)):
phones = sentences[utt_id][0]
durations = sentences[utt_id][1]
speaker = sentences[utt_id][2]
@ -82,21 +120,30 @@ def evaluate(args, fastspeech2_config):
phone_ids = [phone_dict[phn] for phn in phones]
phone_ids = paddle.to_tensor(np.array(phone_ids))
if args.speaker_dict:
speaker_id = int([item[1] for item in spk_id_list if speaker == item[0]][0])
speaker_id = paddle.to_tensor(speaker_id)
else:
speaker_id = None
durations = paddle.to_tensor(np.array(durations))
# 生成的和真实的可能有 1, 2 帧的差距,但是 batch_fn 会修复
# split data into 3 sections
if args.dataset == "baker":
num_train = 9800
num_dev = 100
if i in range(0, num_train):
wav_path = utt_id + ".wav"
if wav_path in train_wav_files:
sub_output_dir = output_dir / ("train/raw")
elif i in range(num_train, num_train + num_dev):
elif wav_path in dev_wav_files:
sub_output_dir = output_dir / ("dev/raw")
else:
elif wav_path in test_wav_files:
sub_output_dir = output_dir / ("test/raw")
sub_output_dir.mkdir(parents=True, exist_ok=True)
with paddle.no_grad():
mel = fastspeech2_inference(phone_ids, durations=durations)
mel = fastspeech2_inference(phone_ids, durations=durations, spk_id=speaker_id)
np.save(sub_output_dir / (utt_id + "_feats.npy"), mel)
@ -109,6 +156,8 @@ def main():
default="baker",
type=str,
help="name of dataset, should in {baker, ljspeech, vctk} now")
parser.add_argument(
"--rootdir", default=None, type=str, help="directory to dataset.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
@ -126,6 +175,12 @@ def main():
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--speaker-dict",
type=str,
default=None,
help="speaker id map file.")
parser.add_argument(
"--dur-file", default=None, type=str, help="path to durations.txt.")

File diff suppressed because one or more lines are too long

@ -18,7 +18,7 @@ from pathlib import Path
import jsonlines
import numpy as np
from tqdm import tqdm
def main():
# parse config and args
@ -52,9 +52,9 @@ def main():
output_dir = dump_dir / sub
output_dir.mkdir(parents=True, exist_ok=True)
results = []
for name in os.listdir(output_dir / "raw"):
# 003918_feats.npy
utt_id = name.split("_")[0]
files = os.listdir(output_dir / "raw")
for name in tqdm(files):
utt_id = name.split("_feats.npy")[0]
mel_path = output_dir / ("raw/" + name)
gen_mel = np.load(mel_path)
wave_name = utt_id + "_wave.npy"
Loading…
Cancel
Save