Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSpeech into add_new_tacotron2, test=tts

pull/1314/head
TianYuan 4 years ago
commit 1bf1a876ae

@ -32,6 +32,12 @@ pull_request_rules:
actions: actions:
label: label:
remove: ["conflicts"] remove: ["conflicts"]
- name: "auto add label=Dataset"
conditions:
- files~=^dataset/
actions:
label:
add: ["Dataset"]
- name: "auto add label=S2T" - name: "auto add label=S2T"
conditions: conditions:
- files~=^paddlespeech/s2t/ - files~=^paddlespeech/s2t/
@ -50,18 +56,30 @@ pull_request_rules:
actions: actions:
label: label:
add: ["Audio"] add: ["Audio"]
- name: "auto add label=TextProcess" - name: "auto add label=Vector"
conditions:
- files~=^paddlespeech/vector/
actions:
label:
add: ["Vector"]
- name: "auto add label=Text"
conditions: conditions:
- files~=^paddlespeech/text/ - files~=^paddlespeech/text/
actions: actions:
label: label:
add: ["TextProcess"] add: ["Text"]
- name: "auto add label=Example" - name: "auto add label=Example"
conditions: conditions:
- files~=^examples/ - files~=^examples/
actions: actions:
label: label:
add: ["Example"] add: ["Example"]
- name: "auto add label=CLI"
conditions:
- files~=^paddlespeech/cli
actions:
label:
add: ["CLI"]
- name: "auto add label=Demo" - name: "auto add label=Demo"
conditions: conditions:
- files~=^demos/ - files~=^demos/
@ -70,13 +88,13 @@ pull_request_rules:
add: ["Demo"] add: ["Demo"]
- name: "auto add label=README" - name: "auto add label=README"
conditions: conditions:
- files~=README.md - files~=(README.md|READEME_cn.md)
actions: actions:
label: label:
add: ["README"] add: ["README"]
- name: "auto add label=Documentation" - name: "auto add label=Documentation"
conditions: conditions:
- files~=^docs/ - files~=^(docs/|CHANGELOG.md|paddleaudio/CHANGELOG.md)
actions: actions:
label: label:
add: ["Documentation"] add: ["Documentation"]
@ -88,10 +106,16 @@ pull_request_rules:
add: ["CI"] add: ["CI"]
- name: "auto add label=Installation" - name: "auto add label=Installation"
conditions: conditions:
- files~=^(tools/|setup.py|setup.sh) - files~=^(tools/|setup.py|setup.cfg|setup_audio.py)
actions: actions:
label: label:
add: ["Installation"] add: ["Installation"]
- name: "auto add label=Test"
conditions:
- files~=^(tests/)
actions:
label:
add: ["Test"]
- name: "auto add label=mergify" - name: "auto add label=mergify"
conditions: conditions:
- files~=^.mergify.yml - files~=^.mergify.yml
@ -106,7 +130,7 @@ pull_request_rules:
add: ["Docker"] add: ["Docker"]
- name: "auto add label=Deployment" - name: "auto add label=Deployment"
conditions: conditions:
- files~=^speechnn/ - files~=^speechx/
actions: actions:
label: label:
add: ["Deployment"] add: ["Deployment"]

@ -1,2 +1,11 @@
# Changelog # Changelog
Date: 2022-1-10, Author: Jackwaterveg.
Add features to: CLI:
- Support English (librispeech/asr1/transformer).
- Support choosing `decode_method` for conformer and transformer models.
- Refactor the config, using the unified config.
- PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1297
***

@ -23,7 +23,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
### 3. Usage ### 3. Usage
- Command Line(Recommended) - Command Line(Recommended)
```bash ```bash
# Chinese
paddlespeech asr --input ./zh.wav paddlespeech asr --input ./zh.wav
# English
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
``` ```
(It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.) (It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
@ -43,7 +46,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
Output: Output:
```bash ```bash
# Chinese
[2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康 [2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
# English
[2022-01-12 11:51:10,815] [ INFO] - ASR Result: i knocked at the door on the ancient side of the building
``` ```
- Python API - Python API
@ -77,3 +83,4 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
| Model | Language | Sample Rate | Model | Language | Sample Rate
| :--- | :---: | :---: | | :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000 | conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000

@ -2,7 +2,7 @@
# 语音识别 # 语音识别
## 介绍 ## 介绍
语音识别解决让计算机程序自动转录语音的问题 语音识别是一项用计算机程序自动转录语音的技术
这个 demo 是一个从给定音频文件识别文本的实现,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。 这个 demo 是一个从给定音频文件识别文本的实现,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
## 使用方法 ## 使用方法
@ -21,7 +21,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
### 3. 使用方法 ### 3. 使用方法
- 命令行 (推荐使用) - 命令行 (推荐使用)
```bash ```bash
# 中文
paddlespeech asr --input ./zh.wav paddlespeech asr --input ./zh.wav
# 英文
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
``` ```
(如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error没有关系这个包是非必须的。) (如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error没有关系这个包是非必须的。)
@ -41,7 +44,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
输出: 输出:
```bash ```bash
# 中文
[2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康 [2021-12-08 13:12:34,063] [ INFO] [utils.py] [L225] - ASR Result: 我认为跑步最重要的就是给我带来了身体健康
# 英文
[2022-01-12 11:51:10,815] [ INFO] - ASR Result: i knocked at the door on the ancient side of the building
``` ```
- Python API - Python API
@ -74,3 +80,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
| 模型 | 语言 | 采样率 | 模型 | 语言 | 采样率
| :--- | :---: | :---: | | :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16000 | conformer_wenetspeech| zh| 16000
| transformer_librispeech| en| 16000

@ -1,40 +0,0 @@
# Data Augmentation Pipeline
Data augmentation has often been a highly effective technique to boost deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Six optional augmentation components are provided to be selected, configured, and inserted into the processing pipeline.
* Audio
- Volume Perturbation
- Speed Perturbation
- Shifting Perturbation
- Online Bayesian normalization
- Noise Perturbation (need background noise audio files)
- Impulse Response (need impulse audio files)
* Feature
- SpecAugment
- Adaptive SpecAugment
To inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
```
[{
"type": "speed",
"params": {"min_speed_rate": 0.95,
"max_speed_rate": 1.05},
"prob": 0.6
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 0.8
}]
```
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a randomly sampled offset between -5 ms and 5 ms. Finally, this newly synthesized audio clip will be fed into the feature extractor for further training.
For other configuration examples, please refer to `examples/conf/augmentation.example.json`.
Be careful when utilizing the data augmentation technique, as improper augmentation will harm the training, due to the enlarged train-test gap.

@ -27,7 +27,6 @@ Contents
asr/models_introduction asr/models_introduction
asr/data_preparation asr/data_preparation
asr/augmentation
asr/feature_list asr/feature_list
asr/ngram_lm asr/ngram_lm

@ -5,14 +5,13 @@
### Speech Recognition Model ### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0594 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0547 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1)
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1)
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../example/librispeech/asr1) [Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1)
[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../example/librispeech/asr1) [Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2)
[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../example/librispeech/asr2)
### Language Model based on NGram ### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions Language Model | Training Data | Token-based | Size | Descriptions
@ -25,7 +24,7 @@ Language Model | Training Data | Token-based | Size | Descriptions
| Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link | | Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link |
| :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |
| [Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) | | (only for CLI)[Transformer FAT-ST MTL En-Zh](https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz) | Ted-En-Zh| Spm| | Encoder:Transformer, Decoder:Transformer, <br />Decoding method: Attention | 20.80 | [Transformer Ted-En-Zh ST1](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/ted_en_zh/st1) |
## Text-to-Speech Models ## Text-to-Speech Models

@ -0,0 +1,42 @@
# TTS Papers
## Text Frontend
### Polyphone
- [【g2pM】g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset](https://arxiv.org/abs/2004.03136)
- [Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT](https://www1.se.cuhk.edu.hk/~hccl/publications/pub/201909_INTERSPEECH_DongyangDAI.pdf)
### Text Normalization
#### English
- [applenob/text_normalization](https://github.com/applenob/text_normalization)
### G2P
#### English
- [cmusphinx/g2p-seq2seq](https://github.com/cmusphinx/g2p-seq2seq)
## Acoustic Models
- [【AdaSpeech3】AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style](https://arxiv.org/abs/2107.02530)
- [【AdaSpeech2】AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data](https://arxiv.org/abs/2104.09715)
- [【AdaSpeech】AdaSpeech: Adaptive Text to Speech for Custom Voice](https://arxiv.org/abs/2103.00993)
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
- [【FastPitch】FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
- [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
- [【FastSpeech】FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
- [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
- [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
## Vocoders
- [【RefineGAN】RefineGAN: Universally Generating Waveform Better than Ground Truth with Highly Accurate Pitch and Intensity Responses](https://arxiv.org/abs/2111.00962)
- [【Fre-GAN】Fre-GAN: Adversarial Frequency-consistent Audio Synthesis](https://arxiv.org/abs/2106.02297)
- [【StyleMelGAN】StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization](https://arxiv.org/abs/2011.01557)
- [【Multi-band MelGAN】Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech](https://arxiv.org/abs/2005.05106)
- [【HiFi-GAN】HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646)
- [【VocGAN】VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network](https://arxiv.org/abs/2007.15256)
- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
- [【MelGAN】MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis](https://arxiv.org/abs/1910.06711)
- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- [【LPCNet】LPCNet: Improving Neural Speech Synthesis Through Linear Prediction](https://arxiv.org/abs/1810.11846)
- [【WaveRNN】Efficient Neural Audio Synthesis](https://arxiv.org/abs/1802.08435)
## GAN TTS
- [【GAN TTS】High Fidelity Speech Synthesis with Adversarial Networks](https://arxiv.org/abs/1909.11646)
## Voice Cloning
- [【SV2TTS】Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis](https://arxiv.org/abs/1806.04558)
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)

@ -1,30 +1,38 @@
# 发包方法 # 简化安装与发包
## 问题:
1. [如何去除 ubuntu 的 apt 安装依赖?](#conda-代替系统依赖)
2. [如何支持普通用户和开发者两种安装的需求,尽量减少普通用户所需的依赖?](#区分install模式和develop模式)
3. [如何进行 python 包的动态安装?](#python-包的动态安装)
4. [如何进行 python 项目编包?](#python-编包方法)
5. [发包前要有什么准备?](#关于发包前的准备工作)
6. [发 C++ 包需要注意的东西?](#manylinux)
## conda 代替系统依赖 ## conda 代替系统依赖
conda可以用来代替一些 apt-get 安装的系统依赖,这样可以让项目适用于除了 ubuntu 以外的系统。 conda 可以用来代替一些 apt-get 安装的系统依赖,这样可以让项目适用于除了 ubuntu 以外的系统。
使用 conda 可以安装 sox, libsndfileswig等 paddlespeech 需要的依赖: 使用 conda 可以安装 sox、 libsndfile、swig 等 paddlespeech 需要的依赖:
```bash ```bash
conda install -y -c conda-forge sox libsndfile conda install -y -c conda-forge sox libsndfile
``` ```
部分系统会缺少libbzip2库这个 paddlespeech 也是需要的,这也可以用 conda 安装: 部分系统会缺少 libbzip2 库,这个 paddlespeech 也是需要的,这也可以用 conda 安装:
```bash ```bash
conda install -y -c bzip2 conda install -y -c bzip2
``` ```
conda也可以安装linux的C++的依赖: conda 也可以安装 linux C++ 的依赖:
```bash ```bash
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
``` ```
#### 剩余问题使用conda环境编译kenlm失败。目前在conda环境下编译kenlm会出现链接失败的问题 #### 剩余问题:使用 conda 环境编译 kenlm 失败。目前在 conda 环境下编译 kenlm 会出现链接失败的问题
目前知道需要的依赖: 目前知道需要的依赖:
@ -32,7 +40,39 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
conda install -c conda-forge eigen boost cmake conda install -c conda-forge eigen boost cmake
``` ```
## 区分install模式和develop模式
可以在 setup.py 中划分 install 的依赖(基本依赖)和 develop 的依赖 (开发者额外依赖)。 setup_info 中 `install_requires` 设置 install 的依赖,而在 `extras_require` 中设置 `develop` key 为 develop 的依赖。
普通安装可以使用:
```bash
pip install .
```
另外使用 pip 安装已发的包也是使用普通安装的:
```
pip install paddlespeech
```
而开发者可以使用如下方式安装,这样不仅会安装 install 的依赖,也会安装 develop 的依赖, 即:最后安装的依赖 = install 依赖 + develop 依赖:
```bash
pip install -e .[develop]
```
## python 包的动态安装
可以使用 pip 包来实现动态安装:
```python
import pip
if int(pip.__version__.split('.')[0]) > 9:
from pip._internal import main
else:
from pip import main
main(['install', package_name])
```
## python 编包方法 ## python 编包方法
@ -48,7 +88,7 @@ pip install twine
#### python 编包 #### python 编包
编写好python包的setup.py, 然后使用如下命令编wheel包 编写好 python 包的 setup.py, 然后使用如下命令编 wheel 包:
```bash ```bash
python setup.py bdist_wheel python setup.py bdist_wheel
@ -66,21 +106,44 @@ python setup.py sdist
twine upload dist/wheel包 twine upload dist/wheel包
``` ```
输入账号和密码后就可以上传wheel包了 输入账号和密码后就可以上传 wheel 包了
#### 关于python 包的发包信息 #### 关于python 包的发包信息
主要可以参考这个[文档](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/?highlight=find_packages) 主要可以参考这个[文档](https://packaging.python.org/en/latest/guides/distributing-packages-using-setuptools/?highlight=find_packages)
## 关于发包前的准备工作
#### 拉分支
在发包之前需要拉分支。例如需要发 0.1.0 版本的正式包,则需要拉一个 r0.1 的分支。并且在这个 r0.1 分支的包上面打 0.1.0 的tag。在拉分支之前可以选择性的使用 rc 版本发一个正式版前的试用包例如0.1.0rc0等到rc包测试通过后再拉分支如果是发 0.1.1 包,则 merge r0.1分支打tag完成发包。
关于打tag的命令可以参考[git 基础](https://git-scm.com/book/zh/v2/Git-%E5%9F%BA%E7%A1%80-%E6%89%93%E6%A0%87%E7%AD%BE)。使用轻量标签即可:
```bash
git tag r0.1.1 commit_id
```
然后使用`git push` 把本地 tag 传到远程 repo 上即可
```bash
git push origin r0.1.1
```
打完 tag 后要记得编写 release note。
最后,发包准备工作的步骤可以总结为:
- 用 develop 分支发 rc 包
- rc 包通过后拉分支
- 打 tag
- 发包
- 编写 release note
## Manylinux 降低含有 C++ 依赖的 pip 包的 glibc 依赖 ## ManyLinux
为了让有C++依赖的 pip wheel 包可以适用于更多的 linux 系统,需要降低其本身的 glibc 的依赖。这就需要让 pip wheel 包在 manylinux 的 docker 下编包。关于查看系统的 glibc 版本,可以使用命令:`ldd --version`。 为了让有 C++ 依赖的 pip wheel 包可以适用于更多的 linux 系统,需要降低其本身的 glibc 的依赖。这就需要让 pip wheel 包在 manylinux 的 docker 下编包。关于查看系统的 glibc 版本,可以使用命令:`ldd --version`。
### Manylinux ### Manylinux
关于Many Linux主要可以参考 Github 项目的说明[ github many linux](https://github.com/pypa/manylinux)。 关于 Manylinux主要可以参考 Github 项目的说明[ github many linux](https://github.com/pypa/manylinux)。
manylinux1 支持 Centos5以上 manylinux2010 支持 Centos 6 以上manylinux2014 支持Centos 7 以上。 manylinux1 支持 Centos5以上 manylinux2010 支持 Centos 6 以上manylinux2014 支持Centos 7 以上。
目前使用 manylinux2010 基本可以满足所有的 linux 生产环境需求。不建议使用manylinux1系统较老难度较大 目前使用 manylinux2010 基本可以满足所有的 linux 生产环境需求。不建议使用manylinux1系统较老难度较大
@ -98,7 +161,7 @@ docker pull quay.io/pypa/manylinux1_x86_64
docker run -it xxxxxx docker run -it xxxxxx
``` ```
Many Linux 2010 的docker环境自带 swig 和各种类型的 python 版本。这里注意不要自己下载conda 来安装环境来编译 pip 包,要用 docker 本身的环境来编包。 manylinux2010 的docker环境自带 swig 和各种类型的 python 版本。这里注意不要自己下载 conda 来安装环境来编译 pip 包,要用 docker 本身的环境来编包。
设置python 设置python
```bash ```bash
@ -119,41 +182,3 @@ auditwheel show wheel包
```bash ```bash
auditwheel repair wheel包 auditwheel repair wheel包
``` ```
## 区分 install 模式和 develop 模式
可以在setup.py 中划分 install 的依赖(基本依赖)和 develop 的依赖 (开发者额外依赖)。 setup_info 中 `install_requires` 设置 install 的依赖,而在 `extras_require` 中设置 `develop` key为 develop的依赖。
普通安装可以使用:
```bash
pip install .
```
另外使用 pip 安装已发的包也是使用普通安装的:
```
pip install paddlespeech
```
而开发者可以使用如下方式安装这样不仅会安装install的依赖也会安装develop的依赖 即:最后安装的依赖=install依赖 + develop依赖
```bash
pip install -e .[develop]
```
## python 包的动态安装
可以使用 pip包来实现动态安装
```python
import pip
if int(pip.__version__.split('.')[0]) > 9:
from pip._internal import main
else:
from pip import main
main(['install', package_name])
```

@ -3,9 +3,9 @@ decode_batch_size: 128
error_rate_type: cer error_rate_type: cer
decoding_method: attention # 'attention', 'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention_rescoring' decoding_method: attention # 'attention', 'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention_rescoring'
ctc_weight: 0.5 # ctc weight for attention rescoring decode mode. ctc_weight: 0.5 # ctc weight for attention rescoring decode mode.
decoding_chunk_size: -1 # decoding chunk size. Defaults to -1. decoding_chunk_size: 16 # decoding chunk size. Defaults to -1.
# <0: for decoding, use full chunk. # <0: for decoding, use full chunk.
# >0: for decoding, use fixed chunk size as set. # >0: for decoding, use fixed chunk size as set.
# 0: used for training, it's prohibited here. # 0: used for training, it's prohibited here.
num_decoding_left_chunks: -1 # number of left chunks for decoding. Defaults to -1. num_decoding_left_chunks: -1 # number of left chunks for decoding. Defaults to -1.
simulate_streaming: False # simulate streaming inference. Defaults to False. simulate_streaming: True # simulate streaming inference. Defaults to False.

@ -257,6 +257,7 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \ --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \ --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0 --spk_id=0 \
--inference_dir=exp/default/inference
``` ```

@ -0,0 +1,19 @@
#!/bin/bash
train_output_path=$1
stage=0
stop_stage=0
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_aishell3 \
--voc=pwgan_aishell3 \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
fi

@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=${train_output_path}/test_e2e \ --output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \ --phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \ --speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 --spk_id=0 \
--inference_dir=${train_output_path}/inference

@ -21,7 +21,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
fi fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 link_wav.py \ python3 ${MAIN_ROOT}/utils/link_wav.py \
--old-dump-dir=dump \ --old-dump-dir=dump \
--dump-dir=dump_finetune --dump-dir=dump_finetune
fi fi

@ -21,7 +21,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
fi fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 link_wav.py \ python3 ${MAIN_ROOT}/utils/link_wav.py \
--old-dump-dir=dump \ --old-dump-dir=dump \
--dump-dir=dump_finetune --dump-dir=dump_finetune
fi fi

@ -10,11 +10,11 @@ Run the command below to get the results of the test.
```bash ```bash
./run.sh ./run.sh
``` ```
The `avg WER` of g2p is: 0.027495061517943988 The `avg WER` of g2p is: 0.027124048652822204
```text ```text
,--------------------------------------------------------------------. ,--------------------------------------------------------------------.
| | # Snt # Wrd | Corr Sub Del Ins Err S.Err | | | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+-----------------+-----------------------------------------| |--------+-----------------+-----------------------------------------|
| Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 | | Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.2 |
`--------------------------------------------------------------------' `--------------------------------------------------------------------'
``` ```

@ -7,11 +7,11 @@ Run the command below to get the results of the test.
```bash ```bash
./run.sh ./run.sh
``` ```
The `avg CER` of text normalization is: 0.006388318503308237 The `avg CER` of text normalization is: 0.00730093543235227
```text ```text
,-----------------------------------------------------------------. ,-----------------------------------------------------------------.
| | # Snt # Wrd | Corr Sub Del Ins Err S.Err | | | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+--------------+-----------------------------------------| |--------+--------------+-----------------------------------------|
| Sum/Avg| 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 | | Sum/Avg| 125 2254 | 99.4 0.1 0.5 0.2 0.8 4.8 |
`-----------------------------------------------------------------' `-----------------------------------------------------------------'
``` ```

@ -1,8 +1,9 @@
batch_size: 5 batch_size: 1
error_rate_type: char-bleu error_rate_type: char-bleu
decoding_method: fullsentence # 'fullsentence', 'simultaneous' decoding_method: fullsentence # 'fullsentence', 'simultaneous'
beam_size: 10 beam_size: 10
word_reward: 0.7 word_reward: 0.7
maxlenratio: 0.3
decoding_chunk_size: -1 # decoding chunk size. Defaults to -1. decoding_chunk_size: -1 # decoding chunk size. Defaults to -1.
# <0: for decoding, use full chunk. # <0: for decoding, use full chunk.
# >0: for decoding, use fixed chunk size as set. # >0: for decoding, use fixed chunk size as set.

@ -1,9 +1,10 @@
batch_size: 5 batch_size: 1
error_rate_type: char-bleu error_rate_type: char-bleu
decoding_method: fullsentence # 'fullsentence', 'simultaneous' decoding_method: fullsentence # 'fullsentence', 'simultaneous'
beam_size: 10 beam_size: 10
word_reward: 0.7 word_reward: 0.7
maxlenratio: 0.3
decoding_chunk_size: -1 # decoding chunk size. Defaults to -1. decoding_chunk_size: -1 # decoding chunk size. Defaults to -1.
# <0: for decoding, use full chunk. # <0: for decoding, use full chunk.
# >0: for decoding, use fixed chunk size as set. # >0: for decoding, use fixed chunk size as set.

@ -240,13 +240,14 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \ --am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
--am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \ --am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
--voc=pwgan_vctk \ --voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \ --lang=en \
--text=${BIN_DIR}/../sentences_en.txt \ --text=${BIN_DIR}/../sentences_en.txt \
--output_dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--phones_dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \ --phones_dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt \ --speaker_dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt \
--spk_id=0 --spk_id=0 \
--inference_dir=exp/default/inference
``` ```

@ -0,0 +1,20 @@
#!/bin/bash
train_output_path=$1
stage=0
stop_stage=0
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_vctk \
--voc=pwgan_vctk \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--lang=en
fi

@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=${train_output_path}/test_e2e \ --output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \ --phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \ --speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 --spk_id=0 \
--inference_dir=${train_output_path}/inference

@ -12,4 +12,4 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
__version__ = '0.1.0' __version__ = '0.1.1'

@ -12,7 +12,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse import argparse
import io
import os import os
import sys import sys
from typing import List from typing import List
@ -23,9 +22,9 @@ import librosa
import numpy as np import numpy as np
import paddle import paddle
import soundfile import soundfile
import yaml
from yacs.config import CfgNode from yacs.config import CfgNode
from ..download import get_path_from_url
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register from ..utils import cli_register
@ -46,22 +45,65 @@ pretrained_models = {
# "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav" # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
"conformer_wenetspeech-zh-16k": { "conformer_wenetspeech-zh-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/conformer.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz',
'md5': 'md5':
'54e7a558a6e020c2f5fb224874943f97', '76cb19ed857e6623856b7cd7ebbfeda4',
'cfg_path': 'cfg_path':
'conf/conformer.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':
'exp/conformer/checkpoints/wenetspeech', 'exp/conformer/checkpoints/wenetspeech',
}, },
"transformer_librispeech-en-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz',
'md5':
'2c667da24922aad391eacafe37bc1660',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/transformer/checkpoints/avg_10',
},
"deepspeech2offline_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
'md5':
'932c3593d62fe5c741b59b31318aa314',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/deepspeech2/checkpoints/avg_1',
'lm_url':
'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
'lm_md5':
'29e02312deb2e59b3c8686c7966d4fe3'
},
"deepspeech2online_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
'md5':
'd5e076217cf60486519f72c217d21b9b',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/deepspeech2_online/checkpoints/avg_1',
'lm_url':
'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
'lm_md5':
'29e02312deb2e59b3c8686c7966d4fe3'
},
} }
model_alias = { model_alias = {
"ds2_offline": "paddlespeech.s2t.models.ds2:DeepSpeech2Model", "deepspeech2offline":
"ds2_online": "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline", "paddlespeech.s2t.models.ds2:DeepSpeech2Model",
"conformer": "paddlespeech.s2t.models.u2:U2Model", "deepspeech2online":
"transformer": "paddlespeech.s2t.models.u2:U2Model", "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline",
"wenetspeech": "paddlespeech.s2t.models.u2:U2Model", "conformer":
"paddlespeech.s2t.models.u2:U2Model",
"transformer":
"paddlespeech.s2t.models.u2:U2Model",
"wenetspeech":
"paddlespeech.s2t.models.u2:U2Model",
} }
@ -85,7 +127,8 @@ class ASRExecutor(BaseExecutor):
'--lang', '--lang',
type=str, type=str,
default='zh', default='zh',
help='Choose model language. zh or en') help='Choose model language. zh or en, zh:[conformer_wenetspeech-zh-16k], en:[transformer_librispeech-en-16k]'
)
self.parser.add_argument( self.parser.add_argument(
"--sample_rate", "--sample_rate",
type=int, type=int,
@ -97,6 +140,15 @@ class ASRExecutor(BaseExecutor):
type=str, type=str,
default=None, default=None,
help='Config of asr task. Use deault config when it is None.') help='Config of asr task. Use deault config when it is None.')
self.parser.add_argument(
'--decode_method',
type=str,
default='attention_rescoring',
choices=[
'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention',
'attention_rescoring'
],
help='only support transformer and conformer model')
self.parser.add_argument( self.parser.add_argument(
'--ckpt_path', '--ckpt_path',
type=str, type=str,
@ -136,6 +188,7 @@ class ASRExecutor(BaseExecutor):
lang: str='zh', lang: str='zh',
sample_rate: int=16000, sample_rate: int=16000,
cfg_path: Optional[os.PathLike]=None, cfg_path: Optional[os.PathLike]=None,
decode_method: str='attention_rescoring',
ckpt_path: Optional[os.PathLike]=None): ckpt_path: Optional[os.PathLike]=None):
""" """
Init model and other resources from a specific path. Init model and other resources from a specific path.
@ -159,51 +212,44 @@ class ASRExecutor(BaseExecutor):
else: else:
self.cfg_path = os.path.abspath(cfg_path) self.cfg_path = os.path.abspath(cfg_path)
self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams") self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
res_path = os.path.dirname( self.res_path = os.path.dirname(
os.path.dirname(os.path.abspath(self.cfg_path))) os.path.dirname(os.path.abspath(self.cfg_path)))
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
self.config.merge_from_file(self.cfg_path) self.config.merge_from_file(self.cfg_path)
self.config.decoding.decoding_method = "attention_rescoring"
with UpdateConfig(self.config): with UpdateConfig(self.config):
if "ds2_online" in model_type or "ds2_offline" in model_type: if "deepspeech2online" in model_type or "deepspeech2offline" in model_type:
from paddlespeech.s2t.io.collator import SpeechCollator from paddlespeech.s2t.io.collator import SpeechCollator
self.config.collator.vocab_filepath = os.path.join( self.vocab = self.config.vocab_filepath
res_path, self.config.collator.vocab_filepath) self.config.decode.lang_model_path = os.path.join(
self.config.collator.mean_std_filepath = os.path.join( MODEL_HOME, 'language_model',
res_path, self.config.collator.cmvn_path) self.config.decode.lang_model_path)
self.collate_fn_test = SpeechCollator.from_config(self.config) self.collate_fn_test = SpeechCollator.from_config(self.config)
self.text_feature = TextFeaturizer( self.text_feature = TextFeaturizer(
unit_type=self.config.collator.unit_type, unit_type=self.config.unit_type, vocab=self.vocab)
vocab=self.config.collator.vocab_filepath, lm_url = pretrained_models[tag]['lm_url']
spm_model_prefix=self.config.collator.spm_model_prefix) lm_md5 = pretrained_models[tag]['lm_md5']
self.config.model.input_dim = self.collate_fn_test.feature_size self.download_lm(
self.config.model.output_dim = self.text_feature.vocab_size lm_url,
os.path.dirname(self.config.decode.lang_model_path), lm_md5)
elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type: elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
self.config.collator.vocab_filepath = os.path.join( self.config.spm_model_prefix = os.path.join(
res_path, self.config.collator.vocab_filepath) self.res_path, self.config.spm_model_prefix)
self.config.collator.augmentation_config = os.path.join(
res_path, self.config.collator.augmentation_config)
self.config.collator.spm_model_prefix = os.path.join(
res_path, self.config.collator.spm_model_prefix)
self.text_feature = TextFeaturizer( self.text_feature = TextFeaturizer(
unit_type=self.config.collator.unit_type, unit_type=self.config.unit_type,
vocab=self.config.collator.vocab_filepath, vocab=self.config.vocab_filepath,
spm_model_prefix=self.config.collator.spm_model_prefix) spm_model_prefix=self.config.spm_model_prefix)
self.config.model.input_dim = self.config.collator.feat_dim self.config.decode.decoding_method = decode_method
self.config.model.output_dim = self.text_feature.vocab_size
else: else:
raise Exception("wrong type") raise Exception("wrong type")
# Enter the path of model root
model_name = model_type[:model_type.rindex( model_name = model_type[:model_type.rindex(
'_')] # model_type: {model_name}_{dataset} '_')] # model_type: {model_name}_{dataset}
model_class = dynamic_import(model_name, model_alias) model_class = dynamic_import(model_name, model_alias)
model_conf = self.config.model model_conf = self.config
logger.info(model_conf)
model = model_class.from_config(model_conf) model = model_class.from_config(model_conf)
self.model = model self.model = model
self.model.eval() self.model.eval()
@ -222,7 +268,7 @@ class ASRExecutor(BaseExecutor):
logger.info("Preprocess audio_file:" + audio_file) logger.info("Preprocess audio_file:" + audio_file)
# Get the object for feature extraction # Get the object for feature extraction
if "ds2_online" in model_type or "ds2_offline" in model_type: if "deepspeech2online" in model_type or "deepspeech2offline" in model_type:
audio, _ = self.collate_fn_test.process_utterance( audio, _ = self.collate_fn_test.process_utterance(
audio_file=audio_file, transcript=" ") audio_file=audio_file, transcript=" ")
audio_len = audio.shape[0] audio_len = audio.shape[0]
@ -236,18 +282,7 @@ class ASRExecutor(BaseExecutor):
elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type: elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
logger.info("get the preprocess conf") logger.info("get the preprocess conf")
preprocess_conf_file = self.config.collator.augmentation_config preprocess_conf = self.config.preprocess_config
# redirect the cmvn path
with io.open(preprocess_conf_file, encoding="utf-8") as f:
preprocess_conf = yaml.safe_load(f)
for idx, process in enumerate(preprocess_conf["process"]):
if process['type'] == "cmvn_json":
preprocess_conf["process"][idx][
"cmvn_path"] = os.path.join(
self.res_path,
preprocess_conf["process"][idx]["cmvn_path"])
break
logger.info(preprocess_conf)
preprocess_args = {"train": False} preprocess_args = {"train": False}
preprocessing = Transformation(preprocess_conf) preprocessing = Transformation(preprocess_conf)
logger.info("read the audio file") logger.info("read the audio file")
@ -289,10 +324,10 @@ class ASRExecutor(BaseExecutor):
Model inference and result stored in self.output. Model inference and result stored in self.output.
""" """
cfg = self.config.decoding cfg = self.config.decode
audio = self._inputs["audio"] audio = self._inputs["audio"]
audio_len = self._inputs["audio_len"] audio_len = self._inputs["audio_len"]
if "ds2_online" in model_type or "ds2_offline" in model_type: if "deepspeech2online" in model_type or "deepspeech2offline" in model_type:
result_transcripts = self.model.decode( result_transcripts = self.model.decode(
audio, audio,
audio_len, audio_len,
@ -328,6 +363,13 @@ class ASRExecutor(BaseExecutor):
""" """
return self._outputs["result"] return self._outputs["result"]
def download_lm(self, url, lm_dir, md5sum):
download_path = get_path_from_url(
url=url,
root_dir=lm_dir,
md5sum=md5sum,
decompress=False, )
def _pcm16to32(self, audio): def _pcm16to32(self, audio):
assert (audio.dtype == np.int16) assert (audio.dtype == np.int16)
audio = audio.astype("float32") audio = audio.astype("float32")
@ -414,12 +456,13 @@ class ASRExecutor(BaseExecutor):
config = parser_args.config config = parser_args.config
ckpt_path = parser_args.ckpt_path ckpt_path = parser_args.ckpt_path
audio_file = parser_args.input audio_file = parser_args.input
decode_method = parser_args.decode_method
force_yes = parser_args.yes force_yes = parser_args.yes
device = parser_args.device device = parser_args.device
try: try:
res = self(audio_file, model, lang, sample_rate, config, ckpt_path, res = self(audio_file, model, lang, sample_rate, config, ckpt_path,
force_yes, device) decode_method, force_yes, device)
logger.info('ASR Result: {}'.format(res)) logger.info('ASR Result: {}'.format(res))
return True return True
except Exception as e: except Exception as e:
@ -434,6 +477,7 @@ class ASRExecutor(BaseExecutor):
sample_rate: int=16000, sample_rate: int=16000,
config: os.PathLike=None, config: os.PathLike=None,
ckpt_path: os.PathLike=None, ckpt_path: os.PathLike=None,
decode_method: str='attention_rescoring',
force_yes: bool=False, force_yes: bool=False,
device=paddle.get_device()): device=paddle.get_device()):
""" """
@ -442,7 +486,8 @@ class ASRExecutor(BaseExecutor):
audio_file = os.path.abspath(audio_file) audio_file = os.path.abspath(audio_file)
self._check(audio_file, sample_rate, force_yes) self._check(audio_file, sample_rate, force_yes)
paddle.set_device(device) paddle.set_device(device)
self._init_from_path(model, lang, sample_rate, config, ckpt_path) self._init_from_path(model, lang, sample_rate, config, decode_method,
ckpt_path)
self.preprocess(model, audio_file) self.preprocess(model, audio_file)
self.infer(model) self.infer(model)
res = self.postprocess() # Retrieve result of asr. res = self.postprocess() # Retrieve result of asr.

@ -40,11 +40,11 @@ __all__ = ["STExecutor"]
pretrained_models = { pretrained_models = {
"fat_st_ted-en-zh": { "fat_st_ted-en-zh": {
"url": "url":
"https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz", "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz",
"md5": "md5":
"fa0a7425b91b4f8d259c70b2aca5ae67", "d62063f35a16d91210a71081bd2dd557",
"cfg_path": "cfg_path":
"conf/transformer_mtl_noam.yaml", "model.yaml",
"ckpt_path": "ckpt_path":
"exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams", "exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams",
} }
@ -170,24 +170,19 @@ class STExecutor(BaseExecutor):
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
self.config.merge_from_file(self.cfg_path) self.config.merge_from_file(self.cfg_path)
self.config.decoding.decoding_method = "fullsentence" self.config.decode.decoding_method = "fullsentence"
with UpdateConfig(self.config): with UpdateConfig(self.config):
self.config.collator.vocab_filepath = os.path.join( self.config.cmvn_path = os.path.join(
res_path, self.config.collator.vocab_filepath) res_path, self.config.cmvn_path)
self.config.collator.cmvn_path = os.path.join( self.config.spm_model_prefix = os.path.join(
res_path, self.config.collator.cmvn_path) res_path, self.config.spm_model_prefix)
self.config.collator.spm_model_prefix = os.path.join(
res_path, self.config.collator.spm_model_prefix)
self.text_feature = TextFeaturizer( self.text_feature = TextFeaturizer(
unit_type=self.config.collator.unit_type, unit_type=self.config.unit_type,
vocab=self.config.collator.vocab_filepath, vocab=self.config.vocab_filepath,
spm_model_prefix=self.config.collator.spm_model_prefix) spm_model_prefix=self.config.spm_model_prefix)
self.config.model.input_dim = self.config.collator.feat_dim
self.config.model.output_dim = self.text_feature.vocab_size model_conf = self.config
model_conf = self.config.model
logger.info(model_conf)
model_name = model_type[:model_type.rindex( model_name = model_type[:model_type.rindex(
'_')] # model_type: {model_name}_{dataset} '_')] # model_type: {model_name}_{dataset}
model_class = dynamic_import(model_name, model_alias) model_class = dynamic_import(model_name, model_alias)
@ -218,7 +213,7 @@ class STExecutor(BaseExecutor):
logger.info("Preprocess audio_file:" + audio_file) logger.info("Preprocess audio_file:" + audio_file)
if "fat_st" in model_type: if "fat_st" in model_type:
cmvn = self.config.collator.cmvn_path cmvn = self.config.cmvn_path
utt_name = "_tmp" utt_name = "_tmp"
# Get the object for feature extraction # Get the object for feature extraction
@ -284,7 +279,7 @@ class STExecutor(BaseExecutor):
""" """
Model inference and result stored in self.output. Model inference and result stored in self.output.
""" """
cfg = self.config.decoding cfg = self.config.decode
audio = self._inputs["audio"] audio = self._inputs["audio"]
audio_len = self._inputs["audio_len"] audio_len = self._inputs["audio_len"]
if model_type == "fat_st_ted": if model_type == "fat_st_ted":

@ -524,10 +524,10 @@ class U2Tester(U2Trainer):
List[paddle.static.InputSpec]: input spec. List[paddle.static.InputSpec]: input spec.
""" """
from paddlespeech.s2t.models.u2 import U2InferModel from paddlespeech.s2t.models.u2 import U2InferModel
infer_model = U2InferModel.from_pretrained(self.train_loader, infer_model = U2InferModel.from_pretrained(self.test_loader,
self.config.clone(), self.config.clone(),
self.args.checkpoint_path) self.args.checkpoint_path)
feat_dim = self.train_loader.feat_dim feat_dim = self.test_loader.feat_dim
input_spec = [ input_spec = [
paddle.static.InputSpec(shape=[1, None, feat_dim], paddle.static.InputSpec(shape=[1, None, feat_dim],
dtype='float32'), # audio, [B,T,D] dtype='float32'), # audio, [B,T,D]

@ -285,7 +285,7 @@ class U2STTrainer(Trainer):
subsampling_factor=1, subsampling_factor=1,
load_aux_output=load_transcript, load_aux_output=load_transcript,
num_encs=1, num_encs=1,
dist_sampler=True) dist_sampler=False)
logger.info("Setup train/valid Dataloader!") logger.info("Setup train/valid Dataloader!")
else: else:
# test dataset, return raw text # test dataset, return raw text
@ -408,6 +408,7 @@ class U2STTester(U2STTrainer):
decoding_method=decode_cfg.decoding_method, decoding_method=decode_cfg.decoding_method,
beam_size=decode_cfg.beam_size, beam_size=decode_cfg.beam_size,
word_reward=decode_cfg.word_reward, word_reward=decode_cfg.word_reward,
maxlenratio=decode_cfg.maxlenratio,
decoding_chunk_size=decode_cfg.decoding_chunk_size, decoding_chunk_size=decode_cfg.decoding_chunk_size,
num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks, num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks,
simulate_streaming=decode_cfg.simulate_streaming) simulate_streaming=decode_cfg.simulate_streaming)
@ -435,6 +436,7 @@ class U2STTester(U2STTrainer):
decoding_method=decode_cfg.decoding_method, decoding_method=decode_cfg.decoding_method,
beam_size=decode_cfg.beam_size, beam_size=decode_cfg.beam_size,
word_reward=decode_cfg.word_reward, word_reward=decode_cfg.word_reward,
maxlenratio=decode_cfg.maxlenratio,
decoding_chunk_size=decode_cfg.decoding_chunk_size, decoding_chunk_size=decode_cfg.decoding_chunk_size,
num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks, num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks,
simulate_streaming=decode_cfg.simulate_streaming) simulate_streaming=decode_cfg.simulate_streaming)

@ -117,7 +117,8 @@ class FeatureNormalizer(object):
self._compute_mean_std(manifest_path, featurize_func, num_samples, self._compute_mean_std(manifest_path, featurize_func, num_samples,
num_workers) num_workers)
else: else:
self._read_mean_std_from_file(mean_std_filepath) mean_std = mean_std_filepath
self._read_mean_std_from_file(mean_std)
def apply(self, features): def apply(self, features):
"""Normalize features to be of zero mean and unit stddev. """Normalize features to be of zero mean and unit stddev.
@ -131,10 +132,14 @@ class FeatureNormalizer(object):
""" """
return (features - self._mean) * self._istd return (features - self._mean) * self._istd
def _read_mean_std_from_file(self, filepath, eps=1e-20): def _read_mean_std_from_file(self, mean_std, eps=1e-20):
"""Load mean and std from file.""" """Load mean and std from file."""
filetype = filepath.split(".")[-1] if isinstance(mean_std, list):
mean, istd = load_cmvn(filepath, filetype=filetype) mean = mean_std[0]['cmvn_stats']['mean']
istd = mean_std[0]['cmvn_stats']['istd']
else:
filetype = mean_std.split(".")[-1]
mean, istd = load_cmvn(mean_std, filetype=filetype)
self._mean = np.expand_dims(mean, axis=0) self._mean = np.expand_dims(mean, axis=0)
self._istd = np.expand_dims(istd, axis=0) self._istd = np.expand_dims(istd, axis=0)

@ -264,14 +264,17 @@ class U2STBaseModel(nn.Layer):
speech_lengths: paddle.Tensor, speech_lengths: paddle.Tensor,
beam_size: int=10, beam_size: int=10,
word_reward: float=0.0, word_reward: float=0.0,
maxlenratio: float=0.5,
decoding_chunk_size: int=-1, decoding_chunk_size: int=-1,
num_decoding_left_chunks: int=-1, num_decoding_left_chunks: int=-1,
simulate_streaming: bool=False, ) -> paddle.Tensor: simulate_streaming: bool=False, ) -> paddle.Tensor:
""" Apply beam search on attention decoder """ Apply beam search on attention decoder with length penalty
Args: Args:
speech (paddle.Tensor): (batch, max_len, feat_dim) speech (paddle.Tensor): (batch, max_len, feat_dim)
speech_length (paddle.Tensor): (batch, ) speech_length (paddle.Tensor): (batch, )
beam_size (int): beam size for beam search beam_size (int): beam size for beam search
word_reward (float): word reward used in beam search
maxlenratio (float): max length ratio to bound the length of translated text
decoding_chunk_size (int): decoding chunk for dynamic chunk decoding_chunk_size (int): decoding chunk for dynamic chunk
trained model. trained model.
<0: for decoding, use full chunk. <0: for decoding, use full chunk.
@ -284,90 +287,90 @@ class U2STBaseModel(nn.Layer):
""" """
assert speech.shape[0] == speech_lengths.shape[0] assert speech.shape[0] == speech_lengths.shape[0]
assert decoding_chunk_size != 0 assert decoding_chunk_size != 0
assert speech.shape[0] == 1
device = speech.place device = speech.place
batch_size = speech.shape[0]
# Let's assume B = batch_size and N = beam_size # Let's assume B = batch_size and N = beam_size
# 1. Encoder # 1. Encoder and init hypothesis
encoder_out, encoder_mask = self._forward_encoder( encoder_out, encoder_mask = self._forward_encoder(
speech, speech_lengths, decoding_chunk_size, speech, speech_lengths, decoding_chunk_size,
num_decoding_left_chunks, num_decoding_left_chunks,
simulate_streaming) # (B, maxlen, encoder_dim) simulate_streaming) # (B, maxlen, encoder_dim)
maxlen = encoder_out.shape[1]
encoder_dim = encoder_out.shape[2] maxlen = max(int(encoder_out.shape[1] * maxlenratio), 5)
running_size = batch_size * beam_size
encoder_out = encoder_out.unsqueeze(1).repeat(1, beam_size, 1, 1).view( hyp = {"score": 0.0, "yseq": [self.sos], "cache": None}
running_size, maxlen, encoder_dim) # (B*N, maxlen, encoder_dim) hyps = [hyp]
encoder_mask = encoder_mask.unsqueeze(1).repeat( ended_hyps = []
1, beam_size, 1, 1).view(running_size, 1, cur_best_score = -float("inf")
maxlen) # (B*N, 1, max_len) cache = None
hyps = paddle.ones(
[running_size, 1], dtype=paddle.long).fill_(self.sos) # (B*N, 1)
# log scale score
scores = paddle.to_tensor(
[0.0] + [-float('inf')] * (beam_size - 1), dtype=paddle.float)
scores = scores.to(device).repeat(batch_size).unsqueeze(1).to(
device) # (B*N, 1)
end_flag = paddle.zeros_like(scores, dtype=paddle.bool) # (B*N, 1)
cache: Optional[List[paddle.Tensor]] = None
# 2. Decoder forward step by step # 2. Decoder forward step by step
for i in range(1, maxlen + 1): for i in range(1, maxlen + 1):
# Stop if all batch and all beam produce eos ys = paddle.ones((len(hyps), i), dtype=paddle.long)
# TODO(Hui Zhang): if end_flag.sum() == running_size:
if end_flag.cast(paddle.int64).sum() == running_size: if hyps[0]["cache"] is not None:
break cache = [
paddle.ones(
(len(hyps), i - 1, hyp_cache.shape[-1]),
dtype=paddle.float32)
for hyp_cache in hyps[0]["cache"]
]
for j, hyp in enumerate(hyps):
ys[j, :] = paddle.to_tensor(hyp["yseq"])
if hyps[0]["cache"] is not None:
for k in range(len(cache)):
cache[k][j] = hyps[j]["cache"][k]
ys_mask = subsequent_mask(i).unsqueeze(0).to(device)
# 2.1 Forward decoder step
hyps_mask = subsequent_mask(i).unsqueeze(0).repeat(
running_size, 1, 1).to(device) # (B*N, i, i)
# logp: (B*N, vocab)
logp, cache = self.st_decoder.forward_one_step( logp, cache = self.st_decoder.forward_one_step(
encoder_out, encoder_mask, hyps, hyps_mask, cache) encoder_out.repeat(len(hyps), 1, 1),
encoder_mask.repeat(len(hyps), 1, 1), ys, ys_mask, cache)
# 2.2 First beam prune: select topk best prob at current time
top_k_logp, top_k_index = logp.topk(beam_size) # (B*N, N) hyps_best_kept = []
top_k_logp += word_reward for j, hyp in enumerate(hyps):
top_k_logp = mask_finished_scores(top_k_logp, end_flag) top_k_logp, top_k_index = logp[j:j + 1].topk(beam_size)
top_k_index = mask_finished_preds(top_k_index, end_flag, self.eos)
for b in range(beam_size):
# 2.3 Seconde beam prune: select topk score with history new_hyp = {}
scores = scores + top_k_logp # (B*N, N), broadcast add new_hyp["score"] = hyp["score"] + float(top_k_logp[0, b])
scores = scores.view(batch_size, beam_size * beam_size) # (B, N*N) new_hyp["yseq"] = [0] * (1 + len(hyp["yseq"]))
scores, offset_k_index = scores.topk(k=beam_size) # (B, N) new_hyp["yseq"][:len(hyp["yseq"])] = hyp["yseq"]
scores = scores.view(-1, 1) # (B*N, 1) new_hyp["yseq"][len(hyp["yseq"])] = int(top_k_index[0, b])
new_hyp["cache"] = [cache_[j] for cache_ in cache]
# 2.4. Compute base index in top_k_index, # will be (2 x beam) hyps at most
# regard top_k_index as (B*N*N),regard offset_k_index as (B*N), hyps_best_kept.append(new_hyp)
# then find offset_k_index in top_k_index
base_k_index = paddle.arange(batch_size).view(-1, 1).repeat( hyps_best_kept = sorted(
1, beam_size) # (B, N) hyps_best_kept, key=lambda x: -x["score"])[:beam_size]
base_k_index = base_k_index * beam_size * beam_size
best_k_index = base_k_index.view(-1) + offset_k_index.view( # sort and get nbest
-1) # (B*N) hyps = hyps_best_kept
if i == maxlen:
# 2.5 Update best hyps for hyp in hyps:
best_k_pred = paddle.index_select( hyp["yseq"].append(self.eos)
top_k_index.view(-1), index=best_k_index, axis=0) # (B*N)
best_hyps_index = best_k_index // beam_size # finalize the ended hypotheses with word reward (by length)
last_best_k_hyps = paddle.index_select( remained_hyps = []
hyps, index=best_hyps_index, axis=0) # (B*N, i) for hyp in hyps:
hyps = paddle.cat( if hyp["yseq"][-1] == self.eos:
(last_best_k_hyps, best_k_pred.view(-1, 1)), hyp["score"] += (i - 1) * word_reward
dim=1) # (B*N, i+1) cur_best_score = max(cur_best_score, hyp["score"])
ended_hyps.append(hyp)
# 2.6 Update end flag else:
end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1) # stop while guarantee the optimality
if hyp["score"] + maxlen * word_reward > cur_best_score:
remained_hyps.append(hyp)
# stop predition when there is no unended hypothesis
if not remained_hyps:
break
hyps = remained_hyps
# 3. Select best of best # 3. Select best of best
scores = scores.view(batch_size, beam_size) best_hyp = max(ended_hyps, key=lambda x: x["score"])
# TODO: length normalization
best_index = paddle.argmax(scores, axis=-1).long() # (B) return paddle.to_tensor([best_hyp["yseq"][1:]])
best_hyps_index = best_index + paddle.arange(
batch_size, dtype=paddle.long) * beam_size
best_hyps = paddle.index_select(hyps, index=best_hyps_index, axis=0)
best_hyps = best_hyps[:, 1:]
return best_hyps
# @jit.to_static # @jit.to_static
def subsampling_rate(self) -> int: def subsampling_rate(self) -> int:
@ -472,6 +475,7 @@ class U2STBaseModel(nn.Layer):
decoding_method: str, decoding_method: str,
beam_size: int, beam_size: int,
word_reward: float=0.0, word_reward: float=0.0,
maxlenratio: float=0.5,
decoding_chunk_size: int=-1, decoding_chunk_size: int=-1,
num_decoding_left_chunks: int=-1, num_decoding_left_chunks: int=-1,
simulate_streaming: bool=False): simulate_streaming: bool=False):
@ -507,6 +511,7 @@ class U2STBaseModel(nn.Layer):
feats_lengths, feats_lengths,
beam_size=beam_size, beam_size=beam_size,
word_reward=word_reward, word_reward=word_reward,
maxlenratio=maxlenratio,
decoding_chunk_size=decoding_chunk_size, decoding_chunk_size=decoding_chunk_size,
num_decoding_left_chunks=num_decoding_left_chunks, num_decoding_left_chunks=num_decoding_left_chunks,
simulate_streaming=simulate_streaming) simulate_streaming=simulate_streaming)

@ -14,7 +14,6 @@
"""This module provides functions to calculate bleu score in different level. """This module provides functions to calculate bleu score in different level.
e.g. wer for word-level, cer for char-level. e.g. wer for word-level, cer for char-level.
""" """
import nltk
import numpy as np import numpy as np
import sacrebleu import sacrebleu
@ -114,6 +113,5 @@ class ErrorCalculator():
seq_true_text = "".join(seq_true).replace(self.space, " ") seq_true_text = "".join(seq_true).replace(self.space, " ")
seqs_hat.append(seq_hat_text) seqs_hat.append(seq_hat_text)
seqs_true.append(seq_true_text) seqs_true.append(seq_true_text)
bleu = nltk.bleu_score.corpus_bleu([[ref] for ref in seqs_true], bleu = sacrebleu.corpus_bleu(seqs_hat, [[ref] for ref in seqs_true])
seqs_hat) return bleu.score * 100
return bleu * 100

@ -14,9 +14,11 @@
import argparse import argparse
from pathlib import Path from pathlib import Path
import numpy
import soundfile as sf import soundfile as sf
from paddle import inference from paddle import inference
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.frontend.zh_frontend import Frontend
@ -29,20 +31,38 @@ def main():
'--am', '--am',
type=str, type=str,
default='fastspeech2_csmsc', default='fastspeech2_csmsc',
choices=['speedyspeech_csmsc', 'fastspeech2_csmsc'], choices=[
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_aishell3',
'fastspeech2_vctk'
],
help='Choose acoustic model type of tts task.') help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
"--phones_dict", type=str, default=None, help="phone vocabulary file.") "--phones_dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument( parser.add_argument(
"--tones_dict", type=str, default=None, help="tone vocabulary file.") "--tones_dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument(
"--speaker_dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
'--spk_id',
type=int,
default=0,
help='spk id for multi speaker acoustic model')
# voc # voc
parser.add_argument( parser.add_argument(
'--voc', '--voc',
type=str, type=str,
default='pwgan_csmsc', default='pwgan_csmsc',
choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'], choices=[
'pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc', 'pwgan_aishell3',
'pwgan_vctk'
],
help='Choose vocoder type of tts task.') help='Choose vocoder type of tts task.')
# other # other
parser.add_argument(
'--lang',
type=str,
default='zh',
help='Choose model language. zh or en')
parser.add_argument( parser.add_argument(
"--text", "--text",
type=str, type=str,
@ -53,8 +73,12 @@ def main():
args, _ = parser.parse_known_args() args, _ = parser.parse_known_args()
frontend = Frontend( # frontend
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict) if args.lang == 'zh':
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
elif args.lang == 'en':
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!") print("frontend done!")
# model: {model_name}_{dataset} # model: {model_name}_{dataset}
@ -83,30 +107,53 @@ def main():
print("in new inference") print("in new inference")
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f: with open(args.text, 'rt') as f:
for line in f: for line in f:
items = line.strip().split() items = line.strip().split()
utt_id = items[0] utt_id = items[0]
sentence = "".join(items[1:]) if args.lang == 'zh':
sentence = "".join(items[1:])
elif args.lang == 'en':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence)) sentences.append((utt_id, sentence))
get_tone_ids = False get_tone_ids = False
get_spk_id = False
if am_name == 'speedyspeech': if am_name == 'speedyspeech':
get_tone_ids = True get_tone_ids = True
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
get_spk_id = True
spk_id = numpy.array([args.spk_id])
am_input_names = am_predictor.get_input_names() am_input_names = am_predictor.get_input_names()
print("am_input_names:", am_input_names)
merge_sentences = True
for utt_id, sentence in sentences: for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids( if args.lang == 'zh':
sentence, merge_sentences=True, get_tone_ids=get_tone_ids) input_ids = frontend.get_input_ids(
phone_ids = input_ids["phone_ids"] sentence,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"]
elif args.lang == 'en':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
if get_tone_ids: if get_tone_ids:
tone_ids = input_ids["tone_ids"] tone_ids = input_ids["tone_ids"]
tones = tone_ids[0].numpy() tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input_handle(am_input_names[1]) tones_handle = am_predictor.get_input_handle(am_input_names[1])
tones_handle.reshape(tones.shape) tones_handle.reshape(tones.shape)
tones_handle.copy_from_cpu(tones) tones_handle.copy_from_cpu(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input_handle(am_input_names[1])
spk_id_handle.reshape(spk_id.shape)
spk_id_handle.copy_from_cpu(spk_id)
phones = phone_ids[0].numpy() phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input_handle(am_input_names[0]) phones_handle = am_predictor.get_input_handle(am_input_names[0])
phones_handle.reshape(phones.shape) phones_handle.reshape(phones.shape)

@ -0,0 +1,246 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# generate mels using durations.txt
# for mb melgan finetune
# 长度和原本的 mel 不一致怎么办?
import argparse
import os
from pathlib import Path
import numpy as np
import paddle
import yaml
from tqdm import tqdm
from yacs.config import CfgNode
from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.speedyspeech import SpeedySpeech
from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, speedyspeech_config):
rootdir = Path(args.rootdir).expanduser()
assert rootdir.is_dir()
# construct dataset for evaluation
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
phone_dict = {}
for phn, id in phn_id:
phone_dict[phn] = int(id)
with open(args.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id)
print("tone_size:", tone_size)
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
if args.speaker_dict:
with open(args.speaker_dict, 'rt') as f:
spk_id_list = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id_list)
else:
spk_num = None
model = SpeedySpeech(
vocab_size=vocab_size,
tone_size=tone_size,
**speedyspeech_config["model"],
spk_num=spk_num)
model.set_state_dict(
paddle.load(args.speedyspeech_checkpoint)["main_params"])
model.eval()
stat = np.load(args.speedyspeech_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
speedyspeech_normalizer = ZScore(mu, std)
speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
model)
speedyspeech_inference.eval()
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
sentences, speaker_set = get_phn_dur(args.dur_file)
merge_silence(sentences)
if args.dataset == "baker":
wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
# split data into 3 sections
num_train = 9800
num_dev = 100
train_wav_files = wav_files[:num_train]
dev_wav_files = wav_files[num_train:num_train + num_dev]
test_wav_files = wav_files[num_train + num_dev:]
elif args.dataset == "aishell3":
sub_num_dev = 5
wav_dir = rootdir / "train" / "wav"
train_wav_files = []
dev_wav_files = []
test_wav_files = []
for speaker in os.listdir(wav_dir):
wav_files = sorted(list((wav_dir / speaker).rglob("*.wav")))
if len(wav_files) > 100:
train_wav_files += wav_files[:-sub_num_dev * 2]
dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev]
test_wav_files += wav_files[-sub_num_dev:]
else:
train_wav_files += wav_files
train_wav_files = [
os.path.basename(str(str_path)) for str_path in train_wav_files
]
dev_wav_files = [
os.path.basename(str(str_path)) for str_path in dev_wav_files
]
test_wav_files = [
os.path.basename(str(str_path)) for str_path in test_wav_files
]
for i, utt_id in enumerate(tqdm(sentences)):
phones = sentences[utt_id][0]
durations = sentences[utt_id][1]
speaker = sentences[utt_id][2]
# 裁剪掉开头和结尾的 sil
if args.cut_sil:
if phones[0] == "sil" and len(durations) > 1:
durations = durations[1:]
phones = phones[1:]
if phones[-1] == 'sil' and len(durations) > 1:
durations = durations[:-1]
phones = phones[:-1]
phones, tones = frontend._get_phone_tone(phones, get_tone_ids=True)
if tones:
tone_ids = frontend._t2id(tones)
tone_ids = paddle.to_tensor(tone_ids)
if phones:
phone_ids = frontend._p2id(phones)
phone_ids = paddle.to_tensor(phone_ids)
if args.speaker_dict:
speaker_id = int(
[item[1] for item in spk_id_list if speaker == item[0]][0])
speaker_id = paddle.to_tensor(speaker_id)
else:
speaker_id = None
durations = paddle.to_tensor(np.array(durations))
durations = paddle.unsqueeze(durations, axis=0)
# 生成的和真实的可能有 1, 2 帧的差距,但是 batch_fn 会修复
# split data into 3 sections
wav_path = utt_id + ".wav"
if wav_path in train_wav_files:
sub_output_dir = output_dir / ("train/raw")
elif wav_path in dev_wav_files:
sub_output_dir = output_dir / ("dev/raw")
elif wav_path in test_wav_files:
sub_output_dir = output_dir / ("test/raw")
sub_output_dir.mkdir(parents=True, exist_ok=True)
with paddle.no_grad():
mel = speedyspeech_inference(
phone_ids, tone_ids, durations=durations, spk_id=speaker_id)
np.save(sub_output_dir / (utt_id + "_feats.npy"), mel)
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with speedyspeech & parallel wavegan.")
parser.add_argument(
"--dataset",
default="baker",
type=str,
help="name of dataset, should in {baker, ljspeech, vctk} now")
parser.add_argument(
"--rootdir", default=None, type=str, help="directory to dataset.")
parser.add_argument(
"--speedyspeech-config", type=str, help="speedyspeech config file.")
parser.add_argument(
"--speedyspeech-checkpoint",
type=str,
help="speedyspeech checkpoint to load.")
parser.add_argument(
"--speedyspeech-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
)
parser.add_argument(
"--phones-dict",
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--tones-dict",
type=str,
default="tone_id_map.txt",
help="tone vocabulary file.")
parser.add_argument(
"--speaker-dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
"--dur-file", default=None, type=str, help="path to durations.txt.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
def str2bool(str):
return True if str.lower() == 'true' else False
parser.add_argument(
"--cut-sil",
type=str2bool,
default=True,
help="whether cut sil in the edge of audio")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.speedyspeech_config) as f:
speedyspeech_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(speedyspeech_config)
evaluate(args, speedyspeech_config)
if __name__ == "__main__":
main()

@ -165,9 +165,16 @@ def evaluate(args):
# acoustic model # acoustic model
if am_name == 'fastspeech2': if am_name == 'fastspeech2':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict: if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
print( am_inference = jit.to_static(
"Haven't test dygraph to static for multi speaker fastspeech2 now!" am_inference,
) input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([1], dtype=paddle.int64)
])
paddle.jit.save(am_inference,
os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.am))
else: else:
am_inference = jit.to_static( am_inference = jit.to_static(
am_inference, am_inference,

@ -65,6 +65,7 @@ class ToneSandhi():
self.must_not_neural_tone_words = { self.must_not_neural_tone_words = {
"男子", "女子", "分子", "原子", "量子", "莲子", "石子", "瓜子", "电子" "男子", "女子", "分子", "原子", "量子", "莲子", "石子", "瓜子", "电子"
} }
self.punc = ":,;。?!“”‘’':,;.?!"
# the meaning of jieba pos tag: https://blog.csdn.net/weixin_44174352/article/details/113731041 # the meaning of jieba pos tag: https://blog.csdn.net/weixin_44174352/article/details/113731041
# e.g. # e.g.
@ -147,7 +148,9 @@ class ToneSandhi():
finals[i] = finals[i][:-1] + "2" finals[i] = finals[i][:-1] + "2"
# "一" before non-tone4 should be yi4, e.g. 一天 # "一" before non-tone4 should be yi4, e.g. 一天
else: else:
finals[i] = finals[i][:-1] + "4" # "一" 后面如果是标点,还读一声
if word[i + 1] not in self.punc:
finals[i] = finals[i][:-1] + "4"
return finals return finals
def _split_word(self, word: str) -> List[str]: def _split_word(self, word: str) -> List[str]:

@ -626,7 +626,7 @@ class FastSpeech2(nn.Layer):
hs = hs + e_embs + p_embs hs = hs + e_embs + p_embs
# (B, Lmax, adim) # (B, Lmax, adim)
hs = self.length_regulator(hs, d_outs, alpha) hs = self.length_regulator(hs, d_outs, alpha, is_inference=True)
else: else:
d_outs = self.duration_predictor(hs, d_masks) d_outs = self.duration_predictor(hs, d_masks)
# use groundtruth in training # use groundtruth in training
@ -637,7 +637,7 @@ class FastSpeech2(nn.Layer):
hs = hs + e_embs + p_embs hs = hs + e_embs + p_embs
# (B, Lmax, adim) # (B, Lmax, adim)
hs = self.length_regulator(hs, ds) hs = self.length_regulator(hs, ds, is_inference=False)
# forward decoder # forward decoder
if olens is not None and not is_inference: if olens is not None and not is_inference:
@ -780,7 +780,7 @@ class FastSpeech2(nn.Layer):
elif self.spk_embed_integration_type == "concat": elif self.spk_embed_integration_type == "concat":
# concat hidden states with spk embeds and then apply projection # concat hidden states with spk embeds and then apply projection
spk_emb = F.normalize(spk_emb).unsqueeze(1).expand( spk_emb = F.normalize(spk_emb).unsqueeze(1).expand(
shape=[-1, hs.shape[1], -1]) shape=[-1, paddle.shape(hs)[1], -1])
hs = self.spk_projection(paddle.concat([hs, spk_emb], axis=-1)) hs = self.spk_projection(paddle.concat([hs, spk_emb], axis=-1))
else: else:
raise NotImplementedError("support only add or concat.") raise NotImplementedError("support only add or concat.")

@ -11,32 +11,12 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import numpy as np
import paddle import paddle
from paddle import nn from paddle import nn
from paddlespeech.t2s.modules.nets_utils import initialize
from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding
from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator
def expand(encodings: paddle.Tensor, durations: paddle.Tensor) -> paddle.Tensor:
"""
encodings: (B, T, C)
durations: (B, T)
"""
batch_size, t_enc = durations.shape
durations = durations.numpy()
slens = np.sum(durations, -1)
t_dec = np.max(slens)
M = np.zeros([batch_size, t_dec, t_enc])
for i in range(batch_size):
k = 0
for j in range(t_enc):
d = durations[i, j]
M[i, k:k + d, j] = 1
k += d
M = paddle.to_tensor(M, dtype=encodings.dtype)
encodings = paddle.matmul(M, encodings)
return encodings
class ResidualBlock(nn.Layer): class ResidualBlock(nn.Layer):
@ -176,19 +156,25 @@ class SpeedySpeechDecoder(nn.Layer):
class SpeedySpeech(nn.Layer): class SpeedySpeech(nn.Layer):
def __init__(self, def __init__(
vocab_size, self,
encoder_hidden_size, vocab_size,
encoder_kernel_size, encoder_hidden_size,
encoder_dilations, encoder_kernel_size,
duration_predictor_hidden_size, encoder_dilations,
decoder_hidden_size, duration_predictor_hidden_size,
decoder_output_size, decoder_hidden_size,
decoder_kernel_size, decoder_output_size,
decoder_dilations, decoder_kernel_size,
tone_size=None, decoder_dilations,
spk_num=None): tone_size=None,
spk_num=None,
init_type: str="xavier_uniform", ):
super().__init__() super().__init__()
# initialize parameters
initialize(self, init_type)
encoder = SpeedySpeechEncoder(vocab_size, tone_size, encoder = SpeedySpeechEncoder(vocab_size, tone_size,
encoder_hidden_size, encoder_kernel_size, encoder_hidden_size, encoder_kernel_size,
encoder_dilations, spk_num) encoder_dilations, spk_num)
@ -199,6 +185,10 @@ class SpeedySpeech(nn.Layer):
self.encoder = encoder self.encoder = encoder
self.duration_predictor = duration_predictor self.duration_predictor = duration_predictor
self.decoder = decoder self.decoder = decoder
# define length regulator
self.length_regulator = LengthRegulator()
nn.initializer.set_global_initializer(None)
def forward(self, text, tones, durations, spk_id: paddle.Tensor=None): def forward(self, text, tones, durations, spk_id: paddle.Tensor=None):
# input of embedding must be int64 # input of embedding must be int64
@ -213,7 +203,7 @@ class SpeedySpeech(nn.Layer):
# expand encodings # expand encodings
durations_to_expand = durations durations_to_expand = durations
encodings = expand(encodings, durations_to_expand) encodings = self.length_regulator(encodings, durations_to_expand)
# decode # decode
# remove positional encoding here # remove positional encoding here
@ -222,7 +212,7 @@ class SpeedySpeech(nn.Layer):
decoded = self.decoder(encodings) decoded = self.decoder(encodings)
return decoded, pred_durations return decoded, pred_durations
def inference(self, text, tones=None, spk_id=None): def inference(self, text, tones=None, durations=None, spk_id=None):
# text: [T] # text: [T]
# tones: [T] # tones: [T]
# input of embedding must be int64 # input of embedding must be int64
@ -234,24 +224,15 @@ class SpeedySpeech(nn.Layer):
encodings = self.encoder(text, tones, spk_id) encodings = self.encoder(text, tones, spk_id)
pred_durations = self.duration_predictor(encodings) # (1, T) if durations is None:
durations_to_expand = paddle.round(pred_durations.exp()) # (1, T)
durations_to_expand = (durations_to_expand).astype(paddle.int64) pred_durations = self.duration_predictor(encodings)
durations_to_expand = paddle.round(pred_durations.exp())
slens = paddle.sum(durations_to_expand, -1) # [1] durations_to_expand = durations_to_expand.astype(paddle.int64)
t_dec = slens[0] # [1] else:
t_enc = paddle.shape(pred_durations)[-1] durations_to_expand = durations
M = paddle.zeros([1, t_dec, t_enc]) encodings = self.length_regulator(
encodings, durations_to_expand, is_inference=True)
k = paddle.full([1], 0, dtype=paddle.int64)
for j in range(t_enc):
d = durations_to_expand[0, j]
# If the d == 0, slice action is meaningless and not supported
if d >= 1:
M[0, k:k + d, j] = 1
k += d
encodings = paddle.matmul(M, encodings)
shape = paddle.shape(encodings) shape = paddle.shape(encodings)
t_dec, feature_size = shape[1], shape[2] t_dec, feature_size = shape[1], shape[2]
@ -266,7 +247,8 @@ class SpeedySpeechInference(nn.Layer):
self.normalizer = normalizer self.normalizer = normalizer
self.acoustic_model = speedyspeech_model self.acoustic_model = speedyspeech_model
def forward(self, phones, tones, spk_id=None): def forward(self, phones, tones, durations=None, spk_id=None):
normalized_mel = self.acoustic_model.inference(phones, tones, spk_id) normalized_mel = self.acoustic_model.inference(
phones, tones, durations=durations, spk_id=spk_id)
logmel = self.normalizer.inverse(normalized_mel) logmel = self.normalizer.inverse(normalized_mel)
return logmel return logmel

@ -115,8 +115,8 @@ class DurationPredictor(nn.Layer):
Returns Returns
---------- ----------
Tensor Tensor
Batch of predicted durations in log domain (B, Tmax). Batch of predicted durations in log domain (B, Tmax).
""" """
return self._forward(xs, x_masks, False) return self._forward(xs, x_masks, False)

@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
# Modified from espnet(https://github.com/espnet/espnet) # Modified from espnet(https://github.com/espnet/espnet)
"""Length regulator related modules.""" """Length regulator related modules."""
import numpy as np
import paddle import paddle
from paddle import nn from paddle import nn
@ -43,6 +44,28 @@ class LengthRegulator(nn.Layer):
super().__init__() super().__init__()
self.pad_value = pad_value self.pad_value = pad_value
# expand_numpy is faster than expand
def expand_numpy(self, encodings: paddle.Tensor,
durations: paddle.Tensor) -> paddle.Tensor:
"""
encodings: (B, T, C)
durations: (B, T)
"""
batch_size, t_enc = durations.shape
durations = durations.numpy()
slens = np.sum(durations, -1)
t_dec = np.max(slens)
M = np.zeros([batch_size, t_dec, t_enc])
for i in range(batch_size):
k = 0
for j in range(t_enc):
d = durations[i, j]
M[i, k:k + d, j] = 1
k += d
M = paddle.to_tensor(M, dtype=encodings.dtype)
encodings = paddle.matmul(M, encodings)
return encodings
def expand(self, encodings: paddle.Tensor, def expand(self, encodings: paddle.Tensor,
durations: paddle.Tensor) -> paddle.Tensor: durations: paddle.Tensor) -> paddle.Tensor:
""" """
@ -50,28 +73,29 @@ class LengthRegulator(nn.Layer):
durations: (B, T) durations: (B, T)
""" """
batch_size, t_enc = paddle.shape(durations) batch_size, t_enc = paddle.shape(durations)
slens = durations.sum(-1) slens = paddle.sum(durations, -1)
t_dec = slens.max() t_dec = paddle.max(slens)
M = paddle.zeros([batch_size, t_dec, t_enc]) M = paddle.zeros([batch_size, t_dec, t_enc])
for i in range(batch_size): for i in range(batch_size):
k = 0 k = 0
for j in range(t_enc): for j in range(t_enc):
d = durations[i, j] d = durations[i, j]
# If the d == 0, slice action is meaningless and not supported in paddle
if d >= 1: if d >= 1:
M[i, k:k + d, j] = 1 M[i, k:k + d, j] = 1
k += d k += d
encodings = paddle.matmul(M, encodings) encodings = paddle.matmul(M, encodings)
return encodings return encodings
def forward(self, xs, ds, alpha=1.0): def forward(self, xs, ds, alpha=1.0, is_inference=False):
"""Calculate forward propagation. """Calculate forward propagation.
Parameters Parameters
---------- ----------
xs : Tensor xs : Tensor
Batch of sequences of char or phoneme embeddings (B, Tmax, D). Batch of sequences of char or phoneme embeddings (B, Tmax, D).
ds : LongTensor ds : Tensor(int64)
Batch of durations of each frame (B, T). Batch of durations of each frame (B, T).
alpha : float, optional alpha : float, optional
Alpha value to control speed of speech. Alpha value to control speed of speech.
@ -85,4 +109,7 @@ class LengthRegulator(nn.Layer):
assert alpha > 0 assert alpha > 0
ds = paddle.round(ds.cast(dtype=paddle.float32) * alpha) ds = paddle.round(ds.cast(dtype=paddle.float32) * alpha)
ds = ds.cast(dtype=paddle.int64) ds = ds.cast(dtype=paddle.int64)
return self.expand(xs, ds) if is_inference:
return self.expand(xs, ds)
else:
return self.expand_numpy(xs, ds)

@ -0,0 +1,409 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
def length_to_mask(length, max_len=None, dtype=None):
assert len(length.shape) == 1
if max_len is None:
max_len = length.max().astype(
'int').item() # using arange to generate mask
mask = paddle.arange(
max_len, dtype=length.dtype).expand(
(len(length), max_len)) < length.unsqueeze(1)
if dtype is None:
dtype = length.dtype
mask = paddle.to_tensor(mask, dtype=dtype)
return mask
class Conv1d(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
stride=1,
padding="same",
dilation=1,
groups=1,
bias=True,
padding_mode="reflect", ):
super().__init__()
self.kernel_size = kernel_size
self.stride = stride
self.dilation = dilation
self.padding = padding
self.padding_mode = padding_mode
self.conv = nn.Conv1D(
in_channels,
out_channels,
self.kernel_size,
stride=self.stride,
padding=0,
dilation=self.dilation,
groups=groups,
bias_attr=bias, )
def forward(self, x):
if self.padding == "same":
x = self._manage_padding(x, self.kernel_size, self.dilation,
self.stride)
else:
raise ValueError("Padding must be 'same'. Got {self.padding}")
return self.conv(x)
def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
L_in = x.shape[-1] # Detecting input shape
padding = self._get_padding_elem(L_in, stride, kernel_size,
dilation) # Time padding
x = F.pad(
x, padding, mode=self.padding_mode,
data_format="NCL") # Applying padding
return x
def _get_padding_elem(self,
L_in: int,
stride: int,
kernel_size: int,
dilation: int):
if stride > 1:
n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
L_out = stride * (n_steps - 1) + kernel_size * dilation
padding = [kernel_size // 2, kernel_size // 2]
else:
L_out = (L_in - dilation * (kernel_size - 1) - 1) // stride + 1
padding = [(L_in - L_out) // 2, (L_in - L_out) // 2]
return padding
class BatchNorm1d(nn.Layer):
def __init__(
self,
input_size,
eps=1e-05,
momentum=0.9,
weight_attr=None,
bias_attr=None,
data_format='NCL',
use_global_stats=None, ):
super().__init__()
self.norm = nn.BatchNorm1D(
input_size,
epsilon=eps,
momentum=momentum,
weight_attr=weight_attr,
bias_attr=bias_attr,
data_format=data_format,
use_global_stats=use_global_stats, )
def forward(self, x):
x_n = self.norm(x)
return x_n
class TDNNBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
kernel_size,
dilation,
activation=nn.ReLU, ):
super().__init__()
self.conv = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
dilation=dilation, )
self.activation = activation()
self.norm = BatchNorm1d(input_size=out_channels)
def forward(self, x):
return self.norm(self.activation(self.conv(x)))
class Res2NetBlock(nn.Layer):
def __init__(self, in_channels, out_channels, scale=8, dilation=1):
super().__init__()
assert in_channels % scale == 0
assert out_channels % scale == 0
in_channel = in_channels // scale
hidden_channel = out_channels // scale
self.blocks = nn.LayerList([
TDNNBlock(
in_channel, hidden_channel, kernel_size=3, dilation=dilation)
for i in range(scale - 1)
])
self.scale = scale
def forward(self, x):
y = []
for i, x_i in enumerate(paddle.chunk(x, self.scale, axis=1)):
if i == 0:
y_i = x_i
elif i == 1:
y_i = self.blocks[i - 1](x_i)
else:
y_i = self.blocks[i - 1](x_i + y_i)
y.append(y_i)
y = paddle.concat(y, axis=1)
return y
class SEBlock(nn.Layer):
def __init__(self, in_channels, se_channels, out_channels):
super().__init__()
self.conv1 = Conv1d(
in_channels=in_channels, out_channels=se_channels, kernel_size=1)
self.relu = paddle.nn.ReLU()
self.conv2 = Conv1d(
in_channels=se_channels, out_channels=out_channels, kernel_size=1)
self.sigmoid = paddle.nn.Sigmoid()
def forward(self, x, lengths=None):
L = x.shape[-1]
if lengths is not None:
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
total = mask.sum(axis=2, keepdim=True)
s = (x * mask).sum(axis=2, keepdim=True) / total
else:
s = x.mean(axis=2, keepdim=True)
s = self.relu(self.conv1(s))
s = self.sigmoid(self.conv2(s))
return s * x
class AttentiveStatisticsPooling(nn.Layer):
def __init__(self, channels, attention_channels=128, global_context=True):
super().__init__()
self.eps = 1e-12
self.global_context = global_context
if global_context:
self.tdnn = TDNNBlock(channels * 3, attention_channels, 1, 1)
else:
self.tdnn = TDNNBlock(channels, attention_channels, 1, 1)
self.tanh = nn.Tanh()
self.conv = Conv1d(
in_channels=attention_channels,
out_channels=channels,
kernel_size=1)
def forward(self, x, lengths=None):
C, L = x.shape[1], x.shape[2] # KP: (N, C, L)
def _compute_statistics(x, m, axis=2, eps=self.eps):
mean = (m * x).sum(axis)
std = paddle.sqrt(
(m * (x - mean.unsqueeze(axis)).pow(2)).sum(axis).clip(eps))
return mean, std
if lengths is None:
lengths = paddle.ones([x.shape[0]])
# Make binary mask of shape [N, 1, L]
mask = length_to_mask(lengths * L, max_len=L)
mask = mask.unsqueeze(1)
# Expand the temporal context of the pooling layer by allowing the
# self-attention to look at global properties of the utterance.
if self.global_context:
total = mask.sum(axis=2, keepdim=True).astype('float32')
mean, std = _compute_statistics(x, mask / total)
mean = mean.unsqueeze(2).tile((1, 1, L))
std = std.unsqueeze(2).tile((1, 1, L))
attn = paddle.concat([x, mean, std], axis=1)
else:
attn = x
# Apply layers
attn = self.conv(self.tanh(self.tdnn(attn)))
# Filter out zero-paddings
attn = paddle.where(
mask.tile((1, C, 1)) == 0,
paddle.ones_like(attn) * float("-inf"), attn)
attn = F.softmax(attn, axis=2)
mean, std = _compute_statistics(x, attn)
# Append mean and std of the batch
pooled_stats = paddle.concat((mean, std), axis=1)
pooled_stats = pooled_stats.unsqueeze(2)
return pooled_stats
class SERes2NetBlock(nn.Layer):
def __init__(
self,
in_channels,
out_channels,
res2net_scale=8,
se_channels=128,
kernel_size=1,
dilation=1,
activation=nn.ReLU, ):
super().__init__()
self.out_channels = out_channels
self.tdnn1 = TDNNBlock(
in_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation, )
self.res2net_block = Res2NetBlock(out_channels, out_channels,
res2net_scale, dilation)
self.tdnn2 = TDNNBlock(
out_channels,
out_channels,
kernel_size=1,
dilation=1,
activation=activation, )
self.se_block = SEBlock(out_channels, se_channels, out_channels)
self.shortcut = None
if in_channels != out_channels:
self.shortcut = Conv1d(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=1, )
def forward(self, x, lengths=None):
residual = x
if self.shortcut:
residual = self.shortcut(x)
x = self.tdnn1(x)
x = self.res2net_block(x)
x = self.tdnn2(x)
x = self.se_block(x, lengths)
return x + residual
class EcapaTdnn(nn.Layer):
def __init__(
self,
input_size,
lin_neurons=192,
activation=nn.ReLU,
channels=[512, 512, 512, 512, 1536],
kernel_sizes=[5, 3, 3, 3, 1],
dilations=[1, 2, 3, 4, 1],
attention_channels=128,
res2net_scale=8,
se_channels=128,
global_context=True, ):
super().__init__()
assert len(channels) == len(kernel_sizes)
assert len(channels) == len(dilations)
self.channels = channels
self.blocks = nn.LayerList()
self.emb_size = lin_neurons
# The initial TDNN layer
self.blocks.append(
TDNNBlock(
input_size,
channels[0],
kernel_sizes[0],
dilations[0],
activation, ))
# SE-Res2Net layers
for i in range(1, len(channels) - 1):
self.blocks.append(
SERes2NetBlock(
channels[i - 1],
channels[i],
res2net_scale=res2net_scale,
se_channels=se_channels,
kernel_size=kernel_sizes[i],
dilation=dilations[i],
activation=activation, ))
# Multi-layer feature aggregation
self.mfa = TDNNBlock(
channels[-1],
channels[-1],
kernel_sizes[-1],
dilations[-1],
activation, )
# Attentive Statistical Pooling
self.asp = AttentiveStatisticsPooling(
channels[-1],
attention_channels=attention_channels,
global_context=global_context, )
self.asp_bn = BatchNorm1d(input_size=channels[-1] * 2)
# Final linear transformation
self.fc = Conv1d(
in_channels=channels[-1] * 2,
out_channels=self.emb_size,
kernel_size=1, )
def forward(self, x, lengths=None):
"""
Compute embeddings.
Args:
x (paddle.Tensor): Input log-fbanks with shape (N, n_mels, T).
lengths (paddle.Tensor, optional): Length proportions of batch length with shape (N). Defaults to None.
Returns:
paddle.Tensor: Output embeddings with shape (N, self.emb_size, 1)
"""
xl = []
for layer in self.blocks:
try:
x = layer(x, lengths=lengths)
except TypeError:
x = layer(x)
xl.append(x)
# Multi-layer feature aggregation
x = paddle.concat(xl[1:], axis=1)
x = self.mfa(x)
# Attentive Statistical Pooling
x = self.asp(x, lengths=lengths)
x = self.asp_bn(x)
# Final linear transformation
x = self.fc(x)
return x

@ -43,5 +43,6 @@ typeguard
unidecode unidecode
visualdl visualdl
webrtcvad webrtcvad
yacs yacs~=0.1.8
yq yq
zhon

@ -17,6 +17,7 @@ import io
import os import os
import subprocess as sp import subprocess as sp
import sys import sys
import paddlespeech
from pathlib import Path from pathlib import Path
from setuptools import Command from setuptools import Command
@ -41,7 +42,6 @@ requirements = {
"loguru", "loguru",
"matplotlib", "matplotlib",
"nara_wpe", "nara_wpe",
"nltk",
"pandas", "pandas",
"paddleaudio", "paddleaudio",
"paddlenlp", "paddlenlp",
@ -61,7 +61,7 @@ requirements = {
"typeguard", "typeguard",
"visualdl", "visualdl",
"webrtcvad", "webrtcvad",
"yacs", "yacs~=0.1.8",
], ],
"develop": [ "develop": [
"ConfigArgParse", "ConfigArgParse",
@ -78,6 +78,7 @@ requirements = {
"unidecode", "unidecode",
"yq", "yq",
"pre-commit", "pre-commit",
"zhon",
] ]
} }
@ -127,7 +128,7 @@ def _post_install(install_lib_dir):
print("tools install.") print("tools install.")
# ctcdecoder # ctcdecoder
ctcdecoder_dir = HERE / 'paddlespeech/s2t/decoders/ctcdecoder/swig' ctcdecoder_dir = HERE / 'third_party/ctc_decoders'
with pushd(ctcdecoder_dir): with pushd(ctcdecoder_dir):
check_call("bash -e setup.sh") check_call("bash -e setup.sh")
print("ctcdecoder install.") print("ctcdecoder install.")
@ -172,7 +173,7 @@ class UploadCommand(Command):
setup_info = dict( setup_info = dict(
# Metadata # Metadata
name='paddlespeech', name='paddlespeech',
version='0.1.0', version=paddlespeech.__version__,
author='PaddlePaddle Speech and Language Team', author='PaddlePaddle Speech and Language Team',
author_email='paddlesl@baidu.com', author_email='paddlesl@baidu.com',
url='https://github.com/PaddlePaddle/PaddleSpeech', url='https://github.com/PaddlePaddle/PaddleSpeech',

@ -0,0 +1,77 @@
cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
project(deepspeech VERSION 0.1)
set(CMAKE_VERBOSE_MAKEFILE on)
# set std-14
set(CMAKE_CXX_STANDARD 14)
# include file
include(FetchContent)
include(ExternalProject)
# fc_patch dir
set(FETCHCONTENT_QUIET off)
get_filename_component(fc_patch "fc_patch" REALPATH BASE_DIR "${CMAKE_SOURCE_DIR}")
set(FETCHCONTENT_BASE_DIR ${fc_patch})
###############################################################################
# Option Configurations
###############################################################################
# option configurations
option(TEST_DEBUG "option for debug" OFF)
###############################################################################
# Include third party
###############################################################################
# #example for include third party
# FetchContent_Declare()
# # FetchContent_MakeAvailable was not added until CMake 3.14
# FetchContent_MakeAvailable()
# include_directories()
# ABSEIL-CPP
include(FetchContent)
FetchContent_Declare(
absl
GIT_REPOSITORY "https://github.com/abseil/abseil-cpp.git"
GIT_TAG "20210324.1"
)
FetchContent_MakeAvailable(absl)
# libsndfile
include(FetchContent)
FetchContent_Declare(
libsndfile
GIT_REPOSITORY "https://github.com/libsndfile/libsndfile.git"
GIT_TAG "1.0.31"
)
FetchContent_MakeAvailable(libsndfile)
###############################################################################
# Add local library
###############################################################################
# system lib
find_package()
# if dir have CmakeLists.txt
add_subdirectory()
# if dir do not have CmakeLists.txt
add_library(lib_name STATIC file.cc)
target_link_libraries(lib_name item0 item1)
add_dependencies(lib_name depend-target)
###############################################################################
# Library installation
###############################################################################
install()
###############################################################################
# Build binary file
###############################################################################
add_executable()
target_link_libraries()

@ -0,0 +1,2 @@
aux_source_directory(. DIR_LIB_SRCS)
add_library(decoder STATIC ${DIR_LIB_SRCS})

@ -0,0 +1,26 @@
#!/bin/bash
set -e
# Audio classification
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/dog.wav
paddlespeech cls --input ./cat.wav --topk 10
# Punctuation_restoration
paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
# Speech_recognition
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
paddlespeech asr --input ./zh.wav
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
# Text To Speech
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!"
paddlespeech tts --am speedyspeech_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!"
paddlespeech tts --voc mb_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!"
paddlespeech tts --voc style_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!"
paddlespeech tts --voc hifigan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!"
paddlespeech tts --am fastspeech2_aishell3 --voc pwgan_aishell3 --input "你好,欢迎使用百度飞桨深度学习框架!" --spk_id 0
paddlespeech tts --am fastspeech2_ljspeech --voc pwgan_ljspeech --lang en --input "hello world"
paddlespeech tts --am fastspeech2_vctk --voc pwgan_vctk --input "hello, boys" --lang en --spk_id 0
# Speech Translation (only support linux)
paddlespeech st --input ./en.wav

@ -0,0 +1,201 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

@ -0,0 +1,165 @@
GNU LESSER GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
This version of the GNU Lesser General Public License incorporates
the terms and conditions of version 3 of the GNU General Public
License, supplemented by the additional permissions listed below.
0. Additional Definitions.
As used herein, "this License" refers to version 3 of the GNU Lesser
General Public License, and the "GNU GPL" refers to version 3 of the GNU
General Public License.
"The Library" refers to a covered work governed by this License,
other than an Application or a Combined Work as defined below.
An "Application" is any work that makes use of an interface provided
by the Library, but which is not otherwise based on the Library.
Defining a subclass of a class defined by the Library is deemed a mode
of using an interface provided by the Library.
A "Combined Work" is a work produced by combining or linking an
Application with the Library. The particular version of the Library
with which the Combined Work was made is also called the "Linked
Version".
The "Minimal Corresponding Source" for a Combined Work means the
Corresponding Source for the Combined Work, excluding any source code
for portions of the Combined Work that, considered in isolation, are
based on the Application, and not on the Linked Version.
The "Corresponding Application Code" for a Combined Work means the
object code and/or source code for the Application, including any data
and utility programs needed for reproducing the Combined Work from the
Application, but excluding the System Libraries of the Combined Work.
1. Exception to Section 3 of the GNU GPL.
You may convey a covered work under sections 3 and 4 of this License
without being bound by section 3 of the GNU GPL.
2. Conveying Modified Versions.
If you modify a copy of the Library, and, in your modifications, a
facility refers to a function or data to be supplied by an Application
that uses the facility (other than as an argument passed when the
facility is invoked), then you may convey a copy of the modified
version:
a) under this License, provided that you make a good faith effort to
ensure that, in the event an Application does not supply the
function or data, the facility still operates, and performs
whatever part of its purpose remains meaningful, or
b) under the GNU GPL, with none of the additional permissions of
this License applicable to that copy.
3. Object Code Incorporating Material from Library Header Files.
The object code form of an Application may incorporate material from
a header file that is part of the Library. You may convey such object
code under terms of your choice, provided that, if the incorporated
material is not limited to numerical parameters, data structure
layouts and accessors, or small macros, inline functions and templates
(ten or fewer lines in length), you do both of the following:
a) Give prominent notice with each copy of the object code that the
Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the object code with a copy of the GNU GPL and this license
document.
4. Combined Works.
You may convey a Combined Work under terms of your choice that,
taken together, effectively do not restrict modification of the
portions of the Library contained in the Combined Work and reverse
engineering for debugging such modifications, if you also do each of
the following:
a) Give prominent notice with each copy of the Combined Work that
the Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the Combined Work with a copy of the GNU GPL and this license
document.
c) For a Combined Work that displays copyright notices during
execution, include the copyright notice for the Library among
these notices, as well as a reference directing the user to the
copies of the GNU GPL and this license document.
d) Do one of the following:
0) Convey the Minimal Corresponding Source under the terms of this
License, and the Corresponding Application Code in a form
suitable for, and under terms that permit, the user to
recombine or relink the Application with a modified version of
the Linked Version to produce a modified Combined Work, in the
manner specified by section 6 of the GNU GPL for conveying
Corresponding Source.
1) Use a suitable shared library mechanism for linking with the
Library. A suitable mechanism is one that (a) uses at run time
a copy of the Library already present on the user's computer
system, and (b) will operate properly with a modified version
of the Library that is interface-compatible with the Linked
Version.
e) Provide Installation Information, but only if you would otherwise
be required to provide such information under section 6 of the
GNU GPL, and only to the extent that such information is
necessary to install and execute a modified version of the
Combined Work produced by recombining or relinking the
Application with a modified version of the Linked Version. (If
you use option 4d0, the Installation Information must accompany
the Minimal Corresponding Source and Corresponding Application
Code. If you use option 4d1, you must provide the Installation
Information in the manner specified by section 6 of the GNU GPL
for conveying Corresponding Source.)
5. Combined Libraries.
You may place library facilities that are a work based on the
Library side by side in a single library together with other library
facilities that are not Applications and are not covered by this
License, and convey such a combined library under terms of your
choice, if you do both of the following:
a) Accompany the combined library with a copy of the same work based
on the Library, uncombined with any other library facilities,
conveyed under the terms of this License.
b) Give prominent notice with the combined library that part of it
is a work based on the Library, and explaining where to find the
accompanying uncombined form of the same work.
6. Revised Versions of the GNU Lesser General Public License.
The Free Software Foundation may publish revised and/or new versions
of the GNU Lesser General Public License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the
Library as you received it specifies that a certain numbered version
of the GNU Lesser General Public License "or any later version"
applies to it, you have the option of following the terms and
conditions either of that published version or of any later version
published by the Free Software Foundation. If the Library as you
received it does not specify a version number of the GNU Lesser
General Public License, you may choose any version of the GNU Lesser
General Public License ever published by the Free Software Foundation.
If the Library as you received it specifies that a proxy can decide
whether future versions of the GNU Lesser General Public License shall
apply, that proxy's public statement of acceptance of any version is
permanent authorization for you to choose that version for the
Library.

@ -0,0 +1,8 @@
Most of the code here is licensed under the Apache License 2.0.
There are exceptions that have their own licenses, listed below.
score.h and score.cpp is under the LGPL license.
The two files include the header files from KenLM project.
For the rest:
The default licence of paddlespeech-ctcdecoders is Apache License 2.0.

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,6 +1,6 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
// //
// Licensed under the Apache License, Version 2.0 (the "License"); // Licensed under the Apache License, Version 2.0 (the "COPYING.APACHE2.0");
// you may not use this file except in compliance with the License. // you may not use this file except in compliance with the License.
// You may obtain a copy of the License at // You may obtain a copy of the License at
// //

@ -1,16 +1,4 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Licensed under GNU Lesser General Public License v3 (LGPLv3) (LGPL-3) (the "COPYING.LESSER.3");
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "scorer.h" #include "scorer.h"
@ -20,8 +8,6 @@
#include "lm/config.hh" #include "lm/config.hh"
#include "lm/model.hh" #include "lm/model.hh"
#include "lm/state.hh" #include "lm/state.hh"
#include "util/string_piece.hh"
#include "util/tokenize_piece.hh"
#include "decoder_utils.h" #include "decoder_utils.h"

@ -1,16 +1,4 @@
// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. // Licensed under GNU Lesser General Public License v3 (LGPLv3) (LGPL-3) (the "COPYING.LESSER.3");
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#ifndef SCORER_H_ #ifndef SCORER_H_
#define SCORER_H_ #define SCORER_H_
@ -23,7 +11,6 @@
#include "lm/enumerate_vocab.hh" #include "lm/enumerate_vocab.hh"
#include "lm/virtual_interface.hh" #include "lm/virtual_interface.hh"
#include "lm/word_index.hh" #include "lm/word_index.hh"
#include "util/string_piece.hh"
#include "path_trie.h" #include "path_trie.h"

@ -127,11 +127,11 @@ decoders_module = [
setup( setup(
name='paddlespeech_ctcdecoders', name='paddlespeech_ctcdecoders',
version='0.1.0', version='0.1.1',
description="CTC decoders in paddlespeech", description="CTC decoders in paddlespeech",
author="PaddlePaddle Speech and Language Team", author="PaddlePaddle Speech and Language Team",
author_email="paddlesl@baidu.com", author_email="paddlesl@baidu.com",
url="https://github.com/PaddlePaddle/PaddleSpeech", url="https://github.com/PaddlePaddle/PaddleSpeech",
license='Apache 2.0', license='Apache 2.0, GNU Lesser General Public License v3 (LGPLv3) (LGPL-3)',
ext_modules=decoders_module, ext_modules=decoders_module,
py_modules=['swig_decoders']) py_modules=['swig_decoders'])

@ -34,7 +34,7 @@ make -j4
pushd ../src pushd ../src
OPENBLAS_DIR=${KALDI_DIR}/../OpenBLAS OPENBLAS_DIR=${KALDI_DIR}/../OpenBLAS
mkdir -p ${OPENBLAS_DIR}/install mkdir -p ${OPENBLAS_DIR}/install
if [ $SHARED == true ]; if [ $SHARED == true ]; then
./configure --shared --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${OPENBLAS_DIR}/install ./configure --shared --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${OPENBLAS_DIR}/install
else else
./configure --static --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${OPENBLAS_DIR}/install ./configure --static --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${OPENBLAS_DIR}/install

@ -0,0 +1,183 @@
#!/usr/bin/env python3
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
'''
Merge training configs into a single inference config.
The single inference config is for CLI, which only takes a single config to do inferencing.
The trainig configs includes: model config, preprocess config, decode config, vocab file and cmvn file.
'''
import yaml
import json
import os
import argparse
import math
from yacs.config import CfgNode
from paddlespeech.s2t.frontend.utility import load_dict
from contextlib import redirect_stdout
def save(save_path, config):
with open(save_path, 'w') as fp:
with redirect_stdout(fp):
print(config.dump())
def load(save_path):
config = CfgNode(new_allowed=True)
config.merge_from_file(save_path)
return config
def load_json(json_path):
with open(json_path) as f:
json_content = json.load(f)
return json_content
def remove_config_part(config, key_list):
if len(key_list) == 0:
return
for i in range(len(key_list) -1):
config = config[key_list[i]]
config.pop(key_list[-1])
def load_cmvn_from_json(cmvn_stats):
means = cmvn_stats['mean_stat']
variance = cmvn_stats['var_stat']
count = cmvn_stats['frame_num']
for i in range(len(means)):
means[i] /= count
variance[i] = variance[i] / count - means[i] * means[i]
if variance[i] < 1.0e-20:
variance[i] = 1.0e-20
variance[i] = 1.0 / math.sqrt(variance[i])
cmvn_stats = {"mean":means, "istd":variance}
return cmvn_stats
def merge_configs(
conf_path = "conf/conformer.yaml",
preprocess_path = "conf/preprocess.yaml",
decode_path = "conf/tuning/decode.yaml",
vocab_path = "data/vocab.txt",
cmvn_path = "data/mean_std.json",
save_path = "conf/conformer_infer.yaml",
):
# Load the configs
config = load(conf_path)
decode_config = load(decode_path)
vocab_list = load_dict(vocab_path)
# If use the kaldi feature, do not load the cmvn file
if cmvn_path.split(".")[-1] == 'json':
cmvn_stats = load_json(cmvn_path)
if os.path.exists(preprocess_path):
preprocess_config = load(preprocess_path)
for idx, process in enumerate(preprocess_config["process"]):
if process['type'] == "cmvn_json":
preprocess_config["process"][idx][
"cmvn_path"] = cmvn_stats
break
config.preprocess_config = preprocess_config
else:
cmvn_stats = load_cmvn_from_json(cmvn_stats)
config.mean_std_filepath = [{"cmvn_stats":cmvn_stats}]
config.augmentation_config = ''
# the cmvn file is end with .ark
else:
config.cmvn_path = cmvn_path
# Updata the config
config.vocab_filepath = vocab_list
config.input_dim = config.feat_dim
config.output_dim = len(config.vocab_filepath)
config.decode = decode_config
# Remove some parts of the config
if os.path.exists(preprocess_path):
remove_train_list = ["train_manifest",
"dev_manifest",
"test_manifest",
"n_epoch",
"accum_grad",
"global_grad_clip",
"optim",
"optim_conf",
"scheduler",
"scheduler_conf",
"log_interval",
"checkpoint",
"shuffle_method",
"weight_decay",
"ctc_grad_norm_type",
"minibatches",
"subsampling_factor",
"batch_bins",
"batch_count",
"batch_frames_in",
"batch_frames_inout",
"batch_frames_out",
"sortagrad",
"feat_dim",
"stride_ms",
"window_ms",
"batch_size",
"maxlen_in",
"maxlen_out",
]
else:
remove_train_list = ["train_manifest",
"dev_manifest",
"test_manifest",
"n_epoch",
"accum_grad",
"global_grad_clip",
"log_interval",
"checkpoint",
"lr",
"lr_decay",
"batch_size",
"shuffle_method",
"weight_decay",
"sortagrad",
"num_workers",
]
for item in remove_train_list:
try:
remove_config_part(config, [item])
except:
print ( item + " " +"can not be removed")
# Save the config
save(save_path, config)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog='Config merge', add_help=True)
parser.add_argument(
'--cfg_pth', type=str, default = 'conf/transformer.yaml', help='origin config file')
parser.add_argument(
'--pre_pth', type=str, default= "conf/preprocess.yaml", help='')
parser.add_argument(
'--dcd_pth', type=str, default= "conf/tuninig/decode.yaml", help='')
parser.add_argument(
'--vb_pth', type=str, default= "data/lang_char/vocab.txt", help='')
parser.add_argument(
'--cmvn_pth', type=str, default= "data/mean_std.json", help='')
parser.add_argument(
'--save_pth', type=str, default= "conf/transformer_infer.yaml", help='')
parser_args = parser.parse_args()
merge_configs(
conf_path = parser_args.cfg_pth,
decode_path = parser_args.dcd_pth,
preprocess_path = parser_args.pre_pth,
vocab_path = parser_args.vb_pth,
cmvn_path = parser_args.cmvn_pth,
save_path = parser_args.save_pth,
)

@ -20,6 +20,7 @@ import jsonlines
import numpy as np import numpy as np
from tqdm import tqdm from tqdm import tqdm
def main(): def main():
# parse config and args # parse config and args
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
@ -58,9 +59,18 @@ def main():
mel_path = output_dir / ("raw/" + name) mel_path = output_dir / ("raw/" + name)
gen_mel = np.load(mel_path) gen_mel = np.load(mel_path)
wave_name = utt_id + "_wave.npy" wave_name = utt_id + "_wave.npy"
wav = np.load(old_dump_dir / sub / ("raw/" + wave_name)) try:
os.symlink(old_dump_dir / sub / ("raw/" + wave_name), wav = np.load(old_dump_dir / sub / ("raw/" + wave_name))
output_dir / ("raw/" + wave_name)) os.symlink(old_dump_dir / sub / ("raw/" + wave_name),
output_dir / ("raw/" + wave_name))
except FileNotFoundError:
print("delete " + name +
" because it cannot be found in the dump folder")
os.remove(output_dir / "raw" / name)
continue
except FileExistsError:
print("file " + name + " exists, skip.")
continue
num_sample = wav.shape[0] num_sample = wav.shape[0]
num_frames = gen_mel.shape[0] num_frames = gen_mel.shape[0]
wav_path = output_dir / ("raw/" + wave_name) wav_path = output_dir / ("raw/" + wave_name)

Loading…
Cancel
Save