Merge branch 'develop' into webdataset

pull/2062/head
huangyuxin 3 years ago
commit 05d41523ad

@ -1,3 +1,4 @@
([简体中文](./README_cn.md)|English) ([简体中文](./README_cn.md)|English)
<p align="center"> <p align="center">
<img src="./docs/images/PaddleSpeech_logo.png" /> <img src="./docs/images/PaddleSpeech_logo.png" />
@ -494,6 +495,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
<a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a> <a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a>
</td> </td>
</tr> </tr>
<tr>
<td rowspan="3">End-to-End</td>
<td>VITS</td>
<td >CSMSC</td>
<td>
<a href = "./examples/csmsc/vits">VITS-csmsc</a>
</td>
</tr>
</tbody> </tbody>
</table> </table>

@ -1,3 +1,4 @@
(简体中文|[English](./README.md)) (简体中文|[English](./README.md))
<p align="center"> <p align="center">
<img src="./docs/images/PaddleSpeech_logo.png" /> <img src="./docs/images/PaddleSpeech_logo.png" />
@ -481,6 +482,15 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
<a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a> <a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a>
</td> </td>
</tr> </tr>
</tr>
<tr>
<td rowspan="3">端到端</td>
<td>VITS</td>
<td >CSMSC</td>
<td>
<a href = "./examples/csmsc/vits">VITS-csmsc</a>
</td>
</tr>
</tbody> </tbody>
</table> </table>

@ -1,4 +1,4 @@
# [Aidatatang_200zh](http://www.openslr.org/62/) # [Aidatatang_200zh](http://openslr.elda.org/62/)
Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.
The contents and the corresponding descriptions of the corpus include: The contents and the corresponding descriptions of the corpus include:

@ -1,3 +1,3 @@
# [Aishell1](http://www.openslr.org/33/) # [Aishell1](http://openslr.elda.org/33/)
This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. ) This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )

@ -31,7 +31,7 @@ from utils.utility import unpack
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL_ROOT = 'http://www.openslr.org/resources/33' URL_ROOT = 'http://openslr.elda.org/resources/33'
# URL_ROOT = 'https://openslr.magicdatatech.com/resources/33' # URL_ROOT = 'https://openslr.magicdatatech.com/resources/33'
DATA_URL = URL_ROOT + '/data_aishell.tgz' DATA_URL = URL_ROOT + '/data_aishell.tgz'
MD5_DATA = '2f494334227864a8a8fec932999db9d8' MD5_DATA = '2f494334227864a8a8fec932999db9d8'

@ -31,7 +31,7 @@ import soundfile
from utils.utility import download from utils.utility import download
from utils.utility import unpack from utils.utility import unpack
URL_ROOT = "http://www.openslr.org/resources/12" URL_ROOT = "http://openslr.elda.org/resources/12"
#URL_ROOT = "https://openslr.magicdatatech.com/resources/12" #URL_ROOT = "https://openslr.magicdatatech.com/resources/12"
URL_TEST_CLEAN = URL_ROOT + "/test-clean.tar.gz" URL_TEST_CLEAN = URL_ROOT + "/test-clean.tar.gz"
URL_TEST_OTHER = URL_ROOT + "/test-other.tar.gz" URL_TEST_OTHER = URL_ROOT + "/test-other.tar.gz"

@ -1,4 +1,4 @@
# [MagicData](http://www.openslr.org/68/) # [MagicData](http://openslr.elda.org/68/)
MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use. MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.
The contents and the corresponding descriptions of the corpus include: The contents and the corresponding descriptions of the corpus include:

@ -30,7 +30,7 @@ import soundfile
from utils.utility import download from utils.utility import download
from utils.utility import unpack from utils.utility import unpack
URL_ROOT = "http://www.openslr.org/resources/31" URL_ROOT = "http://openslr.elda.org/resources/31"
URL_TRAIN_CLEAN = URL_ROOT + "/train-clean-5.tar.gz" URL_TRAIN_CLEAN = URL_ROOT + "/train-clean-5.tar.gz"
URL_DEV_CLEAN = URL_ROOT + "/dev-clean-2.tar.gz" URL_DEV_CLEAN = URL_ROOT + "/dev-clean-2.tar.gz"

@ -34,7 +34,7 @@ from utils.utility import unpack
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL_ROOT = 'https://www.openslr.org/resources/17' URL_ROOT = 'https://openslr.elda.org/resources/17'
DATA_URL = URL_ROOT + '/musan.tar.gz' DATA_URL = URL_ROOT + '/musan.tar.gz'
MD5_DATA = '0c472d4fc0c5141eca47ad1ffeb2a7df' MD5_DATA = '0c472d4fc0c5141eca47ad1ffeb2a7df'

@ -1,4 +1,4 @@
# [Primewords](http://www.openslr.org/47/) # [Primewords](http://openslr.elda.org/47/)
This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd. This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.
The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use. The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.

@ -34,7 +34,7 @@ from utils.utility import unzip
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28' URL_ROOT = '--no-check-certificate https://us.openslr.org/resources/28/rirs_noises.zip'
DATA_URL = URL_ROOT + '/rirs_noises.zip' DATA_URL = URL_ROOT + '/rirs_noises.zip'
MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb' MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb'

@ -1 +1 @@
# [FreeST](http://www.openslr.org/38/) # [FreeST](http://openslr.elda.org/38/)

@ -1,4 +1,4 @@
# [THCHS30](http://www.openslr.org/18/) # [THCHS30](http://openslr.elda.org/18/)
This is the *data part* of the `THCHS30 2015` acoustic data This is the *data part* of the `THCHS30 2015` acoustic data
& scripts dataset. & scripts dataset.

@ -32,7 +32,7 @@ from utils.utility import unpack
DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
URL_ROOT = 'http://www.openslr.org/resources/18' URL_ROOT = 'http://openslr.elda.org/resources/18'
# URL_ROOT = 'https://openslr.magicdatatech.com/resources/18' # URL_ROOT = 'https://openslr.magicdatatech.com/resources/18'
DATA_URL = URL_ROOT + '/data_thchs30.tgz' DATA_URL = URL_ROOT + '/data_thchs30.tgz'
TEST_NOISE_URL = URL_ROOT + '/test-noise.tgz' TEST_NOISE_URL = URL_ROOT + '/test-noise.tgz'

@ -1,23 +0,0 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# Author: zhendong.peng@mobvoi.com (Zhendong Peng)
import argparse
from flask import Flask
from flask import render_template
parser = argparse.ArgumentParser(description='training your network')
parser.add_argument('--port', default=19999, type=int, help='port id')
args = parser.parse_args()
app = Flask(__name__)
@app.route('/')
def index():
return render_template('index.html')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=args.port, debug=True)

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.2 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 949 KiB

After

Width:  |  Height:  |  Size: 94 KiB

@ -1,18 +1,20 @@
# paddlespeech serving 网页Demo # paddlespeech serving 网页Demo
- 感谢[wenet](https://github.com/wenet-e2e/wenet)团队的前端demo代码. ![图片](./paddle_web_demo.png)
step1: 开启流式语音识别服务器端
## 使用方法 ```
### 1. 在本地电脑启动网页服务 # 开启流式语音识别服务
``` cd PaddleSpeech/demos/streaming_asr_server
python app.py paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml
```
``` step2: 谷歌游览器打开 `web`目录下`index.html`
### 2. 本地电脑浏览器 step3: 点击`连接`验证WebSocket是否成功连接
step4点击开始录音(弹窗询问,允许录音)
在浏览器中输入127.0.0.1:19999 即可看到相关网页Demo。
![图片](./paddle_web_demo.png)

File diff suppressed because one or more lines are too long

@ -1,453 +0,0 @@
/*
* @Author: baipengxia
* @Date: 2021-03-12 11:44:28
* @Last Modified by: baipengxia
* @Last Modified time: 2021-03-12 15:14:24
*/
/** COMMON RESET **/
* {
-webkit-tap-highlight-color: rgba(0, 0, 0, 0);
}
body,
h1,
h2,
h3,
h4,
h5,
h6,
hr,
p,
dl,
dt,
dd,
ul,
ol,
li,
fieldset,
lengend,
button,
input,
textarea,
th,
td {
margin: 0;
padding: 0;
color: #000;
}
body {
font-size: 14px;
}
html, body {
min-width: 1200px;
}
button,
input,
select,
textarea {
font-size: 14px;
}
h1 {
font-size: 18px;
}
h2 {
font-size: 14px;
}
h3 {
font-size: 14px;
}
ul,
ol,
li {
list-style: none;
}
a {
text-decoration: none;
}
a:hover {
text-decoration: none;
}
fieldset,
img {
border: none;
}
table {
border-collapse: collapse;
border-spacing: 0;
}
i {
font-style: normal;
}
label {
position: inherit;
}
.clearfix:after {
content: ".";
display: block;
height: 0;
clear: both;
visibility: hidden;
}
.clearfix {
zoom: 1;
display: block;
}
html,
body {
font-family: Tahoma, Arial, 'microsoft yahei', 'Roboto', 'Droid Sans', 'Helvetica Neue', 'Droid Sans Fallback', 'Heiti SC', 'Hiragino Sans GB', 'Simsun', 'sans-self';
}
.audio-banner {
width: 100%;
overflow: auto;
padding: 0;
background: url('../image/voice-dictation.svg');
background-size: cover;
}
.weaper {
width: 1200px;
height: 155px;
margin: 72px auto;
}
.text-content {
width: 670px;
height: 100%;
float: left;
}
.text-content .title {
font-size: 34px;
font-family: 'PingFangSC-Medium';
font-weight: 500;
color: rgba(255, 255, 255, 1);
line-height: 48px;
}
.text-content .con {
font-size: 16px;
font-family: PingFangSC-Light;
font-weight: 300;
color: rgba(255, 255, 255, 1);
line-height: 30px;
}
.img-con {
width: 416px;
height: 100%;
float: right;
}
.img-con img {
width: 100%;
height: 100%;
}
.con-container {
margin-top: 34px;
}
.audio-advantage {
background: #f8f9fa;
}
.asr-advantage {
width: 1200px;
margin: 0 auto;
}
.asr-advantage h2 {
text-align: center;
font-size: 22px;
padding: 30px 0 0 0;
}
.asr-advantage > ul > li {
box-sizing: border-box;
padding: 0 16px;
width: 33%;
text-align: center;
margin-bottom: 35px;
}
.asr-advantage > ul > li .icons{
margin-top: 10px;
margin-bottom: 20px;
width: 42px;
height: 42px;
}
.service-item-content {
margin-top: 35px;
display: flex;
justify-content: center;
flex-wrap: wrap;
}
.service-item-content img {
width: 160px;
vertical-align: bottom;
}
.service-item-content > li {
box-sizing: border-box;
padding: 0 16px;
width: 33%;
text-align: center;
margin-bottom: 35px;
}
.service-item-content > li .service-item-content-title {
line-height: 1.5;
font-weight: 700;
margin-top: 10px;
}
.service-item-content > li .service-item-content-desc {
margin-top: 5px;
line-height: 1.8;
color: #657384;
}
.audio-scene-con {
width: 100%;
padding-bottom: 84px;
background: #fff;
}
.audio-scene {
overflow: auto;
width: 1200px;
background: #fff;
text-align: center;
padding: 0;
margin: 0 auto;
}
.audio-scene h2 {
padding: 30px 0 0 0;
font-size: 22px;
text-align: center;
}
.audio-experience {
width: 100%;
height: 538px;
background: #fff;
padding: 0;
margin: 0;
overflow: auto;
}
.asr-box {
width: 1200px;
height: 394px;
margin: 64px auto;
}
.asr-box h2 {
font-size: 22px;
text-align: center;
margin-bottom: 64px;
}
.voice-container {
position: relative;
width: 1200px;
height: 308px;
background: rgba(255, 255, 255, 1);
border-radius: 8px;
border: 1px solid rgba(225, 225, 225, 1);
}
.voice-container .voice {
height: 236px;
width: 100%;
border-radius: 8px;
}
.voice-container .voice textarea {
height: 100%;
width: 100%;
border: none;
outline: none;
border-radius: 8px;
padding: 25px;
font-size: 14px;
box-sizing: border-box;
resize: none;
}
.voice-input {
width: 100%;
height: 72px;
box-sizing: border-box;
padding-left: 35px;
background: rgba(242, 244, 245, 1);
border-radius: 8px;
line-height: 72px;
}
.voice-input .el-select {
width: 492px;
}
.start-voice {
display: inline-block;
margin-left: 10px;
}
.start-voice .time {
margin-right: 25px;
}
.asr-advantage > ul > li {
margin-bottom: 77px;
}
#msg {
width: 100%;
line-height: 40px;
font-size: 14px;
margin-left: 330px;
}
#captcha {
margin-left: 350px !important;
display: inline-block;
position: relative;
}
.black {
position: fixed;
width: 100%;
height: 100%;
z-index: 5;
background: rgba(0, 0, 0, 0.5);
top: 0;
left: 0;
}
.container {
position: fixed;
z-index: 6;
top: 25%;
left: 10%;
}
.audio-scene-con {
width: 100%;
padding-bottom: 84px;
background: #fff;
}
#sound {
color: #fff;
cursor: pointer;
background: #147ede;
padding: 10px;
margin-top: 30px;
margin-left: 135px;
width: 176px;
height: 30px !important;
text-align: center;
line-height: 30px !important;
border-radius: 10px;
}
.con-ten {
position: absolute;
width: 100%;
height: 100%;
z-index: 5;
background: #fff;
opacity: 0.5;
top: 0;
left: 0;
}
.websocket-url {
width: 320px;
height: 20px;
border: 1px solid #dcdfe6;
line-height: 20px;
padding: 10px;
border-radius: 4px;
}
.voice-btn {
color: #fff;
background-color: #409eff;
font-weight: 500;
padding: 12px 20px;
font-size: 14px;
border-radius: 4px;
border: 0;
cursor: pointer;
}
.voice-btn.end {
display: none;
}
.result-text {
background: #fff;
padding: 20px;
}
.voice-footer {
border-top: 1px solid #dddede;
background: #f7f9fa;
text-align: center;
margin-bottom: 8px;
color: #333;
font-size: 12px;
padding: 20px 0;
}
/** line animate **/
.time-box {
display: none;
margin-left: 10px;
width: 300px;
}
.total-time {
font-size: 14px;
color: #545454;
}
.voice-btn.end.show,
.time-box.show {
display: inline;
}
.start-taste-line {
margin-right: 20px;
display: inline-block;
}
.start-taste-line hr {
background-color: #187cff;
width: 3px;
height: 8px;
margin: 0 3px;
display: inline-block;
border: none;
}
.hr {
animation: note 0.2s ease-in-out;
animation-iteration-count: infinite;
animation-direction: alternate;
}
.hr-one {
animation-delay: -0.9s;
}
.hr-two {
animation-delay: -0.8s;
}
.hr-three {
animation-delay: -0.7s;
}
.hr-four {
animation-delay: -0.6s;
}
.hr-five {
animation-delay: -0.5s;
}
.hr-six {
animation-delay: -0.4s;
}
.hr-seven {
animation-delay: -0.3s;
}
.hr-eight {
animation-delay: -0.2s;
}
.hr-nine {
animation-delay: -0.1s;
}
@keyframes note {
from {
transform: scaleY(1);
}
to {
transform: scaleY(4);
}
}

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 432 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 72 KiB

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 286 KiB

@ -1,133 +0,0 @@
SoundRecognizer = {
rec: null,
wave: null,
SampleRate: 16000,
testBitRate: 16,
isCloseRecorder: false,
SendInterval: 300,
realTimeSendTryType: 'pcm',
realTimeSendTryEncBusy: 0,
realTimeSendTryTime: 0,
realTimeSendTryNumber: 0,
transferUploadNumberMax: 0,
realTimeSendTryChunk: null,
soundType: "pcm",
init: function (config) {
this.soundType = config.soundType || 'pcm';
this.SampleRate = config.sampleRate || 16000;
this.recwaveElm = config.recwaveElm || '';
this.TransferUpload = config.translerCallBack || this.TransferProcess;
this.initRecorder();
},
RealTimeSendTryReset: function (type) {
this.realTimeSendTryType = type;
this.realTimeSendTryTime = 0;
},
RealTimeSendTry: function (rec, isClose) {
var that = this;
var t1 = Date.now(), endT = 0, recImpl = Recorder.prototype;
if (this.realTimeSendTryTime == 0) {
this.realTimeSendTryTime = t1;
this.realTimeSendTryEncBusy = 0;
this.realTimeSendTryNumber = 0;
this.transferUploadNumberMax = 0;
this.realTimeSendTryChunk = null;
}
if (!isClose && t1 - this.realTimeSendTryTime < this.SendInterval) {
return;//控制缓冲达到指定间隔才进行传输
}
this.realTimeSendTryTime = t1;
var number = ++this.realTimeSendTryNumber;
//借用SampleData函数进行数据的连续处理采样率转换是顺带的
var chunk = Recorder.SampleData(rec.buffers, rec.srcSampleRate, this.SampleRate, this.realTimeSendTryChunk, { frameType: isClose ? "" : this.realTimeSendTryType });
//清理已处理完的缓冲数据释放内存以支持长时间录音最后完成录音时不能调用stop因为数据已经被清掉了
for (var i = this.realTimeSendTryChunk ? this.realTimeSendTryChunk.index : 0; i < chunk.index; i++) {
rec.buffers[i] = null;
}
this.realTimeSendTryChunk = chunk;
//没有新数据或结束时的数据量太小不能进行mock转码
if (chunk.data.length == 0 || isClose && chunk.data.length < 2000) {
this.TransferUpload(number, null, 0, null, isClose);
return;
}
//实时编码队列阻塞处理
if (!isClose) {
if (this.realTimeSendTryEncBusy >= 2) {
console.log("编码队列阻塞,已丢弃一帧", 1);
return;
}
}
this.realTimeSendTryEncBusy++;
//通过mock方法实时转码成mp3、wav
var encStartTime = Date.now();
var recMock = Recorder({
type: this.realTimeSendTryType
, sampleRate: this.SampleRate //采样率
, bitRate: this.testBitRate //比特率
});
recMock.mock(chunk.data, chunk.sampleRate);
recMock.stop(function (blob, duration) {
that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
blob.encTime = Date.now() - encStartTime;
//转码好就推入传输
that.TransferUpload(number, blob, duration, recMock, isClose);
}, function (msg) {
that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
//转码错误?没想到什么时候会产生错误!
console.log("不应该出现的错误:" + msg, 1);
});
},
recordClose: function () {
try {
this.rec.close(function () {
this.isCloseRecorder = true;
});
this.RealTimeSendTry(this.rec, true);//最后一次发送
} catch (ex) {
// recordClose();
}
},
recordEnd: function () {
try {
this.rec.stop(function (blob, time) {
this.recordClose();
}, function (s) {
this.recordClose();
});
} catch (ex) {
}
},
initRecorder: function () {
var that = this;
var rec = Recorder({
type: that.soundType
, bitRate: that.testBitRate
, sampleRate: that.SampleRate
, onProcess: function (buffers, level, time, sampleRate) {
that.wave.input(buffers[buffers.length - 1], level, sampleRate);
that.RealTimeSendTry(rec, false);//推入实时处理因为是unknown格式这里简化函数调用没有用到buffers和bufferSampleRate因为这些数据和rec.buffers是完全相同的。
}
});
rec.open(function () {
that.wave = Recorder.FrequencyHistogramView({
elem: that.recwaveElm, lineCount: 90
, position: 0
, minHeight: 1
, stripeEnable: false
});
rec.start();
that.isCloseRecorder = false;
that.RealTimeSendTryReset(that.soundType);//重置
});
this.rec = rec;
},
TransferProcess: function (number, blobOrNull, duration, blobRec, isClose) {
}
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

@ -1,6 +0,0 @@
/*
录音
https://github.com/xiangyuecn/Recorder
src: engine/pcm.js
*/
!function(){"use strict";Recorder.prototype.enc_pcm={stable:!0,testmsg:"pcm为未封装的原始音频数据pcm数据文件无法直接播放支持位数8位、16位填在比特率里面采样率取值无限制"},Recorder.prototype.pcm=function(e,t,r){var a=this.set,n=e.length,o=8==a.bitRate?8:16,c=new ArrayBuffer(n*(o/8)),s=new DataView(c),l=0;if(8==o)for(var p=0;p<n;p++,l++){var i=128+(e[p]>>8);s.setInt8(l,i,!0)}else for(p=0;p<n;p++,l+=2)s.setInt16(l,e[p],!0);t(new Blob([s.buffer],{type:"audio/pcm"}))},Recorder.pcm2wav=function(e,a,n){e.slice&&null!=e.type&&(e={blob:e});var o=e.sampleRate||16e3,c=e.bitRate||16;if(e.sampleRate&&e.bitRate||console.warn("pcm2wav必须提供sampleRate和bitRate"),Recorder.prototype.wav){var s=new FileReader;s.onloadend=function(){var e;if(8==c){var t=new Uint8Array(s.result);e=new Int16Array(t.length);for(var r=0;r<t.length;r++)e[r]=t[r]-128<<8}else e=new Int16Array(s.result);Recorder({type:"wav",sampleRate:o,bitRate:c}).mock(e,o).stop(function(e,t){a(e,t)},n)},s.readAsArrayBuffer(e.blob)}else n("pcm2wav必须先加载wav编码器wav.js")}}();

@ -1,6 +0,0 @@
/*
录音
https://github.com/xiangyuecn/Recorder
src: engine/wav.js
*/
!function(){"use strict";Recorder.prototype.enc_wav={stable:!0,testmsg:"支持位数8位、16位填在比特率里面采样率取值无限制"},Recorder.prototype.wav=function(t,e,n){var r=this.set,a=t.length,o=r.sampleRate,f=8==r.bitRate?8:16,i=a*(f/8),s=new ArrayBuffer(44+i),c=new DataView(s),u=0,v=function(t){for(var e=0;e<t.length;e++,u++)c.setUint8(u,t.charCodeAt(e))},w=function(t){c.setUint16(u,t,!0),u+=2},l=function(t){c.setUint32(u,t,!0),u+=4};if(v("RIFF"),l(36+i),v("WAVE"),v("fmt "),l(16),w(1),w(1),l(o),l(o*(f/8)),w(f/8),w(f),v("data"),l(i),8==f)for(var p=0;p<a;p++,u++){var d=128+(t[p]>>8);c.setInt8(u,d,!0)}else for(p=0;p<a;p++,u+=2)c.setInt16(u,t[p],!0);e(new Blob([c.buffer],{type:"audio/wav"}))}}();

@ -1,6 +0,0 @@
/*
录音
https://github.com/xiangyuecn/Recorder
src: extensions/frequency.histogram.view.js
*/
!function(){"use strict";var t=function(t){return new e(t)},e=function(t){var e=this,r={scale:2,fps:20,lineCount:30,widthRatio:.6,spaceWidth:0,minHeight:0,position:-1,mirrorEnable:!1,stripeEnable:!0,stripeHeight:3,stripeMargin:6,fallDuration:1e3,stripeFallDuration:3500,linear:[0,"rgba(0,187,17,1)",.5,"rgba(255,215,0,1)",1,"rgba(255,102,0,1)"],stripeLinear:null,shadowBlur:0,shadowColor:"#bbb",stripeShadowBlur:-1,stripeShadowColor:"",onDraw:function(t,e){}};for(var a in t)r[a]=t[a];e.set=t=r;var i=t.elem;i&&("string"==typeof i?i=document.querySelector(i):i.length&&(i=i[0])),i&&(t.width=i.offsetWidth,t.height=i.offsetHeight);var o=t.scale,l=t.width*o,n=t.height*o,h=e.elem=document.createElement("div"),s=["","transform-origin:0 0;","transform:scale("+1/o+");"];h.innerHTML='<div style="width:'+t.width+"px;height:"+t.height+'px;overflow:hidden"><div style="width:'+l+"px;height:"+n+"px;"+s.join("-webkit-")+s.join("-ms-")+s.join("-moz-")+s.join("")+'"><canvas/></div></div>';var f=e.canvas=h.querySelector("canvas");e.ctx=f.getContext("2d");if(f.width=l,f.height=n,i&&(i.innerHTML="",i.appendChild(h)),!Recorder.LibFFT)throw new Error("需要lib.fft.js支持");e.fft=Recorder.LibFFT(1024),e.lastH=[],e.stripesH=[]};e.prototype=t.prototype={genLinear:function(t,e,r,a){for(var i=t.createLinearGradient(0,r,0,a),o=0;o<e.length;)i.addColorStop(e[o++],e[o++]);return i},input:function(t,e,r){var a=this;a.sampleRate=r,a.pcmData=t,a.pcmPos=0,a.inputTime=Date.now(),a.schedule()},schedule:function(){var t=this,e=t.set,r=Math.floor(1e3/e.fps);t.timer||(t.timer=setInterval(function(){t.schedule()},r));var a=Date.now(),i=t.drawTime||0;if(a-t.inputTime>1.3*e.stripeFallDuration)return clearInterval(t.timer),void(t.timer=0);if(!(a-i<r)){t.drawTime=a;for(var o=t.fft.bufferSize,l=t.pcmData,n=t.pcmPos,h=new Int16Array(o),s=0;s<o&&n<l.length;s++,n++)h[s]=l[n];t.pcmPos=n;var f=t.fft.transform(h);t.draw(f,t.sampleRate)}},draw:function(t,e){var r=this,a=r.set,i=r.ctx,o=a.scale,l=a.width*o,n=a.height*o,h=a.lineCount,s=r.fft.bufferSize,f=a.position,d=Math.abs(a.position),c=1==f?0:n,p=n;d<1&&(c=p/=2,p=Math.floor(p*(1+d)),c=Math.floor(0<f?c*(1-d):c*(1+d)));for(var u=r.lastH,v=r.stripesH,w=Math.ceil(p/(a.fallDuration/(1e3/a.fps))),g=Math.ceil(p/(a.stripeFallDuration/(1e3/a.fps))),m=a.stripeMargin*o,M=1<<(Math.round(Math.log(s)/Math.log(2)+3)<<1),b=Math.log(M)/Math.log(10),L=20*Math.log(32767)/Math.log(10),y=s/2,S=Math.min(y,Math.floor(5e3*y/(e/2))),C=S==y,H=C?h:Math.round(.8*h),R=S/H,D=C?0:(y-S)/(h-H),x=0,F=0;F<h;F++){var T=Math.ceil(x);x+=F<H?R:D;for(var B=Math.min(Math.ceil(x),y),E=0,j=T;j<B;j++)E=Math.max(E,Math.abs(t[j]));var I=M<E?Math.floor(17*(Math.log(E)/Math.log(10)-b)):0,q=p*Math.min(I/L,1);u[F]=(u[F]||0)-w,q<u[F]&&(q=u[F]),q<0&&(q=0),u[F]=q;var z=v[F]||0;if(q&&z<q+m)v[F]=q+m;else{var P=z-g;P<0&&(P=0),v[F]=P}}i.clearRect(0,0,l,n);var W=r.genLinear(i,a.linear,c,c-p),k=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c-p)||W,A=r.genLinear(i,a.linear,c,c+p),G=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c+p)||A;i.shadowBlur=a.shadowBlur*o,i.shadowColor=a.shadowColor;var V=a.mirrorEnable,J=V?2*h-1:h,K=a.widthRatio,N=a.spaceWidth*o;0!=N&&(K=(l-N*(J+1))/l);for(var O=Math.max(1*o,Math.floor(l*K/J)),Q=(l-J*O)/(J+1),U=a.minHeight*o,X=V?l/2-(Q+O/2):0,Y=(F=0,X);F<h;F++)Y+=Q,$=Math.floor(Y),q=Math.max(u[F],U),0!=c&&(_=c-q,i.fillStyle=W,i.fillRect($,_,O,q)),c!=n&&(i.fillStyle=A,i.fillRect($,c,O,q)),Y+=O;if(a.stripeEnable){var Z=a.stripeShadowBlur;i.shadowBlur=(-1==Z?a.shadowBlur:Z)*o,i.shadowColor=a.stripeShadowColor||a.shadowColor;var $,_,tt=a.stripeHeight*o;for(F=0,Y=X;F<h;F++)Y+=Q,$=Math.floor(Y),q=v[F],0!=c&&((_=c-q-tt)<0&&(_=0),i.fillStyle=k,i.fillRect($,_,O,tt)),c!=n&&(n<(_=c+q)+tt&&(_=n-tt),i.fillStyle=G,i.fillRect($,_,O,tt)),Y+=O}if(V){var et=Math.floor(l/2);i.save(),i.scale(-1,1),i.drawImage(r.canvas,Math.ceil(l/2),0,et,n,-et,0,et,n),i.restore()}a.onDraw(t,e)}},Recorder.FrequencyHistogramView=t}();

@ -1,6 +0,0 @@
/*
录音
https://github.com/xiangyuecn/Recorder
src: extensions/lib.fft.js
*/
Recorder.LibFFT=function(r){"use strict";var s,v,d,l,F,b,g,m;return function(r){var o,t,a,f;for(s=Math.round(Math.log(r)/Math.log(2)),d=((v=1<<s)<<2)*Math.sqrt(2),l=[],F=[],b=[0],g=[0],m=[],o=0;o<v;o++){for(a=o,f=t=0;t!=s;t++)f<<=1,f|=1&a,a>>>=1;m[o]=f}var n,u=2*Math.PI/v;for(o=(v>>1)-1;0<o;o--)n=o*u,g[o]=Math.cos(n),b[o]=Math.sin(n)}(r),{transform:function(r){var o,t,a,f,n,u,e,h,M=1,i=s-1;for(o=0;o!=v;o++)l[o]=r[m[o]],F[o]=0;for(o=s;0!=o;o--){for(t=0;t!=M;t++)for(n=g[t<<i],u=b[t<<i],a=t;a<v;a+=M<<1)e=n*l[f=a+M]-u*F[f],h=n*F[f]+u*l[f],l[f]=l[a]-e,F[f]=F[a]-h,l[a]+=e,F[a]+=h;M<<=1,i--}t=v>>1;var c=new Float64Array(t);for(n=-(u=d),o=t;0!=o;o--)e=l[o],h=F[o],c[o-1]=n<e&&e<u&&n<h&&h<u?0:Math.round(e*e+h*h);return c},bufferSize:v}};

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 4.2 KiB

@ -1,156 +0,0 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>PaddleSpeech Serving-语音实时转写</title>
<link rel="shortcut icon" href="./static/paddle.ico">
<script src="../static/js/jquery-3.2.1.min.js"></script>
<script src="../static/js/recorder/recorder-core.js"></script>
<script src="../static/js/recorder/extensions/lib.fft.js"></script>
<script src="../static/js/recorder/extensions/frequency.histogram.view.js"></script>
<script src="../static/js/recorder/engine/pcm.js"></script>
<script src="../static/js/SoundRecognizer.js"></script>
<link rel="stylesheet" href="../static/css/style.css">
<link rel="stylesheet" href="../static/css/font-awesome.min.css">
</head>
<body>
<div class="asr-content">
<div class="audio-banner">
<div class="weaper">
<div class="text-content">
<p><span class="title">PaddleSpeech Serving简介</span></p>
<p class="con-container">
<span class="con">PaddleSpeech 是基于飞桨 PaddlePaddle 的语音方向的开源模型库用于语音和音频中的各种关键任务的开发。PaddleSpeech Serving是基于python + fastapi 的语音算法模型的C/S类型后端服务旨在统一paddle speech下的各语音算子来对外提供后端服务。</span>
</p>
</div>
<div class="img-con">
<img src="../static/image/PaddleSpeech_logo.png" alt="" />
</div>
</div>
</div>
<div class="audio-experience">
<div class="asr-box">
<h2>产品体验</h2>
<div id="client-word-recorder" style="position: relative;">
<div class="pd">
<div style="text-align:center;height:20px;width:100%;
border:0px solid #bcbcbc;color:#000;box-sizing: border-box;display:inline-block"
class="recwave">
</div>
</div>
</div>
<div class="voice-container">
<div class="voice-input">
<span>WebSocket URL</span>
<input type="text" id="socketUrl" class="websocket-url" value="ws://127.0.0.1:8091/ws/asr"
placeholder="请输入服务器地址ws://127.0.0.1:8091/ws/asr">
<div class="start-voice">
<button type="primary" id="beginBtn" class="voice-btn">
<span class="fa fa-microphone"> 开始识别</span>
</button>
<button type="primary" id="endBtn" class="voice-btn end">
<span class="fa fa-microphone-slash"> 结束识别</span>
</button>
<div id="timeBox" class="time-box flex-display-1">
<span class="total-time">识别中,<i id="timeCount"></i> 秒后自动停止识别</span>
</div>
</div>
</div>
<div class="voice">
<div class="result-text" id="resultPanel">此处显示识别结果</div>
</div>
</div>
</div>
</div>
</div>
<script>
var wenetWs = null
var timeLoop = null
var result = ""
$(document).ready(function () {
$('#beginBtn').on('click', startRecording)
$('#endBtn').on('click', stopRecording)
})
function openWebSocket(url) {
if ("WebSocket" in window) {
wenetWs = new WebSocket(url)
wenetWs.onopen = function () {
console.log("Websocket 连接成功,开始识别")
wenetWs.send(JSON.stringify({
"signal": "start"
}))
}
wenetWs.onmessage = function (_msg) { parseResult(_msg.data) }
wenetWs.onclose = function () {
console.log("WebSocket 连接断开")
}
wenetWs.onerror = function () { console.log("WebSocket 连接失败") }
}
}
function parseResult(data) {
var data = JSON.parse(data)
console.log('result json:', data)
var result = data.result
console.log(result)
$("#resultPanel").html(result)
}
function TransferUpload(number, blobOrNull, duration, blobRec, isClose) {
if (blobOrNull) {
var blob = blobOrNull
var encTime = blob.encTime
var reader = new FileReader()
reader.onloadend = function () { wenetWs.send(reader.result) }
reader.readAsArrayBuffer(blob)
}
}
function startRecording() {
// Check socket url
var socketUrl = $('#socketUrl').val()
if (!socketUrl.trim()) {
alert('请输入 WebSocket 服务器地址ws://127.0.0.1:8091/ws/asr')
$('#socketUrl').focus()
return
}
// init recorder
SoundRecognizer.init({
soundType: 'pcm',
sampleRate: 16000,
recwaveElm: '.recwave',
translerCallBack: TransferUpload
})
openWebSocket(socketUrl)
// Change button state
$('#beginBtn').hide()
$('#endBtn, #timeBox').addClass('show')
// Start countdown
var seconds = 180
$('#timeCount').text(seconds)
timeLoop = setInterval(function () {
seconds--
$('#timeCount').text(seconds)
if (seconds === 0) {
stopRecording()
}
}, 1000)
}
function stopRecording() {
wenetWs.send(JSON.stringify({ "signal": "end" }))
SoundRecognizer.recordClose()
$('#endBtn').add($('#timeBox')).removeClass('show')
$('#beginBtn').show()
$('#timeCount').text('')
clearInterval(timeLoop)
}
</script>
</body>
</html>

@ -22,6 +22,7 @@ onnxruntime
pandas pandas
paddlenlp paddlenlp
paddlespeech_feat paddlespeech_feat
Pillow>=9.0.0
praatio==5.0.0 praatio==5.0.0
pypinyin pypinyin
pypinyin-dict pypinyin-dict

@ -10,7 +10,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python | [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python |
[Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python | [Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python |
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python | [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python |
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python | [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_1.0.1.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python |
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) | python | [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) | python |
[Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_offline_librispeech_ckpt_1.0.1.model.tar.gz)| Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers| - |0.0467| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) | inference/python | [Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_offline_librispeech_ckpt_1.0.1.model.tar.gz)| Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers| - |0.0467| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) | inference/python |
[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python | [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python |

@ -2,13 +2,13 @@
## Conformer ## Conformer
paddle version: 2.2.2 paddle version: 2.2.2
paddlespeech version: 0.2.0 paddlespeech version: 1.0.1
| Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER | | Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention | - | 0.0530 | | conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention | - | 0.0522 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0495 | | conformer | 47.07M | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0481 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0494 | | conformer | 47.07M | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0480 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0464 | | conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0460 |
## Conformer Streaming ## Conformer Streaming

@ -57,7 +57,7 @@ feat_dim: 80
stride_ms: 10.0 stride_ms: 10.0
window_ms: 25.0 window_ms: 25.0
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
batch_size: 64 batch_size: 32
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
minibatches: 0 # for debug minibatches: 0 # for debug
@ -73,10 +73,10 @@ num_encs: 1
########################################### ###########################################
# Training # # Training #
########################################### ###########################################
n_epoch: 240 n_epoch: 150
accum_grad: 2 accum_grad: 8
global_grad_clip: 5.0 global_grad_clip: 5.0
dist_sampler: True dist_sampler: False
optim: adam optim: adam
optim_conf: optim_conf:
lr: 0.002 lr: 0.002

@ -144,3 +144,34 @@ optional arguments:
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
The pretrained model can be downloaded here:
- [vits_csmsc_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/vits/vits_csmsc_ckpt_1.1.0.zip) (add_blank=true)
VITS checkpoint contains files listed below.
```text
vits_csmsc_ckpt_1.1.0
├── default.yaml # default config used to train vitx
├── phone_id_map.txt # phone vocabulary file when training vits
└── snapshot_iter_350000.pdz # model parameters and optimizer states
```
ps: This ckpt is not good enough, a better result is training
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained VITS.
```bash
source path.sh
add_blank=true
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
--config=vits_csmsc_ckpt_1.1.0/default.yaml \
--ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
--phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
--output_dir=exp/default/test_e2e \
--text=${BIN_DIR}/../sentences.txt \
--add-blank=${add_blank}
```

@ -3,6 +3,11 @@
config_path=$1 config_path=$1
train_output_path=$2 train_output_path=$2
# install monotonic_align
cd ${MAIN_ROOT}/paddlespeech/t2s/models/vits/monotonic_align
python3 setup.py build_ext --inplace
cd -
python3 ${BIN_DIR}/train.py \ python3 ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \ --train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \ --dev-metadata=dump/dev/norm/metadata.jsonl \

@ -74,7 +74,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# convert the m4a to wav # convert the m4a to wav
# and we will not delete the original m4a file # and we will not delete the original m4a file
echo "start to convert the m4a to wav" echo "start to convert the m4a to wav"
bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1; bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/ || exit 1;
if [ $? -ne 0 ]; then if [ $? -ne 0 ]; then
echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated." echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated."

@ -14,10 +14,8 @@
# Modified from espnet(https://github.com/espnet/espnet) # Modified from espnet(https://github.com/espnet/espnet)
"""Spec Augment module for preprocessing i.e., data augmentation""" """Spec Augment module for preprocessing i.e., data augmentation"""
import random import random
import numpy import numpy
from PIL import Image from PIL import Image
from PIL.Image import BICUBIC
from .functional import FuncTrans from .functional import FuncTrans
@ -46,9 +44,10 @@ def time_warp(x, max_time_warp=80, inplace=False, mode="PIL"):
warped = random.randrange(center - window, center + warped = random.randrange(center - window, center +
window) + 1 # 1 ... t - 1 window) + 1 # 1 ... t - 1
left = Image.fromarray(x[:center]).resize((x.shape[1], warped), BICUBIC) left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
Image.BICUBIC)
right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped), right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
BICUBIC) Image.BICUBIC)
if inplace: if inplace:
x[:warped] = left x[:warped] = left
x[warped:] = right x[warped:] = right

@ -133,11 +133,11 @@ class ASRExecutor(BaseExecutor):
""" """
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
logger.info("start to init the model") logger.debug("start to init the model")
# default max_len: unit:second # default max_len: unit:second
self.max_len = 50 self.max_len = 50
if hasattr(self, 'model'): if hasattr(self, 'model'):
logger.info('Model had been initialized.') logger.debug('Model had been initialized.')
return return
if cfg_path is None or ckpt_path is None: if cfg_path is None or ckpt_path is None:
@ -151,15 +151,15 @@ class ASRExecutor(BaseExecutor):
self.ckpt_path = os.path.join( self.ckpt_path = os.path.join(
self.res_path, self.res_path,
self.task_resource.res_dict['ckpt_path'] + ".pdparams") self.task_resource.res_dict['ckpt_path'] + ".pdparams")
logger.info(self.res_path) logger.debug(self.res_path)
else: else:
self.cfg_path = os.path.abspath(cfg_path) self.cfg_path = os.path.abspath(cfg_path)
self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams") self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
self.res_path = os.path.dirname( self.res_path = os.path.dirname(
os.path.dirname(os.path.abspath(self.cfg_path))) os.path.dirname(os.path.abspath(self.cfg_path)))
logger.info(self.cfg_path) logger.debug(self.cfg_path)
logger.info(self.ckpt_path) logger.debug(self.ckpt_path)
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
@ -216,7 +216,7 @@ class ASRExecutor(BaseExecutor):
max_len = self.config.encoder_conf.max_len max_len = self.config.encoder_conf.max_len
self.max_len = frame_shift_ms * max_len * subsample_rate self.max_len = frame_shift_ms * max_len * subsample_rate
logger.info( logger.debug(
f"The asr server limit max duration len: {self.max_len}") f"The asr server limit max duration len: {self.max_len}")
def preprocess(self, model_type: str, input: Union[str, os.PathLike]): def preprocess(self, model_type: str, input: Union[str, os.PathLike]):
@ -227,15 +227,15 @@ class ASRExecutor(BaseExecutor):
audio_file = input audio_file = input
if isinstance(audio_file, (str, os.PathLike)): if isinstance(audio_file, (str, os.PathLike)):
logger.info("Preprocess audio_file:" + audio_file) logger.debug("Preprocess audio_file:" + audio_file)
# Get the object for feature extraction # Get the object for feature extraction
if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type: if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type:
logger.info("get the preprocess conf") logger.debug("get the preprocess conf")
preprocess_conf = self.config.preprocess_config preprocess_conf = self.config.preprocess_config
preprocess_args = {"train": False} preprocess_args = {"train": False}
preprocessing = Transformation(preprocess_conf) preprocessing = Transformation(preprocess_conf)
logger.info("read the audio file") logger.debug("read the audio file")
audio, audio_sample_rate = soundfile.read( audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True) audio_file, dtype="int16", always_2d=True)
if self.change_format: if self.change_format:
@ -255,7 +255,7 @@ class ASRExecutor(BaseExecutor):
else: else:
audio = audio[:, 0] audio = audio[:, 0]
logger.info(f"audio shape: {audio.shape}") logger.debug(f"audio shape: {audio.shape}")
# fbank # fbank
audio = preprocessing(audio, **preprocess_args) audio = preprocessing(audio, **preprocess_args)
@ -264,19 +264,19 @@ class ASRExecutor(BaseExecutor):
self._inputs["audio"] = audio self._inputs["audio"] = audio
self._inputs["audio_len"] = audio_len self._inputs["audio_len"] = audio_len
logger.info(f"audio feat shape: {audio.shape}") logger.debug(f"audio feat shape: {audio.shape}")
else: else:
raise Exception("wrong type") raise Exception("wrong type")
logger.info("audio feat process success") logger.debug("audio feat process success")
@paddle.no_grad() @paddle.no_grad()
def infer(self, model_type: str): def infer(self, model_type: str):
""" """
Model inference and result stored in self.output. Model inference and result stored in self.output.
""" """
logger.info("start to infer the model to get the output") logger.debug("start to infer the model to get the output")
cfg = self.config.decode cfg = self.config.decode
audio = self._inputs["audio"] audio = self._inputs["audio"]
audio_len = self._inputs["audio_len"] audio_len = self._inputs["audio_len"]
@ -293,7 +293,7 @@ class ASRExecutor(BaseExecutor):
self._outputs["result"] = result_transcripts[0] self._outputs["result"] = result_transcripts[0]
elif "conformer" in model_type or "transformer" in model_type: elif "conformer" in model_type or "transformer" in model_type:
logger.info( logger.debug(
f"we will use the transformer like model : {model_type}") f"we will use the transformer like model : {model_type}")
try: try:
result_transcripts = self.model.decode( result_transcripts = self.model.decode(
@ -352,7 +352,7 @@ class ASRExecutor(BaseExecutor):
logger.error("Please input the right audio file path") logger.error("Please input the right audio file path")
return False return False
logger.info("checking the audio file format......") logger.debug("checking the audio file format......")
try: try:
audio, audio_sample_rate = soundfile.read( audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True) audio_file, dtype="int16", always_2d=True)
@ -374,7 +374,7 @@ class ASRExecutor(BaseExecutor):
sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \ sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
") ")
return False return False
logger.info("The sample rate is %d" % audio_sample_rate) logger.debug("The sample rate is %d" % audio_sample_rate)
if audio_sample_rate != self.sample_rate: if audio_sample_rate != self.sample_rate:
logger.warning("The sample rate of the input file is not {}.\n \ logger.warning("The sample rate of the input file is not {}.\n \
The program will resample the wav file to {}.\n \ The program will resample the wav file to {}.\n \
@ -383,28 +383,28 @@ class ASRExecutor(BaseExecutor):
".format(self.sample_rate, self.sample_rate)) ".format(self.sample_rate, self.sample_rate))
if force_yes is False: if force_yes is False:
while (True): while (True):
logger.info( logger.debug(
"Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream." "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream."
) )
content = input("Input(Y/N):") content = input("Input(Y/N):")
if content.strip() == "Y" or content.strip( if content.strip() == "Y" or content.strip(
) == "y" or content.strip() == "yes" or content.strip( ) == "y" or content.strip() == "yes" or content.strip(
) == "Yes": ) == "Yes":
logger.info( logger.debug(
"change the sampele rate, channel to 16k and 1 channel" "change the sampele rate, channel to 16k and 1 channel"
) )
break break
elif content.strip() == "N" or content.strip( elif content.strip() == "N" or content.strip(
) == "n" or content.strip() == "no" or content.strip( ) == "n" or content.strip() == "no" or content.strip(
) == "No": ) == "No":
logger.info("Exit the program") logger.debug("Exit the program")
return False return False
else: else:
logger.warning("Not regular input, please input again") logger.warning("Not regular input, please input again")
self.change_format = True self.change_format = True
else: else:
logger.info("The audio file format is right") logger.debug("The audio file format is right")
self.change_format = False self.change_format = False
return True return True

@ -92,7 +92,7 @@ class CLSExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'model'): if hasattr(self, 'model'):
logger.info('Model had been initialized.') logger.debug('Model had been initialized.')
return return
if label_file is None or ckpt_path is None: if label_file is None or ckpt_path is None:
@ -135,14 +135,14 @@ class CLSExecutor(BaseExecutor):
Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet). Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet).
""" """
feat_conf = self._conf['feature'] feat_conf = self._conf['feature']
logger.info(feat_conf) logger.debug(feat_conf)
waveform, _ = load( waveform, _ = load(
file=audio_file, file=audio_file,
sr=feat_conf['sample_rate'], sr=feat_conf['sample_rate'],
mono=True, mono=True,
dtype='float32') dtype='float32')
if isinstance(audio_file, (str, os.PathLike)): if isinstance(audio_file, (str, os.PathLike)):
logger.info("Preprocessing audio_file:" + audio_file) logger.debug("Preprocessing audio_file:" + audio_file)
# Feature extraction # Feature extraction
feature_extractor = LogMelSpectrogram( feature_extractor = LogMelSpectrogram(

@ -61,7 +61,7 @@ def _get_unique_endpoints(trainer_endpoints):
continue continue
ips.add(ip) ips.add(ip)
unique_endpoints.add(endpoint) unique_endpoints.add(endpoint)
logger.info("unique_endpoints {}".format(unique_endpoints)) logger.debug("unique_endpoints {}".format(unique_endpoints))
return unique_endpoints return unique_endpoints
@ -96,7 +96,7 @@ def get_path_from_url(url,
# data, and the same ip will only download data once. # data, and the same ip will only download data once.
unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:]) unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:])
if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum): if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum):
logger.info("Found {}".format(fullpath)) logger.debug("Found {}".format(fullpath))
else: else:
if ParallelEnv().current_endpoint in unique_endpoints: if ParallelEnv().current_endpoint in unique_endpoints:
fullpath = _download(url, root_dir, md5sum, method=method) fullpath = _download(url, root_dir, md5sum, method=method)
@ -118,7 +118,7 @@ def _get_download(url, fullname):
try: try:
req = requests.get(url, stream=True) req = requests.get(url, stream=True)
except Exception as e: # requests.exceptions.ConnectionError except Exception as e: # requests.exceptions.ConnectionError
logger.info("Downloading {} from {} failed with exception {}".format( logger.debug("Downloading {} from {} failed with exception {}".format(
fname, url, str(e))) fname, url, str(e)))
return False return False
@ -190,7 +190,7 @@ def _download(url, path, md5sum=None, method='get'):
fullname = osp.join(path, fname) fullname = osp.join(path, fname)
retry_cnt = 0 retry_cnt = 0
logger.info("Downloading {} from {}".format(fname, url)) logger.debug("Downloading {} from {}".format(fname, url))
while not (osp.exists(fullname) and _md5check(fullname, md5sum)): while not (osp.exists(fullname) and _md5check(fullname, md5sum)):
if retry_cnt < DOWNLOAD_RETRY_LIMIT: if retry_cnt < DOWNLOAD_RETRY_LIMIT:
retry_cnt += 1 retry_cnt += 1
@ -209,7 +209,7 @@ def _md5check(fullname, md5sum=None):
if md5sum is None: if md5sum is None:
return True return True
logger.info("File {} md5 checking...".format(fullname)) logger.debug("File {} md5 checking...".format(fullname))
md5 = hashlib.md5() md5 = hashlib.md5()
with open(fullname, 'rb') as f: with open(fullname, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b""): for chunk in iter(lambda: f.read(4096), b""):
@ -217,8 +217,8 @@ def _md5check(fullname, md5sum=None):
calc_md5sum = md5.hexdigest() calc_md5sum = md5.hexdigest()
if calc_md5sum != md5sum: if calc_md5sum != md5sum:
logger.info("File {} md5 check failed, {}(calc) != " logger.debug("File {} md5 check failed, {}(calc) != "
"{}(base)".format(fullname, calc_md5sum, md5sum)) "{}(base)".format(fullname, calc_md5sum, md5sum))
return False return False
return True return True
@ -227,7 +227,7 @@ def _decompress(fname):
""" """
Decompress for zip and tar file Decompress for zip and tar file
""" """
logger.info("Decompressing {}...".format(fname)) logger.debug("Decompressing {}...".format(fname))
# For protecting decompressing interupted, # For protecting decompressing interupted,
# decompress to fpath_tmp directory firstly, if decompress # decompress to fpath_tmp directory firstly, if decompress

@ -217,7 +217,7 @@ class BaseExecutor(ABC):
logging.getLogger(name) for name in logging.root.manager.loggerDict logging.getLogger(name) for name in logging.root.manager.loggerDict
] ]
for l in loggers: for l in loggers:
l.disabled = True l.setLevel(logging.ERROR)
def show_rtf(self, info: Dict[str, List[float]]): def show_rtf(self, info: Dict[str, List[float]]):
""" """

@ -88,7 +88,7 @@ class KWSExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'model'): if hasattr(self, 'model'):
logger.info('Model had been initialized.') logger.debug('Model had been initialized.')
return return
if ckpt_path is None: if ckpt_path is None:
@ -141,7 +141,7 @@ class KWSExecutor(BaseExecutor):
assert os.path.isfile(audio_file) assert os.path.isfile(audio_file)
waveform, _ = load(audio_file) waveform, _ = load(audio_file)
if isinstance(audio_file, (str, os.PathLike)): if isinstance(audio_file, (str, os.PathLike)):
logger.info("Preprocessing audio_file:" + audio_file) logger.debug("Preprocessing audio_file:" + audio_file)
# Feature extraction # Feature extraction
waveform = paddle.to_tensor(waveform).unsqueeze(0) waveform = paddle.to_tensor(waveform).unsqueeze(0)

@ -49,7 +49,7 @@ class Logger(object):
self.handler.setFormatter(self.format) self.handler.setFormatter(self.format)
self.logger.addHandler(self.handler) self.logger.addHandler(self.handler)
self.logger.setLevel(logging.DEBUG) self.logger.setLevel(logging.INFO)
self.logger.propagate = False self.logger.propagate = False
def __call__(self, log_level: str, msg: str): def __call__(self, log_level: str, msg: str):

@ -110,7 +110,7 @@ class STExecutor(BaseExecutor):
""" """
decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME) decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME)
decompressed_path = os.path.abspath(decompressed_path) decompressed_path = os.path.abspath(decompressed_path)
logger.info("Kaldi_bins stored in: {}".format(decompressed_path)) logger.debug("Kaldi_bins stored in: {}".format(decompressed_path))
if "LD_LIBRARY_PATH" in os.environ: if "LD_LIBRARY_PATH" in os.environ:
os.environ["LD_LIBRARY_PATH"] += f":{decompressed_path}" os.environ["LD_LIBRARY_PATH"] += f":{decompressed_path}"
else: else:
@ -128,7 +128,7 @@ class STExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'model'): if hasattr(self, 'model'):
logger.info('Model had been initialized.') logger.debug('Model had been initialized.')
return return
if cfg_path is None or ckpt_path is None: if cfg_path is None or ckpt_path is None:
@ -140,8 +140,8 @@ class STExecutor(BaseExecutor):
self.ckpt_path = os.path.join( self.ckpt_path = os.path.join(
self.task_resource.res_dir, self.task_resource.res_dir,
self.task_resource.res_dict['ckpt_path']) self.task_resource.res_dict['ckpt_path'])
logger.info(self.cfg_path) logger.debug(self.cfg_path)
logger.info(self.ckpt_path) logger.debug(self.ckpt_path)
res_path = self.task_resource.res_dir res_path = self.task_resource.res_dir
else: else:
self.cfg_path = os.path.abspath(cfg_path) self.cfg_path = os.path.abspath(cfg_path)
@ -192,7 +192,7 @@ class STExecutor(BaseExecutor):
Input content can be a file(wav). Input content can be a file(wav).
""" """
audio_file = os.path.abspath(wav_file) audio_file = os.path.abspath(wav_file)
logger.info("Preprocess audio_file:" + audio_file) logger.debug("Preprocess audio_file:" + audio_file)
if "fat_st" in model_type: if "fat_st" in model_type:
cmvn = self.config.cmvn_path cmvn = self.config.cmvn_path

@ -98,7 +98,7 @@ class TextExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'model'): if hasattr(self, 'model'):
logger.info('Model had been initialized.') logger.debug('Model had been initialized.')
return return
self.task = task self.task = task

@ -173,16 +173,23 @@ class TTSExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'): if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'):
logger.info('Models had been initialized.') logger.debug('Models had been initialized.')
return return
# am # am
if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
use_pretrained_am = True
else:
use_pretrained_am = False
am_tag = am + '-' + lang am_tag = am + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=am_tag, model_tag=am_tag,
model_type=0, # am model_type=0, # am
skip_download=not use_pretrained_am,
version=None, # default version version=None, # default version
) )
if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: if use_pretrained_am:
self.am_res_path = self.task_resource.res_dir self.am_res_path = self.task_resource.res_dir
self.am_config = os.path.join(self.am_res_path, self.am_config = os.path.join(self.am_res_path,
self.task_resource.res_dict['config']) self.task_resource.res_dict['config'])
@ -193,9 +200,9 @@ class TTSExecutor(BaseExecutor):
# must have phones_dict in acoustic # must have phones_dict in acoustic
self.phones_dict = os.path.join( self.phones_dict = os.path.join(
self.am_res_path, self.task_resource.res_dict['phones_dict']) self.am_res_path, self.task_resource.res_dict['phones_dict'])
logger.info(self.am_res_path) logger.debug(self.am_res_path)
logger.info(self.am_config) logger.debug(self.am_config)
logger.info(self.am_ckpt) logger.debug(self.am_ckpt)
else: else:
self.am_config = os.path.abspath(am_config) self.am_config = os.path.abspath(am_config)
self.am_ckpt = os.path.abspath(am_ckpt) self.am_ckpt = os.path.abspath(am_ckpt)
@ -220,13 +227,19 @@ class TTSExecutor(BaseExecutor):
self.speaker_dict = speaker_dict self.speaker_dict = speaker_dict
# voc # voc
if voc_ckpt is None or voc_config is None or voc_stat is None:
use_pretrained_voc = True
else:
use_pretrained_voc = False
voc_tag = voc + '-' + lang voc_tag = voc + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=voc_tag, model_tag=voc_tag,
model_type=1, # vocoder model_type=1, # vocoder
skip_download=not use_pretrained_voc,
version=None, # default version version=None, # default version
) )
if voc_ckpt is None or voc_config is None or voc_stat is None: if use_pretrained_voc:
self.voc_res_path = self.task_resource.voc_res_dir self.voc_res_path = self.task_resource.voc_res_dir
self.voc_config = os.path.join( self.voc_config = os.path.join(
self.voc_res_path, self.task_resource.voc_res_dict['config']) self.voc_res_path, self.task_resource.voc_res_dict['config'])
@ -235,9 +248,9 @@ class TTSExecutor(BaseExecutor):
self.voc_stat = os.path.join( self.voc_stat = os.path.join(
self.voc_res_path, self.voc_res_path,
self.task_resource.voc_res_dict['speech_stats']) self.task_resource.voc_res_dict['speech_stats'])
logger.info(self.voc_res_path) logger.debug(self.voc_res_path)
logger.info(self.voc_config) logger.debug(self.voc_config)
logger.info(self.voc_ckpt) logger.debug(self.voc_ckpt)
else: else:
self.voc_config = os.path.abspath(voc_config) self.voc_config = os.path.abspath(voc_config)
self.voc_ckpt = os.path.abspath(voc_ckpt) self.voc_ckpt = os.path.abspath(voc_ckpt)
@ -254,21 +267,18 @@ class TTSExecutor(BaseExecutor):
with open(self.phones_dict, "r") as f: with open(self.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id) vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
tone_size = None tone_size = None
if self.tones_dict: if self.tones_dict:
with open(self.tones_dict, "r") as f: with open(self.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()] tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id) tone_size = len(tone_id)
print("tone_size:", tone_size)
spk_num = None spk_num = None
if self.speaker_dict: if self.speaker_dict:
with open(self.speaker_dict, 'rt') as f: with open(self.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()] spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id) spk_num = len(spk_id)
print("spk_num:", spk_num)
# frontend # frontend
if lang == 'zh': if lang == 'zh':
@ -278,7 +288,6 @@ class TTSExecutor(BaseExecutor):
elif lang == 'en': elif lang == 'en':
self.frontend = English(phone_vocab_path=self.phones_dict) self.frontend = English(phone_vocab_path=self.phones_dict)
print("frontend done!")
# acoustic model # acoustic model
odim = self.am_config.n_mels odim = self.am_config.n_mels
@ -311,7 +320,6 @@ class TTSExecutor(BaseExecutor):
am_normalizer = ZScore(am_mu, am_std) am_normalizer = ZScore(am_mu, am_std)
self.am_inference = am_inference_class(am_normalizer, am) self.am_inference = am_inference_class(am_normalizer, am)
self.am_inference.eval() self.am_inference.eval()
print("acoustic model done!")
# vocoder # vocoder
# model: {model_name}_{dataset} # model: {model_name}_{dataset}
@ -334,7 +342,6 @@ class TTSExecutor(BaseExecutor):
voc_normalizer = ZScore(voc_mu, voc_std) voc_normalizer = ZScore(voc_mu, voc_std)
self.voc_inference = voc_inference_class(voc_normalizer, voc) self.voc_inference = voc_inference_class(voc_normalizer, voc)
self.voc_inference.eval() self.voc_inference.eval()
print("voc done!")
def preprocess(self, input: Any, *args, **kwargs): def preprocess(self, input: Any, *args, **kwargs):
""" """
@ -375,7 +382,7 @@ class TTSExecutor(BaseExecutor):
text, merge_sentences=merge_sentences) text, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") logger.error("lang should in {'zh', 'en'}!")
self.frontend_time = time.time() - frontend_st self.frontend_time = time.time() - frontend_st
self.am_time = 0 self.am_time = 0

@ -117,7 +117,7 @@ class VectorExecutor(BaseExecutor):
# stage 2: read the input data and store them as a list # stage 2: read the input data and store them as a list
task_source = self.get_input_source(parser_args.input) task_source = self.get_input_source(parser_args.input)
logger.info(f"task source: {task_source}") logger.debug(f"task source: {task_source}")
# stage 3: process the audio one by one # stage 3: process the audio one by one
# we do action according the task type # we do action according the task type
@ -127,13 +127,13 @@ class VectorExecutor(BaseExecutor):
try: try:
# extract the speaker audio embedding # extract the speaker audio embedding
if parser_args.task == "spk": if parser_args.task == "spk":
logger.info("do vector spk task") logger.debug("do vector spk task")
res = self(input_, model, sample_rate, config, ckpt_path, res = self(input_, model, sample_rate, config, ckpt_path,
device) device)
task_result[id_] = res task_result[id_] = res
elif parser_args.task == "score": elif parser_args.task == "score":
logger.info("do vector score task") logger.debug("do vector score task")
logger.info(f"input content {input_}") logger.debug(f"input content {input_}")
if len(input_.split()) != 2: if len(input_.split()) != 2:
logger.error( logger.error(
f"vector score task input {input_} wav num is not two," f"vector score task input {input_} wav num is not two,"
@ -142,7 +142,7 @@ class VectorExecutor(BaseExecutor):
# get the enroll and test embedding # get the enroll and test embedding
enroll_audio, test_audio = input_.split() enroll_audio, test_audio = input_.split()
logger.info( logger.debug(
f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}" f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
) )
enroll_embedding = self(enroll_audio, model, sample_rate, enroll_embedding = self(enroll_audio, model, sample_rate,
@ -158,8 +158,8 @@ class VectorExecutor(BaseExecutor):
has_exceptions = True has_exceptions = True
task_result[id_] = f'{e.__class__.__name__}: {e}' task_result[id_] = f'{e.__class__.__name__}: {e}'
logger.info("task result as follows: ") logger.debug("task result as follows: ")
logger.info(f"{task_result}") logger.debug(f"{task_result}")
# stage 4: process the all the task results # stage 4: process the all the task results
self.process_task_results(parser_args.input, task_result, self.process_task_results(parser_args.input, task_result,
@ -207,7 +207,7 @@ class VectorExecutor(BaseExecutor):
""" """
if not hasattr(self, "score_func"): if not hasattr(self, "score_func"):
self.score_func = paddle.nn.CosineSimilarity(axis=0) self.score_func = paddle.nn.CosineSimilarity(axis=0)
logger.info("create the cosine score function ") logger.debug("create the cosine score function ")
score = self.score_func( score = self.score_func(
paddle.to_tensor(enroll_embedding), paddle.to_tensor(enroll_embedding),
@ -244,7 +244,7 @@ class VectorExecutor(BaseExecutor):
sys.exit(-1) sys.exit(-1)
# stage 1: set the paddle runtime host device # stage 1: set the paddle runtime host device
logger.info(f"device type: {device}") logger.debug(f"device type: {device}")
paddle.device.set_device(device) paddle.device.set_device(device)
# stage 2: read the specific pretrained model # stage 2: read the specific pretrained model
@ -283,7 +283,7 @@ class VectorExecutor(BaseExecutor):
# stage 0: avoid to init the mode again # stage 0: avoid to init the mode again
self.task = task self.task = task
if hasattr(self, "model"): if hasattr(self, "model"):
logger.info("Model has been initialized") logger.debug("Model has been initialized")
return return
# stage 1: get the model and config path # stage 1: get the model and config path
@ -294,7 +294,7 @@ class VectorExecutor(BaseExecutor):
sample_rate_str = "16k" if sample_rate == 16000 else "8k" sample_rate_str = "16k" if sample_rate == 16000 else "8k"
tag = model_type + "-" + sample_rate_str tag = model_type + "-" + sample_rate_str
self.task_resource.set_task_model(tag, version=None) self.task_resource.set_task_model(tag, version=None)
logger.info(f"load the pretrained model: {tag}") logger.debug(f"load the pretrained model: {tag}")
# get the model from the pretrained list # get the model from the pretrained list
# we download the pretrained model and store it in the res_path # we download the pretrained model and store it in the res_path
self.res_path = self.task_resource.res_dir self.res_path = self.task_resource.res_dir
@ -312,19 +312,19 @@ class VectorExecutor(BaseExecutor):
self.res_path = os.path.dirname( self.res_path = os.path.dirname(
os.path.dirname(os.path.abspath(self.cfg_path))) os.path.dirname(os.path.abspath(self.cfg_path)))
logger.info(f"start to read the ckpt from {self.ckpt_path}") logger.debug(f"start to read the ckpt from {self.ckpt_path}")
logger.info(f"read the config from {self.cfg_path}") logger.debug(f"read the config from {self.cfg_path}")
logger.info(f"get the res path {self.res_path}") logger.debug(f"get the res path {self.res_path}")
# stage 2: read and config and init the model body # stage 2: read and config and init the model body
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
self.config.merge_from_file(self.cfg_path) self.config.merge_from_file(self.cfg_path)
# stage 3: get the model name to instance the model network with dynamic_import # stage 3: get the model name to instance the model network with dynamic_import
logger.info("start to dynamic import the model class") logger.debug("start to dynamic import the model class")
model_name = model_type[:model_type.rindex('_')] model_name = model_type[:model_type.rindex('_')]
model_class = self.task_resource.get_model_class(model_name) model_class = self.task_resource.get_model_class(model_name)
logger.info(f"model name {model_name}") logger.debug(f"model name {model_name}")
model_conf = self.config.model model_conf = self.config.model
backbone = model_class(**model_conf) backbone = model_class(**model_conf)
model = SpeakerIdetification( model = SpeakerIdetification(
@ -333,11 +333,11 @@ class VectorExecutor(BaseExecutor):
self.model.eval() self.model.eval()
# stage 4: load the model parameters # stage 4: load the model parameters
logger.info("start to set the model parameters to model") logger.debug("start to set the model parameters to model")
model_dict = paddle.load(self.ckpt_path) model_dict = paddle.load(self.ckpt_path)
self.model.set_state_dict(model_dict) self.model.set_state_dict(model_dict)
logger.info("create the model instance success") logger.debug("create the model instance success")
@paddle.no_grad() @paddle.no_grad()
def infer(self, model_type: str): def infer(self, model_type: str):
@ -349,14 +349,14 @@ class VectorExecutor(BaseExecutor):
# stage 0: get the feat and length from _inputs # stage 0: get the feat and length from _inputs
feats = self._inputs["feats"] feats = self._inputs["feats"]
lengths = self._inputs["lengths"] lengths = self._inputs["lengths"]
logger.info("start to do backbone network model forward") logger.debug("start to do backbone network model forward")
logger.info( logger.debug(
f"feats shape:{feats.shape}, lengths shape: {lengths.shape}") f"feats shape:{feats.shape}, lengths shape: {lengths.shape}")
# stage 1: get the audio embedding # stage 1: get the audio embedding
# embedding from (1, emb_size, 1) -> (emb_size) # embedding from (1, emb_size, 1) -> (emb_size)
embedding = self.model.backbone(feats, lengths).squeeze().numpy() embedding = self.model.backbone(feats, lengths).squeeze().numpy()
logger.info(f"embedding size: {embedding.shape}") logger.debug(f"embedding size: {embedding.shape}")
# stage 2: put the embedding and dim info to _outputs property # stage 2: put the embedding and dim info to _outputs property
# the embedding type is numpy.array # the embedding type is numpy.array
@ -380,12 +380,13 @@ class VectorExecutor(BaseExecutor):
""" """
audio_file = input_file audio_file = input_file
if isinstance(audio_file, (str, os.PathLike)): if isinstance(audio_file, (str, os.PathLike)):
logger.info(f"Preprocess audio file: {audio_file}") logger.debug(f"Preprocess audio file: {audio_file}")
# stage 1: load the audio sample points # stage 1: load the audio sample points
# Note: this process must match the training process # Note: this process must match the training process
waveform, sr = load_audio(audio_file) waveform, sr = load_audio(audio_file)
logger.info(f"load the audio sample points, shape is: {waveform.shape}") logger.debug(
f"load the audio sample points, shape is: {waveform.shape}")
# stage 2: get the audio feat # stage 2: get the audio feat
# Note: Now we only support fbank feature # Note: Now we only support fbank feature
@ -396,9 +397,9 @@ class VectorExecutor(BaseExecutor):
n_mels=self.config.n_mels, n_mels=self.config.n_mels,
window_size=self.config.window_size, window_size=self.config.window_size,
hop_length=self.config.hop_size) hop_length=self.config.hop_size)
logger.info(f"extract the audio feat, shape is: {feat.shape}") logger.debug(f"extract the audio feat, shape is: {feat.shape}")
except Exception as e: except Exception as e:
logger.info(f"feat occurs exception {e}") logger.debug(f"feat occurs exception {e}")
sys.exit(-1) sys.exit(-1)
feat = paddle.to_tensor(feat).unsqueeze(0) feat = paddle.to_tensor(feat).unsqueeze(0)
@ -411,11 +412,11 @@ class VectorExecutor(BaseExecutor):
# stage 4: store the feat and length in the _inputs, # stage 4: store the feat and length in the _inputs,
# which will be used in other function # which will be used in other function
logger.info(f"feats shape: {feat.shape}") logger.debug(f"feats shape: {feat.shape}")
self._inputs["feats"] = feat self._inputs["feats"] = feat
self._inputs["lengths"] = lengths self._inputs["lengths"] = lengths
logger.info("audio extract the feat success") logger.debug("audio extract the feat success")
def _check(self, audio_file: str, sample_rate: int): def _check(self, audio_file: str, sample_rate: int):
"""Check if the model sample match the audio sample rate """Check if the model sample match the audio sample rate
@ -441,7 +442,7 @@ class VectorExecutor(BaseExecutor):
logger.error("Please input the right audio file path") logger.error("Please input the right audio file path")
return False return False
logger.info("checking the aduio file format......") logger.debug("checking the aduio file format......")
try: try:
audio, audio_sample_rate = soundfile.read( audio, audio_sample_rate = soundfile.read(
audio_file, dtype="float32", always_2d=True) audio_file, dtype="float32", always_2d=True)
@ -458,7 +459,7 @@ class VectorExecutor(BaseExecutor):
") ")
return False return False
logger.info(f"The sample rate is {audio_sample_rate}") logger.debug(f"The sample rate is {audio_sample_rate}")
if audio_sample_rate != self.sample_rate: if audio_sample_rate != self.sample_rate:
logger.error("The sample rate of the input file is not {}.\n \ logger.error("The sample rate of the input file is not {}.\n \
@ -468,6 +469,6 @@ class VectorExecutor(BaseExecutor):
".format(self.sample_rate, self.sample_rate)) ".format(self.sample_rate, self.sample_rate))
sys.exit(-1) sys.exit(-1)
else: else:
logger.info("The audio file format is right") logger.debug("The audio file format is right")
return True return True

@ -60,6 +60,7 @@ class CommonTaskResource:
def set_task_model(self, def set_task_model(self,
model_tag: str, model_tag: str,
model_type: int=0, model_type: int=0,
skip_download: bool=False,
version: Optional[str]=None): version: Optional[str]=None):
"""Set model tag and version of current task. """Set model tag and version of current task.
@ -83,16 +84,18 @@ class CommonTaskResource:
self.version = version self.version = version
self.res_dict = self.pretrained_models[model_tag][version] self.res_dict = self.pretrained_models[model_tag][version]
self._format_path(self.res_dict) self._format_path(self.res_dict)
self.res_dir = self._fetch(self.res_dict, if not skip_download:
self._get_model_dir(model_type)) self.res_dir = self._fetch(self.res_dict,
self._get_model_dir(model_type))
else: else:
assert self.task == 'tts', 'Vocoder will only be used in tts task.' assert self.task == 'tts', 'Vocoder will only be used in tts task.'
self.voc_model_tag = model_tag self.voc_model_tag = model_tag
self.voc_version = version self.voc_version = version
self.voc_res_dict = self.pretrained_models[model_tag][version] self.voc_res_dict = self.pretrained_models[model_tag][version]
self._format_path(self.voc_res_dict) self._format_path(self.voc_res_dict)
self.voc_res_dir = self._fetch(self.voc_res_dict, if not skip_download:
self._get_model_dir(model_type)) self.voc_res_dir = self._fetch(self.voc_res_dict,
self._get_model_dir(model_type))
@staticmethod @staticmethod
def get_model_class(model_name) -> List[object]: def get_model_class(model_name) -> List[object]:

@ -35,12 +35,6 @@ if __name__ == "__main__":
# save jit model to # save jit model to
parser.add_argument( parser.add_argument(
"--export_path", type=str, help="path of the jit model to save") "--export_path", type=str, help="path of the jit model to save")
parser.add_argument(
'--nxpu',
type=int,
default=0,
choices=[0, 1],
help="if nxpu == 0 and ngpu == 0, use cpu.")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args) print_arguments(args)

@ -35,12 +35,6 @@ if __name__ == "__main__":
# save asr result to # save asr result to
parser.add_argument( parser.add_argument(
"--result_file", type=str, help="path of save the asr result") "--result_file", type=str, help="path of save the asr result")
parser.add_argument(
'--nxpu',
type=int,
default=0,
choices=[0, 1],
help="if nxpu == 0 and ngpu == 0, use cpu.")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args, globals()) print_arguments(args, globals())

@ -38,12 +38,6 @@ if __name__ == "__main__":
#load jit model from #load jit model from
parser.add_argument( parser.add_argument(
"--export_path", type=str, help="path of the jit model to save") "--export_path", type=str, help="path of the jit model to save")
parser.add_argument(
'--nxpu',
type=int,
default=0,
choices=[0, 1],
help="if nxpu == 0 and ngpu == 0, use cpu.")
parser.add_argument( parser.add_argument(
"--enable-auto-log", action="store_true", help="use auto log") "--enable-auto-log", action="store_true", help="use auto log")
args = parser.parse_args() args = parser.parse_args()

@ -31,12 +31,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
parser.add_argument(
'--nxpu',
type=int,
default=0,
choices=[0, 1],
help="if nxpu == 0 and ngpu == 0, use cpu.")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args, globals()) print_arguments(args, globals())

@ -16,7 +16,6 @@ import random
import numpy as np import numpy as np
from PIL import Image from PIL import Image
from PIL.Image import BICUBIC
from paddlespeech.s2t.frontend.augmentor.base import AugmentorBase from paddlespeech.s2t.frontend.augmentor.base import AugmentorBase
from paddlespeech.s2t.utils.log import Log from paddlespeech.s2t.utils.log import Log
@ -164,9 +163,9 @@ class SpecAugmentor(AugmentorBase):
window) + 1 # 1 ... t - 1 window) + 1 # 1 ... t - 1
left = Image.fromarray(x[:center]).resize((x.shape[1], warped), left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
BICUBIC) Image.BICUBIC)
right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped), right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
BICUBIC) Image.BICUBIC)
if self.inplace: if self.inplace:
x[:warped] = left x[:warped] = left
x[warped:] = right x[warped:] = right

@ -226,10 +226,10 @@ class TextFeaturizer():
sos_id = vocab_list.index(SOS) if SOS in vocab_list else -1 sos_id = vocab_list.index(SOS) if SOS in vocab_list else -1
space_id = vocab_list.index(SPACE) if SPACE in vocab_list else -1 space_id = vocab_list.index(SPACE) if SPACE in vocab_list else -1
logger.info(f"BLANK id: {blank_id}") logger.debug(f"BLANK id: {blank_id}")
logger.info(f"UNK id: {unk_id}") logger.debug(f"UNK id: {unk_id}")
logger.info(f"EOS id: {eos_id}") logger.debug(f"EOS id: {eos_id}")
logger.info(f"SOS id: {sos_id}") logger.debug(f"SOS id: {sos_id}")
logger.info(f"SPACE id: {space_id}") logger.debug(f"SPACE id: {space_id}")
logger.info(f"MASKCTC id: {maskctc_id}") logger.debug(f"MASKCTC id: {maskctc_id}")
return token2id, id2token, vocab_list, unk_id, eos_id, blank_id return token2id, id2token, vocab_list, unk_id, eos_id, blank_id

@ -827,7 +827,7 @@ class U2Model(U2DecodeModel):
# encoder # encoder
encoder_type = configs.get('encoder', 'transformer') encoder_type = configs.get('encoder', 'transformer')
logger.info(f"U2 Encoder type: {encoder_type}") logger.debug(f"U2 Encoder type: {encoder_type}")
if encoder_type == 'transformer': if encoder_type == 'transformer':
encoder = TransformerEncoder( encoder = TransformerEncoder(
input_dim, global_cmvn=global_cmvn, **configs['encoder_conf']) input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
@ -894,7 +894,7 @@ class U2Model(U2DecodeModel):
if checkpoint_path: if checkpoint_path:
infos = checkpoint.Checkpoint().load_parameters( infos = checkpoint.Checkpoint().load_parameters(
model, checkpoint_path=checkpoint_path) model, checkpoint_path=checkpoint_path)
logger.info(f"checkpoint info: {infos}") logger.debug(f"checkpoint info: {infos}")
layer_tools.summary(model) layer_tools.summary(model)
return model return model

@ -37,9 +37,9 @@ class CTCLoss(nn.Layer):
self.loss = nn.CTCLoss(blank=blank, reduction=reduction) self.loss = nn.CTCLoss(blank=blank, reduction=reduction)
self.batch_average = batch_average self.batch_average = batch_average
logger.info( logger.debug(
f"CTCLoss Loss reduction: {reduction}, div-bs: {batch_average}") f"CTCLoss Loss reduction: {reduction}, div-bs: {batch_average}")
logger.info(f"CTCLoss Grad Norm Type: {grad_norm_type}") logger.debug(f"CTCLoss Grad Norm Type: {grad_norm_type}")
assert grad_norm_type in ('instance', 'batch', 'frame', None) assert grad_norm_type in ('instance', 'batch', 'frame', None)
self.norm_by_times = False self.norm_by_times = False
@ -70,7 +70,8 @@ class CTCLoss(nn.Layer):
param = {} param = {}
self._kwargs = {k: v for k, v in kwargs.items() if k in param} self._kwargs = {k: v for k, v in kwargs.items() if k in param}
_notin = {k: v for k, v in kwargs.items() if k not in param} _notin = {k: v for k, v in kwargs.items() if k not in param}
logger.info(f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}") logger.debug(
f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}")
def forward(self, logits, ys_pad, hlens, ys_lens): def forward(self, logits, ys_pad, hlens, ys_lens):
"""Compute CTC loss. """Compute CTC loss.

@ -82,6 +82,12 @@ def default_argument_parser(parser=None):
type=int, type=int,
default=1, default=1,
help="number of parallel processes. 0 for cpu.") help="number of parallel processes. 0 for cpu.")
train_group.add_argument(
'--nxpu',
type=int,
default=0,
choices=[0, 1],
help="if nxpu == 0 and ngpu == 0, use cpu.")
train_group.add_argument( train_group.add_argument(
"--config", metavar="CONFIG_FILE", help="config file.") "--config", metavar="CONFIG_FILE", help="config file.")
train_group.add_argument( train_group.add_argument(

@ -94,7 +94,7 @@ def pad_sequence(sequences: List[paddle.Tensor],
for i, tensor in enumerate(sequences): for i, tensor in enumerate(sequences):
length = tensor.shape[0] length = tensor.shape[0]
# use index notation to prevent duplicate references to the tensor # use index notation to prevent duplicate references to the tensor
logger.info( logger.debug(
f"length {length}, out_tensor {out_tensor.shape}, tensor {tensor.shape}" f"length {length}, out_tensor {out_tensor.shape}, tensor {tensor.shape}"
) )
if batch_first: if batch_first:

@ -123,7 +123,6 @@ class TTSClientExecutor(BaseExecutor):
time_end = time.time() time_end = time.time()
time_consume = time_end - time_start time_consume = time_end - time_start
response_dict = res.json() response_dict = res.json()
logger.info(response_dict["message"])
logger.info("Save synthesized audio successfully on %s." % (output)) logger.info("Save synthesized audio successfully on %s." % (output))
logger.info("Audio duration: %f s." % logger.info("Audio duration: %f s." %
(response_dict['result']['duration'])) (response_dict['result']['duration']))
@ -702,7 +701,6 @@ class VectorClientExecutor(BaseExecutor):
test_audio=args.test, test_audio=args.test,
task=task) task=task)
time_end = time.time() time_end = time.time()
logger.info(f"The vector: {res}")
logger.info("Response time %f s." % (time_end - time_start)) logger.info("Response time %f s." % (time_end - time_start))
return True return True
except Exception as e: except Exception as e:

@ -30,7 +30,7 @@ class ACSEngine(BaseEngine):
"""The ACSEngine Engine """The ACSEngine Engine
""" """
super(ACSEngine, self).__init__() super(ACSEngine, self).__init__()
logger.info("Create the ACSEngine Instance") logger.debug("Create the ACSEngine Instance")
self.word_list = [] self.word_list = []
def init(self, config: dict): def init(self, config: dict):
@ -42,7 +42,7 @@ class ACSEngine(BaseEngine):
Returns: Returns:
bool: The engine instance flag bool: The engine instance flag
""" """
logger.info("Init the acs engine") logger.debug("Init the acs engine")
try: try:
self.config = config self.config = config
self.device = self.config.get("device", paddle.get_device()) self.device = self.config.get("device", paddle.get_device())
@ -50,7 +50,7 @@ class ACSEngine(BaseEngine):
# websocket default ping timeout is 20 seconds # websocket default ping timeout is 20 seconds
self.ping_timeout = self.config.get("ping_timeout", 20) self.ping_timeout = self.config.get("ping_timeout", 20)
paddle.set_device(self.device) paddle.set_device(self.device)
logger.info(f"ACS Engine set the device: {self.device}") logger.debug(f"ACS Engine set the device: {self.device}")
except BaseException as e: except BaseException as e:
logger.error( logger.error(
@ -66,7 +66,9 @@ class ACSEngine(BaseEngine):
self.url = "ws://" + self.config.asr_server_ip + ":" + str( self.url = "ws://" + self.config.asr_server_ip + ":" + str(
self.config.asr_server_port) + "/paddlespeech/asr/streaming" self.config.asr_server_port) + "/paddlespeech/asr/streaming"
logger.info("Init the acs engine successfully") logger.info("Initialize acs server engine successfully on device: %s." %
(self.device))
return True return True
def read_search_words(self): def read_search_words(self):
@ -95,12 +97,12 @@ class ACSEngine(BaseEngine):
Returns: Returns:
_type_: _description_ _type_: _description_
""" """
logger.info("send a message to the server") logger.debug("send a message to the server")
if self.url is None: if self.url is None:
logger.error("No asr server, please input valid ip and port") logger.error("No asr server, please input valid ip and port")
return "" return ""
ws = websocket.WebSocket() ws = websocket.WebSocket()
logger.info(f"set the ping timeout: {self.ping_timeout} seconds") logger.debug(f"set the ping timeout: {self.ping_timeout} seconds")
ws.connect(self.url, ping_timeout=self.ping_timeout) ws.connect(self.url, ping_timeout=self.ping_timeout)
audio_info = json.dumps( audio_info = json.dumps(
{ {
@ -123,7 +125,7 @@ class ACSEngine(BaseEngine):
logger.info(f"audio result: {msg}") logger.info(f"audio result: {msg}")
# 3. send chunk audio data to engine # 3. send chunk audio data to engine
logger.info("send the end signal") logger.debug("send the end signal")
audio_info = json.dumps( audio_info = json.dumps(
{ {
"name": "test.wav", "name": "test.wav",
@ -197,7 +199,7 @@ class ACSEngine(BaseEngine):
start = max(time_stamp[m.start(0)]['bg'] - offset, 0) start = max(time_stamp[m.start(0)]['bg'] - offset, 0)
end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed) end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed)
logger.info(f'start: {start}, end: {end}') logger.debug(f'start: {start}, end: {end}')
acs_result.append({'w': w, 'bg': start, 'ed': end}) acs_result.append({'w': w, 'bg': start, 'ed': end})
return acs_result, asr_result return acs_result, asr_result
@ -212,7 +214,7 @@ class ACSEngine(BaseEngine):
Returns: Returns:
acs_result, asr_result: the acs result and the asr result acs_result, asr_result: the acs result and the asr result
""" """
logger.info("start to process the audio content search") logger.debug("start to process the audio content search")
msg = self.get_asr_content(io.BytesIO(audio_data)) msg = self.get_asr_content(io.BytesIO(audio_data))
acs_result, asr_result = self.get_macthed_word(msg) acs_result, asr_result = self.get_macthed_word(msg)

@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler:
asr_engine (ASREngine): the global asr engine asr_engine (ASREngine): the global asr engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"create an paddle asr connection handler to process the websocket connection" "create an paddle asr connection handler to process the websocket connection"
) )
self.config = asr_engine.config # server config self.config = asr_engine.config # server config
@ -152,12 +152,12 @@ class PaddleASRConnectionHanddler:
self.output_reset() self.output_reset()
def extract_feat(self, samples: ByteString): def extract_feat(self, samples: ByteString):
logger.info("Online ASR extract the feat") logger.debug("Online ASR extract the feat")
samples = np.frombuffer(samples, dtype=np.int16) samples = np.frombuffer(samples, dtype=np.int16)
assert samples.ndim == 1 assert samples.ndim == 1
self.num_samples += samples.shape[0] self.num_samples += samples.shape[0]
logger.info( logger.debug(
f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
) )
@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler:
else: else:
assert self.remained_wav.ndim == 1 # (T,) assert self.remained_wav.ndim == 1 # (T,)
self.remained_wav = np.concatenate([self.remained_wav, samples]) self.remained_wav = np.concatenate([self.remained_wav, samples])
logger.info( logger.debug(
f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
) )
@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler:
# update remained wav # update remained wav
self.remained_wav = self.remained_wav[self.n_shift * num_frames:] self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
logger.info( logger.debug(
f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
) )
logger.info( logger.debug(
f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
) )
logger.info(f"global samples: {self.num_samples}") logger.debug(f"global samples: {self.num_samples}")
logger.info(f"global frames: {self.num_frames}") logger.debug(f"global frames: {self.num_frames}")
def decode(self, is_finished=False): def decode(self, is_finished=False):
"""advance decoding """advance decoding
@ -237,7 +237,7 @@ class PaddleASRConnectionHanddler:
return return
num_frames = self.cached_feat.shape[1] num_frames = self.cached_feat.shape[1]
logger.info( logger.debug(
f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
) )
@ -355,7 +355,7 @@ class ASRServerExecutor(ASRExecutor):
lm_url = self.task_resource.res_dict['lm_url'] lm_url = self.task_resource.res_dict['lm_url']
lm_md5 = self.task_resource.res_dict['lm_md5'] lm_md5 = self.task_resource.res_dict['lm_md5']
logger.info(f"Start to load language model {lm_url}") logger.debug(f"Start to load language model {lm_url}")
self.download_lm( self.download_lm(
lm_url, lm_url,
os.path.dirname(self.config.decode.lang_model_path), lm_md5) os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -367,7 +367,7 @@ class ASRServerExecutor(ASRExecutor):
if "deepspeech2" in self.model_type: if "deepspeech2" in self.model_type:
# AM predictor # AM predictor
logger.info("ASR engine start to init the am predictor") logger.debug("ASR engine start to init the am predictor")
self.am_predictor = onnx_infer.get_sess( self.am_predictor = onnx_infer.get_sess(
model_path=self.am_model, sess_conf=self.am_predictor_conf) model_path=self.am_model, sess_conf=self.am_predictor_conf)
else: else:
@ -400,7 +400,7 @@ class ASRServerExecutor(ASRExecutor):
self.num_decoding_left_chunks = num_decoding_left_chunks self.num_decoding_left_chunks = num_decoding_left_chunks
# conf for paddleinference predictor or onnx # conf for paddleinference predictor or onnx
self.am_predictor_conf = am_predictor_conf self.am_predictor_conf = am_predictor_conf
logger.info(f"model_type: {self.model_type}") logger.debug(f"model_type: {self.model_type}")
sample_rate_str = '16k' if sample_rate == 16000 else '8k' sample_rate_str = '16k' if sample_rate == 16000 else '8k'
tag = model_type + '-' + lang + '-' + sample_rate_str tag = model_type + '-' + lang + '-' + sample_rate_str
@ -422,12 +422,11 @@ class ASRServerExecutor(ASRExecutor):
# self.res_path, self.task_resource.res_dict[ # self.res_path, self.task_resource.res_dict[
# 'params']) if am_params is None else os.path.abspath(am_params) # 'params']) if am_params is None else os.path.abspath(am_params)
logger.info("Load the pretrained model:") logger.debug("Load the pretrained model:")
logger.info(f" tag = {tag}") logger.debug(f" tag = {tag}")
logger.info(f" res_path: {self.res_path}") logger.debug(f" res_path: {self.res_path}")
logger.info(f" cfg path: {self.cfg_path}") logger.debug(f" cfg path: {self.cfg_path}")
logger.info(f" am_model path: {self.am_model}") logger.debug(f" am_model path: {self.am_model}")
# logger.info(f" am_params path: {self.am_params}")
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
@ -436,7 +435,7 @@ class ASRServerExecutor(ASRExecutor):
if self.config.spm_model_prefix: if self.config.spm_model_prefix:
self.config.spm_model_prefix = os.path.join( self.config.spm_model_prefix = os.path.join(
self.res_path, self.config.spm_model_prefix) self.res_path, self.config.spm_model_prefix)
logger.info(f"spm model path: {self.config.spm_model_prefix}") logger.debug(f"spm model path: {self.config.spm_model_prefix}")
self.vocab = self.config.vocab_filepath self.vocab = self.config.vocab_filepath
@ -450,7 +449,7 @@ class ASRServerExecutor(ASRExecutor):
# AM predictor # AM predictor
self.init_model() self.init_model()
logger.info(f"create the {model_type} model success") logger.debug(f"create the {model_type} model success")
return True return True
@ -501,7 +500,7 @@ class ASREngine(BaseEngine):
"If all GPU or XPU is used, you can set the server to 'cpu'") "If all GPU or XPU is used, you can set the server to 'cpu'")
sys.exit(-1) sys.exit(-1)
logger.info(f"paddlespeech_server set the device: {self.device}") logger.debug(f"paddlespeech_server set the device: {self.device}")
if not self.init_model(): if not self.init_model():
logger.error( logger.error(
@ -509,7 +508,8 @@ class ASREngine(BaseEngine):
) )
return False return False
logger.info("Initialize ASR server engine successfully.") logger.info("Initialize ASR server engine successfully on device: %s." %
(self.device))
return True return True
def new_handler(self): def new_handler(self):

@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler:
asr_engine (ASREngine): the global asr engine asr_engine (ASREngine): the global asr engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"create an paddle asr connection handler to process the websocket connection" "create an paddle asr connection handler to process the websocket connection"
) )
self.config = asr_engine.config # server config self.config = asr_engine.config # server config
@ -157,7 +157,7 @@ class PaddleASRConnectionHanddler:
assert samples.ndim == 1 assert samples.ndim == 1
self.num_samples += samples.shape[0] self.num_samples += samples.shape[0]
logger.info( logger.debug(
f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
) )
@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler:
else: else:
assert self.remained_wav.ndim == 1 # (T,) assert self.remained_wav.ndim == 1 # (T,)
self.remained_wav = np.concatenate([self.remained_wav, samples]) self.remained_wav = np.concatenate([self.remained_wav, samples])
logger.info( logger.debug(
f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
) )
@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler:
# update remained wav # update remained wav
self.remained_wav = self.remained_wav[self.n_shift * num_frames:] self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
logger.info( logger.debug(
f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
) )
logger.info( logger.debug(
f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
) )
logger.info(f"global samples: {self.num_samples}") logger.debug(f"global samples: {self.num_samples}")
logger.info(f"global frames: {self.num_frames}") logger.debug(f"global frames: {self.num_frames}")
def decode(self, is_finished=False): def decode(self, is_finished=False):
"""advance decoding """advance decoding
@ -237,13 +237,13 @@ class PaddleASRConnectionHanddler:
return return
num_frames = self.cached_feat.shape[1] num_frames = self.cached_feat.shape[1]
logger.info( logger.debug(
f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
) )
# the cached feat must be larger decoding_window # the cached feat must be larger decoding_window
if num_frames < decoding_window and not is_finished: if num_frames < decoding_window and not is_finished:
logger.info( logger.debug(
f"frame feat num is less than {decoding_window}, please input more pcm data" f"frame feat num is less than {decoding_window}, please input more pcm data"
) )
return None, None return None, None
@ -294,7 +294,7 @@ class PaddleASRConnectionHanddler:
Returns: Returns:
logprob: poster probability. logprob: poster probability.
""" """
logger.info("start to decoce one chunk for deepspeech2") logger.debug("start to decoce one chunk for deepspeech2")
input_names = self.am_predictor.get_input_names() input_names = self.am_predictor.get_input_names()
audio_handle = self.am_predictor.get_input_handle(input_names[0]) audio_handle = self.am_predictor.get_input_handle(input_names[0])
audio_len_handle = self.am_predictor.get_input_handle(input_names[1]) audio_len_handle = self.am_predictor.get_input_handle(input_names[1])
@ -369,7 +369,7 @@ class ASRServerExecutor(ASRExecutor):
lm_url = self.task_resource.res_dict['lm_url'] lm_url = self.task_resource.res_dict['lm_url']
lm_md5 = self.task_resource.res_dict['lm_md5'] lm_md5 = self.task_resource.res_dict['lm_md5']
logger.info(f"Start to load language model {lm_url}") logger.debug(f"Start to load language model {lm_url}")
self.download_lm( self.download_lm(
lm_url, lm_url,
os.path.dirname(self.config.decode.lang_model_path), lm_md5) os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -381,7 +381,7 @@ class ASRServerExecutor(ASRExecutor):
if "deepspeech2" in self.model_type: if "deepspeech2" in self.model_type:
# AM predictor # AM predictor
logger.info("ASR engine start to init the am predictor") logger.debug("ASR engine start to init the am predictor")
self.am_predictor = init_predictor( self.am_predictor = init_predictor(
model_file=self.am_model, model_file=self.am_model,
params_file=self.am_params, params_file=self.am_params,
@ -415,7 +415,7 @@ class ASRServerExecutor(ASRExecutor):
self.num_decoding_left_chunks = num_decoding_left_chunks self.num_decoding_left_chunks = num_decoding_left_chunks
# conf for paddleinference predictor or onnx # conf for paddleinference predictor or onnx
self.am_predictor_conf = am_predictor_conf self.am_predictor_conf = am_predictor_conf
logger.info(f"model_type: {self.model_type}") logger.debug(f"model_type: {self.model_type}")
sample_rate_str = '16k' if sample_rate == 16000 else '8k' sample_rate_str = '16k' if sample_rate == 16000 else '8k'
tag = model_type + '-' + lang + '-' + sample_rate_str tag = model_type + '-' + lang + '-' + sample_rate_str
@ -437,12 +437,12 @@ class ASRServerExecutor(ASRExecutor):
self.res_path = os.path.dirname( self.res_path = os.path.dirname(
os.path.dirname(os.path.abspath(self.cfg_path))) os.path.dirname(os.path.abspath(self.cfg_path)))
logger.info("Load the pretrained model:") logger.debug("Load the pretrained model:")
logger.info(f" tag = {tag}") logger.debug(f" tag = {tag}")
logger.info(f" res_path: {self.res_path}") logger.debug(f" res_path: {self.res_path}")
logger.info(f" cfg path: {self.cfg_path}") logger.debug(f" cfg path: {self.cfg_path}")
logger.info(f" am_model path: {self.am_model}") logger.debug(f" am_model path: {self.am_model}")
logger.info(f" am_params path: {self.am_params}") logger.debug(f" am_params path: {self.am_params}")
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
@ -451,7 +451,7 @@ class ASRServerExecutor(ASRExecutor):
if self.config.spm_model_prefix: if self.config.spm_model_prefix:
self.config.spm_model_prefix = os.path.join( self.config.spm_model_prefix = os.path.join(
self.res_path, self.config.spm_model_prefix) self.res_path, self.config.spm_model_prefix)
logger.info(f"spm model path: {self.config.spm_model_prefix}") logger.debug(f"spm model path: {self.config.spm_model_prefix}")
self.vocab = self.config.vocab_filepath self.vocab = self.config.vocab_filepath
@ -465,7 +465,7 @@ class ASRServerExecutor(ASRExecutor):
# AM predictor # AM predictor
self.init_model() self.init_model()
logger.info(f"create the {model_type} model success") logger.debug(f"create the {model_type} model success")
return True return True
@ -516,7 +516,7 @@ class ASREngine(BaseEngine):
"If all GPU or XPU is used, you can set the server to 'cpu'") "If all GPU or XPU is used, you can set the server to 'cpu'")
sys.exit(-1) sys.exit(-1)
logger.info(f"paddlespeech_server set the device: {self.device}") logger.debug(f"paddlespeech_server set the device: {self.device}")
if not self.init_model(): if not self.init_model():
logger.error( logger.error(
@ -524,7 +524,9 @@ class ASREngine(BaseEngine):
) )
return False return False
logger.info("Initialize ASR server engine successfully.") logger.info("Initialize ASR server engine successfully on device: %s." %
(self.device))
return True return True
def new_handler(self): def new_handler(self):

@ -49,7 +49,7 @@ class PaddleASRConnectionHanddler:
asr_engine (ASREngine): the global asr engine asr_engine (ASREngine): the global asr engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"create an paddle asr connection handler to process the websocket connection" "create an paddle asr connection handler to process the websocket connection"
) )
self.config = asr_engine.config # server config self.config = asr_engine.config # server config
@ -107,7 +107,7 @@ class PaddleASRConnectionHanddler:
# acoustic model # acoustic model
self.model = self.asr_engine.executor.model self.model = self.asr_engine.executor.model
self.continuous_decoding = self.config.continuous_decoding self.continuous_decoding = self.config.continuous_decoding
logger.info(f"continue decoding: {self.continuous_decoding}") logger.debug(f"continue decoding: {self.continuous_decoding}")
# ctc decoding config # ctc decoding config
self.ctc_decode_config = self.asr_engine.executor.config.decode self.ctc_decode_config = self.asr_engine.executor.config.decode
@ -207,7 +207,7 @@ class PaddleASRConnectionHanddler:
assert samples.ndim == 1 assert samples.ndim == 1
self.num_samples += samples.shape[0] self.num_samples += samples.shape[0]
logger.info( logger.debug(
f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
) )
@ -218,7 +218,7 @@ class PaddleASRConnectionHanddler:
else: else:
assert self.remained_wav.ndim == 1 # (T,) assert self.remained_wav.ndim == 1 # (T,)
self.remained_wav = np.concatenate([self.remained_wav, samples]) self.remained_wav = np.concatenate([self.remained_wav, samples])
logger.info( logger.debug(
f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
) )
@ -252,14 +252,14 @@ class PaddleASRConnectionHanddler:
# update remained wav # update remained wav
self.remained_wav = self.remained_wav[self.n_shift * num_frames:] self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
logger.info( logger.debug(
f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
) )
logger.info( logger.debug(
f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
) )
logger.info(f"global samples: {self.num_samples}") logger.debug(f"global samples: {self.num_samples}")
logger.info(f"global frames: {self.num_frames}") logger.debug(f"global frames: {self.num_frames}")
def decode(self, is_finished=False): def decode(self, is_finished=False):
"""advance decoding """advance decoding
@ -283,24 +283,24 @@ class PaddleASRConnectionHanddler:
stride = subsampling * decoding_chunk_size stride = subsampling * decoding_chunk_size
if self.cached_feat is None: if self.cached_feat is None:
logger.info("no audio feat, please input more pcm data") logger.debug("no audio feat, please input more pcm data")
return return
num_frames = self.cached_feat.shape[1] num_frames = self.cached_feat.shape[1]
logger.info( logger.debug(
f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
) )
# the cached feat must be larger decoding_window # the cached feat must be larger decoding_window
if num_frames < decoding_window and not is_finished: if num_frames < decoding_window and not is_finished:
logger.info( logger.debug(
f"frame feat num is less than {decoding_window}, please input more pcm data" f"frame feat num is less than {decoding_window}, please input more pcm data"
) )
return None, None return None, None
# if is_finished=True, we need at least context frames # if is_finished=True, we need at least context frames
if num_frames < context: if num_frames < context:
logger.info( logger.debug(
"flast {num_frames} is less than context {context} frames, and we cannot do model forward" "flast {num_frames} is less than context {context} frames, and we cannot do model forward"
) )
return None, None return None, None
@ -354,7 +354,7 @@ class PaddleASRConnectionHanddler:
Returns: Returns:
logprob: poster probability. logprob: poster probability.
""" """
logger.info("start to decoce one chunk for deepspeech2") logger.debug("start to decoce one chunk for deepspeech2")
input_names = self.am_predictor.get_input_names() input_names = self.am_predictor.get_input_names()
audio_handle = self.am_predictor.get_input_handle(input_names[0]) audio_handle = self.am_predictor.get_input_handle(input_names[0])
audio_len_handle = self.am_predictor.get_input_handle(input_names[1]) audio_len_handle = self.am_predictor.get_input_handle(input_names[1])
@ -391,7 +391,7 @@ class PaddleASRConnectionHanddler:
self.decoder.next(output_chunk_probs, output_chunk_lens) self.decoder.next(output_chunk_probs, output_chunk_lens)
trans_best, trans_beam = self.decoder.decode() trans_best, trans_beam = self.decoder.decode()
logger.info(f"decode one best result for deepspeech2: {trans_best[0]}") logger.debug(f"decode one best result for deepspeech2: {trans_best[0]}")
return trans_best[0] return trans_best[0]
@paddle.no_grad() @paddle.no_grad()
@ -402,7 +402,7 @@ class PaddleASRConnectionHanddler:
# reset endpiont state # reset endpiont state
self.endpoint_state = False self.endpoint_state = False
logger.info( logger.debug(
"Conformer/Transformer: start to decode with advanced_decoding method" "Conformer/Transformer: start to decode with advanced_decoding method"
) )
cfg = self.ctc_decode_config cfg = self.ctc_decode_config
@ -427,25 +427,25 @@ class PaddleASRConnectionHanddler:
stride = subsampling * decoding_chunk_size stride = subsampling * decoding_chunk_size
if self.cached_feat is None: if self.cached_feat is None:
logger.info("no audio feat, please input more pcm data") logger.debug("no audio feat, please input more pcm data")
return return
# (B=1,T,D) # (B=1,T,D)
num_frames = self.cached_feat.shape[1] num_frames = self.cached_feat.shape[1]
logger.info( logger.debug(
f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
) )
# the cached feat must be larger decoding_window # the cached feat must be larger decoding_window
if num_frames < decoding_window and not is_finished: if num_frames < decoding_window and not is_finished:
logger.info( logger.debug(
f"frame feat num is less than {decoding_window}, please input more pcm data" f"frame feat num is less than {decoding_window}, please input more pcm data"
) )
return None, None return None, None
# if is_finished=True, we need at least context frames # if is_finished=True, we need at least context frames
if num_frames < context: if num_frames < context:
logger.info( logger.debug(
"flast {num_frames} is less than context {context} frames, and we cannot do model forward" "flast {num_frames} is less than context {context} frames, and we cannot do model forward"
) )
return None, None return None, None
@ -489,7 +489,7 @@ class PaddleASRConnectionHanddler:
self.encoder_out = ys self.encoder_out = ys
else: else:
self.encoder_out = paddle.concat([self.encoder_out, ys], axis=1) self.encoder_out = paddle.concat([self.encoder_out, ys], axis=1)
logger.info( logger.debug(
f"This connection handler encoder out shape: {self.encoder_out.shape}" f"This connection handler encoder out shape: {self.encoder_out.shape}"
) )
@ -513,7 +513,8 @@ class PaddleASRConnectionHanddler:
if self.endpointer.endpoint_detected(ctc_probs.numpy(), if self.endpointer.endpoint_detected(ctc_probs.numpy(),
decoding_something): decoding_something):
self.endpoint_state = True self.endpoint_state = True
logger.info(f"Endpoint is detected at {self.num_frames} frame.") logger.debug(
f"Endpoint is detected at {self.num_frames} frame.")
# advance cache of feat # advance cache of feat
assert self.cached_feat.shape[0] == 1 #(B=1,T,D) assert self.cached_feat.shape[0] == 1 #(B=1,T,D)
@ -526,7 +527,7 @@ class PaddleASRConnectionHanddler:
def update_result(self): def update_result(self):
"""Conformer/Transformer hyps to result. """Conformer/Transformer hyps to result.
""" """
logger.info("update the final result") logger.debug("update the final result")
hyps = self.hyps hyps = self.hyps
# output results and tokenids # output results and tokenids
@ -560,16 +561,16 @@ class PaddleASRConnectionHanddler:
only for conformer and transformer model. only for conformer and transformer model.
""" """
if "deepspeech2" in self.model_type: if "deepspeech2" in self.model_type:
logger.info("deepspeech2 not support rescoring decoding.") logger.debug("deepspeech2 not support rescoring decoding.")
return return
if "attention_rescoring" != self.ctc_decode_config.decoding_method: if "attention_rescoring" != self.ctc_decode_config.decoding_method:
logger.info( logger.debug(
f"decoding method not match: {self.ctc_decode_config.decoding_method}, need attention_rescoring" f"decoding method not match: {self.ctc_decode_config.decoding_method}, need attention_rescoring"
) )
return return
logger.info("rescoring the final result") logger.debug("rescoring the final result")
# last decoding for last audio # last decoding for last audio
self.searcher.finalize_search() self.searcher.finalize_search()
@ -685,7 +686,6 @@ class PaddleASRConnectionHanddler:
"bg": global_offset_in_sec + start, "bg": global_offset_in_sec + start,
"ed": global_offset_in_sec + end "ed": global_offset_in_sec + end
}) })
# logger.info(f"{word_time_stamp[-1]}")
self.word_time_stamp = word_time_stamp self.word_time_stamp = word_time_stamp
logger.info(f"word time stamp: {self.word_time_stamp}") logger.info(f"word time stamp: {self.word_time_stamp}")
@ -707,13 +707,13 @@ class ASRServerExecutor(ASRExecutor):
lm_url = self.task_resource.res_dict['lm_url'] lm_url = self.task_resource.res_dict['lm_url']
lm_md5 = self.task_resource.res_dict['lm_md5'] lm_md5 = self.task_resource.res_dict['lm_md5']
logger.info(f"Start to load language model {lm_url}") logger.debug(f"Start to load language model {lm_url}")
self.download_lm( self.download_lm(
lm_url, lm_url,
os.path.dirname(self.config.decode.lang_model_path), lm_md5) os.path.dirname(self.config.decode.lang_model_path), lm_md5)
elif "conformer" in self.model_type or "transformer" in self.model_type: elif "conformer" in self.model_type or "transformer" in self.model_type:
with UpdateConfig(self.config): with UpdateConfig(self.config):
logger.info("start to create the stream conformer asr engine") logger.debug("start to create the stream conformer asr engine")
# update the decoding method # update the decoding method
if self.decode_method: if self.decode_method:
self.config.decode.decoding_method = self.decode_method self.config.decode.decoding_method = self.decode_method
@ -726,7 +726,7 @@ class ASRServerExecutor(ASRExecutor):
if self.config.decode.decoding_method not in [ if self.config.decode.decoding_method not in [
"ctc_prefix_beam_search", "attention_rescoring" "ctc_prefix_beam_search", "attention_rescoring"
]: ]:
logger.info( logger.debug(
"we set the decoding_method to attention_rescoring") "we set the decoding_method to attention_rescoring")
self.config.decode.decoding_method = "attention_rescoring" self.config.decode.decoding_method = "attention_rescoring"
@ -739,7 +739,7 @@ class ASRServerExecutor(ASRExecutor):
def init_model(self) -> None: def init_model(self) -> None:
if "deepspeech2" in self.model_type: if "deepspeech2" in self.model_type:
# AM predictor # AM predictor
logger.info("ASR engine start to init the am predictor") logger.debug("ASR engine start to init the am predictor")
self.am_predictor = init_predictor( self.am_predictor = init_predictor(
model_file=self.am_model, model_file=self.am_model,
params_file=self.am_params, params_file=self.am_params,
@ -748,7 +748,7 @@ class ASRServerExecutor(ASRExecutor):
# load model # load model
# model_type: {model_name}_{dataset} # model_type: {model_name}_{dataset}
model_name = self.model_type[:self.model_type.rindex('_')] model_name = self.model_type[:self.model_type.rindex('_')]
logger.info(f"model name: {model_name}") logger.debug(f"model name: {model_name}")
model_class = self.task_resource.get_model_class(model_name) model_class = self.task_resource.get_model_class(model_name)
model = model_class.from_config(self.config) model = model_class.from_config(self.config)
self.model = model self.model = model
@ -782,7 +782,7 @@ class ASRServerExecutor(ASRExecutor):
self.num_decoding_left_chunks = num_decoding_left_chunks self.num_decoding_left_chunks = num_decoding_left_chunks
# conf for paddleinference predictor or onnx # conf for paddleinference predictor or onnx
self.am_predictor_conf = am_predictor_conf self.am_predictor_conf = am_predictor_conf
logger.info(f"model_type: {self.model_type}") logger.debug(f"model_type: {self.model_type}")
sample_rate_str = '16k' if sample_rate == 16000 else '8k' sample_rate_str = '16k' if sample_rate == 16000 else '8k'
tag = model_type + '-' + lang + '-' + sample_rate_str tag = model_type + '-' + lang + '-' + sample_rate_str
@ -804,12 +804,12 @@ class ASRServerExecutor(ASRExecutor):
self.res_path = os.path.dirname( self.res_path = os.path.dirname(
os.path.dirname(os.path.abspath(self.cfg_path))) os.path.dirname(os.path.abspath(self.cfg_path)))
logger.info("Load the pretrained model:") logger.debug("Load the pretrained model:")
logger.info(f" tag = {tag}") logger.debug(f" tag = {tag}")
logger.info(f" res_path: {self.res_path}") logger.debug(f" res_path: {self.res_path}")
logger.info(f" cfg path: {self.cfg_path}") logger.debug(f" cfg path: {self.cfg_path}")
logger.info(f" am_model path: {self.am_model}") logger.debug(f" am_model path: {self.am_model}")
logger.info(f" am_params path: {self.am_params}") logger.debug(f" am_params path: {self.am_params}")
#Init body. #Init body.
self.config = CfgNode(new_allowed=True) self.config = CfgNode(new_allowed=True)
@ -818,7 +818,7 @@ class ASRServerExecutor(ASRExecutor):
if self.config.spm_model_prefix: if self.config.spm_model_prefix:
self.config.spm_model_prefix = os.path.join( self.config.spm_model_prefix = os.path.join(
self.res_path, self.config.spm_model_prefix) self.res_path, self.config.spm_model_prefix)
logger.info(f"spm model path: {self.config.spm_model_prefix}") logger.debug(f"spm model path: {self.config.spm_model_prefix}")
self.vocab = self.config.vocab_filepath self.vocab = self.config.vocab_filepath
@ -832,7 +832,7 @@ class ASRServerExecutor(ASRExecutor):
# AM predictor # AM predictor
self.init_model() self.init_model()
logger.info(f"create the {model_type} model success") logger.debug(f"create the {model_type} model success")
return True return True
@ -883,7 +883,7 @@ class ASREngine(BaseEngine):
"If all GPU or XPU is used, you can set the server to 'cpu'") "If all GPU or XPU is used, you can set the server to 'cpu'")
sys.exit(-1) sys.exit(-1)
logger.info(f"paddlespeech_server set the device: {self.device}") logger.debug(f"paddlespeech_server set the device: {self.device}")
if not self.init_model(): if not self.init_model():
logger.error( logger.error(
@ -891,7 +891,9 @@ class ASREngine(BaseEngine):
) )
return False return False
logger.info("Initialize ASR server engine successfully.") logger.info("Initialize ASR server engine successfully on device: %s." %
(self.device))
return True return True
def new_handler(self): def new_handler(self):

@ -65,10 +65,10 @@ class ASRServerExecutor(ASRExecutor):
self.task_resource.res_dict['model']) self.task_resource.res_dict['model'])
self.am_params = os.path.join(self.res_path, self.am_params = os.path.join(self.res_path,
self.task_resource.res_dict['params']) self.task_resource.res_dict['params'])
logger.info(self.res_path) logger.debug(self.res_path)
logger.info(self.cfg_path) logger.debug(self.cfg_path)
logger.info(self.am_model) logger.debug(self.am_model)
logger.info(self.am_params) logger.debug(self.am_params)
else: else:
self.cfg_path = os.path.abspath(cfg_path) self.cfg_path = os.path.abspath(cfg_path)
self.am_model = os.path.abspath(am_model) self.am_model = os.path.abspath(am_model)
@ -236,16 +236,16 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
if self._check( if self._check(
io.BytesIO(audio_data), self.asr_engine.config.sample_rate, io.BytesIO(audio_data), self.asr_engine.config.sample_rate,
self.asr_engine.config.force_yes): self.asr_engine.config.force_yes):
logger.info("start running asr engine") logger.debug("start running asr engine")
self.preprocess(self.asr_engine.config.model_type, self.preprocess(self.asr_engine.config.model_type,
io.BytesIO(audio_data)) io.BytesIO(audio_data))
st = time.time() st = time.time()
self.infer(self.asr_engine.config.model_type) self.infer(self.asr_engine.config.model_type)
infer_time = time.time() - st infer_time = time.time() - st
self.output = self.postprocess() # Retrieve result of asr. self.output = self.postprocess() # Retrieve result of asr.
logger.info("end inferring asr engine") logger.debug("end inferring asr engine")
else: else:
logger.info("file check failed!") logger.error("file check failed!")
self.output = None self.output = None
logger.info("inference time: {}".format(infer_time)) logger.info("inference time: {}".format(infer_time))

@ -104,7 +104,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
if self._check( if self._check(
io.BytesIO(audio_data), self.asr_engine.config.sample_rate, io.BytesIO(audio_data), self.asr_engine.config.sample_rate,
self.asr_engine.config.force_yes): self.asr_engine.config.force_yes):
logger.info("start run asr engine") logger.debug("start run asr engine")
self.preprocess(self.asr_engine.config.model, self.preprocess(self.asr_engine.config.model,
io.BytesIO(audio_data)) io.BytesIO(audio_data))
st = time.time() st = time.time()
@ -112,7 +112,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
infer_time = time.time() - st infer_time = time.time() - st
self.output = self.postprocess() # Retrieve result of asr. self.output = self.postprocess() # Retrieve result of asr.
else: else:
logger.info("file check failed!") logger.error("file check failed!")
self.output = None self.output = None
logger.info("inference time: {}".format(infer_time)) logger.info("inference time: {}".format(infer_time))

@ -67,22 +67,22 @@ class CLSServerExecutor(CLSExecutor):
self.params_path = os.path.abspath(params_path) self.params_path = os.path.abspath(params_path)
self.label_file = os.path.abspath(label_file) self.label_file = os.path.abspath(label_file)
logger.info(self.cfg_path) logger.debug(self.cfg_path)
logger.info(self.model_path) logger.debug(self.model_path)
logger.info(self.params_path) logger.debug(self.params_path)
logger.info(self.label_file) logger.debug(self.label_file)
# config # config
with open(self.cfg_path, 'r') as f: with open(self.cfg_path, 'r') as f:
self._conf = yaml.safe_load(f) self._conf = yaml.safe_load(f)
logger.info("Read cfg file successfully.") logger.debug("Read cfg file successfully.")
# labels # labels
self._label_list = [] self._label_list = []
with open(self.label_file, 'r') as f: with open(self.label_file, 'r') as f:
for line in f: for line in f:
self._label_list.append(line.strip()) self._label_list.append(line.strip())
logger.info("Read label file successfully.") logger.debug("Read label file successfully.")
# Create predictor # Create predictor
self.predictor_conf = predictor_conf self.predictor_conf = predictor_conf
@ -90,7 +90,7 @@ class CLSServerExecutor(CLSExecutor):
model_file=self.model_path, model_file=self.model_path,
params_file=self.params_path, params_file=self.params_path,
predictor_conf=self.predictor_conf) predictor_conf=self.predictor_conf)
logger.info("Create predictor successfully.") logger.debug("Create predictor successfully.")
@paddle.no_grad() @paddle.no_grad()
def infer(self): def infer(self):
@ -148,7 +148,8 @@ class CLSEngine(BaseEngine):
logger.error(e) logger.error(e)
return False return False
logger.info("Initialize CLS server engine successfully.") logger.info("Initialize CLS server engine successfully on device: %s." %
(self.device))
return True return True
@ -160,7 +161,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
cls_engine (CLSEngine): The CLS engine cls_engine (CLSEngine): The CLS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleCLSConnectionHandler to process the cls request") "Create PaddleCLSConnectionHandler to process the cls request")
self._inputs = OrderedDict() self._inputs = OrderedDict()
@ -183,7 +184,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
self.infer() self.infer()
infer_time = time.time() - st infer_time = time.time() - st
logger.info("inference time: {}".format(infer_time)) logger.debug("inference time: {}".format(infer_time))
logger.info("cls engine type: inference") logger.info("cls engine type: inference")
def postprocess(self, topk: int): def postprocess(self, topk: int):

@ -88,7 +88,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
cls_engine (CLSEngine): The CLS engine cls_engine (CLSEngine): The CLS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleCLSConnectionHandler to process the cls request") "Create PaddleCLSConnectionHandler to process the cls request")
self._inputs = OrderedDict() self._inputs = OrderedDict()
@ -110,7 +110,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
self.infer() self.infer()
infer_time = time.time() - st infer_time = time.time() - st
logger.info("inference time: {}".format(infer_time)) logger.debug("inference time: {}".format(infer_time))
logger.info("cls engine type: python") logger.info("cls engine type: python")
def postprocess(self, topk: int): def postprocess(self, topk: int):

@ -13,7 +13,7 @@
# limitations under the License. # limitations under the License.
from typing import Text from typing import Text
from ..utils.log import logger from paddlespeech.cli.log import logger
__all__ = ['EngineFactory'] __all__ = ['EngineFactory']

@ -45,7 +45,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
logger.error("Please check tte engine type.") logger.error("Please check tte engine type.")
try: try:
logger.info("Start to warm up tts engine.") logger.debug("Start to warm up tts engine.")
for i in range(warm_up_time): for i in range(warm_up_time):
connection_handler = PaddleTTSConnectionHandler(tts_engine) connection_handler = PaddleTTSConnectionHandler(tts_engine)
if flag_online: if flag_online:
@ -53,7 +53,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
text=sentence, text=sentence,
lang=tts_engine.lang, lang=tts_engine.lang,
am=tts_engine.config.am): am=tts_engine.config.am):
logger.info( logger.debug(
f"The first response time of the {i} warm up: {connection_handler.first_response_time} s" f"The first response time of the {i} warm up: {connection_handler.first_response_time} s"
) )
break break
@ -62,7 +62,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
st = time.time() st = time.time()
connection_handler.infer(text=sentence) connection_handler.infer(text=sentence)
et = time.time() et = time.time()
logger.info( logger.debug(
f"The response time of the {i} warm up: {et - st} s") f"The response time of the {i} warm up: {et - st} s")
except Exception as e: except Exception as e:
logger.error("Failed to warm up on tts engine.") logger.error("Failed to warm up on tts engine.")

@ -28,7 +28,7 @@ class PaddleTextConnectionHandler:
text_engine (TextEngine): The Text engine text_engine (TextEngine): The Text engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleTextConnectionHandler to process the text request") "Create PaddleTextConnectionHandler to process the text request")
self.text_engine = text_engine self.text_engine = text_engine
self.task = self.text_engine.executor.task self.task = self.text_engine.executor.task
@ -130,7 +130,7 @@ class TextEngine(BaseEngine):
"""The Text Engine """The Text Engine
""" """
super(TextEngine, self).__init__() super(TextEngine, self).__init__()
logger.info("Create the TextEngine Instance") logger.debug("Create the TextEngine Instance")
def init(self, config: dict): def init(self, config: dict):
"""Init the Text Engine """Init the Text Engine
@ -141,7 +141,7 @@ class TextEngine(BaseEngine):
Returns: Returns:
bool: The engine instance flag bool: The engine instance flag
""" """
logger.info("Init the text engine") logger.debug("Init the text engine")
try: try:
self.config = config self.config = config
if self.config.device: if self.config.device:
@ -150,7 +150,7 @@ class TextEngine(BaseEngine):
self.device = paddle.get_device() self.device = paddle.get_device()
paddle.set_device(self.device) paddle.set_device(self.device)
logger.info(f"Text Engine set the device: {self.device}") logger.debug(f"Text Engine set the device: {self.device}")
except BaseException as e: except BaseException as e:
logger.error( logger.error(
"Set device failed, please check if device is already used and the parameter 'device' in the yaml file" "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
@ -168,5 +168,6 @@ class TextEngine(BaseEngine):
ckpt_path=config.ckpt_path, ckpt_path=config.ckpt_path,
vocab_file=config.vocab_file) vocab_file=config.vocab_file)
logger.info("Init the text engine successfully") logger.info("Initialize Text server engine successfully on device: %s."
% (self.device))
return True return True

@ -62,7 +62,7 @@ class TTSServerExecutor(TTSExecutor):
(hasattr(self, 'am_encoder_infer_sess') and (hasattr(self, 'am_encoder_infer_sess') and
hasattr(self, 'am_decoder_sess') and hasattr( hasattr(self, 'am_decoder_sess') and hasattr(
self, 'am_postnet_sess'))) and hasattr(self, 'voc_inference'): self, 'am_postnet_sess'))) and hasattr(self, 'voc_inference'):
logger.info('Models had been initialized.') logger.debug('Models had been initialized.')
return return
# am # am
am_tag = am + '-' + lang am_tag = am + '-' + lang
@ -85,8 +85,7 @@ class TTSServerExecutor(TTSExecutor):
else: else:
self.am_ckpt = os.path.abspath(am_ckpt[0]) self.am_ckpt = os.path.abspath(am_ckpt[0])
self.phones_dict = os.path.abspath(phones_dict) self.phones_dict = os.path.abspath(phones_dict)
self.am_res_path = os.path.dirname( self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt))
os.path.abspath(am_ckpt))
# create am sess # create am sess
self.am_sess = get_sess(self.am_ckpt, am_sess_conf) self.am_sess = get_sess(self.am_ckpt, am_sess_conf)
@ -119,8 +118,7 @@ class TTSServerExecutor(TTSExecutor):
self.am_postnet = os.path.abspath(am_ckpt[2]) self.am_postnet = os.path.abspath(am_ckpt[2])
self.phones_dict = os.path.abspath(phones_dict) self.phones_dict = os.path.abspath(phones_dict)
self.am_stat = os.path.abspath(am_stat) self.am_stat = os.path.abspath(am_stat)
self.am_res_path = os.path.dirname( self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt[0]))
os.path.abspath(am_ckpt[0]))
# create am sess # create am sess
self.am_encoder_infer_sess = get_sess(self.am_encoder_infer, self.am_encoder_infer_sess = get_sess(self.am_encoder_infer,
@ -130,9 +128,9 @@ class TTSServerExecutor(TTSExecutor):
self.am_mu, self.am_std = np.load(self.am_stat) self.am_mu, self.am_std = np.load(self.am_stat)
logger.info(f"self.phones_dict: {self.phones_dict}") logger.debug(f"self.phones_dict: {self.phones_dict}")
logger.info(f"am model dir: {self.am_res_path}") logger.debug(f"am model dir: {self.am_res_path}")
logger.info("Create am sess successfully.") logger.debug("Create am sess successfully.")
# voc model info # voc model info
voc_tag = voc + '-' + lang voc_tag = voc + '-' + lang
@ -149,16 +147,16 @@ class TTSServerExecutor(TTSExecutor):
else: else:
self.voc_ckpt = os.path.abspath(voc_ckpt) self.voc_ckpt = os.path.abspath(voc_ckpt)
self.voc_res_path = os.path.dirname(os.path.abspath(self.voc_ckpt)) self.voc_res_path = os.path.dirname(os.path.abspath(self.voc_ckpt))
logger.info(self.voc_res_path) logger.debug(self.voc_res_path)
# create voc sess # create voc sess
self.voc_sess = get_sess(self.voc_ckpt, voc_sess_conf) self.voc_sess = get_sess(self.voc_ckpt, voc_sess_conf)
logger.info("Create voc sess successfully.") logger.debug("Create voc sess successfully.")
with open(self.phones_dict, "r") as f: with open(self.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
self.vocab_size = len(phn_id) self.vocab_size = len(phn_id)
logger.info(f"vocab_size: {self.vocab_size}") logger.debug(f"vocab_size: {self.vocab_size}")
# frontend # frontend
self.tones_dict = None self.tones_dict = None
@ -169,7 +167,7 @@ class TTSServerExecutor(TTSExecutor):
elif lang == 'en': elif lang == 'en':
self.frontend = English(phone_vocab_path=self.phones_dict) self.frontend = English(phone_vocab_path=self.phones_dict)
logger.info("frontend done!") logger.debug("frontend done!")
class TTSEngine(BaseEngine): class TTSEngine(BaseEngine):
@ -267,7 +265,7 @@ class PaddleTTSConnectionHandler:
tts_engine (TTSEngine): The TTS engine tts_engine (TTSEngine): The TTS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleTTSConnectionHandler to process the tts request") "Create PaddleTTSConnectionHandler to process the tts request")
self.tts_engine = tts_engine self.tts_engine = tts_engine

@ -102,16 +102,22 @@ class TTSServerExecutor(TTSExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'): if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'):
logger.info('Models had been initialized.') logger.debug('Models had been initialized.')
return return
# am model info # am model info
if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
use_pretrained_am = True
else:
use_pretrained_am = False
am_tag = am + '-' + lang am_tag = am + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=am_tag, model_tag=am_tag,
model_type=0, # am model_type=0, # am
skip_download=not use_pretrained_am,
version=None, # default version version=None, # default version
) )
if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: if use_pretrained_am:
self.am_res_path = self.task_resource.res_dir self.am_res_path = self.task_resource.res_dir
self.am_config = os.path.join(self.am_res_path, self.am_config = os.path.join(self.am_res_path,
self.task_resource.res_dict['config']) self.task_resource.res_dict['config'])
@ -122,29 +128,33 @@ class TTSServerExecutor(TTSExecutor):
# must have phones_dict in acoustic # must have phones_dict in acoustic
self.phones_dict = os.path.join( self.phones_dict = os.path.join(
self.am_res_path, self.task_resource.res_dict['phones_dict']) self.am_res_path, self.task_resource.res_dict['phones_dict'])
print("self.phones_dict:", self.phones_dict) logger.debug(self.am_res_path)
logger.info(self.am_res_path) logger.debug(self.am_config)
logger.info(self.am_config) logger.debug(self.am_ckpt)
logger.info(self.am_ckpt)
else: else:
self.am_config = os.path.abspath(am_config) self.am_config = os.path.abspath(am_config)
self.am_ckpt = os.path.abspath(am_ckpt) self.am_ckpt = os.path.abspath(am_ckpt)
self.am_stat = os.path.abspath(am_stat) self.am_stat = os.path.abspath(am_stat)
self.phones_dict = os.path.abspath(phones_dict) self.phones_dict = os.path.abspath(phones_dict)
self.am_res_path = os.path.dirname(os.path.abspath(self.am_config)) self.am_res_path = os.path.dirname(os.path.abspath(self.am_config))
print("self.phones_dict:", self.phones_dict)
self.tones_dict = None self.tones_dict = None
self.speaker_dict = None self.speaker_dict = None
# voc model info # voc model info
if voc_ckpt is None or voc_config is None or voc_stat is None:
use_pretrained_voc = True
else:
use_pretrained_voc = False
voc_tag = voc + '-' + lang voc_tag = voc + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=voc_tag, model_tag=voc_tag,
model_type=1, # vocoder model_type=1, # vocoder
skip_download=not use_pretrained_voc,
version=None, # default version version=None, # default version
) )
if voc_ckpt is None or voc_config is None or voc_stat is None: if use_pretrained_voc:
self.voc_res_path = self.task_resource.voc_res_dir self.voc_res_path = self.task_resource.voc_res_dir
self.voc_config = os.path.join( self.voc_config = os.path.join(
self.voc_res_path, self.task_resource.voc_res_dict['config']) self.voc_res_path, self.task_resource.voc_res_dict['config'])
@ -153,9 +163,9 @@ class TTSServerExecutor(TTSExecutor):
self.voc_stat = os.path.join( self.voc_stat = os.path.join(
self.voc_res_path, self.voc_res_path,
self.task_resource.voc_res_dict['speech_stats']) self.task_resource.voc_res_dict['speech_stats'])
logger.info(self.voc_res_path) logger.debug(self.voc_res_path)
logger.info(self.voc_config) logger.debug(self.voc_config)
logger.info(self.voc_ckpt) logger.debug(self.voc_ckpt)
else: else:
self.voc_config = os.path.abspath(voc_config) self.voc_config = os.path.abspath(voc_config)
self.voc_ckpt = os.path.abspath(voc_ckpt) self.voc_ckpt = os.path.abspath(voc_ckpt)
@ -172,7 +182,6 @@ class TTSServerExecutor(TTSExecutor):
with open(self.phones_dict, "r") as f: with open(self.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
self.vocab_size = len(phn_id) self.vocab_size = len(phn_id)
print("vocab_size:", self.vocab_size)
# frontend # frontend
if lang == 'zh': if lang == 'zh':
@ -182,7 +191,6 @@ class TTSServerExecutor(TTSExecutor):
elif lang == 'en': elif lang == 'en':
self.frontend = English(phone_vocab_path=self.phones_dict) self.frontend = English(phone_vocab_path=self.phones_dict)
print("frontend done!")
# am infer info # am infer info
self.am_name = am[:am.rindex('_')] self.am_name = am[:am.rindex('_')]
@ -197,7 +205,6 @@ class TTSServerExecutor(TTSExecutor):
self.am_name + '_inference') self.am_name + '_inference')
self.am_inference = am_inference_class(am_normalizer, am) self.am_inference = am_inference_class(am_normalizer, am)
self.am_inference.eval() self.am_inference.eval()
print("acoustic model done!")
# voc infer info # voc infer info
self.voc_name = voc[:voc.rindex('_')] self.voc_name = voc[:voc.rindex('_')]
@ -208,7 +215,6 @@ class TTSServerExecutor(TTSExecutor):
'_inference') '_inference')
self.voc_inference = voc_inference_class(voc_normalizer, voc) self.voc_inference = voc_inference_class(voc_normalizer, voc)
self.voc_inference.eval() self.voc_inference.eval()
print("voc done!")
class TTSEngine(BaseEngine): class TTSEngine(BaseEngine):
@ -297,7 +303,7 @@ class PaddleTTSConnectionHandler:
tts_engine (TTSEngine): The TTS engine tts_engine (TTSEngine): The TTS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleTTSConnectionHandler to process the tts request") "Create PaddleTTSConnectionHandler to process the tts request")
self.tts_engine = tts_engine self.tts_engine = tts_engine
@ -357,7 +363,7 @@ class PaddleTTSConnectionHandler:
text, merge_sentences=merge_sentences) text, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") logger.error("lang should in {'zh', 'en'}!")
frontend_et = time.time() frontend_et = time.time()
self.frontend_time = frontend_et - frontend_st self.frontend_time = frontend_et - frontend_st

@ -65,16 +65,22 @@ class TTSServerExecutor(TTSExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
if hasattr(self, 'am_predictor') and hasattr(self, 'voc_predictor'): if hasattr(self, 'am_predictor') and hasattr(self, 'voc_predictor'):
logger.info('Models had been initialized.') logger.debug('Models had been initialized.')
return return
# am # am
if am_model is None or am_params is None or phones_dict is None:
use_pretrained_am = True
else:
use_pretrained_am = False
am_tag = am + '-' + lang am_tag = am + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=am_tag, model_tag=am_tag,
model_type=0, # am model_type=0, # am
skip_download=not use_pretrained_am,
version=None, # default version version=None, # default version
) )
if am_model is None or am_params is None or phones_dict is None: if use_pretrained_am:
self.am_res_path = self.task_resource.res_dir self.am_res_path = self.task_resource.res_dir
self.am_model = os.path.join(self.am_res_path, self.am_model = os.path.join(self.am_res_path,
self.task_resource.res_dict['model']) self.task_resource.res_dict['model'])
@ -85,16 +91,16 @@ class TTSServerExecutor(TTSExecutor):
self.am_res_path, self.task_resource.res_dict['phones_dict']) self.am_res_path, self.task_resource.res_dict['phones_dict'])
self.am_sample_rate = self.task_resource.res_dict['sample_rate'] self.am_sample_rate = self.task_resource.res_dict['sample_rate']
logger.info(self.am_res_path) logger.debug(self.am_res_path)
logger.info(self.am_model) logger.debug(self.am_model)
logger.info(self.am_params) logger.debug(self.am_params)
else: else:
self.am_model = os.path.abspath(am_model) self.am_model = os.path.abspath(am_model)
self.am_params = os.path.abspath(am_params) self.am_params = os.path.abspath(am_params)
self.phones_dict = os.path.abspath(phones_dict) self.phones_dict = os.path.abspath(phones_dict)
self.am_sample_rate = am_sample_rate self.am_sample_rate = am_sample_rate
self.am_res_path = os.path.dirname(os.path.abspath(self.am_model)) self.am_res_path = os.path.dirname(os.path.abspath(self.am_model))
logger.info("self.phones_dict: {}".format(self.phones_dict)) logger.debug("self.phones_dict: {}".format(self.phones_dict))
# for speedyspeech # for speedyspeech
self.tones_dict = None self.tones_dict = None
@ -113,13 +119,19 @@ class TTSServerExecutor(TTSExecutor):
self.speaker_dict = speaker_dict self.speaker_dict = speaker_dict
# voc # voc
if voc_model is None or voc_params is None:
use_pretrained_voc = True
else:
use_pretrained_voc = False
voc_tag = voc + '-' + lang voc_tag = voc + '-' + lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=voc_tag, model_tag=voc_tag,
model_type=1, # vocoder model_type=1, # vocoder
skip_download=not use_pretrained_voc,
version=None, # default version version=None, # default version
) )
if voc_model is None or voc_params is None: if use_pretrained_voc:
self.voc_res_path = self.task_resource.voc_res_dir self.voc_res_path = self.task_resource.voc_res_dir
self.voc_model = os.path.join( self.voc_model = os.path.join(
self.voc_res_path, self.task_resource.voc_res_dict['model']) self.voc_res_path, self.task_resource.voc_res_dict['model'])
@ -127,9 +139,9 @@ class TTSServerExecutor(TTSExecutor):
self.voc_res_path, self.task_resource.voc_res_dict['params']) self.voc_res_path, self.task_resource.voc_res_dict['params'])
self.voc_sample_rate = self.task_resource.voc_res_dict[ self.voc_sample_rate = self.task_resource.voc_res_dict[
'sample_rate'] 'sample_rate']
logger.info(self.voc_res_path) logger.debug(self.voc_res_path)
logger.info(self.voc_model) logger.debug(self.voc_model)
logger.info(self.voc_params) logger.debug(self.voc_params)
else: else:
self.voc_model = os.path.abspath(voc_model) self.voc_model = os.path.abspath(voc_model)
self.voc_params = os.path.abspath(voc_params) self.voc_params = os.path.abspath(voc_params)
@ -144,21 +156,21 @@ class TTSServerExecutor(TTSExecutor):
with open(self.phones_dict, "r") as f: with open(self.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id) vocab_size = len(phn_id)
logger.info("vocab_size: {}".format(vocab_size)) logger.debug("vocab_size: {}".format(vocab_size))
tone_size = None tone_size = None
if self.tones_dict: if self.tones_dict:
with open(self.tones_dict, "r") as f: with open(self.tones_dict, "r") as f:
tone_id = [line.strip().split() for line in f.readlines()] tone_id = [line.strip().split() for line in f.readlines()]
tone_size = len(tone_id) tone_size = len(tone_id)
logger.info("tone_size: {}".format(tone_size)) logger.debug("tone_size: {}".format(tone_size))
spk_num = None spk_num = None
if self.speaker_dict: if self.speaker_dict:
with open(self.speaker_dict, 'rt') as f: with open(self.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()] spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id) spk_num = len(spk_id)
logger.info("spk_num: {}".format(spk_num)) logger.debug("spk_num: {}".format(spk_num))
# frontend # frontend
if lang == 'zh': if lang == 'zh':
@ -168,7 +180,7 @@ class TTSServerExecutor(TTSExecutor):
elif lang == 'en': elif lang == 'en':
self.frontend = English(phone_vocab_path=self.phones_dict) self.frontend = English(phone_vocab_path=self.phones_dict)
logger.info("frontend done!") logger.debug("frontend done!")
# Create am predictor # Create am predictor
self.am_predictor_conf = am_predictor_conf self.am_predictor_conf = am_predictor_conf
@ -176,7 +188,7 @@ class TTSServerExecutor(TTSExecutor):
model_file=self.am_model, model_file=self.am_model,
params_file=self.am_params, params_file=self.am_params,
predictor_conf=self.am_predictor_conf) predictor_conf=self.am_predictor_conf)
logger.info("Create AM predictor successfully.") logger.debug("Create AM predictor successfully.")
# Create voc predictor # Create voc predictor
self.voc_predictor_conf = voc_predictor_conf self.voc_predictor_conf = voc_predictor_conf
@ -184,7 +196,7 @@ class TTSServerExecutor(TTSExecutor):
model_file=self.voc_model, model_file=self.voc_model,
params_file=self.voc_params, params_file=self.voc_params,
predictor_conf=self.voc_predictor_conf) predictor_conf=self.voc_predictor_conf)
logger.info("Create Vocoder predictor successfully.") logger.debug("Create Vocoder predictor successfully.")
@paddle.no_grad() @paddle.no_grad()
def infer(self, def infer(self,
@ -316,7 +328,8 @@ class TTSEngine(BaseEngine):
logger.error(e) logger.error(e)
return False return False
logger.info("Initialize TTS server engine successfully.") logger.info("Initialize TTS server engine successfully on device: %s." %
(self.device))
return True return True
@ -328,7 +341,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
tts_engine (TTSEngine): The TTS engine tts_engine (TTSEngine): The TTS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleTTSConnectionHandler to process the tts request") "Create PaddleTTSConnectionHandler to process the tts request")
self.tts_engine = tts_engine self.tts_engine = tts_engine
@ -366,23 +379,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
if target_fs == 0 or target_fs > original_fs: if target_fs == 0 or target_fs > original_fs:
target_fs = original_fs target_fs = original_fs
wav_tar_fs = wav wav_tar_fs = wav
logger.info( logger.debug(
"The sample rate of synthesized audio is the same as model, which is {}Hz". "The sample rate of synthesized audio is the same as model, which is {}Hz".
format(original_fs)) format(original_fs))
else: else:
wav_tar_fs = librosa.resample( wav_tar_fs = librosa.resample(
np.squeeze(wav), original_fs, target_fs) np.squeeze(wav), original_fs, target_fs)
logger.info( logger.debug(
"The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
format(original_fs, target_fs)) format(original_fs, target_fs))
# transform volume # transform volume
wav_vol = wav_tar_fs * volume wav_vol = wav_tar_fs * volume
logger.info("Transform the volume of the audio successfully.") logger.debug("Transform the volume of the audio successfully.")
# transform speed # transform speed
try: # windows not support soxbindings try: # windows not support soxbindings
wav_speed = change_speed(wav_vol, speed, target_fs) wav_speed = change_speed(wav_vol, speed, target_fs)
logger.info("Transform the speed of the audio successfully.") logger.debug("Transform the speed of the audio successfully.")
except ServerBaseException: except ServerBaseException:
raise ServerBaseException( raise ServerBaseException(
ErrorCode.SERVER_INTERNAL_ERR, ErrorCode.SERVER_INTERNAL_ERR,
@ -399,7 +412,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
wavfile.write(buf, target_fs, wav_speed) wavfile.write(buf, target_fs, wav_speed)
base64_bytes = base64.b64encode(buf.read()) base64_bytes = base64.b64encode(buf.read())
wav_base64 = base64_bytes.decode('utf-8') wav_base64 = base64_bytes.decode('utf-8')
logger.info("Audio to string successfully.") logger.debug("Audio to string successfully.")
# save audio # save audio
if audio_path is not None: if audio_path is not None:
@ -487,15 +500,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
logger.error(e) logger.error(e)
sys.exit(-1) sys.exit(-1)
logger.info("AM model: {}".format(self.config.am)) logger.debug("AM model: {}".format(self.config.am))
logger.info("Vocoder model: {}".format(self.config.voc)) logger.debug("Vocoder model: {}".format(self.config.voc))
logger.info("Language: {}".format(lang)) logger.debug("Language: {}".format(lang))
logger.info("tts engine type: python") logger.info("tts engine type: python")
logger.info("audio duration: {}".format(duration)) logger.info("audio duration: {}".format(duration))
logger.info("frontend inference time: {}".format(self.frontend_time)) logger.debug("frontend inference time: {}".format(self.frontend_time))
logger.info("AM inference time: {}".format(self.am_time)) logger.debug("AM inference time: {}".format(self.am_time))
logger.info("Vocoder inference time: {}".format(self.voc_time)) logger.debug("Vocoder inference time: {}".format(self.voc_time))
logger.info("total inference time: {}".format(infer_time)) logger.info("total inference time: {}".format(infer_time))
logger.info( logger.info(
"postprocess (change speed, volume, target sample rate) time: {}". "postprocess (change speed, volume, target sample rate) time: {}".
@ -503,6 +516,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
logger.info("total generate audio time: {}".format(infer_time + logger.info("total generate audio time: {}".format(infer_time +
postprocess_time)) postprocess_time))
logger.info("RTF: {}".format(rtf)) logger.info("RTF: {}".format(rtf))
logger.info("device: {}".format(self.tts_engine.device)) logger.debug("device: {}".format(self.tts_engine.device))
return lang, target_sample_rate, duration, wav_base64 return lang, target_sample_rate, duration, wav_base64

@ -105,7 +105,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
tts_engine (TTSEngine): The TTS engine tts_engine (TTSEngine): The TTS engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleTTSConnectionHandler to process the tts request") "Create PaddleTTSConnectionHandler to process the tts request")
self.tts_engine = tts_engine self.tts_engine = tts_engine
@ -143,23 +143,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
if target_fs == 0 or target_fs > original_fs: if target_fs == 0 or target_fs > original_fs:
target_fs = original_fs target_fs = original_fs
wav_tar_fs = wav wav_tar_fs = wav
logger.info( logger.debug(
"The sample rate of synthesized audio is the same as model, which is {}Hz". "The sample rate of synthesized audio is the same as model, which is {}Hz".
format(original_fs)) format(original_fs))
else: else:
wav_tar_fs = librosa.resample( wav_tar_fs = librosa.resample(
np.squeeze(wav), original_fs, target_fs) np.squeeze(wav), original_fs, target_fs)
logger.info( logger.debug(
"The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
format(original_fs, target_fs)) format(original_fs, target_fs))
# transform volume # transform volume
wav_vol = wav_tar_fs * volume wav_vol = wav_tar_fs * volume
logger.info("Transform the volume of the audio successfully.") logger.debug("Transform the volume of the audio successfully.")
# transform speed # transform speed
try: # windows not support soxbindings try: # windows not support soxbindings
wav_speed = change_speed(wav_vol, speed, target_fs) wav_speed = change_speed(wav_vol, speed, target_fs)
logger.info("Transform the speed of the audio successfully.") logger.debug("Transform the speed of the audio successfully.")
except ServerBaseException: except ServerBaseException:
raise ServerBaseException( raise ServerBaseException(
ErrorCode.SERVER_INTERNAL_ERR, ErrorCode.SERVER_INTERNAL_ERR,
@ -176,7 +176,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
wavfile.write(buf, target_fs, wav_speed) wavfile.write(buf, target_fs, wav_speed)
base64_bytes = base64.b64encode(buf.read()) base64_bytes = base64.b64encode(buf.read())
wav_base64 = base64_bytes.decode('utf-8') wav_base64 = base64_bytes.decode('utf-8')
logger.info("Audio to string successfully.") logger.debug("Audio to string successfully.")
# save audio # save audio
if audio_path is not None: if audio_path is not None:
@ -264,15 +264,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
logger.error(e) logger.error(e)
sys.exit(-1) sys.exit(-1)
logger.info("AM model: {}".format(self.config.am)) logger.debug("AM model: {}".format(self.config.am))
logger.info("Vocoder model: {}".format(self.config.voc)) logger.debug("Vocoder model: {}".format(self.config.voc))
logger.info("Language: {}".format(lang)) logger.debug("Language: {}".format(lang))
logger.info("tts engine type: python") logger.info("tts engine type: python")
logger.info("audio duration: {}".format(duration)) logger.info("audio duration: {}".format(duration))
logger.info("frontend inference time: {}".format(self.frontend_time)) logger.debug("frontend inference time: {}".format(self.frontend_time))
logger.info("AM inference time: {}".format(self.am_time)) logger.debug("AM inference time: {}".format(self.am_time))
logger.info("Vocoder inference time: {}".format(self.voc_time)) logger.debug("Vocoder inference time: {}".format(self.voc_time))
logger.info("total inference time: {}".format(infer_time)) logger.info("total inference time: {}".format(infer_time))
logger.info( logger.info(
"postprocess (change speed, volume, target sample rate) time: {}". "postprocess (change speed, volume, target sample rate) time: {}".
@ -280,6 +280,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
logger.info("total generate audio time: {}".format(infer_time + logger.info("total generate audio time: {}".format(infer_time +
postprocess_time)) postprocess_time))
logger.info("RTF: {}".format(rtf)) logger.info("RTF: {}".format(rtf))
logger.info("device: {}".format(self.tts_engine.device)) logger.debug("device: {}".format(self.tts_engine.device))
return lang, target_sample_rate, duration, wav_base64 return lang, target_sample_rate, duration, wav_base64

@ -33,7 +33,7 @@ class PaddleVectorConnectionHandler:
vector_engine (VectorEngine): The Vector engine vector_engine (VectorEngine): The Vector engine
""" """
super().__init__() super().__init__()
logger.info( logger.debug(
"Create PaddleVectorConnectionHandler to process the vector request") "Create PaddleVectorConnectionHandler to process the vector request")
self.vector_engine = vector_engine self.vector_engine = vector_engine
self.executor = self.vector_engine.executor self.executor = self.vector_engine.executor
@ -54,7 +54,7 @@ class PaddleVectorConnectionHandler:
Returns: Returns:
str: the punctuation text str: the punctuation text
""" """
logger.info( logger.debug(
f"start to extract the do vector {self.task} from the http request") f"start to extract the do vector {self.task} from the http request")
if self.task == "spk" and task == "spk": if self.task == "spk" and task == "spk":
embedding = self.extract_audio_embedding(audio_data) embedding = self.extract_audio_embedding(audio_data)
@ -81,17 +81,17 @@ class PaddleVectorConnectionHandler:
Returns: Returns:
float: the score between enroll and test audio float: the score between enroll and test audio
""" """
logger.info("start to extract the enroll audio embedding") logger.debug("start to extract the enroll audio embedding")
enroll_emb = self.extract_audio_embedding(enroll_audio) enroll_emb = self.extract_audio_embedding(enroll_audio)
logger.info("start to extract the test audio embedding") logger.debug("start to extract the test audio embedding")
test_emb = self.extract_audio_embedding(test_audio) test_emb = self.extract_audio_embedding(test_audio)
logger.info( logger.debug(
"start to get the score between the enroll and test embedding") "start to get the score between the enroll and test embedding")
score = self.executor.get_embeddings_score(enroll_emb, test_emb) score = self.executor.get_embeddings_score(enroll_emb, test_emb)
logger.info(f"get the enroll vs test score: {score}") logger.debug(f"get the enroll vs test score: {score}")
return score return score
@paddle.no_grad() @paddle.no_grad()
@ -106,11 +106,12 @@ class PaddleVectorConnectionHandler:
# because the soundfile will change the io.BytesIO(audio) to the end # because the soundfile will change the io.BytesIO(audio) to the end
# thus we should convert the base64 string to io.BytesIO when we need the audio data # thus we should convert the base64 string to io.BytesIO when we need the audio data
if not self.executor._check(io.BytesIO(audio), sample_rate): if not self.executor._check(io.BytesIO(audio), sample_rate):
logger.info("check the audio sample rate occurs error") logger.debug("check the audio sample rate occurs error")
return np.array([0.0]) return np.array([0.0])
waveform, sr = load_audio(io.BytesIO(audio)) waveform, sr = load_audio(io.BytesIO(audio))
logger.info(f"load the audio sample points, shape is: {waveform.shape}") logger.debug(
f"load the audio sample points, shape is: {waveform.shape}")
# stage 2: get the audio feat # stage 2: get the audio feat
# Note: Now we only support fbank feature # Note: Now we only support fbank feature
@ -121,9 +122,9 @@ class PaddleVectorConnectionHandler:
n_mels=self.config.n_mels, n_mels=self.config.n_mels,
window_size=self.config.window_size, window_size=self.config.window_size,
hop_length=self.config.hop_size) hop_length=self.config.hop_size)
logger.info(f"extract the audio feats, shape is: {feats.shape}") logger.debug(f"extract the audio feats, shape is: {feats.shape}")
except Exception as e: except Exception as e:
logger.info(f"feats occurs exception {e}") logger.error(f"feats occurs exception {e}")
sys.exit(-1) sys.exit(-1)
feats = paddle.to_tensor(feats).unsqueeze(0) feats = paddle.to_tensor(feats).unsqueeze(0)
@ -159,7 +160,7 @@ class VectorEngine(BaseEngine):
"""The Vector Engine """The Vector Engine
""" """
super(VectorEngine, self).__init__() super(VectorEngine, self).__init__()
logger.info("Create the VectorEngine Instance") logger.debug("Create the VectorEngine Instance")
def init(self, config: dict): def init(self, config: dict):
"""Init the Vector Engine """Init the Vector Engine
@ -170,7 +171,7 @@ class VectorEngine(BaseEngine):
Returns: Returns:
bool: The engine instance flag bool: The engine instance flag
""" """
logger.info("Init the vector engine") logger.debug("Init the vector engine")
try: try:
self.config = config self.config = config
if self.config.device: if self.config.device:
@ -179,7 +180,7 @@ class VectorEngine(BaseEngine):
self.device = paddle.get_device() self.device = paddle.get_device()
paddle.set_device(self.device) paddle.set_device(self.device)
logger.info(f"Vector Engine set the device: {self.device}") logger.debug(f"Vector Engine set the device: {self.device}")
except BaseException as e: except BaseException as e:
logger.error( logger.error(
"Set device failed, please check if device is already used and the parameter 'device' in the yaml file" "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
@ -196,5 +197,7 @@ class VectorEngine(BaseEngine):
ckpt_path=config.ckpt_path, ckpt_path=config.ckpt_path,
task=config.task) task=config.task)
logger.info("Init the Vector engine successfully") logger.info(
"Initialize Vector server engine successfully on device: %s." %
(self.device))
return True return True

@ -138,7 +138,7 @@ class ASRWsAudioHandler:
Returns: Returns:
str: the final asr result str: the final asr result
""" """
logging.info("send a message to the server") logging.debug("send a message to the server")
if self.url is None: if self.url is None:
logger.error("No asr server, please input valid ip and port") logger.error("No asr server, please input valid ip and port")
@ -160,7 +160,7 @@ class ASRWsAudioHandler:
separators=(',', ': ')) separators=(',', ': '))
await ws.send(audio_info) await ws.send(audio_info)
msg = await ws.recv() msg = await ws.recv()
logger.info("client receive msg={}".format(msg)) logger.debug("client receive msg={}".format(msg))
# 3. send chunk audio data to engine # 3. send chunk audio data to engine
for chunk_data in self.read_wave(wavfile_path): for chunk_data in self.read_wave(wavfile_path):
@ -170,7 +170,7 @@ class ASRWsAudioHandler:
if self.punc_server and len(msg["result"]) > 0: if self.punc_server and len(msg["result"]) > 0:
msg["result"] = self.punc_server.run(msg["result"]) msg["result"] = self.punc_server.run(msg["result"])
logger.info("client receive msg={}".format(msg)) logger.debug("client receive msg={}".format(msg))
# 4. we must send finished signal to the server # 4. we must send finished signal to the server
audio_info = json.dumps( audio_info = json.dumps(
@ -310,7 +310,7 @@ class TTSWsHandler:
start_request = json.dumps({"task": "tts", "signal": "start"}) start_request = json.dumps({"task": "tts", "signal": "start"})
await ws.send(start_request) await ws.send(start_request)
msg = await ws.recv() msg = await ws.recv()
logger.info(f"client receive msg={msg}") logger.debug(f"client receive msg={msg}")
msg = json.loads(msg) msg = json.loads(msg)
session = msg["session"] session = msg["session"]
@ -319,7 +319,7 @@ class TTSWsHandler:
request = json.dumps({"text": text_base64}) request = json.dumps({"text": text_base64})
st = time.time() st = time.time()
await ws.send(request) await ws.send(request)
logging.info("send a message to the server") logging.debug("send a message to the server")
# 4. Process the received response # 4. Process the received response
message = await ws.recv() message = await ws.recv()
@ -543,7 +543,6 @@ class VectorHttpHandler:
"sample_rate": sample_rate, "sample_rate": sample_rate,
} }
logger.info(self.url)
res = requests.post(url=self.url, data=json.dumps(data)) res = requests.post(url=self.url, data=json.dumps(data))
return res.json() return res.json()

@ -169,7 +169,7 @@ def save_audio(bytes_data, audio_path, sample_rate: int=24000) -> bool:
sample_rate=sample_rate) sample_rate=sample_rate)
os.remove("./tmp.pcm") os.remove("./tmp.pcm")
else: else:
print("Only supports saved audio format is pcm or wav") logger.error("Only supports saved audio format is pcm or wav")
return False return False
return True return True

@ -1,59 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import functools
import logging
__all__ = [
'logger',
]
class Logger(object):
def __init__(self, name: str=None):
name = 'PaddleSpeech' if not name else name
self.logger = logging.getLogger(name)
log_config = {
'DEBUG': 10,
'INFO': 20,
'TRAIN': 21,
'EVAL': 22,
'WARNING': 30,
'ERROR': 40,
'CRITICAL': 50,
'EXCEPTION': 100,
}
for key, level in log_config.items():
logging.addLevelName(level, key)
if key == 'EXCEPTION':
self.__dict__[key.lower()] = self.logger.exception
else:
self.__dict__[key.lower()] = functools.partial(self.__call__,
level)
self.format = logging.Formatter(
fmt='[%(asctime)-15s] [%(levelname)8s] - %(message)s')
self.handler = logging.StreamHandler()
self.handler.setFormatter(self.format)
self.logger.addHandler(self.handler)
self.logger.setLevel(logging.DEBUG)
self.logger.propagate = False
def __call__(self, log_level: str, msg: str):
self.logger.log(log_level, msg)
logger = Logger()

@ -16,11 +16,11 @@ from typing import Optional
import onnxruntime as ort import onnxruntime as ort
from .log import logger from paddlespeech.cli.log import logger
def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None): def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
logger.info(f"ort sessconf: {sess_conf}") logger.debug(f"ort sessconf: {sess_conf}")
sess_options = ort.SessionOptions() sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
if sess_conf.get('graph_optimization_level', 99) == 0: if sess_conf.get('graph_optimization_level', 99) == 0:
@ -34,7 +34,7 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
# fastspeech2/mb_melgan can't use trt now! # fastspeech2/mb_melgan can't use trt now!
if sess_conf.get("use_trt", 0): if sess_conf.get("use_trt", 0):
providers = ['TensorrtExecutionProvider'] providers = ['TensorrtExecutionProvider']
logger.info(f"ort providers: {providers}") logger.debug(f"ort providers: {providers}")
if 'cpu_threads' in sess_conf: if 'cpu_threads' in sess_conf:
sess_options.intra_op_num_threads = sess_conf.get("cpu_threads", 0) sess_options.intra_op_num_threads = sess_conf.get("cpu_threads", 0)

@ -13,6 +13,8 @@
import base64 import base64
import math import math
from paddlespeech.cli.log import logger
def wav2base64(wav_file: str): def wav2base64(wav_file: str):
""" """
@ -61,7 +63,7 @@ def get_chunks(data, block_size, pad_size, step):
elif step == "voc": elif step == "voc":
data_len = data.shape[0] data_len = data.shape[0]
else: else:
print("Please set correct type to get chunks, am or voc") logger.error("Please set correct type to get chunks, am or voc")
chunks = [] chunks = []
n = math.ceil(data_len / block_size) n = math.ceil(data_len / block_size)
@ -73,7 +75,7 @@ def get_chunks(data, block_size, pad_size, step):
elif step == "voc": elif step == "voc":
chunks.append(data[start:end, :]) chunks.append(data[start:end, :])
else: else:
print("Please set correct type to get chunks, am or voc") logger.error("Please set correct type to get chunks, am or voc")
return chunks return chunks

@ -141,71 +141,133 @@ class FastSpeech2(nn.Layer):
init_dec_alpha: float=1.0, ): init_dec_alpha: float=1.0, ):
"""Initialize FastSpeech2 module. """Initialize FastSpeech2 module.
Args: Args:
idim (int): Dimension of the inputs. idim (int):
odim (int): Dimension of the outputs. Dimension of the inputs.
adim (int): Attention dimension. odim (int):
aheads (int): Number of attention heads. Dimension of the outputs.
elayers (int): Number of encoder layers. adim (int):
eunits (int): Number of encoder hidden units. Attention dimension.
dlayers (int): Number of decoder layers. aheads (int):
dunits (int): Number of decoder hidden units. Number of attention heads.
postnet_layers (int): Number of postnet layers. elayers (int):
postnet_chans (int): Number of postnet channels. Number of encoder layers.
postnet_filts (int): Kernel size of postnet. eunits (int):
postnet_dropout_rate (float): Dropout rate in postnet. Number of encoder hidden units.
use_scaled_pos_enc (bool): Whether to use trainable scaled pos encoding. dlayers (int):
use_batch_norm (bool): Whether to use batch normalization in encoder prenet. Number of decoder layers.
encoder_normalize_before (bool): Whether to apply layernorm layer before encoder block. dunits (int):
decoder_normalize_before (bool): Whether to apply layernorm layer before decoder block. Number of decoder hidden units.
encoder_concat_after (bool): Whether to concatenate attention layer's input and output in encoder. postnet_layers (int):
decoder_concat_after (bool): Whether to concatenate attention layer's input and output in decoder. Number of postnet layers.
reduction_factor (int): Reduction factor. postnet_chans (int):
encoder_type (str): Encoder type ("transformer" or "conformer"). Number of postnet channels.
decoder_type (str): Decoder type ("transformer" or "conformer"). postnet_filts (int):
transformer_enc_dropout_rate (float): Dropout rate in encoder except attention and positional encoding. Kernel size of postnet.
transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding. postnet_dropout_rate (float):
transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module. Dropout rate in postnet.
transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding. use_scaled_pos_enc (bool):
transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding. Whether to use trainable scaled pos encoding.
transformer_dec_attn_dropout_rate (float): Dropout rate in decoder self-attention module. use_batch_norm (bool):
conformer_pos_enc_layer_type (str): Pos encoding layer type in conformer. Whether to use batch normalization in encoder prenet.
conformer_self_attn_layer_type (str): Self-attention layer type in conformer encoder_normalize_before (bool):
conformer_activation_type (str): Activation function type in conformer. Whether to apply layernorm layer before encoder block.
use_macaron_style_in_conformer (bool): Whether to use macaron style FFN. decoder_normalize_before (bool):
use_cnn_in_conformer (bool): Whether to use CNN in conformer. Whether to apply layernorm layer before decoder block.
zero_triu (bool): Whether to use zero triu in relative self-attention module. encoder_concat_after (bool):
conformer_enc_kernel_size (int): Kernel size of encoder conformer. Whether to concatenate attention layer's input and output in encoder.
conformer_dec_kernel_size (int): Kernel size of decoder conformer. decoder_concat_after (bool):
duration_predictor_layers (int): Number of duration predictor layers. Whether to concatenate attention layer's input and output in decoder.
duration_predictor_chans (int): Number of duration predictor channels. reduction_factor (int):
duration_predictor_kernel_size (int): Kernel size of duration predictor. Reduction factor.
duration_predictor_dropout_rate (float): Dropout rate in duration predictor. encoder_type (str):
pitch_predictor_layers (int): Number of pitch predictor layers. Encoder type ("transformer" or "conformer").
pitch_predictor_chans (int): Number of pitch predictor channels. decoder_type (str):
pitch_predictor_kernel_size (int): Kernel size of pitch predictor. Decoder type ("transformer" or "conformer").
pitch_predictor_dropout_rate (float): Dropout rate in pitch predictor. transformer_enc_dropout_rate (float):
pitch_embed_kernel_size (float): Kernel size of pitch embedding. Dropout rate in encoder except attention and positional encoding.
pitch_embed_dropout_rate (float): Dropout rate for pitch embedding. transformer_enc_positional_dropout_rate (float):
stop_gradient_from_pitch_predictor (bool): Whether to stop gradient from pitch predictor to encoder. Dropout rate after encoder positional encoding.
energy_predictor_layers (int): Number of energy predictor layers. transformer_enc_attn_dropout_rate (float):
energy_predictor_chans (int): Number of energy predictor channels. Dropout rate in encoder self-attention module.
energy_predictor_kernel_size (int): Kernel size of energy predictor. transformer_dec_dropout_rate (float):
energy_predictor_dropout_rate (float): Dropout rate in energy predictor. Dropout rate in decoder except attention & positional encoding.
energy_embed_kernel_size (float): Kernel size of energy embedding. transformer_dec_positional_dropout_rate (float):
energy_embed_dropout_rate (float): Dropout rate for energy embedding. Dropout rate after decoder positional encoding.
stop_gradient_from_energy_predictorbool): Whether to stop gradient from energy predictor to encoder. transformer_dec_attn_dropout_rate (float):
spk_num (Optional[int]): Number of speakers. If not None, assume that the spk_embed_dim is not None, Dropout rate in decoder self-attention module.
conformer_pos_enc_layer_type (str):
Pos encoding layer type in conformer.
conformer_self_attn_layer_type (str):
Self-attention layer type in conformer
conformer_activation_type (str):
Activation function type in conformer.
use_macaron_style_in_conformer (bool):
Whether to use macaron style FFN.
use_cnn_in_conformer (bool):
Whether to use CNN in conformer.
zero_triu (bool):
Whether to use zero triu in relative self-attention module.
conformer_enc_kernel_size (int):
Kernel size of encoder conformer.
conformer_dec_kernel_size (int):
Kernel size of decoder conformer.
duration_predictor_layers (int):
Number of duration predictor layers.
duration_predictor_chans (int):
Number of duration predictor channels.
duration_predictor_kernel_size (int):
Kernel size of duration predictor.
duration_predictor_dropout_rate (float):
Dropout rate in duration predictor.
pitch_predictor_layers (int):
Number of pitch predictor layers.
pitch_predictor_chans (int):
Number of pitch predictor channels.
pitch_predictor_kernel_size (int):
Kernel size of pitch predictor.
pitch_predictor_dropout_rate (float):
Dropout rate in pitch predictor.
pitch_embed_kernel_size (float):
Kernel size of pitch embedding.
pitch_embed_dropout_rate (float):
Dropout rate for pitch embedding.
stop_gradient_from_pitch_predictor (bool):
Whether to stop gradient from pitch predictor to encoder.
energy_predictor_layers (int):
Number of energy predictor layers.
energy_predictor_chans (int):
Number of energy predictor channels.
energy_predictor_kernel_size (int):
Kernel size of energy predictor.
energy_predictor_dropout_rate (float):
Dropout rate in energy predictor.
energy_embed_kernel_size (float):
Kernel size of energy embedding.
energy_embed_dropout_rate (float):
Dropout rate for energy embedding.
stop_gradient_from_energy_predictorbool):
Whether to stop gradient from energy predictor to encoder.
spk_num (Optional[int]):
Number of speakers. If not None, assume that the spk_embed_dim is not None,
spk_ids will be provided as the input and use spk_embedding_table. spk_ids will be provided as the input and use spk_embedding_table.
spk_embed_dim (Optional[int]): Speaker embedding dimension. If not None, spk_embed_dim (Optional[int]):
Speaker embedding dimension. If not None,
assume that spk_emb will be provided as the input or spk_num is not None. assume that spk_emb will be provided as the input or spk_num is not None.
spk_embed_integration_type (str): How to integrate speaker embedding. spk_embed_integration_type (str):
tone_num (Optional[int]): Number of tones. If not None, assume that the How to integrate speaker embedding.
tone_num (Optional[int]):
Number of tones. If not None, assume that the
tone_ids will be provided as the input and use tone_embedding_table. tone_ids will be provided as the input and use tone_embedding_table.
tone_embed_dim (Optional[int]): Tone embedding dimension. If not None, assume that tone_num is not None. tone_embed_dim (Optional[int]):
tone_embed_integration_type (str): How to integrate tone embedding. Tone embedding dimension. If not None, assume that tone_num is not None.
init_type (str): How to initialize transformer parameters. tone_embed_integration_type (str):
init_enc_alpha float): Initial value of alpha in scaled pos encoding of the encoder. How to integrate tone embedding.
init_dec_alpha (float): Initial value of alpha in scaled pos encoding of the decoder. init_type (str):
How to initialize transformer parameters.
init_enc_alpha float):
Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float):
Initial value of alpha in scaled pos encoding of the decoder.
""" """
assert check_argument_types() assert check_argument_types()
@ -258,7 +320,6 @@ class FastSpeech2(nn.Layer):
padding_idx=self.padding_idx) padding_idx=self.padding_idx)
if encoder_type == "transformer": if encoder_type == "transformer":
print("encoder_type is transformer")
self.encoder = TransformerEncoder( self.encoder = TransformerEncoder(
idim=idim, idim=idim,
attention_dim=adim, attention_dim=adim,
@ -275,7 +336,6 @@ class FastSpeech2(nn.Layer):
positionwise_layer_type=positionwise_layer_type, positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
elif encoder_type == "conformer": elif encoder_type == "conformer":
print("encoder_type is conformer")
self.encoder = ConformerEncoder( self.encoder = ConformerEncoder(
idim=idim, idim=idim,
attention_dim=adim, attention_dim=adim,
@ -362,7 +422,6 @@ class FastSpeech2(nn.Layer):
# NOTE: we use encoder as decoder # NOTE: we use encoder as decoder
# because fastspeech's decoder is the same as encoder # because fastspeech's decoder is the same as encoder
if decoder_type == "transformer": if decoder_type == "transformer":
print("decoder_type is transformer")
self.decoder = TransformerEncoder( self.decoder = TransformerEncoder(
idim=0, idim=0,
attention_dim=adim, attention_dim=adim,
@ -380,7 +439,6 @@ class FastSpeech2(nn.Layer):
positionwise_layer_type=positionwise_layer_type, positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
elif decoder_type == "conformer": elif decoder_type == "conformer":
print("decoder_type is conformer")
self.decoder = ConformerEncoder( self.decoder = ConformerEncoder(
idim=0, idim=0,
attention_dim=adim, attention_dim=adim,
@ -453,20 +511,29 @@ class FastSpeech2(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax). text(Tensor(int64)):
text_lengths(Tensor(int64)): Batch of lengths of each input (B,). Batch of padded token ids (B, Tmax).
speech(Tensor): Batch of padded target features (B, Lmax, odim). text_lengths(Tensor(int64)):
speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,). Batch of lengths of each input (B,).
durations(Tensor(int64)): Batch of padded durations (B, Tmax). speech(Tensor):
pitch(Tensor): Batch of padded token-averaged pitch (B, Tmax, 1). Batch of padded target features (B, Lmax, odim).
energy(Tensor): Batch of padded token-averaged energy (B, Tmax, 1). speech_lengths(Tensor(int64)):
tone_id(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). Batch of the lengths of each target (B,).
spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim). durations(Tensor(int64)):
spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) Batch of padded durations (B, Tmax).
pitch(Tensor):
Batch of padded token-averaged pitch (B, Tmax, 1).
energy(Tensor):
Batch of padded token-averaged energy (B, Tmax, 1).
tone_id(Tensor, optional(int64)):
Batch of padded tone ids (B, Tmax).
spk_emb(Tensor, optional):
Batch of speaker embeddings (B, spk_embed_dim).
spk_id(Tnesor, optional(int64)):
Batch of speaker ids (B,)
Returns: Returns:
""" """
# input of embedding must be int64 # input of embedding must be int64
@ -662,20 +729,28 @@ class FastSpeech2(nn.Layer):
"""Generate the sequence of features given the sequences of characters. """Generate the sequence of features given the sequences of characters.
Args: Args:
text(Tensor(int64)): Input sequence of characters (T,). text(Tensor(int64)):
durations(Tensor, optional (int64)): Groundtruth of duration (T,). Input sequence of characters (T,).
pitch(Tensor, optional): Groundtruth of token-averaged pitch (T, 1). durations(Tensor, optional (int64)):
energy(Tensor, optional): Groundtruth of token-averaged energy (T, 1). Groundtruth of duration (T,).
alpha(float, optional): Alpha to control the speed. pitch(Tensor, optional):
use_teacher_forcing(bool, optional): Whether to use teacher forcing. Groundtruth of token-averaged pitch (T, 1).
energy(Tensor, optional):
Groundtruth of token-averaged energy (T, 1).
alpha(float, optional):
Alpha to control the speed.
use_teacher_forcing(bool, optional):
Whether to use teacher forcing.
If true, groundtruth of duration, pitch and energy will be used. If true, groundtruth of duration, pitch and energy will be used.
spk_emb(Tensor, optional, optional): peaker embedding vector (spk_embed_dim,). (Default value = None) spk_emb(Tensor, optional, optional):
spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None) peaker embedding vector (spk_embed_dim,). (Default value = None)
tone_id(Tensor, optional(int64), optional): tone ids (T,). (Default value = None) spk_id(Tensor, optional(int64), optional):
spk ids (1,). (Default value = None)
tone_id(Tensor, optional(int64), optional):
tone ids (T,). (Default value = None)
Returns: Returns:
""" """
# input of embedding must be int64 # input of embedding must be int64
x = paddle.cast(text, 'int64') x = paddle.cast(text, 'int64')
@ -724,8 +799,10 @@ class FastSpeech2(nn.Layer):
"""Integrate speaker embedding with hidden states. """Integrate speaker embedding with hidden states.
Args: Args:
hs(Tensor): Batch of hidden state sequences (B, Tmax, adim). hs(Tensor):
spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim). Batch of hidden state sequences (B, Tmax, adim).
spk_emb(Tensor):
Batch of speaker embeddings (B, spk_embed_dim).
Returns: Returns:
@ -749,8 +826,10 @@ class FastSpeech2(nn.Layer):
"""Integrate speaker embedding with hidden states. """Integrate speaker embedding with hidden states.
Args: Args:
hs(Tensor): Batch of hidden state sequences (B, Tmax, adim). hs(Tensor):
tone_embs(Tensor): Batch of speaker embeddings (B, Tmax, tone_embed_dim). Batch of hidden state sequences (B, Tmax, adim).
tone_embs(Tensor):
Batch of speaker embeddings (B, Tmax, tone_embed_dim).
Returns: Returns:
@ -773,10 +852,12 @@ class FastSpeech2(nn.Layer):
"""Make masks for self-attention. """Make masks for self-attention.
Args: Args:
ilens(Tensor): Batch of lengths (B,). ilens(Tensor):
Batch of lengths (B,).
Returns: Returns:
Tensor: Mask tensor for self-attention. dtype=paddle.bool Tensor:
Mask tensor for self-attention. dtype=paddle.bool
Examples: Examples:
>>> ilens = [5, 3] >>> ilens = [5, 3]
@ -858,19 +939,32 @@ class StyleFastSpeech2Inference(FastSpeech2Inference):
""" """
Args: Args:
text(Tensor(int64)): Input sequence of characters (T,). text(Tensor(int64)):
durations(paddle.Tensor/np.ndarray, optional (int64)): Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias Input sequence of characters (T,).
durations(paddle.Tensor/np.ndarray, optional (int64)):
Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias
durations_scale(int/float, optional): durations_scale(int/float, optional):
durations_bias(int/float, optional): durations_bias(int/float, optional):
pitch(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
pitch_scale(int/float, optional): In denormed HZ domain. pitch(paddle.Tensor/np.ndarray, optional):
pitch_bias(int/float, optional): In denormed HZ domain. Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
energy(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias pitch_scale(int/float, optional):
energy_scale(int/float, optional): In denormed domain. In denormed HZ domain.
energy_bias(int/float, optional): In denormed domain. pitch_bias(int/float, optional):
robot: bool: (Default value = False) In denormed HZ domain.
spk_emb: (Default value = None) energy(paddle.Tensor/np.ndarray, optional):
spk_id: (Default value = None) Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias
energy_scale(int/float, optional):
In denormed domain.
energy_bias(int/float, optional):
In denormed domain.
robot(bool) (Default value = False):
spk_emb(Default value = None):
spk_id(Default value = None):
Returns: Returns:
Tensor: logmel Tensor: logmel
@ -949,8 +1043,10 @@ class FastSpeech2Loss(nn.Layer):
use_weighted_masking: bool=False): use_weighted_masking: bool=False):
"""Initialize feed-forward Transformer loss module. """Initialize feed-forward Transformer loss module.
Args: Args:
use_masking (bool): Whether to apply masking for padded part in loss calculation. use_masking (bool):
use_weighted_masking (bool): Whether to weighted masking in loss calculation. Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool):
Whether to weighted masking in loss calculation.
""" """
assert check_argument_types() assert check_argument_types()
super().__init__() super().__init__()
@ -982,17 +1078,28 @@ class FastSpeech2Loss(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
after_outs(Tensor): Batch of outputs after postnets (B, Lmax, odim). after_outs(Tensor):
before_outs(Tensor): Batch of outputs before postnets (B, Lmax, odim). Batch of outputs after postnets (B, Lmax, odim).
d_outs(Tensor): Batch of outputs of duration predictor (B, Tmax). before_outs(Tensor):
p_outs(Tensor): Batch of outputs of pitch predictor (B, Tmax, 1). Batch of outputs before postnets (B, Lmax, odim).
e_outs(Tensor): Batch of outputs of energy predictor (B, Tmax, 1). d_outs(Tensor):
ys(Tensor): Batch of target features (B, Lmax, odim). Batch of outputs of duration predictor (B, Tmax).
ds(Tensor): Batch of durations (B, Tmax). p_outs(Tensor):
ps(Tensor): Batch of target token-averaged pitch (B, Tmax, 1). Batch of outputs of pitch predictor (B, Tmax, 1).
es(Tensor): Batch of target token-averaged energy (B, Tmax, 1). e_outs(Tensor):
ilens(Tensor): Batch of the lengths of each input (B,). Batch of outputs of energy predictor (B, Tmax, 1).
olens(Tensor): Batch of the lengths of each target (B,). ys(Tensor):
Batch of target features (B, Lmax, odim).
ds(Tensor):
Batch of durations (B, Tmax).
ps(Tensor):
Batch of target token-averaged pitch (B, Tmax, 1).
es(Tensor):
Batch of target token-averaged energy (B, Tmax, 1).
ilens(Tensor):
Batch of the lengths of each input (B,).
olens(Tensor):
Batch of the lengths of each target (B,).
Returns: Returns:

@ -50,20 +50,34 @@ class HiFiGANGenerator(nn.Layer):
init_type: str="xavier_uniform", ): init_type: str="xavier_uniform", ):
"""Initialize HiFiGANGenerator module. """Initialize HiFiGANGenerator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels. Number of input channels.
channels (int): Number of hidden representation channels. out_channels (int):
global_channels (int): Number of global conditioning channels. Number of output channels.
kernel_size (int): Kernel size of initial and final conv layer. channels (int):
upsample_scales (list): List of upsampling scales. Number of hidden representation channels.
upsample_kernel_sizes (list): List of kernel sizes for upsampling layers. global_channels (int):
resblock_kernel_sizes (list): List of kernel sizes for residual blocks. Number of global conditioning channels.
resblock_dilations (list): List of dilation list for residual blocks. kernel_size (int):
use_additional_convs (bool): Whether to use additional conv layers in residual blocks. Kernel size of initial and final conv layer.
bias (bool): Whether to add bias parameter in convolution layers. upsample_scales (list):
nonlinear_activation (str): Activation function module name. List of upsampling scales.
nonlinear_activation_params (dict): Hyperparameters for activation function. upsample_kernel_sizes (list):
use_weight_norm (bool): Whether to use weight norm. List of kernel sizes for upsampling layers.
resblock_kernel_sizes (list):
List of kernel sizes for residual blocks.
resblock_dilations (list):
List of dilation list for residual blocks.
use_additional_convs (bool):
Whether to use additional conv layers in residual blocks.
bias (bool):
Whether to add bias parameter in convolution layers.
nonlinear_activation (str):
Activation function module name.
nonlinear_activation_params (dict):
Hyperparameters for activation function.
use_weight_norm (bool):
Whether to use weight norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
""" """
super().__init__() super().__init__()
@ -199,9 +213,10 @@ class HiFiGANGenerator(nn.Layer):
def inference(self, c, g: Optional[paddle.Tensor]=None): def inference(self, c, g: Optional[paddle.Tensor]=None):
"""Perform inference. """Perform inference.
Args: Args:
c (Tensor): Input tensor (T, in_channels). c (Tensor):
normalize_before (bool): Whether to perform normalization. Input tensor (T, in_channels).
g (Optional[Tensor]): Global conditioning tensor (global_channels, 1). g (Optional[Tensor]):
Global conditioning tensor (global_channels, 1).
Returns: Returns:
Tensor: Tensor:
Output tensor (T ** prod(upsample_scales), out_channels). Output tensor (T ** prod(upsample_scales), out_channels).
@ -233,20 +248,33 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
"""Initialize HiFiGANPeriodDiscriminator module. """Initialize HiFiGANPeriodDiscriminator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels. Number of input channels.
period (int): Period. out_channels (int):
kernel_sizes (list): Kernel sizes of initial conv layers and the final conv layer. Number of output channels.
channels (int): Number of initial channels. period (int):
downsample_scales (list): List of downsampling scales. Period.
max_downsample_channels (int): Number of maximum downsampling channels. kernel_sizes (list):
use_additional_convs (bool): Whether to use additional conv layers in residual blocks. Kernel sizes of initial conv layers and the final conv layer.
bias (bool): Whether to add bias parameter in convolution layers. channels (int):
nonlinear_activation (str): Activation function module name. Number of initial channels.
nonlinear_activation_params (dict): Hyperparameters for activation function. downsample_scales (list):
use_weight_norm (bool): Whether to use weight norm. List of downsampling scales.
max_downsample_channels (int):
Number of maximum downsampling channels.
use_additional_convs (bool):
Whether to use additional conv layers in residual blocks.
bias (bool):
Whether to add bias parameter in convolution layers.
nonlinear_activation (str):
Activation function module name.
nonlinear_activation_params (dict):
Hyperparameters for activation function.
use_weight_norm (bool):
Whether to use weight norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
use_spectral_norm (bool): Whether to use spectral norm. use_spectral_norm (bool):
Whether to use spectral norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
""" """
super().__init__() super().__init__()
@ -298,7 +326,8 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
c (Tensor): Input tensor (B, in_channels, T). c (Tensor):
Input tensor (B, in_channels, T).
Returns: Returns:
list: List of each layer's tensors. list: List of each layer's tensors.
""" """
@ -367,8 +396,10 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
"""Initialize HiFiGANMultiPeriodDiscriminator module. """Initialize HiFiGANMultiPeriodDiscriminator module.
Args: Args:
periods (list): List of periods. periods (list):
discriminator_params (dict): Parameters for hifi-gan period discriminator module. List of periods.
discriminator_params (dict):
Parameters for hifi-gan period discriminator module.
The period parameter will be overwritten. The period parameter will be overwritten.
""" """
super().__init__() super().__init__()
@ -385,7 +416,8 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input noise signal (B, 1, T). x (Tensor):
Input noise signal (B, 1, T).
Returns: Returns:
List: List of list of each discriminator outputs, which consists of each layer output tensors. List: List of list of each discriminator outputs, which consists of each layer output tensors.
""" """
@ -417,16 +449,25 @@ class HiFiGANScaleDiscriminator(nn.Layer):
"""Initilize HiFiGAN scale discriminator module. """Initilize HiFiGAN scale discriminator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels. Number of input channels.
kernel_sizes (list): List of four kernel sizes. The first will be used for the first conv layer, out_channels (int):
Number of output channels.
kernel_sizes (list):
List of four kernel sizes. The first will be used for the first conv layer,
and the second is for downsampling part, and the remaining two are for output layers. and the second is for downsampling part, and the remaining two are for output layers.
channels (int): Initial number of channels for conv layer. channels (int):
max_downsample_channels (int): Maximum number of channels for downsampling layers. Initial number of channels for conv layer.
bias (bool): Whether to add bias parameter in convolution layers. max_downsample_channels (int):
downsample_scales (list): List of downsampling scales. Maximum number of channels for downsampling layers.
nonlinear_activation (str): Activation function module name. bias (bool):
nonlinear_activation_params (dict): Hyperparameters for activation function. Whether to add bias parameter in convolution layers.
downsample_scales (list):
List of downsampling scales.
nonlinear_activation (str):
Activation function module name.
nonlinear_activation_params (dict):
Hyperparameters for activation function.
use_weight_norm (bool): Whether to use weight norm. use_weight_norm (bool): Whether to use weight norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
use_spectral_norm (bool): Whether to use spectral norm. use_spectral_norm (bool): Whether to use spectral norm.
@ -614,7 +655,8 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input noise signal (B, 1, T). x (Tensor):
Input noise signal (B, 1, T).
Returns: Returns:
List: List of list of each discriminator outputs, which consists of each layer output tensors. List: List of list of each discriminator outputs, which consists of each layer output tensors.
""" """
@ -675,14 +717,21 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
"""Initilize HiFiGAN multi-scale + multi-period discriminator module. """Initilize HiFiGAN multi-scale + multi-period discriminator module.
Args: Args:
scales (int): Number of multi-scales. scales (int):
scale_downsample_pooling (str): Pooling module name for downsampling of the inputs. Number of multi-scales.
scale_downsample_pooling_params (dict): Parameters for the above pooling module. scale_downsample_pooling (str):
scale_discriminator_params (dict): Parameters for hifi-gan scale discriminator module. Pooling module name for downsampling of the inputs.
follow_official_norm bool): Whether to follow the norm setting of the official implementaion. scale_downsample_pooling_params (dict):
Parameters for the above pooling module.
scale_discriminator_params (dict):
Parameters for hifi-gan scale discriminator module.
follow_official_norm bool):
Whether to follow the norm setting of the official implementaion.
The first discriminator uses spectral norm and the other discriminators use weight norm. The first discriminator uses spectral norm and the other discriminators use weight norm.
periods (list): List of periods. periods (list):
period_discriminator_params (dict): Parameters for hifi-gan period discriminator module. List of periods.
period_discriminator_params (dict):
Parameters for hifi-gan period discriminator module.
The period parameter will be overwritten. The period parameter will be overwritten.
""" """
super().__init__() super().__init__()
@ -704,7 +753,8 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input noise signal (B, 1, T). x (Tensor):
Input noise signal (B, 1, T).
Returns: Returns:
List: List:
List of list of each discriminator outputs, List of list of each discriminator outputs,

@ -53,24 +53,38 @@ class MelGANGenerator(nn.Layer):
"""Initialize MelGANGenerator module. """Initialize MelGANGenerator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels, Number of input channels.
out_channels (int):
Number of output channels,
the number of sub-band is out_channels in multi-band melgan. the number of sub-band is out_channels in multi-band melgan.
kernel_size (int): Kernel size of initial and final conv layer. kernel_size (int):
channels (int): Initial number of channels for conv layer. Kernel size of initial and final conv layer.
bias (bool): Whether to add bias parameter in convolution layers. channels (int):
upsample_scales (List[int]): List of upsampling scales. Initial number of channels for conv layer.
stack_kernel_size (int): Kernel size of dilated conv layers in residual stack. bias (bool):
stacks (int): Number of stacks in a single residual stack. Whether to add bias parameter in convolution layers.
nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None upsample_scales (List[int]):
nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, List of upsampling scales.
by default {} stack_kernel_size (int):
pad (str): Padding function module name before dilated convolution layer. Kernel size of dilated conv layers in residual stack.
pad_params (dict): Hyperparameters for padding function. stacks (int):
use_final_nonlinear_activation (nn.Layer): Activation function for the final layer. Number of stacks in a single residual stack.
use_weight_norm (bool): Whether to use weight norm. nonlinear_activation (Optional[str], optional):
Non linear activation in upsample network, by default None
nonlinear_activation_params (Dict[str, Any], optional):
Parameters passed to the linear activation in the upsample network, by default {}
pad (str):
Padding function module name before dilated convolution layer.
pad_params (dict):
Hyperparameters for padding function.
use_final_nonlinear_activation (nn.Layer):
Activation function for the final layer.
use_weight_norm (bool):
Whether to use weight norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
use_causal_conv (bool): Whether to use causal convolution. use_causal_conv (bool):
Whether to use causal convolution.
""" """
super().__init__() super().__init__()
@ -194,7 +208,8 @@ class MelGANGenerator(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
c (Tensor): Input tensor (B, in_channels, T). c (Tensor):
Input tensor (B, in_channels, T).
Returns: Returns:
Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)). Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
""" """
@ -244,7 +259,8 @@ class MelGANGenerator(nn.Layer):
"""Perform inference. """Perform inference.
Args: Args:
c (Union[Tensor, ndarray]): Input tensor (T, in_channels). c (Union[Tensor, ndarray]):
Input tensor (T, in_channels).
Returns: Returns:
Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1). Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1).
""" """
@ -279,20 +295,30 @@ class MelGANDiscriminator(nn.Layer):
"""Initilize MelGAN discriminator module. """Initilize MelGAN discriminator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels. Number of input channels.
out_channels (int):
Number of output channels.
kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer, kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer,
and the first and the second kernel sizes will be used for the last two layers. and the first and the second kernel sizes will be used for the last two layers.
For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15, For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
the last two layers' kernel size will be 5 and 3, respectively. the last two layers' kernel size will be 5 and 3, respectively.
channels (int): Initial number of channels for conv layer. channels (int):
max_downsample_channels (int): Maximum number of channels for downsampling layers. Initial number of channels for conv layer.
bias (bool): Whether to add bias parameter in convolution layers. max_downsample_channels (int):
downsample_scales (List[int]): List of downsampling scales. Maximum number of channels for downsampling layers.
nonlinear_activation (str): Activation function module name. bias (bool):
nonlinear_activation_params (dict): Hyperparameters for activation function. Whether to add bias parameter in convolution layers.
pad (str): Padding function module name before dilated convolution layer. downsample_scales (List[int]):
pad_params (dict): Hyperparameters for padding function. List of downsampling scales.
nonlinear_activation (str):
Activation function module name.
nonlinear_activation_params (dict):
Hyperparameters for activation function.
pad (str):
Padding function module name before dilated convolution layer.
pad_params (dict):
Hyperparameters for padding function.
""" """
super().__init__() super().__init__()
@ -364,7 +390,8 @@ class MelGANDiscriminator(nn.Layer):
def forward(self, x): def forward(self, x):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input noise signal (B, 1, T). x (Tensor):
Input noise signal (B, 1, T).
Returns: Returns:
List: List of output tensors of each layer (for feat_match_loss). List: List of output tensors of each layer (for feat_match_loss).
""" """
@ -406,22 +433,37 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
"""Initilize MelGAN multi-scale discriminator module. """Initilize MelGAN multi-scale discriminator module.
Args: Args:
in_channels (int): Number of input channels. in_channels (int):
out_channels (int): Number of output channels. Number of input channels.
scales (int): Number of multi-scales. out_channels (int):
downsample_pooling (str): Pooling module name for downsampling of the inputs. Number of output channels.
downsample_pooling_params (dict): Parameters for the above pooling module. scales (int):
kernel_sizes (List[int]): List of two kernel sizes. The sum will be used for the first conv layer, Number of multi-scales.
downsample_pooling (str):
Pooling module name for downsampling of the inputs.
downsample_pooling_params (dict):
Parameters for the above pooling module.
kernel_sizes (List[int]):
List of two kernel sizes. The sum will be used for the first conv layer,
and the first and the second kernel sizes will be used for the last two layers. and the first and the second kernel sizes will be used for the last two layers.
channels (int): Initial number of channels for conv layer. channels (int):
max_downsample_channels (int): Maximum number of channels for downsampling layers. Initial number of channels for conv layer.
bias (bool): Whether to add bias parameter in convolution layers. max_downsample_channels (int):
downsample_scales (List[int]): List of downsampling scales. Maximum number of channels for downsampling layers.
nonlinear_activation (str): Activation function module name. bias (bool):
nonlinear_activation_params (dict): Hyperparameters for activation function. Whether to add bias parameter in convolution layers.
pad (str): Padding function module name before dilated convolution layer. downsample_scales (List[int]):
pad_params (dict): Hyperparameters for padding function. List of downsampling scales.
use_causal_conv (bool): Whether to use causal convolution. nonlinear_activation (str):
Activation function module name.
nonlinear_activation_params (dict):
Hyperparameters for activation function.
pad (str):
Padding function module name before dilated convolution layer.
pad_params (dict):
Hyperparameters for padding function.
use_causal_conv (bool):
Whether to use causal convolution.
""" """
super().__init__() super().__init__()
@ -464,7 +506,8 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
def forward(self, x): def forward(self, x):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input noise signal (B, 1, T). x (Tensor):
Input noise signal (B, 1, T).
Returns: Returns:
List: List of list of each discriminator outputs, which consists of each layer output tensors. List: List of list of each discriminator outputs, which consists of each layer output tensors.
""" """

@ -54,20 +54,34 @@ class StyleMelGANGenerator(nn.Layer):
"""Initilize Style MelGAN generator. """Initilize Style MelGAN generator.
Args: Args:
in_channels (int): Number of input noise channels. in_channels (int):
aux_channels (int): Number of auxiliary input channels. Number of input noise channels.
channels (int): Number of channels for conv layer. aux_channels (int):
out_channels (int): Number of output channels. Number of auxiliary input channels.
kernel_size (int): Kernel size of conv layers. channels (int):
dilation (int): Dilation factor for conv layers. Number of channels for conv layer.
bias (bool): Whether to add bias parameter in convolution layers. out_channels (int):
noise_upsample_scales (list): List of noise upsampling scales. Number of output channels.
noise_upsample_activation (str): Activation function module name for noise upsampling. kernel_size (int):
noise_upsample_activation_params (dict): Hyperparameters for the above activation function. Kernel size of conv layers.
upsample_scales (list): List of upsampling scales. dilation (int):
upsample_mode (str): Upsampling mode in TADE layer. Dilation factor for conv layers.
gated_function (str): Gated function in TADEResBlock ("softmax" or "sigmoid"). bias (bool):
use_weight_norm (bool): Whether to use weight norm. Whether to add bias parameter in convolution layers.
noise_upsample_scales (list):
List of noise upsampling scales.
noise_upsample_activation (str):
Activation function module name for noise upsampling.
noise_upsample_activation_params (dict):
Hyperparameters for the above activation function.
upsample_scales (list):
List of upsampling scales.
upsample_mode (str):
Upsampling mode in TADE layer.
gated_function (str):
Gated function in TADEResBlock ("softmax" or "sigmoid").
use_weight_norm (bool):
Whether to use weight norm.
If set to true, it will be applied to all of the conv layers. If set to true, it will be applied to all of the conv layers.
""" """
super().__init__() super().__init__()
@ -194,7 +208,8 @@ class StyleMelGANGenerator(nn.Layer):
def inference(self, c): def inference(self, c):
"""Perform inference. """Perform inference.
Args: Args:
c (Tensor): Input tensor (T, in_channels). c (Tensor):
Input tensor (T, in_channels).
Returns: Returns:
Tensor: Output tensor (T ** prod(upsample_scales), out_channels). Tensor: Output tensor (T ** prod(upsample_scales), out_channels).
""" """
@ -258,11 +273,16 @@ class StyleMelGANDiscriminator(nn.Layer):
"""Initilize Style MelGAN discriminator. """Initilize Style MelGAN discriminator.
Args: Args:
repeats (int): Number of repititons to apply RWD. repeats (int):
window_sizes (list): List of random window sizes. Number of repititons to apply RWD.
pqmf_params (list): List of list of Parameters for PQMF modules window_sizes (list):
discriminator_params (dict): Parameters for base discriminator module. List of random window sizes.
use_weight_nom (bool): Whether to apply weight normalization. pqmf_params (list):
List of list of Parameters for PQMF modules
discriminator_params (dict):
Parameters for base discriminator module.
use_weight_nom (bool):
Whether to apply weight normalization.
""" """
super().__init__() super().__init__()
@ -299,7 +319,8 @@ class StyleMelGANDiscriminator(nn.Layer):
def forward(self, x): def forward(self, x):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x (Tensor): Input tensor (B, 1, T). x (Tensor):
Input tensor (B, 1, T).
Returns: Returns:
List: List of discriminator outputs, #items in the list will be List: List of discriminator outputs, #items in the list will be
equal to repeats * #discriminators. equal to repeats * #discriminators.

@ -32,29 +32,45 @@ class PWGGenerator(nn.Layer):
"""Wave Generator for Parallel WaveGAN """Wave Generator for Parallel WaveGAN
Args: Args:
in_channels (int, optional): Number of channels of the input waveform, by default 1 in_channels (int, optional):
out_channels (int, optional): Number of channels of the output waveform, by default 1 Number of channels of the input waveform, by default 1
kernel_size (int, optional): Kernel size of the residual blocks inside, by default 3 out_channels (int, optional):
layers (int, optional): Number of residual blocks inside, by default 30 Number of channels of the output waveform, by default 1
stacks (int, optional): The number of groups to split the residual blocks into, by default 3 kernel_size (int, optional):
Kernel size of the residual blocks inside, by default 3
layers (int, optional):
Number of residual blocks inside, by default 30
stacks (int, optional):
The number of groups to split the residual blocks into, by default 3
Within each group, the dilation of the residual block grows exponentially. Within each group, the dilation of the residual block grows exponentially.
residual_channels (int, optional): Residual channel of the residual blocks, by default 64 residual_channels (int, optional):
gate_channels (int, optional): Gate channel of the residual blocks, by default 128 Residual channel of the residual blocks, by default 64
skip_channels (int, optional): Skip channel of the residual blocks, by default 64 gate_channels (int, optional):
aux_channels (int, optional): Auxiliary channel of the residual blocks, by default 80 Gate channel of the residual blocks, by default 128
aux_context_window (int, optional): The context window size of the first convolution applied to the skip_channels (int, optional):
auxiliary input, by default 2 Skip channel of the residual blocks, by default 64
dropout (float, optional): Dropout of the residual blocks, by default 0. aux_channels (int, optional):
bias (bool, optional): Whether to use bias in residual blocks, by default True Auxiliary channel of the residual blocks, by default 80
use_weight_norm (bool, optional): Whether to use weight norm in all convolutions, by default True aux_context_window (int, optional):
use_causal_conv (bool, optional): Whether to use causal padding in the upsample network and residual The context window size of the first convolution applied to the auxiliary input, by default 2
blocks, by default False dropout (float, optional):
upsample_scales (List[int], optional): Upsample scales of the upsample network, by default [4, 4, 4, 4] Dropout of the residual blocks, by default 0.
nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None bias (bool, optional):
nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, Whether to use bias in residual blocks, by default True
by default {} use_weight_norm (bool, optional):
interpolate_mode (str, optional): Interpolation mode of the upsample network, by default "nearest" Whether to use weight norm in all convolutions, by default True
freq_axis_kernel_size (int, optional): Kernel size along the frequency axis of the upsample network, by default 1 use_causal_conv (bool, optional):
Whether to use causal padding in the upsample network and residual blocks, by default False
upsample_scales (List[int], optional):
Upsample scales of the upsample network, by default [4, 4, 4, 4]
nonlinear_activation (Optional[str], optional):
Non linear activation in upsample network, by default None
nonlinear_activation_params (Dict[str, Any], optional):
Parameters passed to the linear activation in the upsample network, by default {}
interpolate_mode (str, optional):
Interpolation mode of the upsample network, by default "nearest"
freq_axis_kernel_size (int, optional):
Kernel size along the frequency axis of the upsample network, by default 1
""" """
def __init__( def __init__(
@ -147,9 +163,11 @@ class PWGGenerator(nn.Layer):
"""Generate waveform. """Generate waveform.
Args: Args:
x(Tensor): Shape (N, C_in, T), The input waveform. x(Tensor):
c(Tensor): Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It Shape (N, C_in, T), The input waveform.
is upsampled to match the time resolution of the input. c(Tensor):
Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram).
It is upsampled to match the time resolution of the input.
Returns: Returns:
Tensor: Shape (N, C_out, T), the generated waveform. Tensor: Shape (N, C_out, T), the generated waveform.
@ -195,8 +213,10 @@ class PWGGenerator(nn.Layer):
"""Waveform generation. This function is used for single instance inference. """Waveform generation. This function is used for single instance inference.
Args: Args:
c(Tensor, optional, optional): Shape (T', C_aux), the auxiliary input, by default None c(Tensor, optional, optional):
x(Tensor, optional): Shape (T, C_in), the noise waveform, by default None Shape (T', C_aux), the auxiliary input, by default None
x(Tensor, optional):
Shape (T, C_in), the noise waveform, by default None
Returns: Returns:
Tensor: Shape (T, C_out), the generated waveform Tensor: Shape (T, C_out), the generated waveform
@ -214,20 +234,28 @@ class PWGDiscriminator(nn.Layer):
"""A convolutional discriminator for audio. """A convolutional discriminator for audio.
Args: Args:
in_channels (int, optional): Number of channels of the input audio, by default 1 in_channels (int, optional):
out_channels (int, optional): Output feature size, by default 1 Number of channels of the input audio, by default 1
kernel_size (int, optional): Kernel size of convolutional sublayers, by default 3 out_channels (int, optional):
layers (int, optional): Number of layers, by default 10 Output feature size, by default 1
conv_channels (int, optional): Feature size of the convolutional sublayers, by default 64 kernel_size (int, optional):
dilation_factor (int, optional): The factor with which dilation of each convolutional sublayers grows Kernel size of convolutional sublayers, by default 3
layers (int, optional):
Number of layers, by default 10
conv_channels (int, optional):
Feature size of the convolutional sublayers, by default 64
dilation_factor (int, optional):
The factor with which dilation of each convolutional sublayers grows
exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly,
by default 1 by default 1
nonlinear_activation (str, optional): The activation after each convolutional sublayer, by default "leakyrelu" nonlinear_activation (str, optional):
nonlinear_activation_params (Dict[str, Any], optional): The parameters passed to the activation's initializer, by default The activation after each convolutional sublayer, by default "leakyrelu"
{"negative_slope": 0.2} nonlinear_activation_params (Dict[str, Any], optional):
bias (bool, optional): Whether to use bias in convolutional sublayers, by default True The parameters passed to the activation's initializer, by default {"negative_slope": 0.2}
use_weight_norm (bool, optional): Whether to use weight normalization at all convolutional sublayers, bias (bool, optional):
by default True Whether to use bias in convolutional sublayers, by default True
use_weight_norm (bool, optional):
Whether to use weight normalization at all convolutional sublayers, by default True
""" """
def __init__( def __init__(
@ -290,7 +318,8 @@ class PWGDiscriminator(nn.Layer):
""" """
Args: Args:
x (Tensor): Shape (N, in_channels, num_samples), the input audio. x (Tensor):
Shape (N, in_channels, num_samples), the input audio.
Returns: Returns:
Tensor: Shape (N, out_channels, num_samples), the predicted logits. Tensor: Shape (N, out_channels, num_samples), the predicted logits.
@ -318,24 +347,35 @@ class ResidualPWGDiscriminator(nn.Layer):
"""A wavenet-style discriminator for audio. """A wavenet-style discriminator for audio.
Args: Args:
in_channels (int, optional): Number of channels of the input audio, by default 1 in_channels (int, optional):
out_channels (int, optional): Output feature size, by default 1 Number of channels of the input audio, by default 1
kernel_size (int, optional): Kernel size of residual blocks, by default 3 out_channels (int, optional):
layers (int, optional): Number of residual blocks, by default 30 Output feature size, by default 1
stacks (int, optional): Number of groups of residual blocks, within which the dilation kernel_size (int, optional):
Kernel size of residual blocks, by default 3
layers (int, optional):
Number of residual blocks, by default 30
stacks (int, optional):
Number of groups of residual blocks, within which the dilation
of each residual blocks grows exponentially, by default 3 of each residual blocks grows exponentially, by default 3
residual_channels (int, optional): Residual channels of residual blocks, by default 64 residual_channels (int, optional):
gate_channels (int, optional): Gate channels of residual blocks, by default 128 Residual channels of residual blocks, by default 64
skip_channels (int, optional): Skip channels of residual blocks, by default 64 gate_channels (int, optional):
dropout (float, optional): Dropout probability of residual blocks, by default 0. Gate channels of residual blocks, by default 128
bias (bool, optional): Whether to use bias in residual blocks, by default True skip_channels (int, optional):
use_weight_norm (bool, optional): Whether to use weight normalization in all convolutional layers, Skip channels of residual blocks, by default 64
by default True dropout (float, optional):
use_causal_conv (bool, optional): Whether to use causal convolution in residual blocks, by default False Dropout probability of residual blocks, by default 0.
nonlinear_activation (str, optional): Activation after convolutions other than those in residual blocks, bias (bool, optional):
by default "leakyrelu" Whether to use bias in residual blocks, by default True
nonlinear_activation_params (Dict[str, Any], optional): Parameters to pass to the activation, use_weight_norm (bool, optional):
by default {"negative_slope": 0.2} Whether to use weight normalization in all convolutional layers, by default True
use_causal_conv (bool, optional):
Whether to use causal convolution in residual blocks, by default False
nonlinear_activation (str, optional):
Activation after convolutions other than those in residual blocks, by default "leakyrelu"
nonlinear_activation_params (Dict[str, Any], optional):
Parameters to pass to the activation, by default {"negative_slope": 0.2}
""" """
def __init__( def __init__(
@ -405,7 +445,8 @@ class ResidualPWGDiscriminator(nn.Layer):
def forward(self, x): def forward(self, x):
""" """
Args: Args:
x(Tensor): Shape (N, in_channels, num_samples), the input audio. x(Tensor):
Shape (N, in_channels, num_samples), the input audio.
Returns: Returns:
Tensor: Shape (N, out_channels, num_samples), the predicted logits. Tensor: Shape (N, out_channels, num_samples), the predicted logits.

@ -29,10 +29,14 @@ class ResidualBlock(nn.Layer):
n: int=2): n: int=2):
"""SpeedySpeech encoder module. """SpeedySpeech encoder module.
Args: Args:
channels (int, optional): Feature size of the residual output(and also the input). channels (int, optional):
kernel_size (int, optional): Kernel size of the 1D convolution. Feature size of the residual output(and also the input).
dilation (int, optional): Dilation of the 1D convolution. kernel_size (int, optional):
n (int): Number of blocks. Kernel size of the 1D convolution.
dilation (int, optional):
Dilation of the 1D convolution.
n (int):
Number of blocks.
""" """
super().__init__() super().__init__()
@ -57,7 +61,8 @@ class ResidualBlock(nn.Layer):
def forward(self, x: paddle.Tensor): def forward(self, x: paddle.Tensor):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x(Tensor): Batch of input sequences (B, hidden_size, Tmax). x(Tensor):
Batch of input sequences (B, hidden_size, Tmax).
Returns: Returns:
Tensor: The residual output (B, hidden_size, Tmax). Tensor: The residual output (B, hidden_size, Tmax).
""" """
@ -89,8 +94,10 @@ class TextEmbedding(nn.Layer):
def forward(self, text: paddle.Tensor, tone: paddle.Tensor=None): def forward(self, text: paddle.Tensor, tone: paddle.Tensor=None):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax). text(Tensor(int64)):
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). Batch of padded token ids (B, Tmax).
tones(Tensor, optional(int64)):
Batch of padded tone ids (B, Tmax).
Returns: Returns:
Tensor: The residual output (B, Tmax, embedding_size). Tensor: The residual output (B, Tmax, embedding_size).
""" """
@ -109,12 +116,18 @@ class TextEmbedding(nn.Layer):
class SpeedySpeechEncoder(nn.Layer): class SpeedySpeechEncoder(nn.Layer):
"""SpeedySpeech encoder module. """SpeedySpeech encoder module.
Args: Args:
vocab_size (int): Dimension of the inputs. vocab_size (int):
tone_size (Optional[int]): Number of tones. Dimension of the inputs.
hidden_size (int): Number of encoder hidden units. tone_size (Optional[int]):
kernel_size (int): Kernel size of encoder. Number of tones.
dilations (List[int]): Dilations of encoder. hidden_size (int):
spk_num (Optional[int]): Number of speakers. Number of encoder hidden units.
kernel_size (int):
Kernel size of encoder.
dilations (List[int]):
Dilations of encoder.
spk_num (Optional[int]):
Number of speakers.
""" """
def __init__(self, def __init__(self,
@ -161,9 +174,12 @@ class SpeedySpeechEncoder(nn.Layer):
spk_id: paddle.Tensor=None): spk_id: paddle.Tensor=None):
"""Encoder input sequence. """Encoder input sequence.
Args: Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax). text(Tensor(int64)):
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). Batch of padded token ids (B, Tmax).
spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) tones(Tensor, optional(int64)):
Batch of padded tone ids (B, Tmax).
spk_id(Tnesor, optional(int64)):
Batch of speaker ids (B,)
Returns: Returns:
Tensor: Output tensor (B, Tmax, hidden_size). Tensor: Output tensor (B, Tmax, hidden_size).
@ -192,7 +208,8 @@ class DurationPredictor(nn.Layer):
def forward(self, x: paddle.Tensor): def forward(self, x: paddle.Tensor):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
x(Tensor): Batch of input sequences (B, Tmax, hidden_size). x(Tensor):
Batch of input sequences (B, Tmax, hidden_size).
Returns: Returns:
Tensor: Batch of predicted durations in log domain (B, Tmax). Tensor: Batch of predicted durations in log domain (B, Tmax).
@ -212,10 +229,14 @@ class SpeedySpeechDecoder(nn.Layer):
]): ]):
"""SpeedySpeech decoder module. """SpeedySpeech decoder module.
Args: Args:
hidden_size (int): Number of decoder hidden units. hidden_size (int):
kernel_size (int): Kernel size of decoder. Number of decoder hidden units.
output_size (int): Dimension of the outputs. kernel_size (int):
dilations (List[int]): Dilations of decoder. Kernel size of decoder.
output_size (int):
Dimension of the outputs.
dilations (List[int]):
Dilations of decoder.
""" """
super().__init__() super().__init__()
res_blocks = [ res_blocks = [
@ -230,7 +251,8 @@ class SpeedySpeechDecoder(nn.Layer):
def forward(self, x): def forward(self, x):
"""Decoder input sequence. """Decoder input sequence.
Args: Args:
x(Tensor): Input tensor (B, time, hidden_size). x(Tensor):
Input tensor (B, time, hidden_size).
Returns: Returns:
Tensor: Output tensor (B, time, output_size). Tensor: Output tensor (B, time, output_size).
@ -261,18 +283,30 @@ class SpeedySpeech(nn.Layer):
positional_dropout_rate: int=0.1): positional_dropout_rate: int=0.1):
"""Initialize SpeedySpeech module. """Initialize SpeedySpeech module.
Args: Args:
vocab_size (int): Dimension of the inputs. vocab_size (int):
encoder_hidden_size (int): Number of encoder hidden units. Dimension of the inputs.
encoder_kernel_size (int): Kernel size of encoder. encoder_hidden_size (int):
encoder_dilations (List[int]): Dilations of encoder. Number of encoder hidden units.
duration_predictor_hidden_size (int): Number of duration predictor hidden units. encoder_kernel_size (int):
decoder_hidden_size (int): Number of decoder hidden units. Kernel size of encoder.
decoder_kernel_size (int): Kernel size of decoder. encoder_dilations (List[int]):
decoder_dilations (List[int]): Dilations of decoder. Dilations of encoder.
decoder_output_size (int): Dimension of the outputs. duration_predictor_hidden_size (int):
tone_size (Optional[int]): Number of tones. Number of duration predictor hidden units.
spk_num (Optional[int]): Number of speakers. decoder_hidden_size (int):
init_type (str): How to initialize transformer parameters. Number of decoder hidden units.
decoder_kernel_size (int):
Kernel size of decoder.
decoder_dilations (List[int]):
Dilations of decoder.
decoder_output_size (int):
Dimension of the outputs.
tone_size (Optional[int]):
Number of tones.
spk_num (Optional[int]):
Number of speakers.
init_type (str):
How to initialize transformer parameters.
""" """
super().__init__() super().__init__()
@ -304,14 +338,20 @@ class SpeedySpeech(nn.Layer):
spk_id: paddle.Tensor=None): spk_id: paddle.Tensor=None):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
text(Tensor(int64)): Batch of padded token ids (B, Tmax). text(Tensor(int64)):
durations(Tensor(int64)): Batch of padded durations (B, Tmax). Batch of padded token ids (B, Tmax).
tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). durations(Tensor(int64)):
spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) Batch of padded durations (B, Tmax).
tones(Tensor, optional(int64)):
Batch of padded tone ids (B, Tmax).
spk_id(Tnesor, optional(int64)):
Batch of speaker ids (B,)
Returns: Returns:
Tensor: Output tensor (B, T_frames, decoder_output_size). Tensor:
Tensor: Predicted durations (B, Tmax). Output tensor (B, T_frames, decoder_output_size).
Tensor:
Predicted durations (B, Tmax).
""" """
# input of embedding must be int64 # input of embedding must be int64
text = paddle.cast(text, 'int64') text = paddle.cast(text, 'int64')
@ -336,10 +376,14 @@ class SpeedySpeech(nn.Layer):
spk_id: paddle.Tensor=None): spk_id: paddle.Tensor=None):
"""Generate the sequence of features given the sequences of characters. """Generate the sequence of features given the sequences of characters.
Args: Args:
text(Tensor(int64)): Input sequence of characters (T,). text(Tensor(int64)):
tones(Tensor, optional(int64)): Batch of padded tone ids (T, ). Input sequence of characters (T,).
durations(Tensor, optional (int64)): Groundtruth of duration (T,). tones(Tensor, optional(int64)):
spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None) Batch of padded tone ids (T, ).
durations(Tensor, optional (int64)):
Groundtruth of duration (T,).
spk_id(Tensor, optional(int64), optional):
spk ids (1,). (Default value = None)
Returns: Returns:
Tensor: logmel (T, decoder_output_size). Tensor: logmel (T, decoder_output_size).

@ -83,38 +83,67 @@ class Tacotron2(nn.Layer):
init_type: str="xavier_uniform", ): init_type: str="xavier_uniform", ):
"""Initialize Tacotron2 module. """Initialize Tacotron2 module.
Args: Args:
idim (int): Dimension of the inputs. idim (int):
odim (int): Dimension of the outputs. Dimension of the inputs.
embed_dim (int): Dimension of the token embedding. odim (int):
elayers (int): Number of encoder blstm layers. Dimension of the outputs.
eunits (int): Number of encoder blstm units. embed_dim (int):
econv_layers (int): Number of encoder conv layers. Dimension of the token embedding.
econv_filts (int): Number of encoder conv filter size. elayers (int):
econv_chans (int): Number of encoder conv filter channels. Number of encoder blstm layers.
dlayers (int): Number of decoder lstm layers. eunits (int):
dunits (int): Number of decoder lstm units. Number of encoder blstm units.
prenet_layers (int): Number of prenet layers. econv_layers (int):
prenet_units (int): Number of prenet units. Number of encoder conv layers.
postnet_layers (int): Number of postnet layers. econv_filts (int):
postnet_filts (int): Number of postnet filter size. Number of encoder conv filter size.
postnet_chans (int): Number of postnet filter channels. econv_chans (int):
output_activation (str): Name of activation function for outputs. Number of encoder conv filter channels.
adim (int): Number of dimension of mlp in attention. dlayers (int):
aconv_chans (int): Number of attention conv filter channels. Number of decoder lstm layers.
aconv_filts (int): Number of attention conv filter size. dunits (int):
cumulate_att_w (bool): Whether to cumulate previous attention weight. Number of decoder lstm units.
use_batch_norm (bool): Whether to use batch normalization. prenet_layers (int):
use_concate (bool): Whether to concat enc outputs w/ dec lstm outputs. Number of prenet layers.
reduction_factor (int): Reduction factor. prenet_units (int):
spk_num (Optional[int]): Number of speakers. If set to > 1, assume that the Number of prenet units.
postnet_layers (int):
Number of postnet layers.
postnet_filts (int):
Number of postnet filter size.
postnet_chans (int):
Number of postnet filter channels.
output_activation (str):
Name of activation function for outputs.
adim (int):
Number of dimension of mlp in attention.
aconv_chans (int):
Number of attention conv filter channels.
aconv_filts (int):
Number of attention conv filter size.
cumulate_att_w (bool):
Whether to cumulate previous attention weight.
use_batch_norm (bool):
Whether to use batch normalization.
use_concate (bool):
Whether to concat enc outputs w/ dec lstm outputs.
reduction_factor (int):
Reduction factor.
spk_num (Optional[int]):
Number of speakers. If set to > 1, assume that the
sids will be provided as the input and use sid embedding layer. sids will be provided as the input and use sid embedding layer.
lang_num (Optional[int]): Number of languages. If set to > 1, assume that the lang_num (Optional[int]):
Number of languages. If set to > 1, assume that the
lids will be provided as the input and use sid embedding layer. lids will be provided as the input and use sid embedding layer.
spk_embed_dim (Optional[int]): Speaker embedding dimension. If set to > 0, spk_embed_dim (Optional[int]):
Speaker embedding dimension. If set to > 0,
assume that spk_emb will be provided as the input. assume that spk_emb will be provided as the input.
spk_embed_integration_type (str): How to integrate speaker embedding. spk_embed_integration_type (str):
dropout_rate (float): Dropout rate. How to integrate speaker embedding.
zoneout_rate (float): Zoneout rate. dropout_rate (float):
Dropout rate.
zoneout_rate (float):
Zoneout rate.
""" """
assert check_argument_types() assert check_argument_types()
super().__init__() super().__init__()
@ -230,18 +259,28 @@ class Tacotron2(nn.Layer):
"""Calculate forward propagation. """Calculate forward propagation.
Args: Args:
text (Tensor(int64)): Batch of padded character ids (B, T_text). text (Tensor(int64)):
text_lengths (Tensor(int64)): Batch of lengths of each input batch (B,). Batch of padded character ids (B, T_text).
speech (Tensor): Batch of padded target features (B, T_feats, odim). text_lengths (Tensor(int64)):
speech_lengths (Tensor(int64)): Batch of the lengths of each target (B,). Batch of lengths of each input batch (B,).
spk_emb (Optional[Tensor]): Batch of speaker embeddings (B, spk_embed_dim). speech (Tensor):
spk_id (Optional[Tensor]): Batch of speaker IDs (B, 1). Batch of padded target features (B, T_feats, odim).
lang_id (Optional[Tensor]): Batch of language IDs (B, 1). speech_lengths (Tensor(int64)):
Batch of the lengths of each target (B,).
spk_emb (Optional[Tensor]):
Batch of speaker embeddings (B, spk_embed_dim).
spk_id (Optional[Tensor]):
Batch of speaker IDs (B, 1).
lang_id (Optional[Tensor]):
Batch of language IDs (B, 1).
Returns: Returns:
Tensor: Loss scalar value. Tensor:
Dict: Statistics to be monitored. Loss scalar value.
Tensor: Weight value if not joint training else model outputs. Dict:
Statistics to be monitored.
Tensor:
Weight value if not joint training else model outputs.
""" """
text = text[:, :text_lengths.max()] text = text[:, :text_lengths.max()]
@ -329,18 +368,30 @@ class Tacotron2(nn.Layer):
"""Generate the sequence of features given the sequences of characters. """Generate the sequence of features given the sequences of characters.
Args: Args:
text (Tensor(int64)): Input sequence of characters (T_text,). text (Tensor(int64)):
speech (Optional[Tensor]): Feature sequence to extract style (N, idim). Input sequence of characters (T_text,).
spk_emb (ptional[Tensor]): Speaker embedding (spk_embed_dim,). speech (Optional[Tensor]):
spk_id (Optional[Tensor]): Speaker ID (1,). Feature sequence to extract style (N, idim).
lang_id (Optional[Tensor]): Language ID (1,). spk_emb (ptional[Tensor]):
threshold (float): Threshold in inference. Speaker embedding (spk_embed_dim,).
minlenratio (float): Minimum length ratio in inference. spk_id (Optional[Tensor]):
maxlenratio (float): Maximum length ratio in inference. Speaker ID (1,).
use_att_constraint (bool): Whether to apply attention constraint. lang_id (Optional[Tensor]):
backward_window (int): Backward window in attention constraint. Language ID (1,).
forward_window (int): Forward window in attention constraint. threshold (float):
use_teacher_forcing (bool): Whether to use teacher forcing. Threshold in inference.
minlenratio (float):
Minimum length ratio in inference.
maxlenratio (float):
Maximum length ratio in inference.
use_att_constraint (bool):
Whether to apply attention constraint.
backward_window (int):
Backward window in attention constraint.
forward_window (int):
Forward window in attention constraint.
use_teacher_forcing (bool):
Whether to use teacher forcing.
Returns: Returns:
Dict[str, Tensor] Dict[str, Tensor]

@ -49,66 +49,124 @@ class TransformerTTS(nn.Layer):
https://arxiv.org/pdf/1809.08895.pdf https://arxiv.org/pdf/1809.08895.pdf
Args: Args:
idim (int): Dimension of the inputs. idim (int):
odim (int): Dimension of the outputs. Dimension of the inputs.
embed_dim (int, optional): Dimension of character embedding. odim (int):
eprenet_conv_layers (int, optional): Number of encoder prenet convolution layers. Dimension of the outputs.
eprenet_conv_chans (int, optional): Number of encoder prenet convolution channels. embed_dim (int, optional):
eprenet_conv_filts (int, optional): Filter size of encoder prenet convolution. Dimension of character embedding.
dprenet_layers (int, optional): Number of decoder prenet layers. eprenet_conv_layers (int, optional):
dprenet_units (int, optional): Number of decoder prenet hidden units. Number of encoder prenet convolution layers.
elayers (int, optional): Number of encoder layers. eprenet_conv_chans (int, optional):
eunits (int, optional): Number of encoder hidden units. Number of encoder prenet convolution channels.
adim (int, optional): Number of attention transformation dimensions. eprenet_conv_filts (int, optional):
aheads (int, optional): Number of heads for multi head attention. Filter size of encoder prenet convolution.
dlayers (int, optional): Number of decoder layers. dprenet_layers (int, optional):
dunits (int, optional): Number of decoder hidden units. Number of decoder prenet layers.
postnet_layers (int, optional): Number of postnet layers. dprenet_units (int, optional):
postnet_chans (int, optional): Number of postnet channels. Number of decoder prenet hidden units.
postnet_filts (int, optional): Filter size of postnet. elayers (int, optional):
use_scaled_pos_enc (pool, optional): Whether to use trainable scaled positional encoding. Number of encoder layers.
use_batch_norm (bool, optional): Whether to use batch normalization in encoder prenet. eunits (int, optional):
encoder_normalize_before (bool, optional): Whether to perform layer normalization before encoder block. Number of encoder hidden units.
decoder_normalize_before (bool, optional): Whether to perform layer normalization before decoder block. adim (int, optional):
encoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in encoder. Number of attention transformation dimensions.
decoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in decoder. aheads (int, optional):
positionwise_layer_type (str, optional): Position-wise operation type. Number of heads for multi head attention.
positionwise_conv_kernel_size (int, optional): Kernel size in position wise conv 1d. dlayers (int, optional):
reduction_factor (int, optional): Reduction factor. Number of decoder layers.
spk_embed_dim (int, optional): Number of speaker embedding dimenstions. dunits (int, optional):
spk_embed_integration_type (str, optional): How to integrate speaker embedding. Number of decoder hidden units.
use_gst (str, optional): Whether to use global style token. postnet_layers (int, optional):
gst_tokens (int, optional): The number of GST embeddings. Number of postnet layers.
gst_heads (int, optional): The number of heads in GST multihead attention. postnet_chans (int, optional):
gst_conv_layers (int, optional): The number of conv layers in GST. Number of postnet channels.
gst_conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in GST. postnet_filts (int, optional):
gst_conv_kernel_size (int, optional): Kernal size of conv layers in GST. Filter size of postnet.
gst_conv_stride (int, optional): Stride size of conv layers in GST. use_scaled_pos_enc (pool, optional):
gst_gru_layers (int, optional): The number of GRU layers in GST. Whether to use trainable scaled positional encoding.
gst_gru_units (int, optional): The number of GRU units in GST. use_batch_norm (bool, optional):
transformer_lr (float, optional): Initial value of learning rate. Whether to use batch normalization in encoder prenet.
transformer_warmup_steps (int, optional): Optimizer warmup steps. encoder_normalize_before (bool, optional):
transformer_enc_dropout_rate (float, optional): Dropout rate in encoder except attention and positional encoding. Whether to perform layer normalization before encoder block.
transformer_enc_positional_dropout_rate (float, optional): Dropout rate after encoder positional encoding. decoder_normalize_before (bool, optional):
transformer_enc_attn_dropout_rate float, optional): Dropout rate in encoder self-attention module. Whether to perform layer normalization before decoder block.
transformer_dec_dropout_rate (float, optional): Dropout rate in decoder except attention & positional encoding. encoder_concat_after (bool, optional):
transformer_dec_positional_dropout_rate (float, optional): Dropout rate after decoder positional encoding. Whether to concatenate attention layer's input and output in encoder.
transformer_dec_attn_dropout_rate float, optional): Dropout rate in deocoder self-attention module. decoder_concat_after (bool, optional):
transformer_enc_dec_attn_dropout_rate (float, optional): Dropout rate in encoder-deocoder attention module. Whether to concatenate attention layer's input and output in decoder.
init_type (str, optional): How to initialize transformer parameters. positionwise_layer_type (str, optional):
init_enc_alpha float, optional: Initial value of alpha in scaled pos encoding of the encoder. Position-wise operation type.
init_dec_alpha (float, optional): Initial value of alpha in scaled pos encoding of the decoder. positionwise_conv_kernel_size (int, optional):
eprenet_dropout_rate (float, optional): Dropout rate in encoder prenet. Kernel size in position wise conv 1d.
dprenet_dropout_rate (float, optional): Dropout rate in decoder prenet. reduction_factor (int, optional):
postnet_dropout_rate (float, optional): Dropout rate in postnet. Reduction factor.
use_masking (bool, optional): Whether to apply masking for padded part in loss calculation. spk_embed_dim (int, optional):
use_weighted_masking (bool, optional): Whether to apply weighted masking in loss calculation. Number of speaker embedding dimenstions.
bce_pos_weight (float, optional): Positive sample weight in bce calculation (only for use_masking=true). spk_embed_integration_type (str, optional):
loss_type (str, optional): How to calculate loss. How to integrate speaker embedding.
use_guided_attn_loss (bool, optional): Whether to use guided attention loss. use_gst (str, optional):
num_heads_applied_guided_attn (int, optional): Number of heads in each layer to apply guided attention loss. Whether to use global style token.
num_layers_applied_guided_attn (int, optional): Number of layers to apply guided attention loss. gst_tokens (int, optional):
List of module names to apply guided attention loss. The number of GST embeddings.
gst_heads (int, optional):
The number of heads in GST multihead attention.
gst_conv_layers (int, optional):
The number of conv layers in GST.
gst_conv_chans_list (Sequence[int], optional):
List of the number of channels of conv layers in GST.
gst_conv_kernel_size (int, optional):
Kernal size of conv layers in GST.
gst_conv_stride (int, optional):
Stride size of conv layers in GST.
gst_gru_layers (int, optional):
The number of GRU layers in GST.
gst_gru_units (int, optional):
The number of GRU units in GST.
transformer_lr (float, optional):
Initial value of learning rate.
transformer_warmup_steps (int, optional):
Optimizer warmup steps.
transformer_enc_dropout_rate (float, optional):
Dropout rate in encoder except attention and positional encoding.
transformer_enc_positional_dropout_rate (float, optional):
Dropout rate after encoder positional encoding.
transformer_enc_attn_dropout_rate float, optional):
Dropout rate in encoder self-attention module.
transformer_dec_dropout_rate (float, optional):
Dropout rate in decoder except attention & positional encoding.
transformer_dec_positional_dropout_rate (float, optional):
Dropout rate after decoder positional encoding.
transformer_dec_attn_dropout_rate float, optional):
Dropout rate in deocoder self-attention module.
transformer_enc_dec_attn_dropout_rate (float, optional):
Dropout rate in encoder-deocoder attention module.
init_type (str, optional):
How to initialize transformer parameters.
init_enc_alpha float, optional:
Initial value of alpha in scaled pos encoding of the encoder.
init_dec_alpha (float, optional):
Initial value of alpha in scaled pos encoding of the decoder.
eprenet_dropout_rate (float, optional):
Dropout rate in encoder prenet.
dprenet_dropout_rate (float, optional):
Dropout rate in decoder prenet.
postnet_dropout_rate (float, optional):
Dropout rate in postnet.
use_masking (bool, optional):
Whether to apply masking for padded part in loss calculation.
use_weighted_masking (bool, optional):
Whether to apply weighted masking in loss calculation.
bce_pos_weight (float, optional):
Positive sample weight in bce calculation (only for use_masking=true).
loss_type (str, optional):
How to calculate loss.
use_guided_attn_loss (bool, optional):
Whether to use guided attention loss.
num_heads_applied_guided_attn (int, optional):
Number of heads in each layer to apply guided attention loss.
num_layers_applied_guided_attn (int, optional):
Number of layers to apply guided attention loss.
""" """
def __init__( def __init__(

@ -33,8 +33,10 @@ def fold(x, n_group):
"""Fold audio or spectrogram's temporal dimension in to groups. """Fold audio or spectrogram's temporal dimension in to groups.
Args: Args:
x(Tensor): The input tensor. shape=(*, time_steps) x(Tensor):
n_group(int): The size of a group. The input tensor. shape=(*, time_steps)
n_group(int):
The size of a group.
Returns: Returns:
Tensor: Folded tensor. shape=(*, time_steps // n_group, group) Tensor: Folded tensor. shape=(*, time_steps // n_group, group)
@ -53,7 +55,8 @@ class UpsampleNet(nn.LayerList):
on mel and time dimension. on mel and time dimension.
Args: Args:
upscale_factors(List[int], optional): Time upsampling factors for each Conv2DTranspose Layer. upscale_factors(List[int], optional):
Time upsampling factors for each Conv2DTranspose Layer.
The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose
Layers. Each upscale_factor is used as the ``stride`` for the Layers. Each upscale_factor is used as the ``stride`` for the
corresponding Conv2DTranspose. Defaults to [16, 16], this the default corresponding Conv2DTranspose. Defaults to [16, 16], this the default
@ -94,8 +97,10 @@ class UpsampleNet(nn.LayerList):
"""Forward pass of the ``UpsampleNet`` """Forward pass of the ``UpsampleNet``
Args: Args:
x(Tensor): The input spectrogram. shape=(batch_size, input_channels, time_steps) x(Tensor):
trim_conv_artifact(bool, optional, optional): Trim deconvolution artifact at each layer. Defaults to False. The input spectrogram. shape=(batch_size, input_channels, time_steps)
trim_conv_artifact(bool, optional, optional):
Trim deconvolution artifact at each layer. Defaults to False.
Returns: Returns:
Tensor: The upsampled spectrogram. shape=(batch_size, input_channels, time_steps * upsample_factor) Tensor: The upsampled spectrogram. shape=(batch_size, input_channels, time_steps * upsample_factor)
@ -123,10 +128,14 @@ class ResidualBlock(nn.Layer):
and output. and output.
Args: Args:
channels (int): Feature size of the input. channels (int):
cond_channels (int): Featuer size of the condition. Feature size of the input.
kernel_size (Tuple[int]): Kernel size of the Convolution2d applied to the input. cond_channels (int):
dilations (int): Dilations of the Convolution2d applied to the input. Featuer size of the condition.
kernel_size (Tuple[int]):
Kernel size of the Convolution2d applied to the input.
dilations (int):
Dilations of the Convolution2d applied to the input.
""" """
def __init__(self, channels, cond_channels, kernel_size, dilations): def __init__(self, channels, cond_channels, kernel_size, dilations):
@ -173,12 +182,16 @@ class ResidualBlock(nn.Layer):
"""Compute output for a whole folded sequence. """Compute output for a whole folded sequence.
Args: Args:
x (Tensor): The input. [shape=(batch_size, channel, height, width)] x (Tensor):
condition (Tensor [shape=(batch_size, condition_channel, height, width)]): The local condition. The input. [shape=(batch_size, channel, height, width)]
condition (Tensor [shape=(batch_size, condition_channel, height, width)]):
The local condition.
Returns: Returns:
res (Tensor): The residual output. [shape=(batch_size, channel, height, width)] res (Tensor):
skip (Tensor): The skip output. [shape=(batch_size, channel, height, width)] The residual output. [shape=(batch_size, channel, height, width)]
skip (Tensor):
The skip output. [shape=(batch_size, channel, height, width)]
""" """
x_in = x x_in = x
x = self.conv(x) x = self.conv(x)
@ -216,12 +229,16 @@ class ResidualBlock(nn.Layer):
"""Compute the output for a row and update the buffer. """Compute the output for a row and update the buffer.
Args: Args:
x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width) x_row (Tensor):
condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width) A row of the input. shape=(batch_size, channel, 1, width)
condition_row (Tensor):
A row of the condition. shape=(batch_size, condition_channel, 1, width)
Returns: Returns:
res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) res (Tensor):
skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width) A row of the the residual output. shape=(batch_size, channel, 1, width)
skip (Tensor):
A row of the skip output. shape=(batch_size, channel, 1, width)
""" """
x_row_in = x_row x_row_in = x_row
@ -258,11 +275,16 @@ class ResidualNet(nn.LayerList):
"""A stack of several ResidualBlocks. It merges condition at each layer. """A stack of several ResidualBlocks. It merges condition at each layer.
Args: Args:
n_layer (int): Number of ResidualBlocks in the ResidualNet. n_layer (int):
residual_channels (int): Feature size of each ResidualBlocks. Number of ResidualBlocks in the ResidualNet.
condition_channels (int): Feature size of the condition. residual_channels (int):
kernel_size (Tuple[int]): Kernel size of each ResidualBlock. Feature size of each ResidualBlocks.
dilations_h (List[int]): Dilation in height dimension of every ResidualBlock. condition_channels (int):
Feature size of the condition.
kernel_size (Tuple[int]):
Kernel size of each ResidualBlock.
dilations_h (List[int]):
Dilation in height dimension of every ResidualBlock.
Raises: Raises:
ValueError: If the length of dilations_h does not equals n_layers. ValueError: If the length of dilations_h does not equals n_layers.
@ -288,11 +310,13 @@ class ResidualNet(nn.LayerList):
"""Comput the output of given the input and the condition. """Comput the output of given the input and the condition.
Args: Args:
x (Tensor): The input. shape=(batch_size, channel, height, width) x (Tensor):
condition (Tensor): The local condition. shape=(batch_size, condition_channel, height, width) The input. shape=(batch_size, channel, height, width)
condition (Tensor):
The local condition. shape=(batch_size, condition_channel, height, width)
Returns: Returns:
Tensor : The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width) Tensor: The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width)
""" """
skip_connections = [] skip_connections = []
@ -312,12 +336,16 @@ class ResidualNet(nn.LayerList):
"""Compute the output for a row and update the buffers. """Compute the output for a row and update the buffers.
Args: Args:
x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width) x_row (Tensor):
condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width) A row of the input. shape=(batch_size, channel, 1, width)
condition_row (Tensor):
A row of the condition. shape=(batch_size, condition_channel, 1, width)
Returns: Returns:
res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) res (Tensor):
skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width) A row of the the residual output. shape=(batch_size, channel, 1, width)
skip (Tensor):
A row of the skip output. shape=(batch_size, channel, 1, width)
""" """
skip_connections = [] skip_connections = []
@ -337,11 +365,16 @@ class Flow(nn.Layer):
sampling. sampling.
Args: Args:
n_layers (int): Number of ResidualBlocks in the Flow. n_layers (int):
channels (int): Feature size of the ResidualBlocks. Number of ResidualBlocks in the Flow.
mel_bands (int): Feature size of the mel spectrogram (mel bands). channels (int):
kernel_size (Tuple[int]): Kernel size of each ResisualBlocks in the Flow. Feature size of the ResidualBlocks.
n_group (int): Number of timesteps to the folded into a group. mel_bands (int):
Feature size of the mel spectrogram (mel bands).
kernel_size (Tuple[int]):
Kernel size of each ResisualBlocks in the Flow.
n_group (int):
Number of timesteps to the folded into a group.
""" """
dilations_dict = { dilations_dict = {
8: [1, 1, 1, 1, 1, 1, 1, 1], 8: [1, 1, 1, 1, 1, 1, 1, 1],
@ -393,11 +426,14 @@ class Flow(nn.Layer):
a sample from p(X) into a sample from p(Z). a sample from p(X) into a sample from p(Z).
Args: Args:
x (Tensor): A input sample of the distribution p(X). shape=(batch, 1, height, width) x (Tensor):
condition (Tensor): The local condition. shape=(batch, condition_channel, height, width) A input sample of the distribution p(X). shape=(batch, 1, height, width)
condition (Tensor):
The local condition. shape=(batch, condition_channel, height, width)
Returns: Returns:
z (Tensor): shape(batch, 1, height, width), the transformed sample. z (Tensor):
shape(batch, 1, height, width), the transformed sample.
Tuple[Tensor, Tensor]: Tuple[Tensor, Tensor]:
The parameter of the transformation. The parameter of the transformation.
logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the transformation from x to z. logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the transformation from x to z.
@ -433,8 +469,10 @@ class Flow(nn.Layer):
p(Z) and transform the sample. It is a auto regressive transformation. p(Z) and transform the sample. It is a auto regressive transformation.
Args: Args:
z(Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps z(Tensor):
condition(Tensor): The local condition. shape=(batch, condition_channel, time_steps) A sample of the distribution p(Z). shape=(batch, 1, time_steps
condition(Tensor):
The local condition. shape=(batch, condition_channel, time_steps)
Returns: Returns:
Tensor: Tensor:
The transformed sample. shape=(batch, 1, height, width) The transformed sample. shape=(batch, 1, height, width)
@ -462,12 +500,18 @@ class WaveFlow(nn.LayerList):
flows. flows.
Args: Args:
n_flows (int): Number of flows in the WaveFlow model. n_flows (int):
n_layers (int): Number of ResidualBlocks in each Flow. Number of flows in the WaveFlow model.
n_group (int): Number of timesteps to fold as a group. n_layers (int):
channels (int): Feature size of each ResidualBlock. Number of ResidualBlocks in each Flow.
mel_bands (int): Feature size of mel spectrogram (mel bands). n_group (int):
kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock. Number of timesteps to fold as a group.
channels (int):
Feature size of each ResidualBlock.
mel_bands (int):
Feature size of mel spectrogram (mel bands).
kernel_size (Union[int, List[int]]):
Kernel size of the convolution layer in each ResidualBlock.
""" """
def __init__(self, n_flows, n_layers, n_group, channels, mel_bands, def __init__(self, n_flows, n_layers, n_group, channels, mel_bands,
@ -518,12 +562,16 @@ class WaveFlow(nn.LayerList):
condition. condition.
Args: Args:
x (Tensor): The audio. shape=(batch_size, time_steps) x (Tensor):
condition (Tensor): The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps) The audio. shape=(batch_size, time_steps)
condition (Tensor):
The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps)
Returns: Returns:
Tensor: The transformed random variable. shape=(batch_size, time_steps) Tensor:
Tensor: The log determinant of the jacobian of the transformation from x to z. shape=(1,) The transformed random variable. shape=(batch_size, time_steps)
Tensor:
The log determinant of the jacobian of the transformation from x to z. shape=(1,)
""" """
# x: (B, T) # x: (B, T)
# condition: (B, C, T) upsampled condition # condition: (B, C, T) upsampled condition
@ -559,12 +607,13 @@ class WaveFlow(nn.LayerList):
autoregressive manner. autoregressive manner.
Args: Args:
z (Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps z (Tensor):
condition (Tensor): The local condition. shape=(batch, condition_channel, time_steps) A sample of the distribution p(Z). shape=(batch, 1, time_steps
condition (Tensor):
The local condition. shape=(batch, condition_channel, time_steps)
Returns: Returns:
Tensor: The transformed sample (audio here). shape=(batch_size, time_steps) Tensor: The transformed sample (audio here). shape=(batch_size, time_steps)
""" """
z, condition = self._trim(z, condition) z, condition = self._trim(z, condition)
@ -590,13 +639,20 @@ class ConditionalWaveFlow(nn.LayerList):
"""ConditionalWaveFlow, a UpsampleNet with a WaveFlow model. """ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.
Args: Args:
upsample_factors (List[int]): Upsample factors for the upsample net. upsample_factors (List[int]):
n_flows (int): Number of flows in the WaveFlow model. Upsample factors for the upsample net.
n_layers (int): Number of ResidualBlocks in each Flow. n_flows (int):
n_group (int): Number of timesteps to fold as a group. Number of flows in the WaveFlow model.
channels (int): Feature size of each ResidualBlock. n_layers (int):
n_mels (int): Feature size of mel spectrogram (mel bands). Number of ResidualBlocks in each Flow.
kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock. n_group (int):
Number of timesteps to fold as a group.
channels (int):
Feature size of each ResidualBlock.
n_mels (int):
Feature size of mel spectrogram (mel bands).
kernel_size (Union[int, List[int]]):
Kernel size of the convolution layer in each ResidualBlock.
""" """
def __init__(self, def __init__(self,
@ -622,12 +678,16 @@ class ConditionalWaveFlow(nn.LayerList):
the determinant of the jacobian of the transformation from x to z. the determinant of the jacobian of the transformation from x to z.
Args: Args:
audio(Tensor): The audio. shape=(B, T) audio(Tensor):
mel(Tensor): The mel spectrogram. shape=(B, C_mel, T_mel) The audio. shape=(B, T)
mel(Tensor):
The mel spectrogram. shape=(B, C_mel, T_mel)
Returns: Returns:
Tensor: The inversely transformed random variable z (x to z). shape=(B, T) Tensor:
Tensor: the log of the determinant of the jacobian of the transformation from x to z. shape=(1,) The inversely transformed random variable z (x to z). shape=(B, T)
Tensor:
the log of the determinant of the jacobian of the transformation from x to z. shape=(1,)
""" """
condition = self.encoder(mel) condition = self.encoder(mel)
z, log_det_jacobian = self.decoder(audio, condition) z, log_det_jacobian = self.decoder(audio, condition)
@ -638,10 +698,12 @@ class ConditionalWaveFlow(nn.LayerList):
"""Generate raw audio given mel spectrogram. """Generate raw audio given mel spectrogram.
Args: Args:
mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) mel(np.ndarray):
Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
Returns: Returns:
Tensor: The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T) Tensor:
The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T)
""" """
start = time.time() start = time.time()
condition = self.encoder(mel, trim_conv_artifact=True) # (B, C, T) condition = self.encoder(mel, trim_conv_artifact=True) # (B, C, T)
@ -657,7 +719,8 @@ class ConditionalWaveFlow(nn.LayerList):
"""Generate raw audio given mel spectrogram. """Generate raw audio given mel spectrogram.
Args: Args:
mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) mel(np.ndarray):
Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
Returns: Returns:
np.ndarray: The synthesized audio. shape=(T,) np.ndarray: The synthesized audio. shape=(T,)
@ -673,8 +736,10 @@ class ConditionalWaveFlow(nn.LayerList):
"""Build a ConditionalWaveFlow model from a pretrained model. """Build a ConditionalWaveFlow model from a pretrained model.
Args: Args:
config(yacs.config.CfgNode): model configs config(yacs.config.CfgNode):
checkpoint_path(Path or str): the path of pretrained model checkpoint, without extension name model configs
checkpoint_path(Path or str):
the path of pretrained model checkpoint, without extension name
Returns: Returns:
ConditionalWaveFlow The model built from pretrained result. ConditionalWaveFlow The model built from pretrained result.
@ -694,8 +759,8 @@ class WaveFlowLoss(nn.Layer):
"""Criterion of a WaveFlow model. """Criterion of a WaveFlow model.
Args: Args:
sigma (float): The standard deviation of the gaussian noise used in WaveFlow, sigma (float):
by default 1.0. The standard deviation of the gaussian noise used in WaveFlow, by default 1.0.
""" """
def __init__(self, sigma=1.0): def __init__(self, sigma=1.0):
@ -708,8 +773,10 @@ class WaveFlowLoss(nn.Layer):
log_det_jacobian of transformation from x to z. log_det_jacobian of transformation from x to z.
Args: Args:
z(Tensor): The transformed random variable (x to z). shape=(B, T) z(Tensor):
log_det_jacobian(Tensor): The log of the determinant of the jacobian matrix of the The transformed random variable (x to z). shape=(B, T)
log_det_jacobian(Tensor):
The log of the determinant of the jacobian matrix of the
transformation from x to z. shape=(1,) transformation from x to z. shape=(1,)
Returns: Returns:
@ -726,7 +793,8 @@ class ConditionalWaveFlow2Infer(ConditionalWaveFlow):
"""Generate raw audio given mel spectrogram. """Generate raw audio given mel spectrogram.
Args: Args:
mel (np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) mel (np.ndarray):
Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
Returns: Returns:
np.ndarray: The synthesized audio. shape=(T,) np.ndarray: The synthesized audio. shape=(T,)

@ -165,19 +165,29 @@ class WaveRNN(nn.Layer):
init_type: str="xavier_uniform", ): init_type: str="xavier_uniform", ):
''' '''
Args: Args:
rnn_dims (int, optional): Hidden dims of RNN Layers. rnn_dims (int, optional):
fc_dims (int, optional): Dims of FC Layers. Hidden dims of RNN Layers.
bits (int, optional): bit depth of signal. fc_dims (int, optional):
aux_context_window (int, optional): The context window size of the first convolution applied to the Dims of FC Layers.
auxiliary input, by default 2 bits (int, optional):
upsample_scales (List[int], optional): Upsample scales of the upsample network. bit depth of signal.
aux_channels (int, optional): Auxiliary channel of the residual blocks. aux_context_window (int, optional):
compute_dims (int, optional): Dims of Conv1D in MelResNet. The context window size of the first convolution applied to the auxiliary input, by default 2
res_out_dims (int, optional): Dims of output in MelResNet. upsample_scales (List[int], optional):
res_blocks (int, optional): Number of residual blocks. Upsample scales of the upsample network.
mode (str, optional): Output mode of the WaveRNN vocoder. aux_channels (int, optional):
Auxiliary channel of the residual blocks.
compute_dims (int, optional):
Dims of Conv1D in MelResNet.
res_out_dims (int, optional):
Dims of output in MelResNet.
res_blocks (int, optional):
Number of residual blocks.
mode (str, optional):
Output mode of the WaveRNN vocoder.
`MOL` for Mixture of Logistic Distribution, and `RAW` for quantized bits as the model's output. `MOL` for Mixture of Logistic Distribution, and `RAW` for quantized bits as the model's output.
init_type (str): How to initialize parameters. init_type (str):
How to initialize parameters.
''' '''
super().__init__() super().__init__()
self.mode = mode self.mode = mode
@ -226,8 +236,10 @@ class WaveRNN(nn.Layer):
def forward(self, x, c): def forward(self, x, c):
''' '''
Args: Args:
x (Tensor): wav sequence, [B, T] x (Tensor):
c (Tensor): mel spectrogram [B, C_aux, T'] wav sequence, [B, T]
c (Tensor):
mel spectrogram [B, C_aux, T']
T = (T' - 2 * aux_context_window ) * hop_length T = (T' - 2 * aux_context_window ) * hop_length
Returns: Returns:
@ -280,10 +292,14 @@ class WaveRNN(nn.Layer):
gen_display: bool=False): gen_display: bool=False):
""" """
Args: Args:
c(Tensor): input mels, (T', C_aux) c(Tensor):
batched(bool): generate in batch or not input mels, (T', C_aux)
target(int): target number of samples to be generated in each batch entry batched(bool):
overlap(int): number of samples for crossfading between batches generate in batch or not
target(int):
target number of samples to be generated in each batch entry
overlap(int):
number of samples for crossfading between batches
mu_law(bool) mu_law(bool)
Returns: Returns:
wav sequence: Output (T' * prod(upsample_scales), out_channels, C_out). wav sequence: Output (T' * prod(upsample_scales), out_channels, C_out).
@ -404,7 +420,8 @@ class WaveRNN(nn.Layer):
def pad_tensor(self, x, pad, side='both'): def pad_tensor(self, x, pad, side='both'):
''' '''
Args: Args:
x(Tensor): mel, [1, n_frames, 80] x(Tensor):
mel, [1, n_frames, 80]
pad(int): pad(int):
side(str, optional): (Default value = 'both') side(str, optional): (Default value = 'both')
@ -428,12 +445,15 @@ class WaveRNN(nn.Layer):
Overlap will be used for crossfading in xfade_and_unfold() Overlap will be used for crossfading in xfade_and_unfold()
Args: Args:
x(Tensor): Upsampled conditioning features. mels or aux x(Tensor):
Upsampled conditioning features. mels or aux
shape=(1, T, features) shape=(1, T, features)
mels: [1, T, 80] mels: [1, T, 80]
aux: [1, T, 128] aux: [1, T, 128]
target(int): Target timesteps for each index of batch target(int):
overlap(int): Timesteps for both xfade and rnn warmup Target timesteps for each index of batch
overlap(int):
Timesteps for both xfade and rnn warmup
Returns: Returns:
Tensor: Tensor:

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save