Merge branch 'PaddlePaddle:develop' into develop

pull/2779/head
liangym 3 years ago committed by GitHub
commit 5c8b75ee18
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -157,6 +157,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV). - 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update ### Recent Update
- 🎉 2022.11.30: Add [TTS Android Demo](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/TTSAndroid).
- 👑 2022.11.18: Add [Whisper CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640), support multi language recognition and translation. - 👑 2022.11.18: Add [Whisper CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640), support multi language recognition and translation.
- 🔥 2022.11.18: Add [Wav2vec2 CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), Support ASR and Feature Extraction. - 🔥 2022.11.18: Add [Wav2vec2 CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), Support ASR and Feature Extraction.
- 🎉 2022.11.17: Add [male voice for TTS](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660). - 🎉 2022.11.17: Add [male voice for TTS](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660).

@ -164,7 +164,8 @@
### 近期更新 ### 近期更新
- 👑 2022.11.18: 新增 [Whisper CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640),支持多种语言的识别与翻译。 - 🎉 2022.11.30: 新增 [TTS Android 部署示例](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/TTSAndroid)。
- 👑 2022.11.18: 新增 [Whisper CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640), 支持多种语言的识别与翻译。
- 🔥 2022.11.18: 新增 [Wav2vec2 CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), 支持 ASR 和 特征提取. - 🔥 2022.11.18: 新增 [Wav2vec2 CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), 支持 ASR 和 特征提取.
- 🎉 2022.11.17: TTS 新增[高质量男性音色](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660)。 - 🎉 2022.11.17: TTS 新增[高质量男性音色](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660)。
- 🔥 2022.11.07: 新增 [U2/U2++ 高性能流式 ASR C++ 部署](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech)。 - 🔥 2022.11.07: 新增 [U2/U2++ 高性能流式 ASR C++ 部署](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech)。

@ -0,0 +1,13 @@
*.iml
.gradle
/local.properties
/.idea/caches
/.idea/libraries
/.idea/modules.xml
/.idea/workspace.xml
/.idea/navEditor.xml
/.idea/assetWizardSettings.xml
.DS_Store
/build
/captures
.externalNativeBuild

@ -0,0 +1,189 @@
# 语音合成 Java API Demo 使用指南
在 Android 上实现语音合成功能,此 Demo 有很好的的易用性和开放性,如在 Demo 中跑自己训练好的模型等。
本文主要介绍语音合成 Demo 运行方法。
## 如何运行语音合成 Demo
### 环境准备
1. 在本地环境安装好 Android Studio 工具,详细安装方法请见 [Android Stuido 官网](https://developer.android.com/studio)。
2. 准备一部 Android 手机,并开启 USB 调试模式。开启方法: `手机设置 -> 查找开发者选项 -> 打开开发者选项和 USB 调试模式`
**注意**
> 如果您的 Android Studio 尚未配置 NDK ,请根据 Android Studio 用户指南中的[安装及配置 NDK 和 CMake ](https://developer.android.com/studio/projects/install-ndk)内容,预先配置好 NDK 。您可以选择最新的 NDK 版本,或者使用 Paddle Lite 预测库版本一样的 NDK。
### 部署步骤
1. 用 Android Studio 打开 TTSAndroid 工程。
2. 手机连接电脑,打开 USB 调试和文件传输模式,并在 Android Studio 上连接自己的手机设备(手机需要开启允许从 USB 安装软件权限)。
**注意:**
>1. 如果您在导入项目、编译或者运行过程中遇到 NDK 配置错误的提示,请打开 `File > Project Structure > SDK Location`,修改 `Andriod NDK location` 为您本机配置的 NDK 所在路径。
>2. 如果您是通过 Andriod Studio 的 SDK Tools 下载的 NDK (见本章节"环境准备"),可以直接点击下拉框选择默认路径。
>3. 还有一种 NDK 配置方法,你可以在 `TTSAndroid/local.properties` 文件中手动添加 NDK 路径配置 `nkd.dir=/root/android-ndk-r20b`
>4. 如果以上步骤仍旧无法解决 NDK 配置错误,请尝试根据 Andriod Studio 官方文档中的[更新 Android Gradle 插件](https://developer.android.com/studio/releases/gradle-plugin?hl=zh-cn#updating-plugin)章节,尝试更新 Android Gradle plugin 版本。
3. 点击 Run 按钮,自动编译 APP 并安装到手机。(该过程会自动下载 Paddle Lite 预测库和模型,需要联网)
成功后效果如下:
- pic 1APP 安装到手机。
- pic 2APP 打开后的效果,在下拉框中选择待合成的文本。
- pic 3合成后点击按钮播放音频。
<p align="center"><img width="350" height="500" src="https://user-images.githubusercontent.com/24568452/204450217-d166588a-5341-4565-8662-0f8129284bba.png"/><img width="350" height="500" src="https://user-images.githubusercontent.com/24568452/204450231-d6f3105c-276a-4af5-a3ba-864d9f5ee24e.png"/><img width="350" height="500" src="https://user-images.githubusercontent.com/24568452/204450269-0ddf46ec-eedd-4c90-8a0d-e915622fdf3e.png"/></p>
## 更新预测库
* Paddle Lite
项目:[https://github.com/PaddlePaddle/Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite)。
参考 [Paddle Lite 源码编译文档](https://www.paddlepaddle.org.cn/lite/v2.11/source_compile/compile_env.html),编译
Android 预测库。
* 编译最终产物位于 `build.lite.xxx.xxx.xxx` 下的 `inference_lite_lib.xxx.xxx`
* 替换 java 库
* jar 包
将生成的 `build.lite.android.xxx.gcc/inference_lite_lib.android.xxx/java/jar/PaddlePredictor.jar`
替换 Demo 中的 `TTSAndroid/app/libs/PaddlePredictor.jar`
* Java so
* arm64-v8a
将生成的 `build.lite.android.armv8.gcc/inference_lite_lib.android.armv8/java/so/libpaddle_lite_jni.so`
库替换 Demo 中的 `TTSAndroid/app/src/main/jniLibs/arm64-v8a/libpaddle_lite_jni.so`
## Demo 内容介绍
先整体介绍下目标检测 Demo 的代码结构,然后介绍 Java 各功能模块的功能。
<p align="center">
<img width="442" alt="image" src="https://user-images.githubusercontent.com/24568452/204455080-4f96fe55-6058-4235-bb92-cc98cfcc8bb6.png">
</p>
### 重点关注内容
1. `Predictor.java` 预测代码。
```bash
# 位置:
TTSAndroid/app/src/main/java/com/baidu/paddle/lite/demo/tts/Predictor.java
```
2. `fastspeech2_csmsc_arm.nb``mb_melgan_csmsc_arm.nb`: 模型文件 (opt 工具转化后 Paddle Lite 模型)
,分别来自 [fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip)
和 [mb_melgan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_pdlite_1.3.0.zip)。
```bash
# 位置:
TTSAndroid/app/src/main/assets/models/cpu/fastspeech2_csmsc_arm.nb
TTSAndroid/app/src/main/assets/models/cpu/mb_melgan_csmsc_arm.nb
```
3. `libpaddle_lite_jni.so`、`PaddlePredictor.jar`Paddle Lite Java 预测库与 jar 包。
```bash
# 位置
TTSAndroid/app/src/main/jniLibs/arm64-v8a/libpaddle_lite_jni.so
TTSAndroid/app/libs/PaddlePredictor.jar
```
> 如果要替换动态库 so 和 jar 文件,则将新的动态库 so 更新到 `TTSAndroid/app/src/main/jniLibs/arm64-v8a/` 目录下 新的 jar 文件更新到 `TTSAndroid/app/libs/` 目录下
4. `build.gradle` : 定义编译过程的 gradle 脚本。(不用改动,定义了自动下载 Paddle Lite 预测和模型的过程)
```bash
# 位置
TTSAndroid/app/build.gradle
```
如果需要手动更新模型和预测库,则可将 gradle 脚本中的 `download*` 接口注释即可, 将新的预测库替换至相应目录下
### Java 端
* 模型存放,将下载好的模型解压存放在 `app/src/assets/models` 目录下。
* TTSAndroid Java 包在 `app/src/main/java/com/baidu/paddle/lite/demo/tts` 目录下,实现 APP 界面消息事件。
* MainActivity 实现 APP 的创建、运行、释放功能,重点关注 `onLoadModel``onRunModel` 函数,实现 APP 界面值传递和推理处理。
```java
public boolean onLoadModel() {
return predictor.init(MainActivity.this, modelPath, AMmodelName, VOCmodelName, cpuThreadNum,
cpuPowerMode);
}
public boolean onRunModel() {
return predictor.isLoaded() && predictor.runModel(phones);
}
```
* SettingActivity 实现设置界面各个元素的更新与显示如模型地址、线程数、输入 shape 大小等,如果新增/删除界面的某个元素,均在这个类里面实现:
- 参数的默认值可在 `app/src/main/res/values/strings.xml` 查看
- 每个元素的 ID 和 value 是对应 `app/src/main/res/xml/settings.xml`
`app/src/main/res/values/string.xml` 文件中的值
- 这部分内容不建议修改,如果有新增属性,可以按照此格式进行添加
* Predictor 使用 Java API 实现语音合成模型的预测功能,重点关注 `init`、和 `runModel` 函数,实现 Paddle Lite 端侧推理功能:
```java
// 初始化函数,完成预测器初始化
public boolean init(Context appCtx, String modelPath, String AMmodelName, String VOCmodelName, int cpuThreadNum, String cpuPowerMode);
// 模型推理函数
public boolean runModel(float[] phones);
```
## 代码讲解 (使用 Paddle Lite `Java API` 执行预测)
Android 示例基于 Java API 开发,调用 Paddle Lite `Java API` 包括以下五步。更详细的 `API`
描述参考:[Paddle Lite Java API ](https://www.paddlepaddle.org.cn/lite/v2.11/api_reference/java_api_doc.html)。
## 如何更新模型和输入
### 更新模型
1. 将优化后的模型存放到目录 `TTSAndroid/app/src/main/assets/models/cpu/`
下,可任意换成 [released_model.md](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md)
中的 `*_pdlite_*.zip/*_arm.nb`
格式的声学模型和声码器,注意更换声学模型需要对应修改 `TTSAndroid/app/src/main/java/com/baidu/paddle/lite/demo/tts/MainActivity.java`
中的 `sentencesToChoose` 数组。
2. 如果模型名字跟工程中模型名字一模一样,即均是使用`fastspeech2_csmsc_arm.nb` (假设声学模型的 `phone_id_map.txt`
也一样)和 `mb_melgan_csmsc_arm.nb`
,则代码不需更新;否则,需要修改 `TTSAndroid/app/src/main/java/com/baidu/paddle/lite/demo/tts/MainActivity.java`
中的 `AMmodelName``VOCmodelName`
<p align="center">
<img src="https://user-images.githubusercontent.com/24568452/204458299-25e305a6-7cbb-4308-86ee-03f146bb938e.png">
</p>
3. 如果更新模型的输入/输出 Tensor 个数、shape 和 Dtype
发生更新,需要更新文件 `TTSAndroid/app/src/main/java/com/baidu/paddle/lite/demo/tts/Predictor.java`
### 更新输入
**本 Demo 不包含文本前端模块**,通过下拉框选择预先设置好的文本,在代码中映射成对应的 phone_id**如需文本前端模块请自行处理**`phone_id_map.txt`
请参考 [fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip)。
## 通过 setting 界面更新语音合成的相关参数
### setting 界面参数介绍
可通过 APP 上的 Settings 按钮,实现语音合成 Demo 中参数的更新,目前支持以下参数的更新:
参数的默认值可在 `app/src/main/res/values/strings.xml` 查看
- CPU setting
- power_mode 默认是 `LITE_POWER_HIGH`
- thread_num 默认是 1
### setting 界面参数更新
1. 打开 APP点击右上角的 `:` 符合,选择 `Settings..` 选项,打开 setting 界面;
2. 再将 setting 界面的 Enable custom settings 选中☑️,然后更新部分参数;
3. 假设更新线程数据,将 CPU Thread Num 设置为 4更新后返回原界面APP 将自动重新加载模型,在下拉框中选择文本会进行合成,合成结束后悔打印 4 线程的耗时和结果
## 性能优化方法
如果你觉得当前性能不符合需求,想进一步提升模型性能,可参考[性能优化文档](https://github.com/PaddlePaddle/Paddle-Lite-Demo#%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96)完成性能优化。
## Release
[2022-11-29-app-release.apk](https://paddlespeech.bj.bcebos.com/demos/TTSAndroid/2022-11-29-app-release.apk)
## More
本 Demo 合并自 [yt605155624/TTSAndroid](https://github.com/yt605155624/TTSAndroid)。

@ -0,0 +1,108 @@
import java.security.MessageDigest
apply plugin: 'com.android.application'
android {
compileSdkVersion 28
defaultConfig {
applicationId "com.baidu.paddle.lite.demo.tts"
minSdkVersion 15
targetSdkVersion 28
versionCode 1
versionName "1.0"
testInstrumentationRunner "android.support.test.runner.AndroidJUnitRunner"
}
buildTypes {
release {
minifyEnabled false
proguardFiles getDefaultProguardFile('proguard-android-optimize.txt'), 'proguard-rules.pro'
}
}
}
dependencies {
implementation fileTree(include: ['*.jar'], dir: 'libs')
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'com.android.support.constraint:constraint-layout:1.1.3'
implementation 'com.android.support:design:28.0.0'
testImplementation 'junit:junit:4.12'
androidTestImplementation 'com.android.support.test:runner:1.0.2'
androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2'
implementation files('libs/PaddlePredictor.jar')
}
def paddleLiteLibs = 'https://paddlespeech.bj.bcebos.com/demos/TTSAndroid/paddle_lite_libs_68b66fd3.tar.gz'
task downloadAndExtractPaddleLiteLibs(type: DefaultTask) {
doFirst {
println "Downloading and extracting Paddle Lite libs"
}
doLast {
// Prepare cache folder for libs
if (!file("cache").exists()) {
mkdir "cache"
}
// Generate cache name for libs
MessageDigest messageDigest = MessageDigest.getInstance('MD5')
messageDigest.update(paddleLiteLibs.bytes)
String cacheName = new BigInteger(1, messageDigest.digest()).toString(32)
// Download libs
if (!file("cache/${cacheName}.tar.gz").exists()) {
ant.get(src: paddleLiteLibs, dest: file("cache/${cacheName}.tar.gz"))
}
// Unpack libs
if (!file("cache/${cacheName}").exists()) {
copy {
from tarTree("cache/${cacheName}.tar.gz")
into "cache/${cacheName}"
}
}
// Copy PaddlePredictor.jar
if (!file("libs/PaddlePredictor.jar").exists()) {
copy {
from "cache/${cacheName}/java/PaddlePredictor.jar"
into "libs"
}
}
if (!file("src/main/jniLibs/arm64-v8a/libpaddle_lite_jni.so").exists()) {
copy {
from "cache/${cacheName}/java/libs/arm64-v8a/"
into "src/main/jniLibs/arm64-v8a"
}
}
}
}
preBuild.dependsOn downloadAndExtractPaddleLiteLibs
def paddleLiteModels = [['src' : 'https://paddlespeech.bj.bcebos.com/demos/TTSAndroid/fs2cnn_mbmelgan_cpu_v1.3.0.tar.gz',
'dest': 'src/main/assets/models'],]
task downloadAndExtractPaddleLiteModels(type: DefaultTask) {
doFirst {
println "Downloading and extracting Paddle Lite models"
}
doLast {
// Prepare cache folder for models
String cachePath = "cache"
if (!file("${cachePath}").exists()) {
mkdir "${cachePath}"
}
paddleLiteModels.eachWithIndex { model, index ->
MessageDigest messageDigest = MessageDigest.getInstance('MD5')
messageDigest.update(model.src.bytes)
String cacheName = new BigInteger(1, messageDigest.digest()).toString(32)
// Download the target model if not exists
boolean copyFiles = !file("${model.dest}").exists()
if (!file("${cachePath}/${cacheName}.tar.gz").exists()) {
ant.get(src: model.src, dest: file("${cachePath}/${cacheName}.tar.gz"))
copyFiles = true // force to copy files from the latest archive files
}
// Copy model file
if (copyFiles) {
copy {
from tarTree("${cachePath}/${cacheName}.tar.gz")
into "${model.dest}"
}
}
}
}
}
preBuild.dependsOn downloadAndExtractPaddleLiteModels

@ -0,0 +1,21 @@
# Add project specific ProGuard rules here.
# You can control the set of applied configuration files using the
# proguardFiles setting in build.gradle.
#
# For more details, see
# http://developer.android.com/guide/developing/tools/proguard.html
# If your project uses WebView with JS, uncomment the following
# and specify the fully qualified class name to the JavaScript interface
# class:
#-keepclassmembers class fqcn.of.javascript.interface.for.webview {
# public *;
#}
# Uncomment this to preserve the line number information for
# debugging stack traces.
#-keepattributes SourceFile,LineNumberTable
# If you keep the line number information, uncomment this to
# hide the original source file name.
#-renamesourcefileattribute SourceFile

@ -0,0 +1,26 @@
package com.baidu.paddle.lite.demo.tts;
import android.content.Context;
import android.support.test.InstrumentationRegistry;
import android.support.test.runner.AndroidJUnit4;
import org.junit.Test;
import org.junit.runner.RunWith;
import static org.junit.Assert.*;
/**
* Instrumented test, which will execute on an Android device.
*
* @see <a href="http://d.android.com/tools/testing">Testing documentation</a>
*/
@RunWith(AndroidJUnit4.class)
public class ExampleInstrumentedTest {
@Test
public void useAppContext() {
// Context of the app under test.
Context appContext = InstrumentationRegistry.getTargetContext();
assertEquals("com.baidu.paddle.lite.demo", appContext.getPackageName());
}
}

@ -0,0 +1,27 @@
<?xml version="1.0" encoding="utf-8"?>
<manifest xmlns:android="http://schemas.android.com/apk/res/android"
package="com.baidu.paddle.lite.demo.tts">
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE" />
<uses-permission android:name="android.permission.READ_EXTERNAL_STORAGE" />
<application
android:allowBackup="true"
android:icon="@drawable/logo"
android:label="@string/app_name"
android:roundIcon="@drawable/logo"
android:supportsRtl="true"
android:theme="@style/AppTheme">
<activity android:name="com.baidu.paddle.lite.demo.tts.MainActivity">
<intent-filter>
<action android:name="android.intent.action.MAIN" />
<category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
</activity>
<activity
android:name="com.baidu.paddle.lite.demo.tts.SettingsActivity"
android:label="Settings"></activity>
</application>
</manifest>

@ -0,0 +1,122 @@
/*
* Copyright (C) 2014 The Android Open Source Project
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.baidu.paddle.lite.demo.tts;
import android.content.res.Configuration;
import android.os.Bundle;
import android.preference.PreferenceActivity;
import android.support.annotation.LayoutRes;
import android.support.v7.app.ActionBar;
import android.support.v7.app.AppCompatDelegate;
import android.view.MenuInflater;
import android.view.View;
import android.view.ViewGroup;
/**
* A {@link android.preference.PreferenceActivity} which implements and proxies the necessary calls
* to be used with AppCompat.
* <p>
* This technique can be used with an {@link android.app.Activity} class, not just
* {@link android.preference.PreferenceActivity}.
*/
public abstract class AppCompatPreferenceActivity extends PreferenceActivity {
private AppCompatDelegate mDelegate;
@Override
protected void onCreate(Bundle savedInstanceState) {
getDelegate().installViewFactory();
getDelegate().onCreate(savedInstanceState);
super.onCreate(savedInstanceState);
}
@Override
protected void onPostCreate(Bundle savedInstanceState) {
super.onPostCreate(savedInstanceState);
getDelegate().onPostCreate(savedInstanceState);
}
public ActionBar getSupportActionBar() {
return getDelegate().getSupportActionBar();
}
@Override
public MenuInflater getMenuInflater() {
return getDelegate().getMenuInflater();
}
@Override
public void setContentView(@LayoutRes int layoutResID) {
getDelegate().setContentView(layoutResID);
}
@Override
public void setContentView(View view) {
getDelegate().setContentView(view);
}
@Override
public void setContentView(View view, ViewGroup.LayoutParams params) {
getDelegate().setContentView(view, params);
}
@Override
public void addContentView(View view, ViewGroup.LayoutParams params) {
getDelegate().addContentView(view, params);
}
@Override
protected void onPostResume() {
super.onPostResume();
getDelegate().onPostResume();
}
@Override
protected void onTitleChanged(CharSequence title, int color) {
super.onTitleChanged(title, color);
getDelegate().setTitle(title);
}
@Override
public void onConfigurationChanged(Configuration newConfig) {
super.onConfigurationChanged(newConfig);
getDelegate().onConfigurationChanged(newConfig);
}
@Override
protected void onStop() {
super.onStop();
getDelegate().onStop();
}
@Override
protected void onDestroy() {
super.onDestroy();
getDelegate().onDestroy();
}
public void invalidateOptionsMenu() {
getDelegate().invalidateOptionsMenu();
}
private AppCompatDelegate getDelegate() {
if (mDelegate == null) {
mDelegate = AppCompatDelegate.create(this, null);
}
return mDelegate;
}
}

@ -0,0 +1,400 @@
package com.baidu.paddle.lite.demo.tts;
import android.Manifest;
import android.app.ProgressDialog;
import android.content.Intent;
import android.content.SharedPreferences;
import android.content.pm.PackageManager;
import android.media.MediaPlayer;
import android.os.Bundle;
import android.os.Environment;
import android.os.Handler;
import android.os.HandlerThread;
import android.os.Message;
import android.preference.PreferenceManager;
import android.support.annotation.NonNull;
import android.support.v4.app.ActivityCompat;
import android.support.v4.content.ContextCompat;
import android.support.v7.app.AppCompatActivity;
import android.text.method.ScrollingMovementMethod;
import android.util.Log;
import android.view.Menu;
import android.view.MenuInflater;
import android.view.MenuItem;
import android.view.View;
import android.widget.AdapterView;
import android.widget.ArrayAdapter;
import android.widget.Button;
import android.widget.Spinner;
import android.widget.TextView;
import android.widget.Toast;
import java.io.File;
import java.io.IOException;
public class MainActivity extends AppCompatActivity implements View.OnClickListener, MediaPlayer.OnPreparedListener, MediaPlayer.OnErrorListener, AdapterView.OnItemSelectedListener {
public static final int REQUEST_LOAD_MODEL = 0;
public static final int REQUEST_RUN_MODEL = 1;
public static final int RESPONSE_LOAD_MODEL_SUCCESSED = 0;
public static final int RESPONSE_LOAD_MODEL_FAILED = 1;
public static final int RESPONSE_RUN_MODEL_SUCCESSED = 2;
public static final int RESPONSE_RUN_MODEL_FAILED = 3;
public MediaPlayer mediaPlayer = new MediaPlayer();
private static final String TAG = Predictor.class.getSimpleName();
protected ProgressDialog pbLoadModel = null;
protected ProgressDialog pbRunModel = null;
// Receive messages from worker thread
protected Handler receiver = null;
// Send command to worker thread
protected Handler sender = null;
// Worker thread to load&run model
protected HandlerThread worker = null;
// UI components of image classification
protected TextView tvInputSetting;
protected TextView tvInferenceTime;
protected Button btn_play;
protected Button btn_pause;
protected Button btn_stop;
// Model settings of image classification
protected String modelPath = "";
protected int cpuThreadNum = 1;
protected String cpuPowerMode = "";
protected Predictor predictor = new Predictor();
int sampleRate = 24000;
private final String wavName = "tts_output.wav";
private final String wavFile = Environment.getExternalStorageDirectory() + File.separator + wavName;
private final String AMmodelName = "fastspeech2_csmsc_arm.nb";
private final String VOCmodelName = "mb_melgan_csmsc_arm.nb";
private float[] phones = {};
private final float[][] sentencesToChoose = {
// 009901 昨日,这名“伤者”与医生全部被警方依法刑事拘留。
{261, 231, 175, 116, 179, 262, 44, 154, 126, 177, 19, 262, 42, 241, 72, 177, 56, 174, 245, 37, 186, 37, 49, 151, 127, 69, 19, 179, 72, 69, 4, 260, 126, 177, 116, 151, 239, 153, 141},
// 009902 钱伟长想到上海来办学校是经过深思熟虑的。
{174, 83, 213, 39, 20, 260, 89, 40, 30, 177, 22, 71, 9, 153, 8, 37, 17, 260, 251, 260, 99, 179, 177, 116, 151, 125, 70, 233, 177, 51, 176, 108, 177, 184, 153, 242, 40, 45},
// 009903 她见我一进门就骂,吃饭时也骂,骂得我抬不起头。
{182, 2, 151, 85, 232, 73, 151, 123, 154, 52, 151, 143, 154, 5, 179, 39, 113, 69, 17, 177, 114, 105, 154, 5, 179, 154, 5, 40, 45, 232, 182, 8, 37, 186, 174, 74, 182, 168},
// 009904 李述德在离开之前,只说了一句“柱驼杀父亲了”。
{153, 74, 177, 186, 40, 42, 261, 10, 153, 73, 152, 7, 262, 113, 174, 83, 179, 262, 115, 177, 230, 153, 45, 73, 151, 242, 180, 262, 186, 182, 231, 177, 2, 69, 186, 174, 124, 153, 45},
// 009905 这种车票和保险单捆绑出售属于重复性购买。
{262, 44, 262, 163, 39, 41, 173, 99, 71, 42, 37, 28, 260, 84, 40, 14, 179, 152, 220, 37, 21, 39, 183, 177, 170, 179, 177, 185, 240, 39, 162, 69, 186, 260, 128, 70, 170, 154, 9},
// 009906 戴佩妮的男友西米露接唱情歌,让她非常开心。
{40, 10, 173, 49, 155, 72, 40, 45, 155, 15, 142, 260, 72, 154, 74, 153, 186, 179, 151, 103, 39, 22, 174, 126, 70, 41, 179, 175, 22, 182, 2, 69, 46, 39, 20, 152, 7, 260, 120},
// 009907 观大势、谋大局、出大策始终是该院的办院方针。
{70, 199, 40, 5, 177, 116, 154, 168, 40, 5, 151, 240, 179, 39, 183, 40, 5, 38, 44, 179, 177, 115, 262, 161, 177, 116, 70, 7, 247, 40, 45, 37, 17, 247, 69, 19, 262, 51},
// 009908 他们骑着摩托回家,正好为农忙时的父母帮忙。
{182, 2, 154, 55, 174, 73, 262, 45, 154, 157, 182, 230, 71, 212, 151, 77, 180, 262, 59, 71, 29, 214, 155, 162, 154, 20, 177, 114, 40, 45, 69, 186, 154, 185, 37, 19, 154, 20},
// 009909 但是因为还没到退休年龄,只能掰着指头捱日子。
{40, 17, 177, 116, 120, 214, 71, 8, 154, 47, 40, 30, 182, 214, 260, 140, 155, 83, 153, 126, 180, 262, 115, 155, 57, 37, 7, 262, 45, 262, 115, 182, 171, 8, 175, 116, 261, 112},
// 009910 这几天雨水不断,人们恨不得待在家里不出门。
{262, 44, 151, 74, 182, 82, 240, 177, 213, 37, 184, 40, 202, 180, 175, 52, 154, 55, 71, 54, 37, 186, 40, 42, 40, 7, 261, 10, 151, 77, 153, 74, 37, 186, 39, 183, 154, 52}
};
@Override
public void onClick(View v) {
switch (v.getId()) {
case R.id.btn_play:
if (!mediaPlayer.isPlaying()) {
mediaPlayer.start();
}
break;
case R.id.btn_pause:
if (mediaPlayer.isPlaying()) {
mediaPlayer.pause();
}
break;
case R.id.btn_stop:
if (mediaPlayer.isPlaying()) {
mediaPlayer.reset();
initMediaPlayer();
}
break;
default:
break;
}
}
private void initMediaPlayer() {
try {
File file = new File(wavFile);
// 指定音频文件的路径
mediaPlayer.setDataSource(file.getPath());
// 让 MediaPlayer 进入到准备状态
mediaPlayer.prepare();
// 该方法使得进入应用时就播放音频
// mediaPlayer.setOnPreparedListener(this);
// prepare async to not block main thread
mediaPlayer.prepareAsync();
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void onPrepared(MediaPlayer player) {
player.start();
}
@Override
public boolean onError(MediaPlayer mp, int what, int extra) {
// The MediaPlayer has moved to the Error state, must be reset!
mediaPlayer.reset();
initMediaPlayer();
return true;
}
@Override
protected void onCreate(Bundle savedInstanceState) {
requestAllPermissions();
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
// 初始化控件
Spinner spinner = findViewById(R.id.spinner1);
// 建立数据源
String[] sentences = getResources().getStringArray(R.array.text);
// 建立 Adapter 并且绑定数据源
ArrayAdapter<String> adapter = new ArrayAdapter<String>(this, android.R.layout.simple_spinner_dropdown_item, sentences);
// 第一个参数表示在哪个 Activity 上显示,第二个参数是系统下拉框的样式,第三个参数是数组。
spinner.setAdapter(adapter);//绑定Adapter到控件
spinner.setOnItemSelectedListener(this);
btn_play = findViewById(R.id.btn_play);
btn_pause = findViewById(R.id.btn_pause);
btn_stop = findViewById(R.id.btn_stop);
btn_play.setOnClickListener(this);
btn_pause.setOnClickListener(this);
btn_stop.setOnClickListener(this);
btn_play.setVisibility(View.INVISIBLE);
btn_pause.setVisibility(View.INVISIBLE);
btn_stop.setVisibility(View.INVISIBLE);
// Clear all setting items to avoid app crashing due to the incorrect settings
SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(this);
SharedPreferences.Editor editor = sharedPreferences.edit();
editor.clear();
editor.commit();
// Prepare the worker thread for mode loading and inference
receiver = new Handler() {
@Override
public void handleMessage(Message msg) {
switch (msg.what) {
case RESPONSE_LOAD_MODEL_SUCCESSED:
pbLoadModel.dismiss();
onLoadModelSuccessed();
break;
case RESPONSE_LOAD_MODEL_FAILED:
pbLoadModel.dismiss();
Toast.makeText(MainActivity.this, "Load model failed!", Toast.LENGTH_SHORT).show();
onLoadModelFailed();
break;
case RESPONSE_RUN_MODEL_SUCCESSED:
pbRunModel.dismiss();
onRunModelSuccessed();
break;
case RESPONSE_RUN_MODEL_FAILED:
pbRunModel.dismiss();
Toast.makeText(MainActivity.this, "Run model failed!", Toast.LENGTH_SHORT).show();
onRunModelFailed();
break;
default:
break;
}
}
};
worker = new HandlerThread("Predictor Worker");
worker.start();
sender = new Handler(worker.getLooper()) {
public void handleMessage(Message msg) {
switch (msg.what) {
case REQUEST_LOAD_MODEL:
// Load model and reload test image
if (onLoadModel()) {
receiver.sendEmptyMessage(RESPONSE_LOAD_MODEL_SUCCESSED);
} else {
receiver.sendEmptyMessage(RESPONSE_LOAD_MODEL_FAILED);
}
break;
case REQUEST_RUN_MODEL:
// Run model if model is loaded
if (onRunModel()) {
receiver.sendEmptyMessage(RESPONSE_RUN_MODEL_SUCCESSED);
} else {
receiver.sendEmptyMessage(RESPONSE_RUN_MODEL_FAILED);
}
break;
default:
break;
}
}
};
// Setup the UI components
tvInputSetting = findViewById(R.id.tv_input_setting);
tvInferenceTime = findViewById(R.id.tv_inference_time);
tvInputSetting.setMovementMethod(ScrollingMovementMethod.getInstance());
}
@Override
protected void onResume() {
super.onResume();
boolean settingsChanged = false;
SharedPreferences sharedPreferences = PreferenceManager.getDefaultSharedPreferences(this);
String model_path = sharedPreferences.getString(getString(R.string.MODEL_PATH_KEY),
getString(R.string.MODEL_PATH_DEFAULT));
settingsChanged |= !model_path.equalsIgnoreCase(modelPath);
int cpu_thread_num = Integer.parseInt(sharedPreferences.getString(getString(R.string.CPU_THREAD_NUM_KEY),
getString(R.string.CPU_THREAD_NUM_DEFAULT)));
settingsChanged |= cpu_thread_num != cpuThreadNum;
String cpu_power_mode =
sharedPreferences.getString(getString(R.string.CPU_POWER_MODE_KEY),
getString(R.string.CPU_POWER_MODE_DEFAULT));
settingsChanged |= !cpu_power_mode.equalsIgnoreCase(cpuPowerMode);
if (settingsChanged) {
modelPath = model_path;
cpuThreadNum = cpu_thread_num;
cpuPowerMode = cpu_power_mode;
// Update UI
tvInputSetting.setText("Model: " + modelPath.substring(modelPath.lastIndexOf("/") + 1) + "\n" + "CPU" +
" Thread Num: " + cpuThreadNum + "\n" + "CPU Power Mode: " + cpuPowerMode + "\n");
tvInputSetting.scrollTo(0, 0);
// Reload model if configure has been changed
loadModel();
}
}
public void loadModel() {
pbLoadModel = ProgressDialog.show(this, "", "Loading model...", false, false);
sender.sendEmptyMessage(REQUEST_LOAD_MODEL);
}
public void runModel() {
pbRunModel = ProgressDialog.show(this, "", "Running model...", false, false);
sender.sendEmptyMessage(REQUEST_RUN_MODEL);
}
public boolean onLoadModel() {
return predictor.init(MainActivity.this, modelPath, AMmodelName, VOCmodelName, cpuThreadNum,
cpuPowerMode);
}
public boolean onRunModel() {
return predictor.isLoaded() && predictor.runModel(phones);
}
public boolean onLoadModelSuccessed() {
// Load test image from path and run model
// runModel();
return true;
}
public void onLoadModelFailed() {
}
public void onRunModelSuccessed() {
// Obtain results and update UI
btn_play.setVisibility(View.VISIBLE);
btn_pause.setVisibility(View.VISIBLE);
btn_stop.setVisibility(View.VISIBLE);
tvInferenceTime.setText("Inference done\nInference time: " + predictor.inferenceTime() + " ms"
+ "\nRTF: " + predictor.inferenceTime() * sampleRate / (predictor.wav.length * 1000) + "\nAudio saved in " + wavFile);
try {
Utils.rawToWave(wavFile, predictor.wav, sampleRate);
} catch (IOException e) {
e.printStackTrace();
}
if (ContextCompat.checkSelfPermission(MainActivity.this,
Manifest.permission.WRITE_EXTERNAL_STORAGE) != PackageManager.PERMISSION_GRANTED) {
ActivityCompat.requestPermissions(MainActivity.this, new String[]{Manifest.permission.WRITE_EXTERNAL_STORAGE}, 1);
} else {
// 初始化 MediaPlayer
initMediaPlayer();
}
}
public void onRunModelFailed() {
}
public void onSettingsClicked() {
startActivity(new Intent(MainActivity.this, SettingsActivity.class));
}
@Override
public boolean onCreateOptionsMenu(Menu menu) {
MenuInflater inflater = getMenuInflater();
inflater.inflate(R.menu.menu_action_options, menu);
return true;
}
@Override
public boolean onOptionsItemSelected(MenuItem item) {
switch (item.getItemId()) {
case android.R.id.home:
finish();
break;
case R.id.settings:
onSettingsClicked();
}
return super.onOptionsItemSelected(item);
}
@Override
public void onRequestPermissionsResult(int requestCode, @NonNull String[] permissions,
@NonNull int[] grantResults) {
super.onRequestPermissionsResult(requestCode, permissions, grantResults);
if (grantResults[0] != PackageManager.PERMISSION_GRANTED) {
Toast.makeText(this, "Permission Denied", Toast.LENGTH_SHORT).show();
}
}
@Override
protected void onDestroy() {
if (predictor != null) {
predictor.releaseModel();
}
worker.quit();
super.onDestroy();
if (mediaPlayer != null) {
mediaPlayer.stop();
mediaPlayer.release();
}
}
private boolean requestAllPermissions() {
if (ContextCompat.checkSelfPermission(this, Manifest.permission.WRITE_EXTERNAL_STORAGE)
!= PackageManager.PERMISSION_GRANTED || ContextCompat.checkSelfPermission(this,
Manifest.permission.CAMERA)
!= PackageManager.PERMISSION_GRANTED) {
ActivityCompat.requestPermissions(this, new String[]{Manifest.permission.WRITE_EXTERNAL_STORAGE},
0);
return false;
}
return true;
}
@Override
public void onItemSelected(AdapterView<?> parent, View view, int position, long id) {
if (position > 0) {
phones = sentencesToChoose[position - 1];
runModel();
}
}
@Override
public void onNothingSelected(AdapterView<?> parent) {
}
}

@ -0,0 +1,149 @@
package com.baidu.paddle.lite.demo.tts;
import android.content.Context;
import android.util.Log;
import com.baidu.paddle.lite.MobileConfig;
import com.baidu.paddle.lite.PaddlePredictor;
import com.baidu.paddle.lite.PowerMode;
import com.baidu.paddle.lite.Tensor;
import java.io.File;
import java.util.Date;
public class Predictor {
private static final String TAG = Predictor.class.getSimpleName();
public boolean isLoaded = false;
public int cpuThreadNum = 1;
public String cpuPowerMode = "LITE_POWER_HIGH";
public String modelPath = "";
protected PaddlePredictor AMPredictor = null;
protected PaddlePredictor VOCPredictor = null;
protected float inferenceTime = 0;
protected float[] wav;
public boolean init(Context appCtx, String modelPath, String AMmodelName, String VOCmodelName, int cpuThreadNum, String cpuPowerMode) {
// Release model if exists
releaseModel();
AMPredictor = loadModel(appCtx, modelPath, AMmodelName, cpuThreadNum, cpuPowerMode);
if (AMPredictor == null) {
return false;
}
VOCPredictor = loadModel(appCtx, modelPath, VOCmodelName, cpuThreadNum, cpuPowerMode);
if (VOCPredictor == null) {
return false;
}
isLoaded = true;
return true;
}
protected PaddlePredictor loadModel(Context appCtx, String modelPath, String modelName, int cpuThreadNum, String cpuPowerMode) {
// Load model
if (modelPath.isEmpty()) {
return null;
}
String realPath = modelPath;
if (modelPath.charAt(0) != '/') {
// Read model files from custom path if the first character of mode path is '/'
// otherwise copy model to cache from assets
realPath = appCtx.getCacheDir() + "/" + modelPath;
// push model to mobile
Utils.copyDirectoryFromAssets(appCtx, modelPath, realPath);
}
if (realPath.isEmpty()) {
return null;
}
MobileConfig config = new MobileConfig();
config.setModelFromFile(realPath + File.separator + modelName);
Log.e(TAG, "File:" + realPath + File.separator + modelName);
config.setThreads(cpuThreadNum);
if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_HIGH")) {
config.setPowerMode(PowerMode.LITE_POWER_HIGH);
} else if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_LOW")) {
config.setPowerMode(PowerMode.LITE_POWER_LOW);
} else if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_FULL")) {
config.setPowerMode(PowerMode.LITE_POWER_FULL);
} else if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_NO_BIND")) {
config.setPowerMode(PowerMode.LITE_POWER_NO_BIND);
} else if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_RAND_HIGH")) {
config.setPowerMode(PowerMode.LITE_POWER_RAND_HIGH);
} else if (cpuPowerMode.equalsIgnoreCase("LITE_POWER_RAND_LOW")) {
config.setPowerMode(PowerMode.LITE_POWER_RAND_LOW);
} else {
Log.e(TAG, "Unknown cpu power mode!");
return null;
}
return PaddlePredictor.createPaddlePredictor(config);
}
public void releaseModel() {
AMPredictor = null;
VOCPredictor = null;
isLoaded = false;
cpuThreadNum = 1;
cpuPowerMode = "LITE_POWER_HIGH";
modelPath = "";
}
public boolean runModel(float[] phones) {
if (!isLoaded()) {
return false;
}
Date start = new Date();
Tensor am_output_handle = getAMOutput(phones, AMPredictor);
wav = getVOCOutput(am_output_handle, VOCPredictor);
Date end = new Date();
inferenceTime = (end.getTime() - start.getTime());
return true;
}
public Tensor getAMOutput(float[] phones, PaddlePredictor am_predictor) {
Tensor phones_handle = am_predictor.getInput(0);
long[] dims = {phones.length};
phones_handle.resize(dims);
phones_handle.setData(phones);
am_predictor.run();
Tensor am_output_handle = am_predictor.getOutput(0);
// [?, 80]
// long outputShape[] = am_output_handle.shape();
float[] am_output_data = am_output_handle.getFloatData();
// [? x 80]
// long[] am_output_data_shape = {am_output_data.length};
// Log.e(TAG, Arrays.toString(am_output_data));
// 打印 mel 数组
// for (int i=0;i<outputShape[0];i++) {
// Log.e(TAG, Arrays.toString(Arrays.copyOfRange(am_output_data,i*80,(i+1)*80)));
// }
// voc_predictor 需要知道输入的 shape所以不能输出转成 float 之后的一维数组
return am_output_handle;
}
public float[] getVOCOutput(Tensor input, PaddlePredictor voc_predictor) {
Tensor mel_handle = voc_predictor.getInput(0);
// [?, 80]
long[] dims = input.shape();
mel_handle.resize(dims);
float[] am_output_data = input.getFloatData();
mel_handle.setData(am_output_data);
voc_predictor.run();
Tensor voc_output_handle = voc_predictor.getOutput(0);
// [? x 300, 1]
// long[] outputShape = voc_output_handle.shape();
float[] voc_output_data = voc_output_handle.getFloatData();
// long[] voc_output_data_shape = {voc_output_data.length};
return voc_output_data;
}
public boolean isLoaded() {
return AMPredictor != null && VOCPredictor != null && isLoaded;
}
public float inferenceTime() {
return inferenceTime;
}
}

@ -0,0 +1,111 @@
package com.baidu.paddle.lite.demo.tts;
import android.content.SharedPreferences;
import android.os.Bundle;
import android.preference.CheckBoxPreference;
import android.preference.EditTextPreference;
import android.preference.ListPreference;
import android.support.v7.app.ActionBar;
import java.util.ArrayList;
import java.util.List;
public class SettingsActivity extends AppCompatPreferenceActivity implements SharedPreferences.OnSharedPreferenceChangeListener {
ListPreference lpChoosePreInstalledModel = null;
CheckBoxPreference cbEnableCustomSettings = null;
EditTextPreference etModelPath = null;
ListPreference lpCPUThreadNum = null;
ListPreference lpCPUPowerMode = null;
List<String> preInstalledModelPaths = null;
List<String> preInstalledCPUThreadNums = null;
List<String> preInstalledCPUPowerModes = null;
@Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
addPreferencesFromResource(R.xml.settings);
ActionBar supportActionBar = getSupportActionBar();
if (supportActionBar != null) {
supportActionBar.setDisplayHomeAsUpEnabled(true);
}
// Initialized pre-installed models
preInstalledModelPaths = new ArrayList<String>();
preInstalledCPUThreadNums = new ArrayList<String>();
preInstalledCPUPowerModes = new ArrayList<String>();
preInstalledModelPaths.add(getString(R.string.MODEL_PATH_DEFAULT));
preInstalledCPUThreadNums.add(getString(R.string.CPU_THREAD_NUM_DEFAULT));
preInstalledCPUPowerModes.add(getString(R.string.CPU_POWER_MODE_DEFAULT));
// Setup UI components
lpChoosePreInstalledModel = (ListPreference) findPreference(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY));
String[] preInstalledModelNames = new String[preInstalledModelPaths.size()];
for (int i = 0; i < preInstalledModelPaths.size(); i++) {
preInstalledModelNames[i] = preInstalledModelPaths.get(i).substring(preInstalledModelPaths.get(i).lastIndexOf("/") + 1);
}
lpChoosePreInstalledModel.setEntries(preInstalledModelNames);
lpChoosePreInstalledModel.setEntryValues(preInstalledModelPaths.toArray(new String[preInstalledModelPaths.size()]));
lpCPUThreadNum = (ListPreference) findPreference(getString(R.string.CPU_THREAD_NUM_KEY));
lpCPUPowerMode = (ListPreference) findPreference(getString(R.string.CPU_POWER_MODE_KEY));
cbEnableCustomSettings = (CheckBoxPreference) findPreference(getString(R.string.ENABLE_CUSTOM_SETTINGS_KEY));
etModelPath = (EditTextPreference) findPreference(getString(R.string.MODEL_PATH_KEY));
etModelPath.setTitle("Model Path (SDCard: " + Utils.getSDCardDirectory() + ")");
}
private void reloadPreferenceAndUpdateUI() {
SharedPreferences sharedPreferences = getPreferenceScreen().getSharedPreferences();
boolean enableCustomSettings = sharedPreferences.getBoolean(getString(R.string.ENABLE_CUSTOM_SETTINGS_KEY), false);
String modelPath = sharedPreferences.getString(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY), getString(R.string.MODEL_PATH_DEFAULT));
int modelIdx = lpChoosePreInstalledModel.findIndexOfValue(modelPath);
if (modelIdx >= 0 && modelIdx < preInstalledModelPaths.size()) {
if (!enableCustomSettings) {
SharedPreferences.Editor editor = sharedPreferences.edit();
editor.putString(getString(R.string.MODEL_PATH_KEY), preInstalledModelPaths.get(modelIdx));
editor.putString(getString(R.string.CPU_THREAD_NUM_KEY), preInstalledCPUThreadNums.get(modelIdx));
editor.putString(getString(R.string.CPU_POWER_MODE_KEY), preInstalledCPUPowerModes.get(modelIdx));
editor.commit();
}
lpChoosePreInstalledModel.setSummary(modelPath);
}
cbEnableCustomSettings.setChecked(enableCustomSettings);
etModelPath.setEnabled(enableCustomSettings);
lpCPUThreadNum.setEnabled(enableCustomSettings);
lpCPUPowerMode.setEnabled(enableCustomSettings);
modelPath = sharedPreferences.getString(getString(R.string.MODEL_PATH_KEY), getString(R.string.MODEL_PATH_DEFAULT));
String cpuThreadNum = sharedPreferences.getString(getString(R.string.CPU_THREAD_NUM_KEY), getString(R.string.CPU_THREAD_NUM_DEFAULT));
String cpuPowerMode = sharedPreferences.getString(getString(R.string.CPU_POWER_MODE_KEY), getString(R.string.CPU_POWER_MODE_DEFAULT));
etModelPath.setSummary(modelPath);
etModelPath.setText(modelPath);
lpCPUThreadNum.setValue(cpuThreadNum);
lpCPUThreadNum.setSummary(cpuThreadNum);
lpCPUPowerMode.setValue(cpuPowerMode);
lpCPUPowerMode.setSummary(cpuPowerMode);
}
@Override
protected void onResume() {
super.onResume();
getPreferenceScreen().getSharedPreferences().registerOnSharedPreferenceChangeListener(this);
reloadPreferenceAndUpdateUI();
}
@Override
protected void onPause() {
super.onPause();
getPreferenceScreen().getSharedPreferences().unregisterOnSharedPreferenceChangeListener(this);
}
@Override
public void onSharedPreferenceChanged(SharedPreferences sharedPreferences, String key) {
if (key.equals(getString(R.string.CHOOSE_PRE_INSTALLED_MODEL_KEY))) {
SharedPreferences.Editor editor = sharedPreferences.edit();
editor.putBoolean(getString(R.string.ENABLE_CUSTOM_SETTINGS_KEY), false);
editor.commit();
}
reloadPreferenceAndUpdateUI();
}
}

@ -0,0 +1,155 @@
package com.baidu.paddle.lite.demo.tts;
import static java.lang.Math.abs;
import android.content.Context;
import android.os.Environment;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
public class Utils {
public static void copyFileFromAssets(Context appCtx, String srcPath, String dstPath) {
if (srcPath.isEmpty() || dstPath.isEmpty()) {
return;
}
InputStream is = null;
OutputStream os = null;
try {
is = new BufferedInputStream(appCtx.getAssets().open(srcPath));
os = new BufferedOutputStream(new FileOutputStream(new File(dstPath)));
byte[] buffer = new byte[1024];
int length = 0;
while ((length = is.read(buffer)) != -1) {
os.write(buffer, 0, length);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
os.close();
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public static void copyDirectoryFromAssets(Context appCtx, String srcDir, String dstDir) {
if (srcDir.isEmpty() || dstDir.isEmpty()) {
return;
}
try {
if (!new File(dstDir).exists()) {
new File(dstDir).mkdirs();
}
for (String fileName : appCtx.getAssets().list(srcDir)) {
String srcSubPath = srcDir + File.separator + fileName;
String dstSubPath = dstDir + File.separator + fileName;
if (new File(srcSubPath).isDirectory()) {
copyDirectoryFromAssets(appCtx, srcSubPath, dstSubPath);
} else {
copyFileFromAssets(appCtx, srcSubPath, dstSubPath);
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
public static String getSDCardDirectory() {
return Environment.getExternalStorageDirectory().getAbsolutePath();
}
public static void rawToWave(String file, float[] data, int samplerate) throws IOException {
// creating the empty wav file.
File waveFile = new File(file);
waveFile.createNewFile();
//following block is converting raw to wav.
DataOutputStream output = null;
try {
output = new DataOutputStream(new FileOutputStream(waveFile));
// WAVE header
// chunk id
writeString(output, "RIFF");
// chunk size
writeInt(output, 36 + data.length * 2);
// format
writeString(output, "WAVE");
// subchunk 1 id
writeString(output, "fmt ");
// subchunk 1 size
writeInt(output, 16);
// audio format (1 = PCM)
writeShort(output, (short) 1);
// number of channels
writeShort(output, (short) 1);
// sample rate
writeInt(output, samplerate);
// byte rate
writeInt(output, samplerate * 2);
// block align
writeShort(output, (short) 2);
// bits per sample
writeShort(output, (short) 16);
// subchunk 2 id
writeString(output, "data");
// subchunk 2 size
writeInt(output, data.length * 2);
short[] short_data = FloatArray2ShortArray(data);
for (int i = 0; i < short_data.length; i++) {
writeShort(output, short_data[i]);
}
} finally {
if (output != null) {
output.close();
}
}
}
private static void writeInt(final DataOutputStream output, final int value) throws IOException {
output.write(value);
output.write(value >> 8);
output.write(value >> 16);
output.write(value >> 24);
}
private static void writeShort(final DataOutputStream output, final short value) throws IOException {
output.write(value);
output.write(value >> 8);
}
private static void writeString(final DataOutputStream output, final String value) throws IOException {
for (int i = 0; i < value.length(); i++) {
output.write(value.charAt(i));
}
}
public static short[] FloatArray2ShortArray(float[] values) {
float mmax = (float) 0.01;
short[] ret = new short[values.length];
for (int i = 0; i < values.length; i++) {
if (abs(values[i]) > mmax) {
mmax = abs(values[i]);
}
}
for (int i = 0; i < values.length; i++) {
values[i] = values[i] * (32767 / mmax);
ret[i] = (short) (values[i]);
}
return ret;
}
}

@ -0,0 +1,20 @@
<?xml version="1.0" encoding="utf-8"?>
<selector xmlns:android="http://schemas.android.com/apk/res/android">
<item android:state_pressed="false"><!--没点击按钮的时候-->
<shape android:shape="rectangle"><!--按钮形状-->
<solid android:color="#008577" /><!--按钮背景填充色-->
<corners android:radius="10dp" />
<stroke android:width="1dp" android:color="#009688" /><!--按钮边框-->
</shape>
</item>
<item android:state_pressed="true">
<shape android:shape="rectangle"><!--按钮形状-->
<solid android:color="#C3009688" /><!--按钮背景填充色-->
<corners android:radius="10dp" />
<stroke android:width="1dp" android:color="#009688" /><!--按钮边框-->
</shape>
</item>
</selector>

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

@ -0,0 +1,112 @@
<?xml version="1.0" encoding="utf-8"?>
<android.support.constraint.ConstraintLayout xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:tools="http://schemas.android.com/tools"
android:layout_width="match_parent"
android:layout_height="match_parent"
tools:context=".MainActivity">
<RelativeLayout
android:layout_width="match_parent"
android:layout_height="match_parent">
<ImageView
android:id="@+id/logo"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_marginTop="20dp"
android:src="@drawable/paddlespeech_logo" />
<LinearLayout
android:id="@+id/v_input_info"
android:layout_width="fill_parent"
android:layout_height="wrap_content"
android:layout_below="@+id/logo"
android:layout_alignParentTop="true"
android:layout_marginTop="120dp"
android:orientation="vertical">
<TextView
android:id="@+id/tv_input_setting"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_marginLeft="12dp"
android:layout_marginTop="10dp"
android:layout_marginRight="12dp"
android:layout_marginBottom="5dp"
android:lineSpacingExtra="4dp"
android:maxLines="6"
android:scrollbars="vertical"
android:singleLine="false"
android:text=""
android:textColor="#3C3C3C" />
<Spinner
android:id="@+id/spinner1"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:dropDownSelector="#63D81B60"
android:spinnerMode="dropdown" />
<TextView
android:id="@+id/tv_inference_time"
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:layout_below="@+id/spinner1"
android:layout_centerHorizontal="true"
android:layout_centerVertical="true"
android:layout_marginLeft="12dp"
android:layout_marginTop="50dp"
android:layout_marginRight="12dp"
android:layout_marginBottom="5dp"
android:gravity="start"
android:lineSpacingExtra="4dp"
android:maxLines="6"
android:textColor="#3C3C3C" />
<LinearLayout
android:id="@+id/btns"
android:layout_width="match_parent"
android:layout_height="match_parent"
android:layout_below="@+id/tv_inference_time"
android:layout_marginLeft="10dp"
android:layout_marginTop="30dp">
<Button
android:id="@+id/btn_play"
android:layout_width="60dp"
android:layout_height="40dp"
android:background="@drawable/button_drawable"
android:text="Play"
android:textAllCaps="false"
android:textColor="#ffffff" />
<Button
android:id="@+id/btn_pause"
android:layout_width="60dp"
android:layout_height="40dp"
android:layout_marginLeft="3dp"
android:background="@drawable/button_drawable"
android:text="Pause"
android:textAllCaps="false"
android:textColor="#ffffff" />
<Button
android:id="@+id/btn_stop"
android:layout_width="60dp"
android:layout_height="40dp"
android:layout_marginLeft="3dp"
android:background="@drawable/button_drawable"
android:text="Stop"
android:textAllCaps="false"
android:textColor="#ffffff" />
</LinearLayout>
</LinearLayout>
</RelativeLayout>
</android.support.constraint.ConstraintLayout>

@ -0,0 +1,9 @@
<menu xmlns:android="http://schemas.android.com/apk/res/android"
xmlns:app="http://schemas.android.com/apk/res-auto">
<group>
<item
android:id="@+id/settings"
android:title="Settings..."
app:showAsAction="withText" />
</group>
</menu>

@ -0,0 +1,44 @@
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string-array name="cpu_thread_num_entries">
<item>1 threads</item>
<item>2 threads</item>
<item>4 threads</item>
<item>8 threads</item>
</string-array>
<string-array name="cpu_thread_num_values">
<item>1</item>
<item>2</item>
<item>4</item>
<item>8</item>
</string-array>
<string-array name="cpu_power_mode_entries">
<item>HIGH(only big cores)</item>
<item>LOW(only LITTLE cores)</item>
<item>FULL(all cores)</item>
<item>NO_BIND(depends on system)</item>
<item>RAND_HIGH</item>
<item>RAND_LOW</item>
</string-array>
<string-array name="cpu_power_mode_values">
<item>LITE_POWER_HIGH</item>
<item>LITE_POWER_LOW</item>
<item>LITE_POWER_FULL</item>
<item>LITE_POWER_NO_BIND</item>
<item>LITE_POWER_RAND_HIGH</item>
<item>LITE_POWER_RAND_LOW</item>
</string-array>
<string-array name="text">
<item>Please select a sentence to be synthesized</item>
<item>昨日,这名“伤者”与医生全部被警方依法刑事拘留。</item>
<item>钱伟长想到上海来办学校是经过深思熟虑的。</item>
<item>她见我一进门就骂,吃饭时也骂,骂得我抬不起头。</item>
<item>李述德在离开之前,只说了一句“柱驼杀父亲了”。</item>
<item>这种车票和保险单捆绑出售属于重复性购买。</item>
<item>戴佩妮的男友西米露接唱情歌,让她非常开心。</item>
<item>观大势、谋大局、出大策始终是该院的办院方针。</item>
<item>他们骑着摩托回家,正好为农忙时的父母帮忙。</item>
<item>但是因为还没到退休年龄,只能掰着指头捱日子。</item>
<item>这几天雨水不断,人们恨不得待在家里不出门。</item>
</string-array>
</resources>

@ -0,0 +1,6 @@
<?xml version="1.0" encoding="utf-8"?>
<resources>
<color name="colorPrimary">#008577</color>
<color name="colorPrimaryDark">#00574B</color>
<color name="colorAccent">#D81B60</color>
</resources>

@ -0,0 +1,12 @@
<resources>
<string name="app_name">TTS</string>
<string name="CHOOSE_PRE_INSTALLED_MODEL_KEY">CHOOSE_PRE_INSTALLED_MODEL_KEY</string>
<string name="ENABLE_CUSTOM_SETTINGS_KEY">ENABLE_CUSTOM_SETTINGS_KEY</string>
<string name="MODEL_PATH_KEY">MODEL_PATH_KEY</string>
<string name="CPU_THREAD_NUM_KEY">CPU_THREAD_NUM_KEY</string>
<string name="CPU_POWER_MODE_KEY">CPU_POWER_MODE_KEY</string>
<string name="MODEL_PATH_DEFAULT">models/cpu</string>
<string name="CPU_THREAD_NUM_DEFAULT">1</string>
<string name="CPU_POWER_MODE_DEFAULT">LITE_POWER_HIGH</string>
</resources>

@ -0,0 +1,16 @@
<resources>
<!-- Base application theme. -->
<style name="AppTheme" parent="Theme.AppCompat.Light.DarkActionBar">
<!-- Customize your theme here. -->
<item name="colorPrimary">@color/colorPrimary</item>
<item name="colorPrimaryDark">@color/colorPrimaryDark</item>
<item name="colorAccent">@color/colorAccent</item>
<item name="actionOverflowMenuStyle">@style/OverflowMenuStyle</item>
</style>
<style name="OverflowMenuStyle" parent="Widget.AppCompat.Light.PopupMenu.Overflow">
<item name="overlapAnchor">false</item>
</style>
</resources>

@ -0,0 +1,39 @@
<?xml version="1.0" encoding="utf-8"?>
<PreferenceScreen xmlns:android="http://schemas.android.com/apk/res/android">
<PreferenceCategory android:title="Model Settings">
<ListPreference
android:defaultValue="@string/MODEL_PATH_DEFAULT"
android:key="@string/CHOOSE_PRE_INSTALLED_MODEL_KEY"
android:negativeButtonText="@null"
android:positiveButtonText="@null"
android:title="Choose pre-installed models" />
<CheckBoxPreference
android:defaultValue="false"
android:key="@string/ENABLE_CUSTOM_SETTINGS_KEY"
android:summaryOff="Disable"
android:summaryOn="Enable"
android:title="Enable custom settings" />
<EditTextPreference
android:defaultValue="@string/MODEL_PATH_DEFAULT"
android:key="@string/MODEL_PATH_KEY"
android:title="Model Path" />
</PreferenceCategory>
<PreferenceCategory android:title="CPU Settings">
<ListPreference
android:defaultValue="@string/CPU_THREAD_NUM_DEFAULT"
android:entries="@array/cpu_thread_num_entries"
android:entryValues="@array/cpu_thread_num_values"
android:key="@string/CPU_THREAD_NUM_KEY"
android:negativeButtonText="@null"
android:positiveButtonText="@null"
android:title="CPU Thread Num" />
<ListPreference
android:defaultValue="@string/CPU_POWER_MODE_DEFAULT"
android:entries="@array/cpu_power_mode_entries"
android:entryValues="@array/cpu_power_mode_values"
android:key="@string/CPU_POWER_MODE_KEY"
android:negativeButtonText="@null"
android:positiveButtonText="@null"
android:title="CPU Power Mode" />
</PreferenceCategory>
</PreferenceScreen>

@ -0,0 +1,17 @@
package com.baidu.paddle.lite.demo.tts;
import static org.junit.Assert.assertEquals;
import org.junit.Test;
/**
* Example local unit test, which will execute on the development machine (host).
*
* @see <a href="http://d.android.com/tools/testing">Testing documentation</a>
*/
public class ExampleUnitTest {
@Test
public void addition_isCorrect() {
assertEquals(4, 2 + 2);
}
}

@ -0,0 +1,27 @@
// Top-level build file where you can add configuration options common to all sub-projects/modules.
buildscript {
repositories {
google()
jcenter()
}
dependencies {
classpath 'com.android.tools.build:gradle:4.1.0'
// NOTE: Do not place your application dependencies here; they belong
// in the individual module build.gradle files
}
}
allprojects {
repositories {
google()
jcenter()
}
}
task clean(type: Delete) {
delete rootProject.buildDir
}

@ -0,0 +1,15 @@
# Project-wide Gradle settings.
# IDE (e.g. Android Studio) users:
# Gradle settings configured through the IDE *will override*
# any settings specified in this file.
# For more details on how to configure your build environment visit
# http://www.gradle.org/docs/current/userguide/build_environment.html
# Specifies the JVM arguments used for the daemon process.
# The setting is particularly useful for tweaking memory settings.
org.gradle.jvmargs=-Xmx1536m
# When configured, Gradle will run in incubating parallel mode.
# This option should only be used with decoupled projects. More details, visit
# http://www.gradle.org/docs/current/userguide/multi_project_builds.html#sec:decoupled_projects
# org.gradle.parallel=true

@ -0,0 +1,6 @@
#Wed Jun 16 14:31:28 CST 2021
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.0-all.zip

@ -0,0 +1,172 @@
#!/usr/bin/env sh
##############################################################################
##
## Gradle start up script for UN*X
##
##############################################################################
# Attempt to set APP_HOME
# Resolve links: $0 may be a link
PRG="$0"
# Need this for relative symlinks.
while [ -h "$PRG" ] ; do
ls=`ls -ld "$PRG"`
link=`expr "$ls" : '.*-> \(.*\)$'`
if expr "$link" : '/.*' > /dev/null; then
PRG="$link"
else
PRG=`dirname "$PRG"`"/$link"
fi
done
SAVED="`pwd`"
cd "`dirname \"$PRG\"`/" >/dev/null
APP_HOME="`pwd -P`"
cd "$SAVED" >/dev/null
APP_NAME="Gradle"
APP_BASE_NAME=`basename "$0"`
# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS=""
# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD="maximum"
warn () {
echo "$*"
}
die () {
echo
echo "$*"
echo
exit 1
}
# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "`uname`" in
CYGWIN* )
cygwin=true
;;
Darwin* )
darwin=true
;;
MINGW* )
msys=true
;;
NONSTOP* )
nonstop=true
;;
esac
CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar
# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD="$JAVA_HOME/jre/sh/java"
else
JAVACMD="$JAVA_HOME/bin/java"
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD="java"
which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
# Increase the maximum file descriptors if we can.
if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
MAX_FD_LIMIT=`ulimit -H -n`
if [ $? -eq 0 ] ; then
if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
MAX_FD="$MAX_FD_LIMIT"
fi
ulimit -n $MAX_FD
if [ $? -ne 0 ] ; then
warn "Could not set maximum file descriptor limit: $MAX_FD"
fi
else
warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
fi
fi
# For Darwin, add options to specify how the application appears in the dock
if $darwin; then
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
fi
# For Cygwin, switch paths to Windows format before running java
if $cygwin ; then
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`
JAVACMD=`cygpath --unix "$JAVACMD"`
# We build the pattern for arguments to be converted via cygpath
ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
SEP=""
for dir in $ROOTDIRSRAW ; do
ROOTDIRS="$ROOTDIRS$SEP$dir"
SEP="|"
done
OURCYGPATTERN="(^($ROOTDIRS))"
# Add a user-defined pattern to the cygpath arguments
if [ "$GRADLE_CYGPATTERN" != "" ] ; then
OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
fi
# Now convert the arguments - kludge to limit ourselves to /bin/sh
i=0
for arg in "$@" ; do
CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option
if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
else
eval `echo args$i`="\"$arg\""
fi
i=$((i+1))
done
case $i in
(0) set -- ;;
(1) set -- "$args0" ;;
(2) set -- "$args0" "$args1" ;;
(3) set -- "$args0" "$args1" "$args2" ;;
(4) set -- "$args0" "$args1" "$args2" "$args3" ;;
(5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
(6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
(7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
(8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
(9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
esac
fi
# Escape application args
save () {
for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
echo " "
}
APP_ARGS=$(save "$@")
# Collect all arguments for the java command, following the shell quoting and substitution rules
eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"
# by default we should be in the correct project dir, but when run from Finder on Mac, the cwd is wrong
if [ "$(uname)" = "Darwin" ] && [ "$HOME" = "$PWD" ]; then
cd "$(dirname "$0")"
fi
exec "$JAVACMD" "$@"

@ -0,0 +1,84 @@
@if "%DEBUG%" == "" @echo off
@rem ##########################################################################
@rem
@rem Gradle startup script for Windows
@rem
@rem ##########################################################################
@rem Set local scope for the variables with windows NT shell
if "%OS%"=="Windows_NT" setlocal
set DIRNAME=%~dp0
if "%DIRNAME%" == "" set DIRNAME=.
set APP_BASE_NAME=%~n0
set APP_HOME=%DIRNAME%
@rem Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
set DEFAULT_JVM_OPTS=
@rem Find java.exe
if defined JAVA_HOME goto findJavaFromJavaHome
set JAVA_EXE=java.exe
%JAVA_EXE% -version >NUL 2>&1
if "%ERRORLEVEL%" == "0" goto init
echo.
echo ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.
goto fail
:findJavaFromJavaHome
set JAVA_HOME=%JAVA_HOME:"=%
set JAVA_EXE=%JAVA_HOME%/bin/java.exe
if exist "%JAVA_EXE%" goto init
echo.
echo ERROR: JAVA_HOME is set to an invalid directory: %JAVA_HOME%
echo.
echo Please set the JAVA_HOME variable in your environment to match the
echo location of your Java installation.
goto fail
:init
@rem Get command-line arguments, handling Windows variants
if not "%OS%" == "Windows_NT" goto win9xME_args
:win9xME_args
@rem Slurp the command line arguments.
set CMD_LINE_ARGS=
set _SKIP=2
:win9xME_args_slurp
if "x%~1" == "x" goto execute
set CMD_LINE_ARGS=%*
:execute
@rem Setup the command line
set CLASSPATH=%APP_HOME%\gradle\wrapper\gradle-wrapper.jar
@rem Execute Gradle
"%JAVA_EXE%" %DEFAULT_JVM_OPTS% %JAVA_OPTS% %GRADLE_OPTS% "-Dorg.gradle.appname=%APP_BASE_NAME%" -classpath "%CLASSPATH%" org.gradle.wrapper.GradleWrapperMain %CMD_LINE_ARGS%
:end
@rem End local scope for the variables with windows NT shell
if "%ERRORLEVEL%"=="0" goto mainEnd
:fail
rem Set variable GRADLE_EXIT_CONSOLE if you need the _script_ return code instead of
rem the _cmd.exe /c_ return code!
if not "" == "%GRADLE_EXIT_CONSOLE%" exit 1
exit /b 1
:mainEnd
if "%OS%"=="Windows_NT" endlocal
:omega

@ -19,11 +19,12 @@ numpydoc
onnxruntime==1.10.0 onnxruntime==1.10.0
opencc opencc
paddlenlp paddlenlp
paddlepaddle>=2.2.2 # use paddlepaddle == 2.3.* according to: https://github.com/PaddlePaddle/Paddle/issues/48243
paddlepaddle>=2.2.2,<2.4.0
paddlespeech_ctcdecoders paddlespeech_ctcdecoders
paddlespeech_feat paddlespeech_feat
pandas pandas
pathos == 0.2.8 pathos==0.2.8
pattern_singleton pattern_singleton
Pillow>=9.0.0 Pillow>=9.0.0
praatio==5.0.0 praatio==5.0.0

@ -23,10 +23,17 @@ Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: | :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: |
[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - | [Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | [Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
[Wav2vec2-large-wenetspeech-self Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2-large-wenetspeech-self_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | - | 714 MB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-aishell1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.17 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0453 | - | - |
### Whisper Model
Demo Link | Training Data | Size | Descriptions | CER | Model
:-----------: | :-----:| :-------: | :-----: | :-----: |:---------:|
[Whisper](../../demos/whisper) | 680kh from internet | large: 5.8G,</br>medium: 2.9G,</br>small: 923M,</br>base: 277M,</br>tiny: 145M | Encoder:Transformer,</br> Decoder:Transformer, </br>Decoding method: </br>Greedy search | 2.7 </br>(large, Librispeech) | [whisper-large](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-large-model.tar.gz) </br>[whisper-medium](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-medium-model.tar.gz) </br>[whisper-medium-English-only](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-medium-en-model.tar.gz) </br>[whisper-small](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-small-model.tar.gz) </br>[whisper-small-English-only](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-small-en-model.tar.gz) </br>[whisper-base](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-base-model.tar.gz) </br>[whisper-base-English-only](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-base-en-model.tar.gz) </br>[whisper-tiny](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-tiny-model.tar.gz) </br>[whisper-tiny-English-only](https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-tiny-en-model.tar.gz)
### Language Model based on NGram ### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions |Language Model | Training Data | Token-based | Size | Descriptions|
:------------:| :------------:|:------------: | :------------: | :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |
[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8' [English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8'
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings [Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings [Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings

@ -0,0 +1,74 @@
# This example mainly follows the FastSpeech2 with CSMSC
This example contains code used to train a rhythm version of [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can directly download the rhythm version of MFA result from here [baker_alignment_tone.zip](https://paddlespeech.bj.bcebos.com/Rhy_e2e/baker_alignment_tone.zip), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
Remember in our repo, you should add `--rhy-with-duration` flag to obtain the rhythm information.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- synthesize waveform from a text file.
5. inference using the static model.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
└── speech_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance.
# For more details, You can refer to [FastSpeech2 with CSMSC](../tts3)
## Pretrained Model
Pretrained FastSpeech2 model for end-to-end rhythm version:
- [fastspeech2_rhy_csmsc_ckpt_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_rhy_csmsc_ckpt_1.3.0.zip)
This FastSpeech2 checkpoint contains files listed below.
```text
fastspeech2_rhy_csmsc_ckpt_1.3.0
├── default.yaml # default config used to train fastspeech2
├── phone_id_map.txt # phone vocabulary file when training fastspeech2
├── snapshot_iter_153000.pdz # model parameters and optimizer states
├── durations.txt # the intermediate output of preprocess.sh
├── energy_stats.npy
├── pitch_stats.npy
└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2
```

@ -0,0 +1 @@
../../tts3/conf/default.yaml

@ -0,0 +1 @@
../../tts3/local/preprocess.sh

@ -0,0 +1 @@
../../tts3/local/synthesize.sh

@ -0,0 +1,119 @@
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=0
stop_stage=0
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=pwgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--inference_dir=${train_output_path}/inference \
--use_rhy=True
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
--voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--inference_dir=${train_output_path}/inference \
--use_rhy=True
fi
# the pretrained models haven't release now
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--use_rhy=True
# --inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_csmsc \
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--inference_dir=${train_output_path}/inference \
--use_rhy=True
fi
# wavernn
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "in wavernn syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=wavernn_csmsc \
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
--voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--inference_dir=${train_output_path}/inference \
--use_rhy=True
fi

@ -0,0 +1 @@
../../tts3/local/train.sh

@ -0,0 +1,38 @@
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_153.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
### please place the mfa result of rhythm here
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is pwgan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi

@ -25,6 +25,7 @@ import librosa
import numpy as np import numpy as np
import paddle import paddle
import soundfile import soundfile
from paddlenlp.transformers import AutoTokenizer
from yacs.config import CfgNode from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
@ -50,7 +51,7 @@ class SSLExecutor(BaseExecutor):
self.parser.add_argument( self.parser.add_argument(
'--model', '--model',
type=str, type=str,
default='wav2vec2ASR_librispeech', default=None,
choices=[ choices=[
tag[:tag.index('-')] tag[:tag.index('-')]
for tag in self.task_resource.pretrained_models.keys() for tag in self.task_resource.pretrained_models.keys()
@ -123,7 +124,7 @@ class SSLExecutor(BaseExecutor):
help='Increase logger verbosity of current task.') help='Increase logger verbosity of current task.')
def _init_from_path(self, def _init_from_path(self,
model_type: str='wav2vec2ASR_librispeech', model_type: str=None,
task: str='asr', task: str='asr',
lang: str='en', lang: str='en',
sample_rate: int=16000, sample_rate: int=16000,
@ -134,6 +135,18 @@ class SSLExecutor(BaseExecutor):
Init model and other resources from a specific path. Init model and other resources from a specific path.
""" """
logger.debug("start to init the model") logger.debug("start to init the model")
if model_type is None:
if lang == 'en':
model_type = 'wav2vec2ASR_librispeech'
elif lang == 'zh':
model_type = 'wav2vec2ASR_aishell1'
else:
logger.error(
"invalid lang, please input --lang en or --lang zh")
logger.debug(
"Model type had not been specified, default {} was used.".
format(model_type))
# default max_len: unit:second # default max_len: unit:second
self.max_len = 50 self.max_len = 50
if hasattr(self, 'model'): if hasattr(self, 'model'):
@ -167,9 +180,13 @@ class SSLExecutor(BaseExecutor):
self.config.merge_from_file(self.cfg_path) self.config.merge_from_file(self.cfg_path)
if task == 'asr': if task == 'asr':
with UpdateConfig(self.config): with UpdateConfig(self.config):
if lang == 'en':
self.text_feature = TextFeaturizer( self.text_feature = TextFeaturizer(
unit_type=self.config.unit_type, unit_type=self.config.unit_type,
vocab=self.config.vocab_filepath) vocab=self.config.vocab_filepath)
elif lang == 'zh':
self.text_feature = AutoTokenizer.from_pretrained(
self.config.tokenizer)
self.config.decode.decoding_method = decode_method self.config.decode.decoding_method = decode_method
model_name = model_type[:model_type.rindex( model_name = model_type[:model_type.rindex(
'_')] # model_type: {model_name}_{dataset} '_')] # model_type: {model_name}_{dataset}
@ -253,7 +270,8 @@ class SSLExecutor(BaseExecutor):
audio, audio,
text_feature=self.text_feature, text_feature=self.text_feature,
decoding_method=cfg.decoding_method, decoding_method=cfg.decoding_method,
beam_size=cfg.beam_size) beam_size=cfg.beam_size,
tokenizer=getattr(self.config, 'tokenizer', None))
self._outputs["result"] = result_transcripts[0][0] self._outputs["result"] = result_transcripts[0][0]
except Exception as e: except Exception as e:
logger.exception(e) logger.exception(e)
@ -413,7 +431,7 @@ class SSLExecutor(BaseExecutor):
@stats_wrapper @stats_wrapper
def __call__(self, def __call__(self,
audio_file: os.PathLike, audio_file: os.PathLike,
model: str='wav2vec2ASR_librispeech', model: str=None,
task: str='asr', task: str='asr',
lang: str='en', lang: str='en',
sample_rate: int=16000, sample_rate: int=16000,

@ -70,6 +70,38 @@ ssl_dynamic_pretrained_models = {
'exp/wav2vec2ASR/checkpoints/avg_1.pdparams', 'exp/wav2vec2ASR/checkpoints/avg_1.pdparams',
}, },
}, },
"wav2vec2-zh-16k": {
'1.3': {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2-large-wenetspeech-self_ckpt_1.3.0.model.tar.gz',
'md5':
'00ea4975c05d1bb58181205674052fe1',
'cfg_path':
'model.yaml',
'ckpt_path':
'chinese-wav2vec2-large',
'model':
'chinese-wav2vec2-large.pdparams',
'params':
'chinese-wav2vec2-large.pdparams',
},
},
"wav2vec2ASR_aishell1-zh-16k": {
'1.3': {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr3/wav2vec2ASR-large-aishell1_ckpt_1.3.0.model.tar.gz',
'md5':
'ac8fa0a6345e6a7535f6fabb5e59e218',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/wav2vec2ASR/checkpoints/avg_1',
'model':
'exp/wav2vec2ASR/checkpoints/avg_1.pdparams',
'params':
'exp/wav2vec2ASR/checkpoints/avg_1.pdparams',
},
},
} }
# --------------------------------- # ---------------------------------
@ -1658,3 +1690,16 @@ g2pw_onnx_models = {
}, },
}, },
} }
# ---------------------------------
# ------------- Rhy_frontend ---------------
# ---------------------------------
rhy_frontend_models = {
'rhy_e2e': {
'1.0': {
'url':
'https://paddlespeech.bj.bcebos.com/Rhy_e2e/rhy_frontend.zip',
'md5': '6624a77393de5925d5a84400b363d8ef',
},
},
}

@ -1173,10 +1173,6 @@ class Wav2Vec2ConfigPure():
self.proj_codevector_dim = config.proj_codevector_dim self.proj_codevector_dim = config.proj_codevector_dim
self.diversity_loss_weight = config.diversity_loss_weight self.diversity_loss_weight = config.diversity_loss_weight
# ctc loss
self.ctc_loss_reduction = config.ctc_loss_reduction
self.ctc_zero_infinity = config.ctc_zero_infinity
# adapter # adapter
self.add_adapter = config.add_adapter self.add_adapter = config.add_adapter
self.adapter_kernel_size = config.adapter_kernel_size self.adapter_kernel_size = config.adapter_kernel_size

@ -76,28 +76,66 @@ class Wav2vec2ASR(nn.Layer):
feats: paddle.Tensor, feats: paddle.Tensor,
text_feature: Dict[str, int], text_feature: Dict[str, int],
decoding_method: str, decoding_method: str,
beam_size: int): beam_size: int,
tokenizer: str=None):
batch_size = feats.shape[0] batch_size = feats.shape[0]
if decoding_method == 'ctc_prefix_beam_search' and batch_size > 1: if decoding_method == 'ctc_prefix_beam_search' and batch_size > 1:
logger.error( raise ValueError(
f'decoding mode {decoding_method} must be running with batch_size == 1' f"decoding mode {decoding_method} must be running with batch_size == 1"
) )
logger.error(f"current batch_size is {batch_size}")
sys.exit(1)
if decoding_method == 'ctc_greedy_search': if decoding_method == 'ctc_greedy_search':
if tokenizer is None:
hyps = self.ctc_greedy_search(feats) hyps = self.ctc_greedy_search(feats)
res = [text_feature.defeaturize(hyp) for hyp in hyps] res = [text_feature.defeaturize(hyp) for hyp in hyps]
res_tokenids = [hyp for hyp in hyps] res_tokenids = [hyp for hyp in hyps]
else:
hyps = self.ctc_greedy_search(feats)
res = []
res_tokenids = []
for sequence in hyps:
# Decode token terms to words
predicted_tokens = text_feature.convert_ids_to_tokens(
sequence)
tmp_res = []
tmp_res_tokenids = []
for c in predicted_tokens:
if c == "[CLS]":
continue
elif c == "[SEP]" or c == "[PAD]":
break
else:
tmp_res.append(c)
tmp_res_tokenids.append(text_feature.vocab[c])
res.append(''.join(tmp_res))
res_tokenids.append(tmp_res_tokenids)
# ctc_prefix_beam_search and attention_rescoring only return one # ctc_prefix_beam_search and attention_rescoring only return one
# result in List[int], change it to List[List[int]] for compatible # result in List[int], change it to List[List[int]] for compatible
# with other batch decoding mode # with other batch decoding mode
elif decoding_method == 'ctc_prefix_beam_search': elif decoding_method == 'ctc_prefix_beam_search':
assert feats.shape[0] == 1 assert feats.shape[0] == 1
if tokenizer is None:
hyp = self.ctc_prefix_beam_search(feats, beam_size) hyp = self.ctc_prefix_beam_search(feats, beam_size)
res = [text_feature.defeaturize(hyp)] res = [text_feature.defeaturize(hyp)]
res_tokenids = [hyp] res_tokenids = [hyp]
else:
hyp = self.ctc_prefix_beam_search(feats, beam_size)
res = []
res_tokenids = []
predicted_tokens = text_feature.convert_ids_to_tokens(hyp)
tmp_res = []
tmp_res_tokenids = []
for c in predicted_tokens:
if c == "[CLS]":
continue
elif c == "[SEP]" or c == "[PAD]":
break
else:
tmp_res.append(c)
tmp_res_tokenids.append(text_feature.vocab[c])
res.append(''.join(tmp_res))
res_tokenids.append(tmp_res_tokenids)
else: else:
raise ValueError( raise ValueError(
f"wav2vec2 not support decoding method: {decoding_method}") f"wav2vec2 not support decoding method: {decoding_method}")

@ -17,10 +17,10 @@ from pathlib import Path
import soundfile as sf import soundfile as sf
from timer import timer from timer import timer
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_am_output
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_predictor
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_voc_output
from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.exps.syn_utils import get_lite_am_output
from paddlespeech.t2s.exps.syn_utils import get_lite_predictor
from paddlespeech.t2s.exps.syn_utils import get_lite_voc_output
from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_sentences

@ -18,13 +18,13 @@ import numpy as np
import soundfile as sf import soundfile as sf
from timer import timer from timer import timer
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_am_sublayer_output
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_predictor
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_streaming_am_output
from paddlespeech.t2s.exps.lite_syn_utils import get_lite_voc_output
from paddlespeech.t2s.exps.syn_utils import denorm from paddlespeech.t2s.exps.syn_utils import denorm
from paddlespeech.t2s.exps.syn_utils import get_chunks from paddlespeech.t2s.exps.syn_utils import get_chunks
from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.exps.syn_utils import get_lite_am_sublayer_output
from paddlespeech.t2s.exps.syn_utils import get_lite_predictor
from paddlespeech.t2s.exps.syn_utils import get_lite_streaming_am_output
from paddlespeech.t2s.exps.syn_utils import get_lite_voc_output
from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_sentences
from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.exps.syn_utils import run_frontend
from paddlespeech.t2s.utils import str2bool from paddlespeech.t2s.utils import str2bool

@ -0,0 +1,111 @@
import os
from pathlib import Path
from typing import Optional
import numpy as np
from paddlelite.lite import create_paddle_predictor
from paddlelite.lite import MobileConfig
from .syn_utils import run_frontend
# Paddle-Lite
def get_lite_predictor(model_dir: Optional[os.PathLike]=None,
model_file: Optional[os.PathLike]=None,
cpu_threads: int=1):
config = MobileConfig()
config.set_model_from_file(str(Path(model_dir) / model_file))
predictor = create_paddle_predictor(config)
return predictor
def get_lite_am_output(
input: str,
am_predictor,
am: str,
frontend: object,
lang: str='zh',
merge_sentences: bool=True,
speaker_dict: Optional[os.PathLike]=None,
spk_id: int=0, ):
am_name = am[:am.rindex('_')]
am_dataset = am[am.rindex('_') + 1:]
get_spk_id = False
get_tone_ids = False
if am_name == 'speedyspeech':
get_tone_ids = True
if am_dataset in {"aishell3", "vctk", "mix"} and speaker_dict:
get_spk_id = True
spk_id = np.array([spk_id])
frontend_dict = run_frontend(
frontend=frontend,
text=input,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids,
lang=lang)
if get_tone_ids:
tone_ids = frontend_dict['tone_ids']
tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input(1)
tones_handle.from_numpy(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input(1)
spk_id_handle.from_numpy(spk_id)
phone_ids = frontend_dict['phone_ids']
phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input(0)
phones_handle.from_numpy(phones)
am_predictor.run()
am_output_handle = am_predictor.get_output(0)
am_output_data = am_output_handle.numpy()
return am_output_data
def get_lite_voc_output(voc_predictor, input):
mel_handle = voc_predictor.get_input(0)
mel_handle.from_numpy(input)
voc_predictor.run()
voc_output_handle = voc_predictor.get_output(0)
wav = voc_output_handle.numpy()
return wav
def get_lite_am_sublayer_output(am_sublayer_predictor, input):
input_handle = am_sublayer_predictor.get_input(0)
input_handle.from_numpy(input)
am_sublayer_predictor.run()
am_sublayer_handle = am_sublayer_predictor.get_output(0)
am_sublayer_output = am_sublayer_handle.numpy()
return am_sublayer_output
def get_lite_streaming_am_output(input: str,
am_encoder_infer_predictor,
am_decoder_predictor,
am_postnet_predictor,
frontend,
lang: str='zh',
merge_sentences: bool=True):
get_tone_ids = False
frontend_dict = run_frontend(
frontend=frontend,
text=input,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids,
lang=lang)
phone_ids = frontend_dict['phone_ids']
phones = phone_ids[0].numpy()
am_encoder_infer_output = get_lite_am_sublayer_output(
am_encoder_infer_predictor, input=phones)
am_decoder_output = get_lite_am_sublayer_output(
am_decoder_predictor, input=am_encoder_infer_output)
am_postnet_output = get_lite_am_sublayer_output(
am_postnet_predictor, input=np.transpose(am_decoder_output, (0, 2, 1)))
am_output_data = am_decoder_output + np.transpose(am_postnet_output,
(0, 2, 1))
normalized_mel = am_output_data[0]
return normalized_mel

@ -26,8 +26,6 @@ import paddle
from paddle import inference from paddle import inference
from paddle import jit from paddle import jit
from paddle.static import InputSpec from paddle.static import InputSpec
from paddlelite.lite import create_paddle_predictor
from paddlelite.lite import MobileConfig
from yacs.config import CfgNode from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
@ -163,10 +161,13 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
# frontend # frontend
def get_frontend(lang: str='zh', def get_frontend(lang: str='zh',
phones_dict: Optional[os.PathLike]=None, phones_dict: Optional[os.PathLike]=None,
tones_dict: Optional[os.PathLike]=None): tones_dict: Optional[os.PathLike]=None,
use_rhy=False):
if lang == 'zh': if lang == 'zh':
frontend = Frontend( frontend = Frontend(
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) phone_vocab_path=phones_dict,
tone_vocab_path=tones_dict,
use_rhy=use_rhy)
elif lang == 'en': elif lang == 'en':
frontend = English(phone_vocab_path=phones_dict) frontend = English(phone_vocab_path=phones_dict)
elif lang == 'mix': elif lang == 'mix':
@ -512,105 +513,3 @@ def get_sess(model_path: Optional[os.PathLike],
sess = ort.InferenceSession( sess = ort.InferenceSession(
model_path, providers=providers, sess_options=sess_options) model_path, providers=providers, sess_options=sess_options)
return sess return sess
# Paddle-Lite
def get_lite_predictor(model_dir: Optional[os.PathLike]=None,
model_file: Optional[os.PathLike]=None,
cpu_threads: int=1):
config = MobileConfig()
config.set_model_from_file(str(Path(model_dir) / model_file))
predictor = create_paddle_predictor(config)
return predictor
def get_lite_am_output(
input: str,
am_predictor,
am: str,
frontend: object,
lang: str='zh',
merge_sentences: bool=True,
speaker_dict: Optional[os.PathLike]=None,
spk_id: int=0, ):
am_name = am[:am.rindex('_')]
am_dataset = am[am.rindex('_') + 1:]
get_spk_id = False
get_tone_ids = False
if am_name == 'speedyspeech':
get_tone_ids = True
if am_dataset in {"aishell3", "vctk", "mix"} and speaker_dict:
get_spk_id = True
spk_id = np.array([spk_id])
frontend_dict = run_frontend(
frontend=frontend,
text=input,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids,
lang=lang)
if get_tone_ids:
tone_ids = frontend_dict['tone_ids']
tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input(1)
tones_handle.from_numpy(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input(1)
spk_id_handle.from_numpy(spk_id)
phone_ids = frontend_dict['phone_ids']
phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input(0)
phones_handle.from_numpy(phones)
am_predictor.run()
am_output_handle = am_predictor.get_output(0)
am_output_data = am_output_handle.numpy()
return am_output_data
def get_lite_voc_output(voc_predictor, input):
mel_handle = voc_predictor.get_input(0)
mel_handle.from_numpy(input)
voc_predictor.run()
voc_output_handle = voc_predictor.get_output(0)
wav = voc_output_handle.numpy()
return wav
def get_lite_am_sublayer_output(am_sublayer_predictor, input):
input_handle = am_sublayer_predictor.get_input(0)
input_handle.from_numpy(input)
am_sublayer_predictor.run()
am_sublayer_handle = am_sublayer_predictor.get_output(0)
am_sublayer_output = am_sublayer_handle.numpy()
return am_sublayer_output
def get_lite_streaming_am_output(input: str,
am_encoder_infer_predictor,
am_decoder_predictor,
am_postnet_predictor,
frontend,
lang: str='zh',
merge_sentences: bool=True):
get_tone_ids = False
frontend_dict = run_frontend(
frontend=frontend,
text=input,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids,
lang=lang)
phone_ids = frontend_dict['phone_ids']
phones = phone_ids[0].numpy()
am_encoder_infer_output = get_lite_am_sublayer_output(
am_encoder_infer_predictor, input=phones)
am_decoder_output = get_lite_am_sublayer_output(
am_decoder_predictor, input=am_encoder_infer_output)
am_postnet_output = get_lite_am_sublayer_output(
am_postnet_predictor, input=np.transpose(am_decoder_output, (0, 2, 1)))
am_output_data = am_decoder_output + np.transpose(am_postnet_output,
(0, 2, 1))
normalized_mel = am_output_data[0]
return normalized_mel

@ -27,6 +27,7 @@ from paddlespeech.t2s.exps.syn_utils import get_sentences
from paddlespeech.t2s.exps.syn_utils import get_voc_inference from paddlespeech.t2s.exps.syn_utils import get_voc_inference
from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.exps.syn_utils import run_frontend
from paddlespeech.t2s.exps.syn_utils import voc_to_static from paddlespeech.t2s.exps.syn_utils import voc_to_static
from paddlespeech.t2s.utils import str2bool
def evaluate(args): def evaluate(args):
@ -49,7 +50,8 @@ def evaluate(args):
frontend = get_frontend( frontend = get_frontend(
lang=args.lang, lang=args.lang,
phones_dict=args.phones_dict, phones_dict=args.phones_dict,
tones_dict=args.tones_dict) tones_dict=args.tones_dict,
use_rhy=args.use_rhy)
print("frontend done!") print("frontend done!")
# acoustic model # acoustic model
@ -240,6 +242,11 @@ def parse_args():
type=str, type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.") help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output_dir", type=str, help="output dir.") parser.add_argument("--output_dir", type=str, help="output dir.")
parser.add_argument(
"--use_rhy",
type=str2bool,
default=False,
help="run rhythm frontend or not")
args = parser.parse_args() args = parser.parse_args()
return args return args

@ -0,0 +1,14 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .rhy_predictor import *

@ -0,0 +1,106 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import re
import paddle
import yaml
from paddlenlp.transformers import ErnieTokenizer
from yacs.config import CfgNode
from paddlespeech.cli.utils import download_and_decompress
from paddlespeech.resource.pretrained_models import rhy_frontend_models
from paddlespeech.text.models.ernie_linear import ErnieLinear
from paddlespeech.utils.env import MODEL_HOME
DefinedClassifier = {
'ErnieLinear': ErnieLinear,
}
model_version = '1.0'
class RhyPredictor():
def __init__(
self,
model_dir: os.PathLike=MODEL_HOME, ):
uncompress_path = download_and_decompress(
rhy_frontend_models['rhy_e2e'][model_version], model_dir)
with open(os.path.join(uncompress_path, 'rhy_default.yaml')) as f:
config = CfgNode(yaml.safe_load(f))
self.punc_list = []
with open(os.path.join(uncompress_path, 'rhy_token'), 'r') as f:
for line in f:
self.punc_list.append(line.strip())
self.punc_list = [0] + self.punc_list
self.make_rhy_dict()
self.model = DefinedClassifier["ErnieLinear"](**config["model"])
pretrained_token = config['data_params']['pretrained_token']
self.tokenizer = ErnieTokenizer.from_pretrained(pretrained_token)
state_dict = paddle.load(
os.path.join(uncompress_path, 'snapshot_iter_2600_main_params.pdz'))
self.model.set_state_dict(state_dict)
self.model.eval()
def _clean_text(self, text):
text = text.lower()
text = re.sub('[^A-Za-z0-9\u4e00-\u9fa5]', '', text)
text = re.sub(f'[{"".join([p for p in self.punc_list][1:])}]', '', text)
return text
def preprocess(self, text, tokenizer):
clean_text = self._clean_text(text)
assert len(clean_text) > 0, f'Invalid input string: {text}'
tokenized_input = tokenizer(
list(clean_text), return_length=True, is_split_into_words=True)
_inputs = dict()
_inputs['input_ids'] = tokenized_input['input_ids']
_inputs['seg_ids'] = tokenized_input['token_type_ids']
_inputs['seq_len'] = tokenized_input['seq_len']
return _inputs
def get_prediction(self, raw_text):
_inputs = self.preprocess(raw_text, self.tokenizer)
seq_len = _inputs['seq_len']
input_ids = paddle.to_tensor(_inputs['input_ids']).unsqueeze(0)
seg_ids = paddle.to_tensor(_inputs['seg_ids']).unsqueeze(0)
logits, _ = self.model(input_ids, seg_ids)
preds = paddle.argmax(logits, axis=-1).squeeze(0)
tokens = self.tokenizer.convert_ids_to_tokens(
_inputs['input_ids'][1:seq_len - 1])
labels = preds[1:seq_len - 1].tolist()
assert len(tokens) == len(labels)
# add 0 for non punc
text = ''
for t, l in zip(tokens, labels):
text += t
if l != 0: # Non punc.
text += self.punc_list[l]
return text
def make_rhy_dict(self):
self.rhy_dict = {}
for i, p in enumerate(self.punc_list[1:]):
self.rhy_dict[p] = 'sp' + str(i + 1)
def pinyin_align(self, pinyins, rhy_pre):
final_py = []
j = 0
for i in range(len(rhy_pre)):
if rhy_pre[i] in self.rhy_dict:
final_py.append(self.rhy_dict[rhy_pre[i]])
else:
final_py.append(pinyins[j])
j += 1
return final_py

@ -30,6 +30,7 @@ from pypinyin_dict.phrase_pinyin_data import large_pinyin
from paddlespeech.t2s.frontend.g2pw import G2PWOnnxConverter from paddlespeech.t2s.frontend.g2pw import G2PWOnnxConverter
from paddlespeech.t2s.frontend.generate_lexicon import generate_lexicon from paddlespeech.t2s.frontend.generate_lexicon import generate_lexicon
from paddlespeech.t2s.frontend.rhy_prediction.rhy_predictor import RhyPredictor
from paddlespeech.t2s.frontend.tone_sandhi import ToneSandhi from paddlespeech.t2s.frontend.tone_sandhi import ToneSandhi
from paddlespeech.t2s.frontend.zh_normalization.text_normlization import TextNormalizer from paddlespeech.t2s.frontend.zh_normalization.text_normlization import TextNormalizer
from paddlespeech.t2s.ssml.xml_processor import MixTextProcessor from paddlespeech.t2s.ssml.xml_processor import MixTextProcessor
@ -82,11 +83,13 @@ class Frontend():
def __init__(self, def __init__(self,
g2p_model="g2pW", g2p_model="g2pW",
phone_vocab_path=None, phone_vocab_path=None,
tone_vocab_path=None): tone_vocab_path=None,
use_rhy=False):
self.mix_ssml_processor = MixTextProcessor() self.mix_ssml_processor = MixTextProcessor()
self.tone_modifier = ToneSandhi() self.tone_modifier = ToneSandhi()
self.text_normalizer = TextNormalizer() self.text_normalizer = TextNormalizer()
self.punc = ":,;。?!“”‘’':,;.?!" self.punc = ":,;。?!“”‘’':,;.?!"
self.rhy_phns = ['sp1', 'sp2', 'sp3', 'sp4']
self.phrases_dict = { self.phrases_dict = {
'开户行': [['ka1i'], ['hu4'], ['hang2']], '开户行': [['ka1i'], ['hu4'], ['hang2']],
'发卡行': [['fa4'], ['ka3'], ['hang2']], '发卡行': [['fa4'], ['ka3'], ['hang2']],
@ -105,6 +108,10 @@ class Frontend():
'': [['lei5']], '': [['lei5']],
'掺和': [['chan1'], ['huo5']] '掺和': [['chan1'], ['huo5']]
} }
self.use_rhy = use_rhy
if use_rhy:
self.rhy_predictor = RhyPredictor()
print("Rhythm predictor loaded.")
# g2p_model can be pypinyin and g2pM and g2pW # g2p_model can be pypinyin and g2pM and g2pW
self.g2p_model = g2p_model self.g2p_model = g2p_model
if self.g2p_model == "g2pM": if self.g2p_model == "g2pM":
@ -195,9 +202,13 @@ class Frontend():
segments = sentences segments = sentences
phones_list = [] phones_list = []
for seg in segments: for seg in segments:
if self.use_rhy:
seg = self.rhy_predictor._clean_text(seg)
phones = [] phones = []
# Replace all English words in the sentence # Replace all English words in the sentence
seg = re.sub('[a-zA-Z]+', '', seg) seg = re.sub('[a-zA-Z]+', '', seg)
if self.use_rhy:
seg = self.rhy_predictor.get_prediction(seg)
seg_cut = psg.lcut(seg) seg_cut = psg.lcut(seg)
initials = [] initials = []
finals = [] finals = []
@ -205,11 +216,18 @@ class Frontend():
# 为了多音词获得更好的效果,这里采用整句预测 # 为了多音词获得更好的效果,这里采用整句预测
if self.g2p_model == "g2pW": if self.g2p_model == "g2pW":
try: try:
if self.use_rhy:
seg = self.rhy_predictor._clean_text(seg)
pinyins = self.g2pW_model(seg)[0] pinyins = self.g2pW_model(seg)[0]
except Exception: except Exception:
# g2pW采用模型采用繁体输入如果有cover不了的简体词采用g2pM预测 # g2pW采用模型采用繁体输入如果有cover不了的简体词采用g2pM预测
print("[%s] not in g2pW dict,use g2pM" % seg) print("[%s] not in g2pW dict,use g2pM" % seg)
pinyins = self.g2pM_model(seg, tone=True, char_split=False) pinyins = self.g2pM_model(seg, tone=True, char_split=False)
if self.use_rhy:
rhy_text = self.rhy_predictor.get_prediction(seg)
final_py = self.rhy_predictor.pinyin_align(pinyins,
rhy_text)
pinyins = final_py
pre_word_length = 0 pre_word_length = 0
for word, pos in seg_cut: for word, pos in seg_cut:
sub_initials = [] sub_initials = []
@ -271,7 +289,7 @@ class Frontend():
phones.append(c) phones.append(c)
if c and c in self.punc: if c and c in self.punc:
phones.append('sp') phones.append('sp')
if v and v not in self.punc: if v and v not in self.punc and v not in self.rhy_phns:
phones.append(v) phones.append(v)
phones_list.append(phones) phones_list.append(phones)
if merge_sentences: if merge_sentences:
@ -330,7 +348,7 @@ class Frontend():
phones.append(c) phones.append(c)
if c and c in self.punc: if c and c in self.punc:
phones.append('sp') phones.append('sp')
if v and v not in self.punc: if v and v not in self.punc and v not in self.rhy_phns:
phones.append(v) phones.append(v)
phones_list.append(phones) phones_list.append(phones)
if merge_sentences: if merge_sentences:
@ -504,6 +522,11 @@ class Frontend():
print("----------------------------") print("----------------------------")
return [sum(all_phonemes, [])] return [sum(all_phonemes, [])]
def add_sp_if_no(self, phonemes):
if not phonemes[-1][-1].startswith('sp'):
phonemes[-1].append('sp4')
return phonemes
def get_input_ids(self, def get_input_ids(self,
sentence: str, sentence: str,
merge_sentences: bool=True, merge_sentences: bool=True,
@ -519,6 +542,8 @@ class Frontend():
merge_sentences=merge_sentences, merge_sentences=merge_sentences,
print_info=print_info, print_info=print_info,
robot=robot) robot=robot)
if self.use_rhy:
phonemes = self.add_sp_if_no(phonemes)
result = {} result = {}
phones = [] phones = []
tones = [] tones = []

@ -47,7 +47,7 @@ base = [
"onnxruntime==1.10.0", "onnxruntime==1.10.0",
"opencc", "opencc",
"pandas", "pandas",
"paddlenlp", "paddlenlp>=2.4.3",
"paddlespeech_feat", "paddlespeech_feat",
"Pillow>=9.0.0", "Pillow>=9.0.0",
"praatio==5.0.0", "praatio==5.0.0",
@ -71,11 +71,10 @@ base = [
"prettytable", "prettytable",
"zhon", "zhon",
"colorlog", "colorlog",
"pathos == 0.2.8", "pathos==0.2.8",
"braceexpand", "braceexpand",
"pyyaml", "pyyaml",
"pybind11", "pybind11",
"paddlelite",
"paddleslim==2.3.4", "paddleslim==2.3.4",
] ]

@ -22,7 +22,7 @@ We develop under:
1. First to launch docker container. 1. First to launch docker container.
``` ```
docker run --privileged --net=host --ipc=host -it --rm -v $PWD:/workspace --name=dev registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 /bin/bash docker run --privileged --net=host --ipc=host -it --rm -v /path/to/paddlespeech:/workspace --name=dev registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7 /bin/bash
``` ```
* More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html). * More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).

@ -1,4 +1,4 @@
# customized Auto Speech Recognition # Customized ASR
## introduction ## introduction
These scripts are tutorials to show you how build your own decoding graph. These scripts are tutorials to show you how build your own decoding graph.

@ -4,3 +4,4 @@
* `websocket` - Streaming ASR with websocket for deepspeech2_aishell. * `websocket` - Streaming ASR with websocket for deepspeech2_aishell.
* `aishell` - Streaming Decoding under aishell dataset, for local WER test. * `aishell` - Streaming Decoding under aishell dataset, for local WER test.
* `onnx` - Example to convert deepspeech2 to onnx format.

@ -1,12 +1,57 @@
# Aishell - Deepspeech2 Streaming # Aishell - Deepspeech2 Streaming
## How to run > We recommend using U2/U2++ model instead of DS2, please see [here](../../u2pp_ol/wenetspeech/).
A C++ deployment example for using the deepspeech2 model to recognize `wav` and compute `CER`. We using AISHELL-1 as test data.
## Source path.sh
```bash
. path.sh
``` ```
SpeechX bins is under `echo $SPEECHX_BUILD`, more info please see `path.sh`.
## Recognize with linear feature
```bash
bash run.sh bash run.sh
``` ```
## Results `run.sh` has multi stage, for details please see `run.sh`:
1. donwload dataset, model and lm
2. convert cmvn format and compute feature
3. decode w/o lm by feature
4. decode w/ ngram lm by feature
5. decode w/ TLG graph by feature
6. recognize w/ TLG graph by wav input
### Recognize with `.scp` file for wav
This sciprt using `recognizer_main` to recognize wav file.
The input is `scp` file which look like this:
```text
# head data/split1/1/aishell_test.scp
BAC009S0764W0121 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0121.wav
BAC009S0764W0122 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0122.wav
...
BAC009S0764W0125 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0125.wav
```
If you want to recognize one wav, you can make `scp` file like this:
```text
key path/to/wav/file
```
Then specify `--wav_rspecifier=` param for `recognizer_main` bin. For other flags meaning, please see `help`:
```bash
recognizer_main --help
```
For the exmaple to using `recognizer_main` please see `run.sh`.
### CTC Prefix Beam Search w/o LM ### CTC Prefix Beam Search w/o LM
@ -25,7 +70,7 @@ Mandarin -> 7.86 % N=104768 C=96865 S=7573 D=330 I=327
Other -> 0.00 % N=0 C=0 S=0 D=0 I=0 Other -> 0.00 % N=0 C=0 S=0 D=0 I=0
``` ```
### CTC WFST ### CTC TLG WFST
LM: [aishell train](http://paddlespeech.bj.bcebos.com/speechx/examples/ds2_ol/aishell/aishell_graph.zip) LM: [aishell train](http://paddlespeech.bj.bcebos.com/speechx/examples/ds2_ol/aishell/aishell_graph.zip)
--acoustic_scale=1.2 --acoustic_scale=1.2
@ -43,8 +88,11 @@ Mandarin -> 10.93 % N=104762 C=93410 S=9779 D=1573 I=95
Other -> 100.00 % N=3 C=0 S=1 D=2 I=0 Other -> 100.00 % N=3 C=0 S=1 D=2 I=0
``` ```
## fbank ## Recognize with fbank feature
```
This script is same to `run.sh`, but using fbank feature.
```bash
bash run_fbank.sh bash run_fbank.sh
``` ```
@ -66,7 +114,7 @@ Mandarin -> 5.82 % N=104762 C=99386 S=4941 D=435 I=720
English -> 0.00 % N=0 C=0 S=0 D=0 I=0 English -> 0.00 % N=0 C=0 S=0 D=0 I=0
``` ```
### CTC WFST ### CTC TLG WFST
LM: [aishell train](https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph2.zip) LM: [aishell train](https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph2.zip)
``` ```
@ -75,7 +123,11 @@ Mandarin -> 9.57 % N=104762 C=94817 S=4325 D=5620 I=84
Other -> 100.00 % N=3 C=0 S=1 D=2 I=0 Other -> 100.00 % N=3 C=0 S=1 D=2 I=0
``` ```
## build TLG graph ## Build TLG WFST graph
```
bash run_build_tlg.sh The script is for building TLG wfst graph, depending on `srilm`, please make sure it is installed.
For more information please see the script below.
```bash
bash ./local/run_build_tlg.sh
``` ```

@ -22,6 +22,7 @@ mkdir -p $data
if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
if [ ! -f $data/speech.ngram.zh.tar.gz ];then if [ ! -f $data/speech.ngram.zh.tar.gz ];then
# download ngram
pushd $data pushd $data
wget -c http://paddlespeech.bj.bcebos.com/speechx/examples/ngram/zh/speech.ngram.zh.tar.gz wget -c http://paddlespeech.bj.bcebos.com/speechx/examples/ngram/zh/speech.ngram.zh.tar.gz
tar xvzf speech.ngram.zh.tar.gz tar xvzf speech.ngram.zh.tar.gz
@ -29,6 +30,7 @@ if [ $stage -le -1 ] && [ $stop_stage -ge -1 ]; then
fi fi
if [ ! -f $ckpt_dir/data/mean_std.json ]; then if [ ! -f $ckpt_dir/data/mean_std.json ]; then
# download model
mkdir -p $ckpt_dir mkdir -p $ckpt_dir
pushd $ckpt_dir pushd $ckpt_dir
wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP1_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz wget -c https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP1_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz
@ -43,6 +45,7 @@ if [ ! -f $unit ]; then
fi fi
if ! which ngram-count; then if ! which ngram-count; then
# need srilm install
pushd $MAIN_ROOT/tools pushd $MAIN_ROOT/tools
make srilm.done make srilm.done
popd popd
@ -71,7 +74,7 @@ lm=data/local/lm
mkdir -p $lm mkdir -p $lm
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# Train lm # Train ngram lm
cp $text $lm/text cp $text $lm/text
local/aishell_train_lms.sh local/aishell_train_lms.sh
echo "build LM done." echo "build LM done."
@ -94,8 +97,8 @@ cmvn=$data/cmvn_fbank.ark
wfst=$data/lang_test wfst=$data/lang_test
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
if [ ! -d $data/test ]; then if [ ! -d $data/test ]; then
# download test dataset
pushd $data pushd $data
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip
unzip aishell_test.zip unzip aishell_test.zip
@ -108,6 +111,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj ./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
# convert cmvn format
cmvn-json2kaldi --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn cmvn-json2kaldi --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn
fi fi
@ -116,7 +120,7 @@ label_file=aishell_result
export GLOG_logtostderr=1 export GLOG_logtostderr=1
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder # recognize w/ TLG graph
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/check_tlg.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/check_tlg.log \
recognizer_main \ recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \ --wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \

@ -32,6 +32,7 @@ exp=$PWD/exp
aishell_wav_scp=aishell_test.scp aishell_wav_scp=aishell_test.scp
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
if [ ! -d $data/test ]; then if [ ! -d $data/test ]; then
# donwload dataset
pushd $data pushd $data
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip
unzip aishell_test.zip unzip aishell_test.zip
@ -43,6 +44,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
fi fi
if [ ! -f $ckpt_dir/data/mean_std.json ]; then if [ ! -f $ckpt_dir/data/mean_std.json ]; then
# download model
mkdir -p $ckpt_dir mkdir -p $ckpt_dir
pushd $ckpt_dir pushd $ckpt_dir
wget -c https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz wget -c https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
@ -52,6 +54,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
lm=$data/zh_giga.no_cna_cmn.prune01244.klm lm=$data/zh_giga.no_cna_cmn.prune01244.klm
if [ ! -f $lm ]; then if [ ! -f $lm ]; then
# download kenlm bin
pushd $data pushd $data
wget -c https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm wget -c https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm
popd popd
@ -68,7 +71,7 @@ export GLOG_logtostderr=1
cmvn=$data/cmvn.ark cmvn=$data/cmvn.ark
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# 3. gen linear feat # 3. convert cmvn format and compute linear feat
cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn
./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj ./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
@ -82,7 +85,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# recognizer # decode w/o lm
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wolm.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wolm.log \
ctc_beam_search_decoder_main \ ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
@ -101,7 +104,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# decode with lm # decode w/ ngram lm with feature input
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.lm.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.lm.log \
ctc_beam_search_decoder_main \ ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
@ -124,6 +127,7 @@ wfst=$data/wfst/
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
mkdir -p $wfst mkdir -p $wfst
if [ ! -f $wfst/aishell_graph.zip ]; then if [ ! -f $wfst/aishell_graph.zip ]; then
# download TLG graph
pushd $wfst pushd $wfst
wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip
unzip aishell_graph.zip unzip aishell_graph.zip
@ -133,7 +137,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder # decoder w/ TLG graph with feature input
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wfst.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.wfst.log \
ctc_tlg_decoder_main \ ctc_tlg_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
@ -154,7 +158,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# TLG decoder # recognize from wav file w/ TLG graph
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recognizer.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recognizer.log \
recognizer_main \ recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \ --wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \

@ -68,7 +68,7 @@ export GLOG_logtostderr=1
cmvn=$data/cmvn_fbank.ark cmvn=$data/cmvn_fbank.ark
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# 3. gen linear feat # 3. convert cmvn format and compute fbank feat
cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn --binary=false cmvn_json2kaldi_main --json_file=$ckpt_dir/data/mean_std.json --cmvn_write_path=$cmvn --binary=false
./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj ./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
@ -82,7 +82,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# recognizer # decode w/ lm by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wolm.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wolm.log \
ctc_beam_search_decoder_main \ ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
@ -100,7 +100,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# decode with lm # decode with ngram lm by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.lm.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.lm.log \
ctc_beam_search_decoder_main \ ctc_beam_search_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
@ -131,7 +131,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# TLG decoder # decode w/ TLG graph by feature
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wfst.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/recog.fbank.wfst.log \
ctc_tlg_decoder_main \ ctc_tlg_decoder_main \
--feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \ --feature_rspecifier=scp:$data/split${nj}/JOB/fbank_feat.scp \
@ -153,6 +153,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
fi fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# recgonize w/ TLG graph by wav
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/fbank_recognizer.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/fbank_recognizer.log \
recognizer_main \ recognizer_main \
--wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \ --wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \

@ -0,0 +1,78 @@
# Streaming DeepSpeech2 Server with WebSocket
This example is about using `websocket` as streaming deepspeech2 server. For deepspeech2 model training please see [here](../../../../examples/aishell/asr0/).
The websocket protocal is same to [PaddleSpeech Server](../../../../demos/streaming_asr_server/),
for detail of implementation please see [here](../../../speechx/protocol/websocket/).
## Source path.sh
```bash
. path.sh
```
SpeechX bins is under `echo $SPEECHX_BUILD`, more info please see `path.sh`.
## Start WebSocket Server
```bash
bash websoket_server.sh
```
The output is like below:
```text
I1130 02:19:32.029882 12856 cmvn_json2kaldi_main.cc:39] cmvn josn path: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/model/data/mean_std.json
I1130 02:19:32.032230 12856 cmvn_json2kaldi_main.cc:73] nframe: 907497
I1130 02:19:32.032564 12856 cmvn_json2kaldi_main.cc:85] cmvn stats have write into: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/cmvn.ark
I1130 02:19:32.032579 12856 cmvn_json2kaldi_main.cc:86] Binary: 1
I1130 02:19:32.798342 12937 feature_pipeline.h:53] cmvn file: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/cmvn.ark
I1130 02:19:32.798542 12937 feature_pipeline.h:58] dither: 0
I1130 02:19:32.798583 12937 feature_pipeline.h:60] frame shift ms: 10
I1130 02:19:32.798588 12937 feature_pipeline.h:62] feature type: linear
I1130 02:19:32.798596 12937 feature_pipeline.h:80] frame length ms: 20
I1130 02:19:32.798601 12937 feature_pipeline.h:88] subsampling rate: 4
I1130 02:19:32.798606 12937 feature_pipeline.h:90] nnet receptive filed length: 7
I1130 02:19:32.798611 12937 feature_pipeline.h:92] nnet chunk size: 1
I1130 02:19:32.798615 12937 feature_pipeline.h:94] frontend fill zeros: 0
I1130 02:19:32.798630 12937 nnet_itf.h:52] subsampling rate: 4
I1130 02:19:32.798635 12937 nnet_itf.h:54] model path: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/model/exp/deepspeech2_online/checkpoints//avg_1.jit.pdmodel
I1130 02:19:32.798640 12937 nnet_itf.h:57] param path: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/model/exp/deepspeech2_online/checkpoints//avg_1.jit.pdiparams
I1130 02:19:32.798643 12937 nnet_itf.h:59] DS2 param:
I1130 02:19:32.798647 12937 nnet_itf.h:61] cache names: chunk_state_h_box,chunk_state_c_box
I1130 02:19:32.798652 12937 nnet_itf.h:63] cache shape: 5-1-1024,5-1-1024
I1130 02:19:32.798656 12937 nnet_itf.h:65] input names: audio_chunk,audio_chunk_lens,chunk_state_h_box,chunk_state_c_box
I1130 02:19:32.798660 12937 nnet_itf.h:67] output names: softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0
I1130 02:19:32.798664 12937 ctc_tlg_decoder.h:41] fst path: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/wfst//TLG.fst
I1130 02:19:32.798669 12937 ctc_tlg_decoder.h:42] fst symbole table: /workspace/zhanghui/PaddleSpeech/speechx/examples/ds2_ol/websocket/data/wfst//words.txt
I1130 02:19:32.798673 12937 ctc_tlg_decoder.h:47] LatticeFasterDecoder max active: 7500
I1130 02:19:32.798677 12937 ctc_tlg_decoder.h:49] LatticeFasterDecoder beam: 15
I1130 02:19:32.798681 12937 ctc_tlg_decoder.h:50] LatticeFasterDecoder lattice_beam: 7.5
I1130 02:19:32.798708 12937 websocket_server_main.cc:37] Listening at port 8082
```
## Start WebSocket Client
```bash
bash websocket_client.sh
```
This script using AISHELL-1 test data to call websocket server.
The input is specific by `--wav_rspecifier=scp:$data/$aishell_wav_scp`.
The `scp` file which look like this:
```text
# head data/split1/1/aishell_test.scp
BAC009S0764W0121 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0121.wav
BAC009S0764W0122 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0122.wav
...
BAC009S0764W0125 /workspace/PaddleSpeech/speechx/examples/u2pp_ol/wenetspeech/data/test/S0764/BAC009S0764W0125.wav
```
If you want to recognize one wav, you can make `scp` file like this:
```text
key path/to/wav/file
```

@ -6,13 +6,14 @@ This example will demonstrate how to using the u2/u2++ model to recognize `wav`
## Testing with Aishell Test Data ## Testing with Aishell Test Data
### Source `path.sh` first ## Source path.sh
```bash ```bash
source path.sh . path.sh
``` ```
All bins are under `echo $SPEECHX_BUILD` dir. SpeechX bins is under `echo $SPEECHX_BUILD`, more info please see `path.sh`.
### Download dataset and model ### Download dataset and model

@ -83,5 +83,10 @@ fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# decode with wav input # decode with wav input
./loca/recognizer.sh ./local/recognizer.sh
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# decode with wav input with quanted model
./local/recognizer_quant.sh
fi fi

Loading…
Cancel
Save