You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/docs/tutorial/tts/tts_tutorial.ipynb

947 lines
35 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://github.com/PaddlePaddle/PaddleSpeech\"><img style=\"position: absolute; z-index: 999; top: 0; right: 0; border: 0; width: 128px; height: 128px;\" src=\"https://nosir.github.io/cleave.js/images/right-graphite@2x.png\" alt=\"Fork me on GitHub\"></a>\n",
"\n",
"# 『听』和『说』\n",
"人类通过听觉获取的信息大约占所有感知信息的 20% ~ 30%。声音存储了丰富的语义以及时序信息,由专门负责听觉的器官接收信号,产生一系列连锁刺激后,在人类大脑的皮层听区进行处理分析,获取语义和知识。近年来,随着深度学习算法上的进步以及不断丰厚的硬件资源条件,**文本转语音Text-to-Speech, TTS** 技术在移动、虚拟娱乐等领域得到了广泛的应用。</font>\n",
"## \"听\"书\n",
"使用 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) 直接获取书籍上的文字。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# download demo sources\n",
"!mkdir download\n",
"!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/ocr_result.jpg\n",
"!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/ocr.wav\n",
"!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/tts_lips.mp4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import IPython.display as dp\n",
"from PIL import Image\n",
"img_path = 'download/ocr_result.jpg'\n",
"im = Image.open(img_path)\n",
"dp.display(im)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"使用 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech),阅读上一步识别出来的文字。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dp.Audio(\"download/ocr.wav\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"具体实现代码详见 [Story Talker](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/story_talker)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 偶像开口说话\n",
"*元宇宙来袭,构造你的虚拟人!* 看看 [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN) 怎样合成唇形让WiFi之母——海蒂·拉玛说话。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"html_str = '''\n",
"<video controls width=\"600\" height=\"360\" src=\"{}\">animation</video>\n",
"'''.format(\"download/tts_lips.mp4\")\n",
"dp.display(HTML(html_str))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"具体实现代码请参考 [Metaverse](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/metaverse)。\n",
"\n",
"下面让我们来系统地学习语音方面的知识,看看怎样使用 **PaddleSpeech** 实现基本的语音功能以及怎样结合光学字符识别Optical Character RecognitionOCR、自然语言处理Natural Language ProcessingNLP等技术“听”书、让名人开口说话。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 前言\n",
"## 背景知识\n",
"为了更好地了解文本转语音任务的要素,我们先简要地回顾一下文本转语音的发展历史。如果你对此已经有所了解,或希望能尽快使用代码实现,请直接跳至[实践](#实践)。\n",
"### 定义\n",
"<!----\n",
"Note: \n",
"1.此句抄自 [李沐Dive into Dive Learning](https://zh-v2.d2l.ai/chapter_introduction/index.html)\n",
"2.修改参考A survey on Neural Speech Sysnthesis.\n",
"---> \n",
"文本转语音又称语音合成Speech Sysnthesis指的是将一段文本按照一定需求转化成对应的音频这种特性决定了的输出数据比输入输入长得多。文本转语音是一项包含了语义学、声学、数字信号处理以及机器学习的等多项学科的交叉任务。虽然辨识低质量音频文件的内容对人类来说很容易但这对计算机来说并非易事。\n",
"\n",
"按照不同的应用需求,更广义的语音合成研究包括:*语音转换*,例如说话人转换、语音到歌唱转换、语音情感转换、口音转换等;*歌唱合成*,例如歌词到歌唱转换、可视语音合成等。\n",
"\n",
"### 发展历史\n",
"\n",
"<!--\n",
"以下摘自维基百科 https://en.wikipedia.org/wiki/Speech_synthesis\n",
"--->\n",
"\n",
"在第二次工业革命之前语音的合成主要以机械式的音素合成为主。1779年德裔丹麦科学家 Christian Gottlieb Kratzenstein 建造了人类的声道模型使其可以产生五个长元音。1791年 Wolfgang von Kempelen 添加了唇和舌的模型使其能够发出辅音和元音。贝尔实验室于20世纪30年代发明了声码器Vocoder将语音自动分解为音调和共振此项技术由 Homer Dudley 改进为键盘式合成器并于 1939年纽约世界博览会展出。\n",
"\n",
"第一台基于计算机的语音合成系统起源于20世纪50年代。1961年IBM 的 John Larry Kelly以及 Louis Gerstman 使用 IBM 704 计算机合成语音成为贝尔实验室最著名的成就之一。1975年第一代语音合成系统之一 —— MUSAMUltichannel Speaking Automation问世其由一个独立的硬件和配套的软件组成。1978年发行的第二个版本也可以进行无伴奏演唱。90 年代的主流是采用 MIT 和贝尔实验室的系统,并结合自然语言处理模型。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/55035de353b042cd8c4468819b2d36e2fcc89bffdf2b442fa4c7b0b5499e1592\"></center>\n",
"\n",
"### 主流方法\n",
"\n",
"当前的主流方法分为**基于统计参数的语音合成**、**波形拼接语音合成**、**混合方法**以及**端到端神经网络语音合成**。基于参数的语音合成包含隐马尔可夫模型Hidden Markov Model,HMM以及深度学习网络Deep Neural NetworkDNN。端到端的方法保函声学模型+声码器以及“完全”端到端方法。\n",
"\n",
"\n",
"## 基于深度学习的语音合成技术\n",
"\n",
"### 语音合成基本知识\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/10859679d74745ab82fb6f5c9984a95152c25b0e3dce4515b120c8997a6752d8\"></center>\n",
"<br></br>\n",
"\n",
"语音合成流水线包含 <font color=\"#ff0000\">**文本前端Text Frontend**</font> 、<font color=\"#ff0000\">**声学模型Acoustic Model**</font> 和 <font color=\"#ff0000\">**声码器Vocoder**</font> 三个主要模块:\n",
"- 通过文本前端模块将原始文本转换为字符/音素。\n",
"- 通过声学模型将字符/音素转换为声学特征如线性频谱图、mel 频谱图、LPC 特征等。\n",
"- 通过声码器将声学特征转换为波形。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 实践"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 安装 paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install --upgrade pip && pip install paddlespeech -U"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"环境安装请参考 [Installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md) 教程。 \n",
"下面使用 **PaddleSpeech** 提供的预训练模型合成中文语音。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据及模型准备"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 获取PaddlePaddle预训练模型"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget -P download https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip\n",
"!unzip -d download download/pwg_baker_ckpt_0.4.zip\n",
"!wget -P download https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip\n",
"!unzip -d download download/fastspeech2_nosil_baker_ckpt_0.4.zip"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!tree download/pwg_baker_ckpt_0.4\n",
"!tree download/fastspeech2_nosil_baker_ckpt_0.4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 导入 Python 包"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 本项目的依赖需要用到 nltk 包,但是有时会因为网络原因导致不好下载,此处手动下载一下放到百度服务器的包\n",
"!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz\n",
"!tar zxvf nltk_data.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 设置 gpu 环境\n",
"%env CUDA_VISIBLE_DEVICES=0\n",
"\n",
"import logging\n",
"import sys\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import argparse\n",
"import os\n",
"from pathlib import Path\n",
"import IPython.display as dp\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import paddle\n",
"import soundfile as sf\n",
"import yaml\n",
"from paddlespeech.t2s.frontend.zh_frontend import Frontend\n",
"from paddlespeech.t2s.models.fastspeech2 import FastSpeech2\n",
"from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference\n",
"from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator\n",
"from paddlespeech.t2s.models.parallel_wavegan import PWGInference\n",
"from paddlespeech.t2s.modules.normalizer import ZScore\n",
"from yacs.config import CfgNode"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 设置预训练模型的路径"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fastspeech2_config = \"download/fastspeech2_nosil_baker_ckpt_0.4/default.yaml\"\n",
"fastspeech2_checkpoint = \"download/fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz\"\n",
"fastspeech2_stat = \"download/fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy\"\n",
"pwg_config = \"download/pwg_baker_ckpt_0.4/pwg_default.yaml\"\n",
"pwg_checkpoint = \"download/pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz\"\n",
"pwg_stat = \"download/pwg_baker_ckpt_0.4/pwg_stats.npy\"\n",
"phones_dict = \"download/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt\"\n",
"# 读取 conf 配置文件并结构化\n",
"with open(fastspeech2_config) as f:\n",
" fastspeech2_config = CfgNode(yaml.safe_load(f))\n",
"with open(pwg_config) as f:\n",
" pwg_config = CfgNode(yaml.safe_load(f))\n",
"print(\"========Config========\")\n",
"print(fastspeech2_config)\n",
"print(\"---------------------\")\n",
"print(pwg_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 文本前端Text Frontend\n",
"\n",
"一个文本前端模块主要包含:\n",
"- 分段Text Segmentation\n",
"- 文本正则化Text Normalization, TN\n",
"- 分词Word Segmentation, 主要是在中文中)\n",
"- 词性标注Part-of-Speech, PoS\n",
"- 韵律预测Prosody\n",
"- 字音转换Grapheme-to-PhonemeG2P\n",
"<font size=2>Grapheme: **语言**书写系统的最小有意义单位; Phoneme: 区分单词的最小**语音**单位)</font>\n",
" - 多音字Polyphone\n",
" - 变调Tone Sandhi\n",
" - “一”、“不”变\n",
" - 三声变调\n",
" - 轻声变调\n",
" - 儿化音\n",
" - 方言\n",
"- ...\n",
"\n",
"(输入给声学模型之前,还需要把音素序列转换为 id\n",
"\n",
"\n",
"其中最重要的模块是<font color=\"#ff0000\"> 文本正则化 </font>模块和<font color=\"#ff0000\"> 字音转换TTS 中更常用 G2P 代指) </font>模块。\n",
"\n",
"\n",
"各模块输出示例:\n",
"```text\n",
"• Text: 全国一共有112所211高校\n",
"• Text Normalization: 全国一共有一百一十二所二一一高校\n",
"• Word Segmentation: 全国/一共/有/一百一十二/所/二一一/高校/\n",
"• G2P注意此句中“一”的读音:\n",
" quan2 guo2 yi2 gong4 you3 yi4 bai3 yi1 shi2 er4 suo3 er4 yao1 yao1 gao1 xiao4\n",
" (可以进一步把声母和韵母分开)\n",
" q uan2 g uo2 y i2 g ong4 y ou3 y i4 b ai3 y i1 sh i2 er4 s uo3 er4 y ao1 y ao1 g ao1 x iao4\n",
" (把音调和声韵母分开)\n",
" q uan g uo y i g ong y ou y i b ai y i sh i er s uo er y ao y ao g ao x iao\n",
" 0 2 0 2 0 2 0 4 0 3 ...\n",
"• Prosody (prosodic words #1, prosodic phrases #2, intonation phrases #3, sentence #4):\n",
" 全国#2一共有#2一百#1一十二所#2二一一#1高校#4\n",
" (分词的结果一般是固定的,但是不同人习惯不同,可能有不同的韵律)\n",
"```\n",
"\n",
"文本前端模块的设计需要结合很多专业的语义学知识和经验。人类在读文本的时候可以自然而然地读出正确的发音,但是这些先验知识计算机并不知晓。\n",
"例如,对于一个句子的分词:\n",
"\n",
"```text\n",
"我也想过过过儿过过的生活\n",
"我也想/过过/过儿/过过的/生活\n",
"\n",
"货拉拉拉不拉拉布拉多\n",
"货拉拉/拉不拉/拉布拉多\n",
"\n",
"南京市长江大桥\n",
"南京市长/江大桥\n",
"南京市/长江大桥\n",
"```\n",
"或者是词的变调和儿化音:\n",
"```\n",
"你要不要和我们一起出去玩?\n",
"你要不2声要和我们一4声起出去玩\n",
"\n",
"不好,我要一个人出去。\n",
"不4声我要一2声个人出去。\n",
"\n",
"(以下每个词的所有字都是三声的,请你读一读,体会一下在读的时候,是否每个字都被读成了三声?)\n",
"纸老虎、虎骨酒、展览馆、岂有此理、手表厂有五种好产品\n",
"```\n",
"又或是多音字,这类情况通常需要先正确分词:\n",
"```text\n",
"人要行,干一行行一行,一行行行行行;\n",
"人要是不行,干一行不行一行,一行不行行行不行。\n",
"\n",
"佟大为妻子产下一女\n",
"\n",
"海水朝朝朝朝朝朝朝落\n",
"浮云长长长长长长长消\n",
"```\n",
"\n",
"PaddleSpeech Text-to-Speech的文本前端解决方案:\n",
"- [文本正则](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tn)\n",
"- [G2P](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p)\n",
" - 多音字模块: pypinyin/g2pM\n",
" - 变调模块: 用分词 + 规则"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 构造文本前端对象\n",
"传入`phones_dict`,把相应的`phones`转换成`phone_ids`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 传入 phones_dict 会把相应的 phones 转换成 phone_ids\n",
"frontend = Frontend(phone_vocab_path=phones_dict)\n",
"print(\"Frontend done!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 调用文本前端\n",
"\n",
"文本前端对输入数据进行正则化时会进行分句,若`merge_sentences`设置为`False`,则所有分句的 `phone_ids` 构成一个 `List`;若设置为`True``input_ids[\"phone_ids\"][0]`则表示整句的`phone_ids`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# input = \"我每天中午12:00起床\"\n",
"# input = \"我出生于2005/11/08那天的最低气温达到-10°C\"\n",
"input = \"你好,欢迎使用百度飞桨框架进行深度学习研究!\"\n",
"input_ids = frontend.get_input_ids(input, merge_sentences=True, print_info=True)\n",
"phone_ids = input_ids[\"phone_ids\"][0]\n",
"print(\"phone_ids:%s\"%phone_ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 用深度学习实现文本前端\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/85a5cd8aef1e444cbb980a2f1f184316247bbb7870a34925a77b799802df8ef0\"></center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 声学模型Acoustic Model\n",
"\n",
"声学模型将字符/音素转换为声学特征如线性频谱图、mel 频谱图、LPC 特征等,声学特征以 “帧” 为单位,一般一帧是 10ms 左右,一个音素一般对应 5~20 帧左右, 声学模型需要解决的是 <font color=\"#ff0000\">“不等长序列间的映射问题”</font>,“不等长”是指,同一个人发不同音素的持续时间不同,同一个人在不同时刻说同一句话的语速可能不同,对应各个音素的持续时间不同,不同人说话的特色不同,对应各个音素的持续时间不同。这是一个困难的“一对多”问题。\n",
"```\n",
"# 卡尔普陪外孙玩滑梯\n",
"000001|baker_corpus|sil 20 k 12 a2 4 er2 10 p 12 u3 12 p 9 ei2 9 uai4 15 s 11 uen1 12 uan2 14 h 10 ua2 11 t 15 i1 16 sil 20\n",
"```\n",
"\n",
"声学模型主要分为自回归模型和非自回归模型,其中自回归模型在 `t` 时刻的预测需要依赖 `t-1` 时刻的输出作为输入,预测时间长,但是音质相对较好,非自回归模型不存在预测上的依赖关系,预测时间快,音质相对较差。\n",
"\n",
"主流声学模型发展的脉络:\n",
"- 自回归模型:\n",
" - Tacotron\n",
" - Tacotron2\n",
" - Transformer TTS\n",
"- 非自回归模型:\n",
" - FastSpeech\n",
" - SpeedySpeech\n",
" - FastPitch\n",
" - FastSpeech2\n",
" - ...\n",
" \n",
"在本教程中,我们使用 `FastSpeech2` 作为声学模型。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6b6d671713ec4d20a0e60653c7a5d4ae3c35b1d1e58b4cc39e0bc82ad4a341d9\"></center>\n",
"<br><center> FastSpeech2 网络结构图</center></br>\n",
"\n",
"\n",
"PaddleSpeech TTS 实现的 FastSpeech2 与论文不同的地方在于,我们使用的是 phone 级别的 `pitch` 和 `energy`(与 FastPitch 类似),这样的合成结果可以更加**稳定**。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/862c21456c784c41a83a308b7d9707f0810cc3b3c6f94ed48c60f5d32d0072f0\"></center>\n",
"<br><center> FastPitch 网络结构图</center></br>\n",
"\n",
"更多关于[语音合成模型的发展及改进](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 初始化声学模型 FastSpeech2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with open(phones_dict, \"r\") as f:\n",
" phn_id = [line.strip().split() for line in f.readlines()]\n",
"vocab_size = len(phn_id)\n",
"print(\"vocab_size:\", vocab_size)\n",
"odim = fastspeech2_config.n_mels\n",
"model = FastSpeech2(\n",
" idim=vocab_size, odim=odim, **fastspeech2_config[\"model\"])\n",
"# 加载预训练模型参数\n",
"model.set_state_dict(paddle.load(fastspeech2_checkpoint)[\"main_params\"])\n",
"# 推理阶段不启用 batch norm 和 dropout\n",
"model.eval()\n",
"stat = np.load(fastspeech2_stat)\n",
"# 读取数据预处理阶段数据集的均值和标准差\n",
"mu, std = stat\n",
"mu, std = paddle.to_tensor(mu), paddle.to_tensor(std)\n",
"# 构造归一化的新模型\n",
"fastspeech2_normalizer = ZScore(mu, std)\n",
"fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)\n",
"fastspeech2_inference.eval()\n",
"print(\"FastSpeech2 done!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 调用声学模型"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with paddle.no_grad():\n",
" mel = fastspeech2_inference(phone_ids)\n",
"print(\"shepe of mel (n_frames x n_mels):\")\n",
"print(mel.shape)\n",
"# 绘制声学模型输出的 mel 频谱\n",
"fig, ax = plt.subplots(figsize=(16, 6))\n",
"im = ax.imshow(mel.T, aspect='auto',origin='lower')\n",
"plt.title('Mel Spectrogram')\n",
"plt.xlabel('Time')\n",
"plt.ylabel('Frequency')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 声码器Vocoder\n",
"声码器将声学特征转换为波形。声码器需要解决的是 <font color=\"#ff0000\">“信息缺失的补全问题”</font>。信息缺失是指,在音频波形转换为频谱图的时候,存在**相位信息**的缺失,在频谱图转换为 mel 频谱图的时候,存在**频域压缩**导致的信息缺失假设音频的采样率是16kHZ, 一帧的音频有 10ms也就是说1s 的音频有 16000 个采样点,而 1s 中包含 100 帧,每一帧有 160 个采样点,声码器的作用就是将一个频谱帧变成音频波形的 160 个采样点,所以声码器中一般会包含**上采样**模块。\n",
"\n",
"与声学模型类似,声码器也分为自回归模型和非自回归模型, 更细致的分类如下:\n",
"\n",
"- Autoregression\n",
" - WaveNet\n",
" - WaveRNN\n",
" - LPCNet\n",
"- Flow\n",
" - <font color=\"#ff0000\">WaveFlow</font>\n",
" - WaveGlow\n",
" - FloWaveNet\n",
" - Parallel WaveNet\n",
"- GAN\n",
"\t- WaveGAN\n",
" - <font color=\"#ff0000\">Parallel WaveGAN</font>\n",
" - <font color=\"#ff0000\">MelGAN</font>\n",
" - <font color=\"#ff0000\">Style MelGAN</font>\n",
" - <font color=\"#ff0000\">Multi Band MelGAN</font>\n",
" - <font color=\"#ff0000\">HiFi GAN</font>\n",
"- VAE\n",
" - Wave-VAE\n",
"- Diffusion\n",
" - WaveGrad\n",
" - DiffWave\n",
"\n",
"PaddleSpeech TTS 主要实现了百度的 `WaveFlow` 和一些主流的 GAN Vocoder, 在本教程中,我们使用 `Parallel WaveGAN` 作为声码器。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/9eafa4e5642d45309e6e8883bff46380407b3858d0934bf5896868281316ce94\" width=\"700\"></center>\n",
"<br><center>图1Parallel WaveGAN 网络结构图</center></br>\n",
"\n",
"各 GAN Vocoder 的生成器和判别器的 Loss 的区别如下表格所示:\n",
"\n",
"Model | Generator Loss |Discriminator Loss\n",
":-------------:| :------------:| :-----\n",
"Mel GAN| adversial loss <br> Feature Matching | Multi-Scale Discriminator |\n",
"Parallel Wave GAN|adversial loss <br> Multi-resolution STFT loss | adversial loss|\n",
"Multi-Band Mel GAN | adversial loss <br> full band Multi-resolution STFT loss <br> sub band Multi-resolution STFT loss |Multi-Scale Discriminator|\n",
"HiFi GAN |adversial loss <br> Feature Matching <br> Mel-Spectrogram Loss | Multi-Scale Discriminator <br> Multi-Period Discriminator| \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 初始化声码器 Parallel WaveGAN"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"vocoder = PWGGenerator(**pwg_config[\"generator_params\"])\n",
"# 模型加载预训练参数\n",
"vocoder.set_state_dict(paddle.load(pwg_checkpoint)[\"generator_params\"])\n",
"vocoder.remove_weight_norm()\n",
"# 推理阶段不启用 batch norm 和 dropout\n",
"vocoder.eval()\n",
"# 读取数据预处理阶段数据集的均值和标准差\n",
"stat = np.load(pwg_stat)\n",
"mu, std = stat\n",
"mu, std = paddle.to_tensor(mu), paddle.to_tensor(std)\n",
"pwg_normalizer = ZScore(mu, std)\n",
"# 构建归一化的模型\n",
"pwg_inference = PWGInference(pwg_normalizer, vocoder)\n",
"pwg_inference.eval()\n",
"print(\"Parallel WaveGAN done!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 调用声码器"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with paddle.no_grad():\n",
" wav = pwg_inference(mel)\n",
"print(\"shepe of wav (time x n_channels):%s\"%wav.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 绘制声码器输出的波形图\n",
"wave_data = wav.numpy().T\n",
"time = np.arange(0, wave_data.shape[1]) * (1.0 / fastspeech2_config.fs)\n",
"fig, ax = plt.subplots(figsize=(16, 6))\n",
"plt.plot(time, wave_data[0])\n",
"plt.title('Waveform')\n",
"plt.xlabel('Time (seconds)')\n",
"plt.ylabel('Amplitude (normed)')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 播放音频"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dp.Audio(wav.numpy().T, rate=fastspeech2_config.fs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 保存音频"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir output\n",
"sf.write(\n",
" \"output/output.wav\",\n",
" wav.numpy(),\n",
" samplerate=fastspeech2_config.fs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 进阶 —— 个性化调节\n",
"FastSpeech2 模型可以个性化地调节音素时长、音调和能量,通过一些简单的调节就可以获得一些有意思的效果。\n",
"\n",
"例如对于以下的原始音频`\"凯莫瑞安联合体的经济崩溃,迫在眉睫\"`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 原始音频\n",
"dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x1_001.wav\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# speed x 1.2\n",
"dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x1.2_001.wav\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# speed x 0.8\n",
"dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x0.8_001.wav\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pitch x 1.3(童声)\n",
"dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/child_voice/001.wav\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# robot\n",
"dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/robot/001.wav\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"具体实现代码请参考 [Style FastSpeech2](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/style_fs2)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 用 PaddleSpeech 训练 TTS 模型\n",
"PaddleSpeech 的 examples 是按照 数据集/模型 的结构安排的:\n",
"```text\n",
"examples \n",
"├── aishell3\n",
"│ ├── README.md\n",
"│ ├── tts3\n",
"│ └── vc0\n",
"├── csmsc\n",
"│ ├── README.md\n",
"│ ├── tts2\n",
"│ ├── tts3\n",
"│ ├── voc1\n",
"│ └── voc3\n",
"├── ...\n",
"└── ...\n",
"```\n",
"我们在每个数据集的 README.md 介绍了子目录和模型的对应关系, 在 TTS 中有如下对应关系:\n",
"```text\n",
"tts0 - Tacotron2\n",
"tts1 - TransformerTTS\n",
"tts2 - SpeedySpeech\n",
"tts3 - FastSpeech2\n",
"voc0 - WaveFlow\n",
"voc1 - Parallel WaveGAN\n",
"voc2 - MelGAN\n",
"voc3 - MultiBand MelGAN\n",
"```\n",
"### 基于 CSMCS 数据集训练 FastSpeech2 模型\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleSpeech.git\n",
"cd examples/csmsc/tts3\n",
"```\n",
"根据 README.md, 下载 CSMCS 数据集和其对应的强制对齐文件, 并放置在对应的位置\n",
"```bash\n",
"./run.sh\n",
"```\n",
"`run.sh` 中包含预处理、训练、合成、静态图推理等步骤:\n",
"\n",
"```bash\n",
"#!/bin/bash\n",
"set -e\n",
"source path.sh\n",
"gpus=0,1\n",
"stage=0\n",
"stop_stage=100\n",
"conf_path=conf/default.yaml\n",
"train_output_path=exp/default\n",
"ckpt_name=snapshot_iter_153.pdz\n",
"\n",
"# with the following command, you can choice the stage range you want to run\n",
"# such as `./run.sh --stage 0 --stop-stage 0`\n",
"# this can not be mixed use with `$1`, `$2` ...\n",
"source ${MAIN_ROOT}/utils/parse_options.sh || exit 1\n",
"\n",
"if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n",
" # prepare data\n",
" bash ./local/preprocess.sh ${conf_path} || exit -1\n",
"fi\n",
"if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n",
" # train model, all `ckpt` under `train_output_path/checkpoints/` dir\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1\n",
"fi\n",
"if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n",
" # synthesize, vocoder is pwgan\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n",
"fi\n",
"if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n",
" # synthesize_e2e, vocoder is pwgan\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n",
"fi\n",
"if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then\n",
" # inference with static model\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1\n",
"fi\n",
"```\n",
"\n",
"### 基于 CSMCS 数据集训练 Parallel WaveGAN 模型\n",
"```bash\n",
"git clone https://github.com/PaddlePaddle/PaddleSpeech.git\n",
"cd examples/csmsc/voc1\n",
"```\n",
"根据 README.md, 下载 CSMCS 数据集和其对应的强制对齐文件, 并放置在对应的位置\n",
"```bash\n",
"./run.sh\n",
"```\n",
"`run.sh` 中包含预处理、训练、合成等步骤:\n",
"```bash\n",
"#!/bin/bash\n",
"set -e\n",
"source path.sh\n",
"gpus=0,1\n",
"stage=0\n",
"stop_stage=100\n",
"conf_path=conf/default.yaml\n",
"train_output_path=exp/default\n",
"ckpt_name=snapshot_iter_5000.pdz\n",
"\n",
"# with the following command, you can choice the stage range you want to run\n",
"# such as `./run.sh --stage 0 --stop-stage 0`\n",
"# this can not be mixed use with `$1`, `$2` ...\n",
"source ${MAIN_ROOT}/utils/parse_options.sh || exit 1\n",
"\n",
"if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n",
" # prepare data\n",
" ./local/preprocess.sh ${conf_path} || exit -1\n",
"fi\n",
"if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n",
" # train model, all `ckpt` under `train_output_path/checkpoints/` dir\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1\n",
"fi\n",
"if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n",
" # synthesize\n",
" CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n",
"fi\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FAQ\n",
"\n",
"- 需要注意的问题\n",
"- 经验与分享\n",
"- 用户的其他问题"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 作业\n",
"在 CSMSC 数据集上利用 FastSpeech2 和 Parallel WaveGAN 实现一个中文 TTS 系统。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 关注 PaddleSpeech\n",
"请关注我们的 [Github Repo](https://github.com/PaddlePaddle/PaddleSpeech/),非常欢迎加入以下微信群参与讨论:\n",
"- 扫描二维码\n",
"- 添加运营小姐姐微信\n",
"- 通过后回复【语音】\n",
"- 系统自动邀请加入技术群\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/bca0bc75dce14b53af44e374e64fc91aeeb13c075c894d6aabed033148f65377\" ></center>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}