{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\"Fork\n", "\n", "# 『听』和『说』\n", "人类通过听觉获取的信息大约占所有感知信息的 20% ~ 30%。声音存储了丰富的语义以及时序信息,由专门负责听觉的器官接收信号,产生一系列连锁刺激后,在人类大脑的皮层听区进行处理分析,获取语义和知识。近年来,随着深度学习算法上的进步以及不断丰厚的硬件资源条件,**文本转语音(Text-to-Speech, TTS)** 技术在移动、虚拟娱乐等领域得到了广泛的应用。\n", "## \"听\"书\n", "使用 [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) 直接获取书籍上的文字。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# download demo sources\n", "!mkdir download\n", "!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/ocr_result.jpg\n", "!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/ocr.wav\n", "!wget -P download https://paddlespeech.bj.bcebos.com/tutorial/tts/tts_lips.mp4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import IPython.display as dp\n", "from PIL import Image\n", "img_path = 'download/ocr_result.jpg'\n", "im = Image.open(img_path)\n", "dp.display(im)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech),阅读上一步识别出来的文字。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dp.Audio(\"download/ocr.wav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "具体实现代码详见 [Story Talker](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/story_talker)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 偶像开口说话\n", "*元宇宙来袭,构造你的虚拟人!* 看看 [PaddleGAN](https://github.com/PaddlePaddle/PaddleGAN) 怎样合成唇形,让WiFi之母——海蒂·拉玛说话。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML\n", "html_str = '''\n", "\n", "'''.format(\"download/tts_lips.mp4\")\n", "dp.display(HTML(html_str))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "具体实现代码请参考 [Metaverse](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/metaverse)。\n", "\n", "下面让我们来系统地学习语音方面的知识,看看怎样使用 **PaddleSpeech** 实现基本的语音功能,以及怎样结合光学字符识别(Optical Character Recognition,OCR)、自然语言处理(Natural Language Processing,NLP)等技术“听”书、让名人开口说话。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 前言\n", "## 背景知识\n", "为了更好地了解文本转语音任务的要素,我们先简要地回顾一下文本转语音的发展历史。如果你对此已经有所了解,或希望能尽快使用代码实现,请直接跳至[实践](#实践)。\n", "### 定义\n", " \n", "文本转语音,又称语音合成(Speech Sysnthesis),指的是将一段文本按照一定需求转化成对应的音频,这种特性决定了的输出数据比输入输入长得多。文本转语音是一项包含了语义学、声学、数字信号处理以及机器学习的等多项学科的交叉任务。虽然辨识低质量音频文件的内容对人类来说很容易,但这对计算机来说并非易事。\n", "\n", "按照不同的应用需求,更广义的语音合成研究包括:*语音转换*,例如说话人转换、语音到歌唱转换、语音情感转换、口音转换等;*歌唱合成*,例如歌词到歌唱转换、可视语音合成等。\n", "\n", "### 发展历史\n", "\n", "\n", "\n", "在第二次工业革命之前,语音的合成主要以机械式的音素合成为主。1779年,德裔丹麦科学家 Christian Gottlieb Kratzenstein 建造了人类的声道模型,使其可以产生五个长元音。1791年, Wolfgang von Kempelen 添加了唇和舌的模型,使其能够发出辅音和元音。贝尔实验室于20世纪30年代发明了声码器(Vocoder),将语音自动分解为音调和共振,此项技术由 Homer Dudley 改进为键盘式合成器并于 1939年纽约世界博览会展出。\n", "\n", "第一台基于计算机的语音合成系统起源于20世纪50年代。1961年,IBM 的 John Larry Kelly,以及 Louis Gerstman 使用 IBM 704 计算机合成语音,成为贝尔实验室最著名的成就之一。1975年,第一代语音合成系统之一 —— MUSA(MUltichannel Speaking Automation)问世,其由一个独立的硬件和配套的软件组成。1978年发行的第二个版本也可以进行无伴奏演唱。90 年代的主流是采用 MIT 和贝尔实验室的系统,并结合自然语言处理模型。\n", "\n", "
\n", "\n", "### 主流方法\n", "\n", "当前的主流方法分为**基于统计参数的语音合成**、**波形拼接语音合成**、**混合方法**以及**端到端神经网络语音合成**。基于参数的语音合成包含隐马尔可夫模型(Hidden Markov Model,HMM)以及深度学习网络(Deep Neural Network,DNN)。端到端的方法保函声学模型+声码器以及“完全”端到端方法。\n", "\n", "\n", "## 基于深度学习的语音合成技术\n", "\n", "### 语音合成基本知识\n", "\n", "
\n", "

\n", "\n", "语音合成流水线包含 **文本前端(Text Frontend)****声学模型(Acoustic Model)****声码器(Vocoder)** 三个主要模块:\n", "- 通过文本前端模块将原始文本转换为字符/音素。\n", "- 通过声学模型将字符/音素转换为声学特征,如线性频谱图、mel 频谱图、LPC 特征等。\n", "- 通过声码器将声学特征转换为波形。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 实践" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 安装 paddlespeech" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade pip && pip install paddlespeech -U" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "环境安装请参考 [Installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md) 教程。 \n", "下面使用 **PaddleSpeech** 提供的预训练模型合成中文语音。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据及模型准备" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 获取PaddlePaddle预训练模型" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -P download https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip\n", "!unzip -d download download/pwg_baker_ckpt_0.4.zip\n", "!wget -P download https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip\n", "!unzip -d download download/fastspeech2_nosil_baker_ckpt_0.4.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!tree download/pwg_baker_ckpt_0.4\n", "!tree download/fastspeech2_nosil_baker_ckpt_0.4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入 Python 包" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 本项目的依赖需要用到 nltk 包,但是有时会因为网络原因导致不好下载,此处手动下载一下放到百度服务器的包\n", "!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz\n", "!tar zxvf nltk_data.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 设置 gpu 环境\n", "%env CUDA_VISIBLE_DEVICES=0\n", "\n", "import logging\n", "import sys\n", "import warnings\n", "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import argparse\n", "import os\n", "from pathlib import Path\n", "import IPython.display as dp\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import paddle\n", "import soundfile as sf\n", "import yaml\n", "from paddlespeech.t2s.frontend.zh_frontend import Frontend\n", "from paddlespeech.t2s.models.fastspeech2 import FastSpeech2\n", "from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference\n", "from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator\n", "from paddlespeech.t2s.models.parallel_wavegan import PWGInference\n", "from paddlespeech.t2s.modules.normalizer import ZScore\n", "from yacs.config import CfgNode" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 设置预训练模型的路径" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fastspeech2_config = \"download/fastspeech2_nosil_baker_ckpt_0.4/default.yaml\"\n", "fastspeech2_checkpoint = \"download/fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz\"\n", "fastspeech2_stat = \"download/fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy\"\n", "pwg_config = \"download/pwg_baker_ckpt_0.4/pwg_default.yaml\"\n", "pwg_checkpoint = \"download/pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz\"\n", "pwg_stat = \"download/pwg_baker_ckpt_0.4/pwg_stats.npy\"\n", "phones_dict = \"download/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt\"\n", "# 读取 conf 配置文件并结构化\n", "with open(fastspeech2_config) as f:\n", " fastspeech2_config = CfgNode(yaml.safe_load(f))\n", "with open(pwg_config) as f:\n", " pwg_config = CfgNode(yaml.safe_load(f))\n", "print(\"========Config========\")\n", "print(fastspeech2_config)\n", "print(\"---------------------\")\n", "print(pwg_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 文本前端(Text Frontend)\n", "\n", "一个文本前端模块主要包含:\n", "- 分段(Text Segmentation)\n", "- 文本正则化(Text Normalization, TN)\n", "- 分词(Word Segmentation, 主要是在中文中)\n", "- 词性标注(Part-of-Speech, PoS)\n", "- 韵律预测(Prosody)\n", "- 字音转换(Grapheme-to-Phoneme,G2P)\n", "(Grapheme: **语言**书写系统的最小有意义单位; Phoneme: 区分单词的最小**语音**单位)\n", " - 多音字(Polyphone)\n", " - 变调(Tone Sandhi)\n", " - “一”、“不”变\n", " - 三声变调\n", " - 轻声变调\n", " - 儿化音\n", " - 方言\n", "- ...\n", "\n", "(输入给声学模型之前,还需要把音素序列转换为 id)\n", "\n", "\n", "其中最重要的模块是 文本正则化 模块和 字音转换(TTS 中更常用 G2P 代指) 模块。\n", "\n", "\n", "各模块输出示例:\n", "```text\n", "• Text: 全国一共有112所211高校\n", "• Text Normalization: 全国一共有一百一十二所二一一高校\n", "• Word Segmentation: 全国/一共/有/一百一十二/所/二一一/高校/\n", "• G2P(注意此句中“一”的读音):\n", " quan2 guo2 yi2 gong4 you3 yi4 bai3 yi1 shi2 er4 suo3 er4 yao1 yao1 gao1 xiao4\n", " (可以进一步把声母和韵母分开)\n", " q uan2 g uo2 y i2 g ong4 y ou3 y i4 b ai3 y i1 sh i2 er4 s uo3 er4 y ao1 y ao1 g ao1 x iao4\n", " (把音调和声韵母分开)\n", " q uan g uo y i g ong y ou y i b ai y i sh i er s uo er y ao y ao g ao x iao\n", " 0 2 0 2 0 2 0 4 0 3 ...\n", "• Prosody (prosodic words #1, prosodic phrases #2, intonation phrases #3, sentence #4):\n", " 全国#2一共有#2一百#1一十二所#2二一一#1高校#4\n", " (分词的结果一般是固定的,但是不同人习惯不同,可能有不同的韵律)\n", "```\n", "\n", "文本前端模块的设计需要结合很多专业的语义学知识和经验。人类在读文本的时候可以自然而然地读出正确的发音,但是这些先验知识计算机并不知晓。\n", "例如,对于一个句子的分词:\n", "\n", "```text\n", "我也想过过过儿过过的生活\n", "我也想/过过/过儿/过过的/生活\n", "\n", "货拉拉拉不拉拉布拉多\n", "货拉拉/拉不拉/拉布拉多\n", "\n", "南京市长江大桥\n", "南京市长/江大桥\n", "南京市/长江大桥\n", "```\n", "或者是词的变调和儿化音:\n", "```\n", "你要不要和我们一起出去玩?\n", "你要不(2声)要和我们一(4声)起出去玩(儿)?\n", "\n", "不好,我要一个人出去。\n", "不(4声)好,我要一(2声)个人出去。\n", "\n", "(以下每个词的所有字都是三声的,请你读一读,体会一下在读的时候,是否每个字都被读成了三声?)\n", "纸老虎、虎骨酒、展览馆、岂有此理、手表厂有五种好产品\n", "```\n", "又或是多音字,这类情况通常需要先正确分词:\n", "```text\n", "人要行,干一行行一行,一行行行行行;\n", "人要是不行,干一行不行一行,一行不行行行不行。\n", "\n", "佟大为妻子产下一女\n", "\n", "海水朝朝朝朝朝朝朝落\n", "浮云长长长长长长长消\n", "```\n", "\n", "PaddleSpeech Text-to-Speech的文本前端解决方案:\n", "- [文本正则](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tn)\n", "- [G2P](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p)\n", " - 多音字模块: pypinyin/g2pM\n", " - 变调模块: 用分词 + 规则" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 构造文本前端对象\n", "传入`phones_dict`,把相应的`phones`转换成`phone_ids`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 传入 phones_dict 会把相应的 phones 转换成 phone_ids\n", "frontend = Frontend(phone_vocab_path=phones_dict)\n", "print(\"Frontend done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 调用文本前端\n", "\n", "文本前端对输入数据进行正则化时会进行分句,若`merge_sentences`设置为`False`,则所有分句的 `phone_ids` 构成一个 `List`;若设置为`True`,`input_ids[\"phone_ids\"][0]`则表示整句的`phone_ids`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# input = \"我每天中午12:00起床\"\n", "# input = \"我出生于2005/11/08,那天的最低气温达到-10°C\"\n", "input = \"你好,欢迎使用百度飞桨框架进行深度学习研究!\"\n", "input_ids = frontend.get_input_ids(input, merge_sentences=True, print_info=True)\n", "phone_ids = input_ids[\"phone_ids\"][0]\n", "print(\"phone_ids:%s\"%phone_ids)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 用深度学习实现文本前端\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 声学模型(Acoustic Model)\n", "\n", "声学模型将字符/音素转换为声学特征,如线性频谱图、mel 频谱图、LPC 特征等,声学特征以 “帧” 为单位,一般一帧是 10ms 左右,一个音素一般对应 5~20 帧左右, 声学模型需要解决的是 “不等长序列间的映射问题”,“不等长”是指,同一个人发不同音素的持续时间不同,同一个人在不同时刻说同一句话的语速可能不同,对应各个音素的持续时间不同,不同人说话的特色不同,对应各个音素的持续时间不同。这是一个困难的“一对多”问题。\n", "```\n", "# 卡尔普陪外孙玩滑梯\n", "000001|baker_corpus|sil 20 k 12 a2 4 er2 10 p 12 u3 12 p 9 ei2 9 uai4 15 s 11 uen1 12 uan2 14 h 10 ua2 11 t 15 i1 16 sil 20\n", "```\n", "\n", "声学模型主要分为自回归模型和非自回归模型,其中自回归模型在 `t` 时刻的预测需要依赖 `t-1` 时刻的输出作为输入,预测时间长,但是音质相对较好,非自回归模型不存在预测上的依赖关系,预测时间快,音质相对较差。\n", "\n", "主流声学模型发展的脉络:\n", "- 自回归模型:\n", " - Tacotron\n", " - Tacotron2\n", " - Transformer TTS\n", "- 非自回归模型:\n", " - FastSpeech\n", " - SpeedySpeech\n", " - FastPitch\n", " - FastSpeech2\n", " - ...\n", " \n", "在本教程中,我们使用 `FastSpeech2` 作为声学模型。\n", "
\n", "
FastSpeech2 网络结构图

\n", "\n", "\n", "PaddleSpeech TTS 实现的 FastSpeech2 与论文不同的地方在于,我们使用的的是 phone 级别的 `pitch` 和 `energy`(与 FastPitch 类似),这样的合成结果可以更加**稳定**。\n", "
\n", "
FastPitch 网络结构图

\n", "\n", "更多关于[语音合成模型的发展及改进](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 初始化声学模型 FastSpeech2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(phones_dict, \"r\") as f:\n", " phn_id = [line.strip().split() for line in f.readlines()]\n", "vocab_size = len(phn_id)\n", "print(\"vocab_size:\", vocab_size)\n", "odim = fastspeech2_config.n_mels\n", "model = FastSpeech2(\n", " idim=vocab_size, odim=odim, **fastspeech2_config[\"model\"])\n", "# 加载预训练模型参数\n", "model.set_state_dict(paddle.load(fastspeech2_checkpoint)[\"main_params\"])\n", "# 推理阶段不启用 batch norm 和 dropout\n", "model.eval()\n", "stat = np.load(fastspeech2_stat)\n", "# 读取数据预处理阶段数据集的均值和标准差\n", "mu, std = stat\n", "mu, std = paddle.to_tensor(mu), paddle.to_tensor(std)\n", "# 构造归一化的新模型\n", "fastspeech2_normalizer = ZScore(mu, std)\n", "fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)\n", "fastspeech2_inference.eval()\n", "print(\"FastSpeech2 done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 调用声学模型" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with paddle.no_grad():\n", " mel = fastspeech2_inference(phone_ids)\n", "print(\"shepe of mel (n_frames x n_mels):\")\n", "print(mel.shape)\n", "# 绘制声学模型输出的 mel 频谱\n", "fig, ax = plt.subplots(figsize=(16, 6))\n", "im = ax.imshow(mel.T, aspect='auto',origin='lower')\n", "plt.title('Mel Spectrogram')\n", "plt.xlabel('Time')\n", "plt.ylabel('Frequency')\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 声码器(Vocoder)\n", "声码器将声学特征转换为波形。声码器需要解决的是 “信息缺失的补全问题”。信息缺失是指,在音频波形转换为频谱图的时候,存在**相位信息**的缺失,在频谱图转换为 mel 频谱图的时候,存在**频域压缩**导致的信息缺失;假设音频的采样率是16kHZ, 一帧的音频有 10ms,也就是说,1s 的音频有 16000 个采样点,而 1s 中包含 100 帧,每一帧有 160 个采样点,声码器的作用就是将一个频谱帧变成音频波形的 160 个采样点,所以声码器中一般会包含**上采样**模块。\n", "\n", "与声学模型类似,声码器也分为自回归模型和非自回归模型, 更细致的分类如下:\n", "\n", "- Autoregression\n", " - WaveNet\n", " - WaveRNN\n", " - LPCNet\n", "- Flow\n", " - WaveFlow\n", " - WaveGlow\n", " - FloWaveNet\n", " - Parallel WaveNet\n", "- GAN\n", "\t- WaveGAN\n", " - Parallel WaveGAN\n", " - MelGAN\n", " - Style MelGAN\n", " - Multi Band MelGAN\n", " - HiFi GAN\n", "- VAE\n", " - Wave-VAE\n", "- Diffusion\n", " - WaveGrad\n", " - DiffWave\n", "\n", "PaddleSpeech TTS 主要实现了百度的 `WaveFlow` 和一些主流的 GAN Vocoder, 在本教程中,我们使用 `Parallel WaveGAN` 作为声码器。\n", "\n", "
\n", "
图1:Parallel WaveGAN 网络结构图

\n", "\n", "各 GAN Vocoder 的生成器和判别器的 Loss 的区别如下表格所示:\n", "\n", "Model | Generator Loss |Discriminator Loss\n", ":-------------:| :------------:| :-----\n", "Mel GAN| adversial loss
Feature Matching | Multi-Scale Discriminator |\n", "Parallel Wave GAN|adversial loss
Multi-resolution STFT loss | adversial loss|\n", "Multi-Band Mel GAN | adversial loss
full band Multi-resolution STFT loss
sub band Multi-resolution STFT loss |Multi-Scale Discriminator|\n", "HiFi GAN |adversial loss
Feature Matching
Mel-Spectrogram Loss | Multi-Scale Discriminator
Multi-Period Discriminator| \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 初始化声码器 Parallel WaveGAN" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vocoder = PWGGenerator(**pwg_config[\"generator_params\"])\n", "# 模型加载预训练参数\n", "vocoder.set_state_dict(paddle.load(pwg_checkpoint)[\"generator_params\"])\n", "vocoder.remove_weight_norm()\n", "# 推理阶段不启用 batch norm 和 dropout\n", "vocoder.eval()\n", "# 读取数据预处理阶段数据集的均值和标准差\n", "stat = np.load(pwg_stat)\n", "mu, std = stat\n", "mu, std = paddle.to_tensor(mu), paddle.to_tensor(std)\n", "pwg_normalizer = ZScore(mu, std)\n", "# 构建归一化的模型\n", "pwg_inference = PWGInference(pwg_normalizer, vocoder)\n", "pwg_inference.eval()\n", "print(\"Parallel WaveGAN done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 调用声码器" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with paddle.no_grad():\n", " wav = pwg_inference(mel)\n", "print(\"shepe of wav (time x n_channels):%s\"%wav.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 绘制声码器输出的波形图\n", "wave_data = wav.numpy().T\n", "time = np.arange(0, wave_data.shape[1]) * (1.0 / fastspeech2_config.fs)\n", "fig, ax = plt.subplots(figsize=(16, 6))\n", "plt.plot(time, wave_data[0])\n", "plt.title('Waveform')\n", "plt.xlabel('Time (seconds)')\n", "plt.ylabel('Amplitude (normed)')\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 播放音频" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dp.Audio(wav.numpy().T, rate=fastspeech2_config.fs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 保存音频" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir output\n", "sf.write(\n", " \"output/output.wav\",\n", " wav.numpy(),\n", " samplerate=fastspeech2_config.fs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 进阶 —— 个性化调节\n", "FastSpeech2 模型可以个性化地调节音素时长、音调和能量,通过一些简单的调节就可以获得一些有意思的效果。\n", "\n", "例如对于以下的原始音频`\"凯莫瑞安联合体的经济崩溃,迫在眉睫\"`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 原始音频\n", "dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x1_001.wav\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# speed x 1.2\n", "dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x1.2_001.wav\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# speed x 0.8\n", "dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speed/x0.8_001.wav\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pitch x 1.3(童声)\n", "dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/child_voice/001.wav\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# robot\n", "dp.display(dp.Audio(url=\"https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/robot/001.wav\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "具体实现代码请参考 [Style FastSpeech2](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/style_fs2)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 用 PaddleSpeech 训练 TTS 模型\n", "PaddleSpeech 的 examples 是按照 数据集/模型 的结构安排的:\n", "```text\n", "examples \n", "├── aishell3\n", "│ ├── README.md\n", "│ ├── tts3\n", "│ └── vc0\n", "├── csmsc\n", "│ ├── README.md\n", "│ ├── tts2\n", "│ ├── tts3\n", "│ ├── voc1\n", "│ └── voc3\n", "├── ...\n", "└── ...\n", "```\n", "我们在每个数据集的 README.md 介绍了子目录和模型的对应关系, 在 TTS 中有如下对应关系:\n", "```text\n", "tts0 - Tactron2\n", "tts1 - TransformerTTS\n", "tts2 - SpeedySpeech\n", "tts3 - FastSpeech2\n", "voc0 - WaveFlow\n", "voc1 - Parallel WaveGAN\n", "voc2 - MelGAN\n", "voc3 - MultiBand MelGAN\n", "```\n", "### 基于 CSMCS 数据集训练 FastSpeech2 模型\n", "```bash\n", "git clone https://github.com/PaddlePaddle/PaddleSpeech.git\n", "cd examples/csmsc/tts3\n", "```\n", "根据 README.md, 下载 CSMCS 数据集和其对应的强制对齐文件, 并放置在对应的位置\n", "```bash\n", "./run.sh\n", "```\n", "`run.sh` 中包含预处理、训练、合成、静态图推理等步骤:\n", "\n", "```bash\n", "#!/bin/bash\n", "set -e\n", "source path.sh\n", "gpus=0,1\n", "stage=0\n", "stop_stage=100\n", "conf_path=conf/default.yaml\n", "train_output_path=exp/default\n", "ckpt_name=snapshot_iter_153.pdz\n", "\n", "# with the following command, you can choice the stage range you want to run\n", "# such as `./run.sh --stage 0 --stop-stage 0`\n", "# this can not be mixed use with `$1`, `$2` ...\n", "source ${MAIN_ROOT}/utils/parse_options.sh || exit 1\n", "\n", "if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n", " # prepare data\n", " bash ./local/preprocess.sh ${conf_path} || exit -1\n", "fi\n", "if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n", " # train model, all `ckpt` under `train_output_path/checkpoints/` dir\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1\n", "fi\n", "if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n", " # synthesize, vocoder is pwgan\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n", "fi\n", "if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then\n", " # synthesize_e2e, vocoder is pwgan\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n", "fi\n", "if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then\n", " # inference with static model\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1\n", "fi\n", "```\n", "\n", "### 基于 CSMCS 数据集训练 Parallel WaveGAN 模型\n", "```bash\n", "git clone https://github.com/PaddlePaddle/PaddleSpeech.git\n", "cd examples/csmsc/voc1\n", "```\n", "根据 README.md, 下载 CSMCS 数据集和其对应的强制对齐文件, 并放置在对应的位置\n", "```bash\n", "./run.sh\n", "```\n", "`run.sh` 中包含预处理、训练、合成等步骤:\n", "```bash\n", "#!/bin/bash\n", "set -e\n", "source path.sh\n", "gpus=0,1\n", "stage=0\n", "stop_stage=100\n", "conf_path=conf/default.yaml\n", "train_output_path=exp/default\n", "ckpt_name=snapshot_iter_5000.pdz\n", "\n", "# with the following command, you can choice the stage range you want to run\n", "# such as `./run.sh --stage 0 --stop-stage 0`\n", "# this can not be mixed use with `$1`, `$2` ...\n", "source ${MAIN_ROOT}/utils/parse_options.sh || exit 1\n", "\n", "if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then\n", " # prepare data\n", " ./local/preprocess.sh ${conf_path} || exit -1\n", "fi\n", "if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then\n", " # train model, all `ckpt` under `train_output_path/checkpoints/` dir\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1\n", "fi\n", "if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then\n", " # synthesize\n", " CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1\n", "fi\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# FAQ\n", "\n", "- 需要注意的问题\n", "- 经验与分享\n", "- 用户的其他问题" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 作业\n", "在 CSMSC 数据集上利用 FastSpeech2 和 Parallel WaveGAN 实现一个中文 TTS 系统。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 关注 PaddleSpeech\n", "请关注我们的 [Github Repo](https://github.com/PaddlePaddle/PaddleSpeech/),非常欢迎加入以下微信群参与讨论:\n", "- 扫描二维码\n", "- 添加运营小姐姐微信\n", "- 通过后回复【语音】\n", "- 系统自动邀请加入技术群\n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }