diff --git a/README.md b/README.md index 8a83ac619..c501e0c37 100644 --- a/README.md +++ b/README.md @@ -74,9 +74,9 @@ Just a quick test of our functions: [English ASR](link/hubdetail?name=deepspeech Developers can have a try of our model with only a few lines of code. -A tiny *ASR* DeepSpeech2 model training on toy set of LibriSpeech: +A tiny **ASR** DeepSpeech2 model training on toy set of LibriSpeech: -```shell +```bash cd examples/tiny/s0/ # source the environment source path.sh @@ -86,16 +86,34 @@ bash local/data.sh bash local/test.sh conf/deepspeech2.yaml ckptfile offline ``` -For *TTS*, try FastSpeech2 on LJSpeech: -- Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/), our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz). -- The pretrained models are seperated into two parts: [fastspeech2_nosil_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) and [pwg_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip). Please download then unzip to `./model/fastspeech2` and `./model/pwg` respectively. -- Assume your path to the dataset is `~/datasets/LJSpeech-1.1` and `./ljspeech_alignment` accordingly, preprocess your data and then use our pretrained model to synthesize: -```shell -bash ./local/preprocess.sh conf/default.yaml -bash ./local/synthesize_e2e.sh conf/default.yaml ./model/fastspeech2/snapshot_iter_100000.pdz ./model/pwg/pwg_snapshot_iter_400000.pdz -``` +For **TTS**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC: +```bash +cd examples/csmsc/tts3 +# download the pretrained models and unaip them +wget https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip +unzip pwg_baker_ckpt_0.4.zip +wget https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip +unzip fastspeech2_nosil_baker_ckpt_0.4.zip +# source the environment +source path.sh +# run end-to-end synthesize +FLAGS_allocator_strategy=naive_best_fit \ +FLAGS_fraction_of_gpu_memory_to_use=0.01 \ +python3 ${BIN_DIR}/synthesize_e2e.py \ + --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ + --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ + --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ + --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --text=${BIN_DIR}/../sentences.txt \ + --output-dir=exp/default/test_e2e \ + --inference-dir=exp/default/inference \ + --device="gpu" \ + --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt +``` If you want to try more functions like training and tuning, please see [ASR getting started](docs/source/asr/getting_started.md) and [TTS Basic Use](/docs/source/tts/basic_usage.md). diff --git a/docs/source/_static/custom.css b/docs/source/_static/custom.css new file mode 100644 index 000000000..bb65c51a9 --- /dev/null +++ b/docs/source/_static/custom.css @@ -0,0 +1,5 @@ +.wy-nav-content { + max-width: 80%; +} +.table table{ background:#b9b9b9} +.table table td{ background:#FFF; } diff --git a/docs/source/conf.py b/docs/source/conf.py index c41884ef8..f2f75ce3e 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -79,6 +79,9 @@ smartquotes = False # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] html_logo = '../images/paddle.png' +html_css_files = [ + 'custom.css', +] # -- Extension configuration ------------------------------------------------- # numpydoc_show_class_members = False diff --git a/docs/source/tts/demo.rst b/docs/source/tts/demo.rst index 948fc056e..09c4d25ad 100644 --- a/docs/source/tts/demo.rst +++ b/docs/source/tts/demo.rst @@ -27,74 +27,106 @@ Analysis/synthesis Audio samples generated from ground-truth spectrograms with a vocoder. .. raw:: html - + LJSpeech(English)

- + +
+
- - + + + + + + + + + + + + + - + + + + + + + + + + + + + + + + + -TTS -------------------- + + + + + -Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder. + + + + + -.. raw:: html + + +
GT WaveFlow Text GT WaveFlow
Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition + + + +
in being comparatively modern. - + + +
For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process +
produced the block books, which were the immediate predecessors of the true printed book + +
the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing. +
Death is just a part of life, something we're all destined to do. + + + +
I think it's hard winning a war with words. + + + +
Don’t argue with the people of strong determination, because they may change the fact! + + + +
Love you three thousand times. + + + +
+ +
+
- + CSMSC(Chinese) +
+
+
- - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TransformerTTS + WaveFlow Tacotron2 + WaveFlow Text SpeedySpeech + ParallelWaveGAN FastSpeech2 + ParallelWaveGAN
凯莫瑞安联合体的经济崩溃,迫在眉睫。 + + + +
对于所有想要离开那片废土,去寻找更美好生活的人来说。 + + + +
克哈,是你们所有人安全的港湾。 + + + +
为了保护尤摩扬人民不受异虫的残害,我所做的,比他们自己的领导委员会都多。 + +
无论他们如何诽谤我,我将继续为所有泰伦人的最大利益,而努力奋斗。 + +
身为你们的元首,我带领泰伦人实现了人类统治领地和经济的扩张。 + +
我们将继续成长,用行动回击那些只会说风凉话,不愿意和我们相向而行的害群之马。 + +
帝国武装力量,无数的优秀儿女,正时刻守卫着我们的家园大门,但是他们孤木难支。 @@ -259,183 +608,289 @@ Audio samples generated by a TTS system. Text is first transformed into spectrog +
凡是今天应征入伍者,所获的所有刑罚罪责,减半。 + +
+ +
+
+ + +Multi-Speaker TTS +------------------- + +PaddleSpeech also support Multi-Speaker TTS, we provide the audio demos generated by FastSpeech2 + ParallelWaveGAN, we use AISHELL-3 Multi-Speaker TTS dataset. + + + +.. raw:: html + +
+ + + + + + +
Text Origin Generated
+
+
+
+ + +Duration control in FastSpeech2 +-------------------------------------- +In our FastSpeech2, we can control ``duration``, ``pitch`` and ``energy``, we provide the audio demos of duration control here. ``duration`` means the duration of phonemes, when we reduce duration, the speed of audios will increase, and when we incerase ``duration``, the speed of audios will reduce. + +The ``duration`` of different phonemes in a sentence can have different scale ratios (when you want to slow down one word and keep the other words' speed in a sentence). Here we use a fixed scale ratio for different phonemes to control the ``speed`` of audios. + +The duration control in FastSpeech2 can control the speed of audios will keep the pitch. (in some speech tool, increase the speed will increase the pitch, and vice versa.) + +.. raw:: html + +
+
+ + + + + + + + + + + + + + -
Speed(0.8x) Speed(1x) Speed(1.2x)
+ + +
+ +
- - - - - - - + + + + + + + + + + + + - + + + + + + + + + + + + + + -
SpeedySpeech + ParallelWaveGAN FastSpeech2 + ParallelWaveGAN
+ + + +
+ + +
+ + +
+ + +
+ + +
+ +
+ +
+
+
Chinese TTS with/without text frontend @@ -447,12 +902,15 @@ We use ``FastSpeech2`` + ``ParallelWaveGAN`` here. .. raw:: html -
+
+
- - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
With Text Frontend Without Text Frontend Text With Text Frontend Without Text Frontend
他只是一个纸老虎。 + +
手表厂有五种好产品。 + +
老板的轿车需要保养。 + +
我们所有人都好喜欢你呀。 + +
岂有此理。 +
虎骨酒多少钱一瓶。 + +
这件事情需要冷处理。 + +
这个老奶奶是个大喇叭。 + +
我喜欢说相声。 + +
有一天,我路过了一栋楼。 +
+ +
+
-
\ No newline at end of file + \ No newline at end of file diff --git a/docs/source/tts/demo_2.rst b/docs/source/tts/demo_2.rst index 37922fcbf..2f0ca7cdb 100644 --- a/docs/source/tts/demo_2.rst +++ b/docs/source/tts/demo_2.rst @@ -5,3 +5,283 @@ This is an audio demo page to contrast PaddleSpeech TTS and Espnet TTS, We use t We use Espnet's released models here. FastSpeech2 + Parallel WaveGAN in CSMSC + +.. raw:: html + + +
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Text Espent TTS PaddleSpeech TTS
早上好,今天是2020/10/29,最低温度是-3°C。 + + + +
你好,我的编号是37249,很高兴为您服务。 + + + +
我们公司有37249个人。 + + + +
我出生于2005年10月8日。 + + + +
我们习惯在12:30吃中午饭。 + + + +
只要有超过3/4的人投票同意,你就会成为我们的新班长。 + + + +
我要买一只价值999.9元的手表。 + + + +
我的手机号是18544139121,欢迎来电。 + + + +
明天有62%的概率降雨。 + + + +
手表厂有五种好产品。 + + + +
跑马场有五百匹很勇敢的千里马。 + + + +
有一天,我看到了一栋楼,我顿感不妙,因为我看不清里面有没有人。 + + + +
史小姐拿着小雨伞去找她的老保姆了。 + + + +
不要相信这个老奶奶说的话,她一点儿也不好。 + + + +
+
+ diff --git a/docs/source/tts/test_sentence.txt b/docs/source/tts/test_sentence.txt new file mode 100644 index 000000000..933f47491 --- /dev/null +++ b/docs/source/tts/test_sentence.txt @@ -0,0 +1,14 @@ +001 早上好,今天是2020/10/29,最低温度是-3°C。 +002 你好,我的编号是37249,很高兴为您服务。 +003 我们公司有37249个人。 +004 我出生于2005年10月8日。 +005 我们习惯在12:30吃中午饭。 +006 只要有超过3/4的人投票同意,你就会成为我们的新班长。 +007 我要买一只价值999.9元的手表。 +008 我的手机号是18544139121,欢迎来电。 +009 明天有62%的概率降雨。 +010 手表厂有五种好产品。 +011 跑马场有五百匹很勇敢的千里马。 +012 有一天,我看到了一栋楼,我顿感不妙,因为我看不清里面有没有人。 +013 史小姐拿着小雨伞去找她的老保姆了。 +014 不要相信这个老奶奶说的话,她一点儿也不好。 \ No newline at end of file diff --git a/parakeet/models/fastspeech2/fastspeech2.py b/parakeet/models/fastspeech2/fastspeech2.py index 0dbbb7bd9..192517b16 100644 --- a/parakeet/models/fastspeech2/fastspeech2.py +++ b/parakeet/models/fastspeech2/fastspeech2.py @@ -419,9 +419,18 @@ class FastSpeech2(nn.Layer): if is_inference: # (B, Tmax) - d_outs = self.duration_predictor.inference(hs, d_masks) + if ds is not None: + d_outs = ds + else: + d_outs = self.duration_predictor.inference(hs, d_masks) + if ps is not None: + p_outs = ps + if es is not None: + e_outs = es + # use prediction in inference # (B, Tmax, 1) + p_embs = self.pitch_embed(p_outs.transpose((0, 2, 1))).transpose( (0, 2, 1)) e_embs = self.energy_embed(e_outs.transpose((0, 2, 1))).transpose( @@ -516,7 +525,7 @@ class FastSpeech2(nn.Layer): x = paddle.cast(text, 'int64') y = speech spemb = spembs - if durations: + if durations is not None: d = paddle.cast(durations, 'int64') p, e = pitch, energy # setup batch axis @@ -534,9 +543,12 @@ class FastSpeech2(nn.Layer): if use_teacher_forcing: # use groundtruth of duration, pitch, and energy - ds, ps, es = d.unsqueeze(0), p.unsqueeze(0), e.unsqueeze(0) + ds = d.unsqueeze(0) if d is not None else None + ps = p.unsqueeze(0) if p is not None else None + es = e.unsqueeze(0) if e is not None else None + # ds, ps, es = , p.unsqueeze(0), e.unsqueeze(0) # (1, L, odim) - _, outs, *_ = self._forward( + _, outs, d_outs, *_ = self._forward( xs, ilens, ys, @@ -545,10 +557,11 @@ class FastSpeech2(nn.Layer): es=es, spembs=spembs, spk_id=spk_id, - tone_id=tone_id) + tone_id=tone_id, + is_inference=True) else: # (1, L, odim) - _, outs, *_ = self._forward( + _, outs, d_outs, *_ = self._forward( xs, ilens, ys,