diff --git a/.mergify.yml b/.mergify.yml index 03e57e14b..6ec28ae81 100644 --- a/.mergify.yml +++ b/.mergify.yml @@ -39,12 +39,30 @@ pull_request_rules: actions: label: remove: ["conflicts"] - - name: "auto add label=enhancement" + - name: "auto add label=S2T" conditions: - files~=^deepspeech/ actions: label: - add: ["enhancement"] + add: ["S2T"] + - name: "auto add label=T2S" + conditions: + - files~=^parakeet/ + actions: + label: + add: ["T2S"] + - name: "auto add label=Audio" + conditions: + - files~=^paddleaudio/ + actions: + label: + add: ["Audio"] + - name: "auto add label=TextProcess" + conditions: + - files~=^text_processing/ + actions: + label: + add: ["TextProcess"] - name: "auto add label=Example" conditions: - files~=^examples/ diff --git a/.readthedocs.yml b/.readthedocs.yml index 702ae6dae..dc38a20fc 100644 --- a/.readthedocs.yml +++ b/.readthedocs.yml @@ -7,7 +7,7 @@ version: 2 # Build documentation in the docs/ directory with Sphinx sphinx: - configuration: docs/src/conf.py + configuration: docs/source/conf.py # Build documentation with MkDocs #mkdocs: @@ -20,11 +20,6 @@ formats: [] python: version: 3.7 install: - - method: pip - path: . - extra_requirements: - - doc - - requirements: docs/requirements.txt diff --git a/README.md b/README.md index 468f42a61..8a83ac619 100644 --- a/README.md +++ b/README.md @@ -10,10 +10,9 @@ English | [简体中文](README_ch.md)

- Quick Start - | Tutorials - | Models List - + Quick Start + | Tutorials + | Models List

------------------------------------------------------------------------------------ @@ -27,37 +26,31 @@ how they can install it, how they can use it --> -**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for two critical tasks in Speech - **Automatic Speech Recognition (ASR)** and **Text-To-Speech Synthesis (TTS)**, with modules involving state-of-art and influential models. +**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech, with state-of-art and influential models. -Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing module, and deployment. Besides, this toolkit also features at: -- **Fast and Light-weight**: we provide a high-speed and ultra-lightweight model that is convenient for industrial deployment. +Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at: +- **Fast and Light-weight**: we provide high-speed and ultra-lightweight models that are convenient for industrial deployment. - **Rule-based Chinese frontend**: our frontend contains Text Normalization (TN) and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context. -- **Varieties of Functions that Vitalize Research**: - - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of both ASR and TTS, and uses datasets like LibriSpeech, LJSpeech, AIShell, etc. See also [model lists](#models-list) for more details. - - *Support of ASR streaming and non-streaming data*: This toolkit contains non-streaming/streaming models like [DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf), [Transformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100) and [U2](https://arxiv.org/pdf/2012.05481.pdf). +- **Varieties of Functions that Vitalize both Industrial and Academia**: + - *Implementation of critical audio tasks*: this toolkit contains audio functions like Speech Translation (ST), Automatic Speech Recognition (ASR), Text-To-Speech Synthesis (TTS), Voice Cloning(VC), Punctuation Restoration, etc. + - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model lists](#models-list) for more details. + - *Cross-domain application*: as an extension of the application of traditional audio tasks, we combine the aforementioned tasks with other fields like NLP. Let's install PaddleSpeech with only a few lines of code! >Note: The official name is still deepspeech. 2021/10/26 -``` shell -# 1. Install essential libraries and paddlepaddle first. -# install prerequisites -sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev libsndfile1 -# `pip install paddlepaddle-gpu` instead if you are using GPU. -pip install paddlepaddle - -# 2.Then install PaddleSpeech. +If you are using Ubuntu, PaddleSpeech can be set up with pip installation (with root privilege). +```shell git clone https://github.com/PaddlePaddle/DeepSpeech.git cd DeepSpeech pip install -e . ``` - ## Table of Contents The contents of this README is as follow: -- [Alternative Installation](#installation) +- [Alternative Installation](#alternative-installation) - [Quick Start](#quick-start) - [Models List](#models-list) - [Tutorials](#tutorials) @@ -75,10 +68,13 @@ The base environment in this page is If you want to set up PaddleSpeech in other environment, please see the [ASR installation](docs/source/asr/install.md) and [TTS installation](docs/source/tts/install.md) documents for all the alternatives. ## Quick Start +> Note: the current links to `English ASR` and `English TTS` are not valid. -> Note: `ckptfile` should be replaced by real path that represents files or folders later. Similarly, `exp/default` is the folder that contains the pretrained models. +Just a quick test of our functions: [English ASR](link/hubdetail?name=deepspeech2_aishell&en_category=AutomaticSpeechRecognition) and [English TTS](link/hubdetail?name=fastspeech2_baker&en_category=TextToSpeech) by typing message or upload your own audio file. -Try a tiny ASR DeepSpeech2 model training on toy set of LibriSpeech: +Developers can have a try of our model with only a few lines of code. + +A tiny *ASR* DeepSpeech2 model training on toy set of LibriSpeech: ```shell cd examples/tiny/s0/ @@ -90,12 +86,13 @@ bash local/data.sh bash local/test.sh conf/deepspeech2.yaml ckptfile offline ``` -For TTS, try FastSpeech2 on LJSpeech: -- Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/) and our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz). +For *TTS*, try FastSpeech2 on LJSpeech: +- Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/), our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz). +- The pretrained models are seperated into two parts: [fastspeech2_nosil_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) and [pwg_ljspeech_ckpt](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip). Please download then unzip to `./model/fastspeech2` and `./model/pwg` respectively. - Assume your path to the dataset is `~/datasets/LJSpeech-1.1` and `./ljspeech_alignment` accordingly, preprocess your data and then use our pretrained model to synthesize: ```shell bash ./local/preprocess.sh conf/default.yaml -bash ./local/synthesize_e2e.sh conf/default.yaml exp/default ckptfile +bash ./local/synthesize_e2e.sh conf/default.yaml ./model/fastspeech2/snapshot_iter_100000.pdz ./model/pwg/pwg_snapshot_iter_400000.pdz ``` @@ -104,14 +101,17 @@ If you want to try more functions like training and tuning, please see [ASR gett ## Models List +PaddleSpeech supports a series of most popular models, summarized in [released models](./docs/source/released_model.md) with available pretrained models. - -PaddleSpeech ASR supports a lot of mainstream models, which are summarized as follow. For more information, please refer to [ASR Models](./docs/source/asr/released_model.md). +ASR module contains *Acoustic Model* and *Language Model*, with the following details: +> Note: The `Link` should be code path rather than download links. + + @@ -125,7 +125,7 @@ The current hyperlinks redirect to [Previous Parakeet](https://github.com/Paddle - + @@ -200,7 +200,7 @@ PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model @@ -208,41 +208,41 @@ PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model - + - + @@ -250,26 +250,26 @@ PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model @@ -277,14 +277,14 @@ PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model diff --git a/deepspeech/exps/deepspeech2/model.py b/deepspeech/exps/deepspeech2/model.py index 7b929f8b7..6424cfdf3 100644 --- a/deepspeech/exps/deepspeech2/model.py +++ b/deepspeech/exps/deepspeech2/model.py @@ -167,6 +167,11 @@ class DeepSpeech2Trainer(Trainer): logger.info(f"{model}") layer_tools.print_params(model, logger.info) + self.model = model + logger.info("Setup model!") + + if not self.train: + return grad_clip = ClipGradByGlobalNormWithLog( config.training.global_grad_clip) @@ -180,74 +185,77 @@ class DeepSpeech2Trainer(Trainer): weight_decay=paddle.regularizer.L2Decay( config.training.weight_decay), grad_clip=grad_clip) - - self.model = model self.optimizer = optimizer self.lr_scheduler = lr_scheduler - logger.info("Setup model/optimizer/lr_scheduler!") + logger.info("Setup optimizer/lr_scheduler!") + def setup_dataloader(self): config = self.config.clone() config.defrost() - config.collator.keep_transcription_text = False - - config.data.manifest = config.data.train_manifest - train_dataset = ManifestDataset.from_config(config) - - config.data.manifest = config.data.dev_manifest - dev_dataset = ManifestDataset.from_config(config) - - config.data.manifest = config.data.test_manifest - test_dataset = ManifestDataset.from_config(config) - - if self.parallel: - batch_sampler = SortagradDistributedBatchSampler( + if self.train: + # train + config.data.manifest = config.data.train_manifest + train_dataset = ManifestDataset.from_config(config) + if self.parallel: + batch_sampler = SortagradDistributedBatchSampler( + train_dataset, + batch_size=config.collator.batch_size, + num_replicas=None, + rank=None, + shuffle=True, + drop_last=True, + sortagrad=config.collator.sortagrad, + shuffle_method=config.collator.shuffle_method) + else: + batch_sampler = SortagradBatchSampler( + train_dataset, + shuffle=True, + batch_size=config.collator.batch_size, + drop_last=True, + sortagrad=config.collator.sortagrad, + shuffle_method=config.collator.shuffle_method) + + config.collator.keep_transcription_text = False + collate_fn_train = SpeechCollator.from_config(config) + self.train_loader = DataLoader( train_dataset, - batch_size=config.collator.batch_size, - num_replicas=None, - rank=None, - shuffle=True, - drop_last=True, - sortagrad=config.collator.sortagrad, - shuffle_method=config.collator.shuffle_method) + batch_sampler=batch_sampler, + collate_fn=collate_fn_train, + num_workers=config.collator.num_workers) + + # dev + config.data.manifest = config.data.dev_manifest + dev_dataset = ManifestDataset.from_config(config) + + config.collator.augmentation_config = "" + config.collator.keep_transcription_text = False + collate_fn_dev = SpeechCollator.from_config(config) + self.valid_loader = DataLoader( + dev_dataset, + batch_size=int(config.collator.batch_size), + shuffle=False, + drop_last=False, + collate_fn=collate_fn_dev, + num_workers=config.collator.num_workers) + logger.info("Setup train/valid Dataloader!") else: - batch_sampler = SortagradBatchSampler( - train_dataset, - shuffle=True, - batch_size=config.collator.batch_size, - drop_last=True, - sortagrad=config.collator.sortagrad, - shuffle_method=config.collator.shuffle_method) - - collate_fn_train = SpeechCollator.from_config(config) - - config.collator.augmentation_config = "" - collate_fn_dev = SpeechCollator.from_config(config) - - config.collator.keep_transcription_text = True - config.collator.augmentation_config = "" - collate_fn_test = SpeechCollator.from_config(config) - - self.train_loader = DataLoader( - train_dataset, - batch_sampler=batch_sampler, - collate_fn=collate_fn_train, - num_workers=config.collator.num_workers) - self.valid_loader = DataLoader( - dev_dataset, - batch_size=int(config.collator.batch_size), - shuffle=False, - drop_last=False, - collate_fn=collate_fn_dev, - num_workers=config.collator.num_workers) - self.test_loader = DataLoader( - test_dataset, - batch_size=config.decoding.batch_size, - shuffle=False, - drop_last=False, - collate_fn=collate_fn_test, - num_workers=config.collator.num_workers) - logger.info("Setup train/valid/test Dataloader!") + # test + config.data.manifest = config.data.test_manifest + test_dataset = ManifestDataset.from_config(config) + + config.collator.augmentation_config = "" + config.collator.keep_transcription_text = True + collate_fn_test = SpeechCollator.from_config(config) + + self.test_loader = DataLoader( + test_dataset, + batch_size=config.decoding.batch_size, + shuffle=False, + drop_last=False, + collate_fn=collate_fn_test, + num_workers=config.collator.num_workers) + logger.info("Setup test Dataloader!") class DeepSpeech2Tester(DeepSpeech2Trainer): diff --git a/deepspeech/exps/u2/model.py b/deepspeech/exps/u2/model.py index 7806aaa49..e47a59eda 100644 --- a/deepspeech/exps/u2/model.py +++ b/deepspeech/exps/u2/model.py @@ -172,7 +172,7 @@ class U2Trainer(Trainer): dist.get_rank(), total_loss / num_seen_utts)) return total_loss, num_seen_utts - def train(self): + def do_train(self): """The training process control by step.""" # !!!IMPORTANT!!! # Try to export the model by script, if fails, we should refine diff --git a/deepspeech/exps/u2_kaldi/model.py b/deepspeech/exps/u2_kaldi/model.py index f86243269..663c36d8b 100644 --- a/deepspeech/exps/u2_kaldi/model.py +++ b/deepspeech/exps/u2_kaldi/model.py @@ -173,7 +173,7 @@ class U2Trainer(Trainer): dist.get_rank(), total_loss / num_seen_utts)) return total_loss, num_seen_utts - def train(self): + def do_train(self): """The training process control by step.""" # !!!IMPORTANT!!! # Try to export the model by script, if fails, we should refine diff --git a/deepspeech/exps/u2_st/model.py b/deepspeech/exps/u2_st/model.py index c5df44c67..1f638e64c 100644 --- a/deepspeech/exps/u2_st/model.py +++ b/deepspeech/exps/u2_st/model.py @@ -184,7 +184,7 @@ class U2STTrainer(Trainer): dist.get_rank(), total_loss / num_seen_utts)) return total_loss, num_seen_utts - def train(self): + def do_train(self): """The training process control by step.""" # !!!IMPORTANT!!! # Try to export the model by script, if fails, we should refine diff --git a/deepspeech/training/trainer.py b/deepspeech/training/trainer.py index 2c2389203..2da838047 100644 --- a/deepspeech/training/trainer.py +++ b/deepspeech/training/trainer.py @@ -134,6 +134,10 @@ class Trainer(): logger.info( f"Benchmark reset batch-size: {self.args.benchmark_batch_size}") + @property + def train(self): + return self._train + @contextmanager def eval(self): self._train = False @@ -248,7 +252,7 @@ class Trainer(): sys.exit( f"Reach benchmark-max-step: {self.args.benchmark_max_step}") - def train(self): + def do_train(self): """The training process control by epoch.""" self.before_train() @@ -321,7 +325,7 @@ class Trainer(): """ try: with Timer("Training Done: {}"): - self.train() + self.do_train() except KeyboardInterrupt: exit(-1) finally: @@ -432,7 +436,7 @@ class Trainer(): beginning of the experiment. """ config_file = self.config_dir / "config.yaml" - if self._train and config_file.exists(): + if self.train and config_file.exists(): time_stamp = time.strftime("%Y_%m_%d_%H_%M_%s", time.gmtime()) target_path = self.config_dir / ".".join( [time_stamp, "config.yaml"]) diff --git a/docs/images/paddle.png b/docs/images/paddle.png new file mode 100644 index 000000000..bc1135abf Binary files /dev/null and b/docs/images/paddle.png differ diff --git a/docs/images/tuning_error_surface.png b/docs/images/tuning_error_surface.png new file mode 100644 index 000000000..2204cee2f Binary files /dev/null and b/docs/images/tuning_error_surface.png differ diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 000000000..11e0d4b46 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,7 @@ +myst-parser +numpydoc +recommonmark>=0.5.0 +sphinx +sphinx-autobuild +sphinx-markdown-tables +sphinx_rtd_theme diff --git a/docs/source/asr/deepspeech_architecture.md b/docs/source/asr/models_introduction.md similarity index 80% rename from docs/source/asr/deepspeech_architecture.md rename to docs/source/asr/models_introduction.md index be9471d93..c99093bd6 100644 --- a/docs/source/asr/deepspeech_architecture.md +++ b/docs/source/asr/models_introduction.md @@ -1,6 +1,5 @@ -# Deepspeech2 -## Streaming - +# Models introduction +## Streaming DeepSpeech2 The implemented arcitecure of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes. The model is mainly composed of 2D convolution subsampling layer and stacked single direction rnn layers. @@ -14,8 +13,8 @@ In addition, the training process and the testing process are also introduced. The arcitecture of the model is shown in Fig.1.

- -
Fig.1 The Arcitecture of deepspeech2 online model + +
Fig.1 The Arcitecture of deepspeech2 online model

@@ -23,13 +22,13 @@ The arcitecture of the model is shown in Fig.1. #### Vocabulary For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \ and \. The \ represents the blank label in CTC, the \ represents the unknown character and the \ represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \, \ and \. For both English and mandarin data, we set the default indexs that \=0, \=1 and \= last index. ``` - # The code to build vocabulary - cd examples/aishell/s0 - python3 ../../../utils/build_vocab.py \ - --unit_type="char" \ - --count_threshold=0 \ - --vocab_path="data/vocab.txt" \ - --manifest_paths "data/manifest.train.raw" "data/manifest.dev.raw" +# The code to build vocabulary +cd examples/aishell/s0 +python3 ../../../utils/build_vocab.py \ + --unit_type="char" \ + --count_threshold=0 \ + --vocab_path="data/vocab.txt" \ + --manifest_paths "data/manifest.train.raw" "data/manifest.dev.raw" # vocabulary for aishell dataset (Mandarin) vi examples/aishell/s0/data/vocab.txt @@ -41,29 +40,29 @@ vi examples/librispeech/s0/data/vocab.txt #### CMVN For CMVN, a subset or the full of traininig set is chosed and be used to compute the feature mean and std. ``` - # The code to compute the feature mean and std +# The code to compute the feature mean and std cd examples/aishell/s0 python3 ../../../utils/compute_mean_std.py \ - --manifest_path="data/manifest.train.raw" \ - --spectrum_type="linear" \ - --delta_delta=false \ - --stride_ms=10.0 \ - --window_ms=20.0 \ - --sample_rate=16000 \ - --use_dB_normalization=True \ - --num_samples=2000 \ - --num_workers=10 \ - --output_path="data/mean_std.json" + --manifest_path="data/manifest.train.raw" \ + --spectrum_type="linear" \ + --delta_delta=false \ + --stride_ms=10.0 \ + --window_ms=20.0 \ + --sample_rate=16000 \ + --use_dB_normalization=True \ + --num_samples=2000 \ + --num_workers=10 \ + --output_path="data/mean_std.json" ``` #### Feature Extraction - For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc. - Currently, the released deepspeech2 online model use the linear feature extraction method. - ``` - The code for feature extraction - vi deepspeech/frontend/featurizer/audio_featurizer.py - ``` +For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc. +Currently, the released deepspeech2 online model use the linear feature extraction method. +``` +The code for feature extraction +vi deepspeech/frontend/featurizer/audio_featurizer.py +``` ### Encoder The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. @@ -84,11 +83,11 @@ vi deepspeech/models/ds2_online/deepspeech2.py vi deepspeech/modules/ctc.py ``` -## Training Process +### Training Process Using the command below, you can train the deepspeech2 online model. ``` - cd examples/aishell/s0 - bash run.sh --stage 0 --stop_stage 2 --model_type online --conf_path conf/deepspeech2_online.yaml +cd examples/aishell/s0 +bash run.sh --stage 0 --stop_stage 2 --model_type online --conf_path conf/deepspeech2_online.yaml ``` The detail commands are: ``` @@ -127,11 +126,11 @@ fi By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss. -## Testing Process +### Testing Process Using the command below, you can test the deepspeech2 online model. - ``` - bash run.sh --stage 3 --stop_stage 5 --model_type online --conf_path conf/deepspeech2_online.yaml - ``` +``` +bash run.sh --stage 3 --stop_stage 5 --model_type online --conf_path conf/deepspeech2_online.yaml +``` The detail commands are: ``` conf_path=conf/deepspeech2_online.yaml @@ -139,7 +138,7 @@ avg_num=1 model_type=online avg_ckpt=avg_${avg_num} - if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then # test ckpt avg_n CUDA_VISIBLE_DEVICES=2 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${model_type}|| exit -1 fi @@ -156,19 +155,16 @@ fi ``` After the training process, we use stage 3,4,5 for testing process. The stage 3 is for testing the model generated in the stage 2 and provided the CER index of the test set. The stage 4 is for transforming the model from dynamic graph to static graph by using "paddle.jit" library. The stage 5 is for testing the model in static graph. - -## Non-Streaming +## Non-Streaming DeepSpeech2 The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use. The arcitecture of the model is shown in Fig.2.

- -
Fig.2 The Arcitecture of deepspeech2 offline model + +
Fig.2 The Arcitecture of deepspeech2 offline model

- - For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model. The code of encoder and decoder for deepspeech2 offline model is in: @@ -180,7 +176,7 @@ The training process and testing process of deepspeech2 offline model is very si Only some changes should be noticed. For training and testing, the "model_type" and the "conf_path" must be set. - ``` +``` # Training offline cd examples/aishell/s0 bash run.sh --stage 0 --stop_stage 2 --model_type offline --conf_path conf/deepspeech2.yaml diff --git a/docs/source/asr/getting_started.md b/docs/source/asr/quick_start.md similarity index 83% rename from docs/source/asr/getting_started.md rename to docs/source/asr/quick_start.md index 478f3bb38..da1620e90 100644 --- a/docs/source/asr/getting_started.md +++ b/docs/source/asr/quick_start.md @@ -1,5 +1,4 @@ -# Getting Started - +# Quick Start of Speech-To-Text Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data. Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit. @@ -11,68 +10,52 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org ```bash cd examples/tiny ``` - Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to `examples/librispeech` instead. - - Source env - ```bash source path.sh ``` - **Must do this before starting do anything.** - Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as default, you can change this in the script. - + **Must do this before you start to do anything.** + Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as `MODEL`, you can change this in the script. - Main entrypoint - ```bash bash run.sh ``` - This just a demo, please make sure every `step` is work fine when do next `step`. + This is just a demo, please make sure every `step` works well before next `step`. More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine! ## Training a model -The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```local/tune.sh``` to find an optimal setting. +The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh```and ```sh infer.sh```to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with ```sh infer_golden.sh```and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```local/tune.sh```to find an optimal setting. ## Speech-to-text Inference An inference module caller `infer.py` is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance. - ```bash CUDA_VISIBLE_DEVICES=0 bash local/infer.sh ``` - We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`. ## Evaluate a Model - To evaluate a model's performance quantitatively, please run: - ```bash CUDA_VISIBLE_DEVICES=0 bash local/test.sh ``` - The error rate (default: word error rate; can be set with `error_rate_type`) will be printed. -For more help on arguments: - ## Hyper-parameters Tuning - The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed. `tune.py` performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts. - - ```bash CUDA_VISIBLE_DEVICES=0 bash local/tune.sh ``` - The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.

- -
An example error surface for tuning on the dev-clean set of LibriSpeech + +
An example error surface for tuning on the dev-clean set of LibriSpeech

Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning. diff --git a/docs/source/asr/released_model.md b/docs/source/asr/released_model.md deleted file mode 100644 index dc3a176b0..000000000 --- a/docs/source/asr/released_model.md +++ /dev/null @@ -1,28 +0,0 @@ -# Released Models - -## Acoustic Model Released in paddle 2.X -Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech -:-------------:| :------------:| :-----: | -----: | :----------------- |:--------- | :---------- | :--------- -[Ds2 Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0824 |-| 151 h -[Ds2 Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.065 |-| 151 h -[Conformer Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | 0.0594 |-| 151 h -[Conformer Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | 0.0547 |-| 151 h -[Conformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz) | Librispeech Dataset | Word-based | 287 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention |-| 0.0325 | 960 h -[Transformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz) | Librispeech Dataset | Word-based | 195 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention |-| 0.0544 | 960 h - -## Acoustic Model Transformed from paddle 1.8 -Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech -:-------------:| :------------:| :-----: | -----: | :----------------- | :---------- | :---------- | :--------- -[Ds2 Offline Aishell model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_v1.8_to_v2.x.tar.gz)|Aishell Dataset| Char-based| 234 MB| 2 Conv + 3 bidirectional GRU layers| 0.0804 |-| 151 h| -[Ds2 Offline Librispeech model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_v1.8_to_v2.x.tar.gz)|Librispeech Dataset| Word-based| 307 MB| 2 Conv + 3 bidirectional sharing weight RNN layers |-| 0.0685| 960 h| -[Ds2 Offline Baidu en8k model](https://deepspeech.bj.bcebos.com/eng_models/baidu_en8k_v1.8_to_v2.x.tar.gz)|Baidu Internal English Dataset| Word-based| 273 MB| 2 Conv + 3 bidirectional GRU layers |-| 0.0541 | 8628 h| - - - -## Language Model Released - -Language Model | Training Data | Token-based | Size | Descriptions -:-------------:| :------------:| :-----: | -----: | :----------------- -[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1;
About 1.85 billion n-grams;
'trie' binary with '-a 22 -q 8 -b 8' -[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4;
About 0.13 billion n-grams;
'probing' binary with default settings -[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning;
About 3.7 billion n-grams;
'probing' binary with default settings diff --git a/docs/source/conf.py b/docs/source/conf.py index 7e32a22c4..c41884ef8 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -23,6 +23,8 @@ import recommonmark.parser import sphinx_rtd_theme +autodoc_mock_imports = ["soundfile", "librosa"] + # -- Project information ----------------------------------------------------- project = 'paddle speech' @@ -46,10 +48,10 @@ pygments_style = 'sphinx' extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.viewcode', - 'sphinx_rtd_theme', + "sphinx_rtd_theme", 'sphinx.ext.mathjax', - 'sphinx.ext.autosummary', 'numpydoc', + 'sphinx.ext.autosummary', 'myst_parser', ] @@ -76,6 +78,7 @@ smartquotes = False # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] +html_logo = '../images/paddle.png' # -- Extension configuration ------------------------------------------------- # numpydoc_show_class_members = False diff --git a/docs/source/index.rst b/docs/source/index.rst index 3c196d2d8..06bc2f3fa 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -10,34 +10,44 @@ Contents .. toctree:: :maxdepth: 1 :caption: Introduction - - asr/deepspeech_architecture + introduction .. toctree:: :maxdepth: 1 - :caption: Getting_started - - asr/install - asr/getting_started - + :caption: Quick Start + install + asr/quick_start + tts/quick_start + .. toctree:: :maxdepth: 1 - :caption: More Information + :caption: Speech-To-Text + asr/models_introduction asr/data_preparation asr/augmentation asr/feature_list - asr/ngram_lm - + asr/ngram_lm .. toctree:: :maxdepth: 1 - :caption: Released_model + :caption: Text-To-Speech - asr/released_model + tts/basic_usage + tts/advanced_usage + tts/zh_text_frontend + tts/models_introduction + tts/gan_vocoder + tts/demo + tts/demo_2 +.. toctree:: + :maxdepth: 1 + :caption: Released Models + + released_model .. toctree:: :maxdepth: 1 @@ -45,3 +55,8 @@ Contents asr/reference + + + + + diff --git a/docs/source/asr/install.md b/docs/source/install.md similarity index 92% rename from docs/source/asr/install.md rename to docs/source/install.md index 8cecba125..0c27a4db3 100644 --- a/docs/source/asr/install.md +++ b/docs/source/install.md @@ -8,7 +8,7 @@ To avoid the trouble of environment setup, [running in Docker container](#runnin ## Setup (Important) -- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost`, `sox, and `swig`, e.g. installing them via `apt-get`: +- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost`, `sox`, and `swig`, e.g. installing them via `apt-get`: ```bash sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev @@ -44,6 +44,14 @@ bash setup.sh source tools/venv/bin/activate ``` +## Simple Setup + +```python +git clone https://github.com/PaddlePaddle/DeepSpeech.git +cd DeepSpeech +pip install -e . +``` + ## Running in Docker Container (optional) Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed. diff --git a/docs/source/introduction.md b/docs/source/introduction.md new file mode 100644 index 000000000..2f71b104f --- /dev/null +++ b/docs/source/introduction.md @@ -0,0 +1,33 @@ +# PaddleSpeech + +## What is PaddleSpeech? +PaddleSpeech is an open-source toolkit on PaddlePaddle platform for two critical tasks in Speech - Speech-To-Text (Automatic Speech Recognition, ASR) and Text-To-Speech Synthesis (TTS), with modules involving state-of-art and influential models. + +## What can PaddleSpeech do? + +### Speech-To-Text +(An introduce of ASR in PaddleSpeech is needed here!) + +### Text-To-Speech +TTS mainly consists of components below: +- Implementation of models and commonly used neural network layers. +- Dataset abstraction and common data preprocessing pipelines. +- Ready-to-run experiments. + +PaddleSpeech TTS provides you with a complete TTS pipeline, including: +- Text FrontEnd + - Rule based Chinese frontend. +- Acoustic Models + - FastSpeech2 + - SpeedySpeech + - TransformerTTS + - Tacotron2 +- Vocoders + - Multi Band MelGAN + - Parallel WaveGAN + - WaveFlow +- Voice Cloning + - Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis + - GE2E + +Text-To-Speech helps you to train TTS models with simple commands. diff --git a/docs/source/released_model.md b/docs/source/released_model.md new file mode 100644 index 000000000..3b60f15a2 --- /dev/null +++ b/docs/source/released_model.md @@ -0,0 +1,55 @@ +# Released Models + +## Speech-To-Text Models +### Acoustic Model Released in paddle 2.X +Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech +:-------------:| :------------:| :-----: | -----: | :----------------- |:--------- | :---------- | :--------- +[Ds2 Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0824 |-| 151 h +[Ds2 Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.065 |-| 151 h +[Conformer Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | 0.0594 |-| 151 h +[Conformer Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | 0.0547 |-| 151 h +[Conformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz) | Librispeech Dataset | Word-based | 287 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention |-| 0.0325 | 960 h +[Transformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz) | Librispeech Dataset | Word-based | 195 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention |-| 0.0544 | 960 h + +### Acoustic Model Transformed from paddle 1.8 +Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech +:-------------:| :------------:| :-----: | -----: | :----------------- | :---------- | :---------- | :--------- +[Ds2 Offline Aishell model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_v1.8_to_v2.x.tar.gz)|Aishell Dataset| Char-based| 234 MB| 2 Conv + 3 bidirectional GRU layers| 0.0804 |-| 151 h| +[Ds2 Offline Librispeech model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_v1.8_to_v2.x.tar.gz)|Librispeech Dataset| Word-based| 307 MB| 2 Conv + 3 bidirectional sharing weight RNN layers |-| 0.0685| 960 h| +[Ds2 Offline Baidu en8k model](https://deepspeech.bj.bcebos.com/eng_models/baidu_en8k_v1.8_to_v2.x.tar.gz)|Baidu Internal English Dataset| Word-based| 273 MB| 2 Conv + 3 bidirectional GRU layers |-| 0.0541 | 8628 h| + +### Language Model Released + +Language Model | Training Data | Token-based | Size | Descriptions +:-------------:| :------------:| :-----: | -----: | :----------------- +[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1;
About 1.85 billion n-grams;
'trie' binary with '-a 22 -q 8 -b 8' +[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4;
About 0.13 billion n-grams;
'probing' binary with default settings +[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning;
About 3.7 billion n-grams;
'probing' binary with default settings + +## Text-To-Speech Models +### Acoustic Models +Model Type | Dataset| Example Link | Pretrained Models +:-------------:| :------------:| :-----: | :----- +Tacotron2|LJSpeech|[tacotron2-vctk](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip) +TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip) +SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip) +FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip) +FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip) +FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) +FastSpeech2| VCTK |[fastspeech2-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_vctk_ckpt_0.5.zip) + + +### Vocoders + +Model Type | Dataset| Example Link | Pretrained Models +:-------------:| :------------:| :-----: | :----- +WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip) +Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip.](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) +Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip) +Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_vctk_ckpt_0.5.zip) + +### Voice Cloning +Model Type | Dataset| Example Link | Pretrained Models +:-------------:| :------------:| :-----: | :----- +GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip) +GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip) diff --git a/docs/source/tts/advanced_usage.md b/docs/source/tts/advanced_usage.md index 529800444..297f274f7 100644 --- a/docs/source/tts/advanced_usage.md +++ b/docs/source/tts/advanced_usage.md @@ -1,6 +1,5 @@ - # Advanced Usage -This sections covers how to extend parakeet by implementing your own models and experiments. Guidelines on implementation are also elaborated. +This sections covers how to extend TTS by implementing your own models and experiments. Guidelines on implementation are also elaborated. For the general deep learning experiment, there are several parts to deal with: 1. Preprocess the data according to the needs of the model, and iterate the dataset by batch. @@ -8,7 +7,7 @@ For the general deep learning experiment, there are several parts to deal with: 3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.). 5. Configure and run the experiment. -## Parakeet's Model Components +## PaddleSpeech TTS's Model Components In order to balance the reusability and function of models, we divide models into several types according to its characteristics. For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`. @@ -68,11 +67,11 @@ There are two common ways to define a model which consists of several modules. ``` When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way. -In the directory structure of Parakeet, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories. +In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories. -## Parakeet's Data Components +## PaddleSpeech TTS's Data Components Another critical componnet for a deep learning project is data. -Parakeet uses the following methods for training data: +PaddleSpeech TTS uses the following methods for training data: 1. Preprocess the data. 2. Load the preprocessed data for training. @@ -154,7 +153,7 @@ def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]: return example ``` -## Parakeet's Training Components +## PaddleSpeech TTS's Training Components A typical training process includes the following processes: 1. Iterate the dataset. 2. Process batch data. @@ -164,7 +163,7 @@ A typical training process includes the following processes: 6. Write logs, visualize, and in some cases save necessary intermediate results. 7. Save the state of the model and optimizer. -Here, we mainly introduce the training related components of Parakeet and why we designed it like this. +Here, we mainly introduce the training related components of TTS in Pa and why we designed it like this. ### Global Repoter When training and modifying Deep Learning models,logging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization tools,such as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose. @@ -245,7 +244,7 @@ def test_reporter_scope(): In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`. - The `Trainer` in Parakeet report the information in this way. + The `Trainer` in PaddleSpeech TTS report the information in this way. ```python while True: self.observation = {} @@ -269,7 +268,7 @@ We made an abstraction for these intermediate processes, that is, `Updater`, whi ### Visualizer Because we choose observation as the communication mode, we can simply write the things in observation into `visualizer`. -## Parakeet's Configuration Components +## PaddleSpeech TTS's Configuration Components Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories. 1. Data source and data processing mode configuration. 2. Save path configuration of experimental results. @@ -293,28 +292,26 @@ The following is the basic `ArgumentParser`: 3. `--output-dir` is the dir to save the training results.(if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train) 4. `--device` and `--nprocs` determine operation modes,`--device` specifies the type of running device, whether to run on `cpu` or `gpu`. `--nprocs` refers to the number of training processes. If `nprocs` > 1, it means that multi process parallel training is used. (Note: currently only GPU multi card multi process training is supported.) -Developers can refer to the examples in `Parakeet/examples` to write the default configuration file when adding new experiments. +Developers can refer to the examples in `examples` to write the default configuration file when adding new experiments. -## Parakeet's Experiment template +## PaddleSpeech TTS's Experiment template -The experimental codes in Parakeet are generally organized as follows: +The experimental codes in PaddleSpeech TTS are generally organized as follows: ```text -├── conf -│ └── default.yaml (defalut config) -├── README.md (help information) -├── batch_fn.py (organize metadata into batch) -├── config.py (code to read default config) -├── *_updater.py (Updater of a specific model) -├── preprocess.py (data preprocessing code) -├── preprocess.sh (script to call data preprocessing.py) -├── synthesis.py (synthesis from metadata) -├── synthesis.sh (script to call synthesis.py) -├── synthesis_e2e.py (synthesis from raw text) -├── synthesis_e2e.sh (script to call synthesis_e2e.py) -├── train.py (train code) -└── run.sh (script to call train.py) +. +├── README.md (help information) +├── conf +│ └── default.yaml (defalut config) +├── local +│ ├── preprocess.sh (script to call data preprocessing.py) +│ ├── synthesize.sh (script to call synthesis.py) +│ ├── synthesize_e2e.sh (script to call synthesis_e2e.py) +│ └──train.sh (script to call train.py) +├── path.sh (script include paths to be sourced) +└── run.sh (script to call scripts in local) ``` +The `*.py` files called by above `*.sh` are located `${BIN_DIR}/` We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification: ```text @@ -330,4 +327,4 @@ exp/default/ └── test/ (output dir of synthesis results) ``` -You can view the examples we provide in `Parakeet/examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to Parakeet. +You can view the examples we provide in `examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech. diff --git a/docs/source/tts/basic_usage.md b/docs/source/tts/basic_usage.md deleted file mode 100644 index fc2a5bad1..000000000 --- a/docs/source/tts/basic_usage.md +++ /dev/null @@ -1,115 +0,0 @@ -# Basic Usage -This section shows how to use pretrained models provided by parakeet and make inference with them. - -Pretrained models in v0.4 are provided in a archive. Extract it to get a folder like this: -``` -checkpoint_name/ -├──default.yaml -├──snapshot_iter_76000.pdz -├──speech_stats.npy -└──phone_id_map.txt -``` -`default.yaml` stores the config used to train the model. -`snapshot_iter_N.pdz` is the chechpoint file, where `N` is the steps it has been trained. -`*_stats.npy` is the stats file of feature if it has been normalized before training. -`phone_id_map.txt` is the map of phonemes to phoneme_ids. - -The example code below shows how to use the models for prediction. -## Acoustic Models (text to spectrogram) -The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction object,then use fastspeech2_inferencet(phone_ids) to generate spectrograms, which can be further used to synthesize raw audio with a vocoder. - -```python -from pathlib import Path -import numpy as np -import paddle -import yaml -from yacs.config import CfgNode -from parakeet.models.fastspeech2 import FastSpeech2 -from parakeet.models.fastspeech2 import FastSpeech2Inference -from parakeet.modules.normalizer import ZScore -# Parakeet/examples/fastspeech2/baker/frontend.py -from frontend import Frontend - -# load the pretrained model -checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4") -with open(checkpoint_dir / "phone_id_map.txt", "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] -vocab_size = len(phn_id) -with open(checkpoint_dir / "default.yaml") as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) -odim = fastspeech2_config.n_mels -model = FastSpeech2( - idim=vocab_size, odim=odim, **fastspeech2_config["model"]) -model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) -model.eval() - -# load stats file -stat = np.load(checkpoint_dir / "speech_stats.npy") -mu, std = stat -mu = paddle.to_tensor(mu) -std = paddle.to_tensor(std) -fastspeech2_normalizer = ZScore(mu, std) - -# construct a prediction object -fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - -# load Chinese Frontend -frontend = Frontend(checkpoint_dir / "phone_id_map.txt") - -# text to spectrogram -sentence = "你好吗?" -input_ids = frontend.get_input_ids(sentence, merge_sentences=True) -phone_ids = input_ids["phone_ids"] -flags = 0 -# The output of Chinese text frontend is segmented -for part_phone_ids in phone_ids: - with paddle.no_grad(): - temp_mel = fastspeech2_inference(part_phone_ids) - if flags == 0: - mel = temp_mel - flags = 1 - else: - mel = paddle.concat([mel, temp_mel]) -``` - -## Vocoder (spectrogram to wave) -The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object,then use pwg_inference(mel) to generate raw audio (in wav format). - -```python -from pathlib import Path -import numpy as np -import paddle -import soundfile as sf -import yaml -from yacs.config import CfgNode -from parakeet.models.parallel_wavegan import PWGGenerator -from parakeet.models.parallel_wavegan import PWGInference -from parakeet.modules.normalizer import ZScore - -# load the pretrained model -checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4") -with open(checkpoint_dir / "pwg_default.yaml") as f: - pwg_config = CfgNode(yaml.safe_load(f)) -vocoder = PWGGenerator(**pwg_config["generator_params"]) -vocoder.set_state_dict(paddle.load(args.pwg_params)) -vocoder.remove_weight_norm() -vocoder.eval() - -# load stats file -stat = np.load(checkpoint_dir / "pwg_stats.npy") -mu, std = stat -mu = paddle.to_tensor(mu) -std = paddle.to_tensor(std) -pwg_normalizer = ZScore(mu, std) - -# construct a prediction object -pwg_inference = PWGInference(pwg_normalizer, vocoder) - -# spectrogram to wave -wav = pwg_inference(mel) -sf.write( - audio_path, - wav.numpy(), - samplerate=fastspeech2_config.fs) -``` diff --git a/docs/source/tts/demo.rst b/docs/source/tts/demo.rst index a6f18f88d..948fc056e 100644 --- a/docs/source/tts/demo.rst +++ b/docs/source/tts/demo.rst @@ -11,7 +11,7 @@ The main processes of TTS include: When training ``Tacotron2``、``TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech `_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC `_ by default. -In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples. +In the future, ``PaddleSpeech TTS`` will mainly use Chinese TTS datasets for default examples. Here, we will display three types of audio samples: @@ -441,7 +441,7 @@ Audio samples generated by a TTS system. Text is first transformed into spectrog Chinese TTS with/without text frontend -------------------------------------- -We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here. +We provide a complete Chinese text frontend module in ``PaddleSpeech TTS``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here. We use ``FastSpeech2`` + ``ParallelWaveGAN`` here. diff --git a/docs/source/tts/demo_2.rst b/docs/source/tts/demo_2.rst new file mode 100644 index 000000000..37922fcbf --- /dev/null +++ b/docs/source/tts/demo_2.rst @@ -0,0 +1,7 @@ +Audio Sample (PaddleSpeech TTS VS Espnet TTS) +================== + +This is an audio demo page to contrast PaddleSpeech TTS and Espnet TTS, We use their respective modules (Text Frontend, Acoustic model and Vocoder) here. +We use Espnet's released models here. + +FastSpeech2 + Parallel WaveGAN in CSMSC diff --git a/docs/source/tts/gan_vocoder.md b/docs/source/tts/gan_vocoder.md new file mode 100644 index 000000000..4931f3072 --- /dev/null +++ b/docs/source/tts/gan_vocoder.md @@ -0,0 +1,9 @@ +# GAN Vocoders +This is a brief introduction of GAN Vocoders, we mainly introduce the losses of different vocoders here. + +Model | Generator Loss |Discriminator Loss +:-------------:| :------------:| :----- +Parallel Wave GAN| adversial loss
Feature Matching | Multi-Scale Discriminator | +Mel GAN |adversial loss
Multi-resolution STFT loss | adversial loss| +Multi-Band Mel GAN | adversial loss
full band Multi-resolution STFT loss
sub band Multi-resolution STFT loss |Multi-Scale Discriminator| +HiFi GAN |adversial loss
Feature Matching
Mel-Spectrogram Loss | Multi-Scale Discriminator
Multi-Period Discriminato | diff --git a/docs/source/tts/index.rst b/docs/source/tts/index.rst deleted file mode 100644 index 74abe60d7..000000000 --- a/docs/source/tts/index.rst +++ /dev/null @@ -1,45 +0,0 @@ -.. parakeet documentation master file, created by - sphinx-quickstart on Fri Sep 10 14:22:24 2021. - You can adapt this file completely to your liking, but it should at least - contain the root `toctree` directive. - -Parakeet -==================================== - -``parakeet`` is a deep learning based text-to-speech toolkit built upon ``paddlepaddle`` framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by `Baidu Research `_ and other research groups. - -``parakeet`` mainly consists of components below. - -#. Implementation of models and commonly used neural network layers. -#. Dataset abstraction and common data preprocessing pipelines. -#. Ready-to-run experiments. - -.. toctree:: - :maxdepth: 1 - :caption: Introduction - - introduction - -.. toctree:: - :maxdepth: 1 - :caption: Getting started - - install - basic_usage - advanced_usage - cn_text_frontend - released_models - -.. toctree:: - :maxdepth: 1 - :caption: Demos - - demo - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/docs/source/tts/install.md b/docs/source/tts/install.md deleted file mode 100644 index b092acff0..000000000 --- a/docs/source/tts/install.md +++ /dev/null @@ -1,47 +0,0 @@ -# Installation -## Install PaddlePaddle -Parakeet requires PaddlePaddle as its backend. Note that 2.1.2 or newer versions of paddle is required. - -Since paddlepaddle has multiple packages depending on the device (cpu or gpu) and the dependency libraries, it is recommended to install a proper package of paddlepaddle with respect to the device and dependency library versons via `pip`. - -Installing paddlepaddle with conda or build paddlepaddle from source is also supported. Please refer to [PaddlePaddle installation](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for more details. - -Example instruction to install paddlepaddle via pip is listed below. - -### PaddlePaddle with GPU -```python -# PaddlePaddle for CUDA10.1 -python -m pip install paddlepaddle-gpu==2.1.2.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html -# PaddlePaddle for CUDA10.2 -python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple -# PaddlePaddle for CUDA11.0 -python -m pip install paddlepaddle-gpu==2.1.2.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html -# PaddlePaddle for CUDA11.2 -python -m pip install paddlepaddle-gpu==2.1.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html -``` -### PaddlePaddle with CPU -```python -python -m pip install paddlepaddle==2.1.2 -i https://mirror.baidu.com/pypi/simple -``` -## Install libsndfile -Experimemts in parakeet often involve audio and spectrum processing, thus `librosa` and `soundfile` are required. `soundfile` requires a extra C library `libsndfile`, which is not always handled by pip. - -For Windows and Mac users, `libsndfile` is also installed when installing `soundfile` via pip, but for Linux users, installing `libsndfile` via system package manager is required. Example commands for popular distributions are listed below. -```bash -# ubuntu, debian -sudo apt-get install libsndfile1 -# centos, fedora -sudo yum install libsndfile -# openSUSE -sudo zypper in libsndfile -``` -For any problem with installtion of soundfile, please refer to [SoundFile](https://pypi.org/project/SoundFile/). -## Install Parakeet -There are two ways to install parakeet according to the purpose of using it. - - 1. If you want to run experiments provided by parakeet or add new models and experiments, it is recommended to clone the project from github (Parakeet), and install it in editable mode. - ```python - git clone https://github.com/PaddlePaddle/Parakeet - cd Parakeet - pip install -e . - ``` diff --git a/docs/source/tts/introduction.md b/docs/source/tts/introduction.md deleted file mode 100644 index d350565cd..000000000 --- a/docs/source/tts/introduction.md +++ /dev/null @@ -1,27 +0,0 @@ -# Parakeet - PAddle PARAllel text-to-speech toolKIT - -## What is Parakeet? -Parakeet is a deep learning based text-to-speech toolkit built upon paddlepaddle framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by Baidu Research and other research groups. - -## What can Parakeet do? -Parakeet mainly consists of components below: -- Implementation of models and commonly used neural network layers. -- Dataset abstraction and common data preprocessing pipelines. -- Ready-to-run experiments. - -Parakeet provides you with a complete TTS pipeline, including: -- Text FrontEnd - - Rule based Chinese frontend. -- Acoustic Models - - FastSpeech2 - - SpeedySpeech - - TransformerTTS - - Tacotron2 -- Vocoders - - Parallel WaveGAN - - WaveFlow -- Voice Cloning - - Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis - - GE2E - -Parakeet helps you to train TTS models with simple commands. diff --git a/docs/source/tts/released_models.md b/docs/source/tts/models_introduction.md similarity index 82% rename from docs/source/tts/released_models.md rename to docs/source/tts/models_introduction.md index 7899c1c5d..b13297582 100644 --- a/docs/source/tts/released_models.md +++ b/docs/source/tts/models_introduction.md @@ -1,12 +1,12 @@ -# Released Models -TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models. +# Models introduction +TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models. The main processes of TTS include: 1. Convert the original text into characters/phonemes, through `text frontend` module. 2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`. 3. Convert acoustic features into waveforms through `Vocoders`. -A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by Parakeet are acoustic models and vocoders. +A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders. ## Acoustic Models ### Modeling Objectives of Acoustic Models @@ -27,14 +27,14 @@ At present, there are two mainstream acoustic model structures. - Acoustic decoder (N Frames - > N Frames).
-
+
- Sequence to sequence acoustic model: - M Tokens - > N Frames.
-
+
### Tacotron2 @@ -54,7 +54,7 @@ At present, there are two mainstream acoustic model structures. - CBHG postprocess. - Vocoder: Griffin-Lim.
-
+
**Advantage of Tacotron:** @@ -89,10 +89,10 @@ At present, there are two mainstream acoustic model structures. - The alignment matrix of previous time is considered at the step `t` of decoder.
-
+
-You can find Parakeet's tacotron2 example at `Parakeet/examples/tacotron2`. +You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [examples/ljspeech/tts0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0). ### TransformerTTS **Disadvantages of the Tacotrons:** @@ -118,7 +118,7 @@ Transformer TTS is a combination of Tacotron2 and Transformer. - Positional Encoding.
-
+
#### Transformer TTS @@ -138,7 +138,7 @@ Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2. - Uniform scale position encoding may have a negative impact on input or output sequences.
-
+
**Disadvantages of Transformer TTS:** @@ -146,7 +146,7 @@ Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2. - The ability to perceive local information is weak, and local information is more related to pronunciation. - Stability is worse than Tacotron2. -You can find Parakeet's Transformer TTS example at `Parakeet/examples/transformer_tts`. +You can find PaddleSpeech TTS's Transformer TTS with LJSpeech dataset example at [examples/ljspeech/tts1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts1). ### FastSpeech2 @@ -184,14 +184,14 @@ Instead of using the encoder-attention-decoder based architecture as adopted by • Can be generated in parallel (decoding time is less affected by sequence length)
-
+
#### FastPitch [FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.
-
+
#### FastSpeech2 @@ -209,10 +209,10 @@ Instead of using the encoder-attention-decoder based architecture as adopted by FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
-
+
-You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fastspeech2`, We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2. +You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2. ### SpeedySpeech [SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure. @@ -223,10 +223,10 @@ You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fast - Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
-
+
-You can find Parakeet's SpeedySpeech example at `Parakeet/examples/speedyspeech/baker`. +You can find PaddleSpeech TTS's SpeedySpeech with CSMSC dataset example at [examples/csmsc/tts2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2). ## Vocoders In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform. @@ -276,7 +276,7 @@ Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Paralle - It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M). - It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development. -You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`. +You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examples/ljspeech/voc0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0). ### Parallel WaveGAN [Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method. @@ -286,10 +286,10 @@ You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`. - Use non-causal convolution instead of causal convolution. - The input is random Gaussian white noise. - The model is non-autoregressive both in training and prediction, which is fast -- Multi-resolution STFT loss. +- Multi-resolution STFT loss.
-
+
-You can find Parakeet's Parallel WaveGAN example at `Parakeet/examples/parallelwave_gan/baker`. +You can find PaddleSpeech TTS's Parallel WaveGAN with CSMSC example at [examples/csmsc/voc1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1). diff --git a/docs/source/tts/quick_start.md b/docs/source/tts/quick_start.md new file mode 100644 index 000000000..f5d16bbfc --- /dev/null +++ b/docs/source/tts/quick_start.md @@ -0,0 +1,193 @@ +# Quick Start of Text-To-Speech +The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are: +* CSMCS (Mandarin single speaker) +* AISHELL3 (Mandarin multiple speaker) +* LJSpeech (English single speaker) +* VCTK (English multiple speaker) + +The models in PaddleSpeech TTS have the following mapping relationship: +* tts0 - Tactron2 +* tts1 - TransformerTTS +* tts2 - SpeedySpeech +* tts3 - FastSpeech2 +* voc0 - WaveFlow +* voc1 - Parallel WaveGAN +* voc2 - MelGAN +* voc3 - MultiBand MelGAN +* vc0 - Tactron2 Voice Clone with GE2E + +## Quick Start + +Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. (./examples/csmsc/)(https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc) + +### Train Parallel WaveGAN with CSMSC +- Go to directory + ```bash + cd examples/csmsc/voc1 + ``` +- Source env + ```bash + source path.sh + ``` + **Must do this before you start to do anything.** + Set `MAIN_ROOT` as project dir. Using `parallelwave_gan` model as `MODEL`. + +- Main entrypoint + ```bash + bash run.sh + ``` + This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`. +### Train FastSpeech2 with CSMSC +- Go to directory + ```bash + cd examples/csmsc/tts3 + ``` +- Source env + ```bash + source path.sh + ``` + **Must do this before you start to do anything.** + Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`. +- Main entrypoint + ```bash + bash run.sh + ``` + This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`. + +The steps in `run.sh` mainly include: +- source path. +- preprocess the dataset, +- train the model. +- synthesize waveform from metadata.jsonl. +- synthesize waveform from text file. (in acoustic models) +- inference using static model. (optional) + +For more details , you can see `README.md` in examples. + +## Pipeline of TTS +This section shows how to use pretrained models provided by TTS and make inference with them. + +Pretrained models in TTS are provided in a archive. Extract it to get a folder like this: +**Acoustic Models:** +```text +checkpoint_name +├── default.yaml +├── snapshot_iter_*.pdz +├── speech_stats.npy +├── phone_id_map.txt +├── spk_id_map.txt (optimal) +└── tone_id_map.txt (optimal) +``` +**Vocoders:** +```text +checkpoint_name +├── default.yaml +├── snapshot_iter_*.pdz +└── stats.npy +``` +- `default.yaml` stores the config used to train the model. +- `snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained. +- `*_stats.npy` is the stats file of feature if it has been normalized before training. +- `phone_id_map.txt` is the map of phonemes to phoneme_ids. +- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example) +- `spk_id_map.txt` is the map of spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example) + +The example code below shows how to use the models for prediction. +### Acoustic Models (text to spectrogram) +The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction object,then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder. + +```python +from pathlib import Path +import numpy as np +import paddle +import yaml +from yacs.config import CfgNode +from parakeet.models.fastspeech2 import FastSpeech2 +from parakeet.models.fastspeech2 import FastSpeech2Inference +from parakeet.modules.normalizer import ZScore +# examples/fastspeech2/baker/frontend.py +from frontend import Frontend + +# load the pretrained model +checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4") +with open(checkpoint_dir / "phone_id_map.txt", "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] +vocab_size = len(phn_id) +with open(checkpoint_dir / "default.yaml") as f: + fastspeech2_config = CfgNode(yaml.safe_load(f)) +odim = fastspeech2_config.n_mels +model = FastSpeech2( + idim=vocab_size, odim=odim, **fastspeech2_config["model"]) +model.set_state_dict( + paddle.load(args.fastspeech2_checkpoint)["main_params"]) +model.eval() + +# load stats file +stat = np.load(checkpoint_dir / "speech_stats.npy") +mu, std = stat +mu = paddle.to_tensor(mu) +std = paddle.to_tensor(std) +fastspeech2_normalizer = ZScore(mu, std) + +# construct a prediction object +fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) + +# load Chinese Frontend +frontend = Frontend(checkpoint_dir / "phone_id_map.txt") + +# text to spectrogram +sentence = "你好吗?" +input_ids = frontend.get_input_ids(sentence, merge_sentences=True) +phone_ids = input_ids["phone_ids"] +flags = 0 +# The output of Chinese text frontend is segmented +for part_phone_ids in phone_ids: + with paddle.no_grad(): + temp_mel = fastspeech2_inference(part_phone_ids) + if flags == 0: + mel = temp_mel + flags = 1 + else: + mel = paddle.concat([mel, temp_mel]) +``` + +### Vocoder (spectrogram to wave) +The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object,then use `pwg_inference(mel)` to generate raw audio (in wav format). + +```python +from pathlib import Path +import numpy as np +import paddle +import soundfile as sf +import yaml +from yacs.config import CfgNode +from parakeet.models.parallel_wavegan import PWGGenerator +from parakeet.models.parallel_wavegan import PWGInference +from parakeet.modules.normalizer import ZScore + +# load the pretrained model +checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4") +with open(checkpoint_dir / "pwg_default.yaml") as f: + pwg_config = CfgNode(yaml.safe_load(f)) +vocoder = PWGGenerator(**pwg_config["generator_params"]) +vocoder.set_state_dict(paddle.load(args.pwg_params)) +vocoder.remove_weight_norm() +vocoder.eval() + +# load stats file +stat = np.load(checkpoint_dir / "pwg_stats.npy") +mu, std = stat +mu = paddle.to_tensor(mu) +std = paddle.to_tensor(std) +pwg_normalizer = ZScore(mu, std) + +# construct a prediction object +pwg_inference = PWGInference(pwg_normalizer, vocoder) + +# spectrogram to wave +wav = pwg_inference(mel) +sf.write( + audio_path, + wav.numpy(), + samplerate=fastspeech2_config.fs) +``` diff --git a/docs/source/tts/cn_text_frontend.md b/docs/source/tts/zh_text_frontend.md similarity index 96% rename from docs/source/tts/cn_text_frontend.md rename to docs/source/tts/zh_text_frontend.md index 8a0251792..26255ad31 100644 --- a/docs/source/tts/cn_text_frontend.md +++ b/docs/source/tts/zh_text_frontend.md @@ -1,5 +1,5 @@ # Chinese Rule Based Text Frontend -TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in Parakeet, see exapmle in `Parakeet/examples/text_frontend/`. +A TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in PaddleSpeech TTS, see exapmle in [examples/other/text_frontend/](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/text_frontend). A text frontend module mainly includes: - Text Segmentation diff --git a/examples/other/text_frontend/README.md b/examples/other/text_frontend/README.md index d5c6dd4d3..0bf6e72dc 100644 --- a/examples/other/text_frontend/README.md +++ b/examples/other/text_frontend/README.md @@ -20,5 +20,19 @@ Run the command below to get the results of test. ./run.sh ``` The `avg WER` of g2p is: 0.027495061517943988 +```text + ,--------------------------------------------------------------------. + | | # Snt # Wrd | Corr Sub Del Ins Err S.Err | + |--------+-----------------+-----------------------------------------| + | Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 | + `--------------------------------------------------------------------' +``` The `avg CER` of text normalization is: 0.006388318503308237 +```text + ,-----------------------------------------------------------------. + | | # Snt # Wrd | Corr Sub Del Ins Err S.Err | + |--------+--------------+-----------------------------------------| + | Sum/Avg| 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 | + `-----------------------------------------------------------------' +``` diff --git a/text_processing/.gitignore b/text_processing/.gitignore new file mode 100644 index 000000000..e400141b2 --- /dev/null +++ b/text_processing/.gitignore @@ -0,0 +1,7 @@ +data +glove +.pyc +checkpoints +epoch +__pycache__ +glove.840B.300d.zip diff --git a/text_processing/README.md b/text_processing/README.md new file mode 100644 index 000000000..294af01d1 --- /dev/null +++ b/text_processing/README.md @@ -0,0 +1,25 @@ +# PaddleSpeechTask +A speech library to deal with a series of related front-end and back-end tasks + +## 环境 +- python==3.6.13 +- paddle==2.1.1 + +## 中/英文文本加标点任务 punctuation restoration: + +### 数据集: data +- 中文数据来源:data/chinese +1.iwlst2012zh +2.平凡的世界 + +- 英文数据来源: data/english +1.iwlst2012en + +- iwlst2012数据获取过程见data/README.md + +### 模型:speechtask/punctuation_restoration/model +1.BLSTM模型 + +2.BertLinear模型 + +3.BertBLSTM模型 diff --git a/text_processing/examples/punctuation_restoration/chinese/README.md b/text_processing/examples/punctuation_restoration/chinese/README.md new file mode 100644 index 000000000..1fcd954ca --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/README.md @@ -0,0 +1,35 @@ +# 中文实验例程 +## 测试数据: +- IWLST2012中文:test2012 + +## 运行代码 +- 运行 `run.sh 0 0 conf/train_conf/bertBLSTM_zh.yaml 1 conf/data_conf/chinese.yaml ` + +## 实验结果: +- BertLinear + - 实验配置:conf/train_conf/bertLinear_zh.yaml + - 测试结果 + + | | COMMA | PERIOD | QUESTION | OVERALL | + |-----------|-----------|-----------|-----------|--------- | + |Precision | 0.425665 | 0.335190 | 0.698113 | 0.486323 | + |Recall | 0.511278 | 0.572108 | 0.787234 | 0.623540 | + |F1 | 0.464560 | 0.422717 | 0.740000 | 0.542426 | + +- BertBLSTM + - 实验配置:conf/train_conf/bertBLSTM_zh.yaml + - 测试结果 avg_1 + + | | COMMA | PERIOD | QUESTION | OVERALL | + |-----------|-----------|-----------|-----------|--------- | + |Precision | 0.469484 | 0.550604 | 0.801887 | 0.607325 | + |Recall | 0.580271 | 0.592408 | 0.817308 | 0.663329 | + |F1 | 0.519031 | 0.570741 | 0.809524 | 0.633099 | + + - BertBLSTM/avg_1测试标贝合成数据 + + | | COMMA | PERIOD | QUESTION | OVERALL | + |-----------|-----------|-----------|-----------|--------- | + |Precision | 0.217192 | 0.196339 | 0.820717 | 0.411416 | + |Recall | 0.205922 | 0.892531 | 0.416162 | 0.504872 | + |F1 | 0.211407 | 0.321873 | 0.552279 | 0.361853 | diff --git a/text_processing/examples/punctuation_restoration/chinese/conf/blstm.yaml b/text_processing/examples/punctuation_restoration/chinese/conf/blstm.yaml new file mode 100644 index 000000000..9b1a2e010 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/conf/blstm.yaml @@ -0,0 +1,34 @@ +data: + language: chinese + raw_path: /data4/mahaoxin/PaddleSpeechTask/data/chinese/PFDSJ #path to raw dataset + raw_train_file: train + raw_dev_file: dev + raw_test_file: test + vocab_file: vocab + punc_file: punc_vocab + save_path: data/PFDSJ #path to save dataset + seq_len: 100 + batch_size: 10 + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 + +model_type: blstm +model_params: + vocab_size: 3751 + embedding_size: 200 + hidden_size: 100 + num_layers: 3 + num_class: 5 + init_scale: 0.1 + +training: + n_epoch: 32 + lr: !!float 1e-4 + lr_decay: 1.0 + weight_decay: !!float 1e-06 + global_grad_clip: 5.0 + log_interval: 10 + + + diff --git a/text_processing/examples/punctuation_restoration/chinese/conf/data_conf/chinese.yaml b/text_processing/examples/punctuation_restoration/chinese/conf/data_conf/chinese.yaml new file mode 100644 index 000000000..191bfd3e6 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/conf/data_conf/chinese.yaml @@ -0,0 +1,7 @@ +type: chinese +raw_path: /data4/mahaoxin/PaddleSpeechTask/data/chinese/iwslt2012_zh #path to raw dataset +raw_train_file: iwslt2012_train_zh +raw_dev_file: iwslt2010_dev_zh +raw_test_file: biaobei_asr +punc_file: punc_vocab +save_path: data/iwslt2012_zh #path to save dataset \ No newline at end of file diff --git a/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertBLSTM_zh.yaml b/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertBLSTM_zh.yaml new file mode 100644 index 000000000..d1f58aac1 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertBLSTM_zh.yaml @@ -0,0 +1,49 @@ +data: + dataset_type: Bert + train_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/train + dev_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/dev + test_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/test2012_revise + data_params: + pretrained_token: bert-base-chinese + punc_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/punc_vocab + seq_len: 100 + batch_size: 64 + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 + +checkpoint: + kbest_n: 5 + latest_n: 10 + metric_type: F1 + + +model_type: BertBLSTM +model_params: + pretrained_token: bert-base-chinese + output_size: 4 + dropout: 0.0 + bert_size: 768 + blstm_size: 128 + num_blstm_layers: 2 + init_scale: 0.1 + +# model_type: BertChLinear +# model_params: bert-base-chinese +# pretrained_token: +# output_size: 4 +# dropout: 0.0 +# bert_size: 768 + +training: + n_epoch: 100 + lr: !!float 1e-5 + lr_decay: 1.0 + weight_decay: !!float 1e-06 + global_grad_clip: 5.0 + log_interval: 10 + log_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/log/bertBLSTM_zh0812.log + +testing: + log_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/log/test_bertBLSTM_zh0812.log + diff --git a/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertLinear_zh.yaml b/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertLinear_zh.yaml new file mode 100644 index 000000000..c422e840e --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/conf/train_conf/bertLinear_zh.yaml @@ -0,0 +1,42 @@ +data: + dataset_type: Bert + train_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/train + dev_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/dev + test_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/test2012 + data_params: + pretrained_token: bert-base-chinese + punc_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/data/iwslt2012_zh/punc_vocab + seq_len: 100 + batch_size: 32 + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 + +checkpoint: + kbest_n: 10 + latest_n: 10 + metric_type: F1 + + +model_type: BertLinear +model_params: + pretrained_token: bert-base-uncased + output_size: 4 + dropout: 0.2 + bert_size: 768 + hiddensize: 1568 + + +training: + n_epoch: 50 + lr: !!float 1e-5 + lr_decay: 1.0 + weight_decay: !!float 1e-06 + global_grad_clip: 5.0 + log_interval: 10 + log_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/log/train_linear0812.log + +testing: + log_interval: 10 + log_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/chinese/log/test_linear0812.log + diff --git a/text_processing/examples/punctuation_restoration/chinese/local/avg.sh b/text_processing/examples/punctuation_restoration/chinese/local/avg.sh new file mode 100644 index 000000000..b8c14c662 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/local/avg.sh @@ -0,0 +1,23 @@ +#! /usr/bin/env bash + +if [ $# != 2 ]; then + echo "usage: ${0} ckpt_dir avg_num" + exit -1 +fi + +ckpt_dir=${1} +average_num=${2} +decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams + +python3 -u ${BIN_DIR}/avg_model.py \ +--dst_model ${decode_checkpoint} \ +--ckpt_dir ${ckpt_dir} \ +--num ${average_num} \ +--val_best + +if [ $? -ne 0 ]; then + echo "Failed in avg ckpt!" + exit 1 +fi + +exit 0 \ No newline at end of file diff --git a/text_processing/examples/punctuation_restoration/chinese/local/data.sh b/text_processing/examples/punctuation_restoration/chinese/local/data.sh new file mode 100644 index 000000000..aff7203cc --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/local/data.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +if [ $# != 1 ];then + echo "usage: ${0} data_pre_conf" + echo $1 + exit -1 +fi + +data_pre_conf=$1 + +python3 -u ${BIN_DIR}/pre_data.py \ +--config ${data_pre_conf} + +if [ $? -ne 0 ]; then + echo "Failed in training!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/chinese/local/test.sh b/text_processing/examples/punctuation_restoration/chinese/local/test.sh new file mode 100644 index 000000000..6db75ca2a --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/local/test.sh @@ -0,0 +1,32 @@ + +#!/bin/bash + +if [ $# != 2 ];then + echo "usage: ${0} config_path ckpt_path_prefix" + exit -1 +fi + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +device=gpu +if [ ${ngpu} == 0 ];then + device=cpu +fi +config_path=$1 +ckpt_prefix=$2 + + +python3 -u ${BIN_DIR}/test.py \ +--device ${device} \ +--nproc 1 \ +--config ${config_path} \ +--result_file ${ckpt_prefix}.rsl \ +--checkpoint_path ${ckpt_prefix} + +if [ $? -ne 0 ]; then + echo "Failed in evaluation!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/chinese/local/train.sh b/text_processing/examples/punctuation_restoration/chinese/local/train.sh new file mode 100644 index 000000000..f6bd2c983 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/local/train.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +if [ $# != 2 ];then + echo "usage: CUDA_VISIBLE_DEVICES=0 ${0} config_path ckpt_name" + exit -1 +fi + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +config_path=$1 +ckpt_name=$2 + +device=gpu +if [ ${ngpu} == 0 ];then + device=cpu +fi + +mkdir -p exp + +python3 -u ${BIN_DIR}/train.py \ +--device ${device} \ +--nproc ${ngpu} \ +--config ${config_path} \ +--output exp/${ckpt_name} + +if [ $? -ne 0 ]; then + echo "Failed in training!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/chinese/path.sh b/text_processing/examples/punctuation_restoration/chinese/path.sh new file mode 100644 index 000000000..8154cc78f --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/path.sh @@ -0,0 +1,13 @@ +export MAIN_ROOT=${PWD}/../../../ + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/ + + +export BIN_DIR=${MAIN_ROOT}/speechtask/punctuation_restoration/bin diff --git a/text_processing/examples/punctuation_restoration/chinese/run.sh b/text_processing/examples/punctuation_restoration/chinese/run.sh new file mode 100644 index 000000000..bb3d25d4b --- /dev/null +++ b/text_processing/examples/punctuation_restoration/chinese/run.sh @@ -0,0 +1,47 @@ +#!/bin/bash +set -e +source path.sh + + +## stage, gpu, data_pre_config, train_config, avg_num +if [ $# -lt 4 ]; then + echo "usage: bash ./run.sh stage gpu train_config avg_num data_config" + echo "eg: bash ./run.sh 0 0 train_config 1 data_config " + exit -1 +fi + +stage=$1 +stop_stage=100 +gpus=$2 +conf_path=$3 +avg_num=$4 +avg_ckpt=avg_${avg_num} +ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}') +echo "checkpoint name ${ckpt}" + +if [ $stage -le 0 ]; then + if [ $# -eq 5 ]; then + data_pre_conf=$5 + # prepare data + bash ./local/data.sh ${data_pre_conf} || exit -1 + else + echo "data_pre_conf is not exist!" + exit -1 + fi +fi + + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `exp` dir + CUDA_VISIBLE_DEVICES=${gpus} bash ./local/train.sh ${conf_path} ${ckpt} +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # avg n best model + bash ./local/avg.sh exp/${ckpt}/checkpoints ${avg_num} +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # test ckpt avg_n + CUDA_VISIBLE_DEVICES=${gpus} bash ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1 +fi diff --git a/text_processing/examples/punctuation_restoration/english/README.md b/text_processing/examples/punctuation_restoration/english/README.md new file mode 100644 index 000000000..7955bb7d5 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/README.md @@ -0,0 +1,23 @@ +# 英文实验例程 +## 测试数据: +- IWLST2012英文:test2011 + +## 运行代码 +- 运行 `run.sh 0 0 conf/train_conf/bertBLSTM_base_en.yaml 1 conf/data_conf/english.yaml ` + + +## 相关论文实验结果: +> * Nagy, Attila, Bence Bial, and Judit Ács. "Automatic punctuation restoration with BERT models." arXiv preprint arXiv:2101.07343 (2021)* +> + + +## 实验结果: +- BertBLSTM + - 实验配置:conf/train_conf/bertLinear_en.yaml + - 测试结果:exp/bertLinear_enRe/checkpoints/3.pdparams + + | | COMMA | PERIOD | QUESTION | OVERALL | + |-----------|-----------|-----------|-----------|--------- | + |Precision |0.667910 |0.715778 |0.822222 |0.735304 | + |Recall |0.755274 |0.868188 |0.804348 |0.809270 | + |F1 |0.708911 |0.784651 |0.813187 |0.768916 | diff --git a/text_processing/examples/punctuation_restoration/english/conf/data_conf/english.yaml b/text_processing/examples/punctuation_restoration/english/conf/data_conf/english.yaml new file mode 100644 index 000000000..44834f28c --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/conf/data_conf/english.yaml @@ -0,0 +1,7 @@ +type: english +raw_path: /data4/mahaoxin/PaddleSpeechTask/data/english/iwslt2012_en #path to raw dataset +raw_train_file: iwslt2012_train_en +raw_dev_file: iwslt2010_dev_en +raw_test_file: iwslt2011_test_en +punc_file: punc_vocab +save_path: data/iwslt2012_en #path to save dataset \ No newline at end of file diff --git a/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertBLSTM_base_en.yaml b/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertBLSTM_base_en.yaml new file mode 100644 index 000000000..7f4383d48 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertBLSTM_base_en.yaml @@ -0,0 +1,47 @@ +data: + dataset_type: Bert + train_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/train + dev_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/dev + test_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/test2011 + data_params: + pretrained_token: bert-base-uncased #english + punc_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/punc_vocab + seq_len: 50 + batch_size: 32 + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 + +checkpoint: + kbest_n: 10 + latest_n: 10 + +model_type: BertBLSTM +model_params: + pretrained_token: bert-base-uncased + output_size: 4 + dropout: 0.0 + bert_size: 768 + blstm_size: 128 + num_blstm_layers: 2 + init_scale: 0.2 +# model_type: BertChLinear +# model_params: +# pretrained_token: bert-large-uncased +# output_size: 4 +# dropout: 0.0 +# bert_size: 768 + +training: + n_epoch: 100 + lr: !!float 1e-5 + lr_decay: 1.0 + weight_decay: !!float 1e-06 + global_grad_clip: 5.0 + log_interval: 10 + log_path: log/bertBLSTM_base0812.log + +testing: + log_path: log/testbertBLSTM_base0812.log + + diff --git a/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertLinear_en.yaml b/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertLinear_en.yaml new file mode 100644 index 000000000..8cac98894 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/conf/train_conf/bertLinear_en.yaml @@ -0,0 +1,39 @@ +data: + dataset_type: Bert + train_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/train + dev_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/dev + test_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/test2011 + data_params: + pretrained_token: bert-base-uncased #english + punc_path: /data4/mahaoxin/PaddleSpeechTask/examples/punctuation_restoration/english/data/iwslt2012_en/punc_vocab + seq_len: 100 + batch_size: 32 + sortagrad: True + shuffle_method: batch_shuffle + num_workers: 0 + +checkpoint: + kbest_n: 10 + latest_n: 10 + +model_type: BertLinear +model_params: + pretrained_token: bert-base-uncased + output_size: 4 + dropout: 0.2 + bert_size: 768 + hiddensize: 1568 + +training: + n_epoch: 20 + lr: !!float 1e-5 + lr_decay: 1.0 + weight_decay: !!float 1e-06 + global_grad_clip: 3.0 + log_interval: 10 + log_path: log/train_linear0820.log + +testing: + log_path: log/test2011_linear0820.log + + diff --git a/text_processing/examples/punctuation_restoration/english/local/avg.sh b/text_processing/examples/punctuation_restoration/english/local/avg.sh new file mode 100644 index 000000000..b8c14c662 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/local/avg.sh @@ -0,0 +1,23 @@ +#! /usr/bin/env bash + +if [ $# != 2 ]; then + echo "usage: ${0} ckpt_dir avg_num" + exit -1 +fi + +ckpt_dir=${1} +average_num=${2} +decode_checkpoint=${ckpt_dir}/avg_${average_num}.pdparams + +python3 -u ${BIN_DIR}/avg_model.py \ +--dst_model ${decode_checkpoint} \ +--ckpt_dir ${ckpt_dir} \ +--num ${average_num} \ +--val_best + +if [ $? -ne 0 ]; then + echo "Failed in avg ckpt!" + exit 1 +fi + +exit 0 \ No newline at end of file diff --git a/text_processing/examples/punctuation_restoration/english/local/data.sh b/text_processing/examples/punctuation_restoration/english/local/data.sh new file mode 100644 index 000000000..1b0c62b17 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/local/data.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +if [ $# != 1 ];then + echo "usage: ${0} config_path" + exit -1 +fi + +config_path=$1 + +python3 -u ${BIN_DIR}/pre_data.py \ +--config ${config_path} + +if [ $? -ne 0 ]; then + echo "Failed in training!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/english/local/test.sh b/text_processing/examples/punctuation_restoration/english/local/test.sh new file mode 100644 index 000000000..6db75ca2a --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/local/test.sh @@ -0,0 +1,32 @@ + +#!/bin/bash + +if [ $# != 2 ];then + echo "usage: ${0} config_path ckpt_path_prefix" + exit -1 +fi + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +device=gpu +if [ ${ngpu} == 0 ];then + device=cpu +fi +config_path=$1 +ckpt_prefix=$2 + + +python3 -u ${BIN_DIR}/test.py \ +--device ${device} \ +--nproc 1 \ +--config ${config_path} \ +--result_file ${ckpt_prefix}.rsl \ +--checkpoint_path ${ckpt_prefix} + +if [ $? -ne 0 ]; then + echo "Failed in evaluation!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/english/local/train.sh b/text_processing/examples/punctuation_restoration/english/local/train.sh new file mode 100644 index 000000000..f6bd2c983 --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/local/train.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +if [ $# != 2 ];then + echo "usage: CUDA_VISIBLE_DEVICES=0 ${0} config_path ckpt_name" + exit -1 +fi + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +config_path=$1 +ckpt_name=$2 + +device=gpu +if [ ${ngpu} == 0 ];then + device=cpu +fi + +mkdir -p exp + +python3 -u ${BIN_DIR}/train.py \ +--device ${device} \ +--nproc ${ngpu} \ +--config ${config_path} \ +--output exp/${ckpt_name} + +if [ $? -ne 0 ]; then + echo "Failed in training!" + exit 1 +fi + +exit 0 diff --git a/text_processing/examples/punctuation_restoration/english/path.sh b/text_processing/examples/punctuation_restoration/english/path.sh new file mode 100644 index 000000000..8154cc78f --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/path.sh @@ -0,0 +1,13 @@ +export MAIN_ROOT=${PWD}/../../../ + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/ + + +export BIN_DIR=${MAIN_ROOT}/speechtask/punctuation_restoration/bin diff --git a/text_processing/examples/punctuation_restoration/english/run.sh b/text_processing/examples/punctuation_restoration/english/run.sh new file mode 100644 index 000000000..bb3d25d4b --- /dev/null +++ b/text_processing/examples/punctuation_restoration/english/run.sh @@ -0,0 +1,47 @@ +#!/bin/bash +set -e +source path.sh + + +## stage, gpu, data_pre_config, train_config, avg_num +if [ $# -lt 4 ]; then + echo "usage: bash ./run.sh stage gpu train_config avg_num data_config" + echo "eg: bash ./run.sh 0 0 train_config 1 data_config " + exit -1 +fi + +stage=$1 +stop_stage=100 +gpus=$2 +conf_path=$3 +avg_num=$4 +avg_ckpt=avg_${avg_num} +ckpt=$(basename ${conf_path} | awk -F'.' '{print $1}') +echo "checkpoint name ${ckpt}" + +if [ $stage -le 0 ]; then + if [ $# -eq 5 ]; then + data_pre_conf=$5 + # prepare data + bash ./local/data.sh ${data_pre_conf} || exit -1 + else + echo "data_pre_conf is not exist!" + exit -1 + fi +fi + + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `exp` dir + CUDA_VISIBLE_DEVICES=${gpus} bash ./local/train.sh ${conf_path} ${ckpt} +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # avg n best model + bash ./local/avg.sh exp/${ckpt}/checkpoints ${avg_num} +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # test ckpt avg_n + CUDA_VISIBLE_DEVICES=${gpus} bash ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1 +fi diff --git a/text_processing/requirements.txt b/text_processing/requirements.txt new file mode 100644 index 000000000..685ab029e --- /dev/null +++ b/text_processing/requirements.txt @@ -0,0 +1,6 @@ +numpy +pyyaml +tensorboardX +tqdm +ujson +yacs diff --git a/text_processing/speechtask/punctuation_restoration/bin/avg_model.py b/text_processing/speechtask/punctuation_restoration/bin/avg_model.py new file mode 100644 index 000000000..a012e2581 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/bin/avg_model.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python3 +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import glob +import json +import os + +import numpy as np +import paddle + + +def main(args): + paddle.set_device('cpu') + + val_scores = [] + beat_val_scores = [] + selected_epochs = [] + if args.val_best: + jsons = glob.glob(f'{args.ckpt_dir}/[!train]*.json') + for y in jsons: + with open(y, 'r') as f: + dic_json = json.load(f) + loss = dic_json['F1'] + epoch = dic_json['epoch'] + if epoch >= args.min_epoch and epoch <= args.max_epoch: + val_scores.append((epoch, loss)) + + val_scores = np.array(val_scores) + sort_idx = np.argsort(val_scores[:, 1]) + sorted_val_scores = val_scores[sort_idx] + path_list = [ + args.ckpt_dir + '/{}.pdparams'.format(int(epoch)) + for epoch in sorted_val_scores[:args.num, 0] + ] + + beat_val_scores = sorted_val_scores[:args.num, 1] + selected_epochs = sorted_val_scores[:args.num, 0].astype(np.int64) + print("best val scores = " + str(beat_val_scores)) + print("selected epochs = " + str(selected_epochs)) + else: + path_list = glob.glob(f'{args.ckpt_dir}/[!avg][!final]*.pdparams') + path_list = sorted(path_list, key=os.path.getmtime) + path_list = path_list[-args.num:] + + print(path_list) + + avg = None + num = args.num + assert num == len(path_list) + for path in path_list: + print(f'Processing {path}') + states = paddle.load(path) + if avg is None: + avg = states + else: + for k in avg.keys(): + avg[k] += states[k] + # average + for k in avg.keys(): + if avg[k] is not None: + avg[k] /= num + + paddle.save(avg, args.dst_model) + print(f'Saving to {args.dst_model}') + + meta_path = os.path.splitext(args.dst_model)[0] + '.avg.json' + with open(meta_path, 'w') as f: + data = json.dumps({ + "avg_ckpt": args.dst_model, + "ckpt": path_list, + "epoch": selected_epochs.tolist(), + "val_loss": beat_val_scores.tolist(), + }) + f.write(data + "\n") + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='average model') + parser.add_argument('--dst_model', required=True, help='averaged model') + parser.add_argument( + '--ckpt_dir', required=True, help='ckpt model dir for average') + parser.add_argument( + '--val_best', action="store_true", help='averaged model') + parser.add_argument( + '--num', default=5, type=int, help='nums for averaged model') + parser.add_argument( + '--min_epoch', + default=0, + type=int, + help='min epoch used for averaging model') + parser.add_argument( + '--max_epoch', + default=65536, # Big enough + type=int, + help='max epoch used for averaging model') + + args = parser.parse_args() + print(args) + + main(args) diff --git a/text_processing/speechtask/punctuation_restoration/bin/pre_data.py b/text_processing/speechtask/punctuation_restoration/bin/pre_data.py new file mode 100644 index 000000000..a074d7e3d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/bin/pre_data.py @@ -0,0 +1,48 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Data preparation for punctuation_restoration task.""" +import yaml +from speechtask.punctuation_restoration.utils.default_parser import default_argument_parser +from speechtask.punctuation_restoration.utils.punct_pre import process_chinese_pure_senetence +from speechtask.punctuation_restoration.utils.punct_pre import process_english_pure_senetence +from speechtask.punctuation_restoration.utils.utility import print_arguments + + +# create dataset from raw data files +def main(config, args): + print("Start preparing data from raw data.") + if (config['type'] == 'chinese'): + process_chinese_pure_senetence(config) + elif (config['type'] == 'english'): + print('english!!!!') + process_english_pure_senetence(config) + else: + print('Error: Type should be chinese or english!!!!') + raise ValueError('Type should be chinese or english') + + print("Finish preparing data.") + + +if __name__ == "__main__": + parser = default_argument_parser() + args = parser.parse_args() + print_arguments(args, globals()) + + # https://yaml.org/type/float.html + with open(args.config, "r") as f: + config = yaml.load(f, Loader=yaml.FullLoader) + + # config.freeze() + print(config) + main(config, args) diff --git a/text_processing/speechtask/punctuation_restoration/bin/test.py b/text_processing/speechtask/punctuation_restoration/bin/test.py new file mode 100644 index 000000000..17892fdb7 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/bin/test.py @@ -0,0 +1,45 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Evaluation for model.""" +import yaml +from speechtask.punctuation_restoration.training.trainer import Tester +from speechtask.punctuation_restoration.utils.default_parser import default_argument_parser +from speechtask.punctuation_restoration.utils.utility import print_arguments + + +def main_sp(config, args): + exp = Tester(config, args) + exp.setup() + exp.run_test() + + +def main(config, args): + main_sp(config, args) + + +if __name__ == "__main__": + parser = default_argument_parser() + args = parser.parse_args() + print_arguments(args, globals()) + + # https://yaml.org/type/float.html + with open(args.config, "r") as f: + config = yaml.load(f, Loader=yaml.FullLoader) + + print(config) + if args.dump_config: + with open(args.dump_config, 'w') as f: + print(config, file=f) + + main(config, args) diff --git a/text_processing/speechtask/punctuation_restoration/bin/train.py b/text_processing/speechtask/punctuation_restoration/bin/train.py new file mode 100644 index 000000000..1ffd79b7b --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/bin/train.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Trainer for punctuation_restoration task.""" +import yaml +from paddle import distributed as dist +from speechtask.punctuation_restoration.training.trainer import Trainer +from speechtask.punctuation_restoration.utils.default_parser import default_argument_parser +from speechtask.punctuation_restoration.utils.utility import print_arguments + + +def main_sp(config, args): + exp = Trainer(config, args) + exp.setup() + exp.run() + + +def main(config, args): + if args.device == "gpu" and args.nprocs > 1: + dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs) + else: + main_sp(config, args) + + +if __name__ == "__main__": + parser = default_argument_parser() + args = parser.parse_args() + print_arguments(args, globals()) + + # https://yaml.org/type/float.html + with open(args.config, "r") as f: + config = yaml.load(f, Loader=yaml.FullLoader) + + print(config) + if args.dump_config: + with open(args.dump_config, 'w') as f: + print(config, file=f) + + main(config, args) diff --git a/text_processing/speechtask/punctuation_restoration/io/__init__.py b/text_processing/speechtask/punctuation_restoration/io/__init__.py new file mode 100644 index 000000000..185a92b8d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/io/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/text_processing/speechtask/punctuation_restoration/io/collator.py b/text_processing/speechtask/punctuation_restoration/io/collator.py new file mode 100644 index 000000000..5b63b5847 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/io/collator.py @@ -0,0 +1,64 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + +__all__ = ["TextCollator"] + + +class TextCollator(): + def __init__(self, padding_value): + self.padding_value = padding_value + + def __call__(self, batch): + """batch examples + Args: + batch ([List]): batch is (text, punctuation) + text (List[int] ) shape (batch, L) + punctuation (List[int] or str): shape (batch, L) + Returns: + tuple(text, punctuation): batched data. + text : (B, Lmax) + punctuation : (B, Lmax) + """ + texts = [] + punctuations = [] + for text, punctuation in batch: + + texts.append(text) + punctuations.append(punctuation) + + #[B, T, D] + x_pad = self.pad_sequence(texts).astype(np.int64) + # print(x_pad.shape) + # pad_list(audios, 0.0).astype(np.float32) + # ilens = np.array(audio_lens).astype(np.int64) + y_pad = self.pad_sequence(punctuations).astype(np.int64) + # print(y_pad.shape) + # olens = np.array(text_lens).astype(np.int64) + return x_pad, y_pad + + def pad_sequence(self, sequences): + # assuming trailing dimensions and type of all the Tensors + # in sequences are same and fetching those from sequences[0] + max_len = max([len(s) for s in sequences]) + out_dims = (len(sequences), max_len) + + out_tensor = np.full(out_dims, + self.padding_value) #, dtype=sequences[0].dtype) + for i, tensor in enumerate(sequences): + length = len(tensor) + # use index notation to prevent duplicate references to the tensor + out_tensor[i, :length] = tensor + + return out_tensor diff --git a/text_processing/speechtask/punctuation_restoration/io/common.py b/text_processing/speechtask/punctuation_restoration/io/common.py new file mode 100644 index 000000000..3ed4a6041 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/io/common.py @@ -0,0 +1,55 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import codecs +import re +import unicodedata + +import ujson + +PAD = "" +UNK = "" +NUM = "" +END = "" +SPACE = "_SPACE" + + +def write_json(filename, dataset): + with codecs.open(filename, mode="w", encoding="utf-8") as f: + ujson.dump(dataset, f) + + +def word_convert(word, keep_number=True, lowercase=True): + if not keep_number: + if is_digit(word): + word = NUM + if lowercase: + word = word.lower() + return word + + +def is_digit(word): + try: + float(word) + return True + except ValueError: + pass + try: + unicodedata.numeric(word) + return True + except (TypeError, ValueError): + pass + result = re.compile(r'^[-+]?[0-9]+,[0-9]+$').match(word) + if result: + return True + return False diff --git a/text_processing/speechtask/punctuation_restoration/io/dataset.py b/text_processing/speechtask/punctuation_restoration/io/dataset.py new file mode 100644 index 000000000..17c13c387 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/io/dataset.py @@ -0,0 +1,310 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import random + +import numpy as np +import paddle +from paddle.io import Dataset +from paddlenlp.transformers import BertTokenizer +# from speechtask.punctuation_restoration.utils.punct_prepro import load_dataset + +__all__ = ["PuncDataset", "PuncDatasetFromBertTokenizer"] + + +class PuncDataset(Dataset): + """Representing a Dataset + superclass + ---------- + data.Dataset : + Dataset is a abstract class, representing the real data. + """ + + def __init__(self, train_path, vocab_path, punc_path, seq_len=100): + # 检查文件是否存在 + print(train_path) + print(vocab_path) + assert os.path.exists(train_path), "train文件不存在" + assert os.path.exists(vocab_path), "词典文件不存在" + assert os.path.exists(punc_path), "标点文件不存在" + self.seq_len = seq_len + + self.word2id = self.load_vocab( + vocab_path, extra_word_list=['', '']) + self.id2word = {v: k for k, v in self.word2id.items()} + self.punc2id = self.load_vocab(punc_path, extra_word_list=[" "]) + self.id2punc = {k: v for (v, k) in self.punc2id.items()} + + tmp_seqs = open(train_path, encoding='utf-8').readlines() + self.txt_seqs = [i for seq in tmp_seqs for i in seq.split()] + # print(self.txt_seqs[:10]) + # with open('./txt_seq', 'w', encoding='utf-8') as w: + # print(self.txt_seqs, file=w) + self.preprocess(self.txt_seqs) + print('---punc-') + print(self.punc2id) + + def __len__(self): + """return the sentence nums in .txt + """ + return self.in_len + + def __getitem__(self, index): + """返回指定索引的张量对 (输入文本id的序列 , 其对应的标点id序列) + Parameters + ---------- + index : int 索引 + """ + return self.input_data[index], self.label[index] + + def load_vocab(self, vocab_path, extra_word_list=[], encoding='utf-8'): + n = len(extra_word_list) + with open(vocab_path, encoding='utf-8') as vf: + vocab = {word.strip(): i + n for i, word in enumerate(vf)} + for i, word in enumerate(extra_word_list): + vocab[word] = i + return vocab + + def preprocess(self, txt_seqs: list): + """将文本转为单词和应预测标点的id pair + Parameters + ---------- + txt : 文本 + 文本每个单词跟随一个空格,符号也跟一个空格 + """ + input_data = [] + label = [] + input_r = [] + label_r = [] + # txt_seqs is a list like: ['char', 'char', 'char', '*,*', 'char', ......] + count = 0 + length = len(txt_seqs) + for token in txt_seqs: + count += 1 + if count == length: + break + if token in self.punc2id: + continue + punc = txt_seqs[count] + if punc not in self.punc2id: + # print('标点{}:'.format(count), self.punc2id[" "]) + label.append(self.punc2id[" "]) + input_data.append( + self.word2id.get(token, self.word2id[""])) + input_r.append(token) + label_r.append(' ') + else: + # print('标点{}:'.format(count), self.punc2id[punc]) + label.append(self.punc2id[punc]) + input_data.append( + self.word2id.get(token, self.word2id[""])) + input_r.append(token) + label_r.append(punc) + if len(input_data) != len(label): + assert 'error: length input_data != label' + # code below is for using 100 as a hidden size + print(len(input_data)) + self.in_len = len(input_data) // self.seq_len + len_tmp = self.in_len * self.seq_len + input_data = input_data[:len_tmp] + label = label[:len_tmp] + + self.input_data = paddle.to_tensor( + np.array(input_data, dtype='int64').reshape(-1, self.seq_len)) + self.label = paddle.to_tensor( + np.array(label, dtype='int64').reshape(-1, self.seq_len)) + + +# unk_token='[UNK]' +# sep_token='[SEP]' +# pad_token='[PAD]' +# cls_token='[CLS]' +# mask_token='[MASK]' + + +class PuncDatasetFromBertTokenizer(Dataset): + """Representing a Dataset + superclass + ---------- + data.Dataset : + Dataset is a abstract class, representing the real data. + """ + + def __init__(self, + train_path, + is_eval, + pretrained_token, + punc_path, + seq_len=100): + # 检查文件是否存在 + print(train_path) + self.tokenizer = BertTokenizer.from_pretrained( + pretrained_token, do_lower_case=True) + self.paddingID = self.tokenizer.pad_token_id + assert os.path.exists(train_path), "train文件不存在" + assert os.path.exists(punc_path), "标点文件不存在" + self.seq_len = seq_len + + self.punc2id = self.load_vocab(punc_path, extra_word_list=[" "]) + self.id2punc = {k: v for (v, k) in self.punc2id.items()} + + tmp_seqs = open(train_path, encoding='utf-8').readlines() + self.txt_seqs = [i for seq in tmp_seqs for i in seq.split()] + # print(self.txt_seqs[:10]) + # with open('./txt_seq', 'w', encoding='utf-8') as w: + # print(self.txt_seqs, file=w) + if (is_eval): + self.preprocess(self.txt_seqs) + else: + self.preprocess_shift(self.txt_seqs) + print("data len: %d" % (len(self.input_data))) + print('---punc-') + print(self.punc2id) + + def __len__(self): + """return the sentence nums in .txt + """ + return self.in_len + + def __getitem__(self, index): + """返回指定索引的张量对 (输入文本id的序列 , 其对应的标点id序列) + Parameters + ---------- + index : int 索引 + """ + return self.input_data[index], self.label[index] + + def load_vocab(self, vocab_path, extra_word_list=[], encoding='utf-8'): + n = len(extra_word_list) + with open(vocab_path, encoding='utf-8') as vf: + vocab = {word.strip(): i + n for i, word in enumerate(vf)} + for i, word in enumerate(extra_word_list): + vocab[word] = i + return vocab + + def preprocess(self, txt_seqs: list): + """将文本转为单词和应预测标点的id pair + Parameters + ---------- + txt : 文本 + 文本每个单词跟随一个空格,符号也跟一个空格 + """ + input_data = [] + label = [] + # txt_seqs is a list like: ['char', 'char', 'char', '*,*', 'char', ......] + count = 0 + for i in range(len(txt_seqs) - 1): + word = txt_seqs[i] + punc = txt_seqs[i + 1] + if word in self.punc2id: + continue + + token = self.tokenizer(word) + x = token["input_ids"][1:-1] + input_data.extend(x) + + for i in range(len(x) - 1): + label.append(self.punc2id[" "]) + + if punc not in self.punc2id: + # print('标点{}:'.format(count), self.punc2id[" "]) + label.append(self.punc2id[" "]) + else: + label.append(self.punc2id[punc]) + + if len(input_data) != len(label): + assert 'error: length input_data != label' + # code below is for using 100 as a hidden size + + # print(len(input_data[0])) + # print(len(label)) + self.in_len = len(input_data) // self.seq_len + len_tmp = self.in_len * self.seq_len + input_data = input_data[:len_tmp] + label = label[:len_tmp] + # # print(input_data) + # print(type(input_data)) + # tmp=np.array(input_data) + # print('--~~~~~~~~~~~~~') + # print(type(tmp)) + # print(tmp.shape) + self.input_data = paddle.to_tensor( + np.array(input_data, dtype='int64').reshape( + -1, self.seq_len)) #, dtype='int64' + self.label = paddle.to_tensor( + np.array(label, dtype='int64').reshape( + -1, self.seq_len)) #, dtype='int64' + + def preprocess_shift(self, txt_seqs: list): + """将文本转为单词和应预测标点的id pair + Parameters + ---------- + txt : 文本 + 文本每个单词跟随一个空格,符号也跟一个空格 + """ + input_data = [] + label = [] + # txt_seqs is a list like: ['char', 'char', 'char', '*,*', 'char', ......] + count = 0 + for i in range(len(txt_seqs) - 1): + word = txt_seqs[i] + punc = txt_seqs[i + 1] + if word in self.punc2id: + continue + + token = self.tokenizer(word) + x = token["input_ids"][1:-1] + input_data.extend(x) + + for i in range(len(x) - 1): + label.append(self.punc2id[" "]) + + if punc not in self.punc2id: + # print('标点{}:'.format(count), self.punc2id[" "]) + label.append(self.punc2id[" "]) + else: + label.append(self.punc2id[punc]) + + if len(input_data) != len(label): + assert 'error: length input_data != label' + + # print(len(input_data[0])) + # print(len(label)) + start = 0 + processed_data = [] + processed_label = [] + while (start < len(input_data) - self.seq_len): + # end=start+self.seq_len + end = random.randint(start + self.seq_len // 2, + start + self.seq_len) + processed_data.append(input_data[start:end]) + processed_label.append(label[start:end]) + + start = start + random.randint(1, self.seq_len // 2) + + self.in_len = len(processed_data) + # # print(input_data) + # print(type(input_data)) + # tmp=np.array(input_data) + # print('--~~~~~~~~~~~~~') + # print(type(tmp)) + # print(tmp.shape) + self.input_data = processed_data + #paddle.to_tensor(np.array(processed_data, dtype='int64')) #, dtype='int64' + self.label = processed_label + #paddle.to_tensor(np.array(processed_label, dtype='int64')) #, dtype='int64' + + +if __name__ == '__main__': + dataset = PuncDataset() diff --git a/text_processing/speechtask/punctuation_restoration/model/BertBLSTM.py b/text_processing/speechtask/punctuation_restoration/model/BertBLSTM.py new file mode 100644 index 000000000..bc953adfd --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/model/BertBLSTM.py @@ -0,0 +1,74 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +import paddle.nn as nn +import paddle.nn.initializer as I +from paddlenlp.transformers import BertForTokenClassification + + +class BertBLSTMPunc(nn.Layer): + def __init__(self, + pretrained_token="bert-large-uncased", + output_size=4, + dropout=0.0, + bert_size=768, + blstm_size=128, + num_blstm_layers=2, + init_scale=0.1): + super(BertBLSTMPunc, self).__init__() + self.output_size = output_size + self.bert = BertForTokenClassification.from_pretrained( + pretrained_token, num_classes=bert_size) + # self.bert_vocab_size = vocab_size + # self.bn = nn.BatchNorm1d(segment_size*self.bert_vocab_size) + # self.fc = nn.Linear(segment_size*self.bert_vocab_size, output_size) + + self.lstm = nn.LSTM( + input_size=bert_size, + hidden_size=blstm_size, + num_layers=num_blstm_layers, + direction="bidirect", + weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale)), + weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + + # NOTE dense*2 使用bert中间层 dense hidden_state self.bert_size + self.dropout = nn.Dropout(dropout) + self.fc = nn.Linear(blstm_size * 2, output_size) + self.softmax = nn.Softmax() + + def forward(self, x): + # print('input :', x.shape) + x = self.bert(x) #[0] + # print('after bert :', x.shape) + + y, (_, _) = self.lstm(x) + # print('after lstm :', y.shape) + y = self.fc(self.dropout(y)) + y = paddle.reshape(y, shape=[-1, self.output_size]) + # print('after fc :', y.shape) + + logit = self.softmax(y) + # print('after softmax :', logit.shape) + + return y, logit + + +if __name__ == '__main__': + print('start model') + model = BertBLSTMPunc() + x = paddle.randint(low=0, high=40, shape=[2, 5]) + print(x) + y, logit = model(x) diff --git a/text_processing/speechtask/punctuation_restoration/model/BertLinear.py b/text_processing/speechtask/punctuation_restoration/model/BertLinear.py new file mode 100644 index 000000000..854f522cf --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/model/BertLinear.py @@ -0,0 +1,63 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +import paddle.nn as nn +from paddlenlp.transformers import BertForTokenClassification + + +class BertLinearPunc(nn.Layer): + def __init__(self, + pretrained_token="bert-base-uncased", + output_size=4, + dropout=0.2, + bert_size=768, + hiddensize=1568): + super(BertLinearPunc, self).__init__() + self.output_size = output_size + self.bert = BertForTokenClassification.from_pretrained( + pretrained_token, num_classes=bert_size) + # self.bert_vocab_size = vocab_size + # self.bn = nn.BatchNorm1d(segment_size*self.bert_vocab_size) + # self.fc = nn.Linear(segment_size*self.bert_vocab_size, output_size) + + # NOTE dense*2 使用bert中间层 dense hidden_state self.bert_size + self.dropout1 = nn.Dropout(dropout) + self.fc1 = nn.Linear(bert_size, hiddensize) + self.dropout2 = nn.Dropout(dropout) + self.relu = nn.ReLU() + self.fc2 = nn.Linear(hiddensize, output_size) + self.softmax = nn.Softmax() + + def forward(self, x): + # print('input :', x.shape) + x = self.bert(x) #[0] + # print('after bert :', x.shape) + + x = self.fc1(self.dropout1(x)) + x = self.fc2(self.relu(self.dropout2(x))) + x = paddle.reshape(x, shape=[-1, self.output_size]) + # print('after fc :', x.shape) + + logit = self.softmax(x) + # print('after softmax :', logit.shape) + + return x, logit + + +if __name__ == '__main__': + print('start model') + model = BertLinearPunc() + x = paddle.randint(low=0, high=40, shape=[2, 5]) + print(x) + y, logit = model(x) diff --git a/text_processing/speechtask/punctuation_restoration/model/blstm.py b/text_processing/speechtask/punctuation_restoration/model/blstm.py new file mode 100644 index 000000000..fcfd31a3e --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/model/blstm.py @@ -0,0 +1,89 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +import paddle.nn as nn +import paddle.nn.initializer as I + + +class BiLSTM(nn.Layer): + """LSTM for Punctuation Restoration + """ + + def __init__(self, + vocab_size, + embedding_size, + hidden_size, + num_layers, + num_class, + init_scale=0.1): + super(BiLSTM, self).__init__() + # hyper parameters + self.vocab_size = vocab_size + self.embedding_size = embedding_size + self.hidden_size = hidden_size + self.num_layers = num_layers + self.num_class = num_class + + # 网络中的层 + self.embedding = nn.Embedding( + vocab_size, + embedding_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + # print(hidden_size) + # print(embedding_size) + self.lstm = nn.LSTM( + input_size=embedding_size, + hidden_size=hidden_size, + num_layers=num_layers, + direction="bidirect", + weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale)), + weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + # Here is a one direction LSTM. If bidirection LSTM, (hidden_size*2(,)) + self.fc = nn.Linear( + in_features=hidden_size * 2, + out_features=num_class, + weight_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale)), + bias_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + # self.fc = nn.Linear(hidden_size, num_class) + + self.softmax = nn.Softmax() + + def forward(self, input): + """The forward process of Net + Parameters + ---------- + inputs : tensor + Training data, batch first + """ + # Inherit the knowledge of context + + # hidden = self.init_hidden(inputs.size(0)) + # print('input_size',inputs.size()) + embedding = self.embedding(input) + # print('embedding_size', embedding.size()) + # packed = pack_sequence(embedding, inputs_lengths, batch_first=True) + # embedding本身是同样长度的,用这个函数主要是为了用pack + # ***************************************************************************** + y, (_, _) = self.lstm(embedding) + + # print(y.size()) + y = self.fc(y) + y = paddle.reshape(y, shape=[-1, self.num_class]) + logit = self.softmax(y) + return y, logit diff --git a/text_processing/speechtask/punctuation_restoration/model/lstm.py b/text_processing/speechtask/punctuation_restoration/model/lstm.py new file mode 100644 index 000000000..5ec685337 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/model/lstm.py @@ -0,0 +1,85 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +import paddle.nn as nn +import paddle.nn.initializer as I + + +class RnnLm(nn.Layer): + def __init__(self, + vocab_size, + punc_size, + hidden_size, + num_layers=1, + init_scale=0.1, + dropout=0.0): + super(RnnLm, self).__init__() + self.hidden_size = hidden_size + self.num_layers = num_layers + self.init_scale = init_scale + self.punc_size = punc_size + + self.embedder = nn.Embedding( + vocab_size, + hidden_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + + self.lstm = nn.LSTM( + input_size=hidden_size, + hidden_size=hidden_size, + num_layers=num_layers, + dropout=dropout, + weight_ih_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale)), + weight_hh_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + + self.fc = nn.Linear( + hidden_size, + punc_size, + weight_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale)), + bias_attr=paddle.ParamAttr(initializer=I.Uniform( + low=-init_scale, high=init_scale))) + + self.dropout = nn.Dropout(p=dropout) + self.softmax = nn.Softmax() + + def forward(self, inputs): + x = inputs + x_emb = self.embedder(x) + x_emb = self.dropout(x_emb) + + y, (_, _) = self.lstm(x_emb) + + y = self.dropout(y) + y = self.fc(y) + y = paddle.reshape(y, shape=[-1, self.punc_size]) + logit = self.softmax(y) + return y, logit + + +class CrossEntropyLossForLm(nn.Layer): + def __init__(self): + super(CrossEntropyLossForLm, self).__init__() + + def forward(self, y, label): + label = paddle.unsqueeze(label, axis=2) + loss = paddle.nn.functional.cross_entropy( + input=y, label=label, reduction='none') + loss = paddle.squeeze(loss, axis=[2]) + loss = paddle.mean(loss, axis=[0]) + loss = paddle.sum(loss) + return loss diff --git a/text_processing/speechtask/punctuation_restoration/modules/__init__.py b/text_processing/speechtask/punctuation_restoration/modules/__init__.py new file mode 100644 index 000000000..185a92b8d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/modules/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/text_processing/speechtask/punctuation_restoration/modules/activation.py b/text_processing/speechtask/punctuation_restoration/modules/activation.py new file mode 100644 index 000000000..6a13e4aab --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/modules/activation.py @@ -0,0 +1,141 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from collections import OrderedDict + +import paddle +from paddle import nn + +__all__ = ["get_activation", "brelu", "LinearGLUBlock", "ConvGLUBlock"] + + +def brelu(x, t_min=0.0, t_max=24.0, name=None): + # paddle.to_tensor is dygraph_only can not work under JIT + t_min = paddle.full(shape=[1], fill_value=t_min, dtype='float32') + t_max = paddle.full(shape=[1], fill_value=t_max, dtype='float32') + return x.maximum(t_min).minimum(t_max) + + +class LinearGLUBlock(nn.Layer): + """A linear Gated Linear Units (GLU) block.""" + + def __init__(self, idim: int): + """ GLU. + Args: + idim (int): input and output dimension + """ + super().__init__() + self.fc = nn.Linear(idim, idim * 2) + + def forward(self, xs): + return glu(self.fc(xs), dim=-1) + + +class ConvGLUBlock(nn.Layer): + def __init__(self, kernel_size, in_ch, out_ch, bottlececk_dim=0, + dropout=0.): + """A convolutional Gated Linear Units (GLU) block. + + Args: + kernel_size (int): kernel size + in_ch (int): number of input channels + out_ch (int): number of output channels + bottlececk_dim (int): dimension of the bottleneck layers for computational efficiency. Defaults to 0. + dropout (float): dropout probability. Defaults to 0.. + """ + + super().__init__() + + self.conv_residual = None + if in_ch != out_ch: + self.conv_residual = nn.utils.weight_norm( + nn.Conv2D( + in_channels=in_ch, out_channels=out_ch, kernel_size=(1, 1)), + name='weight', + dim=0) + self.dropout_residual = nn.Dropout(p=dropout) + + self.pad_left = ConstantPad2d((0, 0, kernel_size - 1, 0), 0) + + layers = OrderedDict() + if bottlececk_dim == 0: + layers['conv'] = nn.utils.weight_norm( + nn.Conv2D( + in_channels=in_ch, + out_channels=out_ch * 2, + kernel_size=(kernel_size, 1)), + name='weight', + dim=0) + # TODO(hirofumi0810): padding? + layers['dropout'] = nn.Dropout(p=dropout) + layers['glu'] = GLU() + + elif bottlececk_dim > 0: + layers['conv_in'] = nn.utils.weight_norm( + nn.Conv2D( + in_channels=in_ch, + out_channels=bottlececk_dim, + kernel_size=(1, 1)), + name='weight', + dim=0) + layers['dropout_in'] = nn.Dropout(p=dropout) + layers['conv_bottleneck'] = nn.utils.weight_norm( + nn.Conv2D( + in_channels=bottlececk_dim, + out_channels=bottlececk_dim, + kernel_size=(kernel_size, 1)), + name='weight', + dim=0) + layers['dropout'] = nn.Dropout(p=dropout) + layers['glu'] = GLU() + layers['conv_out'] = nn.utils.weight_norm( + nn.Conv2D( + in_channels=bottlececk_dim, + out_channels=out_ch * 2, + kernel_size=(1, 1)), + name='weight', + dim=0) + layers['dropout_out'] = nn.Dropout(p=dropout) + + self.layers = nn.Sequential(layers) + + def forward(self, xs): + """Forward pass. + Args: + xs (FloatTensor): `[B, in_ch, T, feat_dim]` + Returns: + out (FloatTensor): `[B, out_ch, T, feat_dim]` + """ + residual = xs + if self.conv_residual is not None: + residual = self.dropout_residual(self.conv_residual(residual)) + xs = self.pad_left(xs) # `[B, embed_dim, T+kernel-1, 1]` + xs = self.layers(xs) # `[B, out_ch * 2, T ,1]` + xs = xs + residual + return xs + + +def get_activation(act): + """Return activation function.""" + # Lazy load to avoid unused import + activation_funcs = { + "hardtanh": paddle.nn.Hardtanh, + "tanh": paddle.nn.Tanh, + "relu": paddle.nn.ReLU, + "selu": paddle.nn.SELU, + "swish": paddle.nn.Swish, + "gelu": paddle.nn.GELU, + "brelu": brelu, + } + + return activation_funcs[act]() diff --git a/text_processing/speechtask/punctuation_restoration/modules/attention.py b/text_processing/speechtask/punctuation_restoration/modules/attention.py new file mode 100644 index 000000000..1a7363c4d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/modules/attention.py @@ -0,0 +1,229 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Multi-Head Attention layer definition.""" +import math +from typing import Optional +from typing import Tuple + +import paddle +from paddle import nn +from paddle.nn import initializer as I + +__all__ = ["MultiHeadedAttention", "RelPositionMultiHeadedAttention"] + +# Relative Positional Encodings +# https://www.jianshu.com/p/c0608efcc26f +# https://zhuanlan.zhihu.com/p/344604604 + + +class MultiHeadedAttention(nn.Layer): + """Multi-Head Attention layer.""" + + def __init__(self, n_head: int, n_feat: int, dropout_rate: float): + """Construct an MultiHeadedAttention object. + Args: + n_head (int): The number of heads. + n_feat (int): The number of features. + dropout_rate (float): Dropout rate. + """ + super().__init__() + assert n_feat % n_head == 0 + # We assume d_v always equals d_k + self.d_k = n_feat // n_head + self.h = n_head + self.linear_q = nn.Linear(n_feat, n_feat) + self.linear_k = nn.Linear(n_feat, n_feat) + self.linear_v = nn.Linear(n_feat, n_feat) + self.linear_out = nn.Linear(n_feat, n_feat) + self.dropout = nn.Dropout(p=dropout_rate) + + def forward_qkv(self, + query: paddle.Tensor, + key: paddle.Tensor, + value: paddle.Tensor + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: + """Transform query, key and value. + Args: + query (paddle.Tensor): Query tensor (#batch, time1, size). + key (paddle.Tensor): Key tensor (#batch, time2, size). + value (paddle.Tensor): Value tensor (#batch, time2, size). + Returns: + paddle.Tensor: Transformed query tensor, size + (#batch, n_head, time1, d_k). + paddle.Tensor: Transformed key tensor, size + (#batch, n_head, time2, d_k). + paddle.Tensor: Transformed value tensor, size + (#batch, n_head, time2, d_k). + """ + n_batch = query.size(0) + q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k) + k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k) + v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k) + q = q.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) + k = k.transpose([0, 2, 1, 3]) # (batch, head, time2, d_k) + v = v.transpose([0, 2, 1, 3]) # (batch, head, time2, d_k) + + return q, k, v + + def forward_attention(self, + value: paddle.Tensor, + scores: paddle.Tensor, + mask: Optional[paddle.Tensor]) -> paddle.Tensor: + """Compute attention context vector. + Args: + value (paddle.Tensor): Transformed value, size + (#batch, n_head, time2, d_k). + scores (paddle.Tensor): Attention score, size + (#batch, n_head, time1, time2). + mask (paddle.Tensor): Mask, size (#batch, 1, time2) or + (#batch, time1, time2). + Returns: + paddle.Tensor: Transformed value weighted + by the attention score, (#batch, time1, d_model). + """ + n_batch = value.size(0) + if mask is not None: + mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) + scores = scores.masked_fill(mask, -float('inf')) + attn = paddle.softmax( + scores, axis=-1).masked_fill(mask, + 0.0) # (batch, head, time1, time2) + else: + attn = paddle.softmax( + scores, axis=-1) # (batch, head, time1, time2) + + p_attn = self.dropout(attn) + x = paddle.matmul(p_attn, value) # (batch, head, time1, d_k) + x = x.transpose([0, 2, 1, 3]).contiguous().view( + n_batch, -1, self.h * self.d_k) # (batch, time1, d_model) + + return self.linear_out(x) # (batch, time1, d_model) + + def forward(self, + query: paddle.Tensor, + key: paddle.Tensor, + value: paddle.Tensor, + mask: Optional[paddle.Tensor]) -> paddle.Tensor: + """Compute scaled dot product attention. + Args: + query (torch.Tensor): Query tensor (#batch, time1, size). + key (torch.Tensor): Key tensor (#batch, time2, size). + value (torch.Tensor): Value tensor (#batch, time2, size). + mask (torch.Tensor): Mask tensor (#batch, 1, time2) or + (#batch, time1, time2). + Returns: + torch.Tensor: Output tensor (#batch, time1, d_model). + """ + q, k, v = self.forward_qkv(query, key, value) + scores = paddle.matmul(q, + k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k) + return self.forward_attention(v, scores, mask) + + +class RelPositionMultiHeadedAttention(MultiHeadedAttention): + """Multi-Head Attention layer with relative position encoding.""" + + def __init__(self, n_head, n_feat, dropout_rate): + """Construct an RelPositionMultiHeadedAttention object. + Paper: https://arxiv.org/abs/1901.02860 + Args: + n_head (int): The number of heads. + n_feat (int): The number of features. + dropout_rate (float): Dropout rate. + """ + super().__init__(n_head, n_feat, dropout_rate) + # linear transformation for positional encoding + self.linear_pos = nn.Linear(n_feat, n_feat, bias_attr=False) + # these two learnable bias are used in matrix c and matrix d + # as described in https://arxiv.org/abs/1901.02860 Section 3.3 + #self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k)) + #self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k)) + #torch.nn.init.xavier_uniform_(self.pos_bias_u) + #torch.nn.init.xavier_uniform_(self.pos_bias_v) + pos_bias_u = self.create_parameter( + [self.h, self.d_k], default_initializer=I.XavierUniform()) + self.add_parameter('pos_bias_u', pos_bias_u) + pos_bias_v = self.create_parameter( + (self.h, self.d_k), default_initializer=I.XavierUniform()) + self.add_parameter('pos_bias_v', pos_bias_v) + + def rel_shift(self, x, zero_triu: bool=False): + """Compute relative positinal encoding. + Args: + x (paddle.Tensor): Input tensor (batch, head, time1, time1). + zero_triu (bool): If true, return the lower triangular part of + the matrix. + Returns: + paddle.Tensor: Output tensor. (batch, head, time1, time1) + """ + zero_pad = paddle.zeros( + (x.size(0), x.size(1), x.size(2), 1), dtype=x.dtype) + x_padded = paddle.cat([zero_pad, x], dim=-1) + + x_padded = x_padded.view(x.size(0), x.size(1), x.size(3) + 1, x.size(2)) + x = x_padded[:, :, 1:].view_as(x) # [B, H, T1, T1] + + if zero_triu: + ones = paddle.ones((x.size(2), x.size(3))) + x = x * paddle.tril(ones, x.size(3) - x.size(2))[None, None, :, :] + + return x + + def forward(self, + query: paddle.Tensor, + key: paddle.Tensor, + value: paddle.Tensor, + pos_emb: paddle.Tensor, + mask: Optional[paddle.Tensor]): + """Compute 'Scaled Dot Product Attention' with rel. positional encoding. + Args: + query (paddle.Tensor): Query tensor (#batch, time1, size). + key (paddle.Tensor): Key tensor (#batch, time2, size). + value (paddle.Tensor): Value tensor (#batch, time2, size). + pos_emb (paddle.Tensor): Positional embedding tensor + (#batch, time1, size). + mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or + (#batch, time1, time2). + Returns: + paddle.Tensor: Output tensor (#batch, time1, d_model). + """ + q, k, v = self.forward_qkv(query, key, value) + q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k) + + n_batch_pos = pos_emb.size(0) + p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k) + p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) + + # (batch, head, time1, d_k) + q_with_bias_u = (q + self.pos_bias_u).transpose([0, 2, 1, 3]) + # (batch, head, time1, d_k) + q_with_bias_v = (q + self.pos_bias_v).transpose([0, 2, 1, 3]) + + # compute attention score + # first compute matrix a and matrix c + # as described in https://arxiv.org/abs/1901.02860 Section 3.3 + # (batch, head, time1, time2) + matrix_ac = paddle.matmul(q_with_bias_u, k.transpose([0, 1, 3, 2])) + + # compute matrix b and matrix d + # (batch, head, time1, time2) + matrix_bd = paddle.matmul(q_with_bias_v, p.transpose([0, 1, 3, 2])) + # Remove rel_shift since it is useless in speech recognition, + # and it requires special attention for streaming. + # matrix_bd = self.rel_shift(matrix_bd) + + scores = (matrix_ac + matrix_bd) / math.sqrt( + self.d_k) # (batch, head, time1, time2) + + return self.forward_attention(v, scores, mask) diff --git a/text_processing/speechtask/punctuation_restoration/modules/crf.py b/text_processing/speechtask/punctuation_restoration/modules/crf.py new file mode 100644 index 000000000..0a53ae6f8 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/modules/crf.py @@ -0,0 +1,366 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +from paddle import nn + +__all__ = ['CRF'] + + +class CRF(nn.Layer): + """ + Linear-chain Conditional Random Field (CRF). + + Args: + nb_labels (int): number of labels in your tagset, including special symbols. + bos_tag_id (int): integer representing the beginning of sentence symbol in + your tagset. + eos_tag_id (int): integer representing the end of sentence symbol in your tagset. + pad_tag_id (int, optional): integer representing the pad symbol in your tagset. + If None, the model will treat the PAD as a normal tag. Otherwise, the model + will apply constraints for PAD transitions. + batch_first (bool): Whether the first dimension represents the batch dimension. + """ + + def __init__(self, + nb_labels: int, + bos_tag_id: int, + eos_tag_id: int, + pad_tag_id: int=None, + batch_first: bool=True): + super().__init__() + + self.nb_labels = nb_labels + self.BOS_TAG_ID = bos_tag_id + self.EOS_TAG_ID = eos_tag_id + self.PAD_TAG_ID = pad_tag_id + self.batch_first = batch_first + + # initialize transitions from a random uniform distribution between -0.1 and 0.1 + self.transitions = self.create_parameter( + [self.nb_labels, self.nb_labels], + default_initializer=nn.initializer.Uniform(-0.1, 0.1)) + self.init_weights() + + def init_weights(self): + # enforce contraints (rows=from, columns=to) with a big negative number + # so exp(-10000) will tend to zero + + # no transitions allowed to the beginning of sentence + self.transitions[:, self.BOS_TAG_ID] = -10000.0 + # no transition alloed from the end of sentence + self.transitions[self.EOS_TAG_ID, :] = -10000.0 + + if self.PAD_TAG_ID is not None: + # no transitions from padding + self.transitions[self.PAD_TAG_ID, :] = -10000.0 + # no transitions to padding + self.transitions[:, self.PAD_TAG_ID] = -10000.0 + # except if the end of sentence is reached + # or we are already in a pad position + self.transitions[self.PAD_TAG_ID, self.EOS_TAG_ID] = 0.0 + self.transitions[self.PAD_TAG_ID, self.PAD_TAG_ID] = 0.0 + + def forward(self, + emissions: paddle.Tensor, + tags: paddle.Tensor, + mask: paddle.Tensor=None) -> paddle.Tensor: + """Compute the negative log-likelihood. See `log_likelihood` method.""" + nll = -self.log_likelihood(emissions, tags, mask=mask) + return nll + + def log_likelihood(self, emissions, tags, mask=None): + """Compute the probability of a sequence of tags given a sequence of + emissions scores. + + Args: + emissions (paddle.Tensor): Sequence of emissions for each label. + Shape of (batch_size, seq_len, nb_labels) if batch_first is True, + (seq_len, batch_size, nb_labels) otherwise. + tags (paddle.LongTensor): Sequence of labels. + Shape of (batch_size, seq_len) if batch_first is True, + (seq_len, batch_size) otherwise. + mask (paddle.FloatTensor, optional): Tensor representing valid positions. + If None, all positions are considered valid. + Shape of (batch_size, seq_len) if batch_first is True, + (seq_len, batch_size) otherwise. + + Returns: + paddle.Tensor: sum of the log-likelihoods for each sequence in the batch. + Shape of () + """ + # fix tensors order by setting batch as the first dimension + if not self.batch_first: + emissions = emissions.transpose(0, 1) + tags = tags.transpose(0, 1) + + if mask is None: + mask = paddle.ones(emissions.shape[:2], dtype=paddle.float) + + scores = self._compute_scores(emissions, tags, mask=mask) + partition = self._compute_log_partition(emissions, mask=mask) + return paddle.sum(scores - partition) + + def decode(self, emissions, mask=None): + """Find the most probable sequence of labels given the emissions using + the Viterbi algorithm. + + Args: + emissions (paddle.Tensor): Sequence of emissions for each label. + Shape (batch_size, seq_len, nb_labels) if batch_first is True, + (seq_len, batch_size, nb_labels) otherwise. + mask (paddle.FloatTensor, optional): Tensor representing valid positions. + If None, all positions are considered valid. + Shape (batch_size, seq_len) if batch_first is True, + (seq_len, batch_size) otherwise. + + Returns: + paddle.Tensor: the viterbi score for the for each batch. + Shape of (batch_size,) + list of lists: the best viterbi sequence of labels for each batch. [B, T] + """ + # fix tensors order by setting batch as the first dimension + if not self.batch_first: + emissions = emissions.transpose(0, 1) + tags = tags.transpose(0, 1) + + if mask is None: + mask = paddle.ones(emissions.shape[:2], dtype=paddle.float) + + scores, sequences = self._viterbi_decode(emissions, mask) + return scores, sequences + + def _compute_scores(self, emissions, tags, mask): + """Compute the scores for a given batch of emissions with their tags. + + Args: + emissions (paddle.Tensor): (batch_size, seq_len, nb_labels) + tags (Paddle.LongTensor): (batch_size, seq_len) + mask (Paddle.FloatTensor): (batch_size, seq_len) + + Returns: + paddle.Tensor: Scores for each batch. + Shape of (batch_size,) + """ + batch_size, seq_length = tags.shape + scores = paddle.zeros([batch_size]) + + # save first and last tags to be used later + first_tags = tags[:, 0] + last_valid_idx = mask.int().sum(1) - 1 + + # TODO(Hui Zhang): not support fancy index. + # last_tags = tags.gather(last_valid_idx.unsqueeze(1), axis=1).squeeze() + batch_idx = paddle.arange(batch_size, dtype=last_valid_idx.dtype) + gather_last_valid_idx = paddle.stack( + [batch_idx, last_valid_idx], axis=-1) + last_tags = tags.gather_nd(gather_last_valid_idx) + + # add the transition from BOS to the first tags for each batch + # t_scores = self.transitions[self.BOS_TAG_ID, first_tags] + t_scores = self.transitions[self.BOS_TAG_ID].gather(first_tags) + + # add the [unary] emission scores for the first tags for each batch + # for all batches, the first word, see the correspondent emissions + # for the first tags (which is a list of ids): + # emissions[:, 0, [tag_1, tag_2, ..., tag_nblabels]] + # e_scores = emissions[:, 0].gather(1, first_tags.unsqueeze(1)).squeeze() + gather_first_tags_idx = paddle.stack([batch_idx, first_tags], axis=-1) + e_scores = emissions[:, 0].gather_nd(gather_first_tags_idx) + + # the scores for a word is just the sum of both scores + scores += e_scores + t_scores + + # now lets do this for each remaining word + for i in range(1, seq_length): + + # we could: iterate over batches, check if we reached a mask symbol + # and stop the iteration, but vecotrizing is faster due to gpu, + # so instead we perform an element-wise multiplication + is_valid = mask[:, i] + + previous_tags = tags[:, i - 1] + current_tags = tags[:, i] + + # calculate emission and transition scores as we did before + # e_scores = emissions[:, i].gather(1, current_tags.unsqueeze(1)).squeeze() + gather_current_tags_idx = paddle.stack( + [batch_idx, current_tags], axis=-1) + e_scores = emissions[:, i].gather_nd(gather_current_tags_idx) + # t_scores = self.transitions[previous_tags, current_tags] + gather_transitions_idx = paddle.stack( + [previous_tags, current_tags], axis=-1) + t_scores = self.transitions.gather_nd(gather_transitions_idx) + + # apply the mask + e_scores = e_scores * is_valid + t_scores = t_scores * is_valid + + scores += e_scores + t_scores + + # add the transition from the end tag to the EOS tag for each batch + # scores += self.transitions[last_tags, self.EOS_TAG_ID] + scores += self.transitions.gather(last_tags)[:, self.EOS_TAG_ID] + + return scores + + def _compute_log_partition(self, emissions, mask): + """Compute the partition function in log-space using the forward-algorithm. + + Args: + emissions (paddle.Tensor): (batch_size, seq_len, nb_labels) + mask (Paddle.FloatTensor): (batch_size, seq_len) + + Returns: + paddle.Tensor: the partition scores for each batch. + Shape of (batch_size,) + """ + batch_size, seq_length, nb_labels = emissions.shape + + # in the first iteration, BOS will have all the scores + alphas = self.transitions[self.BOS_TAG_ID, :].unsqueeze( + 0) + emissions[:, 0] + + for i in range(1, seq_length): + # (bs, nb_labels) -> (bs, 1, nb_labels) + e_scores = emissions[:, i].unsqueeze(1) + + # (nb_labels, nb_labels) -> (bs, nb_labels, nb_labels) + t_scores = self.transitions.unsqueeze(0) + + # (bs, nb_labels) -> (bs, nb_labels, 1) + a_scores = alphas.unsqueeze(2) + + scores = e_scores + t_scores + a_scores + new_alphas = paddle.logsumexp(scores, axis=1) + + # set alphas if the mask is valid, otherwise keep the current values + is_valid = mask[:, i].unsqueeze(-1) + alphas = is_valid * new_alphas + (1 - is_valid) * alphas + + # add the scores for the final transition + last_transition = self.transitions[:, self.EOS_TAG_ID] + end_scores = alphas + last_transition.unsqueeze(0) + + # return a *log* of sums of exps + return paddle.logsumexp(end_scores, axis=1) + + def _viterbi_decode(self, emissions, mask): + """Compute the viterbi algorithm to find the most probable sequence of labels + given a sequence of emissions. + + Args: + emissions (paddle.Tensor): (batch_size, seq_len, nb_labels) + mask (Paddle.FloatTensor): (batch_size, seq_len) + + Returns: + paddle.Tensor: the viterbi score for the for each batch. + Shape of (batch_size,) + list of lists of ints: the best viterbi sequence of labels for each batch + """ + batch_size, seq_length, nb_labels = emissions.shape + + # in the first iteration, BOS will have all the scores and then, the max + alphas = self.transitions[self.BOS_TAG_ID, :].unsqueeze( + 0) + emissions[:, 0] + + backpointers = [] + + for i in range(1, seq_length): + # (bs, nb_labels) -> (bs, 1, nb_labels) + e_scores = emissions[:, i].unsqueeze(1) + + # (nb_labels, nb_labels) -> (bs, nb_labels, nb_labels) + t_scores = self.transitions.unsqueeze(0) + + # (bs, nb_labels) -> (bs, nb_labels, 1) + a_scores = alphas.unsqueeze(2) + + # combine current scores with previous alphas + scores = e_scores + t_scores + a_scores + + # so far is exactly like the forward algorithm, + # but now, instead of calculating the logsumexp, + # we will find the highest score and the tag associated with it + # max_scores, max_score_tags = paddle.max(scores, axis=1) + max_scores = paddle.max(scores, axis=1) + max_score_tags = paddle.argmax(scores, axis=1) + + # set alphas if the mask is valid, otherwise keep the current values + is_valid = mask[:, i].unsqueeze(-1) + alphas = is_valid * max_scores + (1 - is_valid) * alphas + + # add the max_score_tags for our list of backpointers + # max_scores has shape (batch_size, nb_labels) so we transpose it to + # be compatible with our previous loopy version of viterbi + backpointers.append(max_score_tags.t()) + + # add the scores for the final transition + last_transition = self.transitions[:, self.EOS_TAG_ID] + end_scores = alphas + last_transition.unsqueeze(0) + + # get the final most probable score and the final most probable tag + # max_final_scores, max_final_tags = paddle.max(end_scores, axis=1) + max_final_scores = paddle.max(end_scores, axis=1) + max_final_tags = paddle.argmax(end_scores, axis=1) + + # find the best sequence of labels for each sample in the batch + best_sequences = [] + emission_lengths = mask.int().sum(axis=1) + for i in range(batch_size): + + # recover the original sentence length for the i-th sample in the batch + sample_length = emission_lengths[i].item() + + # recover the max tag for the last timestep + sample_final_tag = max_final_tags[i].item() + + # limit the backpointers until the last but one + # since the last corresponds to the sample_final_tag + sample_backpointers = backpointers[:sample_length - 1] + + # follow the backpointers to build the sequence of labels + sample_path = self._find_best_path(i, sample_final_tag, + sample_backpointers) + + # add this path to the list of best sequences + best_sequences.append(sample_path) + + return max_final_scores, best_sequences + + def _find_best_path(self, sample_id, best_tag, backpointers): + """Auxiliary function to find the best path sequence for a specific sample. + + Args: + sample_id (int): sample index in the range [0, batch_size) + best_tag (int): tag which maximizes the final score + backpointers (list of lists of tensors): list of pointers with + shape (seq_len_i-1, nb_labels, batch_size) where seq_len_i + represents the length of the ith sample in the batch + + Returns: + list of ints: a list of tag indexes representing the bast path + """ + # add the final best_tag to our best path + best_path = [best_tag] + + # traverse the backpointers in backwards + for backpointers_t in reversed(backpointers): + + # recover the best_tag at this timestep + best_tag = backpointers_t[best_tag][sample_id].item() + + # append to the beginning of the list so we don't need to reverse it later + best_path.insert(0, best_tag) + + return best_path diff --git a/text_processing/speechtask/punctuation_restoration/training/__init__.py b/text_processing/speechtask/punctuation_restoration/training/__init__.py new file mode 100644 index 000000000..185a92b8d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/training/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/text_processing/speechtask/punctuation_restoration/training/loss.py b/text_processing/speechtask/punctuation_restoration/training/loss.py new file mode 100644 index 000000000..356dfcab1 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/training/loss.py @@ -0,0 +1,98 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class FocalLossHX(nn.Layer): + def __init__(self, gamma=0, size_average=True): + super(FocalLoss, self).__init__() + self.gamma = gamma + self.size_average = size_average + + def forward(self, input, target): + # print('input') + # print(input.shape) + # print(target.shape) + + if input.dim() > 2: + input = paddle.reshape( + input, + shape=[input.size(0), input.size(1), -1]) # N,C,H,W => N,C,H*W + input = input.transpose(1, 2) # N,C,H*W => N,H*W,C + input = paddle.reshape( + input, shape=[-1, input.size(2)]) # N,H*W,C => N*H*W,C + target = paddle.reshape(target, shape=[-1]) + + logpt = F.log_softmax(input) + # print('logpt') + # print(logpt.shape) + # print(logpt) + + # get true class column from each row + all_rows = paddle.arange(len(input)) + # print(target) + log_pt = logpt.numpy()[all_rows.numpy(), target.numpy()] + + pt = paddle.to_tensor(log_pt, dtype='float64').exp() + ce = F.cross_entropy(input, target, reduction='none') + # print('ce') + # print(ce.shape) + + loss = (1 - pt)**self.gamma * ce + # print('ce:%f'%ce.mean()) + # print('fl:%f'%loss.mean()) + if self.size_average: + return loss.mean() + else: + return loss.sum() + + +class FocalLoss(nn.Layer): + """ + Focal Loss. + Code referenced from: + https://github.com/clcarwin/focal_loss_pytorch/blob/master/focalloss.py + Args: + gamma (float): the coefficient of Focal Loss. + ignore_index (int64): Specifies a target value that is ignored + and does not contribute to the input gradient. Default ``255``. + """ + + def __init__(self, gamma=2.0): + super(FocalLoss, self).__init__() + self.gamma = gamma + + def forward(self, logit, label): + #####logit = F.softmax(logit) + # logit = paddle.reshape( + # logit, [logit.shape[0], logit.shape[1], -1]) # N,C,H,W => N,C,H*W + # logit = paddle.transpose(logit, [0, 2, 1]) # N,C,H*W => N,H*W,C + # logit = paddle.reshape(logit, + # [-1, logit.shape[2]]) # N,H*W,C => N*H*W,C + label = paddle.reshape(label, [-1, 1]) + range_ = paddle.arange(0, label.shape[0]) + range_ = paddle.unsqueeze(range_, axis=-1) + label = paddle.cast(label, dtype='int64') + label = paddle.concat([range_, label], axis=-1) + logpt = F.log_softmax(logit) + logpt = paddle.gather_nd(logpt, label) + + pt = paddle.exp(logpt.detach()) + loss = -1 * (1 - pt)**self.gamma * logpt + loss = paddle.mean(loss) + # print(loss) + # print(logpt) + return loss diff --git a/text_processing/speechtask/punctuation_restoration/training/trainer.py b/text_processing/speechtask/punctuation_restoration/training/trainer.py new file mode 100644 index 000000000..2dce88a3f --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/training/trainer.py @@ -0,0 +1,651 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +import time +from collections import defaultdict +from pathlib import Path + +import numpy as np +import paddle +import paddle.nn as nn +import pandas as pd +from paddle import distributed as dist +from paddle.io import DataLoader +from sklearn.metrics import classification_report +from sklearn.metrics import f1_score +from sklearn.metrics import precision_recall_fscore_support +from speechtask.punctuation_restoration.io.dataset import PuncDataset +from speechtask.punctuation_restoration.io.dataset import PuncDatasetFromBertTokenizer +from speechtask.punctuation_restoration.model.BertBLSTM import BertBLSTMPunc +from speechtask.punctuation_restoration.model.BertLinear import BertLinearPunc +from speechtask.punctuation_restoration.model.blstm import BiLSTM +from speechtask.punctuation_restoration.model.lstm import RnnLm +from speechtask.punctuation_restoration.utils import layer_tools +from speechtask.punctuation_restoration.utils import mp_tools +from speechtask.punctuation_restoration.utils.checkpoint import Checkpoint +from tensorboardX import SummaryWriter + +__all__ = ["Trainer", "Tester"] + +DefinedClassifier = { + "lstm": RnnLm, + "blstm": BiLSTM, + "BertLinear": BertLinearPunc, + "BertBLSTM": BertBLSTMPunc +} + +DefinedLoss = { + "ce": nn.CrossEntropyLoss, +} + +DefinedDataset = { + 'PuncCh': PuncDataset, + 'Bert': PuncDatasetFromBertTokenizer, +} + + +class Trainer(): + """ + An experiment template in order to structure the training code and take + care of saving, loading, logging, visualization stuffs. It"s intended to + be flexible and simple. + + So it only handles output directory (create directory for the output, + create a checkpoint directory, dump the config in use and create + visualizer and logger) in a standard way without enforcing any + input-output protocols to the model and dataloader. It leaves the main + part for the user to implement their own (setup the model, criterion, + optimizer, define a training step, define a validation function and + customize all the text and visual logs). + It does not save too much boilerplate code. The users still have to write + the forward/backward/update mannually, but they are free to add + non-standard behaviors if needed. + We have some conventions to follow. + 1. Experiment should have ``model``, ``optimizer``, ``train_loader`` and + ``valid_loader``, ``config`` and ``args`` attributes. + 2. The config should have a ``training`` field, which has + ``valid_interval``, ``save_interval`` and ``max_iteration`` keys. It is + used as the trigger to invoke validation, checkpointing and stop of the + experiment. + 3. There are four methods, namely ``train_batch``, ``valid``, + ``setup_model`` and ``setup_dataloader`` that should be implemented. + Feel free to add/overwrite other methods and standalone functions if you + need. + + Parameters + ---------- + config: yacs.config.CfgNode + The configuration used for the experiment. + + args: argparse.Namespace + The parsed command line arguments. + Examples + -------- + >>> def main_sp(config, args): + >>> exp = Trainer(config, args) + >>> exp.setup() + >>> exp.run() + >>> + >>> config = get_cfg_defaults() + >>> parser = default_argument_parser() + >>> args = parser.parse_args() + >>> if args.config: + >>> config.merge_from_file(args.config) + >>> if args.opts: + >>> config.merge_from_list(args.opts) + >>> config.freeze() + >>> + >>> if args.nprocs > 1 and args.device == "gpu": + >>> dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs) + >>> else: + >>> main_sp(config, args) + """ + + def __init__(self, config, args): + self.config = config + self.args = args + self.optimizer = None + self.visualizer = None + self.output_dir = None + self.checkpoint_dir = None + self.iteration = 0 + self.epoch = 0 + + def setup(self): + """Setup the experiment. + """ + self.setup_logger() + paddle.set_device(self.args.device) + if self.parallel: + self.init_parallel() + + self.setup_output_dir() + self.dump_config() + self.setup_visualizer() + self.setup_checkpointer() + + self.setup_model() + + self.setup_dataloader() + + self.iteration = 0 + self.epoch = 0 + + @property + def parallel(self): + """A flag indicating whether the experiment should run with + multiprocessing. + """ + return self.args.device == "gpu" and self.args.nprocs > 1 + + def init_parallel(self): + """Init environment for multiprocess training. + """ + dist.init_parallel_env() + + @mp_tools.rank_zero_only + def save(self, tag=None, infos: dict=None): + """Save checkpoint (model parameters and optimizer states). + + Args: + tag (int or str, optional): None for step, else using tag, e.g epoch. Defaults to None. + infos (dict, optional): meta data to save. Defaults to None. + """ + + infos = infos if infos else dict() + infos.update({ + "step": self.iteration, + "epoch": self.epoch, + "lr": self.optimizer.get_lr() + }) + self.checkpointer.add_checkpoint(self.checkpoint_dir, self.iteration + if tag is None else tag, self.model, + self.optimizer, infos) + + def resume_or_scratch(self): + """Resume from latest checkpoint at checkpoints in the output + directory or load a specified checkpoint. + + If ``args.checkpoint_path`` is not None, load the checkpoint, else + resume training. + """ + scratch = None + infos = self.checkpointer.load_parameters( + self.model, + self.optimizer, + checkpoint_dir=self.checkpoint_dir, + checkpoint_path=self.args.checkpoint_path) + if infos: + # restore from ckpt + self.iteration = infos["step"] + self.epoch = infos["epoch"] + scratch = False + else: + self.iteration = 0 + self.epoch = 0 + scratch = True + + return scratch + + def new_epoch(self): + """Reset the train loader seed and increment `epoch`. + """ + self.epoch += 1 + if self.parallel: + self.train_loader.batch_sampler.set_epoch(self.epoch) + + def train(self): + """The training process control by epoch.""" + from_scratch = self.resume_or_scratch() + + if from_scratch: + # save init model, i.e. 0 epoch + self.save(tag="init") + + self.lr_scheduler.step(self.iteration) + if self.parallel: + self.train_loader.batch_sampler.set_epoch(self.epoch) + + self.logger.info( + f"Train Total Examples: {len(self.train_loader.dataset)}") + self.punc_list = [] + for i in range(len(self.train_loader.dataset.id2punc)): + self.punc_list.append(self.train_loader.dataset.id2punc[i]) + while self.epoch < self.config["training"]["n_epoch"]: + self.model.train() + self.total_label_train = [] + self.total_predict_train = [] + try: + data_start_time = time.time() + for batch_index, batch in enumerate(self.train_loader): + dataload_time = time.time() - data_start_time + msg = "Train: Rank: {}, ".format(dist.get_rank()) + msg += "epoch: {}, ".format(self.epoch) + msg += "step: {}, ".format(self.iteration) + msg += "batch : {}/{}, ".format(batch_index + 1, + len(self.train_loader)) + msg += "lr: {:>.8f}, ".format(self.lr_scheduler()) + msg += "data time: {:>.3f}s, ".format(dataload_time) + self.train_batch(batch_index, batch, msg) + data_start_time = time.time() + t = classification_report( + self.total_label_train, + self.total_predict_train, + target_names=self.punc_list) + self.logger.info(t) + except Exception as e: + self.logger.error(e) + raise e + + total_loss, F1_score = self.valid() + self.logger.info("Epoch {} Val info val_loss {}, F1_score {}". + format(self.epoch, total_loss, F1_score)) + if self.visualizer: + self.visualizer.add_scalars("epoch", { + "total_loss": total_loss, + "lr": self.lr_scheduler() + }, self.epoch) + + self.save( + tag=self.epoch, infos={"val_loss": total_loss, + "F1": F1_score}) + # step lr every epoch + self.lr_scheduler.step() + self.new_epoch() + + def run(self): + """The routine of the experiment after setup. This method is intended + to be used by the user. + """ + try: + self.train() + except KeyboardInterrupt: + self.save() + exit(-1) + finally: + self.destory() + self.logger.info("Training Done.") + + def setup_output_dir(self): + """Create a directory used for output. + """ + # output dir + output_dir = Path(self.args.output).expanduser() + output_dir.mkdir(parents=True, exist_ok=True) + + self.output_dir = output_dir + + def setup_checkpointer(self): + """Create a directory used to save checkpoints into. + + It is "checkpoints" inside the output directory. + """ + # checkpoint dir + self.checkpointer = Checkpoint(self.logger, + self.config["checkpoint"]["kbest_n"], + self.config["checkpoint"]["latest_n"]) + + checkpoint_dir = self.output_dir / "checkpoints" + checkpoint_dir.mkdir(exist_ok=True) + + self.checkpoint_dir = checkpoint_dir + + def setup_logger(self): + LOG_FORMAT = "%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s" + format_str = logging.Formatter( + '%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s' + ) + logging.basicConfig( + filename=self.config["training"]["log_path"], + level=logging.INFO, + format=LOG_FORMAT) + self.logger = logging.getLogger(__name__) + # self.logger = logging.getLogger(self.config["training"]["log_path"].strip().split('/')[-1].split('.')[0]) + + self.logger.setLevel(logging.INFO) #设置日志级别 + sh = logging.StreamHandler() #往屏幕上输出 + sh.setFormatter(format_str) #设置屏幕上显示的格式 + self.logger.addHandler(sh) #把对象加到logger里 + + self.logger.info('info') + print("setup logger!!!") + + @mp_tools.rank_zero_only + def destory(self): + """Close visualizer to avoid hanging after training""" + # https://github.com/pytorch/fairseq/issues/2357 + if self.visualizer: + self.visualizer.close() + + @mp_tools.rank_zero_only + def setup_visualizer(self): + """Initialize a visualizer to log the experiment. + + The visual log is saved in the output directory. + + Notes + ------ + Only the main process has a visualizer with it. Use multiple + visualizers in multiprocess to write to a same log file may cause + unexpected behaviors. + """ + # visualizer + visualizer = SummaryWriter(logdir=str(self.output_dir)) + self.visualizer = visualizer + + @mp_tools.rank_zero_only + def dump_config(self): + """Save the configuration used for this experiment. + + It is saved in to ``config.yaml`` in the output directory at the + beginning of the experiment. + """ + with open(self.output_dir / "config.yaml", "wt") as f: + print(self.config, file=f) + + def train_batch(self, batch_index, batch_data, msg): + start = time.time() + + input, label = batch_data + label = paddle.reshape(label, shape=[-1]) + y, logit = self.model(input) + pred = paddle.argmax(logit, axis=1) + self.total_label_train.extend(label.numpy().tolist()) + self.total_predict_train.extend(pred.numpy().tolist()) + # self.total_predict.append(logit.numpy().tolist()) + # print('--after model----') + # # print(label.shape) + # # print(pred.shape) + # # print('--!!!!!!!!!!!!!----') + # print("self.total_label") + # print(self.total_label) + # print("self.total_predict") + # print(self.total_predict) + loss = self.crit(y, label) + + loss.backward() + layer_tools.print_grads(self.model, print_func=None) + self.optimizer.step() + self.optimizer.clear_grad() + iteration_time = time.time() - start + + losses_np = { + "train_loss": float(loss), + } + msg += "train time: {:>.3f}s, ".format(iteration_time) + msg += "batch size: {}, ".format(self.config["data"]["batch_size"]) + msg += ", ".join("{}: {:>.6f}".format(k, v) + for k, v in losses_np.items()) + self.logger.info(msg) + # print(msg) + + if dist.get_rank() == 0 and self.visualizer: + for k, v in losses_np.items(): + self.visualizer.add_scalar("train/{}".format(k), v, + self.iteration) + self.iteration += 1 + + @paddle.no_grad() + def valid(self): + self.logger.info( + f"Valid Total Examples: {len(self.valid_loader.dataset)}") + self.model.eval() + valid_losses = defaultdict(list) + num_seen_utts = 1 + total_loss = 0.0 + valid_total_label = [] + valid_total_predict = [] + for i, batch in enumerate(self.valid_loader): + input, label = batch + label = paddle.reshape(label, shape=[-1]) + y, logit = self.model(input) + pred = paddle.argmax(logit, axis=1) + valid_total_label.extend(label.numpy().tolist()) + valid_total_predict.extend(pred.numpy().tolist()) + loss = self.crit(y, label) + + if paddle.isfinite(loss): + num_utts = batch[1].shape[0] + num_seen_utts += num_utts + total_loss += float(loss) * num_utts + valid_losses["val_loss"].append(float(loss)) + + if (i + 1) % self.config["training"]["log_interval"] == 0: + valid_dump = {k: np.mean(v) for k, v in valid_losses.items()} + valid_dump["val_history_loss"] = total_loss / num_seen_utts + + # logging + msg = f"Valid: Rank: {dist.get_rank()}, " + msg += "epoch: {}, ".format(self.epoch) + msg += "step: {}, ".format(self.iteration) + msg += "batch : {}/{}, ".format(i + 1, len(self.valid_loader)) + msg += ", ".join("{}: {:>.6f}".format(k, v) + for k, v in valid_dump.items()) + self.logger.info(msg) + # print(msg) + + self.logger.info("Rank {} Val info val_loss {}".format( + dist.get_rank(), total_loss / num_seen_utts)) + # print("Rank {} Val info val_loss {} acc: {}".format( + # dist.get_rank(), total_loss / num_seen_utts, acc)) + F1_score = f1_score( + valid_total_label, valid_total_predict, average="macro") + return total_loss / num_seen_utts, F1_score + + def setup_model(self): + config = self.config + + model = DefinedClassifier[self.config["model_type"]]( + **self.config["model_params"]) + self.crit = DefinedLoss[self.config["loss_type"]](**self.config[ + "loss"]) if "loss_type" in self.config else DefinedLoss["ce"]() + + if self.parallel: + model = paddle.DataParallel(model) + + self.logger.info(f"{model}") + layer_tools.print_params(model, self.logger.info) + + lr_scheduler = paddle.optimizer.lr.ExponentialDecay( + learning_rate=config["training"]["lr"], + gamma=config["training"]["lr_decay"], + verbose=True) + optimizer = paddle.optimizer.Adam( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=paddle.regularizer.L2Decay( + config["training"]["weight_decay"])) + + self.model = model + self.optimizer = optimizer + self.lr_scheduler = lr_scheduler + self.logger.info("Setup model/criterion/optimizer/lr_scheduler!") + + def setup_dataloader(self): + print("setup_dataloader!!!") + config = self.config["data"].copy() + + print(config["batch_size"]) + + train_dataset = DefinedDataset[config["dataset_type"]]( + train_path=config["train_path"], **config["data_params"]) + dev_dataset = DefinedDataset[config["dataset_type"]]( + train_path=config["dev_path"], **config["data_params"]) + + # train_dataset = config["dataset_type"](os.path.join(config["save_path"], "train"), + # os.path.join(config["save_path"], config["vocab_file"]), + # os.path.join(config["save_path"], config["punc_file"]), + # config["seq_len"]) + + # dev_dataset = PuncDataset(os.path.join(config["save_path"], "dev"), + # os.path.join(config["save_path"], config["vocab_file"]), + # os.path.join(config["save_path"], config["punc_file"]), + # config["seq_len"]) + + # if self.parallel: + # batch_sampler = SortagradDistributedBatchSampler( + # train_dataset, + # batch_size=config["batch_size"], + # num_replicas=None, + # rank=None, + # shuffle=True, + # drop_last=True, + # sortagrad=config["sortagrad"], + # shuffle_method=config["shuffle_method"]) + # else: + # batch_sampler = SortagradBatchSampler( + # train_dataset, + # shuffle=True, + # batch_size=config["batch_size"], + # drop_last=True, + # sortagrad=config["sortagrad"], + # shuffle_method=config["shuffle_method"]) + + self.train_loader = DataLoader( + train_dataset, + num_workers=config["num_workers"], + batch_size=config["batch_size"]) + self.valid_loader = DataLoader( + dev_dataset, + batch_size=config["batch_size"], + shuffle=False, + drop_last=False, + num_workers=config["num_workers"]) + self.logger.info("Setup train/valid Dataloader!") + + +class Tester(Trainer): + def __init__(self, config, args): + super().__init__(config, args) + + @mp_tools.rank_zero_only + @paddle.no_grad() + def test(self): + self.logger.info( + f"Test Total Examples: {len(self.test_loader.dataset)}") + self.punc_list = [] + for i in range(len(self.test_loader.dataset.id2punc)): + self.punc_list.append(self.test_loader.dataset.id2punc[i]) + self.model.eval() + test_total_label = [] + test_total_predict = [] + with open(self.args.result_file, 'w') as fout: + for i, batch in enumerate(self.test_loader): + input, label = batch + label = paddle.reshape(label, shape=[-1]) + y, logit = self.model(input) + pred = paddle.argmax(logit, axis=1) + test_total_label.extend(label.numpy().tolist()) + test_total_predict.extend(pred.numpy().tolist()) + # print(type(logit)) + + # logging + msg = "Test: " + msg += "epoch: {}, ".format(self.epoch) + msg += "step: {}, ".format(self.iteration) + self.logger.info(msg) + # print(msg) + t = classification_report( + test_total_label, test_total_predict, target_names=self.punc_list) + print(t) + t2 = self.evaluation(test_total_label, test_total_predict) + print(t2) + + def evaluation(self, y_pred, y_test): + precision, recall, f1, _ = precision_recall_fscore_support( + y_test, y_pred, average=None, labels=[1, 2, 3]) + overall = precision_recall_fscore_support( + y_test, y_pred, average='macro', labels=[1, 2, 3]) + result = pd.DataFrame( + np.array([precision, recall, f1]), + columns=list(['O', 'COMMA', 'PERIOD', 'QUESTION'])[1:], + index=['Precision', 'Recall', 'F1']) + result['OVERALL'] = overall[:3] + return result + + def run_test(self): + self.resume_or_scratch() + try: + self.test() + except KeyboardInterrupt: + exit(-1) + + def setup(self): + """Setup the experiment. + """ + paddle.set_device(self.args.device) + self.setup_logger() + self.setup_output_dir() + self.setup_checkpointer() + + self.setup_dataloader() + self.setup_model() + + self.iteration = 0 + self.epoch = 0 + + def setup_model(self): + config = self.config + model = DefinedClassifier[self.config["model_type"]]( + **self.config["model_params"]) + + self.model = model + self.logger.info("Setup model!") + + def setup_dataloader(self): + config = self.config["data"].copy() + + test_dataset = DefinedDataset[config["dataset_type"]]( + train_path=config["test_path"], **config["data_params"]) + + self.test_loader = DataLoader( + test_dataset, + batch_size=config["batch_size"], + shuffle=False, + drop_last=False) + self.logger.info("Setup test Dataloader!") + + def setup_output_dir(self): + """Create a directory used for output. + """ + # output dir + if self.args.output: + output_dir = Path(self.args.output).expanduser() + output_dir.mkdir(parents=True, exist_ok=True) + else: + output_dir = Path( + self.args.checkpoint_path).expanduser().parent.parent + output_dir.mkdir(parents=True, exist_ok=True) + + self.output_dir = output_dir + + def setup_logger(self): + LOG_FORMAT = "%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s" + format_str = logging.Formatter( + '%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s' + ) + logging.basicConfig( + filename=self.config["testing"]["log_path"], + level=logging.INFO, + format=LOG_FORMAT) + self.logger = logging.getLogger(__name__) + # self.logger = logging.getLogger(self.config["training"]["log_path"].strip().split('/')[-1].split('.')[0]) + + self.logger.setLevel(logging.INFO) #设置日志级别 + sh = logging.StreamHandler() #往屏幕上输出 + sh.setFormatter(format_str) #设置屏幕上显示的格式 + self.logger.addHandler(sh) #把对象加到logger里 + + self.logger.info('info') + print("setup test logger!!!") diff --git a/text_processing/speechtask/punctuation_restoration/utils/__init__.py b/text_processing/speechtask/punctuation_restoration/utils/__init__.py new file mode 100644 index 000000000..185a92b8d --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/text_processing/speechtask/punctuation_restoration/utils/checkpoint.py b/text_processing/speechtask/punctuation_restoration/utils/checkpoint.py new file mode 100644 index 000000000..1ad4b5b36 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/checkpoint.py @@ -0,0 +1,304 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import glob +import json +import os +import re +from pathlib import Path +from typing import Text +from typing import Union + +import paddle +from paddle import distributed as dist +from paddle.optimizer import Optimizer +from speechtask.punctuation_restoration.utils import mp_tools +# from speechtask.punctuation_restoration.utils.log import Log + +# logger = Log(__name__).getlog() + +__all__ = ["Checkpoint"] + + +class Checkpoint(): + def __init__(self, + logger, + kbest_n: int=5, + latest_n: int=1, + metric_type='val_loss'): + self.best_records: Mapping[Path, float] = {} + self.latest_records = [] + self.kbest_n = kbest_n + self.latest_n = latest_n + self._save_all = (kbest_n == -1) + self.logger = logger + self.metric_type = metric_type + + def add_checkpoint(self, + checkpoint_dir, + tag_or_iteration: Union[int, Text], + model: paddle.nn.Layer, + optimizer: Optimizer=None, + infos: dict=None): + """Save checkpoint in best_n and latest_n. + Args: + checkpoint_dir (str): the directory where checkpoint is saved. + tag_or_iteration (int or str): the latest iteration(step or epoch) number or tag. + model (Layer): model to be checkpointed. + optimizer (Optimizer, optional): optimizer to be checkpointed. + infos (dict or None)): any info you want to save. + metric_type (str, optional): metric type. Defaults to 'val_loss'. + """ + metric_type = self.metric_type + if (metric_type not in infos.keys()): + self._save_parameters(checkpoint_dir, tag_or_iteration, model, + optimizer, infos) + return + + #save best + if self._should_save_best(infos[metric_type]): + self._save_best_checkpoint_and_update( + infos[metric_type], checkpoint_dir, tag_or_iteration, model, + optimizer, infos) + #save latest + self._save_latest_checkpoint_and_update( + checkpoint_dir, tag_or_iteration, model, optimizer, infos) + + if isinstance(tag_or_iteration, int): + self._save_checkpoint_record(checkpoint_dir, tag_or_iteration) + + def load_parameters(self, + model, + optimizer=None, + checkpoint_dir=None, + checkpoint_path=None, + record_file="checkpoint_latest"): + """Load a last model checkpoint from disk. + Args: + model (Layer): model to load parameters. + optimizer (Optimizer, optional): optimizer to load states if needed. + Defaults to None. + checkpoint_dir (str, optional): the directory where checkpoint is saved. + checkpoint_path (str, optional): if specified, load the checkpoint + stored in the checkpoint_path(prefix) and the argument 'checkpoint_dir' will + be ignored. Defaults to None. + record_file "checkpoint_latest" or "checkpoint_best" + Returns: + configs (dict): epoch or step, lr and other meta info should be saved. + """ + configs = {} + + if checkpoint_path is not None: + pass + elif checkpoint_dir is not None and record_file is not None: + # load checkpint from record file + checkpoint_record = os.path.join(checkpoint_dir, record_file) + iteration = self._load_checkpoint_idx(checkpoint_record) + if iteration == -1: + return configs + checkpoint_path = os.path.join(checkpoint_dir, + "{}".format(iteration)) + else: + raise ValueError( + "At least one of 'checkpoint_path' or 'checkpoint_dir' should be specified!" + ) + + rank = dist.get_rank() + + params_path = checkpoint_path + ".pdparams" + model_dict = paddle.load(params_path) + model.set_state_dict(model_dict) + self.logger.info( + "Rank {}: loaded model from {}".format(rank, params_path)) + + optimizer_path = checkpoint_path + ".pdopt" + if optimizer and os.path.isfile(optimizer_path): + optimizer_dict = paddle.load(optimizer_path) + optimizer.set_state_dict(optimizer_dict) + self.logger.info("Rank {}: loaded optimizer state from {}".format( + rank, optimizer_path)) + + info_path = re.sub('.pdparams$', '.json', params_path) + if os.path.exists(info_path): + with open(info_path, 'r') as fin: + configs = json.load(fin) + return configs + + def load_latest_parameters(self, + model, + optimizer=None, + checkpoint_dir=None, + checkpoint_path=None): + """Load a last model checkpoint from disk. + Args: + model (Layer): model to load parameters. + optimizer (Optimizer, optional): optimizer to load states if needed. + Defaults to None. + checkpoint_dir (str, optional): the directory where checkpoint is saved. + checkpoint_path (str, optional): if specified, load the checkpoint + stored in the checkpoint_path(prefix) and the argument 'checkpoint_dir' will + be ignored. Defaults to None. + Returns: + configs (dict): epoch or step, lr and other meta info should be saved. + """ + return self.load_parameters(model, optimizer, checkpoint_dir, + checkpoint_path, "checkpoint_latest") + + def load_best_parameters(self, + model, + optimizer=None, + checkpoint_dir=None, + checkpoint_path=None): + """Load a last model checkpoint from disk. + Args: + model (Layer): model to load parameters. + optimizer (Optimizer, optional): optimizer to load states if needed. + Defaults to None. + checkpoint_dir (str, optional): the directory where checkpoint is saved. + checkpoint_path (str, optional): if specified, load the checkpoint + stored in the checkpoint_path(prefix) and the argument 'checkpoint_dir' will + be ignored. Defaults to None. + Returns: + configs (dict): epoch or step, lr and other meta info should be saved. + """ + return self.load_parameters(model, optimizer, checkpoint_dir, + checkpoint_path, "checkpoint_best") + + def _should_save_best(self, metric: float) -> bool: + if not self._best_full(): + return True + + # already full + worst_record_path = max(self.best_records, key=self.best_records.get) + # worst_record_path = max(self.best_records.iteritems(), key=operator.itemgetter(1))[0] + worst_metric = self.best_records[worst_record_path] + return metric < worst_metric + + def _best_full(self): + return (not self._save_all) and len(self.best_records) == self.kbest_n + + def _latest_full(self): + return len(self.latest_records) == self.latest_n + + def _save_best_checkpoint_and_update(self, metric, checkpoint_dir, + tag_or_iteration, model, optimizer, + infos): + # remove the worst + if self._best_full(): + worst_record_path = max(self.best_records, + key=self.best_records.get) + self.best_records.pop(worst_record_path) + if (worst_record_path not in self.latest_records): + self.logger.info( + "remove the worst checkpoint: {}".format(worst_record_path)) + self._del_checkpoint(checkpoint_dir, worst_record_path) + + # add the new one + self._save_parameters(checkpoint_dir, tag_or_iteration, model, + optimizer, infos) + self.best_records[tag_or_iteration] = metric + + def _save_latest_checkpoint_and_update( + self, checkpoint_dir, tag_or_iteration, model, optimizer, infos): + # remove the old + if self._latest_full(): + to_del_fn = self.latest_records.pop(0) + if (to_del_fn not in self.best_records.keys()): + self.logger.info( + "remove the latest checkpoint: {}".format(to_del_fn)) + self._del_checkpoint(checkpoint_dir, to_del_fn) + self.latest_records.append(tag_or_iteration) + + self._save_parameters(checkpoint_dir, tag_or_iteration, model, + optimizer, infos) + + def _del_checkpoint(self, checkpoint_dir, tag_or_iteration): + checkpoint_path = os.path.join(checkpoint_dir, + "{}".format(tag_or_iteration)) + for filename in glob.glob(checkpoint_path + ".*"): + os.remove(filename) + self.logger.info("delete file: {}".format(filename)) + + def _load_checkpoint_idx(self, checkpoint_record: str) -> int: + """Get the iteration number corresponding to the latest saved checkpoint. + Args: + checkpoint_path (str): the saved path of checkpoint. + Returns: + int: the latest iteration number. -1 for no checkpoint to load. + """ + if not os.path.isfile(checkpoint_record): + return -1 + + # Fetch the latest checkpoint index. + with open(checkpoint_record, "rt") as handle: + latest_checkpoint = handle.readlines()[-1].strip() + iteration = int(latest_checkpoint.split(":")[-1]) + return iteration + + def _save_checkpoint_record(self, checkpoint_dir: str, iteration: int): + """Save the iteration number of the latest model to be checkpoint record. + Args: + checkpoint_dir (str): the directory where checkpoint is saved. + iteration (int): the latest iteration number. + Returns: + None + """ + checkpoint_record_latest = os.path.join(checkpoint_dir, + "checkpoint_latest") + checkpoint_record_best = os.path.join(checkpoint_dir, "checkpoint_best") + + with open(checkpoint_record_best, "w") as handle: + for i in self.best_records.keys(): + handle.write("model_checkpoint_path:{}\n".format(i)) + with open(checkpoint_record_latest, "w") as handle: + for i in self.latest_records: + handle.write("model_checkpoint_path:{}\n".format(i)) + + @mp_tools.rank_zero_only + def _save_parameters(self, + checkpoint_dir: str, + tag_or_iteration: Union[int, str], + model: paddle.nn.Layer, + optimizer: Optimizer=None, + infos: dict=None): + """Checkpoint the latest trained model parameters. + Args: + checkpoint_dir (str): the directory where checkpoint is saved. + tag_or_iteration (int or str): the latest iteration(step or epoch) number. + model (Layer): model to be checkpointed. + optimizer (Optimizer, optional): optimizer to be checkpointed. + Defaults to None. + infos (dict or None): any info you want to save. + Returns: + None + """ + checkpoint_path = os.path.join(checkpoint_dir, + "{}".format(tag_or_iteration)) + + model_dict = model.state_dict() + params_path = checkpoint_path + ".pdparams" + paddle.save(model_dict, params_path) + self.logger.info("Saved model to {}".format(params_path)) + + if optimizer: + opt_dict = optimizer.state_dict() + optimizer_path = checkpoint_path + ".pdopt" + paddle.save(opt_dict, optimizer_path) + self.logger.info( + "Saved optimzier state to {}".format(optimizer_path)) + + info_path = re.sub('.pdparams$', '.json', params_path) + infos = {} if infos is None else infos + with open(info_path, 'w') as fout: + data = json.dumps(infos) + fout.write(data) diff --git a/text_processing/speechtask/punctuation_restoration/utils/default_parser.py b/text_processing/speechtask/punctuation_restoration/utils/default_parser.py new file mode 100644 index 000000000..b83d989d6 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/default_parser.py @@ -0,0 +1,74 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + + +def default_argument_parser(): + r"""A simple yet genral argument parser for experiments with parakeet. + + This is used in examples with parakeet. And it is intended to be used by + other experiments with parakeet. It requires a minimal set of command line + arguments to start a training script. + + The ``--config`` and ``--opts`` are used for overwrite the deault + configuration. + + The ``--data`` and ``--output`` specifies the data path and output path. + Resuming training from existing progress at the output directory is the + intended default behavior. + + The ``--checkpoint_path`` specifies the checkpoint to load from. + + The ``--device`` and ``--nprocs`` specifies how to run the training. + + + See Also + -------- + parakeet.training.experiment + Returns + ------- + argparse.ArgumentParser + the parser + """ + parser = argparse.ArgumentParser() + + # yapf: disable + # data and output + parser.add_argument("--config", metavar="FILE", help="path of the config file to overwrite to default config with.") + parser.add_argument("--dump-config", metavar="FILE", help="dump config to yaml file.") + # parser.add_argument("--data", metavar="DATA_DIR", help="path to the datatset.") + parser.add_argument("--output", metavar="OUTPUT_DIR", help="path to save checkpoint and logs.") + + # load from saved checkpoint + parser.add_argument("--checkpoint_path", type=str, help="path of the checkpoint to load") + + # save jit model to + parser.add_argument("--export_path", type=str, help="path of the jit model to save") + + # save asr result to + parser.add_argument("--result_file", type=str, help="path of save the asr result") + + # running + parser.add_argument("--device", type=str, default='gpu', choices=["cpu", "gpu"], + help="device type to use, cpu and gpu are supported.") + parser.add_argument("--nprocs", type=int, default=1, help="number of parallel processes to use.") + + # overwrite extra config and default config + # parser.add_argument("--opts", nargs=argparse.REMAINDER, + # help="options to overwrite --config file and the default config, passing in KEY VALUE pairs") + parser.add_argument("--opts", type=str, default=[], nargs='+', + help="options to overwrite --config file and the default config, passing in KEY VALUE pairs") + # yapd: enable + + return parser diff --git a/text_processing/speechtask/punctuation_restoration/utils/layer_tools.py b/text_processing/speechtask/punctuation_restoration/utils/layer_tools.py new file mode 100644 index 000000000..fb076c0c7 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/layer_tools.py @@ -0,0 +1,88 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np +from paddle import nn + +__all__ = [ + "summary", "gradient_norm", "freeze", "unfreeze", "print_grads", + "print_params" +] + + +def summary(layer: nn.Layer, print_func=print): + if print_func is None: + return + num_params = num_elements = 0 + for name, param in layer.state_dict().items(): + if print_func: + print_func( + "{} | {} | {}".format(name, param.shape, np.prod(param.shape))) + num_elements += np.prod(param.shape) + num_params += 1 + if print_func: + num_elements = num_elements / 1024**2 + print_func( + f"Total parameters: {num_params}, {num_elements:.2f}M elements.") + + +def print_grads(model, print_func=print): + if print_func is None: + return + for n, p in model.named_parameters(): + msg = f"param grad: {n}: shape: {p.shape} grad: {p.grad}" + print_func(msg) + + +def print_params(model, print_func=print): + if print_func is None: + return + total = 0.0 + num_params = 0.0 + for n, p in model.named_parameters(): + msg = f"{n} | {p.shape} | {np.prod(p.shape)} | {not p.stop_gradient}" + total += np.prod(p.shape) + num_params += 1 + if print_func: + print_func(msg) + if print_func: + total = total / 1024**2 + print_func(f"Total parameters: {num_params}, {total:.2f}M elements.") + + +def gradient_norm(layer: nn.Layer): + grad_norm_dict = {} + for name, param in layer.state_dict().items(): + if param.trainable: + grad = param.gradient() # return numpy.ndarray + grad_norm_dict[name] = np.linalg.norm(grad) / grad.size + return grad_norm_dict + + +def recursively_remove_weight_norm(layer: nn.Layer): + for layer in layer.sublayers(): + try: + nn.utils.remove_weight_norm(layer) + except ValueError as e: + # ther is not weight norm hoom in this layer + pass + + +def freeze(layer: nn.Layer): + for param in layer.parameters(): + param.trainable = False + + +def unfreeze(layer: nn.Layer): + for param in layer.parameters(): + param.trainable = True diff --git a/text_processing/speechtask/punctuation_restoration/utils/mp_tools.py b/text_processing/speechtask/punctuation_restoration/utils/mp_tools.py new file mode 100644 index 000000000..d3e25aab6 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/mp_tools.py @@ -0,0 +1,30 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from functools import wraps + +from paddle import distributed as dist + +__all__ = ["rank_zero_only"] + + +def rank_zero_only(func): + @wraps(func) + def wrapper(*args, **kwargs): + rank = dist.get_rank() + if rank != 0: + return + result = func(*args, **kwargs) + return result + + return wrapper diff --git a/text_processing/speechtask/punctuation_restoration/utils/punct_pre.py b/text_processing/speechtask/punctuation_restoration/utils/punct_pre.py new file mode 100644 index 000000000..7f1431829 --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/punct_pre.py @@ -0,0 +1,163 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil + +CHINESE_PUNCTUATION_MAPPING = { + 'O': '', + ',': ",", + '。': '。', + '?': '?', +} + + +def process_one_file_chinese(raw_path, save_path): + f = open(raw_path, 'r', encoding='utf-8') + save_file = open(save_path, 'w', encoding='utf-8') + for line in f.readlines(): + line = line.strip().replace(' ', '').replace(' ', '') + for i in line: + save_file.write(i + ' ') + save_file.write('\n') + save_file.close() + + +def process_chinese_pure_senetence(config): + ####need raw_path, raw_train_file, raw_dev_file, raw_test_file, punc_file, save_path + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_train_file"])), "train file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_dev_file"])), "dev file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_test_file"])), "test file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "punc_file"])), "punc file doesn't exist." + + train_file = os.path.join(config["raw_path"], config["raw_train_file"]) + dev_file = os.path.join(config["raw_path"], config["raw_dev_file"]) + test_file = os.path.join(config["raw_path"], config["raw_test_file"]) + if not os.path.exists(config["save_path"]): + os.makedirs(config["save_path"]) + + shutil.copy( + os.path.join(config["raw_path"], config["punc_file"]), + os.path.join(config["save_path"], config["punc_file"])) + + process_one_file_chinese(train_file, + os.path.join(config["save_path"], "train")) + process_one_file_chinese(dev_file, os.path.join(config["save_path"], "dev")) + process_one_file_chinese(test_file, + os.path.join(config["save_path"], "test")) + + +def process_one_chinese_pair(raw_path, save_path): + + f = open(raw_path, 'r', encoding='utf-8') + save_file = open(save_path, 'w', encoding='utf-8') + for line in f.readlines(): + if (len(line.strip().split()) == 2): + word, punc = line.strip().split() + save_file.write(word + ' ' + CHINESE_PUNCTUATION_MAPPING[punc]) + if (punc == "。"): + save_file.write("\n") + else: + save_file.write(" ") + save_file.close() + + +def process_chinese_pair(config): + ### need raw_path, raw_train_file, raw_dev_file, raw_test_file, punc_file, save_path + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_train_file"])), "train file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_dev_file"])), "dev file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_test_file"])), "test file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "punc_file"])), "punc file doesn't exist." + + train_file = os.path.join(config["raw_path"], config["raw_train_file"]) + dev_file = os.path.join(config["raw_path"], config["raw_dev_file"]) + test_file = os.path.join(config["raw_path"], config["raw_test_file"]) + + process_one_chinese_pair(train_file, + os.path.join(config["save_path"], "train")) + process_one_chinese_pair(dev_file, os.path.join(config["save_path"], "dev")) + process_one_chinese_pair(test_file, + os.path.join(config["save_path"], "test")) + + shutil.copy( + os.path.join(config["raw_path"], config["punc_file"]), + os.path.join(config["save_path"], config["punc_file"])) + + +english_punc = [',', '.', '?'] +ignore_english_punc = ['\"', '/'] + + +def process_one_file_english(raw_path, save_path): + f = open(raw_path, 'r', encoding='utf-8') + save_file = open(save_path, 'w', encoding='utf-8') + for line in f.readlines(): + for i in ignore_english_punc: + line = line.replace(i, '') + for i in english_punc: + line = line.replace(i, ' ' + i) + wordlist = line.strip().split(' ') + # print(type(wordlist)) + # print(wordlist) + for i in wordlist: + save_file.write(i + ' ') + save_file.write('\n') + save_file.close() + + +def process_english_pure_senetence(config): + ####need raw_path, raw_train_file, raw_dev_file, raw_test_file, punc_file, save_path + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_train_file"])), "train file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_dev_file"])), "dev file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "raw_test_file"])), "test file doesn't exist." + assert os.path.exists( + os.path.join(config["raw_path"], config[ + "punc_file"])), "punc file doesn't exist." + + train_file = os.path.join(config["raw_path"], config["raw_train_file"]) + dev_file = os.path.join(config["raw_path"], config["raw_dev_file"]) + test_file = os.path.join(config["raw_path"], config["raw_test_file"]) + if not os.path.exists(config["save_path"]): + os.makedirs(config["save_path"]) + + shutil.copy( + os.path.join(config["raw_path"], config["punc_file"]), + os.path.join(config["save_path"], config["punc_file"])) + + process_one_file_english(train_file, + os.path.join(config["save_path"], "train")) + process_one_file_english(dev_file, os.path.join(config["save_path"], "dev")) + process_one_file_english(test_file, + os.path.join(config["save_path"], "test")) diff --git a/text_processing/speechtask/punctuation_restoration/utils/utility.py b/text_processing/speechtask/punctuation_restoration/utils/utility.py new file mode 100644 index 000000000..64570026b --- /dev/null +++ b/text_processing/speechtask/punctuation_restoration/utils/utility.py @@ -0,0 +1,81 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains common utility functions.""" +import distutils.util +import math +import os +from typing import List + +__all__ = ['print_arguments', 'add_arguments', "log_add"] + + +def print_arguments(args, info=None): + """Print argparse's arguments. + + Usage: + + .. code-block:: python + + parser = argparse.ArgumentParser() + parser.add_argument("name", default="Jonh", type=str, help="User name.") + args = parser.parse_args() + print_arguments(args) + + :param args: Input argparse.Namespace for printing. + :type args: argparse.Namespace + """ + filename = "" + if info: + filename = info["__file__"] + filename = os.path.basename(filename) + print(f"----------- {filename} Configuration Arguments -----------") + for arg, value in sorted(vars(args).items()): + print("%s: %s" % (arg, value)) + print("-----------------------------------------------------------") + + +def add_arguments(argname, type, default, help, argparser, **kwargs): + """Add argparse's argument. + + Usage: + + .. code-block:: python + + parser = argparse.ArgumentParser() + add_argument("name", str, "Jonh", "User name.", parser) + args = parser.parse_args() + """ + type = distutils.util.strtobool if type == bool else type + argparser.add_argument( + "--" + argname, + default=default, + type=type, + help=help + ' Default: %(default)s.', + **kwargs) + + +def log_add(args: List[int]) -> float: + """Stable log add + + Args: + args (List[int]): log scores + + Returns: + float: sum of log scores + """ + if all(a == -float('inf') for a in args): + return -float('inf') + a_max = max(args) + lsp = math.log(sum(math.exp(a - a_max) for a in args)) + return a_max + lsp
Acoustic Model Aishell2 Conv + 5 LSTM layers with only forward direction 2 Conv + 5 LSTM layers with only forward direction Ds2 Online Aishell Model Text Frontend - chinese-fronted + chinese-fronted
Tacotron2 LJSpeech - tacotron2-vctk + tacotron2-vctk
TransformerTTS - transformer-ljspeech + transformer-ljspeech
SpeedySpeech CSMSC - speedyspeech-csmsc + speedyspeech-csmsc
FastSpeech2 AISHELL-3 - fastspeech2-aishell3 + fastspeech2-aishell3
VCTK fastspeech2-vctk fastspeech2-vctk
LJSpeech fastspeech2-ljspeech fastspeech2-ljspeech
CSMSC - fastspeech2-csmsc + fastspeech2-csmsc
WaveFlow LJSpeech - waveflow-ljspeech + waveflow-ljspeech
Parallel WaveGAN LJSpeech - PWGAN-ljspeech + PWGAN-ljspeech
VCTK - PWGAN-vctk + PWGAN-vctk
CSMSC - PWGAN-csmsc + PWGAN-csmsc
GE2E AISHELL-3, etc. - ge2e + ge2e
GE2E + Tactron2 AISHELL-3 - ge2e-tactron2-aishell3 + ge2e-tactron2-aishell3