Merge branch 'develop' into fix_whisper

pull/3880/head
yinfan98 10 months ago
commit c328cfbd3f

@ -2,7 +2,8 @@
* asr0 - deepspeech2 Streaming/Non-Streaming * asr0 - deepspeech2 Streaming/Non-Streaming
* asr1 - transformer/conformer Streaming/Non-Streaming * asr1 - transformer/conformer Streaming/Non-Streaming
* asr2 - transformer/conformer Streaming/Non-Streaming with Kaldi feature * ~~asr2 - transformer/conformer Streaming/Non-Streaming with Kaldi feature~~
* asr3 - wav2vec2 Non-Streaming
## Data ## Data

@ -103,12 +103,19 @@ If you want to train the model, you can use the script below to execute stage 0
```bash ```bash
bash run.sh --stage 0 --stop_stage 1 bash run.sh --stage 0 --stop_stage 1
``` ```
or you can run these scripts in the command line (only use CPU). Or you can run these scripts in the command line (only use CPU).
```bash ```bash
source path.sh source path.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2 CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
``` ```
If you want to use GPU, you can run these scripts in the command line (suppose you have only 1 GPU).
```bash
source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES=0 ./local/train.sh conf/deepspeech2.yaml deepspeech2
```
## Stage 2: Top-k Models Averaging ## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below: After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash ```bash
@ -148,7 +155,7 @@ source path.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2 CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1 avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_10
``` ```
## Pretrained Model ## Pretrained Model
You can get the pretrained models from [this](../../../docs/source/released_model.md). You can get the pretrained models from [this](../../../docs/source/released_model.md).
@ -157,14 +164,14 @@ using the `tar` scripts to unpack the model and then you can use the script to t
For example: For example:
``` ```
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz
tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz tar xzvf asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz
source path.sh source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps # If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1 bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2 bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_10
``` ```
The performance of the released models are shown in [this](./RESULTS.md) The performance of the released models are shown in [this](./RESULTS.md)
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
@ -178,7 +185,7 @@ This stage is to transform dygraph to static graph.
If you already have a dynamic graph model, you can run this script: If you already have a dynamic graph model, you can run this script:
```bash ```bash
source path.sh source path.sh
./local/export.sh deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 exp/deepspeech2/checkpoints/avg_1.jit offline ./local/export.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_10 exp/deepspeech2/checkpoints/avg_10.jit
``` ```
## Stage 5: Static graph Model Testing ## Stage 5: Static graph Model Testing
Similar to stage 3, the static graph model can also be tested. Similar to stage 3, the static graph model can also be tested.
@ -190,7 +197,7 @@ Similar to stage 3, the static graph model can also be tested.
``` ```
If you already have exported the static graph, you can run this script: If you already have exported the static graph, you can run this script:
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_export.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1.jit offline CUDA_VISIBLE_DEVICES= ./local/test_export.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_10.jit
``` ```
## Stage 6: Single Audio File Inference ## Stage 6: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
@ -202,8 +209,8 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
``` ```
you can train the model by yourself, or you can download the pretrained model by the script below: you can train the model by yourself, or you can download the pretrained model by the script below:
```bash ```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz
tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz tar xzvf asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz
``` ```
You can download the audio demo: You can download the audio demo:
```bash ```bash
@ -211,5 +218,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wa
``` ```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_01_03.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_10 data/demo_01_03.wav
``` ```

@ -144,7 +144,7 @@ source path.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2 CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1 avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1
``` ```
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph. This stage is to transform dygraph to static graph.
@ -185,5 +185,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.w
``` ```
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_002_en.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_002_en.wav
``` ```

@ -148,7 +148,7 @@ or you can run these scripts in the command line (only use CPU).
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20 avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
## Pretrained Model ## Pretrained Model
You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md). You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
@ -163,7 +163,7 @@ source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps # If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1 bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2 bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
The performance of the released models are shown in [here](./RESULTS.md). The performance of the released models are shown in [here](./RESULTS.md).
@ -192,8 +192,8 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20 avg.sh best exp/conformer/checkpoints 20
# test stage is optional # test stage is optional
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
## Stage 5: Single Audio File Inference ## Stage 5: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
@ -214,5 +214,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.w
``` ```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 data/demo_002_en.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20 data/demo_002_en.wav
``` ```

@ -6,6 +6,15 @@ This example contains code used to train a [DiffSinger](https://arxiv.org/abs/21
### Download and Extract ### Download and Extract
Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`. Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
### pip install
<!-- Comment: Cause ppdiffusers will install newest huggingface_hub, but cached_download function has been removed, So need to install the specified version.>
<!-- TODO: If the corresponding dependency library is OK, it needs to be deleted.-->
```shell
pip install huggingface_hub==0.25.2
```
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/Opencpop`. Assume the path to the dataset is `~/datasets/Opencpop`.
Run the command below to Run the command below to

@ -7,6 +7,13 @@
### 下载并解压 ### 下载并解压
从 [官方网站](https://wenet.org.cn/opencpop/download/) 下载数据集 从 [官方网站](https://wenet.org.cn/opencpop/download/) 下载数据集
### pip 安装
<!-- 注释: 因为ppdiffusion会安装最新的huggingface_hub但cached_download功能已被删除所以需要安装指定的版本。>
<!-- 待完成: 如果相应的依赖库正常,则将其删除。-->
```shell
pip install huggingface_hub==0.25.2
```
## 开始 ## 开始
假设数据集的路径是 `~/datasets/Opencpop`. 假设数据集的路径是 `~/datasets/Opencpop`.
运行下面的命令会进行如下操作: 运行下面的命令会进行如下操作:

@ -16,5 +16,5 @@ python3 test_g2p.py --input-dir=data/g2p --output-dir=exp/g2p
# whether use sclite to get more detail information of WER # whether use sclite to get more detail information of WER
if [ "$USE_SCLITE" = true ];then if [ "$USE_SCLITE" = true ];then
echo "Start sclite g2p ..." echo "Start sclite g2p ..."
${MAIN_ROOT}/tools/sctk/bin/sclite -i wsj -r ./exp/g2p/text.ref.clean trn -h ./exp/g2p/text.g2p trn -e utf-8 -o all ${MAIN_ROOT}/tools/extras/sctk/bin/sclite -i wsj -r ./exp/g2p/text.ref.clean trn -h ./exp/g2p/text.g2p trn -e utf-8 -o all
fi fi

@ -27,7 +27,6 @@ The document below will describe the scripts in `run.sh` in detail.
The path.sh contains the environment variables. The path.sh contains the environment variables.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
``` ```
This script needs to be run first. And another script is also needed: This script needs to be run first. And another script is also needed:
```bash ```bash
@ -67,7 +66,6 @@ bash run.sh --stage 0 --stop_stage 0
You can also just run these scripts in your command line. You can also just run these scripts in your command line.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
``` ```
After processing the data, the `data` directory will look like this: After processing the data, the `data` directory will look like this:
@ -103,7 +101,6 @@ bash run.sh --stage 0 --stop_stage 1
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
``` ```
@ -124,7 +121,6 @@ or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 10 avg.sh best exp/conformer/checkpoints 10
@ -144,11 +140,10 @@ bash run.sh --stage 0 --stop_stage 3
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 10 avg.sh best exp/conformer/checkpoints 10
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10
``` ```
## Pretrained Model ## Pretrained Model
You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md). You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
@ -163,7 +158,7 @@ source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps # If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1 bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2 bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10
``` ```
The performance of the released models are shown in [here](./RESULTS.md). The performance of the released models are shown in [here](./RESULTS.md).
@ -186,5 +181,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wa
``` ```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 data/demo_01_03.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10 data/demo_01_03.wav
``` ```

@ -30,5 +30,5 @@ TESS音频情绪分类任务。
$ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_mfcc.yaml $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_mfcc.yaml
$ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_logmelspectrogram.yaml $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_logmelspectrogram.yaml
$ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_melspectrogram.yaml $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_melspectrogram.yaml
$ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_pectrogram.yaml $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns_spectrogram.yaml
``` ```

@ -51,7 +51,7 @@ You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.: For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash ```bash
bash run.sh --gpus 0,1 --avg_num 20 bash run.sh --gpus 0,1 --avg_num 1
``` ```
## Stage 0: Data processing ## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below: To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
@ -134,7 +134,7 @@ The test stage is to evaluate the model performance. The code of the test stage
```bash ```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n # test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt}|| exit -1
fi fi
``` ```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 : If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
@ -147,7 +147,7 @@ source path.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2 CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1 avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1
``` ```
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph. This stage is to transform dygraph to static graph.

@ -26,7 +26,6 @@ The document below will describe the scripts in ```run.sh```in detail.
The path.sh contains the environment variables. The path.sh contains the environment variables.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
``` ```
This script needs to be run first. And another script is also needed: This script needs to be run first. And another script is also needed:
```bash ```bash
@ -64,7 +63,6 @@ bash run.sh --stage 0 --stop_stage 0
You can also just run these scripts in your command line. You can also just run these scripts in your command line.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
``` ```
After processing the data, the ``data`` directory will look like this: After processing the data, the ``data`` directory will look like this:
@ -100,7 +98,6 @@ bash run.sh --stage 0 --stop_stage 1
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
```## Stage 2: Top-k Models Averaging ```## Stage 2: Top-k Models Averaging
@ -119,7 +116,6 @@ bash run.sh --stage 0 --stop_stage 2
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 1 avg.sh best exp/transformer/checkpoints 1
@ -139,7 +135,6 @@ bash run.sh --stage 0 --stop_stage 3
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 1 avg.sh best exp/transformer/checkpoints 1
@ -166,7 +161,6 @@ bash run.sh --stage 4 --stop_stage 4
or you can also use these scripts in the command line (only use CPU). or you can also use these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 1 avg.sh best exp/transformer/checkpoints 1

@ -13,3 +13,7 @@
# limitations under the License. # limitations under the License.
import _locale import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])

@ -177,8 +177,9 @@ def th_accuracy(pad_outputs: paddle.Tensor,
Returns: Returns:
float: Accuracy value (0.0 - 1.0). float: Accuracy value (0.0 - 1.0).
""" """
pad_pred = pad_outputs.view(pad_targets.shape[0], pad_targets.shape[1], pad_pred = pad_outputs.reshape(
pad_outputs.shape[1]).argmax(2) [pad_targets.shape[0], pad_targets.shape[1],
pad_outputs.shape[1]]).argmax(2)
mask = pad_targets != ignore_label mask = pad_targets != ignore_label
#TODO(Hui Zhang): sum not support bool type #TODO(Hui Zhang): sum not support bool type
# numerator = paddle.sum( # numerator = paddle.sum(

@ -24,7 +24,7 @@ from scipy.special import softmax
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", type=str, required=True, default="./export", help="The directory to static model.") parser.add_argument("--model_dir", type=str, required=True, default="./export", help="The directory to static model.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'gcu'], default="gpu", help="Select which device to train model, defaults to gpu.")
parser.add_argument("--wav", type=str, required=True, help="Audio file to infer.") parser.add_argument("--wav", type=str, required=True, help="Audio file to infer.")
parser.add_argument("--batch_size", type=int, default=1, help="Batch size per GPU/CPU for training.") parser.add_argument("--batch_size", type=int, default=1, help="Batch size per GPU/CPU for training.")
parser.add_argument('--use_tensorrt', type=eval, default=False, choices=[True, False], help='Enable to use tensorrt to speed up.') parser.add_argument('--use_tensorrt', type=eval, default=False, choices=[True, False], help='Enable to use tensorrt to speed up.')

@ -32,9 +32,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
# save jit model to
parser.add_argument(
"--export_path", type=str, help="path of the jit model to save")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args) print_arguments(args)

@ -32,9 +32,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
# save asr result to
parser.add_argument(
"--result_file", type=str, help="path of save the asr result")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args, globals()) print_arguments(args, globals())

@ -32,12 +32,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
# save asr result to
parser.add_argument(
"--result_file", type=str, help="path of save the asr result")
#load jit model from
parser.add_argument(
"--export_path", type=str, help="path of the jit model to save")
parser.add_argument( parser.add_argument(
"--enable-auto-log", action="store_true", help="use auto log") "--enable-auto-log", action="store_true", help="use auto log")
args = parser.parse_args() args = parser.parse_args()

@ -75,7 +75,7 @@ class DeepSpeech2Tester_hub():
feat = self.preprocessing(audio, **self.preprocess_args) feat = self.preprocessing(audio, **self.preprocess_args)
logger.info(f"feat shape: {feat.shape}") logger.info(f"feat shape: {feat.shape}")
audio_len = paddle.to_tensor(feat.shape[0]) audio_len = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
audio = paddle.to_tensor(feat, dtype='float32').unsqueeze(axis=0) audio = paddle.to_tensor(feat, dtype='float32').unsqueeze(axis=0)
result_transcripts = self.compute_result_transcripts( result_transcripts = self.compute_result_transcripts(
@ -171,10 +171,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
parser.add_argument("--audio_file", type=str, help='audio file path')
# save asr result to
parser.add_argument(
"--result_file", type=str, help="path of save the asr result")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args, globals()) print_arguments(args, globals())
if not os.path.isfile(args.audio_file): if not os.path.isfile(args.audio_file):

@ -335,7 +335,12 @@ class DeepSpeech2Tester(DeepSpeech2Trainer):
self.test_loader, self.config, self.args.checkpoint_path) self.test_loader, self.config, self.args.checkpoint_path)
infer_model.eval() infer_model.eval()
static_model = infer_model.export() static_model = infer_model.export()
try:
logger.info(f"Export code: {static_model.forward.code}") logger.info(f"Export code: {static_model.forward.code}")
except:
logger.info(
f"Fail to print Export code, static_model.forward.code can not be run."
)
paddle.jit.save(static_model, self.args.export_path) paddle.jit.save(static_model, self.args.export_path)

@ -18,7 +18,7 @@ from yacs.config import CfgNode
from paddlespeech.s2t.exps.hubert.model import HubertASRTester as Tester from paddlespeech.s2t.exps.hubert.model import HubertASRTester as Tester
from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.utility import print_arguments from paddlespeech.utils.argparse import print_arguments
def main_sp(config, args): def main_sp(config, args):

@ -19,7 +19,7 @@ from yacs.config import CfgNode
from paddlespeech.s2t.exps.hubert.model import HubertASRTrainer as Trainer from paddlespeech.s2t.exps.hubert.model import HubertASRTrainer as Trainer
from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.utility import print_arguments from paddlespeech.utils.argparse import print_arguments
def main_sp(config, args): def main_sp(config, args):

@ -75,7 +75,7 @@ class U2Infer():
feat = self.preprocessing(audio, **self.preprocess_args) feat = self.preprocessing(audio, **self.preprocess_args)
logger.info(f"feat shape: {feat.shape}") logger.info(f"feat shape: {feat.shape}")
ilen = paddle.to_tensor(feat.shape[0]) ilen = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0) xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0)
decode_config = self.config.decode decode_config = self.config.decode
logger.info(f"decode cfg: {decode_config}") logger.info(f"decode cfg: {decode_config}")

@ -78,7 +78,7 @@ class U2Infer():
if self.args.debug: if self.args.debug:
np.savetxt("feat.transform.txt", feat) np.savetxt("feat.transform.txt", feat)
ilen = paddle.to_tensor(feat.shape[0]) ilen = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0) xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0)
decode_config = self.config.decode decode_config = self.config.decode
logger.info(f"decode cfg: {decode_config}") logger.info(f"decode cfg: {decode_config}")

@ -37,8 +37,6 @@ if __name__ == "__main__":
# save asr result to # save asr result to
parser.add_argument( parser.add_argument(
'--dict-path', type=str, default=None, help='dict path.') '--dict-path', type=str, default=None, help='dict path.')
parser.add_argument(
"--result_file", type=str, help="path of save the asr result")
args = parser.parse_args() args = parser.parse_args()
print_arguments(args, globals()) print_arguments(args, globals())

@ -104,11 +104,6 @@ def main(config, args):
if __name__ == "__main__": if __name__ == "__main__":
parser = default_argument_parser() parser = default_argument_parser()
# save asr result to
parser.add_argument(
"--result_file", type=str, help="path of save the asr result")
parser.add_argument(
"--audio_file", type=str, help="path of the input audio file")
args = parser.parse_args() args = parser.parse_args()
config = CfgNode(new_allowed=True) config = CfgNode(new_allowed=True)

@ -84,13 +84,13 @@ class HubertASR(nn.Layer):
def forward(self, wav, wavs_lens_rate, target, target_lens): def forward(self, wav, wavs_lens_rate, target, target_lens):
if self.normalize_wav: if self.normalize_wav:
wav = F.layer_norm(wav, wav.shape) wav = F.layer_norm(wav, wav.shape[1:])
# Extract wav2vec output # Extract wav2vec output
out = self.hubert.extract_features(wav)[0] out = self.hubert.extract_features(wav)[0]
# We normalize the output if required # We normalize the output if required
if self.output_norm: if self.output_norm:
out = F.layer_norm(out, out.shape) out = F.layer_norm(out, out.shape[1:])
if self.training and hasattr(self.config, 'spec_augment'): if self.training and hasattr(self.config, 'spec_augment'):
feats = self.spec_augment(out) feats = self.spec_augment(out)

@ -190,7 +190,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
r_loss_att = self.criterion_att(r_decoder_out, r_ys_out_pad) r_loss_att = self.criterion_att(r_decoder_out, r_ys_out_pad)
loss_att = loss_att * (1 - reverse_weight) + r_loss_att * reverse_weight loss_att = loss_att * (1 - reverse_weight) + r_loss_att * reverse_weight
acc_att = th_accuracy( acc_att = th_accuracy(
decoder_out.view(-1, self.vocab_size), decoder_out.reshape([-1, self.vocab_size]),
ys_out_pad, ys_out_pad,
ignore_label=self.ignore_id, ) ignore_label=self.ignore_id, )
return loss_att, acc_att return loss_att, acc_att
@ -271,11 +271,13 @@ class U2BaseModel(ASRInterface, nn.Layer):
maxlen = encoder_out.shape[1] maxlen = encoder_out.shape[1]
encoder_dim = encoder_out.shape[2] encoder_dim = encoder_out.shape[2]
running_size = batch_size * beam_size running_size = batch_size * beam_size
encoder_out = encoder_out.unsqueeze(1).repeat(1, beam_size, 1, 1).view( encoder_out = encoder_out.unsqueeze(1).repeat(
running_size, maxlen, encoder_dim) # (B*N, maxlen, encoder_dim) 1, beam_size, 1, 1).reshape(
[running_size, maxlen,
encoder_dim]) # (B*N, maxlen, encoder_dim)
encoder_mask = encoder_mask.unsqueeze(1).repeat( encoder_mask = encoder_mask.unsqueeze(1).repeat(
1, beam_size, 1, 1).view(running_size, 1, 1, beam_size, 1, 1).reshape([running_size, 1,
maxlen) # (B*N, 1, max_len) maxlen]) # (B*N, 1, max_len)
hyps = paddle.ones( hyps = paddle.ones(
[running_size, 1], dtype=paddle.long).fill_(self.sos) # (B*N, 1) [running_size, 1], dtype=paddle.long).fill_(self.sos) # (B*N, 1)
@ -305,34 +307,35 @@ class U2BaseModel(ASRInterface, nn.Layer):
# 2.3 Seconde beam prune: select topk score with history # 2.3 Seconde beam prune: select topk score with history
scores = scores + top_k_logp # (B*N, N), broadcast add scores = scores + top_k_logp # (B*N, N), broadcast add
scores = scores.view(batch_size, beam_size * beam_size) # (B, N*N) scores = scores.reshape(
[batch_size, beam_size * beam_size]) # (B, N*N)
scores, offset_k_index = scores.topk(k=beam_size) # (B, N) scores, offset_k_index = scores.topk(k=beam_size) # (B, N)
scores = scores.view(-1, 1) # (B*N, 1) scores = scores.reshape([-1, 1]) # (B*N, 1)
# 2.4. Compute base index in top_k_index, # 2.4. Compute base index in top_k_index,
# regard top_k_index as (B*N*N),regard offset_k_index as (B*N), # regard top_k_index as (B*N*N),regard offset_k_index as (B*N),
# then find offset_k_index in top_k_index # then find offset_k_index in top_k_index
base_k_index = paddle.arange(batch_size).view(-1, 1).repeat( base_k_index = paddle.arange(batch_size).reshape([-1, 1]).repeat(
1, beam_size) # (B, N) 1, beam_size) # (B, N)
base_k_index = base_k_index * beam_size * beam_size base_k_index = base_k_index * beam_size * beam_size
best_k_index = base_k_index.view(-1) + offset_k_index.view( best_k_index = base_k_index.reshape([-1]) + offset_k_index.reshape(
-1) # (B*N) [-1]) # (B*N)
# 2.5 Update best hyps # 2.5 Update best hyps
best_k_pred = paddle.index_select( best_k_pred = paddle.index_select(
top_k_index.view(-1), index=best_k_index, axis=0) # (B*N) top_k_index.reshape([-1]), index=best_k_index, axis=0) # (B*N)
best_hyps_index = best_k_index // beam_size best_hyps_index = best_k_index // beam_size
last_best_k_hyps = paddle.index_select( last_best_k_hyps = paddle.index_select(
hyps, index=best_hyps_index, axis=0) # (B*N, i) hyps, index=best_hyps_index, axis=0) # (B*N, i)
hyps = paddle.cat( hyps = paddle.cat(
(last_best_k_hyps, best_k_pred.view(-1, 1)), (last_best_k_hyps, best_k_pred.reshape([-1, 1])),
dim=1) # (B*N, i+1) dim=1) # (B*N, i+1)
# 2.6 Update end flag # 2.6 Update end flag
end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1) end_flag = paddle.equal(hyps[:, -1], self.eos).reshape([-1, 1])
# 3. Select best of best # 3. Select best of best
scores = scores.view(batch_size, beam_size) scores = scores.reshape([batch_size, beam_size])
# TODO: length normalization # TODO: length normalization
best_index = paddle.argmax(scores, axis=-1).long() # (B) best_index = paddle.argmax(scores, axis=-1).long() # (B)
best_hyps_index = best_index + paddle.arange( best_hyps_index = best_index + paddle.arange(
@ -379,7 +382,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
ctc_probs = self.ctc.log_softmax(encoder_out) # (B, maxlen, vocab_size) ctc_probs = self.ctc.log_softmax(encoder_out) # (B, maxlen, vocab_size)
topk_prob, topk_index = ctc_probs.topk(1, axis=2) # (B, maxlen, 1) topk_prob, topk_index = ctc_probs.topk(1, axis=2) # (B, maxlen, 1)
topk_index = topk_index.view(batch_size, maxlen) # (B, maxlen) topk_index = topk_index.reshape([batch_size, maxlen]) # (B, maxlen)
pad_mask = make_pad_mask(encoder_out_lens) # (B, maxlen) pad_mask = make_pad_mask(encoder_out_lens) # (B, maxlen)
topk_index = topk_index.masked_fill_(pad_mask, self.eos) # (B, maxlen) topk_index = topk_index.masked_fill_(pad_mask, self.eos) # (B, maxlen)

@ -129,7 +129,7 @@ def _compute_mask_indices(
[sequence_length for _ in range(batch_size)]) [sequence_length for _ in range(batch_size)])
# SpecAugment mask to fill # SpecAugment mask to fill
spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=np.bool) spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=np.bool_)
spec_aug_mask_idxs = [] spec_aug_mask_idxs = []
max_num_masked_span = compute_num_masked_span(sequence_length) max_num_masked_span = compute_num_masked_span(sequence_length)
@ -207,9 +207,9 @@ def _sample_negative_indices(features_shape: Tuple,
sampled_negative_indices = np.zeros( sampled_negative_indices = np.zeros(
shape=(batch_size, sequence_length, num_negatives), dtype=np.int32) shape=(batch_size, sequence_length, num_negatives), dtype=np.int32)
mask_time_indices = (mask_time_indices.astype(np.bool) mask_time_indices = (mask_time_indices.astype(np.bool_)
if mask_time_indices is not None else if mask_time_indices is not None else
np.ones(features_shape, dtype=np.bool)) np.ones(features_shape, dtype=np.bool_))
for batch_idx in range(batch_size): for batch_idx in range(batch_size):
high = mask_time_indices[batch_idx].sum() - 1 high = mask_time_indices[batch_idx].sum() - 1

@ -714,13 +714,13 @@ class MultiheadAttention(nn.Layer):
else: else:
if self.beam_size > 1 and bsz == key.size(1): if self.beam_size > 1 and bsz == key.size(1):
# key is [T, bsz*beam_size, C], reduce to [T, bsz, C] # key is [T, bsz*beam_size, C], reduce to [T, bsz, C]
key = key.view( key = key.reshape(
key.size(0), -1, self.beam_size, [key.size(0), -1, self.beam_size,
key.size(2))[:, :, 0, :] key.size(2)])[:, :, 0, :]
if key_padding_mask is not None: if key_padding_mask is not None:
key_padding_mask = key_padding_mask.view( key_padding_mask = key_padding_mask.reshape(
-1, self.beam_size, [-1, self.beam_size,
key_padding_mask.size(1))[:, 0, :] key_padding_mask.size(1)])[:, 0, :]
k = self.k_proj(key) k = self.k_proj(key)
v = self.v_proj(key) v = self.v_proj(key)
@ -1476,7 +1476,7 @@ def compute_mask_indices(
lens = np.fromiter( lens = np.fromiter(
(e - s if e - s >= length + min_space else 0 (e - s if e - s >= length + min_space else 0
for s, e in parts), for s, e in parts),
np.int, ) np.int_, )
l_sum = np.sum(lens) l_sum = np.sum(lens)
if l_sum == 0: if l_sum == 0:
break break

@ -88,7 +88,7 @@ def compute_amplitude(waveforms, lengths=None, amp_type="avg", scale="linear"):
out = paddle.mean(paddle.abs(waveforms), axis=1, keepdim=True) out = paddle.mean(paddle.abs(waveforms), axis=1, keepdim=True)
else: else:
wav_sum = paddle.sum(paddle.abs(waveforms), axis=1, keepdim=True) wav_sum = paddle.sum(paddle.abs(waveforms), axis=1, keepdim=True)
out = wav_sum / lengths out = wav_sum / lengths.astype(wav_sum.dtype)
elif amp_type == "peak": elif amp_type == "peak":
out = paddle.max(paddle.abs(waveforms), axis=1, keepdim=True)[0] out = paddle.max(paddle.abs(waveforms), axis=1, keepdim=True)[0]
else: else:
@ -248,4 +248,4 @@ def notch_filter(notch_freq, filter_width=101, notch_width=0.05):
hhpf[pad] += 1 hhpf[pad] += 1
# Adding filters creates notch filter # Adding filters creates notch filter
return (hlpf + hhpf).view(1, -1, 1) return (hlpf + hhpf).reshape([1, -1, 1])

@ -743,7 +743,7 @@ class SpecAugment(paddle.nn.Layer):
time = x.shape[2] time = x.shape[2]
if time - window <= window: if time - window <= window:
return x.view(*original_size) return x.reshape([*original_size])
# compute center and corresponding window # compute center and corresponding window
c = paddle.randint(window, time - window, (1, ))[0] c = paddle.randint(window, time - window, (1, ))[0]
@ -762,7 +762,7 @@ class SpecAugment(paddle.nn.Layer):
x[:, :, :w] = left x[:, :, :w] = left
x[:, :, w:] = right x[:, :, w:] = right
return x.view(*original_size) return x.reshape([*original_size])
def mask_along_axis(self, x, dim): def mask_along_axis(self, x, dim):
"""Mask along time or frequency axis. """Mask along time or frequency axis.
@ -775,7 +775,7 @@ class SpecAugment(paddle.nn.Layer):
""" """
original_size = x.shape original_size = x.shape
if x.dim() == 4: if x.dim() == 4:
x = x.view(-1, x.shape[2], x.shape[3]) x = x.reshape([-1, x.shape[2], x.shape[3]])
batch, time, fea = x.shape batch, time, fea = x.shape
@ -795,7 +795,7 @@ class SpecAugment(paddle.nn.Layer):
(batch, n_mask)).unsqueeze(2) (batch, n_mask)).unsqueeze(2)
# compute masks # compute masks
arange = paddle.arange(end=D).view(1, 1, -1) arange = paddle.arange(end=D).reshape([1, 1, -1])
mask = (mask_pos <= arange) * (arange < (mask_pos + mask_len)) mask = (mask_pos <= arange) * (arange < (mask_pos + mask_len))
mask = mask.any(axis=1) mask = mask.any(axis=1)
@ -811,7 +811,7 @@ class SpecAugment(paddle.nn.Layer):
# same to x.masked_fill_(mask, val) # same to x.masked_fill_(mask, val)
y = paddle.full(x.shape, val, x.dtype) y = paddle.full(x.shape, val, x.dtype)
x = paddle.where(mask, y, x) x = paddle.where(mask, y, x)
return x.view(*original_size) return x.reshape([*original_size])
class TimeDomainSpecAugment(nn.Layer): class TimeDomainSpecAugment(nn.Layer):

@ -59,13 +59,13 @@ class Wav2vec2ASR(nn.Layer):
def forward(self, wav, wavs_lens_rate, target, target_lens): def forward(self, wav, wavs_lens_rate, target, target_lens):
if self.normalize_wav: if self.normalize_wav:
wav = F.layer_norm(wav, wav.shape) wav = F.layer_norm(wav, wav.shape[1:])
# Extract wav2vec output # Extract wav2vec output
out = self.wav2vec2(wav)[0] out = self.wav2vec2(wav)[0]
# We normalize the output if required # We normalize the output if required
if self.output_norm: if self.output_norm:
out = F.layer_norm(out, out.shape) out = F.layer_norm(out, out.shape[1:])
if self.training and hasattr(self.config, 'spec_augment'): if self.training and hasattr(self.config, 'spec_augment'):
feats = self.spec_augment(out) feats = self.spec_augment(out)

@ -19,6 +19,9 @@ from typing import Tuple
import paddle import paddle
import paddle.nn as nn import paddle.nn as nn
import paddle.nn.functional as F import paddle.nn.functional as F
from .wavlm_paddle import WavLM
from .wavlm_paddle import WavLMConfig
from paddlespeech.s2t.models.wav2vec2.modules.VanillaNN import VanillaNN from paddlespeech.s2t.models.wav2vec2.modules.VanillaNN import VanillaNN
from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import SpecAugment from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import SpecAugment
from paddlespeech.s2t.modules.ctc import CTCDecoderBase as CTC from paddlespeech.s2t.modules.ctc import CTCDecoderBase as CTC
@ -26,8 +29,6 @@ from paddlespeech.s2t.modules.initializer import DefaultInitializerContext
from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
from paddlespeech.s2t.utils.utility import log_add from paddlespeech.s2t.utils.utility import log_add
from .wavlm_paddle import WavLM, WavLMConfig
class WavLMASR(nn.Layer): class WavLMASR(nn.Layer):
def __init__(self, config: dict): def __init__(self, config: dict):
@ -56,13 +57,13 @@ class WavLMASR(nn.Layer):
def forward(self, wav, wavs_lens_rate, target, target_lens): def forward(self, wav, wavs_lens_rate, target, target_lens):
if self.normalize_wav: if self.normalize_wav:
wav = F.layer_norm(wav, wav.shape) wav = F.layer_norm(wav, wav.shape[1:])
# Extract wav2vec output # Extract wav2vec output
out = self.wavlm(wav) out = self.wavlm(wav)
# We normalize the output if required # We normalize the output if required
if self.output_norm: if self.output_norm:
out = F.layer_norm(out, out.shape) out = F.layer_norm(out, out.shape[1:])
if self.training and hasattr(self.config, 'spec_augment'): if self.training and hasattr(self.config, 'spec_augment'):
feats = self.spec_augment(out) feats = self.spec_augment(out)

@ -6,25 +6,24 @@
# Based on fairseq code bases # Based on fairseq code bases
# https://github.com/pytorch/fairseq # https://github.com/pytorch/fairseq
# -------------------------------------------------------- # --------------------------------------------------------
import math
import logging import logging
from typing import List, Optional, Tuple import math
from typing import List
from typing import Optional
from typing import Tuple
import numpy as np import numpy as np
import paddle import paddle
import paddle.nn as nn import paddle.nn as nn
import paddle.nn.functional as F import paddle.nn.functional as F
from paddle.nn import LayerNorm
from paddle import Tensor from paddle import Tensor
from .modules.modules import ( from paddle.nn import LayerNorm
MultiheadAttention,
SamePad, from .modules.modules import get_activation_fn
get_activation_fn, from .modules.modules import GLU_Linear
TransposeLast, from .modules.modules import MultiheadAttention
GLU_Linear, from .modules.modules import SamePad
) from .modules.modules import TransposeLast
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -38,8 +37,7 @@ def compute_mask_indices(
mask_other: float=0.0, mask_other: float=0.0,
min_masks: int=0, min_masks: int=0,
no_overlap: bool=False, no_overlap: bool=False,
min_space: int = 0, min_space: int=0, ) -> np.ndarray:
) -> np.ndarray:
""" """
Computes random mask spans for a given shape Computes random mask spans for a given shape
@ -65,9 +63,7 @@ def compute_mask_indices(
all_num_mask = int( all_num_mask = int(
# add a random number for probabilistic rounding # add a random number for probabilistic rounding
mask_prob * all_sz / float(mask_length) mask_prob * all_sz / float(mask_length) + np.random.rand())
+ np.random.rand()
)
all_num_mask = max(min_masks, all_num_mask) all_num_mask = max(min_masks, all_num_mask)
@ -77,9 +73,7 @@ def compute_mask_indices(
sz = all_sz - padding_mask[i].long().sum().item() sz = all_sz - padding_mask[i].long().sum().item()
num_mask = int( num_mask = int(
# add a random number for probabilistic rounding # add a random number for probabilistic rounding
mask_prob * sz / float(mask_length) mask_prob * sz / float(mask_length) + np.random.rand())
+ np.random.rand()
)
num_mask = max(min_masks, num_mask) num_mask = max(min_masks, num_mask)
else: else:
sz = all_sz sz = all_sz
@ -88,7 +82,8 @@ def compute_mask_indices(
if mask_type == "static": if mask_type == "static":
lengths = np.full(num_mask, mask_length) lengths = np.full(num_mask, mask_length)
elif mask_type == "uniform": elif mask_type == "uniform":
lengths = np.random.randint(mask_other, mask_length * 2 + 1, size=num_mask) lengths = np.random.randint(
mask_other, mask_length * 2 + 1, size=num_mask)
elif mask_type == "normal": elif mask_type == "normal":
lengths = np.random.normal(mask_length, mask_other, size=num_mask) lengths = np.random.normal(mask_length, mask_other, size=num_mask)
lengths = [max(1, int(round(x))) for x in lengths] lengths = [max(1, int(round(x))) for x in lengths]
@ -119,9 +114,9 @@ def compute_mask_indices(
min_length = min(lengths) min_length = min(lengths)
for length in sorted(lengths, reverse=True): for length in sorted(lengths, reverse=True):
lens = np.fromiter( lens = np.fromiter(
(e - s if e - s >= length + min_space else 0 for s, e in parts), (e - s if e - s >= length + min_space else 0
np.int, for s, e in parts),
) np.int_, )
l_sum = np.sum(lens) l_sum = np.sum(lens)
if l_sum == 0: if l_sum == 0:
break break
@ -137,13 +132,10 @@ def compute_mask_indices(
mask_idc = np.random.choice(sz - min_len, num_mask, replace=False) mask_idc = np.random.choice(sz - min_len, num_mask, replace=False)
mask_idc = np.asarray( mask_idc = np.asarray([
[
mask_idc[j] + offset mask_idc[j] + offset
for j in range(len(mask_idc)) for j in range(len(mask_idc)) for offset in range(lengths[j])
for offset in range(lengths[j]) ])
]
)
mask_idcs.append(np.unique(mask_idc[mask_idc < sz])) mask_idcs.append(np.unique(mask_idc[mask_idc < sz]))
@ -217,8 +209,7 @@ class WavLMConfig:
class WavLM(nn.Layer): class WavLM(nn.Layer):
def __init__( def __init__(
self, self,
cfg: WavLMConfig, cfg: WavLMConfig, ) -> None:
) -> None:
super().__init__() super().__init__()
logger.info(f"WavLM Config: {cfg.__dict__}") logger.info(f"WavLM Config: {cfg.__dict__}")
@ -230,14 +221,11 @@ class WavLM(nn.Layer):
conv_layers=feature_enc_layers, conv_layers=feature_enc_layers,
dropout=0.0, dropout=0.0,
mode=cfg.extractor_mode, mode=cfg.extractor_mode,
conv_bias=cfg.conv_bias, conv_bias=cfg.conv_bias, )
)
self.post_extract_proj = ( self.post_extract_proj = (nn.Linear(self.embed, cfg.encoder_embed_dim)
nn.Linear(self.embed, cfg.encoder_embed_dim) if self.embed != cfg.encoder_embed_dim else
if self.embed != cfg.encoder_embed_dim None)
else None
)
self.mask_prob = cfg.mask_prob self.mask_prob = cfg.mask_prob
self.mask_selection = cfg.mask_selection self.mask_selection = cfg.mask_selection
@ -260,8 +248,7 @@ class WavLM(nn.Layer):
self.mask_emb = self.create_parameter( self.mask_emb = self.create_parameter(
shape=[cfg.encoder_embed_dim], shape=[cfg.encoder_embed_dim],
default_initializer=nn.initializer.Uniform(), default_initializer=nn.initializer.Uniform(), )
)
self.encoder = TransformerEncoder(cfg) self.encoder = TransformerEncoder(cfg)
self.layer_norm = LayerNorm(self.embed) self.layer_norm = LayerNorm(self.embed)
@ -278,8 +265,7 @@ class WavLM(nn.Layer):
self.mask_other, self.mask_other,
min_masks=2, min_masks=2,
no_overlap=self.no_mask_overlap, no_overlap=self.no_mask_overlap,
min_space=self.mask_min_space, min_space=self.mask_min_space, )
)
# mask_indices = torch.from_numpy(mask_indices).to(x.device) # mask_indices = torch.from_numpy(mask_indices).to(x.device)
mask_indices = paddle.to_tensor(mask_indices, dtype='int64') mask_indices = paddle.to_tensor(mask_indices, dtype='int64')
x[mask_indices] = self.mask_emb x[mask_indices] = self.mask_emb
@ -295,28 +281,24 @@ class WavLM(nn.Layer):
self.mask_channel_selection, self.mask_channel_selection,
self.mask_channel_other, self.mask_channel_other,
no_overlap=self.no_mask_channel_overlap, no_overlap=self.no_mask_channel_overlap,
min_space=self.mask_channel_min_space, min_space=self.mask_channel_min_space, )
)
mask_channel_indices = ( mask_channel_indices = (
# torch.from_numpy(mask_channel_indices) # torch.from_numpy(mask_channel_indices)
paddle.to_tensor(mask_channel_indices, dtype='int64') paddle.to_tensor(mask_channel_indices, dtype='int64')
.to(x.device) .to(x.device).unsqueeze(1).expand(-1, T, -1))
.unsqueeze(1)
.expand(-1, T, -1)
)
x[mask_channel_indices] = 0 x[mask_channel_indices] = 0
return x, mask_indices return x, mask_indices
def forward_padding_mask( def forward_padding_mask(
self, features: Tensor, padding_mask: Tensor, self,
) -> Tensor: features: Tensor,
padding_mask: Tensor, ) -> Tensor:
extra = padding_mask.size(1) % features.size(1) extra = padding_mask.size(1) % features.size(1)
if extra > 0: if extra > 0:
padding_mask = padding_mask[:, :-extra] padding_mask = padding_mask[:, :-extra]
padding_mask = padding_mask.view( padding_mask = padding_mask.view(
padding_mask.size(0), features.size(1), -1 padding_mask.size(0), features.size(1), -1)
)
padding_mask = padding_mask.all(-1) padding_mask = padding_mask.all(-1)
return padding_mask return padding_mask
@ -327,8 +309,7 @@ class WavLM(nn.Layer):
mask: bool=False, mask: bool=False,
ret_conv: bool=False, ret_conv: bool=False,
output_layer: Optional[int]=None, output_layer: Optional[int]=None,
ret_layer_results: bool = False, ret_layer_results: bool=False, ):
):
if self.feature_grad_mult > 0: if self.feature_grad_mult > 0:
features = self.feature_extractor(source) features = self.feature_extractor(source)
@ -351,9 +332,7 @@ class WavLM(nn.Layer):
features = self.dropout_input(features) features = self.dropout_input(features)
if mask: if mask:
x, mask_indices = self.apply_mask( x, mask_indices = self.apply_mask(features, padding_mask)
features, padding_mask
)
else: else:
x = features x = features
@ -366,10 +345,14 @@ class WavLM(nn.Layer):
x, layer_results = self.encoder( x, layer_results = self.encoder(
x, x,
padding_mask=padding_mask, padding_mask=padding_mask,
layer=None if output_layer is None else output_layer - 1 layer=None if output_layer is None else output_layer - 1)
)
# print(f"Debugging: x.shape: {x.shape}, x.mean(): {x.mean()}, x.std(): {x.std()}") # print(f"Debugging: x.shape: {x.shape}, x.mean(): {x.mean()}, x.std(): {x.std()}")
res = {"x": x, "padding_mask": padding_mask, "features": features, "layer_results": layer_results} res = {
"x": x,
"padding_mask": padding_mask,
"features": features,
"layer_results": layer_results
}
feature = res["features"] if ret_conv else res["x"] feature = res["features"] if ret_conv else res["x"]
if ret_layer_results: if ret_layer_results:
@ -381,14 +364,12 @@ class WavLM(nn.Layer):
class ConvFeatureExtractionModel(nn.Layer): class ConvFeatureExtractionModel(nn.Layer):
def __init__( def __init__(self,
self,
conv_layers: List[Tuple[int, int, int]], conv_layers: List[Tuple[int, int, int]],
dropout: float=0.0, dropout: float=0.0,
mode: str="default", mode: str="default",
conv_bias: bool=False, conv_bias: bool=False,
conv_type: str = "default" conv_type: str="default"):
):
super().__init__() super().__init__()
assert mode in {"default", "layer_norm"} assert mode in {"default", "layer_norm"}
@ -400,16 +381,19 @@ class ConvFeatureExtractionModel(nn.Layer):
stride, stride,
is_layer_norm=False, is_layer_norm=False,
is_group_norm=False, is_group_norm=False,
conv_bias=False, conv_bias=False, ):
):
def make_conv(): def make_conv():
conv = nn.Conv1D(n_in, n_out, k, stride=stride, bias_attr=conv_bias, conv = nn.Conv1D(
n_in,
n_out,
k,
stride=stride,
bias_attr=conv_bias,
weight_attr=nn.initializer.KaimingNormal()) weight_attr=nn.initializer.KaimingNormal())
# nn.init.kaiming_normal_(conv.weight) # nn.init.kaiming_normal_(conv.weight)
return conv return conv
assert ( assert (is_layer_norm and is_group_norm
is_layer_norm and is_group_norm
) == False, "layer norm and group norm are exclusive" ) == False, "layer norm and group norm are exclusive"
if is_layer_norm: if is_layer_norm:
@ -419,19 +403,18 @@ class ConvFeatureExtractionModel(nn.Layer):
nn.Sequential( nn.Sequential(
TransposeLast(), TransposeLast(),
nn.LayerNorm(normalized_shape=dim, epsilon=1e-5), nn.LayerNorm(normalized_shape=dim, epsilon=1e-5),
TransposeLast(), TransposeLast(), ),
), nn.GELU(), )
nn.GELU(),
)
elif is_group_norm: elif is_group_norm:
return nn.Sequential( return nn.Sequential(
make_conv(), make_conv(),
nn.Dropout(p=dropout), nn.Dropout(p=dropout),
nn.GroupNorm(num_groups=dim, num_channels=dim, epsilon=1e-5), nn.GroupNorm(
nn.GELU(), num_groups=dim, num_channels=dim, epsilon=1e-5),
) nn.GELU(), )
else: else:
return nn.Sequential(make_conv(), nn.Dropout(p=dropout), nn.GELU()) return nn.Sequential(
make_conv(), nn.Dropout(p=dropout), nn.GELU())
self.conv_type = conv_type self.conv_type = conv_type
if self.conv_type == "default": if self.conv_type == "default":
@ -449,9 +432,7 @@ class ConvFeatureExtractionModel(nn.Layer):
stride, stride,
is_layer_norm=mode == "layer_norm", is_layer_norm=mode == "layer_norm",
is_group_norm=mode == "default" and i == 0, is_group_norm=mode == "default" and i == 0,
conv_bias=conv_bias, conv_bias=conv_bias, ))
)
)
in_d = dim in_d = dim
elif self.conv_type == "conv2d": elif self.conv_type == "conv2d":
in_d = 1 in_d = 1
@ -460,9 +441,7 @@ class ConvFeatureExtractionModel(nn.Layer):
assert len(cl) == 3 assert len(cl) == 3
(dim, k, stride) = cl (dim, k, stride) = cl
self.conv_layers.append( self.conv_layers.append(paddle.nn.Conv2D(in_d, dim, k, stride))
paddle.nn.Conv2D(in_d, dim, k, stride)
)
self.conv_layers.append(paddle.nn.ReLU()) self.conv_layers.append(paddle.nn.ReLU())
in_d = dim in_d = dim
elif self.conv_type == "custom": elif self.conv_type == "custom":
@ -473,17 +452,13 @@ class ConvFeatureExtractionModel(nn.Layer):
assert len(cl) == 3 assert len(cl) == 3
(dim, k, stride) = cl (dim, k, stride) = cl
self.conv_layers.append( self.conv_layers.append(
paddle.nn.Conv2D(in_d, dim, k, stride, padding=1) paddle.nn.Conv2D(in_d, dim, k, stride, padding=1))
) self.conv_layers.append(paddle.nn.LayerNorm([dim, idim]))
self.conv_layers.append(
paddle.nn.LayerNorm([dim, idim])
)
self.conv_layers.append(paddle.nn.ReLU()) self.conv_layers.append(paddle.nn.ReLU())
in_d = dim in_d = dim
if (i + 1) % 2 == 0: if (i + 1) % 2 == 0:
self.conv_layers.append( self.conv_layers.append(
paddle.nn.MaxPool2D(2, stride=2, ceil_mode=True) paddle.nn.MaxPool2D(2, stride=2, ceil_mode=True))
)
idim = int(math.ceil(idim / 2)) idim = int(math.ceil(idim / 2))
else: else:
pass pass
@ -518,8 +493,8 @@ class TransformerEncoder(nn.Layer):
self.dropout = args.dropout self.dropout = args.dropout
self.embedding_dim = args.encoder_embed_dim self.embedding_dim = args.encoder_embed_dim
dropout = 0 dropout = 0
std = math.sqrt((4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim)) std = math.sqrt(
(4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim))
self.pos_conv = nn.Conv1D( self.pos_conv = nn.Conv1D(
self.embedding_dim, self.embedding_dim,
@ -528,15 +503,16 @@ class TransformerEncoder(nn.Layer):
padding=args.conv_pos // 2, padding=args.conv_pos // 2,
groups=args.conv_pos_groups, groups=args.conv_pos_groups,
weight_attr=nn.initializer.Normal(mean=0, std=std), weight_attr=nn.initializer.Normal(mean=0, std=std),
bias_attr=True bias_attr=True)
)
# nn.init.normal_(self.pos_conv.weight, mean=0, std=std) # nn.init.normal_(self.pos_conv.weight, mean=0, std=std)
# nn.init.constant_(self.pos_conv.bias, 0) # nn.init.constant_(self.pos_conv.bias, 0)
# self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2) # self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2)
# self.pos_conv.weight_g = self.pos_conv.weight_g.unsqueeze(0).unsqueeze(0) # self.pos_conv.weight_g = self.pos_conv.weight_g.unsqueeze(0).unsqueeze(0)
self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2) self.pos_conv = nn.utils.weight_norm(
self.pos_conv = nn.Sequential(self.pos_conv, SamePad(args.conv_pos), nn.GELU()) self.pos_conv, name="weight", dim=2)
self.pos_conv = nn.Sequential(self.pos_conv,
SamePad(args.conv_pos), nn.GELU())
if hasattr(args, "relative_position_embedding"): if hasattr(args, "relative_position_embedding"):
self.relative_position_embedding = args.relative_position_embedding self.relative_position_embedding = args.relative_position_embedding
@ -547,8 +523,7 @@ class TransformerEncoder(nn.Layer):
self.num_buckets = 0 self.num_buckets = 0
self.max_distance = 0 self.max_distance = 0
self.layers = nn.LayerList( self.layers = nn.LayerList([
[
TransformerSentenceEncoderLayer( TransformerSentenceEncoderLayer(
embedding_dim=self.embedding_dim, embedding_dim=self.embedding_dim,
ffn_embedding_dim=args.encoder_ffn_embed_dim, ffn_embedding_dim=args.encoder_ffn_embed_dim,
@ -558,14 +533,13 @@ class TransformerEncoder(nn.Layer):
activation_dropout=args.activation_dropout, activation_dropout=args.activation_dropout,
activation_fn=args.activation_fn, activation_fn=args.activation_fn,
layer_norm_first=args.layer_norm_first, layer_norm_first=args.layer_norm_first,
has_relative_attention_bias=(self.relative_position_embedding and i == 0), has_relative_attention_bias=(
self.relative_position_embedding and i == 0),
num_buckets=self.num_buckets, num_buckets=self.num_buckets,
max_distance=self.max_distance, max_distance=self.max_distance,
gru_rel_pos=args.gru_rel_pos, gru_rel_pos=args.gru_rel_pos, )
)
for i in range(args.encoder_layers) for i in range(args.encoder_layers)
] ])
)
self.layer_norm_first = args.layer_norm_first self.layer_norm_first = args.layer_norm_first
self.layer_norm = LayerNorm(self.embedding_dim) self.layer_norm = LayerNorm(self.embedding_dim)
@ -574,14 +548,19 @@ class TransformerEncoder(nn.Layer):
# self.apply(init_bert_params) # self.apply(init_bert_params)
def forward(self, x, padding_mask=None, streaming_mask=None, layer=None): def forward(self, x, padding_mask=None, streaming_mask=None, layer=None):
x, layer_results = self.extract_features(x, padding_mask, streaming_mask, layer) x, layer_results = self.extract_features(x, padding_mask,
streaming_mask, layer)
# print("x.shape", x.shape) # print("x.shape", x.shape)
if self.layer_norm_first and layer is None: if self.layer_norm_first and layer is None:
x = self.layer_norm(x) x = self.layer_norm(x)
return x, layer_results return x, layer_results
def extract_features(self, x, padding_mask=None, streaming_mask=None, tgt_layer=None): def extract_features(self,
x,
padding_mask=None,
streaming_mask=None,
tgt_layer=None):
if padding_mask is not None: if padding_mask is not None:
x[padding_mask] = 0 x[padding_mask] = 0
@ -598,7 +577,6 @@ class TransformerEncoder(nn.Layer):
# x = x.transpose(0, 1) # x = x.transpose(0, 1)
x = x.transpose([1, 0, 2]) x = x.transpose([1, 0, 2])
layer_results = [] layer_results = []
z = None z = None
if tgt_layer is not None: if tgt_layer is not None:
@ -608,7 +586,12 @@ class TransformerEncoder(nn.Layer):
for i, layer in enumerate(self.layers): for i, layer in enumerate(self.layers):
dropout_probability = np.random.random() dropout_probability = np.random.random()
if not self.training or (dropout_probability > self.layerdrop): if not self.training or (dropout_probability > self.layerdrop):
x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False,self_attn_mask=streaming_mask, pos_bias=pos_bias) x, z, pos_bias = layer(
x,
self_attn_padding_mask=padding_mask,
need_weights=False,
self_attn_mask=streaming_mask,
pos_bias=pos_bias)
if tgt_layer is not None: if tgt_layer is not None:
layer_results.append((x, z)) layer_results.append((x, z))
if i == tgt_layer: if i == tgt_layer:
@ -645,8 +628,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
num_buckets: int=0, num_buckets: int=0,
max_distance: int=0, max_distance: int=0,
rescale_init: bool=False, rescale_init: bool=False,
gru_rel_pos: bool = True, gru_rel_pos: bool=True, ) -> None:
) -> None:
super().__init__() super().__init__()
# Initialize parameters # Initialize parameters
@ -666,8 +648,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
num_buckets=num_buckets, num_buckets=num_buckets,
max_distance=max_distance, max_distance=max_distance,
rescale_init=rescale_init, rescale_init=rescale_init,
gru_rel_pos=gru_rel_pos, gru_rel_pos=gru_rel_pos, )
)
self.dropout1 = nn.Dropout(dropout) self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(self.activation_dropout) self.dropout2 = nn.Dropout(self.activation_dropout)
@ -679,7 +660,8 @@ class TransformerSentenceEncoderLayer(nn.Layer):
self.self_attn_layer_norm = LayerNorm(self.embedding_dim) self.self_attn_layer_norm = LayerNorm(self.embedding_dim)
if self.activation_name == "glu": if self.activation_name == "glu":
self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim, "swish") self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim,
"swish")
else: else:
self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim) self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim)
self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim) self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim)
@ -687,14 +669,12 @@ class TransformerSentenceEncoderLayer(nn.Layer):
# layer norm associated with the position wise feed-forward NN # layer norm associated with the position wise feed-forward NN
self.final_layer_norm = LayerNorm(self.embedding_dim) self.final_layer_norm = LayerNorm(self.embedding_dim)
def forward( def forward(self,
self,
x: Tensor, x: Tensor,
self_attn_mask: Tensor=None, self_attn_mask: Tensor=None,
self_attn_padding_mask: Tensor=None, self_attn_padding_mask: Tensor=None,
need_weights: bool=False, need_weights: bool=False,
pos_bias=None pos_bias=None):
):
""" """
LayerNorm is applied either before or after the self-attention/ffn LayerNorm is applied either before or after the self-attention/ffn
modules similar to the original Transformer imlementation. modules similar to the original Transformer imlementation.
@ -710,8 +690,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
key_padding_mask=self_attn_padding_mask, key_padding_mask=self_attn_padding_mask,
need_weights=False, need_weights=False,
attn_mask=self_attn_mask, attn_mask=self_attn_mask,
position_bias=pos_bias position_bias=pos_bias)
)
# import pdb; pdb.set_trace() # import pdb; pdb.set_trace()
x = self.dropout1(x) x = self.dropout1(x)
x = residual + x x = residual + x
@ -734,8 +713,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
key_padding_mask=self_attn_padding_mask, key_padding_mask=self_attn_padding_mask,
need_weights=need_weights, need_weights=need_weights,
attn_mask=self_attn_mask, attn_mask=self_attn_mask,
position_bias=pos_bias position_bias=pos_bias)
)
x = self.dropout1(x) x = self.dropout1(x)
x = residual + x x = residual + x

@ -2,11 +2,11 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/__init__.py) # Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/__init__.py)
from paddlespeech.s2t.models.whisper.whipser import decode from paddlespeech.s2t.models.whisper.whisper import decode
from paddlespeech.s2t.models.whisper.whipser import DecodingOptions from paddlespeech.s2t.models.whisper.whisper import DecodingOptions
from paddlespeech.s2t.models.whisper.whipser import DecodingResult from paddlespeech.s2t.models.whisper.whisper import DecodingResult
from paddlespeech.s2t.models.whisper.whipser import detect_language from paddlespeech.s2t.models.whisper.whisper import detect_language
from paddlespeech.s2t.models.whisper.whipser import log_mel_spectrogram from paddlespeech.s2t.models.whisper.whisper import log_mel_spectrogram
from paddlespeech.s2t.models.whisper.whipser import ModelDimensions from paddlespeech.s2t.models.whisper.whisper import ModelDimensions
from paddlespeech.s2t.models.whisper.whipser import transcribe from paddlespeech.s2t.models.whisper.whisper import transcribe
from paddlespeech.s2t.models.whisper.whipser import Whisper from paddlespeech.s2t.models.whisper.whisper import Whisper

@ -971,8 +971,14 @@ class ApplyTimestampRules(LogitFilter):
# if sum of probability over timestamps is above any other token, sample timestamp # if sum of probability over timestamps is above any other token, sample timestamp
logprobs = F.log_softmax(logits, axis=-1, dtype='float32') logprobs = F.log_softmax(logits, axis=-1, dtype='float32')
for k in range(tokens.shape[0]): for k in range(tokens.shape[0]):
timestamp_logprob = paddle.logsumexp( # When using paddle.logsumexp on a 32GB Tesla-V100 GPU, we encountered CUDA error 700.
logprobs[k, self.tokenizer.timestamp_begin:], axis=-1) # To bypass this issue in CI, we have decomposed the operation into separate steps.
# It will raise 2e-6 difference in precision.
# TODO: revert this after logsumexp been fixed.
timestamp_logprob = paddle.exp(
logprobs[k, self.tokenizer.timestamp_begin:])
timestamp_logprob = paddle.sum(timestamp_logprob, axis=-1)
timestamp_logprob = paddle.log(timestamp_logprob)
max_text_token_logprob = paddle.max( max_text_token_logprob = paddle.max(
logprobs[k, :self.tokenizer.timestamp_begin]) logprobs[k, :self.tokenizer.timestamp_begin])
if timestamp_logprob > max_text_token_logprob: if timestamp_logprob > max_text_token_logprob:

@ -129,8 +129,8 @@ class MultiHeadedAttention(nn.Layer):
p_attn = self.dropout(attn) p_attn = self.dropout(attn)
x = paddle.matmul(p_attn, value) # (batch, head, time1, d_k) x = paddle.matmul(p_attn, value) # (batch, head, time1, d_k)
x = x.transpose([0, 2, 1, 3]).reshape([n_batch, -1, self.h * x = x.transpose([0, 2, 1, 3]).reshape(
self.d_k]) # (batch, time1, d_model) [n_batch, -1, self.h * self.d_k]) # (batch, time1, d_model)
return self.linear_out(x) # (batch, time1, d_model) return self.linear_out(x) # (batch, time1, d_model)
@ -280,8 +280,8 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
(x.shape[0], x.shape[1], x.shape[2], 1), dtype=x.dtype) (x.shape[0], x.shape[1], x.shape[2], 1), dtype=x.dtype)
x_padded = paddle.cat([zero_pad, x], dim=-1) x_padded = paddle.cat([zero_pad, x], dim=-1)
x_padded = x_padded.view(x.shape[0], x.shape[1], x.shape[3] + 1, x_padded = x_padded.reshape(
x.shape[2]) [x.shape[0], x.shape[1], x.shape[3] + 1, x.shape[2]])
x = x_padded[:, :, 1:].view_as(x) # [B, H, T1, T1] x = x_padded[:, :, 1:].view_as(x) # [B, H, T1, T1]
if zero_triu: if zero_triu:
@ -349,7 +349,8 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
new_cache = paddle.concat((k, v), axis=-1) new_cache = paddle.concat((k, v), axis=-1)
n_batch_pos = pos_emb.shape[0] n_batch_pos = pos_emb.shape[0]
p = self.linear_pos(pos_emb).reshape([n_batch_pos, -1, self.h, self.d_k]) p = self.linear_pos(pos_emb).reshape(
[n_batch_pos, -1, self.h, self.d_k])
p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k)
# (batch, head, time1, d_k) # (batch, head, time1, d_k)

@ -138,7 +138,7 @@ class Pitch():
input: np.ndarray, input: np.ndarray,
use_continuous_f0: bool=True, use_continuous_f0: bool=True,
use_log_f0: bool=True) -> np.ndarray: use_log_f0: bool=True) -> np.ndarray:
input = input.astype(np.float) input = input.astype(np.float_)
frame_period = 1000 * self.hop_length / self.sr frame_period = 1000 * self.hop_length / self.sr
f0, timeaxis = pyworld.dio( f0, timeaxis = pyworld.dio(
input, input,

@ -112,7 +112,7 @@ def parse_args():
parser.add_argument( parser.add_argument(
"--device", "--device",
default="gpu", default="gpu",
choices=["gpu", "cpu", "xpu", "npu", "mlu"], choices=["gpu", "cpu", "xpu", "npu", "mlu", "gcu"],
help="Device selected for inference.", ) help="Device selected for inference.", )
parser.add_argument('--cpu_threads', type=int, default=1) parser.add_argument('--cpu_threads', type=int, default=1)

@ -841,6 +841,9 @@ class FastSpeech2(nn.Layer):
spk_emb = self.spk_projection(F.normalize(spk_emb)) spk_emb = self.spk_projection(F.normalize(spk_emb))
hs = hs + spk_emb.unsqueeze(1) hs = hs + spk_emb.unsqueeze(1)
elif self.spk_embed_integration_type == "concat": elif self.spk_embed_integration_type == "concat":
# one wave `spk_emb` under synthesize, the dim is `1`
if spk_emb.dim() == 1:
spk_emb = spk_emb.unsqueeze(0)
# concat hidden states with spk embeds and then apply projection # concat hidden states with spk embeds and then apply projection
spk_emb = F.normalize(spk_emb).unsqueeze(1).expand( spk_emb = F.normalize(spk_emb).unsqueeze(1).expand(
shape=[-1, paddle.shape(hs)[1], -1]) shape=[-1, paddle.shape(hs)[1], -1])

@ -29,7 +29,27 @@ def is_broadcastable(shp1, shp2):
def broadcast_shape(shp1, shp2): def broadcast_shape(shp1, shp2):
result = [] result = []
for a, b in zip(shp1[::-1], shp2[::-1]): for a, b in zip(shp1[::-1], shp2[::-1]):
is_a_int = isinstance(a, int)
is_b_int = isinstance(b, int)
if is_a_int and is_b_int:
result.append(max(a, b)) result.append(max(a, b))
else:
dtype = None
if hasattr(a, 'dtype'):
dtype = a.dtype
if hasattr(b, 'dtype'):
dtype = b.dtype
if (is_a_int):
a = paddle.full((), a, dtype=dtype)
if (is_b_int):
b = paddle.full((), b, dtype=dtype)
result.append(paddle.maximum(a, b))
return result[::-1] return result[::-1]

@ -67,7 +67,7 @@ class PositionalEncoding(nn.Layer):
pe[:, 0::2] = paddle.sin(position * div_term) pe[:, 0::2] = paddle.sin(position * div_term)
pe[:, 1::2] = paddle.cos(position * div_term) pe[:, 1::2] = paddle.cos(position * div_term)
pe = pe.unsqueeze(0) pe = pe.unsqueeze(0)
self.pe = pe self.pe = paddle.assign(pe)
def forward(self, x: paddle.Tensor): def forward(self, x: paddle.Tensor):
"""Add positional encoding. """Add positional encoding.

@ -36,7 +36,7 @@ def convert_dtype_to_np_dtype_(dtype):
elif dtype is core.VarDesc.VarType.FP16: elif dtype is core.VarDesc.VarType.FP16:
return np.float16 return np.float16
elif dtype is core.VarDesc.VarType.BOOL: elif dtype is core.VarDesc.VarType.BOOL:
return np.bool return np.bool_
elif dtype is core.VarDesc.VarType.INT32: elif dtype is core.VarDesc.VarType.INT32:
return np.int32 return np.int32
elif dtype is core.VarDesc.VarType.INT64: elif dtype is core.VarDesc.VarType.INT64:

@ -53,7 +53,6 @@ base = [
"pandas", "pandas",
"paddleaudio>=1.1.0", "paddleaudio>=1.1.0",
"paddlenlp>=2.4.8", "paddlenlp>=2.4.8",
"paddlepaddle-gpu==2.5.1",
"paddleslim>=2.3.4", "paddleslim>=2.3.4",
"ppdiffusers>=0.9.0", "ppdiffusers>=0.9.0",
"paddlespeech_feat", "paddlespeech_feat",
@ -67,6 +66,7 @@ base = [
"pyyaml", "pyyaml",
"resampy", "resampy",
"sacrebleu", "sacrebleu",
"soundfile",
"textgrid", "textgrid",
"timer", "timer",
"ToJyutping==0.2.1", "ToJyutping==0.2.1",

Loading…
Cancel
Save