[s2t] add whisper asr large model (#2640)

* add whisper asr large model decoding, test=asr * fix code style. * fix json code style. * remove resource and fix code style. * fix yapf * add cli and demos, fix some code style. * fix some problem by comment. * fix yapf
3 years ago · b1d3f59bcb
parent dc9d3baf51
commit b1d3f59bcb
16 changed files with 2789 additions and 3 deletions
--- a/demos/whisper/README.md
+++ b/demos/whisper/README.md
@ -0,0 +1,89 @@
 ([简体中文](./README_cn.md)|English)
 ## Introduction
 Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
 Whisper model trained by OpenAI whisper https://github.com/openai/whisper
 ## Usage
 ### 1. Installation
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
 You can choose one way from easy, meduim and hard to install paddlespeech.
 ### 2. Prepare Input File
 The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
 Here are sample files for this demo that can be downloaded:
 ```bash
 wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 ```
 ### 3. Usage
 - Command Line(Recommended)
   ```bash
   # to recognize text 
   paddlespeech whisper --task transcribe --input ./zh.wav
   # to recognize text and translate to English
   paddlespeech whisper --task translate --input ./zh.wav
   ```
   Usage:
   ```bash
   paddlespeech whisper --help
   ```
   Arguments:
   - `input`(required): Audio file to recognize.
   - `model`: Model type of asr task. Default: `whisper-large`.
   - `task`: Output type. Default: `transcribe`.
   - `lang`: Model language. Default: `None`. Forcibly set the recognized language, which is determined by the model itself by default.
   - `sample_rate`: Sample rate of the model. Default: `16000`. Other sampling rates are not supported now.
   - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
   - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
   - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
   - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
   - `verbose`: Show the log information.
 - Python API
   ```python
   import paddle
   from paddlespeech.cli.whisper import WhisperExecutor
   whisper_executor = WhisperExecutor()
   # to recognize text 
   text = whisper_executor(
       model='whisper-large',
       task='transcribe',
       sample_rate=16000,
       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
       ckpt_path=None,
       audio_file='./zh.wav',
       device=paddle.get_device())
   print('ASR Result: \n{}'.format(text))
   # to recognize text and translate to English
   feature = whisper_executor(
       model='whisper-large',
       task='translate',
       sample_rate=16000,
       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
       ckpt_path=None,
       audio_file='./zh.wav',
       device=paddle.get_device())
   print('Representation: \n{}'.format(feature))
   ```
   Output:
   ```bash
   Transcribe Result:
   Detected language: Chinese
   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
   Translate Result:
   Detected language: Chinese
   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
--- a/demos/whisper/README_cn.md
+++ b/demos/whisper/README_cn.md
@ -0,0 +1,91 @@
 (简体中文|[English](./README.md))
 # Whisper模型
 ## 介绍
 Whisper是一种通用的语音识别模型。它是在多种音频的大数据集上训练的，也是一个多任务模型，可以执行多语言语音识别以及语音翻译和语言识别。
 Whisper模型由OpenAI Whisper训练 https://github.com/openai/whisper
 ## 使用方法
 ### 1. 安装
 请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
 你可以从 easy，medium，hard 三中方式中选择一种方式安装。
 ### 2. 准备输入
 这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
 可以下载此 demo 的示例音频：
 ```bash
 wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 ```
 ### 3. 使用方法
 - 命令行 (推荐使用)
   ```bash
   # 识别文本
   paddlespeech whisper --task transcribe --input ./zh.wav
   # 将语音翻译成英语
   paddlespeech whisper --task translate --input ./zh.wav
   ```
  使用方法：
   ```bash
   paddlespeech whisper --help
   ```
   参数：
   - `input`(必须输入)：用于识别的音频文件。
   - `model`：ASR 任务的模型，默认值：`whisper-large`。
   - `task`：输出类别，默认值：`transcribe`。
   - `lang`：模型语言，默认值：`None`，强制设定识别出的语言，默认为模型自行判定。
   - `sample_rate`：音频采样率，默认值：`16000`，目前Whisper暂不支持其他采样率。
   - `config`：ASR 任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
   - `ckpt_path`：模型参数文件，若不设置则下载解码模型使用，默认值：`None`。
   - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
   - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
   - `verbose`: 如果使用，显示 logger 信息。
 - Python API
   ```python
   import paddle
   from paddlespeech.cli.whisper import WhisperExecutor
   whisper_executor = WhisperExecutor()
   # 识别文本
   text = whisper_executor(
       model='whisper-large',
       task='transcribe',
       sample_rate=16000,
       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
       ckpt_path=None,
       audio_file='./zh.wav',
       device=paddle.get_device())
   print('ASR Result: \n{}'.format(text))
    # 将语音翻译成英语
   feature = whisper_executor(
       model='whisper-large',
       task='translate',
       sample_rate=16000,
       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
       ckpt_path=None,
       audio_file='./zh.wav',
       device=paddle.get_device())
   print('Representation: \n{}'.format(feature))
   ```
   输出：
   ```bash
   Transcribe Result:
   Detected language: Chinese
   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
   Translate Result:
   Detected language: Chinese
   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
--- a/demos/whisper/run.sh
+++ b/demos/whisper/run.sh
@ -0,0 +1,10 @@
 #!/bin/bash
 # audio download
 wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 # to recognize text 
 paddlespeech whisper --task transcribe --input ./zh.wav
 # to recognize text and translate to English
 paddlespeech whisper --task translate --input ./zh.wav
--- a/paddlespeech/cli/base_commands.py
+++ b/paddlespeech/cli/base_commands.py
@ -83,7 +83,8 @@ model_name_format = {
    'st': 'Model-Source language-Target language',
    'text': 'Model-Task-Language',
    'tts': 'Model-Language',
-    'vector': 'Model-Sample Rate'
+    'vector': 'Model-Sample Rate',
    'whisper': 'Model-Language-Sample Rate'
 }
@ -94,7 +95,9 @@ class StatsCommand:
    def __init__(self):
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.stats', add_help=True)
-        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws']
+        self.task_choices = [
            'asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws', 'whisper'
        ]
        self.parser.add_argument(
            '--task',
            type=str,
@ -141,6 +144,10 @@ _commands = {
    'tts': ['Text to Speech infer command.', 'TTSExecutor'],
    'vector': ['Speech to vector embedding infer command.', 'VectorExecutor'],
    'kws': ['Keyword Spotting infer command.', 'KWSExecutor'],
    'whisper': [
        'Whisper model for speech to text or translate speech to English.',
        'WhisperExecutor'
    ]
 }
 for com, info in _commands.items():
--- a/paddlespeech/cli/whisper/init.py
+++ b/paddlespeech/cli/whisper/init.py
@ -0,0 +1,14 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .infer import WhisperExecutor
--- a/paddlespeech/cli/whisper/infer.py
+++ b/paddlespeech/cli/whisper/infer.py
@ -0,0 +1,468 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import io
 import os
 import sys
 import time
 from collections import OrderedDict
 from typing import List
 from typing import Optional
 from typing import Union
 import librosa
 import numpy as np
 import paddle
 import soundfile
 from yacs.config import CfgNode
 from ..download import get_path_from_url
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import CLI_TIMER
 from ..utils import stats_wrapper
 from ..utils import timer_register
 from paddlespeech.s2t.models.whisper import log_mel_spectrogram
 from paddlespeech.s2t.models.whisper import ModelDimensions
 from paddlespeech.s2t.models.whisper import Whisper
 from paddlespeech.s2t.utils.utility import UpdateConfig
 __all__ = ['WhisperExecutor']
@timer_register
 class WhisperExecutor(BaseExecutor):
    def __init__(self):
        super().__init__('whisper')
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.whisper', add_help=True)
        self.parser.add_argument(
            '--input', type=str, default=None, help='Audio file to recognize.')
        self.parser.add_argument(
            '--model',
            type=str,
            default='whisper',
            choices=[
                tag[:tag.index('-')]
                for tag in self.task_resource.pretrained_models.keys()
            ],
            help='Choose model type of asr task.')
        self.parser.add_argument(
            '--lang',
            type=str,
            default='None',
            help='Choose model decode language. Default is None, recognized by model.'
        )
        self.parser.add_argument(
            '--task',
            type=str,
            default='transcribe',
            choices=["transcribe", "translate"],
            help='Choose task tpye for transcribe or translate.')
        self.parser.add_argument(
            '--size',
            type=str,
            default='large',
            help='Choose model size. now only support large, large:[whisper-large-16k]'
        )
        self.parser.add_argument(
            "--sample_rate",
            type=int,
            default=16000,
            choices=[16000],
            help='Choose the audio sample rate of the model. only support 16000')
        self.parser.add_argument(
            '--config',
            type=str,
            default=None,
            help='Config of asr task. Use deault config when it is None.')
        self.parser.add_argument(
            '--decode_method',
            type=str,
            default='ctc_prefix_beam_search',
            choices=['ctc_greedy_search', 'ctc_prefix_beam_search'],
            help='only support transformer and conformer model')
        self.parser.add_argument(
            '--ckpt_path',
            type=str,
            default=None,
            help='Checkpoint file of model.')
        self.parser.add_argument(
            '--yes',
            '-y',
            action="store_true",
            default=False,
            help='No additional parameters required. \
            Once set this parameter, it means accepting the request of the program by default, \
            which includes transforming the audio sample rate')
        self.parser.add_argument(
            '--rtf',
            action="store_true",
            default=False,
            help='Show Real-time Factor(RTF).')
        self.parser.add_argument(
            '--device',
            type=str,
            default=paddle.get_device(),
            help='Choose device to execute model inference.')
        self.parser.add_argument(
            '-d',
            '--job_dump_result',
            action='store_true',
            help='Save job result into file.')
        self.parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Increase logger verbosity of current task.')
    def _init_from_path(self,
                        model_type: str='whisper',
                        lang: str='None',
                        task: str='transcribe',
                        size: str='large',
                        sample_rate: int=16000,
                        cfg_path: Optional[os.PathLike]=None,
                        decode_method: str='ctc_prefix_beam_search',
                        num_decoding_left_chunks: int=-1,
                        ckpt_path: Optional[os.PathLike]=None):
        """
        Init model and other resources from a specific path.
        """
        logger.debug("start to init the model")
        # default max_len: unit:second
        self.max_len = 50
        if hasattr(self, 'model'):
            logger.debug('Model had been initialized.')
            return
        if cfg_path is None or ckpt_path is None:
            sample_rate_str = '16k' if sample_rate == 16000 else '8k'
            tag = model_type + '-' + size + '-' + sample_rate_str
            self.task_resource.set_task_model(tag, version=None)
            self.res_path = self.task_resource.res_dir
            self.cfg_path = os.path.join(
                self.res_path, self.task_resource.res_dict['cfg_path'])
            self.ckpt_path = os.path.join(
                self.res_path,
                self.task_resource.res_dict['ckpt_path'] + ".pdparams")
            logger.debug(self.res_path)
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
        logger.debug(self.cfg_path)
        logger.debug(self.ckpt_path)
        #Init body.
        self.config = CfgNode(new_allowed=True)
        self.config.merge_from_file(self.cfg_path)
        with UpdateConfig(self.config):
            if "whisper" in model_type:
                resource_url = self.task_resource.res_dict['resuource_data']
                resource_md5 = self.task_resource.res_dict['resuource_data_md5']
                resuource_path = self.task_resource.res_dict['resuource_path']
                self.download_resource(resource_url, resuource_path,
                                       resource_md5)
            else:
                raise Exception("wrong type")
        # load model
        model_dict = paddle.load(self.ckpt_path)
        dims = ModelDimensions(**model_dict["dims"])
        self.model = Whisper(dims)
        self.model.load_dict(model_dict)
        self.model.eval()
        #set task
        if task is not None:
            self.task = task
        #set language
        if lang is not None:
            self.language = lang
    def preprocess(self, model_type: str, input: Union[str, os.PathLike]):
        """
        Input preprocess and return paddle.Tensor stored in self.input.
        Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet).
        """
        audio_file = input
        if isinstance(audio_file, (str, os.PathLike)):
            logger.debug("Preprocess audio_file:" + audio_file)
        elif isinstance(audio_file, io.BytesIO):
            audio_file.seek(0)
        # Get the object for feature extraction
        # whisper hard-coded audio hyperparameters, params in paddlespeech/s2t/models/whisper/whisper.py
        logger.debug("read the audio file")
        audio, audio_sample_rate = soundfile.read(
            audio_file, dtype="float32", always_2d=True)
        if self.change_format:
            if audio.shape[1] >= 2:
                audio = audio.mean(axis=1, dtype=np.int16)
            else:
                audio = audio[:, 0]
            # pcm16 -> pcm 32
            audio = self._pcm16to32(audio)
            audio = librosa.resample(
                audio, orig_sr=audio_sample_rate, target_sr=self.sample_rate)
            audio_sample_rate = self.sample_rate
            # pcm32 -> pcm 16
            audio = self._pcm32to16(audio)
        else:
            audio = audio[:, 0]
        logger.debug(f"audio shape: {audio.shape}")
        # fbank
        audio = log_mel_spectrogram(audio)
        audio_len = paddle.to_tensor(audio.shape[0])
        #audio = paddle.to_tensor(audio, dtype='float32').unsqueeze(axis=0)
        self._inputs["audio"] = audio
        self._inputs["audio_len"] = audio_len
        logger.debug(f"audio feat shape: {audio.shape}")
        logger.debug("audio feat process success")
    @paddle.no_grad()
    def infer(self, model_type: str):
        """
        Model inference and result stored in self.output.
        """
        logger.debug("start to infer the model to get the output")
        cfg = self.config
        audio = self._inputs["audio"]
        if cfg.temperature_increment_on_fallback is not None:
            temperature = tuple(
                np.arange(cfg.temperature, 1.0 + 1e-6,
                          cfg.temperature_increment_on_fallback))
        else:
            temperature = [cfg.temperature]
        self._outputs["result"] = self.model.transcribe(
            audio,
            verbose=cfg.verbose,
            task=self.task,
            language=self.language,
            temperature=temperature,
            compression_ratio_threshold=cfg.compression_ratio_threshold,
            logprob_threshold=cfg.logprob_threshold,
            best_of=cfg.best_of,
            beam_size=cfg.beam_size,
            patience=cfg.patience,
            length_penalty=cfg.length_penalty,
            initial_prompt=cfg.initial_prompt,
            condition_on_previous_text=cfg.condition_on_previous_text,
            no_speech_threshold=cfg.no_speech_threshold)
    def postprocess(self) -> Union[str, os.PathLike]:
        """
            Output postprocess and return human-readable results such as texts and audio files.
        """
        return self._outputs["result"]
    def download_resource(self, url, lm_dir, md5sum):
        download_path = get_path_from_url(
            url=url,
            root_dir=lm_dir,
            md5sum=md5sum,
            decompress=True, )
    def _pcm16to32(self, audio):
        assert (audio.dtype == np.int16)
        audio = audio.astype("float32")
        bits = np.iinfo(np.int16).bits
        audio = audio / (2**(bits - 1))
        return audio
    def _pcm32to16(self, audio):
        assert (audio.dtype == np.float32)
        bits = np.iinfo(np.int16).bits
        audio = audio * (2**(bits - 1))
        audio = np.round(audio).astype("int16")
        return audio
    def _check(self, audio_file: str, sample_rate: int, force_yes: bool=False):
        self.sample_rate = sample_rate
        if self.sample_rate != 16000 and self.sample_rate != 8000:
            logger.error(
                "invalid sample rate, please input --sr 8000 or --sr 16000")
            return False
        if isinstance(audio_file, (str, os.PathLike)):
            if not os.path.isfile(audio_file):
                logger.error("Please input the right audio file path")
                return False
        elif isinstance(audio_file, io.BytesIO):
            audio_file.seek(0)
        logger.debug("checking the audio file format......")
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
            audio_duration = audio.shape[0] / audio_sample_rate
            if audio_duration > self.max_len:
                logger.error(
                    f"Please input audio file less then {self.max_len} seconds.\n"
                )
                return False
        except Exception as e:
            logger.exception(e)
            logger.error(
                f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
                 you can try to use sox to change the file format.\n \
                 For example: \n \
                 sample rate: 16k \n \
                 sox input_audio.xx --rate 16k --bits 16 --channels 1 output_audio.wav \n \
                 sample rate: 8k \n \
                 sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
                 ")
            return False
        logger.debug("The sample rate is %d" % audio_sample_rate)
        if audio_sample_rate != self.sample_rate:
            logger.warning("The sample rate of the input file is not {}.\n \
                            The program will resample the wav file to {}.\n \
                            If the result does not meet your expectations，\n \
                            Please input the 16k 16 bit 1 channel wav file. \
                        ".format(self.sample_rate, self.sample_rate))
            if force_yes is False:
                while (True):
                    logger.debug(
                        "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream."
                    )
                    content = input("Input(Y/N):")
                    if content.strip() == "Y" or content.strip(
                    ) == "y" or content.strip() == "yes" or content.strip(
                    ) == "Yes":
                        logger.debug(
                            "change the sampele rate, channel to 16k and 1 channel"
                        )
                        break
                    elif content.strip() == "N" or content.strip(
                    ) == "n" or content.strip() == "no" or content.strip(
                    ) == "No":
                        logger.debug("Exit the program")
                        return False
                    else:
                        logger.warning("Not regular input, please input again")
            self.change_format = True
        else:
            logger.debug("The audio file format is right")
            self.change_format = False
        return True
    def execute(self, argv: List[str]) -> bool:
        """
            Command line entry.
        """
        parser_args = self.parser.parse_args(argv)
        model = parser_args.model
        lang = parser_args.lang
        task = parser_args.task
        size = parser_args.size
        sample_rate = parser_args.sample_rate
        config = parser_args.config
        ckpt_path = parser_args.ckpt_path
        decode_method = parser_args.decode_method
        force_yes = parser_args.yes
        rtf = parser_args.rtf
        device = parser_args.device
        if not parser_args.verbose:
            self.disable_task_loggers()
        task_source = self.get_input_source(parser_args.input)
        task_results = OrderedDict()
        has_exceptions = False
        for id_, input_ in task_source.items():
            try:
                res = self(
                    audio_file=input_,
                    model=model,
                    lang=lang,
                    task=task,
                    size=size,
                    sample_rate=sample_rate,
                    config=config,
                    ckpt_path=ckpt_path,
                    decode_method=decode_method,
                    force_yes=force_yes,
                    rtf=rtf,
                    device=device)
                task_results[id_] = res
            except Exception as e:
                has_exceptions = True
                task_results[id_] = f'{e.__class__.__name__}: {e}'
        if rtf:
            self.show_rtf(CLI_TIMER[self.__class__.__name__])
        self.process_task_results(parser_args.input, task_results,
                                  parser_args.job_dump_result)
        if has_exceptions:
            return False
        else:
            return True
    @stats_wrapper
    def __call__(self,
                 audio_file: os.PathLike,
                 model: str='whisper',
                 lang: str='None',
                 task: str='transcribe',
                 size: str='large',
                 sample_rate: int=16000,
                 config: os.PathLike=None,
                 ckpt_path: os.PathLike=None,
                 decode_method: str='attention_rescoring',
                 num_decoding_left_chunks: int=-1,
                 force_yes: bool=False,
                 rtf: bool=False,
                 device=paddle.get_device()):
        """
        Python API to call an executor.
        """
        audio_file = os.path.abspath(audio_file)
        paddle.set_device(device)
        self._init_from_path(model, lang, task, size, sample_rate, config,
                             decode_method, num_decoding_left_chunks, ckpt_path)
        if not self._check(audio_file, sample_rate, force_yes):
            sys.exit(-1)
        if rtf:
            k = self.__class__.__name__
            CLI_TIMER[k]['start'].append(time.time())
        self.preprocess(model, audio_file)
        self.infer(model)
        res = self.postprocess()  # Retrieve result of asr.
        if rtf:
            CLI_TIMER[k]['end'].append(time.time())
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
            CLI_TIMER[k]['extra'].append(audio.shape[0] / audio_sample_rate)
        return res
--- a/paddlespeech/resource/model_alias.py
+++ b/paddlespeech/resource/model_alias.py
@ -29,6 +29,11 @@ model_alias = {
    "transformer": ["paddlespeech.s2t.models.u2:U2Model"],
    "wenetspeech": ["paddlespeech.s2t.models.u2:U2Model"],
    # ---------------------------------
    # ------------ Whisper ------------
    # ---------------------------------
    "whisper": ["paddlespeech.s2t.models.whisper:Whisper"],
    # ---------------------------------
    # -------------- CLS --------------
    # ---------------------------------
--- a/paddlespeech/resource/pretrained_models.py
+++ b/paddlespeech/resource/pretrained_models.py
@ -25,6 +25,7 @@ __all__ = [
    'tts_static_pretrained_models',
    'tts_onnx_pretrained_models',
    'vector_dynamic_pretrained_models',
    'whisper_dynamic_pretrained_models',
 ]
 # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
@ -424,6 +425,31 @@ asr_onnx_pretrained_models = {
    },
 }
 whisper_dynamic_pretrained_models = {
    "whisper-large-16k": {
        '1.3': {
            'url':
            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/whisper-large-model.tar.gz',
            'md5':
            '364c4d670835e5ca489045e1c29d75fe',
            'cfg_path':
            'whisper.yaml',
            'ckpt_path':
            'whisper-large-model',
            'model':
            'whisper-large-model.pdparams',
            'params':
            'whisper-large-model.pdparams',
            'resuource_data':
            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
            'resuource_data_md5':
            '37a0a8abdb3641a51194f79567a93b61',
            'resuource_path':
            'paddlespeech/s2t/models/whisper',
        },
    },
 }
 # ---------------------------------
 # -------------- CLS --------------
 # ---------------------------------
--- a/paddlespeech/resource/resource.py
+++ b/paddlespeech/resource/resource.py
@ -22,7 +22,7 @@ from ..utils.dynamic_import import dynamic_import
 from ..utils.env import MODEL_HOME
 from .model_alias import model_alias
-task_supported = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws']
+task_supported = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws', 'whisper']
 model_format_supported = ['dynamic', 'static', 'onnx']
 inference_mode_supported = ['online', 'offline']
--- a/paddlespeech/s2t/exps/whisper/test_wav.py
+++ b/paddlespeech/s2t/exps/whisper/test_wav.py
@ -0,0 +1,122 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.∏
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from Whisper (https://github.com/openai/whisper/whisper/)
 import os.path
 import sys
 import distutils
 import numpy as np
 import paddle
 import soundfile
 from yacs.config import CfgNode
 from paddlespeech.s2t.models.whisper import log_mel_spectrogram
 from paddlespeech.s2t.models.whisper import ModelDimensions
 from paddlespeech.s2t.models.whisper import transcribe
 from paddlespeech.s2t.models.whisper import Whisper
 from paddlespeech.s2t.training.cli import default_argument_parser
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
 class WhisperInfer():
    def __init__(self, config, args):
        self.args = args
        self.config = config
        self.audio_file = args.audio_file
        paddle.set_device('gpu' if self.args.ngpu > 0 else 'cpu')
        config.pop("ngpu")
        #load_model
        model_dict = paddle.load(self.config.model_file)
        config.pop("model_file")
        dims = ModelDimensions(**model_dict["dims"])
        self.model = Whisper(dims)
        self.model.load_dict(model_dict)
    def run(self):
        check(args.audio_file)
        with paddle.no_grad():
            temperature = config.pop("temperature")
            temperature_increment_on_fallback = config.pop(
                "temperature_increment_on_fallback")
            if temperature_increment_on_fallback is not None:
                temperature = tuple(
                    np.arange(temperature, 1.0 + 1e-6,
                              temperature_increment_on_fallback))
            else:
                temperature = [temperature]
            #load audio
            mel = log_mel_spectrogram(args.audio)
            result = transcribe(
                self.model, mel, temperature=temperature, **config)
            if args.result_file is not None:
                with open(args.result_file, 'w') as f:
                    f.write(str(result))
            return result
 def check(audio_file: str):
    if not os.path.isfile(audio_file):
        print("Please input the right audio file path")
        sys.exit(-1)
    logger.info("checking the audio file format......")
    try:
        _, sample_rate = soundfile.read(audio_file)
    except Exception as e:
        logger.error(str(e))
        logger.error(
            "can not open the wav file, please check the audio file format")
        sys.exit(-1)
    logger.info("The sample rate is %d" % sample_rate)
    assert (sample_rate == 16000)
    logger.info("The audio file format is right")
 def main(config, args):
    WhisperInfer(config, args).run()
 if __name__ == "__main__":
    parser = default_argument_parser()
    # save asr result to
    parser.add_argument(
        "--result_file", type=str, help="path of save the asr result")
    parser.add_argument(
        "--audio_file", type=str, help="path of the input audio file")
    parser.add_argument(
        "--debug",
        type=distutils.util.strtobool,
        default=False,
        help="for debug.")
    args = parser.parse_args()
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    if args.decode_cfg:
        decode_confs = CfgNode(new_allowed=True)
        decode_confs.merge_from_file(args.decode_cfg)
        config.decode = decode_confs
    if args.opts:
        config.merge_from_list(args.opts)
    config.freeze()
    main(config, args)
--- a/paddlespeech/s2t/models/whisper/init.py
+++ b/paddlespeech/s2t/models/whisper/init.py
@ -0,0 +1,12 @@
 # MIT License, Copyright (c) 2022 OpenAI.
 # Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
 # 
 # Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/__init__.py)
 from paddlespeech.s2t.models.whisper.whipser import decode
 from paddlespeech.s2t.models.whisper.whipser import DecodingOptions
 from paddlespeech.s2t.models.whisper.whipser import DecodingResult
 from paddlespeech.s2t.models.whisper.whipser import detect_language
 from paddlespeech.s2t.models.whisper.whipser import log_mel_spectrogram
 from paddlespeech.s2t.models.whisper.whipser import ModelDimensions
 from paddlespeech.s2t.models.whisper.whipser import transcribe
 from paddlespeech.s2t.models.whisper.whipser import Whisper
--- a/paddlespeech/s2t/models/whisper/tokenizer.py
+++ b/paddlespeech/s2t/models/whisper/tokenizer.py
@ -0,0 +1,360 @@
 # MIT License, Copyright (c) 2022 OpenAI.
 # Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
 # 
 # Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/tokenizer.py)
 import os
 from dataclasses import dataclass
 from functools import lru_cache
 from typing import List
 from typing import Optional
 from typing import Tuple
 from typing import Union
 import numpy as np
 import paddle
 from paddlenlp.transformers import GPTTokenizer
 LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "iw": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
 }
 # language code lookup by name, with a few language aliases
 TO_LANGUAGE_CODE = {
    **{language: code for code, language in LANGUAGES.items()},
    "burmese": "my",
    "valencian": "ca",
    "flemish": "nl",
    "haitian": "ht",
    "letzeburgesch": "lb",
    "pushto": "ps",
    "panjabi": "pa",
    "moldavian": "ro",
    "moldovan": "ro",
    "sinhalese": "si",
    "castilian": "es",
 }
@dataclass(frozen=True)
 class Tokenizer:
    """A thin wrapper around `GPTTokenizer` providing quick access to special tokens"""
    tokenizer: "GPTTokenizer"
    language: Optional[str]
    sot_sequence: Tuple[int]
    def encode(self, text, **kwargs):
        return self.tokenizer.encode(text, **kwargs)
    def decode(self,
               token_ids: Union[int, List[int], np.ndarray, paddle.Tensor],
               **kwargs):
        if len(token_ids) > 1:
            ids_list = []
            for ids in token_ids:
                if paddle.is_tensor(ids):
                    ids = ids.item()
                if ids < len(self.tokenizer):
                    ids_list.append(ids)
            token_ids = ids_list
        return self.tokenizer.decode(token_ids, **kwargs)
    def decode_with_timestamps(self, tokens) -> str:
        """
        Timestamp tokens are above the special tokens' id range and are ignored by `decode()`.
        This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".
        """
        outputs = [[]]
        for token in tokens:
            if token >= self.timestamp_begin:
                timestamp = f"<|{(token - self.timestamp_begin) * 0.02:.2f}|>"
                outputs.append(timestamp)
                outputs.append([])
            else:
                outputs[-1].append(token)
        outputs = [
            s if isinstance(s, str) else self.tokenizer.decode(s)
            for s in outputs
        ]
        return "".join(outputs)
    @property
    @lru_cache()
    def eot(self) -> int:
        return self.tokenizer.eos_token_id
    @property
    @lru_cache()
    def sot(self) -> int:
        return self._get_single_token_id("<|startoftranscript|>")
    @property
    @lru_cache()
    def sot_lm(self) -> int:
        return self._get_single_token_id("<|startoflm|>")
    @property
    @lru_cache()
    def sot_prev(self) -> int:
        return self._get_single_token_id("<|startofprev|>")
    @property
    @lru_cache()
    def no_speech(self) -> int:
        return self._get_single_token_id("<|nospeech|>")
    @property
    @lru_cache()
    def no_timestamps(self) -> int:
        return self._get_single_token_id("<|notimestamps|>")
    @property
    @lru_cache()
    def timestamp_begin(self) -> int:
        return self.tokenizer.all_special_ids[-1] + 1
    @property
    @lru_cache()
    def language_token(self) -> int:
        """Returns the token id corresponding to the value of the `language` field"""
        if self.language is None:
            raise ValueError(
                "This tokenizer does not have language token configured")
        additional_tokens = dict(
            zip(
                self.tokenizer.additional_special_tokens,
                self.tokenizer.additional_special_tokens_ids, ))
        candidate = f"<|{self.language}|>"
        if candidate in additional_tokens:
            return additional_tokens[candidate]
        raise KeyError(f"Language {self.language} not found in tokenizer.")
    @property
    @lru_cache()
    def all_language_tokens(self) -> Tuple[int]:
        result = []
        for token, token_id in zip(
                self.tokenizer.additional_special_tokens,
                self.tokenizer.additional_special_tokens_ids, ):
            if token.strip("<|>") in LANGUAGES:
                result.append(token_id)
        return tuple(result)
    @property
    @lru_cache()
    def all_language_codes(self) -> Tuple[str]:
        return tuple(
            self.decode([l]).strip("<|>") for l in self.all_language_tokens)
    @property
    @lru_cache()
    def sot_sequence_including_notimestamps(self) -> Tuple[int]:
        return tuple(list(self.sot_sequence) + [self.no_timestamps])
    @property
    @lru_cache()
    def non_speech_tokens(self) -> Tuple[int]:
        """
        Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
        annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.
        - ♪♪♪
        - ( SPEAKING FOREIGN LANGUAGE )
        - [DAVID] Hey there,
        keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
        """
        symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』")
        symbols += "<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪".split(
        )
        # symbols that may be a single token or multiple tokens depending on the tokenizer.
        # In case they're multiple tokens, suppress the first token, which is safe because:
        # These are between U+2640 and U+267F miscellaneous symbols that are okay to suppress
        # in generations, and in the 3-byte UTF-8 representation they share the first two bytes.
        miscellaneous = set("♩♪♫♬♭♮♯")
        assert all(0x2640 <= ord(c) <= 0x267F for c in miscellaneous)
        # allow hyphens "-" and single quotes "'" between words, but not at the beginning of a word
        result = {
            self.tokenizer.encode(" -").input_ids[0],
            self.tokenizer.encode(" '").input_ids[0]
        }
        for symbol in symbols + list(miscellaneous):
            for tokens in [
                    self.tokenizer.encode(symbol).input_ids,
                    self.tokenizer.encode(" " + symbol).input_ids
            ]:
                if len(tokens) == 1 or symbol in miscellaneous:
                    result.add(tokens[0])
        return tuple(sorted(result))
    def _get_single_token_id(self, text) -> int:
        tokens = self.tokenizer.encode(text).input_ids
        assert len(tokens) == 1, f"{text} is not encoded as a single token"
        return tokens[0]
@lru_cache(maxsize=None)
 def build_tokenizer(name: str="gpt2"):
    os.environ["TOKENIZERS_PARALLELISM"] = "false"
    path = os.path.join(os.path.dirname(__file__), "assets", name)
    tokenizer = GPTTokenizer.from_pretrained(path)
    specials = [
        "<|startoftranscript|>",
        * [f"<|{lang}|>" for lang in LANGUAGES.keys()],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
    ]
    tokenizer.add_special_tokens(dict(additional_special_tokens=specials))
    return tokenizer
@lru_cache(maxsize=None)
 def get_tokenizer(
        multilingual: bool,
        *,
        task: Optional[str]=None,  # Literal["transcribe", "translate", None]
        language: Optional[str]=None, ) -> Tokenizer:
    if language is not None:
        language = language.lower()
        if language not in LANGUAGES:
            if language in TO_LANGUAGE_CODE:
                language = TO_LANGUAGE_CODE[language]
            else:
                raise ValueError(f"Unsupported language: {language}")
    if multilingual:
        tokenizer_name = "multilingual"
        task = task or "transcribe"
        language = language or "en"
    else:
        tokenizer_name = "gpt2"
        task = None
        language = None
    tokenizer = build_tokenizer(name=tokenizer_name)
    all_special_ids: List[int] = tokenizer.all_special_ids
    sot: int = all_special_ids[1]
    translate: int = all_special_ids[-6]
    transcribe: int = all_special_ids[-5]
    langs = tuple(LANGUAGES.keys())
    sot_sequence = [sot]
    if language is not None:
        sot_sequence.append(sot + 1 + langs.index(language))
    if task is not None:
        sot_sequence.append(transcribe if task == "transcribe" else translate)
    return Tokenizer(
        tokenizer=tokenizer,
        language=language,
        sot_sequence=tuple(sot_sequence))
--- a/paddlespeech/s2t/models/whisper/utils.py
+++ b/paddlespeech/s2t/models/whisper/utils.py
@ -0,0 +1,92 @@
 # MIT License, Copyright (c) 2022 OpenAI.
 # Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
 # 
 # Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/utils.py)
 import zlib
 from typing import Iterator
 from typing import TextIO
 def exact_div(x, y):
    assert x % y == 0
    return x // y
 def str2bool(string):
    str2val = {"True": True, "False": False}
    if string in str2val:
        return str2val[string]
    else:
        raise ValueError(f"Expected one of {set(str2val.keys())}, got {string}")
 def optional_int(string):
    return None if string == "None" else int(string)
 def optional_float(string):
    return None if string == "None" else float(string)
 def compression_ratio(text) -> float:
    return len(text) / len(zlib.compress(text.encode("utf-8")))
 def format_timestamp(seconds: float,
                     always_include_hours: bool=False,
                     decimal_marker: str='.'):
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)
    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000
    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000
    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000
    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
 def write_txt(transcript: Iterator[dict], file: TextIO):
    for segment in transcript:
        print(segment['text'].strip(), file=file, flush=True)
 def write_vtt(transcript: Iterator[dict], file: TextIO):
    print("WEBVTT\n", file=file)
    for segment in transcript:
        print(
            f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n"
            f"{segment['text'].strip().replace('-->', '->')}\n",
            file=file,
            flush=True, )
 def write_srt(transcript: Iterator[dict], file: TextIO):
    """
    Write a transcript to a file in SRT format.
    Example usage:
        from pathlib import Path
        from whisper.utils import write_srt
        result = transcribe(model, audio_path, temperature=temperature, **args)
        # save SRT
        audio_basename = Path(audio_path).stem
        with open(Path(output_dir) / (audio_basename + ".srt"), "w", encoding="utf-8") as srt:
            write_srt(result["segments"], file=srt)
    """
    for i, segment in enumerate(transcript, start=1):
        # write srt lines
        print(
            f"{i}\n"
            f"{format_timestamp(segment['start'], always_include_hours=True, decimal_marker=',')} --> "
            f"{format_timestamp(segment['end'], always_include_hours=True, decimal_marker=',')}\n"
            f"{segment['text'].strip().replace('-->', '->')}\n",
            file=file,
            flush=True, )
--- a/paddlespeech/s2t/models/whisper/whipser.py
+++ b/paddlespeech/s2t/models/whisper/whipser.py
--- a/paddlespeech/s2t/models/whisper/whisper_LICENSE
+++ b/paddlespeech/s2t/models/whisper/whisper_LICENSE
@ -0,0 +1,21 @@
 MIT License
 Copyright (c) 2022 OpenAI
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/tests/unit/cli/test_cli.sh
+++ b/tests/unit/cli/test_cli.sh
@ -94,5 +94,11 @@ paddlespeech stats --task text
 paddlespeech stats --task vector
 paddlespeech stats --task st
 # whisper text recognize
 paddlespeech whisper --task transcribe --input ./zh.wav
 # whisper recognize text and translate to English
 paddlespeech whisper --task translate --input ./zh.wav
 echo -e "\033[32mTest success !!!\033[0m"