Add st demo.

4 years ago · 70a8a75476
parent 3e780dfe1f
commit 70a8a75476
6 changed files with 111 additions and 31 deletions
--- a/demos/audio_tagging/README.md
+++ b/demos/audio_tagging/README.md
@ -3,7 +3,7 @@
 ## Introduction
 Audio tagging is the task of labelling an audio clip with one or more labels or tags, includeing music tagging, acoustic scene classification, audio event classification, etc.
-This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command line  or a few lines in python using `PaddleSpeech`. 
+This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
 ## Usage
 ### 1. Installation
@ -86,7 +86,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespeech
 ### 4.Pretrained Models
-Here is a list of pretrained models released by PaddleSpeech and can be used by command and python api:
+Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
 | Model | Sample Rate
 | :--- | :---: 
--- a/demos/speech_recognition/README.md
+++ b/demos/speech_recognition/README.md
@ -3,7 +3,7 @@
 ## Introduction
 ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). 
-This demo is an implementation to recognize text from a specific audio file. It can be done by a single command line  or a few lines in python using `PaddleSpeech`. 
+This demo is an implementation to recognize text from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
 ## Usage
 ### 1. Installation
@ -32,7 +32,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.
  - `input`(required): Audio file to recognize.
  - `model`: Model type of asr task. Default: `conformer_wenetspeech`.
  - `lang`: Model language. Default: `zh`.
-  - `sr`: Sample rate of the model. Default: `16000`.
+  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
@ -68,7 +68,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.
 ### 4.Pretrained Models
-Here is a list of pretrained models released by PaddleSpeech and can be used by command and python api:
+Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
 | Model | Language | Sample Rate
 | :--- | :---: | :---: |
--- a/demos/speech_translation/README.md
+++ b/demos/speech_translation/README.md
@ -0,0 +1,77 @@
 # Speech Translation
 ## Introduction
 Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language.
 This demo is an implementation to recognize text from a specific audio file and translate to target language. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
 ## Usage
 ### 1. Installation
 ```bash
 pip install paddlespeech
 ```
 ### 2. Prepare Input File
 Input of this demo should be a WAV file(`.wav`).
 Here are sample files for this demo that can be downloaded:
 ```bash
 wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
 ```
 ### 3. Usage
 - Command Line(Recommended)
  ```bash
  paddlespeech st --input ~/en.wav
  ```
  Usage:
  ```bash
  paddlespeech st --help
  ```
  Arguments:
  - `input`(required): Audio file to recognize and translate.
  - `model`: Model type of st task. Default: `fat_st_ted`.
  - `src_lang`: Source language. Default: `en`.
  - `tgt_lang`: Target language. Default: `zh`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of st task. Use pretrained model when it is None. Default: `None`.
  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
  Output:
  ```bash
  [2021-12-09 11:13:03,178] [    INFO] [utils.py] [L225] - ST Result: ['我 在 这栋 建筑 的 古老 门上 敲门 。']
  ```
 - Python API
  ```python
  import paddle
  from paddlespeech.cli import STExecutor
  st_executor = STExecutor()
  text = st_executor(
      model='fat_st_ted',
      src_lang='en',
      tgt_lang='zh',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./en.wav',
      device=paddle.get_device())
  print('ST Result: \n{}'.format(text))
  ```
  Output:
  ```bash
  ST Result:
  ['我 在 这栋 建筑 的 古老 门上 敲门 。'] 
  ```
 ### 4.Pretrained Models
 Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
 | Model | Source Language | Target Language
 | :--- | :---: | :---: |
 | fat_st_ted| en| zh
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -88,6 +88,7 @@ class ASRExecutor(BaseExecutor):
            '--model',
            type=str,
            default='conformer_wenetspeech',
            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
            help='Choose model type of asr task.')
        self.parser.add_argument(
            '--lang',
@ -95,7 +96,7 @@ class ASRExecutor(BaseExecutor):
            default='zh',
            help='Choose model language. zh or en')
        self.parser.add_argument(
-            "--sr",
+            "--sample_rate",
            type=int,
            default=16000,
            choices=[8000, 16000],
@ -200,8 +201,8 @@ class ASRExecutor(BaseExecutor):
                raise Exception("wrong type")
        # Enter the path of model root
-        model_name = ''.join(
+        model_name = model_type[:model_type.rindex(
-            model_type.split('_')[:-1])  # model_type: {model_name}_{dataset}
+            '_')]  # model_type: {model_name}_{dataset}
        model_class = dynamic_import(model_name, model_alias)
        model_conf = self.config.model
        logger.info(model_conf)
@ -314,7 +315,7 @@ class ASRExecutor(BaseExecutor):
                num_processes=cfg.num_proc_bsearch)
            self._outputs["result"] = result_transcripts[0]
-        elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
+        elif "conformer" in model_type or "transformer" in model_type:
            result_transcripts = self.model.decode(
                audio,
                audio_len,
@ -419,7 +420,7 @@ class ASRExecutor(BaseExecutor):
        model = parser_args.model
        lang = parser_args.lang
-        sample_rate = parser_args.sr
+        sample_rate = parser_args.sample_rate
        config = parser_args.config
        ckpt_path = parser_args.ckpt_path
        audio_file = parser_args.input
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@ -81,6 +81,7 @@ class CLSExecutor(BaseExecutor):
            '--model',
            type=str,
            default='panns_cnn14',
            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
            help='Choose model type of cls task.')
        self.parser.add_argument(
            '--config',
@ -250,7 +251,6 @@ class CLSExecutor(BaseExecutor):
            Python API to call an executor.
        """
        audio_file = os.path.abspath(audio_file)
        # self._check(audio_file, sample_rate)
        paddle.set_device(device)
        self._init_from_path(model, config, ckpt_path, label_file)
        self.preprocess(audio_file)
--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@ -23,9 +23,6 @@ import numpy as np
 import paddle
 import soundfile
 from kaldiio import WriteHelper
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
 from yacs.config import CfgNode
 from ..executor import BaseExecutor
@ -33,11 +30,14 @@ from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import logger
 from ..utils import MODEL_HOME
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
 __all__ = ["STExecutor"]
 pretrained_models = {
-    "fat_st_ted_en-zh": {
+    "fat_st_ted-en-zh": {
        "url":
        "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz",
        "md5":
@ -49,7 +49,7 @@ pretrained_models = {
    }
 }
-model_alias = {"fat_st_ted": "paddlespeech.s2t.models.u2_st:U2STModel"}
+model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
 kaldi_bins = {
    "url":
@ -70,9 +70,10 @@ class STExecutor(BaseExecutor):
        self.parser.add_argument(
            "--input", type=str, required=True, help="Audio file to translate.")
        self.parser.add_argument(
-            "--model_type",
+            "--model",
            type=str,
            default="fat_st_ted",
            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
            help="Choose model type of st task.")
        self.parser.add_argument(
            "--src_lang",
@ -91,7 +92,7 @@ class STExecutor(BaseExecutor):
            choices=[16000],
            help='Choose the audio sample rate of the model. 8000 or 16000')
        self.parser.add_argument(
-            "--cfg_path",
+            "--config",
            type=str,
            default=None,
            help="Config of st task. Use deault config when it is None.")
@ -150,7 +151,7 @@ class STExecutor(BaseExecutor):
            return
        if cfg_path is None or ckpt_path is None:
-            tag = model_type + "_" + src_lang + "-" + tgt_lang
+            tag = model_type + "-" + src_lang + "-" + tgt_lang
            res_path = self._get_pretrained_path(tag)
            self.cfg_path = os.path.join(res_path,
                                         pretrained_models[tag]["cfg_path"])
@ -186,7 +187,9 @@ class STExecutor(BaseExecutor):
        model_conf = self.config.model
        logger.info(model_conf)
-        model_class = dynamic_import(model_type, model_alias)
+        model_name = model_type[:model_type.rindex(
            '_')]  # model_type: {model_name}_{dataset}
        model_class = dynamic_import(model_name, model_alias)
        self.model = model_class.from_config(model_conf)
        self.model.eval()
@ -213,7 +216,7 @@ class STExecutor(BaseExecutor):
        audio_file = os.path.abspath(wav_file)
        logger.info("Preprocess audio_file:" + audio_file)
-        if model_type == "fat_st_ted":
+        if "fat_st" in model_type:
            cmvn = self.config.collator.cmvn_path
            utt_name = "_tmp"
@ -321,25 +324,25 @@ class STExecutor(BaseExecutor):
        """
        parser_args = self.parser.parse_args(argv)
-        model_type = parser_args.model_type
+        model = parser_args.model
        src_lang = parser_args.src_lang
        tgt_lang = parser_args.tgt_lang
        sample_rate = parser_args.sample_rate
-        cfg_path = parser_args.cfg_path
+        config = parser_args.config
        ckpt_path = parser_args.ckpt_path
        audio_file = parser_args.input
        device = parser_args.device
        try:
-            res = self(model_type, src_lang, tgt_lang, sample_rate, cfg_path,
+            res = self(model, src_lang, tgt_lang, sample_rate, config,
                       ckpt_path, audio_file, device)
            logger.info("ST Result: {}".format(res))
            return True
        except Exception as e:
-            print(e)
+            logger.exception(e)
            return False
-    def __call__(self, model_type, src_lang, tgt_lang, sample_rate, cfg_path,
+    def __call__(self, model, src_lang, tgt_lang, sample_rate, config,
                 ckpt_path, audio_file, device):
        """
            Python API to call an executor.
@ -347,10 +350,9 @@ class STExecutor(BaseExecutor):
        audio_file = os.path.abspath(audio_file)
        self._check(audio_file, sample_rate)
        paddle.set_device(device)
-        self._init_from_path(model_type, src_lang, tgt_lang, cfg_path,
+        self._init_from_path(model, src_lang, tgt_lang, config, ckpt_path)
-                             ckpt_path)
+        self.preprocess(audio_file, model)
-        self.preprocess(audio_file, model_type)
+        self.infer(model)
-        self.infer(model_type)
+        res = self.postprocess(model)
        res = self.postprocess(model_type)
        return res