Add st demo.

pull/1089/head
KP 3 years ago
parent 3e780dfe1f
commit 70a8a75476

@ -3,7 +3,7 @@
## Introduction ## Introduction
Audio tagging is the task of labelling an audio clip with one or more labels or tags, includeing music tagging, acoustic scene classification, audio event classification, etc. Audio tagging is the task of labelling an audio clip with one or more labels or tags, includeing music tagging, acoustic scene classification, audio event classification, etc.
This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command line or a few lines in python using `PaddleSpeech`. This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage ## Usage
### 1. Installation ### 1. Installation
@ -86,7 +86,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespeech
### 4.Pretrained Models ### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech and can be used by command and python api: Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
| Model | Sample Rate | Model | Sample Rate
| :--- | :---: | :--- | :---:

@ -3,7 +3,7 @@
## Introduction ## Introduction
ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). ASR, or Automatic Speech Recognition, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text).
This demo is an implementation to recognize text from a specific audio file. It can be done by a single command line or a few lines in python using `PaddleSpeech`. This demo is an implementation to recognize text from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage ## Usage
### 1. Installation ### 1. Installation
@ -32,7 +32,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.
- `input`(required): Audio file to recognize. - `input`(required): Audio file to recognize.
- `model`: Model type of asr task. Default: `conformer_wenetspeech`. - `model`: Model type of asr task. Default: `conformer_wenetspeech`.
- `lang`: Model language. Default: `zh`. - `lang`: Model language. Default: `zh`.
- `sr`: Sample rate of the model. Default: `16000`. - `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of asr task. Use pretrained model when it is None. Default: `None`. - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
- `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`. - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
- `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment. - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
@ -68,7 +68,7 @@ wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.
### 4.Pretrained Models ### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech and can be used by command and python api: Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
| Model | Language | Sample Rate | Model | Language | Sample Rate
| :--- | :---: | :---: | | :--- | :---: | :---: |

@ -0,0 +1,77 @@
# Speech Translation
## Introduction
Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language.
This demo is an implementation to recognize text from a specific audio file and translate to target language. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
```bash
pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`).
Here are sample files for this demo that can be downloaded:
```bash
wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
```
### 3. Usage
- Command Line(Recommended)
```bash
paddlespeech st --input ~/en.wav
```
Usage:
```bash
paddlespeech st --help
```
Arguments:
- `input`(required): Audio file to recognize and translate.
- `model`: Model type of st task. Default: `fat_st_ted`.
- `src_lang`: Source language. Default: `en`.
- `tgt_lang`: Target language. Default: `zh`.
- `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of st task. Use pretrained model when it is None. Default: `None`.
- `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
- `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
Output:
```bash
[2021-12-09 11:13:03,178] [ INFO] [utils.py] [L225] - ST Result: ['我 在 这栋 建筑 的 古老 门上 敲门 。']
```
- Python API
```python
import paddle
from paddlespeech.cli import STExecutor
st_executor = STExecutor()
text = st_executor(
model='fat_st_ted',
src_lang='en',
tgt_lang='zh',
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./en.wav',
device=paddle.get_device())
print('ST Result: \n{}'.format(text))
```
Output:
```bash
ST Result:
['我 在 这栋 建筑 的 古老 门上 敲门 。']
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
| Model | Source Language | Target Language
| :--- | :---: | :---: |
| fat_st_ted| en| zh

@ -88,6 +88,7 @@ class ASRExecutor(BaseExecutor):
'--model', '--model',
type=str, type=str,
default='conformer_wenetspeech', default='conformer_wenetspeech',
choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
help='Choose model type of asr task.') help='Choose model type of asr task.')
self.parser.add_argument( self.parser.add_argument(
'--lang', '--lang',
@ -95,7 +96,7 @@ class ASRExecutor(BaseExecutor):
default='zh', default='zh',
help='Choose model language. zh or en') help='Choose model language. zh or en')
self.parser.add_argument( self.parser.add_argument(
"--sr", "--sample_rate",
type=int, type=int,
default=16000, default=16000,
choices=[8000, 16000], choices=[8000, 16000],
@ -200,8 +201,8 @@ class ASRExecutor(BaseExecutor):
raise Exception("wrong type") raise Exception("wrong type")
# Enter the path of model root # Enter the path of model root
model_name = ''.join( model_name = model_type[:model_type.rindex(
model_type.split('_')[:-1]) # model_type: {model_name}_{dataset} '_')] # model_type: {model_name}_{dataset}
model_class = dynamic_import(model_name, model_alias) model_class = dynamic_import(model_name, model_alias)
model_conf = self.config.model model_conf = self.config.model
logger.info(model_conf) logger.info(model_conf)
@ -314,7 +315,7 @@ class ASRExecutor(BaseExecutor):
num_processes=cfg.num_proc_bsearch) num_processes=cfg.num_proc_bsearch)
self._outputs["result"] = result_transcripts[0] self._outputs["result"] = result_transcripts[0]
elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type: elif "conformer" in model_type or "transformer" in model_type:
result_transcripts = self.model.decode( result_transcripts = self.model.decode(
audio, audio,
audio_len, audio_len,
@ -419,7 +420,7 @@ class ASRExecutor(BaseExecutor):
model = parser_args.model model = parser_args.model
lang = parser_args.lang lang = parser_args.lang
sample_rate = parser_args.sr sample_rate = parser_args.sample_rate
config = parser_args.config config = parser_args.config
ckpt_path = parser_args.ckpt_path ckpt_path = parser_args.ckpt_path
audio_file = parser_args.input audio_file = parser_args.input

@ -81,6 +81,7 @@ class CLSExecutor(BaseExecutor):
'--model', '--model',
type=str, type=str,
default='panns_cnn14', default='panns_cnn14',
choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
help='Choose model type of cls task.') help='Choose model type of cls task.')
self.parser.add_argument( self.parser.add_argument(
'--config', '--config',
@ -250,7 +251,6 @@ class CLSExecutor(BaseExecutor):
Python API to call an executor. Python API to call an executor.
""" """
audio_file = os.path.abspath(audio_file) audio_file = os.path.abspath(audio_file)
# self._check(audio_file, sample_rate)
paddle.set_device(device) paddle.set_device(device)
self._init_from_path(model, config, ckpt_path, label_file) self._init_from_path(model, config, ckpt_path, label_file)
self.preprocess(audio_file) self.preprocess(audio_file)

@ -23,9 +23,6 @@ import numpy as np
import paddle import paddle
import soundfile import soundfile
from kaldiio import WriteHelper from kaldiio import WriteHelper
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.s2t.utils.utility import UpdateConfig
from yacs.config import CfgNode from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
@ -33,11 +30,14 @@ from ..utils import cli_register
from ..utils import download_and_decompress from ..utils import download_and_decompress
from ..utils import logger from ..utils import logger
from ..utils import MODEL_HOME from ..utils import MODEL_HOME
from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.s2t.utils.utility import UpdateConfig
__all__ = ["STExecutor"] __all__ = ["STExecutor"]
pretrained_models = { pretrained_models = {
"fat_st_ted_en-zh": { "fat_st_ted-en-zh": {
"url": "url":
"https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz", "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/fat_st_ted-en-zh.tar.gz",
"md5": "md5":
@ -49,7 +49,7 @@ pretrained_models = {
} }
} }
model_alias = {"fat_st_ted": "paddlespeech.s2t.models.u2_st:U2STModel"} model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
kaldi_bins = { kaldi_bins = {
"url": "url":
@ -70,9 +70,10 @@ class STExecutor(BaseExecutor):
self.parser.add_argument( self.parser.add_argument(
"--input", type=str, required=True, help="Audio file to translate.") "--input", type=str, required=True, help="Audio file to translate.")
self.parser.add_argument( self.parser.add_argument(
"--model_type", "--model",
type=str, type=str,
default="fat_st_ted", default="fat_st_ted",
choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
help="Choose model type of st task.") help="Choose model type of st task.")
self.parser.add_argument( self.parser.add_argument(
"--src_lang", "--src_lang",
@ -91,7 +92,7 @@ class STExecutor(BaseExecutor):
choices=[16000], choices=[16000],
help='Choose the audio sample rate of the model. 8000 or 16000') help='Choose the audio sample rate of the model. 8000 or 16000')
self.parser.add_argument( self.parser.add_argument(
"--cfg_path", "--config",
type=str, type=str,
default=None, default=None,
help="Config of st task. Use deault config when it is None.") help="Config of st task. Use deault config when it is None.")
@ -150,7 +151,7 @@ class STExecutor(BaseExecutor):
return return
if cfg_path is None or ckpt_path is None: if cfg_path is None or ckpt_path is None:
tag = model_type + "_" + src_lang + "-" + tgt_lang tag = model_type + "-" + src_lang + "-" + tgt_lang
res_path = self._get_pretrained_path(tag) res_path = self._get_pretrained_path(tag)
self.cfg_path = os.path.join(res_path, self.cfg_path = os.path.join(res_path,
pretrained_models[tag]["cfg_path"]) pretrained_models[tag]["cfg_path"])
@ -186,7 +187,9 @@ class STExecutor(BaseExecutor):
model_conf = self.config.model model_conf = self.config.model
logger.info(model_conf) logger.info(model_conf)
model_class = dynamic_import(model_type, model_alias) model_name = model_type[:model_type.rindex(
'_')] # model_type: {model_name}_{dataset}
model_class = dynamic_import(model_name, model_alias)
self.model = model_class.from_config(model_conf) self.model = model_class.from_config(model_conf)
self.model.eval() self.model.eval()
@ -213,7 +216,7 @@ class STExecutor(BaseExecutor):
audio_file = os.path.abspath(wav_file) audio_file = os.path.abspath(wav_file)
logger.info("Preprocess audio_file:" + audio_file) logger.info("Preprocess audio_file:" + audio_file)
if model_type == "fat_st_ted": if "fat_st" in model_type:
cmvn = self.config.collator.cmvn_path cmvn = self.config.collator.cmvn_path
utt_name = "_tmp" utt_name = "_tmp"
@ -321,25 +324,25 @@ class STExecutor(BaseExecutor):
""" """
parser_args = self.parser.parse_args(argv) parser_args = self.parser.parse_args(argv)
model_type = parser_args.model_type model = parser_args.model
src_lang = parser_args.src_lang src_lang = parser_args.src_lang
tgt_lang = parser_args.tgt_lang tgt_lang = parser_args.tgt_lang
sample_rate = parser_args.sample_rate sample_rate = parser_args.sample_rate
cfg_path = parser_args.cfg_path config = parser_args.config
ckpt_path = parser_args.ckpt_path ckpt_path = parser_args.ckpt_path
audio_file = parser_args.input audio_file = parser_args.input
device = parser_args.device device = parser_args.device
try: try:
res = self(model_type, src_lang, tgt_lang, sample_rate, cfg_path, res = self(model, src_lang, tgt_lang, sample_rate, config,
ckpt_path, audio_file, device) ckpt_path, audio_file, device)
logger.info("ST Result: {}".format(res)) logger.info("ST Result: {}".format(res))
return True return True
except Exception as e: except Exception as e:
print(e) logger.exception(e)
return False return False
def __call__(self, model_type, src_lang, tgt_lang, sample_rate, cfg_path, def __call__(self, model, src_lang, tgt_lang, sample_rate, config,
ckpt_path, audio_file, device): ckpt_path, audio_file, device):
""" """
Python API to call an executor. Python API to call an executor.
@ -347,10 +350,9 @@ class STExecutor(BaseExecutor):
audio_file = os.path.abspath(audio_file) audio_file = os.path.abspath(audio_file)
self._check(audio_file, sample_rate) self._check(audio_file, sample_rate)
paddle.set_device(device) paddle.set_device(device)
self._init_from_path(model_type, src_lang, tgt_lang, cfg_path, self._init_from_path(model, src_lang, tgt_lang, config, ckpt_path)
ckpt_path) self.preprocess(audio_file, model)
self.preprocess(audio_file, model_type) self.infer(model)
self.infer(model_type) res = self.postprocess(model)
res = self.postprocess(model_type)
return res return res

Loading…
Cancel
Save