You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/runtime/examples/silero_vad/README.md

4.3 KiB

English | 简体中文

Silero VAD Deployment Example

This directory provides examples that infer_onnx_silero_vad fast finishes the deployment of VAD models on CPU/GPU.

Before deployment, two steps require confirmation.

Taking VAD inference on Linux as an example, the compilation test can be completed by executing the following command in this directory.

mkdir build
cd build
# Download the FastDeploy precompiled library. Users can choose your appropriate version in the `FastDeploy Precompiled Library` mentioned above
wget https://bj.bcebos.com/fastdeploy/release/cpp/fastdeploy-linux-x64-x.x.x.tgz
tar xvf fastdeploy-linux-x64-x.x.x.tgz
cmake .. -DFASTDEPLOY_INSTALL_DIR=${PWD}/fastdeploy-linux-x64-x.x.x
make -j

# Download the VAD model file and test audio. After decompression, place the model and test audio in the infer_onnx_silero_vad.cc peer directory
wget https://bj.bcebos.com/paddlehub/fastdeploy/silero_vad.tgz
wget https://bj.bcebos.com/paddlehub/fastdeploy/silero_vad_sample.wav

# inference
./infer_onnx_silero_vad ../silero_vad.onnx ../silero_vad_sample.wav

VAD C++ Interface

Vad Class

Vad::Vad(const std::string& model_file,
    const fastdeploy::RuntimeOption& custom_option = fastdeploy::RuntimeOption())

Parameter

  • model_file(str): Model file path
  • runtime_option(RuntimeOption): Backend inference configuration. None by default. (use the default configuration)

setAudioCofig function

Must be called before the init function

void Vad::setAudioCofig(int sr, int frame_ms, float threshold, int min_silence_duration_ms, int speech_pad_ms);

Parameter

  • sr(int): sampling rate
  • frame_ms(int): The length of each detection frame, and it is used to calculate the detection window size
  • threshold(float): Result probability judgment threshold
  • min_silence_duration_ms(int): The threshold used to calculate whether it is silence
  • speech_pad_ms(int): Used to calculate the end time of the speech

init function

Used to initialize audio-related parameters.

void Vad::init();

loadAudio function

Load audio.

void Vad::loadAudio(const std::string& wavPath)

Parameter

  • wavPath(str): Audio file path

Predict function

Used to start model reasoning.

bool Vad::Predict();

getResult function

Used to obtain reasoning results

std::vector<std::map<std::string, float>> Vad::getResult(
            float removeThreshold = 1.6, float expandHeadThreshold = 0.32, float expandTailThreshold = 0,
            float mergeThreshold = 0.3);

Parameter

  • removeThreshold(float): Discard result fragment threshold; If some recognition results are too short, they will be discarded according to this threshold
  • expandHeadThreshold(float): Offset at the beginning of the segment; The recognized start time may be too close to the voice part, so move forward the start time accordingly
  • expandTailThreshold(float): Offset at the end of the segment; The recognized end time may be too close to the voice part, so the end time is moved back accordingly
  • mergeThreshold(float): Some result segments are very close and can be combined into one, and the vocal segments can be combined accordingly

The output result format isstd::vector<std::map<std::string, float>>

Output a list, each element is a speech fragment

Each clip can use 'start' to get the start time and 'end' to get the end time

Tips

  1. The setAudioCofigfunction must be called before the init function
  2. The sampling rate of the input audio file must be consistent with that set in the code