diff --git a/demos/audio_searching/README.md b/demos/audio_searching/README.md index 2bce9313..c0df12ec 100644 --- a/demos/audio_searching/README.md +++ b/demos/audio_searching/README.md @@ -3,20 +3,25 @@ # Audio Searching ## Introduction -As the Internet continues to evolve, unstructured data such as emails, social media photos, live videos, and customer service voice calls have become increasingly common. If we want to process the data on a computer, we need to use embedding technology to transform the data into vector and store, index, and query it +As the Internet continues to evolve, unstructured data such as emails, social media photos, live videos, and customer service voice calls have become increasingly common. If we want to process the data on a computer, we need to use embedding technology to transform the data into vector and store, index, and query it. -However, when there is a large amount of data, such as hundreds of millions of audio tracks, it is more difficult to do a similarity search. The exhaustive method is feasible, but very time consuming. For this scenario, this demo will introduce how to build an audio similarity retrieval system using the open source vector database Milvus +However, when there is a large amount of data, such as hundreds of millions of audio tracks, it is more difficult to do a similarity search. The exhaustive method is feasible, but very time consuming. For this scenario, this demo will introduce how to build an audio similarity retrieval system using the open source vector database Milvus. -Audio retrieval (speech, music, speaker, etc.) enables querying and finding similar sounds (or the same speaker) in a large amount of audio data. The audio similarity retrieval system can be used to identify similar sound effects, minimize intellectual property infringement, quickly retrieve the voice print library, and help enterprises control fraud and identity theft. Audio retrieval also plays an important role in the classification and statistical analysis of audio data +Audio retrieval (speech, music, speaker, etc.) enables querying and finding similar sounds (or the same speaker) in a large amount of audio data. The audio similarity retrieval system can be used to identify similar sound effects, minimize intellectual property infringement, quickly retrieve the voice print library, and help enterprises control fraud and identity theft. Audio retrieval also plays an important role in the classification and statistical analysis of audio data. -In this demo, you will learn how to build an audio retrieval system to retrieve similar sound snippets. The uploaded audio clips are converted into vector data using paddlespeech-based pre-training models (audio classification model, speaker recognition model, etc.) and stored in Milvus. Milvus automatically generates a unique ID for each vector, then stores the ID and the corresponding audio information (audio ID, audio speaker ID, etc.) in MySQL to complete the library construction. During retrieval, users upload test audio to obtain vector, and then conduct vector similarity search in Milvus. The retrieval result returned by Milvus is vector ID, and the corresponding audio information can be queried in MySQL by ID +In this demo, you will learn how to build an audio retrieval system to retrieve similar sound snippets. The uploaded audio clips are converted into vector data using paddlespeech-based pre-training models (audio classification model, speaker recognition model, etc.) and stored in Milvus. Milvus automatically generates a unique ID for each vector, then stores the ID and the corresponding audio information (audio ID, audio speaker ID, etc.) in MySQL to complete the library construction. During retrieval, users upload test audio to obtain vector, and then conduct vector similarity search in Milvus.The retrieval result returned by Milvus is vector ID, and the corresponding audio information can be queried in MySQL by ID. ![Workflow of an audio searching system](./img/audio_searching.png) -Note:this demo uses the [CN-Celeb](http://openslr.org/82/) dataset of at least 650,000 audio entries and 3000 speakers to build the audio vector library, which is then retrieved using a preset distance calculation. The dataset can also use other, Adjust as needed, e.g. Librispeech, VoxCeleb, UrbanSound, GloVe, MNIST, etc +Note:this demo uses the [CN-Celeb](http://openslr.org/82/) dataset of at least 650,000 audio entries and 3000 speakers to build the audio vector library, which is then retrieved using a preset distance calculation. The dataset can also use other, Adjust as needed, e.g. Librispeech, VoxCeleb, UrbanSound, GloVe, MNIST, etc. ## Usage -### 1. Prepare MySQL and Milvus services by docker-compose +### 1. Prepare PaddleSpeech +Audio vector extraction requires PaddleSpeech training model, so please make sure that PaddleSpeech has been installed before running. Specific installation steps: See [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). + +You can choose one way from easy, meduim and hard to install paddlespeech. + +### 2. Prepare MySQL and Milvus services by docker-compose The audio similarity search system requires Milvus, MySQL services. We can start these containers with one click through [docker-compose.yaml](./docker-compose.yaml), so please make sure you have [installed Docker Engine](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/) before running. then ```bash @@ -45,7 +50,7 @@ ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 15c84a506754 qingen1/paddlespeech-audio-search-client:2.3 "/bin/bash -c '/usr/…" 22 hours ago Up 22 hours (healthy) 0.0.0.0:8068->80/tcp audio-webclient ``` -### 2. Start API Server +### 3. Start API Server Then to start the system server, and it provides HTTP backend services. - Install the Python packages @@ -75,73 +80,120 @@ Then to start the system server, and it provides HTTP backend services. Then start the server with Fastapi. ```bash - export PYTHONPATH=$PYTHONPATH:./src + export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio python src/main.py ``` Then you will see the Application is started: ```bash - INFO: Started server process [3949] - 2022-03-07 17:39:14,864 | INFO | server.py | serve | 75 | Started server process [3949] + INFO: Started server process [13352] + 2022-03-26 22:45:30,838 | INFO | server.py | serve | 75 | Started server process [13352] INFO: Waiting for application startup. - 2022-03-07 17:39:14,865 | INFO | on.py | startup | 45 | Waiting for application startup. + 2022-03-26 22:45:30,839 | INFO | on.py | startup | 45 | Waiting for application startup. INFO: Application startup complete. - 2022-03-07 17:39:14,866 | INFO | on.py | startup | 59 | Application startup complete. + 2022-03-26 22:45:30,839 | INFO | on.py | startup | 59 | Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) - 2022-03-07 17:39:14,867 | INFO | server.py | _log_started_message | 206 | Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) + 2022-03-26 22:45:30,840 | INFO | server.py | _log_started_message | 206 | Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) ``` -### 3. Usage +### 4. Usage - Prepare data ```bash wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz ``` Note: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example - - scripts test (recommend!) +- Scripts test (Recommended) - The internal process is downloading data, loading the Paddlespeech model, extracting embedding, storing library, retrieving and deleting library + The internal process is downloading data, loading the paddlespeech model, extracting embedding, storing library, retrieving and deleting library ```bash python ./src/test_main.py ``` Output: ```bash - Checkpoint path: %your model path% + Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ... + ... + Unpacking ./example_audio.tar.gz ... + [2022-03-26 22:50:54,987] [ INFO] - checking the aduio file format...... + [2022-03-26 22:50:54,987] [ INFO] - The sample rate is 16000 + [2022-03-26 22:50:54,987] [ INFO] - The audio file format is right + [2022-03-26 22:50:54,988] [ INFO] - device type: cpu + [2022-03-26 22:50:54,988] [ INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k + [2022-03-26 22:50:54,990] [ INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz + ... + [2022-03-26 22:51:17,285] [ INFO] - start to dynamic import the model class + [2022-03-26 22:51:17,285] [ INFO] - model name ecapatdnn + [2022-03-26 22:51:23,864] [ INFO] - start to set the model parameters to model + [2022-03-26 22:54:08,115] [ INFO] - create the model instance success + [2022-03-26 22:54:08,116] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_ + searching/example_audio/knife_hit_iron3.wav + [2022-03-26 22:54:08,116] [ INFO] - load the audio sample points, shape is: (11012,) + [2022-03-26 22:54:08,150] [ INFO] - extract the audio feat, shape is: (80, 69) + [2022-03-26 22:54:08,152] [ INFO] - feats shape: [1, 80, 69] + [2022-03-26 22:54:08,154] [ INFO] - audio extract the feat success + [2022-03-26 22:54:08,155] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:08,155] [ INFO] - feats shape:[1, 80, 69], lengths shape: [1] + [2022-03-26 22:54:08,433] [ INFO] - embedding size: (192,) Extracting feature from audio No. 1 , 20 audios in total + [2022-03-26 22:54:08,435] [ INFO] - checking the aduio file format...... + [2022-03-26 22:54:08,435] [ INFO] - The sample rate is 16000 + [2022-03-26 22:54:08,436] [ INFO] - The audio file format is right + [2022-03-26 22:54:08,436] [ INFO] - device type: cpu + [2022-03-26 22:54:08,436] [ INFO] - Model has been initialized + [2022-03-26 22:54:08,436] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav + [2022-03-26 22:54:08,436] [ INFO] - load the audio sample points, shape is: (6391,) + [2022-03-26 22:54:08,452] [ INFO] - extract the audio feat, shape is: (80, 40) + [2022-03-26 22:54:08,454] [ INFO] - feats shape: [1, 80, 40] + [2022-03-26 22:54:08,454] [ INFO] - audio extract the feat success + [2022-03-26 22:54:08,454] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:08,455] [ INFO] - feats shape:[1, 80, 40], lengths shape: [1] + [2022-03-26 22:54:08,633] [ INFO] - embedding size: (192,) Extracting feature from audio No. 2 , 20 audios in total ... - 2022-03-09 17:22:13,870 | INFO | main.py | load_audios | 85 | Successfully loaded data, total count: 20 - 2022-03-09 17:22:13,898 | INFO | main.py | count_audio | 147 | Successfully count the number of data! - 2022-03-09 17:22:13,918 | INFO | main.py | audio_path | 57 | Successfully load audio: ./example_audio/test.wav + 2022-03-26 22:54:15,892 | INFO | main.py | load_audios | 85 | Successfully loaded data, total count: 20 + 2022-03-26 22:54:15,908 | INFO | main.py | count_audio | 148 | Successfully count the number of data! + [2022-03-26 22:54:15,916] [ INFO] - checking the aduio file format...... + [2022-03-26 22:54:15,916] [ INFO] - The sample rate is 16000 + [2022-03-26 22:54:15,916] [ INFO] - The audio file format is right + [2022-03-26 22:54:15,916] [ INFO] - device type: cpu + [2022-03-26 22:54:15,916] [ INFO] - Model has been initialized + [2022-03-26 22:54:15,916] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav + [2022-03-26 22:54:15,917] [ INFO] - load the audio sample points, shape is: (8456,) + [2022-03-26 22:54:15,923] [ INFO] - extract the audio feat, shape is: (80, 53) + [2022-03-26 22:54:15,924] [ INFO] - feats shape: [1, 80, 53] + [2022-03-26 22:54:15,924] [ INFO] - audio extract the feat success + [2022-03-26 22:54:15,924] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:15,924] [ INFO] - feats shape:[1, 80, 53], lengths shape: [1] + [2022-03-26 22:54:16,051] [ INFO] - embedding size: (192,) ... - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/test.wav, distance 0.0 - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, distance 0.021805256605148315 - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/knife_cut_into_flesh.wav, distance 0.052762262523174286 + 2022-03-26 22:54:16,086 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0 + 2022-03-26 22:54:16,087 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716 + 2022-03-26 22:54:16,087 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708 ... - 2022-03-09 17:22:32,582 | INFO | main.py | search_local_audio | 135 | Successfully searched similar audio! - 2022-03-09 17:22:33,658 | INFO | main.py | drop_tables | 159 | Successfully drop tables in Milvus and MySQL! + 2022-03-26 22:54:16,088 | INFO | main.py | search_local_audio | 136 | Successfully searched similar audio! + 2022-03-26 22:54:17,164 | INFO | main.py | drop_tables | 160 | Successfully drop tables in Milvus and MySQL! ``` -- GUI test (optional) +- GUI test (Optional) - Navigate to 127.0.0.1:8068 in your browser to access the front-end interface + Navigate to 127.0.0.1:8068 in your browser to access the front-end interface. - Note: If the browser and the service are not on the same machine, then the IP needs to be changed to the IP of the machine where the service is located, and the corresponding API_URL in docker-compose.yaml needs to be changed and the service can be restarted + Note: If the browser and the service are not on the same machine, then the IP needs to be changed to the IP of the machine where the service is located, and the corresponding API_URL in docker-compose.yaml needs to be changed, and the docker-compose.yaml file needs to be re-executed for the change to take effect. - Insert data - Download the data and decompress it to a path named /home/speech/data. Then enter /home/speech/data in the address bar of the upload page to upload the data + Download the data on the server and decompress it to a file, for example, /home/speech/data/. Then enter /home/speech/data/ in the address bar of the upload page to upload the data. ![](./img/insert.png) - Search for similar audio - Select the magnifying glass icon on the left side of the interface. Then, press the "Default Target Audio File" button and upload a .wav sound file you'd like to search. Results will be displayed + Select the magnifying glass icon on the left side of the interface. Then, press the "Default Target Audio File" button and upload a .wav sound file from the client you'd like to search. Results will be displayed. ![](./img/search.png) -### 4.Result +### 5.Result machine configuration: - OS: CentOS release 7.6 @@ -157,9 +209,9 @@ recall and elapsed time statistics are shown in the following figure: ![](./img/result.png) -The retrieval framework based on Milvus takes about 2.9 milliseconds to retrieve on the premise of 90% recall rate, and it takes about 500 milliseconds for feature extraction (testing audio takes about 5 seconds), that is, a single audio test takes about 503 milliseconds in total, which can meet most application scenarios +The retrieval framework based on Milvus takes about 2.9 milliseconds to retrieve on the premise of 90% recall rate, and it takes about 500 milliseconds for feature extraction (testing audio takes about 5 seconds), that is, a single audio test takes about 503 milliseconds in total, which can meet most application scenarios. -### 5.Pretrained Models +### 6.Pretrained Models Here is a list of pretrained models released by PaddleSpeech : diff --git a/demos/audio_searching/README_cn.md b/demos/audio_searching/README_cn.md index a4cb7312..c851bd0f 100644 --- a/demos/audio_searching/README_cn.md +++ b/demos/audio_searching/README_cn.md @@ -4,21 +4,26 @@ # 音频相似性检索 ## 介绍 -随着互联网不断发展,电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据,需要使用 embedding 技术将这些数据转化为向量 vector,然后进行存储、建索引、并查询 +随着互联网不断发展,电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据,需要使用 embedding 技术将这些数据转化为向量 vector,然后进行存储、建索引、并查询。 -但是,当数据量很大,比如上亿条音频要做相似度搜索,就比较困难了。穷举法固然可行,但非常耗时。针对这种场景,该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统 +但是,当数据量很大,比如上亿条音频要做相似度搜索,就比较困难了。穷举法固然可行,但非常耗时。针对这种场景,该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统。 -音频检索(如演讲、音乐、说话人等检索)实现了在海量音频数据中查询并找出相似声音(或相同说话人)片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等,还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中,音频检索也发挥着重要作用 +音频检索(如演讲、音乐、说话人等检索)实现了在海量音频数据中查询并找出相似声音(或相同说话人)片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等,还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中,音频检索也发挥着重要作用。 -在本 demo 中,你将学会如何构建一个音频检索系统,用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型(音频分类模型,说话人识别模型等)将上传的音频片段转换为向量数据,并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID,然后将 ID 和 相应的音频信息(音频id,音频的说话人id等等)存储在 MySQL,这样就完成建库的工作。用户在检索时,上传测试音频,得到向量,然后在 Milvus 中进行向量相似度搜索,Milvus 返回的检索结果为向量 ID,通过 ID 在 MySQL 内部查询相应的音频信息即可 +在本 demo 中,你将学会如何构建一个音频检索系统,用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型(音频分类模型,说话人识别模型等)将上传的音频片段转换为向量数据,并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID,然后将 ID 和 相应的音频信息(音频id,音频的说话人id等等)存储在 MySQL,这样就完成建库的工作。用户在检索时,上传测试音频,得到向量,然后在 Milvus 中进行向量相似度搜索,Milvus 返回的检索结果为向量 ID,通过 ID 在 MySQL 内部查询相应的音频信息即可。 ![音频检索流程图](./img/audio_searching.png) -注:该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集,包括至少 650000 条音频,3000 个说话人,来建立音频向量库(音频特征,或音频说话人特征),然后通过预设的距离计算方式进行音频(或说话人)检索,这里面数据集也可以使用其他的,根据需要调整,如Librispeech,VoxCeleb,UrbanSound,GloVe,MNIST等 +注:该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集,包括至少 650000 条音频,3000 个说话人,来建立音频向量库(音频特征,或音频说话人特征),然后通过预设的距离计算方式进行音频(或说话人)检索,这里面数据集也可以使用其他的,根据需要调整,如Librispeech,VoxCeleb,UrbanSound,GloVe,MNIST等。 ## 使用方法 -### 1. MySQL 和 Milvus 安装 -音频相似度搜索系统需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器,所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即 +### 1. PaddleSpeech 安装 +音频向量的提取需要用到基于 PaddleSpeech 训练的模型,所以请确保在运行之前已经安装了 PaddleSpeech,具体安装步骤,详见[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。 + +你可以从 easy,medium,hard 三中方式中选择一种方式安装。 + +### 2. MySQL 和 Milvus 安装 +音频相似性的检索需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器,所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即 ```bash docker-compose -f docker-compose.yaml up -d @@ -47,8 +52,8 @@ ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" ``` -### 2. 配置并启动 API 服务 -启动系统服务程序,它会提供基于 Http 后端服务 +### 3. 配置并启动 API 服务 +启动系统服务程序,它会提供基于 HTTP 后端服务。 - 安装服务依赖的 python 基础包 @@ -77,24 +82,24 @@ ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 启动用 Fastapi 构建的服务 ```bash - export PYTHONPATH=$PYTHONPATH:./src + export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio python src/main.py ``` 然后你会看到应用程序启动: ```bash - INFO: Started server process [3949] - 2022-03-07 17:39:14,864 | INFO | server.py | serve | 75 | Started server process [3949] + INFO: Started server process [13352] + 2022-03-26 22:45:30,838 | INFO | server.py | serve | 75 | Started server process [13352] INFO: Waiting for application startup. - 2022-03-07 17:39:14,865 | INFO | on.py | startup | 45 | Waiting for application startup. + 2022-03-26 22:45:30,839 | INFO | on.py | startup | 45 | Waiting for application startup. INFO: Application startup complete. - 2022-03-07 17:39:14,866 | INFO | on.py | startup | 59 | Application startup complete. + 2022-03-26 22:45:30,839 | INFO | on.py | startup | 59 | Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) - 2022-03-07 17:39:14,867 | INFO | server.py | _log_started_message | 206 | Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) + 2022-03-26 22:45:30,840 | INFO | server.py | _log_started_message | 206 | Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit) ``` -### 3. 测试方法 +### 4. 测试方法 - 准备数据 ```bash wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz @@ -110,40 +115,88 @@ ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" 输出: ```bash - Checkpoint path: %your model path% + Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ... + ... + Unpacking ./example_audio.tar.gz ... + [2022-03-26 22:50:54,987] [ INFO] - checking the aduio file format...... + [2022-03-26 22:50:54,987] [ INFO] - The sample rate is 16000 + [2022-03-26 22:50:54,987] [ INFO] - The audio file format is right + [2022-03-26 22:50:54,988] [ INFO] - device type: cpu + [2022-03-26 22:50:54,988] [ INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k + [2022-03-26 22:50:54,990] [ INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz + ... + [2022-03-26 22:51:17,285] [ INFO] - start to dynamic import the model class + [2022-03-26 22:51:17,285] [ INFO] - model name ecapatdnn + [2022-03-26 22:51:23,864] [ INFO] - start to set the model parameters to model + [2022-03-26 22:54:08,115] [ INFO] - create the model instance success + [2022-03-26 22:54:08,116] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_ + searching/example_audio/knife_hit_iron3.wav + [2022-03-26 22:54:08,116] [ INFO] - load the audio sample points, shape is: (11012,) + [2022-03-26 22:54:08,150] [ INFO] - extract the audio feat, shape is: (80, 69) + [2022-03-26 22:54:08,152] [ INFO] - feats shape: [1, 80, 69] + [2022-03-26 22:54:08,154] [ INFO] - audio extract the feat success + [2022-03-26 22:54:08,155] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:08,155] [ INFO] - feats shape:[1, 80, 69], lengths shape: [1] + [2022-03-26 22:54:08,433] [ INFO] - embedding size: (192,) Extracting feature from audio No. 1 , 20 audios in total + [2022-03-26 22:54:08,435] [ INFO] - checking the aduio file format...... + [2022-03-26 22:54:08,435] [ INFO] - The sample rate is 16000 + [2022-03-26 22:54:08,436] [ INFO] - The audio file format is right + [2022-03-26 22:54:08,436] [ INFO] - device type: cpu + [2022-03-26 22:54:08,436] [ INFO] - Model has been initialized + [2022-03-26 22:54:08,436] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav + [2022-03-26 22:54:08,436] [ INFO] - load the audio sample points, shape is: (6391,) + [2022-03-26 22:54:08,452] [ INFO] - extract the audio feat, shape is: (80, 40) + [2022-03-26 22:54:08,454] [ INFO] - feats shape: [1, 80, 40] + [2022-03-26 22:54:08,454] [ INFO] - audio extract the feat success + [2022-03-26 22:54:08,454] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:08,455] [ INFO] - feats shape:[1, 80, 40], lengths shape: [1] + [2022-03-26 22:54:08,633] [ INFO] - embedding size: (192,) Extracting feature from audio No. 2 , 20 audios in total ... - 2022-03-09 17:22:13,870 | INFO | main.py | load_audios | 85 | Successfully loaded data, total count: 20 - 2022-03-09 17:22:13,898 | INFO | main.py | count_audio | 147 | Successfully count the number of data! - 2022-03-09 17:22:13,918 | INFO | main.py | audio_path | 57 | Successfully load audio: ./example_audio/test.wav + 2022-03-26 22:54:15,892 | INFO | main.py | load_audios | 85 | Successfully loaded data, total count: 20 + 2022-03-26 22:54:15,908 | INFO | main.py | count_audio | 148 | Successfully count the number of data! + [2022-03-26 22:54:15,916] [ INFO] - checking the aduio file format...... + [2022-03-26 22:54:15,916] [ INFO] - The sample rate is 16000 + [2022-03-26 22:54:15,916] [ INFO] - The audio file format is right + [2022-03-26 22:54:15,916] [ INFO] - device type: cpu + [2022-03-26 22:54:15,916] [ INFO] - Model has been initialized + [2022-03-26 22:54:15,916] [ INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav + [2022-03-26 22:54:15,917] [ INFO] - load the audio sample points, shape is: (8456,) + [2022-03-26 22:54:15,923] [ INFO] - extract the audio feat, shape is: (80, 53) + [2022-03-26 22:54:15,924] [ INFO] - feats shape: [1, 80, 53] + [2022-03-26 22:54:15,924] [ INFO] - audio extract the feat success + [2022-03-26 22:54:15,924] [ INFO] - start to do backbone network model forward + [2022-03-26 22:54:15,924] [ INFO] - feats shape:[1, 80, 53], lengths shape: [1] + [2022-03-26 22:54:16,051] [ INFO] - embedding size: (192,) ... - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/test.wav, distance 0.0 - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, distance 0.021805256605148315 - 2022-03-09 17:22:32,580 | INFO | main.py | search_local_audio | 131 | search result http://testserver/data?audio_path=./example_audio/knife_cut_into_flesh.wav, distance 0.052762262523174286 + 2022-03-26 22:54:16,086 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0 + 2022-03-26 22:54:16,087 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716 + 2022-03-26 22:54:16,087 | INFO | main.py | search_local_audio | 132 | search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708 ... - 2022-03-09 17:22:32,582 | INFO | main.py | search_local_audio | 135 | Successfully searched similar audio! - 2022-03-09 17:22:33,658 | INFO | main.py | drop_tables | 159 | Successfully drop tables in Milvus and MySQL! + 2022-03-26 22:54:16,088 | INFO | main.py | search_local_audio | 136 | Successfully searched similar audio! + 2022-03-26 22:54:17,164 | INFO | main.py | drop_tables | 160 | Successfully drop tables in Milvus and MySQL! ``` + - 前端测试(可选) 在浏览器中输入 127.0.0.1:8068 访问前端页面 - 注:如果浏览器和服务不在同一台机器上,那么 IP 需要修改成服务所在的机器 IP,并且 docker-compose.yaml 中相应的 API_URL 也要修改,并重新起服务即可 + 注:如果浏览器和服务不在同一台机器上,那么 IP 需要修改成服务所在的机器 IP,并且 docker-compose.yaml 中相应的 API_URL 也要修改,然后重新执行 docker-compose.yaml 文件,使修改生效。 - 上传音频 - 下载数据并解压到一文件夹,假设为 /home/speech/data,那么在上传页面地址栏输入 /home/speech/data 进行数据上传 + 在服务端下载数据并解压到一文件夹,假设为 /home/speech/data/,那么在上传页面地址栏输入 /home/speech/data/ 进行数据上传 ![](./img/insert.png) - 检索相似音频 - 选择左上角放大镜,点击 “Default Target Audio File” 按钮,上传测试音频,接着你将看到检索结果 + 选择左上角放大镜,点击 “Default Target Audio File” 按钮,从客户端上传测试音频,接着你将看到检索结果 ![](./img/search.png) -### 4. 结果 +### 5. 结果 机器配置: - 操作系统: CentOS release 7.6 @@ -158,9 +211,9 @@ ffce340b3790 minio/minio:RELEASE.2020-12-03T00-03-10Z "/usr/bin/docker-ent…" ![](./img/result.png) -基于 Milvus 的检索框架在召回率 90% 的前提下,检索耗时约 2.9 毫秒,加上特征提取(Embedding)耗时约 500毫秒(测试音频时长约 5秒),即单条音频测试总共耗时约 503 毫秒,可以满足大多数应用场景 +基于 Milvus 的检索框架在召回率 90% 的前提下,检索耗时约 2.9 毫秒,加上特征提取(Embedding)耗时约 500 毫秒(测试音频时长约 5 秒),即单条音频测试总共耗时约 503 毫秒,可以满足大多数应用场景。 -### 5. 预训练模型 +### 6. 预训练模型 以下是 PaddleSpeech 提供的预训练模型列表: diff --git a/demos/audio_searching/requirements.txt b/demos/audio_searching/requirements.txt index 95c6140d..057c6ab9 100644 --- a/demos/audio_searching/requirements.txt +++ b/demos/audio_searching/requirements.txt @@ -1,7 +1,8 @@ diskcache==5.2.1 +dtaidistance==2.3.1 fastapi librosa==0.8.0 -numpy +numpy==1.21.0 pydantic pymilvus==2.0.1 pymysql diff --git a/demos/audio_searching/src/config.py b/demos/audio_searching/src/config.py index 70ac494c..3d6d3d43 100644 --- a/demos/audio_searching/src/config.py +++ b/demos/audio_searching/src/config.py @@ -16,7 +16,7 @@ import os ############### Milvus Configuration ############### MILVUS_HOST = os.getenv("MILVUS_HOST", "127.0.0.1") MILVUS_PORT = int(os.getenv("MILVUS_PORT", "19530")) -VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "2048")) +VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "192")) INDEX_FILE_SIZE = int(os.getenv("INDEX_FILE_SIZE", "1024")) METRIC_TYPE = os.getenv("METRIC_TYPE", "L2") DEFAULT_TABLE = os.getenv("DEFAULT_TABLE", "audio_table") diff --git a/demos/audio_searching/src/encode.py b/demos/audio_searching/src/encode.py index eba5c48c..83b9e3df 100644 --- a/demos/audio_searching/src/encode.py +++ b/demos/audio_searching/src/encode.py @@ -15,7 +15,12 @@ import os import librosa import numpy as np +from config import DEFAULT_TABLE + from logs import LOGGER +from paddlespeech.cli import VectorExecutor + +vector_executor = VectorExecutor() def get_audio_embedding(path): @@ -23,16 +28,9 @@ def get_audio_embedding(path): Use vpr_inference to generate embedding of audio """ try: - RESAMPLE_RATE = 16000 - audio, _ = librosa.load(path, sr=RESAMPLE_RATE, mono=True) - - # TODO add infer/python interface to get embedding, now fake it by rand - # vpr = ECAPATDNN(checkpoint_path=None, device='cuda') - # embedding = vpr.inference(audio) - np.random.seed(hash(os.path.basename(path)) % 1000000) - embedding = np.random.rand(1, 2048) + embedding = vector_executor(audio_file=path) embedding = embedding / np.linalg.norm(embedding) - embedding = embedding.tolist()[0] + embedding = embedding.tolist() return embedding except Exception as e: LOGGER.error(f"Error with embedding:{e}") diff --git a/demos/audio_searching/src/test_main.py b/demos/audio_searching/src/test_main.py index 331208ff..32030bae 100644 --- a/demos/audio_searching/src/test_main.py +++ b/demos/audio_searching/src/test_main.py @@ -11,12 +11,12 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import zipfile - -import gdown from fastapi.testclient import TestClient from main import app +from utils.utility import download +from utils.utility import unpack + client = TestClient(app) @@ -24,11 +24,11 @@ def download_audio_data(): """ download audio data """ - url = 'https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp' - gdown.download(url) - - with zipfile.ZipFile('example_audio.zip', 'r') as zip_ref: - zip_ref.extractall('./example_audio') + url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz" + md5sum = "52ac69316c1aa1fdef84da7dd2c67b39" + target_dir = "./" + filepath = download(url, md5sum, target_dir) + unpack(filepath, target_dir, True) def test_drop(): diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py index 91974761..56eccd13 100644 --- a/paddlespeech/cli/vector/infer.py +++ b/paddlespeech/cli/vector/infer.py @@ -169,7 +169,7 @@ class VectorExecutor(BaseExecutor): @stats_wrapper def __call__(self, audio_file: os.PathLike, - model: str='ecapatdnn-voxceleb12', + model: str='ecapatdnn_voxceleb12', sample_rate: int=16000, config: os.PathLike=None, ckpt_path: os.PathLike=None,