|
|
|
@ -0,0 +1,703 @@
|
|
|
|
|
{
|
|
|
|
|
"cells": [
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"<a href=\"https://github.com/PaddlePaddle/PaddleSpeech\"><img style=\"position: absolute; z-index: 999; top: 0; right: 0; border: 0; width: 128px; height: 128px;\" src=\"https://nosir.github.io/cleave.js/images/right-graphite@2x.png\" alt=\"Fork me on GitHub\"></a>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# 1. 识别声音\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" 通过听取声音,人的大脑会获取到大量的信息,其中的一个场景是识别和归类,如:识别熟悉的亲人或朋友的声音、识别不同乐器发出的声音和识别不同环境产生的声音,等等。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" 我们可以根据不同声音的特征(频率,音色等)进行区分,这种区分行为的本质,就是对声音进行分类。</font>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"声音分类根据用途还可以继续细分:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* 副语言识别:说话人识别(Speaker Recognition), 情绪识别(Speech Emotion Recognition),性别分类(Speaker gender classification)\n",
|
|
|
|
|
"* 音乐识别:音乐流派分类(Music Genre Classification)\n",
|
|
|
|
|
"* 场景识别:环境声音分类(Environmental Sound Classification)\n",
|
|
|
|
|
"* 声音事件检测:各个环境中的声学事件检测\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/2b3fdd6dd3b24360ab7448e1aa47bb93d7610aaf79fd4f25aa0a8ff131493261\"></center>\n",
|
|
|
|
|
"<center>图片来源:http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Speaker%20(v3).pdf</center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"## 1.1 Audio Tagging\n",
|
|
|
|
|
"使用 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) 的预训练模型对一段音频做实时的声音检测,结果如下视频所示。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"%%HTML\n",
|
|
|
|
|
"<center><video width=\"800\" controls>\n",
|
|
|
|
|
" <source src=\"https://paddlespeech.bj.bcebos.com/PaddleAudio/audio_tagging_demo.mp4\" type=\"video/mp4\">\n",
|
|
|
|
|
"</video></center>"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"# 2. 音频和特征提取"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"# 环境准备:安装paddlespeech和paddleaudio\n",
|
|
|
|
|
"!pip install --upgrade pip && pip install paddlespeech paddleaudio -U"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"import warnings\n",
|
|
|
|
|
"warnings.filterwarnings(\"ignore\")\n",
|
|
|
|
|
"import IPython\n",
|
|
|
|
|
"import numpy as np\n",
|
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
|
"%matplotlib inline"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"## 2.1 数字音频\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 2.1.1 声音信号和音频文件\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"下面通过一个例子观察音频文件的波形,直观地了解数字音频文件的包含的内容。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"# 获取示例音频\n",
|
|
|
|
|
"!test -f ./dog.wav || wget https://paddlespeech.bj.bcebos.com/PaddleAudio/dog.wav\n",
|
|
|
|
|
"IPython.display.Audio('./dog.wav')"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from paddleaudio import load\n",
|
|
|
|
|
"data, sr = load(file='./dog.wav', mono=True, dtype='float32') # 单通道,float32音频样本点\n",
|
|
|
|
|
"print('wav shape: {}'.format(data.shape))\n",
|
|
|
|
|
"print('sample rate: {}'.format(sr))\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# 展示音频波形\n",
|
|
|
|
|
"plt.figure()\n",
|
|
|
|
|
"plt.plot(data)\n",
|
|
|
|
|
"plt.show()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"!paddlespeech cls --input ./dog.wav"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 2.2 音频特征提取\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 2.2.1 短时傅里叶变换\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" 对于一段音频,一般会将整段音频进行分帧,每一帧含有一定长度的信号数据,一般使用 `25ms`,帧与帧之间的移动距离称为帧移,一般使用 `10ms`,然后对每一帧的信号数据加窗后,进行短时傅立叶变换(STFT)得到时频谱。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"通过按照上面的对一段音频进行分帧后,我们可以用傅里叶变换来分析每一帧信号的频率特性。将每一帧的频率信息拼接后,可以获得该音频不同时刻的频率特征——Spectrogram,也称作为语谱图。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ef98c95137442a797c9204e1108e585facf7124ee964edc845f2c849a39347f\"></center>\n",
|
|
|
|
|
"<center>图片参考:DLHLP 李宏毅 语音识别课程PPT;https://www.shong.win/2016/04/09/fft/</center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<br></br>\n",
|
|
|
|
|
"下面例子采用 `paddle.signal.stft` 演示如何提取示例音频的频谱特征,并进行可视化:"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"import paddle\n",
|
|
|
|
|
"import numpy as np\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"x = paddle.to_tensor(data)\n",
|
|
|
|
|
"n_fft = 1024\n",
|
|
|
|
|
"win_length = 1024\n",
|
|
|
|
|
"hop_length = 512\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# [D, T]\n",
|
|
|
|
|
"spectrogram = paddle.signal.stft(x, n_fft=1024, win_length=1024, hop_length=512, onesided=True) \n",
|
|
|
|
|
"print('spectrogram.shape: {}'.format(spectrogram.shape))\n",
|
|
|
|
|
"print('spectrogram.dtype: {}'.format(spectrogram.dtype))\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"spec = np.log(np.abs(spectrogram.numpy())**2)\n",
|
|
|
|
|
"plt.figure()\n",
|
|
|
|
|
"plt.title(\"Log Power Spectrogram\")\n",
|
|
|
|
|
"plt.imshow(spec[:100, :], origin='lower')\n",
|
|
|
|
|
"plt.show()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 2.2.2 LogFBank\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"研究表明,人类对声音的感知是非线性的,随着声音频率的增加,人对更高频率的声音的区分度会不断下降。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"例如同样是相差 500Hz 的频率,一般人可以轻松分辨出声音中 500Hz 和 1,000Hz 之间的差异,但是很难分辨出 10,000Hz 和 10,500Hz 之间的差异。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"因此,学者提出了梅尔频率,在该频率计量方式下,人耳对相同数值的频率变化的感知程度是一样的。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/18fac30a88bd46c88a6a8bfdec580b42ff3f6b6ef0b54bb68cb1c217f31c18d7\" width=500></center>\n",
|
|
|
|
|
"<center>图片来源:https://www.researchgate.net/figure/Curve-relationship-between-frequency-signal-with-its-mel-frequency-scale-Algorithm-1_fig3_221910348</center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"关于梅尔频率的计算,其会对原始频率的低频的部分进行较多的采样,从而对应更多的频率,而对高频的声音进行较少的采样,从而对应较少的频率。使得人耳对梅尔频率的低频和高频的区分性一致。\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/7762cef8fa0e4b10b7f566a0e705609af7704f6a1d2b4e8bac44abe724f9c866\" ></center>\n",
|
|
|
|
|
"<center>图片来源:https://ww2.mathworks.cn/help/audio/ref/mfcc.html</center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Mel Fbank 的计算过程如下,而我们一般都是使用 LogFBank 作为识别特征:\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e7e6c2e221f642af9e618de768dada99258ec5d97b314035b21dd3e217941a67\" ></center>\n",
|
|
|
|
|
"<center>图片来源:https://ww2.mathworks.cn/help/audio/ref/mfcc.html</center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<br></br>\n",
|
|
|
|
|
"下面例子采用 `paddleaudio.features.LogMelSpectrogram` 演示如何提取示例音频的 LogFBank:"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from paddleaudio.features import LogMelSpectrogram\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# - sr: 音频文件的采样率。\n",
|
|
|
|
|
"# - n_fft: FFT样本点个数。\n",
|
|
|
|
|
"# - hop_length: 音频帧之间的间隔。\n",
|
|
|
|
|
"# - win_length: 窗函数的长度。\n",
|
|
|
|
|
"# - window: 窗函数种类。\n",
|
|
|
|
|
"# - n_mels: 梅尔刻度数量。\n",
|
|
|
|
|
"feature_extractor2 = LogMelSpectrogram(\n",
|
|
|
|
|
" sr=sr, \n",
|
|
|
|
|
" n_fft=n_fft, \n",
|
|
|
|
|
" hop_length=hop_length, \n",
|
|
|
|
|
" win_length=win_length, \n",
|
|
|
|
|
" window='hann', \n",
|
|
|
|
|
" n_mels=64)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"x = paddle.to_tensor(data).unsqueeze(0) # [B, L]\n",
|
|
|
|
|
"log_fbank = feature_extractor2(x) # [B, D, T]\n",
|
|
|
|
|
"log_fbank = log_fbank.squeeze(0) # [D, T]\n",
|
|
|
|
|
"print('log_fbank.shape: {}'.format(log_fbank.shape))\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"plt.figure()\n",
|
|
|
|
|
"plt.imshow(log_fbank.numpy(), origin='lower')\n",
|
|
|
|
|
"plt.show()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 2.3 声音分类方法\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 2.3.1 传统机器学习方法\n",
|
|
|
|
|
"在传统的声音和信号的研究领域中,声音特征是一类包含丰富先验知识的手工特征,如频谱图、梅尔频谱和梅尔频率倒谱系数等。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"因此在一些分类的应用上,可以采用传统的机器学习方法例如决策树、svm和随机森林等方法。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"一个典型的应用案例是:男声和女声分类。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/943905088eef48b48e4b94f7ff4c475060937868ca474b61bdcc55fc155b283e\" width=800></center>\n",
|
|
|
|
|
"<center>图片来源:https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0179403.g001</center>"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 2.3.2 深度学习方法\n",
|
|
|
|
|
"传统机器学习方法可以捕捉声音特征的差异(例如男声和女声的声音在音高上往往差异较大)并实现分类任务。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"而深度学习方法则可以突破特征的限制,更灵活的组网方式和更深的网络层次,可以更好地提取声音的高层特征,从而获得更好的分类指标。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"随着深度学习算法的快速发展和在分类任务上的优异表现,当下流行的声音分类模型无一不是采用深度学习网络搭建而成的,如 [AudioCLIP[1]](https://arxiv.org/pdf/2106.13043v1.pdf)、[PANNs[2]](https://arxiv.org/pdf/1912.10211v5.pdf) 和 [Audio Spectrogram Transformer[3]](https://arxiv.org/pdf/2104.01778v3.pdf) 等。\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/bc2c0352c4124b1d866696fd5d8165efbdca5d60f21648729258b62981ef600a\" ></center>\n",
|
|
|
|
|
"<center>图片来源:https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5</center>"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 2.3.3 Pretrain + Finetune\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"在声音分类和声音检测的场景中(如环境声音分类、情绪识别和音乐流派分类等)由于可获取的据集有限,且语音数据标注的成本高,用户可以收集到的数据集体量往往较小,这种数据量稀少的情况对于模型训练是非常不利的。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"预训练模型能够减少领域数据的需求量,并达到较高的识别准确率。在CV和NLP领域中,有诸如 MobileNet、VGG19、YOLO、BERT 和 ERNIE 等开源的预训练模型,在图像检测、图像分类、文本分类和文本生成等各自领域内的任务中,使用预训练模型在下游任务的数据集上进行 finetune ,往往可以更快和更容易获得较好的效果和指标。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"相较于 CV 领域的 ImageNet 数据集,谷歌在 2017 年开放了一个大规模的音频数据集 [AudioSet[4]](https://ieeexplore.ieee.org/document/7952261),它是目前最大的用于音频分类任务的数据集。该数据集包含了 632 类的音频类别以及 2084320 条人工标记的每段 10 秒长度的声音剪辑片段(包括 527 个标签),数据总时长为 5,800 小时。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/334a00c3ca4d4feb90982bb882897eeae2c82a6521b54b46bc64cb68289cdd92\" width=480></center>\n",
|
|
|
|
|
"<center>图片来源:https://research.google.com/audioset/ontology/index.html</center>\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition[2]](https://arxiv.org/pdf/1912.10211.pdf))是基于 AudioSet 数据集训练的声音分类/识别的模型,其中`PANNs-CNN14`在测试集上取得了较好的效果:mAP 为 0.431,AUC 为 0.973,d-prime 为 2.732,经过预训练后,该模型可以用于提取音频的 embbedding ,适合用于声音分类和声音检测等下游任务。本示例将使用 `PANNs` 的预训练模型 Finetune 完成声音分类的任务。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/812d3268cc5b46c88bd23fb9ebaa89196081a14409724b4c87e96498c78c930e\" width=480></center>\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"本教程选取 `PANNs` 中的预训练模型 `cnn14` 作为 backbone,用于提取声音的深层特征,`SoundClassifer`创建下游的分类网络,实现对输入音频的分类。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1954041f63ae49e2bc1f858ca43433140dfc70a513a8479aa9eb5ca8841cb2ac\" width=600></center>"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"# 3. 实践:环境声音分类\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"## 3.1 数据集准备\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"此课程选取了[ESC-50: Dataset for Environmental Sound Classification[5]](https://github.com/karolpiczak/ESC-50) 数据集作为示例。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"ESC-50是一个包含有 2000 个带标签的环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"音频样本可分为 5 个主要类别:\n",
|
|
|
|
|
" - 动物声音(Animals)\n",
|
|
|
|
|
" - 自然界产生的声音和水声(Natural soundscapes & water sounds)\n",
|
|
|
|
|
" - 人类发出的非语言声音(Human, non-speech sounds)\n",
|
|
|
|
|
" - 室内声音(Interior/domestic sounds)\n",
|
|
|
|
|
" - 室外声音和一般噪声(Exterior/urban noises)。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"ESC-50 数据集中的提供的 `meta/esc50.csv` 文件包含的部分信息如下:\n",
|
|
|
|
|
"```\n",
|
|
|
|
|
" filename,fold,target,category,esc10,src_file,take\n",
|
|
|
|
|
" 1-100038-A-14.wav,1,14,chirping_birds,False,100038,A\n",
|
|
|
|
|
" 1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A\n",
|
|
|
|
|
" 1-101296-A-19.wav,1,19,thunderstorm,False,101296,A\n",
|
|
|
|
|
" ...\n",
|
|
|
|
|
"```\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" - filename: 音频文件名字。 \n",
|
|
|
|
|
" - fold: 数据集自身提供的N-Fold验证信息,用于切分训练集和验证集。\n",
|
|
|
|
|
" - target: 标签数值。\n",
|
|
|
|
|
" - category: 标签文本信息。\n",
|
|
|
|
|
" - esc10: 文件是否为ESC-10的数据集子集。\n",
|
|
|
|
|
" - src_file: 原始音频文件前缀。\n",
|
|
|
|
|
" - take: 原始文件的截取段落信息。\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"在此声音分类的任务中,我们将`target`作为训练过程的分类标签。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 3.1.1 数据集初始化\n",
|
|
|
|
|
"调用以下代码自动下载并读取数据集音频文件,创建训练集和验证集。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from paddleaudio.datasets import ESC50\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"train_ds = ESC50(mode='train')\n",
|
|
|
|
|
"dev_ds = ESC50(mode='dev')"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 3.1.2 特征提取\n",
|
|
|
|
|
"通过下列代码,用 `paddleaudio.features.LogMelSpectrogram` 初始化一个音频特征提取器,在训练过程中实时提取音频的 LogFBank 特征,其中主要的参数如下: "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"feature_extractor = LogMelSpectrogram(sr=44100, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', n_mels=64)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 3.2 模型\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 3.2.1 选取预训练模型\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"选取`cnn14`作为 backbone,用于提取音频的特征:"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from paddlespeech.cls.models import cnn14\n",
|
|
|
|
|
"backbone = cnn14(pretrained=True, extract_embedding=True)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 3.2.2 构建分类模型\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"`SoundClassifer`接收`cnn14`作为backbone模型,并创建下游的分类网络:"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"import paddle.nn as nn\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"class SoundClassifier(nn.Layer):\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" def __init__(self, backbone, num_class, dropout=0.1):\n",
|
|
|
|
|
" super().__init__()\n",
|
|
|
|
|
" self.backbone = backbone\n",
|
|
|
|
|
" self.dropout = nn.Dropout(dropout)\n",
|
|
|
|
|
" self.fc = nn.Linear(self.backbone.emb_size, num_class)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" def forward(self, x):\n",
|
|
|
|
|
" x = x.unsqueeze(1)\n",
|
|
|
|
|
" x = self.backbone(x)\n",
|
|
|
|
|
" x = self.dropout(x)\n",
|
|
|
|
|
" logits = self.fc(x)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" return logits\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"model = SoundClassifier(backbone, num_class=len(ESC50.label_list))"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 3.3 Finetune"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"1. 创建 DataLoader "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"batch_size = 16\n",
|
|
|
|
|
"train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True)\n",
|
|
|
|
|
"dev_loader = paddle.io.DataLoader(dev_ds, batch_size=batch_size,)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"2. 定义优化器和 Loss"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=model.parameters())\n",
|
|
|
|
|
"criterion = paddle.nn.loss.CrossEntropyLoss()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"3. 启动模型训练 "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 15,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from paddleaudio.utils import logger\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"epochs = 20\n",
|
|
|
|
|
"steps_per_epoch = len(train_loader)\n",
|
|
|
|
|
"log_freq = 10\n",
|
|
|
|
|
"eval_freq = 10\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"for epoch in range(1, epochs + 1):\n",
|
|
|
|
|
" model.train()\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" avg_loss = 0\n",
|
|
|
|
|
" num_corrects = 0\n",
|
|
|
|
|
" num_samples = 0\n",
|
|
|
|
|
" for batch_idx, batch in enumerate(train_loader):\n",
|
|
|
|
|
" waveforms, labels = batch\n",
|
|
|
|
|
" feats = feature_extractor(waveforms)\n",
|
|
|
|
|
" feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n",
|
|
|
|
|
" logits = model(feats)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" loss = criterion(logits, labels)\n",
|
|
|
|
|
" loss.backward()\n",
|
|
|
|
|
" optimizer.step()\n",
|
|
|
|
|
" if isinstance(optimizer._learning_rate,\n",
|
|
|
|
|
" paddle.optimizer.lr.LRScheduler):\n",
|
|
|
|
|
" optimizer._learning_rate.step()\n",
|
|
|
|
|
" optimizer.clear_grad()\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" # Calculate loss\n",
|
|
|
|
|
" avg_loss += loss.numpy()[0]\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" # Calculate metrics\n",
|
|
|
|
|
" preds = paddle.argmax(logits, axis=1)\n",
|
|
|
|
|
" num_corrects += (preds == labels).numpy().sum()\n",
|
|
|
|
|
" num_samples += feats.shape[0]\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" if (batch_idx + 1) % log_freq == 0:\n",
|
|
|
|
|
" lr = optimizer.get_lr()\n",
|
|
|
|
|
" avg_loss /= log_freq\n",
|
|
|
|
|
" avg_acc = num_corrects / num_samples\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" print_msg = 'Epoch={}/{}, Step={}/{}'.format(\n",
|
|
|
|
|
" epoch, epochs, batch_idx + 1, steps_per_epoch)\n",
|
|
|
|
|
" print_msg += ' loss={:.4f}'.format(avg_loss)\n",
|
|
|
|
|
" print_msg += ' acc={:.4f}'.format(avg_acc)\n",
|
|
|
|
|
" print_msg += ' lr={:.6f}'.format(lr)\n",
|
|
|
|
|
" logger.train(print_msg)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" avg_loss = 0\n",
|
|
|
|
|
" num_corrects = 0\n",
|
|
|
|
|
" num_samples = 0\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" if epoch % eval_freq == 0 and batch_idx + 1 == steps_per_epoch:\n",
|
|
|
|
|
" model.eval()\n",
|
|
|
|
|
" num_corrects = 0\n",
|
|
|
|
|
" num_samples = 0\n",
|
|
|
|
|
" with logger.processing('Evaluation on validation dataset'):\n",
|
|
|
|
|
" for batch_idx, batch in enumerate(dev_loader):\n",
|
|
|
|
|
" waveforms, labels = batch\n",
|
|
|
|
|
" feats = feature_extractor(waveforms)\n",
|
|
|
|
|
" feats = paddle.transpose(feats, [0, 2, 1])\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" logits = model(feats)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" preds = paddle.argmax(logits, axis=1)\n",
|
|
|
|
|
" num_corrects += (preds == labels).numpy().sum()\n",
|
|
|
|
|
" num_samples += feats.shape[0]\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" print_msg = '[Evaluation result]'\n",
|
|
|
|
|
" print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
" logger.eval(print_msg)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 3.4 音频预测\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"执行预测,获取 Top K 分类结果:"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"top_k = 10\n",
|
|
|
|
|
"wav_file = './dog.wav'\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"waveform, sr = load(wav_file)\n",
|
|
|
|
|
"feature_extractor = LogMelSpectrogram(sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', n_mels=64)\n",
|
|
|
|
|
"feats = feature_extractor(paddle.to_tensor(paddle.to_tensor(waveform).unsqueeze(0)))\n",
|
|
|
|
|
"feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n",
|
|
|
|
|
"print(feats.shape)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"logits = model(feats)\n",
|
|
|
|
|
"probs = nn.functional.softmax(logits, axis=1).numpy()\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"sorted_indices = probs[0].argsort()\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"msg = f'[{wav_file}]\\n'\n",
|
|
|
|
|
"for idx in sorted_indices[-top_k:]:\n",
|
|
|
|
|
" msg += f'{ESC50.label_list[idx]}: {probs[0][idx]:.5f}\\n'\n",
|
|
|
|
|
"print(msg)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"collapsed": false
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"# 4. 作业\n",
|
|
|
|
|
"1. 使用开发模式安装 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) \n",
|
|
|
|
|
"环境要求:docker, Ubuntu 16.04,root user。 \n",
|
|
|
|
|
"参考安装方法:[使用Docker安装paddlespeech](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md#hard-get-the-full-funciton-on-your-mechine)\n",
|
|
|
|
|
"1. 在 [MusicSpeech](http://marsyas.info/downloads/datasets.html) 数据集上完成 music/speech 二分类。 \n",
|
|
|
|
|
"2. 在 [GTZAN Genre Collection](http://marsyas.info/downloads/datasets.html) 音乐分类数据集上利用 PANNs 预训练模型实现音乐类别十分类。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# 5. 关注 PaddleSpeech\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"请关注我们的 [Github Repo](https://github.com/PaddlePaddle/PaddleSpeech/),非常欢迎加入以下微信群参与讨论:\n",
|
|
|
|
|
"- 扫描二维码\n",
|
|
|
|
|
"- 添加运营小姐姐微信\n",
|
|
|
|
|
"- 通过后回复【语音】\n",
|
|
|
|
|
"- 系统自动邀请加入技术群\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/87bc7da42bcc401bae41d697f13d8b362bfdfd7198f14096b6d46b4004f09613\" width=\"300\" height=\"300\" ></center>\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# 6. 参考文献\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"[1] Guzhov, A., Raue, F., Hees, J., & Dengel, A.R. (2021). AudioCLIP: Extending CLIP to Image, Text and Audio. ArXiv, abs/2106.13043.\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"[2] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M.D. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2880-2894.\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"[3] Gong, Y., Chung, Y., & Glass, J.R. (2021). AST: Audio Spectrogram Transformer. ArXiv, abs/2104.01778.\n",
|
|
|
|
|
" \n",
|
|
|
|
|
"[4] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., & Ritter, M. (2017). Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776-780.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"[5] Piczak, K.J. (2015). ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd ACM international conference on Multimedia.\n"
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"metadata": {
|
|
|
|
|
"kernelspec": {
|
|
|
|
|
"display_name": "Python 3",
|
|
|
|
|
"language": "python",
|
|
|
|
|
"name": "py35-paddle1.2.0"
|
|
|
|
|
},
|
|
|
|
|
"language_info": {
|
|
|
|
|
"codemirror_mode": {
|
|
|
|
|
"name": "ipython",
|
|
|
|
|
"version": 3
|
|
|
|
|
},
|
|
|
|
|
"file_extension": ".py",
|
|
|
|
|
"mimetype": "text/x-python",
|
|
|
|
|
"name": "python",
|
|
|
|
|
"nbconvert_exporter": "python",
|
|
|
|
|
"pygments_lexer": "ipython3",
|
|
|
|
|
"version": "3.7.4"
|
|
|
|
|
}
|
|
|
|
|
},
|
|
|
|
|
"nbformat": 4,
|
|
|
|
|
"nbformat_minor": 1
|
|
|
|
|
}
|