{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "\"Fork\n", "\n", "# 1. 识别声音\n", " \n", " 通过听取声音,人的大脑会获取到大量的信息,其中的一个场景是识别和归类,如:识别熟悉的亲人或朋友的声音、识别不同乐器发出的声音和识别不同环境产生的声音,等等。\n", "\n", " 我们可以根据不同声音的特征(频率,音色等)进行区分,这种区分行为的本质,就是对声音进行分类。\n", "\n", "声音分类根据用途还可以继续细分:\n", "\n", "* 副语言识别:说话人识别(Speaker Recognition), 情绪识别(Speech Emotion Recognition),性别分类(Speaker gender classification)\n", "* 音乐识别:音乐流派分类(Music Genre Classification)\n", "* 场景识别:环境声音分类(Environmental Sound Classification)\n", "* 声音事件检测:各个环境中的声学事件检测\n", " \n", "\n", "
\n", "
图片来源:http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Speaker%20(v3).pdf
\n", "\n", "## 1.1 Audio Tagging\n", "使用 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) 的预训练模型对一段音频做实时的声音检测,结果如下视频所示。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%%HTML\n", "
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# 2. 音频和特征提取" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 环境准备:安装paddlespeech和paddleaudio\n", "!pip install --upgrade pip && pip install paddlespeech paddleaudio -U" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "import IPython\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "\n", "\n", "## 2.1 数字音频\n", "\n", "### 2.1.1 声音信号和音频文件\n", " \n", "下面通过一个例子观察音频文件的波形,直观地了解数字音频文件的包含的内容。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# 获取示例音频\n", "!test -f ./dog.wav || wget https://paddlespeech.bj.bcebos.com/PaddleAudio/dog.wav\n", "IPython.display.Audio('./dog.wav')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddleaudio import load\n", "data, sr = load(file='./dog.wav', mono=True, dtype='float32') # 单通道,float32音频样本点\n", "print('wav shape: {}'.format(data.shape))\n", "print('sample rate: {}'.format(sr))\n", "\n", "# 展示音频波形\n", "plt.figure()\n", "plt.plot(data)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!paddlespeech cls --input ./dog.wav" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 2.2 音频特征提取\n", "\n", "### 2.2.1 短时傅里叶变换\n", "\n", " 对于一段音频,一般会将整段音频进行分帧,每一帧含有一定长度的信号数据,一般使用 `25ms`,帧与帧之间的移动距离称为帧移,一般使用 `10ms`,然后对每一帧的信号数据加窗后,进行短时傅立叶变换(STFT)得到时频谱。\n", " \n", "通过按照上面的对一段音频进行分帧后,我们可以用傅里叶变换来分析每一帧信号的频率特性。将每一帧的频率信息拼接后,可以获得该音频不同时刻的频率特征——Spectrogram,也称作为语谱图。\n", "\n", "
\n", "
图片参考:DLHLP 李宏毅 语音识别课程PPT;https://www.shong.win/2016/04/09/fft/
\n", "\n", "

\n", "下面例子采用 `paddle.signal.stft` 演示如何提取示例音频的频谱特征,并进行可视化:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import paddle\n", "import numpy as np\n", "\n", "x = paddle.to_tensor(data)\n", "n_fft = 1024\n", "win_length = 1024\n", "hop_length = 512\n", "\n", "# [D, T]\n", "spectrogram = paddle.signal.stft(x, n_fft=1024, win_length=1024, hop_length=512, onesided=True) \n", "print('spectrogram.shape: {}'.format(spectrogram.shape))\n", "print('spectrogram.dtype: {}'.format(spectrogram.dtype))\n", "\n", "\n", "spec = np.log(np.abs(spectrogram.numpy())**2)\n", "plt.figure()\n", "plt.title(\"Log Power Spectrogram\")\n", "plt.imshow(spec[:100, :], origin='lower')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### 2.2.2 LogFBank\n", "\n", "研究表明,人类对声音的感知是非线性的,随着声音频率的增加,人对更高频率的声音的区分度会不断下降。\n", "\n", "例如同样是相差 500Hz 的频率,一般人可以轻松分辨出声音中 500Hz 和 1,000Hz 之间的差异,但是很难分辨出 10,000Hz 和 10,500Hz 之间的差异。\n", "\n", "因此,学者提出了梅尔频率,在该频率计量方式下,人耳对相同数值的频率变化的感知程度是一样的。\n", "\n", "
\n", "
图片来源:https://www.researchgate.net/figure/Curve-relationship-between-frequency-signal-with-its-mel-frequency-scale-Algorithm-1_fig3_221910348
\n", "\n", "关于梅尔频率的计算,其会对原始频率的低频的部分进行较多的采样,从而对应更多的频率,而对高频的声音进行较少的采样,从而对应较少的频率。使得人耳对梅尔频率的低频和高频的区分性一致。\n", "
\n", "
图片来源:https://ww2.mathworks.cn/help/audio/ref/mfcc.html
\n", "\n", "Mel Fbank 的计算过程如下,而我们一般都是使用 LogFBank 作为识别特征:\n", "
\n", "
图片来源:https://ww2.mathworks.cn/help/audio/ref/mfcc.html
\n", "\n", "

\n", "下面例子采用 `paddleaudio.features.LogMelSpectrogram` 演示如何提取示例音频的 LogFBank:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddleaudio.features import LogMelSpectrogram\n", "\n", "# - sr: 音频文件的采样率。\n", "# - n_fft: FFT样本点个数。\n", "# - hop_length: 音频帧之间的间隔。\n", "# - win_length: 窗函数的长度。\n", "# - window: 窗函数种类。\n", "# - n_mels: 梅尔刻度数量。\n", "feature_extractor2 = LogMelSpectrogram(\n", " sr=sr, \n", " n_fft=n_fft, \n", " hop_length=hop_length, \n", " win_length=win_length, \n", " window='hann', \n", " n_mels=64)\n", "\n", "x = paddle.to_tensor(data).unsqueeze(0) # [B, L]\n", "log_fbank = feature_extractor2(x) # [B, D, T]\n", "log_fbank = log_fbank.squeeze(0) # [D, T]\n", "print('log_fbank.shape: {}'.format(log_fbank.shape))\n", "\n", "plt.figure()\n", "plt.imshow(log_fbank.numpy(), origin='lower')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 2.3 声音分类方法\n", "\n", "### 2.3.1 传统机器学习方法\n", "在传统的声音和信号的研究领域中,声音特征是一类包含丰富先验知识的手工特征,如频谱图、梅尔频谱和梅尔频率倒谱系数等。\n", " \n", "因此在一些分类的应用上,可以采用传统的机器学习方法例如决策树、svm和随机森林等方法。\n", " \n", "一个典型的应用案例是:男声和女声分类。\n", "\n", "
\n", "
图片来源:https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0179403.g001
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### 2.3.2 深度学习方法\n", "传统机器学习方法可以捕捉声音特征的差异(例如男声和女声的声音在音高上往往差异较大)并实现分类任务。\n", " \n", "而深度学习方法则可以突破特征的限制,更灵活的组网方式和更深的网络层次,可以更好地提取声音的高层特征,从而获得更好的分类指标。\n", "\n", "随着深度学习算法的快速发展和在分类任务上的优异表现,当下流行的声音分类模型无一不是采用深度学习网络搭建而成的,如 [AudioCLIP[1]](https://arxiv.org/pdf/2106.13043v1.pdf)、[PANNs[2]](https://arxiv.org/pdf/1912.10211v5.pdf) 和 [Audio Spectrogram Transformer[3]](https://arxiv.org/pdf/2104.01778v3.pdf) 等。\n", "
\n", "
图片来源:https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### 2.3.3 Pretrain + Finetune\n", "\n", "\n", "在声音分类和声音检测的场景中(如环境声音分类、情绪识别和音乐流派分类等)由于可获取的据集有限,且语音数据标注的成本高,用户可以收集到的数据集体量往往较小,这种数据量稀少的情况对于模型训练是非常不利的。\n", "\n", "预训练模型能够减少领域数据的需求量,并达到较高的识别准确率。在CV和NLP领域中,有诸如 MobileNet、VGG19、YOLO、BERT 和 ERNIE 等开源的预训练模型,在图像检测、图像分类、文本分类和文本生成等各自领域内的任务中,使用预训练模型在下游任务的数据集上进行 finetune ,往往可以更快和更容易获得较好的效果和指标。\n", "\n", "相较于 CV 领域的 ImageNet 数据集,谷歌在 2017 年开放了一个大规模的音频数据集 [AudioSet[4]](https://ieeexplore.ieee.org/document/7952261),它是目前最大的用于音频分类任务的数据集。该数据集包含了 632 类的音频类别以及 2084320 条人工标记的每段 10 秒长度的声音剪辑片段(包括 527 个标签),数据总时长为 5,800 小时。\n", "\n", "
\n", "
图片来源:https://research.google.com/audioset/ontology/index.html
\n", " \n", "`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition[2]](https://arxiv.org/pdf/1912.10211.pdf))是基于 AudioSet 数据集训练的声音分类/识别的模型,其中`PANNs-CNN14`在测试集上取得了较好的效果:mAP 为 0.431,AUC 为 0.973,d-prime 为 2.732,经过预训练后,该模型可以用于提取音频的 embbedding ,适合用于声音分类和声音检测等下游任务。本示例将使用 `PANNs` 的预训练模型 Finetune 完成声音分类的任务。\n", "\n", "
\n", " \n", "本教程选取 `PANNs` 中的预训练模型 `cnn14` 作为 backbone,用于提取声音的深层特征,`SoundClassifer`创建下游的分类网络,实现对输入音频的分类。\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# 3. 实践:环境声音分类\n", "\n", "## 3.1 数据集准备\n", "\n", "此课程选取了[ESC-50: Dataset for Environmental Sound Classification[5]](https://github.com/karolpiczak/ESC-50) 数据集作为示例。\n", " \n", "ESC-50是一个包含有 2000 个带标签的环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。\n", "\n", "音频样本可分为 5 个主要类别:\n", " - 动物声音(Animals)\n", " - 自然界产生的声音和水声(Natural soundscapes & water sounds)\n", " - 人类发出的非语言声音(Human, non-speech sounds)\n", " - 室内声音(Interior/domestic sounds)\n", " - 室外声音和一般噪声(Exterior/urban noises)。\n", "\n", "\n", "ESC-50 数据集中的提供的 `meta/esc50.csv` 文件包含的部分信息如下:\n", "```\n", " filename,fold,target,category,esc10,src_file,take\n", " 1-100038-A-14.wav,1,14,chirping_birds,False,100038,A\n", " 1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A\n", " 1-101296-A-19.wav,1,19,thunderstorm,False,101296,A\n", " ...\n", "```\n", "\n", " - filename: 音频文件名字。 \n", " - fold: 数据集自身提供的N-Fold验证信息,用于切分训练集和验证集。\n", " - target: 标签数值。\n", " - category: 标签文本信息。\n", " - esc10: 文件是否为ESC-10的数据集子集。\n", " - src_file: 原始音频文件前缀。\n", " - take: 原始文件的截取段落信息。\n", " \n", "在此声音分类的任务中,我们将`target`作为训练过程的分类标签。\n", "\n", "### 3.1.1 数据集初始化\n", "调用以下代码自动下载并读取数据集音频文件,创建训练集和验证集。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddleaudio.datasets import ESC50\n", "\n", "train_ds = ESC50(mode='train')\n", "dev_ds = ESC50(mode='dev')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### 3.1.2 特征提取\n", "通过下列代码,用 `paddleaudio.features.LogMelSpectrogram` 初始化一个音频特征提取器,在训练过程中实时提取音频的 LogFBank 特征,其中主要的参数如下: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "feature_extractor = LogMelSpectrogram(sr=44100, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', n_mels=64)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 3.2 模型\n", "\n", "### 3.2.1 选取预训练模型\n", "\n", "选取`cnn14`作为 backbone,用于提取音频的特征:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddlespeech.cls.models import cnn14\n", "backbone = cnn14(pretrained=True, extract_embedding=True)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "### 3.2.2 构建分类模型\n", "\n", "`SoundClassifer`接收`cnn14`作为backbone模型,并创建下游的分类网络:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import paddle.nn as nn\n", "\n", "\n", "class SoundClassifier(nn.Layer):\n", "\n", " def __init__(self, backbone, num_class, dropout=0.1):\n", " super().__init__()\n", " self.backbone = backbone\n", " self.dropout = nn.Dropout(dropout)\n", " self.fc = nn.Linear(self.backbone.emb_size, num_class)\n", "\n", " def forward(self, x):\n", " x = x.unsqueeze(1)\n", " x = self.backbone(x)\n", " x = self.dropout(x)\n", " logits = self.fc(x)\n", "\n", " return logits\n", "\n", "model = SoundClassifier(backbone, num_class=len(ESC50.label_list))" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 3.3 Finetune" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "1. 创建 DataLoader " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "batch_size = 16\n", "train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True)\n", "dev_loader = paddle.io.DataLoader(dev_ds, batch_size=batch_size,)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "2. 定义优化器和 Loss" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=model.parameters())\n", "criterion = paddle.nn.loss.CrossEntropyLoss()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "3. 启动模型训练 " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddleaudio.utils import logger\n", "\n", "epochs = 20\n", "steps_per_epoch = len(train_loader)\n", "log_freq = 10\n", "eval_freq = 10\n", "\n", "for epoch in range(1, epochs + 1):\n", " model.train()\n", "\n", " avg_loss = 0\n", " num_corrects = 0\n", " num_samples = 0\n", " for batch_idx, batch in enumerate(train_loader):\n", " waveforms, labels = batch\n", " feats = feature_extractor(waveforms)\n", " feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n", " logits = model(feats)\n", "\n", " loss = criterion(logits, labels)\n", " loss.backward()\n", " optimizer.step()\n", " if isinstance(optimizer._learning_rate,\n", " paddle.optimizer.lr.LRScheduler):\n", " optimizer._learning_rate.step()\n", " optimizer.clear_grad()\n", "\n", " # Calculate loss\n", " avg_loss += loss.numpy()[0]\n", "\n", " # Calculate metrics\n", " preds = paddle.argmax(logits, axis=1)\n", " num_corrects += (preds == labels).numpy().sum()\n", " num_samples += feats.shape[0]\n", "\n", " if (batch_idx + 1) % log_freq == 0:\n", " lr = optimizer.get_lr()\n", " avg_loss /= log_freq\n", " avg_acc = num_corrects / num_samples\n", "\n", " print_msg = 'Epoch={}/{}, Step={}/{}'.format(\n", " epoch, epochs, batch_idx + 1, steps_per_epoch)\n", " print_msg += ' loss={:.4f}'.format(avg_loss)\n", " print_msg += ' acc={:.4f}'.format(avg_acc)\n", " print_msg += ' lr={:.6f}'.format(lr)\n", " logger.train(print_msg)\n", "\n", " avg_loss = 0\n", " num_corrects = 0\n", " num_samples = 0\n", "\n", " if epoch % eval_freq == 0 and batch_idx + 1 == steps_per_epoch:\n", " model.eval()\n", " num_corrects = 0\n", " num_samples = 0\n", " with logger.processing('Evaluation on validation dataset'):\n", " for batch_idx, batch in enumerate(dev_loader):\n", " waveforms, labels = batch\n", " feats = feature_extractor(waveforms)\n", " feats = paddle.transpose(feats, [0, 2, 1])\n", " \n", " logits = model(feats)\n", "\n", " preds = paddle.argmax(logits, axis=1)\n", " num_corrects += (preds == labels).numpy().sum()\n", " num_samples += feats.shape[0]\n", "\n", " print_msg = '[Evaluation result]'\n", " print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples)\n", "\n", " logger.eval(print_msg)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 3.4 音频预测\n", "\n", "执行预测,获取 Top K 分类结果:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "top_k = 10\n", "wav_file = './dog.wav'\n", "\n", "waveform, sr = load(wav_file)\n", "feature_extractor = LogMelSpectrogram(sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window='hann', n_mels=64)\n", "feats = feature_extractor(paddle.to_tensor(paddle.to_tensor(waveform).unsqueeze(0)))\n", "feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n", "print(feats.shape)\n", "\n", "logits = model(feats)\n", "probs = nn.functional.softmax(logits, axis=1).numpy()\n", "\n", "sorted_indices = probs[0].argsort()\n", "\n", "msg = f'[{wav_file}]\\n'\n", "for idx in sorted_indices[-top_k:]:\n", " msg += f'{ESC50.label_list[idx]}: {probs[0][idx]:.5f}\\n'\n", "print(msg)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# 4. 作业\n", "1. 使用开发模式安装 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) \n", "环境要求:docker, Ubuntu 16.04,root user。 \n", "参考安装方法:[使用Docker安装paddlespeech](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md#hard-get-the-full-funciton-on-your-mechine)\n", "1. 在 [MusicSpeech](http://marsyas.info/downloads/datasets.html) 数据集上完成 music/speech 二分类。 \n", "2. 在 [GTZAN Genre Collection](http://marsyas.info/downloads/datasets.html) 音乐分类数据集上利用 PANNs 预训练模型实现音乐类别十分类。\n", "\n", "\n", "# 5. 关注 PaddleSpeech\n", "\n", "请关注我们的 [Github Repo](https://github.com/PaddlePaddle/PaddleSpeech/),非常欢迎加入以下微信群参与讨论:\n", "- 扫描二维码\n", "- 添加运营小姐姐微信\n", "- 通过后回复【语音】\n", "- 系统自动邀请加入技术群\n", "\n", "\n", "
\n", "\n", "# 6. 参考文献\n", "\n", "[1] Guzhov, A., Raue, F., Hees, J., & Dengel, A.R. (2021). AudioCLIP: Extending CLIP to Image, Text and Audio. ArXiv, abs/2106.13043.\n", " \n", "[2] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M.D. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2880-2894.\n", " \n", "[3] Gong, Y., Chung, Y., & Glass, J.R. (2021). AST: Audio Spectrogram Transformer. ArXiv, abs/2104.01778.\n", " \n", "[4] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., & Ritter, M. (2017). Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776-780.\n", "\n", "[5] Piczak, K.J. (2015). ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd ACM international conference on Multimedia.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "py35-paddle1.2.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }