You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/docs/tutorial/cls/cls_tutorial.ipynb

651 lines
26 KiB

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://github.com/PaddlePaddle/PaddleSpeech\"><img style=\"position: absolute; z-index: 999; top: 0; right: 0; border: 0; width: 128px; height: 128px;\" src=\"https://nosir.github.io/cleave.js/images/right-graphite@2x.png\" alt=\"Fork me on GitHub\"></a>\n",
"\n",
"# 1. 识别声音\n",
" \n",
" 通过听取声音,人的大脑会获取到大量的信息,其中的一个场景是识别和归类,如:识别熟悉的亲人或朋友的声音、识别不同乐器发出的声音和识别不同环境产生的声音,等等。\n",
"\n",
" 我们可以根据不同声音的特征(频率,音色等)进行区分,这种区分行为的本质,就是对声音进行分类。</font>\n",
"\n",
"声音分类根据用途还可以继续细分:\n",
"\n",
"* 副语言识别说话人识别Speaker Recognition, 情绪识别Speech Emotion Recognition性别分类Speaker gender classification\n",
"* 音乐识别音乐流派分类Music Genre Classification\n",
"* 场景识别环境声音分类Environmental Sound Classification\n",
"* 声音事件检测:各个环境中的声学事件检测\n",
" \n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/2b3fdd6dd3b24360ab7448e1aa47bb93d7610aaf79fd4f25aa0a8ff131493261\"></center>\n",
"<center>图片来源http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/Speaker%20(v3).pdf</center>\n",
"\n",
"## 1.1 Audio Tagging\n",
"使用 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) 的预训练模型对一段音频做实时的声音检测,结果如下视频所示。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%HTML\n",
"<center><video width=\"800\" controls>\n",
" <source src=\"https://paddlespeech.bj.bcebos.com/PaddleAudio/audio_tagging_demo.mp4\" type=\"video/mp4\">\n",
"</video></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 音频和特征提取"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 环境准备安装paddlespeech和paddleaudio\n",
"!pip install --upgrade pip && pip install paddlespeech paddleaudio -U"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"import IPython\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"## 2.1 数字音频\n",
"\n",
"### 2.1.1 声音信号和音频文件\n",
" \n",
"下面通过一个例子观察音频文件的波形,直观地了解数字音频文件的包含的内容。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 获取示例音频\n",
"!test -f ./dog.wav || wget https://paddlespeech.bj.bcebos.com/PaddleAudio/dog.wav\n",
"IPython.display.Audio('./dog.wav')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddleaudio import load\n",
"data, sr = load(file='./dog.wav', mono=True, dtype='float32') # 单通道float32音频样本点\n",
"print('wav shape: {}'.format(data.shape))\n",
"print('sample rate: {}'.format(sr))\n",
"\n",
"# 展示音频波形\n",
"plt.figure()\n",
"plt.plot(data)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!paddlespeech cls --input ./dog.wav"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.2 音频特征提取\n",
"\n",
"### 2.2.1 短时傅里叶变换\n",
"\n",
" 对于一段音频,一般会将整段音频进行分帧,每一帧含有一定长度的信号数据,一般使用 `25ms`,帧与帧之间的移动距离称为帧移,一般使用 `10ms`然后对每一帧的信号数据加窗后进行短时傅立叶变换STFT得到时频谱。\n",
" \n",
"通过按照上面的对一段音频进行分帧后我们可以用傅里叶变换来分析每一帧信号的频率特性。将每一帧的频率信息拼接后可以获得该音频不同时刻的频率特征——Spectrogram也称作为语谱图。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ef98c95137442a797c9204e1108e585facf7124ee964edc845f2c849a39347f\"></center>\n",
"<center>图片参考DLHLP 李宏毅 语音识别课程PPThttps://www.shong.win/2016/04/09/fft/</center>\n",
"\n",
"<br></br>\n",
"下面例子采用 `paddle.signal.stft` 演示如何提取示例音频的频谱特征,并进行可视化:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import paddle\n",
"import numpy as np\n",
"\n",
"data, sr = load(file='./dog.wav', sr=32000, mono=True, dtype='float32')\n",
"x = paddle.to_tensor(data)\n",
"n_fft = 1024\n",
"win_length = 1024\n",
"hop_length = 320\n",
"\n",
"# [D, T]\n",
"spectrogram = paddle.signal.stft(x, n_fft=n_fft, win_length=win_length, hop_length=hop_length, onesided=True) \n",
"print('spectrogram.shape: {}'.format(spectrogram.shape))\n",
"print('spectrogram.dtype: {}'.format(spectrogram.dtype))\n",
"\n",
"\n",
"spec = np.log(np.abs(spectrogram.numpy())**2)\n",
"plt.figure()\n",
"plt.title(\"Log Power Spectrogram\")\n",
"plt.imshow(spec[:100, :], origin='lower')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2.2 LogFBank\n",
"\n",
"研究表明,人类对声音的感知是非线性的,随着声音频率的增加,人对更高频率的声音的区分度会不断下降。\n",
"\n",
"例如同样是相差 500Hz 的频率,一般人可以轻松分辨出声音中 500Hz 和 1,000Hz 之间的差异,但是很难分辨出 10,000Hz 和 10,500Hz 之间的差异。\n",
"\n",
"因此,学者提出了梅尔频率,在该频率计量方式下,人耳对相同数值的频率变化的感知程度是一样的。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/18fac30a88bd46c88a6a8bfdec580b42ff3f6b6ef0b54bb68cb1c217f31c18d7\" width=500></center>\n",
"<center>图片来源https://www.researchgate.net/figure/Curve-relationship-between-frequency-signal-with-its-mel-frequency-scale-Algorithm-1_fig3_221910348</center>\n",
"\n",
"关于梅尔频率的计算,其会对原始频率的低频的部分进行较多的采样,从而对应更多的频率,而对高频的声音进行较少的采样,从而对应较少的频率。使得人耳对梅尔频率的低频和高频的区分性一致。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/7762cef8fa0e4b10b7f566a0e705609af7704f6a1d2b4e8bac44abe724f9c866\" ></center>\n",
"<center>图片来源https://ww2.mathworks.cn/help/audio/ref/mfcc.html</center>\n",
"\n",
"Mel Fbank 的计算过程如下,而我们一般都是使用 LogFBank 作为识别特征:\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e7e6c2e221f642af9e618de768dada99258ec5d97b314035b21dd3e217941a67\" ></center>\n",
"<center>图片来源https://ww2.mathworks.cn/help/audio/ref/mfcc.html</center>\n",
"\n",
"<br></br>\n",
"下面例子采用 `paddleaudio.features.LogMelSpectrogram` 演示如何提取示例音频的 LogFBank:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddleaudio.features import LogMelSpectrogram\n",
"\n",
"f_min=50.0\n",
"f_max=14000.0\n",
"n_mels=64\n",
"\n",
"# - sr: 音频文件的采样率。\n",
"# - n_fft: FFT样本点个数。\n",
"# - hop_length: 音频帧之间的间隔。\n",
"# - win_length: 窗函数的长度。\n",
"# - window: 窗函数种类。\n",
"# - n_mels: 梅尔刻度数量。\n",
"feature_extractor2 = LogMelSpectrogram(\n",
" sr=sr, \n",
" n_fft=n_fft, \n",
" hop_length=hop_length, \n",
" win_length=win_length, \n",
" window='hann', \n",
" f_min=f_min,\n",
" f_max=f_max,\n",
" n_mels=n_mels)\n",
"\n",
"x = paddle.to_tensor(data).unsqueeze(0) # [B, L]\n",
"log_fbank = feature_extractor2(x) # [B, D, T]\n",
"log_fbank = log_fbank.squeeze(0) # [D, T]\n",
"print('log_fbank.shape: {}'.format(log_fbank.shape))\n",
"\n",
"plt.figure()\n",
"plt.imshow(log_fbank.numpy(), origin='lower')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.3 声音分类方法\n",
"\n",
"### 2.3.1 传统机器学习方法\n",
"在传统的声音和信号的研究领域中,声音特征是一类包含丰富先验知识的手工特征,如频谱图、梅尔频谱和梅尔频率倒谱系数等。\n",
" \n",
"因此在一些分类的应用上可以采用传统的机器学习方法例如决策树、svm和随机森林等方法。\n",
" \n",
"一个典型的应用案例是:男声和女声分类。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/943905088eef48b48e4b94f7ff4c475060937868ca474b61bdcc55fc155b283e\" width=800></center>\n",
"<center>图片来源https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0179403.g001</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.2 深度学习方法\n",
"传统机器学习方法可以捕捉声音特征的差异(例如男声和女声的声音在音高上往往差异较大)并实现分类任务。\n",
" \n",
"而深度学习方法则可以突破特征的限制,更灵活的组网方式和更深的网络层次,可以更好地提取声音的高层特征,从而获得更好的分类指标。\n",
"\n",
"随着深度学习算法的快速发展和在分类任务上的优异表现,当下流行的声音分类模型无一不是采用深度学习网络搭建而成的,如 [AudioCLIP[1]](https://arxiv.org/pdf/2106.13043v1.pdf)、[PANNs[2]](https://arxiv.org/pdf/1912.10211v5.pdf) 和 [Audio Spectrogram Transformer[3]](https://arxiv.org/pdf/2104.01778v3.pdf) 等。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/bc2c0352c4124b1d866696fd5d8165efbdca5d60f21648729258b62981ef600a\" ></center>\n",
"<center>图片来源https://towardsdatascience.com/audio-deep-learning-made-simple-sound-classification-step-by-step-cebc936bbe5</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.3 Pretrain + Finetune\n",
"\n",
"\n",
"在声音分类和声音检测的场景中(如环境声音分类、情绪识别和音乐流派分类等)由于可获取的据集有限,且语音数据标注的成本高,用户可以收集到的数据集体量往往较小,这种数据量稀少的情况对于模型训练是非常不利的。\n",
"\n",
"预训练模型能够减少领域数据的需求量并达到较高的识别准确率。在CV和NLP领域中有诸如 MobileNet、VGG19、YOLO、BERT 和 ERNIE 等开源的预训练模型,在图像检测、图像分类、文本分类和文本生成等各自领域内的任务中,使用预训练模型在下游任务的数据集上进行 finetune ,往往可以更快和更容易获得较好的效果和指标。\n",
"\n",
"相较于 CV 领域的 ImageNet 数据集,谷歌在 2017 年开放了一个大规模的音频数据集 [AudioSet[4]](https://ieeexplore.ieee.org/document/7952261),它是目前最大的用于音频分类任务的数据集。该数据集包含了 632 类的音频类别以及 2084320 条人工标记的每段 10 秒长度的声音剪辑片段(包括 527 个标签),数据总时长为 5,800 小时。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/334a00c3ca4d4feb90982bb882897eeae2c82a6521b54b46bc64cb68289cdd92\" width=480></center>\n",
"<center>图片来源https://research.google.com/audioset/ontology/index.html</center>\n",
" \n",
"`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition[2]](https://arxiv.org/pdf/1912.10211.pdf))是基于 AudioSet 数据集训练的声音分类/识别的模型,其中`PANNs-CNN14`在测试集上取得了较好的效果mAP 为 0.431AUC 为 0.973d-prime 为 2.732,经过预训练后,该模型可以用于提取音频的 embbedding ,适合用于声音分类和声音检测等下游任务。本示例将使用 `PANNs` 的预训练模型 Finetune 完成声音分类的任务。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/812d3268cc5b46c88bd23fb9ebaa89196081a14409724b4c87e96498c78c930e\" width=480></center>\n",
" \n",
"本教程选取 `PANNs` 中的预训练模型 `cnn14` 作为 backbone用于提取声音的深层特征`SoundClassifer`创建下游的分类网络,实现对输入音频的分类。\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1954041f63ae49e2bc1f858ca43433140dfc70a513a8479aa9eb5ca8841cb2ac\" width=600></center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. 实践:环境声音分类\n",
"\n",
"## 3.1 数据集准备\n",
"\n",
"此课程选取了[ESC-50: Dataset for Environmental Sound Classification[5]](https://github.com/karolpiczak/ESC-50) 数据集作为示例。\n",
" \n",
"ESC-50是一个包含有 2000 个带标签的环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。\n",
"\n",
"音频样本可分为 5 个主要类别:\n",
" - 动物声音Animals\n",
" - 自然界产生的声音和水声Natural soundscapes & water sounds\n",
" - 人类发出的非语言声音Human, non-speech sounds\n",
" - 室内声音Interior/domestic sounds\n",
" - 室外声音和一般噪声Exterior/urban noises。\n",
"\n",
"\n",
"ESC-50 数据集中的提供的 `meta/esc50.csv` 文件包含的部分信息如下:\n",
"```\n",
" filename,fold,target,category,esc10,src_file,take\n",
" 1-100038-A-14.wav,1,14,chirping_birds,False,100038,A\n",
" 1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A\n",
" 1-101296-A-19.wav,1,19,thunderstorm,False,101296,A\n",
" ...\n",
"```\n",
"\n",
" - filename: 音频文件名字。 \n",
" - fold: 数据集自身提供的N-Fold验证信息用于切分训练集和验证集。\n",
" - target: 标签数值。\n",
" - category: 标签文本信息。\n",
" - esc10: 文件是否为ESC-10的数据集子集。\n",
" - src_file: 原始音频文件前缀。\n",
" - take: 原始文件的截取段落信息。\n",
" \n",
"在此声音分类的任务中,我们将`target`作为训练过程的分类标签。\n",
"\n",
"### 3.1.1 数据集初始化\n",
"调用以下代码自动下载并读取数据集音频文件,创建训练集和验证集。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddleaudio.datasets import ESC50\n",
"\n",
"train_ds = ESC50(mode='train', sample_rate=sr)\n",
"dev_ds = ESC50(mode='dev', sample_rate=sr)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1.2 特征提取\n",
"通过下列代码,用 `paddleaudio.features.LogMelSpectrogram` 初始化一个音频特征提取器,在训练过程中实时提取音频的 LogFBank 特征,其中主要的参数如下: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"feature_extractor = LogMelSpectrogram(\n",
" sr=sr, \n",
" n_fft=n_fft, \n",
" hop_length=hop_length, \n",
" win_length=win_length, \n",
" window='hann', \n",
" f_min=f_min,\n",
" f_max=f_max,\n",
" n_mels=n_mels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 模型\n",
"\n",
"### 3.2.1 选取预训练模型\n",
"\n",
"选取`cnn14`作为 backbone用于提取音频的特征"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddlespeech.cls.models import cnn14\n",
"backbone = cnn14(pretrained=True, extract_embedding=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2.2 构建分类模型\n",
"\n",
"`SoundClassifer`接收`cnn14`作为backbone模型并创建下游的分类网络"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import paddle.nn as nn\n",
"\n",
"\n",
"class SoundClassifier(nn.Layer):\n",
"\n",
" def __init__(self, backbone, num_class, dropout=0.1):\n",
" super().__init__()\n",
" self.backbone = backbone\n",
" self.dropout = nn.Dropout(dropout)\n",
" self.fc = nn.Linear(self.backbone.emb_size, num_class)\n",
"\n",
" def forward(self, x):\n",
" x = x.unsqueeze(1)\n",
" x = self.backbone(x)\n",
" x = self.dropout(x)\n",
" logits = self.fc(x)\n",
"\n",
" return logits\n",
"\n",
"model = SoundClassifier(backbone, num_class=len(ESC50.label_list))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Finetune"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. 创建 DataLoader "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"batch_size = 16\n",
"train_loader = paddle.io.DataLoader(train_ds, batch_size=batch_size, shuffle=True)\n",
"dev_loader = paddle.io.DataLoader(dev_ds, batch_size=batch_size,)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. 定义优化器和 Loss"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"optimizer = paddle.optimizer.Adam(learning_rate=1e-4, parameters=model.parameters())\n",
"criterion = paddle.nn.loss.CrossEntropyLoss()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. 启动模型训练 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddleaudio.utils import logger\n",
"\n",
"epochs = 20\n",
"steps_per_epoch = len(train_loader)\n",
"log_freq = 10\n",
"eval_freq = 10\n",
"\n",
"for epoch in range(1, epochs + 1):\n",
" model.train()\n",
"\n",
" avg_loss = 0\n",
" num_corrects = 0\n",
" num_samples = 0\n",
" for batch_idx, batch in enumerate(train_loader):\n",
" waveforms, labels = batch\n",
" feats = feature_extractor(waveforms)\n",
" feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n",
" logits = model(feats)\n",
"\n",
" loss = criterion(logits, labels)\n",
" loss.backward()\n",
" optimizer.step()\n",
" if isinstance(optimizer._learning_rate,\n",
" paddle.optimizer.lr.LRScheduler):\n",
" optimizer._learning_rate.step()\n",
" optimizer.clear_grad()\n",
"\n",
" # Calculate loss\n",
" avg_loss += loss.numpy()[0]\n",
"\n",
" # Calculate metrics\n",
" preds = paddle.argmax(logits, axis=1)\n",
" num_corrects += (preds == labels).numpy().sum()\n",
" num_samples += feats.shape[0]\n",
"\n",
" if (batch_idx + 1) % log_freq == 0:\n",
" lr = optimizer.get_lr()\n",
" avg_loss /= log_freq\n",
" avg_acc = num_corrects / num_samples\n",
"\n",
" print_msg = 'Epoch={}/{}, Step={}/{}'.format(\n",
" epoch, epochs, batch_idx + 1, steps_per_epoch)\n",
" print_msg += ' loss={:.4f}'.format(avg_loss)\n",
" print_msg += ' acc={:.4f}'.format(avg_acc)\n",
" print_msg += ' lr={:.6f}'.format(lr)\n",
" logger.train(print_msg)\n",
"\n",
" avg_loss = 0\n",
" num_corrects = 0\n",
" num_samples = 0\n",
"\n",
" if epoch % eval_freq == 0 and batch_idx + 1 == steps_per_epoch:\n",
" model.eval()\n",
" num_corrects = 0\n",
" num_samples = 0\n",
" with logger.processing('Evaluation on validation dataset'):\n",
" for batch_idx, batch in enumerate(dev_loader):\n",
" waveforms, labels = batch\n",
" feats = feature_extractor(waveforms)\n",
" feats = paddle.transpose(feats, [0, 2, 1])\n",
" \n",
" logits = model(feats)\n",
"\n",
" preds = paddle.argmax(logits, axis=1)\n",
" num_corrects += (preds == labels).numpy().sum()\n",
" num_samples += feats.shape[0]\n",
"\n",
" print_msg = '[Evaluation result]'\n",
" print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples)\n",
"\n",
" logger.eval(print_msg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.4 音频预测\n",
"\n",
"执行预测,获取 Top K 分类结果:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"top_k = 10\n",
"wav_file = './dog.wav'\n",
"\n",
"waveform, _ = load(wav_file, sr)\n",
"feats = feature_extractor(paddle.to_tensor(paddle.to_tensor(waveform).unsqueeze(0)))\n",
"feats = paddle.transpose(feats, [0, 2, 1]) # [B, N, T] -> [B, T, N]\n",
"print(feats.shape)\n",
"\n",
"logits = model(feats)\n",
"probs = nn.functional.softmax(logits, axis=1).numpy()\n",
"\n",
"sorted_indices = probs[0].argsort()\n",
"\n",
"msg = f'[{wav_file}]\\n'\n",
"for idx in sorted_indices[-1:-top_k-1:-1]:\n",
" msg += f'{ESC50.label_list[idx]}: {probs[0][idx]:.5f}\\n'\n",
"print(msg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. 作业\n",
"1. 使用开发模式安装 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech) \n",
"环境要求docker, Ubuntu 16.04root user。 \n",
"参考安装方法:[使用Docker安装paddlespeech](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md#hard-get-the-full-funciton-on-your-mechine)\n",
"1. 在 [MusicSpeech](http://marsyas.info/downloads/datasets.html) 数据集上完成 music/speech 二分类。 \n",
"2. 在 [GTZAN Genre Collection](http://marsyas.info/downloads/datasets.html) 音乐分类数据集上利用 PANNs 预训练模型实现音乐类别十分类。\n",
"\n",
"关于如何自定义分类数据集,请参考文档 [PaddleSpeech/docs/source/cls/custom_dataset.md](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/cls/custom_dataset.md)\n",
"\n",
"# 5. 关注 PaddleSpeech\n",
"\n",
"请关注我们的 [Github Repo](https://github.com/PaddlePaddle/PaddleSpeech/),非常欢迎加入以下微信群参与讨论:\n",
"- 扫描二维码\n",
"- 添加运营小姐姐微信\n",
"- 通过后回复【语音】\n",
"- 系统自动邀请加入技术群\n",
"\n",
"\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/87bc7da42bcc401bae41d697f13d8b362bfdfd7198f14096b6d46b4004f09613\" width=\"300\" height=\"300\" ></center>\n",
"\n",
"# 6. 参考文献\n",
"\n",
"[1] Guzhov, A., Raue, F., Hees, J., & Dengel, A.R. (2021). AudioCLIP: Extending CLIP to Image, Text and Audio. ArXiv, abs/2106.13043.\n",
" \n",
"[2] Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M.D. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2880-2894.\n",
" \n",
"[3] Gong, Y., Chung, Y., & Glass, J.R. (2021). AST: Audio Spectrogram Transformer. ArXiv, abs/2104.01778.\n",
" \n",
"[4] Gemmeke, J.F., Ellis, D.P., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., & Ritter, M. (2017). Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776-780.\n",
"\n",
"[5] Piczak, K.J. (2015). ESC: Dataset for Environmental Sound Classification. Proceedings of the 23rd ACM international conference on Multimedia.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py37",
"language": "python",
"name": "py37"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}