{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## NLP处理实例\n", "\n", "### 数据集: Disasters on social media\n", "\n", "推特（社交媒体）上有许多的信息，其中有关于灾难，疾病，暴乱的，有些只是开玩笑或者是电影情节，我们该如何让机器能分辨出这两种讨论呢？" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 为什么这很重要\n", "我们将努力正确预测有关灾难的推文。这是一个非常相关的问题，因为：\n", "\n", "* 对于任何试图从噪音中获取信号的人来说，这都是可以采取行动的(比如警察部门在这种情况下)。\n", "* 这是很棘手的，因为依赖关键字比在大多数情况下像垃圾邮件更难。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 任务\n", "对文本数据进行分类，并分析找到相关归因。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "from tensorflow import keras\n", "import nltk\n", "import pandas as pd\n", "import numpy as np\n", "import re\n", "import codecs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据输入\n", "需要对数据做一下编码处理" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "input_file = codecs.open(\"data/socialmedia_relevant_cols.csv\", \"r\",encoding='utf-8', errors='replace')\n", "output_file = open(\"data/socialmedia_relevant_cols_clean.csv\", \"w\",encoding='utf-8')\n", "\n", "def sanitize_characters(raw, clean): \n", " for line in input_file:\n", " out = line\n", " output_file.write(line)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "sanitize_characters(input_file, output_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据预处理" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	text	choose_one	class_label
0	Just happened a terrible car crash	Relevant	1
1	Our Deeds are the Reason of this #earthquake M...	Relevant	1
2	Heard about #earthquake is different cities, s...	Relevant	1
3	there is a forest fire at spot pond, geese are...	Relevant	1
4	Forest fire near La Ronge Sask. Canada	Relevant	1

\n", "

" ], "text/plain": [ " text choose_one class_label\n", "0 Just happened a terrible car crash Relevant 1\n", "1 Our Deeds are the Reason of this #earthquake M... Relevant 1\n", "2 Heard about #earthquake is different cities, s... Relevant 1\n", "3 there is a forest fire at spot pond, geese are... Relevant 1\n", "4 Forest fire near La Ronge Sask. Canada Relevant 1" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "questions = pd.read_csv(\"data/socialmedia_relevant_cols_clean.csv\")\n", "questions.columns=['text', 'choose_one', 'class_label']\n", "questions.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }