Create NLP处理实例-checkpoint.ipynb

pull/2/head
benjas 5 years ago
parent fab852c889
commit 22bd64ef1b

@ -0,0 +1,202 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLP处理实例\n",
"\n",
"### 数据集: Disasters on social media\n",
"\n",
"推特(社交媒体)上有许多的信息,其中有关于灾难,疾病,暴乱的,有些只是开玩笑或者是电影情节,我们该如何让机器能分辨出这两种讨论呢?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 为什么这很重要\n",
"我们将努力正确预测有关灾难的推文。这是一个非常相关的问题,因为:\n",
"\n",
"* 对于任何试图从噪音中获取信号的人来说,这都是可以采取行动的(比如警察部门在这种情况下)。\n",
"* 这是很棘手的,因为依赖关键字比在大多数情况下像垃圾邮件更难。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 任务\n",
"对文本数据进行分类,并分析找到相关归因。"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"import sklearn\n",
"from tensorflow import keras\n",
"import nltk\n",
"import pandas as pd\n",
"import numpy as np\n",
"import re\n",
"import codecs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据输入\n",
"需要对数据做一下编码处理"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"input_file = codecs.open(\"data/socialmedia_relevant_cols.csv\", \"r\",encoding='utf-8', errors='replace')\n",
"output_file = open(\"data/socialmedia_relevant_cols_clean.csv\", \"w\",encoding='utf-8')\n",
"\n",
"def sanitize_characters(raw, clean): \n",
" for line in input_file:\n",
" out = line\n",
" output_file.write(line)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"sanitize_characters(input_file, output_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据预处理"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>choose_one</th>\n",
" <th>class_label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Just happened a terrible car crash</td>\n",
" <td>Relevant</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Our Deeds are the Reason of this #earthquake M...</td>\n",
" <td>Relevant</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heard about #earthquake is different cities, s...</td>\n",
" <td>Relevant</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>there is a forest fire at spot pond, geese are...</td>\n",
" <td>Relevant</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Forest fire near La Ronge Sask. Canada</td>\n",
" <td>Relevant</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text choose_one class_label\n",
"0 Just happened a terrible car crash Relevant 1\n",
"1 Our Deeds are the Reason of this #earthquake M... Relevant 1\n",
"2 Heard about #earthquake is different cities, s... Relevant 1\n",
"3 there is a forest fire at spot pond, geese are... Relevant 1\n",
"4 Forest fire near La Ronge Sask. Canada Relevant 1"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"questions = pd.read_csv(\"data/socialmedia_relevant_cols_clean.csv\")\n",
"questions.columns=['text', 'choose_one', 'class_label']\n",
"questions.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading…
Cancel
Save