diff --git a/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb new file mode 100644 index 0000000..8c3d0f3 --- /dev/null +++ b/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb @@ -0,0 +1,202 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## NLP处理实例\n", + "\n", + "### 数据集: Disasters on social media\n", + "\n", + "推特(社交媒体)上有许多的信息,其中有关于灾难,疾病,暴乱的,有些只是开玩笑或者是电影情节,我们该如何让机器能分辨出这两种讨论呢?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 为什么这很重要\n", + "我们将努力正确预测有关灾难的推文。这是一个非常相关的问题,因为:\n", + "\n", + "* 对于任何试图从噪音中获取信号的人来说,这都是可以采取行动的(比如警察部门在这种情况下)。\n", + "* 这是很棘手的,因为依赖关键字比在大多数情况下像垃圾邮件更难。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 任务\n", + "对文本数据进行分类,并分析找到相关归因。" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "import sklearn\n", + "from tensorflow import keras\n", + "import nltk\n", + "import pandas as pd\n", + "import numpy as np\n", + "import re\n", + "import codecs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据输入\n", + "需要对数据做一下编码处理" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "input_file = codecs.open(\"data/socialmedia_relevant_cols.csv\", \"r\",encoding='utf-8', errors='replace')\n", + "output_file = open(\"data/socialmedia_relevant_cols_clean.csv\", \"w\",encoding='utf-8')\n", + "\n", + "def sanitize_characters(raw, clean): \n", + " for line in input_file:\n", + " out = line\n", + " output_file.write(line)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "sanitize_characters(input_file, output_file)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据预处理" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textchoose_oneclass_label
0Just happened a terrible car crashRelevant1
1Our Deeds are the Reason of this #earthquake M...Relevant1
2Heard about #earthquake is different cities, s...Relevant1
3there is a forest fire at spot pond, geese are...Relevant1
4Forest fire near La Ronge Sask. CanadaRelevant1
\n", + "
" + ], + "text/plain": [ + " text choose_one class_label\n", + "0 Just happened a terrible car crash Relevant 1\n", + "1 Our Deeds are the Reason of this #earthquake M... Relevant 1\n", + "2 Heard about #earthquake is different cities, s... Relevant 1\n", + "3 there is a forest fire at spot pond, geese are... Relevant 1\n", + "4 Forest fire near La Ronge Sask. Canada Relevant 1" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "questions = pd.read_csv(\"data/socialmedia_relevant_cols_clean.csv\")\n", + "questions.columns=['text', 'choose_one', 'class_label']\n", + "questions.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}