diff --git a/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb new file mode 100644 index 0000000..8c3d0f3 --- /dev/null +++ b/机器学习竞赛实战_优胜解决方案/机器学习实战小项目/文本特征处理方法对比/.ipynb_checkpoints/NLP处理实例-checkpoint.ipynb @@ -0,0 +1,202 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## NLP处理实例\n", + "\n", + "### 数据集: Disasters on social media\n", + "\n", + "推特(社交媒体)上有许多的信息,其中有关于灾难,疾病,暴乱的,有些只是开玩笑或者是电影情节,我们该如何让机器能分辨出这两种讨论呢?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 为什么这很重要\n", + "我们将努力正确预测有关灾难的推文。这是一个非常相关的问题,因为:\n", + "\n", + "* 对于任何试图从噪音中获取信号的人来说,这都是可以采取行动的(比如警察部门在这种情况下)。\n", + "* 这是很棘手的,因为依赖关键字比在大多数情况下像垃圾邮件更难。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 任务\n", + "对文本数据进行分类,并分析找到相关归因。" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "import sklearn\n", + "from tensorflow import keras\n", + "import nltk\n", + "import pandas as pd\n", + "import numpy as np\n", + "import re\n", + "import codecs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据输入\n", + "需要对数据做一下编码处理" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "input_file = codecs.open(\"data/socialmedia_relevant_cols.csv\", \"r\",encoding='utf-8', errors='replace')\n", + "output_file = open(\"data/socialmedia_relevant_cols_clean.csv\", \"w\",encoding='utf-8')\n", + "\n", + "def sanitize_characters(raw, clean): \n", + " for line in input_file:\n", + " out = line\n", + " output_file.write(line)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "sanitize_characters(input_file, output_file)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据预处理" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | text | \n", + "choose_one | \n", + "class_label | \n", + "
---|---|---|---|
0 | \n", + "Just happened a terrible car crash | \n", + "Relevant | \n", + "1 | \n", + "
1 | \n", + "Our Deeds are the Reason of this #earthquake M... | \n", + "Relevant | \n", + "1 | \n", + "
2 | \n", + "Heard about #earthquake is different cities, s... | \n", + "Relevant | \n", + "1 | \n", + "
3 | \n", + "there is a forest fire at spot pond, geese are... | \n", + "Relevant | \n", + "1 | \n", + "
4 | \n", + "Forest fire near La Ronge Sask. Canada | \n", + "Relevant | \n", + "1 | \n", + "