Add. Loading Word2vec

pull/2/head
benjas 5 years ago
parent b10cc20af1
commit 7b787d6c09

@ -1,6 +1,256 @@
{
"cells": [],
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 使用LSTM进行情感分析"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 深度学习在自然语言处理中的应用\n",
"自然语言处理是教会机器如何去处理或者读懂人类语言的系统,主要应用领域:\n",
"\n",
"* 对话系统 - 聊天机器人(小冰)\n",
"* 情感分析 - 对一段文本进行情感识别(我们现在做)\n",
"* 图文映射 - CNN和RNN的融合\n",
"* 机器翻译 - 将一种语言翻译成另一种语言\n",
"* 语音识别 - 将语音识别成文字,如王者荣耀\n",
"\n",
"请回顾[第四章——递归神经网络与词向量原理解读](https://github.com/ben1234560/AiLearning-Theory-Applying/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%85%A5%E9%97%A8/%E7%AC%AC%E5%9B%9B%E7%AB%A0%E2%80%94%E2%80%94%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E4%B8%8E%E8%AF%8D%E5%90%91%E9%87%8F%E5%8E%9F%E7%90%86%E8%A7%A3%E8%AF%BB.md)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 词向量模型\n",
"计算机只认识数字!\n",
"<img src=\"assets/20210112092212.png\" width=\"100%\">\n",
"我们可以将一句话中的每个词都转换成一个向量\n",
"<img src=\"assets/20210112092241.png\" width=\"100%\">\n",
"它们的向量维度是一致的"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"词向量是具有空间一样的并不是简单的映射例如我们希望单词“love”和“adore”这两个词在向量空间中是有一定的相关性的因为他们有类似的定义他们都在类似的上下文中使用。单词的向量表示也被称之为词嵌入。\n",
"<img src=\"assets/20210112095444.png\" width=\"50%\">\n",
"word2vec构建的词向量正如上图相同含义的词在高维空间上是接近的而不同含义的词差别很远。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Word2Vec\n",
"为了去得到这些词嵌入,我们使用一个非常厉害的模型\"Word2vec\"。简单的说,这个模型根据上下文的语境来推断出毎个词的词向量。如果两个个词在上下文的语境中,可以被互相替换,那么这两个词的距离就非常近。在自然语言中,上下文的语境对分析词语的意义是非常重要的。比如,之前我们提到的\"adore\"和Tove\"这两个词,我们观察如下上下文的语境。\n",
"<img src=\"assets/20210112100552.png\" width=\"50%\">\n",
"从句子中我们可以看到,这两个词通常在句子中是表现积极的,而且-般比名词或者名词组合要好。这也说明了,这两个词可以被互相替换,他们的意思是非常相近的。对于句子的语法结构分析,上下文语境也是非常重要的。所有,这个模型的作用就是从一大堆句子(以 Wikipedia为例)中为毎个独一无二的单词进行建模,并且输出一个唯一的向量。word2vec模型的输出被称为一个嵌入矩阵\n",
"<img src=\"assets/20210112100616.png\" width=\"70%\">\n",
"这个嵌入矩阵包含训练集中每个词的一个向量。传统来讲,这个嵌入矩阵中的词向量数据会很大。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recurrent Neural Networks(RNNs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在,我们已经得到了神经网络的输入数据——词向量,接下来让我们看看需要构建的神经网络。NLP数据的个独特之处是它是时间序列数据。每个单词的出现都依赖于它的前—个单词和后—个单词。由于这种依赖的存在,我们使用循环神经网络来处理这种时间序列数据。循环神经网络的结构和你之前看到的那些前馈神经网络的结枃可能有一些不一样。前馈神经网络由三部分组成,输入层,隐藏层和输出层。\n",
"<img src=\"assets/20210112104331.png\" width=\"70%\">\n",
"\n",
"前馈神经网络和RNN之前的主要区别就是RNN考虑了时间的信息。在RNN中,句子中的每个单词都被考虑上了时间步骤。实际上,时间步长的数量将等于最大序列长度\n",
"<img src=\"assets/20210112104505.png\" width=\"70%\">\n",
"与每个时间步骤相关联的中间状态也被作为一个新的组件,称为隐藏状态向量h(t)。从抽象的角度来看,这个向量是用来封装和汇总前面时间步骤中所看到的所有信息。就像x(t)表示一个向量,它封装了一个特定单词的所有信息。\n",
"\n",
"隐藏状态是当前单词向量和前一步的隐藏状态冋量的函数。并且这两项之和需要通过激活函数来进行激活。\n",
"<img src=\"assets/20210112105004.png\" width=\"50%\">\n",
"\n",
"<img src=\"assets/20210112105055.png\" width=\"100%\">\n",
"如上图第一个词The(Xt-1)经过神经元计算(Wxt-1)得出特征向量ht-1再给第二个词movie使用循环如此直至最后综合考虑前面的所有特征。\n",
"\n",
"从上图我们也能看到一个问题就是越前面的数据越无法感知也就是俗称的梯度消失所以引入LSTMLSTM在上一章节已经了解过这里不再重复。\n",
"\n",
"可回顾[第四章——递归神经网络与词向量原理解读](https://github.com/ben1234560/AiLearning-Theory-Applying/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%85%A5%E9%97%A8/%E7%AC%AC%E5%9B%9B%E7%AB%A0%E2%80%94%E2%80%94%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E4%B8%8E%E8%AF%8D%E5%90%91%E9%87%8F%E5%8E%9F%E7%90%86%E8%A7%A3%E8%AF%BB.md)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 项目流程\n",
"\n",
" 1.制作词向量可以使用gensim库也可以直接用现成的\n",
" 2.词和ID的映射\n",
" 3.构建RNN网络架构\n",
" 4.训练模型\n",
" 5.评估结果"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 导入数据\n",
"首先,我们需要去创建词向量。为了简单起见,我们使用训陈练好的模型来创建。\n",
"\n",
"作为该领域的个最大玩家, Google已经帮助我们在大规模数据集上训练出来了word2vec模型,包括1000亿个不同的词!在这个模型中,谷歌能创建300万个词向量,每个向量维度为300。\n",
"\n",
"在理想情况下,我们将使用这些向量来构建模型,但是因为这个单词向量矩阵相当大(3.6G),我们用另外个现成的小—些的,该矩阵由Gove进行训练得到。矩阵将包含400000个词向量,每个向量的维数为50。\n",
"\n",
"我们将导入两个不同的数据结构,一个是包含40000个单词的 Python列表,一个是包含所有单词向量值得400000`*`50维的嵌入矩阵。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据集目录如下\n",
"<img src=\"assets/20210112141331.png\" width=\"50%\">\n",
"其中文件夹negativeReviews和positiveReviews里是一句话一个txt"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded the word list!\n",
"Loaded the word vectors!\n",
"400000\n",
"(400000, 50)\n"
]
}
],
"source": [
"import numpy as np\n",
"wordsList = np.load('./training_data/wordsList.npy')\n",
"print('Loaded the word list!')\n",
"wordsList = wordsList.tolist() #Originally loaded as numpy array\n",
"wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8\n",
"wordVectors = np.load('./training_data/wordVectors.npy')\n",
"print ('Loaded the word vectors!')\n",
"print(len(wordsList))\n",
"print(wordVectors.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们也可以在词库中搜索单词,比如 “baseball”然后可以通过访问嵌入矩阵来得到相应的向量如下"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1.9327 , 1.0421 , -0.78515 , 0.91033 , 0.22711 , -0.62158 ,\n",
" -1.6493 , 0.07686 , -0.5868 , 0.058831, 0.35628 , 0.68916 ,\n",
" -0.50598 , 0.70473 , 1.2664 , -0.40031 , -0.020687, 0.80863 ,\n",
" -0.90566 , -0.074054, -0.87675 , -0.6291 , -0.12685 , 0.11524 ,\n",
" -0.55685 , -1.6826 , -0.26291 , 0.22632 , 0.713 , -1.0828 ,\n",
" 2.1231 , 0.49869 , 0.066711, -0.48226 , -0.17897 , 0.47699 ,\n",
" 0.16384 , 0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 ,\n",
" -0.59457 , 1.3566 , -0.27506 , 0.19918 , -0.36008 , 0.55667 ,\n",
" -0.70315 , 0.17157 ], dtype=float32)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseballIndex = wordsList.index('baseball')\n",
"wordVectors[baseballIndex]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们有了向量,我们的第一步就是输入一个句子,然后构造它的向量表示。假设我们现在的输入句子是 “I thought the movie was incredible and inspiring”。为了得到词向量我们可以使用 TensorFlow 的嵌入函数。这个函数有两个参数,一个是嵌入矩阵(在我们的情况下是词向量矩阵),另一个是每个词对应的索引。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"maxSeqLength = 10 # Maximum length of sentence 设置最大词数\n",
"numDimensions = 300 # Dimensions for each word vector 设置每个单词最大维度\n",
"firstSentence = np.zeros((maxSeqLength), dtype='int32')\n",
"firstSentence[0] = wordsList.index(\"i\")\n",
"firstSentence[1] = wordsList.index(\"thought\")\n",
"firstSentence[2] = wordsList.index(\"the\")\n",
"firstSentence[3] = wordsList.index(\"movie\")\n",
"firstSentence[4] = wordsList.index(\"was\")\n",
"firstSentence[5] = wordsList.index(\"incredible\")\n",
"firstSentence[6] = wordsList.index(\"and\")\n",
"firstSentence[7] = wordsList.index(\"inspiring\")\n",
"#如果长度没有达到设置标准用0来占位\n",
"print(firstSentence.shape)\n",
"print(firstSentence) #Shows the row index for each word"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -85,6 +85,156 @@
"可回顾[第四章——递归神经网络与词向量原理解读](https://github.com/ben1234560/AiLearning-Theory-Applying/blob/master/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E5%85%A5%E9%97%A8/%E7%AC%AC%E5%9B%9B%E7%AB%A0%E2%80%94%E2%80%94%E9%80%92%E5%BD%92%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E4%B8%8E%E8%AF%8D%E5%90%91%E9%87%8F%E5%8E%9F%E7%90%86%E8%A7%A3%E8%AF%BB.md)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 项目流程\n",
"\n",
" 1.制作词向量可以使用gensim库也可以直接用现成的\n",
" 2.词和ID的映射\n",
" 3.构建RNN网络架构\n",
" 4.训练模型\n",
" 5.评估结果"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 导入数据\n",
"首先,我们需要去创建词向量。为了简单起见,我们使用训陈练好的模型来创建。\n",
"\n",
"作为该领域的个最大玩家, Google已经帮助我们在大规模数据集上训练出来了word2vec模型,包括1000亿个不同的词!在这个模型中,谷歌能创建300万个词向量,每个向量维度为300。\n",
"\n",
"在理想情况下,我们将使用这些向量来构建模型,但是因为这个单词向量矩阵相当大(3.6G),我们用另外个现成的小—些的,该矩阵由Gove进行训练得到。矩阵将包含400000个词向量,每个向量的维数为50。\n",
"\n",
"我们将导入两个不同的数据结构,一个是包含40000个单词的 Python列表,一个是包含所有单词向量值得400000`*`50维的嵌入矩阵。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据集目录如下\n",
"<img src=\"assets/20210112141331.png\" width=\"50%\">\n",
"其中文件夹negativeReviews和positiveReviews里是一句话一个txt"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loaded the word list!\n",
"Loaded the word vectors!\n",
"400000\n",
"(400000, 50)\n"
]
}
],
"source": [
"import numpy as np\n",
"wordsList = np.load('./training_data/wordsList.npy')\n",
"print('Loaded the word list!')\n",
"wordsList = wordsList.tolist() #Originally loaded as numpy array\n",
"wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8\n",
"wordVectors = np.load('./training_data/wordVectors.npy')\n",
"print ('Loaded the word vectors!')\n",
"print(len(wordsList))\n",
"print(wordVectors.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们也可以在词库中搜索单词,比如 “baseball”然后可以通过访问嵌入矩阵来得到相应的向量如下"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1.9327 , 1.0421 , -0.78515 , 0.91033 , 0.22711 , -0.62158 ,\n",
" -1.6493 , 0.07686 , -0.5868 , 0.058831, 0.35628 , 0.68916 ,\n",
" -0.50598 , 0.70473 , 1.2664 , -0.40031 , -0.020687, 0.80863 ,\n",
" -0.90566 , -0.074054, -0.87675 , -0.6291 , -0.12685 , 0.11524 ,\n",
" -0.55685 , -1.6826 , -0.26291 , 0.22632 , 0.713 , -1.0828 ,\n",
" 2.1231 , 0.49869 , 0.066711, -0.48226 , -0.17897 , 0.47699 ,\n",
" 0.16384 , 0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 ,\n",
" -0.59457 , 1.3566 , -0.27506 , 0.19918 , -0.36008 , 0.55667 ,\n",
" -0.70315 , 0.17157 ], dtype=float32)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baseballIndex = wordsList.index('baseball')\n",
"wordVectors[baseballIndex]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们有了向量,我们的第一步就是输入一个句子,然后构造它的向量表示。假设我们现在的输入句子是 “I thought the movie was incredible and inspiring”。为了得到词向量我们可以使用 TensorFlow 的嵌入函数。这个函数有两个参数,一个是嵌入矩阵(在我们的情况下是词向量矩阵),另一个是每个词对应的索引。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"maxSeqLength = 10 # Maximum length of sentence 设置最大词数\n",
"numDimensions = 300 # Dimensions for each word vector 设置每个单词最大维度\n",
"firstSentence = np.zeros((maxSeqLength), dtype='int32')\n",
"firstSentence[0] = wordsList.index(\"i\")\n",
"firstSentence[1] = wordsList.index(\"thought\")\n",
"firstSentence[2] = wordsList.index(\"the\")\n",
"firstSentence[3] = wordsList.index(\"movie\")\n",
"firstSentence[4] = wordsList.index(\"was\")\n",
"firstSentence[5] = wordsList.index(\"incredible\")\n",
"firstSentence[6] = wordsList.index(\"and\")\n",
"firstSentence[7] = wordsList.index(\"inspiring\")\n",
"#如果长度没有达到设置标准用0来占位\n",
"print(firstSentence.shape)\n",
"print(firstSentence) #Shows the row index for each word"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据管道如下图所示:\n",
"<img src=\"assets/20210112143537.png\" width=\"100%\">\n",
"\n",
"输出数据是一个 10*50 的词矩阵,其中包括 10 个词,每个词的向量维度是 50。就是去找到这些词对应的向量"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with tf.Session() as sess:\n",
"print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)"
]
},
{
"cell_type": "code",
"execution_count": null,

Loading…
Cancel
Save