@ -215,13 +215,468 @@
"print(firstSentence) #Shows the row index for each word"
"print(firstSentence) #Shows the row index for each word"
]
]
},
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据管道如下图所示:\n",
"<img src=\"assets/20210112143537.png\" width=\"100%\">\n",
"\n",
"输出数据是一个 10*50 的词矩阵,其中包括 10 个词,每个词的向量维度是 50。就是去找到这些词对应的向量"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"with tf.Session() as sess:\n",
"print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在整个训练集上面构造索引之前,我们先花一些时间来可视化我们所拥有的数据类型。这将帮助我们去决定如何设置最大序列长度的最佳值。在前面的例子中,我们设置了最大长度为 10, 但这个值在很大程度上取决于你输入的数据。\n",
"\n",
"训练集我们使用的是 IMDB 数据集。这个数据集包含 25000 条电影数据,其中 12500 条正向数据, 12500 条负向数据。这些数据都是存储在一个文本文件中,首先我们需要做的就是去解析这个文件。正向数据包含在一个文件中,负向数据包含在另一个文件中。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from os import listdir\n",
"from os.path import isfile, join\n",
"# 指定数据集位置,由于提供的数据都是一个个单独的文件,所以得一个个读取\n",
"positiveFiles = ['./training_data/positiveReviews/' + f for f in listdir('./training_data/positiveReviews/') if isfile(join('./training_data/positiveReviews/', f))]\n",
"negativeFiles = ['./training_data/negativeReviews/' + f for f in listdir('./training_data/negativeReviews/') if isfile(join('./training_data/negativeReviews/', f))]\n",
"numWords = []\n",
"for pf in positiveFiles:\n",
" with open(pf, \"r\", encoding='utf-8') as f:\n",
" line=f.readline()\n",
" counter = len(line.split())\n",
" numWords.append(counter) \n",
"print('Positive files finished')\n",
"\n",
"for nf in negativeFiles:\n",
" with open(nf, \"r\", encoding='utf-8') as f:\n",
" line=f.readline()\n",
" counter = len(line.split())\n",
" numWords.append(counter) \n",
"print('Negative files finished')\n",
"\n",
"numFiles = len(numWords)\n",
"print('The total number of files is', numFiles)\n",
"print('The total number of words in the files is', sum(numWords))\n",
"print('The average number of words in the files is', sum(numWords)/len(numWords))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"plt.hist(numWords, 50)\n",
"plt.xlabel('Sequence Length')\n",
"plt.ylabel('Frequency')\n",
"plt.axis([0, 1200, 0, 8000])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"从直方图和句子的平均单词数,我们认为将句子最大长度设置为绝大多数的长度 250 是可行的。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"maxSeqLength = 250"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"import tensorflow as tf"
"# 查看其中一条评论\n",
"fname = positiveFiles[3] #Can use any valid index (not just 3)\n",
"with open(fname) as f:\n",
" for lines in f:\n",
" print(lines)\n",
" exit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"接下来,我们将它转换成一个索引矩阵。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 删除标点符号、括号、问号等,只留下字母数字字符\n",
"import re\n",
"strip_special_chars = re.compile(\"[^A-Za-z0-9 ]+\")\n",
"\n",
"def cleanSentences(string):\n",
" string = string.lower().replace(\"<br />\", \" \")\n",
" return re.sub(strip_special_chars, \"\", string.lower())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"firstFile = np.zeros((maxSeqLength), dtype='int32')\n",
"with open(fname) as f:\n",
" indexCounter = 0\n",
" line=f.readline()\n",
" cleanedLine = cleanSentences(line)\n",
" split = cleanedLine.split()\n",
" for word in split:\n",
" try:\n",
" firstFile[indexCounter] = wordsList.index(word)\n",
" except ValueError:\n",
" firstFile[indexCounter] = 399999 #Vector for unknown words\n",
" indexCounter = indexCounter + 1\n",
"firstFile"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在,我们用相同的方法来处理全部的 25000 条评论。我们将导入电影训练集,并且得到一个 25000 * 250 的矩阵。这是一个计算成本非常高的过程,可以直接使用理好的索引矩阵文件。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ids = np.load('./training_data/idsMatrix.npy')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 构建LSTM网络模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### RNN Model\n",
"现在,我们可以开始构建我们的 TensorFlow 图模型。首先, 我们需要去定义一些超参数, 比如批处理大小, LSTM的单元个数, 分类类别和训练次数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"batchSize = 24 # 梯度处理的大小\n",
"lstmUnits = 64 # 隐藏层神经元数量\n",
"numClasses = 2 # 分类数量, n/p\n",
"iterations = 50000 # 迭代次数"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"与大多数 TensorFlow 图一样,现在我们需要指定两个占位符,一个用于数据输入,另一个用于标签数据。对于占位符,最重要的一点就是确定好维度。\n",
"\n",
"标签占位符代表一组值,每一个值都为 [1,0] 或者 [0,1],这个取决于数据是正向的还是负向的。输入占位符,是一个整数化的索引数组。 \n",
"\n",
"<img src=\"assets/20210112150210.png\" width=\"100%\">"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tf.reset_default_graph()\n",
"\n",
"labels = tf.placeholder(tf.float32, [batchSize, numClasses])\n",
"input_data = tf.placeholder(tf.int32, [batchSize, maxSeqLength])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一旦,我们设置了我们的输入数据占位符,我们可以调用 tf. nn. embedding lookup0函数来得到我们的词向量。该函数最后将返回一个三维向量,第一个维度是批处理大小,第二个维度是句子长度,第三个维度是词向量长度。更清晰的表达,如下图际示"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = tf.Variable(tf.zeros([batchSize, maxSeqLength, numDimensions]),dtype=tf.float32)\n",
"data = tf.nn.embedding_lookup(wordVectors,input_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在我们已经得到了我们想要的数据形式,那么揭晓了我们看看如何才能将这种数据形式输入到我们的 LSTM 网络中。首先,我们使用 tf.nn.rnn_cell.BasicLSTMCell 函数,这个函数输入的参数是一个整数,表示需要几个 LSTM 单元。这是我们设置的一个超参数,我们需要对这个数值进行调试从而来找到最优的解。然后,我们会设置一个 dropout 参数,以此来避免一些过拟合。\n",
"\n",
"最后,我们将 LSTM cell 和三维的数据输入到 tf.nn.dynamic_rnn ,这个函数的功能是展开整个网络,并且构建一整个 RNN 模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits) # 基本单元\n",
"lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75) # 解决一些过拟合问题, output_keep_prob保留比例, 这个在LSTM的讲解中有解释过\n",
"value, _ = tf.nn.dynamic_rnn(lstmCell, data, dtype=tf.float32) # 构建网络, value是值h, _ 是中间传递结果,这里不需要分析所以去掉"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"堆栈 LSTM 网络是一个比较好的网络架构。也就是前一个LSTM 隐藏层的输出是下一个LSTM的输入。堆栈LSTM可以帮助模型记住更多的上下文信息, 但是带来的弊端是训练参数会增加很多, 模型的训练时间会很长, 过拟合的几率也会增加。\n",
"\n",
"dynamic RNN 函数的第一个输出可以被认为是最后的隐藏状态向量。这个向量将被重新确定维度,然后乘以最后的权重矩阵和一个偏置项来获得最终的输出值。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 权重参数初始化\n",
"weight = tf.Variable(tf.truncated_normal([lstmUnits, numClasses]))\n",
"bias = tf.Variable(tf.constant(0.1, shape=[numClasses]))\n",
"value = tf.transpose(value, [1, 0, 2])\n",
"# 获取最终的结果值\n",
"last = tf.gather(value, int(value.get_shape()[0]) - 1) # 去ht\n",
"prediction = (tf.matmul(last, weight) + bias) # 最终连上w和b"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"接下来, 我们需要定义正确的预测函数和正确率评估参数。正确的预测形式是查看最后输出的0-1向量是否和标记的0-1向量相同。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))\n",
"accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"之后,我们使用一个标准的交叉熵损失函数来作为损失值。对于优化器,我们选择 Adam, 并且采用默认的学习率"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))\n",
"optimizer = tf.train.AdamOptimizer().minimize(loss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 训练与测试结果"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 超参数调整\n",
"\n",
"选择合适的超参数来训练你的神经网络是至关重要的。你会发现你的训练损失值与你选择的优化器( Adam, Adadelta, SGD, 等等) , 学习率和网络架构都有很大的关系。特别是在RNN和LSTM中, 单元数量和词向量的大小都是重要因素。\n",
"\n",
" * 学习率: RNN最难的一点就是它的训练非常困难, 因为时间步骤很长。那么, 学习率就变得非常重要了。如果我们将学习率设置的很大, 那么学习曲线就会波动性很大, 如果我们将学习率设置的很小, 那么训练过程就会非常缓慢。根据经验, 将学习率默认设置为 0.001 是一个比较好的开始。如果训练的非常缓慢,那么你可以适当的增大这个值,如果训练过程非常的不稳定,那么你可以适当的减小这个值。\n",
" \n",
" * 优化器:这个在研究中没有一个一致的选择,但是 Adam 优化器被广泛的使用。\n",
" \n",
" * LSTM单元的数量: 这个值很大程度上取决于输入文本的平均长度。而更多的单元数量可以帮助模型存储更多的文本信息, 当然模型的训练时间就会增加很多, 并且计算成本会非常昂贵。\n",
" \n",
" * 词向量维度: 词向量的维度一般我们设置为50到300。维度越多意味着可以存储更多的单词信息, 但是你需要付出的是更昂贵的计算成本。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 训练\n",
"\n",
"训练过程的基本思路是,我们首先先定义一个 TensorFlow 会话。然后,我们加载一批评论和对应的标签。接下来,我们调用会话的 run 函数。这个函数有两个参数,第一个参数被称为 fetches 参数,这个参数定义了我们感兴趣的值。我们希望通过我们的优化器来最小化损失函数。第二个参数被称为 feed_dict 参数。这个数据结构就是我们提供给我们的占位符。我们需要将一个批处理的评论和标签输入模型,然后不断对这一组训练数据进行循环训练。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 辅助函数\n",
"from random import randint\n",
"# 制作batch数据, 通过数据集索引位置来设置训练集和预测集\n",
"# 并让batch中正负样本各占一半, 同事给定其当前标签\n",
"def getTrainBatch():\n",
" labels = []\n",
" arr = np.zeros([batchSize, maxSeqLength])\n",
" for i in range(batchSize):\n",
" if (i % 2 == 0): \n",
" num = randint(1,11499)\n",
" labels.append([1,0])\n",
" else:\n",
" num = randint(13499,24999)\n",
" labels.append([0,1])\n",
" arr[i] = ids[num-1:num]\n",
" return arr, labels\n",
"\n",
"def getTestBatch():\n",
" labels = []\n",
" arr = np.zeros([batchSize, maxSeqLength])\n",
" for i in range(batchSize):\n",
" num = randint(11499,13499)\n",
" if (num <= 12499):\n",
" labels.append([1,0])\n",
" else:\n",
" labels.append([0,1])\n",
" arr[i] = ids[num-1:num]\n",
" return arr, labels"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sess = tf.InteractiveSession()\n",
"saver = tf.train.Saver()\n",
"sess.run(tf.global_variables_initializer())\n",
"\n",
"for i in range(iterations):\n",
" # 上面定义的, 拿到batch数据的函数\n",
" nextBatch, nextBatchLabels = getTrainBatch();\n",
" sess.run(optimizer, {input_data: nextBatch, labels: nextBatchLabels}) \n",
" # 隔一万次打印一次当前结果\n",
" if (i % 1000 == 0 and i != 0):\n",
" loss_ = sess.run(loss, {input_data: nextBatch, labels: nextBatchLabels})\n",
" accuracy_ = sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})\n",
"\n",
" print(\"iteration {}/{}...\".format(i+1, iterations),\n",
" \"loss {}...\".format(loss_),\n",
" \"accuracy {}...\".format(accuracy_)) \n",
" # Save the network every 10,000 training iterations, 隔一万次保存一次, 防止后面效果并没有继续变好\n",
" if (i % 10000 == 0 and i != 0):\n",
" save_path = saver.save(sess, \"models/pretrained_lstm.ckpt\", global_step=i)\n",
" print(\"saved to %s\" % save_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上面结果稍微过拟合, 因为accuracy达到了1.0,查看可视化结果\n",
"<img src=\"assets/20210112153302.png\" width=\"70%\">\n",
"<img src=\"assets/20210112153547.png\" width=\"70%\">"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"查看上面的训练曲线,我们发现这个模型的训练结果还是不错的。损失值在稳定的下降,正确率也不断的在接近 100% 。然而,当分析训练曲线的时候,我们应该注意到我们的模型可能在训练集上面已经过拟合了。过拟合是机器学习中一个非常常见的问题,表示模型在训练集上面拟合的太好了,但是在测试集上面的泛化能力就会差很多。也就是说,如果你在训练集上面取得了损失值是 0 的模型,但是这个结果也不一定是最好的结果。当我们训练 LSTM 的时候,提前终止是一种常见的防止过拟合的方法。基本思路是,我们在训练集上面进行模型训练,同事不断的在测试集上面测量它的性能。一旦测试误差停止下降了,或者误差开始增大了,那么我们就需要停止训练了。因为这个迹象表明,我们网络的性能开始退化了。\n",
"\n",
"导入一个预训练的模型需要使用 TensorFlow 的另一个会话函数,称为 Server ,然后利用这个会话函数来调用 restore 函数。这个函数包括两个参数,一个表示当前的会话,另一个表示保存的模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sess = tf.InteractiveSession()\n",
"saver = tf.train.Saver()\n",
"saver.restore(sess, tf.train.latest_checkpoint('models'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"然后,从我们的测试集中导入一些电影评论。请注意,这些评论是模型从来没有看见过的。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"iterations = 10\n",
"for i in range(iterations):\n",
" nextBatch, nextBatchLabels = getTestBatch();\n",
" print(\"Accuracy for this batch:\", (sess.run(accuracy, {input_data: nextBatch, labels: nextBatchLabels})) * 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"测试结果中有些高有些低, 可以看出依然存在过拟合现象, 目前构建的模型也是比较简单的模型, 在实际运用中会堆叠不止一层的LSTM\n",
"\n",
"我们自己测试的时候可以多试几个超参数,特别是词向量维度。具体可以参考上面的超参数调整。"
]
]
},
},
{
{