|
|
@ -1238,15 +1238,34 @@
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"### 调参"
|
|
|
|
"## 模型优化——超参数调优\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"**在机器学习中,优化模型意味着为一个特定的问题找到一组最佳的超参数。**\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* 模型超参数被认为是机器学习算法的最佳参数,由数据科学家在训练前对其进行调整。例如随机林中的树数,或K近邻回归中使用的邻居数。\n",
|
|
|
|
|
|
|
|
"* 模型参数是模型在训练过程中学习到的,例如线性回归中的权重。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"**[调整模型超参数](http://scikit-learn.org/stable/modules/grid_search.html)控制模型中欠拟合与过拟合的平衡。**\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* 我们可以尝试通过建立一个更复杂的模型来纠正欠拟合,例如在随机森林中使用更多的树,或者在神经网络中使用更多的层。欠拟合的模型具有很高的偏差,当我们的模型没有足够的能力(自由度)来学习特征和目标之间的关系时,就会出现这种情况。\n",
|
|
|
|
|
|
|
|
"* 我们可以尝试通过限制模型的复杂性和应用正则化来纠正过度拟合。这可能意味着减少多项式回归的次数,或者在神经网络中减少网络层次等。过度拟合的模型具有很高的方差,并且实际上已经记住了训练集。欠拟合和过拟合都会导致测试集的泛化性能较差。"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"### Cross Validation\n",
|
|
|
|
"### 基于随机搜索和交叉验证的参数调优\n",
|
|
|
|
"<img src=\"data/kfold_cv.png\" width=\"70%\">"
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* 随机搜索指的是我们选择超参数进行评估的方法:定义一系列选项,然后随机选择组合进行尝试。这与网格搜索形成对比,网格搜索评估我们指定的每个组合。一般来说,当我们对最佳模型超参数的知识有限时,随机搜索会更好,我们可以使用随机搜索缩小选项范围,然后使用网格搜索来选择范围更有限的选项。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* 交叉验证是用来评估超参数性能的方法。我们使用K-Fold交叉验证,而不是将训练集分成单独的训练集和验证集,从而减少我们可以使用的训练数据量。这意味着将训练数据分成K个折叠,然后经过一个迭代过程,我们首先对K-1个折叠进行训练,然后在第K个折叠上评估性能。我们重复这个过程K次,所以最终我们将在训练数据中的每个例子上进行测试,关键是我们要测试的每个迭代都是在我们没有训练的数据上进行的。在K次交叉验证结束时,我们将K次迭代的平均误差作为最终的性能度量,对所有的训练数据进行训练。然后,我们记录的性能用于比较超参数的不同组合。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"使用k=5进行k折交叉验证的图片如下所示:\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"<img src=\"data/kfold_cv.png\" width=\"70%\">\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"我们将实现随机搜索和交叉验证,以选择最佳的超参数的GBDT回归。我们首先定义一个网格,然后形成一个迭代过程:从网格中随机抽取一组超参数,使用4折交叉验证评估超参数,然后选择性能最好的超参数。"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
@ -1283,6 +1302,27 @@
|
|
|
|
" 'max_features': max_features} "
|
|
|
|
" 'max_features': max_features} "
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"我们选择了6个不同的超参数来调节GBDT回归。这些都会以不同的方式影响模型,而这些方式很难提前确定,要找到针对特定问题的最佳组合,唯一的方法就是测试它们!要了解超参数,建议查看[Scikit Learn文档](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)。现在,只要知道我们正在努力寻找超参数的最佳组合,因为没有理论告诉我们哪一个最有效,我们只需要评估它们,就像运行一个实验一样!\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"我们创建随机搜索对象,并传入以下参数:\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* `estimator`:所选的模型\n",
|
|
|
|
|
|
|
|
"* `param_distributions`:我们定义的参数分布\n",
|
|
|
|
|
|
|
|
"* `cv`: 用于k-fold交叉验证的折叠数\n",
|
|
|
|
|
|
|
|
"* `n_iter`:要尝试的不同组合的数量\n",
|
|
|
|
|
|
|
|
"* `scoring`:评估使用的指标\n",
|
|
|
|
|
|
|
|
"* `n_jobs`:并行运行的内核数(-1将使用所有可用的)\n",
|
|
|
|
|
|
|
|
"* `verbose`:显示参数信息(1显示有限数量)\n",
|
|
|
|
|
|
|
|
"* `return_train_score`:返回每个交叉验证折叠的训练分数\n",
|
|
|
|
|
|
|
|
"* `random_state`:使用的固定的随机数种子数,以便每次拆分的数据相同\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"随机搜索对象的训练方法与任何其他scikit学习模型的方法相同。训练后,我们可以比较所有不同的超参数组合,找出性能最好的一个。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
@ -1790,7 +1830,7 @@
|
|
|
|
}
|
|
|
|
}
|
|
|
|
],
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"# 获取所有cv结果并按测试性能排序\n",
|
|
|
|
"# Get all of the cv results and sort by the test performance\n",
|
|
|
|
"random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n",
|
|
|
|
"random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"random_results.head(10)"
|
|
|
|
"random_results.head(10)"
|
|
|
@ -1824,6 +1864,25 @@
|
|
|
|
"random_cv.best_estimator_"
|
|
|
|
"random_cv.best_estimator_"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"最佳GBDT模型具有以下超参数:\n",
|
|
|
|
|
|
|
|
"* `loss = lad`\n",
|
|
|
|
|
|
|
|
"* `n_estimators = 500`\n",
|
|
|
|
|
|
|
|
"* `max_depth = 5`\n",
|
|
|
|
|
|
|
|
"* `min_samples_leaf = 6`\n",
|
|
|
|
|
|
|
|
"* `min_samples_split = 6`\n",
|
|
|
|
|
|
|
|
"* `max_features = None` \n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"使用随机搜索是缩小可能的超参数范围的好方法。最初,我们不知道哪种组合最有效,现在缩小了选择范围。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"我们可以使用随机搜索的结果及其左右值加入到网格搜索,以找到超参数中比随机搜索的效果最好的那些。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"在这里,我们使用的网格搜索只测试n_estimators(树的个数),然后绘制训练和测试性能图,以了解增加树的数量对我们的模型有什么作用。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
@ -1839,7 +1898,7 @@
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [],
|
|
|
|
"outputs": [],
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"# 创建一系列要评估的树\n",
|
|
|
|
"# Create a range of trees to evaluate\n",
|
|
|
|
"trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n",
|
|
|
|
"trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n",
|
|
|
|
"model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n",
|
|
|
@ -1848,14 +1907,10 @@
|
|
|
|
" max_features = None,\n",
|
|
|
|
" max_features = None,\n",
|
|
|
|
" random_state = 42)\n",
|
|
|
|
" random_state = 42)\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"# 使用树的范围和随机森林模型的网格搜索对象\n",
|
|
|
|
"# Grid Search Object using the trees range and the random forest model\n",
|
|
|
|
"grid_search = GridSearchCV(estimator = model, \n",
|
|
|
|
"grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, \n",
|
|
|
|
" param_grid=trees_grid, \n",
|
|
|
|
" scoring = 'neg_mean_absolute_error', verbose = 1,\n",
|
|
|
|
" cv = 4, \n",
|
|
|
|
" n_jobs = -1, return_train_score = True)"
|
|
|
|
" scoring = 'neg_mean_absolute_error', \n",
|
|
|
|
|
|
|
|
" verbose = 1,\n",
|
|
|
|
|
|
|
|
" n_jobs = -1, \n",
|
|
|
|
|
|
|
|
" return_train_score = True)\n"
|
|
|
|
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
@ -1936,10 +1991,10 @@
|
|
|
|
}
|
|
|
|
}
|
|
|
|
],
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"# 将结果导入数据框\n",
|
|
|
|
"# Get the results into a dataframe\n",
|
|
|
|
"results = pd.DataFrame(grid_search.cv_results_)\n",
|
|
|
|
"results = pd.DataFrame(grid_search.cv_results_)\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"# 绘制训练误差和测试误差与树木数量的关系图\n",
|
|
|
|
"# Plot the training and testing error vs number of trees\n",
|
|
|
|
"figsize(8, 8)\n",
|
|
|
|
"figsize(8, 8)\n",
|
|
|
|
"plt.style.use('fivethirtyeight')\n",
|
|
|
|
"plt.style.use('fivethirtyeight')\n",
|
|
|
|
"plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n",
|
|
|
|
"plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n",
|
|
|
@ -2160,7 +2215,20 @@
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"### 测试模型"
|
|
|
|
"从上面看出,我们的模型是过度拟合!训练误差显著低于测试误差,说明该模型对训练数据的学习效果很好,但不能推广到测试数据中。随着树木数量的增加,过度拟合的数量也会增加。测试误差和训练误差均随树数的增加而减小,但训练误差的减小速度较快。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"在训练误差和测试误差之间总是有差异的(训练误差总是较低的),但是如果有显著的差异,我们希望通过获得更多的训练数据或通过超参数调整或正则化来降低模型的复杂性来尝试减少过拟合。对于GBDT回归模型,一些选项包括减少树的数量、减少每棵树的最大深度以及增加叶节点中的最小样本数。如果想进一步研究GBDT可以了解[该文章](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目前,我们将使用性能最好的模型,并接受它可能会过拟合。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"### 在测试集上评估模型\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"我们使用超参数调整的最佳模型在测试集上进行预测。此前,我们的模型从未见过测试集,因此性能应该是一个很好的指标,表明如果在生产中部署模型,它将如何执行。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"为了进行比较,我们比较默认模型、最优模型的性能。"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
@ -2188,10 +2256,10 @@
|
|
|
|
}
|
|
|
|
}
|
|
|
|
],
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"# 默认模型\n",
|
|
|
|
"# Default model\n",
|
|
|
|
"default_model = GradientBoostingRegressor(random_state = 42)\n",
|
|
|
|
"default_model = GradientBoostingRegressor(random_state = 42)\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"# 选择最佳模型\n",
|
|
|
|
"# Select the best model\n",
|
|
|
|
"final_model = grid_search.best_estimator_\n",
|
|
|
|
"final_model = grid_search.best_estimator_\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"final_model"
|
|
|
|
"final_model"
|
|
|
@ -2260,7 +2328,7 @@
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"对比测试结果,训练时间近似,模型得到差不多10%的提升。"
|
|
|
|
"对比测试结果,训练时间近似,模型得到差不多10%的提升。证明我们的优化是有效的。"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
@ -2291,6 +2359,15 @@
|
|
|
|
"plt.title('Test Values and Predictions');"
|
|
|
|
"plt.title('Test Values and Predictions');"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"预测值和真实值近乎密度线较拟合。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"下面的诊断图是残差直方图。理想情况下,我们希望残差是正态分布的,这意味着模型在两个方向(高和低)上都是错误的。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 34,
|
|
|
|
"execution_count": 34,
|
|
|
@ -2310,16 +2387,41 @@
|
|
|
|
"source": [
|
|
|
|
"source": [
|
|
|
|
"figsize = (6, 6)\n",
|
|
|
|
"figsize = (6, 6)\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"# 计算残差\n",
|
|
|
|
"# Calculate the residuals \n",
|
|
|
|
"residuals = final_pred - y_test\n",
|
|
|
|
"residuals = final_pred - y_test\n",
|
|
|
|
"\n",
|
|
|
|
"\n",
|
|
|
|
"# 绘制残差分布直方图\n",
|
|
|
|
"# Plot the residuals in a histogram\n",
|
|
|
|
"plt.hist(residuals, color = 'red', bins = 20,\n",
|
|
|
|
"plt.hist(residuals, color = 'red', bins = 20,\n",
|
|
|
|
" edgecolor = 'black')\n",
|
|
|
|
" edgecolor = 'black')\n",
|
|
|
|
"plt.xlabel('Error'); plt.ylabel('Count')\n",
|
|
|
|
"plt.xlabel('Error'); plt.ylabel('Count')\n",
|
|
|
|
"plt.title('Distribution of Residuals');"
|
|
|
|
"plt.title('Distribution of Residuals');"
|
|
|
|
]
|
|
|
|
]
|
|
|
|
},
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"残差接近正常分布,低端有几个明显的离群点。这表明模型的预测值有远低于真实值的错误。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
|
|
|
|
{
|
|
|
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
|
|
|
"metadata": {},
|
|
|
|
|
|
|
|
"source": [
|
|
|
|
|
|
|
|
"## 总结\n",
|
|
|
|
|
|
|
|
"在这第一部分中,我们执行了机器学习流程中的几个关键:\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"* 建立基础模型,比较多种模型性能指标\n",
|
|
|
|
|
|
|
|
"* 模型超参数调参,针对问题进行优化\n",
|
|
|
|
|
|
|
|
"* 在测试集上评估最佳模型\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"结果表明,机器学习可以应用于我们的问题,最终的模型能够预测建筑物的能源之星的得分在9.1分以内。我们还看到,超参数调整能够改善模型的性能,尽管在时间投入方面要付出相当大的代价。这是一个很好的提醒,正确的特性工程和收集更多的数据(如果可能的话)比微调模型有更大的回报。我们还观察到运行时与精度的权衡,这是我们在设计机器学习模型时必须考虑的许多因素之一。\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"我们知道我们的模型是准确的,但是我们知道它为什么会做出这样的预测吗?机器学习过程的下一步是至关重要的:试图理解模型是如何做出预测的。实现高精度是很好的,但是如果我们能够弄清楚为什么模型能够准确地预测,那么我们就可以利用这些信息来更好地理解问题。例如,该模型依赖哪些特征来推断能源之星分数?是否可以使用此模型进行特征选择,并实现更易于解释的简单模型?\n",
|
|
|
|
|
|
|
|
"\n",
|
|
|
|
|
|
|
|
"在最后的notebook中,我们将尝试回答这些问题,并从项目中得出最终结论。"
|
|
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
},
|
|
|
|
{
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": null,
|
|
|
|
"execution_count": null,
|
|
|
|