From 6fbfaf3a64023b64023122e3c3f760d30cf794fb Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Mon, 28 Dec 2020 16:55:32 +0800
Subject: [PATCH] Add comment of Introduction
---
...ºç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb | 144 +++++++++++++++---
...2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb | 144 +++++++++++++++---
2 files changed, 246 insertions(+), 42 deletions(-)
diff --git a/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb b/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb
index 5e87960..22062fc 100644
--- a/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb
+++ b/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb
@@ -1238,15 +1238,34 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### è°ƒå‚"
+ "## æ¨¡åž‹ä¼˜åŒ–â€”â€”è¶…å‚æ•°è°ƒä¼˜\n",
+ "\n",
+ "**在机器å¦ä¹ ä¸ï¼Œä¼˜åŒ–模型æ„味ç€ä¸ºä¸€ä¸ªç‰¹å®šçš„é—®é¢˜æ‰¾åˆ°ä¸€ç»„æœ€ä½³çš„è¶…å‚æ•°ã€‚**\n",
+ "\n",
+ "* æ¨¡åž‹è¶…å‚æ•°è¢«è®¤ä¸ºæ˜¯æœºå™¨å¦ä¹ ç®—æ³•çš„æœ€ä½³å‚æ•°ï¼Œç”±æ•°æ®ç§‘å¦å®¶åœ¨è®ç»ƒå‰å¯¹å…¶è¿›è¡Œè°ƒæ•´ã€‚ä¾‹å¦‚éšæœºæž—ä¸çš„æ ‘数,或K近邻回归ä¸ä½¿ç”¨çš„邻居数。\n",
+ "* æ¨¡åž‹å‚æ•°æ˜¯æ¨¡åž‹åœ¨è®ç»ƒè¿‡ç¨‹ä¸å¦ä¹ 到的,例如线性回归ä¸çš„æƒé‡ã€‚\n",
+ "\n",
+ "**[è°ƒæ•´æ¨¡åž‹è¶…å‚æ•°](http://scikit-learn.org/stable/modules/grid_search.html)æŽ§åˆ¶æ¨¡åž‹ä¸æ¬ 拟åˆä¸Žè¿‡æ‹Ÿåˆçš„平衡。**\n",
+ "\n",
+ "* 我们å¯ä»¥å°è¯•é€šè¿‡å»ºç«‹ä¸€ä¸ªæ›´å¤æ‚的模型æ¥çº æ£æ¬ 拟åˆï¼Œä¾‹å¦‚åœ¨éšæœºæ£®æž—ä¸ä½¿ç”¨æ›´å¤šçš„æ ‘,或者在神ç»ç½‘络ä¸ä½¿ç”¨æ›´å¤šçš„å±‚ã€‚æ¬ æ‹Ÿåˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„å差,当我们的模型没有足够的能力(自由度)æ¥å¦ä¹ 特å¾å’Œç›®æ ‡ä¹‹é—´çš„å…³ç³»æ—¶ï¼Œå°±ä¼šå‡ºçŽ°è¿™ç§æƒ…况。\n",
+ "* 我们å¯ä»¥å°è¯•通过é™åˆ¶æ¨¡åž‹çš„夿‚性和应用æ£åˆ™åŒ–æ¥çº æ£è¿‡åº¦æ‹Ÿåˆã€‚è¿™å¯èƒ½æ„味ç€å‡å°‘多项å¼å›žå½’的次数,或者在神ç»ç½‘络ä¸å‡å°‘网络层次ç‰ã€‚过度拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„æ–¹å·®ï¼Œå¹¶ä¸”实际上已ç»è®°ä½äº†è®ç»ƒé›†ã€‚æ¬ æ‹Ÿåˆå’Œè¿‡æ‹Ÿåˆéƒ½ä¼šå¯¼è‡´æµ‹è¯•集的泛化性能较差。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Cross Validation\n",
- "
"
+ "### åŸºäºŽéšæœºæœç´¢å’Œäº¤å‰éªŒè¯çš„傿•°è°ƒä¼˜\n",
+ "\n",
+ "* éšæœºæœç´¢æŒ‡çš„æ˜¯æˆ‘ä»¬é€‰æ‹©è¶…å‚æ•°è¿›è¡Œè¯„估的方法:定义一系列选项,然åŽéšæœºé€‰æ‹©ç»„åˆè¿›è¡Œå°è¯•ã€‚è¿™ä¸Žç½‘æ ¼æœç´¢å½¢æˆå¯¹æ¯”ï¼Œç½‘æ ¼æœç´¢è¯„估我们指定的æ¯ä¸ªç»„åˆã€‚一般æ¥è¯´ï¼Œå½“æˆ‘ä»¬å¯¹æœ€ä½³æ¨¡åž‹è¶…å‚æ•°çš„çŸ¥è¯†æœ‰é™æ—¶ï¼Œéšæœºæœç´¢ä¼šæ›´å¥½ï¼Œæˆ‘们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢ç¼©å°é€‰é¡¹èŒƒå›´ï¼Œç„¶åŽä½¿ç”¨ç½‘æ ¼æœç´¢æ¥é€‰æ‹©èŒƒå›´æ›´æœ‰é™çš„选项。\n",
+ "\n",
+ "* 交å‰éªŒè¯æ˜¯ç”¨æ¥è¯„ä¼°è¶…å‚æ•°æ€§èƒ½çš„æ–¹æ³•。我们使用K-Fold交å‰éªŒè¯ï¼Œè€Œä¸æ˜¯å°†è®ç»ƒé›†åˆ†æˆå•独的è®ç»ƒé›†å’ŒéªŒè¯é›†ï¼Œä»Žè€Œå‡å°‘我们å¯ä»¥ä½¿ç”¨çš„è®ç»ƒæ•°æ®é‡ã€‚è¿™æ„味ç€å°†è®ç»ƒæ•°æ®åˆ†æˆK个折å ,然åŽç»è¿‡ä¸€ä¸ªè¿ä»£è¿‡ç¨‹ï¼Œæˆ‘们首先对K-1个折å 进行è®ç»ƒï¼Œç„¶åŽåœ¨ç¬¬K个折å 上评估性能。我们é‡å¤è¿™ä¸ªè¿‡ç¨‹K次,所以最终我们将在è®ç»ƒæ•°æ®ä¸çš„æ¯ä¸ªä¾‹åä¸Šè¿›è¡Œæµ‹è¯•ï¼Œå…³é”®æ˜¯æˆ‘ä»¬è¦æµ‹è¯•çš„æ¯ä¸ªè¿ä»£éƒ½æ˜¯åœ¨æˆ‘们没有è®ç»ƒçš„æ•°æ®ä¸Šè¿›è¡Œçš„。在K次交å‰éªŒè¯ç»“æŸæ—¶ï¼Œæˆ‘们将K次è¿ä»£çš„å¹³å‡è¯¯å·®ä½œä¸ºæœ€ç»ˆçš„æ€§èƒ½åº¦é‡ï¼Œå¯¹æ‰€æœ‰çš„è®ç»ƒæ•°æ®è¿›è¡Œè®ç»ƒã€‚ç„¶åŽï¼Œæˆ‘ä»¬è®°å½•çš„æ€§èƒ½ç”¨äºŽæ¯”è¾ƒè¶…å‚æ•°çš„ä¸åŒç»„åˆã€‚\n",
+ "\n",
+ "使用k=5进行k折交å‰éªŒè¯çš„图片如下所示:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "æˆ‘ä»¬å°†å®žçŽ°éšæœºæœç´¢å’Œäº¤å‰éªŒè¯ï¼Œä»¥é€‰æ‹©æœ€ä½³çš„è¶…å‚æ•°çš„GBDTå›žå½’ã€‚æˆ‘ä»¬é¦–å…ˆå®šä¹‰ä¸€ä¸ªç½‘æ ¼ï¼Œç„¶åŽå½¢æˆä¸€ä¸ªè¿ä»£è¿‡ç¨‹ï¼šä»Žç½‘æ ¼ä¸éšæœºæŠ½å–ä¸€ç»„è¶…å‚æ•°ï¼Œä½¿ç”¨4折交å‰éªŒè¯è¯„ä¼°è¶…å‚æ•°ï¼Œç„¶åŽé€‰æ‹©æ€§èƒ½æœ€å¥½çš„è¶…å‚æ•°ã€‚"
]
},
{
@@ -1283,6 +1302,27 @@
" 'max_features': max_features} "
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "我们选择了6个ä¸åŒçš„è¶…å‚æ•°æ¥è°ƒèŠ‚GBDT回归。这些都会以ä¸åŒçš„æ–¹å¼å½±å“模型,而这些方å¼å¾ˆéš¾æå‰ç¡®å®šï¼Œè¦æ‰¾åˆ°é’ˆå¯¹ç‰¹å®šé—®é¢˜çš„æœ€ä½³ç»„åˆï¼Œå”¯ä¸€çš„æ–¹æ³•就是测试它们ï¼è¦äº†è§£è¶…傿•°ï¼Œå»ºè®®æŸ¥çœ‹[Scikit Learn文档](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)。现在,åªè¦çŸ¥é“我们æ£åœ¨åŠªåŠ›å¯»æ‰¾è¶…å‚æ•°çš„æœ€ä½³ç»„åˆï¼Œå› 为没有ç†è®ºå‘Šè¯‰æˆ‘们哪一个最有效,我们åªéœ€è¦è¯„估它们,就åƒè¿è¡Œä¸€ä¸ªå®žéªŒä¸€æ ·ï¼\n",
+ "\n",
+ "æˆ‘ä»¬åˆ›å»ºéšæœºæœç´¢å¯¹è±¡ï¼Œå¹¶ä¼ 入以䏋傿•°ï¼š\n",
+ "\n",
+ "* `estimator`:所选的模型\n",
+ "* `param_distributions`:æˆ‘ä»¬å®šä¹‰çš„å‚æ•°åˆ†å¸ƒ\n",
+ "* `cv`: 用于k-fold交å‰éªŒè¯çš„æŠ˜å æ•°\n",
+ "* `n_iter`:è¦å°è¯•çš„ä¸åŒç»„åˆçš„æ•°é‡\n",
+ "* `scoring`ï¼šè¯„ä¼°ä½¿ç”¨çš„æŒ‡æ ‡\n",
+ "* `n_jobs`:并行è¿è¡Œçš„å†…æ ¸æ•°ï¼ˆ-1将使用所有å¯ç”¨çš„)\n",
+ "* `verbose`ï¼šæ˜¾ç¤ºå‚æ•°ä¿¡æ¯ï¼ˆ1æ˜¾ç¤ºæœ‰é™æ•°é‡ï¼‰\n",
+ "* `return_train_score`:返回æ¯ä¸ªäº¤å‰éªŒè¯æŠ˜å çš„è®ç»ƒåˆ†æ•°\n",
+ "* `random_state`ï¼šä½¿ç”¨çš„å›ºå®šçš„éšæœºæ•°ç§åæ•°ï¼Œä»¥ä¾¿æ¯æ¬¡æ‹†åˆ†çš„æ•°æ®ç›¸åŒ\n",
+ "\n",
+ "éšæœºæœç´¢å¯¹è±¡çš„è®ç»ƒæ–¹æ³•与任何其他scikitå¦ä¹ 模型的方法相åŒã€‚è®ç»ƒåŽï¼Œæˆ‘们å¯ä»¥æ¯”较所有ä¸åŒçš„è¶…å‚æ•°ç»„åˆï¼Œæ‰¾å‡ºæ€§èƒ½æœ€å¥½çš„一个。"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1790,7 +1830,7 @@
}
],
"source": [
- "# èŽ·å–æ‰€æœ‰cv结果并按测试性能排åº\n",
+ "# Get all of the cv results and sort by the test performance\n",
"random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n",
"\n",
"random_results.head(10)"
@@ -1824,6 +1864,25 @@
"random_cv.best_estimator_"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "最佳GBDTæ¨¡åž‹å…·æœ‰ä»¥ä¸‹è¶…å‚æ•°ï¼š\n",
+ "* `loss = lad`\n",
+ "* `n_estimators = 500`\n",
+ "* `max_depth = 5`\n",
+ "* `min_samples_leaf = 6`\n",
+ "* `min_samples_split = 6`\n",
+ "* `max_features = None` \n",
+ "\n",
+ "ä½¿ç”¨éšæœºæœç´¢æ˜¯ç¼©å°å¯èƒ½çš„è¶…å‚æ•°èŒƒå›´çš„好方法。最åˆï¼Œæˆ‘们ä¸çŸ¥é“哪ç§ç»„åˆæœ€æœ‰æ•ˆï¼ŒçŽ°åœ¨ç¼©å°äº†é€‰æ‹©èŒƒå›´ã€‚\n",
+ "\n",
+ "我们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢çš„结果åŠå…¶å·¦å³å€¼åŠ å…¥åˆ°ç½‘æ ¼æœç´¢ï¼Œä»¥æ‰¾åˆ°è¶…傿•°ä¸æ¯”éšæœºæœç´¢çš„æ•ˆæžœæœ€å¥½çš„那些。\n",
+ "\n",
+ "åœ¨è¿™é‡Œï¼Œæˆ‘ä»¬ä½¿ç”¨çš„ç½‘æ ¼æœç´¢åªæµ‹è¯•n_estimatorsï¼ˆæ ‘çš„ä¸ªæ•°ï¼‰ï¼Œç„¶åŽç»˜åˆ¶è®ç»ƒå’Œæµ‹è¯•æ€§èƒ½å›¾ï¼Œä»¥äº†è§£å¢žåŠ æ ‘çš„æ•°é‡å¯¹æˆ‘们的模型有什么作用。"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1839,7 +1898,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# 创建一系列è¦è¯„ä¼°çš„æ ‘\n",
+ "# Create a range of trees to evaluate\n",
"trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n",
"\n",
"model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n",
@@ -1848,14 +1907,10 @@
" max_features = None,\n",
" random_state = 42)\n",
"\n",
- "# ä½¿ç”¨æ ‘çš„èŒƒå›´å’Œéšæœºæ£®æž—æ¨¡åž‹çš„ç½‘æ ¼æœç´¢å¯¹è±¡\n",
- "grid_search = GridSearchCV(estimator = model, \n",
- " param_grid=trees_grid, \n",
- " cv = 4, \n",
- " scoring = 'neg_mean_absolute_error', \n",
- " verbose = 1,\n",
- " n_jobs = -1, \n",
- " return_train_score = True)\n"
+ "# Grid Search Object using the trees range and the random forest model\n",
+ "grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, \n",
+ " scoring = 'neg_mean_absolute_error', verbose = 1,\n",
+ " n_jobs = -1, return_train_score = True)"
]
},
{
@@ -1936,10 +1991,10 @@
}
],
"source": [
- "# å°†ç»“æžœå¯¼å…¥æ•°æ®æ¡†\n",
+ "# Get the results into a dataframe\n",
"results = pd.DataFrame(grid_search.cv_results_)\n",
"\n",
- "# 绘制è®ç»ƒè¯¯å·®å’Œæµ‹è¯•è¯¯å·®ä¸Žæ ‘æœ¨æ•°é‡çš„关系图\n",
+ "# Plot the training and testing error vs number of trees\n",
"figsize(8, 8)\n",
"plt.style.use('fivethirtyeight')\n",
"plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n",
@@ -2160,7 +2215,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### 测试模型"
+ "从上é¢çœ‹å‡ºï¼Œæˆ‘们的模型是过度拟åˆï¼è®ç»ƒè¯¯å·®æ˜¾è‘—低于测试误差,说明该模型对è®ç»ƒæ•°æ®çš„å¦ä¹ 效果很好,但ä¸èƒ½æŽ¨å¹¿åˆ°æµ‹è¯•æ•°æ®ä¸ã€‚éšç€æ ‘木数é‡çš„å¢žåŠ ï¼Œè¿‡åº¦æ‹Ÿåˆçš„æ•°é‡ä¹Ÿä¼šå¢žåŠ ã€‚æµ‹è¯•è¯¯å·®å’Œè®ç»ƒè¯¯å·®å‡éšæ ‘æ•°çš„å¢žåŠ è€Œå‡å°ï¼Œä½†è®ç»ƒè¯¯å·®çš„å‡å°é€Ÿåº¦è¾ƒå¿«ã€‚\n",
+ "\n",
+ "在è®ç»ƒè¯¯å·®å’Œæµ‹è¯•误差之间总是有差异的(è®ç»ƒè¯¯å·®æ€»æ˜¯è¾ƒä½Žçš„),但是如果有显著的差异,我们希望通过获得更多的è®ç»ƒæ•°æ®æˆ–é€šè¿‡è¶…å‚æ•°è°ƒæ•´æˆ–æ£åˆ™åŒ–æ¥é™ä½Žæ¨¡åž‹çš„夿‚性æ¥å°è¯•å‡å°‘过拟åˆã€‚对于GBDT回归模型,一些选项包括å‡å°‘æ ‘çš„æ•°é‡ã€å‡å°‘æ¯æ£µæ ‘的最大深度以åŠå¢žåŠ å¶èŠ‚ç‚¹ä¸çš„æœ€å°æ ·æœ¬æ•°ã€‚如果想进一æ¥ç ”ç©¶GBDTå¯ä»¥äº†è§£[è¯¥æ–‡ç« ](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目å‰ï¼Œæˆ‘们将使用性能最好的模型,并接å—它å¯èƒ½ä¼šè¿‡æ‹Ÿåˆã€‚"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 在测试集上评估模型\n",
+ "\n",
+ "æˆ‘ä»¬ä½¿ç”¨è¶…å‚æ•°è°ƒæ•´çš„æœ€ä½³æ¨¡åž‹åœ¨æµ‹è¯•集上进行预测。æ¤å‰ï¼Œæˆ‘们的模型从未è§è¿‡æµ‹è¯•é›†ï¼Œå› æ¤æ€§èƒ½åº”è¯¥æ˜¯ä¸€ä¸ªå¾ˆå¥½çš„æŒ‡æ ‡ï¼Œè¡¨æ˜Žå¦‚æžœåœ¨ç”Ÿäº§ä¸éƒ¨ç½²æ¨¡åž‹ï¼Œå®ƒå°†å¦‚何执行。\n",
+ "\n",
+ "ä¸ºäº†è¿›è¡Œæ¯”è¾ƒï¼Œæˆ‘ä»¬æ¯”è¾ƒé»˜è®¤æ¨¡åž‹ã€æœ€ä¼˜æ¨¡åž‹çš„æ€§èƒ½ã€‚"
]
},
{
@@ -2188,10 +2256,10 @@
}
],
"source": [
- "# 默认模型\n",
+ "# Default model\n",
"default_model = GradientBoostingRegressor(random_state = 42)\n",
"\n",
- "# 选择最佳模型\n",
+ "# Select the best model\n",
"final_model = grid_search.best_estimator_\n",
"\n",
"final_model"
@@ -2260,7 +2328,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "对比测试结果,è®ç»ƒæ—¶é—´è¿‘似,模型得到差ä¸å¤š10%çš„æå‡ã€‚"
+ "对比测试结果,è®ç»ƒæ—¶é—´è¿‘似,模型得到差ä¸å¤š10%çš„æå‡ã€‚è¯æ˜Žæˆ‘们的优化是有效的。"
]
},
{
@@ -2291,6 +2359,15 @@
"plt.title('Test Values and Predictions');"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "预测值和真实值近乎密度线较拟åˆã€‚\n",
+ "\n",
+ "下é¢çš„诊æ–å›¾æ˜¯æ®‹å·®ç›´æ–¹å›¾ã€‚ç†æƒ³æƒ…å†µä¸‹ï¼Œæˆ‘ä»¬å¸Œæœ›æ®‹å·®æ˜¯æ£æ€åˆ†å¸ƒçš„,这æ„å‘³ç€æ¨¡åž‹åœ¨ä¸¤ä¸ªæ–¹å‘(高和低)上都是错误的。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": 34,
@@ -2310,16 +2387,41 @@
"source": [
"figsize = (6, 6)\n",
"\n",
- "# 计算残差\n",
+ "# Calculate the residuals \n",
"residuals = final_pred - y_test\n",
"\n",
- "# 绘制残差分布直方图\n",
+ "# Plot the residuals in a histogram\n",
"plt.hist(residuals, color = 'red', bins = 20,\n",
" edgecolor = 'black')\n",
"plt.xlabel('Error'); plt.ylabel('Count')\n",
"plt.title('Distribution of Residuals');"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "残差接近æ£å¸¸åˆ†å¸ƒï¼Œä½Žç«¯æœ‰å‡ 个明显的离群点。这表明模型的预测值有远低于真实值的错误。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "在这第一部分ä¸ï¼Œæˆ‘们执行了机器å¦ä¹ æµç¨‹ä¸çš„å‡ ä¸ªå…³é”®ï¼š\n",
+ "\n",
+ "* å»ºç«‹åŸºç¡€æ¨¡åž‹ï¼Œæ¯”è¾ƒå¤šç§æ¨¡åž‹æ€§èƒ½æŒ‡æ ‡\n",
+ "* æ¨¡åž‹è¶…å‚æ•°è°ƒå‚,针对问题进行优化\n",
+ "* 在测试集上评估最佳模型\n",
+ "\n",
+ "结果表明,机器å¦ä¹ å¯ä»¥åº”用于我们的问题,最终的模型能够预测建ç‘物的能æºä¹‹æ˜Ÿçš„得分在9.1åˆ†ä»¥å†…ã€‚æˆ‘ä»¬è¿˜çœ‹åˆ°ï¼Œè¶…å‚æ•°è°ƒæ•´èƒ½å¤Ÿæ”¹å–„模型的性能,尽管在时间投入方é¢è¦ä»˜å‡ºç›¸å½“大的代价。这是一个很好的æé†’,æ£ç¡®çš„特性工程和收集更多的数æ®ï¼ˆå¦‚æžœå¯èƒ½çš„è¯ï¼‰æ¯”微调模型有更大的回报。我们还观察到è¿è¡Œæ—¶ä¸Žç²¾åº¦çš„æƒè¡¡ï¼Œè¿™æ˜¯æˆ‘ä»¬åœ¨è®¾è®¡æœºå™¨å¦ä¹ æ¨¡åž‹æ—¶å¿…é¡»è€ƒè™‘çš„è®¸å¤šå› ç´ ä¹‹ä¸€ã€‚\n",
+ "\n",
+ "æˆ‘ä»¬çŸ¥é“æˆ‘们的模型是准确的,但是我们知é“它为什么会åšå‡ºè¿™æ ·çš„预测å—?机器å¦ä¹ è¿‡ç¨‹çš„ä¸‹ä¸€æ¥æ˜¯è‡³å…³é‡è¦çš„:试图ç†è§£æ¨¡åž‹æ˜¯å¦‚何åšå‡ºé¢„测的。实现高精度是很好的,但是如果我们能够弄清楚为什么模型能够准确地预测,那么我们就å¯ä»¥åˆ©ç”¨è¿™äº›ä¿¡æ¯æ¥æ›´å¥½åœ°ç†è§£é—®é¢˜ã€‚例如,该模型ä¾èµ–å“ªäº›ç‰¹å¾æ¥æŽ¨æ–能æºä¹‹æ˜Ÿåˆ†æ•°ï¼Ÿæ˜¯å¦å¯ä»¥ä½¿ç”¨æ¤æ¨¡åž‹è¿›è¡Œç‰¹å¾é€‰æ‹©ï¼Œå¹¶å®žçŽ°æ›´æ˜“äºŽè§£é‡Šçš„ç®€å•æ¨¡åž‹ï¼Ÿ\n",
+ "\n",
+ "在最åŽçš„notebookä¸ï¼Œæˆ‘们将å°è¯•回ç”这些问题,并从项目ä¸å¾—出最终结论。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
diff --git a/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb b/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb
index 5e87960..22062fc 100644
--- a/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb
+++ b/机器å¦ä¹ 竞赛实战_优胜解决方案/建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建ç‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb
@@ -1238,15 +1238,34 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### è°ƒå‚"
+ "## æ¨¡åž‹ä¼˜åŒ–â€”â€”è¶…å‚æ•°è°ƒä¼˜\n",
+ "\n",
+ "**在机器å¦ä¹ ä¸ï¼Œä¼˜åŒ–模型æ„味ç€ä¸ºä¸€ä¸ªç‰¹å®šçš„é—®é¢˜æ‰¾åˆ°ä¸€ç»„æœ€ä½³çš„è¶…å‚æ•°ã€‚**\n",
+ "\n",
+ "* æ¨¡åž‹è¶…å‚æ•°è¢«è®¤ä¸ºæ˜¯æœºå™¨å¦ä¹ ç®—æ³•çš„æœ€ä½³å‚æ•°ï¼Œç”±æ•°æ®ç§‘å¦å®¶åœ¨è®ç»ƒå‰å¯¹å…¶è¿›è¡Œè°ƒæ•´ã€‚ä¾‹å¦‚éšæœºæž—ä¸çš„æ ‘数,或K近邻回归ä¸ä½¿ç”¨çš„邻居数。\n",
+ "* æ¨¡åž‹å‚æ•°æ˜¯æ¨¡åž‹åœ¨è®ç»ƒè¿‡ç¨‹ä¸å¦ä¹ 到的,例如线性回归ä¸çš„æƒé‡ã€‚\n",
+ "\n",
+ "**[è°ƒæ•´æ¨¡åž‹è¶…å‚æ•°](http://scikit-learn.org/stable/modules/grid_search.html)æŽ§åˆ¶æ¨¡åž‹ä¸æ¬ 拟åˆä¸Žè¿‡æ‹Ÿåˆçš„平衡。**\n",
+ "\n",
+ "* 我们å¯ä»¥å°è¯•é€šè¿‡å»ºç«‹ä¸€ä¸ªæ›´å¤æ‚的模型æ¥çº æ£æ¬ 拟åˆï¼Œä¾‹å¦‚åœ¨éšæœºæ£®æž—ä¸ä½¿ç”¨æ›´å¤šçš„æ ‘,或者在神ç»ç½‘络ä¸ä½¿ç”¨æ›´å¤šçš„å±‚ã€‚æ¬ æ‹Ÿåˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„å差,当我们的模型没有足够的能力(自由度)æ¥å¦ä¹ 特å¾å’Œç›®æ ‡ä¹‹é—´çš„å…³ç³»æ—¶ï¼Œå°±ä¼šå‡ºçŽ°è¿™ç§æƒ…况。\n",
+ "* 我们å¯ä»¥å°è¯•通过é™åˆ¶æ¨¡åž‹çš„夿‚性和应用æ£åˆ™åŒ–æ¥çº æ£è¿‡åº¦æ‹Ÿåˆã€‚è¿™å¯èƒ½æ„味ç€å‡å°‘多项å¼å›žå½’的次数,或者在神ç»ç½‘络ä¸å‡å°‘网络层次ç‰ã€‚过度拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„æ–¹å·®ï¼Œå¹¶ä¸”实际上已ç»è®°ä½äº†è®ç»ƒé›†ã€‚æ¬ æ‹Ÿåˆå’Œè¿‡æ‹Ÿåˆéƒ½ä¼šå¯¼è‡´æµ‹è¯•集的泛化性能较差。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Cross Validation\n",
- "
"
+ "### åŸºäºŽéšæœºæœç´¢å’Œäº¤å‰éªŒè¯çš„傿•°è°ƒä¼˜\n",
+ "\n",
+ "* éšæœºæœç´¢æŒ‡çš„æ˜¯æˆ‘ä»¬é€‰æ‹©è¶…å‚æ•°è¿›è¡Œè¯„估的方法:定义一系列选项,然åŽéšæœºé€‰æ‹©ç»„åˆè¿›è¡Œå°è¯•ã€‚è¿™ä¸Žç½‘æ ¼æœç´¢å½¢æˆå¯¹æ¯”ï¼Œç½‘æ ¼æœç´¢è¯„估我们指定的æ¯ä¸ªç»„åˆã€‚一般æ¥è¯´ï¼Œå½“æˆ‘ä»¬å¯¹æœ€ä½³æ¨¡åž‹è¶…å‚æ•°çš„çŸ¥è¯†æœ‰é™æ—¶ï¼Œéšæœºæœç´¢ä¼šæ›´å¥½ï¼Œæˆ‘们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢ç¼©å°é€‰é¡¹èŒƒå›´ï¼Œç„¶åŽä½¿ç”¨ç½‘æ ¼æœç´¢æ¥é€‰æ‹©èŒƒå›´æ›´æœ‰é™çš„选项。\n",
+ "\n",
+ "* 交å‰éªŒè¯æ˜¯ç”¨æ¥è¯„ä¼°è¶…å‚æ•°æ€§èƒ½çš„æ–¹æ³•。我们使用K-Fold交å‰éªŒè¯ï¼Œè€Œä¸æ˜¯å°†è®ç»ƒé›†åˆ†æˆå•独的è®ç»ƒé›†å’ŒéªŒè¯é›†ï¼Œä»Žè€Œå‡å°‘我们å¯ä»¥ä½¿ç”¨çš„è®ç»ƒæ•°æ®é‡ã€‚è¿™æ„味ç€å°†è®ç»ƒæ•°æ®åˆ†æˆK个折å ,然åŽç»è¿‡ä¸€ä¸ªè¿ä»£è¿‡ç¨‹ï¼Œæˆ‘们首先对K-1个折å 进行è®ç»ƒï¼Œç„¶åŽåœ¨ç¬¬K个折å 上评估性能。我们é‡å¤è¿™ä¸ªè¿‡ç¨‹K次,所以最终我们将在è®ç»ƒæ•°æ®ä¸çš„æ¯ä¸ªä¾‹åä¸Šè¿›è¡Œæµ‹è¯•ï¼Œå…³é”®æ˜¯æˆ‘ä»¬è¦æµ‹è¯•çš„æ¯ä¸ªè¿ä»£éƒ½æ˜¯åœ¨æˆ‘们没有è®ç»ƒçš„æ•°æ®ä¸Šè¿›è¡Œçš„。在K次交å‰éªŒè¯ç»“æŸæ—¶ï¼Œæˆ‘们将K次è¿ä»£çš„å¹³å‡è¯¯å·®ä½œä¸ºæœ€ç»ˆçš„æ€§èƒ½åº¦é‡ï¼Œå¯¹æ‰€æœ‰çš„è®ç»ƒæ•°æ®è¿›è¡Œè®ç»ƒã€‚ç„¶åŽï¼Œæˆ‘ä»¬è®°å½•çš„æ€§èƒ½ç”¨äºŽæ¯”è¾ƒè¶…å‚æ•°çš„ä¸åŒç»„åˆã€‚\n",
+ "\n",
+ "使用k=5进行k折交å‰éªŒè¯çš„图片如下所示:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "æˆ‘ä»¬å°†å®žçŽ°éšæœºæœç´¢å’Œäº¤å‰éªŒè¯ï¼Œä»¥é€‰æ‹©æœ€ä½³çš„è¶…å‚æ•°çš„GBDTå›žå½’ã€‚æˆ‘ä»¬é¦–å…ˆå®šä¹‰ä¸€ä¸ªç½‘æ ¼ï¼Œç„¶åŽå½¢æˆä¸€ä¸ªè¿ä»£è¿‡ç¨‹ï¼šä»Žç½‘æ ¼ä¸éšæœºæŠ½å–ä¸€ç»„è¶…å‚æ•°ï¼Œä½¿ç”¨4折交å‰éªŒè¯è¯„ä¼°è¶…å‚æ•°ï¼Œç„¶åŽé€‰æ‹©æ€§èƒ½æœ€å¥½çš„è¶…å‚æ•°ã€‚"
]
},
{
@@ -1283,6 +1302,27 @@
" 'max_features': max_features} "
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "我们选择了6个ä¸åŒçš„è¶…å‚æ•°æ¥è°ƒèŠ‚GBDT回归。这些都会以ä¸åŒçš„æ–¹å¼å½±å“模型,而这些方å¼å¾ˆéš¾æå‰ç¡®å®šï¼Œè¦æ‰¾åˆ°é’ˆå¯¹ç‰¹å®šé—®é¢˜çš„æœ€ä½³ç»„åˆï¼Œå”¯ä¸€çš„æ–¹æ³•就是测试它们ï¼è¦äº†è§£è¶…傿•°ï¼Œå»ºè®®æŸ¥çœ‹[Scikit Learn文档](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)。现在,åªè¦çŸ¥é“我们æ£åœ¨åŠªåŠ›å¯»æ‰¾è¶…å‚æ•°çš„æœ€ä½³ç»„åˆï¼Œå› 为没有ç†è®ºå‘Šè¯‰æˆ‘们哪一个最有效,我们åªéœ€è¦è¯„估它们,就åƒè¿è¡Œä¸€ä¸ªå®žéªŒä¸€æ ·ï¼\n",
+ "\n",
+ "æˆ‘ä»¬åˆ›å»ºéšæœºæœç´¢å¯¹è±¡ï¼Œå¹¶ä¼ 入以䏋傿•°ï¼š\n",
+ "\n",
+ "* `estimator`:所选的模型\n",
+ "* `param_distributions`:æˆ‘ä»¬å®šä¹‰çš„å‚æ•°åˆ†å¸ƒ\n",
+ "* `cv`: 用于k-fold交å‰éªŒè¯çš„æŠ˜å æ•°\n",
+ "* `n_iter`:è¦å°è¯•çš„ä¸åŒç»„åˆçš„æ•°é‡\n",
+ "* `scoring`ï¼šè¯„ä¼°ä½¿ç”¨çš„æŒ‡æ ‡\n",
+ "* `n_jobs`:并行è¿è¡Œçš„å†…æ ¸æ•°ï¼ˆ-1将使用所有å¯ç”¨çš„)\n",
+ "* `verbose`ï¼šæ˜¾ç¤ºå‚æ•°ä¿¡æ¯ï¼ˆ1æ˜¾ç¤ºæœ‰é™æ•°é‡ï¼‰\n",
+ "* `return_train_score`:返回æ¯ä¸ªäº¤å‰éªŒè¯æŠ˜å çš„è®ç»ƒåˆ†æ•°\n",
+ "* `random_state`ï¼šä½¿ç”¨çš„å›ºå®šçš„éšæœºæ•°ç§åæ•°ï¼Œä»¥ä¾¿æ¯æ¬¡æ‹†åˆ†çš„æ•°æ®ç›¸åŒ\n",
+ "\n",
+ "éšæœºæœç´¢å¯¹è±¡çš„è®ç»ƒæ–¹æ³•与任何其他scikitå¦ä¹ 模型的方法相åŒã€‚è®ç»ƒåŽï¼Œæˆ‘们å¯ä»¥æ¯”较所有ä¸åŒçš„è¶…å‚æ•°ç»„åˆï¼Œæ‰¾å‡ºæ€§èƒ½æœ€å¥½çš„一个。"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1790,7 +1830,7 @@
}
],
"source": [
- "# èŽ·å–æ‰€æœ‰cv结果并按测试性能排åº\n",
+ "# Get all of the cv results and sort by the test performance\n",
"random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n",
"\n",
"random_results.head(10)"
@@ -1824,6 +1864,25 @@
"random_cv.best_estimator_"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "最佳GBDTæ¨¡åž‹å…·æœ‰ä»¥ä¸‹è¶…å‚æ•°ï¼š\n",
+ "* `loss = lad`\n",
+ "* `n_estimators = 500`\n",
+ "* `max_depth = 5`\n",
+ "* `min_samples_leaf = 6`\n",
+ "* `min_samples_split = 6`\n",
+ "* `max_features = None` \n",
+ "\n",
+ "ä½¿ç”¨éšæœºæœç´¢æ˜¯ç¼©å°å¯èƒ½çš„è¶…å‚æ•°èŒƒå›´çš„好方法。最åˆï¼Œæˆ‘们ä¸çŸ¥é“哪ç§ç»„åˆæœ€æœ‰æ•ˆï¼ŒçŽ°åœ¨ç¼©å°äº†é€‰æ‹©èŒƒå›´ã€‚\n",
+ "\n",
+ "我们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢çš„结果åŠå…¶å·¦å³å€¼åŠ å…¥åˆ°ç½‘æ ¼æœç´¢ï¼Œä»¥æ‰¾åˆ°è¶…傿•°ä¸æ¯”éšæœºæœç´¢çš„æ•ˆæžœæœ€å¥½çš„那些。\n",
+ "\n",
+ "åœ¨è¿™é‡Œï¼Œæˆ‘ä»¬ä½¿ç”¨çš„ç½‘æ ¼æœç´¢åªæµ‹è¯•n_estimatorsï¼ˆæ ‘çš„ä¸ªæ•°ï¼‰ï¼Œç„¶åŽç»˜åˆ¶è®ç»ƒå’Œæµ‹è¯•æ€§èƒ½å›¾ï¼Œä»¥äº†è§£å¢žåŠ æ ‘çš„æ•°é‡å¯¹æˆ‘们的模型有什么作用。"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1839,7 +1898,7 @@
"metadata": {},
"outputs": [],
"source": [
- "# 创建一系列è¦è¯„ä¼°çš„æ ‘\n",
+ "# Create a range of trees to evaluate\n",
"trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n",
"\n",
"model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n",
@@ -1848,14 +1907,10 @@
" max_features = None,\n",
" random_state = 42)\n",
"\n",
- "# ä½¿ç”¨æ ‘çš„èŒƒå›´å’Œéšæœºæ£®æž—æ¨¡åž‹çš„ç½‘æ ¼æœç´¢å¯¹è±¡\n",
- "grid_search = GridSearchCV(estimator = model, \n",
- " param_grid=trees_grid, \n",
- " cv = 4, \n",
- " scoring = 'neg_mean_absolute_error', \n",
- " verbose = 1,\n",
- " n_jobs = -1, \n",
- " return_train_score = True)\n"
+ "# Grid Search Object using the trees range and the random forest model\n",
+ "grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, \n",
+ " scoring = 'neg_mean_absolute_error', verbose = 1,\n",
+ " n_jobs = -1, return_train_score = True)"
]
},
{
@@ -1936,10 +1991,10 @@
}
],
"source": [
- "# å°†ç»“æžœå¯¼å…¥æ•°æ®æ¡†\n",
+ "# Get the results into a dataframe\n",
"results = pd.DataFrame(grid_search.cv_results_)\n",
"\n",
- "# 绘制è®ç»ƒè¯¯å·®å’Œæµ‹è¯•è¯¯å·®ä¸Žæ ‘æœ¨æ•°é‡çš„关系图\n",
+ "# Plot the training and testing error vs number of trees\n",
"figsize(8, 8)\n",
"plt.style.use('fivethirtyeight')\n",
"plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n",
@@ -2160,7 +2215,20 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### 测试模型"
+ "从上é¢çœ‹å‡ºï¼Œæˆ‘们的模型是过度拟åˆï¼è®ç»ƒè¯¯å·®æ˜¾è‘—低于测试误差,说明该模型对è®ç»ƒæ•°æ®çš„å¦ä¹ 效果很好,但ä¸èƒ½æŽ¨å¹¿åˆ°æµ‹è¯•æ•°æ®ä¸ã€‚éšç€æ ‘木数é‡çš„å¢žåŠ ï¼Œè¿‡åº¦æ‹Ÿåˆçš„æ•°é‡ä¹Ÿä¼šå¢žåŠ ã€‚æµ‹è¯•è¯¯å·®å’Œè®ç»ƒè¯¯å·®å‡éšæ ‘æ•°çš„å¢žåŠ è€Œå‡å°ï¼Œä½†è®ç»ƒè¯¯å·®çš„å‡å°é€Ÿåº¦è¾ƒå¿«ã€‚\n",
+ "\n",
+ "在è®ç»ƒè¯¯å·®å’Œæµ‹è¯•误差之间总是有差异的(è®ç»ƒè¯¯å·®æ€»æ˜¯è¾ƒä½Žçš„),但是如果有显著的差异,我们希望通过获得更多的è®ç»ƒæ•°æ®æˆ–é€šè¿‡è¶…å‚æ•°è°ƒæ•´æˆ–æ£åˆ™åŒ–æ¥é™ä½Žæ¨¡åž‹çš„夿‚性æ¥å°è¯•å‡å°‘过拟åˆã€‚对于GBDT回归模型,一些选项包括å‡å°‘æ ‘çš„æ•°é‡ã€å‡å°‘æ¯æ£µæ ‘的最大深度以åŠå¢žåŠ å¶èŠ‚ç‚¹ä¸çš„æœ€å°æ ·æœ¬æ•°ã€‚如果想进一æ¥ç ”ç©¶GBDTå¯ä»¥äº†è§£[è¯¥æ–‡ç« ](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目å‰ï¼Œæˆ‘们将使用性能最好的模型,并接å—它å¯èƒ½ä¼šè¿‡æ‹Ÿåˆã€‚"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 在测试集上评估模型\n",
+ "\n",
+ "æˆ‘ä»¬ä½¿ç”¨è¶…å‚æ•°è°ƒæ•´çš„æœ€ä½³æ¨¡åž‹åœ¨æµ‹è¯•集上进行预测。æ¤å‰ï¼Œæˆ‘们的模型从未è§è¿‡æµ‹è¯•é›†ï¼Œå› æ¤æ€§èƒ½åº”è¯¥æ˜¯ä¸€ä¸ªå¾ˆå¥½çš„æŒ‡æ ‡ï¼Œè¡¨æ˜Žå¦‚æžœåœ¨ç”Ÿäº§ä¸éƒ¨ç½²æ¨¡åž‹ï¼Œå®ƒå°†å¦‚何执行。\n",
+ "\n",
+ "ä¸ºäº†è¿›è¡Œæ¯”è¾ƒï¼Œæˆ‘ä»¬æ¯”è¾ƒé»˜è®¤æ¨¡åž‹ã€æœ€ä¼˜æ¨¡åž‹çš„æ€§èƒ½ã€‚"
]
},
{
@@ -2188,10 +2256,10 @@
}
],
"source": [
- "# 默认模型\n",
+ "# Default model\n",
"default_model = GradientBoostingRegressor(random_state = 42)\n",
"\n",
- "# 选择最佳模型\n",
+ "# Select the best model\n",
"final_model = grid_search.best_estimator_\n",
"\n",
"final_model"
@@ -2260,7 +2328,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "对比测试结果,è®ç»ƒæ—¶é—´è¿‘似,模型得到差ä¸å¤š10%çš„æå‡ã€‚"
+ "对比测试结果,è®ç»ƒæ—¶é—´è¿‘似,模型得到差ä¸å¤š10%çš„æå‡ã€‚è¯æ˜Žæˆ‘们的优化是有效的。"
]
},
{
@@ -2291,6 +2359,15 @@
"plt.title('Test Values and Predictions');"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "预测值和真实值近乎密度线较拟åˆã€‚\n",
+ "\n",
+ "下é¢çš„诊æ–å›¾æ˜¯æ®‹å·®ç›´æ–¹å›¾ã€‚ç†æƒ³æƒ…å†µä¸‹ï¼Œæˆ‘ä»¬å¸Œæœ›æ®‹å·®æ˜¯æ£æ€åˆ†å¸ƒçš„,这æ„å‘³ç€æ¨¡åž‹åœ¨ä¸¤ä¸ªæ–¹å‘(高和低)上都是错误的。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": 34,
@@ -2310,16 +2387,41 @@
"source": [
"figsize = (6, 6)\n",
"\n",
- "# 计算残差\n",
+ "# Calculate the residuals \n",
"residuals = final_pred - y_test\n",
"\n",
- "# 绘制残差分布直方图\n",
+ "# Plot the residuals in a histogram\n",
"plt.hist(residuals, color = 'red', bins = 20,\n",
" edgecolor = 'black')\n",
"plt.xlabel('Error'); plt.ylabel('Count')\n",
"plt.title('Distribution of Residuals');"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "残差接近æ£å¸¸åˆ†å¸ƒï¼Œä½Žç«¯æœ‰å‡ 个明显的离群点。这表明模型的预测值有远低于真实值的错误。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "在这第一部分ä¸ï¼Œæˆ‘们执行了机器å¦ä¹ æµç¨‹ä¸çš„å‡ ä¸ªå…³é”®ï¼š\n",
+ "\n",
+ "* å»ºç«‹åŸºç¡€æ¨¡åž‹ï¼Œæ¯”è¾ƒå¤šç§æ¨¡åž‹æ€§èƒ½æŒ‡æ ‡\n",
+ "* æ¨¡åž‹è¶…å‚æ•°è°ƒå‚,针对问题进行优化\n",
+ "* 在测试集上评估最佳模型\n",
+ "\n",
+ "结果表明,机器å¦ä¹ å¯ä»¥åº”用于我们的问题,最终的模型能够预测建ç‘物的能æºä¹‹æ˜Ÿçš„得分在9.1åˆ†ä»¥å†…ã€‚æˆ‘ä»¬è¿˜çœ‹åˆ°ï¼Œè¶…å‚æ•°è°ƒæ•´èƒ½å¤Ÿæ”¹å–„模型的性能,尽管在时间投入方é¢è¦ä»˜å‡ºç›¸å½“大的代价。这是一个很好的æé†’,æ£ç¡®çš„特性工程和收集更多的数æ®ï¼ˆå¦‚æžœå¯èƒ½çš„è¯ï¼‰æ¯”微调模型有更大的回报。我们还观察到è¿è¡Œæ—¶ä¸Žç²¾åº¦çš„æƒè¡¡ï¼Œè¿™æ˜¯æˆ‘ä»¬åœ¨è®¾è®¡æœºå™¨å¦ä¹ æ¨¡åž‹æ—¶å¿…é¡»è€ƒè™‘çš„è®¸å¤šå› ç´ ä¹‹ä¸€ã€‚\n",
+ "\n",
+ "æˆ‘ä»¬çŸ¥é“æˆ‘们的模型是准确的,但是我们知é“它为什么会åšå‡ºè¿™æ ·çš„预测å—?机器å¦ä¹ è¿‡ç¨‹çš„ä¸‹ä¸€æ¥æ˜¯è‡³å…³é‡è¦çš„:试图ç†è§£æ¨¡åž‹æ˜¯å¦‚何åšå‡ºé¢„测的。实现高精度是很好的,但是如果我们能够弄清楚为什么模型能够准确地预测,那么我们就å¯ä»¥åˆ©ç”¨è¿™äº›ä¿¡æ¯æ¥æ›´å¥½åœ°ç†è§£é—®é¢˜ã€‚例如,该模型ä¾èµ–å“ªäº›ç‰¹å¾æ¥æŽ¨æ–能æºä¹‹æ˜Ÿåˆ†æ•°ï¼Ÿæ˜¯å¦å¯ä»¥ä½¿ç”¨æ¤æ¨¡åž‹è¿›è¡Œç‰¹å¾é€‰æ‹©ï¼Œå¹¶å®žçŽ°æ›´æ˜“äºŽè§£é‡Šçš„ç®€å•æ¨¡åž‹ï¼Ÿ\n",
+ "\n",
+ "在最åŽçš„notebookä¸ï¼Œæˆ‘们将å°è¯•回ç”这些问题,并从项目ä¸å¾—出最终结论。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,