From 6fbfaf3a64023b64023122e3c3f760d30cf794fb Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Mon, 28 Dec 2020 16:55:32 +0800 Subject: [PATCH] Add comment of Introduction --- ...ºç­‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb | 144 +++++++++++++++--- ...2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb | 144 +++++++++++++++--- 2 files changed, 246 insertions(+), 42 deletions(-) diff --git a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb index 5e87960..22062fc 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb @@ -1238,15 +1238,34 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### è°ƒå‚" + "## æ¨¡åž‹ä¼˜åŒ–â€”â€”è¶…å‚æ•°è°ƒä¼˜\n", + "\n", + "**在机器学习中,优化模型æ„味ç€ä¸ºä¸€ä¸ªç‰¹å®šçš„é—®é¢˜æ‰¾åˆ°ä¸€ç»„æœ€ä½³çš„è¶…å‚æ•°ã€‚**\n", + "\n", + "* æ¨¡åž‹è¶…å‚æ•°è¢«è®¤ä¸ºæ˜¯æœºå™¨å­¦ä¹ ç®—æ³•çš„æœ€ä½³å‚æ•°ï¼Œç”±æ•°æ®ç§‘学家在训练å‰å¯¹å…¶è¿›è¡Œè°ƒæ•´ã€‚ä¾‹å¦‚éšæœºæž—中的树数,或K近邻回归中使用的邻居数。\n", + "* æ¨¡åž‹å‚æ•°æ˜¯æ¨¡åž‹åœ¨è®­ç»ƒè¿‡ç¨‹ä¸­å­¦ä¹ åˆ°çš„,例如线性回归中的æƒé‡ã€‚\n", + "\n", + "**[è°ƒæ•´æ¨¡åž‹è¶…å‚æ•°](http://scikit-learn.org/stable/modules/grid_search.html)控制模型中欠拟åˆä¸Žè¿‡æ‹Ÿåˆçš„平衡。**\n", + "\n", + "* 我们å¯ä»¥å°è¯•é€šè¿‡å»ºç«‹ä¸€ä¸ªæ›´å¤æ‚的模型æ¥çº æ­£æ¬ æ‹Ÿåˆï¼Œä¾‹å¦‚åœ¨éšæœºæ£®æž—中使用更多的树,或者在神ç»ç½‘络中使用更多的层。欠拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„å差,当我们的模型没有足够的能力(自由度)æ¥å­¦ä¹ ç‰¹å¾å’Œç›®æ ‡ä¹‹é—´çš„å…³ç³»æ—¶ï¼Œå°±ä¼šå‡ºçŽ°è¿™ç§æƒ…况。\n", + "* 我们å¯ä»¥å°è¯•通过é™åˆ¶æ¨¡åž‹çš„夿‚性和应用正则化æ¥çº æ­£è¿‡åº¦æ‹Ÿåˆã€‚è¿™å¯èƒ½æ„味ç€å‡å°‘多项å¼å›žå½’的次数,或者在神ç»ç½‘络中å‡å°‘网络层次等。过度拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„æ–¹å·®ï¼Œå¹¶ä¸”实际上已ç»è®°ä½äº†è®­ç»ƒé›†ã€‚欠拟åˆå’Œè¿‡æ‹Ÿåˆéƒ½ä¼šå¯¼è‡´æµ‹è¯•集的泛化性能较差。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Cross Validation\n", - "" + "### åŸºäºŽéšæœºæœç´¢å’Œäº¤å‰éªŒè¯çš„傿•°è°ƒä¼˜\n", + "\n", + "* éšæœºæœç´¢æŒ‡çš„æ˜¯æˆ‘ä»¬é€‰æ‹©è¶…å‚æ•°è¿›è¡Œè¯„估的方法:定义一系列选项,然åŽéšæœºé€‰æ‹©ç»„åˆè¿›è¡Œå°è¯•。这与网格æœç´¢å½¢æˆå¯¹æ¯”,网格æœç´¢è¯„估我们指定的æ¯ä¸ªç»„åˆã€‚一般æ¥è¯´ï¼Œå½“æˆ‘ä»¬å¯¹æœ€ä½³æ¨¡åž‹è¶…å‚æ•°çš„çŸ¥è¯†æœ‰é™æ—¶ï¼Œéšæœºæœç´¢ä¼šæ›´å¥½ï¼Œæˆ‘们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢ç¼©å°é€‰é¡¹èŒƒå›´ï¼Œç„¶åŽä½¿ç”¨ç½‘æ ¼æœç´¢æ¥é€‰æ‹©èŒƒå›´æ›´æœ‰é™çš„选项。\n", + "\n", + "* 交å‰éªŒè¯æ˜¯ç”¨æ¥è¯„ä¼°è¶…å‚æ•°æ€§èƒ½çš„æ–¹æ³•。我们使用K-Fold交å‰éªŒè¯ï¼Œè€Œä¸æ˜¯å°†è®­ç»ƒé›†åˆ†æˆå•独的训练集和验è¯é›†ï¼Œä»Žè€Œå‡å°‘我们å¯ä»¥ä½¿ç”¨çš„训练数æ®é‡ã€‚è¿™æ„味ç€å°†è®­ç»ƒæ•°æ®åˆ†æˆK个折å ï¼Œç„¶åŽç»è¿‡ä¸€ä¸ªè¿­ä»£è¿‡ç¨‹ï¼Œæˆ‘们首先对K-1个折å è¿›è¡Œè®­ç»ƒï¼Œç„¶åŽåœ¨ç¬¬K个折å ä¸Šè¯„估性能。我们é‡å¤è¿™ä¸ªè¿‡ç¨‹K次,所以最终我们将在训练数æ®ä¸­çš„æ¯ä¸ªä¾‹å­ä¸Šè¿›è¡Œæµ‹è¯•ï¼Œå…³é”®æ˜¯æˆ‘ä»¬è¦æµ‹è¯•çš„æ¯ä¸ªè¿­ä»£éƒ½æ˜¯åœ¨æˆ‘们没有训练的数æ®ä¸Šè¿›è¡Œçš„。在K次交å‰éªŒè¯ç»“æŸæ—¶ï¼Œæˆ‘们将K次迭代的平å‡è¯¯å·®ä½œä¸ºæœ€ç»ˆçš„æ€§èƒ½åº¦é‡ï¼Œå¯¹æ‰€æœ‰çš„训练数æ®è¿›è¡Œè®­ç»ƒã€‚ç„¶åŽï¼Œæˆ‘ä»¬è®°å½•çš„æ€§èƒ½ç”¨äºŽæ¯”è¾ƒè¶…å‚æ•°çš„ä¸åŒç»„åˆã€‚\n", + "\n", + "使用k=5进行k折交å‰éªŒè¯çš„图片如下所示:\n", + "\n", + "\n", + "\n", + "æˆ‘ä»¬å°†å®žçŽ°éšæœºæœç´¢å’Œäº¤å‰éªŒè¯ï¼Œä»¥é€‰æ‹©æœ€ä½³çš„è¶…å‚æ•°çš„GBDT回归。我们首先定义一个网格,然åŽå½¢æˆä¸€ä¸ªè¿­ä»£è¿‡ç¨‹ï¼šä»Žç½‘æ ¼ä¸­éšæœºæŠ½å–ä¸€ç»„è¶…å‚æ•°ï¼Œä½¿ç”¨4折交å‰éªŒè¯è¯„ä¼°è¶…å‚æ•°ï¼Œç„¶åŽé€‰æ‹©æ€§èƒ½æœ€å¥½çš„è¶…å‚æ•°ã€‚" ] }, { @@ -1283,6 +1302,27 @@ " 'max_features': max_features} " ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们选择了6个ä¸åŒçš„è¶…å‚æ•°æ¥è°ƒèŠ‚GBDT回归。这些都会以ä¸åŒçš„æ–¹å¼å½±å“模型,而这些方å¼å¾ˆéš¾æå‰ç¡®å®šï¼Œè¦æ‰¾åˆ°é’ˆå¯¹ç‰¹å®šé—®é¢˜çš„æœ€ä½³ç»„åˆï¼Œå”¯ä¸€çš„æ–¹æ³•就是测试它们ï¼è¦äº†è§£è¶…傿•°ï¼Œå»ºè®®æŸ¥çœ‹[Scikit Learn文档](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)。现在,åªè¦çŸ¥é“æˆ‘ä»¬æ­£åœ¨åŠªåŠ›å¯»æ‰¾è¶…å‚æ•°çš„æœ€ä½³ç»„åˆï¼Œå› ä¸ºæ²¡æœ‰ç†è®ºå‘Šè¯‰æˆ‘们哪一个最有效,我们åªéœ€è¦è¯„估它们,就åƒè¿è¡Œä¸€ä¸ªå®žéªŒä¸€æ ·ï¼\n", + "\n", + "æˆ‘ä»¬åˆ›å»ºéšæœºæœç´¢å¯¹è±¡ï¼Œå¹¶ä¼ å…¥ä»¥ä¸‹å‚数:\n", + "\n", + "* `estimator`:所选的模型\n", + "* `param_distributions`:æˆ‘ä»¬å®šä¹‰çš„å‚æ•°åˆ†å¸ƒ\n", + "* `cv`: 用于k-fold交å‰éªŒè¯çš„æŠ˜å æ•°\n", + "* `n_iter`:è¦å°è¯•çš„ä¸åŒç»„åˆçš„æ•°é‡\n", + "* `scoring`:评估使用的指标\n", + "* `n_jobs`:并行è¿è¡Œçš„内核数(-1将使用所有å¯ç”¨çš„)\n", + "* `verbose`ï¼šæ˜¾ç¤ºå‚æ•°ä¿¡æ¯ï¼ˆ1æ˜¾ç¤ºæœ‰é™æ•°é‡ï¼‰\n", + "* `return_train_score`:返回æ¯ä¸ªäº¤å‰éªŒè¯æŠ˜å çš„训练分数\n", + "* `random_state`ï¼šä½¿ç”¨çš„å›ºå®šçš„éšæœºæ•°ç§å­æ•°ï¼Œä»¥ä¾¿æ¯æ¬¡æ‹†åˆ†çš„æ•°æ®ç›¸åŒ\n", + "\n", + "éšæœºæœç´¢å¯¹è±¡çš„训练方法与任何其他scikit学习模型的方法相åŒã€‚训练åŽï¼Œæˆ‘们å¯ä»¥æ¯”较所有ä¸åŒçš„è¶…å‚æ•°ç»„åˆï¼Œæ‰¾å‡ºæ€§èƒ½æœ€å¥½çš„一个。" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1790,7 +1830,7 @@ } ], "source": [ - "# èŽ·å–æ‰€æœ‰cv结果并按测试性能排åº\n", + "# Get all of the cv results and sort by the test performance\n", "random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n", "\n", "random_results.head(10)" @@ -1824,6 +1864,25 @@ "random_cv.best_estimator_" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "最佳GBDTæ¨¡åž‹å…·æœ‰ä»¥ä¸‹è¶…å‚æ•°ï¼š\n", + "* `loss = lad`\n", + "* `n_estimators = 500`\n", + "* `max_depth = 5`\n", + "* `min_samples_leaf = 6`\n", + "* `min_samples_split = 6`\n", + "* `max_features = None` \n", + "\n", + "ä½¿ç”¨éšæœºæœç´¢æ˜¯ç¼©å°å¯èƒ½çš„è¶…å‚æ•°èŒƒå›´çš„好方法。最åˆï¼Œæˆ‘们ä¸çŸ¥é“哪ç§ç»„åˆæœ€æœ‰æ•ˆï¼ŒçŽ°åœ¨ç¼©å°äº†é€‰æ‹©èŒƒå›´ã€‚\n", + "\n", + "我们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢çš„结果åŠå…¶å·¦å³å€¼åŠ å…¥åˆ°ç½‘æ ¼æœç´¢ï¼Œä»¥æ‰¾åˆ°è¶…傿•°ä¸­æ¯”éšæœºæœç´¢çš„æ•ˆæžœæœ€å¥½çš„那些。\n", + "\n", + "在这里,我们使用的网格æœç´¢åªæµ‹è¯•n_estimators(树的个数),然åŽç»˜åˆ¶è®­ç»ƒå’Œæµ‹è¯•性能图,以了解增加树的数é‡å¯¹æˆ‘们的模型有什么作用。" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1839,7 +1898,7 @@ "metadata": {}, "outputs": [], "source": [ - "# 创建一系列è¦è¯„估的树\n", + "# Create a range of trees to evaluate\n", "trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n", "\n", "model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n", @@ -1848,14 +1907,10 @@ " max_features = None,\n", " random_state = 42)\n", "\n", - "# ä½¿ç”¨æ ‘çš„èŒƒå›´å’Œéšæœºæ£®æž—模型的网格æœç´¢å¯¹è±¡\n", - "grid_search = GridSearchCV(estimator = model, \n", - " param_grid=trees_grid, \n", - " cv = 4, \n", - " scoring = 'neg_mean_absolute_error', \n", - " verbose = 1,\n", - " n_jobs = -1, \n", - " return_train_score = True)\n" + "# Grid Search Object using the trees range and the random forest model\n", + "grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, \n", + " scoring = 'neg_mean_absolute_error', verbose = 1,\n", + " n_jobs = -1, return_train_score = True)" ] }, { @@ -1936,10 +1991,10 @@ } ], "source": [ - "# å°†ç»“æžœå¯¼å…¥æ•°æ®æ¡†\n", + "# Get the results into a dataframe\n", "results = pd.DataFrame(grid_search.cv_results_)\n", "\n", - "# 绘制训练误差和测试误差与树木数é‡çš„关系图\n", + "# Plot the training and testing error vs number of trees\n", "figsize(8, 8)\n", "plt.style.use('fivethirtyeight')\n", "plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n", @@ -2160,7 +2215,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 测试模型" + "从上é¢çœ‹å‡ºï¼Œæˆ‘们的模型是过度拟åˆï¼è®­ç»ƒè¯¯å·®æ˜¾è‘—低于测试误差,说明该模型对训练数æ®çš„学习效果很好,但ä¸èƒ½æŽ¨å¹¿åˆ°æµ‹è¯•æ•°æ®ä¸­ã€‚éšç€æ ‘木数é‡çš„增加,过度拟åˆçš„æ•°é‡ä¹Ÿä¼šå¢žåŠ ã€‚æµ‹è¯•è¯¯å·®å’Œè®­ç»ƒè¯¯å·®å‡éšæ ‘æ•°çš„å¢žåŠ è€Œå‡å°ï¼Œä½†è®­ç»ƒè¯¯å·®çš„å‡å°é€Ÿåº¦è¾ƒå¿«ã€‚\n", + "\n", + "åœ¨è®­ç»ƒè¯¯å·®å’Œæµ‹è¯•è¯¯å·®ä¹‹é—´æ€»æ˜¯æœ‰å·®å¼‚çš„ï¼ˆè®­ç»ƒè¯¯å·®æ€»æ˜¯è¾ƒä½Žçš„ï¼‰ï¼Œä½†æ˜¯å¦‚æžœæœ‰æ˜¾è‘—çš„å·®å¼‚ï¼Œæˆ‘ä»¬å¸Œæœ›é€šè¿‡èŽ·å¾—æ›´å¤šçš„è®­ç»ƒæ•°æ®æˆ–é€šè¿‡è¶…å‚æ•°è°ƒæ•´æˆ–正则化æ¥é™ä½Žæ¨¡åž‹çš„夿‚性æ¥å°è¯•å‡å°‘过拟åˆã€‚对于GBDT回归模型,一些选项包括å‡å°‘树的数é‡ã€å‡å°‘æ¯æ£µæ ‘的最大深度以åŠå¢žåŠ å¶èŠ‚ç‚¹ä¸­çš„æœ€å°æ ·æœ¬æ•°ã€‚如果想进一步研究GBDTå¯ä»¥äº†è§£[该文章](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目å‰ï¼Œæˆ‘们将使用性能最好的模型,并接å—它å¯èƒ½ä¼šè¿‡æ‹Ÿåˆã€‚" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 在测试集上评估模型\n", + "\n", + "æˆ‘ä»¬ä½¿ç”¨è¶…å‚æ•°è°ƒæ•´çš„æœ€ä½³æ¨¡åž‹åœ¨æµ‹è¯•集上进行预测。此å‰ï¼Œæˆ‘们的模型从未è§è¿‡æµ‹è¯•集,因此性能应该是一个很好的指标,表明如果在生产中部署模型,它将如何执行。\n", + "\n", + "ä¸ºäº†è¿›è¡Œæ¯”è¾ƒï¼Œæˆ‘ä»¬æ¯”è¾ƒé»˜è®¤æ¨¡åž‹ã€æœ€ä¼˜æ¨¡åž‹çš„æ€§èƒ½ã€‚" ] }, { @@ -2188,10 +2256,10 @@ } ], "source": [ - "# 默认模型\n", + "# Default model\n", "default_model = GradientBoostingRegressor(random_state = 42)\n", "\n", - "# 选择最佳模型\n", + "# Select the best model\n", "final_model = grid_search.best_estimator_\n", "\n", "final_model" @@ -2260,7 +2328,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "对比测试结果,训练时间近似,模型得到差ä¸å¤š10%çš„æå‡ã€‚" + "对比测试结果,训练时间近似,模型得到差ä¸å¤š10%çš„æå‡ã€‚è¯æ˜Žæˆ‘们的优化是有效的。" ] }, { @@ -2291,6 +2359,15 @@ "plt.title('Test Values and Predictions');" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "预测值和真实值近乎密度线较拟åˆã€‚\n", + "\n", + "下é¢çš„è¯Šæ–­å›¾æ˜¯æ®‹å·®ç›´æ–¹å›¾ã€‚ç†æƒ³æƒ…况下,我们希望残差是正æ€åˆ†å¸ƒçš„,这æ„å‘³ç€æ¨¡åž‹åœ¨ä¸¤ä¸ªæ–¹å‘(高和低)上都是错误的。" + ] + }, { "cell_type": "code", "execution_count": 34, @@ -2310,16 +2387,41 @@ "source": [ "figsize = (6, 6)\n", "\n", - "# 计算残差\n", + "# Calculate the residuals \n", "residuals = final_pred - y_test\n", "\n", - "# 绘制残差分布直方图\n", + "# Plot the residuals in a histogram\n", "plt.hist(residuals, color = 'red', bins = 20,\n", " edgecolor = 'black')\n", "plt.xlabel('Error'); plt.ylabel('Count')\n", "plt.title('Distribution of Residuals');" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "残差接近正常分布,低端有几个明显的离群点。这表明模型的预测值有远低于真实值的错误。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 总结\n", + "在这第一部分中,我们执行了机器学习æµç¨‹ä¸­çš„几个关键:\n", + "\n", + "* å»ºç«‹åŸºç¡€æ¨¡åž‹ï¼Œæ¯”è¾ƒå¤šç§æ¨¡åž‹æ€§èƒ½æŒ‡æ ‡\n", + "* æ¨¡åž‹è¶…å‚æ•°è°ƒå‚,针对问题进行优化\n", + "* 在测试集上评估最佳模型\n", + "\n", + "结果表明,机器学习å¯ä»¥åº”用于我们的问题,最终的模型能够预测建筑物的能æºä¹‹æ˜Ÿçš„得分在9.1åˆ†ä»¥å†…ã€‚æˆ‘ä»¬è¿˜çœ‹åˆ°ï¼Œè¶…å‚æ•°è°ƒæ•´èƒ½å¤Ÿæ”¹å–„模型的性能,尽管在时间投入方é¢è¦ä»˜å‡ºç›¸å½“大的代价。这是一个很好的æé†’,正确的特性工程和收集更多的数æ®ï¼ˆå¦‚æžœå¯èƒ½çš„è¯ï¼‰æ¯”微调模型有更大的回报。我们还观察到è¿è¡Œæ—¶ä¸Žç²¾åº¦çš„æƒè¡¡ï¼Œè¿™æ˜¯æˆ‘ä»¬åœ¨è®¾è®¡æœºå™¨å­¦ä¹ æ¨¡åž‹æ—¶å¿…é¡»è€ƒè™‘çš„è®¸å¤šå› ç´ ä¹‹ä¸€ã€‚\n", + "\n", + "æˆ‘ä»¬çŸ¥é“æˆ‘们的模型是准确的,但是我们知é“它为什么会åšå‡ºè¿™æ ·çš„预测å—?机器学习过程的下一步是至关é‡è¦çš„:试图ç†è§£æ¨¡åž‹æ˜¯å¦‚何åšå‡ºé¢„测的。实现高精度是很好的,但是如果我们能够弄清楚为什么模型能够准确地预测,那么我们就å¯ä»¥åˆ©ç”¨è¿™äº›ä¿¡æ¯æ¥æ›´å¥½åœ°ç†è§£é—®é¢˜ã€‚例如,该模型ä¾èµ–å“ªäº›ç‰¹å¾æ¥æŽ¨æ–­èƒ½æºä¹‹æ˜Ÿåˆ†æ•°ï¼Ÿæ˜¯å¦å¯ä»¥ä½¿ç”¨æ­¤æ¨¡åž‹è¿›è¡Œç‰¹å¾é€‰æ‹©ï¼Œå¹¶å®žçŽ°æ›´æ˜“äºŽè§£é‡Šçš„ç®€å•æ¨¡åž‹ï¼Ÿ\n", + "\n", + "在最åŽçš„notebook中,我们将å°è¯•回答这些问题,并从项目中得出最终结论。" + ] + }, { "cell_type": "code", "execution_count": null, diff --git a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb index 5e87960..22062fc 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/2_建模_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb @@ -1238,15 +1238,34 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### è°ƒå‚" + "## æ¨¡åž‹ä¼˜åŒ–â€”â€”è¶…å‚æ•°è°ƒä¼˜\n", + "\n", + "**在机器学习中,优化模型æ„味ç€ä¸ºä¸€ä¸ªç‰¹å®šçš„é—®é¢˜æ‰¾åˆ°ä¸€ç»„æœ€ä½³çš„è¶…å‚æ•°ã€‚**\n", + "\n", + "* æ¨¡åž‹è¶…å‚æ•°è¢«è®¤ä¸ºæ˜¯æœºå™¨å­¦ä¹ ç®—æ³•çš„æœ€ä½³å‚æ•°ï¼Œç”±æ•°æ®ç§‘学家在训练å‰å¯¹å…¶è¿›è¡Œè°ƒæ•´ã€‚ä¾‹å¦‚éšæœºæž—中的树数,或K近邻回归中使用的邻居数。\n", + "* æ¨¡åž‹å‚æ•°æ˜¯æ¨¡åž‹åœ¨è®­ç»ƒè¿‡ç¨‹ä¸­å­¦ä¹ åˆ°çš„,例如线性回归中的æƒé‡ã€‚\n", + "\n", + "**[è°ƒæ•´æ¨¡åž‹è¶…å‚æ•°](http://scikit-learn.org/stable/modules/grid_search.html)控制模型中欠拟åˆä¸Žè¿‡æ‹Ÿåˆçš„平衡。**\n", + "\n", + "* 我们å¯ä»¥å°è¯•é€šè¿‡å»ºç«‹ä¸€ä¸ªæ›´å¤æ‚的模型æ¥çº æ­£æ¬ æ‹Ÿåˆï¼Œä¾‹å¦‚åœ¨éšæœºæ£®æž—中使用更多的树,或者在神ç»ç½‘络中使用更多的层。欠拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„å差,当我们的模型没有足够的能力(自由度)æ¥å­¦ä¹ ç‰¹å¾å’Œç›®æ ‡ä¹‹é—´çš„å…³ç³»æ—¶ï¼Œå°±ä¼šå‡ºçŽ°è¿™ç§æƒ…况。\n", + "* 我们å¯ä»¥å°è¯•通过é™åˆ¶æ¨¡åž‹çš„夿‚性和应用正则化æ¥çº æ­£è¿‡åº¦æ‹Ÿåˆã€‚è¿™å¯èƒ½æ„味ç€å‡å°‘多项å¼å›žå½’的次数,或者在神ç»ç½‘络中å‡å°‘网络层次等。过度拟åˆçš„æ¨¡åž‹å…·æœ‰å¾ˆé«˜çš„æ–¹å·®ï¼Œå¹¶ä¸”实际上已ç»è®°ä½äº†è®­ç»ƒé›†ã€‚欠拟åˆå’Œè¿‡æ‹Ÿåˆéƒ½ä¼šå¯¼è‡´æµ‹è¯•集的泛化性能较差。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Cross Validation\n", - "" + "### åŸºäºŽéšæœºæœç´¢å’Œäº¤å‰éªŒè¯çš„傿•°è°ƒä¼˜\n", + "\n", + "* éšæœºæœç´¢æŒ‡çš„æ˜¯æˆ‘ä»¬é€‰æ‹©è¶…å‚æ•°è¿›è¡Œè¯„估的方法:定义一系列选项,然åŽéšæœºé€‰æ‹©ç»„åˆè¿›è¡Œå°è¯•。这与网格æœç´¢å½¢æˆå¯¹æ¯”,网格æœç´¢è¯„估我们指定的æ¯ä¸ªç»„åˆã€‚一般æ¥è¯´ï¼Œå½“æˆ‘ä»¬å¯¹æœ€ä½³æ¨¡åž‹è¶…å‚æ•°çš„çŸ¥è¯†æœ‰é™æ—¶ï¼Œéšæœºæœç´¢ä¼šæ›´å¥½ï¼Œæˆ‘们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢ç¼©å°é€‰é¡¹èŒƒå›´ï¼Œç„¶åŽä½¿ç”¨ç½‘æ ¼æœç´¢æ¥é€‰æ‹©èŒƒå›´æ›´æœ‰é™çš„选项。\n", + "\n", + "* 交å‰éªŒè¯æ˜¯ç”¨æ¥è¯„ä¼°è¶…å‚æ•°æ€§èƒ½çš„æ–¹æ³•。我们使用K-Fold交å‰éªŒè¯ï¼Œè€Œä¸æ˜¯å°†è®­ç»ƒé›†åˆ†æˆå•独的训练集和验è¯é›†ï¼Œä»Žè€Œå‡å°‘我们å¯ä»¥ä½¿ç”¨çš„训练数æ®é‡ã€‚è¿™æ„味ç€å°†è®­ç»ƒæ•°æ®åˆ†æˆK个折å ï¼Œç„¶åŽç»è¿‡ä¸€ä¸ªè¿­ä»£è¿‡ç¨‹ï¼Œæˆ‘们首先对K-1个折å è¿›è¡Œè®­ç»ƒï¼Œç„¶åŽåœ¨ç¬¬K个折å ä¸Šè¯„估性能。我们é‡å¤è¿™ä¸ªè¿‡ç¨‹K次,所以最终我们将在训练数æ®ä¸­çš„æ¯ä¸ªä¾‹å­ä¸Šè¿›è¡Œæµ‹è¯•ï¼Œå…³é”®æ˜¯æˆ‘ä»¬è¦æµ‹è¯•çš„æ¯ä¸ªè¿­ä»£éƒ½æ˜¯åœ¨æˆ‘们没有训练的数æ®ä¸Šè¿›è¡Œçš„。在K次交å‰éªŒè¯ç»“æŸæ—¶ï¼Œæˆ‘们将K次迭代的平å‡è¯¯å·®ä½œä¸ºæœ€ç»ˆçš„æ€§èƒ½åº¦é‡ï¼Œå¯¹æ‰€æœ‰çš„训练数æ®è¿›è¡Œè®­ç»ƒã€‚ç„¶åŽï¼Œæˆ‘ä»¬è®°å½•çš„æ€§èƒ½ç”¨äºŽæ¯”è¾ƒè¶…å‚æ•°çš„ä¸åŒç»„åˆã€‚\n", + "\n", + "使用k=5进行k折交å‰éªŒè¯çš„图片如下所示:\n", + "\n", + "\n", + "\n", + "æˆ‘ä»¬å°†å®žçŽ°éšæœºæœç´¢å’Œäº¤å‰éªŒè¯ï¼Œä»¥é€‰æ‹©æœ€ä½³çš„è¶…å‚æ•°çš„GBDT回归。我们首先定义一个网格,然åŽå½¢æˆä¸€ä¸ªè¿­ä»£è¿‡ç¨‹ï¼šä»Žç½‘æ ¼ä¸­éšæœºæŠ½å–ä¸€ç»„è¶…å‚æ•°ï¼Œä½¿ç”¨4折交å‰éªŒè¯è¯„ä¼°è¶…å‚æ•°ï¼Œç„¶åŽé€‰æ‹©æ€§èƒ½æœ€å¥½çš„è¶…å‚æ•°ã€‚" ] }, { @@ -1283,6 +1302,27 @@ " 'max_features': max_features} " ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "我们选择了6个ä¸åŒçš„è¶…å‚æ•°æ¥è°ƒèŠ‚GBDT回归。这些都会以ä¸åŒçš„æ–¹å¼å½±å“模型,而这些方å¼å¾ˆéš¾æå‰ç¡®å®šï¼Œè¦æ‰¾åˆ°é’ˆå¯¹ç‰¹å®šé—®é¢˜çš„æœ€ä½³ç»„åˆï¼Œå”¯ä¸€çš„æ–¹æ³•就是测试它们ï¼è¦äº†è§£è¶…傿•°ï¼Œå»ºè®®æŸ¥çœ‹[Scikit Learn文档](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)。现在,åªè¦çŸ¥é“æˆ‘ä»¬æ­£åœ¨åŠªåŠ›å¯»æ‰¾è¶…å‚æ•°çš„æœ€ä½³ç»„åˆï¼Œå› ä¸ºæ²¡æœ‰ç†è®ºå‘Šè¯‰æˆ‘们哪一个最有效,我们åªéœ€è¦è¯„估它们,就åƒè¿è¡Œä¸€ä¸ªå®žéªŒä¸€æ ·ï¼\n", + "\n", + "æˆ‘ä»¬åˆ›å»ºéšæœºæœç´¢å¯¹è±¡ï¼Œå¹¶ä¼ å…¥ä»¥ä¸‹å‚数:\n", + "\n", + "* `estimator`:所选的模型\n", + "* `param_distributions`:æˆ‘ä»¬å®šä¹‰çš„å‚æ•°åˆ†å¸ƒ\n", + "* `cv`: 用于k-fold交å‰éªŒè¯çš„æŠ˜å æ•°\n", + "* `n_iter`:è¦å°è¯•çš„ä¸åŒç»„åˆçš„æ•°é‡\n", + "* `scoring`:评估使用的指标\n", + "* `n_jobs`:并行è¿è¡Œçš„内核数(-1将使用所有å¯ç”¨çš„)\n", + "* `verbose`ï¼šæ˜¾ç¤ºå‚æ•°ä¿¡æ¯ï¼ˆ1æ˜¾ç¤ºæœ‰é™æ•°é‡ï¼‰\n", + "* `return_train_score`:返回æ¯ä¸ªäº¤å‰éªŒè¯æŠ˜å çš„训练分数\n", + "* `random_state`ï¼šä½¿ç”¨çš„å›ºå®šçš„éšæœºæ•°ç§å­æ•°ï¼Œä»¥ä¾¿æ¯æ¬¡æ‹†åˆ†çš„æ•°æ®ç›¸åŒ\n", + "\n", + "éšæœºæœç´¢å¯¹è±¡çš„训练方法与任何其他scikit学习模型的方法相åŒã€‚训练åŽï¼Œæˆ‘们å¯ä»¥æ¯”较所有ä¸åŒçš„è¶…å‚æ•°ç»„åˆï¼Œæ‰¾å‡ºæ€§èƒ½æœ€å¥½çš„一个。" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1790,7 +1830,7 @@ } ], "source": [ - "# èŽ·å–æ‰€æœ‰cv结果并按测试性能排åº\n", + "# Get all of the cv results and sort by the test performance\n", "random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)\n", "\n", "random_results.head(10)" @@ -1824,6 +1864,25 @@ "random_cv.best_estimator_" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "最佳GBDTæ¨¡åž‹å…·æœ‰ä»¥ä¸‹è¶…å‚æ•°ï¼š\n", + "* `loss = lad`\n", + "* `n_estimators = 500`\n", + "* `max_depth = 5`\n", + "* `min_samples_leaf = 6`\n", + "* `min_samples_split = 6`\n", + "* `max_features = None` \n", + "\n", + "ä½¿ç”¨éšæœºæœç´¢æ˜¯ç¼©å°å¯èƒ½çš„è¶…å‚æ•°èŒƒå›´çš„好方法。最åˆï¼Œæˆ‘们ä¸çŸ¥é“哪ç§ç»„åˆæœ€æœ‰æ•ˆï¼ŒçŽ°åœ¨ç¼©å°äº†é€‰æ‹©èŒƒå›´ã€‚\n", + "\n", + "我们å¯ä»¥ä½¿ç”¨éšæœºæœç´¢çš„结果åŠå…¶å·¦å³å€¼åŠ å…¥åˆ°ç½‘æ ¼æœç´¢ï¼Œä»¥æ‰¾åˆ°è¶…傿•°ä¸­æ¯”éšæœºæœç´¢çš„æ•ˆæžœæœ€å¥½çš„那些。\n", + "\n", + "在这里,我们使用的网格æœç´¢åªæµ‹è¯•n_estimators(树的个数),然åŽç»˜åˆ¶è®­ç»ƒå’Œæµ‹è¯•性能图,以了解增加树的数é‡å¯¹æˆ‘们的模型有什么作用。" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1839,7 +1898,7 @@ "metadata": {}, "outputs": [], "source": [ - "# 创建一系列è¦è¯„估的树\n", + "# Create a range of trees to evaluate\n", "trees_grid = {'n_estimators': [100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800]}\n", "\n", "model = GradientBoostingRegressor(loss = 'lad', max_depth = 5,\n", @@ -1848,14 +1907,10 @@ " max_features = None,\n", " random_state = 42)\n", "\n", - "# ä½¿ç”¨æ ‘çš„èŒƒå›´å’Œéšæœºæ£®æž—模型的网格æœç´¢å¯¹è±¡\n", - "grid_search = GridSearchCV(estimator = model, \n", - " param_grid=trees_grid, \n", - " cv = 4, \n", - " scoring = 'neg_mean_absolute_error', \n", - " verbose = 1,\n", - " n_jobs = -1, \n", - " return_train_score = True)\n" + "# Grid Search Object using the trees range and the random forest model\n", + "grid_search = GridSearchCV(estimator = model, param_grid=trees_grid, cv = 4, \n", + " scoring = 'neg_mean_absolute_error', verbose = 1,\n", + " n_jobs = -1, return_train_score = True)" ] }, { @@ -1936,10 +1991,10 @@ } ], "source": [ - "# å°†ç»“æžœå¯¼å…¥æ•°æ®æ¡†\n", + "# Get the results into a dataframe\n", "results = pd.DataFrame(grid_search.cv_results_)\n", "\n", - "# 绘制训练误差和测试误差与树木数é‡çš„关系图\n", + "# Plot the training and testing error vs number of trees\n", "figsize(8, 8)\n", "plt.style.use('fivethirtyeight')\n", "plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Testing Error')\n", @@ -2160,7 +2215,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 测试模型" + "从上é¢çœ‹å‡ºï¼Œæˆ‘们的模型是过度拟åˆï¼è®­ç»ƒè¯¯å·®æ˜¾è‘—低于测试误差,说明该模型对训练数æ®çš„学习效果很好,但ä¸èƒ½æŽ¨å¹¿åˆ°æµ‹è¯•æ•°æ®ä¸­ã€‚éšç€æ ‘木数é‡çš„增加,过度拟åˆçš„æ•°é‡ä¹Ÿä¼šå¢žåŠ ã€‚æµ‹è¯•è¯¯å·®å’Œè®­ç»ƒè¯¯å·®å‡éšæ ‘æ•°çš„å¢žåŠ è€Œå‡å°ï¼Œä½†è®­ç»ƒè¯¯å·®çš„å‡å°é€Ÿåº¦è¾ƒå¿«ã€‚\n", + "\n", + "åœ¨è®­ç»ƒè¯¯å·®å’Œæµ‹è¯•è¯¯å·®ä¹‹é—´æ€»æ˜¯æœ‰å·®å¼‚çš„ï¼ˆè®­ç»ƒè¯¯å·®æ€»æ˜¯è¾ƒä½Žçš„ï¼‰ï¼Œä½†æ˜¯å¦‚æžœæœ‰æ˜¾è‘—çš„å·®å¼‚ï¼Œæˆ‘ä»¬å¸Œæœ›é€šè¿‡èŽ·å¾—æ›´å¤šçš„è®­ç»ƒæ•°æ®æˆ–é€šè¿‡è¶…å‚æ•°è°ƒæ•´æˆ–正则化æ¥é™ä½Žæ¨¡åž‹çš„夿‚性æ¥å°è¯•å‡å°‘过拟åˆã€‚对于GBDT回归模型,一些选项包括å‡å°‘树的数é‡ã€å‡å°‘æ¯æ£µæ ‘的最大深度以åŠå¢žåŠ å¶èŠ‚ç‚¹ä¸­çš„æœ€å°æ ·æœ¬æ•°ã€‚如果想进一步研究GBDTå¯ä»¥äº†è§£[该文章](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)。目å‰ï¼Œæˆ‘们将使用性能最好的模型,并接å—它å¯èƒ½ä¼šè¿‡æ‹Ÿåˆã€‚" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 在测试集上评估模型\n", + "\n", + "æˆ‘ä»¬ä½¿ç”¨è¶…å‚æ•°è°ƒæ•´çš„æœ€ä½³æ¨¡åž‹åœ¨æµ‹è¯•集上进行预测。此å‰ï¼Œæˆ‘们的模型从未è§è¿‡æµ‹è¯•集,因此性能应该是一个很好的指标,表明如果在生产中部署模型,它将如何执行。\n", + "\n", + "ä¸ºäº†è¿›è¡Œæ¯”è¾ƒï¼Œæˆ‘ä»¬æ¯”è¾ƒé»˜è®¤æ¨¡åž‹ã€æœ€ä¼˜æ¨¡åž‹çš„æ€§èƒ½ã€‚" ] }, { @@ -2188,10 +2256,10 @@ } ], "source": [ - "# 默认模型\n", + "# Default model\n", "default_model = GradientBoostingRegressor(random_state = 42)\n", "\n", - "# 选择最佳模型\n", + "# Select the best model\n", "final_model = grid_search.best_estimator_\n", "\n", "final_model" @@ -2260,7 +2328,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "对比测试结果,训练时间近似,模型得到差ä¸å¤š10%çš„æå‡ã€‚" + "对比测试结果,训练时间近似,模型得到差ä¸å¤š10%çš„æå‡ã€‚è¯æ˜Žæˆ‘们的优化是有效的。" ] }, { @@ -2291,6 +2359,15 @@ "plt.title('Test Values and Predictions');" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "预测值和真实值近乎密度线较拟åˆã€‚\n", + "\n", + "下é¢çš„è¯Šæ–­å›¾æ˜¯æ®‹å·®ç›´æ–¹å›¾ã€‚ç†æƒ³æƒ…况下,我们希望残差是正æ€åˆ†å¸ƒçš„,这æ„å‘³ç€æ¨¡åž‹åœ¨ä¸¤ä¸ªæ–¹å‘(高和低)上都是错误的。" + ] + }, { "cell_type": "code", "execution_count": 34, @@ -2310,16 +2387,41 @@ "source": [ "figsize = (6, 6)\n", "\n", - "# 计算残差\n", + "# Calculate the residuals \n", "residuals = final_pred - y_test\n", "\n", - "# 绘制残差分布直方图\n", + "# Plot the residuals in a histogram\n", "plt.hist(residuals, color = 'red', bins = 20,\n", " edgecolor = 'black')\n", "plt.xlabel('Error'); plt.ylabel('Count')\n", "plt.title('Distribution of Residuals');" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "残差接近正常分布,低端有几个明显的离群点。这表明模型的预测值有远低于真实值的错误。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 总结\n", + "在这第一部分中,我们执行了机器学习æµç¨‹ä¸­çš„几个关键:\n", + "\n", + "* å»ºç«‹åŸºç¡€æ¨¡åž‹ï¼Œæ¯”è¾ƒå¤šç§æ¨¡åž‹æ€§èƒ½æŒ‡æ ‡\n", + "* æ¨¡åž‹è¶…å‚æ•°è°ƒå‚,针对问题进行优化\n", + "* 在测试集上评估最佳模型\n", + "\n", + "结果表明,机器学习å¯ä»¥åº”用于我们的问题,最终的模型能够预测建筑物的能æºä¹‹æ˜Ÿçš„得分在9.1åˆ†ä»¥å†…ã€‚æˆ‘ä»¬è¿˜çœ‹åˆ°ï¼Œè¶…å‚æ•°è°ƒæ•´èƒ½å¤Ÿæ”¹å–„模型的性能,尽管在时间投入方é¢è¦ä»˜å‡ºç›¸å½“大的代价。这是一个很好的æé†’,正确的特性工程和收集更多的数æ®ï¼ˆå¦‚æžœå¯èƒ½çš„è¯ï¼‰æ¯”微调模型有更大的回报。我们还观察到è¿è¡Œæ—¶ä¸Žç²¾åº¦çš„æƒè¡¡ï¼Œè¿™æ˜¯æˆ‘ä»¬åœ¨è®¾è®¡æœºå™¨å­¦ä¹ æ¨¡åž‹æ—¶å¿…é¡»è€ƒè™‘çš„è®¸å¤šå› ç´ ä¹‹ä¸€ã€‚\n", + "\n", + "æˆ‘ä»¬çŸ¥é“æˆ‘们的模型是准确的,但是我们知é“它为什么会åšå‡ºè¿™æ ·çš„预测å—?机器学习过程的下一步是至关é‡è¦çš„:试图ç†è§£æ¨¡åž‹æ˜¯å¦‚何åšå‡ºé¢„测的。实现高精度是很好的,但是如果我们能够弄清楚为什么模型能够准确地预测,那么我们就å¯ä»¥åˆ©ç”¨è¿™äº›ä¿¡æ¯æ¥æ›´å¥½åœ°ç†è§£é—®é¢˜ã€‚例如,该模型ä¾èµ–å“ªäº›ç‰¹å¾æ¥æŽ¨æ–­èƒ½æºä¹‹æ˜Ÿåˆ†æ•°ï¼Ÿæ˜¯å¦å¯ä»¥ä½¿ç”¨æ­¤æ¨¡åž‹è¿›è¡Œç‰¹å¾é€‰æ‹©ï¼Œå¹¶å®žçŽ°æ›´æ˜“äºŽè§£é‡Šçš„ç®€å•æ¨¡åž‹ï¼Ÿ\n", + "\n", + "在最åŽçš„notebook中,我们将å°è¯•回答这些问题,并从项目中得出最终结论。" + ] + }, { "cell_type": "code", "execution_count": null,