From f65f03d2ef342f28c50b143a289eee0feb5772c0 Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Sun, 27 Dec 2020 11:47:07 +0800 Subject: [PATCH] Add comment of Feature Engineering and Selection --- ...ºç­‘能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb | 84 ++++++++++++++----- ...¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb | 84 ++++++++++++++----- 2 files changed, 128 insertions(+), 40 deletions(-) diff --git a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb index 330d60a..95b06f1 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/.ipynb_checkpoints/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹-checkpoint.ipynb @@ -2991,7 +2991,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 共线" + "### 去除共线特å¾\n", + "在数æ®é›†ä¸­ï¼ŒWeather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)é«˜åº¦ç›¸å…³ï¼Œå› ä¸ºå®ƒä»¬åªæ˜¯è®¡ç®—能æºä½¿ç”¨å¼ºåº¦çš„æ–¹æ³•略有ä¸åŒã€‚" ] }, { @@ -3021,6 +3022,24 @@ " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一般高度关è”çš„ç‰¹å¾æˆ‘们是去除的,åªä¿ç•™ä¸€ä¸ªä¸ºæ¨¡åž‹æä¾›å¿…è¦ä¿¡æ¯ã€‚\n", + "\n", + "åŽ»é™¤å…±çº¿ç‰¹å¾æ˜¯ä¸€ç§é€šè¿‡å‡å°‘ç‰¹å¾æ•°é‡æ¥é™ä½Žæ¨¡åž‹å¤æ‚度的方法,有助于æé«˜æ¨¡åž‹çš„æ³›åŒ–能力。它还å¯ä»¥å¸®åŠ©æˆ‘ä»¬è§£é‡Šæ¨¡åž‹ï¼Œå› ä¸ºæˆ‘ä»¬åªéœ€è¦æ‹…心å•一å˜é‡ï¼Œæ¯”如EUIï¼Œè€Œä¸æ˜¯Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影å“得分。\n", + "\n", + "除了这个还有很多方法,如:[方差膨胀因å­/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简å•的方法,并删除相关系数高于æŸä¸ªé˜ˆå€¼çš„特å¾ï¼ˆä¸æ˜¯ä¸Žåˆ†æ•°ç›¸å…³ï¼Œæ˜¯ä¸¤å˜é‡ä¹‹é—´çš„相关,我们需è¦ä¸Žåˆ†æ•°é«˜åº¦ç›¸å…³çš„å˜é‡ï¼ï¼‰ã€‚关于删除共线å˜é‡çš„æ›´å½»åº•的讨论,å¯ä»¥å‚考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "下é¢çš„通过比较两个特å¾ï¼ŒåŸºäºŽæˆ‘们为相关系数选择的阈值æ¥åˆ é™¤å…±çº¿ç‰¹å¾ã€‚它还打å°å‡ºå®ƒåŽ»é™¤çš„ç›¸å…³æ€§ï¼Œè¿™æ ·æˆ‘ä»¬å°±å¯ä»¥çœ‹åˆ°è°ƒæ•´é˜ˆå€¼çš„æ•ˆæžœã€‚如果特å¾ä¹‹é—´çš„相关系数超过这个值,我们将使用0.6的阈值æ¥åˆ é™¤ä¸€å¯¹ç‰¹å¾ä¸­çš„一个。" + ] + }, { "cell_type": "code", "execution_count": 27, @@ -3030,25 +3049,27 @@ "def remove_collinear_features(x, threshold):\n", " '''\n", " Objective:\n", - " 删除数æ®å¸§ä¸­ç›¸å…³ç³»æ•°å¤§äºŽé˜ˆå€¼çš„共线特å¾ã€‚ 删除共线特å¾å¯ä»¥å¸®åŠ©æ¨¡åž‹æ³›åŒ–å¹¶æé«˜æ¨¡åž‹çš„å¯è§£é‡Šæ€§ã€‚\n", + " Remove collinear features in a dataframe with a correlation coefficient\n", + " greater than the threshold. Removing collinear features can help a model\n", + " to generalize and improves the interpretability of the model.\n", " \n", " Inputs: \n", - " 阈值:删除任何相关性大于此值的特å¾\n", + " threshold: any features with correlations greater than this value are removed\n", " \n", " Output: \n", - " 仅包å«éžé«˜å…±çº¿ç‰¹å¾çš„æ•°æ®å¸§\n", + " dataframe that contains only the non-highly-collinear features\n", " '''\n", " \n", - " # ä¸è¦åˆ é™¤èƒ½æºä¹‹æ˜Ÿå¾—分之间的相关性\n", + " # Dont want to remove correlations between Energy Star Score\n", " y = x['score']\n", " x = x.drop(columns = ['score'])\n", " \n", - " # 计算相关性矩阵\n", + " # Calculate the correlation matrix\n", " corr_matrix = x.corr()\n", " iters = range(len(corr_matrix.columns) - 1)\n", " drop_cols = []\n", "\n", - " # 迭代相关性矩阵并比较相关性\n", + " # Iterate through the correlation matrix and compare correlations\n", " for i in iters:\n", " for j in range(i):\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", @@ -3056,13 +3077,13 @@ " row = item.index\n", " val = abs(item.values)\n", " \n", - " # 如果相关性超过阈值\n", + " # If correlation exceeds the threshold\n", " if val >= threshold:\n", - " # æ‰“å°æœ‰ç›¸å…³æ€§çš„特å¾å’Œç›¸å…³å€¼\n", + " # Print the correlated features and the correlation value\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " drop_cols.append(col.values[0])\n", "\n", - " # 删除æ¯å¯¹ç›¸å…³åˆ—中的一个\n", + " # Drop one of each pair of correlated columns\n", " drops = set(drop_cols)\n", " x = x.drop(columns = drops)\n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", @@ -3070,7 +3091,7 @@ " 'log_Water Use (All Water Sources) (kgal)',\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " \n", - " # 将得分添加回数æ®\n", + " # Add the score back in to the data\n", " x['score'] = y\n", " \n", " return x" @@ -3090,7 +3111,7 @@ } ], "source": [ - "# 删除大于指定相关系数的共线特å¾\n", + "# Remove the collinear features above a specified correlation coefficient\n", "features = remove_collinear_features(features, 0.6);" ] }, @@ -3111,7 +3132,7 @@ } ], "source": [ - "# 删除所有 na 值的列\n", + "# Remove any columns with all nan values\n", "features = features.dropna(axis=1, how = 'all')\n", "features.shape" ] @@ -3120,7 +3141,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### æ•°æ®é›†åˆ’分" + "现在数æ®é›†æœ‰64个特性(其中一个列是target)。这ä»ç„¶æ˜¯ç›¸å½“å¤šï¼Œä½†ä¸»è¦æ˜¯å› ä¸ºæˆ‘们有一个One-Hot的分类å˜é‡ã€‚此外,虽然大é‡çš„特å¾å¯¹äºŽçº¿æ€§å›žå½’等模型å¯èƒ½å­˜åœ¨é—®é¢˜ï¼Œä½†éšæœºæ£®æž—ç­‰æ¨¡åž‹æ‰§è¡Œéšå¼ç‰¹å¾é€‰æ‹©ï¼Œå¹¶è‡ªåŠ¨ç¡®å®šå“ªäº›ç‰¹å¾åœ¨è®­ç»ƒè¿‡ç¨‹ä¸­æ˜¯é‡è¦çš„。还有其他的特性选择步骤,但是现在我们将ä¿ç•™æˆ‘们所有的特性,看看模型是如何执行的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**附加的特å¾é€‰æ‹©**\n", + "\n", + "有很多的特å¾é€‰æ‹©æ–¹æ³•,常用的方法有主æˆåˆ†åˆ†æž(PCA),它将特å¾ä¿æŒæœ€å¤§æ–¹å·®çš„å‡å°‘,以é™ä½Žç»´æ•°ï¼Œæˆ–独立æˆåˆ†åˆ†æž(ICA),其目的是在一组特å¾ä¸­æ‰¾åˆ°ç‹¬ç«‹çš„æºã€‚ç„¶è€Œï¼Œè™½ç„¶è¿™äº›æ–¹æ³•æœ‰æ•ˆåœ°å‡å°‘了特性的数é‡ï¼Œä½†æ˜¯å®ƒä»¬åˆ›å»ºäº†æ²¡æœ‰ç‰©ç†æ„义的新特性,从而使得解释模型几乎是ä¸å¯èƒ½çš„。在实际场景中,很少有é‡åˆ°è¿™ä¹ˆåšä¸”效果æå‡çš„。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### æ•°æ®é›†åˆ’分\n", + "\n", + "在机器学习中,我们总是需è¦å°†æˆ‘们的特å¾åˆ†ä¸ºä¸¤ç»„:å³è®­ç»ƒé›†å’Œé¢„测集(å¯èƒ½è¿˜ä¼šå¤šä¸€ä¸ªéªŒè¯é›†ï¼‰ã€‚\n", + "\n", + "我们使用测试集æ¥è¯„估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特å¾ä¸çŸ¥é“答案的情况下进行预测。然åŽå°†æµ‹è¯•集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n", + "\n", + "对于我们的问题,我们将首先æå–所有没有能æºä¹‹æ˜Ÿåˆ†æ•°çš„建筑数æ®ï¼ˆæˆ‘们ä¸çŸ¥é“这些建筑的真实答案,因此它们对训练或测试没有帮助)。然åŽï¼Œæˆ‘们将具有能æºä¹‹æ˜Ÿåˆ†æ•°çš„建筑数æ®åˆ†æˆ30%的测试集和70%的训练集。\n", + "\n", + "使用scikit learn将数æ®åˆ†æˆéšæœºçš„è®­ç»ƒå’Œæµ‹è¯•é›†å¾ˆç®€å•。我们å¯ä»¥è®¾ç½®æ‹†åˆ†çš„éšæœºçжæ€ï¼Œä»¥ç¡®ä¿ç»“果一致。" ] }, { @@ -3138,7 +3183,7 @@ } ], "source": [ - "# æå–没有得分的建筑物和带有得分的建筑物\n", + "# Extract the buildings with no score and the buildings with a score\n", "no_score = features[features['score'].isna()]\n", "score = features[features['score'].notnull()]\n", "\n", @@ -3163,15 +3208,14 @@ } ], "source": [ - "# 将特å¾å’Œç›®æ ‡åˆ†ç¦»å¼€\n", - "features = score.drop(columns = 'score')\n", + "# Separate out the features and targets\n", + "features = score.drop(columns='score')\n", "targets = pd.DataFrame(score['score'])\n", "\n", - "# 用 nan æ›¿æ¢ inf and -inf (required for later imputation)\n", + "# Replace the inf and -inf with nan (required for later imputation)\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "\n", - "# 按照 7:3 的比例划分训练集和测试集\n", - "\n", + "# Split into 70% training and 30% testing set\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "\n", "print(X.shape)\n", diff --git a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb index 330d60a..95b06f1 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹/1_æ•°æ®é¢„处ç†_建筑能æºåˆ©ç”¨çŽ‡é¢„æµ‹.ipynb @@ -2991,7 +2991,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 共线" + "### 去除共线特å¾\n", + "在数æ®é›†ä¸­ï¼ŒWeather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)é«˜åº¦ç›¸å…³ï¼Œå› ä¸ºå®ƒä»¬åªæ˜¯è®¡ç®—能æºä½¿ç”¨å¼ºåº¦çš„æ–¹æ³•略有ä¸åŒã€‚" ] }, { @@ -3021,6 +3022,24 @@ " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一般高度关è”çš„ç‰¹å¾æˆ‘们是去除的,åªä¿ç•™ä¸€ä¸ªä¸ºæ¨¡åž‹æä¾›å¿…è¦ä¿¡æ¯ã€‚\n", + "\n", + "åŽ»é™¤å…±çº¿ç‰¹å¾æ˜¯ä¸€ç§é€šè¿‡å‡å°‘ç‰¹å¾æ•°é‡æ¥é™ä½Žæ¨¡åž‹å¤æ‚度的方法,有助于æé«˜æ¨¡åž‹çš„æ³›åŒ–能力。它还å¯ä»¥å¸®åŠ©æˆ‘ä»¬è§£é‡Šæ¨¡åž‹ï¼Œå› ä¸ºæˆ‘ä»¬åªéœ€è¦æ‹…心å•一å˜é‡ï¼Œæ¯”如EUIï¼Œè€Œä¸æ˜¯Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影å“得分。\n", + "\n", + "除了这个还有很多方法,如:[方差膨胀因å­/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简å•的方法,并删除相关系数高于æŸä¸ªé˜ˆå€¼çš„特å¾ï¼ˆä¸æ˜¯ä¸Žåˆ†æ•°ç›¸å…³ï¼Œæ˜¯ä¸¤å˜é‡ä¹‹é—´çš„相关,我们需è¦ä¸Žåˆ†æ•°é«˜åº¦ç›¸å…³çš„å˜é‡ï¼ï¼‰ã€‚关于删除共线å˜é‡çš„æ›´å½»åº•的讨论,å¯ä»¥å‚考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "下é¢çš„通过比较两个特å¾ï¼ŒåŸºäºŽæˆ‘们为相关系数选择的阈值æ¥åˆ é™¤å…±çº¿ç‰¹å¾ã€‚它还打å°å‡ºå®ƒåŽ»é™¤çš„ç›¸å…³æ€§ï¼Œè¿™æ ·æˆ‘ä»¬å°±å¯ä»¥çœ‹åˆ°è°ƒæ•´é˜ˆå€¼çš„æ•ˆæžœã€‚如果特å¾ä¹‹é—´çš„相关系数超过这个值,我们将使用0.6的阈值æ¥åˆ é™¤ä¸€å¯¹ç‰¹å¾ä¸­çš„一个。" + ] + }, { "cell_type": "code", "execution_count": 27, @@ -3030,25 +3049,27 @@ "def remove_collinear_features(x, threshold):\n", " '''\n", " Objective:\n", - " 删除数æ®å¸§ä¸­ç›¸å…³ç³»æ•°å¤§äºŽé˜ˆå€¼çš„共线特å¾ã€‚ 删除共线特å¾å¯ä»¥å¸®åŠ©æ¨¡åž‹æ³›åŒ–å¹¶æé«˜æ¨¡åž‹çš„å¯è§£é‡Šæ€§ã€‚\n", + " Remove collinear features in a dataframe with a correlation coefficient\n", + " greater than the threshold. Removing collinear features can help a model\n", + " to generalize and improves the interpretability of the model.\n", " \n", " Inputs: \n", - " 阈值:删除任何相关性大于此值的特å¾\n", + " threshold: any features with correlations greater than this value are removed\n", " \n", " Output: \n", - " 仅包å«éžé«˜å…±çº¿ç‰¹å¾çš„æ•°æ®å¸§\n", + " dataframe that contains only the non-highly-collinear features\n", " '''\n", " \n", - " # ä¸è¦åˆ é™¤èƒ½æºä¹‹æ˜Ÿå¾—分之间的相关性\n", + " # Dont want to remove correlations between Energy Star Score\n", " y = x['score']\n", " x = x.drop(columns = ['score'])\n", " \n", - " # 计算相关性矩阵\n", + " # Calculate the correlation matrix\n", " corr_matrix = x.corr()\n", " iters = range(len(corr_matrix.columns) - 1)\n", " drop_cols = []\n", "\n", - " # 迭代相关性矩阵并比较相关性\n", + " # Iterate through the correlation matrix and compare correlations\n", " for i in iters:\n", " for j in range(i):\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", @@ -3056,13 +3077,13 @@ " row = item.index\n", " val = abs(item.values)\n", " \n", - " # 如果相关性超过阈值\n", + " # If correlation exceeds the threshold\n", " if val >= threshold:\n", - " # æ‰“å°æœ‰ç›¸å…³æ€§çš„特å¾å’Œç›¸å…³å€¼\n", + " # Print the correlated features and the correlation value\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " drop_cols.append(col.values[0])\n", "\n", - " # 删除æ¯å¯¹ç›¸å…³åˆ—中的一个\n", + " # Drop one of each pair of correlated columns\n", " drops = set(drop_cols)\n", " x = x.drop(columns = drops)\n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", @@ -3070,7 +3091,7 @@ " 'log_Water Use (All Water Sources) (kgal)',\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " \n", - " # 将得分添加回数æ®\n", + " # Add the score back in to the data\n", " x['score'] = y\n", " \n", " return x" @@ -3090,7 +3111,7 @@ } ], "source": [ - "# 删除大于指定相关系数的共线特å¾\n", + "# Remove the collinear features above a specified correlation coefficient\n", "features = remove_collinear_features(features, 0.6);" ] }, @@ -3111,7 +3132,7 @@ } ], "source": [ - "# 删除所有 na 值的列\n", + "# Remove any columns with all nan values\n", "features = features.dropna(axis=1, how = 'all')\n", "features.shape" ] @@ -3120,7 +3141,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### æ•°æ®é›†åˆ’分" + "现在数æ®é›†æœ‰64个特性(其中一个列是target)。这ä»ç„¶æ˜¯ç›¸å½“å¤šï¼Œä½†ä¸»è¦æ˜¯å› ä¸ºæˆ‘们有一个One-Hot的分类å˜é‡ã€‚此外,虽然大é‡çš„特å¾å¯¹äºŽçº¿æ€§å›žå½’等模型å¯èƒ½å­˜åœ¨é—®é¢˜ï¼Œä½†éšæœºæ£®æž—ç­‰æ¨¡åž‹æ‰§è¡Œéšå¼ç‰¹å¾é€‰æ‹©ï¼Œå¹¶è‡ªåŠ¨ç¡®å®šå“ªäº›ç‰¹å¾åœ¨è®­ç»ƒè¿‡ç¨‹ä¸­æ˜¯é‡è¦çš„。还有其他的特性选择步骤,但是现在我们将ä¿ç•™æˆ‘们所有的特性,看看模型是如何执行的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**附加的特å¾é€‰æ‹©**\n", + "\n", + "有很多的特å¾é€‰æ‹©æ–¹æ³•,常用的方法有主æˆåˆ†åˆ†æž(PCA),它将特å¾ä¿æŒæœ€å¤§æ–¹å·®çš„å‡å°‘,以é™ä½Žç»´æ•°ï¼Œæˆ–独立æˆåˆ†åˆ†æž(ICA),其目的是在一组特å¾ä¸­æ‰¾åˆ°ç‹¬ç«‹çš„æºã€‚ç„¶è€Œï¼Œè™½ç„¶è¿™äº›æ–¹æ³•æœ‰æ•ˆåœ°å‡å°‘了特性的数é‡ï¼Œä½†æ˜¯å®ƒä»¬åˆ›å»ºäº†æ²¡æœ‰ç‰©ç†æ„义的新特性,从而使得解释模型几乎是ä¸å¯èƒ½çš„。在实际场景中,很少有é‡åˆ°è¿™ä¹ˆåšä¸”效果æå‡çš„。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### æ•°æ®é›†åˆ’分\n", + "\n", + "在机器学习中,我们总是需è¦å°†æˆ‘们的特å¾åˆ†ä¸ºä¸¤ç»„:å³è®­ç»ƒé›†å’Œé¢„测集(å¯èƒ½è¿˜ä¼šå¤šä¸€ä¸ªéªŒè¯é›†ï¼‰ã€‚\n", + "\n", + "我们使用测试集æ¥è¯„估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特å¾ä¸çŸ¥é“答案的情况下进行预测。然åŽå°†æµ‹è¯•集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n", + "\n", + "对于我们的问题,我们将首先æå–所有没有能æºä¹‹æ˜Ÿåˆ†æ•°çš„建筑数æ®ï¼ˆæˆ‘们ä¸çŸ¥é“这些建筑的真实答案,因此它们对训练或测试没有帮助)。然åŽï¼Œæˆ‘们将具有能æºä¹‹æ˜Ÿåˆ†æ•°çš„建筑数æ®åˆ†æˆ30%的测试集和70%的训练集。\n", + "\n", + "使用scikit learn将数æ®åˆ†æˆéšæœºçš„è®­ç»ƒå’Œæµ‹è¯•é›†å¾ˆç®€å•。我们å¯ä»¥è®¾ç½®æ‹†åˆ†çš„éšæœºçжæ€ï¼Œä»¥ç¡®ä¿ç»“果一致。" ] }, { @@ -3138,7 +3183,7 @@ } ], "source": [ - "# æå–没有得分的建筑物和带有得分的建筑物\n", + "# Extract the buildings with no score and the buildings with a score\n", "no_score = features[features['score'].isna()]\n", "score = features[features['score'].notnull()]\n", "\n", @@ -3163,15 +3208,14 @@ } ], "source": [ - "# 将特å¾å’Œç›®æ ‡åˆ†ç¦»å¼€\n", - "features = score.drop(columns = 'score')\n", + "# Separate out the features and targets\n", + "features = score.drop(columns='score')\n", "targets = pd.DataFrame(score['score'])\n", "\n", - "# 用 nan æ›¿æ¢ inf and -inf (required for later imputation)\n", + "# Replace the inf and -inf with nan (required for later imputation)\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "\n", - "# 按照 7:3 的比例划分训练集和测试集\n", - "\n", + "# Split into 70% training and 30% testing set\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "\n", "print(X.shape)\n",