diff --git a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb index 330d60a..95b06f1 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb @@ -2991,7 +2991,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 共线" + "### 去除共线特征\n", + "在数据集中,Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关,因为它们只是计算能源使用强度的方法略有不同。" ] }, { @@ -3021,6 +3022,24 @@ " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一般高度关联的特征我们是去除的,只保留一个为模型提供必要信息。\n", + "\n", + "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法,有助于提高模型的泛化能力。它还可以帮助我们解释模型,因为我们只需要担心单一变量,比如EUI,而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n", + "\n", + "除了这个还有很多方法,如:[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简单的方法,并删除相关系数高于某个阈值的特征(不是与分数相关,是两变量之间的相关,我们需要与分数高度相关的变量!)。关于删除共线变量的更彻底的讨论,可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "下面的通过比较两个特征,基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性,这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值,我们将使用0.6的阈值来删除一对特征中的一个。" + ] + }, { "cell_type": "code", "execution_count": 27, @@ -3030,25 +3049,27 @@ "def remove_collinear_features(x, threshold):\n", " '''\n", " Objective:\n", - " 删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n", + " Remove collinear features in a dataframe with a correlation coefficient\n", + " greater than the threshold. Removing collinear features can help a model\n", + " to generalize and improves the interpretability of the model.\n", " \n", " Inputs: \n", - " 阈值:删除任何相关性大于此值的特征\n", + " threshold: any features with correlations greater than this value are removed\n", " \n", " Output: \n", - " 仅包含非高共线特征的数据帧\n", + " dataframe that contains only the non-highly-collinear features\n", " '''\n", " \n", - " # 不要删除能源之星得分之间的相关性\n", + " # Dont want to remove correlations between Energy Star Score\n", " y = x['score']\n", " x = x.drop(columns = ['score'])\n", " \n", - " # 计算相关性矩阵\n", + " # Calculate the correlation matrix\n", " corr_matrix = x.corr()\n", " iters = range(len(corr_matrix.columns) - 1)\n", " drop_cols = []\n", "\n", - " # 迭代相关性矩阵并比较相关性\n", + " # Iterate through the correlation matrix and compare correlations\n", " for i in iters:\n", " for j in range(i):\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", @@ -3056,13 +3077,13 @@ " row = item.index\n", " val = abs(item.values)\n", " \n", - " # 如果相关性超过阈值\n", + " # If correlation exceeds the threshold\n", " if val >= threshold:\n", - " # 打印有相关性的特征和相关值\n", + " # Print the correlated features and the correlation value\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " drop_cols.append(col.values[0])\n", "\n", - " # 删除每对相关列中的一个\n", + " # Drop one of each pair of correlated columns\n", " drops = set(drop_cols)\n", " x = x.drop(columns = drops)\n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", @@ -3070,7 +3091,7 @@ " 'log_Water Use (All Water Sources) (kgal)',\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " \n", - " # 将得分添加回数据\n", + " # Add the score back in to the data\n", " x['score'] = y\n", " \n", " return x" @@ -3090,7 +3111,7 @@ } ], "source": [ - "# 删除大于指定相关系数的共线特征\n", + "# Remove the collinear features above a specified correlation coefficient\n", "features = remove_collinear_features(features, 0.6);" ] }, @@ -3111,7 +3132,7 @@ } ], "source": [ - "# 删除所有 na 值的列\n", + "# Remove any columns with all nan values\n", "features = features.dropna(axis=1, how = 'all')\n", "features.shape" ] @@ -3120,7 +3141,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 数据集划分" + "现在数据集有64个特性(其中一个列是target)。这仍然是相当多,但主要是因为我们有一个One-Hot的分类变量。此外,虽然大量的特征对于线性回归等模型可能存在问题,但随机森林等模型执行隐式特征选择,并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤,但是现在我们将保留我们所有的特性,看看模型是如何执行的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**附加的特征选择**\n", + "\n", + "有很多的特征选择方法,常用的方法有主成分分析(PCA),它将特征保持最大方差的减少,以降低维数,或独立成分分析(ICA),其目的是在一组特征中找到独立的源。然而,虽然这些方法有效地减少了特性的数量,但是它们创建了没有物理意义的新特性,从而使得解释模型几乎是不可能的。在实际场景中,很少有遇到这么做且效果提升的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据集划分\n", + "\n", + "在机器学习中,我们总是需要将我们的特征分为两组:即训练集和预测集(可能还会多一个验证集)。\n", + "\n", + "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n", + "\n", + "对于我们的问题,我们将首先提取所有没有能源之星分数的建筑数据(我们不知道这些建筑的真实答案,因此它们对训练或测试没有帮助)。然后,我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n", + "\n", + "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态,以确保结果一致。" ] }, { @@ -3138,7 +3183,7 @@ } ], "source": [ - "# 提取没有得分的建筑物和带有得分的建筑物\n", + "# Extract the buildings with no score and the buildings with a score\n", "no_score = features[features['score'].isna()]\n", "score = features[features['score'].notnull()]\n", "\n", @@ -3163,15 +3208,14 @@ } ], "source": [ - "# 将特征和目标分离开\n", - "features = score.drop(columns = 'score')\n", + "# Separate out the features and targets\n", + "features = score.drop(columns='score')\n", "targets = pd.DataFrame(score['score'])\n", "\n", - "# 用 nan 替换 inf and -inf (required for later imputation)\n", + "# Replace the inf and -inf with nan (required for later imputation)\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "\n", - "# 按照 7:3 的比例划分训练集和测试集\n", - "\n", + "# Split into 70% training and 30% testing set\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "\n", "print(X.shape)\n", diff --git a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb index 330d60a..95b06f1 100644 --- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb +++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb @@ -2991,7 +2991,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 共线" + "### 去除共线特征\n", + "在数据集中,Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关,因为它们只是计算能源使用强度的方法略有不同。" ] }, { @@ -3021,6 +3022,24 @@ " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一般高度关联的特征我们是去除的,只保留一个为模型提供必要信息。\n", + "\n", + "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法,有助于提高模型的泛化能力。它还可以帮助我们解释模型,因为我们只需要担心单一变量,比如EUI,而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n", + "\n", + "除了这个还有很多方法,如:[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简单的方法,并删除相关系数高于某个阈值的特征(不是与分数相关,是两变量之间的相关,我们需要与分数高度相关的变量!)。关于删除共线变量的更彻底的讨论,可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "下面的通过比较两个特征,基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性,这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值,我们将使用0.6的阈值来删除一对特征中的一个。" + ] + }, { "cell_type": "code", "execution_count": 27, @@ -3030,25 +3049,27 @@ "def remove_collinear_features(x, threshold):\n", " '''\n", " Objective:\n", - " 删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n", + " Remove collinear features in a dataframe with a correlation coefficient\n", + " greater than the threshold. Removing collinear features can help a model\n", + " to generalize and improves the interpretability of the model.\n", " \n", " Inputs: \n", - " 阈值:删除任何相关性大于此值的特征\n", + " threshold: any features with correlations greater than this value are removed\n", " \n", " Output: \n", - " 仅包含非高共线特征的数据帧\n", + " dataframe that contains only the non-highly-collinear features\n", " '''\n", " \n", - " # 不要删除能源之星得分之间的相关性\n", + " # Dont want to remove correlations between Energy Star Score\n", " y = x['score']\n", " x = x.drop(columns = ['score'])\n", " \n", - " # 计算相关性矩阵\n", + " # Calculate the correlation matrix\n", " corr_matrix = x.corr()\n", " iters = range(len(corr_matrix.columns) - 1)\n", " drop_cols = []\n", "\n", - " # 迭代相关性矩阵并比较相关性\n", + " # Iterate through the correlation matrix and compare correlations\n", " for i in iters:\n", " for j in range(i):\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", @@ -3056,13 +3077,13 @@ " row = item.index\n", " val = abs(item.values)\n", " \n", - " # 如果相关性超过阈值\n", + " # If correlation exceeds the threshold\n", " if val >= threshold:\n", - " # 打印有相关性的特征和相关值\n", + " # Print the correlated features and the correlation value\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " drop_cols.append(col.values[0])\n", "\n", - " # 删除每对相关列中的一个\n", + " # Drop one of each pair of correlated columns\n", " drops = set(drop_cols)\n", " x = x.drop(columns = drops)\n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", @@ -3070,7 +3091,7 @@ " 'log_Water Use (All Water Sources) (kgal)',\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " \n", - " # 将得分添加回数据\n", + " # Add the score back in to the data\n", " x['score'] = y\n", " \n", " return x" @@ -3090,7 +3111,7 @@ } ], "source": [ - "# 删除大于指定相关系数的共线特征\n", + "# Remove the collinear features above a specified correlation coefficient\n", "features = remove_collinear_features(features, 0.6);" ] }, @@ -3111,7 +3132,7 @@ } ], "source": [ - "# 删除所有 na 值的列\n", + "# Remove any columns with all nan values\n", "features = features.dropna(axis=1, how = 'all')\n", "features.shape" ] @@ -3120,7 +3141,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### 数据集划分" + "现在数据集有64个特性(其中一个列是target)。这仍然是相当多,但主要是因为我们有一个One-Hot的分类变量。此外,虽然大量的特征对于线性回归等模型可能存在问题,但随机森林等模型执行隐式特征选择,并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤,但是现在我们将保留我们所有的特性,看看模型是如何执行的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**附加的特征选择**\n", + "\n", + "有很多的特征选择方法,常用的方法有主成分分析(PCA),它将特征保持最大方差的减少,以降低维数,或独立成分分析(ICA),其目的是在一组特征中找到独立的源。然而,虽然这些方法有效地减少了特性的数量,但是它们创建了没有物理意义的新特性,从而使得解释模型几乎是不可能的。在实际场景中,很少有遇到这么做且效果提升的。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据集划分\n", + "\n", + "在机器学习中,我们总是需要将我们的特征分为两组:即训练集和预测集(可能还会多一个验证集)。\n", + "\n", + "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n", + "\n", + "对于我们的问题,我们将首先提取所有没有能源之星分数的建筑数据(我们不知道这些建筑的真实答案,因此它们对训练或测试没有帮助)。然后,我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n", + "\n", + "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态,以确保结果一致。" ] }, { @@ -3138,7 +3183,7 @@ } ], "source": [ - "# 提取没有得分的建筑物和带有得分的建筑物\n", + "# Extract the buildings with no score and the buildings with a score\n", "no_score = features[features['score'].isna()]\n", "score = features[features['score'].notnull()]\n", "\n", @@ -3163,15 +3208,14 @@ } ], "source": [ - "# 将特征和目标分离开\n", - "features = score.drop(columns = 'score')\n", + "# Separate out the features and targets\n", + "features = score.drop(columns='score')\n", "targets = pd.DataFrame(score['score'])\n", "\n", - "# 用 nan 替换 inf and -inf (required for later imputation)\n", + "# Replace the inf and -inf with nan (required for later imputation)\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "\n", - "# 按照 7:3 的比例划分训练集和测试集\n", - "\n", + "# Split into 70% training and 30% testing set\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "\n", "print(X.shape)\n",