From f65f03d2ef342f28c50b143a289eee0feb5772c0 Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Sun, 27 Dec 2020 11:47:07 +0800
Subject: [PATCH] Add comment of Feature Engineering and Selection

---
 ...�筑能源利用率预测-checkpoint.ipynb | 84 ++++++++++++++-----
 ...��处理_建筑能源利用率预测.ipynb | 84 ++++++++++++++-----
 2 files changed, 128 insertions(+), 40 deletions(-)

diff --git a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb
index 330d60a..95b06f1 100644
--- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb
@@ -2991,7 +2991,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 共线"
+    "### 去除共线特征\n",
+    "在数据集中，Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关，因为它们只是计算能源使用强度的方法略有不同。"
    ]
   },
   {
@@ -3021,6 +3022,24 @@
     "                                                                        'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "一般高度关联的特征我们是去除的，只保留一个为模型提供必要信息。\n",
+    "\n",
+    "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法，有助于提高模型的泛化能力。它还可以帮助我们解释模型，因为我们只需要担心单一变量，比如EUI，而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
+    "\n",
+    "除了这个还有很多方法，如：[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/)，这里我们将使用更简单的方法，并删除相关系数高于某个阈值的特征（不是与分数相关，是两变量之间的相关，我们需要与分数高度相关的变量！）。关于删除共线变量的更彻底的讨论，可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "下面的通过比较两个特征，基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性，这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值，我们将使用0.6的阈值来删除一对特征中的一个。"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 27,
@@ -3030,25 +3049,27 @@
     "def remove_collinear_features(x, threshold):\n",
     "    '''\n",
     "    Objective:\n",
-    "       删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n",
+    "        Remove collinear features in a dataframe with a correlation coefficient\n",
+    "        greater than the threshold. Removing collinear features can help a model\n",
+    "        to generalize and improves the interpretability of the model.\n",
     "        \n",
     "    Inputs: \n",
-    "        阈值：删除任何相关性大于此值的特征\n",
+    "        threshold: any features with correlations greater than this value are removed\n",
     "    \n",
     "    Output: \n",
-    "        仅包含非高共线特征的数据帧\n",
+    "        dataframe that contains only the non-highly-collinear features\n",
     "    '''\n",
     "    \n",
-    "    # 不要删除能源之星得分之间的相关性\n",
+    "    # Dont want to remove correlations between Energy Star Score\n",
     "    y = x['score']\n",
     "    x = x.drop(columns = ['score'])\n",
     "    \n",
-    "    # 计算相关性矩阵\n",
+    "    # Calculate the correlation matrix\n",
     "    corr_matrix = x.corr()\n",
     "    iters = range(len(corr_matrix.columns) - 1)\n",
     "    drop_cols = []\n",
     "\n",
-    "    # 迭代相关性矩阵并比较相关性\n",
+    "    # Iterate through the correlation matrix and compare correlations\n",
     "    for i in iters:\n",
     "        for j in range(i):\n",
     "            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@@ -3056,13 +3077,13 @@
     "            row = item.index\n",
     "            val = abs(item.values)\n",
     "            \n",
-    "            # 如果相关性超过阈值\n",
+    "            # If correlation exceeds the threshold\n",
     "            if val >= threshold:\n",
-    "                # 打印有相关性的特征和相关值\n",
+    "                # Print the correlated features and the correlation value\n",
     "                # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
     "                drop_cols.append(col.values[0])\n",
     "\n",
-    "    # 删除每对相关列中的一个\n",
+    "    # Drop one of each pair of correlated columns\n",
     "    drops = set(drop_cols)\n",
     "    x = x.drop(columns = drops)\n",
     "    x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@@ -3070,7 +3091,7 @@
     "                          'log_Water Use (All Water Sources) (kgal)',\n",
     "                          'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
     "    \n",
-    "    # 将得分添加回数据\n",
+    "    # Add the score back in to the data\n",
     "    x['score'] = y\n",
     "               \n",
     "    return x"
@@ -3090,7 +3111,7 @@
     }
    ],
    "source": [
-    "# 删除大于指定相关系数的共线特征\n",
+    "# Remove the collinear features above a specified correlation coefficient\n",
     "features = remove_collinear_features(features, 0.6);"
    ]
   },
@@ -3111,7 +3132,7 @@
     }
    ],
    "source": [
-    "# 删除所有 na 值的列\n",
+    "# Remove any columns with all nan values\n",
     "features  = features.dropna(axis=1, how = 'all')\n",
     "features.shape"
    ]
@@ -3120,7 +3141,31 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 数据集划分"
+    "现在数据集有64个特性(其中一个列是target)。这仍然是相当多，但主要是因为我们有一个One-Hot的分类变量。此外，虽然大量的特征对于线性回归等模型可能存在问题，但随机森林等模型执行隐式特征选择，并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤，但是现在我们将保留我们所有的特性，看看模型是如何执行的。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**附加的特征选择**\n",
+    "\n",
+    "有很多的特征选择方法，常用的方法有主成分分析(PCA)，它将特征保持最大方差的减少，以降低维数，或独立成分分析(ICA)，其目的是在一组特征中找到独立的源。然而，虽然这些方法有效地减少了特性的数量，但是它们创建了没有物理意义的新特性，从而使得解释模型几乎是不可能的。在实际场景中，很少有遇到这么做且效果提升的。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 数据集划分\n",
+    "\n",
+    "在机器学习中，我们总是需要将我们的特征分为两组：即训练集和预测集（可能还会多一个验证集）。\n",
+    "\n",
+    "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案，但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较，从而估计出我们的模型在实际开上线时的性能。\n",
+    "\n",
+    "对于我们的问题，我们将首先提取所有没有能源之星分数的建筑数据（我们不知道这些建筑的真实答案，因此它们对训练或测试没有帮助）。然后，我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
+    "\n",
+    "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态，以确保结果一致。"
    ]
   },
   {
@@ -3138,7 +3183,7 @@
     }
    ],
    "source": [
-    "# 提取没有得分的建筑物和带有得分的建筑物\n",
+    "# Extract the buildings with no score and the buildings with a score\n",
     "no_score = features[features['score'].isna()]\n",
     "score = features[features['score'].notnull()]\n",
     "\n",
@@ -3163,15 +3208,14 @@
     }
    ],
    "source": [
-    "# 将特征和目标分离开\n",
-    "features = score.drop(columns = 'score')\n",
+    "# Separate out the features and targets\n",
+    "features = score.drop(columns='score')\n",
     "targets = pd.DataFrame(score['score'])\n",
     "\n",
-    "# 用 nan 替换 inf and -inf （required for later imputation）\n",
+    "# Replace the inf and -inf with nan (required for later imputation)\n",
     "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
     "\n",
-    "# 按照 7：3 的比例划分训练集和测试集\n",
-    "\n",
+    "# Split into 70% training and 30% testing set\n",
     "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
     "\n",
     "print(X.shape)\n",
diff --git a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb
index 330d60a..95b06f1 100644
--- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb
@@ -2991,7 +2991,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 共线"
+    "### 去除共线特征\n",
+    "在数据集中，Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关，因为它们只是计算能源使用强度的方法略有不同。"
    ]
   },
   {
@@ -3021,6 +3022,24 @@
     "                                                                        'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "一般高度关联的特征我们是去除的，只保留一个为模型提供必要信息。\n",
+    "\n",
+    "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法，有助于提高模型的泛化能力。它还可以帮助我们解释模型，因为我们只需要担心单一变量，比如EUI，而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
+    "\n",
+    "除了这个还有很多方法，如：[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/)，这里我们将使用更简单的方法，并删除相关系数高于某个阈值的特征（不是与分数相关，是两变量之间的相关，我们需要与分数高度相关的变量！）。关于删除共线变量的更彻底的讨论，可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "下面的通过比较两个特征，基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性，这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值，我们将使用0.6的阈值来删除一对特征中的一个。"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 27,
@@ -3030,25 +3049,27 @@
     "def remove_collinear_features(x, threshold):\n",
     "    '''\n",
     "    Objective:\n",
-    "       删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n",
+    "        Remove collinear features in a dataframe with a correlation coefficient\n",
+    "        greater than the threshold. Removing collinear features can help a model\n",
+    "        to generalize and improves the interpretability of the model.\n",
     "        \n",
     "    Inputs: \n",
-    "        阈值：删除任何相关性大于此值的特征\n",
+    "        threshold: any features with correlations greater than this value are removed\n",
     "    \n",
     "    Output: \n",
-    "        仅包含非高共线特征的数据帧\n",
+    "        dataframe that contains only the non-highly-collinear features\n",
     "    '''\n",
     "    \n",
-    "    # 不要删除能源之星得分之间的相关性\n",
+    "    # Dont want to remove correlations between Energy Star Score\n",
     "    y = x['score']\n",
     "    x = x.drop(columns = ['score'])\n",
     "    \n",
-    "    # 计算相关性矩阵\n",
+    "    # Calculate the correlation matrix\n",
     "    corr_matrix = x.corr()\n",
     "    iters = range(len(corr_matrix.columns) - 1)\n",
     "    drop_cols = []\n",
     "\n",
-    "    # 迭代相关性矩阵并比较相关性\n",
+    "    # Iterate through the correlation matrix and compare correlations\n",
     "    for i in iters:\n",
     "        for j in range(i):\n",
     "            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@@ -3056,13 +3077,13 @@
     "            row = item.index\n",
     "            val = abs(item.values)\n",
     "            \n",
-    "            # 如果相关性超过阈值\n",
+    "            # If correlation exceeds the threshold\n",
     "            if val >= threshold:\n",
-    "                # 打印有相关性的特征和相关值\n",
+    "                # Print the correlated features and the correlation value\n",
     "                # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
     "                drop_cols.append(col.values[0])\n",
     "\n",
-    "    # 删除每对相关列中的一个\n",
+    "    # Drop one of each pair of correlated columns\n",
     "    drops = set(drop_cols)\n",
     "    x = x.drop(columns = drops)\n",
     "    x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@@ -3070,7 +3091,7 @@
     "                          'log_Water Use (All Water Sources) (kgal)',\n",
     "                          'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
     "    \n",
-    "    # 将得分添加回数据\n",
+    "    # Add the score back in to the data\n",
     "    x['score'] = y\n",
     "               \n",
     "    return x"
@@ -3090,7 +3111,7 @@
     }
    ],
    "source": [
-    "# 删除大于指定相关系数的共线特征\n",
+    "# Remove the collinear features above a specified correlation coefficient\n",
     "features = remove_collinear_features(features, 0.6);"
    ]
   },
@@ -3111,7 +3132,7 @@
     }
    ],
    "source": [
-    "# 删除所有 na 值的列\n",
+    "# Remove any columns with all nan values\n",
     "features  = features.dropna(axis=1, how = 'all')\n",
     "features.shape"
    ]
@@ -3120,7 +3141,31 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### 数据集划分"
+    "现在数据集有64个特性(其中一个列是target)。这仍然是相当多，但主要是因为我们有一个One-Hot的分类变量。此外，虽然大量的特征对于线性回归等模型可能存在问题，但随机森林等模型执行隐式特征选择，并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤，但是现在我们将保留我们所有的特性，看看模型是如何执行的。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**附加的特征选择**\n",
+    "\n",
+    "有很多的特征选择方法，常用的方法有主成分分析(PCA)，它将特征保持最大方差的减少，以降低维数，或独立成分分析(ICA)，其目的是在一组特征中找到独立的源。然而，虽然这些方法有效地减少了特性的数量，但是它们创建了没有物理意义的新特性，从而使得解释模型几乎是不可能的。在实际场景中，很少有遇到这么做且效果提升的。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 数据集划分\n",
+    "\n",
+    "在机器学习中，我们总是需要将我们的特征分为两组：即训练集和预测集（可能还会多一个验证集）。\n",
+    "\n",
+    "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案，但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较，从而估计出我们的模型在实际开上线时的性能。\n",
+    "\n",
+    "对于我们的问题，我们将首先提取所有没有能源之星分数的建筑数据（我们不知道这些建筑的真实答案，因此它们对训练或测试没有帮助）。然后，我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
+    "\n",
+    "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态，以确保结果一致。"
    ]
   },
   {
@@ -3138,7 +3183,7 @@
     }
    ],
    "source": [
-    "# 提取没有得分的建筑物和带有得分的建筑物\n",
+    "# Extract the buildings with no score and the buildings with a score\n",
     "no_score = features[features['score'].isna()]\n",
     "score = features[features['score'].notnull()]\n",
     "\n",
@@ -3163,15 +3208,14 @@
     }
    ],
    "source": [
-    "# 将特征和目标分离开\n",
-    "features = score.drop(columns = 'score')\n",
+    "# Separate out the features and targets\n",
+    "features = score.drop(columns='score')\n",
     "targets = pd.DataFrame(score['score'])\n",
     "\n",
-    "# 用 nan 替换 inf and -inf （required for later imputation）\n",
+    "# Replace the inf and -inf with nan (required for later imputation)\n",
     "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
     "\n",
-    "# 按照 7：3 的比例划分训练集和测试集\n",
-    "\n",
+    "# Split into 70% training and 30% testing set\n",
     "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
     "\n",
     "print(X.shape)\n",