Add comment of Feature Engineering and Selection

5 years ago · f65f03d2ef
parent 64705d565b
commit f65f03d2ef
2 changed files with 128 additions and 40 deletions
--- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/.ipynb_checkpoints/1_数据预处理_建筑能源利用率预测-checkpoint.ipynb
@ -2991,7 +2991,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### 共线"
+    "### 去除共线特征\n",
    "在数据集中，Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关，因为它们只是计算能源使用强度的方法略有不同。"
   ]
  },
  {
@ -3021,6 +3022,24 @@
    "                                                                        'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般高度关联的特征我们是去除的，只保留一个为模型提供必要信息。\n",
    "\n",
    "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法，有助于提高模型的泛化能力。它还可以帮助我们解释模型，因为我们只需要担心单一变量，比如EUI，而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
    "\n",
    "除了这个还有很多方法，如：[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/)，这里我们将使用更简单的方法，并删除相关系数高于某个阈值的特征（不是与分数相关，是两变量之间的相关，我们需要与分数高度相关的变量！）。关于删除共线变量的更彻底的讨论，可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的通过比较两个特征，基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性，这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值，我们将使用0.6的阈值来删除一对特征中的一个。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
@ -3030,25 +3049,27 @@
    "def remove_collinear_features(x, threshold):\n",
    "    '''\n",
    "    Objective:\n",
-    "       删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n",
+    "        Remove collinear features in a dataframe with a correlation coefficient\n",
    "        greater than the threshold. Removing collinear features can help a model\n",
    "        to generalize and improves the interpretability of the model.\n",
    "        \n",
    "    Inputs: \n",
-    "        阈值：删除任何相关性大于此值的特征\n",
+    "        threshold: any features with correlations greater than this value are removed\n",
    "    \n",
    "    Output: \n",
-    "        仅包含非高共线特征的数据帧\n",
+    "        dataframe that contains only the non-highly-collinear features\n",
    "    '''\n",
    "    \n",
-    "    # 不要删除能源之星得分之间的相关性\n",
+    "    # Dont want to remove correlations between Energy Star Score\n",
    "    y = x['score']\n",
    "    x = x.drop(columns = ['score'])\n",
    "    \n",
-    "    # 计算相关性矩阵\n",
+    "    # Calculate the correlation matrix\n",
    "    corr_matrix = x.corr()\n",
    "    iters = range(len(corr_matrix.columns) - 1)\n",
    "    drop_cols = []\n",
    "\n",
-    "    # 迭代相关性矩阵并比较相关性\n",
+    "    # Iterate through the correlation matrix and compare correlations\n",
    "    for i in iters:\n",
    "        for j in range(i):\n",
    "            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@ -3056,13 +3077,13 @@
    "            row = item.index\n",
    "            val = abs(item.values)\n",
    "            \n",
-    "            # 如果相关性超过阈值\n",
+    "            # If correlation exceeds the threshold\n",
    "            if val >= threshold:\n",
-    "                # 打印有相关性的特征和相关值\n",
+    "                # Print the correlated features and the correlation value\n",
    "                # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
    "                drop_cols.append(col.values[0])\n",
    "\n",
-    "    # 删除每对相关列中的一个\n",
+    "    # Drop one of each pair of correlated columns\n",
    "    drops = set(drop_cols)\n",
    "    x = x.drop(columns = drops)\n",
    "    x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@ -3070,7 +3091,7 @@
    "                          'log_Water Use (All Water Sources) (kgal)',\n",
    "                          'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
    "    \n",
-    "    # 将得分添加回数据\n",
+    "    # Add the score back in to the data\n",
    "    x['score'] = y\n",
    "               \n",
    "    return x"
@ -3090,7 +3111,7 @@
    }
   ],
   "source": [
-    "# 删除大于指定相关系数的共线特征\n",
+    "# Remove the collinear features above a specified correlation coefficient\n",
    "features = remove_collinear_features(features, 0.6);"
   ]
  },
@ -3111,7 +3132,7 @@
    }
   ],
   "source": [
-    "# 删除所有 na 值的列\n",
+    "# Remove any columns with all nan values\n",
    "features  = features.dropna(axis=1, how = 'all')\n",
    "features.shape"
   ]
@ -3120,7 +3141,31 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### 数据集划分"
+    "现在数据集有64个特性(其中一个列是target)。这仍然是相当多，但主要是因为我们有一个One-Hot的分类变量。此外，虽然大量的特征对于线性回归等模型可能存在问题，但随机森林等模型执行隐式特征选择，并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤，但是现在我们将保留我们所有的特性，看看模型是如何执行的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**附加的特征选择**\n",
    "\n",
    "有很多的特征选择方法，常用的方法有主成分分析(PCA)，它将特征保持最大方差的减少，以降低维数，或独立成分分析(ICA)，其目的是在一组特征中找到独立的源。然而，虽然这些方法有效地减少了特性的数量，但是它们创建了没有物理意义的新特性，从而使得解释模型几乎是不可能的。在实际场景中，很少有遇到这么做且效果提升的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据集划分\n",
    "\n",
    "在机器学习中，我们总是需要将我们的特征分为两组：即训练集和预测集（可能还会多一个验证集）。\n",
    "\n",
    "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案，但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较，从而估计出我们的模型在实际开上线时的性能。\n",
    "\n",
    "对于我们的问题，我们将首先提取所有没有能源之星分数的建筑数据（我们不知道这些建筑的真实答案，因此它们对训练或测试没有帮助）。然后，我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
    "\n",
    "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态，以确保结果一致。"
   ]
  },
  {
@ -3138,7 +3183,7 @@
    }
   ],
   "source": [
-    "# 提取没有得分的建筑物和带有得分的建筑物\n",
+    "# Extract the buildings with no score and the buildings with a score\n",
    "no_score = features[features['score'].isna()]\n",
    "score = features[features['score'].notnull()]\n",
    "\n",
@ -3163,15 +3208,14 @@
    }
   ],
   "source": [
-    "# 将特征和目标分离开\n",
+    "# Separate out the features and targets\n",
    "features = score.drop(columns='score')\n",
    "targets = pd.DataFrame(score['score'])\n",
    "\n",
-    "# 用 nan 替换 inf and -inf （required for later imputation）\n",
+    "# Replace the inf and -inf with nan (required for later imputation)\n",
    "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
    "\n",
-    "# 按照 7：3 的比例划分训练集和测试集\n",
+    "# Split into 70% training and 30% testing set\n",
    "\n",
    "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
    "\n",
    "print(X.shape)\n",
--- a/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/建筑能源利用率预测/1_数据预处理_建筑能源利用率预测.ipynb
@ -2991,7 +2991,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### 共线"
+    "### 去除共线特征\n",
    "在数据集中，Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关，因为它们只是计算能源使用强度的方法略有不同。"
   ]
  },
  {
@ -3021,6 +3022,24 @@
    "                                                                        'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般高度关联的特征我们是去除的，只保留一个为模型提供必要信息。\n",
    "\n",
    "去除共线特征是一种通过减少特征数量来降低模型复杂度的方法，有助于提高模型的泛化能力。它还可以帮助我们解释模型，因为我们只需要担心单一变量，比如EUI，而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
    "\n",
    "除了这个还有很多方法，如：[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/)，这里我们将使用更简单的方法，并删除相关系数高于某个阈值的特征（不是与分数相关，是两变量之间的相关，我们需要与分数高度相关的变量！）。关于删除共线变量的更彻底的讨论，可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的通过比较两个特征，基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性，这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值，我们将使用0.6的阈值来删除一对特征中的一个。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
@ -3030,25 +3049,27 @@
    "def remove_collinear_features(x, threshold):\n",
    "    '''\n",
    "    Objective:\n",
-    "       删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n",
+    "        Remove collinear features in a dataframe with a correlation coefficient\n",
    "        greater than the threshold. Removing collinear features can help a model\n",
    "        to generalize and improves the interpretability of the model.\n",
    "        \n",
    "    Inputs: \n",
-    "        阈值：删除任何相关性大于此值的特征\n",
+    "        threshold: any features with correlations greater than this value are removed\n",
    "    \n",
    "    Output: \n",
-    "        仅包含非高共线特征的数据帧\n",
+    "        dataframe that contains only the non-highly-collinear features\n",
    "    '''\n",
    "    \n",
-    "    # 不要删除能源之星得分之间的相关性\n",
+    "    # Dont want to remove correlations between Energy Star Score\n",
    "    y = x['score']\n",
    "    x = x.drop(columns = ['score'])\n",
    "    \n",
-    "    # 计算相关性矩阵\n",
+    "    # Calculate the correlation matrix\n",
    "    corr_matrix = x.corr()\n",
    "    iters = range(len(corr_matrix.columns) - 1)\n",
    "    drop_cols = []\n",
    "\n",
-    "    # 迭代相关性矩阵并比较相关性\n",
+    "    # Iterate through the correlation matrix and compare correlations\n",
    "    for i in iters:\n",
    "        for j in range(i):\n",
    "            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@ -3056,13 +3077,13 @@
    "            row = item.index\n",
    "            val = abs(item.values)\n",
    "            \n",
-    "            # 如果相关性超过阈值\n",
+    "            # If correlation exceeds the threshold\n",
    "            if val >= threshold:\n",
-    "                # 打印有相关性的特征和相关值\n",
+    "                # Print the correlated features and the correlation value\n",
    "                # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
    "                drop_cols.append(col.values[0])\n",
    "\n",
-    "    # 删除每对相关列中的一个\n",
+    "    # Drop one of each pair of correlated columns\n",
    "    drops = set(drop_cols)\n",
    "    x = x.drop(columns = drops)\n",
    "    x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@ -3070,7 +3091,7 @@
    "                          'log_Water Use (All Water Sources) (kgal)',\n",
    "                          'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
    "    \n",
-    "    # 将得分添加回数据\n",
+    "    # Add the score back in to the data\n",
    "    x['score'] = y\n",
    "               \n",
    "    return x"
@ -3090,7 +3111,7 @@
    }
   ],
   "source": [
-    "# 删除大于指定相关系数的共线特征\n",
+    "# Remove the collinear features above a specified correlation coefficient\n",
    "features = remove_collinear_features(features, 0.6);"
   ]
  },
@ -3111,7 +3132,7 @@
    }
   ],
   "source": [
-    "# 删除所有 na 值的列\n",
+    "# Remove any columns with all nan values\n",
    "features  = features.dropna(axis=1, how = 'all')\n",
    "features.shape"
   ]
@ -3120,7 +3141,31 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### 数据集划分"
+    "现在数据集有64个特性(其中一个列是target)。这仍然是相当多，但主要是因为我们有一个One-Hot的分类变量。此外，虽然大量的特征对于线性回归等模型可能存在问题，但随机森林等模型执行隐式特征选择，并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤，但是现在我们将保留我们所有的特性，看看模型是如何执行的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**附加的特征选择**\n",
    "\n",
    "有很多的特征选择方法，常用的方法有主成分分析(PCA)，它将特征保持最大方差的减少，以降低维数，或独立成分分析(ICA)，其目的是在一组特征中找到独立的源。然而，虽然这些方法有效地减少了特性的数量，但是它们创建了没有物理意义的新特性，从而使得解释模型几乎是不可能的。在实际场景中，很少有遇到这么做且效果提升的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据集划分\n",
    "\n",
    "在机器学习中，我们总是需要将我们的特征分为两组：即训练集和预测集（可能还会多一个验证集）。\n",
    "\n",
    "我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案，但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较，从而估计出我们的模型在实际开上线时的性能。\n",
    "\n",
    "对于我们的问题，我们将首先提取所有没有能源之星分数的建筑数据（我们不知道这些建筑的真实答案，因此它们对训练或测试没有帮助）。然后，我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
    "\n",
    "使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态，以确保结果一致。"
   ]
  },
  {
@ -3138,7 +3183,7 @@
    }
   ],
   "source": [
-    "# 提取没有得分的建筑物和带有得分的建筑物\n",
+    "# Extract the buildings with no score and the buildings with a score\n",
    "no_score = features[features['score'].isna()]\n",
    "score = features[features['score'].notnull()]\n",
    "\n",
@ -3163,15 +3208,14 @@
    }
   ],
   "source": [
-    "# 将特征和目标分离开\n",
+    "# Separate out the features and targets\n",
    "features = score.drop(columns='score')\n",
    "targets = pd.DataFrame(score['score'])\n",
    "\n",
-    "# 用 nan 替换 inf and -inf （required for later imputation）\n",
+    "# Replace the inf and -inf with nan (required for later imputation)\n",
    "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
    "\n",
-    "# 按照 7：3 的比例划分训练集和测试集\n",
+    "# Split into 70% training and 30% testing set\n",
    "\n",
    "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
    "\n",
    "print(X.shape)\n",