Add comment of Feature Engineering and Selection

pull/2/head
benjas 5 years ago
parent 64705d565b
commit f65f03d2ef

@ -2991,7 +2991,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 共线" "### 去除共线特征\n",
"在数据集中Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关,因为它们只是计算能源使用强度的方法略有不同。"
] ]
}, },
{ {
@ -3021,6 +3022,24 @@
" 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一般高度关联的特征我们是去除的,只保留一个为模型提供必要信息。\n",
"\n",
"去除共线特征是一种通过减少特征数量来降低模型复杂度的方法有助于提高模型的泛化能力。它还可以帮助我们解释模型因为我们只需要担心单一变量比如EUI而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
"\n",
"除了这个还有很多方法,如:[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简单的方法,并删除相关系数高于某个阈值的特征(不是与分数相关,是两变量之间的相关,我们需要与分数高度相关的变量!)。关于删除共线变量的更彻底的讨论,可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面的通过比较两个特征基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值我们将使用0.6的阈值来删除一对特征中的一个。"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 27, "execution_count": 27,
@ -3030,25 +3049,27 @@
"def remove_collinear_features(x, threshold):\n", "def remove_collinear_features(x, threshold):\n",
" '''\n", " '''\n",
" Objective:\n", " Objective:\n",
" 删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n", " Remove collinear features in a dataframe with a correlation coefficient\n",
" greater than the threshold. Removing collinear features can help a model\n",
" to generalize and improves the interpretability of the model.\n",
" \n", " \n",
" Inputs: \n", " Inputs: \n",
" 阈值:删除任何相关性大于此值的特征\n", " threshold: any features with correlations greater than this value are removed\n",
" \n", " \n",
" Output: \n", " Output: \n",
" 仅包含非高共线特征的数据帧\n", " dataframe that contains only the non-highly-collinear features\n",
" '''\n", " '''\n",
" \n", " \n",
" # 不要删除能源之星得分之间的相关性\n", " # Dont want to remove correlations between Energy Star Score\n",
" y = x['score']\n", " y = x['score']\n",
" x = x.drop(columns = ['score'])\n", " x = x.drop(columns = ['score'])\n",
" \n", " \n",
" # 计算相关性矩阵\n", " # Calculate the correlation matrix\n",
" corr_matrix = x.corr()\n", " corr_matrix = x.corr()\n",
" iters = range(len(corr_matrix.columns) - 1)\n", " iters = range(len(corr_matrix.columns) - 1)\n",
" drop_cols = []\n", " drop_cols = []\n",
"\n", "\n",
" # 迭代相关性矩阵并比较相关性\n", " # Iterate through the correlation matrix and compare correlations\n",
" for i in iters:\n", " for i in iters:\n",
" for j in range(i):\n", " for j in range(i):\n",
" item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@ -3056,13 +3077,13 @@
" row = item.index\n", " row = item.index\n",
" val = abs(item.values)\n", " val = abs(item.values)\n",
" \n", " \n",
" # 如果相关性超过阈值\n", " # If correlation exceeds the threshold\n",
" if val >= threshold:\n", " if val >= threshold:\n",
" # 打印有相关性的特征和相关值\n", " # Print the correlated features and the correlation value\n",
" # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
" drop_cols.append(col.values[0])\n", " drop_cols.append(col.values[0])\n",
"\n", "\n",
" # 删除每对相关列中的一个\n", " # Drop one of each pair of correlated columns\n",
" drops = set(drop_cols)\n", " drops = set(drop_cols)\n",
" x = x.drop(columns = drops)\n", " x = x.drop(columns = drops)\n",
" x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@ -3070,7 +3091,7 @@
" 'log_Water Use (All Water Sources) (kgal)',\n", " 'log_Water Use (All Water Sources) (kgal)',\n",
" 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
" \n", " \n",
" # 将得分添加回数据\n", " # Add the score back in to the data\n",
" x['score'] = y\n", " x['score'] = y\n",
" \n", " \n",
" return x" " return x"
@ -3090,7 +3111,7 @@
} }
], ],
"source": [ "source": [
"# 删除大于指定相关系数的共线特征\n", "# Remove the collinear features above a specified correlation coefficient\n",
"features = remove_collinear_features(features, 0.6);" "features = remove_collinear_features(features, 0.6);"
] ]
}, },
@ -3111,7 +3132,7 @@
} }
], ],
"source": [ "source": [
"# 删除所有 na 值的列\n", "# Remove any columns with all nan values\n",
"features = features.dropna(axis=1, how = 'all')\n", "features = features.dropna(axis=1, how = 'all')\n",
"features.shape" "features.shape"
] ]
@ -3120,7 +3141,31 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 数据集划分" "现在数据集有64个特性(其中一个列是target)。这仍然是相当多但主要是因为我们有一个One-Hot的分类变量。此外虽然大量的特征对于线性回归等模型可能存在问题但随机森林等模型执行隐式特征选择并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤但是现在我们将保留我们所有的特性看看模型是如何执行的。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**附加的特征选择**\n",
"\n",
"有很多的特征选择方法,常用的方法有主成分分析(PCA),它将特征保持最大方差的减少,以降低维数,或独立成分分析(ICA),其目的是在一组特征中找到独立的源。然而,虽然这些方法有效地减少了特性的数量,但是它们创建了没有物理意义的新特性,从而使得解释模型几乎是不可能的。在实际场景中,很少有遇到这么做且效果提升的。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据集划分\n",
"\n",
"在机器学习中,我们总是需要将我们的特征分为两组:即训练集和预测集(可能还会多一个验证集)。\n",
"\n",
"我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n",
"\n",
"对于我们的问题我们将首先提取所有没有能源之星分数的建筑数据我们不知道这些建筑的真实答案因此它们对训练或测试没有帮助。然后我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
"\n",
"使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态以确保结果一致。"
] ]
}, },
{ {
@ -3138,7 +3183,7 @@
} }
], ],
"source": [ "source": [
"# 提取没有得分的建筑物和带有得分的建筑物\n", "# Extract the buildings with no score and the buildings with a score\n",
"no_score = features[features['score'].isna()]\n", "no_score = features[features['score'].isna()]\n",
"score = features[features['score'].notnull()]\n", "score = features[features['score'].notnull()]\n",
"\n", "\n",
@ -3163,15 +3208,14 @@
} }
], ],
"source": [ "source": [
"# 将特征和目标分离开\n", "# Separate out the features and targets\n",
"features = score.drop(columns='score')\n", "features = score.drop(columns='score')\n",
"targets = pd.DataFrame(score['score'])\n", "targets = pd.DataFrame(score['score'])\n",
"\n", "\n",
"# 用 nan 替换 inf and -inf required for later imputation\n", "# Replace the inf and -inf with nan (required for later imputation)\n",
"features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
"\n", "\n",
"# 按照 73 的比例划分训练集和测试集\n", "# Split into 70% training and 30% testing set\n",
"\n",
"X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
"\n", "\n",
"print(X.shape)\n", "print(X.shape)\n",

@ -2991,7 +2991,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 共线" "### 去除共线特征\n",
"在数据集中Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)高度相关,因为它们只是计算能源使用强度的方法略有不同。"
] ]
}, },
{ {
@ -3021,6 +3022,24 @@
" 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);" " 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1]);"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一般高度关联的特征我们是去除的,只保留一个为模型提供必要信息。\n",
"\n",
"去除共线特征是一种通过减少特征数量来降低模型复杂度的方法有助于提高模型的泛化能力。它还可以帮助我们解释模型因为我们只需要担心单一变量比如EUI而不是Weather Normalized Site EUI (kBtu/ft²和Site EUI (kBtu/ft²)如何影响得分。\n",
"\n",
"除了这个还有很多方法,如:[方差膨胀因子/系数](http://www.statisticshowto.com/variance-inflation-factor/),这里我们将使用更简单的方法,并删除相关系数高于某个阈值的特征(不是与分数相关,是两变量之间的相关,我们需要与分数高度相关的变量!)。关于删除共线变量的更彻底的讨论,可以参考这个[this notebook on Kaggle](https://www.kaggle.com/robertoruiz/dealing-with-multicollinearity/code)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面的通过比较两个特征基于我们为相关系数选择的阈值来删除共线特征。它还打印出它去除的相关性这样我们就可以看到调整阈值的效果。如果特征之间的相关系数超过这个值我们将使用0.6的阈值来删除一对特征中的一个。"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 27, "execution_count": 27,
@ -3030,25 +3049,27 @@
"def remove_collinear_features(x, threshold):\n", "def remove_collinear_features(x, threshold):\n",
" '''\n", " '''\n",
" Objective:\n", " Objective:\n",
" 删除数据帧中相关系数大于阈值的共线特征。 删除共线特征可以帮助模型泛化并提高模型的可解释性。\n", " Remove collinear features in a dataframe with a correlation coefficient\n",
" greater than the threshold. Removing collinear features can help a model\n",
" to generalize and improves the interpretability of the model.\n",
" \n", " \n",
" Inputs: \n", " Inputs: \n",
" 阈值:删除任何相关性大于此值的特征\n", " threshold: any features with correlations greater than this value are removed\n",
" \n", " \n",
" Output: \n", " Output: \n",
" 仅包含非高共线特征的数据帧\n", " dataframe that contains only the non-highly-collinear features\n",
" '''\n", " '''\n",
" \n", " \n",
" # 不要删除能源之星得分之间的相关性\n", " # Dont want to remove correlations between Energy Star Score\n",
" y = x['score']\n", " y = x['score']\n",
" x = x.drop(columns = ['score'])\n", " x = x.drop(columns = ['score'])\n",
" \n", " \n",
" # 计算相关性矩阵\n", " # Calculate the correlation matrix\n",
" corr_matrix = x.corr()\n", " corr_matrix = x.corr()\n",
" iters = range(len(corr_matrix.columns) - 1)\n", " iters = range(len(corr_matrix.columns) - 1)\n",
" drop_cols = []\n", " drop_cols = []\n",
"\n", "\n",
" # 迭代相关性矩阵并比较相关性\n", " # Iterate through the correlation matrix and compare correlations\n",
" for i in iters:\n", " for i in iters:\n",
" for j in range(i):\n", " for j in range(i):\n",
" item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n", " item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]\n",
@ -3056,13 +3077,13 @@
" row = item.index\n", " row = item.index\n",
" val = abs(item.values)\n", " val = abs(item.values)\n",
" \n", " \n",
" # 如果相关性超过阈值\n", " # If correlation exceeds the threshold\n",
" if val >= threshold:\n", " if val >= threshold:\n",
" # 打印有相关性的特征和相关值\n", " # Print the correlated features and the correlation value\n",
" # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n", " # print(col.values[0], \"|\", row.values[0], \"|\", round(val[0][0], 2))\n",
" drop_cols.append(col.values[0])\n", " drop_cols.append(col.values[0])\n",
"\n", "\n",
" # 删除每对相关列中的一个\n", " # Drop one of each pair of correlated columns\n",
" drops = set(drop_cols)\n", " drops = set(drop_cols)\n",
" x = x.drop(columns = drops)\n", " x = x.drop(columns = drops)\n",
" x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n", " x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', \n",
@ -3070,7 +3091,7 @@
" 'log_Water Use (All Water Sources) (kgal)',\n", " 'log_Water Use (All Water Sources) (kgal)',\n",
" 'Largest Property Use Type - Gross Floor Area (ft²)'])\n", " 'Largest Property Use Type - Gross Floor Area (ft²)'])\n",
" \n", " \n",
" # 将得分添加回数据\n", " # Add the score back in to the data\n",
" x['score'] = y\n", " x['score'] = y\n",
" \n", " \n",
" return x" " return x"
@ -3090,7 +3111,7 @@
} }
], ],
"source": [ "source": [
"# 删除大于指定相关系数的共线特征\n", "# Remove the collinear features above a specified correlation coefficient\n",
"features = remove_collinear_features(features, 0.6);" "features = remove_collinear_features(features, 0.6);"
] ]
}, },
@ -3111,7 +3132,7 @@
} }
], ],
"source": [ "source": [
"# 删除所有 na 值的列\n", "# Remove any columns with all nan values\n",
"features = features.dropna(axis=1, how = 'all')\n", "features = features.dropna(axis=1, how = 'all')\n",
"features.shape" "features.shape"
] ]
@ -3120,7 +3141,31 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 数据集划分" "现在数据集有64个特性(其中一个列是target)。这仍然是相当多但主要是因为我们有一个One-Hot的分类变量。此外虽然大量的特征对于线性回归等模型可能存在问题但随机森林等模型执行隐式特征选择并自动确定哪些特征在训练过程中是重要的。还有其他的特性选择步骤但是现在我们将保留我们所有的特性看看模型是如何执行的。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**附加的特征选择**\n",
"\n",
"有很多的特征选择方法,常用的方法有主成分分析(PCA),它将特征保持最大方差的减少,以降低维数,或独立成分分析(ICA),其目的是在一组特征中找到独立的源。然而,虽然这些方法有效地减少了特性的数量,但是它们创建了没有物理意义的新特性,从而使得解释模型几乎是不可能的。在实际场景中,很少有遇到这么做且效果提升的。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据集划分\n",
"\n",
"在机器学习中,我们总是需要将我们的特征分为两组:即训练集和预测集(可能还会多一个验证集)。\n",
"\n",
"我们使用测试集来评估模型学习到的映射。模型从未在测试集上看到答案,但必须在仅使用特征不知道答案的情况下进行预测。然后将测试集的预测与真实目标进行比较,从而估计出我们的模型在实际开上线时的性能。\n",
"\n",
"对于我们的问题我们将首先提取所有没有能源之星分数的建筑数据我们不知道这些建筑的真实答案因此它们对训练或测试没有帮助。然后我们将具有能源之星分数的建筑数据分成30%的测试集和70%的训练集。\n",
"\n",
"使用scikit learn将数据分成随机的训练和测试集很简单。我们可以设置拆分的随机状态以确保结果一致。"
] ]
}, },
{ {
@ -3138,7 +3183,7 @@
} }
], ],
"source": [ "source": [
"# 提取没有得分的建筑物和带有得分的建筑物\n", "# Extract the buildings with no score and the buildings with a score\n",
"no_score = features[features['score'].isna()]\n", "no_score = features[features['score'].isna()]\n",
"score = features[features['score'].notnull()]\n", "score = features[features['score'].notnull()]\n",
"\n", "\n",
@ -3163,15 +3208,14 @@
} }
], ],
"source": [ "source": [
"# 将特征和目标分离开\n", "# Separate out the features and targets\n",
"features = score.drop(columns='score')\n", "features = score.drop(columns='score')\n",
"targets = pd.DataFrame(score['score'])\n", "targets = pd.DataFrame(score['score'])\n",
"\n", "\n",
"# 用 nan 替换 inf and -inf required for later imputation\n", "# Replace the inf and -inf with nan (required for later imputation)\n",
"features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n", "features = features.replace({np.inf: np.nan, -np.inf: np.nan})\n",
"\n", "\n",
"# 按照 73 的比例划分训练集和测试集\n", "# Split into 70% training and 30% testing set\n",
"\n",
"X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n", "X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)\n",
"\n", "\n",
"print(X.shape)\n", "print(X.shape)\n",

Loading…
Cancel
Save