Add comment of Correlations between Features and Target

pull/2/head
benjas 5 years ago
parent 2d2162d77d
commit dd12d38563

@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 建筑指标数据\n", "## 建筑能源使用数据\n",
"[来源](https://github.com/WillKoehrsen/machine-learning-project-walkthrough)\n", "[来源](https://github.com/WillKoehrsen/machine-learning-project-walkthrough)\n",
"\n", "\n",
"### 项目目标\n", "### 项目目标\n",
@ -737,7 +737,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 7,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -746,7 +746,7 @@
"(11746, 60)" "(11746, 60)"
] ]
}, },
"execution_count": 4, "execution_count": 7,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -779,7 +779,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 8,
"metadata": { "metadata": {
"scrolled": true "scrolled": true
}, },
@ -870,7 +870,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 9,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -887,7 +887,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 5, "execution_count": 10,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1507,7 +1507,7 @@
"max 155101.000000 " "max 155101.000000 "
] ]
}, },
"execution_count": 5, "execution_count": 10,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1528,13 +1528,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 缺失值处理\n", "### 缺失值\n",
"每个列缺失的比例,这里提供一个函数。" "计算每个列缺失的比例,这里提供一个函数。"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 6, "execution_count": 11,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -1570,7 +1570,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 7, "execution_count": 12,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1939,7 +1939,7 @@
"Largest Property Use Type - Gross Floor Area (ft²) 0.0 " "Largest Property Use Type - Gross Floor Area (ft²) 0.0 "
] ]
}, },
"execution_count": 7, "execution_count": 12,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1948,9 +1948,18 @@
"missing_values_table(data)" "missing_values_table(data)"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一般而言我们不希望丢失任何数据但我们也不希望数据对模型有负影响所以我们尽可能减少无意义的、负影响的数据这里我对缺失值超过50%的进行剔除。\n",
"\n",
"实际业务场景我们曾尝试过使用缺失率较大的特征使用后模型结果大幅度上涨表明上看是好的特征重要性也是最高的我们产生了疑惑随之去追溯数据源发现有很大部分正样本有该数据绝大多数负样本都没有这样模型就以为有值的就是正样本这其实是不正确的后面我们修改了SQL语句重新拿了数据集模型结果才趋向正常"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 8, "execution_count": 13,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1972,7 +1981,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 9, "execution_count": 14,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -1992,7 +2001,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 10, "execution_count": 15,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -2001,7 +2010,7 @@
"Text(0.5, 1.0, 'Energy Star Score Distribution')" "Text(0.5, 1.0, 'Energy Star Score Distribution')"
] ]
}, },
"execution_count": 10, "execution_count": 15,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
}, },
@ -2034,14 +2043,16 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"发现得分1和100的个数比较高由PDF可以看出这些是由每个建筑的个人提交各个利用率的分值过高和过低都有问题而分值是标签我们不能人为的改变标签。\n", "发现得分1和100的个数过多按理来说得分应该是比较平均的或者越高分约少从上面介绍可知建筑信息是由个人提交的而个人会提交更少的用电量人为的提高得分从上图可以看出高分的建筑反而更多。\n",
"\n",
"这些都是不客观的,我们希望能有一个更客观的评分标准。\n",
"\n", "\n",
"能源使用强度EUI它是总能源使用量除以建筑物的面积平方英尺。这个能源使用量不是自我报告的因此更客观地衡量建筑物的能源效率。" "能源使用强度EUI它是总能源使用量除以建筑物的面积平方英尺。这个能源使用量不是自我报告的因此更客观地衡量建筑物的能源效率。"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 11, "execution_count": 16,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -2069,12 +2080,12 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"可以看出所有值都是偏小的,说明有一个或极少个非常大的极值。" "可以看出绝大部分值都是偏小的只有极大值时X轴的跨度才会这么大,说明有一个或极少个非常大的极值。"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 17,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -2091,7 +2102,7 @@
"Name: Site EUI (kBtu/ft²), dtype: float64" "Name: Site EUI (kBtu/ft²), dtype: float64"
] ]
}, },
"execution_count": 12, "execution_count": 17,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -2102,7 +2113,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 18,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -2121,7 +2132,7 @@
"Name: Site EUI (kBtu/ft²), dtype: float64" "Name: Site EUI (kBtu/ft²), dtype: float64"
] ]
}, },
"execution_count": 13, "execution_count": 18,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -2139,7 +2150,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 19,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -2354,7 +2365,7 @@
"8068 East Williamsburg ... " "8068 East Williamsburg ... "
] ]
}, },
"execution_count": 14, "execution_count": 19,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -2367,23 +2378,18 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"离群点的可能性:\n", "异常值的可能性有输入错误、测量设备故障、不正确的单位、或者可能是真实的。\n",
"\n",
"* 输入错误\n",
"* 测量设备故障\n",
"* 不正确的单位\n",
"* 或者可能是真实的。\n",
"\n", "\n",
"离群点,我们一般抛弃。" "对于异常值,我们一般抛弃,因为其不代表数据的实际情况。(在个别场景中,我们需要这些异常值)"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 剔除离群点\n", "## 剔除异常值\n",
"\n", "\n",
"处理离群点时,不能主观的判断,导致丢失数据,如何去除[离群点](https://people.richland.edu/james/lecture/m170/ch03-pos.html),处理时尽可能保守:\n", "处理离群点时,不能主观的判断,导致丢失数据,如何去除[异常值](https://people.richland.edu/james/lecture/m170/ch03-pos.html),处理时尽可能保守:\n",
"\n", "\n",
"* 在低端,值低于 $\\text{First Quartile} -3 * \\text{Interquartile Range}$\n", "* 在低端,值低于 $\\text{First Quartile} -3 * \\text{Interquartile Range}$\n",
"* 在高端,值高于 $\\text{Third Quartile} + 3 * \\text{Interquartile Range}$" "* 在高端,值高于 $\\text{Third Quartile} + 3 * \\text{Interquartile Range}$"
@ -2391,7 +2397,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 15, "execution_count": 20,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -2400,10 +2406,10 @@
"third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']\n", "third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']\n",
"\n", "\n",
"# Interquartile range\n", "# Interquartile range\n",
"iqr = third_quartile -first_quartile\n", "iqr = third_quartile - first_quartile\n",
"\n", "\n",
"# Remove outliers\n", "# Remove outliers\n",
"data = data [(data['Site EUI (kBtu/ft²)']>(first_quartile - 3*iqr)) & (data['Site EUI (kBtu/ft²)']<(third_quartile + 3*iqr))]" "data = data[(data['Site EUI (kBtu/ft²)']>(first_quartile - 3*iqr)) & (data['Site EUI (kBtu/ft²)']<(third_quartile + 3*iqr))]"
] ]
}, },
{ {
@ -2455,7 +2461,7 @@
"# Create a list of buildings with more than 100 measurements\n", "# Create a list of buildings with more than 100 measurements\n",
"types = data.dropna(subset=['score'])\n", "types = data.dropna(subset=['score'])\n",
"Alltypes_num = types['Largest Property Use Type'].value_counts()\n", "Alltypes_num = types['Largest Property Use Type'].value_counts()\n",
"types = list(Alltypes_num[Alltypes_num.values > 100].index) # 获取大于100条的数据" "types = list(Alltypes_num[Alltypes_num.values > 100].index)"
] ]
}, },
{ {

Loading…
Cancel
Save