|
|
|
@ -4,7 +4,7 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 建筑指标数据\n",
|
|
|
|
|
"## 建筑能源使用数据\n",
|
|
|
|
|
"[来源](https://github.com/WillKoehrsen/machine-learning-project-walkthrough)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 项目目标\n",
|
|
|
|
@ -737,7 +737,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -746,7 +746,7 @@
|
|
|
|
|
"(11746, 60)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -779,7 +779,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 3,
|
|
|
|
|
"execution_count": 8,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"scrolled": true
|
|
|
|
|
},
|
|
|
|
@ -870,7 +870,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"execution_count": 9,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
@ -887,7 +887,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 5,
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -1507,7 +1507,7 @@
|
|
|
|
|
"max 155101.000000 "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 5,
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -1528,13 +1528,13 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"### 缺失值处理\n",
|
|
|
|
|
"每个列缺失的比例,这里提供一个函数。"
|
|
|
|
|
"### 缺失值\n",
|
|
|
|
|
"计算每个列缺失的比例,这里提供一个函数。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 6,
|
|
|
|
|
"execution_count": 11,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
@ -1570,7 +1570,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -1939,7 +1939,7 @@
|
|
|
|
|
"Largest Property Use Type - Gross Floor Area (ft²) 0.0 "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 7,
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -1948,9 +1948,18 @@
|
|
|
|
|
"missing_values_table(data)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"一般而言,我们不希望丢失任何数据,但我们也不希望数据对模型有负影响,所以我们尽可能减少无意义的、负影响的数据,这里我对缺失值超过50%的进行剔除。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"实际业务场景:我们曾尝试过使用缺失率较大的特征,使用后模型结果大幅度上涨,表明上看是好的,特征重要性也是最高的,我们产生了疑惑,随之去追溯数据源,发现有很大部分正样本有该数据,绝大多数负样本都没有,这样模型就以为有值的就是正样本,这其实是不正确的,后面我们修改了SQL语句,重新拿了数据集,模型结果才趋向正常"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 8,
|
|
|
|
|
"execution_count": 13,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -1972,7 +1981,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 9,
|
|
|
|
|
"execution_count": 14,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
@ -1992,7 +2001,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"execution_count": 15,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -2001,7 +2010,7 @@
|
|
|
|
|
"Text(0.5, 1.0, 'Energy Star Score Distribution')"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 10,
|
|
|
|
|
"execution_count": 15,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
},
|
|
|
|
@ -2034,14 +2043,16 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"发现得分1和100的个数比较高,由PDF可以看出,这些是由每个建筑的个人提交各个利用率的分值,过高和过低都有问题,而分值是标签,我们不能人为的改变标签。\n",
|
|
|
|
|
"发现得分1和100的个数过多,按理来说得分应该是比较平均的,或者越高分约少,从上面介绍可知,建筑信息是由个人提交的,而个人会提交更少的用电量,人为的提高得分,从上图可以看出高分的建筑反而更多。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"这些都是不客观的,我们希望能有一个更客观的评分标准。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"能源使用强度(EUI),它是总能源使用量除以建筑物的面积(平方英尺)。这个能源使用量不是自我报告的,因此更客观地衡量建筑物的能源效率。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 11,
|
|
|
|
|
"execution_count": 16,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -2069,12 +2080,12 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"可以看出所有值都是偏小的,说明有一个或极少个非常大的极值。"
|
|
|
|
|
"可以看出绝大部分值都是偏小的,只有极大值时,X轴的跨度才会这么大,说明有一个或极少个非常大的极值。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"execution_count": 17,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -2091,7 +2102,7 @@
|
|
|
|
|
"Name: Site EUI (kBtu/ft²), dtype: float64"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 12,
|
|
|
|
|
"execution_count": 17,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -2102,7 +2113,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 13,
|
|
|
|
|
"execution_count": 18,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -2121,7 +2132,7 @@
|
|
|
|
|
"Name: Site EUI (kBtu/ft²), dtype: float64"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 13,
|
|
|
|
|
"execution_count": 18,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -2139,7 +2150,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 14,
|
|
|
|
|
"execution_count": 19,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
@ -2354,7 +2365,7 @@
|
|
|
|
|
"8068 East Williamsburg ... "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 14,
|
|
|
|
|
"execution_count": 19,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
@ -2367,23 +2378,18 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"离群点的可能性:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* 输入错误\n",
|
|
|
|
|
"* 测量设备故障\n",
|
|
|
|
|
"* 不正确的单位\n",
|
|
|
|
|
"* 或者可能是真实的。\n",
|
|
|
|
|
"异常值的可能性有输入错误、测量设备故障、不正确的单位、或者可能是真实的。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"离群点,我们一般抛弃。"
|
|
|
|
|
"对于异常值,我们一般抛弃,因为其不代表数据的实际情况。(在个别场景中,我们需要这些异常值)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 剔除离群点\n",
|
|
|
|
|
"## 剔除异常值\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"处理离群点时,不能主观的判断,导致丢失数据,如何去除[离群点](https://people.richland.edu/james/lecture/m170/ch03-pos.html),处理时尽可能保守:\n",
|
|
|
|
|
"处理离群点时,不能主观的判断,导致丢失数据,如何去除[异常值](https://people.richland.edu/james/lecture/m170/ch03-pos.html),处理时尽可能保守:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* 在低端,值低于 $\\text{First Quartile} -3 * \\text{Interquartile Range}$\n",
|
|
|
|
|
"* 在高端,值高于 $\\text{Third Quartile} + 3 * \\text{Interquartile Range}$"
|
|
|
|
@ -2391,7 +2397,7 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 15,
|
|
|
|
|
"execution_count": 20,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
@ -2400,10 +2406,10 @@
|
|
|
|
|
"third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# Interquartile range\n",
|
|
|
|
|
"iqr = third_quartile -first_quartile\n",
|
|
|
|
|
"iqr = third_quartile - first_quartile\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# Remove outliers\n",
|
|
|
|
|
"data = data [(data['Site EUI (kBtu/ft²)']>(first_quartile - 3*iqr)) & (data['Site EUI (kBtu/ft²)']<(third_quartile + 3*iqr))]"
|
|
|
|
|
"data = data[(data['Site EUI (kBtu/ft²)']>(first_quartile - 3*iqr)) & (data['Site EUI (kBtu/ft²)']<(third_quartile + 3*iqr))]"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
@ -2455,7 +2461,7 @@
|
|
|
|
|
"# Create a list of buildings with more than 100 measurements\n",
|
|
|
|
|
"types = data.dropna(subset=['score'])\n",
|
|
|
|
|
"Alltypes_num = types['Largest Property Use Type'].value_counts()\n",
|
|
|
|
|
"types = list(Alltypes_num[Alltypes_num.values > 100].index) # 获取大于100条的数据"
|
|
|
|
|
"types = list(Alltypes_num[Alltypes_num.values > 100].index)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|