|
|
|
@ -5,14 +5,21 @@
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 建筑指标数据\n",
|
|
|
|
|
"目标:对每个建筑的能源利用率评分,1-100之间,回归任务"
|
|
|
|
|
"[来源](https://github.com/WillKoehrsen/machine-learning-project-walkthrough)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### 项目目标\n",
|
|
|
|
|
"* 确定能源之星评分数据集中的预测因素。\n",
|
|
|
|
|
"* 使用提供的建筑能源数据开发模型,并预测建筑物的能源之星的得分(0-100的连续值)。\n",
|
|
|
|
|
"* 解释模型结果。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"基于项目目标,我们需要做的是一个回归模型。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 工作流程\n",
|
|
|
|
|
"## 机器学习——工作流程\n",
|
|
|
|
|
"1. 数据清洗与格式转换\n",
|
|
|
|
|
"2. 探索性数据分析\n",
|
|
|
|
|
"3. 特征工程\n",
|
|
|
|
@ -22,16 +29,16 @@
|
|
|
|
|
"7. 解释模型\n",
|
|
|
|
|
"8. 提交答案\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"这些过程并不是完全的从头到尾,可能在4的时候发现1的数据清洗有问题,再回来做1"
|
|
|
|
|
"这些过程并不是严格的从头到尾,可能在4建立模型时,发现1的数据清洗有问题,再回来做1,该项目包含3个notebook"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 导入所需的基本工具包\n",
|
|
|
|
|
"### 导入所需的基本工具包\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"有些默认参数可以设置"
|
|
|
|
|
"可设置默认参数"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
@ -40,13 +47,18 @@
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"# 操作数据必备包\n",
|
|
|
|
|
"import pandas as pd\n",
|
|
|
|
|
"import numpy as np\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"pd.options.mode.chained_assignment = None # 消除警告,比如说提示版本升级之类的\n",
|
|
|
|
|
"# 消除警告,比如说提示版本升级之类的\n",
|
|
|
|
|
"import warnings\n",
|
|
|
|
|
"warnings.simplefilter('ignore')\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"pd.set_option('display.max_columns', 60) # 设置最大显示列为60\n",
|
|
|
|
|
"# 设置最大显示列为60,还有max_rows则是设置最大列\n",
|
|
|
|
|
"pd.set_option('display.max_columns', 60)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# # Matplotlib 可视化\n",
|
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
|
"%matplotlib inline\n",
|
|
|
|
|
"\n",
|
|
|
|
@ -64,7 +76,8 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 数据清洗"
|
|
|
|
|
"## 数据清洗\n",
|
|
|
|
|
"[pandas](https://pandas.pydata.org/pandas-docs/stable/)读取数据"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
@ -722,11 +735,37 @@
|
|
|
|
|
"data.head() # display top of dataframe"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [
|
|
|
|
|
{
|
|
|
|
|
"data": {
|
|
|
|
|
"text/plain": [
|
|
|
|
|
"(11746, 60)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
"execution_count": 4,
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"output_type": "execute_result"
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"source": [
|
|
|
|
|
"data.shape"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"数据具体情况在数据文件夹下的pdf里"
|
|
|
|
|
"数据集共有60列,我们并不知道这些列的具体意思,虽然机器学习中,我们可以不用理解列,只需要放进去让模型告诉我们哪个重要。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"但在实际场景中,如果想要模型效果有更好的提升,就不可以避免的要对某个别列做相应处理,或者交叉特征,甚至需要向业务人员解释。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"具体列信息可以参考data目录下的2016_nyc_benchmarking_data_disclosure_definitions.pdf,其中**ENERGY STAR Score是标签列(0-100的评分)**\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"而ENERGY STAR Score的评分方法也很简单,建筑持有者提供自我报告的能源使用情况,根据这些提供的自我报告数据来评分排名。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
@ -825,7 +864,8 @@
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"上面都是non-null,不一定是没有缺失值(np.nan),可能是缺失值的标记符号不一样,查看上面的数据,中间有很大部分是Not Available,所以Not Available应该就是缺失值"
|
|
|
|
|
"* 大部分数据被记录为object,在处理前必须转换为数值型,如float。\n",
|
|
|
|
|
"* 上面都是non-null表示没有缺失值,但不一定真的没有缺失值(np.nan),可能是缺失值的标记符号不一样,查看上面的数据,中间有很大部分是Not Available,Not Available应该就是缺失值。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
@ -1477,6 +1517,13 @@
|
|
|
|
|
"data.describe()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"现在能看到部分列的count没有达到11746个,即表明其中有NaN值。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {},
|
|
|
|
|