{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 介绍:机器学习项目第三部分\n", "在这个项目中,我们将通过一个完整的机器学习问题来处理真实场景的数据集。利用建筑能源数据,建立一个模型来预测建筑的能源之星的评分,使之成为一个有监督的回归、机器学习任务。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 机器学习——工作流程\n", "1. 数据清洗与格式转换\n", "2. 探索性数据分析\n", "3. 特征工程与选择\n", "4. 建立基础模型,比较多种模型性能指标\n", "5. 模型超参数调参,针对问题进行优化\n", "6. 在测试集上评估最佳模型\n", "7. 尽可能解释模型结果\n", "8. 得出结论,并提交答案\n", "\n", "在这里,我们将专注于最后两个步骤,并尝试窥视我们所构建模型的黑匣子。我们知道它是准确的吗,因为它可以预测能量之星的分数在真实值的9.1分之内,但是它究竟是如何做出预测的呢?我们将研究一些方法来尝试理解GBDT,然后得出结论(data文件夹中有已完成的报告)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入工具包" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Pandas and numpy for data manipulation\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# No warnings about setting value on copy of slice\n", "pd.options.mode.chained_assignment = None\n", "pd.set_option('display.max_columns', 60)\n", "\n", "# Matplotlib for visualization\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# Set default font size\n", "plt.rcParams['font.size'] = 24\n", "\n", "from IPython.core.pylabtools import figsize\n", "\n", "# Seaborn for visualization\n", "import seaborn as sns\n", "\n", "sns.set(font_scale = 2)\n", "\n", "# Imputing missing values\n", "from sklearn.preprocessing import Imputer, MinMaxScaler\n", "\n", "# Machine Learning Models\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.ensemble import GradientBoostingRegressor\n", "\n", "from sklearn import tree\n", "\n", "# LIME for explaining predictions\n", "import lime \n", "import lime.lime_tabular" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Read in data into dataframes \n", "train_features = pd.read_csv('data/training_features.csv')\n", "test_features = pd.read_csv('data/testing_features.csv')\n", "train_labels = pd.read_csv('data/training_labels.csv')\n", "test_labels = pd.read_csv('data/testing_labels.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 重新创建最终模型" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "D:\\Anaconda3\\lib\\site-packages\\sklearn\\utils\\deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.\n", " warnings.warn(msg, category=DeprecationWarning)\n" ] } ], "source": [ "# Create an imputer object with a median filling strategy\n", "imputer = Imputer(strategy = 'median')\n", "\n", "# Train on the training features\n", "imputer.fit(train_features)\n", "\n", "# Transform both training data and testing data\n", "X = imputer.transform(train_features)\n", "X_test = imputer.transform(test_features)\n", "\n", "# Convert y to one-dimensional array (vector)\n", "y = np.array(train_labels).reshape((-1, ))\n", "y_test = np.array(test_labels).reshape((-1, ))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Function to calculation mean absolute error\n", "def mae (y_true,y_pred):\n", " return np.mean(abs(y_true - y_pred))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,\n", " learning_rate=0.1, loss='lad', max_depth=5,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=6, min_samples_split=6,\n", " min_weight_fraction_leaf=0.0, n_estimators=800,\n", " n_iter_no_change=None, presort='auto',\n", " random_state=42, subsample=1.0, tol=0.0001,\n", " validation_fraction=0.1, verbose=0, warm_start=False)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = GradientBoostingRegressor(loss='lad', max_depth=5, max_features=None,\n", " min_samples_leaf=6, min_samples_split=6, \n", " n_estimators=800, random_state=42)\n", "\n", "model.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Final Model Performance on the test set: MAE = 9.0839\n" ] } ], "source": [ "# Make predictions on the test set\n", "model_pred = model.predict(X_test)\n", "\n", "print('Final Model Performance on the test set: MAE = %0.4f' % mae(y_test, model_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 解释模型\n", "\n", "机器学习经常被批评为[黑匣子](https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/):我们把数据放在一边,它给我们另一边的答案。虽然这些答案通常是非常准确的,但模型并没有告诉我们它是如何做出预测的。这在某种程度上是正确的,但是我们可以尝试并发现模型是如何“思考”的,比如[Locally Interpretable Model-agnostic Explainer (LIME)](https://arxiv.org/pdf/1602.04938.pdf)。这是通过学习一个线性回归来解释模型预测,这是一个易于解释的模型!\n", "\n", "我们将探索几种解释模型的方法:\n", "* Feature importances 特征重要性\n", "* Locally Interpretable Model-agnostic Explainer (LIME) 本地可解释模型\n", "* Examining a single decision tree in the ensemble. 查看集成中的单个决策树" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 特征重要性\n", "\n", "我们可以解释一组决策树的基本方法之一是通过所谓的特征重要性。这些变量可以解释为对目标最具预测性的变量。虽然特性导入的实际细节相当复杂(这是一个关于这个主题的之外的问题),我们可以使用相对值来比较这些特性,并确定哪些与我们的问题最相关。从一组受过训练的树中提取特征的重要性是相当容易的。我们将把特性的重要性存储在一个数据中,以分析和可视化它们。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | feature | \n", "importance | \n", "
---|---|---|
0 | \n", "Site EUI (kBtu/ft²) | \n", "0.452163 | \n", "
1 | \n", "Weather Normalized Site Electricity Intensity ... | \n", "0.249107 | \n", "
2 | \n", "Water Intensity (All Water Sources) (gal/ft²) | \n", "0.056662 | \n", "
3 | \n", "Property Id | \n", "0.031396 | \n", "
4 | \n", "Largest Property Use Type_Non-Refrigerated War... | \n", "0.025153 | \n", "
5 | \n", "DOF Gross Floor Area | \n", "0.025003 | \n", "
6 | \n", "log_Water Intensity (All Water Sources) (gal/ft²) | \n", "0.022335 | \n", "
7 | \n", "Largest Property Use Type_Multifamily Housing | \n", "0.021462 | \n", "
8 | \n", "Order | \n", "0.020169 | \n", "
9 | \n", "log_Direct GHG Emissions (Metric Tons CO2e) | \n", "0.019410 | \n", "