Feat-Add Performance comparison between XGBoost and LightGBM

pull/2/head
benjas 5 years ago
parent c49584ad85
commit 733e412c1a

@ -22,7 +22,7 @@
"7. 尽可能解释模型结果\n", "7. 尽可能解释模型结果\n",
"8. 得出结论,并提交答案\n", "8. 得出结论,并提交答案\n",
"\n", "\n",
"上一个notebook我们使用了1-3还有一部分4该notebook我们着重了解4-6。" "上一个notebook我们使用了第1-3步还有一部分第步该notebook我们着重了解第4-6步。"
] ]
}, },
{ {
@ -34,17 +34,20 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 2, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"import pandas as pd\n", "import pandas as pd\n",
"import numpy as np\n", "import numpy as np\n",
"\n", "\n",
"pd.options.mode.chained_assignment = None # 消除警告,比如说提示版本升级之类的\n", "# 消除警告,比如说提示版本升级之类的\n",
"import warnings\n",
"warnings.simplefilter('ignore')\n",
"\n", "\n",
"pd.set_option('display.max_columns', 60) # 设置最大显示列为60\n", "pd.set_option('display.max_columns', 60) # 设置最大显示列为60\n",
"\n", "\n",
"# Matplotlib 可视化\n",
"import matplotlib.pyplot as plt\n", "import matplotlib.pyplot as plt\n",
"%matplotlib inline\n", "%matplotlib inline\n",
"\n", "\n",
@ -68,9 +71,17 @@
"from sklearn.model_selection import RandomizedSearchCV, GridSearchCV" "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV"
] ]
}, },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 读取数据\n",
"读取上一个notebook处理好的数据。"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -100,7 +111,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 5,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -860,7 +871,7 @@
"[5 rows x 64 columns]" "[5 rows x 64 columns]"
] ]
}, },
"execution_count": 4, "execution_count": 5,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -875,7 +886,11 @@
"source": [ "source": [
"### 缺失值填充\n", "### 缺失值填充\n",
"\n", "\n",
"利用sklearn的 Imputer object来进行缺失值填充测试集则使用数据集中的结果进行填充尽可能的不要利用测试集的数据对测试集加工因为一开始我们也是不知道的可参考[Data Leagage](https://www.kaggle.com/dansbecker/data-leakage)。" "一般缺失值我们建议在入模前填充当然现在的XGBoost和LightGBM等都自带处理缺失值如XGBoost把缺失值当做稀疏矩阵来对待本身的在节点分裂时不考虑的缺失值的数值。缺失值数据会被分到左子树和右子树分别计算损失选择较优的那一个。但一般来讲我们都需要处理缺失值。\n",
"\n",
"上一个notebook我们直接删除了50%缺失率以上的特征,在这里我们将重点关注如何处理这些缺失值的填充,填充的方法有很多种,这里使用简单的中位数填充([这里有关于缺失值填充更深入的探讨](http://www.stat.columbia.edu/~gelman/arm/missing.pdf))。\n",
"\n",
"利用sklearn的 Imputer object来进行缺失值填充测试集则使用训练集训练的结果进行填充尽可能的不要利用测试集的数据对测试集加工因为一开始我们也是不知道的否则将造成数据“泄露”可参考[Data Leagage](https://www.kaggle.com/dansbecker/data-leakage)。"
] ]
}, },
{ {
@ -938,7 +953,17 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 特征标准化与归一化" "## 特征缩放\n",
"\n",
"在构建模型之前,我们要做的最后一步是缩放特征。这是必要的,因为特征是在不同的单位,我们要规范化的特征,使单位不影响算法(具体可参考[回归分析中汽车案例](https://github.com/ben1234560/AiLearning-Theory-Applying/blob/master/notebook_%E5%BF%85%E5%A4%87%E6%95%B0%E5%AD%A6%E5%9F%BA%E7%A1%80/%E5%9B%9E%E5%BD%92%E5%88%86%E6%9E%90%E7%AB%A0%E8%8A%82/%E6%A1%88%E4%BE%8B%EF%BC%9A%E6%B1%BD%E8%BD%A6%E4%BB%B7%E6%A0%BC%E9%A2%84%E6%B5%8B%E4%BB%BB%E5%8A%A1.ipynb))。\n",
"\n",
"线性回归和随机森林不需要特征标度但其他方法如支持向量机和k近邻法则需要它因为它们考虑了观测值之间的欧几里德距离。因此当我们比较多个算法时最好是缩放特征。\n",
"\n",
"缩放的方式有两种:\n",
"* 对于每个值减去特征的平均值除以特征的标准差。这就是所谓的标准化每个特征的平均值为0标准差为1。\n",
"* 对于每个值,减去特征的最小值,除以特征的最小值减去特征的最小值(范围)。这确保了特征的所有值都在0到1之间并被缩放到范围或归一化。\n",
"\n",
"与缺失值填充一样,当我们训练缩放对象时,我们只希望使用训练集进行训练。当转换特征时,我们将同时转换训练集和测试集。"
] ]
}, },
{ {
@ -974,18 +999,18 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"### 选择的机器学习算法(回归问题)\n", "### 选择的机器学习算法(回归问题)\n",
" 1. Linear Regression\n", " 1. Linear Regression 线性回归\n",
" 2. Support Vector Machine Regression\n", " 2. Support Vector Machine Regression 支持向量机回归\n",
" 3. Random Forest Regression\n", " 3. Random Forest Regression随机森林回归\n",
" 4. Gradient Boosting Regression\n", " 4. Gradient Boosting Regression GBDT回归\n",
" 5. K-Nearest Neighbors Regression\n", " 5. K-Nearest Neighbors Regression K近邻回归\n",
"\n", "\n",
"这里先使用默认参数,后续再调参" "这里先使用默认参数,后续再调参"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 11,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -1009,7 +1034,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 12,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1029,7 +1054,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 13,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1049,17 +1074,9 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 15, "execution_count": 14,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"D:\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
]
},
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
@ -1077,7 +1094,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 16, "execution_count": 15,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1097,7 +1114,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 17, "execution_count": 16,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1161,7 +1178,60 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"集成算法的效果更好这里由于参数只使用默认的对SVM等这种参数影响较大的模型不太公平。" "集成算法GBDT的效果更好这里由于参数只使用默认的对SVM等这种参数影响较大的模型不太公平。\n",
"\n",
"当然,从这些结果,我们可以得出结论,机器学习是适用的,因为所有的模型都显著地优于基线!\n",
"\n",
"这里我们还没有使用XGBoost/LightGBM从算法公式层面来讲XGBoost/LightGBM是优于GBDT也是GB以及GBDT基础上的优化和改进下面我们也小试一下XGBoost还有LightGBM。当然这里是以GBDT作为提升算法的代表。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[15:35:29] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n",
"XGBoost Regression Performance on the test set: MAE = 9.9936\n"
]
}
],
"source": [
"import xgboost as xgb\n",
"xgb_model = xgb.XGBRegressor()\n",
"xgb_mae = fit_and_evaluate(xgb_model)\n",
"\n",
"print('XGBoost Regression Performance on the test set: MAE = %0.4f' % xgb_mae)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lightgbm Regression Performance on the test set: MAE = 9.3751\n"
]
}
],
"source": [
"import lightgbm as lgb\n",
"lgb_model = lgb.LGBMRegressor()\n",
"lgb_mae = fit_and_evaluate(lgb_model)\n",
"print('lightgbm Regression Performance on the test set: MAE = %0.4f' % lgb_mae)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到XGBoost和LightGBM明显是由于GBDT因为该两者和随机森林时间隔得有点远这么比较明显不公平所以还是以GBDT做模型。实际场景中我们推荐XGBoost和LightGBM哪个好用哪个。"
] ]
}, },
{ {

@ -34,7 +34,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 2, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -81,7 +81,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 3, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -111,7 +111,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 4, "execution_count": 5,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -871,7 +871,7 @@
"[5 rows x 64 columns]" "[5 rows x 64 columns]"
] ]
}, },
"execution_count": 4, "execution_count": 5,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -1010,7 +1010,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 12, "execution_count": 11,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -1034,7 +1034,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 13, "execution_count": 12,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1054,7 +1054,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 13,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1074,17 +1074,9 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 15, "execution_count": 14,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"D:\\Anaconda3\\lib\\site-packages\\sklearn\\ensemble\\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.\n",
" \"10 in version 0.20 to 100 in 0.22.\", FutureWarning)\n"
]
},
{ {
"name": "stdout", "name": "stdout",
"output_type": "stream", "output_type": "stream",
@ -1102,7 +1094,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 16, "execution_count": 15,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1122,7 +1114,7 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 17, "execution_count": 16,
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -1186,7 +1178,60 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"集成算法的效果更好这里由于参数只使用默认的对SVM等这种参数影响较大的模型不太公平。" "集成算法GBDT的效果更好这里由于参数只使用默认的对SVM等这种参数影响较大的模型不太公平。\n",
"\n",
"当然,从这些结果,我们可以得出结论,机器学习是适用的,因为所有的模型都显著地优于基线!\n",
"\n",
"这里我们还没有使用XGBoost/LightGBM从算法公式层面来讲XGBoost/LightGBM是优于GBDT也是GB以及GBDT基础上的优化和改进下面我们也小试一下XGBoost还有LightGBM。当然这里是以GBDT作为提升算法的代表。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[15:35:29] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.\n",
"XGBoost Regression Performance on the test set: MAE = 9.9936\n"
]
}
],
"source": [
"import xgboost as xgb\n",
"xgb_model = xgb.XGBRegressor()\n",
"xgb_mae = fit_and_evaluate(xgb_model)\n",
"\n",
"print('XGBoost Regression Performance on the test set: MAE = %0.4f' % xgb_mae)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lightgbm Regression Performance on the test set: MAE = 9.3751\n"
]
}
],
"source": [
"import lightgbm as lgb\n",
"lgb_model = lgb.LGBMRegressor()\n",
"lgb_mae = fit_and_evaluate(lgb_model)\n",
"print('lightgbm Regression Performance on the test set: MAE = %0.4f' % lgb_mae)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可以看到XGBoost和LightGBM明显是由于GBDT因为该两者和随机森林时间隔得有点远这么比较明显不公平所以还是以GBDT做模型。实际场景中我们推荐XGBoost和LightGBM哪个好用哪个。"
] ]
}, },
{ {

Loading…
Cancel
Save