Add. Normalize

master
benjas 3 years ago
parent 44ec4b7795
commit c3b408ee69

@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
"id": "f5102024",
"id": "0ef70ec2",
"metadata": {},
"source": [
"## 分类特征\n",
@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": 17,
"id": "65eeb045",
"id": "38d54603",
"metadata": {},
"outputs": [
{
@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
"id": "34da34b3",
"id": "7cca3ea7",
"metadata": {},
"source": [
"## Splitting\n",
@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
"id": "8c42b4ed",
"id": "735d477e",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@ -301,7 +301,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1df43cfd",
"id": "f79c747e",
"metadata": {},
"outputs": [],
"source": [
@ -310,7 +310,7 @@
},
{
"cell_type": "markdown",
"id": "054ee902",
"id": "12240f96",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联并不会在树节点分裂他们。\n",
@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4cf0ee0f",
"id": "baed14ad",
"metadata": {},
"outputs": [],
"source": [
@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
"id": "7d60d0b6",
"id": "56db7969",
"metadata": {},
"source": [
"## 频率编码\n",
@ -340,7 +340,7 @@
{
"cell_type": "code",
"execution_count": 19,
"id": "bb167930",
"id": "30ff5b0b",
"metadata": {},
"outputs": [
{
@ -420,7 +420,7 @@
},
{
"cell_type": "markdown",
"id": "48229c86",
"id": "13d14185",
"metadata": {},
"source": [
"## 聚合/组统计\n",
@ -432,7 +432,7 @@
{
"cell_type": "code",
"execution_count": 20,
"id": "76380f6f",
"id": "373291cf",
"metadata": {},
"outputs": [
{
@ -518,19 +518,130 @@
},
{
"cell_type": "markdown",
"id": "fd72f933",
"id": "e638bb48",
"metadata": {},
"source": [
"此处的功能向每一行添加color_counts该行color组的平均值。因此LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
]
},
{
"cell_type": "markdown",
"id": "90d66547",
"metadata": {},
"source": [
"## 标准化\n",
"可以针对自己对列进行标准化。例如"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "b60781c7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.434274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color\n",
"0 -0.956183\n",
"1 0.239046\n",
"2 1.434274\n",
"3 0.239046\n",
"4 -0.956183"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = ( df['color']-df['color'].mean() ) / df['color'].std()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "5e7c1b67",
"metadata": {},
"source": [
"或者你可以针对一列标准化另一列。例如如果你创建一个组统计数据如上所述来指示D3每周的平均值。然后你可以通过"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ed09035",
"id": "19520699",
"metadata": {},
"outputs": [],
"source": []
"source": [
"df['D3_remove_time'] = df['D3'] - df['D3_week_mean']"
]
},
{
"cell_type": "markdown",
"id": "28892593",
"metadata": {},
"source": [
"D3_remove_time随着时间的推移新变量不再增加因为我们已经针对时间的影响对其进行了标准化。"
]
},
{
"cell_type": "markdown",
"id": "ecf0eb32",
"metadata": {},
"source": [
"## 离群值去除/平滑"
]
}
],
"metadata": {

@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
"id": "f5102024",
"id": "0ef70ec2",
"metadata": {},
"source": [
"## 分类特征\n",
@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": 17,
"id": "65eeb045",
"id": "38d54603",
"metadata": {},
"outputs": [
{
@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
"id": "34da34b3",
"id": "7cca3ea7",
"metadata": {},
"source": [
"## Splitting\n",
@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
"id": "8c42b4ed",
"id": "735d477e",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@ -301,7 +301,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1df43cfd",
"id": "f79c747e",
"metadata": {},
"outputs": [],
"source": [
@ -310,7 +310,7 @@
},
{
"cell_type": "markdown",
"id": "054ee902",
"id": "12240f96",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联并不会在树节点分裂他们。\n",
@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "4cf0ee0f",
"id": "baed14ad",
"metadata": {},
"outputs": [],
"source": [
@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
"id": "7d60d0b6",
"id": "56db7969",
"metadata": {},
"source": [
"## 频率编码\n",
@ -340,7 +340,7 @@
{
"cell_type": "code",
"execution_count": 19,
"id": "bb167930",
"id": "30ff5b0b",
"metadata": {},
"outputs": [
{
@ -420,7 +420,7 @@
},
{
"cell_type": "markdown",
"id": "48229c86",
"id": "13d14185",
"metadata": {},
"source": [
"## 聚合/组统计\n",
@ -432,7 +432,7 @@
{
"cell_type": "code",
"execution_count": 20,
"id": "76380f6f",
"id": "373291cf",
"metadata": {},
"outputs": [
{
@ -518,19 +518,130 @@
},
{
"cell_type": "markdown",
"id": "fd72f933",
"id": "e638bb48",
"metadata": {},
"source": [
"此处的功能向每一行添加color_counts该行color组的平均值。因此LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
]
},
{
"cell_type": "markdown",
"id": "90d66547",
"metadata": {},
"source": [
"## 标准化\n",
"可以针对自己对列进行标准化。例如"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "b60781c7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.434274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color\n",
"0 -0.956183\n",
"1 0.239046\n",
"2 1.434274\n",
"3 0.239046\n",
"4 -0.956183"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = ( df['color']-df['color'].mean() ) / df['color'].std()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "5e7c1b67",
"metadata": {},
"source": [
"或者你可以针对一列标准化另一列。例如如果你创建一个组统计数据如上所述来指示D3每周的平均值。然后你可以通过"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ed09035",
"id": "19520699",
"metadata": {},
"outputs": [],
"source": []
"source": [
"df['D3_remove_time'] = df['D3'] - df['D3_week_mean']"
]
},
{
"cell_type": "markdown",
"id": "28892593",
"metadata": {},
"source": [
"D3_remove_time随着时间的推移新变量不再增加因为我们已经针对时间的影响对其进行了标准化。"
]
},
{
"cell_type": "markdown",
"id": "ecf0eb32",
"metadata": {},
"source": [
"## 离群值去除/平滑"
]
}
],
"metadata": {

Loading…
Cancel
Save