Add. Outlier Removal / Smooth

master
benjas 4 years ago
parent c3b408ee69
commit 008e2d5689

@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
"id": "0ef70ec2",
"id": "f72a6efa",
"metadata": {},
"source": [
"## 分类特征\n",
@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": 17,
"id": "38d54603",
"id": "224ba994",
"metadata": {},
"outputs": [
{
@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
"id": "7cca3ea7",
"id": "ba369f7a",
"metadata": {},
"source": [
"## Splitting\n",
@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
"id": "735d477e",
"id": "6512d8e2",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@ -301,7 +301,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "f79c747e",
"id": "7e1bbabe",
"metadata": {},
"outputs": [],
"source": [
@ -310,7 +310,7 @@
},
{
"cell_type": "markdown",
"id": "12240f96",
"id": "f195a2c7",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联并不会在树节点分裂他们。\n",
@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "baed14ad",
"id": "8f2bea13",
"metadata": {},
"outputs": [],
"source": [
@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
"id": "56db7969",
"id": "a8ab1bcb",
"metadata": {},
"source": [
"## 频率编码\n",
@ -340,7 +340,7 @@
{
"cell_type": "code",
"execution_count": 19,
"id": "30ff5b0b",
"id": "87cca857",
"metadata": {},
"outputs": [
{
@ -420,7 +420,7 @@
},
{
"cell_type": "markdown",
"id": "13d14185",
"id": "7986b33c",
"metadata": {},
"source": [
"## 聚合/组统计\n",
@ -432,7 +432,7 @@
{
"cell_type": "code",
"execution_count": 20,
"id": "373291cf",
"id": "e8f7106e",
"metadata": {},
"outputs": [
{
@ -518,7 +518,7 @@
},
{
"cell_type": "markdown",
"id": "e638bb48",
"id": "9da30631",
"metadata": {},
"source": [
"此处的功能向每一行添加color_counts该行color组的平均值。因此LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
@ -526,7 +526,7 @@
},
{
"cell_type": "markdown",
"id": "90d66547",
"id": "790eb030",
"metadata": {},
"source": [
"## 标准化\n",
@ -536,7 +536,7 @@
{
"cell_type": "code",
"execution_count": 22,
"id": "b60781c7",
"id": "14474d08",
"metadata": {},
"outputs": [
{
@ -611,7 +611,7 @@
},
{
"cell_type": "markdown",
"id": "5e7c1b67",
"id": "cdfb237a",
"metadata": {},
"source": [
"或者你可以针对一列标准化另一列。例如如果你创建一个组统计数据如上所述来指示D3每周的平均值。然后你可以通过"
@ -620,7 +620,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "19520699",
"id": "d426927e",
"metadata": {},
"outputs": [],
"source": [
@ -629,7 +629,7 @@
},
{
"cell_type": "markdown",
"id": "28892593",
"id": "78fc4991",
"metadata": {},
"source": [
"D3_remove_time随着时间的推移新变量不再增加因为我们已经针对时间的影响对其进行了标准化。"
@ -637,11 +637,22 @@
},
{
"cell_type": "markdown",
"id": "ecf0eb32",
"id": "da699903",
"metadata": {},
"source": [
"## 离群值去除/平滑"
"## 离群值去除/平滑\n",
"通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n",
"\n",
"这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19087e59",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
"id": "0ef70ec2",
"id": "f72a6efa",
"metadata": {},
"source": [
"## 分类特征\n",
@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": 17,
"id": "38d54603",
"id": "224ba994",
"metadata": {},
"outputs": [
{
@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
"id": "7cca3ea7",
"id": "ba369f7a",
"metadata": {},
"source": [
"## Splitting\n",
@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
"id": "735d477e",
"id": "6512d8e2",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@ -301,7 +301,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "f79c747e",
"id": "7e1bbabe",
"metadata": {},
"outputs": [],
"source": [
@ -310,7 +310,7 @@
},
{
"cell_type": "markdown",
"id": "12240f96",
"id": "f195a2c7",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联并不会在树节点分裂他们。\n",
@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "baed14ad",
"id": "8f2bea13",
"metadata": {},
"outputs": [],
"source": [
@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
"id": "56db7969",
"id": "a8ab1bcb",
"metadata": {},
"source": [
"## 频率编码\n",
@ -340,7 +340,7 @@
{
"cell_type": "code",
"execution_count": 19,
"id": "30ff5b0b",
"id": "87cca857",
"metadata": {},
"outputs": [
{
@ -420,7 +420,7 @@
},
{
"cell_type": "markdown",
"id": "13d14185",
"id": "7986b33c",
"metadata": {},
"source": [
"## 聚合/组统计\n",
@ -432,7 +432,7 @@
{
"cell_type": "code",
"execution_count": 20,
"id": "373291cf",
"id": "e8f7106e",
"metadata": {},
"outputs": [
{
@ -518,7 +518,7 @@
},
{
"cell_type": "markdown",
"id": "e638bb48",
"id": "9da30631",
"metadata": {},
"source": [
"此处的功能向每一行添加color_counts该行color组的平均值。因此LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
@ -526,7 +526,7 @@
},
{
"cell_type": "markdown",
"id": "90d66547",
"id": "790eb030",
"metadata": {},
"source": [
"## 标准化\n",
@ -536,7 +536,7 @@
{
"cell_type": "code",
"execution_count": 22,
"id": "b60781c7",
"id": "14474d08",
"metadata": {},
"outputs": [
{
@ -611,7 +611,7 @@
},
{
"cell_type": "markdown",
"id": "5e7c1b67",
"id": "cdfb237a",
"metadata": {},
"source": [
"或者你可以针对一列标准化另一列。例如如果你创建一个组统计数据如上所述来指示D3每周的平均值。然后你可以通过"
@ -620,7 +620,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "19520699",
"id": "d426927e",
"metadata": {},
"outputs": [],
"source": [
@ -629,7 +629,7 @@
},
{
"cell_type": "markdown",
"id": "28892593",
"id": "78fc4991",
"metadata": {},
"source": [
"D3_remove_time随着时间的推移新变量不再增加因为我们已经针对时间的影响对其进行了标准化。"
@ -637,11 +637,22 @@
},
{
"cell_type": "markdown",
"id": "ecf0eb32",
"id": "da699903",
"metadata": {},
"source": [
"## 离群值去除/平滑"
"## 离群值去除/平滑\n",
"通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n",
"\n",
"这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19087e59",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {

Loading…
Cancel
Save