|
|
|
@ -21,7 +21,7 @@
|
|
|
|
|
"id": "5eb53e03",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 关于编码\n",
|
|
|
|
|
"### 关于编码\n",
|
|
|
|
|
"在执行编码时,最好训练和测试集一起编码,如下所示"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -43,7 +43,7 @@
|
|
|
|
|
"id": "83412ecb",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## NAN值加工\n",
|
|
|
|
|
"### NAN值加工\n",
|
|
|
|
|
"如果将np.nan给LGBM,那么在每个树节点分裂时,它会分裂非 NAN 值,然后将所有 NAN 发送到左节点或右节点,这取决于什么是最好的。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"因此,NAN 在每个节点都得到特殊处理,并且可能会变得过拟合。\n",
|
|
|
|
@ -74,7 +74,7 @@
|
|
|
|
|
"id": "31c076fc",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 标签编码/因式分解/内存减少\n",
|
|
|
|
|
"### 标签编码/因式分解/内存减少\n",
|
|
|
|
|
"标签编码(分解)将(字符串、类别、对象)列转换为整数。类似get_dummies,不同点在于如果有几十个取值,如果用pd.get_dummies()则会得到好几十列,增加了数据的稀疏性"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -246,7 +246,7 @@
|
|
|
|
|
"id": "f72a6efa",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 分类特征\n",
|
|
|
|
|
"### 分类特征\n",
|
|
|
|
|
"对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -283,7 +283,7 @@
|
|
|
|
|
"id": "ba369f7a",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Splitting\n",
|
|
|
|
|
"### Splitting\n",
|
|
|
|
|
"可以通过拆分将单个(字符串或数字)列分成两列。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"例如,id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。"
|
|
|
|
@ -294,7 +294,7 @@
|
|
|
|
|
"id": "6512d8e2",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 组合/转化/交互\n",
|
|
|
|
|
"### 组合/转化/交互\n",
|
|
|
|
|
"两个(字符串或数字)列可以合并为一列。例如card1,card2可以成为一个新列"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -333,7 +333,7 @@
|
|
|
|
|
"id": "a8ab1bcb",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 频率编码\n",
|
|
|
|
|
"### 频率编码\n",
|
|
|
|
|
"频率编码是一种强大的技术,它允许 LGBM 查看列值是罕见的还是常见的。例如,如果您希望 LGBM“查看”哪些颜色不常使用,请尝试"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -423,7 +423,7 @@
|
|
|
|
|
"id": "7986b33c",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 聚合/组统计\n",
|
|
|
|
|
"### 聚合/组统计\n",
|
|
|
|
|
"为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如"
|
|
|
|
@ -529,7 +529,7 @@
|
|
|
|
|
"id": "790eb030",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 标准化\n",
|
|
|
|
|
"### 标准化\n",
|
|
|
|
|
"可以针对自己对列进行标准化。例如"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
@ -640,19 +640,11 @@
|
|
|
|
|
"id": "da699903",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## 离群值去除/平滑\n",
|
|
|
|
|
"### 离群值去除/平滑\n",
|
|
|
|
|
"通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "19087e59",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": []
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"metadata": {
|
|
|
|
|