From 008e2d5689b676331f7e023bc673248410326ce7 Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Mon, 30 Aug 2021 14:30:08 +0800 Subject: [PATCH] Add. Outlier Removal / Smooth --- ...re Engineering Techniques-checkpoint.ipynb | 49 ++++++++++++------- .../Feature Engineering Techniques.ipynb | 49 ++++++++++++------- 2 files changed, 60 insertions(+), 38 deletions(-) diff --git a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb index 7fb59ae..f78b3b7 100644 --- a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb +++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb @@ -243,7 +243,7 @@ }, { "cell_type": "markdown", - "id": "0ef70ec2", + "id": "f72a6efa", "metadata": {}, "source": [ "## 分类特征\n", @@ -253,7 +253,7 @@ { "cell_type": "code", "execution_count": 17, - "id": "38d54603", + "id": "224ba994", "metadata": {}, "outputs": [ { @@ -280,7 +280,7 @@ }, { "cell_type": "markdown", - "id": "7cca3ea7", + "id": "ba369f7a", "metadata": {}, "source": [ "## Splitting\n", @@ -291,7 +291,7 @@ }, { "cell_type": "markdown", - "id": "735d477e", + "id": "6512d8e2", "metadata": {}, "source": [ "## 组合/转化/交互\n", @@ -301,7 +301,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f79c747e", + "id": "7e1bbabe", "metadata": {}, "outputs": [], "source": [ @@ -310,7 +310,7 @@ }, { "cell_type": "markdown", - "id": "12240f96", + "id": "f195a2c7", "metadata": {}, "source": [ "这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n", @@ -321,7 +321,7 @@ { "cell_type": "code", "execution_count": null, - "id": "baed14ad", + "id": "8f2bea13", "metadata": {}, "outputs": [], "source": [ @@ -330,7 +330,7 @@ }, { "cell_type": "markdown", - "id": "56db7969", + "id": "a8ab1bcb", "metadata": {}, "source": [ "## 频率编码\n", @@ -340,7 +340,7 @@ { "cell_type": "code", "execution_count": 19, - "id": "30ff5b0b", + "id": "87cca857", "metadata": {}, "outputs": [ { @@ -420,7 +420,7 @@ }, { "cell_type": "markdown", - "id": "13d14185", + "id": "7986b33c", "metadata": {}, "source": [ "## 聚合/组统计\n", @@ -432,7 +432,7 @@ { "cell_type": "code", "execution_count": 20, - "id": "373291cf", + "id": "e8f7106e", "metadata": {}, "outputs": [ { @@ -518,7 +518,7 @@ }, { "cell_type": "markdown", - "id": "e638bb48", + "id": "9da30631", "metadata": {}, "source": [ "此处的功能向每一行添加color_counts该行color组的平均值。因此,LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。" @@ -526,7 +526,7 @@ }, { "cell_type": "markdown", - "id": "90d66547", + "id": "790eb030", "metadata": {}, "source": [ "## 标准化\n", @@ -536,7 +536,7 @@ { "cell_type": "code", "execution_count": 22, - "id": "b60781c7", + "id": "14474d08", "metadata": {}, "outputs": [ { @@ -611,7 +611,7 @@ }, { "cell_type": "markdown", - "id": "5e7c1b67", + "id": "cdfb237a", "metadata": {}, "source": [ "或者你可以针对一列标准化另一列。例如,如果你创建一个组统计数据(如上所述)来指示D3每周的平均值。然后你可以通过" @@ -620,7 +620,7 @@ { "cell_type": "code", "execution_count": null, - "id": "19520699", + "id": "d426927e", "metadata": {}, "outputs": [], "source": [ @@ -629,7 +629,7 @@ }, { "cell_type": "markdown", - "id": "28892593", + "id": "78fc4991", "metadata": {}, "source": [ "D3_remove_time随着时间的推移,新变量不再增加,因为我们已经针对时间的影响对其进行了标准化。" @@ -637,11 +637,22 @@ }, { "cell_type": "markdown", - "id": "ecf0eb32", + "id": "da699903", "metadata": {}, "source": [ - "## 离群值去除/平滑" + "## 离群值去除/平滑\n", + "通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n", + "\n", + "这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19087e59", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/竞赛优胜技巧/Feature Engineering Techniques.ipynb b/竞赛优胜技巧/Feature Engineering Techniques.ipynb index 7fb59ae..f78b3b7 100644 --- a/竞赛优胜技巧/Feature Engineering Techniques.ipynb +++ b/竞赛优胜技巧/Feature Engineering Techniques.ipynb @@ -243,7 +243,7 @@ }, { "cell_type": "markdown", - "id": "0ef70ec2", + "id": "f72a6efa", "metadata": {}, "source": [ "## 分类特征\n", @@ -253,7 +253,7 @@ { "cell_type": "code", "execution_count": 17, - "id": "38d54603", + "id": "224ba994", "metadata": {}, "outputs": [ { @@ -280,7 +280,7 @@ }, { "cell_type": "markdown", - "id": "7cca3ea7", + "id": "ba369f7a", "metadata": {}, "source": [ "## Splitting\n", @@ -291,7 +291,7 @@ }, { "cell_type": "markdown", - "id": "735d477e", + "id": "6512d8e2", "metadata": {}, "source": [ "## 组合/转化/交互\n", @@ -301,7 +301,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f79c747e", + "id": "7e1bbabe", "metadata": {}, "outputs": [], "source": [ @@ -310,7 +310,7 @@ }, { "cell_type": "markdown", - "id": "12240f96", + "id": "f195a2c7", "metadata": {}, "source": [ "这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n", @@ -321,7 +321,7 @@ { "cell_type": "code", "execution_count": null, - "id": "baed14ad", + "id": "8f2bea13", "metadata": {}, "outputs": [], "source": [ @@ -330,7 +330,7 @@ }, { "cell_type": "markdown", - "id": "56db7969", + "id": "a8ab1bcb", "metadata": {}, "source": [ "## 频率编码\n", @@ -340,7 +340,7 @@ { "cell_type": "code", "execution_count": 19, - "id": "30ff5b0b", + "id": "87cca857", "metadata": {}, "outputs": [ { @@ -420,7 +420,7 @@ }, { "cell_type": "markdown", - "id": "13d14185", + "id": "7986b33c", "metadata": {}, "source": [ "## 聚合/组统计\n", @@ -432,7 +432,7 @@ { "cell_type": "code", "execution_count": 20, - "id": "373291cf", + "id": "e8f7106e", "metadata": {}, "outputs": [ { @@ -518,7 +518,7 @@ }, { "cell_type": "markdown", - "id": "e638bb48", + "id": "9da30631", "metadata": {}, "source": [ "此处的功能向每一行添加color_counts该行color组的平均值。因此,LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。" @@ -526,7 +526,7 @@ }, { "cell_type": "markdown", - "id": "90d66547", + "id": "790eb030", "metadata": {}, "source": [ "## 标准化\n", @@ -536,7 +536,7 @@ { "cell_type": "code", "execution_count": 22, - "id": "b60781c7", + "id": "14474d08", "metadata": {}, "outputs": [ { @@ -611,7 +611,7 @@ }, { "cell_type": "markdown", - "id": "5e7c1b67", + "id": "cdfb237a", "metadata": {}, "source": [ "或者你可以针对一列标准化另一列。例如,如果你创建一个组统计数据(如上所述)来指示D3每周的平均值。然后你可以通过" @@ -620,7 +620,7 @@ { "cell_type": "code", "execution_count": null, - "id": "19520699", + "id": "d426927e", "metadata": {}, "outputs": [], "source": [ @@ -629,7 +629,7 @@ }, { "cell_type": "markdown", - "id": "28892593", + "id": "78fc4991", "metadata": {}, "source": [ "D3_remove_time随着时间的推移,新变量不再增加,因为我们已经针对时间的影响对其进行了标准化。" @@ -637,11 +637,22 @@ }, { "cell_type": "markdown", - "id": "ecf0eb32", + "id": "da699903", "metadata": {}, "source": [ - "## 离群值去除/平滑" + "## 离群值去除/平滑\n", + "通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n", + "\n", + "这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19087e59", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": {