From 004fea8295417aaf8a17590b24ef8695fe76acf5 Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Mon, 30 Aug 2021 11:50:57 +0800 Subject: [PATCH] Add. Combining / Transforming / Interaction --- ...re Engineering Techniques-checkpoint.ipynb | 47 +++++++++++++++++-- .../Feature Engineering Techniques.ipynb | 47 +++++++++++++++++-- 2 files changed, 86 insertions(+), 8 deletions(-) diff --git a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb index c48a018..e502a2a 100644 --- a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb +++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb @@ -243,7 +243,7 @@ }, { "cell_type": "markdown", - "id": "f81f2de6", + "id": "6091bf47", "metadata": {}, "source": [ "## 分类特征\n", @@ -253,7 +253,7 @@ { "cell_type": "code", "execution_count": 10, - "id": "bf7c9e8a", + "id": "6061ae00", "metadata": {}, "outputs": [ { @@ -280,7 +280,7 @@ }, { "cell_type": "markdown", - "id": "3412c82b", + "id": "94a95b95", "metadata": {}, "source": [ "## Splitting\n", @@ -289,10 +289,49 @@ "例如,id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。" ] }, + { + "cell_type": "markdown", + "id": "87e8b887", + "metadata": {}, + "source": [ + "## 组合/转化/交互\n", + "两个(字符串或数字)列可以合并为一列。例如card1,card2可以成为一个新列" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92515211", + "metadata": {}, + "outputs": [], + "source": [ + "df['uid'] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)" + ] + }, + { + "cell_type": "markdown", + "id": "b9c66a13", + "metadata": {}, + "source": [ + "这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n", + "\n", + "但这种uid = card1_card2可能与目标相关,现在LGBM会将其拆分。数字列可以与加法、减法、乘法等组合使用。一个数字示例是" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d50f2c15", + "metadata": {}, + "outputs": [], + "source": [ + "df['x1_x2'] = df['x1'] * df['x2']" + ] + }, { "cell_type": "code", "execution_count": null, - "id": "e78b77c4", + "id": "3ce3cb5f", "metadata": {}, "outputs": [], "source": [] diff --git a/竞赛优胜技巧/Feature Engineering Techniques.ipynb b/竞赛优胜技巧/Feature Engineering Techniques.ipynb index c48a018..e502a2a 100644 --- a/竞赛优胜技巧/Feature Engineering Techniques.ipynb +++ b/竞赛优胜技巧/Feature Engineering Techniques.ipynb @@ -243,7 +243,7 @@ }, { "cell_type": "markdown", - "id": "f81f2de6", + "id": "6091bf47", "metadata": {}, "source": [ "## 分类特征\n", @@ -253,7 +253,7 @@ { "cell_type": "code", "execution_count": 10, - "id": "bf7c9e8a", + "id": "6061ae00", "metadata": {}, "outputs": [ { @@ -280,7 +280,7 @@ }, { "cell_type": "markdown", - "id": "3412c82b", + "id": "94a95b95", "metadata": {}, "source": [ "## Splitting\n", @@ -289,10 +289,49 @@ "例如,id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。" ] }, + { + "cell_type": "markdown", + "id": "87e8b887", + "metadata": {}, + "source": [ + "## 组合/转化/交互\n", + "两个(字符串或数字)列可以合并为一列。例如card1,card2可以成为一个新列" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92515211", + "metadata": {}, + "outputs": [], + "source": [ + "df['uid'] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)" + ] + }, + { + "cell_type": "markdown", + "id": "b9c66a13", + "metadata": {}, + "source": [ + "这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n", + "\n", + "但这种uid = card1_card2可能与目标相关,现在LGBM会将其拆分。数字列可以与加法、减法、乘法等组合使用。一个数字示例是" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d50f2c15", + "metadata": {}, + "outputs": [], + "source": [ + "df['x1_x2'] = df['x1'] * df['x2']" + ] + }, { "cell_type": "code", "execution_count": null, - "id": "e78b77c4", + "id": "3ce3cb5f", "metadata": {}, "outputs": [], "source": []