diff --git a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb
index 0333905..563ea58 100644
--- a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb
+++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb
@@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
- "id": "ac6cd644",
+ "id": "f5102024",
"metadata": {},
"source": [
"## 分类特征\n",
@@ -252,8 +252,8 @@
},
{
"cell_type": "code",
- "execution_count": 10,
- "id": "e3285fce",
+ "execution_count": 17,
+ "id": "65eeb045",
"metadata": {},
"outputs": [
{
@@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
- "id": "d5016f4c",
+ "id": "34da34b3",
"metadata": {},
"source": [
"## Splitting\n",
@@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
- "id": "66f3fb03",
+ "id": "8c42b4ed",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@@ -301,16 +301,16 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "7a267704",
+ "id": "1df43cfd",
"metadata": {},
"outputs": [],
"source": [
- "df['uid'] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)"
+ "df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)"
]
},
{
"cell_type": "markdown",
- "id": "7e41e460",
+ "id": "054ee902",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n",
@@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "15b54354",
+ "id": "4cf0ee0f",
"metadata": {},
"outputs": [],
"source": [
@@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
- "id": "e38268bf",
+ "id": "7d60d0b6",
"metadata": {},
"source": [
"## 频率编码\n",
@@ -339,8 +339,8 @@
},
{
"cell_type": "code",
- "execution_count": 12,
- "id": "4f6983bd",
+ "execution_count": 19,
+ "id": "bb167930",
"metadata": {},
"outputs": [
{
@@ -407,7 +407,7 @@
"4 0 2"
]
},
- "execution_count": 12,
+ "execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
@@ -418,10 +418,116 @@
"df"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "48229c86",
+ "metadata": {},
+ "source": [
+ "## 聚合/组统计\n",
+ "为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n",
+ "\n",
+ "可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "76380f6f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " color | \n",
+ " color_counts | \n",
+ " color_counts_sum | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " color color_counts color_counts_sum\n",
+ "0 0 2 4\n",
+ "1 1 2 4\n",
+ "2 2 1 1\n",
+ "3 1 2 4\n",
+ "4 0 2 4"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "temp = df.groupby('color')['color_counts'].agg(['mean']).rename({'mean':'color_counts_mean'},axis=1)\n",
+ "df = pd.merge(df,temp,on='color',how='left')\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fd72f933",
+ "metadata": {},
+ "source": [
+ "此处的功能向每一行添加color_counts该行color组的平均值。因此,LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
- "id": "26f55eeb",
+ "id": "9ed09035",
"metadata": {},
"outputs": [],
"source": []
diff --git a/竞赛优胜技巧/Feature Engineering Techniques.ipynb b/竞赛优胜技巧/Feature Engineering Techniques.ipynb
index 0333905..563ea58 100644
--- a/竞赛优胜技巧/Feature Engineering Techniques.ipynb
+++ b/竞赛优胜技巧/Feature Engineering Techniques.ipynb
@@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
- "id": "ac6cd644",
+ "id": "f5102024",
"metadata": {},
"source": [
"## 分类特征\n",
@@ -252,8 +252,8 @@
},
{
"cell_type": "code",
- "execution_count": 10,
- "id": "e3285fce",
+ "execution_count": 17,
+ "id": "65eeb045",
"metadata": {},
"outputs": [
{
@@ -280,7 +280,7 @@
},
{
"cell_type": "markdown",
- "id": "d5016f4c",
+ "id": "34da34b3",
"metadata": {},
"source": [
"## Splitting\n",
@@ -291,7 +291,7 @@
},
{
"cell_type": "markdown",
- "id": "66f3fb03",
+ "id": "8c42b4ed",
"metadata": {},
"source": [
"## 组合/转化/交互\n",
@@ -301,16 +301,16 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "7a267704",
+ "id": "1df43cfd",
"metadata": {},
"outputs": [],
"source": [
- "df['uid'] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)"
+ "df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)"
]
},
{
"cell_type": "markdown",
- "id": "7e41e460",
+ "id": "054ee902",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n",
@@ -321,7 +321,7 @@
{
"cell_type": "code",
"execution_count": null,
- "id": "15b54354",
+ "id": "4cf0ee0f",
"metadata": {},
"outputs": [],
"source": [
@@ -330,7 +330,7 @@
},
{
"cell_type": "markdown",
- "id": "e38268bf",
+ "id": "7d60d0b6",
"metadata": {},
"source": [
"## 频率编码\n",
@@ -339,8 +339,8 @@
},
{
"cell_type": "code",
- "execution_count": 12,
- "id": "4f6983bd",
+ "execution_count": 19,
+ "id": "bb167930",
"metadata": {},
"outputs": [
{
@@ -407,7 +407,7 @@
"4 0 2"
]
},
- "execution_count": 12,
+ "execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
@@ -418,10 +418,116 @@
"df"
]
},
+ {
+ "cell_type": "markdown",
+ "id": "48229c86",
+ "metadata": {},
+ "source": [
+ "## 聚合/组统计\n",
+ "为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n",
+ "\n",
+ "可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "76380f6f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " color | \n",
+ " color_counts | \n",
+ " color_counts_sum | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 2 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 0 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " color color_counts color_counts_sum\n",
+ "0 0 2 4\n",
+ "1 1 2 4\n",
+ "2 2 1 1\n",
+ "3 1 2 4\n",
+ "4 0 2 4"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "temp = df.groupby('color')['color_counts'].agg(['mean']).rename({'mean':'color_counts_mean'},axis=1)\n",
+ "df = pd.merge(df,temp,on='color',how='left')\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fd72f933",
+ "metadata": {},
+ "source": [
+ "此处的功能向每一行添加color_counts该行color组的平均值。因此,LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,
- "id": "26f55eeb",
+ "id": "9ed09035",
"metadata": {},
"outputs": [],
"source": []