Add. Categorical Features

master
benjas 4 years ago
parent 272052520e
commit 3174f5a5cf

@ -2,7 +2,7 @@
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "278c7a1e", "id": "8d942947",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 特征工程技术" "# 特征工程技术"
@ -10,7 +10,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "67f256b4", "id": "d08d515b",
"metadata": {}, "metadata": {},
"source": [ "source": [
"搬运参考https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575" "搬运参考https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
@ -18,7 +18,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "5a28bcf6", "id": "5eb53e03",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 关于编码\n", "## 关于编码\n",
@ -28,7 +28,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "c0edffa6", "id": "eb00d32d",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -40,7 +40,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "3bd8a464", "id": "83412ecb",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## NAN值加工\n", "## NAN值加工\n",
@ -54,7 +54,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "e2c552c7", "id": "8b093d6f",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -63,7 +63,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "fe85c377", "id": "978c9dc6",
"metadata": {}, "metadata": {},
"source": [ "source": [
"这样LGBM将不再过度处理 NAN。相反它会给予它与其他数字相同的关注。可以尝试两种方法看看哪个给出了最高的CV。" "这样LGBM将不再过度处理 NAN。相反它会给予它与其他数字相同的关注。可以尝试两种方法看看哪个给出了最高的CV。"
@ -71,7 +71,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "05e77c5a", "id": "31c076fc",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 标签编码/因式分解/内存减少\n", "## 标签编码/因式分解/内存减少\n",
@ -81,7 +81,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 14,
"id": "554159aa", "id": "ceef72c3",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -157,7 +157,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "e5bf12a9", "id": "eca60e6f",
"metadata": {}, "metadata": {},
"source": [ "source": [
"之后,可以将其转换为 int8、int16 或 int32用以减少内存具体取决于 max 是否小于 128、小于 32768。" "之后,可以将其转换为 int8、int16 或 int32用以减少内存具体取决于 max 是否小于 128、小于 32768。"
@ -166,7 +166,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 21, "execution_count": 21,
"id": "863fee6f", "id": "fcd6f4e3",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -196,7 +196,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 22, "execution_count": 22,
"id": "1a6bac81", "id": "a40af2b8",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -221,7 +221,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "0951f3c7", "id": "3728adee",
"metadata": {}, "metadata": {},
"source": [ "source": [
"另外为了减少内存人们memory_reduce在其他列上使用流行的功能。\n", "另外为了减少内存人们memory_reduce在其他列上使用流行的功能。\n",
@ -232,7 +232,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 23, "execution_count": 23,
"id": "88368fc6", "id": "03948c52",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -244,7 +244,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "1ecd48ce", "id": "bb624a66",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []

@ -2,7 +2,7 @@
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "278c7a1e", "id": "8d942947",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 特征工程技术" "# 特征工程技术"
@ -10,7 +10,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "67f256b4", "id": "d08d515b",
"metadata": {}, "metadata": {},
"source": [ "source": [
"搬运参考https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575" "搬运参考https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
@ -18,7 +18,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "5a28bcf6", "id": "5eb53e03",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 关于编码\n", "## 关于编码\n",
@ -28,7 +28,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "c0edffa6", "id": "eb00d32d",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -40,7 +40,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "3bd8a464", "id": "83412ecb",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## NAN值加工\n", "## NAN值加工\n",
@ -54,7 +54,7 @@
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "e2c552c7", "id": "8b093d6f",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -63,7 +63,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "fe85c377", "id": "978c9dc6",
"metadata": {}, "metadata": {},
"source": [ "source": [
"这样LGBM将不再过度处理 NAN。相反它会给予它与其他数字相同的关注。可以尝试两种方法看看哪个给出了最高的CV。" "这样LGBM将不再过度处理 NAN。相反它会给予它与其他数字相同的关注。可以尝试两种方法看看哪个给出了最高的CV。"
@ -71,7 +71,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "05e77c5a", "id": "31c076fc",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## 标签编码/因式分解/内存减少\n", "## 标签编码/因式分解/内存减少\n",
@ -80,8 +80,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 14, "execution_count": 1,
"id": "554159aa", "id": "ceef72c3",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -142,7 +142,7 @@
"4 0" "4 0"
] ]
}, },
"execution_count": 14, "execution_count": 1,
"metadata": {}, "metadata": {},
"output_type": "execute_result" "output_type": "execute_result"
} }
@ -157,7 +157,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "e5bf12a9", "id": "eca60e6f",
"metadata": {}, "metadata": {},
"source": [ "source": [
"之后,可以将其转换为 int8、int16 或 int32用以减少内存具体取决于 max 是否小于 128、小于 32768。" "之后,可以将其转换为 int8、int16 或 int32用以减少内存具体取决于 max 是否小于 128、小于 32768。"
@ -165,8 +165,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 21, "execution_count": 2,
"id": "863fee6f", "id": "fcd6f4e3",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -195,8 +195,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 22, "execution_count": 3,
"id": "1a6bac81", "id": "a40af2b8",
"metadata": {}, "metadata": {},
"outputs": [ "outputs": [
{ {
@ -221,7 +221,7 @@
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"id": "0951f3c7", "id": "3728adee",
"metadata": {}, "metadata": {},
"source": [ "source": [
"另外为了减少内存人们memory_reduce在其他列上使用流行的功能。\n", "另外为了减少内存人们memory_reduce在其他列上使用流行的功能。\n",
@ -231,8 +231,8 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 23, "execution_count": 4,
"id": "88368fc6", "id": "03948c52",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@ -241,10 +241,47 @@
" if df[col].dtype=='int64': df[col] = df[col].astype('int32')" " if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
] ]
}, },
{
"cell_type": "markdown",
"id": "f1d175ca",
"metadata": {},
"source": [
"## 分类特征\n",
"对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "333baf5e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5 entries, 0 to 4\n",
"Data columns (total 1 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 color 5 non-null category\n",
"dtypes: category(1)\n",
"memory usage: 265.0 bytes\n"
]
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = df['color'].astype('category') # 转成分类特征并查看内存使用情况已知int8内存使用是: 133.0 bytes\n",
"df.info()"
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"id": "1ecd48ce", "id": "28f791bd",
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [] "source": []

Loading…
Cancel
Save