Add. Splitting

master
benjas 3 years ago
parent 3174f5a5cf
commit a141ea68d7

@ -80,7 +80,7 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": 1,
"id": "ceef72c3",
"metadata": {},
"outputs": [
@ -142,7 +142,7 @@
"4 0"
]
},
"execution_count": 14,
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
@ -165,7 +165,7 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 2,
"id": "fcd6f4e3",
"metadata": {},
"outputs": [
@ -195,7 +195,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 3,
"id": "a40af2b8",
"metadata": {},
"outputs": [
@ -231,7 +231,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 4,
"id": "03948c52",
"metadata": {},
"outputs": [],
@ -241,10 +241,58 @@
" if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
]
},
{
"cell_type": "markdown",
"id": "f81f2de6",
"metadata": {},
"source": [
"## 分类特征\n",
"对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "bf7c9e8a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5 entries, 0 to 4\n",
"Data columns (total 1 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 color 5 non-null category\n",
"dtypes: category(1)\n",
"memory usage: 265.0 bytes\n"
]
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = df['color'].astype('category') # 转成分类特征并查看内存使用情况已知int8内存使用是: 133.0 bytes\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "3412c82b",
"metadata": {},
"source": [
"## Splitting\n",
"可以通过拆分将单个(字符串或数字)列分成两列。\n",
"\n",
"例如id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb624a66",
"id": "e78b77c4",
"metadata": {},
"outputs": [],
"source": []

@ -243,7 +243,7 @@
},
{
"cell_type": "markdown",
"id": "f1d175ca",
"id": "f81f2de6",
"metadata": {},
"source": [
"## 分类特征\n",
@ -253,7 +253,7 @@
{
"cell_type": "code",
"execution_count": 10,
"id": "333baf5e",
"id": "bf7c9e8a",
"metadata": {},
"outputs": [
{
@ -278,10 +278,21 @@
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "3412c82b",
"metadata": {},
"source": [
"## Splitting\n",
"可以通过拆分将单个(字符串或数字)列分成两列。\n",
"\n",
"例如id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28f791bd",
"id": "e78b77c4",
"metadata": {},
"outputs": [],
"source": []

Loading…
Cancel
Save