Add. Splitting

3 years ago · a141ea68d7
parent 3174f5a5cf
commit a141ea68d7
2 changed files with 68 additions and 9 deletions
--- a/竞赛优胜技巧/.ipynb_checkpoints/Feature
+++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature
@ -80,7 +80,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 1,
   "id": "ceef72c3",
   "metadata": {},
   "outputs": [
@ -142,7 +142,7 @@
       "4      0"
      ]
     },
-     "execution_count": 14,
+     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
@ -165,7 +165,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 2,
   "id": "fcd6f4e3",
   "metadata": {},
   "outputs": [
@ -195,7 +195,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 3,
   "id": "a40af2b8",
   "metadata": {},
   "outputs": [
@ -231,7 +231,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 23,
+   "execution_count": 4,
   "id": "03948c52",
   "metadata": {},
   "outputs": [],
@ -241,10 +241,58 @@
    "    if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "f81f2de6",
+   "metadata": {},
+   "source": [
+    "## 分类特征\n",
+    "对于分类变量，可以选择告诉 LGBM 它们是分类的（但内存会增加），或者可以告诉 LGBM 将其视为数字（首先需要对其进行标签编码）"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "bf7c9e8a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 5 entries, 0 to 4\n",
+      "Data columns (total 1 columns):\n",
+      " #   Column  Non-Null Count  Dtype   \n",
+      "---  ------  --------------  -----   \n",
+      " 0   color   5 non-null      category\n",
+      "dtypes: category(1)\n",
+      "memory usage: 265.0 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
+    "df['color'],_ = df['color'].factorize()\n",
+    "df['color'] = df['color'].astype('category')  # 转成分类特征并查看内存使用情况（已知int8内存使用是: 133.0 bytes）\n",
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3412c82b",
+   "metadata": {},
+   "source": [
+    "## Splitting\n",
+    "可以通过拆分将单个（字符串或数字）列分成两列。\n",
+    "\n",
+    "例如，id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段，需要将它们拆分。"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "bb624a66",
+   "id": "e78b77c4",
   "metadata": {},
   "outputs": [],
   "source": []
--- a/竞赛优胜技巧/Feature
+++ b/竞赛优胜技巧/Feature
@ -243,7 +243,7 @@
  },
  {
   "cell_type": "markdown",
-   "id": "f1d175ca",
+   "id": "f81f2de6",
   "metadata": {},
   "source": [
    "## 分类特征\n",
@ -253,7 +253,7 @@
  {
   "cell_type": "code",
   "execution_count": 10,
-   "id": "333baf5e",
+   "id": "bf7c9e8a",
   "metadata": {},
   "outputs": [
    {
@ -278,10 +278,21 @@
    "df.info()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "3412c82b",
+   "metadata": {},
+   "source": [
+    "## Splitting\n",
+    "可以通过拆分将单个（字符串或数字）列分成两列。\n",
+    "\n",
+    "例如，id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段，需要将它们拆分。"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "28f791bd",
+   "id": "e78b77c4",
   "metadata": {},
   "outputs": [],
   "source": []