Create Feature Engineering Techniques.ipynb

5 years ago · 272052520e
parent bf9d94cae2
commit 272052520e
2 changed files with 548 additions and 0 deletions
--- a/竞赛优胜技巧/.ipynb_checkpoints/Feature
+++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature
@ -0,0 +1,274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "278c7a1e",
+   "metadata": {},
+   "source": [
+    "# 特征工程技术"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67f256b4",
+   "metadata": {},
+   "source": [
+    "搬运参考：https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a28bcf6",
+   "metadata": {},
+   "source": [
+    "## 关于编码\n",
+    "在执行编码时，最好训练和测试集一起编码，如下所示"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0edffa6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.concat([train[col],test[col]],axis=0)\n",
+    "# PERFORM FEATURE ENGINEERING HERE\n",
+    "train[col] = df[:len(train)]\n",
+    "test[col] = df[len(train):]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bd8a464",
+   "metadata": {},
+   "source": [
+    "## NAN值加工\n",
+    "如果将np.nan给LGBM，那么在每个树节点分裂时，它会分裂非 NAN 值，然后将所有 NAN 发送到左节点或右节点，这取决于什么是最好的。\n",
+    "\n",
+    "因此，NAN 在每个节点都得到特殊处理，并且可能会变得过拟合。\n",
+    "\n",
+    "通过简单地将所有 NAN 转换为低于所有非 NAN 值的负数（例如 - 999），来防止测试集过拟合。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2c552c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df[col].fillna(-999, inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe85c377",
+   "metadata": {},
+   "source": [
+    "这样LGBM将不再过度处理 NAN。相反，它会给予它与其他数字相同的关注。可以尝试两种方法，看看哪个给出了最高的CV。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05e77c5a",
+   "metadata": {},
+   "source": [
+    "## 标签编码/因式分解/内存减少\n",
+    "标签编码（分解）将（字符串、类别、对象）列转换为整数。类似get_dummies，不同点在于如果有几十个取值，如果用pd.get_dummies()则会得到好几十列，增加了数据的稀疏性"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "554159aa",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>color</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   color\n",
+       "0      0\n",
+       "1      1\n",
+       "2      2\n",
+       "3      1\n",
+       "4      0"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
+    "df['color'],_ = df['color'].factorize()\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5bf12a9",
+   "metadata": {},
+   "source": [
+    "之后，可以将其转换为 int8、int16 或 int32用以减少内存，具体取决于 max 是否小于 128、小于 32768。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "863fee6f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 5 entries, 0 to 4\n",
+      "Data columns (total 1 columns):\n",
+      " #   Column  Non-Null Count  Dtype\n",
+      "---  ------  --------------  -----\n",
+      " 0   color   5 non-null      int8 \n",
+      "dtypes: int8(1)\n",
+      "memory usage: 133.0 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "if df['color'].max()<128:\n",
+    "    df['color'] = df['color'].astype('int8')\n",
+    "elif df['color'].max()<32768:\n",
+    "    df['color'] = df['color'].astype('int16')\n",
+    "else: df['color'] = df['color'].astype('int32')\n",
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "1a6bac81",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 5 entries, 0 to 4\n",
+      "Data columns (total 1 columns):\n",
+      " #   Column  Non-Null Count  Dtype\n",
+      "---  ------  --------------  -----\n",
+      " 0   color   5 non-null      int32\n",
+      "dtypes: int32(1)\n",
+      "memory usage: 148.0 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "df['color'] = df['color'].astype('int32')  # 如果使用int32，可以看到memory usage: 变成148了\n",
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0951f3c7",
+   "metadata": {},
+   "source": [
+    "另外为了减少内存，人们memory_reduce在其他列上使用流行的功能。\n",
+    "\n",
+    "一种更简单、更安全的方法是将所有 float64 转换为 float32，将所有 int64 转换为 int32。（最好避免使用 float16。如果你愿意，可以使用 int8 和 int16）。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "88368fc6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for col in df.columns:\n",
+    "    if df[col].dtype=='float64': df[col] = df[col].astype('float32')\n",
+    "    if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ecd48ce",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/竞赛优胜技巧/Feature
+++ b/竞赛优胜技巧/Feature
@ -0,0 +1,274 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "278c7a1e",
+   "metadata": {},
+   "source": [
+    "# 特征工程技术"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67f256b4",
+   "metadata": {},
+   "source": [
+    "搬运参考：https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a28bcf6",
+   "metadata": {},
+   "source": [
+    "## 关于编码\n",
+    "在执行编码时，最好训练和测试集一起编码，如下所示"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0edffa6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.concat([train[col],test[col]],axis=0)\n",
+    "# PERFORM FEATURE ENGINEERING HERE\n",
+    "train[col] = df[:len(train)]\n",
+    "test[col] = df[len(train):]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bd8a464",
+   "metadata": {},
+   "source": [
+    "## NAN值加工\n",
+    "如果将np.nan给LGBM，那么在每个树节点分裂时，它会分裂非 NAN 值，然后将所有 NAN 发送到左节点或右节点，这取决于什么是最好的。\n",
+    "\n",
+    "因此，NAN 在每个节点都得到特殊处理，并且可能会变得过拟合。\n",
+    "\n",
+    "通过简单地将所有 NAN 转换为低于所有非 NAN 值的负数（例如 - 999），来防止测试集过拟合。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2c552c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df[col].fillna(-999, inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe85c377",
+   "metadata": {},
+   "source": [
+    "这样LGBM将不再过度处理 NAN。相反，它会给予它与其他数字相同的关注。可以尝试两种方法，看看哪个给出了最高的CV。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05e77c5a",
+   "metadata": {},
+   "source": [
+    "## 标签编码/因式分解/内存减少\n",
+    "标签编码（分解）将（字符串、类别、对象）列转换为整数。类似get_dummies，不同点在于如果有几十个取值，如果用pd.get_dummies()则会得到好几十列，增加了数据的稀疏性"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "554159aa",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>color</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   color\n",
+       "0      0\n",
+       "1      1\n",
+       "2      2\n",
+       "3      1\n",
+       "4      0"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
+    "df['color'],_ = df['color'].factorize()\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5bf12a9",
+   "metadata": {},
+   "source": [
+    "之后，可以将其转换为 int8、int16 或 int32用以减少内存，具体取决于 max 是否小于 128、小于 32768。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "id": "863fee6f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 5 entries, 0 to 4\n",
+      "Data columns (total 1 columns):\n",
+      " #   Column  Non-Null Count  Dtype\n",
+      "---  ------  --------------  -----\n",
+      " 0   color   5 non-null      int8 \n",
+      "dtypes: int8(1)\n",
+      "memory usage: 133.0 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "if df['color'].max()<128:\n",
+    "    df['color'] = df['color'].astype('int8')\n",
+    "elif df['color'].max()<32768:\n",
+    "    df['color'] = df['color'].astype('int16')\n",
+    "else: df['color'] = df['color'].astype('int32')\n",
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "1a6bac81",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 5 entries, 0 to 4\n",
+      "Data columns (total 1 columns):\n",
+      " #   Column  Non-Null Count  Dtype\n",
+      "---  ------  --------------  -----\n",
+      " 0   color   5 non-null      int32\n",
+      "dtypes: int32(1)\n",
+      "memory usage: 148.0 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "df['color'] = df['color'].astype('int32')  # 如果使用int32，可以看到memory usage: 变成148了\n",
+    "df.info()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0951f3c7",
+   "metadata": {},
+   "source": [
+    "另外为了减少内存，人们memory_reduce在其他列上使用流行的功能。\n",
+    "\n",
+    "一种更简单、更安全的方法是将所有 float64 转换为 float32，将所有 int64 转换为 int32。（最好避免使用 float16。如果你愿意，可以使用 int8 和 int16）。"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "id": "88368fc6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for col in df.columns:\n",
+    "    if df[col].dtype=='float64': df[col] = df[col].astype('float32')\n",
+    "    if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1ecd48ce",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}