{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8d942947",
   "metadata": {},
   "source": [
    "# 特征工程技术"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d08d515b",
   "metadata": {},
   "source": [
    "搬运参考：https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eb53e03",
   "metadata": {},
   "source": [
    "### 关于编码\n",
    "在执行编码时，最好训练和测试集一起编码，如下所示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb00d32d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.concat([train[col],test[col]],axis=0)\n",
    "# PERFORM FEATURE ENGINEERING HERE\n",
    "train[col] = df[:len(train)]\n",
    "test[col] = df[len(train):]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83412ecb",
   "metadata": {},
   "source": [
    "### NAN值加工\n",
    "如果将np.nan给LGBM，那么在每个树节点分裂时，它会分裂非 NAN 值，然后将所有 NAN 发送到左节点或右节点，这取决于什么是最好的。\n",
    "\n",
    "因此，NAN 在每个节点都得到特殊处理，并且可能会变得过拟合。\n",
    "\n",
    "通过简单地将所有 NAN 转换为低于所有非 NAN 值的负数（例如 - 999），来防止测试集过拟合。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b093d6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[col].fillna(-999, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "978c9dc6",
   "metadata": {},
   "source": [
    "这样LGBM将不再过度处理 NAN。相反，它会给予它与其他数字相同的关注。可以尝试两种方法，看看哪个给出了最高的CV。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31c076fc",
   "metadata": {},
   "source": [
    "### 标签编码/因式分解/内存减少\n",
    "标签编码（分解）将（字符串、类别、对象）列转换为整数。类似get_dummies，不同点在于如果有几十个取值，如果用pd.get_dummies()则会得到好几十列，增加了数据的稀疏性"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ceef72c3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   color\n",
       "0      0\n",
       "1      1\n",
       "2      2\n",
       "3      1\n",
       "4      0"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
    "df['color'],_ = df['color'].factorize()\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eca60e6f",
   "metadata": {},
   "source": [
    "之后，可以将其转换为 int8、int16 或 int32用以减少内存，具体取决于 max 是否小于 128、小于 32768。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fcd6f4e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 5 entries, 0 to 4\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype\n",
      "---  ------  --------------  -----\n",
      " 0   color   5 non-null      int8 \n",
      "dtypes: int8(1)\n",
      "memory usage: 133.0 bytes\n"
     ]
    }
   ],
   "source": [
    "if df['color'].max()<128:\n",
    "    df['color'] = df['color'].astype('int8')\n",
    "elif df['color'].max()<32768:\n",
    "    df['color'] = df['color'].astype('int16')\n",
    "else: df['color'] = df['color'].astype('int32')\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a40af2b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 5 entries, 0 to 4\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype\n",
      "---  ------  --------------  -----\n",
      " 0   color   5 non-null      int32\n",
      "dtypes: int32(1)\n",
      "memory usage: 148.0 bytes\n"
     ]
    }
   ],
   "source": [
    "df['color'] = df['color'].astype('int32')  # 如果使用int32，可以看到memory usage: 变成148了\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3728adee",
   "metadata": {},
   "source": [
    "另外为了减少内存，人们memory_reduce在其他列上使用流行的功能。\n",
    "\n",
    "一种更简单、更安全的方法是将所有 float64 转换为 float32，将所有 int64 转换为 int32。（最好避免使用 float16。如果你愿意，可以使用 int8 和 int16）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "03948c52",
   "metadata": {},
   "outputs": [],
   "source": [
    "for col in df.columns:\n",
    "    if df[col].dtype=='float64': df[col] = df[col].astype('float32')\n",
    "    if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35181a29",
   "metadata": {},
   "source": [
    "### 测试修改数据大小后，结果会不会发生变化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4cbbf955",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import time\n",
    "\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "from sklearn.model_selection import StratifiedKFold, train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.preprocessing import OrdinalEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "6aaae6e5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "七分类任务，处理前： [1 2 3 4 5 6 7]\n",
      "[5 5 2 ... 3 3 3]\n",
      "七分类任务，处理后： [0. 1. 2. 3. 4. 5. 6.]\n",
      "[4. 4. 1. ... 2. 2. 2.]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_covtype\n",
    "data = fetch_covtype()  # 森林植被类型\n",
    "# 预处理\n",
    "X, y = data['data'], data['target']\n",
    "# 由于模型标签需要从0开始，所以数字需要全部减1\n",
    "print('七分类任务，处理前：',np.unique(y))\n",
    "print(y)\n",
    "ord = OrdinalEncoder()\n",
    "y = ord.fit_transform(y.reshape(-1, 1))\n",
    "y = y.reshape(-1, )\n",
    "print('七分类任务，处理后：',np.unique(y))\n",
    "print(y)\n",
    "\n",
    "X = pd.DataFrame(X,columns=data.feature_names)\n",
    "X = X.iloc[:,:20]  # 数据集过大，这里仅用前20列做演示\n",
    "\n",
    "y = pd.DataFrame(y, columns=data.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "816e36fd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 581012 entries, 0 to 581011\n",
      "Data columns (total 20 columns):\n",
      " #   Column                              Non-Null Count   Dtype  \n",
      "---  ------                              --------------   -----  \n",
      " 0   Elevation                           581012 non-null  float64\n",
      " 1   Aspect                              581012 non-null  float64\n",
      " 2   Slope                               581012 non-null  float64\n",
      " 3   Horizontal_Distance_To_Hydrology    581012 non-null  float64\n",
      " 4   Vertical_Distance_To_Hydrology      581012 non-null  float64\n",
      " 5   Horizontal_Distance_To_Roadways     581012 non-null  float64\n",
      " 6   Hillshade_9am                       581012 non-null  float64\n",
      " 7   Hillshade_Noon                      581012 non-null  float64\n",
      " 8   Hillshade_3pm                       581012 non-null  float64\n",
      " 9   Horizontal_Distance_To_Fire_Points  581012 non-null  float64\n",
      " 10  Wilderness_Area_0                   581012 non-null  float64\n",
      " 11  Wilderness_Area_1                   581012 non-null  float64\n",
      " 12  Wilderness_Area_2                   581012 non-null  float64\n",
      " 13  Wilderness_Area_3                   581012 non-null  float64\n",
      " 14  Soil_Type_0                         581012 non-null  float64\n",
      " 15  Soil_Type_1                         581012 non-null  float64\n",
      " 16  Soil_Type_2                         581012 non-null  float64\n",
      " 17  Soil_Type_3                         581012 non-null  float64\n",
      " 18  Soil_Type_4                         581012 non-null  float64\n",
      " 19  Soil_Type_5                         581012 non-null  float64\n",
      "dtypes: float64(20)\n",
      "memory usage: 88.7 MB\n"
     ]
    }
   ],
   "source": [
    "X.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "022527cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f1de7808",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第 1 次训练...\n",
      "0.951731022434877\n",
      "第 2 次训练...\n",
      "0.952789514900648\n",
      "第 3 次训练...\n",
      "0.9518510869003975\n",
      "第 4 次训练...\n",
      "0.9518855097158396\n",
      "第 5 次训练...\n",
      "0.952023200977608\n",
      "5折泛化，验证集AC：0.952\n",
      "Wall time: 1h 48min 7s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "val_acc_num=0\n",
    "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n",
    "    print(\"第 {} 次训练...\".format(fold_+1))\n",
    "    train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n",
    "    vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n",
    "\n",
    "    random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n",
    "    random_forest.fit(train_x, trai_y.values.ravel())  # .values.ravel()是把DF格式变成1-D数组，上面出现的警告用这个解决\n",
    "\n",
    "    # ===============验证集AUC操作===================\n",
    "    pred_y = random_forest.predict(vali_x)\n",
    "    print(accuracy_score(pred_y,vali_y.values.ravel()))\n",
    "    val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n",
    "    \n",
    "print(\"5折泛化，验证集ACC：{0:.7f}\".format(val_acc_num/5))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5577a53b",
   "metadata": {},
   "source": [
    "上面输出的是3位尾数，5折加起来的均值为：0.952056"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "9f4453ec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 581012 entries, 0 to 581011\n",
      "Data columns (total 20 columns):\n",
      " #   Column                              Non-Null Count   Dtype  \n",
      "---  ------                              --------------   -----  \n",
      " 0   Elevation                           581012 non-null  float32\n",
      " 1   Aspect                              581012 non-null  float32\n",
      " 2   Slope                               581012 non-null  float32\n",
      " 3   Horizontal_Distance_To_Hydrology    581012 non-null  float32\n",
      " 4   Vertical_Distance_To_Hydrology      581012 non-null  float32\n",
      " 5   Horizontal_Distance_To_Roadways     581012 non-null  float32\n",
      " 6   Hillshade_9am                       581012 non-null  float32\n",
      " 7   Hillshade_Noon                      581012 non-null  float32\n",
      " 8   Hillshade_3pm                       581012 non-null  float32\n",
      " 9   Horizontal_Distance_To_Fire_Points  581012 non-null  float32\n",
      " 10  Wilderness_Area_0                   581012 non-null  float32\n",
      " 11  Wilderness_Area_1                   581012 non-null  float32\n",
      " 12  Wilderness_Area_2                   581012 non-null  float32\n",
      " 13  Wilderness_Area_3                   581012 non-null  float32\n",
      " 14  Soil_Type_0                         581012 non-null  float32\n",
      " 15  Soil_Type_1                         581012 non-null  float32\n",
      " 16  Soil_Type_2                         581012 non-null  float32\n",
      " 17  Soil_Type_3                         581012 non-null  float32\n",
      " 18  Soil_Type_4                         581012 non-null  float32\n",
      " 19  Soil_Type_5                         581012 non-null  float32\n",
      "dtypes: float32(20)\n",
      "memory usage: 44.3 MB\n"
     ]
    }
   ],
   "source": [
    "# 优化X内存使用率\n",
    "for col in X.columns:\n",
    "    if X[col].dtype=='float64': X[col] = X[col].astype('float32')\n",
    "    if X[col].dtype=='int64': X[col] = [col].astype('int32')\n",
    "X.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "56e985d4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第 1 次训练...\n",
      "0.9512577127957109\n",
      "第 2 次训练...\n",
      "0.9529013880880872\n",
      "第 3 次训练...\n",
      "0.9521092580162132\n",
      "第 4 次训练...\n",
      "0.9516961842309083\n",
      "第 5 次训练...\n",
      "0.9520145952737474\n",
      "5折泛化，验证集AC：0.952\n",
      "Wall time: 1h 57min 11s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "val_acc_num=0\n",
    "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n",
    "    print(\"第 {} 次训练...\".format(fold_+1))\n",
    "    train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n",
    "    vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n",
    "\n",
    "    random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n",
    "    random_forest.fit(train_x, trai_y.values.ravel())  # .values.ravel()是把DF格式变成1-D数组，上面出现的警告用这个解决\n",
    "\n",
    "    # ===============验证集AUC操作===================\n",
    "    pred_y = random_forest.predict(vali_x)\n",
    "    print(accuracy_score(pred_y,vali_y.values.ravel()))\n",
    "    val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n",
    "    \n",
    "print(\"5折泛化，验证集ACC：{0:.7f}\".format(val_acc_num/5))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6dc51d51",
   "metadata": {},
   "source": [
    "5折加起来的均值为：0.9519958，和上方的0.952056，仅差0.0000602，我测试了三轮，结果是接近一致的波动，由此证明减少内存是不影响结果的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "17fd117b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 封装好的代码，原文链接：https://blog.csdn.net/wushaowu2014/article/details/86561141\n",
    "def reduce_mem_usage(df):\n",
    "    \"\"\" iterate through all the columns of a dataframe and modify the data type\n",
    "        to reduce memory usage.        \n",
    "    \"\"\"\n",
    "    start_mem = df.memory_usage().sum() / 1024**2\n",
    "    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n",
    "    \n",
    "    for col in df.columns:\n",
    "        col_type = df[col].dtype\n",
    "        \n",
    "        if col_type != object:\n",
    "            c_min = df[col].min()\n",
    "            c_max = df[col].max()\n",
    "            if str(col_type)[:3] == 'int':\n",
    "                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
    "                    df[col] = df[col].astype(np.int8)\n",
    "                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
    "                    df[col] = df[col].astype(np.int16)\n",
    "                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
    "                    df[col] = df[col].astype(np.int32)\n",
    "                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
    "                    df[col] = df[col].astype(np.int64)  \n",
    "            else:\n",
    "                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
    "                    df[col] = df[col].astype(np.float16)\n",
    "                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
    "                    df[col] = df[col].astype(np.float32)\n",
    "                else:\n",
    "                    df[col] = df[col].astype(np.float64)\n",
    "        else:\n",
    "            df[col] = df[col].astype('category')\n",
    " \n",
    "    end_mem = df.memory_usage().sum() / 1024**2\n",
    "    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n",
    "    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n",
    "    \n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "690184eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memory usage of dataframe is 88.66 MB\n",
      "Memory usage after optimization is: 22.16 MB\n",
      "Decreased by 75.0%\n"
     ]
    }
   ],
   "source": [
    "X = reduce_mem_usage(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f72a6efa",
   "metadata": {},
   "source": [
    "### 分类特征\n",
    "对于分类变量，可以选择告诉 LGBM 它们是分类的（但内存会增加），或者可以告诉 LGBM 将其视为数字（首先需要对其进行标签编码）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "224ba994",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 5 entries, 0 to 4\n",
      "Data columns (total 1 columns):\n",
      " #   Column  Non-Null Count  Dtype   \n",
      "---  ------  --------------  -----   \n",
      " 0   color   5 non-null      category\n",
      "dtypes: category(1)\n",
      "memory usage: 265.0 bytes\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
    "df['color'],_ = df['color'].factorize()\n",
    "df['color'] = df['color'].astype('category')  # 转成分类特征并查看内存使用情况（已知int8内存使用是: 133.0 bytes）\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba369f7a",
   "metadata": {},
   "source": [
    "### Splitting\n",
    "可以通过拆分将单个（字符串或数字）列分成两列。\n",
    "\n",
    "例如，id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段，需要将它们拆分。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6512d8e2",
   "metadata": {},
   "source": [
    "### 组合/转化/交互\n",
    "两个（字符串或数字）列可以合并为一列。例如card1，card2可以成为一个新列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e1bbabe",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f195a2c7",
   "metadata": {},
   "source": [
    "这有助于LGBM将card1和card2一起去与目标关联，并不会在树节点分裂他们。\n",
    "\n",
    "但这种uid = card1_card2可能与目标相关，现在LGBM会将其拆分。数字列可以与加法、减法、乘法等组合使用。一个数字示例是"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f2bea13",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['x1_x2'] = df['x1'] * df['x2']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8ab1bcb",
   "metadata": {},
   "source": [
    "### 频率编码\n",
    "频率编码是一种强大的技术，它允许 LGBM 查看列值是罕见的还是常见的。例如，如果您希望 LGBM“查看”哪些颜色不常使用，请尝试"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "87cca857",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>color_counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  color  color_counts\n",
       "0     0             2\n",
       "1     1             2\n",
       "2     2             1\n",
       "3     1             2\n",
       "4     0             2"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp = df['color'].value_counts().to_dict()\n",
    "df['color_counts'] = df['color'].map(temp)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7986b33c",
   "metadata": {},
   "source": [
    "### 聚合/组统计\n",
    "为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n",
    "\n",
    "可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "e8f7106e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "      <th>color_counts</th>\n",
       "      <th>color_counts_sum</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  color  color_counts  color_counts_sum\n",
       "0     0             2                 4\n",
       "1     1             2                 4\n",
       "2     2             1                 1\n",
       "3     1             2                 4\n",
       "4     0             2                 4"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp = df.groupby('color')['color_counts'].agg(['mean']).rename({'mean':'color_counts_mean'},axis=1)\n",
    "df = pd.merge(df,temp,on='color',how='left')\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9da30631",
   "metadata": {},
   "source": [
    "此处的功能向每一行添加color_counts该行color组的平均值。因此，LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "790eb030",
   "metadata": {},
   "source": [
    "### 标准化\n",
    "可以针对自己对列进行标准化。例如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "14474d08",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>color</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.956183</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.239046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.434274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.239046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.956183</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      color\n",
       "0 -0.956183\n",
       "1  0.239046\n",
       "2  1.434274\n",
       "3  0.239046\n",
       "4 -0.956183"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
    "df['color'],_ = df['color'].factorize()\n",
    "df['color'] = ( df['color']-df['color'].mean() ) / df['color'].std()\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdfb237a",
   "metadata": {},
   "source": [
    "或者你可以针对一列标准化另一列。例如，如果你创建一个组统计数据（如上所述）来指示D3每周的平均值。然后你可以通过"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d426927e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df['D3_remove_time'] = df['D3'] - df['D3_week_mean']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78fc4991",
   "metadata": {},
   "source": [
    "D3_remove_time随着时间的推移，新变量不再增加，因为我们已经针对时间的影响对其进行了标准化。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da699903",
   "metadata": {},
   "source": [
    "### 离群值去除/平滑\n",
    "通常，你希望从数据中删除异常，因为它们会混淆你的模型。然而，在风控等比赛中，我们想要发现异常，所以要谨慎使用平滑技术。\n",
    "\n",
    "这些方法背后的想法是确定和删除不常见的值。例如，通过使用变量的频率编码，你可以删除所有出现小于 0.1% 的值，方法是将它们替换为 -9999 之类的新值（请注意，您应该使用与 NAN 使用的值不同的值）。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}