You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1008 lines
30 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"id": "8d942947",
"metadata": {},
"source": [
"# 特征工程技术"
]
},
{
"cell_type": "markdown",
"id": "d08d515b",
"metadata": {},
"source": [
"搬运参考https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575"
]
},
{
"cell_type": "markdown",
"id": "5eb53e03",
"metadata": {},
"source": [
"### 关于编码\n",
"在执行编码时,最好训练和测试集一起编码,如下所示"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb00d32d",
"metadata": {},
"outputs": [],
"source": [
"df = pd.concat([train[col],test[col]],axis=0)\n",
"# PERFORM FEATURE ENGINEERING HERE\n",
"train[col] = df[:len(train)]\n",
"test[col] = df[len(train):]"
]
},
{
"cell_type": "markdown",
"id": "83412ecb",
"metadata": {},
"source": [
"### NAN值加工\n",
"如果将np.nan给LGBM那么在每个树节点分裂时它会分裂非 NAN 值,然后将所有 NAN 发送到左节点或右节点,这取决于什么是最好的。\n",
"\n",
"因此NAN 在每个节点都得到特殊处理,并且可能会变得过拟合。\n",
"\n",
"通过简单地将所有 NAN 转换为低于所有非 NAN 值的负数(例如 - 999来防止测试集过拟合。"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b093d6f",
"metadata": {},
"outputs": [],
"source": [
"df[col].fillna(-999, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "978c9dc6",
"metadata": {},
"source": [
"这样LGBM将不再过度处理 NAN。相反它会给予它与其他数字相同的关注。可以尝试两种方法看看哪个给出了最高的CV。"
]
},
{
"cell_type": "markdown",
"id": "31c076fc",
"metadata": {},
"source": [
"### 标签编码/因式分解/内存减少\n",
"标签编码分解字符串、类别、对象列转换为整数。类似get_dummies不同点在于如果有几十个取值如果用pd.get_dummies()则会得到好几十列,增加了数据的稀疏性"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ceef72c3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color\n",
"0 0\n",
"1 1\n",
"2 2\n",
"3 1\n",
"4 0"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "eca60e6f",
"metadata": {},
"source": [
"之后,可以将其转换为 int8、int16 或 int32用以减少内存具体取决于 max 是否小于 128、小于 32768。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "fcd6f4e3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5 entries, 0 to 4\n",
"Data columns (total 1 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 color 5 non-null int8 \n",
"dtypes: int8(1)\n",
"memory usage: 133.0 bytes\n"
]
}
],
"source": [
"if df['color'].max()<128:\n",
" df['color'] = df['color'].astype('int8')\n",
"elif df['color'].max()<32768:\n",
" df['color'] = df['color'].astype('int16')\n",
"else: df['color'] = df['color'].astype('int32')\n",
"df.info()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a40af2b8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5 entries, 0 to 4\n",
"Data columns (total 1 columns):\n",
" # Column Non-Null Count Dtype\n",
"--- ------ -------------- -----\n",
" 0 color 5 non-null int32\n",
"dtypes: int32(1)\n",
"memory usage: 148.0 bytes\n"
]
}
],
"source": [
"df['color'] = df['color'].astype('int32') # 如果使用int32可以看到memory usage: 变成148了\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "3728adee",
"metadata": {},
"source": [
"另外为了减少内存人们memory_reduce在其他列上使用流行的功能。\n",
"\n",
"一种更简单、更安全的方法是将所有 float64 转换为 float32将所有 int64 转换为 int32。最好避免使用 float16。如果你愿意可以使用 int8 和 int16。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "03948c52",
"metadata": {},
"outputs": [],
"source": [
"for col in df.columns:\n",
" if df[col].dtype=='float64': df[col] = df[col].astype('float32')\n",
" if df[col].dtype=='int64': df[col] = df[col].astype('int32')"
]
},
{
"cell_type": "markdown",
"id": "35181a29",
"metadata": {},
"source": [
"### 测试修改数据大小后,结果会不会发生变化"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "4cbbf955",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import time\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"from sklearn.model_selection import StratifiedKFold, train_test_split\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.preprocessing import OrdinalEncoder"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "6aaae6e5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"七分类任务,处理前: [1 2 3 4 5 6 7]\n",
"[5 5 2 ... 3 3 3]\n",
"七分类任务,处理后: [0. 1. 2. 3. 4. 5. 6.]\n",
"[4. 4. 1. ... 2. 2. 2.]\n"
]
}
],
"source": [
"from sklearn.datasets import fetch_covtype\n",
"data = fetch_covtype() # 森林植被类型\n",
"# 预处理\n",
"X, y = data['data'], data['target']\n",
"# 由于模型标签需要从0开始所以数字需要全部减1\n",
"print('七分类任务,处理前:',np.unique(y))\n",
"print(y)\n",
"ord = OrdinalEncoder()\n",
"y = ord.fit_transform(y.reshape(-1, 1))\n",
"y = y.reshape(-1, )\n",
"print('七分类任务,处理后:',np.unique(y))\n",
"print(y)\n",
"\n",
"X = pd.DataFrame(X,columns=data.feature_names)\n",
"X = X.iloc[:,:20] # 数据集过大这里仅用前20列做演示\n",
"\n",
"y = pd.DataFrame(y, columns=data.target_names)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "816e36fd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 581012 entries, 0 to 581011\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Elevation 581012 non-null float64\n",
" 1 Aspect 581012 non-null float64\n",
" 2 Slope 581012 non-null float64\n",
" 3 Horizontal_Distance_To_Hydrology 581012 non-null float64\n",
" 4 Vertical_Distance_To_Hydrology 581012 non-null float64\n",
" 5 Horizontal_Distance_To_Roadways 581012 non-null float64\n",
" 6 Hillshade_9am 581012 non-null float64\n",
" 7 Hillshade_Noon 581012 non-null float64\n",
" 8 Hillshade_3pm 581012 non-null float64\n",
" 9 Horizontal_Distance_To_Fire_Points 581012 non-null float64\n",
" 10 Wilderness_Area_0 581012 non-null float64\n",
" 11 Wilderness_Area_1 581012 non-null float64\n",
" 12 Wilderness_Area_2 581012 non-null float64\n",
" 13 Wilderness_Area_3 581012 non-null float64\n",
" 14 Soil_Type_0 581012 non-null float64\n",
" 15 Soil_Type_1 581012 non-null float64\n",
" 16 Soil_Type_2 581012 non-null float64\n",
" 17 Soil_Type_3 581012 non-null float64\n",
" 18 Soil_Type_4 581012 non-null float64\n",
" 19 Soil_Type_5 581012 non-null float64\n",
"dtypes: float64(20)\n",
"memory usage: 88.7 MB\n"
]
}
],
"source": [
"X.info()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "022527cd",
"metadata": {},
"outputs": [],
"source": [
"folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "f1de7808",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"第 1 次训练...\n",
"0.951731022434877\n",
"第 2 次训练...\n",
"0.952789514900648\n",
"第 3 次训练...\n",
"0.9518510869003975\n",
"第 4 次训练...\n",
"0.9518855097158396\n",
"第 5 次训练...\n",
"0.952023200977608\n",
"5折泛化验证集AC0.952\n",
"Wall time: 1h 48min 7s\n"
]
}
],
"source": [
"%%time\n",
"val_acc_num=0\n",
"for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n",
" print(\"第 {} 次训练...\".format(fold_+1))\n",
" train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n",
" vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n",
"\n",
" random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n",
" random_forest.fit(train_x, trai_y.values.ravel()) # .values.ravel()是把DF格式变成1-D数组上面出现的警告用这个解决\n",
"\n",
" # ===============验证集AUC操作===================\n",
" pred_y = random_forest.predict(vali_x)\n",
" print(accuracy_score(pred_y,vali_y.values.ravel()))\n",
" val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n",
" \n",
"print(\"5折泛化验证集ACC{0:.7f}\".format(val_acc_num/5))"
]
},
{
"cell_type": "markdown",
"id": "5577a53b",
"metadata": {},
"source": [
"上面输出的是3位尾数5折加起来的均值为0.952056"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "9f4453ec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 581012 entries, 0 to 581011\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Elevation 581012 non-null float32\n",
" 1 Aspect 581012 non-null float32\n",
" 2 Slope 581012 non-null float32\n",
" 3 Horizontal_Distance_To_Hydrology 581012 non-null float32\n",
" 4 Vertical_Distance_To_Hydrology 581012 non-null float32\n",
" 5 Horizontal_Distance_To_Roadways 581012 non-null float32\n",
" 6 Hillshade_9am 581012 non-null float32\n",
" 7 Hillshade_Noon 581012 non-null float32\n",
" 8 Hillshade_3pm 581012 non-null float32\n",
" 9 Horizontal_Distance_To_Fire_Points 581012 non-null float32\n",
" 10 Wilderness_Area_0 581012 non-null float32\n",
" 11 Wilderness_Area_1 581012 non-null float32\n",
" 12 Wilderness_Area_2 581012 non-null float32\n",
" 13 Wilderness_Area_3 581012 non-null float32\n",
" 14 Soil_Type_0 581012 non-null float32\n",
" 15 Soil_Type_1 581012 non-null float32\n",
" 16 Soil_Type_2 581012 non-null float32\n",
" 17 Soil_Type_3 581012 non-null float32\n",
" 18 Soil_Type_4 581012 non-null float32\n",
" 19 Soil_Type_5 581012 non-null float32\n",
"dtypes: float32(20)\n",
"memory usage: 44.3 MB\n"
]
}
],
"source": [
"# 优化X内存使用率\n",
"for col in X.columns:\n",
" if X[col].dtype=='float64': X[col] = X[col].astype('float32')\n",
" if X[col].dtype=='int64': X[col] = [col].astype('int32')\n",
"X.info()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "56e985d4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"第 1 次训练...\n",
"0.9512577127957109\n",
"第 2 次训练...\n",
"0.9529013880880872\n",
"第 3 次训练...\n",
"0.9521092580162132\n",
"第 4 次训练...\n",
"0.9516961842309083\n",
"第 5 次训练...\n",
"0.9520145952737474\n",
"5折泛化验证集AC0.952\n",
"Wall time: 1h 57min 11s\n"
]
}
],
"source": [
"%%time\n",
"val_acc_num=0\n",
"for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n",
" print(\"第 {} 次训练...\".format(fold_+1))\n",
" train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n",
" vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n",
"\n",
" random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n",
" random_forest.fit(train_x, trai_y.values.ravel()) # .values.ravel()是把DF格式变成1-D数组上面出现的警告用这个解决\n",
"\n",
" # ===============验证集AUC操作===================\n",
" pred_y = random_forest.predict(vali_x)\n",
" print(accuracy_score(pred_y,vali_y.values.ravel()))\n",
" val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n",
" \n",
"print(\"5折泛化验证集ACC{0:.7f}\".format(val_acc_num/5))"
]
},
{
"cell_type": "markdown",
"id": "6dc51d51",
"metadata": {},
"source": [
"5折加起来的均值为0.9519958和上方的0.952056仅差0.0000602,我测试了三轮,结果是接近一致的波动,由此证明减少内存是不影响结果的。"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "17fd117b",
"metadata": {},
"outputs": [],
"source": [
"# 封装好的代码原文链接https://blog.csdn.net/wushaowu2014/article/details/86561141\n",
"def reduce_mem_usage(df):\n",
" \"\"\" iterate through all the columns of a dataframe and modify the data type\n",
" to reduce memory usage. \n",
" \"\"\"\n",
" start_mem = df.memory_usage().sum() / 1024**2\n",
" print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n",
" \n",
" for col in df.columns:\n",
" col_type = df[col].dtype\n",
" \n",
" if col_type != object:\n",
" c_min = df[col].min()\n",
" c_max = df[col].max()\n",
" if str(col_type)[:3] == 'int':\n",
" if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
" df[col] = df[col].astype(np.int8)\n",
" elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
" df[col] = df[col].astype(np.int16)\n",
" elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
" df[col] = df[col].astype(np.int32)\n",
" elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
" df[col] = df[col].astype(np.int64) \n",
" else:\n",
" if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
" df[col] = df[col].astype(np.float16)\n",
" elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
" df[col] = df[col].astype(np.float32)\n",
" else:\n",
" df[col] = df[col].astype(np.float64)\n",
" else:\n",
" df[col] = df[col].astype('category')\n",
" \n",
" end_mem = df.memory_usage().sum() / 1024**2\n",
" print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n",
" print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n",
" \n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "690184eb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Memory usage of dataframe is 88.66 MB\n",
"Memory usage after optimization is: 22.16 MB\n",
"Decreased by 75.0%\n"
]
}
],
"source": [
"X = reduce_mem_usage(X)"
]
},
{
"cell_type": "markdown",
"id": "f72a6efa",
"metadata": {},
"source": [
"### 分类特征\n",
"对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "224ba994",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 5 entries, 0 to 4\n",
"Data columns (total 1 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 color 5 non-null category\n",
"dtypes: category(1)\n",
"memory usage: 265.0 bytes\n"
]
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = df['color'].astype('category') # 转成分类特征并查看内存使用情况已知int8内存使用是: 133.0 bytes\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"id": "ba369f7a",
"metadata": {},
"source": [
"### Splitting\n",
"可以通过拆分将单个(字符串或数字)列分成两列。\n",
"\n",
"例如id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。"
]
},
{
"cell_type": "markdown",
"id": "6512d8e2",
"metadata": {},
"source": [
"### 组合/转化/交互\n",
"两个字符串或数字列可以合并为一列。例如card1card2可以成为一个新列"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e1bbabe",
"metadata": {},
"outputs": [],
"source": [
"df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)"
]
},
{
"cell_type": "markdown",
"id": "f195a2c7",
"metadata": {},
"source": [
"这有助于LGBM将card1和card2一起去与目标关联并不会在树节点分裂他们。\n",
"\n",
"但这种uid = card1_card2可能与目标相关现在LGBM会将其拆分。数字列可以与加法、减法、乘法等组合使用。一个数字示例是"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f2bea13",
"metadata": {},
"outputs": [],
"source": [
"df['x1_x2'] = df['x1'] * df['x2']"
]
},
{
"cell_type": "markdown",
"id": "a8ab1bcb",
"metadata": {},
"source": [
"### 频率编码\n",
"频率编码是一种强大的技术,它允许 LGBM 查看列值是罕见的还是常见的。例如,如果您希望 LGBM“查看”哪些颜色不常使用请尝试"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "87cca857",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" <th>color_counts</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color color_counts\n",
"0 0 2\n",
"1 1 2\n",
"2 2 1\n",
"3 1 2\n",
"4 0 2"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = df['color'].value_counts().to_dict()\n",
"df['color_counts'] = df['color'].map(temp)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "7986b33c",
"metadata": {},
"source": [
"### 聚合/组统计\n",
"为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n",
"\n",
"可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "e8f7106e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" <th>color_counts</th>\n",
" <th>color_counts_sum</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color color_counts color_counts_sum\n",
"0 0 2 4\n",
"1 1 2 4\n",
"2 2 1 1\n",
"3 1 2 4\n",
"4 0 2 4"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp = df.groupby('color')['color_counts'].agg(['mean']).rename({'mean':'color_counts_mean'},axis=1)\n",
"df = pd.merge(df,temp,on='color',how='left')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "9da30631",
"metadata": {},
"source": [
"此处的功能向每一行添加color_counts该行color组的平均值。因此LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。"
]
},
{
"cell_type": "markdown",
"id": "790eb030",
"metadata": {},
"source": [
"### 标准化\n",
"可以针对自己对列进行标准化。例如"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "14474d08",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>color</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1.434274</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.239046</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-0.956183</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" color\n",
"0 -0.956183\n",
"1 0.239046\n",
"2 1.434274\n",
"3 0.239046\n",
"4 -0.956183"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n",
"df['color'],_ = df['color'].factorize()\n",
"df['color'] = ( df['color']-df['color'].mean() ) / df['color'].std()\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "cdfb237a",
"metadata": {},
"source": [
"或者你可以针对一列标准化另一列。例如如果你创建一个组统计数据如上所述来指示D3每周的平均值。然后你可以通过"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d426927e",
"metadata": {},
"outputs": [],
"source": [
"df['D3_remove_time'] = df['D3'] - df['D3_week_mean']"
]
},
{
"cell_type": "markdown",
"id": "78fc4991",
"metadata": {},
"source": [
"D3_remove_time随着时间的推移新变量不再增加因为我们已经针对时间的影响对其进行了标准化。"
]
},
{
"cell_type": "markdown",
"id": "da699903",
"metadata": {},
"source": [
"### 离群值去除/平滑\n",
"通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n",
"\n",
"这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}