{ "cells": [ { "cell_type": "markdown", "id": "8d942947", "metadata": {}, "source": [ "# 特征工程技术" ] }, { "cell_type": "markdown", "id": "d08d515b", "metadata": {}, "source": [ "搬运参考:https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575" ] }, { "cell_type": "markdown", "id": "5eb53e03", "metadata": {}, "source": [ "### 关于编码\n", "在执行编码时,最好训练和测试集一起编码,如下所示" ] }, { "cell_type": "code", "execution_count": null, "id": "eb00d32d", "metadata": {}, "outputs": [], "source": [ "df = pd.concat([train[col],test[col]],axis=0)\n", "# PERFORM FEATURE ENGINEERING HERE\n", "train[col] = df[:len(train)]\n", "test[col] = df[len(train):]" ] }, { "cell_type": "markdown", "id": "83412ecb", "metadata": {}, "source": [ "### NAN值加工\n", "如果将np.nan给LGBM,那么在每个树节点分裂时,它会分裂非 NAN 值,然后将所有 NAN 发送到左节点或右节点,这取决于什么是最好的。\n", "\n", "因此,NAN 在每个节点都得到特殊处理,并且可能会变得过拟合。\n", "\n", "通过简单地将所有 NAN 转换为低于所有非 NAN 值的负数(例如 - 999),来防止测试集过拟合。" ] }, { "cell_type": "code", "execution_count": null, "id": "8b093d6f", "metadata": {}, "outputs": [], "source": [ "df[col].fillna(-999, inplace=True)" ] }, { "cell_type": "markdown", "id": "978c9dc6", "metadata": {}, "source": [ "这样LGBM将不再过度处理 NAN。相反,它会给予它与其他数字相同的关注。可以尝试两种方法,看看哪个给出了最高的CV。" ] }, { "cell_type": "markdown", "id": "31c076fc", "metadata": {}, "source": [ "### 标签编码/因式分解/内存减少\n", "标签编码(分解)将(字符串、类别、对象)列转换为整数。类似get_dummies,不同点在于如果有几十个取值,如果用pd.get_dummies()则会得到好几十列,增加了数据的稀疏性" ] }, { "cell_type": "code", "execution_count": 1, "id": "ceef72c3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
color
00
11
22
31
40
\n", "
" ], "text/plain": [ " color\n", "0 0\n", "1 1\n", "2 2\n", "3 1\n", "4 0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n", "df['color'],_ = df['color'].factorize()\n", "df" ] }, { "cell_type": "markdown", "id": "eca60e6f", "metadata": {}, "source": [ "之后,可以将其转换为 int8、int16 或 int32用以减少内存,具体取决于 max 是否小于 128、小于 32768。" ] }, { "cell_type": "code", "execution_count": 2, "id": "fcd6f4e3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5 entries, 0 to 4\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 color 5 non-null int8 \n", "dtypes: int8(1)\n", "memory usage: 133.0 bytes\n" ] } ], "source": [ "if df['color'].max()<128:\n", " df['color'] = df['color'].astype('int8')\n", "elif df['color'].max()<32768:\n", " df['color'] = df['color'].astype('int16')\n", "else: df['color'] = df['color'].astype('int32')\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 3, "id": "a40af2b8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5 entries, 0 to 4\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 color 5 non-null int32\n", "dtypes: int32(1)\n", "memory usage: 148.0 bytes\n" ] } ], "source": [ "df['color'] = df['color'].astype('int32') # 如果使用int32,可以看到memory usage: 变成148了\n", "df.info()" ] }, { "cell_type": "markdown", "id": "3728adee", "metadata": {}, "source": [ "另外为了减少内存,人们memory_reduce在其他列上使用流行的功能。\n", "\n", "一种更简单、更安全的方法是将所有 float64 转换为 float32,将所有 int64 转换为 int32。(最好避免使用 float16。如果你愿意,可以使用 int8 和 int16)。" ] }, { "cell_type": "code", "execution_count": 4, "id": "03948c52", "metadata": {}, "outputs": [], "source": [ "for col in df.columns:\n", " if df[col].dtype=='float64': df[col] = df[col].astype('float32')\n", " if df[col].dtype=='int64': df[col] = df[col].astype('int32')" ] }, { "cell_type": "markdown", "id": "35181a29", "metadata": {}, "source": [ "### 测试修改数据大小后,结果会不会发生变化" ] }, { "cell_type": "code", "execution_count": 6, "id": "4cbbf955", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import time\n", "\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "from sklearn.model_selection import StratifiedKFold, train_test_split\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.preprocessing import OrdinalEncoder" ] }, { "cell_type": "code", "execution_count": 24, "id": "6aaae6e5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "七分类任务,处理前: [1 2 3 4 5 6 7]\n", "[5 5 2 ... 3 3 3]\n", "七分类任务,处理后: [0. 1. 2. 3. 4. 5. 6.]\n", "[4. 4. 1. ... 2. 2. 2.]\n" ] } ], "source": [ "from sklearn.datasets import fetch_covtype\n", "data = fetch_covtype() # 森林植被类型\n", "# 预处理\n", "X, y = data['data'], data['target']\n", "# 由于模型标签需要从0开始,所以数字需要全部减1\n", "print('七分类任务,处理前:',np.unique(y))\n", "print(y)\n", "ord = OrdinalEncoder()\n", "y = ord.fit_transform(y.reshape(-1, 1))\n", "y = y.reshape(-1, )\n", "print('七分类任务,处理后:',np.unique(y))\n", "print(y)\n", "\n", "X = pd.DataFrame(X,columns=data.feature_names)\n", "X = X.iloc[:,:20] # 数据集过大,这里仅用前20列做演示\n", "\n", "y = pd.DataFrame(y, columns=data.target_names)" ] }, { "cell_type": "code", "execution_count": 25, "id": "816e36fd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 581012 entries, 0 to 581011\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Elevation 581012 non-null float64\n", " 1 Aspect 581012 non-null float64\n", " 2 Slope 581012 non-null float64\n", " 3 Horizontal_Distance_To_Hydrology 581012 non-null float64\n", " 4 Vertical_Distance_To_Hydrology 581012 non-null float64\n", " 5 Horizontal_Distance_To_Roadways 581012 non-null float64\n", " 6 Hillshade_9am 581012 non-null float64\n", " 7 Hillshade_Noon 581012 non-null float64\n", " 8 Hillshade_3pm 581012 non-null float64\n", " 9 Horizontal_Distance_To_Fire_Points 581012 non-null float64\n", " 10 Wilderness_Area_0 581012 non-null float64\n", " 11 Wilderness_Area_1 581012 non-null float64\n", " 12 Wilderness_Area_2 581012 non-null float64\n", " 13 Wilderness_Area_3 581012 non-null float64\n", " 14 Soil_Type_0 581012 non-null float64\n", " 15 Soil_Type_1 581012 non-null float64\n", " 16 Soil_Type_2 581012 non-null float64\n", " 17 Soil_Type_3 581012 non-null float64\n", " 18 Soil_Type_4 581012 non-null float64\n", " 19 Soil_Type_5 581012 non-null float64\n", "dtypes: float64(20)\n", "memory usage: 88.7 MB\n" ] } ], "source": [ "X.info()" ] }, { "cell_type": "code", "execution_count": 20, "id": "022527cd", "metadata": {}, "outputs": [], "source": [ "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)" ] }, { "cell_type": "code", "execution_count": 21, "id": "f1de7808", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "第 1 次训练...\n", "0.951731022434877\n", "第 2 次训练...\n", "0.952789514900648\n", "第 3 次训练...\n", "0.9518510869003975\n", "第 4 次训练...\n", "0.9518855097158396\n", "第 5 次训练...\n", "0.952023200977608\n", "5折泛化,验证集AC:0.952\n", "Wall time: 1h 48min 7s\n" ] } ], "source": [ "%%time\n", "val_acc_num=0\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n", " print(\"第 {} 次训练...\".format(fold_+1))\n", " train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n", " vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n", "\n", " random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n", " random_forest.fit(train_x, trai_y.values.ravel()) # .values.ravel()是把DF格式变成1-D数组,上面出现的警告用这个解决\n", "\n", " # ===============验证集AUC操作===================\n", " pred_y = random_forest.predict(vali_x)\n", " print(accuracy_score(pred_y,vali_y.values.ravel()))\n", " val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n", " \n", "print(\"5折泛化,验证集ACC:{0:.7f}\".format(val_acc_num/5))" ] }, { "cell_type": "markdown", "id": "5577a53b", "metadata": {}, "source": [ "上面输出的是3位尾数,5折加起来的均值为:0.952056" ] }, { "cell_type": "code", "execution_count": 22, "id": "9f4453ec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 581012 entries, 0 to 581011\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Elevation 581012 non-null float32\n", " 1 Aspect 581012 non-null float32\n", " 2 Slope 581012 non-null float32\n", " 3 Horizontal_Distance_To_Hydrology 581012 non-null float32\n", " 4 Vertical_Distance_To_Hydrology 581012 non-null float32\n", " 5 Horizontal_Distance_To_Roadways 581012 non-null float32\n", " 6 Hillshade_9am 581012 non-null float32\n", " 7 Hillshade_Noon 581012 non-null float32\n", " 8 Hillshade_3pm 581012 non-null float32\n", " 9 Horizontal_Distance_To_Fire_Points 581012 non-null float32\n", " 10 Wilderness_Area_0 581012 non-null float32\n", " 11 Wilderness_Area_1 581012 non-null float32\n", " 12 Wilderness_Area_2 581012 non-null float32\n", " 13 Wilderness_Area_3 581012 non-null float32\n", " 14 Soil_Type_0 581012 non-null float32\n", " 15 Soil_Type_1 581012 non-null float32\n", " 16 Soil_Type_2 581012 non-null float32\n", " 17 Soil_Type_3 581012 non-null float32\n", " 18 Soil_Type_4 581012 non-null float32\n", " 19 Soil_Type_5 581012 non-null float32\n", "dtypes: float32(20)\n", "memory usage: 44.3 MB\n" ] } ], "source": [ "# 优化X内存使用率\n", "for col in X.columns:\n", " if X[col].dtype=='float64': X[col] = X[col].astype('float32')\n", " if X[col].dtype=='int64': X[col] = [col].astype('int32')\n", "X.info()" ] }, { "cell_type": "code", "execution_count": 23, "id": "56e985d4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "第 1 次训练...\n", "0.9512577127957109\n", "第 2 次训练...\n", "0.9529013880880872\n", "第 3 次训练...\n", "0.9521092580162132\n", "第 4 次训练...\n", "0.9516961842309083\n", "第 5 次训练...\n", "0.9520145952737474\n", "5折泛化,验证集AC:0.952\n", "Wall time: 1h 57min 11s\n" ] } ], "source": [ "%%time\n", "val_acc_num=0\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n", " print(\"第 {} 次训练...\".format(fold_+1))\n", " train_x, trai_y = X.loc[trn_idx], y.loc[trn_idx]\n", " vali_x, vali_y = X.loc[val_idx], y.loc[val_idx]\n", "\n", " random_forest = RandomForestClassifier(n_estimators=1000,oob_score=True)\n", " random_forest.fit(train_x, trai_y.values.ravel()) # .values.ravel()是把DF格式变成1-D数组,上面出现的警告用这个解决\n", "\n", " # ===============验证集AUC操作===================\n", " pred_y = random_forest.predict(vali_x)\n", " print(accuracy_score(pred_y,vali_y.values.ravel()))\n", " val_acc_num += accuracy_score(pred_y,vali_y.values.ravel())\n", " \n", "print(\"5折泛化,验证集ACC:{0:.7f}\".format(val_acc_num/5))" ] }, { "cell_type": "markdown", "id": "6dc51d51", "metadata": {}, "source": [ "5折加起来的均值为:0.9519958,和上方的0.952056,仅差0.0000602,我测试了三轮,结果是接近一致的波动,由此证明减少内存是不影响结果的。" ] }, { "cell_type": "code", "execution_count": 26, "id": "17fd117b", "metadata": {}, "outputs": [], "source": [ "# 封装好的代码,原文链接:https://blog.csdn.net/wushaowu2014/article/details/86561141\n", "def reduce_mem_usage(df):\n", " \"\"\" iterate through all the columns of a dataframe and modify the data type\n", " to reduce memory usage. \n", " \"\"\"\n", " start_mem = df.memory_usage().sum() / 1024**2\n", " print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))\n", " \n", " for col in df.columns:\n", " col_type = df[col].dtype\n", " \n", " if col_type != object:\n", " c_min = df[col].min()\n", " c_max = df[col].max()\n", " if str(col_type)[:3] == 'int':\n", " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n", " df[col] = df[col].astype(np.int8)\n", " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n", " df[col] = df[col].astype(np.int16)\n", " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n", " df[col] = df[col].astype(np.int32)\n", " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n", " df[col] = df[col].astype(np.int64) \n", " else:\n", " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n", " df[col] = df[col].astype(np.float16)\n", " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n", " df[col] = df[col].astype(np.float32)\n", " else:\n", " df[col] = df[col].astype(np.float64)\n", " else:\n", " df[col] = df[col].astype('category')\n", " \n", " end_mem = df.memory_usage().sum() / 1024**2\n", " print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))\n", " print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 27, "id": "690184eb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Memory usage of dataframe is 88.66 MB\n", "Memory usage after optimization is: 22.16 MB\n", "Decreased by 75.0%\n" ] } ], "source": [ "X = reduce_mem_usage(X)" ] }, { "cell_type": "markdown", "id": "f72a6efa", "metadata": {}, "source": [ "### 分类特征\n", "对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)" ] }, { "cell_type": "code", "execution_count": 17, "id": "224ba994", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5 entries, 0 to 4\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 color 5 non-null category\n", "dtypes: category(1)\n", "memory usage: 265.0 bytes\n" ] } ], "source": [ "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n", "df['color'],_ = df['color'].factorize()\n", "df['color'] = df['color'].astype('category') # 转成分类特征并查看内存使用情况(已知int8内存使用是: 133.0 bytes)\n", "df.info()" ] }, { "cell_type": "markdown", "id": "ba369f7a", "metadata": {}, "source": [ "### Splitting\n", "可以通过拆分将单个(字符串或数字)列分成两列。\n", "\n", "例如,id_30诸如\"Mac OS X 10_9_5\"之类的字符串列可以拆分为操作系统\"Mac OS X\"和版本\"10_9_5\"。或者例如数字\"1230.45\"可以拆分为元\" 1230\"和分\"45\"。LGBM 无法单独看到这些片段,需要将它们拆分。" ] }, { "cell_type": "markdown", "id": "6512d8e2", "metadata": {}, "source": [ "### 组合/转化/交互\n", "两个(字符串或数字)列可以合并为一列。例如card1,card2可以成为一个新列" ] }, { "cell_type": "code", "execution_count": null, "id": "7e1bbabe", "metadata": {}, "outputs": [], "source": [ "df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)" ] }, { "cell_type": "markdown", "id": "f195a2c7", "metadata": {}, "source": [ "这有助于LGBM将card1和card2一起去与目标关联,并不会在树节点分裂他们。\n", "\n", "但这种uid = card1_card2可能与目标相关,现在LGBM会将其拆分。数字列可以与加法、减法、乘法等组合使用。一个数字示例是" ] }, { "cell_type": "code", "execution_count": null, "id": "8f2bea13", "metadata": {}, "outputs": [], "source": [ "df['x1_x2'] = df['x1'] * df['x2']" ] }, { "cell_type": "markdown", "id": "a8ab1bcb", "metadata": {}, "source": [ "### 频率编码\n", "频率编码是一种强大的技术,它允许 LGBM 查看列值是罕见的还是常见的。例如,如果您希望 LGBM“查看”哪些颜色不常使用,请尝试" ] }, { "cell_type": "code", "execution_count": 19, "id": "87cca857", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorcolor_counts
002
112
221
312
402
\n", "
" ], "text/plain": [ " color color_counts\n", "0 0 2\n", "1 1 2\n", "2 2 1\n", "3 1 2\n", "4 0 2" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp = df['color'].value_counts().to_dict()\n", "df['color_counts'] = df['color'].map(temp)\n", "df" ] }, { "cell_type": "markdown", "id": "7986b33c", "metadata": {}, "source": [ "### 聚合/组统计\n", "为 LGBM 提供组统计数据允许 LGBM 确定某个值对于特定组是常见的还是罕见的。\n", "\n", "可以通过为 pandas 提供 3 个变量来计算组统计数据。你给它组、感兴趣的变量和统计类型。例如" ] }, { "cell_type": "code", "execution_count": 20, "id": "e8f7106e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorcolor_countscolor_counts_sum
0024
1124
2211
3124
4024
\n", "
" ], "text/plain": [ " color color_counts color_counts_sum\n", "0 0 2 4\n", "1 1 2 4\n", "2 2 1 1\n", "3 1 2 4\n", "4 0 2 4" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp = df.groupby('color')['color_counts'].agg(['mean']).rename({'mean':'color_counts_mean'},axis=1)\n", "df = pd.merge(df,temp,on='color',how='left')\n", "df" ] }, { "cell_type": "markdown", "id": "9da30631", "metadata": {}, "source": [ "此处的功能向每一行添加color_counts该行color组的平均值。因此,LGBM 现在可以判断color_counts对它们的color组是否为极少数的部分。" ] }, { "cell_type": "markdown", "id": "790eb030", "metadata": {}, "source": [ "### 标准化\n", "可以针对自己对列进行标准化。例如" ] }, { "cell_type": "code", "execution_count": 22, "id": "14474d08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
color
0-0.956183
10.239046
21.434274
30.239046
4-0.956183
\n", "
" ], "text/plain": [ " color\n", "0 -0.956183\n", "1 0.239046\n", "2 1.434274\n", "3 0.239046\n", "4 -0.956183" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n", "df['color'],_ = df['color'].factorize()\n", "df['color'] = ( df['color']-df['color'].mean() ) / df['color'].std()\n", "df" ] }, { "cell_type": "markdown", "id": "cdfb237a", "metadata": {}, "source": [ "或者你可以针对一列标准化另一列。例如,如果你创建一个组统计数据(如上所述)来指示D3每周的平均值。然后你可以通过" ] }, { "cell_type": "code", "execution_count": null, "id": "d426927e", "metadata": {}, "outputs": [], "source": [ "df['D3_remove_time'] = df['D3'] - df['D3_week_mean']" ] }, { "cell_type": "markdown", "id": "78fc4991", "metadata": {}, "source": [ "D3_remove_time随着时间的推移,新变量不再增加,因为我们已经针对时间的影响对其进行了标准化。" ] }, { "cell_type": "markdown", "id": "da699903", "metadata": {}, "source": [ "### 离群值去除/平滑\n", "通常,你希望从数据中删除异常,因为它们会混淆你的模型。然而,在风控等比赛中,我们想要发现异常,所以要谨慎使用平滑技术。\n", "\n", "这些方法背后的想法是确定和删除不常见的值。例如,通过使用变量的频率编码,你可以删除所有出现小于 0.1% 的值,方法是将它们替换为 -9999 之类的新值(请注意,您应该使用与 NAN 使用的值不同的值)。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }