{
"cells": [
{
"cell_type": "markdown",
"id": "33127151",
"metadata": {},
"source": [
"# 自动化特征工程"
]
},
{
"cell_type": "markdown",
"id": "66dfb30d",
"metadata": {},
"source": [
"### 结论:效果一般\n",
"搬运参考:https://www.kaggle.com/liananapalkova/automated-feature-engineering-for-titanic-dataset"
]
},
{
"cell_type": "markdown",
"id": "91896713",
"metadata": {},
"source": [
"### 1.介绍\n",
"如果您曾经为您的ML项目手动创建过数百个特性(我相信您做到了),那么您将乐于了解名为“featuretools”的Python包如何帮助完成这项任务。好消息是这个软件包很容易使用。它的目标是自动化特征工程。当然,人类的专业知识是无法替代的,但是“featuretools”可以自动化大量的日常工作。出于探索目的,这里使用fetch_covtype数据集。\n",
"\n",
"本笔记本的主要内容包括:\n",
"\n",
"首先,使用自动特征工程(“featuretools”包),从54个特征总数增加到N个。\n",
"\n",
"其次,应用特征约简和选择方法,从N个特征中选择X个最相关的特征。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "522eb443",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]\n"
]
}
],
"source": [
"import sys\n",
"print(sys.version) # 版本信息"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "51e62bae",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleNote: you may need to restart the kernel to use updated packages.\n",
"Collecting featuretools\n",
" Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8f/32/b5d02df152aff86f720524540ae516a8e15d7a8c53bd4ee06e2b1ed0c263/featuretools-0.26.2-py3-none-any.whl (327 kB)\n",
"Requirement already satisfied: numpy>=1.16.6 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (1.19.5)\n",
"Requirement already satisfied: dask[dataframe]>=2.12.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (2021.4.0)\n",
"Requirement already satisfied: pyyaml>=5.4 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (5.4.1)\n",
"Requirement already satisfied: tqdm>=4.32.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (4.59.0)\n",
"Requirement already satisfied: scipy>=1.3.2 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (1.6.2)\n",
"Requirement already satisfied: click>=7.0.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (7.1.2)\n",
"Requirement already satisfied: pandas<2.0.0,>=1.2.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (1.2.4)\n",
"Requirement already satisfied: psutil>=5.6.6 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (5.8.0)\n",
"Requirement already satisfied: distributed>=2.12.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (2021.4.0)\n",
"Requirement already satisfied: cloudpickle>=0.4.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from featuretools) (1.6.0)\n",
"Requirement already satisfied: partd>=0.3.10 in d:\\programdata\\anaconda3\\lib\\site-packages (from dask[dataframe]>=2.12.0->featuretools) (1.2.0)\n",
"Requirement already satisfied: fsspec>=0.6.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from dask[dataframe]>=2.12.0->featuretools) (0.9.0)\n",
"Requirement already satisfied: toolz>=0.8.2 in d:\\programdata\\anaconda3\\lib\\site-packages (from dask[dataframe]>=2.12.0->featuretools) (0.11.1)\n",
"Requirement already satisfied: tblib>=1.6.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (1.7.0)\n",
"Requirement already satisfied: zict>=0.1.3 in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (2.0.0)\n",
"Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (2.3.0)\n",
"Requirement already satisfied: tornado>=6.0.3 in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (6.1)\n",
"Requirement already satisfied: msgpack>=0.6.0 in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (1.0.2)\n",
"Requirement already satisfied: setuptools in d:\\programdata\\anaconda3\\lib\\site-packages (from distributed>=2.12.0->featuretools) (52.0.0.post20210125)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in d:\\programdata\\anaconda3\\lib\\site-packages (from pandas<2.0.0,>=1.2.0->featuretools) (2.8.1)\n",
"Requirement already satisfied: pytz>=2017.3 in d:\\programdata\\anaconda3\\lib\\site-packages (from pandas<2.0.0,>=1.2.0->featuretools) (2021.1)\n",
"Requirement already satisfied: locket in d:\\programdata\\anaconda3\\lib\\site-packages\\locket-0.2.1-py3.8.egg (from partd>=0.3.10->dask[dataframe]>=2.12.0->featuretools) (0.2.1)\n",
"Requirement already satisfied: six>=1.5 in d:\\programdata\\anaconda3\\lib\\site-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.2.0->featuretools) (1.15.0)\n",
"Requirement already satisfied: heapdict in d:\\programdata\\anaconda3\\lib\\site-packages (from zict>=0.1.3->distributed>=2.12.0->featuretools) (1.0.1)\n",
"Installing collected packages: featuretools\n",
"Successfully installed featuretools-0.26.2\n",
"\n"
]
}
],
"source": [
"pip install featuretools"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "43cc9a46",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import time\n",
"import gc\n",
"import pandas as pd\n",
"\n",
"import featuretools as ft\n",
"from featuretools.primitives import *\n",
"from featuretools.variable_types import Numeric\n",
"from sklearn.svm import LinearSVC\n",
"from sklearn.feature_selection import SelectFromModel\n",
"# 导入相关模型,没有的pip install xxx 即可\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.preprocessing import OrdinalEncoder\n",
"from sklearn.metrics import log_loss"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4c17c0bc",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_covtype\n",
"data = fetch_covtype()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bcce5a3d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"七分类任务,处理前: [1 2 3 4 5 6 7]\n",
"[5 5 2 ... 3 3 3]\n",
"七分类任务,处理后: [0. 1. 2. 3. 4. 5. 6.]\n",
"[4. 4. 1. ... 2. 2. 2.]\n"
]
}
],
"source": [
"# 预处理\n",
"X, y = data['data'], data['target']\n",
"# 由于模型标签需要从0开始,所以数字需要全部减1\n",
"print('七分类任务,处理前:',np.unique(y))\n",
"print(y)\n",
"ord = OrdinalEncoder()\n",
"y = ord.fit_transform(y.reshape(-1, 1))\n",
"y = y.reshape(-1, )\n",
"print('七分类任务,处理后:',np.unique(y))\n",
"print(y)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4afeeca5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" Elevation | \n",
" Aspect | \n",
" Slope | \n",
" Horizontal_Distance_To_Hydrology | \n",
" Vertical_Distance_To_Hydrology | \n",
" Horizontal_Distance_To_Roadways | \n",
" Hillshade_9am | \n",
" Hillshade_Noon | \n",
" Hillshade_3pm | \n",
" Horizontal_Distance_To_Fire_Points | \n",
" Wilderness_Area_0 | \n",
" Wilderness_Area_1 | \n",
" Wilderness_Area_2 | \n",
" Wilderness_Area_3 | \n",
" Soil_Type_0 | \n",
" Soil_Type_1 | \n",
" Soil_Type_2 | \n",
" Soil_Type_3 | \n",
" Soil_Type_4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 2596.0 | \n",
" 51.0 | \n",
" 3.0 | \n",
" 258.0 | \n",
" 0.0 | \n",
" 510.0 | \n",
" 221.0 | \n",
" 232.0 | \n",
" 148.0 | \n",
" 6279.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2590.0 | \n",
" 56.0 | \n",
" 2.0 | \n",
" 212.0 | \n",
" -6.0 | \n",
" 390.0 | \n",
" 220.0 | \n",
" 235.0 | \n",
" 151.0 | \n",
" 6225.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index Elevation Aspect Slope Horizontal_Distance_To_Hydrology \\\n",
"0 0 2596.0 51.0 3.0 258.0 \n",
"1 1 2590.0 56.0 2.0 212.0 \n",
"\n",
" Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways \\\n",
"0 0.0 510.0 \n",
"1 -6.0 390.0 \n",
"\n",
" Hillshade_9am Hillshade_Noon Hillshade_3pm \\\n",
"0 221.0 232.0 148.0 \n",
"1 220.0 235.0 151.0 \n",
"\n",
" Horizontal_Distance_To_Fire_Points Wilderness_Area_0 Wilderness_Area_1 \\\n",
"0 6279.0 1.0 0.0 \n",
"1 6225.0 1.0 0.0 \n",
"\n",
" Wilderness_Area_2 Wilderness_Area_3 Soil_Type_0 Soil_Type_1 \\\n",
"0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"\n",
" Soil_Type_2 Soil_Type_3 Soil_Type_4 \n",
"0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = pd.DataFrame(X,columns=data.feature_names)\n",
"X = X.reset_index()\n",
"X = X.iloc[:,:20] # 数据集过大,这里仅用前20列做演示\n",
"X.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "af6722f2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" Cover_Type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 4.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 4.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index Cover_Type\n",
"0 0 4.0\n",
"1 1 4.0"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = pd.DataFrame(y, columns=data.target_names)\n",
"y = y.reset_index()\n",
"y.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2d34ab5c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 581012 entries, 0 to 581011\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 index 581012 non-null int64 \n",
" 1 Elevation 581012 non-null float64\n",
" 2 Aspect 581012 non-null float64\n",
" 3 Slope 581012 non-null float64\n",
" 4 Horizontal_Distance_To_Hydrology 581012 non-null float64\n",
" 5 Vertical_Distance_To_Hydrology 581012 non-null float64\n",
" 6 Horizontal_Distance_To_Roadways 581012 non-null float64\n",
" 7 Hillshade_9am 581012 non-null float64\n",
" 8 Hillshade_Noon 581012 non-null float64\n",
" 9 Hillshade_3pm 581012 non-null float64\n",
" 10 Horizontal_Distance_To_Fire_Points 581012 non-null float64\n",
" 11 Wilderness_Area_0 581012 non-null float64\n",
" 12 Wilderness_Area_1 581012 non-null float64\n",
" 13 Wilderness_Area_2 581012 non-null float64\n",
" 14 Wilderness_Area_3 581012 non-null float64\n",
" 15 Soil_Type_0 581012 non-null float64\n",
" 16 Soil_Type_1 581012 non-null float64\n",
" 17 Soil_Type_2 581012 non-null float64\n",
" 18 Soil_Type_3 581012 non-null float64\n",
" 19 Soil_Type_4 581012 non-null float64\n",
"dtypes: float64(19), int64(1)\n",
"memory usage: 88.7 MB\n"
]
}
],
"source": [
"X.info()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "1551c241",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 581012 entries, 0 to 581011\n",
"Data columns (total 20 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 index 581012 non-null int32 \n",
" 1 Elevation 581012 non-null float32\n",
" 2 Aspect 581012 non-null float32\n",
" 3 Slope 581012 non-null float32\n",
" 4 Horizontal_Distance_To_Hydrology 581012 non-null float32\n",
" 5 Vertical_Distance_To_Hydrology 581012 non-null float32\n",
" 6 Horizontal_Distance_To_Roadways 581012 non-null float32\n",
" 7 Hillshade_9am 581012 non-null float32\n",
" 8 Hillshade_Noon 581012 non-null float32\n",
" 9 Hillshade_3pm 581012 non-null float32\n",
" 10 Horizontal_Distance_To_Fire_Points 581012 non-null float32\n",
" 11 Wilderness_Area_0 581012 non-null float32\n",
" 12 Wilderness_Area_1 581012 non-null float32\n",
" 13 Wilderness_Area_2 581012 non-null float32\n",
" 14 Wilderness_Area_3 581012 non-null float32\n",
" 15 Soil_Type_0 581012 non-null float32\n",
" 16 Soil_Type_1 581012 non-null float32\n",
" 17 Soil_Type_2 581012 non-null float32\n",
" 18 Soil_Type_3 581012 non-null float32\n",
" 19 Soil_Type_4 581012 non-null float32\n",
"dtypes: float32(19), int32(1)\n",
"memory usage: 44.3 MB\n"
]
}
],
"source": [
"# 转换数据格式以减少内存占用\n",
"for col in X.columns:\n",
" if X[col].dtype=='float64': X[col] = X[col].astype('float32')\n",
" if X[col].dtype=='int64': X[col] = X[col].astype('int32')\n",
"X.info() # 减少了一半"
]
},
{
"cell_type": "markdown",
"id": "f68429bf",
"metadata": {},
"source": [
"### 2.执行自动化特征工程\n",
"需要先确认是否有NaN值,对NaN值做处理建议参考:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "06f24545",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Object `es.entity_from_dataframe` not found.\n"
]
}
],
"source": [
"es.entity_from_dataframe?"
]
},
{
"cell_type": "markdown",
"id": "e3f82b96",
"metadata": {},
"source": [
"创建实体集后,可以使用所谓的原特征生成新特征。\n",
"\n",
"分为两类:\n",
"\n",
"* 聚合:这些函数将每个父项的子数据点组合在一起,然后计算统计数据,如平均值、最小值、最大值或标准偏差。聚合使用表之间的关系跨多个表工作。\n",
"\n",
"* 转换:这些函数处理单个表的一列或多列。\n",
"\n",
"我们可以使用\"normalize_entity\"函数创建虚拟表。这样我们就可以应用聚合函数和转换函数来生成新特性。为了创建这样的表,我们将使用分类变量、布尔变量和整数变量。"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "f2c69a94",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Entityset: fetch_covtype_data\n",
" Entities:\n",
" X [Rows: 581012, Columns: 20]\n",
" Relationships:\n",
" No relationships"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = ft.EntitySet(id = 'fetch_covtype_data')\n",
"es = es.entity_from_dataframe(entity_id = 'X', dataframe = X, \n",
" variable_types = \n",
" {\n",
" 'Aspect': ft.variable_types.Categorical,\n",
" 'Slope': ft.variable_types.Categorical,\n",
" 'Hillshade_9am': ft.variable_types.Categorical,\n",
" 'Hillshade_Noon': ft.variable_types.Categorical,\n",
" 'Hillshade_3pm': ft.variable_types.Categorical,\n",
" 'Wilderness_Area_0': ft.variable_types.Boolean,\n",
" 'Wilderness_Area_1': ft.variable_types.Boolean,\n",
" 'Wilderness_Area_2': ft.variable_types.Boolean,\n",
" 'Wilderness_Area_3': ft.variable_types.Boolean,\n",
" 'Soil_Type_0': ft.variable_types.Boolean,\n",
" 'Soil_Type_1': ft.variable_types.Boolean,\n",
" 'Soil_Type_2': ft.variable_types.Boolean,\n",
" 'Soil_Type_3': ft.variable_types.Boolean,\n",
" 'Soil_Type_4': ft.variable_types.Boolean\n",
" },\n",
" index = 'index')\n",
"\n",
"es"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "770130bc",
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/plain": [
"Entityset: fetch_covtype_data\n",
" Entities:\n",
" X [Rows: 581012, Columns: 20]\n",
" Wilderness_Area_0 [Rows: 2, Columns: 1]\n",
" Wilderness_Area_1 [Rows: 2, Columns: 1]\n",
" Wilderness_Area_2 [Rows: 2, Columns: 1]\n",
" Wilderness_Area_3 [Rows: 2, Columns: 1]\n",
" Soil_Type_0 [Rows: 2, Columns: 1]\n",
" Soil_Type_1 [Rows: 2, Columns: 1]\n",
" Soil_Type_2 [Rows: 2, Columns: 1]\n",
" Soil_Type_3 [Rows: 2, Columns: 1]\n",
" Soil_Type_4 [Rows: 2, Columns: 1]\n",
" Relationships:\n",
" X.Wilderness_Area_0 -> Wilderness_Area_0.Wilderness_Area_0\n",
" X.Wilderness_Area_1 -> Wilderness_Area_1.Wilderness_Area_1\n",
" X.Wilderness_Area_2 -> Wilderness_Area_2.Wilderness_Area_2\n",
" X.Wilderness_Area_3 -> Wilderness_Area_3.Wilderness_Area_3\n",
" X.Soil_Type_0 -> Soil_Type_0.Soil_Type_0\n",
" X.Soil_Type_1 -> Soil_Type_1.Soil_Type_1\n",
" X.Soil_Type_2 -> Soil_Type_2.Soil_Type_2\n",
" X.Soil_Type_3 -> Soil_Type_3.Soil_Type_3\n",
" X.Soil_Type_4 -> Soil_Type_4.Soil_Type_4"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_0', index='Wilderness_Area_0')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_1', index='Wilderness_Area_1')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_2', index='Wilderness_Area_2')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_3', index='Wilderness_Area_3')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_0', index='Soil_Type_0')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_1', index='Soil_Type_1')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_2', index='Soil_Type_2')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_3', index='Soil_Type_3')\n",
"es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_4', index='Soil_Type_4')\n",
"es"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "352fa085",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" type | \n",
" dask_compatible | \n",
" koalas_compatible | \n",
" description | \n",
" valid_inputs | \n",
" return_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" sum | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Calculates the total addition, ignoring `NaN`. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 1 | \n",
" first | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Determines the first value in a list. | \n",
" Variable | \n",
" None | \n",
"
\n",
" \n",
" 2 | \n",
" last | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Determines the last value in a list. | \n",
" Variable | \n",
" None | \n",
"
\n",
" \n",
" 3 | \n",
" trend | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Calculates the trend of a variable over time. | \n",
" DatetimeTimeIndex, Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 4 | \n",
" n_most_common | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Determines the `n` most common elements. | \n",
" Discrete | \n",
" Discrete | \n",
"
\n",
" \n",
" 5 | \n",
" time_since_last | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Calculates the time elapsed since the last datetime (default in seconds). | \n",
" DatetimeTimeIndex | \n",
" Numeric | \n",
"
\n",
" \n",
" 6 | \n",
" std | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Computes the dispersion relative to the mean value, ignoring `NaN`. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 7 | \n",
" median | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Determines the middlemost number in a list of values. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 8 | \n",
" count | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Determines the total number of values, excluding `NaN`. | \n",
" Index | \n",
" Numeric | \n",
"
\n",
" \n",
" 9 | \n",
" percent_true | \n",
" aggregation | \n",
" True | \n",
" False | \n",
" Determines the percent of `True` values. | \n",
" Boolean | \n",
" Numeric | \n",
"
\n",
" \n",
" 10 | \n",
" time_since_first | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Calculates the time elapsed since the first datetime (in seconds). | \n",
" DatetimeTimeIndex | \n",
" Numeric | \n",
"
\n",
" \n",
" 11 | \n",
" max | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Calculates the highest value, ignoring `NaN` values. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 12 | \n",
" any | \n",
" aggregation | \n",
" True | \n",
" False | \n",
" Determines if any value is 'True' in a list. | \n",
" Boolean | \n",
" Boolean | \n",
"
\n",
" \n",
" 13 | \n",
" mode | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Determines the most commonly repeated value. | \n",
" Discrete | \n",
" None | \n",
"
\n",
" \n",
" 14 | \n",
" entropy | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Calculates the entropy for a categorical variable | \n",
" Categorical | \n",
" Numeric | \n",
"
\n",
" \n",
" 15 | \n",
" min | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Calculates the smallest value, ignoring `NaN` values. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 16 | \n",
" all | \n",
" aggregation | \n",
" True | \n",
" False | \n",
" Calculates if all values are 'True' in a list. | \n",
" Boolean | \n",
" Boolean | \n",
"
\n",
" \n",
" 17 | \n",
" skew | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Computes the extent to which a distribution differs from a normal distribution. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 18 | \n",
" mean | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Computes the average for a list of values. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 19 | \n",
" avg_time_between | \n",
" aggregation | \n",
" False | \n",
" False | \n",
" Computes the average number of seconds between consecutive events. | \n",
" DatetimeTimeIndex | \n",
" Numeric | \n",
"
\n",
" \n",
" 20 | \n",
" num_unique | \n",
" aggregation | \n",
" True | \n",
" True | \n",
" Determines the number of distinct values, ignoring `NaN` values. | \n",
" Discrete | \n",
" Numeric | \n",
"
\n",
" \n",
" 21 | \n",
" num_true | \n",
" aggregation | \n",
" True | \n",
" False | \n",
" Counts the number of `True` values. | \n",
" Boolean | \n",
" Numeric | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name type dask_compatible koalas_compatible \\\n",
"0 sum aggregation True True \n",
"1 first aggregation False False \n",
"2 last aggregation False False \n",
"3 trend aggregation False False \n",
"4 n_most_common aggregation False False \n",
"5 time_since_last aggregation False False \n",
"6 std aggregation True True \n",
"7 median aggregation False False \n",
"8 count aggregation True True \n",
"9 percent_true aggregation True False \n",
"10 time_since_first aggregation False False \n",
"11 max aggregation True True \n",
"12 any aggregation True False \n",
"13 mode aggregation False False \n",
"14 entropy aggregation False False \n",
"15 min aggregation True True \n",
"16 all aggregation True False \n",
"17 skew aggregation False False \n",
"18 mean aggregation True True \n",
"19 avg_time_between aggregation False False \n",
"20 num_unique aggregation True True \n",
"21 num_true aggregation True False \n",
"\n",
" description \\\n",
"0 Calculates the total addition, ignoring `NaN`. \n",
"1 Determines the first value in a list. \n",
"2 Determines the last value in a list. \n",
"3 Calculates the trend of a variable over time. \n",
"4 Determines the `n` most common elements. \n",
"5 Calculates the time elapsed since the last datetime (default in seconds). \n",
"6 Computes the dispersion relative to the mean value, ignoring `NaN`. \n",
"7 Determines the middlemost number in a list of values. \n",
"8 Determines the total number of values, excluding `NaN`. \n",
"9 Determines the percent of `True` values. \n",
"10 Calculates the time elapsed since the first datetime (in seconds). \n",
"11 Calculates the highest value, ignoring `NaN` values. \n",
"12 Determines if any value is 'True' in a list. \n",
"13 Determines the most commonly repeated value. \n",
"14 Calculates the entropy for a categorical variable \n",
"15 Calculates the smallest value, ignoring `NaN` values. \n",
"16 Calculates if all values are 'True' in a list. \n",
"17 Computes the extent to which a distribution differs from a normal distribution. \n",
"18 Computes the average for a list of values. \n",
"19 Computes the average number of seconds between consecutive events. \n",
"20 Determines the number of distinct values, ignoring `NaN` values. \n",
"21 Counts the number of `True` values. \n",
"\n",
" valid_inputs return_type \n",
"0 Numeric Numeric \n",
"1 Variable None \n",
"2 Variable None \n",
"3 DatetimeTimeIndex, Numeric Numeric \n",
"4 Discrete Discrete \n",
"5 DatetimeTimeIndex Numeric \n",
"6 Numeric Numeric \n",
"7 Numeric Numeric \n",
"8 Index Numeric \n",
"9 Boolean Numeric \n",
"10 DatetimeTimeIndex Numeric \n",
"11 Numeric Numeric \n",
"12 Boolean Boolean \n",
"13 Discrete None \n",
"14 Categorical Numeric \n",
"15 Numeric Numeric \n",
"16 Boolean Boolean \n",
"17 Numeric Numeric \n",
"18 Numeric Numeric \n",
"19 DatetimeTimeIndex Numeric \n",
"20 Discrete Numeric \n",
"21 Boolean Numeric "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"primitives = ft.list_primitives()\n",
"pd.options.display.max_colwidth = 100\n",
"primitives[primitives['type'] == 'aggregation'].head(primitives[primitives['type'] == 'aggregation'].shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7762885f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" type | \n",
" dask_compatible | \n",
" koalas_compatible | \n",
" description | \n",
" valid_inputs | \n",
" return_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 22 | \n",
" greater_than | \n",
" transform | \n",
" True | \n",
" False | \n",
" Determines if values in one list are greater than another list. | \n",
" Ordinal, Datetime, Numeric | \n",
" Boolean | \n",
"
\n",
" \n",
" 23 | \n",
" less_than | \n",
" transform | \n",
" True | \n",
" True | \n",
" Determines if values in one list are less than another list. | \n",
" Ordinal, Datetime, Numeric | \n",
" Boolean | \n",
"
\n",
" \n",
" 24 | \n",
" and | \n",
" transform | \n",
" True | \n",
" True | \n",
" Element-wise logical AND of two lists. | \n",
" Boolean | \n",
" Boolean | \n",
"
\n",
" \n",
" 25 | \n",
" less_than_scalar | \n",
" transform | \n",
" True | \n",
" True | \n",
" Determines if values are less than a given scalar. | \n",
" Ordinal, Datetime, Numeric | \n",
" Boolean | \n",
"
\n",
" \n",
" 26 | \n",
" modulo_numeric | \n",
" transform | \n",
" True | \n",
" True | \n",
" Element-wise modulo of two lists. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 79 | \n",
" is_weekend | \n",
" transform | \n",
" True | \n",
" True | \n",
" Determines if a date falls on a weekend. | \n",
" Datetime | \n",
" Boolean | \n",
"
\n",
" \n",
" 80 | \n",
" num_characters | \n",
" transform | \n",
" True | \n",
" True | \n",
" Calculates the number of characters in a string. | \n",
" NaturalLanguage | \n",
" Numeric | \n",
"
\n",
" \n",
" 81 | \n",
" latitude | \n",
" transform | \n",
" False | \n",
" False | \n",
" Returns the first tuple value in a list of LatLong tuples. | \n",
" LatLong | \n",
" Numeric | \n",
"
\n",
" \n",
" 82 | \n",
" cum_sum | \n",
" transform | \n",
" False | \n",
" False | \n",
" Calculates the cumulative sum. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
" 83 | \n",
" subtract_numeric_scalar | \n",
" transform | \n",
" True | \n",
" True | \n",
" Subtract a scalar from each element in the list. | \n",
" Numeric | \n",
" Numeric | \n",
"
\n",
" \n",
"
\n",
"
62 rows × 7 columns
\n",
"
"
],
"text/plain": [
" name type dask_compatible koalas_compatible \\\n",
"22 greater_than transform True False \n",
"23 less_than transform True True \n",
"24 and transform True True \n",
"25 less_than_scalar transform True True \n",
"26 modulo_numeric transform True True \n",
".. ... ... ... ... \n",
"79 is_weekend transform True True \n",
"80 num_characters transform True True \n",
"81 latitude transform False False \n",
"82 cum_sum transform False False \n",
"83 subtract_numeric_scalar transform True True \n",
"\n",
" description \\\n",
"22 Determines if values in one list are greater than another list. \n",
"23 Determines if values in one list are less than another list. \n",
"24 Element-wise logical AND of two lists. \n",
"25 Determines if values are less than a given scalar. \n",
"26 Element-wise modulo of two lists. \n",
".. ... \n",
"79 Determines if a date falls on a weekend. \n",
"80 Calculates the number of characters in a string. \n",
"81 Returns the first tuple value in a list of LatLong tuples. \n",
"82 Calculates the cumulative sum. \n",
"83 Subtract a scalar from each element in the list. \n",
"\n",
" valid_inputs return_type \n",
"22 Ordinal, Datetime, Numeric Boolean \n",
"23 Ordinal, Datetime, Numeric Boolean \n",
"24 Boolean Boolean \n",
"25 Ordinal, Datetime, Numeric Boolean \n",
"26 Numeric Numeric \n",
".. ... ... \n",
"79 Datetime Boolean \n",
"80 NaturalLanguage Numeric \n",
"81 LatLong Numeric \n",
"82 Numeric Numeric \n",
"83 Numeric Numeric \n",
"\n",
"[62 rows x 7 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"primitives[primitives['type'] == 'transform'].head(primitives[primitives['type'] == 'transform'].shape[0])"
]
},
{
"cell_type": "markdown",
"id": "2a1baf81",
"metadata": {},
"source": [
"1. 现在我们将应用一个深度特征合成(DFS)函数,该函数将通过自动应用适当的聚合来生成新特征,这里选择了深度2。深度值越高,将堆叠越多的基本体。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "6d3df2f7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall time: 1min 3s\n"
]
}
],
"source": [
"%%time\n",
"features, feature_names = ft.dfs(entityset = es, \n",
" target_entity = 'X', \n",
" max_depth = 2)"
]
},
{
"cell_type": "markdown",
"id": "3c16b6f0",
"metadata": {},
"source": [
"这是一个新功能的列表。例如,\"Wilderness_Area_0.MEAN(X.Elevation)\"表示Wilderness_Area_0的每一个唯一值的Elevation值的均值。即相同的Wilderness_Area_0的Elevation值的均值"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "9a44a98a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
" ,\n",
"