# 自动化特征工程

### 结论：效果一般
搬运参考：https://www.kaggle.com/liananapalkova/automated-feature-engineering-for-titanic-dataset

### 1.介绍
如果您曾经为您的ML项目手动创建过数百个特性（我相信您做到了），那么您将乐于了解名为“featuretools”的Python包如何帮助完成这项任务。好消息是这个软件包很容易使用。它的目标是自动化特征工程。当然，人类的专业知识是无法替代的，但是“featuretools”可以自动化大量的日常工作。出于探索目的，这里使用fetch_covtype数据集。

本笔记本的主要内容包括：

首先，使用自动特征工程（“featuretools”包），从54个特征总数增加到N个。

其次，应用特征约简和选择方法，从N个特征中选择X个最相关的特征。

In [1]:
import sys
print(sys.version)  # 版本信息

3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]


In [5]:
pip install featuretools

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simpleNote: you may need to restart the kernel to use updated packages.
Collecting featuretools
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8f/32/b5d02df152aff86f720524540ae516a8e15d7a8c53bd4ee06e2b1ed0c263/featuretools-0.26.2-py3-none-any.whl (327 kB)
Installing collected packages: featuretools
Successfully installed featuretools-0.26.2



In [19]:
import numpy as np
import time
import gc
import pandas as pd

import featuretools as ft
from featuretools.primitives import *
from featuretools.variable_types import Numeric
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
# 导入相关模型，没有的pip install xxx 即可

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import log_loss

In [2]:
from sklearn.datasets import fetch_covtype
data = fetch_covtype()

In [3]:
# 预处理
X, y = data['data'], data['target']
# 由于模型标签需要从0开始，所以数字需要全部减1
print('七分类任务，处理前：',np.unique(y))
print(y)
ord = OrdinalEncoder()
y = ord.fit_transform(y.reshape(-1, 1))
y = y.reshape(-1, )
print('七分类任务，处理后：',np.unique(y))
print(y)

七分类任务，处理前： [1 2 3 4 5 6 7]
[5 5 2 ... 3 3 3]
七分类任务，处理后： [0. 1. 2. 3. 4. 5. 6.]
[4. 4. 1. ... 2. 2. 2.]


In [4]:
X = pd.DataFrame(X,columns=data.feature_names)
X = X.reset_index()
X = X.iloc[:,:20]  # 数据集过大，这里仅用前20列做演示
X.head(2)

Unnamed: 0,index,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area_0,Wilderness_Area_1,Wilderness_Area_2,Wilderness_Area_3,Soil_Type_0,Soil_Type_1,Soil_Type_2,Soil_Type_3,Soil_Type_4
0,0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
y = pd.DataFrame(y, columns=data.target_names)
y = y.reset_index()
y.head(2)

Unnamed: 0,index,Cover_Type
0,0,4.0
1,1,4.0


In [6]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 20 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   index                               581012 non-null  int64  
 1   Elevation                           581012 non-null  float64
 2   Aspect                              581012 non-null  float64
 3   Slope                               581012 non-null  float64
 4   Horizontal_Distance_To_Hydrology    581012 non-null  float64
 5   Vertical_Distance_To_Hydrology      581012 non-null  float64
 6   Horizontal_Distance_To_Roadways     581012 non-null  float64
 7   Hillshade_9am                       581012 non-null  float64
 8   Hillshade_Noon                      581012 non-null  float64
 9   Hillshade_3pm                       581012 non-null  float64
 10  Horizontal_Distance_To_Fire_Points  581012 non-null  float64
 11  Wilderness_Area_0         

In [7]:
# 转换数据格式以减少内存占用
for col in X.columns:
    if X[col].dtype=='float64': X[col] = X[col].astype('float32')
    if X[col].dtype=='int64': X[col] = X[col].astype('int32')
X.info()  # 减少了一半

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 20 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   index                               581012 non-null  int32  
 1   Elevation                           581012 non-null  float32
 2   Aspect                              581012 non-null  float32
 3   Slope                               581012 non-null  float32
 4   Horizontal_Distance_To_Hydrology    581012 non-null  float32
 5   Vertical_Distance_To_Hydrology      581012 non-null  float32
 6   Horizontal_Distance_To_Roadways     581012 non-null  float32
 7   Hillshade_9am                       581012 non-null  float32
 8   Hillshade_Noon                      581012 non-null  float32
 9   Hillshade_3pm                       581012 non-null  float32
 10  Horizontal_Distance_To_Fire_Points  581012 non-null  float32
 11  Wilderness_Area_0         

### 2.执行自动化特征工程
需要先确认是否有NaN值，对NaN值做处理建议参考：

In [8]:
es.entity_from_dataframe?

Object `es.entity_from_dataframe` not found.


创建实体集后，可以使用所谓的原特征生成新特征。

分为两类：

* 聚合：这些函数将每个父项的子数据点组合在一起，然后计算统计数据，如平均值、最小值、最大值或标准偏差。聚合使用表之间的关系跨多个表工作。

* 转换：这些函数处理单个表的一列或多列。

我们可以使用"normalize_entity"函数创建虚拟表。这样我们就可以应用聚合函数和转换函数来生成新特性。为了创建这样的表，我们将使用分类变量、布尔变量和整数变量。

In [9]:
es = ft.EntitySet(id = 'fetch_covtype_data')
es = es.entity_from_dataframe(entity_id = 'X', dataframe = X, 
                              variable_types = 
                              {
                                  'Aspect': ft.variable_types.Categorical,
                                  'Slope': ft.variable_types.Categorical,
                                  'Hillshade_9am': ft.variable_types.Categorical,
                                  'Hillshade_Noon': ft.variable_types.Categorical,
                                  'Hillshade_3pm': ft.variable_types.Categorical,
                                  'Wilderness_Area_0': ft.variable_types.Boolean,
                                  'Wilderness_Area_1': ft.variable_types.Boolean,
                                  'Wilderness_Area_2': ft.variable_types.Boolean,
                                  'Wilderness_Area_3': ft.variable_types.Boolean,
                                  'Soil_Type_0': ft.variable_types.Boolean,
                                  'Soil_Type_1': ft.variable_types.Boolean,
                                  'Soil_Type_2': ft.variable_types.Boolean,
                                  'Soil_Type_3': ft.variable_types.Boolean,
                                  'Soil_Type_4': ft.variable_types.Boolean
                              },
                              index = 'index')

es

Entityset: fetch_covtype_data
  Entities:
    X [Rows: 581012, Columns: 20]
  Relationships:
    No relationships

In [10]:
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_0', index='Wilderness_Area_0')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_1', index='Wilderness_Area_1')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_2', index='Wilderness_Area_2')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Wilderness_Area_3', index='Wilderness_Area_3')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_0', index='Soil_Type_0')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_1', index='Soil_Type_1')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_2', index='Soil_Type_2')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_3', index='Soil_Type_3')
es = es.normalize_entity(base_entity_id='X', new_entity_id='Soil_Type_4', index='Soil_Type_4')
es

Entityset: fetch_covtype_data
  Entities:
    X [Rows: 581012, Columns: 20]
    Wilderness_Area_0 [Rows: 2, Columns: 1]
    Wilderness_Area_1 [Rows: 2, Columns: 1]
    Wilderness_Area_2 [Rows: 2, Columns: 1]
    Wilderness_Area_3 [Rows: 2, Columns: 1]
    Soil_Type_0 [Rows: 2, Columns: 1]
    Soil_Type_1 [Rows: 2, Columns: 1]
    Soil_Type_2 [Rows: 2, Columns: 1]
    Soil_Type_3 [Rows: 2, Columns: 1]
    Soil_Type_4 [Rows: 2, Columns: 1]
  Relationships:
    X.Wilderness_Area_0 -> Wilderness_Area_0.Wilderness_Area_0
    X.Wilderness_Area_1 -> Wilderness_Area_1.Wilderness_Area_1
    X.Wilderness_Area_2 -> Wilderness_Area_2.Wilderness_Area_2
    X.Wilderness_Area_3 -> Wilderness_Area_3.Wilderness_Area_3
    X.Soil_Type_0 -> Soil_Type_0.Soil_Type_0
    X.Soil_Type_1 -> Soil_Type_1.Soil_Type_1
    X.Soil_Type_2 -> Soil_Type_2.Soil_Type_2
    X.Soil_Type_3 -> Soil_Type_3.Soil_Type_3
    X.Soil_Type_4 -> Soil_Type_4.Soil_Type_4

In [11]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100
primitives[primitives['type'] == 'aggregation'].head(primitives[primitives['type'] == 'aggregation'].shape[0])

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description,valid_inputs,return_type
0,sum,aggregation,True,True,"Calculates the total addition, ignoring `NaN`.",Numeric,Numeric
1,first,aggregation,False,False,Determines the first value in a list.,Variable,
2,last,aggregation,False,False,Determines the last value in a list.,Variable,
3,trend,aggregation,False,False,Calculates the trend of a variable over time.,"DatetimeTimeIndex, Numeric",Numeric
4,n_most_common,aggregation,False,False,Determines the `n` most common elements.,Discrete,Discrete
5,time_since_last,aggregation,False,False,Calculates the time elapsed since the last datetime (default in seconds).,DatetimeTimeIndex,Numeric
6,std,aggregation,True,True,"Computes the dispersion relative to the mean value, ignoring `NaN`.",Numeric,Numeric
7,median,aggregation,False,False,Determines the middlemost number in a list of values.,Numeric,Numeric
8,count,aggregation,True,True,"Determines the total number of values, excluding `NaN`.",Index,Numeric
9,percent_true,aggregation,True,False,Determines the percent of `True` values.,Boolean,Numeric


In [12]:
primitives[primitives['type'] == 'transform'].head(primitives[primitives['type'] == 'transform'].shape[0])

Unnamed: 0,name,type,dask_compatible,koalas_compatible,description,valid_inputs,return_type
22,greater_than,transform,True,False,Determines if values in one list are greater than another list.,"Ordinal, Datetime, Numeric",Boolean
23,less_than,transform,True,True,Determines if values in one list are less than another list.,"Ordinal, Datetime, Numeric",Boolean
24,and,transform,True,True,Element-wise logical AND of two lists.,Boolean,Boolean
25,less_than_scalar,transform,True,True,Determines if values are less than a given scalar.,"Ordinal, Datetime, Numeric",Boolean
26,modulo_numeric,transform,True,True,Element-wise modulo of two lists.,Numeric,Numeric
...,...,...,...,...,...,...,...
79,is_weekend,transform,True,True,Determines if a date falls on a weekend.,Datetime,Boolean
80,num_characters,transform,True,True,Calculates the number of characters in a string.,NaturalLanguage,Numeric
81,latitude,transform,False,False,Returns the first tuple value in a list of LatLong tuples.,LatLong,Numeric
82,cum_sum,transform,False,False,Calculates the cumulative sum.,Numeric,Numeric


1. 现在我们将应用一个深度特征合成（DFS）函数，该函数将通过自动应用适当的聚合来生成新特征，这里选择了深度2。深度值越高，将堆叠越多的基本体。

In [14]:
%%time
features, feature_names = ft.dfs(entityset = es, 
                                 target_entity = 'X', 
                                 max_depth = 2)

Wall time: 1min 3s


这是一个新功能的列表。例如，"Wilderness_Area_0.MEAN（X.Elevation）"表示Wilderness_Area_0的每一个唯一值的Elevation值的均值。即相同的Wilderness_Area_0的Elevation值的均值

In [15]:
feature_names

[<Feature: Elevation>,
 <Feature: Horizontal_Distance_To_Hydrology>,
 <Feature: Vertical_Distance_To_Hydrology>,
 <Feature: Horizontal_Distance_To_Roadways>,
 <Feature: Horizontal_Distance_To_Fire_Points>,
 <Feature: Aspect>,
 <Feature: Slope>,
 <Feature: Hillshade_9am>,
 <Feature: Hillshade_Noon>,
 <Feature: Hillshade_3pm>,
 <Feature: Wilderness_Area_0>,
 <Feature: Wilderness_Area_1>,
 <Feature: Wilderness_Area_2>,
 <Feature: Wilderness_Area_3>,
 <Feature: Soil_Type_0>,
 <Feature: Soil_Type_1>,
 <Feature: Soil_Type_2>,
 <Feature: Soil_Type_3>,
 <Feature: Soil_Type_4>,
 <Feature: Wilderness_Area_0.COUNT(X)>,
 <Feature: Wilderness_Area_0.MAX(X.Elevation)>,
 <Feature: Wilderness_Area_0.MAX(X.Horizontal_Distance_To_Fire_Points)>,
 <Feature: Wilderness_Area_0.MAX(X.Horizontal_Distance_To_Hydrology)>,
 <Feature: Wilderness_Area_0.MAX(X.Horizontal_Distance_To_Roadways)>,
 <Feature: Wilderness_Area_0.MAX(X.Vertical_Distance_To_Hydrology)>,
 <Feature: Wilderness_Area_0.MEAN(X.Elevation)>,
 <Fe

In [16]:
features[features['Elevation'] == 2596][["Wilderness_Area_0.MEAN(X.Elevation)","Elevation","Wilderness_Area_0"]].head()

Unnamed: 0_level_0,Wilderness_Area_0.MEAN(X.Elevation),Elevation,Wilderness_Area_0
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,3000.267334,2596.0,1.0
561,3000.267334,2596.0,1.0
2062,2926.053223,2596.0,0.0
6946,2926.053223,2596.0,0.0
6976,2926.053223,2596.0,0.0


In [17]:
features.shape

(581012, 532)

通过使用“featuretools”，我们能够在瞬间生成512个特征。

“featuretools”是一个功能强大的软件包，它可以节省从多个数据表创建新功能的时间。然而，它并不能完全替代人类领域的知识。此外，现在我们面临另一个问题，称为“维度灾难”。

### 3.“维度灾难”：特征约简与选择

为了解决“维数灾难”，有必要应用特征约简和选择，这意味着从数据中去除低值特征。但请记住，特征选择可能会影响ML模型的性能。棘手的是，ML模型的设计包含一个艺术元素。这绝对不是一个具有严格规则的确定性过程，要想取得成功就必须遵循这些规则。为了得到一个精确的模型，有必要应用、组合和比较几十种方法。在本notebook中，我不会解释所有可能的方法来处理“维度灾难”。我将集中讨论以下方法：

* 确定共线特征

* 使用L1范数惩罚的线性模型检测最相关的特征

#### 3.1 确认共线特征

共线性意味着独立特征之间的高度相关性。如果我们在模式中保持这些特征，可能很难评估独立特征对目标变量的影响。因此，我们将检测这些功能并删除它们，尽管在删除之前会应用手动修订。

In [37]:
# 相关系数超过一定阈值则删除
threshold = 0.95

# 绝对值相关系数矩阵
corr_matrix = features.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head(50)

Unnamed: 0,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type_4.STD(X.Elevation),Soil_Type_4.STD(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.STD(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.STD(X.Horizontal_Distance_To_Roadways),Soil_Type_4.STD(X.Vertical_Distance_To_Hydrology),Soil_Type_4.SUM(X.Elevation),Soil_Type_4.SUM(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.SUM(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.SUM(X.Horizontal_Distance_To_Roadways),Soil_Type_4.SUM(X.Vertical_Distance_To_Hydrology)
Elevation,,0.306229,0.093306,0.365559,0.148022,0.015735,0.242697,0.112179,0.205887,0.059148,...,0.150376,0.150376,0.150376,0.150376,0.150376,0.150376,0.150376,0.150376,0.150376,0.150376
Horizontal_Distance_To_Hydrology,,,0.606236,0.07203,0.051874,0.017376,0.010607,0.027088,0.04679,0.05233,...,0.00937,0.00937,0.00937,0.00937,0.00937,0.00937,0.00937,0.00937,0.00937,0.00937
Vertical_Distance_To_Hydrology,,,,0.046372,0.069913,0.070305,0.274976,0.166333,0.110957,0.034902,...,0.026772,0.026772,0.026772,0.026772,0.026772,0.026772,0.026772,0.026772,0.026772,0.026772
Horizontal_Distance_To_Roadways,,,,,0.33158,0.025121,0.215914,0.034349,0.189461,0.106119,...,0.061607,0.061607,0.061607,0.061607,0.061607,0.061607,0.061607,0.061607,0.061607,0.061607
Horizontal_Distance_To_Fire_Points,,,,,,0.109172,0.185662,0.132669,0.057329,0.047981,...,0.051845,0.051845,0.051845,0.051845,0.051845,0.051845,0.051845,0.051845,0.051845,0.051845
Aspect,,,,,,,0.078728,0.579273,0.336103,0.646944,...,0.008938,0.008938,0.008938,0.008938,0.008938,0.008938,0.008938,0.008938,0.008938,0.008938
Slope,,,,,,,,0.327199,0.526911,0.175854,...,0.072311,0.072311,0.072311,0.072311,0.072311,0.072311,0.072311,0.072311,0.072311,0.072311
Hillshade_9am,,,,,,,,,0.010037,0.780296,...,0.046514,0.046514,0.046514,0.046514,0.046514,0.046514,0.046514,0.046514,0.046514,0.046514
Hillshade_Noon,,,,,,,,,,0.594274,...,0.062044,0.062044,0.062044,0.062044,0.062044,0.062044,0.062044,0.062044,0.062044,0.062044
Hillshade_3pm,,,,,,,,,,,...,0.0069,0.0069,0.0069,0.0069,0.0069,0.0069,0.0069,0.0069,0.0069,0.0069


In [38]:
# 选择相关系数低于阈值的特征
collinear_features = [column for column in upper.columns if any(upper[column] > threshold)]

print('There are %d features to remove.' % (len(collinear_features)))

There are 407 features to remove.


In [39]:
features_filtered = features.drop(columns = collinear_features)

print('The number of features that passed the collinearity threshold: ', features_filtered.shape[1])

The number of features that passed the collinearity threshold:  125


但是，请注意，在不了解删除过程的情况下，仅通过关联删除特征不是一个好主意。具有非常高相关性的两者之间存在显著差异的功能可能需要额外操作。因此，手动操作是必要的。但是这个主题超出了内核的范围。

#### 3.2 使用L1范数惩罚的线性模型检测最相关的特征
下一步是使用L1 norml惩罚的线性模型。

注意，正常情况下我们是不知道测试集的标签，所以这里先做分割，切分训练和预测集合

In [46]:
df = pd.merge(features_filtered, y, on=['index'])
df.head(2)

Unnamed: 0,index,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,...,Soil_Type_4.MIN(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.MIN(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.MODE(X.Soil_Type_0),Soil_Type_4.MODE(X.Soil_Type_1),Soil_Type_4.MODE(X.Soil_Type_2),Soil_Type_4.MODE(X.Soil_Type_3),Soil_Type_4.MODE(X.Wilderness_Area_0),Soil_Type_4.MODE(X.Wilderness_Area_1),Soil_Type_4.MODE(X.Wilderness_Area_2),Cover_Type
0,0,2596.0,258.0,0.0,510.0,6279.0,51.0,3.0,221.0,232.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
1,1,2590.0,212.0,-6.0,390.0,6225.0,56.0,2.0,220.0,235.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0


In [48]:
train_df, test_df = train_test_split(df,random_state=42)
train_df.head(2)

Unnamed: 0,index,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,...,Soil_Type_4.MIN(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.MIN(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.MODE(X.Soil_Type_0),Soil_Type_4.MODE(X.Soil_Type_1),Soil_Type_4.MODE(X.Soil_Type_2),Soil_Type_4.MODE(X.Soil_Type_3),Soil_Type_4.MODE(X.Wilderness_Area_0),Soil_Type_4.MODE(X.Wilderness_Area_1),Soil_Type_4.MODE(X.Wilderness_Area_2),Cover_Type
442216,442216,2833.0,60.0,26.0,1890.0,1211.0,258.0,26.0,148.0,244.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
20198,20198,3008.0,339.0,7.0,6427.0,2971.0,45.0,2.0,220.0,234.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [53]:
features_positive = features_filtered.loc[:, features_filtered.ge(0).all()]

train_X = train_df.drop('Cover_Type',1)
train_y = train_df['Cover_Type']

test_X = test_df.drop('Cover_Type',1)
test_y = test_df['Cover_Type']
test_X.head(2)

Unnamed: 0,index,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,...,Soil_Type_3.NUM_UNIQUE(X.Wilderness_Area_3),Soil_Type_4.MIN(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.MIN(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.MODE(X.Soil_Type_0),Soil_Type_4.MODE(X.Soil_Type_1),Soil_Type_4.MODE(X.Soil_Type_2),Soil_Type_4.MODE(X.Soil_Type_3),Soil_Type_4.MODE(X.Wilderness_Area_0),Soil_Type_4.MODE(X.Wilderness_Area_1),Soil_Type_4.MODE(X.Wilderness_Area_2)
250728,250728,3351.0,726.0,124.0,3813.0,2271.0,206.0,27.0,192.0,252.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
246788,246788,2732.0,212.0,1.0,1082.0,912.0,129.0,7.0,231.0,236.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(train_X, train_y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(train_X)
X_selected_df = pd.DataFrame(X_new, columns=[train_X.columns[i] for i in range(len(train_X.columns)) if model.get_support()[i]])
X_selected_df.shape



(435759, 36)

In [55]:
X_selected_df.columns

Index(['Elevation', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Horizontal_Distance_To_Fire_Points', 'Aspect', 'Slope',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Wilderness_Area_0',
       'Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3',
       'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3',
       'Soil_Type_4', 'Wilderness_Area_0.NUM_UNIQUE(X.Aspect)',
       'Wilderness_Area_2.MODE(X.Aspect)',
       'Wilderness_Area_2.NUM_UNIQUE(X.Aspect)',
       'Wilderness_Area_2.NUM_UNIQUE(X.Soil_Type_1)',
       'Wilderness_Area_2.NUM_UNIQUE(X.Soil_Type_2)',
       'Wilderness_Area_2.NUM_UNIQUE(X.Soil_Type_3)',
       'Wilderness_Area_3.NUM_UNIQUE(X.Aspect)',
       'Wilderness_Area_3.NUM_UNIQUE(X.Soil_Type_1)',
       'Wilderness_Area_3.NUM_UNIQUE(X.Soil_Type_2)',
       'Wilderness_Area_3.NUM_UNIQUE(X.Soil_Type_3)',
       'Soil_Type_1.MODE(X.Hillshade_9am)',
       'Soil_T

### 4. 训练和测试单模型

最后，我们将创建一个基本随机森林分类器。请注意，我跳过了一些基本步骤，如交叉验证、学习曲线分析等。

In [59]:
%%time
random_forest = RandomForestClassifier(n_estimators=500,oob_score=True)
random_forest.fit(X_selected_df, train_y)

Wall time: 12min 18s


RandomForestClassifier(n_estimators=500, oob_score=True)

### 5.验证效果

In [60]:
# 验证效果
Y_pred = random_forest.predict(test_X[X_selected_df.columns])
print(accuracy_score(Y_pred,test_y))  # RF

0.9439598493662782


In [67]:
"""
del features_filtered
del features_positive
del fetch_covtype
del df, X,y, X_selected_df,train,test,train_df,test_df,train_X,train_y
"""
gc.collect()

51238

### 5.1 比较原特征的分数

In [8]:
org_df = pd.merge(X, y, on=['index'])
org_train_df, org_test_df = train_test_split(org_df,random_state=42)
org_train_X = org_train_df.drop('Cover_Type',1)
org_train_y = org_train_df['Cover_Type']

org_test_X = org_test_df.drop('Cover_Type',1)
org_test_y = org_test_df['Cover_Type']
org_test_X.head(2)

Unnamed: 0,index,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area_0,Wilderness_Area_1,Wilderness_Area_2,Wilderness_Area_3,Soil_Type_0,Soil_Type_1,Soil_Type_2,Soil_Type_3,Soil_Type_4
250728,250728,3351.0,206.0,27.0,726.0,124.0,3813.0,192.0,252.0,180.0,2271.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
246788,246788,2732.0,129.0,7.0,212.0,1.0,1082.0,231.0,236.0,137.0,912.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
%%time
random_forest = RandomForestClassifier(n_estimators=500,oob_score=True)
random_forest.fit(org_train_X, org_train_y)
pred_org_test_y = random_forest.predict(org_test_X)
print(accuracy_score(pred_org_test_y,org_test_y))  # RF

0.9673328605949619
Wall time: 14min 30s


### 5.2 使用未约简与选择的特征的分数

In [18]:
df = pd.merge(features, y, on=['index'])
df.head(2)

Unnamed: 0,index,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,...,Soil_Type_4.STD(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.STD(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.STD(X.Horizontal_Distance_To_Roadways),Soil_Type_4.STD(X.Vertical_Distance_To_Hydrology),Soil_Type_4.SUM(X.Elevation),Soil_Type_4.SUM(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.SUM(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.SUM(X.Horizontal_Distance_To_Roadways),Soil_Type_4.SUM(X.Vertical_Distance_To_Hydrology),Cover_Type
0,0,2596.0,258.0,0.0,510.0,6279.0,51.0,3.0,221.0,232.0,...,1324.050751,212.689925,1558.361956,58.279989,1715981000.0,1149499000.0,156171328.0,1364632000.0,26848308.0,4.0
1,1,2590.0,212.0,-6.0,390.0,6225.0,56.0,2.0,220.0,235.0,...,1324.050751,212.689925,1558.361956,58.279989,1715981000.0,1149499000.0,156171328.0,1364632000.0,26848308.0,4.0


In [20]:
del features, X
gc.collect()

3256

In [22]:
train_df, test_df = train_test_split(df,random_state=42)
train_X = train_df.drop('Cover_Type',1)
train_y = train_df['Cover_Type']

test_X = test_df.drop('Cover_Type',1)
test_y = test_df['Cover_Type']
test_X.head(2)

Unnamed: 0,index,Elevation,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Horizontal_Distance_To_Fire_Points,Aspect,Slope,Hillshade_9am,Hillshade_Noon,...,Soil_Type_4.STD(X.Elevation),Soil_Type_4.STD(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.STD(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.STD(X.Horizontal_Distance_To_Roadways),Soil_Type_4.STD(X.Vertical_Distance_To_Hydrology),Soil_Type_4.SUM(X.Elevation),Soil_Type_4.SUM(X.Horizontal_Distance_To_Fire_Points),Soil_Type_4.SUM(X.Horizontal_Distance_To_Hydrology),Soil_Type_4.SUM(X.Horizontal_Distance_To_Roadways),Soil_Type_4.SUM(X.Vertical_Distance_To_Hydrology)
250728,250728,3351.0,726.0,124.0,3813.0,2271.0,206.0,27.0,192.0,252.0,...,277.045517,1324.050751,212.689925,1558.361956,58.279989,1715981000.0,1149499000.0,156171328.0,1364632000.0,26848308.0
246788,246788,2732.0,212.0,1.0,1082.0,912.0,129.0,7.0,231.0,236.0,...,277.045517,1324.050751,212.689925,1558.361956,58.279989,1715981000.0,1149499000.0,156171328.0,1364632000.0,26848308.0


In [23]:
del df, train_df, test_df
gc.collect()

45

In [24]:
%%time
random_forest = RandomForestClassifier(n_estimators=500,oob_score=True)
random_forest.fit(train_X, train_y)
pred_y = random_forest.predict(test_X)
print(accuracy_score(pred_y,test_y))  # RF

0.9442352309418738
Wall time: 30min 31s


从结果来看，在这个数据集上，不管是增加的特征，还是增加后过滤的特征，效果都比原始特征差。我也咨询了一些朋友他们试了效果都一般，但是kaggle上很多人点赞，如果你们在哪个数据集上试了效果上涨，请联系我。