From 3174f5a5cfda739a3c2857620c248f3f4ec0d973 Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Mon, 30 Aug 2021 11:30:05 +0800 Subject: [PATCH] Add. Categorical Features --- ...re Engineering Techniques-checkpoint.ipynb | 30 ++++---- .../Feature Engineering Techniques.ipynb | 77 ++++++++++++++----- 2 files changed, 72 insertions(+), 35 deletions(-) diff --git a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb index 20c2e02..6b17c59 100644 --- a/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb +++ b/竞赛优胜技巧/.ipynb_checkpoints/Feature Engineering Techniques-checkpoint.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "278c7a1e", + "id": "8d942947", "metadata": {}, "source": [ "# 特征工程技术" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "67f256b4", + "id": "d08d515b", "metadata": {}, "source": [ "搬运参考:https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575" @@ -18,7 +18,7 @@ }, { "cell_type": "markdown", - "id": "5a28bcf6", + "id": "5eb53e03", "metadata": {}, "source": [ "## 关于编码\n", @@ -28,7 +28,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c0edffa6", + "id": "eb00d32d", "metadata": {}, "outputs": [], "source": [ @@ -40,7 +40,7 @@ }, { "cell_type": "markdown", - "id": "3bd8a464", + "id": "83412ecb", "metadata": {}, "source": [ "## NAN值加工\n", @@ -54,7 +54,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e2c552c7", + "id": "8b093d6f", "metadata": {}, "outputs": [], "source": [ @@ -63,7 +63,7 @@ }, { "cell_type": "markdown", - "id": "fe85c377", + "id": "978c9dc6", "metadata": {}, "source": [ "这样LGBM将不再过度处理 NAN。相反,它会给予它与其他数字相同的关注。可以尝试两种方法,看看哪个给出了最高的CV。" @@ -71,7 +71,7 @@ }, { "cell_type": "markdown", - "id": "05e77c5a", + "id": "31c076fc", "metadata": {}, "source": [ "## 标签编码/因式分解/内存减少\n", @@ -81,7 +81,7 @@ { "cell_type": "code", "execution_count": 14, - "id": "554159aa", + "id": "ceef72c3", "metadata": {}, "outputs": [ { @@ -157,7 +157,7 @@ }, { "cell_type": "markdown", - "id": "e5bf12a9", + "id": "eca60e6f", "metadata": {}, "source": [ "之后,可以将其转换为 int8、int16 或 int32用以减少内存,具体取决于 max 是否小于 128、小于 32768。" @@ -166,7 +166,7 @@ { "cell_type": "code", "execution_count": 21, - "id": "863fee6f", + "id": "fcd6f4e3", "metadata": {}, "outputs": [ { @@ -196,7 +196,7 @@ { "cell_type": "code", "execution_count": 22, - "id": "1a6bac81", + "id": "a40af2b8", "metadata": {}, "outputs": [ { @@ -221,7 +221,7 @@ }, { "cell_type": "markdown", - "id": "0951f3c7", + "id": "3728adee", "metadata": {}, "source": [ "另外为了减少内存,人们memory_reduce在其他列上使用流行的功能。\n", @@ -232,7 +232,7 @@ { "cell_type": "code", "execution_count": 23, - "id": "88368fc6", + "id": "03948c52", "metadata": {}, "outputs": [], "source": [ @@ -244,7 +244,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1ecd48ce", + "id": "bb624a66", "metadata": {}, "outputs": [], "source": [] diff --git a/竞赛优胜技巧/Feature Engineering Techniques.ipynb b/竞赛优胜技巧/Feature Engineering Techniques.ipynb index 20c2e02..9736043 100644 --- a/竞赛优胜技巧/Feature Engineering Techniques.ipynb +++ b/竞赛优胜技巧/Feature Engineering Techniques.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "278c7a1e", + "id": "8d942947", "metadata": {}, "source": [ "# 特征工程技术" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "67f256b4", + "id": "d08d515b", "metadata": {}, "source": [ "搬运参考:https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575" @@ -18,7 +18,7 @@ }, { "cell_type": "markdown", - "id": "5a28bcf6", + "id": "5eb53e03", "metadata": {}, "source": [ "## 关于编码\n", @@ -28,7 +28,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c0edffa6", + "id": "eb00d32d", "metadata": {}, "outputs": [], "source": [ @@ -40,7 +40,7 @@ }, { "cell_type": "markdown", - "id": "3bd8a464", + "id": "83412ecb", "metadata": {}, "source": [ "## NAN值加工\n", @@ -54,7 +54,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e2c552c7", + "id": "8b093d6f", "metadata": {}, "outputs": [], "source": [ @@ -63,7 +63,7 @@ }, { "cell_type": "markdown", - "id": "fe85c377", + "id": "978c9dc6", "metadata": {}, "source": [ "这样LGBM将不再过度处理 NAN。相反,它会给予它与其他数字相同的关注。可以尝试两种方法,看看哪个给出了最高的CV。" @@ -71,7 +71,7 @@ }, { "cell_type": "markdown", - "id": "05e77c5a", + "id": "31c076fc", "metadata": {}, "source": [ "## 标签编码/因式分解/内存减少\n", @@ -80,8 +80,8 @@ }, { "cell_type": "code", - "execution_count": 14, - "id": "554159aa", + "execution_count": 1, + "id": "ceef72c3", "metadata": {}, "outputs": [ { @@ -142,7 +142,7 @@ "4 0" ] }, - "execution_count": 14, + "execution_count": 1, "metadata": {}, "output_type": "execute_result" } @@ -157,7 +157,7 @@ }, { "cell_type": "markdown", - "id": "e5bf12a9", + "id": "eca60e6f", "metadata": {}, "source": [ "之后,可以将其转换为 int8、int16 或 int32用以减少内存,具体取决于 max 是否小于 128、小于 32768。" @@ -165,8 +165,8 @@ }, { "cell_type": "code", - "execution_count": 21, - "id": "863fee6f", + "execution_count": 2, + "id": "fcd6f4e3", "metadata": {}, "outputs": [ { @@ -195,8 +195,8 @@ }, { "cell_type": "code", - "execution_count": 22, - "id": "1a6bac81", + "execution_count": 3, + "id": "a40af2b8", "metadata": {}, "outputs": [ { @@ -221,7 +221,7 @@ }, { "cell_type": "markdown", - "id": "0951f3c7", + "id": "3728adee", "metadata": {}, "source": [ "另外为了减少内存,人们memory_reduce在其他列上使用流行的功能。\n", @@ -231,8 +231,8 @@ }, { "cell_type": "code", - "execution_count": 23, - "id": "88368fc6", + "execution_count": 4, + "id": "03948c52", "metadata": {}, "outputs": [], "source": [ @@ -241,10 +241,47 @@ " if df[col].dtype=='int64': df[col] = df[col].astype('int32')" ] }, + { + "cell_type": "markdown", + "id": "f1d175ca", + "metadata": {}, + "source": [ + "## 分类特征\n", + "对于分类变量,可以选择告诉 LGBM 它们是分类的(但内存会增加),或者可以告诉 LGBM 将其视为数字(首先需要对其进行标签编码)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "333baf5e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 5 entries, 0 to 4\n", + "Data columns (total 1 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 color 5 non-null category\n", + "dtypes: category(1)\n", + "memory usage: 265.0 bytes\n" + ] + } + ], + "source": [ + "df = pd.DataFrame(['green','bule','red','bule','green'],columns=['color'])\n", + "df['color'],_ = df['color'].factorize()\n", + "df['color'] = df['color'].astype('category') # 转成分类特征并查看内存使用情况(已知int8内存使用是: 133.0 bytes)\n", + "df.info()" + ] + }, { "cell_type": "code", "execution_count": null, - "id": "1ecd48ce", + "id": "28f791bd", "metadata": {}, "outputs": [], "source": []