From 7311b35c872496b265b2361fa7ffd305f4793a43 Mon Sep 17 00:00:00 2001
From: benjas <909336740@qq.com>
Date: Tue, 19 Jan 2021 16:22:16 +0800
Subject: [PATCH] Add. Making plans based on sample differences
---
...归-信用卡欺诈检测-checkpoint.ipynb | 371 +++++++++++++++++-
.../逻辑回归-信用卡欺诈检测.ipynb | 54 +++
2 files changed, 423 insertions(+), 2 deletions(-)
diff --git a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb
index 2fd6442..b2653f6 100644
--- a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb
@@ -1,6 +1,373 @@
{
- "cells": [],
- "metadata": {},
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 信用卡欺诈检测\n",
+ "基于信用卡交易记录数据,建立分类模型来预测哪些交易记录是异常的,哪些是正常的。\n",
+ "\n",
+ "我整理好的数据地址:https://pan.baidu.com/s/18vPGelYCXGqp5OCWZWz36A 提取码:de0f\n",
+ "\n",
+ "kaggle数据地址:https://www.kaggle.com/mlg-ulb/creditcardfraud#creditcard.csv\n",
+ "\n",
+ "kesci数据地址:https://www.kesci.com/mw/dataset/5b56a592fc7e9000103c0442"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 任务目的:\n",
+ "完成数据集中正常交易数据和异常交易数据的分类,并对测试数据进行预测 0/1进行分类。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 任务流程:\n",
+ "* 加载数据,观测问题\n",
+ "* 针对问题给出解决方案\n",
+ "* 数据集划分\n",
+ "* 评估方法对比\n",
+ "* 逻辑回归模型\n",
+ "* 建模结果分析\n",
+ "* 方案效果对比\n",
+ "\n",
+ "### 主要解决问题:\n",
+ " (1) 在此项目中,我们首先对数据进行观测,发现了其中样本不均衡的问题,其实我们做任务工作之前都一定要先进行数据检查,看看数据有什么问题,针对这些问题来选择解决方案。\n",
+ " (2) 这里我们提出了两种方法,下采样和过采样,两条路线来进行对比实验,任何时间问题来了之后,我们都不会一条路走到黑,没有对比就没有优化,通常会得到一个基础模型,然后对各种方法进行对比,找到最合适的,然后在任务开始之前,一定得多想多准备,得到的结果才有可选择的余地。\n",
+ " (3) 在建模之前,需要对数据进行各种预处理操作,比如数据标准化,缺失值填充等,这些都是必要操作,由于数据本身已经给定了特征,此处我们还没有提到特征工程这个概念,后续实战中我们会逐步引入,其实数据预处理的工作是整个任务中最为重要也是最优难度的一个阶段,数据决定上限,模型逼近这个上限。\n",
+ " (4) 先选好评估方法,再进行建模。建模的目的是为了得到结果,但是我们不可能一次就得到最好的结果,肯定要尝试很多次,所以一定要有一个合适的评估方法,比如通用的AUC、ROC、召回率、精确率等,也可以根据实际问题自己指定评估指标。\n",
+ " (5) 选择合适的算法,这里我们使用的逻辑回归,逻辑回归现在使用的很少,但在金融领域还是一个非常具有代表的算法,其简单并具有可推导及解释性,深受金融行业的爱戴。\n",
+ " (6) 模型调参也是非常重要的,不用的调参会导致不同的结果,后续实战中我们也会有更多的调参细节,对于调参可以参考工具包的API文档,了解每个参数的意义,再来选择合适的参数值。\n",
+ " (7) 得到结果一定是和实际任务结合在一起,有时候线下(开发)时效果不错,但是上线后效果差距很大,所以测试环境也是必不可少的。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "D:\\Anaconda3\\lib\\importlib\\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject\n",
+ " return f(*args, **kwds)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 导入工具包\n",
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "\n",
+ "%matplotlib inline # 把图轻松的镶嵌到这个notebook中"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Time | \n",
+ " V1 | \n",
+ " V2 | \n",
+ " V3 | \n",
+ " V4 | \n",
+ " V5 | \n",
+ " V6 | \n",
+ " V7 | \n",
+ " V8 | \n",
+ " V9 | \n",
+ " ... | \n",
+ " V21 | \n",
+ " V22 | \n",
+ " V23 | \n",
+ " V24 | \n",
+ " V25 | \n",
+ " V26 | \n",
+ " V27 | \n",
+ " V28 | \n",
+ " Amount | \n",
+ " Class | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 0.0 | \n",
+ " -1.359807 | \n",
+ " -0.072781 | \n",
+ " 2.536347 | \n",
+ " 1.378155 | \n",
+ " -0.338321 | \n",
+ " 0.462388 | \n",
+ " 0.239599 | \n",
+ " 0.098698 | \n",
+ " 0.363787 | \n",
+ " ... | \n",
+ " -0.018307 | \n",
+ " 0.277838 | \n",
+ " -0.110474 | \n",
+ " 0.066928 | \n",
+ " 0.128539 | \n",
+ " -0.189115 | \n",
+ " 0.133558 | \n",
+ " -0.021053 | \n",
+ " 149.62 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 0.0 | \n",
+ " 1.191857 | \n",
+ " 0.266151 | \n",
+ " 0.166480 | \n",
+ " 0.448154 | \n",
+ " 0.060018 | \n",
+ " -0.082361 | \n",
+ " -0.078803 | \n",
+ " 0.085102 | \n",
+ " -0.255425 | \n",
+ " ... | \n",
+ " -0.225775 | \n",
+ " -0.638672 | \n",
+ " 0.101288 | \n",
+ " -0.339846 | \n",
+ " 0.167170 | \n",
+ " 0.125895 | \n",
+ " -0.008983 | \n",
+ " 0.014724 | \n",
+ " 2.69 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 1.0 | \n",
+ " -1.358354 | \n",
+ " -1.340163 | \n",
+ " 1.773209 | \n",
+ " 0.379780 | \n",
+ " -0.503198 | \n",
+ " 1.800499 | \n",
+ " 0.791461 | \n",
+ " 0.247676 | \n",
+ " -1.514654 | \n",
+ " ... | \n",
+ " 0.247998 | \n",
+ " 0.771679 | \n",
+ " 0.909412 | \n",
+ " -0.689281 | \n",
+ " -0.327642 | \n",
+ " -0.139097 | \n",
+ " -0.055353 | \n",
+ " -0.059752 | \n",
+ " 378.66 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 1.0 | \n",
+ " -0.966272 | \n",
+ " -0.185226 | \n",
+ " 1.792993 | \n",
+ " -0.863291 | \n",
+ " -0.010309 | \n",
+ " 1.247203 | \n",
+ " 0.237609 | \n",
+ " 0.377436 | \n",
+ " -1.387024 | \n",
+ " ... | \n",
+ " -0.108300 | \n",
+ " 0.005274 | \n",
+ " -0.190321 | \n",
+ " -1.175575 | \n",
+ " 0.647376 | \n",
+ " -0.221929 | \n",
+ " 0.062723 | \n",
+ " 0.061458 | \n",
+ " 123.50 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 2.0 | \n",
+ " -1.158233 | \n",
+ " 0.877737 | \n",
+ " 1.548718 | \n",
+ " 0.403034 | \n",
+ " -0.407193 | \n",
+ " 0.095921 | \n",
+ " 0.592941 | \n",
+ " -0.270533 | \n",
+ " 0.817739 | \n",
+ " ... | \n",
+ " -0.009431 | \n",
+ " 0.798278 | \n",
+ " -0.137458 | \n",
+ " 0.141267 | \n",
+ " -0.206010 | \n",
+ " 0.502292 | \n",
+ " 0.219422 | \n",
+ " 0.215153 | \n",
+ " 69.99 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 31 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Time V1 V2 V3 V4 V5 V6 V7 \\\n",
+ "0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n",
+ "1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n",
+ "2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n",
+ "3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n",
+ "4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n",
+ "\n",
+ " V8 V9 ... V21 V22 V23 V24 V25 \\\n",
+ "0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n",
+ "1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 \n",
+ "2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 \n",
+ "3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 \n",
+ "4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 \n",
+ "\n",
+ " V26 V27 V28 Amount Class \n",
+ "0 -0.189115 0.133558 -0.021053 149.62 0 \n",
+ "1 0.125895 -0.008983 0.014724 2.69 0 \n",
+ "2 -0.139097 -0.055353 -0.059752 378.66 0 \n",
+ "3 -0.221929 0.062723 0.061458 123.50 0 \n",
+ "4 0.502292 0.219422 0.215153 69.99 0 \n",
+ "\n",
+ "[5 rows x 31 columns]"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 读取数据\n",
+ "data = pd.read_csv(\"data/creditcard.csv\")\n",
+ "data.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据情况说明:\n",
+ "数据集包含由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易,其中284807笔交易中有492笔被盗刷。数据集非常不平衡,正例(被盗刷)占所有交易的0.172%。,这是因为由于保密问题,我们无法提供有关数据的原始功能和更多背景信息。特征V1,V2,... V28是使用PCA获得的主要组件,没有用PCA转换的唯一特征是“Class”和“Amount”。特征'Time'包含数据集中每个刷卡时间和第一次刷卡时间之间经过的秒数。特征'Class'是响应变量,如果发生被盗刷,则取值1,否则为0。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 284315\n",
+ "1 492\n",
+ "Name: Class, dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAETCAYAAAD6R0vDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAaBElEQVR4nO3dfbQddX3v8ffHACo+gRIQAhgs0Yq2IkZkab1VqRBtLdglCr2V1EWl9cJq7XX1gq7eYm2x2FWlclVaqKkBHxDwiVpsimhrbVUIlvKoTUQkIRQi4ckHQPB7/5jfqZvDOSc7ZPY+OSfv11p77dnf+c3Mb/ZJzufMb2bPTlUhSVKfHjXbHZAkzT+GiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hos0hCSLk1SSHbZwud9M8pVR9WuK7d2Y5JemmfeSJN8aV1+0fTNctE1pvxx/lOT7A4+9Zrtf80FV/UtVPXNz7ZK8I8lHxtEnzV+Gi7ZFr66qxw88NkxusKVHENo2+HPbfhgumhMGhqWOS3IT8MVWvyDJfyW5K8mXkzx7YJl/SvJbA68fMkTV1vc7SdYkuSPJB5KkzVuQ5C+SfC/JDcAvb6Z/+yT5VJKNSW5P8v5p2r0vybokdye5IslLBuYdnGR1m3drkve2+mOSfKSt984klyfZY4buHJjkqvaefCLJY9p6Xppk/cD2Tkpyc5J7knwryaFJlgFvB17fjhr/o7XdK8lFSTYlWZvkTQPreWySle09vD7J/5m0nRvbtq4CfpBkhyQnJ/l22/Z1SV4z6ef0r0lOb/t7Q5IXtfq6JLclWT7Tz0Ozz3DRXPOLwLOAw9vrzwNLgN2BbwAf3cL1/QrwAuC5wOsG1vumNu95wFLgtdOtIMkC4HPAd4HFwCLgvGmaXw4cCDwZ+BhwwcQvf+B9wPuq6onAzwDnt/py4EnAPsBTgN8BfjTDPr0OWAbsB/w88JtT9PmZwInAC6rqCW2/b6yqfwDeBXyiHTU+ty3ycWA9sBfde/GuJIe2eae0/X468ArgN6bo0zF0Ab1LVT0AfBt4SduvPwY+kmTPgfYvBK5q+/sxuvfzBcD+bf3vT/L4Gd4DzTLDRduiz7S/WO9M8plJ895RVT+oqh8BVNWKqrqnqu4D3gE8N8mTtmBbp1XVnVV1E/Alul/80P2C/suqWldVm4A/m2EdB9P90v2D1rd7q2rKk/hV9ZGqur2qHqiq9wCPBibOg/wY2D/JblX1/ar62kD9KcD+VfVgVV1RVXfP0J8zqmpD6/ffDezToAfbtg9IsmNV3VhV355qZUn2AX4BOKnt25XA3wBvaE1eB7yrqu6oqvXAGdP0ad3Az+2C1sefVNUngDV07+OE71TV31bVg8An6IL1nVV1X1X9I3A/XdBoG2W4aFt0ZFXt0h5HTpq3bmKiDV2d1oZX7gZubLN224Jt/dfA9A+Bib+G9xrcFt1RyXT2Ab7b/iKfUZK3tqGju5LcSfeX+0R/jwOeAXyzDX39SqufC6wCzkuyIcmfJ9nxEezTf6uqtcBb6AL5tiTnzXDhxF7Apqq6Z6D2XbojtIn5g+/V4PSUtSTHJrly4o8I4Dk89Od268D0RCBNrnnksg0zXDTXDN7G+9eBI4BfovslvbjV055/AOw80P6pW7CdW+hCY8K+M7RdB+y7uZPV7fzKSXR/6e9aVbsAd030t6rWVNUxdEN87wYuTPK4qvpxVf1xVR0AvIhuuO7YLdiXKVXVx6rqF4Cn0b2v756YNanpBuDJSZ4wUNsXuLlN3wLsPTBv8H37781NTCR5GnA23bDcU9r7cA0//blpHjBcNJc9AbgPuJ0uRN41af6VwK8l2TnJ/nRHBsM6H/jdJHsn2RU4eYa2l9H9gj0tyePaCfgXT9PfB4CNwA5J/gh44sTMJL+RZGFV/QS4s5UfTPKyJD/Xzu3cTTdM9uAW7MvDJHlmkpcneTRwL92RwMQ6bwUWJ3kUQFWtA/4N+LO2bz9P915OnN86H3hbkl2TLKILjZk8ji5sNra+vJHuyEXziOGiuewcuuGZm4HrgK9Nmn863dj8rcBKtuxk/9l0Q1H/QXehwKema9jOC7ya7hzATXQnvl8/RdNVdBcg/Gfr9708dLhoGXBtku/Tndw/uqrupTviupAuWK4H/hnY2s+hPBo4Dfge3TDa7nRXiQFc0J5vT/KNNn0M3ZHhBuDTwClVdUmb9066ff4O8IXW1/um23BVXQe8B/gq3c/m54B/3cr90TYmflmYpD4leTNdMP7ibPdFs8cjF0lbJcmeSV6c5FHtEue30h3daDvmp2Ulba2dgL+m+1zNnXSfSfngrPZIs85hMUlS7xwWkyT1znCRJPXOcy7NbrvtVosXL57tbkjSnHLFFVd8r6oWTq4bLs3ixYtZvXr1bHdDkuaUJFPeGslhMUlS7wwXSVLvDBdJUu8MF0lS7wwXSVLvDBdJUu8MF0lS7wwXSVLv/BDlHLP45L+f7S7MKzee9suz3QVpXvLIRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1LuRhUuSfZJ8Kcn1Sa5N8nut/o4kNye5sj1eNbDM25KsTfKtJIcP1Je12tokJw/U90vy9SRrknwiyU6t/uj2em2bv3hU+ylJerhRHrk8ALy1qp4FHAKckOSANu/0qjqwPS4GaPOOBp4NLAM+mGRBkgXAB4BXAgcAxwys591tXUuAO4DjWv044I6q2h84vbWTJI3JyMKlqm6pqm+06XuA64FFMyxyBHBeVd1XVd8B1gIHt8faqrqhqu4HzgOOSBLg5cCFbfmVwJED61rZpi8EDm3tJUljMJZzLm1Y6nnA11vpxCRXJVmRZNdWWwSsG1hsfatNV38KcGdVPTCp/pB1tfl3tfaT+3V8ktVJVm/cuHGr9lGS9FMjD5ckjwc+Cbylqu4GzgR+BjgQuAV4z0TTKRavR1CfaV0PLVSdVVVLq2rpwoULZ9wPSdLwRhouSXakC5aPVtWnAKrq1qp6sKp+ApxNN+wF3ZHHPgOL7w1smKH+PWCXJDtMqj9kXW3+k4BN/e6dJGk6o7xaLMCHgOur6r0D9T0Hmr0GuKZNXwQc3a702g9YAlwGXA4saVeG7UR30v+iqirgS8Br2/LLgc8OrGt5m34t8MXWXpI0Bjtsvskj9mLgDcDVSa5stbfTXe11IN0w1Y3AbwNU1bVJzgeuo7vS7ISqehAgyYnAKmABsKKqrm3rOwk4L8mfAv9OF2a053OTrKU7Yjl6hPspSZpkZOFSVV9h6nMfF8+wzKnAqVPUL55quaq6gZ8Oqw3W7wWO2pL+SpL64yf0JUm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRtZuCTZJ8mXklyf5Nokv9fqT05ySZI17XnXVk+SM5KsTXJVkoMG1rW8tV+TZPlA/flJrm7LnJEkM21DkjQeozxyeQB4a1U9CzgEOCHJAcDJwKVVtQS4tL0GeCWwpD2OB86ELiiAU4AXAgcDpwyExZmt7cRyy1p9um1IksZgZOFSVbdU1Tfa9D3A9cAi4AhgZWu2EjiyTR8BnFOdrwG7JNkTOBy4pKo2VdUdwCXAsjbviVX11aoq4JxJ65pqG5KkMRjLOZcki4HnAV8H9qiqW6ALIGD31mwRsG5gsfWtNlN9/RR1ZtiGJGkMRh4uSR4PfBJ4S1XdPVPTKWr1COpb0rfjk6xOsnrjxo1bsqgkaQYjDZckO9IFy0er6lOtfGsb0qI939bq64F9BhbfG9iwmfreU9Rn2sZDVNVZVbW0qpYuXLjwke2kJOlhRnm1WIAPAddX1XsHZl0ETFzxtRz47ED92HbV2CHAXW1IaxVwWJJd24n8w4BVbd49SQ5p2zp20rqm2oYkaQx2GOG6Xwy8Abg6yZWt9nbgNOD8JMcBNwFHtXkXA68C1gI/BN4IUFWbkvwJcHlr986q2tSm3wx8GHgs8Pn2YIZtSJLGYGThUlVfYerzIgCHTtG+gBOmWdcKYMUU9dXAc6ao3z7VNiRJ4+En9CVJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRsqXJI87LMkkiRNZ9gjl79KclmS/5Vkl5H2SJI05w0VLlX1C8D/pLuB5OokH0vyipH2TJI0Zw19zqWq1gB/CJwE/CJwRpJvJvm1UXVOkjQ3DXvO5eeTnE73bZIvB17dvr745cDpI+yfJGkOGvbGle8HzgbeXlU/mihW1YYkfziSnkmS5qxhw+VVwI+q6kGAJI8CHlNVP6yqc0fWO0nSnDTsOZcv0H1nyoSdW02SpIcZNlweU1Xfn3jRpnceTZckSXPdsOHygyQHTbxI8nzgRzO0lyRtx4Y95/IW4IIkG9rrPYHXj6ZLkqS5bqhwqarLk/ws8Ey6ry7+ZlX9eKQ9kyTNWcMeuQC8AFjclnleEqrqnJH0SpI0pw0VLknOBX4GuBJ4sJULMFwkSQ8z7JHLUuCAqqpRdkaSND8Me7XYNcBTR9kRSdL8MeyRy27AdUkuA+6bKFbVr46kV5KkOW3YcHnHKDshSZpfhr0U+Z+TPA1YUlVfSLIzsGC0XZMkzVXD3nL/TcCFwF+30iLgM6PqlCRpbhv2hP4JwIuBu+G/vzhs95kWSLIiyW1JrhmovSPJzUmubI9XDcx7W5K1Sb6V5PCB+rJWW5vk5IH6fkm+nmRNkk8k2anVH91er23zFw+5j5KkngwbLvdV1f0TL5LsQPc5l5l8GFg2Rf30qjqwPS5u6zsAOBp4dlvmg0kWJFkAfAB4JXAAcExrC/Dutq4lwB3Aca1+HHBHVe1P90Vm7x5yHyVJPRk2XP45yduBxyZ5BXAB8HczLVBVXwY2Dbn+I4Dzquq+qvoOsBY4uD3WVtUNLdzOA45IErpvwbywLb8SOHJgXSvb9IXAoa29JGlMhg2Xk4GNwNXAbwMXA4/0GyhPTHJVGzbbtdUWAesG2qxvtenqTwHurKoHJtUfsq42/67WXpI0JkOFS1X9pKrOrqqjquq1bfqRfFr/TLrbyBwI3AK8p9WnOrKoR1CfaV0Pk+T4JKuTrN64ceNM/ZYkbYFh7y32Hab4BV1VT9+SjVXVrQPrPBv4XHu5HthnoOnewMTt/aeqfw/YJckO7ehksP3Euta3c0NPYprhuao6CzgLYOnSpd7aRpJ6siX3FpvwGOAo4MlburEke1bVLe3la+huKwNwEfCxJO8F9gKWAJfRHYUsSbIfcDPdSf9fr6pK8iXgtXTnYZYDnx1Y13Lgq23+F70nmiSN17Aforx9Uukvk3wF+KPplknyceClwG5J1gOnAC9NciDdUdCNdOdvqKprk5wPXAc8AJxQVQ+29ZwIrKL70OaKqrq2beIk4Lwkfwr8O/ChVv8QcG6StXRHLEcPs4+SpP4MOyx20MDLR9EdyTxhpmWq6pgpyh+aojbR/lTg1CnqF9NdQDC5fgPd1WST6/fSHVlJkmbJsMNi7xmYfoDuqON1vfdGkjQvDDss9rJRd0SSNH8MOyz2v2eaX1Xv7ac7kqT5YEuuFnsB3ZVYAK8GvsxDP+AoSRKwZV8WdlBV3QPdDSiBC6rqt0bVMUnS3DXs7V/2Be4feH0/sLj33kiS5oVhj1zOBS5L8mm6z6i8BjhnZL2SJM1pw14tdmqSzwMvaaU3VtW/j65bkqS5bNhhMYCdgbur6n109+3ab0R9kiTNccN+zfEpdLdbeVsr7Qh8ZFSdkiTNbcMeubwG+FXgBwBVtYHN3P5FkrT9GjZc7m93Fi6AJI8bXZckSXPdsOFyfpK/pvsOlTcBXwDOHl23JElz2bBXi/1FklcAdwPPBP6oqi4Zac8kSXPWZsMlyQJgVVX9EmCgSJI2a7PDYu1Lu36Y5Elj6I8kaR4Y9hP69wJXJ7mEdsUYQFX97kh6JUma04YNl79vD0mSNmvGcEmyb1XdVFUrx9UhSdLct7lzLp+ZmEjyyRH3RZI0T2wuXDIw/fRRdkSSNH9sLlxqmmlJkqa1uRP6z01yN90RzGPbNO11VdUTR9o7SdKcNGO4VNWCcXVEkjR/bMn3uUiSNBTDRZLUO8NFktQ7w0WS1LuRhUuSFUluS3LNQO3JSS5JsqY979rqSXJGkrVJrkpy0MAyy1v7NUmWD9Sfn+TqtswZSTLTNiRJ4zPKI5cPA8sm1U4GLq2qJcCl7TXAK4El7XE8cCZ0QQGcArwQOBg4ZSAszmxtJ5ZbtpltSJLGZGThUlVfBjZNKh8BTNynbCVw5ED9nOp8je4bL/cEDgcuqapNVXUH3ffJLGvznlhVX21fv3zOpHVNtQ1J0piM+5zLHlV1C0B73r3VFwHrBtqtb7WZ6uunqM+0DUnSmGwrJ/QzRa0eQX3LNpocn2R1ktUbN27c0sUlSdMYd7jc2oa0aM+3tfp6YJ+BdnsDGzZT33uK+kzbeJiqOquqllbV0oULFz7inZIkPdS4w+UiYOKKr+XAZwfqx7arxg4B7mpDWquAw5Ls2k7kHwasavPuSXJIu0rs2EnrmmobkqQxGfabKLdYko8DLwV2S7Ke7qqv04DzkxwH3AQc1ZpfDLwKWAv8EHgjQFVtSvInwOWt3TurauIigTfTXZH2WODz7cEM25AkjcnIwqWqjplm1qFTtC3ghGnWswJYMUV9NfCcKeq3T7UNSdL4bCsn9CVJ84jhIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSerdrIRLkhuTXJ3kyiSrW+3JSS5JsqY979rqSXJGkrVJrkpy0MB6lrf2a5IsH6g/v61/bVs2499LSdp+zeaRy8uq6sCqWtpenwxcWlVLgEvba4BXAkva43jgTOjCCDgFeCFwMHDKRCC1NscPLLds9LsjSZqwLQ2LHQGsbNMrgSMH6udU52vALkn2BA4HLqmqTVV1B3AJsKzNe2JVfbWqCjhnYF2SpDGYrXAp4B+TXJHk+Fbbo6puAWjPu7f6ImDdwLLrW22m+vop6pKkMdlhlrb74qrakGR34JIk35yh7VTnS+oR1B++4i7YjgfYd999Z+6xJGlos3LkUlUb2vNtwKfpzpnc2oa0aM+3tebrgX0GFt8b2LCZ+t5T1Kfqx1lVtbSqli5cuHBrd0uS1Iw9XJI8LskTJqaBw4BrgIuAiSu+lgOfbdMXAce2q8YOAe5qw2argMOS7NpO5B8GrGrz7klySLtK7NiBdUmSxmA2hsX2AD7drg7eAfhYVf1DksuB85McB9wEHNXaXwy8ClgL/BB4I0BVbUryJ8Dlrd07q2pTm34z8GHgscDn20OSNCZjD5equgF47hT124FDp6gXcMI061oBrJiivhp4zlZ3VpL0iGxLlyJLkuYJw0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktS7eRsuSZYl+VaStUlOnu3+SNL2ZF6GS5IFwAeAVwIHAMckOWB2eyVJ2495GS7AwcDaqrqhqu4HzgOOmOU+SdJ2Y4fZ7sCILALWDbxeD7xwcqMkxwPHt5ffT/KtMfRte7Eb8L3Z7sTm5N2z3QPNgjnxb3MOedpUxfkaLpmiVg8rVJ0FnDX67mx/kqyuqqWz3Q9pMv9tjsd8HRZbD+wz8HpvYMMs9UWStjvzNVwuB5Yk2S/JTsDRwEWz3CdJ2m7My2GxqnogyYnAKmABsKKqrp3lbm1vHG7Utsp/m2OQqoedipAkaavM12ExSdIsMlwkSb0zXCRJvZuXJ/Q1Xkl+lu4OCIvoPk+0Abioqq6f1Y5JmjUeuWirJDmJ7vY6AS6juww8wMe9Yai2ZUneONt9mM+8WkxbJcl/As+uqh9Pqu8EXFtVS2anZ9LMktxUVfvOdj/mK4fFtLV+AuwFfHdSfc82T5o1Sa6abhawxzj7sr0xXLS13gJcmmQNP71Z6L7A/sCJs9YrqbMHcDhwx6R6gH8bf3e2H4aLtkpV/UOSZ9B9zcEiuv+064HLq+rBWe2cBJ8DHl9VV06ekeSfxt+d7YfnXCRJvfNqMUlS7wwXSVLvDBdpFiR5apLzknw7yXVJLk7yjCTXzHbfpD54Ql8asyQBPg2srKqjW+1AvDRW84hHLtL4vQz4cVX91UShXc00cSk3SRYn+Zck32iPF7X6nkm+nOTKJNckeUmSBUk+3F5fneT3x79L0kN55CKN33OAKzbT5jbgFVV1b5IlwMeBpcCvA6uq6tQkC4CdgQOBRVX1HIAku4yu69JwDBdp27Qj8P42XPYg8IxWvxxYkWRH4DNVdWWSG4CnJ/l/wN8D/zgrPZYGOCwmjd+1wPM30+b3gVuB59IdsewEUFVfBv4HcDNwbpJjq+qO1u6fgBOAvxlNt6XhGS7S+H0ReHSSN00UkrwAeNpAmycBt1TVT4A3AAtau6cBt1XV2cCHgIOS7AY8qqo+Cfxf4KDx7IY0PYfFpDGrqkryGuAv29cS3AvcSHeftgkfBD6Z5CjgS8APWv2lwB8k+THwfeBYutvu/G2SiT8W3zbynZA2w9u/SJJ657CYJKl3hoskqXeGiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hoskqXf/H9Km0Sac++BJAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# 数据标签分布\n",
+ "count_classes = pd.value_counts(data['Class'], sort=True).sort_index() # 统计里面不同分类的量\n",
+ "count_classes.plot(kind='bar') # 使用直方图\n",
+ "plt.title(\"Fraund class histogram\")\n",
+ "plt.xlabel(\"Class\")\n",
+ "plt.ylabel(\"Frequency\")\n",
+ "print(count_classes)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "明显的正负样本数量有差异,正样本(为1)只有492个,负样本有28万个,这种情况下,如果直接给模型学习,那么模型很容易知道,只要它把样本预测为负,那么准确率就在99.99%以上。\n",
+ "\n",
+ "我们不能让模型学到这种歪门技巧。\n",
+ "\n",
+ "有两种方案解决:\n",
+ "* 1和0一样多,也就是1也有28万个左右。(上采样)\n",
+ "* 0和1一样少,也就是28万里只取492个。(下采样)\n",
+ "\n",
+ "两个方案的比较:\n",
+ "* 第一种需要造一些数据,那么数据就是假的,假的会影响模型在预测真实数据时,结果自然会下降。\n",
+ "* 第二种方式则会减少真实数据,使得模型可学的数据变少,能力也会减弱。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.3"
+ }
+ },
"nbformat": 4,
"nbformat_minor": 2
}
diff --git a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb
index 215025f..b2653f6 100644
--- a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb
+++ b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb
@@ -287,6 +287,60 @@
"数据集包含由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易,其中284807笔交易中有492笔被盗刷。数据集非常不平衡,正例(被盗刷)占所有交易的0.172%。,这是因为由于保密问题,我们无法提供有关数据的原始功能和更多背景信息。特征V1,V2,... V28是使用PCA获得的主要组件,没有用PCA转换的唯一特征是“Class”和“Amount”。特征'Time'包含数据集中每个刷卡时间和第一次刷卡时间之间经过的秒数。特征'Class'是响应变量,如果发生被盗刷,则取值1,否则为0。"
]
},
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "0 284315\n",
+ "1 492\n",
+ "Name: Class, dtype: int64\n"
+ ]
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAETCAYAAAD6R0vDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAaBElEQVR4nO3dfbQddX3v8ffHACo+gRIQAhgs0Yq2IkZkab1VqRBtLdglCr2V1EWl9cJq7XX1gq7eYm2x2FWlclVaqKkBHxDwiVpsimhrbVUIlvKoTUQkIRQi4ckHQPB7/5jfqZvDOSc7ZPY+OSfv11p77dnf+c3Mb/ZJzufMb2bPTlUhSVKfHjXbHZAkzT+GiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hos0hCSLk1SSHbZwud9M8pVR9WuK7d2Y5JemmfeSJN8aV1+0fTNctE1pvxx/lOT7A4+9Zrtf80FV/UtVPXNz7ZK8I8lHxtEnzV+Gi7ZFr66qxw88NkxusKVHENo2+HPbfhgumhMGhqWOS3IT8MVWvyDJfyW5K8mXkzx7YJl/SvJbA68fMkTV1vc7SdYkuSPJB5KkzVuQ5C+SfC/JDcAvb6Z/+yT5VJKNSW5P8v5p2r0vybokdye5IslLBuYdnGR1m3drkve2+mOSfKSt984klyfZY4buHJjkqvaefCLJY9p6Xppk/cD2Tkpyc5J7knwryaFJlgFvB17fjhr/o7XdK8lFSTYlWZvkTQPreWySle09vD7J/5m0nRvbtq4CfpBkhyQnJ/l22/Z1SV4z6ef0r0lOb/t7Q5IXtfq6JLclWT7Tz0Ozz3DRXPOLwLOAw9vrzwNLgN2BbwAf3cL1/QrwAuC5wOsG1vumNu95wFLgtdOtIMkC4HPAd4HFwCLgvGmaXw4cCDwZ+BhwwcQvf+B9wPuq6onAzwDnt/py4EnAPsBTgN8BfjTDPr0OWAbsB/w88JtT9PmZwInAC6rqCW2/b6yqfwDeBXyiHTU+ty3ycWA9sBfde/GuJIe2eae0/X468ArgN6bo0zF0Ab1LVT0AfBt4SduvPwY+kmTPgfYvBK5q+/sxuvfzBcD+bf3vT/L4Gd4DzTLDRduiz7S/WO9M8plJ895RVT+oqh8BVNWKqrqnqu4D3gE8N8mTtmBbp1XVnVV1E/Alul/80P2C/suqWldVm4A/m2EdB9P90v2D1rd7q2rKk/hV9ZGqur2qHqiq9wCPBibOg/wY2D/JblX1/ar62kD9KcD+VfVgVV1RVXfP0J8zqmpD6/ffDezToAfbtg9IsmNV3VhV355qZUn2AX4BOKnt25XA3wBvaE1eB7yrqu6oqvXAGdP0ad3Az+2C1sefVNUngDV07+OE71TV31bVg8An6IL1nVV1X1X9I3A/XdBoG2W4aFt0ZFXt0h5HTpq3bmKiDV2d1oZX7gZubLN224Jt/dfA9A+Bib+G9xrcFt1RyXT2Ab7b/iKfUZK3tqGju5LcSfeX+0R/jwOeAXyzDX39SqufC6wCzkuyIcmfJ9nxEezTf6uqtcBb6AL5tiTnzXDhxF7Apqq6Z6D2XbojtIn5g+/V4PSUtSTHJrly4o8I4Dk89Od268D0RCBNrnnksg0zXDTXDN7G+9eBI4BfovslvbjV055/AOw80P6pW7CdW+hCY8K+M7RdB+y7uZPV7fzKSXR/6e9aVbsAd030t6rWVNUxdEN87wYuTPK4qvpxVf1xVR0AvIhuuO7YLdiXKVXVx6rqF4Cn0b2v756YNanpBuDJSZ4wUNsXuLlN3wLsPTBv8H37781NTCR5GnA23bDcU9r7cA0//blpHjBcNJc9AbgPuJ0uRN41af6VwK8l2TnJ/nRHBsM6H/jdJHsn2RU4eYa2l9H9gj0tyePaCfgXT9PfB4CNwA5J/gh44sTMJL+RZGFV/QS4s5UfTPKyJD/Xzu3cTTdM9uAW7MvDJHlmkpcneTRwL92RwMQ6bwUWJ3kUQFWtA/4N+LO2bz9P915OnN86H3hbkl2TLKILjZk8ji5sNra+vJHuyEXziOGiuewcuuGZm4HrgK9Nmn863dj8rcBKtuxk/9l0Q1H/QXehwKema9jOC7ya7hzATXQnvl8/RdNVdBcg/Gfr9708dLhoGXBtku/Tndw/uqrupTviupAuWK4H/hnY2s+hPBo4Dfge3TDa7nRXiQFc0J5vT/KNNn0M3ZHhBuDTwClVdUmb9066ff4O8IXW1/um23BVXQe8B/gq3c/m54B/3cr90TYmflmYpD4leTNdMP7ibPdFs8cjF0lbJcmeSV6c5FHtEue30h3daDvmp2Ulba2dgL+m+1zNnXSfSfngrPZIs85hMUlS7xwWkyT1znCRJPXOcy7NbrvtVosXL57tbkjSnHLFFVd8r6oWTq4bLs3ixYtZvXr1bHdDkuaUJFPeGslhMUlS7wwXSVLvDBdJUu8MF0lS7wwXSVLvDBdJUu8MF0lS7wwXSVLv/BDlHLP45L+f7S7MKzee9suz3QVpXvLIRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1LuRhUuSfZJ8Kcn1Sa5N8nut/o4kNye5sj1eNbDM25KsTfKtJIcP1Je12tokJw/U90vy9SRrknwiyU6t/uj2em2bv3hU+ylJerhRHrk8ALy1qp4FHAKckOSANu/0qjqwPS4GaPOOBp4NLAM+mGRBkgXAB4BXAgcAxwys591tXUuAO4DjWv044I6q2h84vbWTJI3JyMKlqm6pqm+06XuA64FFMyxyBHBeVd1XVd8B1gIHt8faqrqhqu4HzgOOSBLg5cCFbfmVwJED61rZpi8EDm3tJUljMJZzLm1Y6nnA11vpxCRXJVmRZNdWWwSsG1hsfatNV38KcGdVPTCp/pB1tfl3tfaT+3V8ktVJVm/cuHGr9lGS9FMjD5ckjwc+Cbylqu4GzgR+BjgQuAV4z0TTKRavR1CfaV0PLVSdVVVLq2rpwoULZ9wPSdLwRhouSXakC5aPVtWnAKrq1qp6sKp+ApxNN+wF3ZHHPgOL7w1smKH+PWCXJDtMqj9kXW3+k4BN/e6dJGk6o7xaLMCHgOur6r0D9T0Hmr0GuKZNXwQc3a702g9YAlwGXA4saVeG7UR30v+iqirgS8Br2/LLgc8OrGt5m34t8MXWXpI0Bjtsvskj9mLgDcDVSa5stbfTXe11IN0w1Y3AbwNU1bVJzgeuo7vS7ISqehAgyYnAKmABsKKqrm3rOwk4L8mfAv9OF2a053OTrKU7Yjl6hPspSZpkZOFSVV9h6nMfF8+wzKnAqVPUL55quaq6gZ8Oqw3W7wWO2pL+SpL64yf0JUm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRtZuCTZJ8mXklyf5Nokv9fqT05ySZI17XnXVk+SM5KsTXJVkoMG1rW8tV+TZPlA/flJrm7LnJEkM21DkjQeozxyeQB4a1U9CzgEOCHJAcDJwKVVtQS4tL0GeCWwpD2OB86ELiiAU4AXAgcDpwyExZmt7cRyy1p9um1IksZgZOFSVbdU1Tfa9D3A9cAi4AhgZWu2EjiyTR8BnFOdrwG7JNkTOBy4pKo2VdUdwCXAsjbviVX11aoq4JxJ65pqG5KkMRjLOZcki4HnAV8H9qiqW6ALIGD31mwRsG5gsfWtNlN9/RR1ZtiGJGkMRh4uSR4PfBJ4S1XdPVPTKWr1COpb0rfjk6xOsnrjxo1bsqgkaQYjDZckO9IFy0er6lOtfGsb0qI939bq64F9BhbfG9iwmfreU9Rn2sZDVNVZVbW0qpYuXLjwke2kJOlhRnm1WIAPAddX1XsHZl0ETFzxtRz47ED92HbV2CHAXW1IaxVwWJJd24n8w4BVbd49SQ5p2zp20rqm2oYkaQx2GOG6Xwy8Abg6yZWt9nbgNOD8JMcBNwFHtXkXA68C1gI/BN4IUFWbkvwJcHlr986q2tSm3wx8GHgs8Pn2YIZtSJLGYGThUlVfYerzIgCHTtG+gBOmWdcKYMUU9dXAc6ao3z7VNiRJ4+En9CVJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRsqXJI87LMkkiRNZ9gjl79KclmS/5Vkl5H2SJI05w0VLlX1C8D/pLuB5OokH0vyipH2TJI0Zw19zqWq1gB/CJwE/CJwRpJvJvm1UXVOkjQ3DXvO5eeTnE73bZIvB17dvr745cDpI+yfJGkOGvbGle8HzgbeXlU/mihW1YYkfziSnkmS5qxhw+VVwI+q6kGAJI8CHlNVP6yqc0fWO0nSnDTsOZcv0H1nyoSdW02SpIcZNlweU1Xfn3jRpnceTZckSXPdsOHygyQHTbxI8nzgRzO0lyRtx4Y95/IW4IIkG9rrPYHXj6ZLkqS5bqhwqarLk/ws8Ey6ry7+ZlX9eKQ9kyTNWcMeuQC8AFjclnleEqrqnJH0SpI0pw0VLknOBX4GuBJ4sJULMFwkSQ8z7JHLUuCAqqpRdkaSND8Me7XYNcBTR9kRSdL8MeyRy27AdUkuA+6bKFbVr46kV5KkOW3YcHnHKDshSZpfhr0U+Z+TPA1YUlVfSLIzsGC0XZMkzVXD3nL/TcCFwF+30iLgM6PqlCRpbhv2hP4JwIuBu+G/vzhs95kWSLIiyW1JrhmovSPJzUmubI9XDcx7W5K1Sb6V5PCB+rJWW5vk5IH6fkm+nmRNkk8k2anVH91er23zFw+5j5KkngwbLvdV1f0TL5LsQPc5l5l8GFg2Rf30qjqwPS5u6zsAOBp4dlvmg0kWJFkAfAB4JXAAcExrC/Dutq4lwB3Aca1+HHBHVe1P90Vm7x5yHyVJPRk2XP45yduBxyZ5BXAB8HczLVBVXwY2Dbn+I4Dzquq+qvoOsBY4uD3WVtUNLdzOA45IErpvwbywLb8SOHJgXSvb9IXAoa29JGlMhg2Xk4GNwNXAbwMXA4/0GyhPTHJVGzbbtdUWAesG2qxvtenqTwHurKoHJtUfsq42/67WXpI0JkOFS1X9pKrOrqqjquq1bfqRfFr/TLrbyBwI3AK8p9WnOrKoR1CfaV0Pk+T4JKuTrN64ceNM/ZYkbYFh7y32Hab4BV1VT9+SjVXVrQPrPBv4XHu5HthnoOnewMTt/aeqfw/YJckO7ehksP3Euta3c0NPYprhuao6CzgLYOnSpd7aRpJ6siX3FpvwGOAo4MlburEke1bVLe3la+huKwNwEfCxJO8F9gKWAJfRHYUsSbIfcDPdSf9fr6pK8iXgtXTnYZYDnx1Y13Lgq23+F70nmiSN17Aforx9Uukvk3wF+KPplknyceClwG5J1gOnAC9NciDdUdCNdOdvqKprk5wPXAc8AJxQVQ+29ZwIrKL70OaKqrq2beIk4Lwkfwr8O/ChVv8QcG6StXRHLEcPs4+SpP4MOyx20MDLR9EdyTxhpmWq6pgpyh+aojbR/lTg1CnqF9NdQDC5fgPd1WST6/fSHVlJkmbJsMNi7xmYfoDuqON1vfdGkjQvDDss9rJRd0SSNH8MOyz2v2eaX1Xv7ac7kqT5YEuuFnsB3ZVYAK8GvsxDP+AoSRKwZV8WdlBV3QPdDSiBC6rqt0bVMUnS3DXs7V/2Be4feH0/sLj33kiS5oVhj1zOBS5L8mm6z6i8BjhnZL2SJM1pw14tdmqSzwMvaaU3VtW/j65bkqS5bNhhMYCdgbur6n109+3ab0R9kiTNccN+zfEpdLdbeVsr7Qh8ZFSdkiTNbcMeubwG+FXgBwBVtYHN3P5FkrT9GjZc7m93Fi6AJI8bXZckSXPdsOFyfpK/pvsOlTcBXwDOHl23JElz2bBXi/1FklcAdwPPBP6oqi4Zac8kSXPWZsMlyQJgVVX9EmCgSJI2a7PDYu1Lu36Y5Elj6I8kaR4Y9hP69wJXJ7mEdsUYQFX97kh6JUma04YNl79vD0mSNmvGcEmyb1XdVFUrx9UhSdLct7lzLp+ZmEjyyRH3RZI0T2wuXDIw/fRRdkSSNH9sLlxqmmlJkqa1uRP6z01yN90RzGPbNO11VdUTR9o7SdKcNGO4VNWCcXVEkjR/bMn3uUiSNBTDRZLUO8NFktQ7w0WS1LuRhUuSFUluS3LNQO3JSS5JsqY979rqSXJGkrVJrkpy0MAyy1v7NUmWD9Sfn+TqtswZSTLTNiRJ4zPKI5cPA8sm1U4GLq2qJcCl7TXAK4El7XE8cCZ0QQGcArwQOBg4ZSAszmxtJ5ZbtpltSJLGZGThUlVfBjZNKh8BTNynbCVw5ED9nOp8je4bL/cEDgcuqapNVXUH3ffJLGvznlhVX21fv3zOpHVNtQ1J0piM+5zLHlV1C0B73r3VFwHrBtqtb7WZ6uunqM+0DUnSmGwrJ/QzRa0eQX3LNpocn2R1ktUbN27c0sUlSdMYd7jc2oa0aM+3tfp6YJ+BdnsDGzZT33uK+kzbeJiqOquqllbV0oULFz7inZIkPdS4w+UiYOKKr+XAZwfqx7arxg4B7mpDWquAw5Ls2k7kHwasavPuSXJIu0rs2EnrmmobkqQxGfabKLdYko8DLwV2S7Ke7qqv04DzkxwH3AQc1ZpfDLwKWAv8EHgjQFVtSvInwOWt3TurauIigTfTXZH2WODz7cEM25AkjcnIwqWqjplm1qFTtC3ghGnWswJYMUV9NfCcKeq3T7UNSdL4bCsn9CVJ84jhIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSerdrIRLkhuTXJ3kyiSrW+3JSS5JsqY979rqSXJGkrVJrkpy0MB6lrf2a5IsH6g/v61/bVs2499LSdp+zeaRy8uq6sCqWtpenwxcWlVLgEvba4BXAkva43jgTOjCCDgFeCFwMHDKRCC1NscPLLds9LsjSZqwLQ2LHQGsbNMrgSMH6udU52vALkn2BA4HLqmqTVV1B3AJsKzNe2JVfbWqCjhnYF2SpDGYrXAp4B+TXJHk+Fbbo6puAWjPu7f6ImDdwLLrW22m+vop6pKkMdlhlrb74qrakGR34JIk35yh7VTnS+oR1B++4i7YjgfYd999Z+6xJGlos3LkUlUb2vNtwKfpzpnc2oa0aM+3tebrgX0GFt8b2LCZ+t5T1Kfqx1lVtbSqli5cuHBrd0uS1Iw9XJI8LskTJqaBw4BrgIuAiSu+lgOfbdMXAce2q8YOAe5qw2argMOS7NpO5B8GrGrz7klySLtK7NiBdUmSxmA2hsX2AD7drg7eAfhYVf1DksuB85McB9wEHNXaXwy8ClgL/BB4I0BVbUryJ8Dlrd07q2pTm34z8GHgscDn20OSNCZjD5equgF47hT124FDp6gXcMI061oBrJiivhp4zlZ3VpL0iGxLlyJLkuYJw0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktS7eRsuSZYl+VaStUlOnu3+SNL2ZF6GS5IFwAeAVwIHAMckOWB2eyVJ2495GS7AwcDaqrqhqu4HzgOOmOU+SdJ2Y4fZ7sCILALWDbxeD7xwcqMkxwPHt5ffT/KtMfRte7Eb8L3Z7sTm5N2z3QPNgjnxb3MOedpUxfkaLpmiVg8rVJ0FnDX67mx/kqyuqqWz3Q9pMv9tjsd8HRZbD+wz8HpvYMMs9UWStjvzNVwuB5Yk2S/JTsDRwEWz3CdJ2m7My2GxqnogyYnAKmABsKKqrp3lbm1vHG7Utsp/m2OQqoedipAkaavM12ExSdIsMlwkSb0zXCRJvZuXJ/Q1Xkl+lu4OCIvoPk+0Abioqq6f1Y5JmjUeuWirJDmJ7vY6AS6juww8wMe9Yai2ZUneONt9mM+8WkxbJcl/As+uqh9Pqu8EXFtVS2anZ9LMktxUVfvOdj/mK4fFtLV+AuwFfHdSfc82T5o1Sa6abhawxzj7sr0xXLS13gJcmmQNP71Z6L7A/sCJs9YrqbMHcDhwx6R6gH8bf3e2H4aLtkpV/UOSZ9B9zcEiuv+064HLq+rBWe2cBJ8DHl9VV06ekeSfxt+d7YfnXCRJvfNqMUlS7wwXSVLvDBdpFiR5apLzknw7yXVJLk7yjCTXzHbfpD54Ql8asyQBPg2srKqjW+1AvDRW84hHLtL4vQz4cVX91UShXc00cSk3SRYn+Zck32iPF7X6nkm+nOTKJNckeUmSBUk+3F5fneT3x79L0kN55CKN33OAKzbT5jbgFVV1b5IlwMeBpcCvA6uq6tQkC4CdgQOBRVX1HIAku4yu69JwDBdp27Qj8P42XPYg8IxWvxxYkWRH4DNVdWWSG4CnJ/l/wN8D/zgrPZYGOCwmjd+1wPM30+b3gVuB59IdsewEUFVfBv4HcDNwbpJjq+qO1u6fgBOAvxlNt6XhGS7S+H0ReHSSN00UkrwAeNpAmycBt1TVT4A3AAtau6cBt1XV2cCHgIOS7AY8qqo+Cfxf4KDx7IY0PYfFpDGrqkryGuAv29cS3AvcSHeftgkfBD6Z5CjgS8APWv2lwB8k+THwfeBYutvu/G2SiT8W3zbynZA2w9u/SJJ657CYJKl3hoskqXeGiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hoskqXf/H9Km0Sac++BJAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# 数据标签分布\n",
+ "count_classes = pd.value_counts(data['Class'], sort=True).sort_index() # 统计里面不同分类的量\n",
+ "count_classes.plot(kind='bar') # 使用直方图\n",
+ "plt.title(\"Fraund class histogram\")\n",
+ "plt.xlabel(\"Class\")\n",
+ "plt.ylabel(\"Frequency\")\n",
+ "print(count_classes)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "明显的正负样本数量有差异,正样本(为1)只有492个,负样本有28万个,这种情况下,如果直接给模型学习,那么模型很容易知道,只要它把样本预测为负,那么准确率就在99.99%以上。\n",
+ "\n",
+ "我们不能让模型学到这种歪门技巧。\n",
+ "\n",
+ "有两种方案解决:\n",
+ "* 1和0一样多,也就是1也有28万个左右。(上采样)\n",
+ "* 0和1一样少,也就是28万里只取492个。(下采样)\n",
+ "\n",
+ "两个方案的比较:\n",
+ "* 第一种需要造一些数据,那么数据就是假的,假的会影响模型在预测真实数据时,结果自然会下降。\n",
+ "* 第二种方式则会减少真实数据,使得模型可学的数据变少,能力也会减弱。"
+ ]
+ },
{
"cell_type": "code",
"execution_count": null,