diff --git a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb index 2fd6442..b2653f6 100644 --- a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb +++ b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/.ipynb_checkpoints/逻辑回归-信用卡欺诈检测-checkpoint.ipynb @@ -1,6 +1,373 @@ { - "cells": [], - "metadata": {}, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 信用卡欺诈检测\n", + "基于信用卡交易记录数据,建立分类模型来预测哪些交易记录是异常的,哪些是正常的。\n", + "\n", + "我整理好的数据地址:https://pan.baidu.com/s/18vPGelYCXGqp5OCWZWz36A 提取码:de0f\n", + "\n", + "kaggle数据地址:https://www.kaggle.com/mlg-ulb/creditcardfraud#creditcard.csv\n", + "\n", + "kesci数据地址:https://www.kesci.com/mw/dataset/5b56a592fc7e9000103c0442" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 任务目的:\n", + "完成数据集中正常交易数据和异常交易数据的分类,并对测试数据进行预测 0/1进行分类。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 任务流程:\n", + "* 加载数据,观测问题\n", + "* 针对问题给出解决方案\n", + "* 数据集划分\n", + "* 评估方法对比\n", + "* 逻辑回归模型\n", + "* 建模结果分析\n", + "* 方案效果对比\n", + "\n", + "### 主要解决问题:\n", + " (1) 在此项目中,我们首先对数据进行观测,发现了其中样本不均衡的问题,其实我们做任务工作之前都一定要先进行数据检查,看看数据有什么问题,针对这些问题来选择解决方案。\n", + " (2) 这里我们提出了两种方法,下采样和过采样,两条路线来进行对比实验,任何时间问题来了之后,我们都不会一条路走到黑,没有对比就没有优化,通常会得到一个基础模型,然后对各种方法进行对比,找到最合适的,然后在任务开始之前,一定得多想多准备,得到的结果才有可选择的余地。\n", + " (3) 在建模之前,需要对数据进行各种预处理操作,比如数据标准化,缺失值填充等,这些都是必要操作,由于数据本身已经给定了特征,此处我们还没有提到特征工程这个概念,后续实战中我们会逐步引入,其实数据预处理的工作是整个任务中最为重要也是最优难度的一个阶段,数据决定上限,模型逼近这个上限。\n", + " (4) 先选好评估方法,再进行建模。建模的目的是为了得到结果,但是我们不可能一次就得到最好的结果,肯定要尝试很多次,所以一定要有一个合适的评估方法,比如通用的AUC、ROC、召回率、精确率等,也可以根据实际问题自己指定评估指标。\n", + " (5) 选择合适的算法,这里我们使用的逻辑回归,逻辑回归现在使用的很少,但在金融领域还是一个非常具有代表的算法,其简单并具有可推导及解释性,深受金融行业的爱戴。\n", + " (6) 模型调参也是非常重要的,不用的调参会导致不同的结果,后续实战中我们也会有更多的调参细节,对于调参可以参考工具包的API文档,了解每个参数的意义,再来选择合适的参数值。\n", + " (7) 得到结果一定是和实际任务结合在一起,有时候线下(开发)时效果不错,但是上线后效果差距很大,所以测试环境也是必不可少的。" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "D:\\Anaconda3\\lib\\importlib\\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject\n", + " return f(*args, **kwds)\n" + ] + } + ], + "source": [ + "# 导入工具包\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "\n", + "%matplotlib inline # 把图轻松的镶嵌到这个notebook中" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
00.0-1.359807-0.0727812.5363471.378155-0.3383210.4623880.2395990.0986980.363787...-0.0183070.277838-0.1104740.0669280.128539-0.1891150.133558-0.021053149.620
10.01.1918570.2661510.1664800.4481540.060018-0.082361-0.0788030.085102-0.255425...-0.225775-0.6386720.101288-0.3398460.1671700.125895-0.0089830.0147242.690
21.0-1.358354-1.3401631.7732090.379780-0.5031981.8004990.7914610.247676-1.514654...0.2479980.7716790.909412-0.689281-0.327642-0.139097-0.055353-0.059752378.660
31.0-0.966272-0.1852261.792993-0.863291-0.0103091.2472030.2376090.377436-1.387024...-0.1083000.005274-0.190321-1.1755750.647376-0.2219290.0627230.061458123.500
42.0-1.1582330.8777371.5487180.403034-0.4071930.0959210.592941-0.2705330.817739...-0.0094310.798278-0.1374580.141267-0.2060100.5022920.2194220.21515369.990
\n", + "

5 rows × 31 columns

\n", + "
" + ], + "text/plain": [ + " Time V1 V2 V3 V4 V5 V6 V7 \\\n", + "0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 \n", + "1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 \n", + "2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 \n", + "3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 \n", + "4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 \n", + "\n", + " V8 V9 ... V21 V22 V23 V24 V25 \\\n", + "0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539 \n", + "1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170 \n", + "2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642 \n", + "3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376 \n", + "4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010 \n", + "\n", + " V26 V27 V28 Amount Class \n", + "0 -0.189115 0.133558 -0.021053 149.62 0 \n", + "1 0.125895 -0.008983 0.014724 2.69 0 \n", + "2 -0.139097 -0.055353 -0.059752 378.66 0 \n", + "3 -0.221929 0.062723 0.061458 123.50 0 \n", + "4 0.502292 0.219422 0.215153 69.99 0 \n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 读取数据\n", + "data = pd.read_csv(\"data/creditcard.csv\")\n", + "data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据情况说明:\n", + "数据集包含由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易,其中284807笔交易中有492笔被盗刷。数据集非常不平衡,正例(被盗刷)占所有交易的0.172%。,这是因为由于保密问题,我们无法提供有关数据的原始功能和更多背景信息。特征V1,V2,... V28是使用PCA获得的主要组件,没有用PCA转换的唯一特征是“Class”和“Amount”。特征'Time'包含数据集中每个刷卡时间和第一次刷卡时间之间经过的秒数。特征'Class'是响应变量,如果发生被盗刷,则取值1,否则为0。" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 284315\n", + "1 492\n", + "Name: Class, dtype: int64\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAETCAYAAAD6R0vDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAaBElEQVR4nO3dfbQddX3v8ffHACo+gRIQAhgs0Yq2IkZkab1VqRBtLdglCr2V1EWl9cJq7XX1gq7eYm2x2FWlclVaqKkBHxDwiVpsimhrbVUIlvKoTUQkIRQi4ckHQPB7/5jfqZvDOSc7ZPY+OSfv11p77dnf+c3Mb/ZJzufMb2bPTlUhSVKfHjXbHZAkzT+GiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hos0hCSLk1SSHbZwud9M8pVR9WuK7d2Y5JemmfeSJN8aV1+0fTNctE1pvxx/lOT7A4+9Zrtf80FV/UtVPXNz7ZK8I8lHxtEnzV+Gi7ZFr66qxw88NkxusKVHENo2+HPbfhgumhMGhqWOS3IT8MVWvyDJfyW5K8mXkzx7YJl/SvJbA68fMkTV1vc7SdYkuSPJB5KkzVuQ5C+SfC/JDcAvb6Z/+yT5VJKNSW5P8v5p2r0vybokdye5IslLBuYdnGR1m3drkve2+mOSfKSt984klyfZY4buHJjkqvaefCLJY9p6Xppk/cD2Tkpyc5J7knwryaFJlgFvB17fjhr/o7XdK8lFSTYlWZvkTQPreWySle09vD7J/5m0nRvbtq4CfpBkhyQnJ/l22/Z1SV4z6ef0r0lOb/t7Q5IXtfq6JLclWT7Tz0Ozz3DRXPOLwLOAw9vrzwNLgN2BbwAf3cL1/QrwAuC5wOsG1vumNu95wFLgtdOtIMkC4HPAd4HFwCLgvGmaXw4cCDwZ+BhwwcQvf+B9wPuq6onAzwDnt/py4EnAPsBTgN8BfjTDPr0OWAbsB/w88JtT9PmZwInAC6rqCW2/b6yqfwDeBXyiHTU+ty3ycWA9sBfde/GuJIe2eae0/X468ArgN6bo0zF0Ab1LVT0AfBt4SduvPwY+kmTPgfYvBK5q+/sxuvfzBcD+bf3vT/L4Gd4DzTLDRduiz7S/WO9M8plJ895RVT+oqh8BVNWKqrqnqu4D3gE8N8mTtmBbp1XVnVV1E/Alul/80P2C/suqWldVm4A/m2EdB9P90v2D1rd7q2rKk/hV9ZGqur2qHqiq9wCPBibOg/wY2D/JblX1/ar62kD9KcD+VfVgVV1RVXfP0J8zqmpD6/ffDezToAfbtg9IsmNV3VhV355qZUn2AX4BOKnt25XA3wBvaE1eB7yrqu6oqvXAGdP0ad3Az+2C1sefVNUngDV07+OE71TV31bVg8An6IL1nVV1X1X9I3A/XdBoG2W4aFt0ZFXt0h5HTpq3bmKiDV2d1oZX7gZubLN224Jt/dfA9A+Bib+G9xrcFt1RyXT2Ab7b/iKfUZK3tqGju5LcSfeX+0R/jwOeAXyzDX39SqufC6wCzkuyIcmfJ9nxEezTf6uqtcBb6AL5tiTnzXDhxF7Apqq6Z6D2XbojtIn5g+/V4PSUtSTHJrly4o8I4Dk89Od268D0RCBNrnnksg0zXDTXDN7G+9eBI4BfovslvbjV055/AOw80P6pW7CdW+hCY8K+M7RdB+y7uZPV7fzKSXR/6e9aVbsAd030t6rWVNUxdEN87wYuTPK4qvpxVf1xVR0AvIhuuO7YLdiXKVXVx6rqF4Cn0b2v756YNanpBuDJSZ4wUNsXuLlN3wLsPTBv8H37781NTCR5GnA23bDcU9r7cA0//blpHjBcNJc9AbgPuJ0uRN41af6VwK8l2TnJ/nRHBsM6H/jdJHsn2RU4eYa2l9H9gj0tyePaCfgXT9PfB4CNwA5J/gh44sTMJL+RZGFV/QS4s5UfTPKyJD/Xzu3cTTdM9uAW7MvDJHlmkpcneTRwL92RwMQ6bwUWJ3kUQFWtA/4N+LO2bz9P915OnN86H3hbkl2TLKILjZk8ji5sNra+vJHuyEXziOGiuewcuuGZm4HrgK9Nmn863dj8rcBKtuxk/9l0Q1H/QXehwKema9jOC7ya7hzATXQnvl8/RdNVdBcg/Gfr9708dLhoGXBtku/Tndw/uqrupTviupAuWK4H/hnY2s+hPBo4Dfge3TDa7nRXiQFc0J5vT/KNNn0M3ZHhBuDTwClVdUmb9066ff4O8IXW1/um23BVXQe8B/gq3c/m54B/3cr90TYmflmYpD4leTNdMP7ibPdFs8cjF0lbJcmeSV6c5FHtEue30h3daDvmp2Ulba2dgL+m+1zNnXSfSfngrPZIs85hMUlS7xwWkyT1znCRJPXOcy7NbrvtVosXL57tbkjSnHLFFVd8r6oWTq4bLs3ixYtZvXr1bHdDkuaUJFPeGslhMUlS7wwXSVLvDBdJUu8MF0lS7wwXSVLvDBdJUu8MF0lS7wwXSVLv/BDlHLP45L+f7S7MKzee9suz3QVpXvLIRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1LuRhUuSfZJ8Kcn1Sa5N8nut/o4kNye5sj1eNbDM25KsTfKtJIcP1Je12tokJw/U90vy9SRrknwiyU6t/uj2em2bv3hU+ylJerhRHrk8ALy1qp4FHAKckOSANu/0qjqwPS4GaPOOBp4NLAM+mGRBkgXAB4BXAgcAxwys591tXUuAO4DjWv044I6q2h84vbWTJI3JyMKlqm6pqm+06XuA64FFMyxyBHBeVd1XVd8B1gIHt8faqrqhqu4HzgOOSBLg5cCFbfmVwJED61rZpi8EDm3tJUljMJZzLm1Y6nnA11vpxCRXJVmRZNdWWwSsG1hsfatNV38KcGdVPTCp/pB1tfl3tfaT+3V8ktVJVm/cuHGr9lGS9FMjD5ckjwc+Cbylqu4GzgR+BjgQuAV4z0TTKRavR1CfaV0PLVSdVVVLq2rpwoULZ9wPSdLwRhouSXakC5aPVtWnAKrq1qp6sKp+ApxNN+wF3ZHHPgOL7w1smKH+PWCXJDtMqj9kXW3+k4BN/e6dJGk6o7xaLMCHgOur6r0D9T0Hmr0GuKZNXwQc3a702g9YAlwGXA4saVeG7UR30v+iqirgS8Br2/LLgc8OrGt5m34t8MXWXpI0Bjtsvskj9mLgDcDVSa5stbfTXe11IN0w1Y3AbwNU1bVJzgeuo7vS7ISqehAgyYnAKmABsKKqrm3rOwk4L8mfAv9OF2a053OTrKU7Yjl6hPspSZpkZOFSVV9h6nMfF8+wzKnAqVPUL55quaq6gZ8Oqw3W7wWO2pL+SpL64yf0JUm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRtZuCTZJ8mXklyf5Nokv9fqT05ySZI17XnXVk+SM5KsTXJVkoMG1rW8tV+TZPlA/flJrm7LnJEkM21DkjQeozxyeQB4a1U9CzgEOCHJAcDJwKVVtQS4tL0GeCWwpD2OB86ELiiAU4AXAgcDpwyExZmt7cRyy1p9um1IksZgZOFSVbdU1Tfa9D3A9cAi4AhgZWu2EjiyTR8BnFOdrwG7JNkTOBy4pKo2VdUdwCXAsjbviVX11aoq4JxJ65pqG5KkMRjLOZcki4HnAV8H9qiqW6ALIGD31mwRsG5gsfWtNlN9/RR1ZtiGJGkMRh4uSR4PfBJ4S1XdPVPTKWr1COpb0rfjk6xOsnrjxo1bsqgkaQYjDZckO9IFy0er6lOtfGsb0qI939bq64F9BhbfG9iwmfreU9Rn2sZDVNVZVbW0qpYuXLjwke2kJOlhRnm1WIAPAddX1XsHZl0ETFzxtRz47ED92HbV2CHAXW1IaxVwWJJd24n8w4BVbd49SQ5p2zp20rqm2oYkaQx2GOG6Xwy8Abg6yZWt9nbgNOD8JMcBNwFHtXkXA68C1gI/BN4IUFWbkvwJcHlr986q2tSm3wx8GHgs8Pn2YIZtSJLGYGThUlVfYerzIgCHTtG+gBOmWdcKYMUU9dXAc6ao3z7VNiRJ4+En9CVJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRsqXJI87LMkkiRNZ9gjl79KclmS/5Vkl5H2SJI05w0VLlX1C8D/pLuB5OokH0vyipH2TJI0Zw19zqWq1gB/CJwE/CJwRpJvJvm1UXVOkjQ3DXvO5eeTnE73bZIvB17dvr745cDpI+yfJGkOGvbGle8HzgbeXlU/mihW1YYkfziSnkmS5qxhw+VVwI+q6kGAJI8CHlNVP6yqc0fWO0nSnDTsOZcv0H1nyoSdW02SpIcZNlweU1Xfn3jRpnceTZckSXPdsOHygyQHTbxI8nzgRzO0lyRtx4Y95/IW4IIkG9rrPYHXj6ZLkqS5bqhwqarLk/ws8Ey6ry7+ZlX9eKQ9kyTNWcMeuQC8AFjclnleEqrqnJH0SpI0pw0VLknOBX4GuBJ4sJULMFwkSQ8z7JHLUuCAqqpRdkaSND8Me7XYNcBTR9kRSdL8MeyRy27AdUkuA+6bKFbVr46kV5KkOW3YcHnHKDshSZpfhr0U+Z+TPA1YUlVfSLIzsGC0XZMkzVXD3nL/TcCFwF+30iLgM6PqlCRpbhv2hP4JwIuBu+G/vzhs95kWSLIiyW1JrhmovSPJzUmubI9XDcx7W5K1Sb6V5PCB+rJWW5vk5IH6fkm+nmRNkk8k2anVH91er23zFw+5j5KkngwbLvdV1f0TL5LsQPc5l5l8GFg2Rf30qjqwPS5u6zsAOBp4dlvmg0kWJFkAfAB4JXAAcExrC/Dutq4lwB3Aca1+HHBHVe1P90Vm7x5yHyVJPRk2XP45yduBxyZ5BXAB8HczLVBVXwY2Dbn+I4Dzquq+qvoOsBY4uD3WVtUNLdzOA45IErpvwbywLb8SOHJgXSvb9IXAoa29JGlMhg2Xk4GNwNXAbwMXA4/0GyhPTHJVGzbbtdUWAesG2qxvtenqTwHurKoHJtUfsq42/67WXpI0JkOFS1X9pKrOrqqjquq1bfqRfFr/TLrbyBwI3AK8p9WnOrKoR1CfaV0Pk+T4JKuTrN64ceNM/ZYkbYFh7y32Hab4BV1VT9+SjVXVrQPrPBv4XHu5HthnoOnewMTt/aeqfw/YJckO7ehksP3Euta3c0NPYprhuao6CzgLYOnSpd7aRpJ6siX3FpvwGOAo4MlburEke1bVLe3la+huKwNwEfCxJO8F9gKWAJfRHYUsSbIfcDPdSf9fr6pK8iXgtXTnYZYDnx1Y13Lgq23+F70nmiSN17Aforx9Uukvk3wF+KPplknyceClwG5J1gOnAC9NciDdUdCNdOdvqKprk5wPXAc8AJxQVQ+29ZwIrKL70OaKqrq2beIk4Lwkfwr8O/ChVv8QcG6StXRHLEcPs4+SpP4MOyx20MDLR9EdyTxhpmWq6pgpyh+aojbR/lTg1CnqF9NdQDC5fgPd1WST6/fSHVlJkmbJsMNi7xmYfoDuqON1vfdGkjQvDDss9rJRd0SSNH8MOyz2v2eaX1Xv7ac7kqT5YEuuFnsB3ZVYAK8GvsxDP+AoSRKwZV8WdlBV3QPdDSiBC6rqt0bVMUnS3DXs7V/2Be4feH0/sLj33kiS5oVhj1zOBS5L8mm6z6i8BjhnZL2SJM1pw14tdmqSzwMvaaU3VtW/j65bkqS5bNhhMYCdgbur6n109+3ab0R9kiTNccN+zfEpdLdbeVsr7Qh8ZFSdkiTNbcMeubwG+FXgBwBVtYHN3P5FkrT9GjZc7m93Fi6AJI8bXZckSXPdsOFyfpK/pvsOlTcBXwDOHl23JElz2bBXi/1FklcAdwPPBP6oqi4Zac8kSXPWZsMlyQJgVVX9EmCgSJI2a7PDYu1Lu36Y5Elj6I8kaR4Y9hP69wJXJ7mEdsUYQFX97kh6JUma04YNl79vD0mSNmvGcEmyb1XdVFUrx9UhSdLct7lzLp+ZmEjyyRH3RZI0T2wuXDIw/fRRdkSSNH9sLlxqmmlJkqa1uRP6z01yN90RzGPbNO11VdUTR9o7SdKcNGO4VNWCcXVEkjR/bMn3uUiSNBTDRZLUO8NFktQ7w0WS1LuRhUuSFUluS3LNQO3JSS5JsqY979rqSXJGkrVJrkpy0MAyy1v7NUmWD9Sfn+TqtswZSTLTNiRJ4zPKI5cPA8sm1U4GLq2qJcCl7TXAK4El7XE8cCZ0QQGcArwQOBg4ZSAszmxtJ5ZbtpltSJLGZGThUlVfBjZNKh8BTNynbCVw5ED9nOp8je4bL/cEDgcuqapNVXUH3ffJLGvznlhVX21fv3zOpHVNtQ1J0piM+5zLHlV1C0B73r3VFwHrBtqtb7WZ6uunqM+0DUnSmGwrJ/QzRa0eQX3LNpocn2R1ktUbN27c0sUlSdMYd7jc2oa0aM+3tfp6YJ+BdnsDGzZT33uK+kzbeJiqOquqllbV0oULFz7inZIkPdS4w+UiYOKKr+XAZwfqx7arxg4B7mpDWquAw5Ls2k7kHwasavPuSXJIu0rs2EnrmmobkqQxGfabKLdYko8DLwV2S7Ke7qqv04DzkxwH3AQc1ZpfDLwKWAv8EHgjQFVtSvInwOWt3TurauIigTfTXZH2WODz7cEM25AkjcnIwqWqjplm1qFTtC3ghGnWswJYMUV9NfCcKeq3T7UNSdL4bCsn9CVJ84jhIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSerdrIRLkhuTXJ3kyiSrW+3JSS5JsqY979rqSXJGkrVJrkpy0MB6lrf2a5IsH6g/v61/bVs2499LSdp+zeaRy8uq6sCqWtpenwxcWlVLgEvba4BXAkva43jgTOjCCDgFeCFwMHDKRCC1NscPLLds9LsjSZqwLQ2LHQGsbNMrgSMH6udU52vALkn2BA4HLqmqTVV1B3AJsKzNe2JVfbWqCjhnYF2SpDGYrXAp4B+TXJHk+Fbbo6puAWjPu7f6ImDdwLLrW22m+vop6pKkMdlhlrb74qrakGR34JIk35yh7VTnS+oR1B++4i7YjgfYd999Z+6xJGlos3LkUlUb2vNtwKfpzpnc2oa0aM+3tebrgX0GFt8b2LCZ+t5T1Kfqx1lVtbSqli5cuHBrd0uS1Iw9XJI8LskTJqaBw4BrgIuAiSu+lgOfbdMXAce2q8YOAe5qw2argMOS7NpO5B8GrGrz7klySLtK7NiBdUmSxmA2hsX2AD7drg7eAfhYVf1DksuB85McB9wEHNXaXwy8ClgL/BB4I0BVbUryJ8Dlrd07q2pTm34z8GHgscDn20OSNCZjD5equgF47hT124FDp6gXcMI061oBrJiivhp4zlZ3VpL0iGxLlyJLkuYJw0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktS7eRsuSZYl+VaStUlOnu3+SNL2ZF6GS5IFwAeAVwIHAMckOWB2eyVJ2495GS7AwcDaqrqhqu4HzgOOmOU+SdJ2Y4fZ7sCILALWDbxeD7xwcqMkxwPHt5ffT/KtMfRte7Eb8L3Z7sTm5N2z3QPNgjnxb3MOedpUxfkaLpmiVg8rVJ0FnDX67mx/kqyuqqWz3Q9pMv9tjsd8HRZbD+wz8HpvYMMs9UWStjvzNVwuB5Yk2S/JTsDRwEWz3CdJ2m7My2GxqnogyYnAKmABsKKqrp3lbm1vHG7Utsp/m2OQqoedipAkaavM12ExSdIsMlwkSb0zXCRJvZuXJ/Q1Xkl+lu4OCIvoPk+0Abioqq6f1Y5JmjUeuWirJDmJ7vY6AS6juww8wMe9Yai2ZUneONt9mM+8WkxbJcl/As+uqh9Pqu8EXFtVS2anZ9LMktxUVfvOdj/mK4fFtLV+AuwFfHdSfc82T5o1Sa6abhawxzj7sr0xXLS13gJcmmQNP71Z6L7A/sCJs9YrqbMHcDhwx6R6gH8bf3e2H4aLtkpV/UOSZ9B9zcEiuv+064HLq+rBWe2cBJ8DHl9VV06ekeSfxt+d7YfnXCRJvfNqMUlS7wwXSVLvDBdpFiR5apLzknw7yXVJLk7yjCTXzHbfpD54Ql8asyQBPg2srKqjW+1AvDRW84hHLtL4vQz4cVX91UShXc00cSk3SRYn+Zck32iPF7X6nkm+nOTKJNckeUmSBUk+3F5fneT3x79L0kN55CKN33OAKzbT5jbgFVV1b5IlwMeBpcCvA6uq6tQkC4CdgQOBRVX1HIAku4yu69JwDBdp27Qj8P42XPYg8IxWvxxYkWRH4DNVdWWSG4CnJ/l/wN8D/zgrPZYGOCwmjd+1wPM30+b3gVuB59IdsewEUFVfBv4HcDNwbpJjq+qO1u6fgBOAvxlNt6XhGS7S+H0ReHSSN00UkrwAeNpAmycBt1TVT4A3AAtau6cBt1XV2cCHgIOS7AY8qqo+Cfxf4KDx7IY0PYfFpDGrqkryGuAv29cS3AvcSHeftgkfBD6Z5CjgS8APWv2lwB8k+THwfeBYutvu/G2SiT8W3zbynZA2w9u/SJJ657CYJKl3hoskqXeGiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hoskqXf/H9Km0Sac++BJAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# 数据标签分布\n", + "count_classes = pd.value_counts(data['Class'], sort=True).sort_index() # 统计里面不同分类的量\n", + "count_classes.plot(kind='bar') # 使用直方图\n", + "plt.title(\"Fraund class histogram\")\n", + "plt.xlabel(\"Class\")\n", + "plt.ylabel(\"Frequency\")\n", + "print(count_classes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "明显的正负样本数量有差异,正样本(为1)只有492个,负样本有28万个,这种情况下,如果直接给模型学习,那么模型很容易知道,只要它把样本预测为负,那么准确率就在99.99%以上。\n", + "\n", + "我们不能让模型学到这种歪门技巧。\n", + "\n", + "有两种方案解决:\n", + "* 1和0一样多,也就是1也有28万个左右。(上采样)\n", + "* 0和1一样少,也就是28万里只取492个。(下采样)\n", + "\n", + "两个方案的比较:\n", + "* 第一种需要造一些数据,那么数据就是假的,假的会影响模型在预测真实数据时,结果自然会下降。\n", + "* 第二种方式则会减少真实数据,使得模型可学的数据变少,能力也会减弱。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, "nbformat": 4, "nbformat_minor": 2 } diff --git a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb index 215025f..b2653f6 100644 --- a/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb +++ b/机器学习竞赛实战_优胜解决方案/信用卡欺诈检测/逻辑回归-信用卡欺诈检测.ipynb @@ -287,6 +287,60 @@ "数据集包含由欧洲人于2013年9月使用信用卡进行交易的数据。此数据集显示两天内发生的交易,其中284807笔交易中有492笔被盗刷。数据集非常不平衡,正例(被盗刷)占所有交易的0.172%。,这是因为由于保密问题,我们无法提供有关数据的原始功能和更多背景信息。特征V1,V2,... V28是使用PCA获得的主要组件,没有用PCA转换的唯一特征是“Class”和“Amount”。特征'Time'包含数据集中每个刷卡时间和第一次刷卡时间之间经过的秒数。特征'Class'是响应变量,如果发生被盗刷,则取值1,否则为0。" ] }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 284315\n", + "1 492\n", + "Name: Class, dtype: int64\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZcAAAETCAYAAAD6R0vDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjAsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+17YcXAAAaBElEQVR4nO3dfbQddX3v8ffHACo+gRIQAhgs0Yq2IkZkab1VqRBtLdglCr2V1EWl9cJq7XX1gq7eYm2x2FWlclVaqKkBHxDwiVpsimhrbVUIlvKoTUQkIRQi4ckHQPB7/5jfqZvDOSc7ZPY+OSfv11p77dnf+c3Mb/ZJzufMb2bPTlUhSVKfHjXbHZAkzT+GiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hos0hCSLk1SSHbZwud9M8pVR9WuK7d2Y5JemmfeSJN8aV1+0fTNctE1pvxx/lOT7A4+9Zrtf80FV/UtVPXNz7ZK8I8lHxtEnzV+Gi7ZFr66qxw88NkxusKVHENo2+HPbfhgumhMGhqWOS3IT8MVWvyDJfyW5K8mXkzx7YJl/SvJbA68fMkTV1vc7SdYkuSPJB5KkzVuQ5C+SfC/JDcAvb6Z/+yT5VJKNSW5P8v5p2r0vybokdye5IslLBuYdnGR1m3drkve2+mOSfKSt984klyfZY4buHJjkqvaefCLJY9p6Xppk/cD2Tkpyc5J7knwryaFJlgFvB17fjhr/o7XdK8lFSTYlWZvkTQPreWySle09vD7J/5m0nRvbtq4CfpBkhyQnJ/l22/Z1SV4z6ef0r0lOb/t7Q5IXtfq6JLclWT7Tz0Ozz3DRXPOLwLOAw9vrzwNLgN2BbwAf3cL1/QrwAuC5wOsG1vumNu95wFLgtdOtIMkC4HPAd4HFwCLgvGmaXw4cCDwZ+BhwwcQvf+B9wPuq6onAzwDnt/py4EnAPsBTgN8BfjTDPr0OWAbsB/w88JtT9PmZwInAC6rqCW2/b6yqfwDeBXyiHTU+ty3ycWA9sBfde/GuJIe2eae0/X468ArgN6bo0zF0Ab1LVT0AfBt4SduvPwY+kmTPgfYvBK5q+/sxuvfzBcD+bf3vT/L4Gd4DzTLDRduiz7S/WO9M8plJ895RVT+oqh8BVNWKqrqnqu4D3gE8N8mTtmBbp1XVnVV1E/Alul/80P2C/suqWldVm4A/m2EdB9P90v2D1rd7q2rKk/hV9ZGqur2qHqiq9wCPBibOg/wY2D/JblX1/ar62kD9KcD+VfVgVV1RVXfP0J8zqmpD6/ffDezToAfbtg9IsmNV3VhV355qZUn2AX4BOKnt25XA3wBvaE1eB7yrqu6oqvXAGdP0ad3Az+2C1sefVNUngDV07+OE71TV31bVg8An6IL1nVV1X1X9I3A/XdBoG2W4aFt0ZFXt0h5HTpq3bmKiDV2d1oZX7gZubLN224Jt/dfA9A+Bib+G9xrcFt1RyXT2Ab7b/iKfUZK3tqGju5LcSfeX+0R/jwOeAXyzDX39SqufC6wCzkuyIcmfJ9nxEezTf6uqtcBb6AL5tiTnzXDhxF7Apqq6Z6D2XbojtIn5g+/V4PSUtSTHJrly4o8I4Dk89Od268D0RCBNrnnksg0zXDTXDN7G+9eBI4BfovslvbjV055/AOw80P6pW7CdW+hCY8K+M7RdB+y7uZPV7fzKSXR/6e9aVbsAd030t6rWVNUxdEN87wYuTPK4qvpxVf1xVR0AvIhuuO7YLdiXKVXVx6rqF4Cn0b2v756YNanpBuDJSZ4wUNsXuLlN3wLsPTBv8H37781NTCR5GnA23bDcU9r7cA0//blpHjBcNJc9AbgPuJ0uRN41af6VwK8l2TnJ/nRHBsM6H/jdJHsn2RU4eYa2l9H9gj0tyePaCfgXT9PfB4CNwA5J/gh44sTMJL+RZGFV/QS4s5UfTPKyJD/Xzu3cTTdM9uAW7MvDJHlmkpcneTRwL92RwMQ6bwUWJ3kUQFWtA/4N+LO2bz9P915OnN86H3hbkl2TLKILjZk8ji5sNra+vJHuyEXziOGiuewcuuGZm4HrgK9Nmn863dj8rcBKtuxk/9l0Q1H/QXehwKema9jOC7ya7hzATXQnvl8/RdNVdBcg/Gfr9708dLhoGXBtku/Tndw/uqrupTviupAuWK4H/hnY2s+hPBo4Dfge3TDa7nRXiQFc0J5vT/KNNn0M3ZHhBuDTwClVdUmb9066ff4O8IXW1/um23BVXQe8B/gq3c/m54B/3cr90TYmflmYpD4leTNdMP7ibPdFs8cjF0lbJcmeSV6c5FHtEue30h3daDvmp2Ulba2dgL+m+1zNnXSfSfngrPZIs85hMUlS7xwWkyT1znCRJPXOcy7NbrvtVosXL57tbkjSnHLFFVd8r6oWTq4bLs3ixYtZvXr1bHdDkuaUJFPeGslhMUlS7wwXSVLvDBdJUu8MF0lS7wwXSVLvDBdJUu8MF0lS7wwXSVLv/BDlHLP45L+f7S7MKzee9suz3QVpXvLIRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1LuRhUuSfZJ8Kcn1Sa5N8nut/o4kNye5sj1eNbDM25KsTfKtJIcP1Je12tokJw/U90vy9SRrknwiyU6t/uj2em2bv3hU+ylJerhRHrk8ALy1qp4FHAKckOSANu/0qjqwPS4GaPOOBp4NLAM+mGRBkgXAB4BXAgcAxwys591tXUuAO4DjWv044I6q2h84vbWTJI3JyMKlqm6pqm+06XuA64FFMyxyBHBeVd1XVd8B1gIHt8faqrqhqu4HzgOOSBLg5cCFbfmVwJED61rZpi8EDm3tJUljMJZzLm1Y6nnA11vpxCRXJVmRZNdWWwSsG1hsfatNV38KcGdVPTCp/pB1tfl3tfaT+3V8ktVJVm/cuHGr9lGS9FMjD5ckjwc+Cbylqu4GzgR+BjgQuAV4z0TTKRavR1CfaV0PLVSdVVVLq2rpwoULZ9wPSdLwRhouSXakC5aPVtWnAKrq1qp6sKp+ApxNN+wF3ZHHPgOL7w1smKH+PWCXJDtMqj9kXW3+k4BN/e6dJGk6o7xaLMCHgOur6r0D9T0Hmr0GuKZNXwQc3a702g9YAlwGXA4saVeG7UR30v+iqirgS8Br2/LLgc8OrGt5m34t8MXWXpI0Bjtsvskj9mLgDcDVSa5stbfTXe11IN0w1Y3AbwNU1bVJzgeuo7vS7ISqehAgyYnAKmABsKKqrm3rOwk4L8mfAv9OF2a053OTrKU7Yjl6hPspSZpkZOFSVV9h6nMfF8+wzKnAqVPUL55quaq6gZ8Oqw3W7wWO2pL+SpL64yf0JUm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRtZuCTZJ8mXklyf5Nokv9fqT05ySZI17XnXVk+SM5KsTXJVkoMG1rW8tV+TZPlA/flJrm7LnJEkM21DkjQeozxyeQB4a1U9CzgEOCHJAcDJwKVVtQS4tL0GeCWwpD2OB86ELiiAU4AXAgcDpwyExZmt7cRyy1p9um1IksZgZOFSVbdU1Tfa9D3A9cAi4AhgZWu2EjiyTR8BnFOdrwG7JNkTOBy4pKo2VdUdwCXAsjbviVX11aoq4JxJ65pqG5KkMRjLOZcki4HnAV8H9qiqW6ALIGD31mwRsG5gsfWtNlN9/RR1ZtiGJGkMRh4uSR4PfBJ4S1XdPVPTKWr1COpb0rfjk6xOsnrjxo1bsqgkaQYjDZckO9IFy0er6lOtfGsb0qI939bq64F9BhbfG9iwmfreU9Rn2sZDVNVZVbW0qpYuXLjwke2kJOlhRnm1WIAPAddX1XsHZl0ETFzxtRz47ED92HbV2CHAXW1IaxVwWJJd24n8w4BVbd49SQ5p2zp20rqm2oYkaQx2GOG6Xwy8Abg6yZWt9nbgNOD8JMcBNwFHtXkXA68C1gI/BN4IUFWbkvwJcHlr986q2tSm3wx8GHgs8Pn2YIZtSJLGYGThUlVfYerzIgCHTtG+gBOmWdcKYMUU9dXAc6ao3z7VNiRJ4+En9CVJvTNcJEm9M1wkSb0zXCRJvTNcJEm9M1wkSb0zXCRJvRsqXJI87LMkkiRNZ9gjl79KclmS/5Vkl5H2SJI05w0VLlX1C8D/pLuB5OokH0vyipH2TJI0Zw19zqWq1gB/CJwE/CJwRpJvJvm1UXVOkjQ3DXvO5eeTnE73bZIvB17dvr745cDpI+yfJGkOGvbGle8HzgbeXlU/mihW1YYkfziSnkmS5qxhw+VVwI+q6kGAJI8CHlNVP6yqc0fWO0nSnDTsOZcv0H1nyoSdW02SpIcZNlweU1Xfn3jRpnceTZckSXPdsOHygyQHTbxI8nzgRzO0lyRtx4Y95/IW4IIkG9rrPYHXj6ZLkqS5bqhwqarLk/ws8Ey6ry7+ZlX9eKQ9kyTNWcMeuQC8AFjclnleEqrqnJH0SpI0pw0VLknOBX4GuBJ4sJULMFwkSQ8z7JHLUuCAqqpRdkaSND8Me7XYNcBTR9kRSdL8MeyRy27AdUkuA+6bKFbVr46kV5KkOW3YcHnHKDshSZpfhr0U+Z+TPA1YUlVfSLIzsGC0XZMkzVXD3nL/TcCFwF+30iLgM6PqlCRpbhv2hP4JwIuBu+G/vzhs95kWSLIiyW1JrhmovSPJzUmubI9XDcx7W5K1Sb6V5PCB+rJWW5vk5IH6fkm+nmRNkk8k2anVH91er23zFw+5j5KkngwbLvdV1f0TL5LsQPc5l5l8GFg2Rf30qjqwPS5u6zsAOBp4dlvmg0kWJFkAfAB4JXAAcExrC/Dutq4lwB3Aca1+HHBHVe1P90Vm7x5yHyVJPRk2XP45yduBxyZ5BXAB8HczLVBVXwY2Dbn+I4Dzquq+qvoOsBY4uD3WVtUNLdzOA45IErpvwbywLb8SOHJgXSvb9IXAoa29JGlMhg2Xk4GNwNXAbwMXA4/0GyhPTHJVGzbbtdUWAesG2qxvtenqTwHurKoHJtUfsq42/67WXpI0JkOFS1X9pKrOrqqjquq1bfqRfFr/TLrbyBwI3AK8p9WnOrKoR1CfaV0Pk+T4JKuTrN64ceNM/ZYkbYFh7y32Hab4BV1VT9+SjVXVrQPrPBv4XHu5HthnoOnewMTt/aeqfw/YJckO7ehksP3Euta3c0NPYprhuao6CzgLYOnSpd7aRpJ6siX3FpvwGOAo4MlburEke1bVLe3la+huKwNwEfCxJO8F9gKWAJfRHYUsSbIfcDPdSf9fr6pK8iXgtXTnYZYDnx1Y13Lgq23+F70nmiSN17Aforx9Uukvk3wF+KPplknyceClwG5J1gOnAC9NciDdUdCNdOdvqKprk5wPXAc8AJxQVQ+29ZwIrKL70OaKqrq2beIk4Lwkfwr8O/ChVv8QcG6StXRHLEcPs4+SpP4MOyx20MDLR9EdyTxhpmWq6pgpyh+aojbR/lTg1CnqF9NdQDC5fgPd1WST6/fSHVlJkmbJsMNi7xmYfoDuqON1vfdGkjQvDDss9rJRd0SSNH8MOyz2v2eaX1Xv7ac7kqT5YEuuFnsB3ZVYAK8GvsxDP+AoSRKwZV8WdlBV3QPdDSiBC6rqt0bVMUnS3DXs7V/2Be4feH0/sLj33kiS5oVhj1zOBS5L8mm6z6i8BjhnZL2SJM1pw14tdmqSzwMvaaU3VtW/j65bkqS5bNhhMYCdgbur6n109+3ab0R9kiTNccN+zfEpdLdbeVsr7Qh8ZFSdkiTNbcMeubwG+FXgBwBVtYHN3P5FkrT9GjZc7m93Fi6AJI8bXZckSXPdsOFyfpK/pvsOlTcBXwDOHl23JElz2bBXi/1FklcAdwPPBP6oqi4Zac8kSXPWZsMlyQJgVVX9EmCgSJI2a7PDYu1Lu36Y5Elj6I8kaR4Y9hP69wJXJ7mEdsUYQFX97kh6JUma04YNl79vD0mSNmvGcEmyb1XdVFUrx9UhSdLct7lzLp+ZmEjyyRH3RZI0T2wuXDIw/fRRdkSSNH9sLlxqmmlJkqa1uRP6z01yN90RzGPbNO11VdUTR9o7SdKcNGO4VNWCcXVEkjR/bMn3uUiSNBTDRZLUO8NFktQ7w0WS1LuRhUuSFUluS3LNQO3JSS5JsqY979rqSXJGkrVJrkpy0MAyy1v7NUmWD9Sfn+TqtswZSTLTNiRJ4zPKI5cPA8sm1U4GLq2qJcCl7TXAK4El7XE8cCZ0QQGcArwQOBg4ZSAszmxtJ5ZbtpltSJLGZGThUlVfBjZNKh8BTNynbCVw5ED9nOp8je4bL/cEDgcuqapNVXUH3ffJLGvznlhVX21fv3zOpHVNtQ1J0piM+5zLHlV1C0B73r3VFwHrBtqtb7WZ6uunqM+0DUnSmGwrJ/QzRa0eQX3LNpocn2R1ktUbN27c0sUlSdMYd7jc2oa0aM+3tfp6YJ+BdnsDGzZT33uK+kzbeJiqOquqllbV0oULFz7inZIkPdS4w+UiYOKKr+XAZwfqx7arxg4B7mpDWquAw5Ls2k7kHwasavPuSXJIu0rs2EnrmmobkqQxGfabKLdYko8DLwV2S7Ke7qqv04DzkxwH3AQc1ZpfDLwKWAv8EHgjQFVtSvInwOWt3TurauIigTfTXZH2WODz7cEM25AkjcnIwqWqjplm1qFTtC3ghGnWswJYMUV9NfCcKeq3T7UNSdL4bCsn9CVJ84jhIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSeqd4SJJ6p3hIknqneEiSerdrIRLkhuTXJ3kyiSrW+3JSS5JsqY979rqSXJGkrVJrkpy0MB6lrf2a5IsH6g/v61/bVs2499LSdp+zeaRy8uq6sCqWtpenwxcWlVLgEvba4BXAkva43jgTOjCCDgFeCFwMHDKRCC1NscPLLds9LsjSZqwLQ2LHQGsbNMrgSMH6udU52vALkn2BA4HLqmqTVV1B3AJsKzNe2JVfbWqCjhnYF2SpDGYrXAp4B+TXJHk+Fbbo6puAWjPu7f6ImDdwLLrW22m+vop6pKkMdlhlrb74qrakGR34JIk35yh7VTnS+oR1B++4i7YjgfYd999Z+6xJGlos3LkUlUb2vNtwKfpzpnc2oa0aM+3tebrgX0GFt8b2LCZ+t5T1Kfqx1lVtbSqli5cuHBrd0uS1Iw9XJI8LskTJqaBw4BrgIuAiSu+lgOfbdMXAce2q8YOAe5qw2argMOS7NpO5B8GrGrz7klySLtK7NiBdUmSxmA2hsX2AD7drg7eAfhYVf1DksuB85McB9wEHNXaXwy8ClgL/BB4I0BVbUryJ8Dlrd07q2pTm34z8GHgscDn20OSNCZjD5equgF47hT124FDp6gXcMI061oBrJiivhp4zlZ3VpL0iGxLlyJLkuYJw0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktQ7w0WS1DvDRZLUO8NFktS7eRsuSZYl+VaStUlOnu3+SNL2ZF6GS5IFwAeAVwIHAMckOWB2eyVJ2495GS7AwcDaqrqhqu4HzgOOmOU+SdJ2Y4fZ7sCILALWDbxeD7xwcqMkxwPHt5ffT/KtMfRte7Eb8L3Z7sTm5N2z3QPNgjnxb3MOedpUxfkaLpmiVg8rVJ0FnDX67mx/kqyuqqWz3Q9pMv9tjsd8HRZbD+wz8HpvYMMs9UWStjvzNVwuB5Yk2S/JTsDRwEWz3CdJ2m7My2GxqnogyYnAKmABsKKqrp3lbm1vHG7Utsp/m2OQqoedipAkaavM12ExSdIsMlwkSb0zXCRJvZuXJ/Q1Xkl+lu4OCIvoPk+0Abioqq6f1Y5JmjUeuWirJDmJ7vY6AS6juww8wMe9Yai2ZUneONt9mM+8WkxbJcl/As+uqh9Pqu8EXFtVS2anZ9LMktxUVfvOdj/mK4fFtLV+AuwFfHdSfc82T5o1Sa6abhawxzj7sr0xXLS13gJcmmQNP71Z6L7A/sCJs9YrqbMHcDhwx6R6gH8bf3e2H4aLtkpV/UOSZ9B9zcEiuv+064HLq+rBWe2cBJ8DHl9VV06ekeSfxt+d7YfnXCRJvfNqMUlS7wwXSVLvDBdpFiR5apLzknw7yXVJLk7yjCTXzHbfpD54Ql8asyQBPg2srKqjW+1AvDRW84hHLtL4vQz4cVX91UShXc00cSk3SRYn+Zck32iPF7X6nkm+nOTKJNckeUmSBUk+3F5fneT3x79L0kN55CKN33OAKzbT5jbgFVV1b5IlwMeBpcCvA6uq6tQkC4CdgQOBRVX1HIAku4yu69JwDBdp27Qj8P42XPYg8IxWvxxYkWRH4DNVdWWSG4CnJ/l/wN8D/zgrPZYGOCwmjd+1wPM30+b3gVuB59IdsewEUFVfBv4HcDNwbpJjq+qO1u6fgBOAvxlNt6XhGS7S+H0ReHSSN00UkrwAeNpAmycBt1TVT4A3AAtau6cBt1XV2cCHgIOS7AY8qqo+Cfxf4KDx7IY0PYfFpDGrqkryGuAv29cS3AvcSHeftgkfBD6Z5CjgS8APWv2lwB8k+THwfeBYutvu/G2SiT8W3zbynZA2w9u/SJJ657CYJKl3hoskqXeGiySpd4aLJKl3hoskqXeGiySpd4aLJKl3hoskqXf/H9Km0Sac++BJAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# 数据标签分布\n", + "count_classes = pd.value_counts(data['Class'], sort=True).sort_index() # 统计里面不同分类的量\n", + "count_classes.plot(kind='bar') # 使用直方图\n", + "plt.title(\"Fraund class histogram\")\n", + "plt.xlabel(\"Class\")\n", + "plt.ylabel(\"Frequency\")\n", + "print(count_classes)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "明显的正负样本数量有差异,正样本(为1)只有492个,负样本有28万个,这种情况下,如果直接给模型学习,那么模型很容易知道,只要它把样本预测为负,那么准确率就在99.99%以上。\n", + "\n", + "我们不能让模型学到这种歪门技巧。\n", + "\n", + "有两种方案解决:\n", + "* 1和0一样多,也就是1也有28万个左右。(上采样)\n", + "* 0和1一样少,也就是28万里只取492个。(下采样)\n", + "\n", + "两个方案的比较:\n", + "* 第一种需要造一些数据,那么数据就是假的,假的会影响模型在预测真实数据时,结果自然会下降。\n", + "* 第二种方式则会减少真实数据,使得模型可学的数据变少,能力也会减弱。" + ] + }, { "cell_type": "code", "execution_count": null,