From d4bb88515a2240b9cb48a519f780f8217a611be2 Mon Sep 17 00:00:00 2001 From: benjas <909336740@qq.com> Date: Sun, 24 Jan 2021 16:07:12 +0800 Subject: [PATCH] Add. Data cleaning --- .../数据清洗-checkpoint.ipynb | 1480 ++++++++++++++++- .../数据清洗.ipynb | 1057 ++++++++++++ 2 files changed, 2535 insertions(+), 2 deletions(-) diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb index 2fd6442..074d363 100644 --- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb +++ b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/数据清洗-checkpoint.ipynb @@ -1,6 +1,1482 @@ { - "cells": [], - "metadata": {}, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 任务:京东用户购买意向预测" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 故事背景:\n", + "京东作为中国最大的自营式电商,在保持高速发展的同时,沉淀了数亿的忠实用户,积累了海量的真实数据。如何从历史数据中找出规律,去预测用户未来的购买需求,让最合适的商品遇见最需要的人,是大数据应用在精准营销中的关键问题,也是所有电商平台在做智能化升级时所需要的核心技术。\n", + "\n", + "以京东商城真实的用户、商品和行为数据(脱敏后)为基础,通过数据挖掘的技术和机器学习的算法,构建用户购买商品的预测模型,输出高潜用户和目标商品的匹配结果,为精准营销提供高质量的目标群体。\n", + "\n", + "目标:使用京东多个品类下商品的历史销售数据,构建算法模型,预测用户在未来5天内,对某个目标品类下商品的购买意向。" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据集:\n", + "这里涉及到的数据集是京东的数据集:\n", + "\n", + "* JData_User.csv 用户数据集 105,321个用户\n", + "* JData_Comment.csv 商品评论 558,552条记录\n", + "* JData_Product.csv 预测商品集合 24,187条记录\n", + "* JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录\n", + "* JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录\n", + "* JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JData_User.csv用户数据**\n", + "\n", + "|字段|意义|备注|\n", + "|-|-|-|\n", + "|user_id|用户id|脱敏|\n", + "|age|年龄|-1表未知|\n", + "|sex|性别|0男,1女,2未知|\n", + "|user_lv_cd|用户等级|级别枚举,越高级别越大|\n", + "|user_reg_tm|用户注册日期|粒度到天|" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JData_Comment.csv评论数据**\n", + "\n", + "|字段|意义|备注|\n", + "|-|-|-|\n", + "|dt|截止时间|天,到2016-02-01|\n", + "|sku_id|商品编号|脱敏|\n", + "|comment_num|累积评论数分段|0表示无评论,1表是1条,2表示2-10条,3表示11-50条,5表示大于50条|\n", + "|has_bad_comment|是否有差评|0表示无,1表示有|\n", + "|bad_comment_rate|差评率|差评数占总评论数的比率|" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JData_Product.csv商品数据**\n", + "\n", + "|字段|意义|备注|\n", + "|-|-|-|\n", + "|sku_id|商品编号|脱敏|\n", + "|a1|属性1|枚举,-1表未知|\n", + "|a2|属性2|枚举,-1表未知|\n", + "|a3|属性3|枚举,-1表未知|\n", + "|cate|品牌ID|脱敏|\n", + "|brand|品牌ID|脱敏|" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**JData_Action_xx.csv商品数据**\n", + "\n", + "|字段|意义|备注|\n", + "|-|-|-|\n", + "|user_id|用户ID|脱敏|\n", + "|sku_id|商品编号|脱敏|\n", + "|time|行为时间||\n", + "|model_id|点击板块的编号|脱敏|\n", + "|type|行为类型|1.浏览商品详情页;2.加入购物车;3.购物车删除;4.下单;5.关注;6.点击;|\n", + "|cate|品牌ID|脱敏|\n", + "|brand|品牌ID|脱敏|" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据挖掘流程:\n", + "(一).数据清洗\n", + "1. 数据集完整性验证\n", + "2. 数据集中是否存在缺失值\n", + "3. 数据集中各特征数值应该如何处理\n", + "4. 哪些数据是我们想要的,哪些是可以过滤掉的\n", + "5. 将有价值数据信息做成新的数据源\n", + "6. 去除无行为交互的商品和用户\n", + "7. 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n", + "\n", + "(二).数据理解与分析\n", + "1. 掌握各个特征的含义\n", + "2. 观察数据有哪些特点,是否可利用来建模\n", + "3. 可视化展示便于分析\n", + "4. 用户的购买意向是否随着时间等因素变化\n", + "(三).特征提取\n", + "1. 基于清洗后的数据集哪些特征是有价值\n", + "2. 分别对用户与商品以及其之间构成的行为进行特征提取\n", + "3. 行为因素中哪些是核心?如何提取?\n", + "4. 瞬时行为特征or累计行为特征?\n", + "\n", + "(四).模型建立\n", + "1. 使用机器学习算法进行预测\n", + "2. 参数设置与调节\n", + "3. 数据集切分" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据集完整性验证\n", + "首先检查JData_User中的用户和JData_Dction中的用户是否一致,保证行为数据中锁产生的行为均由用户数据中的用户产生。\n", + "\n", + "思路:利用pd.Merge连接sku和Action中的sku,观测Action中的数据是否减少Example:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " sku data\n", + "0 a 1\n", + "1 a 1\n", + "2 c 3\n" + ] + } + ], + "source": [ + "# 测试方法\n", + "import pandas as pd\n", + "df1 = pd.DataFrame({'sku':['a','a','e','c'], 'data':[1,1,2,3]})\n", + "df2 = pd.DataFrame({'sku':['a','b','c']})\n", + "print(pd.merge(df1,df2))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "结果只会打印两者共有的部分" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Is action of Feb. from User file? True\n", + "Is action of Mar. from User file? True\n", + "Is action of Apr. from User file? True\n" + ] + } + ], + "source": [ + "#数据集验证\n", + "def user_action_check():\n", + " df_user = pd.read_csv('data/JData_User.csv',encoding='gbk')\n", + " df_sku = df_user.loc[:,'user_id'].to_frame()\n", + " df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", + " # pd.merge(df_sku,df_month2) 会以user_id字段为基准取两个df的交集 不是取并集,这样才能证明 action中的userid 都在df_user里面\n", + " print ('Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))) \n", + " df_month3 = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n", + " print ('Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3)))\n", + " df_month4 = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n", + " print ('Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4)))\n", + "\n", + "user_action_check() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "结论:User数据集中的用户和交互行为数据集中的用户完全一致\n", + "\n", + "根据merge前后的数据量对,能保障Action中的用户ID是User中的ID的子集" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 检查是否有重复记录\n", + "除去各个数据文件中完全重复的记录,可能解释是重复数据是有意义的,比如用户同时购买多件商品,同时添加多个数量的商品到购物车等…" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "#重复数据\n", + "def deduplicate(filepath, filename, newpath):\n", + " df_file = pd.read_csv(filepath,encoding='gbk') \n", + " before = df_file.shape[0]\n", + " df_file.drop_duplicates(inplace=True) # 列相同认为是重复 inplace=True表示在原来的DataFrame上删除重复项4\n", + " after = df_file.shape[0]\n", + " n_dup = before-after # 查看前后差值\n", + " print ('Number of duplicate records for ' + filename + ' is: ' + str(n_dup))\n", + " if n_dup != 0:\n", + " df_file.to_csv(newpath, index=None)\n", + " else:\n", + " print ('Number duplicate records in ' + filename)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of duplicate records for Feb. action is: 2756093\n", + "Number of duplicate records for Mar. action is: 7085038\n", + "Number of duplicate records for Feb. action is: 3672710\n", + "Number of duplicate records for Comment is: 0\n", + "Number duplicate records in Comment\n", + "Number of duplicate records for Product is: 0\n", + "Number duplicate records in Product\n", + "Number of duplicate records for User is: 0\n", + "Number duplicate records in User\n" + ] + } + ], + "source": [ + "deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n", + "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n", + "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n", + "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n", + "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n", + "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idsku_idtimemodel_idcatebrand
type
1217637821763782176378021763782176378
26366366360636636
3146414641464014641464
437373703737
5198119811981019811981
6575597575597575597545054575597575597
\n", + "
" + ], + "text/plain": [ + " user_id sku_id time model_id cate brand\n", + "type \n", + "1 2176378 2176378 2176378 0 2176378 2176378\n", + "2 636 636 636 0 636 636\n", + "3 1464 1464 1464 0 1464 1464\n", + "4 37 37 37 0 37 37\n", + "5 1981 1981 1981 0 1981 1981\n", + "6 575597 575597 575597 545054 575597 575597" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# 查看重复数据\n", + "df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", + "IsDuplicated = df_month2.duplicated()\n", + "df_d = df_month2[IsDuplicated]\n", + "df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n", + "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cduser_reg_tm
7457207458-12.012016-04-15
746320746426-35岁2.022016-04-15
746720746836-45岁2.032016-04-15
7472207473-12.012016-04-15
748220748326-35岁2.032016-04-15
749220749316-25岁2.032016-04-15
749320749416-25岁2.032016-04-15
750320750416-25岁2.042016-04-15
751020751146-55岁2.052016-04-15
7512207513-12.012016-04-15
751820751926-35岁2.022016-04-15
752120752226-35岁0.032016-04-15
7525207526-12.032016-04-15
7533207534-12.012016-04-15
754320754426-35岁2.032016-04-15
7544207545-12.012016-04-15
755120755226-35岁2.032016-04-15
755320755416-25岁2.042016-04-15
854520854616-25岁0.022016-04-29
939420939516-25岁1.022016-05-11
1036221036356岁以上2.022016-05-24
10367210368-12.012016-05-24
1101921102036-45岁2.032016-06-06
1201421201536-45岁2.022016-07-05
1385021385126-35岁2.032016-09-11
14542214543-12.012016-10-05
1674621674716-25岁2.012016-11-25
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd user_reg_tm\n", + "7457 207458 -1 2.0 1 2016-04-15\n", + "7463 207464 26-35岁 2.0 2 2016-04-15\n", + "7467 207468 36-45岁 2.0 3 2016-04-15\n", + "7472 207473 -1 2.0 1 2016-04-15\n", + "7482 207483 26-35岁 2.0 3 2016-04-15\n", + "7492 207493 16-25岁 2.0 3 2016-04-15\n", + "7493 207494 16-25岁 2.0 3 2016-04-15\n", + "7503 207504 16-25岁 2.0 4 2016-04-15\n", + "7510 207511 46-55岁 2.0 5 2016-04-15\n", + "7512 207513 -1 2.0 1 2016-04-15\n", + "7518 207519 26-35岁 2.0 2 2016-04-15\n", + "7521 207522 26-35岁 0.0 3 2016-04-15\n", + "7525 207526 -1 2.0 3 2016-04-15\n", + "7533 207534 -1 2.0 1 2016-04-15\n", + "7543 207544 26-35岁 2.0 3 2016-04-15\n", + "7544 207545 -1 2.0 1 2016-04-15\n", + "7551 207552 26-35岁 2.0 3 2016-04-15\n", + "7553 207554 16-25岁 2.0 4 2016-04-15\n", + "8545 208546 16-25岁 0.0 2 2016-04-29\n", + "9394 209395 16-25岁 1.0 2 2016-05-11\n", + "10362 210363 56岁以上 2.0 2 2016-05-24\n", + "10367 210368 -1 2.0 1 2016-05-24\n", + "11019 211020 36-45岁 2.0 3 2016-06-06\n", + "12014 212015 36-45岁 2.0 2 2016-07-05\n", + "13850 213851 26-35岁 2.0 3 2016-09-11\n", + "14542 214543 -1 2.0 1 2016-10-05\n", + "16746 216747 16-25岁 2.0 1 2016-11-25" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check user who’s user_reg_tm >= '2016-4-15'\n", + "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n", + "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n", + "df_user.loc[df_user.user_reg_tm>= '2016-4-15']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idsku_idtimemodel_idtypecatebrand
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n", + "Index: []" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_month = pd.read_csv('data/JData_Action_201604.csv')\n", + "df_month['time'] = pd.to_datetime(df_month['time'])\n", + "df_month.loc[df_month.time >= '2016-4-16']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "说明客户没有交互数据,所以这一批客户不需要删除" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 行为数据中的user_id为浮点型,进行INT类型转换" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "int64\n", + "int64\n", + "int64\n" + ] + } + ], + "source": [ + "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n", + " \n", + "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n", + " \n", + "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201604.csv',index=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 年龄区间的处理\n", + "查看用户年龄分布,并做特征编码" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 3.0 46570\n", + " 4.0 30336\n", + "-1.0 14412\n", + " 2.0 8797\n", + " 5.0 3325\n", + " 6.0 1871\n", + " 1.0 7\n", + "Name: age, dtype: int64\n" + ] + } + ], + "source": [ + "age_mapping = { \n", + " '15岁以下': 1, \n", + " '16-25岁': 2, \n", + " '26-35岁': 3,\n", + " '36-45岁': 4,\n", + " '46-55岁': 5,\n", + " '56岁以上': 6,\n", + " '-1' :-1\n", + " } \n", + "df_user['age'] = df_user['age'].map(age_mapping)\n", + "print(df_user.age.value_counts())\n", + "df_user.to_csv('data\\JData_User.csv',index=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### user_table\n", + "* user_table特征包括:\n", + "* user_id(用户id),age(年龄),sex(性别),\n", + "* user_lv_cd(用户级别),browse_num(浏览数),\n", + "* addcart_num(加购数),delcart_num(删购数),\n", + "* buy_num(购买数),favor_num(收藏数),\n", + "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n", + "* buy_browse_ratio(购买浏览转化率),\n", + "* buy_click_ratio(购买点击转化率),\n", + "* buy_favor_ratio(购买收藏转化率)\n", + "\n", + "### item_table特征包括:\n", + "* sku_id(商品id),attr1,attr2,\n", + "* attr3,cate,brand,browse_num,\n", + "* addcart_num,delcart_num,\n", + "* buy_num,favor_num,click_num,\n", + "* buy_addcart_ratio,buy_browse_ratio,\n", + "* buy_click_ratio,buy_favor_ratio,\n", + "* comment_num(评论数),\n", + "* has_bad_comment(是否有差评),\n", + "* bad_comment_rate(差评率)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 构建User_table" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# 定义文件名\n", + "ACTION_201602_FILE = \"data/JData_Action_201602.csv\" # 11M条\n", + "ACTION_201603_FILE = \"data/JData_Action_201603.csv\" #26M 条\n", + "ACTION_201604_FILE = \"data/JData_Action_201604.csv\" #13M条\n", + "COMMENT_FILE = \"data/JData_Comment.csv\" #560K条\n", + "PRODUCT_FILE = \"data/JData_Product.csv\" #24k\n", + "USER_FILE = \"data/JData_User.csv\" # 105K 条\n", + " \n", + "USER_TABLE_FILE = \"data/user_table.csv\"\n", + "ITEM_TABLE_FILE = \"data/item_table.csv\"" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from collections import Counter" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# 功能函数: 对每一个user分组的数据进行统计\n", + "def add_type_count(group):\n", + " behavior_type = group.type.astype(int) \n", + " # 用户行为类别\n", + " type_cnt = Counter(behavior_type)\n", + " # 1: 浏览 2: 加购 3: 删除\n", + " # 4: 购买 5: 收藏 6: 点击\n", + " group['browse_num'] = type_cnt[1]\n", + " group['addcart_num'] = type_cnt[2]\n", + " group['delcart_num'] = type_cnt[3]\n", + " group['buy_num'] = type_cnt[4]\n", + " group['favor_num'] = type_cnt[5]\n", + " group['click_num'] = type_cnt[6]\n", + " \n", + " return group[['user_id', 'browse_num', 'addcart_num',\n", + " 'delcart_num', 'buy_num', 'favor_num',\n", + " 'click_num']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "#对action数据进行统计\n", + "#根据自己调节chunk_size大小\n", + "def get_from_action_data(fname, chunk_size=50000):\n", + " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n", + " chunks = []\n", + " loop = True\n", + " while loop:\n", + " try:\n", + " # 只读取user_id和type两个字段\n", + " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n", + " chunks.append(chunk)\n", + " except StopIteration:\n", + " loop = False\n", + " print(\"Iteration is stopped\")\n", + " # 将块拼接为pandas dataframe格式\n", + " df_ac = pd.concat(chunks, ignore_index=True)\n", + " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n", + " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n", + " # 将重复的行丢弃\n", + " df_ac = df_ac.drop_duplicates('user_id')\n", + " \n", + " return df_ac" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# 将各个action数据的统计量进行聚合\n", + "def merge_action_data():\n", + " df_ac = []\n", + " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", + " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", + " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", + " \n", + " df_ac = pd.concat(df_ac, ignore_index=True)\n", + " # 用户在不同action表中统计量求和\n", + " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n", + " # 构造转化率字段\n", + " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", + " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", + " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", + " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", + " \n", + " # 将大于1的转化率字段置为1(100%)\n", + " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", + " \n", + " return df_ac" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# 从FJData_User表中抽取需要的字段\n", + "def get_from_jdata_user():\n", + " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n", + " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n", + " return df_usr" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cd
02000016.02.05
1200002-1.00.01
22000034.01.04
3200004-1.02.01
42000052.00.04
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd\n", + "0 200001 6.0 2.0 5\n", + "1 200002 -1.0 0.0 1\n", + "2 200003 4.0 1.0 4\n", + "3 200004 -1.0 2.0 1\n", + "4 200005 2.0 0.0 4" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_base = get_from_jdata_user()\n", + "user_base.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Iteration is stopped\n", + "Iteration is stopped\n", + "Iteration is stopped\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000012122213104140.0454550.0047170.0024151.0
120000223810004840.0000000.0000000.000000NaN
220000322141014200.0000000.0000000.0000000.0
320000452000061NaN0.0000000.000000NaN
420000510623121610.5000000.0094340.0062110.5
\n", + "
" + ], + "text/plain": [ + " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n", + "0 200001 212 22 13 1 0 \n", + "1 200002 238 1 0 0 0 \n", + "2 200003 221 4 1 0 1 \n", + "3 200004 52 0 0 0 0 \n", + "4 200005 106 2 3 1 2 \n", + "\n", + " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n", + "0 414 0.045455 0.004717 0.002415 \n", + "1 484 0.000000 0.000000 0.000000 \n", + "2 420 0.000000 0.000000 0.000000 \n", + "3 61 NaN 0.000000 0.000000 \n", + "4 161 0.500000 0.009434 0.006211 \n", + "\n", + " buy_favor_ratio \n", + "0 1.0 \n", + "1 NaN \n", + "2 0.0 \n", + "3 NaN \n", + "4 0.5 " + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_behavior = merge_action_data()\n", + "user_behavior.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# 连接成一张表,类似于SQL的左连接(left join)\n", + "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n", + "# 保存为user_table.csv\n", + "user_behavior.to_csv(USER_TABLE_FILE, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000016.02.05212.022.013.01.00.0414.00.0454550.0047170.0024151.0
1200002-1.00.01238.01.00.00.00.0484.00.0000000.0000000.000000NaN
22000034.01.04221.04.01.00.01.0420.00.0000000.0000000.0000000.0
3200004-1.02.0152.00.00.00.00.061.0NaN0.0000000.000000NaN
42000052.00.04106.02.03.01.02.0161.00.5000000.0094340.0062110.5
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n", + "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n", + "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n", + "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n", + "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n", + "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n", + "\n", + " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", + "0 1.0 0.0 414.0 0.045455 0.004717 \n", + "1 0.0 0.0 484.0 0.000000 0.000000 \n", + "2 0.0 1.0 420.0 0.000000 0.000000 \n", + "3 0.0 0.0 61.0 NaN 0.000000 \n", + "4 1.0 2.0 161.0 0.500000 0.009434 \n", + "\n", + " buy_click_ratio buy_favor_ratio \n", + "0 0.002415 1.0 \n", + "1 0.000000 NaN \n", + "2 0.000000 0.0 \n", + "3 0.000000 NaN \n", + "4 0.006211 0.5 " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_table = pd.read_csv(USER_TABLE_FILE)\n", + "user_table.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, "nbformat": 4, "nbformat_minor": 2 } diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb index 3d4b56e..074d363 100644 --- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb +++ b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/数据清洗.ipynb @@ -393,6 +393,1063 @@ "df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n", + "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cduser_reg_tm
7457207458-12.012016-04-15
746320746426-35岁2.022016-04-15
746720746836-45岁2.032016-04-15
7472207473-12.012016-04-15
748220748326-35岁2.032016-04-15
749220749316-25岁2.032016-04-15
749320749416-25岁2.032016-04-15
750320750416-25岁2.042016-04-15
751020751146-55岁2.052016-04-15
7512207513-12.012016-04-15
751820751926-35岁2.022016-04-15
752120752226-35岁0.032016-04-15
7525207526-12.032016-04-15
7533207534-12.012016-04-15
754320754426-35岁2.032016-04-15
7544207545-12.012016-04-15
755120755226-35岁2.032016-04-15
755320755416-25岁2.042016-04-15
854520854616-25岁0.022016-04-29
939420939516-25岁1.022016-05-11
1036221036356岁以上2.022016-05-24
10367210368-12.012016-05-24
1101921102036-45岁2.032016-06-06
1201421201536-45岁2.022016-07-05
1385021385126-35岁2.032016-09-11
14542214543-12.012016-10-05
1674621674716-25岁2.012016-11-25
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd user_reg_tm\n", + "7457 207458 -1 2.0 1 2016-04-15\n", + "7463 207464 26-35岁 2.0 2 2016-04-15\n", + "7467 207468 36-45岁 2.0 3 2016-04-15\n", + "7472 207473 -1 2.0 1 2016-04-15\n", + "7482 207483 26-35岁 2.0 3 2016-04-15\n", + "7492 207493 16-25岁 2.0 3 2016-04-15\n", + "7493 207494 16-25岁 2.0 3 2016-04-15\n", + "7503 207504 16-25岁 2.0 4 2016-04-15\n", + "7510 207511 46-55岁 2.0 5 2016-04-15\n", + "7512 207513 -1 2.0 1 2016-04-15\n", + "7518 207519 26-35岁 2.0 2 2016-04-15\n", + "7521 207522 26-35岁 0.0 3 2016-04-15\n", + "7525 207526 -1 2.0 3 2016-04-15\n", + "7533 207534 -1 2.0 1 2016-04-15\n", + "7543 207544 26-35岁 2.0 3 2016-04-15\n", + "7544 207545 -1 2.0 1 2016-04-15\n", + "7551 207552 26-35岁 2.0 3 2016-04-15\n", + "7553 207554 16-25岁 2.0 4 2016-04-15\n", + "8545 208546 16-25岁 0.0 2 2016-04-29\n", + "9394 209395 16-25岁 1.0 2 2016-05-11\n", + "10362 210363 56岁以上 2.0 2 2016-05-24\n", + "10367 210368 -1 2.0 1 2016-05-24\n", + "11019 211020 36-45岁 2.0 3 2016-06-06\n", + "12014 212015 36-45岁 2.0 2 2016-07-05\n", + "13850 213851 26-35岁 2.0 3 2016-09-11\n", + "14542 214543 -1 2.0 1 2016-10-05\n", + "16746 216747 16-25岁 2.0 1 2016-11-25" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check user who’s user_reg_tm >= '2016-4-15'\n", + "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n", + "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n", + "df_user.loc[df_user.user_reg_tm>= '2016-4-15']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idsku_idtimemodel_idtypecatebrand
\n", + "
" + ], + "text/plain": [ + "Empty DataFrame\n", + "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n", + "Index: []" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_month = pd.read_csv('data/JData_Action_201604.csv')\n", + "df_month['time'] = pd.to_datetime(df_month['time'])\n", + "df_month.loc[df_month.time >= '2016-4-16']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "说明客户没有交互数据,所以这一批客户不需要删除" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 行为数据中的user_id为浮点型,进行INT类型转换" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "int64\n", + "int64\n", + "int64\n" + ] + } + ], + "source": [ + "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n", + " \n", + "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n", + " \n", + "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n", + "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", + "print (df_month['user_id'].dtype)\n", + "df_month.to_csv('data/JData_Action_201604.csv',index=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 年龄区间的处理\n", + "查看用户年龄分布,并做特征编码" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " 3.0 46570\n", + " 4.0 30336\n", + "-1.0 14412\n", + " 2.0 8797\n", + " 5.0 3325\n", + " 6.0 1871\n", + " 1.0 7\n", + "Name: age, dtype: int64\n" + ] + } + ], + "source": [ + "age_mapping = { \n", + " '15岁以下': 1, \n", + " '16-25岁': 2, \n", + " '26-35岁': 3,\n", + " '36-45岁': 4,\n", + " '46-55岁': 5,\n", + " '56岁以上': 6,\n", + " '-1' :-1\n", + " } \n", + "df_user['age'] = df_user['age'].map(age_mapping)\n", + "print(df_user.age.value_counts())\n", + "df_user.to_csv('data\\JData_User.csv',index=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### user_table\n", + "* user_table特征包括:\n", + "* user_id(用户id),age(年龄),sex(性别),\n", + "* user_lv_cd(用户级别),browse_num(浏览数),\n", + "* addcart_num(加购数),delcart_num(删购数),\n", + "* buy_num(购买数),favor_num(收藏数),\n", + "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n", + "* buy_browse_ratio(购买浏览转化率),\n", + "* buy_click_ratio(购买点击转化率),\n", + "* buy_favor_ratio(购买收藏转化率)\n", + "\n", + "### item_table特征包括:\n", + "* sku_id(商品id),attr1,attr2,\n", + "* attr3,cate,brand,browse_num,\n", + "* addcart_num,delcart_num,\n", + "* buy_num,favor_num,click_num,\n", + "* buy_addcart_ratio,buy_browse_ratio,\n", + "* buy_click_ratio,buy_favor_ratio,\n", + "* comment_num(评论数),\n", + "* has_bad_comment(是否有差评),\n", + "* bad_comment_rate(差评率)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 构建User_table" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# 定义文件名\n", + "ACTION_201602_FILE = \"data/JData_Action_201602.csv\" # 11M条\n", + "ACTION_201603_FILE = \"data/JData_Action_201603.csv\" #26M 条\n", + "ACTION_201604_FILE = \"data/JData_Action_201604.csv\" #13M条\n", + "COMMENT_FILE = \"data/JData_Comment.csv\" #560K条\n", + "PRODUCT_FILE = \"data/JData_Product.csv\" #24k\n", + "USER_FILE = \"data/JData_User.csv\" # 105K 条\n", + " \n", + "USER_TABLE_FILE = \"data/user_table.csv\"\n", + "ITEM_TABLE_FILE = \"data/item_table.csv\"" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from collections import Counter" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# 功能函数: 对每一个user分组的数据进行统计\n", + "def add_type_count(group):\n", + " behavior_type = group.type.astype(int) \n", + " # 用户行为类别\n", + " type_cnt = Counter(behavior_type)\n", + " # 1: 浏览 2: 加购 3: 删除\n", + " # 4: 购买 5: 收藏 6: 点击\n", + " group['browse_num'] = type_cnt[1]\n", + " group['addcart_num'] = type_cnt[2]\n", + " group['delcart_num'] = type_cnt[3]\n", + " group['buy_num'] = type_cnt[4]\n", + " group['favor_num'] = type_cnt[5]\n", + " group['click_num'] = type_cnt[6]\n", + " \n", + " return group[['user_id', 'browse_num', 'addcart_num',\n", + " 'delcart_num', 'buy_num', 'favor_num',\n", + " 'click_num']]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "#对action数据进行统计\n", + "#根据自己调节chunk_size大小\n", + "def get_from_action_data(fname, chunk_size=50000):\n", + " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n", + " chunks = []\n", + " loop = True\n", + " while loop:\n", + " try:\n", + " # 只读取user_id和type两个字段\n", + " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n", + " chunks.append(chunk)\n", + " except StopIteration:\n", + " loop = False\n", + " print(\"Iteration is stopped\")\n", + " # 将块拼接为pandas dataframe格式\n", + " df_ac = pd.concat(chunks, ignore_index=True)\n", + " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n", + " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n", + " # 将重复的行丢弃\n", + " df_ac = df_ac.drop_duplicates('user_id')\n", + " \n", + " return df_ac" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# 将各个action数据的统计量进行聚合\n", + "def merge_action_data():\n", + " df_ac = []\n", + " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", + " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", + " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", + " \n", + " df_ac = pd.concat(df_ac, ignore_index=True)\n", + " # 用户在不同action表中统计量求和\n", + " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n", + " # 构造转化率字段\n", + " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", + " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", + " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", + " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", + " \n", + " # 将大于1的转化率字段置为1(100%)\n", + " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", + " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", + " \n", + " return df_ac" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# 从FJData_User表中抽取需要的字段\n", + "def get_from_jdata_user():\n", + " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n", + " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n", + " return df_usr" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cd
02000016.02.05
1200002-1.00.01
22000034.01.04
3200004-1.02.01
42000052.00.04
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd\n", + "0 200001 6.0 2.0 5\n", + "1 200002 -1.0 0.0 1\n", + "2 200003 4.0 1.0 4\n", + "3 200004 -1.0 2.0 1\n", + "4 200005 2.0 0.0 4" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_base = get_from_jdata_user()\n", + "user_base.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Iteration is stopped\n", + "Iteration is stopped\n", + "Iteration is stopped\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000012122213104140.0454550.0047170.0024151.0
120000223810004840.0000000.0000000.000000NaN
220000322141014200.0000000.0000000.0000000.0
320000452000061NaN0.0000000.000000NaN
420000510623121610.5000000.0094340.0062110.5
\n", + "
" + ], + "text/plain": [ + " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n", + "0 200001 212 22 13 1 0 \n", + "1 200002 238 1 0 0 0 \n", + "2 200003 221 4 1 0 1 \n", + "3 200004 52 0 0 0 0 \n", + "4 200005 106 2 3 1 2 \n", + "\n", + " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n", + "0 414 0.045455 0.004717 0.002415 \n", + "1 484 0.000000 0.000000 0.000000 \n", + "2 420 0.000000 0.000000 0.000000 \n", + "3 61 NaN 0.000000 0.000000 \n", + "4 161 0.500000 0.009434 0.006211 \n", + "\n", + " buy_favor_ratio \n", + "0 1.0 \n", + "1 NaN \n", + "2 0.0 \n", + "3 NaN \n", + "4 0.5 " + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_behavior = merge_action_data()\n", + "user_behavior.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# 连接成一张表,类似于SQL的左连接(left join)\n", + "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n", + "# 保存为user_table.csv\n", + "user_behavior.to_csv(USER_TABLE_FILE, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000016.02.05212.022.013.01.00.0414.00.0454550.0047170.0024151.0
1200002-1.00.01238.01.00.00.00.0484.00.0000000.0000000.000000NaN
22000034.01.04221.04.01.00.01.0420.00.0000000.0000000.0000000.0
3200004-1.02.0152.00.00.00.00.061.0NaN0.0000000.000000NaN
42000052.00.04106.02.03.01.02.0161.00.5000000.0094340.0062110.5
\n", + "
" + ], + "text/plain": [ + " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n", + "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n", + "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n", + "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n", + "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n", + "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n", + "\n", + " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", + "0 1.0 0.0 414.0 0.045455 0.004717 \n", + "1 0.0 0.0 484.0 0.000000 0.000000 \n", + "2 0.0 1.0 420.0 0.000000 0.000000 \n", + "3 0.0 0.0 61.0 NaN 0.000000 \n", + "4 1.0 2.0 161.0 0.500000 0.009434 \n", + "\n", + " buy_click_ratio buy_favor_ratio \n", + "0 0.002415 1.0 \n", + "1 0.000000 NaN \n", + "2 0.000000 0.0 \n", + "3 0.000000 NaN \n", + "4 0.006211 0.5 " + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "user_table = pd.read_csv(USER_TABLE_FILE)\n", + "user_table.head()" + ] + }, { "cell_type": "code", "execution_count": null,