diff --git a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb b/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb deleted file mode 100644 index 82734f5..0000000 --- a/机器学习竞赛实战_优胜解决方案/京东用户购买意向预测/.ipynb_checkpoints/1-数据清洗-checkpoint.ipynb +++ /dev/null @@ -1,3347 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 任务:京东用户购买意向预测" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 故事背景:\n", - "京东作为中国最大的自营式电商,在保持高速发展的同时,沉淀了数亿的忠实用户,积累了海量的真实数据。如何从历史数据中找出规律,去预测用户未来的购买需求,让最合适的商品遇见最需要的人,是大数据应用在精准营销中的关键问题,也是所有电商平台在做智能化升级时所需要的核心技术。\n", - "\n", - "以京东商城真实的用户、商品和行为数据(脱敏后)为基础,通过数据挖掘的技术和机器学习的算法,构建用户购买商品的预测模型,输出高潜用户和目标商品的匹配结果,为精准营销提供高质量的目标群体。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 目标:\n", - "使用京东多个品类下商品的历史销售数据,构建算法模型,预测用户在未来5天内,对某个目标品类下商品的购买意向。" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据集:\n", - "这里涉及到的数据集是京东的数据集:\n", - "\n", - "* JData_User.csv 用户数据集 105,321个用户\n", - "* JData_Comment.csv 商品评论 558,552条记录\n", - "* JData_Product.csv 预测商品集合 24,187条记录\n", - "* JData_Action_201602.csv 2月份行为交互记录 11,485,424条记录\n", - "* JData_Action_201603.csv 3月份行为交互记录 25,916,378条记录\n", - "* JData_Action_201604.csv 4月份行为交互记录 13,199,934条记录" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**JData_User.csv用户数据**\n", - "\n", - "|字段|意义|备注|\n", - "|-|-|-|\n", - "|user_id|用户id|脱敏|\n", - "|age|年龄|-1表未知|\n", - "|sex|性别|0男,1女,2未知|\n", - "|user_lv_cd|用户等级|级别枚举,越高级别越大|\n", - "|user_reg_tm|用户注册日期|粒度到天|" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**JData_Comment.csv评论数据**\n", - "\n", - "|字段|意义|备注|\n", - "|-|-|-|\n", - "|dt|截止时间|天,到2016-02-01|\n", - "|sku_id|商品编号|脱敏|\n", - "|comment_num|累积评论数分段|0表示无评论,1表是1条,2表示2-10条,3表示11-50条,4表示大于50条|\n", - "|has_bad_comment|是否有差评|0表示无,1表示有|\n", - "|bad_comment_rate|差评率|差评数占总评论数的比率|" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**JData_Product.csv商品数据**\n", - "\n", - "|字段|意义|备注|\n", - "|-|-|-|\n", - "|sku_id|商品编号|脱敏|\n", - "|a1|属性1|枚举,-1表未知|\n", - "|a2|属性2|枚举,-1表未知|\n", - "|a3|属性3|枚举,-1表未知|\n", - "|cate|品牌ID|脱敏|\n", - "|brand|品牌ID|脱敏|" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**JData_Action_xx.csv商品数据**\n", - "\n", - "|字段|意义|备注|\n", - "|-|-|-|\n", - "|user_id|用户ID|脱敏|\n", - "|sku_id|商品编号|脱敏|\n", - "|time|行为时间||\n", - "|model_id|点击板块的编号|脱敏|\n", - "|type|行为类型|1.浏览商品详情页;2.加入购物车;3.购物车删除;4.下单;5.关注;6.点击;|\n", - "|cate|品牌ID|脱敏|\n", - "|brand|品牌ID|脱敏|" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据挖掘流程:\n", - "(一).数据清洗\n", - "1. 数据集完整性验证\n", - "2. 数据集中是否存在缺失值\n", - "3. 数据集中各特征数值应该如何处理\n", - "4. 哪些数据是我们想要的,哪些是可以过滤掉的\n", - "5. 将有价值数据信息做成新的数据源\n", - "6. 去除无行为交互的商品和用户\n", - "7. 去掉浏览量很大而购买量很少的用户(惰性用户或爬虫用户)\n", - "\n", - "(二).数据理解与分析\n", - "1. 掌握各个特征的含义\n", - "2. 观察数据有哪些特点,是否可利用来建模\n", - "3. 可视化展示便于分析\n", - "4. 用户的购买意向是否随着时间等因素变化\n", - "(三).特征提取\n", - "1. 基于清洗后的数据集哪些特征是有价值\n", - "2. 分别对用户与商品以及其之间构成的行为进行特征提取\n", - "3. 行为因素中哪些是核心?如何提取?\n", - "4. 瞬时行为特征or累计行为特征?\n", - "\n", - "(四).模型建立\n", - "1. 使用机器学习算法进行预测\n", - "2. 参数设置与调节\n", - "3. 数据集切分" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据集完整性验证\n", - "首先检查JData_User中的用户和JData_Dction中的用户是否一致,保证行为数据中锁产生的行为均由用户数据中的用户产生。\n", - "\n", - "思路:利用pd.Merge连接sku和Action中的sku,观测Action中的数据是否减少Example:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " sku data\n", - "0 a 1\n", - "1 a 1\n", - "2 c 3\n" - ] - } - ], - "source": [ - "# 测试方法\n", - "import pandas as pd\n", - "df1 = pd.DataFrame({'sku':['a','a','e','c'], 'data':[1,1,2,3]})\n", - "df2 = pd.DataFrame({'sku':['a','b','c']})\n", - "print(pd.merge(df1,df2))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "结果只会打印两者共有的部分" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Is action of Feb. from User file? True\n", - "Is action of Mar. from User file? True\n", - "Is action of Apr. from User file? True\n" - ] - } - ], - "source": [ - "#数据集验证\n", - "def user_action_check():\n", - " df_user = pd.read_csv('data/JData_User.csv',encoding='gbk')\n", - " df_sku = df_user.loc[:,'user_id'].to_frame()\n", - " df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", - " # pd.merge(df_sku,df_month2) 会以user_id字段为基准取两个df的交集 不是取并集,这样才能证明 action中的userid 都在df_user里面\n", - " print ('Is action of Feb. from User file? ', len(df_month2) == len(pd.merge(df_sku,df_month2))) \n", - " df_month3 = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n", - " print ('Is action of Mar. from User file? ', len(df_month3) == len(pd.merge(df_sku,df_month3)))\n", - " df_month4 = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n", - " print ('Is action of Apr. from User file? ', len(df_month4) == len(pd.merge(df_sku,df_month4)))\n", - "\n", - "user_action_check() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "结论:User数据集中的用户和交互行为数据集中的用户完全一致\n", - "\n", - "根据merge前后的数据量对,能保障Action中的用户ID是User中的ID的子集" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 检查是否有重复记录\n", - "除去各个数据文件中完全重复的记录,可能解释是重复数据是有意义的,比如用户同时购买多件商品,同时添加多个数量的商品到购物车等…" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "#重复数据\n", - "def deduplicate(filepath, filename, newpath):\n", - " df_file = pd.read_csv(filepath,encoding='gbk') \n", - " before = df_file.shape[0]\n", - " df_file.drop_duplicates(inplace=True) # 列相同认为是重复 inplace=True表示在原来的DataFrame上删除重复项4\n", - " after = df_file.shape[0]\n", - " n_dup = before-after # 查看前后差值\n", - " print ('Number of duplicate records for ' + filename + ' is: ' + str(n_dup))\n", - " if n_dup != 0:\n", - " df_file.to_csv(newpath, index=None)\n", - " else:\n", - " print ('Number duplicate records in ' + filename)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of duplicate records for Feb. action is: 2756093\n", - "Number of duplicate records for Mar. action is: 7085038\n", - "Number of duplicate records for Feb. action is: 3672710\n", - "Number of duplicate records for Comment is: 0\n", - "Number duplicate records in Comment\n", - "Number of duplicate records for Product is: 0\n", - "Number duplicate records in Product\n", - "Number of duplicate records for User is: 0\n", - "Number duplicate records in User\n" - ] - } - ], - "source": [ - "deduplicate('data/JData_Action_201602.csv', 'Feb. action', 'data/JData_Action_201602_dedup.csv')\n", - "deduplicate('data/JData_Action_201603.csv', 'Mar. action', 'data/JData_Action_201603_dedup.csv')\n", - "deduplicate('data/JData_Action_201604.csv', 'Feb. action', 'data/JData_Action_201604_dedup.csv')\n", - "deduplicate('data/JData_Comment.csv', 'Comment', 'data/JData_Comment_dedup.csv')\n", - "deduplicate('data/JData_Product.csv', 'Product', 'data/JData_Product_dedup.csv')\n", - "deduplicate('data/JData_User.csv', 'User', 'data/JData_User_dedup.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idsku_idtimemodel_idcatebrand
type
1217637821763782176378021763782176378
26366366360636636
3146414641464014641464
437373703737
5198119811981019811981
6575597575597575597545054575597575597
\n", - "
" - ], - "text/plain": [ - " user_id sku_id time model_id cate brand\n", - "type \n", - "1 2176378 2176378 2176378 0 2176378 2176378\n", - "2 636 636 636 0 636 636\n", - "3 1464 1464 1464 0 1464 1464\n", - "4 37 37 37 0 37 37\n", - "5 1981 1981 1981 0 1981 1981\n", - "6 575597 575597 575597 545054 575597 575597" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# 查看重复数据\n", - "df_month2 = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", - "IsDuplicated = df_month2.duplicated()\n", - "df_d = df_month2[IsDuplicated]\n", - "df_d.groupby('type').count() # 发现重复数据大多数都是由于浏览(1),或者点击(6)产生" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 检查是否存在注册时间在2016年-4月-15号之后的用户\n", - "统计的是4月15号前的客户行为,不应该包含4月15号后的注册客户。" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cduser_reg_tm
7457207458-12.012016-04-15
746320746426-35岁2.022016-04-15
746720746836-45岁2.032016-04-15
7472207473-12.012016-04-15
748220748326-35岁2.032016-04-15
749220749316-25岁2.032016-04-15
749320749416-25岁2.032016-04-15
750320750416-25岁2.042016-04-15
751020751146-55岁2.052016-04-15
7512207513-12.012016-04-15
751820751926-35岁2.022016-04-15
752120752226-35岁0.032016-04-15
7525207526-12.032016-04-15
7533207534-12.012016-04-15
754320754426-35岁2.032016-04-15
7544207545-12.012016-04-15
755120755226-35岁2.032016-04-15
755320755416-25岁2.042016-04-15
854520854616-25岁0.022016-04-29
939420939516-25岁1.022016-05-11
1036221036356岁以上2.022016-05-24
10367210368-12.012016-05-24
1101921102036-45岁2.032016-06-06
1201421201536-45岁2.022016-07-05
1385021385126-35岁2.032016-09-11
14542214543-12.012016-10-05
1674621674716-25岁2.012016-11-25
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd user_reg_tm\n", - "7457 207458 -1 2.0 1 2016-04-15\n", - "7463 207464 26-35岁 2.0 2 2016-04-15\n", - "7467 207468 36-45岁 2.0 3 2016-04-15\n", - "7472 207473 -1 2.0 1 2016-04-15\n", - "7482 207483 26-35岁 2.0 3 2016-04-15\n", - "7492 207493 16-25岁 2.0 3 2016-04-15\n", - "7493 207494 16-25岁 2.0 3 2016-04-15\n", - "7503 207504 16-25岁 2.0 4 2016-04-15\n", - "7510 207511 46-55岁 2.0 5 2016-04-15\n", - "7512 207513 -1 2.0 1 2016-04-15\n", - "7518 207519 26-35岁 2.0 2 2016-04-15\n", - "7521 207522 26-35岁 0.0 3 2016-04-15\n", - "7525 207526 -1 2.0 3 2016-04-15\n", - "7533 207534 -1 2.0 1 2016-04-15\n", - "7543 207544 26-35岁 2.0 3 2016-04-15\n", - "7544 207545 -1 2.0 1 2016-04-15\n", - "7551 207552 26-35岁 2.0 3 2016-04-15\n", - "7553 207554 16-25岁 2.0 4 2016-04-15\n", - "8545 208546 16-25岁 0.0 2 2016-04-29\n", - "9394 209395 16-25岁 1.0 2 2016-05-11\n", - "10362 210363 56岁以上 2.0 2 2016-05-24\n", - "10367 210368 -1 2.0 1 2016-05-24\n", - "11019 211020 36-45岁 2.0 3 2016-06-06\n", - "12014 212015 36-45岁 2.0 2 2016-07-05\n", - "13850 213851 26-35岁 2.0 3 2016-09-11\n", - "14542 214543 -1 2.0 1 2016-10-05\n", - "16746 216747 16-25岁 2.0 1 2016-11-25" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# check user who’s user_reg_tm >= '2016-4-15'\n", - "df_user = pd.read_csv('./data/JData_User.csv',encoding='gbk')\n", - "df_user['user_reg_tm']=pd.to_datetime(df_user['user_reg_tm']) \n", - "df_user.loc[df_user.user_reg_tm>= '2016-4-15']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "检查依然存在4月15号后注册的,如果这些客户没有4月15号后的行为数据,说明要删除。" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idsku_idtimemodel_idtypecatebrand
\n", - "
" - ], - "text/plain": [ - "Empty DataFrame\n", - "Columns: [user_id, sku_id, time, model_id, type, cate, brand]\n", - "Index: []" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_month = pd.read_csv('data/JData_Action_201604.csv')\n", - "df_month['time'] = pd.to_datetime(df_month['time'])\n", - "df_month.loc[df_month.time >= '2016-4-16']" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "说明客户没有交互数据,所以这一批客户不需要删除" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 行为数据中的user_id为浮点型,进行INT类型转换" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "int64\n", - "int64\n", - "int64\n" - ] - } - ], - "source": [ - "df_month = pd.read_csv('data/JData_Action_201602.csv',encoding='gbk')\n", - "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", - "print (df_month['user_id'].dtype)\n", - "df_month.to_csv('data/JData_Action_201602.csv',index=None)\n", - " \n", - "df_month = pd.read_csv('data/JData_Action_201603.csv',encoding='gbk')\n", - "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", - "print (df_month['user_id'].dtype)\n", - "df_month.to_csv('data/JData_Action_201603.csv',index=None)\n", - " \n", - "df_month = pd.read_csv('data/JData_Action_201604.csv',encoding='gbk')\n", - "df_month['user_id'] = df_month['user_id'].apply(lambda x:int(x))\n", - "print (df_month['user_id'].dtype)\n", - "df_month.to_csv('data/JData_Action_201604.csv',index=None)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 年龄区间的处理\n", - "查看用户年龄分布,并做特征编码" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " 3.0 46570\n", - " 4.0 30336\n", - "-1.0 14412\n", - " 2.0 8797\n", - " 5.0 3325\n", - " 6.0 1871\n", - " 1.0 7\n", - "Name: age, dtype: int64\n" - ] - } - ], - "source": [ - "age_mapping = { \n", - " '15岁以下': 1, \n", - " '16-25岁': 2, \n", - " '26-35岁': 3,\n", - " '36-45岁': 4,\n", - " '46-55岁': 5,\n", - " '56岁以上': 6,\n", - " '-1' :-1\n", - " } \n", - "df_user['age'] = df_user['age'].map(age_mapping)\n", - "print(df_user.age.value_counts())\n", - "df_user.to_csv('data\\JData_User.csv',index=None)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "为了能够进行上述清洗,在此首先构造了简单的用户(user)行为特征和商品(item)行为特征,对应于两张表user_table和item_table." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### user_table特征包括:\n", - "* user_table特征包括:\n", - "* user_id(用户id),age(年龄),sex(性别),\n", - "* user_lv_cd(用户级别),browse_num(浏览数),\n", - "* addcart_num(加购数),delcart_num(删购数),\n", - "* buy_num(购买数),favor_num(收藏数),\n", - "* click_num(点击数),buy_addcart_ratio(购买加购转化率),\n", - "* buy_browse_ratio(购买浏览转化率),\n", - "* buy_click_ratio(购买点击转化率),\n", - "* buy_favor_ratio(购买收藏转化率)\n", - "\n", - "### item_table特征包括:\n", - "* sku_id(商品id),attr1,attr2,\n", - "* attr3,cate,brand,browse_num,\n", - "* addcart_num,delcart_num,\n", - "* buy_num,favor_num,click_num,\n", - "* buy_addcart_ratio,buy_browse_ratio,\n", - "* buy_click_ratio,buy_favor_ratio,\n", - "* comment_num(评论数),\n", - "* has_bad_comment(是否有差评),\n", - "* bad_comment_rate(差评率)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 构建User_table" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "# 定义文件名\n", - "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n", - "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n", - "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n", - "COMMENT_FILE = \"data/JData_Comment.csv\"\n", - "PRODUCT_FILE = \"data/JData_Product.csv\"\n", - "USER_FILE = \"data/JData_User.csv\"\n", - "\n", - "USER_TABLE_FILE = \"data/user_table.csv\"\n", - "ITEM_TABLE_FILE = \"data/item_table.csv\"" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from collections import Counter" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# 功能函数: 对每一个user分组的数据进行统计\n", - "def add_type_count(group):\n", - " behavior_type = group.type.astype(int) \n", - " # 用户行为类别\n", - " type_cnt = Counter(behavior_type)\n", - " # 1: 浏览 2: 加购 3: 删除\n", - " # 4: 购买 5: 收藏 6: 点击\n", - " group['browse_num'] = type_cnt[1]\n", - " group['addcart_num'] = type_cnt[2]\n", - " group['delcart_num'] = type_cnt[3]\n", - " group['buy_num'] = type_cnt[4]\n", - " group['favor_num'] = type_cnt[5]\n", - " group['click_num'] = type_cnt[6]\n", - " \n", - " return group[['user_id', 'browse_num', 'addcart_num',\n", - " 'delcart_num', 'buy_num', 'favor_num',\n", - " 'click_num']]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "由于用户行为数据量较大,一次性读入可能造成内存错误(Memory Error),因此使用pandas的分块(chunk)读取" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "#对action数据进行统计\n", - "#根据自己调节chunk_size大小\n", - "def get_from_action_data(fname, chunk_size=50000):\n", - " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n", - " chunks = []\n", - " loop = True\n", - " while loop:\n", - " try:\n", - " # 只读取user_id和type两个字段\n", - " chunk = reader.get_chunk(chunk_size)[[\"user_id\", \"type\"]]\n", - " chunks.append(chunk)\n", - " except StopIteration:\n", - " loop = False\n", - " print(\"Iteration is stopped\")\n", - " # 将块拼接为pandas dataframe格式\n", - " df_ac = pd.concat(chunks, ignore_index=True)\n", - " # 按user_id分组,对每一组进行统计,as_index 表示无索引形式返回数据\n", - " df_ac = df_ac.groupby(['user_id'], as_index=False).apply(add_type_count)\n", - " # 将重复的行丢弃\n", - " df_ac = df_ac.drop_duplicates('user_id')\n", - " \n", - " return df_ac" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "# 将各个action数据的统计量进行聚合\n", - "def merge_action_data():\n", - " df_ac = []\n", - " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", - " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", - " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", - " \n", - " df_ac = pd.concat(df_ac, ignore_index=True)\n", - " # 用户在不同action表中统计量求和\n", - " df_ac = df_ac.groupby(['user_id'], as_index=False).sum()\n", - " # 构造转化率字段\n", - " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", - " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", - " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", - " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", - " \n", - " # 将大于1的转化率字段置为1(100%)\n", - " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", - " \n", - " return df_ac" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "# 从FJData_User表中抽取需要的字段\n", - "def get_from_jdata_user():\n", - " df_usr = pd.read_csv(USER_FILE, header=0,encoding='gbk')\n", - " df_usr = df_usr[[\"user_id\", \"age\", \"sex\", \"user_lv_cd\"]]\n", - " return df_usr" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cd
02000016.02.05
1200002-1.00.01
22000034.01.04
3200004-1.02.01
42000052.00.04
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd\n", - "0 200001 6.0 2.0 5\n", - "1 200002 -1.0 0.0 1\n", - "2 200003 4.0 1.0 4\n", - "3 200004 -1.0 2.0 1\n", - "4 200005 2.0 0.0 4" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "user_base = get_from_jdata_user()\n", - "user_base.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Iteration is stopped\n", - "Iteration is stopped\n", - "Iteration is stopped\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000012122213104140.0454550.0047170.0024151.0
120000223810004840.0000000.0000000.000000NaN
220000322141014200.0000000.0000000.0000000.0
320000452000061NaN0.0000000.000000NaN
420000510623121610.5000000.0094340.0062110.5
\n", - "
" - ], - "text/plain": [ - " user_id browse_num addcart_num delcart_num buy_num favor_num \\\n", - "0 200001 212 22 13 1 0 \n", - "1 200002 238 1 0 0 0 \n", - "2 200003 221 4 1 0 1 \n", - "3 200004 52 0 0 0 0 \n", - "4 200005 106 2 3 1 2 \n", - "\n", - " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n", - "0 414 0.045455 0.004717 0.002415 \n", - "1 484 0.000000 0.000000 0.000000 \n", - "2 420 0.000000 0.000000 0.000000 \n", - "3 61 NaN 0.000000 0.000000 \n", - "4 161 0.500000 0.009434 0.006211 \n", - "\n", - " buy_favor_ratio \n", - "0 1.0 \n", - "1 NaN \n", - "2 0.0 \n", - "3 NaN \n", - "4 0.5 " - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "user_behavior = merge_action_data()\n", - "user_behavior.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "# 连接成一张表,类似于SQL的左连接(left join)\n", - "user_behavior = pd.merge(user_base, user_behavior, on=['user_id'], how='left')\n", - "# 保存为user_table.csv\n", - "user_behavior.to_csv(USER_TABLE_FILE, index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
02000016.02.05212.022.013.01.00.0414.00.0454550.0047170.0024151.0
1200002-1.00.01238.01.00.00.00.0484.00.0000000.0000000.000000NaN
22000034.01.04221.04.01.00.01.0420.00.0000000.0000000.0000000.0
3200004-1.02.0152.00.00.00.00.061.0NaN0.0000000.000000NaN
42000052.00.04106.02.03.01.02.0161.00.5000000.0094340.0062110.5
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n", - "0 200001 6.0 2.0 5 212.0 22.0 13.0 \n", - "1 200002 -1.0 0.0 1 238.0 1.0 0.0 \n", - "2 200003 4.0 1.0 4 221.0 4.0 1.0 \n", - "3 200004 -1.0 2.0 1 52.0 0.0 0.0 \n", - "4 200005 2.0 0.0 4 106.0 2.0 3.0 \n", - "\n", - " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", - "0 1.0 0.0 414.0 0.045455 0.004717 \n", - "1 0.0 0.0 484.0 0.000000 0.000000 \n", - "2 0.0 1.0 420.0 0.000000 0.000000 \n", - "3 0.0 0.0 61.0 NaN 0.000000 \n", - "4 1.0 2.0 161.0 0.500000 0.009434 \n", - "\n", - " buy_click_ratio buy_favor_ratio \n", - "0 0.002415 1.0 \n", - "1 0.000000 NaN \n", - "2 0.000000 0.0 \n", - "3 0.000000 NaN \n", - "4 0.006211 0.5 " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "user_table = pd.read_csv(USER_TABLE_FILE)\n", - "user_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 构建Item_table" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# 定义文件名\n", - "ACTION_201602_FILE = \"data/JData_Action_201602.csv\"\n", - "ACTION_201603_FILE = \"data/JData_Action_201603.csv\"\n", - "ACTION_201604_FILE = \"data/JData_Action_201604.csv\"\n", - "COMMENT_FILE = \"data/JData_Comment.csv\"\n", - "PRODUCT_FILE = \"data/JData_Product.csv\"\n", - "USER_FILE = \"data/JData_User.csv\"\n", - "\n", - "USER_TABLE_FILE = \"data/user_table.csv\"\n", - "ITEM_TABLE_FILE = \"data/item_table.csv\"" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "from collections import Counter" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "# 读取Product中商品\n", - "def get_from_jdata_product():\n", - " df_item = pd.read_csv(PRODUCT_FILE, header=0,encoding='gbk')\n", - " return df_item" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "# 功能函数: 对每一个商品分组的数据进行统计\n", - "def add_type_count(group):\n", - " behavior_type = group.type.astype(int) \n", - " type_cnt = Counter(behavior_type)\n", - " \n", - " group['browse_num'] = type_cnt[1]\n", - " group['addcart_num'] = type_cnt[2]\n", - " group['delcart_num'] = type_cnt[3]\n", - " group['buy_num'] = type_cnt[4]\n", - " group['favor_num'] = type_cnt[5]\n", - " group['click_num'] = type_cnt[6]\n", - " \n", - " return group[['sku_id', 'browse_num', 'addcart_num',\n", - " 'delcart_num', 'buy_num', 'favor_num',\n", - " 'click_num']]" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "# 对action数据进行统计\n", - "def get_from_action_data(fname, chunk_size=50000):\n", - " reader = pd.read_csv(fname, header=0, iterator=True,encoding='gbk')\n", - " chunks = []\n", - " loop = True\n", - " while loop:\n", - " try:\n", - " chunk = reader.get_chunk(chunk_size)[[\"sku_id\", \"type\"]]\n", - " chunks.append(chunk)\n", - " except StopIteration:\n", - " loop = False\n", - " print(\"Iteration is stopped\")\n", - " \n", - " df_ac = pd.concat(chunks, ignore_index=True)\n", - " df_ac = df_ac.groupby(['sku_id'], as_index=False).apply(add_type_count)\n", - " df_ac = df_ac.drop_duplicates('sku_id')\n", - " \n", - " return df_ac" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "# 获取评论中的商品数据,如果存在某一个商品有两个日期的评论,我们取最晚的那一个\n", - "def get_from_jdata_comment():\n", - " df_cmt = pd.read_csv(COMMENT_FILE, header=0)\n", - " df_cmt['dt'] = pd.to_datetime(df_cmt['dt'])\n", - " # find latest comment index\n", - " idx = df_cmt.groupby(['sku_id'])['dt'].transform(max) == df_cmt['dt'] # 取最晚的那一个??\n", - " df_cmt = df_cmt[idx]\n", - " \n", - " return df_cmt[['sku_id', 'comment_num',\n", - " 'has_bad_comment', 'bad_comment_rate']]" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "def merge_action_data():\n", - " df_ac = []\n", - " df_ac.append(get_from_action_data(fname=ACTION_201602_FILE))\n", - " df_ac.append(get_from_action_data(fname=ACTION_201603_FILE))\n", - " df_ac.append(get_from_action_data(fname=ACTION_201604_FILE))\n", - " \n", - " df_ac = pd.concat(df_ac, ignore_index=True)\n", - " df_ac = df_ac.groupby(['sku_id'], as_index=False).sum()\n", - "\n", - " df_ac['buy_addcart_ratio'] = df_ac['buy_num'] / df_ac['addcart_num']\n", - " df_ac['buy_browse_ratio'] = df_ac['buy_num'] / df_ac['browse_num']\n", - " df_ac['buy_click_ratio'] = df_ac['buy_num'] / df_ac['click_num']\n", - " df_ac['buy_favor_ratio'] = df_ac['buy_num'] / df_ac['favor_num']\n", - " \n", - " df_ac.loc[df_ac['buy_addcart_ratio'] > 1., 'buy_addcart_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_browse_ratio'] > 1., 'buy_browse_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_click_ratio'] > 1., 'buy_click_ratio'] = 1.\n", - " df_ac.loc[df_ac['buy_favor_ratio'] > 1., 'buy_favor_ratio'] = 1.\n", - " \n", - " return df_ac" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sku_ida1a2a3catebrand
0103118489
11000023228489
21000031-1-1830
31000061218545
410001-1128244
\n", - "
" - ], - "text/plain": [ - " sku_id a1 a2 a3 cate brand\n", - "0 10 3 1 1 8 489\n", - "1 100002 3 2 2 8 489\n", - "2 100003 1 -1 -1 8 30\n", - "3 100006 1 2 1 8 545\n", - "4 10001 -1 1 2 8 244" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "item_base = get_from_jdata_product()\n", - "item_base.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Iteration is stopped\n", - "Iteration is stopped\n", - "Iteration is stopped\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sku_idbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
0255000079NaN0.00.0NaN
118200002NaN0.00.0NaN
23610740011860.00.00.00.0
3375000010NaN0.00.0NaN
4407922001790.00.00.0NaN
\n", - "
" - ], - "text/plain": [ - " sku_id browse_num addcart_num delcart_num buy_num favor_num \\\n", - "0 2 55 0 0 0 0 \n", - "1 18 2 0 0 0 0 \n", - "2 36 107 4 0 0 1 \n", - "3 37 5 0 0 0 0 \n", - "4 40 79 2 2 0 0 \n", - "\n", - " click_num buy_addcart_ratio buy_browse_ratio buy_click_ratio \\\n", - "0 79 NaN 0.0 0.0 \n", - "1 2 NaN 0.0 0.0 \n", - "2 186 0.0 0.0 0.0 \n", - "3 10 NaN 0.0 0.0 \n", - "4 179 0.0 0.0 0.0 \n", - "\n", - " buy_favor_ratio \n", - "0 NaN \n", - "1 NaN \n", - "2 0.0 \n", - "3 NaN \n", - "4 NaN " - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "item_behavior = merge_action_data()\n", - "item_behavior.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sku_idcomment_numhas_bad_commentbad_comment_rate
5120061000310.0417
51200710000200.0000
512008100011410.0376
512009100018300.0000
512010100020300.0000
\n", - "
" - ], - "text/plain": [ - " sku_id comment_num has_bad_comment bad_comment_rate\n", - "512006 1000 3 1 0.0417\n", - "512007 10000 2 0 0.0000\n", - "512008 100011 4 1 0.0376\n", - "512009 100018 3 0 0.0000\n", - "512010 100020 3 0 0.0000" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "item_comment = get_from_jdata_comment()\n", - "item_comment.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "# SQL: left join\n", - "item_behavior = pd.merge(item_base, item_behavior, on=['sku_id'], how='left')\n", - "item_behavior = pd.merge(item_behavior, item_comment, on=['sku_id'], how='left')\n", - " \n", - "item_behavior.to_csv(ITEM_TABLE_FILE, index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sku_ida1a2a3catebrandbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratiocomment_numhas_bad_commentbad_comment_rate
0103118489NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11000023228489NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
21000031-1-1830NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
31000061218545NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
410001-1128244NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", - "
" - ], - "text/plain": [ - " sku_id a1 a2 a3 cate brand browse_num addcart_num delcart_num \\\n", - "0 10 3 1 1 8 489 NaN NaN NaN \n", - "1 100002 3 2 2 8 489 NaN NaN NaN \n", - "2 100003 1 -1 -1 8 30 NaN NaN NaN \n", - "3 100006 1 2 1 8 545 NaN NaN NaN \n", - "4 10001 -1 1 2 8 244 NaN NaN NaN \n", - "\n", - " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", - "0 NaN NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN NaN \n", - "2 NaN NaN NaN NaN NaN \n", - "3 NaN NaN NaN NaN NaN \n", - "4 NaN NaN NaN NaN NaN \n", - "\n", - " buy_click_ratio buy_favor_ratio comment_num has_bad_comment \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN \n", - "2 NaN NaN NaN NaN \n", - "3 NaN NaN NaN NaN \n", - "4 NaN NaN NaN NaN \n", - "\n", - " bad_comment_rate \n", - "0 NaN \n", - "1 NaN \n", - "2 NaN \n", - "3 NaN \n", - "4 NaN " - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "item_table = pd.read_csv(ITEM_TABLE_FILE)\n", - "item_table.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 数据清洗\n", - "\n", - "用户清洗" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count105,321.000105,318.000105,318.000105,321.000105,180.000105,180.000105,180.000105,180.000105,180.000105,180.00072,129.000105,172.000103,197.00045,986.000
mean252,661.0002.7731.1133.850180.4665.4712.4340.4591.045291.2220.1470.0050.0090.552
std30,403.6981.6720.9561.072273.43710.6185.6001.0483.442460.0310.2700.0220.0740.473
min200,001.000-1.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
25%226,331.0003.0000.0003.00040.0000.0000.0000.0000.00059.0000.0000.0000.0000.000
50%252,661.0003.0002.0004.00094.0002.0000.0000.0000.000148.0000.0000.0000.0001.000
75%278,991.0004.0002.0005.000212.0006.0003.0001.0000.000342.0000.1670.0020.0011.000
max305,321.0006.0002.0005.0007,605.000369.000231.00050.00099.00015,302.0001.0001.0001.0001.000
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num \\\n", - "count 105,321.000 105,318.000 105,318.000 105,321.000 105,180.000 \n", - "mean 252,661.000 2.773 1.113 3.850 180.466 \n", - "std 30,403.698 1.672 0.956 1.072 273.437 \n", - "min 200,001.000 -1.000 0.000 1.000 0.000 \n", - "25% 226,331.000 3.000 0.000 3.000 40.000 \n", - "50% 252,661.000 3.000 2.000 4.000 94.000 \n", - "75% 278,991.000 4.000 2.000 5.000 212.000 \n", - "max 305,321.000 6.000 2.000 5.000 7,605.000 \n", - "\n", - " addcart_num delcart_num buy_num favor_num click_num \\\n", - "count 105,180.000 105,180.000 105,180.000 105,180.000 105,180.000 \n", - "mean 5.471 2.434 0.459 1.045 291.222 \n", - "std 10.618 5.600 1.048 3.442 460.031 \n", - "min 0.000 0.000 0.000 0.000 0.000 \n", - "25% 0.000 0.000 0.000 0.000 59.000 \n", - "50% 2.000 0.000 0.000 0.000 148.000 \n", - "75% 6.000 3.000 1.000 0.000 342.000 \n", - "max 369.000 231.000 50.000 99.000 15,302.000 \n", - "\n", - " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \n", - "count 72,129.000 105,172.000 103,197.000 45,986.000 \n", - "mean 0.147 0.005 0.009 0.552 \n", - "std 0.270 0.022 0.074 0.473 \n", - "min 0.000 0.000 0.000 0.000 \n", - "25% 0.000 0.000 0.000 0.000 \n", - "50% 0.000 0.000 0.000 1.000 \n", - "75% 0.167 0.002 0.001 1.000 \n", - "max 1.000 1.000 1.000 1.000 " - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_user = pd.read_csv('data/User_table.csv',header=0)\n", - "pd.options.display.float_format = '{:,.3f}'.format #输出格式设置,保留三位小数\n", - "df_user.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "由上述统计信息发现: 第一行中根据User_id统计发现有105321个用户,发现有3个用户没有age,sex字段\n", - "\n", - "根据浏览、加购、删购、购买等记录却只有105180条记录,说明存在用户无任何交互记录,因此可以删除上述用户" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "删除没有age,sex字段的用户" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
34072234073nannan132.0006.0004.0001.0000.00041.0000.1670.0310.0241.000
38905238906nannan1171.0003.0002.0002.0003.000464.0000.6670.0120.0040.667
67704267705nannan1342.00018.0008.0000.0000.000743.0000.0000.0000.000nan
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num addcart_num delcart_num \\\n", - "34072 234073 nan nan 1 32.000 6.000 4.000 \n", - "38905 238906 nan nan 1 171.000 3.000 2.000 \n", - "67704 267705 nan nan 1 342.000 18.000 8.000 \n", - "\n", - " buy_num favor_num click_num buy_addcart_ratio buy_browse_ratio \\\n", - "34072 1.000 0.000 41.000 0.167 0.031 \n", - "38905 2.000 3.000 464.000 0.667 0.012 \n", - "67704 0.000 0.000 743.000 0.000 0.000 \n", - "\n", - " buy_click_ratio buy_favor_ratio \n", - "34072 0.024 1.000 \n", - "38905 0.004 0.667 \n", - "67704 0.000 nan " - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_user[df_user['age'].isnull()]" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count105,318.000105,318.000105,318.000105,318.000105,177.000105,177.000105,177.000105,177.000105,177.000105,177.00072,126.000105,169.000103,194.00045,984.000
mean252,661.1642.7731.1133.850180.4665.4712.4340.4591.045291.2190.1470.0050.0090.552
std30,404.0121.6720.9561.071273.44010.6185.6001.0483.442460.0340.2700.0220.0740.473
min200,001.000-1.0000.0001.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
25%226,330.2503.0000.0003.00040.0000.0000.0000.0000.00059.0000.0000.0000.0000.000
50%252,661.5003.0002.0004.00094.0002.0000.0000.0000.000148.0000.0000.0000.0001.000
75%278,991.7504.0002.0005.000212.0006.0003.0001.0000.000342.0000.1670.0020.0011.000
max305,321.0006.0002.0005.0007,605.000369.000231.00050.00099.00015,302.0001.0001.0001.0001.000
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num \\\n", - "count 105,318.000 105,318.000 105,318.000 105,318.000 105,177.000 \n", - "mean 252,661.164 2.773 1.113 3.850 180.466 \n", - "std 30,404.012 1.672 0.956 1.071 273.440 \n", - "min 200,001.000 -1.000 0.000 1.000 0.000 \n", - "25% 226,330.250 3.000 0.000 3.000 40.000 \n", - "50% 252,661.500 3.000 2.000 4.000 94.000 \n", - "75% 278,991.750 4.000 2.000 5.000 212.000 \n", - "max 305,321.000 6.000 2.000 5.000 7,605.000 \n", - "\n", - " addcart_num delcart_num buy_num favor_num click_num \\\n", - "count 105,177.000 105,177.000 105,177.000 105,177.000 105,177.000 \n", - "mean 5.471 2.434 0.459 1.045 291.219 \n", - "std 10.618 5.600 1.048 3.442 460.034 \n", - "min 0.000 0.000 0.000 0.000 0.000 \n", - "25% 0.000 0.000 0.000 0.000 59.000 \n", - "50% 2.000 0.000 0.000 0.000 148.000 \n", - "75% 6.000 3.000 1.000 0.000 342.000 \n", - "max 369.000 231.000 50.000 99.000 15,302.000 \n", - "\n", - " buy_addcart_ratio buy_browse_ratio buy_click_ratio buy_favor_ratio \n", - "count 72,126.000 105,169.000 103,194.000 45,984.000 \n", - "mean 0.147 0.005 0.009 0.552 \n", - "std 0.270 0.022 0.074 0.473 \n", - "min 0.000 0.000 0.000 0.000 \n", - "25% 0.000 0.000 0.000 0.000 \n", - "50% 0.000 0.000 0.000 1.000 \n", - "75% 0.167 0.002 0.001 1.000 \n", - "max 1.000 1.000 1.000 1.000 " - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "delete_list = df_user[df_user['age'].isnull()].index\n", - "df_user.drop(delete_list,axis=0,inplace=True)\n", - "df_user.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "删除无交互记录的用户" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "105177\n" - ] - } - ], - "source": [ - "df_naction = df_user[(df_user['browse_num'].isnull()) & (df_user['addcart_num'].isnull()) & (df_user['delcart_num'].isnull()) & (df_user['buy_num'].isnull()) & (df_user['favor_num'].isnull()) & (df_user['click_num'].isnull())]\n", - "df_user.drop(df_naction.index,axis=0,inplace=True)\n", - "print(len(df_user))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "统计并删除无购买记录的用户" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "75694\n" - ] - } - ], - "source": [ - "df_bzero = df_user[df_user['buy_num']==0]\n", - "# 输出购买数为0的总记录数\n", - "print(len(df_bzero))" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count29,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.00029,483.000
mean250,746.4452.9141.0254.272302.48810.5254.6731.6371.677486.6530.3600.0180.0300.862
std29,979.6761.4900.9590.808391.53514.3017.5681.4124.584658.6710.3200.0380.1360.287
min200,001.000-1.0000.0002.0001.0000.0000.0001.0000.0000.0000.0040.0000.0000.010
25%225,058.5003.0000.0004.00076.0003.0000.0001.0000.000116.0000.1180.0040.0021.000
50%249,144.0003.0001.0004.000178.0006.0002.0001.0000.000282.0000.2500.0080.0051.000
75%276,252.5004.0002.0005.000381.00013.0006.0002.0001.000604.0000.5000.0180.0121.000
max305,318.0006.0002.0005.0007,605.000288.000178.00050.00096.00015,302.0001.0001.0001.0001.000
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num addcart_num \\\n", - "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n", - "mean 250,746.445 2.914 1.025 4.272 302.488 10.525 \n", - "std 29,979.676 1.490 0.959 0.808 391.535 14.301 \n", - "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n", - "25% 225,058.500 3.000 0.000 4.000 76.000 3.000 \n", - "50% 249,144.000 3.000 1.000 4.000 178.000 6.000 \n", - "75% 276,252.500 4.000 2.000 5.000 381.000 13.000 \n", - "max 305,318.000 6.000 2.000 5.000 7,605.000 288.000 \n", - "\n", - " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n", - "count 29,483.000 29,483.000 29,483.000 29,483.000 29,483.000 \n", - "mean 4.673 1.637 1.677 486.653 0.360 \n", - "std 7.568 1.412 4.584 658.671 0.320 \n", - "min 0.000 1.000 0.000 0.000 0.004 \n", - "25% 0.000 1.000 0.000 116.000 0.118 \n", - "50% 2.000 1.000 0.000 282.000 0.250 \n", - "75% 6.000 2.000 1.000 604.000 0.500 \n", - "max 178.000 50.000 96.000 15,302.000 1.000 \n", - "\n", - " buy_browse_ratio buy_click_ratio buy_favor_ratio \n", - "count 29,483.000 29,483.000 29,483.000 \n", - "mean 0.018 0.030 0.862 \n", - "std 0.038 0.136 0.287 \n", - "min 0.000 0.000 0.010 \n", - "25% 0.004 0.002 1.000 \n", - "50% 0.008 0.005 1.000 \n", - "75% 0.018 0.012 1.000 \n", - "max 1.000 1.000 1.000 " - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_user = df_user[df_user['buy_num']!=0] # 只要有购买记录的\n", - "df_user.describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "删除爬虫及惰性用户\n", - "\n", - "由上表所知,浏览购买转换比和点击购买转换比均值为0.018,0.030,因此这里认为浏览购买转换比和点击购买转换比小于0.0005的用户为惰性用户" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "90\n" - ] - } - ], - "source": [ - "bindex = df_user[df_user['buy_browse_ratio']<0.0005].index\n", - "print (len(bindex))\n", - "df_user.drop(bindex,axis=0,inplace=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "323\n" - ] - } - ], - "source": [ - "cindex = df_user[df_user['buy_click_ratio']<0.0005].index\n", - "print (len(cindex))\n", - "df_user.drop(cindex,axis=0,inplace=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_idagesexuser_lv_cdbrowse_numaddcart_numdelcart_numbuy_numfavor_numclick_numbuy_addcart_ratiobuy_browse_ratiobuy_click_ratiobuy_favor_ratio
count29,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.00029,070.000
mean250,767.0992.9101.0284.268280.26010.1454.4571.6441.589447.1130.3640.0190.0310.866
std29,998.8701.4920.9590.809325.12913.4436.9981.4204.294530.9940.3200.0380.1370.282
min200,001.000-1.0000.0002.0001.0000.0000.0001.0000.0000.0000.0040.0010.0010.018
25%225,036.0003.0000.0004.00075.0003.0000.0001.0000.000114.0000.1250.0040.0021.000
50%249,200.5003.0001.0004.000174.0006.0002.0001.0000.000275.0000.2500.0080.0051.000
75%276,284.0004.0002.0005.000366.00013.0006.0002.0001.000585.0000.5000.0180.0121.000
max305,318.0006.0002.0005.0005,007.000288.000158.00050.00069.0008,156.0001.0001.0001.0001.000
\n", - "
" - ], - "text/plain": [ - " user_id age sex user_lv_cd browse_num addcart_num \\\n", - "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n", - "mean 250,767.099 2.910 1.028 4.268 280.260 10.145 \n", - "std 29,998.870 1.492 0.959 0.809 325.129 13.443 \n", - "min 200,001.000 -1.000 0.000 2.000 1.000 0.000 \n", - "25% 225,036.000 3.000 0.000 4.000 75.000 3.000 \n", - "50% 249,200.500 3.000 1.000 4.000 174.000 6.000 \n", - "75% 276,284.000 4.000 2.000 5.000 366.000 13.000 \n", - "max 305,318.000 6.000 2.000 5.000 5,007.000 288.000 \n", - "\n", - " delcart_num buy_num favor_num click_num buy_addcart_ratio \\\n", - "count 29,070.000 29,070.000 29,070.000 29,070.000 29,070.000 \n", - "mean 4.457 1.644 1.589 447.113 0.364 \n", - "std 6.998 1.420 4.294 530.994 0.320 \n", - "min 0.000 1.000 0.000 0.000 0.004 \n", - "25% 0.000 1.000 0.000 114.000 0.125 \n", - "50% 2.000 1.000 0.000 275.000 0.250 \n", - "75% 6.000 2.000 1.000 585.000 0.500 \n", - "max 158.000 50.000 69.000 8,156.000 1.000 \n", - "\n", - " buy_browse_ratio buy_click_ratio buy_favor_ratio \n", - "count 29,070.000 29,070.000 29,070.000 \n", - "mean 0.019 0.031 0.866 \n", - "std 0.038 0.137 0.282 \n", - "min 0.001 0.001 0.018 \n", - "25% 0.004 0.002 1.000 \n", - "50% 0.008 0.005 1.000 \n", - "75% 0.018 0.012 1.000 \n", - "max 1.000 1.000 1.000 " - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_user.describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [], - "source": [ - "df_user.to_csv(\"data/JData_FUser.csv\", index=None)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.3" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}